Estimation of Pairwise Relatedness With Molecular Markers
 ^{*} Department of Biology, University of Oregon, Eugene, Oregon 97403
 ^{†} Department of Forest Sciences, University of British Columbia, Vancouver, British Columbia V6T1Z4, Canada
 Corresponding author: Michael Lynch, Department of Biology, University of Oregon, Eugene, OR 97403. Email: mlynch{at}oregon.uoregon.edu
Abstract
Applications of quantitative genetics and conservation genetics often require measures of pairwise relationships between individuals, which, in the absence of known pedigree structure, can be estimated only by use of molecular markers. Here we introduce methods for the joint estimation of the twogene and fourgene coefficients of relationship from data on codominant molecular markers in randomly mating populations. In a comparison with other published estimators of pairwise relatedness, we find these new “regression” estimators to be computationally simpler and to yield similar or lower sampling variances, particularly when many loci are used or when loci are hypervariable. Two examples are given in which the new estimators are applied to natural populations, one that reveals isolationbydistance in an annual plant and the other that suggests a genetic basis for a coat color polymorphism in bears.
COEFFICIENTS of relationship between pairs of individuals play a central role in many areas of genetics and behavioral ecology. For example, in quantitative genetics, the phenotypic resemblance of relatives, which forms the basis for the empirical estimation of components of genetic variance, is a direct function of the probability that individuals have one or two genes identical by descent at a locus. Given such probabilities, causal components of variance (such as the additive and dominance genetic variance) can be estimated from the phenotypic covariance (Falconer and Mackay 1996; Lynch and Walsh 1998). In studies of laboratory or domesticated populations, where investigators can be certain of the degrees of relationship among observed individuals, the application of conventional quantitativegenetic methodology is straightforward. Major uncertainties about the relationships among individuals from natural populations are the primary impediment to extending quantitativegenetic analysis to field studies, but Ritland (1989, 1996a) has suggested how this problem might be overcome by regressing pairwise measures of phenotypic similarity on pairwise estimates of relatedness obtained with molecular markers.
Pairwise measures of relatedness also play a role in the field of conservation genetics. For example, in captive breeding programs, substantial effort is being made to ensure that matings are minimized between close relatives to reduce the loss of genetic variation by random genetic drift. If the potential parents are derived directly from wildcaught stock or are descendants of individuals of unknown relationship, a relative ranking of degrees of relatedness can only be achieved through inferences with molecular markers (Avise 1995).
A third field of inquiry within which pairwise relatedness plays a significant role is the evolution of social behavior. Studies in this area are largely focused around Hamilton’s (1964) theory of kin selection, which states that the evolutionary advantage of an altruistic act depends on whether the cost to the donor exceeds the benefit to the recipient multiplied by the relatedness between the two individuals. Because most such studies involve field populations where parentage is not directly observed, indirect inferences about relatedness must again be made with molecular markers.
In all of the abovementioned applications of molecular markers, it is an implicit assumption that such markers provide reasonable, if not excellent, estimates of relatedness coefficients. Yet, there are few existing methods for the estimation of pairwise relatedness for which the statistical properties are well understood or well behaved. Several estimators have been developed for pairwise relatedness using the rather specialized data provided by DNAfingerprint profiles (Lynch 1988; Liet al. 1993; Geyer and Thompson 1995). Following up on earlier work of Pamilo and Crozier (1982), Queller and Goodnight (1989) developed markerbased estimators for withingroup relatedness, but these are of somewhat limited applicability in the estimation of pairwise relationship because of their poor behavior with diallelic loci. An efficient methodofmoments estimator, recently developed by Ritland (1996b), provides a basis for the joint estimation of identitybydescent at both the genic and genotypic levels. Ritland’s approach, which is based on a model involving joint probabilities of the two genotypes of a pair, can be quite complex computationally and is illbehaved with some gene frequencies. Maximumlikelihood methods have been developed by Thompson (1975, 1976, 1986) to test for specific types of relationship.
In this article, we introduce a simple method for obtaining unbiased estimates of pairwise relationship coefficients. Its simplicity arises from the use of a regression approach for inferring relationship—one individual of a pair serves as a “reference,” and the probabilities of the locusspecific genotypes in the other “proband” individual are conditioned on those of the reference. Aside from its ease of application and unbiased nature, this method has two very useful features—it generates joint estimates of both the two and fourgene coefficients of relatedness, and it yields simple expressions for the sampling variance of these coefficients. This latter feature provides a convenient means for optimizing the use of information derived from different loci. Following our derivation of the regression method, we compare its performance against that of other methods and then provide two examples of its application to studies of natural populations.
JOINT ESTIMATION OF TWOGENE AND FOURGENE COEFFICIENTS
Throughout, we focus on the traditional definition of relatedness for individual pairs of diploid individuals, r_{xy} = 2Θ_{xy}, where the coefficient of coancestry, Θ_{xy}, is the probability that, for any autosomal locus, a random gene taken from individual x is identical by descent with a random gene taken from individual y. For monozygotic twins (and clonemates), r_{xy} = 1; for parentoffspring and fullsib relationships, r_{xy} = 0.5; and for second and thirdorder relationships, r_{xy} is equal, respectively, to 0.25 and 0.125.
The relatedness coefficient for two individuals (x and y) is a linear function of two “higherorder” coefficients,
In the following analyses, we focus on the estimation of r_{xy} and Δ_{xy}, as these are the relationship coefficients that are of primary practical utility. Our computer simulations showed that estimates of ϕ_{xy} have much higher sampling variance than those of r_{xy} and Δ_{xy}, enough so that the accurate measurement of ϕ_{xy} is beyond reach unless very large numbers of informative loci can be assayed. This large sampling variance does not carry over greatly to estimates of the composite measure r_{xy}, because there is also a very large negative sampling covariance between the two component coefficients, ϕ_{xy} and Δ_{xy}.
Genotypic probabilities: There are two fundamental ways to set up a model for the genotypic probabilities in a pair of individuals. The first approach, adopted by Ritland (1996b), specifies the joint probability of both genotypes. The second approach, adopted here, specifies the conditional genotypic probability of a proband individual y, given the genotype of the reference individual x. We refer to these two approaches as “correlation” and “regression” methods in the sense that they are symmetrical vs. asymmetrical measures. Both approaches allow the joint estimation of r_{xy}, ϕ_{xy}, and Δ_{xy}, but as we will see, correlation and regression estimators differ substantially in terms of complexity and statistical properties. It is important to note that our use of the terms correlation and regression refers to the underlying statistical model and not to the estimators themselves. The estimators developed here and in Ritland (1996b) are more properly termed “methodofmoments” estimators.
Consider a single locus with n alleles, and let x be the reference individual (with alleles a and b) and y be the proband individual (with alleles c and d). The conditional probabilities for the n(n + 1)/2 possible genotypes in y can be expressed as a function of ϕ_{xy}, Δ_{xy}, and the known allele frequencies,
Regression estimators: Equation 2 provides the foundation for the regressionbased estimators that we now explore. To illustrate the general approach, we first derive estimators conditioned on the observation of a homozygote reference genotype. In this straightforward case, two probabilities are informative about x’s relationship with individual y: P(iiii) and P(i·ii), the conditional probabilities that the two individuals have two vs. one pair of genes identical in state at the locus, with a dot denoting any allele other than i. The probability of no genes identical in state, P(··ii), provides no additional information, as it simply equals [1  P(iiii)  P(i·ii)]. Letting p_{i} be the frequency of the ith allele, from Equation 2,
Assuming that we know the allele frequency p_{i} in advance, these two equations can be rearranged to yield estimators for the two unknown relationship coefficients,
The appendix provides a parallel set of results for heterozygotes at diallelic and multiallelic loci. Diallelic heterozygous reference individuals introduce no new problems, but with multiallelic loci, there are six classes of conditional probabilities for heterozygous reference individuals. In the latter case then, the number of observed 0/1 variables exceeds the number of unknowns (ϕ and Δ). To deal with this situation, we provide a weighted leastsquares approximation.
A general estimator, which covers all three cases, is best described by introducing “indicator variables” for the sharing of pairs of alleles (as opposed to more complex patterns of sharing as used earlier). As before, let the reference individual have alleles a and b and the proband individual alleles c and d. If the reference individual is homozygous, S_{ab} = 1, while if it is heterozygous, S_{ab} = 0. Likewise, if allele a from the reference individual is the same as allele c from the proband, S_{ac} = 1, while S_{ac} = 0 if it is different. In total, there are six S’s corresponding to the six ways of choosing two objects without replacement from a pool of four objects. Letting p_{a} and p_{b} be the frequencies of alleles a and b in the population, the fully general expressions for the two coefficients of primary interest are
In actual practice, there is no particular reason to use one member of a pair of individuals as the reference as opposed to the other member. Thus, the reciprocal estimates rˆ_{xy} and rˆ_{yx}, etc., can be arithmetically averaged to further refine the pairwise relationship estimates for the pair of individuals x and y. In all of the following analyses, we rely on such reciprocal estimates, as the arithmetic average of the two reciprocal estimates generally has a lower statistical variance than a single estimate. In principle, the root of the product of the two reciprocal estimates could be used, but this leads to undefined estimates in the event that one is negative.
Multilocus estimates: Estimates of relatedness are usually based on data from multiple loci. Under the assumption that the marker loci are unlinked, the locusspecific estimates are independent. However, any averaging of the locusspecific estimates to obtain overall estimates of r_{xy} and Δ_{xy} should account for the dramatic amonglocus differences of sampling variance that can arise from both differences in reference genotypes (e.g., common homozygote vs. rare heterozygote) and in levels of variation (loci with more alleles being more informative).
Let w_{r,x}(ℓ) and w_{Δ,}_{x}(ℓ) denote the weights to be used for the ℓth locus in the overall estimates of r_{xy} and Δ_{xy}, and let W_{r,x} and W_{Δ,}_{x} be the sums of the weights over all L loci. The composite estimates of the relationship coefficients for x and y are then
With statistically independent marker loci, the locusspecific weights that minimize the sampling variance of the overall estimates
Properties of the regression estimators: Extensive computer simulations demonstrated that the regression estimators given above are essentially unbiased, regardless of the numbers of loci or the values of ϕ and Δ. Thus, the primary issues of interest are the magnitudes of the sampling variances of the estimators and their sensitivity to the degree of actual relationship and to the allelefrequency distribution.
We obtained estimates of the sampling variances of the regression estimators by Monte Carlo simulation, assuming gene frequencies were known without error and assuming a random mating population with unlinked marker loci. Reference genotypes were drawn randomly according to their HardyWeinberg frequencies, and the genotypes of the paired individuals were then obtained from the conditional genotype distributions given the reference genotype and the particular relationship. For multiallelic loci, two types of allelefrequency distributions were considered: uniform distributions, in which the frequencies of each of the n alleles per locus were equal to 1/n, and “triangular” distributions, in which the frequencies of alleles followed the proportions 1, 2,..., n. In all of the following figures, we report the singlelocus sampling variances of the relationship coefficients. For analyses involving multiple loci with identical allele frequencies, the sampling variance of multilocus estimates can be obtained by dividing the plotted values by the number of loci (L).
A special property of the regression estimator is that the expected singlelocus sampling variance declines with increasing numbers of unlinked loci, down to an asymptotic value (Figure 1). This dependence on number of loci arises with the regression estimator because the estimation variances (the weights) differ among alternative reference genotypes at the same locus (for example, a reference genotype having rarer alleles gives estimates with lower variance). By contrast, the correlation estimator of Ritland (1996b) is not conditioned upon observed genotype, and its variance only depends on the distribution of gene frequencies in the population. Although Figure 1 details the influence of the number of loci on the variance of the regression estimator, for the remaining analyses we focus on the situation in which 10 informative loci have been sampled. At that point, the lower asymptotic value of the singlelocus sampling variance is closely approximated in most situations, and 10 loci is a good approximation of the sampling scheme employed in many empirical studies, with diallelic loci corresponding to isozymes and multiallelic loci corresponding to microsatellites.
For diallelic loci, the asymptotic sampling variance per locus for rˆ is equal to 1 in the case of nonrelatives and somewhat lower for related individuals (even though nonoptimal weights are employed with relatives; Figure 1). With allele frequencies approaching 0.5, the optimal weights of all reference genotypes approach equality regardless of the degree of relationship, because all alleles are then equally informative. Thus, the asymptotic sampling variances near allele frequencies of 0.5 are the best that one could expect to achieve even if the correct weights were used. Because even with close relatives, the sampling variance is never less than about 0.4 per locus, these results imply that with a large number of loci, the expected standard error of rˆ is generally on the order
As in the case of rˆ, the singlelocus sampling variance of
In principle, an increase in the number of alleles per locus should reduce the sampling variance of relatedness estimates, because alleles that are identical in state will be more reliable as indicators of identity by descent. For nonrelated individuals, the asymptotic singlelocus sampling variance of rˆ is very close to 1/(n  1), regardless of the form of the allelefrequency distribution (Figure 2). With parents and offspring, the sampling variance is up to 50% less than this, while with other types of relatives it is somewhat higher when alleles with low frequency are common. Again, with an even allelefrequency distribution, all reference genotypes are equally informative regardless of the degree of relationship, so the results for this case can be viewed as the minimum sampling variance that one can expect to achieve with the regression estimator—except in the case of parents and offspring, a standard error of rˆ less than about
COMPARISON WITH OTHER ESTIMATORS
As noted above, for applications in quantitative genetics, there is a need for separate estimates of r_{xy} and Δ_{xy} because the additive genetic covariance between individuals is a function of the composite measure r_{xy}, whereas the dominance genetic covariance is a function of Δ_{xy}. However, for situations in which one can be reasonably certain that the dominance genetic variance for a trait is negligible, or when one can be certain that collateral relatives (e.g., pairs of individuals, such as full sibs and double first cousins, that share paternal and maternal genes) are absent, Δ_{xy} can be ignored. In addition, in many applications in conservation genetics and behavioral ecology, the composite estimate r_{xy} may provide all the information that is needed. Four additional estimators of r_{xy}, all of which are unbiased, have been previously described.
A simple estimator based on the sharing of alleles, proposed by Lynch (1988) for analyses employing DNA fingerprint profiles, can be generalized to any set of codominant markers. The following expression includes the slight modification suggested by Li et al. (1993). Define the similarity index, S_{xy}, to be the average fraction of genes at a locus in a reference individual (here either x or y) for which there is another gene in the proband that is identical in state. Thus, S_{xy} = 1 when (x = ii, y = ii) or (x = ij, y = ij), S_{xy} = 0.75 when (x = ii, y = ij) or vice versa, S_{xy} = 0.5 when (x = ij, y = ik), and S_{xy} = 0 when (x = ij, y = kl). A singlelocus estimator for r_{xy} is then
Like Equation 8, Ritland’s (1996b) methodofmoments estimator for r_{xy} considers the joint distribution of both genotypes in a symmetrical way. The differing information provided by alternative alleles is incorporated by considering the incidence of each of the n possible alleles at the locus. The observed data are summarized as an array of n similarities, where the ith element (S_{i}) is equal to 0.0 (at most, one of the individuals contains allele i), 0.25 (both individuals contain a single i allele), 0.5 (one individual contains two and the other individual one i alleles), or 1.0 (both individuals are ii homozygotes). Estimates of r_{xy} derived for each allele are combined into a single estimate for the locus by using weights that assume zero relationship (as with the weighted regression estimators derived above),
A simpler estimator, also based upon the joint distribution of genotypes, was described by Ritland (1996b) and earlier workers (Li and Horvitz 1953; Weir 1996, Equation 2.28), primarily in relation to estimating inbreeding coefficients. Defining an alternative similarity index such that
Finally, we note Queller and Goodnight’s (1989) estimator of r_{xy}. Although their index is primarily designed for estimating the average degree of relatedness within groups of individuals, it can be expressed in terms of the same parameters that we employ with our Equations 5a and 5b to obtain a pairwise estimator for individuals x and y,
In comparing the performance of these alternative methods for estimating r_{xy} to that of the regression estimator, we evaluated their singlelocus sampling variances analytically by considering the joint probabilities of all genotypes of pairs of individuals, conditional on the degree of relationship and the allelefrequency distribution. With these alternative methods, the weights depend only on the allelefrequency distribution in the population, not on the genotypes of the reference and proband individuals. Thus, with multiple marker loci all with the same allele frequencies, the multilocus sampling variances are simply the singlelocus values divided by the number of loci. When loci have different allelefrequency distributions, as is usually the case in practice, weighted multilocus estimates can be obtained by weighting the locusspecific estimates by the inverses of their sampling variance.
For diallelic loci, the correlation estimator yields a sampling variance per locus equal to one in the case of nonrelatives regardless of the allele frequency (Figure 3). As noted above, the regression estimator asymptotically approaches this same level of efficiency for nonrelatives, but the similarityindex method has higher sampling variance. On the other hand, for close relatives, compared to the correlation estimator, the regression and similarityindex methods yield more accurate estimates of r over the full range of allele frequencies at diallelic loci, with the latter actually outperforming the former in the case of parentoffspring pairs.
A multiallelic perspective yields further insight into the relative efficiencies of the four techniques. With a uniform distribution of three or more alleles per locus, the singlelocus sampling variance for rˆ is essentially 1/(n  1) with nonrelatives regardless of the method (Figure 4). Thus, because an even allelefrequency distribution provides the greatest power of inference, this seems to be the best that one can expect to achieve with any estimator of distant relationships. For related individuals, the regression and similarityindex methods yield very similar sampling variances of r provided there are at least three alleles per locus, while the correlation and QuellerGoodnight estimators are again less efficient. For the two superior methods, the singlelocus sampling variance of estimates of rˆ asymptotically approaches 0.14 with increasing allele number with full sibs, and very slowly approaches 0 with parents and offspring.
With a triangular allelefrequency distribution, the regression and correlation methods again yield essentially identical results with nonrelatives, while the similarityindex and QuellerGoodnight methods have somewhat higher sampling variances. However, with related individuals, the similarityindex method is again the superior of the four methods, and the correlation and QuellerGoodnight estimators generally yield the highest sampling variance. By use of either the regression or similarityindex methods, up to a 50% reduction in the standard error of rˆ an be achieved.
The only other markerbased method for the estimation of Δ is the correlationbased estimator of Ritland (1996b), which is quite complex algebraically. Results in Table 1 show that the much simpler regression estimator presented above yields essentially the same asymptotic sampling variances as the correlation method when the allelefrequency distribution is uniform. With triangular allelefrequency distributions, the results are also very similar for nonrelatives, but with related individuals, the regression estimator yields more precise estimates, with the reduction in sampling variance approaching 50% with close relatives.
Thompson (1975, 1986) has extensively investigated the use of maximum likelihood for inferring pairwise relationship. The likelihood method allows one to take an entirely different approach for genealogical inference. For example, Thompson discusses the power of likelihood to distinguish among major types of relationships, defined as family (parentoffspring, full sibs), close (half sibs, uncle, etc.), remote (cousin, etc.), and unrelated. This approach to inferring genealogical “relationship” is fundamentally different from our approach to estimating “relatedness,” which is a nondiscrete numerical parameter defined in terms of probabilities of identitybydescent. Nevertheless, we have considered the possibility of using likelihood methods to estimate “relatedness” under our regression framework. Using notation developed earlier, the likelihood of data from one locus is the probability
Using computer simulations, we examined the behavior of such maximumlikelihood estimation of relatedness by a standard numerical method (NewtonRaphson iteration). Convergence to a maximum was confirmed both by noting that the likelihood increased over iterations and converged and by comparing the iterative solutions to likelihood functions of the same data mapped by brute force. The results, and those discussed by Ritland (1996b), suggest that the potential for using maximum likelihood for estimating relatedness is limited. The problem is fundamentally due to the fact that the ideal properties of likelihood are asymptotic or apply to “large” sample sizes. The number of loci usually available for pairwise estimation is inherently small—too small for likelihood to avoid substantial problems with bias (usually negative) and extremely large sampling variance. For example, for the case of zero true relatedness, the average estimate of r_{xy} is on the order of 1.0 or less when 40 or fewer loci are sampled, and the sampling variance is two to three orders of magnitude beyond that shown for the alternative estimators in Figures 3 and 4. Interestingly, we found that there is an approximate sample size (number of loci) above which the maximumlikelihood estimators become “stable” or show approximately the predicted asymptotic variance. However, this sample size is large. For the maximumlikelihood estimator of r_{xy}, at low true relatedness, stability occurs at ∼70 diallelic loci (p = 0.5). The maximumlikelihood estimator of Δ_{xy} exhibits similar behavior, although it begins to stabilize when ∼30 loci have been sampled. Thus, while the maximumlikelihood approach may provide a useful means for comparing alternative degrees of relationship by likelihoodratio tests, its applicability for estimating pairwise relatedness coefficients appears to be limited unless one has the luxury of a very large number of polymorphic markers.
EXAMPLE APPLICATIONS
As examples of how estimators of pairwise relatedness can be used in population studies and how they behave with actual data, we consider two applications. First, as part of a study of isolationbydistance and field heritabilities in the common monkeyflower (Mimulus guttatus), 300 plants were randomly selected along an 84m transect through a meadow adjacent to Indian Valley Reservoir in Clear Lake County, California (this was the “meadow” transect of Ritland and Ritland 1996). Extracts were obtained from corollas and assayed for 10 polymorphic isozyme loci. Eight loci were diallelic, 1 was triallelic, and the other had four alleles. Using the regression estimator, relatedness was estimated for pairs of plants separated by up to 4 m (with gene frequencies estimated from the entire sample). The estimates of pairwise relatedness from this dataset show considerable scatter, with some being >+1 and many <0 (Figure 5). Such behavior is in accordance with the results presented above, which highlight the large sampling variance expected for estimates based upon relatively few marker loci. Because of this large variance, significant inferences can be made only from groups of pairwise relatedness estimates or from correlations of these estimates with other quantities such as similarity for a quantitative trait (Ritland 1996a). In this particular application, there is a negative regression of relatedness on distance (Figure 5) as expected under isolationbydistance. Relatedness decreased ∼50% over the span from 0 to 4 m, with the average value for adjacent plants being 0.21, nearly the level of relatedness expected between half sibs (0.25).
A second application of relatedness estimates derives from work (D. Marshall and K. Ritland, unpublished results) with a whitephase (termed Kermodism) of the black bear, which is found in low to moderate (10%) frequency along the north coast of British Columbia and adjacent islands. The genetic basis of the coat color polymorphism is unknown. During late summer 1997, nearly 900 bear hair samples were collected from five islands and the adjacent mainland of northern coastal Bristish Columbia. DNA was extracted from hairs with roots and assayed for 8 highly polymorphic microsatellite loci using the primers developed by Paetkau et al. (1995). The number of alleles per locus ranged from 7 to 17, with a mean of 10.4, and locusspecific heterozygosities ranged from 0.72 to 0.85, with a mean of 0.79. After factoring out the multiple samples for individual bears, a total of 89 distinct genotypes were found in the regions where Kermodism was of significant frequency (17 on Gribbel Island, 13 on Hawksbury Island, 38 on Princess Royal Island, and 21 at Terrace [mainland BC]). Bear hair color was also recorded in these samples. Estimates of pairwise relatedness were found within each of these four regions, using the pooled samples to estimate gene frequencies. All pairs of individuals were then classified into two groups: pairs sharing coat color (both white or both black, of which there were 614 pairings) and pairs not sharing coat colors (one black, one white, involving 156 pairings). A comparison of the frequency distribution of rˆ for these two groups (Figure 6) shows an excess of relatedness among bears sharing coat colors (r¯ = 0.057 compared to 0.039 for unlike colors), suggesting a genetic basis for the variation in this character. However, bootstrap resampling indicated that this difference of means is not significant (the excess being present in only 88 highly variable microsatellite loci, the statistical error of relatedness is considerably less than that experienced with isozyme markers in the previous study). Further inferences about the mode of inheritance of Kermodism are given in Ritland (1999).
DISCUSSION
Estimation of relatedness with molecular markers is a statistically demanding enterprise. On the positive side, all of the estimators described above (except maximumlikelihood) are essentially unbiased in the sense that they return estimates that are on average identical to their expected values. Errors in estimates of population allele frequencies, which were not incorporated into our simulations, can introduce bias, but the effects of error in genefrequency estimation will generally be trivial (of order 1/N when N individuals are censused for gene frequency) compared to the additional sampling errors that arise in the estimation of relatedness, provided the number of individuals sampled exceeds 100 or so (Ritland 1996a,b). Moreover, this source of bias can be simply removed by omitting the pair of interest from the estimate of allele frequency (Queller and Goodnight 1989), although pathological behavior will occur in the rare event that marker alleles are unique to particular individuals, as this would lead to population genefrequency estimates of zero. In addition, the sampling variance of the relationship coefficients owing to uncertain allele frequencies can, in principle, be obtained by resampling procedures.
The high sampling variance of estimates of relatedness arises in part because of variance in identitybydescent among loci and in part because of variance in identityinstate for alleles that are not identical by descent. These sources of sampling error are fundamental consequences of Mendelian segregation, and no amount of statistical finesse can eliminate them. In the actual estimation of relatedness, however, further sampling error is introduced by error in inference. With the regression and correlation estimators, for example, large standard errors result because the estimates of relationship coefficients derived from single loci commonly fall outside of the true domain of (0, 1). Although estimators can be designed to ensure that all estimates lie in the range of true possibilities (e.g., Thompson 1976), all such estimators necessarily return biased estimates, and the magnitude of the bias depends on the actual degree of relationship. Thus, while negative singlelocus estimates of relationship coefficients may seem to be an undesirable feature, it is precisely this feature that ensures that the estimators proposed above will be unbiased.
Our results suggest that the relative advantages of the alternative estimators of relatedness depend on several factors. These include the number of loci, the allelefrequency distribution, the degree of actual relationship, and the coefficient estimated (r vs. Δ). In general, molecularmarker approaches that yield many alleles and loci tend to favor use of the regression estimators proposed in this article over the correlation estimators presented by Ritland (1996b). With small numbers of diallelic loci with extreme allele frequencies, the correlation method is more efficient than the regression method, but the regression estimators are more efficient in almost all other cases. In addition, the simplicity of the regression estimators lends to easier programming and more stability of estimates under uneven allele frequency distributions. The simplicity of the regressionbased approach is underscored by our ability to obtain an analytical solution for
As noted above, some simple statements can be made concerning the minimum sampling variance that one can expect to achieve in the estimation of relationship coefficients. For pairs of unrelated or distantly related individuals assayed at L loci, each containing n alleles, the standard errors of the estimates of ϕ (details leading up to this result are not shown), Δ, and r will be no less than
One of the limitations of both the regression and correlation methods for estimating relatedness is the use of weights that assume zero relationship. The best weights are a function of the actual relationship, but this is an unknown. Nevertheless, the use of approximate but incorrect weights yields more precise estimates than the use of unweighted estimators, because differences in the information content of alleles with different frequencies are at least partially taken into account. One might think that estimates obtained with the null weights could be improved upon by subsequently refining the weights, using the previous estimates of relatedness in the calculation of the weights. These revised weights could then give a second round of weighted estimates, and the whole process could be repeated again until a suitable degree of convergence to final estimates is achieved. However, simulations by us and by Ritland (1996b) indicated that, even with large numbers of loci, this iterative approach has little promise. Bias is introduced, and with the weights being as noisy as they are, the weights themselves are often wildly unrealistic.
Generally speaking, our results show that attempts to estimate relatedness with molecular markers can be greatly improved upon by working with multiallelic loci, with the most dramatic gains in efficiency occurring with loci with relatively even distributions of allele frequencies. Because the sampling variance of rˆ is inversely proportional to Ln, it is clear that roughly the same amount of efficiency is gained by working with loci with twice the number of alleles as by doubling the number of loci. For Δ, the sampling variance is inversely proportional to Ln^{2}, so a much greater gain can be achieved by increasing numbers of alleles as opposed to numbers of loci. Thus, an early investment in a search for informative loci (those with a large number of alleles, with roughly equal frequencies) can be quite advantageous in the long term. These recommendations assume that at least 10 or so loci are sampled, because with fewer loci, the tradeoff involving r favors more loci over more alleles per locus.
The results presented above indicate that even with fairly large numbers of loci, standard errors of relationship coefficients will rarely be
APPENDIX
Provided there are only two alleles at the locus in the population, the approach provided in the text for a homozygous reference genotype can also be applied to the case in which the reference genotype is a heterozygote for alleles i and j. The conditional probabilities of observing proband genotypes, given a heterozygous reference genotype, are
Equating these probabilities to their estimates and rearranging, estimators for the coefficients of relationship are obtained as
If there are more than two alleles in the population, there are six possible proband genotype categories conditioned on observing the heterozygous reference genotype ij. The conditional probabilities include P(iiij) and P(jjij) as given in Equations A1a and A1b plus four more:
Linear regression provides a datafitting procedure for obtaining estimators for ϕ_{xy}, Δ_{xy}, and r_{xy} in this case. The six probabilities can be assembled into an array,
For any pair of individuals, the observed data vector (Pˆ_{)} will always contain a single one for the observed twogenotype combination with all other elements being equal to zero. The linear model then becomes
If the elements of the observation vector Pˆ were independent and identical in distribution, ordinary leastsquares analysis could be used to obtain estimates of the relationship coefficients with minimum sampling variance. However, because all of the elements of the observation vector are constrained to sum to 1, such conditions are obviously violated. Although the failure to fully account for the structure of the data in the P vector does not cause the estimates of the coefficients of relationship to be biased, it does elevate the sampling variance. Unfortunately, the variancecovariance structure necessary to generate the optimal weights for a more powerful generalized leastsquares framework depends on the unknown parameters ϕ_{xy} and Δ_{xy}. To obtain approximate weights, we rely on Ritland’s (1996b) argument that, in the absence of prior information on the relationship of x and y, it is reasonable to start with the assumption that ϕ_{xy} = Δ_{xy} = 0.
Using the optimal weights given by Equation 4b of Ritland (1996b), we were able to obtain analytical solutions for the weighted leastsquares estimators of ϕ_{xy} and Δ_{xy} using an equation solver program. These are
Acknowledgments
We thank John Kelley for helpful comments. This work was supported by National Institutes of Health grant GM36827 and National Science Foundation grant DEB9629775 to M.L., and by a National Sciences and Engineering Research Council/Industry Research Chair in population genetics held by K.R.
Footnotes

Communicating editor: A. H. D. Brown
 Received June 26, 1998.
 Accepted April 19, 1999.
 Copyright © 1999 by the Genetics Society of America