# Unbiased Estimation of Gene Diversity in Samples Containing Related Individuals: Exact Variance and Arbitrary Ploidy

- Michael DeGiorgio
^{*},^{1},^{2}, - Ivana Jankovic
^{*},^{1},^{3}and - Noah A. Rosenberg
^{*}^{†}

^{*}Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109 and^{†}Department of Human Genetics and the Life Sciences Institute, University of Michigan, Ann Arbor, Michigan 48109

- 2
*Corresponding author:*Center for Computational Medicine and Bioinformatics, University of Michigan, 2017 Palmer Commons, 100 Washtenaw Ave., Ann Arbor, MI 48109-2218. E-mail: degiormi{at}umich.edu

## Abstract

Gene diversity, a commonly used measure of genetic variation, evaluates the proportion of heterozygous individuals expected at a locus in a population, under the assumption of Hardy–Weinberg equilibrium. When using the standard estimator of gene diversity, the inclusion of related or inbred individuals in a sample produces a downward bias. Here, we extend a recently developed estimator shown to be unbiased in a diploid autosomal sample that includes known related or inbred individuals to the general case of arbitrary ploidy. We derive an exact formula for the variance of the new estimator, , and present an approximation to facilitate evaluation of the variance when each individual is related to at most one other individual in a sample. When examining samples from the human X chromosome, which represent a mixture of haploid and diploid individuals, we find that performs favorably compared to the standard estimator, both in theoretical computations of mean squared error and in data analysis. We thus propose that is a useful tool in characterizing gene diversity in samples of arbitrary ploidy that contain related or inbred individuals.

FOR a given locus, gene diversity, also known as expected heterozygosity, characterizes the proportion of heterozygous genotypes expected in a population under Hardy–Weinberg equilibrium (Nei 1973). Nei and Roychoudhury (1974) devised an estimator of gene diversity that is unbiased for random samples of unrelated, noninbred individuals. When inbred individuals or close relatives are included in a sample, however, this estimator has a downward bias (Weir 1989; DeGiorgio and Rosenberg 2009). To account for the effects of inbreeding in a sample of diploid individuals, Weir (1989, 1996) derived the expected value of gene diversity, producing an unbiased estimator of gene diversity that makes use of the mean inbreeding coefficient across sampled individuals, where the inbreeding coefficient of an individual is defined as the probability for a randomly chosen locus that the two alleles of the individual are inherited identically by descent from a common ancestor. Using the mean kinship coefficient across pairs of sampled individuals, DeGiorgio and Rosenberg (2009) extended this estimator to account for the bias produced in samples containing close relatives, where the kinship coefficient between two individuals, *j* and *k*, is defined as the probability that an allele randomly selected from individual *j* at a random locus and an allele randomly selected from individual *k* at the same locus are identical by descent (IBD).

The DeGiorgio and Rosenberg (2009) estimator is useful for autosomal markers in samples from diploid organisms that contain related or inbred individuals. However, in studying gene diversity among related individuals in nondiploid cases (*e.g.*, Buteler *et al*. 1999) or in cases of mixed ploidy, such as in the analysis of sex chromosomes (*e.g.*, Reiland *et al*. 2002), unbiasedness for this estimator has not been demonstrated. Here, we extend the DeGiorgio and Rosenberg (2009) estimator of gene diversity to account for situations in which known related and inbred individuals are included in a sample and in which the sample contains an arbitrary mixture of individuals of different ploidy. We use a more general method to obtain the estimator than the method used for diploids by DeGiorgio and Rosenberg (2009), and we show that the general estimator reduces to the DeGiorgio and Rosenberg (2009) estimator in the diploid case. We also derive a formula for the variance of our estimator, , to facilitate evaluation of the statistical properties of the estimator. This variance formula, which is a function of identity states among individuals, includes terms that involve identity-by-descent among two, three, and four individuals and among pairs of pairs of individuals. Our variance function is convenient because extensive work on IBD probabilities among individuals (*e.g.*, Cotterman 1940; Harris 1964; Gillois 1965; Cockerham 1971; Jacquard 1974; Thompson 1974; Lange 2002) has provided a framework for calculating the quantities incorporated in the formula.

Using the variance formula, we examine the performance of our estimator in scenarios involving the human X chromosome, for which males and females, who might both be included in a typical sample, differ in ploidy. In our evaluations, we first show that the exact theoretical values of the variance, which are obtained from a quite complex formula, are closely matched by simulations. We also validate that when each sampled individual is related to at most one other individual in the sample, the exact theoretical variance can be approximated well by a simpler formula. Using the variance approximation and simulations, we compare the behavior of our estimator to that of the Nei and Roychoudhury (1974) estimator, which does not account for relatives. We then analyze human SNPs from the X chromosome and find that also performs well in practice.

## THEORY

Consider a sample of *g* groups, each with different ploidy (*e.g.*, haploid males and diploid females on the human X chromosome). Suppose that the sample from group *b* contains *n _{b} m_{b}*-ploid individuals,

*b*= 1, 2, …,

*g*. Further, let (

*b*,

*k*),

*k*= 1, 2, …,

*n*, denote individual

_{b}*k*from group

*b*. The number of copies of allelic type

*i*in individual

*k*from group

*b*is(1)where is an indicator random variable that takes on the value 1 if the ℓth allele in individual (

*b*,

*k*) has type

*i*and that equals 0 otherwise.

Note that , where *p _{i}* is the frequency of allelic type

*i*in the population. We can then define an unbiased estimator for the frequency of allele

*i*as(2)Rewriting the estimator of Nei and Roychoudhury (1974) for the mixed-ploidy case, if no inbred or related individuals are included in the sample, then an unbiased estimator of gene diversity is(3)If inbred or related individuals are included in the sample, then is a biased estimator of . We follow the approach of DeGiorgio and Rosenberg (2009), correcting for this bias by first obtaining the variance of sample allele frequencies. However, we use a different method here for obtaining the variance of sample allele frequencies, determining the bias correction for diploids as a special case of a more general computation.

#### An unbiased estimator:

Suppose we have four possibly, but not necessarily, distinct individuals (*a*, *j*), (*b*, *k*), (*a*′, *j*′), and (*b*′, *k*′). Define Φ_{(a,j)(b,k)} as the probability that two alleles randomly chosen, one from individual (*a*, *j*) and the other from individual (*b*, *k*), are IBD. Similarly, define Φ_{(a,j)(b,k)(a′,j′)} as the probability that three alleles randomly chosen, one from (*a*, *j*), one from (*b*, *k*), and one from (*a*′, *j*′), are IBD. Define Φ_{(a,j)(b,k)(a′,j′)(b′,k′)} as the probability that four alleles randomly chosen, one from (*a*, *j*), one from (*b*, *k*), one from (*a*′, *j*′), and one from (*b*′, *k*′), are IBD. Finally, define Φ_{(a,j)(b,k),(a′,j′)(b′,k′)} as the joint probability that two alleles randomly chosen, one from (*a*, *j*) and the other from (*b*, *k*), are IBD and two alleles randomly chosen, one from (*a*′, *j*′) and the other from (*b*′, *k*′), are IBD. These four types of probability of identity-by-descent are identical to the θ, γ, δ, and Δ coefficients of Cockerham (1971), respectively. We can then define(4)(5)(6)(7)as weighted mean kinship coefficients across all sets of pairs, triplets, quartets, and pairs of pairs of individuals. The weight associated with an individual in group *x*, , is proportional to the ploidy associated with the group. Define the inbreeding coefficient for individual (*b*, *k*), denoted by *f*_{(b,k)}, as the probability that two alleles randomly chosen without replacement from individual (*b*, *k*) are IBD and let be the mean inbreeding coefficient across individuals in group *b*. This definition reduces to the standard definition for the diploid case.

In this section we first present two equations (Equations 8 and 9) that aid in the development of a generalized estimator of gene diversity (Theorem 1). This general estimator, the main result of the section, corrects the bias created by the inclusion of related and inbred individuals in a sample consisting of individuals with any mixture of ploidy. Using this estimator, we provide generalizations of results presented by DeGiorgio and Rosenberg (2009) for diploids to the case of arbitrary ploidy (Equations 13 and 14) and we show how these generalizations can be reduced to the diploid case.

Consider a locus with *I* distinct alleles, allele frequencies *p _{i}* ∈ [0, 1], and . Suppose a sample from a population has

*g*groups, each with different ploidy, and

*n*-ploid individuals in group

_{b}m_{b}*b*,

*b*= 1, 2, …,

*g*, each of whom is possibly inbred and related to other individuals in the sample. Consider the ℓth allele of individual (

*a*,

*j*) and the

*t*th allele of individual (

*b*,

*k*). By definition of expected value, we have(8)In taking the expected value of our estimator of gene diversity, we need to evaluate the quantity . Using Equation 8, we show in appendix a that(9)Plugging Equations 8 and 9 into yields , which reduces to the result presented for the diploid case in Equation 7 of DeGiorgio and Rosenberg (2009), by reduction of the definition of for the diploid case. The following theorem provides a generalized unbiased estimator of gene diversity when a sample with any mixture of ploidy contains related or inbred individuals.

Theorem 1. *Consider a locus with I distinct alleles*, *allele frequencies p _{i}* ∈ [0, 1],

*and*.

*Suppose a sample from a population has g groups*,

*each with different ploidy*,

*and n*-

_{b}m_{b}*ploid individuals in group b*,

*b*= 1, 2, …,

*g*,

*each of whom is possibly inbred and related to other individuals in the sample. Then*(10)

*is an unbiased estimator for gene diversity*.

The proof that is unbiased follows that of Proposition 1 in DeGiorgio and Rosenberg (2009), substituting the more general in place of the corresponding mean kinship coefficient in the earlier proof.

When reducing the definition of for the diploid case studied by DeGiorgio and Rosenberg (2009), the result in Theorem 1 is identical to the result presented for this case in Proposition 1 of DeGiorgio and Rosenberg (2009). One interesting consequence of Theorem 1 is that has a simple representation in terms of the sample proportion of identity-by-state and the probability of identity-by-descent computed on the basis of assumed levels of inbreeding and relationship. This representation is(11)where is the probability that two alleles in the sample, chosen uniformly at random with replacement, are identical by state, and [IBD] is the probability that two alleles in the sample, chosen uniformly at random with replacement, are identical by descent. A proof that Equation 11 is a consequence of Equation 10 is provided in appendix a. Note that Equations 10 and 11 have a connection to estimators of relatedness in a context in which relatedness is unknown. Such estimators essentially invert equations similar to Equation 11 to get estimators of (Ritland 1996; Rousset 2002).

We next seek to transform the estimator in Equation 10 into one that is more convenient for data analysis. Let 𝒢_{a,b,} *a*, *b* = 1, 2, …, *g*, be the set of distinct types of relative pairs for pairs of distinct individuals in a sample, one from group *a* and one from group *b*. Let η_{R} be the number of pairs of individuals with relationship type *R* in 𝒢_{a,b,} and let Φ_{R} be the kinship coefficient for each of these pairs. Then, as shown in appendix a, we can write as(12)This version of is convenient for computation. To obtain a formula for that is convenient for computation and that is a generalized version of an analogous quantity for the diploid case in Equation 9 of DeGiorgio and Rosenberg (2009), we can substitute Equations 3 and 12 into Equation 10 to get(13)whereA proof of Equation 13 is provided in appendix a. We note that by using *g* = 1, *n*_{1} = *n*, and *m*_{1} = 2 in Equation 13, we obtain Equation 9 of DeGiorgio and Rosenberg (2009).

Note that , whereBy rearranging and taking the expected value, we get . Therefore,(14)Equation 14 is a generalized version of the bias formula in the diploid case, in Equation 11 of DeGiorgio and Rosenberg (2009). The bias is always negative and it has a magnitude that increases linearly with respect to *H*. Using *g* = 1, *n*_{1} = *n*, and *m*_{1} = 2 in Equation 14, we obtain Equation 11 of DeGiorgio and Rosenberg (2009).

#### Variance of the estimator:

In the previous section, we derived an unbiased estimator of gene diversity in a sample of arbitrary ploidy. It is useful to determine the variance of the estimator, a quantity that in the diploid case DeGiorgio and Rosenberg (2009) obtained only by simulation. The following theorem provides a formula for the variance of the generalized estimator of gene diversity in samples with any mixture of ploidy.

Theorem 2. *Consider a locus with I distinct alleles*, *allele frequencies p _{i}* ∈ [0, 1],

*and*.

*Suppose a sample from a population has g groups*,

*each with different ploidy*,

*and n*-

_{b}m_{b}*ploid individuals in group b*,

*b*= 1, 2,…,

*g*,

*each of whom is possibly inbred and related to other individuals in the sample. Then the variances of the*

*and*

*estimators of gene diversity are*(15)

*and*(16)

*where*(17)The proof of Theorem 2 is long and is provided in appendix b.

We next derive an approximate formula that in our calculations below we use in place of Equation 17 inside of Equations 15 and 16. The approximation is based only on pairwise kinship coefficients and is useful in cases in which the number of relatives in a sample is small enough that no individual is related to more than one other sampled individual. In such cases, the only nonzero terms included in , , and all involve sampling the same individual or pairs of individuals more than once. Thus, the , , and terms, along with , are ignored, as they are likely to be much smaller than in cases in which the number of relationships in the sample is small.

In addition to the assumptions listed in Theorem 2, suppose that each individual in the sample is related to no more than one other individual in the sample. If we ignore terms involving , *k* > 1, then terms involving , , , and in Equation 17 can be ignored. The only terms in Equation 17 that we retain are those of order and . is of order . Therefore, reducing Equation 17 leads to(18)This formula is an approximation to Equation 17 when the number of relatives in a sample is small enough that no individual is related to more than one other sampled individual.

We now show that when no related individuals are included in a sample of diploids, the variance in Equation 18 is exactly the formula given by Weir (1989). Suppose a sample from a diploid population consists of *n* unrelated, but possibly inbred, individuals. Further suppose that we ignore terms involving *n*^{−}* ^{k}*,

*k*> 1. Then Φ

_{kk}= (1/2)(1 +

*f*), where

_{k}*f*is the inbreeding coefficient for individual

_{k}*k*. We can write the mean pairwise kinship coefficient aswhere is the mean inbreeding coefficient across individuals. Plugging into Equation 18, we get(19)

#### The X chromosome case:

A common situation in which data of mixed ploidy arise is on sex chromosomes, for which members of one sex have two copies of a specific sex chromosome and members of the other sex have one copy. Later, we examine data on the human X chromosome, for which females have two copies and males have one. Thus, we now utilize Equation 13 to derive an unbiased estimator of gene diversity in samples from the X chromosome.

Consider an X-linked locus with *I* distinct alleles, allele frequencies *p _{i}* ∈ [0, 1], and . Suppose a sample from a population has

*n*

_{F}females and

*n*

_{M}males, each of whom is possibly inbred and related to other sampled individuals. Let ℳ, ℱ, and 𝒰 be the sets of distinct types of male–male, female–female, and male–female relative pairs in the sample, respectively. Further, let η

_{R}be the number of pairs of individuals with relationship type

*R*and let Φ

_{R}be the kinship coefficient for each of these pairs. Let males be group 1 and let females be group 2. Plugging

*g*= 2,

*n*

_{1}=

*n*

_{M},

*n*

_{2}=

*n*

_{F},

*m*

_{1}= 1, and

*m*

_{2}= 2 into Equation 13, we obtain an unbiased estimator for gene diversity at an X-linked locus as(20)where is the mean inbreeding coefficient across female individuals and

*f*is the inbreeding coefficient for female

_{k}*k*.

The following special case of Equation 20 is useful for the examples we consider in subsequent sections. It makes use of Table 1, which shows the various types of relationships possible for the X chromosome in pairs of individuals. Suppose a noninbred sample from a population has *n*_{F} females and *n*_{M} males, among which η_{k} pairs of relationship type *k* are included. Let Φ_{k} be the kinship coefficient for each of these pairs. Because the sample is not inbred, the mean inbreeding coefficient across female individuals is . Plugging as well as η_{k} and Φ_{k} for each relationship type *k* (Table 1) into Equation 20, we obtain(21)

## DATA ANALYSIS

#### Data:

We investigated the properties of on mixed-ploidy data using analytical computations of bias, variance, and mean squared error; simulations; and analysis of data from human populations. Our choices for simulation parameters were designed on the basis of values in the data. In our analytical computations and simulations, we based our assumed true allele frequencies on sample allele frequencies at 36 X-chromosomal loci typed in 950 unrelated individuals, 624 males and 326 females, from the Human Genome Diversity Panel (HGDP-CEPH) microsatellite data set of 1048 individuals (Ramachandran *et al*. 2008). Individuals 127 and 139 from the Ramachandran *et al*. (2008) data set were not included in our analyses. The 950 individuals were assumed to have no first- or second-degree relationships, on the basis of the Rosenberg (2006) analysis of the full HGDP-CEPH panel.

Our data analysis was performed on a data set of 13,052 X-chromosomal single-nucleotide polymorphism (SNP) loci genotyped in 485 individuals from 29 populations in the HGDP-CEPH panel (Jakobsson *et al*. 2008). We also removed individuals related through the X chromosome, yielding a data set of 446 unrelated individuals. Unlike the Jakobsson *et al*. (2008) data set of 443 unrelated individuals, our set of 446 individuals did not retain individuals 866, 1046, or 1049, which are not in the H952 subset of the HGDP-CEPH panel. However, individuals 292, 451, 477, 983, 988, and 1089 were included in the data set of nonrelatives because they were all involved exclusively in male–male parent–offspring relationships and were therefore unrelated through the X chromosome to other sampled individuals.

#### Data analysis methods:

We used simulations and analytical calculations to evaluate the behavior of the estimator for X-chromosomal loci under conditions of varying heterozygosities, sample sizes, and relationships of sampled individuals. We compared the relative performance of and by applying and to samples containing related individuals and to samples in which relatives were removed so that no relative pairs remained. True allele frequencies were based on microsatellite sample allele frequencies (see *Data*). In the simulations, individuals of a relative pair were generated by randomly choosing the allele(s) of the first individual on the basis of the empirical allele frequency distribution from the data set. For a given type of relative pair, we then simulated the allele(s) of the second individual by copying alleles from the first individual using the probabilities of sharing zero, one, and two alleles IBD for that type of pair. Table 2 depicts these probabilities, as well as the symbols used here to denote the various classes of relative pairs. If only one allele was shared, then it was copied in the second individual from the first allele of the first (independently generated) individual. In cases of male–female relative pairs, the male was generated first and the second allele of the female was always chosen independently from the allele frequency distribution.

To create a reduced data set of unrelated individuals, the second (possibly dependent) individual was not included for same-sex pairs, whereas for male–female pairs, the male relative was removed. Thus, because each individual in our simulation was included in exactly one relative pair, the number of individuals used to calculate for the unrelated sample was always half of that used for the other two estimators. Removing the male in male–female pairs results in the loss of one-third of the alleles, compared to a loss of one-half of the alleles for removal of an individual from a same-sex pair. Thus, compared to removing females, removing males from male–female pairs generates a larger sample of alleles while still ensuring that no individuals are related.

The value assumed for the true heterozygosity, *H*, of a specific locus, was calculated from the assumed true allele frequencies on the basis of genotypic data of the 950 unrelated individuals. In each simulated scenario, for each of the three estimators, this true heterozygosity was compared to the mean of the estimates produced by the estimator in 100,000 replicate simulations. The subscript *full* is used to denote cases in which an estimator was applied to the entire sample, whereas the subscript *reduced* indicates that relatives were removed from the sample. For a given scenario, the bias of each estimator was found by subtracting *H* from the mean value of the estimates for that estimator. Variance was calculated as the squared mean of the estimates across simulations subtracted from the mean across simulations of the squares of the estimates. Mean squared error (MSE) was then calculated as the sum of bias squared and variance.

##### Approximate variance:

Because each of our analyses was performed on samples that contained only pairs of related individuals, the assumptions that underlie the derivation of the approximate variance (Equation 18) apply. We compared the exact, the approximate, and the simulated variance for and in a series of cases that included only full-sib pairs. We chose nine representative cases of the various parameters that can affect estimator performance. Three of these cases considered an equal mix of male–male, female–female, and male–female full-sib pairs at the ATCT003 (*H* = 0.7794), DXS1068 (*H* = 0.7344), and GATA48H04 (*H* = 0.6476) loci, chosen to represent high, intermediate, and low heterozygosity, respectively. Additionally, we considered cases at the intermediate-heterozygosity locus involving 20 male–male, 80 male–male, 20 female–female, 80 female–female, 20 male–female, and 80 male–female pairs, to examine the effects of sample size and the sexes of the individuals. In each of our evaluations, we calculated the exact variances (Equations 15 and 16), approximate variances (Equation 18 plugged into Equations 15 and 16), and simulation variances obtained from 100,000 replicate simulations.

As Table 3 shows, in all cases examined, the exact, approximate, and simulated variances are similar, with the approximate variance slightly underestimating the exact variance. Because of the complexity of the formula for the exact variance, the difference between the approximate variance and the exact variance does not have a simple dependence on heterozygosity or sample size. However, it can be observed in Table 3 that for both and , the relative difference between the approximate variance and exact variances is smallest at low heterozygosity and large sample size, typically near ∼2%. In cases of high heterozygosity and small sample size, the relative difference remains at most ∼10%. We note that the same approximation to the variance of in Equation 18 is applied in obtaining the approximate variances of both and . Thus, because the approximation is generally reasonably accurate and because it treats and in the same way, our use of the approximation is sensible in our subsequent comparisons of the mean squared errors of and .

#### Effect of parameters on the estimators:

Several factors can potentially affect the performance of the estimators. These factors include the true value of heterozygosity itself, the sample size, the type of relative pair represented in the sample, and, if multiple types of relative pairs are included, the combination of particular types of relative pairs. We now examine each of these factors in sequence.

##### Varying heterozygosity:

To investigate the influence of varying heterozygosity on the estimator, we evaluated the scenario of 60 related individuals in 10 *t*_{1} pairs, 10 *u*_{2} pairs, and 10 *v*_{2} pairs (see Table 2) for each of the 36 X-linked microsatellite loci. This scheme incorporates 30 full-sib pairs, considering equally many males and females and utilizing three distinct kinship coefficients: for male–male pairs (*t*_{1}), for male–female pairs (*u*_{2}), and for female–female pairs (*v*_{2}). The 36 loci represent a spread of assumed true heterozygosities ranging from 0.4008 to 0.8599. For each locus, we calculated (Equation 21), as well as and (Nei and Roychoudhury 1974).

Figure 1 displays the properties of the three estimators, , , and , based on application of analytical computations of bias (Equation 14 for ) and the variance approximation (Equation 18 plugged into Equations 15 and 16) to each of the 36 loci. and are unbiased estimators and therefore have zero bias, whereas exhibits increasing bias squared as heterozygosity increases. The bias squared for as a function of heterozygosity is plotted using the theoretical prediction based on Equation 14: . Generally, over the space of heterozygosities defined by the 36 microsatellite loci, the MSE and variance of all three estimators decrease with increasing heterozygosity.

##### Varying sample size and type of relative pair:

We next applied the estimators to scenarios of varying sample size. The ATCT003 (*H* = 0.7794), DXS1068 (*H* = 0.7344), and GATA48H04 (*H* = 0.6476) loci were chosen from the data set to represent high, intermediate, and low heterozygosities, respectively. Only the data for the intermediate heterozygosity locus DXS1068 are shown; the other two loci yield similar results. For each locus and for each of the 10 types of relative pairs in Table 2, we varied the sample size from 2 to 100 pairs. We considered a sample size of at least 2 pairs, as no information is available for the computation of from a single pair of male–male relatives. For all three loci, analytical calculations were performed using the variance approximation (Equation 18 plugged into Equations 15 and 16).

Figure 2 shows that as sample size increases, MSE decreases for all three estimators, and it is always comparable for and ( mostly overlaps in Figure 2). Usually, we expect MSE in a reduced sample to be highest due to greater variance. However, although the results conformed to this prediction for most types of relative pairs, for male–female relative pairs for which there was probability for sharing exactly one allele IBD (types *u*_{1} and *u*_{4}), the MSE of was actually lower than the MSE for and . The same result was also detected in our simulations (data not shown). Investigating further, we found that in male–male and female–female pairs, cases with high probabilities for sharing one or two alleles IBD had MSEs for and that were closer to the MSE values, compared with the higher MSE for observed in other cases. The MSE of is smaller relative to that of the other estimators for *u*_{1} and *u*_{4} male–female pairs because when only one-third of the sample is removed in creating the unrelated set of individuals (removal of males), the increase in variance due to the relatively small decrease in sample size in is comparable to the increased variance caused by the high IBD probabilities for *u*_{1} and *u*_{4} pairs in and , unlike in other cases. When females, instead of males, are removed from male–female pairs, decreasing the sample by two-thirds rather than one-third, the estimators behave more intuitively (Figure 3), with yielding the highest MSE.

##### Varying combinations of relative pairs:

Finally, we studied the effect of relative pair combinations in a sample, using allele frequencies at the ATCT003, DXS1068, and GATA48H04 loci. Only the results for the highest heterozygosity locus, ATCT003, are shown; as was true in the previous section, each locus yielded similar results. For each locus, we examined each of the 231 possible divisions of exactly 20 full-sib pairs into male–male (*t*_{1}), male–female (*u*_{2}), and female–female (*v*_{2}) pairs. Figure 4 displays the MSE, variance, and bias squared of the three estimators, calculated analytically using the variance approximation (Equation 18), for various combinations of *t*_{1}, *u*_{2}, and *v*_{2} pairs for the ATCT003 locus. Variance was highest for , because it had the smallest sample of alleles. For all estimators, variance was highest where the configuration of full-sibs had mostly male–male pairs, again due to the smaller sample of alleles. and were unbiased across the space of possible combinations. showed a trend in bias squared in which configurations with a greater proportion of males had higher bias squared, as is predicted analytically from the smaller sample size (Equation 14). For all configurations, the bias squared of was greater than that for the other estimators. Among the three estimators, MSE was highest for . Similarly to the observation for variance, MSE was greatest for configurations with a high proportion of male–male pairs. Although performed slightly worse in having a greater variance compared to , it had a slightly lower MSE due to its lower bias. More generally, although performed better in the setting of Figure 4, the exact formula can be used to determine which estimator has lowest MSE for a given scenario.

#### Application to data:

We next investigated the behavior of our estimator using X-chromosomal SNP data sets of 485 individuals and 446 unrelated individuals (see *Data*). Table 4 displays the relative pairs in the sample of 485 individuals. Because we analyzed the estimators separately by population, the subscripts of 485 and 446 refer to whether or not relatives were included in a calculation, not to the actual numbers of individuals in that calculation. In the same manner as in DeGiorgio and Rosenberg (2009), we took for each population to be a proxy for true heterozygosity, because this quantity provided an unbiased estimate when no relatives were included in the sample. Note that removed individuals belonged only to pairs related through the X chromosome; individuals related only autosomally (such as male–male parent–offspring pairs) were included in the reduced sample. In our analysis, we compared the means of and across the 13,052 loci to the corresponding mean of .

Figure 5 compares the difference between the mean of across loci () and the mean of () with the difference between the mean of () and the mean of (). As Figure 5A shows, generally yields a lower heterozygosity estimate than due to the downward bias caused by related individuals. Applying reduces the magnitude of the difference between the estimate of heterozygosity in sets with and without relatives (Figure 5B), and yields values that are not consistently lower than those of . It is important to note that because 15 of 45 of the relative pairs in the data have an uncertain second-degree relationship (*t*_{3}, *u*_{5}, or *v*_{5}), might have overcorrected bias in cases in which the individuals were not related via the X chromosome and undercorrected bias in cases in which the individuals actually were related on the X chromosome.

A Wilcoxon signed-rank test was used to evaluate the differences between and applied to the 13 populations that contained relatives (see Table 4). This test yielded a *P*-value of 0.0024, indicating that the inclusion of relatives had a significant impact on the estimation of heterozygosity using . In contrast, the Wilcoxon signed-rank comparison of and yielded a *P*-value of 0.6355, indicating that the inclusion of relatives did not significantly alter the estimation of heterozygosity when was used. The mean difference (−8.0493 × 10^{−5}) and the mean absolute difference (6.3159 × 10^{−4}) were smaller across the 13 populations than the mean difference (−1.9393 × 10^{−3}) and the mean absolute difference (1.9849 × 10^{−3}), and , respectively.

We also investigated the behavior of and with regard to variance for the 13 populations that contained relatives. We compared and , which we used as proxies for bias, following the methods of DeGiorgio and Rosenberg (2009), and the standard deviations of the two estimators applied with relatives included. From Figure 6, we observe that while there was a sizeable difference in the bias proxy between and , there was only a small difference in standard deviation. This result is compatible with the results from our analytical computations, which suggest that corrects bias without substantially increasing variance.

## DISCUSSION

Our estimator, , is an effective tool for assessing the gene diversity of a sample of arbitrary ploidy containing related or inbred individuals. It can be used to provide unbiased estimates of expected heterozygosity when the inbreeding and kinship coefficients of sampled individuals are known. We found that the unbiasedness of the diploid estimator of DeGiorgio and Rosenberg (2009) extends to a much more general set of scenarios, provided that kinship coefficients are appropriately weighted by ploidy in the computation.

Here, we evaluated the properties of in the specific case of the human X chromosome. Through our analytical calculations, we have shown that, similarly to the DeGiorgio and Rosenberg (2009) estimator in the diploid case, the performance of is generally superior to that of when the sample to which the estimators are applied contains relatives. accounts for the bias introduced by relatedness while simultaneously maintaining comparable MSE and variance to . Our estimator also performs well compared to when applied to data from human populations. While the true heterozygosity of each population is not known, when we compared and to an approximation of true heterozygosity, with applied to the data set with no related individuals, we found that the difference between the estimate when relatives were included and when relatives were not included was significantly smaller for . Because the reduction in this proxy for bias is accompanied by only a small increase in standard deviation, we argue that should often be preferred over in the estimation of gene diversity in a sample containing relatives.

In addition to developing the estimator for gene diversity, we also determined the analytical variance of our estimator, allowing us to theoretically evaluate the properties of . We also developed an approximation for variance (Equation 18) that is simpler to compute and that is applicable when each individual has at most one relative in the sample. Knowledge of the theoretical variance can further allow investigators to evaluate the circumstances under which applied to a full sample, including relatives, is superior to using with a reduced sample in which members of relative pairs have been removed. For example, Figure 2 indicates that removing relatives will provide a lower MSE of the heterozygosity estimate in some cases. However, Figure 4 suggests that yields a lower MSE than except in the small fraction of relative–pair combinations that contain large numbers of *u*_{1} pairs. Thus, we propose that in most cases the use of on a sample set that includes related individuals affords a better estimate of gene diversity than applying on a sample that contains no relatives and that investigators can use the theoretical variance of to determine whether a given situation is likely to be among the exceptions.

## APPENDIX A

In this section, we present proofs for Equations 9, 11, 12, and 13.

*Proof of Equation* 9. Applying the definition of and using Equation 8, we have▪*Proof of Equation* 11. . We only need to show that . Note that while we write as an estimate, [IBD] depends only on quantities that are treated as known with certainty and we write it as a known quantity itself. Consider two alleles from the sample (that are not necessarily distinct). Let *C*_{(a,j)(b,k)} denote the event that the first of the two alleles is from individual (*a*, *j*) and the second is from individual (*b*, *k*), where (*a*, *j*) and (*b*, *k*) are not necessarily distinct. Supposing that the two alleles are drawn uniformly at random from the sample, with replacement, let *C*_{(a,j)(b,k)} denote the probability of event *C*_{(a,j)(b,k)}. Let IBD|*C*_{(a,j)(b,k)} be the probability that two alleles are IBD given that the first allele is chosen from individual (*a*, *j*) and the second is chosen from individual (*b*, *k*). ThenNote that, for individuals (*a*, *j*) and (*b*, *k*), which are not necessarily distinct,It follows that▪*Proof of Equation* 12. For an *m _{b}*-ploid individual

*k*, Φ

_{(b,k)(b,k)}= 1/

*m*+ (1 − 1/

_{b}*m*)

_{b}*f*

_{(b,k)}= (1/

*m*)[1+(

_{b}*m*−1)

_{b}*f*

_{(b,k)}]. Note that Φ

_{(a,j)(b,k)}= 0 if individuals (

*a*,

*j*) and (

*b*,

*k*) are unrelated. We can then break into three components, considering three different types of pairs of individuals: same group–same individual, same group–different individual, and different group. Therefore▪

*Proof of Equation*13. First we note thatSubstituting into (Equation 10) givesRearranging Equation 3 we getfrom which▪

## APPENDIX B

In this section, we present results that aid in the derivation of the variance of our gene diversity estimator. Lemma 3 derives certain expectations involving four alleles. These expectations are used to calculate the variance and covariance of squared allele frequency estimates in Lemma 4. Lemma 4 is then used to prove the variance formula in Theorem 2 when related and inbred individuals are included in a sample.

Lemma 3. *Consider a locus with I distinct alleles*, *allele frequencies p _{i}* ∈ [0, 1],

*and*.

*Suppose a sample from a population has g groups*,

*each with different ploidy*,

*and n*

_{b}*m*,

_{b}-ploid individuals in group b*b*= 1, 2, …,

*g*,

*each of whom is possibly inbred and related to other individuals in the sample. Consider the*ℓ

*th allele of individual*(

*a*,

*j*),

*the tth allele of individual*(

*b*,

*k*),

*the*ℓ′

*th allele of individual*(

*a*′,

*j*′),

*and the t*′

*th allele of individual*(

*b*′,

*k*′).

*For clarity*,

*let w*= (

*a*,

*j*),

*x*= (

*b*,

*k*),

*y*= (

*a*′,

*j*′),

*and z*= (

*b*′,

*k*′).

*Then for allelic types i and i*′ ≠

*i*,(B1)(B2)

*Proof.*We need to evaluatewhere

*S*represents one of the 15 identity states in Figure B1 for four alleles—one from

*w*, one from

*x*, one from

*y*, and one from

*z*—and Δ

_{s}is the identity coefficient, the probability of observing state

*S*=

*s*for four alleles randomly chosen, one from

*w*, one from

*x*, one from

*y*, and one from

*z*. We can rewrite the identity coefficients in terms of kinship coefficients by using the following relationships:(B3)

Note that the Δ-coefficients above are identical to the δ-coefficients in Cockerham (1971). Also, the Φ-coefficients involving two individuals, three individuals, and pairs of pairs of individuals are identical to Cockerham's θ-, γ-, and Δ-coefficients, respectively (Cockerham 1971). If *i*′ = *i*, we get(B4)If *i* ≠ *i*′, we get(B5)The desired result follows by substituting Equation B3 into Equations B4 and B5. ▪

Note that expressions mathematically identical to Equations B1 and B2 except with different notation appear in Table 1 of Cockerham (1971). However, a slight conceptual difference is that our formulas involve an expectation of a product among four arbitrary alleles, not necessarily four alleles in two pairs of diploid genotypes. We now use Lemma 3 to derive and .

Lemma 4. *Consider a locus with I distinct alleles*, *allele frequencies p _{i}* ∈ [0, 1],

*and*.

*Suppose a sample from a population has g groups*,

*each with different ploidy*,

*and n*-

_{b}m_{b}*ploid individuals in group b*,

*b*= 1, 2, …,

*g*,

*each of whom is possibly inbred and related to other individuals in the sample. Then for allelic types i and i*′ ≠

*i*,(B6)(B7)

*and therefore*(B8)(B9)

*Proof.*Applying the definition of , we haveFor the case with alleles

*i*and

*i*′ ≠

*i*, we haveApplying the definition of variance, we haveApplying the definition of covariance, we have▪

We now utilize Lemma 4 to prove Theorem 2.

*Proof of Theorem 2.* Applying the definition of variance, we haveSimplifying, we getApplying the identity gives Equation 15. ▪

It is interesting (and convenient) that although the derivation requires the use of all 15 Δ-coefficients, the only coefficients required in the variance formula are , , , and . The 15 Δ-coefficients in Figure B1 completely specify the 14 Φ-coefficients in Equation B3 (along with the 15th Φ-coefficient equal to Δ_{15}). Through symmetry of the 6 Φ-coefficients involving two individuals, symmetry of the 4 Φ-coefficients involving three individuals, and symmetry of the 3 Φ-coefficients involving pairs of pairs of individuals, by averaging over sets of individuals, the variance of gene diversity becomes a function of only 4 average Φ-coefficients.

## Acknowledgments

We thank Laurent Excoffier and three anonymous reviewers for their valuable comments. This work was supported by National Institutes of Health (NIH) grant R01 GM081441, NIH training grant T32 GM070449, a University of Michigan Rackham Merit Fellowship, and grants from the Burroughs Wellcome Fund and the Alfred P. Sloan Foundation.

## Footnotes

1 These authors contributed equally to this work.

3

*Present address:*David Geffen School of Medicine, University of California, Los Angeles, CA 90095.Communicating editor: L. Excoffier

- Received August 3, 2010.
- Accepted September 21, 2010.

- Copyright © 2010 by the Genetics Society of America