Originally published as Genetics Published Articles Ahead of Print on August 9, 2008.

Genetics, Vol. 179, 2027-2036, August 2008, Copyright © 2008
doi:10.1534/genetics.107.084772

The Relationship Between Homozygosity and the Frequency of the Most Frequent Allele

Department of Human Genetics, Center for Computational Medicine and Biology, and the Life Sciences Institute, University of Michigan, Ann Arbor, Michigan 48109-2218

1 Corresponding author: University of Michigan, 2017 Palmer Commons, 100 Washtenaw Ave., Ann Arbor, MI 48109-2218.
E-mail: rnoah{at}umich.edu

Manuscript received November 20, 2007. Accepted for publication May 15, 2008.

ABSTRACT

Homozygosity is a commonly used summary of allele-frequency distributions at polymorphic loci. Because high-frequency alleles contribute disproportionately to the homozygosity of a locus, it often occurs that most homozygotes are homozygous for the most frequent allele. To assess the relationship between homozygosity and the highest allele frequency at a locus, for a given homozygosity value, we determine the lower and upper bounds on the frequency of the most frequent allele. These bounds suggest tight constraints on the frequency of the most frequent allele as a function of homozygosity, differing by at most Formula and having an average difference of Formula{pi}2/18 {approx} 0.1184. The close connection between homozygosity and the frequency of the most frequent allele—which we illustrate using allele frequencies from human populations—has the consequence that when one of these two quantities is known, considerable information is available about the other quantity. This relationship also explains the similar performance of statistical tests of population-genetic models that rely on homozygosity and those that rely on the frequency of the most frequent allele, and it provides a basis for understanding the utility of extended homozygosity statistics in identifying haplotypes that have been elevated to high frequency as a result of positive selection.


THE concept of homozygosity appears ubiquitously in population genetics, in the context of mathematical theory as well as in statistical methods for data analysis. Consider a locus with K ≥ 2 alleles, for which the frequency of allele i is pi > 0 and for which the alleles are placed in decreasing order of frequency so that pi ≥ pj if i < j. For diploids, the fraction of homozygotes expected under the assumption of Hardy–Weinberg proportions can be defined as

Formula 1(1)
where

Formula 2(2)

In this article, we show that if all that is known about a locus is its expected homozygosity H, it is possible to localize the frequency p1 of its most frequent allele within a quite narrow range. Conversely, given p1, a narrow range can be specified for the value of H. Thus, we determine the upper and lower bounds on the frequency p1 of the most frequent allele as functions of homozygosity H. We also determine the bounds on H as functions of p1.

The connection between H and p1 provides a close relationship between two of the most basic quantities associated with a polymorphic locus. We use this relationship to explain a high correlation observed between H and p1 in human microsatellite data, as well as to provide a conceptual basis for the success of extended haplotype homozygosity methods in detecting positive selection. Note that expected heterozygosity under Hardy–Weinberg proportions is 1 – H; thus, by a simple transformation, our results can also be used to describe the relationship between heterozygosity and the frequency of the most frequent allele.


RESULTS
We consider a polymorphic locus with at least two alleles. We do not assume that the number of alleles with nonzero frequency is known; it is convenient to view the locus as having infinitely many alleles and to allow some of these alleles to have frequency 0. We refer to the frequency of the most frequent allele, p1, by M. Henceforth we use H and "homozygosity" to refer to expected homozygosity assuming Hardy–Weinberg proportions. Both M and H must lie in the interval (0, 1). The quantity {lceil}x{rciel} denotes the smallest integer larger than or equal to x. Our main results, which are proved in the APPENDIX, are the bounds on M as functions of H (Theorem 1) and the bounds on H as functions of M (Theorem 2).

THEOREM 1. Consider a sequence of the allele frequencies at a locus, Formula 2, with pi isin [0, 1), Formula 2, Formula 2, M = p1, and i < j implies pi ≥ pj. Then

Formula 2

Formula 2
with equality if and only if pi = M for 1 ≤ i ≤ K – 1, pK = 1 – (K – 1)M, and pi = 0 for i > K, where K = {lceil}H–1{rciel} = {lceil}M–1{rciel}.

THEOREM 2. Consider a sequence of the allele frequencies at a locus, Formula 2, with pi isin [0, 1), Formula 2, Formula 2, M = p1, and i < j implies pi ≥ pj. Then (i) H > M2 and (ii) H ≤ 1 – M({lceil}M–1{rciel} – 1)(2 – {lceil}M–1{rciel}M), with equality if and only if pi = M for 1 ≤ i ≤ K – 1, pK = 1 – (K – 1)M, and pi = 0 for i > K, where K = {lceil}H–1{rciel} = {lceil}M–1{rciel}.

The bounds obtained in Theorems 1 and 2 are summarized in Table 1. Loosely speaking, Theorem 1 verifies that for a given homozygosity, the frequency of the most frequent allele is smallest when as many alleles as possible are tied as most frequent and greatest when there is one extremely frequent allele and many rare alleles. Theorem 2 shows that for a given frequency of the most frequent allele, homozygosity is smallest when many extremely rare alleles are present and greatest when as many alleles as possible are tied as most frequent. For each of the theorems, part i is straightforward to prove, and part ii follows from the fact that when considering all possible sets of nonnegative real numbers bounded above by a specified constant M and having a fixed sum C, the maximal sum of squares is obtained by greedily choosing as many of the numbers as possible to equal M and by assigning at most one additional number to be positive (Lemma 3 in the APPENDIX).


View this table:
In this window
In a new window

 
TABLE 1

Bounds on homozygosity and the frequency of the most frequent allele

 
Theorems 1 and 2 can be visualized in Figures 1–5GoGoGoGo, and various properties of the bounds that can be observed in the figures are considered in the APPENDIX. Figure 1 illustrates the upper and lower bounds on the frequency of the most frequent allele, as functions of homozygosity. The peculiar yet continuous and monotonic nature of the lower bound can be observed, as can the relatively confined range between the upper and lower bounds—with an average difference of Formula 2{pi}2/18 {approx} 0.1184—in which the frequency of the most frequent allele must lie. The stepped shape for the lower bound results from transitions at reciprocals of integers for the number of alleles contained in the collection of allele frequencies that achieves the lower bound.


Figure 1
View larger version (14K):
In this window
In a new window
Download PPT slide
 
FIGURE 1.—

Upper and lower bounds on the frequency of the most frequent allele, as functions of homozygosity.

 

Figure 2
View larger version (20K):
In this window
In a new window
Download PPT slide
 
FIGURE 2.—

The difference between the upper and lower bounds on the frequency of the most frequent allele, for a given homozygosity, and the difference between the bounds and homozygosity itself.

 

Figure 3
View larger version (12K):
In this window
In a new window
Download PPT slide
 
FIGURE 3.—

The lower bound on the fraction of homozygosity contributed by homozygotes for the most frequent allele. The upper bound is 1.

 

Figure 4
View larger version (14K):
In this window
In a new window
Download PPT slide
 
FIGURE 4.—

Upper and lower bounds on homozygosity, as functions of the frequency of the most frequent allele. These upper and lower bounds are the inverse functions of the lower and upper bounds on the frequency of the most frequent allele, given homozygosity.

 

Figure 5
View larger version (21K):
In this window
In a new window
Download PPT slide
 
FIGURE 5.—

The difference between the upper and lower bounds on homozygosity given the frequency of the most frequent allele, and the difference between the frequency of the most frequent allele and the bounds.

 
Figure 2 shows the pairwise differences among the upper bound, the lower bound, and the homozygosity itself. From this figure, it is possible to see that the lower bound on the frequency of the most frequent allele is greater than or equal to the homozygosity, with equality when homozygosity is the reciprocal of an integer. It can also be seen that the difference between the lower bound and the homozygosity has numerous local maxima, the highest point being at (Formula 2, Formula 2), and that the difference between the upper bound and the lower bound has local maxima at reciprocals of integers and local minima in the intervening intervals. The maximal difference between the upper and lower bounds occurs at (Formula 2, Formula 2), and the highest of the local minima is nearby.

Figure 3 displays the minimal fraction of homozygosity contained in homozygotes for the most frequent allele. This function is monotonically increasing, so that for homozygosities substantially >Formula 2, nearly all homozygotes are homozygous for the most frequent allele, regardless of the total number of alleles.

The upper and lower bounds on homozygosity in terms of the frequency of the most frequent allele are the inverse functions of the lower and upper bounds on the frequency of the most frequent allele in terms of homozygosity. Thus, there is a close relationship between the bounds on H in terms of M shown in Figure 4 and the bounds on M in terms of H shown in Figure 1.

As functions of the frequency of the most frequent allele, Figure 5 depicts the pairwise differences among the upper bound on homozygosity, the lower bound, and the frequency of the most frequent allele itself. The frequency of the most frequent allele is greater than or equal to the upper bound, equaling the upper bound at reciprocals of integers. The difference between this frequency and the upper bound has a collection of local maxima, the highest being at (Formula 2, Formula 2). The difference between the upper and lower bounds has local maxima at reciprocals of integers and local minima in the intervening intervals. The maximal difference between the upper and lower bounds occurs at (Formula 2, Formula 2), near the highest of the local minima.


APPLICATION TO DATA
To demonstrate the bounds with actual allele frequencies, we consider the homozygosity and frequency of the most frequent allele for 783 multiallelic microsatellite loci studied in a sample of 1048 individuals drawn from worldwide human populations (ROSENBERG et al. 2005). Although our theoretical results are useful for any collection of multiallelic loci, this data set provides a particularly illustrative example, as levels of variability of human microsatellites span quite a wide range. For each locus, we assume that the allele frequencies in the sample are parametric allele frequencies, and we obtain values for H and M in the full collection of 1048 individuals.

Figure 6 plots H and M for the 783 loci, illustrating a high degree of correlation between the two quantities. Homozygosity ranges from 0.0837 to 0.6872, and the frequency of the most frequent allele ranges from 0.1136 to 0.8146. Several loci have values of M quite close to the lower bound for their homozygosity values (Table 2). The lists of allele frequencies for these loci are fairly close to the lists that achieve the lower bound. For example, locus AGAT017 has homozygosity 0.2118, between Formula 2 and Formula 2, and its four most frequent alleles have frequencies 0.2425, 0.2410, 0.2300, and 0.1979. At homozygosity 0.2118, the lower bound for the frequency of the most frequent allele is achieved when four alleles have frequency 0.2243 and a fifth allele has frequency 0.1027.


Figure 6
View larger version (23K):
In this window
In a new window
Download PPT slide
 
FIGURE 6.—

Homozygosity and frequency of the most frequent allele for 783 microsatellite loci. Each bin is 0.01 x 0.01, and the upper and lower bounds on the frequency of the most frequent allele are shown for comparison. The correlation coefficient of homozygosity and the frequency of the most frequent allele is 0.9439. Tables 2 and 3 give the frequencies of all alleles at the marked microsatellite loci.

 

View this table:
In this window
In a new window

 
TABLE 2

Five microsatellite loci with frequency of the most frequent allele close to the lower bound

 

View this table:
In this window
In a new window

 
TABLE 3

Three microsatellite loci with frequency of the most frequent allele close to the upper bound

 
Two other loci whose most frequent alleles have frequency close to the lower bound—TATC012 and GATA146D07—have homozygosities between Formula 2 and Formula 2. For homozygosities in this interval, the lower bound is achieved when the three highest allele frequencies have the same value; indeed both loci have three high-frequency alleles with frequency near the lower bound. Similarly, locus GATA151C03P, with homozygosity between Formula 2 and Formula 2, has two high-frequency alleles with frequency near the lower bound.

Table 3 displays the allele frequencies for three loci with values of M close to the upper bound. The upper bound is approximated when a locus has one allele with a particularly high frequency and many alleles with low frequencies. Consistent with their M values near the upper bound, each of the three loci has a single high-frequency allele and several low-frequency alleles.

Subdividing loci on the basis of their numbers of alleles, Figure 7 illustrates a trend of decreasing H with an increasing number of alleles. Considering the four plots, the mean value of M H is greatest in Figure 7B, in which the mean homozygosity is near 0.25. This observation is explained by the fact that the range between the upper and lower bounds on M is greatest for a homozygosity of Formula 2. As the mean homozygosity moves away from Formula 2 in Figure 7, A, C, and D, the mean value of M H decreases.


Figure 7
View larger version (20K):
In this window
In a new window
Download PPT slide
 
FIGURE 7.—

Frequency of the most frequent allele minus homozygosity for 783 microsatellite loci. Each bin is 0.01 x 0.01, and the upper and lower bounds on MH are shown for comparison. For each plot, the mean Formula 2 is marked by an x. (A) Loci with 4–9 distinct alleles (233): the mean is (0.2989, 0.1179). (B) Loci with 10–11 distinct alleles (222): the mean is (0.2570, 0.1208). (C) Loci with 12–14 distinct alleles (175): the mean is (0.2262, 0.1126). (D) Loci with 15–35 distinct alleles (153): the mean is (0.1890, 0.1102).

 


DISCUSSION
For a biallelic locus, an exact relationship exists between homozygosity (H) and the frequency of the most frequent allele (M), as H = 2M2 – 2M + 1 and Formula 2. Although in general the value of H or M is not uniquely specified from the value of the other quantity, we have found that a close connection between H and M does in fact exist. Our analysis verifies that measured values for homozygosity (and heterozygosity) consist largely of the contribution of the most common allele, and that the contribution made by rarer alleles is relatively small. Especially if homozygosity is very high or if the most frequent allele has a high frequency, each of the two summaries H and M greatly limits the possible values of the other quantity, so that both quantities provide similar information about an underlying allele-frequency distribution.

These results have implications for population-genetic methods that rely on H or M in analyses of multiallelic loci. Various neutrality tests have been developed that identify deviations from null population-genetic models on the basis of unusual values of homozygosity (WATTERSON 1977, 1978), heterozygosity (DEPAULIS and VEUILLE 1998; DEPAULIS et al. 2001; MARKOVTSOVA et al. 2001), or the frequency of the most frequent allele (HUDSON et al. 1994). The close connection between homozygosity and the frequency of the most frequent allele suggests that tests using H and those using M detect similar features of the allele-frequency distribution. This observation potentially explains a high level of agreement seen in Table 7 of INNAN et al. (2005) for the haplotype diversity test (DEPAULIS and VEUILLE 1998), based on haplotype heterozygosity, and the HUDSON et al. (1994) haplotype test, based on the frequency of the most frequent haplotype.

Our results are also informative in relation to recently proposed methods that use "extended haplotype homozygosity"—pairwise identity of long haplotypes in the neighborhood of an index site—in detecting the signature of partial selective sweeps (SABETI et al. 2002; TOOMAJIAN et al. 2006; VOIGHT et al. 2006; TANG et al. 2007; ZENG et al. 2007). During such sweeps, a favored mutant allele rises to high frequency, carrying with it neighboring alleles that were near the selected site on the haplotype on which the mutation originally occurred. Thus, the detection of partial selective sweeps is a search for long high-frequency haplotypes that have not had sufficient time to be broken down by recombination. Because of the close connection between homozygosity and the frequency of the most frequent allele, genomic regions that have long high-frequency haplotypes will largely be coincident with regions that have long stretches of high haplotype homozygosity. Consequently, extended haplotype homozygosity methods provide an effective basis for accessing the signal of partial selective sweeps contained in extended high-frequency haplotypes.

Finally, the connection between homozygosity and the frequency of the most frequent allele may be useful for examining the properties of a variety of additional functions of allele frequencies that are based on homozygosity. Notably, the genetic differentiation measure FST and related quantities can be assembled from the homozygosities of various subgroups of a population—especially when viewed in the formulation of the GST measure of NEI (1987). From the connection between H and M, it follows that constraints on FST as functions of M undoubtedly exist; such constraints potentially provide the conceptual basis for understanding a frequency dependence observed for values of FST (LONG and KITTLES 2003; HEDRICK 2005).


APPENDIX
In addition to verifying Theorems 1 and 2, this APPENDIX formalizes many of the features visible in Figures 1–5GoGoGoGo. We begin with the proofs of the theorems. We then obtain properties of the frequency of the most frequent allele in terms of homozygosity and properties of homozygosity in terms of the frequency of the most frequent allele. For convenience, we label the bounds as follows:

Formula A1(A1)

Formula A2(A2)

Formula A3(A3)

Formula A4(A4)
For integers K ≥ 2, we also denote the half-open interval [1/K, 1/(K – 1)) by IK.

The key result is Lemma 3, which considers sets of nonnegative numbers with a fixed positive sum C, in which the numbers in the set are bounded above by a positive constant M. The square of a positive number x is greater than or equal to the sum of squares for each collection of nonnegative numbers whose sum is x. As a result, considering all sets of nonnegative numbers with maximum M and with sum equal to C, we can show that the maximal sum of squares is obtained when as many of the numbers as possible are equal to M and when at most one remaining number is smaller than M. Lemma 3 makes it possible to obtain the maximal homozygosity as a function of M and, ultimately, to find the minimal M as a function of H.

LEMMA 3. Suppose M > 0 and C > 0 and that {lceil}C/M{rciel} is denoted K. Considering all sequences Formula A4 with pi isin [0, M], Formula A4, and i < j implies pi ≥ pj, Formula A4 is maximal if and only if pi = M for 1 ≤ i ≤ K – 1, pK = C – (K – 1)M, and pi = 0 for i > K, and its maximum is K(K – 1)M2 – 2C(K – 1)M + C2.

Proof. We use induction on K. Suppose K = 1, so that C ≤ M. Because Formula A4,

Formula A5(A5)
Because a nonnegative term is subtracted in Equation A5, the maximum of H(p) occurs when this term is zero. As a result, at the maximum, p1 = C ≤ M, pi = 0 for i > 1, and H(p) = C2. This establishes the base case.

Assume that the desired result is true for all C and M with {lceil}C/M{rciel} = K – 1. Now suppose {lceil}C/M{rciel} = K. The proposed value of p that maximizes H has pi = M for 1 ≤ i ≤ K – 1, pK = C – (K – 1)M, and pi = 0 for i > K. Label this sequence by p*. Then

Formula A6(A6)
By assumption, {lceil}C/M{rciel} = K and M ≥ C/K. As Equation A6 describes a parabola in M with positive leading term, regardless of the value of M, H(p*) is greater than or equal to the value at the minimum of the parabola, or C2/K.

We now show that no other sequence p can achieve a value of H as high as H(p*). Suppose p1 < C/K. Because pi ≤ p1 for i > 1,

Formula A7(A7)
Because H(p*) ≥ C2/K, the sequence p that maximizes H cannot have p1 < C/K. This sequence must therefore have p1 ≥ C/K and {lceil}C/p1{rciel} ≤ K. However, because p1 ≤ M and {lceil}C/M{rciel} = K by assumption, {lceil}C/p1{rciel} ≥ {lceil}C/M{rciel} = K. Thus, {lceil}C/p1{rciel} = K and {lceil}(Cp1)/p1{rciel} = K – 1.

Note that Formula A7. We can therefore apply the inductive hypothesis to Formula A7 with Cp1 in place of C and p1 in place of M. By the inductive hypothesis, the maximum of Formula A7 occurs if and only if pi = p1 for 2 ≤ i ≤ K 1, pK = (Cp1) – (K – 2)p1, and pi = 0 for i > K. As a result,

Formula A8(A8)
This function is monotonically increasing in p1 for p1 ≥ C/K and therefore achieves its maximum when p1 is as large as possible—that is, when p1 = M. {blacksquare}

Proof of Theorem 2. (i) This result follows from the definition of H and from the fact that p2 > 0; (ii) this follows from Lemma 3, taking C = 1 so that K = {lceil}M–1{rciel}. {blacksquare}

LEMMA 4. (i) Formula A8 are monotonically increasing, continuous, and bijective; (ii) f and G are differentiable on (1/K, 1/(K – 1)) for each integer K ≥ 2.

Proof. The result is trivial for F and g. For integers K ≥ 2, G(1/K) = 1/K, and G is monotonically increasing on each interval IK, where {lceil}M–1{rciel} has the fixed value K. Thus, G is monotonically increasing on (0, 1). From the form of G it is clear that on (1/K, 1/(K – 1)), G is continuous and differentiable. For each K, as M = 1/K is approached from either direction, G(M) approaches 1/K. Thus, G is continuous on (0, 1). Given H isin (0, 1), there is a unique M for which G(M) = H, so that G is bijective. Similar reasoning holds for f. {blacksquare}

As a consequence of this lemma, since G(1/K) = 1/K for integers K ≥ 2, if M isin (0, 1) and {lceil}M–1{rciel} = K, then G(M) lies in the interval IK. Similarly, if {lceil}H–1{rciel} = K for H isin (0, 1), then f(H) also lies in IK.

LEMMA 5. F and g are inverse functions on (0, 1), as are f and G.

Proof. The result is trivial for F and g. As bijections, both f and G are invertible. Noting that M isin IK implies G(M) isin IK, {lceil}M–1{rciel} = {lceil}G(M)–1{rciel}, from which we can solve for M in terms of G(M) on each interval IK to find that on each interval the inverse of G is f. {blacksquare}

Proof of Theorem 1.

  1. This result follows from the definition of H and from the fact that p2 > 0.
  2. By Theorem 2, for a given value of M and a given sequence Formula A8 with Formula A8, G(M) ≥ H with equality if and only if pi = M for 1 ≤ i ≤ K 1, pK = 1 – (K – 1)M, and pi = 0 for i ≥ K, where K = {lceil}M–1{rciel}. Applying the monotonically increasing function f to the inequality G(M) ≥ H, f(G(M)) ≥ f(H) with the same equality condition. Because f is the inverse of G, M ≥ f(H) with the same equality condition. {blacksquare}

Note that for a given value of H, we can find a set of allele frequencies for which the value of M comes arbitrarily close to its upper bound of Formula A8. This can be accomplished by supposing that a locus has one common allele with frequency M and N rare alleles each with frequency {varepsilon}. If all other alleles have zero frequency, such a locus must have M2 + N{varepsilon}2 = H and M + N{varepsilon} = 1. Solving this pair of equations for M in terms of N (taking the larger root) and letting Formula A8, Formula A8. Similar reasoning yields sets of allele frequencies that for a given value of M have values of H arbitrarily close to the lower bound of M2.

Frequency of the most frequent allele in terms of homozygosity:

We now derive the properties of the upper and lower bounds on the frequency of the most frequent allele, as functions of homozygosity. Most of the results that follow are relatively straightforward to prove, and they are included for completeness.

Proposition 6 determines the mean values of the bounds, finding that the difference between them has a rather small mean of Formula A8{pi}2/18 {approx} 0.1184. Lemma 7 then shows that the lower bound on the frequency of the most frequent allele is greater than or equal to the homozygosity itself; the mean values of the differences of the upper and lower bounds from the homozygosity are then obtained in Proposition 8. Results 9–15 concern additional properties of the differences among the upper and lower bounds and the homozygosity and properties of various maxima and minima associated with the upper and lower bounds. The section concludes with Proposition 16, which determines a lower bound on the fraction of homozygosity that is due to the most frequent allele.

PROPOSITION 6. Averaging across values of H isin (0, 1), (i) the mean of F(H) is Formula A8; (ii) the mean of f(H) is {pi}2/18; (iii) the mean of F(H) – f(H) is Formula A8{pi}2/18.

Proof.

  1. The mean of F(H) is Formula A8.
  2. Because (0, 1) = Formula A8IK, and because {lceil}H–1{rciel} = K for H isin IK, the mean of f(H) can be written

    Formula A9(A9)
    Because Formula A9 reduces to –1 and because Formula A9, Equation A9 simplifies to {pi}2/18.

  3. That the mean of F(H) – f(H) is Formula A9{pi}2/18 follows directly from i and ii together with the fact that F(H) > f(H) for H isin (0, 1). {blacksquare}

LEMMA 7. For H isin (0, 1), f(H) ≥ H, with equality if and only if H = K–1 for some integer K.

Proof. This result follows from the fact that for H isin (0, 1), {lceil}H–1{rciel} > 1, and {lceil}H–1{rciel} – 1 < H–1 ≤ {lceil}H–1{rciel}, with the equality occurring if and only if H = K–1 for an integer K. {blacksquare}

PROPOSITION 8. Averaging across values of H isin (0, 1), (i) the mean of F(H) – H is Formula A9; (ii) the mean of f(H) – H is {pi}2/18 – Formula A9.

Proof. By Theorem 1 and Lemma 7, F(H) > f(H) ≥ H on the interval (0, 1). The mean of F(H) – H [or f(H) – H] equals the mean of F(H) [or f(H)] minus the mean of H, or Formula A9. Consequently, using Proposition 6, (i) the mean of F(H) – H is Formula A9Formula A9 = Formula A9, and (ii) the mean of f(H) – H is {pi}2/18 – Formula A9. {blacksquare}

PROPOSITION 9. On the interval [1/K, 1/(K – 1)), where K ≥ 2 is an integer, the maximal value of f(H) – H is 1/[4K(K – 1)], and it is achieved at H = (4K – 3)/[4K(K – 1)].

Proof. For H isin IK, {lceil}H–1{rciel} = K, and

Formula A9
f(H) – H is continuous on the interval and differentiable except at the endpoints. Its only critical point on the interval is a maximum that occurs at ((4K 3)/[4K(K – 1)], 1/[4K(K – 1)]). {blacksquare}

COROLLARY 10. On (0, 1), the maximal value of f(H) – H is Formula A9, and it is achieved at H = Formula A9.

Proof. Because (0, 1) = Formula A9IK, f(H) – H has its maximum in IK for some K—in particular, for the K for which the maximal value of f(H) – H is greatest. By Proposition 9, the maximum of f(H) – H on IK is 1/[4K(K – 1)]. As 1/[4K(K – 1)] decreases for K ≥ 2, the maximum of f(H) – H on (0, 1) occurs in I2. Applying Proposition 9, this maximum is at (Formula A9, Formula A9). {blacksquare}

PROPOSITION 11. On the interval [1/K, 1/(K – 1)], where K ≥ 2 is an integer,

i. For K ≥ 5, the maximal value of F(H) – f(H) is Formula A9, and it is achieved at H = 1/(K – 1). For K = 2, 3, 4, the maximal value of F(H) – f(H) is Formula A9, and it is achieved at H = 1/K.
ii. The minimal value of F(H) – f(H) is

Formula A10(A10)
and it is achieved at H = (K – 1)/(K2K – 1).

Proof. Define {chi}(H) = F(H) – f(H). For H isin IK, {lceil}H–1{rciel} = K, and

Formula A10
To verify ii, note that the only critical point of {chi}(H) on [1/K, 1/(K 1)] is a minimum that occurs at ((K – 1)/(K2K – 1), β(K)).

To obtain i, note that because there is no maximum in the interior of [1/(K – 1), 1/K], the maximum of {chi}(H) occurs at the endpoint of the interval that produces the larger value of {chi}(H). At H = 1/K, Formula A10, and at H = 1/(K – 1), Formula A10. Define Formula A10, and note that at points H = 1/K for integers K ≥ 1, {gamma}(H) = {chi}(H). At the endpoints of [0, 1], {gamma}(H) = 0, and on [0, 1], {gamma}(H) has its maximum and only critical point at (Formula A10, Formula A10). Consequently, for H, H' isin [0, 1], if H > H' ≥ Formula A10, then {gamma}(H) < {gamma}(H'), whereas if Formula A10 ≥ H > H', then {gamma}(H) > {gamma}(H'). Thus, for K = 2, 3, 4, {gamma}(1/K) = {chi}(1/K) > {chi}(1/(K – 1)) = {gamma}(1/(K – 1)), whereas for integers K ≥ 5, {gamma}(1/(K – 1)) = {chi}(1/(K – 1)) > {chi}(1/K) = {gamma}(1/K). {blacksquare}

PROPOSITION 12. On (0, 1), the highest local minimum of F(H) f(H) is Formula A10, and it occurs at H = Formula A10.

Proof. By Proposition 11, the minimal difference for a given interval [1/K, 1/(K – 1)] is achieved at H = (K 1)/(K2K – 1) and is β(K). To find the integer K ≥ 2 where β(K) is greatest, we show that β(K) > β(K + 1) for K ≥ 6. It then follows that the largest value of β(K) occurs at the integer K isin [2, 6] that produces the highest value of β(K). This maximum occurs at K = 5, so that H = Formula A10 and Formula A10.

The following chain of inequalities yields the result:

Formula A10

COROLLARY 13. The maximal value of F(H) – H is Formula A10, and it is achieved at H = Formula A10.

Proof. This result was shown in the proof of Proposition 11 when it was found that Formula A10 has its maximum on [0, 1] at H = Formula A10. {blacksquare}

COROLLARY 14. The maximal value of F(H) – f(H) is Formula A10, and it is achieved at H = Formula A10.

Proof. Because f(H) ≥ H, F(H) – f(H) ≤ F(H) – H. By Corollary 13, the maximum of F(H) – H occurs at Formula A10 and is Formula A10. Evaluating at H = Formula A10, F(H) – f(H) achieves this same upper bound. {blacksquare}

PROPOSITION 15. The difference F(H) – f(H) is (i) greater than f(H) – H if Formula A10, (ii) equal to f(H) – H if Formula A10, and (iii) less than f(H) – H if Formula A10.

Proof. Consider H isin [1/K, 1/(K – 1)], for K ≥ 3. On this interval, by Proposition 11, the minimum of F(H) – f(H) is β(K), and by Proposition 9, the maximum of f(H) – H is µ(K) = 1/[4K(K – 1)]. The following inequalities yield F(H) – f(H) > f(H) – H for H < Formula A10:

Formula A11(A11)
For H isin [Formula A11, 1), [F(H) – f(H)] – [f(H) – H] = Formula A11, which for H isin [Formula A11, 1) can be shown to fall on the same side of zero as H2 – 8H + 4. The only root of H2 – 8H + 4 = 0 for H isin [Formula A11, 1) is Formula A11, at which the sign of H2 – 8H + 4 switches from positive to negative. {blacksquare}

PROPOSITION 16.

i. The fraction of homozygosity due to homozygotes for the most frequent allele is greater than or equal to

Formula A11
with equality if and only if K = {lceil}H–1{rciel} = {lceil}M–1{rciel}, p1 = p2 = ... = pK–1 = M, and pK = 1 – (K – 1)M.

ii. The fraction of homozygosity due to homozygotes for the most frequent allele is greater than or equal to H, equality requiring H = K–1 for some integer K ≥ 2, and p1 = p2 = ... = pK = H.
iii. The lower bound on the fraction of homozygosity due to homozygotes for the most frequent allele lies in [1/K, 1/(K – 1)), where K = {lceil}H–1{rciel}.
iv. The lower bound on the fraction of homozygosity due to homozygotes for the most frequent allele is monotonically increasing with H on the interval (0, 1).

Proof. The fraction of homozygosity due to homozygotes for the most frequent allele is M2/H, so that i follows directly from Theorem 1ii.

ii. That M2/H ≥ f(H)2/H ≥ H2/H follows directly from Theorem 1ii and Lemma 7, with equality under the same conditions as specified by these results.
iii. That M2/H ≥ K–1 for H isin IK follows trivially from ii. Note that f(H)2/H < 1/(K – 1) is equivalent to Formula A11, which is true except if H = 1/(K – 1).
iv. Denote the lower bound in i by {sigma}(H). The function {sigma} is continuous on (0, 1), and at H = K–1 for integers K ≥ 2, {sigma}(H) = K–1. To show that {sigma} is monotonic on (0, 1) all that must be shown is that it is monotonic for H isin IK. On this interval, {lceil}H–1{rciel} = K, and the derivative of {sigma} is

Formula A11
To show that the term inside the brackets is positive for H isin IK, we can begin with the inequality (K – 1)H2KH + 1 > 0, which holds for H isin IK, as the leading term is positive and the roots are located at 1/(K – 1) and 1. Multiplying by K2 and adding identical terms to both sides, we have (K 1)(K2H2 – 4KH + 4) > (K2 – 4K + 4)(KH 1). Noting that K ≥ 2 and for H isin IK, 2 – KH > 0, the square root of both sides can be taken to obtain Formula A11. {blacksquare}

Homozygosity in terms of the frequency of the most frequent allele:

Many of the results in this section follow from those in the previous section, using the fact that the lower and upper bounds g and G for homozygosity are the respective inverse functions of the upper and lower bounds F and f for the frequency of the most frequent allele.

PROPOSITION 17. Averaging across values of M isin (0, 1), (i) the mean of G(M) is 1 – {pi}2/18; (ii) the mean of g(M) is Formula A11; (iii) the mean of G(M) – g(M) is Formula A11{pi}2/18.

Proof.

iii. Because G and g are the inverse functions of f and F by Lemma 5, and because on (0, 1), G > g and F > f, the area between G and g equals the area between F and f. By Proposition 6, this area is Formula A11{pi}2/18.
ii. The mean of g(M) is Formula A11.
i. That the mean of G(M) is 1 – {pi}2/18 follows directly from ii and iii. {blacksquare}

LEMMA 18. For M isin (0, 1), G(M) ≤ M, with equality if and only if M = K–1 for some integer K.

Proof. This result follows directly from Lemma 7 and the inverse relationship of G and f in Lemma 5. {blacksquare}

PROPOSITION 19. Averaging across values of M isin (0, 1), (i) the mean of MG(M) is {pi}2/18 – Formula A11; (ii) the mean of Mg(M) is Formula A11.

Proof. From the inverse relationship between G and f (Lemma 5), the area between M and G(M) equals the area between f(H) and H, or {pi}2/18 – Formula A11 (Proposition 8ii), and from the inverse relationship between g and F, the area between M and g(M) equals the area between F(H) and H, or Formula A11 (Proposition 8i). {blacksquare}

PROPOSITION 20. On the interval [1/K, 1/(K – 1)), where K ≥ 2 is an integer, the maximal value of MG(M) is 1/[4K(K – 1)], and it is achieved at H = (2K – 1)/[2K(K – 1)].

Proof. For M isin IK, {lceil}M–1{rciel} = K, and MG(M) = M [K(K – 1)M2 – 2(K – 1)M + 1]. MG(M) is continuous on the interval and differentiable except at the endpoints. Its only critical point on the interval is a maximum that occurs at ((2K – 1)/[2K(K – 1)], 1/[4K(K 1)]). {blacksquare}

COROLLARY 21. On (0, 1), the maximal value of MG(M) is Formula A11, and it is achieved at M = Formula A11.

Proof. Because (0, 1) = Formula A11IK, MG(M) has its maximum in IK for some K—in particular, for the K for which the maximal value of MG(M) is greatest. By Proposition 20, the maximum of MG(M) on IK is 1/[4K(K – 1)]. As 1/[4K(K – 1)] decreases for K ≥ 2, the maximum of MG(M) on (0, 1) occurs in I2. Applying Proposition 20, this maximum is at (Formula A11, Formula A11). {blacksquare}

PROPOSITION 22. On the interval [1/K, 1/(K – 1)], where K ≥ 2 is an integer,

i. For K ≥ 3, the maximal value of G(M) – g(M) is (K 2)/(K – 1)2, and it is achieved at M = 1/(K – 1). For K = 2, the maximal value of G(M) – g(M) is Formula A11, and it is achieved at M = Formula A11.
ii. The minimal value of G(M) – g(M) is {rho}(K) = (K 2)/(K2 K – 1), and it is achieved at M = (K 1)/(K2 K – 1).

Proof. Define {xi}(M) = G(M) – g(M). For M isin IK, {lceil}M–1{rciel} = K, and {xi}(M) = (K2K – 1)M2 – 2(K 1)M + 1. To verify ii, note that the only critical point of {xi}(M) on [1/K, 1/(K – 1)] is a minimum that occurs at ((K – 1)/(K2K – 1), {rho}(K)).

To obtain i, note that because there is no maximum in the interior of [1/(K – 1), 1/K], the maximum of {xi}(M) occurs at the endpoint that produces the larger value of {xi}(M). At M = 1/K, {xi}(M) = 1/K – 1/K2, and at M = 1/(K – 1), {xi}(M) = 1/(K – 1) – 1/(K – 1)2. Define {delta}(M) = M M2, and note that at points M = 1/K for integers K ≥ 1, {delta}(M) = {xi}(M). At the endpoints of [0, 1], {delta}(M) = 0, and on [0, 1], {delta}(M) has its maximum and only critical point at (Formula A11, Formula A11). Consequently, for M, M' isin [0, 1], if M > M' ≥ Formula A11, then {delta}(M) < {delta}(M'), whereas if Formula A11 ≥ M > M', then {delta}(M) > {delta}(M'). Thus, for K = 2, {delta}(1/K) = {xi}(1/K) > {xi}(1/(K – 1)) = {delta}(1/(K – 1)), whereas for integers K ≥ 3, {delta}(1/(K – 1)) = {xi}(1/(K 1)) > {xi}(1/K) = {delta}(1/K). {blacksquare}

PROPOSITION 23. On (0, 1), the highest local minimum of G(M) g(M) is Formula A11, and it occurs at M = Formula A11.

Proof. By Proposition 22, the minimal difference for a given interval [1/K, 1/(K – 1)] is achieved at M = (K 1)/(K2K – 1) and is {rho}(K). To find the integer K ≥ 2 where {rho}(K) is greatest, note that

Formula A12(A12)
As a result, {rho}(K) – {rho}(K + 1) > 0 for K ≥ 3. It follows that {rho}(K) is largest at the integer K isin [2, 3] that produces the highest value of {rho}(K). This maximum occurs at K = 3, so that M = Formula A12 and {rho}(K) = Formula A12. {blacksquare}

COROLLARY 24. The maximal value of Mg(M) is Formula A12, and it is achieved at M = Formula A12.

Proof. This result was shown in the proof of Proposition 22 when it was found that {delta}(M) = MM2 has its maximum on [0, 1] at M = Formula A12. {blacksquare}

COROLLARY 25. The maximal value of G(M) – g(M) is Formula A12, and it is achieved at M = Formula A12.

Proof. Because M ≥ G(M), G(M) – g(M) ≤ Mg(M). By Corollary 24, the maximum of Mg(M) occurs at Formula A12 and is Formula A12. Evaluating at M = Formula A12, G(M) – g(M) achieves this same upper bound. {blacksquare}

PROPOSITION 26. The difference G(M) – g(M) is (i) greater than MG(M) if 0 < M < Formula A12, (ii) equal to MG(M) if M = Formula A12, and (iii) less than MG(M) if Formula A12 < M < 1.

Proof. Consider M isin [1/K, 1/(K – 1)], for K ≥ 3. On this interval, by Proposition 22, the minimum of G(M) – g(M) is {rho}(K), and by Proposition 20, the maximum of MG(M) is µ(K) = 1/[4K(K – 1)]. The quantity {rho}(K) – µ(K) can be simplified to

Formula A12
which is clearly positive for K > Formula A12, and which is also positive for K = 3. As a result, G(M) – g(M) > MG(M) for intervals IK with K ≥ 3, that is, for 0 < M < Formula A12.

For M isin [Formula A12, 1), [G(M) – g(M)] – [MG(M)] = 3M2 – 5M + 2. The only root of 3M2 – 5M + 2 = 0 for M isin [Formula A12, 1) is M = Formula A12, at which the sign of 3M2 – 5M + 2 switches from positive to negative. {blacksquare}


ACKNOWLEDGEMENTS
We thank S. Boca for numerous discussions of this work. Grant support was provided by a University of Michigan Center for Genetics in Health and Medicine postdoctoral fellowship, by National Institutes of Health grant R01 GM081441, by an Alfred P. Sloan Research Fellowship, and by a Burroughs Wellcome Fund Career Award in the Biomedical Sciences.


LITERATURE CITED

DEPAULIS, F., and M. VEUILLE, 1998 Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol. Biol. Evol. 15: 1788–1790.[Medline]

DEPAULIS, F., S. MOUSSET and M. VEUILLE, 2001 Haplotype tests using coalescent simulations conditional on the number of segregating sites. Mol. Biol. Evol. 18: 1136–1138.[Free Full Text]

HEDRICK, P. W., 2005 A standardized genetic differentiation measure. Evolution 59: 1633–1638.[CrossRef][Medline]

HUDSON, R. R., K. BAILEY, D. SKARECKY, J. KWIATOWSKI and F. J. AYALA, 1994 Evidence for positive selection in the superoxide dismutase (Sod) region of Drosophila melanogaster. Genetics 136: 1329–1340.[Abstract]

INNAN, H., K. ZHANG, P. MARJORAM, S. TAVARÉ and N. A. ROSENBERG, 2005 Statistical tests of the coalescent model based on the haplotype frequency distribution and the number of segregating sites. Genetics 169: 1763–1777.[Abstract/Free Full Text]

LONG, J. C., and R. A. KITTLES, 2003 Human genetic diversity and the nonexistence of biological races. Hum. Biol. 75: 449–471.[CrossRef][Medline]

MARKOVTSOVA, L., P. MARJORAM and S. TAVARÉ, 2001 On a test of Depaulis and Veuille. Mol. Biol. Evol. 18: 1132–1133.[Free Full Text]

NEI, M., 1987 Molecular Evolutionary Genetics. Columbia University Press, New York.

ROSENBERG, N. A., S. MAHAJAN, S. RAMACHANDRAN, C. ZHAO, J. K. PRITCHARD et al., 2005 Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet. 1: 660–671.

SABETI, P. C., D. E. REICH, J. M. HIGGINS, H. Z. P. LEVINE, D. J. RICHTER et al., 2002 Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832–837.[CrossRef][Medline]

TANG, K., K. R. THORNTON and M. STONEKING, 2007 A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol. 5: 1587–1602.

TOOMAJIAN, C., T. T. HU, M. J. ARANZANA, C. LISTER, C. TANG et al., 2006 A nonparametric test reveals selection for rapid flowering in the Arabidopsis genome. PLoS Biol. 4: 732–738.

VOIGHT, B. F., S. KUDARAVALLI, X. WEN and J. K. PRITCHARD, 2006 A map of recent positive selection in the human genome. PLoS Biol. 4: 446–458.[CrossRef]

WATTERSON, G. A., 1977 Heterosis or neutrality? Genetics 85: 789–814.[Abstract/Free Full Text]

WATTERSON, G. A., 1978 The homozygosity test of neutrality. Genetics 88: 405–417.[Abstract/Free Full Text]

ZENG, K., S. MANO, S. SHI and C.-I. WU, 2007 Comparisons of site- and haplotype-frequency methods for detecting positive selection. Mol. Biol. Evol. 24: 1562–1574.[Abstract/Free Full Text]

Communicating editor: A. D. LONG


Related articles in Genetics:

ISSUE HIGHLIGHTS

Genetics 2008 179: NP. [Full Text]