# The Relationship Between FST and the Frequency of the Most Frequent Allele

^{*}Department of Evolutionary Biology and Science for Life Laboratory, Uppsala University, SE-752 36, Uppsala, Sweden^{†}Department of Biology, Stanford University, Stanford, California

- 1Corresponding author: Uppsala University, Norbyvägen 18D, SE-752 36, Uppsala, Sweden. E-mail: mattias.jakobsson{at}ebc.uu.se

## Abstract

*F*_{ST} is frequently used as a summary of genetic differentiation among groups. It has been suggested that *F*_{ST} depends on the allele frequencies at a locus, as it exhibits a variety of peculiar properties related to genetic diversity: higher values for biallelic single-nucleotide polymorphisms (SNPs) than for multiallelic microsatellites, low values among high-diversity populations viewed as substantially distinct, and low values for populations that differ primarily in their profiles of rare alleles. A full mathematical understanding of the dependence of *F*_{ST} on allele frequencies, however, has been elusive. Here, we examine the relationship between *F*_{ST} and the frequency of the most frequent allele, demonstrating that the range of values that *F*_{ST} can take is restricted considerably by the allele-frequency distribution. For a two-population model, we derive strict bounds on *F*_{ST} as a function of the frequency *M* of the allele with highest mean frequency between the pair of populations. Using these bounds, we show that for a value of *M* chosen uniformly between 0 and 1 at a multiallelic locus whose number of alleles is left unspecified, the mean maximum *F*_{ST} is ∼0.3585. Further, *F*_{ST} is restricted to values much less than 1 when *M* is low or high, and the contribution to the maximum *F*_{ST} made by the most frequent allele is on average ∼0.4485. Using bounds on homozygosity that we have previously derived as functions of *M*, we describe strict bounds on *F*_{ST} in terms of the homozygosity of the total population, finding that the mean maximum *F*_{ST} given this homozygosity is 1 − ln 2 ≈ 0.3069. Our results provide a conceptual basis for understanding the dependence of *F*_{ST} on allele frequencies and genetic diversity and for interpreting the roles of these quantities in computations of *F*_{ST} from population-genetic data. Further, our analysis suggests that many unusual observations of *F*_{ST}, including the relatively low *F*_{ST} values in high-diversity human populations from Africa and the relatively low estimates of *F*_{ST} for microsatellites compared to SNPs, can be understood not as biological phenomena associated with different groups of populations or classes of markers but rather as consequences of the intrinsic mathematical dependence of *F*_{ST} on the properties of allele-frequency distributions.

DIFFERENTIATION among groups is one of the fundamental subjects of the field of population genetics. Comparisons of the level of variation among subpopulations with the level of variation in the total population have been employed frequently in population-genetic theory, in statistical methods for data analysis, and in empirical studies of distributions of genetic variation. Wright’s (Wright 1951) fixation indices, and *F*_{ST} in particular, have been central to this effort.

Wright’s *F*_{ST} was originally defined as the correlation between two randomly sampled gametes from the same subpopulation when the correlation of two randomly sampled gametes from the total population is set to zero. Several definitions of *F*_{ST} or *F*_{ST}-like quantities are now available, relying on a variety of different conceptual formulations but all measuring some aspect of population differentiation (*e.g.*, Charlesworth 1998; Holsinger and Weir 2009). Many authors have claimed that one or another formulation of *F*_{ST} is affected by levels of genetic diversity or by allele frequencies, either because the range of *F*_{ST} is restricted by these quantities or because these quantities affect the degree to which *F*_{ST} reflects population differentiation (*e.g.*, Charlesworth 1998; Nagylaki 1998; Hedrick 1999, 2005; Long and Kittles 2003; Jost 2008; Ryman and Leimar 2008; Long 2009; Meirmans and Hedrick 2011). For example, Nagylaki (1998) and Hedrick (1999) argued that measures of *F*_{ST} may be poor measures of genetic differentiation when the level of diversity is high. Charlesworth (1998) suggested that *F*_{ST} can be inflated when diversity is low, arguing that *F*_{ST} might not be appropriate for comparing loci with substantially different levels of variation. In a provocative recent article, Jost (2008) used the diversity dependence of forms of *F*_{ST} to question their utility as differentiation measures at all.

One definition that is convenient for mathematical assessment of the relationship of an *F*_{ST}-like quantity and allele frequencies is the quantity labeled *G*_{ST} by Nei (1973), which for a given locus measures the difference between the heterozygosity of the total (pooled) population, *h*_{T}, and the mean heterozygosity across subpopulations, *h*_{S}, divided by the heterozygosity of the total population:*H*_{T} = 1 − *h*_{T}, and the mean homozygosity across subpopulations, *H*_{S} = 1 − *h*_{S}, we can write*H*_{S} ≥ *H*_{T} and, therefore, because *H*_{S} ≤ 1 and for a polymorphic locus with finitely many alleles, 0 < *H*_{T} < 1, *G*_{ST} lies in the interval [0,1].

Using *G*_{ST} for their definition of *F*_{ST}, Hedrick (1999, 2005) and Long and Kittles (2003) pointed out that because *h*_{T} < 1, *F*_{ST} cannot exceed the mean homozygosity across subpopulations, *H*_{S}:*K* equal-sized subpopulations, in which each allele is private to a single subpopulation. In the limit as *K* → ∞, a stronger upper bound on *F*_{ST} as a function of *H*_{S} and *K* reduces to Equation 3 (see also Jin and Chakraborty 1995 and Long and Kittles 2003).

While Hedrick (1999, 2005) and Long and Kittles (2003) have clarified the relationship between *F*_{ST} and the mean homozygosity *H _{S}* across subpopulations, their approaches do not easily illuminate the connection between

*F*

_{ST}and allele frequencies themselves. A formal understanding of the relationship between

*F*

_{ST}and allele frequencies would make it possible to more fully understand the behavior of

*F*

_{ST}in situations where markers of interest differ substantially in allele frequencies or levels of genetic diversity. Our recent work on the relationship between homozygosity and the frequency of the most frequent allele (Rosenberg and Jakobsson 2008; Reddy and Rosenberg 2012) provides a mathematical approach for formal investigation of bounds on population-genetic statistics in terms of allele frequencies. In this article, we therefore seek to thoroughly examine the dependence of

*F*

_{ST}on allele frequencies by investigating the upper bound on

*F*

_{ST}in terms of the frequency

*M*of the most frequent allele across a pair of populations. We derive bounds on

*F*

_{ST}given the frequency of the most frequent allele and bounds on the frequency of the most frequent allele given

*F*

_{ST}. We consider loci with arbitrarily many alleles in a pair of subpopulations. Using theory for the bounds on homozygosity given the frequency of the most frequent allele, we obtain strict bounds on

*F*

_{ST}given the homozygosity of the total population. Our analysis clarifies the relationships among

*F*

_{ST}, allele frequencies, and homozygosity, providing explanations for peculiar observations of

*F*

_{ST}that can be attributed to allele-frequency dependence.

## Model

We examine a polymorphic locus with at least two alleles in a setting with *K* subpopulations that contribute equally to a total population. Denote the number of distinct alleles by *I*, the frequency of allele *i* in population *k* by *p _{ki}*, and the mean frequency of allele

*i*across populations by

We consider *F*_{ST} formulated as a property of nonnegative numbers between 0 and 1 such that within populations, the allele frequencies sum to 1 (*k*). This formulation is the same as the formulation of Nei’s *G*_{ST}, which we hereafter denote by *F*. We have (Nei 1973)*H*_{T} < 1. The assumption that *I*, the number of distinct alleles at the locus, is finite guarantees that *H*_{T} > 0 (and hence, *H*_{S} > 0 because *H*_{S} ≥ *H*_{T}). Thus, 0 < *H*_{T} < 1 and 0 < *H*_{S} ≤ 1.

We assume that all allele frequencies are the parametric allele frequencies of the population under consideration. Thus, the frequency of an allele is the probability of drawing the allele from the parametric frequency distribution; homozygosity is then the probability that two independent random draws carry the same allelic type, and heterozygosity is the probability that two independent random draws carry different allelic types. We emphasize that in our formulation, *F*, *H*_{T}, and *H*_{S} are functions of the parametric allele frequencies, and our interest is in the properties of these functions and their relationships with the allele frequencies; we do not investigate their estimation from data, nor do we consider how evolutionary models affect the underlying allele frequencies involved in their computation.

We focus on the case of two subpopulations (K = 2). In this case, the allele frequencies are denoted *p*_{1}* _{i}* for population 1 and

*p*

_{2}

*for population 2. For each*

_{i}*i*from 1 to

*I*, let

*σ*=

_{i}*p*

_{1}

*+*

_{i}*p*

_{2}

*be the sum across populations of the frequency of allele*

_{i}*i*. Each

*σ*lies in (0, 2), and the number of alleles

_{i}*I*counts only those alleles with

*σ*> 0. We denote

_{i}*σ*

_{1}≥

*σ*

_{2}≥ … ≥

*σ*. We denote the frequency of the most frequent allele in the total pooled population by

_{I}*M*=

*σ*

_{1}/2, and we find it convenient to express some results in terms of

*σ*

_{1}and others in terms of

*M*. Because

*σ*is positive, we have 1/

_{i}*I*≤

*M*< 1.

Let *δ _{i}* = |

*p*

_{1}

*−*

_{i}*p*

_{2}

*| be the absolute difference between*

_{i}*p*

_{1}

*and*

_{i}*p*

_{2}

*. We can write the homozygosity of the total population as*

_{i}*F*can be computed solely using the allele-frequency sums and differences between the two populations.

## Bounds on *F*

Our goal is to study the relationship between *F* and *M* in the general case of *I* alleles in two populations. For convenience, we write *F* as a function of *σ*_{1}, keeping in mind that *σ*_{1}/2 = *M*, and we begin by considering the special case in which *I* = 2.

### Bounds on F for two alleles

This case has two alleles, with frequencies *p*_{11} and *p*_{12} in population 1, and *p*_{21} and *p*_{22} in population 2 (Table 1). The frequency of the second allele is *p*_{12} = 1 − *p*_{11} in population 1 and *p*_{22} = 1 − *p*_{21} in population 2. Using Equation 4, we have a simple expression for *F* (Weir 1996; Rosenberg *et al.* 2003):*F* in terms of the frequency of the most frequent allele *M* = *σ*_{1}/2. Because the alleles are arranged to satisfy *σ*_{1} ≥ *σ*_{2} and because *σ*_{1} + *σ*_{2} = 2, *σ*_{1} must lie in [1, 2). For the lower bound on *F* as a function of *σ*_{1}, we note that if allele 1 has the same frequency in both populations, then *p*_{11} = *p*_{21} = *σ*_{1}/2. The frequency of allele 2 will also be the same in the two populations, *p*_{12} = *p*_{22} = 1 − *σ*_{1}/2, and *δ*_{1} and *δ*_{2} will both equal zero. For these allele frequencies, we see that *H*_{S} = *H*_{T}, and it is clear from Equation 5 that *F*(*σ*_{1}) ≥ 0 for all values of *σ*_{1} in [1, 2), with equality if and only if *p*_{11} = *p*_{21} = *σ*_{1}/2.

For the upper bound, we first note that because *δ*_{1} = 2*p*_{11} − *σ*_{1} when *p*_{11} ≥ *p*_{21} and *δ*_{1} = 2*p*_{21} − *σ*_{1} when *p*_{21} ≥ *p*_{11},*p*_{11} = 1 or *p*_{21} = 1. Using Equations 5 and 6, we have*F* as a function of *σ*_{1} is achieved when the allele frequencies of the two populations differ as much as possible, that is, when (*p*_{11}, *p*_{21}) = (1, *σ*_{1} − 1) or (*p*_{11}, *p*_{21}) = (*σ*_{1} − 1, 1). The bounds on *F* are*q*(1/2) = 1 to *q*(1) = 0.

### Lower bound on F for an unspecified number of alleles

For any number of alleles *I* and any set of *σ _{i}*, by noting that the denominator of

*F*in Equation 4 is positive and that the numerator is

*i*,

*p*

_{1}

*=*

_{i}*p*

_{2}

*=*

_{i}*σ*/2. Thus, the lower bound on

_{i}*F*as a function of

*σ*

_{1}is achieved when the allele frequencies are the same in both populations for all

*I*alleles. Thus,

*F*= 0 is attainable for any value of

*σ*

_{1}in (0, 2).

### Upper bound on F for an unspecified number of alleles

The upper bound on *F* as a function of *σ*_{1} has different properties for *σ*_{1} ε (0, 1) and for *σ*_{1} ε [1, 2). We begin with *σ*_{1} ε (0, 1).

Using Equation 4, we can rearrange *F*(*σ*_{1}) to obtain*F*(*σ*_{1}) is maximal when each allele is found only in one of the two subpopulations.

To complete the maximization of *F*(*σ*_{1}) as a function of *σ*_{1}, it remains to maximize

Define *I* is unspecified; we search for an upper bound over all possible values *I* ≥ 2 and discover that the maximum occurs when each subpopulation has *I* = *J* distinct alleles. Because *p*_{1}* _{i}* +

*p*

_{2}

*≤*

_{i}*σ*

_{1}and because for each

*i*, at the maximum of

*F*(

*σ*), each allele has either

_{1}*p*

_{1}

*= 0 or*

_{i}*p*

_{2}

*= 0, it suffices to maximize*

_{i}*p*

_{1}

*≤*

_{i}*σ*

_{1}for all

*i*. This maximization is the same problem considered in Rosenberg and Jakobsson (2008, Lemma 3), which demonstrates that the maximum occurs if and only if the locus has

*J*− 1 alleles of frequency

*σ*

_{1}and one remaining allele of frequency 1 − (

*J*− 1)

*σ*

_{1}.

Lemma 3 of Rosenberg and Jakobsson (2008) yields 1 − *σ*_{1}(*J* − 1)(2 − *Jσ*_{1}) for each of the two maxima, on *J* alleles, *J* of which occur only in the first subpopulation and the other *J* of which occur only in the second population, and each subpopulation has *J* − 1 alleles of frequency *σ*_{1} and one allele of frequency 1 − (*J* − 1)*σ*_{1}. Because *σ*_{1} ε [1, 2), we separate terms in Equation 4 for the first and subsequent alleles:*F*, given *σ*_{1}, occurs when *σ*_{1} ε [1, 2), *p*_{11} = 1 or *p*_{21} = 1.

Next, for any *i*, *δ _{i}* ≤

*σ*, with equality if and only if

_{i}*p*

_{1}

*= 0 or*

_{i}*p*

_{2}

*= 0. Then*

_{i}*σ*with

_{i}*i*≥ 2, only one can be positive, namely

*σ*

_{2}, by the assumption that the alleles are labeled in decreasing order of frequency. Thus, equality occurs in both inequalities if and only if

*σ*

_{2}= 2 −

*σ*

_{1}and either

*p*

_{12}or

*p*

_{22}is 0.

We have therefore found that given *σ*_{1} ε [1, 2), *p*_{11}, *p*_{12}, *p*_{21}, *p*_{22}) = (1, 0, *σ*_{1} − 1, 2 − *σ*_{1}) or (*σ*_{1} − 1, 2 − *σ*_{1}, 1, 0). Replacing the terms *p*_{11} = 1 or *p*_{21} = 1 and *σ*_{2} = 2 − *σ*_{1}. This result matches the two-allele case: when *σ*_{1} ε [1, 2), the case of an unspecified number of alleles reduces to the case of two alleles.

Summarizing our results, the bounds of *F* are*F* is continuous at *σ*_{1} = 1, as

The upper bound on *F* is shown as the solid line in Figure 2. The plot illustrates that the upper bound on *F*(*σ*_{1}) has a piecewise structure on (0, 1), with changes in shape occurring when *σ*_{1} is equal to the reciprocal of an integer. Similarly to the bounds examined by Rosenberg and Jakobsson (2008), for each *J* ≥ 2, *Q*(*σ*_{1}) is monotonically increasing on the interval [1/*J*, 1/(*J* − 1)), where *J*. Further, *Q*(*σ*_{1}) is continuous at the boundaries 1/*J* between intervals, with *Q*(1/*J*) = 1/(2*J* − 1). On [1, 2), the upper bound has a simple monotonic decline according to *q*(*σ*_{1}).

## Properties of the Upper Bound on *F*

The region between 0 and the upper bound on *F* exactly circumscribes the set of possible values of *F* as a function of *σ*_{1}, as the upper bound is strict. We now explore a series of features of the upper bound on *F* as a function of *σ*_{1}.

### The space between the upper and lower bounds on F

The mean maximum *F* across the range of possible frequencies for the most frequent allele gives a sense of the maximal *F* attainable on average, when *M* is uniformly distributed. This mean can be obtained by evaluating the area of the region between the lower and upper bounds on *F*.

Because the lower bound on *F* is zero over the entire interval *σ*_{1} ε (0, 2), we need to determine only the area *A* under the upper bound on *F*. We integrate *Q*(*σ*_{1}) for *σ*_{1} ε (0, 1) and *q*(*σ*_{1}) for *σ*_{1} ε [1, 2),*J*, 1/(*J* − 1)) for *J* ≥ 2. On each such interval, *J*. We then have*Appendix*, we show that

The second term in Equation 17 is*q*(*σ*_{1}) for *σ*_{1} ε [1, 2) is ∼0.3862944.

Summing the values for the two integrals, the area *A* under the upper bound on *F* is ∼0.7170751. Considering *F* as a function of *M* = *σ*_{1}/2 rather than *σ*_{1}, *F* is confined to a region with area ∼0.3585376. This area under the curve is the mean maximal value of *F* across the space of values of *M*, and it is substantially less than 1. Thus, on average, *F* is constrained within a narrow range, and across most of the space of possible values for the frequency of the most frequent allele, *F* cannot achieve large values. For example, only over half the range—for *M* between 1/4 and 3/4—is it possible for *F* to exceed 1/3.

### Jagged points touch a simple curve

For *σ*_{1} ε [1, 2), the upper bound on *F* is a smooth function *q*(*σ*_{1}). For *σ*_{1} ε (0, 1), however, the upper bound is a jagged curve. At *σ*_{1} = 1/*J* for any integer *J* ≥ 2, that is, at the “jagged points” where the upper bound is not differentiable, *Q*(*σ*_{1}) coincides with the reflection of *q*(*σ*_{1}) across the line *σ*_{1} = 1. We have*σ*_{1} = 1/*J*. Thus, for *σ*_{1} = 1/*J*, *Q*(*σ*_{1}) touches the curve*q**(*σ*_{1}) on (0, 1).

Because *q**(*σ*_{1}) on (0, 1) is the reflection of *q*(*σ*_{1}) on [1, 2) across the line *σ*_{1} = 1, the area under *q**(*σ*_{1}) on (0, 1) is the same as the area of *q*(*σ*_{1}) on [1, 2), or 2 ln 2 − 1. Thus, on the interval (0, 1), the space between *q**(*σ*_{1}) and *Q*(*σ*_{1}) is

### The contribution made by M to the upper bound on F

We denote by *F*_{1}(*σ*_{1}) the contribution of the most frequent allele to *F*(*σ*_{1}). By this quantity, we mean the term in *F*(*σ*_{1}) contributed by the difference between populations in the frequency of the most frequent allele. From Equation 4, *F*(*σ*_{1}) can be written*i*th term in the summation is denoted *F _{i}*(

*σ*

_{1}), our interest is in the value of

*F*

_{1}(

*σ*

_{1}) obtained at the set of allele frequencies that maximizes

*F*(

*σ*

_{1}).

For *σ*_{1} in the interval (0, 1), defining *J* − 2 alleles with frequency *σ*_{1} and two alleles with frequency 1 − (*J* − 1)*σ*_{1}: *J* − 1 alleles with frequency *σ*_{1} and one allele with frequency 1 − (*J* − 1)*σ*_{1} in each subpopulation. The value of *F*_{1}(*σ*_{1}) to *F*(*σ*_{1}) at the maximum by *Q*_{1}(*σ*_{1}), we have*Appendix*, we evaluate

For *σ*_{1} ε [1, 2), at the maximum of *F*(*σ*_{1}), *q*_{1}(*σ*_{1}) is*Q*_{1}(*σ*_{1}) and *q*_{1}(*σ*_{1}), the total area *B* under *F*_{1} as *σ*_{1} ranges from 0 to 2 is*M* = *σ*_{1}/2, we find that *F*_{1} is confined to ∼0.1607997 of the space of possible pairs of values (*M*, *F*). The fraction of the area *A* under the upper bound on *F* contributed by the most frequent allele over the entire interval *σ*_{1} ε (0, 2) is *B*/*A* ≈ 0.4484877. This quantity can be interpreted as the mean contribution of the most frequent allele to the maximum value of *F*, and it indicates a substantial role for the most frequent allele. Indeed, for *σ*_{1} ε [1, 2), *q*_{1}(*σ*_{1})/*q*(*σ*_{1}) = 1/2. The contribution made by the most frequent allele to the upper bound on *F* appears in Figure 3.

## Bounds on *M*

Our derivation of the bounds on *F* as functions of the frequency *M* of the most frequent allele enables us to provide bounds on *M* as functions of *F* by taking the inverse of the functions *q*(*σ*_{1}) and *Q*(*σ*_{1}). For 0 < *F* < 1, we show that the bounds on the frequency of the most frequent allele in terms of *F* are*F* = 1, *σ*_{1} must equal 1, and for *F* = 0, *σ*_{1} lies in the open interval (0, 2).

### Bounds on σ_{1} for two alleles

We first consider the two-allele case. By definition of *σ*_{1}, regardless of the value of *F*, *σ*_{1} can be no smaller than 1, and when *σ*_{1} = 1, *F* ε [0, 1], it is possible to choose allele frequencies *p*_{11} and *p*_{21} so that *σ*_{1} = *p*_{11} + *p*_{21} = 1. We simply set *σ*_{1}(*F*) = 1 can be achieved across the full domain *F* ε [0, 1].

For the upper bound on *σ*_{1}, recall that the upper bound on *F* in terms of *σ*_{1} (Equation 7) is a continuous monotonically decreasing function on the interval *σ*_{1} ε [1, 2). We can therefore obtain the upper bound on *σ*_{1} as the inverse of this function. Thus, for *F* ε [0, 1], the bounds on *σ*_{1} are:*M* = *σ*_{1}/2 appear in Figure 4.

### Lower bound on σ_{1} for an unspecified number of alleles

For the general case, we obtain lower and upper bounds on *F*, considering all possible choices for the number of distinct alleles. It is useful to first recall that the function *Q*(*σ*_{1}) for the upper bound on *F* for *σ*_{1} ε (0, 1) is monotonically increasing, while the function *q*(*σ*_{1}) for the upper bound on *F* for *σ*_{1} ε [1, 2) is monotonically decreasing. We can therefore invert *Q*(*σ*_{1}) and *q*(*σ*_{1}), so that the lower bound on *σ*_{1} as a function of *F* is obtained by solving *Q*(*σ*_{1}) = *F* for *σ*_{1} and the upper bound by solving *q*(*σ*_{1}) = *F* for *σ*_{1}. For the lower bound, we perform the inversion piecewise. For integers *J* ≥ 2, if *σ*_{1} ε [1/*J*, 1/(*J* − 1)), then *Q*(*σ*_{1}) ε [1/(2*J* − 1), 1/(2*J* − 3)). Therefore, for *J* ≥ 2, if *Q* ε [1/(2*J* − 1), 1/(2*J* − 3)), then the lower bound on *σ*_{1} lies in [1/*J*, 1/(*J* − 1)). For this interval on *Q*, ⌈(1 + *Q*)/(2*Q*)⌉ = *J*, and in this region, the lower bound on *σ*_{1}, which we term *r*(*F*), also satisfies ⌈*r*(*F*)⌉ = *J*. We solve Equation 10 for *σ*_{1} for *Q* ε[1/(2*J* − 1), 1/(2*J* − 3)), where both *Q*)/(2*Q*)⌉ are equal to *J*:*σ*_{1} ≥ *σ _{i}* for all

*i*> 1. The upper and lower bounds appear in Figure 5.

### Upper bound on σ_{1} for an unspecified number of alleles

From Equation 13 and Figure 2, we see that for any *F* ε [0, 1], the upper bound on *σ*_{1} is ≥⩾1. Because Equation 13 is continuous and monotonically decreasing, we can take the inverse of this function to compute the upper bound on *σ*_{1} as a function of *F*. The upper bound *R*(*F*) on *σ*_{1} is

*F* and Homozygosity of the Total Population

The relationship between *F* and the frequency of the most frequent allele can be used together with the relationship between homozygosity and the frequency of the most frequent allele (Rosenberg and Jakobsson 2008; Reddy and Rosenberg 2012), to find a relationship between *F* and homozygosity, again in the setting of two populations. The homozygosity that we consider, *H* in Rosenberg and Jakobsson (2008), corresponds to the homozygosity of the total pooled population *H*_{T}. We first note that given any *H*_{T} ε (0, 1), the lower bound on *F* is zero. For example, for any *H*_{T}, *F* = 0 is obtained by using the equality condition in Theorem 1ii of Rosenberg and Jakobsson (2008) to specify a list of allele frequencies with sum of squares *H*_{T} and then assigning that same list of frequencies to both of the component subpopulations.

### Upper bound on F given H_{T} for an unspecified number of alleles

Rosenberg and Jakobsson (2008) showed that the value of *H*_{T} constrains the frequency *M* of the most frequent allele to a narrow range. We have already determined the upper bound on *F* as a function of *M*. Thus, we can obtain an upper bound on *F* as a function of *H*_{T} by taking the maximum value of the upper bound over the range of possible values of *M* allowed under the results of Rosenberg and Jakobsson (2008) for a given value of *H*_{T}. This approach does not guarantee that the upper bound on *F* that we obtain in terms of *H*_{T} is strict; nevertheless, the approach happens to produce a strict bound for *H*_{T} ε [1/2, 1). For *H*_{T} ε (0, 1/2), it is possible to produce a strict bound by writing *F* in terms of *H*_{T}.

To obtain the bound for *H*_{T} ε (0, 1/2), we substitute *H*_{T}, equality is obtained in Equation 31 when *H*_{T} ε (0, 1/2), *F* is maximized when each allele occurs in only one of the two populations. To see that the upper bound is strict, note that when *H*_{1} and *H*_{2}, *H*_{T} = (*H*_{1} + *H*_{2})/4. As *H*_{T} < 1/2, 2*H*_{T} < 1, and we can choose *H*_{1} = *H*_{2} = 2*H*_{T}. Using the equality condition in Theorem 1ii of Rosenberg and Jakobsson (2008), we can specify a set *L* of exactly ⌈(2*H*_{T})^{−1}⌉ allele frequencies whose sum of squares is *H*_{T}. We then construct a set of 2⌈(2*H*_{T})^{−1}⌉ alleles. In population 1, the first ⌈(2*H*_{T})^{−1}⌉ alleles in the set have exactly the allele frequencies in *L* and the next ⌈(2*H*_{T})^{−1}⌉ alleles have frequency 0. In population 2, the first ⌈(2*H*_{T})^{−1}⌉ alleles have frequency 0, and the next ⌈(2*H*_{T})^{−1}⌉ alleles have the frequencies in *L*.

For *H*_{T} ε [1/2, 1), *H*_{T}/(1 − *H*_{T}) ≥ 1, so Equation 31 provides only the trivial bound of *F* ≤ 1, and another approach is needed. For any *H*_{T} ε [1/2, 1), using Theorem 1ii of Rosenberg and Jakobsson (2008), *M* ≥ 1/2. For *M* ≥ 1/2, the upper bound on *F* as a function of *σ*_{1} is monotonically decreasing in *σ*_{1}, and consequently, the upper bound on *F* as a function of *H*_{T} is obtained by evaluating *q*(*σ*_{1}) at the smallest value of *σ*_{1} permitted by *H*_{T}. Theorem 1ii of Rosenberg and Jakobsson (2008) indicates that this smallest allowed *σ*_{1} satisfies*σ*_{1}/2 in Equation 16 with this expression, we have*H _{T}* ε [1/2, 1).

For *H*_{T} ε [1/2, 1), the set of allele frequencies that achieves the minimum *M* as a function of *H*_{T} and the set that achieves the maximum *F* as a function of *M* coincide. Given *H _{T}*,

*M*is minimized by setting

*i*≥ 2. If these mean frequencies are distributed between the two populations such that

*F*is achieved.

Figure 6 shows our upper bound on *F* as a function of the total homozygosity *H*_{T}. If *H*_{T} is low, and particularly if *H*_{T} is high, then *F* is restricted to small values. High values of *F* are possible only when *H*_{T} is near 1/2. In fact, using Equations 31 and 32, *F* can exceed 1/2 only if *H*_{T} lies in (1/3, 5/9).

### The space between the upper and lower bounds on F given H_{T}

In the same manner as in our investigation of the bounds on *F* as a function of *M*, we evaluate the area of the region between the upper and lower bounds on *F* to find the mean maximum *F* across the range of possible values of *H*_{T}.

Because the lower bound on *F* is zero over the entire interval *H*_{T} ε (0, 1), it suffices to evaluate the area *A* under the upper bound on *F*. This area is*H*_{T} − ln(1 − *H*_{T}) and evaluates to ln 2 − 1/2. The second term has indefinite integral *A* = 1 − ln 2 ≈ 0.3068528.

Note that *F* is substantially more constrained when *H*_{T} ε [1/2, 1) than when *H*_{T} ε [0, 1/2). The difference between the areas under the upper bound for *H*_{T} ε [0, 1/2) and for *H*_{T} ε [1/2, 1) is 3 ln 2 − 2 ≈ 0.0794415, a sizeable fraction of the sum of the two areas. Twice the difference in areas, or 6 ln 2 − 4 ≈ 0.1588831, is the expectation of the difference between the maximum value of *F* for a value of *H*_{T} chosen uniformly at random from (0, 1/2) and the maximum value of *F* for a value of *H*_{T} chosen uniformly at random from [1/2, 1).

## Application to Data

We illustrate the bounds on *F*, *M*, and *H*_{T} for a series of examples using human polymorphism data from Rosenberg *et al.* (2005) and Li *et al.* (2008). For each example, for each locus, we assume that the allele frequencies in the data sets are parametric allele frequencies. The parametric allele frequencies are obtained in each of a pair of populations, and they are then averaged to obtain parametric allele frequencies for the total population. *F*, *M*, and *H*_{T} are then computed. The data set of Rosenberg *et al.* (2005) considers 1048 individuals genotyped for 783 microsatellites, and the data set of Li *et al.* (2008) considers 938 unrelated individuals genotyped for single-nucleotide polymorphisms (SNPs); for all analyses, we restrict our attention to the 935 individuals found in both data sets. For the Li *et al.* (2008) data, we examine only 640,034 SNPs studied by Pemberton *et al.* (2012).

### Example 1: Africans and Native Americans

Our first example considers microsatellites in 101 Africans and 63 Native Americans, and it is chosen to illustrate a relatively wide range of values of *F*, *M*, and *H*_{T}. Figure 7 shows *F* and *M*, demonstrating that for the comparison of Africans and Native Americans, *F* < 0.1 for most of the 783 loci. The mean value of *F* is 0.05 with standard deviation 0.06, and the mean value of *M* is 0.37 with standard deviation 0.11.

Similarly, Figure 8 plots *F* and *H*_{T} for the 783 loci. The mean *H*_{T} is 0.25 with standard deviation 0.08. In both Figures 7 and 8, relatively few loci approach the upper bound on *F*.

### Example 2: High-diversity and low-diversity populations

The bounds on *F* as a function of *M* and *H*_{T} indicate that genetic diversity in a pair of populations has a strong effect on the value of *F* between them. To illustrate this point, we compare the values of *F* obtained from two populations each with high within-population diversity to those obtained from two populations with lower within-population diversity.

The Yoruba and Mbuti Pygmy populations are two African populations with high genetic diversity; the Colombian and Pima populations are Native American populations with lower diversity. Figure 9A shows *F* and *M* computed from the Yoruba and Mbuti Pygmy populations, and Figure 9B shows *F* and *H*_{T}. The mean value of *F* is 0.04 with standard deviation 0.03, the mean value of *M* is 0.35 with standard deviation 0.11, and the mean value of *H*_{T} is 0.24 with standard deviation 0.08.

By contrast, in corresponding plots for the less diverse Colombian and Pima populations, higher values of *F*, *M*, and *H*_{T} are apparent (Figure 9, C and D). In particular, because *M* and *H*_{T} tend to be nearer to 1/2, larger values of *F* are possible. The mean values of *M* and *H*_{T} are much closer to 1/2 than in the African groups; the mean *M* is 0.50 with standard deviation 0.15, and the mean *H*_{T} is 0.38 with standard deviation 0.15. As is suggested by the fact that *F* can attain its largest values when *M* and *H*_{T} lie near 1/2, the mean value of *F* for the Native American groups is nearly twice as high as in the African groups (mean 0.07, standard deviation 0.07).

### Example 3: Single-nucleotide polymorphisms

Our third example considers SNPs in the same set of Africans and Native Americans for which microsatellites were examined in Figures 7 and 8. Figure 10 shows the joint distribution of *F* and *M* as well as the mean and median of *F* for intervals of *M* ranging from 1/2 to 1 with width 0.01. Mean values of *F* decrease with *M* for *M* ε (1/2, 1), and this decrease is correlated with the decreasing value of the bound on *F* as a function of *M* (*r* = 0.94). Compared with the mean, the median value of *F* is less correlated with the value of the bound, although it also declines with increasing *M* (*r* = 0.77).

For biallelic markers, for *M* > 1/2, at least one of the two alleles must appear in both populations, and the upper bound on *F* occurs when one of the populations has only one allele. In Figure 10, for high values of *M*, more SNPs approach the upper bound on *F* than for low values of *M*. This result indicates that SNPs with high values of *M* are more likely to have an allele found in one but not the other of the two populations.

## Discussion

The range of *F* depends on the level of diversity in the markers considered. In this article, we have further shown that not only does diversity constrain the range of *F*, the frequency of the most frequent allele has a strong influence on the values that *F* can take. When the frequency of the most frequent allele is small *or* large, *F* is restricted to small values far from one (Figure 2). In fact, considering all possible values of *M*, *F* is restricted on average to only ∼35.85% of the space of possibilities. This extreme reduction in range for *F* can be viewed as a consequence of our result that about half of the contribution to the maximal *F* arises from the most frequent allele (exactly half for *σ*_{1} ε [1,2)). Using results from Rosenberg and Jakobsson (2008) on the relationship between homozygosity and the frequency of the most frequent allele, we have described a link between *F* and homozygosity of the total population (*H*_{T}) via separate relationships of *F* and homozygosity to the frequency of the most frequent allele. *F* is restricted by *H*_{T} even further than by *M*, to only ∼30.69% of the space of possibilities.

Our work extends knowledge of the connection between *F* and genetic diversity, providing a framework for interpreting a variety of features of values of *F* measured in population-genetic data. We have presented empirical computations that illuminate recently observed phenomena in human population genetics. In particular, even without a formal understanding of the ways in which evolutionary processes and the population-genetic models that encode them give rise to values of *M*, *H*_{T}, and *F*, the mathematical constraints linking these quantities can aid in interpreting the patterns found in the data.

### Low F_{ST} values in human populations from Africa

Estimates of *F*_{ST} in human populations have been low in Africa compared with other geographic regions, such as among Native Americans (Rosenberg *et al.* 2002; Tishkoff *et al.* 2009). This pattern appears to belie the extensive genetic differentiation known to exist among African populations. For example, using microsatellite loci, Tishkoff *et al.* (2009) identified a number of genetically distinctive subgroups of African populations despite confirming that *F*_{ST} in Africa has an unexpectedly small value. The apparent discrepancy between the extensive genetic differentiation among populations in Africa and counterintuitively low values of *F*_{ST} can be explained using our results. Because Africa has high within-population genetic diversity—including microsatellite homozygosities well below 1/2 in many populations (Tishkoff *et al.* 2009, Figure S2B)—the maximum *F*_{ST} for comparisons of African populations at microsatellite loci is relatively constrained compared with the maximum *F*_{ST} for comparisons of groups that have less within-population diversity and mean homozygosities nearer 1/2. Figure 9 shows that *F*_{ST} values comparing African populations are more constrained by *M* and *H*_{T} than are those comparing Native American populations. Thus, the observation for microsatellites of low *F*_{ST} in African populations can be attributed to high within-population genetic diversities.

That *F*_{ST} is more tightly constrained for high-diversity populations than for populations where *H*_{T} ≈ 1/2 has an additional consequence. When considering two pairs of populations with the same *F*_{ST} value and *H*_{T} < 1/2, it is likely that a pair of populations with higher within-group diversity is more differentiated than is a pair of populations with relatively low within-group diversity. In other words, the higher the level of genetic diversity within a population, the greater the extent to which raw values of *F*_{ST} underpredict the intuitive level of differentiation among subpopulations; the result of Tishkoff *et al.* (2009) exactly follows this pattern.

### Lower F_{ST} values for microsatellites than for SNPs

Computations of *F*_{ST} in human populations have generally found that *F*_{ST} estimates based on multiallelic loci such as microsatellites are lower than those obtained from biallelic loci such as SNPs (*e.g.*, Rosenberg *et al.* 2002; Li *et al.* 2008). This observation is apparent in the difference between *F*_{ST}-like computations from nearly the same sets of individuals for microsatellites and for SNPs. When separating human populations into seven geographic regions and computing the within-population component of genetic variation, a quantity analogous to 1 − *F*_{ST}, Rosenberg *et al.* (2002) obtained an estimate of 0.941 with microsatellites, whereas Li *et al.* (2008) obtained 0.889 with SNPs. Our results provide a simple explanation for this difference. The SNPs of Li *et al.* (2008) each have only two alleles, so for each locus, the frequency of the most frequent allele is at least 1/2; further, the minor alleles tend to be common, such that many of the loci have *M* near 1/2. By contrast, the microsatellites in the study of Rosenberg *et al.* (2002) have ∼12 alleles on average, so *M* is typically smaller than 1/2 and often much smaller (Rosenberg and Jakobsson 2008). Thus, for microsatellites, because of lower frequencies of the most frequent allele and higher levels of genetic diversity, the maximum value of *F* is substantially more constrained than the corresponding maximum of *F* for SNPs (Figure 2). We can explain the difference in the magnitudes of the Rosenberg *et al.* (2002) and Li *et al.* (2008) *F*_{ST} values via this phenomenon.

Recently, attention has increasingly focused on biallelic sites for which the rarer allele has low frequency (Keinan and Clark 2012; Nelson *et al.* 2012; Tennessen *et al.* 2012). In our terms, these are sites for which the frequency of the most frequent allele, *M*, is high. Because *F* is tightly constrained for high values of *M*, we might expect that when *F*_{ST} is calculated using sites with rare minor alleles, small *F*_{ST} values will be produced. Indeed, Figure 10 shows that when *F* is used to compare Africans with Native Americans at SNP loci, mean values of *F* decrease as *M* increases from 1/2 to 1.

## Conclusions

Measures of *F*_{ST} have often been used for making inferences about such phenomena as population structure, migration patterns, and range expansions. However, we have found that without a proper understanding of the dependence of *F*_{ST} on diversity and allele frequencies, *F*_{ST} can potentially produce puzzling or misleading results. We have described mathematical relationships between *F*_{ST}, the frequency of the most frequent allele, and homozygosity that are useful for interpreting the properties of differentiation measures when features of allele frequencies and diversity statistics vary across loci or populations—as they inevitably do in typical scenarios.

Beginning with Charlesworth (1998), Nagylaki (1998), and Hedrick (1999), recent studies have noted that *F*_{ST} is constrained by diversity, and the issue was described as early as in the work of Sewall Wright (Wright 1978, p. 82). Jost (2008) generated new interest in the dependence of *F*_{ST} on diversity, illustrating that the dependence can produce substantial discord between intuitions about and measurements of differentiation levels. Jost (2008) also used a multiplicative definition of diversity to propose a pair of new differentiation indices that have the feature of reaching their maximum value if and only if each allele is private to a single subpopulation. In our view, the key to choosing and applying measures of differentiation lies not in “fixation on an index” (Long 2009), be it *F*_{ST}, the measures of Jost (2008), or other indices that have recently been proposed (Meirmans and Hedrick 2011), but in developing an understanding of the ways in which possible statistics relate both to intuitive aspects of differentiation and to mathematical features of allele frequencies and genetic diversity. In this context, *F*_{ST} remains of particular interest on the basis of its long history of use in population genetics and its connection to features of biological models (Whitlock 2011). Our examples provide only a few among many ways in which the mathematical properties we have obtained for *F*_{ST} can be used to interpret its behavior in the analysis of empirical data.

## Acknowledgments

We thank S. Boca and J. VanLiere for numerous discussions of this work. Financial support was provided by the Swedish Research Council, the Erik Philip Sörensen Foundation, the Burroughs Wellcome Fund, a Stanford Graduate Fellowship, and U.S. National Institutes of Health grants GM081441 and HG005855.

## Appendix

The appendix provides the derivations of two integrals described in the main text.

#### Integral ∫ 0 1 Q ( σ 1 ) d σ 1 (Equation 18)

To obtain *k* ≥ 1, *k* + 1) ≤ *σ*_{1} < 1/*k*. We have

#### Integral ∫ 0 1 Q 1 ( σ 1 ) d σ 1 (with *Q*_{1} as in Equation 24)

To obtain *k* ≥ 1,

## Footnotes

*Communicating editor: M. A. Beaumont*

- Received August 7, 2012.
- Accepted November 5, 2012.

- Copyright © 2013 by the Genetics Society of America

Available freely online through the author-supported open access option.