# A Fundamental Relationship Between Genotype Frequencies and Fitnesses

- Joseph Lachance
^{1}

- Graduate Program in Genetics, Department of Ecology and Evolution, State University of New York, Stony Brook, New York 11794-5222

- 1
*Author e-mail:*joseph.lachance{at}sunysb.edu

## Abstract

The set of possible postselection genotype frequencies in an infinite, randomly mating population is found. Geometric mean heterozygote frequency divided by geometric mean homozygote frequency equals two times the geometric mean heterozygote fitness divided by geometric mean homozygote fitness. The ratio of genotype frequencies provides a measure of genetic variation that is independent of allele frequencies. When this ratio does not equal two, either selection or population structure is present. Within-population HapMap data show population-specific patterns, while pooled data show an excess of homozygotes.

WHAT patterns of genetic variation are possible within a population, and how does natural selection affect these patterns? R. A. Fisher remarked “it is often convenient to consider a natural population not so much as an aggregate of living individuals but as an aggregate of gene ratios” (Fisher 1953, p. 515). This mathematical abstraction allows key questions in evolutionary genetics to be addressed. A population of diploid individuals can be characterized by a set of genotype frequencies (*P _{AA}*,

*P*,

_{AB}*P*, etc.). This population genetic state is represented by a point in genotype frequency space, where each dimension corresponds to the frequency of a particular genotype. As genotype frequencies change over time, evolving populations explore genotype frequency space (Rice 2004).

_{BB}However, not every possibility can be realized. Populations are constrained to a restricted set of genotype frequencies. Trivially, genotype frequencies must sum to one. Mendelian segregation and patterns of mating further restrict the set of possible genotype frequencies. For example, in a randomly mating population it is unlikely that every individual will be the same heterozygous genotype. Natural selection also influences patterns of genetic variation, as high-fitness genotypes are found at higher frequencies than neutral expectations. What genotype frequencies can one expect to find, and how does genotype-specific fitness influence this? Any equation summarizing the set of all possible population genetic states must contain frequency and fitness terms for every genotype. Subsequently, genotype frequency data can be used to infer a ratio of genotypic fitnesses. While mathematical descriptions exist for loci with two segregating alleles (Cannings and Edwards 1968), such formulations are lacking for arbitrary numbers of segregating alleles. Here, a general equation describing the set of possible postselection genotype frequencies is derived. Much like how the Hardy–Weinberg principle describes population genetic states in the absence of selection, this novel equation describes population genetic states in the presence of selection. In the context of genotype-frequency space, this is a multidimensional surface, the curvature of which is influenced by natural selection (Figure 1). Evolution involves adaptive walks toward regions of high mean fitness on this surface (Wright 1932; Ewens 1989; Edwards 2000). The set of possible genotype frequencies is analogous to the ecological concept of a fundamental niche (Hutchinson 1957) and the Ramachandran diagrams of biochemistry (Ramachandran *et al*. 1963). The former describes the full range of environmental conditions under which an organism can exist, while the latter describes the possible conformations of dihedral angles for a polypeptide. In each case, valid regions of parameter space are described.

## MODEL

A standard single-locus model of theoretical population genetics is considered (diploidy, autosomal inheritance, random mating, and infinite population size). Fitnesses are assumed to be constant and frequency independent. If there are *n* segregating alleles at a single locus, different genotypes are possible, of which *n* are homozygous and are heterozygous. Thus, genotype-frequency space spans dimensions. Under random mating, each point in allele-frequency space maps to a single point in genotype-frequency space. Consequently, the surface of possible genotype frequencies is dimensional. The recursion equations of classical population genetics give genotype frequency in the present generation (*P _{ij}*) as a function of genotype fitness (

*w*) and allele frequencies in the past generation (

_{ij}*p*).

_{i}#### Derivation of genotypic ratio:

Subsequent to mating, but prior to selection, genotype frequencies are found in Hardy–Weinberg proportions. Postselection homozygote frequencies are equal to while postselection heterozygote frequencies are equal to (Rice 2004). Mean fitness () equals the weighted sum of all genotype fitnesses. It is useful to algebraically manipulate these recursion equations so that a ratio of genotype frequency to genotype fitness is on the left-hand side and a ratio of allele frequencies to mean fitness is on the right-hand side. Subsequently, terms for multiple genotypes can be multiplied.

A natural division of genotypes involves homozygotes and heterozygotes. Every allele has a corresponding homozygous genotype, and the product of all homozygote ratios is(1)Since all terms in the above equation are positive, each side of Equation 1 can be raised to the (*n*(*n* − 1)/2)th power:(2)Every allele also can be found in heterozygous genotypes, and the product of all heterozygote ratios is(3)Moving the constant term to the left-hand side and raising every term of Equation 3 to the *n*th power,(4)Note that the right-hand sides of Equations 2 and 4 are identical. Further algebraic manipulation and the transitive property of equality (where *A* = *B* and *B* = *C* imply *A* = *C*) allow a single equation containing every genotypic term to be derived:(5)Since every term in the above equation is positive, Equation 5 can be simplified by taking the root of both sides of the equation. This root is the product of the number of homozygote and heterozygote states:(6)Note that the geometric mean of numbers is the root of their product. In the absence of assortative mating, patterns of genetic variation reduce to a surprisingly elementary equation. The geometric mean heterozygote frequency divided by the geometric mean homozygote frequency equals two times the geometric mean heterozygote fitness divided by the geometric mean homozygote fitness. Denoting geometric means with asterisks,(7)

#### Description of the genotypic ratio:

The above genotypic ratio equation is marked by multiple axes of symmetry: frequencies are on the left-hand side while fitnesses are on the right-hand side, and heterozygous terms are found in numerators while homozygous terms are found in denominators. Genotype frequencies satisfy the above equation after a single generation of random mating and viability selection. As expected, postselection genotype frequencies show increased heterozygosity when heterozygote fitnesses are large relative to homozygote fitnesses. The right-hand side of Equation 7 involves a ratio of fitnesses, indicating that relative, rather than absolute, fitnesses determine genotype frequencies. Under conditions of neutrality Equation 7 reduces to Hardy–Weinberg proportions. However, these proportions also arise when fitnesses are multiplicative (Lewontin and Cockerham 1959). By extension, one can expect to find the same ratio of genotypic frequencies as Hardy–Weinberg when the assumptions of this model are met and . Regardless of selection coefficients, heterozygote frequencies are maximized at intermediate allele frequencies. The constant 2 on the right-hand side of Equation 7 is due to diploidy and equivalence between *ij* and *ji* heterozygotes. Singularities in the above equation are nonproblematic, as any genotype with zero fitness must also have a postselection frequency of zero. Equation 7 holds for both equilibrium and nonequilibrium populations. Genotype frequencies of natural populations are much easier to obtain than genotype-specific fitnesses. Consequently, Equation 7 allows one to infer the ratio of genotype fitnesses from genotype-frequency data (so long as population size is large and mating is random).

Fitness dominance influences the relative proportions of heterozygotes and homozygotes. The ratio of geometric mean heterozygote frequency to geometric mean homozygote frequency (*i.e*., the left-hand side of Equation 7) is denoted by Φ:(8)Φ < 2 indicates an excess of homozygotes relative to neutral expectations, and Φ > 2 indicates an excess of heterozygotes. When fitnesses are multiplicative (*i.e*., fitness dominance is absent), Φ = 2. Geometric means are always less than or equal to the arithmetic mean. Therefore, additive fitnesses (*i.e*., the fitnesses of heterozygotes are equal to the mean of the relevant homozygotes) result in Φ > 2. Concave fitness functions (where fitnesses of heterozygotes are greater than the arithmetic mean of the relevant homozygote fitnesses) yield Φ > 2. Depending on heterozygote fitnesses, convex fitness functions yield Φ < 2, Φ = 2, or Φ > 2. Note that enzyme kinetics of metabolic pathways are associated with concave fitness functions (Hartl *et al*. 1985; Gillespie 1991), and overdominance and underdominance are exaggerated forms of concave and convex fitness functions, respectively. MATLAB simulations (Mathworks 2005) verify the effects of fitness dominance and also show that Φ is independent of allele frequency (see Table 1).

The genotypic ratio equation (Equation 7) also holds for subsets of alleles. In principle, this allows genotype-specific fitness effects to be detected. The ratio of geometric mean heterozygote frequency to geometric mean homozygote frequency for a subset of alleles is denoted Φ_{i,j,k…} (*i.e*., for the alleles *A*, *B*, and *C* the genotypic frequency ratio is equal to Φ_{ABC}). For example, if there are three segregating alleles and the genotype *AA* is deleterious relative to all other genotypes, one would expect Φ_{AB}, Φ_{AC}, and Φ_{ABC} to be >2 and Φ_{BC} to be 2. This application can identify nonneutral genotypes of highly polymorphic loci, such as microsatellites or genes encoding blood group antigens. A similar approach has been developed that uses genotype-specific fixation indexes (Alvarez 2008). Note, however, that the absolute magnitude of selection-induced departures from Hardy–Weinberg proportions is expected to be small for most sets of genotypic fitnesses (Pereira and Rogatko 1984; Hernández and Weir 1989). Consequently, sample sizes needed to detect selection would need to be quite large (Weir 1996).

If the assumption of constant genotypic fitness is relaxed, the magnitude of the genotype-frequency ratio depends on the number of segregating alleles. Consider a stochastic fitness scenario where genotype-specific fitnesses vary from generation to generation and are drawn from the same arbitrary distribution (*i.e*., no genotype is more fit “on average” than any other genotype). When there is temporal variation in fitness, the geometric mean fitness of genotypes applies (Haldane and Jayakar 1963). The stochastic fitness expectation of Φ is greater than the constant fitness expectation when a small number of alleles are segregating and less than the constant fitness expectation when a large number of alleles are segregating. The geometric mean of a number of independent random variables decreases as the number of variables increases (F. J. Rohlf, personal communication). This is because the geometric mean is sensitive to low values, and each random variable has a chance of resulting in a low value. Random variables in this case refer to genotypic fitnesses. Consequently, the magnitude of Φ is contingent on the relative numbers of heterozygous and homozygous genotypes (which are a function of the number of segregating alleles). Stochastic fitness also influences the genotype-frequency ratio independent of the number of segregating alleles. This is because Φ in a stochastic fitness scenario involves the ratio of two random variables. The geometric mean of a ratio of two identical random variables has an expectation of one. Due to the arithmetic mean–geometric mean inequality, the arithmetic mean of a stochastic fitness ratio is greater than or equal to one, resulting in Φ > 2. Allele-dependent and independent effects of stochastic fitness combine in a complex manner, and MATLAB simulations indicate that when three or fewer alleles are segregating, Φ > 2 (see Table 1).

#### Visualization of genotype frequencies:

The high dimensionality of genotype-frequency space makes visualization difficult. However, it is possible to take two-dimensional slices through genotype-frequency space and view possible frequencies for pairs of genotypes (Figure 2). Five different curves are possible, depending on the number of shared alleles and whether the genotypes in question are homozygous or heterozygous. For example, if one genotype in question involves a homozygote (*ii*) and the other genotype involves a heterozygote that shares zero alleles with the homozygote (*jk*), then Figure 2C applies. Given the assumptions of this model, populations can exist only within the solid regions of Figure 2. Areas and shapes of solid regions are contingent on the ratio of the geometric mean heterozygote fitness to the geometric mean homozygote fitness. The exact position of a population genetic state depends on allele frequencies. For example, one will not find *AA* homozygotes and *AB* heterozygotes at high frequencies if a third allele, *C*, happens to be common. Note that heterozygote advantage in a multiallelic system is unlikely to result in the maintenance of many segregating alleles (Lewontin *et al*. 1978), although spatial heterogeneity in selection pressures relaxes these constraints (Star *et al*. 2007).

#### Comparison of heterozygosity and Φ:

Heterozygosity and the genotypic ratio, Φ, are complementary measures of genetic variation. Both measures exhibit an excess of heterozygotes when there is overdominance and an excess of homozygotes when there is underdominance. However, heterozygosity is maximized at intermediate allele frequencies, while Φ is independent of allele frequency. This is because both homozygous and heterozygous genotypes containing rare alleles will be found at low frequencies (canceling out in Equation 7). Heterozygosity varies as allele frequencies change due to selection. By contrast, the genotypic ratio does not change during an adaptive walk. In addition, expected heterozygosity is greater when more alleles are segregating, while Φ is independent of the number of segregating alleles. Equilibrium heterozygosity of neutral loci depends on population size and mutation rate, while Φ = 2 for neutral loci regardless of population size and mutation rate. Both measures of variation decrease over time when there is inbreeding. Positive assortative mating results in an excess of homozygotes, and negative assortative mating results in an excess of heterozygotes. Population structure also affects both measures of variation. When subpopulations differ in allele frequencies, the frequencies of homozygotes in a pooled population are larger than the mean homozygote frequency of unmixed subpopulations (Hedrick 2005). This reduction in heterozygosity due to population structure is known as the Wahlund effect. The above properties of heterozygosity and Φ hint at the ability to distinguish between alternative evolutionary hypotheses. For example, genotypic ratio data can be combined with other information (such as linkage disequilibrium, allele frequency spectra, and reduced heterozygosity) to provide integrated evidence of selection.

One common measure of genetic variation is Wright's inbreeding coefficient, *f*. This is equal to one minus observed heterozygosity over expected heterozygosity, . An *f* greater than zero corresponds to an excess of homozygotes and an *f* less than zero corresponds to an excess of heterozygotes. While this measure is directly related to the concept of heterozygosity, its relationship to selection coefficients is more convoluted. This is because the magnitude of *f* depends on allele frequencies and does not significantly differ from zero when one allele is rare (see supplemental information). Consider a recessive deleterious allele in mutation–selection balance (*w _{AA}* = 0.9,

*w*= 1.0,

_{AB}*w*= 1.0, = 0.0032). This scenario results in

_{BB}*f*= −0.0003 and Φ = 2.1082. Each measure of genetic variation is sensitive to a different range of genotype frequencies. If values from different generations and/or loci are averaged, it is possible to have an excess of heterozygotes from one measure and an excess of homozygotes from the other measure. When genotype-frequency data are condensed into a single summary statistic like

*f*or Φ, information is unavoidably lost. Thus, a more complete picture of genetic variation arises when both

*f*and Φ are calculated (see Tables 1 and 2).

#### Genomic analysis of Φ:

The signature of selection tends to be local within the genome, while population structure often results in genomewide patterns. Ideally, one could calculate Φ across all loci and look for outliers (with the reasoning that large departures from Φ = 2 are indicative of selection). A Bayesian formulation of Φ exists for two segregating alleles (Pereira and Rogatko 1984), allowing the estimation of type I and type II error rates. In practice, however, sample sizes are rarely large enough to detect significant departures from Hardy–Weinberg proportions. This is confounded when genomic data are used because multiple-testing issues arise. An alternative is to calculate the genomewide mean of Φ for different populations. This allows departures from random mating to be detected, as putatively neutral markers are expected to have a mean Φ = 2. Data from the International HapMap Project are well suited for this type of analysis and were used here (International HapMap Consortium 2003). Here, 60 individuals from northern and western Europe (CEU), 45 Han Chinese individuals from Beijing (CHB), 45 Japanese individuals from Tokyo (JPT), and 60 Yoruban individuals from Ibadan, Nigeria (YRI) were sequenced at ∼800,000 SNP markers. HapMap Data Release 23a was used (phase II, March 2008, NCBI B36 assembly). Φ was calculated for 400 randomly selected SNPs covering the short arm of the third chromosome (see supplemental information for a list of SNPs and genotype frequencies). Linkage disequilibrium in human populations decays substantially over 200 kb (Ke *et al*. 2004). To ensure independence of data points SNPs were chosen that were at least 200 kb apart. SNPs were required to be polymorphic in all four populations, and an additional criterion was that heterozygote and both homozygote genotypes were observed. Results are summarized in Figure 3 and Table 2. European and Chinese populations exhibited an excess of heterozygotes while the Yoruban population exhibited an excess of homozygotes. There is a large spread in values of Φ, owing to the relatively small number of individuals in each sample population. When population data are pooled, mean Φ < 2 (*P* < 0.0001, one-sample *t*-test with 399 d.f.). This is indicative of a Wahlund effect. For each population mean Φ significantly differed from 2 (*P* < 0.0001, one-sample *t*-test with 399 d.f.).

Numerous selective and demographic causes can explain these patterns. An excess of heterozygotes is consistent with overdominance, associative overdominance, stochastic fitness of diallelic loci, negative selection against deleterious recessive alleles, and positive selection of dominant advantageous alleles. Conversely, an excess of homozygotes is consistent with underdominance, negative selection against deleterious dominant alleles, and positive selection of advantageous recessive alleles. However, values of Φ seen in the HapMap data set would require very large selection coefficients (on the order of 10%). Also, it is unlikely that all loci in question are under selection (Kimura 1983, but see Hahn 2008), and there are no *a priori* reasons why the four HapMap populations would have such different signatures of selection. In contrast to the local footprint of selection, demography yields genomewide patterns. Negative assortative mating, where individuals preferentially mate with individuals with different genotypes, results in an excess of heterozygotes over panmictic expectations. Inbreeding avoidance also results in Φ > 2 (Pusey and Wolf 1996). Both positive assortative mating and the pooling of subdivided populations result in an excess of homozygotes. Each of the four HapMap populations has a different demographic history, potentially explaining why they differ in mean Φ. Alternatively, ascertainment bias could be responsible for the differences between populations. Individuals were selected via different methods for each population, particularly with respect to the presence of couples (International Hapmap Consortium 2003). In addition, criteria for ethnic identity ranged from self-identification (Japanese) to all four grandparents sharing the same culture (Yoruban). While it is possible for the effects of selection and population structure to cancel out (resulting in Φ = 2), this is unlikely to occur on a genomic scale. At present, the above causes cannot be distinguished by genotypic ratio data. Indeed, they are not mutually exclusive and pluralistic explanations are possible.

## CONCLUSION

The ratio of geometric mean heterozygote frequency to geometric mean homozygote frequency is coupled to the effects of natural selection. It provides a measure of genetic variation that is complementary to heterozygosity and can be used to detect the signature of evolutionary processes. As larger numbers of individuals are sequenced (as in Macdonald *et al*. 2005), the utility of the genotypic ratio will increase. Genotype frequencies bear the footprint of differential fitnesses, and elegant mathematical patterns arise from the natural phenomena of Mendelian segregation and Darwinian selection.

## Acknowledgments

I thank A. Onstine, H. Spencer, J. True, S. Yeh, R. Yukilevich, and the Eanes lab for constructive criticism during the preparation of this manuscript. This work was supported by a National Institutes of Health predoctoral training grant (5 T32 GM007964-24).

## Footnotes

Communicating editor: H. G. Spencer

- Received July 3, 2008.
- Accepted August 7, 2008.

- Copyright © 2008 by the Genetics Society of America