Population structure parameters commonly used for diploid species are reexamined for the particular case of tetrasomic inheritance (autotetraploid species). Recurrence equations that describe the evolution of identity probabilities for neutral genes in an “island model” of population structure are derived assuming tetrasomic inheritance. The expected equilibrium value of FST is computed. In contrast to diploids, the correlation of genes between individuals within populations with respect to genes between populations (FST) may vary among loci due to the particular segregation patterns expected under tetrasomic inheritance and is consequently inappropriate for estimating demographic parameters in such populations. We thus define a new parameter (ρ) and derive its relationship withNm. This relationship is shown to be independent from both the selfing rate and the proportion of double reduction. Finally, the statistical procedure required to evaluate these parameters using data on gene frequencies distribution among autotetraploid populations is developed.
DUE to its frequent occurrence among angiosperm species (from 30 to 50%; Stebbins 1971; Grant 1981), polyploidy is now recognized as an important step in the evolutionary diversification of flowering plants (Lewis 1980; Levin 1983; Stebbins 1985; Thompson and Lumaret 1992; Soltis and Soltis 1993; Bretagnolle and Thompson 1995, 1996; Petit et al. 1996, 1997). Polyploid species are commonly classified in two major types according to their presumed origin: allopolyploids are thought to result from hybridization between different taxa and subsequent chromosome doubling, while autopolyploids presumably stem from the chromosome doubling of the same genome, primarily by fusion of unreduced gametes (Bever and Felber 1992; Bretagnolle and Thompson 1995). Autotetraploidy was originally thought to be rare and maladaptive as compared to allopolyploidy. However, a growing number of studies using genetic information in addition to cytological and morphological traits confirm that autopolyploids are more common and of greater evolutionary importance than originally appreciated (Levin 1983; Crawford 1985; Rieseberg and Doyle 1989; Soltis and Soltis 1989).
Due to the addition of divergent genomes, inheritance in allopolyploids is disomic; i.e., pairing behavior during meiosis is similar to that of nonhomologous pairs of chromosomes in diploids. In contrast, segregation patterns in autopolyploids are much more complex because more than two homologous chromosomes can pair during meiosis. Multivalents leading to polysomic inheritance are formed. This does not necessarily lead to random assortments of homologous chromosomes into gametes; two sister chromatids may also segregate into the same gamete (Figure 1). This phenomenon, known as “double reduction,” is specific to autopolyploids. It increases the production of homozygous gametes as compared to what is expected under random chromosome segregation and is thus likely to alter many basic expectations of population genetics (Bever and Felber 1992). Because the frequency of double reduction depends on the occurrence of crossovers between the centromere and the locus under consideration (Figure 1), segregation patterns are expected to vary among loci, obscuring predictions regarding genetic aspects of autopolyploids.
Probably because of the agronomic significance of polyploid species, the consequences of polysomic inheritance and double reduction have been investigated, especially for self-fertilization and regular systems of inbreeding (Haldane 1930; Demarly 1963; Bennett 1968; Gallais 1990). In contrast, few investigations have dealt with the amount and patterns of genetic variation among naturally occurring autopolyploid populations, and theoretical models incorporating population structure and estimation procedures are still lacking for tetrasomic inheritance (Glendinning 1989; Bever and Felber 1992 for review; Moodyet al. 1993).
For diploids, the distribution of genetic diversity within and among natural populations is commonly analyzed using theoretical models of population structure, for instance, the island model or the stepping stone model. Functions of probabilities of gene identity within and between units (populations, subpopulations), such as FST (Wright 1951), can be estimated using isozyme or DNA-based marker diversity and can be compared to expectations under specific models such as Wright's island or isolation by distance models (Slatkin and Barton 1989; Rousset 1997). The relationships between estimates and expectations can then be used to quantify gene flow between the studied units or even to understand how ecological and life history traits may influence the distribution of genetic variation within and among populations (see, for example, Loveless and Hamrick 1984; Hamrick and Godt 1990). However, models of population structure as well as estimation procedures have been almost exclusively devoted to diploid populations (see, however, Wright 1938).
The aim of this article is to develop a theoretical framework for the analysis of population structure in autotetraploid species. Recurrence equations that describe probabilities of gene identity under the island or isolation by distance models may be generalized for the case of tetrasomic inheritance; the case of the island model is given here as an illustration. Equilibrium values for traditional F-statistics parameters are derived. Because the proportion of double reduction may vary over loci, we define an additional function of probabilities of gene identity. This parameter seems appropriate to analyze population structure in autotetraploids, because its relationship with the migration rate and the population size is shown to be independent from both the selfing rate and the proportion of double reduction. Finally, following Weir and Cockerham (1984), we define estimators for the different parameters using the analysis of variance framewor.
HIERARCHICAL GENIC STRUCTURE AND DEFINITION OF PARAMETERS
Let Q stand for the probability of identity, Q0 for pairs of genes within individuals, Q1 for pairs of gene between individuals within subpopulations, and Q2 for pairs of genes between subpopulations. Throughout this article, the notation Qj will refer to probabilities of identity in state (IIS) and the j indices (j = 0 to 2) to the same pairs of genes. The addition of a dot on the top of a parameter will denote probabilities of identity by descent (IBD) (i.e., Q̇j), and the standard notation ≡ will be used to distinguish the definition of parameters from their values under particular models of population structure.
Under tetrasomic inheritance, four genes are available at a given locus. Then, a random pair of genes within individuals (Q0) can be issued either from the same gamete (probability 1/3) or from two different gametes (probability 2/3). If QA and QB denote the probability of IIS associated, respectively, with these two categories of pairs of genes, then Q0 = (QA + 2QB)/3.
Another parameter we will consider is (4)
This parameter is analogous to the “correlation between truly outcrossed mates” in diploids (Waller and Knight 1989; Tachida and Yoshimaru 1996). For diploids, interest in this correlation has come from the fact that the relationship between this parameter, the migration rate, and the population size is independent from the selfing rate (see Nagylaki 1983; Tachida and Yoshimaru 1996). We will show that, for in autotetraploids, this relationship is moreover independent of the proportion of double reduction and therefore identical for all loci independently of their distance to the centromere.
Let us define Q̇r as the IBD probability for two genes in different individuals located either in two different subpopulations (r = 2) or in the same subpopulation (r = 1) and use the relationship between coalescence of genes and identity probabilities (Malécot 1975; Tachida 1985; Slatkin and Voelm 1991): the probability of IBD for a pair of genes is the probability that neither gene has mutated between the present time and the time of first common ancestry, that is, their coalescence time (Malécot 1975; Slatkin 1991). This yields the expression (5) where μ is the mutation rate per generation, P(t) the probability that two genes coalesce at generation t in the past, and T a random variable that describes the coalescence time for these two genes. As shown by Slatkin and Voelm (1991; see also Tachida and Yoshimaru 1996), if we think of the process as going backward in time, then T can be divided in two phases: T1, the waiting time for two genes to be found in the same individual, and T2, the time for the two genes in the same individual to coalesce (Figure 2). As a result, and since T1 and T2 are independent, (6)
Because T2 corresponds to a coalescence time, then using the relationships between the coalescence of genes and identity probabilities, E[(1 – μ)2T2] represents the IBD probability for a pair of genes when both are sampled in the same individual (Figure 2): (7)
Unlike T2, T1 in this instance is not a coalescence time but rather the “waiting time” for two genes initially at distance r to migrate within the same individual (Figure 2). To define T1, we do not make any reference to identity between the two genes under consideration, and this waiting time will depend only on the initial distance between the two genes (r = 1 or 2) and on the way genes migrate within and between subpopulations. Hence, E[(1 – μ)2T1], which we will denote ḣr in what follows, is not an IBD probability but simply denotes the probability that neither gene has mutated during T1. Since double reduction affects only transition probabilities for genes within individuals, it does not affect T1 nor ḣr. These two parameters are consequently independent from the proportion of double reduction.
Now, following (2), and using (3), the IBD probability for two genes in different individuals at distance r reduces to (8) and, putting this formula into (4), yields the following expression for : (9) This parameter is of interest for two reasons: (1) Because the ḣrs are independent of the coefficient of double reduction, this equation shows that this is also true for ; (2) as will be shown later, the expected value of can be deduced with minimal effort from previous models of haploid populations.
Consider now, (10) Noting that we can always write which reduces to (11)
This may be compared to the result of the diploid model with selfing (Tachida and Yoshimaru 1996), in which one can write (12)
EQUILIBRIUM VALUES OF THE PARAMETERS IN AN ISLAND MODEL
We consider a finite island model (Wright 1951) of population structure: a set of n subpopulations, each consisting of N individuals, with nonoverlapping generations. Individuals are monoecious and subpopulations exchange migrant gametes at a rate m. Each migrant has an equal chance of coming from each of the other n – 1 subpopulations. Genes are assumed to be neutral and the mutation rate μ is the same for all alleles. Following Nagylaki (1983) and Crow and Aoki (1984), two notations will be used. After migration, the proportion of pairs of genes that originate from one subpopulation in the previous generation is a = (1 – m)2 + m2/(n – 1) for genes within a subpopulation, and b = (1 – a)/(n – 1) for genes from different subpopulations. In each subpopulation, a proportion S of offspring is produced through selfing and the proportion of double reduction for the studied locus is denoted α.
When individuals are autotetraploid, there are 4N genes in each subpopulation. Then, provided neither gene has mutated [with probability γ = (1 – μ)2], genes originating from the same subpopulation are identical by descent with probability (1 + 3Q̇0)/4N + (1 – 1/N)Q̇1, while genes from different subpopulations are identical by descent with probability Q̇2. The recurrence relations for Q̇1 and Q̇2 are as follows (t denoting time in generation): (13) (14) Combining these two relationships, we obtain (15) At equilibrium, the Qi's do not change, hence (16) Using d = a – b, this equation can be expressed as (17)
Noting that (1 + Q̇A + 2Q̇B)/4 – Q̇1 = (1 + 3Q̇0 – 4Q̇1)/4, then substituting this into (10) and using (17), yields (18) which is the same result as in the diploid (or haploid) model. Using θ = n/(n – 1), Equation 18 becomes (19) i.e., (20) and, using (11), (21)
As one may note, we do not need to know identity probabilities within subpopulations (Q0 and Q 1) to derive these results. For diploids, the expected equilibrium value of FIS depends on the selfing rate (S), and the population size (N). For autotetraploids, it also depends on the proportion of double reduction that increases the proportion of homozygous gametes produced [see, for example, Bennett (1968) for the case of a (single) large autotetraploid population]. When neither selfing nor double reduction are occurring in the population (i.e., S = 0 and α = 0), FIS = 0. Equation 21 can then be further simplified into (22)
Expected values of can be computed for other mutation models as previously described (e.g., Crow and Aoki 1984; Rousset 1996), as well as for other geographical models. Under isolation by distance models, it can be shown that ρr/(1 – ρr) ≈ r/(2Dσ2) + Constant, for a pair of populations at distance r in a one-dimensional model, and ρr/(1 – ρr) ≈ ln(r)/(2Dπσ2) + Constant, in a two-dimensional model, where D is the population density and σ2 is a measure of dispersal (Rousset 1997).
POPULATION PARAMETERS ESTIMATION
Consider a dataset describing the genotypic constitution of autotetraploid individuals sampled (at random) from a set of r subpopulations. Each subpopulation is represented by ni individuals (sample size), where i refers to the ith subpopulation. To build estimators for the level of population differentiation, we use the linear model with hierarchical effects (subpopulations, individuals within subpopulations, and genes within individuals) developed by Cockerham (1969, 1973) for the analysis of diploid population structure. Now xijk is an indicator variable describing the state of the kth gene (1 ≤ k ≤ 4, instead of 1 ≤ k ≤ 2 for diploids) in the jth sampled individual (1 ≤ j ≤ ni) of the ith subpopulation (1 ≤ i ≤ r). For a particular allele u, xijk:u = 1, if the gene is u, xijk:u = 0 otherwise, and the ANOVA setup is as follows: Using the same developments as for diploids (Weir 1996), the following sum of squares expectations can be derived (details are given in the appendix): for genes within individuals (23a) for genes between individuals within subpopulations (23b) and for genes between individuals from different subpopulations (23c) where S1 = Σini, , Wd ≡ S1 – r, Wa ≡ S1 – S2/S1, and Ww ≡ r – 1.
From Equations 23a, 23b, 23c, we obtain (24) (25) which yield an estimator of FST: (26) Now, noting that 1 + 3Q0 – 4Q1 = 1 – Q0 + 4(Q0 – Q1) = ϵ(SSI)/Wd, we have (27) An estimator of F̂IT ≡ 1 – (1 – Q̂1)/(1 – Q̂2) is (28) and (29)
For all these parameters, multilocus estimates (i.e., combining the information from all alleles and all loci) are defined as the sum of locus-specific numerators divided by the sum of locus-specific denominators (see also Reynoldset al. 1983; Weir 1996). For example, (30) where l refers to the lth loci and u to the uth allele (with nl, the number of locus and ul, the number of alleles at locus l). Given the dependency of F-statistics on the proportion of double reduction (see above), multilocus estimates of these parameters will be appropriate to make inferences about the balance between migration (and/or mutation) and drift only if α = 0 for all the studied loci. As soon as α ≠ 0 for at least one locus, only the estimate of ρ will have this property.
The aim of this study was to adapt the use of Wright's FST to estimate population structure and gene flow in autotetraploid species. In contrast to diploids, FST estimates in autotetraploids are expected to vary across the loci as a consequence of different amounts of double reduction during meiosis (Figure 1 and Introduction). This problem is illustrated in Equation (11) because FIS will vary depending on both the selfing rate and the proportion of double reduction (α). Since the proportion of double reduction for a given locus is difficult to assess empirically and because population structure estimates should be based on several loci, we defined a new function of identity probabilities, ρ, which is an analogue to the “correlation between truly outcrossed mates” previously defined for diploids (Waller and Knight 1989; Tachida and Yoshimaru 1996). For both diploids and tetraploids, the relationship between this correlation and the product Nm is independent from the selfing rate (except when selfing affects migration). For autotetraploids, interest in ρ comes mainly from the fact that this relationship is also independent of the proportion of double reduction and therefore identical for all loci independently of their distance to the centromere. The parameter ρ can consequently be used to assess population structure over many loci, without any prior knowledge concerning the proportion of double reduction.
Inspection of the relationship between ρ and FST (11) shows that FST is increased by a factor (1 + 3FIS)/4 when self-fertilization or double reduction occurs within subpopulations. This means that like self-fertilization, double reduction reduces the effective subpopulation size and hence promotes differentiation among subpopulations (for the studied locus). The complication due to partial selfing or double reduction can be absorbed in the single parameter FIS and by defining the effective population size as NZ = N/(1 + 3FIS). Equation (21) can then be used with NZ replacing N, i.e., FST/(1 – FST) = 1/(8NZmθ + 8NZμ), while ρ is still equal to ρ/(1 – ρ) ≈ 1/(2Nmθ + 2Nμ), which depends only on the migration rate, mutation rate, and the demographic population size (i.e., N, not NZ). Comparison of Equation 11 with the results of the diploid model (12) further shows that self-fertilization has a greater influence on differentiation in autotetraploids as compared to diploids.
When ignoring selfing and double reduction, the expected effect of drift under the island model of population structure is halved at equilibrium as compared to expectations for diploids, i.e., FST ≈ 1/(1 + 4Nmθ + 4Nμ) (Crow and Aoki 1984; Cockerham and Weir 1987). This can be interpreted as the decrease in the rate of coalescence of genes within subpopulations and is due to the fact that the probability of drawing the same gene within an individual is reduced to 1/4 in an autotetraploid species instead of 1/2 in diploids. In other words, this means that, for a same demographic population size, the effective population size is doubled in an autotetraploid population as compared to a diploid one. This result is in accordance with earlier work on autotetraploids (Moodyet al. 1993) that assumed that mutation alone was opposed to genetic drift (i.e., m = 0, FIS = 0, α = 0): FST = 1/(1 + 8Nμ).
Following the linear model derived in Cockerham (1969, 1973) for diploid data, estimators for F-statistics and ρ can be computed through hierarchical analyses of variance of gene frequencies. Simulations were performed to assess possible bias in the estimation of ρ due to small sample sizes. We simulated a finite island model composed of n monoecious subpopulations of size N. In each subpopulation, 10 neutral, independent loci (recombination rate = 0.5), each with K possible allelic states (K-allele model), and all segregating according to the same proportion of double reduction (α) were modeled. Initial frequencies of the different allelic states were made equal in all the subpopulations (initial frequency = 1/K). Each subpopulation had the same mating system: complete outcrossing (S = 0) or partial selfing (S ≠ 0). We assumed discrete and nonoverlapping generations. Mutation occurs at a rate μ per locus per generation, each allele having an equal chance to mutate toward one of the K-1 other allelic states. Migration occurs through male gametes only: to produce the next generation in a given subpopulation, each pollen grain was sampled independently, and with probability m it was chosen among gametes from the remaining n – 1 subpopulations. As shown in Figure 3, the discrepancies between the average value of the estimator and the expected value of ρ are very small even for small sample sizes, with either S = 0 or S ≠ 0 and α ≠ 0.
We wrote a computer program estimating F-statistics and the parameter ρ according to the ANOVA setup developed above (details of the computations are given in the appendix). The program provides estimations for ρ, FST, FIS, and FIT for each allele as well as estimates combining data over alleles and over loci. To test for a departure from FST = 0, the program allows for Fisher's exact test on (population × genotypes) contingency tables [for each locus separately, see Raymond and Rousset (1995) for the diploid model]. Exact tests on contingency tables in which cell counts are tetraploid genotypes are valid even if there is double reduction. As for diploid datasets (Raymond and Rousset 1995), the software allows for analysis either over the whole set of populations or for pairs of populations. The program containing both estimations and exact tests procedures is available upon request.
We thank D. Couvet and P. Jarne for discussions, M. Raymond for advice concerning the computer program, and J. M. Prosperi for comments on the manuscript. This work was supported by a grant from the French “Bureau des Ressources Génétiques” to E.J. and J.R. This is contribution number 98-085 of the Institut des Sciences de l'Evolution.
Computation of expected sum of squares of gene frequencies involved in estimating ρ and F-statistics: Let Eu denote the expectation of xijk:u, and πu, the expected frequency of the allele u. Then , where ϵ denotes expectation. Then, summing over all alleles, we obtain (A1) where Q3 denotes the identity probability for genes from different independent replicate populations, then, in the following, we write [xijk – E] for the sum over alleles. Using the relationship (xk – xk′)2 = (xk – E)2 + (xk′ – E)2 – 2 · (xk – E) · (xk′ – E), we obtain a useful equation for the covariance of two genes, i.e., (A2) This, derived for different pairs of genes, yields the covariances (A3) for genes within individuals, (A4) for genes between individuals within subpopulations, and (A5) for genes between individuals in different subpopulations. These relationships can be used to derive the expectations i.e., (A6) i.e., (A7) i.e., (A8) where S1 = Σini and .
Now, the basic relationship can be used to write sum of squares expectations, for genes within individuals, and using (A6) and (A7), we obtain (A9)
Following the same procedure and denoting Wd ≡ S1 – r, Wa ≡ S1 – S2/S1, and Ww ≡ r – 1, we find the following sum of squares expectations: for genes between individuals within subpopulations (using A6 and A7), (A10) for genes between individuals from different subpopulations (using A7 and A8), (A11)
As for diploids (Cockerham and Weir 1987), the components of variance of the nested ANOVA model (xijk:u = μu + αi:u + βj:u + ϵijk:u) can be expressed as linear functions of identity probabilities, i.e., (A12) (A13) and (A14)
ANOVA framework for the estimation of ρ and F-statistics: To compute sum of squares, straight way gene frequencies were used instead of indicator variables (xijk). This method is based on the following relationships between gene frequency estimates and the indicator variable [see Weir (1996) for the diploid case], where p̃Ai = ΣjΣkxijk/4ni and P̃AAi = P̃0,i + P̃1,i/2 + P̃,2;i/6 with P0,i, P̃1,i, and P̃2,i standing, respectively, for the proportion of monogenic (AAAA), trigenic (AAAa), and digenic (AAab) individuals in the ith population (Malécot 1948). These relationships yield more convenient expressions for the variance components of the analysis, as shown in Table 1.
Communicating editor: M. Slatkin
- Received March 4, 1998.
- Accepted June 23, 1998.
- Copyright © 1998 by the Genetics Society of America