- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Vitalis, R.
- Articles by Boursot, P.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Vitalis, R.
- Articles by Boursot, P.
Interpretation of Variation Across Marker Loci as Evidence of Selection
Renaud Vitalisa,b,c, Kevin Dawsona,d, and Pierre Boursotaa Laboratoire Génome, Populations et Interactions, Université Montpellier II, 34095 Montpellier Cedex 05, France,
b Laboratoire Génétique et Environment, Institut des Sciences de l'Évolution de Montpellier, Université Montpellier II, 34095 Montpellier Cedex 05, France,
c Station Biologique de la Tour du Valat, 13200 Arles, France
d I.A.C.R. Long Ashton Research Station, Department of Agricultural Science, University of Bristol, Bristol BS41 9AF, United Kingdom
Corresponding author: Renaud Vitalis, Laboratoire Génétique et Environnement, C.C. 065, Institut des Sciences de l'Évolution de Montpellier, Université Montpellier II, Place Eugène Bataillon, 34095 Montpellier Cedex 05, France., vitalis{at}isem.univ-montp2.fr (E-mail)
| ABSTRACT |
|---|
Population structure and history have similar effects on the genetic diversity at all neutral loci. However, some marker loci may also have been strongly influenced by natural selection. Selection shapes genetic diversity in a locus-specific manner. If we could identify those loci that have responded to selection during the divergence of populations, then we may obtain better estimates of the parameters of population history by excluding these loci. Previous attempts were made to identify outlier loci from the distribution of sample statistics under neutral models of population structure and history. Unfortunately these methods depend on assumptions about population structure and history that usually cannot be verified. In this article, we define new population-specific parameters of population divergence and construct sample statistics that are estimators of these parameters. We then use the joint distribution of these estimators to identify outlier loci that may be subject to selection. We found that outlier loci are easier to recognize when this joint distribution is conditioned on the total number of allelic states represented in the pooled sample at each locus. This is so because the conditional distribution is less sensitive to the values of nuisance parameters.
PRESUMED neutral polymorphic loci are commonly used in making inferences about patterns of differentiation within or among populations of the same or closely related species. For this purpose, genetic distances (see, e.g., ![]()
However, misinterpretations can occur if one is not able to clearly distinguish between the patterns generated by random genetic drift or by natural selection. The problem is that selective processes can also affect neutral loci. A locus that is neutral will respond to selection whenever it is in linkage disequilibrium (statistical association among allelic states at different loci) with other loci that are subject to selection. Such associations may arise by chance in small populations (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Therefore, it is of prime interest to identify loci that are responding to selection to exclude them from the genetic analysis of population structure or history. It was recognized early on by ![]()
![]()
, the standardized variance of gene frequency, which is an estimator of the parameter FST. Their first test is a goodness-of-fit test comparing the observed distribution of
estimates (one estimate from each locus) to a
2 distribution with (n - 1) d.f., where n is the number of populations sampled. The second test is based on the comparison of the observed variance of
(across loci) denoted s2F, with the theoretical variance approximated as
![]() |
(1) |
where
is the mean value of
averaged across loci, and k is a constant that, according to ![]()
should be distributed approximately as a
2/d.f., the number of degrees of freedom being determined by the number of biallelic loci.
However, since populations of the same species share, to a certain extent, a common history and since populations are connected through the dispersal of individuals,
values will be correlated across loci. For example, the geographic and historical relationships between populations may have a hierarchical structure if populations have been derived from a common ancestral population by a sequence of successive splits. This is the pattern to be expected following the fragmentation of a species range. The effect of such a population history is always to increase the expected variance of
(![]()
![]()
![]()
![]()
![]()
![]()
More recently, ![]()
![]()
![]()
Thus, their approach might be flawed whenever the true population history consists of repeated branching events or when the connectivity of populations is uneven. However, we cannot infer patterns of migration or historical branching and test for the homogeneity of the markers with the same data. This is what ![]()
![]()
![]()
| THE MODEL |
|---|
We consider two haploid populations of constant sizes N1 and N2, which completely separated
generations ago from a single population of stationary size N0. By complete separation, we mean that the populations did not exchange any migrants between the time of the split and the present. We do not assume that the common ancestral population was at equilibrium when it split. Instead, we allow the ancestral population to have gone through a bottleneck
0 generations before present (with
0 >
). Before this, the ancestral population was at mutation-drift equilibrium, with constant size Ne. Generations do not overlap. New mutations arise at a rate µ and follow the infinite allele model (IAM). This model of population divergence is illustrated in Fig 1.
|
Let Qw,i be the probability that two genes sampled at random within population i are identical by descent (IBD) and Qa be the probability that a gene sampled at random from population 1 is IBD to a gene sampled at random from population 2. IBD probabilities are defined as the probabilities that two genes have not mutated since their most recent common ancestor (![]()
More generally, let Qh denote the IBD probability of any pair of genes: h = (w, i) when two genes are sampled within population i, or h = a when one gene is sampled from each population. It is possible to give an expression for Qh as a function of the coalescence time (![]()
![]() |
(2) |
(![]()
= (1 - µ)2. The waiting time for a coalescent event in a population of size Ni has an exponential distribution with mean Ni. The IBD probability for a pair of genes in population i reduces to
![]() |
(3) |
where Q0 is the IBD probability for two genes sampled at random from the common ancestral population at time
(just before the split) and (1 - Ci) = 
· e-
/Ni is the probability that the two genes neither coalesce nor mutate in the ith population in the time interval 0 < t
. The first term on the right-hand side of Equation 3 is the probability that the two genes coalesce in the time period 0 < t
and are IBD. Following Equation 2, the IBD probability for a pair of genes sampled at random from the common ancestral population just before the split at time
is given by
![]() |
(4) |
where (1 - C0) = 
0-
· e
is the probability that the two genes neither coalesce nor mutate in the time interval
< t
0. The first term on the right-hand side of Equation 4 averages over the coalescent events occurring during the population bottleneck. During this time interval (
< t
0) the waiting time for a coalescent event is exponentially distributed with mean N0. The last term in Equation 4 averages over coalescent events occurring in the ancestral population at mutation-drift equilibrium. This last term represents the IBD probability for two randomly sampled genes in a stationary population of size Ne, which is 1/(1 +
), with
= 2Neµ. Solving the integrals in the low-mutation limit (where
t
e-2µt), we find that the solution of Equation 3 is
![]() |
(5) |
where
i = 2Niµ and Ti =
. The value of Q0 is given by the solution of Equation 4,
![]() |
(6) |
where
0 = 2N0µ and T0 =
. The probability for a gene in population 1 to be IBD with a gene in population 2 is just given by
![]() |
(7) |
Obviously, two such genes cannot coalesce during the
generations between the moment of divergence and the present. They are IBD only if their respective ancestors are IBD when populations 1 and 2 diverge and, furthermore, if they do not undergo mutation during the divergence. Now, it is useful to consider the parameter
![]() |
(8) |
It is worth noting that the weighted sum of Fi over the two populations gives the intraclass correlation for the probability of identity by descent for genes within populations relative to genes between populations. This is of particular interest, because the properties of the intraclass correlations for the probability of identity in state ("IIS correlations"; ![]()
![]()
![]()
![]()
If we neglect new mutations arising during the divergence process, Qa reduces to Q0 and Qw,i = Ci(1 - Q0) + Q0. Thus
![]() |
(9) |
Note that Equation 9 gives a well-known result when both daughter populations are assumed to have the same size N, so that F1 = F2 = F
1 - e-
/N (see, e.g., ![]()
or T0. This suggests that a simple moment-based estimator
i of branch length can be derived as
![]() |
(10) |
where
i is an estimator of Fi (see Appendix for details).
| PROPERTIES |
|---|
Simulation procedure:
For each set of parameter values, a sequence of artificial data sets was generated using standard coalescent simulations, as described by, e.g., ![]()
generations in the past. During this period, all the coalescent events are separated by exponentially distributed time intervals, with means
in population 1 and
in population 2 (see Equation 3). At time
, the number n0 of lineages that remain represents the ancestors of all the genes sampled in populations 1 and 2. The genealogy of these lineages is generated for the time period [
,
0], and all the coalescence events are separated by exponentially distributed time intervals, with mean
(see the first term in the right-hand side of Equation 4). At time
0, the lineages that remain are the ancestors of all the genes sampled in populations 1 and 2. The genealogy of these ne genes is generated for the period [
0, +
], with all coalescent events separated by exponentially distributed time intervals with mean
(see the second term in the right-hand side of Equation 4). Once the complete genealogy is obtained, the mutation events are superimposed on the coalescent tree of lineages. In the results that follow, each artificial data set consisted of two (haploid) samples of size n = 100, one from population 1 and the other from population 2.
Simulation results:
By calculating the estimators
1 and
2 for each of these artificial data sets, it was possible to obtain a close approximation to the expected distribution of these estimators (see Appendix for details). Fig 2 shows this expected joint distribution of
1 and
2 for various combinations of the nuisance parameters
and T0. In this case, the "true" branch lengths were T1 = T2 = 0.1 (hence F1 = F2
0.0953). The expected value of the estimator
1 (respectively
2) was always close to the value of the parameter F1 (respectively F2). One can show that, by construction, the points (
1,
2) lie within the upper-right triangle with vertices (1, 1), (-1, 1), and (1, -1). The joint distribution of these two statistics has a negative correlation. Most importantly, it is clear from this figure that the joint distribution of
1 and
2 depends strongly on the nuisance parameters, even though their expectations remain close to the true values of F1 and F2.
|
It can be seen that, for smaller values of T0, the joint distribution becomes tighter as
increases. On the other hand, for larger values of
, the distribution is found to widen as T0 increases. In both cases, it is the level of variation that remains before divergence that is crucial in shaping the joint distribution. With small
and large T0, the lineages coalesce rapidly before the divergence, and the number of distinct mutations (allelic states) that can be maintained is small. In this case, the variance of the estimates of populations branch lengths is large, as illustrated by the wide joint distribution of
1 and
2. Therefore, the joint distribution of
1 and
2 is not ideal for investigating the homogeneity of response of a set of molecular markers to the genealogical processes. Indeed, other factors such as heterogeneous mutation rates across loci may be invoked to explain disparities of branch length estimates among markers. Fortunately, this problem can be overcome by considering the joint distribution of
1 and
2, conditional upon the total number k of allelic states in the pooled sample at each locus. Fig 3 shows the estimated joint distribution for T1 = T2 = 0.1 (hence F1 = F2
0.0953), conditioned on k = 4. The combinations of nuisance parameter values are the same as in Fig 2.
|
The expected joint conditional distribution appears to be almost independent on the nuisance parameters. So, given the observed values for the parameters F1 and F2, and given the number of alleles in the sample, one can obtain the conditional joint distribution, and then a high probability region, that should contain 95% of the observed measures of pairwise
i's values. This result provides the justification for using the conditional distributions to analyze the homogeneity in the patterns of genetic differentiation revealed by a (large) set of markers.
| APPLICATIONS |
|---|
In this section, we present a methodology for identifying outlier loci by a pairwise analysis of populations. For each pair of populations (i, j), we suggest the following protocol:
- For all loci, the statistics
i and
j are computed (see Appendix). - The parameters Fi and Fj are estimated as the averages among loci weighted by the heterozygosities (1 -
i) and (1 -
j), respectively (see Appendix). This corresponds to the weighting of loci suggested by WEIR and COCKERHAM 1984 for the multilocus estimator of FST.
- The expected joint distribution of
i and
j is generated by performing 10,000 coalescent simulations for a given set of nuisance parameter values. This is repeated using a wide range of values for the nuisance parameters. In the D. simulans data set discussed below, all the pairwise combinations for
and T0 were performed, with
= 1, 5, or 10 and T0 = 0.01, 0.1, or 1. Thus, a total of 90,000 coalescent simulations were performed in this example. The simulated sample sizes are chosen to be representative of those actually realized in the real data set. - For each expected joint distribution of
i and
j, we construct all the distributions, conditional on the number of allelic states k in the pooled sample, for k = 2, 3, ... (the pooled sample is the sample obtained by pooling the samples from populations i and j). Remember, there is one expected distribution for each set of nuisance parameter values. For each conditional distribution, we identify the "high probability" or "high density" region, in the range of the points
i and
j, where 95% of the data are expected to lie (see Appendix for the construction of this high probability region). - For each value of the number of allelic states in the pooled sample, we superimpose a scatter plot of the observed data points (pairs of
1 and
2 values) over an outline of the 95% high probability region to identify outlier loci.
D. simulans data set:
We applied this method to a D. simulans data set, described in ![]()
![]()
1,
2) estimates.
|
In the great majority of cases, the points fall within the 95% confidence region. With 43 loci we would expect two (0.05 x 43
2) to lie outside the region by chance. But considering the joint distributions for loci with three or more alleles, we found 4 loci that clearly lie outside. Caution is required in the case of loci that lie on the borders of the possible range (Fig 4B). These correspond to loci that have an allele fixed in one population. Slight variations in the nuisance parameters can increase or decrease the relative proportion of loci that may fix one allele in a population. Indeed, we found some conditions under which the 95% envelope contained these 2 loci. This problem can remain even when we condition on the observed number of alleles. On the other hand, 2 other loci (coding for glutamate pyruvate transaminase and carbonic anhydrase-3) are clear outliers of the expected distributions (Fig 4C and Fig D). In all pairwise comparisons that included the French population, these two loci fell either outside, or on the edges of the 95% high probability region.
In all the pairs that included the population from Congo, two loci coding respectively for the larval protein-10 (Pt-10) and the phosphoglucomutase (PGM) were found to lie outside or on the limit of the 95% high probability region (Fig 5). The locus coding for the larval protein-10 systematically gives a longer estimated branch length for this African population than do all other loci, while it gives similar branch lengths to other loci for the other populations. This suggests that genetic variation was severely reduced by a factor other than genetic drift in this African population. The locus coding for phosphoglucomutase gives a longer branch length estimate than the other loci in three cases (Fig 5, AC) and a shorter one in one case (Fig 5D). The locus coding for phosphoglucomutase was also found to lie outside the limit of the 95% high probability region in all the pairs that included the population from Seychelle Island (Fig 6). To strengthen our presumption that these loci were outside the limit allowed by a neutral model, we checked whether these loci also lie outside the limit of the 99% high probability region. The same results were obtained. For these loci, we did not find any plausible neutral scenario of divergence by drift that could provide such a scatter of points. We thus conclude that natural selection may have acted on these loci or on closely linked regions within the genome.
|
|
We are more cautious about claiming that the loci coding for glutamate pyruvate transaminase and carbonic anhydrase-3 were or are subject to selection. These loci are clear outliers in some pairwise comparisons involving the French population but fall just within the limits of the confidence region in other comparisons. Moreover, when considering 99% confidence regions instead of 95% confidence regions, some loci were no longer detected as outliers but rather as lying on the edges of the confidence limit. The locus coding for isocitrate dehydrogenase-1 was found to be an outlier in three (out of four) pairs that included the population from Seychelle Island. Overall, six more loci were detected as outliers in single pairwise comparisons only. Therefore, we should be very cautious about considering those latter loci as being under selection. Indeed, if a locus has responded to selection in one particular contemporary population since it became isolated, then we expect this locus to show up as an outlier in all (or most) comparisons involving this population. This pattern is exactly what we found for the two loci coding for larval protein-10 and phosphoglucomutase in the Congo and Seychelle Island populations.
Evaluating the robustness of this method to the assumptions of the model:
In the data set discussed above, it is likely that the populations of D. simulans have exchanged migrants after divergence. More generally, one can wonder whether complete isolation and divergence by random drift accurately describes natural situations. An alternative approach would be to develop a new model of population divergence that allows subsequent migration after separation. But if we want to make inferences about a more realistic (and hence a more complex) model of divergence, then we need to distinguish between the pattern of genetic differentiation that results from (i) recent separation followed by very little migration or (ii) ancient separation followed by a moderate amount of migration. This is a difficult task, which would require more powerful methods for inferring parameter values (e.g., maximum likelihood; see ![]()
![]()
So, we are interested in testing if our method (which assumes evolution in complete isolation after divergence) is undermined when applied to pairs of populations that still exchange genes after divergence. It should be borne in mind that gene flow, like genetic drift, affects the whole genome in the same way. We generated artificial data sets under neutral models of population divergence, including high mutation rates and moderate levels of migration between populations. We used a modified version of the algorithm described by ![]()
generations in the past, considering populations 1 and 2 altogether, the waiting time to the next event (coalescence or migration) is drawn from an exponential distribution with mean N1N2/[N2/(
1) · N1/(
2) + m(n1 + n2)N1N2], where m is the backward migration rate (![]()
1)/[N2/(
1) · N1/(
2) + m(n1 + n2) N1N2] (respectively N1/(
2)/[N2/(
1) · N1/(
2) + m(n1 + n2) N1N2]) or one gene migrates from population 2 to population 1 (respectively from population 1 to population 2) with probability m · n1/[N2/(
1) · N1/(
2) + m(n1 + n2) N1N2] (respectively m · n2/[N2/(
1) · N1/(
2) + m(n1 + n2)N1N2]; see ![]()
![]()
![]()
, +
], the coalescent process was generated as previously described (see also Fig 1).
For each set of parameters, we generated 20 data sets composed of two samples (n1 = n2 = 50) of 50 loci each. The parameter values are given in Table 1. For each data set, we applied our method as described above. We generated joint distributions, conditional on the number of alleles, according to the actual numbers of alleles in each sample. For all sets of parameters, we grouped loci with eight alleles and more in a single class. The number of joint conditional distributions generated per artificial data set (i.e., the number of classes for different numbers of alleles) ranged from three to seven. For each data set, over all the joint conditional distributions taken together, we expected to detect 0.05 x 50 = 2.5 outlier loci, just by chance. We performed Wilcoxon's signed-rank tests (see, e.g., ![]()
|
Table 1 shows the total observed number of outlier loci (mean and median over 20 independent simulated data sets) detected for a range of nuisance parameter values (low and high mutation rates, short or long divergence by random drift, with or without migration). In no case could we reject the null hypothesis that the expected number of outlier loci detected by our method was equal to 2.5 (against the alternative hypothesis that the expected number of outliers was >2.5). Thus, our approach is conservative in the sense that the 95% confidence region contains at least 95% of the loci generated by a truly neutral model. At the level of 5% we do not (falsely) detect any more than 5% of outlier loci in a sample of neutral markers (type I error).
Comparison with Beaumont and Nichols' (1996) method:
We also applied BEAUMONT and NICHOLS' (1996) procedure to the D. simulans data set. Based on a preliminary examination of the data, three loci (coding for
-fucosidase, dipeptidase-1, and mannose phosphatase isomerase) were found to lie outside the 95% confidence region of the conditional joint distribution of
ST and mean heterozygosity. The percentiles were determined as described in ![]()
We suspect that, in the present case, the inclusion of a very distant insular population (Seychelle Island) may bias their analysis. Indeed, populations heterogeneous with respect to their demographic parameters (effective population sizes and migration rates) were shown to strongly affect their method (![]()
Moreover, in general, the loci that were outliers in our analysis gave small values of (global) FST. But from the shape of the joint distribution of FST and heterozygosity, it seems that BEAUMONT and NICHOLS' (1996) analysis is likely to detect outlier loci that exhibit unusually large FST values. However, a process that would cause an apparent decrease of genetic variation at one locus in a single local population, without leading to a decrease of the variation over all populations, would not be detected in BEAUMONT and NICHOLS' (1996) procedure. In other words, if selection acts on one locus at a local scale, pairwise comparisons of populations are more likely to be efficient for detecting outlier loci.
| DISCUSSION |
|---|
Using population-specific estimators of branch lengths:
Conventional pairwise genetic distances or pairwise measures of population differentiation are based on the assumption that the sizes of populations are equal and constant through time or that dispersal, if any, is symmetric. For example, the pairwise FST parameter is defined as a ratio of identity probabilities within and among populations. But the within-population term is taken as an average over the pair of populations. Thus, the definition of the parameter implicitly assumes that both populations share the same demographic parameters. WEIR and COCKERHAM's (1984) estimator
of FST is constructed to have low bias and variance, assuming that the populations are independent replicates of the same stochastic process. This means that populations are supposed to have the same size and that they do not exchange migrants. Without these assumptions,
would be a complex function of unequal (within-population) identity probabilities.
In contrast, the
i parameters defined here make sense even when the populations are of unequal size. The only assumption we make is that when the two populations have separated, they remain completely isolated. From the estimation of Fi's for a pair of populations, we can infer the branch lengths. The ratio of these branch length estimates is inversely proportional to the ratio of effective population sizes. Thus, these estimates may be seen as measures of the intensity of genetic drift that has occurred since population divergence. The main drawback to this approach is that when estimates of IIS probabilities are smaller within populations than among them (i.e.,
w,i <
a),
i becomes negative, and the moment-based estimator of branch length fails. Although this can arise just by chance for some loci, averaging
estimates over loci reduces the problem.
Provided that we obtain good estimates of branch lengths for a pair of populations (which requires the pooling of information from many independent loci), we may be able to evaluate the consistency of locus-specific estimates. Indeed, the joint distribution of branch length estimates, conditioned on the number of alleles in the pooled sample, depends only weakly on nuisance parameters of the simple model of divergence by drift. In particular, this conditional distribution is not sensitive to departures from mutation-drift equilibrium before isolation or to differences in mutation rates.
Detection of selection acting on genetic markers:
We saw from the analysis of the D. simulans data set that the great majority of loci always fall in the confidence region of the conditional pairwise distributions of branch length estimates, while some loci do not. Overall, we identified two loci that were probably subject to selection in the population from Congo, one of which was also probably subject to selection in the population from Seychelle Island. We concluded that the distribution of variability at these loci may have been shaped by forces other than mutation and drift. Furthermore, we identified two other loci that either lie on the edges or fall just outside the high probability region of the expected conditional distribution in the French population, although we are more cautious about these latter loci. It is noteworthy that our estimation of the density of
i parameters (see Appendix) is discontinuous, because of the discrete nature of the data (the allele counts). This is particularly true when the number of alleles on which the distribution is conditioned is small (for a given set of parameters, the lower the number of allelic states, the more discontinuous the null distribution; see Fig 4). Using discrete distributions is clearly preferable to using some (unnecessary) continuous approximations to it. Moreover, whenever the null distribution is based on the same number of allelic states and the same number of genes as in the sample, there is no tendency for loci to show up as outlier just because of the discrete nature of the distribution (i.e., a locus cannot, by construction, show up between arc-shaped areas located at the edge of some distributions). Yet, when an apparent outlier lies very close to the 95% high probability region, it is highly advisable to check whether this locus also lies outside the 99% high probability region.
The main criticisms of LEWONTIN and KRAKAUER's (1973) attempts to interpret across-loci heterogeneity of FST values arose from their failure to consider allele frequencies as random variables, whose distribution depends on the underlying model of population structure and history. Indeed, uneven patterns of dispersal among populations (![]()
![]()
![]()
![]()
However, conditioning the distribution of FST on the heterozygosity (![]()
![]()
![]()
![]()
4Ndµ/(1 + 4Ndµ), for diploids]. In addition, at a local scale, FST is only weakly influenced by the total population size Nd (![]()
![]()
![]()
![]()
As already suggested by ![]()
![]()
![]()
![]()
![]()
We found that patterns such as those identified in, e.g., the Tunisia vs. Congo data set as evidence of selection can be produced by "neutral models," where the coalescent process occurs independently at each locus. Indeed, similar scatters of points could be obtained whenever the parameters
1 and
2 vary across loci, having particularly high values at certain loci (results not shown). Models of this type provide a rough approximation to models of unlinked neutral loci, some of which were strongly influenced by selection (remembering that the effect of selection resembles a reduction in the effective population size experienced by these loci, as described by ![]()
![]()
![]()
An important task for the future is to consider a more general neutral model of the divergence of two populations, where gene flow may continue after the moment of "separation." It is also desirable to extend this approach to more elaborate neutral models, incorporating recombination. More sophisticated estimators of the divergence parameters (branch lengths) would then be required. We assumed that the mutation process follows the IAM and we allowed a wide range of possible mutation rates. In the IAM, genes that are identical in state are also identical by descent. This may not be the case with other mutation models such as with the K allele or stepwise mutation processes, which can produce IIS genes that are not IBD (homoplasy). The IAM is probably an adequate model for allozyme data. It is certainly not so appropriate for potentially more variable markers, such as microsatellites. Recent studies revealed that the processes of mutation of microsatellite markers may be more complex than previously thought and may vary greatly among loci (![]()
![]()
If we could identify those marker loci that responded to selection during the process of divergence, then we may be able to obtain improved estimates of the parameters of population structure and history by excluding these loci (![]()
| ACKNOWLEDGMENTS |
|---|
We are very grateful to R. S. Singh and R. A. Morton for providing the Drosophila simulans data set. We thank I. Olivieri for helpful comments on a previous draft of this manuscript and S. Billiard for valuable discussions about the structured coalescent. We are grateful to two anonymous reviewers for their constructive comments. This work was funded by contract no. BIO4-CT96-1189 of the Commission of the European Communities (DG XII) to P.B., and R.V. was also partially funded by the Fondation Sansouire. This is publication no. 2001-045 of the Institut des Sciences de l'Évolution de Montpellier.
Manuscript received October 23, 2000; Accepted for publication May 14, 2001.
| APPENDIX |
|---|
Parameters estimation:
For any given allele u, we use the indicator variable xiju for describing the state of the jth gene in the ith population, with i = (1, 2). xiju = 1 if the allelic type is u, xiju = 0 otherwise. Let piu be the frequency of allele u in the ith population. Then piu =
(xiju|p), where
(|p) denotes the expectation, conditional on the array p of all the allele frequencies. Considering the second moments of the random variable xiju, it follows that
(x2iju|p) = piu and, since individuals are sampled independently from the ith population,
(xijuxij'u|p) = p2iu for j'
j. Then, summing over all alleles gives the probability for two genes in population i to be identical in state (IIS),
![]() |
(A1) |
where
denotes now the expectation over the distribution of allele frequencies p and k is the number of alleles in the population. The IIS probability for two genes respectively taken in populations 1 and 2 is given by
![]() |
(A2) |
An unbiased estimator of the frequency of allele u among ni sampled individuals from the ith population is simply given by
iu=
nj=1
. Expanding the square of this expression, and then taking expectation, gives
(
2iu|p) =
. Therefore,
![]() |
(A3) |
is an unbiased estimator of the probability for two genes in population j to be identical in state, with k being the number of alleles in the sample. Similarly
![]() |
(A4) |
is an unbiased estimator of the IIS probability of two genes taken in the ancestral population, before divergence. Approximating the expectation of a ratio by the ratio of expectations, an estimator of Fi is given by
![]() |
(A5) |
When combining the information brought by all alleles at more than one locus, a multilocus estimator is defined as the ratio of the sum of locus-specific numerators over the sum of locus-specific denominators (see, e.g., ![]()
's in Equation 8 to get
) directly yields Cockerham's estimators (![]()
![]()
![]()
![]()
Estimation of the density of Fi parameters:
For each set of parameter values, coalescent simulations were performed, thus generating "artificial data sets." Each artificial data set yields a pair of estimates
1 and
2. An approximation to the expected joint distribution was obtained as follows. First, a two-dimensional histogram was constructed. Recall that the points (
1,
2) are constrained to lie within the upper-right triangle of a square with vertices (-1, -1), (1, -1), (-1, 1), and (1, 1). The whole square region was covered by a two-dimensional array (or mesh) of 100 x 100 square cells. Each cell has thus sides of length 0.02. Each observation (
1,
2) was binned in the appropriate cell. The cell counts were divided by the total number of observations to obtain a discrete probability distribution over the two-dimensional array. This discrete distribution is a close approximation to the expected joint distribution of the estimators (
1,
2). The q-level "high probability region" (q = 95% or any other value) is constructed as follows. The cells are sorted in order of decreasing probability. Finally, starting from the cells with the highest associated probabilities, cells are sequentially added to the confidence region until the cumulative probability of the whole set of cells obtained is equal to (or just exceeds) the chosen q-value.
From this procedure, we obtain for each simulation a region within which a proportion q of the data lies. Note that this confidence region is not necessarily continuous. Constructing the high probability region using the discrete distribution is clearly preferable to using some (unnecessary) continuous approximation to it.
| LITERATURE CITED |
|---|
BARTON, N. H., 1995 Linkage and the limits to natural selection. Genetics 140:821-841[Abstract].
BARTON, N. H., 1998 The effect of hitch-hiking on neutral genealogies. Genet. Res. 72:123-133.
BEAUMONT, M. A. and R. A. NICHOLS, 1996 Evaluating loci for use in the genetic analysis of population structure. Proc. R. Soc. Lond. Ser. B 263:1619-1626
BOWCOCK, A. M., J. R. KIDD, J. L. MOUNTAIN, J. M. HEBERT, and L. CAROTENUTO et al., 1991 Drift, admixture, and selection in human evolution: a study with DNA polymorphisms. Genetics 88:839-843.
CAVALLI-SFORZA, L. L., 1966 Population structure and human evolution. Proc. R. Soc. Lond. Ser. B 164:362-379[Medline].
CHARLESWORTH, B., M. T. MORGAN, and D. CHARLESWORTH, 1993 The effect of deleterious mutations on neutral molecular variation. Genetics 134:1289-1303[Abstract].
CHARLESWORTH, B., M. NORDBORG, and D. CHARLESWORTH, 1997 The effects of local selection, balanced polymorphism and background selection on equilibrium patterns of genetic diversity in subdivided populations. Genet. Res. 70:155-174[Medline].
CHOUDHARY, M., M. B. COULTHART, and R. S. SINGH, 1992 A comprehensive study of genic variation in natural populations of Drosophila melanogaster. VI. Patterns and processes of genic divergence between D. melanogaster and its sibling species, D. simulans. Genetics 130:843-853[Abstract].
COCKERHAM, C. C., 1973 Analyses of gene frequencies. Genetics 74:697-700.
COCKERHAM, C. C. and B. S. WEIR, 1987 Correlations, descent measures: drift with migration and mutation. Proc. Natl. Acad. Sci. USA 84:8512-8514
ESTOUP, A., and B. ANGERS, 1998 Microsatellites and minisatellites for molecular ecology: theoretical and empirical considerations, pp. 5586 in Advances in Molecular Ecology, edited by G. R. CARVALHO. IOS Press, Amsterdam.
FELSENSTEIN, J., 1982 How can we infer geography and history from gene frequencies? J. Theor. Biol. 96:9-20[Medline].
FLINT, J., J. BOND, D. C. REES, A. J. BOYCE, and J. M. ROBERTS-THOMSON et al., 1999 Minisatellite mutational processes reduce FST estimates. Hum. Genet. 105:567-576[Medline].
HILL, W. G. and A. ROBERTSON, 1966 The effect of linkage on limits to artificial selection. Genet. Res. 8:269-294[Medline].
HILL, W. G. and A. ROBERTSON, 1968 Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38:226-231.
HUDSON, R. R., 1990 Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7:1-44.
HUDSON, R. R. and N. L. KAPLAN, 1988 The coalescent process in models with selection and recombination. Genetics 120:831-840
KAPLAN, N. L., R. R. HUDSON, and C. H. LANGLEY, 1989 The hitchhiking effect revisited. Genetics 123:887-899
LEWONTIN, R. C. and J. KRAKAUER, 1973 Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphism. Genetics 74:175-195
LEWONTIN, R. C. and J. KRAKAUER, 1975 Testing the heterogeneity of F values. Genetics 80:397-398
MALÉCOT, G., 1975 Heterozygosity and relationship in regularly subdivided populations. Theor. Popul. Biol. 8:212-241[Medline].
MAYNARD SMITH, J. and J. HAIGH, 1974 The hitch-hiking effect of a favourable gene. Genet. Res. 23:23-35[Medline].
MENDENHALL, W. M., D. D. WACKERLY and R. L. SCHEAFFER, 1990 Mathematical Statistics with Applications. PWS-KENT Publishing Company, Boston.
NEI, M., 1972 Genetic distance between populations. Am. Nat. 106:283-292.
NEI, M. and A. CHAKRAVARTI, 1977 Drift variance of FST and GST statistics obtained from a finite number of isolated populations. Theor. Popul. Biol. 11:307-325[Medline].
NEI, M. and T. MARYUYAMA, 1975 Lewontin-Krakauer test for neutral genes. Genetics 80:395
NEI, M., A. CHAKRAVARTI, and Y. TATENO, 1977 Mean and variance of FST in a finite number of incompletely isolated populations. Theor. Popul. Biol. 11:291-306[Medline].
NIELSEN, R. and M. SLATKIN, 2000 Likelihood analysis of ongoing gene flow and historical association. Evolution 54:44-50[Medline].
NORDBORG, M., 2001 Coalescent theory, pp. 179212 in Handbook of Statistical Genetics, edited by D. J. BALDING, M. BISHOP and C. CANNINGS. John Wiley & Sons, Chichester, UK.
OHTA, T. and M. KIMURA, 1969 Linkage disequilibrium due to random genetic drift. Genet. Res. 13:47-55.
REYNOLDS, J., B. S. WEIR, and C. C. COCKERHAM, 1983 Estimation of the coancestry coefficient: basis for a short term genetic distance. Genetics 105:767-779
ROBERTSON, A., 1961 Inbreeding in artificial selection programmes. Genet. Res. 2:189-194.
ROBERTSON, A., 1975a Gene frequency distribution as a test of selective neutrality. Genetics 81:775-785
ROBERTSON, A., 1975b Remarks on the Lewontin-Krakauer test. Genetics 80:396
ROSS, K. G., D. D. SHOEMAKER, M. J. B. KRIEGER, J. DEHEER, and L. KELLER, 1999 Assessing genetic structure with multiple classes of molecular markers: a case study involving the introduced fire ant Solenopsis invicta.. Mol. Biol. Evol. 16:525-543[Abstract].
ROUSSET, F., 1996 Equilibrium values of measures of population subdivision for stepwise mutation processes. Genetics 142:1357-1362[Abstract].
ROUSSET, F., 1997 Genetic differentiation and estimation of gene flow from F-statistics under isolation by distance. Genetics 145:1219-1228[Abstract].
ROUSSET, F., 2001 Inferences from spatial population genetics, pp. 179212 in Handbook of Statistical Genetics, edited by D. J. BALDING, M. BISHOP and C. CANNINGS. John Wiley & Sons, Chichester, UK.
SINGH, R. S., M. CHOUDHARY, and J. R. DAVID, 1987 Constrasting patterns of geographic variation in the cosmopolitan sibling species Drosophila melanogaster and D. Simulans. Biochem. Genet. 25:27-40[Medline].
SLATKIN, M., 1991 Inbreeding coefficients and coalescence times. Genet. Res. 58:167-175[Medline].
STROBECK, C., 1983 Expected linkage disequilibrium for a neutral locus linked to a chromosomal arrangement. Genetics 103:545-555
STROBECK, C., 1987 Average number of nucleotide differences in a sample from a single subpopulation: a test for population subdivision. Genetics 117:149-153
TAKAHATA, N., 1988 The coalescent in two partially isolated diffusion populations. Genet. Res. 52:213-222[Medline].
TSAKAS, S. and C. B. KRIMBAS, 1976 Testing the heterogeneity of F values: a suggestion and a correction. Genetics 84:399-401
WEIR, B. S. and C. C. COCKERHAM, 1984 Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370.
WRIGHT, S., 1951 The genetical structure of populations. Ann. Eugen. 15:323-354.
This article has been cited by other articles:
![]() |
M. Foll and O. Gaggiotti A Genome-Scan Method to Identify Selected Loci Appropriate for Both Dominant and Codominant Markers: A Bayesian Perspective Genetics, October 1, 2008; 180(2): 977 - 993. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Innan and Y. Kim Detecting Local Adaptation Using the Joint Sampling of Polymorphism Data in the Parental and Derived Populations Genetics, July 1, 2008; 179(3): 1713 - 1720. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Caballero, H. Quesada, and E. Rolan-Alvarez Impact of Amplified Fragment Length Polymorphism Size Homoplasy on the Estimation of Population Genetic Diversity and the Detection of Selective Loci Genetics, May 1, 2008; 179(1): 539 - 554. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Tsumura, T. Kado, T. Takahashi, N. Tani, T. Ujino-Ihara, and H. Iwata Genome Scan to Detect Genetic Structure and Adaptive Genes of Natural Populations of Cryptomeria japonica Genetics, August 1, 2007; 176(4): 2393 - 2403. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. C. Kane and L. H. Rieseberg Selective Sweeps Reveal Candidate Genes for Adaptation to Drought and Salt Tolerance in Common Sunflower, Helianthus annuus Genetics, April 1, 2007; 175(4): 1823 - 1834. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Pariset, I. Cappuccio, P. Ajmone-Marsan, M. Bruford, S. Dunner, O. Cortes, G. Erhardt, E.-M. Prinzenberg, K. Gutscher, S. Joost, et al. Characterization of 37 Breed-Specific Single-Nucleotide Polymorphisms in Sheep J. Hered., September 1, 2006; 97(5): 531 - 534. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Camus-Kulandaivelu, J.-B. Veyrieras, D. Madur, V. Combes, M. Fourmann, S. Barraud, P. Dubreuil, B. Gouesnard, D. Manicacci, and A. Charcosset Maize Adaptation to Temperate Climate: Relationship Between Population Structure and Polymorphism in the Dwarf8 Gene Genetics, April 1, 2006; 172(4): 2449 - 2463. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bonin, P. Taberlet, C. Miaud, and F. Pompanon Explorative Genome Scan to Detect Candidate Loci for Adaptation Along a Gradient of Altitude in the Common Frog (Rana temporaria) Mol. Biol. Evol., April 1, 2006; 23(4): 773 - 783. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Burke, S. J. Knapp, and L. H. Rieseberg Genetic Consequences of Selection During the Evolution of Cultivated Sunflower Genetics, December 1, 2005; 171(4): 1933 - 1940. [Abstract] [Full Text] [PDF] |
||||
![]() |
M.A Toro and A Caballero Characterization and conservation of genetic diversity in subdivided populations Phil Trans R Soc B, July 29, 2005; 360(1459): 1367 - 1378. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Vasemagi, J. Nilsson, and C. R. Primmer Expressed Sequence Tag-Linked Microsatellites as a Source of Gene-Associated Polymorphisms for Detecting Signatures of Divergent Selection in Atlantic Salmon (Salmo salar L.) Mol. Biol. Evol., April 1, 2005; 22(4): 1067 - 1076. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Goldringer and T. Bataillon On the Distribution of Temporal Variations in Allele Frequency: Consequences for the Estimation of Effective Population Size and the Detection of Loci Undergoing Selection Genetics, September 1, 2004; 168(1): 563 - 568. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Vitalis, K. Dawson, P. Boursot, and K. Belkhir DetSel 1.0: A Computer Program to Detect Markers Responding to Selection J. Hered., September 1, 2003; 94(5): 429 - 431. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Kayser, S. Brauer, and M. Stoneking A Genome Scan to Detect Candidate Regions Influenced by Local Natural Selection in Human Populations Mol. Biol. Evol., June 1, 2003; 20(6): 893 - 900. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Akey, G. Zhang, K. Zhang, L. Jin, and M. D. Shriver Interrogating a High-Density SNP Map for Signatures of Natural Selection Genome Res., December 1, 2002; 12(12): 1805 - 1814. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. F. Storz, M. A. Beaumont, and S. C. Alberts Genetic Evidence for Long-Term Population Decline in a Savannah-Dwelling Primate: Inferences from a Hierarchical Bayesian Model Mol. Biol. Evol., November 1, 2002; 19(11): 1981 - 1990. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Schlotterer A Microsatellite-Based Multilocus Screen for the Identification of Local Selective Sweeps Genetics, February 1, 2002; 160(2): 753 - 763. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Vitalis, R.
- Articles by Boursot, P.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Vitalis, R.
- Articles by Boursot, P.

























