Genetics, Vol. 160, 753-763, February 2002, Copyright © 2002

A Microsatellite-Based Multilocus Screen for the Identification of Local Selective Sweeps

Christian Schlötterera
a Institut für Tierzucht und Genetik, Veterinärmedizinische Universität Wien, 1210 Wien, Austria

Corresponding author: Christian Schlötterer, Veterinärmedizinische Universität Wien, Josef Baumann Gasse 1, 1210 Wien, Austria., christian.schloetterer{at}vu-wien.ac.at (E-mail)

Communicating editor: J. HEY


*  ABSTRACT
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

With the availability of completely sequenced genomes, multilocus scans of natural variability have become a feasible approach for the identification of genomic regions subjected to natural and artificial selection. Here, I introduce a new multilocus test statistic, ln RV, which is based on the ratio of observed variances in repeat number at a set of microsatellite loci in two groups of populations. The distribution of ln RV values captures demographic history of the populations as well as variation in microsatellite mutation among loci. Given that microsatellite loci associated with a recent selective sweep differ from the remainder of the genome, they are expected to fall outside of the distribution of neutral ln RV values. The ln RV test statistic is applied to a data set of 94 loci typed in eight non-African and two African human populations.


IT is well understood that genetic change provides the basis for adaptation processes in natural and domesticated populations. Hence, the identification of those genetic changes causing a phenotype with an increased fitness has been of long-standing interest in biological sciences.

Three different approaches to identify targets of selection (and thus adaptation) have been pursued: (1) the candidate gene approach, (2) QTL mapping, and (3) the multilocus screen.

The candidate gene approach is based on an a priori knowledge about the function of a given gene. The ease of PCR amplification and DNA sequencing, combined with the availability of several test statistics to evaluate the statistical significance of the observed and expected patterns of DNA sequence variation (OTTO 2000 Down), has resulted in numerous studies using a candidate gene approach. Despite the unquestionable importance of these studies in understanding the partitioning of genetic variation in natural populations, this approach is limited to a small number of candidate genes. Hence, a screen for genes involved in adaptation is difficult to pursue as neither the traits nor their genetic basis are known.

QTL mapping (LYNCH and WALSH 1998 Down) is a more general approach. On the basis of the idea that many traits are of quantitative nature, QTL mapping aims to partition the phenotypic variance into a genotypic and environmental component. While this approach is becoming increasingly popular to identify genes contributing to a given trait, it suffers from the problem that the phenotypic trait of potential adaptive relevance must be known. Limited information is available, however, about the traits that are responsible for the adaptation of natural populations to their environment. Thus, QTL mapping has only limited potential for the identification of the genes that are involved in the adaptation process of natural populations.

The key for a multilocus screen is the idea that different forces act in characteristic ways on the genome. While genetic drift, migration, and inbreeding affect all loci to the same extent, selection is targeted to a few loci only. Hence, a locus, which shows a significantly different pattern from the remainder of the genome, is expected to reside in a genomic region that has been the target of selection. This idea was first used by CAVALLI-SFORZA 1966 Down, who calculated F values over several human groups. Later, LEWONTIN and KRAKAUER 1973 Down proposed a formal test statistic to identify loci that deviate from a neutral pattern. This test statistic is based on the variance of the inbreeding coefficient F, which is proportional to the square of its mean value averaged across loci. This test has been subsequently criticized for several reasons. In particular, correlations in allele frequencies, which could be caused by stepping-stone migration and phylogenetic history, will inflate the variance in F relative to the expectations. Furthermore, skewed allele frequencies will also affect the Lewontin-Krakauer test (NEI and MARUYAMA 1975 Down; ROBERTSON 1975 Down). Despite some recent improvements (TSAKAS and KRIMBAS 1976 Down; BOWCOCK et al. 1991 Down; BEAUMONT and NICHOLS 1996 Down; VITALIS et al. 2001 Down) these test statistics have never been widely used to infer selection from multilocus data.

With the recent progress in genomics, various new markers, which are distributed over the genome at a high density, have become available. In combination with the (almost) complete genomic sequence of various organisms, multilocus screens should be reconsidered.

While the high density of available single nucleotide polymorphisms (SNPs) makes them the marker of choice for various studies, such as linkage disequilibrium mapping, they are biallelic markers with a limited information content of a single marker. Microsatellites, on the other hand, are less dense, but offer the advantage of a multiallelic marker, which is highly informative.

In this study I explore the potential of microsatellites to serve as a genetic marker for the identification of genomic regions that have been subject to selection. While microsatellites are unlikely to be the target of natural selection, linkage to a genomic region that has been the target of selection is expected to cause a deviation from neutral expectations. The spread of a novel beneficial mutation through a population results in a reduction of natural variability at the selected locus and flanking regions (MAYNARD SMITH and HAIGH 1974 Down; SLATKIN 1995A Down). The extent to which flanking sequences are affected by such a selective sweep depends largely on the strength of selection and the recombination rate. Hence, a microsatellite locus linked to a beneficial mutation is expected to have a reduction in variability below neutral expectations (SLATKIN 1995A Down; SCHLOTTERER et al. 1997 Down; PRITCHARD and FELDMAN 1998 Down; WIEHE 1998 Down; SCHLOTTERER and WIEHE 1999 Down). Thus, a multilocus screen for genomic regions subjected to selection could take advantage of this reduction in variability.

This conceptionally simple approach is significantly hampered by the observed differences in variability among microsatellite loci. In neutrally evolving populations different coalescent times and variation in mutation rates are responsible for those differences. Hence, the goal of a multilocus test for selected genomic regions is the identification of those microsatellite loci that deviate from the neutrally evolving genome. While the variation in coalescent time can be estimated under certain assumptions of population history, the mutation rate of a microsatellite locus remains an unknown parameter. Here, I introduce a new test statistic, ln RV, which accounts for neutral variation in coalescent times and different microsatellite mutation rates. A microsatellite data set is used to evaluate the usefulness of the ln RV test statistic to identify genomic regions that differ between African and non-African human populations.


*  MATERIALS AND METHODS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Population samples:
Microsatellite data from 10 different populations were analyzed. African populations were represented by Biaka Pygmies from the Central African Republic and the Mbuti Pygmies from northwestern Zaire. Non-African populations included a sample of unrelated Danish blood donors, a moslem community from Northern Israel, Han Chinese living in the United States, native Japanese from the Osaka area or visitors to Stanford or Yale, the Yakut from Siberia, the Nasioi from Melanesia, the Mayan from Mexico, and the Rondonian Surui from Brazil. More information about these populations is available at http://info.med.yale.edu/genetics/kkidd/pops.html.

Genetic markers used:
Data from a total of 94 microsatellite loci were used. The loci are part of the ABI linkage panels 8–11 and 13–16 covering the chromosomes 5–11. All data were taken from the Kidd lab webpage: http://info.med.yale.edu/genetics/kkidd/abiinfo.html. GenBank searches were performed before March 2001.

Test of neutrality (ln RV test):
Assuming the stepwise mutation model (OHTA and KIMURA 1973 Down), neutrality, and mutation drift equilibrium, the variance in repeat number (V) is a good estimator of microsatellite variability (MORAN 1975 Down; GOLDSTEIN et al. 1995 Down; SLATKIN 1995B Down):

(1)

Ne is the effective number of diploid individuals and µ the microsatellite mutation rate. Given that microsatellite mutation rates differ substantially among loci (DI RIENZO et al. 1998 Down; HARR et al. 1998 Down), it is difficult to compare variances among loci directly. This problem can be circumvented by calculating the ratio of the variance in repeat number in two populations, which is independent of the mutation rate. It has to be noted that the expectation of RV is not identical to the ratio of the expectations of VPop1 and VPop2. Computer simulations, however, indicate that over a reasonable range of parameters the two expectations are very similar (Table 1):

(2)


 
View this table:
In this window
In a new window

 
Table 1. Verification of approximations by computer simulation

A better approximation is provided by the delta method (LYNCH and WALSH 1998 Down):

(3)

Higher-order approximations given in LYNCH and WALSH 1998 Down are not included because of the large term 1/12. Computer simulations show that (3) provides a better fit than (2) (Table 1).

Given the close fit of the approximation, it can be assumed that RV is independent of the mutation rate and all loci have approximately the same expectation for a comparison of two populations. Nevertheless, historic sampling causes variation in the coalescent times at the loci studied. Hence, a distribution of RV values is expected. To determine the shape of this distribution, I used computer simulations (see below) and found that for neutrally evolving microsatellite loci the ln RV values follow a Gaussian distribution.

Hence, it is possible to design a test statistic to identify individual microsatellite loci that deviate from neutral expectations. Assuming that most loci are evolving neutrally, the mean and standard deviation of the observed ln RV values could be used to describe the corresponding Gaussian distribution. Using the density function of the Gaussian distribution, it is possible to assign a P value to the ln RV value of each locus. The P values give the probability that a given ln RV value is consistent with the null hypothesis of a neutral evolution.

Test for normal distribution:
Visual inspection of the distribution of ln RV values from computer simulations suggested that they are normally distributed. For a formal test, two different strategies were pursued. First, the nonparametric Kolmogorov-Smirnov test was used to evaluate the distribution of 1000 simulated ln RV values. Because the tail of the distribution is particularly important to define the significance level, I also constructed a "tail test." This test is based on two properties of a normal distribution. First, the distribution is symmetrical with the same number of data points in the upper and lower tail. Second, 95% (99%) of the values of a standardized distribution are expected to fall within the interval between -1.96 (-2.58) and 1.96 (2.58). Hence, Fisher's exact test could be used to test whether or not the number of observations falling in the tail fits the expectations. I determined the significance from 1000 simulated ln RV values using the 1 and 5% tail. A distribution was considered to be normally distributed if the Kolmogorov-Smirnov test and the two-tail tests were not significant (P < 0.05).

Computer simulations:
The coalescent process, which describes the genealogical history of chromosomes, provides a very simple approach to simulate population samples (HUDSON 1990 Down). I made the standard assumptions associated with the coalescent process including neutrality, constant population size, and panmixia.

If not stated otherwise, between 100 and 10,000 loci were simulated for two independent populations using the unbiased stepwise microsatellite mutation model (OHTA and KIMURA 1973 Down; GOLDSTEIN et al. 1995 Down). For each simulated locus, the ln RV test statistic was calculated. When variation in microsatellite mutation rates was incorporated in the computer simulations, mutation rates varied by a factor 10 drawn from a uniform distribution. For these simulations the mean {Theta}-values are reported. For a restricted set of parameters, computer simulations were run with a two-phase mutation model of microsatellites (DI RIENZO et al. 1994 Down). In addition to single repeat changes, a given fraction of microsatellite mutations was allowed to mutate by more than one repeat unit. The size change for such mutations was drawn from a uniform distribution ranging from 1 to a specified maximum.

The influence of demographic events, such as bottleneck and population expansion, was studied by a modification of the constant population size model. All demographic events affect the entire genome; therefore all loci were simulated using the same algorithm. A bottleneck was modeled as suggested by HUDSON 1990 Down. For the computer simulations of the population expansion, an instantaneous rise in population size was assumed.

To study the effect of admixture, I modified a recently proposed method (PRITCHARD et al. 2000 Down) and simulated 100 chromosomes from three independent populations each. A set of randomly selected chromosomes was taken from population one and replaced the same number of chromosomes in population two. Rather than simulating two additional generations for the admixed populations, the ln RV test statistic was directly calculated for populations two and three.

The neutral coalescent simulations could be modified to study the properties of a single microsatellite locus, which is linked to a genomic region subjected to directional selection. I assumed an instantaneous selective sweep, which was simulated as a bottleneck occurring at the selected locus only. Hence, one locus in one of the two populations was simulated under the selection model, while all other loci were simulated under the constant population size model.


*  RESULTS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Verification of the test statistic:
To explore the behavior of the ln RV test statistic computer simulations were performed under the following assumptions: neutrality, a constant population size, random mating, mutation drift equilibrium, no linkage of the microsatellite loci, and independence of the two populations. Using standard coalescent simulations (HUDSON 1990 Down), I obtained the variance in repeat number for a set of microsatellite loci. If not stated otherwise, computer simulations assumed the unbiased stepwise mutation model (OHTA and KIMURA 1973 Down; GOLDSTEIN et al. 1995 Down).

Dependence on the mutation rate: Using a wide range of {Theta}-values (2–100) consistently resulted in a distribution of ln RV values, which was very similar to a Gaussian distribution (Fig 1A and Fig B). Based on 1000 loci and a sample of 100 chromosomes, no significant deviation from normality could be detected (Kolmogorov-Smirnov and tail test, P > 0.1). This observation is consistent with previous computer simulations (GOLDSTEIN et al. 1996 Down; PRITCHARD and FELDMAN 1998 Down) and empirical reports (HARR et al. 1998 Down), which demonstrated that ln V generally approximates a normal distribution.



View larger version (32K):
In this window
In a new window
Download PPT slide
 
Figure 1. Distribution of ln RV values as obtained from coalescent simulations of 10,000 independent microsatellite loci and 100 sampled chromosomes. The parameters used for the simulations are {Theta}-population1/{Theta}-population2/variation in mutation rate among loci (in percentages): (A) 5/5/0; (B) 500/500/0; (C) 5/5/1000; (D) 5/500/0. The variation in mutation rate was simulated on the basis of a uniform distribution.

Microsatellite mutation rates vary by more than one order of magnitude (DI RIENZO et al. 1998 Down; HARR et al. 1998 Down). To account for this, microsatellite mutation rates were drawn from a uniform distribution, resulting in an up to 10-fold variation in mutation rate (Fig 1C). Simulations of 1000 loci for 100 chromosomes each did not result in statistically significant deviations from normality (Kolmogorov-Smirnov and tail test, P > 0.1). Finally, I also tested the influence of differences in population sizes among the groups compared (Fig 1D). The ratios of the effective population sizes were varied (from 1:1 to 1:100) and no deviation from a normal distribution could be detected for 1000 simulated microsatellite loci (Kolmogorov-Smirnov and tail test, P > 0.1, 100 chromosomes). Hence, using a wide range of parameters, the ln RV test statistic can be approximated by a Gaussian distribution. This greatly facilitates the design of a statistical test to detect deviation from neutrality, as no a priori knowledge about the mutation rate or population size of the tested populations is required.

Deviation from stepwise mutation model: Inference from population data (DI RIENZO et al. 1994 Down) and direct observations (WIERDL et al. 1997 Down; BRINKMANN et al. 1998 Down; HARR and SCHLOTTERER 2000 Down) indicated that microsatellite mutations are not confined to single repeat unit changes, but could also encompass larger gains and losses. To investigate whether such a modification of the mutation process affects the ln RV test statistic, I simulated 1000 microsatellite loci for two populations and a sample size of 100 chromosomes. No deviation from the normal distribution could be detected (Kolmogorov-Smirnov and tail test, P > 0.3, Table 2). The only notable difference was an increase in the variance of ln RV values with both a larger step size and a higher proportion of loci not evolving by single repeat unit changes (Table 2).


 
View this table:
In this window
In a new window

 
Table 2. Variance of the ln RV test statistic based on computer simulations of 1000 loci in two neutrally evolving populations under the two-phase microsatellite mutation model

Power of the ln RV test statistic: To assess the power of the ln RV test statistic I simulated the variance in repeat number for 100 microsatellite loci of which one microsatellite locus was associated with a selective sweep. The rates of recombination between the selected site and the microsatellite as well as the strength of selection are two important parameters required for model selection. Given the large uncertainty for each of these parameters, I used the reduction in variability (r) at the marker locus as a compound parameter in the computer simulations. A strong reduction in variability could result from a large selection coefficient, tight linkage to the selected site, or both. Consistent with expectation, a more pronounced reduction in variability resulted in larger numbers of simulation runs with significant (P < 0.05) ln RV values (Table 3). Also, the mean ln RV of the selected locus was higher and had a lower variance when a large r was used. Hence, for a recent and strong reduction in variability, a large fraction of the selected loci will be identified by the ln RV statistic. Some differences could be detected between the simulations using different {Theta}-values (Table 3). In comparison to the large effect of r, the influence of {Theta} was found to be moderate.


 
View this table:
In this window
In a new window

 
Table 3. Power of the ln RV test statistic in dependence of r

Table 4 indicates that the power of the ln RV statistic significantly decreases with the time elapsed since the selection (t) occurred. Only recent selective sweeps could be detected reliably. This observation is fully consistent with previous analytical results (WIEHE 1998 Down). Similar to the simulations for which r was varied, {Theta} had only a moderate influence on the power of the lnRV test.


 
View this table:
In this window
In a new window

 
Table 4. Power of the ln RV test statistic in dependence of t

Influence of the sample size: Despite the continuous improvement in screening technologies, the analysis of large sample sizes is still an important hurdle in population genetics. Therefore, it is interesting to determine the influence of the number of sampled chromosomes on the ln RV test statistic. The power of the ln RV test statistic is dependent on the shape of the distribution of ln RV values. A larger variance in ln RV values requires a more extreme reduction in variability to obtain significance (Table 5). Therefore, I calculated the standard deviation of ln RV over 10,000 loci. Each ln RV value was simulated for different sample sizes (10–1000 chromosomes). Fig 2 clearly indicates that <30 chromosomes result in a large standard deviation, which will, in turn, result in a lower power of the ln RV test statistic. On the other hand, samples of >50 chromosomes will not significantly improve the test statistic. Hence, only a moderate number of individuals need to be typed to determine the significance level of the ln RV test statistic.



View larger version (14K):
In this window
In a new window
Download PPT slide
 
Figure 2. Influence of the sample size (in chromosomes) on the standard deviation of the ln RV test statistic. Standard deviations were measured on 10,000 independently simulated microsatellite loci using the parameters 5/5/0 (see Fig 1 for further explanations).


 
View this table:
In this window
In a new window

 
Table 5. Power of the ln RV test statistic in relation to the variance of ln RV across loci

Influence of demographic events: Computer simulations were used to investigate the influence of common demographic events (population expansion, bottlenecks, and admixture) on the distribution of ln RV values in neutrally evolving populations. In all simulations one population was kept at a constant size, while for the other population either a change in size or admixture was simulated.

Despite that computer simulations covered quite radical population size changes, for most simulations the ln RV values were not found to deviate significantly from a normal distribution. The most notable exceptions were recent and strong bottlenecks in combination with a low {Theta} (Table 6). Under such extreme scenarios, microsatellite loci did not recover variability, resulting in an excess of loci with low variability in the bottlenecked population. Interestingly, a bottleneck occurring 0.1Ne generations ago resulted in a significant tail test (P = 0.038) for the population with a large {Theta} (Table 6). For expanding populations only the combination of large {Theta}-values with an older population expansion resulted in a significant deviation from a normal distribution (Table 7). This deviation is most likely the result of a large diversity generated in those samples, which was not adequately sampled with 100 chromosomes. When the sample size was increased to 200 chromosomes, no significant deviation from normality could be detected (data not shown). No significant deviation from a normal distribution was observed for various admixture proportions as well as different effective population sizes of the source population (Table 8).


 
View this table:
In this window
In a new window

 
Table 6. Variance of ln RV when one population had passed through a bottleneck


 
View this table:
In this window
In a new window

 
Table 7. Variance of ln RV with one recently expanded population


 
View this table:
In this window
In a new window

 
Table 8. Mean ln RV values of computer simulations with one population experiencing admixture from a third population (variance of ln RV over 1000 loci)

The power of the ln RV test statistic depends strongly on the behavior of the neutrally evolving loci. If the ln RV values have a broad distribution (large variance of ln RV), then the identification of a selected locus is more difficult. On the other hand, a very narrow distribution of ln RV values makes the identification of selected loci easier (Table 5). Given that the power of the ln RV test varies with the parameters used for the computer simulation (see above), a systematic power assessment is difficult. Therefore, I use the variance of the ln RV values as an indication for the power of the ln RV test under various demographic scenarios.

Table 6 indicates that population bottlenecks could have quite complex effects on the behavior of the ln RV test statistic. Simulations based on {Theta}-values of six produced a wider distribution of ln RV when one population recently (t = 0.01) went through a bottleneck. For those simulations, which are based on large {Theta}-values, a bottleneck resulted in a more narrow distribution of ln RV.

Population expansions were simulated using a wide range of times since expansion (t) and factors (r) by which the population size changed. Irrespective of the parameters used, population expansions always resulted in a smaller variance of ln RV values (Table 7).

Various proportions of admixture from a third population were simulated. As expected, admixture increased the variability in the admixed population, resulting in a shift of mean ln RV values (Table 8). This was also observed if immigrants from a population with a smaller effective population size replaced a large fraction (0.25) of the population (Table 8). The variance in ln RV values, however, was largely unaffected by admixture. Hence, the power of the ln RV test statistic is not significantly influenced by admixture.

Screening for adaptive mutations in the human genome:
Data from mtDNA and microsatellites suggest that human populations left Africa about 100,000 years ago to colonize the rest of the world (JORDE et al. 1998 Down). This migration challenged human populations in the form of a novel environment. Hence, a comparison of African and non-African populations could potentially identify genomic regions that were involved in adaptation processes in the two groups. Using the ln RV test statistic, it should be possible to identify some candidate regions bearing an adaptive mutation. In this report I used a data set consisting of 94 microsatellite loci, which were typed in 10 human populations, 2 African and 8 non-African. To apply the ln RV test statistic, I averaged the observed variances in repeat number in the non-African and African groups for each locus. The distribution of the ln RV values of the 94 microsatellite loci followed a Gaussian distribution (Kolmogorov-Smirnov test, P = 0.94). Out of the 94 loci analyzed, 4 loci had a ln RV value located outside the 95% confidence interval. Two loci had more variation in the non-African populations than expected by the level of variation detected in African populations (D10S249, P = 0.002; D6S305, P = 0.023). Microsatellite loci D6S462 (P = 0.007) and D10S197 (P = 0.018) had a reduced variability in non-African populations. Because the number of outliers is fully consistent with the neutral expectations, I evaluated the allele distribution of the two loci, which showed the strongest deviation from the remainder of the genome (D10S249 and D6S462). Fig 3 shows the allele distribution of both loci in the pooled African and non-African populations. Consistent with expectations under the selective sweep hypothesis, each locus showed a strongly peaked allele distribution in the population with reduced variability, while the other population had a scattered allele distribution.



View larger version (20K):
In this window
In a new window
Download PPT slide
 
Figure 3. Allele frequency distribution at the two microsatellite loci with the most extreme ln RV values in African and non-African populations. () Africa, ({blacksquare}) non-Africa.

Until now, populations were jointly analyzed as African or non-African groups. An alternative approach for the identification of loci that differ between African and non-African populations would be to make individual comparisons of all African against all non-African populations. Even under neutrality, 5% of the loci will be identified as significant outliers by the ln RV test. When multiple pairs of independent populations are compared, neutral outliers are expected to be confined to one comparison, but selected loci should be significant in all comparisons. Despite that neither African nor non-African populations are independent, I compared all African populations against each non-African population. In 16 pairwise comparisons of 94 microsatellite loci, 72 significant outliers were marked by the ln RV test. The probability of observing x significant ln RV tests for a given locus could be calculated by a binomial distribution. Loci D6S462 and D10S249 were significant in 9 and 16 comparisons, respectively, which would be extremely unlikely in 16 independent comparisons (P < 0.0000001). While the P values should not be taken at face value, given that the comparisons were not independent, Fig 4 clearly indicates that both loci are different from the remainder of the genome.



View larger version (21K):
In this window
In a new window
Download PPT slide
 
Figure 4. Frequency distribution of the number of significant (P < 0.05) pairwise comparisons for all possible comparisons of African and non-African populations.


*  DISCUSSION
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

The interpretation of natural variability has been of long-standing interest in population genetics. Natural variability at a given locus is governed by various factors: mutation rate, effective population size, historic sampling, population demography, and selection. Any attempt to identify targets of selection in the genome is challenged by the need to account for the pattern expected under neutrality. In principle, each site in the genome may have its own specific neutral mutation rate. On the other hand, effective population size, demographic history, and historic sampling variation are shared across sites (at least for autosomes). Hence, it would be desirable to have a joint estimator of the parameters common to all loci and to adjust for differences in mutation rate.

The central variable of the new test statistic is ln RV. For every microsatellite locus analyzed, the ratio of the variance in repeat number is calculated for two populations. This ratio has the same expectation independent of the mutation rate of a given locus. Hence, ln RV values calculated for a number of microsatellite loci are independent of the mutation rate, but reflect population-specific parameters including effective population size and historic sampling. Computer simulations indicate that the distribution of ln RV follows to a very good approximation a Gaussian distribution. Thus, the mean and standard deviation summarize the neutral expectations of ln RV for a set of two populations.

Influence of demographic events on the ln RV test statistic:
In contrast to selection, demographic events affect the entire genome. Hence, similar to the demographic model of a constant population size, selection at a genomic region may be detected by a deviation from the remainder of the genome. Computer simulations have been used to study population expansion, bottlenecks, and admixture. Two different aspects of the ln RV test were examined under those demographic scenarios: first, whether ln RV remains normally distributed, and second, the power of the ln RV statistic.

Distribution of ln RV: For some extreme demographic events, such as a recent and strong bottleneck, ln RV is no longer normally distributed (Table 6). This deviation is caused by a large number of microsatellite loci, which have lost almost all variability. It is obvious, however, that a data set containing a large number of loci with no or very little variability cannot be informative to infer a recent selective sweep at one or a few loci by the reduction in variability. Therefore, I do not consider this deviation as a major limitation for the application of the ln RV test. More serious is the deviation from normality that was observed for an old population expansion for highly variable loci (large {Theta}). While a larger sample size could solve this problem, these simulations indicated that it may be advisable to test the obtained ln RV values for normality before applying the ln RV test.

Overall, the distribution of ln RV values could be approximated by a normal distribution for most of the parameters of the demographic scenarios considered, suggesting that the ln RV test could also be applied for a wider range of demographic events than just constant population sizes.

In this article I did not consider the effect of population substructure within each of the two populations compared. While further computer simulations are required to determine influence of population structure on the distribution of ln RV values, it has to be noted that the effective population sizes can be determined for any hierarchical level of population structure (CHESSER et al. 1993 Down). As under neutrality all autosomal loci have the same effective population size, the ln RV test statistic is most likely not affected by population substructure.

The independence of the ln RV test statistic for most of the demographic scenarios analyzed is in sharp contrast to many other statistical tests to identify selection, such as tests for linkage disequilibrium (DEPAULIS 1998 Down; ANDOLFATTO et al. 1999 Down; KOHN et al. 2000 Down; VIEIRA and CHARLESWORTH 2000 Down). These tests could be highly sensitive to admixture, which significantly complicates the identification of selected regions in the genome.

Power of the ln RV test: I estimated the power of the ln RV test for the three demographic scenarios considered by the variance of ln RV values. Given that for constant population sizes, the power of the ln RV statistic increased with a smaller variance of ln RV (Table 5), I assumed that this also applies to other demographic scenarios as long as ln RV follows a normal distribution. Exact power estimates, however, would require computer simulations of the joint effects of selection and a given demographic event. While this may lead to slightly different power estimates, the overall picture is unlikely to be affected.

For all parameters evaluated population expansion resulted in a more narrow distribution of ln RV values (Table 7), suggesting a higher statistical power to detect local selective sweeps in growing populations. Admixture from a distantly related population (not included in the analysis) increases variability at all loci, resulting in a broader distribution of ln RV values (Table 8). Hence, admixture reduces the power of the ln RV test statistic. The effect of bottlenecks was less clear. For high {Theta}-values a bottleneck consistently resulted in a lower variance of ln RV when compared to a constant population size. Simulations based on intermediate {Theta}-values showed an increased variance of ln RV for very recent bottlenecks only (Table 6).

Limitations of the ln RV test statistic:
Historic sampling is an important source of fluctuating variability in the genome. The ln RV test statistic uses the ratio of the observed variability at a given locus in two populations. Given that both populations are subjected to historic sampling, the ln RV test statistic has a considerable variability determined by historic sampling. The computer simulations assumed two completely unrelated populations, which maximizes the variation in ln RV due to historic sampling. Hence, less pronounced selective sweeps are very difficult to identify. This is reflected by the requirement of a strong recent reduction in variability to identify a selected locus with the ln RV test statistic. One possible approach to improve the power of the ln RV test statistic would be the comparison of two closely related populations. Closely related populations share a large fraction of their genealogy at each locus; hence, the variance of ln RV is expected to be significantly smaller than for distantly related populations. Thus, loci that have been exposed to a different selection regime in the two closely related populations should be easier to detect than in a comparison of two distantly related populations. Nevertheless, further simulation studies are required to verify the behavior of the ln RV test statistic for closely related populations or populations connected by gene flow.

The ln RV test statistic makes the important assumption of constant mutation rates across populations. This assumption could be easily violated given the strong correlation between microsatellite mutation rate and repeat number (SCHLOTTERER et al. 1998 Down; HARR and SCHLOTTERER 2000 Down). Furthermore, interruptions in the microsatellite repeat also reduce microsatellite mutation rates (WEBER 1990 Down). If the two populations differ in the average repeat number or an interruption in the repeat structure is more frequent in one population than in the other, this could result in a significant ln RV test statistic. Experimental evidence, such as sequencing of alleles, could provide further insight. Furthermore, a comparison of the allele distribution will also be informative (see below).

The ln RV test statistic uses the mean and standard deviation of the observed ln RV values to identify those loci that deviate from the remainder of the genome. The probability for each locus to deviate from the expectation can be directly inferred from the density function of the normal distribution. A larger number of loci results in a more accurate estimate of the mean and standard deviation, but also a larger number of loci with ln RV values located in the tail of the normal distribution. A generally accepted solution to this problem is the use of a smaller {alpha}-value, which reduces the number of false positives. Hence, the identification of loci subjected to selection becomes more difficult and the type 2 error increases. Computer simulations indicated that even an {alpha}-value of 0.05 results in a considerable type 2 error (Table 3 and Table 4). A more practical approach is to use the ln RV test statistic as a first pass analysis for the identification of candidate regions. Whether or not an identified region has been subject to a selective sweep could then be further investigated by the analysis of flanking microsatellite loci, which are also expected to show the signature of a selective sweep (WIEHE 1998 Down). Also, sequence analysis at the candidate locus could provide further evidence for a selective sweep when test statistics specific for sequence polymorphism are applied (OTTO 2000 Down). Additional evidence for a selective sweep could be obtained from those loci that have already recovered some variability after the sweep. The allele distribution of a locus that starts accumulating variation after the fixation of a single allele is strongly peaked. The more mutations occurred after the fixation, the broader the distribution becomes, until the random loss of alleles decays the single peaked distribution. Hence, after a sweep, the allele spectrum should be tighter than in nonsweep populations. Following the same rationale, multilocus test statistics based on the allele frequency distribution have been suggested to infer population size changes (BEAUMONT 1999 Down; REICH et al. 1999 Down; GARZA and WILLIAMSON 2001 Down). The power of an allele frequency distribution for a single locus, however, is generally weak and could only be used as confirmatory evidence.

Finally, the analysis of independent pairs of populations could serve as an additional tool for the verification of an identified deviation from neutral expectations. While outliers are expected to occur in a single comparison only, selected loci should be detected in most of the pairwise comparisons. Assuming independence among the populations the probability of observing multiple significant tests could be tested on the basis of the binomial distribution. In reality, however, populations are rarely independent, as they share a common history. Nevertheless, our analysis indicated that most of the significant ln RV tests between African and non-African populations occurred in a limited number of comparisons only. Out of 19 loci for which a significant ln RV test was recorded, 11 loci were found to be significant in 1 or 2 comparisons only. While it is impossible to rule out that some form of local selection has acted on those loci, the more likely explanation is that they are false positives. In any case, out of 16 tests the two candidate loci D6S462 and D10S249 had a significant ln RV value in 9 and 16 comparisons, respectively.

Much of the theory of the ln RV test is based on the variance in repeat number at mutation drift equilibrium. While the high mutation rate of microsatellites requires less time to reach mutation drift equilibrium, most natural populations are not expected to meet this condition. Computer simulations indicated that the normal distribution of ln RV seems to be quite robust to demographic events. Furthermore, no deviation from normality could be detected for the ln RV values of human populations. Further studies, in particular experimental ones with a large number of loci analyzed, will provide further insight into the behavior of the ln RV test statistic in natural populations.

Genomic regions associated with selective sweeps in human populations:
The simple model of out-of-Africa-associated adaptive mutations would have predicted more loci with significantly reduced variability in non-African populations than in African populations. In the human data set of 94 microsatellite loci, however, the same numbers of outliers were observed on both sides of the ln RV distribution.

Locus D6S462 showed the strongest reduction in variability in the non-African populations, suggesting that this locus may have been linked to a genomic region that has swept in non-African populations. The approach to combining populations in African and non-African groups requires a consistently low level of variation across populations to result in a significant ln RV value. A separate analysis of the eight non-African populations against the pooled African populations indicated that each of the non-African populations had a reduced variability at locus D6S462 (P < 0.065, one-sided test). Given that the eight non-African populations covered a wide range of human diversity outside of Africa, the strong reduction in variability in the non-African populations is best explained by a selection event that coincided with the colonization. Further evidence for a recent selective sweep at D6S462 could be gleaned from the allele distribution. While the African population showed a scattered allele distribution, the non-African populations had a highly peaked allele distribution (Fig 3), a pattern that would be expected for an allele that has swept through the population and is starting to accumulate new mutations.

The recently published draft of the human genome (INTERNATIONAL HUMAN GENOME SEQUENCING CONSORTIUM 2001; VENTER et al. 2001 Down) could potentially indicate genes flanking the two candidate regions. While for D6S462 several flanking expressed sequence tags (ESTs) could be detected, a Genome Data Bank (GDB) search did not indicate known genes mapping to D6S462. Further analysis has to await the progress of the analysis of the human genome.

In contrast to the expectations for a selection event associated with the human habitat expansion out of Africa, the locus that deviates most from the remainder of the genome, D10S249, harbored a surplus of variability in non-African populations. Based on the allele spectrum at locus D10S249 (Fig 3), it is very likely that this locus has been subjected to a recent selective sweep. A BLAST search of the human subset of GenBank failed to identify locus D10S249 in the published draft of the human genome sequence. Thus, no information about flanking sequences is available. A GDB search indicated that locus D10S249 is located in the amplimer AFM207- wd12. The gene mapping closest to microsatellite D10S249 is called Severe Combined Immunodeficiency, Athabaskan type (SCIDA), a genomic region associated with both T-cell and B-cell immunity (MURPHY and STRINGER 1986 Down). V(D)J recombination, which accounts for the diversity of T-cell receptor and immunoglobulin-encoding genes, is initiated by a specific double-strand break. The general DNA repair machinery is responsible for the resolution of this break. Previously, it was shown that an essential DNA repair/V(D)J recombination gene lies in the same region as SCIDA (MOSHOUS et al. 2000 Down). While it remains purely speculative until further proof (which will become feasible with the availability of the genomic sequence of this region), it is conceivable that a gene involved in immune defense is a potential target for adaptive mutations. Populations are constantly challenged by pathogen pressure and one way to counter this pressure is the acquisition of novel mutations to control pathogens.

Perspective:
The introduced test statistic provides a means to search multilocus data to identify those loci that show a deviation from neutral expectations in one population (group). Given the inherent problem of a multilocus test statistic and the high type 2 error of the ln RV test statistic, it is obvious that loci identified as outliers by the ln RV test are no final proof of selection, but could serve as a starting point for subsequent studies.


*  ACKNOWLEDGMENTS

I am particularly grateful to K. Kidd for making the data public on the web. Many thanks go to the C.S. lab, C. Haley, and R. R. Hudson for helpful discussions. R. Harding, B. Harr, M. van Staaden, and G. Muir provided comments on the manuscript. Special thanks to T. Wiehe for pointing out the glitches of the expectation of the ratio of two random variables. R. Bürger is acknowledged for his help in approximating the expectation of the ratio of two random variables. Three anonymous reviewers provided helpful comments, which improved the manuscript. W. Schlötterer helped with the C code. C.S. is supported by grants from the Fonds zur Förderung der wissenschaftlichen Forschung (FWF).

Manuscript received July 19, 2001; Accepted for publication November 5, 2001.


*  LITERATURE CITED
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

ANDOLFATTO, P., J. D. WALL, and M. KREITMAN, 1999  Unusual haplotype structure at the proximal breakpoint of In(2L)t in a natural population of Drosophila melanogaster.. Genetics 153:1297-1311[Abstract/Free Full Text].

BEAUMONT, M. A., 1999  Detecting population expansion and decline using microsatellites. Genetics 153:2013-2029[Abstract/Free Full Text].

BEAUMONT, M. A. and R. A. NICHOLS, 1996  Evaluating loci for use in genetic analysis of population structure. Proc. R. Soc. Lond. Ser. B 263:1619-1626.

BOWCOCK, A. M., J. R. KIDD, J. L. MOUNTAIN, J. M. HEBERT, and L. CAROTENUTO et al., 1991  Drift, admixture, and selection in human evolution: a study with DNA polymorphisms. Proc. Natl. Acad. Sci. USA 88:839-843[Abstract/Free Full Text].

BRINKMANN, B., M. KLINTSCHAR, F. NEUHUBER, J. HUHNE, and B. ROLF, 1998  Mutation rate in human microsatellites: influence of the structure and length of the tandem repeat. Am. J. Hum. Genet. 62:1408-1415[Medline].

CAVALLI-SFORZA, L., 1966  Population structure and human evolution. Proc. R. Soc. Lond. Ser. B 164:362-379[Medline].

CHESSER, R. K., O. E. RHODES, D. W. SUGG, and A. SCHNABEL, 1993  Effective size for subdivided populations. Genetics 135:1221-1232[Abstract].

DEPAULIS, F., 1998  Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol. Biol. Evol. 15:1788-1790[Medline].

DI RIENZO, A., A. C. PETERSON, J. C. GARZA, A. M. VALDES, and M. SLATKIN et al., 1994  Mutational processes of simple-sequence repeat loci in human populations. Proc. Natl. Acad. Sci. USA 91:3166-3170[Abstract/Free Full Text].

DI RIENZO, A., P. DONNELLY, C. TOOMAJIAN, B. SISK, and A. HILL et al., 1998  Heterogeneity of microsatellite mutations within and between loci and implications for human demographic histories. Genetics 148:1269-1284[Abstract/Free Full Text].

GARZA, J. C. and E. G. WILLIAMSON, 2001  Detection of reduction in population size using data from microsatellite loci. Mol. Ecol. 10:305-318[Medline].

GOLDSTEIN, D. B., A. RUIZ LINEARES, L. L. CAVALLI-SFORZA, and M. W. FELDMAN, 1995  An evaluation of genetic distances for use with microsatellite loci. Genetics 139:463-471[Abstract].

GOLDSTEIN, D. B., L. A. ZHIVOTOVSKY, K. NAYAR, A. RUIZ LINEARES, and L. L. CAVALLI-SFORZA et al., 1996  Statistical properties of the variation at linked microsatellite loci: implications for the history of human Y chromosomes. Mol. Biol. Evol. 13:1213-1218[Abstract].

HARR, B. and C. SCHLÖTTERER, 2000  Long microsatellite alleles in Drosophila melanogaster have a downward mutation bias and short persistence times, which cause their genome-wide underrepresentation. Genetics 155:1213-1220[Abstract/Free Full Text].

HARR, B., B. ZANGERL, G. BREM, and C. SCHLÖTTERER, 1998  Conservation of locus specific microsatellite variability across species: a comparison of two Drosophila sibling species D. melanogaster and D. simulans.. Mol. Biol. Evol. 15:176-184[Abstract].

HUDSON, R. R., 1990  Gene geneologies and the coalescent process. Oxf. Surv. Evol. Biol. 7:1-44.

Initial sequencing and analysis of the human genome. (2001) Nature 406:860-921.

JORDE, L. B., M. BAMSHAM, and A. R. ROGERS, 1998  Using mitochondrial and nuclear DNA markers to reconstruct human evolution. BioEssays 20:126-136[Medline].

KOHN, M. H., H. J. PELZ, and R. K. WAYNE, 2000  Natural selection mapping of the warfarin-resistance gene. Proc. Natl. Acad. Sci. USA 97:7911-7915[Abstract/Free Full Text].

LEWONTIN, R. C. and J. KRAKAUER, 1973  Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics 74:175-195[Abstract/Free Full Text].

LYNCH, M., and B. WALSH, 1998 Genetics and Analysis of Quantitative Traits. Sinauer Associates, Sunderland, MA.

MAYNARD SMITH, J. and J. HAIGH, 1974  The hitch-hiking effect of a favorable gene. Genet. Res. 23:23-35[Medline].

MORAN, P. A. P., 1975  Wandering distributions and electrophoretic profile. Theor. Popul. Biol. 8:318-330[Medline].

MOSHOUS, D., L. LI, R. CHASSEVAL, N. PHILIPPE, and N. JABADO et al., 2000  A new gene involved in DNA double-strand break repair and V(D)J recombination is located on human chromosome 10p. Hum. Mol. Genet. 9:583-588[Abstract/Free Full Text].

MURPHY, K. E. and J. R. STRINGER, 1986  RecA independent recombination of poly[d(GT)-d(CA)] in pBR322. Nucleic Acids Res. 14:7325-7340[Abstract/Free Full Text].

NEI, M. and T. MARUYAMA, 1975  Letters to the editors: Lewontin-Krakauer test for neutral genes. Genetics 80:395[Free Full Text].

OHTA, T. and M. KIMURA, 1973  A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population. Genet. Res. 22:201-204[Medline].

OTTO, S. P., 2000  Detecting the form of selection from DNA sequence data. Trends Genet. 16:526-529[Medline].

PRITCHARD, J. K., and M. W. FELDMAN, 1998 A test for heterogeneity of microsatellite variation, pp. 47–56, in Proceedings of the Trinational Workshop on Molecular Evolution, edited by M. K. UYENOYAMA and A. VON HAESELER. Duke University Publications Group, Durham, NC.

PRITCHARD, J. K., M. STEPHENS, and P. DONNELLY, 2000  Inference of population structure using multilocus genotype data. Genetics 155:945-959[Abstract/Free Full Text].

REICH, D. E., M. W. FELDMAN, and D. B. GOLDSTEIN, 1999  Statistical properties of two tests that use multilocus data sets to detect population expansions. Mol. Biol. Evol. 16:453-466.

ROBERTSON, A., 1975  Remarks on the Lewontin-Krakauer test. Genetics 80:396[Free Full Text].

SCHLÖTTERER, C., and T. WIEHE, 1999 Microsatellites, a neutral marker to infer selective sweeps, pp. 238–248 in Microsatellites—Evolution and Applications, edited by D. GOLDSTEIN and C. SCHLÖTTERER. Oxford University Press, Oxford.

SCHLÖTTERER, C., C. VOGL, and D. TAUTZ, 1997  Polymorphism and locus-specific effects on polymorphism at microsatellite loci in natural Drosophila melanogaster populations. Genetics 146:309-320[Abstract].

SCHLÖTTERER, C., R. RITTER, B. HARR, and G. BREM, 1998  High mutation rates of a long microsatellite allele in Drosophila melanogaster provide evidence for allele-specific mutation rates. Mol. Biol. Evol. 15:1269-1274[Abstract].

SLATKIN, M., 1995a  Hitchhiking and associative overdominance at a microsatellite locus. Mol. Biol. Evol. 12:473-480[Abstract].

SLATKIN, M., 1995b  A measure of population subdivision based on microsatellite allele frequencies. Genetics 139:457-462[Medline].

TSAKAS, S. and C. B. KRIMBAS, 1976  Testing the heterogeneity of F values: a suggestion and a correction. Genetics 84:399-401[Abstract/Free Full Text].

VENTER, J. C., M. D. ADAMS, E. W. MYERS, P. W. LI, and R. J. MURAL et al., 2001  The sequence of the human genome. Science 291:1304-1351[Abstract/Free Full Text].

VIEIRA, J. and B. CHARLESWORTH, 2000  Evidence for selection at the fused locus of Drosophila virilis.. Genetics 155:1701-1709[Abstract/Free Full Text].

VITALIS, R., K. DAWSON, and P. BOURSOT, 2001  Interpretation of variation across marker loci as evidence of selection. Genetics 158:1811-1823[Abstract/Free Full Text].

WEBER, J. L., 1990  Informativeness of human (dC-dA)n·(dG-dT)n polymorphisms. Genomics 7:524-530[Medline].

WIEHE, T., 1998  The effect of selective sweeps on the variance of the allele distribution of a linked multi-allele locus-hitchhiking of microsatellites. Theor. Popul. Biol. 53:272-283[Medline].

WIERDL, M., M. DOMINSKA, and T. D. PETES, 1997  Microsatellite instability in yeast: dependence on the length of the microsatellite. Genetics 146:769-779[Abstract].




This article has been cited by other articles:


Home page
GeneticsHome page
M. Teschke, O. Mukabayire, T. Wiehe, and D. Tautz
Identification of Selective Sweeps in Closely Related Populations of the House Mouse Based on Microsatellite Scans
Genetics, November 1, 2008; 180(3): 1537 - 1545.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
M. Foll and O. Gaggiotti
A Genome-Scan Method to Identify Selected Loci Appropriate for Both Dominant and Codominant Markers: A Bayesian Perspective
Genetics, October 1, 2008; 180(2): 977 - 993.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
H. Innan and Y. Kim
Detecting Local Adaptation Using the Joint Sampling of Polymorphism Data in the Parental and Derived Populations
Genetics, July 1, 2008; 179(3): 1713 - 1720.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
A. Caballero, H. Quesada, and E. Rolan-Alvarez
Impact of Amplified Fragment Length Polymorphism Size Homoplasy on the Estimation of Population Genetic Diversity and the Detection of Selective Loci
Genetics, May 1, 2008; 179(1): 539 - 554.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
A. Zayed and C. W. Whitfield
A genome-wide signature of positive selection in ancient and recent invasive expansions of the honey bee Apis mellifera
PNAS, March 4, 2008; 105(9): 3421 - 3426.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Bot.Home page
J. C. Burger, M. A. Chapman, and J. M. Burke
Molecular insights into the evolution of crop plants
Am. J. Botany, February 1, 2008; 95(2): 113 - 122.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol Evol