Many data sets one could use for population genetics contain artifactual sites, i.e., sequencing errors. Here, we first explore the impact of such errors on several common summary statistics, assuming that sequencing errors are mostly singletons. We thus show that in the presence of those errors, estimators of θ can be strongly biased. We further show that even with a moderate number of sequencing errors, neutrality tests based on the frequency spectrum reject neutrality. This implies that analyses of data sets with such errors will systematically lead to wrong inferences of evolutionary scenarios. To avoid to these errors, we propose two new estimators of θ that ignore singletons as well as two new tests Y and Y* that can be used to test neutrality despite sequencing errors. All in all, we show that even though singletons are ignored, these new tests show some power to detect deviations from a standard neutral model. We therefore advise the use of these new tests to strengthen conclusions in suspicious data sets.
THE mode of evolution of a population sculpts the polymorphisms that segregate in all of its homologous sequences. Under a standard neutral model (i.e., a Wright–Fisher model or any other related models), an incredible amount of theory has been developed to characterize those polymorphisms. Using what is expected about the polymorphisms, population geneticists have proposed several summary statistics measuring the departure from the standard model. These statistics are used to test for the likelihood of the standard model.
Most, if not all, of the neutrality tests based on the frequency spectrum compare different estimators of θ = 2pNeμ, where p is the ploidy (1 for haploids and 2 for diploids), Ne is the effective population size, and μ the whole-locus mutation rate. Among a sample of n homologous sequences, several estimators of θ have been derived using a coalescent framework. This includes (i) that uses S, the total number of polymorphic sites, along with the correcting factor (Watterson 1975); (ii) , where π is the average pairwise difference in the sample (Tajima 1983); (iii) that uses ξ1, the number of derived singletons (sites at frequency 1/n) (Fu and Li 1993); and (iv) based on η1, the total number of singletons [sites at frequency 1/n or (n − 1)/n] (Fu and Li 1993).
Statistics used for neutrality tests are typically based on the difference between two of these estimators, normalized by its standard deviation. Hence, using the estimators, several statistics were proposed: (Tajima 1989) as well as four other tests that compute differences between other estimators , F*, DFL, and F (Fu and Li 1993). Fu (1996) later proposed two other statistics, Gη and Gξ, that compute the sum of the differences between the observed and the expected numbers of polymorphic sites at each frequency. Fu (1997) defined a general framework of tests that encompasses all of these tests, among others (like the H test of Fay and Wu 2000). A comparison of the main tests (Simonsen et al. 1995; Fu 1997) reveals that all these statistics usually behave similarly, although some violation of the standard model induces notable differences. For example, all statistics have almost the same power to detect population growth (Fu 1997) but show differences in detecting selective sweeps (Simonsen et al. 1995; Fu 1997; Fay and Wu 2000). The original Tajima's D is usually one of the most powerful tests (Simonsen et al. 1995; Fu 1997), although it is not systematically the case (see, e.g., Teshima et al. 2006).
Singletons are the class of polymorphisms that have the highest impact on several statistics including D. An excess or a deficit of singletons strongly skews the statistics; this deviation can lead to the rejection of the standard model. As we discuss below, sequencing errors are mostly singletons. In that respect, these errors have a strong harmful potential for population geneticists who want to infer evolutionary scenarios from data sets. Even when substantial effort is made to correct sequencing errors, singletons are typically not taken into account (Innan et al. 2003).
“Sequencing errors” are used here in their broadest sense. They encompass any errors that are introduced during the experimental procedure. Sequencing is done directly either from amplified genomic regions or from a cloned fragment.
In the former case, the sequencing is done on a pool of amplified fragments and some of them typically contain PCR errors. Actually, the errors rate of the replication enzymes used in PCR are typically high: regular taq polymerases and reverse transcriptases make ∼10−4 errors/bp and Pfu polymerases, which proofread the newly synthesized fragments, make ∼10−6 errors/bp (see, e.g., the Invitrogen website). Because direct sequencing is a consensus of all amplifications, it does not cause many errors. However, it appears that new sequencing techniques (i.e., pyrosequencing) exhibit an error rate that can be as high as 10−3 errors/bp (Wang et al. 2007). Whatever is the rate of errors, only increasing the coverage (the number of independent replicates) reduces the amount of sequencing errors.
Alternatively, one can sequence a clone that is a genomic fragment embedded in a plasmid. If the cloned fragment is a PCR product, it typically contains one or more sequencing errors. Indeed, even though the consensus of all amplifications does not have errors, individual sequences from amplified fragments typically contain errors. Any error that is carried by the clone itself cannot be corrected by increasing the coverage. Estimates of the error rate for cloned amplified fragments from the maize genome give 7 × 10−4 errors/bp (Eyre-Walker et al. 1998; Tiffin and Gaut 2001). Here also, increasing the number of independent clones of the same genomic region can be used to reduce the number of errors.
There are several reasons why only a single clone is sequenced. A first obvious reason is cost. Typically, exploratory genome projects [e.g., the génolevures project (Souciet et al. 2000)] invest more in the number of sequenced genomes than in the coverage of each sequence. This type of analysis culminates with the metagenomic projects [e.g., Sargasso Sea project (Venter et al. 2004)], where random chunks of DNA are cloned from the samples of the environment and subsequently sequenced. Another interesting example is when sequences from a given organism cannot be independently cloned twice. This happens either when no clonal culture is available or when the organism cannot be grown. This is the case of several microorganisms and almost all viruses. We emphasize the case of individual cloning of retroviruses (see, for example, the method develop by Palmer et al. 2003), where, for each individual, only a single clone is obtained by reverse PCR.
Finally, ancient DNA also contains many errors due to DNA degradation (Green et al. 2006). In this last case, chemical damage that accumulates in the molecule through the years makes it impossible to retrieve the original sequence at some sites.
Recently, a very interesting approach explicitly incorporated sequencing errors (among other experimental biases) into coalescent models (Knudsen and Miyamoto 2007). This new model can be used to estimate, through a full maximum-likelihood framework, population parameters as well as sequencing error rate. Although this is a very powerful approach, it remains extremely computationally intensive and will be strongly affected by recombination events. We therefore think that summary statistics estimations (moment methods) are complementary to these types of methods. In that regard, the impact of sequencing errors on and has been studied very recently by Johnson and Slatkin (2008). The authors developed a finite-site model to account for these errors on the basis of their probability of occurrence per site in a finite sequence. Complementarily, we developed here an infinite-site model (i.e., all sequencing errors are new singletons) to characterize their impact in detail and avoid them without any prior knowledge of their likelihood.
Here we focus on the impact of sequencing errors on neutrality tests based on the frequency spectrum. First, we first explore how strongly sequencing errors can affect the estimators of θ. Then, we show that the D and F statistics (Tajima 1989; Fu and Li 1993) can be highly skewed by sequencing errors. Therefore, we propose new θ-estimators that will be insensitive to sequencing errors (, , , and ) and two related statistics (Y* and Y) that can be used to test neutrality despite the presence of sequencing errors. We then analyze the sensitivity of the tests on the basis of these two statistics to detect some violations of the standard model: variable population size (bottleneck), selection, and isolation (an extreme case of population structure) with and without sequencing errors.
Regardless of the experimental artifact leading to sequencing errors, the errors are very likely to be uniformly distributed along the sequences. As a consequence, if sequences are long enough, almost all sequencing errors will be singletons. If this is true, the singletons that we observe in a sample of sequences are a mixture of real singletons and sequencing errors. There are two types of mutations that are singletons: the ξ1 ones at frequency 1/n and the ξn−1 ones at frequency (n − 1)/n. Without the help of an outgroup, these two classes will be considered a single class of mutation, η1.
The number of sequencing errors, ε, depends on both the locus rate of errors, μerr and the number of sequences, n. Actually, ε can be defined as a Poisson random variable with a parameter nμerr. It is interesting to see that ε and S, the number of real mutations, increase linearly with the sequence length (i.e., μseq = Lseq × μsite). This implies that increasing the length of the sequence of interest will not alter the fraction of artifactual sites. On the contrary, S and ε do not exhibit similar relationship with n, the number of sequences. ε increases linearly with n whereas S increases only logarithmically with n. In that respect, increasing the number of sequences in the sample will inflate the fraction of artifactual sites and then worsen the situation. It is even more dramatic when one considers only the singletons (ξ1), whose average number does not depend on n (E[ξ1] = θ). As a result, adding new sequences will add only new artifactual singletons but no real ones.
Impact of sequencing errors on θ-estimators:
We first study how sequencing errors will affect four common θ estimators: , , , and . Even though all of these estimators will be inflated by sequencing errors, we can expect that the errors will not equally affect all of them. In fact, each error adds a new singleton as well as a new segregating site but adds only 2/n to the average pairwise difference. Since both S and ε are independent, all covariances between real and artifactual sites are null. Using the expectations of π, S, η1, and ξ1 as well as their variances (Watterson 1975; Tajima 1983; Fu and Li 1993) (Equations B1–B13), we can express the mean and the variances of all biased estimators as(1)(2)(3)(4)(5)(6)(7)(8)
The expectations are in good agreement with the intuition; the effect is the strongest on and the weakest on θπ. It is noteworthy to mention that for n > 4, the bias is stronger for θS than for θπ. It also shows that increasing the number of sequences will inflate the bias for all estimators except for θπ. The examination of the variances also shows some interesting properties. As expected from intuition, adding sequencing errors inflates all variances. In the original variances (set μerr to 0), only the variance of vanishes as n increases, and this is the reason why θ is preferentially estimated from S. The other variances typically converge to a constant when n increases. Adding sequencing errors drastically changes this pattern. Indeed, the relationship between the variances and n becomes linear for and , sublinear for , and converges to a constant for . With a moderate rate of sequencing error (i.e., μerr = 0.1) and a typical sample size (n ≤ 100), > for small θ-values (i.e., θ ≤ 2).
Therefore, in the presence of nonnegligible sequencing errors, estimations of θ should be carefully performed. Since both the mean and the variance of are less affected by errors, their use is less inadequate than the use of , especially when the sample size gets large or when θ ≤ 1. This is in good agreement with the predictions of a finite-site model (Johnson and Slatkin 2008). However, all these estimators of θ are biased; this consequently motivates the derivation of new estimators that are immune to sequencing errors.
The simplest way to correct for sequencing errors would be to just ignore some of the observed singletons in the sequences. However, this assumes that we are able to estimate properly the number of artifactual singletons, which is unrealistic. Therefore, we derive new estimators that do not make use of the singletons to estimate θ. Although there is no way to compute a revised and , since they are solely based on singletons, one can compute new estimators derived from both the number of segregating sites and the average pairwise differences when singletons are ignored. Depending on the availability of an outgroup to orientate the mutations, these estimators are defined as and (with outgroup) or and (no outgroup). The means of these values were derived by removing the expected numbers of singletons in a sample. The expected number of singletons is given by either E[ξ1] or E[η1] = E[ξ1] + E[ξn−1]. These two values are respectively equal to E[ξ1] = θ and (Fu and Li 1993). It is important to point out that a singleton weighs 2/n on π. As a consequence, the expectations of these values are(9)(10)(11)(12)
Corresponding variances are presented in appendix b (see Equations B22, B23, B31, and B32). We used the mean values to derive unbiased estimators of θ that should be insensitive to sequencing errors:(13)(14)(15)(16)
Expected impact of sequencing errors on D and F:
Since the overestimation on θ is not the same on all estimators, it has to be true that all neutrality tests based on a difference between , , , and will be biased. Here, we focus on how tests based on D (Tajima 1989) and F (Fu and Li 1993) are affected by sequencing errors. Please note that all tests that use or (i.e., all Fu and Li 1993 ones) behave very similarly to the F test. Therefore, for clarity purposes, we choose to present results for F only but all our conclusions apply similarly to the other tests.
Even though the whole D probability distribution is difficult to derive, one can derive an approximation of the average Derr as(19)
Similarly, one can show that:(20)
As expected, sequencing errors cause negative values in Derr and Ferr. Indeed, the numerator of Equation 19 is always negative (or null when n = 2) and even more dramatically so in Equation 20. From the comparison of the numerators of Equations 20 and 19, we can suspect that the impact will be stronger on Ferr than on Derr. This negative bias is increased by both the sequencing error rate, μerr, and the number of sampled sequences, n. Interestingly enough, the estimated variances in the denominators are larger with sequencing errors, which fits the intuition, since adding another random variable (the number of sequencing errors) overall will inflate the variance. The left terms are the usual variances whereas the right terms are the expected increase, which grow with n and μerr. Increasing the denominator tends to diminish the magnitude of the bias on D and F (without changing their signs).
Simulated impact of sequencing errors on D and F:
To assess more precisely the impact of sequencing errors on D and F, we ran simulations of a standard model with extra singletons that were added to the resulting sequences. All coalescent simulations were performed using a standard algorithm that depends only on θ and n. A detailed description of such algorithms is given in reviews or books such as Hudson (1990) and Hein et al. (2005). Here, all sequencing errors are assumed to be singletons. Therefore to simulate data with sequencing errors, we simulate a regular coalescent tree with “true” mutations and add some “artifactual” singletons whose number ε is computed as a Poisson random variable with parameter μerr × n. Errors are distributed uniformly among all sequences.
In this study, we analyze the power of several statistics (i.e., D, F, Y*, and Y) to detect a departure from a standard model. We use here the largest and most robust confidence interval that is constituted by the most extreme values of the statistics for a given n but for θ ∈ [0, ∞[. We present in appendix a a method to find the confidence-interval limits and discuss other possible methodologies.
Results (Figure 1) show that even with a moderate rate of sequencing errors, the distributions of Derr and Ferr are skewed toward significant negative values. More precisely, when the μerr is in the vicinity of [θ/100, θ/10], the Derr and the Ferr distributions tend to be significantly negative. In other words, if there is 1 sequencing error for ∼10–100 singletons (or pairwise differences) in the sample (E[ξ1] = E[π] = θ), Derr and Ferr are always “too” negative. As expected, this effect gets stronger when n becomes large. Because the test based on F directly uses the number of singletons, the bias is stronger for Ferr than for Derr.
This shows that what really matters for the impact of sequencing error is the natural diversity (i.e., θ) of the population and the sample size, n. Even with a low rate of sequencing error, results from large data sets (i.e., n > 100) or from species with low diversity can be strongly affected by sequencing errors and therefore should be interpreted with caution. Thus, we derive two new tests that are immune to sequencing errors.
Derivation of Y* and Y:
As in the standard D from Tajima, we define here two new statistics Y and Y* to be a difference between θ-estimators. Following Fu and Li's (1993) notation, Y requires the use of an outgroup whereas Y* does not. As we have shown, θ can be estimated either from the number of nonsingleton segregating sites ( or ) or from the average pairwise differences, excluding singletons ( or ). Therefore, we define Y and Y* as(21)and(22)
The details of the derivations for both variances in the denominators are presented in appendix b. They can be expressed as and , where αn, βn, , and are constants that depend on n only. As for regular D, we show that θ and θ2 are unknown but can be estimated from or from . Please refer to appendix b for more details.
Importantly, these new statistics are totally immune to sequencing errors provided that these errors are singletons.
Violation of the standard model:
We choose to explore the relative power of D, F, Y*, and Y to detect violation of the standard model in three different scenarios: bottleneck, hitchhiking along with a selective sweep, and isolation. In all three scenarios, we used dedicated coalescent simulations for both n = 20 and n = 50 and used θ = 10 with either no sequencing errors (μerr = 0) or a low rate of sequencing error (μerr = 0.1).
It is important to keep in mind that it is always possible to increase θ for the locus by increasing sequence length. This will result in an improvement of the power of the tests. Actually, more mutations (larger θ) in the tree make it easier to recover its shape from polymorphisms and, therefore, to unravel its nonneutral distortions. This effect of θ is especially important when we consider cases where the tree is reduced overall.
As a consequence, we do not discuss the absolute power of the tests based on Y and Y* but rather their relative power compared to the one based on D or F. Since we disregard some of the information carried by the sequences (i.e., the singletons), we expect tests based on Y* and Y to have less power when compared to the tests based on D and F. Importantly, adding sequencing errors will not affect the tests based on Y and Y*, but will change both D and F into Derr and Ferr. It is important to note that, with the parameters we used (θ = 10 and μerr = 0.1), the power of the test based on D is not affected by sequencing errors only and the test based on F is only weakly affected (Figure 1). We report the power of the tests with and without sequencing errors in all scenarios.
Change in population size:
We considered a population that experiences a severe bottleneck. The simulations were performed using a change in timescale (Griffiths and Tavaré 1994). The population has a regular size N, but shrinks at size N/100 during the bottleneck that goes for a time Tl (at most Tl = 0.1). Some time Tb can have elapsed after the bottleneck ended. Looking at this process in reverse, any coalescent time that falls between Tb and Tb + Tl/f needs to be shortened adequately (Simonsen et al. 1995). Here we consider the cases where the sample is taken during the bottleneck (Tb = 0; Tl ≤ 0.1) or after the bottleneck (Tb > 0 ; Tl = 0.1).
Results from simulations (Figure 2) show that deviations from the standard model are observed whenever the sample is taken either close to the beginning of the bottleneck (0 < Tl < 0.01 and Tb = 0) or just when it finishes (Tl = 0.1 and 0 < Tb < 1). On one hand, if the sample is taken after the bottleneck (i.e., population expansion), the tree will have internal branches that are “too short,” and therefore all statistics will exhibit negative values. On the other hand, if the sample is taken during the bottleneck (i.e., population decline), the resulting coalescent tree will be shorter and with internal branches that are “too long”; this will lead to an excess of medium-frequency polymorphisms and therefore to positive statistics. In both cases, the signal lasts for N (or N/100) generations. Whereas the new test performs poorly when detecting an excess of low-frequency polymorphisms, it performs well when detecting an excess of medium-frequency polymorphisms. Adding sequencing errors reduces the power to detect an excess of medium-frequency polymorphisms but enhances the ability to detect an excess of low-frequency polymorphisms. This highlights that the bias induced by sequencing errors tends to lower the positive deviations of the statistics and enhance the negative ones. It should also be noted that the loss of power to detect positive deviations is stronger for F than for D. This is in good agreement with a stronger impact of sequencing errors on F as shown above.
Hitchhiking along with a selective sweep:
Another interesting situation is when the locus of interest is neutral but linked to a nearby locus that experienced a selective sweep. This effect has long been referred as the hitchhiking effect (Maynard Smith and Haigh 1974; Stephan et al. 1992; Kim and Stephan 2002). We used the simplified version of the model proposed by Braverman et al. (1995), described in Fay and Wu (2000). From the end of the sweep to its beginning, the frequency of the selected allele decreases deterministically from 1 − 10−4 to 10−4 with a selection coefficient equal to α = 2Ns = 1000. We started the simulation at Ts = 0.01 after the sweep end (i.e., a very recent sweep). In Figure 3 we report results for a variable range of recombination rates (R = 4Nc) expressed as the ratio between recombination and selection (c/s).
When the c/s ratio is very small, the resulting star tree has almost only singletons. In the case of n = 20, it has only a few singletons (i.e., no power for any tests) whereas it has several when n = 50. In this case, the tests based on Y and Y* have no power to detect the deviation. Importantly, adding sequencing errors greatly increases the excess of singletons and therefore artificially enhances the power of the test based on Derr and Ferr at a low c/s ratio. When the ratio is in the vicinity of 0.01, an important fraction of the trees shows only 1 lineage that has escaped the sweep through recombination. This lineage will have ancestral singletons (ξn−1). In this case, the test based on Y performs well whereas the test based on Y* performs poorly. When the ratio becomes high, most of the lineages escape the sweep through recombination and the process is assimilated to a standard process; there is no more deviation to be detected besides a residual bias due to sequencing errors on F.
An extreme case of population structure is isolated subpopulations, i.e., a population complete split. In a simple model (Simonsen et al. 1995), an isolation event happened at some time Ti in the past, after which both populations (size N/2) did not mix anymore. Before Ti, the ancestral population (size N) is panmictic and simply follows a standard model. We analyze two types of sampling: (1) both populations have been sampled with equal size and (2) one population is largely underrepresented in the sample. As we will see, the tests behave differently in each situation.
Results for the case of equal sample size (Figure 4a) show that all tests have almost the same power to detect a violation from the standard expectations. With Ti ≥ 1, all distributions are skewed on high values in a very similar fashion. Here, there is a long internal branch that splits the sample in two subtrees with an equal number of lineages. Since singletons do not have any particular role, all tests perform identically when there are no sequencing errors. Adding sequencing errors lowers the power of tests based on D and F since it reduces their positive deviations.
On the contrary, results for samples of unequal sizes (Figure 4b) show that the new tests based on Y and Y* perform better than the original test based on D and F. Despite the difference of power, all distributions are skewed toward negative values. Having unequal sample sizes creates a stretch of the branch that splits the tree into two subtrees, one with few individuals and the other one with all other individuals. This stretch induces an excess of both high- and low-frequency polymorphisms.
Sequencing errors are very often encountered in data sets used for population genetics inferences. As a result, geneticists can be misguided by the effect of sequencing errors that artificially increase the number of singletons. This overabundance of singletons leads to overestimations of θ, which is typically used to compute the effective population size. We show that, even with a moderate rate of errors, the overestimation can be significant. As a consequence, we propose new estimators of θ that are insensitive to sequencing errors provided that these errors are singletons. We therefore highly recommend estimating θ with these new estimators in suspicious data sets.
Another very important consequence of the presence of sequencing errors is their effects on neutrality tests. Indeed, singletons are usually the class of polymorphisms that has the greatest impact on neutrality tests. It is mainly the excess or the deficiency of singletons that steers the statistics outside of their confidence intervals. We have shown (Figure 1) that an artifactual excess of singletons will easily alter the rejection of the standard model. More precisely, it will enhance an excess of singletons caused by some evolutionary scenarios (e.g., population expansion or selective sweep) or mask any deficiency of singletons caused by other scenarios (e.g., population decline or population isolation with equal sampling). As a result, it becomes hard to distinguish the effect of sequencing errors from the ones due to biological departures of the standard model. Tests that use the θ-estimators based on singletons (i.e., the Fu and Li 1993 ones) are the most affected tests.
We show here that the tests based on D and F are skewed even when the error rate is low. Namely, if the rate of artifactual singletons per sequence represents ∼0.01–0.1 of θ (the number of singletons or the average pairwise difference), D and F are often significantly negative. Furthermore, a low rate of sequencing error (0.01 of θ) can alter strongly the power of the tests when the scenario is not a standard neutral one. Using the average pairwise difference, we can calculate the error rate above which a data set becomes suspicious. In the human population, we have θ ≈ 0.001/site (Sachidanandam et al. 2001); this translates into a rate of ∼10−5–10−4 errors/bp. In a population of HIV-1 infecting a patient, we observe θ ≈ 0.01/site (Achaz et al. 2004); this leads to an error rate of 10−4–10−3 errors/bp. When these rates are compared to the ones observed either for pyrosequencing techniques [up to 10−3 errors/bp (Wang et al. 2007)] or for cloned PCR products [7 × 10−4 errors/bp (Eyre-Walker et al. 1998; Tiffin and Gaut 2001)], we have to acknowledge that sequencing errors are a major issue in an uncured data set. Consequently, only data sets where sequencing errors have been carefully removed should be analyzed by standard neutrality tests. The step to remove sequencing errors usually requires independent cloning and/or sequencing, which is highly time and resource consuming. In this context, we think our new tests can be very useful to many population geneticists, since they offer a safe and quick way to test for neutrality despite the presence of sequencing errors.
If there are no sequencing errors, we expect tests based on Y and Y* to show less power than regular tests. However, more than a radical loss of power, ignoring part of the suspicious data leads to a shift in the sensitivity of the tests. We show that in any situation where the height of the tree is greatly reduced and the tree shape converges to a star tree (i.e., population expansion and selective sweep with no recombination), the new tests show less power to detect departure from the standard model. This fits perfectly with intuition since the signal lies on derived singletons, which here are completely ignored. On the contrary, we show that for scenarios where the internal branches of the tree are stretched (i.e., population decline and isolated populations with an equilibrated sample) all the tests behave very similarly. In this case, most of the segregating sites will be at intermediate frequency and all tests are more or less equally able to detect departures from a standard model. Finally, the new tests can outperform the original test based on D in two situations: (i) when there is an excess of high-frequency polymorphisms (hitchhiking along with a sweep—especially Y—and isolated populations with an unbalanced sample) and (ii) when there is an excess of low-frequency, though nonsingleton, polymorphisms (isolated populations with an unbalanced sample).
We have assumed throughout this study that the sequencing errors were singletons. If these errors are genuinely distributed uniformly along long enough sequences, it is very likely that our assumption will hold. It is, however, known that some errors are more commonly encountered than others. For example, microsatellites of mononucleotides are often increased or decreased by 1 unit during the cloning/sequencing step. These small indels are, however, not a problem since they can be easily corrected manually. On the other hand, if the sequencing errors rate depends on the nucleotide context, i.e., some sites being more mutated than others, this can create nonsingleton sequencing errors that will also affect tests based on Y and Y*. In any case, if there are some nonsingleton sequencing errors, our tests will be much less sensitive to those few rare events than the original D is to sequencing errors in general. We note that the equivalent model with finite sites (Johnson and Slatkin 2008) is more appropriate when the density of polymorphic sites is high and/or when the sequences are small. This model, however, requires a prior knowledge of the error rate to handle them properly.
We ignored the possibility of recombination events within the locus under study. It has been reported, however, that recombination tends to decrease the variance of both θ-estimators and (Hudson 1983). As a consequence it decreases the variance of D and therefore, if the same confidence interval is kept, there is an important loss of power (Wall 1999). We expect that the same will happen with tests based on Y and Y*. Here, we decided to keep the most conservative confidence interval for the test (using θ ∈ [0, ∞[ and R = 0). However, one can define a confidence interval for a different range of parameters that can be calculated from the data. Using this last strategy, we should be able to recover most of the power loss.
To conclude, sequencing errors can easily misguide interpretations of the data. In particular, they can make regular statistics (D, F among others) significantly negative, a signal that is usually interpreted as a star-like tree. In this case, if there is no departure from a standard model despite sequencing errors, tests based on Y and Y* could be helpful to avoid spurious interpretation of the data. However, because the new tests do not have much power to detect star-like trees, an absence of a significant result should not systematically mean that the standard model is correct. We can already envision that new tests inspired from the Fu and Li (1993) ones using θ-estimators based on polymorphisms at frequencies 2/n and n − 2/n should perform better than Y and Y* for detecting star-like trees.
APPENDIX A: POWER OF THE TESTS
To test whether an observed D value significantly differs from the neutral expectations, one needs to compare the observation to limit values beyond which one can reject the regular coalescent model (for a given α-risk). Even though we do not mention it each time (for readability purposes), all the following equally applies to F, Y, and Y*. We are interested in finding the confidence interval on D ([Dlow, Dup]) that contains a fraction 1 − α of all values under a standard model. Outside of this interval, we shall reject neutrality. There are two sources of variance in the frequency spectrum of the polymorphic sites. The first one comes from the variance in the shape and in the branch length of the genealogy; the second one derives from the number of mutations and their locations in the tree. Importantly, the mutation locations are dependent only on the underlying tree, which itself, if expressed in units of N generations, depends only on n (the sample size). On the other hand, the number of mutations depends on θ, the population mutation rate, and on the total tree length (expressed in N generations).
When a geneticist samples sequences from the wild and plans to use a neutrality test, he/she knows n but ignores θ; he/she can, however, easily measure some values from the sequences (e.g., S, the number of polymorphic sites). There are therefore several possible strategies to compute the confidence interval [Dlow, Dup]. The most conservative one that defines the largest confidence interval assumes that only n is known and that all θ-values are possible (i.e., θ ∈ [0, + ∞[). A second strategy assumes that some part of the data is known (typically S). Using S, we can either give a single estimate of θ (i.e., ) or give a confidence interval on θ. This latter option was retained by Simonsen et al. (1995) to compute a confidence interval [θlow, θup], using Tavaré (1984), with a small β-risk (with β < α) and, using the method proposed by Berger and Boos (1994), to compute the confidence interval of D for all θ-values in [θlow, θup]. Although it sets the confidence interval only for “likely” values of θ, this method is extremely time consuming. Finally, a last strategy is to assume that S is known and that the D confidence interval can be computed only for a given S (Hudson 1993; Depaulis and Veuille 1998). This last strategy gets rid of all the variance that comes from the mutation rate so it will give the narrowest confidence interval and therefore increase the power of the test. It implicitly assumes that the generated genealogies have to be weighted by their probability knowing S (Markovtsova et al. 2001), but the shortcut of using S instead of θ is robust under almost all neutral genealogies (Depaulis et al. 2001; Wall and Hudson 2001). Here, we chose to use the most conservative confidence interval (i.e., for θ ∈ [0, + ∞[). Even though one cannot scan the whole θ-space, we show that there is a relatively quick method to compute this conservative confidence interval.
To get a sense of how the confidence interval of D ([Dlow, Dup]) varies as a function of θ, we explore, for a given n, a large range of θ-values (i.e., 0–100, by steps of 0.1); they include most of the values typically encountered. For each set of n and θ, we ran 105 regular coalescent simulations, built the empirical distribution of D, and reported the 0.95 confidence interval. Results for n = 10, 50, 100, and 300 (Figure A1a) show that Dup exhibits a peak for low values of θ (usually θ < 5). Similarly, Dlow reaches its lowest value also for small θ-values. This suggests that the most extreme limits of D are observed for low θ-values, regardless of n. Importantly this property holds for F, Y*, and Y.
From there, we set up a strategy to find Dlow and Dup, for a given n but for θ ∈ [0, + ∞[. For Dup, we want to find the most extreme value (which is reached for a small θ). Therefore, we start from θ = 0 and run 105 standard coalescent simulations with increasing θ-values (by steps of 0.1) until the newly calculated Dup is less than the most extreme value of Dup encountered so far (by at least 0.001). The same can be applied to find the most extreme Dlow. Both Dup and Dlow are considered individually. We therefore explore until we capture the summit of the peak or the bottom of the well. It usually corresponds only to a reasonable number of steps (extreme values are observed for low θ-values). The limits computed in this way depend only on n and no longer on the unknown parameter θ. A graphical representation of the confidence interval for D, Y*, and Y is given in Figure A1b for n ranging from 5 to 500.
We also mention that we chose here to consider the “conservative” confidence interval for all statistics to reduce the computation time. However, our strategy to find this conservative confidence interval could be easily changed to find a narrower confidence interval that considers only likely values of θ (Simonsen et al. 1995). This interesting idea can be exploited through a minor modification of our algorithm that will greatly reduce the original computation time. A modified version of Simonsen et al.'s strategy could be defined as follows:
Find Dup and Dlow and their associated and as we do currently (this is affordable since only a few runs are needed from θ = 0 to the peak).
Compute the confidence interval for θ ([θlow, θup]) using S (Tavaré 1984) (this can be performed very quickly using a numerical approach).
Compare to [θlow, θup] to set the final value of the lower boundary on D (). If is inside the interval [θlow, θup], set as Dlow; otherwise perform one last run of simulations using θlow (if ) or θup (if ). In the two last cases, set as the low boundary from this last run.
Perform almost the same procedure as in step 3 to find .
Obviously the same four steps can be done for F, Y*, and Y. This would narrow the confidence interval and therefore improve the power of the tests especially when θ is large. It, however, implies that the confidence interval has to be computed from S for each data set individually; this seems feasible for a given data set but inadequate for extensive simulations.
APPENDIX B: VARIANCES OF THE NEW ESTIMATORS AND STATISTICS
Elementary variances and covariances:
Here we give all elementary variances of π, S, ξ1, and η1 as well as their covariances. These are necessary to derive the variances of , , , , Y, and Y*.
Using the variances of ξ1 and ξn−1 and the covariance between both, one can directly derive variances and covariances that include η1. Here, we have expressed them in an alternative form (usually more compact) to that in Fu and Li (1993).(B11)(B12)(B13)
Derivation of Var[Y*]:
If we define , it can then be written that(B14)
Following the notations of Fu (1995), we define ξ1 as the number of mutations in external branches and ξn−1 as the number of mutations in the highest branch (whenever it exists); thus both variances as well as the covariance can be rewritten as follows:(B15)(B16)(B17)
Each element of Equation B18 is known or can be easily derived. Therefore, replacing all elementary variances and covariances (Equations B1–B10) in the variance expressed in Equation B18, the variance of Y* can be expressed as(B19)where both coefficients are equal to(B20)(B21)
Variances and covariance of S−η1 and π−η1:
These values can be used as an alternative way of computing Var[Y*] (using Equation B14).
Estimation of θ and θ2 from S−η1:
As for the regular D, θ is unknown, but it can be estimated using , since
Hence, an adequate unbiased estimator of θ is(B25)
Similarly, θ2 can be estimated using along with . Actually, Equation B22 can be rewritten in a simplified form using adequate and for practical purposes as
Using these results, we can now express as
Hence, the unbiased estimator of θ2 is(B26)
Derivation of Var[Y]:
The use of an outgroup allows us to keep the ancestral state at frequency (n − 1)/n. The derivations are extremely similar to the one above except that ξn−1 is ignored (only ξ1 is removed from S and π). We then have to define another constant that will be used in the Y definition:(B27)
Variances and covariance of S−ξ1 and π−ξ1:
Estimation of θ and θ2 from S−ξ1:
Hence, the adequate θ unbiased estimator is(B34)
Using γn and δn as adequate coefficients to write Equation B31 in a simplified form, we can write that
Therefore, the estimation of θ2 can be done by(B35)
The source code was designed as a C++ library and is available upon request. I thank J. Wakeley, who motivated this study, for his scientific generosity and his priceless advice. I also thank F. Depaulis, M. Tenaillon, P. Nicolas, and D. Higuet for giving me constructive comments on the manuscript; N. Bierne and O. Tenaillon for suggesting the name of the new statistics; and T. Treangen for improving the English of this manuscript. Finally, I thank the three anonymous reviewers for their constructive suggestions that improved the manuscript.
Communicating editor: M. Nordborg
- Received September 21, 2007.
- Accepted April 18, 2008.
- Copyright © 2008 by the Genetics Society of America