Abstract
Several neutrality tests use outgroups to infer the ancestral and derived states for polymorphism data. However, homoplasy can result in the incorrect inference of the derived variant. We show that empirically derived rates of misorientation strongly influence Fay and Wu's Htest, especially when the sample size is large.
INTRASPECIFIC polymorphism data are usually analyzed within the framework of the neutral WrightFisher model, which (besides the absence of selection) assumes a randomly mating population of constant size. Departures from this model are typically attributed to selective or demographic effects. A rarely considered alternative explanation is that the mutational process can cause or at least contribute to such departures (see, however, Rogers 1992). Indeed, the neutral model can be applied to various mutational models. For sequence data, the expectations are usually derived from the infinite sites model (Kimura 1969; Watterson 1975). This model assumes that each mutation occurs on a different site; i.e., homoplasy is ignored. Numerous statistics have been developed to test whether an observed pattern of polymorphism is expected under the infinite sites neutral model (reviewed in Wall 1999; Depauliset al. 2004). Among these statistics, several use an outgroup to infer the derived and ancestral states of polymorphic sites (Fu and Li 1993; Fay and Wu 2000). These statistics are potentially sensitive to the presence of multiple hits, due to the long branch leading to the outgroup where mutations can be superimposed. In this article, we estimate the level of homoplasy (including parallelisms and reversions) in empirical data sets and we use coalescent simulations to assess the influence of these homoplasy levels on the tests with outgroups.
Three neutrality tests with an outgroup have been proposed. Each one compares a pair of unbiased estimates of the mutational parameter of the population (θ= 4N_{e}μ for an autosomal marker). Fu and Li’s (1993) D and Ftests rely on standardized statistics:
They compare Watterson’s (1975) θ_{w} estimator, which is based on the total number of polymorphic sites, and Tajima’s (1983) diversity π (respectively) to η_{e}, the number of derived unique mutations (mutations on external branches of the tree). They are, thus, highly sensitive to the relative proportion of the latter mutations. Fay and Wu’s (2000) Htest,
In practice, an outgroup is used to identify the derived and ancestral variants of a polymorphic site. This inference can be incorrect if an undetected second mutation occurred at the same site on the outgroup branch, i.e., if multiple hits are present. Fay and Wu (2000) considered this possibility and computed the probability of misorientation, assuming a constant mutation rate and a finite site model. They proposed to incorporate this rate in the null hypothesis, but this is virtually never applied in practice. Furthermore, substitution rates usually vary among sites, suggesting a similar variation in neutral mutation rates (e.g., Nachman and Crowell 2000). Such heterogeneity of neutral mutation rates should substantially enhance the frequency of secondary mutation on variable sites (e.g., the extreme case of mutation hot spots) and the corresponding rate of misorientations. This heterogeneity is, however, difficult to estimate precisely. In the absence of more information, a gamma distribution is generally assumed for substitution rates (e.g., Gu and Zhang 1997). To avoid making assumptions about the distribution of mutations across sites, we used an empirically derived estimator of P_{M}, the average probability of misorientation per site. We first estimated P_{D}, the probability that a polymorphic site shows a detected second mutation on the outgroup branch, by determining the proportion of polymorphic sites for which the outgroup shows a third state. If all mutations were equally likely, then the rate of (undetected) misinferred sites P_{M} would just be half of P_{D}. However, in all genomes examined, transitions occur at a higher rate than transversions (Yang and Yoder 1999), thereby increasing the rate of undetected misinferred sites. Let α be the rate of each possible transition and β that of transversions (Kimura’s twoparameter model; Kimura 1980). We computed the relative rates of undetected over detected multiple hits (twostate vs. threestate sites), given that a second mutation occurred on the outgroup branch. We neglect the possibility of more than two mutations.
If the first mutation, which produces the polymorphic site, is a transition (probability α), then the second mutation on the outgroup branch is undetected if it is also a transition (probability α) and it is detected if it is a transversion (probability 2β since there are twice as many possible transversions as transitions). If the first mutation is a transversion (probability 2β), the second mutation is undetected only if it is the same type of transversion (probability β) and it is detected if it involves the other type of transversion or a transition (probability α+β). Hence, the ratio of undetected to detected multiple hits is
We then estimated the probability of misorientation in data sets from three species that are frequently the focus of sequence polymorphism studies: Homo sapiens, Drosophila simulans, and Arabidopsis thaliana. Patterns of polymorphism in a population can be affected by several factors, including demographic history. To minimize this effect, we chose surveys of loci with comparable sampling schemes within a species. Frisse et al. (2001) analyzed 10 noncoding regions in three human populations and used chimpanzee as an outgroup. Begun and Whitley (2000) collected data from 29 loci in a North American population of D. simulans (outgroup D. melanogaster). Finally, we used polymorphism data from 12 loci of A. thaliana (outgroup A. lyrata or A. gemmifera). This data set was collected by different researchers, but very similar samples composed of worldwide A. thaliana ecotypes were used in all studies. In the 10 human data sets, we did not observe any site with three states. Our estimate of P_{M} is therefore 0 for all loci (P_{M} is likely to be a poor estimator for low P_{D} values). In the D. simulans data sets, P_{M} varies between 0 and 10.6% (mean 3.3%). This is in agreement with Innan and Tajima’s (1997) suggestion that sites with more than one mutation are likely to be present in comparison between D. melanogaster and D. simulans. In the A. thaliana data set, P_{M} ranges from 0 to 19.1% (mean 5.1%). These differences in the average probability of misorientation probably reflect the difference in divergence between the studied species and the outgroup, which are, respectively, 1.1, 2.6, and 8.8% for human, D. simulans, and A. thaliana (average divergence considering all sites, with Jukes and Cantor 1969 correction). The average estimates of probability of misorientation that we observed in D. simulans and A. thaliana are higher than that obtained following Fay and Wu’s (2000) procedure (3.3 vs. 0.9 and 4.7 vs. 2.9% in the two species, respectively). The difference probably reflects heterogeneity of the neutral mutation rates across sites. In contrast to Fay and Wu’s approach, our model implicitly takes this heterogeneity into account. Such heterogeneity should explain the relatively high P_{D} rates observed. A significant negative correlation was observed between H and P_{M} in A. thaliana (Pearson’s R ^{2} = 0.33, P = 0.047), suggesting that H is influenced by homoplasy in these data sets (Figure 1). We did not observe such a correlation in the D. simulans data sets, possibly because of the lower average value of P_{M} or because of small sample sizes of these data (average = 7.3) leading to reduced power. No correlation was observed in any data set between Fu and Li’s (1993) tests and P_{M}.
To study the effect of misorientation ranging from 0 to 20% (the observed range in the analyzed data sets) on tests with an outgroup, we used genealogies generated by a standard coalescent algorithm (Hudson 1993). First, we generated the critical test values (5% significance level, twosided for F and D, onesided for H as in the original procedures) by simulations of the standard neutral model with a fixed number of segregating sites, no misorientation, and no recombination. Second, we simulated data sets in the presence of misorientation by exchanging the frequency of the derived and ancestral variant, with probability given by the misorientation parameter for each segregating site. This implicitly assumes that the rate of multiple hits is independent of the polymorphism frequency, as seems reasonable if such mutations principally occur on the branch connecting the root of the intraspecific tree to the outgroup. The proportion of simulations with a value of the statistics more extreme than the critical value(s) was recorded. Fifty thousand simulations were performed for each combination of parameter values.
The F and Dstatistics were found to be minimally affected by the presence of homoplasy (Figure 2). With levels of misorientation up to 20%, the null model is rejected <8% of the time for all sets of parameter values (sample sizes and numbers of segregating sites each ranging from 10 to 100; results not shown). On the contrary, H is very sensitive to the presence of homoplasy. For example, if P_{M} = 0.15 with a sample size of 50, the null model will be rejected ∼25% of the time. The effect markedly increases with sample size (Figure 2), but is virtually insensitive to the number of segregating sites (results not shown). The difference of susceptibility of F and Dtests vs. the Htest can be understood by considering the unfolded frequency spectrum of polymorphic sites. Unique variants are frequent under the neutral model, while highfrequencyderived variants are very scarce (Figure 3). Thus, in most cases, homoplasies transform a unique variant into a highfrequency one, which produces a large excess of such variants (Figure 3). Misorientation therefore strongly affects the Htest, which is designed precisely to detect an excess of highfrequencyderived variants (Fay and Wu 2000). On the other hand, Fu and Li’s tests are primarily sensitive to unique external mutations, which are diminished by only a fraction when homoplasy occurs at a reasonable rate.
Our results suggest that, in species where a relatively divergent outgroup is commonly used, like D. simulans or A. thaliana, misorientation of the derived state of variants can produce significant values of the Htest with appreciable frequency. This effect is likely to be underestimated in our study since more distant outgroups are sometimes used (e.g., D. yakuba for D. melanogaster or gorilla for humans). On the contrary, the use of a very closely related outgroup could lead to yet another source of misorientation due to the occurrence of ancestral polymorphisms. Using several outgroups could potentially help. (To our knowledge, this is not done in practice when applying the test.) It does not, however, fully solve the problem since adding more outgroups, especially more distant ones, increases the probability of getting multiple hits. Our approximation that neglects higherorder mutations then becomes inappropriate. Using several outgroups can reduce the fraction of misoriented sites, but would correspondingly increase the number of sites that cannot be oriented. These sites would have to be removed from the analyses, thereby reducing the power of the test. The issue becomes a tradeoff between power and robustness to homoplasy. Finally, two outgroups are typically far from being independent: a large part of the lineage linking an outgroup to the intraspecific tree is generally shared between two outgroups, thus providing little additional information. If several outgroups are to be used, our approach can still be applied by replacing the outgroups with an estimate of the ancestral sequence on the node that links the outgroups to the intraspecific tree, e.g., with maximumlikelihood methods (Yanget al. 1995).
Any additional mutational bias (e.g., base frequency heterogeneity) should increase the undetected over detected multiplehit ratio. This is the case particularly for data from coding regions where twofold degenerate sites tend to show high transition:transversion bias since only transitions are synonymous. This is taken into account in our average estimate of transition:transversion ratio, but we implicitly assume that this ratio is constant over sites, an assumption that may induce a slight bias. For data in coding regions, a rough correction can be performed by removing all twofold degenerate sites before estimating α, β, and P_{D} and deriving P_{M} with (3). The assumption of constant α over β ratio between replacement and silent polymorphisms becomes more realistic once those sites are removed. Then P_{M} should be corrected by the factor L/(L  L_{2}_{X}), with L the total length of the sequence and L_{2}_{X} the number of twofold degenerate sites. The rationale is that twofold degenerate sites cannot lead to detected multiple hits for synonymous polymorphisms.
In the above discussion we considered only misorientations caused by homoplasy effects but misorientations could also result from other type of biases such as poor alignment with the outgroup sequences when indels are frequent. Finally, in the presence of recombination, the use of critical values for the case of no recombination is overly conservative at the expense of a drastic reduction in the power of the tests (Wall 1999). The homoplasy effect is likely to be much stronger when the tests are applied with recombination and thus have much tighter confidence intervals. More generally, we conclude that when a significant Htest is observed, misorientation effects should be considered if the test is to be applied in the standard way. Alternatively, a misorientation rate should be incorporated in the null hypothesis with estimators more conservative than those proposed in the original procedure (Fay and Wu 2000).
Acknowledgments
We thank A. Di Rienzo and L. Frisse for providing data sets and D. Begun, D. Carlini, E. Heyer, H. Innan, C. MüllerGraf, and anonymous reviewers for comments on the manuscript. E.B. is supported by a grant from the Ecole Pratique des Hautes Etudes and F.D. by a grant from the Centre National de la Recherche Scientifique.
Footnotes

Communicating editor: D. Begun
 Received June 18, 2003.
 Accepted July 31, 2003.
 Copyright © 2003 by the Genetics Society of America