- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Baudry, E.
- Articles by Depaulis, F.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Baudry, E.
- Articles by Depaulis, F.
Effect of Misoriented Sites on Neutrality Tests With Outgroup
Emmanuelle Baudrya and Frantz Depaulisaa Laboratoire d'Ecologie, Centre National de la Recherche Scientifique UMR 7625-EPHE, Université Pierre et Marie Curie, 75252 Paris Cedex 05, France
Corresponding author: Frantz Depaulis, Laboratoire d'Ecologie, CNRS UMR 7625-EPHE, case 237, bat A, 7 quai St. Bernard, 75252 Paris Cedex 05, France., fdepaulis{at}snv.jussieu.fr (E-mail)
Communicating editor: D. BEGUN
| ABSTRACT |
|---|
Several neutrality tests use outgroups to infer the ancestral and derived states for polymorphism data. However, homoplasy can result in the incorrect inference of the derived variant. We show that empirically derived rates of misorientation strongly influence Fay and Wu's H-test, especially when the sample size is large.
INTRASPECIFIC polymorphism data are usually analyzed within the framework of the neutral Wright-Fisher model, which (besides the absence of selection) assumes a randomly mating population of constant size. Departures from this model are typically attributed to selective or demographic effects. A rarely considered alternative explanation is that the mutational process can cause or at least contribute to such departures (see, however, ![]()
![]()
![]()
![]()
![]()
![]()
![]()
Three neutrality tests with an outgroup have been proposed. Each one compares a pair of unbiased estimates of the mutational parameter of the population (
= 4Neµ for an autosomal marker). FU and LI's (1993) D- and F-tests rely on standardized statistics:
![]() |
(1) |
They compare WATTERSON's (1975)
w estimator, which is based on the total number of polymorphic sites, and TAJIMA's (1983) diversity
(respectively) to
e, the number of derived unique mutations (mutations on external branches of the tree). They are, thus, highly sensitive to the relative proportion of the latter mutations. FAY and WU's (2000) H-test,
![]() |
(2) |
compares
to
H, an estimator weighted by the homozygosity of the derived variants. It is, thus, primarily sensitive to the relative proportion of high-frequency-derived variants. The H-test was designed to specifically detect positive selection in the presence of recombination (with recombination occurring between the region surveyed and the selected site during the selective stage). Since its introduction, an unexpectedly large number of significant H values have been reported in humans and Drosophila (![]()
![]()
In practice, an outgroup is used to identify the derived and ancestral variants of a polymorphic site. This inference can be incorrect if an undetected second mutation occurred at the same site on the outgroup branch, i.e., if multiple hits are present. ![]()
![]()
![]()
![]()
be the rate of each possible transition and ß that of transversions (Kimura's two-parameter model; ![]()
If the first mutation, which produces the polymorphic site, is a transition (probability
), then the second mutation on the outgroup branch is undetected if it is also a transition (probability
) and it is detected if it is a transversion (probability 2ß since there are twice as many possible transversions as transitions). If the first mutation is a transversion (probability 2ß), the second mutation is undetected only if it is the same type of transversion (probability ß) and it is detected if it involves the other type of transversion or a transition (probability
+ ß). Hence, the ratio of undetected to detected multiple hits is
![]() |
(3) |
In practice, we simply estimated
and ß for each data set by counting the proportions of sites with transitional and transversional differences between the sequences including all polymorphisms and fixed differences. Given the relatively low level of divergence between the sequences considered here (<9%, see below), this should provide a reasonably accurate approximation (![]()
We then estimated the probability of misorientation in data sets from three species that are frequently the focus of sequence polymorphism studies: Homo sapiens, Drosophila simulans, and Arabidopsis thaliana. Patterns of polymorphism in a population can be affected by several factors, including demographic history. To minimize this effect, we chose surveys of loci with comparable sampling schemes within a species. ![]()
![]()
![]()
|
To study the effect of misorientation ranging from 0 to 20% (the observed range in the analyzed data sets) on tests with an outgroup, we used genealogies generated by a standard coalescent algorithm (![]()
The F- and D-statistics were found to be minimally affected by the presence of homoplasy (Fig 2). With levels of misorientation up to 20%, the null model is rejected <8% of the time for all sets of parameter values (sample sizes and numbers of segregating sites each ranging from 10 to 100; results not shown). On the contrary, H is very sensitive to the presence of homoplasy. For example, if PM = 0.15 with a sample size of 50, the null model will be rejected
25% of the time. The effect markedly increases with sample size (Fig 2), but is virtually insensitive to the number of segregating sites (results not shown). The difference of susceptibility of F- and D-tests vs. the H-test can be understood by considering the unfolded frequency spectrum of polymorphic sites. Unique variants are frequent under the neutral model, while high-frequency-derived variants are very scarce (Fig 3). Thus, in most cases, homoplasies transform a unique variant into a high-frequency one, which produces a large excess of such variants (Fig 3). Misorientation therefore strongly affects the H-test, which is designed precisely to detect an excess of high-frequency-derived variants (![]()
|
|
Our results suggest that, in species where a relatively divergent outgroup is commonly used, like D. simulans or A. thaliana, misorientation of the derived state of variants can produce significant values of the H-test with appreciable frequency. This effect is likely to be underestimated in our study since more distant outgroups are sometimes used (e.g., D. yakuba for D. melanogaster or gorilla for humans). On the contrary, the use of a very closely related outgroup could lead to yet another source of misorientation due to the occurrence of ancestral polymorphisms. Using several outgroups could potentially help. (To our knowledge, this is not done in practice when applying the test.) It does not, however, fully solve the problem since adding more outgroups, especially more distant ones, increases the probability of getting multiple hits. Our approximation that neglects higher-order mutations then becomes inappropriate. Using several outgroups can reduce the fraction of misoriented sites, but would correspondingly increase the number of sites that cannot be oriented. These sites would have to be removed from the analyses, thereby reducing the power of the test. The issue becomes a trade-off between power and robustness to homoplasy. Finally, two outgroups are typically far from being independent: a large part of the lineage linking an outgroup to the intraspecific tree is generally shared between two outgroups, thus providing little additional information. If several outgroups are to be used, our approach can still be applied by replacing the outgroups with an estimate of the ancestral sequence on the node that links the outgroups to the intraspecific tree, e.g., with maximum-likelihood methods (![]()
Any additional mutational bias (e.g., base frequency heterogeneity) should increase the undetected over detected multiple-hit ratio. This is the case particularly for data from coding regions where twofold degenerate sites tend to show high transition:transversion bias since only transitions are synonymous. This is taken into account in our average estimate of transition:transversion ratio, but we implicitly assume that this ratio is constant over sites, an assumption that may induce a slight bias. For data in coding regions, a rough correction can be performed by removing all twofold degenerate sites before estimating
, ß, and PD and deriving PM with (3). The assumption of constant
over ß ratio between replacement and silent polymorphisms becomes more realistic once those sites are removed. Then PM should be corrected by the factor L/(L - L2X), with L the total length of the sequence and L2X the number of twofold degenerate sites. The rationale is that twofold degenerate sites cannot lead to detected multiple hits for synonymous polymorphisms.
In the above discussion we considered only misorientations caused by homoplasy effects but misorientations could also result from other type of biases such as poor alignment with the outgroup sequences when indels are frequent. Finally, in the presence of recombination, the use of critical values for the case of no recombination is overly conservative at the expense of a drastic reduction in the power of the tests (![]()
![]()
| ACKNOWLEDGMENTS |
|---|
We thank A. Di Rienzo and L. Frisse for providing data sets and D. Begun, D. Carlini, E. Heyer, H. Innan, C. Müller-Graf, and anonymous reviewers for comments on the manuscript. E.B. is supported by a grant from the Ecole Pratique des Hautes Etudes and F.D. by a grant from the Centre National de la Recherche Scientifique.
Manuscript received June 18, 2003; Accepted for publication July 31, 2003.
| LITERATURE CITED |
|---|
AGUADE, M., 2001 Nucleotide sequence variation at two genes of the phenylpropanoid pathway, the FAH1 and F3H genes, in Arabidopsis thaliana.. Mol. Biol. Evol. 18:1-9.
BEGUN, D. and P. WHITLEY, 2000 Reduced X-linked nucleotide polymorphism in Drosophila simulans.. Proc. Natl. Acad. Sci. USA 97:5960-5965.
DEPAULIS, F., S. MOUSSET, and M. VEUILLE, 2004 Powers of neutrality tests against bottleneck and hitchhiking. J. Mol. Evol. in press.
FAY, J. C. and C.-I WU, 2000 Hitchhiking under positive Darwinian selection. Genetics 155:1405-1413.
FRISSE, L., R. R. HUDSON, A. BARTOSZEWICZ, J. D. WALL, and J. DONFACK et al., 2001 Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am. J. Hum. Genet. 69:831-843.[Medline]
FU, Y. X. and W. H. LI, 1993 Statistical tests of neutrality of mutations. Genetics 133:693-709.[Abstract]
GU, X. and J. ZHANG, 1997 A simple method for estimating the parameter of substitution rate variation among sites. Mol. Biol. Evol. 14:1106-1113.[Abstract]
HUDSON, R. R., 1993 The how and why of generating gene genealogies, pp. 2336 in Mechanism of Molecular Evolution, edited by N. TAKAHATA and A. G. CLARK. Japan Scientific Societies Press/Sinauer Associates, Sunderland, MA.
INNAN, H. and F. TAJIMA, 1997 The amount of nucleotide variation within and between allelic classes and the reconstruction of the common ancestral sequence in a population. Genetics 147:1431-1444.[Abstract]
JUKES, T. H., and C. R. CANTOR, 1969 Evolution of protein molecules, pp. 21132 in Mammalian Protein Metabolism, edited by H. N. MUNRO. Academic Press, New York.
KAWABE, A. and N. T. MIYASHITA, 1999 DNA variation in the basic chitinase locus (ChiB) region of the wild plant Arabidopsis thaliana.. Genetics 153:1445-1453.
KIMURA, M., 1969 The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutation. Genetics 61:893-903.
KIMURA, M., 1980 A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.[Medline]
KUITTINEN, H. and M. AGUADE, 2000 Nucleotide variation at the CHALCONE ISOMERASE locus in Arabidopsis thaliana.. Genetics 155:863-872.
MIYASHIYA, N., 2001 DNA variation in the 5' upstream region of the Adh locus of the wild plants Arabidopsis thaliana and Arabis gemmifera.. Mol. Biol. Evol. 18:164-171.
NACHMAN, M. W. and S. L. CROWELL, 2000 Estimate of the mutation rate per nucleotide in humans. Genetics 156:297-304.
OLSEN, K. M., A. WOMACK, A. R. GARRETT, J. I. SUDDITH, and M. D. PURUGGANAN, 2002 Contrasting evolutionary forces in the Arabidopsis thaliana floral developmental pathway. Genetics 160:1641-1650.
PRZEWORSKI, M., 2002 The signature of positive selection at randomly chosen loci. Genetics 160:1179-1189.
ROGERS, A., 1992 Error introduced by the infinite-site model. Mol. Biol. Evol. 9:1181-1184.[Medline]
SAVOLAINEN, O., C. H. LANGLEY, B. P. LAZZARO, and H. FREVILLE, 2000 Contrasting patterns of nucleotide polymorphism at the alcohol dehydrogenase locus in the outcrossing Arabidopsis lyrata and the selfing Arabidopsis thaliana.. Mol. Biol. Evol. 17:645-655.
TAJIMA, F., 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437-460.
WALL, J. D., 1999 Recombination and the power of statistical tests of neutrality. Genet. Res. 74:65-79.
WATTERSON, G. A., 1975 On the number of segregation sites. Theor. Popul. Biol. 7:256-276.[Medline]
YANG, Z. and A. D. YODER, 1999 Estimation of the transition/transversion rate bias and species sampling. J. Mol. Evol. 48:274-283.[Medline]
YANG, Z., S. KUMAR, and M. NEI, 1995 A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:1641-1650.[Abstract]
This article has been cited by other articles:
![]() |
P. K. Ingvarsson Multilocus Patterns of Nucleotide Polymorphism and the Demographic History of Populus tremula Genetics, September 1, 2008; 180(1): 329 - 340. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. De Mita, J. Ronfort, H. I. McKhann, C. Poncet, R. El Malki, and T. Bataillon Investigation of the Demographic and Selective Forces Shaping the Nucleotide Diversity of Genes Involved in Nod Factor Signaling in Medicago truncatula Genetics, December 1, 2007; 177(4): 2123 - 2133. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. D. Hernandez, S. H. Williamson, and C. D. Bustamante Context Dependence, Ancestral Misidentification, and Spurious Signatures of Natural Selection Mol. Biol. Evol., August 1, 2007; 24(8): 1792 - 1800. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Zeng, S. Shi, and C.-I Wu Compound Tests for the Detection of Hitchhiking Under Positive Selection Mol. Biol. Evol., August 1, 2007; 24(8): 1898 - 1908. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. A. Machado, T. S. Haselkorn, and M. A. F. Noor Evaluation of the Genomic Extent of Effects of Fixed Inversion Differences on Intraspecific Variation and Interspecific Gene Flow in Drosophila pseudoobscura and D. persimilis Genetics, March 1, 2007; 175(3): 1289 - 1306. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Comeron Weak selection and recent mutational changes influence polymorphic synonymous mutations in humans PNAS, May 2, 2006; 103(18): 6940 - 6945. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Galtier, E. Bazin, and N. Bierne GC-Biased Segregation of Noncoding Polymorphisms in Drosophila Genetics, January 1, 2006; 172(1): 221 - 228. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Baudry, E.
- Articles by Depaulis, F.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Baudry, E.
- Articles by Depaulis, F.








