Abstract
The polymorphism of a gene or a locus is studied with increasing frequency by multiple laboratories or the same group at different times. Such practice results in polymorphism being revealed by different samples at different regions of the locus. Tests of neutrality have been widely conducted for polymorphism data but commonly used statistical tests cannot be applied directly to such data. This article provides a procedure to conduct a neutrality test and details are given for two commonly used tests. Applying the two new tests to the chemokinereceptor gene (CCR5) in humans, we found that the hypothesis that all mutations are selectively neutral cannot explain the observed pattern of DNA polymorphism.
THE amount and pattern of polymorphism in DNA sequence samples from a population reflects not only mutations in the ancestors of the sequences but also random genetic drift as well as other evolutionary forces, such as natural selection. How to detect the presence of natural selection in molecular population genetics and evolution is an important issue. It is possible to detect the presence of natural selection because natural selection often causes the pattern of polymorphism to differ from that under the neutral mutation hypothesis, which postulates that the majority of mutations that have contributed significantly to the genetic variation in natural populations is neutral or nearly neutral (Kimura 1983). However, the neutral mutation hypothesis is not sufficiently quantified to be tested rigorously in practice. A narrower definition of neutrality is that all mutations are selectively neutral, which is referred to as the hypothesis of strict neutrality (Fu and Li 1993).
A number of statistical tests have been proposed, and almost all of them are designed for a single sample. In reality, polymorphic data can be accumulated over time in the same or different laboratories, which means different sites may be examined using different samples. How to conduct neutrality tests in such situations is the focus of this article. To date, millions of single nucleotide polymorphisms (SNPs) have been identified. It is very likely that SNPs from a single gene or tightly linked regions will be typed for different samples by different research groups over time. The newly developed method will be valuable for analyzing such data. We present an example of such an analysis using data from the CCR5 gene.
STATISTICAL TESTS
A number of statistical tests can be extended for multiple samples. However, for the sake of discussion, we focus on two particular tests partly because they are used widely. First, Tajima (1989) proposed using the difference between two estimates of θ (=4Nμ), where N is the effective population size, and μ is the mutation rate per sequence per generation, to detect the presence of selection. His test statistic is
The above two tests can be extended to multiple samples in the following way. Assume that a locus without recombination is divided into m regions that have been surveyed using different or partially overlapping samples (Figure 1). It should be emphasized that the assumption of no recombination is made here to make the null model as simple as possible, similar to the original Tajima and Fu and Li tests. Just as the presence of recombination does not invalidate the Tajima and Fu and Li tests, but makes them more conservative, the new tests will likely behave similarly and this will also be applicable to data with recombination.
Define θ= 4Nμ, where μ is the mutation rate per sequence per entire locus, and N is the effective population size. Also define θ_{i} = 4Nμ_{i}, where μ_{i} is the mutation rate for the ith region (i = 1, 2,..., m), and r_{i} =μ_{i}/μ, which is the proportion of the mutation rate of the ith region. If the mutation rate per site is constant, the r_{i} is equal to the proportion of the length of the ith region,
Furthermore, we define
PERFORMING THE TESTS
Since Var(Π  K/ā) = Var(Π) + Var(K)/ā^{2}  2 Cov(Π, K)/ā and Var(K  āξ_{1}) = Var(K) + ā^{2} Var (ξ_{1})  2ā Cov(K, ξ_{1}), we can compute Var(Π K/ā) and Var(K  āξ_{1}) if we are able to compute Var(K), Var(Π), Var(ξ_{1}), Cov(Π, K), and Cov(K, ξ_{1}). Some of these terms can be computed analytically; others have to be estimated.
Analytical result: Assume the total sample consists of n sequences, and let those sequences be numbered from 1 to n. S = {1, 2,..., n} will represent the whole sample. The sample, S_{i}, for the ith region will be a subset of S. That is S_{i} ⊆ S. We do not make any assumption here on the relationship among S_{i}. We note that in one extreme, we can have S_{i} ∩ S_{j} = ϕ for every pair of i and j, and on the other hand, we may have S_{1} = S_{2} =... = S_{m} = S. In many situations, it is likely that S_{i} ∩ S_{j} ≠ ϕ (e.g., Figure 1). We use S_{i} to represent the number of elements in the set S_{i}, which is the sample size. Then we have n_{i} = S_{i}, n = S.
The computation of Var(Π), Var(K), and other components that we mentioned above requires understanding of the sample genealogy. Let L_{i} be the total branch length of subtree for the ith region scaled so that 1 unit corresponds to 4N generations, and the length of a branch here means a time duration. ξ_{ki} is the number of ksize mutations in the ith region, and l_{ki} is the length of ksize branches in the subtree for the ith region scaled similarly as L_{i} (Figure 1). The size of a branch is the number of sequences in the sample that are descendants of that branch, and a mutation is said to be size k if it occurs in a branch of size k (Fu 1995). Considering the subtree that is part of the tree shown as solid lines in Figure 1, the branch of g has 2 descendent sequences, 1 and 2, so the size of the branch is 2. A mutation is on the branch of g, so the size of the mutation is 2. Following the definition of θ, θ_{i}, and r_{i}, we have
Numerical estimation: The above derivation shows that E(l_{ki}l_{pj}) is critical for computing Var(K), Var(Π), Cov(Π, K), Var(ξ_{1}), and Cov(K, ξ_{1}). The value of E(l_{ki}l_{pj}) is dependent on the relationship of branches among subtrees of the regions. A simple example of the relationship is given in Figure 1. Although Fu (1995) was able to derive E(l_{ki}l_{pj}) when i = j, the general formula for i ≠ j appears to be analytically intractable. However, an estimate can be obtained relatively easily from the following procedure:
Simulate a genealogy g (topology without mutation) of a sample of n sequences numbered from 1 to n.
Compute the value of E_{g}(l_{ki}l_{pj}) for the simulated genealogy.
Repeat steps 1 and 2 M times. Then E(l_{ki}l_{pj}) is finally estimated as
The components mentioned above can be obtained after E(l_{ki}l_{pj}) is estimated, and then D_{T} and D_{F} can be calculated. Similar to Tajima’s (1989) and Fu and Li’s (1993) tests, D_{T} and D_{F} do not follow wellknown standard distributions. Since E(l_{ki}l_{pj}) have to be estimated from simulated samples, it is natural to use computer simulations to determine the critical points of the tests. Overall, this approach gives more accurate critical values than using approximation by a standard distribution.
AN EXAMPLE
We consider data from the CCR5 gene from Caucasians (Carringtonet al. 1997) to illustrate how the extended tests described in this article can be applied. The CCR5 encodes a cellsurface chemokinereceptor molecule that serves as a coreceptor for the macrophagetropic strains of HIV1. Because of its obvious importance, the CCR5 gene has been subjected to many studies. One hypothesis is that CCR5 might have been under natural selection (Carringtonet al. 1997). In the data from Carrington et al. (1997), 12 mutations were documented in Caucasians, and 10 of the discovered mutations alter the amino acid sequence of the protein, and each mutation is typed by different or partially overlapping samples (Table 1). Since the precise relationships among samples were not given in the original article, we consider two extreme cases here. In the first case, we assume that different samples are composed of different individuals. That is, S_{i} ∩ S_{j} = ϕ, where i ≠ j. In the second case, we assume that smaller samples are a subset of larger samples. That is, we assume S_{8} (64) ⊂ S_{11} (90) ⊂ S_{6} (98) ⊂ S_{3} (170) ⊂ S_{12} (174) ⊂ S_{10} (242) ⊂ S_{1} (382) ⊂ S_{5} (462) ⊂ S_{9} (490) ⊂ S_{2} (698) ⊂ S_{4} (708) ⊂ S_{7} (5210), where the numbers in parentheses are the sample sizes. We refer to those two cases as case 1 and case 2, respectively. By assuming samples are independent in the first case, we basically obtain the maximum possible information from such data. On the other hand, by assuming smaller samples are a subset of larger samples, we have the minimal amount of information.
From (12), we can get the estimate of θ as
Acknowledgments
We are grateful to Ms. Sara Barton for her help. This work is supported by National Institutes of Health grants R01 GM50428 and R01 GM55759 (YunXin Fu), the Chinese Academy of Sciences (KSCX2105), the National Nature Science Foundation of China, and the Nature Science Foundation of Yunnan Province in China.
Footnotes

Communicating editor: G. B. Golding
 Received April 23, 2002.
 Accepted November 21, 2002.
 Copyright © 2003 by the Genetics Society of America