Genetics, Vol. 163, 1147-1151, March 2003, Copyright © 2003

Neutrality Tests Using DNA Polymorphism From Multiple Samples

Haipeng Lib,c, Yunwu Zhangc, Ya-Ping Zhangc, and Yun-Xin Fub,a
a Laboratory of Bioinformatics, Yunnan University, Kunming 650991, People's Republic of China,
b Human Genetics Center, University of Texas, Houston, Texas 77030
c Laboratory of Molecular Evolution and Genome Diversity, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, People's Republic of China

Corresponding author: Yun-Xin Fu, UT School of Public Health, P.O. Box 20186, 1200 Herman Pressler, Houston, TX 77030., yunxin.fu{at}uth.tmc.edu (E-mail)

Communicating editor: G. B. GOLDING


*  ABSTRACT
*TOP
*ABSTRACT
*STATISTICAL TESTS
*PERFORMING THE TESTS
*AN EXAMPLE
*LITERATURE CITED

The polymorphism of a gene or a locus is studied with increasing frequency by multiple laboratories or the same group at different times. Such practice results in polymorphism being revealed by different samples at different regions of the locus. Tests of neutrality have been widely conducted for polymorphism data but commonly used statistical tests cannot be applied directly to such data. This article provides a procedure to conduct a neutrality test and details are given for two commonly used tests. Applying the two new tests to the chemokine-receptor gene (CCR5) in humans, we found that the hypothesis that all mutations are selectively neutral cannot explain the observed pattern of DNA polymorphism.


THE amount and pattern of polymorphism in DNA sequence samples from a population reflects not only mutations in the ancestors of the sequences but also random genetic drift as well as other evolutionary forces, such as natural selection. How to detect the presence of natural selection in molecular population genetics and evolution is an important issue. It is possible to detect the presence of natural selection because natural selection often causes the pattern of polymorphism to differ from that under the neutral mutation hypothesis, which postulates that the majority of mutations that have contributed significantly to the genetic variation in natural populations is neutral or nearly neutral (KIMURA 1983 Down). However, the neutral mutation hypothesis is not sufficiently quantified to be tested rigorously in practice. A narrower definition of neutrality is that all mutations are selectively neutral, which is referred to as the hypothesis of strict neutrality (FU and LI 1993 Down).

A number of statistical tests have been proposed, and almost all of them are designed for a single sample. In reality, polymorphic data can be accumulated over time in the same or different laboratories, which means different sites may be examined using different samples. How to conduct neutrality tests in such situations is the focus of this article. To date, millions of single nucleotide polymorphisms (SNPs) have been identified. It is very likely that SNPs from a single gene or tightly linked regions will be typed for different samples by different research groups over time. The newly developed method will be valuable for analyzing such data. We present an example of such an analysis using data from the CCR5 gene.


*  STATISTICAL TESTS
*TOP
*ABSTRACT
*STATISTICAL TESTS
*PERFORMING THE TESTS
*AN EXAMPLE
*LITERATURE CITED

A number of statistical tests can be extended for multiple samples. However, for the sake of discussion, we focus on two particular tests partly because they are used widely. First, TAJIMA 1989 Down proposed using the difference between two estimates of {theta} (=4Nµ), where N is the effective population size, and µ is the mutation rate per sequence per generation, to detect the presence of selection. His test statistic is

(1)

where {Pi} is the mean number of nucleotide differences between two sequences, K is the number of segregating sites, which is equal to the total number of mutations under the infinite-sites model, n is sample size, and

(2)

Second, FU and LI 1993 Down suggested several tests of neutrality, one of which is

(3)

where {xi}1 is the number of mutations in the external branches, that is, mutations that are inherited by only one sequence in the sample.

The above two tests can be extended to multiple samples in the following way. Assume that a locus without recombination is divided into m regions that have been surveyed using different or partially overlapping samples (Fig 1). It should be emphasized that the assumption of no recombination is made here to make the null model as simple as possible, similar to the original Tajima and Fu and Li tests. Just as the presence of recombination does not invalidate the Tajima and Fu and Li tests, but makes them more conservative, the new tests will likely behave similarly and this will also be applicable to data with recombination.



View larger version (21K):
In this window
In a new window
Download PPT slide
 
Figure 1. An illustration of a genealogy of multiple samples. The locus is divided into two regions. The sequence length in the first region is 400 and in the second one is 600, so r1 = 0.4 and r2 = 0.6 under the hypothesis of a constant mutation rate. |S1| = 3, |S2| = 3, and S1 {cap} S2 = {3}. That means {1, 2} are sequenced from 1 to 400, {4, 5} are sequenced from 401 to 1000, and {3} is sequenced from 1 to 1000. The subtree for the first region is marked by solid lines, and the subtree for the second region is labeled by dashed lines. The lengths of branches can be calculated from coalescent times, and the length of a branch here means a time duration. For example, g = t2 + t3 + t4, and h = t2. Let Li be the branch length of the subtree for the ith region, and then L1 = a + b + g + h + e, and L2 = c + d + f + e. The branch of e is shared among two subtrees. lki is the length of the k-size branch in the subtree for the ith region, so l11 = a + b + h + e, l21 = g, l12 = c + d + e, and l22 = f. Five mutations are on the genealogy and are shown as circles, three of which (solid circles) are found in the first region, and two of which (shaded circles) are found in the second region. Therefore, {xi}11 = 2, {xi}21 = 1, {xi}12 = 2, and {xi}22 = 0.

Define {theta} = 4Nµ, where µ is the mutation rate per sequence per entire locus, and N is the effective population size. Also define {theta}i = 4Nµi, where µi is the mutation rate for the ith region (i = 1, 2, ... , m), and ri = µi/µ, which is the proportion of the mutation rate of the ith region. If the mutation rate per site is constant, the ri is equal to the proportion of the length of the ith region,

Furthermore, we define the mean number of nucleotide differences between two sequences in the ith region as {Pi}i, the number of segregating sites in the ith region as Ki, the number of mutations in the external branches of the genealogy in the ith region as {xi}1i, and the sample size in the ith region as ni (Fig 1). Define

(4)


(5)

and

(6)

It is obvious that when m = 1, these equations reduce to their conventional definitions for a single sample. Note that in these equations the same weight is given to every region. An alternative approach is to give weight to a region according to certain criterion. However, so far we have not found any other weighting scheme to perform better than the equal-weighting scheme.

Furthermore, we define

(7)

where ani is given by (2). That is, is a weighted average of ani. Then the tests (1) and (3) become

(8)

and

(9)


*  PERFORMING THE TESTS
*TOP
*ABSTRACT
*STATISTICAL TESTS
*PERFORMING THE TESTS
*AN EXAMPLE
*LITERATURE CITED

Since and , we can compute Var({Pi} - K/) and Var(K - {xi}1) if we are able to compute Var(K), Var({Pi}), Var({xi}1), Cov({Pi}, K), and Cov(K, {xi}1). Some of these terms can be computed analytically; others have to be estimated.

Analytical result:
Assume the total sample consists of n sequences, and let those sequences be numbered from 1 to n. S = {1, 2, ... , n} will represent the whole sample. The sample, Si, for the ith region will be a subset of S. That is Si {subseteq} S. We do not make any assumption here on the relationship among Si. We note that in one extreme, we can have Si {cap} Sj = {phi} for every pair of i and j, and on the other hand, we may have S1 = S2 = ... = Sm = S. In many situations, it is likely that Si {cap} Sj != {phi} (e.g., Fig 1). We use |Si| to represent the number of elements in the set Si, which is the sample size. Then we have ni = |Si|, n = |S|.

The computation of Var({Pi}), Var(K), and other components that we mentioned above requires understanding of the sample genealogy. Let Li be the total branch length of subtree for the ith region scaled so that 1 unit corresponds to 4N generations, and the length of a branch here means a time duration. {xi}ki is the number of k-size mutations in the ith region, and lki is the length of k-size branches in the subtree for the ith region scaled similarly as Li (Fig 1). The size of a branch is the number of sequences in the sample that are descendants of that branch, and a mutation is said to be size k if it occurs in a branch of size k (FU 1995 Down). Considering the subtree that is part of the tree shown as solid lines in Fig 1, the branch of g has 2 descendent sequences, 1 and 2, so the size of the branch is 2. A mutation is on the branch of g, so the size of the mutation is 2. Following the definition of {theta}, {theta}i, and ri, we have

(10)

Moreover, since 1 unit in time corresponds to 4N generations, we have from LI and FU 1998 Down

(11)

Under the infinite-sites model, we have

(12)

Conditioning on the coalescent times, the number of mutations in each branch follows a Poisson distribution with parameter {theta}l, where l is the branch length. We thus have

where tk (2 <= k <= n) is the time duration required for k sequences to coalesce to k - 1 sequence, i.e., the so-called k-coalescent time (Fig 1), and i != j. Then we have

(13)

which leads to

(14)

WATTERSON 1975 Down showed that K in the case of one sample is

(15)

where

Also we can partition E(LiLj) further as

(16)

Therefore, from (5), (10), (14), (15), and (16), we have

(17)

Similar to (13), we have for i != j that

(18)

TAJIMA 1983 Down showed E({Pi}i) = {theta}i, so we have

(19)

Since {Pi}i is the mean number of nucleotide differences between two sequences in the ith region, it can be calculated from {xi}ki as

(20)

(FU 1995 Down). TAJIMA 1983 Down derived the variance of {Pi}i as

(21)

So we have

(22)

and thus

(23)

where Var({Pi}i) and E({Pi}i{Pi}j) are given by (21) and (22), respectively. Moreover, from (12) and (19), we have

(24)

where E(K{Pi}) is given later (Equation 25). FU 1995 Down showed the formula to calculate E({xi}ki{xi}pi). After putting these terms together, we have

(25)

From FU 1995 Down, we have

(26)

which leads to

(27)

and

(28)

FU and LI 1993 Down showed that the variance of the total number of mutations in the external branches is given by

(29)

where cn = 1 when n = 2, and when n > 2

From (18) and (26), we have

(30)

Substituting (29) and (30) into (28), we have

(31)

and

(32)

where E({xi}ki{xi}1i) is given by FU 1995 Down.

Numerical estimation:
The above derivation shows that E(lkilpj) is critical for computing Var(K), Var({Pi}), Cov({Pi}, K), Var({xi}1), and Cov(K, {xi}1). The value of E(lkilpj) is dependent on the relationship of branches among subtrees of the regions. A simple example of the relationship is given in Fig 1. Although FU 1995 Down was able to derive E(lkilpj) when i = j, the general formula for i != j appears to be analytically intractable. However, an estimate can be obtained relatively easily from the following procedure:

  1. Simulate a genealogy g (topology without mutation) of a sample of n sequences numbered from 1 to n.

  2. Compute the value of Eg(lkilpj) for the simulated genealogy.

  3. Repeat steps 1 and 2 M times. Then E(lkilpj) is finally estimated as

    The computation of Eg(lkilpj) in step 2 is done in a similar manner as FU 1994 Down. As it is obvious, the accuracy of the estimation can be improved by using large values of M.

  4. The components mentioned above can be obtained after E(lkilpj) is estimated, and then DT and DF can be calculated. Similar to TAJIMA'S 1989 Down and FU and LI'S 1993 Down tests, DT and DF do not follow well-known standard distributions. Since E(lkilpj) have to be estimated from simulated samples, it is natural to use computer simulations to determine the critical points of the tests. Overall, this approach gives more accurate critical values than using approximation by a standard distribution.


*  AN EXAMPLE
*TOP
*ABSTRACT
*STATISTICAL TESTS
*PERFORMING THE TESTS
*AN EXAMPLE
*LITERATURE CITED

We consider data from the CCR5 gene from Caucasians (CARRINGTON et al. 1997 Down) to illustrate how the extended tests described in this article can be applied. The CCR5 encodes a cell-surface chemokine-receptor molecule that serves as a co-receptor for the macrophage-tropic strains of HIV-1. Because of its obvious importance, the CCR5 gene has been subjected to many studies. One hypothesis is that CCR5 might have been under natural selection (CARRINGTON et al. 1997 Down). In the data from CARRINGTON et al. 1997 Down, 12 mutations were documented in Caucasians, and 10 of the discovered mutations alter the amino acid sequence of the protein, and each mutation is typed by different or partially overlapping samples (Table 1). Since the precise relationships among samples were not given in the original article, we consider two extreme cases here. In the first case, we assume that different samples are composed of different individuals. That is, Si {cap} Sj = {phi}, where i != j. In the second case, we assume that smaller samples are a subset of larger samples. That is, we assume S8 (64) {sub} S11 (90) {sub} S6 (98) {sub} S3 (170) {sub} S12 (174) {sub} S10 (242) {sub} S1 (382) {sub} S5 (462) {sub} S9 (490) {sub} S2 (698) {sub} S4 (708) {sub} S7 (5210), where the numbers in parentheses are the sample sizes. We refer to those two cases as case 1 and case 2, respectively. By assuming samples are independent in the first case, we basically obtain the maximum possible information from such data. On the other hand, by assuming smaller samples are a subset of larger samples, we have the minimal amount of information.


 
View this table:
In this window
In a new window

 
Table 1. The list of ri and the frequencies of mutations in 12 regions of the CCR5 gene

From (12), we can get the estimate of {theta} as since so we have = 1.913. Also we have {Pi} = {sum} {Pi}i = 0.392. Table 2 shows that both tests yield significant negative values in both cases 1 and 2. Thus we conclude that the CCR5 region has not evolved according to the neutral model. One possibility is that it has evolved under natural selection, which remains to be seen by further study.


 
View this table:
In this window
In a new window

 
Table 2. Results of neutrality tests for CCR5


*  ACKNOWLEDGMENTS

We are grateful to Ms. Sara Barton for her help. This work is supported by National Institutes of Health grants R01 GM50428 and R01 GM55759 (Yun-Xin Fu), the Chinese Academy of Sciences (KSCX2-1-05), the National Nature Science Foundation of China, and the Nature Science Foundation of Yunnan Province in China.

Manuscript received April 23, 2002; Accepted for publication November 21, 2002.


*  LITERATURE CITED
*TOP
*ABSTRACT
*STATISTICAL TESTS
*PERFORMING THE TESTS
*AN EXAMPLE
*LITERATURE CITED

CARRINGTON, M., T. KISSNER, B. GERRARD, S. IVANOV, and S. J. O'BRIEN et al., 1997  Novel alleles of the chemokine-receptor gene CCR5.. Am. J. Hum. Genet. 61:1261-1267.[Medline]

FU, Y. X., 1994  Estimating effective population-size or mutation-rate using the frequencies of mutations of various classes in a sample of DNA sequences. Genetics 138:1375-1386.[Abstract]

FU, Y. X., 1995  Statistical properties of segregating sites. Theor. Popul. Biol. 48:172-197.[Medline]

FU, Y. X. and W. H. LI, 1993  Statistical tests of neutrality of mutations. Genetics 133:693-709.[Abstract]

KIMURA, M., 1983 The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK.

LI, W. H., and Y. X. FU, 1998 Coalescent theory and its applications in population genetics, pp. 45–79 in Statistics in Genetics, edited by E. HALLORAN. Springer-Verlag, New York.

TAJIMA, F., 1983  Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437-460.[Abstract/Free Full Text]

TAJIMA, F., 1989  Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595.[Abstract/Free Full Text]

WATTERSON, G. A., 1975  On the number of segregating sites in genetic models without recombination. Theor. Popul. Biol. 7:256-276.[Medline]




This article has been cited by other articles:


Home page
Mol Biol EvolHome page
Y.-w. Zhang, O. A. Ryder, and Y.-p. Zhang
Intra- and Interspecific Variation of the CCR5 Gene in Higher Primates
Mol. Biol. Evol., October 1, 2003; 20(10): 1722 - 1729.
[Abstract] [Full Text]