Abstract
A new statistic for detecting genetic differentiation of subpopulations is described. The statistic can be calculated when genetic data are collected on individuals sampled from two or more localities. It is assumed that haplotypic data are obtained, either in the form of DNA sequences or data on many tightly linked markers. Using a symmetric island model, and assuming an infinite-sites model of mutation, it is found that the new statistic is as powerful or more powerful than previously proposed statistics for a wide range of parameter values.
DETECTING genetic differentiation of subpopulations is an important problem in several areas of population biology, including areas of evolutionary genetics, ecology, and conservation biology. When data are obtained from two or more localities in the form of allele frequencies at one or more unlinked loci, standard chi-square tests (or likelihood-ratio tests) of homogeneity are appropriate (Workman and Niswander 1970) and can be quite powerful for detecting differentiation. Even when the expected counts in some cells are small, permutation methods can be utilized to give good results (Lewontin and Felsenstein 1965; Roff and Bentzen 1989). If the data consist of DNA sequences, or haplotyping at two or more linked sites, the same methods can be employed, if distinct sequences or haplotypes are treated as alleles. However, if the haplotype diversity is very high and the sample sizes are small, most haplotypes may appear in the sample only once and the methods based on haplotype frequencies will have low power and, in extreme cases, can become completely useless. Using these methods, longer sequences, which must contain more information, can result in lower power than short sequences. This problem is most severe with small samples and long sequences. To handle these kinds of data, Hudson et al. (1992) proposed the use of sequence-based statistics in the permutation tests. These sequence-based statistics utilize information on the numbers of differences between haplotypes and not just the frequencies of the haplotypes. The particular sequence-based statistics considered by Hudson et al. (1992) were shown to be more powerful than the chi-square statistic when haplotype diversity was very high, but were found to be relatively weak when the diversity was low. Thus, for low diversity samples, the chi-square statistic (or a likelihood-ratio statistic) would appear to be best, but, for high diversity samples, the sequence-based statistics should be used. Unfortunately, there are no absolute criteria known for when the chisquare statistic should be employed and when the sequence-based statistics should be used. It would be desirable to have a single statistic that performs well at all levels of diversity. In this note, a new sequence-based statistic is introduced that appears to have this property. Under a symmetric two-island model with mutations occurring according to the infinite-sites model, this new statistic is found to be as powerful or more powerful than other statistics that have been proposed for detecting genetic differentiation. This superior power is found over a wide range of haplotype diversity.
The new statistic, referred to as the nearest-neighbor statistic (Snn), is a measure of how often the “nearest neighbors” (in sequence space) of sequences are from the same locality in geographic space. This is made more precise below. The statistic is applicable when genetic data are collected on individuals sampled from two or more localities. It is assumed that haplotypic data are obtained, either in the form of DNA sequences or data on many tightly linked markers.
To define Snn, it is helpful to first establish some notation. For concreteness, suppose the data collected are mitochondrial sequences obtained from n individuals, some of which are from locality 1 and some from locality 2. (The statistic automatically generalizes to more localities.) We assume all sequences are the same length with no gaps. Arbitrarily number the individuals from 1 to n, and denote the sequence of individual i by si. Let dij equal the number of nucleotide sites at which si differs from sj. Focus on a particular individual, say, individual k, and let mk denote the minimum of {dkj}, j = 1, 2, … , k − 1, k + 1, … , n. Thus, mk is the distance to the nearest neighbor(s) of individual k. (Neighbor here reflects closeness in sequence space, not in geographic space.) Let Tk equal the number of individuals for which dkj = mk, again for fixed k and j ≠ k. Tk is the number
of nearest neighbors of individual k. And let Wk equal the number of individuals with dkj = mk, that are from the same locality as individual k. In other words, Wk is the number of nearest neighbors to individual k that are from the same locality as individual k. Now define Xk = Wk/Tk. Thus, Xk is the fraction of nearest neighbors of individual k that are from the same locality as individual k. Thus, if individual k has only a single nearest neighbor, then Xk is one if the nearest neighbor is from the same locality as individual k, and Xk is zero if the nearest neighbor is from a different locality. The statistic Snn is simply the average of the Xk:
Power of tests (cases examined by Hudsonet al. 1992)
Power of tests in very small sample sizes
To assess the power of permutation tests using Snn to detect geographic differentiation, the same symmetric two-island model considered by Hudson et al. (1992) was used. The parameters of this model are N, the island population size, u, the neutral mutation rate, c, the recombination rate between the ends of the segment sequenced, and m, the migration fraction per generation. An infinite-sites model was assumed (and thus no recurrent mutations occur in these simulations.) The results of these simulations are shown in Tables 1 and 2. Table 1 shows the results for all parameter values and sample sizes considered by Hudson et al. (1992). In Table 2, more results for small sample sizes are given. For comparison, the power of the permutation tests based on the chi-square statistic (χ2) and on KS*, Z*, and HS are also shown in the tables. The statistics KS*, Z*, and HS were the most powerful sequence-based statistics found by Hudson et al. (1992).
In Table 1, we find that Snn has equal or higher power than the χ2 statistic in all cases except one. (The exception is the first case in Table 1 in which the power of Snn was 0.77 while the power of χ2 was 0.78, a very small difference.) For most cases in this table, Snn has equal or only slightly higher power than the test based on χ2. However, in cases with small sample sizes (n1 = n2 = 10 or 15), especially with recombination, there is substantially higher power with the Snn statistic. (In the case with n1 = n2 = 10 and 4Nc = 20, the power with Snn is 0.46, while the power with χ2 is 0.21.) These results motivated us to look at more cases with small sample size, which are shown in Table 2.
For samples of size 6 from each locality, the Snn statistic is substantially more powerful than the χ2 statistic at all levels of variation examined (see Table 2). For samples of size 10 from each locality, Snn is only slightly more powerful than χ2 at low levels of variation, but at higher levels of variation, Snn has very much higher power than χ2. In contrast to the chi-square statistic, higher mutation rates (longer sequences) always lead to more power using the nearest-neighbor statistic, which accords with the intuition that longer sequences should provide more information. With low to moderate levels of variation, the Snn statistic is more powerful than the sequence-based statistics of Hudson et al. However, with the small sample sizes considered in Table 2, it appears that KS* and Z* may have slightly higher power than Snn when levels of variation are very high. (See the case n1 = n2 = 6 and 4Nu = 4Nc = 10.)
Summarizing, we find that among the statistics tested, Snn is the most powerful statistic, or nearly as powerful as the best statistic, under all conditions examined. It should be emphasized, however, that all assessments of power were carried out with a symmetric two-island model and assuming that mutations occur according to an infinite-sites model (in which no multiple hits occur). Other models may lead to different conclusions. The use of Snn eliminates the need to establish criteria for when to use an “allele” frequency-based statistic and when to use a sequence-based statistic. This statistic may also be of use in testing for genetic differences between samples in case-control studies, though these usually consist of diploid data in large samples.
Source code (in the language C) for a program that carries out the test on Unix or Linux machines is in a file, snn.c, available at http://home.uchicago.edu/~rhudson1.
Footnotes
-
Communicating editor: M. Slatkin
- Received February 15, 2000.
- Accepted May 1, 2000.
- Copyright © 2000 by the Genetics Society of America