A New Statistic for Detecting Genetic Differentiation
 Richard R. Hudson⇓
 Corresponding author: Richard R. Hudson, Department of Ecology and Evolution, University of Chicago, 1101 E. 57th St., Chicago, IL 60637. Email: rrhudson{at}uchicago.edu
Abstract
A new statistic for detecting genetic differentiation of subpopulations is described. The statistic can be calculated when genetic data are collected on individuals sampled from two or more localities. It is assumed that haplotypic data are obtained, either in the form of DNA sequences or data on many tightly linked markers. Using a symmetric island model, and assuming an infinitesites model of mutation, it is found that the new statistic is as powerful or more powerful than previously proposed statistics for a wide range of parameter values.
DETECTING genetic differentiation of subpopulations is an important problem in several areas of population biology, including areas of evolutionary genetics, ecology, and conservation biology. When data are obtained from two or more localities in the form of allele frequencies at one or more unlinked loci, standard chisquare tests (or likelihoodratio tests) of homogeneity are appropriate (Workman and Niswander 1970) and can be quite powerful for detecting differentiation. Even when the expected counts in some cells are small, permutation methods can be utilized to give good results (Lewontin and Felsenstein 1965; Roff and Bentzen 1989). If the data consist of DNA sequences, or haplotyping at two or more linked sites, the same methods can be employed, if distinct sequences or haplotypes are treated as alleles. However, if the haplotype diversity is very high and the sample sizes are small, most haplotypes may appear in the sample only once and the methods based on haplotype frequencies will have low power and, in extreme cases, can become completely useless. Using these methods, longer sequences, which must contain more information, can result in lower power than short sequences. This problem is most severe with small samples and long sequences. To handle these kinds of data, Hudson et al. (1992) proposed the use of sequencebased statistics in the permutation tests. These sequencebased statistics utilize information on the numbers of differences between haplotypes and not just the frequencies of the haplotypes. The particular sequencebased statistics considered by Hudson et al. (1992) were shown to be more powerful than the chisquare statistic when haplotype diversity was very high, but were found to be relatively weak when the diversity was low. Thus, for low diversity samples, the chisquare statistic (or a likelihoodratio statistic) would appear to be best, but, for high diversity samples, the sequencebased statistics should be used. Unfortunately, there are no absolute criteria known for when the chisquare statistic should be employed and when the sequencebased statistics should be used. It would be desirable to have a single statistic that performs well at all levels of diversity. In this note, a new sequencebased statistic is introduced that appears to have this property. Under a symmetric twoisland model with mutations occurring according to the infinitesites model, this new statistic is found to be as powerful or more powerful than other statistics that have been proposed for detecting genetic differentiation. This superior power is found over a wide range of haplotype diversity.
The new statistic, referred to as the nearestneighbor statistic (S_{nn}), is a measure of how often the “nearest neighbors” (in sequence space) of sequences are from the same locality in geographic space. This is made more precise below. The statistic is applicable when genetic data are collected on individuals sampled from two or more localities. It is assumed that haplotypic data are obtained, either in the form of DNA sequences or data on many tightly linked markers.
To define S_{nn}, it is helpful to first establish some notation. For concreteness, suppose the data collected are mitochondrial sequences obtained from n individuals, some of which are from locality 1 and some from locality 2. (The statistic automatically generalizes to more localities.) We assume all sequences are the same length with no gaps. Arbitrarily number the individuals from 1 to n, and denote the sequence of individual i by s_{i}. Let d_{ij} equal the number of nucleotide sites at which s_{i} differs from s_{j}. Focus on a particular individual, say, individual k, and let m_{k} denote the minimum of {d_{kj}}, j = 1, 2, … , k − 1, k + 1, … , n. Thus, m_{k} is the distance to the nearest neighbor(s) of individual k. (Neighbor here reflects closeness in sequence space, not in geographic space.) Let T_{k} equal the number of individuals for which d_{kj} = m_{k}, again for fixed k and j ≠ k. T_{k} is the number
of nearest neighbors of individual k. And let W_{k} equal the number of individuals with d_{kj} = m_{k}, that are from the same locality as individual k. In other words, W_{k} is the number of nearest neighbors to individual k that are from the same locality as individual k. Now define X_{k} = W_{k}/T_{k}. Thus, X_{k} is the fraction of nearest neighbors of individual k that are from the same locality as individual k. Thus, if individual k has only a single nearest neighbor, then X_{k} is one if the nearest neighbor is from the same locality as individual k, and X_{k} is zero if the nearest neighbor is from a different locality. The statistic S_{nn} is simply the average of the X_{k}:
To assess the power of permutation tests using S_{nn} to detect geographic differentiation, the same symmetric twoisland model considered by Hudson et al. (1992) was used. The parameters of this model are N, the island population size, u, the neutral mutation rate, c, the recombination rate between the ends of the segment sequenced, and m, the migration fraction per generation. An infinitesites model was assumed (and thus no recurrent mutations occur in these simulations.) The results of these simulations are shown in Tables 1 and 2. Table 1 shows the results for all parameter values and sample sizes considered by Hudson et al. (1992). In Table 2, more results for small sample sizes are given. For comparison, the power of the permutation tests based on the chisquare statistic (χ^{2}) and on K_{S}*, Z*, and H_{S} are also shown in the tables. The statistics K_{S}*, Z*, and H_{S} were the most powerful sequencebased statistics found by Hudson et al. (1992).
In Table 1, we find that S_{nn} has equal or higher power than the χ^{2} statistic in all cases except one. (The exception is the first case in Table 1 in which the power of S_{nn} was 0.77 while the power of χ^{2} was 0.78, a very small difference.) For most cases in this table, S_{nn} has equal or only slightly higher power than the test based on χ^{2}. However, in cases with small sample sizes (n_{1} = n_{2} = 10 or 15), especially with recombination, there is substantially higher power with the S_{nn} statistic. (In the case with n_{1} = n_{2} = 10 and 4Nc = 20, the power with S_{nn} is 0.46, while the power with χ^{2} is 0.21.) These results motivated us to look at more cases with small sample size, which are shown in Table 2.
For samples of size 6 from each locality, the S_{nn} statistic is substantially more powerful than the χ^{2} statistic at all levels of variation examined (see Table 2). For samples of size 10 from each locality, S_{nn} is only slightly more powerful than χ^{2} at low levels of variation, but at higher levels of variation, S_{nn} has very much higher power than χ^{2}. In contrast to the chisquare statistic, higher mutation rates (longer sequences) always lead to more power using the nearestneighbor statistic, which accords with the intuition that longer sequences should provide more information. With low to moderate levels of variation, the S_{nn} statistic is more powerful than the sequencebased statistics of Hudson et al. However, with the small sample sizes considered in Table 2, it appears that K_{S}* and Z* may have slightly higher power than S_{nn} when levels of variation are very high. (See the case n_{1} = n_{2} = 6 and 4Nu = 4Nc = 10.)
Summarizing, we find that among the statistics tested, S_{nn} is the most powerful statistic, or nearly as powerful as the best statistic, under all conditions examined. It should be emphasized, however, that all assessments of power were carried out with a symmetric twoisland model and assuming that mutations occur according to an infinitesites model (in which no multiple hits occur). Other models may lead to different conclusions. The use of S_{nn} eliminates the need to establish criteria for when to use an “allele” frequencybased statistic and when to use a sequencebased statistic. This statistic may also be of use in testing for genetic differences between samples in casecontrol studies, though these usually consist of diploid data in large samples.
Source code (in the language C) for a program that carries out the test on Unix or Linux machines is in a file, snn.c, available at http://home.uchicago.edu/~rhudson1.
Footnotes

Communicating editor: M. Slatkin
 Received February 15, 2000.
 Accepted May 1, 2000.
 Copyright © 2000 by the Genetics Society of America