# Statistical Tests of the Coalescent Model Based on the Haplotype Frequency Distribution and the Number of Segregating Sites

- Hideki Innan*,
- Kangyu Zhang
^{†}, - Paul Marjoram
^{‡}, - Simon Tavaré
^{†}^{§}and - Noah A. Rosenberg
^{†},^{1}

^{*}Human Genetics Center, School of Public Health, University of Texas Health Science Center, Houston, Texas 77030^{†}Program in Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089-1113^{‡}Department of Preventive Medicine, University of Southern California, Los Angeles, California 90089-9011^{§}Department of Oncology, University of Cambridge, Hutchison/MRC Research Centre, Cambridge CB2 2XZ, United Kingdom

- 1
*Corresponding author:*Program in Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089-1113. E-mail: noahr{at}usc.edu

## Abstract

Several tests of neutral evolution employ the observed number of segregating sites and properties of the haplotype frequency distribution as summary statistics and use simulations to obtain rejection probabilities. Here we develop a “haplotype configuration test” of neutrality (HCT) based on the full haplotype frequency distribution. To enable exact computation of rejection probabilities for small samples, we derive a recursion under the standard coalescent model for the joint distribution of the haplotype frequencies and the number of segregating sites. For larger samples, we consider simulation-based approaches. The utility of the HCT is demonstrated in simulations of alternative models and in application to data from *Drosophila melanogaster*.

SELECTIVELY neutral models of within-species evolution consist of a model that describes the genealogy of sampled DNA sequences and a model for the stochastic process of mutation along the branches of the genealogy. Typical neutral models use the coalescent process (Nordborg 2001, for example) to describe the genealogy and the infinitely many sites model (Watterson, 1975) for the mutation process. Many theoretical predictions have been made under the standard neutral model, in which the particular coalescent model chosen is the one with constant population size.

As an alternative to computationally intensive comparisons of likelihoods of DNA sequence data under null and alternative models (Griffiths and Tavaré 1994; Kuhner *et al.* 1998; Thomson *et al.* 2000), summaries of variation in a sample of sequences are often used in testing neutral models (Kreitman 2000; Nielsen 2001; Ford 2002). Test statistics computed from the data are compared to theory-based predictions. If such predictions are unavailable or intractable, hypothesis testing is performed using simulations of the appropriate model.

Several neutrality tests use summary statistics based on the frequency distribution of haplotypes in a sample of DNA sequences from a particular region of a genome (Table 1). These “haplotype tests” are sometimes used to detect positive selection on particular haplotypes (Hudson *et al.* 1994). Alternatively, because haplotype frequency distributions may differ greatly across demographic scenarios (Nei *et al.* 1975; Donnelly 1996), haplotype tests can also help to identify deviations from the demographic assumptions of the standard neutral model.

Some of the first haplotype tests, such as the Ewens-Watterson homozygosity test, were based on the Ewens (1972) sampling theory for the infinitely many alleles mutation model. Because “allele” in this model and “haplotype” in the infinitely many sites model have the same meaning, the Ewens (1972) theory provides the conditional distribution of the haplotype frequency vector **C** given the sample size *n* and the number of distinct haplotypes *K*, under the standard neutral model (Tavaré and Ewens 1998, Equation 6). In the Ewens-Watterson test, haplotype homozygosity (*F*) is estimated for a sample of DNA sequences at a nonrecombining locus as the sum of the squares of observed haplotype relative frequencies (Watterson 1977, 1978). Given *n*, the value of *F* is compared to the known null distribution, and the neutral model is rejected if *F* is unusually high or low. A subsequent “exact test” was based on whether **C** itself was unlikely given *n* and *K* (Slatkin 1994, 1996).

Other, more recently devised, haplotype-based tests reject the null hypothesis when the haplotype frequency distribution is unlikely under the same neutral model, given the number of mutations, or segregating sites (*S*), observed in the data (these tests are also conditioned on a mutation parameter θ, which for now we assume to be known). The adoption of a genealogical perspective has contributed to this shift in emphasis; in neutral models, conditional on the total length (*L*) of the genealogical tree that underlies the data, 𝔼[*S*|*L*] is proportional to *L* (Hudson 1990, for example). Thus, conditioning on *S* loosely serves as a proxy for conditioning on tree length.

Three main test statistics have been used in haplotype-based tests conditioned on *S* (Hudson *et al.* 1994; Depaulis and Veuille 1998): the absolute frequency of the most frequent haplotype (*M*), the number of distinct haplotypes (*K*), and the haplotype diversity (*H*). For each test statistic, many approaches are possible for implementing the test. We term tests based on *M*, *K*, and *H* the Hudson *et al.* haplotype test (HHT), haplotype number test (HNT), and haplotype diversity test (HDT), respectively, allowing each name to apply to the collection of implementations of the relevant test.

Here we develop a “haplotype configuration test” of neutrality (HCT) conditional on *S*. This test is analogous to the exact test of Slatkin (1994, 1996), which does not take *S* into account. The HCT tests if the haplotype frequency vector **C** is an unlikely configuration, given *S* (and the mutation parameter θ). The haplotype configuration **C** conveniently summarizes the pattern of variation among DNA sequences; it perhaps incorporates more information about the data than do *M*, *K*, and *H*, but its use is less cumbersome than full-likelihood inference based on the DNA sequences themselves.

First, we derive a recursion for the joint distribution ℙ(**C**, *S*|*n*, θ), and we show how it can be used to obtain the conditional distribution ℙ(**C**|*S*, *n*, θ). This recursively calculated conditional distribution allows exact computation of rejection probabilities for the HCT for small values of *S* and *n*. Because *M*, *K*, and *H* are calculated from **C**, the recursion can also enable exact computation of rejection probabilities for the HHT, HNT, and HDT. For large *S* and *n*, use of the recursion is slow, and we discuss simulation techniques for all four tests. We describe simulation-based extensions that allow for more complex null hypotheses, which may include such phenomena as migration and recombination. We also consider various ways for addressing the problem that θ might not be “known”; in particular we argue that the effect of θ on haplotype tests can be addressed by using prior information about θ from genomic surveys. Finally, we consider the power of the HCT, HHT, HNT, and HDT against alternative models, and we apply the tests to an example from *Drosophila melanogaster*. The mathematical notation used here is shown in Table 2.

## THEORY

### Joint distribution of C and *S:*

We abbreviate ℙ(**C** = **c**, *S* = *s*|*n*, θ) by *q*(**c**, *s*|*n*, θ). This provides the probability that for *n* lineages and mutation parameter θ the haplotype configuration is **c** = (*c*_{1}, *c*_{2}, … , *c _{n}*) and

*s*segregating sites are observed, where

*c*is the number of haplotypes with absolute frequency

_{k}*k*. We also abbreviate ℙ(

**C**=

**c**|

*S*=

*s*,

*n*, θ) by

*q*(

**c**|

*s*,

*n*, θ) and ℙ(

**C**=

**c**|

*n*, θ) by

*q*(

**c**|

*n*, θ). We wish to express

*q*(

**c**,

*s*|

*n*, θ) in terms of probabilities for states one event backward in time, where an “event” is either a coalescence of two lineages or a mutation. Using the standard neutral coalescent model with population size

*N*(Nordborg 2001, for example) and an infinitely many sites mutation model with mutation parameter θ, time is scaled in units of

*N*generations so that the waiting time until the most recent coalescence is exponentially distributed with rate

*n*(

*n*− 1)/2 and so that for each lineage, the waiting time until a mutation is exponentially distributed with rate θ/2. Because

*n*lineages are present and because mutations on different lineages occur independently of one another, the waiting time (backward in time) until a mutation happens on any lineage is exponentially distributed with rate

*n*θ/2. The probability that the most recent event is a coalescence is (

*n*− 1)/(θ +

*n*− 1), and the probability that it is a mutation is θ/(θ +

*n*− 1), regardless of the time of the event. Suppose the event is a coalescence (a branching, if viewed forward in time). Then for each of

*k*= 1, 2, … ,

*n*− 1, it is possible that a haplotype of absolute frequency

*k*+ 1 was generated by branching from a haplotype that had frequency

*k*in the previous step. The probability that the event produces a haplotype of frequency

*k*+ 1 is the fraction of lineages whose haplotypes had frequency

*k*in the previous step, or

*k*(

*c*+ 1)/(

_{k}*n*− 1).

If the event is a mutation, with probability *c*_{1}/*n* the mutation occurs along a lineage of absolute frequency one, leaving the haplotype configuration unchanged. For each of *k* = 2, 3, … , *n*, it is also possible that a lineage whose haplotype has frequency *k* experiences the mutation. In this situation, which has probability *k*(*c _{k}* + 1)/

*n*, a haplotype of frequency

*k*is replaced by a singleton haplotype and a haplotype of frequency

*k*− 1.

These cases are combined to produce the recursion: 1In (1), **e*** _{k}* is the

*n*-dimensional vector with the

*k*th coordinate equal to 1 and all other coordinates equal to 0. For convenience, we treat all vectors as having length

*n*, appending extra zeros to the ends of vectors of smaller lengths. We also specify the following rules: 234567Induction using (1) with (2)–(7) can be used to derive various well-known results, as well as new expressions for

*n*= 3 and

*n*= 4 (Table 3). For larger

*n*, the expression becomes unwieldy and numerical evaluation of (1) at particular θ values is preferable (Table 4).

Summing (1) from *s* = 0 to ∞ and rearranging terms produces the following recursion for *q*(**c**|*n*, θ): 8Conditions associated with (8) are obtained from sums of corresponding conditions for (1) from *s* = 0 to ∞: 91011It can be shown that (8) is equivalent to a recursion satisfied by the Ewens (1972) sampling formula and used in its proof (Karlin and McGregor 1972, Equation 9). Thus, the solution to (8) with the initial condition (9) is the Ewens sampling formula (Ewens 1972; Tavaré and Ewens 1998, Equation 3): 12where θ_{(}_{n}_{)} = θ(θ + 1) · · · (θ + *n* − 1), *c*_{1}, *c*_{2}, … , *c _{n}* are nonnegative integers, and . It is straightforward to verify that (12) indeed satisfies (8).

### Joint distribution of univariate haplotype summary statistics and *S*:

The statistics *M*, *K*, and *H* are functions of the frequency vector **C** as follows: 131415For a given *n*, only finitely many haplotype configurations are possible; thus, *M*, *K*, and *H* each have a finite range. The probability that one of these statistics, *G*, equals a value *g*, is obtained by summing *q*(**c**, *s*|*n*, θ) over the configurations that produce the value *g*: 16

### Conditioning on *S*:

The conditional probability of **C**, given *S*, *n*, and θ, is the quotient of the joint probability of **C** and *S* and the probability of observing *S* segregating sites: 17The denominator is the sum of the numerator over all configurations and equals 18(Tavaré 1984, Equation 9.5). Equations similar to (17) apply for *M*, *K*, and *H*; if *G* represents one of these statistics, then 19

## STATISTICAL TESTS

The exact computation of *q*(**c**|*s*, *n*, θ) suggests the following haplotype configuration test (HCT).

*Procedure 1—exact implementation of the haplotype configuration test:*

Under the null hypothesis of neutrality, use (17) to compute the probabilities of all haplotype configurations, given

*s*,*n*, and the known θ.Sum the probabilities of all haplotype configurations whose probabilities under the null model are less than or equal to the probability of the observed configuration

**c**. That is, compute 20Reject the null hypothesis at level α if

*P*≤ α. Two-tailed haplotype tests based on univariate summary statistics can be implemented in a similar manner (one-tailed versions of these tests are also possible).

*Procedure 2—exact implementation of (two-tailed) haplotype tests based on univariate summary statistics:*

Under the null hypothesis of neutrality, use (19) to compute the probabilities of all possible values of the summary statistic

*G*given*s*,*n*, and the known θ.If

*g*denotes the observed value of*G*in the data, reject the null hypothesis at level α if ℙ(*G*≤*g*) ≤ α/2 or if ℙ(*G*≥*g*) ≤ α/2.

For the tests based on *M*, *K*, and *H*, the null hypothesis is rejected if a particular aspect of the observed haplotype frequency distribution is a rare occurrence. Using the HCT, however, the null hypothesis is rejected when the observed haplotype frequency distribution itself is rare. Rare configurations will typically—but not always—have unlikely values for one or more of *M*, *K*, and *H*.

For three choices of θ, Table 5 shows exact probabilities for **C**, *M*, *K*, and *H*, given *s* = 10 and *n* = 7, obtained numerically from (17) and (19). The probabilities shown provide rejection probabilities for the four tests. In Table 5, certain haplotype configurations may be unlikely, even if their values of *M*, *K*, and *H* are centrally located with respect to null distributions of these statistics. For example, (0, 2, 1, 0, 0, 0, 0) is highly unlikely for *s* = 10 and θ ≈ 4.08: the HCT rejects the null hypothesis at *P* = 0.0285. However, the frequency of the most frequent haplotype (3), the number of haplotypes (3), and the haplotype diversity (0.6531), are all rather ordinary for the given parameter values.

It is also possible that from among the HHT, HNT, and HDT, one or more tests might reject the null hypothesis at smaller α than does the haplotype configuration test. For example, in Table 5 with θ ≈ 4.08, (7, 0, 0, 0, 0, 0, 0) is unusual at the α = 2 × 0.0328 = 0.0656 level for the HHT, HNT, and HDT, but only at α = 0.1113 using the HCT.

Because **C**, *M*, *K*, and *H* can take on only finitely many values for a given *n*, there is some probability that (for example) ℙ(*G* < *g*) ≤ α/2 but ℙ(*G* ≤ *g*) > α/2. In such cases, procedures 1 and 2 do not reject the null hypothesis and thus are slightly conservative. This situation often arises with small samples, for which not many numbers can be possible values of the test statistics. It occurs most often for *M* and *K*, each of which has only *n* possible values. A small-sample correction for the HCT is to append to step 3 of procedure 1:

if *P* > α but *P* − *Q* < α, where 21 reject the null at level α with probability (α − *P* + *Q*)/*Q*.

For the other tests, we can append to step 2 in procedure 2:

if ℙ(*G* ≤ *g*) > α/2 but ℙ(*G* < *g*) < α/2, reject the null hypothesis at level α with probability [α/2 − ℙ(*G* < *g*)]/ℙ(*G* = *g*); if ℙ(*G* ≥ *g*) > α/2 but ℙ(*G* > *g*) < α/2, reject the null hypothesis at level α with probability [α/2 − ℙ(*G* > *g*)]/ℙ(*G* = *g*).

Corrections for one-tailed versions of the HHT, HNT, and HDT are analogous. The conservative procedures 1 and 2 are suitable for data analysis; the small-sample correction is most useful when it is important for a rejection region to have fixed size, as in evaluations of the power to reject the null hypothesis under alternative models.

## SIMULATION-BASED IMPLEMENTATIONS

Thus far, we have assumed that the sample size and number of segregating sites are small enough that numerical iteration of the recursion is feasible. We have also assumed a simple demographic model without recombination and that θ is known exactly. We now discuss implementations of the HCT and the other tests when these ideal conditions do not hold. First, we consider simulation methods, demographic models, and recombination, continuing to assume that θ is known; we then discuss ways in which uncertainty in θ might be incorporated.

### Methods for simulation:

Earlier articles on haplotype tests have described various simulation-based implementations. These algorithms will be generally applicable to the HCT as well. Although exact computation from (17) and (19) is appropriate for small *s* and *n*, simulation is necessary as *s* and *n* increase.

The procedures that simulate from the correct conditional distribution ℙ(**C**|*S*, *n*, θ) might be classified as Markov chain Monte Carlo methods (Markovtsova *et al.* 2000, 2001), importance sampling methods (Depaulis *et al.* 2001), and acceptance-rejection algorithms (Tavaré *et al.* 1997; Wall and Hudson 2001). A simple and often highly efficient approach is the following version of algorithm 1 of Tavaré *et al.* (1997).

*Procedure 3—acceptance-rejection algorithm for generating samples from* ℙ(**C**|*S*, *n*, θ)*:*

Simulate the coalescence times

*W*,_{n}*W*_{n}_{−1}, … ,*W*_{2}as independent exponentially distributed random variables, with*W*∼ exp[_{j}*j*(*j*− 1)/2].Compute the total branch length of the resulting genealogical tree, .

Accept the simulated collection of

*W*values with probability_{j}*u*, where 22Otherwise, discard the*W*and return to the initial step._{j}Simulate the branching structure of the genealogy by randomly joining lineages until one lineage remains and associate the

*W*with corresponding branching events._{j}Independently place

*S*mutations uniformly on the genealogy.Record the haplotype configuration

**C**(and the quantities*M*,*K*, and*H*).

Procedure 3 is then repeated until a prespecified number of genealogies have been accepted. The empirical distribution of **C** is then used in procedure 1 to decide whether or not to reject the null hypothesis. To implement the HHT, HNT, and HDT, the empirical distributions of *M*, *K*, and *H* are used in the final step of procedure 2.

The efficiency of procedure 3 derives from the fact that it accepts or rejects the simulation before placing *S* mutations on the branching diagram. It is less efficient to place mutations on the genealogy with a Poisson process with mean *L*θ/2 and only afterward accept trees that have accumulated exactly *S* mutations. The denominator of (22) is the maximum of the numerator over all values of *L*, ensuring that acceptance will occur reasonably often, except if *S* is much larger or much smaller than is suggested by the value of θ (Tavaré *et al.* 1997).

It is noteworthy that even with the efficient algorithm of procedure 3, for large *n*, due to the large number of possible configurations, the HCT has computational limitations not shared by the other tests. For a given sample size *n*, the number of possible values of the test statistics for the HHT or HNT is only *n*. For the HDT, up to a linear transformation, the set of possible values of the statistic is equal to the set of numbers that can equal the sum of the squares of the elements of a partition of *n* into positive integers (Sloane 2005, entry A069999). The configurations that produce the smallest and largest sums of squares, which equal *n* and *n*^{2}, respectively, are (*n*, 0, … , 0) and (0, 0, … , 0, 1). Because the parity of the sum of squares in any partition of *n* must be the same as that of *n*, (*n*^{2} − *n* + 2)/2 provides an upper bound on the number of possible values of the HDT statistic. Even for sample sizes that are presently considered large, using a large number of accepted simulated genealogies (exceeding the number of possible values of the test statistic by a factor of at least 100), it is feasible to approximate the probabilities of all possible values of *M*, *K*, and *H*.

Unlike the other tests, however, the HCT has the form of an exact test; such tests are characterized by enumeration of the probabilities of all possible data sets under the model and summation of the probabilities of all data sets that are at most as probable as the observed data set (Mehta and Patel 1998). For these tests, the number of possible data sets increases very rapidly with some property of the data (such as the number of alleles at the locus, in exact tests of Hardy-Weinberg proportions). For the HCT, the number of haplotype configurations, equivalent to the number of unordered partitions of *n* into positive integers, *p*(*n*), grows quickly with *n* (Abramowitz and Stegun 1965, p. 836). While *n* = 10 has only 42 partitions, *n* = 50 has 204,226, and there are ∼4 × 10^{12} for *n* = 200. Because enumeration of the probabilities of all configurations is impossible for large samples, the HCT is limited to sample sizes *n* for which it is feasible to generate at least 100*p*(*n*)samples from ℙ(**C**|*S*, *n*, θ). As Markov chain Monte Carlo algorithms have been developed for other exact tests in genetics to avoid the enumeration problem (Guo and Thompson 1992; Raymond and Rousset 1995), it is conceivable that such an algorithm might be developed for this test as well.

### Demographic models:

The approach in procedure 3 is versatile, in that the null model need not be the standard constant-size coalescent. A more complex demographic model can be accommodated by substituting its distributions for the waiting times and the branching structure in steps 1 and 4, respectively.

For example, in place of a population with constant size *N*, consider an exponentially growing population of present population size *N* with growth parameter β. At *t* time units of *N* generations in the past, population size was *N* exp[−β*t*]. To use this model as the null, after step 1, for each *j*, replace *W _{j}* by

*f*(

*W*), where 23(Slatkin and Hudson 1991; Nordborg 2001, Equation 8), and for

_{j}*j*∈ {2, 3, … ,

*n*− 1}, 24For each

*j*,

*f*(

*W*) gives the time to coalescence of

_{j}*j*lineages to

*j*− 1, measured in units of

*N*generations. Because variable population size does not change the branching structure of the genealogy, the rest of procedure 3 is identical under exponential population growth.

With a population structure model, such as island migration, as the null, it is simpler to simulate the appropriate waiting times and branching structure concurrently in place of step 1 (Hudson 1990), omitting step 4.

### Recombination:

Using the ancestral recombination graph (Hudson 1983; Griffiths and Marjoram 1996; Nordborg 2001), procedure 3 can be extended to allow recombination in the null model. Suppose the recombination parameter for a DNA sequence region is ρ = 2*Nr*, where *l* is the number of base pairs in the region and *r* is *l* times the recombination rate per base pair per generation. Simulation of the ancestral recombination graph for *n* lineages consists in repeatedly simulating an exponentially distributed random variable for the time of the next coalescence (appropriately transformed if the model includes population growth) and another exponential random variable for the time of the next recombination. The smaller of the two times gives the type of the next event, which is then allowed to occur, the larger time is discarded, and uniform random variables are simulated to decide which lineages participate in the event. From the graph, a genealogical tree is produced at each base pair, so that nearby base pairs are more likely than distant base pairs to have equivalent trees (Rosenberg and Nordborg 2002, Figure 3, for example). The graph splits the sequence into *I* segments with lengths *l*_{1}, … , *l _{I}*, such that all base pairs within a segment have equivalent genealogies, and such that recombination events occur only at segment boundaries. Denote the total branch length of the genealogy of segment

*i*by

*L*.

_{i}*Procedure 4—acceptance-rejection algorithm for generating samples with recombination from* ℙ**(C**|*S*, *n*, θ, ρ)*:*

Simulate an ancestral recombination graph for

*n*lineages with recombination rate ρ, until all parts of the DNA sequence reach most recent common ancestors.Compute the total branch length of the genealogical tree at an “average” site, .

Accept the simulated graph with probability

*u*, where 25Otherwise, discard the graph and return to the initial step.Independently place

*S*mutations uniformly on the genealogy.Record the haplotype configuration

**C**(and the quantities*M*,*K*, and*H*).

This procedure enables haplotype tests to be performed conditional on a known recombination parameter as well as on the known mutation parameter. In the computer program we have implemented, which proceeds more slowly than the algorithm in procedure 4, placement of mutations occurs concurrently with the simulation of the graph in step 1. In place of step 3, the graph is accepted if *S* mutations are placed, and steps 2 and 4 are omitted.

### Treatment of θ:

Because θ does not affect the conditional distribution of **C** given the sample size and the number of haplotypes, its value is not of concern in neutrality tests based on the Ewens (1972) sampling theory. However, θ does affect ℙ(**C**|*S*, *n*, θ) and analogous distributions for *M*, *K*, and *H*. As an example, in Table 5, for θ = 1, the most common and rarest configurations are (2, 1, 1, 0, 0, 0, 0) and (7, 0, 0, 0, 0, 0, 0), respectively, while (5, 1, 0, 0, 0, 0, 0) is most common and (0, 0, 1, 1, 0, 0, 0) is rarest for θ = 10. Thus, as has been reported for other neutrality tests (Fu 1996, 1998; Markovtsova *et al.* 2001; Wall and Hudson 2001), it might be expected that the value of θ used can affect rejection probabilities for the HCT, as well as for the HHT, HNT, and HDT.

Table 5 indicates that nominal and actual rejection probabilities for the HCT can differ substantially if the value of θ used is a poor estimate. For example, suppose that the true θ equals 1 and that the configuration (1, 0, 2, 0, 0, 0, 0) is observed with *s* = 10. The actual HCT rejection probability is *P* = 0.1748. If no prior knowledge about θ was available, a sensible procedure would be to base *P*-values on a value of θ estimated using an estimator such as that of Watterson (1975),

26Equation 26 gives θ̂_{W} ≈ 4.08; had this estimate been used, it would have been concluded that the observed configuration is significantly unlikely at α = 0.05 (*P* ≈ 0.0477; see the middle part of Table 5).

This variation in *P*-values as θ fluctuates suggests that for tests that depend on θ, the standard method for obtaining rejection probabilities when they vary with the unknown value of a nuisance parameter—using the maximal rejection probability across possible values of the parameter (Berger and Boos 1994)—is overly conservative. Indeed, in the example of *s* = 10 and *n* = 7, for each of the 14 haplotype configurations possible with *s* > 0, there is *some* value of θ for which the configuration is reasonably likely (Figure 1A). In this example, if the largest rejection probability across all values of θ were used, for the HCT, no configuration would lead to rejection at α = 0.05, or even at α = 0.2 (Figure 1B).

Because point estimates of θ generally have large variances (Felsenstein 1992; Fu and Li 1993), a strategy of maximizing the rejection probability only over a confidence set for θ (Berger and Boos 1994), and not over the full range (0, ∞), is also likely to be quite conservative. This is especially true as *P*-values can vary considerably in the vicinity of a point estimate (consider Figure 1B at log_{10}θ̂_{W} ≈ 0.61). Thus, an alternative method is to substitute a distribution of values for θ in a Bayesian procedure (Kelly 1997; Fu 1998; Depaulis *et al.* 2001). In this scheme, a prior probability density, *f*_{prior}(θ), is chosen, and rejection probabilities are evaluated using the density that is proportional to ℙ(**C**|*S*, *n*, θ)*f*_{prior}(θ). To implement this approach, we can add a step 0 to procedure 3: Under exponential growth or recombination, β and ρ can also be chosen from priors, so that evaluation of rejection probabilities is performed conditional on the prior rather than on a fixed growth or recombination rate. Using a uniform prior on [0, 20] for θ (with no growth or recombination), we implemented procedure 3 with *s* = 10 and *n* = 7 (Table 6). The distribution for the Bayesian approach is similar to that for θ̂ ≈ 4.08; for the example configuration (1, 0, 2, 0, 0, 0, 0), the HCT rejection probability is 0.0403, close to the value obtained using the point estimate. Other priors, such as uniform distributions on [0, 100] or [0, 500], produce similar results (not shown).

An alternate approach to uncertainty in θ is a simulation strategy that does not explicitly use θ. In this procedure (Hudson 1993; Hudson *et al.* 1994), equivalent to assuming an implicit density, *f*_{imp}(θ|*S*, *n*), and simulating from a density proportional to ℙ(**C**|*S*, *n*, θ)*f*_{imp}(θ|*S*, *n*), the waiting times and branching structures of genealogies are simulated under the neutral coalescent. On each genealogy, *S* mutations are placed, uniformly and independently. The null distribution of **C** is taken as the empirical distribution of its values for the simulated trees. If the estimate θ̂_{W} is close to the true value θ, this procedure produces similar rejection probabilities to both the fixed-θ approach that assumes θ = θ̂_{W} and the Bayesian procedure in the previous paragraph. However, if θ̂_{W} and θ differ substantially, the conditional distributions of test statistics by this kind of fixed-*S* simulation may also be quite different from their conditional distributions given *S* and θ (Markovtsova *et al.* 2001; Wall and Hudson 2001). This observation is supported also by our simulation with 10 mutations and *n* = 7 (Table 6). The probability distribution for fixed-*S* simulation is quite similar to the exact distribution shown in Table 5 with θ = θ̂_{W}, and the HCT rejection probability for the example configuration (1, 0, 2, 0, 0, 0, 0) is 0.0520.

Tables 5 and 6 and Figure 1 show that the effect of θ on haplotype tests is not negligible, and that erroneous conclusions might be reached if the observed number of segregating sites is not close to expectation. The problem is not fixed by using a distribution of values for θ estimated from the same data that are to be analyzed using the tests, as such a distribution will likely produce results similar to those obtained with a point estimate. However, the use of diverse regions spread across a genome decreases the variance of an estimate of θ dramatically (Innan *et al.* 2003). Thus, with genomic polymorphism data, θ can be estimated at many loci independent of the region of interest, and the genomic estimate can be treated as the known value of θ. A collection of values spread around a genomic point estimate might also be used in a Bayesian procedure, although this approach will probably not differ greatly in outcome from use of only the point estimate.

The application of a genomically estimated θ requires the assumption of a constant value of θ across the genome. However, variables such as GC-content and recombination rate may lead to considerable variation in θ (Begun and Aquadro 1992, for example). In such cases, the “known” θ could be estimated within classes of regions that have similar GC-content, levels of recombination, or values of other quantities that influence θ.

## POWER

We investigated the power of the four haplotype tests to detect deviations from the standard coalescent model. For various choices of *s* and *n*, we used procedure 3 with θ̂_{W} for θ to obtain the distributions of **C**, M, K, and H under the null model. For given choices of *s* and *n*, the empirical null distributions of **C**, *M*, *K*, and *H* were based on a set of 10^{6} accepted simulations; for *n* ≤ 30, this number was found sufficient to ensure that the distributions were accurately estimated. The rejection regions for the tests were defined using procedures 1 and 2, employing simulated probabilities in place of the exact probabilities and using the correction for small sample size: haplotypes on the rejection region boundary for the chosen significance level were assigned an appropriate rejection probability in (0, 1), and all other haplotypes were placed either inside or outside the rejection region.

For each choice of *s* and *n* and each alternative model, the power to reject the null for significance level α was equal to the fraction of 10^{5} simulations of the alternative whose haplotypes lay in the α-rejection region. Note that in contrast to the simulations of the null model, which were used to simultaneously estimate a large number of quantities, namely the probabilities of all possible haplotype frequencies or values of a test statistic, simulations of the alternative were used only to estimate what fraction of replicates lay in the rejection region. Thus, in comparison with the null model, to obtain repeatable results, the alternative model required fewer replicates.

Simulations of the alternative model require use of a value of θ. As discussed earlier, when the null hypothesis is true, a sensible choice is to use a value of θ estimated by assuming that the null is true—for example, θ̂_{W}. Thus, when the alternative hypothesis is true, a sensible choice of θ on which to condition the simulations is a value that would have been estimated assuming that the alternative was true. For a general model in which the expected total branch length of a genealogical tree is 𝔼[*L*], the generalized expression analogous to θ̂_{W} is 2*S*/𝔼[*L*].

In models with recombination, the recombination rate affects the variation across sites in branch lengths of genealogical trees but not the expected total branch length of a randomly chosen site (Pluzhnikov and Donnelly 1996). Thus, the expected total branch length is the same as in the absence of recombination, and consequently we used θ̂_{W} for the simulations of the alternative model.

For exponential population growth and two-population symmetric island migration models, however, the branch lengths of genealogies differ from those of the standard model. Our simulations of the alternative model employed 2*s*/ for θ, where was the mean total branch length in 10^{5} simulations of the model. Note that this choice makes the value of θ at which tests were performed dependent on the parameters of the alternative model. Once the value of θ was selected, an independent set of simulations was used for the estimation of power.

### Recombination:

If no recombination takes place, the strong correlation of genotypes at neighboring sites makes it fairly likely that if two DNA sequences both contain a mutation at a particular site, they will also share the same haplotype. Thus, many individuals may have the same haplotype, and configurations with high *M*, low *K*, and low *H* are common.

Recombination decouples neighboring sites, so that for large recombination rates, neighboring genotypes are nearly independent. Thus, two sequences with the same genotype at one site will have the same haplotype only if at each of the *s* − 1 remaining sites they share the same genotype—an unlikely event whose chance of occurring is the product of *s* − 1 probabilities. Consequently, given *s* and *n*, as recombination rate increases, *M* decreases, *K* and *H* increase, and **C** tends toward configurations that contain many haplotypes of frequency 1, such as *s***e**_{1} + **e**_{n}_{−}* _{s}* for small

*s*and

*n*

**e**

_{1}for large

*s*.

For relatively large recombination rates, all tests were able to reject the null model in favor of the recombination model, with power greater in the scenarios with smaller *s* (Figure 2). For small *s*, power was comparable for the four tests, whereas for intermediate and large *s*, the HCT had poorer relative performance than the other tests. The HNT performed particularly well, in agreement with previous observations about the informativeness of *K* about the recombination rate (Schierup and Hein 2000; Wall 2000; Myers and Griffiths 2003).

### Exponential population growth:

Because coalescences are more probable in smaller populations, by increasing the population size in recent generations compared with that of ancient generations, exponential population growth makes it more likely that lineages will persist into the distant past without coalescing. Thus, growth increases lengths of terminal branches of coalescent trees in comparison with those of internal branches. Mutations on terminal branches create single-copy haplotypes, so that growth increases the frequencies of configurations with low *M*, high *K*, and high *H*. As in the recombination case, the configurations *s***e**_{1} + **e**_{n}_{−}* _{s}* and

*n*

**e**

_{1}become probable for small and large

*s*, respectively.

In the simulations with exponential growth, the HCT had comparable power to the HNT for small and intermediate *s* (Figure 2). The HHT and HDT performed rather poorly for small *s*, improving as *s* increased. As in the recombination simulations, the HNT was the most generally powerful test. Although some exceptions were observed (for example, with *s* = 10), similar results were usually obtained when the null hypothesis included recombination (Figure 3).

### Island migration:

With a large migration rate, the behavior of the two-population island migration model approaches that of the standard neutral model (Nordborg and Krone 2002, for example). Thus, haplotype frequencies for large migration rates will be similar to those under the null model. As the migration rate decreases, however, lineages coalesce separately in the two populations, with no intervening migrations. The time until one of these lineages migrates to the other population is long, so that genealogies have two long internal branches. These branches contain most mutations, leading to configurations with high *M*, low *K*, and low *H*.

The simulations produced reasonable power with low migration rates and low power with high migration rates (Figure 2). The HCT and HNT had comparable power for intermediate and large *s*. Similarly to the population growth simulations, the HHT and HDT performed poorly for small *s*, with similar results when recombination was included in the null hypothesis (Figure 3). As in both the recombination and the growth scenarios, the HNT was the most generally powerful test for small *s*.

## APPLICATION TO DATA

In a data set for the *Sod* locus in *D. melanogaster* (Hudson *et al.* 1994), 55 segregating sites were observed in a sample of size 10, with **C** = (5, 0, 0, 0, 1, 0, 0, 0, 0, 0). Thus, *M* = 5, *K* = 6, and *H* = 0.7. Using (26), θ̂_{W} ≈ 19.44. To demonstrate the application of the four haplotype tests, we used procedures 3 and 4 (appropriately modified in the case of exponential population growth) conditional on point estimates for θ (Table 7). For β = 0, the estimate of θ employed (26); for β > 0, the estimate was obtained using simulations, as described in the previous section.

As was observed by Hudson *et al.* (1994), *M* = 5 is highly unusual under the standard neutral model, as well as under models that include small amounts of population growth and recombination. In each case, the HHT and HDT reject the null hypothesis at very low *P*-values. The HCT is significant below the 5% level in all cases, while the two-tailed HNT is not significant at the 5% level for low levels of growth and recombination.

## CONCLUSIONS

We have introduced the haplotype configuration test of neutrality, which rejects the null hypothesis if the configuration itself is unlikely given *S*, *n*, and θ. We have also developed a recursion that allows haplotype tests to be applied exactly for small samples; for larger samples, efficient acceptance-rejection algorithms can be implemented.

Testing the haplotype frequency distribution for anomalies can be viewed as similar to testing if the “site frequency spectrum,” or the distribution of allele frequencies for segregating sites, fits the standard neutral model (Tajima 1989). For the site frequency spectrum, an excess of rare variants can reflect population growth or population structure, whereas an excess of common variants can reflect positive selection or balancing selection. Similarly, conditional on the number of segregating sites, an excess of rare haplotypes may reflect population growth or recombination, while an excess of common haplotypes may reflect population structure or positive selection. As is done for tests based on the site frequency spectrum, the HCT, HHT, HNT, and HDT can be used to scan genomes for atypical regions: for the haplotype tests, to accommodate variability across regions in the number of segregating sites, outlying regions can be identified as those with extreme rejection probabilities (rather than extreme values of the test statistic).

While the HCT perhaps takes into account more information about the data than do the HHT, HNT, and HDT, the various tests reject the null hypothesis under different conditions. The HCT is designed to detect general deviation from the predicted haplotype frequency distribution; although this test may not be optimal for specific alternative scenarios, it may have the potential to identify more diverse departures from the null hypothesis than can be detected with the univariate statistics. Some alternative hypotheses, such as multiallelic balancing selection or positive selection for different haplotypes across subgroups of a structured population, might be better suited to the HCT, as they may be unlikely to produce anomalous values of *M*, *K*, or *H*. For other alternatives, such as positive selection on a single haplotype, univariate statistics such as *M* may be most appropriate. Regardless of which tests are used, however, genomic data sets will perhaps increase the confidence that can be placed in *P*-values for haplotype tests and other neutrality tests, because in many species θ will no longer need to be estimated from the same data on which the tests are being performed.

## Acknowledgments

We thank A. Hirsh, M. Nordborg, M. Przeworski, and J. Wall for comments. A program for implementing the simulation-based haplotype configuration test, *haploconfig*, is available from N.A.R. This work was supported by a grant from the University of Texas to H.I., a Burroughs-Wellcome Fund Career Award in Biomedical Sciences to N.A.R., National Institutes of Health grant GM 58897, and Center of Excellence in Genomic Science grant P50HG002790 from the National Human Genome Research Institute. S.T. is a Royal Society-Wolfson Research Merit Award Holder.

## Footnotes

Communicating editor: N. Takahata

- Received June 18, 2004.
- Accepted December 8, 2004.

- Genetics Society of America