## Abstract

The identification of genomic regions that have been exposed to positive selection is a major challenge in population genetics. Since selective sweeps are expected to occur during environmental changes or when populations are colonizing a new habitat, statistical tests constructed on the assumption of constant population size are biased by the co-occurrence of population size changes and selection. To delimit this problem and gain better insights into demographic factors, theoretical results regarding the second-order moments of segregating sites, such as the variance of segregating sites, have been derived. Driven by emerging genomewide surveys, which allow the estimation of demographic parameters, a generalized version of Tajima's *D* has been derived that takes into account a previously estimated demographic scenario to test single loci for traces of selection against the null hypothesis of neutral evolution under variable population size.

THE analysis and interpretation of patterns of DNA variation is the “main obsession” (Gillespie 2004) of molecular population geneticists. It is of particular interest to determine whether a locus is evolving neutrally or as a target of selection (Hey 1999). Adaptive processes are generally thought to be associated with selective sweeps (Charlesworth 1992). Models predict that selective sweeps result in a local reduction of DNA polymorphism around the target site of selection. However, it is well known that a pattern of reduced polymorphism may also result from demographic changes, particularly from recent population size bottlenecks. Therefore, a reliable identification of the cause of reduced variability is difficult, especially when DNA fragments of only a few loci are examined. Recently, several new methods and statistical tests were developed to analyze polymorphism data that presumably have been produced by a complex evolutionary history, during which selective and demographic forces acted simultaneously. Nielsen *et al.* (2005) developed a composite-likelihood method, which is an extension of the test proposed by Kim and Stephan (2002), where the null hypothesis, rather than being fixed *a priori*, is derived from the pattern of background variation in the data. Li and Stephan (2006) devised a maximum-likelihood method, which allows inference of demographic changes and detection of recent positive selection in populations of varying size. Alternatively, variability patterns at multiple loci may be considered jointly. The reasoning here is that the traces of selective sweeps may be distinguished from those of recent population bottlenecks, since only the former have a local effect, while the latter should have a chromosomewide effect. In a test tailored for multiple microsatellite loci, Wiehe *et al.* (2007) found that this is often the case, but, for a certain (population genetically relevant) range of demographic parameters the false positive rate to detect selective sweeps is still unacceptably high. Other recently popularized methods to detect traces of selection are centered around the analysis of linkage disequilibrium (Kim and Nielsen 2004; Stephan *et al.* 2006; McVean 2007) or the structure and frequency spectrum of haplotypes (Sabeti *et al.* 2002, 2007). In contrast to variability-based methods, the latter require knowledge or inference of multilocus haplotypes for data analysis. While the human HapMap project (International Hapmap Consortium 2005) furnishes such data, they are rarely available in population genetic studies of other species.

Initiated by the seminal article of Kingman (1982), coalescent models have become the textbook standard to study the properties of alleles that are generated by common ancestry (Hein *et al.* 2005). Although it is well known that demographic changes such as population size expansions and bottlenecks, admixture, and population subdivision affect present-day polymorphism patterns, popular statistical tests for deviation from neutral evolution (Tajima 1989b; Fu and Li 1993; Fay and Wu 2000) posit a model of constant population size. At the same time, researchers emphasize that an adequate interpretation of observed polymorphism patterns, for instance in Drosophila, cannot neglect the demographic history of a population (Glinka *et al.* 2003; Haddrill *et al.* 2005). There is a fairly long history of population genetic models considering the effects of variable population size upon genetic variability. For example, Nei *et al.* (1975), Maruyama and Fuerst (1984), Watterson (1984), Tajima (1989a), and Slatkin and Hudson (1991) studied the consequences of population size reductions or expansions. The different scenarios have been consolidated into a more general framework by Griffiths and Tavaré (1994).

Here, we study the standard coalescent approximation to the Wright–Fisher model for a sample of *n* genes, without recombination and with population size varying in time. We first revisit joint densities of waiting times, as defined by Griffiths and Tavaré (1994), and their means (Griffiths and Tavaré 1998). Next, we present new results concerning their second-order moments. We extend the results of Fu (1995), regarding size and type of mutations, to general coalescent trees (Griffiths and Tavaré 1998, 2003) and in particular to coalescent trees under variable population size. On the basis of these results, we establish a generalized version of Tajima's *D* (Tajima 1989b), which is applied to test the polymorphism patterns of single loci from an African and a European population of *Drosophila melanogaster* (Li and Stephan 2006; Hutter *et al.* 2007) against the null hypothesis of variable population size. Innan and Stephan (2000) have addressed the same issue, using coalescent simulations to distinguish the effects of exponential growth and selection in *Arabidopsis thaliana*.

## METHODS AND RESULTS

#### The model:

We measure time backward in units of 2*N* generations, where *N* is the size of a diploid population at time of sampling (*t* = 0), and assume that population size is large in each generation. Furthermore, let λ(*t*) = *N*(*t*)/*N* denote the ratio of the population sizes at time *t* in the past and the present and λ(*t*) be real and piecewise continuous. The population-size intensity function Λ is defined byWe assume that Λ(∞) = ∞, so that a sample of genes has a most recent common ancestor (MRCA) with probability 1. Let *T _{n}*, … ,

*T*

_{2}be the time periods during which the genealogy has

*n*, … , 2 lineages, respectively. The joint density (Griffiths and Tavaré 1994) of (

*T*, … ,

_{n}*T*

_{2}) isfor 0 ≤

*t*, … ,

_{n}*t*

_{2}< ∞, where

*s*

_{n}_{+1}= 0,

*s*=

_{n}*t*,

_{n}*s*=

_{j}*t*+ … +

_{j}*t*,

_{n}*j*= 2, … ,

*n*− 1.

Although the waiting times *T _{n}*, … ,

*T*

_{2}are not independent, unlike in the case of constant population size, where

*g*(

*t*, … ,

_{n}*t*

_{2}) =

*g*(

*t*) …

_{n}*g*(

*t*

_{2}), we use the joint density and the law of iterated expectations to derive the first two moments of waiting times. Griffiths and Tavaré (1998) already obtained the means on the basis of previous results from Griffiths (1980) and Tavaré (1984). To emphasize the dependence of the first- and second-order moments of waiting times on the sample size

*n*, waiting times carry a second index ((

*T*)

_{k}_{n}) unless the meaning is clear from the context.

#### First-order moments:

We start by successively deriving *E*(*T _{k}*)

_{n},

*E*(

*T*)

_{k}_{n−1}, … and perform these recursions in particular for small sample sizes. From the joint density above one immediately obtainsby making use of Lemma 1 (see appendix a: proof by induction for the mean waiting times). This result forms the initial step of the induction proof and provides in particular the solutions for

*E*(

*T*

_{2})

_{2}and

*E*(

*T*

_{3})

_{3}. To calculate

*E*(

*T*

_{2})

_{3}we use conditional expectations:Note that the result for

*E*(

*T*

_{2})

_{2}is used on the right-hand side of the above equation. Due to conditioning on

*T*

_{3}=

*t*

_{3}, the integration bounds are shifted by

*t*

_{3}. Using Lemma 2 (see appendix a: proof by induction for the mean waiting times), we can continue and writeIn the same way, we obtainwhereSince this result provides us with the solution for

*E*(

*T*

_{3})

_{4}, we finally derive

*E*(

*T*

_{2})

_{4}by conditioning the result for

*E*(

*T*

_{2})

_{3}on the newly introduced waiting time

*T*

_{4}:In analogy,whereand α

_{n,n,n−2}= α

_{n,n−1,n−2}− α

_{n,n−2,n−2}. When performing the next recursion step to calculate

*E*(

*T*

_{n}_{−3})

_{n}, one might already see thatThe solutions of α

_{n,k,k}, α

_{n,k+1,k}, … eventually lead to(1)whereand

*g*

_{2}(

*t*) is the density of (

*T*

_{2})

_{2}. The proof of (1) is given in appendix a: proof by induction for the mean waiting times. The expected waiting times for specific demographic scenarios can be derived from Equation 1. Population genetic models that consider a finite number of instantaneous population size changes can be explicitly solved, as seen in the following example of a population bottleneck. Let λ(

*t*) =

*f*>>

*n*for τ <

*t*≤ τ + τ

_{f}and λ(

*t*) = 1 otherwise. Then,In the case of exponential growth, where λ(

*t*) = exp{−β

*t*} and β > 0, the mean waiting times

*E*(

*T*) can be expressed in terms of the exponential integral Ei(.) and areObviously, the assumption of a large population size in each generation is violated at some time in the past. However, this turns out not to matter greatly (Nordborg 2003). Models of instantaneous population size change followed by exponential growth can be expressed in terms of tabulated functions as well. It is interesting to note that population size does not necessarily have to decrease backward in time. An example where population size increases backward in time is given by . The intensity function and Λ(∞) = ∞. Expected waiting times are then

_{k}#### Second-order moments:

The same ideas of using conditional expectations and a proof by induction can be applied to second-order moments. Since this requires a considerable amount of space, we just note the results here and encourage the interested reader to study the supplemental material.

Letwhere *g*_{3,2}(*t*_{3}, *t*_{2}) is the joint density of (*T*_{3}, *T*_{2})_{3}.

Then(2)For 2 ≤ *k* < *k*′ ≤ *n*,(3)For all demographic models with a finite number of *m* intervals, each with a different constant population size, Equations 2 and 3 can be explicitly solved by decomposition of the double integrals according to the possible arrangements of coalescent events over these *m* time phases. For example, in the case of the previously mentioned bottleneck model (*cf.* Figure 1) it is instructive to express the solution for three alleles and then to replace by and by .

In the case of exponential population growth, one must resort to a numerical evaluation of (2) and (3).

The variance and covariance of waiting times are(4)(5)

#### Expectation and variance of total tree length and the number of segregating sites:

Let *T*_{c} be the sum of all branch lengths in the coalescent tree of sample size *n*. Then(6)(7)Let mutations occur according to independent Poisson processes of rate θ/2 along the edges of the tree, conditional on the edge lengths of the tree. This assumption refers to general coalescent trees (Griffiths and Tavaré 1998, 2003), where *T _{n}*, …,

*T*

_{2}are continuous random variables under a general joint distribution of waiting times. The ancestral tree is binary and such that when there are

*k*ancestral lines each pair has probability of being the next pair to coalesce. Under variable population size θ is given by 4

*N*μ, where

*N*is the present population size and μ is the mutation rate per sequence per generation. Therefore, θ/2 is the mutation rate per sequence per 2

*N*generations, providing the convenience of the same time scaling of the mutational and the genealogical process.

Under the infinite-sites model (which we assume throughout this article) the number of segregating sites *S _{n}* in a sample of

*n*sequences is identical to the number of mutations on the tree. Then

*S*is a compound Poisson-distributed random variable with mean (θ/2)

_{n}*T*

_{c}and it holds (

*e.g.*, Hudson 1990) that(8)(9)

#### The number of mutations of size *i*:

Fu (1995) defined the size of a branch as the number of sequences in a sample that are descendants of that branch (at time *t* = 0). Furthermore, a mutation is of size *i* (1 ≤ *i* ≤ *n* − 1) if it occurs on a branch of size *i*. Let ξ_{i} be the number of mutations of size *i* in a sample of size *n*. Fu (1995) derived expectations, variances, and covariances of these random variables for constant population size. In fact, the probabilities in Equations 14–21 in Fu (1995), which rely on a Pólya urn model, are valid in a general coalescent tree since the timing of the urn draws is irrelevant for the binary coloring scheme. The probability *p _{n}*

_{,k}(

*i*) that a randomly chosen line of waiting time

*T*is of size

_{k}*i*in waiting time

*T*(Fu 1995; Griffiths and Tavaré 1998) isFor notational convenience, we defineThe probabilities

_{n}*p*(

_{a}*k*,

*i*,

*k*′,

*j*) for

*i*≥

*j*and

*p*(

_{b}*k*,

*i*,

*k*′,

*j*) for

*i*+

*j*<

*n*given in Equations 18 and 19 in Fu (1995) include the summation over an additional variable

*t*, which can be eliminated by rearranging the summands in terms of a hypergeometric distribution and its first two moments. This results inLet ξ

_{kl}be the number of mutations that occurred in the

*l*th line of state

*k*and ε

_{kl}(

*i*) be an index variable, such that ε

_{kl}(

*i*) = 1 if the

*l*th line of state

*k*is of size

*i*and 0 otherwise. Then,In analogy to the results by Fu (1995), we have (see appendix b)Following (22) of Fu (1995), we find(10)whereThis formula already appears in Griffiths and Tavaré (1998). Under variable population size and by applying (1) to a demographic history, which consists of multiple epochs of different but constant population sizes, (10) results in Equation 1 of Marth

*et al.*(2004).

Using Fu's Equation 23, then by separating cases and after some arithmetic manipulations, we obtain(11)(12)whereandFor *i* > *j*,whereTo decide whether a mutation is of size *i*, knowledge of the ancestral and derived alleles is required. Without this knowledge, one can consider only the *type* of a mutation (Fu 1995). A mutation is of type *i* if it is of size *i* or *n* − *i*. If η_{i} is the number of mutations of type *i*, then analogous results for *E*(η_{i}), *V*(η_{i}), and Cov(η_{i}, η_{j}) can be derived by using η_{i} = (ξ_{i} + ξ_{n−i})(1 + δ_{i,n−i})^{−1}. Since , the covariances Cov(ξ_{i}, *S _{n}*) and Cov(η

_{i},

*S*) can be written in terms of the above expressions.

_{n}The average number of all pairwise differences in a sample of size *n* isThe results derived so far can be used to express moments of Π_{n} in terms of moments of *T _{k}*. After some arithmetic manipulations one obtains(13)(14)(15)whereEquation 13 has been already derived in Griffiths and Tavaré (2003). Under variable population size (13) simplifies to

*E*(Π

_{n}) = θ

*E*(

*T*

_{2})

_{2}.

#### Tajima's *D* for variable population size:

Tajima's *D* (Tajima 1989b) is a widely applied test statistic for the null hypothesis of neutral evolution under constant population size. We have now collected all necessary ingredients to formulate a generalized version of *D*, here called *D*′, which may be applied to test the model of neutral evolution under variable population size, provided the demographic history is assumed to be known.

The equalityimplieswhere , and . Note that our definition of *d*′ is already standardized and therefore differs from the definition of *d* in Equation 28 in Tajima (1989b). By using Equations 1–9 and 13–15 together with the unbiased estimates of θ and θ^{2},(16)(17)one formally obtains the test statistic *D*′, which simplifies to *D* for λ(*t*) = 1. We note that Fu and Li's test statistic *D** (Fu and Li 1993) can be generalized in the same fashion as Tajima's *D*. All formulas presented here can be downloaded from http://justus.genetik.uni-koeln.de:8200/software as “Mathematica” files for several different population size histories.

Below, we now discuss the demographic histories of African and European populations of *D. melanogaster* as studied by Li and Stephan (2006) and Hutter *et al.* (2007).

## APPLICATION

#### The demographic history of African and European populations of *D. melanogaster*:

Li and Stephan (2006) developed a maximum-likelihood method to infer demographic changes and to detect recent selective sweeps in populations of varying size from DNA polymorphism data. On the basis of a sample of average size , which comprises 262 loci of the X chromosome from an African population of *D. melanogaster*, they suggested that an instantaneous population expansion model explains the observed polymorphism significantly better than the standard neutral model of constant population size. Under the expansion model, they estimated , where *N*_{A0} indicates the current effective population size for the X chromosome in the African population. The average mutation rate per site per generation was estimated as , assuming 10 generations per year. Then, and the estimated time of the expansion is 60,000 years before the present (corresponding to 0.035 in units of generations). The population size before the expansion was estimated as .

Furthermore, Li and Stephan (2006) studied a sample () that consists of 272 X-linked loci from a European population of *D. melanogaster*. The derived estimate for is 0.0062, with being the present size of the European population. The African and European populations diverged ∼15,800 years ago (corresponding to 0.0735 in units of generations). Subsequently, the European population went through a bottleneck during which its size was reduced to and that lasted ∼340 years (corresponding to 0.0016 in units of generations). The demographic history of both populations is illustrated in Figure 2.

The observed mutation-frequency spectrum, the maximum-likelihood estimation (MLE) from Li and Stephan (2006), and the theoretical frequency spectrum for both populations, obtained by inserting (1) into Equation 4.5 of Griffiths and Tavaré (2003), are shown in Figure 3. The MLE and the theoretical expectations based on the suggested demographic histories are in good agreement since both rely on expected branch lengths.

The property that the theoretical frequency spectrum is strictly monotonically decreasing with the increasing size of a mutation follows directly from Equation 4.5 of Griffiths and Tavaré (2003). Since both populations of Drosophila exhibit an excess of high-frequency derived mutations (*cf.* Figure 3), there is a clear indication for the presence of positive selection in the data (Fay and Wu 2000). However, here we focus on low-frequency derived variants.

In the following, θ denotes 4*N*_{0}μ*L*, where *L* is the average locus length in base pairs. Then, the above estimates of θ become (*L* = 471) and (*L* = 498), respectively. For the African and European data sets the average number of segregating sites per locus is 17.87 and 6.88, respectively. Hence, Watterson's (Watterson 1975) is 5.92 for the African and 2.28 for the European population. In contrast, the estimates based on the suggested demographic models and on Equation 16 are 23.6 and 3.07, respectively. The discrepancy is due to the differences in total tree length (*cf.* Figure 2). The theoretical results for the average number of nucleotide differences with respect to and are 5.35 and 2.49, respectively, by making use of (13). Multiplying the nucleotide diversity by the average locus lengths results in 5.35 and 2.43. We remark that all the outcomes regarding means experience only minor changes if the different sample sizes of the data are taken into account.

The comparison of theoretical and experimental results with respect to variances entails difficulties, since the impact of recombination and the possibility that mutation rates vary among different loci should be taken into account.

#### Analyzing *D*′ for the African demography:

From Equations 1–9 and 13–17, we obtainfor the African demography and *n* = 12. To decide whether an observed value of *D*′ for a certain locus is below or above a critical value, one needs the distribution of *D*′. Although some intralocus recombination might occur, recombination would decrease the variance in the denominator of *D*′, such that the critical values obtained without recombination are even more conservative (*cf.* Tajima 1989b). While mutations are assumed to occur at a constant rate per locus, they can vary across loci. Therefore, we generated the distribution of *D*′ for different values of θ that are in the range of the two outermost values obtained by (16) by coalescent simulations. Taking the most extreme 5 and 95% quantiles among these distributions provides critical values (underlined in Table 1), which are conservative in regard to possible θ-values. The distributions of *D*′ for different values of θ and for *n* = 12 and *n* = 40 are shown in Figure 4. Following Tajima (1989b), nonpolymorphic iterations are disregarded.

The average of the observed *D*′-values in the African sample is −0.27, whereas the average of the observed *D*-values is −0.67. Clearly, *D* is smaller than *D*′, since *D* does not take the effect of expansion into account. On the basis of the critical values (*cf.* Table 1), we found 6 loci with significantly negative *D*′ [Fin025 (0.026), Fin122 (0.021), Fin295 (0.009), Fin430 (0.006), Fin470 (0.019), and Fin743 (0.013)] and 1 locus with significantly positive *D*′ [Fin729 (0.025)]. The *P*-values, shown within parentheses, were generated by the distribution, which already provided the outermost 5 and 95% quantiles, respectively. Two Loci (Fin728 and Fin419) have a *D*′-value of −1.5. All loci, except for Fin025 and Fin122, fall into regions for which the null hypothesis of neutral evolution was also rejected in the test by Li and Stephan (2006). The window including Fin025 contains a total of 11 loci, with an average *D*′ = −0.377. The neighboring locus (2.7 kb away) of Fin025, Fin026, has a *D*′-value of −1.46. Fin122 is located in a window with a total of 9 loci and an average *D*′ of −0.33. These findings suggest that the signals of selection of both loci might be suppressed by their neighboring loci. Since not all loci in the African data set provide a sample of size 12, we recalculated *D*′ for all different sample sizes and obtained the critical values as described above. There is nearly no difference in the ranking of loci with regard to the usage of the statistics *D* and *D*′. The 6 loci with a significantly negative *D*′ are a subset of the 21 significant loci from Hutter *et al.* (2007) based on *D* under constant population size. However, the critical values obtained by *D*′ are more appropriate to reject the neutral model, since *D*′ takes the suggested demographic history into account. Also note that *D*′ captured one locus with a significantly positive value.

#### Analyzing *D*′ for the European demography:

From Equations 1–9 and 13–17, we obtainfor the European demography and *n* = 12. In analogy to the African scenario, we simulated the distribution of *D*′ for different values of θ corresponding to the estimated relevant range. The distributions of *d*′ and *D*′ for different values of θ and for sample sizes 12 and 40 are shown in Figure 5.

Similarly to Tajima (1989b) and Fu and Li (1993), we used the unbiased estimates of θ and θ^{2}, such that the denominator of *D*′ depends solely on the number of segregating sites. Since *d*′ is distributed with mean 0 and variance 1 due to its dependency on θ, the estimates and affect the distortion of the distribution of *D*′ (*cf.* Figure 5). It should be noted that and are considerably larger than under constant population size. Furthermore, the distribution of *D*′ for the European scenario is shifted even more toward negative values [(, *V*(*D*′)) ≈ (−0.2, 0.96) and (−0.3, 0.95) for θ = 3 and θ = 10, respectively] under this bottleneck model than under the standard model (*cf.* Table 1 in Tajima 1989b). While the distribution of *D* becomes smoother with increasing sample size (*cf.* Figure 3 in Tajima 1989b), this is not the case for the distribution of *D*′ and the average θ of ∼3 (*cf.* Figure 5). It is interesting to note that, given a demographic scenario such as the European one, increased sample size may not compensate for the unsatisfactory distributional properties of *D*′. Therefore, the present approach does not produce reliable results in the analysis of the European data.

## DISCUSSION

Changes in population size during evolutionary history can obscure the traces left by natural selection on DNA polymorphism. Methods to identify potential target sites of selective substitutions in a genome, which rely on the spatial distribution of polymorphic sites, may be severely misled if demographic properties are ignored. In particular, the identification of adaptive substitutions can be difficult, because colonization of a new habitat or environmental changes and adaptation to them often occur simultaneously. It is therefore essential that the possible joint action of demographic and adaptive factors is adequately considered in models and statistical methods for data analysis.

Jensen *et al.* (2005) demonstrated that the composite-likelihood-ratio test proposed by Kim and Stephan (2002), which compares the alternative hypotheses of genetic hitchhiking and neutral evolution under constant population size, loses power and suffers from false positives for certain demographic parameters. It is of interest to compare Figure 6 here with Figure 1 in Jensen *et al.* (2005), where the false positives of the composite-likelihood-ratio test (Kim and Stephan 2002) for different parameter constellations of a bottleneck model are shown.

Obviously, the variance to mean ratio of total tree length *V*(*T*_{c})/*E*(*T*_{c}), a measure for the dispersion of the probability distribution of *T*_{c}, may serve as an indicator of “critical” demographic scenarios, *i.e.*, those that lead to a high false positive rate in the detection of selective sweeps, if no means for correction are taken. It should be noted that *V*(*T*_{c}) is always smaller for a population bottleneck in contrast to a model with constant but equal population size at the time of sampling. Since *E*(*T*_{c}) is also reduced under a bottleneck, the variance to mean ratio of total tree length may increase or decrease compared to the standard neutral model depending on the particular parameters of the bottleneck. Although recombination may cause a considerable reduction of *V*(*T*_{c}) [but not of *E*(*T*_{c})], the variance to mean ratio for critical bottlenecks is still high with respect to constant population size. Clearly, the effect on *V*(*T*_{c}) and *E*(*T*_{c}) of varying population size is propagated to the moments of *S _{n}*, Π

_{n}, and

*D*′. As seen in the analysis of the European data set, for certain demographic scenarios it is difficult to filter out the effects on the statistics incurred by the demography. In contrast to the standard neutral model where

*n*= 12, and

*V*(

*T*

_{c}) is almost identical to

*E*(

*T*

_{c}), we find that the ratio

*V*(

*T*

_{c})/

*E*(

*T*

_{c}) is ∼0.17 and 2.93 for the African and European demographic histories, respectively. This is in line with the degree of applicability of

*D*′ for both populations; however,

*D*′ loses its reliability for the European population. To appreciate this effect quantitatively one may compare

*V*(

*T*

_{c})/

*E*(

*T*

_{c}) with the mean of

*D*′ (

*cf.*Figure 7).

Not only can the mean of *D*′ become heavily unbiased for problematic parameter constellations, but also the distribution of *D*′ becomes bimodal with a second peak near or within the rejection region (*cf.* Figure 5) or even ragged (*cf.* Figure 8), which makes a meaningful definition of the rejection region difficult. Also note that the small amount of polymorphism obtained in the European scenario conceals the shape of the genealogies with respect to the moments of *D*′. Due to a smaller value of θ combined with a more pronounced reduction in population size, we obtained a less biased *D*′ than in the examples illustrated in Figure 7 (*cf.* also Figure 8).

It is discomforting that a demographic scenario that already caused problems for tests of neutral evolution under the constant population size model remains cumbersome when tested under a more general model of variable population size with the help of *D*′. Nevertheless, the impact of varying population size on the observable pattern of genetic variability must not be underestimated. The expansion scenario of the African sample illustrates the importance of taking demographic estimates into account, if possible. In particular, for genomic scans this may lead to a considerable reduction of the false positive rate.

Our results and the discussion of two well-characterized data sets show that estimates even of basic population genetic quantities, such as Watterson's , may be heavily skewed, if details of the demographic history are ignored. The results derived here provide a means to guard against this.

## APPENDIX A: PROOF BY INDUCTION FOR THE MEAN WAITING TIMES

The following chart illustrates the setup of recursions and the way of inductive reasoning:First, we prove three lemmas.

Lemma 1.

*Proof*.

Lemma 2.

*Proof*.

Lemma 3.

*Proof*.

Now we are ready to complete the proof of (1). To infer *E*(*T _{k}*)

_{n+1}from the induction assumption given in (1), we writewhere (

*cf.*Figure A1)Therefore,By Lemma 2, we getThe usage of Lemma 3 results in

## APPENDIX B

In the following, we make repeated use of the assumption that mutations occur according to *independent* Poisson processes of rate θ/2 along the edges of the tree, conditional on the edge lengths of the tree:

We writewhere denotes the marginal density of the random variable

*T*. Hence,and_{k}In analogy,whereTherefore,

We writewhereThus,

Finally,where denotes the two-dimensional marginal density of (

*T*_{k}_{′},*T*). Sincewe obtain_{k}

## Acknowledgments

We thank in particular one of the reviewers whose constructive comments led to a considerable improvement of the theory. This work was supported by grants from the Deutscher Akademischer Austausch Dienst to D.Z. and the Deutsche Forschungsgemeinschaft (SFB680) to T.W.

## Footnotes

Communicating editor: M. Veuille

- Received May 9, 2008.
- Accepted June 13, 2008.

- Copyright © 2008 by the Genetics Society of America