## Abstract

This work extends the methods of demographic inference based on the distribution of pairwise genetic differences between individuals (mismatch distribution) to the case of linked microsatellite data. Population genetics theory describes the distribution of mutations among a sample of genes under different demographic scenarios. However, the actual number of mutations can rarely be deduced from DNA polymorphisms. The inclusion of mutation models in theoretical predictions can improve the performance of statistical methods. We have developed a maximum-pseudolikelihood estimator for the parameters that characterize a demographic expansion for a series of linked loci evolving under a stepwise mutation model. Those loci would correspond to DNA polymorphisms of linked microsatellites (such as those found on the Y chromosome or the chloroplast genome). The proposed method was evaluated with simulated data sets and with a data set of chloroplast microsatellites that showed signal for demographic expansion in a previous study. The results show that inclusion of a mutational model in the analysis improves the estimates of the age of expansion in the case of older expansions.

THE shape of the genealogy of a random sample of genes (*i.e*., copies of the same locus) from a population is strongly influenced by the demographic history of the population. For expanding populations the gene genealogy will have a “star” shape (Slatkin and Hudson 1991), with short internode branches and long terminal branches. Under such genealogies, terminal branches accumulate many mutations, producing an “excessive” number of haplotypes and singletons compared to the expectation for a constant-size population. This pattern is exploited in most neutrality tests (*e.g.*, Tajima 1989; Fu 1997). In addition, the distribution of pairwise differences between individuals (also known as mismatch distribution) follows a unimodal distribution, in contrast to the ragged patterns that would be found for a sample from a constant-size population (Slatkin and Hudson 1991).

In this work we propose an approach that follows the ideas developed for the study of mismatch distributions of mtDNA for demographic inference (Rogers and Harpending 1992; Rogers 1995; Schneider and Excoffier 1999), which are adapted here to linked microsatellites by assuming the stepwise mutation model (Kimura and Ohta 1978). The interest of linked microsatellites comes from their application in chloroplast (Provan *et al.* 2001) and mammal Y chromosome genetic diversity studies [mainly studied in humans (Roewer *et al.* 1992; Willuweit and Roewer 2007), but also found in other species (Edwards *et al.* 2000; Luo *et al.* 2007)]. The proposed method is evaluated through simulations and its use is exemplified with a data set of chloroplast microsatellites for the Canary Island pine (*Pinus canariensis*).

## THEORY

#### Number of mutations between two random gene copies:

The classical work by Watterson (1975) gives the distribution probability for the number of mutations *j* occurring between a pair of genes sampled randomly from a population of constant size *N*,(1)where θ = 2*N*μ is the effective population size scaled by the mutation rate μ. And Li (1977) obtained the equivalent distribution for the case of a population size change from *N*_{0} to *N*_{1} in a single step *t* generations ago,(2)where θ_{0} = 2*N*_{0}μ, θ_{1} = 2*N*_{1}μ, τ = 2*t*μ, and *P*(*j* | θ) corresponds to Equation 1 for the stationary case.

Fitting Equation 2 to the distribution of pairwise genetic differences of a sample of genes has been employed to obtain estimates of the three demographic parameters θ_{0}, θ_{1}, and τ (Rogers and Harpending 1992). However, the use of Equation 2 implies that the number of observed genetic differences should correspond to the number of mutations that actually occurred; *i.e.*, an infinite-site mutation model (Kimura 1969) is being assumed for DNA sequence evolution. This will be an unrealistic model for most data sets in which some amount of multiple hits (*i.e.*, homoplasious mutations) are expected.

#### Number of differences between two random gene copies:

Schneider and Excoffier (1999) proposed to introduce a mutational model to describe the probabilistic relationship between the number of observed differences, *i*, and the number of mutations, *j*, to infer the parameters of a demographic expansion from the distribution of genetic differences between pairs of gene copies. Their strategy consists of using Equation 2 and integrating over all possible numbers of mutations that can produce the observed number of differences,(3)where *P*(*i* | *j*) is the probability of observing *i* differences when *j* mutations have occurred. Similarly, the stationary state could be described with(4)

Schneider and Excoffier (1999) derived *P*(*i* | *j*) for DNA sequence data with models of nucleotide substitutions with heterogeneous mutation rates across sites. In this work we present the equivalent distribution for the differences between two nonrecombining chromosomes typed at several microsatellite loci, assuming a symmetrical stepwise mutation model (Kimura and Ohta 1978).

First we consider the case of a single microsatellite locus. The mutational process, conditioned on the number of mutations *j*, can be seen as a Markov process of *j* steps from the state (number of repeats) of one gene to the state of the other gene. Each step might increase or decrease the number of repeats with equal probability. If we defined *x* as the number of steps in a given direction (let it be the number of steps increasing the number of repeats), *x* has a binomial distribution, . The most informative measure of difference available for a pair of microsatellite genes is their absolute difference in number of repeats, δ, which in our Markov process is , and its distribution, from the binomial distribution of *x*, is(5)which is equivalent to Equations 22 and 26 from Walsh (2001). Substituting δ for *i* in Equation 3, we can describe the probability distribution of the difference in number of repeats for two microsatellite genes randomly drawn from a population that underwent a demographic expansion.

Now we consider a nonrecombining chromosome (or chromosome fragment) containing *L* microsatellite loci. We define as a vector that contains the differences in number of repeats at each locus for a pair of chromosomes. Let *k _{l}* equal the number of mutations at locus

*l*, and let

*K*equal the vector with properties

*k*

_{1}, … ,

*k*are nonnegative integers and

_{L}*k*≥ δ

_{l}_{l}for any locus

*l*. For each number of mutations

*j*several

*K*vectors can be built, each of them with a probability given by the multinomial distribution(6)where

*p*is the probability that a mutation hitting the chromosome hits locus

_{l}*l*(

*i.e.*,

*p*is the ratio between the

_{l}*l*locus mutation rate, μ

_{l}, and the global mutation rate for the set of loci, μ

_{g}). Note that, for the multilocus case, θ

_{0}, θ

_{1}, and τ are scaled to global mutation rate μ

_{g}.

Assuming that mutational process is independent among loci, *P*(Δ | *K*) can be calculated as the product of (from Equation 5) for all loci. From and it is possible to obtain *P*(Δ|*j*) by integrating over all possible *K* vectors for *j* and Δ:(7)

## MATERIALS AND METHODS

#### Estimation of demographic parameters:

To obtain estimates , , and , we fit Equation 3 to the distribution of pairwise differences by maximizing the pseudolikelihood of the empirical sample. Letting , , and τ* be a combination of possible values for parameters θ_{0}, θ_{1}, and τ, we can calculate the likelihood of , , and τ*, for a sample of two chromosomes from Equations 3 and 7. For larger samples the pseudolikelihood of , , and τ* is the product of the likelihoods for all possible pairs in the sample. Code in C programming language for the calculation of pseudolikelihood is available from the corresponding author. The combination , , and τ* attaining the highest pseudolikelihood value will be our point estimates , , and , which we call *m*aximimum *p*seudolikelihood using a model with *h*omoplasy (MPH) estimates. Parametric bootstrap confidence intervals around those point estimates can be obtained as described by Schneider and Excoffier (1999).

Multinomial parameters (*p*_{1}, … , *p _{L}*) must be provided to obtain MPH estimates in the multilocus case. When independent mutation rate estimates are available for every locus, multinomial parameters can be easily calculated by approximating the global mutation rate of the chromosome, μ

_{g}, as the sum of all local mutation rates μ

_{1}, … , μ

_{L}(this holds if for every

*l*). When no mutation rate estimates are available, relative mutation rates can be inferred from allele size variance or homozygosity for loci evolving under a stepwise mutation model (Chakraborty

*et al.*1997; Kimmel

*et al.*1998). Note that the sample from which variance in allele size or homozygosity is estimated does not need to be restricted to the sample from which demographic inferences will be drawn, as using additional data sets will increase the precision of those estimates. If relative mutation rate estimates are obtained from the same data set used for the demographic estimation, the uncertainty in the rates estimates can be taken into account in the parametric bootstrap procedure (see below).

In addition, two other methods to estimate demographic parameters of a population expansion can be proposed for linked microsatellite data. For these we ignore the problem of recurrent mutation and we use methods originally described for DNA sequence data evolving under an infinite-site model. To apply these methods, the Manhattan distance () is used to describe differences between two chromosomes (instead of Δ). Assuming that *D*_{M} corresponds to the actual number of mutations (), *m*aximum *p*seudolikelihood (MP) estimates can be obtained by fitting Equation 2 to the distribution of pairwise *D*_{M} of the data (this is equivalent to the approach in Rogers and Harpending 1992). Rogers (1995) proposed *m*oment-based (M) estimates, assuming that (*i.e.*, θ_{1} is very large). For this model and , where *m* and *v* are the mean and variance of the number of mutations (*D*_{M} in our case) between all pairs of the sample (in practice, if *v* < *m*, then , and if , then , as in Rogers 1995).

It must be noted that, while these methods (M, MP, and MPH) provide estimates for the parameters of a stepwise demographic expansion, they do not provide a formal test for the null hypothesis of a constant neutral demography. Thus, previous to this kind of estimation some evidence of expansion should be obtained, typically through a neutrality test.

#### Simulations:

The accuracy of the three described methods (M, MP, and MPH) has been evaluated through coalescent simulations performed with SIMCOAL2 (Laval and Excoffier 2004). Samples of 50 chromosomes were simulated from a population that underwent a step change in its effective population size from θ_{0} to θ_{1} at time τ before present. Three types of chromosomes (single-locus, four-locus, and eight-locus chromosomes, with uniform mutation rate across loci) and ages of the demographic expansion ranging from τ = 1 to τ = 9 were considered. Effective population sizes were between θ_{0} = 0.01 and θ_{0} = 2 and between θ_{1} = 1 and θ_{1} = 1000 (always θ_{0} < θ_{1}). To explore the bias and accuracy of the estimates 1000 replicates were run for several scenarios (*i.e.*, combination of the following parameters: number of loci, age of expansion, and effective population sizes before and after expansion). Output of simulations was analyzed to obtain estimates , , and for the three methods (with the exception of for the moment method, which assumes ).

*P. canariensis*, a study case:

*P. canariensis* is an endemic tree of the Canary Islands, whose recent volcanic origin makes them an interesting place to study dispersal and colonization processes (Juan *et al.* 2000). Navascués *et al.* (2006) detected demographic expansion for *P. canariensis* at each island using chloroplast microsatellites and considered these expansions likely linked to the colonization process. However, they considered their estimates for the time of expansion to be biased because of the assumption of a model without homoplasy in the analysis. We have reanalyzed the same data set to obtain estimates for the three methods described above (M, MP, and MPH) of the time of expansion for each island. This set comprises 497 individuals from four islands, Tenerife (280 individuals), Gran Canaria (145 individuals), La Palma (48 individuals), and El Hierro (24 individuals), genotyped for six microsatellite loci (Pt15169, Pt30204, Pt71936, Pt87268, Pt26081, and Pt36480; Vendramin *et al.* 1996). These data have been previously published by Gómez *et al.* (2003), Navascués *et al.* (2006), and Vaxevanidou *et al.* (2006). Demographic estimates were not produced for the sample from a fifth island, La Gomera (see Navascués *et al.* 2006). Relative mutation rates among loci were estimated using allele size variance for the whole data set (including the La Gomera sample). For MP estimates confidence intervals were estimated by parametric bootstrap by simulating 1000 samples under an infinite-site model and demographic parameters , , and , using MS (Hudson 2002). Similarly, for MPH confidence intervals parametric bootstrap was performed by simulating the evolution of a six-microsatellite chromosome, with relative mutation rates estimated by allele size variance and demographic parameters , , and , using SIMCOAL2 (Laval and Excoffier 2004). There are no reliable mutation rate estimates for chloroplast microsatellites. However, to compare (in mutation units) with dated geological or biogeographical events, mutation rates in the range 10^{−5}–10^{−4} mutations per generation per locus [and 100 years per generation (Provan *et al.* 1999; Navascués *et al.* 2006)] can be used.

## RESULTS

#### Pseudolikelihood profile:

Figure 1 shows the pseudolikelihood profiles of the demographic parameters for three example simulated data sets. Pseudolikelihood profiles for τ present narrow bell shapes with the maxima within the proximity of the true value (Figure 1a). However, the behavior is quite different for the population size parameters θ_{0} and θ_{1}. For θ_{1} > 100 the pseudolikelihood profile often takes an shape (Figure 1c), and for θ_{0} < 1 it often takes an inverted- shape (Figure 1b). Therefore it is not possible to distinguish very large final population sizes or very small initial population sizes. By increasing the final effective population size (given a fixed τ, note that time is scaled to mutation rate) a threshold is reached from which a coalescent event after the expansion has an extremely low probability and, thus, the shape of the genealogy is the same (terminal branches of length τ) for any θ_{1}, and the average number of mutations accumulating after the expansion reaches a maximum. By decreasing the initial effective population size another threshold is reached from which a mutation event has an extremely low probability, giving scenarios where no mutations occur before the expansion, regardless of the shape of the genealogy. These characteristics of the pseudolikelihood profile are helpful to understand the results of the accuracy and bias in the demographic estimates. See supplemental material for further discussion about the pseudolikelihood profile.

#### Bias and accuracy of estimates:

The estimation of the effective population sizes before and after the expansion shows a poor performance with biases of some orders of magnitude (Figure 2, note logarithmic scale). In only one case (θ_{1} = 10, Figure 2b) the estimates are apparently better, but this behavior seems to be for a very narrow range of the parameter values and it still shows a tendency for the overestimation of the parameter. A partial explanation of this can be found by looking at the shape of the pseudolikelihood profiles of and when they take extreme values. In addition, there seems to be a tendency for pseudolikelihood maxima to fall in combinations of extreme values for both and , pushing the estimates to have big biases for these two parameters (see, for instance, point estimates for simulations represented in Figure 1). Conversely, estimates for the time of expansion are more accurate. Figure 3 presents results for simulations with ages of expansion τ = 1 and τ = 7. Some bias, which increases with the age of expansion, is evident for the two methods that do not account for homoplasy (M and MP). By estimating the number of mutations between two chromosomes by the differences in number of repeats (*i.e.*, *D*_{M}) there is a proportion of back and parallel mutations that is missed. Therefore, a younger time of expansion fits well the estimated number of mutations, as it leaves less time for mutations to accumulate. The method that uses a model with homoplasy (MPH) shows a clear improvement over the other two methods, particularly when several linked loci are considered. The reduction of the bias comes with an increase in the variance of the estimator; however, the mean squared error (MSE) for the MPH method was similar to or lower than that in the other methods [for instance, for simulation of τ = 7 and eight loci, , , and ; and for simulation of τ = 1 and eight loci, , , and ]. These results suggest that using a more complex analysis that includes a mutational model for microsatellites can improve the estimates, but only when there has been substantial homoplasious mutation in the sample (*i.e*., older expansions).

*P. canariensis*:

Applying these methods, we obtained estimates of the time of expansions (*i.e.*, putative time of colonization) for *P. canariensis* (Table 1). For the MPH method, relative (to total) mutation rates were estimated from allele size variance, giving *p*_{Pt15169} = 0.070, *p*_{Pt30204} = 0.140, *p*_{Pt71936} = 0.530, *p*_{Pt87268} = 0.210, *p*_{Pt26081} = 0.048, and *p*_{Pt36480} = 0.002. Regardless of the method considered, the geologically older Gran Canaria and Tenerife present older expansions than La Palma and El Hierro in the point estimates. However, taking into account the confidence intervals, the differences in time of expansion among islands are similar in magnitude to the error of the estimates, which makes difficult the interpretation of the results. The major source of error can be attributed to the samples from El Hierro and La Palma as they have a significantly smaller sample size. When the analysis takes into account recurrent mutation (by assuming the stepwise mutation model), older times of expansion are obtained, as predicted by Navascués *et al.* (2006), except for the island of La Palma. In the light of the results of our simulated data, getting lower estimates in the MPH method than in the MP method is rare but is more frequent for very recent expansion where the MPH method can have slightly higher error. Despite the uncertainty around the expansion age estimates and mutation rate of chloroplast microsatellites, the results obtained are roughly congruent with the geological history of the archipelago and with the phylogeography knowledge of the pine-specific parasite *Brachyderes rugatus* (see Table 1; Emerson *et al.* 2000; Navascués *et al.* 2006). However, the analysis seems to gain little from the use of a mutational model due to the young age of the expansions.

## DISCUSSION

This work presents an analytical framework to describe the evolution of a set of linked microsatellites between a pair of individuals. Walsh (2001) presented a similar description, scaling the process to clock time (*i.e.*, generations). Because scaling to mutation rate is necessary for demographic inference, an approach different from the one used by Walsh (2001) was used to account for the heterogeneity in mutation rates (*i.e.*, Equation 7). As long as relative mutation rates across loci are available, Equation 7 can be employed for the analysis of sites evolving under mixed models. Therefore, it is straightforward to include single-nucleotide polymorphisms [for instance, considering them as unique event polymorphisms where *P*(*i* = 0; *j* = 0) = 1 and *P*(*i* = 1; *j* = 1) = 1] or microsatellites evolving under the geometric step mutation model (following Watkins 2007).

Statistical inference from linked microsatellite data can be performed by coalescent MCMC Bayesian [implemented in BATWING software (Wilson and Balding 1998; Wilson *et al.* 2003)] and approximate Bayesian [*i.e.*, without likelihoods (Pritchard *et al.* 1999)] methods. These are computationally intensive approaches but are changing the way population genetics analysis is currently done (Beaumont 2004; Beaumont and Rannala 2004). This applicability of state of the art statistics contrasts to the lack of simple summary statistics specific to linked microsatellites. Using simple statistics is an effective way to describe the genetic diversity of a sample and incorporating mutational models can make them more informative. We demonstrate their usefulness for characterizing demographic expansions, but the same framework could be extended to further statistics, such as neutrality tests or estimation of identity-by-descent probabilities, with linked microsatellites. The development of a neutrality test would be particularly interesting since such a test should be performed before considering the demographic inference under the model of expansion described in this work. A scheme for this could be the estimation of the parameter θ of the constant-size model [via BATWING (Wilson and Balding 1998) or by maximazing the pseudolikelihood (Equation 4)] and the use of simulations under the null model to obtain the distribution of some statistic sensitive to demographic expansion, such as the number of haplotypes (Fu 1997; Navascués *et al.* 2006).

The method for demographic inference proposed (MPH) does not substitute for the above-mentioned coalescent MCMC Bayesian approach (Wilson *et al.* 2003), which can also be used to characterize demographic expansions; however, it requires less computation time, especially for data sets with a large number of individuals. It must be noted that this computation advantage is attained only when a low number of loci are typed. This is because in our approach we integrate over all possible ways to distribute *j* mutations in *l* loci (Equation 7), which becomes a huge number when *j* is large and many loci are considered. This increases greatly the computation time, particularly for highly diverse samples. Nevertheless, most published works on chloroplast or Y chromosome microsatellites use a number of loci within the limits of computationally “affordable” (which we suggest to be ∼10, depending on the machine used and the diversity of the sample analyzed).

To conclude, this work extends the methods of demographic inference based on the distribution of pairwise genetic differences to the case of linked microsatellite data. A statistical method to estimate the demographic parameters of a population expansion (*i.e*., time and magnitude) is developed to be applied to data sets where there is some evidence for demographic expansion (*i.e.*, through a neutrality test). This method assumes that microsatellites follow a stepwise model for mutation to account for homoplasious mutations. This approach improves the estimates of the age of expansion by reducing their bias when the event is relatively old; however, little is gained by its application to younger expansion as can be seen with both the simulated and the empirical data analyzed.

## Acknowledgments

We are grateful to Frantz Depaulis for his support and discussions. Comments made by two anonymous referees helped improve a previous version of this manuscript. Funding for research stays was provided to M.N. (European Science Foundation ConGen Exchange Grant 1142) and C.B. (Bourse pour chercheurs étrangers de la Marie de Paris 2007).

## Footnotes

Communicating editor: L. Excoffier

- Received October 30, 2008.
- Accepted December 15, 2008.

- Copyright © 2009 by the Genetics Society of America