## Abstract

We provide a framework for Bayesian coalescent inference from microsatellite data that enables inference of population history parameters averaged over microsatellite mutation models. To achieve this we first implemented a rich family of microsatellite mutation models and related components in the software package BEAST. BEAST is a powerful tool that performs Bayesian MCMC analysis on molecular data to make coalescent and evolutionary inferences. Our implementation permits the application of existing nonparametric methods to microsatellite data. The implemented microsatellite models are based on the replication slippage mechanism and focus on three properties of microsatellite mutation: length dependency of mutation rate, mutational bias toward expansion or contraction, and number of repeat units changed in a single mutation event. We develop a new model that facilitates microsatellite model averaging and Bayesian model selection by transdimensional MCMC. With Bayesian model averaging, the posterior distributions of population history parameters are integrated across a set of microsatellite models and thus account for model uncertainty. Simulated data are used to evaluate our method in terms of accuracy and precision of *θ* estimation and also identification of the true mutation model. Finally we apply our method to a red colobus monkey data set as an example.

MICROSATELLITES, also called short tandem repeats (STRs), are repetitions of a DNA sequence motif with length between 1 and 6 bp. Because they are abundant, widely distributed in the genome, and highly polymorphic, microsatellites have become one of the most popular genetic markers for making inferences on molecular evolution and population genetics (Shikano *et al.* 2010; Spong *et al.* 2010).

Unequal crossing over (Smith 1976; Richard and Pâques 2000) and replication slippage (Levinson and Gutman 1987) are the two main mechanisms proposed that potentially provide an explanation for the high mutation rate of microsatellites. The study by Levinson and Gutman (1987) using *Escherichia coli* showed that replication slippage is the predominant mutation mechanism of microsatellite DNA. Replication slippage occurs when the replicating strand and the template strand disassociate and then realign out of register, forming a loop in one of the strands. If the process of replication continues, a loop formed by the replicating strand gives rise to an insertion while that by the template strand results in a deletion.

The simplest microsatellite model is the stepwise mutation model (SMM) proposed by Ohta and Kimura (1973), which states that the length of the microsatellite increases or decreases by 1 repeat unit at a rate independent of the microsatellite length. Although the SMM has been employed to devise commonly used statistics for estimating genetic divergence (Slatkin 1995) and effective population size (Wehrhahn 1975), the model has some drawbacks. Under the SMM, there is no stationary distribution and under this process the repeat length will eventually grow arbitrarily long, which is inconsistent with empirical microsatellite length distributions from genomic data (Kruglyak *et al.* 1998). Moreover the SMM ignores various properties of microsatellite mutation that have been observed in empirical data. Many different models have been developed in attempts to capture some of these properties.

Observations from many studies support the fact that longer microsatellites have a higher mutation rate (Goldstein and Clark 1995; Wierdl *et al.* 1997; Schlötterer *et al.* 1998). A longer microsatellite allele has more locations for potential slippage errors and hence possesses a greater chance of experiencing a mutation event during replication, as demonstrated by Streisinger and Owen (1985) using bacteriophage T4. This is the motivation behind rate-dependent models such as the proportional slippage model (Kruglyak *et al.* 1998) and others (Calabrese *et al.* 2001; Sibly *et al.* 2001), which describe the mutation rate as a polynomial function of length in repeat units.

Another property is mutational bias, which exists when the probability of expansion and contraction is unequal for a mutation event. Evidence for this phenomenon has been found in genomes of several species including humans (Rubinsztein *et al.* 1999), which exhibit a preference for expansion, and the bacterium *Mycoplasma gallisepticum* (Metzgar *et al.* 2002), which tends to contract. Models proposed by Calabrese and Durrett (2003) and Walsh (1987) have accounted for this rate asymmetry (see original citations for the stationary distributions of these models).

In one-phase models, a mutation leads to expansion or contraction of the microsatellite by 1 repeat unit only. However, empirical data suggest that mutations can occasionally result in a change in the microsatellite length of >1 unit. According to the two-phase model (TPM), proposed by Di Rienzo *et al.* (1994), there is a probability of *p* that a mutation changes the microsatellite length by 1 unit and has a probability of 1 – *p* that a change in length is ≥1 repeat unit(s), where the length of change is given by a geometric distribution. The generalized mutational model (Fu and Charkraborty 1998) is a simplified version of this mixed model, which sets *p* to 0, and consequently the length of change is entirely governed by the geometric distribution.

Many population genetics inference methods for microsatellite data require the adoption of a mutation model such as those described above. These approaches can be divided into three categories. The first category involves moment estimators based on summary statistics, including sample homozygosity (Kimmel *et al.* 1998; Xu and Fu 2004) and allele length variance (Wehrhahn 1975; Kimmel *et al.* 1998), to estimate *θ* = 4*N*_{e}*μ* (four times the product of effective population size and the mutation rate).

The second category consists of likelihood-based approaches to the estimation of *θ*. As it is not in general possible to evaluate the likelihood function analytically, it is approximated by computational methods including Markov chain Monte Carlo (MCMC) (Beerli and Felsenstein 1999) and importance sampling (Nielsen 1997). Significant progress has been made in the development of methods that employ importance sampling and composite likelihoods for microsatellite inference, allowing the maximum-likelihood estimate of demographic parameters to be computed efficiently (Iorio *et al.* 2005; RoyChoudhury and Stephens 2007). On the other hand, Wilson and Balding (1998), Beaumont (1999), and others have applied MCMC to provide Bayesian inference of demographic history from microsatellite data, in which case population parameters are treated as random variables instead of unknown fixed parameters as in a maximum-likelihood approach. Cornuet *et al.* (2006) investigates the underlying mutation process of microsatellites using reversible-jump MCMC (Green 1995) of microsatellite models.

The third category includes likelihood-free approaches such as approximate Bayesian computation (ABC) (Weiss and Von Haeseler 1998; Beaumont *et al.* 2009; Bertorelle *et al.* 2010). The application of ABC to microsatellite data (Beaumont *et al.* 2002; Cornuet *et al.* 2008; Tallmon *et al.* 2008) aims to increase computation efficiency as the method uses summary statistics instead of the full data set and employs simulation to circumvent the likelihood computation step.

Many of the likelihood approaches mentioned above are based on the coalescent theory (Kingman 1982; Griffiths and Tavare 1994). Rather than assuming a parametric model for the population history, for example exponential growth or logistic growth models (Pybus *et al.* 2003), advanced coalescent-based methods provide inference of the demographic history by estimating population as a function of time directly from the data (Drummond *et al.* 2005; Opgen-Rhein *et al.* 2005; Heled and Drummond 2008; Minin *et al.* 2008), but most of them have not been accessible for microsatellite inference.

To extend previous work on Bayesian coalescent inference of microsatellite data, we develop a method that provides inference of the demographic history averaged over a nested set of microsatellite mutation models that incorporate length dependency, mutation bias, and step size. Our method can handle multiple loci and these are assumed to be unlinked or in independent blocks of linkage. The implementation of this method consists of two main parts. The first part is to introduce the implementation of a rich family of microsatellite mutational models and other necessary components to the BEAST software package (Drummond and Rambaut 2007), which provides microsatellite inference access to sophisticated coalescent models (Drummond *et al.* 2005; Heled and Drummond 2008; Minin *et al.* 2008). The second part deals with model uncertainty by employing the product space formulation of transdimensional MCMC (Sisson 2005) as described in section 2.5 of Godsill (2001), which facilitates Bayesian model selection by producing posterior probabilities of the microsatellite mutation models and Bayesian model averaging for estimates of population history and genealogies over those models. The transdimensional MCMC technique chosen here uses techniques from Bayesian variable selection (BVS) (Geweke 1996; Kuo and Mallick 1998) *sensu* Godsill (2001). The BVS-inspired scheme is preferred over other transdimensional MCMC techniques because it trades a small increase in MCMC state space for a high degree of simplification and flexibility in programming.

To apply BVS a composite model space must be constructed that nests all submodels of interest over which inference of population history and genealogies should be averaged. In our case the submodels have a natural nesting by variable selection, because each model represents a special case of the most general microsatellite mutation model in our family. However, models that do not nest naturally can still be averaged over using BVS by introducing a simple index parameter alongside the union of all submodel parameters to construct a product space over models (Carlin and Chib 1995). However, an advantage of our nested model space is that it is able to indicate which of the three microsatellite mutation properties has a strong signal in the data.

## MATERIALS AND METHODS

#### The basic global model:

Here we give an overview of the global model and the framework within which our implementation is developed. The microsatellite data, **D**, consist of *L* loci, **D** = {**D**_{1}, … , **D _{L}**} and each locus is composed of a collection of microsatellite repeat lengths from the population of interest,

*l*= 1, … ,

*L*and

*n*is the number of copies of locus

_{l}*l*collected. In haploid data,

*n*is the number of sampled individuals, while in diploid data

_{l}*n*is twice the number of individuals from which the samples have been collected. We assume that the data have been generated by an underlying continuous time Markov chain (CTMC), along an unknown genealogy

_{l}*τ*, which is a rooted bifurcating tree. In the simulations and analyses, the mutation rate is assumed constant along the tree within a locus,

_{l}*i.e.*, a strict molecular clock rate. The time intervals between successive coalescence events in the genealogy are modeled by the coalescent, which requires a demographic model component containing parameters

**Θ**. The mutation process is defined by the microsatellite mutation model (more details in the Microsatellite models section) with parameters

**φ**. Let

**τ**= {

**τ**, … ,

_{1}**τ**} and assuming the loci are independent and identically distributed given (

_{L}**τ**,

*φ*), the joint posterior distribution of

**τ**,

*φ*, and

**Θ**is

*et al.*2002). The tree likelihood of locus

*l*is Pr(

**D**|

_{l}**τ**,

_{l}**) and can be evaluated using the peeling/pruning algorithm described by Felsenstein (1981), although we employ augmentation of internal nodes with repeat lengths. The coalescent models come into play by serving as priors for the tree topology and coalescent/divergence times. The form of the coalescent likelihood,**

*φ**f*(

_{G}**τ**|

_{l}**Θ**), depends on a demographic model specified

*a priori*and its parameters (

**Θ**) are jointly estimated. The prior distributions for parameters of the demographic model and mutational model are selected from various standard univariate and multivariate distributions.

#### Microsatellite models:

The models of interest in this study were ones that could be approximated by finite state space continuous-time Markov chains to readily incorporate them into existing software for likelihood calculations on trees. We first need to decide on the coding of the data. Unlike nucleotides or amino acids that have a finite state space, the size of the microsatellite state space is ambiguous, because a universal upper bound for length of microsatellites probably cannot be defined (Kruglyak *et al.* 1998). Yet, according to previous observations, the number of repeats in a microsatellite allele rarely exceeds a few tens (Goldstein and Pollock 1997). In addition, it seems sensible to impose a lower bound on repeat length, above which we can expect the characteristic behavior of microsatellite mutation to occur. In this article, an allele with *i* repeats is denoted as *i*. The imposed maximum and minimum lengths of a microsatellite are denoted as *i*_{max} and *i*_{min}, respectively. Therefore there are *s* = *i*_{max} − *i*_{min} + 1 possible states.

Once boundaries are set, it is easy to define the infinitesimal rate matrix of an ergodic Markov chain. The infinitesimal rate matrix *q _{i}*

_{,j}specifies the relative instantaneous rate of allele

*i*mutating to allele

*j*, and the shared lower and upper bounds of

*i*and

*j*are

*i*

_{min}and

*i*

_{max}, respectively. Given the mutation rate (

*μ*), the Markov chain has the transition probability matrix (

*P*),

*p*

_{i}_{,j}(

*μt*) is the probability of mutating from allele

*i*to allele

*j*, in time

*t*.

In our implementation, the infinitesimal rate matrix of the most complex model is parameterized:*i.e.*, let *c* so that −∑_{i}*q′ _{i,i}*π

_{i}= 1.0.

The function *α*(*u*_{0}, *u*_{1}, *u*_{2}, *d*_{0}, *d*_{1}, *d*_{2}, *i*) is the truncated version of the asymmetric quadratic model proposed by Calabrese and Durrett (2003) and accounts for the length dependency of mutation rate and mutational bias by modeling the rate of expansion and contraction as separate quadratic equations. The rate can be symmetric if expansion and contraction share exactly the same quadratic equation; in other words *u*_{0} = *d*_{0}, *u*_{1} = *d*_{1}, and *u*_{2} = *d*_{2}. Equal rate for all lengths is obtained by setting the coefficients of the linear terms and quadratic terms to zero in both equations. Similarly, the rates are modeled as linear functions of the length when *u*_{2} and *d*_{2} are set to 0.

In Equation 3 the *α*-function has symmetric rates. It is standardized so that parameters *u*_{0} and *d*_{0} are fixed to 1.0, because the rate of *i*_{min} must be a positive real number and for any *α*(1, *a*_{1}, *a*_{2}, 1, *a*_{1}, *a*_{2}, *i*) it is equivalent to constant × *α*(1, *a*_{1}, *a*_{2}, 1, *a*_{1}, *a*_{2}, *i*), because of the normalization of the rate matrix.

The focal length is equal to *i*_{f} if *i*_{min} ≤ *i*_{f} ≤ *i*_{max} and is the repeated root of the equation (*u*_{0} − *d*_{0}) + (*u*_{1} − *d*_{1})(*i*_{f} − *i*_{min}) + (*u*_{2} – *d*_{2})(*i*_{f} − *i*_{min})^{2} = 0. At the focal length the rate of expansion and contraction is the same, so given a mutation event, there is equal probability of expansion and contraction.

Although the function *α*(*u*_{0}, *u*_{1}, *u*_{2}, *d*_{0}, *d*_{1}, *d*_{2}, *i*) has taken mutational bias into account, the parameterization may not necessarily provide answers to questions regarding the relationship between mutational bias and microsatellite length. As mentioned earlier, mutational bias can be quantified by the probability of expansion given a mutation event. The function *β*(*b*_{0}, *b*_{1}, *i*) models the probability of expansion by a simple logistic regression.

Both the bias constant parameter, *b*_{0}, and the bias linear parameter, *b*_{1}, take real values from the range (−∞, +∞). The probability of contraction is 1 − *β*(*b*_{0}, *b*_{1}, *i*). This is a modification of the parameterization adopted by Sainudiin *et al.* (2004). They model expansion probability by simple linear regression. The probability of expansion then becomes

It is worth noting that *β*(*b*_{0}, *b*_{1}, *i*) ∈ (0, 1) whereas

Both parameterizations *β* and

To account for larger steps in state space by a single mutation, we employ the parameterization used by Sainudiin *et al.* (2004), which is similar to the TPM proposed by Di Rienzo *et al.* (1994). Under this model, single-repeat mutations have a probability of *p* whereas multirepeat mutations (length of change ≥1 repeat) have a probability of 1 − *p*. For multirepeat mutations, the distribution of step size, |*i* − *j*|, is given by a truncated geometric distribution *γ*(*g*, *i*, *j*). The symbol *g* is the failure probability of the truncated geometric distribution.

#### Stationary distribution:

For an ergodic Markov chain, as time, *t*, approaches positive infinity, its transition probability matrix converges to a matrix in which every single row is the stationary distribution, lim_{t}_{→∞} *P*(*μt*) = 1*π*. As mentioned by Sainudiin *et al.* (2004) the stationary distributions of all one-phase models are special cases of the general birth–death chain.

#### Bayesian model uncertainty:

The output of a Bayesian analysis is the posterior distribution of the parameters given the data. However, the high-dimensional parameter space in a genealogy-based analysis dictates simulation of the posterior distribution by computationally intensive Monte Carlo methods such as MCMC or importance sampling. Here, the posterior distribution is produced by the Metropolis–Hastings MCMC algorithm (Metropolis *et al.* 1953).

In a Bayesian framework, the standard procedure to compare two models is by computing their Bayes factor (BF), which is the ratio of the marginal likelihoods of the two models (*M*_{1} and *M*_{2}):*comparison* are very time consuming. In addition and more importantly, the mutational model may not be of prime interest, *i.e.*, nuisance, and therefore it is not ideal to perform a separate analysis for every mutation model. The solution to this problem is Bayesian model averaging (BMA). We employ transdimensional MCMC to provide joint inference via sampling the microsatellite model indicator, *ν*, to produce the posterior distribution**φ** is a union of parameter vectors of all *n* models of interest, and *M _{ν}*. The marginal posterior distribution of model indicator

*ν*can be obtained from posterior samples of Pr(

**Θ**,

**τ**,

**φ**,

*v*|

*D*), representing the posterior distribution of the microsatellite model. The joint posterior distribution of

**Θ**and

**τ**integrated over the models is

**Θ**and

**τ**.

Our implementation of transdimensional MCMC combines the techniques of BVS (Kuo and Mallick 1998) and pseudopriors or linking densities (Carlin and Chib 1995). Early BVS applications have aimed to solve the problem of variable selection encountered when building a linear regression model. Initially, there is a large number of potential predictors **X _{1}**, … ,

**X**, with values

_{p}*x*,

_{ij}*j*= 1, … ,

*p*and the focus is on determining which of these predictors are linearly associated with the response variable

*Y*. The full model describes the response

*y*as a linear combination of the explanatory variables

_{i}*x*:

_{ij}*ε*∼

_{i}*N*(0,

*φ*that is (statistically) significantly different from 0 suggests predictor

_{j}**X**may help in predicting the response. Conversely, a

_{j}*φ*that does not significantly differ from 0 indicates

_{j}**X**provides little additional information and can be excluded from the model. The variable selection method by Kuo and Mallick (1998) uses an auxiliary binary indicator variable,

_{j}**δ**, of

*P*dimension.

*δ*= 1 indicates the presence and

_{j}*δ*= 0 indicates the absence of the parameter

_{j}*φ*. The full linear model becomes

_{j}*can be considered as the outcome of the function ψ*

_{j}*=*

_{j}*g*(

*φ*,

_{j}*δ*) =

_{j}*φ*. Setting

_{j}δ_{j}*δ*to 0 forces ψ

_{j}*to 0, so that*

_{j}*φ*is effectively excluded from the model. However, even though in such a case the value of

_{j}*φ*has no effect in the likelihood, it is still sampled by the MCMC machinery, but according to its prior distribution only. This means the dimensions of

_{j}*φ*and

*δ*, and hence the model parameter dimension, are not changed even though mutation model parameters are effectively included and excluded in the likelihood during the MCMC. Therefore it does not require the computation of the Jacobian ratio unlike transdimensional MCMC techniques such as reversible-jump (RJ)MCMC (Green 1995).

We augment the parameters in Equation 3 with a set of indicators, each associated with one of the parameters in the most general microsatellite mutation model, to produce a natural nesting of the described microsatellite mutation models.

Our most complex (full) microsatellite model (defined by Equation 3) contains the parameters, φ^{Full} = {*a*_{1}, *a*_{2}, *b*_{0}, *b*_{2}, *g*, *p*}, and any submodel has only a subset of *k*th element in φ^{Full}. We augment **φ ^{Full}** with a binary indicator variable

**δ**= {

*δ*

_{1}, … ,

*δ*

_{6}} to provide a set of toggle switches that can be used to define all nested models of

**φ**. Letting

^{Full}*q*(

_{ij}**φ**) represent Equation 3, the equation that defines the instantaneous rate matrix of the new model is

^{Full}**φ**and

^{Full}**δ**. Even when the value of

*δ*= 0,

_{k}**φ**=

**φ**. If

^{Full}**φ**and

**δ**are assumed independent,

*f*

_{Φ|Δ}(

**φ**|

**δ**)

*f*

_{Δ}(

**δ**) is replaced by

*f*

_{Φ}(

**φ**)

*f*

_{Δ}(

**δ**).

##### Prior on model space:

There are six free parameters in the full model, which theoretically give us 64 submodels. However, parameter *p* cannot be estimated for submodels in which *g* is not a free parameter (*i.e.*, fixed to 0). This is because if *g* is fixed to 0, *p* does not have any effect on the likelihood. Furthermore, when modeling with regression, it is convention to estimate all polynomial terms in the model up to the largest degree considered in the model. We apply this convention to functions *α*(1, *a*_{1}, *a*_{2}, 1, *a*_{1}, *a*_{2}) and *β*(*b*_{0}, *b*_{1}, *i*) in Equation 3. The application of these restrictions to the model space results in a connected subspace of 27 models, and we apply a uniform prior so that the prior probability on each model is 1/27, while the remaining 37 models have a prior probability of 0.0. Figure A1 in appendix a shows the restricted model space.

##### Proposal distributions:

Model switching is performed by two proposal moves, the flip move and the pick move. The flip move uniformly picks an index of the bit vector **δ** at random and performs a flip, whereby the value at that index changes from 0 to 1 or vice versa. For a bit vector of *n* dimension, the probability is 1/*n* for both the flip move and its reverse. The Hastings ratio for *M _{i}* →

*M*is

_{j}*q*(

*M*|

_{i}*M*)/

_{j}*q*(

*M*|

_{j}*M*) = (1/

_{i}*n*)/(1/

*n*) = 1.0. This proposal distribution over bit vector

**δ**is symmetric and therefore no Hastings correction is required for this proposal. Since, in our case, the 27 models with nonzero prior probability form a connected component, the flip move will produce an ergodic and irreducible Markov chain. In effect, the nonhomogeneity of the restricted model space is corrected for by rejection of the neighboring models with zero prior probability, rather than defining a Hastings ratio specifically for the restricted model space (shown in Figure A1 of appendix a).

The pick move, on the other hand, allows larger moves. It selects a model uniformly at random from the set of 27 permitted models. All permitted models have equal probability to be selected, and thus the pick move is symmetric.

##### Pseudopriors:

Pseudopriors or linking densities, a technique used in transdimensional MCMC, were first introduced by Carlin and Chib (1995). Their method considers the situation when there is no overlap among the individual parameter vectors of the *n* models of interest. The parameter “pool” *φ* is therefore simply a concatenation of all model parameter vectors. The joint posterior distribution for *v* and **φ** given data **D** can be written as*M _{ν}*. The expression

*f*(

*v*,

*Y*), but govern the mixing of MCMC as they play the role of jumping distributions in RJMCMC (Green 2003). Appropriate pseudopriors resemble efficient proposal distributions and achieve efficient sampling by preventing extremely unlikely parameter values of

Selecting suitable pseudopriors can overcome the problem of poor mixing encountered when the prior is very different from the posterior for parameters being model averaged. When a parameter is not in the likelihood, values sampled from the prior may have little agreement with the data. Consequently, a parameter may have difficulty reentering the likelihood, resulting in poor mixing.

Godsill (2001) extended the method by Carlin and Chib (1995) to allow arbitrary overlap among parameter vectors *φ*, indicators, *I*(*ν*), map *ν* to the elements of *φ* used by model *M _{v}*. The posterior is expressed as

Carlin and Chib (1995) suggested that the pseudoprior of a parameter *φ* in model *M _{ν}* should match closely the model-specific posterior distributions Pr(

*φ*|

*Mv*). It has been observed in some trial runs that even though two models

*φ*, the marginal posterior distributions

*φ*in

To accommodate augmented mutation model parameter space for a model-specific pseudoprior, the new function for ψ becomes*I*(*k*, *δ*) returns the index of the element in *φ* according to *k*, which indicates the type of parameter, and **δ**, which specifies the currently active model. For example, if *φ _{I}*

_{(k = 3,δ = **11**)}maps to the parameter

#### Tree likelihood computation:

Felsenstein’s (1981) pruning algorithm of tree likelihood computation implicitly sums over all possible ancestral states. For a data type with *s* states and a rooted gene tree with *n* taxa (*n* − 1 ancestral nodes), the pruning algorithm is *O*(*ns*^{2}). The speed of this algorithm is sufficiently fast for analysis to be completed on nucleotide data, which have state space size of 4 (A, T, C, and G). For microsatellite DNA, however, the number of states is many times larger than that of the nucleotide data type, and therefore likelihood calculation is much more time consuming.

One solution to this problem is to avoid summation across all possible states at ancestral nodes by treating unknown ancestral allelic states, *l* is the product of all likelihoods of nodes in a tree,*x* is one of the 2*n* − 2 nodes in the tree excluding the root and *i _{x}* is the state of node

*x*. The parent of node

*x*is denoted as anc(

*x*) and

*t*is the length of the branch that connects

_{x}*x*to anc(

*x*). Following Wilson and Balding (1998), we replace Felsenstein's tree likelihood with the likelihood in Equation 14. The prior probability of the ancestral state in the root node,

For a discussion on the numerical stability see appendix b.

##### Proposal moves for ancestral state sampling:

The candidate allelic state of an ancestral node is proposed by a random-walk integer move, which makes a step from the current allelic state. This move randomly picks direction and step size, which is an integer between 1 and a maximum step size specified by the user. The maximum step size permitted is less than the difference between the upper and lower boundaries of the allelic state. If after a random-walk step the value proposed exceeds the boundaries, then the exceeding proportion of the step is reflected back. Due to the condition on the maximum step size and the type of reflection chosen, the result of a reflection will not be on either boundary. Given maximum step size, *w*, the number of possible combinations of direction and step size is 2*w*. The Hastings ratio is thus the ratio for a move from state *i* to *j* and is given by *H*(*j*, *i*)/*H*(*i*, *j*), where *H*(*i*, *j*) = *h*_{1}(*i*, *j*) + *h*_{2}(*i*, *j*) + *h*_{3}(*i*, *j*). The equations *h*_{1}(*i*, *j*), *h*_{2}(*i*, *j*), and *h*_{3}(*i*, *j*) are given below:

#### Simulations:

After developing the implementation for Bayesian microsatellite analysis, it is of interest to obtain some indication of the accuracy and precision of the estimates. We consider only a subset of the 27 models in the restricted model space. This subset is obtained by setting *a*_{2} and *p* to 0; therefore the most complex model considered here has only four parameters and the bit vector **δ** has four dimensions. Because of the restriction that *b*_{1} is a free parameter only when *b*_{0} is a free parameter, this subset has 12 permitted models instead of 16.

These 12 models resemble the set in Sainudiin *et al.* (2004), except we use simple logistic regression to model mutational bias. For convenience, we use their model naming system. This restricted model space of 12 models is illustrated in Figure 1.

Simulated data were generated under the 12 different microsatellite models from the procedure described below:

One hundred replicate data sets were generated under each microsatellite model.

For each replicate, 30 random coalescent trees were generated, each with 15 individuals assuming a constant population size with

*N*_{e}*μ*= 2.0 (where*N*_{e}is the effective population size of chromosomes and*μ*is the mutation rate representing the number of mutations per microsatellite locus per generation).A microsatellite data type was created with minimum length of 1 repeat unit and a maximum of 30 repeat units.

For each coalescent tree, a microsatellite site pattern was simulated under the microsatellite model with mutation rate equal to 1.0. All site patterns in a trial were simulated under the same microsatellite model. There were 15 sampled haploid individuals in each site pattern.

Each of the 1200 simulated unlinked 30-locus data sets were analyzed with transdimensional MCMC to demonstrate how well our method identifies the true microsatellite mutational model (*M*_{true}). We also compared the accuracy and precision in the demographic estimates between model averaging and when the true model was known. Analyses have chain lengths of 70 million steps with transdimensional MCMC and 50 million steps with the true model. Sampled parameters were recorded every 50,000 steps.

It is also of interest to investigate the effect of the number of taxa and that of loci on the precision of *θ*-estimation using transdimensional MCMC. Data sets were simulated under the PL2 model with different combinations of number of loci and number of taxa presented in supporting information, Table S1, which also includes the MCMC chain length for each combination. One hundred simulations were carried out for each combination. For this set of simulations, we recorded every 10,000th step of the MCMC. The convergence of each analysis was examined by the trace analyses including the computation of the effective sample size (ESS) of each estimated parameter. Table S2 is a summary of the model parameter values that are chosen for simulating the data. All simulated data sets are provided in .csv format in File S1.

##### Measure of accuracy:

The accuracy was measured by computing the relative bias. Here we define relative bias as*θ* is the true population size value and *k*.

Accuracy of estimates may also be indicated from the percentage of trials with 95% highest probability intervals (95% HPD) containing the true answer. The

##### Measure of precision:

The relative error was used to measure the precision of the estimates obtained. We define the relative error as

##### Prior distribution for microsatellite model parameters:

We used a normal(0, 10) prior on both *b*_{0} and *b*_{1}, an exponential(1) on *a*_{1}, and uniform(0,1) on *g*.

##### Pseudopriors:

From test runs it appears that only *b*_{0}, *b*_{1}, and *g* require pseudopriors to reach reasonable convergence. The pseudopriors for each variable are chosen to be tight distributions centered around the true parameter values since they were known. Pseudopriors for each parameter are shown in Table S3.

##### Tree prior:

For the simulations, the tree prior used was the coalescent with constant population (see Kingman 1982 or Griffiths and Tavare 1994 for details on the coalescent likelihood calculation). For inference, the constant population size model was chosen to match the simulation conditions.

The prior density for *θ* was set to one-on-*x* prior, *x* prior is an improper prior; however, in the case of constant population size, it can be shown to be Jeffrey's prior, and its application in this context leads to a proper posterior distribution (Drummond *et al.* 2002, 2004).

##### Microsatellite model prior:

Because we did not have any *a priori* information regarding the microsatellite model, a uniform prior is applied to the set of 12 models considered.

##### Sampling tree topologies:

The tree proposal moves subtree-slide, narrow exchange, wide exchange, and Wilson–Balding are used for tree topology sampling in all the analyses undertaken in this article. A nice summary of these proposal moves is presented in Höhna *et al.* (2008).

#### Red colobus monkey data:

##### Data:

The red colobus monkey (*Pilocolobus tephrosceles*) data set was kindly provided by J. Allen (University of Florida, Gainesville). This unpublished data set consists of 62 samples from each of 16 loci. Each locus was typed for both homologous copies from each of 31 red colobus monkeys from the Kibale National Park of Uganda. These loci are treated as unlinked as no clear signal of linkage had been found (J. Allen, personal communication). The allele lengths and the PCR primers are presented in Table S4 and Table S5.

##### Analyses:

An upper bound of 33 and lower bound of 6 repeats were imposed. We made boundaries wider than the observed length range of the data to account for the possibility that the observed sample range is not the true range in the population. We ran two separate analyses of 200,000,000 for each of the 12 microsatellite models. We also ran two replicate analyses using transdimensional MCMC. To compare mixing and performance between transdimensional MCMC and fixing the microsatellite model, we estimated *θ* from this data set with transdimensional MCMC and each of the 12 models used for simulation. Values for ESS per MCMC step were computed for *θ*, tree likelihoods, coalescent likelihoods, and mutation rates. We accommodated the potential mutation rate variation across loci by estimating the relative rates but fixing the average rate to 1.0.

##### Prior selection:

A uniform Dirichlet prior for 16 dimensions was applied to the relative mutation rates. The tree prior and mutation model parameter priors used for the real data analyses are the same as the ones for analyses of the simulation data. However, because the true values of the mutational parameters are unknown, pseudopriors could not be picked as easily as for simulated data. We took Carlin and Chib's suggestion and obtained preliminary posterior distributions of mutation model parameters by running a short MCMC (of 40,000,000 steps) with each microsatellite model. The posterior densities obtained from these preliminary runs are still quite broad, but they provide sufficient guidance for the selection of pseudopriors. The first 10% of each chain is discarded as burn-in and the remaining chain is used to fit a standard parametric distribution to the posterior sample of each parameter.

The marginal posterior distributions of the microsatellite mutation parameters were fitted using the maximum-likelihood–based fitdistr function in the MASS package (Venables and Ripley 2002) of R (R Development Core Team 2009), a software environment for statistical computing. The fitdistr function returns parameter estimates for a parametric distribution that best describes the posterior sample. The quality of the fit is then examined by the one-sample Kolmogorov–Smirnov test with the null hypothesis that the posterior sample has come from the fitted distribution. Several different parametric distributions were fitted to the sample and the one with the largest Kolmogorov–Smirnov test *P*-value (least evidence against a bad fit) was chosen.

Table S6 presents all the pseudoprior distributions. In this case, the parameter space is augmented so that there is no overlap in model parameter vectors and each parameter has its own model-specific pseudoprior.

## RESULTS

#### Simulations:

For each analysis of simulated data, the first 10% of the chain was discarded as burn-in and analysis of traces confirm that all parameters have ESS > 100. The accuracy of Bayesian model selection by our method is indicated by how often the maximum *a posteriori* model corresponds with the true model, *M*_{True}. Table 1 presents the frequency distribution of *M*_{Best} (the model with maximum posterior probability using our transdimensional MCMC method) for data sets simulated under each *M*_{True}. The highest percentage value in each row is in italics. It is shown that all diagonal values are in italics, which means *M*_{Best} = *M*_{True} has the highest frequency for all 12 models of *M*_{True}.

Accuracy is also indicated by computing the percentage of trials that has the *M*_{True} contained in the 95% HPD set of models. These values are presented in Table 2. A very high proportion (>0.9) of the trials captures *M*_{True} within the 95% HPD set. The median 95% HPD set size is between two and four models, and the majority of the models have a median set size of two, suggesting good precision.

Often, the user is more concerned with the accuracy of other evolutionary or demographic estimates rather than the mutational model *per se*. The values for relative bias, relative error, and relative 95% HPD bounds for the demographic parameter of constant population size are calculated for each trial. Table 3 is a summary of the median relative bias, the median relative error, the median 95% HPD relative bound, and the percentage of trials in which the 95% HPD interval captured the true value of *N*_{e}*μ* = 2.0 for each model.

Estimates with high precision have small values of median relative error or median 95% HPD relative bound. Accurate estimates have small values of median relative bias and a high percentage of 95% HPD intervals containing the true value. Within each row, the values of the four statistics of accuracy and precision are close between BMA and when the true model is known. However, when the true model (TM) is known, the results have greater coverage of the true population parameter value, smaller median relative error, smaller median relative 95% HPD bound, and smaller absolute median relative bias than model-averaged estimates of *θ*.

All model–method combinations had high frequentist coverage varying from 0.86 to 0.99. Given the small number of replicates, these coverage statistics are not significantly different from each other and are all consistent with an underlying proportion in the region of 0.95, although in the Bayesian setting there is no reason to expect coverage to be at the 0.95 level.

For either method, there is a spectrum of median relative error values across the models of *M*_{True}, where the median relative error value is the smallest when the data are simulated under PU1 and largest when data are simulated under PU2.

Similarly, there is quite some variation in the size of the median relative 95% HPD bounds across different models of *M*_{True} for both BMA and analysis with the true model. Median relative 95% HPD bounds are the smallest for data simulated under PL1 and the largest for PU2, which are approximately three times of that for PL1.

##### Precision vs. number of loci:

Values of median relative error and median relative HPD range are plotted against the number of loci (Figure 2), where the number of taxa is fixed to 15. As the number of loci increases, the relative error (Figure 2, dashed blue line) and 95% HPD range (Figure 2, solid red line) reduce linearly (on a log-log scale). It appears that increasing the number of loci to eight times larger will halve the relative error and reduces the 95% HPD range by a factor >2.

##### Precision vs. number of taxa:

Figure 3 shows median relative error and 95% HPD credible intervals as a function of the number of taxa per loci for 10 unlinked loci. Both the error (Figure 3, dashed blue line) and the HPD interval (Figure 3, solid red line) decrease as the number of taxa increases, but they seem to asymptote to some positive limits. Further reduction of error and HPD range can be achieved only by sampling more loci. The function form *y* = *a*_{0}+*a*_{1}/(*a*_{2}+*x*) is chosen merely to illustrate the general trend.

#### Real data example—colobus monkey data:

Tracer (Rambaut and Drummond 2007) was used to decide the length of the chain to be discarded from a log file as burn-in. The burn-in length was ∼10–20% of the original length of each log file. The two logs from analyses with the same microsatellite model were subsequently combined. The combined log files were then examined again by Tracer to investigate mixing and convergence to stationarity. All ESS values were >150.

According to the results from transdimensional MCMC, the model with the highest posterior probability is EU1. The 95% highest posterior probability set consists of EU1, EU2, PL1, EL1, and EL2, with the respective probabilities 0.483, 0.324, 0.053, 0.049, and 0.048. The posterior probability of including a parameter *a*_{1} = 0.089, *b*_{0} = 0.193, *b*_{1} = 0.186, and *g* = 0.412. These posterior probabilities suggest that multistep mutation is the most evident feature followed by mutation bias and rate dependency.

The posterior mean, median, and 95% HPD interval from analyses with each model are presented in Table 4.

The posterior median of *θ* for single-step models ranges from 4.28 to 4.97 and that for multistep models ranges from 3.40 to 3.98. Since a multistep model was sampled almost half the time, the model-averaged posterior median of *θ* is somewhere in between.

##### Mixing and performance:

We use ESS values per MCMC step as an indication of the sampling efficiency. Figure S1, Figure S2, Figure S3, Figure S4, and Figure S5 are dot plots of ESS value per MCMC step for *θ*, tree likelihoods, coalescent likelihoods, root heights, and mutation rates. These plots suggest that the sampling efficiency of transdimensional MCMC is only slightly less than the average sampling efficiency of single-model analyses. The ratio of ESS per MCMC step for transdimensional MCMC *vs.* single-model analyses on average is 0.78 for *θ*. Averaging across loci, the ratio is 0.90 for tree likelihood, 0.74 for coalescent likelihood, 0.71 for root height, and 0.98 for relative mutation rate.

## DISCUSSION

The focus of this research was on the implementation of a nested family of microsatellite mutation models in the BEAST software package (Drummond and Rambaut 2007). There are many analysis tools unique to BEAST, including nonparametric coalescent-based inference methods such as the extended Bayesian skyline plot (Heled and Drummond 2008) and the newly developed multispecies coalescent method *BEAST (Heled and Drummond 2010). Equipping BEAST with microsatellite models and other related software components permits the application of these methods to microsatellite data.

The microsatellite models implemented have all been previously described in the literature (Di Rienzo *et al.* 1994; Fu and Charkraborty 1998; Calabrese and Durrett 2003; Sainudiin *et al.* 2004). It was not intended to introduce new models of microsatellite evolution in this work, except to make slight modifications where it made the models more suitable for Bayesian inference in an MCMC setting. Aside from their implementation, simulations were used to investigate their statistical properties.

#### Simulations:

Simulation results show moderate variation in performance across data sets simulated under different microsatellite models. However, when a smaller number of loci were simulated (10), more biased results were observed (results not shown).

As mentioned in parameter prior specification (see section on *Prior distribution for microsatellite model parameters*), free parameters shared by more than one microsatellite model have been given the same prior distribution for all models containing the parameter. For example, the constant bias parameter, *b*_{0}, is in models EC1, EC2, PC1, and PC2, and the prior distribution for *b*_{0} is the same for all four models. The impact of the prior choices made for the mutational parameters has not been investigated in this study. It is quite possible that different prior choices would have altered the statistical properties of the estimators. For parameters such as *g* in the two-phase models, which are defined on [0, 1], the uniform prior is natural; however, for other scale parameters, a number of alternatives are feasible. Therefore, unsuitable prior distributions on the microsatellite model parameters may be partially responsible for the demographic estimation bias observed.

#### Red colobus monkey data example:

Our results suggest that the convergence speed of transdimensional MCMC is only slightly worse than for single microsatellite model analyses on average. In addition, only one analysis of transdimensional MCMC is required to perform model selection; however, for single-model analyses, we would need as many as the number of models of interest (12 in this case) and thus require a far longer time.

In this analysis we have selected pseudoprior densities by fitting univariate distributions to densities acquired from preliminary runs. The procedure can become much less time consuming if an empirical density function is used, coupled with the automation of the input file preparation for preliminary runs and the transdimensional MCMC. However, poorly mixed preliminary runs may still yield pseudopriors that are very different from the posterior and thus offer little improvement in mixing of the analysis with transdimensional MCMC. If human data are analyzed, then appropriate pseudopriors may also be constructed from the wealth of empirical data on the mutation parameters from sperm-typing (Zhang *et al.* 1994) and pedigree studies (Weber and Wong 1993; Xu *et al.* 2000; Whittaker *et al.* 2003).

#### Model averaging and Bayes factors:

In addition to providing a model-averaged posterior distribution of population and genealogical parameters, our transdimensional MCMC method also facilitates robust estimates of the marginal likelihoods (and therefore Bayes factors) of the individual microsatellite models. These estimates are not subject to the large, even infinite variances (Raftery *et al.* 2007; Calderheada and Girolami 2009) associated with the harmonic mean estimator of marginal likelihoods (Newton and Raftery 1994). Using transdimensional MCMC to estimate Bayes factors is also computationally efficient as it requires only a single MCMC run to determine the relative merits of all *k* mutational models, rather than the *k* (or more) independent runs required by other techniques, including thermodynamic integration (Lartillot and Philippe 2006).

#### Issues and improvement:

##### Low information content in a single microsatellite locus:

Low information content of a single microsatellite locus means inference results may be sensitive to poor prior choices. In comparison to microsatellite data, mtDNA sequence data possess much more information available for reconstruction of the genetic ancestry. As a result, an mtDNA tree has a higher level of resolution than a microsatellite tree. However, this does not mean inference on mtDNA sequence data is more accurate than that on microsatellite data. Given a population history, the coalescent admits wide variation in the topology and coalescent times of the gene trees. To make a more reliable inference, it is important to use multiple loci, each of which has an independent history (Felsenstein 2006). Even though mtDNA sequence data provide a clear view of the genealogy of the mtDNA sequences, the whole mtDNA genome is a completely linked locus. The mtDNA tree therefore provides only one of the many possible realizations of the coalescent process for a given population history. On the contrary, there are potentially thousands of independent microsatellite loci available to overcome the problem of stochastic variability of individual genealogies.

##### Speed and convergence:

A large microsatellite data set with hundreds of loci may give very accurate population size estimates, but is currently not practical in our implementation, due to slow convergence and computational inefficiencies. An analysis for a data set containing ∼60 unlinked loci and 100 taxa requires days to satisfy our heuristic diagnostic statistics for convergence when all the loci were used simultaneously in the same MCMC run. The slow convergence is due to the large joint parameter space when all loci are unlinked. The parameter space containing 60 independent 100-tip trees is much larger than that having a single 100-tip tree with 60 linked sites. One potential solution is sequential Monte Carlo methods (Liu 2001), which take advantage of the independence structure of the likelihood to build up a full posterior distribution by sequential analysis of the loci (De Finetti 1974). Besides the speed, there is also the need to improve the efficiency of sampling ancestral states. Our implementation samples ancestral states by a naive Metropolis–Hastings algorithm and therefore has low acceptance probability. Gibbs sampling (Geman and Geman 1984) is an alternative MCMC algorithm that can sometimes produce more efficient sampling. It is a special case of the Metropolis–Hastings algorithm, whereby each proposed candidate is always accepted, since the components of the state that change in the proposal are drawn directly from the conditional posterior distribution. Gibbs sampling of internal nodes may improve the convergence.

#### Comparison with other software:

Well-known software programs such as BATWING (Wilson *et al.* 2003) and Migrate (Beerli 2004) also provide Bayesian coalescent analysis on microsatellite data. However, these programs contain only a few simple microsatellite models. BATWING provides the SMM and the *K*-allele model; microsatellite model options in Migrate are the SMM (called the ladder model in Migrate) and Brownian motion (an approximation of the SMM). These models do not take into account many properties of microsatellite mutation and as mentioned in the Introduction there is much evidence for those properties. Therefore the simplifying assumptions of the SMM may not be adequate to perform inference on real data. Furthermore, we provide model averaging over a rich set of microsatellite models, which is absent in these programs.

In the case of BATWING, all microsatellites are assumed to be linked into a single locus. It was discussed earlier that incorporation of multiple loci is necessary for accurate inference. Using only a single locus overlooks the genome-wide distribution of microsatellites, a highly advantageous trait for coalescent inference.

#### Future directions:

In all analyses in this study, all loci shared the same (model-averaged) microsatellite model within an MCMC run. It is possible that the properties of mutation vary across loci. While our implementation allows for variation in both rates and microsatellite models across loci, we have not performed a systematic study of the properties of such models. A hierarchical prior structure can account for variation of the same component in different models. For example, every locus could have its own EC1 model, containing the parameter *b*_{0}. During the MCMC, each *b*_{0} varies according to a given distribution, and the parameters of this distribution (hyperparameters) also have a prior.

Our framework for analysis of microsatellite data can be combined with the multispecies coalescent (Heled and Drummond 2010) to estimate the species tree using multiple microsatellite loci sampled from closely related species. Although microsatellite models can be used alongside various relaxed-clock models in BEAST (Drummond *et al.* 2006) to estimate divergence times, we do not recommend this type of analysis, because each microsatellite locus does not have sufficient information to estimate rate heterogeneity among branches.

Model selection has always been an important problem in statistical inference. It is common to make inferences on the basis of the best model selected by a standard model comparison procedure. However, such a procedure may produce a subset of models that are not significantly different from one another in their goodness-of-fit and therefore create difficulty in deciding which model provides the most reliable inference. Our transdimensional method allows the data to speak for themselves and more importantly makes population inference on the basis of a set of microsatellite models, accounting for model uncertainty and avoiding model misspecification.

## APPENDIX A: RESTRICTED MODEL SPACE

Figure A1 represents the restricted model space. The nodes represent the models, each labeled with its bit vector representation, and the arrow-edges are the Hastings ratios of a move from one model to it neighbor. Two models are neighbors if they have only a single difference. The nodes (models) are color coded according to the number of neighbors they have.

In this case the restricted model space is one connected component. However, some prior specification with zero probabilities on a subset of models may result in two or more disjointed components. In such cases the flip move alone cannot produce an ergodic Markov chain; therefore the pick move must be used so that all the models in the restricted space can be proposed.

## APPENDIX B: NUMERICAL STABILITY

We used a few shortcuts in the tree likelihood calculation. If a proposal move results only in a change of likelihood on a few branches, then we subtract the initial logarithm of partial tree likelihood from the logarithm of full tree likelihood and add the new log partial tree likelihood. This is more efficient than computing the entire full log likelihood.

We have found that if only partial tree likelihood calculation is used, the difference in likelihood between partial and entire likelihood calculation at each step increases as the MCMC proceeds. However, a step that requires entire likelihood calculation will set this difference back to 0. Luckily, in a real analysis, there are many other parameters that will force the calculation of the entire full likelihood rather frequently. We have also provided the option so that the user can force the entire full likelihood computation every *n* number of likelihood computations.

Another component that is relevant to the numerical stability of the method is matrix exponentiation. Matrix exponentiation is achieved by using codes adapted from Cern Colt library 1.2 (http://acs.lbl.gov/software/colt/, more details in appendix c). To ensure that matrix exponentiation of the CERN colt library is reliable, we compare the matrix exponentiation results computed from our codes with from the function MatrixExp in the msm package (Jackson 2009) of R.

Comparisons were made for five values of *t* (0.001, 0.01, 0.1, 1, and 10) and 12 different *Q* matrices, which are the instantaneous matrices under which our data were simulated. There was very little difference between the two exponential methods (all differences <10^{−13}).

## APPENDIX C: MATRIX EXPONENTIATION

The method of matrix exponentiation here requires the *Q* matrix to be diagonalizable. This requirement is checked by using the method described in Gentle (2007). If the method indicates that the requirement is not met, a 0 probability matrix is returned, which leads to the rejection of the proposal move. The code adapted from Cern COLT library 1.2 performs an eigen decomposition of the instantaneous rate matrix, so the matrix *Q* can be expressed as *Q* = *VUV*^{−1}, where *V* is a matrix of eigenvectors and the *U* is a diagonal matrix of eigenvalues. Given some value of *t*, the exponential of *Qt* is then obtained by finding the matrix product *e ^{Qt}* =

*Ve*

^{Ut}V^{−1}. The expression

*e*is also a diagonal matrix, with diagonal values

^{Ut}*u*represents the

_{ii}*i*th diagonal value.

## Acknowledgments

The computer simulations in this article were performed using computational resources provided by BestGRID (http://www.bestgrid.org/), a New Zealand not-for-profit organization that delivers services and tools supporting research. We thank Raazesh Sainudiin and three anonymous reviewers for helpful comments. We thank J. Allen for the unpublished red colobus monkey data set. The collection of the red colobus monkey data set was funded by the National Science Foundation under a grant to D. L. Reed (DEB 0717165). The authors were supported by Marsden grant UOA0809.

## Footnotes

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.110.125260/DC1.

- Received November 16, 2010.
- Accepted February 18, 2011.

- Copyright © 2011 by the Genetics Society of America

Available freely online through the author-supported open access option.