## Abstract

The rate of microsatellite mutation is dependent upon both the allele length and the repeat motif, but the exact nature of this relationship is still unknown. We analyzed data on the inheritance of human Y-chromosomal microsatellites in father–son duos, taken from 24 published reports and comprising 15,285 directly observable meioses. At the six microsatellites analyzed (DYS19, DYS389I, DYS390, DYS391, DYS392, and DYS393), a total of 162 mutations were observed. For each locus, we employed a maximum-likelihood approach to evaluate one of several single-step mutation models on the basis of the data. For five of the six loci considered, a novel logistic mutation model was found to provide the best fit according to Akaike’s information criterion. This implies that the mutation probability at the loci increases (nonlinearly) with allele length at a rate that differs between upward and downward mutations. For DYS392, the best fit was provided by a linear model in which upward and downward mutation probabilities increase equally with allele length. This is the first study to empirically compare different microsatellite mutation models in a locus-specific fashion.

“MICROSATELLITES”, also known as “short tandem repeats” (STRs), are stretches of DNA in which a short piece of sequence (1–6 bp, the repeat motif) is tandemly repeated a number of times. Microsatellites are abundant throughout the human genome (Lander *et al.* 2001). They have high mutation rates, which renders them useful for a number of practical applications, including genetic epidemiology (Jobling and Tyler-Smith 2003), gene mapping (Weissenbach 1993), studies of population history (Jobling and Tyler-Smith 2003), genetic fingerprinting (Krawczak and Schmidtke 1998), kinship testing (Weir *et al.* 2006), and genetic ancestry analysis (Shriver and Kittles 2004). Microsatellites also may have played a major role in genome evolution (Kashi and King 2006). Microsatellite mutations typically comprise the gain or loss of 1 repeat unit, but larger changes of the repeat number in a single meiosis also have been observed (see below).

Various statistical models have been proposed for the microsatellite mutation process (Calabrese and Sainudiin 2005), but no model has yet been identified as representing the single best, thereby testifying to the complexity of the mutation process. In any case, it is well established that the mutation rate of microsatellites (i) depends upon the respective repeat number (Xu *et al.* 2000; Lai and Sun 2003; Kelkar *et al.* 2008) and (ii) varies greatly across loci, which is partly due to the variable nature of the repeat motif itself (Kelkar *et al.* 2008).

The stepwise mutation model (SMM) was the first one used to describe microsatellite evolution. The SMM implies that each mutation adds or deletes 1 repeat unit at a time with a symmetric mutation rate that is independent of the repeat number (Ohta and Kimura 1973). The SMM was used in simulation-based attempts to explain the observed population patterns of microsatellite allele frequencies (Shriver *et al.* 1993; Valdes *et al.* 1993). Since then, various modifications of the SMM have been proposed. For example, the two-phase model (Di Rienzo *et al.* 1994) combines the SMM with geometrically distributed mutational changes of more than 1 repeat unit. This model was itself generalized (Garza *et al.* 1995), allowing for any prespecified distribution of the repeat number change. Other approaches included lower and upper bounds for the allele length (Nauta and Weissing 1996; Feldman *et al.* 1997).

Several microsatellite mutation models have been proposed to account for the relationship between the mutation rate and the respective allele length (Falush and Iwasa 1999). More restricted versions of this approach entail the idea that only the difference in repeat number before and after mutation, but not the mutation rate itself, depends upon the allele length (Garza *et al.* 1995; Kimmel *et al.* 1996). Kruglyak, Durrett, and co-workers (Kruglyak *et al.* 1998; Durrett and Kruglyak 1999) combined the SMM with the possibility of (rare) point mutations that reduce the repeat number. This model, later expanded (Calabrese *et al.* 2001; Dieringer and Schlötterer 2003), is motivated by the idea that a point mutation divides a microsatellite into two parts, only one of which is kept track of.

Finally, Whittaker *et al.* (2003) proposed an exponential microsatellite mutation model with parameters α_{u}, α_{d}, γ_{u}, γ_{d}, λ_{u}, and λ_{d} from an unspecified parameter space. Their transition probabilities (*cf*. *Materials and Methods*) from paternal allele *i* to offspring allele *j* were as follows:However, in mathematical terms this model is not well defined because, for fixed parameters with α_{u}, γ_{u} > 0, *p _{i}*

_{,}

_{i}_{+1}may exceed 1 for large

*i*, and

*p*

_{i}_{,}

_{i}_{+1}even approaches infinity as

*i*→ ∞. Nevertheless, the model can be remedied if written as an instantaneous rate matrix over a finite state space as in Calabrese and Sainudiin (2005).

In the present study, we compared different models of microsatellite mutation, taking into account the locus dependency of the mutation process by evaluating each model seperately for each locus. We chose to confine our study to microsatellites from the male-specific region of the human Y chromosome (Y-STRs) for which mutations can be inferred directly from genotype data on father–son duos. A large amount of such data have become available for opportunistic use in scientific analyses because large numbers of father–son pairs are regularly genotyped in paternity cases. Moreover, these data allow us to directly infer the one-step probability matrix of the discrete-time Markov chain model scaled in generations, as opposed to the continuous-time Markov process usually applied to evolutionary genomic data.

## Materials and Methods

### Y-STRs

In forensics as well as for other practical applications, a few commercially available, PCR-based kits are widely used for microsatellite genotyping (Mayntz-Press and Ballantyne 2007). The Y chromosome is of particular interest in forensics because it allows easy discrimination between male and female contributions to traces and trace mixtures. This has led to a situation where a few Y-STRs are routinely genotyped in a vast number of individuals (Roewer 2009). Table 1 lists the six Y-STRs that have been included in the present study, along with their most important molecular characteristics. For labeling Y-STR loci and alleles, we employed the nomenclature recommended by the International Society of Forensic Genetics (Gusmão *et al.* 2006). In particular, the allele designations refer to the actual number of repeats rather than the crude PCR fragment length.

### Mutation data

Table 2 lists 24 published reports of directly observed Y-STR mutations (or the lack thereof) in father–son pairs, which formed the basis of our present study. Taken together, these reports comprised 15,285 father–son pairs in whom 162 mutations were observed across the six Y-STRs considered. Paternity was confirmed in all cases by independent genotyping of autosomal loci. All studies were original reports so that there should be no overlap between the samples included. Most of the recent reports included the population frequencies of the paternal alleles along with the mutation data, as was recommended by the International Society of Forensic Genetics in 2006 (Gusmão *et al.* 2006). This information is required for fitting mutation models with allele-dependent mutation rates. For studies that did not provide paternal allele frequencies, we tried to approximate these as follows: For four studies (Kayser *et al.* 2000; Budowle *et al.* 2005; Ge *et al.* 2009; Goedbloed *et al.* 2009), estimates were equated to the respective allele frequencies as reported for the relevant subpopulations in the Y Chromosome Haplotype Reference Database (Willuweit and Roewer 2007), Release 31. For the remaining studies, we adopted the allele frequencies reported in the studies themselves, although these estimates included not only fathers, but also unrelated individuals.

### Mutation models

The microsatellite mutation models considered here share a number of characteristics. Thus, alleles are consistently represented by an integer corresponding to the number of repeats. In a father-to-son transmission, an allele may change according to a probability distribution that depends only upon the paternal allele. Mathematically, this means that each model entails a time-homogeneous Markov chain (*X _{n}*)

_{n}_{∈ℤ}with state space

*S*= ℤ, the set of all integers. Each mutation model is thus defined by specifying, for all

*i*,

*j*∈ ℤ, transition probabilities

*p*from paternal allele

_{ij}*i*to offspring allele

*j*. For values of

*j*not explicitly addressed in the definitions given below, we assume

*p*= 0. Since the data considered here contained only three mutations that changed the paternal allele by more than 1 repeat unit (Table 3), we confined our analyses to single-step models, characterized by

_{ij}*p*= 0 for all

_{ij}*i*,

*j*∈ ℤ with |

*j*−

*i*| > 1.

#### Stepwise mutation model:

The most popular model of microsatellite mutation is the SMM, originally proposed by Ohta and Kimura (1973), although in a slightly different context. The parameters of the SMM are 0 < μ_{u}, μ_{d} < 0.5, and the transition probabilities areParameters μ_{u} and μ_{d} correspond to the upward and downward mutation rates, respectively. The symmetric SMM is a special case in which μ_{u} = μ_{d} = μ.

#### Linear model:

To be able to properly define the linear model (Kruglyak *et al.* 1998), we must restrict the state space to *S* = {1, 2, . . . , *k*}, with *k* ∈ ℕ fixed. The linear model has parameters , and the transition probabilities arefor all *i* ∈ *S*\{1, *k*}, *j* ∈ *S*. Furthermore, *p*_{12} = ν_{u}, *p*_{11} = 1 − ν_{u}, *p _{k}*

_{,}

_{k}_{−1}=

*k*⋅ ν

_{d}, and

*p*= 1 −

_{kk}*k*⋅ ν

_{d}. The symmetric linear model is a special case in which ν

_{u}= ν

_{d}= ν.

#### Logistic model:

Here, we newly introduce the logistic model, defined on state space *S* = ℤ. To our knowledge, this mutation model has not been evaluated yet, neither theoretically nor empirically. In its most general form, the logistic model has parameters α_{u}, α_{d} ∈ ℝ, β_{u}, β_{d} ∈ ℝ^{+}, and γ_{u}, γ_{d} ∈ (0,0.5) and is characterized by transition probabilitiesThus, α_{u}, β_{u}, and γ_{u} together with the repeat number determine the probability of a gain of 1 repeat unit. Similarly, α_{d}, β_{d}, and γ_{d} determine the probability of a downward mutation. Specifically, the α-values determine the increase in mutation rate with increasing allele length while the γ-values can be interpreted as maximum mutation probabilities (mutation probabilities for very long alleles). More formally, lim_{i}_{→∞} *p _{i}*

_{,}

_{i}_{+1}= γ

_{u}if α

_{u}> 0, and lim

_{i}_{→∞}

*p*

_{i}_{,}

_{i−}_{1}= γ

_{d}if α

_{d}> 0. Parameter β can be thought of as the repeat number that corresponds to an intermediate mutation probability .

### Statistical approach

Similar to Whittaker *et al.* (2003), we employed a maximum-likelihood approach (Pawitan 2001) to estimate the parameters of a given mutation model. Assuming pairwise independence of the father-to-son transmissions, the likelihood *L* of a fixed parameter vector θ equals the product of the respective transition probabilities, taken over all observed father–son duos. Thus (1)where *N* is the number of observed father–son duos and *p _{i}*(θ) is the respective transition probability (with parameter vector θ) for the

*i*th father–son duo. Note that Equation 1 includes all observed father–son duos, not only those in whom a mutation occurred, because transition probabilities include the probability of the Markov chain staying in the current state. This is why paternal allele frequencies are required for alleles not involved in any observed mutation.

Since the likelihoods in Equation 1 are typically very small, it is convenient to use logarithms; *i.e.*, (2)For each of the six Y-STRs and for each model considered, we maximized the log-likelihood function ℒ using numerical optimization (Nocedal and Wright 2006) as implemented in the *R* environment (Ihaka and Gentleman 1996), particularly the nlminb function from the stats package.

To allow comparison of unnested mutation models that have different numbers of parameters, we used Akaike’s information criterion (AIC), computed as (3)where *k* is the number of parameters in the mutation model of interest (Akaike 1973; Burnham and Anderson 2002). The smaller the AIC value is for a given Y-STR, the better the model fit. In the case of nested models, any additional parameter has to improve the model fit by at least 1 log-likelihood unit to reduce the AIC value. As a general rule of thumb, all models with an AIC value less than 2 units above the smallest AIC value warrant further attention (Burnham and Anderson 2002).

## Results

### Observed mutations

The observed distribution of mutations (Table 3) reveals a consistent pattern of upward bias for shorter alleles and downward bias for longer alleles. The only apparent exception is DYS392, but since only seven mutations were observed at this locus, the observed pattern still seems inconclusive.

All observed mutations were single step except for a −2 mutation at DYS391 and one +2 mutation each at DYS390 and DYS391. These mutations were counted as −1 and +1, respectively, for the purpose of model fitting.

### Model fitting

A straightforward approach to estimate locus- and allele-specific mutation probabilities would be by way of observed relative frequencies of mutations (Figure 1). This simplistic assessment already reveals an asymmetry between upward and downward mutations. In particular, long alleles appear to have high downward mutation probabilities. However, since allele-specific probability estimates are not reliable for alleles with few or no observed mutations, a model-based approach appeared warranted. Moreover, mathematical models potentially yield additional insights into the biological nature of the microsatellite mutation process. We used a maximum-likelihood approach to evaluate several mutation models using the Y-STR data summarized in Table 3. The resulting parameter estimates for some of these models are given in Table 4.

Inspection of the model- and locus-specific maximum log-likelihoods and AIC values (Table 5) reveals a remarkable consistency across the first four loci (DYS19, DYS389I, DYS390, and DYS391). For each of these, the symmetric SMM was found to be only a crude approximation, and the asymmetric SMM was not significantly better. The linear model turned out to be superior to the SMM although allowance for asymmetry again did not seem to improve the model fit substantially. The logistic model provided the best fit to the available data for all four loci, and it was found to be substantially better than both the SMM and the linear model. Within the logistic models, symmetry with respect to γ seems to be acceptable whereas α and β apparently need to be asymmetric. Thus, the newly introduced logistic model with parameters α_{u}, α_{d}, β_{u}, β_{d}, and γ turned out to provide the best fit according to the AIC, closely followed by the full logistic model. All other model fits were significantly worse.

For DYS392, the symmetric linear model yielded the smallest AIC, but this should be interpreted with caution because only few mutations were observed at this locus (Table 3). For DYS393, the logistic model with symmetrical α-, β-, and γ-values was the best fit according to AIC, but other versions of the logistic model yielded very similar AIC values.

A plot of three models fitted to the DYS19 data (Figure 2) serves to illustrate further that the novel logistic model is appropriate to explain the respective mutation data. The corresponding plots for the other five loci look similar and are therefore not shown.

## Discussion

It is well known that the pattern of microsatellite mutation varies across loci (Kelkar *et al.* 2008). To our knowledge, however, the present study is the first to systematically compare novel as well as previously described microsatellite mutation models for Y-STRs in a locus-specific fashion. This comparison was made possible by the accumulation of suitable genotype data from 15,285 father–son duos. The major advantage of this type of data is that all father–son relationships had been confirmed by independent genotyping of other markers. In studies using deep pedigree data for mutational analyses, this is typically not the case (Heyer *et al.* 1997), which renders discrimination between false paternity records and genuine mutations notoriously difficult. Furthermore, by basing our analysis on directly observed mutations (*i.e.*, Y-STR mutations in father–son duos), we avoided the need for additional assumptions about the underlying population dynamics, mating behavior, or selective pressure. This is a clear advantage over studies that sought to investigate microsatellite mutation processes by comparing distantly related genomes (Dieringer and Schlötterer 2003).

In our analysis, we also avoided the complicating effects of recombination through choosing loci from the male-specific region of the Y chromosome. Although this restriction may at first glance seem to limit the general applicability of our results, it may be surmised that Y-chromosomal and autosomal microsatellite loci obey similar mutation models because they have similar biochemical properties and because replication slippage is responsible for STR mutations in both instances (Heyer *et al.* 1997; Kayser *et al.* 2000). This contrasts with minisatellite mutations, where recombination plays a significant role (Buard *et al.* 2000).

One caveat of our study is that the loci considered were originally selected for forensic applications because of their high variability. Therefore, we cannot exclude that our parameter estimates are biased toward higher mutation probabilities, but this seems unlikely to affect the general conclusion as to which models are most appropriate for microsatellites.

As was mentioned before, many statistical models have been proposed for the microsatellite mutation process (Calabrese and Sainudiin 2005). However, only a few of these turned out to be applicable to our data. For example, the model proposed by Kruglyak *et al.* (1998), which includes point mutations, was not deemed relevant to our study because, with the genotyping systems used in forensics, point mutations are not altering repeat counts (Gusmão *et al.* 2006). Moreover, we chose to restrict ourselves to one-step models owing to the scarcity of data on multistep mutations (Table 3). The three instances of mutations resulting in a change by 2 repeat units in our data were counted as single-step changes for model-fitting purposes. This concerned two of the six loci considered, for which the models should therefore be interpreted as dichotomizing all possible mutation events into up- and downward mutations, regardless of step size. However, since multistep mutations are very rare, this dichotomization should not affect our conclusions substantially. Nevertheless, should more data on multistep mutations become available in the future, a study of more sophisticated models may become worthwhile.

Our study was in part inspired by Whittaker *et al.* (2003), who were the first to suggest the use of maximum likelihood to fit microsatellite mutation models. However, as explained above, their exponential mutation model was not well defined, resulting in unbounded mutation probabilities. This problem does not occur with a logistic model, which to our knowledge has not been investigated before, because in the logistic model mutation probabilities are always bounded by parameter γ. Notably, for small repeat numbers, Whittaker’s model and the logistic model are similar in that both entail an exponential increase in mutation probabilities with increasing repeat number. Qualitatively, the logistic model is also similar to the best models emerging from genome comparisons, *e.g.*, the PL1 model in Sainudiin *et al.* (2004). The main features in both instances are an allele-length–dependent mutation rate and a confinement to single-step mutations.

In principle, it would be possible to combine mutation models to obtain better fits. For example, a model with linearly increasing upward mutation probabilities but with downward mutation probabilities according to the logistic model fits our data for DYS390 and DYS391 somewhat better than the logistic model alone (ΔAIC = −21.9 and ΔAIC = −37.7, respectively, *cf*. Table 5). However, we decided to focus on pure models here to not exacerbate the multiple-testing problem.

Practical applications of our results are vast because many uses of microsatellite data require estimates of the respective mutation probabilities. The logistic model, which was shown here to provide the best fit to empirical mutation data, is readily applicable to likelihood-based kinship analysis, phylogenetic analysis, and coalescence methods used in population genetics. Our statistical evaluation of mutation models may also contribute to a better understanding of the underlying biological mutation mechanisms. In particular, the fact that the combined models fit the data better than the original ones suggests possible differences between the mechanisms of upward and downward mutation.

With these applications in mind, gathering of further mutation data, for example, in an international STR mutation database, seems to be warranted. With a growing database, it will become possible to further refine parameter estimates as well as the models themselves.

## Acknowledgments

We thank all authors whose Y-STR mutation data were included in the present study and Lutz Roewer and Sascha Willuweit for providing access to summary statistics for the Y Chromosome Haplotype Reference Database (Willuweit and Roewer 2007). We thank Raazesh Sainudiin and an anonymous reviewer for helpful suggestions.

## Footnotes

*Communicating editor: M. A. Beaumont*

- Received July 6, 2011.
- Accepted September 23, 2011.

- Copyright © 2011 by the Genetics Society of America