## Abstract

Bayesian posterior distributions are obtained for the time to the most recent common ancestor (MRCA) for a nonrecombining segment of DNA (such as the nonpseudoautosomal arm of the Y chromosome or the mitochondrial genome) for two individuals given that they match at *k* out of *n* scored markers. We argue that the distribution of the time *t* to the MRCA is the most natural measure of relatedness for such nonrecombining regions. Both an infinite-alleles (no recurring mutants) and stepwise mutation model are examined, and these agree well when *n* is moderate to large and *k*/*n* is close to one. As expected, the infinite alleles model underestimates *t* relative to the stepwise model. Using a modest number (20) of microsatellite markers is sufficient to obtain reasonably precise estimates of *t* for individuals separated by 200 or less generations. Hence, the multilocus haplotypes of two individuals can be used not only to date very deep ancestry but also rather recent ancestry as well. Finally, our results have forensic implications in that a complete match at all markers between a suspect and a sample excludes only a modest subset of the population unless a very large number of markers (>500 microsatellites) are used.

MOLECULAR marker information has proven an invaluable tool for assessing the degree of relatedness between individuals. To date, most uses of marker information have been concerned with zero-, one- or two-, or deep-generation relatives. By zero-generation, we mean matching/rejecting a forensic sample and a suspect. One- or two-generation assessment includes paternity testing and assessing the degree of relatedness between individuals in natural populations. Typically, these tests have very low power for detecting relatives more distant than sibs and first cousins (Queller and Goodnight 1989; Lynch and Ritland 1999). Finally, human geneticists have been very successful in using marker information to assess very deep relationships, on the order of hundreds (or more typically thousands) of generations. Many of these deep relationship studies have used the haplotypes of nonrecombining chromosomes, such as the nonpseudoautosomal arm of the Y (*e.g.*, Hammer 1995; Deka*et al.* 1996; Skorecki*et al.* 1997; Bianchi*et al.* 1998; Kittles*et al.* 1998; Wilson and Balding 1998; Thomas*et al.* 2000) and mitochrondial DNA (*e.g.*, Torroni*et al.* 1994; Merriwether*et al.* 1995; Forster*et al.* 1996; Brown*et al.* 1998; Stone and Stoneking 1998; Torroni*et al.* 1998). Here we show how haplotype information can also be used to estimate the age of the common ancestor for individuals sharing an intermediate ancestry (tens to hundreds of generations) with reasonable precision. Since our comparison is restricted to the time to the common ancestor for a particular pair of individuals of interest, the fine details of the population history and structure do not enter into the analysis, other than very weakly through the mean of the assumed prior (as discussed below).

For unlinked markers, the product rule (multiplying single-locus genotype probabilities together to obtain a multilocus genotype probability) applied to highly polymorphic loci allows just a few (5-10) unlinked markers to be quite sufficient for identifying individuals that share a common relative in the last generation (such as parent-offspring or sibs). However, because of recombination, unlinked markers have very weak power for distinguishing individuals sharing deeper common ancestry. While there is a growing body of literature on estimating relatedness of two individuals given autosomal marker information (Thompson 1975; Queller and Goodnight 1989; Blouin*et al.* 1996; Ritland 1996; Marshall*et al.* 1998; Lynch and Ritland 1999), it is less clear how to proceed when using markers from nonrecombining DNAs. The product rule does not hold for such regions, as the markers are inherited as a single block.

The key to assessing the amount of relatedness using markers on a nonrecombining chromosome is that any two individuals (indeed, the entire population) will have a most recent common ancestor for that region. This follows from coalescence theory (Hudson 1991; Donnelly and Tavaré 1995), which shows that in a demographically stable population the expected time back to this most recent common ancestor (MRCA) follows a geometric distribution with parameter λ = 1/*N*_{e}, the inverse of the appropriate measure of effective population size [the effective numbers *N*_{m} and *N*_{f} of males and females for the Y and mitochondrial (mt)DNAs, respectively]. Using marker information, we can estimate the time *t* to the MRCA, and it is the distribution of *t* that provides a natural metric for describing the relatedness (at least for that region) between two individuals. It is important to stress this difference between the MRCA of a region of DNA and the MRCA for two individuals. Two individuals can easily share an ancestor that is more recent than their MRCA for a particular DNA region. Hence, estimates of the time to the MRCA using a particular DNA region provide an upper bound on the time back to the most recent common ancestor shared by two individuals.

Here we develop a Bayesian estimator for *t*, obtaining the complete posterior distribution for the time *t* to the MRCA. We assume that *n* markers are scored on the nonrecombining chromosome of interest (either the Y or mtDNA) and that we observe matches in allelic state at *k* of these. We start by assuming the infinite alleles model, where each mutation is assumed to be unique. We then modify our results by assuming a (symmetric single-step) stepwise mutational model, which is a more appropriate descriptor for microsatellite markers. As we show, when *k*/*n* is close to one and *n* moderate to large, the two different mutational models give essentially the same distribution for *t.*

## THE INFINITE ALLELES MODEL

Suppose we score the allelic states at *n* defined markers on a nonrecombining segment of DNA for two individuals and we wish to estimate the time back to their MRCA (for this segment). We first assume for each marker that (at most) only a single mutation has occurred over both lineages leading from the MRCA to the two individuals being considered. This is not an unreasonable assumption if the individuals match at most markers. We refer to this model as the infinite alleles model, as our development is also exact for the situation where each mutation is unique so that matches occur only when the marker locus in both lines has not mutated. Finally, we assume that the markers are exchangeable in the sense that the per-generation mutation rate μ is the same for each locus. We relax these assumptions below.

Let *q*(*t*) be the probability of a match (at any given single marker) between two individuals with a most recent common ancestor *t* generations ago. The number of matches (*k*) out of *n* loci follows a binomial distribution, with
*t* generations is simply (1 - μ)^{2}* ^{t}*, as the probability of no mutations occurring in the lineage from the ancestor to individual one is (1 - μ)

*, with a similar probability from the MRCA to individual two. Hence,*

^{t}*t*)(1 + 2μ

*t*),

*i.e.*, on the order of (2μ

*t*)

^{2}exp(-2μ

*t*). Thus if 2μ

*t*≪ 1, the effects of such multiple mutations have only a trivial effect on increasing

*q*(

*t*) over the value assuming no mutations (see Figure 4). When a specific value of μ is required, we generally use μ = 1/500 = 0.20%, motivated by estimates for Y chromosome microsatellites of 0.28% (Kayser

*et al.*2000) and 0.21% (Heyer

*et al.*1997), which are very similar to the estimated mutation rates of 0.1 to 0.21% for autosomal microsatellites (Weber and Wong 1993; Brinkmann

*et al.*1998). We note that these mutation rate estimates are generally done by scoring microsatellites already known to be polymorphic, which introduces a slight ascertainment bias. However, since we assume the markers being scored are also chosen because they are known to be polymorphic (in the population as a whole), then these potentially biased estimates of mutation rates are appropriate for our analysis.

From Equations 1 and 2a, the resulting likelihood for the time *t* back to the MRCA given that we observe *k* out of *n* matches is
*L*) equal to zero gives the maximum-likelihood estimate (MLE) for *t* is highly positively skewed, resulting in a considerable variance and highly asymmetric confidence intervals about the MLE. In particular, note that the MLE is zero for all values of *n* when there are no mismatches (*k* = *n*), which tells us nothing about the possible restrictions on the maximal time back to the MRCA (see Fu and Li 1996 and Donnelly*et al.* 1996 for a related discussion).

## BAYESIAN POSTERIOR DISTRIBUTIONS FOR TIME TO MRCA

While the MLE describes one feature of the distribution of *t* (the mode), the most complete picture is given by the full posterior distribution of *t*, which can be obtained by a Bayesian analysis (*e.g.*, Lee 1997). Such an analysis proceeds from Bayes’ theorem, with the posterior distribution *p*(*t*|*k*) being proportional to the product of a prior distribution *p*(*t*) for *t* and a likelihood *L*(*t*|*n, k*) given the data (*k* out of *n* matches),

The main objection to a Bayesian analysis raised by non-Bayesians is that the choice of a prior is often very subjective and, as such, this can greatly bias the posterior. For the time back to the MRCA, an appropriate prior naturally follows from standard coalescence theory, as the expected time back to a MRCA under pure drift in an effective population size of *N*_{e} follows the geometric distribution with success parameter *e.g.*, Wilson and Balding 1998). The parameter is 1/*N*_{e} in this case [as opposed to 1/(2*N*_{e}) for an autosomal gene] because the uniparental inheritance means that both mtDNA and the Y chromosome are essentially haploid. As summarized by Hammer (1995), estimates for *N*_{e} based on the standing level of variation at presumably neutral markers are on the order of 5000 in humans for both mtDNA and the male-specific region of the Y chromosome.

Treating time as continuous, the geometric prior is equivalent to using an exponential distribution with hyperparameter *k* ≪ *n*) and the effective population size is extremely small. Thus the prior is both well motivated and the choice of the prior hyperparameter (λ) has very little effect on the final (posterior) distribution in most cases.

Recalling Equation 3, the resulting posterior distribution becomes
*k* ≫ λ, alternate choices of λ have little effect on the posterior distribution. Since *N*_{e}μ*k* ≫ 1. For a mutation rate of μ = 1/500, the choice of *N*_{e} (and hence λ) has essentially no effect on the prior provided *N*_{e}*k* ≫ 250, which is a very mild restriction.

Returning to the posterior distribution, the normalizing constant is easily computed by expanding the (1 - *e*^{-2μ}* ^{t}*)

^{n}^{-}

*term, noting that we can express the function being integrated as*

^{k}With a flat prior (λ = 0) the normalizing term further simplifies to

For zero marker mismatches (*k* = *n*), the posterior is simply an exponential distribution with parameter λ + 2*n*μ,
* _{t}*) and standard deviation (σ

*) for the time to MRCA when there are no mismatches are*

_{t}Likewise, the cumulative probability distribution for the time back to the MRCA is just
*T*_{α} satisfying Pr(*t* ≤ *T*_{α}) = α is given by
*n* marker loci, there is a 90% probability they shared a MRCA in the last 1.15/(μ*n*) generations. The values for 50, 95, and 99% are 0.347, 1.498, and 2.303 (μ*n*)^{-1} generations, respectively.

The posterior distributions with one or more mismatches also follow from Equation 12. For example, for one (*k* = *n* - 1) and two (*k* = *n* - 2) mismatches, the posteriors are
*k* again follow by expanding the (1 - *e*^{-2μ}* ^{t}*)

^{n}^{-}

*term. Define*

^{k}*t*to the MRCA, given

*k*of

*n*marker loci match, a prior with hyperparameter λ, and a per marker mutation rate of μ, are given by

Figures 1, 2, 3 plot the posterior distributions corresponding to different numbers of matches (*k*) for *n* = 5, 10, 20, 50, and 100 markers under the assumption that μ = 1/500 and there is a flat prior (λ = 0). Provided that *N*_{e} ≫ 250, the results are the same for any other prior using λ = 1/*N*_{e}.

Table 1 summarizes the mean, standard deviation (SD), and mode (the MLE) for the posterior distributions as well as the 2.5, 50, 90, and 97.5% cutoff values. Note that a 95% credible region for *t* is given by the 2.5 and 97.5% values. As expected, the distribution of *t* is highly skewed, as mode < median < mean. The distribution becomes increasingly skewed to the right as the number of mismatches increases, which is reflected in not only an increase in the mean but also in the variance. However, note that the mean/SD ratio declines with increasing numbers of mismatches, so that the coefficient of variation declines as *k* decreases.

Finally, note that the resolution for *t* offered by using *n* = 5 markers is very poor, but rather fine precision is offered by using 100 markers. While scoring the latter number of markers may be unrealistic, using 20 markers is both feasible as well as offering reasonable precision.

## DIFFERENTIAL MUTATION RATES

As our knowledge of the parameters associated with the mutational process continues to improve, it is likely that we may find significant differences in the mutation rates at different markers. Fortunately, it is straightforward to modify the likelihood function (Equation 3) to take this into account. Suppose *n* markers are examined, generating the matching data *x*_{1},..., *x _{n}*, where the

*x*are coded as

_{i}*e*

^{-2}

^{t}^{μ}

*, which generally involves only a few terms unless there are a significant number of mismatches.*

^{i}For the simplest case of no mismatches (*k* = *n*), the posterior for the time to the MRCA becomes
*i*), the posterior becomes
*i.*

### CORRECTING FOR MULTIPLE HITS: THE STEPWISE MUTATION MODEL

Microsatellites, given their higher mutation rates compared to single nucleotide polymorphisms (SNPs), are clearly the marker of choice for estimating the time to MRCA when the individuals are assumed to be at least modestly related. As microsatellite “alleles” correspond to different lengths of the repeat unit in the microsatellite array, the infinite alleles model assumed previously is not appropriate if multiple mutations are expected in any given marker. Since mutations change the number of repeats (and hence the size) of an array, two (or more) mutations can recover the initial state found in the MRCA. Likewise, parallel mutations in both lineages leading from the MRCA can also lead to the two individuals sharing the same allelic state, even though mutations have occurred. Thus, using an infinite alleles model for microsatellites will tend to underestimate the time to the MRCA. To examine the severity of this underestimation, we consider the divergence, assuming a stepwise mutational model (SMM) assuming equal probabilities of an up (increase array size by one) or down (decrease array size by one) move (the symmetric single-step SMM). The roots of the SMM trace from Ohta and Kimura’s (1973) model of charge differences in electrophoretically scored proteins. A number of workers have shown this model to be a good fit for microsatellites in both indirect studies examining the distribution of array sizes in natural populations (Edwards*et al.* 1992; Shriver*et al.* 1993; Valdes*et al.* 1993; Di Rienzo*et al.* 1994) and direct studies looking at actual mutations arising in pedigrees (Brinkmann*et al.* 1998; Kayser*et al.* 2000). In the latter two studies, the vast majority (35 out of 37) of new mutations were single step, while the remaining (2 out of 37) were two step.

Denote the allelic state (array size) in the MRCA as state 0, and let *X _{t}* denote the array size at time

*t.*As before we assume a per-generation mutation rate of μ. Under this model, the transition probabilities between states become

To apply this model, we need to compute *q*(*t*), the probability of a match between two lineages sharing a MRCA *t* generations ago. A little thought shows that for the one-step model allelic states in two lineages can only match if an even number (2*M*) of mutations have occurred. The appendix shows that

Since the probability of 2*M* total mutations along both lineages assuming *t* generations back to the MRCA is given by a Poisson distribution with parameter 2*t*μ, we have the probability of a match conditioned on *t* as
_{0} denotes the zero-order modified type I Bessel function (Olver 1964). Hence, under the single-step mutational model, the probability of a match after τ = 2μ*t* generations is