## Abstract

We present a method for estimating the age of a mutation based on the genetic length of ancestral haplotypes shared between individuals carrying the mutation. The method can be reliably applied to small samples, typical of situations involving rare mutations, and makes effective use of modern high-density SNP data, thus overcoming two of the limitations with existing methods. The method provides age estimates and confidence intervals without the use of asymptotic theory and is applicable to genealogies in which the data are independent or correlated. In the correlated case we estimate the correlation directly from the data, rather than relying on a model for the genealogy. To demonstrate the method’s efficacy, we provide simulation results and compare it to other methods. The length data are obtained with a simple procedure, and an R script is available for performing the calculations.

ESTIMATING the age of mutations, *i.e.*, “dating” mutations, has become a fascination for geneticists, particularly since a number of classic studies in the 1990s (*e.g.*, Serre *et al.* 1990; Risch *et al.* 1995; Stephens *et al.* 1998) to the present day (*e.g.*, Holm *et al.* 2011; Fu *et al.* 2012). This interest has stimulated the development of a number of methods for dating mutations. The main limitations with existing methods, however, are that: (1) they cannot be applied reliably in situations involving *rare* mutations where sample sizes are typically *small* and (2) they cannot make effective use of today’s high-density SNP data (derived from genotyping arrays or genome sequencing).

Common methods for dating mutations fall into three broad categories: methods based on the *frequency* of the mutation, methods based on *gene trees*, and methods based on the decay of *linkage disequilibrium* (LD) around the mutation (see Slatkin and Rannala 2000; Rannala and Bertorelle 2001; Colombo 2008). Frequency methods model the way a mutation’s population frequency changes over time and use the mutation’s current frequency to estimate its age. The most well-known method of this sort is that proposed by Kimura and Ohta (1973), which assumes a constant population size and an absence of natural selection. Other frequency methods include those of Thompson (1976) and Griffiths and Tavaré (1998). Frequency methods are not well suited to small sample situations involving rare mutations since obtaining an accurate frequency estimate in such situations is difficult.

Gene tree methods infer mutation age from a sample of mutation-bearing chromosomes using observed variation in sequence or marker data (at polymorphic loci) completely linked to the mutation. A mutation model is proposed for the process that generates the observed variation (*e.g.*, a Poisson process model with an infinitely many sites assumption), and a model (*e.g.*, a coalescent process) is proposed for the genealogy of the sample (*e.g.*, Thomson 1998; Griffiths and Tavaré 1999; Markovtsova *et al.* 2000) or the genealogy is inferred directly using distance or maximum parsimony methods (*e.g.*, Morral *et al.* 1994). The observed variation (presumed to have arisen over time) is then used to infer the mutation’s age. Gene tree methods are not well suited to dating rare mutations using SNP data. Rare mutations are relatively young, and so sufficient time will usually not have elapsed since the mutation’s creation to generate the required variation (see Rannala and Bertorelle 2001). If enough time has elapsed for variation to arise, this variation will typically not be observable in SNP data (sequence or rapidly evolving microsatellite data would be the only viable options).

Methods based on LD decay make use of the fact that the mutation will be in LD with nearby marker alleles at polymorphic loci, and the decay of this LD over time (through recombination) provides information about the mutation’s age. The most well-known method of this sort is given in Serre *et al.* (1990). Variations on the Serre method include those of Stephens *et al.* (1998) and Reich and Goldstein (1999). Serre’s method (and its variations) estimates a mutation’s age by using information from a single marker locus. To utilize information available from other marker loci, some have applied the Serre method to each locus separately and then combined each of the resulting age estimates into a single estimate, *e.g.*, by averaging (see Holm *et al.* 2011). The efficacy of the Serre approach ultimately rests on the asymptotic properties of the estimate. We show, however, that these properties do not hold for small samples.

Methods that combine the frequency approach with one of the other approaches have also been proposed: some methods combine the frequency and gene tree approaches (*e.g.*, Slatkin and Rannala 1997; Stephens 2000), and some methods combine the frequency and LD approaches (*e.g.*, Rannala and Reeve 2003). Since these methods require the mutation’s frequency, they have the same problem mentioned above for plain frequency methods: they are not suited to small sample situations.

An alternative LD approach is based on the following biological picture: present-day mutation “carriers” will have inherited the mutation as segments of the ancestral chromosome that originally acquired the mutation, where these segments will differ in size due to random recombination events (Figure 1). These inherited ancestral segments (harboring the mutation) decrease in size over time, and so their *lengths* can be used to estimate the mutation’s age. Genin *et al.* (2004) and McPeek and Strahs (1999) (originally developed for fine mapping, but also provides a method for dating mutations) are two methods based on this approach. While the Genin method is applicable to small samples, it can utilize only data from highly polymorphic loci (*e.g.*, microsatellite loci), not high-density SNP data. The McPeek and Strahs method is also unsuited to high-density SNP data, not because it cannot be applied to SNP loci, but because the method “infers” the ancestral segment lengths using a hidden Markov model, and the required computations rapidly become impractical as the data density increases to today’s standards. The McPeek and Strahs method also makes extensive use of asymptotic techniques. However, we show that these techniques are not effective for small samples. While others have studied the lengths of ancestral segments segregating in a population (*e.g.*, Stam 1980; Chapman and Thompson 2002; Baird *et al.* 2003), the two methods just discussed are the only approaches which use the lengths of segments harboring a mutation to estimate the mutation’s age.

The method we propose, which we call the *Gamma method*, is also based on ancestral segment lengths. Unlike the methods above, however, we utilize high-density SNP data, inferring the segment lengths with a simple procedure. The method provides age estimates and confidence intervals without the use of asymptotic theory, making it applicable to small samples and thus suitable for typical studies involving rare mutations. To establish the method’s efficacy, we provide simulation results and compare it to other LD methods.

We make two important comments before proceeding. First, all methods considered below are concerned with dating mutations on *recombining* parts of the genome (*i.e.*, not mitochondrial and Y chromosome mutations), where the term “mutation” refers to any kind of variation of DNA sequence, *i.e.*, a nonsynonymous substitution, an insertion, or a deletion (care may be needed when dating large insertions or deletions). Second, the term “mutation age” has become ambiguous in the literature: it can mean the time since the mutation first came into existence, or it can mean the time since the most recent common ancestor (MRCA) of the sample (which is less than or equal to the time in the first sense). The term is near universally used to refer to the time since the MRCA, since the majority of methods estimate the age in that sense, and so we also use the term this way. Often the aim is to estimate age in the MRCA sense anyway (*e.g.*, cases involving a founder effect). However, if the mutation arose in the population from which a random sample is taken, then estimating the MRCA age will be approximately equivalent to estimating the time since the mutation came into existence.

## Materials and Methods

### Age estimation and confidence intervals

We begin with a statistical model describing the genetic length *l* of an ancestral segment (measured in morgans) and the way that length changes over time (measured in generations). Assuming only the Haldane recombination model, the length of the left and right “arms” of the ancestral segment (measured from the mutation) both have independent Exponential(*τ*) distributions, and so (1)where *τ* denotes the age of the mutation (see *Appendix A*). The same result was obtained by McPeek and Strahs (1999), by other means. This model implies that the mean ancestral segment length is 2/*τ*, *i.e.*, the mean length decreases over time, consistent with our expectations. The essential idea of our approach is to obtain ancestral segment lengths from a sample of “unrelated” individuals carrying the mutation and then estimate the *τ* parameter in (1) by *maximum likelihood*. In reality the sampled individuals are not unrelated, since the mutation is tacitly assumed to be identical by descent, and so the term “unrelated” is used here to mean individuals who are separated by a reasonable number of meiosis events; *e.g.*, the individuals should not be siblings or close cousins. We develop two versions of the method: one applies to samples assumed to have an *independent* genealogy (*i.e.*, sampled individuals have independent recombinational histories since their MRCA) and one applies to samples assumed to have a *correlated* genealogy (*i.e.*, a “tree-like” genealogy, where subsets of the sample have common ancestry earlier than the MRCA for the entire sample, thus producing correlated observations).

First, assume that the sample has an independent genealogy. In that case, the segment lengths *l*_{1}, *l*_{2}, …, *l _{n}* of a sample of size

*n*are independent and identically distributed with

*l*∼ Gamma(2,

_{i}*τ*), from which it follows that the maximum-likelihood estimate (MLE) of

*τ*is (2)The MLE is a biased estimate of

*τ*, particularly for small samples, and so for our purposes it is desirable to remove this bias. Properties of the Gamma distribution imply that (3)from which we can obtain the minimum variance unbiased estimate of

*τ*, given by (4)where

*b*(

*n*) = (2

*n*− 1)/2

*n*is the bias correction factor. Properties of the Gamma distribution also imply that (5)

*i.e.*, is a pivotal quantity from which

*exact*confidence intervals for

*τ*are easily derived (see

*Appendix B*).

Now assume that the sample has a correlated genealogy. In this case, the segment lengths *l*_{1}, *l*_{2}, …, *l _{n}* are dependent random variables. We assume that all lengths have the same pairwise correlation

*ρ*,

*i.e.*, Corr(

*l*,

_{i}*l*) =

_{j}*ρ*for all

*i*≠

*j*. Note that this assumption is far weaker than it appears, since for our purposes

*ρ*will act (roughly) as an average or summary of the correlation present among the observations. To estimate

*τ*in this setting we begin indirectly by estimating

*τ*

^{−1}with (6)since this remains an unbiased estimate of

*τ*

^{−1}regardless of the dependence among the

*l*variables. We now seek the distribution of . Inspired by the result in the case involving independent variables,

_{i}*i.e.*, result (3), we suppose that has approximately a Gamma(

*α*,

*β*) distribution, for some

*α*and

*β*. To obtain expressions for

*α*and

*β*, we first observe (see

*Appendix C*) that the mean and variance of is (7)and so equating these expressions to the mean and variance of a Gamma(

*α*,

*β*) random variable,

*i.e.*,

*α*/

*β*and

*α*/

*β*

^{2}, respectively, we easily obtain (8)Thus, we arrive at the following distribution for : (9)where (10)We refer to this as the

*Gamma approximation*. We use the

*n*

^{∗}notation to simultaneously highlight the similarity with the case under independence,

*i.e.*, result (3), and to suggest an enlightening interpretation for the effect of the dependence: it effectively

*reduces*the sample size;

*i.e.*, the estimate is calculated with a sample of size

*n*but behaves as if it came from a smaller sample of size

*n*

^{∗}(since

*n*

^{∗}≤

*n*). We emphasize that (9) is not an asymptotic result valid only for large

*n*; it is an approximation valid for all sample sizes.

Given the Gamma approximation, by using arguments similar to those leading to (4) and (5) above (described in *Appendix B*), we arrive at the following further results. First, (11)is an unbiased estimate of *τ*, where *b*(*n*^{∗}) is the same bias correction factor as in the case under independence, except that *n* is now replaced by *n*^{∗} in the formula. Second, (12)from which confidence intervals for *τ* are easily obtained.

The correlation *ρ* can be estimated directly from the sample of lengths *l*_{1}, *l*_{2}, …, *l _{n}* using the estimatewhere and

*S*

^{2}are the sample mean and variance, respectively (see

*Appendix D*). While , some values of produce inappropriate values of

*n*

^{∗}, and in that event the calculation of

*n*

^{∗}is adjusted accordingly (see

*Appendix D*).

### Obtaining the data

To identify ancestral segments in sampled individuals, we apply the following procedure: line up high-density SNP data around the mutation locus, look for *continuous* haplotype sharing between *two or more* individuals, and then take this sharing to define the ancestral segments (Figure 2). The segment lengths are obtained from the genetic map positions of the outermost markers. This procedure is reasonable since such sharing will typically involve hundreds of consecutive markers, which can be explained only if the shared haplotypes are inherited from a common ancestor, *i.e.*, if they are ancestral segments.

To assess haplotype sharing for individuals with a *single* copy of the mutation (*e.g.*, with a dominant mutation), the data need to be *phased* to resolve the single underlying ancestral haplotype. Similarly, for individuals with copies of two distinct mutations at the same gene locus (*i.e.*, with a compound heterozygous mutation), the data also require phasing to reveal the two distinct ancestral haplotypes (both mutations can then be dated). Phasing is easily achieved if parental data are collected, or with modern statistical techniques (see review by Browning and Browning 2011). We note that other LD methods also require data to be phased in these situations. However, for individuals with *two* copies of an identical recessive mutation, the situation is particularly simple: since such individuals inherit two ancestral haplotypes via an inbreeding loop, producing homozygosity along the haplotype, continuous haplotype sharing becomes equivalent to continuous sharing of *identical homozygous markers*; *i.e.*, phasing is not required (if either ancestral haplotype shortens while in the inbreeding loop then the shorter copy will be revealed by this criterion).

Since haplotype sharing is defined only between two or more individuals, the two *longest* arms on either side of the mutation stop sharing at the *same* point, but the true ancestral segment in one of the individuals will almost always continue some distance further (Figure 2). This extra length cannot be detected by the sharing criterion and so results in a biased age estimate, with the bias being more pronounced for small samples. To correct this, the missing length is estimated from the data. Let *a*_{missing} denote the length of one of these missing arm sections. Since the ancestral segment arm lengths have an Exponential(*τ*) distribution, and since the Exponential distribution is memoryless, it follows that the distribution of *a*_{missing} is Exponential(*τ*). Thus, we can estimate *a*_{missing} by estimating its expected length, which is *τ*^{−1}. Using the inferred arm lengths *a*_{1}, *a*_{2}, …, *a*_{2}* _{n}*, we can estimate

*τ*

^{−1}using (13)which, apart from the two truncated observations, is an unbiased estimate of

*τ*

^{−1}, regardless of whether the data are independent or correlated. Thus, to correct the bias problem, the sum of the lengths in the age estimates (4) and (11),

*i.e.*, , is replaced with (14)where the accounts for the two missing lengths.

Three other factors can affect the length data, obtained using the procedure above. First, chance haplotype sharing is possible at the very end of each segment, having the effect of inflating the inferred lengths. If high-density data with a range of allele frequencies are used, the amount of chance sharing is extremely small. In most circumstances the effect is unnoticeable; a small effect becomes noticeable only when the mutation is very old. Nevertheless, we build in a simple correction that covers these situations (see *Appendix E*). Second, the effect of mutation and errors in the marker data are unnoticeable. Modern marker data (*i.e.*, SNP data) have low mutation rates, and modern genotyping methods (*e.g.*, SNP arrays) produce highly reliable genotype calls of maker loci. Moreover, with the density of modern data, mutations and errors are nearly always obvious: continuous sharing interrupted by one of these features will typically be followed by another large run of continuous sharing. These features may not be so obvious near segment ends, but in that case they have only a minor influence. Finally, in correlated genealogies there is a possibility that some sampled individuals have extra haplotype sharing (around the ancestral haplotype) as a result of common ancestry more recent than the entire sample. Provided the mutation is not too young and/or the sample is not very large, the chances of a random sample including individuals with common ancestry recent enough to produce this problem are low. For correlated genealogies, we assume that recombination has removed any excess sharing of this sort in individuals who have common ancestry more recent than the entire sample.

### Simulation and comparison

To evaluate the Gamma method, we apply it to ancestral segment lengths simulated using SNP data from HapMap 3 (International HapMap 3 Consortium 2010). Phased chromosome 1 data from the CEPH population (Utah residents with ancestry from northern and western Europe) was used, with ancestral segment lengths simulated according to the Haldane recombination model around a nominated mutation locus. We selected loci that overlapped those targeted by a standard genotyping array (an Illumina 1-M array); this represented 2/3 or ∼68K loci targeted by the array on that chromosome.

To generate a sample with an independent genealogy, each ancestral segment is independently simulated using the same HapMap chromosome (designated the ancestral chromosome) and then the resulting segments are substituted into a set of unrelated HapMap chromosomes at corresponding loci. For correlated genealogies, we wanted to simulate under *weak* assumptions and so we use a simple *forward simulation* approach. To generate a sample with a correlated genealogy we begin with a single HapMap chromosome (designated the ancestral chromosome) and then ancestral segments are simulated as follows: (1) extant segments in each generation produce a Poisson distributed (rate *λ* = 2.5) number of offspring, (2) each segment is then passed on to its offspring with probability 1/2, each time undergoing a round of recombination, and (3) this is continued for the required number of generations, then a random sample of segments is taken from the resulting population and substituted into a set of unrelated HapMap chromosomes at corresponding loci. For both the independent and correlated cases, the ancestral segment lengths are inferred from the resulting chromosomes using the procedure described above. Then the method is applied to these inferred lengths, using the version of the method that applies to the assumed kind of genealogy.

We compare our approach to other LD methods. We do not include frequency or gene tree methods in our comparison for two reasons: (1) our approach is based solely on LD (the mutation’s frequency is not part of our data) and (2) our concern is dating rare mutations from small samples using dense marker data (frequency and gene tree methods are not suited to these situations, as explained above). We compare it to the McPeek and Strahs (1999) method, since this is the only method based on haplotype lengths that can be applied to dense marker data, and we compare it to the Serre *et al.* (1990) method, since this is the paradigm LD decay method and by far the most widely used.

Using a sample of ancestral segment lengths *l*_{1}, *l*_{2}, …, *l _{n}*, the McPeek and Strahs age estimate (for both independent and correlated genealogies) is given by (15)which is the MLE of

*τ*in the Gamma model for ancestral segment length. Since the data density makes the use of their software unfeasible, we use the lengths inferred with our procedure. They construct confidence intervals using asymptotic MLE theory: the result that (16)is used for independent genealogies, where is the estimated Fisher information for

*τ*

^{−1}and the result that (17)is used for correlated genealogies, where is the same and , which assumes that all lengths have identical pairwise correlation

*ρ*. They model correlated genealogies using a

*conditional coalescent*process. Under that model they show that (18)(This is the corrected expression reported in errata to their article.)

The Serre age estimate (regardless of the assumed genealogy) is given by (19)where *θ* is the recombination frequency between the mutation *m* and a marker allele *α* (assumed to be ancestral, *i.e.*, linked when the mutation was created) at a polymorphic locus, *f _{α}* is the population frequency of

*α*on normal chromosomes (assumed known and constant over time), and is the proportion of sampled mutant chromosomes with

*α*at the locus. For sensible estimates we require (otherwise ), and we require (otherwise is not a real number). For the single-locus version of the method we apply this formula to randomly selected loci, and for the multilocus version we apply the formula to every locus where it can be applied and then average the results. To be as fair as possible, we use only loci within the bounds of the longest segment arms (alleles at these loci are the only ones guaranteed to be ancestral), and the proportion is calculated using the knowledge of which allele is ancestral. In practice, the point at which loci should stop being used to estimate age will be unknown, and it will not always be obvious which of the alleles at a particular locus is ancestral (particularly with small samples). There are currently no proposals in the literature for calculating confidence intervals that take into account all sources of variation. However, for the single-locus version assuming an independent genealogy, since is the MLE under binomial sampling, confidence intervals can be calculated using the delta method: (20)where (21)Bootstrap methods for confidence intervals cannot be employed effectively, since the requirement that places significant restrictions on the loci that could be used for dating: we would require

*f*< 1/

_{α}*n*to guarantee that can be calculated for all possible bootstrap resamples. As far as we know, this is the first reported simulation study of the Serre method.

We evaluate performance of the methods by calculating the mean of the point estimates (to assess bias), nominal confidence interval coverage (we nominated 95% coverage), and confidence interval length (which is related to the variance of the estimate and of relevant practical interest). By assessing performance on these dimensions, we capture even more than would be learned from the mean square error (MSE) of the estimators.

## Results

We show simulation results for the true age *τ* = 50 generations (≈ 1000–1500 years old, assuming 20 to 25-year average generation span), with 1000 replicates for each of six sample sizes *n* = 5, 10, 15, 20, 25, 30 (Figure 3).

For both independent and correlated genealogies, the Gamma method performs well: the age estimate is unbiased and the confidence interval has the desired coverage. The only exception is for independent genealogies with sample size *n* = 5, where there is a very small upward bias and a slight decrease in confidence interval coverage. Importantly, for small samples sizes, the Gamma method performs better than the McPeek and Strahs and Serre methods: (1) for small samples the age estimates of these methods have a large bias and are only without bias for *n* > 30, (2) the methods do not achieve the desired confidence interval coverage (an exception being the McPeek and Strahs method for *n* ≥ 30 in the independent genealogy case), and (3) the confidence interval length for the Gamma method is typically much smaller than for the other methods. Similar results are obtained for *τ* = 20 and *τ* = 100 (see Supporting Information, Figure S1 and Figure S2).

The primary reason the McPeek and Strahs method does not perform as well, for these sample sizes, is its reliance on the asymptotic properties of the MLE used to estimate age. To compound this problem, for correlated genealogies the correlation is not accurately estimated by their formula, which explains why the confidence intervals in that case do not achieve the correct coverage. Similarly, the Serre method does not perform well with independent genealogies because it relies on asymptotic properties of the age estimate. The performance of the Serre method for correlated genealogies is not totally unexpected, since it has already been reported that the method is biased in this case (see Slatkin and Rannala 2000). We have now shown, however, that there is almost as much bias for the case involving independent genealogies.

## Discussion

The Gamma method overcomes two of the limitations with existing methods for dating mutations: (1) the method can be reliably applied in small sample situations, making it ideal for studies involving rare mutations and (2) the method makes effective use of modern high-density SNP data. Since the method is not based on asymptotic theory, it performs better than other methods for small samples. The method provides estimates and confidence intervals for *both* independent and correlated genealogies. This is an advantage over other LD methods in which confidence intervals are either not provided or provided only under the assumption of an independent genealogy (the McPeek and Strahs method is a notable exception). For correlated genealogies, since the method estimates the correlation directly from the data, instead of relying on idealized models of the genealogy (*e.g.*, coalescent models), it performs better in those situations. The fact that the method does not depend on a population genetics model for the genealogy represents an advantage when applying the method in practice.

The Gamma method has other advantages. First, the method does not require knowledge of the population growth rate or selective forces acting on the mutation, two factors that can significantly influence other approaches. Population growth influences the structure of correlated genealogies, but since the correlation is estimated directly from the data in the Gamma method, it can take these effects into account. Natural selection can effect a mutation’s frequency, but there is no plausible mechanism whereby it can affect the lengths of ancestral segments (other LD methods are similarly unaffected). Some methods take these factors into account (*e.g.*, Slatkin and Rannala 1997; Rannala and Reeve 2003), requiring them as user-specified parameters, but this can be a disadvantage in practice, since their magnitude is often poorly known. We note here that some have inferred the presence of selection by comparing a frequency estimate and an LD estimate of age (*e.g.*, Stephens *et al.* 1998), the former being affected by selection while the latter is not. It is also worth speculating that LD age estimates might provide a basis for making inferences about a mutation’s frequency; LD is related to age, and age is related to frequency.

Second, the Gamma method has some flexibility regarding the data it can be applied to. Unlike some LD methods, the Gamma method is not limited to one data type; *e.g.*, it can be applied to microsatellite data. However, while the typical heterozygosity of such data essentially eliminates the problem of chance sharing (*i.e.*, inferring inflated lengths), depending on the size of the gaps between marker loci, termination points of ancestral segments might be obtained only approximately, leading to approximate age estimates. Depending on the aim of the study, however, approximate results may be sufficient. It is the sheer density of modern SNP data that essentially eliminates the problem of chance sharing, despite the low heterozygosity of the loci, and allows ancestral segment lengths to be obtained with precision. Moreover, unlike other LD methods, marker allele frequencies are not necessarily required for the Gamma method: allele frequencies are needed only when correcting for subtle chance sharing effects in certain situations (see *Appendix E*), but this correction is optional. This can be an advantage in situations where allele frequencies are not always available (*e.g.*, genome sequencing data).

One large disadvantage of the Gamma method, like all other LD methods, is that it is not useful for dating *ancient* mutations where the LD surrounding the mutation has decayed so much as to make dating infeasible. In such cases, frequency or gene tree methods are the only viable options.

A simple R script is available (http://bioinf.wehi.edu.au/software/ageofmutation), which performs the calculations of the Gamma method given a sample of ancestral segment lengths as input, obtained using the straightforward procedure described above. No extensive computations are required. The Gamma method approach has been successfully applied in Corbett *et al.* (2011), Lomax *et al.* (2013), and Lim *et al.* (2014).

## Acknowledgments

We thank Helen Lindsay for several helpful discussions concerning the simulation study. This work was made possible through Victorian State Government Operational Infrastructure Support and Australian Government NHMRC IRIISS.

## Appendix A: Derivation of the Gamma model

Our derivation assumes the Haldane recombination model: the genetic length between adjacent recombination events on a single chromosome is exponentially distributed with rate 1 (where this rate is set by definition). (The Haldane model does not take *genetic interference* into account, and consequently is not the most accurate model of recombination. Despite this, the model will be a good approximation for our purposes, since the genetic distances that concern us are short enough that multiple recombination events are extremely unlikely within those distances.) No further assumptions are required.

Let *a*_{L} and *a*_{R} denote the lengths of the left and right ancestral segment “arms” measured from the mutation. Suppose that a mutation has been created on the ancestral chromosome, and consider subsequent recombination events to the *right* of the mutation. Let *D*_{1}, *D*_{2}, …, *D _{τ}* denote the distances from the mutation to the next recombination point (to the right) in each of the

*τ*generations since the mutation’s creation. Since the exponential distribution is

*memoryless*, the Haldane model implies that

*D*

_{1},

*D*

_{2}, …,

*D*are Exponential(1) variables. These variables are independent, since the recombination process in each generation occurs independently.

_{τ}Now observe that the right arm length *a*_{R} is the *minimum* of the *D*_{1}, *D*_{2}, …, *D _{τ}* variables; since the ancestral segment never gets longer, the recombination “bite” closest to the mutation occurring over the

*τ*generations,

*i.e.*, the minimum of the

*D*’s, marks the extent of the ancestral segment. That is,

_{i}*a*

_{R}= min{

*D*

_{1},

*D*

_{2}, …,

*D*}. But, the minimum of

_{τ}*k*independent Exponential(

*λ*) variables has an Exponential(

*kλ*) distribution. Thus,

*a*

_{R}∼ Exponential(

*τ*). A similar argument shows that

*a*

_{L}∼ Exponential(

*τ*).

The *a*_{L} and *a*_{R} are independent, which follows from the memoryless property. Hence, since the sum of two independent Exponential(*λ*) variables has a Gamma(2, *λ*) distribution, and since *l* = *a*_{L} + *a*_{R}, it follows that *l* ∼ Gamma(2, *τ*).

## Appendix B: MLE, Bias Correction Factor, and Confidence Intervals

Under the assumption that a sample of lengths *l*_{1}, *l*_{2}, …, *l _{n}* is independent and identically distributed with

*l*∼ Gamma(2,

_{i}*τ*), it is straightforward to show that the MLE of

*τ*is(This amounts to assuming that each ancestral segment is the same age,

*i.e.*, that the

*τ*associated with each segment is the same, an assumption that should hold, at least approximately, in practice.) To calculate the bias of this estimate and to calculate confidence intervals for

*τ*we make use of the following fact: if

*X*

_{1},

*X*

_{2}, …,

*X*are independent and identically distributed Gamma(

_{n}*k*,

*λ*) variables, then

To calculate the bias of , observe that fact (B1) implies (B2)and from this a routine calculation showswhere *b*(*n*) = (2*n* − 1)/2*n* is the bias correction factor. Thus,is an unbiased estimate of *τ*, and since is a function of a sufficient statistic for *τ*, the *Rao–Blackwell theorem* implies that it has the smallest variance among all unbiased estimates.

To obtain confidence intervals for *τ* we again make use of fact (B1). Reasoning identical to that leading to (B2) leads toMultiplying by *τ* and applying fact (B1) once more, this time with *ψ* = *τ*, we obtainshowing that is a *pivotal* quantity. Thus, letting *γ _{n}*

_{;}

*denote the*

_{α}*α*percentile of the Gamma(2

*n*, 2

*nb*(

*n*)) distribution, we easily obtain 1 −

*α*confidence intervals:This confidence interval is

*exact*: it is not an asymptotic statement,

*i.e.*, it has coverage 1 −

*α*regardless of the sample size. Note that no penalty has been suffered by calculating the interval with the unbiased estimate (

*e.g.*, the interval is not wider). The scaling properties of the Gamma distribution make the interval invariant to the choice of

*b*(

*n*).

## Appendix C: Parameters for the Gamma Approximation

We first calculate the mean and variance of the estimate . The mean is (C1)since (*l _{i}*) = 2/

*τ*. To calculate the variance, observe the equivalence (C2)where and

**L**

^{T}= [

*l*

_{1},

*l*

_{2}, …,

*l*]. Observe also that, given the assumed correlation structure,

_{n}*i.e.*, Corr(

*l*,

_{i}*l*) =

_{j}*ρ*for

*i*≠

*j*, the covariance matrix for

**L**is (C3)Now recall the following standard result: if

**X**is a random vector and

**a**is vector of constants of the same length, then (

**a**

^{T}

**X**) =

**a**

^{T}(

**X**)

**a**. Using this result with (C2) and (C3), it follows that (C4)since

*σ*

^{2}= (

*l*) = 2/

_{i}*τ*

^{2}.

We now equate expressions (C1) and (C4) to the mean and variance of a Gamma(*α*, *β*) random variable, *i.e.*, *α*/*β* and *α*/*β*^{2}, respectively, to obtainafter some elementary algebra, which are the required parameters for the Gamma approximation.

## Appendix D: Correlation estimate

We first obtain the following more general results:

(D1) (D2)where *X*_{1}, *X*_{2}, …, *X _{n}* are identically distributed random variables with common mean

*μ*, variance

*σ*

^{2}, and common pairwise correlation

*ρ*, where is their average.

We begin by noting that **X** = [*X*_{1}, *X*_{2}, …, *X _{n}*]

^{T}has mean

*μ*

**1**

*and covariance matrixwhere*

_{n}*γ*=

*σ*

^{2}

*ρ*. Observe that

*σ*

^{2}+ (

*n*− 1)

*γ*is an eigenvalue of (

**X**) with eigenvector

**1**

*, and so the symmetric and idempotent matrix is an orthogonal projector onto the space Span{*

_{n}**1**

*}. Also observe that*

_{n}*σ*

^{2}−

*γ*is an eigenvalue, but with algebraic multiplicity

*n*− 1, and that the symmetric and idempotent matrix

**I**−

**G**is a projector onto the orthogonal complement of Span{

**1**

*},*

_{n}*i.e.*, the corresponding eigenspace. Thus, given that (

**X**) is symmetric, its spectral decomposition is (D3)Next, observe that

**GX**and (

**I**−

**G**)

**X**are orthogonal, and therefore uncorrelated:since

**G**is symmetric and idempotent. Also observe that (D4)since and that (D5)We now use the following fact: if

**A**is an

*n*×

*n*matrix of coefficients, then (D6)where we have used standard properties for the trace and the fact that (

**XX**

^{T}) = (

**X**)(

**X**

^{T}) + (

**X**). Thus, using (D6), (D4), (D3), and the fact that trace(

**G**) = 1, we obtainwhich establishes result (D1). Similarly, using (D5) and the fact that trace(

**I**−

**G**) =

*n*− 1, we obtainwhich establishes result (D2). Importantly, we have established that the random quantities inside the expectation of results (D1) and (D2) are uncorrelated.

Applying results (D1) and (D2) to the situation where *l*_{1}, *l*_{2}, …, *l _{n}* are Gamma(2,

*τ*) variables with common pairwise correlation

*ρ*, using the fact that

*μ*

^{2}= 2

*σ*

^{2}, and using the standard definition of the sample variance

*S*

^{2}, these equations becomeAn unbiased estimate of the left-hand side of each equation is simply given by the quantity inside the expectation, and since these quantities are uncorrelated, we can obtain a single estimate of

*ρ*by eliminating

*σ*

^{2}. Thus, following some elementary algebra, we obtainwhich is the correlation estimate.

Even though , some values of produce inappropriate values of *n*^{∗}. When this occurs we adjust the calculation of *n*^{∗}. Recall that there are two places where *n*^{∗} is used: the bias correction factor *b*(*n*^{∗}), and parametrizing the Gamma approximation, *i.e.*, . Also recall thatObserve that *n*^{∗}, as a function of *ρ*, is a rectangular hyperbola on [−1, 1] with a vertical asymptote at *ρ* = −1/(*n* − 1). There are three sets of values that are potentially problematic:When or then |*n*^{∗}| > *n*. If we interpret the correlation as effectively reducing the sample size, then an estimated correlation falling within either of these intervals suggests that there is more information in the sample than a sample of size *n*. For our concerns, the effective sample size should be no greater than a sample size assuming independence, *i.e.*, *n*^{∗} ≤ *n*. An added complication is that when or then *n*^{∗} < 0. Such *n*^{∗} values, however, cannot be used to parametrize the Gamma approximation.

Thus, we adopt the following procedures. When calculating the bias correction factor *b*(*n*^{∗}), if or we bound *n*^{∗} so that |*n*^{∗}| ≤ *n*. This produces accurate bias correction. When parametrizing the Gamma approximation, if we set *n*^{∗} = *n*, and if we recalculate *n*^{∗} using as the correlation value. This produces reliable confidence intervals since the value, used to recalculate *n*^{∗}, reliably picks out the correct effective (reduced) sample size. Note that intervals *I*_{2} and *I*_{3} become narrower (approaching zero length) as the sample size increases.

Finally, note that in a correlated genealogy, samples can sometimes have repeated observations; *i.e.*, more than one individual has inherited the very same length of ancestral segment. While this is rare, repeated observations can bias , and so when calculating the estimate we remove any repeats of distinct observations.

## Appendix E: Correction for chance sharing

The effect of chance sharing becomes noticeable only when a mutation is old (*e.g.*, *τ* ≥ 100). This is because ; *i.e.*, the age estimate is a *hyperbolic* function of the sum of lengths. Thus, when the mutation is old and the sum of lengths small, a small change in the lengths due to chance sharing can be magnified into a noticeable (albeit small) change in the estimated age. On the other hand, when the mutation is younger and the sum of lengths larger, any change in the estimated age due to a small change in lengths is undetectable. While this problem is minor, we develop a simple correction.

We assume that the marker loci are bialleic, with alleles A and B. Consider one of the shorter inferred ancestral segments *S* in relation to a longer segment *T*, as we head away from the mutation on one side (see Figure 4). Now consider marker locus *L*_{1}, the first locus before the inferred endpoint of *S*. The probability that *S* and *T* share the same alleles at *L*_{1} by chance iswhere *f*_{A} and *f*_{B} are the population frequencies of alleles A and B, respectively. The probability of matching by chance at loci *L*_{2}, *L*_{3}, …, as we head toward the mutation, can be calculated similarly. Let *p _{i}* denote the probability of matching at locus

*i*. Thus, ignoring LD, the probability

*P*

_{run}of a “run” of matches up to some locus

*k*is thenTo deal with chance sharing, the idea is to find the number

*k*such that

*P*

_{run}<

*ε*, where

*ε*is some threshold (we set

*ε*= 0.01), and then remove these

*k*loci from the length calculation. We now use the following approximate strategy. In the expression for

*p*, replace the particular

_{i}*f*

_{A}with the median of the

*f*

_{A}frequencies across the observed loci on the relevant chromosome. So, letting 〈

*f*

_{A}〉 denote the median, we havefor all

*i*. To find the number

*k*of loci to remove from the length calculation, we find the

*k*such thatwhich is

*k*= log(

*ε*)/log(

*p*). For standard SNP arrays, with

*ε*= 0.01,

*k*is on the order of 10 loci. This is translated into genetic length by multiplying by the average genetic length between observed markers, which we denote by

*φ*. Thus, to correct for chance sharing, for each segment arm on either side of the mutation we subtract

*φk*from its inferred length. (This correction is not used for independent genealogies for sample sizes

*n*≤ 10, since the age estimate in this circumstance has a slight upward bias, which is offset by the slight downward bias caused by chance sharing, thus obviating the need to correct for chance sharing.)

## Footnotes

*Communicating editor: J. Wakeley*

- Received March 26, 2014.
- Accepted May 24, 2014.

- Copyright © 2014 by the Genetics Society of America