## Abstract

We demonstrate the advantages of using information at many unlinked loci to better calibrate estimates of the time to the most recent common ancestor (TMRCA) at a given locus. To this end, we apply a simple empirical Bayes method to estimate the TMRCA. This method is both asymptotically optimal, in the sense that the estimator converges to the true value when the number of unlinked loci for which we have information is large, and has the advantage of not making any assumptions about demographic history. The algorithm works as follows: we first split the sample at each locus into inferred left and right clades to obtain many estimates of the TMRCA, which we can average to obtain an initial estimate of the TMRCA. We then use nucleotide sequence data from other unlinked loci to form an empirical distribution that we can use to improve this initial estimate.

WITHOUT intralocus recombination, all DNA sequences sampled at a given genetic locus originate from a common ancestor. That is, if we follow the genetic lineages of these sequences back in time, they will merge with one another until a single inheritance path remains. For each locus, this process yields a genealogical tree that unites all of the sampled sequences. The time to the most recent common ancestor (TMRCA) of a particular locus is the height of the genealogical tree at that locus.

TMRCA estimates are commonly used in inferring demographic history. For example, the TMRCA can be used to place an upper bound on the divergence time of subpopulations, if the migration rate between subpopulations and the size of each subpopulation are relatively small (Rosenberg and Feldman 2002). This idea has been applied to obtain the evolutionary history of a number of different organisms, from chaffinches to anchovies (Griswold and Baker 2002; Hailer *et al.* 2012).

Early articles in the TMRCA literature studied the human mitochondrial DNA ancestor, which supported the African origin hypothesis (Vigilant *et al.* 1991). Later studies sought to infer the TMRCA of the Y chromosome, to shed light on the origin and dispersal of modern humans. This was challenging due to the scarcity of DNA sequence polymorphisms on the Y chromosome (Jakubiczka *et al.* 1989; Hammer 1995). One early study examined the *Zfy* intron, which was revealed to be completely monomorphic in a sample of 38 males (Dorit *et al.* 1995). Estimating the TMRCA of this intron necessitated a Bayesian approach, because any estimate proportional to the number of mutations would have given a value of zero. Dorit *et al.* (1995) used a uniform prior distribution on the TMRCA, which was considered inappropriate by a number of commenters, who advocated using priors that stemmed from coalescent theory and their preferred demographic models (Donnelly *et al.* 1996; Fu and Li 1996; Weiss and von Haeseler 1996). As a result of the lack of signal in the data, these different studies inferred very different estimates of the TMRCA (Brookfield 1997). Further efforts to infer the TMRCA for other Y chromosome data have also been affected by this dependence on the prior (Hammer 1995; Whitfield *et al.* 1995; Walsh 2001).

Given the interest in the TMRCA of an individual gene in inferring demography, the dependence of the estimate on the prior demographic model is particularly problematic (Brookfield 1997). In contrast to parametric Bayesian methods such as those applied to Y chromosome data, frequentist approaches such as maximum likelihood do not require the specification of a prior and so might appear preferable. One such frequentist estimator is the one proposed by Tang *et al.* (2002). In this method, nucleotide sequence data are used to partition the sample into two groups, corresponding to the inferred two clades on either side of the root of the tree. Tang *et al.* (2002) then estimate the TMRCA, using the average number of nucleotide sequence differences across all left–right clade pairs,

Of course, application of this method to the *Zfy* data would give an estimate of zero for the TMRCA, which is a clear underprediction. More generally, if Tang *et al.* (2002) had regressed true TMRCA on estimated TMRCA, it would have been revealed that their method tends to underpredict when the number of segregating sites at a locus is small and to overpredict when it is large. This is because an extreme number of segregating sites at a locus often results from a combination of a relatively small or large TMRCA at that locus and a relatively small or large number of mutations conditional on the TMRCA. Errors in inference will occur if all of the variation in the number of segregating sites is attributed to variation in times to most recent common ancestry, as is the case generally in frequentist approaches.

We propose augmenting the method of Tang *et al.* (2002) by using information at unlinked loci to better calibrate estimates of the TMRCA, and we introduce a very simple nonparametric empirical Bayes method. By “nonparametric,” we mean that we do not assume that the prior on the TMRCA has any particular shape, only that all loci’s TMRCAs are sampled from the same distribution. In addition to improving on Tang *et al.* (2002)’s estimator, our method is advantageous over many Bayesian methods in that it makes no prior assumptions about the distribution of TMRCAs and therefore can be used when the history of the population is completely unknown. We show that our method performs well in simulated data from a wide variety of demographic scenarios.

The idea of using information at additional loci to improve the estimate at one locus appears in a number of recent methods, *e.g.*, Hobolth *et al.* (2007) and Li and Durbin (2011), although mostly with a spatial context along the genome that our method does not have. Similarly to Li and Durbin (2011), our method is able to extract information from a single genome, by making use of the number of heterozygote sites in sequences of DNA between recombination break points. We apply this method to a single Bantu individual and a single European individual and are able to show that loci with the same number of heterozygous sites in different populations have different average TMRCAs.

## Materials and Methods

### Assumptions

We assume that the number of mutations at a locus follows a Poisson distribution with constant rate equal to the product of the total genealogical branch length and the per locus mutation rate. In addition, we assume that each mutation generates a new segregating site, in accordance with the infinite-sites model developed by Watterson (1975), which also includes the assumption of complete linkage among sites at a locus. In fact, we allow for the possibility of within-locus recombination as long as it does not modify tree topology or TMRCA, which would preclude the application of Tang *et al.* (2002)’s method. Finally, we assume that all of the different loci under consideration are independent, in the sense that they represent independent samples from the distribution of TMRCA. Approximate independence can be achieved by allowing for sufficient interlocus distance.

### Simple existing methods for inferring the TMRCA of a sampled pair

Let us first consider estimating the TMRCA at a locus *i* in a sample of size 2. The number of nucleotide differences between these two samples follows a Poisson distribution with rate where is the per nucleotide mutation rate at that locus, is the length of the sequenced region, and is the time until coalescence measured in coalescent units. One natural estimator of is the maximum-likelihood estimator, used for example by Tang *et al.* (2002):(1)In Tang *et al.* (2002), is the average number of segregating sites across all left–right clade pairs, and for

Within the framework of coalescent theory, where priors for have been derived for a number of demographic models, it is more common to estimate using a parametric Bayesian approach. However, this requires certain assumptions about demographic history, which we might ideally prefer not to make. One such estimator is the posterior mean, which can be obtained in the manner of equations 19 and 20 in Tajima (1983) for an exponential prior on the TMRCA, which corresponds to the demographic assumption of a constant population size,(2)where and is the effective population size.

### Nonparametric empirical Bayes approach

We can use Robbins’ (1955) method to improve on these simple frequentist and parametric Bayesian approaches, by utilizing information from other unlinked loci in the sample. Robbins considered the following case of sampling from a mixed distribution. Let conditional on some variable be specified by a Poisson distribution,The are in turn independent and identically distributed according to some distribution, which we do not know and do not need to specify. For an illustration of the data-generating process, see Figure 1.

This data-generating process exactly describes the process that yields the number of mutations at unlinked loci in a genome, given our assumptions. That is, conditional on and each is an independently distributed Poisson random variable with rate and each is drawn i.i.d. from an unknown distribution. For the sake of computational simplicity, we assume that which is equivalent to a simple rescaling of

Under this compound sampling scheme (although initially not applied to genetic data), Robbins (1955) showed that we can obtain a point estimate of by making use of Bayes’ rule and the form of the Poisson probability distribution,where is the marginal probability that that is, the marginal probability that we observed exactly segregating sites at locus *i*. As can be seen from the sampling structure depicted in Figure 1, this marginal distribution, which we could simply call does not depend on *i*.

When the number of loci is not too small, we can approximate by the fraction of loci where the number of observed segregating sites is equal to We use to refer to *m* times this fraction or the number of loci with exactly mutations. In this way we obtain the following estimator of the TMRCA at locus *i*,(3)where NPEB is the nonparametric empirical Bayes approach. Note that mutation rates vary across the genome, and we are not assuming a single underlying mutation rate. Loci with relatively high mutation rates, for example, can be truncated, such that the product of the mutation rate and the locus length across all considered loci is roughly similar.

Robbins (1955) proved that this estimator is asymptotically optimal. That is, as the total number of loci sampled grows (), its Bayes risk (such as the mean squared error) converges to the Bayes risk for the Bayesian model where the true prior of the and therefore is known. As might be expected, Robbins’ (1955) method behaves erratically in cases where there are few data. If, for example, that is if no loci have exactly segregating sites, then our estimate of corresponding to a locus *i* where there are segregating sites would be 0, which is clearly wrong. To mitigate this effect, there are a number of smoothing techniques one might apply (Lidstone 1920; Good 1953; Gale and Church 1990, 1994). In this article, we attempt to estimate using Robbins’ (1955) method only when loci where there are segregating sites and loci where there are segregating sites are not rare. It is indeed for these loci that Robbins’ (1955) method shows a clear advantage over traditional methods that do not incorporate information from other independent loci.

Another consequence of variation in is that estimates of are not necessarily a nondecreasing function of the number of mutations In fact, we would expect loci in which there are more mutations to be at least as ancient as loci in which there are only a few. To remedy this, we can fit a weighted isotonic regression of the inferred mean on the number of mutations, using the pava( ) function in the “Iso” package (Turner 2015) in R (R Core Team 2015), where we weight each value by(4)and obtain a new set of estimators, denoted by We use these weights as an approximation of the variance of as is explained in the *Effectiveness of Robbins’ method* section. As the isotonic regression yields the least-squares best fit among nondecreasing relationships, performing this step ensures that if there are fewer mutations at locus *i* than at locus *j*.

To summarize, Robbins’ (1955) method uses the ratio of the number of loci with exactly and mutations to calibrate the TMRCA at a given locus with exactly mutations. We then incorporate the knowledge that the expected number of segregating sites at a locus is a nondecreasing function of its TMRCA, by running an isotonic regression on the TMRCA estimates.

### Generalizing our estimator to a sample of size

In generalizing our estimator for use on samples of size we are inspired by the frequentist estimation of coalescence times from nucleotide sequence data, using a tree-based partition in Tang *et al.* (2002). In that work, the *n* sequences are first partitioned into two subsets that are meant to correspond to the left and right clades of the genealogical tree. The MRCA of any two sequences, one in the left clade and one in the right clade, is the root of the tree. Tang *et al.* (2002) propose an estimator of the TMRCA based on the average number of pairwise differences between sequences in the left clade and sequences in the right clade (see Equation 1).

Although genealogical trees are not always completely resolved by the data, in many cases there is little ambiguity about the branching pattern at the root (Tang *et al.* 2002). When ambiguity does exist at the root, Tang *et al.* (2002) propose a partition algorithm that is less biased than forcing the pair of sequences that differ most from each other to be in different clades. This algorithm does not require knowledge of the ancestral state at the segregating sites. The eight steps of this algorithm are described in detail in Tang *et al.* (2002).

We use the following steps to infer in cases where which we also illustrate in Figure 2.

For each locus

*i*, where we use Tang*et al.*’s (2002) tree partitioning algorithm to partition the sample at locus*i*into left and right clades.From the set of left-clade samples, we pick at random a single sample. We also pick at random a sample from the set of right-clade samples. We calculate the number of pairwise differences and repeat this process for every locus

*i*. The reason we count the number of differences between single pairs of left–right clade sequences instead of averaging the number of differences across all left–right clade pairs is that Robbins’ (1955) method requires to be an integer. We then calculate a for each locus, using these counts at all*m*loci, according to Equation 3. The result of this step is a table that contains estimates of TMRCA corresponding to different observed numbers of segregating sites. We then fit a weighted isotonic regression to these estimates, where each estimate is weighted according to Equation 4.Clearly, at the end of the previous step, we have not used much of the information from our sample, as we have sampled only one left-clade–right-clade pair from each locus. We therefore repeat the previous step over all possible left–right clade pairs at a particular locus, which all have the same TMRCA if the partitioning algorithm is correct. For each locus, the number of possible left–right clade pairs depends on the topology of the tree at that locus. If a single sequence forms one of the clades, the data at that locus will consist of highly correlated pairwise differences. When the tree is balanced, there are pairs, many more than in the unbalanced case. We repeat step 2 until all left-right clade pairs have been used at least once. For loci with the maximum observed number of pairs, each pair is used exactly once. For loci with fewer pairs, some pairs are used multiple times; these are sampled uniformly at random after all pairs at a locus have been used once. At the end of this step, we obtain between and tables, depending on the

*m*inferred tree configurations. That is, the number of tables produced is equal to the number of pairs in the locus with the largest amount of left–right pairs.We average the entries in all of the tables obtained in the previous two steps,

*i.e.*, the estimates of TMRCA for each observed number of segregating sites at a locus, and in this way we obtain a final table with the aggregate information that links each integer-valued unique number of segregating sites to a unique estimate of the TMRCA.We then consider the data at a single locus

*i*. We calculate the average number of segregating sites over all left–right clade pairs at this locus. If this average is an integer, then the estimate of the TMRCA can be read from the row corresponding to value in the final table. More likely than not, though, is not an integer. We can create a piecewise linear function that extends our estimates of the TMRCA to noninteger values of Our estimate of the TMRCA is then a weighted average of the estimates of the TMRCA in the rows corresponding to and

We note here that the presence of recombination does not compromise the method in any way when but does require a reinterpretation of the meaning of the results. The NPEB estimate will no longer correspond to a single TMRCA at a given locus but to an average TMRCA across the locus. This is due to the additive properties of the Poisson distribution and to the fact that, in a sample of size 2, intralocus recombination will not produce a new tree with a different shape. Indeed, in a sample of size 2, there is no ambiguity concerning the members of the left and the right clades. For a sample size >2, we require no intralocus recombination that affects tree shape, because otherwise we could not partition our sample into left and right clades.

### Data availability

The programs sufficient to reproduce the results in this article are available at wakeleylab.oeb.harvard.edu/resources.

## Results

### Effectiveness of Robbins’ method

To assess where Robbins’ (1955) NPEB method is most effective, we calculate the variance of as a function of *m*, and Using a Taylor expansion, we can approximate the variance of the ratio of two random variables (Rice 2007):(5)We can represent the distribution of the for each observed by a multinomial distribution, as long as we create a bin to account for all unobserved yet possible values of In the model there are countably infinite possible numbers of segregating sites, but in practice the number is limited by the length in nucleotides of each locus *i*. By the properties of the multinomial, we have and Equation 5 then simplifies toTherefore, as we have(6)To illustrate where Robbins’ (1955) method is most effective, we apply 6 to each value of an example distribution illustrated in Figure 3. As one might expect, if we increase *m*, we get more accurate results over more points. For moderate numbers of loci *m*, results are still very accurate if is not too small, especially in comparison with Finally, the contribution of the first term on the right side of 6 is smallest when is small. For these reasons, our method can give accurate results at sites with segregating sites.

Using the observed data, we can approximate the variance at each point by assuming This is how we obtain the weights for our isotonic regression (see Equation 4).

### Simulation results in a wide range of population histories

To test the performance of our estimator against the traditional frequentist and parametric Bayesian estimators, we run a series of simulations. Programs sufficient to reproduce all of the results we present are available at https://wakeleylab.oeb.harvard.edu/resources.

We generate synthetic data using the program MSMS (Ewing and Hermisson 2010), which generates sequence data and TMRCA values under a range of demographic scenarios including population growth, subdivision, and admixture. We vary the population mutation rate *θ*, the exponential growth rate of the population *g*, the number of sequences from which we build our genealogies *n*, and the divergence time between populations *d* across a range of parameters described in Table 1. We use a cutoff of 0.2 in step 6 of Tang *et al.*’s (2002) tree partitioning algorithm. This essentially disallows sampled pairs that have relatively very few nucleotide differences from being selected to belong to different tree clades. We choose this value as it is the default setting in Tang *et al.* (2002). Note again that we measure time in units scaled by the population mutation rate *θ*.

We illustrate the performance of the method for two sample sizes, and Felsenstein (2006) suggested as an optimal choice to balance accuracy of estimating *θ* against the costs of genotyping. To justify we might also appeal to the results that the expected TMRCA is equal to and that the probability the MRCA of the sample contains the MRCA of the entire population at a locus is equal to (Saunders *et al.* 1984) if the interest is in the whole-population TMRCA at each locus. Concretely, this means that the TMRCA for 8 lineages is likely to be close to the TMRCA for many more lineages.

For each demographic scenario, we simulate *m* independent genealogies. We then use our algorithm to calculate at each locus *i*. To measure performance, we first compute the mean squared error (MSE) of our estimators at all loci for which our estimate of the variance of is smaller than some threshold, chosen to be 0.1 in these simulations. We assume that there are such loci,(7)where is the true TMRCA at locus *i* given by MSMS. The subscript *s* is the index of one simulated set of *m* loci under a given demographic scenario. To have a more accurate estimate of our error, we repeat these simulations *S* different times for each combination of parameters. We then average over the *S* different sets to obtain the final measure of the accuracy of our estimator, given the demographic scenario.

We impose a cutoff variance because we expect our method to be advantageous only when the variance of the estimator is reasonably small. That is, it is beneficial in estimating the TMRCA of a locus *i* only where and are large. Reasonable values of this threshold will depend on the population mutation rate *θ*. The smaller the cutoff variance is, the smaller is, the number of loci for which we estimate a TMRCA. We specifically chose 0.1 in these simulations to restrict ourselves to loci whose TMRCA we could accurately predict, at least more accurately than using Tang *et al.*’s (2002) method across the range of parameters in our simulations.

### Comparison to Tang’s method and the parametric Bayes posterior mean

Figure 4 is a scatterplot of the MSE of estimates using the method of Tang *et al.* (2002) compared to those obtained using NPEB for simulations over the parameters in the multidimensional grid described in Table 1. We see that Robbins’ (1955) method always performs better than Tang *et al.*’s (2002) approach as measured by MSE.

As we increase *m*, our estimates become more and more accurate: The NPEB MSE converges to the Bayes MSE where the true prior is assumed (Robbins 1955). We illustrate this for and in Figure 5. The parametric Bayes estimates were obtained by assuming (correctly) that the values were drawn from an exponential distribution with parameter *θ*. We update the prior on with the observed number of mutations and in this way obtain the posterior on We then report the mean of this posterior (see Equation 2). For *m* = 250, the MSE of the NPEB estimator is, depending on *θ*, ∼3.5–7.4% higher than the MSE of the Bayesian estimator using the correct prior. For the difference is even smaller, with an increase of only ∼1.2%.

We found that when the assumed prior is not true, the Robbins estimator performs better than the parametric Bayesian estimator as long as *m* is big enough and the assumed prior is different enough from the true prior (Robbins 1955). We illustrate this in a particular case, for different values of when the prior assumes and for demographic parameter values and in Figure 6. It is worth noting that as we increase *m*, we also increase the number of loci for which we are estimating the TMRCA. For this reason, the raw MSEs (*e.g.*, Figure 4) are not completely comparable across different values of *m*, as they depend on this value (see Equation 7).

In summary, our method performs better than Tang’s method across the entire range of tested parameters. Unsurprisingly, the parametric Bayesian estimator performs better than the empirical Bayes estimator when the true prior is assumed. However, our method can outperform the parametric Bayesian estimate in terms of MSE when the assumed prior is incorrect.

### Admixture case study

We also consider the special case of admixture, as a more complicated demographic history. In this case, we can still assume that the true TMRCAs are independent and identically distributed, but this time according to a more complicated distribution that exhibits bimodality (see Figure 7). Using again the program MSMS (Ewing and Hermisson 2010), we simulate the genealogies of pairs of just admixed individuals. Their parent populations diverged 6 time units in the past, with 50% of the genetic material in the sample originating from the first population and 50% from the second population. This means that 50% of sampled lineages will not be able to coalesce before 6 time units in the past. We fix and consider independently segregating loci. Histograms of the true MRCAs and the number of mutations per site are shown in Figure 7, the latter being equal to Tang’s estimator in this case (). We can see that there is considerable bimodality in the TMRCAs, which translates to bimodality in the number of mutations at different loci.

Plotting the true TMRCAs *vs.* the inferred TMRCAs using the two methods reveals that the true TMRCAs are appropriately shrunk using our method and that Tang’s method especially overestimates the TMRCAs in cases where there are a lot of mutations and underestimates them in cases where there are very few mutations (see Figure 8). We used 0.2 as a cutoff value, such that any points with variance >0.2 are not displayed. Note that Figure 8 represents a single (although typical) run of the algorithm. How well the NPEB ends up approximating the true TMRCA depends somewhat on the stochasticity of the data.

### Analysis of TMRCAs from human genomes

We also apply our method to data from 37,574 neutrally evolving autosomal loci from a European and a Bantu individual (Gronau *et al.* 2011). Each interlocus distance is at minimum 50,000 bp, a distance deemed sufficiently high by Gronau *et al.* (2011) that the genealogies can be assumed to be approximately uncorrelated. These presumably neutral loci are 1000 bp in length and were chosen to avoid recombination hotspots. We remove any masked bases and reduce all of our loci to 900 bp, by truncating loci with >900 unmasked bases and removing loci with <900 unmasked bp. We use Gronau *et al.*’s (2011) estimate of the mutation rate of mutations per site per year and for the sake of illustration assume no variation in mutation rate across these loci, which we would otherwise control for by varying the length of each locus. Because of diploidy, we have a sample of size 2 for each individual.

The distribution of numbers of mutations (or heterozygous sites) is different in the case of the Bantu and the European (see Figure 9), which we might attribute to the well-known bottleneck in the ancestry of European populations (Voight *et al.* 2005; Keinan *et al.* 2007). In particular, the average number of pairwise differences is greater for the Bantu than for the European. In Figure 10, we plot the inferred TMRCA at each locus for each of these two genomes. We note that, unlike with our method, the TMRCAs estimated using Tang’s method do not vary depending on the population. Using our method to estimate TMRCAs, we find that the calibration is less intense for the European sample than it is for the Bantu sample, which makes sense in light of the fact that the frequency of sites with exactly mutations decreases more sharply as increases for the European sample (Figure 9).

## Discussion

We have shown that the problem of estimating the TMRCA of a sample can be framed in such a way that it allows for the use of NPEB methods, such as a modified Robbins’ (1955) method. The advantage of these methods is that they use data from all loci to efficiently account for the randomness of mutation, through which loci with the same TMRCA can have very different numbers of segregating sites. In all of our simulations, Robbins’ (1955) method, one of the simplest NPEB methods, showed radical improvement over Tang *et al.*’s (2002) maximum-likelihood method (this is because the method makes use of a lot more of the available information). It also performed very well against a parametric Bayesian method in which it is assumed that the true prior for TMRCA is known.

It is particularly useful in that Robbins’ (1955) method provides reliable estimates of the TMRCA even when the mutation rate is very low. Many of the nucleotide sequences we simulated had 0 segregating sites. Our method was nonetheless reliably able to infer TMRCA at these loci, as long as there was enough information from other independently segregating loci. The other benefit of our method is that it does not require any prior assumptions on demographic history. We ran simulations using simple models of population expansion and divergence and showed that our method is effective in a wide variety of demographic scenarios.

For all cases where the genealogies uniting the sampled sequences are known, as for example when the sample is of size 2, the NPEB estimate may be calculated simply and directly using Equation 3. However, this method is somewhat limited to loci with sufficiently common numbers of segregating sites. It does not perform well with outliers, *i.e.*, when is small.

More effective yet complicated NPEB approaches involve estimating the distribution of the from the data. Laird (1978) proved that the distribution of that maximizes the likelihood of the data is a discrete distribution over finitely many points *j*. An estimate of this distribution can be obtained using the expectation-maximization algorithm (Dempster *et al.* 1977). We can then get estimates of each individual by using Bayes rule with as a prior:(8)This approach is superior to Robbins’ (1955) method in that conditions of monotonicity and convexity are satisfied, and its success does not depend on the use of a squared error loss function over a general loss function (Carlin and Louis 2000). However, it involves much more computation than Robbins’ method. In this article, we concentrated on Robbins’ method as our goal was to show that there is information at independent loci and that even the simplest NPEB method performs quite well, especially compared to the maximum-likelihood approach.

## Acknowledgments

We thank N. Rosenberg, P. Ralph, and an anonymous reviewer for very helpful comments.

## Footnotes

*Communicating editor: N. A. Rosenberg*

- Received December 7, 2015.
- Accepted June 17, 2016.

- Copyright © 2016 by the Genetics Society of America