## Abstract

The general coalescent tree framework is a family of models for determining ancestries among random samples of DNA sequences at a nonrecombining locus. The ancestral models included in this framework can be derived under various evolutionary scenarios. Here, a computationally tractable full-likelihood-based inference method for neutral polymorphisms is presented, using the general coalescent tree framework and the infinite-sites model for mutations in DNA sequences. First, an exact sampling scheme is developed to determine the topologies of conditional ancestral trees. However, this scheme has some computational limitations and to overcome these limitations a second scheme based on importance sampling is provided. Next, these schemes are combined with Monte Carlo integrations to estimate the likelihood of full polymorphism data, the ages of mutations in the sample, and the time of the most recent common ancestor. In addition, this article shows how to apply this method for estimating the likelihood of neutral polymorphism data in a sample of DNA sequences completely linked to a mutant allele of interest. This method is illustrated using the data in a sample of DNA sequences at the APOE gene locus.

THE interest in analyzing polymorphism data in contemporary samples of DNA sequences under various evolutionary scenarios creates a demand to design computationally tractable full-likelihood-based inference methods. For an evolutionary scenario of interest, an ancestral-mutation model can be used to design such a method. The ancestral-mutation model for a sample of DNA sequences at a nonrecombining locus is a combination of two processes: one is an ancestral process that traces the lineages of the sample back in time until the most recent common ancestor, constructing an ancestral tree for the sample. The second is a mutation process that is superimposed on the ancestral tree. The complexities of ancestral-mutation models make the design of such methods challenging. Full data are used instead of summary statistics, which can result in loss of important information in the data (see Felsenstein 1992; Donnelly and Tavaré 1995). In addition, current methods use specific features of the underlying ancestral-mutation models, so they lose flexibility to be applicable to other ancestral-mutation models.

More specifically, Griffiths and Tavaré (1994c, 1995) and Kuhner *et al*. (1995) developed full-likelihood-based inference methods for neutral polymorphisms at a nonrecombining locus. They used the combinations of the standard coalescent (Kingman 1982a,b,c; Hudson 1983; Tajima 1983) with the finite-sites or infinite-sites (Watterson 1975) models as ancestral-mutation models. Stephens and Donnelly (2000) designed an importance sampling method to estimate the full likelihood of the data using the same settings for the ancestral-mutation models. Hobolth *et al*. (2008) provided another importance sampling scheme restricted to the infinite-sites model. The last two methods are computationally more efficient than the first two methods, but they lose flexibility to be applicable to ancestral models without standard coalescent features with independent coalescence waiting times, such as the coalescent processes with exponential growth (Slatkin and Hudson 1991; Griffiths and Tavaré 1994b).

To incorporate the coalescent processes with exponential growth, Kuhner *et al*. (1998) and Griffiths and Tavaré (1994a, 1999) extended their previous methods. For example, the method of Griffiths and Tavaré (1994a, 1999) allows one to consider ancestral models based on coalescent processes with variable population sizes. Coop and Griffiths (2004) modified this inference method and made it applicable for analyzing full polymorphism data in a sample of DNA sequences from a nonrecombining locus completely linked to a mutant allele of interest, either neutral or under selection. Additionally, ancestral models have been developed for this type of sample, where the mutant allele is either neutral (Griffiths and Tavaré 1998, 2003; Wiuf and Donnelly 1999; Stephens 2000) or under selection (Slatkin and Rannala 1997; Stephens and Donnelly 2003). The ancestral model of Slatkin and Rannala (1997) is part of a family of ancestral models derived by Thompson (1975), Nee *et al*. (1994), and Rannala (1997), using a linear birth–death process as an evolutionary process in a population. Although all the ancestral models mentioned above differ in their properties and evolutionary scenarios, they are part of the general coalescent tree framework (Griffiths and Tavaré 1998). Therefore, a computationally tractable full-likelihood-based inference method based on this general framework is of great interest.

For a sample of *n* sequences, an ancestral model in the general coalescent tree framework is described as a bifurcating rooted tree with *n* − 1 internal nodes and *n* leaves, where the internal nodes are coalescent events that happen one at a time. The tree is a combination of two independent components: the topology and the branch lengths. The topology of the tree is constructed going backward in time by combining two randomly chosen ancestral lineages of the sample at each node; the branch lengths of the tree are defined by the joint distribution function of the coalescence waiting times. Note that any density function for coalescence waiting times can define an ancestral model in the general coalescent tree framework.

The *n* leaves (and the sequences in the sample) are labeled from 1 to *n*; and the *n* − 1 internal nodes of the ancestral tree are labeled from 1 to *n* − 1 (in order of occurrence of the coalescent events backward in time). Thus, the topology of an ancestral tree is a leaf-labeled bifurcating rooted tree with totally ordered interior vertices. These trees are called topological trees.

When using the general coalescent tree framework and the infinite-sites model, an evolutionary process that generates polymorphism data in a sample of DNA sequences can be described in the following way. An ancestral tree is constructed, as described above, and mutations are added independently on different branches of the ancestral tree as Poisson processes with equal rates, θ/2, in which θ is the mutation rate at the locus. Then, at the mutation events, the ancestral sequences of the sample are changed according to the infinite-sites model; that is, each mutation occurs at a site of an ancestral sequence at which no previous mutations occurred. Thus, these changes define polymorphism data.

Naively, this probabilistic framework can be used to estimate the likelihood of the full observed data in a sample of *n* sequences. That is, data sets are simulated independently as described above and each simulated data set is compared to the observed data. The proportion of the simulated data sets that match the observed data is an estimate of the likelihood of the observed data. Although this approach provides an estimate for the likelihood of the observed data, this method is computationally infeasible, because the topologies of the ancestral trees of the generated data sets are sampled from the space of all the possible topological trees with *n* leaves. This space has size *n*!(*n* − 1)!/2^{n}^{−1} (Edwards 1970), which is huge for moderate values of *n*. The topologies of the ancestral trees of the generated data sets that match the observed data represent a small portion of that space. Thus, designing a method that samples topologies of the ancestral trees from this subspace can make the method computationally tractable.

On the basis of this idea, I use the general coalescent tree framework with the infinite-sites model to develop a computationally tractable full-likelihood-based inference method for polymorphisms in DNA sequences at a nonrecombining locus. First, an exact sampling scheme for topologies of the conditional ancestral trees is developed. This method has some computational limitations, so to overcome these limitations a second scheme based on an importance sampling is provided. These sampling schemes are combined with Monte Carlo integrations to estimate the likelihood of the full data, the ages of the mutations in the sample, and the time of the most recent common ancestor of the sample. I describe an application of this method for neutral polymorphism data in a sample of DNA sequences at a nonrecombining locus that is completely linked to a mutant allele of interest, either neutral or under selection. The method is illustrated using the data in a sample of DNA sequences at the APOE gene locus from Fullerton *et al*. (2000).

## ESTIMATION OF THE LIKELIHOOD FUNCTION

#### The likelihood function:

The naive approach described in the previous section is formalized and modified below to develop an efficient method for estimating the likelihood function *L*(**α** | *D*), where *D* is the polymorphism data in a sample of *n* DNA sequences and **α** is a vector of parameters of the ancestral-mutation model in the general coalescent tree framework with the infinite-sites model. So, the elements of this vector can be the mutation rate θ and parameters in the distribution function of the coalescence waiting times, such as the growth rate in the exponential population growth model.

For estimating the likelihood function *L*(**α** | *D*), the probability of the data ℙ(*D* | **α**) is estimated up to a multiplicative constant *C*, which does not depend on **α**. To estimate ℙ(*D* | **α**), first an expression is derived for it. On the basis of the naive approach, the equation(1)holds, where **I**{*x*} is an indicator function that is 1 if *x* is true or 0 if *x* is false. An ancestral tree with mutations Δ is constructed by combining the infinite-sites model with an ancestral model in the general coalescent tree framework. For each mutation in the sample, the set of labels of sequences (leaves) that carry that mutation is called a mutant group. The whole sample is considered as a mutant group created by a hypothetical mutation. The notation Δ ≅ *D* means that Δ is consistent with *D*; that is, there is a one-to-one mapping between mutations in the sample and those in Δ such that the mutant groups of the corresponding mutations are the same. For ease of presentation, this article assumes that the ancestral nucleotide at each polymorphic site in the sample is known. (See discussion for other possibilities.)

Now some notations are needed. Let **T** = (*T*_{2}, …, *T _{n}*) denote the waiting times between consecutive coalescent events in an ancestral model, where

*T*is the time between coalescent events when the number of distinct ancestral lineages of the sample changes from

_{i}*i*to

*i*− 1. Let

*f*(

*t*

_{2}, …,

*t*) be the joint density function of

_{n}**T**and let Γ be the topology of the ancestral tree in Δ. The expectation in (1) is conditioned on the ancestral tree (Γ,

**T**) of Δ. That is, the equations(2)hold, where Γ and

**T**are jointly distributed as , andwhere Ω(

*n*) is the space of all the possible topological trees with

*n*leaves.

Before modifying further the right side of (2), it is easy to see that if Δ is consistent with *D*, then each mutation in *D* constrains the topology of the ancestral tree in Δ. The ancestral lineages associated with each mutant group coalesce within their group before they coalesce with other ancestral lineages of the sample. Thus, a topological tree is consistent with *D* if it satisfies the constraints created by all mutant groups in *D*. Let Γ_{D} denote such a topological tree and let Ω(*D*) be the space of this type of topological tree. Note that Ω(*D*) is a subspace of Ω(*n*).

From this observation, it follows that if a topological tree Γ is not consistent with *D*, then the equation holds. Hence, using the equationand narrowing the space Ω(*n*) to Ω(*D*), Equation 2 can be written as(3)where (Γ_{D}, **T**) is distributed according to and .

Finally, the probability can be simplified because for a given ancestral tree the mutations along different branches of the ancestral tree are the result of independent Poisson processes with equal rates, θ/2. Thus, it is easy to see that the equation(4)holds, where is the total length of the ancestral tree. *S*′ is the set of mutation classes defined by consideration that two mutations in the sample are equivalent if they represent the same mutant group and *m _{x}* is the size of mutation class

*x*. Let

*L*(

*x*) be the time interval between two coalescent events in the ancestral tree (Γ

_{D},

**T**). In the first event, the mutant group corresponding to

*x*has the most recent common ancestor. In the second event this mutant group shares, for the first time, an ancestor with other sequences in the sample. Let

*J*

_{1}(

*x*) and

*J*

_{0}(

*x*) denote the numbers of distinct ancestral lineages of the sample at the first and second events, respectively. So,

*L*(

*x*) is equal to the sum .

Note that from (3) it follows that is also the likelihood function because ℙ_{n}(Ω(*D*)) depends only on the constraints of the data and not on θ and **T**, hence, not on **α**. The formula in (4) is derived assuming that the order of the ages of mutations within the mutation classes is not known. However, when the order of the ages of mutations in the class *x* is known, the right side of (4) must be multiplied by 1/*m _{x}*! for each mutation class

*x*in the sample. Because these multipliers depend only on data

*D*, they do not change the likelihood function.

#### Algorithms for estimating the likelihood function:

Below, I provide two algorithms for estimating the likelihood function using Monte Carlo integration based on (4) and (3).

##### Algorithm 1:

Sample a topological tree Γ

_{D}from the probability space (Ω(*D*), ℙ_{D}(·)). Sample coalescence waiting times**T**using the joint density function*f*(*t*_{2}, …,*t*)._{n}Compute the expression in the right side of (4).

Repeat steps 1 and 2 many times and then average the computed values in step 2. (According to the law of large numbers, the average will converge to the likelihood function as the number of the repetitions goes to infinity.)

The second algorithm is a modification of the first algorithm. Let **Q**_{D}(·) be a probability distribution on the space Ω(*D*), such that it is positive for all topological trees Γ_{D}. Then equation (3) can be rewritten as(5)where the joint distribution of (Γ_{D}, **T**) is *f*(*t*_{2},…, *t _{n}*)

**× Q**

_{D}(Γ

_{D}) and . On the basis of these modifications, I present the second algorithm.

##### Algorithm 2:

Sample a topological tree Γ

_{D}from the probability space (Ω(*D*),**Q**_{D}(·)), and sample coalescence waiting times**T**using the density function*f*(*t*_{2},…,*t*)._{n}Compute the expression .

Repeat steps 1 and 2 many times and then average the computed values in step 2. (As in Algorithm 1, the average will converge to the likelihood function as the number of the repetitions goes to infinity.)

It is important to keep in mind some notes regarding the two algorithms. Algorithm 2 is an importance sampling method and similar algorithms are derived for estimating the ages of the mutations in the sample (see in a section below). The definition of the proposal distribution **Q**_{D}(·) and computation of *W*(Γ_{D}) are provided in the next sections. Two sampling methods are developed for topological trees Γ_{D} based on the distributions ℙ_{D}(·) and **Q**_{D}(·). According to (4), it follows that for the Monte Carlo methods described in the algorithms, only the values of (*J*_{1}(·), *J*_{0}(·)) for each mutation class, not the whole topology of the ancestral tree, are necessary to know. In Algorithm 2, it might seem that to compute *W*(Γ_{D}) the probability ℙ_{n}(Ω(*D*)) needs to be computed, but that is not the case (see a section below).

## TOPOLOGICAL STRUCTURE OF THE CONDITIONAL ANCESTRAL TREES

#### Representation of the polymorphism data in a tree structure:

To understand the structure of the topological trees conditioned on the data *D* and how to sample them from the space Ω(*D*), the data *D* are represented in a tree structure form called mutation tree. The root of the mutation tree corresponds to the whole sample. The other nodes of the tree correspond to mutant groups (or mutation classes, because there is a one-to-one mapping between the mutant groups and the mutation classes) in the sample. The nodes are connected on the basis of a parent–child relationship inherited from corresponding mutant groups. For example, a mutant group *A* is the parent of a mutant group *B* (or *B* is a child of *A*) if *B* is nested in *A* and *B* is the biggest one nested in *A*. In other words, there is no other mutant group *C* in the sample, distinct from *A* and *B*, such that *C* is nested in *A*, and *B* is nested in *C*. A recursive procedure can be used to construct this tree by linking the parent node to its child nodes, starting from the root node. An algorithm for building a mutation tree from the data is given in supporting information, File S1; the algorithm is a slight modification of the algorithm of Gusfield (1991). For example, the mutation tree corresponding to the polymorphism data in Figure 1 is shown in Figure 2.

Mutation trees consisting of only parent (root) node and its children are called simple mutation trees. An example of a simple mutation tree is shown in Figure 3. Any mutation tree can be represented as a combination of the simple mutation trees associated with the nodes of the mutation tree. This representation is used to construct a topological tree consistent with the mutation tree. For example, Figure 4 shows the simple mutation trees embedded in the mutation tree in Figure 2.

A topological tree is consistent with the mutation tree if it satisfies the constraints created by the mutant groups corresponding to the nodes of the mutation tree. To construct a topological tree consistent with a mutation tree of *D*, first, topological trees are constructed consistent with each simple mutation tree embedded in the mutation tree. Second, all of these topological trees are then combined. I explain the first part of the procedure in a following section. In the second part of the procedure, the topological trees consistent with simple mutation trees of the mutation tree are changed recursively. This procedure starts with the topological tree that is consistent with the simple mutation tree of the root node of the mutation tree. In this tree, topological subtrees associated with mutant groups of child nodes of the root are cut and respectively replaced by the topological trees consistent with simple mutation trees of the child nodes of the root. The order of the coalescent events does not change during these cut-and-paste procedures. That is, the topological tree that is pasted inherits the labels of the coalescent events in the topological tree that was cut, and the coalescent events stay in the same order as their labels on the updated topological tree. As a result of this sampling procedure, the labeled coalescent events get assigned to the mutant groups. A realization of the procedure for the mutation tree in Figure 2 is shown in Figure 5. In the next section, I describe a probabilistic approach for sampling a topological tree consistent with a simple mutation tree.

## THE PROBABILITIES OF THE TOPOLOGIES OF THE CONDITIONAL ANCESTRAL TREES

#### The probability of a simple mutation tree:

The topological trees with *n* leaves and consistent with a mutation tree constitute a subset of the set Ω(*n*). The probability of that mutation tree is defined as the probability of that subset in the probability space (Ω(*n*), ℙ_{n}(·)). The probability of a mutation tree can be computed using the following lemma, which is based on the computation of the probabilities of the simple mutation trees.

Lemma 1. *The probability of a mutation tree is the product of the probabilities of the simple mutation trees embedded in that mutation tree.*

Lemma 1 easily follows from the sampling procedure described in the previous section.

To compute the probability of a simple mutation tree the following notations are needed. A simple mutation tree can be represented as a vector (**A**, *C*) = (*A*_{1}, …, *A _{k}*,

*C*), where C = (C

_{1}, C

_{0}) and in this mutation tree the child nodes corresponding to mutant groups of size >1 are labeled as

*A*

_{1},…,

*A*;

_{k}*C*

_{1}is the set of the child nodes corresponding to mutant groups of size 1; and

*C*

_{0}is the set of the labels in the mutant group of the root node that do not belong to any of the mutant groups corresponding to the child nodes of the simple mutation tree. Note that the mutant groups of size 1 in a simple mutation tree do not constrain the topological trees consistent with the simple mutation tree and

*k*is defined as the number of mutation constraints for the simple mutation tree (

**A**,

*C*). When there is no confusion, instead of (

**A**,

*C*) the numeric vector (

**a**,

*c*) = (

*a*

_{1}, …,

*a*,

_{k}*c*) is used, where

*a*is the size of the mutant group corresponding to

_{i}*A*;

_{i}*c*

_{1}and

*c*

_{0}are the sizes of the sets

*C*

_{1}and

*C*

_{0}, respectively; and

*c*is the sum

*c*

_{1}+

*c*

_{0}. Let

*m*be the sum .

Let *N _{m}*(

**a**,

*c*) be the total number of topological trees with

*m*leaves and consistent with (

**A**,

*C*). These topological trees can be classified into disjoint groups according to the first coalescent event, backward in time. That is, the coalescent event could happen with two ancestral lineages associated with a mutant group corresponding to one of the

*A*,

_{i}*i*= 1, …,

*k*, or with

*C*. So, the equation(6)holds, where

**e**

_{i}is a vector with

*k*elements, the

*i*th element is 1, and the others are zero.

According to the above notation, *N _{m}*(

*m*) is the number of all possible topological trees with

*m*leaves, soand the probability of having a topological tree consistent with the simple mutation tree (

**A**,

*C*) is

So, from (6) it follows that the probabilities of the simple mutation trees satisfy the recursion formula(7)with the initial conditionsFor consistency with notations above, it is important to note that if on the right side of (6) and (7) the size of one of the mutant groups corresponding to *A _{j}* (

*j*= 1, …,

*k*) becomes 1 the first time, let it happen with

*A*, and then the expressions and should be replaced by and , respectively, where

_{i}**a**is as vector

_{(–i)}**a**with a deleted

*i*element.

Although the probability of a simple mutation tree can be computed using (7), this will be inefficient when the number of mutation constraints of the simple mutation tree or the sizes of the mutant groups corresponding to the child nodes of the simple mutation tree are large. In general, it appears to be a difficult problem to find explicit formulas for the probabilities of the simple mutation trees. For simple mutation trees of the form (*a*_{1}, *c*), the following formula (Wiuf and Donnelly 1999) can be used:(8)Below, I derive explicit formulas for the probability of a simple mutation tree of the form (*a*_{1}, *a*_{2}, *c*).

Lemma 2. *The number of the topological trees consistent with a simple mutation tree of the form* (*a*_{1}, *a*_{2}, *c*) *can be computed using the following formulas*: *If c* ≠ *0*, *then**otherwise,**The expressions for* Φ*(a, b, c) and Q(a, b, c) are defined as**and*

The proof of Lemma 2 is given in File S1.

Remark. From Lemma 2 it follows that if *c* ≠ 0, then(9)otherwise,

## SAMPLING ALGORITHMS FOR THE TOPOLOGIES OF THE CONDITIONAL ANCESTRAL TREES

#### A sampling method from Ω(*D*) based on ℙ_{D}(·):

An algorithm for sampling topological trees consistent with simple mutation trees is developed to describe the sampling procedure for topological trees from the probability space (Ω(*D*), ℙ_{D}(·)), where ℙ_{D}(·) is defined as ℙ_{n}(· | Ω(*D*)).

Let a simple mutation tree have the form (**A**, *C*) = (*A*_{1},…, *A _{k}*,

*C*). Equation 7 is used to sample a topological tree from the probability space , where . The sampling procedure is based on recursively constructing coalescent events, backward in time, by finding which two lineages should coalesce in each coalescent event. So two ancestral lineages that coalesce at the first coalescent event belong to one of the mutant groups of

*A*

_{1}, …,

*A*or in group

_{k}*C*. To find the group in which they are a decision based on (7) is done using the following probabilities: the coalescent event occurs with two ancestral lineages in the mutant group corresponding to

*A*with probabilityor in group

_{j}*C*with probability

If the two coalescing lineages are in group *C*, then both lineages are from group *C*_{1} or from *C*_{0}; otherwise, one of them is from group *C*_{1}, and the other one is from *C*_{0}. These three possibilities have the following probabilities:(10)

If the coalescent event is in the mutant group of *A _{i}*, then randomly choose two ancestral lineages associated with this group and combine them in one lineage. The two lineages are replaced by the new lineage and

*a*is decreased by 1. If

_{i}*a*is 1, then the node corresponding to this mutant group is moved to group

_{i}*C*

_{1};

*c*

_{1}and

*c*are increased by 1 and the value of

*k*is decreased by 1. If the coalescent event is in group

*C*

_{0}, then two ancestral lineages from this group are chosen randomly and combined into a new lineage. They are replaced by the new lineage and

*c*

_{0}and

*c*are decreased by 1. If the coalescent event is in group

*C*

_{1}, then two ancestral lineages associated to the group

*C*

_{1}are chosen randomly and combined together. The new lineage is moved to group

*C*

_{0}and

*c*

_{1}is decreased by 2, and

*c*

_{0}and

*c*are increased by 1. The last case involves randomly choosing a lineage from

*C*

_{1}and

*C*

_{0}. After combining them,

*c*

_{1}and

*c*are decreased by 1. So, after changing the two vectors (

*A*

_{1}, …,

*A*,

_{k}*C*) and (

*a*

_{1}, …,

*a*,

_{k}*c*), the procedure is applied to this new simple mutation tree. Recursively repeating this procedure

*m*− 1 times, the construction of the topological tree will be completed.

In this sampling method, it is necessary for the probabilities of the simple mutation trees in the decision-making steps to have already been computed. Although these probabilities can be computed recursively using (7), doing so can be very inefficient. However, this sampling method can be applied to simple mutation trees of the forms (*C*), (*A*_{1}, *C*), and (*A*_{1}, *A*_{2}, *C*) because explicit formulas are given for the probabilities of these types of simple mutation trees (see above). To overcome this limitation, a second sampling method is presented below, which does not require knowing the values of the probabilities.

#### The second sampling method based on Q_{D}(·):

First, a proposal distribution **Q**_{(A,C)}(·) is defined for the simple mutation tree (**A**, *C*) = (*A*_{1}, …, *A _{k}*,

*C*) and then extended for

**Q**

_{D}(·). To define proposal distribution

**Q**

_{(A,C)}(·), below Equation 7 is rewritten in another form (see Equation 11). This step is similar to the approach used by Griffiths and Tavaré (1994c).

After rewriting (7) as(11)a new decision-making rule is defined as follows: the coalescent event occurs with two ancestral lineages associated with the mutant group of *A _{j}* with probabilityor with group

*C*with probabilityWhen the two lineages are from group

*C*, the coalescent probabilities of the three possibilities for the two lineages are the same as in (10).

So, using the previous sampling procedure with this decision-making rule, a topological tree, Γ_{(A,C)}, will be sampled consistent with (**A**, *C*). This sampling defines the proposal distribution **Q**_{(A,C)}(·) on the space Ω(**A**, *C*). As shown below, several quantities must also be computed during this sampling method. For the simple mutation tree (*a*_{1}, …, *a _{k}*,

*c*), the expression

*W*(

*a*

_{1}, …,

*a*,

_{k}*c*) is computed before each decision-making step using the following formula:At the end of the procedure for sampling topological tree Γ

_{(A,C)}, these quantities,

*W*(

*a*

_{1},…,

*a*,

_{k}*c*), are multiplied together and the result is denoted by

*W*(Γ

_{(A,C)}). So, from Equation 11 and definition of

**Q**

_{(A,C)}(·) it easily follows that the following relation holds for the probabilities of Γ

_{(A,C)}:(12)

In the previous section, a procedure was described for sampling a topological tree consistent with a mutation tree. This method allows one to define both a proposal distribution **Q**_{D}(·) using **Q**_{(A,C)}(·) and a sampling scheme from the space (Ω(*D*), **Q**_{D}(·)). Another form of this sampling scheme is in Algorithm 3 listed below. This algorithm can be modified to make it applicable for sampling from the space (Ω(*D*), ℙ_{n}(· | *D*)).

Let *i* be a variable with the values of the labels of the coalescent events, let *p* be a simple mutation tree associated with a node of the mutation tree, and let *p* have the form (*A*_{1}, …, *A _{k}*,

*C*). It is easy to see that the following equations hold for each mutant group

*x*. In a sampled topological tree consistent with

*D*, the labels of the two coalescent events (when mutant group

*x*has the most recent common ancestor and when, for the first time, it shares an ancestor with others in the sample) are respectively equal to

*n*−

*J*

_{1}(

*x*) and

*n*−

*J*

_{0}(

*x*).

#### Algorithm 3:

Assign 1 to

*i*, the simple mutation tree of the root of the mutation tree to*p*, 1 to*W*(Γ_{D}),*n*to*J*_{1}(·), and 1 to*J*_{0}(·) (for all mutant groups).Compute the following expressions for

*p*,and assign*W*(Γ_{D})*W*(*a*_{1}, …,*a*,_{k}*c*) to*W*(Γ_{D}).Apply the decision-making rule in the second sampling method [based on (7)] to

*p*to find in which group the*i*coalescent event occurs.If the

*i*coalescent event is in group*C*, then there are the following three possibilities: the two coalescing lineages are (1) from group*C*_{1}, or (2) from*C*_{0}, or (3) from group*C*_{0}and from group*C*_{1}. For the first possibility, randomly choose two lineages (without replacement) from the mutant groups of the nodes in*C*_{1}and combine the two lineages together. Assign*n*–*i*to*J*_{0}(·) for these two mutant groups, and move the new lineage to*C*_{0}. Delete the mutant groups of the two lineages from the group*C*_{1}and update the sizes of the groups. For the second possibility, randomly choose two lineages from group*C*_{0}and combine them to make a new lineage in*C*_{0}. Update the size of*C*_{0}decreasing it by 1. For the third possibility, randomly choose one lineage from group*C*_{1}and one from group*C*_{0}and combine them in one lineage. Assign*n*−*i*to*J*_{0}(·) for the mutant group from*C*_{1}, delete this node from*C*_{1}, and decrease*c*_{1}by 1.If the

*i*coalescent event occurs in group*A*, then check the size of the mutant group corresponding to_{j}*A*. If the size of that group is >2, assign the simple mutation tree of node_{j}*A*to_{j}*p*, keep the value of*i*, and go to step 2. Otherwise, assign*n*−*i*to*J*_{1}(*A*) [this coalescent event defines_{j}*J*_{1}(·) for this mutant group]. The simple mutation tree associated with the node of this mutant group has zero mutation constraint,*k*= 0, and it has the form (*c*_{1},*c*_{0}), for which it is (2, 0) or (1, 1) or (0, 2). So, in all these cases, coalesce the two lineages and update the mutation tree according to the three cases.Increase the value of

*i*by 1 (that is, move to the next coalescent event) and assign the simple mutation tree of the root of the updated mutation tree to*p*. Then go to step 2.

Remark. An equation similar to (12) holds for this general case:(13)This formula results from the following steps: write Equation 11 for each node of the mutation tree and expand the right sides of these equations recursively; then multiply all these equations together. The left side of the new equation will be the product of the probabilities of the simple mutation trees associated with each node of the mutation tree. However, this product according to Lemma 1 is the probability of Γ_{D} in the space Ω(*n*). Thus, after expanding the products on the right side of the new equation and dividing both sides of the equation on ℙ_{n}(Ω(*D*)) and using the definitions of **Q**_{D}(·) and *W*(Γ_{D}), the proof is complete.

## AGES OF MUTATIONS CONDITIONAL ON THE DATA

#### The order of the ages of mutations in a mutation class is known:

This section describes how to infer the ages of mutations in the sample conditional on full polymorphism data *D*. Let η be the age of a mutation in a mutation class *x*. Using similar arguments as in the derivation of Equation 3, the mean of the functional λ(η) conditional on the data *D* can be expressed as(14)

To estimate the first and second moments of η, and its distribution function at the point *t*, the functional λ(η) is considered to be equal to η, η^{2}, and **I**{η ≤ *t*}, respectively. Because mutations are superimposed independently on different branches of the ancestral tree of the sample as Poisson processes with equal rates θ/2, it follows that conditional on the age η of a mutation in mutation class *x* is uniformly distributed on the interval , where . On the basis of this fact, = , in distribution, where *U* is a uniform random variable on the interval (0, 1). From this representation, it follows that if there is no extra information about the order of the ages of mutations in mutation class *x*, the ages of these mutations are conditionally independent. But, if the order of the ages of the mutations is known, then after adding this condition, the joint conditional distribution of the ages of mutations in mutation class *x* (ranked in increasing order) is distributed as the vector , where is the ordered statistics of *m _{x}* independent uniform random variables on the interval (0, 1). Note that

*U*

_{(k)}has beta distribution

*B*(α, β), where α =

*k*and β =

*m*+ 1 −

_{x}*k*(see,

*e.g.*, Devroye 1986). Thus, the following Monte Carlo method can be used to estimate the first and the second moments of the age of the

*k*th mutation in the mutation class

*x*and its distribution at point

*t*: sample (Γ

_{D},

**T**) many times and compute the expressions(15)(16)and(17)Average the computed values, respectively, and divide by the estimate of ℙ(

*D*|

**α**). Note that for evaluating the probabilities above, use the formulawhere ℙ(

*U*

_{(k)}≤

*u*) = 0 and 1 when

*u*≤ 0 and

*u*≥ 1, respectively.

Note that if the order of the ages of mutations in mutation class *x* is unknown, then a similar approach to that above can be used to estimate the ages of the mutations.

#### Time to the most recent common ancestor:

Similarly, as in the method of the previous section, a Monte Carlo approach can be applied to infer the time of the most recent common ancestor of the sample, *T*_{MRCA}, conditional on *D*. To estimate the first and second moments and the distribution of *T*_{MRCA}, the method above should be modified by using the following expressions instead of (15)–(17), respectively:

## APPLICATIONS

#### Inference for full polymorphism data completely linked to a mutant allele:

To show flexibility and applicability of the inference method developed in the previous sections, below I describe applications of the method for neutral polymorphism data in a sample of DNA sequences completely linked to a mutant allele, either neutral or under selection.

To formalize the problem the following notations are needed. Assume that *n* DNA sequences at a nonrecombinant locus are drawn at random from a population and *b* sequences in this sample carry an allele, which is the result of a unique mutation event. Let *D _{b}* denote the polymorphism data just among the

*b*sequences and assume that the polymorphism data are neutral and consistent with the infinite-sites model. Here the data are the pair (

*b*,

*D*). Below I describe a method for estimating the probability of the data (

_{b}*b*,

*D*) and the distributions of the ages of the mutant allele and mutations in

_{b}*b*sequences.

In the literature, many probabilistic frameworks have been developed for modeling the ancestral tree of the *b* sequences. Griffiths and Tavaré (1998, 2003) developed a theory for extracting the ancestral tree of the *b* sequences from the ancestral tree of the *n* sequences for which the ancestral tree is modeled in the general coalescent tree framework when the mutant allele is neutral. The special case of this framework was also discussed by Wiuf and Donnelly (1999) and Stephens (2000).

Stephens and Donnelly (2003) and Coop and Griffiths (2004) developed ancestral models for the *b* sequences, where the *n* sequences are drawn at random from an equilibrium population with a mutant allele under selection. In all of these models, the ancestral tree of the *b* sequences is extracted from the conditional ancestral tree of the *n* sequences. Slatkin and Rannala (1997, 2000) used a linear birth–death process to model the ancestral tree of the *b* sequences. This model is different from the previous two models in a few aspects: this model assumes that the mutant allele is in low frequency in the population and under additive selection, the implementation of this model is computationally less complex, and it allows population growth. Although the above models are different in their properties, they all fit to the general coalescent tree framework.

To estimate the probability of the data (*b*, *D _{b}*), this probability is represented as a product of two probabilities:Thus, estimation of the probabilities and is explained below.

#### The neutral case:

For the case of the neutral mutant allele, the probability can be estimated using the equation (3.3) derived by Griffiths and Tavaré (1998):The theory developed by Griffiths and Tavaré (1998) can be used to derive the following formula:(18)where represents the coalescence waiting times on the ancestral tree of *n* sequences; note that the joint density function of **T** is not conditioned on the existence of the unique mutant allele. Here is a random vector uniformly distributed on.The quantity is computed as the expression in (4), but, instead of *n*, *D*, and *T*, the following quantities are used, respectively: *b*, *D _{b}*, and ; is defined aswhere .

Similar expressions can be derived for estimating the ages of the mutant allele and mutations in *b* sequences conditional on (*b*, *D _{b}*). See File S1 for more details about these expressions and algorithms based on Monte Carlo integrations for estimating these quantities.

#### The selection case:

Slatkin and Rannala (1997, 2000), Stephens and Donnelly (2003), and Coop and Griffiths (2004) developed ancestral models for the *b* sequences when the mutant allele is under selection. So, to estimate the probability of ℙ(*D _{b}* |

*b*) when using one of these models, only the coalescence waiting times of the ancestral tree of the

*b*sequences are needed for application of the estimation method developed in the previous section. When using the model by Slatkin and Rannala (1997), the probability ℙ(

*b*) can be estimated using their Equation 7. For the other models, this probability can be evaluated using Equation 24 by Griffiths (2003) or the approach suggested by Stephens and Donnelly (2003).

#### APOE data analysis:

In humans, the APOE gene is located on chromosome 19. Many studies have suggested that some variations of this gene are associated with coronary artery disease and Alzheimer's disease. Thus, it is important to understand the impact of evolutionary processes on genetic variation of this gene. Fullerton *et al*. (2000) sequenced the APOE gene locus (5.5 kbp long) from 96 individuals (192 sequences) from four different populations worldwide: 24 individuals (48 chromosomes) from each population. One of the nucleotide sites in this gene (at position 3937, see Table 1 in Fullerton *et al*. 2000) is polymorphic in all human populations, but not in nonhuman primate relatives that carry the ancestral haplotype (see Fullerton *et al*. 2000 and references therein). Of 192 sequences, 165 carry the mutant allele at this position. To illustrate, the inference methods developed in this article are used to analyze the polymorphism data in the 165 sequences that carry the mutant allele. The results of this analysis are contrasted with the inference based on the full polymorphism data in the 192 sequences.

To analyze to full polymorphism data in the 192 sequences, Fullerton *et al*. (2000) used the Genetree 9.0 program, written by R. C. Griffiths. The program uses the methods developed by Griffiths and Tavaré (1994a,b,c, 1999). In this analysis, the population exponential growth model is considered and the maximum likelihood is used to estimate the underlying parameters in the model: that is, the mutation rate θ at the locus and the exponential growth rate β. Using the likelihood-ratio test, the null hypothesis (β = 0) was not rejected. They did not find a significant signal for population exponential growth. When β = 0, their maximum-likelihood estimate of the mutation rate θ is 3.66.

This article also considers the population exponential growth model as the underlying evolutionary scenario for the data in the sample. For consistency with the infinite-sites model and nonrecombining locus conditions, four sequences were removed from the whole sample and the polymorphism data were restricted to a 4528-bp-long region between positions 833 and 5360. First, the maximum-likelihood estimates of the mutation rate θ at the locus and the exponential growth rate β are derived on the basis of the full polymorphism data in the 165 sequences that are completely linked to the mutant allele. The estimates are θ = 2.75 and β = 0. The inference is done using software that implements the method developed in this article (the software is written in the C programming language and available from the author upon request). Ten independent runs were used with 7 × 10^{7} iterations each (as in Algorithm 5 in File S1); the likelihood surface for the parameters (θ, β) is estimated at the points, where θ is in the range of 0.5–7.25 (10 equidistant points) and β is in the range of 0–4 (5 equidistant points). The contour plot of the likelihood surface is shown in Figure 6. The time required for one run is 58 hr. Note that the likelihood functions at the 50 points of (θ, β) are estimated simultaneously in one run. This estimation approach is appropriate because in this method the topologies of the ancestral trees and coalescence times are sampled independently. In addition, the coalescence times in the exponential growth model can be presented as a transformation from coalescence times in the standard coalescent; hence the sampling of the coalescence times can be done by sampling coalescence times under the standard coalescent and then using the transformation for each β to convert them, respectively. For these estimated parameter values, the mean age of the mutant allele is 1.66 with a standard deviation of 0.89. Time is measured in 2*N*_{e} generations, as a human effective population size, *N*_{e}, can be considered 10,000, with 20 years for a generation time.

The inference methods have also been applied to full polymorphism data in the 188 sequences and the results are contrasted with the above estimates. The contour plot of the likelihood surface of (θ, β) is shown in Figure 7. The range for θ is from 0.5 to 9.4, and it is from 0 to 9.3 for β; for each parameter 16 equidistant points are considered. To compute the likelihood function for these parameter values one simulation run was done with 6 × 10^{7} iterations and 10 independent runs. The time required for one run is 58 hr. The maximum-likelihood estimates of the parameters are θ = 4.0 and β = 2.5. To test for the population exponential growth, mutation rate θ is also estimated when β = 0. In this case, the maximum-likelihood estimate of the mutation rate θ is 2.87. The likelihood-ratio statistic is 2.3, which is not significant. The mean time to the most recent common ancestor is 1.26 with a 0.3 standard deviation. The mean age of the mutant allele is 0.83 with a standard deviation of 0.25.

Note that the mean age of the mutant allele based only on its frequency in the sample is 1.85, using the formula in Kimura and Ohta (1973) or in Griffiths and Tavaré (1999). The Watterson (1975) estimate of the mutation rate at this locus is 2.4. All of these estimates of the mutation rate at the locus and the age of the mutant allele derived in this section do not provide contrasting results, hence, do not suggest that the mutant allele increased its frequency because of selection.

## DISCUSSION

This article developed a computationally tractable full-likelihood-based inference method using the general coalescent tree framework with the infinite-sites model for polymorphism data in a sample of DNA sequences at a nonrecombining locus. This method is different from existing full-likelihood-based inference methods in several ways. First, it is applicable to a wide range of ancestral models that fit into the general coalescent tree framework, which includes various demographic scenarios for populations. Second, the method was developed given the assumption of the infinite-sites model for a mutation scheme in DNA sequences, a nonrecombining locus, and no population structure. In contrast, full-likelihood-based inference methods were developed for evolutionary scenarios such as population subdivision (Bahlo and Griffiths 2000; De Iorio and Griffiths 2004; De Iorio *et al*. 2005), or recombination at the locus (Griffiths and Marjoram 1996; Fearnhead and Donnelly 2001), or other mutation schemes such as stepwise mutation models (Nielsen 1997; Wilson and Balding 1998; Stephens and Donnelly 2000; Drummond *et al*. 2002), assuming large constant size for a population or subpopulations in Wright–Fisher models.

The construction of the importance sampling method developed in this article differentiates this method from other methods. In the other importance sampling methods, ancestral and mutation processes are combined together, whereas, in this method, an ancestral tree is sampled that is consistent with data and then it is combined with the mutation process. On the basis of this sampling approach, the likelihood curve of the mutation rate can be estimated for a set of values of the mutation rate in one simulation run; it does not require choosing a particular value as a starting value. The reason for this is that the sampling of the topologies of the ancestral trees and coalescent waiting times do not depend on mutation rate. The same advantage also holds for estimating the likelihood surface of the mutation rate and any parameters in the distribution function of the coalescence waiting times for the variable population size model. The coalescence times in this model can be derived by a transformation of the coalescence times in the standard coalescent (see Griffiths and Tavaré 1994a, 1999). So, it is enough to sample coalescence times from the standard coalescent and then to apply the transformation to them for different values of the parameters. Using the same idea, Bayesian inference can be developed for the parameters in this framework.

In this article, two sampling methods have been developed for topologies of the conditional ancestral trees. The first method samples the topologies from the exact distribution, but the second method samples the topologies from a proposal distribution. The first method has some limitations for applicability because the probabilities of the simple mutation trees must be known in decision-making steps of this method and no closed-form formula is known for these probabilities in the general case. However, this method can be applied to data sets for which all the simple mutation trees embedded in the mutation tree of the data have no more than a two-mutation constraint. Although this is a restriction, there can be many real data sets that satisfy this condition. For example, the data sets from Fullerton *et al*. (1994) and Garrigan *et al*. (2005) satisfy this condition when the polymorphic sites of the data sets are phased on the basis of the reference sequence from the chimpanzee genome. In addition, data sets are generated for a sample size of 20 under the standard coalescent with the infinite-sites model for a wide range of mutation rates and for each data set the corresponding mutation tree is constructed and the maximal number of mutation constraints of simple mutation trees embedded in the mutation tree is computed. Then the distribution of this statistic for each θ is estimated. For each θ, the value of this statistic with highest probability is at 0, 1, or 2. Note that the same thing is true for sample sizes <20 because coalescence waiting times in the standard coalescent are independent.

I developed three computer programs for the inference method in this article, using three sampling schemes for topologies of the conditional ancestral trees. The two sampling methods are described in the previous sections, and the third one is explained below. I compared the programs with Genetree 9.0 on many simulated data sets. The estimates from these programs agree on all data sets (data not shown), and these programs show some run-time advantage over Genetree 9.0. The run-time differences between the programs are shown in Table 1, in which data sets from Fullerton *et al*. (1994) and Garrigan *et al*. (2005) are used.

#### Modifications and extensions:

This article assumed that the sequences in the sample are labeled and ordered. Considering the labeled and unordered case relaxes this condition. In this case, similar inference methods can be developed by changing decision-making rules and the equations on which they are based. The details of these changes are given in File S1.

For purposes of this research, I assume that the ancestral nucleotide at each polymorphic site is known. Note that the ancestral nucleotide at each polymorphic site in the sample of DNA sequences consistent with the infinite-sites model can be phased on the basis of a reference sequence from an out-group species. If it is not available, the more frequent allele at each polymorphic site can be considered as the ancestral allele because this fact holds in the general coalescent tree framework (see Sargsyan and Wakeley 2008). Otherwise, all possible phasing should be considered (for more details see Griffiths and Tavaré 1995).

An ancestral model based on the model-free approach has also been considered in population genetics. (For more about the model-free approach, see, *e.g.*, Stumpf and Goldstein 2001 and Meligkotsidou and Fearnhead 2005, and references therein.) In this model the ancestral tree of a sample is a rooted-bifurcating tree with random-joining topological structure, but the time intervals between consecutive coalescent events are parameters. This model does not fit the general coalescent tree framework, where coalescence waiting times are random variables. However, the sampling methods developed in this article for the topologies of the conditional ancestral trees can be applied to design a full-likelihood-based inference method, using this ancestral model with the infinite-sites model.

## Acknowledgments

I thank John Wakeley and Simon Tavaré for stimulating discussions and helpful comments on the drafts of the manuscript.

## Footnotes

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.109.112847/DC1.

Communicating editor: R. Nielsen

- Received December 7, 2009.
- Accepted May 7, 2010.

- Copyright © 2010 by the Genetics Society of America