## Abstract

In this article, I develop a methodology for inferring the transmission rate and reproductive value of an epidemic on the basis of genotype data from a sample of infected hosts. The epidemic is modeled by a birth–death process describing the transmission dynamics in combination with an infinite-allele model describing the evolution of alleles. I provide a recursive formulation for the probability of the allele frequencies in a sample of hosts and a Bayesian framework for estimating transmission rates and reproductive values on the basis of observed allele frequencies. Using the Bayesian method, I reanalyze tuberculosis data from the United States. I estimate a net transmission rate of 0.19/year [0.13, 0.24] and a reproductive value of 1.02 [1.01, 1.04]. I demonstrate that the allele frequency probability under the birth–death model does not follow the well-known Ewens’ sampling formula that holds under Kingman's coalescent.

PATHOGENS evolve rapidly due to a short generation time and a high mutation rate. As a consequence, new alleles arise regularly, and in a population of infected individuals, a variety of alleles are present. Assuming a model for the spread of the pathogen to new hosts and a model for the mutation of a pathogen allele allows the estimation of key epidemiological parameters for a pathogen based on the sampled alleles in an epidemic (Tanaka*et al.* 2006; Luciani*et al.* 2008, 2009).

In this article, I consider the *infinite-allele model* (IAM) for the evolution of alleles; the *constant rate birth–death model* (BDM) is assumed for the epidemic spread of the pathogen. Under the infinite-allele model, new alleles arise in a host with a constant mutation rate θ. If a new allele arises, it has not appeared before. This means that there is no convergent evolution. Each infected host is characterized by an allele type. Under the BDM, the alleles spread with a constant transmission (birth) rate λ to new hosts, and infected hosts recover or die with a constant death rate μ. Note that through an estimated birth rate λ and death rate μ, the net transmission rate (λ−μ) and the reproductive value (λ/μ) are determined.

Assuming the IAM together with the BDM, the net transmission rate and reproductive value for tuberculosis have been estimated on the basis of the allele frequencies of the IS*6110* marker (Tanaka*et al.* 2006) using an approximate Bayesian computation (ABC) approach (Pritchard*et al.* 1999; Beaumont*et al.* 2002; Marjoram*et al.* 2003). Bayesian methods infer the posterior distribution of parameters, whereas ABC methods infer the approximate posterior distribution of parameters. The quality of the approximation depends crucially on the choice of summary statistics (unless the full data are used, which is usually not feasible), and the speed of obtaining the approximation depends on the speed of the required simulation tools. Given efficient simulation tools, ABC methods might be faster than Bayesian methods; however, this comes with a cost in accuracy.

Bayesian methods require the knowledge of the likelihood of the data (here the allele frequencies). In this article, I derive the likelihood of the sampled allele frequency under the BDM with the IAM. The allele frequency likelihood is calculated recursively for the pure-birth process; *i.e.*, μ = 0. Under the BDM with death, I calculate the allele frequency likelihood conditioned on the underlying birth–death tree structure. Integrating over all possible trees yields the allele frequency probability. However, when estimating parameters using a Bayesian approach, the integration is not necessary, as both parameters and trees can be sampled from the posterior distribution directly.

Using the Bayesian approach, I reanalyze the tuberculosis data of Small*et al.* (1994). I obtain significantly lower estimates than Tanaka*et al.* (2006) for the net transmission rate (0.19/year *vs.* 0.69/year) and the reproductive value (1.02 *vs.* 3.4). This demonstrates that summary statistics employed by ABC often do not yield a sufficiently good approximation of the posterior distribution. The approach presented here has further the advantage over the previous ABC approach in that it is much faster as no simulations of large trees are required and incomplete sampling is included into the likelihood directly (instead of considering complete trees as a temporary step).

## Models and Methods

### Framework for estimating birth and death rates on the basis of allele frequency data

I first introduce some definitions and notations that are used throughout this article. The model for the spread of an allele is based on the BDM, which starts with a single host infected at time *t*_{or} in the past. Over time each host dies (or recovers) with a constant rate μ and infects a new host with a constant rate λ. One allele within a host is tracked. This allele can mutate to a new allele that has not appeared previously with a constant rate θ (IAM).

Let the number of hosts being sampled from the population be *n*. The allele types of the sampled hosts are summarized in a vector of allele frequencies *a* = (*a*_{1}, *a*_{2}, …, *a _{n}*) ε ℕ

*as follows. The number of hosts that share each allele is counted. The number*

^{n}*a*is the number of alleles that are shared by exactly

_{i}*i*hosts. Note that if for all

*j*>

*k*we have

*a*= 0, we simply write

_{j}*a*= (

*a*

_{1},…,

*a*). Note that

_{k}*n*= ∑

*. Let the number of sampled allele types (clusters) be*

_{j}ja_{j}*c*= ∑

*.*

_{j}a_{j}An example for a vector of allele frequencies is *a* = (2, 3, 1), meaning that *n* = 2 × 1 + 3 × 2 + 1 × 3 = 11 hosts are sampled, which carry in total *c* = 2 + 3 + 1 = 6 different alleles, say alleles *A*_{1}, … , *A*_{6}. Alleles *A*_{1} and *A*_{2} appear only in one host; alleles *A*_{3}, *A*_{4}, and *A*_{5} appear each in two hosts; and allele *A*_{6} appears in three hosts.

In the following, the probability of observing *a* in a given sample of size *n*, ℙ[*a*], is determined. This probability can be used to estimate the model parameters on the basis of the data *a*. Throughout this section we define *e _{i}* as a unit vector;

*i.e.*, it is a vector of only zeros but a 1 at position

*i*.

### Pure-birth model

First, a special case of the BDM, namely a pure-birth model, *i.e.*, μ = 0, is considered. Under this model, a recursion for the probability of an allele frequency *a* is derived.

### Complete sampling

First assume that the whole population of infected hosts is sampled. This is of course almost never the case. However, using the results under complete sampling, the probability of allele frequencies under incomplete sampling is derived in the next section.

**Theorem 1.** *The probability of observing the allele frequency a under the pure-birth process and complete sampling is*

A proof of the Theorem is found in the *Appendix*. The probability ℙ[*a*] can be calculated recursively in the following way. For *a*, *a*′ ε ℕ* ^{n}*, define

*a*>

*a*′ if (i)

*j*<

*k*, but

*with the minimum*

^{n}*a*

_{min}= (1). Since

*a*>

*a*+

*e*

_{j}_{−1}−

*e*and

_{j}*a*>

*a*−

*e*

_{1}−

*e*

_{j}_{−1}+

*e*for

_{j}*j*> 1, and > defines a total order, ℙ[

*a*] can be calculated recursively using Equation 1, with the initial value ℙ[1] = 1. The probability ℙ[

*a*] depends not on both parameters θ, λ, but only on their ratio θ/λ. I did not find a closed-form solution for ℙ[

*a*]. In particular, Ewens’ sampling formula is not the solution. I calculated the probability ℙ[

*a*] via the recursion for up to five individuals; see Table 1.

The probability ℙ[*a*] is the likelihood of the data, and therefore maximum-likelihood or Bayesian methods can be employed to estimate birth and death rates on the basis of allele frequencies.

### Incomplete sampling

Now I consider the scenario that a population of size *N* evolved under the pure-birth process, and then *n* of these *N* individuals are sampled uniformly at random. The probability of obtaining the allele frequency *a* when sampling *n* individuals of *N* individuals is calculated recursively,* _{N}*[

*a*] = ℙ[

*a*] for

Note that Ewens’ sampling formula is invariant toward sampling; *i.e.*, ℙ* _{N}*[

*a*] = ℙ

*[*

_{n}*a*] for

*n*≠

*N*(see also

*Discussion*). However, under the pure-birth model, inspection of Equations 1 and 2 reveals ℙ

*[*

_{N}*a*] ≠ ℙ

*[*

_{n}*a*] for

*n*≠

*N*.

### Birth–death model

Introducing a death rate μ for each individual yields, analogous to that above,*a*′ on the right-hand side with *a* < *a*′.

One solution would be to introduce a cutoff, *i.e.*, assign probability 0 to all states with *N* >> *n* individuals. *N* has to be chosen so large that the probability of getting back to the final stage with *n* individuals is very small. However, even with this cutoff, the above recursion becomes computationally very time consuming, in particular with incomplete sampling, as the underlying number of individuals *N* can be very large.

I therefore introduce a Bayesian approach to estimate the birth and death rates. The idea is based on deriving a closed-form solution for the probability ℙ[*a*|

I first define *t*_{or} ago. It is assumed that sampling of infected hosts is uniformly at random; *i.e.*, each infected host at time *t*_{or} (after the first infected host appeared) is sampled with probability ρ. All nonsampled and extinct lineages are suppressed from the tree. Let the tree induced in this way be T, and let the number of leaves be *n*. Mutations of the allele occur on the tree edges with constant rate θ. The tree T together with *c* − 1 edges where at least one mutation occurs such that the allele frequency *a* with *c* different alleles is induced is denoted by *c* − 1 edges with mutations are *l*_{1}, … , *l _{c}*

_{−1}. The time from the origin of the process to the most recent common ancestor of the individuals that are not descendants of the

*c*−1 edges with mutations is defined as

*l*

_{c}_{.}Note that mutations during this time

*l*do not change the allele frequency

_{c}*a*. An example of a tree

First note that ℙ[λ, μ, θ, *t*_{or}|*a*] = ℙ[λ, μ, θ, *a*] as *t*_{or} is specified by *a*] is a normalizing constant. The probability ℙ[λ, μ, θ, *t*_{or}] is a prior on λ, μ, θ, *t*_{or}. We determine the quantity ℙ[*t*_{or}] in the following.

**Theorem 2.** *The probability density* ℙ[*t*_{or}] *is**where t*_{1}, … , *t _{n}*

_{−1}

*are the branching times in*

*and*

*Proof.* To derive ℙ[*t*_{or}], we split the tree *c* different alleles into *c* subtrees in the following way. We sequentially choose an edge with a mutation that has no mutated descending edge, and define this edge with all its descendants as a subtree, and delete this subtree from *c* − 1 subtrees are removed from *c*th subtree of *t*_{or} and the most ancient edge may or may not have a mutation, while all other edges do not have a mutation.

In each of the subtrees, no edge descending from the first diversification event (the root) has a mutation. In the first *c* −1 subtrees, the edge above the root (root edge) has at least one mutation. In the *c*th subtree, a mutation might or might have not happened above the root (a mutation simply means that the ancestor allele is lost in the sample).

The probability density of a tree T with *n* leaves and bifurcation times *t*_{1}, … , *t _{n}*

_{−1}given the age

*t*

_{0}is

*c*− 1 subtrees, and the probability of observing a mutation on the first

*c*− 1 root edges is

Equation 3 allows us to infer the posterior distribution for λ, μ, which is done in the next section for tuberculosis data.

### Application of the Bayesian approach to tuberculosis data

I implemented a Markov chain Monte Carlo (MCMC) approach using the Metropolis–Hastings algorithm (Metropolis*et al.* 1953; Hastings 1970) to sample from the posterior distribution*t*_{or}] is provided in Equation 3. ℙ[λ, μ, θ, *t*_{or}] is the prior distribution. I assume a uniform prior for the net diversification rate λ − μ on [0.01, 10] per year, a uniform prior for μ/λ on [0, 1], and a uniform prior for *t*_{or} on [0, 100] years.

I fixed θ = 0.198. This is the major difference in prior assumptions compared to Tanaka*et al.* (2006). In the previous study, the prior for θ was a normal distribution with mean 0.198 and standard deviation 0.06735. However, under an IAM in combination with the BDM, the parameters λ, μ, *t*_{or}, θ give rise to the same process as the time-scaled parameters λ/*s*, μ/*s*, *t*_{or}*s*, θ/*s* (recall that under the pure-birth process we already observed the invariance of the likelihood for θ/λ being constant). For example, if the original parameters were in units of years, the scaled parameters with *s* = 365 are in units of days. I provide all estimates in units of years, assuming θ = 0.198. For a different estimate of θ, the values can then be transformed to new variables using *s* = 0.198/θ. This scaling in parameters is also apparent in Tanaka*et al.* (2006): Figure 3 shows the net transmission rate for varying θ priors. The peak of the net transmission rate estimates correlates linearly with the mean θ.

There are two minor differences to Tanaka*et al.* (2006): First, in Tanaka*et al.* (2006), the priors were uniform for λ, μ on [0, ∞] with λ > μ. Note that, since there is a one-to-one mapping (bijection) between (λ, μ) and (λ − μ, μ/λ), the priors in Tanaka*et al.* (2006) are equivalent to uniform priors for λ − μ on [0, ∞] and μ/λ on [0, 1]. Since in my analysis and in Tanaka*et al.* (2006) the estimates for λ − μ are at least 10-fold smaller than the upper bound 10, the different upper bounds do not bias the posterior samples. Second, the tree in the previous study was stopped when a fixed number *N* of infected was reached, and from this tree the observed number *n* was sampled. I assume each individual is sampled with probability *n*/*N* from the big tree and condition on the number of sampled individuals being *n*. I cannot obtain a likelihood function accounting for the sampling procedure in Tanaka*et al.* (2006); however, my sampling procedure introduced only some random noise to the original tree size *N*, which should not bias the posterior distribution.

The MCMC chain after 8 million steps and neglecting the first 25% as burn-in returned a median net transmission rate (λ − μ) of 0.19/year with 95% credible interval [0.13, 0.24] and a reproductive value *R* (λ/μ) of 1.02 [1.01, 1.04]; for the posterior distribution see Figure 2 (left). The initial state was chosen to be the estimates from the previous study based on an ABC approach (Tanaka*et al.* 2006) (0.69 for net transmission rate, 3.4 for *R*). A further run of the MCMC with the initial state λ = 4 and μ = 2 yielded the same posterior distributions; see Figure 2 (right).

The source code in R is available from the author on request.

## Discussion

My estimate of the net transmission rate of tuberculosis is ∼3.5 times lower and the reproductive value is ∼3 times lower than the previous estimate based on the same data (Tanaka*et al.* 2006). The presented estimates challenge the statement of Tanaka*et al.* (2006, p. 1518), saying that “the genetic information (as interpreted with the methods in this study) supports a faster spread of tuberculosis, at least for the data of Small*et al.* (1994).” As Tanaka*et al.* (2006) used the same data, the same model assumption, and basically the same prior distributions as I used in the present study, the difference must come from the method. I obtained the same posterior distribution for different starting values; therefore I claim that the differences in my estimates from the previous estimates are due to the approximation of the ABC not being accurate enough. The quality of the approximation in an ABC analysis depends crucially on the summary statistics used, but unfortunately there is no straightforward way to determine whether a summary statistic is good. Clearly, for a given sample, the allele frequency *a* is a sufficient statistic (as the likelihood depends only on *a*); however, the Bayesian approach in this article revealed that the one-dimensional summary statistic *H* = 1 − ∑* _{j}a_{j}*(

*j*/

*n*)

^{2}(Tanaka

*et al.*2006) is not sufficient.

My net transmission rate estimate is slightly low compared to the net transmission rate for tuberculosis estimated in Porco and Blower (1998) (0.231−0.693). That study developed a detailed model of tuberculosis transmission and obtained an estimate of the net transmission rate through previous estimates in the literature of other tuberculosis parameters like the number of infections per year and the progression rate to tuberculosis. The net transmission rate correlates linearly with the mutation rate. I assumed a mutation rate of 0.198/year following Tanaka*et al.* (2006). However, the estimates in the literature vary and assuming a mutation rate that is 50% larger (as estimated, *e.g.*, in Rosenberg*et al.* 2003) yields a net transmission rate interval that largely overlaps with the interval of Porco and Blower (1998).

The estimated reproductive value is close to 1 (1.02); *i.e.*, each infected individual infects only one further individual in expectation. This low number is likely due to the fact that the tuberculosis epidemic is in the equilibrium phase (after an initial exponential expansion). This means that we estimated the actual reproductive number (Amundsen*et al.* 2004), which should not be confused with the basic reproductive number (Anderson and May 1979, 1992). The basic reproductive number quantifies the expected number of individuals infected by a single infected individual in a fully susceptible population, while the actual reproductive number accounts for the fact that a fraction of the population is infected. Estimation of the basic reproductive number using a BDM requires knowledge of the early phase of the epidemic.

In classic population genetics, the spread of an allele is modeled with Kingman's coalescent under a constant population size (Kingman 1982a,b,c) instead of the BDM. The probability of a sampled allele frequency is Ewens’ sampling formula (Ewens 1972),*a*. A second scenario gives rise to Ewens’ sampling formula. Under the pure-birth process where new alleles are introduced via a constant migration rate η (instead of mutation), the allele frequency probability is also Ewens’ sampling formula (Joyce and Tavaré 1987) with parameter η/λ instead of θ (an extension that also includes death is discussed in Tavaré 1989 and Rannala 1996). Ewens’ sampling formula can be derived under both scenarios analogous to the recursive approach introduced in this article for the pure-birth model and mutation; the derivations are given in the Appendix. One convenient property of Ewens’ sampling formula is that if a subsample is chosen uniformly at random from a sample of alleles that are distributed according to Ewens’ sampling formula, the subsample again is distributed according to Ewens’ sampling formula.

Unfortunately, we cannot make use of the convenient properties of Ewens’ sampling formula in the framework of epidemiology: We cannot assume the coalescent since transmission and death rates are not parameterized and thus cannot be estimated (the coalescent captures only the population size). The BDM describes the spread of an allele in an epidemic by interpreting birth rates as transmission rates. In the context of epidemiology, new alleles evolve in hosts within a population; thus I chose the BDM with the IAM over the previously studied BDM with immigration.

An analog of Ewens’ sampling formula for the BDM in combination with the IAM is*t*_{or}] being a prior distribution for *t*_{or.} An analytic integration for this expression seems not feasible (a hyperbolic function is involved), and therefore maximum-likelihood estimation of the birth and death rates or the mutation rate is not straightforward.

However, as I demonstrate, a Bayesian framework, which avoids the explicit integration, performs well. In particular, the Bayesian approach incorporates a sampling probability and therefore avoids considering trees that are larger than the actual sample size. This makes the approach attractive for sparsely sampled data, which is common in infectious diseases.

Using the BDM for the spread of tuberculosis, or any other epidemic, is the simplest epidemiological model and is appropriate when the epidemic is in the initial exponential phase (meaning λ >> μ) or in the equilibrium phase that is reached due to a reduced number of susceptibles (meaning λ ≈ μ). To model both phases simultaneously, SIR (susceptible-infected-recovered) models (Keeling and Rohani 2008) accounting for a declining number of susceptible individuals over time are required. While these models are well studied in a deterministic framework, they are not well understood in a probabilistic framework. However, a probabilistic formulation is required for a Bayesian analysis.

## Acknowledgments

I thank the editor and two anonymous reviewers for very helpful comments.

## APPENDIX

### Proof of Theorem 1

*Proof.* Let B be the event that the ancestor state of *a* is followed by a bifurcation (*i.e*., transmission) event. Let M be the event that the ancestor state of *a* is followed by a mutation event. Let *a*′ be the allele frequency state before undergoing the most recent event that yields *a*. [For example, if *a* = (2, 3, 1) from above, and the most recent event is bifurcation, then *a*′ = (3, 2, 1) or *a*′ = (2, 4). If the most recent event is mutation, then *a*′ = (2, 3, 1) or *a*′ = (0, 4, 1) or *a*′ = (1, 2, 2) or *a*′ = (1, 3, 0, 1).] With these definitions, we obtain

Assume *a*′ is followed by a bifurcation event. Let the bifurcation event be in a component of size *j* − 1, *j* = 2, … , *n*. Then,*a*′ is followed by a mutation event. Let the mutation event be in a component of size *j*, *j* = 2, … , *n*. Then,

### Derivation of Ewens’ Sampling Formula for the Coalescent With Mutation

We now determine the probability of the sampled allele frequency *a* under the coalescent. Let *a*′ be the state one event ancestral to *a*. Let *B* be the event that the present state evolved following a bifurcation event. Let *M* be the event that the present state evolved following a mutation event. Then,*a* evolved after a bifurcation event. Let the bifurcation event be in a component of size *j* − 1, *j* = 2, … , *n*. Then,*a* evolved after a mutation event. Let the mutation event be in a component of size *j*, *j* = 2, … , *n*. Then,*a* > *a* + *e _{j}*

_{−1}−

*e*and

_{j}*a*>

*a*−

*e*

_{j}_{−1}+

*e*and > defines a total order with minimum (1) (as explained in the main text), we can calculate ℙ[

_{j}*a*] recursively, with the initial value ℙ[(1)] = 1. The solution of this recursion is Ewens’ sampling formula (Equation 4 of main text), which can be easily proved by induction.

### Derivation of Ewens' Sampling Formula for the Pure-Birth Process With Migration

We again have a pure-birth process for the population dynamics. Novel alleles migrate at a constant rate η. We assume that we sample the whole population. Since Ewens’ sampling formula is invariant to random sampling, the derivation of Ewens’ sampling formula for the whole population implies that also a subsample is distributed according to Ewens’ sampling formula.

Let M be the event that *a*′ is followed by a migration event. With the other notation as introduced above, we have again*a*′ is followed by a bifurcation event. Let the bifurcation event be in a component of size *j* − 1, *j* = 2, … , *n*. Then,*a*′ is followed by a migration event. Then,*a* > *a* + *e _{j}*

_{−1}−

*e*and

_{j}*a > a − e*

_{1}

*− e*

_{j}_{−1}

*+ e*and > defines a total order with minimum (1), we can calculate ℙ[

_{j}*a*] recursively, with the initial value ℙ[(1)] = 1. Here, the solution of the recursion is again Ewens’ sampling formula (Equation 4 of main text), which can be proved by a simple induction.

- Received January 4, 2011.
- Accepted April 27, 2011.

- Copyright © 2011 by the Genetics Society of America