## Abstract

We present a formalism for unifying the inference of population size from genetic sequences and mathematical models of infectious disease in populations. Virus phylogenies have been used in many recent studies to infer properties of epidemics. These approaches rely on coalescent models that may not be appropriate for infectious diseases. We account for phylogenetic patterns of viruses in susceptible–infected (SI), susceptible–infected–susceptible (SIS), and susceptible–infected–recovered (SIR) models of infectious disease, and our approach may be a viable alternative to demographic models used to reconstruct epidemic dynamics. The method allows epidemiological parameters, such as the reproductive number, to be estimated directly from viral sequence data. We also describe patterns of phylogenetic clustering that are often construed as arising from a short chain of transmissions. Our model reproduces the moments of the distribution of phylogenetic cluster sizes and may therefore serve as a null hypothesis for cluster sizes under simple epidemiological models. We examine a small cross-sectional sample of human immunodeficiency (HIV)-1 sequences collected in the United States and compare our results to standard estimates of effective population size. Estimated prevalence is consistent with estimates of effective population size and the known history of the HIV epidemic. While our model accurately estimates prevalence during exponential growth, we find that periods of decline are harder to identify.

COALESCENT theory has found wide applications for inference of viral phylogenies (Nee *et al*. 1996; Rosenberg and Nordborg 2002; Drummond *et al*. 2005) and estimation of epidemic prevalence (Yusim *et al*. 2001; Robbins *et al*. 2003; Wilson *et al*. 2005), yet there have been few attempts to formally integrate coalescent theory with standard epidemiological models (Pybus *et al*. 2001; Goodreau 2006). While epidemiological models such as susceptible–infected–recovered (SIR) consider the dynamics of an entire population going forward in time, the coalescent theory operates on a small sample of an infected subpopulation and models the merging of lineages backward in time until a common ancestor has been reached. The original coalescent theory was based on a population of constant size with discrete generations (Kingman 1982a,b). Numerous extensions have been made for populations with overlapping generations in continuous time, exponential or logistic growth (Griffiths and Tavare 1994), and stochastically varying size (Kaj and Krone 2003). However, infectious disease epidemics are a special case of a variable size population, often characterized by early explosive growth followed by decline that leads to extinction or an endemic steady state.

If superinfection is rare and if mutation is fast relative to epidemic growth, each lineage in a phylogenetic tree corresponds to a single infected individual with its own unique viral population. An infection event viewed in reverse time is equivalent to the coalescence of two lineages and every transmission of the virus between hosts can generate a new branch in the phylogeny of consensus viral isolates from infected individuals. Recently diverged sequences should represent transmissions in the recent past, and branches close to the root of a tree should represent transmissions from long ago. Consequently, branching patterns provide information about the frequency of transmissions over time (Wilson *et al*. 2005). The correspondence between transmission and phylogenetic branching is easiest to detect for viruses such as human immunodeficiency virus (HIV) and hepatitis C virus that have a high mutation rate relative to dispersal. Underlying SIR dynamics also apply to other pathogens, although in some cases it may be more difficult to reconstruct the transmission history.

We examined the properties of viral phylogenies generated by the most common epidemiological models based on ordinary differential equations (ODEs). We are able to fit epidemiological models to a reconstructed phylogeny for sampled viral sequence data and make inferences regarding the size of the corresponding infected population. Our solution takes the form of an ODE analogous to those used to track epidemic prevalence and thereby provides a convenient link between commonly used epidemiological models and phylodynamics. Virtually all coalescent theory to date has been expressed in terms of integer-valued stochastic processes. Our motivation for using differential equations to describe the coalescent process is a desire to formalize a link with standard epidemiological models that are also expressed in terms of differential equations.

We use our method to calculate the distribution of coalescent times for samples of viral sequences, fit SIR models to a viral phylogeny, and calculate median time to the most recent common ancestor (MRCA) of the sample. Our method also provides equations that describe the time evolution of the cluster size distribution (CSD)—the distribution of the number of descendants of a lineage over time. Clusters of related virus are often interpreted as epidemiologically linked. For example, clusters of acute HIV infections may represent short transmission chains between high-risk individuals (Yerly *et al*. 2001; Hue *et al*. 2005; Pao *et al*. 2005; Goodreau 2006; Brenner *et al*. 2007; Drumright and Frost 2008; Lewis *et al*. 2008). Because our model reproduces the moments of the cluster size distribution, it can be used to predict the level of clustering as a function of epidemiological conditions. The moments could be directly compared to empirical values or they could be used to reconstruct the entire CSD, whereupon standard statistical tests could be used for comparing distributions.

Although our equations describe the macroscopic properties of the population distribution of cluster sizes, we generalize our method to the case of a small cross-sectional sample of sequences. This allows us to develop a likelihood-based approach to fitting SIR models to observed sequences.

By considering variable degrees of incidence and the size of the infected population, our solution sheds light on the relationship between coalescent rates and epidemic dynamics. Coalescent rates are low near peak prevalence, but higher when there is a large ratio of incidence to prevalence. This can occur early on, when the epidemic is entering its expansion phase, as well as late if the epidemic has multiple periods of growth.

## METHODS

Consider a population of size *N* comprising susceptible , infected , and recovered individuals. The deterministic limiting behavior of , , and as and with all variables ≫ is described by a set of coupled ordinary differential equations, with time-dependent rates of change from state *X* to state *Y* denoted as *f _{XY}*(

*t*). For instance, the classical mass-action SIR model(1)(Kermack and McKendrick 1927; Bailey 1975; Anderson and May 1991) is obtained by setting

*f*(

_{SI}*t*) = β

*S*(

*t*)

*I*(

*t*),

*f*(

_{IR}*t*) = γ

*I*(

*t*), and all other rates to 0. We omit the explicit dependence of terms on time when it is unambiguous.

Classical coalescent inference operates on a small subsample of the larger evolving population, taken at a single time point, and infers properties of the population at an earlier time point; *e.g*., What is the expected number of lineages at a given time *t*? Here, we denote the time of sampling by *T* and consider the evolution of the population backward in time toward time *t* = 0. While this differs from the conventional temporal notation for coalescent theory (where the sampling, or present, time is denoted 0, and as we move backward *t* denotes the number of years before the present), it allows us to develop a system of equations that link coalescent inference with standard epidemiological models.

We apply the coalescent model to the population of infecteds and draw upon the dynamical system to provide parameters such as the rate of lineage coalescence. The practical questions that we seek to address include the following:

If

*n*individuals are sampled at time*T*, how many lineages exist at time*t*≤*T*?How many lineages extant at time

*t*have surviving progeny at time*T*? We define*progeny*of a viral lineage extant from time*t*≤*T*as those individuals infected at time*T*whose virus can be traced back to that viral lineage at time*t*. For instance, in Figure 1, from*t*=*t*_{1}the progeny of lineage 6 has four individuals (5, 6, 8, and 9), but from*t*=*t*_{2}the progeny of lineage 6 consists of only 5 and 6.Can we describe the distribution of the number of progeny from time

*t*(a time*t*cluster),**X**(*t*), using its distributional moments? For instance, in Figure 1, at time*t*=*t*_{2}this distribution is given by (2, 2, 2), while for*t*=*t*_{1}the distribution is (4, 2).

Note that a transmission does not always result in an observable coalescent event depending on which lineages expire due to recovery or are not sampled (*e.g*., the transmission from 7 to 10 in Figure 1), and a transmission to an individual that recovers may still correspond to a coalescent event if that person transmits prior to recovering (*e.g*., the transmission from 6 to 7 in Figure 1).

#### Coalescent model for SIR epidemics:

In an SIR epidemic, a branch in the tree corresponds to a transmission event, and as a lineage is traced backward in time, it traverses multiple infected hosts. A recovery event before the sample time *T* does not alter the number of lineages with progeny because no progeny of this individual can be sampled at a later time. In a standard coalescent model, *n* lineages merge in reverse time at a rate proportional to . Given that a coalescent event occurs among the individuals in , the probability of observing it among the *n* observed lineages is

We introduce the dimensionless variable *A*(*t*; *T*), which is the fraction of the population at *t* with sampled progeny extant at *T. A*(*t*; *T*) is proportional to the number of ancestors of a sample of sequences and is analogous to the integer-valued ancestor function used in standard coalescent theory (Griffiths and Tavare 1994). We consider how *A* evolves as *t* moves into the past, with *T* fixed.

If a fraction ϕ of the infected population is sampled at time *T*, then we observe a number lineages. Initially, *t* = *T*, and *A*(*T*; *T*) = ϕ*I* (the ancestor of each sequence is itself). The sample fraction ϕ is not always known, but if ϕ = 1, our solution will describe the evolution of the fraction of extant lineages for the entire population.

Using the definition of *A* and assuming , the probability of a transmission event causing a coalescent event to be observed in our sample is

The rate of coalescence for a sample of sequences is analogous to the rate of change of the ancestor function, *A*. We can write the coalescence rate for the sample of sequences as the product of the number of transmissions per unit time, *f _{SI}*(

*t*) and the probability

*p*

_{c}that a transmission results in a coalescence being observed in our sample. The ancestor function

*A*(

*t*;

*T*) can be found by integrating the following backward ordinary differential equation from time

*T*:(2)This equation works even when ϕ = 1, in which case

*A*represents the number of ancestors of the entire population of infecteds observed at time

*T*.

#### Cluster size distribution:

Let **X**_{1}(*t*; *T*) denote the number of progeny at *T* of a random infected host from time *t* ≤ *T*, given that such progeny exist. We denote the expected value of **X**_{1} by *x*_{1}(*t*; *T*) and interpret it as the *mean cluster size* from time *t*. **X**_{2}(*t*; *T*) [and *x*_{2} = *E*(**X**_{2})] is a random variable that describes the size of the cluster if it is selected with probability proportional to the cluster's size. This is the same distribution of cluster sizes as if we select an infected at time *T* and determine the size of the cluster to which that infected belongs.

Below, we show that *x*_{1} and *x*_{2} can be found by integrating the ordinary differential equations(3)(4)*backward* in time from *T* with initial prevalence *I*(*T*) taken from the epidemic model. Also, initially (at *t* = *T*), all cluster sizes are unity, and *x*_{1}(*T*; *T*) = *x*_{2}(*T*; *T*) = 1.

The set of infecteds is distributed across a number *A*(*t*; *T*)*N* clusters, and for any 0 ≤ *t* ≤ *T*, the average number of infecteds per time-*t* cluster is *I*(*T*)/*A*(*t*; *T*). This implies(5)

Evaluating the backward derivative at *t* yields(6)Using Equation 6 in conjunction with Equations 2 and 5 yields Equation 3.

Dynamics of *x*_{2} can be found by directly quantifying the mean field behavior of **X**_{2}. Consider the size of a cluster to which a focal individual, a sampled infected at time *T*, belongs. As before, *p*_{c} × *f _{SI}* gives the rate of coalescence. Two clusters merge at each coalescent event, so there is a probability proportional to 2/

*A*that a focal individual belongs to a cluster that takes part in the event. And given that the individual's cluster coalesces, the average amount by which the cluster increases is

*x*

_{1}. Multiplying these factors and probabilities together yields(7)As with

*x*

_{1}, this can be solved by integrating in reverse time with initial conditions

*x*

_{2}(

*T*;

*T*) = 1.

The variance of **X**_{1} can be found by noting that(8)Recall that **X**_{2} is the size of a cluster selected with probability proportional to size, soCombining the last two expressions with the definition of givesThen, the variance in cluster size is(9)

Higher moments can also be derived recursively from earlier moments. We now show that the *n*th moment of the CSD, *M _{n}*, can be found by solving the following differential equation with initial conditions

*M*(

_{n}*T*) = 1,(10)where we define

*M*

_{0}:= 1 for convenience. Equations 3 and 4 could be derived using Equation 10 as a starting point.

Equation 10 is obtained by multiplying the rate at which a cluster merges with other clusters (*f _{SI}A*/

*I*

^{2}) and the expected change in the

*n*th moment when two clusters merge. When a cluster of size

*i*merges with a cluster of size

*j*, the

*n*th moment to be considered will change from that for a cluster of size

*i*to that for a cluster of size (

*i*+

*j*). To find the expected change in the

*n*th moment when two clusters merge, we sum over all possible combinations of clusters of sizes

*i*and

*j*:

The product of the coalescent rate *f _{SI}A*

^{2}/

*I*

^{2}and the factor 1/

*A*that accounts for the probability that a focal lineage takes part in a coalescent event, along with the expected size function, yields Equation 10. In supporting information, Figure S1, we compare solutions of this equation to the second through fifth moments from simulations.

#### Fitting epidemic models to sequence data:

If we know the branching times *t*_{1}, *t*_{2}, · · · , *t*_{n−1} for a phylogeny constructed from *n* sequences, we can use Equation 2 to fit an SIR model. In practice, there is considerable uncertainty about the exact genealogy and branching times given a sample of sequences. The theory developed here is based on a fixed genealogy with no uncertainty about branch lengths, but it should be straightforward to generalize these results to cope with error in the *t _{i}* (Drummond

*et al*. 2005).

The total number of coalescent events observed between times *t* and *T* is proportional to *A*(*T*; *T*) – *A*(*t*; *T*), and at some time *t* < τ < *T*, the fraction of the coalescent events that have occurred is(11)This provides a cumulative distribution function for the distribution of coalescent times. Differentiating with respect to τ yields the density

We make the approximation that when two lineages coalesce, the rates at which other lineages coalesce remain unchanged. Then each coalescent time will be an i.i.d. random variable with the distribution (11). The probability of observing a particular sequence of branching times will be proportional to the product of the density evaluated at each branching time. Consequently, we can construct the log-likelihood function out of an *A*-trajectory(12)where θ denotes the parameters of the SIR model, such as transmission and recovery rates. In File S1 we also present a fitting criterion based on the Kolmogorov–Smirnov statistic for comparing distributions.

## RESULTS

Equation 3 indicates some simple relationships that govern coalescent rates in epidemics. Coalescent rates are proportional to epidemic incidence (*f _{SI}*) and inversely proportional to square prevalence (

*I*

^{−2}). Rates will be highest when prevalence is low and incidence is high, such as at the beginning of an epidemic, during the expansion phase, or following a trough in prevalence.

Equation 9 implies that variance of the CSD asymptotically approaches the mean squared (Figure S4). This is similar to what is seen in the offspring distribution of forward time branching processes, such as the Galton–Watson process (Athreya and Ney 2004).

The point in time where the ancestor function (5) crosses the value 1/*N* is the point at which the phylogeny of the virus has collapsed to a single lineage—the MRCA of the sequences. Therefore, if we collect a sample of size *n* at time *T*, and solve Equation 2 to time zero, with *A*(*T*) = *n*/*N*, the time τ that satisfies *A*(τ) = 1/*N* corresponds to the time to the most recent common ancestor of the sample. Although our differential equations should not serve as an adequate description of the discrete valued processes for values close to 1/*N*, we find that this approximation works quite well. A demonstration with comparison to simulations is provided in Figure S11.

#### Simulations:

To assess the performance of our model, we carried out stochastic simulations of SIR epidemics. Simulations were individual based and in continuous time. Transmission events and recovery events were queued using exponentially distributed lag times, similar to the Gillespie algorithm. Each transmission event was recorded, which allowed us to simulate viral phylogenies under controlled conditions and to test the accuracy of Equations 3 and 9. The transmission data were then converted into phylogenetic trees with known branching times.

Simulation code was independently written by S. D. Frost and E. M. Volz in Python and C. Results from both models were compared to ensure accuracy.

To assess the accuracy of the equations we have derived, we developed a simulation experiment with 10^{3} (1%) initially infected agents out of a population of total size *N* = 10^{5} otherwise identical agents. Transmission and recovery rates were such that *R*_{0} = 10/3. Figure 2 shows Equations 3 and 9 (lines) and the 90% confidence intervals from simulations at 10 thresholds (*t* values). The exact values of *t* and *T* are reported in File S1. Each trajectory corresponds to a cross-sectional census of the infected population at four time points (*T* values) corresponding to maximum prevalence, as well as 86, 68, and 22% of maximum prevalence after the peak. As we go backward in time, all moments of the CSD increase, until the entire census of infecteds falls into a single cluster. Many of the trajectories intersect, which demonstrates that the CSD is a complex function of both *t* and *T* and could therefore not be reduced to a simple forward-looking model.

#### Comparison with the generalized skyline:

Further simulations were developed to test the suitability of the model for estimating epidemiological parameters. When the number of infecteds is small, epidemic dynamics will be subject to large stochastic fluctuations. To determine if Equation 12 can be used to fit SIR models when the population size is small, we conducted a set of simulations with only a single initial infected in a population of 10,000 agents (Figure S5).

The simulations were also designed to determine if SIR models that are fit via likelihood Equation 12 can provide advantages beyond methods that are commonly used to estimate effective population size (*N*_{e}). For purposes of comparison, we used the generalized skyline model (Opgen-Rhein *et al*. 2005) (ape library in R) and compared the estimated effective population size to the best-fit SIR models and the known epidemic prevalence from simulations. Details of the simulations are provided in File S1.

We found that the accuracy of the best-fit SIR models exceeded that of the generalized skyline by 8–30% as measured by the root mean square error (RMSE) of estimated prevalence. It may seem surprising that the SIR model based on ODEs outperforms the generalized skyline even in the presence of stochasticity at small population sizes. This is due to the fact that population dynamics converge to the deterministic SIR model as the infected population increases in size. Fluctuating incidence due to stochastic effects when the number of infecteds is small has the effect of shifting the distribution of coalescence times to the left or the right, but does not fundamentally change the shape of the distribution. This is easily accounted for by including a parameter that varies the fraction initially infected in the deterministic SIR model.

Figure 3 shows the distribution of RMSE over 300 simulations. The mode of RMSE for the SIR model is zero for all experiments, whereas the skyline is slightly biased. Increasing sample size decreases RMSE for both SIR and skyline. Taking the sample at a later time (corresponding to 20% of peak prevalence) decreases the accuracy of both SIR and skyline, although in general the SIR models cope better with late sample times than does the skyline. In Figure S10, we show several representative SIR and skyline fits. It is usually the case that the generalized skyline fails to detect a decrease in prevalence and overestimates in the latter stages of the epidemic.

The SIR models also provide a quite accurate estimate of *R*_{0} [*R*_{0} = 2, (95%: 1.71–2.17)].

#### The effect of a sample fraction:

In the Kingman coalescent, the fraction of the population that is sampled is assumed to be small, such that the probability that more than two individuals have the same parent in the preceding generation is negligible. For example, Kingman showed that the probability that *n* sampled sequences will not have a common ancestor in the preceding generation isKingman then made the approximation that the *O*(*N*^{−2}) terms are zero, which yields a minimum requirement that .

Analytical work has been carried out to investigate the effect on coalescent processes of violating the assumption of a small sample fraction (see, for example, Fu 2006), using discrete mathematics similar to the original Kingman model. Such work has indicated that the Kingman coalescent can be a surprisingly good approximation even when the sample fraction is large.

Nevertheless, our model is not an approximation and takes the sample fraction into account. This gives some insight into how the fraction of the infected population sampled affects the distribution of coalescent times and thus the shape of the reconstructed phylogeny of viral sequences.

Figure 4 shows the empirical distribution of coalescence times for 150 simulations (*R*_{0} = 2) with samples taken at peak prevalence. The sample fraction was varied from 5 to 40%. When the sample fraction is small (5%), the distribution is skewed left, meaning the phylogeny is starlike, which is in agreement with conventional notions for an exponentially growing population. However, as the sample fraction is increased to 10, 20, and 40%, the shape of the distribution changes until it is skewed right, which means that most of the branches occur close to the tips. These qualitatively antipodal distributions are generated by the same underlying population dynamics, with only the sample fraction being varied. This observation is of practical as well as theoretical interest, since many serological surveys for HIV may reach >20% of infected individuals within a given locality (Lewis *et al*. 2008).

Equation 11 gives the analytical distribution of coalescence times and is shown in red in Figure 4. It also provides some simple intuition for why most coalescence events will happen close to the sample time (*T*) when the sample fraction is large. We use the initial conditions *A*(*T*) = *n*/*N*, so that when *n* is large, the term (*A*(*T*)/*I*(*T*))^{2} is also large, which is the probability that two individuals in a transmission event represent sample lineages. Conversely, if *n* and (*A*(*T*)/*I*(*T*))^{2} are small, fewer coalescent events will occur until *I* converges to *A*, which will occur early in the epidemic.

#### Estimating HIV prevalence:

Equation 2 gives the rate of coalescence at any time prior to the sample time (*T*) and, by extension, the distribution of coalescence times. This allowed us to derive the likelihood function (12), which we used to fit a simple mass-action SIR model to 55 HIV-1 sequences of the *pol* gene collected as part of the ACTG241 clinical trial (D'Aquila *et al*. 1996; Leigh Brown *et al*. 1999). All sequences were collected from men who have sex with men (MSM) over a short period of time (May to July, 1993) within the United States. Because the sequences were collected within a short window of time, it is valid to make the approximation that all sequences were sampled simultaneously. To estimate a phylogeny, we used a general-time-reversible model of nucleotide substitution (Tavare 1986) with gamma-distributed variation in site-to-site substitution rates. The root giving the most clocklike rates was determined by maximum likelihood and the null hypothesis of a molecular clock could not be rejected at the 5% significance level.

The epidemiology of HIV has several factors that are important to include in a model. Upon infection, individuals progress through an acute phase lasting 1–3 months and then progress to a chronic phase lasting many years. The transmission probability per act is much greater during the acute phase. Furthermore, since we wish to model the epidemic over a period of 25 years, we must consider natural mortality and immigration into the susceptible pool. All of these factors are considered in the following model:(13)(14)(15)*I*_{1} and *I*_{2} respectively represent the fractions of the population that are at the acute and the chronic stages of infection. Parameters we wish to estimate include the following:

β

_{1}: The transmission rate of acute infecteds.β

_{2}: The transmission rate of chronic infecteds.μ: The immigration rate into the susceptible population and the natural mortality rate. We consider immigration to balance natural mortality.

α: A parameter that controls how incidence scales with cumulative incidence.

ε: The fraction of the population infected at the TMRCA of the sample.

Many more parameters could be included in a model for HIV among MSM, but since our purpose is to fit a model to only 55 sequences, we choose to keep the number of free parameters to a minimum. In addition, we assumed an acute phase that lasts 2 months on average (γ_{1} = 1/60) and a chronic phase that lasts 10 years on average [γ_{2} = 1/(10 × 365)].

Prior distributions are given in File S1.

Given *n* = 55 sequences, we use the initial conditions *A*(*T*) = 55/*N*, *I*_{1}(0) = ε, and *S*(0) = 1 – ε. Since we are including equations for two types of infecteds, we must keep track of ancestor functions for both types. *A*_{1} and *A*_{2} are the fractions of the population that are respectively acute and chronic infected and that have sampled progeny at time *T*. We have(16)(17)For purposes of fitting the SIR model, we use *A* = *A*_{1} + *A*_{2} and = + . A derivation is provided in File S1.

Fitting the model proceeded in two steps. First, we fit a model using Equation 12 as described above. The second step made use of sero-surveillance data of MSM in the United States (Hall *et al*. 2008). These data provided estimates of HIV incidence based on back calculation for the period 1977–2006. To ameliorate error from uncertainty in the chronological values of phylogenetic branch lengths, we adjusted the timescale of the epidemic and rescaled estimated rates to gain the greatest fit with incidence data by a least-squares criterion.

Figure 5 shows the best-fit SIR model. The median posterior estimates were as follows: acute transmission rate, transmission per 47 days; chronic transmission rate, transmission per 1207 days; immigration rate to susceptible state, 1 per 19.5 years; and incidence scaling parameter, . Together, these parameters imply an *R*_{0} value of 2.24 (see File S1). They also imply that 41% of transmissions occur during the acute stage.

For comparison with our SIR model, effective population size (*N*_{e}) was calculated using the skyline plot (Pybus *et al*. 2000). *N*_{e} was rescaled so that min(*N*_{e}) = min(*I*). Figure 5 shows the rescaled skyline and an SIR trajectory that was integrated with parameters from the median of the posterior distribution. Confidence intervals are also given, which show the upper and lower bounds within which 95% of posterior epidemic prevalence falls. Figure 5 also compares the best-fit SIR model with the estimated cumulative incidence among MSM in the United States based on sero-surveillance data. The SIR model is in broad agreement with the data from public health sources regarding the early rate of growth and saturation in the early 1990s. The skyline also reproduces the growth rate during the expansion phase and the tapering of epidemic growth in the early 1990s. However, the skyline predicts a rise in *N*_{e} between 1980 and 1993, which probably overestimates the true prevalence.

We have also compared the CSD mean and variance from our best-fit SIR model to the empirical values from the ACTG241 data (Figure 6). The SIR model successfully reproduces the mean cluster size throughout the course of the epidemic. However, there is substantial deviation between the actual and the predicted variance of cluster sizes. As the clustering threshold is increased, all sampled infecteds eventually fall within a single cluster, and in a finite population, variance converges to zero (not shown).

## DISCUSSION

The distribution of cluster sizes is a function of the time *T* at which we observe a population, such as by taking a sample of sequences, and *t* < *T*, which is a clustering threshold (if the MRCA of two sequences occurs after *t*, then those sequences are clustered). We have derived differential equations that describe how the moments of the CSD change as the threshold *t* moves into the past. This could be used to calculate the distribution of cluster sizes to arbitrary precision at any time. It is straightforward to use the model to calculate the probability that an infected host will have viral progeny at a later time point and, conversely, the expected number of ancestor lineages of a sample taken at *T*. The model promises to serve as a null hypothesis for clustering of infecteds under various epidemiological scenarios and could possibly be used to detect effects that may distort the CSD such as selection and population structure.

The CSD is sensitive to details of the underlying population dynamics. Most coalescent approaches take into account only variable population size, such as epidemic prevalence, but not variable birth rates, analogous to epidemic incidence. Such approaches can give misleading results for epidemics. For example, in both susceptible–infected (SI) models (no recovery) and susceptible–infected–susceptible (SIS) models (recovery into the susceptible state), prevalence rapidly approaches an equilibrium. However, a naive coalescent model based on constant population size would erroneously predict identical coalescent patterns in these two cases. In fact, the SIS case is very similar to a standard constant-population size coalescent, but the lineages in an SI epidemic coalesce only during exponential growth, not at equilibrium (Figure S2 and Figure S3).

We observed drastically less precision when estimating recovery rates than when estimating transmission rates. Consequently, decline in prevalence is much harder to detect than growth. This has been observed previously (Lavery *et al*. 1996) in other biological systems due to differences in the timescale of population change and genetic variation. We nevertheless found that our estimation procedure is robust to misspecification of priors that include zero recovery, and it is feasible to distinguish SI from SIR dynamics (Figure S6, Figure S7, Figure S8, and Figure S9).

In conclusion, coalescent-based estimates of effective population size, such as the generalized skyline, have wide applicability and require minimal consideration of underlying population dynamics. However, in the case that the epidemic dynamics are well understood, the potential is raised for a population genetic model that takes into account the precise effects of transmission and recovery, thereby predicting population dynamics with greater accuracy. We have developed a model that provides a step toward the formal integration of phylodynamics and epidemiology and that can be used to estimate epidemiological and demographic parameters directly from viral sequence data.

Fitting population models to data requires biological simplifications to make the model tractable, which presents the danger of making the model useless for real systems (Wilson *et al*. 2005). Pathogens require successful reproduction both within and between hosts, whereas we have focused entirely on transmission of lineages to uninfected and immunologically naive hosts. We have not considered biological nuances such as superinfection and recombination or the possibility that different strains will have different epidemiological characteristics. Consequently, there are many ways that our model could be extended and improved.

We have calculated coalescent rates and CSD moments only for the most simple mass-action SIR models. But modern mathematical epidemiology has progressed in the direction of incorporating variable host susceptibility, pathogen virulence, geographical heterogeneity, and host contact network structure. Reproducing our derivations for such models would be a difficult but worthy enterprise.

While we have focused on variable population size in epidemics, a second pillar of phylodynamics concerns the effects of immune selection on viral phylogenies (Grenfell *et al*. 2004). A major limitation of our approach is that we adopt the standard assumption of selective neutrality. It is unknown how our method would perform for genes under strong immune selection, such as influenza virus hemagglutinin.

We have made a first attempt at a method for fitting arbitrary SIR models to cross-sectional samples of viral sequences. Many challenges remain for increasing the utility of the method. It may be possible to improve estimation of model parameters when historical prevalence data are available. However, it is not known how to discriminate between competing models when only sequence data are available. The estimation theory developed here is based on a fixed genealogy of virus with no uncertainty about branch lengths; in reality there can be a great deal of uncertainty about the structure of the genealogy, and it should be straightforward to generalize the method to account for this (Drummond *et al*. 2005). Finally, it should also be possible to extend our solutions to heterochronous samples—sequence data collected at multiple time points over the course of an epidemic.

## Acknowledgments

Irene Hall provided estimates of HIV incidence in MSM. The authors acknowledge support from the National Institutes of Health (T32 AI07384, R01 AI47745). S.D.W.F. is supported by a Royal Society Wolfson Research Merit Award. M.J.W. is supported by the Biotechnology and Biological Sciences Research Council.

## Footnotes

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.109.106021/DC1.

Communicating editor: M. W. Feldman

- Received June 8, 2009.
- Accepted September 11, 2009.

- Copyright © 2009 by the Genetics Society of America