Abstract

Estimation of epidemiological and population parameters from molecular sequence data has become central to the understanding of infectious disease dynamics. Various models have been proposed to infer details of the dynamics that describe epidemic progression. These include inference approaches derived from Kingman’s coalescent theory. Here, we use recently described coalescent theory for epidemic dynamics to develop stochastic and deterministic coalescent susceptible–infected–removed (SIR) tree priors. We implement these in a Bayesian phylogenetic inference framework to permit joint estimation of SIR epidemic parameters and the sample genealogy. We assess the performance of the two coalescent models and also juxtapose results obtained with a recently published birth–death-sampling model for epidemic inference. Comparisons are made by analyzing sets of genealogies simulated under precisely known epidemiological parameters. Additionally, we analyze influenza A (H1N1) sequence data sampled in the Canterbury region of New Zealand and HIV-1 sequence data obtained from known United Kingdom infection clusters. We show that both coalescent SIR models are effective at estimating epidemiological parameters from data with large fundamental reproductive number R0 and large population size S0. Furthermore, we find that the stochastic variant generally outperforms its deterministic counterpart in terms of error, bias, and highest posterior density coverage, particularly for smaller R0 and S0. However, each of these inference models is shown to have undesirable properties in certain circumstances, especially for epidemic outbreaks with R0 close to one or with small effective susceptible populations.

Phylodynamics and the Coalescent

THE epidemiological and evolutionary processes that underpin rapidly evolving species occur on a shared spatiotemporal frame of reference. Unified analyses that include both the dynamics of an epidemic and the reconstruction of the pathogen phylogeny can therefore uncover otherwise inaccessible information to aid in outbreak prevention. Such information includes the rates of pathogen transmission and host recovery, effective population sizes, and the “time of origin” representing the introduction of the first infected individual into a population of susceptible hosts.

The term phylodynamics was popularized by Grenfell et al. (2004) to describe the interlaced study of immunodynamics, epidemiology, and evolutionary mechanisms. Several phylodynamic models, both stochastic and deterministic in nature, have since been developed to characterize the phylogenetic history of the pathogen species and compartmentalizations of the host population throughout the epidemic. Such models grant the ability to infer key epidemiological parameters from genetic sequence data and include birth–death branching processes (Stadler et al. 2012, 2013; Gavryushkina et al. 2014; Kühnert et al. 2014), as well as coalescent approaches (Griffiths and Tavaré 1994; Pybus et al. 2001; Rasmussen et al. 2011, 2014; Koelle and Rasmussen 2012; Dearlove and Wilson 2013) derived from Kingman’s coalescent theory (Kingman 1982).

Significant steps toward the unification of epidemiology and statistical phylogenetics were made by Pybus et al. (2001), Volz et al. (2009), and Dearlove and Wilson (2013), with the formalization and application of Kingman’s n-coalescent to pathogen population dynamics. These methods involved numerical integration of a set of ordinary differential equations (ODEs) to find deterministic approximations to the variation in the number of sampled lineages through time. Volz (2012) extended the tree density calculation from previous work (Volz et al. 2009) to allow for serially sampled and spatially structured genetic sequence data. In this coalescent model, the birth and death rates can vary in time and by the state of the host, so that “the birth rate of a single gene copy is both time- and state-dependent” (Volz 2012, p. 7).

In this article, we assess the ability of coalescent-based phylodynamic models to infer, in a Bayesian setting, a range of epidemiological parameters from simulated data. While Dearlove and Wilson (2013) paved the way by implementing a coalescent approach for deterministic susceptible–infected (SI), susceptible–infected–susceptible (SIS), and susceptible–infected–removed (SIR) models for Bayesian inference, we implement and rigorously test both deterministic and stochastic coalescent SIR models of epidemic dynamics extended for heterochronously sampled data.

Stochastic and Deterministic Models

Stochasticity and determinism in population sizes each maintain dominant roles in particular stages of an epidemic. Once the infected population has grown considerably large, on the order of 1000–10,000 lineages, the probability densities of stochastically expressed population size dynamics converge toward the deterministic interpretation (Rouzine et al. 2001). However, during the early stages the population size of infected individuals is small, and the dynamics of the epidemic are therefore governed by stochastic processes due to the relative significance of fluctuations in the demographic and rate parameters of the population model (Kühnert et al. 2014). Therefore, approximating the prevalence of infection by a deterministic function requires the number of infected hosts within the effective population to be assumed as very large throughout the duration of the described epidemic, i.e., once the exponential growth phase has been reached (Rouzine et al. 2001).

Population size is critical to the epidemiological system and, as with any parameter in a Bayesian setting, yields the most accurate estimations when detailed prior information is available and incorporated into the inference (Drummond et al. 2006). In our extension and implementation of the coalescent model for epidemics, both stochastic and deterministic population size processes are used for the simulation of trees and/or trajectories for subsequent inference.

Compartmental Population Models (SIR)

Host populations can be compartmentalized simply but effectively in mathematical models that describe epidemic progression. The specific division of the aggregate population depends on the contagion, spanning a range of scenarios where hosts may or may not recover from infection, may or may not be reinfected, etc. Such examples include the SI, SIS, and SIR models (Anderson and May 1991; Keeling and Rohani 2008). Each of these compartments can be expressed either (a) by a set of ODEs that describe the deterministic time development of real-valued compartment occupancies or (b) in terms of integer-valued occupancies governed by continuous-time Markov chains (CTMC) that allow for a degree of uncertainty in the timing and number of events that occur over the course of the epidemic.

In this article, we concentrate on the SIR model, which describes epidemics that include infected individuals who are at some point in time removed from the effective population by way of immunity, death, behavioral changes, or some other termination of infectiousness. The deterministic variant of this model was introduced by Kermack and Mckendrick (1932) and is given by the trio of coupled ODEs,
ddtS(t)=βI(t)S(t),
(1)
ddtI(t)=βI(t)S(t)γI(t),
(2)
ddtR(t)=γI(t),
(3)
where β and γ respectively represent the transition rates from susceptible S to infected I and infected I to removed R. The model fully defines the population dynamics with initial conditions S(z0), I(z0), and R(z0). It is worth recognizing that, in the closed SIR model used here, there is no demographic change in the host population. Therefore, (d/dt)S(t)+(d/dt)I(t)+(d/dt)R(t)=0 and S(t)+I(t)+R(t)=N, where N is the constant total population size. Throughout this article we refer to the solutions to Equations 1–3 as deterministic SIR trajectories.
The comparable stochastic description is given in terms of the probability of the epidemic state at time t given its initial state and the rate parameters
π(s,i,r;t)Pr(S(t)=s,I(t)=i,R(t)=r|S(0),I(0),R(0),β,γ),
(4)
which is governed by the following equation of motion:
ddtπ(s,i,r;t)=β[(s+1)(i1)π(s+1,i1,r;t)siπ(s,i,r;t)]+γ[(i+1)π(s,i+1,r1;t)iπ(s,i,r;t)].
(5)
An explicit sampling process is incorporated by allowing each removal event to coincide with a sampling event with a fixed probability ψ/(ψ+μ), where ψ and μ are the overall rates of sampled and unsampled removals, respectively, such that γ=ψ+μ. We refer to epidemic histories sampled from this model as stochastic SIR trajectories.

Both types of epidemic trajectories can be related to models of sampled transmission tree genealogies. In the deterministic case, this relationship is made via the coalescent distributions described in Volz (2012). We call this the deterministic coalescent SIR model. In the stochastic case, genealogies appear naturally from a branching process in which the branching events coincide with the transmission events in the CTMC and only those lineages ancestral to sampled removals are recorded. We call this the stochastic SIR model.

Another way of relating the stochastic SIR model to sampled transmission trees involves drawing a realization of a stochastic SIR epidemic and then using the coalescent distribution in Volz (2012) to produce a tree conditional on the particular piecewise constant infected compartment size corresponding to that realization. We call this approach the stochastic coalescent SIR model. Unlike BDSIR, the stochastic coalescent SIR model does not require the sampling process to be specified explicitly.

Both the transmission rate β and the removal rate γ can be estimated using each of the methods considered in this article from data ascribed to an SIR epidemic.

Methods

Inference framework

All phylodynamic inference discussed in this article is based on the joint posterior probability density
f(T,V,η,θ|D)=Pr(D|T,θ)f(T|V,η)f(V|η)f(η)f(θ)Pr(D),
(6)
where the sampled transmission tree T, the epidemic trajectory denoted V=(S,I,R), the substitution parameters θ, and the epidemiological parameters η={β,γ,S0,z0} are all estimated from the sequence data. The sampled transmission tree T is assumed to be identical to the pathogen genealogy.

Here, S, I, and R represent the host compartment sizes from the present time τ=0 back to the origin z0, such that S(τ)=S(z0τ), (τ)=I(z0τ), and (τ)=R(z0τ).

The various terms making up the right-hand side of Equation 6 are the tree likelihood Pr(D|T,θ), the tree prior f(T|V,η), the epidemic trajectory density f(V|η), and the substitution and epidemiological parameter priors f(η) and f(θ). The probability Pr(D) is merely a normalizing constant and can be ignored. It is the product of the tree prior and trajectory density f(T|V,η)f(V|η) that distinguishes each of the models considered in this article.

For both the deterministic and stochastic coalescent SIR models, the tree prior f(T|V,η) is calculated in the following way. First, consider the time span of a tree divided into segments bracketed by both sampling and coalescent events. By considering intervals ending in sampling events as well as coalescent-ending intervals, we follow previous work that extended coalescent approaches to time-stamped, serially sampled data (Rodrigo and Felsenstein 1999; Drummond et al. 2002). Interval i is spanned by ki lineages and is the ith interval when ordered from the most recent tip to the root. The set of intervals A ending in sample events and the set of intervals Y ending in coalescent events together encompass all intervals, V=AY. Let the end time of an interval be τi (going back in time), with τ0=0 as the time of the most recent tip and with time increasing into the past. Then the probability density of a genealogy given an epidemic trajectory is
f(T|V,η)=iYλki(τi)iVω(τi,ki),
(7)
where λki(τ) is the instantaneous coalescent rate at τ prescribed by Volz (2012),
λki(τ)=(ki2)2βS(τ)(τ),
(8)
and where ω(τi,ki) is the survival probability
ω(τi,ki)=exp(τi1τiλki(τ)dτ).
(9)
The deterministic coalescent SIR model assumes that the SIR epidemic trajectories are found by integrating the ODEs in Equations 1–3. Therefore, under this model each epidemic trajectory is a deterministic function of its parameters V(η). This means that the trajectory density can be written as
f(V|η)=δ(VV(η)),
(10)
where δ(x) is the Dirac δ-function and represents a point mass concentrated at x=0.

In contrast, the stochastic coalescent SIR model assumes that the epidemic is generated by a jump process corresponding to the master equation given in Equation 5. In this case, the probability f(V|η) is nonsingular and thus contributes to the uncertainty in the final inference result.

In the BDSIR model introduced by Kuhnert et al. (2014), f(V|η) is the same as for the stochastic coalescent SIR model, but f(T|V,η) is defined differently. See Kühnert et al. (2014) for details.

Markov chain Monte Carlo algorithm

We use Markov chain Monte Carlo (MCMC) to sample from the joint posterior density given in Equation 6. Many of the specifics of the algorithm used have been discussed previously, in particular the method for calculating the tree likelihood (Felsenstein 1981, 2004) and the mechanism for exploring tree space (Drummond et al. 2002). However, the model-specific product f(T|V,η)f(V|η) requires special attention.

As we are primarily interested in parametric inference rather than the epidemic trajectory itself, we can regard V as a nuisance parameter to be marginalized over. This marginalization can be achieved implicitly by sampling it using MCMC and then ignoring this component of the sampled state, which is the strategy we use when reporting the BDSIR results. It can also be made an explicit part of the likelihood calculation, which is the approach we take with the deterministic and stochastic coalescent SIR models. This marginalization means that the product f(T|V,η)f(V|η) becomes
f(T|η)=f(T|V,η)f(V|η)dV,
(11)
the probability density of the tree given the epidemiological parameters.

In the case of the deterministic coalescent SIR model, this density reduces to f(T|V(η),η), meaning that the density of the tree given epidemiological parameters η is obtained simply by substituting the numerical solution to Equations 1–3 for those parameters into Equation 7.

The stochastic coalescent SIR model is more complex, as in this case the trajectory density f(V|η) is nonsingular, meaning that computing the integral in Equation 11 is nontrivial. We treat this here using the “pseudomarginal” approach (Beaumont 2003; Andrieu and Roberts 2009) in which, at each step in the MCMC chain, the marginalized tree density f(T|η) is replaced by the Monte Carlo estimate
f^(T|η)=1Mr=1Mf(T|Vr,η),
(12)
where each Vr is a trajectory sampled independently from f(V|η), using a stochastic simulation algorithm (Sehl et al. 2009). Perhaps counterintuitively within an MCMC framework, this stochastic likelihood converges to the true marginal posterior distribution regardless of the number M of realizations used in the estimate. However, the magnitude of M can significantly affect the rate at which the chain produces effectively independent samples from the posterior and must be tuned carefully.

Implementation and validation

We have implemented the schemes described above for performing inference under the deterministic and stochastic coalescent SIR models within the BEAST 2 phylodynamics package found at http://github.com/CompEvol/phylodynamics. This has a number of advantages over a stand-alone implementation. Foremost, we were able to avoid reimplementing components of the algorithm that are in common with other already-implemented phylogenetic and phylodynamic analyses, such as the MCMC proposal operators used to traverse the parameter space. Furthermore, this greatly increases the usefulness of the implementation, as it can be immediately used in conjunction with a wide variety of nucleotide and amino acid substitution models and parameter priors.

We have taken two steps to ensure our implementation is correct. First, we compared tree probability density f(T|V,η) values calculated using the main implementation of each of the two models with those calculated using completely independent implementations in R (R Core Team 2014).

Second, we used the implemented MCMC algorithms to sample transmission trees from the tree density given in Equation 11 for each model. We then compared the distributions of tree height, total edge length, and binary clade count summary statistics from these sampled ensembles with sample distributions obtained directly via stochastic simulation. As shown in Supporting Information, File S1, Figure S1, Figure S2, and Figure S3 (Sampling from the prior) and in the associated figures, the resulting pairs of distributions agree, providing strong support for our claim that the implementations of the methods described above are correct.

Instructions for downloading and using this package are also available on the project website located at http://github.com/CompEvol/phylodynamics.

Simulation study

To evaluate the implementation and extension of the coalescent models, we performed analyses on both sequence data and fixed trees simulated with known parameter values. The median estimated values produced by each model were then used to measure relative error and bias, along with the widths and coverage of 95% highest posterior density (HPD) intervals.

We used three methods for simulating the trees and trajectories, as shown below:
Inferencemodel:{Stoch.Coal.SIRDeter.Coal.SIRBDSIR______________________________________________________________________________________Simulationscheme:{Stoch.Coal.SIRStoch.Coal.SIRStoch.Coal.SIRDeter.Coal.SIRDeter.Coal.SIRDeter.Coal.SIRStochasticSIRStochasticSIRStochasticSIR.
The stochastic coalescent and deterministic coalescent simulation schemes were used to validate the coalescent SIR inference models. The stochastic SIR scheme, contrarily, is emphasized for its realistic properties.

Stochastic SIR trees and trajectories were generated using master equations in the simulation package MASTER (Vaughan and Drummond 2013). Deterministic coalescent trajectories were generated using a Runge–Kutta integrator (Runge 1895; Kutta 1901) with adaptive step sizes to solve a system of first order ODEs. Stochastic coalescent trajectories were generated using Sehl et al.’s (2009) SAL τ-leaping algorithm (Sehl et al. 2009).

To simulate the stochastic coalescent SIR trees, we used the stochastic SIR trajectories, which could be converted to effective population size with the mathematical expression used to obtain Volz’s (2012) coalescent rate for the SIR model: Ne(τ)=1/λ2(τ)=(τ)/(2βS(τ)). The sampling times, generated by a sampling rate ψ, for the stochastic coalescent SIR trees were also taken from the MASTER output to allow for direct comparison between the sets of trees. In other words, the underlying epidemic function was the same for both stochastic SIR and stochastic coalescent SIR trees, the latter of which were then simulated under a piecewise constant population function.

Likewise, for the simulation of deterministic coalescent trees we used deterministic SIR trajectories to construct a population function and the relation Ne=/(2βS) to convert infected and susceptible host population sizes to effective population size. The sampling times were randomly generated from a probability distribution so that the density of samples taken through time was proportional to the number of infected individuals through time, as with the stochastic SIR trees.

We simulated stochastic SIR trees, using multiple combinations of parameter values. We were particularly interested in varying the basic reproductive ratio R0 and the initial susceptible population size S0, to observe the changes in relative error, bias, and uncertainty in stochastic and deterministic models. To alter the ratio R0=βS0/γ and still generate sensible trees with a consistent number of tips, one or more of the other parameters (birth rate β, removal rate γ, or S0) must also change. Table 2, Table S6, Table S7, and Table S9 show the true values of the parameters for each set of simulations. (The birth rate β is not shown, as our implementation allows either β or R0 to serve as a parameter in the inference, and R0 is the parameter of interest. However, β can be calculated via the other three, using β=R0γ/S0. For example, when R0=1.0978, S0=499, and γ=0.25, then β=5.50E-4.)

Heterochronous trees:

We generated 100 trees under each of the three (stochastic SIR, stochastic coalescent SIR, and deterministic coalescent SIR) models with parameters S0, β, and γ. For heterochronously sampled trees, each removal generates a sample with probability ψ/(ψ+μ), where ψ is the overall rate of sampled removals and μ is the rate of unsampled removals such that γ=ψ+μ.

The simulations ended once the number of infected individuals reached zero, i.e., when the last infected individual was removed. This ensured that the simulated trajectories spanned past the exponential growth phase of the epidemic and therefore included samples past the peak of infected individuals. This choice of procedure was motivated by (a) the suggestion of Stadler et al. (2014) that the behavior of the coalescent beyond the exponential phase could either inflate or reduce bias and (b) the observations of Dearlove and Wilson (2013) and Bošková et al. (2014) that deterministic coalescent SIR models might be properly fitted only once the epidemic has peaked. Figure 1 shows trajectories of susceptible, infected, and removed individuals underlying the simulation of stochastic SIR trees (Figure 2) generated in MASTER. An example XML for simulating these MASTER trees is provided in File S1.

Figure 1

Stochastic SIR trajectories for susceptible S, infected I, and recovered R populations, with (top row) S0=999 and R0=2.4975, (middle row) S0=499 and R0=1.497, and (bottom row) S0=499 and R0=1.0978. (The right column shows infected I only.)

Figure 2

(A) Full stochastic SIR transmission tree with both sampled ψ-tips, shown in red, and otherwise removed μ-tips, shown in yellow. (B) The corresponding 140-tip sampled stochastic SIR tree. A and B were generated in FigTree (Rambaut 2007).

We required that the trees had n100 leaves, filtering out those in which the epidemic died out in the early stages, i.e., when the initial infected individual was removed from the effective population too quickly to infect others. (Note that the inference procedures discussed in this article all implicitly condition on the number of leaves.) The probability that the first event in a given trajectory is the removal (by recovery, death, etc.) of patient zero is given by δ/(βS0+δ)=1/(1+R0). When R02.50, this probability is 30%. In our case, 52/152 (34%) trees were “empty” or containing only one node. The filtering process left us with a mean of 160 leaves for the simulated trees.

Homochronous trees:

A major concern in the comparison between Kühnert et al. (2014)’s birth–death-sampling SIR inference model, which includes explicit sampling, and our implementations of Volz (2012)’s coalescent SIR models, which do not include explicit sampling, is that the former is given extra information via the sampling process. Volz and Frost (2014) addressed this issue by providing a coalescent SIR model that does incorporate sampling explicitly.

That being said, results from Bošková et al. (2014) indicate that the poor performance of the deterministic coalescent SIR model in comparison with birth–death models was due to the lack of handling stochastic population size changes through time rather than the lack of information about the sampling proportion. Their results showed that the coalescent is “very robust to changes in sampling schemes” (Boskova et al. 2014, p. 8).

Regardless, to ensure a fair comparison of BDSIR and the coalescent SIR models, we simulated an SIR epidemic with homochronous, or contemporaneous, sampling. This type of simulation affords no additional information about the population size for explicit-sampling models, as there is only a single time of sampling.

We selected a simulation time of t=20 for the homochronously sampled trees, with the trajectories being sampled at high prevalence but also past the time of peak prevalence. This is important for distinguishing SIR from SI/SIS outbreaks, as it provides information about the removal parameter γ. In this set of simulations, each lineage was sampled at t=20 with probability 0.7 (the leaf count distribution for varied sampling probabilities is in File S1).

Simulated sequences:

To assess the ability of each SIR model to infer epidemic parameters with the inclusion of phylogenetic uncertainty, we also simulated the evolution of 2000-bp sequences down each simulated tree. We time stamped the sequences with the tip dates of each corresponding tree and informed the inference with the true Hasegawa–Kishino–Yano (HKY) substitution model (Hasegawa et al. 1985), clock rate =5E-3, and κ=5. These choices were made to reflect real data, specifically those of influenza (Vaughan et al. 2014).

Along with simulated sequence data, analyses were performed with the simulated trees fixed (results are in File S1), and the parameters R0, γ, S0, and the origin of the tree z0 were estimated with Bayesian prior distributions as listed in Table 4.

Deterministic coalescent SIR on higher R0 and S0:

Finally, we had particular interest in the effects of varying the population size parameter S0 on the deterministic coalescent SIR model, as comparisons from initial analyses with lower true R0 (≈1.5 and ≈1.1) and S0 (= 499) showed higher error and bias and lower 95% HPD coverage. Also, it is often assumed that deterministic descriptions will perform well for higher R0 and larger population sizes. Table S7 and Table S9 detail the parameter values we used to explore the behavior of the deterministic coalescent on varied R0 and S0 combinations.

Interpretation of results

We compared the coalescent SIR, as well as BDSIR, parameter estimations from the simulated data to the true values used to generate the SIR trajectories. Following Kühnert et al. (2014), the precision and accuracy of these methods were measured by relative error, bias, and HPD intervals. We used the posterior median value of the parameter value η^ compared with the true parameter η¯{R0,γ,S0,z0}. Relative error and bias are then gauged by calculating the median value over medians from all 100 trees, such that
REη^=τ=1100|η^η¯|/η¯100
and
RBη^=τ=1100|η^η¯|/η¯100.
Measures of HPD interval widths are given by
95%HPDupperbound95%HPDlowerboundη¯.
Table 1, Table 2, and Table 3 show these results, along with the percentages of posterior estimates that produced 95% HPD intervals containing the true values (i.e., 95% HPD coverage).

Results for simulated sequences: R02.50, S0=999

Table 1
Results for simulated sequences: R02.50, S0=999
ηInferenceTruthMeanMedianErrorBiasRelative HPD width95% HPD accuracy (%)
0Stoch.Coal.SIR2.502.412.160.13−0.110.9797.00
Deter.Coal.SIR2.502.782.030.380.050.7987.00
BDSIR2.503.212.840.150.141.86100.00
γStoch.Coal.SIR0.300.160.130.52−0.520.8247.00
Deter.Coal.SIR0.300.250.160.56−0.280.9756.00
BDSIR0.300.170.140.52−0.521.1384.00
S(0)Stoch.Coal.SIR999180511480.320.215.1299.00
Deter.Coal.SIR999238415650.660.606.54100.00
BDSIR999400226111.701.7010.3899.00
z(0)Stoch.Coal.SIRVaries51.6748.890.260.230.6137.00
Deter.Coal.SIRVaries49.1346.460.220.200.2629.00
BDSIRVaries31.1629.520.510.510.7918.00
ηInferenceTruthMeanMedianErrorBiasRelative HPD width95% HPD accuracy (%)
0Stoch.Coal.SIR2.502.412.160.13−0.110.9797.00
Deter.Coal.SIR2.502.782.030.380.050.7987.00
BDSIR2.503.212.840.150.141.86100.00
γStoch.Coal.SIR0.300.160.130.52−0.520.8247.00
Deter.Coal.SIR0.300.250.160.56−0.280.9756.00
BDSIR0.300.170.140.52−0.521.1384.00
S(0)Stoch.Coal.SIR999180511480.320.215.1299.00
Deter.Coal.SIR999238415650.660.606.54100.00
BDSIR999400226111.701.7010.3899.00
z(0)Stoch.Coal.SIRVaries51.6748.890.260.230.6137.00
Deter.Coal.SIRVaries49.1346.460.220.200.2629.00
BDSIRVaries31.1629.520.510.510.7918.00
Table 1
Results for simulated sequences: R02.50, S0=999
ηInferenceTruthMeanMedianErrorBiasRelative HPD width95% HPD accuracy (%)
0Stoch.Coal.SIR2.502.412.160.13−0.110.9797.00
Deter.Coal.SIR2.502.782.030.380.050.7987.00
BDSIR2.503.212.840.150.141.86100.00
γStoch.Coal.SIR0.300.160.130.52−0.520.8247.00
Deter.Coal.SIR0.300.250.160.56−0.280.9756.00
BDSIR0.300.170.140.52−0.521.1384.00
S(0)Stoch.Coal.SIR999180511480.320.215.1299.00
Deter.Coal.SIR999238415650.660.606.54100.00
BDSIR999400226111.701.7010.3899.00
z(0)Stoch.Coal.SIRVaries51.6748.890.260.230.6137.00
Deter.Coal.SIRVaries49.1346.460.220.200.2629.00
BDSIRVaries31.1629.520.510.510.7918.00
ηInferenceTruthMeanMedianErrorBiasRelative HPD width95% HPD accuracy (%)
0Stoch.Coal.SIR2.502.412.160.13−0.110.9797.00
Deter.Coal.SIR2.502.782.030.380.050.7987.00
BDSIR2.503.212.840.150.141.86100.00
γStoch.Coal.SIR0.300.160.130.52−0.520.8247.00
Deter.Coal.SIR0.300.250.160.56−0.280.9756.00
BDSIR0.300.170.140.52−0.521.1384.00
S(0)Stoch.Coal.SIR999180511480.320.215.1299.00
Deter.Coal.SIR999238415650.660.606.54100.00
BDSIR999400226111.701.7010.3899.00
z(0)Stoch.Coal.SIRVaries51.6748.890.260.230.6137.00
Deter.Coal.SIRVaries49.1346.460.220.200.2629.00
BDSIRVaries31.1629.520.510.510.7918.00

Simulation study results for fixed trees: R02.50 and S0=999, R01.50 and S0=499, and R01.10 and S0=499

Table 2
Simulation study results for fixed trees: R02.50 and S0=999, R01.50 and S0=499, and R01.10 and S0=499
ηInferenceTruthMeanMedianErrorBiasRelative HPD width95% HPD accuracy (%)
0Stoch.Coal.SIR2.502.842.680.120.090.98100.00
Deter.Coal.SIR2.502.682.490.130.040.8198.00
BDSIR2.502.732.670.120.080.5594.00
γStoch.Coal.SIR0.300.270.250.19−0.131.1499.00
Deter.Coal.SIR0.300.320.290.163.14E-31.2799.00
BDSIR0.300.280.270.13−0.090.6295.00
S(0)Stoch.Coal.SIR99913909210.19−0.033.85100.00
Deter.Coal.SIR999180711330.520.294.5998.00
BDSIR999159111420.390.243.4299.00
z(0)Stoch.Coal.SIRVaries41.8140.350.030.010.2099.00
Deter.Coal.SIRVaries41.1739.990.030.010.0776.00
BDSIRVaries40.8939.728.65E-4−5.13E-43.43E-397.00
0Stoch.Coal.SIR1.501.481.370.09−0.060.81100.00
Deter.Coal.SIR1.501.801.490.240.150.5285.00
BDSIR1.501.461.430.08−0.030.4799.00
γStoch.Coal.SIR0.300.190.170.40−0.401.0685.00
Deter.Coal.SIR0.300.260.230.27−0.221.1589.00
BDSIR0.300.260.250.18−0.180.7297.00
S(0)Stoch.Coal.SIR4995993900.25−0.223.56100.00
Deter.Coal.SIR4995623610.44−0.263.3691.00
BDSIR4999967140.510.494.63100.00
z(0)Stoch.Coal.SIRVaries76.4768.240.550.540.5899.00
Deter.Coal.SIRVaries91.0372.510.390.380.4288.00
BDSIRVaries69.1166.510.34−0.310.2094.00
0Stoch.Coal.SIR1.101.391.320.220.221.0999.00
Deter.Coal.SIR1.101.681.440.460.460.5925.00
BDSIR1.101.341.320.200.200.5175.00
γStoch.Coal.SIR0.250.170.150.37−0.361.1184.00
Deter.Coal.SIR0.250.220.180.30−0.221.1686.00
BDSIR0.250.280.260.120.090.92100.00
S(0)Stoch.Coal.SIR4996083980.24−0.183.38100.00
Deter.Coal.SIR4995533370.42−0.263.0892.00
BDSIR499147110401.211.216.5299.00
z(0)Stoch.Coal.SIRVaries91.6084.550.060.020.6097.00
Deter.Coal.SIRVaries112.7990.370.260.260.9485.00
BDSIRVaries82.9880.930.02−0.010.0888.00
ηInferenceTruthMeanMedianErrorBiasRelative HPD width95% HPD accuracy (%)
0Stoch.Coal.SIR2.502.842.680.120.090.98100.00
Deter.Coal.SIR2.502.682.490.130.040.8198.00
BDSIR2.502.732.670.120.080.5594.00
γStoch.Coal.SIR0.300.270.250.19−0.131.1499.00
Deter.Coal.SIR0.300.320.290.163.14E-31.2799.00
BDSIR0.300.280.270.13−0.090.6295.00
S(0)Stoch.Coal.SIR99913909210.19−0.033.85100.00
Deter.Coal.SIR999180711330.520.294.5998.00
BDSIR999159111420.390.243.4299.00
z(0)Stoch.Coal.SIRVaries41.8140.350.030.010.2099.00
Deter.Coal.SIRVaries41.1739.990.030.010.0776.00
BDSIRVaries40.8939.728.65E-4−5.13E-43.43E-397.00
0Stoch.Coal.SIR1.501.481.370.09−0.060.81100.00
Deter.Coal.SIR1.501.801.490.240.150.5285.00
BDSIR1.501.461.430.08−0.030.4799.00
γStoch.Coal.SIR0.300.190.170.40−0.401.0685.00
Deter.Coal.SIR0.300.260.230.27−0.221.1589.00
BDSIR0.300.260.250.18−0.180.7297.00
S(0)Stoch.Coal.SIR4995993900.25−0.223.56100.00
Deter.Coal.SIR4995623610.44−0.263.3691.00
BDSIR4999967140.510.494.63100.00
z(0)Stoch.Coal.SIRVaries76.4768.240.550.540.5899.00
Deter.Coal.SIRVaries91.0372.510.390.380.4288.00
BDSIRVaries69.1166.510.34−0.310.2094.00
0Stoch.Coal.SIR1.101.391.320.220.221.0999.00
Deter.Coal.SIR1.101.681.440.460.460.5925.00
BDSIR1.101.341.320.200.200.5175.00
γStoch.Coal.SIR0.250.170.150.37−0.361.1184.00
Deter.Coal.SIR0.250.220.180.30−0.221.1686.00
BDSIR0.250.280.260.120.090.92100.00
S(0)Stoch.Coal.SIR4996083980.24−0.183.38100.00
Deter.Coal.SIR4995533370.42−0.263.0892.00
BDSIR499147110401.211.216.5299.00
z(0)Stoch.Coal.SIRVaries91.6084.550.060.020.6097.00
Deter.Coal.SIRVaries112.7990.370.260.260.9485.00
BDSIRVaries82.9880.930.02−0.010.0888.00
Table 2
Simulation study results for fixed trees: R02.50 and S0=999, R01.50 and S0=499, and R01.10 and S0=499
ηInferenceTruthMeanMedianErrorBiasRelative HPD width95% HPD accuracy (%)
0Stoch.Coal.SIR2.502.842.680.120.090.98100.00
Deter.Coal.SIR2.502.682.490.130.040.8198.00
BDSIR2.502.732.670.120.080.5594.00
γStoch.Coal.SIR0.300.270.250.19−0.131.1499.00
Deter.Coal.SIR0.300.320.290.163.14E-31.2799.00
BDSIR0.300.280.270.13−0.090.6295.00
S(0)Stoch.Coal.SIR99913909210.19−0.033.85100.00
Deter.Coal.SIR999180711330.520.294.5998.00
BDSIR999159111420.390.243.4299.00
z(0)Stoch.Coal.SIRVaries41.8140.350.030.010.2099.00
Deter.Coal.SIRVaries41.1739.990.030.010.0776.00
BDSIRVaries40.8939.728.65E-4−5.13E-43.43E-397.00
0Stoch.Coal.SIR1.501.481.370.09−0.060.81100.00
Deter.Coal.SIR1.501.801.490.240.150.5285.00
BDSIR1.501.461.430.08−0.030.4799.00
γStoch.Coal.SIR0.300.190.170.40−0.401.0685.00
Deter.Coal.SIR0.300.260.230.27−0.221.1589.00
BDSIR0.300.260.250.18−0.180.7297.00
S(0)Stoch.Coal.SIR4995993900.25−0.223.56100.00
Deter.Coal.SIR4995623610.44−0.263.3691.00
BDSIR4999967140.510.494.63100.00
z(0)Stoch.Coal.SIRVaries76.4768.240.550.540.5899.00
Deter.Coal.SIRVaries91.0372.510.390.380.4288.00
BDSIRVaries69.1166.510.34−0.310.2094.00
0Stoch.Coal.SIR1.101.391.320.220.221.0999.00
Deter.Coal.SIR1.101.681.440.460.460.5925.00
BDSIR1.101.341.320.200.200.5175.00
γStoch.Coal.SIR0.250.170.150.37−0.361.1184.00
Deter.Coal.SIR0.250.220.180.30−0.221.1686.00
BDSIR0.250.280.260.120.090.92100.00
S(0)Stoch.Coal.SIR4996083980.24−0.183.38100.00
Deter.Coal.SIR4995533370.42−0.263.0892.00
BDSIR499147110401.211.216.5299.00
z(0)Stoch.Coal.SIRVaries91.6084.550.060.020.6097.00
Deter.Coal.SIRVaries112.7990.370.260.260.9485.00
BDSIRVaries82.9880.930.02−0.010.0888.00
ηInferenceTruthMeanMedianErrorBiasRelative HPD width95% HPD accuracy (%)
0Stoch.Coal.SIR2.502.842.680.120.090.98100.00
Deter.Coal.SIR2.502.682.490.130.040.8198.00
BDSIR2.502.732.670.120.080.5594.00
γStoch.Coal.SIR0.300.270.250.19−0.131.1499.00
Deter.Coal.SIR0.300.320.290.163.14E-31.2799.00
BDSIR0.300.280.270.13−0.090.6295.00
S(0)Stoch.Coal.SIR99913909210.19−0.033.85100.00
Deter.Coal.SIR999180711330.520.294.5998.00
BDSIR999159111420.390.243.4299.00
z(0)Stoch.Coal.SIRVaries41.8140.350.030.010.2099.00
Deter.Coal.SIRVaries41.1739.990.030.010.0776.00
BDSIRVaries40.8939.728.65E-4−5.13E-43.43E-397.00
0Stoch.Coal.SIR1.501.481.370.09−0.060.81100.00
Deter.Coal.SIR1.501.801.490.240.150.5285.00
BDSIR1.501.461.430.08−0.030.4799.00
γStoch.Coal.SIR0.300.190.170.40−0.401.0685.00
Deter.Coal.SIR0.300.260.230.27−0.221.1589.00
BDSIR0.300.260.250.18−0.180.7297.00
S(0)Stoch.Coal.SIR4995993900.25−0.223.56100.00
Deter.Coal.SIR4995623610.44−0.263.3691.00
BDSIR4999967140.510.494.63100.00
z(0)Stoch.Coal.SIRVaries76.4768.240.550.540.5899.00
Deter.Coal.SIRVaries91.0372.510.390.380.4288.00
BDSIRVaries69.1166.510.34−0.310.2094.00
0Stoch.Coal.SIR1.101.391.320.220.221.0999.00
Deter.Coal.SIR1.101.681.440.460.460.5925.00
BDSIR1.101.341.320.200.200.5175.00
γStoch.Coal.SIR0.250.170.150.37−0.361.1184.00
Deter.Coal.SIR0.250.220.180.30−0.221.1686.00
BDSIR0.250.280.260.120.090.92100.00
S(0)Stoch.Coal.SIR4996083980.24−0.183.38100.00
Deter.Coal.SIR4995533370.42−0.263.0892.00
BDSIR499147110401.211.216.5299.00
z(0)Stoch.Coal.SIRVaries91.6084.550.060.020.6097.00
Deter.Coal.SIRVaries112.7990.370.260.260.9485.00
BDSIRVaries82.9880.930.02−0.010.0888.00

Epidemic parameter inference from H1N1 sequences in New Zealand

Table 3
Epidemic parameter inference from H1N1 sequences in New Zealand
Inference modelR0γS0Root of the tree (yr)Origin z0 of the epidemic (yr)
Stoch. Coal. SIR1.46 (1.04–2.14)27.08 (4.20–64.03)6.90E4 (175–2.86E5)0.53 (0.44–0.61)0.69 (0.45–1.03)
Deter. Coal. SIR1.35 (1.05–1.84)34.50 (3.86–82.16)1.20E5 (29–4.59E5)0.54 (0.45–0.62)0.73 (0.47–1.04)
BDSIR1.61 (1.09–2.29)27.72 (6.82–55.04)2.22E4 (259–9.38E4)0.49 (0.41–0.56)0.53 (0.43–0.65)
Inference modelR0γS0Root of the tree (yr)Origin z0 of the epidemic (yr)
Stoch. Coal. SIR1.46 (1.04–2.14)27.08 (4.20–64.03)6.90E4 (175–2.86E5)0.53 (0.44–0.61)0.69 (0.45–1.03)
Deter. Coal. SIR1.35 (1.05–1.84)34.50 (3.86–82.16)1.20E5 (29–4.59E5)0.54 (0.45–0.62)0.73 (0.47–1.04)
BDSIR1.61 (1.09–2.29)27.72 (6.82–55.04)2.22E4 (259–9.38E4)0.49 (0.41–0.56)0.53 (0.43–0.65)

Shown are mean estimates (and 95% HPD intervals) of each epidemic parameter inferred from seasonal influenza A (H1N1) sequence data collected in the Canterbury region of New Zealand throughout the 2001 flu season.

Table 3
Epidemic parameter inference from H1N1 sequences in New Zealand
Inference modelR0γS0Root of the tree (yr)Origin z0 of the epidemic (yr)
Stoch. Coal. SIR1.46 (1.04–2.14)27.08 (4.20–64.03)6.90E4 (175–2.86E5)0.53 (0.44–0.61)0.69 (0.45–1.03)
Deter. Coal. SIR1.35 (1.05–1.84)34.50 (3.86–82.16)1.20E5 (29–4.59E5)0.54 (0.45–0.62)0.73 (0.47–1.04)
BDSIR1.61 (1.09–2.29)27.72 (6.82–55.04)2.22E4 (259–9.38E4)0.49 (0.41–0.56)0.53 (0.43–0.65)
Inference modelR0γS0Root of the tree (yr)Origin z0 of the epidemic (yr)
Stoch. Coal. SIR1.46 (1.04–2.14)27.08 (4.20–64.03)6.90E4 (175–2.86E5)0.53 (0.44–0.61)0.69 (0.45–1.03)
Deter. Coal. SIR1.35 (1.05–1.84)34.50 (3.86–82.16)1.20E5 (29–4.59E5)0.54 (0.45–0.62)0.73 (0.47–1.04)
BDSIR1.61 (1.09–2.29)27.72 (6.82–55.04)2.22E4 (259–9.38E4)0.49 (0.41–0.56)0.53 (0.43–0.65)

Shown are mean estimates (and 95% HPD intervals) of each epidemic parameter inferred from seasonal influenza A (H1N1) sequence data collected in the Canterbury region of New Zealand throughout the 2001 flu season.

H1N1 data analysis

To test the efficacy of the coalescent SIR models on real data, epidemic parameters R0, γ, S0, and time of origin z0 were estimated from 42 seasonal influenza A (H1N1) sequences sampled throughout the 2001 flu season in Canterbury, New Zealand.

Influenza infections are well known for their seasonal SIR behavior in nonequatorial populations, as each annual flu season begins with a supply of susceptible hosts and tapers off as the hosts recover with adaptive immunity (Iwasaki and Pillai 2014). Due partly to this seasonal pattern, the influenza virus is both a motivator for the development of specialized models and a prime subject for testing phylodynamic models (Koelle et al. 2006).

Sampling a particular region bypasses the necessity of specifying geographically structured populations, and New Zealand is an area of particular interest due to its geographic location and relative isolation from other regions with potentially varying dynamics. It is also assumed to play a key role in the global circulation of influenza strains (Rambaut and Holmes 2009; Bedford et al. 2010).

We used an HKY nucleotide substitution model, with a substitution rate of 5E-3 as estimated in Vaughan et al. (2014), and informed the models with dated sequences. Priors used for the Bayesian inference are shown in Table 4.

Bayesian prior distributions

Table 4
Bayesian prior distributions
AnnalysisR0γS(0)z(0)ψ/(ψ+μ)
R02.5, S0=999LogN(1, 1)LogN(−1, 1)LogN(7, 1)Unif(0, 100)Beta(1, 1)
R01.5, S0=499LogN(0.5, 1)LogN(−1, 1)LogN(6, 1)Unif(0, 500)Beta(1, 1)
R01.1, S0=499LogN(0.1, 1)LogN(−1.5, 1)LogN(6, 1)Unif(0, 500)Beta(1, 1)
R01.1, S0=999aLogN(0.1, 1)LogN(−1.5, 1)LogN(7, 1)Unif(0, 500)
R01.1, S0=1999aLogN(0.1, 1)LogN(−1.5, 1)LogN(7.5, 1)Unif(0, 500)
R01.2, S0=499aLogN(0.2, 1)LogN(−1, 1)LogN(6, 1)Unif(0, 500)
H1N1Unif(0, 10)LogN(3, 0.75)LogN(13, 2)Unif(0, 10)Beta(1, 1)
HIV-1LogN(1, 1)LogN(−1, 1)LogN(7, 1)Unif(0, 100)Beta(1, 1)
AnnalysisR0γS(0)z(0)ψ/(ψ+μ)
R02.5, S0=999LogN(1, 1)LogN(−1, 1)LogN(7, 1)Unif(0, 100)Beta(1, 1)
R01.5, S0=499LogN(0.5, 1)LogN(−1, 1)LogN(6, 1)Unif(0, 500)Beta(1, 1)
R01.1, S0=499LogN(0.1, 1)LogN(−1.5, 1)LogN(6, 1)Unif(0, 500)Beta(1, 1)
R01.1, S0=999aLogN(0.1, 1)LogN(−1.5, 1)LogN(7, 1)Unif(0, 500)
R01.1, S0=1999aLogN(0.1, 1)LogN(−1.5, 1)LogN(7.5, 1)Unif(0, 500)
R01.2, S0=499aLogN(0.2, 1)LogN(−1, 1)LogN(6, 1)Unif(0, 500)
H1N1Unif(0, 10)LogN(3, 0.75)LogN(13, 2)Unif(0, 10)Beta(1, 1)
HIV-1LogN(1, 1)LogN(−1, 1)LogN(7, 1)Unif(0, 100)Beta(1, 1)

Shown are prior distributions for the reestimation of SIR parameters—the reproductive ratio R0, the rate of removal γ, the number of susceptible individuals at the start of the epidemic S(0), the time of origin z(0), and the sampling proportion ψ/(ψ+μ) for BDSIR—from the simulated trees, seasonal influenza A (H1N1), and human immunodeficiency virus (HIV-1) data analyses. LogN(M, S) is a log-normal distribution with mean M and standard deviation S in log space.

a

Only applies to deterministic coalescent SIR; see details in File S1.

Table 4
Bayesian prior distributions
AnnalysisR0γS(0)z(0)ψ/(ψ+μ)
R02.5, S0=999LogN(1, 1)LogN(−1, 1)LogN(7, 1)Unif(0, 100)Beta(1, 1)
R01.5, S0=499LogN(0.5, 1)LogN(−1, 1)LogN(6, 1)Unif(0, 500)Beta(1, 1)
R01.1, S0=499LogN(0.1, 1)LogN(−1.5, 1)LogN(6, 1)Unif(0, 500)Beta(1, 1)
R01.1, S0=999aLogN(0.1, 1)LogN(−1.5, 1)LogN(7, 1)Unif(0, 500)
R01.1, S0=1999aLogN(0.1, 1)LogN(−1.5, 1)LogN(7.5, 1)Unif(0, 500)
R01.2, S0=499aLogN(0.2, 1)LogN(−1, 1)LogN(6, 1)Unif(0, 500)
H1N1Unif(0, 10)LogN(3, 0.75)LogN(13, 2)Unif(0, 10)Beta(1, 1)
HIV-1LogN(1, 1)LogN(−1, 1)LogN(7, 1)Unif(0, 100)Beta(1, 1)
AnnalysisR0γS(0)z(0)ψ/(ψ+μ)
R02.5, S0=999LogN(1, 1)LogN(−1, 1)LogN(7, 1)Unif(0, 100)Beta(1, 1)
R01.5, S0=499LogN(0.5, 1)LogN(−1, 1)LogN(6, 1)Unif(0, 500)Beta(1, 1)
R01.1, S0=499LogN(0.1, 1)LogN(−1.5, 1)LogN(6, 1)Unif(0, 500)Beta(1, 1)
R01.1, S0=999aLogN(0.1, 1)LogN(−1.5, 1)LogN(7, 1)Unif(0, 500)
R01.1, S0=1999aLogN(0.1, 1)LogN(−1.5, 1)LogN(7.5, 1)Unif(0, 500)
R01.2, S0=499aLogN(0.2, 1)LogN(−1, 1)LogN(6, 1)Unif(0, 500)
H1N1Unif(0, 10)LogN(3, 0.75)LogN(13, 2)Unif(0, 10)Beta(1, 1)
HIV-1LogN(1, 1)LogN(−1, 1)LogN(7, 1)Unif(0, 100)Beta(1, 1)

Shown are prior distributions for the reestimation of SIR parameters—the reproductive ratio R0, the rate of removal γ, the number of susceptible individuals at the start of the epidemic S(0), the time of origin z(0), and the sampling proportion ψ/(ψ+μ) for BDSIR—from the simulated trees, seasonal influenza A (H1N1), and human immunodeficiency virus (HIV-1) data analyses. LogN(M, S) is a log-normal distribution with mean M and standard deviation S in log space.

a

Only applies to deterministic coalescent SIR; see details in File S1.

HIV-1 data analysis

In addition to our analysis of H1N1 sequence data, we selected HIV-1 subtype B nucleotide sequences collected from infected individuals located in the United Kingdom. The coalescent SIR results were collated with the results from the BDSIR data analysis performed by Kühnert et al. (2014), using the same sequences. More details of this analysis are provided in File S1.

Results and Discussion

Simulation study

Results for epidemic parameter inference from nucleotide sequences simulated from stochastic SIR trees are provided in Table 1 for R02.50. Results for inference from fixed trees (R02.50, R01.50, R01.10) are shown in Table 2, with 95% HPD coverage shown for each analysis in Figure 3. Inference results for analyses with true R0=1.0987 and varying population size (S0=499,999,1999) are described in Tables S1 and S2 in the supporting information, along with results from trees simulated under the stochastic and deterministic coalescent models for validation.

Figure 3

Estimates of R(0) from true stochastic SIR trees using inference methods by column, with stochastic coalescent SIR (A–C), deterministic coalescent SIR (D–F), and BDSIR (G–I). The truth varies by row, with R0=2.4975 (A, D, and G), R0=1.4970 (B, E, and H), and R0=1.0978 (C, F, and I).

Heterochronous trees:

For R02.50, all three inference methods performed similarly for parameters R0 and γ, with high 95% HPD coverage and low error and bias. The most weakly identifiable parameter S0 yielded the largest HPD intervals for all three inference models. The deterministic coalescent returned higher error (0.52) and bias (0.29) than the stochastic coalescent SIR (0.19, −0.03) and BDSIR (0.39, 0.24) and recovered the origin parameter z0 for only 76 of 100 simulated trees, while the stochastic coalescent and BDSIR respectively recovered z0 for 99 and 97 of 100 simulations.

For R01.50, the relative HPD widths (akin to variance) for three of the four estimated parameters (R0, γ, and z0) were smallest for BDSIR. For the parameter S0, the relative HPD width is largest for BDSIR, although it also had slightly higher 95% HPD coverage than deterministic coalescent SIR and the same as stochastic coalescent SIR. The deterministic coalescent SIR method recovered the truth for 85, 89, 91, and 88 of 100 trees for parameters R0, γ, S0, and z0, while its stochastic analog recovered the truth for 100, 85, 100, and 99 of 100 trees for the same parameters. Finally, for stochastic coalescent SIR and BDSIR, error and (absolute) bias were relatively low for R0, arguably the parameter of most interest to epidemiologists since it represents the number of individuals each infected individual will infect in a naive population. Deterministic coalescent SIR has a higher error (0.24) and bias (0.15) and also has significantly lower coverage for R0 (85%).

For R01.10, the two stochastic models again outperformed the deterministic coalescent in error, bias, and 95% HPD coverage. The stochastic coalescent most reliably recovered the truth for R0 (99 of 100 simulations), while the deterministic coalescent had more than double the error and bias and still recovered the truth for only 25 of the 100 simulations. BDSIR had the lowest error and bias for R0 under this scheme, although it recovered the truth for only 75 of 100 simulations. For removal parameter γ, BDSIR again yielded lower error and bias, in this case returning the truth for 100/100 trees (in contrast to 84 and 86 from the stochastic and deterministic coalescent, respectively).

In the stochastic models, there is a greater trade-off between parameters due to the impact the relationship between them has on the survival of trajectories at low R0. A larger estimated removal rate tends to require a larger susceptible population for the epidemic to avoid dying out in the early stages. Likewise, a smaller susceptible population implies a smaller estimated γ.

Deterministic coalescent SIR on higher R0 and S0:

As mentioned in the preceding subsection, the deterministic coalescent model yielded higher error and bias than both the stochastic coalescent and BDSIR for most parameters with R01.10 and S0=499.

To investigate the deterministic model’s sensitivity to population sizes, we also simulated a range of population sizes (S0= 499, 999, and 1999) for R0=1.0987. Even with S0=1999, the deterministic coalescent SIR model’s 95% HPD coverage was low. For parameters R0, γ, S0, and z0, this coverage was respectively 40%, 64%, 66%, and 18%. Table S6 shows these results.

Additionally, we increased both R0 (to 3.5 and 5) and S0 (to 4999 and 9999). However, for parameters R0, γ, and S0, the deterministic coalescent SIR showed increased error, bias, and HPD widths, and the HPD coverage for z0 did not improve. These results are shown in Table S9.

While each of these methods is an approximation, the deterministic coalescent particularly suffers from model misspecification since it does not account for the stochasticity that is always present in the early stages of epidemics, regardless of S0.

Homochronous trees:

Results for homochronously sampled trees are given in Table S3.

All three SIR inference models recover the truth for >95/100 trees within their respective 95% HPD widths for epidemic parameters R0, γ, and S0. The time of origin z0 was recovered for 100/100 trees by BDSIR, 95/100 trees by stochastic coalescent SIR, and 73/100 trees by deterministic coalescent SIR. However, relative error and bias also increased consistently across all three models, along with the 95% HPD widths. The deterministic coalescent had the highest error, bias, and HPD width for R0 and highest error and HPD width for S0, which is consistent with the heterochronously sampled data.

Further consideration of the effects of sampling rate changes and sampling model misspecification are warranted for BDSIR and coalescent SIR, the latter of which has been facilitated by Volz and Frost (2014).

Simulated sequences:

Relative error and bias were inflated across all three inference models with the addition of phylogenetic uncertainty, and in certain cases the 95% HPD coverage was lower than with fixed trees. The deterministic coalescent model recovered the truth within its 95% HPD intervals only for ≥90 of the 100 trees in the case of S0. The true values for the parameters R0, γ, and z0 were covered by 95% HPD intervals for 87, 56, and 29 of the 100 trees, respectively. This is contrasted with the performance of the stochastic coalescent (100, 97, 47, and 37 for parameters S0, R0, γ, and z0) and BDSIR (99, 100, 84, and 18 for S0, R0, γ, and z0), as shown in Table 1.

Error, bias, and 95% HPD widths were higher with simulated sequences for all three inference models for parameters γ, S0, and z0 than with fixed trees. This indicates the importance of calibrating epidemic parameters of interest. In our case, we emphasize the basic reproductive number R0, often the parameter of most interest to epidemiologists. For R0, stochastic coalescent SIR and BDSIR recovered the truth within their 95% HPD intervals for 97 and 100 of the 100 simulations, respectively. They also showed only slight changes in error and bias compared to inference performed on the fixed trees used to generate the sequences. The deterministic coalescent SIR model recovered R0 for 87 of the 100 simulations (contrasted with 98/100 for the fixed trees) and with increased error.

Priors and identifiability:

It is important to understand the impact of selected priors on inference results, as the priors are where the power of Bayesian inference lies. For example, we found relatively weak identifiability in the initial susceptible population parameter S0, which must either be fixed or be estimated alongside the origin parameter z0.

In addition to allowing each parameter to be either fixed or estimated, we have provided options for parameterization of our models, with either the transmission rate β or R0 acting as operable parameters in MCMC analysis. For the deterministic coalescent, there is also an option to use the intrinsic growth parameter described by Dearlove and Wilson (2013).

The choice of parameterization necessarily affects the prior that will be used in the inference and should be considered carefully. However, we found that once a parameterization has been selected, our inference models are robust to different prior distributions placed on each parameter. We also used broader prior distributions on the deterministic coalescent to test whether this would increase its lower 95% HPD coverage relative to the stochastic models. We found that doing so increased the error and bias of the results without increasing the accuracy (shown in Table S4).

H1N1 data analysis

Epidemic parameter estimates from serially sampled influenza A (H1N1) virus sequence data are shown in Table 3.

The estimated means of the basic reproductive number were R0= 1.46, 1.35, and 1.61 for the stochastic coalescent, the deterministic coalescent, and BDSIR, respectively. Estimates of R0 from pandemic H1N1 in New Zealand range from ∼1.2 to 1.5 (Paine et al. 2010; Opatowski et al. 2011; Roberts and Nishiura 2011; Roberts 2013; Biggerstaff et al. 2014), and estimates of R0 for seasonal H1N1 from other countries also range from ∼1.2 to 1.5 (Chowell et al. 2008). The 95% HPD intervals were very similar across each model, ranging from just over 1.0 to ∼2.0.

The population of the Canterbury region in 2001 was reported to be ∼481, 431 by the Environment Canterbury Regional Council (Ecan 2001) and 521, 832 by Statistics New Zealand (StatsNZ 2001). The mean estimates of S0 were considerably lower using the stochastic coalescent (S0=69,000), the deterministic coalescent (S0=120,000), and BDSIR (S0=22,200). However, the effective population of susceptibles is assumed to be much smaller, as the total population contains individuals of various susceptibility, e.g., those with partial immunity from vaccination and previous or secondary infections.

Most people recover from flu symptoms, the time they are likely to be most infectious, within a few days up to 2 weeks (CDC 2014; WHO 2014). This provides a range of probable true values for the removal parameter γ. The sequence data and molecular clock rate, and therefore the tree, are in units of years. Therefore, our γ range would be from 365/14 days to 365/2 days or from γ=26.1 to γ=182.5. The stochastic coalescent, the deterministic coalescent, and BDSIR respectively inferred γ means of 27.08, 34.50, and 27.72. These estimates are on the low side compared to epidemiological models for influenza that include explicit spatial and household effects (Ferguson et al. 2005), but a moderate misfit of the model is not unexpected when fitting a simple closed SIR model with no population substructure.

The root of the tree was very similar across all inference models, respectively 0.53, 0.54, and 0.49 for stochastic coalescent SIR, deterministic coalescent SIR, and BDSIR. The same was true for the origin z0, with: 0.69, 0.73, and 0.53 for the stochastic coalescent, the deterministic coalescent, and BDSIR. All three inference models returned tree root and origin estimates that are consistent with previous estimates from single flu seasons. That is, the tree age is young and the root coincides with the start of the (winter) influenza season in the Southern Hemisphere. The time of introduction of influenza into the region, z0, was 1 or 2 months before the root. This supports the notion that the sequences selected represent a single introduction of the strain into the Canterbury population (see File S1 for details of data selection and Figure S4 for representative trees inferred from an alternate data selection.).

The trees estimated by each of the three models are typical for influenza (see Figure 4 for representative trees from each posterior), with branches that are quick to coalesce moving backward in time from the most recently sampled tip.

Figure 4

Representative influenza A (H1N1) posterior trees from inference using the (A) BDSIR, (B) stochastic coalescent SIR, and (C) deterministic coalescent SIR inference models.

HIV-1 data analysis

Results for inference from HIV-1 sequence data can be found in File S1. 95% HPD intervals are shown in Figure S5, Figure S6, Figure S7, and Table S8.

Computational efficiency

Finally, Table S5 shows comparisons of computation times under each inference model for each type of data analyzed. The deterministic coalescent SIR model is by far the fastest to sample and converge, with stochastic coalescent SIR and BDSIR varying, depending on the type of data.

Closing remarks

A key reason for the success of coalescent theory in population genetics is its mathematical simplicity and the computational efficiency of calculating the probability density of a sample genealogy. Our results show that a stochastic variant of coalescent theory can be successfully adapted to estimate epidemiological parameters in a true Bayesian inference context. This stochastic coalescent SIR model performs better than the deterministic analog for estimating epidemic parameters in some circumstances. Unfortunately, the stochastic model relies on a computationally demanding Monte Carlo estimate of the coalescent density via simulation of an ensemble of epidemic trajectories, negating one of the main advantages of coalescent theory. In fact, the current implementation is less computationally efficient than the implementation of the BDSIR model. However, an advantage of the stochastic coalescent over the explicit sampling model in BDSIR is its robustness to biased sampling schemes, as has been shown for the case of pure exponential growth dynamics (Bošková et al. 2014).

A more computationally efficient approach to computing the coalescent probability of the sample genealogy in the stochastic setting would be to use particle filtering (Andrieu and Roberts 2009; Andrieu et al. 2010; Rasmussen et al. 2011, 2014), but there are no theoretical barriers to applying particle MCMC to the exact model (Stadler et al. 2014). Therefore, an obvious extension of this work would be to apply particle MCMC algorithms to the exact stochastic SIR model that was used in simulations in this work. We anticipate that the exact model would outperform all the methods tested here, especially when R0 is close to one.

In the meantime, the Bayesian coalescent inference methods developed here make it feasible to estimate epidemic parameters from time-stamped, serially sampled molecular sequence data, while accurately accounting for uncertainty in the topology and the divergence times of the phylogenetic tree.

Acknowledgments

We thank Gabriel Leventhal and Louis Du Plessis (ETH Zürich) for constructive and valuable input and the New Zealand eScience Infrastructure for access to high-performance computing facilities (http://www.nesi.org.nz/). A.J.D. was funded by a Rutherford Discovery Fellowship from the Royal Society of New Zealand. A.P., T.V., T.S., and A.J.D. were also partially supported by Marsden grant UOA1324 from the Royal Society of New Zealand (http://www.royalsociety.org.nz/programmes/funds/marsden/awards/2013-awards/).

Footnotes

Available freely online through the author-supported open access option.

Communicating editor: Y. S. Song

Literature Cited

Anderson
R M
,
May
R M
,
1991
Infectious Diseases of Humans: Dynamics and Control
.
Oxford University Press
,
Oxford
.

Andrieu
C
,
Roberts
G O
,
2009
The pseudo-marginal approach for efficient Monte Carlo computations.
Ann. Stat.
37
:
697
.

Andrieu
C
,
Doucet
A
,
Holenstein
R
,
2010
Particle Markov chain Monte Carlo methods.
J. R. Stat. Soc. Ser. B Stat. Methodol.
72
:
269
342
.

Beaumont
M A
,
2003
Estimation of population growth or decline in genetically monitored populations.
Genetics
164
:
1139
1160
.

Bedford
T
,
Cobey
S
,
Beerli
P
,
Pascual
M
,
2010
Global migration dynamics underlie evolution and persistence of human influenza a (h3n2).
PLoS Pathog.
6
:
e1000918
.

Biggerstaff
M
,
Cauchemez
S
,
Reed
C
,
Gambhir
M
,
Finelli
L
,
2014
Estimates of the reproduction number for seasonal, pandemic, and zoonotic influenza: a systematic review of the literature.
BMC Infect. Dis.
14
:
480
.

Bošková
V
,
Bonhoeffer
S
,
Stadler
T
,
2014
Inference of epidemiological dynamics based on simulated phylogenies using birth-death and coalescent models.
PLoS Comput. Biol.
10: e1003913.

CDC, 2014 United States Centers for Disease Control and Prevention. Available at: http://www.cdc.gov/flu/. Accessed: November, 2014.

Chowell
G
,
Miller
M
,
Viboud
C
,
2008
Seasonal influenza in the United States, France, And Australia: transmission and prospects for control.
Epidemiol. Infect.
6
:
852
864
.

Dearlove
B
,
Wilson
D J
,
2013
Coalescent inference for infectious disease: meta-analysis of hepatitis C.
Philos. Trans. R. Soc. Lond. B Biol. Sci.
368
:
20120314
.

Drummond
A J
,
Nicholls
G K
,
Rodrigo
A G
,
Solomon
W
,
2002
Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data.
Genetics
161
:
1307
1320
.

Drummond
A J
,
Ho
S Y W
,
Phillips
M J
,
Rambaut
A
,
2006
Relaxed phylogenetics and dating with confidence.
PLoS Biol.
4
:
e88
.

ECAN, 2001 Environment Canterbury Regional Council. Available at: http://ecan.govt.nz/about-us/population/how-many/pages/census.aspx. Accessed: November, 2014.

Felsenstein
J
,
1981
Evolutionary trees from DNA sequences: a maximum likelihood approach.
J. Mol. Evol.
17
:
368
376
.

Felsenstein
J
,
2004
Inferring Phylogenies
.
Sinauer Associates
,
Sunderland, MA
.

Ferguson
N
,
Cummings
D
,
Cauchemez
S
,
Fraser
C
,
Riley
S
et al. ,
2005
Strategies for containing an emerging influenza pandemic in southeast Asia.
Nature
437
:
209
214
.

Gavryushkina, A., D. Welch, T. Stadler, and A. Drummond, 2014 Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration. arXiv:1406.4573.

Grenfell
B T
,
Pybus
O G
,
Gog
J R
,
Wood
J L N
,
Daly
J M
et al. ,
2004
Unifying the epidemiological and evolutionary dynamics of pathogens.
Science
303
:
327
332
.

Griffiths
R C
,
Tavaré
S
,
1994
Ancestral inference in population genetics.
Stat. Sci.
9
:
307
319
.

Hasegawa
M
,
Kishino
H
,
Yano
T
,
1985
Dating of the human-ape splitting by a molecular clock of mitochondrial DNA.
J. Mol. Evol.
22
:
160
174
.

Iwasaki
A
,
Pillai
P S
,
2014
Innate immunity to influenza virus infection.
Nat. Rev. Immunol.
14
:
315
328
.

Keeling
M J
,
Rohani
P
,
2008
Modeling Infectious Diseases in Humans and Animals
.
Princeton University Press
,
Princeton
.

Kermack, W., and A. McKendrick, 1932 Contributions to the mathematical theory of epidemics. ii. The problem of endemicity. Proc. R. Soc. A 138: 55–83.

Kingman
J F C
,
1982
The coalescent.
Stoch. Proc. Appl.
13
:
235
248
.

Koelle
K
,
Rasmussen
D A
,
2012
Rates of coalescence for common epidemiological models at equilibrium.
J. R. Soc. Interface
9
:
997
1007
.

Koelle
K
,
Cobey
S
,
Grenfell
B
,
Pascual
M
,
2006
Epochal evolution shapes the phylodynamics of interpandemic influenza a (h3n2) in humans.
Science
314
:
1898
1903
.

Kühnert
D
,
Stadler
T
,
Vaughan
T G
,
Drummond
A J
,
2014
Simultaneous reconstruction of evolutionary history and epidemiological dynamics from viral sequences with the birth-death SIR model.
J. R. Soc. Interface
11
:
20131106
.

Kutta
M W
,
1901
Beitrag zur näherungsweisen integration totaler differentialgleichungen.
Zeitschrift für Mathematik und Physik
46
:
435
453
.

Opatowski
L
,
Fraser
C
,
Griffin
J
,
de Silva
E
,
Van Kerkhove
M
et al. ,
2011
Transmission characteristics of the 2009 h1n1 influenza pandemic: comparison of 8 southern hemisphere countries.
PLoS Pathog.
7: e1002225.

Paine, S., G. Mercer, P. Kelly, D. Bandaranayake, M. Baker et al., 2010 Transmissability of 2009 pandemic influenza a(h1n1) in New Zealand: effective reproduction number and influence of age, ethnicity, and importations. Eurosurveillance 15(24). Available at: http://www.eurosurveillance.org/ViewArticle.aspx?ArticleId=19591.

Pybus
O G
,
Charleston
M A
,
Gupta
S
,
Rambaut
A
,
Holmes
E C
et al. ,
2001
The epidemic behavior of the hepatitis c virus.
Science
292
:
2323
2325
.

R Core Team
,
2014
R: A Language and Environment for Statistical Computing
.
R Foundation for Statistical Computing
,
Vienna
.

Rambaut, A., 2007 Figtree. Available at: http://tree.bio.ed.ac.uk/software/figtree/.

Rambaut
A
,
Holmes
E
,
2009
The early molecular epidemiology of the swine-origin a/h1n1 human influenza pandemic.
PLoS Curr.
1: RRN1003.

Rasmussen
D A
,
Ratmann
O
,
Koelle
K
,
2011
Inference for nonlinear epidemiological models using genealogies and time series.
PLoS Comput. Biol.
7
:
e1002136
.

Rasmussen
D A
,
Volz
E M
,
Koelle
K
,
2014
Phylodynamic inference for structured epidemiological models.
PLoS Comput. Biol.
10
:
e1003570
.

Roberts
M
,
2013
Epidemic models with uncertainty in the reproduction number.
J. Math. Biol.
66
:
1463
1474
.

Roberts
M
,
Nishiura
H
,
2011
Early estimation of the reproduction number in the presence of imported cases: pandemic influenza h1n1–2009 in New Zealand.
PLoS ONE
6: e17835.

Rodrigo, A., and J. Felsenstein, 1999 Coalescent approaches to HIV population genetics, pp. 233–272 in The evolution of HIV, edited by K. A. Crandall. The Johns Hopkins University Press, Baltimore.

Rouzine
I M
,
Rodrigo
A
,
Coffin
J M
,
2001
Transition between stochastic evolution and deterministic evolution in the presence of selection: general theory and application to virology.
Microbiol. Mol. Biol. Rev.
65
:
151
185
.

Runge
C
,
1895
Ueber die numerische auflösung von differentialgleichungen.
Math. Ann..
46
:
167
178
.

Sehl
M
,
Alekseyenko
A V
,
Lange
K L
,
2009
Accurate stochastic simulation via the step anticipation tau-leaping (sal) algorithm.
J. Comput. Biol.
16
:
1195
1208
.

Stadler
T
,
Kouyos
R
,
von Wyl
V
,
Yerly
S
,
Böni
J
et al. ,
2012
Estimating the basic reproductive number from viral sequence data.
Mol. Biol. Evol.
29
:
347
357
.

Stadler
T
,
Kühnert
D
,
Bonhoeffer
S
,
Drummond
A J
,
2013
Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV).
Proc. Natl. Acad. Sci. USA
110
:
228
233
.

Stadler, T., T. G. Vaughan, A. Gavruskin, S. Guindon, D. Kühnert, et al., 2014 Population genetics vs. population dynamics: How well can coalescent-based models approximate population dynamic processes? Genetics 190: 187–201.

StatsNZ, 2001 Statistics New Zealand. Available at: http://stats.govt.nz/Census/.

Vaughan
T G
,
Drummond
A J
,
2013
A stochastic simulator of birth-death master equations with application to phylodynamics.
Mol. Biol. Evol.
30: 1480–1493.

Vaughan, T., D. Kühnert, A. Popinga, D. Welch, and A. Drummond, 2014 Efficient Bayesian inference under the structured coalescent. Bioinformatics 30: 2272–2279.

Volz
E
,
Frost
S D
,
2014
Sampling through time and phylodynamic inference with coalescent and birth-death models.
J. R. Soc. Interface
11
: 20140945.

Volz
E M
,
2012
Complex population dynamics and the coalescent under neutrality.
Genetics
190
:
187
201
.

Volz
E M
,
Kosakovsky Pond
S L
,
Ward
M J
,
Leigh Brown
A J
,
Frost
S D W
,
2009
Phylodynamics of infectious disease epidemics.
Genetics
183
:
1421
1430
.

WHO, 2014 World Health Organization. Available at: http://www.who.int/topics/influenza/en/. Accessed: November, 2014.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data