## Abstract

Transmission events are the fundamental building blocks of the dynamics of any infectious disease. Much about the epidemiology of a disease can be learned when these individual transmission events are known or can be estimated. Such estimations are difficult and generally feasible only when detailed epidemiological data are available. The genealogy estimated from genetic sequences of sampled pathogens is another rich source of information on transmission history. Optimal inference of transmission events calls for the combination of genetic data and epidemiological data into one joint analysis. A key difficulty is that the transmission tree, which describes the transmission events between infected hosts, differs from the phylogenetic tree, which describes the ancestral relationships between pathogens sampled from these hosts. The trees differ both in timing of the internal nodes and in topology. These differences become more pronounced when a higher fraction of infected hosts is sampled. We show how the phylogenetic tree of sampled pathogens is related to the transmission tree of an outbreak of an infectious disease, by the within-host dynamics of pathogens. We provide a statistical framework to infer key epidemiological and mutational parameters by simultaneously estimating the phylogenetic tree and the transmission tree. We test the approach using simulations and illustrate its use on an outbreak of foot-and-mouth disease. The approach unifies existing methods in the emerging field of phylodynamics with transmission tree reconstruction methods that are used in infectious disease epidemiology.

ESTIMATING who infected whom for an outbreak of an infectious disease can provide valuable insights. Estimated transmission trees have been used to evaluate effectiveness of intervention measures (Ferguson *et al.* 2001; Keeling *et al.* 2003; Wallinga and Teunis 2004; Heijne *et al.* 2009), to quantify superspreading (Lloyd-Smith *et al.* 2005), to estimate key parameters (Haydon *et al.* 2003; Heijne *et al.* 2012; Hens *et al.* 2012), and to identify mechanisms of transmission (Spada *et al.* 2004; Ypma *et al.* 2013). Transmission trees can be statistically reconstructed using epidemiological data from outbreak investigations, such as time of symptom onset, geographical location, and social ties; these data generally must be very detailed to allow for accurate reconstructions.

For many pathogens, in particular RNA viruses, evolutionary processes occur on the same timescale as epidemiological processes (Holmes *et al.* 1995; Pybus and Rambaut 2009). This makes it possible to draw conclusions about epidemiology from genetic analysis. The field that infers epidemiological characteristics from genetic sequences by simultaneously considering host dynamics and pathogen genetics has been dubbed “phylodynamics” (Grenfell *et al.* 2004). In practical applications researchers have considered a specific epidemiological model dependent on the phylogenetic tree inferred from sequence data, simultaneously estimating mutational and epidemiological parameters. This allowed them to answer questions on relative population sizes and dates of introduction of pathogens. Initially the epidemiological models used were classical models from population genetics, such as the Wright–Fisher model (Pybus *et al.* 2001). Recently more realistic epidemiological models such as the Susceptible-Infected-Recovered (SIR) model (Volz *et al.* 2009; Rasmussen *et al.* 2011) and birth–death model (Stadler 2009) have been suggested. In these methods, the sampled hosts are thought of as the leaves of the phylogenetic tree, while the internal nodes coincide with transmission times to unobserved hosts. The phylogenetic tree is thus equated with the partially observed transmission tree.

Although the transmission tree and the phylogenetic tree of an outbreak may appear as two incarnations of the same tree, they are in fact different in interpretation and in local characteristics. The phylogenetic tree represents the clonal ancestry of sampled pathogen its leaves are sampled pathogens, and its internal nodes are most recent common ancestors of the sampled and transmitted pathogens (Figure 1) (Pybus and Rambaut 2009). As a pair of lineages corresponding to two transmitted pathogens can coalesce together before coalescing to the lineage sampled from the infecting host, the topology of the two trees need not be the same (Figure 1A). The difference between phylogenetic trees and transmission trees is closely related to the difference between phylogenetic trees and species trees; in the latter context this phenomenon is known as ”incomplete lineage sorting” (Rosenberg and Nordborg 2002; Maddison and Knowles 2006). While the timing of nodes in the transmission tree corresponds to transmission times, the timing of internal nodes of the phylogenetic tree corresponds to coalescent events that take place prior to transmission. The absolute difference in branch length between the two trees depends on the epidemiological generation interval and within-host dynamics. As the branch length of the phylogenetic tree decreases with sampling rate of hosts, the relative difference between the partially observed transmission tree and the phylogenetic tree is largest when the sampling rate is high (Figure 1).

Recently, methods that focus on including genetic information in transmission tree reconstructions have been proposed (Morelli *et al.* 2012; Ypma *et al.* 2012). However, these approaches either ignore the phylogenetic tree or assume that internal nodes of the phylogenetic trees coincide with transmission events. As these approaches require a high sampling fraction, the within-host genetic diversity can contribute to the genetic diversity observed between the sampled sequences. Because this contribution is ignored in the analyses, it can lead to incorrect inference of the transmission tree and biased estimates of parameters.

Here, we present a consistent way to use pathogen genetic sequence data in transmission tree reconstruction, by simultaneously estimating the phylogenetic tree of the pathogens. This requires inclusion of a model of within-host pathogen dynamics. In this article we describe the likelihood framework for such a joint estimation of the transmission tree and the phylogenetic tree. We investigate the performance and robustness of the methodology using simulations of an influenza outbreak in a confined setting. In these simulations, we also investigate the sensitivity of the outcome to unsampled or unobserved hosts. Finally, we illustrate the use of the method by applying it to a previously published data set on an outbreak of foot-and-mouth disease (FMD).

## Methods

### Joint likelihood of the transmission tree, phylogenetic tree, and within-host dynamics

We focus on outbreaks for which nearly all cases are observed, host characteristics such as time of symptom onset are known, and sequences are obtained from pathogens sampled from a proportion of the hosts. The transmission events are not observed. From these data we simultaneously estimate both the transmission tree and the phylogenetic tree, by writing down the likelihood for any pair of these trees. Throughout, we assume that the first infected host, or index case, introduced infection and all other infected hosts are infected by another host in this outbreak. We assume every host is infected at most once.

Define a transmission tree *T* as the set of all transmissions between infected hosts, including transmission times (Wallinga and Teunis 2004). The transmission times could be observed or unknown, in which case we estimate them simultaneously. The phylogenetic tree *P* is the usual dichotomous tree, with timed internal nodes. The function *W(t,h)* gives a measure of within-host genetic diversity, as the product of the pathogen generation time and the within-host effective pathogen population size in host *h* at time *t*. *W(t,h)* can be thought of as being proportional to the total number of virus particles in host *h* at time *t*. Let **θ** be the epidemiological parameters, **μ** the mutational parameters, and the data *D* consist of genetic sequences *D*_{G} and epidemiological data *D*_{E}.

The probability of the transmission tree *T*, phylogenetic tree *P*, within-host dynamics *W*, and parameters **θ** and **μ** can be found by first applying Bayes’ theorem and then the chain rule of probability,where *π* denotes the prior probability. We can further simplify this equation by using conditional dependencies. In particular, if we know the full transmission tree, epidemiological parameters, and within-host dynamics, the phylogenetic tree and mutational parameters give no further information on the probability of the epidemiological data:Likewise, when the ancestry of sampled pathogens and the mutational parameters are known, the epidemiological data and parameters give no further information on the sequence data:Also, the mutational and epidemiological parameters give no further information on the ancestry when the transmission tree and within-host dynamics are known (we assume that there is no information on selection pressures):We thus get(1)The amount of prior information differs per application; prior information on parameters and within-host dynamics might be available due to previous studies and prior information on the transmission tree might be available through contact tracing.

We point out the similarity of Equation 1) to previous methodologies. The first term on the right-hand side is, up to the inclusion of the within-host model *W*, identical to the likelihood equations found in transmission tree reconstruction methods (Wallinga and Teunis 2004). The second and third terms together resemble the likelihood equations found in many phylodynamic approaches (Pybus and Rambaut 2009; Volz *et al.* 2013), with the difference that the epidemiological model in such approaches is replaced here by the combination of the transmission tree and the within-host model.

### Numerical implementation: Sampling from the probability distribution

Equation 1 defines (up to a constant) a probability distribution on the space of transmission trees, phylogenetic trees, and parameters. We can sample from this distribution using Markov chain Monte Carlo (MCMC) methods. The initial state consists of a transmission tree, phylogenetic trees, and parameters with probability larger than 0. We construct the initial state as follows. We first ensure that every host has a time of infection by assigning one if the time was unobserved. We then construct a transmission tree by assigning an infector to all infected hosts, apart from the host infected first. The infecting hosts are randomly taken from the set of hosts that are infectious at the time of infection of the infected host. We construct a phylogenetic tree that is consistent with this tree and take initial values for the parameters from their prior distributions.

In each iteration of the MCMC, the trees and parameters are updated. An iteration consists of five different steps:

For a host infected in the outbreak, pick a new infector (in this step we also alter the phylogenetic tree, to make sure that it is compatible with the proposed transmission tree);

consider a new root for the transmission tree, switching infection times between the current and proposed root;

update infection times;

per host, update the phylogenetic tree contained in that host; and

update parameter values.

See supporting information, File S1, for technical details. Below, we illustrate this general concept using simulated and real data.

### Testing the approach on simulated data sets

To illustrate the method and assess robustness to missing data, we apply it to simulated data on an influenza outbreak in a confined school-based population of 200 individuals. Such confined outbreaks are amenable to analysis as, due to the low number of cases, individual infections can be tracked using only epidemiological data (Cauchemez *et al.* 2011; Hens *et al.* 2012).

Each simulation starts with 1 infected and 199 susceptible individuals, half of them children; only outbreaks with at least 50 cases were used for further analysis. We use a latent period of 2 days, and a gamma-distributed infectious period with a mean of 3 days and a variance of 2. During their infectious period, individuals exert an equal and constant force of infection on all susceptible individuals; *i.e.*, we assume homogeneous mixing, such that the expected number of infections caused by an infectious individual is 0 for adults and 2 for children at the start of the outbreak. We assume that symptom onset coincides with the start of the infectious period. See File S1 for details.

We take a simple analytically tractable function, a quartic kernel, for the product of pathogen generation time and effective population size,where *T _{h}* is the time between infection and recovery of host

*h*, and (

*t − t*) is the time since infection

_{h}*t*of

_{h}*h*. Note that the model assumes an effective size of 1 at transmission, ensuring only one strain can infect a host. We take a substitution rate of 0.003 substitutions/site/year (Jenkins

*et al.*2002).

For each case, we record the time of symptom onset, the recovery time, a sequence sampled 1 day after symptom onset, and whether the case is a child or adult. These data are then used for inference, as described below.

### Inference

We estimate the transmission tree and parameters **θ** and **μ** using the proposed likelihood Equation 1. To this end, we need to specify the three components of the likelihood.

#### Likelihood component for the transmission tree:

We assume that the incubation period, defined as the time between infection and developing symptoms, is gamma distributed with mean 2 and variance 1. The likelihood of the transmission tree then becomes the product over all infected hosts of the infectiousness of the infecting host relative to the total infectiousness of all hosts times the probability distribution *s* for the length of the incubation period,where *H* is the set of all infected hosts, *H ^{−}* is the set of all infected hosts minus the index case,

*t*and

_{x}*o*are the start of the infection and the start of the infectious period (respectively) of host

_{x}*x*,

*v(x)*is the infector of host

*x*, and

*I*,

_{t}(a|θ*D*

_{E}

*)*is the infectiousness of host

*a*at time

*t*. Here we take infectiousness to be the rate at which a host infects any other host. The infectiousness

*I*,

_{t}(a|θ*D*

_{E}

*)= θ*if

*t*>

*s*and

_{a}*a*has not recovered at time

*t*, 0 otherwise. Note that the infectiousness does not depend on the infected host, as we assume homogeneous mixing and we do not want to

*a priori*assume a difference in infectiousness between adults and children.

#### Likelihood component for the phylogenetic tree:

The likelihood of the phylogenetic tree can be obtained from coalescent theory for haploid organisms. Going backward in time, two events that alter the number of lineages of the phylogenetic tree contained within one host can take place. First, the number can decrease by one due to a coalescent event. Second, the number can increase by one due to an incoming lineage: a transmission event from this host to another, akin to a new sample in standard coalescent models. The first case is described by the likelihood that the lineages did not coalesce for a time, and finally coalesced; the second case is described by the likelihood that the lineages did not coalesce. Using forward time we getwhere *H* is the set of all infected hosts and *C _{x}* is the set of all intervals , where the number of viral lineages within host

*x*with sampled offspring is constant and >1. The indicator function is 1 if the interval starts with a coalescent, 0 otherwise. Here, we assume

*W*to be fully known.

#### Likelihood component for the mutational parameters:

We take the simplest feasible substitution model; all mutations are equally likely and happen with rate *μ*. We therefore getwhere the first product is over all base pairs, the sum is over all possible assignments of each of the nucleotides to each of the *N* internal nodes, the second product is over all branches of the phylogenetic tree, *t* denotes branch length, and the indicator denotes whether a mutation occurred on that branch. We omit the Jukes–Cantor correction because timescales are very short and selection pressures absent, and hence the probability of multiple mutations at the same locus is negligible (see File S1). We can avoid having to sum over all 4* ^{N}* possible states using Felsenstein’s pruning algorithm (Felsenstein 1981). Inclusion of other, perhaps more realistic, substitution models would be straightforward.

### Evaluating performance

We examine how well the transmission tree can be reconstructed by evaluating the probability assigned to the actual transmission events. We estimate the probability that host *j* infected host *i* by the proportion of sampled trees in which *j* infected *i*. We assess false positives by evaluating, for each infected host, whether the infector assigned the highest probability is the actual infector. We further evaluate how well the method estimates the substitution rate and whether the method is able to find the reduced infectiousness of adults. The latter is done by counting the fraction of infections caused by adults among transmission events estimated at probability at least 0.9. As we are interested in the parameters and transmission tree, the phylogenetic tree can be considered a nuisance parameter, and we do not further investigate its estimation.

To assess robustness of the estimation procedure, we simulate data, estimate parameters and evaluate performance for seven different simulation scenarios. In the baseline scenario, all data are available. In the second and third scenario, we examine the impact of incomplete sampling by randomly discarding 0 or 50% of sequences. In a fourth scenario, we examine the impact of unobserved hosts by randomly discarding 20% of infected hosts. To keep the number of observed hosts comparable, we start these simulations with 20% more susceptibles. In a fifth scenario, we examine sensitivity to an increased substitution rate of 0.01 substitutions/site/year. In a sixth scenario, we examine sensitivity to a decreased substitution rate of 0.001 substitutions/site/year. In a seventh scenario, we examine the impact of a misspecified within-host model, by setting *W(t,h)* = 1 in the analysis. Specifying this within-host model is equivalent to making the incorrect assumption that coalescent events coincide with transmission events.

### Application to an outbreak of foot-and-mouth disease

To illustrate the use of the method in practice, we reanalyze data on an outbreak of FMD in 2001 in Durham County, England (Cottam *et al.* 2006, 2008; Morelli *et al.* 2012). Here, for a cluster of 12 infected farms, spatial information, date of symptom onset, culling date, and one full genome sequence is known for all 12 farms in the cluster. Both the phylogenetic tree and transmission tree have been estimated before for this outbreak separately, giving inconsistent results (Cottam *et al.* 2008). We apply our method to illustrate how to estimate both trees simultaneously, using the same epidemiological model and substitution model as were used in a previous study (Morelli *et al.* 2012). We assume an exponentially increasing function for *W* (see File S1 for details), as at time of culling the disease is still spreading within the farms.

## Results

We first test the proposed method by estimating the transmission tree from the simulated data sets. Figure 2 shows how well individual transmissions can be estimated, and Table 1 shows the average probability assigned to the actual transmissions. When all data are available, on average half of the transmissions in the reconstructed transmission tree are correct (Figure 2); this number is almost one if we restrict ourselves to transmission events assigned a high probability. Few transmissions can be correctly estimated when the percentage of infected hosts sampled decreases. When instead the percentage of hosts observed decreases, transmissions can be estimated correctly quite often, but the percentage of incorrectly estimated transmissions increases strongly. More transmission events can be estimated when the substitution rate is higher and fewer transmission events can be estimated when the substitution rate is lower. Nearly as many transmission events can be correctly estimated when we assume that coalescent events coincide with transmission events as when we know the correct within-host model. However, the percentage of transmissions that is incorrectly estimated increases.

We now turn to estimation of mutational parameters from the simulated data sets. The substitution rate can be estimated well even when data are missing (Figure 3A); the mean estimated rate was 0.0033, and the actual value of 0.003 was contained in the 95% credibility interval for 90% of the simulations. Assuming that coalescent events coincide with transmission leads to a large overestimation (mean 0.0067, coverage 15%); this is because the total branch length of the phylogenetic tree is underestimated, leading to an overestimate of the substitution rate (Figure 1).

Next, we focus on performance in estimating the relative infectiousness of adults from the simulated data sets. Figure 3B gives, for each of 100 simulations under each of the seven scenarios, the point estimate and confidence interval for *p*, the fraction of infections caused by adults. Most estimates of *p* are close to the correct value of 0, except for the scenarios where only 80% of hosts are observed (mean *P* = 0.05), or when transmission times are equated with coalescent times (mean *P* = 0.11). More importantly, the confidence interval for *p* becomes larger when there is more uncertainty about the transmission tree, most notable under the two scenarios with decreased sampling rates.

Having tested the method, we apply it to infer transmission trees from epidemiological and genetic sequence data collected during an FMD outbreak in Durham County, England, in 2001. Figure 4A illustrates a typical sample from the MCMC, consisting of a transmission chain connecting the infected farms and a phylogenetic tree connecting the sampled sequences. The results are broadly similar to those obtained from previous analyses on the same data set (File S1). The mean latency period was estimated at 7.8 (95% credibility interval, CI: 4.9, 12) days (Figure 4), which is in agreement with a typical latency period of 5 days (95% CI: 1–12 days) of FMD virus (Keeling *et al.* 2001; Gibbens and Wilesmith 2002; Charleston *et al.* 2011). Ignoring the within-host genetic diversity would have led to an unrealistically large estimate for the latency period of 24 (95% CI: 17, 35) days (Morelli *et al.* 2012). The substitution rate was estimated at 1.1 × 10^{−2} (95% CI: 8.7 × 10^{−3}, 1.5 × 10^{−2}) substitutions per site per year, which is higher than a typical value of 7.7 × 10^{−3} based on genetic data only (Cottam *et al.* 2008). We found the value to be sensitive to the precise specification of the within-host dynamics. See File S1 for further results and comparison to previous estimates. Together, these results confirm that the method is able to estimate plausible parameter values, while reconstructing the transmission tree and a consistent phylogenetic tree of the outbreak.

## Discussion

We have shown how the within-host dynamics of pathogens relate the phylogenetic tree of sampled pathogens to the transmission tree of an outbreak of an infectious disease. We use this relationship to estimate key parameters by combining genetic data on the pathogen with epidemiological data. The advantage of this estimation procedure is that estimates are more accurate, and estimation is feasible even when hosts are unsampled or unobserved.

The method is able to correctly estimate epidemiological and mutational parameters and can infer individual transmission events. Although the methods perform best when all infected hosts have been observed and sampled, simulations show that even when 20% of infected hosts are unobserved the transmission tree can still be estimated reasonably well. Note, however, that estimation of some epidemiological parameters might become biased. For example, in the FMD example, the slight overestimation of both duration of the latency period and substitution rate could well be due to unobserved farms (Morelli *et al.* 2012). As expected, the accuracy of the estimates increases with substitution rate. The method therefore seems most suited to analyzing data sets on RNA viruses. Analysis for DNA viruses or bacteria might still be feasible if the times between infection are large enough, as sufficient genetic diversity might still accumulate over the course of the outbreak.

To correctly estimate both the transmission tree and the phylogenetic tree, knowledge on the within-host effective pathogen population size and pathogen generation time is needed. This knowledge will, however, not be available in general. On the short timescales we are considering, we expect it will be challenging to estimate the shape of the time-varying population size from the data. Conversely, the impact of a misspecification of the within-host effective pathogen population size is minor. Even the extreme case of taking a constant population size of 1 (*i.e.*, equating transmission times with coalescent times) leads to reasonable estimation of the transmission tree. We see this in our application to FMD; although the substitution rate is overestimated, the estimated epidemiological parameters agree very well with previous estimates. It is possible that within-host dynamics are easier to estimate for chronic infections. However, for chronic infections, the epidemiological data are usually less informative of the transmission tree.

In proposing this methodology, we extend previously proposed methods that aim to reconstruct transmission trees using both epidemiological and genetic data. The field was pioneered by Cottam *et al.* (2008), who considered a sequential estimation procedure for the phylogenetic tree and the transmission tree, an approach that leads to loss of information contained in the phylogenetic tree. More recent approaches considered both data types simultaneously (Jombart *et al.* 2011; Morelli *et al.* 2012; Ypma *et al.* 2012), but implicitly or explicitly associated sequences with hosts, rather than with individual pathogens within the host. This means that transmission times coincide with coalescent times, an approximation that is inaccurate when the sampling fraction is high. Our simulations show that this leads to suboptimal inference of the transmission tree and to a large overestimation of the substitution rate.

Transmission tree reconstruction methods focus on individual transmission events, whereas many published phylodynamical approaches (Pybus and Rambaut 2009) typically focus on finding general characteristics of an outbreak, *e.g.*, epidemiological parameters, from sampled sequences. To estimate individual transmissions, detailed data are needed; we expect that the number of outbreaks that are studied in such detail will increase. More importantly, transmission tree reconstruction allows for an understanding of the outbreak at a high level of detail; for example, hypotheses regarding the transmission mechanism can be tested using the reconstructed transmission tree (Ypma *et al.* 2013).

Reconstructing large outbreaks at the detailed level of individual transmissions is feasible only when highly informative data are available. These could take the form of detailed epidemiological data on who infected whom, informative genetic data, *i.e.*, a large number of sampled sequences exhibiting high genetic diversity, or a combination of both. The method we have presented uses both data types to estimate simultaneously the transmission tree and the phylogenetic tree, acknowledging the fact that these two are, in fact, different. With the decreasing cost of sequencing technologies it is likely that more and more of such detailed molecular epidemiological data sets will become available for a range of pathogens, a trend we have already seen in the past few years (Bataille *et al.* 2011; Gardy *et al.* 2011; Harris *et al.* 2012). We therefore expect that the usefulness of this combination of two historically separated fields will become even more apparent in years to come.

## Footnotes

*Communicating editor: M. A. Beaumont*

- Received June 27, 2013.
- Accepted September 5, 2013.

- Copyright © 2013 by the Genetics Society of America