The coalescent with recombination is a very useful tool in molecular population genetics. Under this framework, genealogies often represent the evolution of the substitution unit, and because of this, the few coalescent algorithms implemented for the simulation of coding sequences force recombination to occur only between codons. However, it is clear that recombination is expected to occur most often within codons. Here we have developed an algorithm that can evolve coding sequences under an ancestral recombination graph that represents the genealogies at each nucleotide site, thereby allowing for intracodon recombination. The algorithm is a modification of Hudson's coalescent in which, in addition to keeping track of events occurring in the ancestral material that reaches the sample, we need to keep track of events occurring in ancestral material that does not reach the sample but that is produced by intracodon recombination. We are able to show that at typical substitution rates the number of nonsynonymous changes induced by intracodon recombination is small and that intracodon recombination does not generally result in inflated estimates of the overall nonsynonymous/synonymous substitution ratio (ω). On the other hand, recombination can bias the estimation of ω at particular codons, resulting in apparent rate variation among sites and in the spurious identification of positively selected sites. Importantly, in this case, allowing for variable synonymous rates across sites greatly reduces the false-positive rate and recovers statistical power. Finally, coalescent simulations with intracodon recombination could be used to better represent the evolution of nuclear coding genes or fast-evolving pathogens such as HIV-1.We have implemented this algorithm in a computer program called NetRecodon, freely available at http://darwin.uvigo.es.
THE coalescent (Kingman 1982; Hudson 1990) provides an efficient sampling of genealogical histories from a theoretical population evolving under a neutral Wright–Fisher model (Ewens 1979; Kingman 1982; Hudson 1990). Coalescent simulations are commonly used in molecular population genetics to understand the behavior and interactions among evolutionary processes under different scenarios (Innan et al. 2005), such as hypothesis testing (DeChaine and Martin 2006), evaluation and comparison of different analytical methods (Carvajal-Rodriguez et al. 2006), or estimation of population genetic parameters (Beaumont et al. 2002). Indeed, to obtain meaningful biological inferences from these simulations, it is very important that the underlying model is as realistic as possible. In this regard, a number of models have been developed during the last decade that consider different evolutionary processes such as recombination (Simonsen and Churchill 1997; Wiuf and Posada 2003), gene conversion (Wiuf and Hein 2000), selection (Hudson and Kaplan 1988, 1995), and gene flow or demographic history (Slatkin 1987; Pybus and Rambaut 2002).
Despite these advances, and in the face of a plethora of coalescent simulators (Excoffier et al. 2000; Hudson 2002; Posada and Wiuf 2003; Spencer and Coop 2004; Mailund et al. 2005; Schaffner et al. 2005; Marjoram and Wall 2006; Arenas and Posada 2007; Hellenthal and Stephens 2007; Liang et al. 2007), it was not possible until very recently to simulate recombining protein-coding DNA sequences within this framework (Anisimova et al. 2003; Arenas and Posada 2007). Importantly, to our knowledge, the algorithms described or implemented so far allow recombination only between codons, not within them. The reason for this unrealistic constraint is that standard codon models describe the probabilities of change along a lineage from one codon to another (Yang 2006), whereas recombination can occur between any two nucleotides, potentially resulting in one or more lineages not being shared by all the positions of the codon. In other words, although the unit for substitution in coding sequences is the codon, the unit for recombination in these sequences is still the nucleotide. Here we describe a new algorithm that overcomes this limitation by allowing for the evolution of different positions of the same codon in distinct genealogies. Furthermore, we use this algorithm to evaluate the effect of intracodon recombination on the generation of nonsynonymous (NS) diversity and on the estimation of the ratio of nonsynonymous-to-synonymous substitution rates (ω or dN/dS) (Li and Gojobori 1983) and the hypotheses derived from it.
Simulation of intracodon recombination under the coalescent:
The simulation of intracodon recombination occurs in two independent steps: the construction of the ancestral recombination graph (ARG) (Griffiths 1991; Griffiths and Marjoram 1996, 1997) and the evolution of the coding sequences. The first step is an extension of the standard coalescent m-loci continuous-time model with recombination (Kaplan and Hudson 1985). The novelty comes from the fact that intracodon recombination, apart from breaking the ancestral material in two segments as in the case of intercodon recombination, also results in ancestral material that never reaches the sample (Figure 1). Therefore, we distinguish between “sampled” and “unsampled” ancestral material, while in the standard coalescent all the ancestral material appears in the sample. Because sampled and unsampled ancestral material can meet in the same codon, we need to be sensitive to recombination and substitution events occurring not only in the sampled ancestral material, as usual, but also in the unsampled ancestral material. We also tried a shortcut in which recombination events were allowed only in the sampled material and obtained almost identical results as with the full algorithm. However, this simplified algorithm was discarded because it did not reduce the computational costs much because, in practice, recombination events within unsampled material were very rare. Because we keep track of the relationships between recombinant segments containing ancestral material, we are able to define at the end the exact genealogy for each codon site, which we call the ancestral codon graph (ACG) (Figure 1). Note that, in the presence of recombination, the ACGs along the alignment can be different from each other. Moreover, for codons that contain recombination breakpoints, the ACGs are reticulated graphs or networks, while, for the nonrecombining codons, the ACGs are binary trees. The algorithm for building the ARG is as follows:
Start with k nodes = n sampled sequences. Each node contains a single segment encompassing all the codons considered.
Sample the time back to the next event from an exponential distribution with the parameter k(k − 1)/2 + 2Nrg, where N is the effective population size, r is the recombination rate per site per generation, and g is the number of possible breakpoint sites summed over all sequences in the sample. A breakpoint site is a potential recombining site only when it has non-MRCA (most recent common ancestor) ancestral material (sampled or unsampled; see below) before and after it (see Wiuf and Hein 1999).
Choose the type of event. It will be a coalescent event (CA) with probability [k(k − 1)/2)]/[k(k − 1)/2 + 2Nrg] and a recombination event (RE) with probability 2Nrg/[k(k − 1)/2 + 2Nrg].
Complete the event. If it is a CA event, select two nodes at random and merge them into a new ancestral node inheriting all the segments from both coalescing and descendant nodes and set k = k − 1. If it is a RE event, draw a random site among all possible breakpoints across all segments in all nodes. Cut all the segments in the (recombinant) node that contains the chosen (breakpoint) site into left and right segments. If the event occurs between codons, create two ancestral parental nodes that will inherit either the left or the right segments and set k = k + 1. If the event occurs within a codon, create two ancestral parental nodes that will inherit the left and right segments (ancestral material that does reach the sample or “sampled material”) and the left and right site(s) flanking the breakpoint in the same codon (ancestral material that never reaches the sample or “unsampled material”), set g = g + 3 and k = k + 1. In both cases, keep the location and ancestral relationships of every segment.
If k = 1, label this node as the root and end the process; otherwise, go to step 2.
Importantly, note that intracodon recombination increases the amount of (unsampled) ancestral material by three nucleotides.
In a second step, each codon is evolved independently along its ACG, starting at the root, which contains all the sampled and unsampled ancestral material (Figure 1). Note that in the presence of intracodon recombination the grand most recent common ancestor (GMRCA) (Griffiths and Marjoram 1996) of the sample is not necessarily the root node. Given branch lengths and a Markov model of codon evolution, it is straightforward to calculate the probabilities of change between any two nodes and to use them to evolve the codons along a nonreticulated ACG (i.e., a tree) (Yang 2006). If the ACG is reticulated, the process is slightly more involved. The general recursion proceeds as follows, independently for each codon position (Figure 2):
Starting at the root, choose a random codon by sampling it from the equilibrium distribution.
Evolve this codon to produce new codons at the descendant nodes.
For every node with an assigned codon that has not been evolved yet:
If the node is a tip, do nothing else.
If the node is a coalescent node, evolve its codon to produce new codons at the two descendant nodes.
If the node is a parental node, check whether both parentals involved in the same recombination event have been already assigned a codon. If not, wait until this assignment is made. Once both parentals have been assigned a codon, combine both codons according to the breakpoint location and evolve the resulting codon to produce a new codon at the descendant recombinant node. If a codon stop is generated, erase all codons in the ACG and go back to step 1.
If all the nodes in the ACG have been assigned a codon, stop.
Note that, in step 3c, if a codon stop is created, we start the substitution process again from the root, keeping the simulated genealogy to afford computational costs and to avoid favoring smaller genealogies because larger genealogies tend to have more recombination events and thus a higher chance of generating stop codons.
Algorithm development and validation:
The algorithm was implemented in C in a program called NetRecodon, which is a major redraft of Recodon (Arenas and Posada 2007). To validate it, we compared the results obtained with this algorithm with the theoretical expectations for the mean and variances of different simulation statistics, such as the number of recombination events or the time to the most recent common ancestor (Hudson 1990). We also checked whether these summary statistics agreed with those obtained with the ms program (Hudson 2002) under different evolutionary scenarios. In addition, substitution and codon model parameters were estimated from the simulated data using HYPHY (Kosakovsky Pond et al. 2005) and PAUP* (Swofford 2000). These estimates agreed very well with the expected values from the simulations. Moreover, we implemented additional features that correspond to a variety of real scenarios, such as exponential growth, migration, longitudinal samples (dated tips), haploid/diploid populations, and a broad set of codon models that allow ω to change along the sequences according to different distribution (see Yang et al. 2000).
Simulation of protein-coding sequences with recombination:
We simulated coding sequences under different values of the population mutation parameter [θ = 4Nμl = 10, 20, 50, 100, and 200, where N is the effective (diploid) population size, μ is the substitution rate per codon, and l is the number of codons], recombination rates (ρ = 4Nrl = 0, 1, 4, 16, 64, and 128; where r is the recombination rate per nucleotide), and dN/dS ratios (ω = 0.2, 1.0, and 5.0). The number of sequences in the sample (n = 10), alignment length (l = 333 codons), and effective population size (N = 1000) was constant. The codon model used was GY94 (Goldman and Yang 1994) with a transition/transversion ratio of 0.5, and under the standard genetic code. For every combination of parameters (5 × 6 × 3 = 90 combinations), we simulated recombination in two different ways. In one setting, we used the algorithm introduced above, where recombination was free to occur within and between codons. In addition, we also repeated the simulations but forced recombination to occur only between codons because this setting had been previously used to understand the impact of recombination on the detection of positive selection (Anisimova et al. 2003; Shriner et al. 2003). Indeed, we made sure that the total recombination rate, ρ, was the same in both situations. For every scenario, we simulated 200 alignments.
The global ω was estimated from the simulated data using the Nei and Gojobori method (NG86) (Nei and Gojobori 1986) as implemented in SNAP (Korber 2000) and maximum likelihood under the GY94 model as implemented in HYPHY. The NG86 is a very simple method, commonly used for closely related sequences, which does not use phylogenetic information. The GY94 requires a phylogeny, which was estimated with the neighbor-joining (NJ) algorithm (Saitou and Nei 1987).
ω per site:
To understand the effect of recombination on a site-by-site basis, we also performed some simulations parameterized according to a real data set—an HIV-1 env 2007 subtype reference alignment (A–K, without recombinants)—downloaded from the Los Alamos National Laboratory HIV sequence database (http://hiv-web.lanl.gov) with 40 sequences and 956 codons. In particular, we studied the effect of (inter- and intra-) codon recombination on the M0 (one rate) and M1 (neutral) likelihood-ratio test (LRT) for homogeneity of ω across sites (see Yang et al. 2000) and on the identification of positively selected sites (PSS) (ω > 1).
The estimated nucleotide diversity was 0.16, which roughly corresponds to θ = 250 under our settings. According to PAML (Yang 2007) under the M0, M1, and M8 models, ωM0 = 0.5 (but we also studied cases with ωM0 = 1.0 and ωM0 = 2.0), ωM1 = 0.1, p0M1 = 0.6, p0M8 = 0.9, pM8 = 0.2, qM8 = 0.3, and ωM8 = 3.8—the values used in the simulations. As before, we assumed that ρ = 0, 1, 4, 16, 64, and 128, l = 999 nt, and N =1000, but we increased the sample size (n = 30) because these models are more complex. The number of replicates was 200 for M0 and M1 and 2000 for M8. For each data set, maximum-likelihood phylogenetic trees were estimated using PHYML (Guindon and Gascuel 2003) and fed into PAML to obtain model likelihoods. Results for the M0 and M1 LRTs were classified as true negatives (TN) or as false positives (FP; P-value < 0.05) when data were simulated under M0 and as true positives (TP) or as false negatives (FN) when data were simulated under M1. We used this classification to calculate the false-positive rate [FPR = FP/(TN + FP)] and the power [TP/(TP + FN)] of the LRTs. To identify PSS, we used fixed-effects likelihood (FEL) model (Kosakovsky Pond and Frost 2005b) assuming a “one-rate” (FEL-1R; dS is held constant across sites) and a “two-rate” (FEL-2R; dS is adjusted across sites) model, as implemented in HYPHY upon a NJ tree. False-positive rates and power were calculated as before, but on a site-by-site basis in the context of the M8 model.
In theory, intracodon recombination can generate new codons that have 0, 1, or 2 NS changes when compared to the parental codons. In the simulations, recombination within codons was indeed twice as common as intercodon recombination, and only at high substitution rates did the proportion of total recombination events that resulted in NS changes reach a significant proportion (12–34%) (Table 1). Nonetheless, the proportion of the total NS changes observed in the history of the sample due to intracodon recombination was always small. This proportion indeed depended on the recombination and substitution rates and reached 10–12% when ρ = 128 (∼230 recombination events in the history of the sample) (Table 2).
Effect of intracodon recombination on the estimation of ω:
We did not observe a significant effect of (intra- or inter-) recombination on the estimation of the global ω (supporting information, Table S1, Table S2, and Table S3). However, at high substitution rates, some significant instances were detected in which increasing recombination rates resulted in estimates of ω closer to 1, regardless of its simulated (true) value. Both methods employed to estimate ω worked better in the presence of elevated substitution rates, although the phylogenetic GY94 model showed a slight tendency to overestimate ω compared to NG86, especially when θ was large.
Effect of intracodon recombination on the LRT for ω variation across sites:
Recombination clearly biased the M0 and M1 LRTs toward the rejection of the null hypothesis of homogeneity of ω across sites (Figure 3). Increasing (intra-or inter-) recombination rates elevated both the false-positive rate and power, making these LRTs nonconservative.
Effect of intracodon recombination on the identification of PSS:
Recombination significantly increased the number of sites erroneously identified as positively selected by FEL-1R, although the errors were also evident in the absence of recombination. At high recombination rates (ρ = 64, 128), the number of false PSSs reached 30–35 (of 333 codons) at the 95% significance level (Figure S1). When the PSSs were identified with FEL-2R, the number of false positives clearly decreased. In this case, the number of false PSSs was tiny when the recombination rate was 0 or very small and increased slowly with higher rates. In all cases, allowing for intracodon recombination did not make a difference. Consequently, the false-positive rate was more or less constant regardless of the recombination rate in the case of FEL-1R and smaller and much more correlated with the recombination rate for FEL-2R (Figure 4). Remarkably, the power to detect PSS was four times higher for FEL-2R than for FEL-1R. In both cases, the recombination rate did not affect the results, except for the fact that the standard errors per replicate decreased at larger recombination rates.
Simulation is indeed a powerful tool in population genetics, with a rich variety of applications, but most of its benefits rely on biologically meaningful models. The algorithm described here facilitates the generation of more realistic protein-coding sequence samples in the presence of recombination and, importantly, allows for the estimation of the nonsynonymous substitutions induced by recombination, relative to mutation. Here, a necessary assumption is that the genealogy is independent of ω (as in Anisimova et al. 2003; Shriner et al. 2003; Scheffler et al. 2006; Wilson and McVean 2006). Our results suggest that recombination does not have a strong overall effect on the generation of nonsynonymous changes, although this does not mean that it cannot mislead positive selection analyses, especially those based on phylogenies (Anisimova et al. 2003). Other authors have studied the impact of recombination on the estimation of (the global) ω. Shriner et al. (2003) studied the effect of recombination on the maximum-likelihood phylogenetic methods implemented in PAML (Yang 1997) for the characterization of molecular adaptation. They used the program ms (Hudson 2002) to simulate the data, which does not allow for intracodon recombination. Although they found that recombination leads to false-positive detection of sites undergoing positive selection, the effect of recombination on the estimate of ω across the entire sequence length was unclear. This is because, although point estimates were reported as significantly higher than the expected value of 1, in fact the 95% confidence intervals included 1 in most cases. Similarly, Anisimova et al. (2003), also ignoring intracodon recombination, did not find a strong effect of recombination on PAML's estimation of ω, although there was some inflation of the number of sites identified as positively selected (ω > 1, PSS). In addition, Kosakovsky Pond et al. (2006) found that a single recombination hotspot could increase the number of PSSs detected by the FEL analysis (Kosakovsky Pond and Frost 2005a), but they did not investigate its effect on the estimation of ω. More recently, Kosakovsky Pond et al. (2008) did not find a significant effect of a single recombination event on the estimation of ω, although they did find that the PSSs identified were different when the recombination breakpoint was explicitly taken into account. Here we show that considering intracodon recombination does not make a difference in the assessment of the impact of recombination on the estimation of global ω and that, as previously shown, this impact is low.
A different issue is the effect of recombination on the estimation of ω per site. We show that recombination can give the impression that the selection pressure varies along the sequence when in fact it is constant and that some sites appear to be under positive selection when they are not. Obviously, this is especially relevant for those studies trying to understand positive selection across recombining genomes, such as studies of Drosophila, humans, or HIV-1 (Nielsen and Yang 1998; Zanotto et al. 1999; De Oliveira et al. 2004; Bustamante et al. 2005; Clark et al. 2007; Nielsen et al. 2007). In theory, recombination could inflate the value of ω (global or per site) through two different mechanisms. One mechanism could be the generation of nonsynonymous changes, but we have shown that the number of nonsynonymous changes generated by recombination is low compared to substitution. Indeed, many recombination events occur between identical codons, especially at low substitution rates. The other possible mechanism operates to confound phylogenetic estimation and therefore also those methods that use a phylogeny to estimate this parameter (so, in theory, the NG86 method should not be affected by this bias). Our results suggest that recombination affects selection analyses mainly because it confounds the phylogenies used in those analyses. Indeed, in our simulations, the consideration of intracodon recombination does not change the effects of recombination when only intercodon events, which cannot result in nonsynonymous changes, are allowed. Previously, Anisimova et al. (2003) found that just using an incorrect phylogeny can have a severe effect on the comparison of codon models. Indeed, phylogenetic variation across sites can be wrongly interpreted by the LRTs as variation in synonymous and nonsynonymous substitution rates. This can happen because different trees along the alignment can have different heights (see Figure 1, for example), and when dS is held constant (as it is in most popular codon models), this variation in tree length can be interpreted as variation in dN and therefore as variation of the dN/dS ratios (Anisimova et al. 2003). The fact that a model that does not hold dS constant (FEL-2R) showed much better properties in terms of false-positive rate and power (and provides a better statistical fit; data not shown) in the presence of recombination supports this idea. However, some bias still persisted, and this might be due to recombination misleading phylogenetic inference and the derived LRTs (Anisimova et al. 2003). Indeed, we have previously shown (Posada and Crandall 2002) that the trees inferred by ignoring recombination can be quite different from the underlying true phylogenies. In any case, using a “two-rate” or “dual” model such as FEL-2R seems highly recommended if recombination is suspected to have occurred in the history of the sample under study.
By chance, 2/3 of the recombination events that occur within coding sequences should happen within codons. We have analyzed recombination breakpoints for the circulant recombinant forms of HIV-1 reported at the Los Alamos HIV database (http://www.hiv.lanl.gov/content/sequence/HIV/CRFs/breakpoints.html). Note that these breakpoints are very gross estimates and do not constitute a random sample. Among the 290 breakpoint locations listed, only 28% occur within codons, which might suggest that intracodon recombination events are selected against and/or are more difficult to detect (especially if the detection is carried out at the amino acid level). More reliable data are needed to better understand this question.
To our knowledge, only one method has been developed to co-estimate ρ and ω (Wilson and McVean 2006). This method showed very good performance in simulations, but these did not allow recombination within codons. One of the potential uses of our algorithm is a more meaningful evaluation of these kinds of methods. The algorithm described here has been implemented in a computer program called NetRecodon, freely available from the software section at http://darwin.uvigo.es, which also simulates migration, demographic periods, and dated tips. The program is reasonably fast and can produce large alignments (100 sequences with 1000 codons will take 2 min). Note that the execution time depends directly on the recombination and substitution rates. Conveniently, NetRecodon can also run in parallel (using the Message Passing Interface libraries). This algorithm could be used to more realistically model the evolution of nuclear genes and fast-evolving pathogens such as HIV-1 or the estimation of genetic parameters using approximate Bayesian computation (Beaumont et al. 2002; Tallmon et al. 2004; Excoffier et al. 2005; Tanaka et al. 2006). Certainly, in coding sequences, recombination occurs more often within codons than between them, and therefore recombination needs to be taken into account.
We thank two anonymous reviewers for their excellent suggestions. In particular, one of the reviewers suggested some important improvements to the original algorithm. This work was supported by the Spanish Ministry of Science and Education (grant no. BIO2007-61411 to D.P., Formación de Personal Investigador (FPI) fellowship BES-2005-9151 to M.A.).
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.109.109736/DC1.
Communicating editor: J. Wakeley
- Received September 14, 2009.
- Accepted November 14, 2009.
- Copyright © 2010 by the Genetics Society of America