Abstract
Determining the amount of recombination in the genealogical history of a sample of genes is important to both evolutionary biology and medical population genetics. However, recurrent mutation can produce patterns of genetic diversity similar to those generated by recombination and can bias estimates of the population recombination rate. Hudson (2001) has suggested an approximate-likelihood method based on coalescent theory to estimate the population recombination rate, 4Ner, under an infinite-sites model of sequence evolution. Here we extend the method to the estimation of the recombination rate in genomes, such as those of many viruses and bacteria, where the rate of recurrent mutation is high. In addition, we develop a powerful permutation-based method for detecting recombination that is both more powerful than other permutation-based methods and robust to misspecification of the model of sequence evolution. We apply the method to sequence data from viruses, bacteria, and human mitochondrial DNA. The extremely high level of recombination detected in both HIV1 and HIV2 sequences demonstrates that recombination cannot be ignored in the analysis of viral population genetic data.
RECOMBINATION breaks down the correlation in genealogical history between different regions of a genome and shuffles genetic diversity among chromosomes. In evolutionary biology, the importance of recombination is the generation of novel gene combinations, which allows the spread of multiple beneficial mutations (Fisher 1932; Muller 1932) and prevents the accumulation of deleterious ones (Muller 1964). In medical genetics, associations between disease phenotypes and genetic markers that build up through genetic drift and are broken down by recombination are central to the mapping of disease-associated mutations (Pritchard and Przeworski 2001).
The occurrence of recombination also has practical implications for evolutionary inference. For population geneticists, recombination reduces the effects of evolutionary stochasticity, averaging out genealogical histories over a genome. In contrast, traditional methods of phylogenetic inference typically assume the absence of recombination. If the assumption is incorrect, inferences about the evolutionary history of gene sequences may be misleading (Schierup and Hein 2000). Recombination is therefore a critical issue for analyses of within-species variation.
A variety of nonparametric methods have been developed to detect recombination from gene sequences, without estimating the rate at which it occurs. Some use phylogenetic methods to ask whether different regions of a gene have different histories (Grassly and Holmes 1997; McGuireet al. 2000), which are targeted at identifying rare recombinant genotypes. Other methods are aimed at inferring the presence of recurrent recombination, such as occurs among the genes of most eukaryote species. Among these methods, some consider summary statistics that are sensitive to recombination, such as the relationship between physical distance and measures, or indicators of linkage disequilibrium (Lewontin 1964; Maynard Smith 1999). Other methods consider properties of phylogenetic trees inferred under the assumption of no recombination (Maynard Smith and Smith 1998; Worobey 2001). The methods vary in their ability to statistically detect recombination under different conditions and their sensitivity to an accurate characterization of the underlying model of sequence evolution (Maynard Smith 1999; Meunier and Eyre-Walker 2001).
The inability of such methods to estimate the rate at which recombination occurs is a serious limitation. Characterizing the rate of recombination is important for analyzing the power of association studies, assessing the reliability of phylogenetic methods, and predicting the rate at which advantageous mutations, such as those conferring drug resistance, can spread between genetic backgrounds. Some nonparametric methods for detecting recombination, such as the homoplasy test (Maynard Smith and Smith 1998) and derivatives (Worobey 2001), provide a characterization of how far the data are from the extremes of free recombination and complete clonality. But there is no straightforward relationship between such a property and the parameters of any underlying evolutionary model. As a result, comparison between genes or species is problematic, and there is little or no way of statistically testing whether data sets have different levels of recombination. Model-based estimation of the rate of recombination does rely on an underlying model that is almost certainly a simplification of reality. However, the benefits gained are the ease of comparison between different data sets, the ability to make predictions about the question of interest, and the potential to test whether the model of evolution is an adequate characterization of the underlying processes. In addition, parametric models can be used to test for the presence of recombination by comparing the likelihood of the data under models with and without recombination (Brownet al. 2001).
What evolutionary model is appropriate for describing the effects of recombination on gene sequences? Coalescent theory provides a statistical description of the genealogical history of sequences sampled from large, Fisher-Wright populations with nonoverlapping generations, constant population size, and no selection or migration (Kingman 1982; Hudson 1991). Within this framework, the effects of recombination on sample history are a function not of the absolute recombination rate, but of the product of the per gene per generation rate of crossing over (genetic map length), r, and the effective population size, Ne (Griffiths and Marjoram 1996b). Without prior information about one of these parameters, it is possible only to estimate the product of these parameters, often written as ρ = 4Ner (equivalently, one can estimate the ratio of the recombination rate and the mutation rate, r/μ, and the population mutation rate θ = 4Neμ). The coalescent can readily be extended to include time-varying population size, migration, and some forms of selection (Hudson and Kaplan 1994; Bravermanet al. 1995). Under these more complex situations, the effects of recombination on gene samples also depend on other parameters. In general, however, the product of the current effective population size of the population and the absolute recombination rate is the key determinant of the impact of recombination on patterns of genetic diversity.
Within the framework of the coalescent, several methods have been proposed as estimators of the population recombination rate. Hudson (1987) derived a moment estimator on the basis of the variance in pairwise differences. Hey and Wakeley (1997) developed a method on the basis of combining analytically derived likelihoods for all pairs of sites and sets of four sequences. Wall (2000) proposed to find the value of 4Ner that maximizes the likelihood of observing the number of haplotypes and inferred minimum number of recombination events (Hudson and Kaplan 1985). Full-likelihood estimators of the population recombination rate, on the basis of the coalescent, have also been developed. These use computationally intensive Monte Carlo methods; Griffiths and Marjoram (1996a) described a method on the basis of importance sampling, while Kuhner et al. (2000) developed a Metropolis-Hastings rejection Monte Carlo Markov chain (MCMC) method. Recently, Fearnhead and Donnelly (2001) improved the importance sampling method considerably. Even so, full-likelihood methods are computationally intensive and practically impossible for many data sets.
Recurrent mutation (A) and recombination (B) can generate similar patterns of genetic variability. The top shows the genealogies and occurrence of mutations, while the bottom depicts the resulting sampled gene sequences.
Recently, Hudson (2001) suggested an ad hoc method for estimating the population recombination rate on the basis of combining the coalescent likelihoods of all pairwise comparisons of segregating sites. Estimation of 4Ner is rapid, and the method performs well in terms of bias and variance in comparison to Hudson's earlier moment estimator (Hudson 1987) and other ad hoc approaches (Hudson 2001). The method does not use all available information in the sequence data and introduces nonindependence in the combination of multiple comparisons, but is flexible and can potentially be expanded to incorporate deviations from the standard coalescent. Hudson's (2001) estimator of 4Ner has been termed the composite-likelihood estimate (CLE).
In this article we consider a problem of critical importance to the analysis of recombination: the detection and estimation of recombination in genomes, such as those of many viruses and bacteria, where the rate of substitution is sufficiently high that some sites have experienced multiple mutations in the history of the sample. The issue is important because recurrent mutation can generate patterns of genetic variability that resemble the effects of recombination (Figure 1); in particular, the presence of all four haplotypes for a pair of segregating sites. Under the infinite-sites model, any such incompatibilities would be interpreted as evidence for recombination and hence will bias estimates of the recombination rate upward. Similarly, the likelihood-ratio test for the presence of recombination will be sensitive to misspecification of the mutation model, particularly the underestimation of the mutation rate at segregating sites, which can be caused by rate heterogeneity.
To address these problems we have extended Hudson's composite-likelihood method (Hudson 2001) to allow for finite-sites mutation models. In addition, we propose a permutation-based test (the likelihood permutation test) to test the hypothesis of no recombination (4Ner = 0). We use a permutation-based approach, rather than estimate confidence intervals from the composite likelihood, as the nonindependence makes interpretation of the composite-likelihood surface problematic, but also because we wish the test to be robust to model misspecification. We find that the composite-likelihood estimator performs well, even when most sites analyzed have experienced multiple mutations, and that the likelihood permutation test is more powerful than previous permutation-based methods for detecting recombination. We also consider the effect of misspecification of the model of sequence evolution on both the test for recombination and estimation of 4Ner. We show that the likelihood permutation test is robust to misspecification, unlike the homoplasy test (Maynard Smith and Smith 1998) or the informative sites test (Worobey 2001), and that estimation of 4Ner is also robust to minor misspecification of the model of sequence evolution. We apply the likelihood permutation test and estimation procedure to several empirical data sets from viruses, bacteria, and human mitochondria.
METHODS
Composite-likelihood estimation of 4Ner: First, we outline our implementation of the approach of Hudson (2001) for estimating the population recombination rate under the standard Fisher-Wright population model. The central difference between the method of Hudson (2001) and that presented here is that we allow for models of sequence evolution in which multiple mutations may occur at a site during the history of the sample. Although it is possible to use an arbitrary model of sequence evolution, we make the simplifying assumption that all sites in a sequence conform to a two-allele model with reversible, symmetric mutation, such that the rate of mutation per site per generation is μ and is constant across sites. Consequently, we restrict analysis to sites at which there are no more than two alleles segregating. The extension of the method to more complex models of sequence evolution is left to future research; however, it is worth noting that the method appears to perform well, even when the true model of sequence evolution is considerably more complex than that assumed (see below).
The estimation procedure has four stages. The initial step is to estimate the population mutation rate per site, θ = 4Neμ, from an approximate finite-sites version of the Watterson estimate
The third stage is to estimate the likelihood of each equivalent set under the estimated value of θ, the symmetric, reversible mutation model, and a range of recombination rates (typically 0 ≤ 4Ner ≤ 100), using the importance sampling method of Fearnhead and Donnelly (2001). We also used a simple Monte Carlo scheme for estimating the likelihood, similar to that implemented in Hudson (2001), to check the accuracy of likelihoods estimated by the importance sampling method (results not shown).
In the final stage, an estimate of the population recombination rate for the entire sequence (4Ner) is obtained by combining the likelihoods from all pairwise comparisons. The composite likelihood is given by
For genomes, such as viruses and bacteria, in which a gene-conversion model for recombination is more appropriate than a crossing-over model, the relationship between physical distance and recombination rate is modeled as
(A) The composite (CLR) and full (LR) relative likelihood surface for a single simulated data set. (B) The joint distribution of the maximum-likelihood estimate (MLE) of 4Ner and the composite-likelihood estimate (CLE). Likelihoods were calculated with θ = 0.01 per site.
For simple data sets and low values of 4Ner, it is possible to compare the composite-likelihood surface with the full-likelihood surface estimated by the method of Fearnhead and Donnelly (2001). Figure 2 shows a comparison of the two surfaces for a single case and the joint distribution of the maximum-likelihood estimator (MLE) and CLE point estimates of 4Ner for 100 simulated data sets with n = 50 and θ = 4Ner = 3. For the single example (Figure 2A), the composite-likelihood curve has a very similar point estimate to the ML estimate, but is more highly curved because of the nonindependence introduced by multiple comparisons. Statistics for the two estimators of 4Ner (full-likelihood/composite-likelihood) are median, 2.4/3.8; variance, 9.1/15.6; proportion within a factor of two from the true value, 0.50/0.52. The correlation between the composite- and maximum-likelihood estimates is 0.78 (Figure 2B).
Hudson (2001) characterized the composite-likelihood estimator for the case where data conform to the infinite-sites model. In terms of bias and variance, the CLE is one of the better ad hoc methods for estimating the population recombination rate, although the estimator has considerable variance. However, this is also true of the MLE (Figure 2) and, to a large extent, is a reflection of inherent stochasticity in the genealogical process. However, while full likelihood provides an estimate of the relative likelihood of different values, there is no easily interpretable meaning of the composite-likelihood curve. Confidence intervals for the estimate of 4Ner can be obtained only by extensive simulation (Hudson 2001).
The likelihood permutation test: We propose a simple test for the presence of recombination. Under a model of no recombination, and assuming a uniform mutation rate, sites are exchangeable (this is also true if there is free recombination). That is, the likelihood of observing the data is independent of the order in which sites occur. If there is some recombination, sites are no longer exchangeable, because closely linked sites have correlated genealogies. Consequently, the likelihood of observing the data is dependent on the order of sites. The likelihood permutation test for recombination is based on this property; we find the maximum composite likelihood for a data set (estimating 4Ner in the process), then permute segregating sites by location, and for each permutation find the maximum composite likelihood (and the corresponding value of 4Ner). The proportion of permuted data sets with a composite likelihood equal to or greater than that of the original data is calculated. If this proportion is lower than a chosen significance level, we conclude that there is evidence for recombination.
There are several methods for detecting recombination on the basis of the permutation of segregating sites. Permutation tests for recombination aimed at detecting a decay of a summary statistic of linkage disequilibrium (r2 or |D′|) with distance have been used to suggest the presence of recombination in human mitochondria (Awadallaet al. 2000) and Plasmodium falciparum (Conwayet al. 1999) and regions of low recombination in the Drosophila melanogaster genome (Miyashita and Langley 1988). Another permutation test (referred to as G4) has been suggested by Meunier and Eyre-Walker (2001), which compares the sum of distances between all pairs of sites that have all four possible haplotypes to the distribution in permuted data sets. We compared the power of the likelihood permutation test with these other permutation-based tests.
Models of sequence evolution: We characterize both the composite-likelihood estimator and likelihood permutation test under a range of models of sequence evolution that reflect genomes experiencing high mutation rates at some or all sites. We have chosen four caricature models to represent the diversity of possible situations:
Infinite sites: All sites have the same low mutation rate (θ = 0.01) and conform to the two-allele symmetric, reversible mutation model used in the likelihood estimation stage. This represents the best-case scenario (effectively infinite sites), as might be assumed for nuclear loci in humans (excluding hypermutable CpG dinucleotides).
Hypermutable: Most sites (99.5%) effectively conform to the infinite-sites model (θ = 0.005), but a fraction (0.5%) have a 100-fold higher mutation rate. All sites conform to the symmetric, reversible mutation model. This is chosen to reflect extreme rate variation, as occurs when hypermutable CpG dinucleotides are included in an analysis or in the mitochondrial genome of mammals.
Complex: This is characterized by strong base composition variation and mutation rate variation. Specifically, this is an HKY (Hasegawa, Kishino, Yano) mutation model (Hasegawaet al. 1985), with base frequencies πT = 0.4, πC = 0.1, πA = 0.4, πG = 0.1, a transition-transversion ratio of 2, and an exponential distribution of mutation rates with a base-averaged mutation rate of
Finite sites: All sites have the same, high mutation rate (θ = 0.5) and conform to the two-allele symmetric, reversible mutation model. In this case, each segregating site experiences, on average, 2.6 mutations in the history of the sample. This model represents the extreme levels of polymorphism as occur at synonymous sites in retroviruses such as human immunodeficiency virus (HIV).
Data are simulated under the null 4Ner = 0 and 4Ner = 10, for n = 50 and the length of sequence chosen such that the average number of segregating sites is in the range 40–50. Ideally, for each simulated data set the likelihoods should be calculated for the value of θ estimated from the data. However, for the large number of replicates required to provide an accurate characterization of the estimator's properties, calculating the likelihoods for each data set is practically unfeasible. Instead, we have estimated likelihoods under three different values of θ, 0.01, 0.1, and 0.5, and present the results for each, along with mean and standard deviation of the values of θ estimated from the simulated data. One advantage of this approach is that it allows us to characterize the severity of model misspecification on the detection and estimation of recombination.
Empirical data: We applied both the likelihood permutation test and estimation of the population recombination rate to a series of empirical data sets from viruses, bacteria, and human mtDNA. Previous analyses (Suerbaumet al. 1998; Awadallaet al. 1999; Worobeyet al. 1999; Ingmanet al. 2000; Worobey 2001) of these data sets revealed a range of levels of recombination, from effectively clonal in hepatitis C virus (HCV) and mtDNA (Ingmanet al. 2000; Worobey 2001) to freely recombining in Helicobacter pylori (Suerbaumet al. 1998). While none of these data sets represent random samples from Fisher-Wright populations, as is supposed by the coalescent methods of analysis, the results are likely to be indicative of the situation in more appropriate samples.
Viral genomes: Data sets were the following: HCV, 6 complete genome sequences (Worobey 2001; worldwide sample); measles, 50 sequences of the Hemagglutinin gene (Woelket al. 2001; worldwide sample); dengue DEN-1 virus, 7 sets of concatenated capsid C, premembrane/membrane prM/M, and E genes (Worobeyet al. 1999; worldwide); HIV2 subtype A, 21 sequences of env gene (Kuikenet al. 2000; worldwide); and HIV1 subtype B, 93 sequences of the env gene (Kuikenet al. 2000; worldwide).
Bacterial genomes: H. pylori data sets were 33 sequences of the flaA gene (worldwide; Suerbaumet al. 1998).
Mitochondrial genomes: Data sets were 45 partial genome sequences from the analysis of Awadalla et al. (1999; worldwide) and 53 complete genome sequences from the analysis of Ingman et al. (2000).
RESULTS
Estimating 4Ner with recurrent mutation: To date, estimators of the population recombination rate have typically been characterized under the infinite-sites assumption that each segregating site is the result of a single mutation. In many biologically realistic situations this assumption cannot be justified, even though the infinite-sites model is superficially plausible. For example, if 20 mutations occur in a genealogy of 500 linked sites (the expected number for n = 50 and θ = 0.009), the probability that at least one site experiences recurrent mutation is >30% and will be higher if there is recombination or any variation between sites in the mutation rate. In organisms with high mutation rates, such as many viruses and bacteria, a large proportion of sites may have experienced multiple mutations.
Because recurrent mutation can create patterns of genetic variability that resemble the effects of recombination (Figure 1), it is important to develop methods for estimating the recombination rate that can account for finite-sites models of sequence evolution. We have extended Hudson's (2001) composite-likelihood method for estimating the population recombination rate, 4Ner, within a coalescent framework, to incorporate models in which sites may experience multiple mutations in the history of the sample. Our approach is to use the simplest possible model of finite-sites evolution (two-allele system with symmetric reversible mutation and a constant mutation rate across sites) and to investigate how the method performs under a variety of caricature models of sequence evolution chosen to reflect biological diversity.
The distribution of CLEs of the population recombination rate simulated and analyzed under different models of sequence evolution. Each chart represents the results from 1000 data sets simulated with 4Ner = 10. The model of sequence evolution used to simulate data is on the left and the value of θ used to calculate likelihoods under the two-allele symmetric reversible model is at the top of the columns.
Figure 3 shows the distribution of point estimates for 4Ner for data simulated under the four caricature models (n = 50 and 4Ner = 10) and likelihoods estimated under three different values of θ: 0.01, 0.1, and 0.5. In Table 1 we also present the median and proportion of estimates that are within a factor of two from the true value, along with the mean and standard deviation of estimates of θ obtained from Equation 1.
As expected, when there is a considerable discrepancy between the true value of θ and that used to estimate likelihoods, estimates of 4Ner are strongly biased. When the true value of θ is lower than the value used to estimate likelihoods, estimates of 4Ner are downwardly biased. In contrast, when the true value of θ is greater than the value used to estimate likelihoods, estimates of 4Ner are biased upward. However, it is encouraging to find that when likelihoods are estimated under the correct value of θ, the estimator performs almost as well when the mutation rate is very high as it does when the mutation rate is low (Figure 3, bottom right vs. top left).
The middle two rows of Figure 3 and Table 1 show the effects of applying the simplistic mutation model to data simulated under models representing some degree of biological complexity. For both the hypermutable and complex models there is strong rate variation across sites, yet the estimator properties are hardly worse than under the best-case scenario, and the estimates of θ are well within the range that leads to sensible estimates of 4Ner. In short, the composite-likelihood estimator of the population recombination rate is robust to minor misspecification of the underlying mutation model. This conclusion is of great importance as it provides a justification of the use of the CLE on real data sets.
Statistical properties of the composite-likelihood estimator
Detecting recombination: The results presented above may give us some confidence that the value of 4Ner estimated by the composite-likelihood method is meaningful, even in genomes where the rate of recurrent mutation is high. However, one important question that is difficult to address within the CLE framework is whether one can reject the hypothesis that 4Ner = 0. Direct experimental evidence for recombination may be difficult to obtain for many genomes (particularly if genetic exchange is very rare); thus it is important to have indirect, population genetic-based methods for detecting recombination. And it is equally important that such methods should not create false positives through misspecification of the model of sequence evolution.
We have proposed the likelihood permutation test as a means of testing for the presence of recombination. Table 2 shows the results of the power analysis carried out on the same four caricatures of sequence evolution, and again estimating likelihoods under the three values of θ. We also compare the power of the likelihood permutation test to other permutation-based tests for recombination that consider summaries of the data sensitive to the presence of recombination.
The key result is that the likelihood permutation test is consistently the most powerful permutation-based method for detecting recombination from population genetic data. In the case of infinite-sites data, recombination is detected in almost 96% of cases, compared to ~80% for the other tests. Even when the model used to estimate likelihoods is very different from the true model, the power of the test is considerable. For example, with data generated by the finite-sites model with θ = 0.5, recombination is detected in 83% of cases when the correct value of θ is used to calculate likelihoods, compared to 82% of cases when θ = 0.01 is used to estimate likelihoods. In contrast, those methods that rely heavily on the distribution of pairs at which all four gametes are present (|D′| and G4) have greatly reduced power under such high levels of mutation (51 and 39%, respectively). The one situation where the likelihood permutation test has reduced power is when the true value of θ is much lower than that used to estimate likelihoods; however, such a situation is unlikely to occur for empirical data. It is also worth noting that the power to detect recombination using the correlation between r2 and physical distance is consistently greater than with either |D′| or G4 for the biologically plausible models of sequence evolution.
DISCUSSION AND APPLICATION
The composite-likelihood method and likelihood permutation test together present a powerful approach for assessing the influence of recombination on patterns of genetic variability. Even when the mutational and substitutional processes affecting gene sequence evolution are complex and unlikely to be fully characterized by any simple model, the use of simple models provides a remarkably robust way of detecting recombination and estimating the population recombination rate. To investigate how the new approach performs on real data, we have applied the methods to samples of gene sequences from the viruses HIV1, HIV2, hepatitis C, dengue-1, and measles, the bacterium H. pylori, and human mitochondrial DNA. We also discuss possible limitations of the approach, in particular misspecification of the population model used to estimate the likelihoods.
Empirical data: The empirical data sets were chosen to reflect a diversity of levels of recombination, as had been estimated from previous studies (Maynard Smithet al. 1993; Suerbaumet al. 1998; Awadallaet al. 1999; Worobeyet al. 1999; Ingmanet al. 2000; Worobey 2001). For the HIV data sets, we analyzed third position sites in the coding region separately from the first two positions, to investigate whether different results were obtained from using data with different levels of diversity. In addition, we analyzed two human mtDNA data sets that have been used to provide evidence for (Awadallaet al. 1999) and against (Ingmanet al. 2000) recombination. In all cases, a gene-conversion model for recombination is more appropriate than a crossing-over model, and we have fixed the average tract length of gene conversion to 100 bp for the viral and bacterial data sets and 500 bp for the mtDNA data sets. These numbers are arbitrary, although in the microbial and viral data sets, the composite likelihood increases for small tract lengths (data not shown). In one of the few cases in eukaryotes where gene conversion tract lengths have been estimated, the best fit to the data was a geometric distribution with mean tract length of 352 bp (Hillikeret al. 1994).
Power analysis of permutation tests for detecting recombination
Table 3 presents the results of these analyses and the estimate of the population recombination rate, γ, under a gene conversion type model; see Equation 5. In addition, we carried out the same analyses, but filtering out single nucleotide polymorphisms (SNPs) for which the minor allele was at a frequency <0.1; the results are presented in Table 4. For the HCV and dengue virus data sets the results from the filtered analysis are identical to those in Table 2 as the sample sizes are <10. We also omitted the results for the test of Meunier and Eyre-Walker (2001) as it behaves in an almost identical fashion to |D′|.
From Table 3 and, more noticeably, from Table 4, we find evidence for recombination in almost all data sets and levels of recombination that range from
The effect of filtering out rare variants is worth noting. Rare variants are largely uninformative about recombination (though not entirely; McVean 2001), and hence their inclusion may obscure the signal of recombination, particularly if there is an excess of rare mutations in the data. Removal of rare variants from the data has little effect on estimates of the population recombination rate in both the empirical (compare estimates of γ from Tables 3 and 4) and simulated data. For example, under the finite-sites model, the median of estimates of γ was 9.8 when all sites were used (and analyzed under the correct mutation model) and 10.2 when the analysis was restricted to sites for which the minor allele frequency was at least 0.1. In the simulated data, no increase in the power of the likelihood permutation test was found when the analysis was restricted to intermediate frequency variants. However, the simulated data sets have no excess of rare variants, unlike the empirical data.
Very high levels of recombination in HIV: The results concerning recombination in HIV1 subtype B and HIV2 subtype A sequences are particularly notable. Although recombination between different subtypes is occasionally observed (Kuikenet al. 2000), recombination within subtypes has largely been ignored in phylogenetic analysis of genetic diversity (Nielsen and Yang 1998; Rambautet al. 2001). The results presented here support such a conclusion. Using the likelihood permutation test, we find evidence for recombination in both HIV2 and HIV1, though only when SNPs are filtered for the case of HIV1. For HIV1 the estimate of γ is beyond the range for which likelihoods were estimated.
Levels of genetic diversity are extremely high in HIV1 and HIV2 (estimates of θ per site at first/second codon positions of 0.144 and 0.102, respectively). Because recurrent
mutation can cause patterns of genetic diversity similar to that caused by recombination, one might be cautious of concluding that recombination is present. However, the estimation of a low level of recombination in HCV, which has an even higher level of diversity
Detecting recombination in empirical data
The implications of such a high level of recombination in HIV1 are considerable. Not only does it question the validity of conclusions about the age and timings of events in the history of the virus that have been made assuming an absence of recombination (Nielsen and Yang 1998; Rambautet al. 2001), but it has practical implications for predicting how fast mutations (such as drug resistance) may spread across different genetic backgrounds. Analysis of genetic data from appropriate samples taken at different population scales will be essential for inferring the extent and consequences of recombination.
Recombination in human mtDNA? Another issue of considerable importance is whether there is evidence for recombination in human mtDNA. The data set of Awadalla et al. (1999) clearly shows evidence for recombination when all data are used, irrespective of the test employed (for r2 and the likelihood permutation test this is also true for >90% of random subsets of 35 of the 45 sequences). In direct contrast, the data of Ingman et al. (2000) show no evidence for recombination, irrespective of the test used. When the frequency filter is applied, only one statistic, r2, still shows evidence for recombination in the first data set (and this is sensitive to the removal of a single segregating site). These results are in direct contrast to those from the viral and bacterial sequences, where the frequency filter increases the power of almost all tests. Taken together, the results suggest a lack of evidence for recombination in human mtDNA.
Why should low frequency variants create the impression of recombination? Hey (2000) suggested that sequencing protocols might lead to the propagation of correlated errors. Such an effect may be enhanced by the combination of sequences from multiple laboratories (because recurrent errors will be strongly correlated), and for this reason, the data collected and sequenced by Ingman et al. (2000) is preferable. Given that sequencing errors tend to be at low frequency, this may explain why three of the four tests are significant only if all the data are analyzed, but it does not explain (beyond chance) why r2 still shows a significant relationship with distance when only high frequency variants are used. McVean (2001) suggested that bouts of local adaptive evolution might lead to correlated mutations and a relationship between physical distance and linkage disequilibrium as measured by r2. How adaptive evolution influences patterns of linkage disequilibrium and the measurement and detection of recombination is an important problem.
Detecting recombination with mutations at intermediate frequencies
Misspecification of the population model: While the properties of the composite-likelihood estimator of the population recombination rate have been examined across a variety of models of sequence evolution, no mention has been made so far as to how robust the methods described here may be to deviations from the population model. Coalescent estimation of likelihoods assumes that a random sample has been taken from a population of constant size, with random mating, no migration to or from different populations, and no natural selection. In reality, none of these assumptions are tenable, although several deviations from the standard neutral model (such as fluctuating population size) can be approximated as having an effect on the effective population size, Ne.
Population growth, strong geographical structuring, and nonrandom representation of gene sequences in the databases are potentially important concerns for the use of coalescent methods. Sampling of sequences specifically for population genetic analysis will overcome the problems of nonrandom database representation; however, inadequacies in the demographic model are more problematic. Population growth tends to decrease linkage disequilibrium while population structure tends to increase linkage disequilibrium (e.g., Pritchard and Przeworski 2001). Consequently, one might expect estimates of the population recombination rate (and the ability to detect recombination) to be sensitive to the demographic history of the population.
While no exhaustive attempt is made here to characterize the behavior of the CLE under misspecified population models, it is possible to ask whether the data sets analyzed show evidence for deviation from the neutral model in terms of the allele frequency spectrum. This can most simply be assessed through the use of Tajima's D statistic, which compares estimates of the population mutation rate derived from the number of segregating sites and the average pairwise differences. A negative value of the statistic indicates an excess of rare variants and the possibility of population growth, and a positive value suggests population structure may be important.
Table 3 includes the value of Tajima's D statistic for the data sets analyzed, and indicates the significance level estimated assuming no recombination. While the statistic is negative for all data sets, it is only significantly so for measles, HIV1, and the two mtDNA data sets. However, the variance of the statistic is reduced by recombination (so reducing the confidence limits under the null model). Other data sets (particularly the HIV2 data) may therefore also reflect significant deviations from the standard neutral model. However, those data sets that show evidence for a departure from the standard neutral model also reflect the full diversity of estimated recombination rates. In short, while departure from the assumed demographic model may have some influence on the estimate of the population recombination rate, it is unlikely to be confused with the signal of recombination.
Acknowledgments
We thank Michael Worobey for the generous supply of empirical data sets and important insights. In addition, we thank Dick Hudson, Molly Przeworksi, and two reviewers for discussion and comments on the manuscript. G.M. is funded by the Royal Society and P.A. is funded by the Wellcome trust. The programs pairwise and permute used to estimate the population recombination rate and test for recombination are available within the LDhat package, which can be downloaded from http://www.stats.ox.ac.uk/~mcvean.
Footnotes
-
Communicating editor: J. Hey
- Received October 2, 2001.
- Accepted January 7, 2002.
- Copyright © 2002 by the Genetics Society of America