Although the analysis of linkage disequilibrium (LD) plays a central role in many areas of population genetics, the sampling variance of LD is known to be very large with high sensitivity to numbers of nucleotide sites and individuals sampled. Here we show that a genome-wide analysis of the distribution of heterozygous sites within a single diploid genome can yield highly informative patterns of LD as a function of physical distance. The proposed statistic, the correlation of zygosity, is closely related to the conventional population-level measure of LD, but is agnostic with respect to allele frequencies and hence likely less prone to outlier artifacts. Application of the method to several vertebrate species leads to the conclusion that >80% of recombination events are typically resolved by gene-conversion-like processes unaccompanied by crossovers, with the average lengths of conversion patches being on the order of one to several kilobases in length. Thus, contrary to common assumptions, the recombination rate between sites does not scale linearly with distance, often even up to distances of 100 kb. In addition, the amount of LD between sites separated by <200 bp is uniformly much greater than can be explained by the conventional neutral model, possibly because of the nonindependent origin of mutations within this spatial scale. These results raise questions about the application of conventional population-genetic interpretations to LD on short spatial scales and also about the use of spatial patterns of LD to infer demographic histories.
THE analysis of linkage disequilibrium (LD) plays a central role in many areas of population genetics, including the determination of genetic maps, ascertainment of levels of recombination at the population level, and estimation of effective population sizes. For populations in approximate drift–mutation–recombination equilibrium, the latter becomes possible with neutral markers because the expected levels of allelic association across loci can be expressed in terms of the population parameters θ = 4Neu and ρ = 4Ner, where Ne is the effective population size, u is the mutation rate per nucleotide site, and r is the rate of recombination between sites (Ohta and Kimura 1969; Hill 1975). In principle, if an estimate of r is available, the effective population size can be extracted from ρ (e.g., Hill 1981; Hayes et al. 2003; Tenesa et al. 2007), or vice versa. Unfortunately, the stochasticity of evolutionary processes generates enormous evolutionary variance for two-locus measures (Hill and Weir 1988), so a very large number of LD estimates from independent pairs of sites is required for meaningful inferences on ρ. Moreover, it has become increasingly clear that gene conversion causes a nonlinear relationship between ρ and physical distances between sites (Andolfatto and Nordborg 1998), raising questions about the very meaning of ρ.
With methods for whole-genome sequencing now well established, obtaining observations on large numbers of sites is no longer problematic. Although extending LD analysis to large numbers of individuals is still challenging (Maruki and Lynch 2014), it has been suggested that even a single individual, if sampled from a randomly mating population, can provide insight into genome-wide disequilibrium patterns (Lynch 2008). The proposed parameter, Δ, denoted as the “correlation of zygosity,” is a measure of the degree to which the spatial distribution of heterozygous sites deviates from the expectation under random segregation. Here, we consider the issues underlying this parameter in detail, first describing the observable single-individual measure in terms of the traditional population-level estimates of LD, then exploring the power of the proposed method, and finally summarizing the results of applications to the sequenced genomes of several vertebrate species. The empirical observations provide new insights into features of recombination that have been previously accessible with only a small number of model genetic organisms.
The correlation of zygosity, Δ
Consider all pairs of genomic sites separated by distance d within a single individual, and let π and (1 − π), respectively, be the fractions of heterozygous and homozygous sites within the individual. Provided two sampled sites are uncorrelated by descent, denoted by (1 − Δ) below, the probabilities of jointly homozygous and jointly heterozygous states are (1 − π)2 and π2, respectively. If the genotypes of the two sites are correlated, denoted by Δ, they are either both homozygous or both heterozygous with respective probabilities (1 − π) and π. After some simplification, it follows that the frequencies of jointly homozygous, jointly heterozygous, and mixed homozygous/heterozygous pairs of sites are respectively:(1a) (1b) (1c)
From Equation 1c, it can be seen that (2)is a measure of the deviation of the frequency of pairs of loci with mixed zygosities from the random expectation. In general, Δ is expected to be positive because recombination causes variance in coalescence times among chromosomal regions. Nonrecombining haplotype blocks that happen to have deep coalescent times have a higher probability of carrying multiple derived mutations than do pairs with shallow coalescent times, and hence of being doubly homozygous for such mutations. The mathematical bounds to Δ are therefore generally expected to be zero to one, and in numerous applications we have never seen otherwise.
Relationship to population-genetic parameters
A more mechanistic understanding of Δ requires expressions for the expectations of the three two-site zygosity classes (H0, H1, and H2) in terms of the underlying population determinants. In the following, we adhere to the standard Wright–Fisher population-genetic model, with biallelic loci (individual nucleotide sites), random mating, and discrete generations with consecutive episodes of mutation, recombination, and gamete sampling. Mutation is assumed to occur at rate u per generation at each site, and the probability that a recombination event is initiated between sites is denoted as r.
Consider two hypothetical sites, and denote the respective frequencies of alleles A and a at the first site as p and 1 − p and of alleles B and b at the second site as q and 1 − q. At the population level, the gamete frequencies can be represented in the usual manner as PAB = pq + D, Pab = (1 − p)(1 − q) + D, PAb = p(1 − q) − D, and PaB = q(1 − p) − D, respectively, where D is the disequilibrium coefficient. The expected frequency of homozygous–heterozygous pairs is then (3)where the latter expression assumes equal expected heterozygosities at both sites, i.e., E[2p(1 − p)] = E[2q(1 − q)]. This expression further simplifies to (4a)where π = E[2p(1 − p)], and δ = E[4p(1 − p)q(1 − q)] + E[2D(1 − 2p)(1 − 2q)] + E(4D2). Likewise, it can be shown that the expected frequencies of double homozygotes and double heterozygotes are (4b) (4c)Similar points have been made previously from the standpoint of an entire population (Sabatti and Risch 2002).
Under the assumption of neutrality, the expected values of H0, H1, and H2 are functions of the underlying evolutionary forces of drift, mutation, and recombination. For a two-allele model, the expected heterozygosity for an equilibrium population is π = θ/(1 + 2θ) (Kimura 1968), and expectations for the fourth-order moments can be obtained by simplification of diffusion approximations, given as in Ohta and Kimura (1969, Equations 17), (5a) (5b) (5c)where (5d)Because θ is almost always <0.05 in diploid organisms (Lynch 2007; Leffler et al. 2012), terms involving θ2 and θ3 that do not also involve ρ can be ignored, and substitution into Equation 2 yields(6)
An alternative expression for E(Δ) presented in Haubold et al. (2010), which is closely approximated by (7)is obtained using expressions for the expectations of H0 and H1 given by Strobeck and Morgan (1978) for the infinite-allele model (in contrast to the infinite-sites model relied on above). Provided θ ≪ 1, the deviation in the numerator is trivial relative to the leading term θ(18 + ρ), and this is likely true in the denominator as well. Thus, focusing only on the leading terms that are similar (8)The latter expression is also obtained with the infinite-sites model (where each newly arisen mutation occurs at a previously homozygous site) (Ohta and Kimura 1971; Hill 1975), demonstrating its generality with respect to the features of the mutation model. The quantity within parentheses has appeared previously in the literature, being equivalent to the probability that pairs of nucleotides at two sites are derived from the same common ancestor (Pluzhnikov and Donnelly 1996; Wiuf and Hein 2000; McVean 2002) and also to the correlation of coalescence times at two sites (Hudson 1991). Note that as ρ → 0, E(Δ) → θ; for ρ ≪ 1, E(Δ) ≃ θ[1 − (2ρ/3)]; and as ρ → ∞, E(Δ) → θ/ρ.
As a check on the validity of the diffusion-theory approximation, we evaluated the average values of the two-site features derived from simulations of a Wright–Fisher model over the full range of per-site values of θ and ρ observed in eukaryotes (Lynch 2007). The agreement between the expectations of the theory and simulated data was excellent.
Relationship to population-level linkage disequilibrium
How does Δ relate to the more conventional measures of LD extracted from population samples of multiple individuals, e.g., D2? Because E[p(1 − p)q(1 − q)] ≃ 0.25π2 over the full range of allele frequencies, it follows from Equations 2 and 3 that (9)which using the expressions for the moments in the numerator reduces to (10)when θ ≪ 1. Because the term within the square brackets is between 2.0 and 2.8 (for large and small ρ, respectively), as a rough approximation(11)
An alternative measure is the standardized linkage disequilibrium, defined by Ohta and Kimura (1969) as (12)Although broadly used, this statistic is problematic in that it is not an unbiased estimator of the squared correlation, which is a function of the expectation of the ratio rather than the ratio of expectations. Because this difference in definition can lead to up to 100-fold differences in values of if allele frequencies are extreme, as is often the case with neutral alleles (McVean 2002; Song and Song 2007), most investigators restrict LD analyses to pairs of loci with intermediate allele frequencies. Such treatment is not possible, but also not necessary, with the proposed method. Again noting that the denominator is approximately π2/4, we see that , so from Equation 11, provided π ≪ 1, .
Interpretation of the recombination rate
Recombination events involve heteroduplex formations between homologous chromosomes, where nonmatching sites must be resolved by mismatch repair, which causes gene conversion. Inclusion of this factor in the definition of the recombination rate is essential because although all recombination events result in gene conversion, not all conversion events are accompanied by crossovers.
Let c be the total recombination rate per single nucleotide site (i.e., the rate of recombination per site with or without crossing over), x be the fraction of recombination events resulting in a crossover, and d be the number of sites separating the two focal positions (with d = 1 for adjacent sites). From a two-site perspective, a gene-conversion event has consequences equivalent to a crossover if the conversion is restricted to a single heterozygous site, a point first made by Andolfatto and Nordborg (1998) under the assumption of a constant conversion tract length. Following Langley et al. (2000) and Frisse et al. (2001), and allowing for an exponential distribution of tract lengths with an average length (in base pairs), the total recombination rate per site associated with conversions is , with the term in parentheses being the fraction of conversion events covering only single sites. Because a single-site conversion accompanied by a crossover restores the original state, the total recombination rate is then (13a)Thus, although it is commonly assumed that the recombination rate increases linearly with distance between sites, gene conversion results in two domains with different distant-dependent rates. For , (13b)whereas for , (13c)Gene conversions unaccompanied by crossing over cause a lower rate of recombination at all distances, with a fractional reduction x from very short to very long distances. Under the exponential-distribution model, the transition between the two asymptotic rates initiates at a distance approximately equal to and is nearly complete at distance . This effect leaves a characteristic signature in the expected relationship between Δ and d (Figure 1).
Influence of population-size history
Because the population-genetic interpretations of Δ (and indeed all prior measures of LD) noted above are based on the assumption of a constant long-term population size, it is prudent to consider the potential effects of past demographic changes. This problem can be readily examined by use of the recursion equations presented by Strobeck and Morgan (1978), whose equilibrium solutions were referred to above. As in Figure 1, it is useful here to simply evaluate the behavior of the scaled correlation Δ/θ0, where θ0 denotes the current heterozygosity in the population, as this then leaves the influence of recombination as a separate term, e.g., Equation 8. (Recall from above that .) To further simplify the presentation, we also assume x = 1 in the following examples.
With populations with constant Ne, the scaled correlation converges on values of 1.0 for all Ne at short distances between sites, and then yields a set of parallel declining lines (on a logarithmic scale) at higher distances, with higher elevations associated with smaller Ne (Figure 2). This asymptotic behavior, which arises because Δ ≃ 1/(4Necxd) at large d, implies an expected inverse relationship between Δ and d at large distances.
For populations changing in size, we considered demographic shifts within the past 4Nc generations, where Nc is the current effective population size. Because 4Nc is the mean coalescence time for a neutral mutation in a population of constant size, current patterns of LD will be negligibly influenced by demographic changes prior to this point. For populations declining in size progressively through time, the scaled correlation exceeds 1.0 at short distances between sites and then converges on the expected values for the current population size at large distances (Figure 2). In effect, continuously shrinking populations result in a situation in which LD at closely linked sites is elevated relative to the expectation for a constant-size population with the observed θc. Such behavior is a reflection of the fact that the current population, which encourages greater buildup of LD, retains old variation reflective of a larger Ne. Populations expanding in size exhibit the opposite pattern—scaled correlations <1.0 at the shortest distances, but again with values converging on the expectations for the current population size at large d. Linear rates of change yield slightly less extreme behavior than exponential growth/decline (Figure 2).
An abrupt but transient transition to a novel population size was also investigated. In this case, both a sudden population expansion or bottleneck results in a reduction in the scaled correlation relative to expectations for the current population size, except in the case of a very distant expansion, which only trivially increases Δ/θ0 (Figure 2). Again, large effects are noticeable only at short physical distances (<1 kb), with the results for larger distances reflecting the expectations for a constant population size Nc with observed θc, even though the latter is not the current equilibrium expectation 4Ncu because of the lingering influence of historical changes in Ne.
These results suggest that estimates of Δ at very short distances can be influenced by past population-size changes, although the particular patterns observed are not necessarily very diagnostic of the historical nature of change unless the temporal pattern is largely continuous. For example, depending on the magnitude of change, transient population expansions and bottlenecks can lead to qualitatively very similar behaviors. On the other hand, even with extreme demographic changes, estimates of Δ/π for pairs of sites separated by >1 kb closely reflect the expectations of an equilibrium population with the current size. Hence, at appropriate physical distances, Equation 8 provides informative insight into the recombinational properties responsible for the pattern of decline of Δ with d.
We now consider the practical issue of obtaining unbiased and minimum-sampling-variance estimates of Δ. Consider an idealized data set in which the raw reads for a single individual have depth of sequencing coverage n. With very high depths of coverage, the genotypic states at each locus can usually be determined essentially unambiguously from the consensus sequence (assuming proper alignment of sites). However, most high-throughput-sequencing methods are highly error prone, and most projects deploy a low enough level of sequence coverage per site that genotypic ascertainment is problematic. To deal with such problems, a maximum-likelihood (ML) procedure has been suggested for acquiring joint estimates of π and Δ that best explain the read data at the full collection of genomic sites, while simultaneously correcting for biases associated with sequencing errors (Lynch 2008).
The general strategy is to condense the data at each pair of sites (say a and b) down to quartets enumerating the numbers of times the four alternative nucleotides have been observed, (naA, naC, naG, naT) and (nbA, nbC, nbG, nbT). Letting n denote the octet for the pair of sites, the likelihood of the observed data can be expressed as (14)where the four terms ℓ1a, ℓ1b, ℓ2a, and ℓ2b describe the site-specific likelihoods of observations conditional on being homozygous (ℓ1a and ℓ1b) or heterozygous (ℓ2a and ℓ2b), defined below. The ML solution is the joint set of values of and (with ) that maximizes the product of the likelihood of the observations over all pairs of sites separated by a specific distance d.
Each of the ℓ terms is a function of the genome-wide nucleotide composition, binomial allelic sampling (in the case of heterozygotes), and the error rate (ε, itself a parameter to be estimated). Expressions for these terms were given previously as Equations 4a and 4b in Lynch (2008), where issues of bias were noted. In the intervening time, we have formulated modified expressions that greatly reduce the bias in the derived parameter estimates, even at low coverage.
For each site, the probability of an observed quartet conditional on the site being homozygous is (15a)where pi is the frequency of nucleotide type i in the genome (and hence the fraction of homozygotes expected to be of type i), b(n − ni; n, ε) is the binomial probability of n − ni errors in n reads, conditional on the individual being a homozygote of type i and given the error rate ε, and the final term defines the probability of the observed distribution of the three hypothetical error types conditional on i being the proposed homozygote type (the indices used are such that i ≠ j ≠ k, with j and k denoting two of the error types and the count for the third simply being defined as n − ni − nj − nk). This final term, not used in Lynch (2008), provides information on the error rate not contained in the binomial term alone.
Conditional on the site being heterozygous, the likelihood of an observed quartet is (15b)Here, 2pipj/S is the genome-wide expected frequency of ij heterozygous genotypes among all possible heterozygotes, with serving to normalize the sum of possible heterozygote frequencies to 1.0; b(n −ni − nj; n, 2ε/3) is the probability of errors to nucleotides other than i and j; p(ni; ni + nj, 0.5) is the probability of sampling the ith nucleotide ni times from the pool of hypothetically nonerroneous reads (ni + nj); and the final term (not included in Lynch 2008) is the probability of the observed distribution of read types interpreted as errors (not i or j).
To determine the power and accuracy of the ML estimator, using the Wright–Fisher model described above, random pairs of two-locus genotypes were generated using biologically realistic values for the effective population size (Ne) and site-specific rates of mutation (u) and recombination (r) to define θ and ρ, which in turn defined the expected values of H0 and H2 based on Equations 4a–4c. Errors in sequences were generated by assuming constant rates per site per read with ε = 0.01, an unfortunately realistic situation. ML estimates of ε, H0, and H2 were obtained by maximizing the log-likelihood of the observed data (i.e., the log of the product of Equation 14 for full sets of sites), and these were then extrapolated to yield (16a) (16b)Some sets of results, with each independent simulation involving 105 to 106 pairs of sites, are compared to their expectations as defined in Figure 3 by Equation 6. On average, the estimates of Δ are independent of the level of coverage, provided the latter is >4×. In some cases, the estimates are upwardly biased, but generally by no more than 25%, and such biases become asymptotically negligible with large numbers of sites.
In practical applications of this method, computational efficiency can be dramatically enhanced by performing a first pass through the data in a univariate analysis to estimate π and ε (Lynch 2008). This then enables the precomputation of values of ℓ1 and ℓ2 for different quartets, which can be repeatedly reused in analyses performed over different distances. The estimates of Δ for each distance of interest can then be rapidly acquired in a second round of analysis from the ML estimates of H0 and H2. These and other refinements are now implemented in the program mlRho v. 2.2 (available at http://guanine.evolbio.mpg.de/mlRho/). Strictly speaking, the overall approach is equivalent to a composite-likelihood procedure, as not all pairs of sites are genealogically independent (although those on different chromosomes are expected to be, so that for species with n chromosomes, no more than a fraction ∼1/n of pairs of pairs will be nonindependent).
To obtain insight into the recombinational features of natural populations, we applied the preceding methods to single-genome sequences of several noninbred vertebrate species, all of which harbor enough sites to avoid issues of bias (Supporting Information, File S1). As discussed in Haubold et al. (2010), to avoid artifacts that might arise from paralogous regions (including repetitive DNAs such as mobile elements), such applications require the initial mapping of the full set of raw sequencing reads to the genome assembly, followed by the removal of any reads that map to more than one genomic location and also masking paralogous regions and any other sites that attract unusually large numbers of reads.
To illustrate some general issues, we initially focus on results for four human genomes (one archaic), after which we summarize further observations across the vertebrate phylogeny. In accordance with the expectations outlined in Figure 1, the decline in Δ with physical distance is strongly nonlinear in all genomes, even on a logarithmic scale, and this nonlinearity remains apparent at all scales of physical distance (Figure 4). Particularly striking are the high values and very rapid decline of Δ over distances <100 bp or so. As noted above, for the standard neutral model, Δ is expected to converge on the standing level of heterozygosity (π) as distance d → 1, but estimates of Δ at d < 100 are often an order of magnitude or more greater than the ML-derived estimates of π (Table 1). Potential reasons for such behavior are discussed below.
The scaled correlations (obtained by dividing the ML estimates of Δ by the respective estimates of π) point to different past demographic histories of the sequenced humans (Figure 4). First, with the exception of the Watson sequence, the profiles are essentially parallel, with the estimates for the Chinese genome being elevated ∼7× relative to that for the African genome and the estimates for the Denisovan genome being elevated a further fourfold. As noted above, such behavior is indicative of an ∼30-fold larger long-term Ne in the recent ancestry of the African genome than in that of the Denisovan genome. The elevation of Denisovan estimates substantially >1.0 may imply a rapid ancestral decline in population size, whereas the estimates for the African genome are in better agreement with population stasis. The data for the Watson genome are qualitatively consistent with a current Ne similar to that for the Chinese genome, but with a dramatic decline from a large ancestral population size, possibly a consequence of an out-of-Africa bottleneck. Finally, we note that the expected inverse relationship between Δ and d is approached only at distances on the order of 105, which is indicative of a substantial amount of noncrossover recombination events even at large distances.
To obtain estimates of the recombination parameters ρ1 = 4Nec, x, and L, we generated least-squares fits to Equation 8 as a function of d, with Equation 13a used to define the recombination rate as a function of distance d. Because of the nonlinear nature of the relationship between Δ and d, some compromises need to be made in deriving such fits, as overreliance on short vs. long distances will bias the fits to the ranges of distances containing the most data. Our analyses involved estimates of Δ at 1-bp intervals up to sites 103 bp apart and pooled estimates for 10-bp windows for distances 103 < d ≤ 104 bp and for 100-bp windows for distances 104 < d ≤ 105 bp. The statistical analyses then sought the joint set of the four parameter estimates (θ, ρ1, x, and L) that minimized the mean-squared deviations of the logarithm of Δ from the model expectations. As the data for very short distances have features that apparently violate the assumptions of the model, these analyses were restricted to pairs of sites with distances in the range of 250 to 105 bp. As can be seen in Figure 4, the resultant fits provide an excellent description of this range of d.
Several tentative conclusions can be drawn from the fitted parameter estimates for these human genomes (Table 1). First, the estimated fraction of human recombination events resulting in crossing over is in the range of x = 0.066 to 0.235, with a mean of 0.171 (SE = 0.038). Second, estimates of ρ1 are in the range of 0.00015 to 0.0015, with a mean of 0.00062 (0.00030). Third, estimates of average conversion-tract lengths are in the range 1389–3208 bp, with a mean of 2100 (390).
Qualitatively similar results were obtained with analyses of five other primate genomes (Figure 5). Again, the estimates of Δ are aberrantly high at short distances between sites, and again the expected asymptotic inverse relationship between Δ and d is not attained even at a distance of 100 kb. When restricted to sites with d > 250 (>1000 in the case of chimpanzee, which exhibits aberrant behavior at smaller distances), the fit of Equations 8 and 13a to the data are again quite good. For this set of species, x has an average value of 0.075 (0.027), whereas the average value of is 4332 (976). Overall, the data on LD for distant sites suggest that recent effective population sizes are highest for Pongo abelii, lower for P. pygmaeus and Gorilla gorilla, still lower for Pan troglodytes, and lowest for Macaca mulatta. Although such estimates may seem discordant with the current day population sizes induced by recent human activity, the past few dozen generations are not expected to be sufficient to greatly influence standing levels of LD relative to the entire demographic history of the species. Results for additional vertebrates, summarized in Table 1, again consistently imply x < 0.25 and on the order of one to several kilobases.
The approach taken above is computationally intensive, involving well over 109 nucleotide sites for most analyses, which translates to orders of magnitude more pairs of sites depending on the sizes of the assembled contigs. Although not all pairs of sites are genealogically independent, the vast majority will be essentially so in vertebrates, which typically harbor a few dozen chromosomes approximately equal in size. Thus, the composite-likelihood approach appears to be justified statistically for individual estimates of Δ. The program mlRho computes the confidence interval around each estimate of Δ in the usual way by searching for positions of decline of two units on the likelihood surface; by solving Equation 8, this interval can then be translated into statistical bounds on the estimates of ρ. From the limited scatter of the data in Figure 4 and Figure 5 alone, it can be seen that with the magnitude of data available for vertebrate genomes, individual data points are quite reliable.
On the other hand, because estimates of Δ at close physical distances do share genealogical history, there is the issue of nonindependence along the distance profile. Such sampling covariance will be a function of not only the finite number of sites involved in an analysis, but also the evolutionary history of a population, and as a consequence no fully general statement can be made about the absolute magnitude of noise in a Δ profile that might be expected. However, to gain some insight into the matter of how variable Δ profiles might be among individuals, we consider two data sets.
First, we present empirical data on the Δ profile derived from 10 genomes of the microcrustacean Daphnia pulex, each derived from a different population (Figure 6). All 10 of these genotypes were extracted from populations known to be randomly mating on an annual basis and in Hardy–Weinberg equilibrium (Tucker et al. 2013). Despite the different population sources (ranging over several hundreds of kilometers), the Δ profiles are highly similar. Deviations become slightly larger at increasing physical distances because of declining sample sizes (and possibly recent differences in population-specific demographic histories).
The average estimate of θ = 4Neu for these D. pulex genotypes is 0.0204 (0.0005), whereas the average estimate of 4Nec is 0.00069 (0.00003), implying that on a per-site basis the mutation rate in this species is about thirty times the recombination rate. The estimated fraction of recombination events resolved as crossovers is x = 0.099 (0.020), and the average conversion tract length is . Daphnia is one of the few metazoan species in which conversion tracts can be directly observed by loss-of-heterozygosity in clonally propagated lines, and direct observations in this species (Omilian et al. 2007; Xu et al. 2011) are entirely consistent with these indirect estimates.
Second, to specifically investigate the variation in Δ profiles that can arise among individuals within a single population, we used the computer program ms (Hudson 2002) to generate independent diploid genotypes with the same input parameters (using the average values noted above for human sequences). As can be seen in Figure 7, when large numbers of sites are available per individual, the Δ profiles are very similar. Thus, there is little question that the variation observed among individuals in Figure 4 is a consequence of true variation in background parameters, and not a simple consequence of stochastic within-population variation.
Although numerous techniques have been proposed for estimating LD from population samples (Stumpf and McVean 2003; Slatkin 2008), most applications of these methods have involved small to moderate numbers of nucleotide sites harboring alleles with intermediate frequencies. In addition, almost all attempts to estimate the population recombination rate ρ1 = 4Nec have assumed a standard neutral model with independently arising mutations, negligible gene conversion, and a linear relationship between the recombination rate and physical distance. The results presented above imply that all of these assumptions are violated in a wide range of species.
Rather than focusing on a large population sample with imprecise allele-frequency estimates, the approach presented here uses the information contained within the two chromosome sets of just a single diploid individual to estimate a function of LD that is directly related to the conventional population-level measure of LD, E(D2). Although estimation of the underlying recombinational parameters using this method does assume that the sampled individual is derived from a randomly mating population, the same assumption applies to extrapolations made with all prior LD methods involving multiple individuals.
Because a nonlinear regression involving four parameters has the capacity to yield a diversity of functions, an obvious question is whether the preceding empirical results are simply a curve-fitting procedure, as opposed to providing insights into the actual physical features of recombination. Several lines of evidence support the latter view. First, our conclusion that the fraction of recombination events resulting in crossing over (x) is generally on the order of 0.2 or less is quite consistent with previous observations. For example, x has been estimated to be in the range 0.25–0.34 in budding yeast Saccharomyces cerevisiae (Malkova et al. 2004; Mancera et al. 2008) and ∼ 0.14 in humans (Frisse et al. 2001; Padhukasahasram and Rannala 2013). The latter is not significantly different from our estimate of 0.17 for humans nor from our average estimate of 0.08 for other primates. For gene-sized regions in Drosophila melanogaster, Langley et al. (2000) concluded that almost all recombination is a consequence of conversion events, and using their data set, Yin et al. (2009) inferred x = 0.08, which is not greatly different than a direct empirical estimate of 0.15 in same species (Hilliker et al. 1994). From a high-resolution cross in Arabidopsis, Yang et al. (2012) inferred x ≃ 0.05, and population sampling yields estimates of x = 0.06 and 0.16 for barley and maize, respectively (Morrell et al. 2006).
Second, our conclusion that mean conversion-tract lengths are typically on the order of one to several kilobases is also consistent with previous observations. Estimates of are on the order of 400 bp in Eubacteria (Santoyo and Romero 2005), in the range 500–4000 bp in S. cerevisiae (Ahn and Livingston 1986; Judd and Petes 1988; McGill et al. 1990), and 400–1400 bp in D. melanogaster (Hilliker et al. 1994; Preston and Engels 1996; Miller et al. 2012). Paigen et al. (2008) have observed conversion-tract lengths in the mouse ranging from 200 to 1200 bp, and Padhukasahasram and Rannala (2013) have inferred tract lengths of 100–900 bp on human chromosomes. Other observations reviewed in Chen et al. (2007) and Rukść et al. (2008) suggest mammalian conversion-tract lengths in the range of a few hundred to two thousand base pairs. Finally, we note that a variety of mechanisms associated with double-strand-break repair, including break-induced replication and synthesis-dependent strand annealing, can lead to loss-of-heterozygosity tract lengths exceeding several thousands of base pairs (Omilian et al. 2007; Llorente et al. 2008; Xu et al. 2011). Such effects will have the same influence on LD as conventional meiotic gene conversions and almost certainly contribute to the mean tract-lengths inferred in our analyses.
Taken together, these observations strongly suggest that the vast majority of meiotic recombination events are resolved as simple gene conversions without crossing over, with average tract lengths of many hundreds to thousands of base pairs. Assuming an approximately geometric distribution of tract lengths (Ahn et al. 1988; Hilliker et al. 1994), which may underestimate the actual range of lengths, this means that conversion-like events up to tens of kilobases commonly occur. Thus, rates of recombination between sites separated by less than many thousand kilobases are generally much greater (up to 1/x greater) than one would expect based on a linear extrapolation from estimates of crossover frequencies derived from genetic maps.
Although future work with a much wider range of taxa will be required to determine whether the known cytological differences in meiosis (Kohl and Sekelsky 2013) translate into quantitatively significant differences in the features of recombination (e.g., x and ) among phylogenetic lineages, the consistency of our results with prior observations suggests a way forward. Aside from its applicability to the sequenced genomes of just single individuals, a substantial advantage of the proposed method is that it provides a genome-wide average view of recombinational features that is not dependent on the kinds of artificial constructs often deployed in laboratory experiments. In addition, analyses can readily be partitioned spatially to estimate regional features of chromosomes. Finally, because the method is agnostic with respect to allele frequencies, it provides a global estimate of LD that should be unaffected by biases that can arise when analyses are restricted to alleles that are able to rise to high frequencies, as has often been the focus of studies using .
While the mechanisms remain unclear, our observations of anomalous patterns of high LD at short distances are consistent with results from previous analyses in Drosophila (Andolfatto and Przeworski 2000), humans (Przeworski and Wall 2001), sorghum (Hamblin et al. 2005), and Arabidopsis (Kim et al. 2007), all of which suggest problems with the standard neutral model at short (typically intragenic) distances. The central issue is that the absolute amount of LD at distances <200 bp or so is far greater than expected under models that assume independent mutational and recombinational events.
There are at least three reasons why genetic events on such small spatial scales should not be treated as independent. First, as pointed out by Schrider et al. (2011), new mutations arise in a significantly clustered manner on spatial scales <100 bp or so. Such effects might result simply from an occasional defective polymerase engaging at origins of replication. Second, Hicks et al. (2010) find in S. cerevisiae an ∼800-fold increase in the mutation rate in newly synthesized strands involved in gene conversions, and more generally, double-strand-break repair appears to be mutagenic (Malkova and Haber 2012). Such effects not only contribute to mutation clustering, but also render mutation and recombination nonindependent events. Third, nonhomologous gene conversion can introduce excess LD at individual sites (Walsh 1988; Mansai and Innan 2010). Although we attempted to avoid the latter effects by restricting our analyses to sequence reads mapping to single sites, it remains possible that more divergent stretches of DNA occasionally engage in illegitimate conversion events.
One additional factor might contribute to anomalously high estimates of LD at very short distances—nonindependence of base-call errors at closely spaced sites. The likelihood functions employed in mlRho assume that errors at different sites arise independently, which seems entirely reasonable for distant sites contained in nonoverlapping sequence reads. However, it is possible that some individual reads will be subject to higher error rates throughout, causing an overestimate of LD on the spatial scale of individual reads. Localized mismapping might generate similar problems. Should they exist, such artifacts would be a concern for any method for estimating LD and are not unique to the method introduced herein.
Finally, we note that a number of studies have sought to exploit information on the relationship between population-level LD and physical distance to infer past population-size changes (e.g., Hayes et al. 2003; Tenesa et al. 2007; Qanbari et al. 2010). With an adherence to the assumption that the recombination rate between sites increases linearly with distance, i.e., rd = cd, the general idea underlying such studies is that the distance-scaled recombination parameter ρd/d (assumed to equal 4Nec) reflects population-size conditions ∼ 1/(2cd) generations in the past. A number of concerns with these methods have been raised (Corbin et al. 2012; Park 2012; MacLeod et al. 2013; Sheehan et al. 2013).
A related approach to inferring the history of population-size changes, which like our method uses sequence information from just single individuals, relies on inferences on the homozygosity tract-length distribution, i.e., the genome-wide distribution of distances between heterozygous sites (Li and Durbin 2011; Harris and Nielsen 2013). It has been argued that this method, which can also be extended to a sample of individuals (Sheehan et al. 2013), allows for the testing of specific demographic hypotheses, although again the assumption of a linear recombination rate–physical distance relationship is adhered to. A potential limitation of such analyses (not shared by the Δ estimator) is the issue of missing data, which break up homozygosity tracts (but simply reduce sample sizes used in Δ estimation).
Our observations raise significant questions about the use of both types of methods to infer the past demographic history of populations. A primary issue is that the relationship between LD and physical distance depends not just on past demographic changes but on the scaling features of recombination. With a nonlinear mapping of the recombination rate and physical distance apparently being typical up to at least 100 kb between sites, and the recombination rate per base pair being on the order of 10× greater for close than distant pairs of sites, it appears that any attempt to convert spatial patterns of LD to temporal patterns in Ne needs to employ some sort of nonlinear timescale transformation.
Without such correction, and adhering to a simple reliance on crossover-based genetic maps, biases in estimates of Ne of up to an order of magnitude may result. In effect, by simply extrapolating from crossover rates, the actual amount of recombination will be increasingly underestimated for more closely spaced sites. This will lead to overestimates in both the distance into the past that might be represented by such sites and the associated Ne, the exact pattern of bias depending on x and . We will leave to a future study an evaluation of the power of Δ profiles to infer past population history, but simply reemphasize that substantially different patterns of demographic change may also sometimes lead to quite similar distance-dependent patterns of LD (Figure 2). A similar point has been made previously with respect to interpretations based on the site-frequency spectrum of polymorphisms (Myers et al. 2008).
We thank A. Thota, S. Michael, and R. Henschel (University Information Technology Services, Indiana University) for computational assistance and M. Hahn for helpful comments. This work was support by National Science Foundation (NSF) grants EF-0827411 and DEB-1257806 and National Institutes of Health grant GM101672 to M.L. The computational analyses were supported in part by NSF grants CNS-0723054 and CNS-0521433, which support IU computational facilities. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by NSF grant OCI-1053575.
Supporting information is available online at http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.114.166843/-/DC1.
Communicating editor: Y. S. Song
- Received June 3, 2014.
- Accepted June 17, 2014.
- Copyright © 2014 by the Genetics Society of America