| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Program in Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089
1 Address for correspondence: Program in Molecular and Computational Biology, University of Southern California, 1042 W. 36th Pl., Los Angeles, CA 90089.
E-mail: jeffwall{at}usc.edu
| ABSTRACT |
|---|
|
|
|---|
(= 4Nr), where N is the (diploid) effective population size and r is the recombination rate per generation per base pair (HUDSON 1985; PRITCHARD and PRZEWORSKI 2001). It is well known that, all else being equal, more recombination leads to less LD (e.g., OHTA and KIMURA 1969, 1971); in addition, larger population sizes lead to less genetic drift, thus lower levels of LD. From extant sequence polymorphism data alone, neither N nor r can be estimated individually. Estimates of
can be thought of as multisite measures of LD (PRITCHARD and PRZEWORSKI 2001; WALL 2001). Accurate estimates of
would be useful for the future design of large-scale association studies (e.g., KRUGLYAK 1999; PRITCHARD and PRZEWORSKI 2001; WALL and PRITCHARD 2003), as well as for distinguishing between alternative evolutionary models (WALL 2001; WALL et al. 2002).
The development of methodology for estimating
from patterns of variation has long been an active research area, and there is a range of different methods for estimating
(under a simple model where all recombination events are crossovers and the phase of all double heterozygotes is known). The first methods developed were ad hoc moment estimators (HUDSON 1987; HEY and WAKELEY 1997; WAKELEY 1997). These are quick and easy to use, but do not efficiently utilize the available information. In contrast, full maximum-likelihood techniques are attractive because they make full use of the available haplotype information. Several different implementations have been proposed, which use either importance sampling (GRIFFITHS and MARJORAM 1996; FEARNHEAD and DONNELLY 2001) or Markov chain Monte Carlo (KUHNER et al. 2000; NIELSEN 2000) to estimate likelihoods. However, the state spaces for these methods are extremely complex and have many dimensions, so it is hard to evaluate when a program has run for long enough to obtain an accurate estimate of the likelihood. In practice, these methods are computationally infeasible for medium- and large-size data sets (e.g., data sets where
10, which corresponds to
15 kb in humans; WALL 2000; FEARNHEAD and DONNELLY 2001). Even with the approximately exponential increase in computer processor speeds, full-likelihood methods will be of limited applicability for the foreseeable future.
To date, several compromise approaches have been proposed, which seek to avoid the computational burdens of calculating the exact likelihood of haplotype configurations while keeping the statistical rigor of a likelihood framework. One method involves describing the data with one or more summary statistics and then estimating
using maximum likelihood on the reduced data (WALL 2000). The success of this approach relies on finding summaries of the data that are highly sensitive to the recombination rate. Both the number of distinct haplotypes and the minimum number of inferred recombination events (cf. HUDSON and KAPLAN 1985) seem to work reasonably well (WALL 2000; HUDSON 2001).
Another effective approach exploits the fact that the likelihood of
can be written as the product of conditional distributions of one haplotype given a sample of the other haplotypes (LI and STEPHENS 2003). If x1, x2, ... , xn are a sample of n haplotypes, then
![]() |
that maximizes this product. This method is sensitive to the order of the haplotypes, so the authors estimate their approximate likelihood by averaging over several possible orders (LI and STEPHENS 2003).
A third approach utilizes a composite likelihood (HUDSON 1993, 2001; MCVEAN et al. 2002; PTAK et al. 2004). For a given data set, this method calculates the likelihood (as a function of
) of the haplotype data at each pair of segregating sites and then forms a composite likelihood by multiplying all these pairwise likelihood curves together. The maximum (composite) likelihood
CL is a good point estimate of the recombination rate (HUDSON 2001), but, due to dependency among pairs, the standard asymptotic maximum-likelihood assumptions (for the likelihood ratio) do not apply. Hence, the uncertainty in estimates can be assessed only by simulation. In essence, Hudson's method breaks up the data into small subsets (in this case pairs of sites), calculates the likelihood (as a function of
) for these subsets, and then multiplies the likelihood functions together to form a composite likelihood. The probabilities for all possible two-site haplotype configurations can be calculated in advance, so the analysis of any particular data set is quite fast once the appropriate file has been generated.
The methods described above assume that all recombination events are crossovers. However, this simple model of recombination is not biologically realistic. Current meiotic recombination models allow for two different forms of recombination (e.g., SZOSTAK et al. 1983). Both involve some copying of DNA from one chromosome to another, but they differ in whether the copying is accompanied by the reciprocal exchange of flanking markers (i.e., crossing over) or not. We call these two forms of recombination "crossing over" and "gene conversion," respectively. There is ample experimental evidence for gene conversion in higher eukaryotes (e.g., HILLIKER et al. 1994; ZANGENBERG et al. 1995; GUILLON and DE MASSY 2002; JEFFREYS and MAY 2004), and some evidence that the empirical patterns of LD also reflect the effects of gene conversion (LANGLEY et al. 2000; ARDLIE et al. 2001; FRISSE et al. 2001; PRZEWORSKI and WALL 2001; ANDOLFATTO and WALL 2003). Unfortunately, direct (laboratory-based) estimation of gene conversion rates is technically challenging and extremely laborious, even for a single locus.
Recently, theoretical models have been developed that incorporate both crossing over and gene conversion (ANDOLFATTO and NORDBORG 1998; WIUF and HEIN 2000). Under certain assumptions, it is possible to generalize the composite-likelihood approach of HUDSON (2001) to jointly estimate crossing over and gene conversion rates from polymorphism data (FRISSE et al. 2001). Given specific crossing-over and gene conversion rates, as well as the distribution of gene conversion tract lengths, one can calculate the function that relates physical to genetic distance. From this, it is straightforward to calculate likelihoods for pairs of sites and to construct a composite likelihood.
In practice, it is very difficult to accurately estimate gene conversion rates from a single genetic region. The effects of crossing over vs. gene conversion on patterns of LD are quite subtle and are often overwhelmed by the inherent stochasticity of the evolutionary process. Generally, accurate estimation of crossing-over and gene conversion rates requires polymorphism data from multiple evolutionarily independent regions. FRISSE et al. (2001) assume that recombination rates do not vary across loci and simply multiply the composite likelihoods at each locus to form a composite likelihood for a collection of loci. An alternative approach was proposed by PTAK et al. (2004); they assume that the ratio of gene conversion events to crossing-over events is constant across loci, but that the crossing-over rate varies across loci. Ptak and colleagues argue that crossing-over rates generally vary across loci and that local rates are not necessarily accurately estimated by a comparison of physical and genetic maps.
In this article, we present a new method for estimating recombination rates. We take a composite-likelihood approach similar to HUDSON (2001), but we consider triplets of sites instead of pairs of sites. This should allow us to discriminate more efficiently between the effects of crossing over and of gene conversion. Due to memory constraints, we can calculate only three-site likelihoods for small sample sizes. For larger sample sizes, we consider random subsets of the sample and form a composite likelihood by multiplying together the (composite) likelihoods of many subsets (see METHODS). Our approach can be used to estimate crossing-over rates only (comparable to WALL 2000, HUDSON 2001, and LI and STEPHENS 2003) or to jointly estimate crossing-over rates and gene conversion rates (comparable to FRISSE et al. 2001 and PTAK et al. 2004). We run simulations to compare the relative accuracies of our and previous methods.
| METHODS |
|---|
|
|
|---|
(= 4Nµ, where µ is the mutation rate per site per generation) is near 0 and condition on there being exactly two alleles at each site. Our model of recombination assumes that gene conversion and crossing over are mechanistically independent and that gene conversion tract lengths follow a geometric distribution (HILLIKER et al. 1994; FRISSE et al. 2001). The model has three parameters: a scaled crossing-over rate
co (= 4Nrco, where rco is the crossing-over rate per base pair per generation), a scaled gene conversion rate
gc (= 4Nrgc, where rgc is the gene conversion initiation rate per base pair per generation), and the average gene conversion tract length (t, in base pairs). As with previous studies (e.g., FRISSE et al. 2001; PTAK et al. 2004), we replace
gc with f =
gc/
co so that f is the ratio of gene conversion initiation events (per base pair) to crossing-over events (per base pair). Simulation under this model is a straightforward generalization of the standard coalescent (WIUF and HEIN 2000; PRZEWORSKI and WALL 2001). We assume no variation in recombination rates across the sequence, although this can be relaxed. There are two different ways of calculating two-site sample probabilities. One method utilizes a recursion described by GOLDING (1984) and ETHIER and GRIFFITHS (1990), where exact probabilities can be obtained by solving a large (but sparse) system of linear equations. The other method involves generating a large number of random genealogies using standard coalescent machinery (e.g., HUDSON 1983) and estimating the probabilities of different sample configurations separately for each replicate. The latter method proves to be more practical for larger sample sizes (HUDSON 2001) and we adopt the same Monte Carlo approach here.
Suppose we have three (segregating) sites, s1, s2, and s3 (in that order along a chromosome), and assume N, n,
co, f, and t are fixed, where
co refers to the scaled crossing-over rate per base pair and the distances between s1, s2, and s3 are fixed. We wish to calculate the probability of a particular three-site configuration n, which we write as Pr(n|
,
co, f, t). We run coalescent simulations with gene conversion and construct three-locus genealogies g by considering the trees at the positions corresponding to the locations of s1, s2, and s3. At site s1, the genealogy is a tip-labeled tree with 2n 2 branches. Arbitrarily label these 1, 2, ... , 2n 2 and denote the length (in units of 4N generations) of the ith branch as ai. (Note that for notational convenience our scaling differs by a factor of 2 from the standard scaling.) The branch lengths at s2 and s3 are defined analogously and are labeled {bj} and {ck}. We assume that the number of mutations that occur on a branch of length x is Poisson, with mean
x, and set
, and
. Define the indicator variable I(g, n, i, j, k) to be 1 if mutations on the ith, jth, and kth branches of the s1, s2, and s3 trees produce the haplotype configuration n, and 0 otherwise. It is easy to determine the value of I( ) from the genealogy g.
Now, suppose I(g, n, i, j, k) = 1. Then, the configuration n would arise if a mutation happened on the ith, jth, and kth branches of the s1, s2, and s3 trees, but no mutations occurred on any of the other branches. The probability of this is
![]() |
,
co, f, t), we sum over all possible trios of branches (from the three trees) and take the expectation over random genealogies (assuming the standard equilibrium neutral model):
![]() |
terms. As
0, we consider the scaled likelihood function
![]() |
![]() | (1) |
Our main interest is in the probabilities of sample configurations conditional on there being two alleles at each site. In the limit as
0, the conditional probability of the configuration n is
![]() | (2) |
![]() |
1(h) is the sum of the branch lengths at site s1 for the hth simulation, etc. Note that the conditional probability (2) does not depend on
and can be estimated directly from simulations. Informally, we refer to these conditional probabilities as the "likelihoods" of three-site sample configurations; we write them as Pr(n;
co, f, t).
As mentioned above, the likelihoods of all possible configurations {n} can be estimated simultaneously. In addition, likelihoods (of all possible configurations) for multiple recombination rates can be estimated at the same time as well. Suppose R1 = {
1,
2, ... ,
m} is a set of different crossing-over values (not scaled per base pair and in ascending order). If d1 (respectively, d2) is the distance in base pairs between s1 and s2 (respectively, s2 and s3), then likelihoods for all d1
co, d2
co
R1 can be estimated simultaneously from the same underlying coalescent simulations. For each replicate, we examine the trees at a collection of sites S such that for every d1
co, d2
co
R1, there exist s1, s2, s3
S such that the crossing-over rate between s1 and s2 is d1
co and the crossing-over rate between s2 and s3 is d2
co. Although the common genealogy will introduce a slight degree of dependency between the different triplets {si}, this drawback is greatly compensated for by the large gains in computational effort of not having to generate separate genealogies for each triplet. Note that while multiple crossing-over rates can be considered, the other recombination parameters are constrained. In our implementation, both t
co and f must be constant.
Estimating likelihoods:
For the special case of f = 0, we ran simulations with n = 20 and 4 x 107 replicates, over a grid of R1 values ranging from 0.0 to 160 (17 total values). For joint estimates of crossing-over and gene conversion rates, we ran simulations with n = 10, over a grid of R1, t
co, and f values. We take R1 = {0.0, 0.15, 0.25, 0.4, 0.63, 1.0, 1.6, 2.5, 4.0, 6.3, 10.0, 16.0, 25.0, 40.0, 63.0, 100.0, 160.0}, let t
co range from 0.015625 to 4.0 (9 total values), and consider f values from 0.0 to 16.0 (13 total values). A total of 5 x 106 replicates were run for each parameter combination. (The number of replicates was chosen to be high enough so that all likelihoods, even for unlikely configurations, could be estimated reasonably accurately.) These simulations took more than a year's worth of computing time on a pair of Xeon 2.4 GHz processors. We emphasize that the estimation of three-site likelihoods needs to be done only once and that subsequent analyses of particular data sets are relatively fast. It would be preferable to consider a larger value of n, but the sizes of the likelihood files impose memory constraints. The parameter values were chosen in part so that analyses could be performed on a desktop computer with 1 GB of RAM.
Estimating recombination rates for a single locus:
Suppose we have polymorphism data with n
10 from a single locus, D1. Label the set of segregating sites as S = {s1, s2, ... , su}. We calculate a composite likelihood by multiplying together the likelihoods of all subsets of three sites,
![]() |
co, f, t). Call this
W04.
If n > 10 we adopt a subsampling approach. We choose a large number of subsets of 10 sequences. Denote V = {v1, v2, ... , vz} as the collection of subsets. For each subset we consider all triplets of segregating sites. Then, we calculate a composite likelihood by multiplying together the likelihoods of all of the triplets in each of the subsets,
![]() |
co, f, t).
For each triplet, we estimate its likelihood over a grid of recombination values (
co, f, and t). To estimate these likelihoods, we use linear interpolation over the calculated likelihoods (described above) whose parameter values (e.g., d1
co, d2
co, f, and t) are closest to the actual ones. For our simulation study (described below), we fix t at a plausible value (either 125 or 500 bp; see HILLIKER et al. 1994; FRISSE et al. 2001; JEFFREYS and MAY 2004) and consider only a grid of
co and f values. We decided to calculate likelihoods over a grid of values, rather than use various iterative procedures (cf. PRESS et al. 1992), because we did not know a priori whether the composite-likelihood surfaces would have a single local maximum.
Estimating recombination rates for multiple loci:
Suppose we have multiple unlinked loci and we wish to jointly estimate recombination parameters from these data. If the relevant parameters are assumed to be constant across loci, then we can simply multiply the composite-likelihood functions at each individual locus to form a composite likelihood for the combined data. If D = {D1, D2, ... , Dm} is a collection of loci, we set
![]() | (3) |
co, f, t) as a point estimate. Call these
T1,
T1, and
T1, and denote the maximum composite likelihood as MCL(D;
co, f, t). We call this method the "joint" method and note that it is analogous to the approach of FRISSE et al. (2001).
We also implement an alternative approach that is analogous to PTAK et al. (2004). We form a different composite likelihood assuming that f and t are fixed across loci (e.g., at f0 and t0), but allowing
co to vary:
![]() | (4) |
T2. Following PTAK et al. (2004), we call this method of estimating f the "profile" method, due to its similarity to profile likelihood.
Extensions:
We note in passing that it should be straightforward to extend our methods to handle diploid (i.e., unphased) data, data without ancestral-derived information, ascertainment bias, and/or missing data. For example, given unphased data from three sites, we would first enumerate all possible (haploid) haplotype configurations that are consistent with the diploid data, and then calculate the probability that each haplotype configuration would produce the observed diploid configuration (assuming random assortment of gametes). The likelihood of the diploid configuration M is then
![]() |
Simulations:
To test the accuracy of our method for estimating crossing-over rates (only), we run coalescent simulations (cf. HUDSON 1983) with known recombination rates and then tabulate the distributions of estimated recombination rates. These simulations have n = 50,
= 0.001/bp,
= 0.001/bp or 0.004/bp, and 2, 4, 6, ... , 16 kb of simulated sequence. A total of 1000 replicates were run for each parameter combination. For each replicate, we consider 100 subsets of size 20 and all triplets of three segregating sites from each of these subsets. These simulations are directly comparable to those in Figures 7 and 8 of HUDSON (2001) and Figure 5 of LI and STEPHENS (2003).
We take a similar approach for evaluating how accurately our method can estimate f. We simulate 50,000 loci of length 5 kb, each with n = 50 and
=
co = 0.001/bp. One set of simulations had f = 1 and a mean conversion tract length of t = 500 bp while the other had f = 4 and t = 125 bp. (Gene conversion tract lengths followed a geometric distribution.) For each locus, we calculate composite likelihoods (using 150 subsets of 10 sequences) over a grid of
co {0, 0.0002, 0.0004, ... , 0.004} and f ({0, 0.35, 0.5, 0.7, 1, 1.4, 2, 2.8, 4} when f = 1 and {0, 1, 1.4, 2, 2.8, 4, 5.6, 8, 11.2, 16} when f = 4) values. We consider each locus individually, as well as groups of 2, 5, 10, 20, 50, 100, 200, and 500 loci. (Due to computational constraints, the same 50,000 loci are used for each analysis.) For each collection of loci, we calculate
T1,
T1, and
T2, using the two methods described above, in (3) and (4). For comparison, we also estimated f using the methods of FRISSE et al. (2001) and PTAK et al. (2004). We call these pairwise composite-likelihood estimators
P1 and
P2, respectively. To compare the accuracies of the different estimators of f we log transform all values and calculate the root mean square error [i.e., the square root of the average squared difference between log (estimated f) and log (actual f)]. Estimates of f that are 0 are changed to 0.25 times the true value (i.e., 0.25 or 1 depending on the simulations) to ensure that log
is well defined.
Following PTAK et al. (2004), we also examine the effects of recombination rate variation (among loci) on estimates of f. We generate 50,000 loci of length 5 kb with n = 50,
= 0.001/bp,
co = gamma(10, 104) and either f = 1 with t = 500 bp or f = 4 with t = 125 bp. Tract lengths are geometrically distributed and the loci are analyzed in the same way as described above. The mean value of
co in these simulations is 0.001/bp (as before), and the gamma distribution parameters were chosen to approximately match the simulations described in PTAK et al. (2004).
We are also interested in determining how often one could reject a model of crossing over only (i.e., f = 0). For each collection of loci with
=
co = 0.001/bp and f = 1 or 4, we evaluate
![]() | (5) |
= 0.001/bp,
co = 0.00168/bp (or 0.0018/bp), and f = 0. The value of
co was chosen to produce the same mean
co estimate as the
co = 0.001/bp, f = 1 (or f = 4) simulations when we set f = 0. We generate 50,000 null loci and evaluate loci either singly or in groups of 2, 5, 10, 20, 50, 100, 200, or 500 loci as above. For each one, we take the 95th percentile of the composite-likelihood-ratio distribution as the critical value when analyzing the simulations with gene conversion. The simulations described in this section took several months of computing time on a pair of 2.4 GHz Xeon processors. All programs used in the analyses were written in C and are available upon request.
| RESULTS |
|---|
|
|
|---|
and then tabulated the distributions of estimated values. Figure 1 shows the 10th and 90th percentiles of the distributions of
W04, with analogous results from HUDSON's (2001)
CL shown for comparison. The distributions of the two estimators are quite similar, although the percentiles of the
W04 distributions tend to be slightly lower than the corresponding ones for
CL. The root mean square error for
W04 is generally slightly smaller than that for
CL, but both are larger than the root mean square error of the distribution of
PACB (LI and STEPHENS 2003, Figure 5). A priori we expected that an estimate based on the likelihoods of triplets of sites would be more accurate than an estimate based on the likelihoods of pairs of sites, but that the difference would be mitigated by having to consider subsets of 20 sequences when calculating
W04 but using all sequences when calculating
CL. If we compare the two estimators on simulated data with n = 20, then the difference in accuracy between
W04 and
CL is more substantial (results not shown). We note that it should be possible to modify the methods of MCVEAN et al. (2002) to consider triplets of sites and that such a method may outperform
W04.
|
T1) and profile (
T2) methods using triplets of sites. For comparison, we also calculated
P1 and
P2, the pairwise versions of the joint and profile methods (cf. FRISSE et al. 2001; PTAK et al. 2004). Roughly speaking, the joint method assumes that
(per base pair) is constant across loci, while the profile method does not. (For both
T1 and
P1 we jointly estimate
and
for each collection of loci but present results only for
.) Figures 2 and 3 show the distribution of estimated f values for the four different methods: Figure 2 uses 1, 5, 20, and 100 independent loci, respectively, with f = 1 and t = 500 bp while Figure 3 uses 1, 5, 20, and 100 independent loci, respectively, with f = 4 and t = 125 bp. It is quite striking how poorly all four methods perform when there are few loci. With 1 or 5 loci, none of the methods perform much better than random guessing (among the 9 or 10 possible values of f), while even with 20 loci there is an
50% probability of being off by at least a factor of two. The only scenario where the actual value of f (f = 1 in Figure 2 and f = 4 in Figure 3) was estimated at appreciable frequency (e.g., more than half of all replicates) was with
T1 for sets of 100 loci. We note in passing that the distributions for
T1 and
P1 are substantially more ragged than the corresponding distributions for the pairwise estimators. Presumably this reflects in part the small sample sizes of the subsets and the roughness of the grid over which likelihoods were calculated.
|
|
T1 has lower RMS error than the other methods. The true value of
is constant across loci in the simulations, so we might expect the joint methods (which explicitly assume this rate constancy) to perform better than the profile methods. While this is clearly true for the triplet composite-likelihood methods, the two pairwise composite likelihoods perform comparably. As did PTAK et al. (2004), we find that the profile methods are biased downward; in contrast,
P1 is biased upward slightly, while
T1 is nearly unbiased.
|
4 in both humans (ZANGENBERG et al. 1995; JEFFREYS and MAY 2004) and Drosophila melanogaster (HILLIKER and CHOVNICK 1981).
Thus far we have concentrated on the problem of estimating f. Ideally, we would want to jointly estimate
and f. In general this is difficult to do, since high values of
and low values of f are hard to distinguish from lower values of
and higher values of f (PTAK et al. 2004; results not shown). We tabulate for both joint methods the probability that both
and f can be estimated "reasonably" well: the probability that both are closer than a factor of 2 away from their true values (i.e., 0.5 <
< 2 or 2 <
< 8 and 0.0005 <
< 0.002). Figure 5 plots this as a function of the number of loci. As with the earlier results, the three-site joint method is substantially better than the pairwise joint method, and estimation is more accurate for the f = 4 simulations compared with the f = 1 simulations. Both methods perform poorly overall, and
1020 loci are needed to have a 50% chance of estimating both
and f reasonably well. In contrast, under the simpler crossing-over-only model of recombination, only a single locus is necessary for estimating
to the same level of accuracy.
|
T1 is more accurate than the other methods, and the f = 4 simulations have lower RMS error than the f = 1 simulations. Also, as before, the only scenario where the true value of f was estimated at appreciable frequency (e.g., >50% of replicates) was using
T1 and sets of 100 loci.
|
co is constant across loci. However, this does not turn out to be the case.
P1 and
P2 show approximately equal bias and RMS error (on log-transformed data), with
P2 biased downward and
P1 biased upward. For the three-site likelihoods,
T1 has a smaller bias and RMS error than
T2.
We also examined how well the two joint methods perform at jointly estimating
co and f. The probability that both parameters can be estimated reasonably well (see previous section for details) is slightly lower for loci with (crossing-over) rate variation than in Figure 5 for loci with no rate variation (results not shown).
Rejecting f = 0:
The gene conversion results presented above examine the question of how well we can estimate
co and f. Here we address a simpler question and look at how often we can reject a model of no gene conversion (i.e., f = 0). To do this, we use the equivalent of a likelihood-ratio statistic, with critical values determined from (null) simulations without gene conversion (see METHODS for details). We tabulate the proportion of simulations (with gene conversion) where f = 0 can be rejected at the 5% level. Figure 7 shows these proportions as a function of the number of independent loci considered, for the joint and profile methods, using either pairs or triplets of sites. Figure 7A shows the results when f = 1 and t = 500 bp while Figure 7B shows the results when f = 4 and t = 125 bp. The number of loci needed to have appreciable power to reject the no gene conversion model is substantial; in Figure 7A >10 loci are necessary to have 50% power (5 or more loci in Figure 7B), while
50 loci are required to achieve >80% power (
10 loci in Figure 7B). Since estimators of f are more precise for the f = 4 simulations (see Figures 4 and 5), it is not surprising that the probability of rejecting f = 0 is higher in Figure 7B than in Figure 7A. However, it is somewhat surprising that for
20 loci the pairwise methods have greater power to reject the null model than the three-site methods, even though
T1 is the most accurate and most precise of the four estimators (see Figures 26).
|
100 loci. We stringently checked our code and reran our simulations to verify that this odd result was not an error. The dip in power is due to a sharp increase in the critical values for sets of >50 loci. Slightly <1 locus out of every 1000 simulated with no gene conversion is extremely unusual, with a composite likelihood (based on pairs of sites) that strongly supports f > 0. Equivalently, the distribution of (5) for each locus taken individually is highly skewed, with a long tail to the right (results not shown). These unusual loci in the tail dominate when calculating (5); since >5% of groups of 100 loci contain one of these unusual loci, the estimated critical value is quite large, and the power is low. Similarly, since <5% of groups of 50 loci contain one of these outliers, the estimated critical value is low (and thus the power is high). This unexpected result illustrates the hidden dangers of using composite likelihoods instead of real likelihoods.
An example:
As an illustration, we estimate f using the Arabidopsis thaliana sequence polymorphism data described in HAUBOLD et al. (2002). The authors sequenced 14 short regions spread out over 170 kb of sequence in a worldwide sample of 39 accessions. Using the ad hoc method of WIEHE et al. (2000), they estimated f = 9 (HAUBOLD et al. 2002, Figure 6). Although the data are limited, we wanted to see whether our methods also estimate extremely high values of f. We applied both the two-site and three-site composite-likelihood methods to the data and jointly estimated
co and f (assuming a mean conversion tract length of 300 bp, as in HAUBOLD et al. 2002). Using two-site likelihoods, we estimate
P1 = 1.7 x 104 and
P1 = 14.8, while using three-site likelihoods (and 300 subsets of 10 sequences) we obtain
T1 = 1.8 x 104 and
T1 = 16. Note that
T1 might be an underestimate; due to computational constraints, we could not estimate (three-site composite) likelihoods for values of f > 16. These estimates are highly concordant, and they provide additional evidence that gene conversion rates may be extremely high in A. thaliana. More precise estimates will require extensive additional sequence polymorphism data.
| DISCUSSION |
|---|
|
|
|---|
Despite the improvement over previous methods, our three-site composite-likelihood approach does not perform very well on an absolute scale. For the parameter values considered here, accurate estimates of crossing-over and gene conversion rates require data from
10 independent loci, whereas the same level of accuracy in estimating crossing-over rates alone (assuming no gene conversion) requires just a single locus. While the poor performance in Figures 26 reflects in part the limitations of our methods, the main difference (compared with the results in Figure 1) lies in the added difficulty in estimating multiple recombination rates instead of just a single one. We fear that future methods will yield only incremental improvements because the information content of sequence polymorphism data is so poor for the question at hand.
Since accurate estimates of gene conversion (and crossing-over) rates require polymorphism data from multiple, evolutionarily independent regions, we must make explicit assumptions regarding the extent to which recombination rates vary across different regions. The most restrictive assumption, namely that rates are constant across loci (i.e., the joint method), may not be generally appropriate for analyzing real data (but see below). If the goal is to estimate f, then variation in crossing-over rates can be accounted for mostly by using the profile method. However, both the joint and profile methods assume that f is constant across loci. Essentially nothing is known about whether f varies across the genomes of higher eukaryotes, but several authors have suggested that f may be higher in regions of reduced crossing over in D. melanogaster (LANGLEY et al. 2000; ANDOLFATTO and WALL 2003). In principle, we could test this hypothesis by jointly estimating f and
co for individual regions and determining whether estimates of f vary in a systematic way. We caution though that the results of such an analysis would be difficult to interpret, since the rate estimates for any particular region are likely to be inaccurate. Nonetheless, it may be possible to determine whether estimates of f are higher at some set of loci compared to others, even if the actual values cannot be estimated accurately.
Another approach for dealing with rate heterogeneity across loci is to apply the joint or profile methods to the data anyway, viewing the estimates obtained as "averages" across many loci. Some systematic biases may be associated with such an approach, such as the upward bias to
P1 when the actual data have rate variation. In our simulations, the joint methods show equal or less bias than the profile methods regardless of whether or not there is variation in recombination rates in the underlying data (Figures 24; results not shown). We also ran simulations with more extreme rate variation [gamma(4, 0.00025)-distributed rates for
co] but still found that
P1 performed as well as or better than
P2 (results not shown). In contrast, Ptak and colleagues found that
P2 was less biased than
P1 when the underlying data had variation in recombination rates (PTAK et al. 2004). The reason for this discrepancy between the two studies is unclear, but the performance of different estimators (of f) may be sensitive to the particular parameter values used in the simulations (in particular, to the precise extent of recombination rate variation). It should also be noted that nothing is yet known about the effect of variation in f (across loci) on the various methods for estimating f discussed here. Further simulation studies will address this question.
Finally, we emphasize that our simulation study assumed we have perfect data quality. Real data have sequencing or genotyping errors, and these errors can artificially inflate estimated gene conversion rates (PTAK et al. 2004). Recurrent mutations (i.e., multiple mutations at the same nucleotide site) can also have the same effect. Gene conversion can be thought of as a process where small stretches of sequence (i.e., conversion tracts) look different because they were inherited from a different ancestral chromosome. Similarly, sequencing errors and recurrent mutations can be thought of as processes where small stretches of sequence (i.e., single SNPs) look different because of experimental error or mutation. The extent to which these processes "look" different (on the basis of patterns of variation and LD) depends in part on the density of single-nucleotide polymorphisms (SNPs) and the mean gene conversion tract length. In species where the typical gene conversion tract might contain no more than one SNP, we expect that it will be extremely difficult to distinguish between the effects of sequencing errors, recurrent mutation, and gene conversion on the basis of patterns of LD alone. However, in most cases sequencing error rates can be estimated directly, given sufficient time and motivation. Under simple models of the error process, recombination rates could then be estimated assuming a fixed rate of sequencing errors. In addition, in species with higher SNP densities and/or longer gene conversion tract lengths, it may be possible to distinguish between the effects of gene conversion and sequencing errors (or even jointly estimate both rates) using a generalization of the methods described in this article.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| LITERATURE CITED |
|---|
|
|
|---|
ANDOLFATTO, P., and M. NORDBORG, 1998 The effect of gene conversion on intralocus associations. Genetics 148: 13971399.
ANDOLFATTO, P., and J. D. WALL, 2003 Linkage disequilibrium patterns across a recombination gradient in African Drosophila melanogaster. Genetics 165: 12891305.
ARDLIE, K., S. N. LIU-CORDERO, M. A. EBERLE, M. DALY, J. BARRETT et al., 2001 Lower-than-expected linkage disequilibrium between tightly linked markers in humans suggests a role for gene conversion. Am. J. Hum. Genet. 69: 582589.[CrossRef][Medline]
CHARLESWORTH, B., 1996 Background selection and patterns of genetic diversity in Drosophila melanogaster. Genet. Res. 68: 131149.[Medline]
ETHIER, S. N., and R. C. GRIFFITHS, 1990 On the two-locus sampling distribution. J. Math. Biol. 29: 131159.[CrossRef]
FEARNHEAD, P., and P. DONNELLY, 2001 Estimating recombination rates from population genetic data. Genetics 159: 12991318.
FRISSE, L., R. R. HUDSON, A. BARTOSZEWICZ, J. D. WALL, J. DONFACK et al., 2001 Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am. J. Hum. Genet. 69: 831843.[CrossRef][Medline]
GOLDING, G. B., 1984 The sampling distribution of linkage disequilibrium. Genetics 108: 257274.
GRIFFITHS, R. C., and P. MARJORAM, 1996 Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 3: 479502.[Medline]
GUILLON, H., and B. DE MASSY, 2002 An initiation site for meiotic crossing-over and gene conversion in the mouse. Nat. Genet. 32: 296299.[CrossRef][Medline]
HAUBOLD, B., J. KROYMANN, A. RATZKA, T. MITCHELL-OLDS and T. WIEHE, 2002 Recombination and gene conversion in a 170-kb genomic region of Arabidopsis thaliana. Genetics 161: 12691278.
HEY, J., and J. WAKELEY, 1997 A coalescent estimator of the population recombination rate. Genetics 145: 833846.[Abstract]
HILLIKER, A. J., and A. CHOVNICK, 1981 Further observations of intragenic recombination in Drosophila melanogaster. Genet. Res. 38: 281296.[Medline]
HILLIKER, A. J., G. HARAUZ, A. G. REAUME, M. GRAY, S. H. CLARK et al., 1994 Meiotic gene conversion tract length distribution within the rosy locus of Drosophila melanogaster. Genetics 137: 10191026.[Abstract]
HUDSON, R. R., 1983 Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23: 183201.[CrossRef][Medline]
HUDSON, R. R., 1985 The sampling distribution of linkage disequilibrium under an infinite allele model without selection. Genetics 109: 611631.
HUDSON, R. R., 1987 Estimating the recombination parameter of a finite population model without selection. Genet. Res. 50: 245250.[Medline]
HUDSON, R. R., 1993 The how and why of generating gene genealogies, pp. 2336 in Mechanisms of Molecular Evolution, edited by N. TAKAHATA and A. G. CLARK. Sinauer Associates, Sunderland, MA.
HUDSON, R. R., 2001 Two-locus sampling distributions and their application. Genetics 159: 18051817.
HUDSON, R. R., and N. L. KAPLAN, 1985 Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111: 147164.
JEFFREYS, A. J., and C. A. MAY, 2004 Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nat. Genet. 36: 151156.[CrossRef][Medline]
KONG, A., D. F. GUDBJARTSSON, J. SAINZ, G. M. JONSDOTTIR, S. A. GUDJONSSON et al., 2002 A high-resolution recombination map of the human genome. Nat. Genet. 31: 241247.[CrossRef][Medline]
KRUGLYAK, L., 1999 Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet. 22: 139144.[CrossRef][Medline]
KUHNER, M. K., J. YAMATO and J. FELSENSTEIN, 2000 Maximum likelihood estimation of recombination rates from population data. Genetics 156: 13931401.
LANGLEY, C. H., B. P. LAZZARO, W. PHILLIPS, E. HEIKKINEN and J. M. BRAVERMAN, 2000 Linkage disequilibrium and the site frequency spectra in the su(s) and su(wa) regions of the Drosophila melanogaster X chromosome. Genetics 156: 18371852.
LI, N., and M. STEPHENS, 2003 Modeling linkage disequilibrium, and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165: 22132233.
MCVEAN, G., P. AWADALLA and P. FEARNHEAD, 2002 A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160: 12311241.
NIELSEN, R., 2000 Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154: 931942.
OHTA, T., and M. KIMURA, 1969 Linkage disequilibrium due to random genetic drift. Genet. Res. 13: 4755.
OHTA, T., and M. KIMURA, 1971 Linkage disequilibrium between two segregating nucleotide sites under the steady flux of mutations in a finite population. Genetics 68: 571580.
PRESS, W. H., S. A. TEUKOLSKY, W. T. VETTERLING and B. P. FLANNERY, 1992 Numerical Recipes in C. Cambridge University Press, Cambridge, UK.
PRITCHARD, J. K., and M. PRZEWORSKI, 2001 Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69: 114.[CrossRef][Medline]
PRZEWORSKI, M., and J. D. WALL, 2001 Why is there so little intragenic linkage disequilibrium in humans? Genet. Res. 77: 143151.[CrossRef][Medline]
PTAK, S. E., K. VOELPEL and M. PRZEWORSKI, 2004 Insights into recombination from patterns of linkage disequilibrium in humans. Genetics 167: 387397.
SZOSTAK, J. W., T. L. ORR-WEAVER, R. J. ROTHSTEIN and F. W. STAHL, 1983 The double-strand-break repair model for recombination. Cell 33: 2535.[CrossRef][Medline]
WAKELEY, J., 1997 Using the variance of pairwise differences to estimate the recombination rate. Genet. Res. 69: 4548.[CrossRef][Medline]
WALL, J. D., 2000 A comparison of estimators of the population recombination rate. Mol. Biol. Evol. 17: 156163.
WALL, J. D., 2001 Insights from linked single nucleotide polymorphisms: what we can learn from linkage disequilibrium. Curr. Opin. Genet. Dev. 11: 647651.[CrossRef][Medline]
WALL, J. D., and J. K. PRITCHARD, 2003 Haplotype blocks and linkage disequilibrium in the human genome. Nat. Rev. Genet. 4: 587597.[CrossRef][Medline]
WALL, J. D., P. ANDOLFATTO and M. PRZEWORSKI, 2002 Testing models of selection and demography in Drosophila simulans. Genetics 162: 203216.
WEBER, J. L., 2002 The Iceland map. Nat. Genet. 31: 225226.[Medline]
WIUF, C., and J. HEIN, 2000 The coalescent with gene conversion. Genetics 155: 451462.
YU, A., C. ZHAO, Y. FAN, W. JANG, A. J. MUNGALL et al., 2001 Comparison of human genetic and sequence-based physical maps. Nature 409: 951953.[CrossRef][Medline]
ZANGENBERG, G., M. M. HUANG, N. ARNHEIM and H. ERLICH, 1995 New HLA-DPB1 alleles generated by interallelic gene conversion detected by analysis of sperm. Nat. Genet. 10: 407414.[CrossRef][Medline]
This article has been cited by other articles: