- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- A corrigendum has been published
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Li, N.
- Articles by Stephens, M.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Li, N.
- Articles by Stephens, M.
Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data
Na Lia and Matthew Stephensba Department of Biostatistics, University of Washington, Seattle, Washington 98195
b Department of Statistics, University of Washington, Seattle, Washington 98195
Corresponding author: Matthew Stephens, University of Washington, Padelford B-313, Stevens Way, Seattle, WA 98195., stephens{at}stat.washington.edu (E-mail)
Communicating editor: S. TAVARE
| ABSTRACT |
|---|
We introduce a new statistical model for patterns of linkage disequilibrium (LD) among multiple SNPs in a population sample. The model overcomes limitations of existing approaches to understanding, summarizing, and interpreting LD by (i) relating patterns of LD directly to the underlying recombination process; (ii) considering all loci simultaneously, rather than pairwise; (iii) avoiding the assumption that LD necessarily has a "block-like" structure; and (iv) being computationally tractable for huge genomic regions (up to complete chromosomes). We examine in detail one natural application of the model: estimation of underlying recombination rates from population data. Using simulation, we show that in the case where recombination is assumed constant across the region of interest, recombination rate estimates based on our model are competitive with the very best of current available methods. More importantly, we demonstrate, on real and simulated data, the potential of the model to help identify and quantify fine-scale variation in recombination rate from population data. We also outline how the model could be useful in other contexts, such as in the development of more efficient haplotype-based methods for LD mapping.
LINKAGE disequilibrium (LD) is the nonindependence, at a population level, of the alleles carried at different positions in the genome. The patterns of LD observed in natural populations are the result of a complex interplay between genetic factors and the population's demographic history. In particular, recombination plays a key role in shaping patterns of LD in a population. When a recombination occurs between two loci, it tends to reduce the dependence between the alleles carried at those loci and thus reduce LD. Although recombination events in a single meiosis are relatively rare over small regions, the large total number of meioses that occurs each generation in a population has a substantial cumulative effect on patterns of LD, and so molecular data from population samples contain valuable information on fine-scale variations in recombination rate.
Despite the undoubted importance of understanding patterns of LD across the genome, most obviously because of its potential impact on the design and analysis of studies to map disease genes in humans, most current methods for interpreting and analyzing patterns of LD suffer from at least one of the following limitations:
- They are based on computing some measure of LD defined only for pairs of sites, rather than considering all sites simultaneously.
- They assume a "block-like" structure for patterns of LD, which may not be appropriate at all loci.
- They do not directly relate patterns of LD to biological mechanisms of interest, such as the underlying recombination rate.
As an example of the limitations of current methods, consider Fig 1, which graphically shows pairwise LD measures for six simulated data sets, simulated under various models for heterogeneity in the underlying recombination rate. The reader is invited to speculate on what the underlying models are in each casethe answer appears in the Fig 8 legend. In each of the six figures one can identify by eye, or by some quantitative criteria (e.g., ![]()
![]()
![]()
![]()
![]()
![]()
|
|
|
|
|
|
|
|
In this article we introduce a statistical model for LD that overcomes the limitations of existing approaches by relating genetic variation in a population sample to the underlying recombination rate. We examine in detail one natural application of the model: estimation of underlying recombination rates from population data. Using simulation, we show that in the case in which recombination is assumed constant across the region of interest, recombination rate estimates based on our model are competitive with the very best of current available methods. More importantly, we demonstrate, on real and simulated data, the potential of the model to help identify and quantify fine-scale variation in recombination rate (including "recombination hotspots") from population data.
Although we focus here on estimating recombination rates, we view the model as being useful more broadly in interpreting and analyzing patterns of LD across multiple loci. In particular, as we outline in our DISCUSSION, the model could be helpful in the development of more efficient haplotype-based methods for LD mapping, along the lines of, for example, ![]()
![]()
![]()
| MODELS |
|---|
Background:
The most successful current approaches to constructing statistical models relating genetic variation to the underlying recombination rate (and to other genetic and demographic factors) are based on the coalescent (![]()
![]()
![]()
![]()
![]()
Despite the ease with which coalescent models can be simulated from, using these models for inference remains extremely challenging. For example, consider the problem of estimating the underlying recombination rate in a region, using data from a random population sample. It follows from coalescent theory that population samples contain information on the value of the product of the recombination rate c and the effective (diploid) population size N, but not on c and N separately. It has therefore become standard to attempt to estimate the compound parameter
= 4Nc, and several methods have been proposed. Some (e.g., ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
More recently, ![]()
![]()
over moderate to large genomic regions. Hudson's method is based on multiplying together likelihoods for every pair of sites genotyped, in which these pairwise likelihoods are computed via simulation, assuming an "infinite-sites" mutation model (i.e., no repeat mutation). This method has been modified by ![]()
![]()
A new model:
Here we describe a new model for LD, which enjoys many of the advantages of coalescent-based methods (e.g., it directly relates LD patterns to the underlying recombination rate) while remaining computationally tractable for huge genomic regions, up to entire chromosomes. Our model relates the distribution of sampled haplotypes to the underlying recombination rate, by exploiting the identity
![]() |
(1) |
where h1, ... , hn denote the n sampled haplotypes, and
denotes the recombination parameter (which may be a vector of parameters if the recombination rate is allowed to vary along the region). This identity expresses the unknown probability distribution on the left as a product of conditional distributions on the right. For simplicity we often use the notation
to denote these conditional distributions. While the conditional distributions are not computationally tractable for models of interest, they are amenable to approximation, as we describe below. Our strategy is to substitute an approximation for these conditional distributions (
, say) into the right-hand side of (1), to obtain an approximation to the distribution of the haplotypes h given
:
![]() |
(2) |
We refer to this as a "product of approximate conditionals" (PAC) model and to the corresponding likelihood as a PAC likelihood, which we denote LPAC. Explicitly,
![]() |
(3) |
Similarly, we refer to the value of
that maximizes LPAC as a maximum PAC likelihood estimate for
and denote it by
PAC.
The utility of the model (3) will naturally depend on the use of an appropriate approximation for the conditional distribution
. This approximation should be designed to answer the following question: if, at a particular locus, in a random sample of k chromosomes from a population, we observe genetic types h1, ... , hk, what is the conditional distribution of the type of the next sampled chromosome, Pr(hk+1|h1, ... , hk)? We are aware of three forms for
in the literature, each of which attempts to answer this question under different assumptions for the genetic model underlying the loci being studied. The first and best known comes from the Ewens sampling formula (![]()
= 4Nµ, then with probability k/(k +
) the k + 1st haplotype is an exact copy of one of the first k haplotypes chosen at random; otherwise it is a novel haplotype. Although the assumptions underlying this formula will never hold in practice, it does capture the following properties that we would expect to hold more generally:
- The next haplotype is more likely to match a haplotype that has already been observed many times rather than one that has been observed less frequently.
- The probability of seeing a novel haplotype decreases as k increases.
- The probability of seeing a novel haplotype increases as
increases.
However, for modern molecular data, and for sequence data and SNP data in particular, it fails to capture the following two properties:
- If the next haplotype is not exactly the same as an existing (i.e., previously seen) haplotype, it will tend to differ by a small number of mutations from an existing haplotype, rather than to be completely different from all existing haplotypes.
- Due to recombination, the next haplotype will tend to look somewhat similar to existing haplotypes over contiguous genomic regions, the average physical length of these regions being larger in areas of the genome where the local rate of recombination is low.
![]()
that captures properties iiv above. In their suggested form for
, the next haplotype differs by M mutations from a randomly chosen existing haplotype, where M has a geometric distribution with Pr(M = 0) = k/(k +
) (so that it reproduces the Ewens sampling formula in the special case of the infinite-alleles mutation model). Thus the next haplotype is a (possibly imperfect) "copy" of a randomly chosen existing haplotype.
![]()
to also capture property v above. In FD's approximation, the k + 1st haplotype is made up of an imperfect mosaic of the first k haplotypes, with the size of the mosaic fragments being smaller for higher values of the recombination rate.
Here we use two new forms for
that also capture properties iv above. The first, described in detail in Appendix A and illustrated in Fig 2, which we denote
A, is a simplification of FD's approximation that is easier to understand and slightly quicker to compute. [N. PATTERSON (personal communication) has independently suggested a similar simplification.] The second, which we describe in detail in Appendix B and denote
B, is a slight modification of
A, developed using empirical results from Fig 3 to produce a likelihood LPAC that gives more accurate estimates of
. Where necessary, we denote the PAC likelihoods and maximum PAC-likelihood estimates corresponding to
A (respectively,
B) by LPAC-A and
PAC-A (respectively, LPAC-B and
PAC-B).
A key property of both
A and
B is that they are easy and fast to compute. Unlike the Ewens sampling formula, but like the approximations of ![]()
, based on such explicit assumptions, and capturing iv or v, are known. However, the suggested forms for
were motivated by considering both the Ewens sampling formula and the underlying genealogy (or, in the case with recombination, genealogies) relating a random sample of haplotypes from a neutrally evolving, constant-sized panmictic population. As such, it may be helpful to view them as approximations to the (unknown) true conditional distribution under these assumptions. In particular, there are certain aspects of many real populations (e.g., population expansion or population structure) and biological factors (e.g., gene conversion and selection) that these forms for
do not attempt to capture. For some applications this may not matter very much. For others it may be necessary to develop forms for
that do capture these aspectsa point we return to in the DISCUSSION.
An unwelcome feature of the PAC likelihoods corresponding to our choices of
and indeed the forms for
from ![]()
![]()
. Although in principle the dependence on ordering could be removed by averaging the PAC likelihood over all possible orderings of the haplotypes, in practice this would require a sum over n! terms, which is infeasible even for rather small values of n. Instead, as a pragmatic alternative solution, we propose to average LPAC over several random orders of the haplotypes. Unless otherwise stated, all results reported here were obtained by averaging over 20 random orders. In our experience, the performance of the method is not especially sensitive to the number of random orders usedresults based on 100 random orders gave qualitatively similar results, and results based on a single random order were often not much worse (data not shown). It is, however, important that when comparing likelihoods for different values of
, the same set of random orders should be used for each value of
.
| ESTIMATING CONSTANT RECOMBINATION RATE |
|---|
In this section we consider estimating the recombination rate when it is assumed to be constant across the region of interest. More precisely, we assume that crossovers in a single meiosis occur as a Poisson process of constant rate c per unit (physical) distance and consider estimating the scalar parameter
= 4Nc. We first use simulated data to examine the properties of the estimator
PAC-A, corresponding to the conditional distribution
A described in Appendix A, under what we call the "standard coalescent model": a constant-sized, panmictic population with an infinite-sites mutation model. We show that, although quite accurate,
PAC-A exhibits a systematic bias. We use the empirical results to develop a modified conditional distribution,
B (described in detail in Appendix B), whose corresponding estimator,
PAC-B, exhibits considerably less bias and is more accurate. We compare the performance of models based on both
A and
B with results from other methods.
Properties of the point estimate
PAC:
We used the program mksample (![]()
- The number n of haplotypes in the sample
- The number S of markers typed
- The value of
(we measure physical distance so that the total physical length of each simulated haplotype equals 1.0. Thus the value of
is also the total value of
across the region).
For each data set we found
PAC-A by numerically maximizing the PAC likelihood (using a golden bisection search method; ![]()
used to generate the data.
It seems natural to measure the error in estimates for
on a relative, rather than an absolute, scale. For example, ![]()
gave estimates within a factor of 2 of the true value, and both FD and ![]()
/
for their estimates of
and the deviation of this ratio from the "optimal" value of 1. A problem with working with this ratio directly is that it tends to penalize overestimation more heavily than underestimation. For example, overestimating
by a factor of 10 gives a larger deviation from 1 than underestimating
by a factor of 10. To avoid this problem, we quantify the relative error of an estimate
for
by Err(
,
) = log10(
/
). This gives, for example, an error of 0 if
=
, an error of 1 if
overestimates
by a factor of 10, and an error of -1 if
underestimates
by a factor of 10.
We note that Err(
,
) can also be viewed as the error (on an absolute scale) in estimating log10(
) by log 10(
). Thus, if the usual asymptotic theory for maximum-likelihood estimation applies for estimation of log10(
) in this setting (which, as discussed in FD, it may not), then for the actual maximum-likelihood estimate (MLE)
MLE of
, Err(
,
MLE) would be normally distributed asymptotically, centered on 0. Optimistically, we might therefore hope that for sufficiently large data sets (large in terms of the number of haplotypes, the number of markers, or both) Err(
,
PAC-A) might be approximately normally distributed, centered on 0. In our simulations, we found that for some combinations of n, S, and
this did indeed appear to be the case (e.g., Fig 3B), but that for other combinations, although the distribution often appeared close to normal, it was centered around some nonzero value (e.g., Fig 3, a and c), indicating a systematic tendency for
PAC-A to over- or underestimate
. We refer to the median of Err as the "bias" [of log (
PAC-A) in estimating log(
)]. Although bias is usually defined as a mean error, this is not particularly helpful here since the mean is often heavily influenced by a small number of very large values and may even be infinite in some cases (see also FD). We therefore follow previous authors, including ![]()
Despite the biases evident in Fig 3, a and c,
PAC-A gives reasonably accurate estimates of
. For example, even in Fig 3C, which shows one of the most extreme biases that we observed in our simulations, the bias corresponds to underestimating
by approximately a factor of 2, and
PAC-A is within a factor of 2 of the true value of
in 68% of cases. Although in many statistical applications estimates within a factor of 2 of the truth would not be considered particularly helpful or impressive, in this setting this kind of accuracy is often not easy to achieve (see, for example, ![]()
We performed extensive simulations to better characterize the bias noted above and found that, although the bias depends on all three variables (n, S, and
), it is especially dependent on the average spacing between sites. More specifically, for fixed n and S we observed a striking linear relationship between the bias and the log of the average marker spacing (Fig 4). This linear relationship was also apparent for data simulated under an assumption of population expansion (data not shown). The slope of the linear relationship is negative in each case, indicating a tendency for
PAC-A to overestimate
when the markers are very closely spaced and underestimate
when the markers are far apart. As the number of sampled haplotypes increases, both the slope and intercept of the line appear to get closer to 0 (Table 1). On the basis of these empirical results we can modify
A to reduce the bias of the point estimates (see Appendix B for details). The improved performance of this modified conditional distribution, which we denote
B, is illustrated in the next section.
|
Fig 4 also illustrates the effect of varying parameter values on the variability of point estimates. As might be expected, the variance of the error reduces with increased sample size and increased number of sites, with the latter providing the more substantial decrease. For example, doubling the number of sites from 50 to 100 roughly halved the variance of the error in most cases, while doubling the number of individuals from 50 to 100 resulted in much smaller decreases. For a fixed sample size and number of sites, the variance of the error decreases as the spacing between sites grows. This may be due to the fact that for larger spacings more recombination events occur, increasing the relative accuracy with which
can be estimated, although we would not expect this pattern to continue indefinitely as the marker spacing is increased beyond the range considered here.
Comparison of point estimates with other methods:
![]()
, based on multiplying together the likelihood computed for every pair of SNPs. He compared the performance of this method with others in the literature (![]()
![]()
![]()
![]()
![]()
CL, with the results for
PAC-A and
PAC-B on data sets simulated under the same conditions (Fig 5). For data sets with small numbers of SNPs (
12)
CL provides the most accurate estimates of
, although all three methods struggle to produce reliable estimates. For larger numbers of SNPs both
PAC-A and
PAC-B tend, desirably, to exhibit less variability than
CL. Further,
PAC-B exhibits little or none of the bias present in
PAC-A and provides the most accurate estimates of
.
The superior performance of the pairwise composite-likelihood method for data sets with small numbers of SNPs is perhaps not surprisingindeed, for data sets with only two SNPs
CL is precisely the maximum-likelihood estimate for
. However, we note that almost all of the improvement in accuracy comes from the increase in the 10th percentile of the estimator toward the true value, rather than from a decrease in the 90th percentile. One possible explanation for this is that
CL uses a likelihood based on an infinite-sites mutation model (i.e., assumes no repeat mutation) and so is able essentially to rule out very small values for
if there is even one pair of sites at which all four gametes are present. (The effect of this may be compounded by the fact that
CL was found by maximizing over a grid of possible values, which forces all nonzero estimates of
to be above some threshold.) Our estimator does not make the infinite-sites assumption and so will be more inclined to estimate very small values of
, possibly leading to occasional substantial underestimates. Since in real data it will typically be unclear whether or not the infinite-sites assumption holds, the advantage of
CL for even small numbers of sites is perhaps less clear-cut than it appears in Fig 5.
We used the same simulated data to examine the accuracy of estimates of
obtained by the methods described by ![]()
across the region," or "per-locus
," which we denote
(more precisely, in our notation
=
L, where L is the physical length of the region). Results from FD suggest that even for small values of
(<3, say), the approximate-likelihood curves obtained by these methods may be poor approximations to the actual likelihood curve, and so it seems unlikely that the curves will be accurate for larger
. However, point estimates based on these methods could still be accurate, if the maximum of the approximate-likelihood curve occurs in about the right place. To investigate this possibility, we applied both methods, using
1 day of CPU time per method per data set (compared with
30 sec per data set for
PAC-B), to 10 of the data sets simulated with
= 40. Computational considerations make a more comprehensive simulation study inconvenient. Each of the methods was run with
fixed at the value used to simulate the data, giving them some advantage over how they could be used in practice. Nevertheless, neither method produced point estimates of
as accurate as those from
PAC-B (Table 2). Of the two full-likelihood schemes, the maximum of the likelihood curve obtained by infs was consistently closer to the true value of
than was the maximum of the likelihood curve obtained by Recombine. Indeed, the estimates obtained from Recombine were often close to an order of magnitude smaller than the true value of
, which raises a danger that when the method is applied to real data (for which the value of
is of course not known) the user might be misled into thinking that the value of
is small enough for the method to produce reliable results. Results from longer runs of infs, taking
5 days of CPU time each, produced improved results, competitive with
PAC-B (data not shown).
|
Properties of PAC likelihood curves:
Construction of confidence intervals:
We examined the coverage properties of confidence intervals (C.I.'s) constructed from the PAC-likelihood curve in two ways:
- Include all values of
for which loge(LPAC-B(
)) is within 2 of the maximum. - Include all values of loge(
) within ±1.96
of
PAC-B, where
is the square root of the inverse of minus the second derivative (found numerically) of the log of the PAC-B likelihood curve [as a function of loge(
)] evaluated at
PAC-A =
PAC-B.
The rationale for looking at such C.I.'s is that, under standard asymptotic theory for likelihood estimation, C.I.'s constructed in this way using the true likelihood curve would include the true value of
95% of the time. (For I this follows from the asymptotic
2 distribution of the log-likelihood-ratio statistic; for II it follows from asymptotic normal distribution of the MLE.)
Fig 6 shows the coverage properties for C.I.'s produced using the two methods (i.e., the proportion of times that C.I.'s formed using each method contained the true value of
) for the data sets used to obtain Fig 5C. For moderate sequence length both methods produce C.I.'s that are slightly anticonservative, with coverage properties that approach
0.91, compared with the expectation of
0.95 under asymptotic theory. On the basis of these results we speculate that the curvature of the PAC-B-likelihood curve does not deviate grossly from that of the true-likelihood curve. We note that the coverage properties are also closer to asymptotic expectations than those reported by ![]()
Comparison with other methods:
We compared the LPAC-B-likelihood curves with likelihood curves obtained from three other methods: the full-data coalescent method of FD (implemented in the computer program infs) and the pairwise composite-likelihood methods of HUDSON [2001; implemented by one of us (N.L.), using tables available from R. Hudson's website] and ![]()
![]()
=
= 3.0 across a region of physical length 1 and were kindly supplied by J. D. Wall. (These likelihood curves are plotted with
on the x-axis, rather than log(
), because infs and LDhat output likelihood curves for evenly spaced values of
.)
Interpreting the results of this comparison is slightly tricky. Unlike the other three methods we consider, the full-data coalescent method can, in principle, provide a fully accurate representation of the true-likelihood curve. As such it is tempting to treat this as a "gold standard" against which to compare the other methods. However, as mentioned previously, even for the rather small value of
= 3 used to generate these data, accurate approximation of the true-likelihood curve may be computationally impractical. Indeed, the estimated effective sample sizes (ESSs) obtained for these data sets, shown at the top of each part of Fig 7, suggest that we should not place much confidence in the accuracy of many of the curves. Our attempts to obtain more accurate likelihood curves by performing longer runs for some of the data sets (numbers 15 and 16) actually produced smaller estimated ESSs, suggesting that the effective sample sizes quoted for the other data sets are optimistic (see FD for further discussion of this problem). A further complication in comparing the methods is that both our method and that of ![]()
: the likelihood curves shown from infs are profile likelihoods for
at the true value of
; Hudson's method and our method avoid explicitly estimating
; LDhat estimates
using an analog of Watterson's estimate, but allowing for multiple mutations.
Notwithstanding these issues, we attempt to draw some general conclusions:
- In general, the likelihood curves produced by the four methods seem to agree rather more closely than might have been expected. (Compare, for example, the variability here with the variability observed for different runs of a single method in FD.) However, the closeness of the agreement between the methods differs appreciably across data sets. Data set 12 consists of only four sites, three of which are singletons, and so the differences in the curves for this data set seem not to be particularly interesting. We were unable to discern a systematic reason for the larger differences among methods observed in some of the other data sets (e.g., 16).
- The two pairwise composite-likelihood methods tend to produce likelihood curves that are slightly more peaked than those of the other two methods. This might be expected since, as pointed out by
MCVEAN et al. 2002 , pairwise composite-likelihood curves are typically more peaked than the true-likelihood curve because they treat each pair of sites as independent, when in fact many pairs are highly dependent.
- The method implemented in LDhat, which allows for multiple mutations, tends to achieve its maximum at larger values of
than does Hudson's method, which does not allow for multiple mutations. This is surprising; indeed, the opposite might have been expected, since multiple mutations could be used in place of recombination events to explain certain patterns of LD. One possible explanation is that the run lengths we used for computing the likelihood in LDhat might be insufficient (we used the default values). - Different orderings of the haplotypes can give PAC-likelihood curves that differ appreciably from one another. In addition, the maximum of the likelihood curve based on the average over several orderings tends to be toward the left end of the distribution of maxima obtained from different orderings. This is because, although not shown in Fig 7, the curves with maxima at smaller values of
tend to be larger (in absolute value) than those with maxima at larger values of
(presumably because they correspond to orderings of the haplotypes that, in some sense, require fewer recombination events to explain them) and thus contribute more to the average. Although this dependence on ordering is bothersome, in simulation studies (results not shown) we have found that the variability in the position of the maxima of the PAC likelihood over different orderings of the haplotypes is typically small compared with the uncertainty in estimation of
.
| VARIABLE RECOMBINATION RATE |
|---|
Models for variation in recombination rate:
One of our main motivations for developing this model is to explore fine-scale variation in recombination rates. A simple (no interference) model for variation in recombination rates is that crossovers in a single meiosis occur as an inhomogeneous Poisson process, of rate c(x) at position x. Here we consider two specific cases of this general model:
- A simple single-hotspot model, where

(4) Here
represents the background rate of crossover, a and b represent the left and right ends of the hotspot region, and
(>1) quantifies the magnitude of the recombination hotspot. The PAC likelihood for this model is a function of four parameters: a, b,
, and
. - A more general model, where if x is a position between markers j and j + 1, then

(5) Here
represents the background rate of crossover, and
j is a multiplier controlling how the crossover rate between markers j and j + 1 deviates from the background rate. The PAC likelihood for this model is a function of the parameters
1, ...
S-1 (where S is the number of SNPs) and
.
For the simple single-hotspot model it is straightforward to obtain numerically the maximum PAC-likelihood estimates for all four parameters simultaneously, although in the examples that we consider we assume that a and b are known and maximize the PAC likelihood in terms of
and
. The evidence for the presence of a hotspot can be summarized by the log-likelihood ratio (LLR) for the null hypothesis of no hotspot, H0:
= 1 vs. the alternative H1:
> 1. If
denotes the value of
that maximizes LPAC-B under H0, and
1 and
1 denote the values of
and
that maximize LPAC-B under H1, then
![]() |
(6) |
and large values of LLR represent evidence for the existence of a hotspot. Under standard asymptotic theory, two times LLR would have (asymptotically) a chi-square distribution on 1 d.f., and so rejecting H0 if LLR > 1.92 would give a hypothesis test with a type I error rate of 0.05. Although it seemed unlikely that standard asymptotic theory would apply here, we found that for data sets simulated under the null hypothesis, rejecting H0 for LLR > 1.92 gave empirical type I error rates close to 0.05 (Table 3), which provides some guidance as to what might be considered a "large" value of LLR.
|
For the second, more general model, obtaining maximum PAC-likelihood estimates for the parameters creates problems. First, the maximum-likelihood estimates are not unique (indeed there are infinitely many of them), because multiplying all the
j by any constant, and dividing
by the same constant, gives exactly the same likelihood. (In technical terms, the parameters are said to be unidentifiable.) Second, even if the identifiability problem is solved (for example, by first obtaining an estimate for
assuming that there is no hotspot and then fixing this when estimating the other parameters) there is the practical problem that the likelihood curve for some
j will often be very flat, resulting in estimates for many
j being very close to (or equal to) either 0 or infinity. This seems undesirable: if the likelihood for a particular
j is very flat, this indicates that there is little information about the recombination rate in that marker interval, in which case it seems sensible to estimate that the recombination rate is close to the background rate (i.e.,
j
1), rather than (close to) infinitely bigger or smaller!
To solve both these problems, we assume a "prior" distribution for the
j's: specifically that the
j's are independent and identically distributed, with log10(
j)
N(0, 0.52). This prior was chosen to allow occasional deviations from the background rate of recombination by a factor of 10 or more (with probability
95%,
j lies in the range 0.110). This choice of prior could be motivated from a Bayesian viewpoint as reflecting our prior beliefs about the
j's, but it also has the more pragmatic justification that identifying variations of this kind of magnitude seems both interesting and, perhaps, attainable. We consider alternative prior specifications in the DISCUSSION.
In principle, given the prior distribution for the
j's described above, we could also place a prior distribution on
and obtain an approximation to the posterior distribution of all parameters, using Markov chain Monte Carlo, for example. Although this would be our preferred approach, for simplicity we avoid this here and instead use the following ad hoc two-stage approach: first, obtain point estimates for
and
by jointly maximizing the product of LPAC(
,
) and the prior density of
; second, obtain a "posterior distribution" for each
j by fixing all other parameters at their estimated values, discretizing the prior on
j (truncated at
j = ±3), and computing the corresponding discretized posterior distribution as being proportional to the prior times the PAC likelihood. For data sets with a large number of sites the first stage (optimization over
,
) can be very time consuming, requiring large numbers of evaluations of the likelihood function. Further, it seems unlikely that the simple optimization method we used will reliably find the global maximum of the likelihood surface. Both these problems could be alleviated by exploiting the fact that the derivatives of the PAC likelihood can be computed efficiently, but we do not pursue this here.
Power to detect recombination hotspots and robustness:
In this section we assume that there is a single recombination hotspot (Equation 4), whose putative position is known, and examine the power of our model to detect the hotspot under various assumptions about the population demography and SNP marker ascertainment. Although the assumptions made here (in particular, that there is a single hotspot with known putative position) are unrealistic, they provide a convenient framework within which to examine quantitatively the power of our approach and how it is affected by population demographic history and marker ascertainment schemes.
We consider the following scenarios:
- Constant-size randomly mating population, all markers
- Constant-size randomly mating population, only markers at frequency >0.1
- Exponentially expanding population, with expansion starting t = 500 generations ago
- Exponentially expanding population, with expansion starting t = 5000 generations ago
- Haplotypes sampled from a structured population, consisting of two islands exchanging migrants at a rate of one per generation (scaled migration parameter 4Nm = 4)
- Haplotypes sampled from only one of the islands in the structured population described above.
These last four models are the same as those considered in ![]()
= 1960, 350 for t = 500,5000 were kindly provided by M. Przeworski.)
For each scenario we simulated data sets under the simple single-hotspot model described above, using mksample and the postprocessing algorithm described in Appendix C. For the first two scenarios each data set was simulated to have
50 segregating sites and 60 chromosomes, with a = 0.4, b = 0.5,
, and
= 10 [these values were chosen to approximately match values for the TAP2 data from ![]()
, and for the structured scenarios
within each population. For each scenario we also simulated data sets under the same conditions, but with no hotspot (i.e.,
= 1).
|
We applied the likelihood-ratio test to each data set to test the null hypothesis H0:
= 1 against the alternative H1:
> 1 (Table 3). For the scenarios not involving population expansion, the test gave type I error rates of
0.05 when applied to data without a hotspot and a power of
0.90 when applied to data simulated with a hotspot, although the test based on just the common SNPs had a slightly reduced power. The two scenarios involving population expansion gave either a substantial reduction in power or an inflated type I error rate (which is in some sense equivalent to a reduction in power). This might be due to a reduction in the number of "common" SNPs under these scenarios, as common SNPs tend to be most informative for estimating recombination rates.
We also examined the robustness of estimates of
under the various scenarios (Fig 3). As noted by FD, the recombination rate
depends on how the effective population size N is defined. In contrast, the definition of the parameter
does not depend on how the effective population size is defined, and so we might hope that estimation of
will be robust to departures from the assumption of a constant-sized panmictic population. For the levels of population structure we used in our simulations this does indeed appear to be the casein both cases estimates were more accurate than those for the sample from a single random-mating population, perhaps because population structure makes recombinants easier to "spot." As might be expected, estimates based only on common SNPs were less accurate than those based on all SNPs. A drop in accuracy is also evident for the scenarios simulated under population expansion, probably again due to a reduction in the number of "common" SNPs under these scenarios. Some of the scenarios also resulted in an upward bias for estimates of
, notably one of the expansion scenarios in which the median of the estimates was almost 2.5 times the true value.
Estimating recombination rates along a region:
Simulated data:
We fitted the more general varying recombination rate model to the simulated data used to produce the pairwise LD plots in Fig 1; the results are shown in Fig 8. From the latter figure we might conclude, correctly, that the data sets corresponding to the top left and bottom middle parts had recombination hotspots somewhere in the region 0.40.6. We might also conclude that the other data sets had no hotspots, which would be correct except in the case of the bottom left figure, which was actually generated from data with a hotspot between 0.4 and 0.5. One possible reason that the hotspot shows up less well in this case is that there are fewer sites at high frequency (>0.15) in this data set. Despite the fact that we might have been misled in one case out of six, we view Fig 8 as considerably more informative than Fig 1, from which we find it difficult to draw any conclusions.
TAP2 data:
![]()
Through analysis of sperm crossover events ![]()
We fitted the simple single-hotspot model to the haplotype data (kindly provided in convenient electronic format by A. Jeffreys), assuming a hotspot between 4 and 5.2 kb, and obtained estimates of
= 1.3/kb and
= 12 (95% C.I. [6,21]), with a LLR of 12, indicating strong evidence for the presence of the hotspot. Our estimate for the average magnitude of the hotspot,
= 12 times the background rate, agrees well with the sperm-typing results from ![]()
![]()












