- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Browning, S.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Browning, S.
Pedigree Data Analysis With Crossover Interference
Sharon Browningaa Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina 27695-7566 and GlaxoSmithKline, Research Triangle Park, North Carolina 27709
Corresponding author: Sharon Browning, Five Moore Dr., MAI. A112B.1G, Research Triangle Park, NC 27709., sharon.r.browning{at}gsk.com (E-mail)
Communicating editor: S. TAVARÉ
| ABSTRACT |
|---|
We propose a new method for calculating probabilities for pedigree genetic data that incorporates crossover interference using the chi-square models. Applications include relationship inference, genetic map construction, and linkage analysis. The method is based on importance sampling of unobserved inheritance patterns conditional on the observed genotype data and takes advantage of fast algorithms for no-interference models while using reweighting to allow for interference. We show that the method is effective for arbitrarily many markers with small pedigrees.
EXISTING methods for likelihood-based analysis of pedigree genotype data assume absence of crossover interference, even though such interference has been well documented in humans (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The major reason that crossover interference models are not used is the lack of computationally feasible methods for working with more than a handful of linked markers. ![]()
![]()
![]()
We propose a new method for calculating probabilities for pedigree genetic data that does incorporate crossover interference. The method is based on importance sampling, which is a Monte Carlo technique for estimating the value of an integral or sum (in this case probabilities of the data, expressed as a sum over unobserved inheritance indicators). Importance sampling involves evaluating the integrand at independently sampled realizations from a probability distribution that is roughly proportional to the integrand. Correct weighting of the sampled values gives an unbiased estimate that converges to the true value of the integral as the number of sample repetitions is increased. Because the samples are independent, the method is not plagued by the difficulties in assessing convergence encountered in use of Markov chain Monte Carlo (MCMC) methods.
The probabilities produced by this method may be used for multipoint linkage mapping, genetic map construction, and relationship inference. The method can be used with arbitrarily many linked markers for small pedigrees. For small pedigrees the method is fast enough that it could be used on a routine basis in analysis of pedigree genetic data.
| BACKGROUND |
|---|
It has long been known that the locations of multiple crossovers on a chromosome at meiosis are not independent, but exhibit crossover interference whereby the existence of a crossover at one location suppresses the occurrence of crossovers in nearby regions. A common approach is to use the Kosambi map function with methods that assume independence between recombination events in nonoverlapping intervals. This approach does not have any real effect, as genetic distances are used only for reporting results, while recombination fractions are used at all steps in the calculations. That is, observed recombination fractions are converted to genetic distances with the map function and are converted back to recombination fractions with the same map function for use in the analysis. As ![]()
A useful family of point process models for crossover interference is the chi-square models. These are renewal models for the occurrence of crossovers on the four-strand chromatid bundle. A parameter m controls the strength of interference: m = 0 corresponds to no interference, while m = 4 corresponds to approximately the level of interference found in humans (![]()
![]()
![]()
![]()
![]()
In pedigree data, the recombination patterns are generally not directly observed, due to unobserved individuals and insufficiently polymorphic markers. Thus, to calculate probabilities for genotype data requires summation over possible recombination patterns. For K meioses and L marker loci there are 2K(L-1) possible recombination patterns, so this computation, if performed exactly, rapidly becomes infeasible for increasing numbers of markers and meioses.
When independence (i.e., no-interference) models for recombination are used, exploitation of the independence can reduce the number of computations to the order of LK2K (![]()
![]()
![]()
![]()
To reduce the computational burden while incorporating crossover interference, we propose an importance-sampling approach. This approach takes advantage of the special algorithms for independence models, while using reweighting to allow for interference.
| METHODS |
|---|
We present a method for calculating probabilities for genotypic data under the chi-square models. The method is based on importance sampling of underlying unobserved inheritance patterns. We focus primarily on calculation of the likelihood, which is equal to the probability of the genotypic data under the assumed model. Likelihoods may be used to compare models, for example, to find the most likely genealogical relationship between individuals (![]()
Importance sampling of inheritance indicators:
For each marker locus l and meiosis k, let Xkl be a zero-one inheritance indicator. The parent involved in the meiosis has two copies of the DNA, one maternal (0) and the other paternal (1). The indicator X describes which of these two copies was transmitted to the offspring at this locus. If Xkl = Xk,l+1 no recombination occurred on meiosis k over the interval between markers l and l + 1, while recombination occurred if Xkl
Xk,l+1. The inheritance pattern X represents the inheritance indicators over all meioses and loci.
We can write the probability of the genotype data Y as a sum over inheritance patterns

where PC denotes probabilities under the chi-square crossover model, with a given choice of parameter m. Since PC(Y|X)PC(X) = PC(X, Y)
PC(X|Y), only those terms with large values of PC(X|Y) contribute significantly to the sum. An importance-sampling approach aims to sample with high frequency those terms with significant contribution, while sampling with low frequency the terms with negligible contribution. Ideally we would like to sample from PC(X|Y), but we have no way to do so directly. Instead we sample from PI(X|Y), where PI denotes probabilities under the independence model. The probabilities PI(X|Y) are sufficiently close to PC(X|Y) to result in a useful importance sampler.
In what follows we assume that probabilities for genotypes Yl at a marker l are conditionally independent of genotypes and inheritance patterns at other markers given the inheritance pattern Xl at the marker. This assumption holds if the markers are in linkage equilibrium, which will be approximately true if all the genotyped individuals come from a single homogeneous population and the markers are not too closely spaced. A consequence of this assumption is that probabilities of genotypes Y given inheritance patterns X do not depend on the crossover model, so PC(Y|X) = PI(Y|X)
P(Y|X). Then P(Y|X) = PI(Y|X) = PI(X|Y)PI(Y)/PI(X) and we can write

Thus if X(1), X(2), ... , X(n) are sampled from PI(X|Y), an unbiased estimate of PC(Y) is given by

![]()
![]()
Improved performance through resampling:
This method is an example of sequential importance sampling (![]()
![]()
C(Y).
We now give more details of the resampling algorithm, in which we apply the method of residual resampling described in Sect. 3.4.4 of LIU (2001). First we set a batch size B and a resampling interval T (choice of these values is discussed below). We start by sampling X(i)L, X(i)L-1, ... , X(i)L-T+1 for i = 1, 2, ... , B, perform resampling as described below, and then sample X(i)L-T, X(i)L-T-1, ... , X(i)L-2T+1, perform resampling, and continue in this fashion until reaching X1 at the end of the chromosome. As noted in ![]()
Write X(i)(j) for the ith partial sample at the jth resampling. For example,
and
. The resampling weight for this sample is

for j > 1, while at the first resampling

At the jth resampling, let
. Start by retaining
copies of X(i)(j), where [ ] is the floor function giving the largest integer less than or equal to its argument. The number of elements remaining to be resampled is
. Add to the set of retained inheritance patterns r independent random draws from the X(i)(j) with probabilities proportional to Bw(i)j/Wj - ki, i = 1, 2, ... , B. Now reset the weights w(i)j to Wj/B for i = 1, 2, ... , B and continue with the sequential sampling of the inheritance patterns.
Let J = [(L - 1)/T] be the total number of resampling points. After the samples are completed, the weight of the ith sample X(i) is

An unbiased estimate of P(Y) is

The values obtained from a single batch are correlated and cannot be used to estimate the standard error of
C(Y). To estimate the standard error we divide our total number of iterations into batches of size B. Choice of B has negligible effect on computing time. Small choices of B are less efficient (i.e., give larger standard error) because resampling is less effective with a smaller pool. Thus it is best to choose B to be reasonably large, subject to computer memory limitations and to having a sufficient number of batches to give a good estimate of standard error (say at least 20 batches). We used B = 100 in all the examples presented in RESULTS.
A couple of issues are involved in best choice of resampling interval T. First, resampling involves computational time, and thus frequent resampling can be computationally expensive. The computational time for resampling is essentially independent of the pedigree size; thus the expense of resampling is most notable in computations on small pedigrees and is negligible relative to total computation time for larger pedigrees (such as the eight-meiosis pedigree for four full-sibs). Second, as the resampling interval increases, the standard error of the estimates tends to increase. This increase in standard error is particularly evident in larger pedigrees, which are also the pedigrees for which resampling is most beneficial. Thus small values of T (say 1
T
5) are best for large pedigrees, and large values of T (say T
5) are best for small pedigrees. We chose to use T = 5 for simplicity throughout the examples presented in RESULTS.
We note that it is not necessary (and is very inefficient) to calculate PC(X(i)(j)) and PI(X(i)(j)) from scratch at each resampling point. For the independence sampler, PI(X(i)(j)) equals PI(X(i)(j-1)) multiplied by the probabilities of the recombination pattern in X(i)L-(j-1)T+1, X(i)L-(j-1)T, ... , X(i)L-jT+1. For the chi-square sampler it is necessary to save an m-dimensional vector of partial probabilities from calculation of PC(X(i)(j-1)) for use in calculation of PC(X(i)(j)).
Our implementation of the algorithm described above is available from the author on request.
Probabilities for linkage analysis and map construction:
Linkage analysis and map construction are based on probabilities of inheritance indicators Xl at a locus l given the genotype information (![]()
![]()

where P(xl|X, Y) = P(xl|X) is one if Xl = xl and zero otherwise. An unbiased estimate of this probability is proportional to

where X(1), ... , X(n) are independent realizations from PI(X|Y) and the constant of proportionality may be found by summing over all possible values of xl. Probabilities PC(Xl = xl, Xl+1 = xl+1|Y) may be estimated similarly. Thus these probabilities may be estimated under interference models using the machinery presented here and then used in linkage mapping or genetic map construction.
| RESULTS |
|---|
We present results from analysis of simulated data, to demonstrate the capabilities of the proposed approach.
Our simulated data were designed to represent human chromosome 1, with an estimated genetic length of 2.9 morgans (![]()
C(Y)] and, for comparison, under the independence model with the Kosambi map function [giving PI(Y)]. We look at three relationships: two half-sibs (two meioses), an aunt-niece pair (five meioses), and four full-siblings (eight meioses). The data were simulated under the chi-square model with m = 4, although this choice has little impact on the results. Table 1 shows results and computing times for one simulated chromosome for each map/relationship combination.
|
From the results in Table 1, we see that the type of marker has little impact on the computing time or standard errors, but does affect the magnitude of the likelihoods. Marker spacing affects computing time, but has little impact on standard errors. The most computing-intensive part of the algorithm is sampling of the inheritance patterns from the distribution PI(X|Y), which is of computational order L2K, so that computing time doubles for each additional meiosis. Hence, with the current implementation, eight meioses is about the maximum that one would want to work with105 iterations for chromosome 1 with a 1-cM map took
1 hr (fewer iterations and hence shorter computing time would suffice in some cases). Standard errors also increase with the number of meioses, so one would generally want to increase the number of iterations as pedigree size increases.
In considering whether an estimate of ln PC(Y) is sufficiently precise, one minimal requirement is that the standard error be less than the difference between ln PC(Y) and ln PI(Y), where ln PI(Y) is the natural log of the probability of the data under the independence model using the Kosambi map function. If this requirement is not satisfied, then the exact answer obtained under the independence model is a better estimate of PC(Y) than that from the importance sampler. We note that the requirement is satisfied in all but 1 of the 12 examples looked at, and in most of the examples a much smaller number of iterations would have sufficed to meet this minimal requirement (in fact, <100 iterations, at approximately one-thousandth of the computing time shown, would have sufficed in 7 out of 12 of the examples). Thus 105 iterations are more than enough (based on this criterion) for this chromosome length, choice of model parameter m, and pedigree size, but more iterations may be required for longer chromosomes, larger pedigrees, or different values of m.
We found that resampling was very helpful in reducing the amount of computing time required to achieve a given standard error, with benefit increasing with pedigree size. For the aunt-niece and four full-sib relationships, savings in computational time due to the resampling were at least twofold and typically around fivefold.
| DISCUSSION |
|---|
We have presented an algorithm for calculating probabilities for pedigree genetic data under the chi-square family of interference models. The method is based on importance sampling and thus gives an approximate calculation with precision depending on the number of importance sampling iterations. The results shown indicate that the algorithm is feasible for pedigrees with up to eight meioses and for any number of markers. Calculated probabilities of the data may be used as likelihoods for relationship inference, by repeating the calculation with differing pedigrees. In addition, the approach may be used to calculate probabilities of inheritance patterns given the data for linkage gene mapping.
Computational time is higher than that under a no-interference assumptionin general, the computing time without interference would be approximately the time to perform one iteration of the importance sampler used here; however, the important point is that the computation times for this method scale the same with increasing marker density and pedigree size as do those for the basic no-interference algorithm of ![]()
We have presented this algorithm for the chi-square model only, but it is generalizable to other classes of models. To directly apply the approach here, it is necessary to be able to calculate the probability of a pattern of recombinations over a series of markers. Such calculation may not be computationally feasible for some models. A more natural approach would involve sampling of not only the recombination pattern but also the actual locations of crossovers on the chromatid bundle under the independence model and conditional on the genotype data Y. Such an approach has the potential to be very flexible, although it will add some noise, resulting in lower precision for a given number of iterations of the importance sampler.
| ACKNOWLEDGMENTS |
|---|
The author thanks Reinhard Hopperger for programming assistance and Terry Speed and the referees for helpful comments. This work was supported in part by a Research Starter Grant in Informatics from the PhRMA Foundation.
Manuscript received May 27, 2002; Accepted for publication March 7, 2003.
| APPENDIX |
|---|
Probability of recombination pattern for chi-square models:
We reproduce the method given in ![]()
s
1 Ds(yl) and Rl = 1/2
s
1 Ds(yl). Then the probability of a recombination pattern over the L - 1 intervals between L consecutive loci on a chromosome is given by (1/p)1M1M2 ... ML-11', where Ml = Nl whenever there is no recombination in the lth interval and Ml = Rl when there is recombination. See ![]()
Sampling inheritance patterns conditional on the genotype data under the independence model:
![]()

The terms
can be calculated recursively as

and

This is an application of the ![]()
s
L(s). Then

can be used to sample XL from PI(X|Y). Recursively, after sampling Xl, sample Xl-1 from

| LITERATURE CITED |
|---|
BAUM, L. E., 1972 An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes, pp. 18 in Inequalities III: Proceedings of the Third Symposium on Inequalities Held at The University of California, Los Angeles, September 19, 1969, edited by O. SHISHA. Academic Press, San Diego.
BOEHNKE, M. and N. J. COX, 1997 Accurate inference of relationships in sib-pair linkage studies. Am. J. Hum. Genet. 61:423-429.[Medline]
BROMAN, K. W. and J. L. WEBER, 2000 Characterization of human crossover interference. Am. J. Hum. Genet. 66:1911-1926.[Medline]
BROMAN, K. W., J. C. MURRAY, V. C. SHEFFIELD, R. L. WHITE, and J. L. WEBER, 1998 Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am. J. Hum. Genet. 63:861-869.[Medline]
BROMAN, K. W., L. B. ROWE, G. A. CHURCHILL, and K. PAIGEN, 2002 Crossover interference in the mouse. Genetics 160:1123-1131.
GOLDGAR, D. E., P. R. FAIN, and W. J. KIMBERLING, 1989 Chiasma-based models of multilocus recombination: increased power for exclusion mapping and gene ordering. Genomics 5:283-290.[Medline]
GOLDSTEIN, D. R., H. ZHAO, and T. P. SPEED, 1995 Relative efficiencies for
2 models of recombination for exclusion mapping and gene ordering. Genomics 27:265-273.[Medline]
KRUGLYAK, L. and E. S. LANDER, 1998 Faster multipoint linkage analysis using Fourier transforms. J. Comput. Biol. 5:1-7.[Medline]
KRUGLYAK, L., M. J. DALY, M. P. REEVE-DALY, and E. S. LANDER, 1996 Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. Hum. Genet. 58:1347-1363.[Medline]
LANDER, E. S. and P. GREEN, 1987 Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. USA 84:2363-2367.
LIN, S. and T. P. SPEED, 1996 Incorporating crossover interference into pedigree analysis using the
2 model. Hum. Hered. 46:315-322.[Medline]
LIN, S. and T. P. SPEED, 1999 Relative efficiencies of the chi-square recombination models for gene mapping with human pedigree data. Ann. Hum. Genet. 63:81-95.[Medline]
LIN, S., R. CHENG, and F. A. WRIGHT, 2001 Genetic crossover interference in the human genome. Ann. Hum. Genet. 65:79-93.[Medline]
LIU, J. S., 2001 Monte Carlo Strategies in Scientific Computing. Springer-Verlag, New York.
MCPEEK, M. S. and T. P. SPEED, 1995 Modeling interference in genetic recombination. Genetics 139:1031-1044.[Abstract]
MCPEEK, M. S. and L. SUN, 2000 Statistical tests for detection of misspecified relationships by use of genome screen data. Am. J. Hum. Genet. 66:1076-1094.[Medline]
SPEED, T. P., 1996 What is a genetic map function? pp. 6588 in Genetic Mapping and DNA Sequencing (IMA Volumes in Mathematics and its Applications, Vol. 81), edited by T. P. SPEED and M. S. WATERMAN. Springer-Verlag, New York.
THOMPSON, E. A., 2000a MCMC estimation of multi-locus genome sharing and multipoint gene location scores. Int. Stat. Rev. 68:53-73.
THOMPSON, E. A., 2000b Statistical Inference From Genetic Data on Pedigrees. Institute of Mathematical Statistics, Beachwood, OH.
WEEKS, D. E., G. M. LATHROP, and J. OTT, 1993 Multipoint mapping under genetic interference. Hum. Hered. 43:86-97.[Medline]
ZHAO, H., M. S. MCPEEK, and T. P. SPEED, 1995a Statistical analysis of chromatid interference. Genetics 139:1057-1065.[Abstract]
ZHAO, H., T. P. SPEED, and M. S. MCPEEK, 1995b Statistical analysis of crossover interference using the chi-square model. Genetics 139:1045-1056.[Abstract]
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Browning, S.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Browning, S.