Abstract
We propose a new method for calculating probabilities for pedigree genetic data that incorporates crossover interference using the chisquare models. Applications include relationship inference, genetic map construction, and linkage analysis. The method is based on importance sampling of unobserved inheritance patterns conditional on the observed genotype data and takes advantage of fast algorithms for nointerference models while using reweighting to allow for interference. We show that the method is effective for arbitrarily many markers with small pedigrees.
EXISTING methods for likelihoodbased analysis of pedigree genotype data assume absence of crossover interference, even though such interference has been well documented in humans (Weekset al. 1993; Broman and Weber 2000; Linet al. 2001) and other species (Zhaoet al. 1995b; Bromanet al. 2002). Moreover, simulation studies have shown that failure to incorporate crossover interference into linkage analyses is inefficient (Goldgaret al. 1989; Goldsteinet al. 1995; Lin and Speed 1999).
The major reason that crossover interference models are not used is the lack of computationally feasible methods for working with more than a handful of linked markers. Lin and Speed (1996) present a method for exact computation of probabilities for pedigree data with the chisquare models; however, computational time increases exponentially with both the number of meioses in the pedigree and the number of markers, severely limiting its applicability. Thompson (2000a,b) has a Markov chain Monte Carlo approach for incorporating crossover interference, but it is limited to at most 12 loci per chromosome.
We propose a new method for calculating probabilities for pedigree genetic data that does incorporate crossover interference. The method is based on importance sampling, which is a Monte Carlo technique for estimating the value of an integral or sum (in this case probabilities of the data, expressed as a sum over unobserved inheritance indicators). Importance sampling involves evaluating the integrand at independently sampled realizations from a probability distribution that is roughly proportional to the integrand. Correct weighting of the sampled values gives an unbiased estimate that converges to the true value of the integral as the number of sample repetitions is increased. Because the samples are independent, the method is not plagued by the difficulties in assessing convergence encountered in use of Markov chain Monte Carlo (MCMC) methods.
The probabilities produced by this method may be used for multipoint linkage mapping, genetic map construction, and relationship inference. The method can be used with arbitrarily many linked markers for small pedigrees. For small pedigrees the method is fast enough that it could be used on a routine basis in analysis of pedigree genetic data.
BACKGROUND
It has long been known that the locations of multiple crossovers on a chromosome at meiosis are not independent, but exhibit crossover interference whereby the existence of a crossover at one location suppresses the occurrence of crossovers in nearby regions. A common approach is to use the Kosambi map function with methods that assume independence between recombination events in nonoverlapping intervals. This approach does not have any real effect, as genetic distances are used only for reporting results, while recombination fractions are used at all steps in the calculations. That is, observed recombination fractions are converted to genetic distances with the map function and are converted back to recombination fractions with the same map function for use in the analysis. As Speed (1996) points out, the only adequate way to incorporate crossover interference is with a point process model. Such an approach can account for nonindependence of recombination between intervals.
A useful family of point process models for crossover interference is the chisquare models. These are renewal models for the occurrence of crossovers on the fourstrand chromatid bundle. A parameter m controls the strength of interference: m = 0 corresponds to no interference, while m = 4 corresponds to approximately the level of interference found in humans (Lin and Speed 1996). The genetic distance between crossovers on the fourstrand bundle is modeled as the sum of m + 1 independent exponential random variables each with rate 2(m + 1). This sum follows a gamma, or scaled chisquare, distribution. A gamete receives only one of the four chromatids, and since each crossover involves four chromatids, the probability that a certain crossover on the chromatid bundle is seen in the gamete is onehalf. Assuming no chromatid interference (Zhaoet al. 1995a), whether or not one crossover on the bundle is transmitted to the gamete is independent of the outcome for other crossovers. McPeek and Speed (1995) and Zhao et al. (1995b) found that the chisquare models provide an excellent fit to available data. Zhao et al. (1995b) provide a method for calculating the probability of a pattern of recombination between consecutive markers for a single meiosis, which involves on the order of Lm^{2} calculations for L marker loci. We reproduce the details of this method in the appendix since our method makes use of these probabilities.
In pedigree data, the recombination patterns are generally not directly observed, due to unobserved individuals and insufficiently polymorphic markers. Thus, to calculate probabilities for genotype data requires summation over possible recombination patterns. For K meioses and L marker loci there are 2^{K}^{(}^{L}^{1)} possible recombination patterns, so this computation, if performed exactly, rapidly becomes infeasible for increasing numbers of markers and meioses.
When independence (i.e., nointerference) models for recombination are used, exploitation of the independence can reduce the number of computations to the order of LK2^{K} (Kruglyak and Lander 1998), so that large numbers of markers may be analyzed, while the size of the pedigree (measured by the number of meioses) must be restricted. McPeek and Sun (2000) describe a related approach for the chisquare models; however, the number of computations required by such an approach is proportional to L4^{K}^{(}^{m}^{+1)}, so that for moderate values of m, calculation is feasible only with extremely small pedigrees. Thus in general to calculate exact probabilities for pedigree data with the chisquare or other interference models involves summing over all 2^{KL} recombination patterns, resulting in a computational burden that increases exponentially with both the number of markers and the number of meioses (Lin and Speed 1996). This is the approach taken by Lin and Speed (1999) in their work showing the efficiency gains obtainable using the chisquare model in gene mapping with pedigree data. They were able to work with 12 meioses (a threegenerational family with four grandparents, two parents, and four children) and seven marker loci.
To reduce the computational burden while incorporating crossover interference, we propose an importancesampling approach. This approach takes advantage of the special algorithms for independence models, while using reweighting to allow for interference.
METHODS
We present a method for calculating probabilities for genotypic data under the chisquare models. The method is based on importance sampling of underlying unobserved inheritance patterns. We focus primarily on calculation of the likelihood, which is equal to the probability of the genotypic data under the assumed model. Likelihoods may be used to compare models, for example, to find the most likely genealogical relationship between individuals (Boehnke and Cox 1997). In addition, the methodology presented here may be used to calculate probabilities of inheritance patterns given the genotypic data, which are used in genetic map construction and multipoint linkage analysis, and we give details of those calculations at the end of this section.
Importance sampling of inheritance indicators: For each marker locus l and meiosis k, let X_{kl} be a zeroone inheritance indicator. The parent involved in the meiosis has two copies of the DNA, one maternal (0) and the other paternal (1). The indicator X describes which of these two copies was transmitted to the offspring at this locus. If X_{kl} = X_{k}_{,}_{l}_{+1} no recombination occurred on meiosis k over the interval between markers l and l + 1, while recombination occurred if X_{kl} ≠ X_{k}_{,}_{l}_{+1}. The inheritance pattern X represents the inheritance indicators over all meioses and loci.
We can write the probability of the genotype data Y as a sum over inheritance patterns
In what follows we assume that probabilities for genotypes Y_{l} at a marker l are conditionally independent of genotypes and inheritance patterns at other markers given the inheritance pattern X_{l} at the marker. This assumption holds if the markers are in linkage equilibrium, which will be approximately true if all the genotyped individuals come from a single homogeneous population and the markers are not too closely spaced. A consequence of this assumption is that probabilities of genotypes Y given inheritance patterns X do not depend on the crossover model, so P_{C}(YX) = P_{I}(YX) ≡ P(YX). Then P(YX) = P_{I}(YX) = P_{I}(XY)P_{I}(Y)/P_{I}(X) and we can write
Improved performance through resampling: This method is an example of sequential importance sampling (Liu 2001, Sect. 2.6.3), because the inheritance patterns X are sampled sequentially along each chromosome. Improvement, in terms of reducing the standard error of the estimate for a fixed number of iterations n, may be obtained by resampling (Liu 2001, Sect. 3.4.4). The idea is to concurrently sample a number of inheritance patterns X. After sampling the inheritance indicators at a couple of loci, write
We now give more details of the resampling algorithm, in which we apply the method of residual resampling described in Sect. 3.4.4 of Liu (2001). First we set a batch size B and a resampling interval T (choice of these values is discussed below). We start by sampling
Write
Let J = [(L  1)/T] be the total number of resampling points. After the samples are completed, the weight of the ith sample X^{(}^{i}^{)} is
A couple of issues are involved in best choice of resampling interval T. First, resampling involves computational time, and thus frequent resampling can be computationally expensive. The computational time for resampling is essentially independent of the pedigree size; thus the expense of resampling is most notable in computations on small pedigrees and is negligible relative to total computation time for larger pedigrees (such as the eightmeiosis pedigree for four fullsibs). Second, as the resampling interval increases, the standard error of the estimates tends to increase. This increase in standard error is particularly evident in larger pedigrees, which are also the pedigrees for which resampling is most beneficial. Thus small values of T (say 1 ≤ T ≤ 5) are best for large pedigrees, and large values of T (say T ≥ 5) are best for small pedigrees. We chose to use T = 5 for simplicity throughout the examples presented in results.
We note that it is not necessary (and is very inefficient) to calculate
Our implementation of the algorithm described above is available from the author on request.
Probabilities for linkage analysis and map construction: Linkage analysis and map construction are based on probabilities of inheritance indicators X_{l} at a locus l given the genotype information (Kruglyaket al. 1996) or the joint probabilities of inheritance indicators X_{l} and X_{l}_{+1} at two neighboring loci given the genotype information (Lander and Green 1987). Consider estimating.
RESULTS
We present results from analysis of simulated data, to demonstrate the capabilities of the proposed approach.
Our simulated data were designed to represent human chromosome 1, with an estimated genetic length of 2.9 morgans (Bromanet al. 1998). The other human chromosomes are shorter, and since our results (not shown) indicate that standard errors tend to be lower with shorter chromosomes because of smaller numbers of crossovers, chromosome 1 represents the “worst case.” We simulated sets of data with marker spacing of 1 cM (291 markers) and 10 cM (29 markers), with singlenucleotide polymorphisms (SNPs; allele frequencies 0.7 and 0.3), and with microsatellites (seven alleles with frequencies 0.4, 0.2, 0.2, 0.05, 0.05, 0.05, 0.05, and heterozygosity 0.75). Calculation was performed under the chisquare model with m = 4 [giving Pˆ_{C(}_{Y}_{)]} and, for comparison, under the independence model with the Kosambi map function [giving P_{I}(Y)]. We look at three relationships: two halfsibs (two meioses), an auntniece pair (five meioses), and four fullsiblings (eight meioses). The data were simulated under the chisquare model with m = 4, although this choice has little impact on the results. Table 1 shows results and computing times for one simulated chromosome for each map/relationship combination.
From the results in Table 1, we see that the type of marker has little impact on the computing time or standard errors, but does affect the magnitude of the likelihoods. Marker spacing affects computing time, but has little impact on standard errors. The most computingintensive part of the algorithm is sampling of the inheritance patterns from the distribution P_{I}(XY), which is of computational order L2^{K}, so that computing time doubles for each additional meiosis. Hence, with the current implementation, eight meioses is about the maximum that one would want to work with—10^{5} iterations for chromosome 1 with a 1cM map took ∼1 hr (fewer iterations and hence shorter computing time would suffice in some cases). Standard errors also increase with the number of meioses, so one would generally want to increase the number of iterations as pedigree size increases.
In considering whether an estimate of ln P_{C}(Y) is sufficiently precise, one minimal requirement is that the standard error be less than the difference between ln P_{C}(Y) and ln P_{I}(Y), where ln P_{I}(Y) is the natural log of the probability of the data under the independence model using the Kosambi map function. If this requirement is not satisfied, then the exact answer obtained under the independence model is a better estimate of P_{C}(Y) than that from the importance sampler. We note that the requirement is satisfied in all but 1 of the 12 examples looked at, and in most of the examples a much smaller number of iterations would have sufficed to meet this minimal requirement (in fact, <100 iterations, at approximately onethousandth of the computing time shown, would have sufficed in 7 out of 12 of the examples). Thus 10^{5} iterations are more than enough (based on this criterion) for this chromosome length, choice of model parameter m, and pedigree size, but more iterations may be required for longer chromosomes, larger pedigrees, or different values of m.
We found that resampling was very helpful in reducing the amount of computing time required to achieve a given standard error, with benefit increasing with pedigree size. For the auntniece and four fullsib relationships, savings in computational time due to the resampling were at least twofold and typically around fivefold.
DISCUSSION
We have presented an algorithm for calculating probabilities for pedigree genetic data under the chisquare family of interference models. The method is based on importance sampling and thus gives an approximate calculation with precision depending on the number of importance sampling iterations. The results shown indicate that the algorithm is feasible for pedigrees with up to eight meioses and for any number of markers. Calculated probabilities of the data may be used as likelihoods for relationship inference, by repeating the calculation with differing pedigrees. In addition, the approach may be used to calculate probabilities of inheritance patterns given the data for linkage gene mapping.
Computational time is higher than that under a nointerference assumption—in general, the computing time without interference would be approximately the time to perform one iteration of the importance sampler used here; however, the important point is that the computation times for this method scale the same with increasing marker density and pedigree size as do those for the basic nointerference algorithm of Lander and Green (1987), so that with increasing computing speed, the size of the problem that can be addressed with interference will lag only a couple of steps behind that for no interference.
We have presented this algorithm for the chisquare model only, but it is generalizable to other classes of models. To directly apply the approach here, it is necessary to be able to calculate the probability of a pattern of recombinations over a series of markers. Such calculation may not be computationally feasible for some models. A more natural approach would involve sampling of not only the recombination pattern but also the actual locations of crossovers on the chromatid bundle under the independence model and conditional on the genotype data Y. Such an approach has the potential to be very flexible, although it will add some noise, resulting in lower precision for a given number of iterations of the importance sampler.
APPENDIX
Probability of recombination pattern for chisquare models: We reproduce the method given in Zhao et al. (1995b) for calculating the probability of a pattern of recombination between consecutive markers for a single meiosis. Following their notation, let p = m + 1 and y = 2px, and let D_{s}(y) be the p × p matrix with the i, jth entry e^{}^{y}y^{ps}^{+}^{j}^{}^{i}/(ps + j  i)! (but with zero entries where ps + j  i < 0). The i, jth entry of D_{s}(y) represents the probability that over genetic distance x, s crossovers occurred on the fourstrand bundle, and the number of intermediate events (the events implied by the sum of m + 1 independent exponential random variables) since the last crossover changed from i to j (hence the zero terms, since j cannot be less than i if s = 0 crossovers occurred). For interval l between markers l and l + 1, let x_{l} be the genetic distance of the interval, y_{l} = 2(m + 1)x_{l} and define N_{k} = D_{0}(y_{l}) + ½ Σ_{s}_{≥1} D_{s}(y_{l}) and R_{l} = ½ Σ_{s}_{≥1} D_{s}(y_{l}). Then the probability of a recombination pattern over the L  1 intervals between L consecutive loci on a chromosome is given by (1/p)1M_{1}M_{2}... M_{L1}1′, where M_{l} = N_{l} whenever there is no recombination in the lth interval and M_{l} = R_{l} when there is recombination. See Zhao et al. (1995b) for the proof of this result.
Sampling inheritance patterns conditional on the genotype data under the independence model: Thompson (2000b, Sect. 8.4) describes how to sample the inheritance pattern X from P_{I}(XY), the probability distribution of the inheritance patterns given the genotype data Y under the independence model, and we give the details here. Let X_{l} = (X_{l}_{1}, X_{l}_{2},..., X_{lK}) be the inheritance vector at locus l, Y_{l} the genotype at locus l, and
Acknowledgments
The author thanks Reinhard Hopperger for programming assistance and Terry Speed and the referees for helpful comments. This work was supported in part by a Research Starter Grant in Informatics from the PhRMA Foundation.
Footnotes

Communicating editor: S. Tavaré
 Received May 27, 2002.
 Accepted March 7, 2003.
 Copyright © 2003 by the Genetics Society of America