| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Genetics, Vol. 177, 1101-1116, October 2007, Copyright © 2007
doi:10.1534/genetics.107.074047
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

* Department of Cognitive Neuroscience/Biophysics, Institute for Computing and Information Sciences, Radboud University, 6525 EZ Nijmegen, The Netherlands and
Department of Information and Knowledge Systems, Institute for Computing and Information Sciences, Radboud University, 6525 ED Nijmegen, The Netherlands
1 Corresponding author: Department of Cognitive Neuroscience/Biophysics/126, Institute for Computing and Information Sciences, Radboud University, Geert Grooteplein 21, 6525 EZ Nijmegen, The Netherlands.
E-mail: k.albers{at}science.ru.nl
| ABSTRACT |
|---|
|
|
|---|
Since the marker data are generally not informative enough to unambiguously infer the ordered genotypes, a probabilistic modeling approach can be used to deal with the uncertainties. The computer programs MERLIN (ABECASIS et al. 2002), GENEHUNTER (KRUGLYAK et al. 1996), and SUPERLINK (FISHELSON and GEIGER 2002; FISHELSON et al. 2005) reconstruct exact maximum-likelihood haplotype configurations in general pedigrees. Due to the exponential increase of computation time and memory usage with pedigree size (MERLIN, GENEHUNTER) or the tree width of the graphical model associated with the likelihood function (SUPERLINK), application of these programs to large pedigrees and many markers typical of QTL-mapping studies may not be feasible, especially when some of the individuals have missing genotypes or no genotype information at all. Approximate statistical approaches based on Markov chain Monte Carlo (MCMC) sampling (THOMPSON 1994; LANGE and SOBEL 1996; JENSEN and KONG 1999; THOMPSON and HEATH 1999; THOMAS et al. 2000; GEORGE and THOMPSON 2003) use the same likelihood function as the exact probabilistic approaches and consequently may achieve very high accuracy. MCMC methods can be generally applied and have modest memory requirements. Although in theory computation time does not scale exponentially with the problem size, in practice it can be very long and convergence of the Markov chain can be difficult to assess. An efficient statistical approach based on a heuristic approximation of conditional probabilities was proposed by GAO et al. (2004); however, it has been tested only on data sets with no missing genotypes.
To overcome problems of efficiency several nonstatistical approaches have been developed. WIJSMAN (1987) proposed a zero-recombinant haplotyping method that is linear in the number of markers and individuals. Recently, efficient algorithms were described by ZHANG et al. (2005; BARUCH et al. 2006). Application of these approaches is limited to data sets without forced recombination events. QIAN and BECKMANN (2002) presented a six-rule algorithm to reconstruct minimum recombinant haplotypes. Since computation time is quadratic in pedigree size and cubic in the number of markers, application to large data sets may not be practical. LI and JIANG (2004) proposed an expectation-maximization (EM) approach that approximately minimizes the number of recombination events. They also proposed an integer linear programming approach that minimizes the number of recombination events. Although computation time of the latter scales linearly with the rate of missing genotypes, it scales exponentially with the number of individuals. WINDIG and MEUWISSEN (2004) described an efficient haplotype reconstruction algorithm for general pedigrees. In spite of the improved efficiency of these methods, imputation of missing genotypes can be problematic due to the lack of a statistical treatment of missing data.
We present a statistical approach, implemented in the computer program CVMHAPLO, that combines the general applicability and accuracy of MCMC approaches with high efficiency. Our haplotype inference algorithm is an iterative procedure where each iteration consists of the following two operations: (1) Estimation of the marginal probabilities of all unassigned ordered genotypes conditioned on the assigned ordered genotypes and the marker data and (2) assignment of a number of the ordered genotypes with the highest conditional marginal probabilities.
Like SIMWALK2, it can be applied to any pedigree and any number of markers. It provides detailed information about the uncertainty in the inferred haplotypes. Computation time of CVMHAPLO scales approximately linearly with the number of markers and individuals.
Step 1 of the assignment procedure is generally intractable. Therefore we use the cluster variation method (CVM) (KIKUCHI 1951; MORITA 1990; YEDIDIA et al. 2005) to approximately compute the marginal probability distributions of ordered genotypes. The CVM is a variational approximation designed for efficient estimation of marginal probabilities in complex probability models for which exact computation is not feasible. The CVM estimates marginal probabilities by optimizing marginal distributions on overlapping subsets of variables for which exact probability calculus is feasible. The CVM was introduced as a method for estimation of equilibrium properties of materials consisting of interacting magnetic spins (KIKUCHI 1951) and has been used mainly for this purpose by physicists. YEDIDIA et al. (2005) established a connection between the most basic approximation of the CVM and the belief propagation algorithm (PEARL 1988), sparking interest in the computer science community. The CVM has been applied to problems in the fields of image restoration (TANAKA and MORITA 1995), computer vision (FREEMAN et al. 2000), interference in two-dimensional channels (SHENTAL et al. 2004), medical diagnosis (KAPPEN 2002), decoding of error-correcting codes (GALLAGER 1963; MCELIECE et al. 1998; KABASHIMA and SAAD 2004), predicting protein structure (PELIZZOLA 2005; KAMISETTY et al. 2007), and language processing (CROFT and TURTLE 1989). We refer to PELIZZOLA (2005) for a recent review of the CVM.
In previous work (ALBERS et al. 2006) we showed that the CVM may be used to obtain accurate approximations of parametric LOD scores in pedigrees without loops, comparing favorably with those of the MCMC program MORGAN (THOMPSON 1994; THOMPSON and HEATH 1999; GEORGE and THOMPSON 2003). The algorithm we propose here does not provide parametric linkage scores, but infers a single consistent haplotype configuration. It is an extension of our previous approach in that it can be applied to general pedigrees, allowing for inbreeding.
Our procedure to iteratively assign ordered genotypes is similar to that of GAO et al. (2004), who applied it to pedigrees without missing data. Gao et al. used a deterministic procedure to approximate conditional probabilities of ordered genotypes given a subset of the observations, whereas we use the CVM to approximate conditional probabilities given all observations in the pedigree. We show that our approach yields accurate reconstructions in problems with substantial amounts of missing data and that it also provides accurate estimates of posterior marginal probabilities of ordered genotypes.
We evaluate our approach in simulated and real data sets. We restrict the evaluation to single-nucleotide polymorphisms (SNPs) and discuss extension to markers with more than two alleles. We compare CVMHAPLO with exact maximum-likelihood approaches and the state of the art MCMC maximum-likelihood approximation of SIMWALK2.
| MATERIALS AND METHODS |
|---|
|
|
|---|
, which are the paternal and maternal alleles. For each nonfounder i in the pedigree and each marker l there are the paternal and maternal segregation indicators
. The founder and nonfounder individuals are denoted by F and NF, respectively. We denote the vector of all ordered genotype variables by G and the vector of all segregation indicators by v. Both G and v are unobserved experimentally. Instead, the observed genotypes consist of unordered pairs of alleles
for a subset of persons and markers. We denote by M the vector of all observed allele pairs. The marker map is assumed to be known; the recombination frequency between marker l and l – 1 is denoted by
l, l–1 and ml denotes the prior allele frequencies for marker l.
|
The algorithm CVMHAPLO:
Algorithm 1 shows CVMHAPLO in pseudocode. The ordered genotypes and segregation indicators are assigned to a specific value in a number of iterations, labeled by n. Lines 1–3 represent the initialization of the algorithm. Lines 4–17 represent the iterative assignment procedure.
In iteration n of the algorithm, we use the CVM to compute the approximate marginal probability of all unassigned ordered genotypes, conditioned on all observed genotypes M and conditioned on all ordered genotypes that have been assigned in all previous iterations, which we denote by
. The resulting conditional probability is denoted by
(line 5 in Algorithm 1). This is the computationally intensive step of the algorithm. It can be performed either exactly or approximately using the CVM. The latter approach is explained in the APPENDIX.
Subsequently, a number of ordered genotypes are assigned. All ordered genotypes for which
= 1 for some value of
are assigned, as well as an additional number of ordered genotypes in the following way. For each marker and person we compute
![]() |
![]() |
![]() |
![]() |
![]() |
= 0.9 and it is obtained when
= {A, A}. We sort the
in descending order (line 8) and select the pNL ordered genotypes with the highest value (line 9) (N denotes the number of individuals in the pedigree and L the number of markers and p is a percentage that is specified by the user).
In line 6 the partial haplotype configuration
of the previous iteration is checked for consistency as described in the APPENDIX. The consistency check verifies that the partial haplotype configuration has a nonzero likelihood under the probabilistic model. When an inconsistency is detected, it is assumed that too many ordered genotypes have been assigned per iteration, and the algorithm is reinitialized in lines 12–14 with a lower value of p.
In line 10,
is updated so that it contains all assigned ordered genotypes. The procedure of estimating marginal distributions and assigning ordered genotypes is repeated either until all ordered genotypes have been assigned or until a stopping criterion has been reached.
Confidence in the assignment:
When there are many missing values, there is a large uncertainty about the value of the ordered genotypes. In this case, maybe some ordered genotypes can be assigned with a relatively high confidence but others not. This is signaled by the conditional marginal probabilities computed above. For instance, in the above example it is clear that if the four probabilities are all 0.25, no reliable assignment can be made. In this case, it is clear that a full reconstruction of all ordered genotypes is likely to produce many errors and it is important to monitor the quality of the iterative assignment procedure. We suggest to use the values of the
as an indication of the reliability of the assignment procedure, in the following way. Denote by {i, l} the set of all ordered genotypes that have been assigned up to iteration n, and
is the probability of the assignment at the time that it was made. We define the confidence in the total assignment up to iteration n as the average of these assignment probabilities:
![]() | (1) |
Application of the cluster variation method:
Exact inference of the conditional marginal probabilities
requires a summation over an exponential number of configurations of the unassigned ordered genotypes and segregation indicators compatible with the marker data and previous assignments. This computation scales exponentially with problem size and is not feasible in practice for complex pedigrees and a large number of markers. Therefore we apply the CVM to compute these probabilities approximately. The idea of the cluster variation method is to avoid the exponential sum by optimizing marginal distributions of overlapping subsets of variables, i.e., the clusters. The subsets of variables must be chosen such that exact probability calculus on the corresponding cluster marginal distributions Q
(x
| ·), where
labels a cluster, is feasible. In essence, the CVM exactly models correlations between variables that are contained in the same cluster and approximates correlations between variables that are contained in different clusters. In the APPENDIX we provide mathematical details of the CVM; here we focus on the practical aspects of applying the CVM.
Obtaining approximations of the marginal distributions of the ordered genotypes with the CVM proceeds along the following lines. First the probabilistic model must be defined. We make use of the standard pedigree likelihood assuming Hardy–Weinberg equilibrium and linkage equilibrium; the specific form of the distribution is given in the APPENDIX. As a preprocessing step we eliminate a number of symmetries from the model, such as the unknown phase in the ordered genotypes of the founders (see the APPENDIX for details). Second, the CVM requires specification of the set of clusters B = {
1,
2, ...} that determines the approximation. Below we describe our choice of clusters for the problem of haplotype inference; this is the default cluster choice of CVMHAPLO.
Third, given the set of clusters and the probabilistic model, the cluster variation method prescribes that the so-called free energy
must be minimized with respect to the cluster marginal distributions to obtain the optimal approximation; i.e.,
. The minimization must be performed under the constraint that the clusters have identical marginal distributions on variables that are contained in more than one cluster. The CVM does not prescribe how this minimization must be performed; it provides only the analytic form of the functional
in terms of the marginal distributions
, the parameters of the probabilistic model, the marker data, and the assigned ordered genotypes. The assumption is that the specific form, which is given in the APPENDIX, yields accurate approximations. We apply the provably convergent double-loop algorithm described by HESKES et al. (2003) to perform the numerical minimization of the CVM free energy.
Finally, after the numerical minimization procedure has converged, the marginal distribution of an ordered genotype can be obtained by straightforward marginalization of the marginal probability distribution of one of the clusters, e.g.,
, that contains the ordered genotype of interest:
![]() |
Specification of the clusters for CVMHAPLO:
For the purpose of haplotype inference we have chosen the clusters such that the corresponding CVM approximation can be applied to any pedigree, regardless of inbreeding and size, and the numerical minimization can be performed within a reasonable time and a reasonable amount of memory usage for large problems. Computation time and memory usage of the CVM increase exponentially with cluster size, but approximately linearly with the number of clusters. The accuracy of the CVM approximation generally increases with cluster size, resulting in a trade-off between accuracy and efficiency.
For every nonfounder individual i and each pair of adjacent markers l and l + 1, we define the cluster
![]() | (2) |
are not explicitly included in the cluster. Because their value depends only on the unobserved genotype variables through the conditional probability tables
(see the APPENDIX for details), as a preprocessing step we integrate over
before applying the CVM. With this choice the number of clusters scales linearly with both the number of individuals and the number of markers, irrespective of the pedigree structure. As we will show, computation time and memory usage for this choice of clusters are acceptable, while the accuracy of the approximation is high.
Illustration of CVMHAPLO:
Here we demonstrate the procedures with a simple example. We consider a family consisting of a father, a mother, a daughter, and a son. Genotype data are simulated for three markers with recombination fractions of 0.05. The true ordered genotypes are
|
![]() |
![]() |
![]() |
![]() |
) and segregation indicators (e.g.,
) of the child and the genotype variables of both parents (e.g.,
) for the two markers in the cluster. Thus the genotype variables and segregation indicators of the children defined for marker 2 are contained in two clusters; the genotype variables of the parents defined for marker 2 are contained in all four clusters. With this set of clusters the CVM will yield approximate probabilities.
With p = 0.5%, CVMHAPLO requires four iterations to reconstruct the haplotypes. In Table 1 the marginal distributions of the ordered genotypes as computed with the CVM, and the ordered genotypes that are assigned from these marginals, are shown for all four iterations. In the first iteration all homozygous genotypes can be assigned, since the corresponding marginal distributions indicate that one configuration has probability one. Also the ordered genotypes of the daughter and son for marker 2 can be assigned as these are unambiguously determined by the homozygous genotypes of the parents. The heterozygous ordered genotype of the father for marker 1 is assigned in the first iteration since it has the highest qmap. Symmetry has been removed (see the APPENDIX) by fixing the paternal segregation indicator of the daughter at the middle marker such that the father transmits his paternal allele. As a result, the father is most likely to transmit his paternal allele to the daughter at the first marker as well. Since the daughter must have received the "B" allele from the father at this marker, this implies a probability of 1.0 – recomb. frac. = 0.95 for the ordered genotype "BA" of the father at the first marker. Ordered genotype "AB" implies a recombination event and therefore has probability 0.05. In the second iteration the marginal distributions are reestimated conditioned on the marker data and the ordered genotypes assigned in the first iteration. The ordered genotype of the father for marker 3 now has the highest probability, qmap = 0.9, and is assigned: the MAP configuration is
. In the third iteration the ordered genotype of the daughter for marker 3 is assigned and finally in the fourth iteration the ordered genotype of the mother for marker 3 is assigned.
|
Data sets:
We evaluated CVMHAPLO on two pedigrees that were taken from experimental linkage studies. Pedigree I is an extended pedigree and concerns an affected/not-affected disease with a complex mode of inheritance. It consists of 53 individuals, of which 13 have been genotyped with an Affymetrix (Santa Clara, CA) 10K SNP array. The pedigree is shown in Figure 2. Pedigree II is a complex pedigree of 368 individuals, of which 262 were genotyped for eight SNP markers spanning
0.08 cM. It is taken from a QTL fine-mapping study in a chicken population. The pedigree is shown in simplified form in Figure 3. Pedigree IIsub contains a subset of the individuals in pedigree II and is shown in Figure 4. Exact computations are feasible in this pedigree. In all analyses the Haldane mapping function was used. Details of the data sets analyzed are given in Table 2. We used the computer program MEGA2 (MUKHOPADHYAY et al. 2005) to create input files for SIMWALK2.
|
|
|
|
Implementation:
The implementation of CVMHAPLO was done in C++ and compiled with gcc version 4.1.1.
Hardware:
All simulations were performed on a cluster of five machines with two dual-core 2.4-GHz AMD64 processors each and 4 GB of physical memory available per processor, running the Linux operating system.
| RESULTS |
|---|
|
|
|---|
with the exact marginal distributions
, using the absolute difference as error measure. Unfortunately there are no linkage programs that provide exact marginal distributions; therefore we implemented the junction tree algorithm (JENSEN 1996; JENSEN and KONG 1999) to calculate these. Forty replicates of five markers were simulated for pedigree I (configuration A in Table 2). The marker distances were, respectively, 0.52, 0.32, 0.24, and 0.17 cM. Figure 5 shows a scatter plot of the exact marginal probabilities vs. the approximate CVM marginal probabilities. The mean error was 0.0044 ± 0.012. Eight of the CVM estimates had an error >0.25; Figure 5 shows that these correspond to exact marginal probabilities that were close to 0.5. The error of the CVM estimates was generally smaller for exact marginal probabilities close to 1 and 0. This is relevant because the ordered genotypes that correspond to these extreme marginal probabilities are the ordered genotypes with the least uncertainty that will be assigned by CVMHAPLO in every iteration, while the ordered genotypes with the most uncertainty will not be assigned.
|
Accuracy of the inferred haplotypes:
In this section we evaluate the accuracy of the reconstructed haplotypes in simulated data sets where the true inheritance was known. We define accuracy as the percentage of assigned ordered genotypes equal to the true simulated ordered genotype.
Comparison with exact maximum-likelihood methods:
We assessed the performance of CVMHAPLO in two pedigrees for which exact calculation of the maximum-likelihood haplotype configuration was feasible. We analyzed pedigree I with five markers using our own implementation of the junction tree algorithm and pedigree IIsub with eight markers using SUPERLINK (configurations B and G in Table 2). Replicates were simulated with 30, 40,..., 90% of the individuals genotyped for all markers. The genotyped individuals were chosen at random. For each percentage of genotyped individuals 40 replicates were simulated, resulting in a total number of 280 replicates per pedigree.
Columns two and three in Table 3 show that the accuracies of, respectively, the exact maximum-likelihood haplotype configuration and the haplotype configuration obtained with CVMHAPLO were similar for all percentages of genotyped individuals, for both pedigrees. The fourth and fifth columns show the log-likelihoods log P(Gassigned, vassigned, M) of the corresponding haplotype configurations. For both pedigrees the log-likelihoods of CVMHAPLO were very close to the exact maximum log-likelihoods when the percentage of genotyped individuals was high and slightly lower when the percentage of genotyped individuals was low. Although in theory higher likelihoods should result in higher accuracies, we find that the differences in the likelihoods did not significantly affect the accuracy.
|
99%. The last column shows that the percentage of assigned ordered genotypes in the partial haplotype configuration decreased when fewer individuals had genotype information. In pedigree I the percentage of assigned ordered genotypes was lower than the percentage of genotyped individuals, while it was higher in pedigree IIsub, indicating a nontrivial dependence of the accuracy on the structure of the pedigree and distribution of the genotyped individuals over the pedigree. These results show that (1) provides a useful stopping criterion to obtain partial assignments of high accuracy. To assess the performance of CVMHAPLO in the real data sets of pedigrees I and IIsub, we performed the simulations of Table 3, however, simulating genotype data for the same individuals as in the real data set (configurations A and H in Table 2) rather than for individuals selected at random. For pedigree I we found that the log-likelihoods of CVMHAPLO were on average 2.03% lower than the exact maximum log-likelihood, but that the accuracy was higher (75.95 and 73.02%, respectively). For pedigree IIsub we found that the log-likelihoods of CVMHAPLO were on average 2.02% lower than the exact maximum log-likelihood, but that the accuracies were comparable (91.56 and 92.05%, respectively). These results are compatible with the results of Table 3: when the pedigree contains untyped individuals the accuracy of CVMHAPLO is comparable to the accuracy of the exact maximum-likelihood approach, while the log-likelihoods are slightly lower.
When applied to the real data sets of pedigree I and pedigree IIsub (configurations A and H), CVMHAPLO yielded a haplotype configuration with a log-likelihood that was 2.3% lower than the exact maximum log-likelihood for pedigree I and a log-likelihood that was equal to the exact maximum log-likelihood for pedigree IIsub. These results suggest that CVMHAPLO will also accurately infer haplotypes in the real data sets.
We conclude that the accuracy of the haplotype configurations inferred with CVMHAPLO was high and similar to the accuracy of exact maximum-likelihood haplotype configurations.
Comparison with SIMWALK2:
For pedigree I 40 replicate data sets with 20 markers were simulated and for pedigree II 8 replicate data sets with 8 markers were simulated (respectively configurations C and F in Table 2). The number of replicates for pedigree II was relatively small due to the long computation times of SIMWALK2. Exact computation of the maximum-likelihood haplotype configuration was not feasible in these data sets. Figure 6 shows the accuracy as a function of the percentage of assigned ordered genotypes of CVMHAPLO and SIMWALK2 for both pedigrees.
|
When all alleles in the haplotypes are assigned (100% on the horizontal axis), the accuracy of CVMHAPLO was not signficantly different from the accuracy of SIMWALK2, both in the subset of genotyped individuals and in the full pedigree. As expected, the accuracy of SIMWALK2 and CVMHAPLO was higher in the subset of genotyped individuals. The likelihoods of the haplotype configurations inferred with CVMHAPLO were slightly lower than the likelihoods of the haplotype configurations inferred with SIMWALK2: in pedigree I the mean difference in the log-likelihood was –1.6 ± 1.2%; in pedigree II the mean difference was –3.3 ± 2.3%. Apparently the lower likelihoods did not significantly affect the overall accuracy, in agreement with our previous results.
In the case of partial assignments, we infer from Figure 6 that the accuracies of SIMWALK2 and CVMHAPLO are similar for the genotyped individuals and that the accuracy of CVMHAPLO is significantly higher for the individuals without genotype information. The accuracy of CVMHAPLO was very high when only the ordered genotypes with high confidence (Equation 1) were assigned and decreased as more ordered genotypes were assigned. In contrast, the criterion used by SIMWALK2 to flag the alleles that could be assigned with certainty was more coarse. This difference was more pronounced in pedigree I than in pedigree II. We attribute this difference to a large extent to the larger percentage of missing data in pedigree I. We conclude that CVMHAPLO gives more accurate partial assignments than SIMWALK2 when the percentage of missing values is high.
Scaling with the number of markers:
To assess the scaling of the accuracy of CVMHAPLO with the number of markers, we analyzed 10 replicates with 20 markers and 10 replicates with 200 markers for pedigree I (configurations C and D in Table 2, respectively). To obtain replicates with comparable marker informativeness, the replicates with 200 markers consisted of 10 adjacent blocks of the markers in the replicates with 20 markers. Analysis of the replicates with 200 markers with CVMHAPLO was feasible, whereas SIMWALK2 did not converge in reasonable time (1 week for a single replicate).
Figure 7 shows that the average accuracy for the replicates with 20 markers and 200 markers was similar. We conclude that the accuracy of CVMHAPLO does not degrade with the number of markers.
|
We did not find a significant effect of LD on the accuracy of the inferred haplotypes for pedigrees I and II, either for the genotyped individuals or for the individuals without genotype information (results not shown). We found this to be the case for both CVMHAPLO and SIMWALK2. In the presence of LD we also observed that the log-likelihoods of the fully reconstructed CVMHAPLO haplotype configurations were slightly lower than the log-likelihoods of the haplotype configurations of SIMWALK2, similar to what we found in data sets simulated without LD.
Evaluation of the confidence measure:
Figure 6 demonstrates that it would be useful to have an indication of the reliability of the (partial) haplotype configuration, as the accuracy decreased significantly when a larger subset of ordered genotypes was assigned. For these replicates we therefore compared the confidence level from Equation 1 to the accuracy of the (partial) haplotype configuration inferred in iteration n of CVMHAPLO, for all n. Figure 8 shows for a given confidence level the mean accuracy of the corresponding haplotype configurations, where the leftmost circle and square data point correspond to the haplotype configurations where all the ordered genotypes were assigned (the haplotypes obtained in the final iteration). We see that for pedigree I the confidence was lower than the accuracy, but still highly correlated with it. For pedigree II the confidence of the haplotype configurations gave a very good indication of the accuracy. The difference is most likely due to the fact that the marker data in pedigree II were more informative than those in pedigree I. We conclude that the confidence measure (1) gives a useful indication of the accuracy (assuming absence of genotype errors) and may be used to control the accuracy of the inferred haplotypes.
|
For all analyses performed with CVMHAPLO we used a fixed value of p = 0.5% for the percentage of ordered genotypes assigned in every iteration, independent of the number of markers and individuals. Theoretically, for a fixed percentage p computation time of CVMHAPLO is expected to scale linearly with the number of markers and approximately linearly with the number of individuals depending on the pedigree structure, which is confirmed by the results shown in Table 4. Although CVMHAPLO required more memory, it was significantly more efficient than SIMWALK2 for the complex pedigree II and scaled more favorably with the number of markers.
|
| DISCUSSION |
|---|
|
|
|---|
It is interesting to note that although our approach does not explicitly maximize the likelihood, it inferred haplotype configurations with nearly optimal likelihood when full genotype information was available. In the case of missing genotypes, the log-likelihood of the inferred haplotype configurations was
2% lower than the exact maximum log-likelihood; however, the accuracy was not significantly different. Our results suggest that the assignments that are suboptimal in the sense of the likelihood are limited to the ordered genotypes that cannot be inferred with high certainty from the marker data.
An important parameter of CVMHAPLO is p, the percentage of ordered genotypes assigned in every iteration. In general, for smaller values of p the accuracy will be higher and the computation times longer. In our experience values of p < 0.5% did not yield significantly higher accuracies. With p = 0.5%, for only 4 replicates of the
700 replicates analyzed the algorithm required a restart with p = 0.25% to produce a consistent configuration; in all of the other replicates a consistent configuration was found with the initial value of p = 0.5%. For higher values of p the number of replicates where CVMHAPLO had to be restarted increased somewhat. Therefore we recommend to use the default value of p = 0.5%, but it can be adjusted by the user. We plan to investigate whether computationally inexpensive heuristics can be devised to prevent inconsistent assignments and consequently restarts of the algorithm.
We compared our approach to the approximate maximum-likelihood haplotyping algorithm of SIMWALK2, since like our approach, SIMWALK2 is a statistical approach that does not require absence of recombinations or tightly linked markers and does not assume the number of recombinations to be minimal. Furthermore, SIMWALK2 is commonly used by practitioners. In previous work on the estimation of parametric LOD scores in pedigrees without inbreeding (ALBERS et al. 2006), we showed that our approach based on the CVM was more efficient than the MCMC sampler implemented in the computer program MORGAN (THOMPSON 1994; THOMPSON and HEATH 1999; GEORGE and THOMPSON 2003). Since MORGAN has no option for haplotyping and cannot be applied to general pedigrees, we believe a comparison here would not be of added value. We also considered the integer linear programming algorithm implemented in PEDPHASE (LI and JIANG 2004), since this algorithm does not require absence of recombinations. On the simulated data sets for pedigree I (configuration C in Table 2), the accuracy was on average 10% lower than that of CVMHAPLO, and the log-likelihoods were on average 50% lower than those of SIMWALK2. It could not analyze the real and simulated data sets (configuration F) for pedigree II within 1 week. The block-extension algorithm in PEDPHASE produced inconsistent output for the data sets simulated for pedigree I and terminated with error status for the data sets simulated for pedigree II. Finally, on simulated data (400 SNPs covering 100 cM) in a pedigree of 400 outbred mice, where a small number of parents had many offspring, we found that our approach was more accurate than the approach described by WINDIG and MEUWISSEN (2004), although it was less efficient (results not shown). This approach requires that sufficient genotyped offspring are available for each parent and therefore we expect that it will be less accurate than SIMWALK2 and CVMHAPLO on the data sets considered in this article, especially for pedigree I. For these reasons we have not included this approach in our comparison.
Like SIMWALK2, our haplotype inference algorithm currently does not explicitly account for linkage disequilibrium between the markers. In dense SNP panels, such as the Affymetrix 10K and Illumina 6K panels, significant marker–marker LD has been shown to be present (PERALTA et al. 2005). LD can lead to a bias in LOD scores when the marker alleles are assumed to be in equilibrium, especially when parental genotypes are missing (HUANG et al. 2004; ABECASIS and WIGGINTON 2005), demonstrating that missing data may strongly affect identity-by-descent probabilities. SCHAID et al. (2002) showed in a real data set that haplotype frequencies estimated in unrelated individuals with an expectation-maximization algorithm differed significantly from haplotype frequencies estimated in pedigrees with GENEHUNTER under the assumption of linkage equilibrium. They suspected that strong LD was responsible for this discrepancy. Our finding that there was no significant effect of LD on the accuracy appears to be in contradiction with these findings, but we believe the difference can be explained by the fact that we evaluated the accuracy of a single haplotype configuration, while the LOD scores investigated by Huang et al. and the haplotype frequencies estimated by Schaid et al. may be more sensitive to violations of the assumption of linkage equilibrium.
To determine the effect of LD we introduced haplotype frequencies, but allowed for recombination between markers in the same haplotype block. As pedigree I was taken from a linkage study in a human population, it is more realistic to have (virtually) no recombination between markers in the same block. We found that also in this case the haplotyping accuracy was not affected by LD.
The issue of LD is a modeling issue and therefore in principle unrelated to the issue of the quality of the CVM approximation, although, of course, the quality of the CVM approximation may depend on the model. The CVM approximation can be applied to any probabilistic model and in particular to a pedigree-likelihood model that includes LD. Pairwise modeling of LD between markers would not require a different choice of the clusters in the CVM approximation. This is a direction for further research.
We have shown results for the CVM cluster choice shown in Figure 1. This cluster choice provides a good trade-off between accuracy and efficiency for a wide range of pedigrees. We found that larger clusters consisting of the variables of three markers for an individual and its parents (instead of two as in Figure 1) may further increase accuracy. If many individuals are genotyped the increase in computation time is often still acceptable as the number of compatible configurations is limited by the observations. One may also choose smaller clusters; however, if many individuals are genotyped the gain in efficiency may be limited. Our current implementation of the algorithm offers several cluster choices.
We applied our algorithm to data sets consisting of SNP markers only. Currently our software does accept multiallelic markers. Due to the increased state space in the case of multiallelic markers, the efficiency of our implementation is not as high as in the case of SNPs. Work is in progress to improve the efficiency for multiallelic markers by applying additional preprocessing techniques and using clusters with fewer variables in the CVM approximation.
In this article we assumed absence of genotyping errors. In practice this will rarely be the case. A simple heuristic for error detection is to locate unlikely double recombinants. These can be inferred from the marginal distributions over segregation indicators of adjacent loci, which can be trivially obtained from the cluster marginal distributions. However, it is preferable to use an error model as proposed by SOBEL et al. (2002). Such an error model can be relatively easily incorporated into our approach, at the expense of a larger state space. Although the efficiency of CVMHAPLO will be reduced, we believe that the additional computational expense may be well justified. In a preliminary analysis of a real data set of 1600 animals genotyped for 14 closely linked markers, we found that inclusion of an error model significantly reduced the number of unlikely recombinant haplotypes, which suggests that haplotype effects can be estimated with better power. These results will be presented in a separate publication. We plan to incorporate the modeling of genotyping errors in future versions of CVMHAPLO.
The program CVMHAPLO is freely available for noncommercial use and can be downloaded from our website at http://www.mbfys.ru.nl/
keesa/cvmhaplo/.
| APPENDIX |
|---|
|
|
|---|
A Bayesian network representation:
The full probability distribution is given by
![]() | (A1) |
are given as well as the
of the person markers with genotype information.
The cluster variation method:
Here we describe the CVM for the case that no ordered genotypes have been assigned and only unordered genotype observations are available. The case in which ordered genotype assignments are available does not require a different treatment, as assignments of ordered genotypes can be modeled as observed genotypes for which both the value of the two alleles and the ordering of the alleles are known.
The idea of the cluster variation method (KIKUCHI 1951; MORITA 1990; YEDIDIA et al. 2005) is to approximate the intractable probability distribution
![]() |
1,
2, ...}: a collection of overlapping subsets of variables, chosen such that the corresponding approximate marginal distributions Q
(x
) are feasible for exact probability calculus. For ease of notation we do not explicitly state that Q
(x
) is conditioned on the marker data M and assigned ordered genotypes Gassigned and that x = (v, G). We define I as the set of clusters that consists of all clusters that can be formed by the intersection of clusters in B and in I. Thus any intersection of clusters is contained in I. The choice of B determines the approximation and fully determines I. The following restrictions on the choice of the set of clusters B hold:
B that contains all variables of the conditional probability table.
1
B is a subset of another cluster
2
B.
To motivate the formalism of the CVM, we first observe that the exact posterior distribution
can be obtained by minimizing the exact free energy defined as
![]() |
(x) is the right-hand side of Equation A1. This can be verified by simple differentiation with respect to Q. Since the functional Fexact(Q) itself is generally intractable to evaluate, the cluster variation method proposes to minimize the approximate free energy
![]() | (A2) |
(x
). Here
defines the collection of all approximate cluster marginals {Q
(x
):
B
I}.
The cluster potential functions 
are defined by the conditional probability tables of the Bayesian network in Equation A1:
![]() | (A3) |
(n) denotes the set of variables on which variable xn is conditioned. Thus, the product of conditional probability tables that defines a potential function 
may contain the tables associated with the allele transmissions,
as well as unordered genotype observations
. Note again that the variables
are not explicitly included in the clusters since they are observed.
The Moebius coefficients a
satisfy
![]() |
The constraints in Equation A2 are consistency and normalization constraints. The consistency constraints are
![]() |
![]() |
Thus, the approximate free energy
is a sum of the free energies associated with each cluster
B
I multiplied by the Moebius coefficient a
. The generalized belief propagation (GBP) algorithm (PEARL 1988; MURPHY et al. 1999; YEDIDIA et al. 2005) is a fixed-point iteration scheme that finds extrema of
. However, GBP does not always converge. Therefore we use the convergent double-loop algorithm described by HESKES et al. (2003) to minimize
. The idea of the double-loop algorithm is to iteratively minimize convex upper bounds on the free energy. At each iteration of the algorithm a convex upper bound is calculated (the outer loop) that is minimized in the inner loop. The algorithm always converges to a (local) minimum of
, provided the inner loop has converged.
For clarity we note that the free energy is not explicitly used in the iterative assignment procedure for the reconstruction of the haplotypes. It is merely used as a suitable optimization criterion for inferring approximate marginal distributions. In special cases of tree-like probabilistic graphical models minimization of the free energy yields exact marginal distributions (PELIZZOLA 2005; YEDIDIA et al. 2005); then the free energy
can be interpreted as a distance measure between the exact distribution P and the CVM distribution
(YEDIDIA et al. 2005).
Consistency of the assignment:
When p of the ordered genotypes are assigned simultaneously in one iteration, it is possible that the resulting
is inconsistent in the sense that this configuration has zero probability under the probability model of equation A1. This is not likely to happen if p is chosen sufficiently small. Our implementation of CVMHAPLO automatically detects inconsistent assignments as follows. When assignments are inconsistent, for one or more of the CVM cluster marginal distributions as a result Q
(x
) = 0,
x
; i.e., all states have zero probability. When all alleles in the pedigree are assigned and no inconsistency is detected like this, the inferred haplotype configuration is consistent.
Preprocessing:
Given our cluster choice, the CVM is exact for a single marker if the pedigree has no loops (PELIZZOLA 2005). As a result, the CVM eliminates all inconsistent genotypes in this case by assigning these configurations a probability of zero in the cluster marginal distributions. Even if the pedigree contains loops, many inconsistent genotypes will be eliminated. As a preprocessing step we apply the CVM to each marker independently and determine which configurations in the cluster marginal distributions are assigned probability zero. These configurations need not be considered in the subsequent haplotype reconstruction procedure. This preprocessing step is highly similar to the genotype elimination algorithm proposed by LANGE and GORADIA (1987) and may yield significant speedups.
Removal of symmetries:
The probabilistic model defined by Equation A1 contains a number of symmetries. The first symmetry concerns the haplotypes of the founders: since by definition the founders do not have parents that are included in the pedigree, it is impossible to determine which haplotype is paternal and which is maternal. The second symmetry occurs when a father and a mother are founders and also do not have genotype information. In this case a haplotype that is found in one of the children could be inherited from either parent with equal probability (assuming no recombination). More symmetries may be present, but in general it is hard to enumerate them all.
We find experimentally that CVMHAPLO yields better results when these symmetries are removed before application of the CVM. We remove the first symmetry by fixing for one child of every founder the parental source of the allele inherited from the founder for one marker. The second symmetry is removed by fixing in one grandchild of every untyped pair of founder parents the parental source of the inherited allele for one marker.
We choose not to fix segregation indicators of more than one marker for each chromosome, as this may lead to inconsistent configurations. See the discussion of SILLANPÄÄ and ARJAS (1998) on this topic in the context of MCMC approximations.
Algorithm 1—CVMHAPLO:
1
for all individuals i and loci l.
is consistent then
![]() | (A4) |
![]() | (A5) |
such that
.
0
n + 1 | ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| LITERATURE CITED |
|---|
|
|
|---|