| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Genetics, Vol. 177, 1929-1940, November 2007, Copyright © 2007
doi:10.1534/genetics.107.079525
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

,1
* University of Goettingen, Institute of Animal Breeding and Genetics, 37075 Goettingen, Germany and
State Key Laboratories of Agrobiotechnology, Key Laboratory for Animal Breeding and Genetics of Ministry of Agriculture of China, College of Animal Science and Technology, China Agricultural University, Beijing, 100094, China
1 Corresponding author: College of Animal Science and Technology, China Agricultural University, Beijing, 100094, China.
E-mail: qzhang{at}cau.edu.cn
| ABSTRACT |
|---|
|
|
|---|
There are a growing number of articles on haplotype inference for unrelated individuals (CLARK 1990; EXCOFFIER and SLATKIN 1995; STEPHENS et al. 2001), but more and more studies show that haplotype inference through close relatives, especially from nuclear families, can be an alternative strategy, as family information can reduce phase ambiguity and improve the efficiency of haplotype frequency estimates (HODGE et al. 1999; ROHDE and FUERST 2001; BECKER and KNAPP 2002; SCHAID 2002). However, these methods consider mainly those nuclear families with both parents and one child (trios). When diseases with onset in adulthood or in old age are studied, it may be impossible to obtain genotypes for markers in the parents of the affected offspring, so that only full-sib information is available, which also may be true for other reasons. Obviously, it is essential to develop efficient approaches to handle such families.
The existing computational methods for haplotyping fit into two categories: statistical methods and rule-based methods. The rule-based approaches (QIAN and BECKMAN 2002; LI and JIANG 2003; GAO et al. 2004; BARUCH et al. 2006) are deterministic and fast and thus can handle large pedigrees with dense markers. However, they normally do not provide numerical assessments of the reliability of their results, and the utility of rule-based approaches for nuclear families remains unknown (NIU 2004). On the other hand, statistical approaches are flexible in tackling nuclear families (ROHDE and FUERST 2001; BECKER and KNAPP 2002; DING et al. 2006), although they are time-consuming and thus may not be suitable for large pedigrees.
Maximum likelihood via the expectation-maximization (EM) algorithm (DEMPSTER et al. 1977) is a widely used statistical approach for haplotype inference. EXCOFFIER and SLATKIN (1995) were the first to propose a maximum-likelihood-based approach for haplotype frequency estimation for unrelated individuals. EM-based approaches without assuming linkage equilibrium among the loci were suggested for various types of complete (ROHDE and FUERST 2001; BECKER and KNAPP 2002) or incomplete (DING et al. 2006) nuclear family data. Their performance was shown to be superior to that of the Lander–Green algorithm (LANDER and GREEN 1987) implemented in GENEHUNTER (KRUGLYAK et al. 1996), which as well as other linkage analysis programs assumes complete linkage equilibrium between the loci (BECKER and KNAPP 2002; DING et al. 2006).
Several methods have been suggested for haplotype inference using sibship data (BECKER and KNAPP 2004; HORVATH et al. 2004; LIU et al. 2006), which have their own strengths and weaknesses. In this article, we propose a new maximum-likelihood-based method for haplotype reconstruction and estimation of haplotype frequencies using full-sib families, which allows genetic markers to be in linkage disequilibrium and assumes that no recombination occurs between the markers.
In our study, we first introduce the general idea of the new algorithm. Most of the technical details are presented in the APPENDIX. We then report the outcome of a simulation study showing that our approach results in a higher accuracy of the estimation of population haplotype frequencies and of reconstructed individual haplotypes. In the DISCUSSION we provide arguments to explain the better statistical properties of our procedure compared with the established methods, and we discuss options to overcome practical problems and limitations, e.g., missing genotypes and the restriction in the number of loci processed simultaneously.
| METHODS |
|---|
|
|
|---|
termed full-sib haplotype set (FSHS), where G1 denotes the diplotype of sib 1 in the ith FSHS of family f.
The likelihood function:
Following similar arguments presented by EXCOFFIER and SLATKIN (1995), for a sample of m families with only full sibs, the likelihood function of the population haplotype frequencies is defined as
![]() | (1) |
is the ith FSHS for family f with nf full sibs, and Sf is the number of possible FSHS in family f.
The EM algorithm:
The EM algorithm iterates between the expectation step and the maximization step until the haplotype frequency estimations converge (i.e., when the changes in haplotype frequency in consecutive iterations are less than some small value).
To implement the EM algorithm, a set of initial values is required. It is assumed that given the phase-unknown genotypes of family f, all the possible FSHSs for family f have the same probability; i.e.,
![]() | (2) |
![]() | (3) |
in family f and
is an indicator variable equal to the number of times that haplotype t is present in the ith FSHS; its possible values are 0, 1, ..., 2nf.
In the expectation step in the gth iteration, the haplotype frequencies obtained in the previous iteration are used to calculate the probability of each possible FSHS for family f as
![]() | (4) |
![]() | (5) |
is the probability of the jth parental combination given the estimates of population haplotype frequencies in the gth iteration, and
is the probability of FSHS conditional on the jth possible parental combination. Iterating between the E-step, using Equation 4 to update probabilities of all FSHSs, and the M-step, using Equation 3 to calculate all haplotype frequencies, the EM algorithm yields the maximum-likelihood estimates of the population haplotype frequencies when an adequate convergence criterion is reached.
In addition to the estimation of haplotype frequencies, haplotype reconstruction is another objective of haplotype inference. Using the probability of each possible FSHS obtained in the expectation step Equation 4 after convergence, the conditional probabilities of these FSHSs for a full-sib family with phase-unknown genotype combination YPf can be calculated after the conversion of all probabilities as
![]() | (6) |
| SIMULATION STUDY |
|---|
|
|
|---|
Approaches to be compared:
In our study, we compared our approach FSHAP with the following three approaches:
Sibships with two missing parents can be treated as well, and these are regarded as nuclear families in which parental genotype information is missing at all loci, but frequencies are still estimated with respect to the parental generation (BECKER and KNAPP 2004). However, the frequencies in the parental generation are identical to those in the offspring generation due to Hardy–Weinberg equilibrium.
fbat/haploinfo.htm). This analysis provides haplotype population frequencies and diplotypes of both parents and offspring. These three approaches were compared with FSHAP, which is specially designed for haplotype inference using families with only full sibs and can handle arbitrary numbers of full sibs. The parameters were estimated with the approaches described in METHODS and thus account both for linkage disequilibrium (LD) and for pedigree information.
FAMHAP, FBAT, and FSHAP allow genetic markers to be in linkage disequilibrium and assume that no recombination occurs between the markers in the generation leading to the full-sib groups. Although the inappropriateness of using GENEHUNTER to reconstruct haplotypes from markers in LD has been identified (SCHAID et al. 2002), it was used here as a lower-bound reference for the performance of FAMHAP, FBAT, and FSHAP.
Criteria:
The efficiencies of the different approaches were evaluated with two sets of performance indexes. The first set, including indexes IF and IH, is related to the evaluation of the population haplotype frequency estimation. IF measures the discrepancy between the estimated and true simulated sample haplotype frequencies and was defined by STEPHENS et al. (2001) as
![]() | (7) |
and pi denote, respectively, the estimated and the true simulated frequency for the ith haplotype in the sample. IF varies between 0 and 1. The more accurate the estimation is, the closer IF will be to 0.
Identification rate IH examines whether all haplotypes present in the sample are identified in the estimated haplotypes. In a sample with N individuals, the minimum frequency for every true haplotype must be 
, which can be used as a lower threshold value for determining the existence of a haplotype; i.e., a haplotype is accepted to be detected only if its estimated frequency is >
. On the basis of this, EXCOFFIER and SLATKIN (1995) suggested the statistic
![]() | (8) |
There are two options for the definition of true haplotype frequency. The first one is the relative frequency of haplotype i in the entire ("true") population, and the second one is the relative frequency of haplotype i in the sample (i.e., in the sibships). The methods compared in our study all make use of the same data. Accuracy of parameter estimation is a combination of (i) sampling and (ii) estimation conditional on the sample. Since we are interested only in the differences between methods, only step ii is relevant; therefore a comparison conditional on the drawn samples seems appropriate.
The second set of indexes, including error rate and IR, is related to the evaluation of the haplotype reconstruction.
If the most likely diplotype of an individual is the same as the simulated true genotype, this individual will be considered as being correctly haplotyped. The error rate is the proportion of not correctly haplotyped individuals in the population.
Although the phase-unknown genotypes of parents are not available, they can be inferred according to the information of offspring. However, the father and the mother cannot be definitely assigned due to their unknown genotypes; only the reconstructed parental diplotypes are taken into account to be compared with true parental diplotypes in the calculation of error rate in our approach. For FAMHAP, FBAT, and GENEHUNTER, the reconstructed diplotypes for father and mother were assigned to the most similar true genotypes of the parents, respectively. The following combinations were compared: (i) reconstructed father–true father and reconstructed mother–true mother and (ii) reconstructed father–true mother and reconstructed mother–true father. The more similar combination was accepted and used as basis for calculation of error rate in parents.
Even if the most likely diplotype of an individual is the correct one, the posterior probability of this diplotype may be substantially smaller than one. The overall quality of the haplotype reconstruction procedure can be evaluated with the average posterior probability of correctly reconstructed haplotypes, which is denoted as IR. Since GENEHUNTER does not provide the posterior probability of the most likely diplotype, and FAMHAP only provides that for parents, the statistic IR can be given only by FSHAP and FBAT.
Where appropriate, contrasts of the means of simulation results between different estimation methods were tested with a conventional t-test using SAS 9.1 (SAS INSTITUTE 2004).
Running time of the algorithms was measured in seconds on an IBM server (SUSE Linux 9.2 and 3-GHz Intel Xeon processor).
| RESULTS |
|---|
|
|
|---|
|
As expected, the efficiency of all the approaches can be improved by increasing the number of offspring in each family (Figure 1), which provides more family information to exclude more redundant FSHSs and parental combinations. The only exception is that the discrepancy of haplotype frequencies from FAMHAP does not decrease as in other approaches but increases slightly.
This point is further illustrated by Table 1. For the second scenario of 30 families with only four sibs each, even when the genotyping cost is double after the number of families is increased to 60, the performance of FSHAP and FAMHAP is still lower than that in the fourth scenario of 15 families with only eight sibs each. On the other hand, it also can be seen from Table 1 that the improvement of efficiency of FSHAP and FAMHAP is very small by increasing only the number of families, and the identification rate is not increased but decreased a little bit.
|
|
|
The running time is also affected by the number of children in families since more redundant parental combinations can be excluded to improve speed by using multiple sibs for FSHAP, FAMHAP, and FBAT. Therefore, the running time of these three approaches is decreased when the number of children is increased from two to six (Table 4). However, this advantage will be counteracted by the enumeration of all haplotype configurations of more children; e.g., the running time of FAMHAP is suddenly increased as sib size is increased to eight. It is also indicated from Table 4 that FSHAP performs faster than FAMHAP, FBAT, and GENEHUNTER. FAMHAP is the second fastest approach.
|
As shown in Table 5, FSHAP performs significantly better compared to the other approaches in most situations. The values of discrepancy from FAMHAP and FAMHAP_nit are not different, whereas the performance of FAMHAP is significantly better than that of FAMHAP_nit with respect to identification rate and haplotype reconstruction.
|
| DISCUSSION |
|---|
|
|
|---|
Another widely used strategy for a large number of loci is the partition-ligation (PL) algorithm proposed by NIU et al.(2002). PL was first implemented together with Gibbs sampling to estimate haplotype phases for a large number of SNPs, and QIN et al.(2002) further combined it with the EM algorithm to handle large sets of loci. The PL–EM of QIN et al.(2002) is currently implemented for unrelated individuals only, but can also be integrated in our approach.
Although both FAMHAP and FSHAP are EM-based approaches, there are two crucial steps in FSHAP that make it perform better than FAMHAP, both with respect to computing speed and accuracy of haplotype inference:
LIU et al.(2006) proposed another EM-based approach for haplotype inference from sibship data, which was not included in the comparison in our study. LIU et al.(2006) report that their approach performs slightly better than FAMHAP and that the variability of discrepancy of their performance is small with the sample size. However, only sibships with two children were taken into account in their study. The approach proposed by LIU et al.(2006) is similar to our approach by considering different parental mating designs; however, the calculation of posterior parental combinations is different. On the other hand, LIU et al.(2006) do not make effective use of the joint information of full sibs given the parental configuration; therefore we expect our approach to be more efficient with increasing family sizes.
Our study proves that including nuclear family information will improve not only the correctness of haplotype reconstruction but also the accuracy of haplotype frequency estimates as discussed in other studies (ROHDE and FUERST 2001; BECKER and KNAPP 2002; SCHAID 2002). Especially for our approach FSHAP the parental information can also be inferred accurately when the number of offspring is increased. It will be especially helpful for research in multiparous species like pigs, dogs, fish, and many lab animals, where it is easy to collect families with multiple siblings.
Theoretically, our approach can deal with sibships of arbitrary size. However, families with an excessively large number of children cannot be handled due to the limitation of computing memory. On the other hand, increasing the number of children is not always helpful to improve the efficiency of our approach. As shown in Table 6, the improvement is very small when the number of children is increased from 8 to 12 and 15, and the performance of our approach is decreased when the number of children is increased to 20.
|
|
In practical situations, incomplete data on some individuals due to failure of typing for one (or more) of the component loci is very common in every lab. Our approach can easily handle such a situation. For an individual with a missing locus, we first list all the possible genotypes at this missing locus, where the information of other sibs of this individual can be used to exclude some impossible genotypes. Thus this individual will have several possible phase-unknown genotypes. When inferring this individual's diplotype, each of her (his) phase-unknown genotypes has a corresponding most likely diplotype with a conditional probability, so the one with the highest probability among these most likely diplotypes is considered as the final diplotype, and its corresponding phase-unknown genotype is the final multilocus genotype.
As in other family-based haplotype reconstruction methods, it also is assumed that within a nuclear family recombination does not occur in the considered chromosome segments (HODGE et al. 1999). When recombination events do occur among loci, it will make it complex to infer the parental combinations on the basis of the information of sibs. However, for tightly linked loci, recombination is an unlikely event. Moreover, recent studies (PATIL et al. 2001; GABRIEL et al. 2002) have shown that the human genome can be partitioned into large blocks with high LD and relatively low recombination, separated by short regions of low LD. Therefore, if the markers within the same haplotype block are analyzed together, it is reasonable to assume that there is no recombination among these markers (WANG et al. 2002).
Although FSHAP was initially designed for families with only full sibs, it can also deal with sibships with parents. According to the principle of FSHAP, the available parent will help to exclude redundant parental combinations and to improve the efficiency of FSHAP.
Furthermore, our approach can also be used in mixed-data structures, consisting, e.g., of complete nuclear families (two parents and at least one child) (ROHDE and FUERST 2001), incomplete nuclear families (one parent and at least one child) (DING et al. 2006), sibships with an arbitrary number of children (this study), and single individuals (EXCOFFIER and SLATKIN 1995). All of these four methods are implemented via an EM algorithm and are similar in the likelihood function. Hence, they can be unified in one framework for mixed-data structures, which will be done in a future study.
At the moment, FSHAP runs only under Linux, and it is available on request from the authors.
| APPENDIX |
|---|
|
|
|---|
|
|
![]() | (A1) |
) is the set of diplotypes of all children in this family, k is the number of types of diplotypes among all children in this family (its maximum value is 4 as shown in Table A3), yi (i = 1, ... , k) is the number of children with diplotype i, and pi is the corresponding probabilities of diplotype i shown in Table A3.
|
and the parental combination known as (hahb x hahb), we can obtain the values of k, y1, y2, and y3 as
![]() |
![]() |
Calculation of the probability of the FSHS:
For family f with nf sibs, there are several possible FSHSs, and for each FSHS, there are also several possible parental combinations for each FSHS; then the probability of the ith FSHS in family f can be calculated as
![]() | (A2) |
can be obtained by using the expression listed in Table A2.
is the conditional probability of the ith FSHS given the parental combination. | ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| LITERATURE CITED |
|---|
|
|
|---|
BARUCH, E., J. I. WELLER, M. COHEN-ZINDER, M. RON and E. SEROUSSI, 2006 Efficient inference of haplotypes from genotypes on a large animal pedigree. Genetics 172: 1757–1765.
BECKER, T., and M. KNAPP, 2002 Efficiency of haplotype frequency estimation when nuclear family information is included. Hum. Hered. 54: 45–53.[CrossRef][Medline]
BECKER, T., and M. KNAPP, 2004 Maximum-likelihood estimation of haplotype frequencies in nuclear families. Genet. Epidemiol. 27: 21–32.[CrossRef][Medline]
CEPPELLINI, R., M. SINISCALCO and C. A. B. SMITH, 1955 The estimation of gene frequencies in a random mating population. Ann. Hum. Genet. 20: 97–115.[Medline]
CLARK, A. G., 1990 Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7: 111–122.[Abstract]
CLAYTON, D., 1999 A generalization of the transmission/disequilibrium test for uncertain haplotypes. Am. J. Hum. Genet. 65: 1170–1177.[CrossRef][Medline]
DAWSON, E., G. R. ABECASIS, S. BUMPSTEAD, Y. CHEN, S. HUNT et al., 2002 A first generation linkage disequilibrium map of human chromosome 22. Nature 418: 544–548.[CrossRef][Medline]
DEMPSTER, A. P., N. M. LAIRD and D. B. RUBIN, 1977 Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 391: 1–38.
DING, X. D., Q. ZHANG, C. FLURY and H. SIMIANER, 2006 Haplotype reconstruction and estimation of haplotype frequencies from nuclear families with only one parent available. Hum. Hered. 62: 12–19.[CrossRef][Medline]
EXCOFFIER, L., and M. SLATKIN, 1995 Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12: 921–927.[Abstract]
GABRIEL, S. B., S. F. SCHAFFNER, H. NGUYEN, J. M. MOORE, J. ROY et al., 2002 The structure of haplotype blocks in the human genome. Science 296: 2225–2229.
GAO, G., I. HOESCHELE, P. SORENSEN and F. DU, 2004 Conditional probability methods for haplotyping in pedigrees. Genetics 167: 2055–2065.
HODGE, S. E., M. BOEHNKE and M. A. SPENCE, 1999 Loss of information due to ambiguous haplotyping of SNPs. Nat. Genet. 21: 360–361.[CrossRef][Medline]
HORVATH, S., X. XU, S. LAKE, E. SILVERMAN, S. WEISS et al., 2004 Tests for associating haplotypes with general phenotype data: application to asthma genetics. Genet. Epidemiol. 26: 61–69.[CrossRef][Medline]
KRUGLYAK, L., M. J. DALY, M. P. REEVE-DALY and E. L. LANDER, 1996 Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. Hum. Genet. 58: 1347–1363.[Medline]
LANDER, E. S., and P. GREEN, 1987 Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. USA 84: 2363–2367.
LI, J., and T. JIANG, 2003 Efficient inference of haplotypes from genotypes on a pedigree. J. Bioinform. Comput. Biol. 1: 41–69.[CrossRef][Medline]
LIU, P. Y., Y. LU and H. W. DENG, 2006 Accurate haplotype inference for multiple linked single-nucleotide polymorphisms using sibship data. Genetics 174: 499–509.
NIU, T., 2004 Algorithms for inferring haplotypes. Genet. Epidemiol. 27: 334–347.[CrossRef][Medline]
NIU, T., Z. S. QIN, X. XU and J. S. LIU, 2002 Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet. 70: 157–169.[CrossRef][Medline]
PATIL, N., A. J. BERNO, D. A. HINDS, W. A. BARRETT, J. M. DOSHI et al., 2001 Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 1719–1723.
QIAN, D., and L. BECKMAN, 2002 Minimum-recombinant haplotyping in pedigrees. Am. J. Hum. Genet. 70: 1434–1445.[CrossRef][Medline]
QIN, Z. S., T. NIU and J. S. LIU, 2002 Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am. J. Hum. Genet. 71: 1242–1247.[CrossRef][Medline]
ROHDE, K., and R. FUERST, 2001 Haplotyping and estimation of haplotype frequencies for closely linked biallelic multilocus genetic phenotypes including nuclear family information. Hum. Mutat. 17: 289–295.[CrossRef][Medline]
SAS INSTITUTE, 2004 SAS 9.1.3 Help and Documentation. SAS Institute, Cary, NC.
SCHAFFNER, S. F., C. FOO, S. GABRIEL, D. REICH, M. J. DALY et al., 2005 Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15: 1576–1583.
SCHAID, D. J., 2002 Relative efficiency of ambiguous vs. directly measured haplotype frequencies. Genet. Epidemiol. 23: 426–443.[CrossRef][Medline]
SCHAID, D. J., S. K. MCDONNELL, L. WANG, J. M. CUNNINGHAM and S. N. THIBODEAU, 2002 Caution on pedigree haplotype inference with software that assumes linkage equilibrium. Am. J. Hum. Genet. 71: 992–995.[CrossRef][Medline]
SMITH, C. A. B., 1957 Counting methods in genetical statistics. Ann. Hum. Genet. 21: 254–276.[Medline]
STEPHENS, M., N. J. SMITH and P. DONNELLY, 2001 A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68: 978–989.[CrossRef][Medline]
WANG, N., J. M. AKEY, K. ZHANG, K. CHAKRABORTY and L. JIN, 2002 Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. Am. J. Hum. Genet. 71: 1227–1234.[CrossRef][Medline]
Communicating editor: C. HALEY
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |