# A New Method for Haplotype Inference Including Full-Sib Information

^{*}University of Goettingen, Institute of Animal Breeding and Genetics, 37075 Goettingen, Germany and^{†}State Key Laboratories of Agrobiotechnology, Key Laboratory for Animal Breeding and Genetics of Ministry of Agriculture of China, College of Animal Science and Technology, China Agricultural University, Beijing, 100094, China

- 1
*Corresponding author:*College of Animal Science and Technology, China Agricultural University, Beijing, 100094, China. E-mail: qzhang{at}cau.edu.cn

## Abstract

Recent literature has suggested that haplotype inference through close relatives, especially from nuclear families, can be an alternative strategy in determining linkage phase and estimating haplotype frequencies. In the case of no possibility to obtain genotypes for parents, and only full-sib information being used, a new approach is suggested to infer phase and to reconstruct haplotypes. We present a maximum-likelihood method via an expectation-maximization algorithm, called FSHAP, using only full-sib information when parent information is not available. FSHAP can deal with families with an arbitrary number of children, and missing parents or missing genotypes can be handled as well. In a simulation study we compare FSHAP with another existing expectation-maximization (EM)-based approach (FAMHAP), the conditioning approach implemented in FBAT and GENEHUNTER, which is only pedigree based and assumes linkage equilibrium. In most situations, FSHAP has the smallest discrepancy of haplotype frequency estimation and the lowest error rate in haplotype reconstruction, only in some cases FAMHAP yields comparable results. GENEHUNTER produces the largest discrepancy, and FBAT produces the highest error rate in offspring in most situations. Among the methods compared, FSHAP has the highest accuracy in reconstructing the diplotypes of the unavailable parents. Potential limitations of the method, *e.g*., in analyzing very large haplotypes, are indicated and possible solutions are discussed.

WITH the discovery of single-nucleotide polymorphisms (SNPs) along the genome, genotyping of large samples of biallelic multilocus genetic phenotypes for fine mapping of complex traits has become standard practice. Both simulation and empirical studies have demonstrated that statistical analysis based on haplotypes often is more efficient than separate analyses of individual markers (Dawson *et al*. 2002). Considerable research effort has been devoted to algorithms that infer haplotype phase from genotype data.

There are a growing number of articles on haplotype inference for unrelated individuals (Clark 1990; Excoffier and Slatkin 1995; Stephens *et al*. 2001), but more and more studies show that haplotype inference through close relatives, especially from nuclear families, can be an alternative strategy, as family information can reduce phase ambiguity and improve the efficiency of haplotype frequency estimates (Hodge *et al*. 1999; Rohde and Fuerst 2001; Becker and Knapp 2002; Schaid 2002). However, these methods consider mainly those nuclear families with both parents and one child (trios). When diseases with onset in adulthood or in old age are studied, it may be impossible to obtain genotypes for markers in the parents of the affected offspring, so that only full-sib information is available, which also may be true for other reasons. Obviously, it is essential to develop efficient approaches to handle such families.

The existing computational methods for haplotyping fit into two categories: statistical methods and rule-based methods. The rule-based approaches (Qian and Beckman 2002; Li and Jiang 2003; Gao *et al*. 2004; Baruch *et al*. 2006) are deterministic and fast and thus can handle large pedigrees with dense markers. However, they normally do not provide numerical assessments of the reliability of their results, and the utility of rule-based approaches for nuclear families remains unknown (Niu 2004). On the other hand, statistical approaches are flexible in tackling nuclear families (Rohde and Fuerst 2001; Becker and Knapp 2002; Ding *et al*. 2006), although they are time-consuming and thus may not be suitable for large pedigrees.

Maximum likelihood via the expectation-maximization (EM) algorithm (Dempster *et al*. 1977) is a widely used statistical approach for haplotype inference. Excoffier and Slatkin (1995) were the first to propose a maximum-likelihood-based approach for haplotype frequency estimation for unrelated individuals. EM-based approaches without assuming linkage equilibrium among the loci were suggested for various types of complete (Rohde and Fuerst 2001; Becker and Knapp 2002) or incomplete (Ding *et al*. 2006) nuclear family data. Their performance was shown to be superior to that of the Lander–Green algorithm (Lander and Green 1987) implemented in GENEHUNTER (Kruglyak *et al*. 1996), which as well as other linkage analysis programs assumes complete linkage equilibrium between the loci (Becker and Knapp 2002; Ding *et al*. 2006).

Several methods have been suggested for haplotype inference using sibship data (Becker and Knapp 2004; Horvath *et al*. 2004; Liu *et al*. 2006), which have their own strengths and weaknesses. In this article, we propose a new maximum-likelihood-based method for haplotype reconstruction and estimation of haplotype frequencies using full-sib families, which allows genetic markers to be in linkage disequilibrium and assumes that no recombination occurs between the markers.

In our study, we first introduce the general idea of the new algorithm. Most of the technical details are presented in the appendix. We then report the outcome of a simulation study showing that our approach results in a higher accuracy of the estimation of population haplotype frequencies and of reconstructed individual haplotypes. In the discussion we provide arguments to explain the better statistical properties of our procedure compared with the established methods, and we discuss options to overcome practical problems and limitations, *e.g*., missing genotypes and the restriction in the number of loci processed simultaneously.

## METHODS

#### Definitions:

We consider a series of *N* closely linked polymorphic loci. For *N* = 3, a possible phase-unknown genotype of individual *i* is *Y _{i}* = (12; 34; 56). A haplotype is defined as the ordered series of alleles on one of the homologous chromosomes of one individual;

*e.g*., for

*Y*, a possible first haplotype is

_{i}*h*

_{i}_{1}= (1 4 5). The diplotype, denoted as

*G*

_{i}_{,}is then a particular combination of two haplotypes;

*e.g*.,

*G*= (

_{i}*h*

_{i}_{1},

*h*

_{i}_{2}) = (1 4 5, 2 3 6) . Note that for a given phase-unknown genotype, several diplotypes are possible. So for one family

*f*with

*n*

_{f}full sibs, and their phase-unknown genotype combination defined as

*YP*

_{f}, there are several possible diplotype combinations, one of which can be represented as termed full-sib haplotype set (FSHS), where

*G*

_{1}denotes the diplotype of sib 1 in the

*i*th FSHS of family

*f*.

#### The likelihood function:

Following similar arguments presented by Excoffier and Slatkin (1995), for a sample of *m* families with only full sibs, the likelihood function of the population haplotype frequencies is defined as(1)where *p _{1}*,

*p*, …,

_{2}*p*are the population frequencies of all haplotypes, and is the

_{n}*i*th FSHS for family

*f*with

*n*

_{f}full sibs, and

*S*

_{f}is the number of possible FSHS in family

*f*.

#### The EM algorithm:

The EM algorithm iterates between the expectation step and the maximization step until the haplotype frequency estimations converge (*i.e*., when the changes in haplotype frequency in consecutive iterations are less than some small value).

To implement the EM algorithm, a set of initial values is required. It is assumed that given the phase-unknown genotypes of family *f*, all the possible FSHSs for family *f* have the same probability; *i.e*.,(2)These initial probabilities are used in the likelihood function to calculate the initial likelihood value. According to Ceppellini *et al*.(1955) and Smith (1957), the population haplotype frequencies can be calculated in the first and in all subsequent iterations as(3)where 2*n*_{f} is the total number of haplotypes being considered in the *i*th FSHS in family *f* and is an indicator variable equal to the number of times that haplotype *t* is present in the *i*th FSHS; its possible values are 0, 1, …, 2*n*_{f}.

In the expectation step in the *g*th iteration, the haplotype frequencies obtained in the previous iteration are used to calculate the probability of each possible FSHS for family *f* as(4)where(5)Here, we give only a brief explanation of Equation 5; for details see the appendix. We first inferred the possible parental combinations based on FSHS, and according to the posterior parental information the probability of FSHS is calculated. For one FSHS, there often will be several possible parental combinations, so is the probability of the *j*th parental combination given the estimates of population haplotype frequencies in the *g*th iteration, and is the probability of FSHS conditional on the *j*th possible parental combination.

Iterating between the E-step, using Equation 4 to update probabilities of all FSHSs, and the M-step, using Equation 3 to calculate all haplotype frequencies, the EM algorithm yields the maximum-likelihood estimates of the population haplotype frequencies when an adequate convergence criterion is reached.

In addition to the estimation of haplotype frequencies, haplotype reconstruction is another objective of haplotype inference. Using the probability of each possible FSHS obtained in the expectation step Equation 4 after convergence, the conditional probabilities of these FSHSs for a full-sib family with phase-unknown genotype combination *YP*_{f} can be calculated after the conversion of all probabilities as(6)The one with the highest probability is the most likely FSHS in full-sib family *f*, and subsequently the most likely diplotype for each member in this family can be obtained. On the other hand, there should be several possible parental combinations for this most likely FSHS, where a probability can be assigned to each possible parental combination (see the appendix). The one with the highest probability will be regarded as the most likely parental combination, and diplotypes of parents will be easily obtained.

## SIMULATION STUDY

#### Simulated data:

To evaluate our approach, we carried out a series of simulation studies. We simulated haplotypes using Schaffner's simulation program (Schaffner *et al*. 2005) based on a coalescent model. The parameters used for the simulation were: chromosome segment length, 1 Mb; mutation rate, 1.5 × 10^{−8}; recombination rate, 1 × 10^{−8}; effective population size, 10,000; and number of sampled chromosomes, 1000. From the simulated haplotypes, the diplotypes of related individuals were produced as follows: we first combined two randomly chosen haplotypes to be the diplotype of the first parent and two other randomly chosen haplotypes to form the diplotype of the second parent. With the assumption of no recombination, the diplotype of their offspring was generated by randomly picking one of the two haplotypes of the father and the mother, respectively. For full-sib families, the information on both parents was omitted after generating the children. Markers are thinned to obtain the required 1 SNP per 8-kb density that was used throughout this study. In the different scenarios, haplotypes of 5, 10, or 20 SNPs were considered, the number of full-sib families was varied between 15 and 60, and the number of offspring in each family was varied between 2 and 20, respectively. For each scenario, 100 replicates were generated and analyzed, and every data set was expected to be in Hardy–Weinberg equilibrium.

#### Approaches to be compared:

In our study, we compared our approach FSHAP with the following three approaches:

FAMHAP estimates haplotype frequencies from unrelated individuals or simple nuclear families with an arbitrary number of children with the EM algorithm (Becker and Knapp 2004). The frequencies are the frequencies in the founders,

*i.e*., those of the parents of the nuclear families and/or the individuals (single-person families). FAMHAP provides only the most likely diplotypes of both parents. The diplotypes of offspring must be inferred again on the basis of their own genotype and parental diplotypes.

Sibships with two missing parents can be treated as well, and these are regarded as nuclear families in which parental genotype information is missing at all loci, but frequencies are still estimated with respect to the parental generation (Becker and Knapp 2004). However, the frequencies in the parental generation are identical to those in the offspring generation due to Hardy–Weinberg equilibrium.

b. FBAT was initially designed by Horvath

*et al*.(2004) for implementing a broad class of family-based association tests. For multiple tightly linked markers, the haplotypes are first reconstructed via a conditioning approach and association testing is then applied (Horvath*et al*. 2004). Haplotype FBAT can deal with nuclear families, sibships, etc., when using the additional program HaploInfo (http://www.biostat.harvard.edu/∼fbat/haploinfo.htm). This analysis provides haplotype population frequencies and diplotypes of both parents and offspring.c. GENEHUNTER is a widely used software for linkage analysis (Kruglyak

*et al*. 1996), which makes full use of the pedigree information. After convergence the program provides information only on the most likely diplotype and does not give its posterior probability. Although very popular and technically suited to handle sibship data, GENEHUNTER uses pedigree information only assuming genotypes to be in full linkage equilibrium.

These three approaches were compared with FSHAP, which is specially designed for haplotype inference using families with only full sibs and can handle arbitrary numbers of full sibs. The parameters were estimated with the approaches described in methods and thus account both for linkage disequilibrium (LD) and for pedigree information.

FAMHAP, FBAT, and FSHAP allow genetic markers to be in linkage disequilibrium and assume that no recombination occurs between the markers in the generation leading to the full-sib groups. Although the inappropriateness of using GENEHUNTER to reconstruct haplotypes from markers in LD has been identified (Schaid *et al*. 2002), it was used here as a lower-bound reference for the performance of FAMHAP, FBAT, and FSHAP.

#### Criteria:

The efficiencies of the different approaches were evaluated with two sets of performance indexes. The first set, including indexes *I*_{F} and *I*_{H}, is related to the evaluation of the population haplotype frequency estimation. *I*_{F} measures the discrepancy between the estimated and true simulated sample haplotype frequencies and was defined by Stephens *et al*. (2001) as(7)where the and *p _{i}* denote, respectively, the estimated and the true simulated frequency for the

*i*th haplotype in the sample.

*I*

_{F}varies between 0 and 1. The more accurate the estimation is, the closer

*I*

_{F}will be to 0.

Identification rate *I*_{H} examines whether all haplotypes present in the sample are identified in the estimated haplotypes. In a sample with *N* individuals, the minimum frequency for every true haplotype must be ≥, which can be used as a lower threshold value for determining the existence of a haplotype; *i.e*., a haplotype is accepted to be detected only if its estimated frequency is >. On the basis of this, Excoffier and Slatkin (1995) suggested the statistic(8)where *k*_{true} is the number of true haplotypes in the sample, *k*_{found} is the number of identified haplotypes with frequency above the threshold value in the sample, and *k*_{missed} is the number of true haplotypes not identified in the sample. *I*_{H} also varies between 0 and 1. When all true haplotypes are identified, it will be 1, and when none of the true haplotypes are identified, it will be 0.

There are two options for the definition of true haplotype frequency. The first one is the relative frequency of haplotype *i* in the entire (“true”) population, and the second one is the relative frequency of haplotype *i* in the sample (*i.e*., in the sibships). The methods compared in our study all make use of the same data. Accuracy of parameter estimation is a combination of (i) sampling and (ii) estimation conditional on the sample. Since we are interested only in the differences between methods, only step ii is relevant; therefore a comparison conditional on the drawn samples seems appropriate.

The second set of indexes, including error rate and *I*_{R}, is related to the evaluation of the haplotype reconstruction.

If the most likely diplotype of an individual is the same as the simulated true genotype, this individual will be considered as being correctly haplotyped. The error rate is the proportion of not correctly haplotyped individuals in the population.

Although the phase-unknown genotypes of parents are not available, they can be inferred according to the information of offspring. However, the father and the mother cannot be definitely assigned due to their unknown genotypes; only the reconstructed parental diplotypes are taken into account to be compared with true parental diplotypes in the calculation of error rate in our approach. For FAMHAP, FBAT, and GENEHUNTER, the reconstructed diplotypes for father and mother were assigned to the most similar true genotypes of the parents, respectively. The following combinations were compared: (i) reconstructed father–true father and reconstructed mother–true mother and (ii) reconstructed father–true mother and reconstructed mother–true father. The more similar combination was accepted and used as basis for calculation of error rate in parents.

Even if the most likely diplotype of an individual is the correct one, the posterior probability of this diplotype may be substantially smaller than one. The overall quality of the haplotype reconstruction procedure can be evaluated with the average posterior probability of correctly reconstructed haplotypes, which is denoted as *I*_{R}. Since GENEHUNTER does not provide the posterior probability of the most likely diplotype, and FAMHAP only provides that for parents, the statistic *I*_{R} can be given only by FSHAP and FBAT.

Where appropriate, contrasts of the means of simulation results between different estimation methods were tested with a conventional *t*-test using SAS 9.1 (SAS Institute 2004).

Running time of the algorithms was measured in seconds on an IBM server (SUSE Linux 9.2 and 3-GHz Intel Xeon processor).

## RESULTS

We simulated four scenarios with identical genotyping costs: 60 families with two sibs, 30 families with four sibs, 20 families with six sibs, and 15 families with eight sibs. All the approaches deal with the same data sets with haplotypes of 10 SNPs. The results of our comparisons with respect to the performance of FSHAP, FAMHAP, FBAT, and GENEHUNTER from these scenarios are shown in Figure 1. In most cases, our new method for haplotype inference using sibship data (FSHAP) has the smallest discrepancy and lowest error rate and the highest identification rate. Only in some situations, the performance of FSHAP is close to FAMHAP, *e.g*., the discrepancy in the first scenario of 60 families with two sibs and the identification rate in the third scenario of 20 families with six sibs. In the estimation of haplotype frequencies, GENEHUNTER produces the largest discrepancy and the lowest identification rate, and in the haplotype reconstruction, FBAT produces the highest error rate in offspring in most situations.

In the parental haplotype reconstruction using offspring's information, as shown in Figure 1, FSHAP, FAMHAP, FBAT, and GENEHUNTER perform poorly in the case of families with only two full sibs, where the error rate in parents remains above 0.45. This performance is not so helpful to further analysis. However, their performance improves rapidly with the number of offspring being increased; especially, the error rate in parents from our approach (FSHAP) and FAMHAP decreases faster than that from FBAT and GENEHUNTER. For the calculation of error rate in parents generally our approach performs better than the other three methods, whereas the performance of FBAT is very close to that of GENEHUNTER.

As expected, the efficiency of all the approaches can be improved by increasing the number of offspring in each family (Figure 1), which provides more family information to exclude more redundant FSHSs and parental combinations. The only exception is that the discrepancy of haplotype frequencies from FAMHAP does not decrease as in other approaches but increases slightly.

This point is further illustrated by Table 1. For the second scenario of 30 families with only four sibs each, even when the genotyping cost is double after the number of families is increased to 60, the performance of FSHAP and FAMHAP is still lower than that in the fourth scenario of 15 families with only eight sibs each. On the other hand, it also can be seen from Table 1 that the improvement of efficiency of FSHAP and FAMHAP is very small by increasing only the number of families, and the identification rate is not increased but decreased a little bit.

Our approach and FBAT can provide a posterior probability for the most likely diplotype. As shown in Table 2, observing more offspring in families is also helpful to improve the reliability of inference for parents and offspring. For only two full sibs, there are a lot of possible parental combinations, which make the reliability of inference for parents very low, but for multiple sibs, more redundant parental combinations are excluded and the posterior probability of the most likely diplotype will be increased. Table 2 shows that the reliability of FBAT is apparently higher than that of FSHAP in most situations. This mainly is a consequence of the higher error rate of FBAT compared with FSHAP. FBAT identifies less haplotypes correctly, but those correctly identified on average have a higher posterior probability than the ones identified with FSHAP.

Generally, the discrepancy and error rate for the EM-based methods increase when more SNPs are included, since in this case the number of possible haplotypes increases exponentially while the average amount of information for estimation from the same data set decreases. As shown in Table 3, the efficiencies of FSHAP, FAMHAP, and FBAT all decrease when the number of SNPs is increased from 10 to 20. However, the decrease with FSHAP is very moderate compared to that with FAMHAP and FBAT.

It is also indicated from Table 3 that the impact of the number of SNPs on the running time for the three EM-based approaches of FSHAP, FAMHAP, and FBAT is very large compared to that for GENEHUNTER. In the case of 10 SNP loci, FSHAP is the fastest one among these four approaches, but the average running time increases dramatically from 0.17 sec to 9.66 sec when the number of SNPs is increased from 10 to 20. Compared to FSHAP, FBAT becomes much slower, where the average running time exponentially increases to 161 sec from 3.45 sec. However, a progressive-extension technique was implemented in FAMHAP (Becker and Knapp 2004), which makes FAMHAP very fast for the large number of SNP loci. Similarly, the Lander–Green algorithm (Lander and Green 1987) keeps the speed of GENEHUNTER almost stable with the doubling of the number of loci.

The running time is also affected by the number of children in families since more redundant parental combinations can be excluded to improve speed by using multiple sibs for FSHAP, FAMHAP, and FBAT. Therefore, the running time of these three approaches is decreased when the number of children is increased from two to six (Table 4). However, this advantage will be counteracted by the enumeration of all haplotype configurations of more children; *e.g*., the running time of FAMHAP is suddenly increased as sib size is increased to eight. It is also indicated from Table 4 that FSHAP performs faster than FAMHAP, FBAT, and GENEHUNTER. FAMHAP is the second fastest approach.

In the case of 10 and 20 SNPs, some families cannot be handled by FAMHAP due to too many possible haplotypes. Therefore FAMHAP with a progressive-extension technique being implemented is used throughout our study, and FAMHAP without a progressive-extension technique is denoted as FAMHAP_nit. A small number of SNPs (5) and varying numbers of children in families (4–8 or 2–10, respectively) are assumed to compare the performance of FSHAP, FAMHAP, FAMHAP_nit, and FBAT. We simulated two scenarios: 60 families with 4–8 sibs and 60 families with 2–10 sibs, all data sets with haplotypes of 5 SNPs.

As shown in Table 5, FSHAP performs significantly better compared to the other approaches in most situations. The values of discrepancy from FAMHAP and FAMHAP_nit are not different, whereas the performance of FAMHAP is significantly better than that of FAMHAP_nit with respect to identification rate and haplotype reconstruction.

## DISCUSSION

One limitation of the EM algorithm is that it cannot handle a large number of loci due to the memory constraint. So far, FSHAP can handle up to 30 loci. Recently, some strategies were proposed to overcome this problem. Clayton (1999) first implemented progressive-extension (PE) techniques in his program SNPHAP. Becker and Knapp (2004) further used PE in FAMHAP, which gives reliable approximations of the maximum-likelihood estimates for up to 63 SNP loci. However, the number of bits required for the type of data prevents this technique from handling more loci (Becker and Knapp 2004). The PE technique does not guarantee that the true EM estimates for individual haplotypes are obtained (Qin *et al*. 2002), which in the case of FAMHAP is not critical since the main objective of the program is to estimate haplotype frequencies rather than to reconstruct individual haplotypes. However, the results of our study (Table 5) indicate that the progressive-extension technique will not impair the performance of FAMHAP in haplotype frequency estimation, while it can make FAMHAP perform better in haplotype reconstruction. A possible explanation is that despite the good haplotype frequency estimates FAMHAP without progressive-extension techniques might be unable to pinpoint which family has which haplotype (T. Becker, personal communication).

Another widely used strategy for a large number of loci is the partition-ligation (PL) algorithm proposed by Niu *et al*.(2002). PL was first implemented together with Gibbs sampling to estimate haplotype phases for a large number of SNPs, and Qin *et al*.(2002) further combined it with the EM algorithm to handle large sets of loci. The PL–EM of Qin *et al*.(2002) is currently implemented for unrelated individuals only, but can also be integrated in our approach.

Although both FAMHAP and FSHAP are EM-based approaches, there are two crucial steps in FSHAP that make it perform better than FAMHAP, both with respect to computing speed and accuracy of haplotype inference:

In our approach, a collapse technique is used to infer possible parental haplotype combinations. It starts from the possible parental haplotype combinations based on a single pair of full sibs and then goes through all additional full sibs, excluding those haplotype combinations not being compatible with the extra children (see the appendix). In FAMHAP a complete list of possible parental haplotype combinations is set up first, which will be very large when parental multilocus genotypes are missing. Afterward, those diplotypes not compatible with the children's genotypes are excluded. This strategy is much slower and more memory demanding than the one implemented in FSHAP, especially when the number of loci is large.

In our approach, the probability of each parental configuration is calculated according to different mating designs, which will give different weight to each parental haplotype combination. Further, a multinominal distribution is used to calculate the joint probability of a sibship's diplotype given each posterior parental combination, which makes effective use of family information (see the appendix). By this the information on parents and sibships is updated in each iteration simultaneously, which makes the estimation of haplotype frequencies more accurate and the inference of haplotype reconstruction in parents and offspring more reliable compared to FAMHAP, which originally was developed primarily for haplotype frequency estimation rather than for individual haplotype reconstruction.

Liu *et al*.(2006) proposed another EM-based approach for haplotype inference from sibship data, which was not included in the comparison in our study. Liu *et al*.(2006) report that their approach performs slightly better than FAMHAP and that the variability of discrepancy of their performance is small with the sample size. However, only sibships with two children were taken into account in their study. The approach proposed by Liu *et al*.(2006) is similar to our approach by considering different parental mating designs; however, the calculation of posterior parental combinations is different. On the other hand, Liu *et al*.(2006) do not make effective use of the joint information of full sibs given the parental configuration; therefore we expect our approach to be more efficient with increasing family sizes.

Our study proves that including nuclear family information will improve not only the correctness of haplotype reconstruction but also the accuracy of haplotype frequency estimates as discussed in other studies (Rohde and Fuerst 2001; Becker and Knapp 2002; Schaid 2002). Especially for our approach FSHAP the parental information can also be inferred accurately when the number of offspring is increased. It will be especially helpful for research in multiparous species like pigs, dogs, fish, and many lab animals, where it is easy to collect families with multiple siblings.

Theoretically, our approach can deal with sibships of arbitrary size. However, families with an excessively large number of children cannot be handled due to the limitation of computing memory. On the other hand, increasing the number of children is not always helpful to improve the efficiency of our approach. As shown in Table 6, the improvement is very small when the number of children is increased from 8 to 12 and 15, and the performance of our approach is decreased when the number of children is increased to 20.

As Figure 2 shows, it is difficult to have a functional relationship between the running time and sib size, because the composition of phase-unknown genotypes of sibs plays a more important role than sib size. To illustrate this we give three scenarios: (i) families with all four children having genotype (12; 12; 12; 12; 12), (ii) families with four children having different genotypes each, and (iii) families with eight children having different genotypes each. The running time of our approach in the first case is higher than that in the second and third cases because the number of possible parental combinations is much higher in the first case than in the second and third cases, even though the sib size in the first two cases is identical and doubled in the third case.

Comparing the results obtained here to the ones reported by Ding *et al*.(2006), it can be concluded that with the same burden of genotyping complete nuclear families are more informative than sibships. In their simulation based on the same program (Schaffner *et al*. 2005), the error rate of complete-family EM proposed by Rohde and Fuerst (2001) is 0.4% in the case of 30 trios and 10 SNPs (Ding *et al*. 2006), which is less genotyping cost and a much lower error rate than the 3.5% of FSHAP in the case of 30 sibships with four sibs each and 10 SNPs.

In practical situations, incomplete data on some individuals due to failure of typing for one (or more) of the component loci is very common in every lab. Our approach can easily handle such a situation. For an individual with a missing locus, we first list all the possible genotypes at this missing locus, where the information of other sibs of this individual can be used to exclude some impossible genotypes. Thus this individual will have several possible phase-unknown genotypes. When inferring this individual's diplotype, each of her (his) phase-unknown genotypes has a corresponding most likely diplotype with a conditional probability, so the one with the highest probability among these most likely diplotypes is considered as the final diplotype, and its corresponding phase-unknown genotype is the final multilocus genotype.

As in other family-based haplotype reconstruction methods, it also is assumed that within a nuclear family recombination does not occur in the considered chromosome segments (Hodge *et al*. 1999). When recombination events do occur among loci, it will make it complex to infer the parental combinations on the basis of the information of sibs. However, for tightly linked loci, recombination is an unlikely event. Moreover, recent studies (Patil *et al*. 2001; Gabriel *et al*. 2002) have shown that the human genome can be partitioned into large blocks with high LD and relatively low recombination, separated by short regions of low LD. Therefore, if the markers within the same haplotype block are analyzed together, it is reasonable to assume that there is no recombination among these markers (Wang *et al*. 2002).

Although FSHAP was initially designed for families with only full sibs, it can also deal with sibships with parents. According to the principle of FSHAP, the available parent will help to exclude redundant parental combinations and to improve the efficiency of FSHAP.

Furthermore, our approach can also be used in mixed-data structures, consisting, *e.g*., of complete nuclear families (two parents and at least one child) (Rohde and Fuerst 2001), incomplete nuclear families (one parent and at least one child) (Ding *et al*. 2006), sibships with an arbitrary number of children (this study), and single individuals (Excoffier and Slatkin 1995). All of these four methods are implemented via an EM algorithm and are similar in the likelihood function. Hence, they can be unified in one framework for mixed-data structures, which will be done in a future study.

At the moment, FSHAP runs only under Linux, and it is available on request from the authors.

## APPENDIX

#### Inferring parental combination using sibs:

For one population, there are seven types of diplotype combinations as shown in Table A1 for any two individuals: (1) both are identical homozygotes; (2) one is a homozygote and the other is a heterozygote, with one common haplotype; (3) both are identical heterozygotes; (4) both are heterozygotes with one common haplotype; (5) both are different homozygotes; (6) one is a homozygote and the other is a heterozygote, without a common haplotype; and (7) both are heterozygotes without a common haplotype.

#### Calculation of joint probability of the children's diplotypes using parental information:

For one family with known parental combination the probability of each possible diplotype of the children have can be obtained as shown in Table A3, and the joint probability of *n* children in this family is multinominal,(A1)where () is the set of diplotypes of all children in this family, *k* is the number of types of diplotypes among all children in this family (its maximum value is 4 as shown in Table A3), *y _{i}* (

*i =*1,

*…*,

*k*) is the number of children with diplotype

*i*, and

*p*is the corresponding probabilities of diplotype

_{i}*i*shown in Table A3.

For example, in the case of *n* = 4, ( and the parental combination known as (*h _{a}h_{b} × h_{a}h_{b}*), we can obtain the values of

*k*,

*y*

_{1},

*y*

_{2}, and

*y*

_{3}asThen according to Table A3 and Equation A1, the joint probability of these children is.

#### Calculation of the probability of the FSHS:

For family *f* with *n*_{f} sibs, there are several possible FSHSs, and for each FSHS, there are also several possible parental combinations for each FSHS; then the probability of the *i*th FSHS in family *f* can be calculated as(A2)where *j* is the type of parental combination shown in Table A2, and can be obtained by using the expression listed in Table A2. is the conditional probability of the *i*th FSHS given the parental combination.

## Acknowledgments

We thank two anonymous reviewers for their constructive comments, Lin Wang for help in using the program HaploInfo to read the output of FBAT, and Tim Becker for helpful advice. This research was supported by the Functional Genome Analysis in Animal Organisms program of the German Federal Ministry of Education and Research, the Förderverein Biotechnologieforschung e.V. Bonn, Lohmann TierzuchtGmbH Cuxhaven, and by The National Key Basic Research Program of China (grant no. 2006CB102104).

## Footnotes

Communicating editor: C. Haley

- Received July 26, 2007.
- Accepted September 18, 2007.

- Copyright © 2007 by the Genetics Society of America