## Abstract

Sibships are commonly used in genetic dissection of complex diseases, particularly for late-onset diseases. Haplotype-based association studies have been advocated as powerful tools for fine mapping and positional cloning of complex disease genes. Existing methods for haplotype inference using data from relatives were originally developed for pedigree data. In this study, we proposed a new statistical method for haplotype inference for multiple tightly linked single-nucleotide polymorphisms (SNPs), which is tailored for extensively accumulated sibship data. This new method was implemented via an expectation-maximization (EM) algorithm without the usual assumption of linkage equilibrium among markers. Our EM algorithm does not incur extra computational burden for haplotype inference using sibship data when compared with using unrelated parental data. Furthermore, its computational efficiency is not affected by increasing sibship size. We examined the robustness and statistical performance of our new method in simulated data created from an empirical haplotype data set of human growth hormone gene 1. The utility of our method was illustrated with an application to the analyses of haplotypes of three candidate genes for osteoporosis.

THE human genome has been portrayed as a series of high linkage disequilibrium (LD) regions with limited haplotype diversity (Patil *et al.* 2001; Gabriel *et al.* 2002). Several common haplotypes that can be captured by a few tagging single-nucleotide polymorphisms (SNPs) usually account for a majority of genetic variation in genomic regions or candidate genes (Johnson *et al.* 2001; Patil *et al.* 2001; Gabriel *et al.* 2002). Such haplotype patterns observed in empirical studies have triggered the development of the International HapMap Project that aims to determine the common patterns of DNA sequence variation in the human genome (Gibbs *et al.* 2003). Focusing on these common haplotypes greatly facilitates LD-based mapping analyses, such as those for fine mapping and positional cloning of complex disease genes (Johnson *et al.* 2001). However, linkage phase information in diploids, such as humans, is usually unknown from genotype data. The determination of haplotypes by experimental methods in large samples is currently time consuming and expensive. Therefore, computational algorithms and statistical methods have been used for large-scale haplotype determination.

Sibships are commonly used with increasing emphasis on genetic studies of complex diseases, particularly those late-onset ones such as Alzheimer's and Parkinson's diseases, for which genotype data of parents of affected individuals are usually not available (Freimer and Sabatti 2004). Extensively accumulated sibship data call for algorithmic and methodological advances in haplotype inference for subsequent fine mapping and positional cloning studies. Currently available computational algorithms and statistical methods of haplotype inference from relatives were originally developed for pedigree data (Lander and Green 1987; Sobel and Lange 1996; O'Connell 2000; Qian and Beckmann 2002). It is unclear how these methods perform in haplotype inference without founder information in sibship data.

Existing methods for pedigree haplotype inference can be broadly classified into two categories: rule-based and likelihood-based haplotyping methods. The rule-based approaches (Wijsman 1987; O'Connell 2000; Tapadar *et al.* 2000; Qian and Beckmann 2002; Li and Jiang 2003; Gao *et al.* 2004) are deterministic and fast and thus can handle large pedigrees with dense markers. However, they do not normally provide numerical assessments of the reliability of their results. On the other hand, likelihood-based methods (Lander and Green 1987; Sobel and Lange 1996; Lin and Speed 1997; Abecasis *et al.* 2002) are typically stochastic and flexible in tackling complex pedigrees, but they are time consuming and thus may not be suitable for large data sets. In particular, most likelihood-based methods for pedigree haplotype inference implicitly assume linkage equilibrium among markers. However, this assumption is contradicted by haplotype block structures in human genomes, as observed ubiquitously in recent empirical data (Patil *et al.* 2001; Gabriel *et al.* 2002). The potential problem of using linkage-based software that assumes linkage equilibrium among markers in haplotype inference has been nicely addressed in the analysis of haplotypes within the HPC1 gene (Schaid *et al.* 2002). Two recent methods, FAMHAP and FBAT, are exceptions in that they can be used to infer haplotype from nuclear families without the assumption of linkage equilibrium among markers (Becker and Knapp 2004; Horvath *et al.* 2004).

In this study, we proposed a maximum-likelihood method for haplotype inference for multiple tightly linked SNPs, tailored for sibship data. We implemented our method via a well-known expectation-maximization (EM) algorithm without the assumption of linkage equilibrium among markers. The utility of the new method was validated by using a wide variety of simulated and real data sets. We compared our method with commonly used software for haplotype inference, Genehunter (Kruglyak *et al.* 1996) and FAMHAP (Becker and Knapp 2004).

## STATISTICAL METHODS

Suppose that there is a sample of *n* sibships, and each sibship contains *n _{i}* individuals. Let denote the observed genotypes for the

*n*sibships, where and is the genotype of individual

*j*in the sibship

*i*. Let denote population haplotype frequencies, where

*M*is the number of all possible haplotypes in the sample. Let and denote the unobserved father and mother genotypes in the sibship

*i*, respectively. The likelihood function of the data can be expressed as(1)

Suppose that the marker loci are tightly linked so that the recombination in the parental generation can be safely ignored and Hardy–Weinberg equilibrium (HWE) holds true. Then,(2)where the notion denotes a parental zygote configuration in which the two haplotypes, and , are compatible with the genotype ; is the probability of offspring genotypes given different mating design and equals 1 when both parents are homozygotes, when only one of the parents is a homozygote, when both parents are two different heterozygotes, and and for homozygous and heterozygous offspring, respectively, when both parents are identical heterozygotes; has values of 1 or 0 according to whether parental genotypes and are compatible with their offspring genotypes () or not; is the combination coefficient of two haplotypes and has values of 1 or 2 according to whether or not; is the combination coefficient of two parental genotypes and has values of 1 or 2 according to whether or not; and , , , and are frequencies of haplotypes , , , and , respectively.

The maximum-likelihood estimate (MLE) of can be found analytically by solving a system of regular equations with a Lagrange multiplicator ,(3)and(4)However, solving a set of *M* equations is tedious when *M* is large, and the number *M* is often unknown *a priori*. Alternatively, a wide variety of iterative algorithms for solving MLE equations can be used. Among them, the EM method is very general for missing data problems, with fairly simple forms and well-established statistical properties (Dempster *et al.* 1977). In our method, in the expectation step, the probability of each of the parental zygote configurations that are compatible with their offspring genotypes is calculated by using haplotype frequencies obtained in the previous iteration step. In the maximization step, haplotype frequencies are estimated by counting copies of a specific haplotype in parental zygote configurations, which maximizes the likelihood of Equation 1. Here we combined the expectation and maximization steps and obtained the following expectation-maximization recursion,(5)where and are the estimated frequencies at steps *r* and *r+*1, respectively, and counts how often (zero, one, two, three, or four times) haplotype *j* occurs in the two haplotype pairs () and (). Within the EM framework, the approximate variances of the estimates of haplotype frequencies can be obtained by inverting the estimated information matrix (appendix a) (Louis 1982).

We adopted the following two strategies to improve the computation speed of our method. First, we used an efficient approach of starting iteration suggested by Rohde and Fuerst (2001). That is, at the first iteration step, we took into account only the most likely haplotype pairs in the sample and then estimated haplotype frequencies by counting these most likely haplotype pairs. This approach usually generated the best haplotype estimates while obviating multiple starting iterations. Second, the possible unknown parental genotypes have three to the power of the number of loci in haplotype inference. These possible parental genotypes in Equation 5 were analytically collapsed and reduced given different mating designs. As an example, this collapse approach for sibship data with two siblings is detailed in appendix b. It can be seen from appendix b that haplotype inference using sibship data does not increase the computational burden of our EM algorithm compared with that using unrelated parental data. Also, the increasing sibship size does not result in a larger computational burden since the number of compatible parental genotypes is reduced and more easily resolved given the larger sibship size.

## SIMULATIONS

Simulation data were sampled from empirical haplotype data of human growth hormone gene 1 (GH1) (Horan *et al.* 2003), where GH1 underwent a combination of events including mutation, recombination, and gene conversion. These complexities inherent in the GH1 gene are common in real data applications. Haplotypes of the GH1 gene were experimentally determined in a sample of 154 male British Caucasians. The 15 SNP sites spanned 535 nucleotides in the promoter of the GH1, with minor allele frequencies ranging from 0.3 to 41.2%. Six of these SNPs can be considered as rare variants with minor allele frequencies <5% (0.3–3.6%). Standardized linkage disequilibrium measured by |*D*′| (Lewontin 1964) among the remaining 9 common SNPs ranged from complete LD (*i.e*., sites −301 and −308) to effective linkage equilibrium (*i.e*., sites −1 and +59).

To simplify simulation and comparison of the simulation results, we here considered each sibship with two siblings. In the simulations, in each nuclear family, both parents' haplotypes were randomly selected from the haplotype distribution of the GH1 gene as shown in Table 1 of Horan *et al.* (2003). Two offspring (*i.e.*, a sib pair) were then randomly generated from their parents. The following six cases were studied:

Our newly proposed EM algorithm using sib-pair data (EM-Sib): We treated parents' genotypes as missing data and used only the sib-pair data for haplotype inference.

The conventional EM algorithm using singleton data (EM-Single): To avoid related subjects, we randomly selected one individual from each sib pair and used this singleton data for haplotype inference by applying the conventional EM algorithm that was designed for random population samples (Excoffier and Slatkin 1995).

Genehunter using sib-pair data (GH-Sib): Genehunter Version 2.1 (http://www.fhcrc.org/labs/kruglyak/Downloads/index.html) was used for haplotype inference for the sib-pair data without using parent information.

Genehunter using family data (GH-Family): Genehunter was used for haplotype inference for the whole-family data in which each nuclear family had both parents and two offspring. To keep the genotyping cost the same for each method, the number of nuclear families used was halved in this case and thus the total number of subjects used for haplotype inference was the same as in the cases using the sib-pair data.

FAMHAP using sibship data (FAMHAP-Sib) without reordering SNP data: The FAMHAP software was obtained from Tim Becker's website (http://www.uni-bonn.de/%7Eumt70e/becker.html).

FAMHAP using sibship data with reordered SNP data (FAMHAP-Sib-R): To obtain the best estimates, we tried several orders of genotype data and took estimates with the largest likelihood.

To compare the performance of each statistical method, we evaluated the following commonly used measurements:

The discrepancy between the estimated and true sample haplotype frequencies,(6)where and denote, respectively, the estimated and true frequencies of the

*j*th haplotype in the sample.Error rate—namely, the proportion of individuals with ambiguous phase whose haplotypes are not correctly inferred (Niu

*et al.*2002).Identification rate: A good statistical method for haplotype inference should meet two criteria: (i) high proportion of true haplotypes identified to total true haplotypes in the sample and (ii) low proportion of false (nonexistent) haplotypes to total estimated haplotypes in the sample. The haplotype identification rate is an integrative index measuring these two criteria (Excoffier and Slatkin 1995). We consider that a haplotype is identified as being present in the true sample if its frequency is estimated to be above the threshold value of 1/(2

*n*) (*n*is sample size). The identification rate can be defined as(7)where is the number of haplotypes in the true sample, is the number of estimated haplotypes with frequencies above the threshold, and is the number of true haplotypes not identified in the sample.

## SIMULATION RESULTS

We simulated six sample sizes: *n* = 50, 100, 150, 200, 250, and 300. Simulations were replicated 100 times for each case and the mean results of the 100 simulations were summarized. Comparisons of discrepancies, error rates, and identification rates among different methods are shown in Figures 1 and 2. Our new method for haplotype inference using sibship data (*i.e.*, EM-Sib) has the smallest discrepancy and error rate and the highest identification rate. The FAMHAP does not work well for inferring haplotype with a large number of loci without reordering SNP genotype data. However, the performance of FAMHAP is improved when running several orders of SNP data. The Genehunter method for haplotype inference using sibship data (*i.e*., GH-Sib) produces the largest discrepancy (>20%) and error rate (>30%) and the lowest identification rate (<70%) for haplotype analyses. However, when using the whole family data Genehunter performs dramatically better than that using sibship data, but is still not as good compared to the EM methods. The EM-Single method that only uses singleton data is less accurate for haplotype inference compared with our EM-Sib method. This is because the EM-Single method does not completely explore sibship information for haplotype inference.

Generally, the discrepancy and the error rate for the EM methods accumulate with increased number of loci for haplotype inference. However, the performance of Genehunter depends largely on the LD among markers and is not affected by the number of loci for haplotype inference. For instance, in our simulations, the error rate of the GH-Sib is 33 and 44% for inferring 15- and 9-locus haplotypes, respectively. Another remarkable feature is that increasing sample sizes does not reduce the error rate or raise the identification rate when using the GH-Sib and GH-Family methods (Figures 1 and 2). The EM algorithms including the EM-Sib, EM-Single, FAMHAP-Sib, and FAMHAP-Sib-R are favored by increasing sample sizes. Larger sample sizes result in more accurate haplotype estimates and smaller standard errors in these EM methods.

## APPLICATION EXAMPLES

To demonstrate the utility of our new method, we selected ∼200 Caucasian nuclear families (each with both parents and two offspring) from our studies that aim to search genes underlying the risk to osteoporosis in the Creighton University Osteoporosis Research Center. Three important candidate genes for osteoporosis were chosen for estimation of haplotype frequencies. These were the vitamin D receptor (VDR), apolipoprotein E (APOE), and parathyroid hormone (PTH)/PTH-related peptide receptor type 1 (PTHR1). Four SNPs were genotyped for each of these three genes in these subjects (Long *et al.* 2004). For the VDR gene, the SNPs spanned 63.4 kb and pairwise LD (|*D*′|) (Lewontin 1964) ranged from 0.049 to 0.980 with an average of 0.345. For the APOE gene, the SNPs spanned 3.6 kb and pairwise LD ranged from 0.745 to 0.999 with an average of 0.898. For the PTHR1 gene, the SNPs spanned 20.3 kb and pairwise LD ranged from 0.947 to 0.985 with an average of 0.970. The nuclear families were divided into two subsamples (with the same number of subjects): unrelated parental sample and sibship sample. We estimated haplotype frequencies at the VDR (*n* = 221), APOE (*n* = 197), and PTHR1 (*n* = 218) genes using the above six strategies: EM-Sib, EM-Single, GH-Sib, GH-Family, FAMHAP-Sib, and FAMHAP-Sib-R. To evaluate the performance of the above strategies in the real data application, we applied the conventional EM algorithm for estimation of haplotype frequencies to the parental samples (denoted as EM-Parent). The conventional EM algorithm is effective for unrelated parental samples and should yield a high accuracy of haplotype frequency estimates in a sample of ∼400 unrelated subjects (Fallin and Schork 2000). Therefore, haplotype frequencies estimated from the EM-Parent are treated as references of “true” haplotype frequencies in the population. We then compared haplotype frequencies from the EM-Parent with those from the other six strategies (Tables 1–3⇓).

Haplotype frequencies estimated from the EM-Sib and FAMHAP methods are most similar to those from the EM-Parent. The total discrepancy for these methods is ∼2.5%. The performance of the EM-Single ranks fourth among the six strategies with an average discrepancy of 5.8%. The total discrepancies for the GH-Sib method are 32.6, 41.8, and 18.3% for the VDR, PTHR1, and APOE genes, respectively. The GH-Family performs reasonably better than the GH-Sib with an average discrepancy of 9.2%. There are three to six haplotypes whose frequency differences between the EM-Parent and the GH-Sib are >5%. Such a large estimated frequency difference is also observed between the EM-Parent and GH-Family methods. Furthermore, the GH methods falsely declared several pseudohaplotypes. In the PTHR1 gene, two haplotypes (*ACAA* and *ACGA*) whose frequencies were estimated to be zero in all of the EM methods were detected by both the GH-Sib and GH-Family methods. These two haplotypes likely do not exist in the sample since the EM methods are very robust in random samples (Fallin and Schork 2000). It is also worth noting that the FAMHAP methods tend to miss rare haplotypes. For example, four VDR haplotypes with frequencies of ∼2% (>1/4*n*) that were detected by the EM-Parent method were not found in the FAMHAP methods.

## DISCUSSION

There is an increasing consensus among human geneticists to use sib-pair linkage and association approaches to map disease-susceptibility genes (Freimer and Sabatti 2004). Many large sibship data sets have already been accumulated, calling for due applications. For late-onset diseases, sibship data without parents are common. One of the applications for these sibship data is haplotype analysis, which has shown some distinct advantages over single-marker analysis in genetic studies of common diseases. In the context of association studies, when the disease association of a specific allele is dependent on *cis*-acting interactions with other loci, the disease association may not be detected by testing a single allele unless the whole functional haplotype itself is analyzed. This has been demonstrated through both empirical studies (Drysdale *et al.* 2000; Martin *et al.* 2000) and simulation studies (Zhang *et al.* 2002, 2003). Computational algorithms and statistical methods are currently the preferred means, particularly for large-scale haplotype determination. In this study, we developed a new statistical method for haplotype inference for multiple tightly linked SNPs, which is specially designed for sibship data. We examined the robustness and statistical performance of the new method in our simulated data that were created from empirical haplotype data of the GH1 gene (Horan *et al.* 2003). The utility of our method was demonstrated with an application to the inferences of haplotypes at several candidate genes underlying osteoporosis.

The accuracy of the proposed method was improved over that of existing methods, as demonstrated in both simulated and real data. Our EM algorithm did not increase the computational burden for haplotype inference using sibship data, compared with that using unrelated parental data. Its computational efficiency was not affected by increasing sibship size. FAMHAP did not perform well for inferring haplotype with a large number of loci in single SNP genotype data and its performance was dramatically improved when running several orders of SNP data. In addition, FAMHAP tended to miss rare haplotypes in the sample. However, Genehunter estimated haplotype frequencies with large biases and falsely declared nonexistent haplotypes in the sample. This is because Genehunter assumes linkage equilibrium among markers on the basis of the Lander–Green algorithm (Lander and Green 1987). This assumption was apparently violated for dense SNP data that were used in our simulations and real data analyses. Similar results were also observed in the haplotype analysis of the HPC1 gene in a recent study (Schaid *et al.* 2002).

Haplotype inference can be viewed as a missing data problem. In our method, specifically developed for sibship data without parents, marker data from all of the sibships are observed data, whereas parental zygote configurations [*i.e.*, ] are unobservable or missing data. The principle of our method is to resolve sibship genotype data into their parental zygote configurations, from which haplotype inference can then be conducted. This circumvents the problem of related subjects that arises in the direct application of the conventional EM algorithm to sibship data. In our method, parental genotype data, if available, can reduce the number of potential zygote configurations and thus increase the accuracy of haplotype inference and accelerate the computation. Incorporating parental data into our method is straightforward and simply discards those zygote configurations that are not compatible with their observed parental genotypes. This is somewhat similar to a recent study of haplotype inference using nuclear family information (Rohde and Fuerst 2001). We therefore encourage researchers to collect available parent data if possible in addition to sibship data in their respective studies.

Finally, two assumptions underlying our method should be acknowledged. One is that no recombination occurs among dense SNP markers in the parental generation. This simplifies the variable in Equation 2 as a constant for different mating types. If takes into account marker interval distance as a function of recombination, it will lead to complexity of the likelihood of the data and require further investigation. However, this assumption should hold in real applications, especially for inferring haplotypes in haplotype blocks where recombination is highly restricted in human genomes. There is little evidence for historical recombination in past generations in haplotype blocks (Patil *et al.* 2001; Gabriel *et al.* 2002), let alone recombination occurring in a given single generation (*e.g.*, the parental generation). The other is that HWE holds at haplotypes. Fallin and Schork (2000) have recently reported a simulation study to assess the effects of departure from HWE on estimation of haplotype frequency by the EM algorithm in random population samples. They demonstrated that the EM algorithm is reasonably robust to departure from HWE and there is no increase in error with extreme departure from HWE toward excess homozygosity.

## APPENDIX A

The approximate variances of the estimates of haplotype frequencies can be obtained by inverting the estimated information matrix ,whereandThe last term of the above equation equals zero when evaluated at . The above equation can also be regarded as the observed information for the EM algorithm (Louis 1982).

## APPENDIX B

The possible unknown parental genotypes in Equation 5 were analytically collapsed given different mating designs, which could greatly reduce the EM computation. In our method, we first found the possible parental zygote configurations that are compatible with the sibling's genotypes. We then derived the explicit form of Equation 5 in each configuration. Below, we illustrate this collapse technique for sibship data with two siblings. The two siblings could take four, three, two, or one possible unobserved haplotypes.

Suppose that the two siblings take four haplotypes and their haplotype configuration is (, ). Accordingly, their compatible parental genotypes should be (, ) and (, ). For the parental genotype (, ), the denominator of Equation 5 can be expressed asThen we counted the number of haplotype

*j*that occurred in the two haplotype pairs () and (). The numerator of Equation 5 can be expressed asThe same results can be obtained from the parental genotype (, ).Suppose that the two siblings take three haplotypes and their possible haplotype configurations are (, ) and (, ). Accordingly, their compatible parental genotypes should be (, ) and (, ), respectively, where can be any haplotype in the population. For the parental genotype (, ), the denominator of Equation 5 can be expressed asThe numerator of Equation 5 can be expressed asFor the parental genotype (, ), the denominator of Equation 5 can be expressed asThe numerator of Equation 5 can be expressed as

Suppose that the two siblings take two haplotypes and their possible haplotype configurations are (, ), (, ), (, ), and (, ). Accordingly, their compatible parental genotypes should be (, ), (, ), (, ), and (, ), respectively, where and can be any haplotypes in the population. For the parental genotype (, ), the denominator of Equation 5 can be expressed asThe numerator of Equation 5 can be expressed asThe same results can be obtained from the parental genotype (, ) by exchanging the subscripts 1 and 2 in the above derivations. For the parental genotype (, ), the denominator of Equation 5 can be expressed asThe numerator of Equation 5 can be expressed asFor the parental genotype (, ), the denominator of Equation 5 can be expressed asThe numerator of Equation 5 can be expressed as

Suppose that the two siblings take one haplotype and their haplotype configuration is (, ). Accordingly, their compatible parental genotype should be (, ). The denominator of Equation 5 can be expressed asThe numerator of Equation 5 can be expressed as

This collapse technique can be also implemented similarly in other sibship data structures. As shown in the above derivations, the haplotype inference using sibship data does not dramatically increase the EM computation compared with that using unrelated parental data. Also, increasing sibship size does not result in a larger computational burden since the number of compatible parental genotypes is reduced and more easily resolved given the larger sibship size.

## Acknowledgments

We thank two anonymous reviewers for their helpful criticisms and comments. Investigators of this work were partially supported by grants from the National Institutes of Health and the state of Nebraska. This study also benefited by grants from the National Natural Science Foundation of China, Huo Ying Dong Education Foundation, the Ministry of Education of China, Xi'an Jiaotong University, and Hunan Normal University.

## Footnotes

Communicating editor: C. Haley

- Received December 2, 2005.
- Accepted June 14, 2006.

- Copyright © 2006 by the Genetics Society of America