## Abstract

Determining the patterns of DNA sequence variation in the human genome is a useful first step toward identifying the genetic basis of a common disease. A haplotype map (HapMap), aimed at describing these variation patterns across the entire genome, has been recently developed by the International HapMap Consortium. In this article, we present a novel statistical model for directly characterizing specific sequence variants that are responsible for disease risk based on the haplotype structure provided by HapMap. Our model is developed in the maximum-likelihood context, implemented with the EM algorithm. We perform simulation studies to investigate the statistical properties of this disease-sequencing model. A worked example from a human obesity study with 155 patients was used to validate this model. In this example, we found that patients carrying a haplotype constituted by allele Gly16 at codon 16 and allele Gln27 at codon 27 genotyped within the β2AR candidate gene display significantly lower body mass index than patients carrying the other haplotypes. The implications and extensions of our model are discussed.

CURRENT studies of genetic mapping have been developed to a point at which the quantitative variation of almost any complex trait can be dissected into its underlying Mendelian factors on the basis of molecular linkage maps (Lander and Botstein 1989) or association studies (Wu* et al.* 2002; Lou* et al.* 2003). Such Mendelian factors, *i.e.*, the so-called quantitative trait loci (QTL), are *hypothesized* regions with alleles triggering an effect on quantitative variation. A number of statistical methods have been developed to map QTL for various mapping population structures and different marker and character types, each aimed at precisely identifying QTL that are close enough to markers for their positional cloning using biotechnology approaches, such as chromosome walks (Wu and Casella 2005). These methods have led to the detection of QTL contributing to complex traits in a range of species from microorganisms to plant and animal species to humans (Mackay 2001).

The basic principle for QTL mapping is the cosegregation of the alleles at a QTL with those at one or a set of known polymorphic markers genotyped on a genome. If a QTL is cosegregating with molecular markers, the genetic effects of QTL and their genomic positions can be estimated from the markers. This approach is robust and powerful for the detection of major QTL and presents the most efficient way to utilize marker information when marker maps are sparse. However, this approach is limited in two aspects. First, because the markers and QTL bracketed by them are located at different genomic positions, the significant linkage of a QTL detected with given markers cannot give any information about the sequence structure and organization of QTL. Second, the inference of the QTL positions using the nearby markers is sensitive to marker informativeness, marker density, and mapping population type. As a result of these, only a few QTL detected from markers have been successfully cloned (Frary* et al.* 2000), despite a considerable number of QTL reported in the literature.

A more accurate and useful approach for the characterization of genetic variants contributing to quantitative variation is to directly analyze DNA sequences associated with a particular disease. If a string of DNA sequence is known to increase disease risk, this risk can be reduced by the alteration of this string of DNA sequence using a specialized drug. The control of this disease can be made more efficient if all possible DNA sequences determining its variation are identified in the entire genome.

With the recent development of the human genome project, massive amounts of DNA sequence data have been available across the human genome (International HapMap Consortium 2003). In particular, single-nucleotide polymorphisms (SNPs), being the most common type of variant in the DNA sequence, provide a powerful means for genotyping the whole genome. This facilitates the complete identification of specific sequence variants responsible for complex diseases. A linear arrangement of alleles (*i.e.*, nucleotides) at different SNPs on a single chromosome, or part of a chromosome, is called a haplotype. The cosegregation of SNP alleles on haplotypes leads to nonrandom association, *i.e.*, linkage disequilibrium (LD), between these alleles in the population. Empirical analyses of LD for SNPs have shown that nearby SNPs in the human genome tend to display highly significant levels of LD and are often distributed in block-like patterns, rather than displaying random- or even-spaced distribution as originally predicted (Patil* et al.* 2001; Dawson* et al.* 2002; Gabriel* et al.* 2002). SNPs within haplotype blocks are much more strongly associated with each other than are those between different blocks. Haplotype diversity within each block can be well explained by only a finite number of SNPs, called tag SNPs or representative SNPs. The existence of these tag SNPs means that it is not necessary to associate a disease with all SNPs in the DNA sequence to understand the complete genetic control of the disease.

In this article, we present a novel statistical model for determining specific DNA sequences that are associated with the phenotypic variation of disease risk. This model is derived on the basis of multilocus haplotype analysis using a finite number of tag SNPs. We derived a closed-form solution for estimating the effects of haplotypes, haplotype frequencies, allele frequencies, and the degrees of LD of various orders among tag SNPs underlying the disease. We performed simulation studies to test the statistical behavior of this haplotype-based sequence-mapping model. A worked example is used to validate our model, in which a DNA sequence variant is detected to significantly reduce human obesity.

## THEORY

Suppose there is a random sample of size *n* drawn from a natural human population at Hardy-Weinberg equilibrium. In this sample, a number of SNPs are genotyped genome-wide, aimed at the identification of DNA sequences responsible for a complex disease. Consider *R* (*R* > 1) tag SNPs for a haplotype block. Each of these *R* SNPs has two alleles denoted by *A*^{r}_{kr} (*k _{r}* = 1, 2;

*r*= 1, …

*R*), with allele frequencies denoted by

*p*

^{}

_{kr}for the

*r*th SNP. We use superscripts and subscripts to distinguish between different SNPs and different alleles within SNPs, respectively. These SNPs form 2

*possible haplotypes expressed as*

^{R}*A*

^{1}

_{k1}

*A*

^{2}

_{k2}…

*A*

^{R}

_{kR}, whose frequencies are denoted by

*p*

_{k1k2…kR}. The haplotype frequencies are composed of allele frequencies at each SNP and linkage disequilibria of different orders among SNPs (Lou

*et al.*2003). The random combination of maternal and paternal haplotypes generates 2

^{R}^{−1}(2

*+ 1) diplotypes expressed as (1 ≤*

^{R}*k*

_{1}≤

*l*

_{1}≤ 2, … , 1 ≤

*k*≤

_{R}*l*≤ 2). These

_{R}*R*SNPs form 3

*observable multilocus zygotic genotypes, generally expressed as*

^{R}*A*

^{1}

_{k1}

*A*

^{1}

_{l1}/

*A*

^{2}

_{k2}

*A*

^{2}

_{l2}/…/

*A*

^{R}

_{kR}

*A*

^{R}

_{lR}. When at most one SNP is heterozygous, the diplotype is consistent to its zygotic genotype. However, when two or more SNPs are heterozygous, the genotype will have different diplotypes and, therefore, the number of multilocus genotypes will be less than the number of diplotypes. Let

*P*

_{}and

*P*

_{k1l1/k2l2/ …/kRlR}denote the diplotype and genotype frequencies, respectively, and

*n*

_{k1l1/k2l2/ …/kRlR}denote the observations of various genotypes.

Our interest is to search for the haplotype diversity that can explain phenotypic variation in a complex disease. The association between haplotype diversity and phenotypic variation has been detected in several studies of drug responses (Judson* et al.* 2000; Bader 2001). This allows us to assume that a particular haplotype is different from other haplotypes for a given disease. Because haplotypes (comprising diplotypes) cannot be directly observed, the effects of different haplotypes on the phenotype need be postulated from observed zygotic genotypes. The inference of diplotypes for a particular genotype is statistically a missing data problem that can be formulated by a finite mixture model (Wu and Casella 2005).

### Likelihood function:

In this study, the complete data are diplotype configurations at a given set of SNPs for each genotype and for disease outcomes of subjects, whereas the observed data are the genotypes of these SNPs and the disease outcomes. The missing data are at the connection from the genotypes to diplotypes. For any given genotype, all possible diplotypes can be written out. For example, genotype *A*^{1}_{1}*A*^{1}_{1}/*A*^{2}_{1}*A*^{2}_{1} has one diplotype , as does genotype *A*^{1}_{1}*A*^{1}_{1}/*A*^{2}_{1}*A*^{2}_{2}, the diplotype . This is because in these situations where at most one SNP is heterozygous, the genotype and diplotype are identical. However, for a double-heterozygous genotype *A*^{1}_{1}*A*^{1}_{2}/*A*^{2}_{1}*A*^{2}_{2}, we have two diplotypes, and .

Table 1 lists all possible genotypes and diplotypes at two SNPs genotyped from a sample of size *n*. Each genotype (and therefore each diplotype) is composed of two haplotypes, one from the mother and the other from the father. Two haplotypes composing a diplotype come from four possible haplotypes, *A*^{1}_{1}*A*^{2}_{1}, *A*^{1}_{1}*A*^{2}_{2}, *A*^{1}_{2}*A*^{2}_{1}, and *A*^{1}_{2}*A*^{2}_{2}, whose frequencies are expressed as *p*_{11}, *p*_{12}, *p*_{21}, and *p*_{22}, respectively. The diplotype frequencies can be expressed in terms of the haplotype frequencies (Table 1). Two diplotypes and of a double heterozygote *A*^{1}_{1}*A*^{1}_{2}/*A*^{2}_{1}*A*^{2}_{2} have frequencies *p*_{11}*p*_{22} and *p*_{12}*p*_{12}, respectively. Thus, the relative frequencies of these two diplotypes for this double heterozygote are a function of haplotype frequencies, which can be expressed as *p*_{11}*p*_{22}/(*p*_{11}*p*_{22} + *p*_{12}*p*_{21}) and *p*_{12}*p*_{21}/(*p*_{11}*p*_{22} + *p*_{12}*p*_{21}), respectively.

Table 1 also gives the relative expected frequencies of haplotypes contained in a given genotype. All genotypes, except for the double heterozygote, contain one or two known haplotypes. For example, genotype *A*^{1}_{1}*A*^{1}_{1}/*A*^{2}_{1}*A*^{2}_{1} has one haplotype *A*^{1}_{1}*A*^{2}_{1}, whereas *A*^{1}_{1}*A*^{1}_{1}/*A*^{2}_{1}*A*^{2}_{2} has one half haplotype *A*^{1}_{1}*A*^{2}_{1} and one half haplotype *A*^{1}_{1}*A*^{2}_{2}. The double heterozygote contains four possible haplotypes, with the relative frequencies *p*_{11}*p*_{22}/(*p*_{11}*p*_{22} + *p*_{12}*p*_{21}) for haplotypes *A*^{1}_{1}*A*^{2}_{1} and *A*^{1}_{2}*A*^{2}_{2} and *p*_{12}*p*_{21}/(*p*_{11}*p*_{22} + *p*_{12}*p*_{21}) for haplotypes *A*^{1}_{1}*A*^{2}_{2} and *A*^{1}_{2}*A*^{2}_{1}.

The haplotype frequencies, arrayed by **Ω**_{p} = (*p*_{11}, *p*_{12}, *p*_{21}, *p*_{22}), that belong to population genetic parameters can be estimated using the nine observed genotypes (**G**) for two SNPs (Table 1). The log-likelihood function of unknown haplotype frequencies given observed genotypes can be written as a multinomial form, *i.e*., 1

Assuming that diplotypes are associated with phenotypic variation in a disease, we formulate a likelihood for unknown population (**Ω**_{p}) and quantitative genetic parameters (**Ω**_{q}) given observed phenotypes (*y*) and SNP genotypes (**G**). Generally speaking, a given 2-SNP genotype, *A*^{1}_{k1}*A*^{1}_{l1}/*A*^{2}_{k2}*A*^{2}_{l2}, can be partitioned into two possible diplotypes, and . Thus, such a log-likelihood function can be formulated on the basis of a two-component mixture model, *i.e.*, 2where the mixture proportion, represents the relative frequency of subject *i* whose diplotype is , and are the probability density functions for subject *i* who has two possible diplotypes, respectively, with the genotypic means of μ_{} for diplotype and μ_{} for diplotype and the common residual variance of σ^{2}. Note that subscript *i* is used to describe the genotypic means in the two density functions above because these means are diplotype- (and therefore subject-) dependent although there are only a total of 10 genotypic means for two SNPs. These means and variance are contained in vector .

Suppose there is a particular haplotype, *A*^{1}_{k1}*A*^{2}_{k2}, which is different from the other three haplotypes, *A*^{1}_{k1}*A*^{2}_{k̄2}, *A*^{1}_{k̄1}*A*^{2}_{k2}, and *A*^{1}_{k̄1}*A*^{2}_{k̄2} (where a bar means the alternative of that allele), in its effect on a complex disease. The phenotypic means of the three genotypes that contain these two distinct groups of haplotypes are denoted as μ_{1} for *A*^{1}_{k1}*A*^{1}_{k1}/*A*^{2}_{k2}*A*^{2}_{k2}, μ_{2} for *A*^{1}_{k1}*A*^{1}_{k̄1}/*A*^{2}_{k2}*A*^{2}_{k̄2}, and μ_{3} for *A*^{1}_{k̄1}*A*^{1}_{k̄1}/*A*^{2}_{k̄2}*A*^{2}_{k̄2}. On the basis of quantitative genetic theory, these three μ's can be partitioned into the overall mean (μ), the additive effect (*a*) due to the substitution of different haplotypes, and the dominant effect (*d*) due to the interaction between different haplotypes, *i.e.*, μ_{1} = μ + *a*, μ_{2} = μ + *d*, and μ_{3} = μ − *a*. In Table 1, the genotypic means are also given for different genotypes by assuming that haplotype *A*^{1}_{1}*A*^{2}_{1} is different from the rest of the haplotypes.

The log-likelihood function described by Equation 2 can be expanded to include all possible SNP genotypes, which is now expressed as 3given that these SNP genotypes are independent. It can be seen from the above likelihood function that, although most zygote genotypes contain a single component (diplotype), the double heterozygote is the mixture of two possible diplotypes weighted by π and 1 − π. Thus, an advanced statistical method should be implemented to obtain the MLEs of the underlying population and quantitative genetic parameters.

### An integrative EM algorithm:

We derived a closed-form solution for estimating the unknown parameters with the EM algorithm (Dempster* et al.* 1977). Haplotype frequencies can be expressed as a function of allelic frequencies and LD. For a two-SNP haplotype, we have 4where *D* is the linkage disequilibrium between the two SNPs. Thus, once haplotype frequencies are estimated, we can estimate allelic frequencies and LD by solving Equation 3. The estimates of haplotype frequencies are based on the log-likelihood function of Equation 1, whereas the estimates of diplotype genotypic means (and therefore the overall mean, haplotype additive, and dominant effects) and residual variance are based on the log-likelihood function of Equation 2. These two different types of parameters can be estimated using an integrative EM algorithm.

In the E step, the expected number (π* _{i}*) and probabilities (Π

*) of subject*

_{i}*i*of a double-heterozygous genotype who carries diplotype are calculated by 56

Note that for all the other genotypes, such probabilities do not exist.

In the M step, the probabilities calculated in the previous iteration are used to estimate the haplotype frequencies using 78910

Assuming that haplotype *A*^{1}_{1}*A*^{1}_{1} has an effect different from the other three haplotypes, these probabilities are used to estimate the haplotype additive and dominant effects and the residual variance by 11121314where *ṅ* = *n*_{11/12} + *n*_{12/11} and *n̈* = *n*_{11/22} + *n*_{12/22} + *n*_{21/21} + *n*_{21/22} + *n*_{22/22}. Iterations including the E and M steps are repeated among Equations 5–14 until the estimates of the parameters converge to stable values. The sampling errors of these parameters can be estimated by calculating Louis's (1982) observed information matrix.

### Hypothesis tests:

We can test two major hypotheses in the following sequence: (1) the association between two SNPs by testing their LD and (2) the difference of a given haplotype from the rest of the haplotypes in its effect on the disease outcome by testing the significance of haplotype additive and dominant effects. The LD between two given SNPs can be tested using two alternative hypotheses: 15

The log-likelihood-ratio test statistic (LR) for the significance of LD is calculated by comparing the likelihood values under H_{1} (full model) and H_{0} (reduced model) using 16where the tilde and hat denote the maximum-likelihood estimates (MLEs) of unknown parameters under H_{0} and H_{1}, respectively. The LR_{1} is considered to asymptotically follow a χ^{2} distribution with 1 d.f. The MLEs of allelic frequencies under H_{0} can be estimated using the EM algorithm described above, but with the constraint *p*_{11}*p*_{22} = *p*_{12}*p*_{21}.

Diplotype or haplotype effects on a complex trait can be tested using two alternative hypotheses expressed as 17

The log-likelihood-ratio test statistic (LR_{2}) under these two hypotheses can be similarly calculated. The LR_{2} may asymptotically follow a χ^{2} distribution with 2 d.f. However, the approximation of a χ^{2} distribution may be inappropriate when some regularity conditions, such as normal and uncorrelated residuals, are violated. The permutation test approach proposed by Churchill and Doerge (1994), which does not rely upon the distribution of the LR_{2}, may be used to determine the critical threshold for determining the existence of a QTL.

*R*-SNP sequence model:

The idea for sequencing a complex trait is described for a two-SNP model. It is possible that the two-SNP model is too simple to characterize genetic variants for quantitative variation. With the analytical line for the two-SNP sequencing model, we can readily extend our model to include an arbitrary number of SNPs whose sequences are associated with the phenotypic variation.

Consider *R* SNPs that form 3* ^{R}* observable multilocus zygotic genotytpes, generally expressed as

*A*

^{1}

_{k1}

*A*

^{1}

_{l1}/

*A*

^{2}

_{k2}

*A*

^{2}

_{l2}/… /

*A*

^{R}

_{kR}

*A*

^{R}

_{lR}. These genotypes are collapsed from a total of 2

^{R}^{−1}(2

*+ 1) diplotypes expressed as (1 ≤*

^{R}*k*

_{1}≤

*l*

_{1}≤ 2, … , 1 ≤

*k*≤

_{R}*l*≤ 1). A key issue for the multi-SNP sequencing model is how to distinguish among 2

_{R}

^{r}^{−1}different diplotypes for the same genotype heterozygous at

*r*loci. The relative frequencies of these diplotypes can be expressed in terms of haplotype frequencies. The integrative EM algorithm can be employed to estimate the MLEs of haplotype frequencies. Lou

*et al.*(2003) provided a general formula for expressing haplotype frequencies in terms of allele frequencies and linkage disequilibria of different orders. The MLEs of the latter can be obtained by solving a system of equations.

In the multi-SNP sequencing model, we face many haplotypes and haplotype pairs. An Akaike information criterion- or Bayes information criterion-based model selection strategy (Burnham and Andersson 1998) has been framed to determine the haplotype that is most distinct from the rest of the haplotypes in explaining quantitative variation. However, in practice, a simultaneous analysis of too many SNPs will encounter considerable computational load and, also, may not be necessary for the explanation of disease variation. These two factors should be considered in a further study of the determination of a maximal number of SNPs for sequencing mapping.

## RESULTS

### Simulation:

We performed Monte Carlo simulation experiments to examine the statistical properties of the model proposed for association studies between variation in DNA sequence and a complex trait. Our simulation studies were designed to consider the effects of different heritability levels (*H*^{2} = 0.1 *vs.* 0.4), different sample sizes (*n* = 100 *vs.* 400), and different gene action modes (additive *vs.* dominant *vs.* overdominant) on the estimation precision of parameters from our model. The ratio of dominant (*d*) over additive effect (*a*) is used to determine the degree of dominant. A haplotype is regarded as additive, dominant, or overdominant if this ratio is zero, one, and greater than one, respectively (Lynch and Walsh 1998).

Suppose two SNPs are segregating in a natural population at Hardy-Weinberg equilibrium. The allele frequencies and linkage disequilibrium between these two SNPs are given in Table 2. We assume that one of the four possible haplotypes for the two SNPs is different from the rest of the haplotypes in their effects on the phenotype of a complex trait, which leads to three distinct groups of diplotypes. By assigning the genetic values for these two contrasting haplotypes under different gene action modes (Table 2), we calculate the genotypic values for all possible diplotypes and further simulate their phenotypic values on the basis of normal distribution. The residual variance is determined according to different heritability levels (0.1 *vs.* 0.4) expressed as the relative proportion of genetic variance to total observed variance. The genetic variance is determined on the basis of the genotypic values of all diplotypes and their frequencies.

Table 2 describes the results of parameter estimation from our simulation using the proposed model. In general, all parameters (including population and quantitative genetic) can be reasonably estimated. As expected, the precision of population parameter estimation is a function of sample size and does not rely on the *H*^{2} value or the gene action mode. Large sample sizes always lead to more precise estimation of allelic frequencies and linkage disequilibrium (Table 2).

The estimation precision of quantitative parameters is dependent on heritability level, sample size, and gene action mode. In almost all situations, the additive effects can be estimated better than the dominant effect. The estimation of additive and dominant effects responds differently to genetic action mode, depending on the degree of dominance. For example, given the same heritability *H*^{2} = 0.1 and same sample size *n* = 100, the sampling error of the additive effect estimation is increased by 121% from additive to dominant modes, but is stable from dominant to overdominant modes. These two percentage values are 1 and 42% for the sampling errors of the dominant effect estimation. It appears that the additive effect estimation is more sensitive than the dominant effect estimation to the changes of heritability and sample size.

We analyze the power to detect a significant diplotype difference from our model under different combinations of *H*^{2} value, sample size, and gene action mode (Table 2). To do so, we first obtain empirical critical threshold values for declaring the significance of diplotype difference through simulations. The thresholds at the significance level α = 0.05 or 0.01 are estimated as the 95th and 99th percentile for the simulated test statistics calculated from simulated data involving no diplotype difference over 1000 LR_{2} values. It is clear that increased *H*^{2} levels and increased sample sizes can increase the power to detect diplotype differences. When *H*^{2} = 0.4 or *n* = 400, such power is 100% (Table 2).

An additional simulation study was performed to investigate the statistical behavior of the multi-SNP sequence model. The results from the 3-SNP sequence model are summarized in Table 3. First, all genetic parameters can be reasonably estimated from the 3-SNP sequence model, although the estimation precision of the dominance effect is reduced compared to that from the 2-SNP model. Second, the power to detect significant sequence variants and the estimation precision of different genetic parameters from the 3-SNP model can be improved with increased heritability levels, increased sample sizes, and decreased gene interactions.

### A worked example:

A real example from an obesity study is used to demonstrate the usefulness of our model. Numerous genes have been investigated as potential obesity-susceptibility genes (Mason* et al.* 1999; Chagnon* et al.* 2003). The β-2 adrenoceptor (BAR-2) is a major lipolytic receptor in human fat cells, whose function was found to be determined by known polymorphisms in codons 16 and 27 (Green* et al.* 1995). In a genetic study of obesity involving 149 women by Large* et al.* (1997), the Arg16Gly polymorphism at codon 16 was found to be associated with altered BAR-2 function with Gly16 carriers showing a fivefold increased agonist sensitivity. The Gln27Glu polymorphism at codon 27 is markedly associated with obesity traits. The homozygotes for Glu27 display an average fat mass excess of 20 kg and 50% larger fat cells than controls. Although the Arg16Gly polymorphism is not significantly associated with obesity and the Gln27Glu polymorphism is not significantly associated with the BAR-2 function, genetic variability in these two sites of the human BAR-2 gene could be of major importance for obesity, energy expenditure, and lipolytic BAR-2 function in adipose tissue, at least in women (Large* et al.* 1997).

To determine whether sequence variants at these two polymorphisms of BAR-2 are associated with obesity phenotypes, we investigated a group of 155 women of ages 32 to 86 years old with a large variation in body fat mass. Each of these patients was determined for her genotypes at codon 16 with two alleles, Arg16 (A) and Gly16 (G), and at codon 27 with two alleles, Gln27 (C) and Glu27 (G), within the BAR-2 gene and measured for body mass index (BMI). These two SNPs form four haplotypes designated as AC, AG, GC, and GG, which lead to nine genotypes, AA/CC, AA/CG, AA/GG, AG/CC, AG/CG, AG/GG, GG/CC, GG/CG, and GG/GG, and the 10 corresponding diplotypes, [AC][AC], [AC][AG], [AG][AG], [AC][GC], and [AC][GG] or [AG][GC], [AG][GG], [GC][GC], [GC][GG] and [GG][GG]. Our model is used to associate diplotype differences with variation in BMI. The MLEs of the haplotype frequencies, allele frequencies, and linkage disequilibrium between the two SNPs were obtained (Table 4). These two SNPs are highly associated with each other, whose LD is estimated as 0.1261. The allele frequencies are ∼0.61 for allele G at codon 16 and 0.38 for allele G at codon 27, respectively, suggesting that both of them have fairly high heterozygosity.

Our model is based on the assumption that one haplotype is different from the rest of the haplotypes. In a practical situation, as in our example used here, the issue of how these haplotypes differ in their effects is not known. By assuming that any one haplotype is different from the rest of the haplotypes, we can find a best-difference pattern that is based on the estimates of the LR_{2}'s using Equation 16. When haplotype GG, GC, AG, or AC is assumed to differ in BMI from the other haplotypes, the LR_{2} values were estimated as 10.35, 3.11, 1.52, and 2.32, respectively. Obviously, haplotype GG is considered as a reference haplotype that has an effect on BMI in a comparison with the other haplotypes. The difference between haplotype GG and the rest of the haplotypes can significantly explain some variation in BMI. The resultant LR_{2} for testing the association between DNA sequence and BMI phenotype (10.35) is well beyond the critical threshold value at the significance level of 1% (6.07) estimated from 1000 simulation replicates. Haplotype GG displays large additive and dominant effects. Patients with homozygous genotype GG/GG have significantly lower BMI than those with homozygous genotype AA/CC, as indicated by a large additive effect (*a* = −1.77). For double-heterozygote AG/GC, diplotype [GG][AC] reduces BMI by 3.05 (the dominant effect) compared to diplotype [AG][GC]. The difference between haplotype GG and the rest of the haplotypes can explain 6.3% of the observed variation in BMI. We estimate the standard errors of the MLEs of the population and quantitative genetic parameters on the basis of Louis's (1982) observed information matrix, suggesting that all MLEs have reasonable estimation precision although the estimates of quantitative genetic parameters are not as precise as those of population genetic parameters due to a small sample size used (see the simulation result in Tables 2 and 3).

## DISCUSSION

The elucidation of the entire human genome using SNPs will make it possible to develop a haplotype map of the human genome. Although such a HapMap has been recently available due to international collaborations (International HapMap Consortium 2003), there is a serious lack of powerful statistical tools for specific utility of this expensive HapMap to detect genes and genetic variations that affect health and disease. Such a statistical model has been developed and presented in this article. The idea behind our statistical modeling is that differences between SNP-constructed haplotypes may be associated with differing susceptibility to disease (International HapMap Consortium 2003).

SNPs reside at sites in the DNA sequence where individuals differ at a single DNA base. Sets of nearby SNPs on the same chromosome are inherited in blocks. Blocks may contain a large number of SNPs, but a few SNPs are enough to uniquely identify the haplotypes in a block (Gabriel* et al.* 2002). The HapMap is a map of these haplotype blocks constructed by tag SNPs, those that explain most of haplotype diversity. The HapMap should be valuable by reducing the number of SNPs required to examine the entire genome for association with a phenotype from the 10 million SNPs that exist to ∼500,000 tag SNPs. This will make genome scan approaches to finding regions with genes that affect diseases much more efficient and comprehensive, since effort will not be wasted typing more SNPs than necessary and all regions of the genome can be included.

Our model is founded on the discovery of tag SNPs in the genome, thus allowing for a fast scan for the association between variation in DNA sequence and traits. This model has three advantages. First, it can materialize the genetic basis for quantitative variation by directly characterizing specific DNA sequences predisposing to a certain disease. The traditional statistical models for genetic mapping attempt to postulate the position of hypothesized QTL that are linked with known markers genotyped from the genome (Lander and Botstein 1989; Lou* et al.* 2003). The QTL detected from these models are regarded as “hypothesized” because it is not possible to know their DNA sequences and, therefore, physiological function. As opposed to the traditional “indirect” approach, our model presents a “direct” approach. At present, the utility of the direct approach is limited to sequencing the functional parts of candidate genes with known biochemical or physiological function. With the release of HapMap, our model will make the direct approach more useful and more efficient in searching for causal variants throughout the whole genome. Our model is also different from traditional association studies (*e.g.*, Zaykin* et al.* 2002) whose aims are the significance test for the association between haplotypes and phenotypes (mostly discrete traits) rather than the precise estimation of haplotype effects. Second, our model is statistically simple and computationally fast. The most difficult part for the estimation from our model is to construct diplotype configurations for heterozygous genotypes at two or more SNPs. The estimation of population genetic parameters is based on a multinomial-likelihood function of the observed genotype data, whereas the estimation of quantitative genetic parameters is based on a mixture-based likelihood function including different diplotypes. These two likelihood functions can be easily integrated to a unified estimation framework implemented with the EM algorithm.

Finally, our model is flexible to different genetic and experimental settings. The results from a simulation study indicate that the association between DNA sequence and phenotype can be well detected when the trait has a modest heritability level (0.1) or a modest sample size (100) is used. Our model can also obtain fairly precise estimation of parameters when diplotypes display overdominance in the situation with modest heritability and sample size. The specific utility of our model to a real example from an obesity study leads to the successful detection of a DNA sequence (haplotype) at codons 16 and 27 genotyped within the β2AR candidate gene (Chagnon* et al.* 2003) for its significant impact on human obesity. This haplotype, composed of the Gly16 form of codon 16 and the Gln27 form of codon 27, tends to reduce BMI when it is combined with itself or any other haplotypes and accounts for ∼6% of the total observed variation in BMI for 155 patient samples.

Although our simulation and example were based on 2- or 3-SNP analyses, our sequencing model has been developed to allow for the detection of sequence variants involving any number of SNPs within a haplotype block. In addition to its use in studying genetic associations with disease, our sequencing model can be extended to study the genetic factors contributing to variation in response to environmental factors, in susceptibility to infection, and in the effectiveness of and adverse responses to drugs and vaccines (Evans and McLeod 2003; Weinshilboum 2003). It can also be modified to estimate the effects of sequence-sequence interaction on a complex trait. It is possible that a haplotype within a candidate gene interacts with haplotypes from other candidate genes. The characterization of specific DNA sequence variants for diseases should allow the development of tests to predict which drugs or vaccines would be most effective in individuals with particular genotypes for genes affecting drug metabolism.

## Acknowledgments

We thank two anonymous referees for their constructive comments on this manuscript. This work is supported by an Outstanding Young Investigator Award of the National Natural Science Foundation of China (30128017), a University of Florida Research Opportunity Fund (02050259), and a University of South Florida biodefense grant (7222061-12) to R.W. The publication of this manuscript was approved as journal series no. R-10063 by the Florida Agricultural Experiment Station.

## Footnotes

Communicating editor: J. B. Walsh

- Received April 20, 2004.
- Accepted May 26, 2004.

- Genetics Society of America