## Abstract

In this article, population-based regression models are proposed for high-resolution linkage disequilibrium mapping of quantitative trait loci (QTL). Two regression models, the “genotype effect model” and the “additive effect model,” are proposed to model the association between the markers and the trait locus. The marker can be either diallelic or multiallelic. If only one marker is used, the method is similar to a classical setting by Nielsen and Weir, and the additive effect model is equivalent to the haplotype trend regression (HTR) method by Zaykin *et al*. If two/multiple marker data with phase ambiguity are used in the analysis, the proposed models can be used to analyze the data directly. By analytical formulas, we show that the genotype effect model can be used to model the additive and dominance effects simultaneously; the additive effect model takes care of the additive effect only. On the basis of the two models, *F*-test statistics are proposed to test association between the QTL and markers. By a simulation study, we show that the two models have reasonable type I error rates for a data set of moderate sample size. The noncentrality parameter approximations of *F*-test statistics are derived to make power calculation and comparison. By a simulation study, it is found that the noncentrality parameter approximations of *F*-test statistics work very well. Using the noncentrality parameter approximations, we compare the power of the two models with that of the HTR. In addition, a simulation study is performed to make a comparison on the basis of the haplotype frequencies of 10 SNPs of angiotensin-1 converting enzyme (ACE) genes.

IN genetics research, one important goal is to locate and identify important genetic variants that are related to complex traits. With the development of dense maps such as single-nucleotide polymorphisms (SNPs) and high-resolution microsatellites in the human genome, enormous amounts of genetic data on human chromosomes are becoming available (International SNP Map Working Group 2001; Kong *et al.* 2002; International HapMap Consortium 2003; HapMap project, http://www.hapmap.org). The opportunities for a genomewide scan to map complex disease genes are tremendous. It is important to build appropriate models and useful algorithms in association mapping of complex diseases to identify important genetic variants of complex traits, for human, animal, or plant study.

In recent years, there has been great interest in linkage disequilibrium (LD) mapping (or association study) of quantitative traits of complex diseases. One way is to use diallelic markers such as SNPs in analysis. This approach has been receiving much attention and there are quite a lot of references to it in the literature (Fulker *et al.* 1999; George *et al.* 1999; Abecasis *et al.* 2000a,b, 2001; Sham *et al.* 2000; Fan *et al.* 2005). Another approach is to use haplotype data that may consist of a set of SNPs (Schaid *et al.* 2002; Zaykin *et al.* 2002; Schaid 2004). The haplotype data may provide more information on the relation between DNA variants and complex traits than that of any single SNP. Hence, it is important to investigate models and algorithms that are based on haplotype data. In Schaid *et al.* (2002) and Zaykin *et al.* (2002), score tests are proposed for association between complex traits and haplotypes, which can be ambiguous owing to the unknown linkage phase of different haplotypes. In Zaykin *et al.* (2002), the method is called haplotype trend regression (HTR), which is very close to the method of Schaid *et al.* (2002) (see Schaid 2004, p. 355, for further explanation). HTR does not assume that haplotype phases are known. Meuwissen and Goddard (2000) introduced a haplotype-based approach, which assumes that haplotype phases are known. In addition, mixed models are used to model the haplotype effect in Meuwissen and Goddard (2000). Morris *et al.* (2004) used a Markov chain Monte Carlo algorithm based on the shattered coalescent model for fine mapping.

On the other hand, the direct available information is genotypes by current genotyping technology, instead of haplotypes. Hence, it is interesting to build models by directly using genotype information; under these models, the main effects of each marker are modeled, which does not require phase information across the markers. If phase is unknown, presumably the haplotypes would need to be estimated first, using a reconstruction algorithm such as PHASE or EM algorithms (Dempster *et al.* 1977; M. Stephens *et al.* 2001; Stephens and Donnelly 2003). This may introduce bias into the subsequent analysis, which would need to be investigated. It is of real interest in making comparison of the genotype-based models and the haplotype-based models. Interestingly, Morris *et al.* (2004) and Clayton *et al.* (2004) have observed that the haplotypes at SNPs may be only slightly more advantageous or even less powerful for fine mapping than the corresponding unphased genotypes.

Suppose that a quantitative trait locus (QTL) is located in a chromosome region. In the region, a marker (or two/multiple markers) is (or are) typed. In our previous research, the markers are assumed to be diallelic (Fan and Xiong 2002). In the current article, the markers can be either diallelic or multiallelic. Suppose that a population sample is available. For each individual in the sample, both trait value and genotypes at the markers are observed. We propose two regression models in association mapping of QTL based on population genetic data. One model is the “genotype effect model,” and the other is the “additive effect model.” These two models extend our previous research of high-resolution LD mapping of QTL using diallelic markers (Fan and Xiong 2002). The model can be very easily performed by using any statistical software in data analysis, or it can be easily implemented by widely used language such as C++. By analytical formulas, we show that the genotype effect model can be used to model the additive and dominance effects simultaneously; the additive effect model takes care of additive effect only. On the basis of the two models, *F*-test statistics are proposed to test association between the QTL and markers. To investigate the robustness of the proposed models and the related *F*-test statistics, simulation studies are performed to calculate the type I error rates. The noncentrality parameters of *F*-test statistics are derived to make power calculation and comparison. Moreover, the proposed models are compared with the haplotype trend regression method by simulation study and type I error rate analysis when two diallelic markers are used in the analysis (Zaykin *et al.* 2002). On the basis of the haplotype frequencies of 10 SNPs of angiotensin-1 converting enzyme (ACE) genes, a simulation study is performed to make power comparison of the proposed models with the haplotype trend regression method (Keavney *et al.* 1998).

A software, CLAM_QTL, is written in C++ to implement the proposed models and methods, which can be downloaded from http://www.stat.tamu.edu/∼rfan/software.html/.

## METHODS

As the first step, we present models and methods by using one marker. Here the marker can be either biallelic or multiallelic. This article extends our previous work (Fan and Xiong 2002). Similar results were worked out independently by colleagues at North Carolina State University, although their language and notations are slightly different (Weir and Cockerham 1977; Nielsen and Weir 1999, 2001). Then, the models and methods are extended to use two/multiple markers in analysis. On the basis of the models, *F*-test statistics are proposed, and the related noncentrality parameter approximations of the *F*-tests are derived.

#### Analysis by one marker:

##### Population models:

Consider a quantitative trait locus *Q*, which is located at an autosome. Suppose that there are two alleles *Q*_{1} and *Q*_{2} at the trait locus with frequencies *q*_{1} and *q*_{2}, respectively. In a region of the QTL *Q*, suppose that one marker *A* is typed, which may be diallelic such as a single-nucleotide polymorphism or may be multiallelic such as a microsatellite marker. Let us denote the alleles of marker *A* by *A*_{1}, …, *A _{m}*, where

*m*is the number of alleles. Suppose that the marker

*A*is in Hardy-Weinberg equilibrium (HWE). Let the frequency of

*A*be . There are

_{i}*J*=

_{A}*m*(

*m*+ 1)/2 possible genotypes, which can be listed as

*A*

_{1}

*A*

_{1}, …,

*A*,

_{m}A_{m}*A*

_{1}

*A*

_{2}, …,

*A*

_{1}

*A*, …,

_{m}*A*

_{m}_{−1}

*A*. Accordingly, let β

_{m}_{11}, …, β

_{mm}, β

_{12}, …, β

_{1m}, …, β

_{m−1,m}be the corresponding effects of the listed genotypes on the quantitative trait. Let

*y*be the trait value of an individual with genotype

*G*=

_{A}*A*. Under an assumption of normality, the trait value can be modeled as(1)where

_{i}A_{j}*w*is a row vector of covariates such as sex and age, γ is a column vector of regression coefficients of

*w*, and

*e*is the error term. Assume that

*e*is normal

*N*(0, σ

_{e}

^{2}). In addition to the covariate effects, there are

*J*=

_{A}*m*(

*m*+ 1)/2 parameters β

_{ij}in model (1), where β

_{ij}= β

_{ji}. Model (1) treats each genotype effect as one parameter. Hence, we call it a genotype effect model. In practice, model (1) may lead to large number of parameters.

Now let us denote the effect of allele *A _{i}* as α

_{i},

*i*= 1, …,

*m*. Suppose the genetic effect is additive in a sense of β

_{ij}= α

_{i}+ α

_{j},

*i*,

*j*= 1, …,

*m*. If an individual has quantitative trait value

*y*and genotype

*G*=

_{A}*A*, model (1) can be modified as(2)In addition to the covariate effects, there are

_{i}A_{j}*m*parameters α

_{i},

*i*= 1, …,

*m*, in model (2). Compared with model (1), model (2) may significantly reduce the number of parameters. Since it models only the additive effect, we call it the additive effect model.

##### Property of model coefficients and association tests:

As in the traditional quantitative genetics, let *a* be the effect of genotype *Q*_{1}*Q*_{1}, *d* be the effect of genotype *Q*_{1}*Q*_{2}, and −*a* be the effect of genotype *Q*_{2}*Q*_{2} (Falconer and Mackay 1996). Let α_{Q} = *a* + (*q*_{2} − *q*_{1})*d* be the average effect of gene substitution and δ_{Q} = 2*d* be the dominance deviation. In addition, let μ = *a*(*q*_{1} − *q*_{2}) + 2*dq*_{1}*q*_{2} be the aggregate effect of the QTL on the trait mean in the population. For *i* = 1, 2, …, *m*, let us denote , which are measures of LD between QTL *Q* and marker *A*. Here *P*(*Q*_{1}*A _{i}*) is the frequency of haplotype

*Q*

_{1}

*A*. In appendix a, we show that the regression coefficients of model (1) are given by(3)In appendix b, we show that the regression coefficients of model (2) are given by(4)From Equations 3 and 4, it is clear that β

_{i}_{ij}= α

_{i}+ α

_{j}, when δ

_{Q}= 0,

*i.e*., no dominance effect. Suppose that the marker

*A*and the QTL

*Q*are in linkage equilibrium;

*i.e.*, . Then Equation 3 implies β

_{ij}= μ; Equation 4 implies that α

_{i}= μ/2. Hence, models (1) and (2) are reduced to(5)

Assume that the additive genetic effect is significantly present, but the dominance genetic effect is not significantly present; *i.e.*, α_{Q} ≠ 0 but δ_{Q} = 0. To test association between the marker *A* and the QTL *Q*, one may test hypotheses H_{a0}: α_{1} = ⋯ = α* _{m} vs.* H

_{a1}: at least two α

_{i}'s are not equal. To see this, note that the hypotheses H

_{a0}: α

_{1}= ⋯ = α

_{m}is equivalent to , since α

_{Q}is significantly different from 0. Thus, implies and so under H

_{a0}. Hence, the hypotheses H

_{a0}: α

_{1}= ⋯ = α

*H*

_{m}vs._{a1}: at least two α

_{i}'s are not equal to each other are equivalent to at least one is not equal to 0. Model (2) can be used to map the QTL by an association analysis.

On the other hand, assume that both additive and dominance genetic effects are significantly present at the putative QTL *Q*; *i.e.*, α_{Q} ≠ 0 and δ_{Q} ≠ 0. To test association between the marker *A* and the QTL *Q*, one may test hypotheses H_{ad0}: β_{11} = ⋯ = β_{mm} = β_{12} = ⋯ = β_{1m} = ⋯ = β_{m−1,m} *vs.* H_{ad1}: at least two β_{ij}'s are not equal.

##### Relation to our previous work:

If the marker *A* has only two alleles *A*_{1} and *A*_{2}, Fan and Xiong (2002) proposed the following model in association mapping of the QTL *Q*,(6)where *x _{A}* and

*z*are dummy random variables defined by(7)and α

_{A}_{A}and δ

_{A}are regression coefficients of the dummy variables

*x*and

_{A}*z*. The regression coefficients are given by and (Fan and Xiong 2002). It can be shown that model (6) is equivalent to model (1). Actually, the following relations of the regression coefficients of the two models can be shown: , and . Similarly, model (2) is equivalent to

_{A}*y*=

*w*γ + μ +

*x*α

_{H}_{A}+

*e*, and we have the following relations and . The advantage of model (6) is that the association effect is decomposed into summations of additive and dominance effects if

*A*is diallelic. If

*A*has more than two alleles, model (1) extends model (6), and model (2) extends model

*y*=

*w*γ + μ +

*x*α

_{H}_{A}+

*e*.

##### Regression models:

Assume that *N* individuals from a population are available for study. Let us list their trait values as *y*_{1}, …, *y _{N}* and their genotypes as

*G*

_{A}_{1}, …,

*G*. For individual

_{AN}*k*, let

*x*

_{ii}^{(k)}be the indicator function of genotype

*A*and

_{i}A_{i}*x*

_{ij}^{(k)}be the indicator function of genotype

*A*. That is, they are dummy variables defined bywhere

_{i}A_{j}*i*,

*j*= 1, 2, …,

*m*,

*i*≠

*j*. Let ,

*k*= 1, 2, …,

*N*;

*i.e.*,

*X*is a column vector of genotype indicator functions of individual

_{k}*k*. Here the superscript τ denotes a vector/matrix transpose. Denote . The corresponding regression of model (1) can be written as(8)where subscript

*k*indicates the corresponding quantities of individual

*k*.

Similarly, let be the number of alleles *A _{i}* of genotype

*G*,

_{Ak}*i*= 1, 2, …,

*m*, for individual

*k*. That is, is a dummy variable defined byDenote and . To use model (2) for data analysis, the corresponding regression model is(9)

##### F-tests and noncentrality parameter approximations:

It is well known that the additive variance and the dominance variance . Let be the total variance. Assume that there are no covariates. Let us denote , and . Then model (8) can be expressed as *y* = *X*η + *e*. By standard regression theory, the coefficients can be estimated by . Let *H* be a (*J _{A}* − 1) ×

*J*matrix defined byThen, (

_{A}*H*η)

^{τ}= (β

_{11}− β

_{22}, …, β

_{11}− β

_{mm}, β

_{11}− β

_{12}, …, β

_{11}− β

_{1m}, …, β

_{11}− β

_{m−1,m}). Hence, the hypothesis H

_{ad0}is equivalent to

*H*η = (0, …, 0)

^{τ}. From Graybill (1976), Chap. 6, the test statistic of a hypothesis H

_{ad0}is noncentral

*F*(

*J*− 1,

_{A}*N*−

*J*) defined bywhere

_{A}*I*is the

_{N}*N*×

*N*identity matrix. The noncentrality parameter of the above

*F*-statistic is λ

_{m,ad}= (

*H*η)

^{τ}[

*H*(

*X*

^{τ}

*X*)

^{−1}

*H*

^{τ}]

^{−1}(

*H*η)/σ

^{2}. Under the assumption of large sample sizes

*N*, we show in appendix c the approximation(10)where

*R*

_{AQ}^{2}is a general measure of the degree of linkage disequilibrium between marker

*A*and the QTL

*Q*defined by (Crow and Kimura 1970; Hedrick 1987; Morton and Wu 1988; Sham

*et al.*2000). Note that

*R*

_{AQ}^{2}is the χ

^{2}-statistic of the

*m*× 2 table of haplotype frequencies of the marker

*A*and trait locus

*Q*. Approximation (10) shows that the noncentrality parameter of test statistics of the null hypothesis of no genetic effects of model (1) is reduced by a factor of for additive variance and by a factor of for dominance variance.

Similarly, let us denote . Then model (9) can be expressed as *y* = *Z*ψ + *e*. The coefficients can be estimated by . Let *K* be a (*m* − 1) × *m* matrix defined byThen, (*K*ψ)^{τ} = (α_{1} − α_{2}, …, α_{1} − α_{m}). Hence, the hypothesis H_{a0} is equivalent to *K*ψ = (0, …, 0)^{τ}. From Graybill (1976), Chap. 6, the test statistic of the hypothesis H_{a0} is noncentral *F*(*m* − 1, *N* − *m*) defined byThe noncentrality parameter of the above *F*-statistic is λ_{m,a} = (*K*ψ)^{τ}[*K*(*Z*^{τ}*Z*)^{−1}*K*^{τ}]^{−1}(*K*ψ)/σ^{2}. Under an assumption of large sample sizes *N*, we show in appendix d the following approximation:(11)This approximation (11) shows that the noncentrality parameter λ_{m,a} is reduced by a factor of for additive variance. The dominance variance is not present in λ_{m,a}.

#### Analysis by two/multiple markers:

##### Population models and association tests:

If genetic data of two/multiple markers are available, models (1) and (2) can be extended for association study of QTL. Most importantly, the data of two/multiple markers may contain phase ambiguity, *i.e.*, phase unknown double heterozygotes. In the following, we generalize models (1) and (2) to directly analyze genetic data of two markers. The principle, actually, can be applied to multiple marker data.

In addition to marker *A*, assume that a second marker *B* is typed, which has *n* alleles denoted by *B*_{1}, …, *B _{n}*. Suppose that the marker

*B*is also in Hardy-Weinberg equilibrium. Let the frequency of allele

*B*be . There are

_{k}*J*=

_{B}*n*(

*n*+ 1)/2 possible genotypes, which can be listed as

*B*

_{1}

*B*

_{1}, …,

*B*,

_{n}B_{n}*B*

_{1}

*B*

_{2}, …,

*B*

_{1}

*B*, …,

_{n}*B*

_{n}_{−1}

*B*. Let

_{n}*y*be the trait value of an individual with genotype

*G*at marker

_{A}*A*and genotype

*G*at marker

_{B}*B*. Such as relations (7), define If marker

*A*has only two alleles

*A*

_{1}and

*A*

_{2}, then

*x*defined above is closely related to

_{Ai}*x*, which is defined in (7). Actually, it is easy to see the following relation since .

_{A}To extend model (2) by using two markers *A* and *B* in the analysis, consider the following model(13)In addition to the covariate effects, there are *m* + *n* − 1 parameters α, α_{Ai}, α_{Bk}, *i* = 1, …, *m* − 1, *k* = 1, …, *n* − 1 in model (13). To see why model (13) extends model (2), it is worthwhile to note that model (2) is equivalent to . Actually, the quantity implies that if only information of marker *A* is used in the analysis; thus, α_{m} = α/2, α_{i} = α_{Ai} + α/2, *i* = 1, …, *m* − 1. Such as model (2), model (13) takes only the additive effect into account. Hence, we call it an additive effect model. Similarly, model (1) can be extended to(14)In addition to the covariate effects, there are *J _{A}* +

*J*− 1 parameters α, α

_{B}_{Ai}, α

_{Bk}, δ

_{Aij}, δ

_{Bkl}in model (14). Model (14) takes both additive and dominance effects into account, and it is called the genotype effect model. Again, model (1) is equivalent to .

Denote *X _{A}* = (

*x*

_{A}_{1}, …,

*x*

_{A}_{(m−1)})

^{τ},

*X*= (

_{B}*x*

_{B}_{1}, …,

*x*

_{B}_{(n−1)})

^{τ}, and

*X*

_{A}_{∪B}= (

*X*

_{A}^{τ},

*X*

_{B}^{τ})

^{τ}. Let us denote the additive variance–covariance matrix of the indicator variables

*x*,

_{Ai}*x*by . Similarly, let

_{Bk}*Z*= (

_{A}*z*

_{A}_{12}, …,

*z*

_{A}_{1m},

*z*

_{A}_{23}, …,

*z*

_{A}_{2m}, …,

*z*

_{A}_{(m−1)m)})

^{τ},

*Z*= (

_{B}*z*

_{B}_{12}, …,

*z*

_{B}_{1n},

*z*

_{B}_{23}, …,

*z*

_{B}_{2n}, …,

*z*

_{B}_{(n−1)n)})

^{τ}, and . Let us denote the dominance variance–covariance matrix of the indicator variables

*z*,

_{Aij}*z*by

_{Bkl}*V*

_{D}= Cov(

*Z*

_{A}_{∪B},

*Z*

_{A}_{∪B}). For

*k*= 1, 2, …,

*n*, let us denote , which are measures of LD between QTL

*Q*and marker

*B*. In appendix e, we show that the regression coefficients of models (13) and (14) are given by (15) The elements of matrices

*V*

_{A}and

*V*

_{D}are provided in appendix e. Equations 15 show that the parameters of LD (

*i.e.*, and ) and gene effect (

*i.e.*, α

_{Q}and δ

_{Q}) are contained in the regression coefficients. Models (13) and (14) simultaneously take care of the LD and the effects of the putative trait locus

*Q*. The gene substitution effect α

_{Q}is contained only in α

_{Ai}, α

_{Bk}; and the dominance effect δ

_{Q}is contained only in δ

_{Aij}, δ

_{Bkl}. Therefore,

*V*

_{A}is called the additive variance–covariance matrix; and

*V*

_{D}is called the dominance variance–covariance matrix. The model (14) orthogonally decomposes the genetic effect into a summation of additive and dominance effects.

In Fan and Xiong (2002), regression models are proposed for LD mapping of QTL by diallelic markers. Models (13) and (14) extend the models by using multiallelic markers in LD analysis. On the basis of Equations 15, we may use models (13) and (14) to test the association between the trait locus *Q* and the two markers *A* and *B*. Assume that the additive genetic effect is significantly present, but the dominance genetic effect is not significantly present; *i.e*., α_{Q} ≠ 0 but δ_{Q} = 0. To test association between the markers *A* and *B* and the QTL *Q*, one may test hypotheses H_{ABa0}: α_{A1} = ⋯ = α_{A(m−1)} = α_{B1} = ⋯ = α_{B(n−1)} = 0 *vs.* H_{ABa1}: at least one α_{Ai}, α_{Bk} is not equal to 0. To see this, note that the hypothesis H_{ABa0} is equivalent to , since α_{Q} is significantly different from 0. On the other hand, assume that both additive and dominance genetic effects are significantly present at the putative QTL *Q*; *i.e.*, α_{Q} ≠ 0 and δ_{Q} ≠ 0. To test association between the markers *A* and *B* and the QTL *Q*, one may test hypothesis H_{ABad0}: α_{A1} = ⋯ = α_{A(m−1)} = α_{B1} = ⋯ = α_{B(n−1)} = δ_{A12} = ⋯ = δ_{A1m} = ⋯ = δ_{A(m−1)m} = δ_{B12} = ⋯ = δ_{B1n} = ⋯ = δ_{B(n−1)n} = 0 *vs.* H_{ABad1}: at least one α_{Ai}, α_{Bk}, δ_{Aij}, δ_{Bkl} is not equal to 0, since both α_{Q} and δ_{Q} are significantly different from 0.

##### Regression models, F-tests, and noncentrality parameter approximations:

Assume that *N* individuals from a population are available for study, whose trait values are listed as *y*_{1}, …, *y _{N}* and their genotypes as

*G*

_{A}_{1}, …,

*G*at marker

_{AN}*A*and

*G*

_{B}_{1}, …,

*G*at marker

_{BN}*B*. For individual

*s*, let be the corresponding coding functions of genotypes

*G*and

_{As}*G*. Let us denote and . Denote α

_{Bs}_{A∪B}= (α, α

_{A1}, …, α

_{A(m−1)}, α

_{B1}, …, α

_{B(n−1)})

^{τ}, and δ

_{A∪B}= (δ

_{A12}, …, δ

_{A(m−1)m}, δ

_{B12}, …, δ

_{B(n−1)n})

^{τ}. The corresponding regression of model (14) can be written as(16)Let us denote and and . On the basis of regression (16), one may construct an

*F*-test statistic

*F*

_{AB}_{,ad}to test the null hypothesis H

_{ABad0}in the same way as constructing

*F*

_{m}_{,ad}or

*F*

_{m}_{,a}(Graybill 1976, Chap. 6). Under the null hypothesis of H

_{ABad0},

*F*

_{AB}_{,ad}is central to

*F*(

*J*+

_{A}*J*− 2,

_{B}*N*−

*J*−

_{A}*J*+ 1). Assume the sample size

_{B}*N*is large enough that the large sample theory applies. Under the alternative hypothesis of H

_{ABad1},

*F*

_{AB}_{,ad}is noncentral to

*F*(

*J*+

_{A}*J*− 2,

_{B}*N*−

*J*−

_{A}*J*+ 1), and it can be shown that the corresponding noncentrality parameter is approximated bySimilarly, an

_{B}*F*-test statistic

*F*

_{AB}_{,a}used to test the null hypothesis H

_{ABa0}can be constructed. Under the null hypothesis of H

_{ABa0},

*F*

_{AB}_{,a}is central to

*F*(

*m*+

*n*− 2,

*N*−

*n*−

*m*+ 1). Under the alternative hypothesis of H

_{ABa1},

*F*

_{AB}_{,a}is noncentral to

*F*(

*m*+

*n*− 2,

*N*−

*m*−

*n*+ 1), and it can be shown that the corresponding noncentrality parameter is approximated by

#### The haplotype trend regression method:

If only one marker *A* is used in the analysis, the proposed model (2) is equivalent to the HTR method of Zaykin *et al.* (2002). However, the proposed models are different from the haplotype trend regression method for two/multiple marker data. Assume that *M* markers are typed in a region of the trait locus *Q*. On the basis of the genotypes of the multiple markers, assume that *J* haplotypes can be determined as *h*_{1}, …, *h _{J}* with frequencies . For each individual, we may define an expected haplotype score vector as follows (Schaid

*et al.*2002; Zaykin

*et al.*2002). The expected haplotype score vector is a column vector of

*J*elements (

*c*

_{1}, …,

*c*)

_{J}^{τ}based on the genotype combination (

*G*

_{1}, …,

*G*) at the markers of an individual. For instance, the score vector is (1, 0, …, 0)

_{M}^{τ}if haplotype pair

*h*

_{1}/

*h*

_{1}is the only possible phase of the genotype combination (

*G*

_{1}, …,

*G*). In general,

_{M}*c*is the conditional probability of a haplotype

_{j}*h*given genotype combination (

_{j}*G*

_{1}, …,

*G*) at the markers;

_{M}*i.e.*,In the above equation, the conditional probability

*P*(

*G*

_{1}, …,

*G*|

_{M}*h*,

_{i}*h*) is 1 if haplotype pair

_{k}*h*/

_{i}*h*is a possible phase for the genotype combination (

_{k}*G*

_{1}, …,

*G*), and

_{M}*P*(

*G*

_{1}, …,

*G*|

_{M}*h*,

_{k}*h*) is 0 otherwise. For each individual, the summation of the expected haplotype scores is equal to 1.

_{j}For the purpose of explanation, consider two diallelic markers *A* and *B*. Let us denote the two alleles of marker *A* by *A*_{1}, *A*_{2}; and denote the two alleles of marker *B* by *B*_{1}, *B*_{2}. Table 1 gives the score vector for each genotype combination of markers *A* and *B*. To understand the entries of Table 1, it is worthwhile to take genotype combination (*G _{A}* =

*A*

_{1}

*A*

_{1},

*G*=

_{B}*B*

_{1}

*B*

_{1}) as an example. Two copies of haplotype

*A*

_{1}

*B*

_{1}can be formed from the genotype combination (

*G*=

_{A}*A*

_{1}

*A*

_{1},

*G*=

_{B}*B*

_{1}

*B*

_{1}). The score for haplotype

*A*

_{1}

*B*

_{1}is 1 for this genotype combination; and scores for the other three haplotypes are all 0. Denote the genotype of an individual at marker

*A*by

*G*and the genotype at marker

_{A}*B*by

*G*. Let us denote

_{B}*c*

_{1}=

*P*(

*A*

_{1}

*B*

_{1}|

*G*=

_{A}*A*

_{1}

*A*

_{2},

*G*=

_{B}*B*

_{1}

*B*

_{2}) =

*P*(

*A*

_{1}

*B*

_{1})

*P*(

*A*

_{2}

*B*

_{2})/[2

*P*(

*A*

_{1}

*B*

_{1})

*P*(

*A*

_{2}

*B*

_{2}) + 2

*P*(

*A*

_{1}

*B*

_{2})

*P*(

*A*

_{2}

*B*

_{1})] =

*c*

_{4};

*i.e.*,

*c*

_{1}is the conditional probability of a haplotype

*A*

_{1}

*B*

_{1}given the double heterozygotes (

*G*=

_{A}*A*

_{1}

*A*

_{2},

*G*=

_{B}*B*

_{1}

*B*

_{2}); and . For the double heterozygotes (

*G*=

_{A}*A*

_{1}

*A*

_{2},

*G*=

_{B}*B*

_{1}

*B*

_{2}), the expected scores are

*c*

_{1},

*c*

_{2},

*c*

_{2},

*c*

_{1}for haplotypes

*A*

_{1}

*B*

_{1},

*A*

_{1}

*B*

_{2},

*A*

_{2}

*B*

_{1},

*A*

_{2}

*B*

_{2}. The scores of the other genotype combinations are provided in Table 1. Then the corresponding model of the haplotype trend regression method can be written as(17)where β

_{i}are regression coefficients, and

*I*are expected scorings of haplotypes defined in Table 1. It can be seen that model (17) is not equivalent to either proposed model (13) or model (14).

_{i}In the general case of *M* markers, let *I _{j}* be the expected score of haplotype

*h*,

_{j}*j*= 1, 2, …,

*J*. In terms of conditional probabilities,

*I*can be expressed asThe corresponding model of the haplotype trend regression method can be written as(18)For

_{j}*j*= 1, 2, …,

*J*, let us denote , which are measures of LD between QTL

*Q*and the haplotypes. Here

*P*(

*Q*

_{1}

*h*) is the frequency of haplotype

_{j}*Q*

_{1}

*h*. In appendix f, we show that the regression coefficients of model (18) satisfy the matrix equation(19)where

_{j}*E*(

*I*) are given in appendix f, andFrom Equations 19, it is clear that model (18) models both the additive and dominance effects. Suppose that the haplotype and the QTL

_{i}I_{k}*Q*are in linkage equilibrium;

*i.e.*, . Then Equation 19 implies β

_{1}= ⋯ = β

_{J}= μ, since and . Hence, model (18) is reduced to (5). To test association between the haplotypes and the trait locus, one may test a null hypothesis β

_{1}= ⋯ = β

_{J}, and the related

*F*-test statistic can be constructed.

Again, assume that *N* individuals from a population are available for study with trait values and genotype information. On the basis of regression (18), one may construct an *F*-test statistic *F*_{HTR} to test the null hypothesis β_{1} = ⋯ = β_{J} = μ (Graybill 1976). Under the null hypothesis, *F*_{HTR} is central to *F*(*J* − 1, *N* − *J*). Under the alternative hypothesis that at least two β_{j}'s are not equal to each other, *F*_{HTR} is noncentral to *F*(*J* − 1, *N* − *J*). Assume the sample size *N* is large enough that the large sample theory applies. Then it can be shown that the corresponding noncentrality parameter is approximated bywhere

The advantage of model (17) is that it may model the haplotype effect by parameters β_{i}. In practice, it is necessary to calculate the expected scorings or haplotype frequencies before building the haplotype trend regression model. Instead, the proposed models (13) and (14) may be used to analyze genetic data directly. Moreover, we have derived analytical formulas to calculate the regression coefficients of the HTR method and the related noncentrality parameter of the test statistic *F*_{HTR}. Note that the original article by Zaykin *et al.* (2002) did not work out this very useful information. Our analytical coefficient equations and related noncentrality parameter approximations can be readily utilized for power evaluation.

## RESULTS

#### Type I error rates:

To evaluate the robustness of the proposed models, we calculate type I error rates of test statistics *F _{m}*

_{,ad},

*F*

_{m}_{,a},

*F*

_{AB}_{,ad},

*F*

_{AB}_{,a}, and

*F*

_{HTR}at a 0.05 significance level. The results are presented in Tables 2 and 3. Four test cases are considered: null, no major gene effect

*a*=

*d*= 0; additive, additive mode of inheritance

*a*= 1, but no dominant effect

*d*= 0; dominant, dominant mode of inheritance

*a*=

*d*= 1; and recessive, recessive mode of inheritance

*a*= 1 and

*d*= −0.5. The total variance is fixed as σ

^{2}= 1.0 and the trait allele frequency is taken as

*q*

_{1}=

*q*

_{2}= 0.5 except for that in the null test case. In Table 2, only one marker

*A*is used in analysis; the number

*m*of alleles ranges from 2 to 6. The allele frequencies are given by: when

*m*= 2; when when

*m*= 4; when

*m*= 5; and when

*m*= 6.

To calculate the type I error rates, 10,000 data sets are simulated for each test case. Each data set contains either 200 or 300 individuals. In each test case in Table 2, the data sets are generated under an assumption of linkage equilibrium between the QTL *Q* and the marker *A*; *i.e.*, . That is, there is no association between the QTL *Q* and marker *A*. Utilizing the data sets, we fit either model (8) or model (9), and then calculate the *F*-test *F _{m}*

_{,ad}or

*F*

_{m}_{,a}. Because the data sets are generated under the assumption of linkage equilibrium, an empirical test statistic that is larger than the cutting point of the related

*F*-statistic at a 0.05 significance level is treated as a false positive. On the basis of the

*F*-test of either

*F*

_{m}_{,ad}or

*F*

_{m}_{,a}, type I error rates are calculated as the proportions of the 10,000 simulation data sets that give significant results at the 0.05 significance level.

For the test statistic *F _{m}*

_{,a}, the Table 2 results show that the type I error rates are around the 0.05 nominal significance level in all cases. Hence, the proposed model (9) is robust for data sets of a sample size

*N*= 200. For test statistic

*F*

_{m}_{,ad}, the type I error rates are around the 0.05 nominal significance level when

*m*≤ 5 for data sets of sample size

*N*= 200. For

*m*= 6 and a sample size

*N*= 200, the type I error rates of test

*F*

_{m}_{,ad}are too big for the dominant and recessive test cases (9.11 and 7.04%, respectively). This is partially due to the large degrees of freedom,

*J*− 1 =

_{A}*m*(

*m*+ 1)/2 − 1 = 20 of test

*F*

_{m}_{,ad}when

*m*= 6; in addition, the high rate of type I error may be also caused by the mode of inheritance,

*i.e.*, for the cases of dominant and recessive models. When the sample size increases to

*N*= 300, the type I error rates of test

*F*

_{m}_{,ad}are around the 0.05 nominal significance level for

*m*= 6. Model (8) is less robust than model (9).

In Table 3, two markers *A* and *B* are used in the analysis. The numbers *m* and *n* of alleles are equal to 2. The allele frequencies are given by and . In each test case, linkage equilibrium is assumed between the QTL *Q* and the markers *A* and *B*; *i.e.*, . Denote , which is the measure of LD between *A* and *B*. Here *P*(*A*_{1}*B*_{1}) is the frequency of haplotype *A*_{1}*B*_{1}. Let(20)be the measure of the third-order LD (Thomson and Baur 1984). Here *P*(*A*_{1}*Q*_{1}*B*_{1}) is the frequency of haplotype *A*_{1}*Q*_{1}*B*_{1}. Between marker *A* and marker *B*, two situations are considered: (1) linkage equilibrium, *i.e.*, , and (2) linkage disequilibrium, *i.e.*, . No linkage disequilibrium of third order is assumed among markers *A* and *B* and the QTL *Q*; that is, *D _{AQB}* = 0. Again, 10,000 data sets are simulated for each test case, and each data set contains 200 individuals. The simulation is done as follows. First, the haplotype frequencies are calculated on the basis of allele frequencies and LD coefficients by relation (20) (Thomson and Baur 1984). Then data sets are simulated using the haplotype frequencies. On the basis of the

*F*-test of either

*F*

_{AB}_{,ad}or

*F*

_{AB}_{,a}or the HTR method, type I error rates are calculated as the proportions of the 10,000 simulation data sets that give significant results at the 0.05 significance level. The Table 3 results show that the type I error rates are around the 0.05 nominal significance level in all cases. Hence, the proposed models (13) and (14) and the HTR method are robust for data sets of a sample size

*N*= 200.

Table 4 shows type I error rates (percentages) of test statistics *F _{ABC}*

_{,ad},

*F*

_{ABC}_{,a}, and

*F*

_{HTR}at a 0.05 significance level when three diallelic markers

*A*,

*B*, and

*C*are used in the analysis. The measures

*D*,

_{ABC}*D*, and

_{AQC}*D*of the third-order LD are defined as that of

_{BQC}*D*; the measure of the fourth order is defined accordingly (Bennett 1954). Such as relation (20), the haplotype frequencies at the three markers

_{AQB}*A*,

*B*, and

*C*and at QTL

*Q*are calculated on the basis of allele frequencies and LD coefficients by Weir's (1996, p. 119) relation (3.14). Then data sets are simulated using the haplotype frequencies. Since this article is about population data, one individual may have two copies of haplotypes. Each haplotype is sampled according to the haplotype frequencies. From the Table 4 results, we can see that the proposed models and the HTR method give correct type I errors for data sets of a sample size

*N*= 200.

#### Power calculation and comparison:

Let *h*^{2} = σ_{ga}^{2}/σ^{2} be the heritability. Figure 1 shows power curves of the test statistics *F*_{4,a}, *F*_{4,ad}, *F*_{2,a}, and *F*_{2,ad} against the disequilibrium coefficient for a dominant mode of inheritance *a* = *d* = 1.0 at a 0.05 significance level based on the approximations of noncentrality parameters λ_{m,a} and λ_{m,ad}. *F*_{4,a} and *F*_{4,ad} are calculated when *A* has four equal frequency alleles; *i.e*., . In addition, the measures of LD are given as follows: Figure 1, A and B, , and Figure 1, C and D, . *F*_{2,a} and *F*_{2,ad} are calculated by collapsing the four alleles to be two alleles: in Figure 1, A and C, alleles *A*_{1} and *A*_{2} are collapsed as one allele, and alleles *A*_{3} and *A*_{4} are collapsed to be the other; in Figure 1, B and D, alleles *A*_{1} and *A*_{3} are collapsed to be one allele, and alleles *A*_{2} and *A*_{4} are collapsed to be the other. For *F*_{2,a} and *F*_{2,ad}, a simple calculation can show that the measures of LD in Figure 1A are 0, 0; the measures of LD in Figure 1B are ; the measures of LD in Figure 1C are 0, 0; and the measures of LD in Figure 1D are . Hence, the QTL *Q* is in linkage equilibrium with the marker after collapsing the alleles in Figure 1, A and C. The other parameters are *q*_{1} = 0.50, *h*^{2} = 0.25, *N* = 200.

From Figure 1, we may see the following:

*F*_{4,ad}is slightly less powerful than*F*_{4,a}, and*F*_{2,ad}is slightly less powerful than*F*_{2,a}. This is because that test statistic*F*_{m}_{,ad}has larger degrees of freedom than those of*F*_{m}_{,a}. Note that the noncentrality parameter approximation λ_{m,ad}of*F*_{m}_{,ad}is given by Equation 10. The contribution of the dominance effect is , which depends on both dominance effect*d*and the magnitude of factor and it can be significant when both of them are large enough. Hence, including a dominance component in the model can improve the power of QTL detection only when the magnitude of is large enough to compensate for the extra degrees of freedom. Note that the quantity is the product of the dominance variance and of the measure*R*_{AQ}^{4}of LD. The magnitude of is the result of the dominance variance reduced by a factor . Even when is large, can be small when LD coefficients are not big;*i.e.*, is small.When the measures of LD are high, the power of the test statistics is high. On the other hand, the power is minimal if all measures of LD are close to 0.

The dependence of power on measures of LD can also be observed by comparing Figure 1A with Figure 1C, 1B with 1D. The power of

*F*_{4,ad}and*F*_{4,a}in Figure 1A is higher than that of*F*_{4,ad}and*F*_{4,a}in Figure 1C, respectively; the power of each test statistic in Figure 1B is higher than that of the same test statistic in Figure 1D. This is because the measures of LD in Figure 1A are equal to or higher than those in Figure 1C, and the measures of LD in Figure 1B are equal to or higher than those in Figure 1D.In Figure 1B and Figure 1D, the power of

*F*_{4,ad}is slightly lower than that of*F*_{2,ad}; the power of*F*_{4,a}is slightly lower than that of*F*_{2,a}.In Figure 1A and Figure 1C, the power of

*F*_{2,ad}and*F*_{2,a}is minimal. This is because measures of LD are 0 after collapsing the alleles in these two graphs.

Figure 2 shows power curves of the test statistics *F*_{4,a}, *F*_{4,ad}, *F*_{3,a}, and *F*_{3,ad} against the disequilibrium coefficient for a dominant mode of inheritance *a* = *d* = 1.0 at a 0.05 significance level. *F*_{4,a} and *F*_{4,ad} are calculated as those in Figure 1. *F*_{3,a} and *F*_{3,ad} are calculated by collapsing two of the four alleles to be a new alelle: in Figure 2, A and C, alleles *A*_{1} and *A*_{2} are collapsed as a new one; in Figure 2, B and D, alleles *A*_{1} and *A*_{3} are collapsed to be a new one. For *F*_{3,a} and *F*_{3,ad}, a simple calculation can show that the measures of LD in Figure 2A are the measures of LD in Figure 2B are the measures of LD in Figure 2C are and the measures of LD in Figure 2D are . Among the features shown in Figure 1, it can be seen that in Figure 2, A and C, the power of *F*_{4,ad} is higher than that of *F*_{3,ad}, and the power of *F*_{4,a} is higher than that of *F*_{3,a}. In Figure 2, B and D, the power of *F*_{4,ad} is slightly lower than that of *F*_{3,ad}, and the power of *F*_{4,a} is slightly lower than that of *F*_{3,a}. Hence, the way to collapse the alleles has impact on power.

From Figures 1 and 2, we may see that the power of *F*_{4,a} and *F*_{4,ad} is relatively stable although it may be slightly lower than that of *F*_{3,a}, *F*_{3,ad}, *F*_{2,a}, and *F*_{2,ad} in certain circumstances. However, the power of *F*_{3,a}, *F*_{3,ad}, *F*_{2,a}, and *F*_{2,ad} depends heavily on the way to collapse the alleles. This shows the advantage of using multiallelic markers in an association study of QTL detection. For multiallelic marker data, the proposed test statistics *F _{m}*

_{,a}and

*F*

_{m}_{,ad}can be directly used to test if there is association between the marker and the QTL. As shown in Figures 1 and 2, the test statistic

*F*

_{m}_{,a}is usually more powerful than

*F*

_{m}_{,ad}due to the increase of degrees of freedom of test statistic

*F*

_{m}_{,ad}.

Figure 3 shows power curves of the test statistics *F*_{4,a} and *F*_{4,ad} against the heritability *h*^{2} at a 0.05 significance level for a dominant mode of inheritance *a* = *d* = 1.0 and for a recessive mode of inheritance *a* = 1.0, *d* = −0.5, respectively. As with Figures 1 and 2, Figure 3 is based on noncentrality parameter approximations (10) and (11). In Figure 3, A and B, the power can be high as the heritability *h*^{2} > 0.1; in these two graphs, the measures of LD are given by . In Figure 3, C and D, the power can be high as the heritability *h*^{2} > 0.15; in these two graphs, the measures of LD are given by . Figure 4 shows power curves of the test statistics *F*_{4,a} and *F*_{4,ad} against the trait allele frequency *q*_{1} or marker allele frequency at a 0.05 significance level. It can be seen that the power depends on both the measures of linkage disequilibrium and the trait allele frequency *q*_{1} or marker allele frequency .

#### Comparison with the haplotype trend regression method:

Assume that the two diallelic markers *A* and *B* are used in the analysis. Figures 5 and 6 show power curves of the test statistics *F _{AB}*

_{,a},

*F*

_{HTR}, and

*F*

_{AB}_{,ad}against the heritability

*h*

^{2}at a 0.05 significance level. The related parameters are given in the figure legends. The power curves of the test statistics

*F*

_{AB}_{,a},

*F*

_{HTR}, and

*F*

_{AB}_{,ad}are calculated on the basis of approximations of noncentrality parameters λ

_{ABa}, λ

_{HTR}, and λ

_{ABad}.

In Figure 5, no third-order linkage disequilibrium is assumed; *i.e.*, *D _{AQB}* = 0. In Figure 6, A and B, weak third-order linkage disequilibrium is assumed;

*i.e.*,

*D*= 0.025. It can be seen that the genotype effect model can be less powerful than the HTR method, and the HTR method can be less powerful than the additive effect model in the case of no or weak third-order linkage disequilibrium among the two markers and the QTL (Figure 5 and Figure 6, A and B). In Figure 6, C and D, strong third-order linkage disequilibrium is assumed;

_{AQB}*i.e.*,

*D*= 0.065. In the case that strong third-order linkage disequilibrium exists, the HTR method can be more powerful (Figure 6, C and D).

_{AQB}Note the following fact: in Figure 6, A and B, the maximum of *D _{AQB}* is 0.025; in Figure 6, C and D, the maximum of

*D*is 0.065 (otherwise, some of the haplotype would have negative frequencies). Thus, the simulated power curves of the haplotype trend regression method in Figures 5 and 6 represent the two extreme situations: (1) no third-order linkage disequilibrium (Figure 5) and (2) strongest third-order linkage disequilibrium (Figure 6). In practice, the third-order linkage disequilibrium would exist in a more moderate way that is between the two extremes; and the power of the haplotype trend regression method should be between those of the two extremes. Note that the proposed genotype effect model and additive effect model utilize only the second-order linkage disequilibrium or pairwise linkage disequilibrium. Hence, the powers of

_{AQB}*F*

_{AB}_{,a}and

*F*

_{AB}_{,ad}are the same for Figures 5 and 6.

Figure 7 shows power curves of the test statistics *F _{ABC}*

_{,a}and

*F*

_{ABC}_{,ad}and

*F*

_{HTR}against the heritability

*h*

^{2}at a 0.05 significance level, when three diallelic markers

*A*,

*B*, and

*C*are used in the analysis. The related parameters are given in the figure legend. From Figure 7, it can be seen that the power of

*F*

_{HTR}is the lowest. This is due to the large number of degrees of freedom of

*F*

_{HTR}, which is

*F*(7,

*N*− 8),

*N*= 200. In contrast,

*F*

_{ABC}_{,a}is

*F*(3,

*N*− 4),

*N*= 200; and

*F*

_{ABC}_{,a}is

*F*(6,

*N*− 7),

*N*= 200. The low power of

*F*

_{HTR}is most likely due to the biallelic QTL situation that we consider. In the situation of multiple QTL haplotypes and strong LD between QTL and marker haplotypes, the haplotype-based methods are expected to have good power.

#### Comparison based on ACE haplotype frequencies:

To work on more realistic scenarios, we take the haplotype information of ACE genes as an example. Ten diallelic polymorphisms in the ACE gene spanning 26 kb were genotyped (Keavney *et al.* 1998). The order of the 10 polymorphisms is T-5991C, A-5466C, T-3892C, A-240T, T-93C, T1237C, G2215A, I/D, G2350A, and 4656(CT)3/2. Table 5 lists 10 haplotypes, where the first 7 are the most frequent haplotypes (http://www.well.ox.ac.uk/∼mfarrall/oxhap_freq.html). For the 10 haplotypes, allele I at marker I/D is always present with allele A at marker G2350A, and allele D at marker I/D is always present with allele G at marker G2350A. Hence, the two markers can be treated as one. Similarly, markers T-5991C and A-5466C can be treated as one; and markers A-240T and T-93C can be treated as one. Therefore, the 10 haplotypes can be considered as containing seven markers.

In Abecasis *et al.* (2000a,b) and Fan *et al.* (2005), it is found that that markers I/D and G2350A show strongest association with the circulating ACE level. Thus, markers I/D and G2350A are treated as a putative trait locus *Q*. A quantitative trait of the putative locus *Q* is simulated for each graph in Figure 8, A–D. The empirical power curves of the test statistics *F*_{HTR}, *F*_{a}, and *F*_{ad} are plotted against the heritability *h*^{2} at a 0.05 significance level in Figure 8. Here *F*_{a} is the test statistic based on the additive effect model, and *F*_{ad} is the test statistic based on the genotype effect model. The empirical power curves *SF*_{HTR}, *SF*_{a}, and *SF*_{ad} in Figure 8 are calculated as follows. First, the interval (0.01, 0.25) of the heritability *h*^{2} is divided into 24 subintervals. Correspondingly, the 24 subintervals lead to 25 end points. For each end point, there is a set of parameters for the power curve. Using the set of parameters, 2500 data sets are simulated for each end point. For each data set, empirical statistics of *F*_{HTR}, *F*_{a}, and *F*_{ad} are calculated. The simulated power is the proportion of the 2500 simulated data sets for which the empirical statistic is larger than the cutting point of the corresponding *F*-distribution at a 0.05 significance level.

In Figure 8, A and C, the curves are plotted for a dominant mode of inheritance *a* = *d* = 1.0; in Figure 8, B and D, the curves are plotted for an additive mode of inheritance *a* = 1.0, *d* = 0. In Figure 8, A and B, all 10 haplotypes are used in the simulations; in Figure 8, C and D, only the first 7 most frequent haplotypes are used. From Figure 8, A–D, it can be seen that the proposed additive effect model has similar power to that of the HTR method. In Figure 8, A and C, when the dominance effects are present, the genotype effect model has similar power to those of the additive effect model and the HTR method. In Figure 8, B and D, the genotype effect model is less powerful because of the absence of the dominance effect. Hence, the genotype effect model can be useful only if the dominance effect can compensate for the extra degrees of freedom.

#### Simulation study:

To evaluate the accuracy of the noncentrality parameter approximations, we performed simulations for the power curves in Figures 1, 2, 5, 6, and 7. The results are presented as supplemental information (http://www.genetics.org/supplemental/). It can be seen that the approximations are excellent.

## DISCUSSION

In this article, two models, the genotype effect model and the additive effect model, are proposed for high-resolution association mapping of QTL on the basis of population data. The two models extend our previous research, which is based on multiple diallelic markers (Fan and Xiong 2002, 2003; Jung *et al.* 2005). The genotype effect model is closely linked to the measured genotype approach (Boerwinkle *et al.* 1986). The very popular genetics software such as Mendel 5.0 is already capable of performing association mapping of QTL by the additive effect model (Cantor *et al.* 2005; Lange *et al.* 2005). Surprisingly, there is no research to theoretically show why these two models are valid methods in association mapping of QTL under normal distribution. There are no existing analytical formulas to evaluate the power of the related test statistics. This article shows that the model coefficients are functions of measures of LD; and thus related *F*-test statistics can be constructed for association study of QTL. In the presence of both additive and dominance effects of the QTL, either the *F _{m}*

_{,ad}(or

*F*

_{AB}_{,ad}) statistic or the

*F*

_{m}_{,a}(or

*F*

_{AB}_{,a}) statistic can be used. Since the

*F*

_{m}_{,ad}(or

*F*

_{AB}_{,ad}) test statistic has bigger degrees of freedom than those of

*F*

_{m}_{,a}(or

*F*

_{AB}_{,a}),

*F*

_{m}_{,a}(or

*F*

_{AB}_{,a}) can be more powerful. If the extra degrees of freedom of the

*F*

_{m}_{,ad}test can be compensated by magnitude , it can be more powerful than

*F*

_{m}_{,a}.

The formulas of noncentrality parameter approximations (10) and (11) clearly indicate the dependence of the power on the quantity *R _{AQ}*

^{2}for genetic data. That is, the noncentrality parameter of test statistics of the null hypothesis of no genetic effects is reduced by a factor of for additive variance and by a factor of for dominance variance. If only one diallelic marker

*A*is used in the analysis, both our previous research and the work of colleagues have derived similar formulas to support this argument (Sham

*et al.*2000; Fan and Xiong 2002, 2003; Fan and Jung 2003; Fan

*et al.*2005; Jung

*et al.*2005). This is a good example in the debate on appropriate measures of LD for markers or multiallelic markers (Hedrick 1987; Devlin and Risch 1995; Pritchard and Przeworski 2001; Weiss and Clark 2002). For multiallelic markers or haplotypes, a satisfactory measure of LD has not been derived, as mentioned regarding p306 in Ardlie

*et al.*(2002). For two diallelic loci

*A*and

*Q*, Ardlie

*et al.*(2002) favor using , which is the correlation of alleles at the two loci. For multiallelic marker data, this article extends previous research by providing the definition of

*R*

_{AQ}^{2}and deriving Equations 10 and 11. Hayes

*et al.*(2003) introduced a multilocus approach for estimating LD and past effective size and used chromosome segment homozygosity (CSH), which was introduced in Sved (1971). The dependence of the noncentrality parameter on the quantity has been indicated by our study and also by Sham

*et al.*(2000).

In Fulker *et al.* (1999), Abecasis *et al.* (2000a,b, 2001), and Sham *et al.* (2000), an association between-family and association within-family (“AbAw”) approach is proposed to decompose the genetic association into effects of between pairs and within pairs on the basis of variance component models. The AbAw approach is based on any single diallelic marker. Instead of using a single diallelic marker, we have proposed variance component models using multiple diallelic markers. In our models, the association is decomposed into additive and dominance components (Fan and Xiong 2002, 2003; Fan and Jung 2003; Fan *et al.* 2005; Jung *et al.* 2005). In Fan and Jung (2003), Fan *et al.* (2005), and Jung *et al.* (2005), we compare our method with the AbAw approach and find that our method is advantageous over the AbAw approach. In model (1) or (2), only one marker is used in model building. If multiple markers or multiallelic markers are available, it is very easy to generalize the models to analyze the data. For instance, model (14) generalizes model (1) if two markers are available in the analysis. Accordingly, model (13) generalizes model (2). If only one marker is used in analysis, the proposed model (2) is equivalent to the haplotype trend regression method by Zaykin *et al.* (2002), which is very close to the method of Schaid *et al.* (2002). However, the proposed models are different from the haplotype trend regression method for two/multiple marker data. If both markers are diallelic markers, the genotype effect model can be less powerful than the HTR method, and the HTR method can be less powerful than the additive effect model in the case of no or weak third-order linkage disequilibrium among the two markers and the QTL. If strong third-order linkage disequilibrium exists, the HTR method can be more powerful.

Basically, the proposed models are genotype based. The models can be used to analyze directly any number of markers, and the markers can be either diallelic or multiallelic. By a simulation study based on ACE haplotype frequencies, we show that the proposed additive effect models have similar power to that of the haplotype-based HTR method. In the meantime, the proposed models enjoy the simplicity of not needing to estimate the expected haplotype scorings; in contrast, the HTR method needs to calculate the expected haplotype scorings before building the models. The proposed models decompose the main marker effects into a summation of additive and dominance effects. In the presence of haplotype effects, it is important to estimate the haplotype effects and haplotype-based methods are more relevant (Stram *et al.* 2003; Tregouet *et al.* 2004).

One potential problem of this generalization is that the number of parameters can be very big. Then, one needs to select important alleles in the analysis and search for important genetic variants that are truly associated with the genetic traits. At first glance, model (1), (2), (13), or (14) seems too complicated and contains too many terms. However, the models are not intimidating at all if one takes into account the recent discovery of haplotype structure in the human genome. Although a haplotype block may contain many SNPs, it takes only a few SNPs to uniquely identify each of the haplotypes in the block. Within a block, there are only two to four common haplotypes (Arnheim *et al.* 2003; Daly *et al.* 2001; Goldstein 2001; Patil *et al.* 2001; Reich *et al.* 2001; Rioux *et al.* 2001; J. C. Stephens *et al.* 2001; Gabriel *et al.* 2002; Nordborg and Tavaré 2002; Phillips *et al.* 2003). This implies that model (1), (2), (13), or (14) contains a few terms and hence is manageable. Moreover, model (1) or (2) already takes the haplotype structure into account and is potentially more powerful. In practice, one may want to collapse some alleles to reduce the number of parameters. However, the collapsing process may decrease linkage disequilibrium and therefore result in loss of power. The proposed regression models can be fitted to alleviate the problem.

In the mathematical derivations, we make the assumption of HWE. It is unclear how to construct tests reflecting deviations from HWE and this requires further research. In addition, we illustrate that the false-positive rate of the genotype effect test is too high for more than five alleles in a sample of 200 individuals. This is obviously due to the large numbers of possible genotypes and hence to sparseness in the contingency table. This problem could be overcome by using exact tests or permutation procedures.

The models of this article are based on population data. Suppose that both population and pedigree data including sibships are available. Then, model (1) or (2) can be generalized to perform high-resolution combined LD mapping and a linkage study of QTL by variance component models in the spirit of our previous work. In fact, we may generalize regression (1) or (2) by adding the polygenic effect to fit the data. Moreover, log-likelihoods can be constructed on the basis of variance component models. This will generalize our research by using either diallelic/multiallelic markers or haplotypes in a combined analysis of population and pedigree data. It is well known that association study-based population data are prone to false positives, due to the population stratification and population history. A valid approach would be to find linkage information by using pedigree data to locate the QTL on a broad chromosome region. Then, a combined linkage and association mapping can be performed for fine mapping of the genetic traits on the basis of both population and pedigree data (Fan and Xiong 2003). This would be more likely to overcome the drawbacks of separate analysis of either a linkage study or association mapping: low resolution of linkage analysis and high false-positive rates in the association study. In the meantime, it is more likely to take advantage of the two methods: the low false-positive rates of linkage analysis and the high resolution of the association-mapping method.

## APPENDIX A

For an individual of a population with trait values *y* and genotype *G _{A}* at marker

*A*, let

*x*be an indicator function of genotype

_{ii}*A*and

_{i}A_{i}*x*be an indicator function of genotype

_{ij}*A*. That is, they are dummy variables defined bywhere

_{i}A_{j}*i*,

*j*= 1, 2, …,

*m*,

*i*≠

*j*. Then model (1) can be rewritten as(A1)Note that . Given Equation A1, taking expectation of

*yx*leads to . On the other hand, a true random-effect model describing the trait value is

_{ii}*y*=

*w*γ +

*g*+

*e*, whereUtilizing and gives(A2)Equating the above quantity to shows Equation 3 when

*i*=

*j*.

If *i* ≠ *j*, . Multiplying at both sides of Equation A1 by *x _{ij}* and taking the expectation lead to

*E*(

*yx*) =

_{ij}*E*(

*x*)[

_{ij}*w*γ + β

_{ij}]. Again, utilizing , , , and gives(A3)Equating the above quantity to shows Equation 3 when

*i*≠

*j*.

## APPENDIX B

For an individual with trait values *y* and genotypes *G _{A}* at marker

*A*, let

*z*be the number of alleles

_{i}*A*of genotype

_{i}*G*,

_{A}*i*= 1, 2, …,

*m*. That is,

*z*is a dummy variable defined byThen model (2) can be rewritten as(B1)Multiplying both sides of expression (B1) by

_{i}*z*and taking the expectation lead to(B2)The elements of the matrix on the left-hand side of the above equation can be calculated as follows: . For

_{i}*i*≠

*j*, the expectation . For the elements on the right-hand side, Equations A2 and A3 lead to , since . Plugging the above quantities into matrix Equation B2 gives Equation 4 aswhere diag(…) denotes a diagonal matrix;

*e.g.*, isIn the above calculation, we use a fact of the inverse matrix (

*M*+

*ab*

^{τ})

^{−1}=

*M*

^{−1}− (

*M*

^{−1}

*a*)(

*b*

^{τ}

*M*

^{−1})/(1 +

*b*

^{τ}

*M*

^{−1}

*a*).

## APPENDIX C

Denote a vector . If the sample size *N* is large enough, the large number law implies the approximation(C1)where is a diagonal matrix, whose elements on the diagonal are given by the elements of . That is, if , then . Let *H* be a (*J _{A}* − 1) ×

*J*matrix defined byThen, (

_{A}*H*η)

^{τ}= (β

_{11}− β

_{22}, …, β

_{11}− β

_{mm}, β

_{11}− β

_{12}, …, β

_{11}− β

_{1m}, …, β

_{11}− β

_{m−1,m}). From approximation (C1), we have the approximationwhere . Applying a fact of inverse matrix (

*M*+

*ab*

^{τ})

^{−1}=

*M*

^{−1}− (

*M*

^{−1}

*a*)(

*b*

^{τ}

*M*

^{−1})/(1 +

*b*

^{τ}

*M*

^{−1}

*a*) again, we haveThe noncentrality parameter is given by(C2)From Equation 3, we haveUtilizing relation , we havePlugging the above equation into (C2), we haveNote that , and so . Hence, the noncentrality parameter approximation (10) is valid.

## APPENDIX D

The large number law implies the following approximation:In the above approximation, the quantities *E*(*z _{i}z_{j}*) in appendix b are used. Applying a fact of inverse matrix (

*M*+

*ab*

^{τ})

^{−1}=

*M*

^{−1}− (

*M*

^{−1}

*a*)(

*b*

^{τ}

*M*

^{−1})/(1 +

*b*

^{τ}

*M*

^{−1}

*a*), the inverse isLet

*K*be a (

*m*− 1) ×

*m*matrix defined byThen, (

*K*ψ)

^{τ}= (α

_{1}− α

_{2}, …, α

_{1}− α

_{m}). On the other hand, we have the approximationwhose inverse is given byTherefore, an approximation of the noncentrality parameter is given byEquation 4 implies that . Thus, the noncentrality parameter

## APPENDIX E

For *i* = 1, 2, …, *m*, *k* = 1, …, *n*, let us denote , which are measures of LD between markers *A* and *B*. Here *P*(*A _{i}B_{k}*) is frequency of haplotype

*A*. It can be shown that for

_{i}B_{k}*i*≠

*j*,

*k*≠

*l*,

*j*≠

*j*′,

*l*≠

*l*′, (

*i*,

*j*) ≠ (

*i*′,

*j*′), (

*k*,

*l*) ≠ (

*k*′,

*l*′),(E1)The quantities in (E1) imply that

Since *EZ _{A}*

_{∪B}is a vector of 0's by the quantities in (E1), it can be shown that

*V*

_{D}= Cov(

*Z*

_{A}_{∪B},

*Z*

_{A}_{∪B}) =

*E*(

*Z*

_{A}_{∪B}

*Z*

_{A}_{∪B}

^{τ}). Moreover, the quantities in (E1) imply that the covariance matrix Cov(

*X*

_{A}_{∪B},

*Z*

_{A}_{∪B}) is a 0 matrix.

Taking variance–covariance between *y* and *x _{Ai}*,

*x*,

_{Bk}*z*,

_{Aij}*z*on the basis of relation (14), we may get the regression coefficients (15) of models (13) and (14).

_{Bkl}## APPENDIX F

Multiplying both sides of expression (18) by *I _{j}* and taking the expectation lead to(F1)

The elements of the matrix on the left-hand side of the above equation can be calculated as follows:The elements on the right-hand side are given bywhere

Plugging the above quantities into matrix Equation F1 gives Equation 19.

## Acknowledgments

We thank two anonymous reviewers for very detailed and thoughtful critiques, which make the paper better. R. Fan was supported by the National Science Foundation Grant DMS-0505025.

## Footnotes

Communicating editor: G. Gibson

- Received June 2, 2005.
- Accepted September 19, 2005.

- Copyright © 2006 by the Genetics Society of America