## Abstract

The analysis of the haplotype-phenotype relationship has become more and more important. We have developed an algorithm, using individual genotypes at linked loci as well as their quantitative phenotypes, to estimate the parameters of the distribution of the phenotypes for subjects with and without a particular haplotype by an expectation-maximization (EM) algorithm. We assumed that the phenotype for a diplotype configuration follows a normal distribution. The algorithm simultaneously calculates the maximum likelihood (*L*_{0max}) under the null hypothesis (*i.e*., nonassociation between the haplotype and phenotype), and the maximum likelihood (*L*_{max}) under the alternative hypothesis (*i.e*., association between the haplotype and phenotype). Then we tested the association between the haplotype and the phenotype using a test statistic, −2 log(*L*_{0max}/*L*_{max}). The above algorithm along with some extensions for different modes of inheritance was implemented as a computer program, QTLHAPLO. Simulation studies using single-nucleotide polymorphism (SNP) genotypes have clarified that the estimation was very accurate when the linkage disequilibrium between linked loci was rather high. Empirical power using the simulated data was high enough. We applied QTLHAPLO for the analysis of the real data of the genotypes at the *calpain 10* gene obtained from diabetic and control subjects in various laboratories.

IN many cases, haplotypes or diplotype configurations but not genotypes are associated with phenotypes. A diplotype configuration is defined as a combination of two haplotype copies possessed by an individual, and an ordered diplotype configuration denotes an ordered list of two haplotypes arranged according to the derivation (father and mother). Since recent analyses disclosed many linked polymorphic loci within a gene, the multiple loci often have to be treated together rather than separately. A haplotype and a haplotype copy have distinct definitions in this manuscript since when a subject is homozygous for a haplotype, he (or she) is interpreted to have a single haplotype but two haplotype copies.

There are several common methods for haplotype inference using genotype SNP data. For example, the Clark algorithm (Clark 1990), the EM algorithm (Excoffier and Slatkin 1995; Hawley and Kidd 1995; Long* et al.* 1995; Schneider* et al.* 2000; Kitamura* et al.* 2002), PHASE (Stephens* et al.* 2001), the PL algorithm (Niu* et al.* 2002), and the PL-EM algorithm (Qin* et al.* 2002) have been used. We also proposed an algorithm to estimate haplotypes by use of pooled genotype data (Ito* et al.* 2003). However, the methods to relate such inferred haplotypes to the phenotypes are still to be developed. We have recently proposed an algorithm (PENHAPLO) to test the association between qualitative phenotypes (*e.g.*, affection status) and the presence of a haplotype by EM algorithm (Ito* et al.* 2004). Thus far, we have considered disease status as a qualitative trait with two outcomes, affected and unaffected, and penetrances as the conditional probabilities of phenotypes given genotypes or diplotype configurations. In practice, however, the disease phenotype often consists of a quantitative measurement such as blood sugar level. The locus for a trait concerning the above disease phenotypes is referred to as the quantitative trait locus (QTL) and the associated phenotypes as quantitative phenotypes. Such quantitative phenotypes often follow continuous distributions, and the quantitative phenotypes should be handled separately from the qualitative phenotypes. Thus, the program PENHAPLO cannot be directly applied to quantitative phenotypes.

We developed an algorithm to estimate simultaneously the diplotype configurations for the subjects and the distribution of quantitative phenotypes for different diplotype configurations and to test the association between the phenotypes and the presence of a haplotype. Note that the following considerations apply to this article. First, rather than defining the probability of a phenotype (penetrance), probability density for a value of a quantitative phenotype was defined. Second, the phenotypes were considered to depend on diplotype configurations rather than on genotypes at single loci.

Recent studies have reported that, in some cases, drug responses and other phenotypes were associated not with genotypes but with haplotypes or diplotype configurations (Judson* et al.* 2000; Bader 2001; Urano* et al.* 2002; Tanaka* et al.* 2002). Tanck* et al.* (2003) presented a method to estimate multilocus haplotype effects using a weighted penalized log-likelihood model. Schaid* et al.* (2002) proposed methods to test the association between ambiguous haplotypes and a variety of traits (binary, ordinal, and quantitative traits), which were based on score equations for generalized linear models (GLMs).

Therefore, it is important to develop a method for testing the association between quantitative phenotypes and different diplotype configurations. One of the problems in haplotype inference is that the diplotype configurations for some subjects are not uniquely determined (ambiguous diplotype configurations). This is because more than one diplotype configuration is possible for a subject even when the genotypes at all the relevant loci are observed. We could regard the diplotype configuration with the highest probability as the true configuration and perform the test using the inferred data; however, such a test may inflate the type I error rates.

To overcome this problem, we developed an algorithm to simultaneously estimate parameters of the phenotype distributions, haplotype frequencies, and diplotype configurations given observed genotypes and the phenotype data.

As the simplest model, we assumed that the phenotype conditional on a diplotype configuration follows a normal distribution. Thus, the distribution of the phenotype for subjects with a specific haplotype follows *N*(μ_{1}, σ^{2}), while the distribution of the same phenotype without it follows *N*(μ_{2}, σ^{2}). We estimate haplotype frequencies, diplotype configurations, and parameters of the phenotype distribution by an EM algorithm using genotype and phenotype data. Ambiguous diplotype configurations are treated as multiple diplotype configurations for each subject with different probabilities. Using simulation data and real data, we demonstrate that our approach can be used to detect the association between quantitative phenotypes and the presence of a haplotype and to estimate the distribution of the phenotypes. Although we assumed the normality for the distribution of phenotypes in the standard model, the methods to cope with the violation of the normality are discussed in this manuscript.

## METHODS

### Analysis of real data:

As the real data, we used the data from three linked SNP loci of the *calpain* (*CAPN*)*10* gene and the quantitative phenotypes. Haplotypes at the *CAPN10* gene have been reported to be associated with type 2 diabetes (Horikawa* et al.* 2000). We selected body mass index (BMI), blood sugar level, and insulin level as the quantitative phenotypes. We applied the real data of the genotypes at the three linked loci as well as one of the phenotypes from the subjects to QTLHAPLO. By this method we tested the association between the haplotypes and the phenotypes and, at the same time, estimated the parameters of the distribution of the phenotypes.

### Algorithm:

#### Sample space:

In this study, the sample space is defined as a set of outcomes from the following experiment. Let us assume that there are *l* linked SNP loci. The number of all the possible haplotypes will be *L* = 2* ^{l}*. We set up a collection of an infinite number of haplotype copies. The relative haplotype frequencies in the collection are Θ = (θ

_{1}, … , θ

*, … , θ*

_{j}*), where θ*

_{L}*is the relative frequency of*

_{j}*j*th haplotype, and θ

*≥ 0, . According to the haplotype frequencies, an ordered combination of two haplotype copies is given to each of*

_{j}*N*individuals by randomly drawing them from the collection of the haplotype copies. A diplotype configuration is defined, in this article, as an ordered combination of two haplotype copies. Let

*a*

_{1},

*a*

_{2}, … ,

*a*

_{L}

^{2}be possible diplotype configurations. The probability that the

*i*th subject has the diplotype configuration

*a*given Θ is

_{k}*P*(

*d*=

_{i}*a*|Θ) = θ

_{k}*θ*

_{l}*, where*

_{m}*d*is a diplotype configuration for the

_{i}*i*th subject, and

*l*,

*m*are the labels (1, 2, … ,

*L*) of the haplotypes that constitute

*a*. The

_{k}*i*th subject develops quantitative phenotype ψ

*, following a probability density function. Let us assume that the phenotype for a diplotype configuration follows a normal distribution with a fixed variance but with a mean that depends on the diplotype configuration. An outcome from the experiment is defined by (Θ,*

_{i}*D*, Ψ), where

*D*= (

*d*

_{1}, … ,

*d*) denotes the vectors of the diplotype configurations and Ψ = (ψ

_{N}_{1}, … , ψ

*) denotes the vectors of the phenotypes. The mean μ of the distribution of a phenotype is assumed to differ between the subjects with and without haplotype*

_{N}*h*

_{b}.

*h*

_{b}is the haplotype that has a different effect from the others. Let

*D*

_{+}denote a set of diplotype configurations that contain the haplotype

*h*

_{b}. We then define two normal distributions for a phenotype, one,

*N*(μ

_{1}, σ

^{2}), for the subjects with

*d*∈

_{i}*D*

_{+}and the other,

*N*(μ

_{2}, σ

^{2}), for those with

*d*∉

_{i}*D*

_{+}. Let

*f*

_{μ1}denote the probability density function that the

*i*th individual develops a phenotype

*x*when

*d*∈

_{i}*D*

_{+}, and let

*f*

_{μ2}denote the probability density function that the

*i*th individual develops

*x*when

*d*∉

_{i}*D*

_{+}.

Thus, if ψ* _{i}* denotes the phenotype of

*i*th subject, the probability density is and

Note that ψ* _{i}* is independent of Θ conditional on

*d*.

_{i}#### Likelihood function:

The observed data are the genotypes and the quantitative phenotypes of the subjects. Let *G*_{obs} = (*g*_{1}, *g*_{2}, … , *g _{N}*) and Ψ

_{obs}= (

*w*

_{1},

*w*

_{2}, … ,

*w*) denote the vectors of the observed genotypes and the quantitative phenotypes, respectively, where

_{N}*g*and

_{i}*w*denote the observed genotypes and the quantitative phenotype of the

_{i}*i*th subject.

As the first step, we consider a general case in which the distributions differ between all the diplotype configurations. Let denote the vector of the means for the distributions for all possible diplotype configurations. Note that, in this context, the distributions of a phenotype are assumed to be potentially different between different diplotype configurations. Then the likelihood function is where *A _{i}* denotes the set of

*a*for the

_{k}*i*th subject that are consistent with

*g*and

_{i}*f*is the probability density function for

*N*(μ

*, σ*

_{k}^{2}).

Since *d _{i}* is independent of

**μ**, σ and ψ

*is independent of Θ conditional on*

_{i}*d*, 1

_{i}Under the null hypothesis that the distribution of the phenotype is independent of the diplotype configuration, the likelihood function is 2where, under the null hypothesis, the mean on the distribution of the phenotype for the diplotype configurations is invariable, and **μ _{0}** denotes the vectors of the means,

**μ**= (μ

_{0}_{0}, μ

_{0}, … , μ

_{0}). Then again,

*A*denotes the set of diplotype configurations for the

_{i}*i*th subject that are consistent with

*g*.

_{i}It is not realistic, however, to assign different distributions for all different diplotype configurations for the alternative hypothesis. We then set up only two normal distributions, *N*(μ_{1}, σ^{2}) and *N*(μ_{2}, σ^{2}), for the alternative hypothesis. For the null hypothesis, we set up only one normal distribution, *N*(μ_{0}, σ^{2}). Thus, under the alternative hypothesis, the *i*th subject develops the phenotype *x* at the probability density function:

#### EM algorithm:

Our algorithm is an extension of the EM algorithm for estimating marker haplotype frequencies to the association studies. However, our likelihood function, unlike previous algorithms, includes the information about the phenotypes.

Equation 1 is maximized over Θ, **μ**, and σ, and the maximum likelihood thus obtained is denoted *L*_{max}. Then Equation 2 is maximized over Θ, **μ _{0}**, and σ, and the maximum likelihood thus obtained is denoted

*L*

_{0max}. The likelihood ratio

*L*

_{0max}/

*L*

_{max}is used to test the association between the presence of the haplotype and the distribution of the phenotypes.

In the maximization for *L*_{max}, the parameters to be estimated are Θ = (θ_{1}, θ_{2}, … , θ* _{L}*), μ

_{1}, μ

_{2}, and σ, while in the maximization for

*L*

_{0max}, the parameters to be estimated are Θ = (θ

_{1}, θ

_{2}, … , θ

*), μ*

_{L}_{0}, and σ. Under the null hypothesis, −2 log(

*L*

_{0max}/

*L*

_{max}) is expected to follow the χ

^{2}distribution with 1 d.f. (Wilks 1962; Serfling 1981).

If the complete data of *d*_{1}, *d*_{2}, … , *d _{N}* and ψ

_{1}, ψ

_{2}, … , ψ

*were available, the maximum-likelihood estimates of θ*

_{N}_{1}, θ

_{2}, … , θ

*and*

_{L}**μ**, σ would be easily obtained as θ̂

*=*

_{j}*n*/(2

_{j}*N*) for

*j*= 1, 2, … ,

*L*and , , where

*n*is the number of the copies of the

_{j}*j*th haplotype in the

*N*subjects,

*N*

_{+}denotes the number of subjects who possess haplotype

*h*

_{b}, and

*N*

_{−}denotes the number of subjects who do not possess haplotype

*h*

_{b}.

However, the complete data are not available, and we observe only genotypes and phenotypes of the subjects. Therefore, we substitute the expected values of *n _{j}*/(2

*N*), ∑

_{di∈D+}ψ

_{i}/

*N*

_{+}, ∑

_{di∉D+}ψ

_{i}/

*N*

_{−}, andfor the real values in the following EM algorithm.

For

*n*= 0, initial values are given to , where and θ^{}_{j}≥ 0.For

*n*= 0, initial values are given to .For

*n*= 0, initial values are given to σ^{(}^{n}^{)}.For all

*i*and for all*a*consistent with the genotype_{k}*g*, calculate 3where_{i}*A*denotes the set of_{i}*a*consistent with_{m}*g*. Note that we examine only_{i}*a*consistent with_{k}*g*. In addition, since_{i}*d*is independent of_{i}**μ**^{(}^{n}^{)}and σ^{(}^{n}^{)}, and ψis independent of Θ_{i}^{(}^{n}^{)}conditional on*d*, Equation 3 becomes 4v. Since_{i}*n*, the number of_{j}*j*th haplotype copies possessed by*N*subjects is a random variable, we can define the expected number of*j*th haplotype copies conditional on the observed data as where*g*(_{j}*a*) denotes the number of_{k}*j*th haplotype copies in*a*, and_{k}*A*again denotes the set of diplotype configurations for the_{i}*i*th subject that is consistent with*g*. Note that_{i}*g*(_{j}*a*) is 0, 1, or 2. The expected values are calculated for all_{k}*j*.Here, ∑

_{di∈D+}ψ_{i}/*N*_{+}, ∑_{di∉D+}ψ_{i}*/N*_{−}, andare random variables and, therefore, expected values conditional on the observed data can be defined as 5In the above equation, and where the denominator and the numerator in Equation 5 are the summed probability densities of the observed data for the*i*th subject consistent with*g*and those consistent with_{i}*g*and_{i}*a*∈_{k}*D*_{+}∩*A*: 6In the above equation, and where the denominator and the numerator in Equation 6 are the summed probability densities of the observed data for the_{i}*i*th subject consistent with*g*and those consistent with_{i}*g*and_{i}*a*_{k}∈*A*_{i}∩*D̅*_{+}, where*n*denotes .From the result of step v, Θ is updated for the next step as follows: From the result of step vi,

**μ**and σ are updated for the next step as follows: viii. Steps iv–vii are repeated until the values converge. The values when converged are considered as the maximum-likelihood estimates Θ̂ = (θ̂_{1}, θ̂_{2}, … , θ̂), μ̂_{L}_{1}, μ̂_{2}, and σ̂.To avoid the local maximum, different sets of values for and σ

^{(0)}are tested.

Here, Equation 1, given the values Θ̂, **μ̂**, and σ̂, is the maximum-likelihood *L*_{max} for the alternative hypothesis. If we give the condition **μ _{0}** = (μ

_{0}, μ

_{0}) and repeat steps iv–vii, then we get the maximum-likelihood

*L*

_{0max}for the null hypothesis. The present algorithm can handle missing data in both the observed genotypes and the phenotypes. Thus, when the genotype data were missing in some loci for the

*i*th subject,

*g*, the observed genotypes for

_{i}*i*, were interpreted as the set of all possible genotypes consistent with the observed genotypes excluding the loci where the data were missing. When the phenotype was missing for the

*i*th subject, the likelihood of only the observed genotype data but not that of the phenotype data was included in the calculation. Even when the phenotype data were missing for some subjects, the inclusion of their observed genotype data for the analysis has increased the accuracy of the estimation of the population haplotype frequencies.

Under the null hypothesis, the statistic −2 log(*L*_{0max}/*L*_{max}) is expected to follow, asymptotically, χ^{2} distribution with 1 d.f.. The above algorithm is implemented as a computer program QTLHAPLO.

### Designs of simulations:

#### The QTL parameter estimations:

The purpose of the simulation was to verify the accuracy of the estimation of the parameters for the distribution of the phenotype. A sample was generated by the experiment defined above. Then an ordered combination of two haplotypes was randomly selected from a collection of haplotype copies and given to each of the *N* subjects according to the given haplotype frequencies. We obtained haplotype frequencies for the *SAA1* gene from a previous study (Moriguchi* et al.* 2001). SNP data at six loci were included in the haplotype data of the *SAA1* gene. We performed two types of simulations, one using the data from six loci and the other from four loci. The latter set of loci (four loci) was obtained by excluding the first and the fourth loci, which were in only weak linkage disequilibrium with the other loci. Haplotype frequencies used in the two types of simulations are shown in Table 1. We assumed that one of the haplotypes is associated with the phenotype, and the phenotype of the subject with that haplotype follows *N*(μ_{1}, σ^{2}). The phenotype of the subject without that haplotype was assumed to follow *N*(μ_{2}, σ^{2}). Thereafter, we removed the phase information and ran our algorithm to estimate parameters.

#### Behavior of the statistic −*2 log*(L_{0max}/L_{max}) under the null hypothesis:

_{0max}

_{max}

The purpose of this simulation was to examine the distribution of the likelihood-ratio test statistic −2 log(*L*_{0max}/*L*_{max}) under the null hypothesis μ_{1} = μ_{2}. The null hypothesis was equivalent to the assumption of no association between the phenotype and the presence of the haplotype. The test statistic was determined for each sample. The distribution of the test statistic was empirically estimated from samples generated by the simulation.

#### The estimation of power:

The purpose of this simulation was to estimate the power under the alternative hypothesis. With varying values of μ_{1}, μ_{2}, and σ, the empirical power was determined.

### Bootstrap method to calculate standard errors of the estimated parameters:

To evaluate the reliability of estimated parameters μ̂_{1} , μ̂_{2}, and σ̂, we used the bootstrap method (nonparametric bootstrap method) to calculate means and standard errors. From the original real sample, an artificial sample was generated by drawing the same number of the subjects at random. A single subject in the original sample may be repetitively drawn. It means that a new sample was drawn from the population in which the subjects in the original sample were uniformly distributed. Using the new artificial sample, the parameters were estimated using QTLHAPLO. The above procedure was repeated 10,000 times and the values of the estimated parameters were used to calculate the mean and the standard error.

### Extension of the algorithm:

The present algorithm was extended so that it can handle dominant, recessive, and additive modes of inheritance. Let *A* denote the haplotype for a genetic region *R* that is related to the phenotype, and let *B* denote the complement of *A*, *i.e.*, the set of all haplotypes other than *A*. We gave the following mean variables for different diplotype configurations. Thus, we gave μ_{1} for both *AA* and *AB* and μ_{2} for *BB* in the dominant mode, while we gave μ_{1} for *AA* and μ_{2} for both *AB* and *BB* in the recessive mode. In the additive mode, we gave μ_{1} and μ_{2} for *AA* and *BB*, respectively, and (μ_{1} + μ_{2})/2 for *AB*. In addition, we have implemented the mode in which the three different means μ_{1}, μ_{2}, and μ_{3} were given to the three different diplotype configurations.

Another extension is to define *A* not as a single haplotype but as a set of multiple haplotypes. For example, we can denote *D*_{+} as the set of diplotype configurations that contain either of the two phenotype-associated haplotypes. More generally, we can define a set *Q* as a set of all phenotype-associated haplotypes and *D*_{+} as the set of diplotype configurations with at least one member of *Q*. We implemented dominant, recessive, and additive modes for the analysis using such sets of haplotypes. In this way, we can test the association between a set of haplotypes and a phenotype. Since a SNP can be defined as a set of haplotypes, we could test the association between a SNP and a phenotype in this way. This extension was also implemented in QTLHAPLO.

## RESULTS

### Analysis of real data:

We analyzed the data from the diabetic patients. The data included the genotypes at *CAPN10* and quantitative phenotypes such as BMI, blood glucose level (BS), and immunoreactive insulin level (IRI). The precise data will be published elsewhere (N. Iwasaki, Y. Horikawa, Y. Kitamura, Y. Nakamura, Y. Tanizawa, Y. Oka, K. Hara, T. Kadowaki, T. Awata, M. Honda, K. Yamashita, M. Ogata, N. Kamatani, N. J. Cox, G. I. Bell and Y. Iwamoto). These quantitative phenotypes are expected to follow asymptotically normal distributions (data not shown). Table 2 also shows the haplotype frequencies Θ inferred under the hypothesis of no linkage disequilibrium. Table 3 shows that the pairwise linkage disequilibrium measures *D*, *D*′, and *r*^{2} estimated under the presence of linkage disequilibrium. These results showed that there was considerable linkage disequilibrium between each pair of the loci of *CAPN10*.

Then, we incorporated the quantitative phenotype data in addition to the genotype data into the analysis. Thus, one of the following quantitative phenotypes was selected: BMI, BS at 0 min (BS 0′), BS 30′, BS 60′, BS 120′, IRI at 0 min (IRI 0′), IRI 30′, IRI 60′, or IRI 120′ (Table 4). The results indicate that there were significant associations between the presence of the haplotype 112 and both BS 30′ and BS 60′ (Table 4). Table 5 shows that when the haplotype 112 was assumed to be the phenotype-associated haplotype, μ̂_{1} > μ̂_{2}, suggesting that the subjects with the 112 haplotype exhibit higher blood glucose levels at 30 and 60 min after the glucose ingestion than those without the haplotype. SE by the bootstrap method were (μ_{1} ± SE, μ_{2} ± SE, σ ± SE) = (147.1 ± 2.6, 138.9 ± 2.2, 28.2 ± 1.1) in the blood glucose level at 30 min, and (138.5 ± 3.7, 129.0 ± 3.0, 38.8 ± 1.5) in the blood glucose level at 60 min. In addition, haplotype 122 was significantly associated with BS 0′ (Table 4). However, this may not necessarily indicate that the subjects with the 122 haplotype exhibit lower fasting glucose levels than those without the haplotype. In fact, the frequency of the 122 haplotype was 0.0087, a value too low to evaluate (Table 2). Although such problems as multiple testing should be kept in mind, these results suggest an association between the 112 haplotype and blood glucose levels.

### The accuracy of estimated values of parameters:

We used the simulation to generate samples under either the null or the alternative hypothesis and analyzed the data in the sample using QTLHAPLO.

First, haplotype frequencies Θ were employed from the four-locus data at the *SAA1* gene, as shown in Table 1. The CCTC haplotype was considered to be the phenotype-associated haplotype. Note that all of the four loci were in tight linkage disequilibrium with each other. Two haplotype copies were selected using the haplotype frequencies and assigned to each subject. The phenotype of the subject was determined stochastically using two normal distributions. *N*(μ_{1}, σ^{2}) was used when the subject possessed the phenotype-associated haplotype, while *N*(μ_{2}, σ^{2}) was used when the subject did not possess the haplotype. The parameters μ_{1}, μ_{2}, and σ were given to each simulation as described in Table 6. A sample consisted of a total of *N* subjects.

After diplotype configurations and phenotypes were determined for all the subjects, the phase information was removed. Using the genotype information and the phenotypes of the subjects, we used QTLHAPLO to estimate the parameters Θ, μ_{1}, μ_{2}, and σ and, at the same time, calculated *P*-values for excluding the null hypothesis.

The results showed that our algorithm is highly accurate for estimating the parameters μ_{1}, μ_{2}, and σ, whether the simulation is performed under the null hypothesis or the alternative hypothesis under the given conditions (Table 6). As expected, the *P*-value to exclude the null hypothesis was high when the simulation was under the null hypothesis, while it was low when the simulation was performed under the alternative hypothesis (Table 6).

Next, Θ was employed from the six-locus data at the *SAA1* gene, shown in Table 1. Note that two of the six loci were in only weak linkage disequilibrium with the other loci. In this case, the ACCGTC haplotype was considered to be the phenotype-associated haplotype. The simulation and the analysis of the data generated by the simulation were performed exactly as in the case of the four-locus data as described above except for the number of loci.

Table 7 again shows that the estimation of the parameters was accurate. The risk to exclude the null hypothesis (*P*-value) was high when the data were simulated under the null hypothesis, while it was very low when they were simulated under the alternative hypothesis (Table 7).

### Accuracy of estimated diplotype configuration:

Using the simulated data, the posterior probability distribution of the diplotype configuration (diplotype distribution) for each subject conditional on the observed genotype and phenotype data [*P*(*d _{i}* =

*a*|

_{k}*G*

_{obs}, Ψ

_{obs})] was determined by QTLHAPLO. The diplotype distribution for each subject conditional on only the observed genotype data [

*P*(

*d*=

_{i}*a*|

_{k}*G*

_{obs})] was also determined using the same program. It is of interest to examine whether the diplotype distribution changes when the observed phenotype data are added to the observed genotype data. Another question is whether the inference becomes more accurate when the observed phenotype data are incorporated. Table 8 shows the comparison of the diplotype distribution for each subject inferred by only the genotype data and inferred by both the genotype and the phenotype data. In this case, the simulation was performed under the conditions μ

_{1}= 165, μ

_{2}= 160, σ = 5.0, and

*N*= 1000. The results were shown only for individuals

*i*= 1, 2, … , 10. For the subjects 1, 3, 4, 5, 6, 8, and 9, the diplotype configurations were concentrated on single events whether or not the observed phenotype data were incorporated (Table 8). For subjects 2, 7, and 10, the diplotype configurations were not concentrated on single events; however, the distributions were almost identical between the two inferences, one made by incorporating the observed phenotype data and the other without them (Table 8).

The comparison of the diplotype distribution between the two inferences was done using the six-locus data for the *SAA1* gene. Part of the results are shown in Table 9. In this case, the diplotype distributions were not concentrated on single events in the subjects *i* = 53, 55, 57, 58, and 59. For the subject *i* = 55, the diplotype distribution differed significantly between the two inferences. Since this subject has a quantitative phenotype of 167.8, the subject is likely to possess the phenotype-associated haplotype ACCGTC because μ_{1} = 165 and μ_{2} = 160. The incorporation of the phenotype data changed the diplotype distribution of the subject *i* = 55 so that the probability of the diplotype configuration occurring with the phenotype-associated haplotype (ACCGTC GCTGCT) increased. Thus the inclusion of the phenotype data changed the diplotype distribution for each subject and seemed to improve the accuracy of the inference of the diplotype configurations.

We have intensively addressed this issue by the simulation; *i.e*., we asked whether the inference of the diplotype configurations becomes more accurate by incorporating the phenotype data in addition to the genotype data. Thus, the inference of the diplotype configuration for each subject was performed using only the genotype data or using both genotype and phenotype data from the simulated samples. Then, we counted how many of the subjects' inferred diplotype configurations became more accurate and how many became less accurate by incorporating the phenotype data. When the posterior probability of the true diplotype configuration became higher by incorporating the phenotype data, as was the case with the subject *i* = 55 in Table 9, we judged that the inference became more accurate. On the other hand, when the posterior probability of the true diplotype configuration became lower by the incorporation of the phenotype data, we judged that the inference became less accurate. When the six-locus haplotype frequencies were used, the proportions of the subjects whose inference of the diplotype configurations became more and less accurate by the incorporation of the phenotype data were 0.1957 ± 0.0429 and 0.1061 ± 0.0413 (results from simulations under the same conditions as in Table 9), respectively. On the other hand, when the four-locus haplotype frequencies were used, the proportions of the subjects whose inference of the diplotype configurations became more and less accurate by the incorporation of the phenotype data were 0.1387 ± 0.0111 and 0.0589 ± 0.0075 (results from simulations under the same conditions as in Table 8), respectively.

### Distribution of the statistic −2 log(*L*_{0max}/*L*_{max}) under the null hypothesis:

The null hypothesis is that the distribution of the quantitative phenotype is independent of the diplotype configurations. It is equivalent to the assumption of μ_{1} = μ_{2}. The samples were simulated under the null hypothesis using various values of μ_{1} = μ_{2}, and σ. For Θ, the four-locus data (Table 1) were used and the 10,000 independent samples were generated. Figure 1 shows the histogram for the statistic −2 log(*L*_{0max}*/L*_{max}) obtained by the analysis of the simulated data using QTLHAPLO. Thus, when the parameters were μ_{1} = μ_{2} = 160 and σ = 5, the statistic followed asymptotically the χ^{2} distribution with 1 d.f. Similar simulations were performed under various conditions followed by the analysis of the data using QTLHAPLO. Table 10 shows the results of the estimated values of the parameters and the type I error rates using two different Θ data sets (four-locus and six-locus data sets in Table 1) and two sample sizes (100 and 1000). The estimated parameters were accurate for both Θ data sets and two sample sizes. In addition, when the value of 3.841, where the cumulative distribution function for χ^{2} distribution with 1 d.f. becomes 0.95, was set as the threshold, the proportion of the samples that generated statistic values over the threshold (empirical type I error rate) was very close to the expected value of 0.05 (Table 10).

### Power of the test:

We determined the empirical powers of the present test using various conditions. First, samples were simulated under the alternative conditions and the data were analyzed by QTLHAPLO. The proportions of the samples that generated the statistic over the threshold value of 3.841 were considered as empirical powers. The results show that the power increases as a function of |μ_{1} − μ_{2}| and sample size (Figure 2). Additional simulation experiments with different parameters followed by the analysis of the data show that the power was a function of |μ_{1} − μ_{2}|/σ as expected (data not shown). Thus, our algorithm has a sufficient power when |μ_{1} − μ_{2}|/σ and sample size are large.

## DISCUSSION

In this investigation, we developed an algorithm to estimate the parameters of the distribution of a quantitative phenotype and to test the association between the presence of a haplotype and the quantitative phenotype. The data used are genotype data at linked loci as well as the data of a quantitative phenotype in multiple subjects.

We examined whether our algorithm could accurately estimate the parameters. Samples of genotypes and phenotypes for multiple subjects were generated using various sets of parameters, and the data were analyzed by the maximum-likelihood method. The maximum-likelihood estimates thus obtained were very close to the values of the parameters that had been given before the simulation, indicating that our algorithm could accurately estimate the parameters.

Then we examined the distribution of the generalized likelihood-ratio statistic, obtained by analyzing the data derived under the null hypothesis. Under various conditions, the statistic was found to follow an asymptotically χ^{2} distribution with 1 d.f. In addition, the analysis of the data simulated under the alternative hypothesis indicated that the power was considerably high when |μ_{1} − μ_{2}|/σ and sample size *N* were sufficiently large; *i.e.*, (μ_{1} − μ_{2})/σ = 0.2, *N* = 1000 and (μ_{1} − μ_{2})/σ = 0.6, *N* = 100.

The importance of |μ_{1} − μ_{2}|/σ for the power of the test is easily understood as follows. Let *A* denote the haplotype for a genetic region *R* that is related to the phenotype, and let *B* denote the complement of *A*, *i.e*., the set of all haplotypes other than *A*. In fact, both *A* and *B* may be sets of haplotypes rather than single haplotypes. In our model, the distribution of the phenotype was assumed to be different between the subjects with the (unordered) diplotype configurations *AA*, *AB*, and *BB*. This means that we divided the phenotype into two parts, *i.e.*, the part due to the effect of the diplotype configurations in region *R* and the part independent of that effect. The latter part contains both environmental and genetic elements unrelated to region *R*. Thus, we assumed covariance neither between the effects of region *R* and other genetic loci nor between the effects of region *R* and the environment. It means that no epistasis was assumed in our model.

The impact of the effect on the phenotype is evaluated by comparing the variances (Fisher 1918). Thus, the impact of the effect of region *R* on the phenotype is evaluated by the ratio of the variance of the effect of region *R* to the total phenotypic variance (Amos 1994; Almasy and Blangero 1998; Pratt* et al.* 2000; Sham* et al.* 2000).

The mathematical modeling of this kind was initiated by Fisher (1918) many years ago, although it was not about the diplotype configurations but about the genotypes. Let σ^{2}_{t} and σ^{2}_{r} denote the total phenotypic variance and the variance due to region *R*, respectively. The ratio σ^{2}_{r}/σ^{2}_{t} is an indicator of the impact of region *R* in the total phenotypic variation. The difference contains elements from both the environment and the genetic loci other than region *R*. If region *R* is the only genetic region relevant to the phenotype, then where σ^{2}_{e} denotes the variance due to environment. Note that, in this case, equals to heritability.

According to our model, the means of the phenotypes for the subjects with the diplotype configurations of *AA* and *BB* are μ_{1} and μ_{2}. Then, if Hardy-Weinberg equilibrium can be assumed, the phenotypic variance due to region *R* is written when μ_{1} ≥ μ_{2} as 7where μ_{3} denotes the mean of the phenotypes for the subjects with the diplotype configuration of *AB*, and *p* denotes the population frequency of the haplotype *A*.

In our model, σ^{2}_{n}, the variance unrelated to region *R* is equal to σ^{2}, the variance of the phenotypes for the subjects with the same diplotype configurations for region *R*. Note that we assumed no difference in the variance for different normal distributions between the phenotypes for different diplotype configurations.

In the dominant model (the phenotypes for *AA* and *AB* are the same), μ_{3} = μ_{1} and Equation 7 becomes 8In the recessive model (the phenotypes for *AB* and *BB* are the same), μ_{3} = μ_{2} and Equation 7 becomes 9In the additive model, μ_{3} = (μ_{1} + μ_{2})/2 and Equation 7 becomes

10Therefore, in dominant (8), recessive (9), and additive (10) modes, σ^{2}_{r} has the form of *f*(*p*)(μ_{1} − μ_{2})^{2}, and has the form of Thus, the ratio of the phenotypic variance due to the difference in the diplotype configurations for region *R* to the total phenotypic variance is positively correlated with |μ_{1} − μ_{2}|/σ. Note that *f*(*p*) ≥ 0 for 0 ≤ *p* ≤ 1. This ratio is equivalent to the heritability when region *R* is the only genetic region influencing the phenotype.

One of the problems in the haplotype inference is that the diplotype configurations of some subjects are not unequivocally determined. Such subjects with ambiguous diplotype configurations should be treated in the analysis. If one attempts to test the association between the diplotype configurations and a phenotype, the subjects with ambiguous diplotype configurations cannot be unequivocally categorized. As is often done, they can be classified into some categories according to the most likely diplotype configurations. However, such forced categorization may cause inflation of type I errors. In fact, our simulation studies have shown that the algorithm presented here is superior to such methods in that it allows the presence of ambiguous diplotype configurations when testing the association between the presence of a haplotype and a quantitative phenotype.

The problems of ambiguous diplotype configurations are amplified when the linkage disequilibrium of the loci to be analyzed is weak. We analyzed two cases in detail. In one case, all the four loci were in tight linkage disequilibrium, while two of the six loci were in weak linkage disequilibrium in the other. When all the four loci were in tight linkage disequilibrium, the percentage of the subjects with ambiguous diplotype configurations was low and the degree of the ambiguity was minimal. However, when two of the six loci were in weak linkage disequilibrium, the problems of the ambiguous diplotype configurations became large. Interestingly, the estimated probability of the true diplotype configuration was often larger when the phenotype data in addition to the genotype data were incorporated for the analysis than when only the genotype data were used. This indicates that the inference of the diplotype configurations becomes more accurate by incorporating the phenotype data when there is a true association between the presence of a haplotype and a quantitative phenotype.

Wu* et al.* (2002) proposed the joint linkage and linkage disequilibrium mapping strategy for estimating allelic frequencies, recombination fractions, and linkage disequilibria for multiallelic markers in natural populations using the Fisher-scoring algorithm. The genomic region within which the linkage disequilibrium is tight is denoted the haplotype block or linkage disequilibrium (LD) block. Within the haplotype block, the problem of ambiguous diplotype configurations is not large; however, it is likely to emerge when polymorphic loci outside the haplotype blocks or those in the region that includes the border(s) of the block(s) are the targets of the study. The value of the present algorithm may be high especially when the involved loci are not within a block.

We then applied this algorithm to the analysis of the data from diabetic patients. The data were composed of the genotypes at three SNP loci within the *CAPN10* gene as well as the quantitative phenotypes. The three loci were in moderate linkage disequilibrium (|*D*′| > 0.6). The analysis has shown that there were significant associations between certain haplotypes and some quantitative phenotypes.

We modeled the test of the association between haplotypes and quantitative phenotypes in a way similar to that employed by Chiano and Clayton (1998), Fallin* et al.* (2001), Zaykin* et al.* (2002), and Lou* et al.* (2003). Thus, Chiano and Clayton (1998) developed the linear logistic regression model, which not only tests for association but also determines how far the haplotype harboring the putative disease gene extends, and estimated haplotype frequencies by the EM algorithm. Zaykin* et al.* (2002) have also developed a statistical method to test the association of haplotype frequencies with phenotypes in samples of unrelated individuals. They estimated haplotype frequencies using the EM algorithm and then related the inferred haplotype probabilities for each individual to the phenotype using regression-based analysis. Fallin* et al.* (2001) devised a method to test the association between haplotypes inferred by the EM algorithm and the disease phenotype using the chi-square statistic for contingency tables. They applied their method for testing the association between multiple SNPs in the APOE gene region and Alzheimer's disease and showed that it was useful even when the linkage disequilibrium was weak and the effect of the gene was rather small. The proposed framework by Lou* et al.* (2003) can accommodate genetic effects of different kinds for the QTL. Our model is easily extendable to estimate the interactions of two haplotypes and between haplotypes and environment (Chiano and Clayton 1998). There are some similarities between the above methods and our algorithm; however, our algorithm differs from each of them. For example, epistasis is not assumed in our algorithm while that by Lou* et al.* (2003) assumed its presence. In the extended phase of our algorithm, sets of haplotypes rather than single haplotypes can be handled. In addition, the sample size of 100 is sufficient for our algorithm while their algorithm needs larger sizes (Lou* et al.* 2003).

Our algorithm can be applied to real data only when the quantitative phenotypes are expected to follow the normal distributions. Indeed, many quantitative phenotypes may follow asymptotically normal distributions; however, there are certainly phenotypes that do not. One of the solutions to such problems may be to use the transformed value of the quantitative phenotype that is expected to follow, under the null hypothesis, a normal distribution. Some of the mathematical transformations that convert the phenotype include the logarithm transformation for the skewed trait (Schaid* et al.* 2002; Wrigth 1968) and the power transformation (Hoaglin* et al.* 1983). The nonparametric method may be another approach.

Although our algorithm is useful for cohort studies, it may be extended in other types of studies. One extension is the application of our algorithm to case-control studies. In principle, this algorithm can be applied to the data from cohort studies and clinical trials but not to those from case-control studies. The reason is that the estimated parameters Θ, μ_{1}, μ_{2}, and σ do not indicate the population parameters when this algorithm is applied to data from case-control studies. However, the test of the association between the presence of a haplotype and a quantitative phenotype may be possible by this algorithm even for the data from case-control studies. In this case, however, the maximum-likelihood estimates obtained by the algorithm are not the real estimates of the parameters. We are now extensively analyzing this issue by simulations to examine whether the application of the present algorithm to data from case-control studies is plausible. The test of the association between a phenotype and the diplotype configurations using subjects with “extreme” phenotypic values will be useful. If we can obtain two samples, one with high phenotypic values and the other with low phenotypic values, the test of the association will become very powerful. In this case, however, the same problem as stated above in the case of case-control studies will emerge. Although the data from such samples can be submitted to our algorithm and the outputs will be obtained, the estimated parameters (for example, Θ) do not indicate population parameters. We are now extensively analyzing this issue by simulations and have found that our algorithm can be used to analyze the data from the subjects with extreme phenotypes in some cases. It means that although the estimated parameters were incorrect, the type I errors did not inflate very much. However, it is still to be clarified under what conditions such application is plausible.

Recently, several groups have proposed algorithms and methods to study the association between haplotypes and quantitative phenotypes. Lou* et al.* (2003) proposed a haplotype-based algorithm for multilocus linkage disequilibrium mapping of quantitative trait loci with epistasis. Thus, the likelihood approach to the estimation of haplotype frequencies is useful; however, there are some limitations (Fallin and Schork 2000; Tishkoff* et al.* 2000). Fallin and Schork (2000) reported that the accuracy of haplotype estimation increases as the amount of linkage disequilibrium between loci increases using the likelihood approach. In this respect, Tanck* et al.* (2003) developed the weighted penalized log-likelihood model and compared it with the different log-likelihood models.

Although there have been several proposals for studies of the association between quantitative phenotypes and haplotypes, procedures that are both reliable and accurate still need to be developed. Such sophisticated methods will be necessary because a number of different quantitative phenotypes are expected to be studied in the near future at the population basis. Thus, not only quantitative phenotypes obtained by simple clinical examinations but also multiple clinical tests as well as the results from DNA microarray studies can be used. Even quantitative data from proteomics studies can be used as quantitative phenotypes.

In conclusion, we developed an algorithm to simultaneously estimate, by the maximum-likelihood method, the population haplotype frequencies and the parameters of the distributions of quantitative phenotypes that are different between subjects with different diplotype configurations using both genotype and phenotype data from multiple subjects. Using a test statistic, −2 log(*L*_{0max}*/L*_{max}), we could construct a method to test the association between the presence of a haplotype and a quantitative phenotype. We implemented this algorithm in a computer program, QTLHAPLO. The analysis of the simulated and real data using this program indicated that this method can accurately estimate the parameters and reliably test the association between the haplotypes and the phenotypes.

## Acknowledgments

This study was supported by a grant from the New Energy and Industry Technology Development Organization.

## Footnotes

↵

^{1}*Present address:*Department of Bioinformatics, Graduate School of Tokyo Medical and Dental University, Yushima 1-5-45, Bunkyo-ku, Tokyo 113-8510, Japan.Communicating editor: M. Feldman

- Received April 7, 2004.
- Accepted June 10, 2004.

- Genetics Society of America