## Abstract

Analysis of the association between haplotypes and phenotypes is becoming increasingly important. We have devised an expectation-maximization (EM)-based algorithm to test the association between a phenotype and a haplotype or a haplotype set and to estimate diplotype-based penetrance using individual genotype and phenotype data from cohort studies and clinical trials. The algorithm estimates, in addition to haplotype frequencies, penetrances for subjects with a given haplotype and those without it (dominant mode). Relative risk can thus also be estimated. In the dominant mode, the maximum likelihood under the assumption of no association between the phenotype and presence of the haplotype (*L*_{0max}) and the maximum likelihood under the assumption of association (*L*_{max}) were calculated. The statistic −2 log(*L*_{0max}/*L*_{max}) was used to test the association. The present algorithm along with the analyses in recessive and genotype modes was implemented in the computer program PENHAPLO. Results of analysis of simulated data indicated that the test had considerable power under certain conditions. Analyses of two real data sets from cohort studies, one concerning the *MTHFR* gene and the other the *NAT2* gene, revealed significant associations between the presence of haplotypes and occurrence of side effects. Our algorithm may be especially useful for analyzing data concerning the association between genetic information and individual responses to drugs.

ANALYSIS of polymorphism data based on linkage disequilibrium (LD) and haplotype structure is becoming increasingly important. Haplotype is also useful for associating the genetic information of subjects with various phenotypes. A phenotype is often associated not only with each SNP but also with haplotypes. If the disease association of a specific allele is dependent on *cis*-acting interactions with other loci, the disease association may not be detected unless the functional haplotypic unit itself is analyzed. Thus, many studies of various diseases and conditions have reported the importance of haplotype information in addition to single locus information (Puffenberger *et al.* 1994; Drysdale *et al.* 2000; Mummidi *et al.* 2000; Hugot *et al.* 2001; Joosten *et al.* 2001; Rioux *et al.* 2001).

We can interpret haplotypes as complete data and genotypes (for example, SNP genotypes at multiple linked loci) as incomplete data from the statistical viewpoint, since we can extract all genotype data from haplotype data, while the reverse is not the case. Therefore, it is generally more useful to consider the association between polymorphism and phenotype based on haplotypes or diplotype configurations (haplotype combinations) rather than on alleles or genotypes. Recent studies have suggested that, in some cases, phenotypes such as diabetes (Horikawa *et al.* 2000) and reaction to drugs are associated with haplotypes or diplotype configurations rather than (single-nucleotide polymorphism) genotypes (Judson* et al.* 2000; Bader 2001; Tanaka *et al.* 2002; Urano *et al.* 2002).

The phenotypes at the level of individuals are based on diplotype configurations rather than haplotypes. This is equivalent to the relationship between the genotypes and alleles at a locus. Note that, according to Mendel's law, phenotypes are associated with genotypes but not alleles. Therefore, the association between a phenotype and genetic information is in some cases detected more efficiently by comparing the proportions of affected individuals between subjects with different diplotype configurations than by comparing the haplotype frequencies between subjects with different phenotypes.

However, the haplotype or diplotype configuration of a subject cannot easily be observed, although molecular methods for the determination of haplotypes have been reported (Michalatos-Beloin* et al.* 1996). To compare the proportions of affected individuals between subjects with different diplotype configurations, the diplotype configuration for each individual should be calculated as posterior distribution based on population haplotype frequencies, which are inferred using one of the haplotype inference algorithms, such as Clark's algorithm (Clark 1990), the expectation-maximization (EM) algorithm (Excoffier and Slatkin 1995; Hawley and Kidd 1995; Long *et al.* 1995; Schneider *et al.* 2000; Kitamura *et al.* 2002), the PHASE algorithm (Stephens *et al.* 2001), the partition-ligation (PL) algorithm (Niu *et al.* 2002), and the PL-EM algorithm (Qin *et al.* 2002). Once diplotype configurations are inferred, all the individuals are classified according to the presence or absence of a haplotype or a diplotype configuration. After classifying the affected and nonaffected individuals into categories, the test of independence is performed.

When one of these algorithms is used to calculate the posterior distribution of the diplotype configuration, at least in some individuals, diplotype configurations are not unequivocally determined. It is not clear how much type I error can occur by classifying subjects according to inferred haplotypes rather than according to the real haplotype information.

To overcome this problem, we developed an algorithm to infer diplotype-based penetrance in addition to haplotype frequencies in populations and diplotype configurations based on observed single-nucleotide polymorphism (SNP) genotypes at multiple linked loci and phenotype data. This algorithm does not require that the diplotype configuration of each individual be unequivocally determined. Rather, on the basis of the EM algorithm, it calculates the maximum-likelihood estimates of population haplotype frequencies, posterior distribution of the diplotype configuration for each individual, and the diplotype-based penetrances. Using the algorithm, the association between the presence of haplotypes and a phenotype (in the dominant mode) can be tested at the individual level using the genotype and phenotype data from cohort studies or clinical trials. We examined the usefulness of this algorithm using both simulated and real data sets and found that it was very useful for analyzing genotype and phenotype data from cohort studies and randomized clinical trials.

## METHODS

### Algorithm:

#### Sample space of the EM-based algorithm for haplotype inference:

In the EM-based algorithm for haplotype inference using genotype data, the sample space is defined as a set of outcomes from the following experiment. First, haplotype frequencies are provided for a collection of infinite haplotype copies. (Throughout this article, “haplotype” and “haplotype copy” have different concepts. If a subject has the two same haplotype copies, the number of haplotypes is still one.) According to the haplotype frequencies, each of the *N* subjects is given two ordered haplotype copies after randomly drawing them from the collection of haplotype copies. The observed data are the genotypes at multiple linked loci involved in the haplotypes of the group of all subjects. In this EM-based algorithm for both haplotype and penetrance inference, the experiment used to define the sample space is a little different. After two ordered haplotype copies are assigned to each subject, he or she develops or does not develop a phenotype as a stochastic process. Thus, the difference between the EM-based algorithm for haplotype inference and the new algorithm presented here is that, in the new algorithm, the process of development of phenotype is included in the experiment for construction of the sample space.

#### New sample space:

Let us assume that there are *l* linked SNP loci. The number of all possible haplotypes will be *L* = 2* ^{l}*. We set up a collection of an infinite number of haplotype copies. The haplotype frequencies in the collection are Θ = (θ

_{1}, … , θ

*, … , θ*

_{j}*), where θ*

_{L}*is the frequency of the*

_{j}*j*th haplotype, and θ

*≥ 0, . To each of*

_{j}*N*individuals, ordered two haplotype copies are assigned by randomly drawing them from the collection of haplotype copies. A diplotype configuration is defined as an ordered combination of two haplotype copies. (Throughout the algorithm section, “diplotype configuration” means an ordered set of two haplotype copies for a subject, while in the other parts of this article, it means an unordered two haplotype copies for a subject). Let

*a*

_{1},

*a*

_{2}, . . ,

*a*

_{L2}be possible diplotype configurations. The probability that the

*i*th subject has the diplotype configuration

*a*is

_{k}*P*(

*d*=

_{i}*a*| Θ) = θ

_{k}*θ*

_{l}*, where*

_{m}*d*is a diplotype configuration for the

_{i}*i*th subject, and

*l*and

*m*are the orders of the haplotypes that constitute

*a*. This means that Hardy-Weinberg equilibrium is assumed at the haplotype level. The

_{k}*i*th subject develops the phenotype ψ

_{+}at the probability determined as a function of

*d*. Theoretically, the penetrances can be defined for all the diplotype configurations. However, it is not realistic to assign different penetrances to all the different diplotype configurations. We therefore defined only two or three penetrances depending on the mode of inheritance in this study.

_{i}Thus, in the dominant mode, the subjects with a haplotype and those without it were given two different penetrances, while in the recessive mode, the subjects who were homozygous for a haplotype and the others were given two different penetrances. In the genotype mode, the subjects with zero, one, and two copies of a haplotype were given three different penetrances.

Let *H*_{all} denote the set of all the haplotypes, and let *H*_{+} denote the subset of *H*_{all} containing the haplotype or haplotypes the presence of which has a different effect than the others. *H*_{+} typically contains only one haplotype, but may contain multiple haplotypes. If *H*_{+} is set up so that it contains all the haplotypes with an allele at a locus, then it is equivalent to the situation of testing the association of an allele (rather than a haplotype) with the phenotype. Two penetrances were set when the analysis was performed either in the dominant or in the recessive mode. Thus, in the dominant mode, let *D*_{+} denote a set of diplotype configurations that contains a member or members of *H*_{+}. In the recessive mode, let *D*_{+} denote a set of diplotype configurations whose two haplotypes are the members of *H*_{+}. Then, let *q*_{+} denote the probability that the *i*th individual develops ψ_{+} when *d _{i}* ∈

*D*

_{+}, and let

*q*

_{−}denote the probability that the

*i*th individual develops ψ

_{+}when

*d*∉

_{i}*D*

_{+}.

Thus, if ψ* _{i}* denotes the phenotype of

*i*th subject, and

Note that Θ and *q*_{+}, *q*_{−} are independent. Since dominant or recessive mode of inheritance is assumed in the above model, only two penetrances are defined.

If genotype-dependent mode is assumed, three penetrances should be defined depending on the genotype. Let *D*_{0}, *D*_{1}, and *D*_{2} denote the sets of diplotype configurations that contain zero, one, and two copies of the members of *H*_{+}, respectively. Let *q*_{0}, *q*_{1}, and *q*_{2} denote the probabilities that the *i*th individual develops ψ_{+} when *d _{i}* ∈

*D*

_{0},

*d*∈

_{i}*D*

_{1}, and

*d*∈

_{i}*D*

_{3}, respectively. Thus, and

Note that Θ and *q*_{0}, *q*_{1}, and *q*_{2} are independent.

The new experiment is different from the old experiment in that the process of developing a phenotype is included in the former. The parameters *q*_{+} and *q*_{−} or *q*_{0}, *q*_{1}, and *q*_{2} are, in addition to Θ, included in the new probability space. Note that ψ* _{i}* is independent of Θ conditional on

*d*.

_{i}#### Likelihood function:

The observed data are the genotypes and phenotypes of the subjects. Let *G*_{obs} = (*g*_{1}, *g*_{2}, … , *g _{N}*) and Ψ

_{obs}= (

*w*

_{1},

*w*

_{2}, … ,

*w*) denote the vectors of the observed genotypes and phenotypes, respectively, where

_{N}*g*and

_{i}*w*denote the observed genotype and phenotype of the

_{i}*i*th subject.

The likelihood function under the alternative hypothesis is written as follows. Thus the likelihood function in the dominant or recessive mode iswhere *A _{i}* denotes the set of

*a*for

_{k}*i*th subject that is consistent with

*g*.

_{i}Since *d _{i}* is independent of

*q*

_{+},

*q*

_{−}and

*ψ*is independent of Θ conditional on

_{i}*d*, 1

_{i}For any *i* and *k*,

In the genotype mode, the likelihood function is 2and

Under the null hypothesis that the phenotype is independent of the diplotype configuration concerning the loci examined, the likelihood function is 3where *q*_{c} denotes the penetrance for all the diplotype configurations.

For any *i* and *k*,

To obtain the maximum likelihood under the alternative hypothesis, Equation 1 or 2 is maximized over (Θ, *q*_{+}, and *q*_{−}) or (Θ, *q*_{0}, *q*_{1}, and *q*_{2}), and the maximum likelihood thus obtained is denoted as *L*_{max}. Then Equation 3 is maximized over Θ and *q*_{c}, and the maximum likelihood under the null hypothesis thus obtained is denoted as *L*_{0max}. The likelihood ratio *L*_{0max}/*L*_{max} is used to test the association between the haplotype and the phenotype.

In the maximization of *L*_{max}, the parameters to be estimated are Θ = (θ_{1}, θ_{2}, … , θ* _{L}*),

*q*

_{+}, and

*q*

_{−}in the dominant or recessive mode and Θ,

*q*

_{0},

*q*

_{1}, and

*q*

_{2}in the genotype mode, while, in maximization of

*L*

_{0max}, the parameters to be estimated are Θ = (θ

_{1}, θ

_{2}, … , θ

*) and*

_{L}*q*

_{c}. The space spanned by the latter maximization is a subspace of that spanned by the former. Under the null hypothesis, the statistic −2 log(

*L*

_{0max}/

*L*

_{max}) is expected to follow the χ

^{2}distribution with 1 or 2 d.f. asymptotically depending on the number of penetrances.

#### The EM algorithm:

In this section, the algorithm is described mainly for the dominant mode. If the complete data of *d*_{1}, *d*_{2}, … , *d _{N}* and ψ

_{1}, ψ

_{2}, … , ψ

*were available, the maximum-likelihood estimates of θ*

_{N}_{1}, θ

_{2}, … , θ

*and*

_{L}*q*

_{+},

*q*

_{−}would be easily obtained as θ̂

*=*

_{j}*n*/(2

_{j}*N*) for

*j*= 1, 2, … ,

*L*and

*q̂*

_{+}=

*N*

_{+ψ+}/

*N*

_{+},

*q̂*

_{−}=

*N*

_{−ψ+}/

*N*

_{−}, where

*n*is the number of the copies of

_{j}*j*th haplotype in the

*N*subjects,

*N*

_{+}= #{

*i*;

*d*∈

_{i}*D*

_{+}},

*N*

_{−}= #{

*i*;

*d*∉

_{i}*D*

_{+}},

*N*

_{+ψ+}= #{

*i*;

*d*∈

_{i}*D*

_{+}, ψ

*= ψ*

_{i}_{+}}, and

*N*

_{−ψ+}= #{

*i*;

*d*∉

_{i}*D*

_{+}, ψ

*= ψ*

_{i}_{+}}. Here, #{

*i*; , } denotes the number of subjects that meet the conditions after ;.

However, complete data are not available, and we observe only the genotypes and phenotypes of the subjects. Therefore, we substitute the expected values of *n _{j}*/(2

*N*),

*N*

_{+}

_{ψ}_{+}/

*N*

_{+}, and

*N*

_{−}

_{ψ}_{+}/

*N*

_{−}for the real values in the EM algorithm to maximize the likelihood function defined by Equation 1. If we supply the condition

*q*

_{c}=

*q*

_{+}=

*q*

_{−}, then we obtain the maximum likelihood

*L*

_{0max}for the null hypothesis defined by Equation 3. Note that

*q*

_{+}/

*q*

_{−}is usually called “relative risk,” and

*q̂*

_{+}/

*q̂*

_{−}is the maximum-likelihood estimate of relative risk. In the genotype mode,

*q*

_{0},

*q*

_{1}, and

*q*

_{1}are substituted by the expected values just as

*q*

_{+}and

*q*

_{−}were substituted in case of the dominant mode as described above.

By use of the EM algorithm, missing data in the observed data of genotypes and phenotypes could be handled. If data were missing from the observed genotype for the *i*th subject, *g _{i}* was interpreted as the set of possible genotypes for the subject not inconsistent with

*g*. For the missing data in phenotypes, the probability that a subject develops the unknown phenotype was interpreted to be 1.

_{i}We can test the association between a phenotype and a set of haplotypes rather than a single haplotype. Thus, a set of haplotypes *H*_{+} typically has only a single haplotype as a member; however, multiple haplotypes can be members of *H*_{+}. We have recently defined a concept of “incomplete haplotype” (Kamatani *et al.* 2004).

Thus, let *H*_{all} denote the set of all the haplotypes concerning all linked loci within a region. A complete haplotype is defined as a list of alleles at all linked loci in the region. Here, incomplete haplotype *H _{I}* is defined as a subset of

*H*

_{all}whose members have certain alleles at some loci within the region (Kamatani

*et al.*2004). Therefore, an incomplete haplotype is defined by a list of alleles at a limited number of the loci. In other words, an incomplete haplotype is defined by a list of alleles at all the loci (one allele at a locus), some of which are masked. For example, AC* defines an incomplete haplotype whose members are the complete haplotypes ACT and ACC (when the alleles at the third locus are T and C). Note that the set of all incomplete haplotypes is not the same as the set of all the set of complete haplotypes. The former is included in the latter. Also note that an allele of a SNP locus can also be defined as an incomplete haplotype because the allele T at the third position in the above haplotype is defined as **T.

Any incomplete haplotype as described above can be used as a target haplotype whose association with a phenotype is tested. In PENHAPLO, we implemented the algorithm to test the association between all possible incomplete haplotypes in dominant, recessive, or genotype mode. However, when the number of loci in the region is high, the number of all the incomplete haplotypes should be very large. The problem of multiple comparison will indeed emerge and this problem is discussed in the discussion.

### Simulation:

#### Empirical distribution of the statistic *−2 log(*L_{0max}/L_{max}) under the null hypothesis:

_{0max}/

_{max})

We first examined the empirical distribution of the statistic −2 log (*L*_{0max}/*L*_{max}) under the null hypothesis by simulation. To do this, we used the frequency distribution of haplotypes Θ obtained from the real data rather than simulating it. Thus, we used Θ obtained from our previous study on *SAA* (serum amyloid A) genes (Moriguchi* et al.* 2001), which include six SNPs. Table 1 shows the haplotypes and their frequencies. For *q _{c}*, we tested various values between 0 and 1. Note that, under the null hypothesis, the penetrance is the same for all diplotype configurations.

We began the simulation by assigning ordered sets of two haplotype copies to each of the *N* subjects by drawing the haplotype copies using Θ. Then, the phenotype of each subject was determined according to *q*_{c}. Thus, *q*_{c} was used as the probability at which any subject develops the phenotype ψ_{+}. Then, after removing the phase information, the algorithm as defined above was applied to the simulated data for determination of the statistic −2 log(*L*_{0max}/*L*_{max}). The simulation was repeated 10,000 times, and the distribution of the statistic was examined.

#### Simulation under the alternative hypothesis:

Next, simulation under the alternative hypothesis was performed in the dominant mode. Thus, one of the haplotypes was assumed to be associated with the phenotype ψ_{+}, and this haplotype was denoted the “phenotype-associated haplotype.” For this simulation, *D*_{+} was defined as the set of diplotype configurations that contained at least one phenotype-associated haplotype. Various real values between 0 and 1 were given to *q*_{−} and *q*_{+} before the simulation.

The simulation was begun by assigning each of the subjects an ordered set of two haplotype copies by drawing them using Θ. Then the phenotype of each person was determined using *q*_{+} or *q*_{−} as the probability of developing the phenotype ψ_{+}. *q*_{+} was used when the diplotype configuration of that subject contained the phenotype-associated haplotype, while *q*_{−} was used otherwise.

After removing the phase information, the SNP genotypes at multiple loci and the phenotype data were subjected to the above-defined algorithm. Simulation was repeated many times, and the results obtained were analyzed. Thus, the power was estimated at various values of *q*_{+} and *q*_{−} by assuming that, under the null hypothesis, the statistic − 2 log(*L*_{0max}/*L*_{max}) follows the χ^{2} distribution with 1 d.f.

In addition, simulation data were applied to two different association tests to evaluate the performance of the present algorithm. First, using the true diplotype configurations of the subjects that could not be observed easily in real data, two-by-two contingency tables were prepared for χ^{2} test. The rows of the contingency tables show whether a subject had at least one copy of phenotype-associated haplotype, and the column shows the phenotype. Second, after posterior distribution of the diplotype configuration for each subject was estimated by LDSUPPORT, which is the haplotype-inference program based on the EM algorithm (Kitamura *et al.* 2002), using the phase-removed genotype data, two-by-two contingency tables were prepared assuming that the diplotype configuration of the maximum probability was true. Then the probablity of the tables were calculated by χ^{2} distribution with 1 d.f. In this simulation, another Θ from an artificial gene, which was made for six SNPs under a weak linkage disequilibrium condition, was used as the frequency distribution of haplotypes in addition to Θ from the *SAA* gene.

### Analysis of real data:

We analyzed two sets of data previously published. One of them was derived from a cohort study in which the association between the haplotypes at *MTHFR* and the occurrence of side effects was tested. The other was derived from another cohort study in which the association between the haplotypes at *NAT2* (*N*-acetyltransferase 2) gene and the occurrence of side effects was tested, as published previously (Tanaka *et al.* 2002). In both of the studies, haplotypes were significantly associated with the occurrence of side effects.

To evaluate the reliability of estimated parameters *q̂*_{+} and *q̂*_{−}, the nonparametric bootstrap method is used to calculate means and standard errors. A bootstrap sample was constructed by drawing a new set of *N* subjects from the original *N* subjects, by permitting duplicate sampling, and was applied to the present algorithm. Bootstrap sampling was repeated 10,000 times (*B* = 10,000) to calculate means and standard errors of the frequencies of the haplotypes, *q̂*_{+}, *q̂*_{−}, and the statistic −2 log(*L*_{0max}/*L*_{max}).

## RESULTS

### Empirical distribution of the statistic −2 log**(** L_{0max}/L_{max}**)** under the null hypothesis:

Although we implemented dominant, recessive, and genotype modes in PENHAPLO, the distribution of the statistic was examined mainly in the dominant and the genotype modes.

Figure 1, A–D, shows the histograms of the statistic −2 log(*L*_{0max}/*L*_{max}) at various values of *q*_{c} and *N* in the dominant mode. The histograms were compatible with the expected χ^{2} distribution with 1 d.f. under all the conditions tested. The histogram in the genotype mode was compatible with the expected χ^{2} distribution with 2 d.f. for *q*_{c} = 0.1 and *N* = 1000 (Figure 1E), while the histogram is shifted to the positive direction as compared with the curve for *q*_{c} = 0.2 and *N* = 200 (Figure 1F). The genotype model seems to require larger sample sizes than the dominant model for the test statistic to follow the expected distribution.

Table 2 shows the estimated *q̂*_{+} and *q̂*_{−} and empirical type I error rates (*α* = 0.05) for various simulation parameters of *q*_{c} and *N* in the dominant mode. These results suggest that, under the null hypothesis, this statistic follows the χ^{2} distribution with 1 d.f. asymptotically.

### Simulation under the alternative hypothesis:

Simulation under the alternative hypothesis was performed in the dominant mode. Thus, it was repeated 10,000 times for various values of *q*_{+}, *q*_{−}, and *N*, and the proportion of the trials that yielded values of the statistic over 3.841 (the value that yielded the cumulative density function of 0.95 for the χ^{2} distribution with 1 d.f.) was determined (empirical power). Figure 2 shows that the power increased with increasing value of *q*_{+}/*q*_{−} (*q*_{+}/*q*_{−} ≥ 1) and with increase in sample size *N*. We then tested the effects of frequency of the phenotype-associated haplotype on the statistic. Figure 3 shows that the empirical power peaked at the intermediate frequency of phenotype-associated haplotype.

Table 3 indicates that the present algorithm has higher power than the association test using the contingency tables made by posterior distribution of the diplotype configuration after haplotype inference, especially under weak linkage disequilibrium conditions. The association test with the contingency tables made by the true diplotype configurations has the highest power because the complete data are known, but the present algorithm can be applied to SNP genotypes at multiple loci without complete data that could not be observed easily in real data.

### Distribution of estimated penetrances *q̂*_{+} and *q̂*_{−}:

We then examined the distribution of estimated *q̂*_{+} and *q̂*_{−} under the alternative hypothesis. Table 4 indicates that the estimation of *q*_{+} and *q*_{−} is quite accurate and that variations are small. Accordingly, the relative risk *q̂*_{+}/*q̂*_{−} was also reliably estimated. These results indicate that the estimated penetrances are accurate with minor variations.

### Analysis of real data:

We then applied the present algorithm to real data for *MTHFR* (Urano *et al.* 2002) and *NAT2* (Tanaka *et al.* 2002). Both sets of data were derived from cohort studies.

The data set for *MTHFR* was derived from a cohort study of rheumatoid arthritis patients. The 104 patients who received methotrexate were examined for both the occurrence of side effects and the genotypes at two SNP loci in the *MTHFR* gene. The precise clinical and genotype data have been published elsewhere (Urano *et al.* 2002). One of the haplotypes was assumed to be the phenotype-associated haplotype, according to a previous article. The statistic −2 log(*L*_{0max}/*L*_{max}) calculated for the data was 6.8074, which was significant (*P* < 0.01). Type II error estimated by the nonparametric bootstrap method was 0.22, which should decrease as the sample size increases. The maximum-likelihood estimates *q̂*_{+} and *q̂*_{−} were 0.2571 ± 0.0523 and 0.0588 ± 0.0405, respectively, and the maximum-likelihood estimate relative risk was 4.37. The standard errors of *q̂*_{+} and *q̂*_{−} were estimated by the nonparametric bootstrap method. Thus, the presence of the phenotype-associated haplotype was associated with susceptibility to development of the phenotype. The diplotype distribution of each subject was concentrated on a single event. For all the subjects, the diplotype configurations estimated under the alternative hypothesis were the same as those estimated by the null hypothesis or by LDSUPPORT, which is the haplotype inference program based on the EM algorithm (Kitamura *et al.* 2002). Thus, the diplotype configurations estimated were in this case the same regardless of the presence of phenotypes. The maximum-likelihood estimates Θ̂ were not different between the results obtained with and without incorporation of phenotype data.

We next analyzed the data set for *NAT2*. This data set was also derived from a cohort study of rheumatoid arthritis patients. The 144 patients who received sulfasalazine were examined for both the occurrence of side effects and the genotypes at seven SNP loci in the *NAT2* gene. One of the haplotypes, known to be the wild-type haplotype, was assumed to be the phenotype-associated haplotype, according to a previous article. The statistic −2 log(*L*_{0max}/*L*_{max}) calculated for the data was 13.4629, which was significant (*P* < 0.001), showing that this haplotype was significantly associated with side effects. Type II error estimated by the nonparametric bootstrap method is 0.088. The maximum-likelihood estimates *q̂*_{+} and *q̂*_{−} were 0.0809 ± 0.0233 and 0.6248 ± 0.1839, respectively, and the maximum-likelihood estimate relative risk was 0.129. Thus, more precisely, the presence of the phenotype-associated haplotype was associated with a reduced probability of side effects. The diplotype distribution of each subject was concentrated on a single event in all cases but one. For all the subjects, the diplotype configurations estimated under the alternative hypothesis were the same as those estimated under the null hypothesis or by LDSUPPORT (results not shown).

## DISCUSSION

In this article, we describe an algorithm that tests the association between a phenotype and a haplotype or a haplotype set at the level of individuals using observed data for genotypes and phenotypes. This algorithm can be applied to data from either clinical trials or cohort studies. The algorithm can also calculate maximum-likelihood estimates of penetrances in dominant, recessive, or genotype modes.

This algorithm implemented in PENHAPLO was found to be both reliable and powerful. As shown by the simulation studies in the dominant mode, the test statistic −2 log(*L*_{0max}/*L*_{max}) followed the χ^{2} distribution with 1 d.f. asymptotically when the data were derived under the assumption of no association between haplotype and phenotype. It was also shown that this method had considerable power under certain conditions (when *q*_{+}/*q*_{−} was not close to 1, *N* was sufficiently large, and the frequency of the phenotype-associated haplotype was close to neither 0 nor 1). Furthermore, compared with the association test using contingency tables made by the posterior distribution of diplotype configuration after haplotype inference, this algorithm has greater power. When this method was applied to real data published previously, it could effectively detect the difference in penetrance between subjects with a given haplotype and those without it.

Although the association between phenotype and haplotypes is tested by comparing the frequencies of a haplotype between two groups with different phenotypes, comparison of penetrances between two groups with different diplotype configurations is even more important in some cases. This method provided the basis for such studies.

In this algorithm, the diplotype configuration for each subject should not necessarily be unequivocally determined. Therefore, this method is expected to be robust and superior to the methods in which ambiguous diplotype configurations are not allowed. When the haplotypes in question involve only polymorphic loci within a haplotype block, the diplotype distribution for each subject is often concentrated on a single event and the diplotype configuration thus estimated is reliable. However, if the haplotypes involve loci located beyond the boundaries of a haplotype block, the diplotype distribution for each subject is often dispersed, and even if the most likely diplotype configuration is selected, estimation is not reliable. Therefore, the test presented here is likely to be more practical for real data.

This method considers the association between haplotypes and phenotypes by focusing on penetrances, while other methods view the same association on the basis of the difference in the frequencies of the haplotypes between groups with different phenotypes. Recent studies on drug-related genes have clarified that some genetic differences are significantly related to the occurrence of side effects and to efficacy. Some studies have shown that haplotypes are associated with reactions to drugs (Drysdale *et al.* 2000; Judson *et al.* 2000). If one intends to apply the results of such studies to a single individual, it is important to evaluate the probability that he or she develops the phenotype (for example, the occurrence of side effects). In such a case, the probability of developing the phenotype may be estimated on the basis of the observed data for the genotypes before the administration of drugs. This article provides one such method for estimation of probabilities on which decisions regarding treatment may depend.

The problem in association studies based on haplotypes is that the candidate phenotype-associated haplotype should be known in advance. In some cases, it is known from previous independent studies or the data from the functional studies. However, when it is not, one must test all the haplotypes, each of which may be associated with the phenotype. In PENHAPLO, the function to test all incomplete haplotypes for the association with a phenotype has been implemented. When each incomplete haplotype in a region is tested, the problem of the multiple comparison will emerge because the number of all the incomplete haplotypes will be huge. The correction for multiple comparisons, such as Bonferroni's correction, is definitely too conservative since such multiple tests are tightly dependent on each other. One of the solutions is to perform the association study using haplotypes constructed with haplotype-tagging SNPs (htSNPs) extracted from a block. The extraction of htSNPs is one of the methods with which to remove the redundancy from the haplotype-based association analysis. Then we use only htSNPs within a block as unmasked loci in the incomplete haplotypes. When the above procedures are employed, the number of incomplete haplotypes within a block with considerable frequencies is not so large (Kamatani *et al.* 2004). We admit, however, that the problem of multiple comparison is the greatest problem to be considered in the future studies in the haplotype-based association studies. The computer program PENHAPLO will be sent to researchers on request under certain conditions.

In summary, we have devised an algorithm with which to test the association between a haplotype or a haplotype set and a phenotype in various modes and to estimate diplotype-based penetrances using data from cohort studies and clinical trials. We have implemented the algorithm in the computer program PENHAPLO. Both simulated data and real data were analyzed by our algorithm. The results suggested that our method is especially useful for analyzing data concerning the association between genetic information and individual responses to drugs.

## Acknowledgments

This study was supported by a grant for Research for the Future Program from the Japan Society for the Promotion of Science (JSPS).

## Footnotes

Communicating editor: Y.-X. Fu

- Received November 17, 2003.
- Accepted August 30, 2004.

- Genetics Society of America