# Test of Association Between Haplotypes and Phenotypes in Case–Control Studies: Examination of Validity of the Application of an Algorithm for Samples From Cohort or Clinical Trials to Case–Control Samples Using Simulated and Real Data

^{*}Algorithm Team, Japan Biological Information Research Center (JBIRC), Japan Biological Informatics Consortium (JBIC), Tokyo 135-0064, Japan,^{†}Mitsubishi Research Institute, Tokyo 100-8141, Japan and^{‡}Division of Genomic Medicine, Department of Advanced Biomedical Engineering and Science and Institute of Rheumatology, Tokyo Women's Medical University, Tokyo 162-0054, Japan

- 2
*Corresponding author:*Division of Genomic Medicine, Department of Advanced Biomedical Engineering and Science and Institute of Rheumatology, Tokyo Women's Medical University, 10-22 Kawada-cho, Shinjuku-ku, Shinjuku, Tokyo 162-0054, Japan. E-mail: kamatani{at}ior.twmu.ac.jp

## Abstract

The use of haplotype information in case–control studies is an area of focus for the research on the association between phenotypes and genetic polymorphisms. We examined the validity of the application of the likelihood-based algorithm, which was originally developed to analyze the data from cohort studies or clinical trials, to the data from case–control studies. This algorithm was implemented in a computer program called PENHAPLO. In this program, haplotype frequencies and penetrances are estimated using the expectation-maximization algorithm, and the haplotype–phenotype association is tested using the generalized likelihood ratio. We show that this algorithm was useful not only for cohort studies but also for case–control studies. Simulations under the null hypothesis (no association between haplotypes and phenotypes) have shown that the type I error rates were accurately estimated. The simulations under alternative hypotheses showed that PENHAPLO is a robust method for the analysis of the data from case–control studies even when the haplotypes were not in HWE, although real penetrances cannot be estimated. The power of PENHAPLO was higher than that of other methods using the likelihood-ratio test for the comparison of haplotype frequencies. Results of the analysis of real data indicated that a significant association between haplotypes in the *SAA1* gene and AA-amyloidosis phenotype was observed in patients with rheumatoid arthritis, thereby suggesting the validity of the application of PENHAPLO for case–control data.

THE importance of the analysis of polymorphism data based on linkage disequilibrium and haplotype structure has increased with the number of available single-nucleotide polymorphism (SNP) markers. Some studies have suggested that phenotypes of individuals are associated with a diplotype configuration (a combination of haplotypes) rather than a SNP genotype (Horikawa *et al*. 2000; Sugiura *et al.* 2002; Tanaka *et al*. 2002; Urano *et al.* 2002; Kawaguchi *et al.* 2003). Since neither haplotypes nor diplotype configurations of a subject are usually observed, the haplotype frequencies of the population are inferred using algorithms.

For the analysis of data from either cohort studies or clinical trials, we developed an algorithm for an association test between individual qualitative phenotypes and specific haplotypes (Ito *et al.* 2004). This algorithm estimates diplotype-based penetrances in dominant, recessive, or general relative risk modes; further, it estimates the frequencies in a population.

On the basis of the expectation-maximization (EM) method, the algorithm was implemented in the computer program PENHAPLO. The proportions of affected and unaffected individuals with different diplotype configurations can be compared using posterior distribution given the maximum-likelihood estimates of population haplotype frequencies as priors. A number of computer programs have been developed for inferring haplotype frequencies in a population (Stephens *et al.* 2001; Kitamura *et al.* 2002; Niu *et al.* 2002; Qin *et al.* 2002; Stephens and Donnelly 2003). One advantage of using PENHAPLO is that the diplotype configuration of each individual need not be determined to test the association between the haplotypes and phenotypes. Rather, the diplotype configurations of some subjects can remain ambiguous. Our previous results showed that PENHAPLO is particularly useful for the analysis of cohort data on the association between haplotypes and phenotypes (Ito *et al.* 2004).

A case–control study is another type of useful clinical study for detecting the association between haplotypes and phenotypes. Several methods have been proposed for the analysis of case–control studies on the basis of haplotypes (Fallin *et al.* 2001; Schaid *et al.* 2002; Zaykin *et al.* 2002; Epstein and Satten 2003; Stram *et al.* 2003). Fallin *et al.* (2001) estimated haplotype frequencies for cases and controls separately; then, they applied omnibus tests to assess the differences in the overall haplotype frequency profiles between cases and controls. Schaid *et al.* (2002) and Zaykin *et al.* (2002) developed methods for testing the association using a logistic regression model. In these methods, haplotype frequencies were estimated under the assumption that haplotypes were in Hardy–Weinberg equilibrium (HWE) in the population. For a case–control study, this assumption is appropriate under the null hypothesis, in which no association is assumed between haplotypes and phenotypes; however, it does not hold under the alternative hypothesis. If an association between haplotypes and phenotypes exists, the proportion of haplotypes in the cases does not reflect the haplotype frequencies in the population and may deviate from HWE. The haplotypes positively associated with the phenotypes are observed more often in the cases, and vice versa. Epstein and Satten (2003) and Stram *et al.* (2003) developed methods to overcome this problem. Epstein and Satten (2003) proposed a retrospective likelihood method in which HWE was assumed only in the controls. Stram *et al.* (2003) corrected the likelihood by introducing known sampling probabilities of cases and controls from the population, although sampling probabilities are rarely known in an actual case–control study. Stram *et al.* (2003) showed that, as long as haplotypes were well predicted by SNP genotyping of data, the difference in the risks estimated using cohort likelihood and case–control-based likelihood was low.

We investigate the applicability of PENHAPLO, which is based on cohort likelihood, to case–control studies. We focused on testing the association between haplotypes and phenotypes, but not on inferring haplotype frequencies. In the subsequent section, we briefly describe the algorithm implemented in PENHAPLO. Next, we apply PENHAPLO to analyze simulated case–control data under both the null hypothesis and the alternative hypothesis. The results using PENHAPLO are compared with those using known-phase data. Finally, we apply PENHAPLO to the real data from a case–control study to examine the validity of the method.

## METHODS

#### Likelihood function in the case of cohort studies:

Details of the algorithm implemented in PENHAPLO are described in our previous article Ito *et al.* (2004). We briefly describe the parameters and the likelihood function used in the algorithm.

Suppose that there are *N* subjects in a sample. A qualitative phenotype ψ_{i} for the *i*th subject occurs with a probability *q _{k}* when the subject possesses a diplotype configuration

*a*and a phenotype ψ

_{k}_{+}. Throughout this article, a diplotype configuration is defined as an ordered combination of two haplotypes. Therefore, the maximum value of

*k*is

*L*

^{2}, where

*L*is the number of all possible haplotypes. In our algorithm implemented in PENHAPLO (Ito

*et al.*2004), we considered two or three penetrances depending on the mode of inheritance, instead of assigning

*q*to each possible diplotype configuration. In the dominant mode, the penetrance

_{k}*q*

_{+}was assumed to be identical for all members of

*D*

_{+}, the subset that contains diplotype configurations with at least one phenotype-associated haplotype

*H*

_{+}; on the other hand, the other penetrance

*q*

_{−}was assumed for nonmembers of

*D*

_{+}. In the recessive mode, the penetrances

*q*

_{+}and

*q*

_{−}were assumed for members and nonmembers, respectively, of

*D*

_{+}, the subset containing diplotype configurations with two

*H*

_{+}copies. In the general relative risk mode, three penetrances were assigned, depending on the number of haplotype

*H*

_{+}copies in the subject.

If the haplotypes are in HWE in the underlying population with frequencies Θ = (θ_{1}, … , θ_{L}), the likelihood function in the dominant or recessive mode is expressed as(1)

Here, *A _{i}* denotes the set of diplotype configurations

*a*for subject

_{k}*i*that are consistent with the observed genotype

*g*is a diplotype configuration for the subject,

_{i}. d_{i}*w*is the observed phenotype, and

_{i}*N*is the number of subjects in the observed data. For any

*i*and

*k*, the probability of the subject

*i*having the diplotype configuration

*a*is expressed as a product of the two frequencies of the haplotypes that constitute

_{k}*a*. The probability that the

_{k}*i*th subject develops the observed phenotype

*w*under the condition that

_{i}*d*=

_{i}*a*is expressed as

_{k}In the general relative risk mode, the likelihood function is expressed as(2)where *q*_{0}, *q*_{1}, and *q*_{2} show the penetrances for the diplotype configurations that contain zero, one, and two copies of the phenotype-associated haplotype *H*_{+}, respectively. The probability function is expressed aswhere *D*_{0}, *D*_{1}, and *D*_{2} denote the sets of diplotype configurations that contain zero, one, and two copies of the members of *H*_{+}, respectively. The parameters (Θ, *q*_{+}, and *q*_{−}) or (Θ, *q*_{0}, *q*_{1}, and *q*_{2}) are estimated using the EM method. The maximum likelihood under the null hypothesis, *i.e.*, *q*_{+} = *q*_{−} or *q*_{0} = *q*_{1} = *q*_{2}, and that under the alternative hypothesis, *i.e.*, *q*_{+} ≠ *q*_{−} or *q*_{0} ≠ *q*_{1} ≠ *q*_{2}, were calculated; the generalized likelihood ratio was used to test the association between the phenotype and the diplotype configuration.

#### Parameters for a case–control study:

In a case–control study, haplotype frequencies of cases or controls exhibit a bias as compared to those of the population if an association between the phenotype and the haplotype *H*_{+} exists. The phenotype-associated haplotype is observed more often in cases than in controls. When PENHAPLO is used to analyze the data from a case–control study, parameters *r*_{+} and *r*_{−} are estimated instead of the penetrances *q*_{+} and *q*_{−} in the dominant or recessive mode. In the general mode, *r*_{0}, *r*_{1}, and *r*_{2} are estimated instead of *q*_{0}, *q*_{1}, and *q*_{2}, respectively. Here, *r*_{+} is the ratio of the number of cases to the total number of subjects of cases and controls for members of *D*_{+}, while *r*_{−} is the same ratio, but for nonmembers of *D*_{+}. We assume that *r*_{+} and *r*_{−} express the bias between cases and controls. The odds ratio OR is estimated using *r*_{+} and *r*_{−} as(3)

(see the appendix for derivation). If prevalence λ of the disease is known for the population, *q*_{+} and *q*_{−} can be obtained using *r*_{+} and *r*_{−} as(4)(5)where *s* denotes the ratio of the number of cases to the total number of subjects in the sample of a case–control study.

#### Permutation test for multiple comparison:

Neither the phenotype-associated haplotype nor the mode of inheritance is known in a real analysis. To find the phenotype-associated haplotypes and the mode of inheritance, all combinations of “incomplete haplotypes” and three modes of inheritance can be tested for the association analysis using PENHAPLO. The concept of incomplete haplotype was defined by Kamatani *et al.* (2004) as a subset of a complete haplotype whose members have certain alleles at some of the loci within the region. On the other hand, a “complete” haplotype is defined as a list of alleles at all the loci within the region. For example, *CTGCT defines an incomplete haplotype whose members are the complete haplotypes ACTGCT and GCTGCT if the alleles at the first locus are A and G, respectively. A single allele, such as T at the third position, is also defined as an incomplete haplotype because it is expressed as **T***.

To control a familywise type I error rate, we evaluate the significance level per test by a permutation test. A familywise type I error rate must be inflated if multiple haplotypes and multiple modes of inheritance are tested. Since incomplete haplotypes are not independent of each other, the commonly used Bonferroni correction would result in extremely conservative corrections for this multiple-testing problem. Therefore, we employ the single-step resampling method proposed by Westfall and Young (1993) for evaluating the significance level per test. The algorithm is as follows:

Generate

*P*-values under the null hypothesis for all combinations of the incomplete haplotypes and modes of inheritance that are tested for real data. To obtain the*P*-values, first, shuffle the phenotype data ψ, affected or nonaffected, among all the subjects. Then, compute the*P*-values for the tests.The minimum

*P*-value is adopted for resampled data.Repeat 1–2

*M*times and obtain*M P*-values.Sort the

*P*-values in ascending order. The (α ×*M*)th*P*-value is a significance level per test corresponding to a familywise significant level α.

#### Simulation 1: validation of the robustness of the algorithm:

We performed simulations to validate the robustness of the use of PENHAPLO in case–control studies. Type I error rates under the null hypothesis were calculated for the simulation data on the basis of two blocks in different states of linkage disequilibrium. The estimated type I error rates show whether the test statistic used in PENHAPLO approximately follows the χ^{2}-distribution. Under various alternative hypotheses, the means and standard deviations of the odds ratio estimated using PENHAPLO were examined. According to the alternative hypothesis, if an association exists between haplotypes and phenotypes, diplotype frequencies in a case–control study are not in HWE. A comparison of the estimated odds ratios using PENHAPLO with the true values implicitly shows the adequacy of the HWE assumption used in PENHAPLO.

The flow chart of the simulation is shown in Figure 1. We began the simulation by assigning two ordered haplotype copies to each of the 1,000,000 subjects by randomly selecting the haplotype copies using Θ. Next, the phenotype of each subject was stochastically determined according to the penetrances *q*_{+} and *q*_{−}. The cases and controls were randomly sampled from affected and unaffected subjects, respectively. After removing the phase information, PENHAPLO was applied to the simulated case–control data for an association test and estimation of odds ratios. Simultaneously, 2 × 2 contingency tables were constructed on the basis of known-phase data and tested for independence using Pearson's χ^{2}-statistic. In the contingency tables, the rows indicate whether a subject had at least one copy of a disease-associated haplotype, and the columns indicate cases *vs.* controls. We also used three other methods (haplotype frequency comparison, diplotype frequency comparison, and subject partition) to test the associations using the same simulated data. Using these three methods, the haplotype frequencies were first estimated, followed by the test of association between haplotypes and disease using contingency tables.

In the simulation, we used either Θ obtained from our previous study on serum amyloid A (*SAA*) genes (Moriguchi *et al.* 2001) or artificially created distributions (*ART*). Table 1 shows the haplotypes and their frequencies. The *SAA* gene contains six SNPs. They are not in tight linkage disequilibrium, as shown in Figure 2. The *ART* data also contain six SNPs, and the loci from 1 to 3 and those from 5 to 6 are in tight linkage disequilibrium, whereas the other loci are in linkage equilibrium. The penetrance *q*_{−} was fixed at 0.2 for both the null and alternative hypotheses, whereas various values of *q*_{+} were tested for alternative hypotheses. In this simulation, we used the dominant mode, in which *q*_{+} and *q*_{−} were given for the subjects with at least one disease-associated haplotype and those without any, respectively. The simulations were repeated 20,000 times for the null hypothesis and 5000 times for each alternative hypothesis. The numbers of cases were equal to those of the controls throughout the study.

The association between a disease and haplotype was tested in recent case–control studies (Fallin *et al.* 2001; Hodgkinson *et al.* 2004). In these studies, haplotype frequencies were estimated separately in cases and controls followed by a test of their differences between cases and controls. The PENHAPLO algorithm also estimates the haplotype frequencies; however, it simultaneously tests the association between disease and diplotype configurations. We compared the results obtained by the analysis using PENHAPLO with those obtained by the analysis using the following methods:

##### Method 1 (haplotype frequency comparison method):

In the first step, LDSUPPORT (Kitamura *et al.* 2002)—a software that estimates the haplotype frequencies using the EM algorithm—was used to separately estimate haplotype frequencies Θ for cases and controls. Then, the differences in the frequencies of the specific haplotype *H*_{+} between cases and controls were tested using 2 × 2 contingency tables.

##### Method 2 (diplotype frequency comparison method):

First, haplotype frequencies for cases and controls were estimated separately using LDSUPPORT. Then, the frequency of each diplotype configuration was calculated separately for cases and controls, as a product of the frequencies of two haplotypes. The frequencies of diplotype configurations were summed up separately for members and nonmembers of *D*_{+}. Then, the frequencies of diplotype configurations categorized as members of *D*_{+} were compared between cases and controls.

##### Method 3 (subject partition method):

In the first step, haplotype frequencies were estimated by LDSUPPORT using pooled data of cases and controls. Then, posterior probabilities of diplotype configurations for each subject were calculated by considering the products of the estimated relative frequencies of two haplotypes constituting possible diplotype configurations for each subject as priors. The expected numbers of diplotype configurations in both cases and controls were calculated by summing them for the members and nonmembers of *D*_{+} separately. Then, the frequencies of diplotype configurations categorized as members of *D*_{+} were compared between the cases and controls. Note that, in this method, a subject may be partitioned into two proportions—that of a member and that of a nonmember of *D*_{+}.

#### Simulation 2: detecting the association:

In simulation 1, the risk haplotype and mode of inheritance were assumed to be known. Next, to simulate a more real association study, the data were simulated as follows and analyzed using PENHAPLO under the assumption that the risk haplotype and mode of inheritance were unknown:

A common disease: One of the major haplotypes was assumed to be a disease-related haplotype. The risk was additive. The relative risk of a subject with two disease-related haplotypes compared to subjects with no disease-related haplotypes was assumed to be 1.5 or 2.

Disease caused by a single mutation: One of the minor haplotypes was assumed to be a disease-related haplotype. The mode of inheritance was dominant. The relative risk was assumed to be 2 or 4;

*i.e.*, the corresponding odds ratios were 2.25 and ∼3.9, respectively.

In both cases, the penetrances *q*_{−} and *q*_{0} for the diplotype without the risk haplotype were fixed at 0.1. The haplotype frequencies Θ of *SAA* genes shown in Table 1 were used. For the simulation of a common disease, the risk haplotype was assumed to be “ACCGTC,” while “GCTGCT” was assumed to be the risk haplotype for the other simulation.

Since neither the risk haplotype nor the mode of inheritance was kept blinded in the analysis of the simulated data all possible combinations of the incomplete haplotypes and three modes of inheritance were tested using PENHAPLO, followed by a comparison of the *P*-values with the significance level per test evaluated by the permutation test. It was decided that an association was detected when at least one *P*-value was less than the significance level per test. Type I error rates under the null hypothesis and the power under an alternative hypothesis were calculated.

The results of PENHAPLO were compared with an alternative nonparametric method (method 4). For case–control studies, haplotype frequencies between cases and controls are often compared using a likelihood-ratio statistic. In our study, log-likelihoods for cases, controls, and combined data of cases and controls—ln(*L*_{cases}), ln(*L*_{controls}), and ln(*L*_{cases+controls}), respectively—were maximized using LDSUPPORT. Then, the likelihood-ratio (LR) statistics were calculated as LR = 2{ln(*L*_{cases}) + ln(*L*_{controls}) − ln(*L*_{cases+controls})}. The statistic approximately follows a χ^{2}-distribution, except in the case where there are many minor haplotypes in the data. Therefore, we evaluated the critical point of LR under the null hypothesis for each simulation with a varying sample size so that the type I error rate equals 0.05. The empirical power was calculated as the proportion of the attempts that yielded values of the test statistic that are greater than the values over the critical point. The powers obtained were compared with the results using PENHAPLO.

Three sample sizes—400, 1000, and 2000 (half of the samples were the cases)—were simulated. The familywise significance level α was set at 0.05. The simulations were repeated 2000 times for both the null and the alternative hypotheses. The permutation tests for an evaluation of significance level per test were performed 1000 times for each simulation.

#### Analysis of real data:

We applied PENHAPLO to analyze the real data from patients with rheumatoid arthritis (RA) with amyloidosis and non-RA healthy subjects (Moriguchi *et al.* 2001). Since the cases (amyloidosis) and controls (non-RA) were selected separately, the study design is considered to be a case–control study. In the previous study (Moriguchi *et al.* 2001), five SNPs of the *SAA1* gene were obtained for 44 RA patients with amyloidosis (cases) and for 58 non-RA subjects (controls). Although one SNP in the *SAA2* gene was obtained in the previous study, we did not use it since the locus was not in linkage disequilibrium with the other loci.

## RESULTS

#### Simulation 1: type I errors under the null hypothesis:

Figure 3 shows the empirical type I error rates (the proportions of attempts with a test statistic that had fallen in the critical region) when the data simulated under the null hypothesis were analyzed by various methods at a fixed significance level of α = 0.05. The results obtained by the analysis using PENHAPLO were compared with those obtained by the analysis using 2 × 2 tables based on the known-phase data. The results were also compared with those obtained by the other three methods using the contingency tables based on the unknown-phase data. The type I error rates were calculated as the proportions of the attempts that yielded the values of the test statistic—that is, −2 log(generalized likelihood ratio) for PENHAPLO and χ^{2}-values for the other methods—that were >3.841 (the value that yields a cumulative distribution function of 0.95 for the χ^{2}-distribution with 1 d.f.). When the same simulated samples were analyzed, the empirical type I error rates obtained using PENHAPLO were consistent with those obtained using the known-phase data. However, the empirical type I error rates obtained using methods 1 and 2 were higher than those obtained using the known-phase data. Therefore, these methods underestimate type I error rates. A real type I error rate of 0.05 was obtained when the critical points were set at values >3.841. The values of the critical points depend on the number of samples and the haplotype frequency in the population. For example, the critical values for *SAA* data with 200 samples of each case and control were 4.005 and 4.036 for methods 1 and 2, respectively.

In contrast, the empirical type I error rates obtained using method 3 were lower than those obtained using the known-phase data, which suggests that method 3 tends to overestimate the type I error rates (Figure 3). A real type I error rate of 0.05 was obtained when the critical points were set at values <3.841. In this case as well, the values of the critical points depend on the number of samples and the haplotype frequency in the population. When the data from *SAA* genes were analyzed, the ranges of the statistical errors (1σ) for the type I error rates included the expected significance level of 0.05 for both PENHAPLO and the known-phase data; however, it was not included for the other three methods (Figure 3).

#### Simulation 1: odds ratio:

We then performed equivalent simulations under the alternative hypotheses. Simulations were performed at a fixed value of *q*_{−} = 0.2, whereas *q*_{+}/*q*_{−} was varied from 1.0 to 2.0. The number of both cases and controls was fixed at 200. Tables 2 and 3 show the means and standard deviations of the estimated odds ratios for the *SAA* and *ART* data, respectively. For both *SAA* gene and *ART* data, the discrepancies between the means of the estimated odds ratio calculated using PENHAPLO and those calculated using the known-phase data were <2%. The discrepancies between the true values and the means estimated using the known-phase data decreased with increasing sample size. Method 1 underestimated the odds ratios: the means of the estimated odds were significantly lower than those calculated using known-phase data (Tables 2 and 3). The discrepancies were >30% at the maximum. The maximum discrepancies occurred at a maximum odds ratio of ∼2.7. The discrepancies between the means estimated using either method 2 or method 3 and those calculated using the known-phase data were within 5% for *SAA* data (Table 2). The discrepancies increased with increasing odds ratio. For *ART* data, method 3 underestimated the odds ratios by >30% (when the means were compared) at the maximum odds ratio (Table 3).

#### Simulation 2: power calculation:

The powers and type I error rates are shown in Table 4.

The familywise type I error rates estimated using PENHAPLO were almost equal to the significance level α = 0.05. Although ∼270 tests (∼90 incomplete haplotypes times and three modes of inheritance) were performed in each simulation analyzed using PENHAPLO, it was shown that the familywise type I error rates did not inflate when the *P*-values per test were compared with the significance levels per test evaluated by the permutation test.

For the alternative hypothesis, the powers analyzed using PENHAPLO were ∼1.1–1.3 times higher than those analyzed using method 4, which was based on the comparison of haplotype frequencies.

#### Analysis of real data:

Finally, we applied PENHAPLO to the real data for the case–control study of amyloidosis (Moriguchi *et al.* 2001). The details of the five SNPs in the *SAA1* gene are shown in Table 5. All the incomplete haplotypes constructed by three htSNPs (loci 3, 4, and 6) were tested for the association in dominant, recessive, and general modes. Table 6 shows the results whose *P*-values were smaller than the significance level per test of 3.10 × 10^{−3}. The results did not change significantly even when additional data were included after our previous publication (Moriguchi *et al.* 2001).

The incomplete haplotype showing the strongest association with AA amyloidosis was *T*** in the general relative risk model. This incomplete haplotype actually denotes a SNP at −13T/C of the *SAA1* gene. The result suggested that −13T was a risk factor for the development of AA amyloidosis. The results are consistent with previous works by Moriguchi *et al.* (2001) and Yamada *et al.* (2003).

Among 39 tests, 8 were significant for the association between AA amyloidosis and the haplotypes. The significant results were divided into two groups. One consisted of the incomplete haplotypes whose major complete haplotype was CTGCC, and the other consisted of those having CCGTC at the highest frequency. The number of members in both groups was four. A careful inspection of the results showed that the group CTGCC serves as a risk, while the group CCGTC serves as an inhibitory factor in the development of AA amyloidosis.

The haplotype CTGCC was the most frequent haplotype observed in the controls, followed by CCGTC, which was the second most frequent haplotype. When more than one major haplotype with contrary effects are detected in a case–control study, the results should be interpreted with caution. In a case–control study, an increase in a disease-related haplotype causes a decrease in the relative frequencies of other haplotypes, in cases. If a minor haplotype is related to the disease and the penetrance is low, then an increase in the risk haplotype in cases does not significantly change the relative frequencies of the major haplotypes. However, if a major haplotype is associated with a risk, the effect may not be so small. As a result, either an increase in a risk haplotype or a decrease in the other major haplotypes (or both) might be detected if the power is sufficiently large. A comparison of the relative frequencies of three major haplotypes (≥10% in controls) between cases and controls revealed that the relative frequency of CTGCC was higher in the cases than in the controls, while CCGTC was observed less frequently in the cases than in the controls. On the other hand, the relative frequency of GCGCT in the controls (third highest) was almost the same between the cases and controls. These results may suggest that the decrease of CCGTC in the cases was caused by the increase of CTGCC. However, a possibility remains that the effect of the haplotype CCGTC is independent of the effect of −13T.

## DISCUSSION

Earlier, we developed an algorithm, for implementation in PENHAPLO, for testing the association between phenotypes and haplotypes using genotype and phenotype data from either cohort studies or clinical trials. The algorithm simultaneously estimates penetrances on the basis of diplotype configurations and haplotype frequencies and tests the association between haplotypes and phenotypes. Since the penetrances can be estimated using data from cohort studies and clinical trials, PENHAPLO can be applied to such studies without major problems. In our study, we investigated the validity of the application of the algorithm implemented in PENHAPLO to case–control studies by simulations as well as by real data.

It was found that PENHAPLO is useful not only for the analysis of the data from cohort studies and clinical trials but also for the analysis of data from case–control studies with regard to the association test between haplotypes and qualitative phenotypes. However, when applying PENHAPLO to case–control studies, some estimated parameters are different from those estimated by the same algorithm when the data from either cohort studies or clinical trials are used. In the case of data from case–control studies, the parameters *r*_{+} and *r*_{−}, which are equivalent to the penetrances *q*_{+} and *q*_{−} for cohort studies and clinical trials, are sensitive to selection bias. The odds ratios (the ratios of the proportion of the number of subjects with *D*_{+} to the number of subjects with *D*_{−} in cases to the same proportion in controls) are expressed as a function of *r*_{+} and *r*_{−}. In addition, if the affection rate is known, the penetrances are expressed as a function of the affection rate, *r*_{+} and *r*_{−}.

Simulation studies have shown that PENHAPLO is a robust method for the analysis of case–control studies. Type I error rates were almost exactly as expected when data were generated under the null hypothesis. The means of the odds ratios estimated using PENHAPLO for the simulation converged to the true value in the population as the number of cases and controls increased. We compared the results obtained using PENHAPLO with those obtained using known-phase data. The type I error rate and the means of the odds ratios obtained using PENHAPLO were comparable to those analyzed using known-phase data.

Our results have shown that the odds ratios can be estimated from the combined case and control data by applying the method for the estimation of penetrances from the cohort data. The use of haplotypes inferred separately in cases and controls or inferred using combined data underestimates the odds ratio. These discrepancies increased with the decreasing strength of linkage disequilibrium and with increasing odds ratios. When an association test was performed using these alternative methods, the type I error rate under the null hypothesis did not correspond with the significance level.

Simulation studies have also shown that the power of PENHAPLO was higher than that of the alternative method.

The results of the analysis of the data from the case–control sample of patients with RA and healthy subjects were of interest. They support the previously proposed hypotheses for the association of *SAA* genes with AA amyloidosis. Baba *et al.* (1995) proposed that two missense base substitutions are related to AA amyloidosis; however, Moriguchi *et al.* (2001) and Yamada *et al.* (2003) proposed that −13T/C substitutions in the 5′-flanking sequence are responsible. The results of the analysis by PENHAPLO using the concept of incomplete haplotypes suggest that both the hypotheses might hold.

In conclusion, the likelihood-based algorithm PENHAPLO can be applied not only to cohort studies and clinical trials but also to case–control studies for testing the association between haplotypes and phenotypes. However, the parameters estimated as penetrances in cohort studies are not the true penetrances in case–control studies. Rather, an odds ratio is obtained from these parameters, and the penetrances can be calculated if the affection rate is available. Although the haplotypes are not in HWE in both cases and controls, the results showed that the effect of departure from HWE did not significantly affect the results, particularly when the loci were in tight linkage disequilibrium.

This work was supported by grants from the New Energy and Industrial Technology Development Organization.

## APPENDIX: PARAMETERS ESTIMATED IN A CASE–CONTROL STUDY

In a cohort study, the probabilities of a subject developing a disease when the diplotype configuration is or is not a member of *D*_{+} are shown in Table A1. Using the haplotype frequencies Θ in the population, the relative frequency of the *j*th diplotype configuration *D _{j}* in the population is expressed, assuming the HWE aswhere θ

_{l}and θ

_{m}are relative frequencies of the

*l*th and

*m*th haplotypes, respectively, that constitute

*D*. A diplotype configuration

_{j}*D*with

_{j}*j*= 1, … ,

*k*is a member of

*D*

_{+}, while

*D*with

_{j}*j*=

*k*+ 1, … ,

*L*

^{2}is a nonmember of

*D*

_{+}. The prevalence of disease λ is calculated using the penetrances

*q*

_{+},

*q*

_{−}, and as(A1)

In a case–control study, suppose that cases and controls are sampled from affected and not-affected subjects in the population with the sample rates *p*_{a} and *p*_{n}, respectively. The probabilities of observing a case or a control with a diplotype configuration *D _{k}* in all samples of cases and controls are expressed in Table A2. The ratio

*s*of the number of cases to the total number of subjects in a case–control study is expressed as(A2)

Thus, the following equation is obtained:(A3)

The ratio *s* is also obtained as the summation of the probabilities of the observed cases over all the possible diplotype configurations:(A4)

Correspondingly, the ratio of the number of controls to the total number of subjects, 1 − *s*, is obtained as the summation of the probabilities of the observed controls across all the possible diplotype configurations as follows:(A5)

In a case–control study, *s* is known; however, *p*_{a} and *p*_{n} are unknown. *r*_{+} and *r*_{−} are defined as the selection biases that express the difference in the relative frequencies of cases and controls with the diplotype configuration *D _{k}* ∈

*D*

_{+}and

*D*∉

_{k}*D*

_{+}, respectively, as(A6)(A7)

Substituting (A3) for (A6) yields(A8)

Instead of *q*_{+}, *r*_{+} is estimated when PENHAPLO is applied to analyze the data from a case–control study. Therefore, if the prevalence λ is known and *r*_{+} is estimated using PENHAPLO, the penetrance *q*_{+} is obtained as follows:(A9)

In the same manner, *q*_{−} is obtained as(A10)

The odds ratio is expressed as

Substituting (A6) and (A7) in the above equation yields(A11)

The likelihood-ratio test used in PENHAPLO is based on the following null and alternative hypotheses for a cohort study:

When PENHAPLO is used to analyze a case–control study, the null and alternative hypotheses are H_{0}: *r*_{+} = *s* and H_{1}: *r*_{+} ≠ *s* that correspond to

Since *r*_{+} and *r*_{−} are not independent parameters, *r*_{+} = *r*_{−} holds only if *r*_{+} = *s* [see (A4), (A5), (A9), and (A10)]; *i.e*., OR = 1 [see (A11)].

## Footnotes

1

*Present address:*Research Center for Advanced Science and Technology, Mitsubishi Research Institute, 2-3-6 Otemachi, Chiyoda-ku, Tokyo 100-8141, Japan.Communicating editor: Y.-X. Fu

- Received December 7, 2005.
- Accepted September 1, 2006.

- Copyright © 2006 by the Genetics Society of America