Originally published as Genetics Published Articles Ahead of Print on September 15, 2006.

Genetics, Vol. 174, 1505-1516, November 2006, Copyright © 2006
doi:10.1534/genetics.105.054452

Test of Association Between Haplotypes and Phenotypes in Case–Control Studies: Examination of Validity of the Application of an Algorithm for Samples From Cohort or Clinical Trials to Case–Control Samples Using Simulated and Real Data

* Algorithm Team, Japan Biological Information Research Center (JBIRC), Japan Biological Informatics Consortium (JBIC), Tokyo 135-0064, Japan, {dagger} Mitsubishi Research Institute, Tokyo 100-8141, Japan and {ddagger} Division of Genomic Medicine, Department of Advanced Biomedical Engineering and Science and Institute of Rheumatology, Tokyo Women's Medical University, Tokyo 162-0054, Japan

2 Corresponding author: Division of Genomic Medicine, Department of Advanced Biomedical Engineering and Science and Institute of Rheumatology, Tokyo Women's Medical University, 10-22 Kawada-cho, Shinjuku-ku, Shinjuku, Tokyo 162-0054, Japan.
E-mail: kamatani{at}ior.twmu.ac.jp

Manuscript received December 7, 2005. Accepted for publication September 1, 2006.

ABSTRACT

The use of haplotype information in case–control studies is an area of focus for the research on the association between phenotypes and genetic polymorphisms. We examined the validity of the application of the likelihood-based algorithm, which was originally developed to analyze the data from cohort studies or clinical trials, to the data from case–control studies. This algorithm was implemented in a computer program called PENHAPLO. In this program, haplotype frequencies and penetrances are estimated using the expectation-maximization algorithm, and the haplotype–phenotype association is tested using the generalized likelihood ratio. We show that this algorithm was useful not only for cohort studies but also for case–control studies. Simulations under the null hypothesis (no association between haplotypes and phenotypes) have shown that the type I error rates were accurately estimated. The simulations under alternative hypotheses showed that PENHAPLO is a robust method for the analysis of the data from case–control studies even when the haplotypes were not in HWE, although real penetrances cannot be estimated. The power of PENHAPLO was higher than that of other methods using the likelihood-ratio test for the comparison of haplotype frequencies. Results of the analysis of real data indicated that a significant association between haplotypes in the SAA1 gene and AA-amyloidosis phenotype was observed in patients with rheumatoid arthritis, thereby suggesting the validity of the application of PENHAPLO for case–control data.


THE importance of the analysis of polymorphism data based on linkage disequilibrium and haplotype structure has increased with the number of available single-nucleotide polymorphism (SNP) markers. Some studies have suggested that phenotypes of individuals are associated with a diplotype configuration (a combination of haplotypes) rather than a SNP genotype (HORIKAWA et al. 2000; SUGIURA et al. 2002; TANAKA et al. 2002; URANO et al. 2002; KAWAGUCHI et al. 2003). Since neither haplotypes nor diplotype configurations of a subject are usually observed, the haplotype frequencies of the population are inferred using algorithms.

For the analysis of data from either cohort studies or clinical trials, we developed an algorithm for an association test between individual qualitative phenotypes and specific haplotypes (ITO et al. 2004). This algorithm estimates diplotype-based penetrances in dominant, recessive, or general relative risk modes; further, it estimates the frequencies in a population.

On the basis of the expectation-maximization (EM) method, the algorithm was implemented in the computer program PENHAPLO. The proportions of affected and unaffected individuals with different diplotype configurations can be compared using posterior distribution given the maximum-likelihood estimates of population haplotype frequencies as priors. A number of computer programs have been developed for inferring haplotype frequencies in a population (STEPHENS et al. 2001; KITAMURA et al. 2002; NIU et al. 2002; QIN et al. 2002; STEPHENS and DONNELLY 2003). One advantage of using PENHAPLO is that the diplotype configuration of each individual need not be determined to test the association between the haplotypes and phenotypes. Rather, the diplotype configurations of some subjects can remain ambiguous. Our previous results showed that PENHAPLO is particularly useful for the analysis of cohort data on the association between haplotypes and phenotypes (ITO et al. 2004).

A case–control study is another type of useful clinical study for detecting the association between haplotypes and phenotypes. Several methods have been proposed for the analysis of case–control studies on the basis of haplotypes (FALLIN et al. 2001; SCHAID et al. 2002; ZAYKIN et al. 2002; EPSTEIN and SATTEN 2003; STRAM et al. 2003). FALLIN et al. (2001) estimated haplotype frequencies for cases and controls separately; then, they applied omnibus tests to assess the differences in the overall haplotype frequency profiles between cases and controls. SCHAID et al. (2002) and ZAYKIN et al. (2002) developed methods for testing the association using a logistic regression model. In these methods, haplotype frequencies were estimated under the assumption that haplotypes were in Hardy–Weinberg equilibrium (HWE) in the population. For a case–control study, this assumption is appropriate under the null hypothesis, in which no association is assumed between haplotypes and phenotypes; however, it does not hold under the alternative hypothesis. If an association between haplotypes and phenotypes exists, the proportion of haplotypes in the cases does not reflect the haplotype frequencies in the population and may deviate from HWE. The haplotypes positively associated with the phenotypes are observed more often in the cases, and vice versa. EPSTEIN and SATTEN (2003) and STRAM et al. (2003) developed methods to overcome this problem. EPSTEIN and SATTEN (2003) proposed a retrospective likelihood method in which HWE was assumed only in the controls. STRAM et al. (2003) corrected the likelihood by introducing known sampling probabilities of cases and controls from the population, although sampling probabilities are rarely known in an actual case–control study. STRAM et al. (2003) showed that, as long as haplotypes were well predicted by SNP genotyping of data, the difference in the risks estimated using cohort likelihood and case–control-based likelihood was low.

We investigate the applicability of PENHAPLO, which is based on cohort likelihood, to case–control studies. We focused on testing the association between haplotypes and phenotypes, but not on inferring haplotype frequencies. In the subsequent section, we briefly describe the algorithm implemented in PENHAPLO. Next, we apply PENHAPLO to analyze simulated case–control data under both the null hypothesis and the alternative hypothesis. The results using PENHAPLO are compared with those using known-phase data. Finally, we apply PENHAPLO to the real data from a case–control study to examine the validity of the method.


METHODS

Likelihood function in the case of cohort studies:

Details of the algorithm implemented in PENHAPLO are described in our previous article ITO et al. (2004). We briefly describe the parameters and the likelihood function used in the algorithm.

Suppose that there are N subjects in a sample. A qualitative phenotype {psi}i for the ith subject occurs with a probability qk when the subject possesses a diplotype configuration ak and a phenotype {psi}+. Throughout this article, a diplotype configuration is defined as an ordered combination of two haplotypes. Therefore, the maximum value of k is L2, where L is the number of all possible haplotypes. In our algorithm implemented in PENHAPLO (ITO et al. 2004), we considered two or three penetrances depending on the mode of inheritance, instead of assigning qk to each possible diplotype configuration. In the dominant mode, the penetrance q+ was assumed to be identical for all members of D+, the subset that contains diplotype configurations with at least one phenotype-associated haplotype H+; on the other hand, the other penetrance q was assumed for nonmembers of D+. In the recessive mode, the penetrances q+ and q were assumed for members and nonmembers, respectively, of D+, the subset containing diplotype configurations with two H+ copies. In the general relative risk mode, three penetrances were assigned, depending on the number of haplotype H+ copies in the subject.

If the haplotypes are in HWE in the underlying population with frequencies {Theta} = ({theta}1, ... , {theta}L), the likelihood function in the dominant or recessive mode is expressed as

Formula 1(1)

Here, Ai denotes the set of diplotype configurations ak for subject i that are consistent with the observed genotype gi. di is a diplotype configuration for the subject, wi is the observed phenotype, and N is the number of subjects in the observed data. For any i and k, the probability of the subject i having the diplotype configuration ak is expressed as a product of the two frequencies of the haplotypes that constitute ak. The probability that the ith subject develops the observed phenotype wi under the condition that di = ak is expressed as

Formula 1

In the general relative risk mode, the likelihood function is expressed as

Formula 2(2)
where q0, q1, and q2 show the penetrances for the diplotype configurations that contain zero, one, and two copies of the phenotype-associated haplotype H+, respectively. The probability function is expressed as

Formula 2
where D0, D1, and D2 denote the sets of diplotype configurations that contain zero, one, and two copies of the members of H+, respectively. The parameters ({Theta}, q+, and q) or ({Theta}, q0, q1, and q2) are estimated using the EM method. The maximum likelihood under the null hypothesis, i.e., q+ = q or q0 = q1 = q2, and that under the alternative hypothesis, i.e., q+ != q or q0 != q1 != q2, were calculated; the generalized likelihood ratio was used to test the association between the phenotype and the diplotype configuration.

Parameters for a case–control study:

In a case–control study, haplotype frequencies of cases or controls exhibit a bias as compared to those of the population if an association between the phenotype and the haplotype H+ exists. The phenotype-associated haplotype is observed more often in cases than in controls. When PENHAPLO is used to analyze the data from a case–control study, parameters r+ and r are estimated instead of the penetrances q+ and q in the dominant or recessive mode. In the general mode, r0, r1, and r2 are estimated instead of q0, q1, and q2, respectively. Here, r+ is the ratio of the number of cases to the total number of subjects of cases and controls for members of D+, while r is the same ratio, but for nonmembers of D+. We assume that r+ and r express the bias between cases and controls. The odds ratio OR is estimated using r+ and r as

Formula 3(3)

(see the APPENDIX for derivation). If prevalence {lambda} of the disease is known for the population, q+ and q can be obtained using r+ and r as

Formula 4(4)

Formula 5(5)
where s denotes the ratio of the number of cases to the total number of subjects in the sample of a case–control study.

Permutation test for multiple comparison:

Neither the phenotype-associated haplotype nor the mode of inheritance is known in a real analysis. To find the phenotype-associated haplotypes and the mode of inheritance, all combinations of "incomplete haplotypes" and three modes of inheritance can be tested for the association analysis using PENHAPLO. The concept of incomplete haplotype was defined by KAMATANI et al. (2004) as a subset of a complete haplotype whose members have certain alleles at some of the loci within the region. On the other hand, a "complete" haplotype is defined as a list of alleles at all the loci within the region. For example, *CTGCT defines an incomplete haplotype whose members are the complete haplotypes ACTGCT and GCTGCT if the alleles at the first locus are A and G, respectively. A single allele, such as T at the third position, is also defined as an incomplete haplotype because it is expressed as **T***.

To control a familywise type I error rate, we evaluate the significance level per test by a permutation test. A familywise type I error rate must be inflated if multiple haplotypes and multiple modes of inheritance are tested. Since incomplete haplotypes are not independent of each other, the commonly used Bonferroni correction would result in extremely conservative corrections for this multiple-testing problem. Therefore, we employ the single-step resampling method proposed by WESTFALL and YOUNG (1993) for evaluating the significance level per test. The algorithm is as follows:

  1. Generate P-values under the null hypothesis for all combinations of the incomplete haplotypes and modes of inheritance that are tested for real data. To obtain the P-values, first, shuffle the phenotype data {psi}, affected or nonaffected, among all the subjects. Then, compute the P-values for the tests.
  2. The minimum P-value is adopted for resampled data.
  3. Repeat 1–2M times and obtain M P-values.
  4. Sort the P-values in ascending order. The ({alpha} x M)th P-value is a significance level per test corresponding to a familywise significant level {alpha}.

Simulation 1: validation of the robustness of the algorithm:

We performed simulations to validate the robustness of the use of PENHAPLO in case–control studies. Type I error rates under the null hypothesis were calculated for the simulation data on the basis of two blocks in different states of linkage disequilibrium. The estimated type I error rates show whether the test statistic used in PENHAPLO approximately follows the {chi}2-distribution. Under various alternative hypotheses, the means and standard deviations of the odds ratio estimated using PENHAPLO were examined. According to the alternative hypothesis, if an association exists between haplotypes and phenotypes, diplotype frequencies in a case–control study are not in HWE. A comparison of the estimated odds ratios using PENHAPLO with the true values implicitly shows the adequacy of the HWE assumption used in PENHAPLO.

The flow chart of the simulation is shown in Figure 1. We began the simulation by assigning two ordered haplotype copies to each of the 1,000,000 subjects by randomly selecting the haplotype copies using {Theta}. Next, the phenotype of each subject was stochastically determined according to the penetrances q+ and q. The cases and controls were randomly sampled from affected and unaffected subjects, respectively. After removing the phase information, PENHAPLO was applied to the simulated case–control data for an association test and estimation of odds ratios. Simultaneously, 2 x 2 contingency tables were constructed on the basis of known-phase data and tested for independence using Pearson's {chi}2-statistic. In the contingency tables, the rows indicate whether a subject had at least one copy of a disease-associated haplotype, and the columns indicate cases vs. controls. We also used three other methods (haplotype frequency comparison, diplotype frequency comparison, and subject partition) to test the associations using the same simulated data. Using these three methods, the haplotype frequencies were first estimated, followed by the test of association between haplotypes and disease using contingency tables.


Figure 1
View larger version (24K):
In this window
In a new window
Download PPT slide
 
FIGURE 1.—

Schematic of the simulation (where the equation q+ = q = 0.2 was assumed for the null hypothesis).

 
In the simulation, we used either {Theta} obtained from our previous study on serum amyloid A (SAA) genes (MORIGUCHI et al. 2001) or artificially created distributions (ART). Table 1 shows the haplotypes and their frequencies. The SAA gene contains six SNPs. They are not in tight linkage disequilibrium, as shown in Figure 2. The ART data also contain six SNPs, and the loci from 1 to 3 and those from 5 to 6 are in tight linkage disequilibrium, whereas the other loci are in linkage equilibrium. The penetrance q was fixed at 0.2 for both the null and alternative hypotheses, whereas various values of q+ were tested for alternative hypotheses. In this simulation, we used the dominant mode, in which q+ and q were given for the subjects with at least one disease-associated haplotype and those without any, respectively. The simulations were repeated 20,000 times for the null hypothesis and 5000 times for each alternative hypothesis. The numbers of cases were equal to those of the controls throughout the study.


View this table:
In this window
In a new window

 
TABLE 1

Haplotype frequencies for SAA genes (MORIGUCHI et al. 2001) and ART

 

Figure 2
Figure 2
View larger version (55K):
In this window
In a new window
Download PPT slide
 
FIGURE 2.—

Linkage disequilibrium between six SNP markers. (a) D'-values for SAA; (b) those for ART.

 
The association between a disease and haplotype was tested in recent case–control studies (FALLIN et al. 2001; HODGKINSON et al. 2004). In these studies, haplotype frequencies were estimated separately in cases and controls followed by a test of their differences between cases and controls. The PENHAPLO algorithm also estimates the haplotype frequencies; however, it simultaneously tests the association between disease and diplotype configurations. We compared the results obtained by the analysis using PENHAPLO with those obtained by the analysis using the following methods:

Method 1 (haplotype frequency comparison method):

In the first step, LDSUPPORT (KITAMURA et al. 2002)—a software that estimates the haplotype frequencies using the EM algorithm—was used to separately estimate haplotype frequencies {Theta} for cases and controls. Then, the differences in the frequencies of the specific haplotype H+ between cases and controls were tested using 2 x 2 contingency tables.

Method 2 (diplotype frequency comparison method):

First, haplotype frequencies for cases and controls were estimated separately using LDSUPPORT. Then, the frequency of each diplotype configuration was calculated separately for cases and controls, as a product of the frequencies of two haplotypes. The frequencies of diplotype configurations were summed up separately for members and nonmembers of D+. Then, the frequencies of diplotype configurations categorized as members of D+ were compared between cases and controls.

Method 3 (subject partition method):

In the first step, haplotype frequencies were estimated by LDSUPPORT using pooled data of cases and controls. Then, posterior probabilities of diplotype configurations for each subject were calculated by considering the products of the estimated relative frequencies of two haplotypes constituting possible diplotype configurations for each subject as priors. The expected numbers of diplotype configurations in both cases and controls were calculated by summing them for the members and nonmembers of D+ separately. Then, the frequencies of diplotype configurations categorized as members of D+ were compared between the cases and controls. Note that, in this method, a subject may be partitioned into two proportions—that of a member and that of a nonmember of D+.

Simulation 2: detecting the association:

In simulation 1, the risk haplotype and mode of inheritance were assumed to be known. Next, to simulate a more real association study, the data were simulated as follows and analyzed using PENHAPLO under the assumption that the risk haplotype and mode of inheritance were unknown:
  1. A common disease: One of the major haplotypes was assumed to be a disease-related haplotype. The risk was additive. The relative risk of a subject with two disease-related haplotypes compared to subjects with no disease-related haplotypes was assumed to be 1.5 or 2.
  2. Disease caused by a single mutation: One of the minor haplotypes was assumed to be a disease-related haplotype. The mode of inheritance was dominant. The relative risk was assumed to be 2 or 4; i.e., the corresponding odds ratios were 2.25 and ~3.9, respectively.

In both cases, the penetrances q and q0 for the diplotype without the risk haplotype were fixed at 0.1. The haplotype frequencies {Theta} of SAA genes shown in Table 1 were used. For the simulation of a common disease, the risk haplotype was assumed to be "ACCGTC," while "GCTGCT" was assumed to be the risk haplotype for the other simulation.

Since neither the risk haplotype nor the mode of inheritance was kept blinded in the analysis of the simulated data all possible combinations of the incomplete haplotypes and three modes of inheritance were tested using PENHAPLO, followed by a comparison of the P-values with the significance level per test evaluated by the permutation test. It was decided that an association was detected when at least one P-value was less than the significance level per test. Type I error rates under the null hypothesis and the power under an alternative hypothesis were calculated.

The results of PENHAPLO were compared with an alternative nonparametric method (method 4). For case–control studies, haplotype frequencies between cases and controls are often compared using a likelihood-ratio statistic. In our study, log-likelihoods for cases, controls, and combined data of cases and controls—ln(Lcases), ln(Lcontrols), and ln(Lcases+controls), respectively—were maximized using LDSUPPORT. Then, the likelihood-ratio (LR) statistics were calculated as LR = 2{ln(Lcases) + ln(Lcontrols) – ln(Lcases+controls)}. The statistic approximately follows a {chi}2-distribution, except in the case where there are many minor haplotypes in the data. Therefore, we evaluated the critical point of LR under the null hypothesis for each simulation with a varying sample size so that the type I error rate equals 0.05. The empirical power was calculated as the proportion of the attempts that yielded values of the test statistic that are greater than the values over the critical point. The powers obtained were compared with the results using PENHAPLO.

Three sample sizes—400, 1000, and 2000 (half of the samples were the cases)—were simulated. The familywise significance level {alpha} was set at 0.05. The simulations were repeated 2000 times for both the null and the alternative hypotheses. The permutation tests for an evaluation of significance level per test were performed 1000 times for each simulation.

Analysis of real data:

We applied PENHAPLO to analyze the real data from patients with rheumatoid arthritis (RA) with amyloidosis and non-RA healthy subjects (MORIGUCHI et al. 2001). Since the cases (amyloidosis) and controls (non-RA) were selected separately, the study design is considered to be a case–control study. In the previous study (MORIGUCHI et al. 2001), five SNPs of the SAA1 gene were obtained for 44 RA patients with amyloidosis (cases) and for 58 non-RA subjects (controls). Although one SNP in the SAA2 gene was obtained in the previous study, we did not use it since the locus was not in linkage disequilibrium with the other loci.


RESULTS

Simulation 1: type I errors under the null hypothesis:

Figure 3 shows the empirical type I error rates (the proportions of attempts with a test statistic that had fallen in the critical region) when the data simulated under the null hypothesis were analyzed by various methods at a fixed significance level of {alpha} = 0.05. The results obtained by the analysis using PENHAPLO were compared with those obtained by the analysis using 2 x 2 tables based on the known-phase data. The results were also compared with those obtained by the other three methods using the contingency tables based on the unknown-phase data. The type I error rates were calculated as the proportions of the attempts that yielded the values of the test statistic—that is, –2 log(generalized likelihood ratio) for PENHAPLO and {chi}2-values for the other methods—that were >3.841 (the value that yields a cumulative distribution function of 0.95 for the {chi}2-distribution with 1 d.f.). When the same simulated samples were analyzed, the empirical type I error rates obtained using PENHAPLO were consistent with those obtained using the known-phase data. However, the empirical type I error rates obtained using methods 1 and 2 were higher than those obtained using the known-phase data. Therefore, these methods underestimate type I error rates. A real type I error rate of 0.05 was obtained when the critical points were set at values >3.841. The values of the critical points depend on the number of samples and the haplotype frequency in the population. For example, the critical values for SAA data with 200 samples of each case and control were 4.005 and 4.036 for methods 1 and 2, respectively.


Figure 3
Figure 3
View larger version (22K):
In this window
In a new window
Download PPT slide
 
FIGURE 3.—

Type I error rates obtained for simulation 1. The results were calculated using PENHAPLO, 2 x 2 tables based on known-phase data, method 1 (haplotype frequency comparison), method 2 (diplotype frequency comparison), and method 3 (subject partition). (a) The results for SAA; (b) those for ART. Statistical error bars were calculated for 1{sigma}. Lines are drawn for guidance.

 
In contrast, the empirical type I error rates obtained using method 3 were lower than those obtained using the known-phase data, which suggests that method 3 tends to overestimate the type I error rates (Figure 3). A real type I error rate of 0.05 was obtained when the critical points were set at values <3.841. In this case as well, the values of the critical points depend on the number of samples and the haplotype frequency in the population. When the data from SAA genes were analyzed, the ranges of the statistical errors (1{sigma}) for the type I error rates included the expected significance level of 0.05 for both PENHAPLO and the known-phase data; however, it was not included for the other three methods (Figure 3).

Simulation 1: odds ratio:

We then performed equivalent simulations under the alternative hypotheses. Simulations were performed at a fixed value of q = 0.2, whereas q+/q was varied from 1.0 to 2.0. The number of both cases and controls was fixed at 200. Tables 2 and 3 show the means and standard deviations of the estimated odds ratios for the SAA and ART data, respectively. For both SAA gene and ART data, the discrepancies between the means of the estimated odds ratio calculated using PENHAPLO and those calculated using the known-phase data were <2%. The discrepancies between the true values and the means estimated using the known-phase data decreased with increasing sample size. Method 1 underestimated the odds ratios: the means of the estimated odds were significantly lower than those calculated using known-phase data (Tables 2 and 3). The discrepancies were >30% at the maximum. The maximum discrepancies occurred at a maximum odds ratio of ~2.7. The discrepancies between the means estimated using either method 2 or method 3 and those calculated using the known-phase data were within 5% for SAA data (Table 2). The discrepancies increased with increasing odds ratio. For ART data, method 3 underestimated the odds ratios by >30% (when the means were compared) at the maximum odds ratio (Table 3).


View this table:
In this window
In a new window

 
TABLE 2

The results of simultaion 1: estimated mean and standard errors of the odds ratios for SAA

 

View this table:
In this window
In a new window

 
TABLE 3

The results of simulation 1: estimated mean and standard errors of the odds ratios for ART

 

Simulation 2: power calculation:

The powers and type I error rates are shown in Table 4.


View this table:
In this window
In a new window

 
TABLE 4

Powers and type I error rates for simulation 2

 
The familywise type I error rates estimated using PENHAPLO were almost equal to the significance level {alpha} = 0.05. Although ~270 tests (~90 incomplete haplotypes times and three modes of inheritance) were performed in each simulation analyzed using PENHAPLO, it was shown that the familywise type I error rates did not inflate when the P-values per test were compared with the significance levels per test evaluated by the permutation test.

For the alternative hypothesis, the powers analyzed using PENHAPLO were ~1.1–1.3 times higher than those analyzed using method 4, which was based on the comparison of haplotype frequencies.

Analysis of real data:

Finally, we applied PENHAPLO to the real data for the case–control study of amyloidosis (MORIGUCHI et al. 2001). The details of the five SNPs in the SAA1 gene are shown in Table 5. All the incomplete haplotypes constructed by three htSNPs (loci 3, 4, and 6) were tested for the association in dominant, recessive, and general modes. Table 6 shows the results whose P-values were smaller than the significance level per test of 3.10 x 10–3. The results did not change significantly even when additional data were included after our previous publication (MORIGUCHI et al. 2001).


View this table:
In this window
In a new window

 
TABLE 5

SNP sites of SAA1 and SAA2

 

View this table:
In this window
In a new window

 
TABLE 6

The incomplete haplotypes and modes of inheritance that were significant in tests for association with amyloidosis

 
The incomplete haplotype showing the strongest association with AA amyloidosis was *T*** in the general relative risk model. This incomplete haplotype actually denotes a SNP at –13T/C of the SAA1 gene. The result suggested that –13T was a risk factor for the development of AA amyloidosis. The results are consistent with previous works by MORIGUCHI et al. (2001) and YAMADA et al. (2003).

Among 39 tests, 8 were significant for the association between AA amyloidosis and the haplotypes. The significant results were divided into two groups. One consisted of the incomplete haplotypes whose major complete haplotype was CTGCC, and the other consisted of those having CCGTC at the highest frequency. The number of members in both groups was four. A careful inspection of the results showed that the group CTGCC serves as a risk, while the group CCGTC serves as an inhibitory factor in the development of AA amyloidosis.

The haplotype CTGCC was the most frequent haplotype observed in the controls, followed by CCGTC, which was the second most frequent haplotype. When more than one major haplotype with contrary effects are detected in a case–control study, the results should be interpreted with caution. In a case–control study, an increase in a disease-related haplotype causes a decrease in the relative frequencies of other haplotypes, in cases. If a minor haplotype is related to the disease and the penetrance is low, then an increase in the risk haplotype in cases does not significantly change the relative frequencies of the major haplotypes. However, if a major haplotype is associated with a risk, the effect may not be so small. As a result, either an increase in a risk haplotype or a decrease in the other major haplotypes (or both) might be detected if the power is sufficiently large. A comparison of the relative frequencies of three major haplotypes (≥10% in controls) between cases and controls revealed that the relative frequency of CTGCC was higher in the cases than in the controls, while CCGTC was observed less frequently in the cases than in the controls. On the other hand, the relative frequency of GCGCT in the controls (third highest) was almost the same between the cases and controls. These results may suggest that the decrease of CCGTC in the cases was caused by the increase of CTGCC. However, a possibility remains that the effect of the haplotype CCGTC is independent of the effect of –13T.


DISCUSSION
Earlier, we developed an algorithm, for implementation in PENHAPLO, for testing the association between phenotypes and haplotypes using genotype and phenotype data from either cohort studies or clinical trials. The algorithm simultaneously estimates penetrances on the basis of diplotype configurations and haplotype frequencies and tests the association between haplotypes and phenotypes. Since the penetrances can be estimated using data from cohort studies and clinical trials, PENHAPLO can be applied to such studies without major problems. In our study, we investigated the validity of the application of the algorithm implemented in PENHAPLO to case–control studies by simulations as well as by real data.

It was found that PENHAPLO is useful not only for the analysis of the data from cohort studies and clinical trials but also for the analysis of data from case–control studies with regard to the association test between haplotypes and qualitative phenotypes. However, when applying PENHAPLO to case–control studies, some estimated parameters are different from those estimated by the same algorithm when the data from either cohort studies or clinical trials are used. In the case of data from case–control studies, the parameters r+ and r, which are equivalent to the penetrances q+ and q for cohort studies and clinical trials, are sensitive to selection bias. The odds ratios (the ratios of the proportion of the number of subjects with D+ to the number of subjects with D in cases to the same proportion in controls) are expressed as a function of r+ and r. In addition, if the affection rate is known, the penetrances are expressed as a function of the affection rate, r+ and r.

Simulation studies have shown that PENHAPLO is a robust method for the analysis of case–control studies. Type I error rates were almost exactly as expected when data were generated under the null hypothesis. The means of the odds ratios estimated using PENHAPLO for the simulation converged to the true value in the population as the number of cases and controls increased. We compared the results obtained using PENHAPLO with those obtained using known-phase data. The type I error rate and the means of the odds ratios obtained using PENHAPLO were comparable to those analyzed using known-phase data.

Our results have shown that the odds ratios can be estimated from the combined case and control data by applying the method for the estimation of penetrances from the cohort data. The use of haplotypes inferred separately in cases and controls or inferred using combined data underestimates the odds ratio. These discrepancies increased with the decreasing strength of linkage disequilibrium and with increasing odds ratios. When an association test was performed using these alternative methods, the type I error rate under the null hypothesis did not correspond with the significance level.

Simulation studies have also shown that the power of PENHAPLO was higher than that of the alternative method.

The results of the analysis of the data from the case–control sample of patients with RA and healthy subjects were of interest. They support the previously proposed hypotheses for the association of SAA genes with AA amyloidosis. BABA et al. (1995) proposed that two missense base substitutions are related to AA amyloidosis; however, MORIGUCHI et al. (2001) and YAMADA et al. (2003) proposed that –13T/C substitutions in the 5'-flanking sequence are responsible. The results of the analysis by PENHAPLO using the concept of incomplete haplotypes suggest that both the hypotheses might hold.

In conclusion, the likelihood-based algorithm PENHAPLO can be applied not only to cohort studies and clinical trials but also to case–control studies for testing the association between haplotypes and phenotypes. However, the parameters estimated as penetrances in cohort studies are not the true penetrances in case–control studies. Rather, an odds ratio is obtained from these parameters, and the penetrances can be calculated if the affection rate is available. Although the haplotypes are not in HWE in both cases and controls, the results showed that the effect of departure from HWE did not significantly affect the results, particularly when the loci were in tight linkage disequilibrium.

This work was supported by grants from the New Energy and Industrial Technology Development Organization.


APPENDIX: PARAMETERS ESTIMATED IN A CASE–CONTROL STUDY
In a cohort study, the probabilities of a subject developing a disease when the diplotype configuration is or is not a member of D+ are shown in Table A1. Using the haplotype frequencies {Theta} in the population, the relative frequency Formula 5 of the jth diplotype configuration Dj in the population is expressed, assuming the HWE as

Formula 5
where {theta}l and {theta}m are relative frequencies of the lth and mth haplotypes, respectively, that constitute Dj. A diplotype configuration Dj with j = 1, ... , k is a member of D+, while Dj with j = k + 1, ... , L2 is a nonmember of D+. The prevalence of disease {lambda} is calculated using the penetrances q+, q, and Formula 5 as

Formula A1(A1)


View this table:
In this window
In a new window

 
TABLE A1

Proportions of the subjects categorized by the affection status and diplotype configurations

 
In a case–control study, suppose that cases and controls are sampled from affected and not-affected subjects in the population with the sample rates pa and pn, respectively. The probabilities of observing a case or a control with a diplotype configuration Dk in all samples of cases and controls are expressed in Table A2. The ratio s of the number of cases to the total number of subjects in a case–control study is expressed as

Formula A2(A2)


View this table:
In this window
In a new window

 
TABLE A2

Proportion of the subjects with different diplotype configurations in cases and controls among the combined case and control sample

 
Thus, the following equation is obtained:

Formula A3(A3)

The ratio s is also obtained as the summation of the probabilities of the observed cases over all the possible diplotype configurations:

Formula A4(A4)

Correspondingly, the ratio of the number of controls to the total number of subjects, 1 – s, is obtained as the summation of the probabilities of the observed controls across all the possible diplotype configurations as follows:

Formula A5(A5)

In a case–control study, s is known; however, pa and pn are unknown. r+ and r are defined as the selection biases that express the difference in the relative frequencies of cases and controls with the diplotype configuration Dk isin D+ and Dk {notin} D+, respectively, as

Formula A6(A6)

Formula A7(A7)

Substituting (A3) for (A6) yields

Formula A8(A8)

Instead of q+, r+ is estimated when PENHAPLO is applied to analyze the data from a case–control study. Therefore, if the prevalence {lambda} is known and r+ is estimated using PENHAPLO, the penetrance q+ is obtained as follows:

Formula A9(A9)

In the same manner, q is obtained as

Formula A10(A10)

The odds ratio is expressed as

Formula A10

Substituting (A6) and (A7) in the above equation yields

Formula A11(A11)

The likelihood-ratio test used in PENHAPLO is based on the following null and alternative hypotheses for a cohort study:

Formula A11

When PENHAPLO is used to analyze a case–control study, the null and alternative hypotheses are H0: r+ = s and H1: r+ != s that correspond to

Formula A11

Since r+ and r are not independent parameters, r+ = r holds only if r+ = s [see (A4), (A5), (A9), and (A10)]; i.e., OR = 1 [see (A11)].


FOOTNOTES
1 Present address: Research Center for Advanced Science and Technology, Mitsubishi Research Institute, 2-3-6 Otemachi, Chiyoda-ku, Tokyo 100-8141, Japan. Back


LITERATURE CITED

BABA, S., S. A. MASAGO, T. TAKAHASHI, T. KASAMA, H. SUGIMURA et al., 1995 A novel allelic variant of serum amyloid a, saa1 gamma: genomic evidence, evolution, frequency, and implication as a risk factor for reactive systemic aa-amyloidosis. Hum. Mol. Genet. 4: 1083–1087.[Abstract/Free Full Text]

EPSTEIN, M. P., and G. A. SATTEN, 2003 Inference on haplotype effects in case-control studies using unphased genotype data. Am. J. Hum. Genet. 73: 1316–1329.[CrossRef][Medline]

FALLIN, D., A. COHEN, L. ESSIOUX, I. CHUMAKOV, M. BLUMENFELD et al., 2001 Genetic analysis of case/control data using estimated haplotype frequencies: application to apoe locus variation and Alzheimer's disease. Genome Res. 11: 143–151.[Abstract/Free Full Text]

HODGKINSON, C. A., D. GOLDMAN, J. JAEGER, S. PERSAUD, J. M. KANE et al., 2004 Disrupted in schizophrenia 1 (disc1): association with schizophrenia, schizoaffective disorder, and bipolar disorder. Am. J. Hum. Genet. 75: 862–872.[CrossRef][Medline]

HORIKAWA, Y., N. ODA, N. J. COX, X. LI, M. ORHO-MELANDER et al., 2000 Genetic variation in the gene encoding calpain-10 is associated with type 2 diabetes mellitus. Nat. Genet. 26: 163–175.[CrossRef][Medline]

ITO, T., E. INOUE and N. KAMATANI, 2004 Association test algorithm between individual phenotype and a haplotype using simultaneous estimation of haplotype frequencies, diplotype configurations and diplotype-based penetrances. Genetics 168: 2339–2348.[Abstract/Free Full Text]

KAMATANI, N., A. SEKINE, T. KITAMOTO, A. IIDA, S. SAITO et al., 2004 Large-scale single-nucleotide polymorphism (snp) and haplotype analyses, using dense snp maps, of 199 drug-related genes in 752 subjects: the analysis of the association between uncommon snps within haplotype blocks and the haplotypes constructed with haplotype-tagging snps. Am. J. Hum. Genet. 75: 190–203.[CrossRef][Medline]

KAWAGUCHI, Y., A. TOCHIMOTO, N. ICHIKAWA, M. HARIGAI, M. HARA et al., 2003 Association of il1a gene polymorphisms with susceptibility to and severity of systemic sclerosis in the Japanese population. Arthritis Rheum. 48: 186–192.[CrossRef][Medline]

KITAMURA, Y., M. MORIGUCHI, H. KANEKO, H. MORISAKI, T. MORISAKI et al., 2002 Determination of probability distribution of diplotype configuration (diplotype distribution) for each subject from genotypic data using the em algorithm. Ann. Hum. Genet. 66: 183–193.[CrossRef][Medline]

MORIGUCHI, M., C. TERAI, H. KANEKO, Y. KOSEKI, H. KAJIYAMA et al., 2001 A novel single-nucleotide polymorphism at the 5'-flanking region of SAA1 associated with risk of type aa amyloidosis secondary to rheumatoid arthritis. Arthritis Rheum. 44: 1266–1272.[CrossRef][Medline]

NIU, T., Z. S. QIN, X. XU and J. LIU, 2002 Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet. 70: 157–169.[CrossRef][Medline]

QIN, Z., T. NIU and J. LIU, 2002 Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am. J. Hum. Genet. 71: 1242–1247.[CrossRef][Medline]

SCHAID, D. J., C. M. ROWLAND, D. E. TINES, R. M. JACOBSON and G. A. POLAND, 2002 Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 70: 425–434.[CrossRef][Medline]

STEPHENS, M., and P. DONNELLY, 2003 A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet. 73: 1162–1169.[CrossRef][Medline]

STEPHENS, M., N. SMITH and P. DONNELLY, 2001 A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68: 978–989.[CrossRef][Medline]

STRAM, D. O., C. LEIGH PEARCE, P. BRETSKY, M. FREEDMAN, J. N. HIRSCHHORN et al., 2003 Modeling and e-m estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals. Hum. Hered. 55: 179–190.[CrossRef][Medline]

SUGIURA, T., Y. KAWAGUCHI, M. HARIGAI, H. TERAJIMA-ICHIDA, Y. KITAMURA et al., 2002 Association between adult-onset Still's disease and interleukin-18 gene polymorphisms. Genes Immun. 3: 394–399.[CrossRef][Medline]

TANAKA, E., A. TANIGUCHI, W. URANO, H. NAKAJIMA, Y. MATSUDA et al., 2002 Adverse effects of sulfasalazine in patients with rheumatoid arthritis are associated with diplotype configuration at the n-acetyltransferase 2 gene. J. Rheumatol. 29: 2492–2499.[Medline]

URANO, W., A. TANIGUCHI, H. YAMANAKA, E. TANAKA, H. NAKAJIMA et al., 2002 Polymorphisms in the methylenetetrahydrofolate reductase gene were associated with both the efficacy and the toxicity of methotrexate used for the treatment of rheumatoid arthritis, as evidenced by single locus and haplotype analyses. Pharmacogenetics 12: 183–190.[CrossRef][Medline]

WESTFALL, P. H., and S. S. YOUNG, 1993 Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment (Wiley Series in Probability and Statistics). John Wiley & Sons, New York.

YAMADA, T., Y. OKUDA, K. TAKASUGI, L. WANG, D. MARKS et al., 2003 An allele of serum amyloid a1 associated with amyloidosis in both Japanese and Caucasians. Amyloid 10: 7–11.[Medline]

ZAYKIN, D. V., P. H. WESTFALL, S. S. YOUNG, M. A. KARNOUB, M. J. WAGNER et al., 2002 Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum. Hered. 53: 79–91.[Medline]

Communicating editor: Y.-X. FU




This article has been cited by other articles:


Home page
J. Pharmacol. Exp. Ther.Home page
Y. Ji, I. Moon, J. Zlatkovic, O. E. Salavaggione, B. A. Thomae, B. W. Eckloff, E. D. Wieben, D. J. Schaid, and R. M. Weinshilboum
Human Hydroxysteroid Sulfotransferase SULT2B1 Pharmacogenomics: Gene Sequence Variation and Functional Genomics
J. Pharmacol. Exp. Ther., August 1, 2007; 322(2): 529 - 540.
[Abstract] [Full Text] [PDF]