- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Heo, M.
- Articles by Allison, D. B.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Heo, M.
- Articles by Allison, D. B.
Pooling Analysis of Genetic Data: The Association of Leptin Receptor (LEPR) Polymorphisms With Variables Related to Human Adiposity
M. Heoa, R. L. Leibelb, B. B. Boyerc, W. K. Chungb, M. Koulud, M. K. Karvonend, U. Pesonend, A. Rissanene, M. Laaksof, M. I. J. Uusitupaf, Y. Chagnong, C. Bouchardh, P. A. Donohouei, T. L. Burnsj, A. R. Shuldinerk, K. Silverk, R. E. Andersenl, O. Pedersenl, S. Echwaldm, T. I. A. Sørensenn, P. Behno, M. A. Permuttp, K. B. Jacobsq, R. C. Elstonq, D. J. Hoffmana, and D. B. Allisonra New York Obesity Research Center, Columbia University College of Physicians and Surgeons, New York, New York 10025,
b Department of Pediatrics, Columbia University College of Physicians and Surgeons, New York, New York 10032,
c Department of Molecular Biology, University of Alaska, Fairbanks, Alaska 99775,
d Department of Pharmacology and Clinical Pharmacology, University of Turku, 20520 Turku, Finland,
e Eating Disorder Unit, University of Helsinki, 00014 Helsinki, Finland,
f Department of Medicine, University of Kuopio, 70210 Kuopio, Finland,
g Kinesiologie, Physical Activity Sciences Lab, Laval University, Ste-Foy, Quebec G1K7P4, Canada,
h Pennington Biomedical Research Center, Louisiana State University, Baton Rouge, Louisiana 70808,
i Department of Pediatrics, University of Iowa College of Medicine, Iowa City, Iowa 52242,
j Department of Preventive Medicine, University of Iowa College of Public Health, Iowa City, Iowa 52242,
k Division of Endocrinology, University of Maryland, Baltimore, Maryland 21224,
l Department of Geriatrics, Johns Hopkins University, Baltimore, Maryland 21224,
m Steno Diabetes Center, Hagedorn Research Institute, DK-2820 Gentofte, Denmark,
n Danish Epidemiology Science Center, Institute of Preventive Medicine, Copenhagen 1399, Denmark,
o Internal Medicine Associates, Bloomington, Indiana 47403,
p Department of Internal Medicine, Washington University School of Medicine, St. Louis, Missouri 63110,
q Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio 44109
r Department of Biostatistics, University of Alabama, Birmingham, Alabama 35294-0022
Corresponding author: D. B. Allison, Department of Biostatistics, School of Public Health, University of Alabama, RPHB 327M, 1530 Third Ave. S., Birmingham, AL 35294-0022., dallison{at}ms.soph.uab.edu (E-mail)
Communicating editor: C. HALEY
| ABSTRACT |
|---|
Analysis of raw pooled data from distinct studies of a single question generates a single statistical conclusion with greater power and precision than conventional metaanalysis based on within-study estimates. However, conducting analyses with pooled genetic data, in particular, is a daunting task that raises important statistical issues. In the process of analyzing data pooled from nine studies on the human leptin receptor (LEPR) gene for the association of three alleles (K109R, Q223R, and K656N) of LEPR with body mass index (BMI; kilograms divided by the square of the height in meters) and waist circumference (WC), we encountered the following methodological challenges: data on relatives, missing data, multivariate analysis, multiallele analysis at multiple loci, heterogeneity, and epistasis. We propose herein statistical methods and procedures to deal with such issues. With a total of 3263 related and unrelated subjects from diverse ethnic backgrounds such as African-American, Caucasian, Danish, Finnish, French-Canadian, and Nigerian, we tested effects of individual alleles; joint effects of alleles at multiple loci; epistatic effects among alleles at different loci; effect modification by age, sex, diabetes, and ethnicity; and pleiotropic genotype effects on BMI and WC. The statistical methodologies were applied, before and after multiple imputation of missing observations, to pooled data as well as to individual data sets for estimates from each study, the latter leading to a metaanalysis. The results from the metaanalysis and the pooling analysis showed that none of the effects were significant at the 0.05 level of significance. Heterogeneity tests showed that the variations of the nonsignificant effects are within the range of sampling variation. Although certain genotypic effects could be population specific, there was no statistically compelling evidence that any of the three LEPR alleles is associated with BMI or waist circumference in the general population.
WHEN many studies on the same topic differ in terms of statistical inferences and conclusions thereof, combining the information from these separate studies by either metaanalysis or raw data pooling provides a means by which data from the individual studies can be combined to enhance statistical power. The primary advantages of such analyses include (1) reduction of type I errors by consolidating many tests of the same hypothesis with many samples into a single test with one pooled sample; (2) increased statistical power; and (3) direct tests of heterogeneity among samples/populations.
In genetic studies, analysis of pooled data can be especially challenging even if all studies investigated relationships between genotypes at the same gene with the same phenotypes. Genetic data sets usually consist of individuals related to other individuals, resulting in correlated observations. Therefore, methodologies accounting for such correlated observations should be employed. Typically, familial correlations depend upon the degree of relationship between pedigree members. With respect to genotypes, multiple alleles across multiple loci, or markers, within the same gene may be of interest, leading to the need to evaluate multilocus analyses for direct and epistatic effects, i.e., interactions among the multiple alleles. Interactions among genotypes and covariates (i.e., gene-by-environment interaction or effect modification) should also be considered in modeling; the use of appropriate covariates can define more precisely the effects of individual alleles. Pleiotropic (either relational or mosaic) effects of genotypes are also of interest and can be investigated through multivariate analysis. Finally, missing observations on genotypes and other variables are not uncommon.
Codes for the same genotypes and discrete variables are often different from study to study. In addition, there may be increased numbers of missing observations in pooled data because of different lists of covariates and alleles. The number of members in each pedigree is usually different within and among studies. The configurations of degrees of relationship among pedigree members are not the same over the different pedigrees. These characteristics of the pooled data require statistical modeling, allowing flexible construction of the residual covariance matrix. Moreover, the number of independent variables in a model can be very large when all the main effects and interaction effects are included. Heterogeneity of samples (e.g., in demographic characteristics) among the different studies is also a concern and necessitates analysis adjusted for study profiles or study effects. To our knowledge, neither the guidelines nor the statistical software for handling such methodological and practical issues are currently well developed in the genetics literature, though some guidelines do exist in other contexts (e.g., ![]()
In this article, we illustrate these issues and demonstrate some appropriate (and in some cases ad hoc) statistical methodologies and procedures. We used pooled raw data on body mass index (BMI; measured as the weight in kilograms divided by the square of the height in meters) and waist circumference (WC) and the three amino acid substitutions (the polymorphisms or the allelic variants) K109R at exon 2, Q223R at exon 4, and K656N at exon 12 in the leptin receptor (LEPR) gene on human chromosome 1p. In the rest of this article, to denote any variant at a particular exon, we use the word "allele" rather than "polymorphism" in accordance with ![]()
![]()
Interindividual variation in human BMI, direct measures of fat mass, and fat distribution have been clearly shown to have a strong genetic component (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
| SUBJECTS |
|---|
A total of 3263 individuals were included in this study (Table 1). Sixty-two percent of these individuals are related to one or more subjects in the data set. The largest number of generations among the family pedigrees in the pooled data was two. Descriptive statistics are presented in Table 2 along with percentages of missing observations. The subjects are ethnically diversei.e., African-American, Caucasian, Danish, Finnish, French-Canadian, and Nigerian. Approximately half are female (Table 2).
|
|
| STATISTICAL ANALYSIS |
|---|
General model:
In general, genetic models can be represented in the following form:

The function f of the phenotype(s) depends upon the model implemented. For example, f(phenotype(s)) is the squared difference of phenotypes of sibling pairs in the Haseman-Elston regression. The phenotypes can be univariate or multivariate. Genetic effects can be random as in variance components analysis or fixed as in usual association studies. Covariates are discrete and/or continuous, and their coefficients are usually fixed, although modeling random coefficients is possible. The model becomes a mixed-effects model when fixed and random effects are simultaneously included. Interaction effects can be among genetic effects (i.e., locus-by-locus interaction, epistasis), among main covariates, and between genetic and main covariates (i.e., locus-by-environment interactions). The expectation of the vector error term is zero but its covariance may not necessarily be in the form of
I, where I is the n-by-n identity matrix and n is the number of subjects under study.
Preliminary analysis:
The main objective of the preliminary analysis was to test the association between alleles at each exon and BMI and WC. We analyzed BMI and WC separately for each exon. The main covariates were sex and age and no interaction effects were modeled. In this preliminary analysis, no missing data imputation was employed. The purpose of these approaches was to overview a crude overall association.
For the association analysis, we utilized the ASSOC routine in the S.A.G.E. (1997) software (![]()
![]()
![]()
We also analyzed the proportion of alleles shared identical-in-state (IIS) by sibling pairs because the alleles at the three exons are of interest. Specifically, we regressed the square of the phenotypic difference between the sibling pairs (cf. Haseman-Elston procedure; ![]()
![]()
For this preliminary analysis we took two methodological approaches: metaanalysis and pooling analysis. For the former, we applied the association analyses described above to each data set from each study and then "metaanalyzed" the results followed by heterogeneity analysis (![]()
![]()
Main analysis:
The objectives of the main analysis were to test the simultaneous effects of alleles in multiple exons, to test the epistatic effects of alleles in these exons, and to test effect modification by main covariates such as age, sex, diabetes, and ethnicity. BMI and WC were modeled as separate univariate phenotypes and as a combined multivariate phenotype. All exons were simultaneously included in all models. The main covariates included continuous age polynomial variables up to the third degree (i.e., age, age2, age3). In addition, discrete variables (ethnicity, diabetes, sex, and study effects) were dummy coded. The powers of the variable were included because the range of age was wide (394 years), and BMI and WC are nonlinearly associated with age over that range, especially in children (![]()
Two- and three-way interaction effects among genotypes and two-way interaction effects between genotype and main covariates were included in the analysis. However, with regard to two-way interaction effects, age2 and age3 were excluded to limit the number of independent variables in the models. The study effects were included only as main covariates without any related interaction terms. We used the S.A.G.E. ASSOC routine, fitted ordinary least squares (OLS) regression models, and conducted general linear model (GLM) multivariate analyses for the main analyses. In this main analysis, only pooling analyses were conducted. In the following section, the theoretical and practical issues that arose in conducting the main analysis are described.
Multiple imputation for missing values:
There were many missing observations for the phenotypes and covariates, e.g., diabetic status (Table 2), and for genotypes at exon 2 in particular (Table 3). Because deletion of missing values (e.g., list-wise deletion) from analyses can introduce biases and inefficient use of collected data (![]()
![]()
![]()
![]()
![]()
![]()
|
The MI method is appealing because it accommodates variation due to random imputation in the inference procedure (note that single imputed values are not true observations). In principle, more imputations provide for better inference with increased accuracy. Practically, however, three to five imputations appear satisfactory in terms of efficiency of estimation even with 50% observations missing (![]()
/m), where
is the fraction of missing information and m is the number of imputations (![]()
= 0.5, i.e., 50% observations missing. Herein the missing observations were imputed five times.
Random values from the normal distribution can be imputed for missing values of continuous variables. For missing discrete categorical values, however, the categorical values nearest to the randomly generated continuous values can be imputed. This imputation method for the categorical variables is acceptable in general even under normality assumptions (![]()
Practical issues and ad hoc methods: The interaction terms and dummy codes for genotypes and discrete variables created models with many independent variables, making interpretation of results difficult. Therefore, we applied OLS backward elimination to select significant variables in the presence of genotype variables. In doing so, we first "stacked" the five imputed pooled data sets into one single data set and then assigned a weight of one-fifth in count to every data point in the stacked data set, so that original nonmissing values have a weight of 1 and imputed values have a weight of 1/5. By "stack" we mean accumulation of (the five rectangular) missing-imputed data sets by case. This procedure still does not account for familial correlations or variations due to MIs and therefore does not produce unbiased standard errors. However, this ad hoc method can help identify significant contributors to specific models; the OLS method provides unbiased coefficient estimates and reasonably good compatibility for testing compared to its counterpart that adjusts for correlations (M. A. PROVINCE, T. RICE and D. C. RAO, unpublished results). Therefore, for subsequent statistical analyses we took residualized BMI and WC resulting from OLS regression on some "significant" variables retained from the backward elimination using the stacked data set.
Although the S.A.G.E. ASSOC routine was developed to allow familial correlations and to produce consistent standard error estimates, it has several practical limitations in terms of the number of interactions that can be included. However, one can include more than one locus by coding the other loci as covariates since ASSOC allows multiple covariates to be included in the model. ASSOC provides the estimated coefficients for each covariate and their standard errors. Through the use of the statistics, we can test the effect of the additional loci by a Wald-type test. The choice of which locus is utilized as a genetic locus ("marker" in the S.A.G.E. ASSOC terminology) and which loci are treated as covariates should not affect the inferences about genotypic effects at each locus because the Wald (for the covariates) and likelihood-ratio tests (for the genetic locus, the marker) are asymptotically equivalent. On the other hand, most programs for OLS regression are easy and flexible to run and produce unbiased point estimates, but they do produce biased estimates of standard errors. To adjust the bias of the standard errors of the OLS regression, we calculated a "correction factor": the ratio of average estimated variances of the point estimates from S.A.G.E. ASSOC to that of estimated variances from the OLS regression. This calculation resulted in "corrected" flexible use of OLS in various model-fitting procedures in the presence of correlations among observations across data points. We also tried the OLS approach with the stacked data, using weights of 1/5 to compare the estimates and P values obtained from the MI inference described in Appendix B.
With respect to multivariate analysis for testing pleiotropic effects, many approaches can be employed (e.g., ![]()
![]()
![]()
| RESULTS |
|---|
Descriptive statistics:
Demographic descriptive statistics are presented in Table 1 and Table 2 along with percentages of missing observations. The range of age is large, as are the ranges of BMI and WC. The estimated allele frequencies at exons 2, 4, and 12 were 0.23, 0.48, and 0.20, respectively (Table 3). Using the maximum-likelihood (ML) test for departure from Hardy-Weinberg equilibrium (HWE) described by ![]()
![]()
Table 3 also presents genotype-specific mean BMI and WC by locus. Within each exon, the mean values are almost the same before or after imputation. However, before imputation, overall means of BMI and WC at exon 2 are smaller than those at the other exons because the genotypes at exon 2 in the Baltimore study (Table 1) are all missing and Baltimore study subjects have relatively larger BMI and WC compared to subjects in the other studies. After imputation, on the other hand, the overall mean values become almost the same across the exons. These results, unadjusted for covariates or familial correlations, provide a preliminary view of the association between the phenotypes and genotypes.
Preliminary analysis:
Results of the ASSOC analyses are presented in terms of differences of the estimated effects of the two genotypes, "wild-type" homozygote and heterozygote, on BMI and WC, separately, from those of the "mutant" homozygous genotypes (Table 4) after adjusting for age and sex. For example, subjects heterozygous (K109R) for the exon 2 allele had a meta-estimate effect size of 0.03 on BMI when compared to subjects with K109K genotype. No single effect was significant from either the individual studies or the metaanalysis (Table 4). The number of subjects for each estimate is presented in Table 5, which also shows P values to assess the significance of the phenotypic variation due to genotypic variation, by exon. The calculation of these P values was based on the contribution of the genotypic variation to the likelihood. The effects were again not significant for either phenotype for any exon from individual studies or from the metaanalysis. Table 6 presents results of the sibling-based permutation association test from individual studies, from the metaanalysis, and from the pooling analysis. In this sibling-based permutation association test, only studies with related subjects were analyzed, because the test requires siblings within each family. Table 6 also shows absence of significant genotypic effects on the phenotypes.
|
|
|
Results of the regression on proportions of alleles shared IIS are presented in Table 7 and Table 8. These results were obtained from the regression of squared phenotype differences and the regression of grand-mean centered phenotype cross products, respectively. There was no statistical evidence that the alleles examined are associated with (or linked to) significant variation in either phenotype. However, the result of such IIS analysis of one study (Quebec family study) showed a significant linkage of the Q223R allele to BMI (P = 0.04), which is in agreement with the result of the IBD linkage analysis reported in ![]()
|
|
Heterogeneity analysis:
Heterogeneity analysis (Table 4, Table 7, and Table 8) showed that the nonsignificant results are consistent over the three loci and over all of the individual studies, which implies that the variation in effects over the studies is within the range of sampling variation.
Main analysis:
Missing data imputation:
The results of missing data imputation are presented in Table 3 with a comparison of descriptive statistics before and after imputation, as presented earlier. Due to the "restricted" imputation adopted herein,
5% of subjects still have missing genotypes for at least one exon even after imputation. Mean BMIs and WCs over the genotypes at exon 2 were increased after imputation. However, frequencies of alleles at each exon changed little.
Backward elimination and effect modifications: Significance of various effect modifications was evaluated by testing for interaction effects among genotypes and main covariates. We tested such interaction effects by OLS regression with the weighted stacked data. Backward eliminationstarting with a full model with main genotype effects, epistatic effects (i.e., interaction effects among genotypes), main covariate effects, and interaction among genotypes and covariateswas applied to identify significant interaction effects. The OLS backward elimination results are listed in Table 9.
|
Only the interaction effect between R109R at exon 2 and sex was significant for BMI, implying that male subjects with R109R genotype at exon 2 have significantly higher BMI than the other subjects. However, the contribution of this interaction effect to the variations of BMI is minimal (increase in R2 < 0.01%). The nonsignificant allele-by-environment interaction effects suggest that the genotypic effects, if any, might not be modified by the main covariates such as diabetes, sex, age, and ethnicity. In terms of epistatic effects, however, subjects with K109R at exon 2 and N656N at exon 12 (see HT2HM12 in Table 9) appeared to have significantly higher BMIs. Age, sex, ethnicity, diabetes, and study effects are major contributors to the variation of BMI and WC. Therefore, we took the "residualized," or adjusted, BMI and WC for sex, diabetes, ethnicity, age polynomials, and study effects for the following analyses.
Joint effects of multiple alleles at different loci and epistasis: The results for BMI from the S.A.G.E. ASSOC routine are displayed in Table 10 with the five imputed pooled data sets. No single genotype at any exon has significant effect on BMI. The joint effects of all the genotypes at all exons are also not significant. These results are consistent with those from OLS regression without controlling for the familial correlations (Table 10). Similar results were observed (Table 11) with respect to WC. Interestingly, estimated coefficients, their standard errors, and P values for individual genotypes obtained from the OLS regressions after employing the estimation process described in Appendix B were almost exactly the same as those from the OLS regressions with the stacked data (with weights of 1/5; Table 10 and Table 11).
|
|
The correction factors from both BMI and WC analyses were, however, <1, which is counterintuitive, implying that the standard errors estimated without controlling for familial correlations are bigger than those that include such controlling. In regard to testing epistasis, i.e., two- and three-way interaction effects among genotypes over all exons, we applied the OLS regression to accommodate all of the interactions in a particular model. The results are displayed in terms of P values for simultaneous effect of such interactions in Table 12. This analysis shows that the epistatic effects among genotypes of the three exons are not jointly significant from the results of MI nor from the results of weighted OLS, regardless of the presence of the main genotype effects in models after adjusting for sex, diabetes, ethnicity, age polynomials, and study effects.
|
Multivariate analysis for pleiotropic effects: The results for testing the pleiotropic effects are shown in Table 13. Although P values vary with imputation, no main effects or interaction effects are significant. These results indicate that the polymorphisms do not have any statistically significant simultaneous effects on BMI and WC.
|
| DISCUSSION |
|---|
We presented practical and ad hoc statistical methodologies that can accommodate many of the challenges encountered in pooled genetic data analysis. Although data pooling may be ideal for enhancement of power of statistical inferences, it is a daunting task to manage and analyze pooled data, as we have seen so far. Even a task as seemingly simple as coordinating the pooling of different data sets by creating a coherent coding system and uniform variable names can prove to be time consuming. Developing a coherent coding system for the pedigree members is important and creating appropriate dummy family members is often required for application of software. Therefore, investigators planning a pooling study should be aware of the large amount of time required for data management before the pooled data are analyzed.
In addition, the availability of statistical software that can handle the analytic problems raised here is limited. This deficiency creates a particular type of problem because, even after all data management issues are resolved, analysts are sometimes forced to change software and/or write additional programs with specific computer languages to conduct appropriate and necessary procedures and analyses. For example, one statistical software package may be able to provide a multiple-imputed data set but not be able to generate the estimates from the imputed data using advanced statistical analytical approaches. To obtain such estimates, data analysts generally need to apply a different software package to the multiple-imputed data; it is important to note that some types of software are not compatible with all forms of electronic data. This emphasizes another consideration. While there are many forms of genetic data analysis software available, few are flexible enough to meet the rapidly increasing genetic statistical needs. For example, statistical methodologies and theories exist, such as GLM and generalized estimating equations, that can flexibly account for varying familial correlation matrices due to different pedigree structures. At the same time, to our knowledge, there are no statistical packages that can easily handle such variations.
As such, the procedures and analyses proposed herein should be an example for conducting a pooled analysis in the absence of "bona fide" methodologies, although some methodological issues are still in question. For example, the empirical correction factors obtained in this study were all <1 (Table 10 and Table 11), implying that the estimated standard errors from OLS methods are bigger than those from methods accounting for familial correlations. This result may initially seem to be counterintuitive, because treating correlated observations as if they were uncorrelated often yields smaller standard errors due to inflated information. However, appropriate control for correlation may yield smaller standard errors. For example, suppose that random variables X and Y are perfectly correlated [i.e., corr(X, Y) = 1] and X = Y +
for some nonrandom
. Now, we have n pairs of observed X and Y. If we apply a paired t-test to testing the null hypothesis of
= 0, the result will be significant no matter how small a nonzero
, because the estimated standard error of
, the denominator of the test statistic, will be 0. But if we apply a two-sample t-test, which ignores the correlation, then the results will depend on the magnitude of
and the number of subjects, because the estimated standard error will not be zero. However, the question of whether this reasoning also applies to the situation described in this article is still unanswered. If this were the case in general, P values based on OLS methods would provide only an upper limit of "true" P values. Thus, the OLS-based P values may not provide a reasonable conclusion about hypotheses tests unless the OLS P values are large. Therefore, when the P values from OLS methods are borderline (between 0.05 and 0.10), application of correlation-adjusting methods may be needed for more accurate P values.
The MI method adds (or more precisely "allows for") uncertainty around the unknown missing genotypes by generating multiple randomly imputed genotypes where the imputed values are the predicted values plus some random error for which the expected squared value is equal to the estimated variance of the prediction. This is done multiple times and the variance in the results that occur from imputation to imputation enter into the calculation of standard errors and P values, thereby "penalizing" one for uncertainty rather than artifactually augmenting one's certainty. In some cases, the variance around the imputation is zero because the missing genotypes are known without error by Mendel's laws. In those cases, the imputation adds the correct amount of uncertainty; it just happens that that amount is zero. More broadly, the justification of the regression imputation is that genotypes can be predicted on the basis of observed phenotypes and covariates just as phenotypes can be predicted on the basis of observed genotypes. Although imputing such genotypes does not in and of itself create new information, the regression imputation of missing genotypes in this way allows one to use the full information that is available in a data set by, for example, not requiring one to drop subjects who are missing genotypic information at one locus but who have information at other loci when conducting a multilocus analysis; i.e., it avoids list-wise deletions.
Under the null hypothesis of no association, there is no reason to suspect that such imputation, if properly accounted for, would bias the expected results. However, under an alternative hypothesis, MI methods can give different results. This results from (a) reducing possible biases if the data are missing at random (MAR) but not missing completely at random (MCAR) in the terminology of ![]()
In terms of combining P values, when the dimension of the parameter vector is one, the procedure described in Appendix B [when dim(Qi) = 1] is well justified theoretically and by simulation studies (e.g., ![]()
![]()
With respect to allele-by-allele interaction analysis, it should be pointed out that when several loci, or markers, are in close physical proximity to each other, as in the current case, and interactions among loci are tested for, such interactions, if observed, may be due to linkage disequilibrium and not true epistasis. To understand why, consider a hypothetical situation of two diallelic loci, A and B, with alleles A1 and A2 and B1 and B2, respectively, in close proximity to each other. Assume that they are in equilibrium and that neither has any individual or interactive effect on the phenotype. Then at some point in history, an allelic variation might have occurred at locus C, which is also very close to A and B, and that C now has alleles C1 and C2, with the C2 allele conferring a predisposition to increased phenotypic values. Suppose further that C2 arose on a chromosome with alleles A2 and B2 and, due to tight linkage, to this day C2 occurs primarily (though not necessarily exclusively) on chromosomes with the A2, B2 haplotype. Finally, assume that loci A and B are genotyped in a study but C is not (i.e., C is unobserved). Then, given a sufficiently large sample, one will detect an interaction "effect" of the A and B loci. However, this is solely due to the fact that when A2 and B2 occur together, they represent a haplotype with a higher likelihood of having a C2 allele at the C locus. For the present study, such "phase" information is not available. However, if a polymorphism under study does not cause variation but is both linked to and in disequilibrium with a polymorphism that causes variation in the phenotype, power to detect epistasis may be enhanced through the use of estimated haplotypes rather than single nucleotide polymorphisms (![]()
![]()
With respect to comparison between metaanalysis and pooling analysis, both methods should yield similar results in terms of estimated effects and their significance, as was the case in this study. Metaanalysis can be as powerful as pooling analysis in certain particular situations (e.g., ![]()
Interaction of study effects with genotypic effects was assessed by means of testing heterogeneity of effects among studies. The heterogeneity of the effects was not significant, as presented earlier. Because of this nonsignificance and concerns about possible overfitting, we did not include study-by-allele interaction effects in the main pooling analysis; i.e., we estimated equal genetic effects across the studies. Although the study effects are large (Table 9), this does not mean that the genotypic effects are different over the studies, but rather that average levels of the phenotypes are different in the different studies. However, we acknowledge that modeling study effects alone may not capture all possible heterogeneity across the study populations due to differences in sampling schemes, degrees of demographic homogeneity within samples, and so on. For example, the Danish samples (![]()
![]()
![]()
An alternative approach would be a mixed-effects model with cohort as a random effect. Such a model could include empirical Bayes estimation and testing (![]()
On the basis of the results from the backward elimination procedures (Table 9), subjects who are heterozygous at exon 2 and homozygous at exon 4 are significantly greater in imputed missing BMI compared to the other subjects (Table 9). Although these subjects (there were five such) could deserve more investigation, only one subject with this combination is extremely obese, with BMI 51. It is therefore unlikely that this particular allele interaction can cause obesity and there is a limitation in epistasis analysis because we used only three exons; i.e., we do not know the pathways of how the three exons interact with unknown alleles in other exons and introns. Furthermore, if we had adjusted for the number of tests, each adjusted P value would have been much higher than that reported in this article. For example, if all tests were independent, a multiple-test adjusted P value pb may be written as pb = 1 - (1 - p)t, where t is the number of tests and p is an unadjusted P value. It follows that pb is 0.06 even with t = 6 and P = 0.01. This further confirms that it is unlikely that all the nonsignificant results from this study are due to type 2 error, although, when the tests are dependent, the adjusted P value will be <0.06 but >0.01. Moreover, even 1% of variation of phenotype due to their allelic variants would have been detected with >99% power with 3000 subjects. As far as confidence intervals of point estimates are concerned, they can be immediately computed from the standard errors provided in the tables.
The lack of association between the amino acid substitutions and obesity indices, despite a large sample, suggests that the substitutions do not affect the phenotypes. While amino acid substitutions may result in either nonfunctioning or poorly functioning proteins, or even functional proteins (if the effect is silent), it is important to note that among complex traits with multiple pathways, such as obesity, the absence of association might not necessarily indicate a lack of effect. It may simply be that persons with the amino acid substitution compensated by other means or that additional genotypic factors may be involved and need to be taken into account before the phenotype becomes manifest. However, the lack of association does not rule out the possibility that the three alleles may influence intermediate traits, or phenotypes, not examined as part of the analyses conducted in this article.
In conclusion, conducting appropriate statistical procedure and analysis of pooled genetic data requires careful data management and flexible adaptation of methods and software to effectively model biological effects of the genes under study. In the absence of well-developed guidelines, we hope that the procedures and methods illustrated herein can be useful as an example for future pooling of genetic studies of quantitative trait loci.
| ACKNOWLEDGMENTS |
|---|
We thank Drs. Jose Fernandez and Gary Gadbury for their valuable input. This study was supported in part by the National Institutes of Health grants R01DK51716, R01DK52431, R01ES09912, F33DK09919, P30DK26687, R01HD29569, R01GM28356, and P41RR03655.
Manuscript received October 30, 2000; Accepted for publication July 5, 2001.
| APPENDIX A |
|---|
DATA PROVIDERS
See Table 1 for the study names.
Finnish Studies 1 and 2: Markku Koulu, M. Karvonen, U. Pesonen, A. Rissanen, M. Laakso, and M. Uusitupa.
QFS and HFS: Claude Bouchard and Yvonne Chagnon.
MIFS: Trudy L. Burns and Patricia A. Donohoue.
Baltimore Study: Ross E. Andersen, Alan R. Shuldiner, and Kristi Silver.
Danish Study: Soren Echwald, Olaf Pedersen, and T. I. A. Sørensen.
Nigerian Study and AfAm Study: Philip Behn and M. Alan Permutt.
| APPENDIX B |
|---|
INFERENCE BASED ON MULTIPLE IMPUTATION (Shafer 1997)
We conducted appropriate analysis with each imputed complete data set to obtain five estimates, Qi, i = 1, ... , m(= 5), and their estimated variance Ui = Var(Qi), i = 1, ... , m(= 5). In the following, dim(Qi) denotes the dimension of the parameter vector Qi.
When dim(Qi) = 1, the final estimate, its variance T, and sampling distribution are

the total variance, where

the within-imputation variance, and

the between-imputation variance, and

where Q0 is the true parameter value and the number of degrees of freedom is

When dim(Qi) = k > 1, calculation of an "overall" P value for testing H0: Q = Q0 can be performed on the basis of the Wald test statistic; that is,

or equivalently,

where pi is the P value obtained from the ith imputed complete data set. Then we obtained a test statistic

where

and

We then obtained the overall P value,

| LITERATURE CITED |
|---|
ABBOTT, A., 2000 Manhattan versus Reykjavik. Nature 406:340-342[Medline].
ALLISON, D. B., B. THIEL, P. ST. JEAN, R. C. ELSTON, and M. C. INFANTE et al., 1998 Multiple phenotype modeling in gene-mapping studies of quantitative traits: power advantages. Am. J. Hum. Genet. 63:1190-1201[Medline].
ALLISON, D. B., M. HEO, N. KAPLAN, and E. R. MARTIN, 1999 Sibling-based tests of linkage and association for quantitative traits. Am. J. Hum. Genet. 64:1754-1764[Medline].
AMOS, C. I., R. C. ELSTON, G. E. BONNEY, B. J. B. KEATS, and G. S. BERENSON, 1990 A multivariate approach for detecting linkage, with application to a pedigree with adverse lipoprotein phenotype. Am. J. Hum. Genet. 47:247-254[Medline].
CARLIN, B. P. and T. A. LOUIS, 2000 Empirical Bayes: past, present and future. J. Am. Stat. Assoc. 95(452):1286-1289.
CHAGNON, Y. C., W. K. CHUNG, L. PÉRUSSE, M. CHAGNON, and R. L. LEIBEL et al., 1999 Linkage and associations between the leptin receptor (LEPR) gene and human body composition in the Quebec Family Study. Int. J. Obes. 23:278-286.
CHAGNON, Y. C., J. H. WILMORE, I. B. BORECKI, J. GAGNON, and L. PÉRUSSE et al., 2000 Association between the leptin receptor gene and adiposity in middle-aged Caucasian males from the HERITAGE family study. J. Clin. Endocrinol. Metab. 85:29-34
CHUA, S. C., JR., W. K. CHUNG, X. S. WU-PENG, Y. ZHANG, and S. M. LIU et al., 1996 Phenotypes of mouse diabetes and rat fatty due to mutations in the OB (leptin) receptor. Science 271:994-996[Abstract].
CHUNG, W. K., L. POWER-KEHOE, M. CHUA, F. CHU, and M. DEVOTO et al., 1997 Exonic and intronic variation in the leptin receptor (OBR) of obese humans. Diabetes 46:1509-1511[Medline].
CLÉMENT, K., C. VAISSE, N. LAHLOU, S. CABROL, and V. PELLOUX et al., 1998 A mutation in the human leptin receptor gene causes obesity and pituitary dysfunction. Nature 392:398-401[Medline].
COMUZZIE, A. G. and D. B. ALLISON, 1998 The search for human obesity genes. Science 280:1374-1377
COOPER, H., and L. V. HEDGES (Editors), 1994 The Handbook of Research Synthesis. Russell SAGE Foundation, New York.
DEL GIUDICE, E. M., L. PERRONE, P. FORABOSCO, M. DEVOTO, and M. T. CARBONE et al., 2000 Linkage study of early-onset obesity to leptin receptor gene in Italian children. Nutr. Res. 20:1059-1063.
DONOHOUE, P. A., T. L. BURNS, M. C. B. MENDOZA, W. K. CHUNG, and R. L. LEIBEL, 2000 Lys656Asn variant of the leptin receptor gene (LEPR) and the ß-3 adrenergic receptor (ß3AR) gene linked to body mass index in humans: the Muscatine study. Pediatr. Res. 47:127A.
ECHWALD, S. M., T. D. SØRENSEN, T. I. A. SØRENSEN, A. TYBJÆRG- HANSEN, and T. ANDERSEN et al., 1997 Amino acid variants in the human leptin receptor: lack of association to juvenile onset obesity. Biochem. Biophys. Res. Commun. 233:248-252[Medline].
ELSTON, R. C., 2000 Introduction and overview. Stat. Methods Med. Res. 9:527-541
ELSTON, R. C., S. BUXBAUM, K. B. JACOBS, and J. M. OLSON, 2000 Haseman and Elston revisited. Genet. Epidemiol. 19:1-17[Medline].
FALLIN, D. and N. J. SCHORK, 2000 Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am. J. Hum. Genet. 67:947-959[Medline].
FALLIN, D., A. COHEN, L. ESSIOUX, I. CHUMAKOV, and M. BLUMENFELD et al., 2001 Genetic analysis of case/control data using estimated haplotype frequencies: application to APOE louc variation and Alzheimer's disease. Genome Res. 11:143-151
FISHER, R. A., 1954 Statistical Methods for Research Workers, Ed. 12. Hafner Publishing, New York.
GEORGE, V. T. and R. C. ELSTON, 1987 Testing the association between polymorphic markers and quantitative traits in pedigrees. Genet. Epidemiol. 4:193-201[Medline].
GOTODA, T., B. S. MANNING, A. P. GOLDSTONE, H. IMRIE, and A. L. EVANS et al., 1997 Leptin receptor gene variation and obesity: lack of association in a white British male population. Hum. Mol. Genet. 6:869-876
HASEMAN, J. K. and R. C. ELSTON, 1972 The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet. 2:3-19[Medline].
HEDGES, L. V., and I. OLKIN, 1985 Statistical Methods for Meta-Analysis. Academic Press, New York.
LI, K. H., T. E. RAGHUNATHAN, and D. B. RUBIN, 1991 Large-sample significance levels from multiply-imputed data using moment-based statistics and F reference distribution. J. Am. Stat. Assoc. 86:1065-1073.
LITTLE, R. J. A., and D. B. RUBIN, 1987 Statistical Analysis With Missing Data. John Wiley & Sons, New York.
LYNCH, M., and B. WALSH, 1998 Genetics and Analysis of Quantitative Traits. Sinauer Associates, Sunderland, MA.
MANGIN, B., P. THOQUET, and N. GRIMSLEY, 1998 Pleiotropic QTL analysis. Biometrics 54:88-99.
MATSUOKA, N., Y. OGAWA, K. HOSODA, J. MATSUDA, and H. MASUZAKI et al., 1997 Human leptin receptor gene in obese Japanese subjects: evidence against either obesity-causing mutations or association of sequence variants with obesity. Diabetologia 40:1204-1210[Medline].
OKSANEN, L., J. KAPIRO, P. MUSTAJOKI, and K. KONTULA, 1998 A common pentanucleotide polymorphism of the 3'-untranslated part of the leptin receptor gene generates a putative stem-loop motif in the mRNA and is associated with serum insulin levels in obese individuals. Int. J. Obes. 22:634-640.
OLKIN, I. and A. SAMPSON, 1998 Comparison of meta-analysis versus analysis of variance of individual patient data. Biometrics 54:317-322[Medline].
ROLLAND-CACHERA, M. F., M. SEMPE, M. GUILLOUD-BATAILLE, E. PATOIS, and F. PEQUIGNOT-GUGGENBUHL et al., 1982 Adiposity indices in children. Am. J. Clin. Nutr. 36:178-184
RUBIN, D. B., 1978 Multiple imputations in sample surveysa phenomenological Bayesian approach to nonresponses, pp. 2034 in Proceedings of the Survey Research Methods Section. American Statistical Association.
RUBIN, D. B., 1987 Multiple Imputations for Nonresponse in Surveys. John Wiley & Sons, New York.
RUBIN, D. B., 1996 Multiple imputation after 18 years. J. Am. Stat. Assoc. 91:473-489.
S.A.G.E., 1997 Statistical Analysis for Genetic Epidemiology, Release 3.1 (Computer program package available from the Department of Epidemiology and Biostatistics, Rammelkamp Center for Education and Research, MetroHealth Campus, Case Western University, Cleveland).
SCHAFER, J. L., 1997 Analysis of Incomplete Multivariate Analysis (Monographs on Statistics and Applied Probability Series 72). Chapman & Hall, New York.
SCHAFER, J. L. and M. K. OLSEN, 1998 Multiple imputation for multivariate missing-data problems: a data analyst's perspective. Multivar. Behav. Res. 33:545-571.
SILVER, K., J. WALSTON, W. K. CHUNG, F. YAO, and V. V. PARIKH et al., 1997 The Gln223Arg and Lys656Asn polymorphisms in the human leptin receptor do not associate with traits related to obesity. Diabetes 46:1898-1900[Medline].
TANIZAWA, Y., A. C. RIGGS, S. DAGOG-JACK, M. VAXILLAIRE, and P. FROGUEL et al., 1994 Isolation of the human LIM/homeodomain gene islet-1 and identification of a sample sequence repeat polymorphism. Diabetes 43:935-941. (erratum, Diabetes 43: 1171).[Abstract].
This article has been cited by other articles:
![]() |
V. Paracchini, P. Pedotti, and E. Taioli Genetics of Leptin and Obesity: A HuGE Review Am. J. Epidemiol., July 15, 2005; 162(2): 101 - 114. [Abstract] [Full Text] [PDF] |
||||
