## Abstract

Although the case-control or the cross-sectional design has been popular in genetic association studies of human longevity, such a design is prone to false positive results due to sampling bias and a potential secular trend in gene–environment interactions. To avoid these problems, the cohort or follow-up study design has been recommended. With the observed individual survival information, the Cox regression model has been used for single-locus data analysis. In this article, we present a novel survival analysis model that combines population survival with individual genotype and phenotype information in assessing the genetic association with human longevity in cohort studies. By monitoring the changes in the observed genotype frequencies over the follow-up period in a birth cohort, we are able to assess the effects of the genotypes and/or haplotypes on individual survival. With the estimated parameters, genotype- and/or haplotype-specific survival and hazard functions can be calculated without any parametric assumption on the survival distribution. In addition, our model estimates haplotype frequencies in a birth cohort over the follow-up time, which is not observable in the multilocus genotype data. A computer simulation study was conducted to specifically assess the performance and power of our haplotype-based approach for given risk and frequency parameters under different sample sizes. Application of our method to paraoxonase 1 genotype data detected a haplotype that significantly reduces carriers' hazard of death and thus reveals and stresses the important role of genetic variation in maintaining human survival at advanced ages.

THE current genetic association studies on human aging and longevity are dominated by the case–control (cases represent long lived and controls the younger aged) or cross-sectional design. In these studies, subjects of different ages are genotyped, and frequencies of a particular gene variant are compared across the observed ages to infer genetic association (De Benedictis *et al*. 2001). In terms of analytical methods, new statistical approaches have been proposed to help analyze genotype data collected in cross-sectional studies with improved power (Tan *et al*. 2004). Lewis and Brunner (2004) explored the validity of the basic assumptions in the cross-sectional approach, *i.e.*, no secular change in both risk and initial frequency of the gene under study. Their study concluded that the assumptions are questionable when gene frequency differs in populations and gene–environment interactions exist and suggested conducting long-term follow-up studies to ensure verifiable results. Although the cohort study design is expensive due to the long time of follow up, it is practically affordable to carry out follow-up studies on aged subjects given the high mortality rate at advanced ages. In this case, the study aims at investigating the genetic effect on human survival at extreme ages, an important topic nowadays because of the largely increased mean life span in the developed countries (Vaupel *et al*. 1998). In the literature, association studies using follow-up design on oldest-old subjects such as nonagenarians (Christiansen *et al*. 2004; Hurme *et al*. 2005) and centenarians (Blanche *et al*. 2001; Louhija *et al*. 2001) have already been conducted. With the collected genotype and survival information, traditional statistical methods have been employed for single-locus analysis, for example, frequency comparison using a simple χ^{2}-(trend) test or estimating genotype relative risk using the traditional Cox regression model.

In the context of human disease gene mapping, multilocus approaches such as the haplotype-based association analysis have been introduced (Schaid *et al*. 2002; Zhao *et al*. 2003). Haplotype-based analysis exhibits more power due to its functional and statistical advantages over the single-locus approach in linkage disequilibrium mapping (Akey *et al*. 2001; Clark 2004; Schaid 2004). Unfortunately, haplotype analysis in population-based human longevity studies encounters the problem of missing phases in the long-lived subjects because genotype information is unavailable from their parents. We present a novel survival analysis model that combines population survival with individual genotype and phenotype information in assessing the genetic association with human longevity in cohort studies. By monitoring the changes in the observed genotype frequencies over the follow-up period in a birth cohort, we are able to assess the effects of the genotypes and/or haplotypes on individual survival. With the estimated parameters, genotype- and/or haplotype-specific survival and hazard functions can be calculated without any parametric assumption on the survival distribution. In addition, our model estimates haplotype frequencies in a birth cohort over the follow-up time, which is not observable in the multilocus genotype data. A computer simulation study was conducted to evaluate the performance of our model. For given haplotype risk and frequency parameters, we assess the power of our model under different sample sizes and years of follow up. The model is applied to measure the association of paraoxonase 1 (*PON1*) gene polymorphism with human survival at advanced ages in the Danish 1905 birth cohort followed from 1998 to 2005. Application of our model helped us to detect a haplotype of a *PON1* gene that significantly reduces the carrier's hazard of death over the follow up.

## METHODS

#### The basic model for genotype-based analysis:

We suppose that we start our follow-up study in a birth cohort of old subjects from initial age . For each individual, we obtain genotype information at a biallelic locus (*e.g.*, a SNP locus) for assessing its influence on conditional survival after age . Combination of the two alleles (1 and 2) forms three genotypes (11, 12, and 22). If the frequencies of carriers of the three genotypes are , , and at intake (age ), then the conditional survival (conditional on the fact that each individual has survived to age ) of the birth cohort can be expressed as the mean of the survivals for carriers of the three genotypes; *i.e*.,(1)where , , and are genotype-specific conditional survival functions for carriers of the corresponding genotypes. Age *x* ranges from to , where *t* is the follow-up time in years. From (1), it is straightforward to calculate the proportion of genotype carriers in the birth cohort at age *x* during the follow up,(2)If we assign the 22 genotype as a reference and assume that, after age , the genotypes 11 and 12 affect survival with the relative risks (*i.e.*, the relative rate of death among those carrying a specific genotype over that of the reference genotype) and , respectively, then in a proportional hazard model we have the hazard of death at age *x* for carriers of the three genotypes as(3)Here, is the hazard of death for the 22 genotype or the baseline hazard function. With (3), we obtain the survival functions for carriers of the three genotypes as(4)To take into account the effects of unobserved risk factors (both genetic and nongenetic), the frailty model (Vaupel *et al*. 1979) can be introduced. Assuming that the unobserved frailty is gamma distributed with mean 1 and variance , we obtain the genotype-specific survivals (see appendix):(5)By introducing (4) or (5) into (2), we can estimate the parameters using the optimization method, which minimizes the differences between the observed and the estimated genotype frequencies over the follow-up ages (from to ),(6)where and *p*(*x*) are vectors of the observed and the fitted frequencies of all the genotypes at age *x*.

In the estimation, a parametric form of the baseline hazard function can be assigned. However, by introducing the mean cohort survival available from population statistics into (1), a nonparametric baseline hazard function can be estimated. This is done using a two-step procedure (Yashin *et al*. 1999; Tan *et al*. 2001) in which we start with an initial guess of the risk and frequency parameters and then numerically solve (1) to obtain a nonparametric baseline hazard function. This hazard function is introduced into (6) to estimate the risk and frequency parameters. The estimated parameters are put back into (1) to calculate an updated baseline hazard function. This process continues until (6) converges (Tan *et al*. 2001). To obtain the statistical significance for the risk parameters, we shuffle the ages at last observation for all the subjects to conduct the permutation tests.

#### The extended model for haplotype-based analysis:

When genotypes at closely linked loci are available, haplotype-based analysis can be performed. Different from genotypes that are observable for each individual, individual haplotypes cannot be determined explicitly without knowing phases. We start with assuming that all the haplotypes occurring at the typed loci are unambiguously observed and denote the collection of them with . Under Hardy–Weinberg equilibrium (HWE), the frequency of the haplotype pair can be calculated as(7)where and are the haplotype frequencies at initial age for haplotypes and . We further assume that the relative risk on hazard of death for carriers of haplotype is . For carriers of haplotype , the hazard of death at follow-up age *x* is in a proportional hazard model. Similar to the situation of genotype-based analysis, both homogeneity (no unobserved risk factor) and heterogeneity or frailty models can be fitted. In any case, the mean conditional survival of the birth cohort is the weighted survival for carriers of the different haplotype pairs; *i.e*.,(8)Likewise, at age *x*, the frequency of carriers of haplotype pair in the birth cohort is(9)Up to now, the haplotype-based parameterization is made by assuming that haplotypes are known explicitly for each individual. In practice, what we observe are unphased multilocus genotypes instead of haplotypes. However, for each multilocus genotype *g*, we have a set of haplotype pairs denoted as that are consistent with *g*. With this relationship, we can express the frequency of the multilocus genotype *g* at follow-up age *x* in the birth cohort as(10)On the basis of the multinomial distribution of the multilocus genotype frequencies in the birth cohort, we construct the likelihood function for estimating the initial haplotype frequencies,(11)where *n*_{g} is the number of carriers of the multilocus genotype *g* in the birth cohort and *p _{g}* is its frequency at the beginning of the follow-up. Instead of (11), one can also estimate the haplotype frequencies using existing software such as GENECOUNTING (Zhao

*et al*. 2002). With the initial haplotype frequencies, we further estimate the risk of haplotypes using the two-step procedure similar to the genotype-based analysis. Once the parameters are estimated, we are able to calculate the frequency trajectory for haplotype

*h*over the follow-up ages as(12)

_{i}All computer codes written in the GAUSS programming language are freely available upon contacting the corresponding author.

## SIMULATION STUDY

To examine the validity of our model, we conducted a computer simulation study specifically on our haplotype-based analysis. In the simulation, we assign different haplotype relative risks (a modest risk of 0.8 and a strong risk of 0.6 based on our previous experience on fitting relative risk models to genotype data) for a three-locus haplotype. Also we assign different sample sizes (500 and 1000) for the birth cohort, starting from the initial follow-up age of 93. Power of our model is assessed for samples collected from 3, 5, and 7 years of follow up. For each setting, we calculate the power as the probability that the 95% confidence limit of the estimated haplotype relative risk falls below 1. The mean cohort survival function is obtained from the female life table of the Danish 1905 cohort (http://www.mortality.org). The initial haplotype frequencies to use in the simulation are taken from the estimated haplotype frequencies in an empirical multilocus genotype data set spanning three SNP loci. Model performance is assessed on haplotypes with frequencies of 0.284 and 0.116. The simulation also takes into account individual heterogeneity in the unobserved frailty that contributes to individual survival by introducing a gamma-distributed frailty with mean 1 and variance 0.1. To assess the power, we first simulate the null distribution for the estimated null risk, which is set to 1 in generating the data for given haplotype frequency and sample size. The critical value for α = 0.05 in the null distribution is used in calculating the power.

In Table 1, we show the results from our simulation study using 500 replicates. We can see that, overall, our model produces unbiased estimates for the fixed haplotype risk and frequency parameters used in generating the multilocus genotype data. The precision of the estimated relative risks as indicated by their percentiles goes up with increasing sample sizes and extending follow-up time. The top part of Table 1 contains the results for a relatively frequent haplotype (frequency of 0.284). For a strong relative risk of *r* = 0.6, the model can have acceptable power in capturing the parameter even within a follow-up period of <5 years using a small birth cohort (*N* = 500). However, for the same sample size, the model is unable to detect a low risk haplotype (*r* = 0.8). A large birth cohort (*N* = 1000) is required to obtain an acceptable power (80%) after 7 years follow-up. For a haplotype with a lower frequency of 0.116, the model still exhibits acceptable power in a small birth cohort (*N* = 500) after 7 years follow-up. When the sample size is doubled, the power can be as high as 91% after 5 years follow-up. To detect a low frequency and low risk haplotype (*r* = 0.8), a long term of follow-up (>7 years) and a large sample of >2000 is required.

In Table 1, we also report the simulation result on a haplotype with a frequency of 0.284 and a risk of 0.8 but ignoring the effect of hidden frailty (). During all the follow-up years, the effect of the haplotype is obviously underestimated, meaning conservative results. At the same time, the power is largely reduced.

Finally, a large birth cohort of 10,000 individuals (over 93 years) was simulated. This time we assume that there is a harmful haplotype that increases the carrier's hazard of death with a relative risk of 1.5 and frequency of the haplotype at the initial age is chosen as 0.284. In Figure 1, we show the theoretical, the simulated, and the estimated frequency trajectories for the haplotype after 7 years follow up. Both the theoretical (solid line) and the simulated (dashed line) frequency patterns are well captured by our model (the dashed-dotted line). Note also that all three lines start from the initial frequency we set in the simulation.

## EXAMPLE APPLICATION

Previous studies have shown that the antioxidant enzyme *PON1* is associated with susceptibility to cardiovascular disorders (Watson *et al*. 1995; Heinecke and Lusis 1998). Recently, *PON1* gene polymorphisms have been found to affect human survival (Bonafe *et al*. 2002; Christiansen *et al*. 2004). In the study by Christiansen *et al*. (2004), three SNPs in the *PON1* gene (two in the coding region, amino acids M55L and Q192R, and one in the promoter region, C-107T) were genotyped. Data analysis using the Cox regression model revealed a strong protective effect of the 55M/L genotype and an increased risk of death for carriers of the 192R/R genotype in females (a relative risk of 1.38). In their study, haplotype analysis was not possible because such an analysis is not supported by the traditional Cox regression model. In this example application, we apply our model to a subsample in their study, the female 1905 cohort data, and perform both genotype- and haplotype-based analyses. Since the start of the follow up at the initial age of 93 years, survival information has been collected for a period of 6 years. The cohort survival function is taken from the female life table of the Danish 1905 cohort (http://www.mortality.org). Similar to the simulation study, we introduce a gamma-distributed frailty model (mean 1, variance 0.1) in our analysis to take into account the influence on survival from the unobserved factors.

In Table 2, we show the genotype counts and number of typed individuals at the three loci. Chi-square tests on each locus showed that all three loci are in Hardy–Weinberg equilibrium (= 0.005, *P* = 0.944 at *PON1* C-107T; = 0.100, *P* = 0.752 at *PON1* M55L; and = 3.190, *P* = 0.074 at PON1 Q192R). We present our results on genotype-based analysis in Table 2, where, as described in methods, risk and frequency parameters are estimated for each genotype against the reference. Although only a subsample was used, our model detected *PON1* 192 Q/Q and Q/R genotypes as of potential influence (*P* = 0.062 for Q/R and *P* = 0.076 for Q/Q genotypes), a result consistent with that of Christiansen *et al*. (2004). In Figure 2, we show the observed and the estimated genotype frequency patterns for all genotypes at the three loci.

On the basis of genotype information from the three loci, we carried out a haplotype-based analysis using our extended model. By taking individuals with genotype information available at all three loci, we obtained a sample size of 451 for haplotype analysis. We first tested the linkage disequilibrium between the markers using GENECOUNTING software (Zhao *et al.* 2002). Very strong marker–marker disequilibrium was found between markers C-107T and M55L (= 225.36, *P* = 0), M55L and Q192R (= 240.29, *P* = 0), and C-107T and Q192R (= 17.54, *P* = 0). In Table 3, we show the estimated initial frequency and the relative risk for each haplotype by assigning the rest as reference. Our results show that haplotype T-L-Q exhibits a significantly (*P* = 0.043) beneficial effect that reduces the carrier's hazard of death with a relative risk of 0.688 and an initial frequency of 0.090. In Figure 3, we show the estimated haplotype frequencies for haplotypes T-L-Q and T-L-R calculated using (12). The increasing frequency of the T-L-Q haplotype in survivors of the 1905 birth cohort illustrates the beneficial effect of the haplotype, which is in contrast to the frequency of the T-L-R haplotype. In addition, on the basis of the parameter estimates, we calculated the conditional survival functions for carriers and noncarriers of the T-L-Q haplotype as presented in Figure 4. A comparison of the survival curves shows that, on average, carriers of the T-L-Q haplotype may live ∼1 year longer than noncarriers of the haplotype (Figure 4).

## DISCUSSION

Recent association studies are increasingly emphasizing the important roles that genes may play in affecting human survival at advanced ages (Bonafe *et al*. 2002; Christiansen *et al*. 2004; Bellizzi *et al*. 2005; Hurme *et al*. 2005). Further studies using efficient analytical tools are needed. Through computer simulation and model application, we have shown that our survival analysis model is a valid method for both genotype- and haplotype-based association analyses of human survival in cohort studies of elderly subjects. Our computer simulation approach not only validates our model but also provides important information useful in planning future research. In addition, empirical application of the model to our *PON1* genotype data has helped us to identify an important haplotype that favors human longevity and indicates an increased power of the haplotype model.

Although our model can support the parametric form of the baseline hazard function, it is important to point out that, by incorporating the cohort-specific survival function available from population statistics, our model conducts parameter estimation without imposing any parametric form of the hazard function. This means that our model provides a nonparametric approach in analyzing survival data at advanced ages. This is important because, at extreme ages, validity of the parametric survival functions, such as the Gompertz or the Gompertz–Makeham models, has been seriously questioned (Driver 2001). In addition, when the sample size is limited at advanced ages, there will be a considerable error in estimating the survival distribution, which consequently leads to unreliable results. Since our model models the genotype or haplotype frequency patterns in the survivors over the follow-up period, with the estimated parameters, the fitted frequency trajectory over the observed period can be examined (Figures 2 and 3) for each genotype or haplotype. This is especially useful in haplotype-based analysis because, unlike in the single-locus analysis, we do not actually observe the individual haplotype in population studies. Moreover, since the parameters are estimated by monitoring the genotype or haplotype frequencies in the survivors over the follow-up period in a birth cohort, exact individual life span is not necessarily required to apply the model. This means that censoring is not a problem at all in our analysis. Finally, although the method is introduced using SNP markers, extending it to multiallelic loci involves only more genotypes or haplotypes in the analysis.

Here, we emphasize the importance of the frailty modeling in our analysis. It is well known that, at advanced ages, the cause-specific mortality curves start to converge as a result of heterogeneity in an individual's frailty composition (Vaupel *et al*. 1998). As a result, ignoring the existence of the competing risk factors can substantially underestimate the risks of genotypes or haplotypes. Frailty modeling in our genotype- and haplotype-based analyses can help us to assess the risk parameters in a more realistic manner. Most importantly, since haplotype analysis is biologically as well as statistically advantageous over the single-locus approach, our haplotype model with frailty modeling provides a powerful method in data analysis. In addition, we point out that the applicability of our model is theoretically not limited to longevity studies. Application of our model to any time to event data, for example, the age of onset of a disease, is feasible provided that the population probability distribution of the time to onset of the disease is available.

On the basis of the mixed results from conducted follow-up studies on apolipoprotein E gene and longevity, Lewis and Brunner (2004) suggested that adequate cohort studies with longer follow up (>5 years) be conducted to obtain reliable results. It is interesting to see from Table 1 that their conclusion seems to comply with our simulation. Since the cohort design avoids the validity issues concerning a cross-sectional design (secular change in the risk and initial frequency of the observed gene), our simulation result is promising because, with a proper analytical approach, the effect of a gene on human extreme age survival can be detected within an affordable period of follow-up time.

## APPENDIX

Under the proportional hazard assumption, if an individual carrying the risky genotype has frailty , the hazard of death at age *x* is . The mean hazard of death for a heterogeneous population carrying the genotype is(A1)Following the traditional approach (Vaupel *et al*. 1979; Hougaard 1991; Aalen 1998), we assume that the frailty is gamma distributed with mean 1 and variance . Then in (A1) can be derived as . Substituted into (A1), we get(A2)where is the cumulative baseline hazard function. Correspondingly, we have the mean survival for the genotype carriers,(A3)Estimating the variance parameter requires a large sample size (Ewbank 2002). In small-scale investigations, can be determined by a grid search for the peak of the likelihood for tentatively assigned values of (Yashin *et al*. 2000; Tan *et al*. 2001). On the basis of our experiences in fitting the gamma-frailty model to large population data sets, one can alternatively fit a frailty model by simply setting to 0.1. This can be conservative compared with some empirical results (Yashin *et al*. 2000; Tan *et al*. 2001; Ewbank 2002). However, we think it is applicable for small data sets.

## Acknowledgments

This work was jointly supported by the U.S. National Institute on Aging (NIA) research grant NIA-P01-AG08761 and the Danish Medical Research Council.

## Footnotes

Communicating editor: R. W. Doerge

- Received September 9, 2005.
- Accepted December 13, 2005.

- Copyright © 2006 by the Genetics Society of America