## Abstract

For most common diseases with heritable components, not a single or a few single-nucleotide polymorphisms (SNPs) explain most of the variance for these disorders. Instead, much of the variance may be caused by interactions (epistasis) among multiple SNPs or interactions with environmental conditions. We present a new powerful statistical model for analyzing and interpreting genomic data that influence multifactorial phenotypic traits with a complex and likely polygenic inheritance. The new method is based on Markov chain Monte Carlo (MCMC) and allows for identification of sets of SNPs and environmental factors that when combined increase disease risk or change the distribution of a quantitative trait. Using simulations, we show that the MCMC method can detect disease association when multiple, interacting SNPs are present in the data. When applying the method on real large-scale data from a Danish population-based cohort, multiple interactions are identified that severely affect serum triglyceride levels in the study individuals. The method is designed for quantitative traits but can also be applied on qualitative traits. It is computationally feasible even for a large number of possible interactions and differs fundamentally from most previous approaches by entertaining nonlinear interactions and by directly addressing the multiple-testing problem.

MOST common diseases are multifactorial and influenced by several factors that can be of both genetic and environmental origin (Lander and Schork 1994; Weiss 1994). The more prevalent of these disorders include cancer, diabetes, obesity, hypertension, and premature cardiovascular morbidity and mortality. Numerous genetic variations have been implicated as major pathogenic factors in various Mendelian disorders. However, the success in identifying the genetic factors underlying complex traits has at times been limited (Glazier *et al*. 2002; Hirschhorn *et al*. 2002). Many studies have been challenged by a presumed low effect of each genetic variant, small study populations, confounding effects such as population stratification, and, possibly, the use of highly simplistic genetic models.

Even though conditions such as diabetes harbor strong genetic components, not a single or a few single-nucleotide polymorphisms (SNPs) explain most of the genetic variance for these disorders (Risch and Merikangas 1996). It is hypothesized that much of the genetic variation may be caused by the interaction (epistasis) of multiple SNPs and interaction with environmental conditions (Cordell 2002). The disease penetrance associated with each allele is low and the impact of genetic components may vary depending on the genetic and environmental background (Carlson *et al*. 2004). Great progress has been achieved in the last few years using simple linear models (Cockerham and Zeng 1996; Falconer 1996; Schaid *et al*. 2002; Baker 2005). However, for large-scale data the methods often cannot detect interactions that are believed to have a substantial impact on the development of complex diseases (Culverhouse *et al*. 2004), because the models include only linear interactions and because the solutions that have been developed to address the multiple-testing problem lead to a drastic reduction in the statistical power (Cardon and Bell 2001). When modeling all the possible interactions between environmental factors and many SNPs at different loci using classical methods, the number of necessary parameters becomes very large as the number of SNPs increases (Nelson *et al*. 2001; Culverhouse *et al*. 2004). This is one of the main reasons why SNP–SNP interactions and SNP–environment interactions are rarely modeled in association studies (Carlborg and Haley 2004) even though these interactions are essential in uncovering the etiological background for complex diseases.

Some of the more successful methods for incorporating higher-order interactions are based on clustering algorithms. The combinatorial partitioning method (Nelson *et al*. 2001) was one of the first methods to model higher-order interactions without any main effect. This method evaluates all possible partitions and is not computationally feasible for large-scale data. The popular multifactor-dimensionality reduction (MDR) (Ritchie *et al*. 2001) for binary traits reduces the number of partitions by labeling the possible combinations as either high risk or low risk. The results are then evaluated through cross-validation or permutation testing, which compensates for multiple testing.

Several methods based on regression and classification trees allow for some interactions and are extremely fast (Breiman *et al*. 1984; Freund and Schapire 1997; Huang *et al*. 2004). One of these is the random forest method (Breiman 2001). This method uses a bootstrap sample of the data to construct a tree and uses the nonsampled (called “out of bag”) data for cross-validation. Multiple bootstrap samples are used to construct a *forest* of trees and the significance of each parameter is determined from the out-of-bag samples.

We propose using Markov chain Monte Carlo (MCMC) to overcome some of the problems described. MCMC is a stochastic computational approach that is very useful in Bayesian statistics where it can be used to estimate the posterior distribution of the parameters using Monte Carlo integration. This is achieved by generating samples from an ergodic Markov chain that has the same stationary distribution as the posterior distribution. MCMC provides a convenient method for dealing with a large parameter space when only parts of the posterior distribution are of interest. The new method explores sets of effects (risk sets) that increase the risk, or the phenotypic value, for individuals who fulfill the criterion defined by the sets. A risk set may contain one or more genetic or environmental conditions. The MCMC method then provides a probability that a particular risk set exists, *i.e*., that the conditions specified by the risk truly cause an increase in the phenotypic value or a higher disease risk. Methods that explore such a large range of models (multiple combinations of effects) often have very little power because they do not efficiently combine the evidence for association from different models. The new Bayesian method addresses this problem by combining information from many different models, for example, by evaluating the effect of all possible interactions when testing the effect of a single SNP. The method is described in materials and methods and the software for the method is called BAMSE and can be found at http://biostat.ku.dk/∼ande/BAMSE

## MATERIALS AND METHODS

#### Risk sets:

In the context of this method, a risk set is a subset of the parameter space that partitions the individuals on the basis of their genotypes and environmental factors. Let Ω denote all possible sets of SNPs and environmental factors (discrete or continuous). Let the *j*th group of individuals whose genetic and environmental profiles fit a risk set **T**_{j} ⊂ Ω denote a potential risk group *G _{j}*. Individuals not in any of the risk groups are placed in

*G*

_{0}. We set the upper bound,

*m*

_{r}, on the number of risk sets so that the number of risk sets

*n*∈ {0, 1,…,

_{m}*m*

_{r}} is finite. An example of a risk set could be {weight > 90 kg, sex is male, SNP3 is a heterozygote} and the individuals who fit this description constitute a risk group. Given a quantitative trait

*y*for individual

_{i}*i*∈

*G*and assuming that the trait for every individual in a given risk group is independent and normally distributed with equal variance, the observations

_{j}*y*are modeled as where

_{i}*n*is the number of risk sets, and

_{m}*I*is the indicator function. Let be a state from the space

*M*. If the phenotype is binary (case–control design) the phenotype is modeled using a binomial distribution where α

_{j}is the probability for being a case in the

*j*th risk group.

Each risk set **T**_{j} defines a genetic and environmental profile and all individuals must fit this profile to be part of the associated risk group *G _{j}*. The space of the

*l*th component for the

*j*th risk set is defined as Ω

_{l}and it is a finite discretization of a compact subspace of ℝ or a finite subset of ℕ, depending on the

*l*th parameter. For example, a SNP parameter has seven states given that individuals who are heterozygous and homozygous for major and minor alleles are all present in the sample (see Equation A4). This corresponds to one state for each possible combination of genotypes. For the continuous-risk parameters, such as environmental factors, a threshold

*t*can be defined that excludes individuals with observations that exceed this threshold. The space is defined as either = {

*a*:

*a*>

*t*} or = {

*a*:

*a*<

*t*}. Let the

*j*th risk set

**T**

_{j}be the set of the risk parameters for

*n*

_{p}parameters.

In some instances an individual may fit several risk sets. In this case an individual is placed in the risk group with the highest mean α_{i}. Since not all of the observations (SNPs and environmental factors) are necessarily thought to affect the trait in a causative combination, not all of them need to define a risk set. Therefore, only some 's are needed to restrict the *j*th risk group and the number of components, henceforth called the active risk components (for risk set *j*), will therefore be restricted to *n*_{a} ∈ {1, 2,…, *m*_{a}}, where *m*_{a} ≤ *n*_{p}. This means that a risk set can be restricted to a maximum of *m*_{a} active components. If only the *k*th component is active for the *j*th risk set then which means that only the *k*th component restricts the *j*th risk group.

#### Priors:

The means, α_{j}'s, for the risk sets are assumed to be normally distributed with the empirical average as mean and a variance calculated from the length of the range of the observed values *R* so that α ∼ *N*(ξ, κ^{−1}), where κ is a multiple of *R*^{−2} and . The prior for κ is the same prior in Richardson and Green (1997).

For the quantitative traits the variance is chosen to be uniformly distributed on (0, ∞). The priors for the distribution of the active risk components are uniformly chosen. The priors for the number of active risk components *n*_{a} and the number of risk sets *n _{m}* are chosen to be geometrically distributed, and respectively. Both distributions are normalized to sum to one. The priors for components defining the SNP and environmental factors are chosen to be uniformly distributed.

#### The Markov chain:

The chain is updated using the Metropolis–Hasting algorithm with acceptance probability(1)of jumping from the current state *m* to a proposed state *m*′. *p*(*m*) is the prior for the *m*th state, *q*(*m*′ | *m*) is the proposal probability of proposing state *m*′ given the current state *m*, and *L*(*m*) is the likelihood for state *m*. The method is implemented so that the parameters are updated one at a time when possible. However, when updating the number of active components or the number of risk sets, updates of several parameters must simultaneously be proposed. A new risk set can be proposed with a random choice of parameters or a risk set can be deleted, which is sometimes called the *death/birth* move. Details regarding the proposal algorithm are provided in the appendix.

The posterior probability of a certain risk set **T**_{k} can be approximated aswhere *B* is the burn-in, *N* is the number of iterations, and *m _{i}* is the

*i*th sampled state. Likewise, the posterior probability for the number of risk sets being equal to

*x*iswhere is the

*i*th sampled number of risk sets and

*I*is the indicator function. This probability can be used to evaluate whether there are any factors, genotypical or environmental, that affect the trait.

#### Simulations:

To evaluate the sensitivity and selectivity of the method several simulations were performed under different genetic models. Simulations were performed assuming 500 unrelated individuals with 20 uncorrelated SNPs and assuming equal variance for affected and unaffected individuals. The frequency of the minor allele was chosen to be 0.2 for all the SNPs and the mean phenotypical value for unaffected individuals was 100 with a standard deviation of 10. The same variance was chosen for the affected individuals but with varying means. However, in two genetic scenarios linkage disequilibrium (LD) was simulated using coalescent simulations based on the ms program (Hudson 2002), and the frequencies of the SNPs were thus random variables. In these simulations, a region was simulated under a neutral infinite size model assuming a crossover at rate of 1/4*N*_{0} for a 50,000-bp-long segment per generation, where *N*_{0} is the effective population size. A random selection of 20 SNPs with a minor allele frequency >0.05 was included in the nonepistatic scenario. For the epistatic scenario 20 SNPs were selected from two regions and a SNP with a minor allele frequency between 0.17 and 0.23 from each region was selected as a susceptibility SNP. Throughout, a burn-in of 10,000 iterations and a run time of 100,000 were chosen for the MCMC analysis by examining the convergence of the likelihood score and using the *potential scale reduction factor* (PSRF), also called the “shrinking factor”(Gelman and Rubin 1992), for the parameters that are always sampled. The shrinking factor was visualized at several intervals of the chains as recommended in Brooks and Gelman (1998). The risk parameters that were frequently, but not always, sampled were transformed into Bernoulli variables (0 for sampled and 1 for nonsampled) and then tested in the same manner as the other parameters.

To evaluate the power of the MCMC-based method compared with that of a linear model and random forests, sensitivity and selectivity were calculated for the three methods under different genetic models. The result is presented as ROC curves for general use of the MCMC method, the random forest, and a one-locus linear model. We evaluate the ability of the methods to detect any effect in the data and to identify specific susceptibility loci.

The sensitivity and selectivity for the linear model is calculated using an *F*-test comparing the full one-locus model to the null model of no genetic impact (three parameters *vs*. one parameter). For the MCMC method, the posterior probability of at least one risk set [*p*(*n _{m}* > 0 | data)] was used to detect an effect in the data and the posterior probability of SNP

*i*belonging to a risk set divided by the posterior probability for there being at least one risk set [

*p*(

*T*⊂

^{i}**T**| data)/

*p*(

*n*> 0 | data)] was used for identifying the susceptibility SNPs. The estimated increase in mean squared error was used for identifying SNPs using the random forest method and the estimated explained variance was used to detect an effect in the data. The random forest implementation in the statistical software R version 2.3.1 was used. Five hundred trees for each data set were chosen and seven variables were randomly sampled as candidates at each split. When identifying the SNPs, the susceptibility SNPs act as the true positives and the other SNPs as false positives. For detecting an effect in data, the simulations were compared with another set of simulations without any genetic effect. For the linear model the best (lowest)

_{m}*P*-values from the two sets of simulations were compared. The same data sets were used for the three methods.

#### Gene–gene–environment interactions affecting serum triglycerides:

Inter99 is a population-based cohort of 6741 individuals randomly recruited using the central person registry from the western part of Copenhagen County (Glümer *et al*. 2003). Only individuals with Danish ancestry by self report were included. A second group of individuals consisting of type 2 diabetes patients was recruited at Steno Diabetes Center. An oral glucose tolerance test was used to determine the glucose tolerance status of each individual according to the World Health Organization (1999): normal glucose tolerant (NGT), impaired fasting glycemia (IFG), impaired glucose tolerance (IGT), and type 2 diabetes (T2D). Smoking habits were quantitated on the basis of interviews and questionnaires for the Inter99 cohort.

## RESULTS

#### Simulations:

We have performed extensive simulations to evaluate the performance of the new method and compare the results with the performance of the single-locus linear model and the random forest method. Although there are many methods we could also have evaluated we have restricted ourselves to these methods because they can handle quantitative traits and are computationally feasible for the simulated data. The new MCMC method suffers only a slight reduction in power compared to the single-locus linear model assuming a single SNP with a dominant effect or an additive effect (Figure 1). This is true both for the detection of an effect in the data and for the identification of the particular SNP involved. This may be somewhat surprising given that the MCMC method allows for effects of multiple SNPs and interactions among SNPs. In contrast, the random forest method suffers a severe reduction in power compared to the linear model under these parameter settings. Similarly, under additive effects of two SNPs, the linear model and the MCMC method perform almost identically. In contrast, the random forest method has much lower power. In the presence of two interacting SNPs, or two pairs of interacting SNPs, the new MCMC method has a distinct advantage over the two other methods (Figure 2), particularly in detecting an effect. With three interacting SNPs, the effect is even more pronounced. The difference in power is manyfold at a low false-positive rate. However, in the presence of LD the advantage of the MCMC is somewhat reduced, which is due to the varying SNP frequency in the simulations.

In general, the MCMC-based method outperforms the normal linear model when dealing with interactions, especially multiple combinations of interactions, while the linear model in some cases may perform slightly better when dealing with a single susceptibility SNP with additive effects. The random forest method performs better than the linear model at detecting interacting SNPs, but much worse at detecting the effect of a single locus. On the basis of a limited number of simulations the three methods seem fairly robust against phenocopy and small departures from normality (data not shown).

To illustrate the efficacy of the method when the number of individuals is large we simulated 100 SNPs from five regions and 5000 unrelated individuals with effect sizes so that the linear model could not achieve significance after Bonferroni correction. Phenotypes were simulated on the basis of a second- and a third-order interaction, *i.e*., five susceptibility SNPs. A description of the simulated data and the result from the simulation can be seen in Figure 3 and Table 1. Figure 3 and Table 1 show that the MCMC method provides very strong evidence for association [*p*(*n _{m}* > 0) = 1] and that the method easily identifies the five susceptibility SNPs and the two specific combinations. The sampling from the stationary distribution starts within a few thousand iterations, which is <1 min on a standard PC.

#### Gene–gene–environment interactions affecting serum triglycerides:

To illustrate the method we applied it to the data described in materials and methods for a −1131 T > C polymorphism in the *APOA5* gene and a −250 G > A, an IVS1 + 49 C > T, and a Ser215Asn polymorphism in the *LIPC* gene. *APOA5* and *LIPC* are two of several genes where common alleles, many in high linkage disequilibrium, have shown a strong association with serum lipids such as triglycerides and cholesterol (Kao *et al*. 2003; Lai *et al*. 2003; Klos *et al*. 2005; Oliva *et al*. 2005). It has been hypothesized that APOA5 interacts with lipoprotein lipases that hydrolyze the apolipoproteins. Furthermore, it has been shown that APOA5 binds to the lipases (Merkel *et al*. 2005). APOA5 has been shown to be expressed only in the liver (Pennacchio *et al*. 2001), where the hepatic lipase encoded by *LIPC* is also found. Hepatic lipase hydrolyzes triglycerides and has been shown to enhance the uptake of lipoproteins (Thuren 2000). The effect of the −1131 T > C polymorphism in the *APOA5* promoter has shown a consistent association with serum triglycerides in various studies in different ethnic groups. Recently, a small association study (Jiang *et al*. 2005) indicated that plasma glucose levels may interact with an *APOA5* polymorphism, giving higher plasma triglyceride levels among type 2 diabetic patients, but failed to show any interactions among nondiabetic subjects. Association studies of the *APOA5* −1131 T > C variant in the Inter99 cohort have shown that the effect on serum triglyceride of the deleterious allele is modulated by other factors that affect serum triglyceride levels (G. Andersen, T. Sparsø, A. Albrechtsen, S. Castella, C. Glümer, K. Borch-Johnsen, T. Jørgensen, R. Nielsen, T. Hansen and O. Pedersen unpublished results). These factors include glucose tolerance status, gender, and smoking habits. The interactions were found using a linear model with two-way interaction terms. Using a linear model it was not possible to include all possible interaction terms among all factors. No interaction was found with the three SNPs in *LIPC* and the *APOA5* variant using a linear model. However, since the effect of the *APOA5* polymorphism is strongly modulated through glucose tolerance status, gender, and smoking habits it is possible that an association with serum triglycerides can be observed only through higher-order interactions. Therefore, the *APOA5* variant, the three *LIPC* variants, sex, smoking habits, and glucose tolerance status were included in the MCMC method. The adjustment factors age and body mass index (BMI) were also included, assuming a linear relationship. The glucose tolerance status and the smoking habits each have four categories (NGT, IFG, IGT, and T2D and never smoked, used to smoke, occasional smoker, daily smoker, respectively). Both environmental factors were assumed to be discrete ordinal variables.

We applied the MCMC method two times on 5300 individuals without missing data from the Inter99 cohort using 5,000,000 iterations in each run. The two chains gave similar results and the parameters sampled in each iteration can be seen in Figure 4. Convergence diagnostics were performed using the method of Gelman and Rubin (1992) and the multivariate shrinking factor was <1.01. The serum triglyceride levels were logarithmically transformed and two extreme outliers were excluded (6–8 SD from the mean after transformation). The method sampled 6–14 risk sets (see Figure 4B), where 10 was the most frequent sample. All the factors used in the method were frequently sampled (see Figure 4C). Not surprisingly, the environmental factors were sampled in each iteration.

The final results can be seen in Table 2. Even though there was no main effect of the three *LIPC* variants, they all show an association with serum triglyceride levels in combination with both genetic and environmental factors. Using the same criteria for the risk sets as seen in Table 2, plots for the genotypes and mean serum triglyceride levels when *LIPC* −250 G > A, *LIPC* IVS1 + 49 C > T, or *LIPC* Ser215Asn were part of the criteria are shown in Figure 5. The presence of several factors in a risk set does not necessarily imply an interaction but could also be additive effects between the factors. One of the risk sets that indicates a strong epistatic effect is the risk set consisting of smoking nonnormal glucose tolerance status (IFG, IGT, T2D) individuals with a combination of *APOA5* and *LIPC* IVS1 + 49 C/T alleles (see Figure 5A). The (unadjusted) *P*-values using a linear model for these stratifications of the data are also shown for illustration purposes. For risk set 6 we tested for epistasis between these SNPs for the smoking IFG, IGT, and type 2 diabetics using a linear model with covariates for being carriers of the minor alleles. To verify this, using standard methods, we followed up this finding by also performing the test for interactions on another cohort of 1008 IFG, IGT, and type 2 diabetic individuals consisting mostly of type 2 diabetics. A total of 683 of the individuals were smokers. The *P*-value of replication was 0.003 and it was the same allele combination that gave heightened serum triglyceride levels. It should be noted that most of the individuals in the follow-up group are under treatment, which may potentially influence the results.

## DISCUSSION

The MCMC approach appears to have overcome many of the major difficulties in modeling higher-order interactions in large-scale population-based data. For many SNPs no method can explore all of the possible interactions. However, the MCMC method uses the marginal effects for a low-order interaction to find the higher-order interactions. For example, if a fifth-order interaction exists in the data, then the phenotypical mean of the group of individuals with three of the five factors will probably be different from that for the rest of the individuals. The Markov chain will then spend more time in this state and by exploring the “local” area will find the fifth-order interaction (see appendix for updates of the Markov chain).

A related fully Bayesian MCMC method, called BAMA, has recently been developed (Kilpikari and Sillanpaa 2003). However, there are some important differences between our method and BAMA. The BAMA method assumes no interactions among loci, that all alleles have different effects, and that the effects from each locus are additive. Another related method is the Monte Carlo version of logic regression (Kooperberg and Ruczinski 2005). This method uses Boolean expressions to model covariates in a linear model and thus allows for higher-order interactions. Monte Carlo logic regression uses maximum-likelihood estimates for the coefficients in the model. The current MCMC method models all multiple combinations of SNPs and does not assume any linear relationship between the effects of any combination of SNPs and environmental factors. However, our method does allow adjustment factors to be included in the model if a linear relationship is assumed. We chose to compare our method with random forest because it does not make any linear assumptions.

#### Computational speed:

The speed of the algorithm is highly dependent on the number of individuals and the number of sampled risk sets. For the simulated data, where the number of sampled risk sets is rather low and the number of individuals is few (500), application of the method takes <1 min, while under the larger simulation condition (5000 individuals, 100 SNPs) each replicate takes a few minutes. When the method was applied on the Inter99 sample, the MCMC method found multiple risk sets defined by up to four factors, which could not be accomplished using most other methods. For example, modeling all possible interactions using a standard linear model would entail including 2592 parameters in the model.

#### Power:

The main advantage of the MCMC method is that it allows multiple effects simultaneously, thereby reducing the variance and allowing information to be pooled among effects. This gives the method more power than the linear models in most genetic scenarios when dealing with multiple susceptibility SNPs. Simulations showed that the MCMC can have more than twice the power of the linear model to detect two two-way interactions when the average number of false positives is low (<1). When there are interactions present in the data, then many combinations of SNPs will, to some extent, be associated with the trait even without any main effect. However, when compensating for multiple testing only information from the most strongly associated combination is used. The MCMC method combines information from many SNPs and risk sets in the calculation of the posterior probability and, thus, if many combinations are associated, the MCMC method compensates for multiple testing at a much lower cost.

While no method can entirely circumvent the problem of “the curse of dimensionality” (Bellman 1961), our simulations show that the new method is feasible for at least 100 SNPs. Although it may not be worthwhile to entertain the possibility of three-way or four-way interactions in data sets with thousands of SNPs (there are 10^{11} possible three-way interactions for 10,000 SNPs), the method can still be efficiently applied to such large data sets if the state space is constrained to exclude high-order interactions.

The new method is too slow for modeling, for example, 500,000 SNPs even without interactions. Nonetheless, while we see our method as most suitable for candidate gene studies, we also note that application in large-scale genomewide studies is possible. For these studies, we recommend using the method by applying it independently to different genic regions, in addition to using standard methods. If multiple SNPs within the same genic region have an effect, such an application of the method should greatly increase the mapping power compared to methods that analyze each SNP separately.

#### Priors and assumptions:

The MCMC method seems to perform well under a range of different genetic models and does not assume the same genetic model for the different risk sets. The factors used in the method can be either biallelic SNPs or environmental risk factors that can be continuous or discrete ordinal. All environmental factors are treated as binary traits, but the threshold that divides the data need not be defined in advance. The environmental factor can be included in two ways, either as the measured values or transformed to the sorting order of the values. By using the latter, the prior for the number of individuals included or excluded from a risk set is then uniformly distributed where, if the actual values were used, the prior for the threshold would be uniformly distributed in the range of the values of the environmental factor. The priors for the mean of each risk set (η, κ) can have a great effect on the posterior distribution of the number of risk sets (see, for example, Richardson and Green 1997 with discussion). If the mean is chosen as the midrange of the trait instead of the empirical average, the MCMC method is likely to predict a higher number of risk sets. This, however, is highly dependent on κ (the prior variance for the means) or the multiple of κ, where a small multiple gives a rather flat prior distribution.

#### Caveats:

Label switching, where different risk sets partition the data in the same way, is a potential problem in this MCMC method. Defining the risk sets to have a higher mean than the set of individuals not in a risk set ∀*i*: α_{i} > α_{0} largely eliminates this problem. However, a label-switching problem can still remain. For example, a risk group might contain all or none of the individuals, which means the data are partitioned in the same way as if there were no risk sets. This problem can be addressed in the calculation of posterior probabilities by appropriate editing of the MCMC output data, and it is, in any case, partially alleviated by assigning low prior probability to states with a high number of risk sets. In our implementation of the method, the number of nonempty risk sets is calculated.

It is important to note that the posterior probabilities estimated using the MCMC method do not have a frequentist interpretation. For example, in repeated simulations without a genetic effect, a posterior probability of >0.4, for there being at least one risk set, was virtually never observed. Practitioners desiring to report a frequentist *P*-value can do so by applying the method in combination with a standard permutation procedure. For small data sets, similar to the data sets simulated in this article, such permutations are easily completed on a single stand-alone computer. Permutation testing on large and complex data sets would require access to a cluster of processors.

The MCMC method can detect a range of different genetic effects, whether they be main effects, epistatic, gene–environment, or a mixture of them. In most genetic scenarios with multiple causative factors, the method has equal or more power to detect an effect and identify the causal combination of genetic factors than the more conventional linear model. The method can model higher-order interactions and find significant reproducible combinations of both genetic and environmental factors that influence serum triglyceride levels.

## APPENDIX

#### Pseudocode:

The algorithm can be written in pseudocode as

runtime=N

while (runtime – –){

update SNP parameters

update environmental parameters

update a mean for a random risk set

update adjustment factors

update the number of active risk components

update the positions of the active risk component

update the missing genotypes

update a mean for a random risk set

update the number of risk sets using delete or create

mean_update=5

while(mean_update – –){

update of variance

update a mean for a random risk set

}

sample the parameters

}

where *N* is the number of iterations. In each update a proposal update is either accepted or rejected.

#### Update of means:

Since a reasonable choice for a mean should be no higher than the maximum phenotypic value max_{p} and no lower than the minimum phenotypic value min_{p}, the parameter space for a mean is defined as [min_{p}, max_{p}]. The proposed mean of risk set *i* is sampled either from *N*(ξ, κ^{−1}) or from *U*(min_{p}, max_{p}). The latter is used when the acceptance rate of the updates falls below a specified threshold. This may typically happen when the Markov chain visits states with many risk sets.

The acceptance probability for this update is(A1)

To avoid some of the problems with label switching the risk sets can be chosen to have a higher mean than **T**_{0}. When updating the mean for one of the risk sets it is restricted to [α_{0}, max_{p}] and **T**_{0} is restricted to .

#### Update of the variance:

The variance is updated by simulating a uniform *U*(σ − *w*, σ + *w*) random variable, where *w* is some specified constant. If the proposed value, σ′, is outside (0, ∞), *i.e*., less than zero, it is *mirrored* back into the space(A2)where σ′′ is the unmirrored proposal variance sampled from *U*(σ − *w*, σ + *w*). This ensures that *q*(*m* | *m*′) = *q*(*m*′ | *m*) and reversibility.

The acceptance probability for this update is(A3)because the prior densities are uniformly distributed.

#### Update of adjustment factors:

This update is performed similarly to the update of the variance, but without any restrictions in ℝ.

#### Update of risk parameters:

There are seven different possible partitions, excluding the empty space, of the discrete space of the SNP risk parameters, when there are three possible one-locus genotypes,(A4)where partition 7 allows all genotypes. Updates are proposed by sampling from their uniform prior density *U*{1, 2, 3,…, 6} and the acceptance probability is the same as (A3). The updates for being active are proposed separately.

The environmental parameters allow the individuals to enter risk sets if they have environmental values that are below or higher than a given threshold . The *i*th environmental parameter takes values in the interval between the observed minimum and maximum environmental values. Again the proposals are sampled from their uniform prior density and the acceptance probability becomes (A3).

#### Updating the number of active risk components:

In the presence of many environmental and genetic factors, a random risk set is likely to result in an empty risk group. Therefore, it will in most cases lead to significant savings of computational time to define an upper limit for the number of components that are active in a risk set. The maximum number of active components in a risk set, *m*_{a}, can be defined by the user.

The probability of deactivating a given active risk component for risk set *i* when there currently are active components is defined as 1/ and the probability of activating a given inactive risk component is . The acceptance probability for deactivating a risk component becomes(A5)where *p*() is the prior for the number of active components in the *i*th risk set and is the prior for the distribution of the active components.

The acceptance probability for activating a risk component is(A6)

#### Updating the active risk component:

Risk sets are also updated by simultaneously proposing a deactivation of one component and an activation of another component. The components chosen to be activated or deactivated are chosen with equal probability. Because the priors are also uniformly distributed, the resulting acceptance probability is given by (A3).

#### Updating the number of risk sets:

To allow different combinations of genotypic and phenotypic factors to have different effects, we allow multiple risk sets with different means. Using reversible jumps (Green 1995), the Markov chain can jump between parameter spaces of different dimensionality. To ensure reversibility, a random component **c** is used so that the mapping from *m* to *m*′ is *one-to-one*. The mapping, when creating a new risk set, is defined as(A7)where **1** is the identity matrix, **0** is a zero matrix, and . The Jacobian for this mapping is the identity matrix and thus does not need to be included in the acceptance probability.

The number of active risk components in the new risk set is uniformly proposed from {1, 2,…, *m*_{a}}, and the positions of each active risk component are also chosen with equal probability. When proposing a new risk set the mean of the new risk set, the SNPs, and the environmental factors are sampled from their prior density. The acceptance probability then reduces to(A8)where is the prior for the number of active components in the proposed risk set (*i*) and is the probability of proposing this number.

#### Updating missing genotypes:

The priors for the missing genotypes are calculated on the basis of the frequencies of the observed genotypes, *p _{j}*, at locus

*j*and assuming Hardy–Weinberg equilibrium(A9)where

*i*is the number of minor alleles and

*g*is the state of the genotype.

_{j}All the missing genotypes for one SNP are updated at the same time and proposed from the prior distribution so that the acceptance probability is given by (A3).

The priors for the genotypes can also be specified for each individual, which can be very efficient when the SNPs are in LD. These priors can be estimated, for example, using the posterior estimates from the fastPHASE software (Scheet and Stephens 2006).

## Acknowledgments

We thank Charlotte Glümer, Knut Borch-Johnsen, and Torben Jørgensen for generously supplying the Inter99 data. This work was supported by the Danish research counsel and by National Institutes of Health grants NIGMS R01HG003229, U01HL084706.

## Footnotes

Communicating editor: M. Nordborg

- Received February 4, 2007.
- Accepted March 31, 2007.

- Copyright © 2007 by the Genetics Society of America