## Abstract

A novel method for Bayesian analysis of genetic heterogeneity and multilocus association in random population samples is presented. The method is valid for quantitative and binary traits as well as for multiallelic markers. In the method, individuals are stochastically assigned into two etiological groups that can have both their own, and possibly different, subsets of trait-associated (disease-predisposing) loci or alleles. The method is favorable especially in situations when etiological models are stratified by the factors that are unknown or went unmeasured, that is, if genetic heterogeneity is due to, for example, unknown genes × environment or genes × gene interactions. Additionally, a heterogeneity structure for the phenotype does not need to follow the structure of the general population; it can have a distinct selection history. The performance of the method is illustrated with simulated example of genes × environment interaction (*quantitative trait with loosely linked markers*) and compared to the results of single-group analysis in the presence of missing data. Additionally, example analyses with previously analyzed cystic fibrosis and type 2 diabetes data sets (*binary traits with closely linked markers*) are presented. The implementation (written in WinBUGS) is freely available for research purposes from http://www.rni.helsinki.fi/∼mjs/.

WITH the wide availability of markers, association mapping has been increasingly recognized as a primary tool to identify parts of chromosomes that may show a functional relationship to the phenotype (Risch and Merikangas 1996; Flint and Mott 2001; Lohmueller *et al.* 2003). Population-based association studies suffer from confounding due to population stratification (inability to divide variance into within- and among-population components) and genetic heterogeneity (trait loci or their alleles are not unique for the trait). If not accounted for properly, hidden population structure (stratification) may give rise to false positives (Lander and Schork 1994; Cardon and Palmer 2003) and genetic heterogeneity can dramatically disturb or mask the mapping signals (Terwilliger and Weiss 1998; Thorton-Wells *et al.* 2004). This is why both confounding and heterogeneity are probable contributors to the problem of nonreplication in genetic studies of complex traits (Sillanpää and Auranen 2004).

Techniques such as stratified analysis (Clayton 2001), matching (Hinds *et al.* 2004), genomic controls (Devlin and Roeder 1999; Marchini *et al.* 2004), structured association (Pritchard *et al.* 2000a; Sillanpää *et al.* 2001; Hoggart *et al.* 2003), smoothing (Conti and Witte 2003; Sillanpää and Bhattacharjee 2005), use of secondary samples (Epstein *et al.* 2005; Kazeem and Farrell 2005), or approaches based on knowledge of relatives (Ewens and Spielman 1995; Thomson 1995; Knapp and Becker 2003) have been used to overcome the problem of population stratification. (For extensive comparison, see Setakis *et al.* 2006.) Similarly, there are approaches based on relationship information (linkage analysis and identity-by-descent methods), haplotype frequency profiles (Longmate 2001), or smoothing/partition/clustering of haplotypes or alleles (Thomas *et al.* 2001; Morris *et al.* 2002, 2003; Seaman *et al.* 2002; Molitor *et al.* 2003a,b; Durrant *et al.* 2004; Yu *et al.* 2004a,b) that are robust to allelic heterogeneity.

Several model-based and model-free methods consider locus heterogeneity in the context of linkage analysis or family data (Smith 1963; Leal 1997; Grigull *et al.* 2001; Province *et al.* 2001; Schaid *et al.* 2001; Shannon *et al.* 2001; Whittemore and Halpern 2001; Hodge *et al.* 2002; Bull *et al.* 2003; Ekstrom and Dalgaard 2003; Hauser *et al.* 2004; Hoti *et al.* 2004); however, few consider locus heterogeneity in association analysis or case–control data. To prevent confounding due to locus heterogeneity in association analysis, one may apply a subset analysis (Leal and Ott 2000; Rebbeck *et al.* 2004), which, however, requires known subsets or subsets stratified on the basis of external covariates (*e.g.*, expression arrays or proteomics) and lacks some power. Another approach proposed by Schork *et al.* (2001) for the case of unknown subsets is a clustering of individuals using a set of neutral markers (or additional covariates) and incorporating this information into the subsequent association study. Province *et al.* (2001) have suggested use of additional covariates/markers and a recursive partitioning approach for a similar purpose. See Thorton-Wells *et al.* (2004) for discussion of clustering analysis, latent class analysis, and factor analysis in the context of producing homogeneous subsets of the data. To address locus/allelic heterogeneity, Sillanpää *et al.* (2001) suggested a joint analysis, where estimation of hidden population structure from neutral markers and detection of genotype–phenotype associations were performed simultaneously in a single modeling framework. The general problem in using population structure estimation inferred from neutral markers, known covariates (age at onset), or ethnic background to address genetic heterogeneity is that one needs to assume that the structure of genetic heterogeneity for the particular trait follows that of the general population or that given by covariates or ethnic background (Cooper *et al.* 2003; Foster and Sharp 2004). (For the opposite view, see Burchard *et al.* 2003.) This is especially problematic in the presence of missing data (Thorton-Wells *et al.* 2004). Several other assumptions are also required such as that the neutral markers have to show allele frequency difference between and Hardy–Weinberg and linkage equilibrium within the original populations (Pritchard *et al.* 2000b; Sillanpää *et al.* 2001).

One can improve the chances of finding trait loci in association analysis by using multiple gene models (Devlin *et al.* 2003; Kilpikari and Sillanpää 2003) and model selection (Balding *et al.* 2002; Broman and Speed 2002; Sillanpää and Corander 2002). To address genetic heterogeneity, we consider a multilocus association model with the joint estimation of population assignment, but unlike Sillanpää *et al.* (2001) we do not use any additional set of neutral markers or covariates, only the phenotypic model. Such treatment is motivated in situations when a structure of genetic heterogeneity for the particular trait does not follow that of the general population or that given by covariates or ethnic background (Cooper *et al.* 2003; Foster and Sharp 2004; Thorton-Wells *et al.* 2004). That is, the factor that would stratify the subsets is unknown or went unmeasured and it cannot be estimated on the basis of external information—only phenotype carries some information. In unclear cases, one can compare the results in two situations: (1) by estimating the unknown stratifying factor simultaneously in the analysis and (2) by treating the estimated or self-reported ethnic background as a known stratifying factor in the analysis. Our Bayesian approach is based on locus-specific indicator variables (Uimari and Hoeschele 1997; Conti *et al.* 2003; Yi *et al.* 2003a; Yi 2004; Sillanpää and Bhattacharjee 2005), which are used to control inclusion or exclusion of contribution from a particular locus in the multiple-regression model so that exclusion has much higher *a priori* probability. The method could be seen as an extension of the earlier work (Sillanpää *et al.* 2001; Kilpikari and Sillanpää 2003; Sillanpää and Bhattacharjee 2005); see the discussion for differences. The proposed model is implemented using the WinBUGS software (Gilks *et al.* 1994; Spiegelhalter *et al.* 1999) and the performance of the method is illustrated in several settings for *a quantitative trait with a sparse set of markers* and compared with single-group analysis by using simulated data. To illustrate performance for *a binary trait and closely linked markers*, we analyzed real data sets of cystic fibrosis (Kerem *et al.* 1989) and of type 2 diabetes (Horikawa *et al.* 2000; Zöllner and Pritchard 2005).

## MODEL

#### Motivations:

The model presented here is designed to be robust against genetic heterogeneity due to multiple causes: (1) genes × environment interaction, when the environmental exposure is unknown or went unmeasured; (2) genes × gene interaction, when there is no measurement from the stratifying gene; (3) rapid population expansion from a small founding population (say 50 individuals), where two founder individuals (with their own etiologies) are carrying the interesting form of the trait; (4) admixture of two populations (with their own etiologies) in the remote past so that linkage disequilibrium due to an admixture event has already vanished; and (5) structure of the population and the etiological structure of the trait have evolved separately—they have distinct selection histories. In each of these situations, one cannot utilize neutral markers or additional covariates to estimate etiological structure. Nor can one ensure that any of the above conditions are met in practice. However, in the case of no heterogeneity, this model maintains the power comparable to that of the single-group analysis (model of Sillanpää and Bhattacharjee 2005). For practical perspective on prior evidence of genetic heterogeneity, see the discussion.

#### Notation:

Let us consider a set of *M* candidate markers and a vector (*N*_{1}, … , *N _{M}*), where

*N*(≥2) is the number of alleles at locus

_{l}*l*. These candidates may represent a preselected set of haplotype-tagging SNP markers (Meng

*et al.*2003; Lin and Altman 2004) or a set of chromosomal regions or haplotype blocks (International HapMap Consortium 2003, 2005), where different haplotypes (within each region/block) are treated as different alleles. The aim is to find a trait-associated subset of loci among the

*M*candidates for some particular trait that is either quantitative (continuous) or qualitative (binary) type. Because only a discrete set of candidate loci is considered, the associated locus is likely to be just a close candidate that is in linkage disequilibrium with the true trait locus. We use a term QTL generally as a synonym for such a candidate locus linked to the quantitative or qualitative trait. We assume that phenotypes and marker observations have been collected in a set of

*M*marker loci from

*N*

_{ind}unrelated individuals. This sample may consist of individuals from the general population or of cases and controls. In the absence of missing data, marker observations

*m*

^{obs}give complete genotype information

*m*= (

*m*). Here,

_{i}*i*refers to individual and

*m*= (

_{i}*m*

_{i}_{1},

*m*

_{i}_{2}, … ,

*m*), where ) consists of the two alleles (assumed to be known without error) at each marker locus

_{iM}*l*. Note that the alleles and are both in the range [1,

*N*].

_{l}#### Missing-data model:

We assume that there might be some missing observations among the marker genotypes and that missing marker genotypes occur at random and independently within and across markers (in the sense that the probability that the genotype is missing is not dependent on the true genotype pattern at the locus or at any of its neighboring markers). By factorizing the joint distribution of complete and observed marker data *p*(*m*^{obs}, *m*) = *p*(*m*^{obs} | *m*)*p*(*m*), we obtain an indicator function *p*(*m*^{obs} | *m*) and the prior for complete observations *p*(*m*). Following the usual Bayesian missing-data model (Sillanpää and Bhattacharjee 2005; missing-data model 2), the prior distribution for complete genotype data *p*(*m*) under Hardy–Weinberg equilibrium is a multinomial distribution, where the occurrence probability of each allele (allele frequency) within the locus is assumed to be equal. (Note that data augmentation under this model is based on the likelihood of the data.) Given this prior for complete genotypes *m*, we consider only a subset of *m* in which *p*(*m*^{obs} | *m*) = 1. This is equal to assuming that missing value imputations are made conditionally on the observations.

#### Genetic model:

Let us assume two etiology groups (with possibly their own trait loci and/or associated alleles) and that each individual *i* has its own assignment variable with value 1 (*E _{i}* = 1) or 2 (

*E*= 2) indicating assignment to one of the groups. Define an indicator variable

_{i}*I*for group

_{lj}*j*(at marker

*l*), where the value 1 (

*I*= 1) corresponds to the case where the marker

_{lj}*l*at group

*j*is included in the model and value 0 (

*I*= 0) implies exclusion. To model genetic effects, we assume that alleles act additively both within locus (no dominance) and between loci (no epistasis). Each group

_{lj}*j*at each marker position

*l*has its own vector of genetic effect coefficients , where is the coefficient for allele

*a*at marker

*l*at group

*j*, where

*a*= 1, … ,

*N*and

_{l}*l*= 1, … ,

*M*and

*j*= 1, 2. Given the assignment of individuals

*E*= (

*E*) and the group-specific quantities—vector of indicators

_{i}*I*= (

*I*

_{l}_{1},

*I*

_{l}_{2}), overall means α = (α

_{1}, α

_{2}), and effects β = ()—our genetic model with additive allelic effects for observation

*y*(individual

_{i}*i*) can be written as(1)where the variable is 1 is the case that an assignment variable

*E*(of individual

_{i}*i*) equals group

*j*and is zero otherwise; the residuals

*e*, regardless of the individual's etiology group, are assumed to be normally distributed

_{i}*N*(0, 1/τ), with common precision parameter (

*i.e.*, inverse of residual variance). Binary phenotypes can also be considered by using a logit link function and omitting the residuals of the model (1) (see Uimari and Sillanpää 2001 and the Discussion in Sillanpää and Bhattacharjee 2005). We allow the first coefficient () at each locus

*l*and each group

*j*to be unconstrained in the model. For discussion of alternative formulations of genetic (genotype and haplotype) effects, see Sillanpää and Bhattacharjee (2005).

#### Hierarchical model:

Let us have a vector of locus-specific genetic variance components σ^{2} = () over *M* loci and assume a random variance model for the genetic effects β | σ^{2} at each locus (for motivation, see the discussion). Let us also prespecify the prior expectation of the proportion of trait-associated markers among all candidates, denoted as *s*. To hierarchically model assignment variables *E* and adopting ignorance in specifying a uniform prior distribution (with probabilities and ) for the proportions of individuals in each of two groups, we assume an underlying hyperparameter κ_{2} describing a proportion (relative frequency) of individuals that are members in group 2. The simple uniform distribution can be adopted if there is some prior information that sizes of the two groups are equal.

The posterior distribution *p*(*I*, α, β, *E*, κ_{2}, τ, σ^{2}, *m* | *y*, *m*^{obs}, *s*) is proportional to a joint distribution of parameters {*I*, α, β, *E*, κ_{2}, τ, σ^{2}, *m*} and the observed data {*y*, *m*^{obs}}. This relation is known as Bayes' rule and is here conditional on fixed quantity *s*; see below. We make the following conditional independence assumptions: (i) given σ^{2} and *s*, the locus indicators *I* and genetic effects β are independent; (ii) given κ_{2}, the complete marker genotypes *m*, the assignment variables *E*, and the regression parameters {α, τ} are mutually independent; and (iii) given σ^{2}, *s*, and κ_{2}, the locus indicators, the genetic effects, the regression parameters, the assignment variables, and the complete marker genotypes are all mutually independent. Under these conditional independence assumptions that are made *a priori*, one can introduce a hierarchical model as a factorization of joint distribution of parameters and data (Figure 1). More explicitly,Here the likelihood *p*(*y* | *m*, *I*, α, β, *E*, τ) in the case of a quantitative trait is a normal density function and in the case of binary trait is an inverse logistic function (see Uimari and Sillanpää 2001; Sillanpää and Bhattacharjee 2005). The actual likelihood value is obtained by substituting residuals *e _{i}* (= observed − estimated trait value) of genetic model (1) into the likelihood (in the case of a binary trait, the estimated trait values are substituted instead of residuals). The following priors are specified for the parameters. The marker-independence prior for indicator variables

*I*isHere

*p*(

*I*|

_{lj}*s*) is a Bernoulli distribution with parameter

*s*representing a small prior probability (expectation) for a locus to be associated into the trait. We give , which corresponds to a prior belief of one QTL among all candidates. For closely linked markers, like haplotype-tagging SNPs, one could use a marker-dependence prior as presented in Sillanpää and Bhattacharjee (2005), where the value of an indicator is dependent on the other indicators in the region. Prior distribution for genetic coefficients (allele

*a*) were assumed to be normal

*N*(0, ) with locus-specific variance component , which is common for both groups (

*j*= 1, 2). This leads to the joint prior . The prior for genetic variance at locus

*l*was given an inverse Gamma (1, 1), and consequently . We assume that a prior for the assignment variables,

*p*(

*E*| κ

_{i}_{2}), is a Bernoulli distribution with parameter κ

_{2}=

*p*(

*E*= 2) representing a probability of an individual to be assigned in group 2; a prior

_{i}*p*(κ

_{2}) was assumed to be Beta (1, 1). (Alternatively, one could have prespecified the fixed value for κ

_{2}, representing prior belief of both values of individual assignments

*E*being equally probable, implying

_{i}*p*(κ

_{2}) = 1.) This leads to the joint prior . The prior for precision parameter

*p*(τ) was Gamma (1, 1) and the prior for both overall mean parameters

*p*(α

_{1}) =

*p*(α

_{2}) is

*N*(0, 10), and

*p*(α) =

*p*(α

_{1})

*p*(α

_{2}). Note that choice of the priors for the genetic variances, the precision parameter, and the overall means reflect the measurement scale of the trait.

## SIMULATED DATA ANALYSIS

In data generation, we wanted to mimic a situation, where there are two etiological groups that arise because of presence or absence of exposure to some factor (modifier) that is completely unknown or went unmeasured. In such a situation, two groups may have identical genotype distributions (and homogeneous ethnic background) and one cannot utilize neutral markers or self-reported ethnic background to estimate heterogeneity underlying the phenotype. These kinds of data may arise when there is gene–environment interaction. We first describe how the homogeneous population of 1000 sampled individuals in the current generation was generated. Then we explain how we created (sampled) two subgroups from this homogeneous population so that there are between-group differences in trait etiology but no differences in genotype frequencies.

#### Generation of homogeneous population:

We first adopted a simulated marker data set used in Kilpikari and Sillanpää (2003). The data set consisted of 1000 individuals with a complete set of genotypes at 36 multiallelic markers, with recombination fractions in between 0.01 and 0.5. There were varying numbers (two to seven) of segregating alleles at each locus with average heterozygosity of 0.65. These 1000 individuals, which were generated using a backward population simulator (Gasbarra *et al.* 2005), had a common founding population (466 founders) 10 generations ago. The following assumptions were used: Hardy–Weinberg and linkage equilibrium for founders and nonrandom mating and slowly increasing population size in each (discrete) generation. For more details of the original data set, see Kilpikari and Sillanpää (2003) and for the simulator see Gasbarra *et al.* (2005).

#### Simulating etiological subgroups:

We created two etiology groups so that each individual in the homogeneous data set (1000 individuals in total) had ∼25% chance to be randomly assigned into each one of two groups. This sampling process resulted in 244 individuals in group 1 and 220 individuals in group 2. Such a “drop” in the number of study individuals (from 1000 to 464) was partly motivated by the reduced computation time. The quantitative phenotypes in both groups were generated analogously using an additive generating model: phenotype = overall mean + a sum of additive genetic values of the trait loci + residual sampled from a standard normal distribution, *N*(0, 1). The two groups were simulated to have their own values for overall mean, trait loci (exactly at markers), and their genetic effects (see Table 1). Note that only one of the three simulated trait loci was active in both groups and even there the active alleles were different (allele heterogeneity). The recombination fractions surrounding the simulated QTL were the following: 0.5 and 0.5 for QTL at marker 4; 0.02 and 0.2 for QTL at marker 16; and 0.1 and 0.1 for QTL at marker 30. Individual observations of two groups were then combined (464 individuals altogether) and a group status of each individual was “forgotten” in the data analysis stage. In this combined data set, the heritability attributable to each QTL at markers 4, 16, and 30 was 0.37, 0.12, and 0.05, respectively. This corresponds to the overall heritability of 0.55. Even if heritability in this data set may appear to be unrealistically high, the analysis presented here arguably corresponds to the analysis of a larger sample and smaller heritability. The simulated phenotypic distributions of the two groups are shown in Figure 2. Note that these two distributions overlap reasonably well and that the same range of values is covered in both distributions. We refer to this combined data set later in the article as a complete data set.

#### Analyses:

We sequentially increased the amount of random missingness in the genotype data. We considered the complete data set and four other data sets, where 5, 10, 15, or 20% of the marker genotypes of the original complete set were coded as missing. The missingness was introduced in a nested manner so that all genotypes that were missing in the 5% data set were also missing in the 10, 15, and 20% data sets and so on. All five marker data sets (with identical phenotypes) were analyzed with the proposed method. For comparison, a single group analysis was also performed (which is equivalent to constraining all individuals to belong to one group only) for the data set with 5% missing values. Note that the single-group analysis closely corresponds to the methods of Kilpikari and Sillanpää (2003) and Sillanpää and Bhattacharjee (2005) (with marker-independence prior), whose performances are roughly comparable to the frequentist multiple regression approaches (see the above articles for details).

The estimation of the model parameters was performed in WinBUGS 1.3 (Gilks *et al.* 1994; Spiegelhalter *et al.* 1999) using a Pentium 4, 2.8 GHz. We used random initial values in the analyses. In all analyses, we ran two Markov chain Monte Carlo (MCMC) chains of length 25,000. The first 5000 “burn-in” iterations were discarded from each chain, which resulted in 40,000 pooled MCMC samples in total. (In analysis of the data set with 10% missing values we utilized only samples from a single chain due to reasons explained in results.) We stored all the MCMC samples (“no thinning”), because of a sufficient storage capacity and a low autocorrelation in the samples. Two MCMC chains were run in parallel, which took ∼40 hr for the complete data and up to 60 hr for the 20% missing data. However, the same number of iterations for a single-group analysis took only ∼8 hr (with 5% missing data). The convergence assessment was performed by visually monitoring chains for several different parameters.

## SIMULATION RESULTS

In Table 2, one can see the estimated posterior probabilities for different markers to be associated on the phenotype (*i.e.*, QTL probabilities) at a group level and the posterior expectation (over markers) for the number of QTL in given groups. Table 2 illustrates the effect of cumulative missingness on these probabilities. In the complete data and the data set with 5% missing values, only the correct QTL positions for the two groups (markers 4, 16, and 30) are supported with QTL probability 1 and there is very little support for other positions. For comparison, the single-group analysis in the data set with 5% missing values (not shown in Table 2) resulted in nonzero QTL probabilities for markers 4, 16, 20, and 30 with values 1.0, 0.4, 0.1, and 0.5, respectively. With increasing missingness in Table 2, also spurious QTL positions (markers 13 and 29) gathered more support (with QTL probability ≤0.6). The analysis of the data set with 10% missing values showed us that we cannot clearly identify the two different genetic mechanisms in this case (if we utilize samples from both chains).

#### Labeling problem:

A well-known problem in this kind of population assignment method is the weak identifiability of the labels of the two groups (see Pritchard *et al.* 2000b; Stephens 2000). This means that sometimes the group configurations obtained in one analysis end up as their mirror image in another analysis just because of the symmetry in the likelihood of the parameters (Figure 3). Following the standard practice of sophisticated population assignment methods in genetics, like those of Pritchard *et al.* (2000b), there is nothing in the model, except the initial values of the MCMC sampler, to attach individuals of group one to the group with index one. The suggested solution for this problem is to impose some order constraints on one of the parameters such as constraining the groups to have increasing means or increasing proportions of individuals in the groups (Richardson and Green 1997; Stephens 2000). However, R. M. Neal's comment in Kass *et al.* (1998) was strongly against using such constraints, because they can do serious harm to the convergence. Here such constraints presumably do not provide good identifiability for the groups and complicate interpretation, because the information with respect to the groups at some loci seems to be too weak on the basis of our experiences with the secondary analysis (see conditional analyses below). In unconstrained models such as the one here, it can actually be preferable that the MCMC sampler is mixing poorly with respect to symmetry of the two groups (*i.e.*, label switching does not occur), because it simplifies interpretation of the results (see Pritchard *et al.* 2000b). Note that Celeux *et al.* (2000) have suggested a tempering scheme for assignment models to improve the mixing properties of the sampler. As is typical in population assignment methods (see Pritchard *et al.* 2000b), the label switching did not occur within any single MCMC chain in our analyses. Such a behavior was found here only between two separate MCMC chains with the data set with 10% missing values.

#### Conditional analyses:

With real data, one may not be able to decide if the estimated genetic architecture is unclear (not unique) because of the label switching or because of the complexity of the underlying genetic architecture. Even though it was known in our simulation study that the label switching was responsible, we still performed the additional analyses with the data set having 10% missing values to illustrate one possible approach. Three additional analyses were performed so that a locus-indicator pair (corresponding to two groups) at a single locus of the three markers (4, 16, and 30) was kept fixed throughout the MCMC estimation process. These three conditional analyses are “short cuts” for exhaustive enumeration of analyses where all possible combinations of the gene actions over these three loci are fixed one at a time. However, we can still gain some further information about the genetic architecture this way. Two parallel chains were run with 10,000 iterations in each (the first 2500 iterations from both were discarded, as burn in). This resulted in 15,000 effective MCMC samples in total.

In the first conditional analysis, where two group-specific locus indicators of marker 4 were simultaneously fixed to values of *I*_{4,1} = 1 and *I*_{4,2} = 0, the analysis was able to reconstruct the true underlying genetic architecture extremely well, having well-structured QTL probabilities of the two groups at markers 16 (1.0, 1.0) and 30 (0.1, 1.0) and a negligible support for marker 13 (0.1, 0.0), respectively. In the second analysis, where two indicators of marker 16 were fixed to values of *I*_{16,1} = 1 and *I*_{16,2} = 0, several nonzero QTL probabilities appeared at new (incorrect) marker positions in particular for group 2, indicating that the locus indicator *I*_{16,2} should not be zero. This was further supported also by the QTL probabilities in group 1 for markers 4 and 30, which were 0.5 and 0.5, respectively. In the third analysis, where the two indicators of marker 30 were fixed to values of *I*_{30,1} = 1 and *I*_{30,2} = 0, the problem of label switching occurred again even after the constraint so that two MCMC samples were mirror images of each other (excluding the fixed locus 30). This behavior might be an indication of the weak genetic effect of marker 30 (*cf.* the simulated effect sizes of Table 1). However, one of the two MCMC chains converged to the same structure supported by the first conditional analysis (of marker 4), which also happened to be the true underlying structure.

#### Summarizing QTL effects:

Because a random variance model was used for QTL effects, it is natural to look first at the posterior estimates of genetic variances and then at the sizes of the individual effects. As expected, the estimated genetic variances (their posterior means) at positions with low QTL probabilities are very close to their prior mean 1 (the results not shown). This is because in our model the data do not influence the genetic variance estimate in the MCMC rounds where the corresponding locus indicator value is zero. Due to (i) assuming *a priori* independence of the indicators and genetic effects (*cf.* Kuo and Mallick 1998), (ii) assuming only a single genetic variance parameter for each marker, and (iii) assuming a nonzero prior mean for the genetic variance, we summarize our results in two groups in the form of weighted genetic variances () in Figure 4. The weighted genetic variance is actually the model-averaged estimate of the genetic variance (averaged over all models with the effect set to zero in models where the marker for a given group was not selected) (Sillanpää and Bhattacharjee 2005). See the discussion for the motivation for points i and ii above. (Because of point iii above, we expected to obtain better mixing properties for the sampler and avoid confounding between locus indicators and genetic effects, leading to better estimates of QTL probabilities.) The estimates of Figure 4 are based on the additional analysis with two parallel MCMC chains (both of length 1500 and no burn in), where the last states of parameter values (of earlier 25,000 MCMC rounds) were used as starting values. (In practice this can be seen also as discarding 25,000 MCMC samples as burn in from each chain.) For a comparison, the weighted genetic variances from the single-group analysis (in the data set with 5% missing values) are also shown (Figure 4A).

#### Dissecting allelic heterogeneity:

The posterior estimates of locus-specific allelic effects at two groups for complete data are shown in Figure 5. In the top, one can see that only three simulated QTL seem to have nonnegligible estimated allelic effects, which are shown in more detail below (Figure 5, bottom). Since we did not impose any constraints for the coefficients, one should interpret the graphs with respect to that. (One can impose constraints afterward by setting the first coefficient to zero and looking at differences (contrasts) of estimated coefficients at each locus.) If the constraints are imposed afterward, then only four true alleles seem to have nonnegligible effects (allele 4 at marker 4, alleles 1 and 3 at marker 16, and allele 5 at marker 30), which closely correspond to the simulated values of Table 1. In Figure 5, bottom, we observe that marker 16 is the only one showing evidence of allelic heterogeneity and the evidence is with respect to the correct alleles (1 and 3).

#### Concordance in assignments:

To illustrate how well the individual assignments can be identified from five different data sets, Table 3 presents the numbers of correctly and incorrectly classified individuals in each group. The estimated group for each individual is based on the posterior mean estimate of the group assignment. The posterior mean estimated proportion of individuals belonging to the smaller of the two groups is also shown (true value is 0.47). Note that the proportion of incorrectly classified individuals is slowly increasing with the amount of missing data, but the “prop of inds”, which equals *p*(κ_{2} | *y*, *m*^{obs}, *s*), does not seem to do so. We would have expected to see this behavior in the case that the prior value for the proportion of individuals belonging to the one of the two groups was 0.5, corresponding to a uniform prior on the group assignment variable (*E _{i}*).

#### Estimated overall means:

Table 4 shows how well the group-specific overall mean parameters (true values are 2.3 and −2.3) were estimated from five different data sets using the proposed approach and the single-group analysis (with the constraint that all individuals belong to group 1). Although only the posterior means are shown, they can still illustrate how estimates have been shrunk toward zero by increasing the amount of missing values or assuming only a single group. Note that the single-group analysis was performed only for the data set with 5% missing observations.

## REAL DATA ANALYSIS

#### Cystic fibrosis data, model, and analysis:

As in Sillanpää and Bhattacharjee (2005), we selected a well-known cystic fibrosis (CF) data set (Kerem *et al.* 1989), with binary phenotype and 23 biallelic markers collected from 93 individuals. The markers in this data set span the 1.7-Mb region surrounding the cystic fibrosis transmembrane regulator (CFTR) gene on chromosomal segment 7q31. The data set contains the haplotype information (without individual identities) and the physical distances (Kerem *et al.* 1989; Morris *et al.* 2000), and there is some degree of missing alleles. As earlier (Sillanpää and Bhattacharjee 2005), we use a marker-dependence prior (with ) to utilize physical distances and the model where each individual contributed two independent observations (phenotype and one allele in each locus) to the analysis. The prior for the overall smoothing parameter is Gamma(1, 0.01) with prior mean at 100. A difference from our earlier analysis (Sillanpää and Bhattacharjee 2005) is that here the haplotype information is utilized—each haplotype is classified into one of the two etiology groups. By using such an independent-observation idea (Sasieni 1997), sample size is doubled, only a single allelic coefficient is fitted in each selected locus for each observation, and estimated effects are approximately double in size. Note that here we are effectively carrying out a heterogeneity analysis for the alleles with respect to their parental origins. Note also that, unlike Sillanpää and Bhattacharjee (2005), we use a random variance model for genetic effects and the prior for missing values where each allele is considered to be *a priori* equally probable at each marker locus. Because of the binary phenotype, we do not have prior for precision parameter. Otherwise, we use the same priors as in the simulated data analyses.

The parameter estimation was done in WinBUGS 1.3 (Gilks *et al.* 1994; Spiegelhalter *et al.* 1999), using a Pentium 4, 3.40 GHz. Two parallel MCMC chains each of length 7800 were run with random initial values that took ∼29 hr. Because of 300 burn-in iterations per chain and no thinning, this resulted in 15,000 pooled MCMC samples in total. No evidence of label switching or convergence problems was detected by our visual inspection of MCMC chains for several different parameters.

#### Results of CF data:

In Figure 6, we present values of weighted genetic variances estimated for two etiology groups. Only locus 2 shows a highly elevated value in group 1 whereas we can see two elevated peaks at positions 10 and 17 in group 2. The estimated proportions of haplotypes classified into two groups, respectively, were ∼0.2 and 0.8, and individual assignment probabilities were found to be surprisingly high, making unambiguous membership identification possible for most of the haplotypes. The two positions (10 and 17) as well as their weighted genetic variance peaks in group 2 are very similar to what was found in the single-group analysis of Sillanpää and Bhattacharjee (2005). (Note that only QTL probabilities were shown in Sillanpää and Bhattacharjee 2005). However, one can see how the peak of position 2 in group 1 has grown much higher in the two-group analysis, changing the overall conclusion for that position. Note that Lazzeroni (1998) also found a notably high peak at position 2.

Further exploration of the haplotype data was performed for a subset of haplotypes that showed very high assignment probability (>0.75) to one of the two groups. This led us to 169 classified haplotypes of a total of 186. This exercise revealed that although locus 2 was identified as an influential position in group 1, allele frequencies at this marker were not visibly different for two groups when compared to frequencies calculated from the whole data set. Surprisingly, it was locus 17 that showed a remarkable allele frequency difference between the two etiology groups (suggesting that position 17 could actually be a stratifying factor). Moreover, a large proportion of haplotypes in group 1 were observed to have allele 2 at locus 17 co-occurring with allele 1 at locus 2 (suggesting the existence of epistatic interaction between loci 17 and 2). At locus 10, we observed a minor increase of frequency of allele 2 in group 1 and of allele 1 in group 2 but the difference was much milder than that at locus 17.

Finally, we wanted to check whether the estimated stratification of haplotypes (in the two groups) is connected to any of the known stratifiers—mutation subsets presented in Table 3 of Kerem *et al.* (1989). The membering CF haplotypes are identified according to whether they carry the Δ*F*_{508} or other mutations and are additionally within each of two mutation groups—those with pancreatic insufficiency (PI) or pancreatic sufficiency (PS). The inspections were done by monitoring the posterior estimated number of patient haplotypes that were classified to group 1 and that also had the non-Δ*F*_{508} mutation, posterior estimated number of haplotypes classified to group 1, and so on. (This monitoring was based on a separate MCMC run.) On the basis of the posterior estimates we observed that the frequency of CF haplotypes containing the non-Δ*F*_{508} mutation showed clear enrichment in group 1 (the smaller group with locus 2 as the main QTL) while the frequency of haplotypes carrying the Δ*F*_{508} mutation showed no difference. From this we may conclude/suggest that if there is epistatic interaction between loci 17 and 2, it is likely to occur in haplotypes with the non-Δ*F*_{508} mutation on the CF chromosome.

#### Type 2 diabetes mellitus and model:

These data were first published in a positional cloning study of Horikawa *et al.* (2000) and a subset of data has been reanalyzed by Zöllner and Pritchard (2005). We use the same subset of data as Zöllner and Pritchard (2005), which have a binary phenotype (108 cases and 112 controls) and 85 SNP markers (with physical distances) spanning the 876-kb area in the NIDDM1 region on chromosome 2. However, there is one important difference between our approaches: before the actual association analysis, Zöllner and Pritchard (2005) completed missing alleles with their most likely values and estimated haplotypes using the PHASE program whereas we use raw genotype data with missing values directly in our analysis. We use the model where every individual contributes a single observation (phenotype and two alleles at each locus) to the analysis. We use a random variance model for genetic effects and the prior for missing values where each allele is considered to be *a priori* equally probable at each marker locus.

The parameters were estimated using two different values of shrinkage parameter: (shrinkage) and (no shrinkage) in WinBUGS 1.3 (Gilks *et al.* 1994; Spiegelhalter *et al.* 1999), using a Pentium 4, 3.40 GHz. Two parallel MCMC chains each of length 5500 and 3750 (no shrinkage) were run with random initial values, which took ∼25 sec per iteration (with two parallel chains). By having burn in of 500 and 2500 (no shrinkage) iterations per chain and no thinning, the estimations were based on 10,000 and 2500 (no shrinkage) pooled MCMC samples in total. Again, no problems in label switching or convergence were detected.

#### Results of diabetes data:

The analysis with shrinkage parameter and marker-independence prior (no distances) ended up with a clear signal at 269 kb in group 1. The estimated proportion of individuals therein is 0.73. To see if any additional positions can be found by relaxing the stringency on the shrinkage parameter, another analysis was executed with the shrinkage paramameter . (Note that half of the genes are assumed to be associated *a priori* under this setting.) As presented in Figure 7, two additional positions at 160 and 161 kb were identified to be active in both groups. One can see that there are weak signals also at 269 and 400 kb in group 2. The estimated proportion of individuals in group 1 is 0.62. The estimated position of disease mutation in Zöllner and Pritchard (2005) is at 131 kb and the three SNPs that make up the haplotype at CAPN-10 (Horikawa *et al.* 2000) are located at 121, 124, and 134 kb. Overall, our estimates are somewhat to the right of their estimates. Zöllner and Pritchard (2005) emphasized that the presence of other genes cannot be excluded because their signal was only modestly significant.

Finally, the complementary analysis was executed with the model using the marker-dependence prior (with distances). This analysis seemed to confirm the locations that were already found in the marker-independence prior analysis (details and results not shown).

## DISCUSSION

#### Presence of genetic heterogeneity:

Generally the most efficient ways to reduce genetic heterogeneity of the trait in the sample are applicable only in the design stage of the study. Examples of those practices are: (1) careful phenotype definition (Leboyer *et al.* 1998; Sillanpää 2002) based on expression profiling (Kraft and Horvath 2003) or proteomics data (Semmes 2004), (2) careful choice of study population (Wright *et al.* 1999; Peltonen 2000; Shifman and Darvasi 2001), and (3) careful sample collection such as utilization of the known history of the population isolate in ascertainment (Heath *et al.* 2001). In this article, we emphasize that we have considered the case where such preventative actions cannot be applied or they have not been sufficient in homogenizing the sample. Moreover, one can consider using this method as a test of presence of genetic heterogeneity or as a primary analysis when there is already some prior evidence or a hint about genetic heterogeneity obtained by statistical or biological means: (1) association testing cannot find a signal, (2) findings from earlier studies cannot be replicated or are contradictory, (3) the population consists of individuals from several ethnic backgrounds, or (4) there is multimodality of the phenotype distribution. Note that one can also use this approach to argue if (self-reported or estimated) ethnic background can be used as a known factor giving specification for the etiological groups.

#### The presented method:

We have presented a new method for studying multilocus association between multiple marker loci and a quantitative or binary trait in the presence of genetic heterogeneity. The method can handle multiallelic marker loci with some degree of missing genotypes. On the basis of tested examples, the amount of missing data should not be too large (roughly not more than 15%). We expect this to be even more critical in binary traits, where phenotype carries less information. This method can be implemented with WinBUGS (Gilks *et al.* 1994; Spiegelhalter *et al.* 1999), which automatically calculates data-specific tuning parameters needed in a Metropolis–Hastings random walk (Chib and Greenberg 1995). Additionally, by assuming *a priori* independence of the indicators and genetic effects (*cf.* Kuo and Mallick 1998), we can avoid the existence of other difficult tuning parameters like those needed in stochastic search variable selection (George and McCulloch 1993; Yi *et al.* 2003a). The method allows for use of external covariates such as age, sex, environmental risk factors, gene-expression measurements (see Hoti and Sillanpää 2006), and dominance in the genetic model. However, we do not claim that this approach is robust for bias of population stratification although logistic regression should be a relatively safe approach in the case of binary phenotype (see Setakis *et al.* 2006). If marker data have been collected from some specific chromosomal interval, one can control for the problem of population stratification (and other confounders) by using the marker-dependence prior of locus indicators to incorporate physical or genetic map distances, similarly as in Sillanpää and Bhattacharjee (2005). In other cases, we propose the use of self-identified ethnicity or structured association to handle the problem of population stratification (Pritchard *et al.* 2000a,b; Sillanpää *et al.* 2001; Corander *et al.* 2003, 2004; Hoggart *et al.* 2003). Note that a self-reported population identity given by a person (or identity estimated by STRUCTURE or BAPS software) is a putative classification covariate that could be handled as an extra “marker locus” in the presented method.

#### Stratifying factor and label switching:

As an output, our method provides estimates for group-specific genetic models (QTL) and individual assignments to the groups. In the estimated groups and their QTL, one cannot distinguish if the group-specific locus was identified because of genetic heterogeneity or because of interaction between the locus and some unmeasured (or unused) environmental covariate (see Thorton-Wells *et al.* 2004). One can, however, check this afterward with respect to existing (but unused) covariates that are available, but generally this is an intractable problem. One should keep this in mind when conclusions are drawn from the analysis. Like some other sophisticated population assignment methods in genetics, our method is likely to suffer from the problem of label switching between different MCMC chains (Pritchard *et al.* 2000b; Stephens 2000). We did not encounter label switching within any single MCMC chain in our test examples. This (no label switching) simplifies interpretation of the results (groups have labels) but it is also an indication of poor mixing of the MCMC sampler, which one should be aware of and monitor carefully. When label switching occurs across chains (but not within a chain), one should calculate posterior estimates on the basis of samples from a single long MCMC chain instead of pooling samples together from several shorter MCMC runs (even if monitoring the convergence and detection of this problem is based on running several MCMC chains). Moreover, one can also apply the conditional analysis for unclear cases to dissect the underlying structure further, as was illustrated in results. As a conclusion, we emphasize that even if label switching within a single MCMC chain happens, one can still obtain valuable information about the existence of genetic heterogeneity from this analysis when compared to the results of single-group analysis. That is, if the same set of loci is supported by both groups and this set differs from single-group analysis, one can conclude that it is likely that there is genetic heterogeneity and label switching.

#### Computation:

As was illustrated in the examples, our approach requires a substantial amount of computation time. Therefore, the current WinBUGS implementation may not directly scale to extremely large sizes of data (with thousands of individuals) or large marker sets (with several hundreds of markers). Thus, for large-scale data sets: (1) the size of the problem can be reduced by focusing on a subgroup of genes (*e.g.*, on the basis of known pathways, specific genomic regions, or preliminary screening or dimension-reduction techniques), and/or (2) one can pursue faster implementation of the method by using C-language, and (3) one can concentrate on the estimating mode (maximum *a posteriori*, MAP) of the posterior distribution only. However, an extra level of difficulty in methods 2 and 3 above (avoided here by WinBUGS) is to find a good updating scheme for the MCMC sampler. Finally, it is also possible in a Bayesian framework to first divide individuals in the data into the smaller data sets and analyze them separately. Because the posterior distribution (results) from one analysis can always be taken as a prior distribution to the next analysis, it is possible to combine separate analyses together afterward and to produce still a meaningful summary. However, we do not recommend such a “split and combine” approach here, because of the weak identifiability of the groups and potential labeling problems.

#### Differences between methods:

Because the proposed method can be seen as an extension of the earlier work (Sillanpää *et al.* 2001; Kilpikari and Sillanpää 2003; Sillanpää and Bhattacharjee 2005) we briefly comment on the differences between these methods:

Population assignment: Sillanpää

*et al.*(2001) also considered a population assignment as here but the set of neutral markers were included as an extra information source (which is useful for correcting population stratification but not neccessarily for genetic heterogeneity). Kilpikari and Sillanpää (2003) and Sillanpää and Bhattacharjee (2005) considered only a case of single-population analysis.Locus indicators and a random variance model: Sillanpää

*et al.*(2001) and Kilpikari and Sillanpää (2003) employed the reversible-jump MCMC for candidate (QTL) selection and a fixed-variance model (fixed-effect model) for QTL effects, whereas Sillanpää and Bhattacharjee (2005) as well as the approach here utilized locus-indicator variables for genetic model selection and a random-variance model (variance-component model) for QTL effects. However, the reversible-jump MCMCs in these approaches were implemented in a way that resembles locus-indicator models (see Kilpikari and Sillanpää 2003).Two groups share a common variance: The special property of the model presented here is that the group-specific effects (at each locus) are assumed to be exchangeable; that is, they share a distribution with the common variance component. In the case that the QTL (marker) is active in both groups, all observations contribute to the estimation of the corresponding genetic variance. This means that in the homogeneous case, where the same sets of QTL are simultaneously present in two groups, the approach can maintain the power comparable to that of the single-group analysis. Sillanpää

*et al.*(2001) considered an individual parameter set for each group, which is not optimal in the homogeneous case.

#### Number of groups:

In the analysis, we have assumed that the number of etiology groups is two and that the residual variance (precision) parameter is common for both groups. If the number of true underlying groups is one this is not a problem (see above paragraph). However, the method can also be generalized for a larger (more than two) number of groups if there is some *a priori* information about the groups and identifiability is not a problem. Nevertheless, the number of distinct modes of the phenotype distribution should not be directly used as a number of groups (*cf.* Figure 2). For a large number of groups, it may still be best to assume only two groups but use group-specific residual variances. This kind of model would follow the spirit of an admixture model, where one tries to fit a common disease model only for the homogeneous part of the data and puts the rest of the data into a “junk group,” which can include phenocopies and all kinds of individuals from multiple genetic backgrounds.

#### Epistasis:

We share the view (Sillanpää 2002; Moore 2003; Sillanpää and Auranen 2004; Thorton-Wells *et al.* 2004) that the dissection of complex traits requires novel methods that are capable of handling both the genetic heterogeneity and complex genetic architecture. The current approach does use multilocus estimation and can account for genetic heterogeneity but does not explicitly model epistatic interactions. Although it may be easy to include known pairs of epistatic markers in the model (*e.g.*, based on database information on known pathways; see Thomas 2005), detection of epistasis is generally a difficult task that requires a suitable combination of the following: specific designs, large data sets, appropriate statistical modeling, and high computing power. In spite of these difficulties, an extension of the presented approach that includes epistasis and utilizes the modern ideas of stochastic partitioning where genotypes from multiple loci are stochastically pooled into a smaller number of classes (see Moore 2003; Sillanpää and Auranen 2004) would be worth of considering in the future. For current Bayesian interaction approaches, see Conti *et al.* (2003), Yi and Xu (2002), Yi *et al.* (2003b, 2005), Narita and Sasaki (2004), and Zhang and Xu (2005).

The model specification code (written in WinBUGS) is freely available for research purposes at http://www.rni.helsinki.fi/∼mjs/.

## Acknowledgments

We are grateful to Nancy Cox, Jonathan Pritchard, and Sebastian Zöllner for their help with the diabetes data; to Andrew Morris, Kung-Yee Liang, and Yen-Feng Chiu for their help with the CF data; and to Andrew Thomas, Duncan C. Thomas, and two anonymous referees for their constructive comments on the manuscript. This work was supported by research grant no. 202324 from the Academy of Finland.

## Footnotes

Communicating editor: K. W. Broman

- Received May 24, 2006.
- Accepted August 28, 2006.

- Copyright © 2006 by the Genetics Society of America