## Abstract

A Bayesian method for fine mapping is presented, which deals with multiallelic markers (with two or more alleles), unknown phase, missing data, multiple causal variants, and both continuous and binary phenotypes. We consider small chromosomal segments spanned by a dense set of closely linked markers and putative genes only at marker points. In the phenotypic model, locus-specific indicator variables are used to control inclusion in or exclusion from marker contributions. To account for covariance between consecutive loci and to control fluctuations in association signals along a candidate region we introduce a joint prior for the indicators that depends on genetic or physical map distances. The potential of the method, including posterior estimation of trait-associated loci, their effects, linkage disequilibrium pattern due to close linkage of loci, and the age of a causal variant (time to most recent common ancestor), is illustrated with the well-known cystic fibrosis and Friedreich ataxia data sets by assuming that haplotypes were not available. In addition, simulation analysis with large genetic distances is shown. Estimation of model parameters is based on Markov chain Monte Carlo (MCMC) sampling and is implemented using WinBUGS. The model specification code is freely available for research purposes from http://www.rni.helsinki.fi/~mjs/.

METHODS for association-based gene mapping use genetic markers that lie near the causative genes as gene representatives in their phenotypic models. How successful these methods are, or how much the original gene effect is reduced when measured indirectly through the closest marker, depends on existing covariance (linkage disequilibrium, LD) between the marker and the gene (Tanksley 1993). The magnitude of LD again depends on factors like population history and the distance between the two loci. In association analysis we are especially interested in using LD due to close linkage of loci as a detection signal, which depends according to exponential decay on the distance between the two loci and would give its highest value at a gene position. Thus, we want to exclude confounding effects (*e.g.*, mutation, selection, genetic drift, population structure, and variations in allele frequencies) from the association signal (measured LD).

Association analysis has been recognized as an important and complementary tool for mapping genes in human, animals, and plants (Risch and Merikangas 1996; Flint and Mott 2001). Effective methods are currently available for finding trait-associated subsets of candidate markers using collected samples of equally related or unrelated individuals (Ball 2001; Piepho and Gauch 2001; Sen and Churchill 2001; Broman and Speed 2002; Devlin* et al.* 2003; Kilpikari and Sillanpää 2003; Xu 2003; Yi* et al.* 2003). These methods use modern procedures for candidate selection but they do not account for covariance between closely linked markers or control fluctuations in association signals. Accounting for such factors may improve gene localization in fine-mapping studies, where a large number of markers have been collected from small chromosomal regions.

Conti and Witte (2003) considered the covariance between the loci in their semi-Bayesian approach. Instead of trying to find some initial evidence for the gene, they wanted to increase the ability to localize disease genes in situations where support for some particular genome region has already been established by statistical or biological means. In their method, the genetic effect coefficient, which is a pairwise measure of LD, is first estimated for each locus using a first-stage model and then spatially smoothed along the candidate region using a second-stage model that can include information on genetic or physical distances (and/or haplotype blocks). This smoothing approach does not require knowledge about haplotypes (linkage phases) and it can control fluctuations in pairwise measures of LD (effect estimate) that may arise from reasons other than tight linkage of loci. Such reasons include population history, structure, and events, as well as allele frequency differences between loci and small sample size (Clayton 2000; Greenland* et al.* 2000; Nordborg and Tavare 2002; Conti and Witte 2003). Unfortunately their approach is feasible only for biallelic markers (where it is easy to infer coupling of alleles between loci) and one cannot consider more than a single locus at a time in the first-stage model (because other loci would confound the pairwise signal). For other approaches that consider covariance between neighboring loci, see Lazzeroni (1998) and Cordell and Elston (1999).

Utilization of marker distances is common in fine-mapping methods, which either assume known linkage phases or technically sum (integrate) over all possible haplotype configurations that are consistent with the genotype data (Rannala and Slatkin 1998; McPeek and Strahs 1999; Service* et al.* 1999; Morris* et al.* 2000, 2003a,b; Thomas* et al.* 2003; Durrant* et al.* 2004). These methods account for covariance between haplotypes (LD due to common population history) and often consider putative gene positions that can be different from the marker positions. Instead of trying to consider more than a single gene locus at a time in their disease model, these approaches concentrate on reconstructing ancestral disease haplotypes or modeling related recombination histories. The fine-scale LD mapping methods generally assume a discrete phenotype (unlike Meuwissen and Goddard 2000) and ignore the population stratification (Lazzeroni 2001).

We consider only marker positions as putative gene loci in our association-based method. Our approach is based on Bayesian analysis with multiple-trait loci, where locus-specific indicator variables are used to control inclusion or exclusion of a particular set of allelic coefficients in the multiple-regression model. In general, several approaches utilize locus-specific indicator variables for model selection (Uimari and Hoeschele 1997; Conti* et al.* 2003; Yi* et al.* 2003; Meuwissen and Goddard 2004; Yi 2004). Instead of following others and requiring that the indicators are mutually independent *a priori*, we allow their prior dependence structure to exploit the distance information. This kind of prior is motivated by its ability to control fluctuations in association signals (LD measured from locus indicators) along a candidate region (see Figure 1). [Note that this treatment makes our approach directly applicable for genotype data from multiallelic (polymorphic) markers, unlike that of Conti and Witte (2003), who modeled marker dependency in effect estimates.] To control the number of selected markers, we let the values of the indicators have a tendency to shrink toward zero; this idea is modified from the approaches of Meuwissen* et al.* (2001) and Xu (2003). The presented model was implemented using the Markov chain Monte Carlo software package WinBUGS (Gilks* et al.* 1994; Spiegelhalter* et al.* 1999). To illustrate the performance of our approach we analyze the well-known cystic fibrosis data set of Kerem* et al.* (1989), utilizing available physical distances but assuming only genotype data. Additionally we analyze Friedreich ataxia data of Liu* et al.* (2001) and simulated data of Kilpikari and Sillanpää (2003) with genetic distances and assuming no haplotype information.

## MODEL

### Notation:

Let us consider a trait, either continuous or binary, and a small candidate region *R*, which consist of a discrete set of *M* marker loci where the putative genes (trait loci) can be positioned. We use a generic term QTL for a trait locus, with influence on a quantitative or qualitative trait, and for the closest candidate marker that is in strong LD with a trait locus. Let us denote a vector of phenotypes by *y* = (*y*_{1}, *y*_{2}, … , *y*_{Nind}) and a vector of marker observations by consisting of measurements from *N*_{ind} different individuals and *M* marker loci. In the case of no missing data, a vector of observations *m*^{obs} equals a complete genotype vector *m* = (*m _{i}*), where an element

*m*= (

_{i}*m*

_{i}_{1},

*m*

_{i}_{2}, … ,

*m*) belongs to individual

_{iM}*i*. Here, each allele in pair can take an integer value in range [1,

*N*], where

_{l}*N*is a number of alleles at locus

_{l}*l*.

### Model for missing genotype data:

We assume that there might be some missingness in the marker genotypes, *m*^{obs}, and that their values are missing at random. To hierarchically model genotype observations (under Hardy-Weinberg equilibrium), we assume a vector of underlying population allele frequencies *f* = (*f _{l}*), where

*f*consist of allele frequencies at locus

_{l}*l*. The joint distribution of complete and incomplete (observed) marker genotypes and population allele frequencies

*p*(

*m*

^{obs},

*m*,

*f*) can be factored and presented in the form

*p*(

*m*

^{obs},

*m*,

*f*) =

*p*(

*m*

^{obs}|

*m*)

*p*(

*m*|

*f*)

*p*(

*f*). In principle, we can have an indicator function

*p*(

*m*

^{obs}|

*m*) that takes value one or zero depending on whether the complete marker genotypes (

*m*) are consistent or not, with the observed (incomplete) marker genotype data (

*m*

^{obs}). However, in practice, WinBUGS does not need this kind of prior to generate imputed values, because missing value imputations are done conditionally on observations (see Congdon 2001). The prior for the complete genotype data is multinomial, where the occurrence probability of each allele is obtained from the corresponding frequency in

*f*. We consider two alternative schemes: (1) to assume the prior for each population allele frequency

_{l}*f*, in , as Dirichlet (

_{l}*N*), or (2) to fix the population allele frequencies (

_{l}*f*) to a uniform distribution over alleles at each locus; this implies

*p*(

*f*) = 1.

### Phenotype model:

Each marker position *l* has its own locus-indicator variable *I _{l}*, where the value one (

*I*= 1) corresponds to the case where the marker is included in the model as a QTL representative and value zero (

_{l}*I*= 0) implies exclusion. Each marker position

_{l}*l*has its own vector of genetic effect coefficients β

*= (β*

_{l}*), where*

_{la}*β*is the coefficient for allele

_{la}*a*at marker

*l*, where

*a*= 1, … ,

*N*and

_{l}*l*= 1, … ,

*M*. Given the vector of locus indicators

*I*= (

*I*), overall mean α, and effects β = (β

_{l}*), our genetic model with additive allelic effects for observation*

_{l}*y*(individual

_{i}*i*) can be written as 1where residuals

*e*are assumed to be normally distributed

_{i}*N*(0, 1/τ), with precision parameter (

*i.e.*, inverse of residual variance). In the case of binary phenotypes we omit the residuals from the model (1) and consider phenotypes through a logit link function (see Uimari and Sillanpää 2001; Conti and Witte 2003; and discussion). Let be a vector of genetic variances where an element σ

^{2}

_{gl}is the genetic variance at locus

*l*. We assume that effect coefficients β

*are normally distributed with*

_{la}*N*, where in the case of biallelic markers one can use a fixed variance model and prespecify σ

^{2}

_{gl}and in the case of multiallelic markers use a random variance model with unknown σ

^{2}

_{gl}. (In classical statistics, these two alternatives are called the fixed-effect model and the variance component model.) We allow the first coefficient (β

_{l}_{1}) at each locus

*l*to be unconstrained in our model unlike that in Kilpikari and Sillanpää (2003). Note that we can still identify differences (contrasts) from the Markov chain Monte Carlo (MCMC) sample or from MCMC point estimates of coefficients afterward.

For loci with only a few segregating alleles or haplotypes, one can replace allele-specific effects in model (1) with genotype-specific (dominance) or haplotype-specific coefficients (Kilpikari and Sillanpää 2003) using a fixed-variance model. To use genotype- or haplotype-specific coefficients in the presence of a larger number of alleles/haplotypes, one should use a random-variance model. See the discussion for alternate ways of handling a large number of coefficients.

### Hierarchical model:

One needs to prespecify the following in the model: (1) the prior expectation of the proportion of trait-associated markers in region *R*, denoted as *s*, and (2) either the physical or genetic map distances in a vector *d* = (*d*_{2}, … , *d _{M}*), where an element

*d*refers to the distance between markers

_{l}*l*and

*l*− 1. Optionally, one may also prespecify an overall smoothing parameter λ in the model (see below).

From Bayes formula we obtain that the posterior density of parameters {*I*, α, β, σ^{2}_{g}, τ, λ, *m*, *f*} given the observed data (*y*, *m*^{obs}) and fixed quantities {*s*, *d*}, denoted as *p*(*I*, α, β, σ^{2}_{g}, τ, λ, *m*, *f*|*y*, *m*^{obs}, *s*, *d*), is proportional to the joint density *p*(*y*, *m*^{obs}, *I*, α, β, σ^{2}_{g}, τ, λ, *m*, *f*|*s*, *d*). By making suitable conditional independence assumptions among variables (like assuming *a priori* independence between the vector of locus indicators and genetic effects; Kuo and Mallick 1998), the joint density can be presented as a product of a likelihood and a factorized prior density: Here the *p*(*y*|*m*, *I*, α, β, *τ*) is a likelihood function that is obtained from the genetic model by substituting residuals *e _{i}* = (

*y*− δ

_{i}*) to the normal density function (see Sillanpää and Arjas 1998; Kilpikari and Sillanpää 2003); In the case of binary data and logit-link function, the likelihood takes the form .*

_{i}A vector containing intermarker distances *d* = (*d*_{2}, … , *d _{M}*) is incorporated in the prior for indicator variables where

*p*(

*I*

_{1}|

*s*) is a Bernoulli distribution with parameter

*s*, where 1 −

*s*can be interpreted as a given shrinkage factor (how much value zero is preferred over one). In our analyses, we have used

*s*= 1/

*M*, which corresponds to a prior belief of one QTL in region

*R*. For some other ways of setting a prior for the number of QTL in locus-indicator models, see Yi (2004). Transition probabilities

*p*(

*I*|

_{l}*I*

_{l}_{−1},

*s*,

*λ*,

*d*) from position

_{l}*l*− 1 to

*l*can be arranged into the 2 × 2 matrix containing all possible transitions between states 0 and 1: 2To understand this structure better, we can express the value of

*I*as a function of

_{l}*I*

_{l}_{−1}as follows: , for

*l*= 2, … ,

*M*. Here,

*x*

_{l}_{0}is a Bernoulli variable with

*x*

_{l}_{0}= 1 indicating that

*I*takes the same value as

_{l}*I*

_{l}_{−1}and with

*x*

_{l}_{0}= 0 indicating that the value of

*I*is drawn independently of the value of the previous locus (

_{l}*i.e.*, ). The probability of

*x*

_{l}_{0}being 1 depends on the distance

*d*between markers

_{l}*l*and

*l*− 1 and reflects the linkage disequilibrium in the area. Here this probability is modeled as , where λ can be interpreted as an overall smoothing parameter and

*d*is the physical or genetic distance, expressed in kilobases or morgans, between markers

_{l}*l*and

*l*− 1. This will result in lower dependencies between consecutive markers if the distance between them is large and vice versa. (Note that although the apparent dependence in transition probability is one-sided, it will actually induce two-sided dependence between indicators in practice.) The new indicator

*I*

^{′}

_{l}has Bernoulli distribution with shrinkage parameter

*s*. Note that the lower diagonal term in Equation 2,

*p*(

*I*= 1|

_{l}*I*

_{l}_{−1}= 1,

*s*, λ,

*d*), is loosely related to the Malecot equation for isolation by distance (

_{l}*cf.*Morton

*et al.*2001), where there is positive probability (depending on population size) to find association due to chance. Following Conti and Witte (2003) one may prespecify and use a fixed smoothing parameter value or perhaps use a range of different values in different MCMC chains for tuning (to introduce a preferred level of smoothing). Alternatively, we can specify a wide or a narrow prior for smoothing parameter λ. In the following we have used a Gamma(1, 0.01) prior, if not stated otherwise, with prior mean at 100. Priors for the genetic coefficients

*p*in the case of a fixed variance model (at each locus

*l*) were assumed to be

*a priori*independent standard normal

*N*(0, 1), omitting a prior for genetic variance

*p*, and in the case of a random variance model

*p*is

*N*(0, σ

^{2}

_{gl}) with

*p*as an inverse Gamma(1, 1). Consequently we assumed and . The prior for the overall mean is

*N*(0, 0.1) and that for the precision parameter

*p*(τ) is Gamma(1, 1), where the latter is needed only for continuous phenotypes. Note that ideally the genetic prior variance given in a fixed variance model should be proportional to the phenotypic variance of the trait, which also can be used inversely as a natural lower bound for

*p*(τ).

## MATERIALS

To test the performance of our methodology for *a binary trait with bi-allelic markers and physical distances*, we selected a well-known cystic fibrosis (CF) data set that has some degree of missing alleles (Kerem* et al.* 1989). Additionally, we analyzed public data for Friedreich ataxia (FA), which has *a binary trait with multiallelic markers and genetic distances* as well as some missing data (Liu* et al.* 2001; Molitor* et al.* 2003a,b). We have illustrated the influence of choice of different types of priors (wide and narrow) and fixed values (small and large) for smoothing parameter λ and also compared the results to the case of no smoothing, with the FA data. We also wanted to compare the two previously mentioned schemes for handling of missing values between CF and FA analyses. Finally, to test performance of our methodology for *a quantitative trait with multiallelic markers and large genetic distances* without missing data, we analyzed the simulated data set presented in Kilpikari and Sillanpää (2003).

### CF data and model:

The CF data contain 93 individuals with binary disease status and 23 biallelic markers with haplotype information. These restriction fragment length polymorphism (RFLP) markers span the area surrounding the cystic fibrosis transmembrane regulator (CFTR) gene on chromosomal segment 7q31. The physical distances in this 1.7-Mb region are available in the data (Kerem* et al.* 1989; Morris* et al.* 2000). Because it was unknown which two haplotypes belong to the same individual (within both phenotypic classes), we used a fixed-variance model with a slight modification, where only a single allelic coefficient was fitted in each selected locus; each individual contributed two independent observations (phenotype and one allele at each locus) to the analysis (*cf.* Sasieni 1997). It is evident that this modified model gives the same result as the model where two allelic coefficients are fitted in each selected locus, except that the estimated allelic effects (and their contrasts) are approximately double in size. It is important to note that the analysis was done by assuming that we did not have haplotypes available. Here we assumed the model for missing values, where the frequency (hyper)parameters have Dirichlet prior distributions and are estimated from the data.

### FA data and model:

The Friedreich ataxia data were first published in Liu* et al.* (2001) and they consist of 58 disease haplotypes and 69 control haplotypes with 12 microsatellite markers covering a 15-cM region (9 closely spaced markers within a 1-cM area and 3 markers at the two ends covering the rest of the length). An extra unphased individual was omitted from the analysis. Genetic distances are available in the data. We used a random-variance model with the same modification as in CF data analysis, where each individual contributed two independent observations to the analysis. Again, haplotype information was omitted in the analysis. In the analysis we assumed a model for missing values, where each allele was considered to be *a priori* equally probable at each marker locus. In the analysis with no smoothing, we assumed an independent prior for locus indicators, where each indicator depends only on shrinkage parameter.

### Simulated data and model:

The simulated data consist of quantitative phenotype and genotype measurements at 36 multiallelic markers (without haplotype information) from 1000 individuals; see Kilpikari and Sillanpää (2003) for details. Simulated QTL are at markers 18 (a strong effect) and 22 (a weak effect) with a joint heritability of 0.63. Haldane's function was used to convert recombination fractions to morgans. The intermarker genetic distances are wide in this data set, typically corresponding to recombination fractions of 0.1 or 0.2. The complete data set was used here (without any missing values), whereas Kilpikari and Sillanpää (2003) used the data set where 20% of the marker genotypes were missing. Unlike in CF and FA analyses, every individual contributed a single observation (one phenotype and two alleles at each locus) to the analysis where a random-variance model, with for each allele its own coefficient, was assumed.

### Analyses:

Data sets were analyzed in WinBUGS 1.3 (Gilks* et al.* 1994; Spiegelhalter* et al.* 1999), using a Pentium 4, 2.8 GHz. In CF and FA analyses, where we assumed a prior distribution for λ or prior independence of locus indicators, we ran two MCMC chains each of length 30,000 (13,500 in FA analyses with fixed λ and 5500 in simulation analyses) with different initial values. First, 5000 (1000 and 4000) rounds were discarded from each chain as “burn-in,” which resulted in 50,000 (25,000 and 3000) pooled MCMC samples in total, respectively. We did not apply any “thinning” for the chain, because of sufficient computer storage capacity and low autocorrelation in the samples. The two MCMC chains were run in parallel, which took ∼20 hr with CF data and 8 (4) hr with FA data (with fixed λ) and 98 hr with simulated data. The convergence assessments were performed by visually monitoring chains for several different parameters.

## RESULTS

### CF data:

In Figure 2, we present the estimated posterior probabilities for different markers to be associated on the CF phenotype (left) and the same based on prior only (right). The posterior estimate for the number of trait loci in the whole region turned out to be ∼3 (with mean 3.24 and mode 3) where the prior assumption for the same was 1.

QTL probabilities in our model may be confounded with QTL effects. This is because in the case of strong LD (dependence), the prior of locus indicators supports several indicators to have value one simultaneously. At the same time, if the position with indicator value one is not a good explanatory variable for phenotype, the corresponding QTL effect becomes negligible. In such case, a locus-indicator value is not sufficient alone to determine how well the putative position explains the phenotype. However, we can still inspect their values as preliminary indicators of QTL activity. In particular due to the relatively small amount of LD in the data set (see below) and the “shrinkage” assumption in our model, we obtained a clear indication of QTL positions at the probability scale. Figure 2 (left) shows a distinct signal at the correct Δ*F*_{508} position on marker 17, which is known to account for ∼66% of the disease chromosomes (Bertranpetit and Calafell 1996). Additionally a clear signal appears at marker 10 (EG1.4) outside the CFTR region. This location as well as many others was found to have a strong secondary association in the study of Kerem* et al.* (1989). Moreover, Molitor* et al.* (2003b) used a single-locus model and found a bimodal posterior distribution for gene location with both modes within 0.03 cM of our best candidates. They also made an additional analysis where the left mode disappears in an analysis for which the only case haplotypes were those known to contain the Δ*F*_{508} mutation. Our method clearly identifies marker 10 as a secondary contributor in this data set. Because of this, we performed a closer inspection of haplotypes in the original data set. In Table 1, one can see that occurrences of certain alleles at markers 10 and 17 correlate with each other in haplotypes (correlation is 0.51). Moreover, one can say that correlation between markers 10 and 17 is surprisingly low (−0.02) among control individuals (Table 1, PC0), suggesting that locus 10 may have protective influence (negative susceptibility) on the disease. A possible reason why this locus has not been suggested by others is that most LD mapping approaches focus on searching for susceptibility alleles only. Note that although the posterior estimated number of trait loci was three, Figure 2 clearly supported only two locations with several neighboring locations with much smaller QTL probabilities. We believe that this estimated number reflects a marker dependency (LD) pattern rather than the actual QTL number for the region (see discussion).

In Figure 3 (left), the posterior plot of estimated allelic effects consistently supported the same two markers (10 and 17) as indicated in Figure 2. One can see large and opposite effects of two alleles at these markers. The difference (or contrast) between these coefficients is comparable to the estimated effect obtained under the model where the first coefficient at each locus is constrained to zero. For comparison, the right side of Figure 3 shows allelic effects under the prior model (without data). Note that effect estimates shown in Figure 3 are point estimates (mean) whereas the whole distribution is summarized for loci 10 and 17 in Table 2.

In general, effect estimation can be inaccurate in practice and as mentioned before, estimates for the QTL positions and their effects may be confounded. Therefore, a more robust way of analyzing association in such confounded models can be done by combining two sources of posterior information (QTL and their effects) into a single marker-specific summary that can be called a weighted genetic variance. This may be obtained from the posterior distribution of the product of indicator variable *I _{l}* and either absolute difference (biallelic case) or variance (multiallelic case) of allelic effects β

*'s at each location*

_{la}*l*. For biallelic loci, this summary corresponds to a model-averaged effect estimate (averaged over all models with the effect set to zero in models where the marker was not selected) proposed by Ball (2001) and a model-averaged variance estimate for multiallelic markers. However, for the present analysis, this practice produced a picture similar to Figure 2 (with an approximate scale difference). A related approach of combining posterior information on QTL and their effects using Bayesian QTL mapping and variance component models has been proposed in Xu and Yi (2000).

### Linkage disequilibrium pattern due to close linkage of loci:

Instead of estimating LD as a degree of joint occurrence of alleles at two loci (which requires haplotype data), we assume that the effect of LD should also be visible in the behavior of locus indicators (in the case that they are not confounded; otherwise a weighted genetic variance may be selected). We express the strength of LD as a probability of an event that two adjacent markers have value one simultaneously in their locus indicators (this event is called joint selection). Figure 4 shows distinctly elevated posterior probabilities of joint selection around marker 17 and also around marker 10 (although not so high) compared to prior levels of LD.

### Inference about estimated average age of mutations over trait loci:

Sometimes it may not be simple to parameterize the model to obtain a direct posterior of a certain quantity of interest analytically, as is the case here. To make inference about such a stochastic function of the posterior distribution, a sequential analysis can be carried out by using posterior samples generated by a first Bayesian analysis as data in a second Bayesian analysis. We performed such an additional analysis for a posterior sample of locus indicators to estimate the average age of mutations over trait loci. In this second Bayesian analysis, 30,000 MCMC samples were considered as actual data points consisting of sampled values of indicators *x _{l}* = 1

_{{}

_{Il}_{=}

_{Il}_{−1=1}}for each consecutive marker pair (

*l*− 1,

*l*) among 23 markers. In our model we assumed that indicators

*x*are Bernoulli distributed with parameter

_{l}*e*, where

^{−adl}*a*represents the average age of mutation over trait loci with a possible scale difference and

*d*is the distance between consecutive markers of marker pair (

_{l}*l*− 1,

*l*). We assumed a Gamma(5, 0.05) prior for age parameter

*a*and ran two parallel chains with 3500 MCMC rounds (with 1000 burn-ins in each) resulting altogether in 5000 pooled MCMC samples to be utilized for estimation. The sampler seemed to have converged quickly. Note that although this model appears to be similar to the one used for smoothing, the smoothing parameter λ does not have this interpretation because it represents prior rather than posterior dependency (linkage disequilibrium) along a chromosomal segment.

In the analysis described above, the estimated posterior mean of age *a* was 0.2679. Because 1 cM is ∼1 Mb in humans and our analyses were performed in kilobases, we have to multiply our estimate by 1000, leading to an estimated age of 268 generations (with 95% credible interval [266, 269]). Others have found an estimated age of Δ*F*_{508} to be near 200 (Serre* et al.* 1990; Morris* et al.* 2000). However, recall that our estimate is averaged over all trait loci (positions 10 and 17) and does not represent only the CFTR region. Therefore, we did yet another analysis and allowed two age parameters, one for the CFTR region (markers 16–20) and another outside of it (models not shown). The new analysis resulted in posterior means of the two age parameters to be 207 (with 95% credible interval [205, 208]) at the CFTR region and 319 (with 95% credible interval [316, 321]) elsewhere (*i.e.*, dominated by position 10). This result is also in agreement with the idea that position 10 may have protective influence because it is very likely that mutations with protective effects are older than others.

### Sensitivity analyses:

To study sensitivity of the estimates of age of mutation *a*, we performed test trials assuming Gamma(5, 0.05), Gamma(1, 1), or Gamma(10, 0.1) prior for *a*. All priors led to the same posterior estimates, strongly supporting the estimated value.

The estimated posterior mean for the smoothing parameter λ was 100, corresponding closely to the prior mean. To check validity of this estimate we also used some other values for prior mean. These extra analyses resulted in the posterior means of smoothing parameter values, which corresponded strongly to the prior assumption, indicating that this quantity is very dependent on the prior choice. However, the posterior distributions of other parameters were not affected by the changes made to the prior for the smoothing parameter.

With CF data, we tested four different priors for allelic coefficients: *N*(0, 1), *N*(0, 10), *N*(0, 20), and *N*(0, 100). The change of prior had the largest influence on the convergence time but also increased variability at the mean level of posterior estimated allelic coefficients with increased variability (“flatness”) in the prior. Maximally a 2.5-fold change was observed in the estimated allelic coefficient by changing the prior. The influence was negligible to the absolute difference between coefficients. However, the same positions were supported in all cases. We also tested the influence of changing the prior for overall mean and the results were comparable.

### Friedrich ataxia data:

Unlike with CF data, locus-indicator variables turned out to be confounded with QTL effects in the Friedreich ataxia data (due to high LD). Therefore, we present posterior information on locus indicators and effect parameters in a combined form (weighted genetic variance). This quantity was calculated at each MCMC iteration as an indicator times genetic variance at a particular locus . In Figure 5, one can see the posterior mean weighted genetic variance estimated at different marker positions. Two putative QTL positions at markers 3 and 5 are clearly indicated. It is known that the gene is located between markers 5 and 6 (Liu* et al.* 2001). Note that the best putative candidate of Molitor* et al.* (2003a) was marker 3. Again we did some closer inspection of the haplotype data and found a striking joint occurrence of certain alleles at haplotypes of loci 3 and 5 among disease haplotypes (32/58 with haplotype 8-5 and 12/58 with haplotype 8-6) and to some extent also among control haplotypes (11/69 with haplotype 8-5 and 3/69 with haplotype 8-6; all results not shown). Joint occurrence among controls is not conclusive due to a high number of missing alleles at marker 5 (23/69). Simple odds ratio calculation for alleles indicates enrichment of alleles 3 and 5 of marker 5 and allele 8 of marker 3 among the cases with values 2.1, 2.3, and 2.5, respectively. However, allele 5 of marker 5 occurs together in haplotypes with allele 8 of marker 3. The same was further supported by the posterior estimates of individual allelic effects (results not shown). To conclude, most of the alleles that were estimated to be enriched among cases occurred jointly with a high frequency with either allele 3 of marker 5 or allele 8 of marker 3 (which also implies they occurred with alleles 5 or 6 of marker 5). Therefore all the enriched alleles of all markers could be approximately represented by these two alleles at these two markers together. This further supports two putative findings of this analysis.

Figure 5 represents weighted genetic variance under different prior specifications for overall smoothing parameter λ and for locus indicators (each corresponding to a different level of smoothing). In Figure 5A one can see that if we assume a wide prior for the smoothing parameter, similarly as in the CF data, we obtain a moderately smoothed posterior. In contrast, if a narrow prior is given for the smoothing parameter (Figure 5B), we obtain a posterior where the smoothing has become a bit stronger (this is more visible in the estimated LD pattern, results not shown). By comparing Figure 5E (no smoothing), Figure 5D (weak constant smoothing), and Figure 5C (strong constant smoothing), one can clearly see that signals at positions 3 and 5 become stronger as levels of smoothing increase. Figure 5E corresponds to the model, where *a priori* independence of locus indicators was assumed [*cf.* the model of Kilpikari and Sillanpää (2003), where all unoccupied QTL positions were considered to be equally likely].

### Dichotomizing transformation for weighted genetic variance:

Direct inference about estimated average age of mutations over trait loci or number of QTL, as was done for CF data, is not possible from FA data because the estimates of locus indicators and effects are confounded in the FA data due to high linkage disequilibrium. In other words, posterior weighted genetic variances clearly differ from posterior QTL probabilities. To estimate age of mutation (or number of QTL) under this condition, we transform each (continuous) weighted genetic variance to a binary form that can be called an adjusted locus indicator, which can then be collected together and plotted as QTL probabilities. The QTL-probability plot drawn from adjusted indicators should resemble the plot drawn from weighted genetic variances (with an approximate scale difference). These adjusted indicators (that are created by transformation) can then be used as data for estimating age of mutation over trait loci or number of QTL similar to our CF analysis.

To understand the transformation, let us first consider the marker-specific posterior for the weighted genetic variance. Note that only a single value (mean) of this posterior is plotted at each marker point in Figure 5. Although the histogram estimates indicate these variables to be stochastically ordered (for the FA data), in the absence of analytical forms (for these posteriors), we utilize only the tail-ordering property of the posteriors. For each marker, we estimated the corresponding tail probability of exceeding the predefined percentile of the joint distribution of all weighted genetic variances. [Alternatively, the cutoff point can be visually defined from the weighted genetic variance plot (Figure 5).] Mathematically this is a simple Bernoulli discretization of a continuous variable where the corresponding Bernoulli probability then represents the probability for a marker to have a high or extreme value of the weighted genetic variance. These tail probabilities should be similar in pattern (over markers) with those obtained by the means as in Figure 5. (Assuming the means to be reasonable estimates of the central tendency of the weighted genetic variances).

### Estimated average age of mutations over trait loci and their number in FA data:

We performed several trials with different cutoff points put on the weighted genetic variance and each time ran the separate age estimation analysis with MCMC using adjusted locus indicators as data, as in CF analysis (results not shown). As expected, the estimated age depends on the stringency of the criteria used to define adjusted locus indicators. The higher the cutoff is put on the weighted genetic variance the rarer the event becomes and the estimated age would increase accordingly. However, even with the choice of cutoff as 90th, 95th, and 99th percentiles (of the overall distribution) the posterior (median) age estimates were 30, 42, and 66 generations (with narrow credible intervals), indicating this to be a recent mutation compared to many others. The corresponding three estimates for the number of QTL were 1.2, 0.6, and 0.1, respectively. Three percentiles correspond to points 7.85, 13.29, and 33.65, respectively, in the weighted genetic variance scale (recall that only means are shown in Figure 5). On the basis of these trials, it seems that this practice can provide useful information on the neighborhood of the age estimate rather than a unique estimate in such cases where direct estimation is impossible due to confounding. However, this practice is not very suitable for estimating the number of QTL.

### Comparison of missing value imputation methods:

On the basis of our numerical experiments with these data sets, it was clear that missing data imputation that was applied to the FA data (with the prior where each allele was considered as equally likely) was better behaved than that in the CF data (with the Dirichlet prior for allele frequency hyperparameters), providing faster convergence.

### Simulated data:

As expected, in large genomic regions this model seems to behave in a closely similar fashion to the model of Kilpikari and Sillanpää (2003). In Figure 6 (left), one can see that the posterior QTL probability is practically zero at most of the markers (except some faint pattern due to LD) while it is one at true QTL locations (18 and 22). Recall that marker 25 showed a weak signal in Kilpikari and Sillanpää (2003), which we think is due to missing values in the analyzed data set. The same two positions (18 and 22) are supported also in Figure 6 (right), which presents weighted genetic variances for the data. Note that it is clearly visible that position 18 had a stronger effect than 22. The posterior (mean) estimate for the number of QTL in the whole region was 2. Closer inspection of locus indicators showed that some dependence between indicators was found only between markers 15–16 and 31–34, where the intermarker distances were the smallest (recombination fractions for the two regions were 0.02 and [0.02, 0.01, 0.01], respectively; results not shown).

## DISCUSSION

We have presented a new multilocus method for association mapping in short chromosomal intervals, where covariances between markers are accounted for by using available physical or genetic distance information. Accounting for these covariances corresponds to spatial smoothing of association signals (LD measured from locus indicators) along the candidate region. The method is suitable for genotype data on multiallelic markers and is equally applicable for quantitative and binary traits and can handle some degree of missing observations. One can also estimate the age of the variant, which should be interpreted here as time to the most recent common ancestor rather than to actual age of mutation, which can be much older (Rannala and Slatkin 1998). One advantage of this method is that it can be implemented with WinBUGS (see Congdon 2001), which avoids user specification of data-specific tuning parameters needed in the Metropolis-Hastings random-walk algorithm (Chib and Greenberg 1995). Moreover, inclusion of covariate information (such as age, sex, or environmental risk factors) or dominance (genotype effects) into the phenotype model is straightforward in WinBUGS. However, the overall smoothing parameter λ can be thought of as a special tuning parameter in our model, which should not be too small in data sets with strong LD. With small values, the dependence between adjacent locus indicators may become too strong, practically preventing the MCMC sampler from moving (insufficient mixing). This may also happen with some prior distributions, which allow small values for λ. Therefore, one should be careful when analyzing data sets known to have strong LD.

We share the view of Phillips* et al.* (2003) and Morris* et al.* (2003b) that the block structure of LD in the human genome creates a need for sophisticated modeling to refine locations within the LD blocks. (For the opposite view, see Goldstein 2001.) Conti and Witte (2003) demonstrated how haplotype block information (Daly* et al.* 2001; Goldstein 2001; Gabriel* et al.* 2002; Wall and Pritchard 2003) can be easily incorporated in their model so that marker-specific gene effects “borrowed information” or were depending on each other only within the same block. The same goal (independence of markers between blocks) can be obtained in our model by artificially replacing the known distance between block-boundary markers (adjacent markers that are members in different blocks) with some arbitrarily large value corresponding to independence of loci. Utilization of block information corresponds to accounting for common population ancestry (LD due to covariance between haplotypes).

Population stratification is a well-known problem with association studies (Cardon and Palmer 2003). The benefit of the “LD smoothing” approach presented here is that it can control fluctuations in LD (spurious associations) due to several factors including events in population history. This way one does not need to apply matching (Hind* et al.* 2004) or utilize techniques such as genomic controls (Devlin and Roeder 1999) or structured association (Pritchard* et al.* 2000a,b; Sillanpää* et al.* 2001; Corander* et al.* 2003, 2004; Hoggart* et al.* 2003), which use an external set of unlinked markers. If the sample consists of related individuals with known relationships, inclusion of a polygenic component into the association model could take away a stratification problem completely (see discussion in Kilpikari and Sillanpää 2003). Available relationship information also enables the use of models that combine pedigree and linkage disequilibrium information (Meuwissen* et al.* 2002; Perez-Enciso 2003; Fan and Jung 2003; Lund* et al.* 2003; Meuwissen and Goddard 2004).

The postgenomic era is bringing us a vast amount of external public information on the human genome sequence. Rannala and Reeve (2001) proposed utilization of such information by specifying an informative prior distribution for the location of a disease gene in their Bayesian LD analysis. An informative prior providing information on gene-rich areas or distribution of exons and introns in the candidate region will improve gene localization. To utilize this type of external information in our model, one can modify the prior for locus-specific indicators *p*(*I _{l}*|

*I*

_{l}_{−1},

*s*, λ,

*d*) by specifying individual values of shrinkage factor 1 −

_{l}*s*at each locus, reflecting the prior probability of the position (locus-specific

*s*should then be scaled so that they sum up to the prior number of QTL in region

_{l}*R*).

We want to briefly comment on our choice of a logit-link function here in contrast to a probit link that was used in Kilpikari and Sillanpää (2003). The logit link was chosen here because of its good mixing properties in WinBUGS (A. Thomas, personal communication). Additionally, although probit and logit-link functions are very close to each other, only logit link is robust for ascertainment (Kagan 2001; Neuhaus 2002). The choice of probit link, implemented via data augmentation, was motivated in Kilpikari and Sillanpää (2003) by the technical fact that one was able to apply exactly the same MCMC sampler (including the full conditionals) for all underlying model parameters in both quantitative and binary traits.

In some cases, the number of QTL-effect parameters (categories) considered in the model may become too large to model each parameter with a fixed-variance model (fixed-effect model). Such a situation may arise if the number of distinct alleles segregating in a marker is very high or one considers gene-gene or gene-environment interactions or genotype/haplotype-specific effects in the phenotype model. In an alternative to the random-variance model (“variance component model”), which avoids these problems, one can coarse the number of groups by regrouping alleles or genotypes/haplotypes into a small number of new groups on the basis of some simple rule. A more sophisticated solution is to include partitioning as a part of the model following ideas in Seaman* et al.* (2002).

Estimates for the number of QTL in studies, which concentrate on small chromosomal intervals, are confounded by strong LD (dependence) between markers. It is very likely that more than just a single marker is in strong LD with the true QTL and therefore they show elevated gene activity, as illustrated in CF and FA data. Because LD appears as continuous segments around the QTL due to common history, it may be more meaningful to estimate the number of QTL as the number of trait-associated marker segments rather than as the number of individual trait-associated markers (see Terwilliger* et al.* 1997; Chapman and Thompson 2002). The alternative, illustrated with FA data, to estimate the number of QTL from transformed data, did not seem to perform well. One potential future approach to reduce or better control confounding in our model would be to apply stochastic search variable selection (SSVS; George and McCulloch 1993; Yi* et al.* 2003; Meuwissen and Goddard 2004) so that locus indicators are hierarchically controlling the prior of QTL effects. A well-known drawback in SSVS is that it requires additional tuning parameters (pseudo-priors), which are data dependent (Dellaportas* et al.* 2000). In any case, the subset selection of candidate markers should be done by the analyst on the basis of the highly elevated locus-specific QTL probabilities or weighted genetic variances in contrast to selection based on markers showing small *P*-values in classical association testing. By doing so a Bayesian analysis avoids complicated problems in multiple testing. On the other hand, Bayesian analysis requires attention to monitoring of convergence and inspection of mixing properties of the MCMC sampler. Also so-called sensitivity analysis is an important part of good analysis practice.

We have put the model specification code (written in the BUGS language) at URL: http://www.rni.helsinki.fi/~mjs/. This code is freely available for research purposes.

## Acknowledgments

We are grateful to Jules Hernández-Sánchez and Andrew Thomas for helpful discussions and two anonymous reviewers for comments that greatly improved the presentation of the results. This work was supported by a research grant (202324) from the Academy of Finland and by the Centre of Population Genetic Analyses, University of Oulu, Finland.

## Footnotes

Communicating editor: J. B. Walsh

- Received June 22, 2004.
- Accepted September 16, 2004.

- Genetics Society of America