help button home button Genetics J Bacteriology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

Originally published as Genetics Published Articles Ahead of Print on April 15, 2007.

Genetics, Vol. 176, 1169-1185, June 2007, Copyright © 2007
doi:10.1534/genetics.106.064279

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
genetics.106.064279v1
176/2/1169    most recent
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Yang, R.
Right arrow Articles by Xu, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yang, R.
Right arrow Articles by Xu, S.

Bayesian Shrinkage Analysis of Quantitative Trait Loci for Dynamic Traits

Runqing Yang* and Shizhong Xu{dagger},1

* School of Agriculture and Biology, Shanghai Jiaotong University, Shanghai, 201101, People's Republic of China and {dagger} Department of Botany and Plant Science, University of California, Riverside, California 92521

1 Corresponding author: Department of Botany and Plant Sciences, University of California, Riverside, CA 92521.
E-mail: xu{at}genetics.ucr.edu

Manuscript received August 1, 2006. Accepted for publication March 23, 2007.


    ABSTRACT
 TOP
 ABSTRACT
 THEORY AND METHODS
 APPLICATIONS
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Many quantitative traits are measured repeatedly during the life of an organism. Such traits are called dynamic traits. The pattern of the changes of a dynamic trait is called the growth trajectory. Studying the growth trajectory may enhance our understanding of the genetic architecture of the growth trajectory. Recently, we developed an interval-mapping procedure to map QTL for dynamic traits under the maximum-likelihood framework. We fit the growth trajectory by Legendre polynomials. The method intended to map one QTL at a time and the entire QTL analysis involved scanning the entire genome by fitting multiple single-QTL models. In this study, we propose a Bayesian shrinkage analysis for estimating and mapping multiple QTL in a single model. The method is a combination between the shrinkage mapping for individual quantitative traits and the Legendre polynomial analysis for dynamic traits. The multiple-QTL model is implemented in two ways: (1) a fixed-interval approach where a QTL is placed in each marker interval and (2) a moving-interval approach where the position of a QTL can be searched in a range that covers many marker intervals. Simulation study shows that the Bayesian shrinkage method generates much better signals for QTL than the interval-mapping approach. We propose several alternative methods to present the results of the Bayesian shrinkage analysis. In particular, we found that the Wald test-statistic profile can serve as a mechanism to test the significance of a putative QTL.


SOME quantitative traits can be measured repeatedly during the development of life. Such traits are called longitudinal traits in humans, but more often are called dynamic traits in animals and plants. Some genes control the phenotypic values of the dynamic traits at fixed time points and others may alter the transitions of the phenotypes between consecutive time points. The growth pattern of a dynamic trait is called the growth trajectory. Studying the growth trajectory may detect both sets of genes and thus may enhance our understanding of the genetic architecture of the growth trajectory. Dynamic traits are often collected in large animals and plants, such as milk production in dairy cattle, growth rate in pigs, egg production in chickens, and growth rate in forest trees.

A growth trajectory is usually described by a logistic growth function. Wu and colleagues (MA et al. 2002; WU et al. 2002, 2003, 2004) developed many logistic regression models for mapping QTL for dynamic traits. The basic model is the logistic regression but different covariance structures of the residual errors are fit to different models. The logistic regression fits only growth trajectories that are sigmoid. Many growth trajectories, however, may show patterns that are not sigmoid. This stimulated YANG et al. (2004, 2005) to adopt a more general model called Legendre polynomial analysis. Legendre polynomial has been extensively used by animal geneticists and breeders to fit milk production and other dynamic traits (KIRKPATRICK and HECKMAN 1989; KIRKPATRICK et al. 1990; SCHAEFFER 2004). Milk production varies across days both within the same season and over different seasons. The curves are definitely not logistic, which explains why animal breeders do not use logistic regression to fit milk production curves. In addition to the flexibility to fit biological curves with arbitrary shapes, the Legendre polynomial is a linear model, as such theory and methodology extensively developed in linear models apply directly to Legendre polynomial analysis.

MACGREGOR et al. (2005) recently applied Legendre polynomial to QTL mapping for longitudinal traits in pedigrees. They adopted the traditional random regression model (RRM) in which the vector of polynomial regression coefficients (genetic effects) for each animal is treated as a random vector sampled from a multivariate normal distribution. The identity-by-descent-based variance component method (GOLDGAR 1990; SCHORK 1993; AMOS 1994; FULKER and CARDON 1994; XU and ATCHLEY 1995; ALMASY and BLANGERO 1998; YI and XU 2000a,b) is applied to estimate and test the variance components of a putative location of the genome. YANG et al. (2006) used Legendre polynomial to map QTL for dynamic traits in line crosses. The method of YANG et al. (2006), however, may not be called RRM because the polynomial regression coefficients for QTL effects are treated as fixed effects. Both the MACGREGOR et al. (2005) and the YANG et al. (2006) methods are implemented via the interval-mapping approach where the model includes the effects of a single QTL and a complete QTL analysis requires scanning the entire genome. MACGREGOR et al. (2005) included a polygenic effect in their RRM. YANG et al. (2006) included an individual specific residual effect in their fixed regression model. Both the polygenic effect in MACGREGOR et al. (2005) and the individual specific residual effect in YANG et al. (2006) can absorb, to some extent, the effects of QTL in regions other than the current one being tested. The models, in terms of QTL mapping, are still single-QTL-effect models. As a consequence, results are hard to summarize because QTL effects for different genome locations are estimated from different models.

In this study, we employed the multiple-QTL model developed for single-trait QTL analysis (XU 2003; WANG et al. 2005) to map multiple QTL for dynamic traits. Although the maximum-likelihood method has been developed for multiple-QTL mapping (e.g., JANSEN 1993; JANSEN and STAM 1994; KAO et al. 1999), the Bayesian method has become more and more popular because of its ability to handle more complicated models (SATAGOPAN et al. 1996; HEATH 1997; UIMARI and HOESCHELE 1997; SILLANPÄÄ and ARJAS 1998, 1999; DAW et al. 1999; HURME et al. 2000; XU and YI 2000; YI and XU 2000a,b, 2002; YUAN et al. 2000; UIMARI and SILLANPÄÄ 2001; BINK et al. 2002; CORANDER and SILLANPÄÄ 2002; YI et al. 2003a,b, 2005; YI 2005). The most cumbersome issue in multiple-QTL mapping is how to determine the optimal number of QTL. Variable selection via stepwise regression is commonly used in maximum-likelihood (ML) mapping (KAO et al. 1999). Reversible-jump Markov chain Monte Carlo (RJ–MCMC) is the corresponding variable selection procedure used in Bayesian analysis (GREEN 1995, 2003). However, recent studies show that RJ–MCMC is subject to poor mixing and slow convergence to the stationary distribution (GODSILL 2001; GREEN 2003). Bayesian shrinkage analysis (XU 2003; WANG et al. 2005) and stochastic search variable selection (SSVS) (YI et al. 2003a,b) are more efficient methods than RJ–MCMC. In these methods, no variable selection is conducted in an explicit manner; rather, a treatment similar to variable selection is made implicitly by shrinking the effects of excessive QTL to zero. QTL with estimated effects shrunk to zero in the shrinkage analysis are equivalent to being excluded from the model in a variable selection approach. The advantages of the shrinkage analysis over an explicit variable selection are twofold: (1) improving the convergence of MCMC and (2) reducing the chance of missing QTL.


    THEORY AND METHODS
 TOP
 ABSTRACT
 THEORY AND METHODS
 APPLICATIONS
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Genetic model:
We use a backcross (BC) mating design as an example to describe the genetic model for dynamic traits. Let Formula be the phenotypic value of individual i (Formula) measured at time t (Formula), where n is the sample size and t is a standardized time point between –1 and 1. Let T be the time point in the original scale. The time point in the standardized scale is obtained using Formula, where Formula and Formula are the starting and ending time points (see KIRKPATRICK et al. 1990). The single-QTL model used by YANG et al. (2006) is modified and adopted here to describe Formula,

Formula 1(1)
where Formula 1 is the population mean at time t, Formula 1 is an individual specific random environmental effect with a Formula 1 distribution, and Formula 1 is a random environmental error (independent of time) distributed as Formula 1. This is a multiple-QTL model with the effect of the jth QTL at time point t denoted by Formula 1 for Formula 1, where q is the number of QTL included in the model. The single-QTL model of YANG et al. (2006) is a special case when Formula 1. Variable Formula 1 is a genotype indicator variable for individual i at locus j and defined as Formula 1 for one genotype and Formula 1 for the other genotype. For example, if the genotypes of the two parents for a QTL are AA and aa, the genotype of the F1 hybrid is Aa. There are two types of BC design. Assuming that the BC family is generated by crossing F1 with the AA parent, the two genotypes of the BC family are AA and Aa, respectively. Therefore, if a BC progeny has a genotype of AA, Formula 1; otherwise, Formula 1.

Following YANG et al. (2006), we use a Legendre polynomial of order d to express each time-dependent variable in model (1) as a linear function of a time-independent vector of parameters, such as Formula 1, Formula 1, and Formula 1, where each one of Formula 1, Formula 1, and Formula 1 is a Formula 1 column vector of time-independent parameters and Formula 1 is a Formula 1 row vector of constants (YANG et al. 2006). Model (1) is then rewritten as

Formula 2(2)

Since Formula 2 is a vector of random effects, we assume Formula 2, where Formula 2 is a Formula 2 unstructured variance–covariance matrix (a full symmetric positive definite matrix).

To estimate the parameters, we need to collect molecular marker data as well as phenotypes of repeated measurements of the dynamic trait. Let Formula 2 (Formula 2) be the kth time point at which the phenotype Formula 2 is measured. Let Formula 2 be an Formula 2 column vector for the repeated measurements of the dynamic traits. In matrix notation, model (2) becomes

Formula 3(3)
where Formula 3 is a Formula 3 matrix of constants (see YANG et al. 2006) and Formula 3 is now an Formula 3 vector for the residual errors with an assumed Formula 3 distribution. Note that the Formula 3 in Equations 1 and 2 is scalar because it represents the residual error for the dynamic trait at a single time point whereas the Formula 3 in Equation 3 is a vector because it contains an array of the residual errors for Formula 3 time points. The assumption of Formula 3 means that the residual errors are i.i.d. normal across all time points. The heterogeneous distribution of the environmental errors over time has been taken into account by the time-dependent environmental effect, Formula 3.

For simplicity, we assume that each individual has observations at all time points and the time points are common for all individuals. If this assumption is violated, Formula 3 will have a different dimension across different i and matrix Formula 3 will also vary in both dimension and content over different individuals. The technical details for variable time points across individuals have been discussed by YANG et al. (2006).

Data, parameters, and missing values:
The phenotypic values of the dynamic trait of interest are a source of data. Marker genotypes and the relative positions of the markers along the genome (or called the marker map) are another source of data. Let us denote the array of phenotypic values by Y and the marker information by M. The data now are denoted by Formula 3. In QTL mapping, the most important parameter is the number of QTL (denoted by q in this study). In the Bayesian shrinkage analysis, however, we treat q as a constant (see WANG et al. 2005 for justification). The parameters of direct interest in the Bayesian shrinkage mapping include the position (Formula 3) and the effect (Formula 3) of QTL (Formula 3). Other parameters include Formula 3, Formula 3, and Formula 3. The genotype indicator variable x and the individual specific effect Formula 3 also appear in model (3), but they are called missing values rather than parameters. These missing values are not missing observations; rather, they may be better called the latent variables or nuisance parameters to avoid confusion. In Bayesian analysis, however, missing values and parameters are treated equally and both are called unobservables, denoted by Formula 3. Each unobservable is a random variable following a certain distribution. The distribution of a missing value is usually described as a function of the existing parameters. The distribution of each parameter is called the prior distribution, which also has its own parameters called the hyperparameters. The hyperparameters are constants chosen by the investigator or estimated from the data if they are described by a higher level prior distribution. If the prior of a hyperparameter is used to estimate the hyperparameter, the model is called the hierarchical model (GELMAN 2005). Let Formula 3 be the vector of hyperparameters. The prior density is denoted by Formula 3. The probability density of the data given the parameters is denoted by Formula 3, which is also called the likelihood. The posterior distribution of the parameter vector is

Formula 4(4)

Given the likelihood function and the prior distribution, the specific form of the posterior distribution can be found or inferred through MCMC sampling.

The likelihood can be written as Formula 4, where Formula 4 is multivariate normal and Formula 4 is modeled by a Markov distribution (WANG et al. 2005). The position of each QTL is assigned a uniform prior within the interval bracketed by two markers. Each QTL effect is assigned a multivariate normal prior with mean zero and an unknown variance matrix. This variance matrix is further assigned a hyperprior so that the variance matrix can be estimated from the data. An inverse Wishart prior is chosen for each variance matrix. The inverse Wishart distribution is a multivariate version of the scaled inverse chi-square distribution and is often used to model a variance–covariance matrix (MEUWISSEN and GODDARD 2004). The residual variance is assigned a scaled inverse chi-square prior distribution.

Gibbs sampler:
Let Formula 4 be the initial values used for all the unknown variables. Let Formula 4 be a component of vector Formula 4 and Formula 4 be the remaining components of Formula 4, excluding Formula 4. The conditional posterior probability of Formula 4 given Formula 4 is denoted by Formula 4, which is used to simulate Formula 4. Once Formula 4 is simulated, it is moved to the list of known elements and the conditional posterior probability of the next component of Formula 4 is calculated and a value is drawn from this distribution. Once all elements of Formula 4 are drawn, we complete the first iteration and the value of Formula 4 is denoted by Formula 4. The sampling process may continue for N iterations to form a Markov chain, denoted by Formula 4, where N is a very large number. The sample mean for the kth element, Formula 4, is the empirical posterior mean of Formula 4, which may be considered as a Bayesian point estimate.

Here we focus on each variable whose Formula 4 has an explicit form so that samples can be directly drawn from that distribution. This process of directly generating samples from the posterior is called the Gibbs sampler (see GELMAN et al. 1995). Variables that can be generated with the Gibbs sampler are described below.

The prior distribution for the population mean is uniform across the real (uninformative prior), which leads to a normal posterior distribution. Therefore, the population mean Formula 4 is drawn from a multivariate normal distribution with mean

Formula 5(5)
and variance

Formula 6(6)
where Formula 6 is the variance–covariance matrix of vector Formula 6. Note that we use Formula 6 and Formula 6 to denote the conditional posterior expectation and variance, respectively. Similar notation applies to other parameters.

Let Formula 6 be the multivariate normal prior for the effect of the jth QTL with mean 0 and variance Formula 6, where Formula 6 is a Formula 6 positive definite matrix. It is a multivariate version of the shrinkage parameter (WANG et al. 2005). The conditional posterior distribution for Formula 6 is multivariable normal with mean

Formula 7(7)
and variance

Formula 8(8)

The Bayesian shrinkage method differs from the usually Bayesian regression analysis in that here we treat Formula 8 (a hyperparameter) as an unknown variable rather than as a fixed constant. Because of this, Formula 8 is described by another prior Formula 8, called the inverse Wishart distribution with a prior belief Formula 8 and a Formula 8 positive definite scale matrix Formula 8. This special treatment is called hierarchical modeling (e.g., GELMAN et al. 1995; GELMAN 2005). The conditional posterior of Formula 8 remains inverse Wishart; i.e.,

Formula 9(9)

This posterior distribution is used to draw Formula 9. Because Formula 9 is sampled, it varies from one iteration to another, reflecting the stochastic nature. More importantly, Formula 9 varies across j (different loci), reflecting the selective nature of the shrinkage. When Formula 9 is large, the posterior mean and variance of Formula 9 will resemble the "least-squares" estimates (no shrinkage). When Formula 9 is small, both the posterior mean and the posterior variance of Formula 9 will approach zero, leading to the "shrinkage estimate" of Formula 9. It is easy to understand where the shrinkage occurs under the univariate model of QTL mapping (WANG et al. 2005), but hard to understand the nature of shrinkage under the multivariate version of the QTL model. We have provided an explanation for this in the beginning of the DISCUSSION.

The individual-specific Formula 9 is sampled from a multivariate normal distribution with mean

Formula 10(10)
and variance

Formula 11(11)

Assume that the prior distribution of Formula 11 is Formula 11. Given Formula 11, the posterior distribution of Formula 11 is

Formula 12(12)

Finally, the residual error variance is sampled from a scaled inverse chi-square distribution,

Formula 13(13)
where Formula 13 and Formula 13 are the hyperparameters in the inverse chi-square prior distribution, and

Formula 14(14)
is the residual error.

Metropolis–Hastings sampler:
We are able to use the Gibbs sampler to generate variables so far because the conditional posterior for each of the aforementioned variables has an explicit form of distribution. The conditional posterior distribution of the position of a QTL, however, has no explicit form. Therefore, the general Metropolis–Hastings (METROPOLIS et al. 1953; HASTINGS 1970) algorithm is required to sample Formula 14. Since the genotype of QTL (Formula 14) depends on the QTL position Formula 14, we decide to sample Formula 14 jointly as a block but proceed with the sampling with one locus at a time. The Gibbs sampler turns out to be a special case of the general Metropolis–Hastings (MH) algorithm (GELMAN et al. 1995). Following WANG et al. (2005), we assume that there is one QTL in each marker interval. The position of the QTL varies within the marker interval that contains the QTL. The prior distribution of Formula 14 is

Formula 15(15)
where Formula 15 and Formula 15 are the positions of the left and the right markers that define the interval. Let Formula 15 be the current position of the locus of interest and Formula 15 be the genotype array of all individuals at the locus. We first sample a new position for the QTL, called the proposed position and denoted by Formula 15, where Formula 15 is sampled from Formula 15 and s is a small positive number (tuning parameter) such as 1 cM. For the new position, we simulate the genotypes for all individuals, denoted by Formula 15. We then use the MH rule to decide whether Formula 15 should be accepted or not. If Formula 15 is accepted, we update both the position and the genotype using Formula 15 and Formula 15. Otherwise, the old values of Formula 15 and Formula 15 are carried over so that Formula 15 and Formula 15. A detailed formula of the MH acceptance rule can be found in WANG et al. (2005) and ZHANG and XU (2005). The one-QTL-per-interval approach to sampling QTL positions is called the fixed-interval approach (WANG et al. 2005). It works well when markers are more or less evenly distributed and the marker density is intermediate. For highly dense marker maps, a slight modification is required.

Figure 1 shows the problem of dealing with a highly dense marker map and a proposed solution. This modified method is called the moving- (or variable-) interval approach as opposed to the fixed-interval approach described above and by WANG et al. (2005). The moving-interval approach was also introduced by WANG et al. (2005), but only very briefly. Here, we use an example to illustrate the moving-interval process. In this example, there are 18 dense markers (17 intervals) covering a chromosome. In the modified method, we place five QTL evenly on the chromosome. The upper limit for the number of QTL is now Formula 15. Each QTL interval may expand to >2 markers. We assume that each QTL interval carries one QTL. Before the MCMC starts, we simply place the five QTL evenly on the chromosome. We then update one QTL at a time given the positions of the remaining QTL. The prior of each QTL position is uniform between the positions of the left QTL and the right QTL; i.e., Formula 15. Because the prior for a QTL position depends on the current positions of other QTL, this approach is called the adaptive Bayesian approach (GILKS et al. 1995). The sampling process remains the same as that described earlier for the fixed-interval approach except that the length of the prior interval for each QTL varies and moves from one iteration to another. A complete cycle of updates for QTL positions is illustrated in Figure 2.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 1.— Illustration of the problem with dense marker map and the sketch of a proposed solution. The line at the top shows 7 markers (Formula 15) with intermediate density and 6 QTL (Formula 15), one in each marker interval. The Bayesian shrinkage method works well without modification. The line in the middle shows the problem of too many correlated variables if the marker density is extremely high. There are 18 markers (Formula 15) and 17 QTL (Formula 15), one in each marker interval. The line at the bottom shows the solution where we can define a QTL interval covering several markers and assign 1 QTL to each so defined interval. Here, there are 18 markers (Formula 15) but only 5 QTL (Formula 15) and each QTL interval covers several markers.

 

Figure 2
View larger version (20K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 2.— Illustration of the moving-interval search algorithm for QTL on a chromosome with saturated markers. The first line shows the marker map and the initial positions of five QTL. The second line shows that the position of Formula 15 can move between Formula 15 and Formula 15. Once the position of Formula 15 is sampled, Formula 15 can move between Formula 15 and Formula 15 (third line). The position of each QTL moves sequentially conditional on the positions of other QTL until positions of all QTL have been updated, which completes one cycle of the sampling (lines 4–6). The line at the bottom shows the new positions of the five QTL after one cycle of the iteration.

 
Block updating of QTL positions and genotypes is still needed in the moving-interval approach. The moving-interval approach creates high correlation between parameters because the range of one QTL position depends on positions of other QTL. This may potentially disturb the mixing of the sampler. However, our previous simulation study did not show any notable disturbance of the mixing of the sampler (WANG et al. 2005).

Post-MCMC analysis:
Unlike maximum-likelihood analysis, the product of MCMC sampling is a realized sample of all unknown variables drawn from the joint posterior distribution and the results should be interpreted in a different way. The most important parameters of interest are the locations of the QTL and their effects. The variance matrix Formula 15 for the jth QTL is not of interest by itself but serves as a mechanism to control the estimate of the QTL effect (i.e., shrinkage). In conventional Bayesian mapping (e.g., SILLANPÄÄ and ARJAS 1998, 1999; YI and XU 2000a,b; WANG et al. 2005), the marginal posterior distribution of QTL position can be depicted via plotting the number of hits by QTL in a short segment (say a 1-cM segment), called a bin, against the genome position where the bin is located. The curve is called the QTL intensity profile, denoted by Formula 15, where Formula 15 represents a location of the genome. Here we have a situation where each marker interval is assumed to have a QTL and thus all marker intervals are hit by the same number of times. If a marker interval contains a true QTL, we expect the QTL intensity profile within the interval to show a peak. Otherwise, the profile should be flat.

In addition to the QTL intensity profile, there are several alternative profiles to present the result of MCMC. One profile is the quadratic QTL effect profile, denoted by Formula 15, where Formula 15 is the average genetic effect (a vector) over the posterior samples for the QTL whose positions fall in the bin located at Formula 15. If the bin at Formula 15 is not hit by any QTL, Formula 15 is set to 0. Normally, the number of hits of a bin by QTL is proportional to the size of the QTL for that bin. However, it is possible that a bin may be hit by one or a few large QTL, which makes Formula 15 an uninformative representation of QTL position. Therefore, a more informative profile may be the profile of the quadratic effect weighted by the frequency, denoted by Formula 15. This profile is a better indicator for the position of the genome repeatedly hit by large QTL and thus is more informative than either Formula 15 or Formula 15. Another alternative profile may be the Wald test-statistic profile denoted by Formula 15, where Formula 15 is the sample variance matrix for the QTL effect at position Formula 15. A potential problem is that for a position not frequently hit by QTL, Formula 15 is not a good representative of the variance matrix of the sampled Formula 15. Therefore, we may use the average theoretical conditional posterior variance matrix of Formula 15 given in Equation 8 in place of Formula 15 to draw the Wald statistic profile. The Wald test statistic may be used as an objective way to test the significance of QTL at position Formula 15. Under the null hypothesis, i.e., there is no QTL at position Formula 15, Formula 15 may follow a chi-square distribution with Formula 15 d.f. Therefore, a critical value may be used to declare statistical significance at position Formula 15. This critical value is Formula 15, the Formula 15 percentile of a chi-square distribution with Formula 15 d.f., where Formula 15 is the type I error (not the vector of QTL effect). Let Formula 15 be the kth element of vector Formula 15 and Formula 15 be the kth diagonal element of matrix Formula 15 for Formula 15. We may propose to use Formula 15 as the univariate test statistical profile for the kth element, where Formula 15 is the chi-square test, a univariate version of the Wald test statistic. The critical value used to declare statistical significance is Formula 15 for Formula 15 and Formula 15 for Formula 15. Note that

Formula 16(16)
is the conditional expectation of Formula 16 given Formula 16 (the remaining elements of vector Formula 16) and

Formula 17(17)
is the conditional variance of Formula 17 given Formula 17. A matrix or a vector with a subscript Formula 17 means a collection of the remaining elements of the corresponding matrix or vector that excludes element k.


    APPLICATIONS
 TOP
 ABSTRACT
 THEORY AND METHODS
 APPLICATIONS
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Simulated data analysis:
Fixed-interval approach:
We simulated a dynamic trait measured at Formula 17 time points from Formula 17 BC individuals. The original eight time points denoted by a vector T and the standardized time points denoted by a vector t are given below:

Formula 17

A genome consisting of a single large chromosome of 600 cM was simulated, which was covered by 61 evenly placed markers (10 cM per marker interval). The growth pattern of the dynamic trait was assumed to be controlled by 10 QTL with positions and effects listed in Table 1. The order of the polynomial that generated the growth trajectory was Formula 17. The population mean was Formula 17, the residual variance was Formula 17, and the individual-specific environmental error variance matrix was

Formula 17


View this table:
[in this window]
[in a new window]

 
TABLE 1 QTL parameters used in the simulation experiment

 
Phenotypic values of the dynamic trait were simulated using model (3). A single sample was simulated and analyzed with the Bayesian shrinkage method. Marker genotypes and the phenotypic values of the 150 simulated individuals are available from the authors on request.

The trajectories for the two genotypes of each QTL are shown in Figure 3, where Formula 17 is the trajectory for one genotype and Formula 17 is the trajectory for the alternative genotype. Figure 3 does not show the plot for QTL 4 because (a) it is close to QTL 3 and (b) with this additional plot Figure 3 would not be as neat as it is.


Figure 3
View larger version (14K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 3.— The trajectories of genotypic effects for 9 of the 10 simulated QTL (the fourth QTL effects are not included in the plots). The genotypic value for one genotype is Formula 17 and the value for the alternative genotype is Formula 17, where t is the standardized time point and T is the original time point ranging from 5 to 40.

 
Although we simulated 10 QTL, the working model included Formula 17 QTL. Here are the actual values for the hyperparameters used in the data analysis: Formula 17, Formula 17, Formula 17, and Formula 17. The initial values of all variables were sampled from their prior distributions. The MCMC was run for 6000 cycles as a burn-in period (deleted) and then for an additional 30,000 cycles after the burn-in. The chain was then thinned to reduce serial correlation by saving one observation in every 60 cycles. The posterior sample contained 5000 (Formula 17) observations for the post-MCMC analysis. The simulation experiment was replicated five times. The result showed very little variation among the replicates. Therefore, we report the result of one replicate only.

Figure 4 shows the four proposed profiles of the QTL parameters. The QTL density profile (Figure 4a) shows nine peaks at the locations of nine simulated QTL. The third and fourth QTL are too close to each other (5 cM distance) and thus could not be separated by the method. This profile shows the expected behavior of the shrinkage analysis and the characteristic of the profile is similar to that observed for the single-trait shrinkage mapping (WANG et al. 2005). The quadratic effect profile is shown in Figure 4b and the signals are very clear for positions occupied by the simulated QTL. The shrinkage effect is obvious: effects of intervals without QTL have been shrunk to zero and thus no variable selection is needed. Peaks of the quadratic effect profile are not sharp. However, the weighted quadratic effect profile (Figure 4c) shows sharp peaks at the positions where the simulated QTL occur. Finally, the Wald test-statistic profile (Figure 4d) also shows clear signals for the simulated QTL. Although Bayesian analysis is not designed for significance tests, the Wald test statistic can be used for a significance test. All nine peaks passed the critical value (Formula 17) and no peaks elsewhere on the genome are higher than the critical value. In real data analysis, one may simply choose the weighted quadratic effect profile Formula 17 or the Wald test-statistic profile Formula 17 for presentation.


Figure 4
View larger version (17K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 4.— Profiles of QTL parameters drawn from the fixed-interval analysis: (a) QTL intensity profile Formula 17, (b) quadratic effect profile Formula 17, (c) weighted quadratic effect profile Formula 17, and (d) Wald test statistic profile Formula 17. The horizontal reference line on the W plot (plot d) is the critical value for the Wald test statistic (Formula 17). The true location of each simulated QTL is indicated by a solid needle on the horizontal axis.

 
We fit the polynomial model with order Formula 17 and thus each QTL effect Formula 17 is a Formula 17 vector, Formula 17. Each component can be tested using the Z-test statistic. The Formula 17 profiles are shown in Figure 5. Clearly, some components of the simulated QTL did not pass the critical value. This means that not all QTL have the same order. Some are linear and others are quadratic and so on. Again, no variable selection is required because the coefficients corresponding to the nonsignificant one are shrunk to zero.


Figure 5
View larger version (14K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 5.— The Z-test-statistic profile for each of the four components of the QTL effect Formula 17 obtained from the fixed-interval analysis, where Formula 17 is the component for order k of the polynomial. The reference line on each plot is the critical value for the Z-test (Formula 17).

 
The estimated QTL locations and effects (posterior means) are summarized in Table 2, showing that Bayesian analysis provides quite accurate estimates of the locations and effects of QTL (see Table 1). The estimated population mean and covariance matrix for the individual specific environmental effects (posterior means) are

Formula 17
and

Formula 17
respectively, where the numerical values in parentheses are the posterior standard deviations. The estimated residual variance is Formula 17.


View this table:
[in this window]
[in a new window]

 
TABLE 2 QTL parameters for the simulated data estimated from the fixed-interval approach

 
Figure 6 shows the estimated trajectories for the nine QTL detected. They are close to the true trajectories shown in Figure 3.


Figure 6
View larger version (15K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 6.— Estimated growth trajectories for the effects of the nine detected QTL with the fixed-interval approach, where Formula 17 and Formula 17 are the trajectories of the two alternative genotypes for each QTL.

 
Finally, as a comparison, we ran the interval-mapping program for dynamic traits (YANG et al. 2006) on the simulated data. The likelihood-ratio profile is shown in Figure 7. Interval-mapping analysis only detected eight simulated QTL using a permutation generated critical value of 25.01 (from 500 permuted samples). The peaks for QTL 2 and QTL 9 are not sufficiently clear. For the two pairs of closely linked QTL, the Bayesian shrinkage method gained an extra power to separate one pair of QTL whereas the likelihood analysis separated neither pair.


Figure 7
View larger version (9K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 7.— The likelihood-ratio (LR) test-statistic profile for the simulated data generated by the dynamic trait interval-mapping program. The horizontal line indicates the empirical critical value at Formula 17 obtained from 500 permuted samples. The true location of each simulated QTL is indicated by a solid needle on the horizontal axis.

 
Moving-interval approach:
For the same design of experiment as described in the fixed-interval experiment, we simulated data with unevenly distributed markers. This time, we placed 121 markers on the same genome with one marker at each end of the genome and the remaining 119 markers were distributed randomly. On average, there was 1 marker in every 5 cM. The positions of the 10 QTL remain the same as those in the fixed-interval experiment. With the fixed-interval approach, we would have to deal with a model with 120 QTL effects. Multicolinearity would be a problem because some markers were too closely linked. With the moving-interval approach, however, we included only 15 QTL in the model. In the beginning of the MCMC process, each QTL interval expands 40 cM of the genome (8 markers on average). Note that the length of the simulated chromosome is Formula 17. With Formula 17 evenly placed QTL on the chromosome, each QTL covers a range of Formula 17. As the MCMC proceeded, the intervals moved from one cycle to another.

Figure 8 shows the QTL profiles for the moving-interval analysis. This time, the QTL intensity profile (Figure 8a) is much more informative than that shown in the fixed-interval approach (Figure 4a). Because of the high density of markers, all 10 simulated QTL have been identified. The Wald test-statistic profile (Figure 8d) shows that all 10 QTL are significant and nowhere else (other than the 10 QTL identified) in the genome shows peaks higher than the critical value. The Z-test profiles for each of the components of QTL effects are depicted in Figure 9. Compared with the results of the fixed-interval approach, there were less significant components, but for each simulated QTL there was at least one significant component. The estimated effects for the 10 QTL are given in Table 3. The moving-interval approach is highly recommended in real data analysis. It takes less computing time and produces better results than the fixed-interval approach. For the simulated data analysis, the fixed-interval approach took ~8 hr on a Pentium IV PC with a 3.60-GHz processor and 3.00 GB RAM while the moving-interval approach took ~3 hr on the same computer.


Figure 8
View larger version (14K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 8.— Profiles of QTL parameters drawn from the moving-interval analysis: (a) QTL intensity profile Formula 17, (b) quadratic effect profile Formula 17, (c) weighted quadratic effect profile Formula 17, and (d) Wald test-statistic profile Formula 17. The horizontal reference line on the W plot (plot d) is the critical value for the Wald test statistic (Formula 17).

 

Figure 9
View larger version (12K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 9.— The Z-test-statistic profile for each of the four components of the QTL effect Formula 17 from the moving-interval analysis, where Formula 17 is the component for order k of the polynomial. The reference line on each plot is the critical value for the Z-test (Formula 17).

 

View this table:
[in this window]
[in a new window]

 
TABLE 3 QTL parameters for the simulated data estimated from the moving-interval approach

 
A real-life example:
A doubled-haploid (DH) rice population consisting of 110 lines was established by anther culture from a cross between an indica rice variety Gui-630 and a japonica rice variety Taiwanjing. Ten plants from two inner rows in each plot were chosen for trait evaluation. The number of developed leaves and the length of the developing leaf on the main stem of each plant were recorded every 5–7 days, beginning from the 30th day (May 5) after the seeds were planted until the end of leaf development. The trait subject to QTL mapping was called "leaf age," which was defined as

Formula 17

(ZHOU et al. 2001). The trait was measured eight times after May 5. The time points of measurement counted by the numbers of days after May 5 were

Formula 17

On the basis of the DH population, a linkage map comprising 175 RFLP markers and covering a total length of Formula 17 cM of the rice genome was constructed (WENG et al. 2000). This map consists of the 12 largest linkage groups for each parental map. These 12 linkage groups represent roughly the 12 chromosomes of the rice genome.

We analyzed the data with both the Bayesian shrinkage method and the ML interval-mapping procedure. In the Bayesian analysis, considering the unequal marker intervals, we adopted the variable-interval approach. We placed a total of Formula 17 QTL in the whole genome. These QTL were distributed among the 12 chromosomes with the number of QTL in each chromosome proportional to the size of the chromosome. With this q-value, the average QTL interval was 17.50 cM, i.e., one QTL in every 17.5 cM. With this QTL density, we can be sure that no major QTL would be missed. The MCMC experiment was set up in the same way as that in the simulation study. For comparison, we also analyzed the data with the ML-based interval mapping (YANG et al. 2006). The empirical critical value for the significance test was obtained from 500 permutations. In both methods, the order of the Legendre polynomial was set at Formula 17, the same order that we chose for the simulation study.

The Wald test-statistic (W) profile for the Bayesian method and the likelihood-ratio (LR) test-statistic profile for the ML interval-mapping method are depicted in Figure 10. Similar to the results of the simulation study, the Bayesian shrinkage method generated clear contrasts between the signals and the noises. Overall, there were eight peaks (QTL) with the W-values >Formula 17. The ML-based interval mapping, however, detected only five QTL, all of which were detected by the Bayesian shrinkage method (compare Figure 10a with 10b). The fact that the Bayesian shrinkage method detected three more QTL implies that Bayesian shrinkage analysis has higher power than the ML-based interval mapping.


Figure 10
View larger version (20K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 10.— The test-statistic profiles for QTL mapping from the rice data analysis: (a) The Wald test-statistic (W) profile generated by the Bayesian shrinkage method; (b) the likelihood-ratio (LR) test-statistic profile drawn from the maximum-likelihood interval-mapping analysis. The horizontal reference line in the W plot (top) is the critical value (Formula 17) for the significance test. The horizontal reference line in the LR plot (bottom) is the empirical critical value at a type I error rate of Formula 17 for the LR test statistic generated from permutation tests. The genome consists of 12 chromosomes that are separated by the vertical dotted lines. The 12 chromosomes are drawn in scales proportional to their lengths. Positions of the markers are indicated by the ticks on the horizontal axis.

 
The estimated Formula 17 and Formula 17 obtained from the means of the posterior sample are

Formula 17
and Formula 17, respectively, where the values in parentheses are the posterior standard deviations. The estimated positions and effects for the eight detected QTL for the Bayesian method are listed in Table 4. Overall, the real data analysis shows the same behavior as the simulation study. The Bayesian method has successfully shrunk spurious QTL to zero, generating clear contrasts between the signals and the noises.


View this table:
[in this window]
[in a new window]

 
TABLE 4 Estimated QTL positions and effects obtained from the Bayesian shrinkage method for the leaf-age trait in the rice-crossing experiment

 

    DISCUSSION
 TOP
 ABSTRACT
 THEORY AND METHODS
 APPLICATIONS
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
The Bayesian shrinkage mapping strategy was originally developed for a single quantitative trait. The study showed that a multivariate version of the shrinkage analysis works equally well for dynamic traits. The selective shrinkage property for the univariate Bayesian QTL mapping of WANG et al. (2005) is obvious, but it is not apparent for the multivariate version of the Bayesian QTL mapping. After some algebraic manipulation, Equations 7 and 8 can be rewritten as

Formula 18(18)
and

Formula 19(19)
respectively. The selective shrinkage property for the multivariate model is now obvious. The inverse of matrix Formula 19 can be expressed as Formula 19, where Formula 19 is a matrix with the same dimension as that of Formula 19 and Formula 19 is the determinant of Formula 19 (a scalar). When matrix Formula 19 is "large" (large deviation from 0), Formula 19 will be large so that Formula 19 becomes "small" (close to 0). In this case, the posterior mean of Formula 19 will be equivalent to the least-squares estimate (no shrinkage). When matrix Formula 19 is small (close to 0), Formula 19 will be small so that Formula 19 becomes extremely "large." A large Formula 19 will dominate over Formula 19 so that Formula 19 is effectively very large, leading to a very small Formula 19. A small Formula 19 in turn will lead to a small Formula 19 because of the relationship given in Equation 18. As a result, the value of Formula 19 will be shrunk toward 0.

Shrinkage mapping differs from mapping with variable selection in that the model dimension is fixed. This means that we must predetermine the number of QTL (q) before the MCMC starts. WANG et al. (2005) defined q as the "maximum number of QTL." This maximum number depends on the knowledge of the investigator on the traits of interest. It also depends on the marker density and the sample size. A denser marker map and larger sample size allow a larger q to be specified. WANG et al. (2005) and ZHANG and XU (2005) found that the q-value should be selected in such a way that, on average, a QTL interval covers ~45 cM. This optimal value was obtained empirically via simulation studies. For dynamic traits, the average QTL interval should be shorter because we expect the true number of QTL to be larger for a dynamic trait than for a single quantitative trait. Assuming that 45 cM is the optimal average interval for a single trait, it is reasonable to take Formula 19 cM as the optimal interval for a dynamic trait, because a dynamic trait fit with a polynomial of order d is equivalent to d independent single traits. In the simulation studies, we chose the average interval of 40 cM because we knew that Formula 19 is sufficient for these data, where 600 is the length of the simulated chromosome measured in centimorgans. We also tried shorter intervals, e.g., 25 cM, and the results were not much different from that when 45 cM was used (data not shown). In real data analysis, one may choose a few alternative q-values and present the results with the "optimal" q. The optimal q-value may be found by trial and error using a cross-validation technique, where we use part of the sample to estimate the parameters and test the estimated parameters using the remaining sample. We then choose the q that has the minimum prediction error.

Is the optimal q very important in the shrinkage analysis? The answer is no. It may be important for other methods, e.g., variable selection, but certainly not in shrinkage analysis. Therefore, the rule of thumb is to choose a q as large as the marker density and the sample size allow. How do we deal with the excessive QTL? By definition, the shrinkage analysis will shrink them to zero. The positions of these excessive QTL will be distributed uniformly across the regions where these QTL are started. For example, in the moving-interval analysis, several excessive QTL were trapped between QTL 2 and QTL 3, but their positions were almost uniformly distributed between the two QTL.

The order of the polynomial will determine the shape of the QTL effect trajectories. There is no reason to believe that the shape of Formula 19 is the same as that of Formula 19. Similarly, the shape of Formula 19 may be different across different QTL. Therefore, the ideal treatment would be to use different orders for different effects. The Z-test statistic profile provides such information, which can be used to select different orders for different effects. However, it is tedious to implement the Z-test approach interactively for order selection during the MCMC process. Fortunately, the shrinkage method provides an automatic mechanism to shrink higher-order coefficients to zero. For simplicity, we may simply choose the highest possible order and use it for all QTL and let the shrinkage mechanism take care of the excessive coefficients. In addition to the order of the polynomial, the effects of the polynomial regression coefficients also determine the shape of a function. It is very likely that two polynomial functions of the same order have quite different shapes because the regression coefficients are different. If a higher-order polynomial is fit to a curve that can be adequately described by a lower-order polynomial, the regression coefficients for the orders higher than necessary will be shrunk to zero. The shape of the curve fit with the higher-order polynomial that contains extra zero regression coefficients will remain the same as that fit with the lower-order polynomial.

The SSVS approach to QTL mapping proposed by YI et al. (2003a,b) is a method similar to the shrinkage analysis, both of which employed models with fixed dimensions. SSVS, however, used a latent indicator variable to indicate the status of a QTL: inclusion or exclusion. WANG et al. (2005) compared SSVS with the shrinkage analysis and found that the two approaches performed equally well (slightly in favor of the shrinkage method but difference barely notable). A multivariate version of SSVS has been developed by MEUWISSEN and GODDARD (2004), which can be adopted here to map dynamic traits. Let Formula 19 be the indicator variable for QTL j. Let Formula 19 be the effect for the jth QTL, where Formula 19 is a Formula 19 vector. The prior distribution for Formula 19 in SSVS is

Formula 19
where Formula 19 and Formula 19. This is a multivariate version of the following single-variable mixture prior,

Formula 19
where Formula 19 here is a single variable. Although Formula 19 has been suggested (GEORGE and MCCULLOCH 1993; YI et al. 2003a,b), a more informative prior for Formula 19 is Formula 19, where Formula 19 is further described by a Beta prior so that its value can be inferred from the data. The multivariate SSVS discussed here is a simple extension of the univariate SSVS. The method of MEUWISSEN and GODDARD (2004), however, goes one step further by assigning a scaled inverse chi-square prior to Formula 19. As a result, Formula 19 can be sampled from its posterior distribution. Furthermore, the authors set Formula 19 so that sampling for Formula 19 has been avoided. Since the multivariate SSVS method works for multiple-QTL mapping (MEUWISSEN and GODDARD 2004), there is no reason to believe that it does not work for dynamic-trait QTL mapping.

An even more general curve-fitting approach, called the B-splines (DE BOOR 2001), may be used in place of the Legendre polynomials. The statistical method is identical to that reported here except that Formula 19 is defined differently for the B-splines. If a biological curve has a complexity that must be described by a polynomial with order 5 or higher, B-splines are recommended. Especially, when the curve is irregular and there is no global trend to fit the data, B-splines will be more useful than the polynomials. With the B-splines, one can divide the entire time course by a few segments. The dividing points are called the knots. Within each segment, a polynomial curve with a particular order is fit and the orders of polynomials across different segments are the same. This is called local polynomial fitting. The number of knots and the order of the polynomial within each segment determine the shape of the entire time-course function. The overall degree of freedom is the sum of the polynomial order and the number of knots (rather than the product of the two). This allows a low degree of freedom of the B-splines to fit a more complicated curve than a global polynomial fitting. One example may be the lifetime milk production curve for dairy cattle. Daily milk production varies within seasons and across seasons. The lifetime curve usually shows many cycles, making B-spline an attractive alternative to the Legendre polynomial. Most biological curves, however, may be described by polynomials of order ≤5. Biological curves with higher order are perhaps very rare and hard to find in reality. Therefore, the Legendre polynomial analysis proposed in this study is sufficient for most dynamic traits.

The Bayesian shrinkage mapping procedure is developed on the basis of an "additive" model in the sense that epistatic effects of QTL pairs have been ignored. YANG et al. (2006) demonstrated that the Legendre polynomial can be easily extended to include epistatic effects. Their model (Equation 42 of YANG et al. 2006) can be directly adopted here. The only difference between the epistatic effect model and the additive model is that the f