Originally published as Genetics Published Articles Ahead of Print on June 4, 2006.
Genetics, Vol. 173, 2339-2356, August 2006, Copyright © 2006
doi:10.1534/genetics.105.054775
Mapping Quantitative Trait Loci for Longitudinal Traits in Line Crosses
Runqing Yang*,
Quan Tian* and
Shizhong Xu
,1
* School of Agriculture and Biology, Shanghai Jiaotong University, Shanghai, 201101, People's Republic of China and
Department of Botany and Plant Science, University of California, Riverside, California 92521
1 Corresponding author: Department of Botany and Plant Sciences, University of California, 900 University Ave., Riverside, CA 92521.
E-mail: xu{at}genetics.ucr.edu
Manuscript received December 15, 2005.
Accepted for publication May 30, 2006.
 |
ABSTRACT
|
|---|
Quantitative traits whose phenotypic values change over time are called longitudinal traits. Genetic analyses of longitudinal traits can be conducted using any of the following approaches: (1) treating the phenotypic values at different time points as repeated measurements of the same trait and analyzing the trait under the repeated measurements framework, (2) treating the phenotypes measured from different time points as different traits and analyzing the traits jointly on the basis of the theory of multivariate analysis, and (3) fitting a growth curve to the phenotypic values across time points and analyzing the fitted parameters of the growth trajectory under the theory of multivariate analysis. The third approach has been used in QTL mapping for longitudinal traits by fitting the data to a logistic growth trajectory. This approach applies only to the particular S-shaped growth process. In practice, a longitudinal trait may show a trajectory of any shape. We demonstrate that one can describe a longitudinal trait with orthogonal polynomials, which are sufficiently general for fitting any shaped curve. We develop a mixed-model methodology for QTL mapping of longitudinal traits and a maximum-likelihood method for parameter estimation and statistical tests. The expectation-maximization (EM) algorithm is applied to search for the maximum-likelihood estimates of parameters. The method is verified with simulated data and demonstrated with experimental data from a pseudobackcross family of Populus (poplar) trees.
THE genetic variance of a quantitative trait is controlled by the segregation of many genes, each with a small effect. Small genetic effects are collectively called the polygenic effects. In the infinitesimal model, only polygenic effects exist. An alternative to this classical definition of genetic variance of a quantitative trait is the theory that the genetic variances of many quantitative traits are actually controlled by the segregation of one or more major genes plus numerous small genetic effects. Using the latter definition, the phenotype of interest still shows a continuous distribution because of the joint contribution from the polygenic effects and random environmental variant. Such a model is called the oligogenic model, which can be tested using segregation analysis (ELSTON and STEWART 1971; MORTON and MACLEAN 1974). In fact, the oligogenic model is the basis for QTL mapping because QTL analysis with the current statistical methods and sample sizes (n < 1000) can detect only genes with major or moderate effects (e.g., HALEY and KNOTT 1992; ZENG 1994; KAO et al. 1999).
When a trait is measured repeatedly over time, it is called a time-dependent trait or longitudinal trait. Three approaches are currently available for analyzing longitudinal traits. The first approach is to treat the phenotypic values at different time points as repeated measurements of the same trait and analyze the trait under the repeated measurements framework. The second approach is to treat the phenotypes measured from different time points as different traits and analyze the traits jointly on the basis of the theory of multivariate analysis (WU et al. 1999). The third approach is to fit a growth curve to the phenotypic values across different time points and analyze the fitted parameters of the growth trajectory under the theory of multivariate analysis in a much reduced dimension (WU et al. 2002). The method of fitting a growth trajectory is considered to be the optimal one because it treats phenotypes measured over time as different traits, but takes into account the correlation generated by the ordered time points.
Abundant molecular markers are now available for QTL mapping. Many QTL mapping studies have focused on traits measured at a single time point (e.g., HALEY and KNOTT 1992; ZENG 1994). Methods for single-trait QTL mapping cannot be directly adopted for longitudinal traits because phenotypes measured at different time points may be controlled by different sets of genes (NUZHDIN and PASYUKOVA 1997; VERHAEGEN et al. 1997; EMEBIRI et al. 1998; YAN et al. 1998a,b; WU et al. 1999, 2002a). Several methods have been proposed for multiple-trait analysis in the context of QTL mapping (e.g., JIANG and ZENG 1995; KOROL et al. 1995; RONIN et al. 1995; EAVES et al. 1996; WU et al. 1999; KNOTT and HALEY 2000). The method of JIANG and ZENG (1995) is one of the first methods for multivariate QTL mapping. In practice, the method is suitable only for a few traits. As the number of traits increases, computational time will be a prohibitive factor. To reduce the dimensionality of the multivariate analysis, a principle component analysis has been proposed (MANGIN et al. 1998; KOROL et al. 2001), in which the first principle component, called a "super trait," is used as a representative of the multiple-trait complex. QTL mapping can be performed on the super trait, which, methodologically, is not different from the univariate QTL mapping. However, the interpretation of the super trait in a biological setting is challenging.
It is in general agreement that the phenotypic values measured over time are correlated. The correlation may be generated by some underlying biological process. In fact, if the trait were measured at an infinite number of time points, the phenotypic value would form a smooth curve as a function of time. This phenotypic profile is called the growth trajectory, which may be governed by a few growth parameters. Genetic analysis of traits with time-course repeated measurements may be conducted on the few growth parameters. The work of WU et al. (2002) was among the early studies on QTL mapping based on the growth models. The authors first fit the phenotypes of a longitudinal trait to a growth model (Equations 3 and 4 of WU et al. 2002) and then treated the estimated growth parameters as multiple traits. QTL mapping was then conducted on the growth parameters using a standard multivariate QTL mapping procedure (JIANG and ZENG 1995). The two-step approach may not be optimal because errors in the estimated growth parameters (the first step) have not been taken into account. Nevertheless, the idea presented by WU et al. (2002) is fundamentally important in the study of longitudinal traits. Other scientists have been attracted to this important area of QTL mapping. Rongling Wu and his colleagues (MA et al. 2002; WU et al. 2002b, 2003, 2004) presented substantial improvement on the method of Weiren Wu and his group (WU et al. 2002). The improvement was reflected by the change from two steps to a single step. MA et al. (2002) used a logistic curve to model the growth trajectory, in which the genetic value is described by
, where t is the time point, a is the asymptotic value of g, r is the relative rate of growth, and b (combined with a) is a parameter describing the starting point of g. The combination of the three parameters, {a, b, r}, determines the actual shape of the growth trajectory. Each of the three parameters is further partitioned into an additive and a dominance effect. If the trait is measured at
time points where
, analysis of the
measurements is always performed in three dimensions, regardless of the dimension of
. The time-course functional trait analysis replaces a continuous function with a finite number of collected data points to facilitate estimation of the functional parameters.
A growth trajectory is governed by the biological law of growth (WEST et al. 2001). The shape, in general, is sigmoid or S-like, a monotonic increasing function of time. Many biological curves, however, cannot be described by the monotonic logistic function. For example, the yearly gain of growth in trees is not sigmoid with time; rather, it is a negative exponential. Daily milk production of cows may be better described by a quadratic function of time. A reaction norm (a trait measured in a continuously varying environment) may have an arbitrary shape. The logistic function cannot be applied to these curves as they are not sigmoid.
Even if the biological process has an S-shaped trajectory, the lack of linearity in the logistic regression limits the use of extensively developed inference methods available for linear models. As a result, the EM algorithm developed for the logistic function cannot be generalized to other types of functions. If we follow the approach of logistic analysis and develop methods using a different specialized function for a different curve, a different EM algorithm would be required for each of the functions, because each function involves a different set of parameters with different meanings. This is another area that needs further investigation.
We propose to use orthogonal polynomials to fit the biological curve and perform QTL mapping for longitudinal traits with the following justifications. First, the polynomial analysis is sufficiently flexible for fitting a biological curve in an arbitrary shape. This can be achieved by choosing different orders of the polynomials. Second, the polynomial function is linear in the parameters, although it is nonlinear in time. As a consequence, we can take advantage of the extensively developed inference methods available for linear models. Third, the same EM algorithm developed for a linear mixed-effects model can be applied to all types of biological curves, requiring no modification, which is clearly in contrast to the aforementioned logistic approach.
Orthogonal polynomials have been applied to genetic analysis for longitudinal traits in pedigree data (HENDERSON 1982; SWALVE 2000; JENSEN 2001). These analyses utilized the random regression model (RRM) because the regression (polynomial) coefficients of individual animals were treated as random effects. MEYER and HILL (1997) showed the equivalence of the covariance function of KIRKPATRICK et al. (1990) to the RRM. A key element in the RRM analysis is the function submodels for the random regression (MISZTAL et al. 2000). One such orthogonal polynomial is the normalized Legendre polynomial (KIRKPATRICK et al. 1990; MEYER 2000), which has been extensively used to evaluate breeding values for various longitudinal traits of large farm animals (SCHAEFFER 2004). The publication of the first QTL mapping study for longitudinal traits in pedigrees has been conducted using Legendre orthogonal polynomials (MACGREGOR et al. 2005), demonstrating the applied importance of this approach. Notably, dominance and epistatic effects of QTL were assumed to be absent in that study. Here we extend this idea to a much more general framework, which includes dominance as well as traditional QTL populations such as the F2 and backcross.
In this study, we take the polynomial approach to mapping QTL for longitudinal traits, but address the problem in the context of line-crossing experiments. Line crosses are major designs of the experiment for genetic analysis. An F2 population derived from the cross of two inbred lines is commonly used genetic material for QTL mapping. We first introduce the theory and methodology based on the F2 mating design. We then conduct a series of simulation experiments to validate the method. Finally, we apply the method to QTL mapping for the growth trajectory in a pseudobackcross family of Populus (poplar) trees and compare the result with that of MA et al. (2002).
 |
THEORY AND METHODS
|
|---|
Genetic model for longitudinal traits:
On the basis of Mendel's law of inheritance, there are three possible genotypes in an F2 population for a segregating locus, denoted by AA, Aa, and aa, respectively, for the three genotypes. Let
be the phenotypic value of individual i measured at time
(a standardized time point ranging from 1 to +1, see APPENDIX B for the definition of standardized time point). This phenotypic value
can be described by the following linear model,
 | (1) |
for
, where n is the number of individuals, µ(
) is the population mean,
is a genotype indicator variable (defined as 1 for AA, 0 for Aa, and 1 for aa) for the ith individual at the genetic locus of interest,
is a dominance indicator variable defined as 1 for the heterozygote and 1 for the homozygote,
(
) is the additive effect,
(
) is the dominance effect,
is an individual-specific time-dependent residual error with an i.i.d.
distribution, and
is an individual-specific time-independent experimental error with an i.i.d.
distribution. This is a mixed-effects model with µ(
),
(
), and
(
) being treated as the fixed effects and
as the random effect. The purpose of the analysis is to estimate and test
(
) and
(
), the time-dependent functional genetic effects of a hypothesized QTL at a putative position of the genome.
The genetic variance contributed by the genetic locus of interest at time
is expressed as
 | (2) |
where
and
under the current definitions of x and z and the assumption of no segregation distortion (see APPENDIX A for the derivations). The phenotypic variance should be written as
 | (3) |
The proportion of the phenotypic variance contributed by the QTL is
 | (4) |
Orthogonal polynomial for longitudinal traits:
All the model parameters, except
2, are functions of time. The functional relationships between genetic parameters and the time variable are described with an orthogonal polynomial. Define
as a basis of the polynomial with order
. A basis is also called a submodel. Method to construct the basis of an orthogonal polynomial is given in APPENDIX B. Let us define
as a vector of population means, which is time independent. The time-dependent population mean µ(
) can be described as a linear combination of µ weighted by the basis of the polynomials; i.e.,
. Similarly, we can describe other parameters with the same basis; e.g.,
and
, where
and
. The random effect at time
can also be expressed as a linear function of the time-independent effects,
, where
for
. Since
is treated as a vector of random effects, we assume that
is i.i.d.
, where
is an
positive definite covariance matrix.
Recall that we are dealing with a longitudinal trait (a trait measured repeatedly over some time interval on the same individual). If the phenotypic value measured at any given time point were treated as an individual trait, for
time points, we would be dealing with
different traits. The vector of genetic effects, say
, would have a dimension of
because each element of vector
represents an effect for a trait. In the polynomial analysis of order r, however, the dimension of vector
is reduced to
for
. Therefore, the polynomial analysis for longitudinal traits is a special dimension reduction technique.
With the polynomial reparameterization, model (1) is now rewritten as a linear function of the time-independent parameters,
 | (5) |
The expectation function of
conditional on the fixed effects is
 | (6) |
The covariance function between
and
conditional on the fixed effects is
 | (7) |
where
is another standardized time point,
for
, and
for
. The first term in the right-hand side of Equation 7,
, is actually the covariance between
and
for
. When
, it becomes the variance of
; i.e.,
.
Model (5) is now linear in parameters, which can be estimated under the framework of the mixed-model analysis. We first estimate µ,
,
,
, and
2 and then use the basis of the polynomial to convert µ,
,
into their time-dependent functional counterparts.
The polynomial transformation allows us to express the total genetic variance as a function of the quadratic terms of the time-independent genetic parameters, as shown below,
 | (8) |
The phenotypic variance may be rewritten as
 | (9) |
The proportion of the phenotypic variance contributed by the major gene is
 | (10) |
One can examine the behavior of each genotype by evaluating the change of the genotypic value across time. Let
,
, and
be the values of the three genotypes at time
. They are expressed as
 | (11) |
Likelihood function:
The phenotypic values for each individual are collected at
fixed time points, denoted by
. We add a subscript to variable
to indicate that
now has taken a particular value from one of the
time points. We use an
vector to denote an array of the phenotypic values for the ith individual,
Define
as an
matrix. In matrix notation, the linear model for vector
is
 | (12) |
where
is an
vector for the experimental errors with
, where
is an
identity matrix. The conditional expectation of model (12) given the fixed effects is
 | (13) |
and the variancecovariance matrix is
 | (14) |
which applies to all
.
The models discussed so far are based on known genotypes of the QTL. In practice, the genotypes are not observable and the likelihood function must be constructed by taking into account all three genotypes for all individuals. Let us order the three genotypes, aa, Aa, and AA, as genotypes 1, 2, and 3 (indexed by k), respectively, and use a numerical variable G to denote the three ordered genotypes. Let
be the genotype of the major gene for individual i. Before we observe the phenotypic values, we should use markers to infer the probability of
. This conditional probability is denoted by
, where
denotes marker information. The conditional probability
varies across individuals because different individuals are supposed to have different marker genotypes. Methods to calculate
can be found in LANDER and BOTSTEIN (1989) for interval mapping and JIANG and ZENG (1997) for multipoint mapping. We take the multipoint method because it facilitates an automatic mechanism to handle missing marker genotypes. Recall that
is the conditional expectation of
given the genotype of individual i and the population parameters. Since each individual can take one of three genotypes, there are only three distinguishable conditional expectations, denoted by
 | (15) |
The probability density of phenotypes for the ith individual with the kth genotype (
) is
 | (16) |
Let
= {µ,
,
,
,
2} be the parameters. The likelihood function is
 | (17) |
Several numerical methods can be used to search for the maximum-likelihood estimates (MLEs) of the parameters, e.g., the simplex method of NELDER and MEAD (1965). We decided to use the EM algorithm (DEMPSTER et al. 1977; LANDER and BOTSTEIN 1989; JANSEN 1993) because explicit equations for the maximization step are available in the complete data situation.
EM algorithm for parameter estimation:
With the EM algorithm, we classify variables into data, missing values, and parameters. In QTL mapping, y = {yi} and
are data,
are parameters, and
are variables with missing values. The genotype indicator variables
, however, are determined by
. So, the missing values can be rewritten as
. The likelihood function given in (17) with all the missing values integrated out is called the observed likelihood function. Instead of directly maximizing the observed likelihood function, the EM algorithm deals with the following complete-data likelihood function,
 | (18) |
where
 | (19) |
 | (20) |
and
 | (21) |
Note that we introduce a new variable
whose value is determined by
as
for
and
for
. Because the complete-data likelihood function involves missing values, integration (or expectation) is taken with respect to the missing values, not for the complete-data likelihood function, but for the following log-transformed complete-data likelihood function,
 | (22) |
The expectation of Equation 22 with respect to the missing values is
 | (23) |
where
 | (24) |
 | (25) |
and
 | (26) |
The EM algorithm requires: (1) finding the partial derivatives of
with respect to the parameters, (2) setting these partial derivatives equal to zero, and (3) searching for explicit solutions for the parameters as a function of the missing values. This completes the maximization step. The expectation step requires taking the expectations for all terms that involve the missing values. Since the complete-data likelihood function is just a likelihood function for a typical mixed model, we can simply utilize the existing mixed-model EM algorithm to find the MLE of parameters (HENDERSON 1986). The following are the EM steps we implement for the mixed-model analysis.
- Step 0: Set
= 0 and initialize all parameters with values in their legal domain, denoted by
(
).
- Step 1 (E1): Compute the posterior probabilities of the three genotypes for each individual,
 | (27) |
- Step 2 (E2): Find the posterior distribution of the random effect
. This posterior distribution turns out to be a mixture of three normal distributions with a mean
 | (28) |
and a covariance matrix
 | (29) |
All parameters appearing in the right-hand side of the equations should be substituted by their most current values of the parameters at iteration t. This statement also holds for subsequent steps.
- Step 3 (E3): Compute all the expectations involved in the following maximization steps (see the next paragraph for detailed expressions of the expectations).
- Step 4 (M1): Update the population mean
 | (30) |
- Step 5 (M2): Update the additive effect
 | (31) |
- Step 6 (M3): Update the dominance effect
 | (32) |
- Step 7 (M4): Update the covariance matrix of the random effect
 | (33) |
- Step 8 (M5): Update the residual variance
 | (34) |
- Step 9: Increment
by 1; i.e., let
=
+1, and repeat steps 18 (three E-steps and five M-steps) until
holds.
In this paragraph, we provide explicit expressions for the expectations of various terms involved in the EM steps. The expectation of the quadratic term of the random effects is
 | (35) |
where
and
are already given in step 2 (Equations 28 and 29). We need to provide only the expectations related to the genotype indicator variables. These expectations are
 | (36) |
The EM algorithm was implemented via a FORTRAN90 program, which is available from the authors on request.
Hypothesis tests:
Several important hypothesis tests are interesting and can be performed via the likelihood-ratio test statistics. The first hypothesis is "no segregation of a QTL at the current position," which is denoted by
. The test statistic for this hypothesis is
 | (37) |
where
is the log-likelihood function evaluated at
and
is the log likelihood under the null model evaluated at
. Note that
differs from
in two aspects: (1)
has a lower dimension because
has been enforced, and (2) the three parameter sets in
are estimated from the null model and they are different from the counterparts estimated under the full model. The EM algorithm for estimating
is a simple special case of the EM algorithm for the full model. Under the null hypothesis,
will follow approximately a chi-square distribution with
, where
and
are the dimensions of
and
, respectively. These two vectors differ by
, each having a dimension of
, which explains why
.
When
is rejected, we can further test whether the QTL effects are due to additive or dominance effects. First, we can test the departure from additivity, i.e., the dominance effect. The null hypothesis of "no dominance effect" is
and the test statistic is
 | (38) |
where
is the parameter vector estimated under
.
To test the additive effect, our null hypothesis becomes "no additive effect," denoted by
. The test statistic is
 | (39) |
where
. Each one of
and
will approximately follow a chi-square distribution with
d.f.
It is also possible to test other hypotheses, such us whether a polynomial of order r +1 is sufficiently better than that of order r and whether the structured V-matrix sufficiently explains the covariances of the experimental errors compared to an unstructured V-matrix. We defer these hypothesis tests to the DATA ANALYSIS and DISCUSSION sections.
So far we have discussed hypothesis tests only at one location of the genome. To scan the entire genome for QTL, we test every putative position of the genome with a 1- or 2-cM increment and plot the test statistic against the genome location to form a test statistic profile. Positions that correspond to significant peaks of the profile provide estimates of the QTL locations. The critical value of the test statistic used to declare significance is obtained via the permutation test (CHURCHILL and DOERGE 1994).
 |
DATA ANALYSIS
|
|---|
Simulation studies:
We simulated an F2 population under several different sample sizes. A single chromosome of 200 cM was simulated with 21 evenly spaced markers (10 cM per interval). The position of the first marker was designated as position zero and positions of all other markers were measured as the distances from the first marker. We put three QTL at positions 23, 94, and 187 cM, respectively. The range of the original time points (t) was [1, 150]. The cumulative heritabilities of the three QTL were 0.24, 0.24, and 0.15, respectively, each of which was defined as
 | (40) |
where
is the proportion of the phenotypic variance explained by the QTL at time point
[see Equation 4 for the definition of
]. The QTL effects that generate such values of
were found by trial and error and are given in Table 1 with Legendre polynomial of order 3 (
). The simulated population mean was
The functional curve for the mean was
. The functions of population mean, additive effect, dominance effect, and heritability of the three QTL plotted against time are given in Figure 1, a, b, c, and d, respectively. The individual specific and time-dependent random effect was simulated using
, where
was generated from a
distribution with
which is a
positive definite covariance matrix. The residual variance was
. The model to generate the phenotypic values was
 | (41) |
where the summation indicated the sum of three QTL.