## Abstract

Quantitative traits whose phenotypic values change over time are called longitudinal traits. Genetic analyses of longitudinal traits can be conducted using any of the following approaches: (1) treating the phenotypic values at different time points as repeated measurements of the same trait and analyzing the trait under the repeated measurements framework, (2) treating the phenotypes measured from different time points as different traits and analyzing the traits jointly on the basis of the theory of multivariate analysis, and (3) fitting a growth curve to the phenotypic values across time points and analyzing the fitted parameters of the growth trajectory under the theory of multivariate analysis. The third approach has been used in QTL mapping for longitudinal traits by fitting the data to a logistic growth trajectory. This approach applies only to the particular S-shaped growth process. In practice, a longitudinal trait may show a trajectory of any shape. We demonstrate that one can describe a longitudinal trait with orthogonal polynomials, which are sufficiently general for fitting any shaped curve. We develop a mixed-model methodology for QTL mapping of longitudinal traits and a maximum-likelihood method for parameter estimation and statistical tests. The expectation-maximization (EM) algorithm is applied to search for the maximum-likelihood estimates of parameters. The method is verified with simulated data and demonstrated with experimental data from a pseudobackcross family of Populus (poplar) trees.

THE genetic variance of a quantitative trait is controlled by the segregation of many genes, each with a small effect. Small genetic effects are collectively called the polygenic effects. In the infinitesimal model, only polygenic effects exist. An alternative to this classical definition of genetic variance of a quantitative trait is the theory that the genetic variances of many quantitative traits are actually controlled by the segregation of one or more major genes plus numerous small genetic effects. Using the latter definition, the phenotype of interest still shows a continuous distribution because of the joint contribution from the polygenic effects and random environmental variant. Such a model is called the oligogenic model, which can be tested using segregation analysis (Elston and Stewart 1971; Morton and MacLean 1974). In fact, the oligogenic model is the basis for QTL mapping because QTL analysis with the current statistical methods and sample sizes (*n* < 1000) can detect only genes with major or moderate effects (*e.g.*, Haley and Knott 1992; Zeng 1994; Kao *et al.* 1999).

When a trait is measured repeatedly over time, it is called a time-dependent trait or longitudinal trait. Three approaches are currently available for analyzing longitudinal traits. The first approach is to treat the phenotypic values at different time points as repeated measurements of the same trait and analyze the trait under the repeated measurements framework. The second approach is to treat the phenotypes measured from different time points as different traits and analyze the traits jointly on the basis of the theory of multivariate analysis (Wu *et al.* 1999). The third approach is to fit a growth curve to the phenotypic values across different time points and analyze the fitted parameters of the growth trajectory under the theory of multivariate analysis in a much reduced dimension (Wu *et al.* 2002). The method of fitting a growth trajectory is considered to be the optimal one because it treats phenotypes measured over time as different traits, but takes into account the correlation generated by the ordered time points.

Abundant molecular markers are now available for QTL mapping. Many QTL mapping studies have focused on traits measured at a single time point (*e.g.*, Haley and Knott 1992; Zeng 1994). Methods for single-trait QTL mapping cannot be directly adopted for longitudinal traits because phenotypes measured at different time points may be controlled by different sets of genes (Nuzhdin and Pasyukova 1997; Verhaegen *et al.* 1997; Emebiri *et al.* 1998; Yan *et al.* 1998a,b; Wu *et al*. 1999, 2002a). Several methods have been proposed for multiple-trait analysis in the context of QTL mapping (*e.g.*, Jiang and Zeng 1995; Korol *et al.* 1995; Ronin *et al.* 1995; Eaves *et al.* 1996; Wu *et al.* 1999; Knott and Haley 2000). The method of Jiang and Zeng (1995) is one of the first methods for multivariate QTL mapping. In practice, the method is suitable only for a few traits. As the number of traits increases, computational time will be a prohibitive factor. To reduce the dimensionality of the multivariate analysis, a principle component analysis has been proposed (Mangin *et al.* 1998; Korol *et al.* 2001), in which the first principle component, called a “super trait,” is used as a representative of the multiple-trait complex. QTL mapping can be performed on the super trait, which, methodologically, is not different from the univariate QTL mapping. However, the interpretation of the super trait in a biological setting is challenging.

It is in general agreement that the phenotypic values measured over time are correlated. The correlation may be generated by some underlying biological process. In fact, if the trait were measured at an infinite number of time points, the phenotypic value would form a smooth curve as a function of time. This phenotypic profile is called the growth trajectory, which may be governed by a few growth parameters. Genetic analysis of traits with time-course repeated measurements may be conducted on the few growth parameters. The work of Wu *et al.* (2002) was among the early studies on QTL mapping based on the growth models. The authors first fit the phenotypes of a longitudinal trait to a growth model (Equations 3 and 4 of Wu *et al.* 2002) and then treated the estimated growth parameters as multiple traits. QTL mapping was then conducted on the growth parameters using a standard multivariate QTL mapping procedure (Jiang and Zeng 1995). The two-step approach may not be optimal because errors in the estimated growth parameters (the first step) have not been taken into account. Nevertheless, the idea presented by Wu *et al.* (2002) is fundamentally important in the study of longitudinal traits. Other scientists have been attracted to this important area of QTL mapping. Rongling Wu and his colleagues (Ma *et al*. 2002; Wu *et al.* 2002b, 2003, 2004) presented substantial improvement on the method of Weiren Wu and his group (Wu *et al.* 2002). The improvement was reflected by the change from two steps to a single step. Ma *et al*. (2002) used a logistic curve to model the growth trajectory, in which the genetic value is described by , where *t* is the time point, *a* is the asymptotic value of *g*, *r* is the relative rate of growth, and *b* (combined with *a*) is a parameter describing the starting point of *g*. The combination of the three parameters, {*a*, *b*, *r*}, determines the actual shape of the growth trajectory. Each of the three parameters is further partitioned into an additive and a dominance effect. If the trait is measured at time points where , analysis of the measurements is always performed in three dimensions, regardless of the dimension of . The time-course functional trait analysis replaces a continuous function with a finite number of collected data points to facilitate estimation of the functional parameters.

A growth trajectory is governed by the biological law of growth (West *et al.* 2001). The shape, in general, is sigmoid or S-like, a monotonic increasing function of time. Many biological curves, however, cannot be described by the monotonic logistic function. For example, the yearly gain of growth in trees is not sigmoid with time; rather, it is a negative exponential. Daily milk production of cows may be better described by a quadratic function of time. A reaction norm (a trait measured in a continuously varying environment) may have an arbitrary shape. The logistic function cannot be applied to these curves as they are not sigmoid.

Even if the biological process has an S-shaped trajectory, the lack of linearity in the logistic regression limits the use of extensively developed inference methods available for linear models. As a result, the EM algorithm developed for the logistic function cannot be generalized to other types of functions. If we follow the approach of logistic analysis and develop methods using a different specialized function for a different curve, a different EM algorithm would be required for each of the functions, because each function involves a different set of parameters with different meanings. This is another area that needs further investigation.

We propose to use orthogonal polynomials to fit the biological curve and perform QTL mapping for longitudinal traits with the following justifications. First, the polynomial analysis is sufficiently flexible for fitting a biological curve in an arbitrary shape. This can be achieved by choosing different orders of the polynomials. Second, the polynomial function is linear in the parameters, although it is nonlinear in time. As a consequence, we can take advantage of the extensively developed inference methods available for linear models. Third, the same EM algorithm developed for a linear mixed-effects model can be applied to all types of biological curves, requiring no modification, which is clearly in contrast to the aforementioned logistic approach.

Orthogonal polynomials have been applied to genetic analysis for longitudinal traits in pedigree data (Henderson 1982; Swalve 2000; Jensen 2001). These analyses utilized the random regression model (RRM) because the regression (polynomial) coefficients of individual animals were treated as random effects. Meyer and Hill (1997) showed the equivalence of the covariance function of Kirkpatrick *et al.* (1990) to the RRM. A key element in the RRM analysis is the function submodels for the random regression (Misztal *et al*. 2000). One such orthogonal polynomial is the normalized Legendre polynomial (Kirkpatrick *et al.* 1990; Meyer 2000), which has been extensively used to evaluate breeding values for various longitudinal traits of large farm animals (Schaeffer 2004). The publication of the first QTL mapping study for longitudinal traits in pedigrees has been conducted using Legendre orthogonal polynomials (Macgregor *et al*. 2005), demonstrating the applied importance of this approach. Notably, dominance and epistatic effects of QTL were assumed to be absent in that study. Here we extend this idea to a much more general framework, which includes dominance as well as traditional QTL populations such as the F_{2} and backcross.

In this study, we take the polynomial approach to mapping QTL for longitudinal traits, but address the problem in the context of line-crossing experiments. Line crosses are major designs of the experiment for genetic analysis. An F_{2} population derived from the cross of two inbred lines is commonly used genetic material for QTL mapping. We first introduce the theory and methodology based on the F_{2} mating design. We then conduct a series of simulation experiments to validate the method. Finally, we apply the method to QTL mapping for the growth trajectory in a pseudobackcross family of Populus (poplar) trees and compare the result with that of Ma *et al*. (2002).

## THEORY AND METHODS

### Genetic model for longitudinal traits:

On the basis of Mendel's law of inheritance, there are three possible genotypes in an F_{2} population for a segregating locus, denoted by *AA*, *Aa*, and *aa*, respectively, for the three genotypes. Let be the phenotypic value of individual *i* measured at time τ (a standardized time point ranging from −1 to +1, see appendix b for the definition of standardized time point). This phenotypic value can be described by the following linear model,(1)for , where *n* is the number of individuals, μ(τ) is the population mean, is a genotype indicator variable (defined as 1 for *AA*, 0 for *Aa*, and −1 for *aa*) for the *i*th individual at the genetic locus of interest, is a dominance indicator variable defined as 1 for the heterozygote and −1 for the homozygote, α(τ) is the additive effect, δ(τ) is the dominance effect, is an individual-specific time-dependent residual error with an i.i.d. distribution, and is an individual-specific time-independent experimental error with an i.i.d. distribution. This is a mixed-effects model with μ(τ), α(τ), and δ(τ) being treated as the fixed effects and as the random effect. The purpose of the analysis is to estimate and test α(τ) and δ(τ), the time-dependent functional genetic effects of a hypothesized QTL at a putative position of the genome.

The genetic variance contributed by the genetic locus of interest at time τ is expressed as(2)where and under the current definitions of *x* and *z* and the assumption of no segregation distortion (see appendix a for the derivations). The phenotypic variance should be written as(3)The proportion of the phenotypic variance contributed by the QTL is(4)

### Orthogonal polynomial for longitudinal traits:

All the model parameters, except σ^{2}, are functions of time. The functional relationships between genetic parameters and the time variable are described with an orthogonal polynomial. Define as a basis of the polynomial with order . A basis is also called a submodel. Method to construct the basis of an orthogonal polynomial is given in appendix b. Let us define as a vector of population means, which is time independent. The time-dependent population mean μ(τ) can be described as a linear combination of **μ** weighted by the basis of the polynomials; *i.e.*, . Similarly, we can describe other parameters with the same basis; *e.g.*, and , where and . The random effect at time τ can also be expressed as a linear function of the time-independent effects, , where for . Since is treated as a vector of random effects, we assume that is i.i.d. , where **Σ** is an positive definite covariance matrix.

Recall that we are dealing with a longitudinal trait (a trait measured repeatedly over some time interval on the same individual). If the phenotypic value measured at any given time point were treated as an individual trait, for time points, we would be dealing with different traits. The vector of genetic effects, say **α**, would have a dimension of because each element of vector **α** represents an effect for a trait. In the polynomial analysis of order *r*, however, the dimension of vector **α** is reduced to for . Therefore, the polynomial analysis for longitudinal traits is a special dimension reduction technique.

With the polynomial reparameterization, model (1) is now rewritten as a linear function of the time-independent parameters,(5)The expectation function of conditional on the fixed effects is(6)The covariance function between and conditional on the fixed effects is(7)where is another standardized time point, for , and for . The first term in the right-hand side of Equation 7, , is actually the covariance between and for . When , it becomes the variance of ; *i.e.*, .

Model (5) is now linear in parameters, which can be estimated under the framework of the mixed-model analysis. We first estimate **μ**, **α**, **δ**, **Σ**, and σ^{2} and then use the basis of the polynomial to convert **μ**, **α**, **δ** into their time-dependent functional counterparts.

The polynomial transformation allows us to express the total genetic variance as a function of the quadratic terms of the time-independent genetic parameters, as shown below,(8)The phenotypic variance may be rewritten as(9)The proportion of the phenotypic variance contributed by the major gene is(10)

One can examine the behavior of each genotype by evaluating the change of the genotypic value across time. Let , , and be the values of the three genotypes at time τ. They are expressed as(11)

### Likelihood function:

The phenotypic values for each individual are collected at fixed time points, denoted by . We add a subscript to variable τ to indicate that τ now has taken a particular value from one of the time points. We use an vector to denote an array of the phenotypic values for the *i*th individual,Defineas an matrix. In matrix notation, the linear model for vector is(12)where is an vector for the experimental errors with , where is an identity matrix. The conditional expectation of model (12) given the fixed effects is(13)and the variance–covariance matrix is(14)which applies to all .

The models discussed so far are based on known genotypes of the QTL. In practice, the genotypes are not observable and the likelihood function must be constructed by taking into account all three genotypes for all individuals. Let us order the three genotypes, *aa*, *Aa*, and *AA*, as genotypes 1, 2, and 3 (indexed by *k*), respectively, and use a numerical variable *G* to denote the three ordered genotypes. Let be the genotype of the major gene for individual *i*. Before we observe the phenotypic values, we should use markers to infer the probability of . This conditional probability is denoted by , where denotes marker information. The conditional probability varies across individuals because different individuals are supposed to have different marker genotypes. Methods to calculate can be found in Lander and Botstein (1989) for interval mapping and Jiang and Zeng (1997) for multipoint mapping. We take the multipoint method because it facilitates an automatic mechanism to handle missing marker genotypes. Recall that is the conditional expectation of given the genotype of individual *i* and the population parameters. Since each individual can take one of three genotypes, there are only three distinguishable conditional expectations, denoted by(15)The probability density of phenotypes for the *i*th individual with the *k*th genotype () is(16)Let **θ** = {**μ**, **α**, **δ**, **Σ**, σ^{2}} be the parameters. The likelihood function is(17)Several numerical methods can be used to search for the maximum-likelihood estimates (MLEs) of the parameters, *e.g.*, the simplex method of Nelder and Mead (1965). We decided to use the EM algorithm (Dempster *et al*. 1977; Lander and Botstein 1989; Jansen 1993) because explicit equations for the maximization step are available in the complete data situation.

### EM algorithm for parameter estimation:

With the EM algorithm, we classify variables into data, missing values, and parameters. In QTL mapping, **y** = {**y**_{i}} and are data, **θ** are parameters, and are variables with missing values. The genotype indicator variables , however, are determined by . So, the missing values can be rewritten as . The likelihood function given in (17) with all the missing values integrated out is called the observed likelihood function. Instead of directly maximizing the observed likelihood function, the EM algorithm deals with the following complete-data likelihood function,(18)where(19)(20)and(21)Note that we introduce a new variable η whose value is determined by as for and for . Because the complete-data likelihood function involves missing values, integration (or expectation) is taken with respect to the missing values, not for the complete-data likelihood function, but for the following log-transformed complete-data likelihood function,(22)The expectation of Equation 22 with respect to the missing values is(23)where(24)(25)and(26)

The EM algorithm requires: (1) finding the partial derivatives of with respect to the parameters, (2) setting these partial derivatives equal to zero, and (3) searching for explicit solutions for the parameters as a function of the missing values. This completes the maximization step. The expectation step requires taking the expectations for all terms that involve the missing values. Since the complete-data likelihood function is just a likelihood function for a typical mixed model, we can simply utilize the existing mixed-model EM algorithm to find the MLE of parameters (Henderson 1986). The following are the EM steps we implement for the mixed-model analysis.

Step 0: Set ζ = 0 and initialize all parameters with values in their legal domain, denoted by

**θ**^{(ζ)}.Step 1 (E1): Compute the posterior probabilities of the three genotypes for each individual,(27)

Step 2 (E2): Find the posterior distribution of the random effect . This posterior distribution turns out to be a mixture of three normal distributions with a mean(28)and a covariance matrix(29)All parameters appearing in the right-hand side of the equations should be substituted by their most current values of the parameters at iteration

*t*. This statement also holds for subsequent steps.Step 3 (E3): Compute all the expectations involved in the following maximization steps (see the next paragraph for detailed expressions of the expectations).

Step 4 (M1): Update the population mean(30)

Step 5 (M2): Update the additive effect(31)

Step 6 (M3): Update the dominance effect(32)

Step 7 (M4): Update the covariance matrix of the random effect(33)

Step 8 (M5): Update the residual variance(34)

Step 9: Increment ζ by 1;

*i.e.*, let ζ = ζ+1, and repeat steps 1–8 (three E-steps and five M-steps) until holds.

In this paragraph, we provide explicit expressions for the expectations of various terms involved in the EM steps. The expectation of the quadratic term of the random effects is(35)where and are already given in step 2 (Equations 28 and 29). We need to provide only the expectations related to the genotype indicator variables. These expectations are(36)

The EM algorithm was implemented via a FORTRAN90 program, which is available from the authors on request.

### Hypothesis tests:

Several important hypothesis tests are interesting and can be performed via the likelihood-ratio test statistics. The first hypothesis is “no segregation of a QTL at the current position,” which is denoted by . The test statistic for this hypothesis is(37)where is the log-likelihood function evaluated at and is the log likelihood under the null model evaluated at . Note that differs from in two aspects: (1) has a lower dimension because has been enforced, and (2) the three parameter sets in are estimated from the null model and they are different from the counterparts estimated under the full model. The EM algorithm for estimating is a simple special case of the EM algorithm for the full model. Under the null hypothesis, will follow approximately a chi-square distribution with , where and are the dimensions of **θ** and , respectively. These two vectors differ by , each having a dimension of , which explains why .

When is rejected, we can further test whether the QTL effects are due to additive or dominance effects. First, we can test the departure from additivity, *i.e.*, the dominance effect. The null hypothesis of “no dominance effect” is and the test statistic is(38)where is the parameter vector estimated under .

To test the additive effect, our null hypothesis becomes “no additive effect,” denoted by . The test statistic is(39)where . Each one of and will approximately follow a chi-square distribution with d.f.

It is also possible to test other hypotheses, such us whether a polynomial of order *r* +1 is sufficiently better than that of order *r* and whether the structured **V**-matrix sufficiently explains the covariances of the experimental errors compared to an unstructured **V**-matrix. We defer these hypothesis tests to the data analysis and discussion sections.

So far we have discussed hypothesis tests only at one location of the genome. To scan the entire genome for QTL, we test every putative position of the genome with a 1- or 2-cM increment and plot the test statistic against the genome location to form a test statistic profile. Positions that correspond to significant peaks of the profile provide estimates of the QTL locations. The critical value of the test statistic used to declare significance is obtained via the permutation test (Churchill and Doerge 1994).

## DATA ANALYSIS

### Simulation studies:

We simulated an F_{2} population under several different sample sizes. A single chromosome of 200 cM was simulated with 21 evenly spaced markers (10 cM per interval). The position of the first marker was designated as position zero and positions of all other markers were measured as the distances from the first marker. We put three QTL at positions 23, 94, and 187 cM, respectively. The range of the original time points (*t*) was [1, 150]. The cumulative heritabilities of the three QTL were 0.24, 0.24, and 0.15, respectively, each of which was defined as(40)where is the proportion of the phenotypic variance explained by the QTL at time point τ [see Equation 4 for the definition of ]. The QTL effects that generate such values of were found by trial and error and are given in Table 1 with Legendre polynomial of order 3 (). The simulated population mean wasThe functional curve for the mean was . The functions of population mean, additive effect, dominance effect, and heritability of the three QTL plotted against time are given in Figure 1, a, b, c, and d, respectively. The individual specific and time-dependent random effect was simulated using , where was generated from a distribution withwhich is a positive definite covariance matrix. The residual variance was . The model to generate the phenotypic values was(41)where the summation indicated the sum of three QTL.

An F_{2} population was simulated with three different sample sizes, . The frequency of measurement of the trait was simulated at two levels, . When the trait was measured five times, the interval between two consecutive measurements was 30 days, whereas corresponded to an interval of measurement of 15 days. All six combinations of *n* and *m* were simulated for comparison. Each of the six combinations (scenarios) was replicated 100 times.

The critical values used to declare statistical significance for QTL in the simulation experiments were obtained empirically by simulating 500 additional replicates under the null model where all QTL effects were set to zero. The empirical critical values under the six scenarios are given in Table 2.

The average estimated QTL positions, the likelihood-ratio test statistics, and the empirical powers of the three QTL are given in Table 3. The estimated QTL effects (including additive and dominance effects) are given in Table 4. The purpose of the simulations was neither to select the optimal design of the experiment nor to choose the optimal model of QTL mapping; rather, it was to demonstrate the general behavior of the mapping procedure and validate the computer program. Therefore, the simulations were not exhaustive. The results of the simulations did show the expected behaviors of a mapping procedure: (1) the power increases as the sample size increases, (2) the standard error of each estimated parameter decreases as the sample size increases, and (3) the increased power and decreased error are also associated with the increased frequency of the trait measurements. Therefore, the algorithm converges as expected and the computer program is logically valid. The average likelihood-ratio profiles of the 100 replicates under and are demonstrated in Figure 2. The three peaks in the likelihood-ratio profile clearly demonstrated the efficiency of the method for scanning multiple QTL. Again, we are not looking at model selection here; rather, we are trying to verify the model fit (Wiltshire *et al.* 2005). We developed a single-QTL model but implemented the method with data containing multiple QTL. This is a typical approach for genome scanning with interval mapping (Lander and Botstein 1989). The estimated positions and effects of the three simulated QTL were very close to the true values (see Table 4). This demonstrated the robustness of the single-QTL model for mapping multiple additive QTL. The robustness may be credited to the covariance matrix **Σ** included in the model. When QTL1 was evaluated, QTL2 and QTL3 were not taken into account by the model. These two neglected QTL would ordinarily cause extra correlation among the repeated measurements. The extra correlations, however, were absorbed by matrix **Σ**. Therefore, the estimated **μ**, **Σ**, and σ^{2} from the single-QTL model were not comparable to the true values of **μ**, **Σ**, and σ^{2} that were used to simulate the three-QTL-controlled longitudinal trait.

Normally, one would expect to see some bias in the estimated effects associated with the Legendre polynomials due to the Beavis effect (Beavis 1994; Xu 2003). The lack of such biases in the simulation experiment is due to the fact that we reported all effects, regardless of the significance of the test. The Beavis effect would be expected to happen if the mean effects were calculated on the basis of a censored sample (containing only the replicates in which significant QTL were detected).

To verify the efficiency of the method for estimating **μ**, **Σ**, and σ^{2}, we simulated an additional data set with exactly the same setup as the original simulation experiment except that we now kept only the first QTL (see Table 1 for the parameters of QTL1) in the simulation under and . The experiment was replicated 100 times. The average estimated σ^{2} was , close to the true value of 2.0. The average estimated **μ** and **Σ** wereandrespectively, where the numbers following ± are the standard deviations obtained from the 100 replicates. The estimated **μ** and **Σ** were also close to the true values of **μ** and **Σ**. Overall, the simulation experiment demonstrated that we have developed a working procedure of QTL mapping for longitudinal traits.

For data generated by the single-QTL model, we also performed the logistic regression analysis using FunMap (Ma *et al*. 2002, 2004). Unfortunately, no QTL was detected in any of the replicates (see Figure 3 for the comparison of the two methods). This comparison is not quite fair because the method of Ma *et al*. (2002) was not designed to handle curves of arbitrary shape with a general covariance structure. This demonstrates only that extreme departure from the assumption of Ma *et al*. (2002) will cause the method they developed to fail, and so experimentalists should take care in its application. A fair comparison is deferred to the real data analysis.

We now demonstrate that, with a little modification, the method can handle models with epistatic (interactive) effects of QTL. Wu *et al.* (2004) recently extended their logistic model to map epistatic effects and published the epistatic model alone in a separate study. For simplicity of presentation, we simulated a backcross population of individuals with test frequencies of or for the period of 150 days. A single chromosome of 100 cM was simulated and 11 markers were placed evenly on the chromosome (10 cM per interval). Two QTL were simulated at positions 23 and 75 cM, respectively. The effects of these two QTL were defined under the following two models: (1) additive plus epistatic effect model (A + E model), in which both QTL have their own additive effects and they also act interactively to determine the phenotype of the trait, and (2) epistatic effect only model (E model), in which the two QTL act interactively to determine the phenotype and neither QTL has its own additive effect. These two models were used to generate the data, but the analytical model always assumed the presence of both the additive and epistatic effects and thus always presented estimated values for both types of effects. The epistatic model is(42)where the values of **μ**, **Σ**, and σ^{2} remained the same as those given in the previous simulation experiment and the vectors of QTL effects were defined as follows.

For the A + E model, the first QTL has effectsthe second QTL has effectsand the epistatic effects are

For the E model, the corresponding three vectors were defined asand

The estimated positions of the two QTL were obtained in a two-dimension grid search for the maximum value of the likelihood function. Once the positions were found, the estimated QTL effects corresponding to the optimal positions were reported as the MLE of the QTL effects. The critical value used for declaration of statistical significance was obtained in the same way as those in the previous simulation experiment.

Table 5 shows the estimated positions of the two QTL along with the empirical statistical powers. The A + E model reached 100% power in both test frequencies examined. The E model, however, has less power. The estimated errors of parameters of the two models under different test frequencies also behave as expected. Table 6 gives the average estimated QTL effects, which were all close to the true values used to simulate the data. Overall, the simulation result has validated the method and the program.

### Example of real data analysis:

We used the data published by Ma *et al*. (2002) as an example to demonstrate the application of the new method to real data from a longitudinal trait. The trait was the growth of the stem diameter of poplar trees measured annually for 11 years (). The mapping population consisted of 78 progeny of a pseudobackcross family ( progeny of a hybrid backcrossed to a parent of the hybrid). The method developed for F_{2} design was revised by including two genotypes only per locus. The conditional probability of the QTL genotype given marker information was calculated using the multipoint method of Jiang and Zeng (1997). The data contained markers of chromosome 10 only, and thus only one chromosome was scanned.

Before conducting QTL mapping, we took the least-squares method and fit the phenotypic values of diameter growth to the logistic curve and Legendre polynomials of orders 2, 3, and 4, respectively, for each individual. The average goodness of fit (*R*^{2}) for the growth trajectories of 78 individuals showed that the Legendre polynomial with fit the data equally well as, or better than, the logistic curve. This justifies the orders chosen for the polynomial analysis in this study and the logistic analysis in Ma *et al.* (2002). Theoretically, the polynomial approach and the logistic analysis can be compared only if the shape of the growth curve is sigmoid. When the order of the polynomial is , the curve is linear or quadratic, which fits poorly to the tree diameter growth curve. The tree diameter growth curve appeared to be S-shaped and thus a valid comparison of the polynomial and the logistic analyses can be made.

There are a maximum of 10 different orders of the polynomial to choose for the time points. We fit each of the orders and chose the one with the minimum Bayesian information criterion (BIC) (Schwarz 1978). The BIC for order *r* was calculated using(43)where is the MLE of parameters for order *r* and is the dimension of **θ** (the number of independent parameters in the model). The BIC scores for the Legendre polynomial of orders 2, 3, 4, and 5 are 1332.6, 920.3, 768.0, and 809.6, respectively. Clearly, is the best order. Therefore, all subsequent reports are based on the result of Legendre polynomial. Results of fitting other orders of the polynomial were not reported due to space limitations. However, the results of and 5 were very close to that of (data not shown).

Figure 4 gives the likelihood-ratio test statistic profile for chromosome 10. Two peaks are evident in the profile. We conducted a permutation test (with 500 randomly reshuffled samples) to determine the empirical critical values for significance declaration. Both peaks passed the critical values, and thus both QTL were significant. The positions of the two QTL were estimated at 32 and 54 cM, respectively. The two QTL are designated as QTL32 and QTL54, respectively. The estimated QTL effects werefor QTL32 andfor QTL54. The QTL effect functions are expressed asfor QTL32 andfor QTL54. Note that dominance effects cannot be separated from the additive effects in a backcross design. Therefore, the additive effects estimated are actually confounded with the dominance effects. The functions of the two QTL effects are depicted in Figure 5, which show similar dynamic patterns of change. Both QTL appeared to be inactive until year 7 and then the effects increased until year 10 when a plateau is reached.

For comparison, we analyzed the same data with the method of Ma *et al.* (2002) using FunMap (Ma *et al.* 2004), a public program for mapping longitudinal traits with the logistical regression method. The covariance structure specified in the program was AR(1), first-order autoregressive model. The likelihood-ratio profile is given in Figure 6 (the shaded curve). Only one peak passed the critical value and the location was at 14 cM. We calculated the BIC value for this analysis and found that the BIC was 915.8, >768.0, which is the BIC of the Legendre polynomial of order 4. Therefore, the two different methods seem to generate somewhat different results.

Because the difference between the two methods may be due to the different covariance structures specified, we then modified our method by using the AR(1) covariance structure but still fit the QTL effects to the Legendre polynomials. We found that the minimum BIC value still occurred when (the BIC score was 906.0). This value is >768.0, which is the BIC of the Legendre polynomial of order 4 with the general covariance structure (). The likelihood-ratio test statistic profile is depicted in Figure 6 (the solid curve). The profile has exactly the same shape as that of Ma *et al*. (2002). This showed that the difference in results from the two methods was partially due to the difference in the covariance structure. Judging by the BIC values, we may conclude that this hybrid method of the Legendre polynomial with AR(1) covariance structure is slightly better than the method of Ma *et al.* (2002).

## DISCUSSION

We presented a polynomial approach to QTL mapping for longitudinal traits as opposed to the logistic regression approach (Ma *et al.* 2002). Both methods may be considered as dimension reduction techniques for multivariate analysis. With the polynomial approach, the dimension of the model can vary according to the model goodness of fit to the actual data. The BIC value was used to determine the optimal dimension. As a result, the polynomial model is more general and flexible than the logistic regression approach, which has a fixed dimension of three. More importantly, the polynomial model is linear in parameters, which allows well-developed linear model methodology to be fully utilized in the longitudinal trait study. Both methods benefit more when the trait is measured more frequently. There is another method for functional data analysis that is quite similar to the polynomial approach, namely, the method of B-splines (de Boor 2001). The B-splines are more commonly used in nonparametric data analysis to infer the empirical distribution of a variable and are rarely applied to quantitative genetics. We think that the polynomial approach is more intuitive and easy to understand. Once the polynomial approach is accepted by geneticists, we may further explore the B-splines approach and compare the two competing but similar methods.

For simplicity, we fit all model effects to the same basis (order) of the polynomials. In fact, different types of model effects may have different functional relationships with time. For example, the population mean may be linear in time, whereas the additive effect may be quadratic or cubic in time. This can be taken into account by describing each type of model effect by a polynomial with its own basis. The model with this flexibility may look like(44)where is the basis with order , which may differ from and the orders of other effects. Allowing the random effects to have a different basis from other effects () may be very desirable because this will introduce a different structured residual covariance matrix. Further discussion on the structured covariance matrix is given later.

If a major gene model sufficiently describes the data, the residual covariance matrix should have a simple form; *e.g.*, or simply . Polygenic effects and other factors not included in the model may cause a more dense structure of the covariance matrix. Wu *et al*. (2002b, 2003, 2004) used AR(1), the first-order autoregressive covariance matrix, because AR(1) is one of the simplest structures that allows the authors to develop a nice EM algorithm. Extension to more complicated structures either is impossible or would take substantial effort. With the polynomial analysis, however, we are able to use , a general structured covariance matrix. We can choose the dimension of as low as zero so that and as high as so that (a fully unstructured covariance matrix). This flexibility cannot be enjoyed by the logistic growth curve fitting approach. The optimal dimension of may be found by evaluating the BIC estimates of different dimensions. Again, there is no theoretical difficulty with the polynomial analysis.

Many aspects of agricultural experiments are subject to high levels of uncertainty. Not all plants may have full measurements at all the calibrated time points. Occasionally, a plant may fail to be measured for the phenotype at a certain time point due to oversight of the researchers. This is a typical missing value problem. Some plants may die before they are fully measured for the entire term of the experiment. This is a typical problem of longitudinal study (Diggle *et al.* 1994). The EM algorithm was particularly designed to handle problems like this. So, there is no additional technique required to handle these missing value problems. One should still include the missing phenotype (as a symbolic quantity) in the model but simply replace all terms involving the missing phenotype by their conditional expectations and proceed with the usual EM iterations. The conditional expectation for the term involving a missing phenotype, say value , should be calculated on the basis of the parameters at the current iteration and a restriction , where are observed values before and after time point *j*. If the plant dies at time point *j* (longitudinal data analysis), the restriction should be for all . These restrictions apply only to growth traits. For other types of traits, say daily gain, such restrictions are not necessary.

An alternative approach for dealing with missing values may be more powerful than the aforementioned one. Instead of using the conditional expectations of the terms related to the missing phenotypes, we may model only the observed phenotypes and completely ignore the missing phenotypes. The complication encountered in this alternative approach is that the vector of phenotypic values for an individual with incomplete measurement has a dimension <. The general situation is that the dimension of vector varies across individuals. This will cause the dimension of matrix to vary, which will be reflected by using in place of . There is no theoretical difficulty in incorporating the property of variable dimension into the likelihood analysis. The model bearing this flexibility may be(45)where now represents the basis of polynomial with order *r* but constructed using only the time points with actual measurements for individual *i*. With the polynomial basis varying across individuals, we can fully take into account the missing value problem. This approach is sufficiently general to handle even more complicated experiments, such as different individuals are measured at different time points during the entire term of the experiment. One should be cautious about standardizing the time point from the original scale to the [−1, +1] scale (see appendix b for the definition of standardized scale). The initial and ending time points used should be constant across individuals. In other words, one should use as the time that the measurement starts and the time that the measurement ends during the entire experiment for all individual plants. Other technical problems may occur, which deserve further investigation. The primary objective of this study was to lay the foundation for the polynomial approach to longitudinal trait study. Technical improvements, which will digress from the main focus, are deferred to subsequent studies.

We have demonstrated that if the growth curve is indeed sigmoid, the proposed polynomial analysis can still be used due to the great generality of the method. The flexibility of the polynomial analysis in terms of incorporating structured residual covariance matrices makes the method more preferable to the logistic approach, which fits only an RA(1) covariance structure. One criticism of the polynomial approach when applied to the sigmoid curve is the interpretability of the polynomial regression coefficients. In logistic models, each parameter has a biological meaning, whereas that is not the case when using a polynomial. However, a logistic curve can be described by a polynomial with order 3 . If a growth curve is indeed fitted with such a polynomial, we can find the functional relationships of the parameters in the logistic curve with respect to the parameters in the polynomial. By doing so, we can interpret the biological meanings of the polynomial parameters. We take the following approaches to infer the functional relationships between the logistic parameters and the polynomial parameters. First, we understand that the actual shape of a logistic curve is also determined by the initial value of the curve at time zero and the value of the curve at the inflection point. This implies that we can express the logistic parameters (*a*, *b*, *r*) as functions of the two coordinates (the initial and the inflection points) because both determine the shape of the curve. We then find the functional relationships of the values of the two coordinates with respect to the polynomial parameters . Combining the two steps, we can eventually find the relationships between parameters of the two types of curves, *i.e*., , , and . Derivations of the three functions are provided in appendix c. We now have expressed the logistic parameters as functions of the polynomial parameters. Although the biological meaning of each polynomial regression coefficient alone is not obvious, different combinations (functions) of the polynomial parameters determine the parameters of the logistic curve, which are well interpretable biologically.

## APPENDIX A: DERIVATIONS OF QTL VARIANCES IN AN F_{2} POPULATION

The linear model for the phenotypic value at time τ is(A1)Definitions of the symbols are given in the main text, see Equation 1. When α(τ) and δ(τ) are treated as fixed effects, the genetic effect,(A2)has a variance of(A3)Define , , and as the theoretical variances of the effects in question across individuals within the F_{2} population. Under the assumption of no segregation distortion, the ratio of the three genotypes (*AA*, *Aa*, and *aa*) is expected to be . On the basis of the definitions of variables *x* and *z* presented in the text, we can show that(A4)(A5)and(A6)Therefore, the genetic variance contributed by the major gene at time τ is(A7)

The particular values of defined for an F_{2} population are due to the special way we defined the values of variables *x* and *z* for the three genotypes. One can define the values of *x* and *z* in any arbitrary ways without affecting the results of the analysis. The way we chose for the definitions of *x* and *z* is adopted from Xu (1998). In fact, a mathematically more attractive way to define *x* and *z* is and for genotypes . The *x* defined this way appears to be odd, which explains why this system has not been adopted in the literature. However, one can prove that and , leading to(A8)which is the mathematical attractiveness of the new system. One should be cautious in adopting a new system for the definitions of *x* and *z*. If *x* and *z* are not orthogonal, then must be incorporated into the formula of genetic variance. Furthermore, if segregation distortion has occurred, the theoretical variances and covariance of variables *x* and *z* are not known. One may estimate the variances and covariance from the observed data and use the estimated values instead.

## APPENDIX B: BASIS OF THE ORTHOGONAL POLYNOMIAL

Orthogonal polynomial analysis is a general way to describe the functional relationship of one variable relative to another variable. The dependent variable is usually stochastic and follows some kind of distribution. The independent variable is usually calibrated by some fixed time points, which are controlled by the investigator. One can choose the order of the polynomial as high as the total number of fixed time points. The model can fully describe the data with no error, but such a model has no general use. In contrast, one can choose the order of the polynomial as low as 0, leading to a constant, again a trivial result. If one chooses the order of 1, the functional relationship becomes linear in time. The order of 2 describes a quadratic function and so on. The rule of thumb is to choose the order as low as possible but the model with this order must be able to describe the data sufficiently. In biological science, most curves may be described by the order of <5. A curve with an order >5 is difficult to interpret.

The first step in the orthogonal polynomial analysis is to convert the original time point *t* into a standardized time point τ using the following expression,(B1)where, is the initial time point and is the ending time point. Therefore, time point τ used in the entire text is the standardized time point. Recall that if we choose a basis with order *r*, then(B2)The *k*th component of ψ(τ) is defined as(B3)where is an integer function. The first five terms of the orthogonal polynomial are , , , , and , respectively (Schaeffer 2004).

## APPENDIX C: RELATIONSHIP BETWEEN LOGISTIC PARAMETERS AND POLYNOMIAL PARAMETERS

The logistic curve is(C1)where is chosen for convenience. The initial coordinate is and the coordinate at the inflection point is , which is found by setting and solving for . Letting be the solution for equation , we obtain using Equation C1. Given , we now have three equations with three unknowns (logistic parameters),(C2)The solutions of Equations C2 are(C3)

We now evaluate the polynomial curve of order 3,(C4)The coordinate of the initial point is *i.e*.,(C5)The second derivative of with respect to τ is(C6)Setting and solving for τ, we get . Substituting into Equation C4, we have(C7)We now convert (time point in the standardized scale) back into (time point in the original scale) by(C8)The simplification is due to . Substituting (Equation C5), (Equation C7), and (Equation C8) into Equations C3, we obtain(C9)

## Acknowledgments

We are grateful to two anonymous reviewers for their comments and suggestions, which greatly assisted the revision of this manuscript. This research was supported by the National Institutes of Health grant R01-GM55321, the National Science Foundation grant DBI-0345205 to S.X., and the Chinese National Natural Science Foundation grant 30471236 to R.Y.

## Footnotes

Communicating editor: Y.-X. Fu

- Received December 15, 2005.
- Accepted May 30, 2006.

- Copyright © 2006 by the Genetics Society of America