## Abstract

There has been a great deal of interest in the development of methodologies to map quantitative trait loci (QTL) using experimental crosses in the last 2 decades. Experimental crosses in animal and plant sciences provide important data sources for mapping QTL through linkage analysis. The Collaborative Cross (CC) is a renewable mouse resource that is generated from eight genetically diverse founder strains to mimic the genetic diversity in humans. The recombinant inbred intercrosses (RIX) generated from CC recombinant inbred (RI) lines share similar genetic structures of *F*_{2} individuals but with up to eight alleles segregating at any one locus. In contrast to *F*_{2} mice, genotypes of RIX can be inferred from the genotypes of their RI parents and can be produced repeatedly. Also, RIX mice typically do not share the same degree of relatedness. This unbalanced genetic relatedness requires careful statistical modeling to avoid false-positive findings. Many quantitative traits are inherently complex with genetic effects varying with other covariates, such as age. For such complex traits, if phenotype data can be collected over a wide range of ages across study subjects, their dynamic genetic patterns can be investigated. Parametric functions, such as sigmoidal or logistic functions, have been used for such purpose. In this article, we propose a flexible nonparametric time-varying coefficient QTL mapping method for RIX data. Our method allows the QTL effects to evolve with time and naturally extends classical parametric QTL mapping methods. We model the varying genetic effects nonparametrically with the B-spline bases. Our model investigates gene-by-time interactions for RIX data in a very flexible nonparametric fashion. Simulation results indicate that the varying coefficient QTL mapping has higher power and mapping precision compared to parametric models when the assumption of constant genetic effects fails. We also apply a modified permutation procedure to control overall significance level.

DURING the past 2 decades, there has been considerable development in statistical methodologies for mapping quantitative trait loci (QTL), since Lander and Botstein (1989) implemented a maximum-likelihood approach to the interval-mapping technique (Goldgar 1990; Amos 1994; Jansen and Stam 1994; Zeng 1994; Almasy and Blangero 1998; Kao *et al.* 1999; Zou *et al.* 2001; Xu *et al.* 2005). In addition to the interval-mapping approach, many other statistical approaches have been used in QTL mapping, such as regression analyses (Haley and Knott 1992) and Bayesian approaches (Satagopan *et al.* 1996; Sillanpaa and Arjas 1998; Yi and Xu 2000; Yi 2004; Hoeschele 2007).

While these methods have been instrumental for QTL identification, they are not able to capture the temporal pattern of QTL effect. Many quantitative traits, such as body size, are inherently too complex to be described by a single value, because their phenotypes, for example, change with age. Instead of being measured at one fixed time point, each subject's phenotype may be measured at different time points across samples, which allows us to study genetic effects that vary with the change of time. For example, genetic correlations among age-specific weights in a laboratory population of rats were shown to involve variable gene action at different ages (Cheverud *et al.* 1983). Vaughn *et al.* (1999) located QTL responsible for age-specific weights in mice, and they found that some QTL affect the early growth patterns and some affect the late growth patterns. To study genetic determination of such functional traits, Wu and colleagues (Ma *et al.* 2002; Wu *et al.* 2002, 2004; Lin and Wu 2006) developed the functional mapping approach. They used growth curve data as an example of functional traits, and the genetic effect was modeled by a parametric function such as sigmoidal or logistic function (Ma *et al.* 2002). While the parametric nature of functional mapping offers tremendous biological and statistical advantages, a reliance on the availability of mathematical functions limits its applicability (Yang *et al.* 2009).

Varying coefficient models are very useful statistical tools for exploring dynamic effects. The varying coefficient models were introduced by Cleveland *et al.* (1991), and discussed by Hastie and Tibshirani (1993) in more detail, to extend the applications of local regression techniques from one-dimensional to multidimensional settings. In varying coefficient models, there are many ways to model the function of the varying effect, such as polynomials, Fourier series, piecewise polynomials, and more general nonparametric functions (Hastie and Tibshirani 1993). For nonparametric varying-coefficient models, various basis systems can be used, and the most common choice is the B-spline basis (He and Shi 1998; Pittman 2002; Huang *et al.* 2004; Wang *et al.* 2007, 2008). One advantage of B-splines over some other nonparametric approaches, like smoothing splines, is that the smoother matrix is independent of the responses. Yang *et al.* (2009) proposed a nonparametric functional mapping framework for genetic mapping of QTL controlling for a dynamic trait, implemented with B-splines.

Although important, QTL mapping in humans is difficult, time consuming, expensive, and hampered by ethical problems and uncontrollable environments. These obstacles are nearly all overcome in laboratory mice. Furthermore, most human genes have functional mouse counterparts and both genomes are organized similarly. Hence, the laboratory mouse has become an important model organism in mapping QTL related to human disease. Recombinant inbred (RI) lines have contributed greatly to genetic dissection of simple and complex traits. A major advantage of RI panels over other commonly used mapping approaches is their ability to support genetic mapping and correlations among many traits, even under different environmental conditions (Plomin *et al.* 1991). However, the traditional inbred mice have a limited amount of variation (Darvasi 1998). Typical mouse RI panels have only 15–35 strains from a single pair of parental inbred lines (Zou *et al.* 2005; Tsaih *et al.* 2005). This is a particularly acute problem when one wants to examine numerous gene–environment interactions or study disease progression at many stages and ages (Zou *et al.* 2005). Mouse RI panels generally have low power and precision compared to other resources because of their small size.

The Collaborative Cross (CC) project (Threadgill *et al.* 2002; Churchill *et al.* 2004; Collaborative Cross Consortium 2012) has been carried out to create a large panel of new RI mouse strains. The CC RI lines are generated from an eight-way cross using eight genetically diverse founder strains, which makes the CC RI lines closer to natural populations than regular RI lines with more genetic variation. A novel derivative of RI lines, called recombinant inbred intercrosses (RIX), has been designed that permits repeated interrogations of a fixed genotype to reduce nongenetic variance while increasing the power of the original RI panel (Threadgill *et al.* 2002; Collaborative Cross Consortium 2012). The RIX panel is created as F1 hybrids of RI lines. Linkage analyses can be performed, using these resources, to fine map genetic loci that are responsible for most inheritable complex traits. Since all RI mice are homozygous at each locus, the genotypes of the derivative RIX mice will be known in advance by imputing from the genotypes of the parental RI lines. RIX mice with identical genotypes can be regenerated whenever needed. Compared to RI, the RIX design has several advantages that include twice the number of recombination sites in a single individual since each is derived from two parental RI, dominance effects can be estimated, there is a large expansion of different RIX genomes over the parental RI, and, because of the buffering capacity of their heterogeneous genome structure, RIX genomes should provide more reliable trait means than the parental RIs. The RIX approach also has advantages over classical crosses like the *F*_{2} design since each RIX has a higher recombination density than a single *F*_{2} individual when performing interval mapping (Broman 2005; Broman 2012), RIX are especially useful for long-term collaborative research because their genotypes are renewable making the phenotypic data cumulative within the research community, and since RIX genomes are easily replicated, experiments with different environmental variables or temporal relationships can be performed on the same genotypes. At the individual level, although the genome of each RIX mouse has similar genetic structures of *F*_{2} individuals, statistical analyses for *F*_{2} data cannot be directly applied to RIX data. This is because some RIX individuals share a common parental RI line, making them genetically more related to each other than those that do not share any parental lines. Several QTL mapping methods have been introduced (Zou *et al.* 2005; Tsaih *et al.* 2005; Yuan *et al.* 2011) for dealing with the special genetic structure of RIX data. However, none has considered the situation in which the QTL effect varies with other covariates. In this study, we propose a new method to properly model both the (time) varying genetic effects and the genetic complexity of RIX data. The proposed model investigates gene-by-time interactions in a flexible nonparametric fashion for RIX data.

## Methods

For an RI panel with *L* lines, there are at most *L*(*L* − 1)/2 nonreciprocal RIXs that can be generated, which is a huge number when *L* is large. A useful sampling and mating scheme is the loop design as described by Zou *et al.* (2005) and Yuan *et al.* (2011). With the loop design, *L* RI lines were randomly ordered to form a circle. Then each RI line is mated with the next *J* RI lines after it, resulting in total of *LJ* samples. That is, we mate RI_{1} with RI_{2}, RI_{3}, …, and RI_{J+1}; …; RI_{i} with RI_{m(i+1,L)}, RI_{m(i+2,L)}, …, and RI_{m(i+j,L)}; …; and RI_{L} with RI_{1}, RI_{2}, …, and RI_{J,} whereNot only in the loop design, but in many RIX populations, pairs of RIX sharing one parent are more closely related than those RIX that do not share a parent. For example, RIX produced by crossing RI_{1} and RI_{2} (RIX_{12}) is expected to be more similar to RIX produced by crossing RI_{1} and RI_{3} (RIX_{13}) than to RIX from crosses between RI_{3} by RI_{4} (RIX_{34}) since (RIX_{12}) and (RIX_{13}) share a parental RI (RI_{1}) while (RIX_{12}) and (RIX_{34}) do not share any parental RI lines.

To study the RIX data, we fit a mixed-effect model by applying a random effect to model the polygenic effect. For simplicity, a model with only additive effect is considered. Also, we assume that all putative QTL are located on markers. For individual *i*, define the observed data as {*y _{i}*,

*t*,

_{i}*x*

_{i}_{1}, …

*x*}, where

_{iM}*y*is the phenotype,

_{i}*t*is the measure of the covariate and is nonconstant across subjects,

_{i}*M*is the total number of markers, and

*x*is the genotype at the

_{im}*m*th marker, coded as −1, 0, or 1 for genotypes aa, Aa, and AA, respectively. We consider one putative QTL at a time and therefore suppress the subindex

*m*in the sequel. The model can be expressed aswhere μ(

*t*) and β(

*t*) are the overall population mean and QTL effect that vary with time

*t*, respectively. The random polygenic effect α

_{l}follows for

*l*= 1, 2, …,

*L*; the random error ε

*follows ; andThis model can be applied to any RIX population in addition to the loop design as described above. The hypotheses for whether there exists any major QTL at a given locus are*

_{i}*H*

_{0}: β(

*t*) = 0

*vs. H*: β(

_{a}*t*) ≠ 0.

We incorporate B-spline bases to model the varying coefficient functions β(*t*) and μ(*t*). The smoothness of the function modeled by B-splines is controlled by the parameter *K* = *n _{j}* +

*d*+ 1, where

*n*is the number of interior knots and

_{j}*d*is the degree of spline. The interior knots of the splines can be either equally spaced or placed on the sample quantiles of the data, so that there are about the same number of observations between any two adjacent knots. We use equally spaced knots for all numerical examples for this study, and hence

*B*(

_{k}*t*) is determined for any given

*t*.

The mixed-effects model becomeswhere *B _{k}*(

*t*)'s are basis functions of B-splines of order

_{i}*K*, and γ

_{0k}'s and γ

*'s are coefficients for B-spline basis. Here μ(*

_{k}*t*) is approximated by and β(

*t*) is approximated by . To test for genetic effect of QTL, the hypotheses

*H*

_{0}: β(

*t*) = 0

*vs. H*: β(

_{a}*t*) ≠ 0 are equivalent to

*H*

_{0}: γ

_{1}= … =

*γ*= 0

_{K}*vs. H*: not all the γ

_{a}*are 0.*

_{k}sWe can rewrite the model above in the matrix form aswhere ** y** = (

*y*

_{1}, …

*y*)

_{n}^{T};

**γ**= (γ

_{01}…γ

_{0K}, γ

_{1}…

*γ*)

_{K}^{T};

**X**is the corresponding

*n*× 2

*K*design matrix for the time-varying fixed effect;

**α**= (α

_{1},… α

*)*

_{L}^{T};

**ε**= (ε

_{1}, … ε

*)*

_{n}^{T}; and

**A**is an

*n*×

*L*design matrix for the random polygenic effect. The design matrix

**X**can be expressed asWe therefore observe

**∼**

*y**N*(

**Xγ**,

**Σ**) with , which can be reparameterized as , with

**D**=

**AA**

^{T}, and

**V**= θ

**D**+

**I**. Regardless of the form of the covariance matrix

**Σ**, the generalized least squares (GLS) is an appropriate estimate for parameter

**γ**as

The profile log-likelihood functions with only unknown parameters in **Σ**, based on the maximum likelihood (ML) and restricted/residual maximum likelihood (REML), can be written asfor ML andfor REML, where *p* is the rank of **X** and **r** = **y** − **X**(**X**^{T}**V**^{−1}**X**)^{−1}**X**^{T}**V**^{−1}**y** is a function of θ.

To simplify the computation, we further solve for the ML or REML estimate of as a function of θ,for ML andfor REML. Substitute the expressions above, we obtain the final profile log-likelihoods for θ asand

Note that the profile log-likelihood above involves only the nuisance parameter θ. Hence its MLE can be easily computed by the Newton–Raphson algorithm. Once θ is estimated, **γ** and can be subsequently estimated byandfor ML andfor REML. We use REML in the following simulation studies, since it has some advantages over ML, such as taking into account the degrees of freedom for fixed effects (Mcculloch and Searle 2001).

Once the parameters are estimated, likelihood-ratio (LR) tests can be performed to evaluate the evidence of QTL effect, and LOD scores can be calculated at the locations of all genetic markerswhere is the MLE under *H*_{0}: γ_{1} = … = γ* _{K}* = 0.

Since the hypothesis testing is performed on a number of markers, it is necessary to adjust the significance level for multiple testing. The threshold, in practice, is usually obtained by permutation procedures (Churchill and Doerge 1994). However, it is quite complicated to obtain appropriate threshold values for RIX data, because direct permutation will not only destroy the relationship between QTL and the trait, but also destroy the relationship between polygenes and the trait, which will result in incorrect thresholds (Anderson and Ter Braak 2003; Zou *et al.* 2005; Churchill and Doerge 2008). To overcome this difficulty, Zou *et al.* (2005) extended the permutation method of Churchill and Doerge (1994) to a novel permutation method for the RIX data. The modified permutation method starts with permuting parental RI strain numbers 1, 2, …, *L* into φ(1), φ(2), …, φ(*L*). Then the permuted marker genotypes of RIX_{ij} will be the corresponding marker genotypes of RIX_{min(φ(i),}_{ φ(j))max(φ(i),}_{ φ(j))} in the original data. The permuted samples are analyzed with the same model as the original data to generate an empirical distribution of maximum LOD scores, where the threshold value can be obtained.

## Results

All analysis and simulation code used below are included in supporting information, File S1. In simulation studies, we set the number of parental RI lines *L* = 100, and applied the loop design with *J* = 3 to generate a total of 300 RIX samples (Zou *et al.* 2005). A single chromosome with 101 evenly spaced markers was simulated with either a 2-cM interval or 5-cM interval between nearby markers (resulting in a total length of 2 M or 5 M, respectively). The QTL is located at the 41th marker, which is at either 80 or 200 cM, for the two marker densities, respectively. The marker genotypes were simulated using R/qtl (Broman *et al.* 2003). We set μ(*t*), the mean temporal growth function for QTL genotype Aa to , which is a logistic growth curve (Ma *et al.* 2002; Yang *et al.* 2009). We randomly generated *t _{i}* from (0, 60) for each subject.

We considered the three different functions for β(*t*):

Case 1:

Case 2:;

Case 3:

Cases 1 and 2 are nonlinearly increasing functional effects, used in simulation studies by Wang *et al.* (2008). Case 3 mimics the situation in which the genetic effect is hardly perceptible until after certain age, such as some breast cancer-susceptibility genes (Foulkes *et al.* 2004). To test the performance of the model under various signal/noise ratios, two different sets of variances for random effect and random error were considered for each case: , and , . In all cases, the average heritability is between 0.02 and 0.18.

To choose a good combination of the interior knot number *n _{j}* and the degree of spline

*d*for the genetic effect, 500 runs of simulation were performed. In those simulations, we set , , and the interval length to 5 cM. Figure 1 and Figure 2 plotted the mean and mean for case 2 with different combinations of

*n*and

_{j}*d*. The figures showed that relatively small

*n*and

_{j}*d*in general fit the curves well, and the same is true for the other two cases. We calculated the squared differences (SQD) between and μ(

*t*), and between and β(

*t*) as for each of the

*n*and

_{j}*d*combination. We recorded the number of combinations of

*n*and

_{j}*d*with the smallest SQD in Table 1, left. The results suggest that the combination of

*n*= 1 and

_{j}*d*= 2 is the best for cases 1 and 2, while for case 3, it is

*n*= 2 and

_{j}*d*= 1.

In practice, the true β(*t*) is unknown, so the choice of *n _{j}* and

*d*needs to be estimated. We propose the following approach to choose

*n*and

_{j}*d*using the AIC (Akaike 1970, 1974) as the selection criterion. First, we set

*n*= 1 and

_{j}*d*= 1 and identify the marker with the highest LOD score. Then at the selected marker, we calculate the AIC values for a set of

*n*and

_{j}*d*, and choose the one with the smallest AIC. In the simulation study, we computed the AIC values for the 500 simulations. Table 1, right, shows the number of combinations of

*n*and

_{j}*d*with the smallest AIC. The results are consistent with the SQD results presented in Table 1, left.

For model comparison, we also fitted β(*t*) parametrically. Specifically, we used polynomial functionsWe set *s* = 1 and *s* = 2, for linear and quadratic polynomial functions, in the simulation studies.

Under each case, 200 runs of simulation were conducted for all models mentioned above. For each case, we estimated β(*t*) using both B-splines and the polynomial functions. Hypothesis testings were performed on *H*_{0}: β(*t*) = 0 *vs. H _{a}*: β(

*t*) ≠ 0, and LOD scores were calculated. For accessing the significance of the hypothesis testing, simulations were carried out from the following null model:Total 1000 runs of simulations were performed and the 95% percentile of the maximum LOD score was calculated.

The QTL position was estimated as the location where the maximum LOD is reached. The mean and standard error of the estimated QTL position, by the three approaches, are listed in Table 2. Power is listed in Table 3. All the three methods have similar performance on estimating the QTL position and power for mapping QTL under cases 2 and 3. However, the B-spline approach has substantially higher power than the other two approaches under case 1, as well as higher precision in estimating the QTL location. The mean of the estimated phenotypic curves are plotted along time in Figure 3, Figure 4, and Figure 5, for all cases with 5-cM intervals, and . The nonparametric approach provides better fit to the true underlying phenotypic curves than the parametric approach in all three cases. Overall speaking, the B-spline method outperforms the parametric method.

To evaluate the performance of the modified permutation, we further carried out the following simulation studies. From 100 RI lines, 300 RIX subjects were simulated. A single 100-cM chromosome with evenly spaced markers was simulated with the QTL located at 40 cM. There were either 51 markers separated by 2-cM intervals or 21 markers separated by 5-cM intervals, on the chromosome. Two different β(*t*) functions, as described in cases 1 and 2 above, were simulated. We set , , and . A total of 100 simulations were conducted. Within each simulation, 1000 permutations were performed and the permutation threshold was calculated. To obtain the empirical thresholds, we ran additional 5000 simulations under *H*_{0}: β(*t*) = 0. We compared the permutation thresholds with the empirical ones. The results listed in Table 4 indicate that the modified permutation performs reasonably well. The permutation thresholds were close to the empirical ones. Type I errors were slightly inflated. This is probably due to the small number of RI lines used in the simulations, as well as the sampling variation due to the small number (100) of simulations conducted.

Besides the biallelic marker data, our method can be extended to model founder effects by fitting individual functional curve for each of the eight CC founder alleles. Assuming additive model, we fitwhere *x _{ij}* is the number of the

*j*th founder alleles in the

*i*th RIX sample and β

*(*

_{j}*t*) is the functional effect of the

*j*th founder allele (

*j*= 1, …, 8). To model β

*(*

_{j}*t*) nonparametrically, we use to approximate it.

We carried out a simple simulation study to demonstrate the performance of this model. We again simulated 300 RIX subjects from 100 RI lines by the loop design. Every parental RI line has the same probability, 1/8, to carry one of the eight founder alleles. We set for *j* = 1, 2, 3, 4 and for *j* = 5, 6, 7, 8, respectively. We further set and . The simulations were conducted for 100 runs. In the simulations, we assumed that the founder alleles were known for all RI lines. The means of the estimated functions of the eight founder alleles were plotted in Figure 6. All estimate the true functions well. With no prior knowledge that the genetic effects of the first four founder alleles were the same, our model obtained four very similar estimated curves, which allows us to group founder alleles with similar genetic effects.

## Discussion

This study is largely motivated by the availability of the CC lines (Collaborative Cross Consortium 2012; Kelada *et al.* 2012). The CC project aims to generate and maintain >300 multiparental CC RI lines, and our ability to map complex traits will be greatly increased by making use of these resources. RIX samples derived from CC RI lines process some good properties from both RI lines and *F*_{2} populations. Genotypes of RIX can be directly inferred from those of their parental RI lines. Unlike the parental RIs whose genotypes are homozygous, the genetic structure of an RIX resembles *F*_{2} animals, reducing the phenotypic anomalies associated with inbred genomes. However, RIX animals typically do not share the same degree of relatedness. This unbalanced genetic relatedness requires careful statistical modeling to avoid a large number of false-positive findings. The functional mapping idea is not new in statistical genetics community (Ma *et al.* 2002; Wu *et al.* 2002, 2004; Lin and Wu 2006; Yang *et al.* 2009). However, this article is the first one that develops the functional mapping method for the RIX data and specifically models the unique genetic structure of RIX samples. In addition to B-spline approximation, other nonparametric approaches can be used to model the varying coefficients, such as the local polynomial regression (Fan and Gijbels 1996), the smoothing splines (Hastie and Tibshirani 1993; Hoover *et al.* 1998), and wavelet-based approaches (Donoho and Johnstone 1994). One advantage of using B-splines is that the smoother matrix {*B _{k}*(

*t*)} is independent of the responses. Unlike other nonparametric approaches, how to determine the smoothness is still an open question, although the choice of the number of knots is generally not critical (Yang

_{i}*et al.*2009). Our simulation results (for example, Figures 1 and 2) show that the estimated functional effects are not very sensitive to the choices of

*d*and

*n*

_{j}_{.}

In our simulations, we applied single marker analysis because the high marker density of the parental RI (Aylor *et al.* 2011; Durrant *et al.* 2011; Collaborative Cross Consortium 2012; Kelada *et al.* 2012), and thus RIX, makes results similar to those that would be obtained using more complicated mapping methods, such as traditional interval mapping (Lander and Botstein 1989) or regression interval mapping (Haley and Knott 1992). We also assume no parent-of-origin QTL and polygenic effects. The model can be extended to include additional effects. For example, with two random effects—one for the maternal effect and another for the paternal effect—we can model the parent-of-origin polygenic effects. Our method mainly considers quantitative trait nucleotide (QTN) mapping, we have shown by simulation studies that it can be extended to model founder allelic effects by fitting one functional curve for each of the eight CC founder alleles. Although our model considers only the additive genetic effects, the dominant effects can be easily included in the model by adding additional functional effects. This is not a concern for QTN models but for models with founder allelic effects, the subsequent increase in the number of parameters can be very large. We may need to consider grouping certain founder alleles with some prior knowledge or genetic similarity as was done in haplotype analysis (Schaid *et al.* 2002; Park *et al.* 2003; Wang *et al.* 2004; Lin *et al.* 2005) to maximize mapping power.

When more than one QTL is on a chromosome, the test statistic at one position will be affected by all the other QTL, the genetic estimates are likely to be biased, and QTL can be mapped to wrong positions (Knott and Haley 1992; Martinez and Curnow 1992). Our model can be extended to multiple regression for multiple QTL mapping, and some model selection approaches can be modified for QTL selection.

Our model investigates gene-by-time interactions for RIX data in a flexible nonparametric fashion. In this model, correlations among subjects are modeled as a function of their relatedness, which dramatically simplifies the covariance matrix of the data. The final result is a framework for mapping in complex genetic designs, which is computationally tractable.

## Acknowledgments

The authors are grateful for constructive comments and suggestions from the reviewers and the associate editor. Support was provided in part by National Institute of General Medical Sciences R01GM074175 and NIMH/NHGRI Center of Excellence in Genomic Sciences (P50MH090338).

## Footnotes

*Edited by Lauren M. McIntyre, Dirk-Jan de Koning, and 4 dedicated Associate Editors*

- Received July 10, 2011.
- Accepted November 17, 2011.

- Copyright © 2012 by the Genetics Society of America