Originally published as Genetics Published Articles Ahead of Print on September 14, 2008.

Genetics, Vol. 180, 1743-1761, November 2008, Copyright © 2008
doi:10.1534/genetics.108.091058

Mixed Effects Models for Quantitative Trait Loci Mapping With Inbred Strains

* Department of Genetics, Southwest Foundation for Biomedical Research, San Antonio, Texas 78245-0549, {dagger} Department of Biomathematics, University of California, Los Angeles, California 90095-1766, {ddagger} Department of Biostatistics, University of California, Los Angeles, California 90095-1772, § Department of Human Genetics, University of California, Los Angeles, California 90095-7088 and ** Department of Statistics, University of California, Los Angeles, California 90095-1554

1 Corresponding author: Department of Genetics, Southwest Foundation for Biomedical Research, P.O. Box 760549, San Antonio, TX 78245-0549.
E-mail: lbauman{at}sfbrgenetics.org

Manuscript received May 14, 2008. Accepted for publication September 5, 2008.

ABSTRACT

Fixed effects models have dominated the statistical analysis of genetic crosses between inbred strains. In spite of their popularity, the traditional models ignore polygenic background and must be tailored to each specific cross. We reexamine the role of random effect models in gene mapping with inbred strains. The biggest difficulty in implementing random effect models is the lack of a coherent way of calculating trait covariances between relatives. The standard model for outbred populations is based on premises of genetic equilibrium that simply do not apply to crosses between inbred strains since every animal in a strain is genetically identical and completely homozygous. We fill this theoretical gap by introducing novel combinatorial entities called strain coefficients. With an appropriate theory, it is possible to reformulate QTL mapping and QTL association analysis as an application of mixed models involving both fixed and random effects. After developing this theory, our first example compares the mixed effects model to a standard fixed effects model using simulated advanced intercross line (AIL) data. Our second example deals with hormone data. Here multivariate traits and parameter identifiability questions arise. Our final example involves random mating among eight strains and vividly demonstrates the versatility of our models.


IN analyzing gene mapping data from inbred strains, there is always the temptation to borrow models more pertinent to outbred populations. The vast majority of statisticians are wise enough to resist this temptation and turn to analysis methods tailored to specific breeding designs. Fortunately, the typical backcross or F2 design has sufficient symmetry to permit analysis of variance by standard statistical packages. As mammalian geneticists explore more complicated designs involving multiple strains and multiple generations, this analysis paradigm has begun to fracture. It is therefore hardly surprising that the last decade and a half have seen a revival of interest in statistical models for gene mapping with inbred strains. Although we briefly review some of the important contributions to this literature in the next section, it is fair to say that most modern models rely heavily on fixed effects. In contrast, the most successful models for mapping quantitative trait loci (QTL) in outbred populations invoke random effects (HOPPER and MATHEWS 1982; GOLDGAR 1990; SCHORK 1993; AMOS 1994; BLANGERO and ALMASY 1997).

The premise of this article is that, properly formulated, random effects models hold equal promise for more complicated inbred strain data. If a QTL is segregating between two strains, backcross and F2 designs reliably detect it (VALDAR et al. 2006). Models based on fixed allelic effects play a critical role in this process. Traditional designs have two drawbacks. First, the scarcity of recombination events often gives long mapped intervals. Second, when two founder strains of related ancestry are chosen, there may be no segregating QTL. To increase the number of recombination events and the number of segregating QTL, geneticists are turning to more complex designs involving multiple strains. Although the rationale for more complex designs is compelling, they bring in their wake problems of overparameterization. Random effects models neatly circumvent some of the parameterization issues encountered with fixed effects models. Unfortunately, the standard outbred QTL model does not make sense for inbred strains. All individuals of a particular strain are genetically identical and completely homozygous. These cardinal characteristics have subtle consequences when we calculate trait covariances for the descendants of matings between different strains. A logically correct theory for specifying covariances between pairs of individuals is the key to making random effects models respectable for inbred strains.

In this article, we take two approaches to QTL mapping; both capture polygenic background as a source of random variation. The two approaches differ in how they handle variation caused by the QTL. In association mapping, markers are treated one by one as candidate genes, and observed genotypes or allele counts at a marker serve as fixed predictors of trait means. In linkage mapping, markers in the vicinity of the QTL provide prior information on gene sharing, and the QTL contribution is modeled as a random effect. The greatest defect of our models is the blanket assumption of additivity. The greatest strength of our models is their generality in other regards. Thus, there is no limit to the number of founding strains, the depth and complexity of pedigrees, or the number of traits in a multivariate analysis.

To avoid breaking the flow of our discussion, much of the mathematical detail is relegated to the APPENDIXES. The following sections summarize previous contributions, lay out the model with full attention to computation of strain coefficients and relative covariances, resolve the thorny issue of identifiability, apply the models to real and simulated data, and discuss the broader implications and limitations of the models.


METHODS

A brief survey of previous methods:

Inbred mammalian strains have unique advantages in genetics. All members of a strain are genetically identical and completely homozygous. Simple crosses between strains involve no phase ambiguities, and any genes mapped can be quickly located in humans and other species by synteny. With mice and other small mammals, breeding is reasonably straightforward, generation times are fairly short, and the environment can be exquisitely controlled.

For decades, QTL mapping in inbred strains was considered an exercise in fixed effects modeling. Testing for association between marker genotypes and trait values is readily carried out using several available statistical packages. In the interval method introduced by LANDER and BOTSTEIN (1989), the QTL is allowed to take any position along a chromosome. This makes QTL genotypes unobservable and requires computation of posterior distributions given observed genotypes at the flanking markers. Although the EM algorithm is applicable in this context, it is often slow to converge, and the regression method of HALEY and KNOTT (1992) provides a quick approximation. The permutation test of CHURCHILL and DOERGE (1994) handles multiple testing problems gracefully. The recent program R/qtl (BROMAN et al. 2003), which capitalizes on the R software environment, combines several of these methods with hidden Markov modeling of missing genotypes. Despite these admirable advances, interval mapping is still limited to simple crosses where polygenic background is confounded with random environment. As the field embraces more complex crosses, geneticists no longer have the luxury of ignoring polygenic background, and it seems self-evident that explicitly modeling it will improve statistical inference.

The composite interval mapping method of ZENG (1993, 1994) implemented in QTL Cartographer generalizes interval mapping by including the direct effects of one or more markers unlinked to the QTL. Hence, composite interval mapping can be viewed as an attempt to incorporate polygenic background through fixed effects. If the number of typed markers is large, then it becomes hopeless to include all of them, and some automatic selection of background markers is desirable (MANLY and OLSON 1999).

Although XIE et al. (1989) take important first steps toward including polygenic background as a random effect, they do not derive general covariance expressions. This failure makes it difficult to deal with nonstandard crosses and awkward to combine data from different crosses. In the meantime, the pressure to increase the number of strains per cross has been growing (REBAI and GOFFINET 1993). Of 21 cloned mouse genes listed in Tables 1 and 2 of the review by FLINT et al. (2005), 7 rely on cloning strategies involving multiple strains or outbred mice. These practical concerns are stimulating intense efforts to revamp experimental design and statistical analysis of inbred cross data (LIU and ZENG 2000; HITZEMANN et al. 2002; PLETCHER et al. 2004; LI et al. 2005; CERVINO et al. 2007). Other recent models that delve into multiple QTL models and epistasis are both frequentist (KAO et al. 1999; JANNINKA and JANSENA 2001; SEATON et al. 2002; BROMAN et al. 2003) and Bayesian oriented (SILLANPÄÄ and ARJAS 1998; SEN and CHURCHILL 2001; BROMAN et al. 2003).


View this table:
In this window
In a new window

 
TABLE 1

AIL: type I error, power, coverage, and average estimates

 

View this table:
In this window
In a new window

 
TABLE 2

Four-way cross: mean estimates for bivariate leptin analysis at D3Mit127

 

Trait means, variances, and covariances:

We begin our theory development with a basic model applicable to any inbred strain design, including F2, advanced intercross lines, and random mating. Suppose that i and j are two animals generated by a complex cross involving s inbred strains. At t traits of interest, i and j exhibit random vectors Xi and Xj of trait values. For the sake of simplicity, assume further that Xi and Xj reflect the contributions of a single gene whose alleles have additive effects. Our immediate goal is to calculate the expected vectors E(Xi) and E(Xj) and the covariance matrix Cov(Xi, Xj). When i = j, we recover variances as well as covariances. Because of our assumption of additivity, Xi decomposes as the sum Yi + Zi of a maternal contribution Yi plus a paternal contribution Zi. To calculate E(Yi), let Mi denote the originating strain of the maternal gene of i. Although Mi is unobserved, we can calculate the probability Pr(Mi = a) for any given strain a. In terms of these probabilities and the t x 1 mean vector µ(a) of allelic effects on each trait for strain a, we have

Formula
Invoking a similar expression for E(Zi), it follows that

Formula 1(1)
where {gamma}i(a) is the probability that a randomly sampled gene from i originates from strain a. We refer to {gamma}i as the strain fraction vector for animal i; {gamma}i has dimension s x 1.

Covariances are derived by the same kind of reasoning. Decompose Xj into the sum Vj + Wj of a maternal contribution Vj plus a paternal contribution Wj. In view of the bilinearity of the covariance operator and the symmetry of maternal and paternal alleles, it suffices to find the covariance Cov(Yi, Vj). Let Nj denote the originating strain of the maternal gene of j. Conditioning on the joint value of Mi and Nj then yields

Formula 1
where the superscript * indicates a vector or matrix transpose. By analogy with kinship coefficients, we define the strain coefficient {psi}ij(a, b) to be the joint probability that a randomly drawn gene from animal i originates from strain a and a randomly drawn gene from the same locus of animal j originates from strain b. If i and j coincide, then sampling is done with replacement. The t x t covariance matrix between the trait values of i and j becomes

Formula 2(2)
where Cij(a, b) = {psi}ij(a, b) – {gamma}i(a){gamma}j(b), which we collect into an s x s matrix, denoted Cij.

For s strains and t traits, it is convenient to stack the allelic effects into a column vector µ of length st with transpose

Formula 2
The positive semidefinite matrix {Omega} = µµ* can then be split into t2 blocks {Omega}kl each of size s x s. Restricting our attention to the block corresponding to traits k and l, the covariance matrix (2) has entries given by the trace formula

Formula 3(3)

In polygenic inheritance, many independent loci contribute in an additive manner to the traits under consideration. Since trait means and covariances add in this setting, the mean expression (1) and the covariance expressions (2) and (3) remain valid provided we replace µ by Formula 3 and {Omega} by Formula 3. Here µl denotes the vector contribution corresponding to locus l rather than the lth component of µ. APPENDIX A shows that every pair (µ, {Omega}) consisting of a vector µ and a positive semidefinite matrix {Omega} can be represented as two such coordinated linear combinations. Hence, to capture polygenic background, it suffices to estimate arbitrary µ and {Omega}. We see later that there is an identifiability issue that must be surmounted in estimating {Omega}.

Computation of strain coefficients:

Because the combinatorial coefficients {gamma}i(a) and {psi}ij(a, b) are essential in calculating trait means and variances, we need good algorithms to compute these coefficients. Fortunately, we can mimic the logic used in calculating kinship coefficients for outbred populations. Since a pedigree founder i is assumed to be strain pure, one entry of the vector {gamma}i = 1, and the remaining entries = 0. Likewise for two founders i and j, one entry of the matrix {psi}ij = 1, and the remaining entries = 0. All other strain fraction vectors {gamma}i and strain coefficient matrices {psi}ij are defined recursively starting with the founders.

To avoid circular reasoning, pedigree members are numbered so that parents always precede their children. If animal i is not a founder, then it has parents k and l. Assuming that k and l have already been visited in filling in the strain fractions, we set

Formula 4(4)
If j != i, then without loss of generality we can assume j has been visited already, and we can set

Formula 5(5)

Formula 6(6)
This leaves only the case j = i. There are four equally likely possibilities when we sample two genes of i: (a) both genes coincide with the gene passed by k, (b) both genes coincide with the gene passed by l, (c) the first gene comes from k and the second from l, and (d) the first gene comes from l and the second from k. These considerations produce the matrix recurrence

Formula 7(7)
where diag({gamma}) denotes a diagonal matrix whose diagonal entries coincide with the entries of the vector {gamma}.

The initial conditions on founders and the recurrences (4)–(7) completely determine {gamma}i and {psi}ij. These in turn determine the Cij matrices, which have a richer mathematical structure than the strain coefficient matrices {psi}ij. APPENDIX B describes several fascinating properties of the Cij matrices. One such property is Cij = 0 between most members of simple crosses, for example, for all F2 animals when i != j or whenever i is a founder or F1.

Variance component models for QTL mapping with outbred populations require conditional kinship coefficients in addition to theoretical kinship coefficients. For exactly the same reasons, we also need conditional strain fractions and coefficient matrices. These depend on observed marker genotypes in the vicinity of a putative QTL. On small pedigrees, it is possible to compute conditional strain coefficient matrices exactly by considering all descent graphs (gene flow patterns) at the QTL and neighboring markers (KRUGLYAK et al. 1996). In practice, inbred strain pedigrees are so large that the number of possible descent graphs is astronomical. Stochastic sampling provides a workable substitute for exhaustive enumeration of descent graphs (SOBEL and LANGE 1996). The Markov chain Monte Carlo (MCMC) method incorporated in the computer program SimWalk samples relevant descent graphs with the appropriate conditional probabilities. Given a descent graph at the QTL, it is trivial to compute strain fractions for all animals and strain coefficient matrices for all pairs of animals in a pedigree. The averages of these quantities over all sampled descent graphs serve as approximations to the conditional strain fractions and strain coefficient matrices.

Strain coefficients convey more information than strain fractions. For instance, it is obvious that

Formula 7
We can put this extra information to good use in predicting QTL genotypes. At a given genomic location, imagine a marker with a different allele for each strain. Let Formula 7 be the conditional probability that animal i has unordered genotype a/b at the hypothetical marker given the observed data at the ordinary markers. The relations

Formula 7
connect the conditional genotype probabilities to the conditional strain fractions and coefficients. These relations in turn imply that

Formula 8(8)
Thus, we can impute strain genotypes as well as strain fractions.

Variance component models:

Variance component models revolve around the multivariate normal distribution or related distributions such as the multivariate t. Every multivariate normal distribution is uniquely determined by its mean vector {nu} and variance matrix {Sigma}. If we decompose trait values into independent, additive contributions, then {nu} and {Sigma} can be expressed as sums over the various contributions. As long as we are willing to take the leap of faith that all random contributions are Gaussian, then trait vectors will be Gaussian as well. For each random contribution, variance matrices are constructed from a constant part and a parametric part. The genetic covariance formula (3) is typical in this regard. The constant parts Cij are forced on us by the nature of the pedigree. The parametric part {Omega} with blocks {Omega}kl requires estimation.

The environmental contribution to the mean is usually modeled as the sum of a grand mean {eta} plus covariate effects such as age or sex. Random environment and cage effects can be modeled by Kronecker products of variance matrices, provided we order trait values so that all values corresponding to a given trait are contained in a single block, and animals are consistently enumerated across blocks. Given these conventions, the variance matrix under random environment reduces to the Kronecker product {Upsilon} {otimes} I of the trait variance matrix {Upsilon} and the identity matrix I. Obviously, {Upsilon} is the parametric part; it describes the environmental covariation of the traits in a single animal. The matrix I reflects the independence of the random environments for the various animals. For a random cage effect, we replace the identity matrix by a cage matrix H = (hij), where hij = 1 if animals i and j belong to the same cage and 0 otherwise. The matrix replacing {Upsilon} describes the environmental covariation of the traits for animals in a single cage (LANGE 2002). As an example, heritability analyses generally specify two random effects, additive polygenes and random error/environment,

Formula 9(9)

Formula 10(10)
where {alpha}ic is the cth of C covariates measured on animal i and βkc is the corresponding regression coefficient for trait k.

Once we specify the mean and variance components, the loglikelihood of a pedigree can be written as

Formula 10
using the observed trait values x, the mean vector {nu} such as that of Equation 9, and the variance matrix {Sigma} such as that of Equation 10. Assuming pedigrees behave independently, their loglikelihoods add. Given the overall loglikelihood, parameters can be estimated by maximum likelihood, and statistical inference conducted by standard likelihood ratio tests comparing alternative hypotheses to null hypotheses. LANGE (2002) develops this frequentist approach to estimation and inference in detail. Our computer program Mendel relies on a quasi-Newton algorithm for maximum likelihood estimation. BAUMAN et al. (2005) discusses an alternative EM algorithm as well as factor-analytic parameterizations of variance matrices. Given the presence of covariates and heterogenous pedigree structures, permutation testing is rarely possible. To aid the user in judging significance and model fitting, Mendel reports standard errors of parameters, pedigree deviances, outlier individuals, and various goodness-of-fit statistics.

Two QTL mapping strategies:

There are two specific strategies, association and linkage, for QTL mapping. Variance component models are pertinent to both. Although the two strategies differ in how they portray QTL effects, each captures polygenic background as a random effect. In addition to the strain effects appearing in Equation 1, most models include a grand mean {eta} and fixed effects tied to plausible predictors. If we specify {eta}, then we must impose the vector constraint Formula 10 on the polygenic mean vector µ. Here the index a ranges over all strains. Random effects include the polygenic effect summarized by Equation 3, random environment plus measurement error, and possibly correlated environment such as cage effects. As described in the next section, the polygenic variance matrix {Omega} is not identifiable, and complicated constraints must be imposed on it to compensate for this fact. Regardless of the nature of these constraints, we must compute theoretical strain fractions and strain coefficients to estimate µ and {Omega} under the null hypothesis of no QTL effect.

In linkage mapping, markers serve to tag chromosome segments and keep track of recombination events. The genotypes of the causative QTL are unobserved, and the QTL is allowed to assume any position along the genome. Under the alternative hypothesis in linkage mapping, we model the QTL as a random effect in the same way that we modeled the contribution of a single gene with additive effects. The only difference is that we use strain fractions and coefficients calculated conditional on the observed marker data. From here on, we refer to these as conditional strain fractions and coefficients; those calculated unconditionally we call theoretical strain fractions and coefficients. Motivated by Equations 1 and 3, we let {varepsilon}(a) denote the additive effect of the QTL in strain a. Then our earlier reasoning shows that the QTL contribution has mean

Formula 10
for animal i and covariance

Formula 10
for animals i and j. Here the circumflexes indicate conditional versions of the strain fractions and coefficients estimated from the marker data. Under the alternative hypothesis, we estimate the entries of {varepsilon}.

Our basic linkage model therefore specifies the trait means and covariances

Formula 11(11)

Formula 12(12)
for two animals i and j. Here k and l index two traits, {alpha}ic is covariate c of animal i, and βkc is the corresponding regression coefficient for trait k. If we let Formula 12 denote the average Formula 12, then all QTL models that include a grand mean require the constraint Formula 12. In the presence of this constraint, the likelihood ratio test of linkage follows asymptotically a {chi}2 distribution with stt degrees of freedom.

In association mapping, QTL fixed effects are tied to the current marker. The marker is viewed as a candidate gene whose genotypes or alleles directly influence trait means (LANGE et al. 2005); random QTL effects are omitted. Hence, in Equation 12 we drop the random effect Formula 12, and in Equation 11 we amend the fixed effect Formula 12 to represent regression on observed allele counts at the current marker. If the additive model for allelic effects is viewed as too restrictive, then we can regress on observed genotypes. Association testing is again conducted by likelihood ratio statistics.

In the presence of missing genotypes in association testing, we fall back on imputed allele counts or imputed genotype counts. Because genotypes at markers are usually directly observed, little is lost in imputation by ignoring genotypes at flanking markers. In this simpler setting, a fast deterministic algorithm is available for imputation (LANGE et al. 2005). Flanking marker genotypes occasionally resolve phase ambiguities caused by combining closely spaced single nucleotide polymorphisms (SNPs) into supermarkers. Accordingly, the current version of Mendel also accepts MCMC estimates of conditional strain fractions from SimWalk. When each strain carries a different allele at the marker, the allele counts delivered by SimWalk are computed by doubling the conditional strain fractions at the marker. When two strains share a common allele at the marker, the corresponding strain fractions are added before doubling.

Identifiability:

We have seen that the polygenic covariance expression (3) between trait k of animal i and trait l of animal j involves the s x s trait block {Omega}kl of an st x st variance matrix {Omega}. Unfortunately, estimation of {Omega} collides with an identifiability issue. The crux of the problem is the existence of nontrivial matrices {Lambda} with

Formula 12
for every legitimate choice of Cij and every trait pair (k, l). Proposition 2 of APPENDIX B explains this phenomenon by representing Cij as a convex combination of the matrix 0 and Formula 12 matrices Emn indexed by unordered strain pairs {m, n}. Here all entries of Emn are 0 except for the diagonal entries emm = enn = 1 and the off-diagonal entries emn = enm = –1. It follows that

Formula 12
provided

Formula 13(13)
for every strain pair {m, n} and every s x s trait block {Lambda}kl = ({lambda}kl,mn) of {Lambda}.

We can solve the identifiability problem by subtracting the nonidentifiable part of {Omega} from {Omega}. To achieve this end, we view the positive semidefinite matrix {Omega} as a vector in the Euclidean space Rstxst. In this setting the trace function Formula 13 = tr(AB*) and Frobenius norm ||A||F = tr(AA*)1/2 reduce to the standard inner product and Euclidean norm. To find the nonidentifiable part of {Omega}, one projects {Omega} onto the vector subspace Formula 13 of symmetric matrices satisfying Equation 13 for every strain pair {m, n} and every trait block {Omega}kl. Formally, the projection P({Omega}) is defined to be the matrix X giving the minimum of Formula 13 for Formula 13.

Fortunately, minimization of Formula 13 separates into subproblems corresponding to different trait blocks. First, consider a diagonal block {Omega}kk of {Omega}. To simplify notation, denote its entries by ymn = {Omega}kk,mn and the entries of the corresponding block of the projection by xmn = P({Omega})kk,mn. To find P({Omega})kk we must minimize the sum of squares

Formula 13
subject to the constraints xmm + xnn = xmn + xnm for every pair {m, n}. Now consider off-diagonal blocks {Omega}kl = Formula 13. These come in pairs that must be handled together, so we let

Formula 13
and

Formula 13
and minimize the sum of squares

Formula 13
subject to the constraints xmm + xnn = xmn + xnm for every pair {m, n}. It follows that diagonal blocks and off-diagonal blocks lead to the same constrained minimization problem.

APPENDIX C shows that each of these least-squares problems has solution X = (xmn) with residual

Formula 13
where Y = (ymn), U = QYQ, and Q is the s x s projection matrix Formula 13. In calculating a covariance, we can ignore symmetrization and replace the matrix Formula 13 by U. Indeed, the symmetry of Cij implies that

Formula 13
Thus, tr(CijQ{Omega}klQ) faithfully represents the covariance between trait k of animal i and trait l of animal j. By the same reasoning, we can replace the entire residual matrix {Omega}P({Omega}) by the matrix

Formula 14(14)
Here diag(Q) is a diagonal block matrix with all t diagonal blocks equal to Q. One can easily check that diag(Q) is a projection matrix and that R inherits the properties of symmetry and positive semidefiniteness from {Omega}.

In reparameterizing {Omega}, it is convenient to define an orthogonal matrix O mapping the vector Formula 14 to the standard basis vector e1. (See APPENDIX D for one version of O.) It follows that

Formula 14
Observe that pre- and postmultiplying any square matrix by Formula 14 zeros out the first row and first column of the matrix. To take advantage of this fact, we express the residual matrix (14) as

Formula 15(15)
The matrix

Formula 15
is a positive semidefinite replacement for diag(O){Omega} diag(O*). By our earlier remark, a block {Upsilon}kl of {Upsilon} equals the corresponding block of diag(O){Omega} diag(O*) with its first row and column zeroed out.

We are now close to the desired goal of reparameterizing the residual. The matrix {Upsilon} has entire rows and columns consisting of zeros. Permuting its rows and columns appropriately will move its nontrivial part to an upper-left block, which will be positive definite whenever {Omega} is positive definite. The Cholesky decomposition of this upper-left block then serves as a good parameterization of R. To compute the number of parameters for s strains and t traits, observe that the matrix {Upsilon} is st x st. A total of t rows and columns are lost in the zeroing-out process. This leaves an (stt) x (stt) upper-left block with (stt)(stt + 1)/2 diagonal or subdiagonal entries. For example, with three strains and two traits, there are 10 parameters.

For the sake of clarity, let us summarize how our proposed parameterization leads to trait covariances. It begins with a Cholesky decomposition {Delta} of an (stt) x (stt) positive definite matrix. The matrix {Delta}{Delta}* is then subdivided into (s – 1) x (s 1) trait blocks ({Delta}{Delta}*)kl, and each block is promoted to an s x s trait block {Upsilon}kl by adding a top row and left column of zeros. In matrix notation, {Upsilon}kl = Z({Delta}{Delta}*)klZ* with Z the s x (s 1) matrix

Formula 15
Finally, we construct the residual matrix R via Equation 15, using the orthogonal matrix O.

With these conventions, the covariance between trait k of animal i and trait l of animal j amounts to

Formula 16(16)
In computing covariances over large pedigrees, it saves time and storage to precompute and store the (s – 1) x (s – 1) matrices 4Z*OCijO*Z and discard the s x s matrices Cij. Note that the action A Formula 16 Z*AZ on an s x s matrix A deletes the first row and first column of A.

This ends our theoretical overview of the model. APPENDIX E shows how to differentiate covariances with respect to parameters, and APPENDIX F supplies a counterexample connecting identifiability and symmetry. We now move on to data analysis.


APPLICATIONS

A simulated advanced intercross line:

An AIL starts with F1 offspring from an intercross of two inbred strains. The F1 animals are randomly bred to produce the F2 animals, the F2 animals are randomly bred to produce the F3 animals, and so on for a total of n generations. An AIL differs from repeated brother–sister mating, because it involves enough animals to preserve genetic diversity. It draws its strength from the steady accumulation of recombination events over many generations (DARVASI and SOLLER 1995). Simulating data according to an AIL design permits us to compare our mixed effects results with the fixed effects results of the benchmark program QTL Cartographer. This exercise is not meant to be a substitute for an exhaustive study of power and experimental design. Also, the comparison is not entirely fair because QTL Cartographer analyzes the Fn data at the last generation ignoring the previous generations. To reconstruct missing marker information, QTL Cartographer applies an inflated recombination fraction scaled to reflect n.

To create our simulated AIL data, we mated two inbred founder animals and subjected their descendants in each generation to virtual random mating. Generation 10 contained 175 animals in 140 sibships with 492 animals overall. Placing the QTL locus at the midpoint of markers 5 and 6 of 11 equally spaced marker loci, we simulated genotypes by gene dropping and assigned QTL effects on the basis of the genotypes at the QTL. QTL genotypes were then discarded from further analysis. We modeled a univariate trait with a grand mean {eta} = 4, an environmental variance {sigma}Formula 16 = 1, and a 2 x 2 polygenic variance matrix

Formula 16
For this simulated trait, strain one has a genetic variance comparable to the environmental variance and larger than the genetic variance of strain two. The two strains share a modest genetic correlation. For reasons explained in the next section, a single generation of data in a symmetric cross of this sort does not sustain estimation of strain-specific polygenic means. To circumvent this problem in our comparisons, we set the strain-specific polygenic means equal to 0. We chose small strain-specific QTL effects {varepsilon}1 = 0.2 and {varepsilon}2 = –0.2 centered around 0. In view of our discussion of identifiability, we can estimate only a single parameter p1 characterizing {Omega}. The projection technique discussed yields the value p1 = 0.667. The discussion of the Cij matrices in APPENDIX B explains why genotype data on a single generation also prevent estimation of p1.

To provide the most informative comparisons, we ran three analyses: (1) Mendel on the full pedigree with complete genotype and phenotype data (Mendel Full), (2) Mendel on the full pedigree but with phenotype data on only the final F10 generation (Mendel F10), and (3) QTL Cartographer on the final F10 generation with complete genotype and phenotype data (Cartographer). Simply comparing cases Mendel Full and Cartographer is hardly fair; the full pedigree contains more than twice the number of animals in the final generation. Mendel F10 takes advantage of the full genealogy and all genotype data in computing theoretical and conditional strain coefficients. It limits itself to the phenotype sample in the last generation to enable a better comparison to QTL Cartographer.

Before turning to QTL mapping in the Mendel analyses, we fit a baseline model including the grand mean, the polygenic variance, and the environmental variance. We then estimated conditional strain coefficients at each of the 11 marker loci. This put us in a position to estimate the global parameters and the QTL-specific parameters simultaneously at each locus. The evidence in favor of the QTL is summarized by a likelihood ratio test (LRT) statistic following a Formula 16 distribution; a nonlinear false discovery rate (FDR) correction (BENJAMINI et al. 2001) corrects for multiple testing for all three analyses. Table 1 summarizes the type I error rate, power, and coverage as well as the generating parameters, their estimates, and the standard errors of the estimates at the loci adjacent to the QTL. Successful coverage occurs when the equivalent one-LOD drop interval (4.6 LRT units) includes the QTL. We reject the null hypothesis of no QTL effect when the LRT is significant at the 0.05 level.

The results in Table 1 reflect 100 simulations for a QTL-effect size that yields power >90% for Mendel Full; type I error rates are given as confidence intervals based on 500 simulations under the null hypothesis of no QTL effect. Clearly, the power to detect linkage is drastically reduced when only the F10 generation is available for analysis. This absence of data also makes it difficult for Mendel F10 to estimate the polygenic parameter p1 accurately. For Mendel Full all estimates are within one standard error of their true values, and standard errors are small. QTL Cartographer exhibits slightly better power and coverage than Mendel F10, but with a largely inflated type I error rate. Both methods are easily bested by Mendel Full. These trends continue over a range of smaller QTL effects (data not shown). We are pleased with these results. In our view they demonstrate that application of the mixed effects model sacrifices little in simple settings while generalizing readily to complex pedigrees.

A multivariate four-way cross:

To illustrate the analysis of multivariate traits, we next consider the hormone data of Burke and colleagues (HARPER et al. 2003) on aging UM-HET3 mice. Figure 1 shows how the UM-HET3 mice were created from four founder strains: BALB/cJ (C), C57BL/6J (B6), C3H/HeJ (C3), and DBA/2J (D2). CB6F1 females crossed with C3D2F1 males provided 967 F2 full siblings. At markers with four different alleles, all F2 mice were heterozygous. Thus compared to a two-way cross, the four-way cross doubles the number of founder strains without sacrificing phase certainty. Hormone levels of insulin-like growth factor I (IGF), leptin (Lep), and thyroxine (T4) were measured at 4 and 15 months on each of the F2 mice. Testing maternal and paternal effects separately, Harper et al. found several linked markers in these data via ANOVA, including a maternal allele at D3Mit25 linked to IGF at 15 months, a paternal allele at D3Mit127 linked to Lep at 4 months, and both maternal and paternal alleles linked to Lep at 15 months. It is worth pointing out that ANOVA or MANOVA must be carried out at marker loci. Only here do marker genotypes or allele counts unambiguously define factor levels. With complete genotyping, our model collapses in this setting to the classical models.


Figure 1
View larger version (6K):
In this window
In a new window
Download PPT slide
 
FIGURE 1.—

Four-way cross for UM-HET3 Mice. UM-HET3 mice are created from four founder strains: BALB/cJ (C), C57BL/6J (B6), C3H/HeJ (C3), and DBA/2J (D2); the F2 generation results from CB6F1 females crossed with C3D2F1 males.

 
This multistrain cross highlights identifiability pitfalls inherent in the structure of some crosses and the data collected on them. For example, all F2 mice share the strain fraction vector Formula 16. Hence, the polygenic mean is confounded with the grand mean. Using strain trait averages or phenotyping members of the original strains would allow us to estimate the polygenic means, but this is not an option for the current data.

Although the rigid structure of the four-way cross preserves phase certainty, it reduces uncertainty to the point where the polygenic covariance matrix cannot be estimated. Polygenic covariances depend on the combinatorial matrices Cij. We have already noted that Cij = 0 whenever i is a founder or i and j are F1 mice. Straightforward calculations for F2 mice i and j with i != j yield

Formula 16
Inspection of Equation 3 therefore shows that the polygenic covariance matrix {Omega} is confounded with the matrix describing the environmental covariances.

Finally, there are identifiability problems with the QTL allelic effects. At the covariance level, the conditional coefficient matrix Formula 16 is identically 0 when typing is full and different alleles are present in each strain. At the mean level, imposition of the constraint {varepsilon}4 = {varepsilon}1{varepsilon}2{varepsilon}3 shows that the genotype-specific means in a purely allelic model can be expressed as the vector

Formula 16
Because the matrix on the right of this equation has less than full rank, some mean vectors are not representable. As a substitute for the additive QTL contributions, we assign a different mean effect to each of the four F2 genotypes.

We analyze these data in the same manner as the simulated AIL except for graphing the –log10(P-value) instead of LRTs and analyzing multiple map points in the intervals between marker loci. We enjoy two advantages over ANOVA or MANOVA; namely, we can use phenotyped individuals with wholly or partially missing genotypes, and we can estimate both QTL location and effect size.

To carry out a multivariate analysis, one must decide which univariate traits to analyze together. This is not a trivial matter because combining traits exacerbates the multiple testing problem and may add noise and degrade power (AMOS et al. 2001; BAUMAN et al. 2005). With outbred populations it is intertwined with the issue of ascertainment (DAWSON and ELSTON 1984); it may also be a problem with inbred populations since strains are often chosen for a particular experiment on the basis of their average phenotype. We present here the results of two multivariate analyses making these points. The most interesting results from this example data set are on chromosome 3, and we focus on three traits, leptin measurements at both 4 and 15 months and insulin-like growth factor I at 15 months in this region. In univariate analysis, both IGF-15 and Lep-4 show significant linkage to markers on chromosome 3, while Lep-15 shows suggestive linkage. Multivariate analyses are indicated biologically, spatially, and temporally.

We carried out a number of multivariate analyses; some of the results are summarized in Figures 2 and 3 and Table 2. The graphs of –log10(P-value) along chromosome 3 in Figure 2 correspond to the univariate analyses of IGF-15, Lep-4, and Lep-15 and the bivariate analysis of Lep-4 and Lep-15. The univariate graph of IGF-15 peaks over marker D3Mit5. Subjecting the P-values for IGF-15 to the nonlinear FDR correction (BENJAMINI et al. 2001) suggests a single location for IGF-15. Both of the univariate leptin graphs as well as the bivariate graph peak over D3Mit127. After FDR correction, at least two significant map points are suggested over D3Mit127 for the bivariate leptin analysis. Table 2 reports estimates and standard errors for the bivariate leptin mean parameters at marker D3Mit127. These estimates are very similar at the two time points. Although likelihood ratios improve over univariate analysis, P-values do not because the degrees of freedom of the {chi}2 test double. The estimated environmental covariance matrix

Formula 17(17)
is consistent with the raw correlation of the two traits. In the matrix (17), the standard error of each estimate appears in parentheses.


Figure 2
View larger version (13K):
In this window
In a new window
Download PPT slide
 
FIGURE 2.—

Univariate and bivariate results, four-way cross on chromosome 3, univariate results for IGF-15 peak over marker D3Mit25. Univariate and bivariate results for Lep-4 and Lep-15 peak over marker D3Mit127.

 

Figure 3
View larger version (8K):
In this window
In a new window
Download PPT slide
 
FIGURE 3.—

Trivariate analysis, four-way cross on chromosome 3, trivariate results peak over marker D3Mit25 and D3Mit86. These peaks are lower than those obtained with univariate and bivariate analyses.

 
A trivariate analysis of IGF-15, Lep-4, and Lep-15 clearly illustrates that in the case of multivariate traits, more is not always better. Comparing Figure 2 to Figure 3 shows two large peaks: one at marker D3Mit25 and one over marker D3Mit127. After FDR adjustment only the first peak survives, and the evidence for it is compromised. Thus, the trivariate analysis provides no additional linkage information and actually degrades the power to detect linkage. While leptin and IGF share numerous biological interactions, there is no evidence in these data for a common genetic determinant on chromosome 3.

An eight-strain simulated cross:

Our first two examples demonstrated the equivalence of the random effects model to the fixed effects model for standard cross designs and hint at the flexibility of our approach. To demonstrate this flexibility, we now present an eight-strain simulated example that (a) documents how correctly accounting for polygenic background can be beneficial and (b) demonstrates how it is possible to test hypotheses with the kind of unbalanced pedigree data encountered in human studies. As with the simulated AIL example, this exercise is not meant to be a substitute for an exhaustive study of power and experimental design.

Simulation specifics:

Our simulated cross involves a univariate trait, eight inbred strains, and seven pedigrees of nine generations each. We are motivated in part by the heterogeneous stock (MOTT et al. 2000) and the collaborative-cross designs (WILLIAMS et al. 2002). Starting with strain-pure founders, we constructed each pedigree by random mating with a decreasing number of progeny per animal per generation. The average number of animals per pedigree is 366. Random mating ensures substantial diversity in theoretical and conditional strain fractions and coefficients. On the basis of the marker map for chromosome 2 in the UM-HET example of the previous section, we simulated genotypes at six loci using the gene-dropping option of Mendel. Locus 3 serves as the QTL and the remaining loci as markers. Genotypes at the QTL are omitted during linkage analysis.

We generated univariate trait values independently for each pedigree by sampling from a multivariate normal distribution with prescribed means and covariances. If animal i has QTL genotype a/b and trait value Xi, then

Formula 17
where {eta} is the grand mean, µ is the vector of polygenic deviations from the mean, and {varepsilon} is the vector of QTL deviations from the mean. For animals i and j, the polygenic and random environment contributions entail the covariance

Formula 17
Note the absence here of a QTL variance contribution. Although the data are analyzed conditionally given observed marker genotypes, they are generated unconditionally. Table 3 displays the values of the parameters used for the simulations. These values were chosen randomly subject to constraints such as Formula 17.


View this table:
In this window
In a new window

 
TABLE 3

Eight-strain cross: simulation generating parameters

 
Our simulation choices present both opportunities and challenges. For example, the fact that each strain is assigned a unique QTL allele suggests that even a simple F2 cross between two strains would be adequate to map the QTL. This advantage is tempered by the long genetic distances separating the QTL from the flanking markers, by the smallness of the QTL effects, by the similarity of these effects in some strains, and by the discordance of the QTL effects and the polygenic means effects.

In using random effect models for QTL mapping, inclusion of polygenic background is usually a good idea. If polygenic background is present but ignored, then the only way of accounting for relative correlations is through the QTL component. When we analyze the current data omitting polygenic background, every single chromosome location in the linkage analysis achieves a P-value <0.00001. Adding polygenic background causes P-values to reach more reasonable levels, ranging from 0.0019 to 0.3835. Subjecting the P-values to the (FDR) procedure highlights the QTL and one neighboring point as significant (BENJAMINI et al. 2001). Figure 4 plots the function –log10(P-value) along the chromosome; as earlier, the P-values reflect the likelihood ratio tests of the QTL component. The QTL is located at 30 cM from the origin between marker D2Mit323 at 23 cM and marker D2Mit37 at 42 cM.


Figure 4
View larger version (8K):
In this window
In a new window
Download PPT slide
 
FIGURE 4.—

Eight-strain cross example—linkage and association results for the simulated eight-strain random-mating example. Association results mirror linkage results at the markers. The linkage results peak over the QTL, located between markers D2Mit323 and D2Mit58.

 
We also used these data to illustrate the application of the QTL association model. As in our linkage analysis, omitting polygenic background leads to unrealistically small P-values. Figure 4 plots the –log10(P-value) for the association analysis with the polygenic background. The association results are similar to the linkage results. The marker with the most significant result is D2Mit323, which is the marker nearest to the QTL. The FDR procedure singles out D2Mit323 as the only significant association.

Comparison of computation times between the two models illustrates the speed of the association analysis. The linkage model requires ~4 hr for calculation of the coefficient matrices for each pedigree and ~20 hr to estimate the parameters for each of the 17 points. The association model requires ~1.5 hr for all calculations at each of the five markers.


DISCUSSION
In the hope of mapping QTL with small effects, geneticists are undertaking more ambitious crosses with multiple strains, multivariate traits, and dense marker sets. The random effects models developed here will enable a smooth transition to more sophisticated statistical analysis. The greatest strength of the models is their ability to capture polygenic background parsimoniously. A second strength is their versatility in handling large pedigrees, large numbers of contributing strains, and multivariate traits. While we have warned against importing ideas wholesale from the rest of statistical genetics, judicious adaptations are fully warranted. For example, since environment can be exquisitely controlled for inbred strain experiments, models of gene-by-environment interaction can be put to good use on the mean level (BLANGERO 1993) and on the variance level (LANGE 1986; ITOH and YAMADA 1990). These techniques apply both to continuous traits (PLETCHER 1999; PLETCHER and GEYER 1999; JAFFRÉIC and PLETCHER 2000; PLETCHER and JAFFRÉIC 2002; PURCELL 2002; PURCELL and SHAM 2002; MEYER and KIRKPATRICK 2005) and to categorical traits (TOWNE et al. 1997; VIEL et al. 2005). It is also straightforward to model multiple QTL acting additively (LANGE 2002).

Balanced against these strengths is the need for better-conceived study designs. Unless crosses are carefully structured, some parameters will be unidentifiable. One antidote is to scale back the complexity of a model and reparameterize. Our first two examples illustrate this tactic. Another antidote is to avoid monolithic designs and opt for a mixture of designs that individually reveal different features of a model. Our third example does this.

In random effects models, trait values for most animals are correlated. Logically, one should treat all animals as members of a single large pedigree. At some point this requirement becomes unwieldy. The computational demands of the random effects models are fairly high, so tactics such as pedigree splitting, marker thinning, and marker amalgamation should not be dismissed. It will probably take a combination of these tactics to cope with the large-scale mapping projects now under way (PLETCHER et al. 2004). Fortunately, our experiences with simulated data suggest that a moderate amount of pedigree splitting sacrifices little information.

We have omitted a detailed discussion of how the program SimWalk delivers conditional strain fractions and coefficients. In our experience, SimWalk's MCMC algorithm adequately samples descent graph space. In association analysis, this lengthy process can be dispensed with if information at neighboring markers is ignored. Deterministic algorithms that produce approximate kinship and strain coefficients may ultimately be a better choice than stochastic sampling (GAO et al. 2004; GAO and HOESCHELE 2005). In maximizing loglikelihoods, it is also worth mentioning that Mendel allows the user to set initial parameter values and bounds. This flexibility is valuable in exploring multimodal likelihood surfaces.

Our QTL parameters enter the model at both the mean and the variance level and are not subject to nonnegativity constraints. Thus, the asymptotic distribution of a likelihood ratio test follows a chi-square distribution with degrees of freedom equal to the difference in the number of independent parameters between the underlying nested models. Model selection can be accomplished by likelihood ratio tests or modified criteria such as the Akaike information criteria (AIC) or the Bayesian information criteria (BIC). Multiple testing is certainly an issue. The FDR correction of Benjamini and Hochberg (BENJAMINI et al. 2001) for dependent tests is often a useful cure and provided us with correct inferences in our simulated examples. Extensions such as Storey's optimal discovery procedure (STOREY 2007; STOREY et al. 2007) can lead to more accurate P-values and should be kept in mind.

The assumption of multivariate normality is helpful in maximum likelihood estimation. For univariate traits with excess kurtosis, the multivariate t distribution is a workable substitute for the multivariate normal distribution and is an implemented option in Mendel. It is reasonable to conjecture that some version of the central limit theorem should hold for a polygenic trait over a pedigree (LANGE 1978; LANGE and BOEHNKE 1983). For simple pedigrees generated en masse in a cross, one can check the normality assumption empirically. The impact of departures from normality has been considered by several researchers (BEATY et al. 1985; ALLISON et al. 1999; PRATT et al. 2000). BLANGERO et al. (2000) and SHAM et al. (2000) suggest solutions to gross violations. One can object that QTL effects by their discrete nature cannot be normal. Three responses are possible. First, this objection has never stopped ordinary QTL mapping with outbred populations. Second, under the null hypothesis, the discrete effects disappear. Third, in all but the simplest crosses, application of a rigorous model incorporating both polygenes and major genes is very computationally demanding.

The web site (http://www.genetics.ucla.edu/software) offers the current versions of Mendel and SimWalk for several computing platforms. Ample documentation and sample problems are provided. The experimental versions of Mendel and SimWalk featured in this article will be released publicly as soon as it is practical.


APPENDIX A: REPRESENTATION OF POSITIVE DEFINITE MATRICES
Given a k x k positive definite matrix {Omega} and a k x 1 vector µ, we now prove that vectors µ1, ... , µn exist such that Formula 17 and Formula 17. To simplify the proof, we pass to the spectral decomposition {Omega} = O*DO of {Omega}. Here O is an orthogonal matrix, and D is a diagonal matrix whose jth diagonal entry dj is an eigenvalue of {Omega}. If n vectors {nu}1, ... , {nu}n exist such that Formula 17 and Formula 17, then taking µi = O*{nu}i for each i completes the proof.

With the transformed problem, we can work on each dimension j separately. Suppose we can find scalars a1, ... , am such that a1 + ... + am = {nu}j and Formula 17 + ... + Formula 17 = dj. Then we construct m vectors w1, ... , wm whose entries are 0 except for their jth entries wij = ai. These m vectors compose part of the solution set {nu}1, ... , {nu}n and do not impinge on the parts contributed by other dimensions. To show that appropriate scalars a1, ... , am exist, we consider optimizing the function f(a) = Formula 17 + ... + Formula 17 subject to the affine constraint a1 + ... + am = {nu}j. By introducing a Lagrange multiplier, we can prove that f(a) attains its minimum Formula 17 when all ai = {nu}j/m. The maximum of f(a) is infinite in all but the trivial case m = 1. For instance, we can take Formula 17, Formula 17, and all other ai = 0 and send p to {infty}. Since dj must be positive, some positive integer m exists with Formula 17 < dj. This choice of m puts us in a position to invoke the intermediate value theorem. The set of vectors a = (a1, ... , am) satisfying the constraint is convex and therefore connected. A continuous function on a connected set attains every value between its minimum and maximum values. Hence, there is some a with f(a) = dj.


APPENDIX B: PROPERTIES OF THE CIJ MATRICES
The role of the matrix Cij in formula (3) suggests its importance. Mathematically Cij is better behaved than the strain coefficient matrix {psi}ij. Recall that the founder initial conditions and the recurrences (4)–(7) completely determine the strain fraction vectors {gamma}i and the strain coefficient matrices {psi}ij. If we retain the conventions that i has parents k and l and j is an animal previously considered, then the last three recurrences translate into the similar recurrences

Formula 1(B1)

Formula 2(B2)
and

Formula 3(B3)
on the Cij matrices. The next proposition collects some relevant facts.

PROPOSITION 1. In addition to satisfying the recurrences (B1), (B2), and (B3), the matrix Cij

  1. has all entries 0 when either i or j is a founder,
  2. is symmetric,
  3. is positive semidefinite,
  4. has the vector 1 in its null space,
  5. has entries Cij(m, n) confined to the interval Formula 3 for n != m and to the interval Formula 3 for n = m.

Proof.

  1. If i is a founder belonging to strain q and j is a founder belonging to strain r, then by definition {gamma}i(m) = 1{m=q}, {gamma}j(n) = 1{n=r}, and {psi}ij(m, n) = 1{m=q}1{n=r}. Thus, all entries of Cij vanish. If i or j is a founder but the other is not, then induction and the recurrences (B1) and (B2) show that all entries of Cij vanish.
  2. Formula (B3) forces Cii to be symmetric, and the recurrences (B1) and (B2) preserve symmetry.
  3. Because the recurrences (B1) and (B2) preserve positive semidefiniteness, it suffices to prove that Cii is positive semidefinite. Inspection of formula (B3) further demonstrates that it suffices to prove that diag({gamma}k) – {gamma}k{gamma}k* is positive semidefinite for all k. Accordingly, let v be an arbitrary vector. The quadratic form

    Formula 3
    is nonnegative owing to Cauchy's inequality

    Formula 3
    and the fact that Formula 3.

  4. Again this is a consequence of the recurrences (B1) and (B2) and the validity of the assertion for Cii. In the latter case, the equality

    Formula 3
    is obvious.

  5. Because the stated bounds are preserved by recurrences (B1) and (B2), it suffices to consider Cii. The contribution Formula 3 to a diagonal term in Equation B3 is bounded below by 0 and above by Formula 3. The contribution Formula 3 to an off-diagonal term is bounded below by Formula 3 and above by 0. {blacksquare}

The collection Formula 3 of all Cij matrices over a pedigree has considerable structure. For example, the symmetry of Cij entails Cij = Cji. With just s = 2 strains, parts b, d, and e of Proposition 1 imply that every Cij is representable as

Formula 3
for some constant Formula 3. Furthermore, since aij = 0 whenever i or j is a founder or an F1 individual, straightforward recursive arguments show that within any strictly linear mating designs like Fn, aij = 0 for all i != j. Two-strain systems also produce uninteresting conditional coefficients; straightforward calculations show that Formula 3 for all i and j at markers that differentiate between the strains with complete genotyping.

To generalize this representation to more than two strains, it is helpful to introduce the s x s matrix Emn where all entries of Emn are 0 except for emm = enn = 1 and emn = enm = –1. There are Formula 3 such matrices.

PROPOSITION 2. Every matrix Cij from the collection Formula 3 can be represented as a linear combination

Formula 4(B4)
Furthermore, the coefficients aij,mn are nonnegative dyadic rationals satisfying

Formula 4

Proof. Each matrix Cii = 0 corresponding to a founder i clearly qualifies. The representation (B4) is preserved by the averaging process of the recurrences (B1) and (B2), so it suffices to prove the representation for a matrix Cii generated by a nonfounder. Again the averaging nature of recurrence (B3) allows us to verify the representation (B4) for a matrix of the form Formula 4. Because the set of dyadic rationals constitutes an algebraic field, it is clear by induction that all entries of {gamma}k are dyadic rationals. We now claim that

Formula 5(B5)
Equality (B5) is certainly true for the off-diagonal entries of the matrices on both sides. For the diagonal entries, it is a consequence of the identity

Formula 5
Because the coefficients Formula 5 are dyadic rationals, all that remains is to check that the sum of the coefficients is properly bounded. This follows from

Formula 6(B6)
The upper bound (B6) can be proved by introducing a Lagrange multiplier corresponding to the constraint Formula 6. Equality is achieved only when all Formula 6. {blacksquare}


APPENDIX C: COMPUTATION OF THE PROJECTION
Consider minimizing the function

Formula 6
subject to the constraints xmm + xnn = xmn + xnm for every unordered pair {m, n}. We proceed by seeking a stationary point of the Lagrangian

Formula 6
This point is characterized by the equations

Formula 1(C1)

Formula 1

Formula 2(C2)
with the convention that m != n. Rearrangement of Equation (C1) gives

Formula 3(C3)
If we interchange m and n in Equation C3, add the result to Equation C3, and invoke the constraint (C2), then we get the equation

Formula 3
determining µ{m,n} as

Formula 3
From Equation C3 it follows that

Formula 4(C4)

It is easy to check that the constraint (C2) is implicit in this solution. Furthermore, the solution entails the residual

Formula 4
If we set Formula 4, our objective function can now be expressed as

Formula 4
Neither the off-diagonal entries xmn nor the constraints now appear. To solve this unconstrained problem, we center the amn by subtracting their average value Formula 4. This allows us to reparameterize f(x) as

Formula 4
in more or less obvious notation.

Minimizing the objective function in this form coincides with a classical problem in population genetics. If we assume that m and n represent two possible alleles from s equally frequent alleles and bmn represents a trait value determined by the genotype m/n, then minimizing f(x) corresponds to the problem of determining the additive genetic variance of a centered trait. The solution to this problem is known to be

Formula 4
It follows that

Formula 4
A final substitution for amn gives

Formula 4
and the general formula

Formula 5(C5)
based on Equation (C4) and valid for both m != n and m = n.

The projection solution (C5) reduces the residual rmn = ymn xmn to

Formula 5
If we define a matrix R with entries rmn and a matrix U with entries

Formula 5
then it is clear that

Formula 5
In other words, the residual matrix Formula 5 is a symmetrized version of U. Fortunately, we can represent U as the matrix product

Formula 5
of Y = (ymn) sandwiched between two copies of the orthogonal projection Formula 5.


APPENDIX D: CONSTRUCTION OF AN ORTHOGONAL MATRIX
An orthogonal matrix O mapping the vector Formula 5 to the standard basis vector e1 can be explicitly constructed by the Gramm–Schmidt process applied to the basis Formula 5, where ek is the standard basis vector with 1 in position k and zeros elsewhere. The first row of O is just Formula 5; the subsequent rows take the form

Formula 5
where k – 2 zeros precede the entry s k + 1. The reader can easily check that the row vectors Formula 5 provide an orthonormal basis.


APPENDIX E: DIFFERENTIATION OF VARIANCES AND COVARIANCES
Because the fastest maximum likelihood algorithms rely on exact derivatives, there is an obvious need to calculate the partial derivatives of each covariance Cov(Xik, Xjl) with respect to the entries of {Delta} = ({delta}mn). If we let {partial}mn denote partial differentiation with respect to {delta}mn, then formula (16) immediately leads to

Formula 5
so it suffices to compute the partial derivatives of {Delta}{Delta}* = (duv). Since

Formula 5
the product rule of differentiation yields

Formula 5
Thus, {partial}mn({Delta}{Delta}*) consists entirely of zeros except for row m and column n. This fact considerably simplifies computation of derivatives.


APPENDIX F: A COUNTEREXAMPLE ON IDENTIFIABILITY
Finally, we consider a counterexample that illustrates some of the subtleties of identifiability. We noted that projection replaces each trait block Y = {Omega}kl with a symmetrized block residual

Formula 5
For purposes of computing covariances, we argued that symmetrization is unnecessary and avoiding it simultaneously yields correct covariances and reduces the number of parameters. We have not actually demonstrated that no further reduction is possible. Furthermore, exploiting the symmetrized version may lead to a residual {Omega}P({Omega}) that fails to be positive semidefinite. Consider the matrix

Formula 5
Straightforward algebra leads to the positive semidefinite matrix

Formula 5
If we symmetrize each 2 x 2 block of B, then we get

Formula 5
A tedious computation shows that

Formula 5
and C cannot be positive semidefinite.


ACKNOWLEDGEMENTS
The authors are grateful to David Burke for access to the UM-HET3 data, to Karl Broman for his editorial interest and guidance, and to the anonymous reviewers for their helpful comments. This investigation was supported by U.S. Public Health Service grants MH59490, GM53275, T32-HG02536, and HL28481.


LITERATURE CITED

ALLISON, D. B., M. C. NEALE, R. ZANNOLLI, N. J. SCHORK, C. I. AMOS et al., 1999 Testing the robustness of the likelihood-ratio test in a variance-component quantitative-trait loci-mapping procedure. Am. J. Hum. Genet. 65: 531–544.[CrossRef][Medline]

AMOS, C. I., 1994 Robust variance-components approach for assessing genetic linkage in pedigrees. Am. J. Hum. Genet. 54: 535–543.[Medline]

AMOS, C. I., M. DE ANDRADE and D. K. ZHU, 2001 Comparison of multivariate tests for genetic linkage. Hum. Hered. 51: 133–144.[CrossRef][Medline]

BAUMAN, L. E., L. ALMASY, J. BLANGERO, R. DUGGIRALA, J. SINSHEIMER et al., 2005 Fishing for pleiotropic QTLs in a polygenic sea. Ann. Hum. Genet. 69: 590–611.[CrossRef][Medline]

BEATY, T. H., S. G. SELF, K.-Y. LIANG, M. A. CONNOLLY, G. A. CHASE et al., 1985 Use of robust covariance components models to analyse triglyceride data in families. Ann. Hum. Genet. 49: 315–328.[Medline]

BENJAMINI, Y., D. DRAI, G. ELMER, N. KAFKAFI and I. GOLANI, 2001 Controlling the false discovery rate in behavior genetics research. Behav. Brain Res. 125: 279–284.[CrossRef][Medline]

BLANGERO, J., 1993 Statistical approaches to human adaptability. Hum. Biol. 65: 941–966.[Medline]

BLANGERO, J., and L. ALMASY, 1997 Multipoint oligogenic linkage analysis of quantitative traits. Genet. Epidemiol. 14: 959–964.[CrossRef][Medline]

BLANGERO, J., J. T. WILLIAMS and L. ALMASY, 2000 Robust lod scores for variance component-based linkage analysis. Genet. Epidemiol. 19: S8–S14.[CrossRef][Medline]

BROMAN, K. W., H. WU, S. SEN and G. A. CHURCHILL, 2003 R/qtl: QTL mapping in experimental crosses. Bioinformatics 19: 889–890.[Abstract/Free Full Text]

CERVINO, A. C. L., A. DARVASI, M. FALLAHI, C. C. MADER and N. F. TSINOREMAS, 2007 An integrated in silico gene mapping strategy in inbred mice. Genetics 175: 321–333.[Abstract/Free Full Text]

CHURCHILL, G. A., and R. W. DOERGE, 1994 Empirical threshold values for quantitative trait mapping. Genetics 138: 963–971.[Abstract]

DARVASI, A., and M. SOLLER, 1995 Advanced intercross lines, an experimental population for fine genetic mapping. Genetics 141: 1199–1207.[Abstract]

DAWSON, D. V., and R. C. ELSTON, 1984 A bivariate problem in human-genetics—ascertainment of families through a correlated trait. Am. J. Med. Genet. 18: 435–448.[CrossRef][Medline]

FLINT, J., W. VALDAR, S. SHIFMAN and R. MOTT, 2005 Strategies for mapping and cloning quantitative trait genes in rodents. Nat. Rev. Genet. 6: 271–286.[CrossRef][Medline]

GAO, G., and I. HOESCHELE, 2005 Approximating identity-by-descent matrices using multiple haplotype configurations on pedigrees. Genetics 171: 365–376.[Abstract/Free Full Text]

GAO, G., I. HOESCHELE, P. SORENSEN and F. DU, 2004 Conditional probability methods for haplotyping in pedigrees. Genetics 167: 2055–2065.[Abstract/Free Full Text]

GOLDGAR, D. E., 1990 Multipoint analysis of human quantitative genetic-variation. Am. J. Hum. Genet. 47: 957–967.[Medline]

HALEY, C. S., and S. A. KNOTT, 1992 A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity 69: 315–324.[Medline]

HARPER, J. M., A. T. GALECKI, D. T. BURKE, S. L. PINKOSKY and R. A. MILLER, 2003 Quantitative trait loci for insulin-like growth factor I, leptin, thyroxine, and corticosterone in genetically heterogeneous mice. Physiol. Genomics 15: 44–51.[Abstract/Free Full Text]

HITZEMANN, R. W., B. MALMANGER, S. COOPER, S. COULOMBE, C. REED et al., 2002 Multiple cross mapping (MCM) markedly improves the localization of a QTL for ethanol-induced activation. Genes Brain Behav. 1: 214–222.[CrossRef][Medline]

HOPPER, J. L., and J. D. MATHEWS, 1982 Extensions to multivariate normal models for pedigree analysis. Ann. Hum. Genet. 46: 373–383.[Medline]

ITOH, Y., and Y. YAMADA, 1990 Relationships between genotype x environment interaction and genetic correlation of the same trait measured in different environments. Theor. Appl. Genet. 80: 11–16.

JAFFRÉIC, F., and S. D. PLETCHER, 2000 Statistical models for estimating the genetic basis of repeated measures and other function-valued traits. Genetics 156: 913–922.[Abstract/Free Full Text]

JANNINKA, J.-L., and R. JANSENA, 2001 Mapping epistatic quantitative trait loci with one-dimensional genome searches. Genetics 157: 445–454.[Abstract/Free Full Text]

KAO, C.-H., Z.-B. ZENG and R. D. TEASDALE, 1999 Multiple interval mapping for quantitative trait loci. Genetics 152: 1203–1216.[Abstract/Free Full Text]

KRUGLYAK, L., M. J. DALY, M. P. REEVE-DALY and E. S. LANDER, 1996 Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. Hum. Genet. 58: 1347–1363.[Medline]

LANDER, E. S., and D. BOTSTEIN, 1989 Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121: 185–199.[Abstract/Free Full Text]

LANGE, K., 1978 Central limit theorems for pedigrees. J. Math. Biol. 6: 59–66.[CrossRef]

LANGE, K., 1986 Cohabitation, convergence and environmental covariances. Am. J. Med. Genet. 24: 483–491.[CrossRef][Medline]

LANGE, K., 2002 Mathematical and Statistical Methods for Genetic Analysis, Ed. 2. Springer-Verlag, New York.

LANGE, K., and M. BOEHNKE, 1983 Extensions to pedigree analysis. IV. Covariance components models for multivariate traits. Am. J. Med. Genet. 14: 513–524.[CrossRef][Medline]

LANGE, K., J. S. SINSHEIMER and E. M. SOBEL, 2005 Association testing with Mendel. Genet. Epidemiol. 29: 36–50.[CrossRef][Medline]

LI, R., M. A. LYONS, H. WITTENBURG, B. PIAGEN and G. A. CHURCHILL, 2005 Combining data from multiple inbred line crosses improves the power and resolution of quantitative trait loci mapping. Genetics 169: 1699–1709.[Abstract/Free Full Text]

LIU, Y., and Z.-B. ZENG, 2000 A general mixture model approach for mapping quantitative trait loci from diverse cross designs involving multiple inbred lines. Genet. Res. 75: 345–355.[CrossRef][Medline]

MANLY, K. F., and J. M. OLSON, 1999 Overview of QTL mapping software and introduction to Map Manager QT. Mamm. Genome 10: 327–334.[CrossRef][Medline]

MEYER, K., and M. KIRKPATRICK, 2005 Up hill, down dale: quantitative genetics of curvaceous traits. Philos. Trans. R. Soc. 360: 1443–1455.[CrossRef]

MOTT, R., C. J. TALBOT, M. G. TURRI, A. C. COLLINS and J. FLINT, 2000 A method for fine mapping quantitative trait loci in outbred animal stocks. Proc. Natl. Acad. Sci. USA 97: 12649–12654.[Abstract/Free Full Text]

PLETCHER, M. T., P. MCCLURG, S. BATALOV, A. I. SU, S. W. BARNES et al., 2004 Use of a dense single nucleotide polymorphism map for in silico mapping in the mouse. PLoS Biol. 2: 2159–2169.

PLETCHER, S. D., 1999 Model fitting and hypothesis testing for age-specific mortality data. J. Evol. Biol. 12: 430–439.[CrossRef]

PLETCHER, S. D., and C. J. GEYER, 1999 The genetic analysis of age-dependent traits: modeling the character process. Genetics 153: 825–835.[Abstract/Free Full Text]

PLETCHER, S. D., and F. JAFFRÉIC, 2002 Generalized character process models: estimating the genetic basis of traits that cannot be observed and that change with age or environmental conditions. Biometrics 58: 157–162.[CrossRef][Medline]

PRATT, S. C., M. J. DALY and L. KRUGLYAK, 2000 Exact multipoint quantitative-trait linkage analysis in pedigrees by variance components. Am. J. Hum. Genet. 6: 1153–1157.

PURCELL, S., 2002 Variance components models for gene-environment interaction in twin analysis. Twin Res. 5: 554–571.[CrossRef][Medline]

PURCELL, S., and P. SHAM, 2002 Variance components models for gene-environment interaction in quantitative trait locus linkage analysis. Twin Res. 5: 572–576.[CrossRef][Medline]

REBAI, A., and B. GOFFINET, 1993 Power of tests for QTL detection using replicated progenies derived from a diallel cross. Theor. Appl. Genet. 86: 1014–1022.

SCHORK, N. J., 1993 Extended multipoint identity-by-descent analysis of human quantitative traits: efficiency, power, and modeling considerations. Am. J. Hum. Genet. 53: 1306–1319.[Medline]

SEATON, G., C. S. HALEY, S. A. KNOTT, M. KEARSEY and P. M. VISSCHER, 2002 QTL Express: mapping quantitative trait loci in simple and complex pedigrees. Bioinformatics 18: 339–340.[Abstract/Free Full Text]

SEN, S., and G. A. CHURCHILL, 2001 A statistical framework for quantitative trait mapping. Genetics 159: 371–387.[Abstract/Free Full Text]

SHAM, P. C., J. H. ZHAO, S. S. CHERNY and J. K. HEWITT, 2000 Variance-components QTL linkage analysis of selected and non-normal samples: conditioning on trait values. Genet. Epidemiol. 19: S22–S28.[CrossRef][Medline]

SILLANPÄÄ, M. J., and E. ARJAS, 1998 Bayesian mapping of multiple quantitative trait loci from incomplete inbred line cross data. Genetics 148: 1373–1388.[Abstract/Free Full Text]

SOBEL, E. M., and K. LANGE, 1996 Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker-sharing statistics. Am. J. Hum. Genet. 58: 1323–1337.[Medline]

STOREY, J. D., 2007 The optimal discovery procedure: a new approach to simultaneous significance testing. J. R. Stat. Soc. B 69: 347–368.[CrossRef]

STOREY, J. D., J. Y. DAI and J. T. LEEK, 2007 The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. Biostatistics 8: 414–432.[Abstract/Free Full Text]

TOWNE, B., R. M. SIERVOGEL and J. BLANGERO, 1997 Effects of genotype-by-sex interaction on quantitative trait linkage analysis. Genet. Epidemiol. 14: 1053–1058.[CrossRef][Medline]

VALDAR, W., J. FLINT and R. MOTT, 2006 Simulating the collaborative cross: power of quantitative trait loci detection and mapping resolution in large sets of recombinant inbred strains of mice. Genetics 172: 1783–1797.[Abstract/Free Full Text]

VIEL, K. R., D. M. WARREN, A. BUIL, T. D. DYER, T. E. HOWARD et al., 2005 A comparison of discrete versus continuous environment in a variance components-based linkage analysis of the COGA data. BMC Genet. 6(Suppl. 1): S57.[CrossRef][Medline]

WILLIAMS, R. W., K. W. BROMAN, J. M. CHEVERUD, G. A. CHURCHILL, R. W. HITZEMANN et al., 2002 A collaborative cross for high-precision complex trait analysis. First workshop report of the complex trait consortium. Technical report.

XIE, C., D. D. G. GESSLER and S. XU, 1989 Combining different line crosses for mapping quantitative trait loci using the identical by descent-based variance component method. Genetics 149: 1139–1146.

ZENG, Z.-B., 1993 Theoretical basis for separation of multiple linked gene effects in mapping quantitative trait loci. Proc. Natl. Acad. Sci. USA 90: 10972–10976.[Abstract/Free Full Text]

ZENG, Z.-B., 1994 Precision mapping of quantitative trait loci. Genetics 136: 1457–1468.[Abstract]

Communicating editor: K. W. BROMAN