| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Genetics, Vol. 176, 1935-1938, July 2007, Copyright © 2007
doi:10.1534/genetics.107.071977
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
,

* Linnaeus Centre for Bioinformatics, Uppsala University, SE-75124 Uppsala, Sweden,
Division of Scientific Computing, Department of Information Technology, Uppsala University, SE-75124 Uppsala, Sweden and
Department of Mathematics and Physics, Mälardalen University, SE-72123 Västerås, Sweden
1 Corresponding author: Linnaeus Centre for Bioinformatics, Uppsala University, SE-75124 Uppsala, Sweden.
E-mail: lars.ronnegard{at}lcb.uu.se
| ABSTRACT |
|---|
|
|
|---|
Variance component estimation has been included in general statistical software, such as Proc Mixed in SAS (LITTELL et al. 1996), where an arbitrary covariance structure of the random effect can be given by the user. These programs have in common that they use iterative procedures, Fisher's scoring or Newton–Raphson, to maximize the likelihood or the restricted likelihood (PAWITAN 2001). More specific programs for applications in animal breeding have also been developed over the last two decades such as ASReml, DMU, and VCE (see DRUET and DUCROCQ 2006 and references therein). These programs use a mixture of Fisher's scoring and Newton–Raphson to maximize the restricted likelihood, called the "average information restricted maximum-likelihood (AI-REML) algorithm." The most computationally demanding part of AI-REML is the inversion of the variance–covariance matrix (V) of the response vector (y). This inversion has to be performed on each iteration. We study a model with a random QTL effect and a residual effect, with variance–covariance matrix of the form
, where
is the symmetric identity-by-descent (IBD) matrix,
is the QTL variance, I is the identity matrix, and
is the residual variance. If
is positive definite, then the inversion of V can be simplified by inverting V in parts. This has been implemented in the software package ASReml by setting up the mixed-model equations (MME) (GILMOUR et al. 1995; JOHNSON and THOMPSON 1995; JENSEN et al. 1997).
The AI-REML algorithm can be implemented by combining the MME with sparse matrix techniques (as done in ASReml), which gives fast solutions when the covariance structure of the random effect is sparse and positive definite. In traditional animal breeding applications, the covariance structure is given by the average relationship between individuals (LYNCH and WALSH 1998). This covariance structure is usually sparse and always positive definite. The IBD matrix is not necessarily sparse or positive definite, however, and the advantage of using sparse matrix techniques together with inversion of V by means of MME in AI-REML may be questioned. LEE and VAN DER WERF (2006) found that the AI-REML algorithm could be faster and more robust in QTL analysis if direct inversion of V is used, especially in linkage-disequilibrium linkage (LDL) mapping (MEUWISSEN and GODDARD 2000) since the IBD matrices used in LDL are usually dense and positive semidefinite. The reason for this is that a covariance structure is added to the base generation alleles (MEUWISSEN and GODDARD 2000; HERNANDEZ-SANCHEZ et al. 2006), which increases the number of nonzero elements in
.
The rank of
at a marker depends on the size of the base generation and how polymorphic the marker is (RÖNNEGÅRD and CARLBORG 2007). In a QTL linkage analysis, the rank of
is twice the size of the base generation when the marker is fully informative (i.e., all marker alleles are unique) and the rank does not depend on the total pedigree size, whereas the number of rows (and columns) in
equals the total number of individuals in the pedigree, n. Hence, at marker locations,
will have many eigenvalues equal to zero, and the number of zero-valued eigenvalues increases with the difference between the total number of individuals and the number of base individuals. In nonmarker locations, the number of eigenvalues in
that approaches zero, when the distance to the marker decreases, is equal to twice the difference between the total number of individuals and the number of base individuals. Thus, for a dense marker map most eigenvalues in all IBD matrices will be either equal to zero (in marker positions) or close to zero (in nonmarker positions).
In this article we develop a fast genome scan method for variance component QTL analysis using AI-REML. The method utilizes an efficient inversion of V that takes advantage of the fact that
has many eigenvalues close to zero.
An efficient inversion of V using the Sherman–Morrison–Woodbury formula:
A simple mixed linear model for QTL detection is given by
, where X and
are the design matrix and parameter vector, respectively, for the fixed effects, v is the vector of QTL genotype effects (length n), v
MVN (0, 
), and e
MVN(0, I
). Thus the variance–covariance matrix of y is V =
+
. Now let the spectral decomposition of
be
, where
is the matrix of eigenvectors and D is the diagonal eigenvalue matrix (superscript T denotes matrix transpose). Using this decomposition, we approximate
. Let
be a submatrix of D, where the eigenvalues larger than a threshold value
are included, and
be the matrix of eigenvectors corresponding to these eigenvalues. Then an approximate IBD matrix with reduced rank is given by
. In our applications
was set equal to
i for which the cumulative sum of
1 to
i divided by the total sum of eigenvalues was 0.8, where
1,
2, ... ,
n are the ordered eigenvalues from largest to smallest. Using the Sherman–Morrison–Woodbury formula (GOLUB and VAN LOAN 1996) we then get an efficient equation for inverting V:
.
Here Ired is the identity matrix of the same size as
. This approximation dramatically decreases the number of mathematical operations needed to invert V. Let k be the rank of
; then the number of floating point arithmetic operations (flops) in calculating the approximated inverse is n2k + 3nk2, whereas the number of flops to invert V directly is n3/2. The spectral decomposition of
requires on the order of n3 flops, but this computation is performed only once in each REML estimation, whereas the inversion of V is performed in each iteration. As an example suppose n = 500, k = 20, and that AI-REML converges in 10 iterations; then the total number of flops for inverting V directly is 6.25 x 108 with the direct method and 1.81 x 108 with the reduced method, giving a 3.5-fold speedup.
Two important questions are then: How are the maximum log-likelihood values and the variance component estimates affected by the approximation? How much is the computational time reduced in a genome scan in practice? To answer these questions we tested the method on chicken chromosome 1 (475 cM) in a Jungle Fowl–White Leghorn F2 cross, where the measured trait was body weight at 200 days of age. KERJE et al. (2003) reported two QTL for this trait on chromosome 1 at 68 and 420 cM. There were 4 F0, 41 F1, and 767 F2 individuals in the pedigree. In our analysis, population mean and sex were included as fixed effects. There were 14 genotyped markers located at 0, 27.7, 35.3, 91.3, 124.3, 154.2, 189.7, 209.3, 233.0, 258.8, 337.4, 407.9, 425.9, and 475.4 cM. The IBD matrices were estimated at every 1 cM using the Markov chain Monte Carlo-based program package Loki (HEATH 1997; HEATH et al. 1997). The estimated IBD matrices were dense with >80% nonzero elements (elements were defined as nonzero for values
10–3). The likelihood-ratio (LR) statistic in AI-REML was calculated as twice the difference between the maximized log(L) and log(L) with
= 0. The convergence criterion used was change in log-likelihood <10–4, with
and
, and the starting value of
was 0.01 times the residual variance. The computer code to calculate
and the AI-REML algorithm was implemented in R (R DEVELOPMENT CORE TEAM 2004).
Accuracy and efficiency of the method:
At individual chromosomal locations, our method was up to five times faster than the direct inversion of V and taken over the whole chromosome it reduced the computation time by 43% (Figure 1). It was faster at all tested locations, except at positions >30 cM from the closest marker. The rank of
was smallest close to marker positions, where also the greatest speedups were achieved.
|
and
was 0.9999. Both methods resulted in maximum LR values at 54 and 426 cM (Figure 2). The relative difference in LR between the two methods was 0.5 and 4.3% at these locations, respectively. In our experience, the shape of the likelihood-ratio curve is not substantially affected by the inclusion of polygenic effects in the variance component QTL model, which was confirmed in our analysis (Figure 2). Both the full model including polygenic effects and our method identified the same two peaks, with a relative difference in LR of 2.7% at 54 cM and 8.1% at 426 cM.
|
assuming fixation within lines. This is an extreme case of LDL mapping, where the base generation alleles are assumed fully correlated within lines. Assuming fixation of QTL alleles within lines is equivalent to fitting a line effect, and for a fully informative marker the rank of
will therefore be 2. The rank of
was consequently reduced further in this example, where the rank was 2 in all positions except at positions >30 cM from the closest marker. In this case our method was up to five times faster than direct inversion in specific positions and reduced the computations over the whole chromosome by 72%. Both models with either
or
gave similar LR values (correlation 0.9999) with QTL at 60 and 430 cM. The relative differences in LR were 1.2 and 0.7%, respectively, for the two QTL.
Conclusions:
Our efficient AI-REML method approximates the likelihood-ratio statistic very well. It is, therefore, a useful method to locate the regions in the genome with the strongest support for a QTL. The efficiency of the method increases when a covariance structure is added to the base generation alleles, which is the case in LDL mapping. This increase in efficacy was illustrated with an extreme covariance structure where the QTL alleles were assumed fixed within lines.
For accurate testing and estimation of the variance components at the QTL identified using this fast method, we suggest that a full model including polygenic effects should be used. Our results indicate that inclusion of polygenic effects is not important in the variance component model when the aim is to detect the location of potential QTL. Other studies have shown that the risk of getting false positives increases in QTL studies if polygenic effects are not included (KENNEDY et al. 1992), but this will not be a problem in our suggested application.
MEUWISSEN and GODDARD (2000) did not include polygenic effects in their presentation of their LDL-mapping model, but accounted for different kinds of population substructures by modeling a general residual variance–covariance matrix R, where V =
+
. This is also possible in our model, and V may then be inverted using the more general version of the Sherman–Morrison–Woodbury formula (GOLUB and VAN LOAN 1996)
. The inversion of R has to be performed only once for the whole genome scan, since it does not change between positions, and the loss in computation time merely depends on how sparse R–1 is.
We have not developed the method for several QTL (see LEE and VAN DER WERF 2006). In principle, the method should be extendable to several QTL, either by developing a more general version of the Sherman–Morrison–Woodbury formula or by developing an orthogonal model where the variance components can be tested one at a time (following THOMSEN 1975).
In our analysis we used Loki to estimate
. If, however, an approximate decomposition of
could be rapidly estimated directly from the marker data, then the spectral decomposition would be redundant and the speed of AI-REML would increase significantly. This decomposition can be made in LDL mapping if the number of possible haplotypes is limited, as shown by MEUWISSEN and GODDARD (2000). Furthermore, RÖNNEGÅRD and CARLBORG (2007) have recently developed a general method for estimating a decomposition of
directly from marker information. Their method is based on single-marker information, but is in principle extendable to a multiple-marker framework. In our analysis of chicken chromosome 1 the number of flops to invert V would have been decreased by 98% if a decomposition of
had been estimated directly. In Table 1, we have also compared the potential of our method when applied to an existing commercial chicken pedigree (ROWE et al. 2006) and an outbred pig cross (M. S. LUND, personal communication; Danish Institute of Agricultural Sciences).
|
| LITERATURE CITED |
|---|
|
|
|---|
DRUET, T., and V. DUCROCQ, 2006 Innovations in software packages in quantitative genetics. Paper no. 27-10. World Congress on Genetics Applied to Livestock Production, Belo Horizonte, Brazil.
GILMOUR, A. R., R. THOMPSON and B. R. CULLIS, 1995 Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51: 1440–1450.[CrossRef]
GOLUB, G. H., and C. VAN LOAN, 1996 Matrix Computations, Ed. 3. Johns Hopkins University Press, Baltimore.
HEATH, S. C., 1997 Markov chain Monte Carlo segregation and linkage analysis for oligogenic models. Am. J. Hum. Genet. 61: 748–760.[Medline]
HEATH, S. C., G. L. SNOW, E. A. THOMPSON, C. TSENG and E. M. WIJSMAN, 1997 MCMC segregation and linkage analysis. Genet. Epidemiol. 14: 1011–1015.[CrossRef][Medline]
HERNANDEZ-SANCHEZ, J., C. S. HALEY and J. A. WOOLLIAMS, 2006 Prediction of IBD based on population history for fine gene mapping. Genet. Sel. Evol. 38: 231–252.[CrossRef][Medline]
JENSEN, J., E. A. MANTYSAARI, P. MADSEN and R. THOMPSON, 1997 Residual maximum likelihood estimation of (co)variance components in multivariate mixed linear models using average information. J. Indian Soc. Agric. Stat. 49: 215–236.
JOHNSON, D. L., and R. THOMPSON, 1995 Restricted maximum likelihood estimation of variance components for univariate animal models using sparse matrix techniques and average information. J. Dairy Sci. 78: 449–456.[Abstract]
KENNEDY, B. W., M. QUINTON and J. A. VAN ARENDONK, 1992 Estimation of effects of single genes on quantitative traits. J. Anim. Sci. 70: 2000–2012.[Abstract]
KERJE, S., Ö. CARLBORG, L. JACOBSSON, K. SCHÜTZ, C. HARTMANN et al., 2003 The twofold difference in adult size between the red junglefowl and White Leghorn chickens is largely explained by a limited number of QTLs. Anim. Genet. 34: 264–274.[CrossRef][Medline]
LEE, S. H., and J. H. J. VAN DER WERF, 2006 An efficient variance component approach implementing an average information REML suitable for combined LD and linkage mapping with a general complex pedigree. Genet. Sel. Evol. 38: 25–43.[CrossRef][Medline]
LITTELL, R. C., G. A. MILLIKEN, W. W. STROUP and R. D. WOLFINGER, 1996 SAS System for Mixed Models. SAS Institute, Cary, NC.
LYNCH, M., and B. WALSH, 1998 Genetics and Analysis of Quantitative Traits. Sinauer Associates, Sunderland, MA.
MEUWISSEN, T. H. E., and M. E. GODDARD, 2000 Fine mapping of quantitative trait loci using linkage disequilibria with closely linked marker loci. Genetics 155: 421–430.
PAWITAN, Y., 2001 In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press, Oxford.
R DEVELOPMENT CORE TEAM, 2004 R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (http://www.R-project.org).
RÖNNEGÅRD, L., and Ö. CARLBORG, 2007 Separation of base allele and sampling term effects gives new insights in variance component QTL analysis. BMC Genet. 8: 1.[CrossRef][Medline]
ROWE, S., R. PONG-WONG, C. S. HALEY, S. KNOTT and D. J. DE KONING, 2006 Variance component estimation of additive and dominance QTL effects in outbred populations applied to commercial broilers. Paper no. 20-09. World Congress on Genetics Applied to Livestock Production, Belo Horizonte, Brazil.
THOMSEN, I., 1975 Testing hypotheses in unbalanced variance components models for two-way layouts. Ann. Stat. 3: 257–265.
Communicating editor: E. ARJAS
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |