| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |

,
,
,

,1
* Department of Computer Science and Engineering and
Bioinformatics Program, University of California, San Diego, California 92093,
Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02141,
Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts 02114, ** Microsoft Research, Redmond, Washington 98052 and 
Department of Computer Science and Department of Human Genetics, University of California, Los Angeles, California 90095
1 Corresponding author: Department of Computer Science and Department of Human Genetics, Mail Code 1596, 3532-J Boelter Hall, University of California, Los Angeles, CA 90095-1596.
E-mail: eeskin{at}cs.ucla.edu
| ABSTRACT |
|---|
|
|
|---|
However, genetic association studies in inbred model organisms are confronted by the problem of inflated false positive rates due to population structure and genetic relatedness among inbred strains caused by the complex genealogical history of most model organism strains. Conventional statistical tests of independence between a genetic marker and a phenotype are prone to spurious associations because the marker and the phenotype are likely to be correlated due to population structure that violates the independence assumption under the null hypothesis. Recent association- or linkage-mapping studies in model organisms attempt to avoid inflated false positive rates by designing the studies using recombinant inbred lines generated from a small number of parental strains (BYSTRYKH et al. 2005; ZOU et al. 2005). However, these studies are limited by the variation present in the parental strains and have long regions between recombinations due to relatively few generations between the recombinant inbred strains and the parental strains. Traditional QTL mapping using F2 or backcross suffers from the same problem in fine-resolution mapping in addition to expensive genotyping cost (BELKNAP 1998; FLINT et al. 2005).
An alternative approach to reduce the inflation of false positives is to apply a statistical test that corrects for the bias due to population structure or genetic relatedness. The most widely used methods to reduce such bias in human association mapping are genomic control (DEVLIN and ROEDER 1999), structured association (PRITCHARD et al. 2000), and principal component analysis (PATTERSON et al. 2006; PRICE et al. 2006). However, these methods are inadequate in the case of model organism association mapping. Genomic control suffers from weak power when the effect of population structure is large as in model organisms (PRICE et al. 2006; YU et al. 2006). Structured association or principal component analysis, which assumes a small number of ancestral populations and admixture, only partially captures the multiple levels of population structure and genetic relatedness in model organisms (ARANZANA et al. 2005; YU et al. 2006; ZHAO et al. 2007). Recently, it has been suggested that linear mixed models can effectively correct for population structure in the association mapping of quantitative traits (YU et al. 2006). Linear mixed models incorporate pairwise genetic relatedness between every pair of individuals in the statistical model directly, reflecting that the phenotypes of two genetically similar individuals are more likely to be correlated than genetically dissimilar individuals. Applications of mixed models to association mapping in maize, Arabidopsis, and potato panels demonstrate that mixed models obtain fewer false positives and higher power than previous methods including genomic control, structured association, and principal component analysis (YU et al. 2006; MALOSETTI et al. 2007; ZHAO et al. 2007).
Although mixed models can effectively capture statistical confounding due to population structure, the currently available implementations have several limitations in the context of model organism association mapping. First, the variance components numerically estimated by various hill-climbing approaches such as the Nelder–Mead simplex algorithm (NELDER and MEAD 1965; GRASER et al. 1987; MEYER 1989), the EM algorithm (SMITH 1990), and the Newton–Raphson algorithm (LINDSTROM and BATES 1988; GILMOUR et al. 1995; JOHNSON and THOMPSON 1995) provide only a locally optimal solution, which may cause the statistical inferences based on these estimates to be inaccurate. Second, the computational cost of the numerical optimization procedure is substantial, requiring a large number of computationally expensive matrix operations at each iteration. Computational considerations are important when large data sets are to be tested. For example, the association mapping with maize panels consisting of hundreds of SNPs over hundreds of strains takes hours for a single run with currently available implementations such as TASSEL (YU et al. 2006) or SAS (SAS INSTITUTE 2004). A microarray data set tested for genomewide association mapping between thousands of transcripts and tens of thousands of markers would take several years of CPU time. Third, when inferring the genetic variance component referred to as the kinship matrix, the importance of a mathematically correct form of kinship matrix estimation is often overlooked. For example, YU et al. (2006) proposed to infer a kinship matrix using SPAGeDi software, setting negative kinship coefficients to zero. Such a kinship matrix may not be positive semidefinite and thus not be a valid form of variance component. Using a nonpositive semidefinite kinship matrix generates ill-defined likelihood for a subset of parameter space in the estimation of the variance component.
In this article, we propose a new method, efficient mixed-model association (EMMA), which corrects for population structure and genetic relatedness in model organism association mapping. Our method takes advantage of the specific nature of the optimization problem in applying mixed models for association mapping, which allows us to substantially increase computational speed by orders of magnitude and improve the reliability of results by achieving near global optimization. Our method improves the efficiency of the mixed-model method by enabling us to perform statistical tests with single-dimensional optimization. Our method's efficiency is further increased by avoiding redundant computationally expensive matrix operation at each iteration in the computation of likelihood function by leveraging spectral decomposition, reducing the computational cost of each iteration from cubic to linear complexity. Due to a substantially decreased computational cost of each iteration, it is possible to converge the global optimum of the likelihood in variance-component estimation with high confidence by combining grid search and the Newton–Raphson algorithm even though the likelihood function may not be convex. Our method is related to a similar technique developed in a different context of simulating the null distribution of variance-component test statistics (CRAINICEANU and RUPPERT 2004).
We show that a simple genetic similarity matrix can serve as a kinship matrix accounting for genetic relatedness as effectively as previously suggested methods while guaranteeing positive semidefiniteness. Our results are consistent with other studies (ZHAO et al. 2007), which suggests that these simpler kinship matrices reduce the false positive rate as effectively as or more effectively than the kinship matrices generated by previous methods (YU et al. 2006). We propose an additional method called phylogenetic control based on the assumption that a phylogenetic tree is a good approximation of the genealogical history of an inbred model organism. In such cases, the phylogenetic tree may be used as a confounding factor, correcting for the complex genetic relations between strains. We show that phylogenetic control can be formulated as a linear mixed model and present an algorithm for inferring the phylogenetic kinship matrix. We show that the phylogenetic kinship matrix is always positive semidefinite and its optimal variance components are unique regardless of the choice of root.
One of the important questions in the design of model organism association-mapping studies is estimating the statistical power for any specific set of inbred strains. We performed a simulation study of the power of our EMMA method to identify causal SNPs both on a genomewide scale and within a smaller region such as a QTL interval. Our results show that with a limited number of genetically diverse strains, such as the currently available panel of inbred mice, it is possible to identify causal loci with a genomewide significance only if the locus explains a large portion of phenotypic variance. However, with more strains, the power of these association studies increases dramatically. Our analysis of statistical power in model organism association mapping demonstrates the dramatic increase in power using multiple measurements of phenotypes from multiple animals for each strain. Study designs that do not replicate phenotype measurements and analysis methods that do not take individual measurements into account suffer a significant decrease in statistical power.
We applied our EMMA method to association mappings of various inbred model organisms. First, we verified that EMMA gives almost identical results to other widely used implementations using the maize panel data sets (YU et al. 2006). In terms of computational time, EMMA is shown to be orders of magnitude faster than the previous methods while performing near global optimization. Second, we performed a genomewide association mapping of Arabidopsis flowering-time phenotypes. Our results are consistent with the recently published results (ZHAO et al. 2007), reducing most of the inflated false positives. Finally, we used our EMMA method to perform a whole-genome association-mapping study of inbred mouse strains. We analyzed nearly 140,000 mouse HapMap SNPs over 48 strains and three quantitative phenotypes, liver weight, body weight, and saccharin preference, with QTL identified by previous studies. We identified significant associations for the three phenotypes while our results show a significant reduction in the inflation of false positives. Interestingly, many of the significantly associated SNPs fall into the known QTL, suggesting the results are likely to be true associations.
An R package implementation of EMMA and the webserver containing the mouse association results are publicly available online at http://mouse.cs.ucla.edu/emma.
| MATERIALS AND METHODS |
|---|
|
|
|---|
For inbred mouse association mapping, the Broad mouse HapMap SNP sets were obtained from the mouse HapMap web site. The 106,040 SNPs that have no more than 10% of genotype calls missing were tested after imputing the missing alleles. The initial body weight (MPD10305) and liver weight phenotypes (MPD2907) were downloaded from Jackson Laboratory MPD (JACKSON LABORATORY 2004). They consist of 374 and 308 phenotype measurements over 38 and 34 strains, respectively. The saccharin preference phenotypes consist of 280 phenotype measurements in 24 strains (REED et al. 2004).
EMMA:
Suppose that n measurements of a phenotype are collected across t inbred strains. A linear mixed model in model organism association mapping is typically expressed as
![]() | (1) |
is a q x 1 vector representing coefficients of the fixed effects. Z is an n x t incidence matrix mapping each observed phenotype to one of t inbred strains. u is the random effect of the mixed model with Var(u) =
where K is the t x t kinship matrix inferred from genotypes as described in the following section, and e is an n x n matrix of residual effect such that Var(e) =
The overall phenotypic variance–covariance matrix can be represented as
Instead of solving mixed-model equations by obtaining the best linear unbiased prediction (BLUP) of random effects u via Henderson's iterative procedure (HENDERSON 1984; ARBELBIDE et al. 2006), we directly estimate the variance components
g and
e, maximizing the full likelihood or restricted likelihood that is defined as full likelihood with the fixed effects integrated out (DEMPSTER et al. 1981). The restricted likelihood avoids a downward bias of maximum-likelihood estimates of variance components by taking into account the loss in degrees of freedom associated with fixed effects. Under the null hypothesis, the full log-likelihood and restricted log-likelihood function can be formulated as
![]() | (2) |
![]() | (3) |
=
g and H =
–1V = ZKZ' +
I is a function of
, defined as
=
The full-likelihood function is maximized when β is
and the optimal variance component is
for full likelihood and
for restricted likelihood, where
is a function of
as well.
Using spectral decomposition, it is possible to find
i and
s such that
![]() | (4) |
![]() | (5) |
. Let
y = [
1,
2,
,
n–q]'; then finding maximum-likelihood (ML) or restricted maximum-likelihood (REML) estimates is equivalent to optimizing the following functions with respect to
:
![]() | (6) |
![]() | (7) |
![]() | (8) |
![]() | (9) |
> 0 if and only if all the eigenvalues
s are nonnegative. Otherwise, such as in the case of the nonpositive semidefinite kinship matrix, the likelihood would be ill defined for a certain range of
.
The suggested procedure in computing likelihood and its derivatives involves only a linear time vector operation at each iteration once the spectral decomposition is computed. The time complexity of the method is O(n3 + rn), where r is the number of iterations required. The time complexity of standard EM or Newton–Raphson algorithms is O(rn3), and the actual ratio of the running time is much bigger than r because the existing methods typically require a large number of matrix multiplications and inverses at each iteration while EMMA computes spectral decomposition only once. Since the computational cost of each iteration has decreased dramatically, instead of obtaining a locally optimal solution during the numerical optimization, it is now computationally feasible to perform a grid search combining with the Newton–Raphson algorithm in the single-dimensional parameter space consisting of
, which is the ratio of the environmental random effect to the genetic background effect, to optimize the likelihood globally with high confidence.
Furthermore, when a large number of multiple measurements are phenotyped per strain, i.e.,
, the execution time can be further reduced using the fact that the nonnegative eigenvalues of ZKZ' and SZKZ'S are the same as those of KZ'Z and KZ'SZ, respectively. Combining this fact with a simple modification of the Gram–Schmidt process greatly reduces the execution time of eigenvalue decomposition, reducing the time complexity into O(n2t + rn). When multiple phenotypes are tested such as in expression quantitative trait loci (eQTL) mapping, the spectral decomposition can be reused, and only a square-time matrix–vector multiplication is required for each phenotype. Thus, the time complexity with m different phenotypes is O(n2t + n2m + rmn), which is much more efficient than O(rn3m) achieved by previous approaches.
In the application of our EMMA method to the various data sets presented in this article, the
's ranged from 10–5 (almost pure population structure effect) to 105 (almost pure environmental or residual effect) and are divided evenly into 100 regions in logarithm scale to compute the derivatives of likelihood functions. The global ML or REML is searched for by applying the Newton–Raphson algorithm to all the intervals where the signs of derivatives change and taking the optimal
among all of the stationary points and endpoints. Since the derivatives of both the full- and the restricted-likelihood function are continuous with nonnegative eigenvalues, such an optimization technique has guaranteed convergence properties as long as the kinship matrix is positive semidefinite. In the following two sections, we describe different methods to infer a kinship matrix K, based on either a genetic similarity matrix or a phylogenetic tree.
Similarity-based kinship matrix:
A number of methods for inferring a kinship matrix from a large number of molecular markers have been suggested, including a simple identical-by-state (IBS) allele-sharing matrix, an allele-frequency weighted IBS matrix (LYNCH and RITLAND 1999), a maximum-likelihood kinship matrix (THOMAS and HILL 2000), and a Monte Carlo simulation-based matrix (WANG 2002). Comparisons of different kinship matrices for explaining genetic differentiation among populations show similar results with small quantitative differences (NIEVERGELT et al. 2007). Recent studies on the association mapping of Arabidopsis thaliana in a structured population show that a simple IBS allele-sharing matrix effectively corrects for confounding from population structure, even better than more sophisticated methods (ZHAO et al. 2007). Although recently suggested estimators of pairwise relatedness have some desirable statistical properties over a simple IBS allele-sharing matrix (CASTEELE et al. 2001), they are not guaranteed to be positive semidefinite.
Here we show that a simple IBS allele-sharing matrix based on the assumption of each SNP or haplotype inducing the same level of small random changes on the phenotype guarantees positive semidefiniteness and convergence if missing alleles are handled appropriately.
Let li,j,h
{0, 1} be a binary variable that has a value of 1 only when the genotype (or haplotype) allele at jth locus in the ith strain is
, where
is the total number of alleles at the jth locus. Let xh,j be random variables independently sampled from N(0,
2); then the genetic background effect ui of strain i can be modeled as an accumulation of small random effects as follows, assuming that xh,j denote the random genetic effect caused by allele h at the jth locus,
![]() | (10) |
and let Lh be the matrix whose element at (i, j) is li,j,h; then the overall genetic background effect u is expressed in the form
![]() | (11) |
2 independently, the variance–covariance matrix of u becomes
Since its (i0, i1)th element
represents the number of shared IBS alleles between the i0th and i1th strains directly if wj = 1, Var(u) is equivalent to a weighted IBS allele sharing a kinship matrix with the scaling factor
2. It is obvious from Equation 11 that the kinship matrix is positive semidefinite. When missing genotypes exist, we estimate li,j,h to be the square root of the probability of the SNP or haplotype allele at the jth locus having the allele h. This is so the random effect for each allele is assigned probabilistically. We generated genotype similarity of maize, Arabidopsis, and mouse data sets using uniform weight. When a haplotype similarity matrix is used, the haplotype window size resulting in the largest ML estimates is selected as the optimal window size. In the Arabidopsis and mouse association-mapping results of this article, the optimal haplotype window size is set to five in both cases.
Phylogenetic control:
Evolutionary biologists have modeled interspecific phenotype distribution using various phylogenetic comparative methods (PCMs) (MARTINS and HANSEN 1997). The correlation structure between phenotypes can be effectively captured with phylogenetic trees, and PCMs have been applied to evolutionary analysis of quantitative traits such as gene expression (GU 2004; OAKLEY et al. 2005) or, very recently, to the association mapping of dichotomous phenotypes (BHATTACHARYA et al. 2007; CARLSON et al. 2007). Felsentein's independent contrast (FIC) method (FELSENSTEIN 1985) models the correlation between phenotypes under the assumption of Brownian motion of phenotypic change along the phylogeny. Since random phenotypic changes occur within a species as well, in cases where the phylogenetic tree is a good approximation of genealogical history, it is reasonable to apply PCMs such as the FIC method in modeling the phenotypic variation in model organisms.
We followed Felsenstein's assumption of Brownian phenotypic changes along the phylogeny. Although multiple fluctuating selection may lead to a Brownian motion model (FELSENSTEIN 1981), here we assume a neutral model where phenotypic changes are explained by accumulated random pleiotropic effects by the genetic background to mathematically model Brownian phenotypic changes. Let T be a phylogenetic tree with t leafs and m edges, and let z
Rm be random variables independently sampled from
At each branch i whose length is bi, we represent the amount of random phenotypic changes along the branch as
Let
i denote the set of branches connecting to a leaf node i from the root. Then the accumulated phenotypic changes are equivalent to
If
is the ancestral mean at an arbitrarily chosen root node, then the phenotype values at the leaf nodes are expressed in the form
![]() | (12) |
if branch j exists in the path from the root to the leaf node i and zero otherwise. The kinship matrix of random effect u = Ez is K = EE' and is proportional to its covariance. If the root of the phylogenetic tree changes, E is changed into E + 1tcT, with 1t a vector of ones and another vector c. However, the restricted likelihood does not change because SZ1t = 0 always holds. In our results, we adjusted the genetic distance matrix using the F84 model (KISHINO and HASEGAWA 1989; FELSENSTEIN and CHURCHILL 1996) from the genomewide genotypes and inferred the phylogenetic tree with the Fitch–Margoliash and least-squares distance method (FITCH and MARGOLIASH 1967).
Statistical tests and multiple hypothesis testing:
Once the ML or REML variance component
is estimated, a general F-statistic testing the null hypothesis Mβ = 0 for an arbitrary full-rank p x q matrix M can be constructed as suggested in KENNEDY et al. (1992) and YU et al. (2006),
![]() | (13) |
The likelihood-ratio test can also be performed on the basis of the estimated ML variance components under different fixed effects. The statistic asymptotically follows a
distribution unless the estimated variation component meets the boundary of parameter space.
When a large number of correlated SNPs are tested, Bonferroni correction may lead to too conservative type I error control. Alternatively, permutation tests or other multiple hypothesis-testing procedures can be used (PIEPHO 2001; STOREY and TIBSHIRANI 2003). If permutations of simulation-based approaches are applied, the computational cost is much larger but can be reduced by reusing the spectral decomposition in the same way described in the context of multiple phenotypes. For each permuted y, only
y = [
1,
2,
,
n–q] has to be computed again to compute the full or the restricted likelihood in linear time at each iteration. Thus, the computational cost for a cubic-time spectral decomposition at each permutation can be substituted by a square-time matrix–vector multiplication, reducing the overall time complexity from O(n2t + rn) to O(n2 + rn).
The variance-component estimation is performed on the basis of REML for the F-test, and ML estimations are used for the likelihood-ratio test and the computation of the Bayesian information criterion (BIC). The P-values are computed from the asymptotic null distribution.
Simulation studies:
We performed two simulation studies for analyzing the statistical power of EMMA. The first simulation is similar to those from other mixed-model studies (YU et al. 2006; ZHAO et al. 2007). A fixed effect based on a randomly chosen causal SNP from the genome with minor allele frequency >10% is added to the existing phenotypes, and the statistical power is computed at the causal SNP. At each fixed effect, the simulation study was performed 1000 times to estimate the average power. The variance explained by a SNP is computed assuming that average minor allele frequency of the causal SNP is 0.3.
Next, we generated simulated phenotypes sampled from a multivariate normal distribution. A random noise vector is added according to the contribution of genetic background to phenotypes,
If
is the fraction of variance due to genetic background excluding the SNP effect, then the covariance of the simulated data is simulated as Var(y) = ((
where S0 = I – 11'/n, where 1 denotes a vector of ones. Similar to the first simulation study, a fixed effect based on a randomly chosen causal SNP is added to the simulated phenotypes and the average power is computed from 1000 independent simulations.
| RESULTS |
|---|
|
|
|---|
|
|
We also applied our EMMA method to perform genomewide association mapping of the flowering-time phenotype in which statistically significant associations are reported in previous studies. The cumulative distribution of P-values across 13,416 nonsingleton SNPs across 95 strains obtained from EMMA is shown in Figure 2a. The cumulative distribution of P-values with a haplotype similarity matrix nearly follows the expected distribution, implying that mixed models significantly outperform structured association in eliminating the inflation of false positives for this data set. Phylogenetic control reduces a large portion of inflated false positives, but residual inflation is still observed. Structured association and simple linear regression showed much larger inflation of false positives, consistent with the previous studies. After correction for genetic relatedness, the previously known FRI gene is still found to be significant at a nominal P-value of P = 10–5 across different kinship matrices. Our independent analyses are consistent with the more extensive results of Arabidopsis association mapping recently published (ZHAO et al. 2007).
|
The cumulative distributions of observed P-values in Figure 2 show that, without correcting for population structure, the rate of false positives is very high. In particular, the body weight phenotype has a substantial inflation of false positives. When our mixed model is used, the inflation of the statistics is significantly reduced in all three phenotypes.
Figure 3 shows genomewide association signals for the three phenotypes. Comparing Figure 3a and 3b, it is obvious that, without correcting for population structure, many false positives are observed at a genomewide level of significance due to inflated P-values. Without correcting for population structure, we were able to identify nearly 6000 SNPs at a nominal P-value of 10–6 and 283 SNPs with P-values <10–10. However, none are significant after applying the mixed model. This strongly supports that most of the significant associations without correcting for population structure are indeed false positives. Interestingly, although the strongest signals for the body weight after applying the mixed model are not genomewide significant, they are concentrated in the region around 114 Mb in chromosome 8. This region almost exactly falls into the LOD peak of a previously known body weight QTL Bwq3 (ANNUCIADO et al. 2001). The P-value of the most significant locus is 3.8 x 10–6 with the F-test, explaining 49% of the overall phenotypic variance and 39% of the phenotypic variation due to the genetic variance component. Although it is slightly below the genomewide significance threshold with a conservative Bonferroni correction, if utilizing the results from previous QTL studies, the locus can be declared as significant over the region of known body weight QTL.
|
For the saccharin preference phenotype, we were able to identify a SNP 30 kb away from the Tas1r3 gene that is perfectly correlated with the SNP previously reported to have significant association with the phenotype (REED et al. 2004). It explains 51% of the genetic variance component, with a P-value of 1.0 x 10–5. The SNP is neither genomewide significant nor the most significant. We believe this is due to the limited power of the study with a small number of strains.
Power of inbred model organism association mapping:
We evaluated the statistical power of association mapping of inbred model organisms in two different ways. First, we simulated an additive effect of a causal SNP over the existing phenotypes for mouse, Arabidopsis, and maize strains, similar to previous studies. Such simulation studies evaluate the SNP effect on the power maintaining the existing correlation structure of phenotypes. However, they do not change the effect of the genetic background or the number of multiple measurements, and no random variable other than the SNP is involved in the power simulation. As an alternative model-driven method for simulation studies, we generated simulated phenotypes randomly sampled from a multivariate normal distribution with various effects of population structure on the phenotypic variation. A SNP effect is simulated on the randomly generated samples, and the statistical power is evaluated. In this way, we can simulate not only the SNP effect but also different genetic backgrounds and different numbers of replicated measurements. We believe that our simulation analysis provide a more extensive understanding of the statistical power of association mapping with model organisms.
Figure 4 shows the statistical power with respect to the additive SNP effect on the Arabidopsis and maize flowering-time phenotypes and three inbred mouse phenotypes used in this article. The maize panel data set consisting of 277 strains has high statistical power, achieving 80% power with a SNP effect explaining 5% of phenotypic variation. Genomewide significance can also be achieved with high power with 10% of SNP effects. For the Arabidopsis data set consisting of 95 strains, the statistical power is decreased, and roughly twice the SNP effect would be needed compared to the maize panels to achieve the same statistical power. For the inbred mouse phenotypes, genomewide power is achievable only when the SNP explains a very large portion of phenotypic variance. In our results, the plausible significant associations explained >35% of the phenotypic variance. The power to achieve genomewide power is largely dependent on the number of available strains. Table 2 summarizes the most plausible associations in these three phenotypes.
|
|
|
the residual variance is zero and the replicated measurement does not increase the power since there is no variability of phenotype within strains. Figure 5d shows more clearly the effect of genetic background and multiple measurements at a glance. When a SNP explains a fairly large fraction (17%) of phenotypic variance, the genomewide significance level can be achieved with high power only when the phenotype has very small population structure effect and the number of replicates is large. As the effect from genetic background becomes larger, the advantage of using multiple measurements decreases significantly.
| DISCUSSION |
|---|
|
|
|---|
The computational efficiency of EMMA is orders of magnitude greater than that of other widely used implementations. When multiple measurements per strain are used across different individuals, the relative efficiency is further increased. This is of a great importance when the computational cost may be a bottleneck in the statistical analysis of high-throughput data such as genomewide gene expressions. For example, the single run of genomewide association mapping of mouse body weight phenotypes with multiple measurements would take up to a month of CPU time with other implementations, while EMMA takes only a single CPU hour. When hundreds and thousands of phenotypes are available such as in the analysis of whole-genome expression data, the computational cost of previous implementations is prohibitive even with high-performance computing. It should be noted that there are other techniques developed for improving computational efficiency of the numerical estimation in a more general context of linear mixed models such as average information REML (GILMOUR et al. 1995), but these techniques would not provide us with the same improvements on the efficiency of each iterative procedure.
Our results of inbred mouse association mapping show the potential and limitations of genomewide inbred mouse association studies. It is remarkable that we were able to identify significant associations at a genomewide level without inflation of false positives, under the limited statistical power of the method. Although there is a possibility that residual confounding still remains with mixed-model association, we believe that the most significant SNP associated with liver weight is likely to be a true positive because it explains a large portion of phenotypic variations between the strains beyond genetic background effect so that the conservative Bonferroni-adjusted P-value still remains significant. The SNP associated with body weight looks also plausible, but it could possibly be due to residual confounding that is not completely captured by a kinship matrix. Likewise, other significant associations can possibly be due to residual confounding not captured by the kinship matrix, so the identified associations must be verified through independent analysis.
In a more general context of association mapping that requires the use of multiple variance components, the computational advantages of EMMA are not applicable since EMMA can effectively solve a model only with one correlated variance component. For example, when allowing heterozygous alleles for outbred individuals, the full model typically takes both additive and dominant variance components in the linear mixed model (LYNCH and RITLAND 1999; ARBELBIDE et al. 2006). Likewise, if strain-specific environmental random effects or other additional random effects are to be considered such as in plant association mapping, multiple variance components need to be used. In such cases where EMMA is not directly applicable, computational bottlenecks may be the biggest obstacles in analyzing large amounts of data such as genomewide expression profiles. EMMA can still be applied in this case if a reasonable approximation is combined with other standard mixed-model methods taking multiple variance components. Under the null hypothesis, it is possible to estimate the ratio between multiple variance components using the full model, and EMMA can be applied under an alternative hypothesis assuming that the ratio between variance components is preserved. Since variance-component estimation under the null hypothesis needs to be done once across a larger number of alternative hypotheses for each marker, such an approximation procedure provides a large amount of computational efficiency essentially equivalent to EMMA with one variance component. Although the approximated test may lose statistical power slightly, the false positive rates would not be inflated.
There have been several genomewide association-mapping studies with inbred mouse strains. To the best of our knowledge, our results are the first whole-genome association mapping of inbred mice that takes the genetic relatedness into account via a statistical method supported by asymptotic theory. Previous studies either do not take the population structure into account (CERVINO et al. 2007) or apply heuristics to reduce the confounding effect from population structure. For example, the weighted version of the F-statistic (PLETCHER et al. 2004) does not follow the asymptotic null distribution. Redefining the significance level on the basis of the empirical null distribution given the heritability parameter (LIU et al. 2006) or the weighted permutation (MCCLURG et al. 2007) rescales the P-values only similar to genomic control and will suffer from a lack of power as the genetic background effect becomes larger.
Our power simulation studies provide assistance to the design of the association study under the effect of population structure. Multiple factors are involved in determining the condition for identifying a locus, and it cannot be represented simply by a single value such as phenotypic variance explained by the SNP. Our results show the importance of multiple measurements of phenotypes from multiple animals for each strain and of directly using the individual measurements in the statistics for association mapping. Taking individual measurements into account within the association mapping is much more computationally intensive. EMMA provides a method for efficiently handling individual measurements. In addition, our results also demonstrate the effect of genetic background on the statistical power. As the population structure explains larger phenotypic variance, the power using multiple measurements becomes lower.
Our results show that phylogenetic control can control for population structure as effectively as the linear mixed model based on the genetic similarity matrix in some data sets despite the limited ability of the model to represent complex genetic relatedness. Since genetic similarity matrices are better models when accounting for recombination and hybridization, and also are easier to compute, phylogenetic control is not preferred in association mapping in model organisms. However, it is possible to compute the likelihood of the phylogenetic control model in linear time (FELSENSTEIN 1985), and this may be useful when a very large number of individuals are to be tested.
| APPENDIX: DERIVATION OF RESTRICTED LIKELIHOOD AND ITS DERIVATIVES |
|---|
|
|
|---|
Plugging in the optimal parameters
and
in Equation 2, it follows that
![]() | (A1) |
And R can be rewritten as follows:
![]() | (A2) |
![]() | (A3) |
![]() | (A4) |
where P = I – X(X'H–1X)–1X'H–1.
It is straightforward to show that
![]() | (A5) |
![]() | (A6) |
![]() | (A7) |
![]() | (A8) |
![]() | (A9) |
![]() | (A10) |
From Equations A1 and A10, it follows that
![]() | (A11) |
The restricted likelihood of y is equivalent to computing the likelihood of Ay where S = AA' and A'A = I:
![]() | (A12) |
![]() | (A13) |
Accordingly,
and
hold, and the restricted likelihood of y is equivalent to the likelihood of
y
N(0,
2diag(
s +
)). By plugging in
to
2, it immediately follows that
![]() | (A14) |
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| LITERATURE CITED |
|---|
|
|
|---|
ANNUCIADO, R. V. P., M. NISHIMURA, M. MORI, A. ISHIKAWA, S. TANAKA et al., 2001 Quantitative trait loci for body weight in the intercross between SM/J and A/J mice. Exp. Anim. 50: 319–324.[CrossRef][Medline]
ARANZANA, M. J., S. KIM, K. ZHAO, E. BAKKER, M. HORTON et al., 2005 Genome-wide association mapping in Arabidopsis identifies previously known flowering time and pathogen resistance genes. PLoS Genet. 1: e60.[CrossRef][Medline]
ARBELBIDE, M., J. YU and R. BERNADO, 2006 Power of mixed-model QTL mapping from phenotypic, pedigree and marker data in self-pollinated crops. Theor. Appl. Genet. 112: 876–884.[CrossRef][Medline]
BELKNAP, J. K., 1998 Effect of within-strain sample size on QTL detection and mapping using recombinant inbred mouse strains. Behav. Genet. 28: 29–38.[CrossRef][Medline]
BHATTACHARYA, T., M. DANIELS, D. HECKERMAN, B. FOLEY, N. FRAHM et al., 2007 Founder effects in the assessment of HIV polymorphisms and HLA allele associations. Science 315: 1583–1586.
BYSTRYKH, L., E. WEERSING, B. DONTJE, S. SUTTON, M. T. PLETCHER et al., 2005 Uncovering regulatory pathways that affect hematopoietic stem cell using genetical genomics. Nat. Genet. 37: 225–232.[CrossRef][Medline]
CARLSON, J. M., C. KADIE, S. MALLAL and D. HECKERMAN, 2007 Leveraging hierarchical population structure in discrete association studies. PLoS One 2: e591.[CrossRef]
CASTEELE, T. V. D., P. GALBUSERA and E. MATTHYSEN, 2001 A comparison of microsatellite-based pairwise relatedness estimators. Mol. Ecol. 10: 1539–1549.[CrossRef][Medline]
CERVINO, A. C., A. DARVASI, M. FALLAHI, C. C. MADER and N. F. TSINOREMAS, 2007 An integrated in silico gene mapping strategy in inbred mice. Genetics 175: 321–333.
CRAINICEANU, C. M., and D. RUPPERT, 2004 Likelihood ratio tests in linear mixed models with one variance component. J. R. Stat. Soc. B 66: 165–185.[CrossRef]
DEMPSTER, A. P., D. B. RUBIN and R. K. TSUTAKAWA, 1981 Estimation in covariance components models. J. Am. Stat. Assoc. 76: 341–353.[CrossRef]
DEVLIN, B., and K. ROEDER, 1999 Genomic control for association studies. Biometrics 55: 997–1004.[CrossRef][Medline]
FELSENSTEIN, J., 1981 Evolutionary trees from dna sequences: a maximum likelihood approach. J. Mol. Evol. 17: 368–376.[CrossRef][Medline]
FELSENSTEIN, J., 1985 Phylogenies and the comparative method. Am. Nat. 125: 1–15.[CrossRef]
FELSENSTEIN, J., and G. CHURCHILL, 1996 A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13: 93–104.[Ab