## Abstract

Additive genetic variance in natural populations is commonly estimated using mixed models, in which the covariance of the genetic effects is modeled by a genetic similarity matrix derived from a dense set of markers. An important but usually implicit assumption is that the presence of any nonadditive genetic effect increases only the residual variance and does not affect estimates of additive genetic variance. Here we show that this is true only for panels of unrelated individuals. In the case that there is genetic relatedness, the combination of population structure and epistatic interactions can lead to inflated estimates of additive genetic variance.

MIXED models with random genetic effects have become an important tool for studying the genetic architecture of complex traits. The covariance of the genetic effects is assumed to be proportional to a genetic similarity matrix (GSM) based on a dense set of markers, which is equivalent to assuming additive effects for each standardized marker score. Under several additional assumptions, such as constant linkage disequilibrium, this gives unbiased estimates of additive genetic variance and narrow-sense heritability (Yang *et al.* 2010; Speed *et al.* 2012; Lee and Chow 2014; Speed and Balding 2015). The sampling variance of such heritability estimators has been studied in Visscher and Goddard (2014) and Kruijer *et al.* (2015). These results are, however, derived under the assumption that the model is correct, *i.e.*, contains the true distribution of the data. Here we consider situations where this is not the case and argue that potential sources of bias may be identified by computing the parameter value that minimizes the Kullback–Leibler (KL) divergence with respect to the true distribution *Q*. For *n*-dimensional Gaussian distributions and , the KL divergence equalsIt is a well-known fact from statistics that in the case of misspecification, *i.e.*, when *Q* is not contained in the model , the maximum-likelihood (ML) estimator converges to (Huber 1967; White 1982). Here we investigate misspecification in a mixed-model context, the covariance of the data being misspecified due to infinitesimal interactions or other nonadditive effects. We consider three different scenarios (A–C) and in each of them three different values of additive and nonadditive genetic variance. The total phenotypic variance is assumed to be known and equal to 1.

In scenario A, the phenotype of *n* individuals is modeled using the multivariate normal distribution(1)where *K* is a marker-based GSM, is the identity matrix, is the additive genetic variance, and is the residual variance. We assume, however, that *Q*, the *actual* distribution of *Y*, is the zero mean normal distribution with covariance , being the Hadamard (entry-wise) product. The matrix is the covariance due to small epistatic interactions between all standardized marker scores (Supporting Information, File S1; see also Jiang and Reif 2015). Hence, the narrow- and broad-sense heritabilities are equal to, respectively, 0.4 and 0.6. In addition to this genetic architecture, we also consider the case where the covariance matrix of *Y* is (*i.e.*, and ) and (*i.e.*, and ).

For all these genetic architectures, does not equal the identity matrix , and *Q* is therefore not contained in model (1). Hence, the ML estimator will not converge to *Q*, but rather to the point () minimizing the KL divergence, . For genetic similarity matrices derived from published data in maize, rice, and *Arabidopsis*, ranges between 0.47 and 0.53, given a true value of 0.4 (Table 1). Similar bias occurs when and Hence, the presence of epistatic interactions leads to inflated estimates of additive genetic variance. For a panel of simulated unrelated individuals, equals the true value of , which is due to the much smaller off-diagonal elements of *K*, making almost indistinguishable from .

In scenario B, a plant trait is phenotyped on *r* genetically identical replicates. Following Kruijer *et al.* (2015), the observations are modeled by the normal distribution(2)*Z* being an incidence matrix assigning plants to genotypes. The true distribution *Q* is multivariate normal with covariance *i.e.*, there are nonadditive (not necessarily epistatic) effects with independent distributions. Such effects could be due to, for example, genotype–environment interaction. As in scenario A, we also consider a genetic architecture with and (*i.e.*, covariance ) and a genetic architecture with and In contrast to model (1) (where and ), is different from , and *Q* is not contained in model (2). Again, the value minimizing KL divergence is substantially larger than the true value (Table 1), and additive genetic variance will tend to be overestimated. Intuitively, this is because the block structure is better captured by than by the diagonal residual.

Scenario C is a combination of scenarios A and B. To avoid the misspecification occurring in scenario B, the model(3)is considered, extending (2) with independent nonadditive effects. This model has been used in the analysis of field trials (Oakey *et al.* 2006, 2007), as well as genomic prediction (Gianola and van Kaam 2008; Howard *et al.* 2014; Jarquin *et al.* 2014). If in fact the nonadditive effects have covariance (as in scenario A) and , the data have covariance . As in scenarios A and B, the minimizing KL divergence is larger than the true value (Table 1), except for the rice population of Zhao *et al.* (2011) with In the latter case, was on average 0.14, while its bias was at most 0.01 for all other populations and heritability levels.

In addition to the minimization of KL divergence we analyzed ML estimates for simulated traits, in which case the phenotypic variance is unknown (File S2). For most populations and heritability levels, the bias of additive genetic variance estimates is similar to what was found by minimizing KL divergence in models (1)–(3). Differences are largest for the population of Zhao *et al.* (2011), where the total phenotypic variance is consistently overestimated.

The bias we identified here by statistical arguments and simulations has important implications, in particular for immortal populations, for which genetically identical replicates are available (*e.g.*, *Arabidopsis thaliana*, agronomic crops, bacteria, and fungi). Typically there is strong population structure and often only several hundreds of different genotypes are phenotyped. One can analyze such data at the individual level [model (2)] or at the level of genotypic means [model (1), with ’s divided by the number of replicates]. Kruijer *et al.* (2015) showed that in the latter type of analysis, standard errors of heritability estimates can be huge, and recommended model (2) for both heritability estimation and genomic prediction. Here we have shown that in the presence of nonadditive effects, this model is likely to overestimate additive genetic variance. If, however, the nonadditive effects are due to epistatic interactions, analysis at the genotypic means level [model (1)] will, apart from the large sampling variance, also give inflated estimates of additive genetic variance. This is a rather realistic scenario, since epistasis may be an important part of the genetic architecture (Mackay 2014), and several other types of nonadditive effects can be ruled out or minimized for immortal populations: *e.g.*, genotype–environment interactions are unlikely in homogeneous controlled environments with adequate randomization, and dominance effects are absent when using inbred lines. Inflated heritability estimates may also affect the performance of G-BLUP, although the loss in accuracy is considerably smaller than in the case where heritability is underestimated (Kruijer *et al.* 2015).

Interestingly, the inflation of additive genetic variance is not due to any nonlinearity or absence of main effects (as in, *e.g.*, Culverhouse *et al.* 2002; Song *et al.* 2010; Zuk *et al.* 2012), but rather to the population structure present in the epistatic GSM, which to some extent resembles the structure of the GSM for the additive effects. At the same time, it is this structure that makes the epistatic GSM distinguishable from the diagonal error. This suggests that epistatic interactions are easier to model in structured populations; *i.e.*, sampling variance of epistatic variance components may not be as large as in unstructured human populations (Yang *et al.* 2011). Expressions for the asymptotic variance in a model with both additive and epistatic effects (File S3) indicate that this is indeed the case. More generally, the inflation of heritability estimates due to misspecification illustrates the difficulty of modeling and estimating genetic effects. As recently pointed out by Speed and Balding (2015) this is already challenging for the additive genetic effects, in the sense that depending on the genetic architecture, different GSMs may be appropriate. Indeed, the potential bias resulting from an inappropriate GSM could be assessed by evaluating KL divergence with respect to the true model, as is the case for alternatives for the epistatic GSM considered here.

## Acknowledgments

I thank two anonymous reviewers for their constructive comments that helped to improve the manuscript. Martin Boer and Fred van Eeuwijk are acknowledged for useful discussions. The research leading to these results has been conducted as part of the project DROught-tolerant yielding PlantS (DROPS), which received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement 244374. This research was also funded by the Learning from Nature project of the Dutch Technology Foundation, which is part of the Netherlands Organisation for Scientific Research.

## Footnotes

*Communicating editor: A. H. Paterson*Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.177212/-/DC1.

- Received May 15, 2015.
- Accepted November 10, 2015.

- Copyright © 2016 by the Genetics Society of America