## Abstract

The additive genomic variance in linear models with random marker effects can be defined as a random variable that is in accordance with classical quantitative genetics theory. Common approaches to estimate the genomic variance in random-effects linear models based on genomic marker data can be regarded as estimating the unconditional (or prior) expectation of this random additive genomic variance, and result in a negligence of the contribution of linkage disequilibrium (LD). We introduce a novel best prediction (BP) approach for the additive genomic variance in both the current and the base population in the framework of genomic prediction using the genomic best linear unbiased prediction (gBLUP) method. The resulting best predictor is the conditional (or posterior) expectation of the additive genomic variance when using the additional information given by the phenotypic data, and is structurally in accordance with the genomic equivalent of the classical additive genetic variance in random-effects models. In particular, the best predictor includes the contribution of (marker) LD to the additive genomic variance and possibly fully eliminates the missing contribution of LD that is caused by the assumptions of statistical frameworks such as the random-effects model. We derive an empirical best predictor (eBP) and compare its performance with common approaches to estimate the additive genomic variance in random-effects models on commonly used genomic datasets.

- best prediction
- genetic variance
- quantitative genetics
- genomic variance
- random-effects models
- BLUP
- whole-genome regression

THE additive genetic variance is defined as the variance of the breeding value (BV) and is the most important determinant of the response of a population to selection (Falconer and Mackay 1996). The additive variance can be estimated from observations made on the population and is a principal component of the (narrow-sense) heritability, which is one of the main quantities of interest in many genetic studies (Falconer and Mackay 1996). The heritability is eminent, among other things, for the prediction of the response to selection in the breeder’s equation (Piepho and Moehring 2007; Hill 2010). Although nonadditive genetic variation exists, most of the genetic variation is additive, such that it is usually sufficient to investigate the additive genetic variance (Hill *et al.* 2008). More specifically, epistasis is only important on the gene level but not for the genetic variance (Hill *et al.* 2008), and Zhu *et al.* (2015) show that, for human complex traits, dominance variation contributes little. Nevertheless, linkage disequilibrium (LD) is an important factor, especially when departing from random mating and Hardy-Weinberg equilibrium, which is often the case in animal breeding (Hill *et al.* 2008; L. Dempfle, personal communication).

The additive genomic variance is defined as the variance of a trait that can be explained by a linear regression on a set of markers (de los Campos *et al.* 2015).

Many authors have been chasing what is sometimes coined “missing heritability” (Maher 2008), which means that only a fraction of the “true” genetic variance can be captured by regression on influential markers. Initially, researchers have used genome-wide association studies (GWAS) in order to find quantitative trait loci (QTL) by single-marker fixed-effect regression combined with variable selection. After having added the estimated corresponding genomic variances of the single statistically significant loci, they asserted that they could only account for a fraction of the “true” genetic variance. For instance, Maher (2008) found that only 5% instead of the widely accepted heritability estimate of 80% of human height could be explained.

Golan *et al.* (2014) state that the “true” genetic variance is generally underestimated when applying variable selection, *e.g.*, GWAS, to genomic datasets which are typically characterized by their high dimensionality, where the number of variables (markers) *p* is much larger than the number of observations *n*. It is well known that a lot of traits are influenced by many genes, and that at least some loci with tiny effects are missed when using variable selection or even single-marker regression models.

Consequently, Bernardo (1994) decided to fit all [restriction fragment length polymorphism (RFLP-)] markers in maize jointly using genomic best linear unbiased prediction (gBLUP), where he assumes the marker effect vector to be random. In animal breeding, Meuwissen *et al.* (2001) used Bayesian approaches (BayesA and BayesB) to fit all markers jointly in order to predict BVs.

Then, Yang *et al.* (2010) estimated the genomic variance in an approach that they termed genome-wide complex trait analysis genomic restricted maximum likelihood (GCTA-GREML) (Yang *et al.* 2011). They showed that quantifying the combined effect of all single-nucleotide polymorphisms (SNPs) explains a larger part of the heritability than only using certain variants quantified by GWAS methods. They illustrate their results on a dataset on human height by pointing out that they could explain a heritability, also termed “chip heritability” (Zhou *et al.* 2013), of ∼45%. They concluded that the main reason for the remaining missing heritability was incomplete LD of causal variants with the genotyped SNPs, which refers to the general difference between the genetic variance and the genomic variance (Powell *et al.* 2010; de los Campos *et al.* 2015). However, the GCTA-GREML approach can be biased upwards as well as downward (Wolc *et al.* 2013; de los Campos *et al.* 2015; Fernando *et al.* 2017a; Lehermeier *et al.* 2017).

Recently, there has been a general discussion whether estimators for the genomic variance account for LD between markers, which is defined as the covariance between the marker genotypes (Bulmer 1971). Some authors argue that estimators similar to GCTA-GREML lack the contribution of LD (Krishna Kumar *et al.* 2016a,b; Lehermeier *et al.* 2017), whereas others (Yang *et al.* 2016) resolutely disagree. More specifically, Krishna Kumar *et al.* (2016a,b) state that, in GCTA-GREML, the contributions of the *p* markers to the phenotypic values are assumed to be independent normally distributed random variables with equal variances. Thus, they claim that the random contribution made by each marker is not correlated with the random contributions made by any other marker, which leads to a negligence of the contribution of LD to the additive genomic variance.

In a study on the model plant *Arabidopsis thaliana* (1001 Genomes Consortium 2016), Lehermeier *et al.* (2017) use Bayesian ridge regression (BRR) to relate the phenotype flowering time to the genomic data. They use an estimator (termed M2) based on the posterior distribution of the marker effects obtained by Markov Chain Monte Carlo (MCMC) methods and show that this estimator explains a larger proportion of the phenotypic variance than the estimator, termed M1, based on gBLUP (VanRaden 2008; Yang *et al.* 2010, 2011). Lehermeier *et al.* (2017) argue that the reason for the better performance of the Bayesian estimator for the additive genomic variance [already mentioned in Sorensen *et al.* (2001), Fernando and Garrick (2013), Zhou *et al.* (2013), Fernando *et al.* (2017b)] is the explicit inclusion of LD.

We show that the additive genomic variance in linear models with random marker effects (RME) can be defined as a random variable. Based on this premise, we propose a novel predictor of the additive genomic variance and place existing estimators in a joint framework permitting comparison with the new predictor. We contribute to the solution of many of the above mentioned controversies by reviewing common approaches to estimate the additive genomic variance, *e.g.*, GCTA-GREML, and show that they estimate the unconditional (or prior) expectation of the random additive genomic variance. Combined with the assumptions on the unconditional distribution of the marker effects in the gBLUP-method, this leads to an insufficient adaptation to the data and a negligence of the contribution of LD.

We introduce a novel best prediction approach for the additive genomic variance in both the current and the base population, *i.e.*, we use the conditional (or posterior) expectation of the random additive genomic variance given the additional information by the phenotypic values for an improved adaptation to the data. We decompose the best predictor into the GCTA-GREML estimator and a function for the contribution of marker LD, which determines whether GCTA-GREML is biased up- or downward. The best predictor is structurally in accordance with the genomic equivalent of the additive genetic variance from classical quantitative genetics, *i.e.*, it explicitly includes the contribution of LD. We propose an empirical best predictor (eBP), and illustrate our theoretical results on several commonly used genomic datasets.

## Materials and Methods

### Linear models

The connection of the *n*-vector *y* of phenotypic values and the *n*-vector *g* of genomic values is given by(1)where μ denotes a fixed intercept, is a *n*-row-vector containing 1s, denotes environmental deviations, and is the identity matrix of dimension *n*.

The statistical model that is probably most popular in genomic applications is the random-effects model (REM), in which the genomic values are assumed to be normally distributed(2)where **G** is a variance-covariance matrix and is the variance component of the vector of genomic values *g*.

In the following, we assume that the genome is mapped with markers and we denote by **X** the design matrix coding the genotypes of the markers.

In this framework, the -matrix **G** introduced in Equation 2 is called the genomic relationship matrix (GRM), and is most commonly defined as(3)where is the idempotent -matrix used for column-wise mean-centering, and , where denotes the frequency of the minor allele at marker *j* (VanRaden 2008).

Then, the genomic values *g* can be separated into the coded genotypes of the single markers **X** and their corresponding *p*-vector *β* of marker effects such that(4)holds in distribution, where for . The *p* individual components of the marker effect vector *β* are drawn at random from a common fixed normal distribution for each marker genotype:(5)with positive variance component .

Model (1) is called linear equivalent model (Henderson 1984) to the “standard” additive linear regression model(6)The centering matrix **P** in Equations 4 and 6 is mandatory for the equivalence of models (1) and (6), and to be able to consistently apply model (6).

The restriction of the column-means of the marker genotypes in model (6) to be 0 guarantees that the sample mean of the genomic values in Equation 4 equals 0 . This ensures the uniqueness of the definition of the vector *g* of genomic values in Equation 4. Otherwise, a different coding of the marker genotypes leads to different genomic values *g*.

Inferences on quantities based on genomic data in model (1) and (6) are often performed with the gBLUP method (Bernardo 1994). Model (6) allows for marker-specific investigations and inferences on the genomic contribution to the phenotypic values, whereas estimation of parameters in model (1) has computational advantages. For additional information on the gBLUP-method, we refer to Appendix *Genomic Best Linear Unbiased Prediction*.

Model (6) is a realization of *n* draws from the underlying data-generating process of the (mean-centered) marker genotypes (Bühlmann and van de Geer 2011). This distribution, as well as the corresponding genomic values in Equation 4, relate to the current population of individuals.

When we are interested in the genomic values in the corresponding consistent base population, we should take the relationship (correlation) between the individuals into account (Powell *et al.* 2010; Legarra 2016). In what follows, we assume a given relationship matrix **R**. Instead of the genomic values *g* or we investigate the uncorrelated genomic values defined by(7)These are realizations of *n* draws from the underlying data-generating process of marker genotypes in the base population. The sample mean of the genomic values in the base population, , is usually different from 0.

### Definitions of the genomic variance

The additive genetic variance (in the base population also called genic variance) with respect to a trait is defined as the theoretical variance of the genetic value based on the respective QTLs (Falconer and Mackay 1996). The genomic variance has been described as the variance of the genomic values that can be explained by a linear regression on a set of markers (de los Campos *et al.* 2015). This does not, however, imply a unique formal definition of the genomic variance.

Here, we give an overview of different approaches to define a genomic variance in the framework of the linear models (1) and (6).

Without further assumptions on the nature of the genome, we can define the sample variance(8)of the mean-centered *n*-vector of genomic values (see Equation 4) in the current population (Ould Estaghvirou *et al.* 2013). Here, defines the sample variance-covariance matrix of the marker genotypes in the current population.

In the base population, we define the sample variance(9)of the uncorrelated genomic values (see Equation 7). Here, defines the sample variance-covariance matrix of the marker genotypes in the base population.

Alternatively, we can define the theoretical variance of the genomic values directly in the REM. The linear model is generated by drawing from the data-generating process of the marker genotypes (representative individual), and the model assumptions in the REM dictate that marker effects are random variables. This gives rise to three different scenarios for the variance of the genomic values in the REM (marker genotypes random, marker effects random, or both random).

The additive genomic variance of a randomly sampled (representative) individual (Gianola *et al.* 2009; de los Campos *et al.* 2015; Fernando *et al.* 2017b) equals(10)where denotes the variance-covariance matrix of the marker genotypes.

The variance of a randomly sampled (representative) individual with RME is given by(11)This is not the additive genomic variance (Gianola *et al.* 2009; de los Campos *et al.* 2015).

The variance of a randomly sampled trait averaged over individuals with fixed genotypes **X** equals(12)and does not equal the additive genomic variance.

We derive the equalities in Equations 10–12 in more detail in the Appendix *Theoretical Variances of the Genomic Values in the REM*. These quantities refer to the genotypes in the current population. We can apply the same definitions in the base population by considering the data-generating process of the genotypes in the base population (exchange *X* by ).

In Table 1, we give an overview of the different possibilities to define the variance of the genomic values in the REM.

In actual applications, we have to replace in Equation 10 by its estimator . Consequently, the sample variance (Equation 8) as well as the theoretical variance (Equation 10) effectively represent the additive genomic variance, *i.e.*, the genomic equivalent of the additive genetic variance (Bulmer 1971; Falconer and Mackay 1996), in the current population. In the following, we do not explicitly distinguish between the sample or the theoretical version of the variance, and will speak only of the additive genomic variance.

From now on, we focus on the estimation of the additive genomic variance in the general form(13)which is a non-negative quadratic form of the genomic values. By specifying the positive semidefinite and symmetric -matrix **B**, we determine whether the genomic variance refers to the current population (see Equation 8), or the base population (see Equation 9). Because the randomness of the marker genotypes is not explicitly necessary to derive Equation 13, we can easily express all results in terms of the genomic values *g* defined in the equivalent model.

In the framework of the REM, the marker effects *β* in model (6) and the genomic values *g* in model (1) are random variables. Consequently, the additive genomic variance in Equation 13 is also a random variable, and, as such, has to be predicted in an optimal way before finally being estimated [in the same way that the RME *β* are predicted by the BLUP and then estimated by the eBLUP (Kackar and Harville 1984; Jiang 1999)].

First, we will show that estimators for the unconditional expectation of Equation 13, like GCTA-GREML, are of the form of Equations 11 and 12, and, therefore, do not estimate the additive genomic variance.

Then, we introduce the (frequentist) best predictor for the additive genomic variance and show that this approach maintains the structure of the additive genomic variance (Equation 13)—the genomic equivalent of the additive genetic variance.

### The expectation of the additive genomic variance

The expectation of the random variable in Equation 13 minimizes the quadratic formwith respect to all real numbers *a*, *i.e.*, is the best approximation of in the absence of additional information (van der Vaart 2007). The unconditional (or prior) expectation of equals(14)because of the properties of the trace.

For the additive genomic variance in the current population, , we choose in Equation 14 and obtain(15)in model (6) or(16)in the equivalent model (1). Unconditional expectations of the form *V* for the additive genomic variance are considered in, for example, Ould Estaghvirou *et al.* (2013).

For the additive genomic variance in the base population, , we choose in Equation 14 and obtain(17)in model (6) or(18)in the equivalent model.

Often (VanRaden 2008; Yang *et al.* 2010, 2011; Speed *et al.* 2012; Vinkhuyzen *et al.* 2013; Legarra 2016), the matrix **R** used for the transformation to the base population is assumed to be the GRM **G** defined in Equation 3. Then, the unconditional expectation of the additive genomic variance simplifies to(19)and the variance component from the gBLUP-method is considered as the additive genomic variance in the base population. We have shown, however, that is only the unconditional expectation of the additive genomic variance.

We recommend caution when using the simplification in Equation 19 because the GRM **G** is in general singular (because **P** is singular), and therefore is not well defined.

We emphasize that only the diagonal elements of the sample variance-covariance matrix ( or ) of the marker genotypes influence the unconditional expectations *V*, and of the additive genomic variance. The model assumptions in the REM dictate the matrix to be diagonal, which leads to a negligence of the off-diagonal elements of or in Equation 14. The covariances (LD) between the marker genotypes are not included, and *V*, and are of the same form as in Equation 11 and in Equation 12. This implies that the unconditional expectation of the random additive genomic variance is structurally not fully in accordance with the (random) additive genomic variance in Equation 13.

Explicit formulae for the estimation of the unconditional expectations *V*, and will be given in the Appendix *Estimation of the Additive Genomic Variance in the REM*.

### Best prediction of the additive genomic variance

The unconditional (prior) expectation of in Equation 14 is strongly influenced by the model assumption on the marginal distribution of the marker effects, and does not use additional information given by the phenotypic values *y* in model Equations 1 and 6. Estimation procedures for the unconditional expectation [which include *e.g.*, restricted maximum likelihood (REML)], however, make use of the data *y* to a certain extent.

By contrast, the conditional (posterior) expectation, given the phenotypic values *y*, makes best use of the information provided by *y* even before the final estimation is performed.

Generally, the conditional (or posterior) expectation of a random variable *Z* (in our case ) given the knowledge of the random vector *Y* is defined as the “best prediction” (Searle *et al.* 1992; van der Vaart 2007) of the random variable *Z*. The best predictor(20)is the unique function that minimizes the mean square error of predictionover all functions in *Y*, *i.e.*, the conditional expectation is the projection (closest element in a given set of functions) of *Z* onto the linear space of all functions in *Y* (Searle *et al.* 1992; van der Vaart 2007).

The best predictor in Equation 20 is, by definition, an unbiased predictor for the random variable *Z* and maximizes the correlation , *i.e.*, we can replace the target random variable *Z* by the best predictor defined in Equation 20 in an optimal way (Searle *et al.* 1992). Instead of inferring the unobservable target random variable, we conduct inferences on the best predictor. Because the best predictor has realized in a given dataset , it is estimable (Searle *et al.* 1992).

In the following, we introduce a novel approach of considering the frequentist best predictor instead of the unconditional expectation for the random additive genomic variance in Equation 13. We proceed according to Equation 20, and definefor the given phenotypic values *y*, because **X** is constant, and, therefore, independent of *y*. The matrix of conditional second moments of the marker effects *β* is usually nondiagonal (contrary to ), and can be expressed asusing the BLUP of the random vector *β*, and the variance-covariance matrix of *β* given the data *y*. Then, the best predictor equals(21)where the last equality holds because of the connectionof the BLUPs and the conditional variance-covariance matricesin models (1) and (6), see also Appendix *Genomic Best Linear Unbiased Prediction*.

For the best predictor of the additive genomic variance in the current population we set in Equation 21, and obtain(22)in model (6) or(23)in the terminology of the equivalent model (1).

For the best predictor of the additive genomic variance in the base population we set in Equation 21, and obtain(24)in model (6) or(25)in the terminology of the equivalent model (1).

We note that Equations 24 and 25 have structural similarities with the iterative maximum-likelihood equation for the variance component when using the expectation-maximization algorithm (Sorensen and Gianola 2002).

We emphasize that the best predictor of the additive genomic variance in the current population (*W*), as well as in the base population includes the contribution of all elements of the sample variance-covariance matrix of marker genotypes ( or ), and, hence, comprises LD information, contrary to the unconditional expectations *V*, and of the additive genomic variance from the previous section.

Explicit formulae for the eBPs of the additive genomic variance, as well as a formula for (approximate approach using the GRM **G** for transformation to the base population), will be given in the Appendix *Estimation of the Additive Genomic Variance in the REM*. We compare the use of the unconditional expectation and the best predictor for the prediction of the random additive genomic variance in the REM in Table A.1 and Table A.2 in the Appendix.

### Statistical analysis (genomic data)

For an illustration of the theoretical results of the previous sections, we used the mice dataset that comes with the *R*-package BGLR (Pérez and de los Campos 2014). The data originally stem from an experiment by Valdar *et al.* (2006a,b) in a mouse population. The dataset contains the matrix **X** with values in of polymorphic marker genotypes that were measured in mice. The trait (*n*-vector *y*) under consideration was body length (BL). The relationship of the mice is recorded in the pedigree matrix **R**, and is used for the transformation to the base population.

Additionally, we used the publicly available historical wheat dataset that also comes with the *R*-package “BGLR” (Pérez and de los Campos 2014). The data originally stems from CIMMYT’s Global Wheat Program and consists of lines of wheat, where the trait under consideration was average grain yield. The phenotypes are divided up into four basic target sets of environments designated as Wheat I, Wheat II, Wheat III, and Wheat IV, of which we considered the only first. The dataset contains the matrix of marker genotypes for markers as well as a relationship matrix.

Moreover, we analyzed a population of fully sequenced *Arabidopsis* lines for which phenotypes and genotypes are publicly available by the effort of the *Arabidopsis* 1001 Genomes project (1001 Genomes Consortium 2016). The lines represent natural inbred lines and we examined the same trait, namely flowering time at 10° (FT10), and the same SNP-markers that were used in Lehermeier *et al.* (2017). For these data, no relationship matrix was available.

We conducted all calculations with the free software R (R Development Core Team 2017). For each dataset, we used the gBLUP-method in the equivalent version (computational advantages) implemented in the *R*-package “sommer” (Covarrubias-Pazaran 2017) to fit a REM. We used REML to obtain estimates ( and ) for the variance components. We also obtained estimates of the best predictor of the genomic effects and the their variance-covariance matrix .

We used this outcome for the estimation of the unconditional expectation *V* and the BP *W* of the additive genomic variance in the current population, as well as the unconditional expectation and the BP for the additive genomic variance in the base population (except for the *Arabidopsis* dataset, where no relationship matrix was available). Although the GRM is not invertible, we will show in the Appendix *Estimation of the Additive Genomic Variance in the REM* how to use the GRM for a transformation to the base population, and to calculate the corresponding unconditional expectation and the BP for the additive genomic variance in the base population.

Detailed information about the calculations and the programming code is provided in the Supplemental Material, File S1.

### Data availability

The authors affirm that all data necessary for confirming the conclusions of this article are represented fully within the manuscript and the supplemental material that has been uploaded to figshare. File S1 contains a detailed description of the estimation of the genomic variances for the gBLUP-method as well as the corresponding *R*-code and its output. Supplemental material available at Figshare: https://doi.org/10.25386/genetics.8114117.

## Results

In the first section of Table 2, we present the estimation results for the unconditional expectation *V* and the best predictor *W* for the additive genomic variance in the current population. In the mice and wheat datasets, exceeds , whereas for the *Arabidopsis* data, the eBP is about double the size of the unconditional expectation.

The sample variance of the phenotypic values has been scaled to 1. The sum of and the residual variance is larger than the phenotypic variance for the mice and wheat data but smaller for the *Arabidopsis* data. Technically, it is possible to define the heritability in two ways, namely with respect to the phenotypic variance, and with respect to the sum of the additive genomic variance and the residual variance. The sum of the eBP and the residual variance, however, equals the scaled phenotypic variance of 1 remarkably exactly for all datasets considered.

In the second section of Table 2, we first present the estimation results for the unconditional expectation and the best predictor for the additive genomic variance in the base population using the given relationship matrices for the transformation. For the mice data, and are similar to their analogs and in the current population. For the wheat data, however, the estimated unconditional expectation and eBP in the base population are around five times larger than those in the current population, and exceed the sample phenotypic variance in the current population. By this approach, it is not possible to define a heritability in the base population because both the estimate of the residual variance and the phenotypic sample variance refer to the current population.

The estimation results for the unconditional expectation and the best predictor for the additive genomic variance in the base population using the GRM for the transformation differ from those using the given relationship matrices by a considerable amount. In the mice and wheat data, is larger than , whereas, for the *Arabidopsis* data the eBP exceeds the estimated unconditional expectation. This conforms to the behavior of *V* and *W* in the current population.

## Discussion

We have shown that commonly used estimators for the additive genomic variance in the REM with genomic marker data are based on the unconditional expectation of the random additive genomic variance. We have introduced a novel best prediction approach for the random additive genomic variance in both the current and the base population. In the following, we discuss several important implications.

### Current and base population

Common ways of estimating the additive genomic variance focus on the base population. These approaches are independent of the actual current population, and, consequently, valid even if the generations change.

If one aims at the response of a population to selection, however, it might be more meaningful to estimate the additive genomic variance in the actual given population. This implies that the estimation of the genomic variance has to be conducted again when the individuals change. A formal definition of the heritability is best possible in the current population, where the phenotypic and residual variance are estimable.

We have preferred to use given relationship matrices for the transformation of the genomic values to the base population. In the case that such a matrix is not available, we have shown how to use genomic relationship matrices for the transformation, although a formal inversion of GRMs is, in general, not possible.

In Table 2, we illustrate that we can decompose the sample phenotypic variance into the sum of the eBP of the additive genomic variance in the current population and into the estimated residual variance. This is due to the orthogonal projection property of the conditional expectation, which gives the best approximation of the random additive genomic variance. This enables a unique definition of the heritability in the current population. It is never possible, however, to transfer the residual variance to the base population. Consequently, a definition of the heritability in the base population is not straightforward.

### The gBLUP-method and the Bayesian approach

The frequentist gBLUP-method can also be set up in the context of Bayesian regression models (with prior distribution for the effect vector as defined in Equation 5, and uninformative priors for the variance components). Lehermeier *et al.* (2017) considered the additive genomic variance in the current population (see Equation 8), and used BRR to estimatewhere denotes MCMC samples from the posterior distribution of *β*. In that approach, Lehermeier *et al.* (2017) estimated the posterior mean of the additive genomic variance in the current population. This approach is the Bayesian equivalent of the (frequentist) empirical version of the best predictor of in Equation 13 in the current population. does not describe the genomic variance in the base population, and should not directly be compared with approaches introduced *e.g.*, in Yang *et al.* (2010, 2011). Analogously to the best predictor (see Equation 24), for the genomic variance in the base population, one can consideras the posterior mean of the genomic variance in the base population in Bayesian regression models.

The frequentist gBLUP-method provides a more formal approach to the prediction of the random additive genomic variance in linear models with random effects than the Bayesian approach. It enables the derivation of explicit formulas for the predictors (unconditional expectation and best predictor) of the random additive genomic variance using the standard output of the gBLUP-method, which goes hand-in-hand with a fast implementation of the empirical version of the predictors. The connection between the BLUP and its covariance for the RME with the additive genomic variance are clearly visible. This enables us, for instance, to derive the decomposition of the best predictor of the random additive genomic variance into the unconditional expectation and a function for the marker LD in the following section.

### Influence of LD

In the section *Definitions of the genomic variance*, we saw that the (random) additive genomic variance equalsin the current population, andin the base population. We emphasize that the variance-covariance matrix of the marker genotypes (marker LD) plays a decisive part in the determination of the additive genomic variances in both the current and the base population. The variances and are structurally in accordance with the classical additive genetic variance (Bulmer 1971; Falconer and Mackay 1996), which is caused by the genotypes, whereas the genotypic effects are fixed.

In the REM, however, the marker effects are random, with unconditional expectation 0 and unconditional diagonal variance-covariance matrix with equal variances . As a consequence, the unconditional expectation of the additive genomic variancein the current population andin the base population contain only the variances of the marker genotypes in the corresponding population. In addition, the unconditional expectation resembles both the variance of a randomly sampled trait for a randomly sampled individual and the variance of a randomly sampled trait for individual with fixed genotypes, see Table 1 for an overview.

We show in the Appendix *Estimation of the Additive Genomic Variance in the REM* that we can partition the best predictor of the random additive genomic variance in the following way:wherein the current population, andin the base population .

The best predictor, therefore, consists of the unconditional expectation of the additive genomic variance (no contribution of LD) and a function that explicitly contains the weighted contribution of marker LD. This function determines whether estimators like GCTA-GREML (unconditional expectation of the random genomic variance in the base population) are biased upwards or downward, *i.e.*, it determines the direction and the magnitude of the bias of GCTA-GREML (this method is based on the assumption that the function *L* constantly equals 0). In addition, we notice that this bias depends not only on the sign of the covariance between the marker genotypes, but both on the sign and the magnitude of the weighted covariances.

When best (conditional) estimators are considered, the inclusion of LD is intrinsic; in that case, discussions about whether or not LD should be included or even optimally included become immaterial.

We emphasize that, contrary to the unconditional expectation, the best predictor maintains the structure of the additive genomic variance and , because the function *L* can be decomposed into the weighted sample variances and covariances of the marker genotypes. Instead of the marker effects, the components of the matrix , which is typically nonzero and nondiagonal, take the part of the weighting factors of the elements of and The best predictor maintains the structure of the additive genomic variance in both the current and the base population and, thus, conforms to the classical genetic variance (Bulmer 1971; Falconer and Mackay 1996).

The difference between the estimators *V* and *W* ( and and ) is given by the estimated and can be obtained from Table 2. We notice that the weighted contribution of marker LD is large and positive in the case of the *Arabidopsis* data, whereas in the mice and wheat data, the weighted contribution of marker LD is slightly negative.

To sum up, the application of the unconditional expectation of the additive genomic variance combined with the model assumptions on the marker effects in random effect models cause, at least partially, the missing contribution of LD to the estimated additive genomic variance. This goes hand-in-hand with the critique expressed in Krishna Kumar *et al.* (2016a,b). It is, however, less important when estimating the additive genomic variance in the base population, where the individuals are uncorrelated and less LD persists (although the marker genotypes need not be uncorrelated).

The best prediction approach eliminates the problem of the missing contribution of LD to the additive genomic variance that is caused by mathematical modeling (*e.g.*, the assumptions in the random-effects model).

### Concluding remarks

The variability in the marker genotypes causes the variability in the genomic values in a population of individuals. Hence, the additive genomic variance is induced by the variance of the marker genotypes. In the application of the random-effects model, the marker effects are assumed to be random, which introduces another source of variability, namely a statistical one, in the genomic values.

We have shown that commonly used estimators use the unconditional expectation to handle this statistical randomness. However, we recommend the use of the best prediction approach (conditional expectation) that uses the additional information given by the phenotypic data in an optimal way, minimizes the mean square error of prediction, includes the contribution of LD, and maintains the structure of the genomic equivalent of the classical additive genetic variance.

## Acknowledgments

We would like to express our gratitude to Chris-Carolin Schön and Leo Dempfle for important remarks and advice on the paper. The remarks of the editor and two anonymous reviewers helped to greatly improve the quality of this manuscript. We are also grateful to Henner Simianer, Torsten Pook, Jonas Brehmer, and Christopher Dörr for their comments on earlier versions of the manuscript. N.S. and M.S. were supported by the Deutsche Forschungsgemeinschaft (DFG), project number SCHL 1865/4-1, as well as the ‘RTG 1953 - Statistical Modeling of Complex Systems and Processes’. H.-P.P. was supported by the DFG project PI 377/18-1.

## Appendix

### Genomic Best Linear Unbiased Prediction

In the REM for the model(6)we have that(26)The marker effect vector *β* cannot be estimated because it is a random variable. Henderson (1984) introduced the concept of the prediction of *β*, which refers to the estimation of the realized values of the random effects. The best linear unbiased predictor (BLUP) for *β* is given by (Henderson 1984; Searle *et al.* 1992). The conditional expectation is the unique best predictor, *i.e.* it is unbiased and has minimal mean square error of predictionwithin the whole set of functions *g* that depend on the data *y* (van der Vaart 2007).

The joint distribution of *y* and β equalsand we obtain(27)see *e.g.* Kotz *et al.* (2000). Consequently, the BLUP equals(28)and is linear in *y*. The conditional variance-covariance matrix of *β* equals(29)and the variance-covariance matrix of the BLUP equals(30)The actual estimation of the parameters in model (6) with the BLUP-method is a two-stage procedure (Das *et al.* 2004). In the first stage, a BLUE for the fixed quantities and a BLUP for the random variables are derived. However, they involve the variance components and as unknown parameters. In a second stage, these parameters are replaced by estimates, and the estimators for the BLUE and the BLUP are referred to as empirical BLUE (eBLUE) and empirical BLUP (eBLUP) (see Kackar and Harville 1984; Jiang 1999). Investigations on the properties of the eBLUE and the eBLUP are very complex (Searle *et al.* 1992), and often only approximate results are obtained (Kackar and Harville 1984; Jiang 1999; Das *et al.* 2004).

Assume that we are provided with estimators for the variance components using, *e.g.*, REML (Patterson and Thompson 1971; Corbeil and Searle 1976; Searle *et al.* 1992). These estimated variance components are nonlinear functions of the data *y*, and, consequently, the eBLUEfor the intercept and the eBLUP(31)for the marker effects *β* are not even linear in the data *y* anymore (despite their naming). The unbiasedness of the eBLUE and the eBLUP can be asserted if the estimated variance components and are non-negative, even functions in *y*, translation-invariant, and if the expectations of the eBLUE and eBLUP are finite (Kackar and Harville 1984). When using REML estimates for the variance components, these requirements are satisfied and the eBLUE and the eBLUP are bias-free estimators for *μ* and *β* (Jiang 1999).

Conditional on the estimation of the variance components (ignoring the randomness in the second stage of the estimation of the eBLUP), the variance-covariance matrix of the eBLUP equals(32)becauseholds.

We can transfer the results derived in model (6) to model(1)by using their equivalence in distribution. The gBLUP for *g* equals(33)The conditional variance-covariance matrix of *g* is obtained as(34)as well as the variance-covariance matrix of the BLUP (35)The variance-covariance matrix of the eBLUP equals

### Theoretical Variances of the Genomic Values in the REM

We review three different definitions of the theoretical variance of the genomic values in the REM (marker genotypes random, marker effects random, or both random). We focus the following analysis on the linear model (6) because of the explicit separation of marker genotypes and marker effects. For simplicity, we focus on the genomic variance in the current population. The results for the base population are obtained by replacing the data-generating process *X* with .

*Random genotypes and random effects*

If the marker genotypes as well as the marker effects are the source of genomic variation, we calculate the variance of the genomic value according to the law of total variance as:The unconditional expectation and the variance operator in the second line apply to the random marker effect vector *β*. Because of the model assumptions on the marker effects in Equation 5, and the mean-centered marker genotypes , we obtain(37)with the interpretation as the variance of a randomly sampled (representative) individual for a trait with random effects.

*Fixed genotypes and random effects*

If the genomic variation is caused by the marker effects only, and the marker genotypes are fixed, then the *n*-vector of genomic values is normally distributed:(38)In order to obtain an average theoretical variance of the individuals in the sample, we calculate the mean trace of the variance-covariance matrix of the genomic values:(39)This approximately equals the variance of a randomly sampled individual for a randomly sampled trait (see Equation 37). Even when the marker genotypes are fixed, their sample variance-covariance matrix contributes to the theoretical variance of the genomic values when averaging over the individuals in the sample.

*Random genotypes and fixed effects*

Probably the most common assumption on the nature of the genome is that the marker genotypes are random, whereas the marker effects are fixed (Falconer and Mackay 1996). In order to translate these assumption to the variance of the genomic values in the REM, we have to condition on the marker effects (*i.e.* we fix them). Then, the theoretical variance of the genomic values of a individual with random marker genotypes (representative individual), and with fixed marker effects, equals(40)and describes the genomic equivalent of the definition of the additive genetic variance (Bulmer 1971; Falconer and Mackay 1996).

### Estimation of the Additive Genomic Variance in the REM

In the sections *The expectation of the additive genomic variance* and *Best prediction of the additive genomic variance*, we have introduced ways to predict the random additive genomic variance(13)in the REM, namely by using the unconditional expectation(14)and the best predictor(21)In the following, we introduce estimators for these quantities and investigate their properties.

*Estimation of the unconditional expectation*

For any given positive semidefinite matrix **B**, the unconditional expectationin model (6) can be estimated byafter having obtained an estimate of the variance component . In the equivalent model (1), we can estimateby usingfor any positive semidefinite matrix **B**.

The specification of **B** as in the section *The expectation of the additive genomic variance* leads to the explicit form of the estimatorsand

*Empirical best prediction*

Because of equalities (Equations 29 and 30) and the variance-covariance matrix of *β*, we have thatConsequently, the best predictor of defined in Equation 21 equals(42)where(43)We have partitioned the best predictor of into the unconditional expectation of and the random variable *L*, which is realized in the phenotypic data *y*. The random variable *L* specifies the adaption of the best predictor to the data and incorporates the contribution of (marker) LD. The expectation of *L* over all possible data *y* is 0 becauseThe sign of the realization of *L* determines whether the best predictor is larger (positive weighted LD) or smaller (negative weighted LD) than the unconditional expectation.

The task of finding an eBP for is reduced to estimating the realized values of *L* because of the connection derived in Equation 42. We replace the BLUP and their variance-covariance matrix in Equation 43 by the eBLUP and their estimated variance-covariance matrix:We assume that we are provided with REML-estimators and for the variance components. Then, we findbecause the eBLUP is unbiased for *β*, *i.e.* (Jiang 1999). Unfortunately, the unbiasedness of the estimated variance-covariance matrix of the eBLUP can be asserted only in a trivial way by conditioning on the estimated variance components:Therefore, the expectation of (conditionally on the variance components) equals 0.

The same holds true forin the equivalent model, because the quantities and are linear combinations of and (see Equations 33 and 36).

Altogether, we can define the unbiased (conditionally on the estimated variance components) eBPfor the additive genomic variance .

The specification of B as in the section *Best prediction of the additive genomic variance* leads to the explicit formof the eBP for the additive genomic variance in the current population, and to the eBPfor the additive genomic variance in the base population.

Using the GRM **G** for a transformation to the base population is not well-defined because **G** is singular. However, because (see Equation 41), is commonly used, we want to find an analogous formula for the eBP in this set-up.

Instead of calculatingandwe useandas substitutes. Then, we defineas an approximation of the eBP of the additive genomic variance in the base population when using the GRM for the transformation.

## Footnotes

Supplemental material available at Figshare: https://doi.org/10.25386/genetics.8114117.

*Communicating editor: W. Valdar*

- Received March 4, 2019.
- Accepted July 30, 2019.

- Copyright © 2019 Schreck
*et al.*

Available freely online through the author-supported open access option.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.