## Abstract

Genomic data provide a valuable source of information for modeling covariance structures, allowing a more accurate prediction of total genetic values (GVs). We apply the kriging concept, originally developed in the geostatistical context for predictions in the low-dimensional space, to the high-dimensional space spanned by genomic single nucleotide polymorphism (SNP) vectors and study its properties in different gene-action scenarios. Two different kriging methods [“universal kriging” (UK) and “simple kriging” (SK)] are presented. As a novelty, we suggest use of the family of Matérn covariance functions to model the covariance structure of SNP vectors. A genomic best linear unbiased prediction (GBLUP) is applied as a reference method. The three approaches are compared in a whole-genome simulation study considering additive, additive-dominance, and epistatic gene-action models. Predictive performance is measured in terms of correlation between true and predicted GVs and average true GVs of the individuals ranked best by prediction. We show that UK outperforms GBLUP in the presence of dominance and epistatic effects. In a limiting case, it is shown that the genomic covariance structure proposed by VanRaden (2008) can be considered as a covariance function with corresponding quadratic variogram. We also prove theoretically that if a specific linear relationship exists between covariance matrices for two linear mixed models, the GVs resulting from BLUP are linked by a scaling factor. Finally, the relation of kriging to other models is discussed and further options for modeling the covariance structure, which might be more appropriate in the genomic context, are suggested.

PREDICTING genotypes and phenotypes plays an important role in many areas of life sciences. Both in animal and plant breeding, it is essential to predict the genetic quality (the so-called total genetic value, GV) of individuals or lines, on the basis of different sources of knowledge. Often, phenotypic measures for various traits are available and the additive genetic relationship between individuals (Wright 1922) can be derived, on the basis of the known pedigree. Best linear unbiased prediction (BLUP; Henderson 1973) of breeding values is a well-established methodology in animal breeding (Mrode 2005) and has recently gained relevance in plant breeding (Piepho*et al.* 2008). In both areas, the main interest is in complex traits with a quantitative genetic background.

In human medicine, the interest is in predicting phenotypes, rather than genotypes, for simple or complex traits (*e.g.,* the probability/risk to encounter a certain disease). Genetic prediction is mainly applied in the context of genetic counseling by predicting the risk of genetic disorders with known mono- or oligogenetic modes of inheritance and a certain history of cases in a known family structure, but accurate predictions of genetic predispositions to human diseases should also be useful for preventive and personalized medicine (delos Campos*et al.* 2010). Wray*et al.* (2007) discuss the potential use of prediction of the genetic liability for traits with a complex quantitative genetic background in a human genetics context, and the variety of possible methods, including linear models, penalized estimation methods, and Bayesian approaches was reviewed by delos Campos*et al.* (2010).

With the availability of high-throughput genotyping facilities (Ranade*et al.* 2001), genotypes for massive numbers of single nucleotide polymorphisms (SNPs) are available and can be used as an additional source of information for predicting GVs. Meuwissen*et al.* (2001) have suggested including SNP information in a statistical model of prediction. They used three statistical models: a model assigning random effects to all available SNPs (later termed “genomic BLUP”), assuming all SNP effects to be drawn from the same normal distribution, and two Bayesian models, where all (“Bayes A”) or a subset (“Bayes B”) of the random SNP effects are drawn from distributions with different variances. Various modifications of these methods and additional models have been subsequently suggested (Gianola*et al.* 2009).

Gianola*et al.* (2006) and Gianola and van Kaam (2008) have suggested a nonparametric treatment of genomic information by using reproducing kernel Hilbert spaces (RKHS) regression, which has already been demonstrated with real data (González-Recio*et al.* 2008, 2009). As was argued by delos Campos*et al.* (2009), the RKHS regression approach to genomic modeling represents a generalized class of estimators and provides a framework for genetic evaluation of quantitative traits that can be used to incorporate information on pedigrees, markers, or any other ways of characterizing the genetic background of individuals.

Opportunities to enhance genetic analyses by using nonparametric kernel-based statistical methods are enormous and these methods have been considered in different areas of genetic research. Schaid (2010a,b) provides an overview of measures of genomic similarity based on kernel methods and describes how kernel functions can be incorporated into different statistical methods like, *e.g.,* nonparametric regression, support vector machines, or regularization in a mixed model context. Only recently, kernel-based methods have also been used in association studies (Kwee*et al.* 2008; Yang*et al.* 2008) and QTL mapping for complex traits (Zou*et al.* 2010), which demonstrates their great potential and flexibility.

Prediction is also relevant in other areas of research: In large parts of geostatistics, the spatial distribution of variables (like temperature, humidity, ore concentration, etc.) is considered. On the basis of a given (limited) set of measurements, the prediction of the variable realization in any point of the considered space is of interest. A standard approach for prediction in this case is the so-called “kriging” (Chilès and Delfiner 1999), which makes use of a parameterized covariance function of the regionalized variables.

While in geostatistics the application of kriging is naturally limited to few dimensions, the basic approach is rather universal (Schölkopf*et al.* 2004). In this article we apply kriging to the genomic prediction problem. Here, one dimension reflects genotype realizations at one SNP. In the genomic context, with *p* SNPs, realizations are in a *p*-dimensional orthogonal hypercube. Due to the biallelic nature of SNPs, only three genotype realizations (coded, *e.g.,* as 0, 1, and 2) are possible in each dimension, so that the number of possible genotype constellations over *p* SNPs is 3* ^{p}*.

The concept of kriging is closely related to the concept of BLUP. Cressie (1989) provides a “historical map of kriging” up to 1963 in which he also refers to Henderson (1963) who introduced BLUP in animal breeding. The steps of kriging are equivalent to “empirical BLUP”-procedures known in other frameworks, and kriging can be viewed as a “spatial BLUP.” The conceptual equivalence of geostatistical kriging and BLUP has already been discussed by Harville (1984). Robinson (1991) provides a detailed review of the history of estimation of random effects via BLUP and its various derivations. He also points out the similarities between BLUP and kriging.

The equivalence of kriging with BLUP in a space spanned by *genomic* data was first noted by Piepho (2009), who also discusses relationships with other estimation principles, like ridge regression (Whittaker*et al.* 2000) and least squares support vector machines (Suykens*et al.* 2002). Comparing the performance of spatial mixed models to ridge regression with maize data, he found that spatial models provide an attractive alternative for prediction. He also points out that the BLUP model used in Meuwissen*et al.* (2001) has an interpretation as a spatial model with quadratic covariance function. Spatial models for genomic prediction were also used by Schulz-Streeck and Piepho (2010).

Moreover, kriging is known to be closely related to radial basis function (RBF) regression methods (Myers 1992). Long*et al.* (2010) showed with real and simulated data that nonparametric RBF regression methods can outperform Bayes A when predicting total GVs in the presence of nonadditive effects using SNP markers.

In this article we demonstrate the potential of the kriging approaches applied to genomic data: As a novelty, we suggest the family of Matérn covariance functions to reflect the functional dependency of the observed covariances from the distance of genotypes expressed as Euclidean norm. On the basis of this model and the assumed covariance function, we suggest two kriging approaches. Under both models, parameters and hidden variables are estimated via maximum likelihood (ML) and BLUP of the unknowns is established by solving the corresponding linear kriging systems. All predictions can also be implemented in the form of the so-called mixed model equations (Henderson 1973). The predictive performance of the two models is compared to a common genomic BLUP as a reference method in a whole-genome simulation study considering various gene-action models.

Furthermore, we show that in a limiting case the genomic covariance structure proposed by VanRaden (2008) can be considered as a covariance function with corresponding quadratic variogram. In addition we prove theoretically that predicted GVs are scaled only by a factor if the covariance structures are linearly transformed. Finally, we discuss further options for a more differentiated modeling using the suggested methodological approach.

## Methods

### Kriging

The term kriging stems from the prediction of ore concentrations in deposits and was mainly developed by Matheron (1962, 1963) on the basis of the master’s thesis of Krige (1951). In geostatistics, kriging is currently the standard approach whenever spatial prediction of a so-called regionalized variable (Matheron 1989), *e.g.,* temperature, ozone concentration or soil moisture, must be performed on the basis of a few isolated measurements of the quantity. It is assumed that the regionalized variable is a realization of a random function with a certain covariance structure. Mostly, the latter is given by a parameterized covariance function (Cressie 1993), and the random function is assumed to be Gaussian.

The kriging approach consists of two steps: (i) estimation of the unknown parameters and hidden variables (in particular by ML or REML) and (ii) prediction of the values of the regionalized variables by performing a BLUP, under the auxiliary assumption that the parameter values and hidden variables estimated in the first step are the true ones.

Many variants of the general kriging principle have been discussed (Cressie 1993). The type of kriging is implied by the unbiasedness condition: In “simple kriging” (SK), it is assumed that the underlying regionalized variable has zero mean, whereas in “universal kriging” (UK), a linear model for the mean of the underlying regionalized variable is assumed.

### The model for polygenic and genomic data

In our further studies, we assume to have *q* individuals with pedigree information, *n* of them being genotyped and having phenotype measurements of a certain quantitative trait. Typically, GVs have to be predicted for individuals that are genotyped, but have no phenotype data.

We use the following model for the given datawhere *y _{i}* is a measurement of the phenotype for individual

*i*,

**β**is an

*f*-vector of nuisance location parameters,

**x**

*is a*

_{i}*p*-vector of dummy SNP instance variates (genotype) observed on individual

*i*, and

*g*is an unknown, random function as described below. Let be a

*q*-vector of additive genetic effects of

*q*individuals, where is the additive genetic variance due to unmarked polygenes, and

**A**is the numerator relationship matrix. The entries of the numerator relationship matrix are twice the coefficients of coancestry between individuals. The vectors and are known incidence vectors;

**z**

*is a unit vector with one component being 1 and all the others zero, indicating the respective position in the pedigree. Let*

_{i}**e**= (

*e*

_{1},…,

*e*)

_{n}*be the vector of environmental residual effects with where is the environmental variance.*

^{T}We assume that {*g*(**x*** _{i}*),

**x**

*ε } is a Gaussian random field (Lifshits 1995) with (*

_{i}*g*(

**x**

*)) = 0 and covariance structure given by where is a covariance function depending on parameters ν,*

_{i}*h*, and

*σ*

_{K}_{.}Let be the corresponding covariance matrix.

### The family of Matérn covariance functions

For the covariance structure we suggest using the so-called family of Matérn covariance functions, which was introduced by Matérn (1986) and Handcock and Wallis (1994) and is defined byHere, ‖ · ‖ is the Euclidean norm, ν > 0 is a smoothness parameter, *h* is a scale parameter, is the variance parameter, and is a modified Bessel function of the second kind of order ν (Abramowitz and Stegun 1984). The Matérn function is isotropic, in that Cov(*g*(**x*** _{i}*),

*g*(

**x**

*)) depends only on the Euclidean norm of the separation vector*

_{j}**x**

*–*

_{i}**x**

*.*

_{j}Matérn covariance functions build a very general class of covariance functions including special cases like the exponential (ν = ) and the Gaussian (ν = ∞) covariance function, the ones that have also been used by Piepho (2009). If the smoothness parameter ν is of the form *m* + , where *m* is an integer, the Matérn function factorizes into the product of an exponential function and a polynomial of degree *m*; *cf.* Table 1 and Figure 1. The best fitting parameter value ν is determined through the model-fitting approaches described below.

In matrix notation, the statistical model is(1)where is an (*n* × *f*)- and is an (*n* × *q*)-incidence matrix and *g*(**X**) = (*g*(**x**_{1}),…, *g*(**x*** _{n}*))

*. Finally, we assume that the random vectors*

^{T}**u**,

**e,**and

*g*(

**X**) are independent.

### Two kriging approaches and a reference model

We consider two models to predict the total genetic value of a certain genotyped individual indexed by 0. This individual belongs to the set of *q* individuals, but it does not have to be phenotyped. The models differ in the size of the sets of quantities that are estimated in the first kriging step and subsequently used for predictions.

### Universal kriging: Modeling of *y*

We exploit the fact that **y** has a multivariate normal distribution,and estimate the parameters β, σ_{u}, σ_{e}, ν, *h* and σ* _{K}* by maximizing the log likelihood of the corresponding density function.

Then, we perform a best linear unbiased prediction of *g*(**x**_{0}) and *i.e.,* we apply the BLUP principle: To obtain we minimizewith the linear predictor under the condition This approach is called universal kriging in other areas of research (Cressie 1993). In fact, the condition assures and therefore *i.e.,* is unbiased. Let The approach results in the following kriging system of equations:Note that this linear system does not depend on **β**. Analogously, can be predicted by the universal kriging estimator where **a _{u}** satisfiesand one gets as BLUP of

In the animal breeding context it is well known that a BLUP approach for the model **y** = **Wβ** + **Zu** + **g**(**X**) + **e** is equivalent to solving the mixed model equations (MME)(2)for given variance components estimated, *e.g.,* by ML. For a derivation of the MME from the kriging system see, *e.g.,* Dempfle (1982).

### Simple Kriging: Joint modeling of y, u, and g(X)

In the second approach we model the hidden variables **u** and **g**(**X**) explicitly and consider the joint density function *f*** _{y,u,g}** of

**y**,

**u,**and

**g**(

**X**), which equalswith

Here, we have to estimate the parameters **β**, σ_{u}, σ_{e}, ν, *h,* σ* _{K}* and the hidden variables

**u**and

**g**(

**X**). Note that in this approach we consider

**u**and

**g**(

**X**) to be parameters that have to be estimated via ML in the first kriging step. Therefore, we maximize the log likelihood

*J*of the density function

*f*

_{y}_{,u,g};

*i.e.,*we maximize(3)with respect to

**β**,

**u,**and

**g**(

**X**). Taking the derivatives with respect to

**β**,

**u,**and

**g**(

**X**) leads to the linear system given in (2), which yields estimators for

**β**,

**u,**and

**g**(

**X**). When using these estimates in Equation 3, the value of

*J*depends only on σ

_{u}, σ

_{e}, ν,

*h,*and σ

_{K}_{.}Thus,

*J*can be maximized numerically with respect to these parameters, leading to estimates for

**β**, σ

_{u}, σ

_{e}, ν,

*h,*σ

*,*

_{K}**u,**and

**g**(

**X**). According to the kriging philosophy, we now assume the values of the estimators (especially the value of the estimator for

**g**(

**X**)) to be the true ones, and

*g*(

**x**

_{0}) is predicted via by the BLUP principle. That is, we minimizewith the linear estimator

This approach is called simple kriging (Cressie 1989, 1993; Chilès and Delfiner, 1999). Note that is always unbiased. The solution is(4)

Finally, the predicted GV is given by where is the estimator obtained in the iterative procedure described above.

### Reference model: genomic BLUP

This approach performs a genomic BLUP on the basis of the modelwhich leads to the kriging systemand predicting

Here, and **Z** are defined as in the previous approaches. The vector is multivariate normal with **G** being a genomic relationship matrix calculated by using the approach of VanRaden (2008). (For the definition of the genomic relationship matrix see the formulas in the *Appendix*.) The matrix is a known incidence matrix whose rows consist of unit vectors with one component being 1 and all the others zero, indicating the respective position in the **g**-vector. Variance components for this model are estimated via ML.

### Simulation study

In a first step, four types of simulations were performed differing in the hypothetical gene-action scenario: additive, additive dominance with two different ratios of dominance variance to additive variance, and epistasis. For each scenario 50 independent simulations were run, resulting in 50 data sets per scenario.

The simulation process basically followed that of Meuwissen*et al.* (2001), Solberg*et al.* (2008), and Long*et al.* (2010).

### Population and genome

In each scenario, the population evolved during 1000 generations of random mating and random selection with a population size of 100 (50 males and 50 females) in each generation to reach a mutation-drift balance. After 1000 generations, the population size was increased to 500 at generation *t* = 1001 by mating each male with 10 females, with one offspring per mating pair. In generations *t* = 1002,…, *t* = 1011 offspring were born from random mating of individuals of the previous generation. The 1500 individuals of generations 1008, 1009, and 1010 were used as estimation set, the 500 individuals of generation 1011 formed the validation set for which total GVs were predicted. Pedigree data were recorded for individuals of the last 10 generations. SNP data of individuals were recorded both for the estimation and the validation set. Phenotypes were stored only for individuals of the estimation set.

The simulated genome consisted of one chromosome of length 1 M, containing 100 equally spaced putative QTL. Each QTL was flanked by 30 equally spaced SNP markers resulting in 3030 markers (*M*) in total. The layout of the chromosome was therefore given by

Starting with monomorphic loci in the base generation, mutation rates at QTL and SNP markers were 2.5 × 10^{−3} per locus per generation (*t* = 1,…, *t* = 1000), to obtain an adequate number of segregating (biallelic) loci. On average, simulation resulted in 2745 segregating markers and 98 segregating QTL in generation *t* = 1001. Only segregating markers and QTL were considered in the following generations. True total GVs were obtained by summing up the QTL effects resulting from the following three gene-action models.

### Three different gene-action models

#### Additive scenario A:

Each QTL locus had an additive effect only, without dominance or epistasis. The additive effect (*a)* was equal to the allele substitution effect, such that for genotypes *QQ*, *Qq,* and *qq* their GVs were 2*a*, *a,* and 0, respectively. The value of *a* at each QTL locus was sampled from a normal distribution .

#### Additive-dominance scenarios AD1 and AD2:

Each QTL locus had both an additive and a dominance effect. Two different scenarios were considered, setting the ratio of dominance variance to additive variance at each QTL to δ = 1 or δ = 2. The additive effects (*a)* were obtained as in the additive scenario. Given the additive effect *a _{i}* and allele frequency

*p*at the

_{i}*i*th locus, its dominance effect (

*d*) was determined by solving the equation,[see Falconer and Mackay (1996)]. Genetic values at that locus were then given by 2

_{i}*a*,

*a*+

*d,*and 0 for genotypes

*Qq,*and

*qq,*respectively.

For simplicity, independence between QTL was assumed and, as a result, the total additive (dominance) variance was summed over all loci.

#### Epistasis scenario E:

In this model there was no additive or dominance effect at any of the individual QTL. Epistasis existed only between pairs of QTL. The forms of epistasis included additive × dominance (*A* × *D*), dominance × additive (*D* × *A*), and dominance × dominance (*D* × *D*). Additive and (*A* × *A*) epistatic effects were excluded, to prevent the additive variance from dominating the total genetic variance.

All segregating QTL were involved in epistatic interactions. QTL were randomly chosen to form pairs and each pair was assigned an (*A* × *D*) interaction effect *ℓ _{AD}*, a (

*D*×

*A*) interaction effect

*ℓ*and a (

_{DA},*D*×

*D*) interaction effect

*ℓ*, which were all equal and sampled from a normal distribution . Given a pair of QTL (

_{DD}*i*= 1, 2), its epistatic value was given bywhere

*x*and

_{i}*z*were additive and dominance codes at locus

_{i}*i*, respectively. For genotype

*i*,

*x*= 1,

_{i}*z*= −0.5; for

_{i}*x*= 0,

_{i}*z*= 0.5; and for

_{i}*x*= −1,

_{i}*z*= −0.5; compare Cordell (2002). The total GV was the sum of the epistatic values produced by the QTL pairs.

_{i}Note that although no additive, dominance, and (*A* × *A*) epistatic effects were explicitly simulated, the model still generated additive dominance , and epistatic variances. The procedure of estimating these variance components followed Cockerham (1954), assuming independence between two loci of each QTL pair and between QTL pairs.

On average, simulation in the epistatic scenario resulted in a broad-sense heritability of 0.84. Furthermore, 30% of the total genetic variance was attributed to additive effects, 27% was due to dominance effects, 14% was attributed to (*A* × *A*) effects, 25% was due to (*D* × *A*) and (*A* × *D*) effects, and 4% was due to (*D* × *D*) effects.

In all scenarios phenotypic records were obtained by adding a normally distributed residual term to the total GVs of the individuals. The environmental variance was obtained such that the narrow sense heritability was 0.25 in all scenarios.

### Additional scenarios

Four additional scenarios based on scenario AD1 were simulated, to analyze the influence of the number of chromosomes, the QTL architecture, the SNP density, and a polygenic effect on the prediction accuracy:

*Scenario AD1.2*: Three chromosomes of length M were simulated, each containing 33 equally spaced QTL and 1000 SNPs.*Scenario AD1.3*: Three chromosomes of length M were simulated, each of them containing 1000 SNPs and the first two of them containing 50 equally spaced QTL. The third chromosome contained no QTL.*Scenario AD1.4*: The same as scenario AD1.2 but with each chromosome containing 33 equally spaced QTL and 3000 SNPs.*Scenario AD1.5*: The same as scenario AD1, but additionally a polygenic effect*u*was simulated, starting from generation 1006. Here, the ratio of additive QTL variance to polygenic variance was set to 3. The polygenic effect*u*of an offspring was calculated as 0.5 · (*u*_{mother}+*u*_{father}) +*m*, where*m*is its Mendelian sampling term drawn from a normal distribution

with *F*_{mother} and *F*_{father} being the inbreeding coefficients of the corresponding mother and father. Here, the true total GV was obtained by summing up the QTL effects and the polygenic effect.

### Statistical analyses

The three methods were compared for their accuracy of predicting the true GVs of the individuals in generation *t* = 1011. For this we applied the three approaches to the 50 simulated data sets consisting of 5500 individuals, the last 5000 of them having pedigree information and the last 2000 of them being fully genotyped, as described in the previous section. Total GVs of the nonphenotyped individuals in generation *t* = 1011 (validation set) were predicted. Thereby, parameters and hidden variables were estimated with the help of 1500 individuals (generations 1008 − 1010, estimation set).

All approaches were implemented in R (R Development Core Team 2007). The ML estimation of the parameters and hidden variables was done using the R-package RandomFields v. 2.0.23 (Schlather 2001–2009) and its function “fitvario.” The function fitvario determines the ML by the function “optim” of R with automatically created starting values.

All models were run on a 1.9-GHz PC running Linux. On average, computing times per data set ranged from approximately 20 min (genomic BLUP) over 77 min (universal kriging) to 227 min (simple kriging), but no special efforts were made to achieve computational efficiency at this stage.

For each method and each gene-action scenario, we computed the correlation between the predicted and the true GVs. This was done both for the estimation set of 1500 individuals and for the validation set of 500 individuals. In addition, we calculated the average true GV of the 50 individuals with the highest predicted GVs in the validation set. Finally, results were summarized by averaging over the 50 data sets and a paired *t*-test was applied to test for significant differences between each pair of characteristics at the 1% significance level.

## Results and Discussion

The results of 50 replicates for the different gene-action models and scenarios are shown in Tables 2–3.

In the additive scenario, universal kriging yields a correlation between predicted and true simulated GVs, which is almost as high as the correlation obtained by the reference method genomic BLUP, both in the estimation and in the validation set (*cf.* Table 2), while simple kriging yields the lowest correlations both in the estimation and in the validation set.

The results are similar to the findings of Piepho (2009) and Schulz-Streeck and Piepho (2010) who also report that for an additive true genetic model the prediction accuracies for ridge regression (with covariance structures based on relationship matrices) and spatial models (with covariance structures based on covariance functions) are similar.

In the AD and E scenarios, universal kriging outperforms genomic BLUP in both estimation and validation set by showing the highest average correlations. The difference in correlations of universal kriging and genomic BLUP is highest in the E scenario and the scenario with the higher ratio of dominance to additive variance (≈ 0.03 for the results of the validation set, which is an increase of accuracy by approximately 7%).

Scatterplots of the correlations of the 50 replicates for the different methods and scenarios are shown in Figure 2, which also demonstrate the better performance of universal kriging in the presence of dominance and epistasis. With the degree of nonadditivity ((E, AD2) > AD1 > A) the accuracy of prediction in the validation set compared to the estimation set deteriorates.

Comparing the average true GV of the 50 individuals (10%) ranked best by prediction in the validation set (*cf.* Table 3), universal kriging and genomic BLUP yield results that are not significantly different from each other both in the A and AD scenarios, while universal kriging outperforms genomic BLUP in the E scenario. Again, simple kriging performs worst in all scenarios apart from AD1.

All three methods, being unbiased by definition, show almost no empirical bias of total GVs (results not shown).

The results of the additional scenarios AD1.2 to AD1.5 indicate that the predictive ability of the universal kriging approach is robust with respect to the number of chromosomes, the QTL distribution, the SNP density, and the inclusion of a polygenic effect (*cf.* Table 4). In scenario AD1.4 with higher SNP density the absolute values of correlations between true and predicted GVs are slightly higher compared to scenario AD1.2 with lower SNP density. In scenario AD1.5 the absolute values of correlations between true and predicted GVs are lower for all three methods.

Overall, results indicate the superiority of universal kriging over genomic BLUP in the presence of nonadditive effects. Simple kriging was shown to have a poorer predictive ability compared to universal kriging and genomic BLUP in all considered gene-action models and scenarios.

The poorer predictive ability of simple kriging is most likely due to the high number of parameters estimated in the first kriging step and the resulting numerical difficulties in optimization. In simple kriging 3505 parameters are estimated compared to only 5 parameters in universal kriging and 3 parameters in genomic BLUP. The poor performance of simple kriging and the influence of the high-dimensional parameter space need further investigations, especially as simple kriging is known to work well in low-dimensional geostatistical frameworks.

The simulation study is primarily meant as a “proof of concept.” Results demonstrate that the suggested kriging procedures based on the Matérn function are able to yield competitive results, despite the fact that the modeling of the genomic part of the data by use of the Matérn function follows a completely different reasoning than in the usual methods. This also demonstrates the flexibility of the basic kriging principle.

The importance of the Matérn family is highlighted by Stein (1999), who recommends the use of the Matérn model in the context of prediction of spatial data. The Matérn model has been widely used in other areas of research; see Guttorp and Gneiting (2006) for a historical excursion. One of the most important reasons for adopting the Matérn model is the inclusion of the parameter ν in the model, which controls the smoothness of the underlying random field. Whereas Stein (1999) advocates the simultaneous estimation of all relevant parameters via (restricted) maximum likelihood, Ruppert*et al.* (2003) and Nychka (2000) remark that the likelihood-based estimation of *h* and ν may lead to problems as both parameters enter in a nonlinear fashion, which may cause the ML fitting to be computationally intensive. Our experience so far indicates that the simultaneous estimation of all relevant parameters is feasible.

As an alternative to the ML estimation of parameters, one could also use REML (Patterson and Thompson 1971) to adjust for the loss of degrees of freedom caused by the fixed effects and to produce less biased estimates. In our simulation study there is only one fixed effect (*i.e.,* **β** is a scalar and **W** = (1,…,1)* ^{T}*), such that there will be little difference between REML and ML estimates for variance components in the reference method GBLUP (Abney

*et al.*2000; Ruppert

*et al.*2003; Bonate 2006; Webster

*et al.*2006). This is also mostly the case in practical applications, where highly accurately predicted GVs are used as phenotypes and only an overall mean is included in the model. With respect to the parameter estimates in the kriging approaches using the Matérn function, it is not clear whether REML is preferable to ML, as the parameters

*h*and ν enter in a nonlinear fashion.

### Relation between the Matérn covariance function and the covariance matrix of VanRaden (2008)

To investigate the general relationship between covariance matrices based on the Matérn function and the genomic relationship matrix of VanRaden (2008), we consider the so-called variograms.

For a random field {*g*(**x**), **x** ε * ^{s}*}, the theoretical variogram is defined by γ(

**x**

*,*

_{i}**x**

*) = 0.5((*

_{j}*g*(

**x**

*) −*

_{i}*g*(

**x**

*))*

_{j}^{2}) for

**x**

*,*

_{i}**x**

*ε*

_{j}*. If and (*

^{s}*g*(

**x**

*)) = 0 for all*

_{i}**x**

*ε*

_{i}*, the variogram is given byfor*

^{s}**x**

*,*

_{i}**x**

*ε*

_{j}*. If further Cov(*

^{s}*g*(

**x**

*),*

_{i}*g*(

**x**

*)) depends only on the Euclidean distance ‖*

_{j}**x**

*−*

_{i}**x**

*‖, the variogram γ can be considered as a function on [0, ∞).*

_{j}In the *Appendix* we show that in a limiting case (in which the number of SNPs tends to infinity) the covariance structure of VanRaden (2008) depends only on the Euclidean distance between the SNP vectors and that the corresponding variogram is a quadratic function on [0,∞).

In all kriging procedures, ν was estimated to be larger than 5, indicating an approximately Gaussian form of the covariance function. In factThe corresponding variogram for the Matérn function is then given byfor *x* ε [0, ∞) by Taylor expansion up to the second derivative around zero. This means the corresponding variogram of the Matérn function is approximately quadratic for small distances *x* and for ν→∞. If both variograms, the one induced by VanRaden's covariance structure and γ_{Matern}, were exactly quadratic, the corresponding covariance matrices would be linear transformations of each other. The equivalence of a quadratic covariance function and the second-order Taylor expansion of the Gaussian model has also been noted by Piepho (2009).

Note that the Matérn covariance function is at least three times differentiable for ν>1.5 (Guttorp and Gneiting 2006), such that it is still possible to derive a second-order Taylor expansion for 1.5 < ν < ∞, leading to a quadratic variogram for small distances *x* as well.

### Using linear transformations of covariance matrices leads to linear transformed predicted GVs

In this context another interesting relation can be shown: There is a linear relation between the predicted GVs, if there is a linear relation between the phenotypic covariance matrices **B** and and a linear relation between the covariance vectors **B**_{0} and on the right-hand sides of the kriging systems under the assumption that *W* = (1,…, 1)* ^{T}* =

**j**and thatis invertible: In detail, it can be shown thatfor the linear (kriging) systems(5)and(6)with

*d*≠0 and

**J**=(

**j**,…,

**j**), from which we get The proof of this result can be found in the

*Appendix*.

The general result has important practical implications: It is shown that predictions resulting from the two systems (5) and (6) are identical although a constant (*c* and ) is added to the phenotypic covariance matrix or the covariance vector on the right-hand side of the kriging system, or to both. In the genetic context, such a modification changes relevant population parameters, like heritabilities as well as genetic and phenotypic correlations. Despite this, predicted GVs remain completely unaffected.

Scaling the phenotypic covariance matrix and the covariance vectors by a factor (*d* and ) also changes the heritability, but is shown to lead to a mere linear transformation of the GVs, thus providing an identical ranking of individuals according to their predicted GVs. However, results obtained from such a scaled system might lead to a higher or lower level of mean squared errors.

As stated before, solving the kriging systems is equivalent to solving the corresponding MME. Hence, we have also proved that the solutions and of the MME are scaled by the factor if the phenotypic covariance matrix and the covariance matrix of **Zu** + **g**(**X**) are linearly transformed.

To our knowledge, the above theoretical result (including the scaling factors *d* and ) has not been proved elsewhere in this explicit form, but some authors refer to the invariance of the predictions to the addition of a multiple of the matrix **J**: It is well known that in ordinary kriging with constant mean, one needs only to know the covariance function up to a constant (Matheron 1971; Christensen 1990). Kitanidis (1993) discusses in the context of so-called “generalized covariance functions” the variability among the covariance functions that behave identically in terms of prediction. The invariance to the addition of a multiple of **J** in a mixed model context is also mentioned in Piepho (2009).

### Reproducing kernel Hilbert space approach

In this subsection we contrast our approach to the reproducing kernel Hilbert spaces approach of Gianola and Van Kaam (2008). Stein (1999) strongly advocates use of the Matérn family because of the wide range of smoothness controlled by the smoothness parameter 0 < ν < ∞. In our study ν was estimated to be larger than 5 in all kriging procedures, indicating an approximately Gaussian form of the covariance function, the one used by Gianola and van Kaam (2008). Gianola and van Kaam (2008) use the same model as in (1) except for the assumption that *g* is a Gaussian random function. They consider the functional,where *g* and are implicitly assumed to be elements of a reproducing kernel Hilbert space for fixed **β** and **u**. Then, the representer theorem (Schölkopf*et al.* 2001) states that the minimizer of *J*(**g**|*s*) has the form(7)where the *α _{i}*’s are unknown coefficients. The function to be minimized becomes

Gianola and van Kaam (2008) state further that a random-effects treatment of leads to the functionalwhich then must be minimized. Taking the gradients of *J*(**β**, u, **α**|*s*) with respect to **β**, **u,** and **α** and setting them to zero leads to the following linear system of equations:.(8)

By equating , and Equations 2 and 8 are obviously identical, as well as Equations 4 and 7. Finally, Gianola and van Kaam (2008) proceed with embedding the above approach into a Bayesian framework.

The approach of Gianola and van Kaam (2008) and our approach are different in that we maximize the full likelihood whereas they drop the summand log(*c*) in Equation 3. Note that *c* depends on the unknown parameters, *i.e.,* the variance components and the parameters of the Matérn covariance function. Dropping the summand log(*c*) therefore leads to different estimates of the parameters. Scheuerer (2011) argues that the factor *c* might be included even in the framework of reproducing kernel Hilbert spaces. Hence, maximizing *J* in (3) is partially justified even if the normal assumption for the *e _{i}* does not hold.

### Further options

The general nonparametric approach of basing the prediction on a covariance function offers a number of possibilities for more differentiated modeling. While in spatial statistics using the Euclidean distance is a natural choice, other distance metrics (Reif*et al.* 2005) may be more adequate in the genomic context. With dense marker maps it is found that the genome is structured in haplotype blocks of varying length (International Hapmap Consortium 2005; Qanbari*et al.* 2010) within which the loci are in high linkage disequilibrium; *i.e.,* genotypes are highly correlated. Here, it might be adequate to account for this nonindependence in the definition of the scale, since otherwise highly correlated loci will lead to a massive double counting. A further option is to implement a feature selection, which could, *e.g.,* give a higher weight to SNPs that are positioned in genomic regions that are found to be relevant for the physiological pathways (Wang*et al.* 2007) underlying the studied trait complex.

### Total GVs

Prediction of the total GV of an individual, including nonadditive components, is of different relevance in different fields. In animal breeding, the value of a breeding animal is mostly determined by its so-called breeding value, which is purely additive. While it is possible to predict nonadditive genetic components even in pedigree-based estimation procedures (see, *e.g.,* Hoeschele 1991; de Boer and Hoeschele 1993), these components are in general not transmitted to the offspring and therefore are mostly considered as nuisance parameters in animal breeding.

In plant breeding, prediction of the total GV as part of the phenotype is more relevant, especially since the biological nature of some crop species and/or reproductive biotechnologies allow an identical reproduction (cloning) of given genotypes. Complex gene models including dominance and epistasis might be especially useful in predicting crossbred performance, but the relevance is rather diverse across the agriculturally used plants (Holland 2001).

It was recently suggested that under polygenic inheritance the additive part is the dominating genetic component (Hill*et al.* 2008) and that under directional selection the rate of change is largely determined by the additive genetic variance, so that attempts to include nonadditive terms in prediction might be, at best, useless or even harmful (Crow 2010). These arguments pertain both to animal and plant breeding and need careful consideration based on empirical evidence.

Predicting the genetic disposition in humans in the context of preventive and personalized medicine using whole genome markers is a relatively new and controversial topic (see delos Campos*et al.* 2010 for a review). The main motivation to consider such approaches comes from the phenomenon that even in extremely large-scale studies the genetic background of complex diseases cannot be sufficiently determined with classical mapping approaches (the so-called “case of the missing heritability”; Maher (2008)). Disposition for complex diseases is assumed to be affected to a considerable extent by nonadditive allelic interactions, and hence models allowing for such interactions are expected to yield improved predictions compared to purely additive models.

One data set and the corresponding R-code for the prediction of GVs are available on http://www.stochastik.math.uni-goettingen.de/∼schlather/genoKriging/.

## Acknowledgments

The authors thank two anonymous referees for their valuable comments on the manuscript and their constructive suggestions. The authors are also grateful to Daniel Gianola for his critical comments on an earlier version of the manuscript. Alexander Malinowski and Marco Oesting have provided additional comments and hints. Parts of the analyses were carried out during a research stay of the corresponding author at the University of Wisconsin in Madison. This research was funded by the German Federal Ministry of Education and Research (BMBF) within the AgroClustEr “Synbreed—Synergistic plant and animal breeding” (FKZ 0315528C) in association with the Deutsche Forschungsgemeinschaft (DFG) research training group “Scaling Problems in Statistics” (RTG 1644).

## APPENDIX

### Relation Between the Matérn Covariance Function and the Covariance Matrix of VanRaden (2008)

We show that the covariance structure of VanRaden (2008) leads to a quadratic variogram γ in a limiting case. The covariance matrix of VanRaden (2008) is defined aswhere **M** is the (*n* × *s*)-matrix of SNP vectors for the *n* animals with *s* SNPs coded by −1, 0, 1 and the *j*th column of **P** is (2(*p _{j}* − 0.5),…, 2(

*p*− 0.5))

_{j}*, where*

^{T}*p*is the frequency of the second allele at locus

_{j}*j*.

Let and let . In the GBLUP model we assumed It follows easily that(9)where **m**_{i}_{•} denotes the *i*th row of **M** and is the Euclidean norm. Consider M* _{ij}* as a random variable with values −1, 0, 1 and corresponding probabilities . Then (

*M*) = 2(

_{ij}*p*− 0.5) and Var(

_{j}*M*) = 2

_{ij}*p*(1 −

_{j}*p*) for all

_{j}*i*= 1,…,

*n*. With

*Y*= (M

_{j}*− 2(*

_{ij}*p*− 0.5))

_{j}^{2}we have (

*Y*) = Var(

_{j}*M*) = 2

_{ij}*p*(1 −

_{j}*p*) and(10)

_{j}Now consider the limiting case *s* → ∞ and assume the series *p*_{1}, *p*_{2},… and (1 − *p*_{1}), (1 − *p*_{2}),… to be uniformly bounded away from zero, which implies(11)for some *c* > 0 and for all *s*. Assume further that *Y*_{1}, *Y*_{2},…, are uncorrelated. Because of Var(*Y _{i}*) < ∞ we can apply Rajchman’s (1932) version of the strong law of large numbers (cited by Krengel 2005, p. 154), which yieldswith probability 1 for

*s*→ ∞. Because of (11) we also havewith probability 1 for

*s*→ ∞, from which we get that the left-hand side of (10) converges to 1 with probability 1 for

*s*→ ∞. Together with (9) it follows thatwith probability 1 for

*s*→ ∞,

*i.e*.,for

*s*large. Hence, Cov(

*g*,

_{i}*g*) depends only on the Euclidean distance of the SNP vectors for

_{j}*s*large. If we consider

*g*as the value of a random field on at position

_{i}*m*

_{i}_{•}, then the corresponding variogram isfor

*s*large,

*i.e.,*for

*x*ε [0, ∞).

### Proof: Using Linear Transformations of Covariance Matrices Leads to Linear Transformed Predicted GVs

The proof starts with calculatingHere we used the unbiasedness condition . Hence we obtainFurthermore, we haveThus, we obtain,which finishes the proof.

- Received March 20, 2011.
- Accepted April 14, 2011.

- Copyright © 2011 by the Genetics Society of America

Available freely online through the author-supported open access option.