- Split View
-
Views
-
Cite
Cite
Ulrike Ober, Malena Erbe, Nanye Long, Emilio Porcu, Martin Schlather, Henner Simianer, Predicting Genetic Values: A Kernel-Based Best Linear Unbiased Prediction With Genomic Data, Genetics, Volume 188, Issue 3, 1 July 2011, Pages 695–708, https://doi.org/10.1534/genetics.111.128694
- Share Icon Share
Abstract
Genomic data provide a valuable source of information for modeling covariance structures, allowing a more accurate prediction of total genetic values (GVs). We apply the kriging concept, originally developed in the geostatistical context for predictions in the low-dimensional space, to the high-dimensional space spanned by genomic single nucleotide polymorphism (SNP) vectors and study its properties in different gene-action scenarios. Two different kriging methods [“universal kriging” (UK) and “simple kriging” (SK)] are presented. As a novelty, we suggest use of the family of Matérn covariance functions to model the covariance structure of SNP vectors. A genomic best linear unbiased prediction (GBLUP) is applied as a reference method. The three approaches are compared in a whole-genome simulation study considering additive, additive-dominance, and epistatic gene-action models. Predictive performance is measured in terms of correlation between true and predicted GVs and average true GVs of the individuals ranked best by prediction. We show that UK outperforms GBLUP in the presence of dominance and epistatic effects. In a limiting case, it is shown that the genomic covariance structure proposed by VanRaden (2008) can be considered as a covariance function with corresponding quadratic variogram. We also prove theoretically that if a specific linear relationship exists between covariance matrices for two linear mixed models, the GVs resulting from BLUP are linked by a scaling factor. Finally, the relation of kriging to other models is discussed and further options for modeling the covariance structure, which might be more appropriate in the genomic context, are suggested.
PREDICTING genotypes and phenotypes plays an important role in many areas of life sciences. Both in animal and plant breeding, it is essential to predict the genetic quality (the so-called total genetic value, GV) of individuals or lines, on the basis of different sources of knowledge. Often, phenotypic measures for various traits are available and the additive genetic relationship between individuals (Wright 1922) can be derived, on the basis of the known pedigree. Best linear unbiased prediction (BLUP; Henderson 1973) of breeding values is a well-established methodology in animal breeding (Mrode 2005) and has recently gained relevance in plant breeding (Piephoet al. 2008). In both areas, the main interest is in complex traits with a quantitative genetic background.
In human medicine, the interest is in predicting phenotypes, rather than genotypes, for simple or complex traits (e.g., the probability/risk to encounter a certain disease). Genetic prediction is mainly applied in the context of genetic counseling by predicting the risk of genetic disorders with known mono- or oligogenetic modes of inheritance and a certain history of cases in a known family structure, but accurate predictions of genetic predispositions to human diseases should also be useful for preventive and personalized medicine (delosCamposet al. 2010). Wrayet al. (2007) discuss the potential use of prediction of the genetic liability for traits with a complex quantitative genetic background in a human genetics context, and the variety of possible methods, including linear models, penalized estimation methods, and Bayesian approaches was reviewed by delosCamposet al. (2010).
With the availability of high-throughput genotyping facilities (Ranadeet al. 2001), genotypes for massive numbers of single nucleotide polymorphisms (SNPs) are available and can be used as an additional source of information for predicting GVs. Meuwissenet al. (2001) have suggested including SNP information in a statistical model of prediction. They used three statistical models: a model assigning random effects to all available SNPs (later termed “genomic BLUP”), assuming all SNP effects to be drawn from the same normal distribution, and two Bayesian models, where all (“Bayes A”) or a subset (“Bayes B”) of the random SNP effects are drawn from distributions with different variances. Various modifications of these methods and additional models have been subsequently suggested (Gianolaet al. 2009).
Gianolaet al. (2006) and Gianola and van Kaam (2008) have suggested a nonparametric treatment of genomic information by using reproducing kernel Hilbert spaces (RKHS) regression, which has already been demonstrated with real data (González-Recioet al. 2008, 2009). As was argued by delosCamposet al. (2009), the RKHS regression approach to genomic modeling represents a generalized class of estimators and provides a framework for genetic evaluation of quantitative traits that can be used to incorporate information on pedigrees, markers, or any other ways of characterizing the genetic background of individuals.
Opportunities to enhance genetic analyses by using nonparametric kernel-based statistical methods are enormous and these methods have been considered in different areas of genetic research. Schaid (2010a,b) provides an overview of measures of genomic similarity based on kernel methods and describes how kernel functions can be incorporated into different statistical methods like, e.g., nonparametric regression, support vector machines, or regularization in a mixed model context. Only recently, kernel-based methods have also been used in association studies (Kweeet al. 2008; Yanget al. 2008) and QTL mapping for complex traits (Zouet al. 2010), which demonstrates their great potential and flexibility.
Prediction is also relevant in other areas of research: In large parts of geostatistics, the spatial distribution of variables (like temperature, humidity, ore concentration, etc.) is considered. On the basis of a given (limited) set of measurements, the prediction of the variable realization in any point of the considered space is of interest. A standard approach for prediction in this case is the so-called “kriging” (Chilès and Delfiner 1999), which makes use of a parameterized covariance function of the regionalized variables.
While in geostatistics the application of kriging is naturally limited to few dimensions, the basic approach is rather universal (Schölkopfet al. 2004). In this article we apply kriging to the genomic prediction problem. Here, one dimension reflects genotype realizations at one SNP. In the genomic context, with p SNPs, realizations are in a p-dimensional orthogonal hypercube. Due to the biallelic nature of SNPs, only three genotype realizations (coded, e.g., as 0, 1, and 2) are possible in each dimension, so that the number of possible genotype constellations over p SNPs is 3p.
The concept of kriging is closely related to the concept of BLUP. Cressie (1989) provides a “historical map of kriging” up to 1963 in which he also refers to Henderson (1963) who introduced BLUP in animal breeding. The steps of kriging are equivalent to “empirical BLUP”-procedures known in other frameworks, and kriging can be viewed as a “spatial BLUP.” The conceptual equivalence of geostatistical kriging and BLUP has already been discussed by Harville (1984). Robinson (1991) provides a detailed review of the history of estimation of random effects via BLUP and its various derivations. He also points out the similarities between BLUP and kriging.
The equivalence of kriging with BLUP in a space spanned by genomic data was first noted by Piepho (2009), who also discusses relationships with other estimation principles, like ridge regression (Whittakeret al. 2000) and least squares support vector machines (Suykenset al. 2002). Comparing the performance of spatial mixed models to ridge regression with maize data, he found that spatial models provide an attractive alternative for prediction. He also points out that the BLUP model used in Meuwissenet al. (2001) has an interpretation as a spatial model with quadratic covariance function. Spatial models for genomic prediction were also used by Schulz-Streeck and Piepho (2010).
Moreover, kriging is known to be closely related to radial basis function (RBF) regression methods (Myers 1992). Longet al. (2010) showed with real and simulated data that nonparametric RBF regression methods can outperform Bayes A when predicting total GVs in the presence of nonadditive effects using SNP markers.
In this article we demonstrate the potential of the kriging approaches applied to genomic data: As a novelty, we suggest the family of Matérn covariance functions to reflect the functional dependency of the observed covariances from the distance of genotypes expressed as Euclidean norm. On the basis of this model and the assumed covariance function, we suggest two kriging approaches. Under both models, parameters and hidden variables are estimated via maximum likelihood (ML) and BLUP of the unknowns is established by solving the corresponding linear kriging systems. All predictions can also be implemented in the form of the so-called mixed model equations (Henderson 1973). The predictive performance of the two models is compared to a common genomic BLUP as a reference method in a whole-genome simulation study considering various gene-action models.
Furthermore, we show that in a limiting case the genomic covariance structure proposed by VanRaden (2008) can be considered as a covariance function with corresponding quadratic variogram. In addition we prove theoretically that predicted GVs are scaled only by a factor if the covariance structures are linearly transformed. Finally, we discuss further options for a more differentiated modeling using the suggested methodological approach.
Methods
Kriging
The term kriging stems from the prediction of ore concentrations in deposits and was mainly developed by Matheron (1962, 1963) on the basis of the master’s thesis of Krige (1951). In geostatistics, kriging is currently the standard approach whenever spatial prediction of a so-called regionalized variable (Matheron 1989), e.g., temperature, ozone concentration or soil moisture, must be performed on the basis of a few isolated measurements of the quantity. It is assumed that the regionalized variable is a realization of a random function with a certain covariance structure. Mostly, the latter is given by a parameterized covariance function (Cressie 1993), and the random function is assumed to be Gaussian.
The kriging approach consists of two steps: (i) estimation of the unknown parameters and hidden variables (in particular by ML or REML) and (ii) prediction of the values of the regionalized variables by performing a BLUP, under the auxiliary assumption that the parameter values and hidden variables estimated in the first step are the true ones.
Many variants of the general kriging principle have been discussed (Cressie 1993). The type of kriging is implied by the unbiasedness condition: In “simple kriging” (SK), it is assumed that the underlying regionalized variable has zero mean, whereas in “universal kriging” (UK), a linear model for the mean of the underlying regionalized variable is assumed.
The model for polygenic and genomic data
In our further studies, we assume to have q individuals with pedigree information, n of them being genotyped and having phenotype measurements of a certain quantitative trait. Typically, GVs have to be predicted for individuals that are genotyped, but have no phenotype data.
We assume that {g(xi), xi ε } is a Gaussian random field (Lifshits 1995) with (g(xi)) = 0 and covariance structure given by where is a covariance function depending on parameters ν, h, and σK. Let be the corresponding covariance matrix.
The family of Matérn covariance functions
Matérn covariance functions build a very general class of covariance functions including special cases like the exponential (ν = ) and the Gaussian (ν = ∞) covariance function, the ones that have also been used by Piepho (2009). If the smoothness parameter ν is of the form m + , where m is an integer, the Matérn function factorizes into the product of an exponential function and a polynomial of degree m; cf.Table 1 and Figure 1. The best fitting parameter value ν is determined through the model-fitting approaches described below.
Two kriging approaches and a reference model
We consider two models to predict the total genetic value of a certain genotyped individual indexed by 0. This individual belongs to the set of q individuals, but it does not have to be phenotyped. The models differ in the size of the sets of quantities that are estimated in the first kriging step and subsequently used for predictions.
Universal kriging: Modeling of y
Simple Kriging: Joint modeling of y, u, and g(X)
Finally, the predicted GV is given by where is the estimator obtained in the iterative procedure described above.
Reference model: genomic BLUP
Here, and Z are defined as in the previous approaches. The vector is multivariate normal with G being a genomic relationship matrix calculated by using the approach of VanRaden (2008). (For the definition of the genomic relationship matrix see the formulas in the Appendix.) The matrix is a known incidence matrix whose rows consist of unit vectors with one component being 1 and all the others zero, indicating the respective position in the g-vector. Variance components for this model are estimated via ML.
Simulation study
In a first step, four types of simulations were performed differing in the hypothetical gene-action scenario: additive, additive dominance with two different ratios of dominance variance to additive variance, and epistasis. For each scenario 50 independent simulations were run, resulting in 50 data sets per scenario.
The simulation process basically followed that of Meuwissenet al. (2001), Solberget al. (2008), and Longet al. (2010).
Population and genome
In each scenario, the population evolved during 1000 generations of random mating and random selection with a population size of 100 (50 males and 50 females) in each generation to reach a mutation-drift balance. After 1000 generations, the population size was increased to 500 at generation t = 1001 by mating each male with 10 females, with one offspring per mating pair. In generations t = 1002,…, t = 1011 offspring were born from random mating of individuals of the previous generation. The 1500 individuals of generations 1008, 1009, and 1010 were used as estimation set, the 500 individuals of generation 1011 formed the validation set for which total GVs were predicted. Pedigree data were recorded for individuals of the last 10 generations. SNP data of individuals were recorded both for the estimation and the validation set. Phenotypes were stored only for individuals of the estimation set.
Starting with monomorphic loci in the base generation, mutation rates at QTL and SNP markers were 2.5 × 10−3 per locus per generation (t = 1,…, t = 1000), to obtain an adequate number of segregating (biallelic) loci. On average, simulation resulted in 2745 segregating markers and 98 segregating QTL in generation t = 1001. Only segregating markers and QTL were considered in the following generations. True total GVs were obtained by summing up the QTL effects resulting from the following three gene-action models.
Three different gene-action models
Additive scenario A:
Each QTL locus had an additive effect only, without dominance or epistasis. The additive effect (a) was equal to the allele substitution effect, such that for genotypes QQ, Qq, and qq their GVs were 2a, a, and 0, respectively. The value of a at each QTL locus was sampled from a normal distribution .
Additive-dominance scenarios AD1 and AD2:
For simplicity, independence between QTL was assumed and, as a result, the total additive (dominance) variance was summed over all loci.
Epistasis scenario E:
In this model there was no additive or dominance effect at any of the individual QTL. Epistasis existed only between pairs of QTL. The forms of epistasis included additive × dominance (A × D), dominance × additive (D × A), and dominance × dominance (D × D). Additive and (A × A) epistatic effects were excluded, to prevent the additive variance from dominating the total genetic variance.
Note that although no additive, dominance, and (A × A) epistatic effects were explicitly simulated, the model still generated additive dominance , and epistatic variances. The procedure of estimating these variance components followed Cockerham (1954), assuming independence between two loci of each QTL pair and between QTL pairs.
On average, simulation in the epistatic scenario resulted in a broad-sense heritability of 0.84. Furthermore, 30% of the total genetic variance was attributed to additive effects, 27% was due to dominance effects, 14% was attributed to (A × A) effects, 25% was due to (D × A) and (A × D) effects, and 4% was due to (D × D) effects.
In all scenarios phenotypic records were obtained by adding a normally distributed residual term to the total GVs of the individuals. The environmental variance was obtained such that the narrow sense heritability was 0.25 in all scenarios.
Additional scenarios
Four additional scenarios based on scenario AD1 were simulated, to analyze the influence of the number of chromosomes, the QTL architecture, the SNP density, and a polygenic effect on the prediction accuracy:
Scenario AD1.2: Three chromosomes of length M were simulated, each containing 33 equally spaced QTL and 1000 SNPs.
Scenario AD1.3: Three chromosomes of length M were simulated, each of them containing 1000 SNPs and the first two of them containing 50 equally spaced QTL. The third chromosome contained no QTL.
Scenario AD1.4: The same as scenario AD1.2 but with each chromosome containing 33 equally spaced QTL and 3000 SNPs.
Scenario AD1.5: The same as scenario AD1, but additionally a polygenic effect u was simulated, starting from generation 1006. Here, the ratio of additive QTL variance to polygenic variance was set to 3. The polygenic effect u of an offspring was calculated as 0.5 · (umother + ufather) + m, where m is its Mendelian sampling term drawn from a normal distribution
with Fmother and Ffather being the inbreeding coefficients of the corresponding mother and father. Here, the true total GV was obtained by summing up the QTL effects and the polygenic effect.
Statistical analyses
The three methods were compared for their accuracy of predicting the true GVs of the individuals in generation t = 1011. For this we applied the three approaches to the 50 simulated data sets consisting of 5500 individuals, the last 5000 of them having pedigree information and the last 2000 of them being fully genotyped, as described in the previous section. Total GVs of the nonphenotyped individuals in generation t = 1011 (validation set) were predicted. Thereby, parameters and hidden variables were estimated with the help of 1500 individuals (generations 1008 − 1010, estimation set).
All approaches were implemented in R (R DevelopmentCoreTeam 2007). The ML estimation of the parameters and hidden variables was done using the R-package RandomFields v. 2.0.23 (Schlather 2001–2009) and its function “fitvario.” The function fitvario determines the ML by the function “optim” of R with automatically created starting values.
All models were run on a 1.9-GHz PC running Linux. On average, computing times per data set ranged from approximately 20 min (genomic BLUP) over 77 min (universal kriging) to 227 min (simple kriging), but no special efforts were made to achieve computational efficiency at this stage.
For each method and each gene-action scenario, we computed the correlation between the predicted and the true GVs. This was done both for the estimation set of 1500 individuals and for the validation set of 500 individuals. In addition, we calculated the average true GV of the 50 individuals with the highest predicted GVs in the validation set. Finally, results were summarized by averaging over the 50 data sets and a paired t-test was applied to test for significant differences between each pair of characteristics at the 1% significance level.
Results and Discussion
The results of 50 replicates for the different gene-action models and scenarios are shown in Tables 2–3.
Scenario | Set | Universal kriging | Simple kriging | Genomic BLUP |
A | Estimation set | 0.801αa (0.005) | 0.772β (0.009) | 0.815γ (0.004) |
Validation set | 0.773α (0.005) | 0.731β (0.008) | 0.776γ (0.005) | |
AD1 | Estimation set | 0.754α (0.004) | 0.652β (0.009) | 0.670β (0.004) |
Validation set | 0.571α (0.006) | 0.530β (0.010) | 0.558γ (0.007) | |
AD2 | Estimation set | 0.854α (0.004) | 0.624β (0.013) | 0.621β (0.005) |
Validation set | 0.490α (0.007) | 0.447β (0.009) | 0.457β (0.007) | |
E | Estimation set | 0.910α (0.009) | 0.631β (0.015) | 0.681γ (0.006) |
Validation set | 0.468α (0.006) | 0.411β (0.008) | 0.437γ (0.007) |
Scenario | Set | Universal kriging | Simple kriging | Genomic BLUP |
A | Estimation set | 0.801αa (0.005) | 0.772β (0.009) | 0.815γ (0.004) |
Validation set | 0.773α (0.005) | 0.731β (0.008) | 0.776γ (0.005) | |
AD1 | Estimation set | 0.754α (0.004) | 0.652β (0.009) | 0.670β (0.004) |
Validation set | 0.571α (0.006) | 0.530β (0.010) | 0.558γ (0.007) | |
AD2 | Estimation set | 0.854α (0.004) | 0.624β (0.013) | 0.621β (0.005) |
Validation set | 0.490α (0.007) | 0.447β (0.009) | 0.457β (0.007) | |
E | Estimation set | 0.910α (0.009) | 0.631β (0.015) | 0.681γ (0.006) |
Validation set | 0.468α (0.006) | 0.411β (0.008) | 0.437γ (0.007) |
Results were averages of 50 replicates. Standard errors of the means are in parentheses. Different lowercase Greek letters indicate significant differences (1% level of significance) within rows.
Scenario | Set | Universal kriging | Simple kriging | Genomic BLUP |
A | Estimation set | 0.801αa (0.005) | 0.772β (0.009) | 0.815γ (0.004) |
Validation set | 0.773α (0.005) | 0.731β (0.008) | 0.776γ (0.005) | |
AD1 | Estimation set | 0.754α (0.004) | 0.652β (0.009) | 0.670β (0.004) |
Validation set | 0.571α (0.006) | 0.530β (0.010) | 0.558γ (0.007) | |
AD2 | Estimation set | 0.854α (0.004) | 0.624β (0.013) | 0.621β (0.005) |
Validation set | 0.490α (0.007) | 0.447β (0.009) | 0.457β (0.007) | |
E | Estimation set | 0.910α (0.009) | 0.631β (0.015) | 0.681γ (0.006) |
Validation set | 0.468α (0.006) | 0.411β (0.008) | 0.437γ (0.007) |
Scenario | Set | Universal kriging | Simple kriging | Genomic BLUP |
A | Estimation set | 0.801αa (0.005) | 0.772β (0.009) | 0.815γ (0.004) |
Validation set | 0.773α (0.005) | 0.731β (0.008) | 0.776γ (0.005) | |
AD1 | Estimation set | 0.754α (0.004) | 0.652β (0.009) | 0.670β (0.004) |
Validation set | 0.571α (0.006) | 0.530β (0.010) | 0.558γ (0.007) | |
AD2 | Estimation set | 0.854α (0.004) | 0.624β (0.013) | 0.621β (0.005) |
Validation set | 0.490α (0.007) | 0.447β (0.009) | 0.457β (0.007) | |
E | Estimation set | 0.910α (0.009) | 0.631β (0.015) | 0.681γ (0.006) |
Validation set | 0.468α (0.006) | 0.411β (0.008) | 0.437γ (0.007) |
Results were averages of 50 replicates. Standard errors of the means are in parentheses. Different lowercase Greek letters indicate significant differences (1% level of significance) within rows.
Scenario | Universal kriging | Simple kriging | Genomic BLUP |
A | 2.420αa (0.259) | 2.291β (0.261) | 2.432α (0.258) |
AD1 | 1.754α (0.182) | 1.648α (0.186) | 1.728α (0.177) |
AD2 | 1.720α (0.172) | 1.563β (0.178) | 1.612α (0.171) |
E | 6.410α (0.502) | 5.847β (0.476) | 5.893β (0.485) |
Scenario | Universal kriging | Simple kriging | Genomic BLUP |
A | 2.420αa (0.259) | 2.291β (0.261) | 2.432α (0.258) |
AD1 | 1.754α (0.182) | 1.648α (0.186) | 1.728α (0.177) |
AD2 | 1.720α (0.172) | 1.563β (0.178) | 1.612α (0.171) |
E | 6.410α (0.502) | 5.847β (0.476) | 5.893β (0.485) |
Results were averages of 50 replicates. Standard errors of the means are in parentheses. Different lowercase Greek letters indicate significant differences (1% level of significance) within rows.
Scenario | Universal kriging | Simple kriging | Genomic BLUP |
A | 2.420αa (0.259) | 2.291β (0.261) | 2.432α (0.258) |
AD1 | 1.754α (0.182) | 1.648α (0.186) | 1.728α (0.177) |
AD2 | 1.720α (0.172) | 1.563β (0.178) | 1.612α (0.171) |
E | 6.410α (0.502) | 5.847β (0.476) | 5.893β (0.485) |
Scenario | Universal kriging | Simple kriging | Genomic BLUP |
A | 2.420αa (0.259) | 2.291β (0.261) | 2.432α (0.258) |
AD1 | 1.754α (0.182) | 1.648α (0.186) | 1.728α (0.177) |
AD2 | 1.720α (0.172) | 1.563β (0.178) | 1.612α (0.171) |
E | 6.410α (0.502) | 5.847β (0.476) | 5.893β (0.485) |
Results were averages of 50 replicates. Standard errors of the means are in parentheses. Different lowercase Greek letters indicate significant differences (1% level of significance) within rows.
In the additive scenario, universal kriging yields a correlation between predicted and true simulated GVs, which is almost as high as the correlation obtained by the reference method genomic BLUP, both in the estimation and in the validation set (cf.Table 2), while simple kriging yields the lowest correlations both in the estimation and in the validation set.
The results are similar to the findings of Piepho (2009) and Schulz-Streeck and Piepho (2010) who also report that for an additive true genetic model the prediction accuracies for ridge regression (with covariance structures based on relationship matrices) and spatial models (with covariance structures based on covariance functions) are similar.
In the AD and E scenarios, universal kriging outperforms genomic BLUP in both estimation and validation set by showing the highest average correlations. The difference in correlations of universal kriging and genomic BLUP is highest in the E scenario and the scenario with the higher ratio of dominance to additive variance (≈ 0.03 for the results of the validation set, which is an increase of accuracy by approximately 7%).
Scatterplots of the correlations of the 50 replicates for the different methods and scenarios are shown in Figure 2, which also demonstrate the better performance of universal kriging in the presence of dominance and epistasis. With the degree of nonadditivity ((E, AD2) > AD1 > A) the accuracy of prediction in the validation set compared to the estimation set deteriorates.
Comparing the average true GV of the 50 individuals (10%) ranked best by prediction in the validation set (cf.Table 3), universal kriging and genomic BLUP yield results that are not significantly different from each other both in the A and AD scenarios, while universal kriging outperforms genomic BLUP in the E scenario. Again, simple kriging performs worst in all scenarios apart from AD1.
All three methods, being unbiased by definition, show almost no empirical bias of total GVs (results not shown).
The results of the additional scenarios AD1.2 to AD1.5 indicate that the predictive ability of the universal kriging approach is robust with respect to the number of chromosomes, the QTL distribution, the SNP density, and the inclusion of a polygenic effect (cf.Table 4). In scenario AD1.4 with higher SNP density the absolute values of correlations between true and predicted GVs are slightly higher compared to scenario AD1.2 with lower SNP density. In scenario AD1.5 the absolute values of correlations between true and predicted GVs are lower for all three methods.
Universal kriging | Simple kriging | Genomic BLUP | ||||
Scenario | Est. set a | Val. set b | Est. set | Val. set | Est. set | Val. set |
AD1 | 0.754αc (0.004) | 0.571α (0.006) | 0.652α (0.009) | 0.530α (0.010) | 0.670α (0.004) | 0.558α (0.007) |
AD1.2 | 0.751α (0.004) | 0.550α (0.006) | 0.627α (0.007) | 0.511α (0.008) | 0.666α (0.005) | 0.541α (0.007) |
AD1.3 | 0.753α (0.005) | 0.554α (0.010) | 0.630α (0.009) | 0.518α (0.011) | 0.670α (0.006) | 0.543α (0.010) |
AD1.4 | 0.758α (0.004) | 0.567α (0.007) | 0.642α (0.007) | 0.531α (0.008) | 0.677α (0.005) | 0.558α (0.007) |
AD1.5 | 0.718β (0.004) | 0.528β (0.006) | 0.623α (0.009) | 0.496α (0.008) | 0.666α (0.005) | 0.518β (0.007) |
Universal kriging | Simple kriging | Genomic BLUP | ||||
Scenario | Est. set a | Val. set b | Est. set | Val. set | Est. set | Val. set |
AD1 | 0.754αc (0.004) | 0.571α (0.006) | 0.652α (0.009) | 0.530α (0.010) | 0.670α (0.004) | 0.558α (0.007) |
AD1.2 | 0.751α (0.004) | 0.550α (0.006) | 0.627α (0.007) | 0.511α (0.008) | 0.666α (0.005) | 0.541α (0.007) |
AD1.3 | 0.753α (0.005) | 0.554α (0.010) | 0.630α (0.009) | 0.518α (0.011) | 0.670α (0.006) | 0.543α (0.010) |
AD1.4 | 0.758α (0.004) | 0.567α (0.007) | 0.642α (0.007) | 0.531α (0.008) | 0.677α (0.005) | 0.558α (0.007) |
AD1.5 | 0.718β (0.004) | 0.528β (0.006) | 0.623α (0.009) | 0.496α (0.008) | 0.666α (0.005) | 0.518β (0.007) |
Estimation set.
Validation set.
Results were averages of 50 replicates. Standard errors of the means are in parentheses. Different lowercase Greek letters indicate significant differences (1% level of significance) within columns.
Universal kriging | Simple kriging | Genomic BLUP | ||||
Scenario | Est. set a | Val. set b | Est. set | Val. set | Est. set | Val. set |
AD1 | 0.754αc (0.004) | 0.571α (0.006) | 0.652α (0.009) | 0.530α (0.010) | 0.670α (0.004) | 0.558α (0.007) |
AD1.2 | 0.751α (0.004) | 0.550α (0.006) | 0.627α (0.007) | 0.511α (0.008) | 0.666α (0.005) | 0.541α (0.007) |
AD1.3 | 0.753α (0.005) | 0.554α (0.010) | 0.630α (0.009) | 0.518α (0.011) | 0.670α (0.006) | 0.543α (0.010) |
AD1.4 | 0.758α (0.004) | 0.567α (0.007) | 0.642α (0.007) | 0.531α (0.008) | 0.677α (0.005) | 0.558α (0.007) |
AD1.5 | 0.718β (0.004) | 0.528β (0.006) | 0.623α (0.009) | 0.496α (0.008) | 0.666α (0.005) | 0.518β (0.007) |
Universal kriging | Simple kriging | Genomic BLUP | ||||
Scenario | Est. set a | Val. set b | Est. set | Val. set | Est. set | Val. set |
AD1 | 0.754αc (0.004) | 0.571α (0.006) | 0.652α (0.009) | 0.530α (0.010) | 0.670α (0.004) | 0.558α (0.007) |
AD1.2 | 0.751α (0.004) | 0.550α (0.006) | 0.627α (0.007) | 0.511α (0.008) | 0.666α (0.005) | 0.541α (0.007) |
AD1.3 | 0.753α (0.005) | 0.554α (0.010) | 0.630α (0.009) | 0.518α (0.011) | 0.670α (0.006) | 0.543α (0.010) |
AD1.4 | 0.758α (0.004) | 0.567α (0.007) | 0.642α (0.007) | 0.531α (0.008) | 0.677α (0.005) | 0.558α (0.007) |
AD1.5 | 0.718β (0.004) | 0.528β (0.006) | 0.623α (0.009) | 0.496α (0.008) | 0.666α (0.005) | 0.518β (0.007) |
Estimation set.
Validation set.
Results were averages of 50 replicates. Standard errors of the means are in parentheses. Different lowercase Greek letters indicate significant differences (1% level of significance) within columns.
Overall, results indicate the superiority of universal kriging over genomic BLUP in the presence of nonadditive effects. Simple kriging was shown to have a poorer predictive ability compared to universal kriging and genomic BLUP in all considered gene-action models and scenarios.
The poorer predictive ability of simple kriging is most likely due to the high number of parameters estimated in the first kriging step and the resulting numerical difficulties in optimization. In simple kriging 3505 parameters are estimated compared to only 5 parameters in universal kriging and 3 parameters in genomic BLUP. The poor performance of simple kriging and the influence of the high-dimensional parameter space need further investigations, especially as simple kriging is known to work well in low-dimensional geostatistical frameworks.
The simulation study is primarily meant as a “proof of concept.” Results demonstrate that the suggested kriging procedures based on the Matérn function are able to yield competitive results, despite the fact that the modeling of the genomic part of the data by use of the Matérn function follows a completely different reasoning than in the usual methods. This also demonstrates the flexibility of the basic kriging principle.
The importance of the Matérn family is highlighted by Stein (1999), who recommends the use of the Matérn model in the context of prediction of spatial data. The Matérn model has been widely used in other areas of research; see Guttorp and Gneiting (2006) for a historical excursion. One of the most important reasons for adopting the Matérn model is the inclusion of the parameter ν in the model, which controls the smoothness of the underlying random field. Whereas Stein (1999) advocates the simultaneous estimation of all relevant parameters via (restricted) maximum likelihood, Ruppertet al. (2003) and Nychka (2000) remark that the likelihood-based estimation of h and ν may lead to problems as both parameters enter in a nonlinear fashion, which may cause the ML fitting to be computationally intensive. Our experience so far indicates that the simultaneous estimation of all relevant parameters is feasible.
As an alternative to the ML estimation of parameters, one could also use REML (Patterson and Thompson 1971) to adjust for the loss of degrees of freedom caused by the fixed effects and to produce less biased estimates. In our simulation study there is only one fixed effect (i.e.,β is a scalar and W = (1,…,1)T), such that there will be little difference between REML and ML estimates for variance components in the reference method GBLUP (Abneyet al. 2000; Ruppertet al. 2003; Bonate 2006; Websteret al. 2006). This is also mostly the case in practical applications, where highly accurately predicted GVs are used as phenotypes and only an overall mean is included in the model. With respect to the parameter estimates in the kriging approaches using the Matérn function, it is not clear whether REML is preferable to ML, as the parameters h and ν enter in a nonlinear fashion.
Relation between the Matérn covariance function and the covariance matrix of VanRaden (2008)
To investigate the general relationship between covariance matrices based on the Matérn function and the genomic relationship matrix of VanRaden (2008), we consider the so-called variograms.
In the Appendix we show that in a limiting case (in which the number of SNPs tends to infinity) the covariance structure of VanRaden (2008) depends only on the Euclidean distance between the SNP vectors and that the corresponding variogram is a quadratic function on [0,∞).
Note that the Matérn covariance function is at least three times differentiable for ν>1.5 (Guttorp and Gneiting 2006), such that it is still possible to derive a second-order Taylor expansion for 1.5 < ν < ∞, leading to a quadratic variogram for small distances x as well.
Using linear transformations of covariance matrices leads to linear transformed predicted GVs
The general result has important practical implications: It is shown that predictions resulting from the two systems (5) and (6) are identical although a constant (c and ) is added to the phenotypic covariance matrix or the covariance vector on the right-hand side of the kriging system, or to both. In the genetic context, such a modification changes relevant population parameters, like heritabilities as well as genetic and phenotypic correlations. Despite this, predicted GVs remain completely unaffected.
Scaling the phenotypic covariance matrix and the covariance vectors by a factor (d and ) also changes the heritability, but is shown to lead to a mere linear transformation of the GVs, thus providing an identical ranking of individuals according to their predicted GVs. However, results obtained from such a scaled system might lead to a higher or lower level of mean squared errors.
As stated before, solving the kriging systems is equivalent to solving the corresponding MME. Hence, we have also proved that the solutions and of the MME are scaled by the factor if the phenotypic covariance matrix and the covariance matrix of Zu + g(X) are linearly transformed.
To our knowledge, the above theoretical result (including the scaling factors d and ) has not been proved elsewhere in this explicit form, but some authors refer to the invariance of the predictions to the addition of a multiple of the matrix J: It is well known that in ordinary kriging with constant mean, one needs only to know the covariance function up to a constant (Matheron 1971; Christensen 1990). Kitanidis (1993) discusses in the context of so-called “generalized covariance functions” the variability among the covariance functions that behave identically in terms of prediction. The invariance to the addition of a multiple of J in a mixed model context is also mentioned in Piepho (2009).
Reproducing kernel Hilbert space approach
By equating , and Equations 2 and 8 are obviously identical, as well as Equations 4 and 7. Finally, Gianola and van Kaam (2008) proceed with embedding the above approach into a Bayesian framework.
The approach of Gianola and van Kaam (2008) and our approach are different in that we maximize the full likelihood whereas they drop the summand log(c) in Equation 3. Note that c depends on the unknown parameters, i.e., the variance components and the parameters of the Matérn covariance function. Dropping the summand log(c) therefore leads to different estimates of the parameters. Scheuerer (2011) argues that the factor c might be included even in the framework of reproducing kernel Hilbert spaces. Hence, maximizing J in (3) is partially justified even if the normal assumption for the ei does not hold.
Further options
The general nonparametric approach of basing the prediction on a covariance function offers a number of possibilities for more differentiated modeling. While in spatial statistics using the Euclidean distance is a natural choice, other distance metrics (Reifet al. 2005) may be more adequate in the genomic context. With dense marker maps it is found that the genome is structured in haplotype blocks of varying length (InternationalHapmapConsortium 2005; Qanbariet al. 2010) within which the loci are in high linkage disequilibrium; i.e., genotypes are highly correlated. Here, it might be adequate to account for this nonindependence in the definition of the scale, since otherwise highly correlated loci will lead to a massive double counting. A further option is to implement a feature selection, which could, e.g., give a higher weight to SNPs that are positioned in genomic regions that are found to be relevant for the physiological pathways (Wanget al. 2007) underlying the studied trait complex.
Total GVs
Prediction of the total GV of an individual, including nonadditive components, is of different relevance in different fields. In animal breeding, the value of a breeding animal is mostly determined by its so-called breeding value, which is purely additive. While it is possible to predict nonadditive genetic components even in pedigree-based estimation procedures (see, e.g.,Hoeschele 1991; de Boer and Hoeschele 1993), these components are in general not transmitted to the offspring and therefore are mostly considered as nuisance parameters in animal breeding.
In plant breeding, prediction of the total GV as part of the phenotype is more relevant, especially since the biological nature of some crop species and/or reproductive biotechnologies allow an identical reproduction (cloning) of given genotypes. Complex gene models including dominance and epistasis might be especially useful in predicting crossbred performance, but the relevance is rather diverse across the agriculturally used plants (Holland 2001).
It was recently suggested that under polygenic inheritance the additive part is the dominating genetic component (Hillet al. 2008) and that under directional selection the rate of change is largely determined by the additive genetic variance, so that attempts to include nonadditive terms in prediction might be, at best, useless or even harmful (Crow 2010). These arguments pertain both to animal and plant breeding and need careful consideration based on empirical evidence.
Predicting the genetic disposition in humans in the context of preventive and personalized medicine using whole genome markers is a relatively new and controversial topic (see delosCamposet al. 2010 for a review). The main motivation to consider such approaches comes from the phenomenon that even in extremely large-scale studies the genetic background of complex diseases cannot be sufficiently determined with classical mapping approaches (the so-called “case of the missing heritability”; Maher (2008)). Disposition for complex diseases is assumed to be affected to a considerable extent by nonadditive allelic interactions, and hence models allowing for such interactions are expected to yield improved predictions compared to purely additive models.
One data set and the corresponding R-code for the prediction of GVs are available on http://www.stochastik.math.uni-goettingen.de/∼schlather/genoKriging/.
Acknowledgements
The authors thank two anonymous referees for their valuable comments on the manuscript and their constructive suggestions. The authors are also grateful to Daniel Gianola for his critical comments on an earlier version of the manuscript. Alexander Malinowski and Marco Oesting have provided additional comments and hints. Parts of the analyses were carried out during a research stay of the corresponding author at the University of Wisconsin in Madison. This research was funded by the German Federal Ministry of Education and Research (BMBF) within the AgroClustEr “Synbreed—Synergistic plant and animal breeding” (FKZ 0315528C) in association with the Deutsche Forschungsgemeinschaft (DFG) research training group “Scaling Problems in Statistics” (RTG 1644).
Footnotes
Communicating editor: I. Hoeschele
Literature Cited
APPENDIX
Relation Between the Matérn Covariance Function and the Covariance Matrix of VanRaden (2008)
Proof: Using Linear Transformations of Covariance Matrices Leads to Linear Transformed Predicted GVs
Author notes
Available freely online through the author-supported open access option.