## Abstract

Semiparametric procedures for prediction of total genetic value for quantitative traits, which make use of phenotypic and genomic data simultaneously, are presented. The methods focus on the treatment of massive information provided by, *e.g*., single-nucleotide polymorphisms. It is argued that standard parametric methods for quantitative genetic analysis cannot handle the multiplicity of potential interactions arising in models with, *e.g*., hundreds of thousands of markers, and that most of the assumptions required for an orthogonal decomposition of variance are violated in artificial and natural populations. This makes nonparametric procedures attractive. Kernel regression and reproducing kernel Hilbert spaces regression procedures are embedded into standard mixed-effects linear models, retaining additive genetic effects under multivariate normality for operational reasons. Inferential procedures are presented, and some extensions are suggested. An example is presented, illustrating the potential of the methodology. Implementations can be carried out after modification of standard software developed by animal breeders for likelihood-based or Bayesian analysis.

MASSIVE quantities of genomic data are now available, with potential for enhancing accuracy of prediction of genetic value of, *e.g.*, candidates for selection in animal and plant breeding programs or for molecular classification of disease status in subjects (Golub *et al.* 1999). For instance, Wong *et al.* (2004) reported a genetic variation map of the chicken genome containing 2.8 million single-nucleotide polymorphisms (SNPs) and demonstrated how the information can be used for targeting specific genomic regions. Likewise, Hayes *et al.* (2004) found 2507 putative SNPs in the salmon genome that could be valuable for marker-assisted selection in this species.

The use of molecular markers as aids in genetic selection programs has been discussed extensively. Important early articles are Soller and Beckmann (1982) and Fernando and Grossman (1989), with the latter focusing on best linear unbiased prediction of genetic value when marker information is used. Most of the literature on marker-assisted selection deals with the problem of locating one or few quantitative trait loci (QTL) using flanking markers. However, in the light of current knowledge about genomics, the widely used single-QTL search approach is naive, since there is evidence of abundant QTL affecting complex traits, as discussed, *e.g.*, by Dekkers and Hospital (2002). This would support the infinitesimal model of Fisher (1918) as a sensible statistical specification for many quantitative traits, with complications being the accommodation of nonadditivity and of feedbacks (Gianola and Sorensen 2004). Dekkers and Hospital (2002) observe that existing statistical methods for marker-assisted selection do not deal well with complexity posed by quantitative traits. Some difficulties are: specification of “statistical significance” thresholds for multiple testing, strong dependence of inferences on model chosen (*e.g.*, number of QTL fitted, distributional forms), inadequate handling of nonadditivity, and ambiguous interpretation of effects in multiple-marker analysis, due to collinearity.

Here, we discuss how large-scale molecular information, such as that conveyed by SNPs, can be employed for marker-assisted prediction of genetic value for quantitative traits in the sense of, *e.g.*, Meuwissen *et al.* (2001), Gianola *et al.* (2003), and Xu (2003). The focus is on inference of genetic value, rather than detection of quantitative trait loci. A main challenge is that of positing a functional form relating phenotypes to SNP genotypes (viewed as thousands of possibly highly colinear covariates), to polygenic additive genetic values, and to other nuisance effects, such as sex or age of an individual, simultaneously.

Standard quantitative genetics theory gives a mechanistic basis to the mixed-effects linear model, treated either from classical (Sorensen and Kennedy 1983; Henderson 1984) or from Bayesian (Gianola and Fernando 1986) perspectives. Meuwissen *et al.* (2001) and Gianola *et al.* (2003) exploit this connection and suggest highly parametric structures for modeling relationships between phenotypes and effects of hundreds or thousands of molecular markers. A first concern is the strength of their assumptions (*e.g.*, linearity, multivariate normality, proportion of segregating loci, spatial within-chromosome effects); it is unknown if their procedures are robust. Second, colinearity between SNP or marker genotypes is bound to exist, because of the sheer massiveness of molecular data plus cosegregation of alleles. While adverse effects of colinearity can be tempered when marker effects are treated as random variables, statistical redundancy is undesirable (Lindley and Smith 1972).

The genome seems to be much more highly interactive than what standard quantitative genetic models can accommodate (*e.g.*, D'Haeseleer *et al.* 2000). In theory, genetic variance can be partitioned into orthogonal additive, dominance, additive × additive, additive × dominance, dominance × dominance, etc., components, only under highly idealized conditions. These include linkage equilibrium, absence of natural or artificial selection, and no inbreeding or assortative mating (Cockerham 1954; Kempthorne 1954). Arguably, these conditions are violated in nature and in breeding programs. Actually, marker-assisted selection exploits existence of linkage disequilibrium, and even chance creates disequilibrium. Further, estimation of nonadditive components of variance is notoriously difficult, even under standard assumptions (Chang 1988). Therefore, it is doubtful whether or not standard quantitative genetic approaches can model fine-structure relationships between genotypes and phenotypes adequately, unless either departures from assumptions have mild effects or statistical constructs based on multivariate normality turn out to be more robust than what is expected on theoretical grounds. These considerations suggest that a nonparametric treatment of the data could be valuable.

On the other hand, application of the additive genetic model in selective breeding of livestock has produced remarkable dividends, as shown in Dekkers and Hospital (2002). Hence, a combination of nonparametric modeling of effects of molecular variables (*e.g.*, SNPs) with features of the additive polygenic mode of inheritance is appealing.

Our objective is to present semiparametric methods for prediction of genetic value for complex traits that make use of phenotypic and genomic data simultaneously. This article is organized as follows. kernel regression on snp markers introduces nonparametric regression, kernel functions, and smoothing parameters and proposes a nonparametric approximation to additive genetic value. Next, semiparametric kernel mixed model combines features of kernel regression with the mixed-effects linear model and describes classical and Bayesian implementations. reproducing kernel hilbert spaces mixed model uses established calculus of variations results and hybridizes the mixed-effects linear model with a regression on kernel basis functions. Estimation procedures are presented, the problem of incomplete genotyping is addressed, and a simulated example is given, to illustrate feasibility and potential. The article concludes with a discussion and with suggestions for additional research.

## KERNEL REGRESSION ON SNP MARKERS

#### The regression function:

Consider a stylized situation in which each of a series of individuals possesses a measurement for some quantitative trait denoted as *y*, as well as information on a possibly massive number of genomic variables, such as SNP “genotypes,” represented by a vector **x**. In the main, **x** is treated as a continuously valued vector of covariates, even though SNP genotypes are discrete (coding is done via dummy variates). Also, **x** could represent gene expression measurements from microarray experiments; here, it would be legitimate to regard this vector as continuous. Although gene expression measurements are typically regarded as response variables, there are contexts in which this type of information could be used in an explanatory role (Mallick *et al.* 2005).

Let the relationship between *y* and **x** be represented as(1)where *y _{i}* is a measurement, such as plant height or body weight, taken on individual

*i*,

**x**

_{i}is a

*p*× 1 vector of dummy SNP or microsatellite covariates observed on

*i*, and

*g*(.) is some unknown function relating these genotypes to phenotypes. Define

*g*(

**x**

_{i}) =

*E*(

*y*|

_{i}**x**

_{i}) as the conditional expectation function, that is, the mean phenotypic value of an infinite number of individuals, all possessing the

*p*-dimensional genotype

**x**

*∼ (0, σ*

_{i}. e_{i}^{2}) is a random residual, distributed independently of

**x**

_{i}and with variance σ

^{2}.

The conditional expectation function is(2)Following Silverman (1986), consider a nonparametric kernel estimator of the *p*-dimensional density of the covariates,(3)where is a kernel function and *h* is a window width or smoothing parameter. In (3), **x** is the value (“focal point”) at which the density is evaluated and **x**_{i} (*i* = 1, 2,…,*n*) is the observed *p*-dimensional SNP genotype of individual *i* in the sample. Hence, (3) estimates population densities (or frequencies). If is to behave as a multivariate probability density function, then it must be true that the kernel function is positive and that the conditionis satisfied. This implies thatSimilarly, and assuming that a single *h* parameter suffices, one can estimate the joint density of phenotype and genotypes at point (*y*, **x**) aswhere is also a kernel function; again, *y _{i}* is the observed sample value of variable

*y*in individual

*i*. The numerator of (2) can be estimated as(4)In (4), let , so that

*dy*=

*hdz*andThe kernel function is typically a proper probability density function chosen such that and . If so, the preceding expression is equal to

*y*, so that (4) becomes(5)

_{i}Returning to (2), one can form the nonparametric estimatorwhich, upon replacing the numerator and denominator by (5) and (3), respectively, takes the form(6)whereis a weight that depends on the kernel function and window width chosen and on the **x**_{i} (*i.e.*, genotypes) observed in the sample. The linear combination of the observations (6) is called the Nadaraya–Watson estimator of the regression function (Nadaraya 1964; Watson 1964). As seen in (6), the fitted value at coordinate **x** is a weighted average of all data points, with the value of the weight depending on the “proximity” of **x**_{i} to **x** and on the value of the smoothing parameter *h*. For instance, if the kernel function has the Gaussian formthis has a maximum value of when **x** = **x**_{i} and tails off to 0 as the distance between **x** and **x**_{i} increases. The values of *w _{i}*(

**x**) decrease more abruptly as . This is the reason why this type of estimator is called “local,” in the sense that observations with

**x**

_{i}coordinates closer to the focal point

**x**are weighted more strongly in the computation of the fitted value .

A specification that is more restrictive than (1) is the additive regression model(7)(Hastie and Tibshirani 1990; Fox 2005), where *x _{ij}* is the genotype for SNP

*j*in individual

*i*. In this model each of the “partial regression” functions is two-dimensional, thus allowing exploration of effects of individual SNPs on phenotypes, albeit at the expense of ignoring,

*e.g.*, epistatic interactions between genotypes. A preliminary examination of relationships can be done via calculation of Nadaraya–Watson estimators of each of the

*g*(.) aswhere the kernels are unidimensional. Naturally, one may wish to account for interactions between SNPs, so this type of analysis would be merely exploratory. A specification that is intermediate between (7) and (1) could include sums of single, pairwise, tripletwise, etc., SNP regression functions.

_{j}#### Impact of window width:

The scatter plot shown in Figure 1, from Chu and Marron (1991), consists of data on log-income and age of 205 people. The solid line in Figure 1A is a moving weighted average of the points, with weights proportional to the curve at the bottom. Chu and Marron (1991) regard the dip in average income in the middle ages as “unexpected.” Figure 1B gives results from three different smooths at window widths of 1, 3, and 9. When *h* = 1, the curve displays considerable sampling variation or roughness. On the other hand, when *h* = 9, local features disappear, because points that are far apart receive considerable weight in the fitting procedure. If the dip is not an artifact, the oversmoothing results in a “bias.” Hence, *h* must be gauged carefully.

Marron (1988) and Chu and Marron (1991) discuss data-driven procedures for assessing *h*. Silverman (1986) gives a discussion in the context of density estimation, whereas Mallick *et al.* (2005) consider Hilbert spaces kernel regression, with *h* treated as an unknown parameter. A conceptually simple and intuitively appealing procedure is cross-validation (*e.g.*, Schucany 2004). For instance, in the “leave-one-out” procedure, the datum for case *i*, that is (*y _{i}*,

**x**

_{i}), is deleted, and a fit is carried out on the basis of the other

*n*− 1 cases. Then, the prediction of

*y*is formed, where the notation −

_{i}*i*indicates that all data other than that for case

*i*are used for estimating the regression function. This process is repeated for all

*n*data points. Subsequently, the cross-validation criterionis formed. A cross-validation estimate of

*h*is the minimizer of CV(

*h*); this is found by carrying out the computations over a grid of

*h*-values. Hart and Lee (2005), however, present evidence of large variability of the leave-one-out estimates of

*h*. Hastie

*et al.*(2001) discuss alternatives based on leaving out 10–20% of the sample values. Ruppert

*et al.*(2003) present procedures for simple calculation of cross-validation statistics.

D. Gulisija, D. Gianola and K. A. Weigel (unpublished results) used another nonparametric procedure, LOESS, to study the relationship between performance and inbreeding in Jersey cows. There is some relationship between LOESS and kernel regression. In LOESS, the number of points contributing to a focal fitted value is fixed (contrary to kernel regression, where the actual number depends on the gentleness of the kernel chosen) and governed by a spanning parameter. This parameter (ranging between 0 and 1) is equivalent to *h* and dictates the fraction of all data points that contribute toward a fitted value. Figure 2 gives a LOESS fit for protein yield (actually, residuals from a parametric model) and 100 bootstrap replicates, illustrating uncertainty about the regression surface. Without getting into details, note that yield decreases gently at low values of inbreeding, followed by a faster linear decline, and then by an apparent increase. Irrespective of the variability (due to that few animals were either noninbred or highly inbred), neither the change of rate at low inbreeding nor the increase in yield at high consanguinity would be predicted by standard quantitative genetics theory. This is another illustration of how “irregularities” can be discovered nonparametrically, which would remain hidden otherwise.

#### Estimation of linear approximation:

If the kernel function is a probability density function, the nonparametric density estimator (3) will be a density function as well, retaining differentiability properties of the kernel (Silverman 1986). Consider *E* (*y* | **x**) = *g*(**x**) and suppose that one is interested in inferring a linear approximation to *g*(**x**) near some fixed point **x*** such as the mean value of the covariates; this leads to a nonparametric counterpart of additive genetic value. From a plant and animal breeding point of view, the concept of breeding value is essential in parametric models, so it seems important to develop a nonparametric counterpart as well. The linear function is(8)whereis the gradient of *g*(**x**) at **x** = **x***. This suggests the pseudomodel(9)and the approximate variance decomposition(10)The first term in the expression above can be interpreted as the variance contributed by the linear effect of **x** in the neighborhood of **x***. Further, if **x** is a vector of genotypes for molecular markers, this would be, roughly, the additive variance “due to” markers.

A nonparametric estimator of (8) is given by(11)where(12)and is an estimate of the vector of first derivatives of *g*(**x**) with respect to **x**, evaluated at **x***. The estimate could be the gradient of (6),(13)whereandare vectors whose form depends on the kernel function used. Hence, (11) becomes(14)Likewise, an estimator of the variance contributed to (10) by the linear effect of **x** (nonparametric additive genetic variance from the SNPs) is given by , where is some estimate of the covariance matrix of **x**. Finally, the relative contribution to variance made by the linear effect of **x**(τ^{2}) can be assessed as(15)where is an estimate of σ^{2}, such as(16)

#### Gaussian kernel:

A *p*-dimensional Gaussian kernel with a single band width parameter *h* has the formso that the estimator of the density at **x** is the finite mixture of Gaussians,where, for **x**_{i} = [*x _{i}*

_{1}

*x*

_{i}_{2}.

*x*]′, at focal

_{ip}**x**= [

*x*

_{1}

*x*

_{2}.

*x*]′

_{p}The weights entering into estimator (6) are thenThe fitted surface is given by(17)where **y** is the *n* × 1 vector of data points, **1** is an *n* × 1 vector of ones, and **v′**(**x**) is an *n* × 1 vector with typical element

Consider next the linear approximation in (14). Using (17), writeNow, for a Gaussian kernel,Further,andHenceand

#### Kernels for discrete covariates:

For a biallelic SNP, there are three possible genotypes at each “locus,” as in stylized Mendelian situations. In a standard (parametric) analysis of variance representation, incidence situations (or additive and dominance effects at each of the loci) are described via two dummy binary variables per locus, and all corresponding epistatic interactions can be assessed from effects of cross-products of these variables. This leads to a highly parameterized structure and to formidable model selection problems.

Consider now the nonparametric approach. For an **x** vector with *p* coordinates, its statistical distribution is given by the probabilities of each of the 3* ^{p}* combinations of binary outcomes. With SNPs,

*p*can be very large (possibly much larger than

*n*), so it is hopeless to estimate the probability distribution of genotypes accurately from observed relative frequencies, and smoothing is required (Silverman 1986). Kernel estimation extends as follows: for binary covariates the number of disagreements between a focal

**x**and the observed

**x**

_{i}in subject

*i*is given bywhere

*d*(.) takes values between 0 and

*p*. For illustration, suppose that one has genotype

*AABbccDd*as focal point and that individual

*i*in the sample has been genotyped as

*aaBbccDD*. The vectors of binary covariates for the focal and observed cases (save for an intercept) are given in the matrix where N (Y) stands for disagreement (agreement). Then,

*d*(

**x**,

**x**

_{i}) = 4, which is twice the number of disagreements in genotypes because there are only 2 “d.f.” per locus. In practice, one should work with a representation of incidences that is free of redundancies. For binary covariates, Silverman (1986) suggests the “binomial” kernelwith ; alternative forms of the kernel function are discussed by Aitchison and Aitken (1976) and Racine and Li (2004). It follows that the kernel estimate of the probability of observing the focal value

**x**is(18)If

*h*= 1, the estimate is just the proportion of cases for which

**x**

_{i}=

**x**; if , every focal point gets an estimate equal to , irrespective of the observed values

**x**

_{1},

**x**

_{2},…,

**x**

_{n}.

Genotype | AA | Aa | aa | BB | Bb | bb | CC | Cc | cc | DD | Dd | dd |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Observed | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |

Focal | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |

Agree | N | Y | N | Y | Y | Y | Y | Y | Y | N | N | Y, |

The nonparametric estimator of the regression function is

Since a discrete distribution does not possess derivatives, the additive genetic value must be defined in the classical sense (*e.g.*, Falconer 1960), that is, by defining contrasts between expected values of individuals having appropriate genotypes. Here, one can use either the vectorial representation or the concept of the additive regression model in (7). Hereinafter, it is assumed that the distribution of **x** is continuous, and a continuous kernel function is employed throughout, as an approximation.

## SEMIPARAMETRIC KERNEL MIXED MODEL

#### General considerations:

Consider now a situation for which there might be an operational or mechanistic basis for specifying at least part of a model. For instance, suppose that *y* is a measure on some quantitative trait, such as milk production of a cow. Animal breeders have exploited to advantage the infinitesimal model of quantitative genetics (Fisher 1918). Vectorial representations of this model are given by Quaas and Pollak (1980), and applications to natural populations are discussed by Kruuk (2004). In this section, we combine features of the infinitesimal model with a nonparametric treatment of genomic data and present semiparametric implementations.

#### Statistical specification:

Model (1) is expanded as(19)where is a vector of nuisance location effects and **u** is a *q* × 1 vector containing additive genetic effects of *q* individuals (these effects are assumed to be independent of those of the markers), some of which may lack a phenotypic record; **w**′_{i} and **z**′_{i} are known nonstochastic incidence vectors. As before, *g*(**x**_{i}**)** is some unknown function of the SNP data. It is assumed that , where is the “unmarked” additive genetic variance and **A** is the additive relationship matrix, whose entries are twice the coefficients of coancestry between individuals. Let be the *n* × 1 vector of residuals, and assume that , where is the residual variance. Note that the model implies that(20)and(21)The preceding means that: (1) the offset *y _{i}* −

*g*(

**x**

_{i}

**)**follows a standard mixed-effects model, and (2) if and

**u**were known, one could use (6) to estimate

*g*(

**x**

_{i}

**)**employing

*y*−

_{i}**w′**

_{i}

**β**−

**z′**

_{i}

**u**as “observations.” This suggests two strategies for analysis of data, discussed below.

#### Strategy 1—mixed model analysis:

This follows from representation (20). First, estimate *g*(**x**_{i}**)**, for *i* = 1, 2, … , *n*, via , as in (6) or, if a Gaussian kernel is adopted, as in (17). Then, carry out a mixed model analysis using the “corrected” data vector and pseudomodelwhere and are incidence matrices of appropriate order. The pseudomodel ignores uncertainty about , since is treated as if it were the true regression (on SNPs) surface.

Under the standard multivariate normality assumptions of the infinitesimal model, one can estimate the variance components and from **y*** via restricted maximum likelihood (REML) (Patterson and Thompson 1971) and form empirical best linear unbiased estimators and predictors of **β** and **u,** respectively, by solving the Henderson mixed model equations(22)(Henderson 1973). The ratio is evaluated at REML estimates of the variance components. Solving system (22) is a standard problem in animal breeding even for very large *q*, since **A**^{−1} is easy to compute. The two-stage procedure could be iterated several times, *i.e*., use the solutions to (22), to obtain a new estimate of using as “data,” and then update the pseudodata **y***, etc.

The “total” additive genetic effect of an individual possessing a vector of SNP covariates with focal value **x** can be defined as the sum of the additive genetic effect of the SNPs, in the sense of (8), plus the polygenic effect *u _{i}*, that is,An empirical predictor of

*T*can be formed by adding (14) to the

_{i}*i*th component of in the mixed model equations. It is not obvious how a measure of uncertainty about can be constructed using this procedure.

A Bayesian approach can be used instead, using the corrected data as observations. Under standard assumptions made for the prior and the likelihood (*e.g.*, Wang *et al.* 1993, 1994; Sorensen and Gianola 2002), one can draw samples *j* = 1, 2, … , *m* from the pseudoposterior distribution via Gibbs sampling and then form “semiparametric” draws of the total genetic value asfor *j* = 1, 2, … , *m*. These can be construed as draws from a semiparametric pseudoposterior distribution; one can readily calculate the means, median, percentiles, and variance of this distribution and produce posterior summaries, leading to an approximation to uncertainty about total genetic value.

Irrespective of whether classical or Bayesian viewpoints are adopted, this approach ignores the error of estimation of *g*(**x**), as noted earlier.

#### Strategy 2—random *g*(.)function:

The estimate of can be improved as follows. Consider (21) and suppose, temporarily, that and **u** are known. Then, form the offset and the alternative Nadaraya–Watson estimator of *g*(**x**_{i}**)**:(23)The problem is that the location vectors and **u** are not observable, so they must be inferred somehow.

Regard now as unknown quantities possessing some prior distribution, which is modified via Bayesian learning into the pseudoposterior process , after observation of the pseudodata **y*** and of the *n* × *p* matrix of observed SNP covariates **X**. Then, for some band width *h*, would also be a random variable possessing some pseudoposterior distribution. Let be a draw from the pseudoposterior process , as in the preceding section. Further, regard as a draw from the pseudoposterior distribution of , given **y***. Subsequently, estimate features of the pseudoposterior distribution of by ergodic averaging. For example, an estimate of its posterior expectation would bewhere the samples are obtained via a Markov chain Monte Carlo (MCMC) procedure from the pseudoposterior distribution .

The MCMC procedure would consist of sequential, iterative, sampling from all conditional pseudoposterior distributions, as follows:

Sample from , where ELSE denotes all other parameters,

**y***,**X**, and*h*. Using standard results, this conditional distribution has the formwhereandSample

**u**from . Using similar standard results, the distribution to sample from iswhereandSample the two variance components from the scaled inverse chi-square distributionsandAbove, ν

_{u}, and ν_{e}, are known hyperparameters of independent scaled inverse chi-square priors assigned to and , respectively.Form draws from the pseudoposterior distribution of as

As usual, early draws in the MCMC procedure are discarded as burn-in and, subsequently, *m* samples are collected to infer features of interest. Note that the procedure termed as strategy 1 gives as an estimate of the relationship between phenotypes and SNP data. In strategy 2, one can use instead, where and are means of the pseudoposterior distributions and , respectively. Under strategy 2, a point predictor of total additive genetic value could bewhere is the posterior mean of *u _{i}*.

## REPRODUCING KERNEL HILBERT SPACES MIXED MODEL

#### General:

What follows is motivated by developments in Mallick *et al.* (2005) for classification of tumors using microarray data. The underlying theory is outside the scope of this article. Only essentials are given here, and foundations are in Wahba (1990, 1999).

Using the structure of (19), consider the penalized sum of squares(24)where, as before, *h* is a smoothing parameter (possibly unknown) and is some norm or “stabilizer.” For instance, in smoothing splines, is a function of the second derivatives of *g*(**x**) integrated between end points that compose the data. The second term in (24) acts as a penalty: if the unknown function *g*(**x**) is rough, in the sense of having slopes that change rapidly, the penalty increases. The main problem here is that of finding the function *g*(**x**) that minimizes (24). Since is a functional on *g*(**x**), this is a variational or calculus of variations problem over a space of smooth curves. The solution was given by Kimeldorf and Wahba (1971) and Wahba (1999), and the minimizer admits the representationwhere is called a reproducing kernel. A possible choice for the kernel (Mallick *et al.* 2005) is the single smoothing parameter Gaussian function

#### Mixed model representation:

We embed these results into (19), leading to the specification(25)with the intercept parameter α_{0} included as part of . Note that there are as many regressions α_{j} as there are data points. However, the roughness penalty in the variational problem leads to a reduction in the effective number of parameters in reproducible kernel Hilbert spaces (RKHS) regression, as it occurs in smoothing splines (Fox 2002).

Define the 1 × *n* row vectorthe *n* × 1 column vector , *j* = 1, 2,…, *n*; and the *n* × *n* matrixThen (25) can be written in matrix form as

Suppose, further, that the α_{j} coefficients are exchangeable according to the distribution . Hence, for a given smoothing parameter *h*, we are in the setting of a mixed-effects linear model.

Given *h*, , , and (at a given *h*, the three variance components may be estimated by, *e.g.*, REML) one can obtain predictions of the polygenic breeding values **u** and of the coefficients from the solutions to the system(26)The total additive genetic value of an individual possessing a vector of SNP covariates **x**_{i} can be defined as