Abstract
Semiparametric procedures for prediction of total genetic value for quantitative traits, which make use of phenotypic and genomic data simultaneously, are presented. The methods focus on the treatment of massive information provided by, e.g., single-nucleotide polymorphisms. It is argued that standard parametric methods for quantitative genetic analysis cannot handle the multiplicity of potential interactions arising in models with, e.g., hundreds of thousands of markers, and that most of the assumptions required for an orthogonal decomposition of variance are violated in artificial and natural populations. This makes nonparametric procedures attractive. Kernel regression and reproducing kernel Hilbert spaces regression procedures are embedded into standard mixed-effects linear models, retaining additive genetic effects under multivariate normality for operational reasons. Inferential procedures are presented, and some extensions are suggested. An example is presented, illustrating the potential of the methodology. Implementations can be carried out after modification of standard software developed by animal breeders for likelihood-based or Bayesian analysis.
MASSIVE quantities of genomic data are now available, with potential for enhancing accuracy of prediction of genetic value of, e.g., candidates for selection in animal and plant breeding programs or for molecular classification of disease status in subjects (Golub et al. 1999). For instance, Wong et al. (2004) reported a genetic variation map of the chicken genome containing 2.8 million single-nucleotide polymorphisms (SNPs) and demonstrated how the information can be used for targeting specific genomic regions. Likewise, Hayes et al. (2004) found 2507 putative SNPs in the salmon genome that could be valuable for marker-assisted selection in this species.
The use of molecular markers as aids in genetic selection programs has been discussed extensively. Important early articles are Soller and Beckmann (1982) and Fernando and Grossman (1989), with the latter focusing on best linear unbiased prediction of genetic value when marker information is used. Most of the literature on marker-assisted selection deals with the problem of locating one or few quantitative trait loci (QTL) using flanking markers. However, in the light of current knowledge about genomics, the widely used single-QTL search approach is naive, since there is evidence of abundant QTL affecting complex traits, as discussed, e.g., by Dekkers and Hospital (2002). This would support the infinitesimal model of Fisher (1918) as a sensible statistical specification for many quantitative traits, with complications being the accommodation of nonadditivity and of feedbacks (Gianola and Sorensen 2004). Dekkers and Hospital (2002) observe that existing statistical methods for marker-assisted selection do not deal well with complexity posed by quantitative traits. Some difficulties are: specification of “statistical significance” thresholds for multiple testing, strong dependence of inferences on model chosen (e.g., number of QTL fitted, distributional forms), inadequate handling of nonadditivity, and ambiguous interpretation of effects in multiple-marker analysis, due to collinearity.
Here, we discuss how large-scale molecular information, such as that conveyed by SNPs, can be employed for marker-assisted prediction of genetic value for quantitative traits in the sense of, e.g., Meuwissen et al. (2001), Gianola et al. (2003), and Xu (2003). The focus is on inference of genetic value, rather than detection of quantitative trait loci. A main challenge is that of positing a functional form relating phenotypes to SNP genotypes (viewed as thousands of possibly highly colinear covariates), to polygenic additive genetic values, and to other nuisance effects, such as sex or age of an individual, simultaneously.
Standard quantitative genetics theory gives a mechanistic basis to the mixed-effects linear model, treated either from classical (Sorensen and Kennedy 1983; Henderson 1984) or from Bayesian (Gianola and Fernando 1986) perspectives. Meuwissen et al. (2001) and Gianola et al. (2003) exploit this connection and suggest highly parametric structures for modeling relationships between phenotypes and effects of hundreds or thousands of molecular markers. A first concern is the strength of their assumptions (e.g., linearity, multivariate normality, proportion of segregating loci, spatial within-chromosome effects); it is unknown if their procedures are robust. Second, colinearity between SNP or marker genotypes is bound to exist, because of the sheer massiveness of molecular data plus cosegregation of alleles. While adverse effects of colinearity can be tempered when marker effects are treated as random variables, statistical redundancy is undesirable (Lindley and Smith 1972).
The genome seems to be much more highly interactive than what standard quantitative genetic models can accommodate (e.g., D'Haeseleer et al. 2000). In theory, genetic variance can be partitioned into orthogonal additive, dominance, additive × additive, additive × dominance, dominance × dominance, etc., components, only under highly idealized conditions. These include linkage equilibrium, absence of natural or artificial selection, and no inbreeding or assortative mating (Cockerham 1954; Kempthorne 1954). Arguably, these conditions are violated in nature and in breeding programs. Actually, marker-assisted selection exploits existence of linkage disequilibrium, and even chance creates disequilibrium. Further, estimation of nonadditive components of variance is notoriously difficult, even under standard assumptions (Chang 1988). Therefore, it is doubtful whether or not standard quantitative genetic approaches can model fine-structure relationships between genotypes and phenotypes adequately, unless either departures from assumptions have mild effects or statistical constructs based on multivariate normality turn out to be more robust than what is expected on theoretical grounds. These considerations suggest that a nonparametric treatment of the data could be valuable.
On the other hand, application of the additive genetic model in selective breeding of livestock has produced remarkable dividends, as shown in Dekkers and Hospital (2002). Hence, a combination of nonparametric modeling of effects of molecular variables (e.g., SNPs) with features of the additive polygenic mode of inheritance is appealing.
Our objective is to present semiparametric methods for prediction of genetic value for complex traits that make use of phenotypic and genomic data simultaneously. This article is organized as follows. kernel regression on snp markers introduces nonparametric regression, kernel functions, and smoothing parameters and proposes a nonparametric approximation to additive genetic value. Next, semiparametric kernel mixed model combines features of kernel regression with the mixed-effects linear model and describes classical and Bayesian implementations. reproducing kernel hilbert spaces mixed model uses established calculus of variations results and hybridizes the mixed-effects linear model with a regression on kernel basis functions. Estimation procedures are presented, the problem of incomplete genotyping is addressed, and a simulated example is given, to illustrate feasibility and potential. The article concludes with a discussion and with suggestions for additional research.
KERNEL REGRESSION ON SNP MARKERS
The regression function:
Consider a stylized situation in which each of a series of individuals possesses a measurement for some quantitative trait denoted as y, as well as information on a possibly massive number of genomic variables, such as SNP “genotypes,” represented by a vector x. In the main, x is treated as a continuously valued vector of covariates, even though SNP genotypes are discrete (coding is done via dummy variates). Also, x could represent gene expression measurements from microarray experiments; here, it would be legitimate to regard this vector as continuous. Although gene expression measurements are typically regarded as response variables, there are contexts in which this type of information could be used in an explanatory role (Mallick et al. 2005).
Let the relationship between y and x be represented as(1)where yi is a measurement, such as plant height or body weight, taken on individual i, xi is a p × 1 vector of dummy SNP or microsatellite covariates observed on i, and g(.) is some unknown function relating these genotypes to phenotypes. Define g(xi) = E(yi|xi) as the conditional expectation function, that is, the mean phenotypic value of an infinite number of individuals, all possessing the p-dimensional genotype xi. ei ∼ (0, σ2) is a random residual, distributed independently of xi and with variance σ2.
The conditional expectation function is(2)Following Silverman (1986), consider a nonparametric kernel estimator of the p-dimensional density of the covariates,
(3)where
is a kernel function and h is a window width or smoothing parameter. In (3), x is the value (“focal point”) at which the density is evaluated and xi (i = 1, 2,…,n) is the observed p-dimensional SNP genotype of individual i in the sample. Hence, (3) estimates population densities (or frequencies). If
is to behave as a multivariate probability density function, then it must be true that the kernel function is positive and that the condition
is satisfied. This implies that
Similarly, and assuming that a single h parameter suffices, one can estimate the joint density of phenotype and genotypes at point (y, x) as
where
is also a kernel function; again, yi is the observed sample value of variable y in individual i. The numerator of (2) can be estimated as
(4)In (4), let
, so that dy = hdz and
The kernel function is typically a proper probability density function chosen such that
and
. If so, the preceding expression is equal to yi, so that (4) becomes
(5)
Returning to (2), one can form the nonparametric estimatorwhich, upon replacing the numerator and denominator by (5) and (3), respectively, takes the form
(6)where
is a weight that depends on the kernel function and window width chosen and on the xi (i.e., genotypes) observed in the sample. The linear combination of the observations (6) is called the Nadaraya–Watson estimator of the regression function (Nadaraya 1964; Watson 1964). As seen in (6), the fitted value at coordinate x is a weighted average of all data points, with the value of the weight depending on the “proximity” of xi to x and on the value of the smoothing parameter h. For instance, if the kernel function has the Gaussian form
this has a maximum value of
when x = xi and tails off to 0 as the distance between x and xi increases. The values of wi(x) decrease more abruptly as
. This is the reason why this type of estimator is called “local,” in the sense that observations with xi coordinates closer to the focal point x are weighted more strongly in the computation of the fitted value
.
A specification that is more restrictive than (1) is the additive regression model(7)(Hastie and Tibshirani 1990; Fox 2005), where xij is the genotype for SNP j in individual i. In this model each of the “partial regression” functions is two-dimensional, thus allowing exploration of effects of individual SNPs on phenotypes, albeit at the expense of ignoring, e.g., epistatic interactions between genotypes. A preliminary examination of relationships can be done via calculation of Nadaraya–Watson estimators of each of the gj(.) as
where the kernels are unidimensional. Naturally, one may wish to account for interactions between SNPs, so this type of analysis would be merely exploratory. A specification that is intermediate between (7) and (1) could include sums of single, pairwise, tripletwise, etc., SNP regression functions.
Impact of window width:
The scatter plot shown in Figure 1, from Chu and Marron (1991), consists of data on log-income and age of 205 people. The solid line in Figure 1A is a moving weighted average of the points, with weights proportional to the curve at the bottom. Chu and Marron (1991) regard the dip in average income in the middle ages as “unexpected.” Figure 1B gives results from three different smooths at window widths of 1, 3, and 9. When h = 1, the curve displays considerable sampling variation or roughness. On the other hand, when h = 9, local features disappear, because points that are far apart receive considerable weight in the fitting procedure. If the dip is not an artifact, the oversmoothing results in a “bias.” Hence, h must be gauged carefully.
Impact of window width on the regression function of log-income on age (Chu and Marron 1991). (A) Scatter plot and (B) smooths for earning power data. Kernel is N(0, 1). Window widths are represented by curves: solid curves, h = 3; dotted curves, h =1; dashed curves, h = 9.
Marron (1988) and Chu and Marron (1991) discuss data-driven procedures for assessing h. Silverman (1986) gives a discussion in the context of density estimation, whereas Mallick et al. (2005) consider Hilbert spaces kernel regression, with h treated as an unknown parameter. A conceptually simple and intuitively appealing procedure is cross-validation (e.g., Schucany 2004). For instance, in the “leave-one-out” procedure, the datum for case i, that is (yi, xi), is deleted, and a fit is carried out on the basis of the other n − 1 cases. Then, the prediction of yi is formed, where the notation −i indicates that all data other than that for case i are used for estimating the regression function. This process is repeated for all n data points. Subsequently, the cross-validation criterion
is formed. A cross-validation estimate of h is the minimizer of CV(h); this is found by carrying out the computations over a grid of h-values. Hart and Lee (2005), however, present evidence of large variability of the leave-one-out estimates of h. Hastie et al. (2001) discuss alternatives based on leaving out 10–20% of the sample values. Ruppert et al. (2003) present procedures for simple calculation of cross-validation statistics.
D. Gulisija, D. Gianola and K. A. Weigel (unpublished results) used another nonparametric procedure, LOESS, to study the relationship between performance and inbreeding in Jersey cows. There is some relationship between LOESS and kernel regression. In LOESS, the number of points contributing to a focal fitted value is fixed (contrary to kernel regression, where the actual number depends on the gentleness of the kernel chosen) and governed by a spanning parameter. This parameter (ranging between 0 and 1) is equivalent to h and dictates the fraction of all data points that contribute toward a fitted value. Figure 2 gives a LOESS fit for protein yield (actually, residuals from a parametric model) and 100 bootstrap replicates, illustrating uncertainty about the regression surface. Without getting into details, note that yield decreases gently at low values of inbreeding, followed by a faster linear decline, and then by an apparent increase. Irrespective of the variability (due to that few animals were either noninbred or highly inbred), neither the change of rate at low inbreeding nor the increase in yield at high consanguinity would be predicted by standard quantitative genetics theory. This is another illustration of how “irregularities” can be discovered nonparametrically, which would remain hidden otherwise.
LOESS curves of protein yield deviations (EBLUP) in Jersey cows against their inbreeding coefficients (F, %). The thick curve is the fitted regression (second-degree local polynomial, spanning parameter = 0.9); the other curves are 100 bootstrap replicates.
Estimation of linear approximation:
If the kernel function is a probability density function, the nonparametric density estimator (3) will be a density function as well, retaining differentiability properties of the kernel (Silverman 1986). Consider E (y | x) = g(x) and suppose that one is interested in inferring a linear approximation to g(x) near some fixed point x* such as the mean value of the covariates; this leads to a nonparametric counterpart of additive genetic value. From a plant and animal breeding point of view, the concept of breeding value is essential in parametric models, so it seems important to develop a nonparametric counterpart as well. The linear function is(8)where
is the gradient of g(x) at x = x*. This suggests the pseudomodel
(9)and the approximate variance decomposition
(10)The first term in the expression above can be interpreted as the variance contributed by the linear effect of x in the neighborhood of x*. Further, if x is a vector of genotypes for molecular markers, this would be, roughly, the additive variance “due to” markers.
A nonparametric estimator of (8) is given by(11)where
(12)and
is an estimate of the vector of first derivatives of g(x) with respect to x, evaluated at x*. The estimate could be the gradient of (6),
(13)where
and
are vectors whose form depends on the kernel function used. Hence, (11) becomes
(14)Likewise, an estimator of the variance contributed to (10) by the linear effect of x (nonparametric additive genetic variance from the SNPs) is given by
, where
is some estimate of the covariance matrix of x. Finally, the relative contribution to variance made by the linear effect of x(τ2) can be assessed as
(15)where
is an estimate of σ2, such as
(16)
Gaussian kernel:
A p-dimensional Gaussian kernel with a single band width parameter h has the formso that the estimator of the density at x is the finite mixture of Gaussians,
where, for xi = [xi1 xi2 . xip]′, at focal x = [x1 x2 . xp]′
The weights entering into estimator (6) are thenThe fitted surface is given by
(17)where y is the n × 1 vector of data points, 1 is an n × 1 vector of ones, and v′(x) is an n × 1 vector with typical element
Consider next the linear approximation in (14). Using (17), writeNow, for a Gaussian kernel,
Further,
and
Hence
and
Kernels for discrete covariates:
For a biallelic SNP, there are three possible genotypes at each “locus,” as in stylized Mendelian situations. In a standard (parametric) analysis of variance representation, incidence situations (or additive and dominance effects at each of the loci) are described via two dummy binary variables per locus, and all corresponding epistatic interactions can be assessed from effects of cross-products of these variables. This leads to a highly parameterized structure and to formidable model selection problems.
Consider now the nonparametric approach. For an x vector with p coordinates, its statistical distribution is given by the probabilities of each of the 3p combinations of binary outcomes. With SNPs, p can be very large (possibly much larger than n), so it is hopeless to estimate the probability distribution of genotypes accurately from observed relative frequencies, and smoothing is required (Silverman 1986). Kernel estimation extends as follows: for binary covariates the number of disagreements between a focal x and the observed xi in subject i is given bywhere d(.) takes values between 0 and p. For illustration, suppose that one has genotype AABbccDd as focal point and that individual i in the sample has been genotyped as aaBbccDD. The vectors of binary covariates for the focal and observed cases (save for an intercept) are given in the matrix
where N (Y) stands for disagreement (agreement). Then, d(x, xi) = 4, which is twice the number of disagreements in genotypes because there are only 2 “d.f.” per locus. In practice, one should work with a representation of incidences that is free of redundancies. For binary covariates, Silverman (1986) suggests the “binomial” kernel
with
; alternative forms of the kernel function are discussed by Aitchison and Aitken (1976) and Racine and Li (2004). It follows that the kernel estimate of the probability of observing the focal value x is
(18)If h = 1, the estimate is just the proportion of cases for which xi = x; if
, every focal point gets an estimate equal to
, irrespective of the observed values x1, x2,…, xn.
Genotype | AA | Aa | aa | BB | Bb | bb | CC | Cc | cc | DD | Dd | dd |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Observed | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
Focal | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
Agree | N | Y | N | Y | Y | Y | Y | Y | Y | N | N | Y, |
The nonparametric estimator of the regression function is
Since a discrete distribution does not possess derivatives, the additive genetic value must be defined in the classical sense (e.g., Falconer 1960), that is, by defining contrasts between expected values of individuals having appropriate genotypes. Here, one can use either the vectorial representation or the concept of the additive regression model in (7). Hereinafter, it is assumed that the distribution of x is continuous, and a continuous kernel function is employed throughout, as an approximation.
SEMIPARAMETRIC KERNEL MIXED MODEL
General considerations:
Consider now a situation for which there might be an operational or mechanistic basis for specifying at least part of a model. For instance, suppose that y is a measure on some quantitative trait, such as milk production of a cow. Animal breeders have exploited to advantage the infinitesimal model of quantitative genetics (Fisher 1918). Vectorial representations of this model are given by Quaas and Pollak (1980), and applications to natural populations are discussed by Kruuk (2004). In this section, we combine features of the infinitesimal model with a nonparametric treatment of genomic data and present semiparametric implementations.
Statistical specification:
Model (1) is expanded as(19)where
is a vector of nuisance location effects and u is a q × 1 vector containing additive genetic effects of q individuals (these effects are assumed to be independent of those of the markers), some of which may lack a phenotypic record; w′i and z′i are known nonstochastic incidence vectors. As before, g(xi) is some unknown function of the SNP data. It is assumed that
, where
is the “unmarked” additive genetic variance and A is the additive relationship matrix, whose entries are twice the coefficients of coancestry between individuals. Let
be the n × 1 vector of residuals, and assume that
, where
is the residual variance. Note that the model implies that
(20)and
(21)The preceding means that: (1) the offset yi − g(xi) follows a standard mixed-effects model, and (2) if
and u were known, one could use (6) to estimate g(xi) employing yi − w′iβ − z′iu as “observations.” This suggests two strategies for analysis of data, discussed below.
Strategy 1—mixed model analysis:
This follows from representation (20). First, estimate g(xi), for i = 1, 2, … , n, via , as in (6) or, if a Gaussian kernel is adopted, as in (17). Then, carry out a mixed model analysis using the “corrected” data vector and pseudomodel
where
and
are incidence matrices of appropriate order. The pseudomodel ignores uncertainty about
, since
is treated as if it were the true regression (on SNPs) surface.
Under the standard multivariate normality assumptions of the infinitesimal model, one can estimate the variance components and
from y* via restricted maximum likelihood (REML) (Patterson and Thompson 1971) and form empirical best linear unbiased estimators and predictors of β and u, respectively, by solving the Henderson mixed model equations
(22)(Henderson 1973). The ratio
is evaluated at REML estimates of the variance components. Solving system (22) is a standard problem in animal breeding even for very large q, since A−1 is easy to compute. The two-stage procedure could be iterated several times, i.e., use the solutions to (22), to obtain a new estimate of
using
as “data,” and then update the pseudodata y*, etc.
The “total” additive genetic effect of an individual possessing a vector of SNP covariates with focal value x can be defined as the sum of the additive genetic effect of the SNPs, in the sense of (8), plus the polygenic effect ui, that is,An empirical predictor of Ti can be formed by adding (14) to the ith component of
in the mixed model equations. It is not obvious how a measure of uncertainty about
can be constructed using this procedure.
A Bayesian approach can be used instead, using the corrected data as observations. Under standard assumptions made for the prior and the likelihood (e.g., Wang et al. 1993, 1994; Sorensen and Gianola 2002), one can draw samples j = 1, 2, … , m from the pseudoposterior distribution
via Gibbs sampling and then form “semiparametric” draws of the total genetic value as
for j = 1, 2, … , m. These
can be construed as draws from a semiparametric pseudoposterior distribution; one can readily calculate the means, median, percentiles, and variance of this distribution and produce posterior summaries, leading to an approximation to uncertainty about total genetic value.
Irrespective of whether classical or Bayesian viewpoints are adopted, this approach ignores the error of estimation of g(x), as noted earlier.
Strategy 2—random g(.)function:
The estimate of can be improved as follows. Consider (21) and suppose, temporarily, that
and u are known. Then, form the offset
and the alternative Nadaraya–Watson estimator of g(xi):
(23)The problem is that the location vectors
and u are not observable, so they must be inferred somehow.
Regard now as unknown quantities possessing some prior distribution, which is modified via Bayesian learning into the pseudoposterior process
, after observation of the pseudodata y* and of the n × p matrix of observed SNP covariates X. Then, for some band width h,
would also be a random variable possessing some pseudoposterior distribution. Let
be a draw from the pseudoposterior process
, as in the preceding section. Further, regard
as a draw from the pseudoposterior distribution of
, given y*. Subsequently, estimate features of the pseudoposterior distribution of
by ergodic averaging. For example, an estimate of its posterior expectation would be
where the samples
are obtained via a Markov chain Monte Carlo (MCMC) procedure from the pseudoposterior distribution
.
The MCMC procedure would consist of sequential, iterative, sampling from all conditional pseudoposterior distributions, as follows:
Sample
from
, where ELSE denotes all other parameters, y*, X, and h. Using standard results, this conditional distribution has the form
where
and
Sample u from
. Using similar standard results, the distribution to sample from is
where
and
Sample the two variance components from the scaled inverse chi-square distributions
and
Above, νu,
and νe,
are known hyperparameters of independent scaled inverse chi-square priors assigned to
and
, respectively.
Form draws from the pseudoposterior distribution of
as
As usual, early draws in the MCMC procedure are discarded as burn-in and, subsequently, m samples are collected to infer features of interest. Note that the procedure termed as strategy 1 gives as an estimate of the relationship between phenotypes and SNP data. In strategy 2, one can use
instead, where
and
are means of the pseudoposterior distributions
and
, respectively. Under strategy 2, a point predictor of total additive genetic value could be
where
is the posterior mean of ui.
REPRODUCING KERNEL HILBERT SPACES MIXED MODEL
General:
What follows is motivated by developments in Mallick et al. (2005) for classification of tumors using microarray data. The underlying theory is outside the scope of this article. Only essentials are given here, and foundations are in Wahba (1990, 1999).
Using the structure of (19), consider the penalized sum of squares(24)where, as before, h is a smoothing parameter (possibly unknown) and
is some norm or “stabilizer.” For instance, in smoothing splines,
is a function of the second derivatives of g(x) integrated between end points that compose the data. The second term in (24) acts as a penalty: if the unknown function g(x) is rough, in the sense of having slopes that change rapidly, the penalty increases. The main problem here is that of finding the function g(x) that minimizes (24). Since
is a functional on g(x), this is a variational or calculus of variations problem over a space of smooth curves. The solution was given by Kimeldorf and Wahba (1971) and Wahba (1999), and the minimizer admits the representation
where
is called a reproducing kernel. A possible choice for the kernel (Mallick et al. 2005) is the single smoothing parameter Gaussian function
Mixed model representation:
We embed these results into (19), leading to the specification(25)with the intercept parameter α0 included as part of
. Note that there are as many regressions αj as there are data points. However, the roughness penalty in the variational problem leads to a reduction in the effective number of parameters in reproducible kernel Hilbert spaces (RKHS) regression, as it occurs in smoothing splines (Fox 2002).
Define the 1 × n row vectorthe n × 1 column vector
, j = 1, 2,…, n; and the n × n matrix
Then (25) can be written in matrix form as
Suppose, further, that the αj coefficients are exchangeable according to the distribution . Hence, for a given smoothing parameter h, we are in the setting of a mixed-effects linear model.
Given h, ,
, and
(at a given h, the three variance components may be estimated by, e.g., REML) one can obtain predictions of the polygenic breeding values u and of the coefficients
from the solutions to the system
(26)The total additive genetic value of an individual possessing a vector of SNP covariates xi can be defined as
(27)where
Since Ti is a linear function of the unknown ui and αj effects, its empirical best linear unbiased predictor, assuming known h, can be obtained by replacing these effects by the corresponding solutions from (26).
Incomplete genotyping:
At least in animal breeding, it is not feasible to have all individuals genotyped for SNPs. On the other hand, the number of animals with phenotypic information available is typically in the order of hundreds of thousands, and genotyping is selective, e.g., young bulls that are candidates for progeny testing in dairy cattle production. Animals lacking molecular data are not a random sample from the population, and ignoring this issue may lead to biased inferences. Unless missingness of molecular data is ignorable, in the sense of, e.g., Henderson (1975), Rubin (1976), Gianola and Fernando (1986), or Im et al. (1989), the procedures given below require modeling of the missing data process, which is difficult and may lack robustness. Here, it is assumed that missingness is ignorable, enabling use of likelihood-based or Bayesian procedures as if selection had not taken place (Sorensen et al. 2001). Two ad hoc procedures are discussed, and an alternative approach, suitable for kernel regression, is presented in the conclusion.
Let the vector of phenotypic data be partitioned as , where y1 (n1 × 1) consists of records of individuals lacking SNP data, whereas y2 (n2 × 1) includes phenotypic data of genotyped individuals. Often, it will be the case that n1 > p ≫ n2. We adopt the model
(28)For the sake of flexibility, assume that
and
are mutually independent but heteroscedastic vectors. In short, the key assumption made here is that the random effect α affects y2 but not y1 or, equivalently, that it gets absorbed into e1. With this representation, the mixed model equations take the form
(29)If SNP data are missing completely at random and h,
,
, and
are treated as known, then
is an unbiased estimator of
, and
and
are unbiased predictors of u and α, respectively. They are not “best,” in the sense of having minimum variance or minimum prediction error variance, because the smooth function
of the SNP markers is missing in the model for individuals that are not genotyped (Henderson 1974).
An alternative consists of writing the bivariate modeland then assigning to the polygenic component, the multivariate normal distribution
Here,
and
are additive genetic variances in individuals without and with molecular information, respectively, and
is their additive genetic covariance. Computations would be those appropriate for a two-trait linear model analysis (Henderson 1984; Sorensen and Gianola 2002).
Bayesian analysis:
To illustrate, consider the first of the two options in the preceding section. Suppose a kernel has been chosen but that the value of h is uncertain, so that the model unknowns areLet the prior density have the form
(30)where H denotes the set of all known hyperparameters (whose values are fixed a priori) and
indicates a multivariate normal distribution with appropriate mean vector and covariance matrix. The four variance components
are assigned independent scaled inverse chi-square prior distributions with degrees of freedom ν and scale parameters S2, with appropriate subscripts. Assign an improper prior distribution to each of the elements of
and, as in Mallick et al. (2005), adopt a uniform prior for h, with lower and upper boundaries hmin and hmax, respectively.
Given the parameters, observations are assumed to be conditionally independent, and the distribution adopted for the sampling model is
Given h, one is again in the setting of the Bayesian analysis of a mixed linear model, and Markov chain Monte Carlo procedures for this situation are well known. Under standard conjugate prior parameterizations, all conditional posterior distributions are known, save for that of h. Hence, one can construct a Gibbs–Metropolis sampling scheme in which conditional distributions are used for drawing and a Metropolis update is employed for h.
Location effects are drawn from a multivariate normal distribution with mean vector and covariance matrix
(31)and
(32)respectively. Likewise, the additive genetic effects can be sampled from a normal distribution centered at
(33)and with covariance matrix
(34)The conditional posterior distribution of the coefficients
is multivariate normal as well, with mean vector
(35)and variance–covariance matrix
(36)
All four variance components have scaled inverse chi-square conditional posterior distributions and are conditionally independent. The conditional posterior distributions to sample from are(37)
(38)
(39)and
(40)
The most difficult parameter to sample is h. Its conditional posterior density can be represented as(41)where
Density (41) is not in a recognizable form. However, a Metropolis algorithm (Metropolis et al. 1953) can be tailored for obtaining samples from the distribution
. Suppose that the Markov chain is at state h[t]. A proposal value h* is drawn from some symmetric candidate-generating distribution and accepted with probability
If the proposal is accepted, then set h[t+1] = h*; otherwise, stay at h[t]. If a Hastings update is employed, instead, an adaptive proposal distribution could be, e.g., a finite mixture of the prior density and of a scaled inverse chi-square density f with mean value h[t] and variance
,
where hlargest and hsmallest are the largest and smallest value of h, respectively, accepted up to time t. The weight w in
is assessed such that a reasonable acceptance rate for the Metropolis–Hastings algorithm is attained.
Illustrative example:
Phenotypic and genotypic values were simulated (single replication) for a sample of N unrelated individuals, for each of two situations. The trait was determined either by five biallelic QTL, having additive gene action, or by five pairs of biallelic QTL, having additive-by-additive gene action. Under nonadditive gene action, the additive genetic variance was null, so that all genetic variance was of the additive-by-additive type. Heritability (ratio between genetic and phenotypic variance) in both settings was set to 0.5.
Genotypes were simulated for a total of 100 biallelic markers, including the “true” QTL; all loci were simulated to be in gametic-phase equilibrium. Since all individuals were unrelated and all loci were in gametic-phase equilibrium, only the QTL genotypes and the trait phenotypes would provide information on the genotypic value. In most real applications, the location of the QTL will not be known, and so many loci that are not QTL will be included in the analysis.
A RKHS mixed model was used to predict genotypic values, given phenotypic values and genotypes at the QTL and at all other loci. The model included a fixed intercept α0 and a random RKHS regression coefficient αi for each subject; additive effects were omitted from the model, as individuals were unrelated, precluding separation of additive effects from residuals, in the absence of knowledge of variance components. The genetic value gi of a subject was predicted aswhere
and
were obtained by solving (26) for this model, using a Gaussian kernel at varying functions of h. The mean squared error of prediction of genetic value (MSEP) was calculated as
(42)and a grid search was used to determine the values of h and
that minimized (42). To evaluate the performance of
as a predictor, another sample of 1000 individuals (“PRED”) was simulated, including genotypes, genotypic values, and phenotypes. This was deemed preferable to doing prediction in the training sample, to reduce dependence between performance and (h,
), whose values were assessed in the training sample The genotypic values of the subjects in PRED were predicted, given their genotypes, using
. Genotypic values were also predicted using a multiple linear regression (MLR) mixed model with a fixed intercept and random regression coefficients on the linear effects of genotypes.
Results for the RKHS mixed model are in Table 1, and those for the MLR mixed model are in Table 2, where “accuracy” is the correlation between true and predicted genetic values. When gene action was strictly additive, the two methods (each fitting k = 100 loci) had the same accuracy, indicating that RKHS performed well even when the parametric assumptions were valid. On the other hand, when inheritance was purely additive by additive, the parametric MLR was clearly outperformed by RKHS, irrespective of the number of loci fitted. An exception occurred at k = 100 and N = 1000; here, the two methods were completely inaccurate. However, when N was increased from 1000 to 5000, the accuracy of RKHS jumped to 0.85 (Table 1), whereas MLR remained inaccurate (Table 2). Note that the accuracy of RKHS decreased (with N held at 1000) when k increased. We attribute this behavior to the use of a Gaussian kernel when, in fact, covariates are discrete.
Accuracy of predicting genotypic values using the RKHS mixed model
Accuracy of predicting genotypic values using the MLR mixed model
CONCLUSION
This article discusses approaches for prediction of genetic value using markers for the entire genome, such as SNPs, and phenotypic measurements for complex traits. In particular, theory for nonparametric and semiparametric procedures, i.e., kernel regression and reproducing kernel Hilbert spaces regression, is developed. The methods consist of a combination of features of the classical additive genetic model of quantitative genetics with an unknown function of SNP genotypes, which is inferred nonparametrically. Mixed-model and Bayesian implementations are presented. The procedures can be computed using software developed by animal breeders for likelihood-based and Bayesian analysis, after modifications.
Except for the parametric part of the model, that is, the standard normality assumption for additive genetic values and model residuals, the procedures attempt to circumvent potential difficulties posed by violation of assumptions required for an orthogonal decomposition of genetic variance stemming from SNP genotypes (Cockerham 1954; Kempthorne 1954). Our expectation is that the nonparametric function of marker genotypes, , captures all possible forms of interaction, but without explicit modeling. The procedures should be particularly useful for highly dimensional regression, including the situation in which the number of SNP variables (p) exceeds amply the number of data points (n). Instead of performing a selection of a few “significant” markers on the basis of some ad hoc method, information on all molecular polymorphisms is employed, irrespective of the degree of colinearity. This is because the procedures should be insensitive to difficulties caused by colinearity, given the forms of the estimators, e.g., (6) or (23). It is assumed that the assignment of genotypes to individuals is unambiguous, i.e., x gives the genotypes for SNP markers free of error.
Our methods share the spirit of those of Meuwissen et al. (2001), Gianola et al. (2003), Xu (2003), Yi et al. (2003), Ter Braak et al. (2005), Wang et al. (2005), and Zhang and Xu (2005), but without making strong assumptions about the form of the marker–phenotype relationship, which is assumed linear by all these authors, and without invoking parametric distributions for pertinent effects.
A hypothetical example was presented, illustrating potential and computational feasibility of at least the RKHS procedure; a standard additive genetic model was outperformed by RKHS when additive-by-additive gene action was simulated. Comparisons between parametric and nonparametric procedures are needed. It is unlikely that computer simulations would shed much light in this respect. First, a huge number of simulation scenarios can be envisaged, resulting from limitless combinations of parameter values, numbers of markers, marker effects, residual distributions, etc. Second, simulations tend to have local value only, that is, conclusions are tentative only within the limited experimental region explored and are heavily dependent on the state of nature assumed. Third, end points and gauges of a simulation tend to be arbitrary. For instance, should frequentist procedures be assessed on Bayesian grounds and vice versa? We believe that studies based on predictive cross-validation for a range of traits and species are perhaps more fruitful. These studies will be conducted once adequate and reliable phenotypic-SNP data sets become more widely available.
There are at least two difficulties with the proposed methodology. As noted above, it is assumed that the SNP genotypes are constructed without error, which is seldom the case. To solve this problem, one would need to build an error-in-variables model, but at the expense of introducing additional assumptions. A second difficulty is that posed by the fact that many individuals will lack SNP information, at least in animal breeding. Earlier, we presented approximate procedures on the basis of the assumption that missingness of SNP data is ignorable, such that the effect of can be absorbed into a residual that has a different variance from that peculiar to individuals possessing SNP data or into the additive genetic value in a two-trait implementation. A more appropriate treatment of missing data requires imputation of genotypes for individuals lacking SNP information. If the SNP data are missing completely at random or just at random, the solutions to system (29), after augmentation with the missing
, would give the means of the posterior distributions of
, and
, conditionally on the variance components and on h (Gianola and Fernando 1986; Sorensen et al. 2001). However, the sampling procedure should address the constraint that SNP genotypes of related individuals must be more alike than those of unrelated subjects. In our treatment, and for operational reasons, we adopted the simplifying assumption that SNP genotypes are independently distributed. This may be anticonservative and could lead to some bias if SNP data are not missing completely at random.
It is unknown to what extent our procedures are robust with respect to the choice of kernel function. Silverman (1986) discusses several options and, in the context of univariate density estimation, concludes that different kernels differ little in mean squared error. Also, an inadequate specification of the smoothing parameter h may affect inference adversely. In this respect, the procedures discussed in reproducing kernel hilbert spaces mixed model provide for automatic specification of the band-width parameter. Here, one could either make an analysis conditionally on the “most likely” value of h or, alternatively, average over its posterior distribution, to take uncertainty into account fully. Here, we have focused on using a continuous kernel, primarily to exploit differentiability properties. However, the vector of markers, x, has a discrete distribution, and careful investigation is needed to assess the adequacy of such an approximation.
A procedure to accommodate missing genotypes in kernel regression is as follows. Replace g(xi) in (19) by(43)where
is the conditional expectation of xi given all the observed genotypes in the pedigree, and
. The conditional expectation in (43) is a function of the conditional probabilities of the missing genotypes given the observed genotypes. Much research has been devoted to computing such probabilities from complex pedigrees using either approximations (Van Arendonk et al. 1989; Fernando et al. 1993; Kerr and Kinghorn 1996) or MCMC samplers (Sheehan and Thomas 1993; Jensen et al. 1995; Jensen and Kong 1999; Fernández et al. 2002; Stricker et al. 2002).
When genotypes for individual i are observed, , and qi is null. When genotypes for i are missing,
is an approximation to the conditional expectation of g(xi) given the observed genotypes, and qi is a random variable with approximately null expectation. Now, the genotypic value of an individual can be written as
(44)where, from the linear approximation given by (9), the variance of qi is
(45)with Vi being the conditional covariance matrix of xi given the observed genotypes of relatives. The jth diagonal in Vi is a function of the conditional probabilities of the missing genotypes at locus j given the observed genotypes, and the jkth off-diagonal element is a function of the conditional probabilities of the missing genotypes at loci j and k given the observed genotypes. An estimate of Var(qi) can be obtained by replacing
by its estimate in (45).
In the simplest approach to modeling the covariances of the qi, the random effects ui and qi are combined asNow, the model for yi is written as
(46)An approximation to the covariance matrix
of a can be obtained by using the usual tabular algorithm that is based only on pedigree information, after setting the ith diagonal of
to Var(ui) + Var(qi). The inverse of this approximate
is sparse and it can be computed efficiently (Henderson 1976).
Although qi is not null only for individuals with missing genotypes, as discussed below, the observed genotypes on relatives can provide information on the segregation of alleles in individuals with missing genotypes. Thus, observed genotypes can provide information on covariances between the qi. Let qij denote the additive genotypic value at locus j, which can be written aswhere vijm and vijp are the gametic values of haplotypes hijm and hijp. When genotype information is available at a locus, it is more convenient to work with the gametic values vijm and vijp rather than the genotypic values qij. For an individual with missing genotypes, the variance of
can be written as
(47)for x = m or p, where
is the additive effect of haplotype hijm. These additive effects can be estimated from the linear function given by (8) as follows. Let
denote the value of x with all elements set to its mean value except at locus j where one of the haplotypes is set to hj. Then,
The covariance between and
, for example, can be written as
(48)where
is the probability that the maternal haplotype of i′ is inherited from the maternal haplotype of d its dam. Suppose genotype information is available on the ancestors of d and on the descendants of i′. Then, the segregation probabilities in (48) may be different from 0.5, and thus the observed genotypes will contribute information for genetic covariances at this locus. However, even in this situation, the inverse of the gametic covariance matrix Gj for locus j that is constructed using (47) and (48) is sparse and it can be computed efficiently (Fernando and Grossman 1989).
For an improved approximation to the modeling of covariances between the qi, the model for yi is written as(49)where L is the set of k loci with the largest effects on the trait. The covariance matrix for the vector vj of gametic values can be computed using (47) and (48). The usual tabular method based on pedigree information is used to compute
, after setting the ith diagonal to
(50)
Our models extend directly to binary or ordered categorical responses when a threshold-liability model holds (Wright 1934; Falconer 1965; Gianola 1982; Gianola and Foulley 1983). Here, the function would be viewed as affecting a latent variable; Mallick et al. (2005) used this idea in analysis of gene expression measurements.
Extension to multivariate responses is less straightforward. It is conceivable that each trait may require a different function of SNP genotypes. It is not obvious how this problem should be dealt with without making strong parametric assumptions.
In conclusion, we believe that this article gives a first description of nonparametric and semiparametric procedures that may be suitable for prediction of genetic value using dense marker data. However, considerable research is required for tuning, extending, and validating some of the ideas presented here.
Acknowledgments
Miguel A. Toro, Bani Mallick, and two anonymous reviewers are thanked for useful comments. Research was completed while the senior author was visiting Parco Tecnologico Padano, Lodi, Italy. Support by the Wisconsin Agriculture Experiment Station and by grants National Research Initiatives Competitive Grants Program/U.S. Department of Agriculture 2003-35205-12833, National Science Foundation (NSF) DEB-0089742, and NSF Division of Mathematical Sciences (DMS)–NSF DMS-044371 is acknowledged.
Footnotes
Communicating editor: R. W. Doerge
- Received August 14, 2005.
- Accepted April 18, 2006.
- Copyright © 2006 by the Genetics Society of America