Genetics, Vol. 163, 347-365, January 2003, Copyright © 2003

On Marker-Assisted Prediction of Genetic Value: Beyond the Ridge

Daniel Gianolaa, Miguel Perez-Encisob, and Miguel A. Toroc
a Department of Animal Sciences, University of Wisconsin, Madison, Wisconsin 53706,
b Station d'Amelioration Génétique des Animaux, Institut National de la Recherche Agronomique, 31326 Castanet-Tolosan, France
c Departamento de Mejora Genética Animal, Instituto Nacional de Investigaciones Agrarias, 28040-Madrid, Spain

Corresponding author: Daniel Gianola, 1675 Observatory Dr., Madison, WI 53706., gianola{at}calshp.cals.wisc.edu (E-mail)

Communicating editor: J. B. WALSH


*  ABSTRACT
*TOP
*ABSTRACT
*A MIXED-EFFECTS MODEL...
*EXTENDING THE HIERARCHY
*A BAYESIAN FORMULATION
*DISCUSSION
*LITERATURE CITED

Marked-assisted genetic improvement of agricultural species exploits statistical dependencies in the joint distribution of marker genotypes and quantitative traits. An issue is how molecular (e.g., dense marker maps) and phenotypic information (e.g., some measure of yield in plants) is to be used for predicting the genetic value of candidates for selection. Multiple regression, selection index techniques, best linear unbiased prediction, and ridge regression of phenotypes on marker genotypes have been suggested, as well as more elaborate methods. Here, phenotype-marker associations are modeled hierarchically via multilevel models including chromosomal effects, a spatial covariance of marked effects within chromosomes, background genetic variability, and family heterogeneity. Lorenz curves and Gini coefficients are suggested for assessing the inequality of the contribution of different marked effects to genetic variability. Classical and Bayesian methods are presented. The Bayesian approach includes a Markov chain Monte Carlo implementation. The generality and flexibility of the Bayesian method is illustrated when a Lorenz curve is to be inferred.


THE availability of a plethora of markers has led to consideration of the issue of the extent to which molecular information can be used to advantage in genetic improvement programs of agricultural species, such as maize or dairy cattle. There is a large body of literature on this matter (e.g., SOLLER and BECKMANN 1983 Down; SMITH and SIMPSON 1986 Down; LANDE and THOMPSON 1990 Down); WHITTAKER 2001 Down gives a review. Marker-assisted selection can be effective as long as part of the genetic variance can be associated with segregating marker loci (LANDE and THOMPSON 1990 Down).

The basic idea in marker-assisted selection is to exploit statistical dependencies (linkage disequilibrium) existing in the joint distribution of marker and quantitative trait loci (QTL) genotypes. For example, when two inbred lines are crossed, the disequilibrium is manifest in the F2 generation. On the other hand, when there is linkage equilibrium at the population level, only the joint distribution of marker and QTL genotypes within a family is nontrivial (OLLIVIER 1998 Down). GEORGES et al. 1995 Down have exploited high levels of within-family disequilibria in dairy cattle populations for QTL mapping. Most livestock populations have some disequilibrium due at least to chance (small effective size), as shown by FARNIR et al. 2000 Down.

Linkage disequilibrium between markers and QTL can be used for two main purposes: (1) to infer genomic location and effects of QTL affecting a trait and (2) to arrive at improved (in some statistical sense) predictions of genetic merit of candidates for selection in a breeding program. These two objectives may not be disjoint (e.g., FERNANDO and GROSSMAN 1989 Down). However, genetic cartography of QTL is not a requirement for prediction of genetic merit or marker-assisted selection (LANGE and WHITTAKER 2001 Down). In fact, remarkable advances have been made in prediction of breeding values in livestock since the introduction of the best linear unbiased predictor (BLUP; HENDERSON et al. 1959 Down; HENDERSON 1973 Down). This method uses only phenotypic and pedigree information, with the QTL viewed only in an abstract manner. On the other hand, it has been argued and shown, at least in simulations, that molecular information may enhance the accuracy of selection (e.g., WHITTAKER et al. 2000 Down).

Our concern is with statistical models and methods for inferring genetic merit using molecular and phenotypic information. The objective is to describe phenotype-marker associations using multilevel hierarchical linear models. The setting is mainly as in WHITTAKER et al. 2000 Down and LANGE and WHITTAKER 2001 Down, i.e., situations where conditional independence of genetic sampling can be assumed, such as in an F2 population derived from a cross between inbred lines. Model features include chromosome-specific effects, spatial associations of markers within chromosomes, existence of background genetic variability, and heterogeneity among families, if some such clustering exists. Classical and Bayesian methods are described. First, we present a mixed-effects model formulation and a BLUP implementation, with the dispersion components estimated by some likelihood-based procedure. An extension of the mixed-effects model is given subsequently. Finally, a Bayesian formulation is presented, including a Markov chain Monte Carlo procedure for drawing samples from target posterior distributions. Possible applications to outbred populations are discussed.


*  A MIXED-EFFECTS MODEL FORMULATION
*TOP
*ABSTRACT
*A MIXED-EFFECTS MODEL...
*EXTENDING THE HIERARCHY
*A BAYESIAN FORMULATION
*DISCUSSION
*LITERATURE CITED

Hierarchical representation:
Let the phenotypic value of individual i for a quantitative trait in an F2 from a cross between inbred lines be described by the model

(1)

Here, the p x 1 vector ß contains some systematic effects representing, e.g., year of harvest, level of fertilization, or plant density; x'i is a known incidence vector relating ß to yi; ai is an unobserved genetic value; and ei is an independently distributed random residual reflecting environmental variability or inadequacy of the model. It is assumed that the genetic value ai results from an unknown number (K, say) of QTL acting additively, so that

where Qik is the genotype at biallelic (assumed for simplicity) locus k for individual i and {alpha}k is the per-allele effect. If ai is random, (1) is a special case of the well-known mixed linear model (e.g., HENDERSON 1973 Down). When K goes to infinity this becomes the classical infinitesimal model of quantitative genetics.

It is conceptually convenient to develop model (1) hierarchically. The first level of a Gaussian hierarchy is given by

(2)

where N (·) indicates a normal distribution and {sigma}2e is the environmental variance. If the n environmental deviates are independently and identically distributed, this leads to the matrix representation

(3)

where X is an n x p incidence matrix assumed (without loss of generality) to have full-column rank, a = {ai}, and I denotes an identity matrix, in this case n x n. Suppose next that individual i has been typed for marker genotypes at each of l loci; this is represented by the vector

with

Assume that all individuals have been typed for all markers (this is not realistic, but see the DISCUSSION). Then, the unobserved genetic value can be modeled as

(4)

where {gamma} = {{gamma}k} is of order l x 1. We refer to {gamma}k as the "marked effect" of marker locus k on genetic value, while {epsilon}i can be interpreted as some residual or "background" genetic effect not involved in the association between genetic value and the markers but, yet, having an effect on phenotype. The vector {gamma} is the gradient or regression of the unobservable additive genetic value on the observable marker genotype, that is, {gamma} = {partial}ai/{partial}mi. As noted by LANDE and THOMPSON 1990 Down, dominance can be introduced by expanding (4) as, e.g.,

where m2'i is a row vector with elements consisting of the squares of the corresponding entries of mi. Interactions between marked effects for different marker loci can be modeled via cross-products between appropriate elements of mi. For simplicity, additivity of marked effects is assumed throughout.

The second level of the hierarchy is represented by a distribution describing the uncertainty about genetic values, given the marked effects, that is the background genetic variability. We adopt the Gaussian model,

where {sigma}2{epsilon} is the background additive genetic variance. It is assumed (rightly or wrongly, depending on the context), given the marker genotypes and {gamma}, that the "background" genetic effects {epsilon}i of different individuals are mutually independent. This implies that either there is no family structure or that, conditionally on the marked effects {gamma}, the family structure is not relevant. Family clustering is taken up in a later section. In matrix notation, and consistently with (4), the assumption of independence leads to

(5)

where M is the n x l matrix of known marker genotypes. Unless there is some prior knowledge about {sigma}2e and {sigma}2{epsilon} (or some clustering of individuals, such as a family structure), the background effect {epsilon}i must be lumped together with ei, because of nonidentifiability. On the other hand, if the variances {sigma}2{epsilon} and {sigma}2e are known a priori, it is possible to "predict" {epsilon}i distinctly from ei, in the same way that one can predict additive genetic and environmental effects via BLUP when dispersion parameters are known.

WHITTAKER et al. 2000 Down and WHITTAKER 2001 Down treat {gamma} as a fixed parameter and employ ridge regression for estimation of this vector. From a Bayesian perspective (e.g., LINDLEY and SMITH 1972 Down; ZELLNER and VANDAELE 1975 Down), this is equivalent to regarding {gamma} as having the distribution

(6)

with {sigma}2{gamma} elicited in some manner. Assumption (6) is adopted as the third level of the hierarchy. In a frequentist setting, the assumption states that the marked effects {gamma} are drawn at random from the multivariate normal distribution (6) in each conceptual repetition of a crossing experiment. In a Bayesian setting, this would be part of the prior ensemble of the model; this is discussed later. Regression coefficients for markers that are not adjacent to QTL are expected to be null under the assumption of no interference (ZENG 1993 Down). However, without knowing the location of the QTL in relation to the markers (the usual situation), it is not obvious how such a prior consideration can be incorporated into the model. Since (6) is a prior distribution in the Bayesian sense, its influence on inferences can be tempered by measuring enough individuals. At this point, it suffices to say that {gamma} can be treated as a random effect merely as a device for obtaining possibly improved (in some sense) predictions of genetic value. HAYES and GODDARD 2001 Down assumed that unobservable gene or chromosome effects followed a Gamma distribution, so these effects would be strictly positive even though estimated values can be negative. MEUWISSEN et al. 2001 Down used Gamma deviates for simulating gene effects, but then "tossed a coin" to determine their sign. Therefore, it is unclear what would be gained from assuming a Gamma distribution for the elements of {gamma}. Note that (6) implies that the "marked effects" are independent and identically distributed, but this assumption is relaxed later on. The normality assumption in (6) is probably adequate enough and facilitates computation significantly.

The three-stage hierarchy can be condensed by inserting (5) into (3), so that the model describing the phenotypic values can be written as

(7)


(8)

where {epsilon} = {{epsilon}i} and e = {ei}. This can be viewed as a frequentist mixed-effects model, where ß is a fixed location parameter and {gamma} and {epsilon} are random terms. Unless additional assumptions are made or some knowledge about the partition of variance in the population is available, {epsilon}i and ei are "confounded." However, it is conceptually useful to maintain these two vectors as distinct. The marginal (frequentist) distribution of the phenotypes induced by model (8) is the normal process

(9)

where MM'{sigma}2{gamma} is the variance-covariance matrix of marked genotypic values, conditionally on the marker genotypes M observed in the experiment. In scalar notation, the "total variance" of a single observation is

Hence,

(10)

is interpretable as the fraction of genetic variance attributable to marked effects, in an experiment repeated over and over with the marker genotypes fixed across replications. The variance "due to" the association with the markers is {sum}lk=1m2ik{sigma}2{gamma}, which depends nontrivially on mi, the specific marker genotype of individual i. On the other hand, since the marker genotypes vary at random over replications,

(11)

where the expectation and covariance matrix are taken over the distribution of marker genotypes in the population. Knowledge of the distribution of marker genotypes is needed for evaluation of (11).

Best prediction and best linear prediction:
Under standard assumptions, with ß and the variance components {sigma}2{gamma}, {sigma}2{epsilon}, and {sigma}2e known, the joint distribution of {gamma} and {epsilon}, given the phenotypes and the marker genotypes, is the multivariate normal process:

(12)

Here,

(13)

where {tau}{gamma} = {sigma}2e/{sigma}2{gamma}, {tau}{epsilon} = {sigma}2e/{sigma}2{epsilon}, and

(14)

The best linear predictor [BLP; also the best predictor (BP) under normality; HENDERSON 1973] of the unobserved total genetic values a = M{gamma} + {epsilon} is

(15)

and the variance-covariance matrix of the prediction error (under normality this is also the covariance matrix of the conditional distribution of a) is

(16)

The best predictor has the smallest possible mean-squared error of prediction. Hence, it would be difficult to improve upon this, provided that the model is reasonable, normality holds, and parameters are known. Thus, (13–16) provide an alternative to a ridge regression approach to prediction. Generalization to multiple traits measured in different individuals is straightforward, but this is not dealt with here. Since, given M, the BLP or BP is unbiased (e.g., HENDERSON 1973 Down), it follows automatically that it is also unbiased unconditionally. However, unconditionally,

with the expectation taken with respect to the joint distribution of marker genotypes in the entire population. A drawback of BP or BLP is that it is unrealistic to assume that ß, {sigma}2{gamma}, {sigma}2{epsilon}, and {sigma}2e are known without error.

Best linear unbiased prediction:
An obvious improvement is to use BLUP. BLUP takes into account uncertainty about ß, which is not the case of BP or BLP above, where ß is treated as known. Under normality, BLUP(a) can be interpreted as the mean of the conditional distribution of the predictand a = M{gamma} + {epsilon}, given a vector of "error contrasts," denoted as w. For example, take w = y - X, where is either the ordinary least-squares or the generalized least-squares estimator of ß. In such a setting, BLUP is the best predictor under normality, but only in the class of linear translation invariant predictors (SEARLE 1974 Down; GIANOLA and GOFFINET 1982 Down). It is well known that for (8) and (9),

(17)

where, for P = I - X(X'X)-1X',

(18)

gives = BLUP({gamma}) and = BLUP({epsilon}). Further,

(19)

The BLUP of a = M{gamma} + {epsilon} is

and the variance-covariance matrix of the prediction errors is

As noted, BLUP of the unobservable genetic values (marked or background genetic effects or any linear combination thereof) takes into account the uncertainty about ß, although given {tau}{gamma}, {tau}{epsilon}, and {sigma}2e. Since BLUP is conditionally unbiased, it follows that it is also so unconditionally (averaged over marker genotypes). The unconditional covariance matrix of the prediction errors is the average of (19) taken over the distribution of marker genotypes.

The needed variance components:
Sensible values of the dispersion parameters must be specified for implementing BLUP. As stated, unless there is prior or external knowledge, it is not possible to separate the variance of the background genetic effects ({sigma}2{epsilon}) from that of the environmental influences ({sigma}2e). The model can be rewritten as

(20)

where e* ~ N(0, I{sigma}2*e). Methods for obtaining maximum-likelihood estimates of {sigma}2{gamma} and of ({sigma}2{epsilon} + {sigma}2e) for (20) are well known, e.g., SEARLE et al. 1992 Down. In this setting only the marked part of the genetic value can be predicted. Here, the conditional distribution used for inferring {gamma} would be

(21)

where

is the best predictor (given the error contrasts) or BLUP({gamma}). The covariance matrix of the conditional distribution (21) is

with this being the same as the covariance matrix of the prediction errors {gamma} - , given the marker genotypes M. The BLUP of the marked quantitative trait genotype is M, and the prediction error dispersion matrix is MC*{gamma}{gamma}M'{sigma}2*e. Hence, M could be used as a criterion for genetic evaluation in marker-assisted selection ignoring background effects. If reasonable maximum-likelihood (ML) or restricted maximum-likelihood (REML) estimates of {sigma}2{gamma} and of {sigma}2*e are available, these can be treated as "true" values in (21).

Suppose that estimates of the additive genetic (2a) and of the environmental variance (2e) are available (or of heritability h2 and the phenotypic variance 2y) from an analysis ignoring marker information. Then, the variance of the background genetic effects (assuming additivity) could be estimated as 2{epsilon} = 2*e - 2e = 2*e - (1 - h2)2y (hoping that this value will be positive). Then, form

where 2{gamma} is the ML or REML estimate of {sigma}2{gamma}, and proceed with the BLUP implementation in (17). The resulting predictor is an empirical BLUP having properties that depend largely on the accuracy of the variance component estimates and on the adequacy of model in (7) and (9).

Differences with ridge regression:
A main difference with the ridge regression approach of WHITTAKER et al. 2000 Down and WHITTAKER 2001 Down is how the value of {tau}{gamma} (or of {tau}*{gamma} = {sigma}2*e/{sigma}2{gamma}) is obtained. Here, the formalism of the random-effects model (with a justification from a Bayesian viewpoint) is favored over an ad hoc procedure for doing shrinkage and tempering colinearity in a fixed-effects model, such as ridge regression. LINDLEY and SMITH 1972 Down, ZELLNER and VANDAELE 1975 Down, and BOX 1980 Down discuss ridge regression from a Bayesian perspective. Further, a random-effects treatment of a fixed-effects model can lead to estimators and predictors with a superior frequentist performance (e.g., ZELLNER and VANDAELE 1975 Down; GIANOLA 1990 Down; WEIGEL et al. 1991 Down).

An advantage of a random-effects treatment is the flexibility to accommodate additional model features. For example, the ridge regression estimator of WHITTAKER 2001 Down implies that all marked effects in {gamma} are independent (a priori). Evidence of coexpression of genes in at least the same chromosome (CARON et al. 2001 Down) indicates that the assumption that marked sections or segments within a chromosome have independent effects is not always tenable. Adjacent QTL may coexpress more (that is, be less independent) than QTL further apart. Thus, some spatial covariance structure along the chromosome might be in order. Ridge regression overstates the prior precision of the model, by virtue of implicitly assuming independence. A possible consequence is that "more genotype is marked" than warranted. Also, there may be chromosomes that are "hotbeds" of QTL, whereas other chromosomes may be arid, leading to between-chromosome variation. This heterogeneity can be accommodated by introducing chromosome effects in the model, with markers in different chromosomes having (possibly) distinct distributions. These extensions are dealt with in the following section.


*  EXTENDING THE HIERARCHY
*TOP
*ABSTRACT
*A MIXED-EFFECTS MODEL...
*EXTENDING THE HIERARCHY
*A BAYESIAN FORMULATION
*DISCUSSION
*LITERATURE CITED

Model without background genetic effects:
For simplicity, assume first that the state of prior knowledge does not allow disentangling background genetic effects from environmental deviates. The first tier is then as in (20), so that

(22)

The second tier is given by the distribution of the marked genetic effects. Here, instead of assuming that {gamma}|{sigma}2{gamma} ~ N(0, I{sigma}2{gamma}), a "one-way layout" is adopted, to partition the variability of marked effects into between- and within-chromosome components. Arrange markers sequentially according to their order within each chromosome and assume a dense marker map. The model for the second tier is

(23)

where c = {ci} is an nc x 1 vector of "marked chromosomal effects" (nc is the number of pairs of chromosomes), T is a known incidence matrix of order l x nc relating marked effects to chromosomes, and v = {vij} is an l x 1 vector of marked "within-chromosome" deviates. In scalar notation, {gamma}ij = ci + vij is the "marked quantitative effect of the jth marker at the ith chromosome." It is assumed that

(24)

Here, V = Diag(Vi) is taken to be a block-diagonal matrix with nc blocks, where Vi is the li x li variance-covariance matrix of marked within-chromosome deviates in chromosome i, and li is the number of markers in chromosome i. The block diagonality of V in (24) implies that marked within-chromosome effects are independent across chromosomes. However, a within-chromosome dependence will be introduced. The matrix Vi can be structured in several different manners, and some forms of modeling such dependence are discussed below. In (23) and (24) the nc x 1 vector of chromosome effects is assumed to possess the distribution

(25)

where {sigma}2c is the variance between chromosome effects. Some of the possible forms of Vi (among many possible ones) are considered next:

  1. Within-chromosome deviates are correlated according to a first-order autoregressive process. Partition with

    Then, if markers are equally spaced,

    where {sigma}2vi is the variance between deviates in chromosome i (i = 1, 2, ... , nc) and {rho} is a parameter taking values between -1 and 1. This covariance structure satisfies the assumption that adjacent within-chromosome deviates are more strongly correlated than those further apart. A form of relaxing the requirement that markers be equally spaced is immediately below.

  2. Within-chromosome deviates are correlated according to a Gaussian decay model (e.g., VERBEKE and MOLENBERGHS 1997 Down). The covariance structure here is

    where

    is the correlation between marked within-chromosome deviates k and t in chromosome i. Here, di,kt is the distance (e.g., in physical units such as kilobases) between markers k and t in chromosome i, and {rho}i > 0 is a chromosome-specific parameter. When the distance between markers -> 0, {lambda}i,kt -> 1; on the other hand, when di,kt -> {infty}, the correlation between within-chromosome deviates is null. The parameter {rho}i governs the rate at which the correlation decreases. When {rho}i is close to 0, such correlation falls rapidly; when {rho}i increases, the drop is more gentle. It may be convenient to assume that {rho}i = {rho} for all chromosomes.

  3. Suppose that the number of markers is constant across chromosomes (with l markers and nc chromosomes there would be z = l/nc markers per chromosome). If the markers are equally spaced and if the variance of within-chromosome deviates is constant across chromosomes, one may pose the Toeplitz covariance specification

    (VERBEKE and MOLENBERGHS 1997 Down) for i = 1, 2, ... , nc. Here, the correlation between within-chromosome deviates k and t is

Note that {lambda}kt here, even though a correlation coefficient, is not the same as {lambda}i,kt in the preceding section. Using (23) in (22) the model can be expressed as

(26)

Under Gaussian assumptions, the joint distribution of c and v, given a vector or linearly independent error contrasts w, is

(27)

Here, = BLUP(c) and = BLUP(v) are computed as

(28)

and

(29)

The BLUP of the M{gamma} is . Since the predictor is conditionally unbiased (given M), it is also unbiased unconditionally. The unconditional covariance matrix of the prediction errors is the average of (29) taken with respect to the distribution of marker genotypes.

Here, it is possible to predict marked effects on a chromosomal basis. Consider model (26) and recall that {gamma} = Tc + v. Suppose that a hypothetical species has four pairs of chromosomes. For the n individuals assayed, the incidence matrix M can be written as

where m'ij are the codes for the marker genotypes of individual i at chromosome j. Likewise, the "true" marked effects on a chromosome-by-chromosome basis can be written as

where cj is an effect peculiar to all markers in chromosome pair j and 1lj is a vector of 1's of order lj. Then, after solving (28), one has the following relationships:

.

For many reasonable covariance structures, all dispersion parameters are identifiable in this model and can be estimated by tailoring some suitable algorithm for maximum likelihood. VERBEKE and MOLENBERGHS 1997 Down, VERBEKE and MOLENBERGHS 2000 Down present examples of applications of standard statistical software for restricted maximum-likelihood estimation.

Model with background genetic effects:
The model is now an expanded version of (26):

(30)

The best linear unbiased estimator of ß and the BLUP of c, v, and {epsilon} can be obtained by solving the system:

(31)

As before, if an estimate of the total genetic variance is available, one can set {sigma}2{epsilon} as 2{epsilon} = 2*{epsilon} - (1 - h2) 2y. Some estimates (e.g., REML) of {sigma}2c and of the parameters defining the structure of Vi can be obtained when estimating 2*e from the model where background genetic effects are ignored or from the full model but fixing the value of {sigma}2{epsilon}. The BLUP of the "total" genetic value of the entire collection of individuals in the sample is then

(32)

and the covariance matrix of the prediction errors is

(33)

Differential contribution of marked effects to genetic variability:
A representation of the degree of inequality of a frequency distribution is given by the Lorenz curve (e.g., GASTWIRTH 1971 Down, GASTWIRTH 1972 Down). This is used for measuring income inequality in economics and was applied first in quantitative genetics by URIOSTE et al. 2001 Down in an assessment of heterogeneity of variance of milk yield in cattle among herds. We place the cumulative proportion of the markers 1/l, 2/l, ... , i/l, ... , (l - 1)/l in the abscissa and plot this against the cumulative proportion of a measure of the marked "genetic variability." A hypothetical example is in Fig 1. A straight line with a slope of 45° indicates perfect equality. For example, if 20% of the markers explain 20% of the variability, 40% of the markers account for 40%, and so on, a straight line results. On the other hand, if a few markers account for a large portion of the variability, the curve bends downward because the distribution of effects is unequal. A measure of the inequality of contributions to variability is the Gini coefficient or twice the "inequality area," that is, the area between the Lorenz curve and the straight line. Since the square in Fig 1 has an area equal to 1, the maximum value of the inequality area is 0.5. Hence, the Gini coefficient takes values between 0 (all marked effects contributing equally to variability) and 1 (a single marked effect explains all variability).



View larger version (20K):
In this window
In a new window
Download PPT slide
 
Figure 1. Lorenz curve depicting the relative contribution of marked effects to marked genetic variability. Ordinate, cumulative proportion of marked genetic variance explained by marked effects; abscissa, cumulative proportion of markers.

Let Munique be a set of rows of M defining u unique marker genotypes (u may be much smaller than n), m'unique,j be the jth row of Munique, and munique,jk be the entry for marker locus k in m'unique,j. Define the corresponding "unique marked effect" as

(34)

Arbitrarily, define "total marked genetic variability" as the sum of the squared unique marked genetic effects, so

(35)

This definition does not take into account the frequency of the unique marker genotypes in the population, but this can be accounted for easily by weighting each µj appropriately. Let now µ2[j] be the jth ordered value of µ2j, sorted in an increasing order. The ordinate values in the Lorenz curve, i.e., the cumulative proportion of the "total marked genetic variation," are calculated as

(36)

where i/u is the cumulative proportion of unique marker genotypes. The Lorenz curve results from plotting L(i/u) against i/u; observe that L(0) = 0 and L(1) = 1. It can be shown (AGUILAR-GUTIERREZ 2000 Down; URIOSTE et al. 2001 Down) that the Gini coefficient is approximately equal to

(37)

A main difficulty in estimating the Lorenz curve and the Gini coefficient is that nonlinear functions of unknown quantities are involved. Simple method of moment estimators and are obtained by replacing µ2j in (36) and (37) by

with obtained from (31). This statistic does not take into account the uncertainty associated with .

Family heterogeneity:
Suppose that the phenotypic and molecular marker information can be partitioned into F nonoverlapping or "independent" clusters representing some familial aggregation. For example, XU 1998 Down discussed the use of "families" of line crosses for QTL mapping. This author pointed out that single-family methods present the risk that if two lines involved in the cross are not segregating at a QTL, then the latter would not be detected, irrespective of the number of individuals scored in a backcross or F2 population. He allowed for multiple families in the model, but treated their effects as fixed. Irrespective of whether these effects are viewed as fixed or random, they constitute statistical nuisances from the point of view of QTL mapping (XU 1998 Down). Here, a different point of view is taken. Allowance is made for heterogeneity of family effects on the marker-phenotype relationships. The stylized situation described in standard treatments of quantitative genetics is adopted; for example, individuals may be clustered into full-sib or half-sib families. We emphasize that our approach does not include recombination rates between markers and QTL as model parameters. A rigorous treatment for outbred populations requires extremely involved algebra (e.g., WANG et al. 1998 Down). The purpose here is to bypass these complex procedures via straightforward linear models having a much simpler covariance structure.

Let the vector of phenotypic records be partitioned as y = [y'1, y'2, ... , y'F]', where yi contains fi observations pertaining to family i (i = 1, 2, ... , F). Assume now that heterogeneity in the marker-phenotype association exists. In the presence of family-specific marked effects the entire vector {gamma} can be written as

where {gamma}i is an l x 1 set of effects peculiar to family i. The model for phenotypic values, allowing now for family-specific background effects, {epsilon}F, becomes

(38)

where X = [X'1, X'2, ... , X'F]' and Xi has as many rows as there are observations for family i (fi) and p columns. Further, M = Diag(Mi). The incidence matrix Mi has fi rows and l columns, the number of markers. The F x 1 vector {epsilon}f = {{epsilon}i} contains F family background effects related to the phenotypic values via the known incidence matrix F (consisting of vectors of ones, 1i, in appropriate locations and of zeroes elsewhere); w is an n x 1 residual vector of independently distributed within-family deviations, with common variance {sigma}2w and where n = {sum}Fi=1fi.

In the absence of marker information, the phenotypic variance can be partitioned into between- and within-family components. For example, with half-sib families, the between-family variance contains one-quarter of the additive genetic variance (supposing additive inheritance), and the within-family component includes the environmental variance plus three-quarters of the additive genetic variance. In the setting of (38) it is assumed that {epsilon}f and w are independently distributed vectors, with distributions {epsilon}f|{sigma}2{epsilon}f ~ N(0, I{sigma}2{epsilon}f) and w|{sigma}2w ~ N(0, I{sigma}2w), respectively. This partitioning reassigns the unmarked additive genetic variance into between ({sigma}2{epsilon}f) and within-family ({sigma}2w) components. Now, the {gamma}i marked effects can be treated as random regressions following the distribution

(39)

Here, {gamma}0 is an l x 1 parameter that is common to all families (the "fixed" part of the regression) and B is the l x l variance-covariance matrix between the regression coefficients. This implies that {gamma}i = {gamma}0 + {delta}i, where {delta}i ~ N(0, B) is a vector of deviations from the common regression {gamma}0 that is specific to family i. In a simpler hierarchical model it could be postulated, for example, that {gamma}i|{sigma}2{gamma} ~ N(0, Ii{sigma}2{gamma}) for i = 1, 2, ... , F. The model for the more general specification becomes

(40)

where ß*' = [ß', {gamma}'0], X* = [X M0], M0 = [M'1 M'2 · M'F]', and {delta}' = [{delta}1, {delta}'2, ... , {delta}F] is a vector of family-specific deviations from the overall regression {gamma} in (39).

Given the dispersion parameters B, {sigma}2{epsilon}f, and {sigma}2w, the best linear unbiased estimator of ß* and the BLUP of each of the family-specific deviations {delta}i are found by solving the system

(41)

where {oplus}Fi=1 is the "direct sum" of matrices notation (e.g., SEARLE 1982 Down), denoting a block-diagonal concatenation of the F matrices M'iMi + B-1{sigma}2w, with all these matrices being l x l. Note that

and

The variance-covariance matrix of the prediction errors is given by the inverse of the coefficient matrix in (41) multiplied by {sigma}2w. Unless there are many families (e.g., independent half-sib groups in dairy cattle) it will be difficult to obtain reliable maximum-likelihood estimates of B.

The hierarchy can be extended as in (23) and (24) to include chromosome and within-chromosome effects that would now be family specific. Put

(42)

where {delta}i is l x 1, ci is an nc x 1 vector of chromosome effects peculiar to family i, Ti is an l x nc incidence matrix, and vi is an l x1 vector of within-chromosome effects for family i. It is convenient to assume that the within-chromosome marked effects are independent across families and chromosomes, but possibly dependent within chromosomes. Here, the variance-covariance matrix V in (24) would be Fl x Fl and would take the form

where Vi = Var(vi) is an l x l variance-covariance matrix that is specific to family i. Now, partition

where vij is an lj (number of markers in chromosome j) x 1 vector of within-chromosome deviates for family i in chromosome j. Then, under the assumption of independence of deviates across chromosomes,

With this,

and Vij can be assigned any of the spatial structures discussed earlier. Then, the distribution of all family-specific deviations from regression would be

(43)

The hierarchy is completed by assigning a distribution to the family-specific chromosome effects ci. An extension of assumption (25) is

(44)

where K = {kij} is an nc x nc matrix of covariances between chromosome effects of the same family, with this dispersion structure assumed homogeneous across families. For example, element k45 of K would be the covariance between the effects of chromosomes 4 and 5 within the same family. As a side note, observe that unconditionally to ci, (43) and (44) imply that {delta}i ~ N(0, TiKT'i + Vi). Hence, assumption (39) is equivalent to (43) and (44) if and only if TiKT'i + Vi = B for all i.

Using (42) in (40) leads to the following model for phenotypes,

(45)

where M = Diag(Mi) and T = Diag(Ti). The BLUP of all family-specific chromosome and within-chromosome effects is obtained by solving

(46)

The variance-covariance matrix of the prediction errors is the inverse of the coefficient matrix in (46) times {sigma}2w.

Lande-Thompson regression model revisited:
LANDE and THOMPSON 1990 Down suggested splitting the molecular scores into between- and within-family components, so that the components could be combined "optimally" for the purpose of genetic improvement. In our context and for simplicity, consider the following variant of their model, using (38). Write the marker codes for family i as Mi = i + (Mi - i), where i is an fi x l matrix with columns equal to the mean values of the corresponding columns of Mi, with all its rows being equal. A model of interest might be

(47)

where {alpha}B is the regression of phenotypes on mean family molecular scores, MW is a matrix of within-family deviations from the mean scores for the appropriate families, and {alpha}W is the vector of regressions of phenotypes on the within-family deviations. Assume that {alpha}B ~N(0, IF{sigma}2{alpha}F) and {alpha}W ~ N(0, IF{sigma}2{alpha}W) where {sigma}2{alpha}F and {sigma}2{alpha}W are components of variance, and suppose that the two random vectors are independently distributed. The best linear unbiased predictor of the between- and within-family regressions can be calculated by solving

(48)

The variance-covariance matrix of the prediction errors can be obtained from the inverse of the coefficient matrix in (48), times {sigma}2w. Model (47) can be extended by allowing the within-family regressions to be family-specific. In this extended model, in addition to {alpha}B, there would be {alpha}Wi (i = 1, 2, ... , F) within-family regressions.


*  A BAYESIAN FORMULATION
*TOP
*ABSTRACT
*A MIXED-EFFECTS MODEL...
*EXTENDING THE HIERARCHY
*A BAYESIAN FORMULATION
*DISCUSSION
*LITERATURE CITED

We adopt now a Bayesian point of view and describe a Markov chain Monte Carlo (MCMC) implementation. The focus is on model (40) and its hierarchical expansion in (42–44), leading to (45) and (46) in the mixed model and BLUP treatment. This is the most general and richly parameterized specification among those discussed so far. The developments follow the typical hierarchical structure of Bayesian multilevel models (LINDLEY and SMITH 1972 Down).

Joint posterior distribution:
Start with (40) as sampling model (data-generating process) or first level of the Bayesian hierarchy. Conditionally on ß*, {delta}, and {epsilon}f, it is assumed that the phenotypic values are drawn independently from the distribution

(49)

It is convenient to express the joint density of all phenotypes, given ß*, {delta}, {epsilon}f, and {sigma}2w as the product of F densities corresponding to the contributions to information made by the phenotypic values in each of the families. Thus,

(50)

The notation N(yi|X*iß* + Mi{delta}i + 1i{epsilon}i, I{sigma}2w) indicates a normal density or distribution with yi as random variable, X*iß* + Mi{delta}i + 1i{epsilon}i as mean vector, and I{sigma}2w as covariance matrix; a similar notation is used if the distribution involves a scalar.

The second level assigns a prior distribution to all unknown parameters of the first tier. It is assumed that the joint prior density can be written as

(51)

Above, ß*0 is the known mean of the prior distribution of ß* and s2ß* is a known scalar that tunes the degree of vagueness of the prior for this parameter. Different degrees of vagueness may be assigned to the two components of ß*; thus, there would be two distinct tuning parameters, s2ß and s2{gamma}0. The notation V({phi}) means that the variance-covariance matrix of v (the within-chromosome deviates) depends on some parameter vector {phi}. For example, if within-chromosome deviates are correlated according to an autoregressive process having {rho} as parameter and with chromosome-specific variances {sigma}2v1, {sigma}2v2, ... , {sigma}2vnc, then {phi} = [{rho}, {sigma}2v1, {sigma}2v2, ... , {sigma}2vnc]'. The autoregressive process is assumed hereafter, to illustrate one of the possible specifications. In (51), s2w and {nu}w are known parameter values (hyperparameters) of a scaled inverted chi-square distribution (e.g., GELMAN et al. 1995 Down) assigned to {sigma}2w. Now, as stated in the developments following (42) and leading to (43), given the chromosome effects c, the marked effects {delta} of different families are taken to be mutually independent. Hence, (51) can be put as

(52)

The third level of the hierarchy consists of the prior distribution assigned to c, {phi}, and {sigma}2{epsilon}f, the parameters of the second tier. It is assumed that the corresponding density takes the form

(53)

In the preceding, it is assumed that (a) the chromosome effects of different families are mutually independent, although correlated within a family with an nc x nc covariance matrix K, as in (44); (b) the parameter {rho} has a prior distribution indexed by some known parameter vector {eta} and is independently distributed (a priori) of all within-chromosome variances {sigma}2vi; (c) the nc variances {sigma}2vi are independently and identically distributed as scaled inverted chi square with known hyperparameters s2v and {nu}v; and (d) the variance between families {sigma}2{epsilon}f follows a scaled inverted chi-square distribution with known parameters s2f and {nu}f.

The fourth and final level of the hierarchical model is the prior distribution assigned to K, the covariance matrix between effects of the same family on different chromosomes. It is assumed that K follows an inverted Wishart process of order nc (the order of K) with density p(K|{nu}cSc, {nu}c), where {nu}cSc is a known scale matrix and {nu}c is a known positive parameter usually referred to as the "degrees of freedom" of the distribution (e.g., GELMAN et al. 1995 Down). Our parameterization is such that, a priori, E(K|{nu}cSc, {nu}c) = {nu}cSc/({nu}c - nc - 1). For a very large value of {nu}c, Sc then approximates the mean value of the prior distribution. In less structured models, e.g., independence of chromosome effects within a family, K would take a simpler form, in which case the prior distribution would be modified appropriately.

Some of the prior distributions given above are difficult or impossible to elicit. In such a situation, one may consider using some of the standard default improper priors, carry out a technically involved reference prior analysis (BERNARDO 1979 Down; BERNARDO and SMITH 1994 Down), or resort to maximum entropy fits; see CANTET et al. 1992 Down and SORENSEN and GIANOLA 2002 Down for an elementary discussion of the latter in a genetics context. Collecting (50), (52), (53), p(K|{nu}cSc, {nu}c) and letting H denote the set of all known hyperparameters, the joint posterior density of the uncertain parameters is, after rearrangement,

(54)

Markov chain Monte Carlo scheme:
At least in theory, all marginal or joint distributions of sets of parameters of interest can be derived from (54) by effecting the necessary integrations, to take uncertainty about nuisance parameters into account properly. Since the joint posterior distribution is not in a form amenable to analytical treatment, MCMC (e.g., GILKS et al. 1996 Down) can be used for simulating draws from the posterior. Features of any target posterior distribution can be estimated from the simulated samples. Monte Carlo error can be made negligible by sheer computing force; however, suitable reparameterizations may enhance the efficiency of the simulation.

The hybrid algorithm proposed uses the conditional posterior distributions as candidate-generating densities, whenever possible. Save for the presence of p({rho}|{eta}) (whose structure has not been specified yet), the joint posterior process with density (54) is in a normal-inverse Gamma or normal-inverse Wishart form. All fully conditional distributions (except that of {rho}) can be identified and sampled with relative ease. This defines a Gibbs sampler for drawing from most posterior distributions of interest. Since the conditional process is not recognizable for {rho}, a Metropolis procedure is tailored. The most relevant expressions needed for implementing this hybrid algorithm are presented below. The notation [parameter or parameters|ELSE] is used throughout to denote a fully conditional posterior distribution, that is, the posterior distribution of a scalar or vector parameter given all other parameters, the data, and H. For details, see WANG et al. 1993 Down, WANG et al. 1994 Down, SORENSEN et al. 1994 Down, and SORENSEN and GIANOLA 2002 Down.

Conditional distribution of ß*: This is arrived at by retaining in (54) only the terms involving ß*. The resulting conditional density is

This can be identified as the density of the multivariate Gaussian distribution,

(55)

where

and

Obtaining samples from (55) is straightforward, especially if the order of ß* is not too large. When the prior distribution of ß* is diffuse (s2ß* -> {infty}), this step involves essentially least-squares type computations, after making an offset of the data vector.

Conditional distributions of family-specific marked effects {delta}: Inspection of (54) reveals that all family-specific marked regressions {delta}i are mutually independent, given all other parameters. The density of the conditional posterior distribution of {delta}i (i = 1, 2, ... , F) is

The distribution can be shown to be the l-variate normal process

(56)

where

and

All family-specific marked effects can be drawn on a family-by-family basis by sampling from the multivariate normal distribution given in (56).

Conditional distributions of family-specific background effects {epsilon}: From (54), it follows that

so that family-specific background effects are conditionally independent, given all other parameters. In particular, for the ith family one has

This can be shown to be the kernel of the density of the univariate normal distribution,

(57)

where

and

Note that 1'i1i = fi is the number of individuals in family i and that {sigma}2{epsilon}f/({sigma}2{epsilon}f + {sigma}2w/fi) is proportional to the "heritability of a family mean" (whenever the variance of background effects between families is proportional to the additive genetic variance), in the terminology of standard quantitative genetics. Sampling from distribution (57) is straightforward.

Conditional distributions of chromosome effects: The density of the conditional distribution of the vector of chromosome effects c is, from (54),

Thus, family-specific chromosome effect vectors ci (i = 1, 2, ... , F) are conditionally independent of each other. After algebra, as in GIANOLA and FERNANDO 1986 Down or GIANOLA et al. 1990 Down, it can be shown that the ith conditional distribution is the nc-variate normal process,

(58)

where

and

For instance, in a species with 30 pairs of chromosomes such as cattle, a draw would need to be made from a 30-variate normal distribution for each of the F families represented in the data set.

Between- and within-family variances of background effects: The density of the conditional posterior distribution of the between-family variance of background effects, {sigma}2{epsilon}f, is

Writing the F Gaussian densities and the density of the scaled inverted chi-square prior process of {sigma}2{epsilon}f explicitly, one arrives at

This indicates that the conditional posterior distribution of {sigma}2{epsilon}f is scaled inverted chi square (e.g., WANG et al. 1993 Down; GELMAN et al. 1995 Down), that is,

(59)

To draw samples from (59), one extracts a random deviate from a central chi-square distribution on F + {nu}f degrees of freedom, takes its reciprocal, and multiplies the inverted deviate by ({sum}Fi=1{epsilon}2i + s2f{nu}f).

Likewise, the density of the conditional posterior distribution of the within-family variance is

Writing the densities explicitly and rearranging one arrives at

This is the distribution of the scaled inverted chi-square random variable

(60)

The sampling process is similar to that described in connection with (59).

Autoregressive parameter {rho}: Note in (54) that the only terms involving {rho} are the prior densities of the family-specific marked effects {delta}i and the prior density p({rho}|{eta}). Hence,

Now, recall that Vi({phi}) is an l x l covariance matrix such that

where Vij({phi}) is the lj x lj covariance matrix of within-chromosome deviates for family i in chromosome j. Specifically,

(61)

Note that this covariance matrix is chromosome specific. Using this, the conditional posterior density above has the form

(62)

When viewed as a function of {rho}, (62) is not in a recognizable form, irrespective of the form of the prior density p({rho}|{eta}) adopted for the autoregressive parameter. Effecting a draw from this distribution requires a more involved procedure, such as a single-site Metropolis or Metropolis-Hastings algorithm (METROPOLIS et al. 1953 Down; TANNER 1993 Down; GELMAN et al. 1995 Down; GILKS et al. 1996 Down; SORENSEN and GIANOLA 2002 Down). Perhaps the simplest is to adopt a Metropolis step. Here, a proposal distribution is needed for generating candidate values of {rho}. As in FISHER 1921 Down, and for the purpose of facilitating the generation of proposals, consider transforming the correlation coefficient {rho} into {zeta} as

with the Jacobian of the transformation being

Then, the conditional posterior density of {zeta} is

(63)

where {zeta} = [{zeta}, {sigma}2v1, {sigma}2v2, ... , {sigma}2vnc]'. Suppose now that the state of {rho} at iteration t of the algorithm is {rho}[t] and that all other parameters in its conditional posterior distribution have been updated. This implies that the state of {zeta} is

We update {zeta} via a Metropolis jump. Here, a candidate value {zeta}* must be sampled from a symmetric candidate generating distribution with density Q({zeta}*|{zeta}[t]); symmetry means that Q({zeta}*|{zeta}[t]) = Q({zeta}[t]|{zeta}*) for all pairs ({zeta}*, {zeta}[t]) and for every t (GELMAN et al. 1995 Down). A distribution meeting this requirement is a t-process; the t-distribution has thicker tails than the normal, so "extreme" values appear at higher density, specially when the degrees of freedom are low (4–10, say). A possible Metropolis jump is as follows.

First, sample a deviate r from a Gamma(d.f./2, d.f./2) where d.f. denotes the degrees of freedom of the t-distribution.

Second, following BOX and TIAO 1973 Down, sample a candidate {zeta}* from

Third, using (63), compute the Metropolis ratio:

The integration constant cancels out in the numerator and denominator, so {delta} is calculated by direct evaluation of ratios involving (63).

Fourth, draw a deviate U from a Uniform(0, 1) distribution, and set

Invert the candidate as

and set

The preceding completes the updating of the autoregressive parameter.

Variances of within-chromosome deviates {sigma}2vi: From (54),

It is convenient to rewrite the model for marked effects presented in (42) into a chromosome-specific basis. Write

where {delta}*j is a vector of order Flj x 1 containing the marked effects of the lj markers in chromosome j for the F families, and v*j is a vector of within-chromosome deviates. An assumption made earlier is that within-chromosome deviates are conditionally mutually independent across chromosomes and families, but correlated within chromosomes according to the covariance matrix in (61). Since the preceding is merely a rearrangement of (42), it follows that

where the identity matrix is F x F. Hence,

indicating that the within-chromosome variances {sigma}2vj have independent fully conditional posterior distributions. In particular, and writing the densities explicitly, one has for {sigma}2vj,

(64)

This is the density of the scale-inverted chi-square variate:

(65)

Thus, the within-chromosome variance can be sampled independently, chromosome by chromosome.

Covariance matrix of chromosome effects K: Retaining in (54) only those terms involving K leads to

Now, we write explicitly the F normal densities and the inverse Wishart density p(K|Sc, vc), to obtain

(66)

This indicates that the fully conditional posterior distribution is inverse Wishart of order nc, degrees of freedom equal to F + vc, scale matrix {sum}Fi=1cic'i + Scvc, and posterior expectation

(given all other parameters). Procedures for sampling from inverse Wishart distributions in a quantitative genetics context are in, e.g., JENSEN et al. 1994 Down, KORSGAARD et al. 1999 Down, and SORENSEN and GIANOLA 2002 Down.

The MCMC scheme in a nutshell:
A discussion of how Markov chain Monte Carlo schemes can be tuned, run, and monitored for convergence can be found in GELMAN et al. 1995 Down, GILKS et al. 1996 Down, COWLES and CARLIN 1996 Down, ROBERT and CASELLA 1999 Down, and SORENSEN and GIANOLA 2002 Down. A possible implementation follows.

  1. Set starting values that are hopefully "inside" of the target joint posterior distribution.

  2. Sample systematically from each of the following conditional posterior distributions, perhaps changing the order of visitation at each round: (55) for ß*; (56), (57), and (58) for the family-specific marked effects, background effects, and chromosome effects, respectively; (59) and (60) for the between- and within-family variances of the background effects; (62) for the autoregressive parameter, after incorporating the Metropolis step; (65) for the nc variances of within-chromosome deviates; and (66) for K, the covariance matrix of chromosome effects. This constitutes a single completed iteration of the scheme. At each subsequent loop, update the values of the appropriate conditioning values with the new draws.

  3. Repeat the iteration as many times as needed to ensure (a) that draws can be reasonably claimed as belonging to the posterior distribution and (b) that posterior features are estimated with a sufficiently small Monte Carlo error.

  4. If necessary, discard early iterations, in what is called the "burn-in" period.

  5. Using the remaining samples, estimate any feature of the posterior distribution, e.g., a posterior mean or variance, a posterior density, or the distribution of an order statistic.

To illustrate the flexibility of the approach, consider inferring the Lorenz curve and the Gini coefficient in (36) and (37), respectively, and contrast this with the crude methods discussed in Differential contribution of marked effects to genetic variability. Let {gamma}(k) be a draw from the posterior distribution of the marked effects in the simplest model. Then, as in (34), µ(k)j = m'unique,j{gamma}(k) is a draw from the posterior distribution of µj and S({gamma}(k)) is a sample from the posterior distribution of S({gamma}) in (35). Next, order all the µ2(k)j, so that

is a sample from posterior distribution of the ordered squared "unique" marked effects. Then

is a sample from the posterior distribution of the Lorenz curve in (36). Likewise,

is a sample from the posterior distribution of the Gini coefficient in (37), measuring unequal contribution of marked effects to genetic variability. The Bayesian analysis produces an entire description of the uncertainty about the Lorenz curve and the Gini coefficient.

Draws from the posterior distribution of the "total" genetic value (in the simplest model) would be obtained as M{gamma}(k) + {epsilon}(k), and posterior means, variances, and covariances can be computed directly from such samples.


*  DISCUSSION
*TOP
*ABSTRACT
*A MIXED-EFFECTS MODEL...
*EXTENDING THE HIERARCHY
*A BAYESIAN FORMULATION
*DISCUSSION
*LITERATURE CITED

Several methods for incorporating molecular information into predictors of genetic merit of candidates for selection in improvement programs have been proposed in recent years. For example, LANDE and THOMPSON 1990 Down suggested using multiple regression (in a least-squares sense) of the phenotypic value of an individual on the number of copies of alleles at polymorphic loci; in situations where the linkage map is known, separate regressions could be calculated for each chromosome. They argued that, given a sufficient number of markers, nearly all of the additive genetic variance could be accounted for by the regression model. These authors calculated a "molecular score" (the fitted value of the least-squares regression) and combined the molecular score with the phenotype for the quantitative trait via standard selection index techniques. LANDE and THOMPSON 1990 Down also introduced a familial structure and discussed how selection could be optimized (in some sense) by combining family means and within-family deviations for both the molecular score and the phenotypic values via selection index theory (SMITH 1936 Down; HAZEL 1943 Down). There are some problems with this approach. First, the method breaks down when formulated in a vectorial manner. Even in the trivial case where candidates are independent and identically distributed (essentially their setting), the covariance matrix of the molecular scores is singular since the distribution of fitted values in regression is defined only in a p-dimensional space, rather than in n dimensions (the number of individuals with molecular scores). Hence, a selection index approach coined in the more general framework of best linear prediction (HENDERSON 1973 Down) leads to an infinite number of solutions. Hence, care must be exercised to ensure that the functions of genetic merit to be predicted are indeed predictable (HARVILLE 1976 Down).

Another difficulty with the least-squares method arises when the number of markers is almost as large as the number of individuals typed. For example, >1.5 million single-nucleotide polymorphisms (the current desideratum of genetic marker) have been identified in the human genome and their positions located at an average spacing of 2 x 10-3 cM (HARTL and JONES 2002 Down). If advantage is to be taken of this map density, this leads to a situation where l >> n. Hence, colinearity of predictor variables creates an insurmountable parameter identification problem unless some dimension-reduction technique, such as singular value decomposition, or some ad hoc marker model selection procedure is employed. Also, it is well known in statistics that least-squares is an inadmissible estimation procedure. STEIN 1955 Down showed for the orthonormal linear regression model that whenever l > 2, the least-squares estimator is dominated (in the mean-squared error sense) throughout the parameter space by a nonlinear, shrinkage type, estimator. This result, which torpedoes best linear unbiased estimation and maximum-likelihood estimation under normality, has been ignored in quantitative genetics (GIANOLA 1990 Down), although it is widely acknowledged in econometrics (THEIL 1971 Down; JUDGE et al. 1985 Down). On these grounds, not much should be expected from least-squares derivatives in marker-assisted prediction of genetic value, as found by MEUWISSEN et al. 2001 Down.

WHITTAKER et al. 2000 Down and WHITTAKER 2001 Down noted that having "too many" markers in the regression model produces serious colinearity, causing unstable least-squares estimates and a poor prediction of the molecular score. The method leads to apparent paradoxes, as follows. At least in theory, the regression coefficients of phenotype on markers flanking a QTL must have the same sign (WHITTAKER et al. 1996 Down). However, the estimated regression coefficients can differ in sign even when the true parameters agree in sign (HWANG and NETTLETON 2002 Down). The latter authors point out that sign inconsistency is greatest when flanking markers are close together and, thus, highly correlated. As an alternative to selection of subsets of markers to include in the model, WHITTAKER et al. 2000 Down and WHITTAKER 2001 Down suggested employing ridge regression (HOERL and KENNARD 1970 Down) for estimation. This ad hoc procedure shrinks the least-squares estimates toward zero, improves the condition of the coefficient matrix in least-squares, and often produces a reduction in mean-squared error of estimation. Here, one encounters the typical dilemma: the shrinkage estimator may be more precise but less accurate, since it is biased. Whether or not this reduces mean-squared error depends on a bias vs. variance trade-off, and the performance of ridge relative to least-squares regression is a function of the true value of the regression vector, which is obviously unknown. One never knows how much shrinkage should be effected, although there are some informal recipes for tuning the shrinkage parameter. On the other hand, ridge regression makes more sense when viewed from a Bayesian perspective (LINDLEY and SMITH 1972 Down; ZELLNER and VANDAELE 1975 Down; GIANOLA and FERNANDO 1986 Down). Ridge regression is equivalent to adopting a normal prior for the regression vector centered at zero and with a prior covariance structure equal to an identity matrix times a scalar, the variance of the prior distribution. A small variance means large prior precision and more shrinkage (a stronger belief that the regression is null); a large variance implies diffuse prior knowledge of the regression vector. The scalar prior covariance matrix implies that the individual regression coefficients of phenotypes on markers are viewed as independently and identically distributed a priori. Shrinking all marked effects toward a common prior mean may be questionable. Also, assuming that different chromosome sections have independent expression levels is contradicted by evidence from gene expression trials (CARON et al. 2001 Down).

In practice, all individuals cannot be typed for all markers. Further, if typing costs continue to be abated, the number of predictor variables (markers) may grow faster than the number of observations. This situation is encountered routinely in animal breeding, and it can be solved by positing random-effect models. The BLUP methods used in animal breeding can be viewed as based on conditional or posterior distributions (GIANOLA and FERNANDO 1986 Down), illustrating why the dimensionality and identification problems arising in least-squares are circumvented. MEUWISSEN et al. 2001 Down noted this and compared least-squares with BLUP and with two Bayesian methods for marker-assisted prediction. In their simulation, effects of 50,000 marker haplotypes needed to be inferred from 2200 observations. Since this is not possible via least-squares, they calculated regressions on markers found in chromosome segments of 1 cM each and evaluated a log-likelihood (assuming normality) for every segment. Then, "significant" segments were selected using an arbitrary procedure on the basis of the said likelihoods. Eventually, the authors combined all such segments into a single multiple regression model. In the BLUP analysis, all chromosome effects were regarded as independent and identically distributed random effects with known variance. Their first Bayesian model was as in BLUP, but assigning independent scaled inverted chi-square distributions to the variances of the chromosome segments. The second Bayesian model used a mixture prior distribution for the variance of the chromosome segment variances. Their main conclusions were: (1) least-squares had a low accuracy of prediction of genetic value, (2) the BLUP and Bayesian methods differed by little, and (3) the Bayesian methods assigning prior distributions to the variance associated with chromosome segments gave the most accurate predictions.

The classical and Bayesian methods outlined here do not accommodate complex genealogical structures such as those arising in animal breeding. To illustrate, consider the hypothetical situation in which the markers and the QTL are identical, so that there are no background effects if gene action is additive. Hence, in the notation of A MIXED-EFFECTS MODEL FORMULATION, ai = {sum}lk=1mik{gamma}k, and, therefore, one would expect a priori that the covariance between the genetic value of any two relatives i and i' is equal to

In the absence of knowledge of genotype frequencies, the term E(m'im'i) cannot be assessed. In a fully Bayesian method one needs to assign a prior to the genotypic frequencies and take the MCMC method even further (e.g., HOESCHELE 2001 Down). For this reason, all analyses presented in our article treat the matrix M as nonstochastic. Concerning the background genetic effects, one can introduce a full relationship matrix under additive inheritance.

Model (38) introduces a family structure, and (47) expands LANDE and THOMPSON 1990 Down to a matrix domain, but with a random-effects treatment. To illustrate a Bayesian analysis of the latter, consider the following terms: (1) i{alpha}B, the fi x 1 vector of unobserved (contrary to least-squares fitted values) mean molecular scores for all members of family i (all its elements are equal); (2) (Mi - i){alpha}W, the within-family contributions to the molecular scores; (3) {epsilon}i or background genetic effect of family i (e.g., the additive genetic contribution, net of the marked part of the variability); and (4) wi, a vector of within-family deviations (also net of the marked part of the variability). The mean of the posterior distribution of the total genetic merit of the individuals in family i would be obtained by averaging i{alpha}(k)B + (Mi - i){alpha}(k)W + 1i{epsilon}(k)i over the MCMC samples. Similarly, the posterior mean of the within-family deviations can be estimated by averaging w(k)i = yi - Xiß(k) - Mi{gamma}(k)i - 1i{epsilon}(k)i. LANDE and THOMPSON 1990 Down consider "selection optimization" by defining total merit of a candidate for selection as a linear combination of family and within-family information. For instance, one could write the merit vector

where hF{alpha} and hF{epsilon} are some nonstochastic relative weights assigned to the unobserved molecular and quantitative trait family means, respectively, and hw{alpha}, hw{epsilon} are corresponding weights assigned to the unobserved within-family deviations. The means, variances, covariances, and any feature of a posterior distribution involving the unobserved total merits Ti can be estimated from the MCMC samples. The procedure takes the ideas of LANDE and THOMPSON 1990 Down to a higher level of generality, with the uncertainty about nuisance parameters taken into account fully in the usual Bayesian manner (BOX and TIAO 1973 Down).

We have assumed throughout that all individuals have been typed for all markers, but this is seldom the case, as noted earlier. In the Bayesian context missing markers can be dealt with automatically via an augmentation of the posterior distribution. Subsequently, MCMC is used to make imputations for the unobserved part of the molecular information, at least if missingness is at random in the sense of RUBIN 1976 Down. See HOESCHELE 2001 Down for technical details.

It remains to be seen to what extent the proposed procedures hold in practice or in simulations what they seem to promise on theoretical grounds. In a simulation, WHITTAKER et al. 2000 Down found that a simple ridge regression procedure enhanced response to selection over plain phenotypic selection or selection based on least-squares fits of the molecular scores. The results of MEUWISSEN et al. 2001 Down point in the same direction. This is encouraging. Another issue is the computer intensiveness of the required calculations. The BLUP-REML analysis can be carried out with standard software for large data sets. On the other hand, the MCMC procedures require a higher level of computation; however, they are more flexible and lead to richer inference (e.g., estimation of the Lorenz curve). At any rate, computational burden is unimportant relative to the cost of genotyping individuals. Also, the Bayesian analysis requires elicitation of some complex priors distributions and this is difficult. In practice, a possibility is to use some of the standard default noninformative priors (e.g., SORENSEN and GIANOLA 2002 Down) although at the risk of causing impropriety. Another option is to carry out the Bayesian analysis with different priors. If inferences are invariant with respect to the set of priors used, this would suggest that the data are "informative enough" and that the prior is overwhelmed by the information contributed by the data.


*  ACKNOWLEDGMENTS

The authors thank Rohan Fernando, Davorka Gulisija, and Guilherme Rosa for useful comments. Research was supported by the Wisconsin Agriculture Experiment Station and by grants from the NRICGP/U.S. Department of Agriculture (99-35205-8162) and National Science Foundation (DEB-0089742).

Manuscript received May 3, 2002; Accepted for publication September 27, 2002.


*  LITERATURE CITED
*TOP
*ABSTRACT
*A MIXED-EFFECTS MODEL...
*EXTENDING THE HIERARCHY
*A BAYESIAN FORMULATION
*DISCUSSION
*LITERATURE CITED

AGUILAR-GUTIERREZ, X., 2000 Desigualdad y Pobreza en México, son inevitables? Instituto de Investigaciones Económicas, Universidad Nacional Autónoma de México, Mexico City.

BERNARDO, J. M., 1979  Reference posterior distributions for Bayesian inference. J. R. Stat. Soc. B 41:113-147.

BERNARDO, J. M., and A. F. M. SMITH, 1994 Bayesian Theory. John Wiley & Sons, New York.

BOX, G. E. P., 1980  Sampling and Bayes' inference in scientific modelling and robustness. J. R. Stat. Soc. A 143:383-430.

BOX, G. E. P., and G. C. TIAO, 1973 Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading, UK.

CANTET, R. J. C., R. L. FERNANDO, and D. GIANOLA, 1992  Bayesian inference about dispersion parameters of univariate mixed models with maternal effects: theoretical considerations. Genet. Sel. Evol. 24:107-135.

CARON, H., B. VAN SCHAIK, M. VAN DER MEE, F. BAAS, and G. RIGGINS et al., 2001  The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science 291:1289-1292.[Abstract/Free Full Text]

COWLES, M. K. and B. P. CARLIN, 1996  Markov chain Monte Carlo convergence diagnostics: a comparative review. J. Am. Stat. Assoc. 91:883-904.

FARNIR, F., W. COPPIETERS, J. J. ARRANZ, P. BERZI, and N. CAMBISANO et al., 2000  Extensive genome-wide linkage disequilibrium in dairy cattle. Genome Res. 10:220-227.[Abstract/Free Full Text]

FERNANDO, R. L. and M. GROSSMAN, 1989  Marker-assisted selection using best linear unbiased prediction. Genet. Sel. Evol. 33:209-229.

FISHER, R. A., 1921  On the probable error of a coefficient of correlation deduced from a small sample. Metron 4:3-32.

GASTWIRTH, J. L., 1971  A general definition of Lorenz curve. Econometrica 39:1037-1039.

GASTWIRTH, J. L., 1972  The estimation of the Lorenz curve. Rev. of Econ. Stat. 54:306-316.

GELMAN, A., J. B. CARLIN, H. S. STERN and D. B. RUBIN, 1995 Bayesian Data Analysis. Chapman & Hall, London.

GEORGES, M. D., D. NIELSEN, M. MACKINNON, A. MISHRA, and R. OKIMOTO et al., 1995  Mapping quantitative trait loci controlling milk production in dairy cattle by exploiting progeny testing. Genetics 139:907-920.[Abstract]

GIANOLA, D., 1990 Can BLUP and REML be improved upon?, pp. 445-449 in Proceedings of the 4th World Congress on Genetics Applied to Livestock Production, Vol. XIII. Joyce Darling, Penicuik, UK.

GIANOLA, D. and R. L. FERNANDO, 1986  Bayesian methods in animal breeding theory. J. Anim. Sci. 63:217-244.[Abstract/Free Full Text]

GIANOLA, D. and B. GOFFINET, 1982  Sire evaluation with best linear unbiased predictors. Biometrics 38:1085-1088.

GIANOLA, D., S. IM and F. W. MACEDO, 1990 A framework for prediction of breeding value, pp. 210–238 in Advances in Statistical Methods for Genetic Improvement of Livestock, edited by D. GIANOLA and K. HAMMOND. Springer-Verlag, Heidelberg.

GILKS, W. R., S. RICHARDSON and D. J. SPIEGELHALTER, 1996 Markov Chain Monte Carlo in Practice. Chapman & Hall, London.

HARTL, D. L., and E. W. JONES, 2002 Essential Genetics: A Genomics Perspective. Jones & Bartlett, Sudbury, MA.

HARVILLE, D. A., 1976  Extension of the Gauss-Markov theorem to include the estimation of random effects. Ann. Stat. 4:384-395.

HAYES, B. and M. E. GODDARD, 2001  The distribution of the effects of genes affecting quantitative traits in livestock. Genet. Sel. Evol. 33:209-229.[Medline]

HAZEL, L. N., 1943  The genetic basis for constructing selection indexes. Genetics 28:476-490.[Free Full Text]

HENDERSON, C. R., 1973 Sire evaluation and genetic trends, pp. 10–41 in Proceedings of the Animal Breeding and Genetics Symposium in Honor of Dr. Jay L. Lush. American Society of Animal Science and American Dairy Science Association, Champaign, IL.

HENDERSON, C. R., O. KEMPTHORNE, S. R. SEARLE, and C. M. VON KROSIGK, 1959  The estimation of environmental and genetic trends from records subject to culling. Biometrics 15:192-218.

HOERL, A. E. and R. W. KENNARD, 1970  Ridge regression: biased estimation and applications for non-orthogonal problems. Technometrics 12:55-82.

HOESCHELE, I., 2001 Mapping quantitative trait loci in outbred pedigrees, pp. 599–644 in Handbook of Statistical Genetics, edited by D. J. BALDING M. BISHOP and C. CANNINGS. John Wiley & Sons, Chichester, UK.

HWANG, J. T. and D. NETTLETON, 2002  Investigating the probability of sign inconsistency in the regression coefficients of markers flanking quantitative trait loci. Genetics 160:1697-1705.[Abstract/Free Full Text]

JENSEN, J., C. S. WANG, D. A. SORENSEN, and D. GIANOLA, 1994  Bayesian inference on variance and covariance components for traits influenced by maternal and direct genetic effects using the Gibbs sampler. Acta Agric. Scand. 44:193-201.

JUDGE, G. G., W. E. GRIFFITHS, R. CARTER, H. LUTKEPOHL and T. C. LEE, 1985 The Theory and Practice of Econometrics. John Wiley & Sons, New York.

KORSGAARD, I. R., A. H. ANDERSEN, and D. SORENSEN, 1999  A useful reparameterisation to obtain samples from conditional inverse Wishart distributions. Genet. Sel. Evol. 31:177-181.

LANDE, R. and R. THOMPSON, 1990  Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics 124:743-756.[Abstract]

LANGE, C. and J. C. WHITTAKER, 2001  On prediction of genetic values in marker-assisted selection. Genetics 159:1375-1381.[Abstract/Free Full Text]

LINDLEY, D. V. and A. F. M. SMITH, 1972  Bayes estimates for the linear model. J. R. Stat. Soc. B 34:1-41.

METROPOLIS, N., A. W. ROSENBLUTH, M. N. ROSENBLUTH, A. H. TELLER, and E. TELLER, 1953  Equations of state calculation by fast computing machines. J. Chem. Phys. 21:1087-1091.

MEUWISSEN, T. H. E., B. J. HAYES, and M. E. GODDARD, 2001  Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819-1829.[Abstract/Free Full Text]

OLLIVIER, L., 1998  The accuracy of marker-assisted selection for quantitative traits in populations in linkage equilibrium. Genetics 148:1367-1372.[Abstract/Free Full Text]

ROBERT, C. P., and G. CASELLA, 1999 Monte Carlo Statistical Methods. Springer-Verlag, New York.

RUBIN, D. B., 1976  Inference and missing data. Biometrika 63:581-592.[Abstract/Free Full Text]

SEARLE, S. R., 1974 Prediction, mixed models and variance components, pp. 229–266 in Reliability and Biometry, edited by F. PROSCHAN and R. J. SERFLING. Society for Industrial and Applied Mathematics, Philadelphia.

SEARLE, S. R., 1982 Matrix Algebra Useful for Statistics. John Wiley & Sons, New York.

SEARLE, S. R., G. CASELLA and C. E. MCCULLOCH, 1992 Variance Components. John Wiley & Sons, New York.

SMITH, C. and S. P. SIMPSON, 1986  The use of genetic polymorphisms in livestock improvement. J. Anim. Breed. Genet. 103:205-217.

SMITH, H. F., 1936  A discriminant function for plant selection. Ann. Eugen. 7:240-250.

SOLLER, M. and J. S. BECKMANN, 1983  Genetic polymorphisms in varietal identification and genetic improvement. Theor. Appl. Genet. 67:25-33.

SORENSEN, D., and D. GIANOLA, 2002 Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. Springer-Verlag, New York.

SORENSEN, D., C. S. WANG, J. JENSEN, and D. GIANOLA, 1994  Bayesian analysis of genetic change due to selection using Gibbs sampling. Genet. Sel. Evol. 26:333-360.

STEIN, C., 1955 Inadmissibility of the usual estimator for the mean of a multivariate normal distribution, pp. 192-206 in Proceedings of the Third Berkeley Symposium on Mathematics, Statistics, and Probability. University of California Press, Berkeley.

TANNER, M. A., 1993 Tools for Statistical Inference. Springer-Verlag, New York.

THEIL, H., 1971 Principles of Econometrics. John Wiley & Sons, New York.

URIOSTE, J. I., D. GIANOLA, R. REKAYA, W. F. FIKSE, and K. A. WEIGEL, 2001  Evaluation of extent and amount of heterogeneous variance for milk yield in Uruguayan Holsteins. Anim. Sci. 72:259-268.

VERBEKE, G., and G. MOLENBERGHS, 1997 Linear Mixed Models in Practice: A SAS Oriented Approach. Springer-Verlag, New York.

VERBEKE, G., and G. MOLENBERGHS, 2000 Linear Mixed Models for Longitudinal Data. Springer, New York.

WANG, C. S., J. J. RUTLEDGE, and D. GIANOLA, 1993  Marginal inferences about variance components in a mixed linear model using Gibbs sampling. Genet. Sel. Evol. 25:41-62.

WANG, C. S., J. J. RUTLEDGE, and D. GIANOLA, 1994  Bayesian analysis of mixed linear models via Gibbs sampling with an application to litter size in Iberian pigs. Genet. Sel. Evol. 26:91-115.

WANG, T., R. L. FERNANDO, and M. GROSSMAN, 1998  Genetic evaluation by best linear unbiased prediction using marker and trait information in a multibreed population. Genetics 148:507-515.[Abstract/Free Full Text]

WEIGEL, K. A., D. GIANOLA, R. J. TEMPELMAN, C. A. MATOS, and I. H. C. CHEN et al., 1991  Improving estimates of fixed effects in a mixed linear model. J. Dairy Sci. 74:3174-3182.[Abstract]

WHITTAKER, J. C., 2001 Marker-assisted selection and introgression, pp. 673–693 in Handbook of Statistical Genetics, edited by D. J. BALDING, M. BISHOP and C. CANNINGS. John Wiley & Sons, Chichester, UK.

WHITTAKER, J. C., R. THOMPSON, and P. M. VISSCHER, 1996  On the mapping of QTL by regression of phenotype and marker-type. Heredity 77:23-32.

WHITTAKER, J. C., R. THOMPSON, and M. C. DENHAM, 2000  Marker-assisted selection using ridge regression. Genet. Res. 75:249-252.[Medline]

XU, S., 1998  Mapping quantitative trait loci using multiple families of line crosses. Genetics 148:517-524.[Abstract/Free Full Text]

ZELLNER, A., and W. VANDAELE, 1975 Bayes-Stein estimators for k-means, regression and simultaneous equation models, pp. 627–653 in Studies in Bayesian Econometrics and Statistics, edited by S. FIENBERG and A. ZELLNER. North-Holland, Amsterdam.

ZENG, Z-B., 1993  Theoretical basis for separation of multiple linked gene effects in mapping quantitative trait loci. Proc. Natl. Acad. Sci. USA 90:10972-10976.[Abstract/Free Full Text]




This article has been cited by other articles:


Home page
Crop Sci.Home page
H. P. Piepho
Ridge Regression and Extensions for Genomewide Selection in Maize
Crop Sci., June 26, 2009; 49(4): 1165 - 1176.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
G. de los Campos, H. Naya, D. Gianola, J. Crossa, A. Legarra, E. Manfredi, K. Weigel, and J. M. Cotes
Predicting Quantitative Traits With Regression Models for Dense Molecular Markers and Pedigree
Genetics, May 1, 2009; 182(1): 375 - 385.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
N. Yi and S. Xu
Bayesian LASSO for Quantitative Trait Loci Mapping
Genetics, June 1, 2008; 179(2): 1045 - 1055.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
D. Gianola and J. B. C. H. M. van Kaam
Reproducing Kernel Hilbert Spaces Regression Methods for Genomic Assisted Prediction of Quantitative Traits
Genetics, April 1, 2008; 178(4): 2289 - 2303.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
O. Gonzalez-Recio, D. Gianola, N. Long, K. A. Weigel, G. J. M. Rosa, and S. Avendano
Nonparametric Methods for Incorporating Genomic Information Into Genetic Evaluations: An Application to Mortality in Broilers
Genetics, April 1, 2008; 178(4): 2305 - 2313.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
D. Gianola, R. L. Fernando, and A. Stella
Genomic-Assisted Prediction of Genetic Value With Semiparametric Procedures
Genetics, July 1, 2006; 173(3): 1761 - 1776.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
H. Wang, Y.-M. Zhang, X. Li, G. L. Masinde, S. Mohan, D. J. Baylink, and S. Xu
Bayesian Shrinkage Estimation of Quantitative Trait Loci Parameters
Genetics, May 1, 2005; 170(1): 465 - 480.
[Abstract] [Full Text] [PDF]


Home page
Crop Sci.Home page
C. Cervantes-Martinez and J. S. Brown
A Haplotype-Based Method for QTL Mapping of F1 Populations in Outbred Plant Species
Crop Sci., September 1, 2004; 44(5): 1572 - 1583.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
M. Perez-Enciso, M. A. Toro, M. Tenenhaus, and D. Gianola
Combining Gene Expression and Molecular Marker Information for Mapping Complex Trait Genes: A Simulation Study
Genetics, August 1, 2003; 164(4): 1597 - 1606.
[Abstract] [Full Text] [PDF]