## Abstract

Obtaining accurate estimates of the genetic covariance matrix for multivariate data is a fundamental task in quantitative genetics and important for both evolutionary biologists and plant or animal breeders. Classical methods for estimating are well known to suffer from substantial sampling errors; importantly, its leading eigenvalues are systematically overestimated. This article proposes a framework that exploits information in the phenotypic covariance matrix in a new way to obtain more accurate estimates of . The approach focuses on the “canonical heritabilities” (the eigenvalues of ), which may be estimated with more precision than those of because is estimated more accurately. Our method uses penalized maximum likelihood and shrinkage to reduce bias in estimates of the canonical heritabilities. This in turn can be exploited to get substantial reductions in bias for estimates of the eigenvalues of and a reduction in sampling errors for estimates of . Simulations show that improvements are greatest when sample sizes are small and the canonical heritabilities are closely spaced. An application to data from beef cattle demonstrates the efficacy this approach and the effect on estimates of heritabilities and correlations. Penalized estimation is recommended for multivariate analyses involving more than a few traits or problems with limited data.

QUANTITATIVE geneticists, including evolutionary biologists and plant and animal breeders, are increasingly dependent on multivariate analyses of genetic variation, for example, to understand evolutionary constraints and design efficient selection programs. New challenges arise when one moves from estimating the genetic variance of a single phenotype to the multivariate setting. An important but unresolved issue is how best to deal with sampling variation and the corresponding bias in the eigenvalues of estimates for the genetic covariance matrix, . It is well known that estimates for the largest eigenvalues of a covariance matrix are biased upward and those for the smallest eigenvalues are biased downward (Lawley 1956; Hayes and Hill 1981). For genetic problems, where we need to estimate at least two covariance matrices simultaneously, this tends to be exacerbated, especially for . In turn, this can result in invalid estimates of , *i.e.*, estimates with negative eigenvalues, and can produce systematic errors in predictions for the response to selection.

There has been longstanding interest in “regularization” of covariance matrices, in particular for cases where the ratio between the number of observations and the number of variables is small. Various studies recently employed such techniques for the analysis of high-dimensional, genomic data. In general, this involves a compromise between additional bias and reduced sampling variation of “improved” estimators that have less statistical risk than standard methods (Bickel and Li 2006). For instance, various types of shrinkage estimators of covariance matrices have been suggested that counteract bias in estimates of eigenvalues by shrinking all sample eigenvalues toward their mean. Often this is equivalent to a weighted combination of the sample covariance matrix and a target matrix, assumed to have a simple structure. A common choice for the latter is an identity matrix. This yields a ridge regression type formulation (Hoerl and Kennard 1970). Numerous simulation studies in a variety of settings are available, which demonstrate that regularization can yield closer agreement between estimated and population covariance matrices, less variable estimates of model terms, or improved performance of statistical tests.

In quantitative genetic analyses, we attempt to partition observed, overall (phenotypic) covariances into their genetic and environmental components. Typically, this results in strong sampling correlations between them. Hence, while the partitioning into sources of variation and estimates of individual covariance matrices may be subject to substantial sampling variances, their sum, *i.e.*, the phenotypic covariance matrix, can generally be estimated much more accurately. This has led to suggestions to “borrow strength” from estimates of phenotypic components to estimate the genetic covariances. In particular, Hayes and Hill (1981) proposed a method termed “bending” that involved regressing the eigenvalues of the product of the genetic and the inverse of the phenotypic covariance matrix toward their mean. One objective of this procedure was to ensure that estimates of the genetic covariance matrix from an analysis of variance were positive definite. In addition, the authors showed by simulation that shrinking eigenvalues even further than needed to make all values nonnegative could improve the achieved response to selection when using the resulting estimates to derive weights for a selection index, especially for estimation based on small samples. Subsequent work demonstrated that bending could also be advantageous in more general scenarios such as indexes that included information from relatives (Meyer and Hill 1983).

Modern, mixed model (“animal model”)-based analyses to estimate genetic parameters using maximum likelihood or Bayesian methods generally constrain estimates to the parameter space, so that—at the expense of introducing some bias—estimates of covariance matrices are positive semidefinite. However, the problems arising from substantial sampling variation in multivariate analyses remain. In spite of increasing applications of such analyses in scenarios where data sets are invariably small, *e.g.*, the analysis of data from natural populations (*e.g.*, Kruuk *et al*. 2008), there has been little interest in regularization and shrinkage techniques in genetic parameter estimation, other than through the use of informative priors in a Bayesian context. Instead, suggestions for improved estimation have focused on parsimonious modeling of covariance matrices, *e.g.*, through reduced rank estimation or by imposing a known structure, such as a factor-analytic structure (Kirkpatrick and Meyer 2004; Meyer 2009), or by fitting covariance functions for longitudinal data (Kirkpatrick *et al*. 1990). While such methods can be highly advantageous when the underlying assumptions are at least approximately correct, data-driven methods of regularization may be preferable in other scenarios.

This article explores the scope for improved estimation of genetic covariance matrices by implementing the equivalent to bending within animal model-type analyses. We begin with a review of the underlying statistical principles (which the impatient reader might skip), examining the concept of improved estimation, its implementation via shrinkage estimators or penalized estimation, and selected applications. We then describe a penalized restricted maximum-likelihood (REML) procedure for the estimation of genetic covariance matrices that utilizes information from its phenotypic counterparts and present a simulation study demonstrating the effect of penalties on parameter estimates and their sampling properties. The article concludes with an application to a problem relevant in genetic improvement of beef cattle and a discussion.

## REVIEW: PRINCIPLES OF PENALIZED ESTIMATION

In broad terms, regularization in statistics refers to a scenario where estimation for ill-posed or overparameterized problems is improved through use of some form of additional information. Often, the latter is composed of a penalty for a deviation from a desired outcome. For example, in fitting smoothing splines a “roughness penalty” is commonly employed to place preference on simple functions (Green 1998). This section reviews some of the underlying principles of improved estimation of covariance matrices.

#### Minimizing statistical risk:

A central term in estimation is that of risk, defined as expected loss, arising from the inevitable deviation of estimates from the underlying population values. Consider a set of *q* normally distributed variables with population covariance matrix , recorded on *n* individuals, and estimator . Common loss functions considered in the statistical literature are the entropy (*L*_{1}) and quadratic (*L*_{2}) loss (James and Stein 1961)(1)and(2)A natural estimator for is a scalar multiple of the matrix of sums of squares and cross-products among the *q* variables, **S**. In this class of estimators, the sample covariance matrix **S**/*d* with *d* the degrees of freedom, *i.e.*, the usual, unbiased estimator, minimizes the *L*_{1} risk, while **S**/(*d* + *q* + 1) yields the minimum risk estimator under loss function *L*_{2} (*e.g.*, Haff 1980).

For known, these loss functions provide a measure of how different the estimate is from the true value, where the difference can reflect both chance (*i.e.*, sampling error) and systematic deviations (*i.e.*, bias). While estimates with minimum bias may be preferred in some circumstances, in general estimators with a low overall loss or risk are considered desirable. The “statistical” risk defined in such a general manner may not reflect the importance of errors in specific elements of adequately in some situations, and alternative definitions may be more appropriate. These are, in principle, readily substituted where required, although they may not necessarily translate to simple penalties on the likelihood function as for the *L*_{1} and *L*_{2} loss. In a genetic context, an important use of estimated genetic and phenotypic covariance matrices is in the calculation of the weights for individual traits in selection indexes. The achieved response to selection on the index then depends on how closely the estimates—and thus the index weights derived from them—adhere to the true values (Hayes and Hill 1981). While reducing the general risk, as quantified by the loss functions above (Equations 1 and 2), is likely to reduce the error in index weights and thus the discrepancy between expected and achieved response, this may not be optimal. For instance, if economic weights for individual traits in an index differ substantially, the definition of risk may need to be adapted accordingly.

#### Improved estimators of covariance matrices:

There is a considerable body of literature on improved estimators of covariance matrices. These are generally biased, but have a lower risk than the standard, unbiased estimator (sample covariance matrix). Several studies derived the risk for a certain class of estimator and given loss function and presented estimators that “dominate” over other estimators followed by a simulation study to demonstrate their properties. Others obtained estimators using a different motivation, such as minimax or empirical Bayesian estimation; see Kubokawa (1999) and Hoffmann (2000) for reviews.

Sampling variation causes the largest eigenvalues of a covariance matrix to be overestimated and the smallest eigenvalues to be underestimated, while their mean is expected to be unbiased. Hence, attention has focused on estimators that modify the eigenvalues of the sample covariance matrix while retaining the corresponding eigenvectors. The impetus for this is generally attributed to Stein (1975), but similar suggestions can be found earlier, for instance, in Lawley (1956). Let denote the *i*th eigenvalue of the sample covariance matrix. Stein's proposal then consisted of an adaptive shrinking obtained by scaling each by . The resulting estimator minimizes the entropy loss but does not preserve the order of eigenvalues or ensure nonnegativity. Hence, later work often combined this with order-preserving measures such as an “isotonizing” regression (which restores order by merging values out of line) or truncation at zero (Dey and Srinivasan 1985; Lin and Perlman 1985; Ye and Wang 2009).

A simple modification scheme entails the linear shrinkage of the sample eigenvalues toward their mean. It can be shown that this yields an estimator that is a weighted combination of the sample covariance matrix and an identity matrix. Considering a quadratic loss function, Ledoit and Wolf (2004) derived an optimal shrinkage factor ρ ∈ [0, 1] that minimized the risk associated with the estimator (with **I** an identity matrix and the mean sample eigenvalue). Daniels and Kass (2001) argued that, due to the nature of the quadratic loss, such an estimator could result in overshrinkage, in particular of the smallest eigenvalues, when the true eigenvalues were far apart. Instead the authors proposed an estimator derived by assuming a prior normal distribution for the eigenvalues on the logarithmic scale, approximated as . This resulted in modified values of the form [with the mean of ], *i.e.*, again involved a regression toward the sample mean, but on a different scale. Warton (2008) proposed a similar, regularized estimator of the sample correlation matrix , , and showed that this was the penalized maximum-likelihood estimator with penalty term proportional to –tr(**R**^{−1}), with the corollary that the corresponding, ridge type estimator of a covariance matrix , , involved a penalty proportional to .

Other work considered shrinkage toward a more general structure. Schäfer and Strimmer (2005) and Sancetta (2008) extended the approach of Ledoit and Wolf (2004) to different target matrices of simple structure with few parameters to be estimated, *e.g.*, a diagonal matrix with different variances or a matrix with all correlations equal. Böhm (2008) examined estimators for multivariate time series and suggested data-driven shrinkage toward a factor-analytic structure. Shrinkage estimators of correlation matrices have been described by Lin and Perlman (1985), Daniels and Kass (2001), and Warton (2008).

##### More than one matrix:

Few studies have addressed improved estimation for multilevel models such as appear in quantitative genetics. The simplest case, with two matrices to be estimated, is a balanced one-way classification. Let and denote the covariance matrices between and within groups, respectively, and **B** and **W** denote the corresponding matrices of mean squares and cross-products (MSCP). Derivations of improved estimators by and large utilized the so-called canonical decomposition of **B** and **W**: For any two symmetric (real) matrices, **W** and **B**, of size *q* × *q* with **W** positive-definite and **B** positive semidefinite (p.s.d.), a matrix **T** exists such that **TT′** = **W** and , with the diagonal matrix of eigenvalues of **W**^{−1}**B** (Anderson 1984).

An immediate, additional problem then is to ensure that estimates are within the parameter space, *i.e.*, are not negative definite. Due to sampling variation, the usual unbiased estimator for the between-group component, (with *m* the group size), has a high probability, increasing with *q* and decreasing sample size, of not being p.s.d., *i.e.*, to have negative eigenvalues (Hill and Thompson 1978; Bhargava and Disch 1982). Using the canonical transformation yields , and it is readily seen that is guaranteed to be nonnegative definite by truncating the elements of at a minimum of unity, *i.e.*, by replacing with . The resulting estimator is the REML estimator (Klotz and Putter 1969; Amemiya 1985; Anderson *et al*. 1986). In a genetic context where we estimate the matrix of environmental covariances as (with α^{−1} the degree of relationship among group members), additional constraints may be required to ensure that is within the parameter space (Meyer and Kirkpatrick 2008).

As outlined above, Hayes and Hill (1981) suggested to bend the estimate of the genetic covariance matrix, , toward the estimate of the phenotypic covariance matrix, , by regressing the eigenvalues of to their mean. Their rationale for this was somewhat *ad hoc*: plays a central role in computing the weights in a selection index and the main objective was to improve the properties of selection indexes based on the estimated covariance matrices. Rather than manipulating the roots of directly though, Hayes and Hill (1981) modified **W**^{−1}**B**, using that for λ_{i} a root of **W**^{−1}**B**, α(λ_{i} – 1)/(λ_{i} – 1 + *n*) is a root of . Their estimator for was then obtained by replacing above by , with the mean of the λ_{i} and ρ ∈ [0, 1] the bending factor.

Loh (1991), Mathew *et al*. (1994), Srivastava and Kubokawa (1999), and Kubokawa and Tsai (2006) considered estimation for two independent Wishart matrices, such as **B** and **W** in the one-way classification. Minimizing the sum of entropy losses, they derived different types of joint estimators, analogous to those proposed for a single matrix, and showed that improved estimators were available that had lower risk than the unbiased or REML estimators. However, no practical applications using any of these results are available. Again, these estimators involved some form of modification of the eigenvalues arising from the canonical decomposition of the two matrices, indicating that the suggestion of Hayes and Hill (1981) was well founded.

#### Penalized maximum-likelihood estimation:

A standard method of regularization is that of penalized estimation, in particular for regression problems. In a least-squares or maximum-likelihood (ML) context, this involves a penalty that is added to the criterion to be minimized. The effect of the penalty is modulated by a so-called tuning parameter (ψ) that determines the relative importance of information from the data and the desired outcome. For (RE)ML, this replaces the objective function withwhere is the vector of parameters to be estimated, denotes the standard log likelihood, and is a (nonnegative) penalty function (the factor is for algebraic consistency and could be omitted).

Let **y** = **Xb** + **e** denote a simple regression model with **y**, **b**, and **e** the vectors of observations, regression coefficients, and residuals, respectively, and **X** the corresponding incidence matrix. A class of penalties commonly employed is that of the ℓ_{p} norm(3)with *b _{i}* the

*i*th element of

**b**. Different values of

*p*are appropriate for different analyses. For

*p*= 0, the penalty is equal to the number of elements in

**b**and may be employed in model selection. A value of

*p*= 1 yields a least absolute shrinkage and selection operator (LASSO)-type penalty that encourages shrinkage of small (absolute value) elements of

**b**to zero and thus subset selection (Tibshirani 1996). A quadratic penalty (

*p*= 2) produces a ridge regression formulation where all elements of

**b**are shrunk proportionally (Hoerl and Kennard 1970). Noninteger values for

*p*have been suggested,

*e.g.*, the so-called “bridge,” attributed to Frank and Friedman (1993), for values of 0 <

*p*< 1. Other proposals have been to combine ℓ

_{1}- and ℓ

_{2}-type penalties to form the “elastic net” (Zou and Hastie 2005) or to account for the correlation among predictors in the penalty term (Tutz and Ulbricht 2009).

In a more general framework, we may assume a certain prior distribution for the parameters to be estimated and impose a penalty that is proportional to minus the logarithmic value of the prior density. This provides a direct link to Bayesian estimation—indeed, penalized maximum-likelihood estimation has been described as “an attempt of enjoying the Bayesian fruits without paying the B-club fee” (Meng 2008, p. 1545). For instance, imposing an ℓ_{1}-type penalty is equivalent to assuming a double exponential prior distribution, while an ℓ_{2} penalty implies a normal prior.

##### Estimation of the tuning factor:

A general procedure to estimate the tuning parameter ψ from the data at hand is cross-validation. This involves splitting the data into so-called training and validation sets. We then fit our model and estimate the parameters of interest for a range of values of ψ, using the training set only. For each set of estimates, the value of the (unpenalized) objective function, *e.g.*, the likelihood or the residual sum of squares, is determined using the validation set. The value of ψ that optimizes this criterion is then chosen as the best value to use for a penalized analysis of the complete data set.

In practice, multiple splits are used. A popular scheme is that of *K*-fold cross-validation (*e.g.*, Hastie *et al*. 2001, Chap. 7) where the data are evenly split into *K* subsets, and *K* analyses are carried out for each value of ψ, with the *i*th subset treated as the validation set and the remaining *K* − 1 subsets forming the training set. The tuning parameter is then chosen on the basis of the objective function averaged across the *K* validation sets. Common values of *K* used are 5 and 10. A related technique is repeated random subsampling. For example, Bickel and Levina (2008) employed a scheme using 50 “random splits” of the data, with the training set comprising a third of the data, and showed that while, due to noise, this tended to overestimate the true risk substantially, the location of the minimum tended to be identified correctly.

In a simulation setting, an alternative is to simply simulate additional data sets for validation, with the same data structure and sampling from distributions with the same population parameters as for the “training” sets. Using one set per replicate, Rothman *et al*. (2009) noted that use of such validation sets could be replaced with cross-validation without any significant changes in results (but did not give details on the cross-validation scheme referred to).

In special cases, ψ can be estimated directly. An example is that of function estimation (smoothing) using a semiparametric regression or penalized splines with a quadratic penalty. In that case, the regression can be rewritten as a mixed model with the penalized coefficients treated as random effects and the tuning parameter can be estimated “automatically” in a (RE)ML analysis from the ratio of the residual variance and the variance due to random effects; see Ruppert *et al*. (2003, Chap. 5) for an example. Foster *et al*. (2009) showed that a LASSO-type penalty on effects can be imposed in a mixed model by treating these as random effects with a double-exponential distribution and that the respective variance parameters (which determine the amount of penalization) can be estimated using a ML approach.

##### Applications to covariance matrices:

Penalized ML estimation of covariance matrices has predominantly been applied in the spirit of covariance selection (Dempster 1972), *i.e.*, to encourage sparsity in the estimate of the inverse of the covariance matrix (“concentration” matrix), and mostly for problems with high dimensions and some natural ordering of variables, *e.g.*, longitudinal data. This relied on the modified Cholesky decomposition or , with **L** a lower triangular matrix with diagonal elements of unity and **D** a diagonal matrix. For longitudinal data, the nonzero off-diagonal elements of **L** have an interpretation as (minus) the regression coefficients in an autoregressive model (Pourahmadi 1999) and can be penalized in the same fashion as coefficients in a simple regression context. Applications using both ℓ_{1}- and ℓ_{2}-type penalties can be found in Huang *et al*. (2006), Bickel and Levina (2008), Friedman *et al*. (2008), Levina *et al*. (2008), Rothman *et al*. (2008), and Yap *et al*. (2009). With a different objective, Warton (2008) considered penalized ML estimation of covariance and correlation matrices to obtain regularized estimates with a stable inverse for use in multivariate regression problems.

## PENALIZED REML ESTIMATION

Consider a simple animal model for *q* traits, **y** = **Xb** + **Zg** + **e** with **y**, **b**, **g**, and **e** the vectors of observations, fixed effects, and additive genetic and residual effects, respectively, and **X** and **Z** the corresponding incidence matrices. Let and denote the matrices of additive genetic and residual covariances among the *q* traits, and let with **A** the numerator relationship matrix between individuals and , with **R**_{k} the submatrix of corresponding to the traits recorded for the *k*th individual and is the direct matrix sum. This gives Var(**y**) = **ZGZ′** + **R** = **V** and the REML log likelihood is, apart from a constant,(4)with **X**_{0} a full-rank submatrix of **X** (*e.g.*, Harville 1977).

Assuming and are unstructured, we have *q*(*q* + 1) variance parameters to be estimated. Maximization of Equation 4 with respect to the elements of and represents a constrained optimization problem, as estimates need to be in the parameter space, *i.e.*, cannot have negative eigenvalues. Hence implementations of REML estimation often employ a parameterization to a scale that does not require constraints (Pinheiro and Bates 1996), *e.g.*, estimating the elements of the Cholesky factor of a covariance matrix rather than the covariance components directly.

A natural alternative in our context is a parameterization to the elements of the canonical decomposition of and : Let and so that for **T** = **Q**^{−1}, and . The elements of are the eigenvalues of and **T** is the corresponding matrix of eigenvectors premultiplied by the matrix square root of . This yields *q* parameters λ_{i} and *q*^{2} elements (*t _{ij}*) of

**T**to be estimated. Eigenvalues λ

_{i}can be interpreted as heritabilities on the canonical scale and are thus constrained to the interval [0, 1]. Again, these constraints can be removed through a suitable further reparameterization,

*e.g.*, by estimating log(–log λ

_{i}) instead of λ

_{i}.

Our review above has identified that modification of the canonical eigenvalues of the “between” and “within” matrices of MSCP in a one-way classification can provide improved estimators of the corresponding covariance matrices and that manipulation of these eigenvalues is equivalent to modifying the canonical eigenvalues of and . Furthermore, we have shown that regressing the eigenvalues of a matrix toward their mean yields a shrinkage estimator that is a weighted combination of the matrix and a multiple of the identity matrix and that such estimators can be obtained by penalized maximum likelihood with a quadratic, ℓ_{2}-type (*cf.* Equation 3) penalty. Using these findings, we propose to implement the equivalent to bending within our REML animal model analysis by replacing in Equation 4 by(5)for . This directly mimics the suggestion of Hayes and Hill (1981) as the quadratic penalty provides a linear shrinkage of all λ_{i} toward their mean. For ψ = 0, reduces to , and for ψ → ∞ Equation 5 yields a model in which all λ_{i} are constrained to be equal.

Equation 5 is readily extended to other types of penalties. For instance, rather than shrinking toward the arithmetic mean , we could use the geometric or harmonic mean (Ye and Wang 2009). An analog of the log-posterior shrinkage estimator of Daniels and Kass (2001) could be obtained by transforming eigenvalues λ_{i} to logarithmic scale. More generally, the shrinkage could be modified by applying a Box-Cox transformation, *i.e*., by replacing λ_{i} with ( – 1)/γ for γ > 0 or log λ_{i} for γ = 0. Alternatively, we might consider replacing the exponent of *p* = 2 with a general value or using a combination of penalties.

For the parameterization to the elements of the canonical transformation, derivatives of are straightforward, and standard REML algorithms, such as expectation-maximization or the so-called average information algorithm (see Thompson *et al*. 2005, for a recent review and references), are readily adapted. Derivatives required are summarized in the appendix. One drawback of this parameterization, however, is that (first) derivatives of both and with respect to all *q*(*q* + 1) parameters to be estimated are nonzero. This implies that computational requirements per iterate are increased compared to the usual implementations. However, this increase is relatively small and pales to insignificance compared to extra computations required when the tuning parameter is to be estimated using cross-validation.

## SIMULATION

#### Method:

A simulation study was carried out considering *q* = 5 traits, 11 sets of genetic parameters, and two types of penalties. Without loss of generality, population values chosen were different combinations of canonical heritabilities (λ_{i}). As discussed by Hill and Thompson (1978), these can represent a wide range of constellations of heritabilities and genetic and environmental correlations. Population values selected differed in both the average level of heritability () and the spread of the λ_{i} about their mean. Values for scenarios *A*–*K* are summarized in Table 1. To understand the effects of the pedigree structure, two contrasting designs were examined. Simulation I comprised a classic, balanced paternal half-sib design with 500, 200, or 100 sires with 10 progeny each. Simulation II considered 125 or 50 unrelated families, using the design of Bondari *et al*. (1978): Each family involved records on two pairs of full-sibs in generation 1, with one male and one female per pair. In generation 2, two paternal half-sibs of different sex were mated to unrelated individuals, recording two offspring per mating. This yielded records for 8 individuals per family, which provided nine different types of covariances between relatives; see Thompson (1976) for a mating plan and list of covariances.

Matrices of MSCP (**B** and **W** for simulation I and the 40 × 40 matrix of MSCP pertaining to records for a family, 5 × 8, in simulation II) for each design and set of population values were sampled from central Wishart distributions as described by Odell and Feiveson (1966). Estimates of genetic and environmental covariance components were obtained using the canonical parameterization as described above and a simple derivative-free optimization procedure to locate the maximum of the (penalized) likelihood function, . Estimation constrained values of λ_{i} to the range of 0.00001–0.99999. Quadratic penalties on the deviation of the canonical heritabilities from their mean were imposed either on the eigenvalues directly or on values transformed to logarithmic scale. A range of values for ψ (0–1.8 in steps of 0.2, 2–4.5 in steps of 0.5, 5–99 in steps of 1, 100–248 in steps of 2, 250–495 in steps of 5, and 500–1000 in steps of 10; 289 in total) were considered.

For simplicity, the appropriate tuning parameter for each replicate was estimated by sampling additional validation data, as done, for instance, by Levina *et al*. (2008) and Rothman *et al*. (2008, 2009), rather than by cross-validation. This involved sampling 100 sets of matrices of MSCP for each replicate, for the same sample size and population parameters as the training data. Let denote the vector of parameter estimates for a replicate and the *r*th value of the tuning parameter, ψ_{r}. For each vector of estimates (*r* = 1, 289) and each of the additional data sets, **y**_{l} (*l* = 1, 100), the value of the unpenalized likelihood, , was calculated. The estimate of the tuning parameter to be used for this replicate, , was then chosen as the value of ψ_{r} for which the average likelihood across validation sets,was maximized. A total of 10,000 replicates were carried out for each scenario examined.

As suggested by Lin and Perlman (1985), the effect of penalized estimation was summarized as percentage reduction in average loss (PRIAL), calculated aswith the standard, unpenalized REML estimate and the penalized estimate, for *X* = *G*, *E*, and *P* and the entropy loss as defined in Equation 1 averaged over replicates.

#### Results:

The effect of sampling variation and penalization on estimates of canonical heritabilities is illustrated in Figures 1 and 2. As expected from theory (*e.g.*, Lawley 1956), bias in unpenalized estimates increased markedly with decreasing spread in the population values and decreasing sample size. Patterns for both designs were similar. For scenarios with equal population values (*A* and *G*), penalization dramatically reduced the bias in estimates with the small remaining bias in the same direction as for unpenalized estimation. For the other cases penalization appeared to overcompensate somewhat, resulting in a bias in the opposite direction for the extreme values (λ_{1} and λ_{5}), *i.e.*, yielded estimates of the largest values that were biased downward and estimates of the smallest values that were biased upward.

Imposing a penalty on the logarithmic scale tended to give estimates of λ_{1} that were less biased than for penalization on the original scale but, in turn, yielded larger upward bias in estimates of λ_{5} in most cases. Overshrinkage, in particular of the smallest eigenvalues, when population values are far apart has been observed previously and has been attributed to the nature of the quadratic penalty used (Daniels and Kass 2001). While the upward bias due to penalization in the estimate of the smallest eigenvalue for case *H* (and similarly cases *D* and *F*) may look disconcertingly large (see Figure 2), it should borne in mind that the corresponding population value was 0.1 so that, in absolute terms, an upward bias of 30–60% was still relatively small and, as discussed below, penalized estimation resulted in a substantial reduction in loss in .

Penalization has the potential to reduce sampling variation for variables on the original scale dramatically. Furthermore, it has to be emphasized that the systematic bias in estimates of the heritabilities on the canonical scale does not automatically translate into a corresponding bias in estimates of the usual genetic parameters. Figure 3 shows the effect of penalized estimation on estimated heritabilities and genetic correlations and their sampling variances for scenarios *A* and *D* (Bondari's design with 125 families). For case *A*, there were very few replicates with estimates at the boundary of the parameter space, and all parameters were thus estimated without notable bias while variation in estimates was reduced by orders of magnitude. For case *D*, with a substantial spread in population heritabilities, however, penalized estimation caused the largest heritabilities to be biased downward and the smallest values to be biased upward, corresponding to the overshrinkage of eigenvalues evident in Figure 2. The reduction in sampling variances achieved for this scenario was less (but still substantial for most parameters) and largest for correlations between traits with low heritabilities.

Penalization had very little effect on the estimates of eigenvectors, with only a slight increase in the average angle between true and estimated vectors apparent. Hence, the reduction in risk achieved, summarized in Tables 1 and 2, is a direct reflection of the effects of penalties on the estimates of the canonical heritabilities. Average tuning factors decreased with increasing spread in the population eigenvalues and were markedly lower for penalties applied on the logarithmic scale. For most scenarios, values for tended to increase with decreasing sample size; *i.e.*, as to be expected, more severe penalties needed to be imposed as fewer data were available and sampling variation increased. The main exception to this pattern was for the cases with identical population values (*A* and *G*). For these, was equal to the maximum value considered (1000) for the more replicates the larger the sample size. This implies that the average value was somewhat distorted by the upper limit on imposed, with the decrease reflecting more variability for smaller samples.

Risks for were largest for scenarios with a wide spread of roots. For reasonable sample sizes, penalized estimation reduced the average loss in throughout, with reductions increasing as the spread in population values decreased. Penalties on the logarithmic scale appeared most advantageous for scenarios with one large eigenvalue and the remaining values close together (*E*, *I*, *J*, and *K*). For constellations with a large spread in λ_{i} (*D* and, to a lesser extent, *F*) penalization increased the loss in in up to a third of replicates; on average, though, there was a reduction in risk.

Relatively small PRIALs for for cases with the largest population value close to unity (*F* and *K*) reflected, in part at least, the effects of constraints on the parameter space that decreased the scope for penalization to reduce risk. While constraints biased the average of across replicates only slightly (depending on the scenario, by <4% up- or downward), effects for individual replicates may have been larger, resulting in attempts to penalize deviations from a less appropriate estimate of the mean than we may wish for. Additional simulations (not shown here) yielded a higher PRIAL for cases *D*, *F*, *I*, *J*, and *K* when replacing (original scale) in Equation 5 with the corresponding harmonic mean.

Again, the pattern of results for the two designs was comparable, suggesting that bending is just as effective in a complex pedigree as in the paternal half-sib design it was originally suggested for. Values of PRIAL for the same sample size (100 sires and 125 families in simulations I and II, respectively) were generally smaller for Bondari's design. This was accompanied by smaller values for ; *i.e.*, with numerous covariances between relatives the same number of observations provided more information so that the effects of sampling variations were less and penalized estimation had somewhat less impact.

## APPLICATION

Application of the procedure suggested is illustrated with data for carcass measurements of beef cattle. This is a typical example of traits considered in livestock improvement schemes that are difficult and expensive to record but play a major role in breeding programs. Data were collected from abattoirs under a meat quality research project and have been analyzed previously; see Reverter *et al*. (2000) for details.

A total of six traits recorded on up to 1796 animals were considered, with 916, 1524, 1796, 1796, 1784, and 1672 records for traits 1–6 (retail beef yield, percentage intramuscular fat, rump fat depth, carcass weight, rib fat depth, and eye muscle area), respectively. Only 44% of individuals had all six traits recorded. All records were preadjusted for differences in age at slaughter or carcass weight as described in Reverter *et al*. (2000). Animals in the data were the progeny of 130 sires and 932 dams, and no parents had records themselves. Adding pedigree information yielded an additional 3105 animals to be included, *i.e.*, a total of 4901 in the analysis. Data and pedigrees are available in the supporting information, File S1.

The model of analysis was a simple animal model, fitting animals' additive genetic effects as random effects. The only fixed effects fitted were those of “contemporary groups” (CG) that represented a combination of herd of origin, sex of animal, date of slaughter, abattoir, finishing regime, and target market subclasses, with up to 282 levels per trait. Estimates of genetic and environmental covariance matrices were obtained by REML, using an “average information” algorithm followed by a derivative-free search to ensure the maximum of the likelihood had been located with reasonable accuracy. Both a standard multivariate analysis and analyses imposing a penalty on the squared deviation of the canonical heritabilities from their mean as described above were carried out.

The tuning parameter ψ was estimated using 10-fold cross-validation, as described above. This required splitting the data into 10 subsets. To avoid problems arising from dividing small CG subclasses in doing so, data were split by sequentially assigning all animals in a CG (for trait 4) to a subset, processing CGs in order of size. For example, subset 1 consisted of records on all animals in CGs 1, 11, 21, …, subset 2 comprised animals in CGs 2, 12, 22, and so forth. For each of the 10 folds, the *i*th subset was designated the validation set and the remaining 9 subsets were merged to form the training data. Estimates of covariance components were then obtained from the training data for a range of tuning parameters, and the corresponding values for the (unpenalized) REML likelihood of these estimates in the validation data were calculated. Initially values of ψ = 0, 1, 2,…, 20 and ψ = 25, 30,…, 100 were considered, and, in a second pass, all values between 20 and 35 in steps of 1 were evaluated; *i.e.*, cross-validation involved 48 × 10 analyses and likelihood evaluations. The value of ψ then chosen as “best” was that for which the average of the 10 validation set likelihood values was highest.

#### Results:

Estimates of canonical heritabilities from a standard, unpenalized analysis together with their approximate standard errors (derived from the inverse of the average information matrix at convergence) were 0.89 ± 0.14, 0.54 ± 0.10, 0.38 ± 0.09, 0.24 ± 0.09, 0.14 ± 0.07, and 0.03 ± 0.05, with a mean of 0.37. Conducting a simulation study, corresponding to simulation II above with 125 families, for six traits (measured on all individuals) with canonical heritabilities of 0.8, 0.5, 0.4, 0.3, 0.2, and 0.1 yielded an average estimate of the tuning parameter of 34, with PRIAL of 19% for and of 39% for , respectively. In line with these results, cross-validation yielded an estimate for the tuning parameter of . Corresponding estimates of canonical heritabilities from the penalized analysis were 0.69 ± 0.11, 0.50 ± 0.09, 0.38 ± 0.08, 0.27 ± 0.08, 0.17 ± 0.07, and 0.05 ± 0.05, with a mean of 0.34. The likelihood for this set of estimates was reduced by 1.32 compared to the value from the unpenalized analysis; *i.e.*, penalization for such relatively mild penalty did not decrease the likelihood significantly even though the estimate of λ_{1} was reduced by >20%.

Resulting estimates of heritabilities (on the original scale) and correlations from the two analyses are contrasted in Figure 4. On the whole, there was good agreement between analyses with most penalized estimates well within the range of plus/minus one standard deviation from their unpenalized counterparts. Penalization reduced the estimates of the higher heritabilities and slightly increased the lowest value. In addition, it tended to reduce the magnitude of higher (absolute value) estimates of genetic correlations somewhat. Reassuringly, changes were largest for trait 1, the trait with the smallest number of records. In particular, an estimate of the environmental correlation between traits 1 and 4 was reduced from 0.82 to 0.63.

## DISCUSSION

Accurate multivariate estimation of genetic covariance matrices is a longstanding problem. Mixed-model-based estimation, considering more than just a few traits and fitting the so-called animal model to accommodate complex pedigrees, has become feasible on a routine basis, due to advancement in computing facilities and improvements in software available. However, problems associated with substantial sampling variation inherent in multivariate estimation, especially for relatively small data sets, remain. In particular, the fact that large eigenvalues tend be biased upward while small eigenvalues tend to be biased downward is generally given little consideration. Emphasis on unbiased estimation of breeding values has fostered a corresponding preference for unbiased methods of estimation, often ignoring the fact that standard methods such as REML are biased as they require estimates to be within the parameter space, *i.e.*, constrain estimates of covariance matrices to be positive (semi-) definite.

#### Regularization:

Our review has shown that trading additional bias against a lower statistical risk in the estimation of covariance matrices is a well-established practice. The literature ranges from theoretical studies, which are predominantly focused on establishing that certain classes of improved estimators dominate over others, to applications that demonstrate that using regularized estimates of covariance matrices in regression problems, discriminant analyses, or portfolio estimation results in more reliable estimates or statistical tests. In a quantitative genetic context, an early form of regularization—though not labeled as such—has been suggested in the form of bending and has been shown to improve the achieved response to selection based on indexes derived using regularized estimates of genetic covariance matrices (Hayes and Hill 1981).

We propose to implement the equivalent to bending in REML analyses fitting an animal model by penalizing the corresponding log likelihood, with the penalty term proportional to the sum of squared deviations of the canonical heritabilities from their mean. Our simulation results demonstrate the statistical risks associated with standard REML estimates of covariance matrices and show that these can be dramatically reduced using penalized estimation. On a relative scale, penalization is most effective when the population eigenvalues are close together, which is the scenario when sampling variances in estimates of eigenvalues are largest. However, mean risks increase considerably as the true roots are spread farther apart, so that a proportionally much smaller reduction for these cases can still represent a substantial decrease in absolute values.

Analyses examining the eigenvalues of estimated genetic covariance matrices usually show that a substantial proportion of the total genetic variance is explained by the leading principal components, with genetic eigenvalues declining in an approximately exponential fashion (Kirkpatrick 2009). Corresponding canonical heritabilities have not been examined in a methodical fashion. While the pattern in genetic eigenvalues does not imply that the eigenvalues of follow suit, our applied example suggests that practical cases with a relatively large spread may not be unusual.

Clearly, an alternative to penalized (RE)ML is Bayesian estimation where regularization is implicit through the prior distributions specified. While such analyses have become a standard in quantitative genetics, uninformative priors appear to be used more often than not, at least in variance component estimation; *i.e*., “only lip service is paid to the Bayesian paradigm” (Thompson *et al*. 2005, p. 1471). This demonstrates that specification of suitable prior distributions or of the associated hyperparameters is often not all that straightforward. Hence penalized REML may provide an easier option in practice.

#### Tuning parameter:

While penalized estimation is appealing, it requires a decision on the strength of the penalty to be applied, *i.e.*, identification of a suitable tuning parameter, ψ. As outlined, cross-validation (*K*-fold or using random splits) is a widely used technique to determine this quantity from the data at hand and is applicable in our case. However, it is a laborious procedure that can increase computational requirements of penalized estimation by orders of magnitude, compared to standard, nonpenalized estimation. Furthermore, for data with a genetic family structure and usually affected by one or more cross-classified fixed effects, representative subsampling can be challenging, especially for small samples. There appears to be a paucity of literature addressing this problem or examining other strategies such as bootstrapping.

An alternative may be to choose a penalty factor *a priori*, analogous to the choice for the degree of confidence to be made when specifying prior distributions in Bayesian estimation. Earlier, Hayes and Hill (1981) suggested bending on sample size alone. Our results indicate that this choice should consider pedigree structure and the spread in canonical heritabilities in addition. Of course the latter is unknown, and we again need to have some prior knowledge, perhaps on the basis of literature results, or try to assess this from the estimates obtained from a standard, nonpenalized analysis. Further work is required to see whether suitable rules of thumb can be established. Results dating back to James and Stein (1961) show that for a quadratic penalty, any amount of shrinkage (ψ > 0) in a certain range reduces the mean square error, with the range determined by the (co)variances of the standard estimator and the shrinkage target; see, for instance, Opgen-Rhein and Strimmer (2007) for a lucid outline. This suggests that a simple approach to exploit at least some of the benefits of penalized multivariate estimation may be to impose a relatively mild penalty, chosen on the basis of the number of traits considered, sample size, and data structure and any information we may have on the spread of the eigenvalues. For this, “mild” might be defined as a value of ψ for which the unpenalized likelihood corresponding to the penalized estimates differs does not differ substantially from the maximum of for ψ = 0.

#### Open questions:

The purpose of this article has been to introduce the concept and demonstrate the utility of regularized estimation for the estimation of genetic parameters, considering multivariate, animal model-type analyses and (restricted) maximum likelihood. A number of problems remain for further research. These include technical issues, such as extensions to models with additional random effects and random regression analyses, the best algorithm to be used to maximize the penalized likelihood, and which cross-validation strategy to use. In addition, there are more fundamental questions to be addressed, such as the most appropriate definition of risk for particular scenarios or the use of alternative prior distributions and penalties.

## CONCLUSIONS

Penalized estimation of genetic covariance matrices can reduce the deviation of estimates from population values by reducing the inherent overdispersion in sample roots, producing “better” estimates. It is recommended for multivariate analyses comprising more than a few traits and small samples to make the best possible use of limited and precious data.

## APPENDIX

Let represent a *q* × *q* matrix with *ij*th element of unity and zero otherwise. The nonzero derivatives needed to adapt a standard REML algorithm to the canonical parameterization and penalized estimation for are given in the following.

#### First derivatives:

#### Second derivatives:

## Acknowledgments

The Animal Genetics and Breeding Unit, University of New England, is a joint venture with Industries & Investment, New South Wales. This work was supported by Meat and Livestock Australia under grant BFGEN.100B (to K.M.); M.K. is grateful for support from National Science Foundation grant DEB-0819901 and the Miller Institute for Basic Research.

## Footnotes

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.109.113381/DC1.

Communicating editor: J. Wakefield

- Received December 20, 2009.
- Accepted April 25, 2010.

- Copyright © 2010 by the Genetics Society of America