## Abstract

An analysis of litter size and average piglet weight at birth in Landrace and Yorkshire using a standard two-trait mixed model (SMM) and a recursive mixed model (RMM) is presented. The RMM establishes a one-way link from litter size to average piglet weight. It is shown that there is a one-to-one correspondence between the parameters of SMM and RMM and that they generate equivalent likelihoods. As parameterized in this work, the RMM tests for the presence of a recursive relationship between additive genetic values, permanent environmental effects, and specific environmental effects of litter size, on average piglet weight. The equivalent standard mixed model tests whether or not the covariance matrices of the random effects have a diagonal structure. In Landrace, posterior predictive model checking supports a model without any form of recursion or, alternatively, a SMM with diagonal covariance matrices of the three random effects. In Yorkshire, the same criterion favors a model with recursion at the level of specific environmental effects only, or, in terms of the SMM, the association between traits is shown to be exclusively due to an environmental (negative) correlation. It is argued that the choice between a SMM or a RMM should be guided by the availability of software, by ease of interpretation, or by the need to test a particular theory or hypothesis that may best be formulated under one parameterization and not the other.

MIXED linear models (Henderson 1984) are broadly used to predict breeding values and to estimate variance components for traits of interest in livestock and plant breeding and play an important role in evolutionary and theoretical quantitative genetics (Lande 1979; Cheverud 1984; Walsh 2003). In genetic improvement programs, the objective of selection includes typically several correlated traits. The classical approach for a multiple-trait analysis is to use models posing that the nature of the correlation between response variables (phenotypes) is due to linear associations between unobservables, such as additive genetic values or nongenetic sources, like permanent or temporary environmental effects.

Structural equation models represent an extension of the standard linear model to account for links (feedback and/or recursiveness) involving either the phenotypes directly or latent variables; they are well established in econometrics and sociology (Goldberger 1972; Jöreskog 1973; Duncan 1975). These models were discussed in the early genetics literature by Wright (1921) but this work has not received much attention in quantitative genetics. Recently, Xiong *et al.* (2004) proposed the use of structural equation models for modeling and identifying genetic networks. In a quantitative genetics context, Gianola and Sorensen (2004) studied the consequences of the existence of simultaneous and recursive relationships between phenotypes on genetic parameters and presented statistical methods for inference. A recent application to study the relationship between somatic cell score and milk yield in goats is in de los Campos *et al.* (2006). Here we are concerned with an illustration of the implementation of structural equation models for the analysis of litter size and average litter weight in two breeds of Danish pigs.

Litter size is an important trait in pig genetic improvement programs (Rothschild and Bidanel 1998) and there is now convincing evidence that it has responded successfully to selection (*i.e.*, Sorensen *et al.* 2000; Noguera *et al.* 2002). Several studies have also reported negative associations between litter size and individual birth weight (Kerr and Cameron 1995; Roehe 1999; Sorensen *et al.* 2000). Further, Sorensen *et al.* (2000) report an increase in the proportion of piglets born dead at higher litter size values.

Litter size is basically determined by ovulation rate and embryo mortality (Blasco *et al.* 1995); these processes take place mainly at the early stages of gestation. Piglet weight at birth is mostly determined by growth in late gestation. One could then postulate a one-way causal path establishing an effect of litter size on piglet weight at birth. This specification defines a recursive two-trait system. On the other hand, simultaneity occurs when trait 1 affects trait 2 and vice versa.

The objective of this study is, first, to show that recursive models can be interpreted as alternative parameterizations of standard linear models. We discuss identifiability of dispersion parameters, a topic that is intimately connected to the possibility of drawing inferences from the various parametric forms of a given model. Second, we address the statistical problems involved in deciding whether the association between traits is mediated by additive genetic and/or environmental covariances or via recursion only. The results are illustrated using data on litter size and average litter weight in pigs.

## MATERIALS AND METHODS

#### Data:

Data from two breeds were analyzed: Landrace and Yorkshire. The traits analyzed were total number born per litter and average litter weight at birth (referred to as litter size and average piglet weight, hereinafter). The Landrace data set included 5178 litter size records and a pedigree file of 8800 individuals. The raw means for litter size and average piglet weight were 14.23 piglets and 1.36 kg., respectively, with standard deviations 3.62 piglets and 0.35 kg. The Yorkshire data set consisted of 3938 litter size records and a pedigree file of 7143 individuals. The raw means for litter size and average piglet weight were 13.01 piglets and 1.30 kg., respectively, with standard deviations 3.40 piglets and 0.22 kg. The raw correlations between traits were 0.01 in Landrace and −0.43 in Yorkshire.

Piglet weight at birth is strongly genetically determined by maternal effects (Grandinson *et al.* 2002), and, as a consequence, average piglet weight (as well as litter size) was considered a trait of the sow.

#### Models and likelihoods:

A description is provided of a standard mixed model (SMM) and a recursive mixed model (RMM). The SMM postulates the following linear structures for *y _{Lij}* (subscript

*L*represents litter size) and

*y*(subscript

_{Wij}*W*represents average piglet weight) of the

*j*th pair of records from female

*i*,(1a)(1b)where (

*k*=

*L*,

*W*) is the appropriate row of a known incidence matrix,

**b**

_{k}is a vector containing effects of herd years, seasons, and parity number,

*u*is an additive genetic effect of individual

_{ki}*i*,

*p*is a permanent environmental effect of individual

_{ki}*i*, and

*e*is a residual effect (the lengths of the vectors of additive genetic effects and data are different, but to simplify notation, it is assumed throughout that after an appropriate relabeling, a common subindex

_{kij}*i*can be used for

*y*,

*u*, and

*p*).

The following distributions were assigned to the location parameters:(2)Above, **I** is the identity matrix (of appropriate order),(3)and(4)The terms and (*x* = *u*, *p*; *m* = *L*, *W*) in (3) and (4) are variance and covariance components associated with the distribution of additive genetic effects (*x* = *u*) and permanent environmental effects (*x* = *p*) for litter size and for individual piglet weight.

A possible approach to modeling the residual term **R**_{ij} is as follows. Assume that the residual terms for individual piglet weight at birth that contribute to a given average piglet weight are conditionally normally and independently distributed, given litter size, with residual variance , where is the residual component of variance of individual piglet weight at birth and is the residual correlation between litter size and individual piglet weight at birth. Also assume that the residual terms for litter size are normally distributed with variance . Then the marginal (with respect to litter size) residual covariance between two individual piglet weight at birth records is and the residual covariance matrix is equal to(5)

In (5), the off-diagonal term , and *n _{ij}* is the known number of records contributing to the average piglet weight of female

*i*in parity

*j*. There are three identifiable parameters in (5). This residual dispersion matrix can also be written as(6)where is the residual regression of individual piglet weight at birth on litter size. Matrix

**R**

_{ij}is positive definite since . The residual covariance matrix (5) for

*n*= 1 is denoted by

_{ij}**R**.

The heritabilities for the two traits are(7)and the coefficients of correlation are(8)

Writing **y**_{ij} = (*y _{Lij}*,

*y*)′, Equations 1 can be expressed as(9)whereIt follows that the sampling model for

_{Wij}**y**

_{ij}is the Gaussian process(10)and the contribution to the likelihood by

**y**

_{ij}is(11)

The RMM assumes the following linear relationships between the *j*th pair of records from individual *i* and location parameters,(12a)(12b)where λ is the recursive parameter. The first term in the right-hand side of (12b) indicates that, according to the model, average piglet weight is linearly related to the deviation of litter size from its group mean, and the strength of this relationship is measured by λ. On the other hand, Gianola and Sorensen (2004) postulate recursiveness or simultaneity between traits involving the observed phenotypes, rather than the unobserved deviations. We return to this point in the discussion.

The system defined by (12) can be retrieved subtracting the mean on both sides of (9) and multiplying by **Λ**, to get(13)The reduced form of (13) is(14)which is the same as (9), whereandIt follows from the Gaussian form of the distributions (2) that(15)where(16)Therefore the sampling model for **y**_{ij} under the RMM is the Gaussian process(17)and the contribution to the likelihood by **y**_{ij} is(18)If λ were known this is the same likelihood as (11) due to the one-to-one relationship(19)However, with unknown λ, the left-hand side of (19) contains 10 parameters and the right-hand side 9. There are thus an infinite number of matrices involving the left-hand side of (19) that satisfy the equality, for any given **G** + **P** + **R**_{ij}. In other words, disregarding identifiability at the level of the mean for both models, the RMM as defined above generates an unidentifiable likelihood.

#### Likelihood identification under the SMM and the RMM:

The subject of identifiability of the SMM and the RMM at the level of the mean is well known (*e.g.*, Searle 1971) and is not discussed. In likelihood (11) of the SMM there are nine identifiable dispersion parameters associated with **G**, **P**, and **R**_{ij}. This model with nondiagonal covariance matrices for *u*, *p*, and *e* is labeled SMM_{upe}.

The RMM has an extra parameter, and a constraint needs to be introduced to achieve identification. One possible constraint is to assume that the phenotypic covariance on the recursive scale is zero. That is, denoting the mean of *y _{L}* by μ

_{L},(20)This places the following interpretation on λ,(21)the phenotypic regression of average litter weight on litter size. Expanding (19) it is easy to show that the constraint (20) guarantees a one-to-one relationship between the dispersion parameters of the RMM and those of the SMM

_{upe}and the likelihoods become equivalent. In this setting the RMM subject to the chosen constraint and the unconstraint SMM

_{upe}are two different identifiable parameterizations of the same likelihood model.

From the point of view of a likelihood analysis, inferences on the recursive scale can be obtained by fitting the SMM_{upe} and transforming the estimated parameters appropriately, and vice versa. However, it is not statistically meaningful to ask whether the data have been generated by the SMM_{upe} or by the recursive process described by the RMM subject to constraint (20), since both specifications lead to the same likelihood.

#### Generating an identifiable likelihood model to address the nature of the relationship between traits:

Here we present a statistically meaningful way to address the question whether the data have been generated by a recursive mechanism.

The starting point is the SMM defined in (3), (4), (5), and (9) but with a diagonal matrix for all the dispersion structures; that is,(22)(23)and(24)The contribution to the likelihood by the pair of records **y**_{ij} is(25)There are six dispersion parameters associated with this model (the covariance matrices of *u*, *p*, and *e* have 0 off-diagonal elements), which is labeled SMM_{0}.

The RMM that is developed here postulates that the relationship between data and location parameters is now(26)where **u**_{i}, **p**_{i}, and **e**_{ij} are the same stochastic variables as in the SMM_{0} with covariance matrices (22), (23), and (24), and with(27)and similarly for Λ_{p}, Λ_{e}, , and . Note that the **Λ**'s in (26) have the same structure as the **Λ**^{−1} in (14). Contrary to the generation of recursion in (13), the recursive model defined by (26) is not obtained by a linear transformation of the SMM and the two models lead to different marginal (with respect to random effects) distributions of the data. The linear structure specified by (26) and (27) has an interesting property: the components of average litter weight , *z* = *u*, *p*, *e*, have a term independent of litter size and a component dependent on litter size.

The sampling model for **y**_{ij} is(28)and the contribution to the likelihood from **y**_{ij} is(29)where , , and . This form of recursive model is labeled RMM_{upe}. There are nine identifiable parameters in the dispersion matrix of this likelihood and when λ_{u} = λ_{p} = λ_{e} = 0 (or when Λ_{u} = Λ_{p} = Λ_{e} = *I*), likelihood (29) is equal to (25). A comparison between RMM_{upe} and RMM with λ_{u} = λ_{p} = λ_{e} = 0, which is labeled RMM_{0}, is to jointly test whether or not there is recursion at the level of the unobservable additive genetic values, permanent environmental and environmental effects. Alternatively, since likelihoods (29) and (11) are equivalent [it is easy to see that, for example, and ] the comparison can be interpreted as testing whether or not the covariance matrices of the random effects of the SMM_{upe} have a diagonal structure. Indeed, note that(30)(31)(32)The bottom diagonal element in (32) is very similar to the corresponding element in (6). However, when the trait is not average (that is, when *n _{ij}* = 1), the second term in the bottom diagonal element of (6) vanishes. Since, for example, , by inspection of (30), (31), and (32) with (3), (4), and (6) it is obvious that the β's under the SMM are identical to the 's in the RMM.

We also need(33)which is matrix for *n _{ij}* = 1. When λ

_{u}= λ

_{p}= λ

_{e}= 0, the above covariance matrices become equal to (22), (23), and (24).

Under the RMM_{upe}, the heritability of average litter weight for *n _{ij}* =

*n*

_{T}for all

*i*,

*j*is defined as(34)

#### Prior and posterior distributions:

For the RMM_{upe}, the joint prior distribution of all parameters is assumed to admit the factorization(35)where **u**^{•} is the vector that contains the pairs for all individuals in the pedigree, and **p**^{•} is the vector that contains all permanent environmental effects of females with records. The vector **b** is allocated an improper uniform distribution and vectors **u**^{•} and **p**^{•} are assumed to be normally distributedwhere **A** is the known additive genetic relationship matrix, andThe 2 × 2 matrices **G**^{•}, **P**^{•}, and **R**^{•} follow inverse Wishart distributionswhere the hyperpriors , , and are known matrices of dimension 2 × 2 and the *v*'s are known degrees of freedom. The conditional density for the total data is equal to(36)where Σ^{•} is a block diagonal with blocks associated with each pair of records **y**_{ij}.

The posterior distribution of the RMM_{upe}, up to a proportionality constant, is obtained by multiplication of the joint prior (35) by (36), giving(37)which is also the posterior distribution of SMM_{upe}, the standard two-trait mixed model with nondiagonal covariance matrices associated with all the random effects. Inferences based on RMM_{upe} can be drawn from the posterior distribution (37) and the recursive parameters can easily be constructed from (30), (31), and (33),(38)(39)and(40)A variety of submodels can be generated either by assuming some or all of the λ's equal or by setting some of them equal to zero.

#### Implementation:

If the number of piglets born was the same for all litters, *n*_{T}, say, then , where denotes the residual covariance matrix (32) with *n _{ij}* replaced by

*n*

_{T}. In this case, the structure of in (37) simplifies considerably. To take advantage of this simplification in the computations one can augment the piglet weight data with the so-called missing single records , so that

*n*=

_{ij}*n*

_{T}for all

*ij*, where

*n*

_{T}is the largest number of records contributing to average piglet weight in the data set. This technique is known as data augmentation (Tanner and Wong 1987) and the general idea is as follows. Given observed data

*y*and a model indexed by parameters θ, the posterior distribution is proportional to . When the model is fitted using MCMC, drawing samples from this posterior distribution may be computationally demanding. However, it may be easy to draw samples fromwhere

*y*

^{mis}stands for the missing data. The strategy requires generating

*y*

^{mis}from . In the present case, is generated fromwhere θ is the vector of all parameters indexing the model.

After a little experimentation, a length of the Gibbs chain equal to 1 million was chosen. In Tables 1 and 2 we report Monte Carlo standard errors of estimates of various posterior means to give an idea of the accuracy of the Monte Carlo computations.

#### Model testing:

Checking for systematic differences between a given model and the observed data discloses the quality of fit of the posed model. An attractive way to study the fit of a model is to use posterior predictive model checking (Gelman *et al.* 1996, 2004). The approach is simple to implement, is flexible, and provides a graphical exploration of residual-type diagnostics. The key feature is the construction of the so-called discrepancy measures that describe particular putative features of the data that the model may fail to account for. To be more specific, consider testing for the presence of recursion at the level of permanent environmental effects. Absence or presence of recursion at the level of additive genetic effects or residuals is studied in a similar way. Let (*y _{Lij}*,

*y*),

_{Wij}*i*= 1, 2, … , denote observed data and for parity

*j*= 1, define the discrepancy measure(41)the change of average piglet weight per unit change of permanent environmental effect associated with litter size. In (41), the sum is over all females with first parity records, and is the average

*p*across females. If the observed data had been generated under RMM

_{Li}_{0}one would expect a value of

*b*in the vicinity of zero. If parameters were known, one could compare the observed value of

_{p}*b*to its sampling distribution, with a significant difference indicating model failure with respect to the discrepancy measure. This is equivalent to simulating data ,

_{p}*i*= 1, 2, … , under the RMM

_{0}, if parameters were known, computing in each replicate, and deciding whether the observed value of

*b*is an atypical value in the distribution of . Specifically and in the current context, one is testing whether the null model RMM

_{p}_{0}is failing to account for a recursive mechanism present in the observed data.

Since parameters are not known, we use the idea of posterior predictive model checking (Gelman *et al.* 1996, 2004) and consider the posterior predictive distribution of . This distribution reflects uncertainty about the parameters that enter in the discrepancy measure (41) as well as sampling variation. Note that the parameters are inferred from the “null model” RMM_{0} that assumes absence of recursion. The presence of recursion, not accounted for by model RMM_{0} would result in a distribution of *b _{p}* − shifted from zero. This can also be construed as a test for a nonzero covariance between permanent environmental effects affecting litter size and those affecting average piglet weight. The exploration of recursion at the level of additive genetic effects and of residuals involves constructing

*b*− and

_{u}*b*− along the same lines.

_{e}Often the diagnostic results of posterior predictive model checking are apparent visually, as is the case in this work. Other times it can be useful to compute a posterior predictive *P-*value to see whether the results could have arisen by chance under the null model (Gelman *et al.* 1996, 2004). These can be very easily computed from the MCMC output.

## RESULTS

The familiar parameterization in a two-trait mixed-model analysis is based on model SMM_{upe}. We therefore show in Table 1 Monte Carlo estimates of posterior means and standard deviations for chosen parameters based on the SMM_{upe} for Landrace and Yorkshire. Due to the symmetry of all the posterior distributions referred to below, standard deviations rather than posterior intervals are reported. The numbers in the table indicate that there is a striking difference between the breeds, especially for the size and the sign of the correlation coefficients. For Landrace, a value in the vicinity of zero for all three correlation coefficients is in an area of high probability mass. For Yorkshire, only for the environmental correlation is the value of zero excluded in the 95% posterior interval.

Table 2 shows Monte Carlo estimates of posterior means and standard deviations for chosen parameters based on the RMM_{upe} parameterization for Landrace and Yorkshire. There is a one-to-one relation between the parameters of the RMM_{upe} and those of the SMM_{upe}. The conclusions based on the recursive parameters are the same as those based on the correlation coefficients from Table 1.

Figures 1 and 2 show the posterior predictive distribution of discrepancies *b _{u}* − ,

*b*− , and

_{p}*b*− for Landrace and Yorkshire generated under RMM

_{e}_{0}. For Landrace, the Monte Carlo estimates of the posterior means (posterior standard deviations) for the three discrepancy measures are 0.067 (0.280), 0.019 (0.020), and −0.000 (0.003), reflecting lack of recursion at all levels. There is therefore lack of evidence suggesting that there is conflict between the data and the null model RMM

_{0}, with respect to the feature described by the discrepancy measure. For Yorkshire, the corresponding numbers are −0.057 (0.052), −0.030 (0.019), and −0.034 (0.002), supporting recursion at the level of the residual term only, a feature of the data that the null model fails to account for.

## DISCUSSION

In a recent article, Gianola and Sorensen (2004) discussed the use of simultaneous equation models to analyze and interpret systems of traits that may be subject to feedback and recursive relationships. Here we report an application of a recursive mixed model for the analysis of litter size and average piglet weight in two breeds of Danish pigs. The recursive relationship defined by model (12) establishes that average piglet weight is linearly related to the deviation of litter size from its group mean. The traditional specification, like that in Gianola and Sorensen (2004), postulates that average piglet weight is linearly related to litter size, rather than to its deviation from the mean. The system defined by (12) is free of some identifiability problems at the level of parameters entering the mean that are common to both traits. It seems also appealing that deviations from a midvalue, rather then absolute values, exert an influence on average piglet weight. Ultimately, these are two different models and a way of discerning between them is by computing their posterior probabilities, in the light of the data. This was not studied in the present work.

The structure of the residual dispersion matrix (5) has three parameters and was arrived at assuming that the covariance between residuals for single weight measurements is . This leads to conditionally independent residuals, given litter size. A more general model would assume that the above covariance is , where is the residual correlation between single weight measurements. Then residuals are no longer conditionally independent, given litter size. The resulting residual covariance matrix between litter size and average litter weight has four identifiable parameters and retrieves (5) when . This model has also its recursive parameterization counterpart.

The saturated recursive model used in this work has nine identifiable dispersion parameters. A more parsimonious alternative with seven parameters postulates that the three recursive parameters λ_{u}, λ_{p}, and λ_{e} are equal. In general, the recursive parameterization can be an attractive approach to arrive at parsimonious models, especially in analyses involving many traits.

Special attention has been given here to identifiability at the level of the likelihood, despite the fact that inferences were based on posterior distributions. In principle, a Bayesian analysis with a nonidentifiable likelihood is possible if proper prior distributions are specified for all the parameters (Bernardo and Smith 1994). In fact, depending on the prior distributions, a Bayesian analysis with a nonidentifiable likelihood may result in Bayesian learning, in the sense that the posterior and prior distributions of the nonidentified parameters are different (see, for example, Sorensen and Gianola 2002, p. 543). However, an MCMC implementation of a Bayesian model with “barely” identified parameters can lead to poor inferences due to extremely slow convergence and very short effective chain lengths. Achieving identifiability of parameters at the level of the likelihood will always lead to Bayesian learning and in general to better behavior of the MCMC algorithm. However, there may be situations where the constraints needed for identifiability may restrict inferences, and an unconstrained model using a careful prior specification could be considered instead.

The analyses of Yorkshire and Landrace data lead to markedly different inferences; we are not disturbed by this result. The breeds are distinct in various behavioral, physiological, and anatomical traits, as well as in outward appearance. From a breeding point of view, in Landrace, changes in litter size should not lead to associated changes in average litter weight. In Yorkshire, a change in environmental deviation of litter size of 1 unit (for example, due to culling) should result in a temporary reduction of average piglet weight of 36 g. In neither breed should successful selection for litter size have a direct effect on average piglet weight.

There is a rich literature dealing with various transformations of the data or reparameterizations that can lead to computationally more tractable analyses of the multivariate linear model (for example, Meyer 1987; Jensen and Mao 1988; Quaas 1988; Ducrocq and Besbes 1993; Groeneveld 1994; Gelfand *et al.* 1995; Thompson *et al.* 1995; Ducrocq and Chapuis 1997). While the recursive model can be viewed in this framework, the focus of the present work is that a recursive model whose likelihood is identifiable is an alternative parameterization of a standard mixed model. The two models provide different interpretations of the results, but are statistically equivalent. There is a one-to-one relationship between the parameters entering the likelihood in both models. This applies also in principle to simultaneous equation models, which in general require a larger number of constraints to achieve identifiability. However, it may not always be easy to define the equivalent standard model, say, to a model involving complex simultaneous and recursive relationships among many traits. Ultimately, the choice of parameterization should be guided by the availability of software (in simple situations like in the present work), by ease of interpretation, or by the need to test a particular theory or hypothesis. The mathematical formulation of such a hypothesis may be more naturally accomplished using one parameterization and not the other.

## Acknowledgments

We are grateful to Gustavo de los Campos and Daniel Gianola for discussions and comments on an earlier version of this manuscript.

## Footnotes

Communicating editor: C. Haley

- Received June 18, 2007.
- Accepted August 16, 2007.

- Copyright © 2007 by the Genetics Society of America