help button home button Genetics J Exp Med
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

Originally published as Genetics Published Articles Ahead of Print on September 15, 2006.

Genetics, Vol. 174, 1613-1624, November 2006, Copyright © 2006
doi:10.1534/genetics.106.060673

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
genetics.106.060673v1
174/3/1613    most recent
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Gianola, D.
Right arrow Articles by Simianer, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gianola, D.
Right arrow Articles by Simianer, H.

A Thurstonian Model for Quantitative Genetic Analysis of Ranks: A Bayesian Approach

Daniel Gianola*,{dagger},{ddagger},1 and Henner Simianer{dagger}

* Department of Animal Sciences, University of Wisconsin, Madison, Wisconsin 53706, {dagger} Institute of Animal Breeding and Genetics, Georg-August-University, 37075 Göttingen, Germany and {ddagger} Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, N-1432 Ås, Norway

1 Corresponding author: Department of Animal Sciences, University of Wisconsin, 1675 Observatory Dr., Madison, WI 53706.
E-mail: gianola{at}ansci.wisc.edu

Manuscript received May 12, 2006. Accepted for publication September 8, 2006.


    ABSTRACT
 TOP
 ABSTRACT
 PROBABILITY DISTRIBUTION OF...
 BAYESIAN STRUCTURE
 CONDITIONAL MAXIMUM A POSTERIORI...
 FULLY BAYESIAN ANALYSIS
 NONPARAMETRIC DISTRIBUTION FOR...
 CONCLUSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
A fully Bayesian method for quantitative genetic analysis of data consisting of ranks of, e.g., genotypes, scored at a series of events or experiments is presented. The model postulates a latent structure, with an underlying variable realized for each genotype or individual involved in the event. The rank observed is assumed to reflect the order of the values of the unobserved variables, i.e., the classical Thurstonian model of psychometrics. Parameters driving the Bayesian hierarchical model include effects of covariates, additive genetic effects, permanent environmental deviations, and components of variance. A Markov chain Monte Carlo implementation based on the Gibbs sampler is described, and procedures for inferring the probability of yet to be observed future rankings are outlined. Part of the model is rendered nonparametric by introducing a Dirichlet process prior for the distribution of permanent environmental effects. This can lead to potential identification of clusters of such effects, which, in some competitions such as horse races, may reflect forms of undeclared preferential treatment.


LATENT variable models for describing mechanisms by which some continuous scale maps into ordered or unordered categories of response have a long tradition in quantitative genetics and psychometry. For example, the threshold-liability model for ordinal categories dates back to WRIGHT (1934), DEMPSTER and LERNER (1950), GRÜNEBERG (1952), and FALCONER (1965); see GIANOLA (1982) for a review. Extremal models were pioneered in psychology by THURSTONE (1927) and were adapted by BOCK and JONES (1968) to explain choices between unordered alternatives.

Suppose that an observation is an assignment into one of M mutually exclusive and exhaustive categories of response. The classical Thurstonian extremal model postulates the existence of some latent or unobserved continuous valued vector Formula, such that category i is observed when yi is larger than the other M – 1 elements of the vector. The probability of observing an assignment into category M, say, is given by

Formula
where "parameters" are typically unknown quantities driving the process. The extremal model can also be applied to data consisting of ranks, presumably reflecting the outcome of some underlying continuous process, e.g., competing ability. For instance, HENERY (1981) and TAVERNIER (1991) used this idea in the context of horse races, and JOHNSON et al. (2002) for analysis of ranked data collected in primate intelligence experiments. Ranked data also arise in human genetics (e.g., LAWRENCE et al. 1994). In HENERY (1981) and TAVERNIER (1991), the setting is one where several individuals compete in a series of events (e.g., horse races). Prior to the event, each individual generates an unobservable random variable, or "liability of winning the event," and these liabilities are regarded as mutually independent, given some parameters. Posterior to the event, the observed ranking corresponds to the order of the realized values of these liabilities. Clearly, the latent variables are no longer independent, a posteriori, because the ranking implies that Formula, where y[1] is the liability of the individual ranked first, and so on. In particular, TAVERNIER (1991) described an empirical Bayes procedure for inferring features of the latent distribution, such as breeding value of racing horses or competing ability of each of the members of a list of competitors. The approach of TAVERNIER (1991) is based on calculating, iteratively, a conditional posterior mode, given some values of dispersion parameters; these, in turn, are estimated using an approximation to marginal maximum-likelihood estimation. Her procedure uses statistical and numerical approximations that are not needed in a Bayesian implementation based on Markov chain Monte Carlo (MCMC) sampling.

The objective of this study is to present theory for a fully Bayesian method for quantitative genetic analysis of data consisting of ranks scored at each of a series of runs, experiments, or events. The method uses a Thurstonian representation of the mechanism by which an underlying continuum is translated into observed ranks. The parameters driving the Bayesian hierarchical model include additive genetic and permanent environmental effects, and the MCMC implementation allows easy computation of probabilities of yet to be observed runs. The article is organized as follows. PROBABILITY DISTRIBUTION OF OBSERVED RANKS outlines the Thurstonian concept and gives the sampling model for the observed data. In BAYESIAN STRUCTURE, a hierarchical specification is outlined. CONDITIONAL MAXIMUM A POSTERIORI ESTIMATION reviews the empirical Bayes methodology suggested by TAVERNIER (1991) and discusses potential shortcomings of this type of treatment of data. FULLY BAYESIAN ANALYSIS gives details of the MCMC scheme proposed and presents procedures for inference and prediction. Some assumptions are relaxed in NONPARAMETRIC DISTRIBUTION FOR EFFECTS. Specifically, the assumption of a normal distribution of permanent environmental deviates is replaced by a Dirichlet process, using a normal distribution as baseline probability measure; a Gibbs sampling scheme is presented. The article concludes with a discussion.


    PROBABILITY DISTRIBUTION OF OBSERVED RANKS
 TOP
 ABSTRACT
 PROBABILITY DISTRIBUTION OF...
 BAYESIAN STRUCTURE
 CONDITIONAL MAXIMUM A POSTERIORI...
 FULLY BAYESIAN ANALYSIS
 NONPARAMETRIC DISTRIBUTION FOR...
 CONCLUSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Suppose nk individuals "compete" in event k and that the information available for analysis is that furnished by the ranks observed in the event. It is assumed that no ties are possible, but this restriction is easy to lift. As in HENERY (1981) and TAVERNIER (1991), it is supposed that the observed ranking is a manifestation of the magnitudes of latent variables, liability of winning the event, which are specific to each of the individuals.

The model for the unobserved liability of individual j Formula winning event k Formula is the standard specification of quantitative genetics,

Formula 1(1)
(DEMPSTER and LERNER 1950; FALCONER 1965; GIANOLA 1982; GIANOLA and FOULLEY 1983), where β is a set of F location effects affecting liability and x'jk is a known incidence vector peculiar to individual j in event k; aj and pj are the additive genetic and permanent environmental effects, respectively, of individual j on liability; gk is an effect peculiar to event k (which may or may not be needed in the model) and Formula 1 is a random residual, distributed independently of all terms in the model, within and across subjects; and Formula 1 is the residual variance. In (1), Formula 1. The liabilities are not observable, so it is assumed that Formula 1 = 1 for parameter identification purposes. Hence, all effects are measured in units of residual standard deviation.

The liabilities of the "competitors" present in event k can be packed into the nk x 1 vector lk, and the corresponding vectorial representation is

Formula 2(2)
where a is a Q x 1 vector of additive genetic effects of all potential participants or genetically related individuals; p is a C x 1 vector of permanent environmental effects (including perhaps nonadditive genetic deviations) of individuals with at least one record of competition; g is a K x 1 vector of event effects; ek Formula 2 is a residual vector, and Formula 2 is an identity matrix of order nk; and Xk, Zak, Zpk, and Zgk are incidence matrices of appropriate order. Vertical concatenation of all liabilities into a Formula 2 vector l leads to the representation

Formula 3(3)
where Formula 3 and Formula 3.

Prior to the competition, and conditionally on β, a, p, and g, the joint density of the yet to be realized liabilities, in view of the mutual independence of the residuals, is

Formula 4(4)
where Formula 4 denotes the density of the normally distributed liability ljk, with mean µjk and variance 1 and Formula 4 is a set of indicator variables; i.e., {delta}jk = 1 if competitor j is present in event k and 0 otherwise.

Consider competition k, such that the "winner" is the individual with the largest liability among the nk competitors. An ordered variable is denoted as Formula 4, which is the realized liability of individual j attaining rank = 1, 2, ... , nk in the event; for instance, if a competitor whose identification is 20 places second in the event, then the corresponding liability is denoted as Formula 4. Following HENERY (1981) and TAVERNIER (1991), the probability that a given order or ranking is observed in event k Formula 4 is given by

Formula 5(5)
This probability can also be expressed in terms of nine contrasts between liabilities being >0. Since the latent variables are not observable, one can "anchor" the liability of the winner to 0, so that all other liabilities are negative; the mean liability of a "perfect" competitor (i.e., one winning all events) would then be null. The probability of the ranks observed in all K competitions (DATA), assuming that outcomes of different races are conditionally independent, given {theta}, is

Formula 6(6)


    BAYESIAN STRUCTURE
 TOP
 ABSTRACT
 PROBABILITY DISTRIBUTION OF...
 BAYESIAN STRUCTURE
 CONDITIONAL MAXIMUM A POSTERIORI...
 FULLY BAYESIAN ANALYSIS
 NONPARAMETRIC DISTRIBUTION FOR...
 CONCLUSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Consider now the problem of inferring, via the Bayesian approach, unobserved features of interest, e.g., the vector of additive genetics a, as it would be the case in horse breeding, or the linear combination v = a + p, which can be termed "competing ability." Define Formula 6, a vector of variance parameters. Assume that the density of the joint prior distribution of {theta} and {Lambda} has the form

Formula 7(7)
Above, Formula 7, and Formula 7 denote multivariate normal densities of orders F, Q, C, and K, respectively; Formula 7 is a variance reflecting uncertainty about location parameters in β (it is assumed known and large); A is the numerator additive relationship matrix (HENDERSON 1976) and Formula 7 is the additive genetic variance; Formula 7 is the variance of permanent environmental effects and Formula 7 is the variance between events. In addition, Formula 7 is the density of a scaled inverted chi-square distribution on {nu}i degrees of freedom, with Formula 7 interpretable as a prior guess for Formula 7, and H = Formula 7 is a set of known hyperparameters.

The density of the joint posterior distribution of {theta} and {Lambda} is proportional to the product of (6) and (7):

Formula 8(8)


    CONDITIONAL MAXIMUM A POSTERIORI ESTIMATION
 TOP
 ABSTRACT
 PROBABILITY DISTRIBUTION OF...
 BAYESIAN STRUCTURE
 CONDITIONAL MAXIMUM A POSTERIORI...
 FULLY BAYESIAN ANALYSIS
 NONPARAMETRIC DISTRIBUTION FOR...
 CONCLUSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Prior to the advent of MCMC procedures for sampling from posterior distributions, it was standard to use two-stage approaches to carry out an approximate Bayesian analysis. In the first stage, a Gaussian approximation to Formula 8 is used to obtain an approximate maximum marginal-likelihood estimate of {Lambda} (GIANOLA et al. 1986; FOULLEY et al. 1987), Formula 8, say. In the second stage, the joint mode of the conditional posterior distribution Formula 8 is used to obtain point estimates of {theta} (GIANOLA and FOULLEY 1983) given Formula 8, or "conditional maximum a posteriori" (MAP). If the distribution Formula 8 is exactly Gaussian, its posterior mode yields directly the posterior means of each of the elements of {theta} (GIANOLA and FERNANDO 1986). If, in addition, a flat prior is used for β, the posterior mean vector of Formula 8 yields the best linear unbiased predictor of β and the best linear unbiased predictors of a, p, and g. Also, if a flat prior is assigned to β and to each of the elements of {Lambda}, under Gaussian assumptions the elements of the mode of the marginal distribution Formula 8 coincide with the restricted maximum-likelihood estimates of the variance components (HARVILLE 1974).

TAVERNIER (1990, 1991) used this type of approach for finding the {theta}-maximizer of the log of the conditional posterior density Formula 8, assuming a flat prior for β. The objective function for the model in (1), assuming a normal prior for β, is

Formula 9(9)
TAVERNIER (1990, 1991) employed the Newton–Raphson algorithm for maximizing (9) in the context of horse races. This requires calculation of first and second derivatives of (9) with respect to {theta}. The gradient vector and the Hessian matrix needed for the Newton–Raphson machinery are

Formula 9
and

Formula 9
respectively, where

Formula 9
These derivatives are not trivial and involve integration of multivariate normal densities, as well as differentiation under the integral sign. TAVERNIER (1991) gives approximations to first and second derivatives that require calculation of expectations, variances, and covariances of order statistics. Also, numerical calculation of high-order multivariate normal integrals is cumbersome. Since the Newton–Raphson algorithm is iterative and the number of competitors and races in a data set can be very large, the procedure is numerically intensive, as a large number of calculations must be effected.

Aside from these technical issues, the procedure relies on statistical approximations that may not be always adequate. For instance, in a high-dimensional, nonnormal, multivariate distribution, the components of the mode of the joint posterior density may be far from the corresponding posterior expectations. Also, computing a measure of precision of estimates of {theta} is a task by itself: a Gaussian approximation to the conditional (given {Lambda}) posterior covariance matrix is obtained by inverting the negative of the Hessian matrix and by evaluating this inverse at the converged values of {theta}. The diagonal elements of this inverse give a large sample approximation to the conditional posterior variances of the elements of {theta}. Last but not least, the procedure for estimating parameters in {Lambda} is also based on Gaussian approximations and, furthermore, the approximate modal inferences about {theta} do not take into account the uncertainty of the approximate estimates of {Lambda}. All these difficulties can be circumvented via MCMC procedures, as discussed in the following sections.


    FULLY BAYESIAN ANALYSIS
 TOP
 ABSTRACT
 PROBABILITY DISTRIBUTION OF...
 BAYESIAN STRUCTURE
 CONDITIONAL MAXIMUM A POSTERIORI...
 FULLY BAYESIAN ANALYSIS
 NONPARAMETRIC DISTRIBUTION FOR...
 CONCLUSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Joint posterior distribution:
The Bayesian structure given above is maintained, but the parameter vector is augmented with the unobserved liabilities as latent variables (TANNER and WONG 1987; SORENSEN et al. 1995). A main difference with a standard latent variable problem is that, in our setting, the sample space of the liabilities is order constrained, within events.

After augmentation with the liabilities l, the joint posterior density is expressible as

Formula 10(10)
Above, the indicator set Ik Formula 10 denotes the order of liabilities, or ranking attained, observed in competition k. For example, if competitors 1, 5, 7, and 14 participate in event 20, and the ranking observed is 5 > 14 > 1 > 7, then, following the notation employed in (10), Formula 10.

Markov chain Monte Carlo sampling:
A Gibbs sampler is described for effecting draws from the joint posterior distribution (10). This is a well-known procedure for sampling from joint distributions; see, e.g., GILKS et al. (1996) and ROBERT and CASELLA (2005). Briefly, the Gibbs sampler loops through all conditional posterior distributions, with uncertain quantities sampled either blockwise or piecewise. All samples from the fully conditionals are accepted with probability 1. Samples from early iterations are discarded as "burn in"; subsequently, at the end of each cycle of sampling, the coordinates of sample Formula 10 are regarded as draws from the corresponding marginal posterior distributions. The chain can be thinned, such that samples are as lowly correlated (serially) as desired. If storage capacity is not a limiting factor, then it is better to use all samples in the post-Gibbs analysis. Convergence diagnostics and processing issues are discussed in COWLES and CARLIN (1996), SORENSEN and GIANOLA (2002), and GELMAN et al. (2003).

Sampling of liabilities:
The form of (10) indicates that the conditional posterior distribution of the liabilities in event k has the form

Formula 11(11)
where ELSE denotes the unknown liabilities at all other events, {theta}, {Delta}, {delta}, and DATA. Note that, conditionally on ELSE, the liabilities at different events are mutually independent. However, within event k, liabilities are not independent, because of the sample space constraint imposed by knowledge of its outcome, represented by Ik. The dependence arises through the sample space, but the kernel of the conditional distribution of the nk liabilities in event k, given ELSE, is proportional to the product of the kernels under conditional independence. Sampling from the space-constrained distribution is straightforward:

Draw "independently" the nk liabilities of competitors in event k (whose indexes are in Ik) from the distributions Formula 11, j isin Ik.
If the liabilities sampled satisfy the order constraint in Ik, then accept the vector lk as a draw from the posterior distribution. Otherwise, reject the proposal and repeat the sampling until a valid draw is effected.

This procedure, although simple, is typically inefficient and it may have a large rejection rate. A more efficient sampling technique, described by DEVROYE (1986), is given in more detail in SORENSEN and GIANOLA (2002). To illustrate, let the outcome of competition 20 be as in the example above; that is, Formula 11. Hence, in event 20, the sample space of the liabilities is l7,20 < l1,20 < l14,20 < l5,20. To sample the four liabilities, one proceeds from either first to last or last to first (the liability of the winner can be set equal to zero, as stated above). For instance:

Sample l7,20 from Formula 11 without any constraint. This is the realized value of the liability of the last competitor.
Sample l1,20 from the truncated normal distribution Formula 11, where the parameters are as in the absence of truncation. The realized value is formed as

Formula 11

where Formula 11 is the standard normal distribution function and U is a uniform random number in (0,1).
Sample l14,20 from the truncated normal process Formula 11, forming the realized value as

Formula 11

Sample the liability of the winner from Formula 11 as

Formula 11

Once the K events are processed independently, this produces a realization of the Formula 11 vector of liabilities at iteration s, Formula 11, which becomes the "data" in model (3) at iteration s = 1, 2, ... , S of the Gibbs sampler.

Sampling of parameters:
Given the liabilities, the conditional posterior density of all parameters can be deduced from (10) after fixing the vector l. This yields

Formula 12(12)
Note that, given the liabilities, the information provided by the sets Ik is redundant. The density above is that of a Bayesian mixed-effects model under Gaussian assumptions (GIANOLA and FERNANDO 1986; WANG et al. 1993, 1994) with the sampled liabilities playing the role of data. The conditional posterior distributions of {theta} and {Delta} are in closed form (WANG et al. 1993, 1994), as follows. The location parameters {theta} can be sampled from the multivariate normal process

Formula 13(13)
where

Formula 13
and

Formula 13
The draw from distribution (13) can be done either in a single pass (GARCIA-CORTÉS and SORENSEN 1996) or element by element (WANG et al. 1994); the rate of convergence to the equilibrium distribution is lower and the autocorrelation between samples higher when sampling is piecewise.

All conditional posterior distributions of the dispersion parameters are scaled inverted chi square and are mutually independent. The distributions are

Formula 14(14)

Formula 15(15)
and

Formula 16(16)

The Gibbs sampling algorithm consists of drawing liabilities from distributions (11), such that the order constraints are satisfied, followed by draws from (13) and (14)–(16) to obtain parameter updates. The sampling process is repeated as needed, after convergence diagnostics are satisfied, and such that the Monte Carlo error of estimation of features of the posterior distribution is small enough.

Inference:
The procedure of TAVERNIER (1991) produces conditional posterior modes of location parameters, as already noted, but can also be used to estimate the probability of observing a certain outcome, even in a hypothetical competition, although ignoring uncertainty about estimates. The fully Bayesian Markov chain Monte Carlo procedure is more flexible. First, it is possible to produce a complete posterior distribution for any uncertain quantity, e.g., the competing ability aj + pj of individual j or the additive genetic variance Formula 16. In addition, it is possible to estimate the probability of any event related to a future competition. Let If be an order associated with a yet to be realized event referred to as "future," with nf participants. The predictive probability distribution of If, posterior to historical data (DATA), is given by

Formula 17(17)
where Formula 17 denotes expectation with respect to the distribution (.|.). The integral in (17) does not have a closed form, since the posterior distribution Formula 17 is unknown. However, it can be estimated from the Markov chain Monte Carlo samples via ergodic averaging.

Let Formula 17, s = 1, 2, ... , S be the Formula 17-coordinate of a sample from the posterior distribution Formula 17, with density (10), and let

Formula 17
be the probability of observing the ranking in If in the future liabilities when Formula 17. Note that, since event effects are assumed to be independent, DATA will not contain information about a future event effect gf. This does not pose a problem, as the posterior distribution can be augmented with this effect; however, DATA will convey information about Formula 17. A simulation-consistent estimator of (17) is

Formula 18(18)
where Formula 18 is an indicator variable taking the value 1 if the ranking in If is attained in sample s and 0 otherwise. In practice, one samples liabilities for future event f from

Formula 18
and checks whether or not the realized values satisfy Formula 18 if so, Formula 18. This process is repeated for every Monte Carlo sample, leading directly to (18).

A similar procedure can be used to estimate the probability of any future event A, such as whether "participant" j is first, second, or third in a competition involving nf competitors; let such an event be denoted as A. Then, the probability of observing A in a future event can be estimated as

Formula 18
Operationally, one samples liabilities lf from Formula 18 as given above and checks whether or not A is realized, in which case Formula 18 and 0 otherwise.


    NONPARAMETRIC DISTRIBUTION FOR EFFECTS
 TOP
 ABSTRACT
 PROBABILITY DISTRIBUTION OF...
 BAYESIAN STRUCTURE
 CONDITIONAL MAXIMUM A POSTERIORI...
 FULLY BAYESIAN ANALYSIS
 NONPARAMETRIC DISTRIBUTION FOR...
 CONCLUSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Above, it was assumed that prior uncertainty about permanent environmental or event effects could be described reasonably by the normal distributions Formula 18 and Formula 18, respectively. However, it might be sensible to make less strong assumptions about these distributions. For instance, there may be clusters of events that are more similar to each other, i.e., clusters representing different, unknown, levels of competition. Likewise, there may be some competitors that receive similar "preferential" treatment. A suitable representation of uncertainty may be that furnished by the Dirichlet process prior (ESCOBAR 1994; MACEACHERN 1994; KLEINMAN and IBRAHIM 1998; VAN DER MERWE and PRETORIUS 2003). The case of the permanent environmental effects is discussed here. A priori, let the permanent effects pi be independently distributed as

Formula 18
where G is some general distribution. In turn, assume that G follows a Dirichlet process (DP) prior

Formula 18
(FERGUSON 1973; ANTONIAK 1974), meaning that G is a random member of a space of distributions, with M and Formula 18 being parameters of these process. For example, it could be that Formula 18 is taken as "base prior," a distribution approximating the nonparametric shape of G, with M interpretable as a degree of belief on how close Formula 18 is to G. The parameter M is such that, when Formula 18, then Formula 18; the same is true if all random effects are identical (KLEINMAN and IBRAHIM 1998). When G is mixed over the Dirichlet process, the Polya urn representation of the prior distribution of the random effects (KLEINMAN and IBRAHIM 1998) is, for Formula 18 as base prior,

Formula 18
where the probabilities P1, P2, ... , PC–1 follow from the parameters of the Dirichlet process. Using results in ESCOBAR (1994), KLEINMAN and IBRAHIM (1998), and VAN DER MERWE and PRETORIUS (2003), the prior distribution of the random effects generates the sequence

Formula 18
Note that as Formula 18, the prior distribution tends toward the parametric process Formula 18, as noted earlier. The preceding implies that the conditional prior distribution of a permanent environmental effect (i, say), given all other such effects, is

Formula 19(19)
where

Formula 19
and B is the set over which the sum is taken (ESCOBAR 1994; VAN DER MERWE and PRETORIUS 2003). Note that pi can be equal to the permanent environmental effect of other subjects with nonnull prior probability; this has an implication in the construction of conditional posterior distributions. Representation (19) illustrates that the conditional prior distribution of any of the permanent effects, given all other p's, is a mixture of C – 1 degenerate distributions (Formula 19) with point masses at pj Formula 19 and of the parametric base distribution Formula 19. The mixing probabilities of states j != i are Formula 19 for each of the C – 1 "conditioning" states, and Formula 19 for the probability that pi is a draw from the normal process.

Given the liabilities, and assuming temporarily that M is assigned a fixed value, the joint posterior density of all uncertain quantities is obtained by modifying (12) into

Formula 20(20)
Above, pi ~ G denotes that pi follows the unknown distribution G drawn from a Dirichlet process with Formula 20 as base measure and with known parameter M. Since the Dirichlet process allows pi to take any of the other p's as a current value, data from all individuals with records of performance contribute to the conditional posterior distribution of pi. One has as conditional posterior distribution

Formula 21(21)
Under standard parametric assumptions, the fully conditional posterior distribution of pi is given by

Formula 22(22)
Using well-known results (e.g., WANG et al. 1993, 1994; SORENSEN and GIANOLA 2002) it can be shown that

Formula 23(23)
where

Formula 23
Formula 23 is the column of Zp in (3) pertaining to individual j, and

Formula 23
In addition, let

Formula 24(24)
with this integral expressible in closed form. Note that the integrand is the product of the densities of all liabilities of individual i, with β, a, and g as fixed parameters (the p's of other individuals do not enter into the model for the liabilities of individual i) times the Formula 24 density. Hence, with li being the ri x 1 vector of liabilities of the individual, and Formula 24, and Formula 24 being incidence matrices relating li to β, a, and g, respectively, a standard integration yields

Formula 25(25)
where

Formula 25
and Formula 25 is a matrix of 1's of order ri. Employing (23) and (24) in (22) leads to

Formula 25
so that

Formula 26(26)
The form of (26) indicates that the fully conditional posterior distribution of pi is a mixture of the C – 1 degenerate distributions Formula 26 with point masses at pj Formula 26 and of the parametric distribution Formula 26. Let now

Formula 26
Similar to that in VAN DER MERWE and PRETORIUS (2003), the rule for drawing samples from the conditional posterior distribution pi|ELSE is to take as a current value for pi

Formula 27(27)
where pj is the current state in the Markov chain of the permanent environmental effect of individual j, and Qj is evaluated at the current values of all parameters entering into this probability. In the calculation of Qj, the permanent effect pi is replaced by pj when computing Formula 27. This means that values of pj that generate liabilities li with higher plausibility are assigned a higher probability of selection. Eventually, this leads to clustering of permanent environmental effects into "plausibility" groups.

The fully conditional posterior distributions of all other parameters can be readily deduced from (20) by fixing p as well as the liabilities

Formula 28(28)
The conditional posterior distribution of β, a, and g given everything else is the multivariate normal process

Formula 29(29)
where

Formula 29
The conditional posterior distributions of the variance components are as in the parametric model, that is,

Formula 30(30)

Formula 31(31)
and

Formula 32(32)
The Gibbs sampler draws liabilities as before, permanent environmental effects from (27), β, a, and g from (29), and the three variance components from the preceding expressions.

At each iteration of the Gibbs sampler, the procedure induces a clustering structure of permanent environmental effects, with each cluster consisting of individuals sharing the same value of pi. Following VAN DER MERWE and PRETORIUS (2003), let t be the number of unique permanent environmental effects (0 < t ≤ C). The posterior distribution of t can be obtained readily from the sampling procedure. BUSH and MACEACHERN (1996) suggest embedding an optional additional step in the Gibbs sampling procedure: let the number of clusters identified at iteration m be Formula 32; this means that there are Formula 32 distinct values of the p's, which are labeled as Formula 32. The model for the liabilities in (3) is then rewritten as

Formula 32
where Formula 32 is a condensation of Zp of order Formula 32 obtained by assigning liabilities to the appropriate cluster of permanent environmental effects to which they belong. At that point of the scheme, KLEINMAN and IBRAHIM (1998) and VAN DER MERWE and PRETORIUS (2003) suggest drawing a sample of {eta} from the process

Formula 32
where

Formula 32
and

Formula 32
Once a sample of {eta} is obtained, say Formula 32, the permanent environmental effects are reconstituted as Formula 32, where Formula 32 is a C x Formula 32 matrix that allocates cluster values to permanent environmental effects. Then, the MCMC process continues for other parameters as usual.

Turn attention now to inferring M, the degree of belief parameter of the Dirichlet process. Using results in ANTONIAK (1974) and WEST (1992), but giving more details, as their developments are compact, the prior distribution of the number of clusters t may be written as

Formula 32
where the factor Formula 32 does not involve M; Formula 32 is the Gamma function. As pointed out by WEST (1992), if one has samples of t (see above), the conditional posterior distribution of M is expressible as

Formula 32
where p(M) is the prior density of M. This result follows, because, if one knows p, this gives immediately the number of clusters t as well as the number of permanent environmental effects per cluster. Then, with B(. , .) being the Beta function

Formula 32
WEST (1992) notes that this density results from marginalizing the joint distribution [M, x|t, ELSE], where x is a continuous random variable taking values between 0 and 1. Hence, it follows that

Formula 32
Consequently, the conditional posterior density of the auxiliary variable x is

Formula 33(33)
so that Formula 33. In addition, the conditional posterior distribution of M is

Formula 33
and its form depends on the prior adopted for M. Since M must be positive, WEST (1992) suggests a Gamma (a, b) distribution as prior, so that

Formula 33
The normalized density is

Formula 33
The integrals in the denominator yield

Formula 33
Hence, letting a* = (t + a) and Formula 33 the conditional posterior distribution becomes

Formula 34(34)
This is a mixture of the two Gamma distributions indicated, with mixing probabilities {pi}x and 1 – {pi}x. Note that

Formula 34
Since Formula 34,

Formula 34
At the end of the MCMC sampling process there will be S samples of the number of clusters t and of the auxiliary variables x. The density of the marginal posterior distribution of M can be estimated (WEST 1992) using the Rao–Blackwell estimator

Formula 35(35)
where Formula 35, and Formula 35. If a = 0 and b = 0, the Gamma prior degenerates to Formula 35 or, equivalently, to Formula 35 {propto} constant. If such an improper prior is adopted, the Rao–Blackwell estimator reduces to

Formula 35
with a* = t and b* = –log x used in the calculation of Formula 35. While the conditional posterior distribution of M is well defined for a = b = 0, this does not guarantee that its marginal posterior distribution will be always proper. Hence, the uniform prior on Formula 35 should be used with caution.


    CONCLUSION
 TOP
 ABSTRACT
 PROBABILITY DISTRIBUTION OF...
 BAYESIAN STRUCTURE
 CONDITIONAL MAXIMUM A POSTERIORI...
 FULLY BAYESIAN ANALYSIS
 NONPARAMETRIC DISTRIBUTION FOR...
 CONCLUSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
This study presents parametric and semiparametric procedures for analysis of rank data. Observations presented as ranks arise often in behavioral or cognitive sciences and in the context of competitions such as horse jumping events or greyhound races. For instance, JOHNSON et al. (2002) carried out a metaanalysis of 30 primate (24 genera represented) intelligence experiments grouped into "paradigms" and "procedures within paradigm." Paradigms are types of intelligence tests (e.g., "reversal learning," where individuals learn to pick one object rather than another, and then the value of the objects changes, so that a previously unrewarded object is now rewarded); "procedures" refer to different methodologies used to investigate the paradigm. In one of six reversal studies the observed ranking was Cebus > Hylobates > Aotus > Saimiri. In the analysis of the ranks, they used Thurstonian models similar to those described in this article. Their Bayesian formulation allowed a clear assessment of uncertainty, which was viewed as especially important because, in these experiments, raters usually do not rank >15 items, so asymptotic approximations may be inadequate.

In a study of competitions of jumping horses, TAVERNIER (1990) argued that a "physical" measure of performance (e.g., time, number of obstacles broken or refused) is not always available, and that it is unclear how to measure the technical difficulty of an event. Also, measures such as yearly earnings are available only for horses that earn money, so that the information comes from a selected sample, which may distort inference considerably. As discussed earlier, TAVERNIER (1990, 1991) proposed an approximate Bayesian method in which ranks are viewed as being the manifestation of latent variables, as in the present study. In the analysis, involving ~19,000 horses participating in all jumping competitions that took place in France in 1987, ~14% of the horses had an accuracy <0.14 ("accuracy" ranges between 0 and 1), and 28% did not reach 0.45. These accuracies are approximations and probably understate uncertainty, because the error of estimates of parameters is not taken into account in the calculations. Important numerical problems were encountered by TAVERNIER (1990). For each Newton–Raphson iteration, and for a race with n horses, she had to calculate the following multivariate normal integrals: (1) an integral of order n – 1, (2) n 1 integrals of order n – 2, and (3) Formula 35 integrals of order n – 3. For this reason, she used approximations to integrals based on Taylor series expansions. None of these integrals need to be computed explicitly or approximated in our procedure, as all calculations rely on Monte Carlo sampling. Another difficulty in the conditional maximum a posteriori method of TAVERNIER (1990, 1991) is the computation of the joint posterior mode, which, in the context of horse races in France may involve solving (reiteratively) a system exceeding 100,000 unknowns. In our procedure, solving of equations is replaced by effecting draws from conditional posterior distributions and, under the assumptions of this article, all these are recognizable. A potential difficulty may be slow mixing of the MCMC algorithm, but this can often be accelerated by either changing the algorithm or adopting a different parameterization. Introducing a Dirichlet process prior enhances the level of difficulty of the calculations, but it confers robustness with respect to possible departures from the normality assumption for the distributions of events or of permanent environmental effects. Further, the nonparametric treatment of permanent environmental effects is a flexible alternative to a thick-tailed distribution (STRANDÉN and GIANOLA 1998, 1999; ROSA et al. 2001, 2003, 2004), in the sense that it allows for asymmetry, as well as for allocation of effects into clusters with some biological meaning.

Here, it was assumed that event effects were equally distributed a priori, but there may be heterogeneity with respect to level of difficulty. This can be accommodated via hierarchical modeling. For instance, the prior distribution Formula 35 could be modified into Formula 35, where {alpha} is a vector of unknown "levels of difficulty" effects, and M is a known matrix relating g to {alpha}; in turn, {alpha} could be assigned a vague prior. In their primate studies, JOHNSON et al. (2002) fitted a latent variable model including a genus effect and a "paradigm-genus" deviation; they found that the latter did not contribute to variability appreciably and concluded that taxa differed in domain-general ability. A nonparametric alternative would be to assign a Dirichlet process prior to the event effects g and let the data drive to clusters representing level of difficulty.

In the context of horse breeding, it may be that there are individuals whose genes predispose them to perform well in certain types of competition, but not in others; i.e., there may be differents sets of genes for different types of competition. Our model can be extended by assuming that each individual has a liability peculiar to each type of competition. Suppose there are three types of competition; here, there would be three liabilities per individual with a within-individual correlation matrix R0, say. All three liabilities would be realized only if the individual in question participates in the three types of events. A simpler model could postulate that liabilities for different events are independent, but that each individual possesses an additive genetic effect peculiar to the type of event. An additive genetic variance–covariance matrix G0 would be introduced and inferred as in a multivariate Bayesian mixed-effects model (SORENSEN and GIANOLA 2002).

In many sports events, e.g., races or marathons, there are no ties. However, ties can be handled in our procedure via a simple modification of the limits the liability variates can take. For example, consider a competition with four participants (5, 7, 14, and 20), such that the observed ranking is Formula 35 here, Formula 35 denotes a tie in second place between individuals 7 and 14. In the process of sampling the liabilities, it suffices to accept samples such that 5 and 20 take the largest and smallest values, respectively, with the liabilities 7 and 14 taking any order between those of 5 and 20; then, the ranking of 7 and 14 would vary at random over MCMC samples. Alternatively, one could assign the average of the two liabilities sampled for 7 and 14 in each of the MCMC iterations. JOHNSON et al. (2002) handled ties via a parametric model for the probability that two genera are tied, with one additional parameter (with a corresponding prior distribution) introduced in the implementation.

A useful extension of this model, at least in the context of evaluation of horse races, would be one for a joint analysis of ranks and of some "continuous" variate, such as time in the event or earnings. As mentioned above, there are situations in which the rank information is available for all competitors, but earnings are available only for the placed competitors. Clearly, earnings for the other competitors are not missing completely at random. However, if the conditional probability of missing earning information depends on the ranks, then missingness would be ignorable in a joint analysis of ranks and of earnings (RUBIN 1976; IM et al. 1989; SORENSEN et al. 2001). For example, such a model could postulate a joint normal distribution of the liabilities and of the continuous trait measured on some competitors.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 PROBABILITY DISTRIBUTION OF...
 BAYESIAN STRUCTURE
 CONDITIONAL MAXIMUM A POSTERIORI...
 FULLY BAYESIAN ANALYSIS
 NONPARAMETRIC DISTRIBUTION FOR...
 CONCLUSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Yu-Mei Chang and Xiao-Lin Wu are thanked for helpful comments. Research was completed while the senior author was Mercator Professor (Deutsche Forschungsgemeinschaft) at Georg-August-University, Göttingen, Germany. Support by the Wisconsin Agriculture Experiment Station and by grants National Research Initiatives Competitive Grants Program/U.S. Department of Agriculture 2003-35205-12833, National Science Foundation (NSF) DEB-0089742, and NSF DMS-NSF DMS-044371 is acknowledged.


    LITERATURE CITED
 TOP
 ABSTRACT
 PROBABILITY DISTRIBUTION OF...
 BAYESIAN STRUCTURE
 CONDITIONAL MAXIMUM A POSTERIORI...
 FULLY BAYESIAN ANALYSIS
 NONPARAMETRIC DISTRIBUTION FOR...
 CONCLUSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 

ANTONIAK, C. E., 1974 Mixtures of Dirichlet processes with applications to nonparametric problems. Ann. Stat. 2: 1152–1174.[CrossRef]

BOCK, R. D., and L. V. JONES, 1968 The Measurement and Prediction of Judgment and Choice. Holden–Day, San Francisco.

BUSH, C. A., and S. N. MACEACHERN, 1996 A semiparametric Bayesian model for randomized block designs. Biometrika 83: 275–285.[Abstract/Free Full Text]

COWLES, M. K., and B. P. CARLIN, 1996 Markov chain Monte Carlo convergence diagnostics: a comparative review. J. Am. Stat. Assoc. 91: 883–904.[CrossRef]

DEMPSTER, E. R., and I. M. LERNER, 1950 Heritability of threshold characters. Genetics 35: 212–236.[Free Full Text]

DEVROYE, L., 1986 Non-Uniform Random Variate Generation. Springer, New York.

ESCOBAR, M. D., 1994 Estimating normal means with a Dirichlet process prior. J. Am. Stat. Assoc. 89: 268–275.[CrossRef]

FALCONER, D. S., 1965 The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann. Hum. Genet. 29: 51–76.[CrossRef]

FERGUSON, T. S., 1973 A Bayesian analysis of some nonparametric problems. Ann. Stat. 1: 209–230.[CrossRef]

FOULLEY, J. L., S. IM, D. GIANOLA and I. HÖSCHELE, 1987 Empirical Bayes estimation of genetic value for n binary traits. Genet. Sel. Evol. 19: 197–224.[CrossRef]

GARCIA-CORTÉS, L. A., and D. SORENSEN, 1996 On a multivariate implementation of the Gibbs sampler. Genet. Sel. Evol. 28: 121–126.[CrossRef]

GELMAN, A., J. B. CARLIN, H. S. STERN and D. B. RUBIN, 2003 Bayesian Data Analysis, Ed. 2. Chapman & Hall, London.

GIANOLA, D., 1982 Theory and analysis of threshold characters. J. Anim. Sci. 54: 1079–1096.[Abstract/Free Full Text]

GIANOLA, D., and R. L. FERNANDO, 1986 Bayesian methods in animal breeding theory. J. Anim. Sci. 63: 217–244.[Abstract/Free Full Text]

GIANOLA, D., and J. L. FOULLEY, 1983 Sire evaluation for ordered categorical data with a threshold model. Genet. Sel. Evol. 15: 201–223.[CrossRef]

GIANOLA, D., J. L. FOULLEY and R. L. FERNANDO, 1986 Prediction of breeding values when variances are not known. Genet. Sel. Evol. 18: 485–498.[CrossRef]

GILKS, W. R., S. RICHARDSON and D. J. SPIEGELHALTER, 1996 Markov Chain Monte Carlo in Practice. Chapman & Hall, London.

GRÜNEBERG, H., 1952 Genetical studies on the skeleton of the mouse. IV. Quasi-continuous variations. J. Genet. 51: 95–114.

HARVILLE, D. A., 1974 Bayesian inference of variance components using only error contrasts. Biometrika 61: 383–385.[Abstract/Free Full Text]

HENDERSON, C. R., 1976 A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics 32: 69–83.[CrossRef]

HENERY, R. J., 1981 Permutation probabilities as models for horse races. J. R. Stat. Soc. B 43: 86–91.

IM, S., R. L. FERNANDO and D. GIANOLA, 1989 Likelihood inferences in animal breeding under selection: a missing data theory viewpoint. Genet. Sel. Evol. 21: 399–414.[CrossRef]

JOHNSON, V. E., R. O. DEANER and C. P. VAN SCHAIK, 2002 Bayesian analysis of multi-study rank data with application to primate intelligence ratings. J. Am. Stat. Assoc. 97: 8–17.[CrossRef]

KLEINMAN, K. P., and J. G. IBRAHIM, 1998 A semiparametric Bayesian approach to the random effects model. Biometrics 54: 921–938.[CrossRef][Medline]

LAWRENCE, S., R. BEASLEY, I. DOULL, B. BEGISHVILI, F. LAMPE et al., 1994 Genetic analysis of atopy and asthma as quantitative traits and ordered polychotomies. Ann. Hum. Genet. 58: 359–368.[Medline]

MACEACHERN, S. N., 1994 Estimation of normal means with a conjugate style Dirichlet process prior. Comm. Stat. Sim. 23: 727–741.[CrossRef]

ROBERT, C. P., and G. CASELLA, 2005 Monte Carlo Statistical Methods, Ed. 2. Springer, New York.

ROSA, G. J. M., D. GIANOLA and J. I. URIOSTE, 2001 Assessing relationships between genetic evaluations using robust regression with an application to Holsteins in Uruguay. Acta Agric. Scand. Sect. A 51: 21–34.[CrossRef]

ROSA, G. J. M., C. R. PADOVANI and D. GIANOLA, 2003 Robust linear mixed models with normal/independent distributions and Bayesian MCMC implementation. Biom. J. 45: 573–590.[CrossRef]

ROSA, G. J. M., D. GIANOLA and C. R. PADOVANI, 2004 Bayesian longitudinal data analysis with mixed models and thick-tailed distributions using MCMC. J. Appl. Stat. 31: 855–873.[CrossRef]

RUBIN, D. B., 1976 Inference and missing data. Biometrika 63: 581–582.[Abstract/Free Full Text]

SORENSEN, D., and D. GIANOLA, 2002 Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. Springer–Verlag, New York.

SORENSEN, D., S. ANDERSEN, D. GIANOLA and I. KORSGAARD, 1995 Bayesian inference in threshold models using Gibbs sampling. Genet. Sel. Evol. 27: 229–249.[CrossRef]

SORENSEN, D. A., R. L. FERNANDO and D. GIANOLA, 2001 Inferring the trajectory of genetic variance in the course of artificial selection. Genet. Res. 77: 83–94.[CrossRef][Medline]

STRANDÉN, I., and D. GIANOLA, 1998 Attenuating effects of preferential treatment with Student-t mixed linear models: a simulation study. Genet. Sel. Evol. 30: 565–583.[CrossRef]

STRANDÉN, I., and D. GIANOLA, 1999 Mixed effects linear models with t-distributions for quantitative genetic analysis: a Bayesian approach. Genet. Sel. Evol. 31: 25–42.[CrossRef]

TANNER, M. A., and W. WONG, 1987 The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 82: 528–550.[CrossRef]

TAVERNIER, A., 1990 Estimation of breeding value of jumping horses from their ranks. Livest. Prod. Sci. 26: 277–290.[CrossRef]