THE analysis of genetic data on groups of related individuals, or pedigrees, frequently necessitates the calculation of probabilities and likelihoods. There are well-known algorithms such as the peeling algorithm (Elston and Stewart 1971), which was extended to include arbitrarily complex pedigrees and genetic models by Cannings et al. (1978), and the expert systems algorithms such as that of Lauritzen and Spiegelhalter (1988) for large Bayesian networks (Jensen 1996), which can perform these calculations in theory. However, in practice, exact methods break down either when the pedigree structure or the genetic model under consideration becomes too complicated. A complex pedigree is one that has too many interconnecting undirected cycles or loops that force large cutsets in any peeling sequence, leading to impossible storage requirements. In a general graph, a sequence of edges for which each edge has a node in common with both preceding and succeeding edges forms an undirected cycle if it begins and ends at the same node. There are many ways of forming loops in a general graph but in pedigrees, loops are typically caused by inbreeding or intermarital relationships. A complex genetic model can pose similar computational challenges, even on a simple pedigree structure, because the underlying graphical model for the computational problem may be highly looped (Sheehanet al. 2002). In these situations, the required probabilities and likelihoods must be estimated. One approach is to approximate the problem by sacrificing a sufficient amount of the complexity to enable exact calculation (e.g., Wanget al. 1996). Alternatively, the complexity of the problem can be preserved and the calculations in question approximated using Markov chain Monte Carlo (MCMC) methods (see Thompson 2001 for an overview).
The first application of MCMC methods in pedigree analysis (Sheehan 1990) exploited the single-site Gibbs sampler (Geman and Geman 1984) for a simple single-locus trait on a large complex human pedigree of Greenland Eskimos where the statespace of the underlying Markov chain was the set of genotypic configurations on the pedigree. Reasonable estimates of marginal posterior distributions of founder genotypes were obtained in this way (Sheehan 1992). By noting that the genetic model is Markovian, a neighborhood system can be defined on a pedigree such that, conditional on the neighbors, genotypes of individuals are independent and the set of genotypic configurations on the pedigree is thus a locally dependent Markov random field. The Gibbs sampler moves between genotypic configurations by selecting a single individual and updating the existing estimate of genotype with one drawn at random from the full conditional distribution of genotype, given the current estimates elsewhere on the graph and any phenotypic data. Because of the local dependencies induced by the neighborhood system, the Gibbs sampler is very easy to implement in these genetic applications. However, it was also noted at this time (Sheehan 1990; Sheehan and Thomas 1993) that although the sampler defines an irreducible Markov chain for most traits determined by a diallelic locus, the irreducibility condition necessary to ensure convergence of the sampler to the correct equilibrium distribution is no longer guaranteed for a genetic system with three or more alleles.
The pedigree Gibbs sampler, as defined above, will converge to the true posterior distribution of genotypes on the pedigree given available phenotypes if it defines an irreducible chain whereby all states communicate and if each individual genotype is updated infinitely often. This follows from the ergodic theorem for finite, aperiodic, irreducible, Markov chains (Cox and Miller 1965). In many other areas of application, such as Bayesian image analysis, for example, there is considerable flexibility in the specification of the underlying Markov random field, which is usually constructed to satisfy the positivity condition (see Besag 1986), and this, in turn, trivially guarantees irreducibility of the single-site Gibbs sampler. For a pedigree with n individuals, let Si = {si} be the set of possible genotypes that can be assigned to the ith pedigree member, ignoring all others. Positivity, in our applications, would mean that any assignment (s1,..., sn) of genotypes on the pedigree would have an associated positive probability. This is clearly not the case for discrete genetic models, which pose a complicated excluded adjacency problem. In particular, we cannot start the Gibbs sampler from a random configuration as is usually the case with other applications.
—An example adapted from Sheehan (2000) where observed data on a dichotomous trait [affected and normal determined by a diallelic genetic system (with alleles A and B)] cause the single-site Gibbs sampler to be reducible. Note that both mother and daughter must be homozygous for the same allele, which means that once one pairing has been assigned, the other can never be visited by updating one individual genotype at a time if heterozygotes are affected with zero probability.
The issue, therefore, is whether the legal configurations of genotypes on a pedigree (i.e., those states that agree both with the data and the genetic model) communicate if genotypes are updated one at a time. It can be shown that the single-site Gibbs sampler is generally irreducible over the legal states of a diallelic genetic system. The only exception (Sheehan 1990) occurs within the following scenario. Suppose we have a dichotomous trait (e.g., affected and normal) where both homozygotes are affected with positive probability but the heterozygote is affected with probability zero. As illustrated in Figure 1, depending on the observed individuals, this can cause the single-site Gibbs sampler to be reducible. However, as all that is required for irreducibility is a minor relaxation of the penetrance probabilities, allowing the heterozygote to have some small positive probability of having the same phenotype as either homozygote, this does not cause a problem in practice. In many cases, of course, the phenotypic observations are actually the genotypes themselves and information on individuals is either complete in that they are fully typed or missing altogether. This is the setting that we focus on heretoforth. Note that the Gibbs sampler is always irreducible for diallelic loci in this case.
—A genetic system with three alleles, A, B, and C, adapted from Sheehan (2000), where observations on two siblings force noncommunicating alternatives on their unobserved parents.
Once the genetic system has three or more alleles, it is easy to see how the Gibbs sampler can define a reducible Markov chain. The classic example for a three-allele system, with alleles A, B, and C, say (see Sheehan and Thomas 1993, for example), is shown in Figure 2, where unobserved parents have two typed offspring, one of whom is heterozygous AB and the other homozygous for the third allele, CC. Clearly, each parent carries the C allele and one carries an A while the other carries a B. From the symmetry of genetic inheritance, it is equally plausible that either parent can have one or the other of the genotypes AB and AC. However, once one has been assigned in a starting configuration for a single-site Gibbs sampler, the other must take the complementary type and neither will change for the duration of the run.
Reducibility depends on the data. If it were possible to enumerate the noncommunicating classes of the Markov chain described by the Gibbs sampler, a scheme that would move between the different classes using the reversible-jump method of Green (1995) or the Metropolis jumping-kernel method of Lin 1996, for example, could be devised. A deterministic algorithm to find the noncommunicating classes was proposed by Lin et al. (1994), which depended on the assumption that such classes can be caused only by observations on offspring forcing noncommunicating alternatives for unobserved parents, as in the classic scenario of Figure 2. However, this assumption does not suffice: There are many other ways in which data can be assigned on a pedigree to cause the Gibbs sampler to be reducible and the algorithm does not work on several simple examples (Jensen and Sheehan 1998). In particular, partial information can filter down a pedigree from an ancestor, which, when combined with other information coming up from descendants, can create noncommunicating classes on individuals in intermediate generations. To date, there is no algorithm that will find all these classes. Moreover, it would seem that a completely general algorithm would require storage of so much partial information as to make it akin to peeling (Jensen and Sheehan 1998).
—A simple pedigree where one member of each marriage is fully typed. The mother-daughter pair, labeled A?, are either both AB or both AC. Once one configuration has been assigned, the other can never be reached by a singlesite Gibbs sampler.
The practical implication of this is that it may not be possible to sample correctly from the true posterior distribution of genotype, given phenotype, using a single-site Gibbs sampler once there are three or more alleles in the genetic system, and any estimates of probabilities and likelihoods obtained in this way are hence unreliable.
In animal breeding circles, especially those concerned with animals where artificial insemination is practiced, it is not uncommon to have a small number of males with large numbers of mates and offspring. In particular, full DNA information would usually be available on these males. The pedigrees resulting from such a breeding program are extremely complex and defy exact calculation of probabilities and likelihoods for any genetic system, however simple. Often, the pedigree structure is sacrificed and estimates are obtained on a simple subpedigree such as a half-sib design, using methods like least squares (Haleyet al. 1994), which permit fast analysis using permutation tests (Churchill and Doerge 1994), for example, or maximum likelihood. However, it is more appealing to preserve the pedigree information and apply MCMC methods.
It has been claimed in the animal genetics literature (see Hoescheleet al. 1997; Bink and Arendonk 1999, for example) that the single-site Gibbs sampler is irreducible if at least one in every parent pair is observed and many analyses have been performed on animal pedigrees with fully typed males in the belief that this is true. A counter-example to this claim involving a simple unlooped pedigree with three generations is presented in Figure 3.
For a half-sib design that includes only individuals over two generations and assumes that the dams are unrelated and distinct, the single-site Gibbs sampler is irreducible when the sires are fully typed, regardless of any observations on mates or offspring. However, it is not the case in general and misleading results can be obtained.
The belief that the single-site Gibbs sampler defines an irreducible Markov chain on the space of genotypic configurations over a general pedigree provided one member of each spouse pairing is observed would seem to derive from the erroneous statement in Lin et al. (1994) that noncommunicating classes can arise only on unobserved parents from offspring data. Jensen and Sheehan (1998) presented several counter-examples to this algorithm, involving two-, three-, and four-generational pedigrees with and without loops. The important point is that information forcing noncommunicating alternatives on a set of individuals can come from above via the ancestors as well as from below via the descendants. In particular, as noted by Thompson (1986), the terms “above” and “below” do not necessarily define disjoint sets in a complex pedigree and the routes by which information filters through a pedigree may be interconnected and hence highly complicated. In this article, we have presented a counter-example to the animal genetics community and have attempted to clarify the reducibility issue.
However, while it is true that any analysis based on a reducible sampling scheme is potentially dubious, reducibility itself is not really a problem. There are several methods for getting around reducibility (see Thompson 2001 for an overview). Most of these involve some sort of relaxation of the genetic model, which enables the sampler to move between illegal configurations. These include importance sampling with weights of zero and one (Sheehan and Thomas 1993), the Metropolis-coupled samplers of Geyer (1991), the heated companion chains of Lin et al. (1993), and the annealing-type samplers of Geyer and Thompson (1995), to name but a few. The corresponding statespace of the Markov chain could be vastly augmented from that defined by the single-site Gibbs sampler and good mixing of the chain becomes the important issue (Gilkset al. 1993).
Using meiosis indicators or descent graphs (Sobel and Lange 1996) rather than genotypes greatly improves mixing of MCMC samplers on pedigrees (Thompson 1994, 2000b). In addition, there are all kinds of joint or block updating schemes such as the blocking Gibbs method of Jensen et al. (1995), which updates large blocks of the pedigree conditional on the values at the remaining variables, the locus-by-locus samplers of Heath (1997) and Sillanpää and Arjas (1999), the meiosis-by-meiosis sampler of Thompson and Heath (2000), or the combination samplers that alternate between chromosome peeling and pedigree peeling (Hurmeet al. 2000; Thomaset al. 2000; Thompson 2000a), all of which require that the pedigree can be peeled for at least one locus. For large complex animal pedigrees, however, it may not be possible to peel even for a single locus and a relaxed sampler may be required in addition to blocking. The ESIP sampler of Fernández et al. (2001) samples genotypes over the entire pedigree at a single locus by using a sample obtained from a modified pedigree obtained by iterative peeling as the proposal in a Metropolis-Hastings updating step. However, the main problem in animal genetics applications is that the pedigrees are often enormous and these alternative MCMC methods have not been tested sufficiently on big problems (Hoescheleet al. 1997). Nonetheless, they at least have the potential to yield estimates from the correct distribution and provide a wealth of attractive alternatives to the single-site Gibbs sampler.
Acknowledgments
We acknowledge Rohan Fernando for bringing this problem to our attention. In addition, Nuala Sheehan acknowledges support from the Wellcome Trust Biomedical Research Collaboration Grant 056266/Z/98/Z and the TVW Telethon Institute for Child Health Research, Perth, Western Australia.
Footnotes
-
Communicating editor: N. Takahata
- Received May 3, 2002.
- Accepted August 13, 2002.
- Copyright © 2002 by the Genetics Society of America