help button home button Genetics J Cell Biol
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Durrett, R.
Right arrow Articles by York, T. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Durrett, R.
Right arrow Articles by York, T. L.
Genetics, Vol. 166, 621-629, January 2004, Copyright © 2004

Bayesian Estimation of Genomic Distance

Richard Durretta,b, Rasmus Nielsenb, and Thomas L. Yorkb
a Department of Mathematics, Cornell University, Ithaca, New York 14853
b Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853

Corresponding author: Richard Durrett, 523 Malott Hall, Cornell University, Ithaca, NY 14853., rtd1{at}cornell.edu (E-mail)

Communicating editor: J. B. WALSH


*  ABSTRACT
*TOP
*ABSTRACT
*PARSIMONY METHODS
*A BAYESIAN APPROACH
*ANALYSIS OF THREE DATA...
*CONCLUSIONS
*APPENDIX
*LITERATURE CITED

We present a Bayesian approach to the problem of inferring the number of inversions and translocations separating two species. The main reason for developing this method is that it will allow us to test hypotheses about the underlying mechanisms, such as the distribution of inversion track lengths or rate constancy among lineages. Here, we apply these methods to comparative maps of eggplant and tomato, human and cat, and human and cattle with 170, 269, and 422 markers, respectively. In the first case the most likely number of events is larger than the parsimony value. In the last two cases the parsimony solutions have very small probability.


UNDERSTANDING the relationship between the organizations of two genomes is important for transferring information between species, for example, for finding animal models of human diseases or locating genes of agricultural importance (O'BRIEN et al. 1999 Down). Inferences concerning genome evolution have primarily used parsimony methods. In the biology literature, experts have compared chromosome-banding patterns to detail the evolution of primate genomes (DUTRILLAUX 1979 Down) or have used comparative gene mapping and cross-species chromosome painting to analyze mammalian genome evolution (CHOWDHARY et al. 1996 Down; HAIG 1999 Down; MURPHY et al. 2001 Down). In a separate and independent effort computer scientists have developed algorithms for computing the minimum number of events needed to transform one genome into another (HANNENHALLI and PEVZNER 1995A Down, HANNENHALLI and PEVZNER 1995B Down). The number of evolutionary events (i.e., inversions and translocations) separating two species is of intrinsic biological interest, but may also guide comparative mapping studies and may be used for estimating phylogenies.

In an earlier article (YORK et al. 2002 Down) we developed a Bayesian method to infer the history of inversions separating homologous chromosomes from two different species. This method can be applied to mitochondrial genomes, to X chromosomes, and to chromosome arms of species of genera Drosophila (RANZ et al. 2001 Down) and Anopheles (SHARAKHOV et al. 2002 Down), where pericentric inversions and translocations are rare. In this article we extend these methods to the problem of genomic distance, i.e., the number of inversions and translocations needed to transform one genome into another. Here, fissions and fusions are included as a special case of translocations in which one of the input or output chromosomes is empty.

There are several reasons for preferring our statistical approach to parsimony methods. The first and simplest is that there is no guarantee that nature took the shortest path. Indeed, we see in two of our examples that the parsimony solution is not in the 95% credible interval. A second important reason is that our approach allows for estimation of rates of evolution and testing of hypotheses: Are all inversions equally likely? Are rates the same in different lineages? Answering the first question is important for a proper analysis of the second. Otherwise we will not know if the perceived higher rate of rearrangements in rodents compared to carnivores is due to higher density of markers in the first case.

In this article we concentrate on estimating the number of events needed to change one genome into the other, leaving hypothesis testing issues for later work. In the next section we describe existing parsimony methods. The following section (A BAYESIAN APPROACH) explains our new approach, and in the last section we analyze three data sets.


*  PARSIMONY METHODS
*TOP
*ABSTRACT
*PARSIMONY METHODS
*A BAYESIAN APPROACH
*ANALYSIS OF THREE DATA...
*CONCLUSIONS
*APPENDIX
*LITERATURE CITED

HANNENHALLI and PEVZNER 1995A Down, HANNENHALLI and PEVZNER 1995B Down, whom we abbreviate HP, developed a polynomial algorithm for computing the distance between two genomes; i.e., what is the smallest number of inversions and translocations needed to transform one genome into another? They accomplished this in two steps. In the first article they solved the problem of computing the reversal distance, i.e., the smallest number of inversions needed to transform one chromosome or mitochondrial genome into another. In the second article they showed how to reduce the genome distance problem to the one for reversal distance by concatenating chromosomes into one unit. That approach is convenient for proving the existence of a polynomial algorithm, but not for our implementation so we take a different approach.

To illustrate the method we consider part of the data of DOGANLAR et al. 2002 Down, who constructed a comparative genetic linkage map of eggplant (Solanum melongena) on the basis of 233 markers for tomato. Using the first letter of the common name to denote the species they found that the marker orders on T1 and E1 and on T8 and E8 were identical, while in four other cases (T2 vs. E2, T6 vs. E6, T7 vs. E7, and T9 vs. E9) the collections of markers were the same and the orders became the same after a small number of inversions were performed (3, 1, 2, and 1, respectively).

In our example we compare the remaining six chromosomes from the two species. The first step is to divide the chromosomes into conserved segments where the adjacency of markers has been preserved between the two species, allowing for the possibility of the overall order being reversed. When such segments have two or more markers we can determine the relative orientation. However, as the HP algorithm assumes that one knows the relative orientation of segments we have to assign orientations to conserved segments consisting of single markers to minimize the distance. In the case of the tomato-eggplant comparison there are only five singleton segments so one can easily consider all 25 = 32 possibilities. In the human-cat and human-cattle comparisons in ANALYSIS OF THREE DATA SETS there are 21 and 75 single-marker segments, respectively, so the amount of work is considerable (221 = 2,097,152) or impossible (275 = 3.77 x 1022).

We now describe the details of the distance computation. Conceptually this is simple—one defines a graph and then computes the number of cycles. However, carrying out the procedure involves a number of details. The first part of Table 1 shows the two genomes with an assignment of signs to the single-marker segments that minimizes the distance. The first step in preparing to use the HP algorithm is to double the markers. When segment i is doubled we replace it by two consecutive numbers, 2i - 1 and 2i; e.g., 6 becomes 11 and 12. A reversed segment -i is replaced by 2i and 2i - 1; e.g., -5 is replaced by 10 and 9. The second step is to add ends to the chromosomes and enough empty chromosomes to make the number of chromosomes equal. In this example, no empty chromosomes are needed. We have labeled the ends in the first genome by 1000–1011 and in the second genome by 2000–2011.


 
View this table:
[in this window]
[in a new window]

 
Table 1. Example of genome distance computation

The second part of Table 1 shows the result of the first two preparatory steps. Commas indicate separations between two segments or between a segment and an end. The next step is to construct the breakpoint graph that results when the commas are replaced by edges that connect vertices with the corresponding numbers. We did not draw the graph since we need to know only the connected components of the graph. Since each vertex has degree two, these are easy to find: Start with a vertex and follow the connections. The resulting component will be either a path that connects two ends or a cycle that consists of markers and no ends. In our example there are five paths of length three that connect ends. These "short paths" tell us that end 1000 in genome 1 corresponds to end 2000 in genome 2, etc. The other correspondences between ends are determined after we compute the distance.

The other "long paths" in the breakpoint graph are listed in Table 1. At this point there are no cycles in the breakpoint graph. To compute a lower bound for the distance we start now with the number of commas seen when we write out one genome. In this example that is 33. We subtract the number of connected components in the breakpoint graph. In this example that is 5 + 7 = 12, and then we add the number of paths that begin and end in the same genome, which in this case is 0. The result, which is 21 in this case, is a lower bound on the distance since any inversion or translocation can at most reduce this quantity by 1, and it is 0 when the two genomes are the same.

In general the distance between genomes can be larger than the lower bound from the breakpoint graph and the computation of the exact distance can be quite complex. In the reversal distance problem there can be obstructions called hurdles that can prevent us from decreasing the distance and hurdles can be intertwined in a fortress of hurdles that takes an extra move to break. In the genome distance problem the situation is even more complex and one must consider six different quantities to compute the distance. Fortunately these complexities rarely arise in biological data sets. BAFNA and PEVZNER 1995 Down considered the reversal distance problem for 11 chloroplast and mitochondrial data sets and in all cases they found that the distance is equal to the lower bound. That is also the case for the three examples we consider in this article.

To verify that 21 is the minimum distance we construct a sequence of 21 moves that transforms genome 1 into genome 2. This is straightforward, but somewhat tedious, so we have hidden the details away in an Appendix. We have given this computation to show that the procedure gives a systematic method for finding the shortest path. In the example we are considering our solution gives a slight improvement on that of DOGANLAR et al. 2002 Down.

The HP method just described gives an estimate of the number of events that can be seen with the marker set used; e.g., an inversion that occurs between two markers or reverses the orientation of a single marker cannot be detected. The method of NADEAU and TAYLOR 1994 Down, which can be used when we know the distances between markers in one of the genomes, avoids this problem. Their computation is based on the length of conserved segments with more than two markers, enlarges the lengths on the basis of the expected value of the unseen flanking ends, and then uses the transformed data to get an estimate of the average size of conserved segments. Using the total size of the genome, this translates into an estimate of the number of conserved segments, S.

Subtracting the number of chromosomes in the genome for which the distances are known from S gives an estimate of the number of disruptions, R, that have occurred. Given R, we divide by 2 to get an estimate of the number of events that have occurred. In the literature this estimate is called the breakpoint distance. It is reasonable to use in Nadeau and Taylor's context, since they are considering the situation in which complete information about marker order has revealed all of the disruptions, and if we assume that all breakpoints are equally likely, then with high probability no two disruptions will occur between the same pair of nucleotides. On the other hand with a small number of markers, e.g., in the tomato-eggplant comparison considered above, the breakpoint distance is not very accurate. For the six chromosomes considered, we have 27 - 6 = 21 disruptions but the genomic distance is 20.


*  A BAYESIAN APPROACH
*TOP
*ABSTRACT
*PARSIMONY METHODS
*A BAYESIAN APPROACH
*ANALYSIS OF THREE DATA...
*CONCLUSIONS
*APPENDIX
*LITERATURE CITED

The aim of the Bayesian approach is to generate the probability distribution of inversions and translocations given the data. The Bayesian approach is desirable because it allows probabilities to be assigned to particular evolutionary histories. On the basis of these probabilities, estimates of the number of inversions and translocations with associated measures of statistical uncertainty can be obtained. In addition, the Bayesian approach allows us to make probabilistic statements regarding the relative frequency of translocations and inversions and investigate the distribution of the inversion tract length. However, to use the Bayesian approach assumptions must be made regarding the biological processes generating inversions and translocations.

We assume that inversions and translocations arise in the genome at a constant rate and that the process of inversions and translocations forms a Markov chain with state space ({Omega}O) given by the set of all possible orderings of the markers on ordered chromosomes. Letting all markers be signed indicating their orientation on the chromosome, there are

orderings of N signed markers onto M numbered chromosomes, allowing empty chromosomes, i.e., chromosomes containing no markers. In our representation, transitions between neighboring states (states differing by only a single inversion or a single translocation) in {Omega}O occur at rate {lambda}I if the two states differ by an inversion and {lambda}T if they differ by a translocation. Because of the symmetry of the transition rates, the Markov chain is time reversible and has a uniform stationary distribution, i.e.,

Because the ordering of chromosomes is not of interest, we also consider a Markov chain with a collapsed state space ({Omega}U) that ignores the numbering of chromosomes. An element of {Omega}U, with M0 empty chromosomes, corresponds to 2(M-M0)M!/M0! elements of {Omega}O. The stationary probability assigned to an element of {Omega}U with M0 empty chromosomes is then

Transitions between neighboring states in this collapsed Markov chain occur at rates {lambda}I if the two states differ by an inversion and {lambda}T if they differ by a translocation, except for translocations involving a fission of two chromosomes in an equivalence class with M0 empty chromosomes. These transitions occur at rate 2M0{lambda}T. Therefore, the transition probabilities of the Markov chain obey the detailed balance equations and the Markov chain is time reversible. In the following we use the representation based on the Markov chain with state space on {Omega}U.

Consider the genome of two organisms. The genomic marker data from one species (x1) can be transformed into the genomic marker data from another species (x2) through a sequence of inversions and translocations. Because the process we have defined is time reversible we can write the sampling probability as

where Pr(x1 -> x2|{Theta}) is the transition probability for the transformation from x1 to x2, and {Theta} is the vector of parameters. Because a model parameterized in terms of t, {lambda}T, and {lambda}I is not identifiable, we arbitrarily set t = 1 and define {Theta} = ({lambda}T, {lambda}I). Therefore, {lambda}T and {lambda}I can also be interpreted as the expected number of translocations and inversions per marker pair in the history of the two species. Pr(x1) does not depend on {Theta}, and the likelihood function is, therefore, simply given by

Let {Psi} be the (countably infinite) set of all possible evolutionary paths from x1 to x2. We note

To estimate {Theta} we establish a Markov chain with state space on [0, {infty})2 x {Psi} and with stationary distribution given by the joint posterior distribution of parameters and evolutionary path (y):

To ensure that the posterior is proper and biologically meaningful, the supports of {lambda}T and {lambda}I are restricted to the intervals (0, {lambda}Tmax) and (0, {lambda}Imax), respectively. The Markov chain is simulated using the Metropolis-Hastings algorithm similarly to the method used in YORK et al. 2002 Down. In brief, the posterior distribution is proportional to

(1)

The Markov chain is then simulated by iteratively proposing new values of y, {lambda}T, and {lambda}I and accepting these new values according to the appropriate Metropolis-Hastings acceptance probabilities (see YORK et al. 2002 Down). These acceptance probabilities are evaluated using the terms of the right side of Equation 1. In this expression, the densities for the {lambda}T and {lambda}I can be trivially evaluated at the relevant values and the probability of an evolutionary path, Pr(y|{lambda}T, {lambda}I), can also be calculated. If at time t there are IT(t) possible translocations that can occur and II(t) possible inversions that can occur, the total rates of translocations and inversions are IT(t){lambda}T and II(t){lambda}I, respectively. Note that because translocations may change the possible number of inversions and translocations, the rate at which chromosomal rearrangements occur may not be constant in time. However, the chance that any particular inversion or translocation occurs as the first event after time t is {lambda}I/(IT(t){lambda}T + II(t){lambda}I) for inversions and {lambda}T/(IT(t){lambda}T + II(t){lambda}I) for translocations and the waiting time to the first event is exponentially distributed with parameter IT(t){lambda}T + II(t){lambda}I. So the probability assigned to a particular evolutionary path, with S events (inversion and translocations), is

where ti is the time between events i - 1 and i for i <= S, , and

In practice, to avoid keeping track of the times between events, we use a method based on uniformization to ensure that the total rate of change is constant in time. In brief, we allow pseudo-events of evolutionary change, which have no effect on marker order, to occur at rate max(IT,II){IT{lambda}T + II{lambda}I} - (IT(t){lambda}T + II(t){lambda}I) at time t. The total rate of evolutionary change is then kept constant at rate max(IT,II){IT{lambda}T + II{lambda}I}. As in YORK et al. 2002 Down, an efficient proposal kernel for y is constructed by considering the breakpoint graph of HANNENHALI and PEVZNER (1995a).

The prior densities for {lambda}T and {lambda}I are assumed to be independent uniform on (0, {lambda}Tmax) and (0, {lambda}Imax), respectively. Updates to these parameters are then proposed by simulating a new value of the parameter uniformly in a window around the current value, independently of each other and of updates of y. The marginal posterior distributions of the parameters are then estimated by sampling values of {lambda}T and {lambda}I from the simulated Markov chain at stationarity. The distribution of values of {lambda}T and {lambda}I is proportional to the joint likelihood function for {lambda}T and {lambda}I, and the results can be interpreted in both a Bayesian and a likelihood framework. Here we focus on the Bayesian interpretation since the usual asymptotic properties of the likelihood function may not be satisfied.

The method can also be used to make inferences on y. The posterior distribution of the number of inversions (LI) and the number of translocations (LT) that have occurred in the history of the two species can be estimated by sampling from the simulated Markov chain. Likewise, the posterior distribution of inversion tract lengths can be estimated.

To improve convergence we used the technique of Metropolis-coupled Markov chain Monte Carlo, in which each chain with stationary distribution {pi} is coupled to a set of "heated" chains with modified stationary distributions, , where the "temperatures," Ti, are slightly greater than one, resulting in flattened distributions and, consequently, better mixing. Only unheated chains are used in estimating posterior distributions.

Our estimates of posterior distributions use every eighth update, after excluding an initial burn-in phase, i.e., the part of the Markov chain before convergence to stationarity. To determine an appropriate length of the burn-in phase, we run multiple chains and compare the between-chain variance of the evolutionary path length L (for example), BL, to the within-chain variance, WL, requiring that BL/WL become small. This is essentially the method of GELMAN and RUBIN 1992 Down. We similarly consider B/W for LI, LT, {lambda}I, and {lambda}T, and define the burn-in phase to end when all five of these ratios are <0.1. The thinning factor of 8 is chosen to keep the output files to a reasonable size, and since this is much less than the number of updates to stationarity (which is >104 for all of the data sets considered here) there should be little loss of information about the posterior distribution.


*  ANALYSIS OF THREE DATA SETS
*TOP
*ABSTRACT
*PARSIMONY METHODS
*A BAYESIAN APPROACH
*ANALYSIS OF THREE DATA...
*CONCLUSIONS
*APPENDIX
*LITERATURE CITED

Tomato vs. eggplant:
DOGANLAR et al. 2002 Down constructed a comparative map of tomato and eggplant consisting of 233 markers. Thinning their data set to choose one marker from each group in which the order was not resolved leads to a data set with 170 markers. On the basis of this map and comparisons with maps for potato (TANKSLEY et al. 1992 Down) and pepper (LIVINGSTONE et al. 1999 Down) they concluded: "Overall, eggplant and tomato were differentiated by 28 rearrangements, which could be explained by 23 paracentric inversions and five translocations during evolution from the species' last common ancestor" (DOGANLAR et al. 2002 Down, p. 1697).

As demonstrated in PARSIMONY METHODS, the parsimony solution for this comparison is 21 events for the six chromosomes and seven inversions for the other six or also a total of 28 using only ordinary translocations. Our Bayesian analysis produced 95% credible intervals of [5, 7], [21, 31], and [28, 37] for the number of translocations, inversions, and total number of events (respectively) separating tomato and eggplant. These results are from 6 unheated chains, each coupled to 4 heated chains. Each of these 30 chains was updated 459,000 times, using 75,000 sec of CPU time on a 1.5-GHz processor. The first 14,000 updates are excluded as burn-in. The posterior distribution for the number of translocations assigns probability 0.08172 to 5, 0.55407 to 6, 0.32137 to 7, 0.03832 to 8, and 0.00453 to 9 or more. Fig 1 gives the posterior distribution of the number of inversions, which has a mode of 25. Thus, even in the case of these two closely related species, the most likely number of inversions and translocations is somewhat higher than the parsimony estimates.



View larger version (13K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 1. Posterior distribution of the number of inversions for the tomato-eggplant comparison.

Fig 2 gives the posterior joint distribution of the rates of translocations {lambda}t and inversions {lambda}i. The mode of the distribution occurs at {lambda}t = 0.000219 and {lambda}i = 0.0194. To interpret these numbers we note that in the two genomes being compared there are an average of 30,271 possible translocations and 1335 possible inversions. Multiplying we see that the total rate is 6.629 for translocations and 25.899 for inversions. This rate assumes that the total evolutionary time between the two genomes has been scaled to be 1. Taking 12 million years as an estimate of the divergence between tomato and eggplant we arrive at rates of 0.276 and 1.078 per genome per million years for translocations and inversions.



View larger version (20K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 2. Posterior joint distribution of inversion and translocation rates for the tomato-eggplant comparison.

Human vs. cat:
MURPHY et al. 2000 Down created a radiation hybrid map of the cat genome integrating 424 genes with 176 microsatellite markers. Using resources on the National Center for Biotechnology Information (NCBI) home page, http://www.ncbi.nlm.nih.gov, we were able to locate the position of 281 of the genes in the human genome. From this set we deleted 12 singleton disruptions, i.e., isolated genes that map to a chromosome different from their neighbors. We do this because it is our belief that these genes are the result of duplications of small segments that recent studies (BAILEY et al. 2002A Down) have shown to occur frequently in the human genome. In support of this practice we note that none of the regions deleted are observed in the chromosome-painting experiments of WEINBERG et al. 1997 Down and those techniques are thought to be capable of detecting segments as small as 5 Mb.

Parsimony analysis shows that the marker order in the human genome can be transformed into the cat genome in 78 moves (14 translocations and 64 inversions). Our Bayesian analysis gives 95% credible intervals of [12, 15], [71, 89], and [85, 102] for the numbers of translocations, inversions, and total number of events, respectively. These results are from 6 unheated chains, each coupled to 3 heated chains. Each of these 24 chains was updated 2.2 million times, using 790,000 sec of CPU time. The first 306,000 updates are excluded as burn-in.

Note that the parsimony distance is not in the 95% credible interval for the total number of events. In fact, the posterior probability of a total of 78 events is ~2 x 10-5. The posterior distribution for the number of translocations assigns probability 0.5967 to 12, 0.3451 to 13, 0.0531 to 14, and 0.0050 to 15 or more, so this time the smallest value is the most likely. The posterior distribution of the number of inversions is given in Fig 3, with the mode being 79.



View larger version (13K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 3. Posterior distribution of the number of inversions for the human-cat comparison.

The extensive conservation of gene order between human and cat is a widely cited result; see, e.g., O'BRIEN et al. 1999 Down. However, in some cases this fact is exaggerated on the basis of a misunderstanding: BAILEY et al. 2002B Down(p. 83) say "Even human comparisons to distantly related mammals demonstrate strong conservation with the estimated number of rearrangements varying from 17 in felines to 180 in mice." Unfortunately, here they are comparing the number of inversions and translocations in felines on the basis of fluorescence in situ hybridization data, which cannot detect many inversions with the number of inversions and translocations in mice based on a fairly dense comparative map.

MURPHY et al. 1999A Down mapped 25 markers on the feline X chromosome and found that the marker order in humans is identical. In the radiation hybrid map of MURPHY et al. 2000 Down, complete conservation of marker order holds for the 8 markers on HSA 9 and FCA D4 but most other chromosomes show a number of inversions. Our estimate for human-cat comparison is about one-half of the one for human-cattle given below. It would be interesting to use three species comparisons to partition the events between lineages but here and in most other cases there are not enough shared markers to make inferences. BOURQUE and PEVZNER 2002 Down compared human, cat, and rat. They found 193 markers shared by the three species but when they discarded the ones for which they could not determine the orientation only 114 were left.

It is interesting to compare the number of translocations estimated with the result of chromosome-painting experiments of WEINBERG et al. 1997 Down. The syntenic segments (i.e., contiguous sets of markers that all map to the same chromosome) from chromosome painting correspond to those in the comparative map with one notable exception: HSA 12 has a segment homologous to part of FCA D3, which was not revealed by the original chromosome painting but was identified by the follow-up study of MURPHY et al. 1999B Down. If we add this segment to the chromosome painting then one needs 12 translocations and two inversions to transform one genome into the other.

Finally we compare our estimates with those of Nadeau and Taylor. There are 83 conserved segments with two or more markers with an average length of 16.3 Mb, which results in an estimated average length of 18.2 Mb. Using 3.2 Gb for the length of the human genome yields an estimate of 175.8 segments. Subtracting 22 chromosomes in the human genome, we arrive at an estimate of 155.8 disruptions or an estimate of 77.4 events. The reader should note that not only is this lower than the parsimony and Bayesian estimates but also the method of Nadeau and Taylor includes events that are not visible with our set of markers.

Fig 4 gives the posterior joint distribution of the rates of translocations {lambda}t and inversions {lambda}i. The mode of the distribution occurs at {lambda}t = 0.000161 and {lambda}i = 0.0350. To interpret these numbers we note that in the two genomes being compared there are an average of 79,650 possible translocations and 2370 possible inversions. Multiplying we see that the total rate is 12.82 for translocations and 82.95 for inversions. Taking 100 million years as an estimate for the divergence between humans and cats we arrive at rate estimates of 0.0641 and 0.415 per genome per million years for translocations and inversions.



View larger version (18K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 4. Posterior joint distribution of inversion and translocation rates for the human-cat comparison.

Human vs. cattle:
Band et al. 2000 Down constructed a radiation hybrid map of cattle (Bos taurus) with a total of 638 genes. For the data see http://bos.cvm.tamu.edu/htmls. Using resources on the NCBI home page, we were able to locate the position of 448 of the genes in the human genome. Deleting 24 singleton disruptions for the reasons indicated above results in a map with 422 markers. Again the reduced map is consistent with the results of chromosome painting (see HAYES 1995 Down; CHOWDHARY et al. 1996 Down).

Parsimony analysis shows that the marker order in the human genome can be transformed into the cattle genome in 155 moves (20 translocations and 135 inversions). Our Bayesian approach experienced convergence problems in this example. Four unheated chains were each coupled to eight heated chains. Two of these sets of chains ran for 1.3 million updates and the other two sets for 1.5 million updates, using a total of 2.6 million CPU seconds. Our usual criterion for the end of burn-in was not attained; a burn-in of 600,000 updates is used for the plots shown.

Fig 5 and Fig 6 give the four posterior distributions for the number of translocations and inversions. In Fig 5 there is a considerable difference between the chains, indicating convergence problems. The qualitative differences between chains in the number of inversions are not as great as in the case of translocations. The modes are all in the range 185–191 but the variance differs considerably from run to run. We cannot make statements with much confidence about the number of inversions and translocations. However, two things are clear: (a) The number of events is roughly twice that in the human vs. cat comparison even though the divergence times are similar and (b) our conclusions differ considerably from those of Band et al. 2000 Down(p. 1359), who say that their "comparative map suggests that 41 translocation events and a minimum of 54 internal rearrangements have occurred." They do not explain how they reached this conclusion. However, we would require a larger number of translocations if we had not deleted the singletons and they would underestimate the number of inversions if they used the breakpoint distance.



View larger version (17K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 5. Posterior distributions for the number of translocations in four runs of the human-cattle comparison.



View larger version (16K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 6. Posterior distributions for the number of inversions in four runs of the human-cattle comparison.

To estimate distances using the methods of Nadeau and Taylor we note that there are 125 conserved segments with 2 or more markers with an average length of 7.19 Mb, which results in an estimated average length of 7.57 Mb. Using 3.2 Gb for the length of the human genome yields an estimate of 422.7 segments. Subtracting 22 chromosomes in the human genome gives an estimate of 400.7 disruptions or an estimate of 200 events. This is somewhat larger than our Bayesian estimate, but it is consistent with the fact that our estimate is restricted to the events that can be detected by the 422 markers in our map.


*  CONCLUSIONS
*TOP
*ABSTRACT
*PARSIMONY METHODS
*A BAYESIAN APPROACH
*ANALYSIS OF THREE DATA...
*CONCLUSIONS
*APPENDIX
*LITERATURE CITED

Elucidating the evolutionary history of genomes is one of the major goals of comparative genomics. We have shown that a full probabilistic approach to the problem is feasible, but the method experiences difficulty with data sets that have had a large number of events (inversions and translocations). For the human-cat and human-cattle comparisons, the probability that the true number of events is close to the minimum number is very small, so in these cases the parsimony estimate is not reliable. The breakpoint distance, which is the number of disruptions/2, is even worse.

Our analysis of the three data sets has revealed a consistent pattern in which inversions are four to seven times more frequent than translocations. Our ratio is larger than that of MURPHY et al. 2000 Down, who conclude that "the ordered gene maps of the cat and cow genomes show approximately twice as many intrachromosomal (inversions) as interchromosomal changes" (MURPHY et al. 2001 Down, p. 4). There are three reasons for this discrepancy: (i) Biologists use the breakpoint distance, which for these maps underestimates the minimum number of inversions; (ii) our Bayesian estimate is typically considerably larger than the minimum distance; and (iii) as genetic maps become denser more inversions can be seen.

Our methods can be extended to a model in which inversion probabilities depend upon tract length. This generalization is important if we are going to accurately extrapolate from the number of inversions seen by a collection of markers to the number that have actually occurred in the history of a genome. A second extension we would like to pursue is to incorporate gene duplications to treat the entire data set and obtain rate estimates for small segmental duplications.


*  ACKNOWLEDGMENTS

All three authors were partially supported by grant no. 0201037 from a joint program of the National Science Foundation and National Institute of General Medical Sciences.

Manuscript received June 9, 2003; Accepted for publication October 1, 2003.


*  APPENDIX
*TOP
*ABSTRACT
*PARSIMONY METHODS
*A BAYESIAN APPROACH
*ANALYSIS OF THREE DATA...
*CONCLUSIONS
*APPENDIX
*LITERATURE CITED

To explain the solution for the eggplant-tomato data set, we call the edges that result from commas in genome 1 black edges and those from genome 2 gray edges. On each step we choose an inversion or translocation involving two gray edges to increase the number of connected components by 1. It is easy to check that this can be done only by picking two edges in the same cycle.

Some of the cycles are easy to deal with. Using dashes to indicate gray edges, flipping segment 17 turns 1007 36-32 33-35 34-2009 into 1007 36-32 33-2009 and 35 = 34, where = indicates a black and a gray edge now connecting 34 and 35. Flipping -18 -17 now we have 1007 36-2009 and 32 = 33. Of course we could have also flipped -18 17 to -17 18 and then flipped -17. Similarly, 1009 44-42 43-40 41-2002 is eliminated by changing 21 -22 -20 to -22 -21 -20 with two inversions and 1011 54-49 48-52 53-51 50-2011 is eliminated by changing 24 -26 27 25 to 24 25 26 27 with three inversions.

To follow the solution on page 1707 of DOGANLAR et al. 2002 Down we also do two inversions to turn 1 -5 2 to 1 2 5, flip the order of 14 11 -15 3, flip -19, and perform a reciprocal translocation of T4 and T10. After these 11 events the situation is given by the arrangement called step 1 at the top of Table A11. Twenty-three components are now in the breakpoint graph, so the distance has been reduced to 10. At this point DOGANLAR et al. 2002 Down invoke a nonreciprocal translocation to put 3 4 in the right place. In our scheme this requires two translocations. Written in terms of the cycles the moves are 29-5 with 4-9 and then 29-9 with 8-2006. In terms of the chromosomes we first make 1 2 3 4 and -6 -5 15 -11 -14 9 and then 1 2 3 4 5 6 and 15 -11 -14 9.

Performing translocations 27-19 with 26-29 and 31-39 with 38-47 and then performing two inversions to turn -12 23 to 23 12 lead to the arrangement called step 2. After these six changes the distance is 4. To finish up we flip -9 then perform the translocation 46-23 with 31-47 to make 23 24 25 26 27 and -18 -17 -16 12 13 14 11 -15. Flipping 11 -15 to 15 -11 and then 12 13 14 15 we have E10 written backward and we are done.


 
View this table:
[in this window]
[in a new window]

 
Table A1. Intermediate points in our version of the path of DOGANLAR et al. 2002 Down, which does many of the inversions at the beginning


*  LITERATURE CITED
*TOP
*ABSTRACT
*PARSIMONY METHODS
*A BAYESIAN APPROACH
*ANALYSIS OF THREE DATA...
*CONCLUSIONS
*APPENDIX
*LITERATURE CITED

BAFNA, V. and P. PEVZNER, 1995  Sorting by reversals: genome rearrangement in plant organelles and evolutionary history of X chromosome. Mol. Biol. Evol. 12:239-246.

BAILEY, J. A., Z. GHU, R. A. CLARK, K. REINERT, and R. V. SAMONTE et al., 2002a  Recent segmental duplications in the human genome. Science 297:1003-1007.[Abstract/Free Full Text]

BAILEY, J. A., A. M. YAVOR, L. VIGGIANO, D. MISCEO, and J. E. HORVATH et al., 2002b  Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. Am. J. Hum. Genet. 70:83-100.[CrossRef][Medline]

BAND, M. J., J. H. LARSON, M. REBEIZ, C. A. GREEN, and D. W. HEYEN et al., 2000  An ordered comparative map of the cattle and human genomes. Genome Res. 10:1359-1368.[Abstract/Free Full Text]

BOURQUE, G. and P. A. PEVZNER, 2002  Genome-scale evolution: reconstructing gene orders in ancestral species. Genome Res. 12:26-36.[Abstract/Free Full Text]

CHOWDHARY, B. P., L. FRONICKE, I. GUSTAVSSON, and H. SCHERTHAN, 1996  Comparative analysis of cattle and human genomes: detection of ZOO-FISH and gene mapping-based chromosomal homologies. Mamm. Genome 7:297-302.[CrossRef][Medline]

DOGANLAR, S., A. FRARY, M. C. DAUNAY, R. N. LESTER, and S. D. TANKSLEY, 2002  A comparative genetic linkage map of eggplant (Solanum melongea) and its implications for genome evolution in the Solanaceae. Genetics 161:1697-1711.[Abstract/Free Full Text]

DUTRILLAUX, B., 1979  Chromosomal evolution in primates: tentative phylogeny from Microcebus murinus (prosimian) to man. Hum. Genet. 48:251-314.[CrossRef][Medline]

GELMAN, A. and D. B. RUBIN, 1992  Inference from iterative simulation using multiple sequences (with discussion). Stat. Sci. 7:457-511.

HAIG, D., 1999  A brief history of human autosomes. Philos. Trans. R. Soc. Lond. Ser. B 354:1447-1470.[CrossRef][Medline]

HANNENHALLI, S., and P. A. PEVZNER, 1995a Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals), pp. 178–189 in Proceedings of the 27th Annual ACM Symposium on the Theory of Computing. ACM Digital Library.

HANNENHALLI, S., and P. A. PEVZNER, 1995b Transforming men into mice (polynomial algorithm for the genomic distance problem), pp. 581–592 in Proceedings of the 36th Annual IEEE Symposium on Foundations of Computer Science. Institute of Electrical and Electronics Engineers, New York.

HAYES, H., 1995  Chromosome painting with human chromosome-specific DNA libraries reveals the extent and distribution of conserved segments in bovine chromosomes. Cytogenet. Cell. Genet. 71:168-174.[Medline]

LIVINGSTONE, K. D., V. K. LACKNEY, J. R. BLAUTH, R. VAN WIJK, and M. K. JAHN, 1999  Genome mapping in Capsicum and the evolution of genome structure in the Solanaceae. Genetics 152:1183-1202.[Abstract/Free Full Text]

MURPHY, W. J., S. SUN, Z. Q. CHEN, J. PECON-SLATTERY, and S. J. O'BRIEN, 1999a  Extensive conservation of sex chromosome organization as revealed by parallel radiation hybrid mapping. Genome Res. 9:1223-1230.[Abstract/Free Full Text]

MURPHY, W. J., M. MENOTTI-RAYMOND, L. A. LYONS, M. A. THOMPSON, and S. J. O'BRIEN, 1999b  Development of a feline whole genome radiation hybrid panel and comparative mapping of human chromosome 12 and 22 loci. Genomics 57:1-8.[CrossRef][Medline]

MURPHY, W. J., S. SHAN, Z. Q. CHEN, J. PECON-SLATTERY, and N. YUKHI et al., 2000  A radiation hybrid map of the cat genome implications for comparative mapping. Genome Res. 10:691-702.[Abstract/Free Full Text]

MURPHY, W. J., R. STANYON, and S. J. O'BRIEN, 2001  Evolution of mammalian genome organization inferred from comparative mapping. Genome Biol. 2:1-8.

NADEAU, J. H. and B. A. TAYLOR, 1994  Lengths of chromosomal segments conserved since the divergence of man and mouse. Proc. Natl. Acad. Sci. USA 81:814-818.

O'BRIEN, S. J., M. MENOTTI-RAYMOND, W. J. MURPHY, W. G. NASH, and J. WIENBERG et al., 1999  The promise of comparative genomics in mammals. Science 286:458-465.[Abstract/Free Full Text]

RANZ, J. M., F. CASALS, and A. RUIZ, 2001  How malleable is the eukaryotic genome? Extreme rate of chromosomal rearrangement in the genus Drosophila. Genome Res. 11:230-239.[Abstract/Free Full Text]

SHARAKHOV, I. V., A. C. SERAZIN, O. G. GRUSHKO, A. DANA, and N. LOBO et al., 2002  Inversions and gene order shuffling in Anopheles gambiae and A. funestrus.. Science 298:182-185.[Abstract/Free Full Text]

TANKSLEY, S. D., M. W. GANAL, J. P. PRINCE, M. C. DE VICENTE, and M. W. BONIERBALE et al., 1992  High density molecular linkage maps of the tomato and potato genomes. Genetics 132:1141-1160.[Abstract]

WEINBERG, J., R. STANYON, W. G. NASH, P. C. M. O'BRIEN, and F. YANG et al., 1997  Conservation of human vs. feline genome organization revealed by reciprocal chromosome painting. Cytogenet. Cell. Genet. 72:211-217.

YORK, T. L., R. DURRETT, and R. NIELSEN, 2002  Bayesian estimation of inversions in the history of two chromosomes. J. Comp. Biol. 9:808-818.




This article has been cited by other articles:


Home page
BioinformaticsHome page
Y. Lin and B. M.E. Moret
Estimating true evolutionary distances under the DCJ model
Bioinformatics, July 1, 2008; 24(13): i114 - i122.
[Abstract] [PDF]


Home page
Genome Res.Home page
K. Yogeeswaran, A. Frary, T. L. York, A. Amenta, A. H. Lesser, J. B. Nasrallah, S. D. Tanksley, and M. E. Nasrallah
Comparative genome analyses of Arabidopsis spp.: Inferring chromosomal rearrangement events in the evolutionary history of A. thaliana
Genome Res., April 1, 2005; 15(4): 505 - 515.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Durrett, R.
Right arrow Articles by York, T. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Durrett, R.
Right arrow Articles by York, T. L.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Copyright © 2004 by the Genetics Society of America.