Abstract
Genomic divergence between species can be quantified in terms of the number of chromosomal rearrangements that have occurred in the respective genomes following their divergence from a common ancestor. These rearrangements disrupt the structural similarity between genomes, with each rearrangement producing additional, albeit shorter, conserved segments. Here we propose a simple statistical approach on the basis of the distribution of the number of markers in contiguous sets of autosomal markers (CSAMs) to estimate the number of conserved segments. CSAM identification requires information on the relative locations of orthologous markers in one genome and only the chromosome number on which each marker resides in the other genome. We propose a simple mathematical model that can account for the effect of the nonuniformity of the breakpoints and markers on the observed distribution of the number of markers in different conserved segments. Computer simulations show that the number of CSAMs increases linearly with the number of chromosomal rearrangements under a variety of conditions. Using the CSAM approach, the estimate of the number of conserved segments between human and mouse genomes is 529 ± 84, with a mean conserved segment length of 2.8 cM. This length is <40% of that currently accepted for human and mouse genomes. This means that the mouse and human genomes have diverged at a rate of ∼1.15 rearrangements per million years. By contrast, mouse and rat are diverging at a rate of only ∼0.74 rearrangements per million years.
AFTER a speciation event, descendant genomes may diverge in overall structure as a result of intra and interchromosomal rearrangements. Each rearrangement reduces the structural homology between the two genomes, while increasing the number of homologous chromosomal fragments. These homologous, but repositioned, fragments are referred to as conserved segments between the genomes compared. The genomic divergence due to chromosomal rearrangements between species can be estimated in terms of the number of conserved segments between their genomes (e.g., Sankoff and Nadeau 1996; Ehrlichet al. 1997).
Conserved segments are identified by examining the relative order of contiguous landmarks in the chromosomes of the species being compared. Protein coding genes are frequently used as landmarks because they are numerous and their orthology relationships can be determined with great certainty even among distantly related species because of high levels of protein sequence conservation. For most organisms, however, genomic map information for only a fraction of these landmarks is currently available. As a consequence, many conserved segments are not “visible.” A second reason for a conserved segment being unobserved is that there may simply be no identifying markers in those regions in one or both genomes.
Historically, genomic map information has consisted mostly of the knowledge of the chromosome number for a given gene in a genome. With recent genomesequencing efforts and better mapping techniques, information is becoming available on the relative order as well as the actual physical location of genes. However, this progress has been made for only a select group of species. For these reasons, approaches that require only the chromosomal number for genes (conserved synteny approach) continue to be used (e.g., Bengtssonet al. 1993; Zakharovet al. 1995; Sankoff and Nadeau 1996). A chromosome from one species is said to have a “conserved synteny” with a chromosome from another species if they have one or more markers in common. Thus, these measures require only knowledge of the chromosome number for the markers in both genomes.
However, statistical approaches utilizing conserved synteny data are known to have serious shortcomings. In particular, this method will provide only a lower bound of the number of observable conserved segments, and the number of conserved syntenies between two chromosomes cannot exceed c_{a} × c_{b}, where c_{a} and c_{b} are the number of chromosomes in species a and b, respectively (Sankoff and Nadeau 1996).
For these reasons, more extensive genome map information needs to be used whenever available. The ideal approach for estimating the total number of conserved segments is to use conserved linkage data, which requires the knowledge of the relative order of markers in both genomes. This requirement, however, makes this approach impractical. For instance, the map location for most known human genes is still in the form of the cytogenetic band they reside in, and even for the more precisely mapped mouse genome, the relative gene order is known for only a few thousand genes (see Blakeet al. 2000). In short, it is not yet possible to find many pairs of mammalian species for which a substantial number of conserved linkages can be identified. Therefore, we use an approach intermediate to the conserved synteny and conserved linkage approaches, which utilizes currently available information more effectively than the conserved synteny approach. In our intermediate approach, we use contiguous sets of autosomal markers (CSAMs). A CSAM is an uninterrupted set of markers in one genome (primary genome) that are syntenic in the other genome (secondary genome; Figure 1). Therefore CSAMs can be identified using relative marker order information in one genome and the chromosome number of those markers in the other genome. This intermediate approach differs from others (e.g., Nadeau and Taylor 1984; Waddingtonet al. 2000) in that no information is required about the physical distances between markers, which are not yet known with precision for most genomes.
In this article we discuss the relationship of the number of CSAMs with the number of rearrangements and present a mathematical formulation for estimating the number of unobserved CSAMs. We also show the usefulness of our approach by computer simulation and empirical data analyses with data from human, mouse, and rat genomes.
A MODEL FOR CSAM SIZE DISTRIBUTION
Let us consider the entire set of autosomal chromosomes linked together in a linear headtotail fashion to form a superchromosome, whose total length is denoted by C. Let there be n conserved segments in this genome, and m genes (markers) residing on these n conserved segments. If we consider each conserved segment to be a separate bin, irrespective of its true physical length (which is not known beforehand), the probability of observing k genes in a given conserved segment can be described by the Poisson distribution,
Here we are proposing to use a gamma distribution to model an overdispersed Poisson distribution that is needed to describe counts of genes in a conserved segment. In this case, a more realistic distribution of genes and nonuniformity of breakpoints is modeled by the shape parameter, α. (β is a scaling factor.) It is worth noting that this gamma distribution is intended to model the observed data, which is affected by the extent of marker sampling, different types of chromosomal rearrangements (and their unknown relative proportions), and unknown differences in marker densities throughout the genome. For this reason, α is not a fundamental biological quantity, but is a descriptor of the observed data.
From Equations 1 and 2, it can be shown that the number of genes in a given conserved segment is expected to follow the negative binomial distribution
In Equation 3, there are n_{0} zerogene segments that cannot be observed. Therefore, we use a truncated negative binomial distribution to estimate n,
n can be estimated by equating the observed and expected values of the first and second moments. For the truncated binomial distribution given in (4), the first two moments about zero are given by
If k¯ and k¯^{2} are observed first and second moments, respectively, and we define
The estimate of n can be obtained by iteration. Sometimes the iterative procedure fails to converge, e.g., when a large number of unobserved segments needs to be estimated using a relatively small number of markers. In this case, we suggest smoothing of the observed segment size distribution (e.g., by taking a moving average with a window of size three) and truncating the tail (which often contains segments with functionally linked genes).
Once n is estimated, then the parameter modeling the nonrandomness of the gene distribution (α) and the scale parameter, β, are given by
Now, given n and the number of observed CSAMs, the number of unobserved CSAMs (n_{0}) can be estimated. The standard error of the estimates can be obtained by the bootstrap procedure (Efron and Tibshirani 1993) in which the CSAMs are sampled with replacement. We suggest the resampling of CSAMs rather than markers, because the CSAM is the unit of measure. The same statistical estimation method can be applied to conserved syntenies, CSAMs, and conserved linkages.
Once the number of conserved segments has been estimated, an approximate (and conservative) estimate of the number of rearrangements can be obtained by R = ^{1}/_{2}(n  max(c_{a}, c_{b})), where c_{a} and c_{b} are the number of chromosomes in species a and b, respectively, and n is the estimate of total number of conserved segments (observed + unobserved). The factor of ^{1}/_{2} is based on our computer simulations involving different proportions of different types of intra and interchromosomal rearrangements (Table 1).
Note that the Sankoff and Nadeau (SN) model (Sankoff and Nadeau 1996) is not a direct special case of our model, although results with α = 1 in our model produce estimates of n that are close to those obtained using their model (Sankoff and Nadeau 1996, p. 251). There have been some other recent sophisticated approaches to estimate the number of conserved segments (e.g., Burtet al. 1999; Waddingtonet al. 2000). Waddington et al. assume that smaller chromosomes contain fewer conserved segments, and larger ones contain more, and develop a model that accounts for chromosome size differences. However, it is not clear that there is a necessary correlation between chromosome size and the number of conserved segments as the chromosome sizes can change considerably during the evolutionary history of species (e.g., by simple fission) at different times. Also the empirical data do not seem to support this assumption (see Figure 5). Our model avoids making these types of assumptions.
COMPUTER SIMULATION
To assess the usefulness of our new approach and compare it to other approaches, we conducted computer simulations in the following manner. The process begins by creating a genome consisting of c chromosomes of specified lengths, with the location of centromeres assigned randomly. Positions of the given number (m) of genomic markers are then determined at random either under a uniform distribution (U) or clumped distribution (N) of marker density. Under the “clumped” scheme, a probability p and a distance d are specified for the clumping (we used p = 0.5 and d = 0.04 cM). With probability p, each marker is selected to be within distance d of the previously chosen marker on that chromosome arm, and with probability (1  p) it is chosen from a uniform density over the entire chromosomal arm. A specified number of rearrangements is then applied to this set of chromosomes to produce the second genome.
Simulating the process of rearrangement requires the selection of breakpoints for excision and insertion of chromosomal segments. In both cases, the breakpoints are chosen restricted to the chosen arm to ensure that each resultant chromosome has exactly one centromere. Breakpoints were also selected using a uniform (U) or a nonuniform (N) distribution. For the uniform chromosomal breakage scheme, the density function for selection of the breakpoint is uniform over the entire arm of the chromosome. In the nonuniform case, the assumption of uniformity is relaxed by specifying an initial length and a breakpoint weight, which is roughly proportional to the current chromosomal length and the observed breakpoint rate of the first 20 autosomal chromosomes in humans (Table 2). As each rearrangement is applied, the chromosomal set is updated to reflect the new position of the segment(s), while retaining identification of the original source chromosome. This process is repeated for a specified number of rearrangements. At the end of this process, each chromosome consists of an ordered list of segments, along with their lengths, origins, and other information. At this point markers are sprinkled onto the original chromosomes, as described above, and their evolutionary trajectories are computed. This information was used in this study for comparing true and estimated values of desired quantities.
Chromosomal rearrangement can take place in various ways. In this article, we consider both inter and intrachromosomal rearrangement. Interchromosomal rearrangements considered were simple translocations, [an end (terminal) piece of one chromosome breaks off and attaches itself to the end of another chromosome], reciprocal translocations (two chromosomes exchange portions from their respective ends or terminals), and intercalary transpositions (the movement is from a nonterminal piece of one chromosome to a nonterminal position on another chromosome). Among the intrachromosomal rearrangements we considered simple transpositions (a fragment moves from one part of a chromosome to another part of the same chromosome) and inplace inversions. The simulation results we present here are from either 100% reciprocal rearrangements (U) or a particular mix of rearrangements (N), consisting of 10% intercalary transpositions, 50% reciprocal translocations, 20% simple intrachromosomal transpositions, and 20% inplace inversions.
Computer simulations denoted by UUU refer to the cases where the marker distribution was uniform, the breakpoint distribution was uniform, and the rearrangements were entirely reciprocal interchromosomal translocations. We use NNN to denote the cases in which the marker distribution was clumped (with probability = 0.5 that a given marker was to be located within 0.04 cM of the previous marker), the breakpoint distribution was based on the human chromosome size and breakpoint distribution as given in Table 2, and the rearrangement mix was in the proportions given in the previous paragraph. For brevity, extensive results from other intermediate combinations are not presented and are discussed when necessary.
RESULTS
Temporal distribution of conserved CSAM sizes: Figure 2 shows the expected patterns of generation of CSAMs (A and B) and conserved syntenies (C and D) with increasing number of chromosomal rearrangements, for UUU and NNN cases (solid circles). It is clear from Figure 2 that CSAMs accumulate linearly with increasing number of rearrangements and that additional CSAMs continue to accumulate at the same rate even when the number of rearrangements is very large. (The number of CSAMs per rearrangement is approximately two even for a very large number of rearrangements.) Furthermore, these relationships generally hold for both UUU and NNN cases. The slight discrepancy in the initial stages for the number of CSAMs per rearrangement (Figure 2B) is due to the inclusion of inversions in the rearrangement types (Table 1). In the initial stages, inversions do not result in new conserved segments. In the later stages, however, inversions may create more than one new segment, as, for example, if the fragment that is inverted is straddling two different conserved segments before it gets inverted. The consequences of these scenarios are reflected in Figure 2B. Our computer simulations also showed that CSAMs, although intermediate between conserved syntenies and conserved linkages, underestimate the number of conserved linkages by only 515%.
As expected, the number of conserved syntenies does not increase linearly with the number of rearrangements. Even though the shape of the curve in Figure 2C (solid circles) suggests weak linearity in the early portion, the number of syntenies per rearrangement (open circles) declines quickly even in this portion. As expected, the number of syntenies shows an upper bound of 400 because the genomes compared contain 20 chromosomes each. This nonlinear relationship of the true number of conserved syntenies with the number of rearrangements means that even the perfect estimation of all unobserved conserved syntenies will produce biased (lower) estimates of the number of conserved segments. In our simulations we also computed the Q value of Bengtsson et al. (1993) and found that this statistic behaves in a manner similar to conserved syntenies, as it uses the number of observed conserved syntenies (results not shown).
Estimation of the number of conserved segments: Using landmarks such as genes we can count the number of observed conserved segments, e.g., by counting the number of CSAMs containing at least one marker. However, as mentioned earlier, the number of unobserved segments needs to be estimated. Accurate estimation of this quantity is important to compute the genomic distance between the two species.
The accuracy with which the number of unobserved conserved segments can be estimated depends upon the accuracy with which the histogram of the observed conserved segments can be modeled. Figure 3 shows the temporal changes in the expected histogram of conserved segments (CSAMs) obtained from computer simulation (NNN). Before rearrangement, both species have 20 chromosomes with 50 genes each, and thus 20 conserved segments between the two species, each of size 50. With each rearrangement, the number of conserved segments increases and, consequently, the average segment size decreases. Obviously, when the number of rearrangements is few there are many largesized segments, due to historical reasons. With time the number of rearrangements increases, which increases the number of small segments and reduces the number of large segments (Figure 3, AC). In fact, Figure 3A shows that when there are only 50 rearrangements, there are 3 segments that are still of size 50; that is, they have not yet been broken up. These segments quickly get broken up, however, as the number of rearrangements increases (Figure 3, B and C).
For comparison, each panel of Figure 3 also shows the fit of the gamma model with the bestfit αvalue and that of the gamma model with α = 1, which approximates the SN model closely. The gamma model fits the observed conserved segments better, especially for CSAMs containing small numbers of markers. For the complete uniform (UUU) case, fixed (α = 1) as well as bestfit gamma models provide equally good fits to the observed distributions, as expected (results not shown).
As mentioned earlier, Equation 4 can be used to estimate the number of unobserved CSAMs. This involves estimation of α, which is generally a difficult problem when the 0 category is not available and the true value of α is small due either to the availability of only a small number of markers or because of a large number of rearrangements. This is because the probability distribution becomes Lshaped. All else being equal, increased nonuniformity of marker or breakpoint distribution also reduces α. If the value of α is fixed a priori (as in the case of our approximation of the SN model), estimation of n_{0}, the number of unobserved CSAMs, is straightforward. Concurrent estimation of α (along with n_{0}), as in the CSAM approach, yields good estimates of n_{0} as long as the assumptions are UUU or NUN. When the distribution of chromosomal breakpoints is nonuniform, however, estimation of α is less reliable and can be significantly biased (n_{0} is often overestimated). These results illustrate the difficulty in estimating the number of conserved segments, even when a singleparameter model is used. For this reason, parameterrich models that attempt to estimate this quantity may experience difficulty without a large number of genes.
Number of markers needed to estimate genomic distance: Genomes of mammals generally consist of a very large number of genes (e.g., 30,000130,000 genes in humans; Schuleret al. 1996; Scott 1999; Ewing and Green 2000; Lianget al. 2000; Roest Crolliuset al. 2000). Clearly, it would be useful to know the minimum number of markers needed to reliably estimate the total number of conserved segments. Figure 4 shows the relationship between the error (percentage difference between the true and the estimated value) and the number of markers for UUU (Figure 4A) and NNN (Figure 4B) cases, for different numbers of rearrangements. We find that we need 700 markers or fewer (depending on the number of rearrangements) to obtain estimates of the number of conserved segments (and, thus, the number of rearrangements) within 5% of the true value, as long as the assumptions are UUU. If the assumptions are NNN (extreme nonuniform case), then the number of markers needed for a maximum error margin of 5% is >1000 (but <1200). For divergence levels greater than shown the error margins are fairly high, even for large numbers of markers, underscoring the problems with estimation of the number of conserved segments when the observed distribution of CSAMs is Lshaped (see Figures 3 and 6). As mentioned earlier, the simulation of chromosomal breakpoints may not be appropriate, creating extremely low αvalues for larger numbers of rearrangements. If the real data fall between these two extreme scenarios (UUU and NNN), we find that we need a maximum of 1200 markers for good results, as long as the two species being compared are fairly closely related. For instance, for NUN we found that ∼1000 markers were needed for an error margin of <5%, even for 200 rearrangements. Schoen (2000) used computer simulation to estimate the number of breakpoints with differing numbers of chromosomes, markers, and rearrangement events and reported a declining error in the estimate with increase in the number of markers. He found that for lower number of markers the error rates increased slightly with the number of rearrangements and that the type of rearrangement (translocation vs. inversion) did not affect the error rate.
DISCUSSION
In this article we have introduced the CSAM approach and compared it to the approach of Sankoff and Nadeau (1996). We have shown that the enumeration of conserved segments by using conserved syntenies as a unit has limitations (e.g., Figure 2). While such limitations have been pointed out in the literature (Sankoff and Nadeau 1996), the severity of the problem has not been clear previously. We have demonstrated that the CSAM approach remedies many of these problems while employing the Sankoff and Nadeau (1996) concept of estimating the number of unobserved segments from the distribution of observed segments. However, we have proposed a more flexible distribution to account for the effect of nonuniformity in the marker and breakpoint distributions—a biologically more realistic scenario.
As the CSAM approach requires relatively few markers for accurate estimation of the number of conserved segments, we now discuss its utility in establishing the extent of chromosomal homology between human, mouse, and rat genomes. In the first comparison, we use the mouse genome as the primary and the human genome as the secondary genome. This is because of the finer relative position information available for mouse genes (Blakeet al. 2000). In this comparison, 310 CSAMs are directly observable (Figure 5). For the same set of markers, the conserved synteny analysis exposes only 143 conserved segments (conserved syntenies). The Sankoff and Nadeau (1996) method of estimating the number of unobserved syntenies predicts 8 additional conserved syntenies. Thus, the total number of conserved segments is predicted to be only 151. This and other similar estimates have been previously obtained and used to indicate the minimum number of conserved segments between human and mouse (Nadeau and Taylor 1984; Copelandet al. 1993; Sankoff and Nadeau 1996; Ehrlichet al. 1997; Nadeau and Sankoff 1998). This number is less than half the number of conserved segments identified through CSAMs.
The high genomic divergence between human and mouse genomes is also evident in the observed CSAMs (Figure 5). Many chromosomes show areas of high breakpoint frequency (probable hotspots for rearrangements). In these areas many conserved segments of small size appear to have been produced by inplace inversions, as evident from multiple adjacent conserved segments with alternating colors. It is, of course, possible that many such short segments are simply artifacts of map ambiguity. This would inflate the overall estimate of the number of conserved segments. However, our estimate is clearly not affected by such ambiguities in the relative order of markers in the human genome because the CSAM approach requires the use of marker order in one of the genomes only (mouse in the present case). In the mouse genome, whenever multiple genes were mapped to the same position, we ordered the markers so as to minimize the number of CSAMs.
The size distribution of the observed CSAMs between human and mouse genomes is given in Figure 6A. The fit of the gamma and SN models to the observed CSAM distribution shows that the gamma model fits the data better (χ^{2} value = 4.74 for the gamma model, as opposed to 19.49 for the SN model; d.f. = 11). The CSAM analysis predicts 219 unobserved segments, with the total number of conserved segments between human and mouse genomes adding up to 529 ± 84. This result is not surprising considering that the relative genomic locations of <3% of all the genes for these two species were used (see Table 3, mousehuman comparison, July 1999 data release). However, it is almost three times the estimates based on conservedsynteny analysis (e.g., Sankoff and Nadeau 1996). Indeed, this almost threefold increase is consistently seen in the analysis of the July 2000 data, with >300 more usable genes (Table 3), as well as in the reciprocal analysis (primary, human; secondary, mouse; results not shown).
The autosomes of the mouse genome consist of a total of ∼3000 Mbp and have a combined length of ∼1500 cM (Nusbaumet al. 1999; Blakeet al. 2000). Therefore, the average length of a conserved segment is 2.84 ± 0.45 cM (5.67 ± 0.90 Mbp), which is <40% of the previous estimates of 8 cM or higher (for, e.g., Nadeau and Taylor 1984; Nadeau and Sankoff 1998). Interestingly, even the original Nadeau and Taylor (1984) approach using CSAM length data in centimorgan units (using CSAMs containing two or more markers) produced an estimate of 2.54 cM. This is somewhat surprising because the original method appears to be rarely used.
An approximate estimate of the rate of chromosomal rearrangement, averaged over evolutionary time, is obtained by assuming that each rearrangement produces 2 conserved segments (see Table 1). Table 1 shows the differential effects of the relative contributions of the various rearrangement types to the number of conserved segments created by a rearrangement event. On average, a working figure of 2 segments/rearrangement appears conservative. For the mousehuman pair the estimate of the rate of chromosomal rearrangements is 1.15 ± 0.19 per million years, as these two species diverged ∼110 million years ago (Kumar and Hedges 1998). This rate of rearrangement is approximate because the true relative contribution of different types of rearrangements that have led to the generation of the conserved segments remains unknown. Also, this is an average rate of divergence between human and mouse genomes; rates of evolution in the independent lineages leading to human and mouse may be unequal and differ substantially from each other.
We also used the CSAM approach to compare the mouse and rat (Rattus norvegicus) genomes (Table 3), and estimated 139 ± 25 conserved segments between their genomes (July 2000 data release). This translates to 0.74 ± 0.16 rearrangements per million years, assuming that the two lineages split 40 million years ago (Kumar and Hedges 1998). As mentioned above, the analysis of the mousehuman data yields a rate of 1.15 ± 0.19 rearrangements per million years. A reciprocal analysis of the mouserat data (i.e., mouse as secondary and rat as primary) revealed a reduction in the number of useful markers to less than onethird of that in the previous analysis because the rat genome is not mapped as extensively as is the mouse genome. Figure 4 shows that the standard error of the estimate is large when the number of markers is low. This instability is reflected in the estimation of the number of CSAMs in the reciprocal analysis when compared to the corresponding estimates when mouse is used as the primary genome. This is in contrast to the mousehuman analysis, where the number of useful markers is large and similar for the reciprocal analysis (human used as the primary genome). Therefore, when the number of markers is reasonably large, the results obtained are quite robust (results not shown).
Our CSAM approach provides a simple way to compare extensively mapped model organism genomes with other genomes for which precise maps are not likely to become available in the near future. This facilitates quantification of the nature and tempo of macroevolutionary forces that have been instrumental in generating the current diversity of genomic organization in mammals (Burtet al. 1999; O’Brienet al. 1999).
Acknowledgments
We thank Tom Dowling and Mark Miller for discussions, Anne Beausang for some artwork, G. Valente for some database searches, and the Mouse Genome Database staff for their excellent user support. This research was supported by the National Institutes of Health, the National Science Foundation, and Burroughs Wellcome Fund grants to S.K.
Footnotes

Communicating editor: H. Ochman
 Received July 28, 2000.
 Accepted December 4, 2000.
 Copyright © 2001 by the Genetics Society of America