Abstract
Comparative genetic maps of two species allow insights into the rearrangements of their genomes since divergence from a common ancestor. When the map details the positions of genes (or any set of orthologous DNA sequences) on chromosomes, syntenic blocks of one or more genes may be identified and used, with appropriate models, to estimate the number of chromosomal segments with conserved content conserved between species. We propose a model for the distribution of the lengths of unobserved segments on each chromosome that allows for widely differing chromosome lengths. The model uses as data either the counts of genes in a syntenic block or the distance between extreme members of a block, or both. The parameters of the proposed segment length distribution, estimated by maximum likelihood, give predictions of the number of conserved segments per chromosome. The model is applied to data from two comparative maps for the chicken, one with human and one with mouse.
COMPARATIVE gene mapping, the analysis of the chromosomal location of homologous genes in different species, is a powerful tool for gene mapping and the study of genome organization and evolution. The most detailed comparisons are between mouse and man, with >2000 homologous genes mapped in both species. Almost 200 linkage groups are conserved between these two species (Carver and Stubbs 1997). Even before these detailed comparative gene maps were assembled, the early genetic maps of man and mouse were used to estimate the mean length and number of chromosomal segments conserved during evolution (Nadeau and Taylor 1984). Comparison of the locations of 83 homologous loci revealed 13 conserved segments. Statistical models were developed for using this sample of conserved segments to estimate the mean length of all conserved autosomal segments in the genome as 8.1 cM. This was used to estimate the number of conserved segments as 198, which is very close to the number observed today. Most comparative studies have focused on mammals, notably mouse and human comparisons (O'Brien et al. 1993, 1997; Womack and Kata 1995; Anderssonet al. 1996; Carver and Stubbs 1997). Recently, comparisons between birds (Burtet al. 1995; Anderssonet al. 1996; Joneset al. 1997; Pitelet al. 1998; Smith and Cheng 1998) or bony fish (Morizot 1983; Postlethwaitet al. 1998) and mammals reveal a high degree of conservation of genome organization. This is surprising given that these species diverged from a common ancestor 420 mya.
The genetic marker maps of farm animals such as cattle, pigs, and poultry are now sufficiently well advanced to be of practical value for the study of economically important traits and livestock improvement (Anderssonet al. 1996). Knowledge of the location of coding sequences is, however, limited. Maps of major livestock species contain 1000–2000 anonymous microsatellite markers and only 5–10% of all genetic markers are genes. Mapping of several vertebrate genomes is progressing rapidly, but by far the most detailed information is still to be found for mouse and human. Through comparative gene mapping, it is possible to link the “genepoor” maps of livestock to the “generich” maps of human and mouse (Anderssonet al. 1996).
Many measures of genome rearrangement are possible, depending on the level of gene mapping information available (e.g., synteny, gene order, and gene position) and the corresponding mathematical modeling approach used. Two derived measures of the degree of genome reorganization between two species using synteny data have been proposed (Bengtssonet al. 1993), and also a measure of genome similarity using gene order (Zakharovet al. 1995). More mechanistic models have been derived from some or all of the known chromosome modification mechanisms such as reciprocal translocation, inversion, transposition, and chromosome fusion and fission. Such an approach has been developed to obtain a direct estimate of the number of conserved segments from synteny data (Sankoff and Nadeau 1996; Erlichet al. 1997), which takes account of as yet unobserved syntenies. When sequences of genes are accurately mapped, similar descriptive models of genome rearrangement are possible (Sankoff 1993; Hannenhalli 1995; Hannenhalli and Pevzner 1995) but these models do not allow for undiscovered segments.
Our concern is with incomplete data of an intermediate accuracy arising from genetic maps, which yield blocks of conserved synteny. These contain information on the number of genes per block and the measured distance (or range) between extreme genes in blocks with at least two members, but ignore information on gene order. The first published estimate of the number of conserved segments between man and mouse used an approach based on such data (Nadeau and Taylor 1984). Although this landmark work used measurements of distance, subsequent approaches have concentrated either on counts of genes (by chromosome or blocks within chromosomes) or on gene order. We build on the approach of Nadeau and Taylor (1984) by using both counts and the additional distance information available in ranges, when present. A central assumption of the Nadeau and Taylor (1984) model was that all chromosomes had identical distributions of the lengths of segments from which ranges had been sampled. Chromosome lengths were assumed to be large relative to segment lengths. This approximation is good for chromosomes >100 cM in length, and fair for those >50 cM in length (as in the mouse), but is untenable for species with shorter chromosomes, such as the chicken, which has extreme divergence in chromosome size. The currently established chicken linkage group sizes range from 2 cM to 518 cM, with several <50 cM. We have extended the method of Nadeau and Taylor (1984) to allow small chromosome lengths and also to use the probability density of the observed ranges in a likelihood approach. A similar method, using only the number of genes forming a syntenic block of one or more markers, is also proposed. This leads naturally to a combined approach using both types of data. The model allows a flexible description of chromosome breakage, which includes random breakage as a special case. The methods are illustrated using comparative maps that compare chickens with both humans and mice (Burtet al. 1999).
METHODS
Distributions of segment lengths for different chromosomes: How are the lengths of conserved segments expected to change, in general, as chromosome length increases? Very small (hypothetical) chromosomes are likely to contain only a single conserved segment, while large chromosomes might be expected to contain many relatively short segments and a few long ones. Intermediate length chromosomes may have segments whose lengths are a substantial proportion of chromosome length. Thus, for our empirical model of segment lengths, we require a flexible distribution whose shape can be defined for each chromosome. The β distribution, a twoparameter distribution defined on the unit interval, can give distributional shapes as varied as unimodal, uniform, exponentiallike, and reverse exponential, as its parameters vary. Segment lengths for the kth chromosome can be scaled by chromosome length, l_{k}, to follow a β distribution whose parameters are a function of l_{k}. Using the square parenthesis notation for a density function, assume the distribution of segment lengths, y, on chromosome k to be
An important special case occurs when the β distribution parameter a equals one, so that
Count data: Observed genes are assumed to be distributed at random along the genome with constant density D genes per centimorgan. If there are many genes and a large number of observed syntenic groups, then the distribution of the number of genes (n) in a syntenic group found on an underlying conserved segment of length y will be approximately Poisson with mean Dy, defined for observable values of n ≥ 1.
The distribution of n, given y, is
Range data and combined data: An extension of the scheme for counts follows naturally for syntenic groups of at least two genes, where we have additional information on the range, w, between the outermost pair of the group. For this subset of the data the Poisson distribution of n given y has to be truncated to be ≥2:
It is possible to combine both preceding likelihoods. For single loci (n = 1) the distribution of n in the Count data section may be used. For range data the joint distribution of n and w may be used, with one modification. The Poisson distribution for a count conditional on segment length, [ny]_{2}, should be truncated to allow n ≥ 1 rather than n ≥ 2. This gives a common truncated Poisson distribution for both approaches, so that their respective loglikelihoods may be added.
Then we maximize
Confidence intervals: All maximizations were performed using standard derivativefree optimization routines. Confidence intervals for the number of conserved segments, S, were calculated only for the two random breakage models with β parameter a = 1.
The random chromosome breakage model has a confidence region for δ and log(γ) that is an elliptical area defined by the critical loglikelihood contour corresponding to
The confidence interval for the random genome breakage model was found from the loglikelihood corresponding to a grid of integer S_{0} values, using the critical value
Comparing model and data: Observed genes are assumed to be distributed at random over the genome. Those found by means that are not random (previously mapped by FISH, gene families, chromosome walking, crossreferenced genes from other species' maps, etc.) have been omitted. If the distribution of genes is random and of constant density D, then, on average, the number found on linkage group k will be proportional to the length of the linkage group, l_{k}, and the observed number, m_{k}, will follow a Poisson distribution with mean Dl_{k}. A linear regression through the origin of Poisson variables m_{k} against l_{k} was fitted and the generalized Pearson chisquare used as a measure of lack of fit (Collett 1991) to assess the evidence for nonrandomness.
We can also compare the observed number of segments per chromosome with a prediction from the model. To estimate the predicted number of observed segments the distribution [y]_{k} is replaced by the distribution of the observed segments
Gene mapping data from the chicken genetic linkage map: For chicken, the genes were mapped as part of the EC CHICKMAP project and the worldwide effort to map the chicken genome (Burtet al. 1995; Burt and Cheng 1998). The mapping information is recorded in the chicken genome database, Arkdbchick (http://www.ri.bbsrc.ac.uk).
To estimate the genetic length of the chicken genome we take map lengths from recombination among m loci, using the Map Manager program (Manly 1993), corrected using the Kosambi mapping function (Kosambi 1944) and multiplied by (m + 1)/(m − 1) to adjust for failure to sample telomeric regions (Morton 1991). The second correction assumes that loci are sampled randomly from a uniform distribution along the genetic map.
The locations of human and mouse genes were taken from the Genome Database (http://gdbwww.gdb.org/gdb/), UniGene (http://www.ncbi.nlm.nih.gov/), Online Mendelian Inheritance in Man (http://gdbwww.gdborg/omim/docs/omimtop.html), and the Mouse Genome Database (http://www.informatics.jax.org/). The comparative gene map for chicken, human, and mouse (http://www.ri.bbsrc.ac.uk) contains 214 orthologous loci, most of which are known genes or conserved anonymous cDNA sequences. We excluded members of multigene families or genes for which specific orthology could not be determined or for which homology was in doubt.
RESULTS
Data presented for comparative maps are based on chicken linkage groups and are labeled chickenhuman (CH) and chickenmouse (CM).
Data: Details of the observed numbers of conserved genes between chicken and human or mouse are given in Table 1. Gene density, for those considered found at random, was ~3/100 cM for both comparisons, having excluded onethird of the conserved loci that were considered nonrandom and therefore biased. The total estimated length of the linkage groups in the chicken map was 3836 cM. There were considerably more single loci than conserved syntenic groups with ranges, particularly so for the chickenmouse comparison. Most of the ranges were derived from fewer than five genes. Ranges were observed on 19 (CH) and 13 (CM) linkage groups, and almost always as a single range per linkage group except for the four largest linkage groups (Figure 1). Smaller linkage groups were more likely to contain single loci than ranges for the chickenmouse comparison. In all, 28 (CH) and 26 (CM) linkage groups were found to contain homology segments defined by a single gene (n = 1) or conserved syntenic groups with n ≥ 2. The largest observed ranges from both comparative maps exceeded the median linkage group length, emphasizing the need for models allowing for chromosome size.
Tests of randomness, using the number of loci per linkage group, gave
Model fitting and predictions: The results of fitting the various models to different data types are presented in Table 2 for the chickenhuman comparison and in Table 3 for the chickenmouse comparison. The behavior of the models was broadly similar for both comparative maps. Using either count data alone or combined data there was no evidence against the random cutting of chromosomes. In contrast, a model of nonrandom breakage was preferred when range data was considered on its own [
The observed numbers of segments per linkage group, plotted against linkage group length, are presented in Figure 3. Also included are predictions from the random chromosome breakage model of the number of underlying segments per linkage group and of the number of observed segments per linkage group with a 95% confidence interval. The chickenmouse prediction for the number of observed segments shows good agreement with the data. Even so, there are still some (nonzero) observed numbers outside the confidence range for observed segments. This is inevitable when the model predicts a single segment for very small linkage groups. For the same reason the predicted curve for the number of conserved segments also lies below some of the observed numbers of segments for short linkage groups. In the chickenhuman comparison this also occurs for the longest linkage groups.
An illustration of the flexibility of the β distribution models to represent a wide range of segment length distributions is presented in Figure 4 using fitted distributions corresponding to the estimated parameters from the chickenmouse comparison. The upper diagram shows changes in segment length distributions with chromosome length from the random chromosome breakage model applied to the combined data. Four chromosome lengths have been chosen for illustration. The distribution for the shortest chromosome of 20 cM has the most probable segment length equal to the chromosome. At a length of 60 cM the distribution is almost uniform, which would be appropriate for a single random cut point. At 90 cM the distribution becomes triangular, corresponding to two cut points. Longer chromosomes show an exponentiallike segment length distribution shifted progressively further to the left. This is shown for a chromosome length of 150 cM, corresponding to 5 segments, for which the probability of a segment exceeding half of the length of the chromosome is 0.06. This probability halves for each additional segment on a chromosome.
The lower diagram in Figure 4 is of the nonrandom breakage model fitted to the range data alone. For the shortest chromosome the most probable segment length is equal to that of the chromosome, as in the random breakage model. As chromosome lengths increase, however, the segment length distributions have progressively smaller means relative to the length of the chromosome and are unimodal. There is no evidence in the range data, and therefore no reflection in the fitted distributions, of a preponderance of very small segment lengths.
DISCUSSION
The small size of some chicken chromosomes and the relatively large size of some conserved syntenic blocks have driven the construction of a chromosomebased model for conserved segments. But, as the model is a generalization of that of Nadeau and Taylor, there is no reason why the approach should not be used more widely, particularly with its emphasis on the testing of model and data assumptions. As illustrated in results, the β distribution has provided a very flexible and intuitively appealing range of distributional shapes for the unobservable segments. Particularly important are special cases corresponding to random breakage models, one of which is already prevalent in the literature (see Nadeau and Sankoff 1998 for a review). The likelihood approach presented here allows an explicit test of the plausibility of these randombreakage models, as well as providing a framework for deriving the confidence intervals that are an essential accompaniment to estimates. A particularly striking consequence of using such a flexible model is the need to use all available data to draw reliable conclusions. Discarding segments defined by single loci (homology segments) results in gross underestimation of genomic rearrangement. These issues are discussed in more detail below.
There are two untested assumptions made in the Nadeau and Taylor (1984) model, which have also been made here, without extensive comment: that genetic and physical distances are (approximately) proportional and that there are no insertions within a syntenic block, although inversions are permitted and fairly common in large conserved segments. This will lead to underestimation of genomic rearrangement. The algorithm of Sankoff et al. (1997a) could be used to identify probable inversions in the data, rather than those caused by incorrect gene ordering, prior to model estimation.
The crucial assumption of genes spread at random over the genome has been tested, but at the simplest level, to assess constant density over chromosomes. There is some evidence that recombination rates in the chicken microchromosomes (the smallest 33 autosomes) are some 2.5 times those in macrochromosomes (the largest 5 autosomes; Rodionov 1996), and that gene densitites in microchromosomes are double those in macrochromosomes (Smithet al. 1999). These two effects cancel out in the test for a random scattering of genes. Performing the same test on the humanmouse comparative map using >1600 unselected genes gave a
The random chromosome and genome breakage models presented here are obtained as special cases of an empirical nonrandom breakage model. This allows likelihoodratio tests for independent components of the model, an approach that is preferable to using goodnessoffit tests for the whole model and then, if satisfactory, declaring that all of the model components are validated. Conclusions about the pattern of chromosome breakage are strongly influenced by the choice of which data measurements to analyze. It may be that when using only the ranges, which contain indirect metric evidence about segment lengths, we are detecting genuine nonrandomness. Or, perhaps, short ranges are underrepresented in the relatively small number of ranges in our sample covering the whole genome, often resulting in just a single range being present on a linkage group. When using all the data the tests within the model do not provide evidence against “chromosomes” being cut at random for the two comparative maps presented here, although the evidence is not unanimous about whether the randomness is on a chromosome or a genome basis. However, both models give similar estimates of conserved numbers of segments. Furthermore, if the model is modified so that observed ranges and linkage group lengths corresponding to microchromosomes are reduced by a factor of 2.5 (the minimum shrinkage factor corresponding to the almost linear part of the Kosambi mapping function), then the random genome breakage model is preferred for both comparative maps, and the estimates of conserved segment numbers change little. Other evidence for random genome breakage models is presented in Nadeau and Sankoff (1998). One arbitrary feature of the chromosome model presented here is the smooth relationship chosen to change the distribution of segment length with chromosome length. There is no expectation that the number of conserved segments on a chromosome increases monotonically with chromosome length, although when chromosome lengths differ widely an increasing trend is likely. With random genome breakage the trend should be linear, becoming less variable with an increase in the number of generations and rearrangements between the two species being compared. This may be a factor in the superior agreement of the observed segment numbers and their prediction in the chickenmouse comparison. The chickenhuman comparison is dominated by linkage groups with only a single observed segment (Figure 1), and this suggests that other functions might be useful in relating the number of conserved segments to chromosome length for some contexts. For example, the model may be easily modified to fit different breakage rates among microchromosomes and among macrochrosomosomes in chickens, if considered biologically plausible.
The estimates of the number of conserved segments change considerably depending on which measurements are chosen as data, in contrast to the relative stability of the estimates over the different chromosome breakage models. The flexibility of these breakage models in describing segment length distributions means that the model will be more sensitive to the data than might be the case with, say, a singleparameter exponential distribution. This places considerable emphasis on the quality of the mapped data and the examination of the assumptions used to describe gene distributions.
As an example of the dramatic effect of very different model assumptions on conclusions, we have also used the model of Sankoff et al. (1997b) to estimate conserved segment number. This elegant model, derived from the probability theory of runs, treats the genome as continuous and relies on both the process of gene identification and chromosome breakage occurring at random. It excludes considerations of gene order and distance both between and within conserved segments, and consequently ignores the impossibility of some configurations of observed ranges and chromosome boundaries. Further work is required to assess the limitations of the model assumptions. Estimates of the conserved number of segments from this model are considerably larger than those derived from our models; CH = 142 segments and CM = 293 segments.
Using our models, the chickenmouse comparison gave an estimate of the number of conserved segments of almost 50% more than the chickenhuman comparison for the combined data, with a confidence interval twice as wide. The intervals barely overlapped, suggesting a difference in conserved number of segments with the chicken between these two species. Further evidence of a difference comes from an examination of the ranges that were found in both comparisons. There are 16 ranges in common, of which 8 were of equal length. The remaining 8 have a measured range that is shorter for the chickenmouse comparison. The two comparative maps presented here have many common genes in their data: genes that are first mapped in the chicken and then located in both of the more extensive mouse and human maps. This precludes the simplest approach to testing for human and mouse differences in conserved segment numbers with chicken by pooling data and assuming it to be independent, because there may be a positive correlation between the estimates of conserved segment number caused by the sampling scheme above. Finding a satisfactory representation of this correlation will be important in future work in evolutionary modeling, because in the long term we will wish to assess differences in the number of conserved segments for multiple comparative maps and to use these maps to give a new perspective on phylogenetic trees.
Acknowledgments
We thank Liz Archibald for her excellent typing, and Michael Romanov for Russian translation. We also thank the Ministry of Agriculture, Fisheries and Food (MAFF), the Biotechnology and Biological Sciences Research Council (BBSRC) and the Commission of the European Communities for supporting this work.
Footnotes

Communicating editor: G. A. Churchill
 Received April 23, 1999.
 Accepted September 20, 1999.
 Copyright © 2000 by the Genetics Society of America