- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Housworth, E. A.
- Articles by Postlethwait, J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Housworth, E. A.
- Articles by Postlethwait, J.
Measures of Synteny Conservation Between Species Pairs
Elizabeth Ann Houswortha and John Postlethwaitba Mathematics Department, University of Oregon, Eugene, Oregon 97403
b Institute of Neuroscience, University of Oregon, Eugene, Oregon 97403
Corresponding author: Elizabeth Ann Housworth, Indiana University, Bloomington, IN 47405., ehouswor{at}indiana.edu (E-mail)
Communicating editor: G. A. CHURCHILL
| ABSTRACT |
|---|
Measures of conserved synteny are important for estimating the relative rates of chromosomal evolution in various lineages. We present a natural way to view the synteny conservation between two species from an Oxford gridan r x c table summarizing the number of orthologous genes on each of the chromosomes 1 through r of the first species that are on each of the chromosomes 1 through c of the second species. This viewpoint suggests a natural statistic, which we denote by
and call syntenic correlation, designed to measure the amount of synteny conservation between two species. This measure allows syntenic conservation to be compared across many pairs of species. We improve the previous methods for estimating the true number of conserved syntenies given the observed number of conserved syntenies by taking into account the dependency of the numbers of orthologues observed in the chromosome pairings between the two species and by determining both point and interval estimators. We also discuss the application of our methods to genomes that contain chromosomes of highly variable lengths and to estimators of the true number of conserved segments between species pairs.
GENOME evolution in multichromosomal organisms involves the translocation of genes between chromosomes, the rearrangement of genes on chromosomes, splitting and fusion of chromosomes, and gene and genome duplication events. Comprehensive measures of rearrangement distances, even restricted to pairs of chromosomes, one from each species, are computationally difficult to obtain. These measures are feasible only with highly conserved orthologous gene arrangements (![]()
![]()
![]()
Recent articles modeling and measuring genome evolution have concentrated on estimating the true number of conserved syntenies or the true total number of conserved chromosomal segments between pairs of species (![]()
![]()
![]()
![]()
![]()
|
Measures of conserved synteny ignore gene rearrangements on the chromosomes while measures involving the number of conserved segments take into consideration separate intrachromosomal rearrangements of blocks of orthologues while ignoring rearrangements within those blocks. Estimates of the true number of conserved syntenies clearly underestimate the true number of conserved segments between the genomes of two species. Measures of the total number of conserved syntenies or the total number of conserved segments are computationally feasible to obtain and provide a gross measure of genomic distance between pairs of species.
Both the estimators for synteny and segment conservation in recent articles (![]()
![]()
![]()
![]()
![]()
Neither the true total number of conserved syntenies nor the true total number of conserved segments is particularly useful in comparing genomic distances between pairs of species because raw counts do not provide adjustments or standardizations for such basic genomic differences between pairs of species such as genome sizes or numbers of chromosomes. Further, the orthologous genes that are yet to be found may be ones not subject to much genetic or genomic constraint. These orthologues may be scattered widely on the chromosomes of species and may inflate the number of conserved syntenies or the number of conserved segments while the bulk of the genome may be highly conserved. We introduce a measure of genomic conservation, which we call syntenic correlation, which corresponds to a measure of how far the orthologues are from being independently scattered in the genomes of the two species. This measure is standardized to be between zero, for completely randomized arrangements of orthologues between the genomes, and one, for two genomes with perfect synteny conservation. Further, this measure can be used to compare genomic distances (i.e., Oxford grids) between many pairs of species.
| METHODS |
|---|
Multivariate distribution of gene counts:
Measures of synteny conservation come essentially from looking at an Oxford grid, i.e., an r x c table of the r chromosomes in species A and the c chromosomes in species B. The (i, j) entry is denoted ni,j and is the observed number of genes on species A chromosome i with an orthologue on species B chromosome j. A pictorial representation of this table is formed by placing a dot in the (i, j) box, representing the chromosome pair, in the orthologue's relative position on each chromosome (see Fig 1). A box with one or more entries is counted as a conserved synteny. The distribution of the n observed orthologues then follows a multinomial distribution with r x c classes (the boxes or pairs of chromosomes), each orthologue having chance pi,j of landing in the (i, j) class.
It is easiest for the analysis that follows to change notation to avoid the multidimensional subscript. Let r x c = m. Label the possible chromosome pairs 1, 2, ... , m and the corresponding probabilities p1, p2, ... , pm. Label the observed number of orthologues n1 for the first chromosome pair, n2 for the second, ... , nm for the mth pair. The multinomial distribution for the number of orthologs found on each chromosome pair is then

for n1
0, n2
0, ... , nm
0, and n1 + n2 + · · · nm = n and where 00 and 0! are interpreted as 1 in this context.
Let l
m be the number of chromosome pairs on which orthologous genes will ever be found. The goal is to find an estimate for l, the true number of conserved syntenies that will be found after the genomes of both species have been completely mapped and analyzed.
Multivariate distribution of the lengths of the syntenic groups:
Consider the ancestral genome with all of its chromosomes concatenated and with the ancestral genes blocked by their syntenic groups that will be conserved between the two daughter species. The concatenated ancestral genome is to be broken into l segments (conserved syntenies) with lengths of proportions p1 through pl. The last proportion, pl, is determined from the other proportions as pl = 1 - (p1 + p2 + · · · pl-1). If the number of breaks in any interval of the ancestral genome is modeled as Poisson, the realized lengths of the segments on the ancestral genome are modeled from an exponential distribution. The joint density function of the proportional lengths is given by (l - 1)! over the region p1 > 0, p2 > 0, ... , pl-1 > 0 and p1 + p2 + · · · + pl-1 < 1. That is, the joint density of the proportional lengths is uniform over the ancestral genome scaled to unit length. This distribution is the member of the Dirichlet family of distributions (further described below) with all of its l parameters equal to one.
It may be preferable to model the lengths of the conserved syntenies or segments with several gamma distributions, all on the same scale, but with shapes that depend on the sizes of the chromosomes making up the pairs. In this case, the joint density function of the proportional lengths follows a Dirichlet distribution whose parameters are determined by the shape parameters of the gamma distributions (see FRISTEDT and GRAY 1997, pp. 156157). In Bayesian statistics language, this Dirichlet distribution on the proportional syntenic lengths is the conjugate prior to the multinomial probabilities that pairs of chromosomes from the two species contain orthologous genes; the parameters of the Dirichlet distribution may be chosen to take into account the relative lengths of the chromosomes in the two species. Choosing a nonuniform Dirichlet distribution amounts to choosing an informative rather than a noninformative prior distribution. The actual parameters chosen to model the chromosome lengths would impart the level of strength for the information given by the prior distribution.
Specifically, if the length of the block of genes from the ancestral genome that will constitute the orthologues of the jth chromosome pair is modeled by a gamma distribution with scale
and shape parameter
j, then the joint distribution of the proportional lengths follows a Dirichlet distribution with parameters {
j}. Let
= 
j. The density function of this Dirichlet distribution is

over the region p1 > 0, p2 > 0, ... , pl-1 > 0, and p1 + p2 + · · · + pl-1 < 1.
Distribution of the total number of conserved syntenies:
We assume that the proportional lengths of the syntenic segments are uniformly distributed. The uniform assumption is noninformative and corresponds to standard likelihood methods. The data consist of counts of orthologous gene pairs in the conserved syntenies found: (n1, n2, ... , nk). These counts are in a collection of k chromosome pairs. Another l - k chromosome pairings to which orthologous genes have not yet been mapped actually contain orthologues yet to be discovered.
The likelihood function of the true total number of conserved syntenies, l, is found by integrating the multinomial distribution against the joint uniform distribution on these proportional lengths. We must include the number of ways to choose l - k of the m - k chromosome pairings to which orthologous genes have not yet been mapped to actually contain orthologues yet to be discovered and we must include the fact that these particular l conserved syntenies are only one choice out of all the equally likely collections of l conserved syntenies chosen from the m chromosome pairings. Then we have the modification of Theorem 1 of ![]()

for l = k, k + 1, ... , m, where the integral is over the region where p1 > 0, p2 > 0, ... , pl-1 > 0, and p1 + p2 + · · · + pl-1 < 1.
The maximum-likelihood estimator for the true number of conserved syntenies is then the value of l that maximizes the function above. This estimator depends on the total number of orthologous genes mapped (n) and the observed number of conserved syntenies (k). The maximum-likelihood estimator depends on the total number of pairs of chromosomes (m = r x c) between the two species only through the constraint that l
m because

The formula for the density of the counts of the number of orthologous genes in the k conserved syntenies found given that there are l conserved syntenies in total has the following probabilistic interpretation: The denominator is the number of ways of choosing l out of the total of m possible conserved syntenies to be filled times the number of ways to fill l conserved syntenies with n orthologous genes. The numerator is the number of ways of choosing l - k of the unseen conserved syntenies from the m - k possibilities. The probability space includes not only the actual counts (n1, n2, ... , nk) observed but also which of the m possible conserved syntenies (cells in the table) get those counts.
An interval estimate for the true number of conserved syntenies, l, can be obtained by recognizing that we have essentially calculated the posterior distribution of l given the noninformative prior distribution that each chromosome pairing has equal chance of ever containing or not containing orthologues and that the orthologues are uniformly distributed among the chromosome pairings that actually contain orthologues. Under this noninformative prior, the posterior distribution on l is simply proportional to the likelihood function of l. That is,

The proportionality constant required to give a probability distribution is given by

An interval estimate (at a 95% level) on the true number of conserved syntenies (of the form [k, L] where k is the observed number and L is the upper bound on the number) is determined by finding the smallest value of L that satisfies

Syntenic correlation:
We introduce a measure of syntenic correlation that can be used to compare genomic distances across many pairs of species. Similar measures have been developed by BENGSSTON et al. (1993) and discussed in ![]()
Reverting to our original multivariate notation describing the r x c table summarizing the number of orthologues in each synteny, let ni,j be the observed number of genes on species A chromosome i with an orthologue on species B chromosome j. Let ei,j be the expected number of genes in the cell assuming that the genes are scattered independently in the two genomes. That is, ei,j = n·,jni,·/n, where ni,· is the row total of the number of genes on species A chromosome i with an orthologue anywhere in species B's genome, n·, j is the column total of the number of genes on species B chromosome j with an orthologue anywhere in A's genome, and n is the total number of orthologous genes mapped between the two species. Then a measure of syntenic correlation is given by

This measure of association has the following properties: It always makes sense as long as 02/0 is interpreted as being 0; the value of
lies between 0 and 1; the value is 1 if and only if, for one of the two species, knowing which chromosome an orthologue belongs to in that species determines which chromosome the orthologue is on in the other species; the value is 0 if and only if the counts of orthologues are perfectly independently scattered on the chromosomes of the two species; and the value is not changed by reordering the chromosomes in the two species.
The use of this scaled chi-square statistic as a measure of association is not new. It was proposed by Cramér as a measure of the degree of dependence or association between the arguments of a contingency table (![]()
is a useful measure of syntenic correlation, other statisticians have argued against the use of modified versions of the chi-square statistic as a measure of the degree of association (![]()
![]()
One of the alternative measures of association proposed by ![]()
![]()
is the difference between the probabilities of making an error with no information and with the chromosome of the other species known divided by the chance of making an error with no information from the other species. To obtain a symmetric measure, assume a gene is taken from each of the two species with probability 1/2 each.
Let mi,· be the maximum number of orthologues mapped from species A chromosome i to any chromosome in species B. Similarly, let m·,j be the maximum number of orthologues mapped from species B chromosome j to any chromosome in species A. Let mA be the maximum number of genes mapped to any single chromosome in species A and mB be the maximum number of genes mapped to any single chromosome in species B. Recall that n is the total number of genes mapped. The proposed measure of association is then

This measure of association has the following properties: It makes sense as long as not all the orthologous genes mapped lie in only one chromosome pairing; the value of
lies between 0 and 1; the value is 1 if and only if the counts of orthologues are concentrated in chromosome pairings (cells of the table), no two of which are in the same row or column; the value is 0 whenever knowing the chromosome on which an orthologue resides in the other species is of no help in determining the chromosome the gene resides on in the focal species; the value is not changed by reordering the chromosomes in the two species.
If the orthologues are scattered independently on the chromosomes of the two species, then this measure of chromosome prediction ability is 0. However,
= 0 whenever the same chromosome in the focal species is most likely to contain a gene no matter which chromosome contains the orthologue in the other species. An example where
= 0 without the genes being scattered independently may be constructed by ensuring that chromosome 1 of species A always contains more orthologues than any other chromosome from species A for each given chromosome in species B and that chromosome 2 of species B plays the same role for species B. Clearly the orthologues do not need to be independently scattered when constructing this example. Thus, this measure assesses the predictive value of the conditional distributions for gene assignments to chromosomes in the other species but it does not measure the randomness of the distribution of the orthologues among chromosome pairs.
| RESULTS |
|---|
To compare our method for estimating the true number of conserved syntenies to the method of ![]()
![]()
![]()
We report the observed, estimated, and 95% upper bound estimate of conserved syntenies between all species pairs of man, cow, rat, and mice in Table 1. We also include the cat-human data from ![]()
= 0.66) is not statistically significantly different from the syntenic correlation between mice and rats (
= 0.69) even though the time since divergence for humans and cats (
92 mya; ![]()
40.7 mya; ![]()
![]()
|
| DISCUSSION |
|---|
Recent articles on estimating the total number of conserved syntenies or segments between pairs of species (![]()
![]()
![]()
![]()
Further, many recent approaches choose one member of the pair of species being compared to provide critical information for the model. ![]()
![]()
![]()
![]()
In this article, we have demonstrated how to take the dependency of the number of genes in conserved syntenies into account when estimating the true total number of conserved syntenies and measuring syntenic correlation. Our methods are symmetrical and do not require the specification of a focal genome. We believe that extending our methods to estimating the total number of conserved segments is fundamentally more problematic. The following extension of our model to the problem of estimating the true number of conserved segments demonstrates why. The following is closely related to the ![]()
Suppose we observed k conserved segments containing n1, n2, ... , nk orthologues, respectively, where n =
ni is the total number of orthologues mapped between the two species. Suppose that the actual l conserved segments from the ancestral genome have lengths that are independently distributed and follow an exponential distribution with parameter
. Then the proportional lengths follow a uniform Dirichlet distribution (FRISTEDT and GRAY 1997, pp. 156157). To estimate the total number of conserved segments, l, consider the likelihood function

l = k, k + 1, ... This distribution is uniform on the probability space, which includes not only the actual counts (n1, n2, ... , nk) observed but also which k of the l total conserved segments get those counts.
This likelihood function has its maximum when l = k [because Lik(l + 1|n1, n2, ... , nk) = l/(n + l) Lik(l|n1, n2, ... , nk)]. In short, without an informative, proper, prior distribution on the true number of conserved segments or information about the actual observed segment lengths proportional to the length of the genome, our most likely single estimate of the total number of conserved segments that will ever be found is simply the number observed at present.
The difference between estimating the total number of conserved syntenies and the total number of conserved segments is that, in the case of conserved syntenies, we in effect assume a noninformative prior distribution to model which chromosome pairs will contribute a conserved synteny. Given that there are exactly l conserved syntenies, each combination of l chromosome pairs out of the m possible pairs is assumed to be equally likely. In the case of counting conserved segments, the noninformative prior is improper because there is no upper bound on the true number, l, and the result is that our best guess for the true number of conserved segments is the number of conserved segments observed (much as our best guess for the probability a randomly chosen new gene will land in each cell in the Oxford grid is simply the observed proportion of genes in the cell).
Additionally, the observed number of conserved syntenies is a sufficient statistic for estimating the total number of conserved syntenies but the same is not true for segments. In other words, the information encoded by the numbers of orthologues found in the observed conserved syntenies and by the number and positions of the observed conserved syntenies that is useful in estimating the total number of conserved syntenies is completely summarized by k, the observed number of conserved syntenies.
Mathematically, we compute the likelihood function for l, the total number of conserved syntenies, given the observed number, k, as

This formula is obtained from the density function of the raw data by counting the number of ways to choose the k observed conserved syntenies from the m possible ones and the number of ways of distributing the n orthologues between those k conserved syntenies so that none are empty (![]()
Because we have no upper bound for the number of conserved segments, we lose information about the total number of conserved segments when we summarize the data by reporting only the observed number. The likelihood function for the total number of conserved segments, l, given the observed number, k, is

which is obtained from the density function of the raw data by counting the number of ways to choose the k observed segments from the total number of segments, l, and distributing the n orthologues so that none of the k segments is empty. Since one of the additional terms does depend on l, we have different information about l when we summarize the data into just the count of the number of conserved segments. Note that this formula is given in Theorem 3 (![]()
One way around these difficulties may be to use the following proper prior distribution: Assume an arbitrarily large, artificial number of possible conserved segments, m, and assume that, prior to obtaining data, each possible segment has equal chances of ever containing orthologues or not. This approach corresponds to the approach used in estimating conserved syntenies. For sufficiently large choices of m, the maximum-likelihood estimator for the true number of segments, l, will not depend on m and the posterior distribution of l will depend only weakly on m.
Neither the raw number of conserved segments nor the raw number of conserved syntenies provides an adequate measure of genomic distance. While the measures proposed by BENGSSTON et al. (1993) and discussed in ![]()
![]()
![]()
![]()
The caveat to the above work, of ourselves and of others, is the typical caveat for all observational data: The orthologous genes that have been mapped must represent a random sample of all the orthologous genes that will be discovered. Indeed, the orthologues found so far may be the ones that are more easily found due to mutational constraints on their divergence and these constraints may also require higher levels of synteny correlation. The ones left to be discovered may be more divergent due to fewer restrictions on their evolution and this relaxation of mutational constraint may also allow them to be more scattered in the genome. Nonetheless, even if orthologues are eventually found on all chromosome pairs from the two species and even when the entire genomes of many pairs of species have been mapped, our syntenic correlation measure will provide a useful and nontrivial measure of syntenic conservation, allowing for the summary and comparison of Oxford grids for many pairs of species.
| ACKNOWLEDGMENTS |
|---|
We thank Phuong Ngo-Hazelett for help with the construction and formatting of the Oxford grids analyzed in this article, Sasha Richardson for help constructing the synteny plot given in Fig 1, and Michael Lynch for suggestions improving the legibility of some of the formulas. We heartily thank David Sankoff and two anonymous reviewers for their constructive comments. One of the anonymous reviewers was particularly helpful, providing references for the use of scaled versions of the chi-square statistic as a measure of association and voicing concerns that enabled us to improve the article substantially. This work was supported by a National Science Foundation interdisciplinary grant in the mathematical sciences DMS 0075143 to E.A.H. and National Institutes of Health grant R01RR10715 to J.P.
Manuscript received September 12, 2001; Accepted for publication June 3, 2002.
| LITERATURE CITED |
|---|
BENGTSSON, B. O., K. K. LEVAN, and G. LEVAN, 1993 Measuring genome reorganization from synteny data. Cytogenet. Cell Genet. 64:198-200.[Medline]
BOVBASE, 2001 The Roslin Institute, Edinburgh (http://www.ri.bbsrc.ac.uk/bovmap/arkbov/), July 28, 2001.
CRAMÉR, H., 1946 Mathematical Methods of Statistics. Princeton University Press, Princeton, NJ.
EHRLICH, J., D. SANKOFF, and J. H. NADEAU, 1997 Synteny conservation and chromosome rearrangements during mammalian evolution. Genetics 147:289-296.[Abstract]
FELLER, W., 1968 An Introduction to Probability Theory and Its Applications, Vol. I. John Wiley & Sons, New York.
FISHER, R. A., 1938 Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh.
FRIDSTEDT, B., and L. GRAY, 1997 A Modern Approach to Probability Theory. Birkhäuser, Boston.
GOODMAN, L. A. and W. H. KRUSKAL, 1954 Measures of association for cross classifications. J. Am. Stat. Assoc. 49:732-764.
GRAUR, D., and W.-H. LI, 2000 Fundamentals of Molecular Evolution. Sinauer Associates, Sunderland, MA.
GUTTMAN, L., 1941 An outline of the statistical theory of prediction. Supplementary study B-1, pp. 253318 in The Prediction of Personal Adjustment, edited by P. HORST, P. WALLIN and L. GUTTMAN. Bulletin 48, Social Science Research Council, New York.
HANNENHALLI, S., C. CHAPPEY, E. V. KOONIN, and P. A. PEVZNER, 1995 Genome sequence comparison and scenarios for gene rearrangements: a test case. Genomics 30:299-311.[Medline]
KUMAR, S. and S. B. HEDGES, 1999 A molecular timescale for vertebrate evolution. Nature 392:917-920.
KUMAR, S., S. R. GADAGKAR, A. FILIPSKI, and X. GU, 2001 Determination of the number of conserved chromosomal segments between species. Genetics 157:1387-1395.
MOUSE GENOME DATABASE (MGB), 2001 Mouse Genome Informatics Web Site, The Jackson Laboratory, Bar Harbor ME (http://www.informatics.jax.org/), July 28, 2001.
MURPHY, W. J., S. SUN, Z. CHEN, N. YUHKI, and D. HIRSCHMANN et al., 2000 A radiation hybrid map of the cat genome: implications for comparative mapping. Genome Res. 10:691-702.
NADEAU, J. H. and D. SANKOFF, 1998 Counting on comparative maps. Trends Genet. 14:495-501.[Medline]
SANKOFF, D. and J. H. NADEAU, 1996 Conserved synteny as a measure of genome rearrangement. Discrete Appl. Math. 71:247-257.
SANKOFF, D., G. LEDUC, N. ANTOINE, B. PAQUIN, and B. F. LANG et al., 1992 Gene order comparisons for phylogenetic inference: evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. USA 89:6575-6579.
SANKOFF, D., M.-N. PARENT, I. MARCHLAND and V. FERRETTI, 1997 On the Nadeau-Taylor theory of conserved chromosome segments, pp. 262274 in Combinatorial Pattern Matching. Eighth Annual Symposium, edited by A. APOSTOLICO and J. HEIN. Lecture Notes in Computer Science 1264, Springer Verlag, Berlin.
WADDINGTON, D., A. SPRINGBETT, and D. W. BURT, 1999 A chromosome-based model for estimating the number of conserved segments between pairs of species from comparative genetic maps. Genetics 154:323-332.
ZAKHAROV, I. A. and A. K. VALEEV, 1988 Quantitative analysis of evolution of mammalian genomes by comparison of genetic maps. Proc. Acad. Sci. USSR 301:1213-1218. Genetika 28: 7781.
This article has been cited by other articles:
![]() |
A. L. Hufton, D. Groth, M. Vingron, H. Lehrach, A. J. Poustka, and G. Panopoulou Early vertebrate whole genome duplications were predated by a period of intense genome rearrangement Genome Res., October 1, 2008; 18(10): 1582 - 1591. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. J. Kent, R. Baertsch, A. Hinrichs, W. Miller, and D. Haussler Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes PNAS, September 30, 2003; 100(20): 11484 - 11489. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Housworth, E. A.
- Articles by Postlethwait, J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Housworth, E. A.
- Articles by Postlethwait, J.


