The stability of the structure of bacterial genomes is challenged by recombination events. Since major rearrangements (i.e., inversions) are thought to frequently operate by homologous recombination between inverted repeats, we analyzed the presence and distribution of such repeats in bacterial genomes and their relation to the conservation of chromosomal structure. First, we show that there is a strong underrepresentation of inverted repeats, relative to direct repeats, in most chromosomes, especially among the ones regarded as most stable. Second, we show that the avoidance of repeats is frequently associated with the stability of the genomes. Closely related genomes reported to differ in terms of stability are also found to differ in the number of inverted repeats. Third, when using replication strand bias as a proxy for genome stability, we find a significant negative correlation between this strand bias and the abundance of inverted repeats. Fourth, when measuring the recombining potential of inverted repeats and their eventual impact on different features of the chromosomal structure, we observe a tendency of repeats to be located in the chromosome in such a way that rearrangements produce a smaller strand switch and smaller asymmetries than expected by chance. Finally, we discuss the limitations of our analysis and the influence of factors such as the nature of repeats, e.g., transposases, or the differences in the recombination machinery among bacteria. These results shed light on the challenges imposed on the genome structure by the presence of inverted repeats.
THE advances of the last decade on genome sequencing and pulsed field gel electrophoresis provide a puzzling image concerning the organization and stability of bacterial genomes. On one hand, many features of genome organization have been found or further unraveled, such as the impact of replication in imposing compositional strand biases (Lobry 1996) and constraining gene distribution (McLeanet al. 1998). Coding sequences cover 90% of most bacterial genomes and transcriptional regulation can be very complex, suggesting selection for structural stability. On the other hand, the genome structure is extremely fluid. Operons are not well conserved between distant species (Itohet al. 1999) and gene content varies at a very high rate in some bacterial lineages (Casjens 1998), partly because of frequent horizontal transfers (Ochmanet al. 2000). Distinctive features of the organization of bacterial genomes, notably in relation to replication, have different importance in different species (Rocha and Danchin 2001). Among groups such as Firmicutes, one observes very different compositional bias, as well as different gene positional biases (Rocha 2002). Thus, the understanding of the trade-offs between genome stability and the requirements of genotypic diversity is becoming a major issue in the study of genome evolution.
Intrachromosomal homologous recombination can lead to deletions, duplications, translocations (for direct repeats), and inversions (for inverted repeats; Smith 1988; Rothet al. 1996; Romero and Palacios 1997). All these events change the genome composition, but most of them do not induce very important shifts in its structure. Indeed, large deletions are counterselected, large insertions are rare, and large tandem duplications are not observed in currently sequenced bacterial genomes, probably because they are too unstable. Therefore, inversions have been regarded as one of the main motors of chromosome structural change (Liu and Sanderson 1996; Rothet al. 1996; Hughes 2000). Pairwise comparisons among completely sequenced genomes show that the first large chromosome rearrangements are caused by inversions (Eisenet al. 2000; Tillier and Collins 2000a; Zivanovicet al. 2002). Currently, the sole exception to this rule is provided by the comparison of Mycoplasma pneumoniae and M. genitalium, which reveals several translocations and no inversion (Himmelreichet al. 1997). However, inverted repeats capable of mediating chromosomal inversions are strongly underrepresented in these genomes, being present up to 60 times less frequently than direct repeats (Rocha and Blanchard 2002).
It is difficult to define genome stability without experimental support or a large number of very close genomes. Thus, we use replication compositional bias as a proxy of genome stability. DNA replication is asymmetric; one strand is replicated continuously (leading strand) whereas the other is replicated in discrete steps through the use of Okasaki fragments (lagging strand; Marians 1992). Since the origins of replication in bacteria, when they are known, seem to be unique, the asymmetry in replication creates a durable asymmetry in the structure of the chromosome (Frank and Lobry 1999). This leads to different nucleotide compositions in each replicating strand, which seem to result from an essentially neutral mutational bias (Frank and Lobry 1999; Tillier and Collins 2000b; Rocha and Danchin 2001). Thus, the intensity of the bias is shaped by the strength of the mutational mechanism and by the rates of genome rearrangement. Assuming that the strength of the mechanism has small variations between genomes, strand bias should be highly correlated to the stability of the chromosome.
Chromosomal inversions seem to be rare in nature but very frequent in the laboratory (Louarnet al. 1985; Rothet al. 1996). This suggests the existence of selection pressure for maintaining chromosomal structure. As a result of this, the large inversions observed in bacterial genomes are symmetrical in relation to the origin of replication (Eisenet al. 2000; Tillier and Collins 2000a). It is then important to understand how chromosomes face recombination events and especially inversions. Here, we tackle this question by accounting for the distribution of repeats capable of producing rearrangements (inverted repeats). We also take into consideration the effects of such potential rearrangements on different elements of the chromosomal structure, in particular chromosome asymmetry and replication strand bias.
METHODS AND DATA
Data: Data on the complete bacterial genomes were taken from Entrez Genomes (http://www.ncbi.nlm.nih.gov), and the annotations were taken from the GenBank files. Except when noted otherwise, we used only one strain for each species to avoid any bias in favor of species represented several times in GenBank. This resulted in a data set of 63 chromosomes, representing 58 bacterial genomes.
Identification of large strict repeats: To compute the threshold minimal length of large repeats, we used a statistic of extremes that takes into account the nucleotide composition and the length of the genome (Karlin and Ost 1985). Among bacteria, the minimal length for which the probability of finding one exact repeat in the genome is <1%thou is in the range 21-26 nucleotides (nt) (P < 0.001; Rochaet al. 1999a). The search for such large, strictly identical repeats was done using Reputer (Kurtz and Schleiermacher 1999).
Deriving large nonstrict repeats: To investigate the influence of genome structure on repeats, we identified nonstrict repeats from strict repeats using an extension process previously described (Achaz et al. 2000, 2002). The method identifies nonstrict repeats by extending both sides of strict repeats when they share significant similarity in sequence. This is based on a local alignment procedure (Smith and Waterman 1981). Nucleotide frequencies differ widely between bacterial species and identity matrix scores produce artificially longer repeats in highly biased genomes. To avoid this effect, we used an empirical scoring matrix for each chromosome, which takes into account the frequencies of nucleotides (Achazet al. 2002). After comparing different methods to build such a matrix, we used the one providing closer average lengths for repeats detected in random genomes with different nucleotide compositions (from very low to very high values of G + C content). This scoring matrix is the following: where pi is the frequency of the nucleotide i in the genome. This matrix provides scores of matches ranging from 20 to 41 and scores of mismatches ranging from -41 to -20. The score of matchN/i is either 7 or 8, depending on the genome bias.
Strand compositional bias: Linear discriminant analyses followed by skew analyses were used to identify genomes with significant strand bias, as in Rocha and Danchin (2001). Once origin and terminus were identified, compositional strand bias was quantified in terms of ΔGC skews. These are defined as the average difference in GC skews between the genes in the leading and the lagging strand. ΔGC = (Glead - Clead)/(Glead + Clead) - (Glag - Clag)/(Glag + Clag), where Xi is the nucleotide frequency of the nucleotide X (i.e., G or C) in the genes of strand i (i.e., lead or lag). This normalizes the replication biases in terms of the genome average bias in nucleotide composition.
MODELS OF GENOME REARRANGEMENT
Before proceeding, we must model the potential outcome of recombination between repeats. We consider a random model where each copy of a repeat can recombine with another copy of the repeat in a random way. We further suppose that couples of repeats of identical size recombine at identical frequency. Yet, two factors are taken into account. First, since one expects larger repeats to recombine more often than smaller ones (in a linear fashion according to Shen and Huang 1986; Vulicet al. 1997), we weight each repeat by its length when computing indices of potential rearrangement. Second, we incorporate the fact that recombination always proceeds between two copies of a repeat. A repeat present in two copies recombines only in a single way between the two copies. However, a repeat with three copies (e.g., A, B, and C) can recombine in three different ways (A with B, A with C, and B with C), each resulting in different outcomes. Thus, because recombination takes place between pairs of repeats we count the latter repeat as three couples of repeats. Therefore, counting pairs of potentially recombining repeats is equivalent to counting couples of repeats.
Assumptions: To accomplish this analysis we had to proceed to several assumptions:
We implicitly assume that homologous recombination may proceed with all statistically significant large repeats. Such minimal length varies from 21 to 26 nt depending on genome size and composition. In fact, it corresponds to the minimal length requirements for the start of homologous recombination by the RecBCD system in Escherichia coli and its functional homolog AddAB in Bacillus subtilis, the only bacterial species for which such studies have been conducted (Shen and Huang 1986; Roberts and Cohan 1993).
We analyze the distribution of repeats as they occur in the published genomes. Thus, we do not take into account the changes of that distribution if rearrangements do occur. Naturally, more refined models should be developed in the future to tackle this question. Such models should take into account the results of rearrangements on the relative positioning of the other repeats (even though that is a very hard computational problem), and the rate of repeat creation and loss.
Given the lack of experimental comparative studies of recombination mechanisms and frequencies in most bacteria, we implicitly assume that the frequency of intrachromosomal recombination is the same in different genomes. All bacteria here analyzed, except Buchnera (Shigenobuet al. 2000), have RecA, the major protein in homologous recombination pathways. However, the different elements of the homologous recombination pathways vary significantly between genomes (Eisen and Hanawalt 1999).
We consider that all repeats are involved in the dynamics of the chromosome in the same way. Since self-replicating repeated elements, such as IS, have special dynamics, we analyze their influence separately. We discuss the impact of violations to these assumptions in the interpretation of the results.
Measures of global rearrangement: The inversion produced by a recombination event between two occurrences of a repeat implicates the inversion of the region between the repeats—the spacer. This element contains less than half of the chromosome, by definition. A simple way of analyzing the potential for genome rearrangement is simply to divide the total number of pairs of inverted repeats by the length of the genome, thereby computing a density of pairs of repeats. However, the analysis of direct repeats has shown that the average spacer length is different between genomes (Rochaet al. 1999a). Also, the frequency of recombination between copies of a repeat is expected to be proportional to the repeat’s length. Therefore, a more precise measure for the average rearrangement length potentially induced by the inverted repeats in a genome is given by where RL is the potential rearrangement length associated with the repeats in the genome; Lri, the length of the repeat i; Lspi, its spacer length; GL, the genome length; and LrT, the sum of the repeats’ lengths.
Inversions and replication structure: Compositional strand bias and chromosomal symmetry are differently affected by recombination between inverted repeats (Figure 1). By definition, copies of inverted repeats occur in different DNA strands. However, they can be in the same type of replicating strand (i.e., both copies in the same chirochore—either leading or lagging strand) or in the same replichore (same replicating half of the chromosome). If they are in the same replichore (IR), then an inversion will produce a shift of the spacer from one replicating strand to the other, so that the sequence of the spacer that was on the leading strand switches to the lagging strand and vice versa. However, because in this case the spacer does not include the origin or the terminus of replication, the symmetry of the chromosome (i.e., the opposite placement of origin and terminus of replication) will not be affected. Naturally, close occurrences will induce small changes, whereas distant occurrences induce large changes. One can then define a measure of average strand switch (SS) potentially induced by all IR repeats in a genome as Conversely, the spacer of a repeat with occurrences in the same chirochore (IC) encompasses the origin or the terminus of replication. In this case, an inversion will not change the leading/lagging character of the spacer, but may induce changes in the relative positions of the origin and terminus of replication. The average asymmetry switch (AS) induced by the inversion will be proportional to the distance of the position of the center of the spacer (Pi) to the closer origin/terminus of replication (Pori/ter):
Expected values: We determined the expected values of RL, SS, and AS under a model where pairs of copies of repeats engage into recombination randomly. The null model corresponds to a random placement of repeats in the chromosomes. Thus, approximate values for the expectations of RL, SS, and AS can be easily determined by simulation. Here, we detail the derivation of the exact expressions. Under the model of random placement of repeats in the chromosome, the distance between two copies of a repeat is distributed uniformly in the interval]0, GL/2]. Therefore, the expected value of RL is ¼ (1/GL × GL/4).
For the determination of the expected values of SS and AS we assume, as previously, uniform distribution for the distance between copies. For simplicity, but this does not affect generality, we assume that all repeats have the same length. Under these conditions, we call SSi the strand switch associated with a repeat and allow it to take one of two values: either the length of the spacer (both copies in the same replichore) or 0 (both copies in the same chirochore). Given the symmetry of the system, the value SSi = 0 has a probability 0.5. Thus, one has to determine only the expression for the probability density function of SSi when repeats are in the same replichore (which sums to 0.5). This results in a function that depends linearly on the spacer length (see Figure 2) and is constrained by two conditions: (i) the cumulated probability is 0.5 and (ii) the function evaluates to zero at GL/2. Thus, the probability density function is given by which results in a function whose expected value is given by Since SS is the sum of each partial SSi, divided by the genome length, its expected value is . Excluding from the analysis the repeats in the same chirochore, for which SS = 0, the expected value becomes . A similar reasoning applies to the determination of the expected value of AS.
RESULTS AND DISCUSSION
Relative distribution of inverted repeats in bacterial genomes: Absolute numbers of repeats: The distribution of direct and inverted repeats in bacterial genomes has recently been analyzed in the context of horizontal transfer (Rochaet al. 1999a) and of repeat generation (Achazet al. 2002). These works have shown that bacterial genomes contain a considerable amount of large repeats. Furthermore, the abundance of such repeats is highly variable among species. In our data set, one finds a maximum of 66,860 pairs of inverted repeats in Neisseria meningitidis and no inverted repeats in Chlamydia trachomatis (Table 1). Interestingly, both bacteria are human pathogens and seem to have a functional RecBCD system. However, C. trachomatis is an obligatory and intracellular parasite, whereas Neisseria is neither. A low level of repeats is typical of obligatory intracellular bacteria (see below). On the other hand, Neisseria is a very extreme case of repeat abundance, mostly for effects of antigenic variation (Saunderset al. 2000). For clarity it is removed from most graphs where the number of repeats is taken explicitly into account. The average length of the repeats in the different genomes is nearly always well above the lower threshold of statistical significance. Indeed, the average length of strict repeats in the genomes is 207 nucleotides.
Inverted repeats are underrepresented compared to direct ones: One would expect to find more direct than inverted repeats if selection acts toward minimizing inversions. On one hand, inverted repeats may induce inversions. On the other hand, if repeats originate mainly from close direct repeats (Achazet al. 2002), inversions are required to create inverted repeats from direct repeats. In any case, our analysis indicates that inverted repeats are usually underrepresented compared to direct repeats: the ratio of inverted/direct repeats is almost always <1 (Figure 3). This still holds if one excludes close direct repeats, which are thought to be the result of an active process of duplication (Achazet al. 2002). The strongest underrepresentation of inverted repeats tends to occur when the total number of repeats is smaller. This suggests that when repeats are avoided (e.g., by structural reasons), inverted repeats are even more strongly avoided, possibly because of their major role in chromosomal inversions. Genomes saturated with repeats, e.g., Neisseria, show no difference between inverted and direct repeats, possibly because selection for a direct positioning vs. the inverted becomes inefficient at such a high level of repeat density (or possibly because their recombination apparatus is less sensitive to repeats).
Rearrangement length: Although relative avoidance of inverted repeats may suggest counterselection of sequences capable of producing inversions, many different causes can underlie such avoidance. In particular, if the magnitude of the rearrangements’ counterselection were simply proportional to their length, one would expect a selection for close repeats that could induce small rearrangements. However, the average observed/expected (O/E) RL is 0.963, which is not significantly different from 1 (P > 0.4, signed-rank test; Figure 4). One is then inclined to think that although selective pressure against rearrangements may cause the avoidance of inverted repeats, relative to direct ones, there is no systematic tendency toward the minimization of the length of the potential rearrangement.
Support for the hypothesis that inverted repeats challenge the chromosomal stability: Analyses of close genomes: The genomes presenting the lowest values of observed/expected rearrangement length are the ones containing fewer repeats, notably Chlamydia, some Mycoplasma, Rickettsia, and Buchnera. These are also the genomes with smaller inverted/direct ratios. Interestingly, recent works have shown that many obligatory intracellular bacterial genomes keep a remarkable synteny (Suyama and Bork 2001; Wolfet al. 2001). In light of their small populations one would expect less efficient purifying selection and therefore larger differences in gene order. The observations that such genomes contain a reduced recombination potential, especially when it involves inversions, may thus explain their stability. Recently, a second genome of Buchnera has been published (Tamaset al. 2002), which indicates that for 50 million years these genomes remained strictly colinear, showing no inversion. This is not surprising since these genomes have both <10 large inverted repeats in their genomes and a deficient homologous recombination machinery. However, the other small stable genomes presenting few repeats do code for both RecA and RecBCD or RecF-like systems.
Closely related bacteria with very different repeat abundance show increased levels of synteny loss. For example, the strains KIM and CO92 of Yersinia pestis are very closely related (average 99.9% of protein similarity) but show a considerable amount of rearrangement in their genomes (Denget al. 2002). The closely related Salmonella enterica typhi and typhimurium (mean protein similarity of 98.6%) show only two large rearrangements. This can be put into relation with their different numbers of repeats: ∼5000 in Y. pestis, many of them insertion sequences, and <1000 in S. enterica typhimurium (for genomes of similar lengths). The correlation between abundance of repeats and genome stability seems to be valid also in Archaea. A recent comparative study of three Pyrococcus (Pyrococcus abyssi, P. horikoshii, and P. furiosus) has indicated that P. furiosus is much more subject to genome rearrangements (Zivanovicet al. 2002). The close comparison between these genomes seemed to implicate repeats in these rearrangements. Indeed, the comparison of the number of pairs of inverted repeats in these genomes (of nearly identical genome length) is in good agreement with these observations: 503 for P. horikoshii, 711 for P. abyssi, and 2004 for P. furiosus.
The case of Rickettsia conorii: One major exception to this trend concerns the comparison of Rickettsia conorii with R. prowazekii. R. conorii is 14% larger than R. prowazekii, but the genomes are colinear, thus supposedly stable, even though R. conorii contains 1180 inverted repeats that have been proposed to replicate in a selfish manner (Ogataet al. 2000) for only 6 inverted repeats in R. prowazekii. A closer analysis of the former genome indicates that its repeats are all small, since 70% of repeats have between 25 and 30 bp, and only 2 repeats are >85 bp. Since the genome does not contain a homolog of the RecBCD system, homologous recombination is expected to follow the RecF pathway (Shen and Huang 1986). The RecF pathway is thought to be involved in restarting replication after replication fork disassociation, and the importance of its role in homologous recombination has been disputed (Courcelleet al. 2001; Amundsen and Smith 2003). For this pathway, the minimal length of strict homology required to start homologous recombination in E. coli is much larger than that required for the RecBCD pathway. It might be as large as 90 bp, whereas it is 20-30 bp for the RecBCD pathway (Shen and Huang 1986). It is thus quite possible that these repeats are not targeted by homologous recombination in Rickettsia, because of the peculiarities of its recombination machinery. This would explain the stability of these genomes in spite of the large number of small repeats.
Support to use replication composition bias as a proxy of genome stability: We have previously suggested a link between the number of repeats in a genome and the replication compositional strand bias (Rochaet al. 1999b). Compositional replication strand bias seems to result from a fast asymmetric mutational bias causing inverted genes to adapt fast to the new strand (Tillier and Collins 2000b). The magnitude of strand bias can vary either by the intensity of the mutational bias or by processes counteracting its establishment, such as genome rearrangements. In this sense, important levels of strand bias can be established only if the genomes are stable. Among the 63 chromosomes, 44 exhibit a significant replication bias. Genomes with significant strand bias have a median of 394 repeats/genome, whereas the remaining genomes have a median of 708 repeats/genome (even though the former genomes are 23% larger). Moreover, genomes with significant strand bias show a negative correlation (ρ=-0.30, P < 0.05, Spearman’s rank test) between the number of inverted repeats and the intensity of the bias (measured as ΔGC skew). These results suggest that the chromosomal stability is highly challenged by inverted repeats. As a consequence, in very stable chromosomes, the number of inverted repeats might tend to be minimized.
How do inverted repeats challenge the chromosomal stability? To tackle this question, we divided inverted repeats into two categories: repeats in the same chirochore (further labeled as IC) and repeats in the same replichore (IR; see models of genome rearrangement and Figure 1). We also developed simple measures of the impact of these repeats on AS and SS. AS measures the consequences of potential rearrangements between IC. SS measures the consequences of potential rearrangements between IR. Therefore the ratio of observed/expected of these indices indicates the association between the positioning of repeats and the instabilities they might induce on genomes.
Differences between IC and IR suggest selection for chromosomal stability: Repeats are causes of change in chromosomal structure, but the distribution and maintenance of repeats is also constrained by the characteristics of that structure. In genomes containing strong compositional strand biases, the mutation pattern is similar for both copies of IC, but different for both copies of IR (Rocha and Danchin 2001). Thus, faster divergence between copies of IR repeats, relative to IC, could lead to differences in number and length between IC and IR (Table 2). If this is so, we should expect higher similarity between copies of IC than between copies of IR. Naturally this hypothesis cannot be tested with the data on strict repeats, which are identical (by definition). Therefore, we extended by dynamic programming the exact repeats into larger nonstrict repeats by searching for significant similarity at the edge of the strict repeats (as described in methods and data). The comparison of nonstrict repeats confirms that IC are more numerous and longer than IR (Table 2). However, the average identity percentage does not differ between IC and IR repeats. This suggests that the different abundance of each type of repeats is not due to larger rates of divergence among IR repeats. The avoidance of IR could then be a consequence of negative selection on the distribution of repeats. Such selection pressure may have different origins. First, inversions change the relative distance of the genes to the origin of replication. This is expected to be counterselected in genomes selecting for highly expressed genes near the origin of replication. Second, genes on the leading strand will be transferred to the lagging strand and vice versa. This is also expected to be counterselected for highly expressed genes and for genomes containing two dedicated DNA polymerases (Rocha 2002). Finally, it has been proposed that higher levels of substitutions in inverted genes may lead to gene loss (Mackiewiczet al. 2001).
Chromosomes tend to keep their symmetry: Using the positions of the origins and termini of replication, one can determine the relative lengths of the two replichores. We analyzed the 48 genomes for which the origin and the terminus can be reliably predicted. In these genomes the length of the two replichores never differed by >20%. Further, the ratio of the lengths of the smallest over the largest replichores of each genome shows a median of 0.95 (data not shown). Such similarity between replichore lengths is in good agreement with the existence of a selective pressure against inversions increasing the asymmetry of the chromosome. A similar selection pressure has been observed in horizontal transfer between strains of E. coli and Salmonella, since genomic variation tends to occur in equal amounts on both replichores, thus keeping chromosomal symmetry (Bergthorsson and Ochman 1998). Further, inversions between the rRNA operons of E. coli that strongly change the symmetry of the chromosome have been found to be severely detrimental (Hill and Gray 1988). This is also in good agreement with data indicating preference for symmetrical rearrangements around the origin and terminus of replication (Eisenet al. 2000; Tillier and Collins 2000a). It has been proposed that such inversions could result from illegitimate recombination between the two newly replicated chromosomes at the moment of replication (Tillier and Collins 2000a), but there is still no experimental evidence of such a mechanism. The analyses of AS indicate O/E ratios systematically smaller than one (average AS = 0.86, P < 0.001, signed-rank test; Figure 5). This indicates that potential rearrangements caused by homologous recombination between IC tend to be symmetrical and that such IC repeats may be less negatively selected.
Strand switch and replication compositional bias: An inversion between two IR switches the strands of the spacer and thus switches the compositional biases in each strand. The comparison of genomes with and without significant compositional strand biases shows a different median observed/expected SS (respectively, 0.80 and 1.08, P < 0.01, Wilcoxon test). Genomes lacking strand compositional bias have a median observed/expected SS not significantly different from 1 (median 1.08, not significant), whereas the others show a ratio systematically smaller than one (median 0.80, P < 0.001, signed-rank test). Further, among these genomes there is a significant negative correlation between the potential of repeats to induce strand switch and their genome ΔGC skew (-0.553, P < 0.001, Spearman ρ; Figure 6). Although the correlation is highly significant, the analysis of its residuals shows a considerable dispersion and two outliers, Streptococcus pneumoniae and N. meningitidis (P < 0.01). This is an indication that other factors affect strand bias and/or that some of our basic assumptions are oversimplified (e.g., the assumption of similar recombination mechanisms and frequencies in different bacteria).
General picture: Both AS and SS indicate observed/expected ratios systematically smaller than 1 (Figure 5), and the differences between AS and SS are not statistically significant. One should note that avoiding simultaneously AS and SS can be done it two different ways. First, it can be done if the occurrences of repeats are close. However, the analysis of RL for all inverted repeats and the relative abundance of IR and IC indicates that is not the case. Second, it can be done by selecting the placement of the two copies of repeats in the same chirochore and in a symmetrical way around the origin or the terminus of replication (see Figure 1). Our results point toward the latter hypothesis.
The special role of transposases: Among the simplifications we have made at the beginning of this work, we assumed that repeats induced rearrangements through homologous recombination. This is an oversimplification for some types of sequences and especially when transposases are concerned. We have thus tried to further analyze the impact of these elements in the induction of genome rearrangements. We have identified 40 bacterial genomes containing genes coding for putative transposases, using the annotation files. As expected, these genomes contain a much larger density of repeats (4.5 times larger, P < 0.002, Wilcoxon test). Further, the density of repeats correlates well with the number of transposases (ρ= +0.45, P < 0.005, Spearman rank test) with two clear outliers (S. solfataricus and S. pneumoniae). However, only 19% of the repeats directly concern sequences coding for transposases. Part of the difference may be explained by the difficulty in identifying unknown families of transposases or by the existence of insertion sequence (IS) remnants that no longer contain intact transposases. Only in three genomes (Bacillus halodurans, Synechocystis C125, and Y. pestis) do transposase-coding sequences include >55% of the genome’s inverted repeats (respectively, 76%, 74%, and 72%).
Genomes lacking IS have smaller ratios of inverted/direct repeats (median 0.22) than genomes containing IS (median 0.69, P < 0.01), although both values are significantly <1 (P < 0.01). There is also a positive and similar effect of transposases on the O/E values for AS and SS, which tend to get closer to 1, with the existence and with the number of transposases in the genome (P < 0.01). Thus, the presence of transposases in shuffling the genome seems to exceed the one of simple repeats targeted by homologous recombination. It is likely that their self-replicative behavior further shuffles the chromosome.
The availability of complete genomes of close species, or strains within a species, has brought to light the importance of genome rearrangements in fashioning the bacterial genome (Hughes 1999). Almost without exception, the first major rearrangements observed in recently divergent bacterial strains or species concern inversions that are symmetrical around the origin and terminus of replication. Here, we have tried to understand the relation between such analyses and the potential for intrachromosomal rearrangements mediated by the long repeats present in bacterial chromosomes. Selective processes are probably at the basis of the different abundance and characteristics of inverted repeats. These repeats have important consequences for genome stability, as we have seen, but they can also be under positive selection for antigenic variation or gene dosage effects. This seems to be a particular case of the trade-off between the necessity of generating genotypic diversity and the problems that are derived from that need.
To be able to compare different genomes we were forced to make several simplifying assumptions. Some, e.g., the role of transposases, could be tackled in this work, but most will have to be tested as more experimental works on homologous recombination in other bacteria become available. In particular, it is of outmost importance to determine the relative levels of homologous recombination between repeats in different genomes as well as the minimal lengths required for homologous recombination. The results of this work suggest that these requirements are likely to be different, since some genomes, such as Neisseria, contain an astonishingly high level of repeats. The genome of S. pneumoniae shows particularly striking features, since it contains very high numbers of repeats for its size and large numbers of transposases (46 genes), but exhibits strong ΔGC skews and 80% of the genes in the leading strand. Such a well-ordered genome structure contrasts with the quantity of elements capable of disrupting it. It remains an open question if this is due to differences in the recombination machinery or to other processes.
Most of the results we have presented are compatible with the hypothesis that repeats challenge the structure of bacterial chromosomes. We found low values of AS and SS, a frequent association of repeat density with differential stability of close genomes, and a systematic underrepresentation of inverted repeats relative to direct ones. However, one would have also expected to find O/E RL values significantly <1, which was not the case. However, considering only IR, O/E RL are <1, resulting in O/E SS < 1 (the underrepresentation of IR as compared to IC leads to that apparent randomness). On the other hand, the lack of a global bias in RL shows that mechanisms creating repeats at short distances are not biasing our results. O/E RL values close to 1 could result if the other elements contributing to the selection of a stable chromosomal structure are not sensitive to the length of the rearrangement. For example, selection of operon structures should be equally effective on small and on large rearrangements, since in both cases only the two operons at the breakpoints of rearrangements are disrupted (and this if repeats are inside different operons). Considering that many large repeats in bacteria are inside coding sequences (Rochaet al. 1999a), selection for minimization of operon disruption would be effective only through the avoidance of inverted repeats relative to direct ones (as observed). Thus, the distribution of repeats in genomes would be constrained by the structure of the chromosome in terms of replication, which is dependent on the length and the type of rearrangement, and of some other factors, which are possibly independent of the length of the inverted segments (i.e., the rearrangement length). The relation between the distribution of repeats in bacterial chromosomes and other genomic features is still a largely unexplored field. For example, several works have suggested that nonpermissive intervals of rearrangement exist in E. coli (Segallet al. 1988; Guijoet al. 2001) and that some regions of the chromosome are particularly prone to recombination events (Louarnet al. 1991). Further work will be required to tackle these questions.
We are very grateful to Isabelle Gonçalves for carefully reading the manuscript. Guillaume Achaz was funded by “La Société de Secours des Amis des Sciences.” Eric Coissac and Pierre Netter are at the Université Pierre et Marie Curie and Eduardo Rocha at the Centre National de la Recherche Scientifique. This work was partially funded by the Association pour la Recherche sur le Cancer, contract 4672.
Communicating editor: M. A. F. Noor
- Received December 17, 2002.
- Accepted April 14, 2003.
- Copyright © 2003 by the Genetics Society of America