BASE composition is spatially structured in mammalian genomes. From sodium chloride centrifugation experiments, Bernardi et al. (1985) defined three major classes of genomic fragments with low, median, and high GC content, respectively, and called them isochores. This discrete description now appears artificial. Analyses of the complete human genome (International Human Genome Sequencing Consortium 2001) have dismissed the underlying hypothetical picture of sharp boundaries between long homogeneous fragments: GC content turns out to vary continuously, and somewhat erratically, along chromosomes. These analyses, however, have confirmed the existence of a highly significant spatial autocorrelation of GC content, with most of the structure detectable at a relatively large (300-kb) scale. There is a strong correlation between the GC content at third codon positions of codons (GC3) and the GC content of the region in which a gene is located (Bernardi 2000). GC3 varies from a typical 40% in low-GC-content regions to 80% and more in high-GC-content regions. GC content is correlated with various genomic features, including repeat element distribution, methylation pattern (Jabbari and Bernardi 1998), and, most remarkably, gene density (Mouchiroudet al. 1991; Duretet al. 1995). GC-rich regions include many genes with short introns while GC-poor regions are essentially deserts of genes. This suggests that the distribution of GC content in mammals could have some functional relevance, raising the issue of its origin and evolution. For brevity, in this article, we use the word “isochore” as an abbreviation for “the peculiar, structured distribution of GC content in mammalian genomes.”
Two opposite views about isochore evolution have been hotly debated over the last 20 years, as part of the neutralist/selectionist controversy. One view is that isochores may simply reflect variable mutation processes among genomic regions, consistent with the neutral model. Such variation in mutational biases has never been demonstrated. Alternatively, isochores might be the result of natural selection. Bernardi and colleagues have long argued that the isochore structure is an adaptation to homeothermy since it has been found in mammals and birds but not amphibians and fish (e.g., Bernardi 1993). The discovery of isochores in crocodiles and turtles (Hugheset al. 1999) led to the rejection of this hypothesis. Moreover, selection, if any, must be unrelated to gene expression: The GC content of genes in humans does not correlate positively with their expression level or pattern (Goncalveset al. 2000), and even pseudogenes translocated into a GC-rich region undergo an increase of GC content (Francino and Ochman 1999). Neutralists furthermore argued that, given the small effective population size of mammalian species, a selective hypothesis must imply very high selection coefficients (typically 2 orders of magnitude higher than in Drosophila) at every position of GC-rich regions, including introns and intergenic DNA (Sharpet al. 1995). The existence of such a selective pressure without apparent correlation with gene expression appeared quite speculative.
On the other hand, recent analyses of human polymorphism data sets (Eyre-Walker 1999; Smith and Eyre-Walker 2001) unexpectedly contradicted the mutational bias hypothesis. Under neutrality, the substitution process (accumulation of changes in the long run) should reflect the mutation process since each kind of mutation has an equal fixation probability. At equilibrium, therefore, the number of AT → GC and GC → AT mutations arising should be equal. Eyre-Walker, however, found significantly more GC → AT than AT → GC mutations. G and C alleles seem to have a selective advantage, making the substitution process more GC biased than the mutation process. These reports renewed interest in selective hypotheses for the evolution of isochores (e.g., Bernardi 2000). There are, however, at least two alternative possible explanations for this result. First, the excess of GC → AT mutations may be the consequence of nonstationarity. Such an excess is expected if GC content is decreasing in GC-rich regions. Smith and Eyre-Walker (2001) could not detect any departure from stationarity by comparing human sequences to other primate species. An analysis of repeated elements in the human genome, however, somewhat contradicted this view (International Human Genome Sequencing Consortium 2001). Second, Holmquist (1992) and Eyre-Walker (1993, 1999) mention another possible evolutionary force that might explain the polymorphism pattern, namely biased gene conversion (BGC). We now examine the BGC hypothesis and argue that it is likely to play an important role in the evolution of GC content in mammals.
Gene conversion is a molecular mechanism associated with recombination in which a genomic fragment is “copied/pasted” onto another homologous fragment (Lamb 1984). Both DNA fragments therefore share identical sequences after a conversion event. Allelic conversion occurs during the process of meiotic recombination. A DNA heteroduplex is formed, involving the plus strand of one chromosome and the minus strand of the sister chromosome (Figure 1). If this region of heteroduplex includes a heterozygous site, then a mismatch will occur. This mismatch may be recognized and corrected by DNA repair systems, thus removing a difference between the two chromosomes. Conversion results in non-Mendelian segregation of alleles in the germ cell where it occurs (Figure 1). This has no evolutionary relevance if the two possible ways of repairing a mismatch have equal probabilities—the proportions of gametes averaged over germ cells remain Mendelian. If, however, the repair process is biased toward, say, G:C pairs, then allele frequencies would evolve in a nonneutral fashion: GC/AT heterozygotes would produce a higher proportion of GC gametes than AT gametes, resulting in a higher fixation probability for GC alleles.
Nagylaki (1983) showed that the dynamics of the fixation process for one locus under biased gene conversion (BGC) are identical to that under directional selection. The two effects could therefore account for the excess of GC → AT polymorphisms in humans. Eyre-Walker (1999) provides arguments against the BGC hypothesis. First, he notes that BGC is unlikely since third codon positions and introns have distinctive GC contents, while BGC should have identical effects in intron and exons. This difference, however, can be explained by an external factor, namely the accumulation of relatively GC-poor transposons (LINES, Alu) within introns but not within coding regions (Duret and Hurst 2001). There is therefore no need to invoke distinct fixation biases in introns and synonymous sites. Eyre-Walker (1999) also argues that the BGC hypothesis is unlikely because there is only a narrow range (one order of magnitude) within which the rate of biased conversion would be high enough to significantly alter polymorphism patterns but low enough not to induce an extreme base composition. Biased conversion within the necessary range is not just an ad hoc assumption, however, if one takes into account the selective pressure acting on genetic systems. Very highly biased conversion rates would probably simply be selected against (see below for additional discussion of this issue).
Several observations indicate that the BGC hypothesis deserves further attention. First, biased DNA repair toward GC has been observed experimentally in mammalian cells after transfection of mismatched DNA fragments (Brown and Jiricny 1987; Billet al. 1998). To significantly influence the fixation process of GC/AT polymorphisms and the resulting equilibrium GC content, BGC must occur at a rate w such that Ne · w ≥ 1 (where Ne is the effective population size and w is the per nucleotide conversion rate times the average conversion bias). Assuming effective population sizes of ~104–105, the estimated recombination rates (~10−8 crossing over per base pair per generation; Weissenbachet al. 1992), conversion biases (Billet al. 1998), and length of heteroduplex during recombination events (several hundreds of bases and up to several kilobases; Detloff and Petes 1992; Li and Baker 2000) are high enough in mammals for BGC to be effective.
Sequence analysis also provides clues about suggestions of a putative role of gene conversion. Two kinds of conversion events occur within genomes: conversion between different copies within a gene family (ectopic gene conversion) or conversion between the two alleles of a gene (allelic gene conversion, Figure 1). If BGC were a major determinant of GC-content evolution, one would expect sequences undergoing frequent gene conversion—either ectopic or allelic—to become GC rich. Ectopic conversion is frequent in genes undergoing concerted evolution, and allelic conversion is frequent in recombination hotspots. Among the genes undergoing concerted evolution in mammals, the well-known ribosomal operons (especially their introns, Table 1), transfer RNAs (Table 1), and histones (GC3 range from ~60% to ~90%) are all GC rich, consistent with the above prediction. In histones the GC content of coding sequences (which experience gene conversion) is much higher than expected given the GC content of flanking regions (which do not; De Bry and Marzluff 1994). Similarly, Hogstrand and Bohme (1999) reported that those regions of the human and mouse MHC genes involved in gene conversion events show a higher CpG level and a higher GC content than regions for which conversion is not suspected.
Hotspots of recombination should also become GC rich under the BGC hypothesis, since gene conversion between alleles occurs during meiotic recombination, where sister chromatids are paired. Eyre-Walker (1992) reported observations suggesting a correlation between recombination rates and GC contents, including a higher chiasma density in cytogenetic bands suspected to be GC richer and a low GC content of the nonrecombining Y chromosome. The most famous hotspot of recombination in mammals is the pseudoautosomal region (PAR) of the X and Y chromosomes. In the sexual chromosomes, a short region of homology—the so-called PAR—behaves like an autosomal bivalent in males. During meiosis, every bivalent undergoes at least one recombination event (e.g., see Lawrieet al. 1995). Given its small size, the per nucleotide recombination rate in the PAR is therefore much higher than in autosomes (Sorianoet al. 1987). X-specific chromosomic sequences, in contrast, undergo one-half the autosomal recombination rate, while Y-specific sequences do not recombine at all. The average GC content at third codon positions of genes in these four chromosomal locations (Y, X, autosome, and PAR) increases with the recombination rate (Table 2), consistent with the BGC hypothesis. The GC content is even higher in the mouse PAR (gene STS: GC3 = 96.2%). Perry and Ashworth (1999) showed that the GC3 of a gene recently translocated into the PAR in mouse increased from 50 to 73% in <1 million years, strongly suggesting that recombination is the cause, not the consequence, of a high GC content. A similar line of evidence comes from birds. Bird genomes include very small chromosomes. Like the PAR, these microchromosomes must have a high per nucleotide recombination rate since they undergo at least one recombination event per generation. Bird microchromosomes again show a very high GC content, probably even higher than mammalian GC-rich regions (Kadiet al. 1993). A relationship between recombination rate and GC content in humans was recently demonstrated by Eisenbarth et al. (2000). These authors analyzed the linkage disequilibrium between loci at the boundary of a GC-rich and a GC-poor region. They found significantly greater linkage disequilibrium (i.e., presumably less recombination) in the GC-poor region, again consistent with the BGC hypothesis. This result is somewhat supported by a genome-wide analysis: Yu et al. (2001) reported a weak, though significant, correlation between estimated recombination rates and GC content, despite the low level of accuracy of the human genetic map, as the authors acknowledge. Finally, note that a relationship between recombination and GC content has also been found in Saccharomyces cerevisiae (Gertonet al. 2000), Drosophila melanogaster, and Caenorhabditis elegans (Maraiset al. 2001).
The BGC hypothesis can therefore account for (i) the higher GC content of regions undergoing gene conversion; (ii) the correlation between recombination rate and GC content, where recombination seems to be the governing force; and (iii) the nonneutral human polymorphism patterns. These arguments, together with the experimental evidence of a GC bias of the repair process, strongly suggest (but do not demonstrate) that BGC might be a major force governing isochore evolution.
A correlation between GC content and recombination rate is also expected under the hypothesis that sequences are under selection, because linkage reduces the efficacy of selection (Hill and Robertson 1966). This hypothesis, however, requires the existence of high selection coefficients, which are unable to operate in regions of low recombination rate. GC3 in the mouse PAR is >95%, which is reached for Ne · s = 1 assuming that sites evolve independently (where Ne is the effective population size and s the per site selection coefficient; e.g., see Bulmer 1991). If this were true, the mutation load for the entire genome would be enormous. If Hill-Robertson effects were impeding such a strong selection pressure and reducing the genomic GC content to a value so far from its optimal value (the GC3 in GC-poor regions is typically 40%), one might expect natural selection to favor modifiers, increasing the recombination rate (although this has not been demonstrated formally). Furthermore, such a strong Hill-Robertson effect, if any, should not affect GC content only. Under this hypothesis, selection against deleterious nonsynonymous changes should be less effective in regions of low recombination rate. This should result in a negative correlation between GC content and nonsynonymous substitution rates. We examined this prediction and found a weak positive (r = 0.036) correlation between a maximum-likelihood estimate of the nonsynonymous rate (Yang and Nielsen 2000) and the GC content of 5000 genes sampled in human and mouse (not shown). The Hill-Robertson effect, if any, has no detectable effect on the fate of nonsynonymous changes, making the hypothesis that it influences putatively selected synonymous ones unlikely.
A test of the causal relationship between recombination and GC content could come from the examination of polymorphism patterns in inbreeding species. Inbreeding populations include no heterozygotes so that BGC, however strong, would not influence the dynamics of allele segregation. Inbreeding species are scarce in vertebrates but common in plants. The genomes of Gramineae are quite similar to mammalian genomes with respect to the distribution of GC content and genes (Carels and Bernardi 2000). A comparison of sequence polymorphism patterns in inbreeding and outbreeding species of Gramineae and maybe other plant species would be of great relevance to the study of GC content evolution.
Although, under BGC, alleles do not segregate according to the neutral model, BGC is basically a neutral process since there are no fitness differences between individuals. The BGC hypothesis, however, might ironically raise new selectionist issues. Even if direct selection for GC content at every site of GC-rich regions turns out to be unlikely, selection on the molecular machinery that determines the dynamics of GC content must be considered. An interesting hypothesis was proposed by Fryxell and Zuckerkandl (2000). They argued that a biased repair process might be an adaptation to the high rate of methyl-cytosine deamination. Cytosines involved in CpG doublets have a mutation rate maybe 10 times higher than other nucleotides in humans (Gianelliet al. 1999). Correctly repairing these mutations is therefore crucial. Deamination of unmethylated cytosines produces uracil, and the resulting U:G mismatch can easily be repaired since U is an alien base in DNA. Deamination of methyl-cytosine, however, produces thymine and an ambiguous T:G mismatch. A repair process that would favor T:G → C:G repair over T:G → T:A might therefore be advantageous in methylated genomes (including mammals, birds, and plants) since most T:G mismatches result from a C → T mutation. Under this scenario (and assuming again that BGC determines GC content), isochores would be a by-product of natural selection acting on DNA repair. A potential test of this would be to compare mammals and species in which the repair of G:T mismatches is unbiased to see whether the latter have similarly structured CG content. Few such data are available, but, consistently, one such species, Xenopus laevis (Varletet al. 1990), has a genome with much less structured GC content than that of the mammals that have been studied.
Communicating editor: D. Charlesworth
- Received December 27, 2000.
- Accepted July 31, 2001.
- Copyright © 2001 by the Genetics Society of America