- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Lercher, M. J.
- Articles by Hurst, L. D.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Lercher, M. J.
- Articles by Hurst, L. D.
The Evolution of Isochores: Evidence From SNP Frequency Distributions
Martin J. Lerchera, Nick G. C. Smithb, Adam Eyre-Walkerc, and Laurence D. Hurstaa Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, United Kingdom,
b Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, SE-752 36 Uppsala, Sweden
c Centre for the Study of Evolution and School of Biological Sciences, University of Sussex, Brighton BN1 9QG, United Kingdom
Corresponding author: Martin J. Lercher, University of Bath, Claverton Down, Bath, Somerset BA2 7AY, UK., m.j.lercher{at}bath.ac.uk (E-mail)
| ABSTRACT |
|---|
The large-scale systematic variation in nucleotide composition along mammalian and avian genomes has been a focus of the debate between neutralist and selectionist views of molecular evolution. Here we test whether the compositional variation is due to mutation bias using two new tests, which do not assume compositional equilibrium. In the first test we assume a standard population genetics model, but in the second we make no assumptions about the underlying population genetics. We apply the tests to single-nucleotide polymorphism data from noncoding regions of the human genome. Both models of neutral mutation bias fit the frequency distributions of SNPs segregating in low- and medium-GC-content regions of the genome adequately, although both suggest compositional nonequilibrium. However, neither model fits the frequency distribution of SNPs from the high-GC-content regions. In contrast, a simple population genetics model that incorporates selection or biased gene conversion cannot be rejected. The results suggest that mutation biases are not solely responsible for the compositional biases found in noncoding regions.
BASE composition varies along mammalian chromosomes over hundreds of kilobases (![]()
![]()
![]()
![]()
![]()
![]()
![]()
Recent studies have tested the mutation bias hypothesis (that mammalian compositional variation is due to mutation bias variation alone) by considering single-nucleotide polymorphism (SNP) data (![]()
![]()
GC mutations and those generated by GC
AT mutations (termed AT
GC and GC
AT polymorphisms, respectively). Under the mutation bias hypothesis it can be shown that the numbers of the two types of polymorphisms are expected to be equal irrespective of what the sequence composition is, so long as the sequences are at compositional equilibrium (![]()
AT polymorphisms at synonymous sites and introns in GC-rich mammalian protein-coding genes; this suggests that either selection or biased gene conversion affects synonymous base composition. Given that the base composition of synonymous sites and introns is highly correlated with the base composition of the isochore in which the gene resides (![]()
However, these studies assume compositional equilibrium of the examined sequences, while base composition can change profoundly and systematically over evolutionary time (![]()
![]()
![]()
![]()
![]()
![]()
A comparison of the frequency distributions was first used by ![]()
![]()
![]()
![]()
![]()
![]()
| MATERIALS AND METHODS |
|---|
We identified 2769 biallelic SNPs that were segregating an A or a T nucleotide and a G or a C nucleotide and that had been assayed for allele frequencies, in build 95 of the refSNP database (http://www.ncbi.nlm.nih.gov/SNP). From refSNP, we also obtained the surrounding sequence of each SNP and its location in relation to known or predicted genes (from contig annotation or BLASTing against mRNA in GenBank). To avoid sites that are under selection, we excluded all SNPs within 2 kb of a gene, resulting in 1504 SNPs. This also reduces the effect of selection on linked sites (![]()
![]()
|
We test the null hypothesis of neutrality under purely mutational forces by constructing a population genetics model for the expected distribution of A/T allele frequencies. To this end we have developed a model that is expected to be less sensitive to compositional nonequilibrium. We take as our starting point the classic formula (![]()
![]()
![]() |
(1) |
where x is the frequency of the new allele; S = 4Nes, where Ne is the effective population size and s is the fixation bias favoring the new allele (i.e., the selective coefficient if selection acts on GC content and a neutral bias in the case of biased gene conversion); and c is a scaling factor. Below, we test two alternative models: when considering mutation bias as the driving force of local GC composition, we set S = 0; when considering fixation bias (from selection or biased gene conversion), we allow S to vary.
The SNPs in our data set were sampled in a two-stage process: before being assayed for allele frequency in a large number of chromosomes, the SNPs were originally discovered in samples of much fewer chromosomes. For any given SNP with known allele frequency x, the probability of detecting both alleles in an original sample of n chromosomes is

The second stage of allele frequency determination can be modeled by a binomial sampling formula (see ![]()
Although the formula for D(x, S) assumes that the distribution of polymorphism frequencies is stationary, this is a considerably less stringent assumption than the assumption of compositional stationarity used in previous tests of compositional neutrality. The test using the numbers of AT
GC and GC
AT polymorphisms is sensitive to changes in the mutation bias over a timescale of roughly 1/u, where u is the mutation rate (![]()
![]()
4Ne generations, with a standard deviation of the same order of magnitude (![]()
![]()
10,000 (A. EYRE-WALKER, P. D. KEIGHTLEY and N. G. C. SMITH, unpublished data) and the generation time is
25 years. Thus, the polymorphism frequency test is compromised only by changes in the mutation bias over a timescale of 12 million years.
To achieve this increased robustness of our model against compositional nonequilibrium, we have to allow the proportions of GC
AT and AT
GC mutations to vary (under a neutral mutation model the proportions are equal when the composition is at equilibrium, even if the GC
AT and AT
GC mutation rates are not equal). We can then calculate the expected distribution of A/T polymorphism frequencies by combining the formulas for the GC
AT and AT
GC mutations. Thus the expected number of polymorphisms for which the A/T allele frequency is between x1 and x2, E(x1, x2), is given by
![]() |
(2) |
where T is the total number of polymorphisms, PGC
AT and PAT
GC are the proportions of GC
AT and AT
GC mutations, respectively (PGC
AT + PAT
GC = 1), and S is the fixation bias of GC over AT alleles. Note that the frequency distributions D(x, -S) and D(x, S) are not normalized in Equation 2, but rather the entire distribution of expected polymorphisms is normalized so that it sums to T.
We estimated S, n, and PGC
AT by finding those parameters that minimize the value of the G test statistic, summed over the 10 polymorphism frequency classes of 10%,
![]() |
(3) |
where Oj and Ej are the observed and expected numbers of polymorphisms in the jth frequency class. The significance of the minimum G value is tested by approximation to the chi-square distribution, with the numbers of degrees of freedom given by nine minus the number of parameters estimated from the data.
The explicit population genetics model described so far makes a number of evolutionary assumptions: constant mutation rates, constant population size, no population subdivision, unbiased sampling, and no linkage between polymorphic sites and sites under positive or negative selection. We can drastically simplify our assumptions by generalizing our null hypothesis, assuming only that the frequency distributions of AT
GC and GC
AT polymorphisms are identical. Under a null hypothesis of purely mutational bias, this will be true if the mutational pattern was constant over the last 4Ne generations. This general model can be implemented by minimizing Equation 3 over the space of all possible distributions D(x, 0). We approximate this procedure by replacing D(x, 0) with three "arbitrary" monotonically decreasing functions, e-zx, (e-z1x + e-z2x), and x-z, where z, z1, and z2 are positive numbers that describe the shape of the distribution.
We first performed a preliminary study to justify our assumption of infinite sample size n2 for the second sampling stage (the allele frequency assay, see above). We analyzed the polymorphism frequency data summed across all GC classes (see Table 1) in an explicit purely mutational model (S = 0). This model (![]()
We then performed simulations to test if a mixture of SNPs that were detected in samples of varying sizes n can be adequately described by a single "mean" sample size n0. As an extreme case, we analyzed two samples of 251 SNPs each, with n = 2 and n = 20, respectively. We calculated the predicted allele frequency distributions from Equation 2, with S = 0 and PGC
AT = PAT
GC = 0.5, with added Poisson noise. We then used Equation 2 with a single parameter n0 to fit the summed frequency data of these two samples. When minimizing G (Equation 3) by varying PAT
GC and n0, we could reject our simplified "mean n0" model at P = 0.05 for only 5.1% of simulated data sets. This suggests that fitting a mean value for n in Equation 2 is a valid approximation.
| RESULTS |
|---|
Although our population genetics model for fitting polymorphism frequency distributions can incorporate selection or biased gene conversion (i.e., a directional fixation bias; see MATERIALS AND METHODS), our first aim is to see whether a neutral mutation model (S = 0) is capable of explaining the SNP data. We have two parameters to fit for the neutral mutation model: the primary sample size, n, and the proportion of GC
AT mutations, PGC
AT (see MATERIALS AND METHODS). The neutral mutation model cannot be rejected for SNPs from regions with low and intermediate GC content (P = 0.09 and P = 0.78, respectively; see Fig 1 and Table 2). It is interesting to note that there is substantial evidence of compositional nonequilibrium, with an AT
GC bias in low GC regions (PGC
AT = 0.46) and a GC
AT bias in intermediate GC regions (PGC
AT = 0.58).
|
|
In contrast, the neutral mutation model provides a poor fit to SNPs from regions of the genome with high GC content (P = 0.014, see Table 2). In Fig 2 we compare the observed data and the data expected on the basis of the neutral mutation model. The failure of the neutral mutation model seems to be due to the combination of the large excess of low-frequency (00.1) A/T alleles and the flat shape of the polymorphism frequency distribution at high A/T frequencies (0.71).
|
This effect is not likely to be due to CpG hypermutability. We tested this by applying our neutral mutation model to the high GC SNP data after removal of all polymorphisms that may have been generated by CpG mutations (CpG
TpG or CpG
CpA). Upon removal of such mutations we find that the rejection of the explicit neutral mutation model is only marginally significant (P = 0.054, see Table 3). However, we do not consider this reduction in statistical support as strong evidence for an effect of CpG mutations, as it is most likely due to the decrease in sample size. To check this, we simulated 1000 data sets in which the high GC data were randomly discarded to generate a data set the same size as the high GC minus CpG data set. In 413 cases the resultant G value was lower than that obtained using the high GC minus CpG data set. These simulations indicate that it is worth addressing alternative explanations of the high GC SNP data.
|
The differences between observed and expected high GC SNP frequency distributions appear consistent with the action of a directional fixation bias, i.e., natural selection or biased gene conversion (![]()
AT mutations are preferentially removed and AT
GC polymorphisms are preferentially retained (in our notation such fixation bias is equivalent to S > 0). We fitted the high GC SNP data using a three-parameter model, varying n, PAT
GC, and S. Upon the addition of the fixation bias parameter S we find a significant improvement in fit (
G = 6.8 and P = 0.009; see Fig 2). The estimated value of S implies selection in favor of G/C alleles, which results in a mutation bias in favor of A/T because the GC content is elevated above its mutational equilibrium.
Our explicit model incorporates a number of assumptions that may in fact not be met by the data. Population size changes, population subdivision, biased sampling, or selection on linked sites (![]()
![]()
GC and GC
AT polymorphisms have the same frequency distributions. This will be satisfied under very general conditions, assuming only the stationarity of the allele frequency distributions and the absence of fixation bias. To compare the frequency distributions, we modeled D(f) using three different monotonically decreasing functions (see MATERIALS AND METHODS). Analyses using all three distributions gave very similar answers and so we present only results using D(f) = e-z f, where z > 0 is the fitted parameter. Table 3 presents the results for each GC content category. In agreement with our results from the explicit population genetics model, this general model of mutational bias provides an adequate fit to the data for SNPs from regions with low and intermediate GC content (P = 0.08 and P = 0.52, respectively). In contrast, the model gives a poor fit to SNPs from regions of the genome with high GC content (P = 0.008; see Fig 2). Again, the discrepancies between observed and expected distributions are consistent with the action of a fixation bias acting in favor of high GC.
| DISCUSSION |
|---|
In the above analysis, we have used two models: an explicit population genetics model and a general model avoiding any assumptions about the shape of the frequency distribution. The explicit model allows a detailed analysis of mutational and selective parameters and facilitates a direct comparison of models of mutation and/or fixation bias. However, it is built on a number of population genetical assumptions, which may not be met by our data. The general model, which directly compares the observed frequency distributions, does not give a quantitative description of the processes shaping composition. However, it is built on less stringent assumptions about the population history and is thus more robust.
Both explicit and general models of neutral mutation bias cannot be rejected for low and intermediate GC SNPs, but can be rejected on the basis of their failure to fit the high GC SNP frequency distribution (P = 0.014 and 0.008, respectively). Under the explicit model, we tested fixation bias (i.e., selection or biased gene conversion) as an explanation for the failure of the neutral mutation model. This explicit model suggests that there is selection/biased gene conversion in favor of GC, but that this is counteracted by mutation bias.
A possibility for the failure of our neutral mutation model is recent compositional nonequilibrium. Both the explicit population genetics model and the general model allow for ancient but not recent changes in base composition: the polymorphism frequency distribution needs to be at equilibrium, whereas the base composition at fixed sites need not be at equilibrium (see MATERIALS AND METHODS). Although the activity of transposable elements appears to have been low in the recent past of the human genome (![]()
![]()
AT mutations.
The data analyzed are polymorphism counts in frequency classes. Regardless of the details of population history, these numbers are approximately Poisson distributed as long as there is no linkage between polymorphisms (![]()
![]()
Our results have important consequences for previous studies of compositional neutrality (![]()
![]()
![]()
AT = PAT
GC = 0.5), then the neutral mutation fit to the high GC SNP data would be strongly rejected even after removal of CpG mutations (P = 0.002; data not shown).
| ACKNOWLEDGMENTS |
|---|
We thank Laurent Duret for interesting discussions and two anonymous referees for helpful suggestions. We acknowledge support from The Wellcome Trust (M.J.L.), the Biotechnology and Biological Sciences Research Council (L.D.H. and A.E.-W.), and The Royal Society (A.E.-W.).
Manuscript received September 4, 2001; Accepted for publication September 3, 2002.
| LITERATURE CITED |
|---|
AKASHI, H., 1999 Inferring the fitness effects of DNA mutations from polymorphism and divergence data: statistical power to detect directional selection under stationarity and free recombination. Genetics 151:221-238.
AKASHI, H. and S. W. SCHAEFFER, 1997 Natural selection and the frequency distributions of "silent" DNA polymorphism in Drosophila. Genetics 146:295-307.[Abstract]
BERNARDI, G., 2000 Isochores and the evolutionary genomics of vertebrates. Gene 241:3-17.[Medline]
CHARLESWORTH, B., 1994 The effect of background selection against deleterious mutations on weakly selected, linked variants. Genet. Res. 63:213-227.[Medline]
CLAY, O., S. CACCIO, S. ZOUBAK, D. MOUCHIROUD, and G. BERNARDI, 1996 Human coding and noncoding DNA: compositional correlations. Mol. Phylogenet. Evol. 5:2-12.[Medline]
DRAKE, J. W., B. CHARLESWORTH, D. CHARLESWORTH, and J. F. CROW, 1998 Rates of spontaneous mutation. Genetics 148:1667-1686.
DURET, L., M. SEMON, G. PIGANEAU, D. MOUCHIROUD, and N. GALTIER, 2002 Vanishing GC-rich isochores in mammalian genomes. Genetics 162:1837-1847.
EWENS, W. J., 1972 The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3:87-112.[Medline]
EYRE-WALKER, A., 1997 Differentiating between selection and mutation bias. Genetics 147:1983-1987.[Medline]
EYRE-WALKER, A., 1999 Evidence of selection on silent site base composition in mammals: potential implications for the evolution of isochores and junk DNA. Genetics 152:675-683.
EYRE-WALKER, A. and L. D. HURST, 2001 The evolution of isochores. Nat. Rev. Genet. 2:549-555.[Medline]
FAY, J. C. and C.-I WU, 2000 Hitchhiking under positive Darwinian selection. Genetics 155:1405-1413.
FILIPSKI, J., 1987 Correlation between molecular clock ticking, codon usage, fidelity of DNA-repair, chromosome-banding and chromatin compactness in germline cells. FEBS Lett. 217:184-186.[Medline]
FRANCINO, H. P. and H. OCHMAN, 1999 Isochores result from mutation not selection. Nature 400:30-31.[Medline]
GALTIER, N. and D. MOUCHIROUD, 1998 Isochore evolution in mammals: a human-like ancestral structure. Genetics 150:1577-1584.
Initial sequencing and analysis of the human genome. (2001) Nature 409:860-921.[Medline]
KIMURA, M., 1983 The Neutral Theory of Evolution. Cambridge University Press, Cambridge, UK.
KIMURA, M. and T. OHTA, 1973 The age of a neutral mutant persisting in a finite population. Genetics 75:199-212.
KLIMAN, R. M., 1999 Recent selection on synonymous codon usage in Drosophila. J. Mol. Evol. 49:343-351.[Medline]
LOBRY, J. R., 1997 Influence of genomic G+C content on average amino-acid composition of proteins from 59 bacterial species. Gene 205:309-316.[Medline]
NAGYLAKI, T., 1983 Evolution of a finite population under gene conversion. Proc. Natl. Acad. Sci. USA 80:6278-6281.
POWELL, J. R. and E. N. MORIYAMA, 1997 Evolution of codon usage bias in Drosophila. Proc. Natl. Acad. Sci. USA 94:7784-7790.
RODRIGUEZ-TRELLES, F., R. TARRIO, and F. J. AYALA, 2000 Evidence for a high ancestral GC content in Drosophila. Mol. Biol. Evol. 17:1710-1717.
SAWYER, S. A. and D. L. HARTL, 1992 Population genetics of polymorphism and divergence. Genetics 132:1161-1176.[Abstract]
SAWYER, S. A., D. E. DYKHUIZEN, and D. L. HARTL, 1987 Confidence interval for the number of selectively neutral amino acid polymorphisms. Proc. Natl. Acad. Sci. USA 84:6225-6228.
SMITH, N. G. C. and A. EYRE-WALKER, 2001 Synonymous codon bias is not caused by mutation bias in G+C-rich genes in humans. Mol. Biol. Evol. 18:982-986.
SUEOKA, N., 1988 Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 85:2653-2657.
WOLFE, K. H., P. M. SHARP, and W.-H. LI, 1989 Mutation rates differ among regions of the mammalian genome. Nature 337:283-285.[Medline]
WRIGHT, S., 1937 The distribution of gene frequencies in populations. Proc. Natl. Acad. Sci. USA 23:307-320.
YU, A., C. ZHAO, Y. FAN, W. JANG, and A. J. MUNGALL et al., 2001 Comparison of human genetic and sequence-based physical maps. Nature 409:951-953.[Medline]
This article has been cited by other articles:
![]() |
M. M. Desai and J. B. Plotkin The Polymorphism Frequency Spectrum of Finitely Many Sites Under Selection Genetics, December 1, 2008; 180(4): 2175 - 2191. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. R Haddrill and B. Charlesworth Non-neutral processes drive the nucleotide composition of non-coding sequences in Drosophila Biol Lett, August 23, 2008; 4(4): 438 - 441. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Reuter, J. Engelstadter, P. Fontanillas, and L. D. Hurst A Test of the Null Model for 5' UTR Evolution Based on GC Content Mol. Biol. Evol., May 1, 2008; 25(5): 801 - 804. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Karro, M. Peifer, R. C. Hardison, M. Kollmann, and H. H. von Grunberg Exponential Decay of GC Content Detected by Strand-Symmetric Substitution Rates Influences the Evolution of Isochore Structure Mol. Biol. Evol., February 1, 2008; 25(2): 362 - 374. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. R. Dreszer, G. D. Wall, D. Haussler, and K. S. Pollard Biased clustered substitutions in the human genome: The footprints of male-driven biased gene conversion Genome Res., October 1, 2007; 17(10): 1420 - 1430. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. D. Hernandez, S. H. Williamson, L. Zhu, and C. D. Bustamante Context-Dependent Mutation Rates May Cause Spurious Signatures of a Fixation Bias Favoring Higher GC-Content in Humans Mol. Biol. Evol., October 1, 2007; 24(10): 2196 - 2202. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Schmegner, J. Hoegel, W. Vogel, and G. Assum The Rate, Not the Spectrum, of Base Pair Substitutions Changes at a GC-Content Transition in the Human NF1 Gene Region: Implications for the Evolution of the Mammalian Genome Structure Genetics, January 1, 2007; 175(1): 421 - 428. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. T. Webster, E. Axelsson, and H. Ellegren Strong Regional Biases in Nucleotide Substitution in the Chicken Genome Mol. Biol. Evol., June 1, 2006; 23(6): 1203 - 1216. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Galtier, E. Bazin, and N. Bierne GC-Biased Segregation of Noncoding Polymorphisms in Drosophila Genetics, January 1, 2006; 172(1): 221 - 228. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. S. Balakirev, V. R. Chechetkin, V. V. Lobzin, and F. J. Ayala Entropy and GC Content in the {beta}-esterase Gene Cluster of the Drosophila melanogaster Subgroup Mol. Biol. Evol., October 1, 2005; 22(10): 2063 - 2072. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Ebersberger and M. Meyer A Genomic Region Evolving Toward Different GC Contents in Humans and Chimpanzees Indicates a Recent and Regionally Limited Shift in the Mutation Pattern Mol. Biol. Evol., May 1, 2005; 22(5): 1240 - 1245. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. E. Vinogradov Noncoding DNA, isochores and gene expression: nucleosome formation potential Nucleic Acids Res., January 26, 2005; 33(2): 559 - 563. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Bazin, L. Duret, S. Penel, and N. Galtier Polymorphix: a sequence polymorphism database Nucleic Acids Res., January 1, 2005; 33(suppl_1): D481 - D484. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Lercher, J.-V. Chamary, and L. D. Hurst Genomic Regionality in Rates of Evolution Is Not Explained by Clustering of Genes of Comparable Expression Profile Genome Res., June 1, 2004; 14(6): 1002 - 1013. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. E. Vinogradov Isochores and tissue-specificity Nucleic Acids Res., September 1, 2003; 31(17): 5212 - 5220. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. E. Vinogradov DNA helix: the importance of being GC-rich Nucleic Acids Res., April 1, 2003; 31(7): 1838 - 1844. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Lercher, M. J.
- Articles by Hurst, L. D.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Lercher, M. J.
- Articles by Hurst, L. D.










