- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.105.049916v1
172/1/569 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Morton, B. R.
- Articles by Gaut, B. S.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Morton, B. R.
- Articles by Gaut, B. S.
Originally published as Genetics Published Articles Ahead of Print on October 11, 2005.
Genetics, Vol. 172, 569-577, January 2006, Copyright © 2006
doi:10.1534/genetics.105.049916
Variation in Mutation Dynamics Across the Maize Genome as a Function of Regional and Flanking Base Composition
Brian R. Morton*,1,
Irie V. Bi
,
Michael D. McMullen
,
and
Brandon S. Gaut
* Department of Biological Sciences, Barnard College, Columbia University, New York, New York 10027,
Department of Agronomy, Plant Sciences Unit, University of Missouri 65211, Columbia, Missouri,
Plant Genetics Research Unit, USDA-ARS, Columbia, Missouri 65211 and
Department of Ecology and Evolution, University of California, Irvine, California 92697
1 Corresponding author: Department of Biological Sciences, Barnard College, Columbia University, 3009 Broadway, New York, NY 10027.
E-mail: bmorton{at}barnard.edu
We examine variation in mutation dynamics across a single genome (Zea mays ssp. mays) in relation to regional and flanking base composition using a data set of 10,472 SNPs generated by resequencing 1776 transcribed regions. We report several relationships between flanking base composition and mutation pattern. The A + T content of the two sites immediately flanking the mutation site is correlated with rate, transition bias, and GC
AT pressure. We also observe a significant CpG effect, or increase in transition rate at CpG sites. At the regional level we find that the strength of the CpG effect is correlated with regional A + T content, ranging from a 1.7-fold increase in transition rate in relatively G + C-rich regions to a 2.6-fold increase in A + T-rich regions. We also observe a relationship between locus A + T content and GC
AT pressure. This regional effect is in opposition to the influence of the two immediate neighbors in that GC
AT pressure increases with increasing locus A + T content but decreases with increasing flanking base A + T content and may represent a relationship between genome location and mutation bias. The data indicate multiple context effects on mutations, resulting in significant variation in mutation dynamics across the genome.
EVOLUTION is ultimately dependent on mutation and thus characterizing mutation rates and biases, within and among genomes, is a prerequisite for studying genomics and molecular evolution. For example, comparative genomics requires an understanding of mutation dynamics in different lineages (e.g., DERMITZAKIS et al. 2002), and compositional patterns such as the possible isochore structure in vertebrates (BERNARDI 2000, but see COHEN et al. 2005) cannot be adequately studied without an understanding of how mutation bias varies along chromosomes (e.g., DURET et al. 2002). Increasingly, analyses of large SNP data sets, such as the recent analysis of 2,576,903 human SNPs (ZHAO and BOERWINKLE 2002), are proving to be valuable for studies of mutation bias. The availability of SNP data from many different taxa now makes it feasible to develop a more detailed knowledge of factors that contribute to variation in mutational biases.
A number of analyses of mutations have demonstrated that context, or the composition of nucleotides flanking a mutation, can have a significant influence on both mutation bias and overall mutation rate (BULMER 1986; MORTON 1995; KRAWCZAK et al. 1998; ZHAO and BOERWINKLE 2002; MORTON 2003). Although context effects are not often considered in studies that apply mutation parameters (although see ARNDT et al. 2003; SIEPEL and HAUSSLER 2003), there is evidence that understanding and incorporating such effects may be very important for interpreting genomic data (MORTON 2003; SIEPEL and HAUSSLER 2003) since they can result in variation in mutation dynamics across sites. In nuclear genes, the most apparent neighboring nucleotide effect that has been studied to date is the CpG effect, which is an increased rate of transitions at CpG dinucleotides as a result of deamination of methylated CpG sites (DUNCAN and MILLER 1980; BULMER 1986; COOPER and YOUSSOUFIAN 1988). The CpG effect has been primarily studied in vertebrate genomes (KRAWCZAK et al. 1998; ZHAO and BOERWINKLE 2002; FRYXELL and MOON 2005), and in human sequences there is a fivefold increase in the rate of transitions at CpG sites due to deamination of methylated cytosines (KRAWCZAK et al. 1998). The CpG effect appears to be weaker in G + C-rich regions, possibly due to greater local helix stability (FRYXELL and MOON 2005), and appears to be slightly stronger on the coding strand than on the template strand near genes (KRAWCZAK et al. 1998).
Context dependency of mutations has also been studied in grass chloroplast DNA (cpDNA; MORTON 1995, 2003). In this genome there is a significant correlation between the A + T content of the two sites flanking a mutation (the A/T context) and both the overall substitution rate and the transition:transversion (Ts:Tv) bias, due to a decreasing rate of transition substitutions as the A/T context increases (MORTON 2003). Since the observed context dependency is not consistent with CpG deamination, and since CpG methylation is not known to occur in cpDNA, it has been suggested that factors such as polymerase fidelity and variable repair efficiency may be responsible for context-dependent mutation biases (MORTON 2003). Neighboring base composition also influences substitution dynamics in cpDNA in other ways; both the bias toward A + T and the bias toward pyrimidines are a function of context (MORTON 2003). Similar context-dependent mutation patterns appear to exist in cpDNA across different flowering-plant lineages (MORTON 1997; YANG et al. 2002).
Given the growing body of evidence regarding context dependency and the lack of data about regional variation in mutation properties, there is a need to better understand context dependency and how mutation dynamics vary across individual genomes. To further our understanding of mutational context and variation, we have analyzed a large SNP data set generated from nuclear genes of maize (Zea mays ssp. mays) with respect to both regional and flanking base composition. We find evidence that the A + T content of flanking nucleotides has an influence on various aspects of mutation dynamics and report a correlation between regional base composition and both CpG effect and the relative rates of GC
AT and AT
GC mutations, or GC
AT mutation pressure.
Sequence data:
The sequence data analyzed in this article were reported previously (WRIGHT et al. 2005; YAMASAKI et al. 2005; GenBank nos. BV123534BV144210, BV446558BV447590, and BV106362BV123527). Briefly, PCR primers were designed to amplify the 3' regions of
2000 sequences from the Maize Mapping Project/Dupont unigene set (http://www.agron.missouri.edu/files_dl/MMP/Cornsensus). For each locus, PCR was performed on genomic DNA from 14 individuals representing the genetic diversity of modern maize inbreds. The sequencing, processing, alignment, and quality of the DNA sequence data were described previously (WRIGHT et al. 2005; YAMASAKI et al. 2005). We modified the alignments in three ways. First, any SNP site that was not supported by a phred quality score of at least 30 for both variants was assigned an "N" for all individuals and ignored in analyses. Second, some alignments were modified slightly to correct for apparent indel errors in coding regions (see below). Third, some loci were excluded from our analyses, either because they did not contain sequences from at least four of the inbred lines or because coding region assignment was uncertain. In total we analyzed 1776 loci with an average A + T content of 53.0% and a variance of 7.3%.
Definition of coding and noncoding regions:
To define coding regions, the unigene sequences were compared to the annotated rice peptide set (version 2 at http://www.tigr.org) and Arabidopsis peptide set (http://www.ncbi.nlm.nih.gov/ on August 16, 2004) with BLASTx. Any hit with an e-value <1e-5 was retained and considered a putative protein coding region (pCDS). The pCDS for each unigene was also estimated by finding the longest open reading frame on the basis of analyses with the bio perl module "getorf" of the EMBOSS package (RICE et al. 2000). Getorf was applied without assuming 5'3' directionality and without assuming the presence of a start codon.To ascertain whether any portion of pCDSs from unigenes were present in genomic alignments, we compared the pCDS to genomic data with BLASTn. All BLAST hits with an e-value <1e-5 were retained, as were the extreme 5' and 3' sites of the region(s) of the pCDS aligned by BLAST. The portion of the pCDS defined by the 5' and 3' sites was aligned to the entire genomic alignment with the program sim4 (FLOREA et al. 1998), using default settings. Sim4 aligns EST sequence to genomic sequence while accounting for genomic features such as consensus intron/exon junctions. Each alignment was also edited by hand both to confirm consensus intron/exon junctions and to eliminate 1-bp indels in coding regions, which were assumed to be sequencing errors when present in only one or two sequences. If there were larger indels or potential frameshifts, the coding region definition was considered ambiguous and the locus was removed from analysis. The 1776 alignments used in this study, including coding regions, are available from http://gautlab.bio.uci.edu/data.
Analysis of mutations:
The alignments were analyzed using a Java package written by one of the authors (B. R. Morton). Sites with a gap introduced into any sequence and SNPs at sites defined as coding were excluded from the analyses. At every variable noncoding site the most parsimonious number of changes was assumed and, given the lack of data from an outgroup taxon, mutations were polarized using the most frequent nucleotide at that site. The reliability of this method of polarization has yet to be established, so any conclusions dependent upon polarization must be considered in this light. As data from outgroup taxa become available, they will allow us to evaluate the validity of this method of polarization.The context of every site, conserved or variable, was calculated using the majority base at the appropriate neighboring site(s). The contexts analyzed were (1) composition of the 5' neighbor, (2) composition of the 3' neighbor, (3) composition of the two 5' neighboring nucleotides, (4) composition of the two 3' neighboring nucleotides, (5) composition of both the 5' and 3' neighbor, and (6) the composition of the four flanking nucleotides, two on each side. Note that all sites occur in multiple contexts since many of these cases overlap. Heterogeneity in mutation dynamics among contexts was assessed by a likelihood-ratio test, or G-test (SOKAL and ROHLF 1995).
For every context we analyzed mutations as both polarized and unpolarized. For unpolarized changes we simply scored the change as a transition or a transversion. For those sites where there were two changes possible, due to three character states across the sequences, we inferred one transversion (which is necessary) and one unknown change. The latter were included in rate calculations but not in transition:transversion calculations. Only 74 of the 5932 noncoding SNP sites (1.2%) had multiple changes and exclusion of these sites did not affect the conclusions (data not shown). For the analysis of polarized mutations we generated 4 x 4 mutation matrices for every context analyzed. For each matrix, the entry mij is the number of sites observed to have a change from nucleotide i to nucleotide j, with the matrix diagonal representing the conserved sites. The rate of each mutation type was then calculated from the matrix by dividing each element by the row total. In addition, for each matrix we calculated the stationary vector (MORTON 2003), which represents the expected equilibrium composition for a sequence evolving under that mutation (substitution) model. This stationary vector can be used as a descriptive parameter of the mutation matrix similar to Ts:Tv.
We also examined regional effects on mutation. For this we calculated the overall A + T content of each locus, including both coding and noncoding sites, and then divided the loci into five classes: (1) A + T < 48%, (2) 48%
A + T < 52%, (3) 52%
A + T < 56%, (4) 56%
A + T < 60%, and (5) A + T
60%. Mutations occurring in loci within each class were then grouped and analyzed together.
Sequence composition:
We analyzed data from a resequencing project in which loci were sequenced from genomic DNA of up to 14 maize inbred lines (WRIGHT et al. 2005; YAMASAKI et al. 2005). Each locus is a single transcribed region of the genome that was amplified using primers designed from a unigene sequence. An alignment was generated for each locus using the coding strand sequence data. We examined 1776 of these loci for which the coding regions could be reliably defined with an average sample size of n = 12.1 sequences. The 1776 loci represent a combined alignment length of 531,503 nucleotides, of which 260,475 (49.0%) are noncoding. A total of 10,472 SNPs representing
2% of the sites were scored. A total of 5932 (56.6%) of the SNPs were at noncoding sites. Each SNP was scored in two ways: as an unpolarized (nondirectional) change and as a polarized (directional) change, for which the most frequent nucleotide at the site was taken as the ancestral state. The distribution of A + T content from these loci is shown in Figure 1 for all sites as well as for only noncoding sites. In general the loci are slightly A + T-rich with an average A + T content of 53.0%. The noncoding sites are only slightly higher in A + T content, with an average composition across loci of 55.1% A + T. Along with the bias toward A + T, we observed a consistent bias of G over C and T over A both in the sequences overall and in only the noncoding sites (a "GT skew"). If we measure the T-A skew by (T A)/(T + A) and the G-C skew by (G C)/(G + C), the T-A skew in the noncoding sites of our data is 12.0% while the G-C skew at noncoding sites is 5.6%. This skew toward G and T in the noncoding regions near genes is similar to a recent observation of human genes (LOUIE et al. 2003).
|
To study the effect of regional composition on mutation bias, loci were divided by A + T content into the following categories: (1) A + T < 48%, (2) 48%
A + T < 52%, (3) 52%
A + T < 56%, (4) 56%
A + T < 60%, and (5) A + T
60%. These will be referred to as the regional composition classes. The results reported here are for the SNPs at noncoding sites but all conclusions discussed below were unchanged when analyses were repeated using all SNPs, although the higher proportion of noncoding SNPs relative to noncoding sites may reflect constrained sites within the coding regions. In addition, varying the categories into which loci were divided by A + T content did not change the general results (data not shown).
General mutation patterns:
Overall, the polarized SNP data yielded a G and C nucleotide mutation rate (the GC rate) that is
1.6 times the rate of mutation for A and T nucleotides (the AT rate) (Table 1). The higher GC rate could potentially be due to the CpG effect, which is discussed in detail below. However, when the GC and AT rates were calculated for different 5' and 3' flanking nucleotides, there was a higher GC rate in every context. Thus, although the effect of CpG deamination is apparent in the higher GC rates when there is a 5' C or a 3' G (Table 1), the CpG effect cannot account for the overall higher GC rate. The ratio of GC-to-AT rates, which reflects the mutational AT pressure, varies across the regional composition classes; the GC:AT rate ratio is higher in those regions with a higher A + T content and lower in those regions with a lower A + T content (Table 2). This variation in mutation pressure is discussed in more detail below. Note that the rates in Table 1 tend to be slightly lower than the rates in Table 2 since accounting for context reduces the number of sites in the analysis by eliminating the first and/or last sites as well as any internal site for which context is ambiguous.
|
|
We also examined mutation bias by looking at the Ts:Tv ratio, which has not been well characterized in plant nuclear genomes. Overall, transitions occur at a rate
1.5 times that of transversions (Table 3). This ratio is consistent across loci: although there is a slight variation in this ratio across loci as a function of regional composition, the variation is not significant (G = 3.6, P > 0.05). The maize nuclear Ts:Tv ratio is slightly higher than that of grass cpDNA, which shows an overall 1.3:1 Ts:Tv ratio. Note, however, that the Ts:Tv ratio in grass cpDNA ranges from <1 to >2.5 as a function of flanking base composition (MORTON 2003).
|
The effect of cytosine deamination:
To examine the influence of context on mutation bias, we first compared the frequency of transition events at CpG dinucleotides, which are known to be methylated in plant nuclear DNA, to the transition rate at other dinucleotides. Deamination of methylated cytosines at CpG dinucleotides is known to generate a significant increase in transition rate in many vertebrate taxa (KRAWCZAK et al. 1998; FRYXELL and MOON 2005) so we hypothesized that a similar CpG effect would exist in our data.
Since both strands at a CpG dinucleotide are methylated, deamination will lead to the observation of either a CG
CA change, for a deamination on the template strand, or a CG
TG change if the deamination is on the coding strand. To measure the CpG effect, we compared the rate of transition in the CpG context to the average rate in all other contexts. For the template strand this involved calculating the ratio of the rate of CG
CA changes to the average rate of AG
AA, TG
TA, and GG
GA changes. For deamination on the coding strand we calculated the ratio of the rate of CG
TG changes to the average of CA
TA, CT
TT, and CC
TC. The average CpG effect was then calculated as the average of the two strand values.
Using the polarized SNP data (see MATERIALS AND METHODS) the rates of mutation for each dinucleotide are shown in Table 4. Overall there is a 2.1-fold increase in transition rate in the CpG context relative to other contexts and this increase at CpG dinucleotides is significant (G = 78.0, P < 106). The CpG effect is also apparent when the rates of all possible dinucleotide changes are compared: the various transitions have higher rates of change than transversions do, as expected from the Ts:Tv > 1 described above, with the highest rates being transitions from the CpG dinucleotide CG
CA and CG
TG changes (Figure 2). Across the regional composition classes there is a correlation between the CpG effect and regional A + T content with A + T-rich regions showing a much stronger CpG effect than A + T-poor regions (Table 4). There is also a significant increase in CpG transition rate with increasing regional A + T content (G = 21.3, P < 0.001).
|
|
When we compared the rate of CpG transition for the two different strands, the rate of CG
CA was found to be significantly lower than the rate of CG
TG (G = 7.1, P < 0.01). Both CG
CA and CG
TG rates increase with increasing regional A + T content but the latter rate is higher in each composition class. These data suggest that the two DNA strands are affected differently by CpG deamination, similar to the data from humans (KRAWCZAK et al. 1998). However, there is no apparent difference in the increase of CG
CA changes and the CG
TG changes relative to G
A and C
T transitions, respectively (Table 4), so it is possible that the rate differences between CG
CA and CG
TG are more general than only CpG deamination. Overall, our data do not unambiguously indicate a difference in CpG effect between the two strands.
Context and transition:transversion bias:
In addition to the apparent effect of methylated cytosine deamination, we studied the general relationship between neighboring base composition and mutation bias. Given the observation from grass cpDNA that flanking base A + T content is correlated with mutation bias (MORTON 2003), we divided all sites into three categories depending on the number of A/T base pairs (0, 1, or 2) in the two immediate neighbors and defined this as the A/T context. SNPs that differed in A/T context were then analyzed separately for comparison.As observed in cpDNA, we found a significant negative correlation between A/T context and Ts:Tv due to a decreasing rate of transitions with increasing A/T context (Table 5). This decreasing rate of transitions also results in a significant decrease in overall mutation rate with increasing A/T context. From Table 5, the overall rates of mutation in the A/T = 0, A/T = 1, and A/T = 2 contexts are 0.0276, 0.0238, and 0.0218, respectively. A comparison of variable (SNP) to conserved sites reveals that this variation in rate among contexts is significant (G = 26.8, P < 105). The negative correlation between A/T context and transition bias was observed in the regional composition classes where A + T > 52% but not in the regions with lower A + T content. Unlike the case for cpDNA, however, this correlation between Ts:Tv and A/T context in nuclear DNA could be due solely to the CpG effect. To remove the CpG effect, we repeated the analysis for the A/T = 1 and A/T = 2 contexts using only sites without a 5'C or 3'G. (There is only a single A/T = 2 context without a potential CpGsites with a 5'G and 3'Cso we excluded this context altogether.) There was still a significant difference in Ts:Tv between the A/T = 1 and A/T = 2 contexts (G = 15.9, P < 104) and, again, this context effect tended to be significant in regions with higher A + T content (Table 5). The data in Table 5 show that flanking bases influence mutations beyond the CpG effect and in a manner similar to what is observed in cpDNA. The variation in Ts:Tv across the three A/T contexts, however, is weaker in these data than in cpDNA (MORTON 2003).
|
Previous studies of other taxa have indicated that nucleotides beyond immediate neighbors can influence nucleotide mutation biases (KRAWCZAK et al. 1998; MORTON 2000; ZHAO and BOERWINKLE 2002). We thus examined the effect of context beyond the nucleotide sites that immediately flank an SNP. However, previous studies have not always separated the effects of immediate neighbors from the composition of more distant nucleotide sites (ZHAO and BOERWINKLE 2002). In our analysis we controlled for the composition of the immediate neighbors by holding the composition of these sites constant and then comparing the composition of the nucleotides one base removed, both 5' and 3', from the SNP sites. For these data, we assessed both mutation rate and the Ts:Tv ratio. No significant relationship was found between the composition of these sites and either mutation rate or bias (data not shown).
Context and mutational AT pressure:
In this section we examine the relationship between context and GC
AT pressure using the polarized SNP data. All sites, both conserved and SNP, were separated by context. Two different sets of contexts were used: (1) A/T context (number of A/T base pairs immediately flanking the site, as above) and (2) regional A + T composition (the regional composition classes described above). Using all sites within a specified context, we generated a 4 x 4 matrix where
ij is the rate of change from nucleotide i to nucleotide j in that context as described in MATERIALS AND METHODS. Once the matrix for each context was determined, the matrices wereanalyzed using two approaches. The first approach involved finding the equilibrium composition of a sequence evolving under each mutation model. This was determined by calculating the stationary vector for each matrix, which represents the expected equilibrium distribution for that mutation model (see MATERIALS AND METHODS). For the second approach we compared the GC
AT and AT
GC mutation rates within each matrix. A correlation is observed between the A/T context and equilibrium A + T composition (Table 6). As A/T context increases, predicted equilibrium A + T content of a site decreases. This trend is observed across the regional composition classes, indicating that sites evolving in a local context that is more A + T rich are themselves less biased toward A and T than sites in a local context that is A + T poor. The opposite trend is observed across regional composition classes. SNPs in loci that are more A + T rich overall predict a higher A + T bias than SNPs in loci that are relatively A + T poor (Table 6). Therefore, variation across regional composition classes cannot be explained by the influence of immediate neighbors since the two influences are in opposite directions and must represent some other feature of mutations.
|
The second approach, comparing GC
AT and AT
GC mutation rates directly, yielded similar results. As shown in Table 2, the GC:AT rate ratio increases with increasing regional A + T content. However, the GC and AT rates presented above were not limited to the GC
AT and AT
GC changes, which determine AT pressure. Therefore, we partitioned the GC and AT rates into two components each: the GC rate into GC
AT and GC
GC rates and the AT rate into AT
GC and AT
AT rates. The data demonstrate that the regional A + T content is correlated with GC
AT mutation pressure; the ratio of GC
AT:AT
GC rates increases with increasing regional A + T content as does the ratio of GC
AT:GC
GC change rates (Table 7). On the other hand, the rate of GC
GC transversions is not much higher than the rate of AT
AT transversions and this ratio decreases with increasing regional A + T content (Table 7), indicating that it is specifically GC
AT rates that are associated with regional composition, not only a general GC mutation rate.
|
Since transitions occur at a higher rate than transversions and GC
AT changes include transitions while GC
GC changes do not, we repeated the analysis using only G
T and C
A as well as T
G and A
C transversion mutations. These data show the same correlation between regional A + T content and GC
AT pressure (Table 7). Overall, the mutation dynamics shown in Tables 6 and 7 demonstrate a bias toward GC
AT changes, a bias that is stronger in regions with higher A + T content, and show a direct relationship between AT pressure and regional composition.
AT pressure. We should note that a number of our observations are based on polarizing mutations. For our analyses we polarized mutations by using the majority base at each site to infer the original state. This will not affect the analyses concerning flanking base effect on rate and transition bias and, therefore, the overall conclusions about context effects. In addition, although the polarization does allow us to infer the mutation rate away from CpG dinucleotides and provides stronger evidence, the high rate of transitions at these sites is in itself strong support for a CpG effect. Conclusions based on predicted equilibrium composition and GC
AT pressure are, however, fully dependent on polarizing the mutations that allow us to generate the 4 x 4 matrices. Future analysis using an outgroup taxon will allow us to examine these effects and to assess the validity of using the majority base to polarize mutations.
The most notable context effect is an elevated rate of CG
TG and CG
CA transitions relative to other transitions (Figure 2). Given the existence of CpG methylation in plants (TARIQ and PASZKOWSKI 2004), this rate elevation is most likely the result of a deamination of methylated cytosines at these dinucleotides. It is difficult to compare the magnitude of the CpG effect observed here directly to studies of nonplant taxa since methodologies differ, but it appears that the increase in transition rate that we observed at CpG sites, roughly a 2.1-fold increase relative to the transition rate at other sites, is not as high as what has been observed in vertebrates (KRAWCZAK et al. 1998). Although we observe an overall 2.1-fold increase in transition rate due to CpG deamination, this increase ranges from a 1.7-fold increase in regions with lower A + T content (<48%) to a 2.6-fold increase in regions with higher A + T content (>60%) and shows a general increase with increasing regional A + T content (Table 4). This trend may reflect variation in the degree of CpG methylation across loci or that repair of deamination products is more efficient in G + C-rich regions (FRYXELL and MOON 2005).
Along with a significant CpG effect, there are other influences of context on mutations apparent in our data. In particular, the composition of the two immediate neighbors, one 5' and one 3', of the mutation site is correlated with overall rate, transition bias, and GC
AT pressure. These effects are similar to what is observed in grass cpDNA and it is likely that they are due to an influence of local composition on polymerase misincorporation or mismatch repair (MORTON 1995, 2003). The similar relationship between context and mutation properties in both nuclear and cpDNA is interesting since it suggests shared replication and/or repair processes or that these properties are fundamental to mutations. Much remains to be learned about replication and repair in plants, but it is known that the two genomes do not share the same replication machinery and have significant differences in repair dynamics (HEINHORST and CANNON 1993; CANNON et al. 1995; HADA et al. 1998; KIMURA et al. 2002, 2005). As more is uncovered about the replication and repair processes in the two genomes, we should be able to better understand the causes of similar context effects.
Although we found a correlation between the composition of the two immediate neighbors and mutation properties, we did not see a clear relationship between mutation and the composition of individual neighboring nucleotides that do not flank the mutation. This contrasts with a recent study of human SNPs (ZHAO and BOERWINKLE 2002). Again, however, differences in methodology make it difficult to draw any specific conclusions about differences in context effects. In our study we controlled for the composition of the immediate neighbors, something that was not done in the study of human SNPs. Thus, it is possible that the human SNP study confounded immediate flanking base effects and nonrandom dinucleotide composition.
Despite the lack of correlation between specific individual nucleotides beyond the immediate neighbors and mutation dynamics, we do observe a correlation between regional composition and GC
AT mutation pressure. It is possible that this correlation is not a context effect but a secondary effect arising from a relationship between chromosome location and replication/mutation dynamics. For example, a correlation between location, replication time and the available nucleotide pool, which could affect misincorporation biases, could potentially lead to a relationship along the lines of what we observe.
One interesting feature of our inferred mutation dynamics is that they predict an A + T content at equilibrium that is higher than the observed base composition. Although we observe a correlation between regional A + T content and predicted A + T content (Table 6), the observed A + T content is lower than expected in each of the regional composition classes. If we group all mutations from our data set into one matrix, we predict an A + T content of 62.0% at equilibrium (Table 6), which is higher than the average regional A + T content of 55.1% observed for noncoding sites. Although, as stated above, the predicted equilibrium may not be accurate since the context of most sites will vary over time, the fact that in every composition class even the lowest predicted equilibrium A + T (typically in the A/T = 2 context) is higher than the observed A + T indicates a real discrepancy. This discrepancy is similar to what was observed in noncoding cpDNA (MORTON 2003) and suggests two possibilities. One is that the sequence is not at equilibrium and the A + T content is increasing in this lineage, as has been proposed recently for other taxa (e.g., DURET et al. 2002; TIFFIN and HAHN 2002; EBERSBERGER and MEYER 2005). The other is that there is a fixation bias, such as selection or biased gene conversion. Investigating these two possibilities in future studies should yield important insights into plant mutational dynamics.
Finally, the mutation dynamics inferred from the SNP data predict the GT skew observed in the data (see RESULTS). The total 4 x 4 matrix inferred from the SNPs predicts an equilibrium composition of 20.0% G, 18.0% C, 28.0% A, and 34.1% T, which is a 9.8% skew of T over A and a 5.3% skew of G over C, similar to the 12.0% and 5.6% T-A and G-C skews, respectively, observed in the noncoding sequences. Similar T-A and G-C skews are found when we consider SNPs in the different contexts described above (data not shown). Since our alignments are of coding strand sequences in transcribed regions, they further suggest the possibility that the bias is associated with transcription.
This skew toward T over A and G over C has recently been reported for human genes (LOUIE et al. 2003). Since, like our data, their observation was for noncoding sequences near genes on the coding strand and is found across numerous loci, they proposed that the skew was due to a transcription-coupled mismatch repair system. If this is the case, then the similar finding in our data suggests a similar mechanism in plant nuclear genes. It also raises the possibility that the G over C and T over A skew observed along the leading strand in prokaryotic genomes (LOBRY 1996; MCINERNEY 1998; MCLEAN et al. 1998; MORTON 1999) is at least partially the result of a transcription-coupled repair mechanism. The possibility of a transcription-coupled repair mechanism has significant implications for our understanding of compositional bias in genes, such as codon usage bias.
ARNDT, P. F., C. B. BURGE and T. HWA, 2003 DNA sequence evolution with neighbor-dependent mutation. J. Comput. Biol. 10: 313322.[CrossRef][Medline]
BERNARDI, G., 2000 Isochores and the evolutionary genomics of vertebrates. Gene 241: 317.[CrossRef][Medline]
BULMER, M., 1986 Neighboring base effects on substitution rates in pseudogenes. Mol. Biol. Evol. 3: 322329.[Abstract]
CANNON, G. C., L. A. HEDRICK and S. HEINHORST, 1995 Repair mechanisms of UV-induced DNA damage in soybean chloroplasts. Plant Mol. Biol. 29: 12671277.[Medline]
COHEN, N., T. DAGAN, L. STONE and D. GRAUR, 2005 GC Composition of the human genome: in search of isochors. Mol. Biol. Evol. 22: 12601272.
COOPER, D. N., and H. YOUSSOUFIAN, 1988 The CpG dinucleotide and human genetic disease. Hum. Genet. 78: 151155.[CrossRef][Medline]
DERMITZAKIS, E. T., A. REYMOND, R. LYLE, N. SCAMUFFA, C. UCLA et al., 2002 Numerous potentially functional but non-genic conserved sequences on human chromosome 21. Nature 420: 578582.[CrossRef][Medline]
DUNCAN, B. K., and J. H. MILLER, 1980 Mutagenic deamination of cytosine residues in DNA. Nature 287: 560561.[CrossRef][Medline]
DURET, L., M. SEMON, G. PIGANEAU, D. MOUCHIROUD and N. GALTIER, 2002 Vanishing GC-rich isochores in mammalian genomes. Genetics 162: 18371847.
EBERSBERGER, I., and M. MEYER, 2005 A genomic region evolving towards different GC contents in humans and chimpanzees indicates a recent and regionally limited shift in the mutation pattern. Mol. Biol. Evol. 22: 12401245.
FLOREA, L., G. HARTZELL, Z. ZHANG, G. M. RUBIN and W. MILLER, 1998 A computer program for aligning a cDNA sequence with genomic DNA sequence. Genome Res. 8: 967974.
FRYXELL, K. J., and W.-J. MOON, 2005 CpG mutation rates in the human genome are highly dependent on local GC content. Mol. Biol. Evol. 22: 650658.
HADA, M., T. HASHIMOTO, O. NIKAIDO and M. SHIN, 1998 UVB-induced DNA damage and its photorepair in nuclei and chloroplasts of Spinacia oleracea L. Photochem. Photobiol. 68: 319322.[CrossRef]
HEINHORST, S., and G. C. CANNON, 1993 DNA replication in chloroplasts. J. Cell Sci. 104: 19.[Abstract]
KIMURA, S., Y. UCHIYAMA, N. KASAI, S. NAMEKAWA, A. SAOTOME et al., 2002 A novel DNA polymerase homologous to Escherichia coli DNA polymerase I from a higher plant, rice (Oryza sativa L.). Nucleic Acids Res. 30: 15851592.
KIMURA, S., T. ISHIBASHI, T. YAMAMOTO and K. SAKAGUCHI, 2005 DNA repair in higher plants. Seikagaku 77: 113123.[Medline]
KRAWCZAK, M., E. V. BALL and D. N. COOPER, 1998 Neighboring-nucleotide effects on the rates of germ-line single-base-pair substitution in human genes. Am. J. Hum. Genet. 63: 474488.[CrossRef][Medline]
LOBRY, J. R., 1996 Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 13: 660665.[Abstract]
LOUIE, E., J. OTT, and J. MAJEWSKI, 2003 Nucleotide frequency variation across human genes. Genome Res. 13: 25942601.
MCINERNEY, J. O., 1998 Replicational and transcriptional selection on codon usage in Borrelia burgdorferi. Proc. Natl. Acad. Sci. USA 95: 1069810703.
MCLEAN, M. J., K. H. WOLFE and K. M. DEVINE, 1998 Base composition skews, replication orientation, and gene orientation in 12 prokaryote genomes. J. Mol. Evol. 47: 691696.[CrossRef][Medline]
MORTON, B. R., 1995 Neighboring base composition and transversion/transition bias in a comparison of rice and maize chloroplast noncoding regions. Proc. Natl. Acad. Sci. USA 92: 97179721.
MORTON, B. R., 1997 Rates of synonymous substitution do not indicate selective constraints on the codon bias of the psbA gene. Mol. Biol. Evol. 14: 412419.[Abstract]
MORTON, B. R., 1999 Strand asymmetry and codon usage bias in the chloroplast genome of Euglena gracilis. Proc. Natl. Acad. Sci. USA 96: 51235128.
MORTON, B. R., 2000 Codon bias and the context dependency of nucleotide substitutions in the evolution of plastid DNA. Evol. Biol. 31: 55103.
MORTON, B. R., 2003 The role of context-dependent mutations in generating compositional and codon usage bias in grass chloroplast DNA. J. Mol. Evol. 56: 616629.[CrossRef][Medline]
RICE, P., I. LONGDEN and A. BLEASBY, 2000 EMBOSS: the European molecular biology open software suite. Trends Genet. 16: 276277.[CrossRef][Medline]
SIEPEL, A., and D. HAUSSLER, 2003 Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 21: 468488.
SOKAL, R. R., and F. J. ROHLF, 1995 Biometry. W. H. Freeman, New York.
TARIQ, M., and J. PASZKOWSKI, 2004 DNA and histone methylation in plants. Trends Genet. 20: 244251.[CrossRef][Medline]
TIFFIN, P., and M. W. HAHN, 2002 Coding sequence divergence between two closely related plant species: Arabidopsis thaliana and Brassica rapa ssp. pekinensis. J. Mol. Evol. 54: 746753.[CrossRef][Medline]
WRIGHT, S. I., I. V. BI, S. G. SCHROEDER, M. YAMASAKI, J. F. DOEBLEY et al., 2005 The effects of artificial selection on the maize genome. Science 308: 13101314.
YAMASAKI, M., M. I. TENAILLON, I. V. BI, S. G. SCHROEDER, H. SANCHEZ-VILLEDA et al., 2005 A large-scale screen for artificial selection in maize identifies candidate agronomic loci for domestication and crop improvement. Plant Cell 17: 28592872.
YANG, Y. W., P. Y. TAI and W.-H. LI, 2002 A study of the phylogeny of Brassica rapa, B. nigra, Raphanus sativa and their related genera using non-coding regions of chloroplast DNA. Mol. Phylogenet. Evol. 23: 268275.[CrossRef][Medline]
ZHAO, Z., and E. BOERWINKLE, 2002 Neighboring-nucleotide effects on single nucleotide polymorphisms: a study of 2.6 million polymorphisms across the human genome. Genome Res. 12: 16791686.
Communicating editor: S. YOKOYAMA
This article has been cited by other articles:
![]() |
B. R. Morton, V.-u.-N. Dar, and S. I. Wright Analysis of Site Frequency Spectra from Arabidopsis with Context-Dependent Corrections for Ancestral Misinference Plant Physiology, February 1, 2009; 149(2): 616 - 624. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Jiang, X.-L. Wu, M. Zhang, J. J. Michal, and R. W. Wright Jr. The Complementary Neighborhood Patterns and Methylation-to-Mutation Likelihood Structures of 15,110 Single-Nucleotide Polymorphisms in the Bovine Genome Genetics, September 1, 2008; 180(1): 639 - 647. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Jia and P. G. Higgs Codon Usage in Mitochondrial Genomes: Distinguishing Context-Dependent Mutation from Translational Selection Mol. Biol. Evol., February 1, 2008; 25(2): 339 - 351. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. R. Morton and S. I. Wright Selective Constraints on Codon Usage of Nuclear Genes from Arabidopsis thaliana Mol. Biol. Evol., January 1, 2007; 24(1): 122 - 129. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.105.049916v1
172/1/569 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Morton, B. R.
- Articles by Gaut, B. S.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Morton, B. R.
- Articles by Gaut, B. S.




