We describe a surprising long-range periodicity that underlies a substantial fraction of C. elegans genomic sequence. Extended segments (up to several hundred nucleotides) of the C. elegans genome show a strong bias toward occurrence of AA/TT dinucleotides along one face of the helix while little or no such constraint is evident on the opposite helical face. Segments with this characteristic periodicity are highly overrepresented in intron sequences and are associated with a large fraction of genes with known germline expression in C. elegans. In addition to altering the path and flexibility of DNA in vitro, sequences of this character have been shown by others to constrain DNA∷nucleosome interactions, potentially producing a structure that could resist the assembly of highly ordered (phased) nucleosome arrays that have been proposed as a precursor to heterochromatin. We propose a number of ways that the periodic occurrence of An/Tn clusters could reflect evolution and function of genes that express in the germ cell lineage of C. elegans.
DNA carries biological information on a variety of levels. Some of the information is evident from well-defined sequence features such as the genetic code and certain reproducible transcription factor binding sites. Other (perhaps equally important) information in the genome may involve less precise sequence rules and longer-range interactions that are likely to be more difficult to detect and understand. One approach to dissecting the information encoded in the genome is to search for nonrandom character in the DNA sequence. Protein-coding constraints, for example, produce a nonrandom distribution in the utilization of base triplets. Likewise, certain transcription factors have a tendency to recognize multiple target sequences in a short interval, producing a localized increase in the incidence of one or more motifs. An intriguing means to investigate nonrandom features of DNA sequence involves searches for periodic appearance of specific sequence elements (see Trifonov 1989 and Minsky 2004 for reviews). Previous analyses of this type have identified, among other nonrandom features, a strong tendency for 3n repeats in coding sequences (due to nonrandom usage of the genetic code and of amino acid sequences) and a periodicity of 10–11 bp in occurrence of AA/TT dinucleotides. The latter periodicity has been observed rather strikingly in sequences that display intrinsic curvature in vitro (e.g., Koo et al. 1986; Ulanovsky and Trifonov 1987; Goodsell and Dickerson 1994 and references therein). AA/TT dinucleotides are also unusual in having less flexibility under certain circumstances than other dinucleotide pairs (e.g., Nelson et al. 1987) and in their apparent ability to contribute to the lateral positioning of nucleosomes along DNA (e.g., Satchwell et al. 1986).
Although the bulk genome periodicity analysis has substantial power to detect patterns in the sequence, the biological significance of these patterns has remained somewhat of a mystery, in particular due to challenges in identifying and finding functional consequences corresponding to individual sequence characteristics. From this perspective, well-characterized model systems such as Caenorhabditis elegans provide a tool of considerable value. C. elegans has been reported to exhibit a strong ∼10-base periodicity signal (e.g., VanWye et al. 1991; Widom 1996; Fukushima et al. 2002) and is among the most extensively characterized both in structure (a complete sequence) and function (both individual genetic studies and whole-genome expression/phenotype analysis).
In this article, we analyze the surprisingly prevalent and extensive periodic character in the C. elegans genome. Rather than underlying the entire genome sequence, the periodic regions appear enriched in a number of “islands” throughout the genome. These islands are for the most part unique in sequence; strikingly, they appear to delineate transcribed regions for a large group of genes that are expressed in the germ cell lineage of C. elegans.
MATERIALS AND METHODS
Sequence versions and software:
Sequences used for this analysis were downloaded from GenBank (Wheeler et al. 2005), Wormbase (Chen et al. 2005a), or the Intronerator (Kent and Zahler 2000) databases at the times noted in describing each experiment. Use of distinct versions of certain sequences was in some cases necessitated by the availability of annotation files of a given type only for those versions. Basic observations [such as the strong enrichment of periodicity in the C. elegans genome and the propensity for periodic An/Tn cluster (PATC) islands to appear in autosomal arms] have been confirmed for numerous releases of the genome sequence and using several independent software tools with a wide variety of parameters.
Serial analysis of gene expression (SAGE) data were obtained from the Genome British Columbia (BC) C. elegans Gene Expression Consortium (http://elegans.bcgsc.bc.ca/) (McKay et al. 2003; Blacque et al. 2005; Chen et al. 2005b; B. Goszczynski and J. McGhee, personal communication; K. Wong, M. Marra, S. Jones, D. Baillie and D. Moerman, personal communication).
Stringent repeat masking of genomes was carried out as described by Kurtz and Schleiermacher (1999), using a PC-compatible machine running the Red Hat version of Linux. All other analyses were carried out on Macintosh systems using Pascal-based programming for processor-intensive tasks (Metrowerks Code Warrior v. 7) and Hypercard-based programming for text-intensive tasks. Several randomized sequences were derived independently from scripts in Perl, Hypercard, and Pascal with no evident differences in periodicity properties.
Unusual character of some C. elegans DNA fragments:
This work started from an observation that certain DNA sequences from C. elegans exhibited unusual electrophoretic mobility on agarose gels at low temperature (S. White-Harrison, J. Fleenor and A. Fire, unpublished observations). Retarded electrophoresis, seen for DNAs from diverse biological sources, can be induced by specific sequence elements that can produce either a static bend or increased bendability in the otherwise straight helix (e.g., Marini et al. 1982; Olson and Zhurkin 2000). Several rather precise models for predicting the three-dimensional path of the DNA helix as a function of base sequence (roll, tilt, and helical periodicity) have been published; these models can predict certain anomalies in electrophoretic behavior with considerable sensitivity and specificity, although the details of predicted trajectories differ considerably between different models (e.g., Goodsell and Dickerson 1994) and are thus at best a very rough approximation. All of these algorithms predict strongly bent character in the segments of C. elegans DNA for which we had observed abnormal mobility. The algorithms are readily scaled for high-throughput analysis of large numbers of sequences. This analysis revealed a higher overall “bend” density for C. elegans DNA than for other nonnematode genomes with a comparable base composition (Table 1).
Periodicity analysis of C. elegans DNA:
While the “bending” analysis above demonstrates a nonrandom character in the DNA sequence, it is by no means clear that bending per se is the biologically selected basis for this nonrandomness. Indeed, the diverse algorithms used above are all based on in vitro behavior of DNA at relatively low temperatures; little or no gel mobility anomaly is generally observed at physiological temperatures (Diekmann 1987). Thus the unusual shapes of DNAs predicted by the algorithms are unlikely to accurately represent the biological properties of the DNA in vivo.
When a set of predicted “bent” sequences from the C. elegans genome was inspected in detail, we found that the most prominent common feature was a propensity for occurrence of AA or TT dinucleotides on one face of the helix over tens of base pairs (e.g., Figure 1). This is consistent with reports in the literature that periodic distributions of AA/TT dinucleotides can produce some of the largest anomalies in DNA structure (e.g., Bolshoy et al. 1991). To extend this analysis to the whole genome, we carried out a general analysis of periodicity for the C. elegans genome. Figure 2 and supplemental Figure S1A (http://www.genetics.org/supplemental/) show such an analysis for each of the 256 tetranucleotide “words.” These figures represent the number of times that two occurrences of a given word are separated by each integral number of base pairs.
As expected from previous analyses of periodicity in sequence databases (e.g., Trifonov 1989), a variety of different characteristics are evident from the periodicity analysis of C. elegans.
For some tetranucleotides (e.g., TGTG; supplemental Figure S1A at http://www.genetics.org/supplemental/), a 2-base periodicity is present, apparently representing segments of the genome with extended runs of alternating purine–pyrimidine.
A 3-base periodicity is evident for a number of the tetranucleotides (e.g., GTAG; supplemental Figure S1A). This type of periodicity is likely to represent the triplet nature of the genetic code combined with nonrandom codon choices and distributions of amino acids.
Additional 3n periodicities of 6, 9, etc. (e.g., GATT, TCAG; supplemental Figure S1A) presumably reflect (at least in part) coding for protein motifs such as beta sheet that have periodicity.
A number of the distributions show a strong and very discrete signal at a unique periodicity (e.g., GCCG; Figure 2C). These longer periodicities represent highly repeated tandem sequences in the genome.
An ∼10-base periodicity is present in words containing multiple AA/TT dinucleotides (e.g., AAAA, Figure 2B).
Filtering the genome to remove repetitive and protein-coding sequences:
To avoid the biases of repetitive DNA and coding sequences, we sought to produce a version of the genome sequence in which repetitive sequences were stringently removed and coding sequences (as much as possible) ignored.
Many repeat masking algorithms are very limited in that only repeats from a limited and defined database are masked in the target sequence (e.g., the repeatmasker algorithm of Smit and Green) (Thomas et al. 2003). This type of operation was at best incomplete for the task at hand. Instead, we chose to apply an algorithm (reputer) and tools developed by Kurtz and Schleiermacher (1999). Starting with two parameters, window (n) and stringency (m), reputer efficiently removes any segment of n bp that can be aligned elsewhere in the genome with at least m bases matching. We used n = 25, m = 25 (i.e., removal of all sequences of 25 bp that precisely match another sequence in the genome) and n = 28, m = 27 (i.e., removal of all sequences of 28 bp that match 27/28 to another sequence in the genome) with highly similar results. For the discussion below, we make use of the 25/25 genome masks.
Removal of coding sequences was a less precise operation. Although numerous attempts have been made to annotate coding sequences in genomes, none are complete. We were, however, able to obtain a substantial damping of triplet and 3n periodicities by utilizing a version of the genome in which coding sequences annotated by the genome project consortium had been removed (Kent and Zahler 2000). Figure 2, C and E, and supplemental Figure S1B (http://www.genetics.org/supplemental/) show the tetranucleotide periodicity analysis for the repeat-subtracted, coding-region-depleted version of the C. elegans genome. Two aspects of the genome emerge from this analysis. First, tetranucleotides rich in AA or TT dinucleotides retain their ∼10-base periodicity, and second, the complex set of patterns for other tetranucleotides is substantially or completely removed.
The periodic distribution of AA/TT-containing tetranucleotides is consistent with previous observations (e.g., Widom 1996; Fukushima et al. 2002) as well as the above analysis of bending predictions. Not as evident from some of the earlier analyses was the extent to which the periodicities can be detected over relatively long molecular distances. In Figure 2B, showing the frequencies of AAAA to AAAA distance as a function of base pair separation, a clear periodicity can easily be seen to extend beyond 200 bp.
Islands of periodicity in the C. elegans genome:
It was conceivable that the periodic signal might derive either uniformly from the whole genome or from some number of highly periodic areas or islands. Under the latter scenario, periodic signals might be enhanced by (i) selecting a tetranucleotide that was overrepresented in periodic islands and then (ii) calculating distance distributions from this tetranucleotide to other (arbitrary) tetranucleotides. Such an analysis is shown in supplemental Figures S2A (distributions of distances between AAAA and an antecedent arbitrary tetranucleotide sequence) and S2B (distributions of distances between AAAA and a subsequent arbitrary tetranucleotide sequence) (http://www.genetics.org/supplemental/). Particularly striking is the propensity for AA/TT-containing tetranucleotides to concentrate in one face of the helix (e.g., start-to-start separations 10, 20, 30 from adjacent AAAA/TTTT sequences), while non-AA/TT-containing tetranucleotides concentrate in intervening regions (e.g., 5, 15, 25, etc). These analyses indicate a periodic character to virtually all tetranucleotides when located in the vicinity of AAAA/TTTT sequences. The enhanced periodicity for arbitrary tetranucleotides in the vicinity of AAAA/TTTT tetranucleotides is indicative of a mosaic character to the genome, i.e., of islands of periodicity. We give these islands the formal designation PATCs.
Detailed analysis of the histograms in supplemental Figure S2 (http://www.genetics.org/supplemental/) reveals an interesting asymmetry in that complementary tetranucleotides can show different location profiles [e.g., (AAAA to GTCG) compared to (AAAA to CGAC), supplemental Figure S2B]. Similar asymmetry can be seen with AA/TT-rich tetranucleotides [e.g., (AAAA to TTTT) compared to (TTTT to AAAA) as expanded in supplemental Figure S3 (http://www.genetics.org/supplemental/)]. The latter comparison in particular shows significant differences in the ≥100-nt range that would not be expected with a simple localized deviation in DNA sequence. Rather, this comparison suggests a relatively intricate and asymmetric interaction that can cover many helical turns in the DNA.
Identification of individual periodic islands in the C. elegans genome:
Although the statistical analysis above demonstrates nonrandom aspects of genome structure, our ability to understand the significance of the unusual structures depends on being able to identify and draw functional correlations between individual structures and genome function. To accomplish this, we needed algorithms that were able to detect a significant fraction (preferably as high as possible) of sequences with strong periodicity while showing maximal selectivity in avoiding detection of “false” positives (e.g., predicted highly periodic regions in random DNA sequence). Several different algorithms were evaluated for this purpose, including examination of localized sequence trajectories using available models for DNA bending, variations of a hidden Markov model, a localized form of Fourier analysis, and a heuristic algorithm (PATC) described in Figure 3. All of these approaches gave similar results in the incidence and distribution of unusual DNA structures in the C. elegans genome, with general agreement on both localized and overall patterns of these structures. The heuristic PATC algorithm was chosen for detailed analysis as it has given somewhat better signal-to-noise responses. For subsequent analysis we chose a cutoff score that excludes 99.999% of random sequence. On the basis of this cutoff score (which has an arbitrary numerical value of 95), 6.14% of the C. elegans genome is within identified PATCs.
PATC motifs show a striking global pattern within the genome (Figure 4) with enrichment on the terminal approximately one-third of each autosome and on the extreme left tip of the X chromosome. C. elegans autosomes are known to have distinct central and peripheral characteristics, with genes more densely packed in the center and more frequent recombination and occurrence of certain transposons in peripheral regions (Brenner 1974; Barnes et al. 1995; C. elegans Sequencing Consortium 1998; Duret et al. 2000).
The genome can be divided along functional lines into coding sequence, introns, and intergenic regions. As shown in Figure 5, periodicity of unique chromosomal regions is most evident in intron sequences, showing a relatively constant profile over the length of individual genes. A somewhat lower level of periodicity is seen in intergenic regions, while only low-level periodicity is seen in coding regions. It should be noted that intergenic regions are predicted on a very limited data set; it is conceivable that some of these sequences are actually transcribed as part of coding or noncoding RNA transcripts. Thus we cannot rule out the possibility that all PATCs are within transcribed regions. Periodicity of 5′- and 3′-UTR regions is also of interest and is described in Figure 5B as the best that could be extracted from current databases. Both 5′- and 3′-UTR regions clearly show above-background periodicity, although we stress that current annotation of UTR sequences is much less complete than that for introns and exons, potentially biasing the genomewide values for periodicity in UTRs.
To understand the functional significance of periodic An/Tn clusters in the C. elegans genome, we next sought a rough size distribution for individual islands. Although the AAAA/TTTT separation plot in Figure 2C demonstrates strong periodicity beyond 200 bp, a longer-range analysis of this plot becomes difficult as noise becomes sufficient to hide any signal. As an alternative approach, we examine the distribution of lengths for which the PATC algorithm described in Figure 3A can define a periodicity-enriched face of the helix. As shown in supplemental Figure S4A (http://www.genetics.org/supplemental/), this distribution continues beyond 1000 bp. Because the latter method could fortuitously “fuse” two adjacent periodic regions that happened to be in the same register, we applied a simpler and less “tuned” algorithm to look for periodic correlations over long distances. This algorithm simply assigns a weight to each 5-base word on the basis of the number of AA/TT dinucleotides and then sums the coincidence value (product) as a function of distance. As a starting data set for the latter procedure, we use the component of the genome most susceptible to high periodicity: introns present on autosomal arms. The resulting plot (supplemental Figure S4B at http://www.genetics.org/supplemental/) shows striking albeit imperfect periodicity well beyond 500 bp. Interestingly, if we require that two intron-contained words be separated not only by n bases, but by at least one exon sequence, we maintain the strong periodicity (supplemental Figure S4C at http://www.genetics.org/supplemental/). This indicates that the specific periodicity can be maintained even through a region of minimally periodic exon sequence.
Associations between periodic character and gene expression in C. elegans:
The strong preference for periodicity in a subset of transcribed intron sequences raises the question of whether there might be correlation(s) between gene function and periodic character. To address this question, we first constructed a gene-by-gene list of periodicity values for introns, exons, and upstream and downstream segments (supplemental Figures S5 and S6 at http://www.genetics.org/supplemental/). Given the genomewide preference for periodicity in intron sequences, we then used the intron PATC score for evaluating expression–periodicity associations.
Despite the remarkable functional genomic analysis available for C. elegans, the majority of the coding regions have been subject to only limited experimental characterization. We expected that the most reliable evaluation would come from the relatively small number of genes for which functional data from classical genetics have been available (generally these are genes with classical genetic nomenclature and references to alleles isolated in forward mutagenic screens). Of the 10 most highly periodic genes on this list (sorted by intron periodicity), 9 (par-2, smu-2, mrt-2, ced-10, mel-46, sqv-2, hmp-2, ced-2, and apx-1) are known to express and/or function in the adult hermaphrodite germline, while only 1 gene in this set (ced-1) has not been reported to function in the germline [we note, however, that the observed lack of maternal rescue by ced-1(+) (Ellis et al.1991) does not rule out expression during germline development] (Kemphues et al. 1988; Ellis et al. 1991; Mello et al. 1994; Costa et al. 1998; Ahmed and Hodgkin 2000; Hwang et al. 2003; Spartz et al. 2004; R. Minasaki and A. Streit, personal communication). Cross-referencing of a somewhat larger subset of this list with data in both the C. elegans database “Wormbase” and openly available articles in PubMed (supplemental Figure S7 at http://www.genetics.org/supplemental/) again revealed an unexpectedly high fraction (45/62) from this class with known roles or expression in the hermaphrodite germline. A comparable set of genes culled at random from the least periodic portion of the list included a much lower fraction with known germline roles (14/69). It should be stressed that although the difference in incidence of germline activity is highly statistically significant (P < 10−8), this analysis is quite imperfect, since annotation of gene expression and function in Wormbase and indeed in the literature as a whole is by nature incomplete.
A more objective (although still imperfect) assessment of association between periodic character and gene expression should be derivable from expression data obtained on a genomewide scale. Although microarray assays have been the most frequently employed in such analyses (e.g., Kim et al. 2001), the resulting data are complicated by a marginal signal-to-noise ratio among low-level-expressed genes. Another technique, SAGE offers a somewhat better opportunity to definitively detect rare RNAs in a mixture (Velculescu et al. 1995). Through the heroic efforts of the Genome BC C. elegans Gene Expression Consortium and their various collaborators, the C. elegans community is fortunate to have access to an extensive set of published and unpublished SAGE data (McKay et al. 2003; Blacque et al. 2005; Chen et al. 2005b; B. Goszczynski and J. McGhee, personal communication; K. Wong, M. Marra, S. Jones, D. Baillie and D. Moerman, personal communication). As shown in Figure 6A, analysis of data from C. elegans oocytes shows a clear, albeit nonlinear, association between SAGE oocyte representation level and degree of periodicity. At one extreme, we observe low periodicity (on average) for genes whose RNAs are not represented in the oocyte SAGE library. Note here that some genes could be unrepresented in the SAGE library due to inefficient tag production or cloning, so we have included in Figure 6B only those genes that are actually represented at least once in the equivalent (long-SAGE) libraries prepared from different C. elegans tissues by the University of British Columbia (UBC) genome group. As representation in the oocyte library begins to increase, the periodicity can also be seen to increase substantially. Average periodicity scores drop for genes with higher levels of representation in the oocyte SAGE library. Although not necessarily precisely reflective of expression levels, SAGE frequencies are certainly related to the underlying mRNA level. These data thus suggest a “reverse bell curve” association between oocyte expression and periodic character.
Although the analyses in Figure 6B suggest a relationship between ooctye expression level and periodicity, there is no reason to assume that this relationship is exclusive. Indeed many of the genes described as periodic (e.g., supplemental Figures S5–S7 at http://www.genetics.org/supplemental/) are putative housekeeping genes involved in processes such as basal transcription that are essential for all cells. To determine which, if any, tissues were most tightly associated with periodic character, we needed a SAGE-compatible numerical method to evaluate the quality of associations such as the bell-shaped curve in Figure 6A. To this end, we adapted a technique from communications signal analysis (detailed in supplemental Methods at http://www.genetics.org/supplemental/) that yields a result in the form of a likelihood ratio. Using this method to compare the expression/periodicity association for the oocyte SAGE data to a null hypothesis (no significant association) gives a total likelihood ratio of >1019. As shown in Figure 6B, a much less significant likelihood ratio is obtained from the other major tissues (gut, hypodermis, muscle, neurons, and pharynx). Better likelihood ratios were obtained from two additional SAGE data sets (pharyngeal marginal cells and AFD neurons). It should be noted that the marginal cell and AFD data sets derive from a much rarer tissue (a small number of cells in each case). Contamination by germline or oocyte RNAs becomes a greater concern under such circumstances, and indeed we observe that several genes that are thought to be germline specific in their activity are represented in these data sets. We stress, however, that no contamination has been demonstrated and thus the possibility remains that these two tissues share considerable portions of their gene expression profiles with the germline. On the basis of this analysis of the SAGE data, we conclude that oocyte SAGE data representation provides the best associative model for predicting periodicity.
We have described an unusual DNA structure that underlies a significant fraction of the C. elegans genome. This structure shares certain features with DNA that has been shown to curve or bend in several biochemical and ultrastructural assays (Crothers et al. 1990). Nonetheless, we note that the unusual sequence characteristics are not accounted for in detail by assumptions of a specific role in producing a bent naked DNA in vivo.
Nature of periodic An/Tn sequences in the C. elegans genome:
The long-range nature of the periodicities observed in C. elegans suggests that we are detecting the “shadow” of a large nuclear structure as it is reflected in DNA sequence. Our working hypothesis is that extended regions with highly periodic character (“PATC islands”) correspond with the ability of the DNA to interact with some surface within the nucleus. Extended surfaces in the nucleus include the nuclear envelope, the outer surfaces of nucleosome cores, neighboring DNA duplexes, and a number of other organelles (nuclear granules and speckles, higher-order chromatin structures, nuclear scaffolds, etc.) (Kornberg and Lorch 1999; Gall et al. 2004; Taddei et al. 2004; Gruenbaum et al. 2005).
Among the various nuclear structures, the outer surface of the nucleosome core provides perhaps the most fertile speculation. In particular, it is this surface that is thought to associate most prominently with the bulk of nuclear DNA (Kornberg and Lorch 1999). AA/TT dinucleotides have been suggested to be involved in specific positioning of nucleosomes through a preference for their minor grooves positioned facing inward in the structure (e.g., Calladine and Drew 1986). Although the contributions of individual AA/TT dinucleotides to energy of binding may be relatively small (and may combine with many other factors), nucleosome conformations for which many AA/TT dinucleotides face minor-groove-inward would provide a substantial free energy benefit. Such structural interactions have thus been proposed to be capable of constraining nucleosomes by impeding their ability to slide along the DNA (e.g., Flaus and Richmond 1998; Kiyama and Trifonov 2002; Kulic and Schiessel 2003). Although nucleosome mobility in vivo is likely to involve a number of biophysical modalities (Flaus and Owen-Hughes 2003), some sequence-dependent kinetic impedance of nucleosome translocation could be expected from all of these modalities. A useful (if incomplete) analogy would be to a bicycle chain that prefers to limit its interaction with the corresponding gears to a quantized set of positions, thereby generating traction.
Several observations are consistent with the suggestion that PATC clusters reflect nucleosomal positioning in C. elegans. First, nucleosomes association seems to be the default state for the bulk of eukaryotic nuclear DNA (including that of C. elegans) (Dixon et al. 1990; Kornberg and Lorch 1999). Second, nucleosome modification complexes are critical in setting up appropriate patterns of gene expression in the germline (Ahringer 2000; Shin and Mello 2003). Third, the minor peaks in the Fourier plot in supplemental Figure S4E (http://www.genetics.org/supplemental/) and the observation in supplemental Figure S3 (http://www.genetics.org/supplemental/) that AAAA → TTTT separations have a distinct profile from TTTT → AAAA separations is consistent with previous reports of nonsymmetric dinucleotide preferences and nonuniform helical repeats at different positions within a nucleosome (e.g., see Hayes et al. 1991; Ioshikhes et al. 1996; Luger et al. 1997).
Although these data are consistent with a nucleosomal connection to the structural anomaly in PATC sequences, we point out several caveats. First, regions of the DNA may adopt different protein associations under distinct cellular conditions; thus association with nucleosomes could be limited to a subset of tissues. Second, helical periodicities of DNA have not been determined for potential alternative models (e.g., for nuclear envelope-associated DNA). Third, differences and heterogeneity in TT vs. AA dinucleotide distribution might also be expected for other physical interactions, either as a result of intrinsic structure and flexibility or as a result of specific protein binding.
One feature of the sequence anomalies that initially seems unexpected on the basis of nucleosomal models is the persistence of periodic structure. Individual nucleosomes would be expected to cover only 146–147 bp of DNA, with some extension potentially coming from defined linker proteins. Two situations could potentially extend this footprint: first, an array of tandemly arranged nucleosomes could potentially produce a longer periodicity, assuming that the spacing between nucleosomal cores was such that the 10.# base periodicity was maintained (Widom 1992). Second, as has been proposed for a number of systems, a region of DNA might be specialized on the basis of constraining one or more nucleosomes not to a single position but rather to a quantized set of positions, e.g., to impede its movement on DNA. Such a situation might involve a larger region of DNA than would be covered by individual nucleosomes, just as a bicycle chain is much longer than the circumference of the gears that it connects with.
For purposes of discussion of functionality and evolution, we refer in the remainder of this discussion to the unknown surfaces that align to periodic regions as “nucleosomes.” We note, however, that all of the subsequent arguments could apply equally to another type of subnuclear surface.
Functional correlates of strong periodic character in the C. elegans genome:
The strong bias for abundant periodicity in a discrete subset of genes suggests a functional commonality among these genes. Manual annotation of gene lists and objective comparisons of genomic data both indicate a statistically significant association between gene activity in the hermaphrodite germline and high sequence periodicity. Nonetheless, it should be stressed that this correlation could be secondary and that the functional link could be a complex one, for example, involving another cell type with similar expression profiles to the hermaphrodite germline or some metabolic process that is simply more effective in these cells. The C. elegans germline passes through a rather dramatic series of transitions during development. Examining the lists of genes that show highly periodic character, we note that the majority of strongly periodic genes appear to be active in meiotic cells of the distal adult germline that act simultaneously (1) as nurse cells for mature oocytes and (2) as “oocytes-in-training.” Interestingly, we observed no strong periodicity in a trio of genes [him-17 (Reddy and Villeneuve 2004), spo-11 (Dernburg et al. 1998), and rad-51 (Colaiacovo et al. 2003)] that are located on autosomal arms and are thought to be active at premeiotic stages in germline function. Likewise we fail to observe strong periodicity in a large group of genes (many on the X chromosome) that are activated in a late burst of transcription during the last stages of oocyte maturation (Kelly et al. 2002). Spermatogenesis also has a characteristic pattern of gene expression (L'Hernault and Roberts 1995), and we see no evidence for unusual periodicity of the majority of genes active in this aspect of germline development. These observations are consistent with the hypothesis that highly periodic character associates with expression at a stage of germline development in which oocyte precursors slowly traverse the early stages of meiosis.
Do An/Tn periodicities in the C. elegans genome contribute to macroscopic fitness?
Given the strongly nonrandom character of periodic sequences in the C. elegans genome, we expect that considerable evolutionary pressure must have been present for their origin and maintenance. Consistent with this expectation, we observe regions with comparably high periodicity signals in a number of additional nematode genomes for which draft or fragmentary sequences are available [e.g., C. briggsae (draft genome from Stein et al. 2003) and C. remanei, Caenorhabditis species CB5161, and Caenorhabditis species PS110 (fragments from the ama-1 gene; Kiontke et al. 2004)]. We envision two types of selective pressures that could contribute to forming and maintaining these structures: macroscopic selection (accumulation of periodic structures in DNA due to increased fitness for the animal) and microscopic bias (accumulation of periodic structures in DNA due to nonrandom mutagenesis and or repair).
For a macroscopic selection model to be applicable, periodic segments that are sufficiently unusual to be absent in random DNA each would need to contribute in some way to the phenotypic fitness of the animal. Such contributions might be relatively subtle, such that no single base mutation would produce a visible phenotype in the laboratory. Alternatively, the effects could be dramatic, in which case one might expect to uncover a set of mutations whose phenotypic effects might be difficult to explain on the basis of standard aspects of DNA and RNA structure. Such “unexplained” mutational consequences have been rarely if ever reported in the C. elegans genetic literature, although biases for examining mutations with stronger phenotypes and biases in choices of mutagen may have masked the relevant interpretations. Estimates of total constraint due to highly periodic islands in the C. elegans genome are rather substantial (with on the order of 106 noncoding bases showing nonrandom identity related to these structures). Spontaneous mutation rates in C. elegans (∼10−8/base/generation) are such that selective pressure on 106 noncoding bases would pose a slight but evolutionarily significant burden on the species. At the same time, the likely nonlethal character of individual mutations that affect only one or a few periodic regions raises some interesting population genetic questions in terms of whether simple selection could allow highly periodic structures to be fixed and maintained within a population.
PATC islands in C. elegans would presumably make their contribution to organismal fitness through modulation of nearby genetic elements (coding regions or other functional chromosomal features). The most attractive hypothesis to us is that these sequences impede nucleosome rearrangement in genes for which germline expression is important. Formation of inactive heterochromatin has been shown at least in a subset of cases to be associated with the assembly of nucleosomes into arrays with a specific and uniform spacing (e.g., Sun et al. 2001). Constraints on nucleosome mobility (or alternatively, strong positioning signals that failed to match the required heterochromatic spacing) would be expected to increase the energy and time needed to assemble these uniformly spaced arrays. This in turn could inhibit or delay the assembly of specific chromosomal regions into higher-order (and potentially silenced) chromatin structures. Such a situation might serve to protect a subset of genes from active and somewhat indiscriminant silencing mechanisms that normally protect the germline from unwanted expression of genes whose expression might be harmful. Harmful genes might include both somatic specification and differentiation components and selfish DNAs such as transposons or viruses.
Although the forces that shape patterns of heterochromatin in the genome are not fully understood, they are almost certain to involve nucleation events (de novo initiation of heterochromatin structures) combined with a tendency of heterochromatin structures to spread laterally in a stepwise and somewhat deliberate manner until sequences that prevent their spread are encountered (Richards and Elgin 2002). Initiation events involve a mixture of triggers: specific DNA sequence elements that are targeted by interaction with proteins that initiate heterochromatic silencing, repeated structures in DNA that may be recognized by meiotic or recombination machinery, parasitic genetic elements, and certain types of modulatory RNA interactions (Turker and Bestor 1997; Hsieh and Fire 2000; Selker 2002). Interestingly, the outer autosomal regions where many of the highly periodic genes in C. elegans reside are enriched in repeated DNA and putative selfish elements (C. elegans Sequencing Consortium 1998). One model would be that these elements contribute to a relatively inhospitable genomic environment for those genes whose activity in the germline is important. This might necessitate the ability of such genes to evolve mechanisms to resist the encroachment of heterochromatic structure (which could potentially render them useless for the germline). A situation in which periodic regions of the germline-expressed genes were refractory to ordered nucleosome phasing and thus to higher-order heterochromatin assembly might provide these genes substantial protection from silencing.
Could biased mutagenesis and/or repair mechanisms contribute on a microscopic scale to periodicity in the C. elegans genome?
As a (nonexclusive) alternative to models proposing an organismal selection for periodic regions in DNA, it is conceivable that such structures arise due to biases in mutagenesis and repair at the level of individual genomic regions. Biased mutagenesis and repair processes have certainly been documented in numerous systems and could reflect both nucleosome positioning and transcriptional properties (Smerdon 1991; Miller 2005). The observed periodicity in C. elegans DNA could be explained if a number of conditions were met.
First, certain sequences in the genome (those transcribed at specific stages of germline development) would either (1) have distinct sensitivity to mutagenic hits or (2) be repaired with altered efficiency or specificity after such hits. The most likely scenario here would be for transcription in germline tissue to be associated with altered mutability characteristics. The relatively stable early meiotic states slowly traversed by oocyte precursors in the distal germline make a good candidate for the tissue of interest here since these chromosomes can persist in a stage-specific conformation for a considerable time period. Another possibility would be targeting of specific regions of chromatin that are accessible in diapause (e.g., dauer) animals, again at a stage where cells persist for a long period.
Second, the mutagenic processes in the animal would need to be influenced by local structural characteristics of the chromosome and its environment. Most attractive as a model here would be to propose that nucleosome interactions might influence the distribution of mutagenic hits and/or the spectrum of repair functions. Such influences have been hypothesized (Holmquist 1994) and demonstrated in model systems (e.g., Smerdon 1991; Schieferstein and Thoma 1996). In the C. elegans case, nucleosome positions would need to alter mutagenic consequences on a base-by-base level, skewing the eventual spectrum of mutations such that AA/TT dinucleotides would form more frequently on one face of a surface contacting the DNA.
Numerous mutational events would be required to produce a periodic structure. Assuming that these hits occurred over considerable evolutionary time, the microscopic bias model requires that some force must constrain the nucleosome. One possibility would be a constraint of nucleosomes independent of the AA/TT periodicity: the AA/TT periodicity would then appear as a shadow of this underlying positional constraint. Alternatively, the AA/TT periodicity itself may work to constrain nucleosome positions, so that positioning of nucleosomes may become more and more constrained as further point mutations are accumulated. The latter model works under the interesting constraint that biases in mutagenesis and repair serve to reinforce current nucleosome positioning (e.g., Holmquist 1994). This would certainly be the case if mutagenic lesions that produce AA/TT dinucleotides with their minor grooves facing into the nucleosome were more frequently formed or fixed in the DNA sequence. Curiously, this set of conditions would set up a situation where strong nucleosome positioning signals could coalesce de novo in DNA sequence through an increasingly energetically favorable and self-reinforcing process.
We have relatively little information from which to speculate on potential mutagenic processes that might produce the types of nonrandomness observed as periodicity in the C. elegans genome. Virtually any point mutagen that was sensitive to DNA structure or context could produce such a bias. Likewise almost any repair process could be biased for or against events that occur on surface-associated microregions in the DNA. For exemplary purposes, we note the following aspects of ultraviolet-induced mutagenesis: (a) UV has been shown to preferentially induce lesions at TT dimers, (b) repair can proceed by either an error-free mechanism (photoreactivation) or an error-prone mechanism (excision and gap repair), and (c) it is conceivable that gap repair would predominate on the outer (exposed) surface of nucleosomes while photoreactivation would predominate on inner (sterically protected) faces. Combining this with other mutagenic processes that might result in formation of A- or T-rich segments in transcriptionally accessible regions of the genome, (Dai et al. 2005) one might expect the types of structures that we observe in nematodes.
On the possibility of macroscopic/microscopic coevolution:
Proposals that the strong periodicity seen in the C. elegans genome might reflect a combination of structural constraints and mutagenesis/repair biases led to a plausible hypothesis that the resulting anomalies may have no particular fitness value for the organism. While this would certainly be a valid hypothesis, an attractive alternative would be to propose that the aggregate changes occurring in certain germline-active genomic regions would confer structural characteristics that would yield a fitness advantage for the organism. One possible modality for this advantage would be the formation, over evolutionary time, of certain germline-active chromatin regions that would be protected from epigenetic silencing. Other potential roles for these sequences could be envisioned in chromosome structure, replication, recombination, segregation, or maintenance. Whatever the role in the function of the organism, the unexpected suggestion here is that mutagenic and repair functions in the nematode phylum have been tuned over evolutionary time so that beneficial changes to certain regions of the genome accumulate without any immediate selection for the individual events.
The proposal that organisms might evolve to manage their own evolutionary change is by no means original to this system (e.g., McClintock 1984). As additional nematode genomes are sequenced and functionally characterized, we expect that further information about the large-scale processes directing their change will become clearer.
Could PATC islands contribute to germline genome defense?
Organisms use numerous mechanisms to protect themselves from selfish or unwanted information (transposons, viruses, etc.). These mechanisms include silencing of repeated DNA, meiotic silencing of unpaired DNA, RNAi, recognition and clearance of extended regions of ssDNA and untranslated RNA, and gatekeeping processes that prevent nuclear export of unspliced RNAs (Fire 2006). Even with these defense mechanisms, there are still numerous instances where foreign sequences enter the genome and cause deleterious effects.
Unusual structural characteristics of germline expressed genes could potentially provide a species with an additional level of protection. Foreign DNAs (whether viral, transposon, or from another species or phylum) would generally lack such characteristics and might thus be recognized as foreign. In the case of the periodic character of many germline-expressed genes in C. elegans, an ability of PATC islands to resist encroachment of silencing would allow the organism to employ a highly aggressive process for spreading of germline heterochromatin into unprotected regions of the genome. The vast majority of sequences that entered the genome from outside sources would then be subject to effective silencing for germline expression during their first few generations in the C. elegans genome, a process that is readily seen with many different transgene constructs that are introduced into C. elegans (Kelly et al. 1997; Hsieh and Fire 2000). The process of injected transgene silencing in C. elegans is not completely understood but is thought to involve the progressive recruitment of heterochromatin-related histone modifications to the extrachromosomal array structures into which the foreign DNA is incorporated (Hsieh and Fire 2000; Bean et al. 2004). Several different vectors are available that seem to partially limit germline silencing of injected C. elegans transgenes; intriguingly the majority of these vectors are derived from genes with substantial periodicity scores (let-858, mex-3, pie-1, smu-2, smu-1, and ama-1) (Kelly et al. 1997; Reese et al. 2000; Sparz et al. 2004; M. Montgomery, S. Xu, W. Kelly and A. Fire, unpublished data; M. Dunn and G. Seydoux, personal communication). By contrast, germline-expressed genes with no detectable periodicity have (with one exception; W. Johnson and J. Dennis, personal communication; Lee and Schedl 2004) been ineffective tools for vector development. Periodicity diagrams for several of the genes that have been tested in vector development are shown in supplemental Figure S8 (http://www.genetics.org/supplemental/).
Given that a number of germline-expressed C. elegans genes lack any detectable periodicity signal, one expects that the periodicity may not be the only force that can prevent heterochromatinization. The difference in periodic character between central and peripheral genes may indicate distinct optima in terms of protection from silencing, with periodicity-related processes the most effective means in the harsh environment of autosomal arms (which may have a greater proportion of silenced DNA) while alternatives may be most effective in transcription-rich chromosomal centers.
Although we see PATC periodicity most strikingly in nematode sequences, the existence of genome defense mechanisms based on species-specific or phylum-specific properties of DNA could in principle be quite general. One expects that other species could use features such as base composition (GC%), presence of specific protein-binding sites, or conserved DNA/miRNA interactions to provide a layer of self–nonself discrimination and thus to protect their cells (and most importantly germline cells) from invasion or activity by unwanted information.
Tool building based on genome structure:
Tools that manipulate gene expression in a specific cell type are critical for many experimental and therapeutic manipulations. The highly periodic structure in a large number of C. elegans germline-expressed genes certainly argues that we should be aware of periodicity in designing germline expression vectors. This might guide the choice of promoters, coding regions, and reporters (synthetic or natural) for intended germline expression. Ultimately, it may be useful to design our own reporter coding sequences de novo to maximize periodic character (particularly in introns) and thus potentially prevent epigenetic silencing of resulting transgenes.
For experimental manipulations of nonnematode species (e.g., human cell lines and transgenic mice) some of the same challenges of gene silencing have been described and discussed (e.g., Bacheler et al. 1979; see Bestor 2000 for review). Although the structures described in this article are somewhat specific to C. elegans, gene silencing applications outside of nematodes may still be worthy of investigation. First, the type of AA/TT periodicity described in this article, if it fundamentally alters nucleosome positioning or mobility, might be directly useful to avoid silencing even in systems that lack these signals normally. Second, a full knowledge of nucleosome positioning signals in vertebrates might allow the construction of vertebrate-specific vectors that would likewise constrain nucleosomes from forming highly ordered heterochromatin.
We thank K. Wong, M. Marra, S. Jones, D. Baillie, D. Moerman, J. McGhee, B. Goszczynski, the M. Smith Genome Sciences Centre, Genome British Columbia, and Genome Canada for providing C. elegans SAGE data prior to publication. Additional help, suggestions, and support from Anne Villeneuve, Tim Schedl, Robert Herman, Bill Kelly, Susan Strome, Geraldine Seydoux, Barbara Meyer, David Schwartz, Phil Beachy, Robert Schleif, Don Brown, Ed Trifonov, Alan Wolffe, Adrian Streit, Roger Kornberg, Jim Priess, Dennis Dixon, SiQun Xu, Jamie Fleenor, Susan White-Harrison, Javier Lopez-Molina, Jeb Gaudet, Dave Hansen, Andrew Travers, Wendy Johnston, Jim Dennis, Weng-Onn Lui, Lia Gracey, Jonathan Gent, Rayka Yokoo, Cecilia Mello, Steve Johnson, Julia Pak, Hann Yew, Blake Hill, and the National Institutes of Health (grants R01GM37706 and T32GM07231) are gratefully acknowledged.
- Received February 17, 2006.
- Accepted April 21, 2006.
- Copyright © 2006 by the Genetics Society of America