The distal arm of the fourth (“dot”) chromosome of Drosophila melanogaster is unusual in that it exhibits an amalgamation of heterochromatic properties (e.g., dense packaging, late replication) and euchromatic properties (e.g., gene density similar to euchromatic domains, replication during polytenization). To examine the evolution of this unusual domain, we undertook a comparative study by generating high-quality sequence data and manually curating gene models for the dot chromosome of D. virilis (Tucson strain 15010–1051.88). Our analysis shows that the dot chromosomes of D. melanogaster and D. virilis have higher repeat density, larger gene size, lower codon bias, and a higher rate of gene rearrangement compared to a reference euchromatic domain. Analysis of eight “wanderer” genes (present in a euchromatic chromosome arm in one species and on the dot chromosome in the other) shows that their characteristics are similar to other genes in the same domain, which suggests that these characteristics are features of the domain and are not required for these genes to function. Comparison of this strain of D. virilis with the strain sequenced by the Drosophila 12 Genomes Consortium (Tucson strain 15010–1051.87) indicates that most genes on the dot are under weak purifying selection. Collectively, despite the heterochromatin-like properties of this domain, genes on the dot evolve to maintain function while being responsive to changes in their local environment.
EUKARYOTIC genomes are packaged into two major types of chromatin: euchromatin is gene rich and has a diffuse appearance during interphase, while heterochromatin is gene poor and remains densely packaged throughout the cell cycle (Grewal and Elgin 2002). The distal 1.2 Mb of the fourth chromosome of Drosophila melanogaster, known as the dot chromosome or Muller F element, is unusual in exhibiting an amalgamation of heterochromatic and euchromatic properties. This domain has a gene density that is similar to the other autosomes (Bartolomé et al. 2002; Slawson et al. 2006). However, it appears heterochromatic by many criteria, including late replication and very low levels of meiotic recombination (Wang et al. 2002; Arguello et al. 2010). It exhibits high levels of association with heterochromatin protein 1 (HP1) and histone H3 di- and trimethylated at lysine 9 (H3K9me2/3), as shown by immunofluorescent staining of the polytene chromosomes (Riddle and Elgin 2006; Slawson et al. 2006). This association with heterochromatin marks has recently been confirmed by the modENCODE Project [N. C. Riddle, A. Minoda, P. V. Kharchenko, A. A. Alekseyenko, Y. B. Schwartz, M. Y. Tolstorukov, A. A. Gorchakov, C. Kennedy, D. Linder-Basso, J. D. Jaffe, G. Shanower, M. I. Kuroda, V. Pirrotta, P. J. Park, S. C. R. Elgin, G. H. Karpen, and the modENCODE Consortium (http://www.modencode.org), unpublished results]. To understand this unique domain and to examine the evolution of a region with very low levels of recombination, we have undertaken a comparative study using the dot chromosome of D. virilis, a species that diverged from D. melanogaster 40–60 million years ago (Powell and Desalle 1995). We sequenced and improved the assembly of the D. virilis dot chromosome and created a manually curated set of gene models to ensure that both the assembly and the gene annotations are at a quality comparable to those in D. melanogaster. We then compared the sequence organization and gene characteristics of the distal portion of the D. virilis dot chromosome with the corresponding region from the D. melanogaster dot chromosome.
In addition to examining the long-term dot chromosome evolution, we also investigated the short-term dot chromosome evolution by comparing the genomic sequences from two different strains of D. virilis. Agencourt Biosciences (AB) has previously produced a whole genome shotgun assembly of Tucson strain 15010–1051.87, while we have sequenced Tucson strain 15010–1051.88 of D. virilis [the Genomics Education Partnership (GEP) assembly]. The AB assembly has been improved by the Drosophila 12 Genomes Consortium and released as part of the comparative analysis freeze 1 (CAF1) assembly (Drosophila 12 Genomes Consortium et al. 2007).
Using the GEP and CAF1 assemblies from D. virilis, and the high-quality D. melanogaster assembly and its gene annotations from FlyBase (Crosby et al. 2007), we compared the gene properties and sequence organization of the dot chromosomes and reference euchromatic and heterochromatic domains. The dot chromosomes from D. melanogaster and D. virilis are distinct from the heterochromatic and euchromatic regions of the two genomes, both in organization (e.g., repeat density) and in characteristics of the genes (e.g., size, codon bias). The two dot chromosomes resemble each other by most criteria and differ only in the types of repetitive sequences present and in relative gene order and orientation. Despite the very low rate of meiotic recombination, comparison of the two D. virilis strains shows that dot chromosome genes are under weak purifying selection. Our analysis of genes that are present in a euchromatic chromosome arm in one species and on the dot chromosome in the other (the “wanderer” genes) shows that this set of genes evolves to maintain function while responding to the changes in the local chromosomal environment.
MATERIALS AND METHODS
Sequence analyses were implemented using custom scripts and programs written in Perl, Ruby, R, and Bash shell scripts. Data were stored in subversion repositories, MySQL databases, and plain text files. Graphical and statistical analyses were done using R from the R Project for Statistical Computing and Microsoft Excel. Sequence analysis was carried out on a quad-core server with 8 GB of RAM running openSUSE 11.1. The custom repositories, databases, and scripts are available from the authors upon request.
Sequencing of D. virilis fosmid clones:
The D. virilis fosmid library [Berkeley Drosophila Genome Project (BDGP)-DviF01] was obtained from the BACPAC Resource Center at Children's Hospital Oakland Research Institute (http://bacpac.chori.org/home.htm). Strategies used for isolating, sequencing, and finishing the D. virilis clones were previously documented (Slawson et al. 2006). All fosmids were improved to the quality standards used for the mouse genome and verified by restriction digests (see supporting information, File S1 for details). The nucleotide sequences and predicted protein sequences reported here have been deposited into the National Center for Biotechnology Information (NCBI) Entrez Genome Project Database under project ID 41283.
The curation strategy has been documented previously (Slawson et al. 2006). Evidence tracks, which were used to create the gene models, included results from multiple gene predictors (Genscan, Twinscan, Geneid, SGP2, SNAP) (Burge and Karlin 1997; Korf et al. 2001; Parra et al. 2000, 2003; Korf 2004), splice site predictors (Genesplicer) (Pertea et al. 2001), and BLAST searches [WUBLAST 2.0MP-WashU (May 4, 2006), http://blast.wustl.edu] against annotated proteins in D. melanogaster (FlyBase 5.16) (Crosby et al. 2007). Student annotations were loaded into the Apollo Genome Annotation Curation Tool for final quality control and analysis (Lewis et al. 2002).
RepeatMasker (version open-3.2.7) was run at the most sensitive settings (-s) using the cross_match (version 0.990329) search engine (http://www.repeatmasker.org). Three repeat libraries were used in the analysis: the Drosophila repeats library in Repbase (release 14.03), the Superlibrary (repeats in Repbase release 14.03 with novel repeats identified by PILER-DF), and the species-specific RepeatModeler (beta open-1-0-3) libraries (http://www.repeatmasker.org/RepeatModeler.html). RepeatRunner was run using the Superlibrary and the default repeat protein database from the RepeatRunner package with default parameters (Jurka et al. 2005; Smith et al. 2007). RepeatModeler was run on the CAF1 D. virilis assembly and release 5 of the D. melanogaster assembly using the de novo repeat finder RECON (release 1.06) and RepeatScout (release 1.05) with default parameters (Bao and Eddy 2002; Price et al. 2005). The repeats found by PILER-DF are available through the FlyBase FTP server (ftp://ftp.flybase.net/genomes/aaa/transposable_elements/PILER-DF/). The species-specific RepeatModeler libraries are available in File S2 (D. melanogaster) and File S3 (D. virilis).
Tandem repeats were identified using Tandem Repeats Finder (version 3.2.1) (Benson 1999), with the following parameters: matching weight = 2, mismatch penalty = 7, indel penalty = 7, match probability = 80, indel probability = 10, MinScore = 50, and maxPeriod = 2000.
Analysis of gene sizes, coding exon sizes, and intron sizes:
In addition to our manual annotations, gene models for the most comprehensive isoform (i.e., the isoform with the largest coding region) in D. melanogaster (release 5.16), and D. virilis GLEAN-R (Elsik et al. 2007) models (release 1.2), were extracted from the precomputed GFF files downloaded from FlyBase (see File S1 for analyses that justify the use of GLEAN-R models). D. virilis heterochromatic genes were not considered due to the small number of documented gene models available. To determine the intron size without repeats, intronic sequences were extracted from each model and repetitive sequences were identified using RepeatMasker with the species-specific RepeatModeler library. The unmasked bases were used to calculate the distribution of intron sizes with repeats removed. The significance threshold (alpha = 0.05) has been adjusted using the conservative Bonferroni correction to compensate for multiple pairwise comparisons (i.e., 15): only raw P-values less than 3.33E-03 (i.e., 0.05/15) are considered to be statistically significant in these analyses (Kolmogorov–Smirnov and Wilcoxon rank sum tests).
Codon bias analysis:
In addition to our manual annotations, coding regions for the most comprehensive isoform in D. melanogaster and the GLEAN-R D. virilis models were extracted from the translation sequence files downloaded from FlyBase. The effective number-of-codons (Nc) statistic (Wright 1990) was calculated using the chips program in the EMBOSS package.
Each gene found on the D. melanogaster and D. virilis dot chromosomes and in the euchromatic reference regions was assigned a unique identifier with its relative orientation. This set of unique identifiers was analyzed using GRIMM with default parameters for the unichromosomal genome through the GRIMM Web interface (Tesler 2002).
D. virilis strain comparisons:
The GEP dot chromosome was broken into nine smaller contigs on the basis of the locations of the gaps; these were aligned against the corresponding regions in the CAF1 dot chromosome using the global alignment algorithm stretcher in the EMBOSS package with default parameters (Rice et al. 2000). JalView was used to inspect and edit the alignments (Waterhouse et al. 2009).
The improved D. virilis dot chromosome assembly:
We previously reported the analysis of 18 D. virilis fosmids, 11 (372,650 bp) from the dot chromosome and 7 (273,110 bp) from the major euchromatic chromosome arms (Slawson et al. 2006). We have since isolated, sequenced, and improved 29 additional dot chromosome fosmids, bringing the quality of the whole region to the quality standards used for the mouse genome (see File S1 for these criteria). (Undergraduate students carried out the sequence improvement and annotation under the sponsorship of the GEP.) Collectively, the 40 overlapping fosmids assemble into 1,240,624 nonoverlapping base pairs from the D. virilis dot chromosome; only eight gaps remain with an estimated total gap size of 14,728 bases (Figure S1). This improved assembly is orthologous to the banded region of the dot chromosome of D. melanogaster. A custom version of the University of California Santa Cruz (UCSC) Genome Browser, available at http://gander.wustl.edu (D. virilis Manuscript assembly), is used to host the sequence data and evidence tracks used in this study (Kent et al. 2002). We identified 81 genes on the dot chromosome of D. virilis; 74 have putative orthologs among the 83 genes on the D. melanogaster dot chromosome (FlyBase Release 5.16).
In situ hybridization results using several fosmids from the GEP assembly (Slawson et al. 2006) place the centromere to the left of the GEP D. virilis dot chromosome scaffold and the telomere to the right. The CAF1 scaffold_13052 is anchored in the same relative orientation (Schaeffer et al. 2008). Although the entire CAF1 scaffold_13052 consists of 2,019,633 bases, the initial 600,000 bases and the final 200,000 bases have very poor quality. These regions collectively account for 244,702 out of 246,340 (99.3%) of the gap bases in the scaffold and show no sequence homology to the banded portion of the D. melanogaster dot chromosome. Since we cannot discount the possibility that these regions of the CAF1 scaffold_13052 have been misassembled, we only analyzed the region that spans from the most proximal to the most distal annotated genes (from 600,384 to 1,826,586 bp). A dot plot analysis shows that this region corresponds to the portion of the GEP strain dot chromosome we have sequenced and annotated and that the two strains of D. virilis have a high degree of sequence similarity (Figure S2).
In addition to improving the D. virilis dot chromosome assembly, we also created manually curated gene models for this region using results from several gene predictors and homology to the putative D. melanogaster orthologs. For each gene, we have consistently annotated the most comprehensive isoform (i.e., the isoform in D. melanogaster with the largest coding region). Our analysis has focused only on the coding regions because we cannot definitively annotate the untranslated regions in the D. virilis gene models, due to the sparse amount of expression data available.
As previously reported, the manually curated Repbase library has a strong bias toward repeats that are found in D. melanogaster (Slawson et al. 2006). To alleviate this bias, we generated three repeat libraries on the basis of the genomic assemblies of the two species. The first, Superlibrary, is a library of all previously reported repeats, combining annotated repeats from the Drosophila Repbase Update with novel repeats in the 12 Drosophila species found by the de novo repeat finder PILER-DF (Edgar and Myers 2005; Smith et al. 2007). Two species-specific repeat libraries were created with RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html) and these later libraries were each used to analyze their respective genomes. However, despite our efforts to minimize bias in the repeat libraries, one cannot completely eliminate it using these approaches. The D. melanogaster assembly includes additional sequences from the Drosophila Heterochromatin Genome Project (Hoskins et al. 2007). Because there has been no corresponding effort to improve the heterochromatic regions in D. virilis, the D. melanogaster repeat libraries may be more comprehensive than those for D. virilis.
Collectively, the improved assembly, the manually annotated gene set, and the custom repeat libraries provide a unique resource to study the organization and evolution of the Drosophila dot chromosome. Using these resources from D. virilis and the high-quality sequence and annotations available for D. melanogaster, we seek to characterize the properties of this unique genomic region and to identify the forces that impact the evolution of this domain and the genes that reside within it.
Defining the euchromatic and heterochromatic reference regions:
To analyze the differences between the dot chromosomes and other regions of the Drosophila genome, we selected euchromatic and heterochromatic reference regions from both D. melanogaster and D. virilis for comparative study. We also utilized previously improved and annotated fosmids from euchromatic regions of the D. virilis genome and the corresponding regions from D. melanogaster (Slawson et al. 2006) to ensure that the reference regions we picked are representative of “typical” euchromatic regions in the two genomes.
Using the previously defined heterochromatin–euchromatin boundary for chromosome (chr) 3L (Muller D element) of D. melanogaster (Hoskins et al. 2007), we extracted a 1.25-Mb region distal to that boundary (toward the telomere) as a representative euchromatic region (chr 3L: 21,705,576–22,955,575), and the adjacent proximal 1.25-Mb region (toward the centromere) as a representative heterochromatic region (chr 3L: 22,955,576–24,205,575). Because there is no defined heterochromatin–euchromatin boundary for the D. virilis CAF1 assembly, we selected a scaffold (scaffold_13049) mapped to the Muller D element and identified heterochromatic and euchromatic domains on the basis of differences in gene and repeat density (Figure S3). The changes in these parameters near the boundary were more gradual in D. virilis than in D. melanogaster. Therefore, we picked a region ∼625 kb distal to the boundary to ensure that its properties reflect those of typical euchromatin in D. virilis. We also selected a 0.8-Mb region (from the boundary to the end of scaffold_13049) as a representative heterochromatic region. Both of these regions are high quality (12 gaps, total estimated gap size 13,612 bases in the euchromatic region; 12 gaps, total estimated gap size 37,211 bases in the heterochromatic region). Using these sequences as reference points, we asked whether the dot chromosomes more closely resemble the heterochromatic or the euchromatic domains for each property under investigation.
The dot chromosomes have a repeat density intermediate between that found in euchromatin and that in pericentric heterochromatin: A well-established characteristic of heterochromatin is a high level of repetitive DNA (Grewal and Elgin 2002). The D. melanogaster dot chromosome is unusual in that, while its gene density is similar to the other autosomes, its repeat density is much higher than the other nonpericentromeric regions in the D. melanogaster genome (Bergman et al. 2006). To determine whether the D. virilis dot chromosome has a similar repeat density, we used RepeatMasker with three custom repeat libraries (described above) to analyze it and the reference heterochromatic and euchromatic regions.
Results from RepeatMasker using the Repbase library initially suggested that the dot chromosome of D. melanogaster has a higher repeat density than that of D. virilis. However, the D. melanogaster and D. virilis dot chromosomes show similar overall repeat densities (Figure 1, Table S1) when we used the more comprehensive repeat libraries (i.e., RepeatRunner with the Superlibrary or RepeatMasker with the species-specific RepeatModeler libraries). These findings suggest that the Repbase library has a strong bias toward repeats in D. melanogaster.
Both dot chromosomes show repeat densities (26–28%) that are higher than the euchromatic reference regions (9–13%) and lower than the heterochromatic reference regions (66–68%) (Figure 1). This difference is consistent with our previous report based on a limited set of D. virilis fosmids (Slawson et al. 2006). The euchromatic reference region from D. virilis has a slightly higher repeat density (13%) than that from D. melanogaster (9%). This difference in the euchromatic reference regions is most pronounced when we use the least biased (RepeatModeler) species-specific libraries. Due to the difficulties of accurately defining repeat boundaries and heuristics used by repeat-finding algorithms (Bao and Eddy 2002), we cannot discern whether these small differences in repeat density are significant. However, our results are consistent with those obtained by the Drosophila 12 Genomes Consortium, which reported a higher overall repeat density in the D. virilis genome assembly compared to D. melanogaster (Drosophila 12 Genomes Consortium et al. 2007). The CAF1 and GEP dot chromosomes show a similar overall repeat density (27 vs. 28%), even though the CAF1 dot chromosome is unfinished while the GEP dot chromosome is improved to the mouse genome standard.
We used the classifications from the species-specific RepeatModeler libraries to analyze the distribution of different classes of repeats, since these libraries have the least bias. The D. melanogaster dot is enriched in DNA transposons and retroelements, and the D. virilis dot is enriched in simple repeats and low complexity sequences (Figure 2, Table S2). The difference in total repeat density between the dot chromosomes and the euchromatic regions can be attributed to the higher density of DNA transposons and retroelements on the dot chromosomes. In contrast, the difference in total repeat density between the heterochromatic reference regions and the dot chromosomes can primarily be attributed to an increase in retroelements. Given the high density of simple and low complexity sequences in D. virilis, we next investigated tandem repeats using Tandem Repeats Finder (TRF) (Benson 1999). We found that all domains in D. virilis have higher densities of tandem repeats compared to the corresponding regions in D. melanogaster (Figure 3 , Table S3).
The dot chromosomes have larger genes, reflecting larger introns, compared to euchromatic domains:
The higher repeat density on the dot chromosomes suggests the possibility of larger genes as a consequence of larger introns. A cumulative distribution plot of gene sizes (limited to the region from the start codon to the stop codon) show larger gene sizes on the D. melanogaster and D. virilis dot chromosomes compared to the euchromatic reference regions (Figure 4A; summary statistics in Table S4). Side-by-side boxplots show that the median gene size and the interquartile range (IQR) are larger on the dot chromosomes compared to the euchromatic reference regions (Figure S4A). The nonparametric Kolmogorov–Smirnov (KS) test shows that this difference in gene size is statistically significant (see raw P-values in Table S5). The difference in the distribution of gene sizes between the dot chromosome and the heterochromatic reference region was not statistically significant, which might be partially attributed to the smaller number of genes (21) documented in the latter domain. The distribution of gene sizes between the genes on the D. melanogaster and the D. virilis dot chromosomes is not significantly different (KS test raw P-value = 0.91). Similarly, we found no significant differences (raw P-value = 0.60) in the distribution of gene sizes between the manually annotated gene models for the GEP strain and the computationally predicted GLEAN-R gene models for the CAF1 strain. Thus the dot chromosome genes from the two species are similar to each other, and significantly larger than euchromatic genes.
To investigate factors that might contribute to the differences observed in gene size, we examined the size distributions of both the individual coding DNA sequences (CDSs) that make up the translated exons for each gene and the introns. The cumulative distribution plot of CDS sizes shows that the CDSs on the dot chromosome tend to be smaller than those in the euchromatic reference regions and larger than those in the D. melanogaster heterochromatic reference region (Figure 4B, summary statistics in Table S6). Differences in CDS sizes between the euchromatic reference regions and the dot chromosomes are statistically significant using the KS test (see boxplots in Figure S4B, raw P-values in Table S7). There are no significant differences in CDS sizes between the GEP and CAF1 D. virilis dot chromosomes or between the D. melanogaster and GEP D. virilis dot chromosomes (raw P-values = 0.99 and 0.72, respectively). Because CDSs on the dot chromosomes are generally smaller than those in the euchromatic reference regions, the larger overall gene size on the dot chromosomes must reflect larger intron sizes.
The cumulative distribution plot of intron sizes shows that the introns on the dot chromosomes are generally larger than the introns in the euchromatic reference regions, but smaller than the introns in the heterochromatic reference region (Figure 4C, summary statistics in Table S8). The one-sided Wilcoxon rank sum tests show these differences to be significant (see boxplot in Figure S4C, raw P-values in Table S9). There is no significant difference in intron sizes between the D. melanogaster and the D. virilis dot chromosomes (raw P-value = 0.90). Hence differences in intron sizes contribute to the differences in gene sizes observed between euchromatic and dot chromosome domains.
To ascertain whether the difference in intron sizes could be explained by the higher repeat density in the dot chromosomes and the heterochromatic reference regions compared to the euchromatic reference regions, we analyzed the size of the introns after repeats are removed. The cumulative distribution plot (Figure 4D) suggests that the differences in intron sizes are generally less pronounced (increase in P-value) but remain statistically significant (summary statistics in Table S10, boxplot in Figure S4D, Wilcoxon rank sum tests of significance in Table S11). The exception is the comparison between the introns for the D. melanogaster dot chromosome and the euchromatic reference regions from both D. melanogaster and D. virilis, where the raw P-values rose above the threshold considered to be statistically significant (from 2.18E-03 to 6.90E-02 and 3.32E-05 to 3.61E-03, respectively). Therefore, the higher repeat density within introns is one of the factors that contribute to the larger intron sizes observed on the dot chromosomes and in the heterochromatic reference region compared to the euchromatic reference regions, but is not the sole factor that leads to this result.
Dot chromosome genes exhibit low codon bias:
The effective Nc is a simple metric for measuring the usage of synonymous codons (Wright 1990); its value ranges from 61 (all synonymous codons are used equally) to 20 (1 codon used exclusively for each amino acid). Genes with high codon bias have a low Nc while genes with low codon bias have a high Nc.
Side-by-side boxplots of Nc show that genes on the dot chromosome have a significantly lower codon bias than genes in the euchromatic reference regions (Figure 5, summary statistics in Table S12). Two-sided KS tests show that this difference is statistically significant (see raw P-values in Table S13). Genes on the D. virilis and the D. melanogaster dot chromosomes also have a statistically significant different codon bias (raw P-value = 1.53E-06). Side-by-side boxplots of Nc show that genes in the D. melanogaster heterochromatic domain appear to have an intermediate distribution of codon usage between that seen for the dot chromosome genes and that seen for euchromatic genes, with the former but not the latter difference being statistically significant (raw P-values = 3.42E-05, 2.01E-04 compared to raw P-values = 3.61E-02, 5.37E-02). Our observations of low codon bias on the dot chromosomes are consistent with previous reports on codon bias in Drosophila (Singh et al. 2005; Drosophila 12 Genomes Consortium et al. 2007) and may reflect the low level of recombination in this domain (see discussion).
Differences in gene order and orientation between the dot chromosomes of D. melanogaster and D. virilis indicate rearrangements within the chromosomes:
Because we have assembled the D. virilis dot chromosome from a set of overlapping fosmids, each finished to high quality and verified by restriction digests, we can approach an analysis of synteny (gene order and orientation) with confidence. Specifically, we can estimate the minimum number of inversions (i.e., the reversal distance) required to transform the D. melanogaster dot chromosome into the D. virilis dot chromosome using the program GRIMM (Tesler 2002) and can identify genes that are located on the dot chromosome in one species and on another chromosome in the other. Using the set of 74 genes that can be found on both the D. virilis and D. melanogaster dot chromosomes, GRIMM predicts that a minimum of 33 inversions are required to transform the gene order and relative orientation on the D. melanogaster dot chromosome into that observed on the D. virilis dot chromosome (Figure 6).
To determine whether this number of inversions is unusual, we utilized the GLEAN-R ortholog assignments from FlyBase for D. virilis and calculated the ratio of the reversal distance to the number of genes in a euchromatic domain. We first ascertained whether the GLEAN-R ortholog assignments are adequate for this type of analysis by comparing the dot chromosome GRIMM results obtained with the manually curated dataset (GEP assembly) to results obtained using the GLEAN-R ortholog assignments (CAF1 assembly). Of the 69 genes on the D. melanogaster dot chromosome where GLEAN-R ortholog assignments are available in D. virilis, 66 remain on the D. virilis dot chromosome. GRIMM estimated that a minimum of 27 inversions are needed to transform the gene order and orientation for this subset of genes from that of D. melanogaster to that of D. virilis. This reversal distance to gene ratio (27/66 = 0.41) is similar to the ratio obtained using the complete set of genes (33/74 = 0.45), establishing that the GLEAN-R ortholog assignments are adequate for this type of analysis.
For the 1906 genes found (using the GLEAN-R ortholog assignment) on both the D. melanogaster and D. virilis Muller D elements (∼25-Mb region), GRIMM estimated that a minimum of 385 inversions is required to transform the gene order and orientation from that of D. melanogaster to that observed in D. virilis. Hence the reversal distance to gene ratio (385/1906 = 0.20) on the Muller D element is much lower than the ratio observed on the Muller F element (dot chromosome).
To confirm this striking difference between the dot chromosome and other elements, we determined the size of the syntenic blocks by visual inspection. The majority of the syntenic blocks on the dot chromosome are small, with an average block size of 1.9. We found 14 syntenic blocks with block sizes of at least 2; the largest syntenic block is 9. In contrast, the syntenic blocks on the Muller D element are much larger, with the largest syntenic block being 42 and an average block size of 8.8. The smaller syntenic blocks on the dot chromosomes are consistent with its higher rate of gene rearrangements.
Genes that have moved between the dot and euchromatic domains show a shift in gene characteristics that reflects the local chromatin environment. Our manual annotations identified three potentially novel genes on the D. virilis dot chromosome: A putative paralog of D. melanogaster CG16719 (CG16719-alpha), a putative paralog of D. melanogaster eIF-5A (eIF-5A-beta), and a novel gene (GEP001). See File S1 for additional details on the annotation of these proposed novel genes.
We also identified four putative orthologs on the D. virilis dot chromosome that are located elsewhere in the D. melanogaster genome: One gene on the D. melanogaster Muller D (chr 3L) element (CG5262), two on Muller B (chr 2L) element (CG5367 and rho-5), and one on Muller C (chr 2R) element (CG4038). Conversely, nine genes annotated on the D. melanogaster dot chromosome cannot be found on the D. virilis dot chromosome. Of these, four (CG11076, CG11077, CG1732, and CG9935) can be mapped to other D. virilis Muller elements in the CAF1 assembly. Of the remaining five, JY-alpha is an incomplete gene on the D. melanogaster dot chromosome and cannot be definitively mapped onto the D. virilis CAF1 assembly. Three other proposed genes (CG11231, CG11260, and CG32021) are likely to be remnants of repetitive elements that have been incorrectly annotated as genes. The remaining gene, CG33797, cannot be mapped to the D. virilis CAF1 assembly by sequence similarity. Hence CG33797 could be a D. melanogaster specific gene; it could be present in other species but in regions that are not part of the CAF1 assembly (e.g., in gaps or heterochromatic regions), or it could be an error in the D. melanogaster annotation. See File S1 for details on the annotation of these missing genes.
Thus among the above cases there are eight genes from D. melanogaster and D. virilis that can be unambiguously determined to reside on the dot chromosome in one species and on a non-dot chromosome in the other species (Figure 7). We can ask whether these wanderer genes (CG11076, CG11077, CG1732, CG4038, CG5262, CG5367, rho-5, and CG9935) show altered properties in the two species as a consequence of residing in the dot chromosome or in a euchromatic domain (Table 1). Consistent with the overall characteristics of the dot chromosome genes reported above, when these genes are found on the dot chromosomes of either species they have a larger average gene size (3099 bp vs. 2375 bp), a larger average intron size (1476 bp vs. 756 bp), a lower codon bias (Nc = 56.1 vs. 51.7), and a higher surrounding repeat density (19.4 vs. 6.0%). Thus these wanderer genes have evolved to maintain function in their new local environment while acquiring the characteristics of genes in that environment.
Comparison of two different strains of D. virilis shows transposon movement and small indels:
Previous studies have demonstrated that the lack of crossing over in the D. melanogaster dot chromosome results in a less effective adaptive response to selection (Marais and Charlesworth 2003; Haddrill et al. 2007). To see whether the same evolutionary pattern exists in the D. virilis dot chromosome, we compared the genomic sequences of the GEP and CAF1 strains. Because the ancestral sequence cannot be determined without a third outgroup, sequences found only in either the GEP (Tucson strain 15010–1051.88) or the CAF1 (Tucson strain 15010–1051.87) strains are labeled as indels (insertions or deletions).
Sequence comparison of the GEP and CAF1 strains of D. virilis shows that >70% of the indels are small (<10 bp) (Figure 8). All of the large indels (>1000 bp) have sequence similarity with LTR retroelements and DNA transposons in the Superlibrary (Table 2). Because the long terminal repeats of an LTR retroelement are identical at the time of integration, the age of an LTR insertion event can be estimated using the percent identity of the terminal repeats (SanMiguel et al. 1998; Lamb et al. 2007). The long terminal repeats can be identified in six of these indels (four in the GEP strain and two in the CAF1 strain): Five have perfect sequence identity and one has 99.7% (2113/2120 bp) sequence identity (Table 2, Figure S5). While we cannot generate an accurate estimate of the age of these large LTR indel events (because of the small number of mismatches), these large indels likely reflect recent evolutionary events while others (e.g., where no terminal repeats can be identified) may be older.
The six indels with identifiable long terminal repeats have sequence similarity to five different consensus sequences in the Superlibrary (Ulysses_I, dvir.2.37.centroid, dvir.2.53.centroid, dvir.5.67.centroid, and dvir.35.83.centroid). Four of the five consensus sequences belong to the Gypsy family of LTR retrotransposons (see File S1 for the annotation of these consensus sequences). Hence our analysis suggests a recent invasion of Gypsy LTR retrotransposons into the two D. virilis genomes and that insertion and excision of transposons is an integral part of the evolution of the dot chromosomes in D. virilis.
Indel to mismatch ratios in the two D. virilis strains:
The global alignments also reveal many base mismatches between the two strains of D. virilis (Table 3). A total of 88.6% of the mismatched bases (3141/3547) in the CAF1 dot chromosome have phred scores ≥30 (i.e., with an estimated error rate of <1 in 1000 bases) (Figure S6). Therefore the majority of the mismatches are genuine differences between the two strains. Previous work used a non-LTR retrotransposon (Helena) to estimate the neutral mutation rates in D. virilis; an estimate of 0.16 deletions per substitution with a 95% confidence interval of 0.09–0.26 deletions per substitution was obtained (Petrov et al. 1996). An extension of this analysis using five different non-LTR retrotransposons that are “dead-on-arrival” generated a neutral mutation rate of 0.174 deletions per substitution (95% confidence interval: 0.133–0.244) (Blumenstiel et al. 2002). The two strains of D. virilis used here show an overall indel-to-mismatch ratio of 0.247 on the dot chromosome compared to 0.200 on the random set of fosmids (previously finished and annotated in Slawson et al. 2006) from the other chromosome arms (Table 3). We also found that indels and mismatches are nonuniformly distributed across the D. virilis dot chromosome, with higher numbers of indels and mismatches near the centromere (Figure S7).
The ratio of indels to substitutions will change depending on the region's functional constraints (Chen et al. 2009): Coding regions have lower indel-to-substitution ratios because purifying selection removes indels that lead to frameshifts (Taylor et al. 2004). To determine whether the same constraints exist in the D. virilis dot chromosome, we first identified the analogous coding regions in the CAF1 strain by mapping our manually curated coding exons from the GEP strain onto the CAF1 dot chromosome. A total of 593 out of 594 exons showed full-length alignment (see File S1 for analysis of exons that show partial or poor alignments) with an average percentage of sequence identity of 99.9% and a standard deviation of 0.3%. After filtering mismatches that were caused by errors in the alignment or in the consensus sequence, we found a higher indel-to-mismatch ratio on the coding exons of the dot chromosome (0.078) compared to the random set of fosmids (0.069) (Table 3). To estimate the neutral mutation rate in both regions, we also analyzed the indel-to-mismatch ratios within introns. The dot chromosome again shows a higher indel-to-mismatch ratio (0.216) compared to random set of fosmids (0.126) (Table 3).
Collectively, our analysis shows that the dot chromosome has an elevated indel-to-mismatch ratio compared to the random set of fosmids from the other D. virilis chromosomes. The higher indel-to-mismatch ratio may reflect less effective selection on the dot chromosome compared to other regions of the D. virilis genome.
Ka/Ks analysis shows weak purifying selection for most of the genes on the dot chromosome:
We calculated the ratio of the number of substitutions per nonsynonymous site to the number of substitutions per synonymous site (Ka/Ks ratio) as a metric for the degree of functional constraint (i.e., purifying or directional selection) (Hurst 2002). Among the 76 genes on the GEP strain that can be mapped unambiguously onto the CAF1 strain, 22 genes had no base mismatches within the coding region, 48 genes had fewer than 10 mismatches, and the remaining 6 genes had more than 10 mismatches. Among the 54 genes with at least 1 mismatch, 20 genes contained only synonymous substitutions and 12 genes contained only nonsynonymous changes. For the 22 genes that contained both synonymous and nonsynonymous changes, the median Ka/Ks ratio is 0.302 and the mean is 0.398. The Ka/Ks ratio ranged from a maximum of 1.094 for CG32016 (which suggests that the gene is under no selective constraint) to the minimum of 0.047 for CG11093 (which suggests that the gene is under purifying selection). Therefore, despite the unique environment of the dot chromosome, most of the genes on the D. virilis dot chromosome are undergoing purifying selection (see Table S14 for the Ka/Ks ratio of each gene).
To determine whether the Ka/Ks ratio on the dot chromosome is unusual, we also analyzed the genes on the set of random fosmids that we have previously annotated (Slawson et al. 2006). Of the 20 partial and complete genes found on these fosmids, 17 can be mapped unambiguously onto the CAF1 assembly. For 9 of these genes that contain both synonymous and nonsynonymous changes, the median Ka/Ks ratio is 0.137 and the mean is 0.250 (Table S15). The higher Ka/Ks ratio on the dot chromosome compared to this random set of fosmids suggests that selection is less effective on the dot chromosome than in the euchromatic regions of the genome.
In this study, we have generated a high-quality D. virilis dot chromosome sequence and manually curated gene models to examine the characteristics and evolution of the dot chromosome in different Drosophila species (D. melanogaster and D. virilis) and in different D. virilis strains (GEP and CAF1). Our analysis consistently shows that the D. melanogaster and D. virilis dot chromosomes are more similar to each other than to the reference heterochromatic and euchromatic regions of both species.
Dot chromosomes have distinct distribution of repetitive elements:
The total repeat densities of the D. melanogaster and D. virilis dot chromosomes are similar, intermediate between that of euchromatin and heterochromatin (Figure 1). The main difference is in the types of repeats present, with the D. melanogaster dot enriched in DNA transposons and retroelements (Figure 2, Table S2). Previous studies using position effect variegation (PEV) in D. melanogaster as a readout of chromatin packaging have characterized the dot chromosome as largely heterochromatic and have also suggested that both proximity to certain transposable elements (e.g., the DNA transposon 1360) and overall repeat density may both play a role in heterochromatin formation and maintenance (Sun et al. 2004; Haynes et al. 2006; Riddle et al. 2008). If, as has been suggested, transposable elements are a better target for heterochromatin formation in Drosophila (Huisinga et al. 2006), differences in the distribution of classes of repeats may alter effective heterochromatin formation on the two dot chromosomes under some circumstances. This difference in repeat type, one of the few observed, might explain the difference reported earlier in polytene chromosome immunofluorescent staining, where the D. virilis dot chromosome fails to show the prominent association with HP1a and H3K9me2/3 seen in D. melanogaster (James et al. 1989; Slawson et al. 2006). However, genes on the D. virilis and D. melanogaster dot chromosome have similar characteristics (e.g., large size, low codon bias), which argues that these genes have evolved in a similar heterochromatin-like domain in both species and must be similarly packaged in germ line cells.
Our analysis also shows a higher abundance of tandem repeats in the D. virilis dot chromosome, as well as euchromatin and heterochromatin (Figure 3). The majority of the tandem repeats identified here by TRF overlap with simple and low complexity repeats identified by RepeatMasker, in agreement with previous findings of Schlotterer and Harr (2000). The expansion of these types of low complexity sequences both on the dot chromosome and in the euchromatic and heterochromatic reference regions may have contributed to the larger euchromatic genome size in D. virilis compared to D. melanogaster (150 Mb vs. 110 Mb; Moriyama et al. 1998), albeit recognizable tandem repeats only account for a small percentage of the two genomes.
Gene characteristics reflect the low levels of recombination:
Another well-established property of heterochromatic domains is a lack of recombination (Grewal and Elgin 2002). Previous reports have shown low levels of recombination on both the D. melanogaster and D. virilis dot chromosomes (Chino and Kikkawa 1933; Bridges 1935; Wang et al. 2002; Arguello et al. 2010). Work by others has found that both very short and very long introns are associated with regions of low recombination (Carvalho and Clark 1999; Comeron and Kreitman 2000). An earlier study by Haddrill et al. (2007) also found that a lack of recombination could be correlated with an increase in gene length. Therefore, if both the D. melanogaster and D. virilis dot chromosomes have low levels of recombination, they should have similar distributions of intron sizes, as we have observed. The higher density of repetitive elements on the dot chromosomes contributes to the larger gene and intron sizes on the dot chromosomes compared to the euchromatic reference regions (Figure 4, A and C). However, recognizable repetitive elements within introns are not the sole factor leading to larger gene and intron sizes, because the differences in intron sizes between the dot chromosomes and the euchromatic reference regions remain statistically significant even after recognizable repeats are removed (Figure 4D).
Codon usage bias has previously been shown to be negatively correlated with protein length and positively correlated with levels of recombination (Powell and Moriyama 1997; Haddrill et al. 2007). The positive correlation between codon bias and recombination rate can be attributed to the Hill–Robertson effect (i.e., regions with a low rate of recombination show less effective response to selection) (Hill and Robertson 1966). Selection may be at work in the positive correlation of codon bias with gene expression levels, observed generally in D. melanogaster (Duret and Mouchiroud 1999). Differences in expression levels are unlikely to be a major contributor to differences in codon bias between the dot chromosomes and the euchromatic reference regions, since Betancourt and colleagues (using expression data generated by Zhang et al. 2007) have previously shown that the difference in gene expression levels for the dot and non-dot loci are not statistically significant in D. virilis (Betancourt et al. 2009).
Our codon bias results (Figure 5) are consistent with the low rate of recombination reported for the heterochromatin-like D. melanogaster dot chromosome and suggest a similar evolution of the D. virilis dot chromosome. Our findings further suggest that the rate of recombination may be a more important determinant of codon usage bias on the dot chromosomes than protein length or level of expression. The codon bias in the D. melanogaster heterochromatic reference region is higher (lower Nc value) than the codon bias in the D. virilis dot chromosome, even though the pericentric heterochromatin has a significantly higher repeat density and is thought to have a similar low level of recombination.
Gene order and orientation indicate a high rate of inversions in a domain with low recombination:
While the recombination rate is significantly lower, we observe an approximately twofold higher rate of gene rearrangements on the dot chromosome compared to a euchromatic domain (Figure 6); this higher rate may reflect the higher repeat density, assuming that these elements promote inversions (Casals and Navarro 2007). Bhutkar et al. (2008) have previously observed a higher rate of gene rearrangements on the Muller A element (X chromosome) compared to Muller elements B–E and suggested that the rate of gene rearrangement may play a role in the evolution of the X chromosome. The higher rate of inversions on the dot chromosome compared to the Muller D element suggests that, similar to the Muller A element, gene rearrangements may play an important role in the evolution of the dot chromosome.
Wanderer genes exhibit the properties common in the domain they inhabit:
Despite the large number of gene rearrangements, ∼90% of the genes (74/83) can be found on both the D. melanogaster and D. virilis dot chromosomes. Our results are consistent with the previous findings by Bhutkar et al. (2008) who estimated that 95% of the genes in Drosophila are localized to the same Muller element across different Drosophila species. We identified eight wanderer genes that are present in a euchromatic domain in one species and heterochromatic (dot chromosome) in the other (Figure 7). These genes exhibit the characteristics of other genes in the same environment (Table 1), which suggests that characteristics such as gene size, codon bias, and repeat density are properties of the domain, and are not required for either set of genes to function per se. Our results are consistent with a previous study of the lt gene cluster in different Drosophila species, which shows that genes that transition from a euchromatic domain to a heterochromatic domain will reflect the properties of their local environment (i.e., increase in gene size due to accumulation of transposable elements in the heterochromatic domain) (Yasuhara et al. 2005). The movement of genes from one chromosome to another is widely observed in Drosophila (Drosophila 12 Genomes Consortium et al. 2007), but the mechanism remains obscure; these events do not appear to be the consequence of recombination or of retroviral action through a cDNA.
Strain differences indicate that indels contribute significantly to change:
Comparison of the GEP and CAF1 strains of D. virilis shows a large number of differences (e.g., base mismatches, insertions, and deletions). These differences include a few large indels of transposable elements (primarily LTR and DNA transposons; Table 2), although the majority of the indels are short (Figure 8). We found that most of the large indels with conserved long terminal repeats can be classified as members of the gypsy family, which suggests a recent invasion of gypsy elements into the genomes of D. virilis. Our results are consistent with previous reports that show gypsy retroelements to be actively transcribed in D. virilis and are also consistent with reports that show variation in the distribution of gypsy elements in different strains of D. virilis (Mizrokhi and Mazo 1991; Mejlumian et al. 2002).
Previous studies have suggested that indels play an important role in the evolution of eukaryotic genomes, and have postulated that indels account for the majority of the sequence differences in closely related DNA samples (Britten et al. 2003). Our analysis shows that the total number of mismatches exceeds the number of indel events in this case (Table 3). However, as each indel on average introduces a difference of two or more bases (including large transposon insertions and deletions), these types of events contribute more to the difference between the dot chromosomes (a total of 81,715 indel bases for the two strains combined), as postulated (Britten et al. 2003).
Strain differences point to weak purifying selection in the dot chromosome domain:
Previous studies on polymorphisms within the coding regions of D. melanogaster have shown much lower levels of both nonsynonymous and synonymous changes on the dot chromosome compared to the genome average (Berry et al. 1991; Sheldahl et al. 2003). Analysis of the Ka/Ks ratio observed here suggests that most of the genes on the D. virilis dot chromosome are undergoing weak purifying selection compared to the genes on a random set of fosmids from other, euchromatic regions. Analysis of D. americana, a species closely related to D. virilis, also suggested weak purifying positive selection on the dot chromosome, presumably a consequence of its lower level of recombination (Betancourt et al. 2009).
The dot chromosome is unusual compared to other gene-rich (euchromatic) regions of the Drosophila genome because of the high density of repetitive elements. However, in this regard the dot chromosome actually resembles a typical euchromatic region of a mammalian genome, where one observes repeat densities of ∼30–40%, with remnants of transposable elements interspersed within and between genes that are actively transcribed. How is gene activity maintained in the midst of repetitious elements, elements that are thought to serve as targets for heterochromatin formation and gene silencing? Future investigation should examine the transcription start sites of the dot chromosome genes through a comprehensive study of the distribution of histone modifications and chromosomal proteins surrounding these regions. In conjunction with the publicly available data released by the modENCODE Project for D. melanogaster [N. C. Riddle, A. Minoda, P. V. Kharchenko, A. A. Alekseyenko, Y. B. Schwartz, M. Y. Tolstorukov, A. A. Gorchakov, C. Kennedy, D. Linder-Basso, J. D. Jaffe, G. Shanower, M. I. Kuroda, V. Pirrotta, P. J. Park, S. C. R. Elgin, G. H. Karpen, and the modENCODE Consortium (http://www.modencode.org), unpublished results]. a comparative study with mapping data from multiple Drosophila species may reveal common sequence motifs that regulate gene expression and chromatin packaging in this genomic environment. The unique properties of the dot chromosome provide an opportunity to examine the impact of chromatin packaging on the evolution of genomes and the control of gene expression, making it worthy of further study.
We thank Casey Bergman, Andrew Clark, Kenneth Olsen, and two anonymous referees for valuable comments and suggestions on this manuscript. Members of the Elgin lab contributed throughout the process with criticisms and suggestions. We thank the Washington University Genome Center for generating raw sequences and providing training and support for many of the coauthors. This work was supported by grant no. 52005780 from the Howard Hughes Medical Institute (HHMI) to Washington University (to S.C.R.E.) with additional funding for data analysis from National Institutes of Health (NIH) grant R01 GM068388 (to S.C.R.E.). The content of this article is solely the responsibility of the authors and does not necessarily represent the official views of HHMI, the National Institute of General Medical Sciences, or the NIH.
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.110.116129/DC1.
The improved D. virilis dot chromosome sequence data and gene annotations have been deposited with the NCBI Entrez Genome Project Database under project ID 41283.
Communicating editor: N. Perrimon
- Received February 27, 2010.
- Accepted May 9, 2010.
- Copyright © 2010 by the Genetics Society of America