An algorithm for detecting local clusters of homologous genes was applied to the genome of Caenorhabditis elegans. Clusters of two or more homologous genes are abundant, totaling 1391 clusters containing 4607 genes, over one-fifth of all genes in C. elegans. Cluster genes are distributed unevenly in the genome, with the large majority located on autosomal chromosome arms, regions characterized by higher genetic recombination and more repeat sequences than autosomal centers and the X chromosome. Cluster genes are transcribed at much lower levels than average and very few have gross phenotypes as assayed by RNAi-mediated reduction of function. The molecular identity of cluster genes is unusual, with a preponderance of nematode-specific gene families that encode putative secreted and transmembrane proteins, and enrichment for genes implicated in xenobiotic detoxification and innate immunity. Gene clustering in Drosophila melanogaster is also substantial and the molecular identity of clustered genes follows a similar pattern. I hypothesize that autosomal chromosome arms in C. elegans undergo frequent local gene duplication and that these duplications support gene diversification and rapid evolution in response to environmental challenges. Although specific gene clusters have been documented in C. elegans, their abundance, genomic distribution, and unusual molecular identities were previously unrecognized.
IT is widely appreciated that genes of related function tend to reside in clusters in eubacteria and archaea, often arranged in coregulated operons. Functional clusters in eukaryotes are less common, although there are various indications that such clusters may be more common than was first apparent from genetic studies. For example, in Caenorhabditis elegans ∼15% of genes are members of cotranscribed operons (Blumenthal et al. 2002). The two to eight genes in each of the ∼1050 operons are subject to similar transcriptional regulation (Lercher et al. 2003), and genes within an operon are often involved in related biological processes (Blumenthal et al. 2002). Such operon clusters are not generally composed of homologous genes, but instead seem to group distinct gene sequences for transcriptional coregulation (Blumenthal et al. 2002). In addition to such functional clustering, specific examples of clusters of homologous genes have also been reported in C. elegans (e.g., Gotoh 1998; Robertson 1998, 2000; Sluder et al. 1999) and in a wide variety of other metazoans (e.g., Fritsch et al. 1980; Akam 1989; Hofker et al. 1989; Del Punta et al. 2000; Glusman et al. 2001). A cursory global analysis of homologous gene clusters was reported in the C. elegans genome sequence report (C. elegans Sequencing Consortium 1998). To investigate homologous gene clusters systematically, I developed an algorithm for scanning the genome for locally abundant gene families. This method identified 1391 cases of local clusters of two or more homologous genes, 216 of which had five or more members. The larger families tend to share a variety of interesting properties, including striking clustering on autosomal arms, an abundance of nematode-specific gene families, and probable involvement in environmental and pathogen interactions.
MATERIALS AND METHODS
Frozen WormBase data set WS120 (http://ws120.wormbase.org/) was used for all analysis except as noted. Downloaded WormBase GFF files were parsed to produce a set of usable Java objects for computer analysis, including all exon positions, coding regions, and other features of interest for every gene, including all alternative splice forms. A set of 22,234 gene products were translated on the basis of this information and the matching chromosome sequences. For genes with multiple splice forms, all but the longest splice form were discarded from this set, which resulted in a set of 19,874 proteins. Keeping a single splice form eliminated blastp matches among splice forms and simplified subsequent analysis. The longest splice form was kept with the idea that this form was most likely to encode a complete functional gene product, but this choice might affect clustering accuracy in some cases. An all-by-all blastp search was conducted for all proteins from each chromosome, using the “–m 8” tabular output option (NCBI Blast 2004 at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.9/). For cluster analysis, a window of 20 genes was moved along the chromosome and cases were collected in which at least one pair of genes in the window were blastp matches with an E-value ≤ 0.001 and with a blast alignment that extended over at least 80% of the query protein. For each such pair, the blast bit score was divided by the query length to generate a bit-score-per-residue value. All such pair values for the window were summed and divided by the number of genes in the window. A histogram bar was plotted at the position of the window centroid (the mean position of the coding start for each gene in the window). This plot is shown in Figure 1, after scaling and annotation of some of the major clusters. Note that this approach produces a histogram bar that reflects the total local gene clustering, regardless of whether this results from one or more gene families. It also weights high-scoring blast matches more heavily than low-scoring ones. The window was moved one gene at a time, so a specific local cluster will contribute values to several histogram bars in its region. Search window sizes ranging from 10 to 50 and various blast match criteria were tested with broadly similar results to those reported here.
Because the clustering algorithm depends on comparing predicted gene products, there will inevitably be some inaccuracy in the clusters assigned. The most likely inaccuracy is underinclusion since gene prediction errors may cause failure to pass the clustering criteria. Because the clustering algorithm requires 80% blast alignment only of query (not of target), it should be relatively insensitive to modest inaccuracies. Such underinclusion is likely to be more severe for genes with less experimental transcript validation, which largely derives from EST data. Cluster genes and autosomal arm genes are transcribed at lower levels than other genes; I anticipate that more accurate gene predictions will modestly increase the number and size of gene clusters and that their genomic location will remain arm biased (and possibly will be more arm biased).
Cluster number and size statistics:
To test whether the distribution of clusters was nonrandom, I used a position-randomizing approach. For each chromosome to test, gene order was randomized and an identical clustering algorithm was run with the new gene order. This was repeated multiple times to acquire a statistical sampling. The number of clusters in randomized tests fit a normal distribution and a one-sample t-test was performed to determine whether the real cluster number deviated from this distribution. Significance of the size of clusters was determined by a nonparametric test because the distribution of real cluster sizes deviated sharply from normal. A list of real cluster sizes was compared to a concatenated list of cluster sizes from multiple randomized tests and the lists were compared by the Mann-Whitney U-test. For both cluster number and cluster size, the P-value was two-tailed and was determined using InStat 3.05 (GraphPad Software).
The sliding-window approach arbitrarily limited clusters to the local 20 gene window, which is useful for plotting genomic distributions. For subsequent analysis, these local clusters were merged by joining clusters whenever they shared at least one gene. The result is a set of merged clusters, each of which represents a regional sequence family. More than one cluster of a particular sequence family will be assigned on the same chromosome only when their nearest genomic neighbors are separated by at least 20 unrelated genes. In principle, this might result in undesirable merging of extended groups of genes that are scattered sparsely across long regions. In practice, no such cases were found and clusters defined in this manner were remarkably tight, in the sense that most genes in each cluster region belonged to the cluster family, interspersed with a modest number of unrelated genes. Figure 6 shows an example, albeit an unusually dense one.
Unclustered gene families:
The same set of longest-splice gene predictions used to define gene clusters was classified into merged gene families using exactly the same blast match and merging criteria as for clustering, except that genome position was ignored. The resulting gene families were ranked by size and were manually inspected for molecular identity using the University of California at Santz Cruz Family Browser (UCSC Gene Sorter, May 2003 C. elegans data set; http://genome.ucsc.edu/) and WormBase Release WS120 (http://ws120.wormbase.org/).
An HTML table, which lists all gene clusters and documents the identity and size of each cluster with five or more genes (supplementary data 4 at http://www.genetics.org/supplemental/), was created for each chromosome. These tables include entries for each gene in the cluster and links to stable UCSC Family Browser pages for one member of each cluster. The linked UCSC page is the family browser output keyed to the protein encoded by the link gene and sorted by protein similarity. To provide a stable available data set, all of the UCSC family tables were saved and the links are to these saved files. In addition, two tables that merged clusters from all chromosomes were made. One was sorted by genome position and has selected annotations (supplemental data 3A at http://www.genetics.org/supplemental/); the second was sorted by cluster size and includes more extensive annotations, including annotations for every cluster of five or more genes (supplemental data 3B at http://www.genetics.org/supplemental/). Members of a gene family were defined by gene products on the UCSC Family Browser (UCSC Gene Sorter, May 2003 C. elegans data set; http://genome.ucsc.edu/) that had blastp E-values <10−6 and at least 20.0% blastp identity with reference members of the family. In a few cases, additional information that was more recent than the May 2003 data release was incorporated, notably for the seven-pass receptor (SR) families and the insulin-like gene family. All of these files are available in supplemental data 2 and 3 at http://www.genetics.org/supplemental/.
Conservation in other phyla:
All-by-all blastp searches were conducted with the most current predicted protein sets from Drosophila melanogaster, Saccharomyces cerevisiae, and Homo sapiens, using the WS120 version of C. elegans WormPep as query. Quality scores as a function of genome position were computed by averaging the blastp score for a sliding window of 20 genes. The score was computed by dividing the blastp bit score by the length of the query protein. The general feature of higher conservation on autosomal arms was first reported in C. elegans Sequencing Consortium (1998). My results showed a smaller difference between autosomal arms and centers, probably because the protein query set used previously was from an earlier genome curation, which may have tended to exclude genes on chromosomal arms.
Most annotations derive from reports on the Pfam and InterPro websites, which are based on conserved domains noted on the UCSC family browser [Pfam release 16 (http://www.sanger.ac.uk/Software/Pfam/); InterPro release 8.1 (http://www.ebi.ac.uk/interpro/); UCSC Gene Sorter, May 2003 C. elegans data set (http://genome.ucsc.edu/)]. In a few cases, blastp or Ψ-blast searches were conducted (NCBI_Blast 2004 at http://www.ncbi.nlm.nih.gov/BLAST/) and manual inspection of hits was used to further confirm or reinterpret these annotations. The UCSC data set was from May 2003 (the most recent available at the time) and updated annotations of the SR and insulin gene superfamilies were abstracted from WormBase WS120 (March 2004) because I was aware that improved annotations had occurred in the interim.
Signal sequence and transmembrane domain analysis:
All 19,874 proteins analyzed for clustering were submitted to the SignalP 3.0 and TMHMM 2.0 servers (Nielsen et al. 1997; Remm and Sonnhammer 2000; Bendtsen et al. 2004). A protein was assigned as secreted according to the SignalP HMM method. A protein was assigned as transmembrane (TM) if the TMHMM short report predicted one or more TM domains and the protein did not have a predicted signal sequence (these are often mistermed “N-terminal transmembrane domains”). Lists of the one-line summary SignalP and TMHMM outputs for the entire protein set are available in supplemental materials (supplemental data 7 and 8 at http://www.genetics.org/supplemental/).
Gene counts and nematode specificity:
Gene counts in C. elegans and C. briggsae were assessed mostly on the basis of blastp searches on WormPep and BriggPep with WormBase data set WS123 (Release WS123; http://wormbase.org/). The gene numbers are presented as rough estimates based on a relatively arbitrary E-value cutoff of 10−4 and a consensus hit count from several different queries from the family. No attempt was made to determine how many members are unpredicted or how many of the predicted members are likely to be pseudogenes. Representation outside of nematodes was assessed from a combination of Pfam and InterPro annotations and a Ψ-blast search in June 2004 on the NCBI nr data set with a persistent threshold E-value of 10−6 (http://www.ncbi.nlm.nih.gov/BLAST/). This threshold appeared to serve well in preventing convergence on short domain matches that do not represent near-full-length homologs. For small proteins, these criteria were relaxed somewhat because of their lower information content. Prior to choosing Ψ-blast search query proteins, family members were recursively aligned and culled in an attempt to discard gene-prediction artifacts. Ψ-blast searches were initiated with proteins that appeared typical for the family as a whole, without any large insertions, deletions, or terminal extensions (these are common gene-finding artifacts in C. elegans).
Protein alignment, phylogenetics, and hydropathy plots:
Protein alignments for specifically investigated families were computed with ClustalX using BLOSUM matrices and otherwise default settings (Thompson et al. 1994, 1997). Phylogenetic trees were generated by Bonsai 1.1.4 (J. H. Thomas, March 2004 at http://calliope.gs.washington.edu/software/index.html) using the neighbor-joining distance method (Saitou and Nei 1987) and by PHYLIP proml using the maximum-likelihood method (Felsenstein 1993). Composite hydropathy plots were generated from ClustalX multiple alignments using Bonsai 1.1.4 and a window of nine amino acids. This method determines average hydropathy in aligned columns and is otherwise the same as Kyte-Doolittle hydropathy on single proteins (Kyte and Doolittle 1982).
Codon analysis for positive selection:
Codon analysis was performed only with members that appeared typical for the family as a whole, with no large insertions, deletions, or terminal extensions. Sets of 5–10 closely related proteins were selected and aligned using ClustalX with BLOSUM matrices and otherwise default settings. This protein alignment was used to construct a maximum-likelihood phylogenetic tree with proml and to make a corresponding codon alignment. These were provided to the codeml program in the PAML package (Yang 1997). Models 7 and 8, using at least three starting dN/dS values to avoid local optima during maximum-likelihood analysis were run (Yang 1997). Statistical significance was assessed using a chi-square test of twice the difference in likelihood scores for the two models, with 2 d.f. (Yang 1997).
Clusters in D. melanogaster:
Analysis of clusters in D. melanogaster was the same as for C. elegans except that a gene window of 30, blastp cutoffs of 0.0001, and an alignment length of ≥70% were used. Statistical tests were similar to those for C. elegans.
Detection of gene clusters:
A general method was developed for detecting physically linked clusters of genes that encode related protein sequences. The method uses a sliding gene window and all-by-all blastp results to locate regions of local protein-coding gene duplications. A variety of specific window sizes, blast match, and cluster scoring criteria was explored and the general pattern of clustering was robust to these changes. The results reported here are for a window of 20 genes with a blast E-value cutoff of 10−3 and at least 80% query alignment (see materials and methods). For the 19,874 genes analyzed, 4607 were located in 1391 local clusters of 2 or more genes. Of these, 1819 genes were in 216 clusters of 5 or more genes. For simplicity, I will refer to these as cluster-2 genes and cluster-5 genes. The set of proteins used for the analysis and sets of proteins found in clusters are found in supplemental data 1 and 2 at http://www.genetics.org/supplemental/). Summaries of the highest-scoring gene clusters are in Figure 1, Table 1, and supplemental data 3 at http://www.genetics.org/supplemental/). This overview has two striking features. First, the frequency and size of gene clusters varies greatly among the six C. elegans chromosomes. The most abundant and largest gene clusters occur on chromosomes II, IV, and V, with a notable paucity of clusters on the X chromosome. Second, gene clusters have a strong tendency to reside on autosomal arms. All but two autosomal arms are enriched in clustered genes, and some arms consist predominantly of clusters. Chromosomal arms in C. elegans are defined by a higher frequency of meiotic recombination (Barnes et al. 1995). Arms also have higher densities of simple and complex repeat DNA and more divergent gene products when compared to other phyla (Barnes et al. 1995; C. elegans Sequencing Consortium 1998). The SR superfamily of G-protein-coupled receptors is a major contributor to gene clusters, especially on chromosome V where most SR genes reside (Robertson 2000, 2001). However, similar genomic cluster patterns were seen even when all SR genes were removed prior to analysis (shown for chromosome V in Figure 2; data not shown). As shown in Table 2, the degree of gene clustering is statistically significant, since randomized gene orders subjected to the same clustering procedure reproducibly gave fewer and smaller clusters. The randomizing method was conservative since it randomized gene order only within a chromosome. Since a substantial feature of clustering is chromosome specificity, the degree of clustering from randomizing the entire genome would be substantially lower.
The distribution of the number of genes per cluster for the entire genome is summarized in Figure 3. The number of clusters drops rapidly with cluster size, but the distribution has a long tail with a small number of very large clusters. Most clusters are compactly arranged in the genome, with relatively few unrelated genes interspersed among the homologous genes that define the cluster. The cluster with the largest genome span is also the one with the most genes; it spans 286 kb on chromosome II and contains 50 homologous MATH-domain-containing genes interspersed irregularly with 44 other genes.
Some of the largest gene families that are represented in clusters have very biased genomic distributions; prominent examples are mentioned here and histograms of gene positions for the 12 largest gene clusters are available in supplemental data 5 at http://www.genetics.org/supplemental/. In the MATH family, 78 of 80 genes are on chromosome II and 50 of these are in one cluster near the middle of the left arm. In the DUF19 family, 33 of 37 genes are on chromosome V and most of these are in one large cluster. In the DUF227 family, 21 of 23 genes are on chromosome V and half of these are in two clusters. In the DUF750 family, 15 of 21 genes are on chromosome V and nearly all of these are in one large cluster. As previously noted, nearly all of the SR families are enriched on chromosome V, with smaller numbers of genes scattered on other autosomes. Regardless of chromosome, these SR families tend to be in large clusters on chromosomal arms. In the cytochrome P450 family, 43 of 76 genes are on chromosome V, mostly in two large clusters on the left arm. Finally, 154 of 253 nuclear hormone receptor genes are on chromosome V, again mostly in clusters on the arms.
Relationship of phylogenetic conservation to gene clusters:
The clustering method has the potential to find families with a wide range of conservation properties. Although the number of clusters makes a full description of their nature difficult, investigation of specific families made it clear that most or all meet a reasonable standard for constituting a gene family. A sampling of alignments among members of four families is shown in Figure 4. Apart from choosing families of sufficient size to provide abundant alignment material, these four families were arbitrarily chosen and appear typical. In all four families, a significant fraction of predicted proteins aligned dubiously with other family members, with large insertions, deletions, or extensions on one or both ends. Some of these are likely to be nonfunctional genes, but preliminary investigation suggests that many are due to errors in ab initio gene finding, since improved gene models were readily identified by manual curation (data not shown). The alignments in Figure 4 were made with proteins that appear typical for their family and that appear to have satisfactory gene models (no large deletions or insertions). In addition to good alignments within cluster families, extensive blast searches and annotation with the UCSC Gene Sorter (May 2003 C. elegans data set; http://genome.ucsc.edu/) showed that the sequences of cluster families are well separated from each other and from unclustered genes. A good test case was the SR families; for all previously identified families and one new family (see below), the clustering algorithm correctly grouped specific families in local clusters, even though members of different SR gene families are often close to each other and sometimes interspersed in the genome.
Many of the gene families identified by the clustering algorithm have been independently identified by various investigators and are the subject of published or current investigations into their patterns of duplication and divergence, including several of the SR families (Robertson 1998, 2000, 2001; Chen et al. 2005; Thomas et al. 2005), the nuclear receptor family (Maglich et al. 2001), the cytochrome P450 family (Gotoh 1998), the short-chain dehydrogenase family (Kallberg et al. 2002), the glutathione S-transferase family (Campbell et al. 2001), the C-type lectin family (Drickamer and Dodd 1999), and the NLP family of antifungal peptides (Couillault et al. 2004). In each case, the gene duplication patterns as inferred from protein relatedness suggest sporadic duplications, probably balanced by gene loss to produce a genetic complement in dynamic equilibrium. I conducted a limited analysis of three of the unstudied gene families identified by the clustering algorithm, the DUF19, CFAM8, and CFAM15 families. In each case, difficulties in gene prediction necessitated manual annotation of probable gene structures to analyze relationships among full-length family members. Unrooted protein trees for 29 DUF19 proteins, 22 CFAM8 proteins, and 10 CFAM15 proteins are shown in the lower part of Figure 4. For the CFAM8 family, 12 members from C. briggsae were annotated and included in the tree. In each case, the inferred patterns of duplication and divergence indicate that these families arose from sporadic duplication in patterns reminiscent of those documented for previously analyzed cluster families (see references above). All three families appear to include both ancient and recent duplications, producing protein divergences ranging from nearly identical to barely alignable. Further investigation of these families and the other new families is clearly required to obtain an adequate picture of their evolution.
As a preliminary assessment of the relationship of cluster size to divergence, I made a systematic analysis using blastp to compare proteins within each cluster size class. Clusters showed a weak correlation between cluster size and mean protein divergence within the cluster, with a slight trend toward greater divergence in large clusters. Specifically, genes had a mean length-normalized blastp score within their cluster as follows: cluster-2 (1.06), cluster-3 (0.96), cluster-4 (0.99), cluster-5 (0.95), and cluster-6 and greater (0.73). As expected, there was substantial variation among clusters, but on average small local clusters appear to result from duplications that are nearly as old as those in larger clusters. Complete lists of mean blastp values for each cluster are available in supplemental data 6 at http://www.genetics.org/supplemental/. Due to difficulties in gene prediction and the use of blast scores as a surrogate for proper distance measures, this analysis should be regarded as strictly provisional.
Finally, I analyzed the relationship of cluster proteins to proteins in other phyla. As previously noted (C. elegans Sequencing Consortium 1998), genes on autosomal arms in C. elegans tend to be less conserved in other phyla than genes in autosomal centers. This tendency is apparent in Figure 5, which graphs the best blastp match to D. melanogaster as a function of genome position for predicted proteins on three C. elegans chromosomes. Very similar patterns were observed for matches to S. cerevisiae and H. sapiens (data not shown). A substantial part of this trend appears to result from lower phylogenetic conservation of proteins in large gene clusters, which are concentrated on autosomal arms (Figure 5). The graphical correlation is striking and it stands up to quantitative scrutiny: cluster-5 proteins had a mean length-normalized best blastp score to D. melanogaster of 0.07, whereas cluster-2 proteins had a mean score of 0.146, and all other proteins had a mean score of 0.27 (see materials and methods). This trend is not due to cluster genes being part of gene families in C. elegans: when analyzed without regard to genome position, the half of C. elegans proteins with best self-blast hits are slightly more conserved in D. melanogaster than the lower half of proteins (not shown). This presumably results from the fact that unclustered gene families in C. elegans include many that are particularly well conserved phylogenetically (see discussion).
Operon clusters and homology clusters:
To test whether genes within operons have sequence similarity to each other, I analyzed all 1048 assigned operons in the WS120 data release (Release WS120; http://ws120.wormbase.org/). There were 2821 genes in these operons (average operon size 2.69 genes). Using the same match criteria as for homologous gene clusters, there were 2 or more homologous genes within 9% of operons (95 of 1048), involving a total of 248 genes. Although 91% of operons contain only nonhomologous genes, the number of exceptions is statistically significant: when pseudo-operons were constructed from random genes with a gene number distribution matching real operons, only 0.5% of them contained homology matches. I also tested whether genes in homology clusters tend to reside in operons. Of the 1819 cluster-5 genes, only 98 (5.3%) were in a known operon, significantly less than the 14.2% of all genes that are in known operons. Conversely, of the 248 homology matches within operons, only 74 (29.8%) were cluster-5 genes, which is slightly below expectation since 39.5% of cluster-2 genes are in the cluster-5 set. I conclude that there is a modest negative correlation between membership in an operon and membership in large homology clusters.
Molecular identity of gene cluster proteins:
All gene clusters were documented as described in materials and methods and these data are available as supplemental data 3 and 4 at http://www.genetics.org/supplemetal/. Briefly, gene clusters with five or more members were annotated using the UCSC family browser, WormBase, Pfam, various blast resources, signal sequence and transmembrane domain predictors, and other resources. The records include overall family size, potential functional identity, links to specific genes in each cluster, links to additional data, and other notes. Brief summaries of the 24 largest gene clusters are shown in Table 3, and additional summaries of all clusters-5 with functionally obscure gene products are found in Table 4. The molecular identities of gene products encoded by clustered gene families are unusual in a variety of ways. I summarize these features first and then discuss each in more detail. First, most of the families are nematode specific, suggesting that they evolve more rapidly than the typical gene. Second, the families are enriched for predicted secreted and transmembrane proteins. Finally, cluster genes are enriched for genes implicated in environmental interactions, specifically those involved in chemosensation, xenobiotic detoxification, and antimicrobial response.
The nematode specificity of cluster genes is dramatic. When all cluster-5 genes are classified by protein family, 51 of the 63 families are nematode specific (Tables 3 and 4 and supplemental data 3 at http://www.genetics.org/supplemetal/). Even when all known SR families are removed, this enrichment is clear (36 of 48 families). This property is not because cluster genes are parts of gene families: when gene families not represented in clusters were analyzed, none of the eight largest families were nematode specific (data not shown; also see discussion).
Table 5 documents the enrichment for secretion signals and transmembrane domains in cluster-2 and cluster-5 genes when compared to noncluster genes. The difference is not as dramatic as for nematode specificity, but this results in part from the presence of a few large cluster families with putative cytoplasmic or nuclear localization (the F-box domain, MATH domain, and nuclear hormone receptor proteins). When analyzed at the level of families, the trend is clearer: 34 of the 50 largest cluster families are predicted to be secreted or transmembrane.
Enrichment for genes implicated in environmental interactions cannot be documented as fully because there is no systematic way to classify genes in this way; analysis of gene ontology terms is inadequate because of inaccuracy and incompleteness. I took the alternative approach of identifying families known or inferred to be involved in specific processes and tested whether these specific families are present in clusters. Assignment of families to specific processes was done by a combination of blast searches, literature searches, and manual PFAM and WormBase browsing. These data are presented in Table 6. Putative environmental interaction genes in C. elegans can be divided into three groups. First, the SR superfamily of seven-pass receptors is implicated in chemosensation on the basis of function and tissue-specific expression in known chemosensory neurons (Troemel et al. 1995; Sengupta et al. 1996; Chen et al. 2005; C. Bargmann, personal communication). There is dramatic enrichment of SR superfamily members in gene clusters. A similar pattern is seen when each of the 15 families is analyzed individually (data not shown). Second, all 4 major gene families implicated in xenobiotic detoxification are highly enriched in clusters. Direct evidence that these genes function in detoxification in C. elegans is limited to the cytochrome P450 (Menzel et al. 2001) and glutathione S-transferase families (Tawe et al. 1998; Leiers et al. 2003); the other two are assigned on the basis of strong sequence similarity to families known to have such function in other organisms. Third, there is direct evidence for involvement in pathogen response for members of 9 specific gene families in C. elegans (Kato et al. 2002; Mallo et al. 2002; Couillault et al. 2004). Six of the 8 families with enough members to analyze are significantly enriched in cluster genes, 5 of them dramatically so. The ninth family (abf antimicrobial peptides) has only two members and they are in a cluster-2 (immediately adjacent genes). In addition to direct evidence for pathogen response, 3 of the pathogen-related cluster families have sequence similarity to proteins implicated in innate immune response in other organisms: two types of lectin proteins and lysozymes (Mallo et al. 2002).
Gene arrangement in homology clusters:
An example of the physical arrangement of homology cluster genes is shown in Figure 6. In Figure 6 (bottom), 16 cluster genes reside in a 40-kb region, with only 1 unrelated gene in their midst. Other clusters are not always this contiguous, but there is a strong tendency for them to reside in highly enriched blocks, and often most of the genes are homologous. The 16 cluster genes (Figure 6, bottom) reside in three blocks of genes and within each block all the genes are on the same strand. However, their evolution was not solely by repeated tandem duplication, since phylogenetic analysis suggests that several duplications and internal rearrangements gave rise to this cluster (data not shown). All of the genes shown have diverged sufficiently that it is difficult to discern which specific mechanism might underlie these rearrangements. Similarly, complex arrangements were found in a number of other clusters, with genes typically found on both strands.
MATH-domain and F-box domain families:
The two largest novel cluster families are the MATH-domain family and the F-box-FTH domain family, with ∼100 and 200 members, respectively. Both are predicted to encode cytoplasmic proteins and neither has yet been implicated in environmental interactions. However, both families appear to be subject to positive selection, a property often associated with changing selective pressure from the environment (Kamei et al. 2000; Choi and Lahn 2003; Thomas et al. 2005). Few of the MATH-domain and F-box-FTH domain genes have identified cDNAs and preliminary inspection of protein alignments and genomic sequences suggests that a substantial fraction of them are nonfunctional genes, perhaps as many as one-third. Although some are likely to be pseudogenes, there is no doubt that many of the genes are functional since there are large families of similar proteins in C. briggsae and dN/dS analysis shows that most of the protein sequence in both families is under strong purifying selection (data not shown). I carried out a preliminary evolutionary analysis of these two protein families based on the subset of predictions that align well with other members in the same family. Lists of proteins analyzed, schematics of protein structure, alignment, and dN/dS results are available in supplemental data 10 at http://www.genetics.org/supplemental/).
The MATH domain is ∼100 amino acids in length and is named for founding domain-containing members meprins and TRAF-C (Uren and Vaux 1996). The domain probably functions in protein-protein interactions (Sunnerhagen et al. 2002). C. elegans MATH-domain cluster-5 genes are predominantly of two sorts. In one type, nearly the entire protein is occupied by two or more repeats of the MATH domain. In the second type, there is a single N-terminal MATH domain followed by a BTB/POZ domain (Zollman et al. 1994). Like the MATH domain, the BTB domain is implicated in protein-protein interactions (Bardwell and Treisman 1994). Recent evidence indicates that some MATH-BTB proteins function as adapters to target other proteins to the ubiquitin-mediated proteolysis pathway (Furukawa et al. 2003; Pintard et al. 2003; Xu et al. 2003; Figueroa et al. 2005). An alignment of the first MATH domain from a sampling of the two-domain proteins is shown at the top of Figure 4. A minority of MATH-domain gene predictions consist largely of a single MATH domain; this fact, coupled with the criteria for clustering (see materials and methods), presumably explains why most MATH-domain genes were identified as members of the same merged clusters. Because of the paucity of confirmed gene structures, it is unclear whether there is real variability in the number of tandem MATH domains or whether the variability is an artifact of mispredicted genes or pseudogenes. Although neither the MATH nor the BTB domains are nematode specific, C. elegans has a hugely expanded number of MATH domains compared to other sequenced genomes (Pfam release 16 at http://www.sanger.ac.uk/Software/Pfam/). Analysis of codon alignments among closely related MATH-domain genes indicates that there is significant positive selection acting on specific sites. High dN/dS sites are concentrated largely in the MATH domain (supplemental data 10 at http://www.genetics.org/supplemetal/) and alignment with a solved MATH-domain protein structure suggests that the sites under positive selection are concentrated on one face of the domain in a region that interacts with one of its binding partners, CD40 (McWhirter et al. 1999; data not shown).
The F-box domain is ∼40 amino acids long and in some cases is known to act as an adapter to target other proteins to the ubiquitin-mediated proteolysis pathway (e.g., Bai et al. 1996; Schulman et al. 2000). In the C. elegans F-box-FTH family, the F-box domain occupies the N terminus followed by ∼250 amino acids called the FTH domain (Clifford et al. 2000; Nayak et al. 2005), which has no sequence relatives outside of nematodes. The entire protein aligns well among most members of the family in C. elegans, with the exceptions most likely being nonfunctional genes and gene prediction errors (supplemental data 10 at http://www.genetics.org/supplemetal/; data not shown). As with MATH-domain genes, analysis of codon alignments among closely related F-box genes shows clear indications of positive selection at specific sites in these proteins (supplemental data 10 at http://www.genetics.org/supplemental/). These sites are not in the F-box region and may cluster in specific regions in the rest of the protein. I speculate that F-box proteins in C. elegans function to target foreign proteins for proteolysis via binding sites in the regions under positive selection.
Chemosensory receptor families:
Members of multiple putative chemosensory receptor (SR) gene families are prominent contributors to gene clustering: 219 of the 1391 clusters contain genes in annotated SR families. These clusters range in size from 2 to 24 genes and contain a total of 1065 genes, including members of all previously described SR families. Extensive clustering of odorant, gustatory, and vomeronasal receptors is also found in vertebrates (e.g., Del Punta et al. 2000; Matsunami et al. 2000; Glusman et al. 2001) and, to a lesser extent, in Drosophila (Robertson et al. 2003), suggesting that local gene duplication and diversification is a phylogenetically conserved feature of chemoreceptor gene families. The specificity of the clustering algorithm in C. elegans is supported by the fact that each of many analyzed SR clusters contains genes from one specific SR family, despite the fact that most SR families have similar genome distributions and are often interspersed locally. One of the cluster families, with ∼75 predicted members, defines a new family in the SR superfamily. The new family is distantly related to the previously recognized srg, sru, srv, srh, and str SR families in C. elegans. Ψ-Blast searches started from two proteins from the new family (persistent E-value cutoff 10−4) also suggest a very distant relationship to melatonin receptors and opsins. A composite hydropathy plot and protein tree of 29 putative full-length predictions from the new family are shown in Figure 7. As with other SR families, the new family is concentrated on chromosome V (57 of 74 genes). Of the 57 genes on chromosome V, 45 are located on the left arm (Figure 7), including four clusters of 5 or more genes, all of which were identified by the clustering algorithm. The new family has been assigned the C. elegans gene designation srt (J. Hodgkin, personal communication). A full annotation of the srt family was completed and submitted to WormBase; a list of all known SR proteins, including the new srt family, is in supplemental data 9 at http://www.genetics.org/supplemental/.
Gene clusters in D. melanogaster are similar:
To investigate whether the extent of homologous gene clustering in metazoans is nematode specific, I performed a similar analysis with the nearly completed sequence of D. melanogaster (Adams et al. 2000). Gene clusters in this genome appear to be slightly less compact and gene-product heterogeneity was more problematic for blast comparisons, so slightly different cluster parameters were used (see materials and methods). Although results were less dramatic than with C. elegans, it is clear that there are homology gene clusters in Drosophila as well (Table 2). A more limited analysis of the molecular identities of large clusters indicated fewer families with no known function, although several were found (DUF227, DUF243, DUF1091, PF02448, DUF725, PF02756, DUF733, and PF03207 in order of descending cluster size). As in C. elegans, there was enrichment for genes implicated in xenobiotic detoxification and pathogen response. For xenobiotic detoxification, all four families that clustered in C. elegans are also clustered in Drosophila: cytochrome P450, glutathione S-transferase, UDP-glycosyl transferase, and short-chain dehydrogenase. Most pathogen response genes are sufficiently divergent to make it difficult to compare Drosophila directly with C. elegans, but Drosophila has clusters for many insect gene families implicated in pathogen response, including the persephone protease family (Ligoxygakis et al. 2002), γ-thionin defensins (PF00304, FlyBase 2004 at http://flybase.bio.indiana.edu; UCSC Gene Sorter, May 2003 C. elegans data set; http://genome.ucsc.edu/), lysozymes (Roxstrom-Lindquist et al. 2004), and serpin protease inhibitors (Levashina et al. 1999).
Drivers of cluster evolution:
Chromosome arms in C. elegans have features that might contribute to a higher frequency of gene duplication, including higher densities of simple and complex DNA repeats and approximately threefold higher rates of meiotic recombination. If unequal crossing over or some other homology-based mechanism were prominent drivers of gene duplication, these properties would produce more primary duplication material on chromosome arms. In agreement with this idea, DNA repeat sequences and gene clusters are less abundant on the X chromosome. However, DNA repeats are abundant on the left arm of chromosome I and on the right arm of chromosome III, regions nearly devoid of gene clusters as detected by the algorithm reported here. In addition, there is no immediately obvious correlation within chromosomal arms between the local frequency of DNA repeats and gene clusters (data not shown). Nevertheless, genome rearrangements probably do occur more frequently on arms, since regions of synteny between C. elegans and C. briggsae are reported to be shorter on autosomal arms than in centers and the X chromosome (Stein et al. 2003). However, this synteny analysis is more likely to detect large inversions, transpositions, and translocations than local duplications. Preliminary analysis of very recent gene duplications in the C. elegans genome suggests that most duplications are modest in size (a few genes or less in length) and cause local tandem or inverted repeats (data not shown; also see Katju and Lynch 2003). I speculate that this class of duplications is the primary source material for gene clusters. Further analysis is necessary to understand the mechanism by which these duplications arise and how they relate to the evolutionarily persistent gene clusters reported here.
Gene clusters contain unusual genes:
By various measures, the gene families identified by the clustering algorithm are unusual. Perhaps the most marked of these is the preponderance of nematode-specific gene products. In contrast to clustered gene families, unclustered homologous families in C. elegans are dramatically different. Using the same family-finding algorithm without regard to genome position, the largest gene families (after removal of cluster-5 families) encoded protein kinases, ligand-gated ion channels, ras/rab family G proteins, two types of transposases, transmembrane tyrosine kinases, protein phosphatases, and phosphoesterases. All of these families are well known, none are nematode specific, and all are the subject of thousands of research articles. To the extent that I investigated, their genome distribution lacks the autosomal arm bias that characterizes nearly every cluster-5 family. Other distinctive features of cluster-5 genes, when compared to the rest of C. elegans genes, are listed in Table 7. These include a dramatically reduced frequency of assigned phenotypes in RNA interference tests of gene function, shorter introns, reduced expression levels by two measures, and increased divergence from their closest predicted C. briggsae relative. All of these features, except shorter introns, are readily rationalized on the basis of functional redundancy and higher rates of evolution for cluster genes. Shorter introns may be an indirect result of the fact that genes with very low expression levels have a smaller average intron size (data not shown).
Apart from these features, do homology cluster genes have any common biological thread? Since the families are nearly all nematode specific and very few have members of known biological function, this is a difficult question to answer. Nevertheless, the enrichment for secretion signal sequences, transmembrane domains, putative chemoreceptors, xenobiotic detoxification genes, and pathogen response genes strongly suggests that environmental interactions are a prominent feature of cluster genes. In addition to these documented features, a surprising number of the novel families appear to encode secreted proteins with peculiar amino acid frequencies, properties shared by many antimicrobial peptides (Boman 2003). These families include DUF19 (34 cluster-5 genes), DUF130 (27 cluster-5 genes), CFAM2 (7 cluster-5 genes), CFAM8 (6 cluster-5 genes), CFAM12 (5 cluster-5 genes), CFAM13 (5 cluster-5 genes), and CFAM18 (10 cluster-5 genes). Finally, one cluster family encodes glycosyl hydrolases, among which are chitinases, potential antifungal enzymes (Leah et al. 1991). I speculate that many of these families are as yet uncharacterized elements of innate immunity in nematodes.
Nuclear hormone receptors:
The genome distribution of nuclear hormone receptor (nhr) genes is particularly telling because it includes both phylogenetically conserved genes and a large expanded family of nematode-specific relatives (Sluder et al. 1999). C. elegans possesses ∼30 nhr genes that belong to families with broad phylogenetic representation, including members of five of the six recognized chordate nhr families (Sluder and Maina 2001). The phylogenetically conserved subset of the nhr genes is distributed widely in the genome, with no obvious chromosome or arm bias. There is also a large expansion of nhr genes in C. elegans, including >200 that define expanded nematode-specific families. Presumably, these genes duplicated and diversified during nematode evolution. The phylogenetic tree for these genes indicates that these duplications occurred over a long period with no obvious indication of temporal clustering. The expanded nhr genes are distributed very nonuniformly in the genome, with 149 of 197 tested genes residing on chromosome V and with a strong bias toward clusters on autosomal arms. I speculate that this segment of the nhr family is specialized for transcriptional response to environmental challenges and that the genes duplicate and diversify on chromosomal arms in concert with this selection. One member of the nhr family, nhr-8, is experimentally implicated in xenobiotic response (Lindblom et al. 2001); however, this gene is not in a homology cluster and appears to be a member of the phylogenetically conserved class of C. elegans nhr genes. I speculate that some of these genes also participate in environmental responses.
Operons and homology clusters:
Genes that are found in large homology clusters tend not to be found in operons and vice versa. Why is this true? I speculate that operons, in their role as transcriptional regulatory units, tend to group genes that are unrelated in sequence but that function together in shared processes (as suggested by Blumenthal et al. 2002). In contrast, homology clusters exist as a consequence of evolutionary patterns of duplication and divergence rather than shared transcriptional regulation. Evolutionary theory indicates that duplicate genes that persist over time must acquire at least partially distinct functions to permit natural selection to act in retaining both gene copies (Ohno et al. 1968). It is likely that some of the functional distinctness acquired by such duplicate genes occurs at the transcriptional level, for example, when each of the duplicates is expressed in a subset of the tissues that expressed their ancestor. This mechanism of divergence is unlikely to be consistent with the duplicates residing in the same operon. In addition, the duplications that give rise to gene clusters might disrupt operon structure, favoring persistence of genes with their own promoters. The genomic distribution of operons is also different from homologous gene clusters; operons are relatively evenly distributed by chromosome (although reduced on the X chromosome) and they are less common on autosomal arms where most homology clusters reside (Blumenthal et al. 2002; data not shown).
I thank Hugh Robertson for inspiring me to analyze gene clusters and Bob Waterston, Emily Rocke, Zhirong Bao, Willie Swanson, Phil Green, Evan Eichler, Colin Manoil, and members of the Thomas lab for helpful discussions of this work. The work was supported by National Institutes of Health grant RO1GM48700.
Communicating editor: K. Kemphues
- Received December 16, 2004.
- Accepted September 6, 2005.
- Copyright © 2006 by the Genetics Society of America