Abstract
The SerH locus of Tetrahymena thermophila is one of several paralogous loci with genes encoding variants of the major cell surface protein known as the immobilization antigen (i-ag). The locus is highly polymorphic, raising questions concerning functional equivalency and selective forces acting on its multiple alleles. Here, we compare the sequences and expression of SerH1, SerH3, SerH4, SerH5, and SerH6. The precursor i-ags are highly similar. They are rich in alanine, serine, threonine, and cysteine and they share nearly identical ER translocation and GPI addition signals. The locations of the 39 cysteines are highly conserved, particularly in the 3.5 central, imperfect tandem repeats in which 8 periodic cysteines punctuate alternating short and long stretches of amino acids. Hydrophobicity patterns are also conserved. Nevertheless, amino acid sequence identity is low, ranging from 60.7 to 82.9%. At the nucleotide level, from 9.7 to 26.7% of nucleotide sites are polymorphic in pairwise comparisons. Expression of each allele is regulated by temperature-sensitive mRNA stability. H mRNAs are stable at <36° but are unstable at >36°. The H5 mRNA, which is less affected by temperature, has a different arrangement of the putative mRNA destabilization motif AUUUA. Statistical analysis of SerH genes indicates that the multiple alleles are neutral. Significantly low ratios of the rates of nonsynonymous to synonymous amino acid substitutions suggest that the multiple alleles are subject to purifying (negative) selection enforcing constraints on structure.
THE variant surface proteins of ciliate protists are encoded by families of allelic and nonallelic genes whose expression is regulated by environmental conditions, principally temperature. Also known as the immobilization antigen (i-ag), these glycosyl phosphatidyl inositol (GPI)-linked proteins coat the entire external surface of the cell. These proteins are so named because exposure of cells to antisera raised against the i-ag results in the cessation of swimming, i.e., immobilization, thus providing a simple assay for their presence. Usually, only one i-ag type is found on the cell surface. The role of the i-ag is not known, but seasonal variation in frequencies of cells expressing different i-ag paralogs and their interlocked expression suggest an important functional role (Saad and Doerder 1995).
Tetrahymena thermophila has at least nine families of paralogous genes encoding alternative forms of i-ags expressed under different environmental conditions (Smithet al. 1992; Doerder 2000). The existence of multiple i-ag paralogs and, in some instances, multiple alleles, raises questions concerning i-ag evolution and their adaptive significance. Here, we focus on the multiple alleles at the SerH locus, which were the first to be described genetically (Nanney and Dubert 1960) and were the first to be sequenced (Tondraviet al. 1990; Deak and Doerder 1995). Inbred strains derived from natural isolates in the 1950s were found to contain five H antigenic types expressed from 20° to 36°. One was lost in early inbreeding, and the other four were shown to be products of four alleles, SerH1, SerH2, SerH3, and SerH4 encoding antigenically distinct i-ags (Nanney and Dubert 1960). Since the inbred strains with these alleles were derived from only four natural isolates, the existence of at least four SerH alleles indicates considerable genetic polymorphism, a conclusion supported by the finding of additional alleles in isolates from ponds in the Allegheny National Forest (ANF) of northwestern Pennsylvania (Saad and Doerder 1995). SerH variants from the ANF ponds include all four of the previously described antigenic types plus new alleles encoding antigenically distinct H i-ags described here. Furthermore, some of these antigenically similar alleles show restriction fragment length polymorphism (RFLP), significantly increasing the number of alleles. The question addressed in this article is whether the multiple SerH alleles are functionally equivalent.
The SerH alleles of inbred strains are expressed from 20° to 36° and encode GPI-linked proteins migrating from 44 to 52 kD (Doerder and Berkowitz 1986; Ko and Thompson 1992; Ronet al. 1992). Sequences of two alleles, SerH1 and SerH3 (present in inbred strains derived from cells isolated from the same pond), have been published, with recently corrected sequences deposited in GenBank (Table 1). The SerH1 (Deak and Doerder 1995) and SerH3 (Tondraviet al. 1990) genes consist of intron-free open reading frames (ORFs) encoding precursor proteins of ~400 amino acids. Both precursor proteins contain typical endoplasmic reticulum (ER) translocation and GPI-linkage signals and are especially rich in alanine, serine, threonine, and cysteine. The mature protein is divided into three domains: the amino-terminal region, 3.5 central imperfect tandem repeats each delineated by eight periodic cysteines, and the carboxyl-terminal region.
The upper limit of SerH expression (36°) is marked by a dramatic shift in mRNA stability. At 30°, for instance, H3 mRNA has a half-life of >60 min, whereas at 40°, the half-life is <3 min (Loveet al. 1988; McMillanet al. 1993). Nuclear runoff assays showed that SerH1 and SerH3 are transcribed normally at 40° (McMillanet al. 1993; Deak and Doerder 1995). The mRNA destabilization motif, AUUUA, associated with mRNA stability in other systems (Asson-Batreset al. 1994; Jacobson and Peltz 1996; Ross 1996; Balmeret al. 2001; Wiluszet al. 2001) is present in the 3′-untranslated regions (UTRs) of SerH transcripts and is correlated with differences in mRNA stability among alleles characterized here.
In this article the sequence and expression of five SerH alleles from geographically separate isolates are analyzed. SerH4 from inbred strains and SerH5 and SerH6 from wild isolates are compared with corrected sequences of SerH1 and SerH3. We focus on the question as to whether these multiple alleles identified by antigenic differences are functionally equivalent. The results indicate that the encoded proteins are structurally similar and that structural constraint is likely the result of purifying selection.
MATERIALS AND METHODS
Strains: Inbred T. thermophila strains A, B, B3, and C2 (homozygous for SerH1, SerH3, SerH4, and SerH4, respectively) were originally obtained from Dr. David L. Nanney. The SerH4 allele in strains B3 and C2 is identical by selection during inbreeding from a common ancestor; results were identical with both strains. ANF683, ANF12090, ANF16006, ANF16020, ANF16057, ANF16062, and ANF16015 (H3 expressing lines), ANF18021 (SerH6 expressing line), and ANF5906 and ANF-6707 (SerH5-expressing lines) were collected as previously described (Doerderet al. 1996). Cells were grown in PPY medium (1% w/v proteose peptone, 0.15% w/v yeast extract, and 0.01 m FeCl3) at specified temperatures. Serotype tests were performed with immobilizing antisera as described previously (Doerder 1979).
Reverse transcription-PCR, cloning, and sequencing: For RNA isolation, T. thermophila strains were grown in 24 ml of PPY medium at room temperature for 2 days (~3 × 105 cells/ml). Total RNA was isolated utilizing RNeasy Midi kits (QIA-GEN, Valencia, CA). Contaminating DNA was removed with RQ1 DNaseI (Promega, Madison, WI), followed by phenol:chloroform extraction and ethanol precipitation. Reverse transcription was performed with MMLV reverse transcriptase (Promega) and 10 μg of total RNA according to manufacturer's directions. Standard PCR amplification of the reverse transcription (RT) product was performed with primers H3AT (5′-GTAAAACAAAACTATAATAATTTG-3′; works on all SerH alleles except SerH2) and dTRI (5′-CGCGAATTCCT22-3′). The 5′-UTR was obtained by 5′-rapid amplification of cDNA ends (RACE) utilizing terminal deoxynucleotidyl transferase (Promega) and dG tailing followed by standard PCR.
The RT-PCR and standard PCR products were purified with a QiaQuick PCR Purification Kit (QIAGEN). The purified product was ligated into either pGEM-T or pGEM-T Easy vectors (Promega). Plasmids recovered from Escherichia coli “Sure” cells (Stratagene, La Jolla, CA) were sequenced in both directions either manually with the SequiTherm EXCEL II DNA sequencing kit (Epicentre Technologies, Madison, WI) or with an automated sequencer. Multiple clones were sequenced in both directions.
For RFLP analysis, RT-PCR was utilized to avoid amplification of possible pseudogenes. Following reverse transcription, cDNA was amplified by standard PCR with primers specific for the amino and carboxyl termini (H3AT; H3CT, 5′-TCAAAAAGTGCAATTTTAATTC-3′). PCR products were purified as above and restriction fragments were separated on polyacrylamide gels.
Northern blots, nuclear runoff, and slot blots: For RNA isolation, T. thermophila strains were grown in 16 ml of PPY medium at the appropriate temperature for 2 days (~3 × 105 cells/ml). Total RNA was isolated with Trizol Reagent (Life Technologies) according to manufacturer's directions for animal cell RNA isolation. Ten micrograms of each RNA was separated on a 1% agarose-formaldehyde gel and blotted onto a positively charged nylon membrane (Boehringer Mannheim, Indianapolis) overnight in 20× SSC pH 7. Full-length SerH3 and SerH4 probes were prepared utilizing the DIG high prime DNA labeling and detection starter kit (Boehringer Mannheim). A random-primed (MBI Fermentas) 32P-labeled actin control probe was made from a 1000-bp RT-PCR product. In vitro nuclear runoff assays and slot blots were performed as described previously (Loveet al. 1988).
DNA isolation and Southern blotting: DNA isolation and Southern blotting were performed according to standard protocol (Sambrooket al. 1989). The blot was probed with full-length SerH5 32P-labeled random-primed fragments.
Statistical analysis of nucleotide and amino acid sequences: ClustalX was used to align deduced amino acid sequences of SerH alleles. After minor manual adjustments, DAMBE (version 4.0.36; Xia and Xie 2001) was used to align nucleotide sequences according to amino acid sequences and to calculate basic descriptive statistics (http://web.hku.hk/~xxia/software/software.htm). The program DnaSP (version 3.51; Rozas and Rozas 1999) was used to calculate Tajima's D statistic and Fu and Li's D* statistic as well as to calculate numbers of nucleotide polymorphisms (http://www.bio.ub.es/~julio/DnaSP.html). dN/dS ratios were calculated by maximum likelihood using the CODEML program of the PAML (version 3.0) package (Yang 1997; Yang and Bielawski 2000; http://abacus.gene.ucl.ac.uk/software/paml.html). The phylogenetic tree of SerH alleles was calculated from the adjusted ClustalX alignment by maximum likelihood using WebPhylip version 2.0 at http://sdmc.krdl.org.sg:8080/~lxzhang/phylip/.
RESULTS
Polymorphism at the SerH locus: Among ANF isolates there appear to be numerous SerH alleles in addition to those already described. The antigenically distinct ANF i-ag previously referred to as “C” (Saad and Doerder 1995) was found by genetic analysis to be a SerH allele and is here referred to as SerH5 (segregation data not shown). Another allele, SerH6, specifies an i-ag that is detected by both anti-H1 and anti-H3 sera. The sequences of these alleles as well as the sequence of SerH4 from inbred strains are reported below.
RFLP among coding regions of SerH1 and SerH3-like variants. RFLPs were determined either from nucleotide sequences (SerH designated) or by electrophoretic analysis of restriction digests of RT-PCR products (~1.3 kb; numbered ANF isolates). Parentheses indicate inbred strain of allele origin (SerH1 and SerH3) or pond of origin (C, CRWP; S, SG29; 3, 343S).
Each major ANF pond has yielded isolates that are immobilized by anti-H1, anti-H2, anti-H3, anti-H4, or anti-H5 (Saad and Doerder 1995), indicating that there are at least five SerH alleles per pond. Figure 1 shows the results of RFLP analysis on several randomly chosen ANF lines immobilized by anti-H3. Two observations are important. The first is that there is considerable polymorphism in this antigenically limited group, supporting the idea that there are many relatively rare SerH alleles. Although the SphI and two PvuII sites are shared, the HindIII, PstI, and remaining PvuII sites vary. Two forms are found in ponds CRWP and 343S, and one of the forms found in 343S is found in SG29. Each of these ponds therefore must contain at least six SerH alleles. The second observation is that among this small sample of alleles, RFLP is somewhat greater for the 5′ (amino) half of the molecule than for the 3′ (carboxyl) half. At the protein level, greater variation in the amino terminus, where the molecule is most likely to be exposed to the environment, is consistent with the distribution of mutations among the H3-like i-ags (see below). The restriction sites shown in Figure 1 are absent in SerH4 and SerH5 alleles.
SerH5 in natural populations: The SerH5 allele was studied in isolates ANF5906 and ANF6707, both isolated from pond SG29. Antisera against each strain fully crossreacted, but did not immobilize strains expressing other SerH alleles, suggesting a new allele as verified by segregation analysis (data not shown). RT-PCR using H3AT and dTRI primers yielded a single product from each strain, and sequencing of these products showed that SerH5 from both isolates was identical. However, Southern analysis (Figure 2A) showed that strain ANF6707 likely has an additional version of SerH5, a result confirmed by PCR, which amplified two bands from genomic DNA of ANF6707 but not ANF5906 (Figure 2B). Since RT-PCR results indicate a single transcript from both strains, the larger 1.4-kb genomic product from ANF6707 may be a SerH5 pseudogene. Pseudogenes were previously shown to be associated with SerH1 and SerH3 (Kileet al. 1988; Deak and Doerder 1995).
SerH5 belongs to a multigene family. (A) Southern blot analysis of genomic DNA from two ANF strains containing SerH5. Restriction enzymes: E, EcoRI; H, HindIII; P, PstI; B, BamHI; X, XbaI. Size markers to left are in kilobases. (B) PCR analysis indicating two SerH5 related genes, only one of which is expressed. Primers specific for the amino and carboxyl termini (H3AT and H3CT, respectively) were expected to amplify a 1.2-kb product. Lanes 1 and 2, amplification using genomic DNA from strains ANF6707 and ANF5906, respectively. Lanes 2 and 3, RT-PCR amplification using total RNA from strains ANF6707 and ANF5906, respectively, using primers H3AT and dTRI (see materials and methods).
Sequences of SerH4, SerH5, and SerH6: The nucleotide sequences of SerH4, SerH5, and SerH6 cDNAs are deposited in GenBank (Table 1) and are not shown here. The SerH4 sequence was identical in inbred strains B3 and C2, which is consistent with their common inbreeding parentage. Genomic (macronuclear) versions of SerH4 and SerH5 were identical to their respective cDNAs, indicating the absence of introns (the genomic version of SerH6 was not sequenced). Properties of these cDNAs and the encoded ORFs are shown in Table 1 where they are compared to the corrected sequences of SerH1 and SerH3. The 5′-UTRs (not shown) are exceptionally AU rich and are characterized by a length of 10–15 adenines upstream of the AUG start codon. The 3′-ends are described below. Except for length, the ORFs are strikingly similar. The ORF is 42% GC, substantially higher than the 21% GC in the 3′-UTR, an observation consistent with the pattern repeatedly observed with other T. thermophila genes.
Properties of i-ag cDNAs and proteins
Codon usage is highly biased in i-ag genes, a property previously noted (Deak and Doerder 1995) and validated with the corrected SerH1 and SerH3 sequences. Among the five alleles, the 3′ nucleotide is A/U in 84.8% of the codons. In a recent study, an extreme A/U bias in the 3′ position was found in weakly expressed genes (e.g., the conjugation-induced CnjB and CnjC genes), whereas highly expressed genes (e.g., histones, ribosomal proteins, and chaperonin) favored C/U in the third position (Larsenet al. 1999). The highly expressed i-ags provide a clear exception to this pattern as they conform to the former. As an additional bias consistent with the AU-rich ORF, 15 codons, primarily those that are GC rich, are not used in SerH genes. These include AGG, ACG, CGN, CAG, CAY, CCG, GGG, GAR, UCG, and UAC. Of the six arginine codons, only AGA is used.
The precursor proteins encoded by SerH alleles are substantially alike in amino acid composition and structure but differ in certain details. The proteins are exceptionally rich in alanine, serine, and threonine residues (totaling ~47%; Table 1) and other small amino acids and are lacking histidine and glutamate (Figure 3). Sequence identity (Table 2) ranges from 60.7% between H1 and H4 to 82.9% between H3 and H6, generally low values for alleles. The precursor proteins are divided into three regions: an amino-terminal region of 105–109 amino acids, 3.5 imperfect tandem repeats each consisting of 75–85 amino acids delineated by eight periodic cysteines, and a carboxyl-terminal region of 27–29 amino acids. The well-conserved N-terminal 21–27 residues are hydrophobic and are identified as a putative cleavable ER translocation signal sequence (Figure 4). Additionally, in common with most other ciliate i-ags, the isoleucine at position 7 is conserved. The well-conserved C-terminal 14 residues are highly hydrophobic, and the region resembles GPI addition sites found in other GPI-linked proteins (Udenfriend and Kodukula 1995). The programs DGPI and big-PI Predictor predict similar, though not identical, GPI attachment sites (ω; Figure 4). Experimental verification is necessary to confirm these predictions.
The remaining central portion of the protein consists primarily of 3.5 imperfect tandem repeats with eight periodic cysteines per full repeat (Figure 4). H1, H3, and H6 have ~85 amino acids per repeat, whereas H4 and H5 have ~75 amino acids per repeat. Sequence identity is greatest in the third and fourth repeats (Figure 4), portions of the molecule likely to be closest to the membrane. The cysteine periodicity is remarkably conserved with a pattern of alternating long and short stretches of amino acids between cysteines (Figures 4 and 5A). Periods II and IV are the most variable in length, each with several indels. In addition to cysteine periodicity, hydrophobicity is also conserved across repeats and among alleles. An example with respect to the first repeat from i-ags H5 and H6 is shown in Figure 5B.
Evolution of SerH alleles: The existence of similar, multiple SerH alleles, each with apparently low frequency, raises questions as to the selective forces operating on these alleles. It is important to know whether the multiple SerH alleles are equivalent (neutral) or the result of positive selection for as yet unknown function(s). The conservation of cysteine periodicity and the overall similarity of amino acid composition among the five i-ags suggest that there are structural constraints. Paradoxically, unlike most alleles in other systems, the five SerH alleles display considerable nucleotide polymorphism (Table 2). In pairwise comparisons, from 9.7 to 27.5% of ~1200 nucleotide sites (after removing alignment gaps) are polymorphic, and among the five alleles a total of 422 sites (37.4%) are polymorphic for a total of 476 mutations (of 1128 shared nucleotide sites, 372 are dimorphic, 46 are trimorphic, and 4 are tetramorphic). As indicated by the sequence identities in Table 2, a considerable portion of the polymorphism alters the amino acid sequence. The mutations, both synonymous and nonsynonymous, are largely confined to the first third of the gene (encoding the aminoterminal region and first repeat) when comparisons are made among SerH1, SerH3, and SerH6 (Figure 5C) or between SerH4 and SerH5, but are spread throughout the gene when the former are compared to the latter (Figure 5C). Among the five alleles, 46.7% of the nucleotide sites (426) in the amino terminus and first repeat are polymorphic, compared to 31.8% of sites (702) in the remaining portion of the molecule.
Amino acid composition of precursor H i-ags deduced from nucleotide sequence. x-axis: single-letter amino acid abbreviations.
Pairwise comparison of percentage amino acid sequence identity (top) and percentage polymorphic nucleotide sites (bottom)
Standard tests such as those of Tajima (1989) and Fu and Li (1993) indicate no significant departure from neutrality, either when the molecules are tested as a whole (D = −0.1326, P > 0.10; D* = 0.1266, P > 0.10) or when they are tested with sliding windows (not shown). This suggests the action of purifying (negative) selection in maintaining neutral multiple SerH alleles. This was more robustly tested by comparing the ratios of the rates of nonsynonymous (dN) to synonymous (dS) substitutions in the coding sequences. A ratio (ω = dN/dS) significantly >1.0 indicates positive selection, while ω< 1.0 indicates negative, or purifying, selection (Yang and Bielawski 2000; Nielsen 2001). The PAML program CODEML, which corrects for transition/transversion ratios and codon bias, was used to calculate ω and test by maximum likelihood for significant departure from ω = 1 (no selection; Yang 1997; Yang and Bielawski 2000). For all pairwise combinations of SerH alleles, the dN/dS ratio was significantly <1.0 (Table 3), implying purifying selection and supporting the hypothesis that H i-ags are functionally equivalent. An unrooted phylogenetic tree of the SerH alleles is shown in Figure 5D. As expected on the basis of the length of repeats (Table 1) and the distributions of mutations, SerH4 and SerH5 branch separately from SerH1, SerH3, and SerH6.
Expression of SerH i-ag genes: For SerH1 and SerH3 alleles of inbred strains, temperature shift to >36° results in expression of SerT in place of SerH, a shift that occurs in <1 hr (Williamset al. 1985). This change in expression is due to the rapid loss of SerH mRNA from the cellular pool (Loveet al. 1988; Deak and Doerder 1995). To ascertain whether the same is true for other SerH alleles, of interest given their sequence divergence, H4- (strains B3 and C2) and H5- (ANF5906 and ANF-6707) expressing cell lines were grown at 30° and 40° for ~18 hr. As expected, at 30°, H1, H3, H4, and H5 i-ag-expressing lines were immobilized by their respective antisera. At 40°, H1, H3, and both H4 expressing lines were unaffected by specific antisera, indicating a switch to T expression. Interestingly, both H5-expressing lines at 40° were immobilized by anti-H5, indicating continued H5 expression. Northern blot analyses were consistent with immobilization results (Figure 6). H1, H3, and H4 mRNA was present at 30° and undetected at 40°. H5 mRNA was abundant at 30° and still present, though greatly reduced, at 40°. In nuclear runoff assays SerH4 and SerH5 transcripts were detected at about the same levels for both 30° and 40° (blot not shown), indicating continued transcription, as in SerH1 and SerH3. SerH5, therefore, is exceptional regarding the increased stability of its mRNA.
Alignment of amino acid sequences of H1, H3, H4, H5, and H6. Identical amino acids are indicated by dots; indels are indicated by dashes. This alignment was used to align the nucleotide sequences for statistical analyses shown in Table 3 and Figure 5, C and D. Bar, putative ER translocation signal; arrows, cysteine delineated repeats; boxes, cysteines; solid and open circles, putative GPI attachment sites predicted by DGPI (http://129.194.186.123/GPI-anchor/index_en.html) and BIG-PI (Eisenhaberet al. 1999) (http://mendel.imp.univie.ac.at/gpi/gpi_server.html), respectively.
The number (Table 1) and arrangement of AUUUA motifs implicated in mRNA instability in a variety of systems vary among the SerH alleles (Figure 7). A distinct doublet pattern, AUUUAXXAUUUA, is found in the 3′-UTRs of SerH1, SerH3, SerH4, and SerH6 but is absent in SerH5. It also should be noted that the poly(A) addition site is variable (Figure 7), and AUUUA exists downstream of proximal poly(A) sites for H1 and H4. Whether differences in AUUUA motifs and in poly(A) addition sites are responsible for greater stability of H5 mRNA awaits experimental tests.
DISCUSSION
The SerH i-ag locus of T. thermophila is a highly polymorphic locus with numerous alleles encoding variants of the cell surface immobilization antigen. Many of its alleles were originally recognized at the protein level by discriminating antisera, whereas others were directly recognized at the molecular level, for example, through differences in restriction sites. The antigenic types are geographically widely distributed and the multiple alleles are simultaneously present in ponds, raising questions concerning the type of selection acting on these alleles. Here, we compare the sequences of five alleles distinguished by differences in antigenicity, a trait unlikely to be related to their function.
The polymorphism at the SerH locus is considerably higher than that of similar systems. Among the five alleles, 37.4% of nucleotide sites (excluding indels) are polymorphic. By contrast, two much larger G i-ag alleles of Paramecium primaurelia (156G, GenBank accession no. X03882; 168G, GenBank accession no. X52133) are polymorphic at only 6.3% of 8064 nucleotide sites (after removing alignment gaps), with the bulk of the mutations confined to the central portion of the gene (Prat 1990). Similarly, among 51 AMA1 alleles of Plasmodium falciparum, only 62 of 1311 sites (4.74%) are polymorphic (Polley and Conway 2001), but in this instance, nucleotide diversity, like that of SerH, is greater in the amino-terminal region. However, unlike the SerH alleles, the AMA1 alleles, which encode a surface antigen of erythrocyte-invading merozoites, appear to be under diversifying selection (Polley and Conway 2001). Although among the T. thermophila H i-ags cysteine periodicity and hydrophobicity are conserved, there is considerable amino acid substitution, mostly limited to small amino acids (e.g., A, G, L, V, S, T). Only 21 residues are conserved among the repeats. The statistical analysis presented here indicates, not necessarily surprisingly, that the alleles are essentially neutral and likely subject to purifying selection, consistent with constraint on structure. It is therefore highly unlikely that the multiple alleles represent adaptations to, for example, variable microenvironments in natural populations.
Properties of SerH genes and proteins. (A) Structure of H i-ag molecule emphasizing cysteine periodicity. The molecule contains 3.5 repeats (shaded, numbered 1–4) with eight periodic cysteines (solid bars) per repeat. Numbers indicate the number of amino acids between cysteines. (B) Hydrophobicity plots. Kyte-Doolittle hydrophobicity plots for the first repeat of H5 and H6 are shown. (C) Nucleotide diversity per nucleotide site (π) comparing SerH1 to SerH4 (thick line) and SerH3 to SerH6 (thin line). Values of π were calculated using a sliding window of 50 nucleotides using the Dna-SP program. (D) Phylogenetic tree of SerH genes. The tree was calculated by maximum likelihood with the BASML option in DAMBE and drawn with PhyloDraw.
Maximum-likelihood test for selection at the SerH locus
Northern blot analysis of SerH expression. Total RNA from cells grown at the indicated temperatures was probed with full-length DIG-labeled probe. The same blot was probed with 32P-labeled actin as a control. H1 from strain A, H3 from strain B, H4 from strain C2, and H5 from strain ANF5906 are shown. H1 and H3 were probed with SerH3; H4 and H5 were probed with SerH4.
All five H i-ag precursor proteins have the same structure: the amino-terminal region of ~100 amino acids with a putative ER translocation signal, 3.5 central imperfect tandem repeats of 75–85 amino acids each, and the carboxyl-terminal region of ~30 amino acids with a putative GPI attachment sequence. Each precursor protein is similar in amino acid composition, and each contains 39 cysteines. Each repeat contains 8 cysteines in a pattern in which short (usually 2) and long (6 to 19) stretches of amino acids alternate between cysteines (Figure 5A). Although the arrangement of disulfide linkages is unknown (reduced and nonreduced i-ags migrate differently in SDS-PAGE, indicating that such linkages are formed), the consistently even number of cysteines per repeat and their consistent periodicity indicate that the same structure is formed by all H i-ags. We speculate that disulfide linkages occur between the cysteines separated by long stretches of amino acids, forming a fibrous structure, as hypothesized years ago for Paramecium i-ags (Preer 1959). Whether or not the cysteines are so linked, the amino portion likely has greatest exposure to the environment and therefore is most likely to contain the epitopes recognized by immobilizing antibody. As recognized when the H i-ags were first described (Margolinet al. 1959; Loefer and Owen 1961), antisera against H i-ags rarely cross-react in immobilization assays, including assays using antisera against purified H protein (Smithet al. 1992). The exception is anti-H3, which weakly reacts with H1-expressing cells. Examination of the amino acid sequence shows that the second repeat of H3 is nearly identical to the first repeat of H1 (Figure 4). The cross-reactivity is explained if, on the cell surface, the first repeat is accessible to antibodies and the second repeat, closer to the carboxyl end, is not. This explanation is consistent with the greater diversity of the amino-terminal region and the first repeat compared to the rest of the molecule. In this respect the H6 i-ag was of interest, because cells expressing this i-ag are immobilized by both anti-H1 and anti-H3. Consistent with the model, there are several regions of the amino-terminal region and first repeat where putative epitopes of five or more amino acids are shared among H1, H3, and H6. However, the number of such sequences precludes identification of specific immobilization epitopes at this time. Similar distribution of putative epitopes is observed by comparison of H4 and H5, though in this instance, the greater identity is in the repeat, not the amino-terminal region.
The apparent functional equivalency of the H i-ags must be qualified by two additional observations. The first is that SerH2 has not yet been successfully cloned. This gene is not recognized on blots by full-length probes made from other SerH alleles, nor does it amplify in PCR with primers that amplify other SerH alleles, including combinations of various primers to the relatively conserved ER translocation and GPI addition signals. SerH2 is possibly a completely different allele, or possibly it is a null allele with a different paralog expressed in its place. The second observation is that while the upper cutoff for most SerH alleles is 36°, for some alleles like SerH5, expression continues past this temperature. The difference might be in the 3′-UTR where alleles differ in the arrangement of the mRNA destabilization motif, AUUUA. AUUUA motifs are typically found in the 3′-ORF and 3′-UTR and are recognized by trans-acting adenosine- and uracil-binding proteins (AUBPs) that may regulate the rate of deadenylation and/or decapping (Asson-Batreset al. 1994; Jacobson and Peltz 1996; Ross 1996; Balmeret al. 2001; Wiluszet al. 2001). Although the motifs function independently, multiple copies of the AUUUA motif, especially those that overlap or are located in close proximity, result in faster degradation of the mRNA (Chen and Shyu 1995). It has been shown that when two AUUUA motifs are arranged such that they are closely spaced or overlapping, degradation is enhanced, suggesting the trans-acting AUBPs may bind in dimeric form (Lagnadoet al. 1994). The absence of the AUUUAXXAUUUA doublet may account for the increased stability of H5 mRNA. It should be noted that extended AUUUA pentamers also are present in the 3′-UTR and that these have also been implicated in mRNA destabilization (Balmeret al. 2001). In this context, we also observed that the poly(A) addition site for H mRNA is variable and that this variation may also incorporate an additional AUUUA motif. We did not observe the canonical poly(A) addition signal, but we did observe that the 3′-UTR contains repeated elements (Figure 7), one of which (UGUGU) is observed in the 3′-UTR of histone genes (Liu and Gorovsky 1993) and SerJ (GenBank accession no. AF242387), and another (UGAUU) is found in SerL (GenBank accession no. AF312770). Variability of the poly(A) addition site has been previously reported for other genes of T. thermophila (Liu and Gorovsky 1993), including SerL (Doerder and Gerber 2000) and the actin controls (C. A. Gerber, J. Bartram and F. P. Doerder, unpublished data). Whether increased stability of SerH5 mRNA is of adaptive significance is not known.
Characterization of the 3′-UTR of SerH alleles. AUUUA motifs are boxed. UGUGU sequence is marked with solid lines, and repeated UGAUU is indicated with arrows. Poly(A) addition sites are indicated by boldface italics; numbers of transcripts sequenced are 2, 9, 3, 2, and 1, respectively.
The apparent functional equivalency of multiple SerH alleles raises the question as to whether differences between SerH and its paralogs are the result of positive selection. The observations that the J i-ag encoded by SerJ (which is expressed under H conditions), is expressed epistatically over H and that frequencies of cells expressing H and J vary inversely in ponds (Saad and Doerder 1995) suggest the action of positive selection but are not definitive. Among the nine T. thermophila i-ag families, only two have been sequenced in addition to SerH: SerJ with one expressed locus and SerL with five expressed loci. J has four repeats, each containing ~100 amino acids and 10 cysteines, whereas L has two, five, or six repeats, each containing ~60 amino acids and 6 cysteines (Doerder 2000; Doerder and Gerber 2000). In each case the cysteine pattern is (CXshort CXlong)N, and the ratio of repeat length to the number of cysteines is 10. This suggests that the same general structure is formed, at least with respect to repeats. However, there are structural differences that might be related to selection for different roles for these antigens. The L i-ags are smaller than H and the CXXC pattern of H (and J) is replaced by CXC in L. Additionally, whereas the H i-ags have ~100 amino acids in the amino-terminal region prior to the first repeat, in J and L the repeats begin at the end of the putative ER signal peptide; in other words, there is no amino-terminal region in J and L, only repeats. This longer amino-terminal region of H has 10 cysteines as in the J repeats, but the cysteines do not alternate between short and long stretches of amino acids as do J repeats, suggesting a different structure at the end of the molecule likely most exposed to the environment. Whether these differences among paralogous i-ags are of adaptive significance requires additional investigation.
An important point concerns the tandem repeats characteristic of ciliate i-ags. In some instances the repeats are identical or nearly so, even at the nucleotide level. In the case of most Paramecium i-ags, which have from 31 to 38 repeats, the 2 or 3 central-most repeats are identical. A similar situation is observed with the large (6-repeat) L i-ag of T. thermophila. With the availability of allelic variants for both T. thermophila (SerH) and P. primaurelia (G) it is clear that repeat identity is greater within alleles than between alleles. Indeed, for the (G) alleles, the regions of identical repeats are the most divergent regions of the two alleles (Prat 1990). The identity of repeats and their sometimes variable number, as among the L i-ag paralogs, suggest a recombinogenic expansion mechanism. Gene duplication followed by unequal crossing over could produce variable numbers of repeats, and it could generate repeats with different numbers of cysteines. In this context it is important to note that there is evidence of allele-specific i-ag pseudogenes in both T. thermophila (F. P. Doerder, unpublished data) and Paramecium (Breueret al. 1996). The Tetrahymena genome project should help provide additional insight. It should be noted that imperfect tandemly repeated elements with periodicity of even numbers of cysteines are found in surface proteins of a wide variety of protists, including parasitic forms, suggesting not only related function but also related mechanisms generating repeated elements.
Finally, these results clearly show that multiple SerH alleles are simultaneously present in ANF ponds. The recovery of SerH1 and SerH3 from the same pond near Woods Hole, Massachusetts, is consistent with this observation. In addition, isolates immobilized by anti-H1 were found in Illinois and isolates immobilized by anti-H3 have been found in Illinois and Michigan (Allen and Gibson 1973). The widespread distribution and simultaneous presence in ponds of multiple SerH alleles parallel the presence of what appear to be multiple mating type alleles in the same ponds (Arslanyolu and Doerder 2000). In both cases, there are numerous similar, but distinct, alleles. Such extensive polymorphism raises intriguing questions about population size and geographic history of T. thermophila. For example, the most closely related SerH alleles, SerH3 and SerH6 (Figure 5D), are from geographically distant populations with virtually no chance for natural gene flow. Similarly, the related SerH4 and SerH5 alleles are also from geographically distant locations. The use of additional markers, such as polymorphism for the presence or absence of the self-splicing intron of rDNA (F. P. Doerder and D. Coulton, unpublished data) and more extensive collecting, may permit reconstruction of T. thermophila's recent past.
Acknowledgments
We thank Dr. Harry van Keulen for technical advice and Bob Krebs for discussion and comments on the manuscript. We also thank Dr. Joe Deak for his constructive suggestions during the course of this work. Dr. Ted Clark first called our attention to the SerH3 sequencing error. Preliminary sequencing of SerH6 was done by Paul Sweeny. The actin control was prepared by James Bartram. Christina Merwin-Gerber assisted with characterization of poly(A) addition site variability of SerH mRNA. This work was supported by the Cleveland State University College of Graduate Studies and National Institutes of Health grant GM-55887.
Footnotes
-
Communicating editor: S. L. Allen
- Received November 17, 2001.
- Accepted January 25, 2002.
- Copyright © 2002 by the Genetics Society of America