Abstract
A safe and effective HIV-1 vaccine is urgently needed to control the worldwide AIDS epidemic. Traditional methods of vaccine development have been frustratingly slow, and it is becoming increasingly apparent that radical new approaches may be required. Computational and mathematical approaches, combined with evolutionary reasoning, may provide new insights for the design of an efficacious AIDS vaccine. Here, we used codon-based substitution models and maximum-likelihood (ML) methods to identify positively selected sites that are likely to be involved in the immune control of HIV-1. Analysis of subtypes B and C revealed widespread adaptive evolution. Positively selected amino acids were detected in all nine HIV-1 proteins, including Env. Of particular interest was the high level of positive selection within the C-terminal regions of the immediate-early regulatory proteins, Tat and Rev. Many of the amino acid replacements were associated with the emergence of novel (or alternative) myristylation and casein kinase II (CKII) phosphorylation sites. The impact of these changes on the conformation and antigenicity of Tat and Rev remains to be established. In rhesus macaques, a single CTL-associated amino substitution in Tat has been linked to escape from acute SIV infection. Understanding the relationship between host-driven positive selection and antigenic variation may lead to the development of novel vaccine strategies that preempt the escape process.
DEVELOPMENT of an efficacious acquired immune deficiency syndrome (AIDS) vaccine is a public health priority (Check 2003; Klausner et al. 2003). Studies in macaques challenged with simian immunodeficiency virus (SIV) have shown that it is possible to contain and prevent infection (Daniel et al. 1992; Hirsch et al. 1994; Dunn et al. 1997). However, these studies have been ambivalent, and the cellular and humoral responses needed to elicit protective immunity have not been well defined. In addition, little is known about the immunogenicity of different subgenomic regions of SIV or about the epitopes that are likely to elicit a potent and sustained immune response. To date, the only successful retroviral vaccine has been one targeted against the transmitted variant of feline leukemia virus (Hoover et al. 1991).
Many groups have sequenced HIV-1 and defined cytotoxic T-lymphocyte (CTL) epitopes that are expressed during disease progression (Borrow et al. 1997; Rosenberg et al. 1997; Novitsky et al. 2001). However, it has been difficult to obtain information on acute phase viruses and the immune responses they elicit, primarily because most patients are not diagnosed during primary infection. Studies of rhesus macaques, inoculated with a single cloned variant of SIV (Allen et al. 2000), indicated that wild-type virus predominated during the first 2 weeks of infection. This was followed by a sharp decline in plasma viremia coincident with the emergence of Tat-specific CTLs. By 4 weeks postinfection, the first escape mutants were detected and, by 8 weeks, wild-type virus was completely replaced with Tat escape variants (Allen et al. 2000; O'Connor et al. 2001).
Much less is known about the generation of CTL escape mutants in humans exposed to multiple variants of HIV-1 in genital secretions or blood (Goulder et al. 1997; McMichael and Phillips 1997; Price et al. 1997; Delwart et al. 1998; Karlsson et al. 1998; McMichael 1998). Studies of viral kinetics indicate that HIV-1 replicates to high titer during the first week of infection, reaching peak viremia at ∼3 weeks (Mellors et al. 1997). During this time, prior to the induction of CTLs, the dominant forces acting on HIV-1 are likely to be related to viral fitness, replication capacity, and the adaptive potential of the virus in the new host (Overbaugh and Bangham 2001). In most patients, peak viremia is followed by a rapid decline in plasma HIV-1 RNA 6–8 weeks postinfection and, ultimately, by stabilization at a level referred to as the viral set point (Mellors et al. 1996).
Evidence suggests that the decline in HIV-1 viremia and the control of persistent infection is mediated by CTLs (Borrow et al. 1997; Rosenberg et al. 1997). By analogy with SIV (Addo et al. 2001; O'Connor et al. 2001), it would be expected that by 8 weeks postinfection, transmitted variants of HIV-1 would be completely replaced with escape viruses that have evaded the initial CTL response. Identification of early escape variants or, perhaps more importantly, identification of wild-type (pre-escape) variants may provide important new insights for vaccine design. The mechanisms underlying the escape process are not fully understood (da Silva and Hughes 1998; Yusim et al. 2002), but are likely to be complex and involve a number of different steps. In addition to changes in antigenicity, these steps may include alterations in proteasomal cleavage, TAP-mediated translocation, human histocompatibility system (HLA)-binding, and T-cell receptor recognition (Pamer and Cresswell 1998; Abele and Tampe 1999; Bochtler et al. 1999; Wilson et al. 1999).
In this study, we used codon-substitution models to measure selection pressures along the length of the HIV-1 genome and to search for positively selected amino acids that may play an important role in the escape from host immunity. The application of positive selection models to vaccine design was first suggested in 1998 (Nielsen and Yang 1998). Using this approach, widespread positive selection in the HIV-1 genome has been detected both at the interpatient level (Yang 2001; Yang et al. 2003) and within the same patient over time (Zanotto et al. 1999). Our studies confirm and extend these findings.
MATERIALS AND METHODS
HIV-1 sequence data sets:
Analyses were performed on a total of 71 full-length sequences from subtypes B (n = 27) and C (n = 27) and group M (n = 27) viruses, representing subtypes A–K. These representative sequences, which were downloaded from the Los Alamos HIV database (http://hiv-web.lanl.gov), are described in detail elsewhere (de Oliveira et al. 2003a). To rule out the possibility of intersubtype recombination, only sequences classified as nonrecombinant were included in the analyses (Anisimova et al. 2003; Yang et al. 2003). To avoid the introduction of insertions and deletions, nucleotide sequences representing multiply spliced early regulatory (tat, rev, nef), singly spliced (env, vif, vpr, vpu), and unspliced structural (gag, pol) genes were aligned against their predicted amino acid sequence using a CLUSTAL algorithm implemented in DAMBE (Xia and Xie 2001). Similar alignments were constructed for the translated amino acid sequences. The alignments were manually edited using the Genetic Data Environment for Linux interface (de Oliveira et al. 2003b).
Phylogenetic analysis and tree building:
Separate analyses were performed on each individual gene, including both distance and maximum-likelihood (ML) methods. The best-fitting nucleotide substitution model was evaluated using a hierarchical likelihood-ratio test (LRT) implemented in MODELTEST 3.0 (Posada and Crandall 1998). The ML trees for complete genomes and individual genes were obtained by implementing a heuristic search with tree bisection reconnection branch swapping. Neighbor-joining trees were constructed using the Felsentein 84 model and used in the codon-selection analysis. Phylogenetic analyses were performed with the PAUP* 4.0b10 program (Swofford 2002).
Analysis of selection pressure:
Positive selection was assessed using four different codon-based ML substitution models (Yang et al. 2000): M0 (one-ratio), M1 (neutral), M2 (selection), and M3 (discrete). All models were implemented in the Codeml program of the PAML software package (Yang 1997). Analyses were performed using the discrete model (M3) with three dn/ds (ω) classes. Such models allow ω to vary among sites by defining a set number of discrete site categories, each with its own ω value. Through maximum-likelihood optimization, it is possible to estimate the value for ω and for p, the fraction of sites in the aligned data set that falls into a given category. Finally, the algorithm calculates the a posteriori probability that each codon belongs to a particular site category. Using the M3 model, sites with a posterior probability exceeding 90% and a ω value >1.0 were designated as being “positive selection sites” (Yang et al. 2000). Since these models are nested, with M3 being the most complex and M0 the least complex, it is possible to evaluate the best-fitting model for the data using the LRT (Anisimova et al. 2001). Comparision of M0 with M3 is a test of site rate variation; comparison of M1 with M2 is a test for positive selection.
Reconstruction of common ancestors:
A rooted tree of n taxa contains n − 1 internal nodes. Ancestral sequences at the internal nodes of each of the nine proteins in the B and C data sets were reconstructed by maximum likelihood using codon models selected by the LRT method. Reconstructed ancestral sequences were saved and translated into their corresponding amino acid sequences. The gp160 envelope glycoprotein was the most difficult to analyze due to the presence of hypervariable regions containing multiple insertions and deletions (indels). To facilitate analysis of gp160, sequences were aligned using glycosylation, myristylation, and protein kinase sites as anchors. To investigate the possibility of alternative coalescence events, other than those depicted by the ML trees constructed in PAML, suboptimal ML trees were also reconstructed using the Bayesian algorithm implemented in MRBayes software (Huelsenbeck and Ronquist 2001). Ancestral sequences of the trees were also reconstructed and saved for the prediction of escape epitopes.
Identification of escape epitopes:
To search for potential escape epitopes, genomic regions containing a large number of positively selected sites were analyzed together with ancestral sequences. The sequences were aligned, translated, and analyzed for differences between sampled strains and their reconstructed ancestral sequences. To identify new peptide sequences that were not present in the sampled strains, ancestral sequences were analyzed using a 10-amino-acid sliding window incremented one codon at a time. Whenever a reconstructed 10-amino-acid ancestral peptide was not present in the external branches of the tree, the sampled sequence was saved as a possible novel epitope. Novel amino acid peptides were screened against previously identified epitopes using two predictive software programs, SYFPEITHI (Rammensee et al. 1999) and Epimap from the Los Alamos HIV Seq.Db (Brander and Goulder 1999), for amino acid composition and binding properties.
RESULTS
Positive selection and amino acid variability:
ML methods were used to assess amino acid variation and identify targets of positive selection. Significant differences were observed in the number and distribution of positively selected variants, among both different HIV-1 proteins and different regions of the same protein. Table 1 describes the fraction of sites (p1, p2, p3) in each protein that were under positive (diversifying) selection, along with the respective ω (dn/ds) values for each category of the M0, M1, M2, and M3 models. Table 1 also shows the results of the LRT comparing M3 with M0 and M2 with M1. Using LRT, the M3 (discrete) and M2 (positive) models were selected (P < 0.001) for all proteins of subtypes B and C, providing evidence of varying selection pressure at individual sites across the HIV-1 genome. The M3 and M2 models were also accepted for group M viruses. However, when compared to subtypes B and C, the number of positively selected sites in the group M data set was substantially higher. Several sites in the M group mapped to signature sequences identified by the VESPA program (Korber and Myers 1992), suggesting that observed variation was due to subtype constraints, rather than to immune selection pressure. This interpretation is consistent with other studies that have shown a decrease in power when applying selection models to highly divergent data sets (Rambaut et al. 2004).
Parameter estimated under three models of variable ω (dn/ds) among sites
One of the most unexpected findings was the high frequency of positively selected variants in the early regulatory proteins, Tat and Rev. Overall, 30.3% of Tat codons in subtype C and 18.4% in subtype B had dn/ds values ≧2.0. The corresponding values for Rev were 18.4% for subtype C and 29.2% for subtype B. Lower levels of variability were observed for Vif, Env, Vpr, Vpu, and Nef with dn/ds ≧ 2.0 values ranging from 1 to 15%. Although Env and Vpu are generally considered to be the most variable HIV-1 proteins, the relative proportion of positively selected sites was greater in Tat and Rev. Many of the sites in Env were localized near regions that contained inserted or deleted codons. To avoid bias, these indels were removed from the analysis, making these regions uninformative. In Vpu, a large proportion of codons (27.3% in subtype B, 24.5% in subtype C) were under positive selection, but the selection intensity was relatively low with 20.8% of B and 23.1% of C viruses having dn/ds values ≦1.10 and 1.68, respectively. Only 1.3% of Vpu codons in subtype C and 6.5% in B were strongly selected with dn/ds values of 8.78 and 5.16, respectively. A similar pattern was observed for Nef. The least-variable protein was integrase with no sites in subtype C and 1.9% of sites in subtype B having dn/ds ≧ 2.0.
Phylogenetic analysis of the ancestral sequences:
Figure 1 is a representative tree constructed from 27 full-length subtype C and 8 reconstructed ancestral sequences. The sequences fell into 8 distinct sublineages, representing strains from India, southern Africa, Ethiopia and Israel, and Brazil. This pattern was supported by high bootstrap values (>75%) and high-score ML trees and by phylogenetic analyses of both nucleotide and deduced amino acid sequences. The nucleotide diversity for the complete alignment was 7.8%. As previously reported by Gaschen et al. (2002), the average distance between a given contemporary sequence and its most recent common ancestor (MRCA) was approximately one-half the sublineage diversity. The mean divergence among sequences in the same sublineage ranged from 4.3% among Brazilian viruses to 4.4% for Indian, 7.8% for Ethiopian and Israeli, and 8.0% for viruses from southern Africa. The average deviation between a given HIV-1 sequence and its most proximal ancestor was 2.2% for Brazilian, 2.6% for Indian, 3.9% for Ethiopian and Israeli, and 4.7% for African viruses. All of the following analyses focus on the early regulatory proteins Tat and Rev.
Representative phylogenetic tree derived from 27 full-length and 8 ancestral subtype C sequences. Ancestral sequences were reconstructed using ML methods implemented in PAML. Eight sublineages, representing subtype C strains from India, Africa, Ethiopia/Israel, and Brazil were identified. For each sublineage, the MRCA segregated as the basal (internal) nodes (labeled as India, Africa 1–5, ET-IS, and BR MRCAs). An indication of the degree of sequence dissimilarity between contemporary and ancestral sequences is shown by the branch length on the horizontal axis.
Pattern of amino acid variability:
The distribution of amino acid variants in the Tat and Rev proteins of subtypes B and C is shown in Figure 2. For both proteins, positively selected amino acids were concentrated primarily in the C-terminal regions, while neutral and negatively selected codons predominated in the conserved functional domains near the N termini of the proteins. Overall, 30 codons in Tat and 17 in Rev were under positive selection in subtype C viruses. The corresponding values for subtype B were 19 and 32, respectively. The detection of fewer positively selected sites in the Rev protein of subtype C was not unexpected, given the truncated nature of the C protein (Pollard and Malim 1998). A total of 21 of the positively selected codons, 12 in Tat and 9 in Rev, were common to both subtypes, suggesting that these amino acids are frequent targets of host selection pressure.
Correlation of positive selection with functional domains of Tat and Rev (Figure 3), including MRCA: consensus sequences for subtypes B and C (Cons.subC and Cons.subB, respectively); subtype B (HXB2) and subtype C (98.ZA.F12) strains; amino acids under positive selection with dn/ds > 2.0, Pos Select (*); CTL epitopes, CTL (C); novel epitopes, novel epit (N), present in reconstructed ancestral sequences but not in the pool of contemporary strains; nuclear localization signal plus transactivator signal, NLS + TAR (▵); NES (θ); cysteine-rich CTL epitopes (CTL-B) disulfide bond region, Cys Rich (∞); Sp1-binding site (∩); HAT (Ø); RGD cell attachment site, cell attch (▴); casein kinase phosphorylation sites, CKII (•); protein kinase C phosphorylation sites, PKC (♦); myristylation sites, MYRISTYL (□), and MRCA signature mutations that differ between subtypes B and C, Mut sub.BxC. Reading frames (+1, +2, and +3) of the overlapping regions of Tat, Rev, and Env are shown at the bottom.
Correlations among highly conserved functional domains, CTL epitopes, and amino acid variability:
Relatively few positively selected variants were detected at the N terminus of Tat between codons 1 and 57, a region that contains the functionally important minimal activation domain, the cysteine-rich disulphide bond region, the nuclear localization signal (NLS), the TAR- and Sp1-binding sites, and the histone acetyltransferase (HAT) domain (Jeang et al. 1999). In addition to being conserved and negatively selected, this region contains several experimentally defined CTL epitopes. Limited variation was tolerated within the disulphide bond region, but not within the essential cysteine residues at codons 22, 25, 27, 30, 34, and 37. Similarly, few positively selected variants were detected in the N-terminal portion of Rev. This region, which overlaps the C terminus of Tat, contains the high-affinity, arginine-rich binding site TRQARRNRRRRWRERQR, which functions as a nuclear import signal, a multimerization domain, and an RRE-binding domain (Pollard and Malim 1998). Despite a high density of CTL epitopes, the N terminus of Rev between codons 1 and 49 was relatively invariant, a finding that presumably reflects the structural and functional constraints of this region.
Correlations among CKII phosphorylation domains, amino acid variation, and CTL epitopes:
A striking finding was the high density of phosphorylation motifs at the C terminus of Tat. Four putative casein kinase II (CKII) sites, codons 61–64, 77–80, 82–85, and 95–98, were prevalent in C, but not in B, viruses. Three of these sites (at positions 61–64, 82–85, and 95–98) were highly conserved with prevalence rates ranging from 88.9 to 92.6%. Variation at these conserved sites was restricted primarily to the spacer (x), rather than to the functionally important serine/threonine (S/T) and aspartic/glutamic acid (D/E) residues. In two viruses that lacked a CKII motif at codons 82–85, an alternative CKII site was detected downstream—the first at codons 83–86, the second at codons 87–90. Less-conserved CKII sites were identified at codons 77–80 and 93–96 in 66.6 and 51.0% of C viruses, respectively. At these sites, nonsynonymous mutations were tolerated in the serine/threonine (77T, 93S) and glutamic/aspartic acid (80D, 96E) residues. Interestingly, the RGD (arginine, glutamine, aspartic acid) cell attachment site of the Tat protein was embedded within the CKII motif at positions 77–80. Nonsynonymous mutations in the arginine (R) and aspartic acid (D) residues of the CKII motif lead to elimination of the RGD site. The linked nature of these overlapping regions is shown in Figure 3.
Representative schematic showing overlap between Tat (positions 77–86) and Rev (positions 31–40). Row 1, line 1 shows the South African strain (ZA.TV001c8); line 2, the nucleotide sequence; and line 3, mutations detected at the first, second, and third codon positions. Row 2 shows amino acid mutations; row 3, whether the amino acid is positively selected; and row 4, the localization and prevalence of CKII, RGD attachment, myristylation, and RRE-binding sites. The two positively selected sites in Rev were located outside the myristylation site. Within the domain, with the exception of a single glycine-33 substitution, all mutations were synonymous or were accepted without disrupting the myristylation motif. In the overlapping region of Tat, these mutations were associated with the elimination of the RGD site and the CKII sites at positions 77–80 and 82–85. In two viruses, an alternative CKII site was detected at either position 83–86 or position 93–96 (data not shown). Changes in the third position of Tat codon 85 were limited to A-to-G mutations, substitutions that retained the basic nature (R, K) of the RRE at Rev codon 39.
CKII phosphorylation sites in Rev were also under positive selection. In subtype B, the most common CKII was located at codons 8–11 near the N terminus of the protein. Despite selection pressure on position 11, most mutations were synonymous or involved the replacement of glutamic acid with aspartic acid, substitutions that preserved the CKII motif. The majority (88.8%) of C viruses lacked this CKII site due to an alanine substitution at codon 11. Instead, the predominant CKII site in C viruses was located within the multimerization domain adjacent to the RRE-binding site at codons 54–57. Mutations (85.2%) at serine-54 in C viruses were often synonymous or involved the substitution of serine with threonine, leading to preservation of this serine-based CKII motif. In contrast, selection pressure on serine-54 and D/E-57 residues of B viruses led to nonsynonymous mutation and disruption of the CKII motif in 60% of isolates. Positive selection was also observed within the leucine-rich nuclear export signal (NES) of Rev (Pollard and Malim 1998), especially in C viruses. The frequent replacement of leucine-4 and -9 in the NES sequence LPPLERLTL suggests that C viruses may be interacting with nuclear export proteins that are different from those used by subtype B. Other features of subtype C included the deletion of amino acids 108–116 at the extreme C terminus of Rev and the presence of multiple overlapping myristylation motifs between codons 89–105, immediately upstream from the deletion. Despite positive selection pressure, the 89–105 deletion was not detected in subtype B and only 20% of B viruses carried the extended myristylation motif.
Constraints imposed by overlapping reading frames:
In total, 22 (73.0%) of the positively selected Tat codons in subtype C were situated in a region that overlaps the N terminus of Rev. Of these, 14 (63.6%) were localized in a reading frame that also overlaps with Env. In this region, nucleotide changes in tat would be expected to affect not only Tat but also the overlapping segment of Rev and Env. Conversely, nucleotide substitutions in rev would be expected to have an impact on the overlapping regions of Tat and Env. As an example, the glycine residue at position 33 of Rev was a frequent target of antibody and CTL reactivity (Figure 3). In total, eight substitutions were detected at this position, seven of which were synonymous mutations in the third codon position. These GGG → GGA mutations had no impact on the myristylation site in Rev, but caused a nonsynonymous aspartic acid (GAC) to asparagine (AAC) mutation in Tat, a change that eliminates the CKII site at codons 61–64.
Prediction of novel peptide sequences:
A sliding window method was used to search for novel peptide sequences that were present in ancestral sequences, but absent from sampled contemporary sequences. A total of 771 peptides (including 589 Env, 111 Nef, 39 Tat, and 32 Rev sequences) in which one or more of the 10 amino acids in the ancestral sequence differed from the sampled sequence were identified. All of these peptides were localized in regions of positive selection. Several Env peptides were located adjacent to areas of insertions or deletions, suggesting that indels may contribute to the creation of new antigenic sites (data not shown). A high proportion (75%) of newly identified epitopes were localized internally at the ancestral nodes of the tree.
DISCUSSION
Understanding how HIV-1 defeats the human immune system is critical to the design of an efficacious AIDS vaccine. A number of recent studies have demonstrated the importance of CTL responses in controlling HIV-1 (and SIV) replication, during both acute and chronic phases of infection. However, it is also known that a single-nucleotide mutation within an immunodominant CTL epitope can lead to viral escape from CTLs, increased viral replication, and clinical disease progression (Allen et al. 2000; Barouch et al. 2002). These findings suggest that CTLs exert significant positive selection pressure and that our ability to predict (and prevent) immune escape will depend on an improved understanding of the relationships between host selection pressure, viral evolution, and antigenic variation. Traditional approaches to this problem have involved extensive sequencing of viral diversity and mapping of CTL epitopes, followed by a long development and testing cycle of new vaccine candidates. In addition to being labor intensive and expensive, it is difficult to extend this experimental approach to patients with acute infection, since such patients are rarely sampled during the acute phase.
As a result of the rapid accumulation of sequence data, combined with advances in codon-based substitutions and ancestral reconstruction methods, it is now possible to begin searching for evolutionary patterns that may be relevant to vaccine design. These methods can be applied to the entire HIV-1 genome and to large numbers of pooled sequences collected from patients with the same, or different, HLA types. Through a process of ancestral reconstruction, it may also be possible to identify wild-type ancestral sequences and to reconstruct CTL escape pathways.
As an example, we recently analyzed a set of published sequences collected during a time-course study of acute SIV infection in rhesus macaques inoculated with a single cloned variant of SIVMAC 239 (Allen et al. 2000; O'Connor et al. 2001). In these studies, escape from acute SIV infection was associated with a single-amino-acid substitution in the Tat-specific epitope, SL8. Ancestral analysis of serial SL8 sequences collected from different macaques and from the same macaque sampled over time at 2, 4, 6, and 8 weeks postinfection identified the inoculating variant of SL8, STPESANL, as the MRCA, even when this sequence was no longer present in the postinoculum specimen (data not shown).
In the HIV-1 setting, it may be possible to measure the intensity of the selection pressure exerted on individual amino acid sites and to preselect a small number of epitopes that are strongly selected and warrant further experimental investigation. These sequences could then be used to design a multi-epitope vaccine directed against regions of the virus that are unable to mutate and escape immune recognition. By inducing a mucosal immune response to these epitopes, prior to infection, it may be possible to prevent the initial establishment of infection or to reduce the level of peak viremia.
Our data confirm and extend previously published findings. In agreement with Yang (2001) and Yang et al. (2003), we detected varying levels of adaptive evolution in all nine HIV-1 genes. One of the most striking findings was the distinct pattern and high concentration of positively selected codons in the C termini of Tat and Rev in regions that lack CTL epitopes and essential functional domains such as the TAR- and RRE-binding domains. Our studies also suggest that the selection pressures directed against these C-terminal regions are likely to be complex and to involve both direct and indirect selection pressures exerted through overlapping reading frames. These findings are particularly intriguing, given that Tat and Rev (along with Nef) are the earliest proteins to be expressed in newly infected cells and that these proteins are the primary determinants controlling the complex, temporally regulated expression of HIV-1. Tat plays a major role in the upregulation of HIV-1 gene expression; Rev controls the switch from chronic abortive infection to full-length mRNA expression and productive infection. Since many of the selection pressures exerted on the C terminus of Tat are likely to also impact on Rev (and vice versa), our findings suggest that the coordinated, sequential expression of these two proteins may be regulated by antigenic variation induced by complex interactions with the host immune system and/or by interactions with other proteins and regulatory factors in the intracellular milieu. Flexibility at the C termini of Tat and Rev, as shown by the emergence and relocation of CKII and myristylation, suggests that these regions are able to tolerate a high level of genetic variation while still retaining their biological properties. In contrast to highly conserved and functionally constrained domains at the N termini of Tat and Rev (i.e., NLS-, TAR-, and Sp1-binding sites) sequences at the C termini of Tat and Rev would be expected to be more susceptible to host-driven selection pressure.
One could argue that CKII sites [S/T-x(2)-D/E] may not be particularly relevant because of their low complexity and the possibility that these sites may not be phosphorylated in vivo. However, phosphorylation is known to play a major role in the regulation of RNA-binding proteins such as Tat and Rev (Holmes 1996; Meggio et al. 1996; Parada and Roeder 1996; Yang et al. 1996; Fouts et al. 1997; Chun et al. 1998; Marin et al. 2000). Studies have shown that CKII-mediated phosphorylation of serine-54 leads to conformational changes in Rev and rapid, efficient RNA-binding (Fouts et al. 1997). It has also been shown that less pathogenic HIV-2 viruses lack this CKII site. We found that, although most B viruses lacked the serine-54 site, they carried an alternative CKII phosphorylation motif at codons 8–11. In B viruses, phosphorylation of serine-8 has been shown to be important for transactivation of Rev.
Our understanding of Tat phosphorylation is less clear (Parada and Roeder 1996; Yang et al. 1996; Chun et al. 1998). Most studies have shown that Tat enhances the activity of other phosphorylated proteins. However, the detection of five CKII sites (three of which were highly conserved) in a short stretch of amino acids at the C terminus of subtype C suggests that direct phosphorylation (hyperphosphorylation) of Tat may also occur. The conservation of these CKII sites, in the face of strong diversifying pressure, may be due to the fact that many of the substitutions occurred in spacer amino acids (x), although serine ↔ threonine and glutamic ↔ aspartic acid substitutions were also tolerated. At the more variable CKII sites, two in subtype C and one in subtype B, nonsynonymous mutations were also tolerated in the functionally important [S/T] and [D/E] residues. However, the loss of a CKII [or protein kinase C (PKC)] site at one position was frequently associated with the presence of an alternative site at a different location.
The immunological and functional significance of these variations remains to be established. Interestingly, the STPESANL epitope associated with CTL-mediated escape from acute SIV infection (Allen et al. 2000; O'Connor et al. 2001) contains a serine-based CKII motif, [S]-x(2)-[E]. By 8 weeks postinfection, this CKII site had been eliminated from 35% of the SIV escape mutants. Further studies are needed to determine whether STPESANL and other escape epitopes are phosphorylated in vivo and whether phoshorylation/dephosphorylation alters the conformation and antigenicity of these epitopes, facilitating immune escape. Two additional epitopes have been identified. One contains a PKC site; the other maps to the NES (Addo et al. 2001). Both sites were under positive selection, especially in C viruses. Again, additional studies are needed to determine whether these patterns are consistent and whether they reflect different biological properties of the two viral subtypes.
Although codon-based evolutionary methods are still in the early stages of development, our studies suggest that a combination of mathematical and experimental methods will lead to an improved understanding of the mechanisms underlying CTL escape and provide new insights for vaccine design. Such studies may also yield new information on the significance of overlapping reading frames and how these regions contribute to the complexity of host-virus interactions and the regulated, ordered expression of the HIV-1 genome. As suggested by Overbaugh and Bangham (2001), to be informative, evolutionary modeling should be complemented with parallel testing of effector cell populations, including HIV-1-specific CTL and T-helper and B-cell clones to identify which selection processes are most critical to the escape pathway.
Our studies suggest that it may be advantageous to extend the above approach to an analysis of important functional domains. When possible, the analyses should be performed on transmitted variants of HIV-1 collected sequentially during the period of acute seroconversion. Although selection models have proven useful for analyzing HIV-1 from different patients with the same, or different, viral subtypes (Yang 2001; Yang et al. 2003), they are best suited to the analysis of within-host variation. In addition, to avoid problems relating to intersubtype recombination, the analyses should be performed on sequences that are known to be nonrecombinant at the subtype level (Anisimova et al. 2003; Yang et al. 2003). Once important escape patterns have been identified, they can be tested experimentally in the SIV model. Such studies would involve the induction of CTL responses to pre-escape variants, followed by viral challenge. Finally, by using site-directed mutagenesis, it should be possible to elucidate the potential role of phosphorylation in the escape process.
Acknowledgments
This work was supported by program grant no. 061238 from the Wellcome Trust (United Kingdom) and the Flemish Funds voor Wetenschappelijk Onderzoek (FWO grants G.0288.01 and the KAN2002 1.5.193.02, Postdoctoral Onderzoeker contract 530).
Footnotes
Communicating editor: Z. Yang
- Received May 29, 2003.
- Accepted April 5, 2004.
- Genetics Society of America