Abstract
We have studied the genetic polymorphism at 10 Plasmodium falciparum loci that are considered potential targets for specific antimalarial vaccines. The polymorphism is unevenly distributed among the loci; loci encoding proteins expressed on the surface of the sporozoite or the merozoite (AMA-1, CSP, LSA-1, MSP-1, MSP-2, and MSP-3) are more polymorphic than those expressed during the sexual stages or inside the parasite (EBA-175, Pfs25, PF48/45, and RAP-1). Comparison of synonymous and nonsynonymous substitutions indicates that natural selection may account for the polymorphism observed at seven of the 10 loci studied. This inference depends on the assumption that synonymous substitutions are neutral, which we test by analyzing codon bias and G+C content in a set of 92 gene loci. We find evidence for an overall trend towards increasing A+T richness, but no evidence for mutation bias. Although the neutrality of synonymous substitutions is not definitely established, this trend towards an A+T rich genome cannot explain the accumulation of substitutions at least in the case of four genes (AMA-1, CSP, LSA-1, and PF48/45) because the G↔C transversions are more frequent than expected. Moreover, the Tajima test manifests positive natural selection for the MSP-1 and, less strongly, MSP-3 polymorphisms; the McDonald-Kreitman test manifests natural selection at LSA-1 and PF48/45. We conclude that there is definite evidence for positive natural selection in the genes encoding AMA-1, CSP, LSA-1, MSP-1, and Pfs48/45. For four other loci, EBA-175, MSP-2, MSP-3, and RAP-1, the evidence is limited. No evidence for natural selection is found for Pfs25.
ELUCIDATING the processes that maintain genetic polymorphism is an issue of considerable interest and contention in population genetics (e.g., Kimura 1983; Gillespie 1991; Ohta 1992, 1996). In the case of Plasmodium falciparum, the agent of malignant malaria, it is also a matter of great clinical and epidemiological importance. Every year, between 300 and 500 million people in the world are infected with malaria; at least one million children under the age of five die each year in sub-Saharan Africa, and more than two billion people are at risk throughout the world (World Health Organization 1995). The design of antimalarial vaccines and the use of antimalarial drugs are hampered by extensive polymorphism in Plasmodium's proteins, particularly those expressed on the parasite's surface, which are obvious targets for the development of highly specific vaccines (McCutchanet al. 1988; Anders and Saul 1994; Kaslow 1994; Conway 1997). Highly polymorphic regions have been observed in the genes encoding surface antigenic proteins such as the circumsporozoite protein (CSP), merozoite surface proteins 1 and 2 (MSP-1 and MSP-2), and the S-antigen (Anders and Saul 1994).
Given that these genes encode antigenic proteins that are recognized by the host's immune system, the observed high levels of heterozygosity and rates of evolution have been attributed to natural selection, an outcome of the accumulation and frequent switch of suitable mutations, by means of which the parasite escapes the host's immune defenses (Hughes 1991, 1992; Anders and Saul 1994; Hughes and Hughes 1995). This interpretation is buttressed by the widespread observation that nonsynonymous nucleotide substitutions are more common than synonymous substitutions (Lockyeret al. 1989; Thomaset al. 1990; Shiet al. 1992a). Synonymous substitutions are likely to be neutral, or nearly so, whereas nonsynonymous substitutions may be functionally constrained and, thus, subject to natural selection (Kimura 1977; Ohta 1996).
The matter is, however, far from settled. It is possible in certain cases to account for an excess of nonsynonymous over synonymous substitutions while assuming that the substitutions are neutral (Sawyer and Hartl 1992; Maynard-Smith 1994). A ratio favoring nonsynonymous substitutions could occur, for example, if the compared sequences are so different from one another that a steady state between forward and backward mutations has been reached, as shown in Neisseria (Maynard-Smith 1994). The genome of Plasmodium falciparum is A+T rich and exhibits a strong codon bias (Hyde and Sims 1987; Mustoet al. 1995), circumstances that constrain synonymous substitutions, which will thus accumulate slowly, affecting the observed ratio of synonymous to nonsynonymous substitutions (Sharp and Li 1987; Gillespie 1994). Biases in nucleotide frequencies and in the transition/transversion ratio may affect the estimates of synonymous and nonsynonymous substitutions, as noticed for mitochondrial genes (Ina 1995).
In the present study, we analyze 10 genes that are expressed at different stages of P. falciparum's complex life cycle. These genes encode antigens that are considered candidates for antimalarial vaccines. Our study includes genes that have not been investigated previously, such as those coding for EBA-175, MSP-3, Pfs25, and RAP-1, and includes new sequences for four other genes. We first estimate their polymorphism and investigate whether the synonymous–nonsynonymous substitution rates are consistent with neutrality. We consider, in particular, the effects of A+T content, biased codon use, and the transition/transversion ratio. We then apply the Tajima (1989) and McDonald-Kreitman (MK) tests (McDonald and Kreitman 1991) as additional, reliable methods for ascertaining natural selection. To apply the MK test, we compare P. falciparum with P. reichenowi, its most closely related species (Coatneyet al. 1971; Collins and Aikawa 1993; Escalante and Ayala 1994; Escalanteet al. 1995), which is parasitic to chimpanzees.
MATERIALS AND METHODS
Life cycle of P. falciparum: P. falciparum belongs to the phylum Apicomplexa, which consists of parasitic taxa characterized by the presence, in at least one stage of their life cycle, of a structure called the “apical complex” that is involved in the penetration of the host cell (Cheng 1986; Collins and Aikawa 1993).
The invasive stage to the vertebrate host consists of haploid sporozoites, which are injected by the mosquito vector during its blood meal. These sporozoites are carried by the blood to the liver, where they multiply within the hepatocyte and develop into liver merozoites, which start the erythrocyte life stage. Some merozoites differentiate into gametocytes, which are the forms taken up with the mosquito's blood meal. Fusion of gametes occurs in the mosquito, where the zygote is formed, develops into the ookinete, and further differentiates into the oocyst. This is the only part of the life cycle where the parasite is diploid. Meiosis takes place in the oocyst, resulting in the formation of haploid sporozoites (Collins and Aikawa 1993).
Genes and DNA sequences: We analyze 10 loci in P. falciparum that encode proteins expressed at different stages of the parasite's life cycle.
Apical membrane antigen-1 (AMA-1): The AMA-1 (also known as PF83) protein has 622 residues and molecular weight of 83 kD. AMA-1 appears first in the apical complex and migrates to the merozoite's surface. The nine sequences used in our study are from Peterson et al. (1989), Thomas et al. (1990), and Oliveira et al. (1996).
Circumsporozoite protein (CSP): The CSP has ~420 residues and molecular weight of 58 kD; it has a variable central region consisting of multiple repeats of four-residue-long motifs (McCutchanet al. 1988). It is the predominant protein on the surface of the sporozoite, the invasive stage transmitted by the mosquito vector. We use 22 sequences from Dame et al. (1984), del Portillo et al. (1987), Lockyer and Schwarz (1987), Campbell (1989), Caspers et al. (1989), and Jongwutiwes et al. (1994).
Erythrocyte-binding antigen of 175 kD (EBA-175): This is a merozoite protein involved in the initial erythrocyte binding by the merozoite (Sim 1995). We use 14 sequences from Ware et al. (1993) and Liang and Sim (1997).
Liver stage antigen 1 (LSA-1): This 200-kD protein is detected only during the liver stage and is accumulated in the parasitophorous vacuole (Zhu and Hollingdale 1991). It has a large central repeat region plus short, nonrepetitive N and C terminals. We use 14 sequences of the N-terminal portion from Yang et al. (1995).
Merozoite surface protein-1 (MSP-1): This protein has variable size determined by a central repeat region; the molecular weight is ~200 kD. The MSP-1 is proteolytically cleaved, and the C-terminal region remains on the merozoite after erythrocyte invasion. We analyze only a 42-kD fragment encoding the C-terminal region. We use 40 sequences from Chang et al. (1988), Peterson et al. (1988), Weber et al. (1988), Jongwutiwes et al. (1993), Pan et al. (1995), Tolle et al. (1995), plus unpublished sequences reported by Y. P. Shi, M. P. Alpers, M. M. Pova, B. L. Nahlen, A. G. Oloo and A. A. Lal under GenBank accession numbers U20726–U20733 and U20653–U20656. The alignment was made following the description in Miller et al. (1993).
Merozoite surface protein-2 (MSP-2): This, like MSP-1, is a membrane protein located on the merozoite, and has a molecular weight of 45–54 kD. It consists of N- and C-terminal regions and a central variable segment made up of repetitive and nonrepetitive motifs (Thomaset al. 1990; Marshallet al. 1992). We used 30 complete sequences from Smythe et al. (1990), Thomas et al. (1990), Fenton et al. (1991), Marshall et al. (1991), and Bhattacharya et al. (1995).
Merozoite surface protein-3 (MSP-3): This protein, also known as the secreted polymorphic antigen associated with the merozoite (SPAM), has a molecular weight of ~43 kD (McCollet al. 1994). It consists of N- and C-terminal regions and a central variable segment of repetitive motifs (McCollet al. 1994). The MSP-3 is secreted by the parasite into the parasitophorous vacuole space of the erythrocyte cytoplasm (McCollet al. 1994). We have included 19 sequences: two from McColl et al. (1994) and 17 partial sequences from Huber et al. (1997).
Ookinete protein (Pfs25): This is a 25-kD surface protein expressed in maturing gametocytes and in the zygote. Our 13 sequences are from Kaslow et al. (1989) and Shi et al. (1992b).
Pfs48/45: These two proteins of 48 and 45 kD are detected from day 2 of gametocytogenesis through gametogenesis and fertilization (Kaslow 1994). They are encoded by a single gene (with likely post-transcriptional modifications; see Kaslow 1994), and they are candidates for a transmission-blocking vaccine (Kaslow 1994). We use eight sequences from Kocken et al. (1993, 1995).
Rhoptry antigen protein-1 (RAP-1): This protein of 83 kD is present in the rhoptries, which are organelles located in the apical complex. We use two complete sequences from Ridley et al. (1990) and one from Y. P. Shi with GenBank accession number U20985.
For interspecific comparisons, we use five sequences (only one available at each locus) from P. reichenowi: CSP (Lal and Goldman 1991), the N-terminal fragment of LSA-1 (Y. Chang, personal communication), Pfs25 (Lalet al. 1990), Pfs48/45 (R. L. B. Milek, C. H. M. Kocken, H. Meijers, J. G. G. Schoenmakers and R. N. H. Konings, GenBank accession number L33882), and RAP-1 (Y. P. Shi, GenBank accession number U20986).
Statistical analysis: We use four measures of genetic polymorphism (see Nei 1987 and Tamura 1992). The parameter π estimates the average number of substitutions per site between any two sequences, assuming that the sample is random. This average is also estimated by d, which is based on Tamura's (1992) three-parameter model and corrects for bias in G+C content and transition/transversion ratio. The parameter θ is related to heterozygosity per site, or the effective number of alleles (ne = 1 + θ); under neutrality equilibrium assumptions, θ = 4Nμ, where N is the effective population size and μ is the rate of neutral mutations. S is simply the number of sites segregating in the sample and is dependent on sample size and length of the sequence. We provide this parameter as a measure of the polymorphism observed, but we do not use it for comparisons between loci.
We test for intragenic recombination (which has been suggested to occur in CSP, MSP-1, and MSP-2; see Conwayet al. 1991, Marshallet al. 1991, Hughes 1992, and McCutchanet al. 1992; Richet al. 1997) with Sawyer's permutation test sum of squares of the condensed fragment lengths (SSCF; Sawyer 1989; Hartl and Sawyer 1991). We calculate the SSCF score using all polymorphic sites rather than only the synonymous ones, as would be appropriate, owing to the scarcity of synonymous substitutions. This test could, therefore, be affected by convergence among the nonsynonymous sites as a result of natural selection. Where possible, however, we also calculate the SSCF score separately for synonymous and nonsynonymous substitutions. The significance of the SSCF score is obtained by means of 10,000 random computer permutations (Hartl and Sawyer 1991).
We first test for evidence of positive natural selection by comparing the number of synonymous and nonsynonymous substitutions. Without positive selection favoring amino acid polymorphism, the incidence of synonymous substitutions should be higher owing to purifying selection against nonsynonymous substitutions; a higher incidence of nonsynonymous than of synonymous substitutions is taken as evidence that positive natural selection is promoting polymorphism (Kimura 1977; Kreitman and Akashi 1995; Ohta 1996). The numbers of synonymous and nonsynonymous substitutions per site are estimated using two methods: Nei and Gojobori's (1986) with the Jukes and Cantor (1969) correction, as implemented in the MEGA program (Kumaret al. 1994), and the method of Li (1993) based on Kimura's (1980) two-parameter model. Intra- and interspecific transitions and transversions are estimated using the pairwise differences without any correction, as implemented in the MEGA program (Kumaret al. 1994); 95% confidence intervals for these statistics are estimated with 5000 bootstrap replications (Efron and Tibshirani 1993).
The previous test assumes that synonymous substitutions are neutral, as is generally taken to be the case. We test this assumption by ascertaining the consequences of codon bias. The effective number of codons, Nc, is defined as the number of codons that would yield the observed level of codon usage if all codons were equally frequent (Wright 1990). Nc is a measure of the codon bias, and its value can range between 20 and 61. A value of Nc = 20 indicates that only one codon per amino acid is used; a value of 61 indicates that all synonymous codons are equally used. Nc and G+C content are calculated for a data set of 92 loci using the program CODONS (Lloyd and Sharp 1992); deviation from 50% G+C content in silent sites affects Nc values so that a correlation is expected if there is codon bias (Wright 1990). The 92 sequences are obtained from the GenBank database (http://www.ncbi.nlm.nih.gov/Web/Genbank/). Only one allele is included for each locus, which was chosen at random when more than one sequence is available. Only genes and gene fragments with >100 codons are included in this set. The significance of the correlation of Nc with (1) total G+C content and (2) G+C content in the third position is ascertained by means of Pearson's correlation test, with α = 0.05 and Bonferroni's correction for multiple tests, as appropriate (Dunn 1961; Hancock and Klockars 1996).
We used two additional tests for detecting positive natural selection in maintaining genetic polymorphism. The Tajima (1989) test is based on the statistic D:
The intra- and interspecific numbers of synonymous and nonsynonymous sites (McDonald and Kreitman 1991) are compared using Fisher's exact test for a 2×2 contingency table (Conover 1980).
RESULTS
Table 1 gives estimates of genetic variation at each of 10 gene loci of P. falciparum. Genetic diversity is greater in the five genes expressed on the surface of either the merozoite (π = 0.016, 0.088, 0.044, and 0.097, respectively, for AMA-1, MSP-1, MSP-2, and MSP-3) or the sporozoite (π = 0.006 for CSP) than in the four other genes (π = 0.004, 0.004, 0.002, and 0.002, respectively, for EBA-175, Pfs25, Pfs48/45, and RAP-1). The N-terminal region of LSA-1 is fairly polymorphic (π = 0.009), even though the protein has been said to be conserved (Fidocket al. 1994). LSA-1 has not been detected on the surface of the hepatocyte, although no experiments specifically designed for the purpose have been performed. The presence of cytotoxic T lymphocytes or cytotoxic T lymphocyte epitopes suggests immune protection (Hillet al. 1992), which has stimulated the working hypothesis that LSA-1 is transported outside the parasitophorous vacuole to the hepatocyte surface, where it would interact with major histocompatibility complex class I molecules (Fidocket al. 1994). The intermediate level of genetic diversity we find in LSA-1 (N-terminal region) is consistent with this hypothesis.
MSP-3 is very polymorphic (π = 0.097), comparable with MSP-1 (the 42-kD region herein analyzed, see materials and methods) and other surface proteins. Some authors suppose that MSP-3 is located on the merozoite surface (Oeuvrayet al. 1994); however, MSP-3 is part of a group of proteins that are not an integral part of the merozoite membrane but are secreted into the parasite parasitophorous vacuole space or the erythrocyte cytoplasm (McCollet al. 1994). These proteins are often referred to as “surface proteins” but their relationship with the merozoite membrane is unknown.
Polymorphism in 10 P. falciparum genes
EBA-175 and RAP-1 are not present on the surface of the merozoite, but they are involved in the erythrocyte invasion that occurs as an essential stage of the parasite's life cycle. Pfs25 is a surface protein expressed in the mature gametocyte close to exflagellation and in the zygote, i.e., inside the mosquito host, and thus is not exposed to the immune system of the human host, where no antibodies against it have been found (Kaslow 1994). Similarly, Pfs48/45 is a surface protein expressed during gametogenesis. Naturally occurring antibodies against this protein have been found in 9–60% individuals in exposed populations (Kaslow 1994). Epidemiologic studies suggest that individuals, after their first malarial infection, produce antibodies that gradually disappear in endemic areas with long-term exposure (Kaslow 1994).
Among the surface-expressed, more polymorphic loci, MSP-1 is distinctive in that it consists of two allelic families (MAD-20 and Wellcome, identified as MSP-1-MAD and MSP-1-Well in Table 1), with low divergence among the members of a family but great differentiation between the families (π = 0.089 for all 40 alleles, but only 0.004 for the 30 MAD alleles, and 0.001 for the 10 Well alleles).
If we use all polymorphic sites, synonymous as well as nonsynonymous, intragenic recombination is detected by the SSCF test in almost all fairly polymorphic genes: the three merozoite surface antigens (AMA-1, MSP-1, and MSP-2), the sporozoite surface antigen (CSP), and LSA-1; the only exception is MSP-3. The SSCF test, however, is not significant when synonymous substitutions alone are considered, which is only possible in the four genes that exhibit nonsynonymous polymorphism within the regions included in our alignment (six sites in AMA-1, three sites in CSP, four sites in MSP-2, and five sites in Pfs25). When only amino acid replacement sites are taken into account, there is evidence of intragenic recombination at AMA-1 (SSCF = 5495; P = 0.001), CSP (SSCF = 27,669; P = 0.045), and MSP-2 (SSCF = 18,503; P = 0.029). We consider in this paper only the C-terminal region of MSP-1, but we have made an additional analysis of six complete sequences (using the alignment of Milleret al. 1993). Evidence of intragenic recombination exists for synonymous (SSCF = 134,495; P = 0.000) and replacement sites (SSCF = 759,275; P = 0.000).
Table 2 gives the incidence of synonymous and nonsynonymous substitutions, estimated by two methods. As previously reported (Hughes 1991, 1992; Hughes and Hughes 1995), the incidence of nonsynonymous substitutions is greater (significantly so in most cases) for several loci. The exceptional cases are MSP-1, MSP-3, and Pfs25. The number of synonymous substitutions is zero in three cases (LSA-1, Pfs48/45, and RAP-1). The pattern of synonymous and nonsynonymous substitutions in MSP-1 (the 42-kD region herein studied) is consistent with previous observations. Hughes (1992) found in regions 8–11 (using the classification of Tanabeet al. 1987) a greater number of synonymous than nonsynonymous substitutions. The excess of nonsynonymous substitutions (comparing Ks and Kn) indicates positive natural selection in seven of the 10 loci: AMA-1, CSP, EBA-175, LSA-1, MSP-2, Pfs48/45, and RAP-1. The method of Li (1993) leads to similar results for the same loci except for MSP-2; at this locus, the number of nonsynonymous substitutions is higher than the number of synonymous substitutions, but not significantly.
Synonymous and nonsynonymous substitutions in 10 P. falciparum genes
Table 3 gives the incidence of transitions and transversions for the 10 P. falciparum genes. All the genes, except Pfs25, exhibit a higher number (often much higher) of transversions. Usually, transitional substitutions are about twice as common as transversions (Gojoboriet al. 1982; Liet al. 1984; Collins and Jukes 1994) and almost 10 times more common in animal mitochondria (Brownet al. 1982; Kondoet al. 1993). In P. falciparum, transversions are more common than transitions in six genes (AMA-1, LSA-1, MSP-1, MSP-2, MSP-3, and Pfs8/45) and about equal in three genes (CSP, EBA-175, and RAP-1). The observed pattern is comparable with those found in the mitochondrial genome of Drosophila (Wolstenholme and Clary 1985; Gleasonet al. 1997; Inohiraet al. 1997) and Apis mellifera (Crozier and Crozier 1993), which are also A+T-rich genomes.
Table 4 gives the total G+C content and the Nc values for 92 loci. The average G+C for the P. falciparum genes is, for all three codon positions, 30.22% [with a 95% confidence interval (C.I.) around the mean of 29.2–31.2%, using a t distribution range 22–51%]. However, for the third position, the G+C content is only 15.16% (C.I. 14.2–16.1%, range 7–31%). The average Nc is 36.82 (C.I. 36.0–37.6, range 31.33–51.44), which is comparable to other A+T-rich genomes, such as the proteobacteria Rickettsia prowazekii, with an average Nc of 40.84 and a range from 33.4 to 51.1 (Andersson and Sharp 1996). Figure 1 shows that Nc is highly correlated with G+C content at the third codon position (Pearson's coefficient: r = 0.737, P = 0.0001) but not with G+C content at the first (r = −0.134, P = 0.20) or at the second (r = 0.07, P = 0.51) codon position. There is no significant correlation, either, between Nc and total G+C (r = 0.178, P = 0.09). The significant correlation between Nc and G+C content at the third codon position persists even if we use a level of significance of α/4 = 0.0125 [on the grounds that four separate tests are performed; Bonferroni correction (see Dunn 1961; Hancock and Klockars 1996)]. Codon use is, thus, constrained by the reduction in G+C content in the third positions, where most of the synonymous substitutions occur.
Transitions and transversions in 10 P. falciparum genes
Table 5 reports the results of two additional tests that seek to ascertain whether positive natural selection contributes to the genetic polymorphism in P. falciparum. Tajima's test shows positive values and a significant departure from neutral expectations for MSP-1, where the synonymous and nonsynonymous substitution ratio failed to detect selection. MSP-3 is very close to significance; we repeated the test for combinations of all sequences minus one, and 90% of the tests were statistically significant. Positive and significant values of D indicate strong overdominance selection (Gillespie 1994).
The MK test (McDonald and Kreitman 1991) ascertains whether the intraspecific number of sites with nonsynonymous substitutions is significantly greater than the neutral expectation, which is determined by the interspecific nonsynonymous/synonymous ratio. The interspecific comparisons are made with P. reichenowi, which is parasitic to chimpanzees and is the closest known species to P. falciparum (Coatneyet al. 1971; Collins and Aikawa 1993; Escalante and Ayala 1994; Escalante et al. 1995, 1997). The nucleotide sequence is known for only five of the nine genes surveyed in the present study. Evidence of natural selection emerges at two of the five loci, LSA-1 and Pfs48/45, also shown to be under selection by the test on the basis of excess nonsynonymous substitutions.
The 92 P. falciparum genes included in the analysis of G+C content and codon use
DISCUSSION
Although P. falciparum has been one of the most extensively investigated parasites, there are severe limitations when seeking to assess its genetic diversity. The gene loci studied remain few, and in several of them, the number of sequences is small. Only sequences from cultured parasites are available for some loci, which introduces bias into the samples. In the case of field isolates, sampling efforts have focused on areas with low genetic diversity (perhaps resulting from transmission differences; Jongwutiweset al. 1993). For example, the CSP data set is mostly limited to Asian samples, and half of the MSP-2 sequences come from India. Moreover, malaria research has focused on those protein parts that are immunologically relevant, so the polymorphism of the complete gene cannot be assessed properly.
This study shows that loci encoding proteins expressed on the surface of the sporozoite and the merozoite are more polymorphic than those expressed during the sexual stages or inside the parasite. These results agree with the general observation that stage-specific surface proteins exhibit high polymorphism when compared with internal antigens (McCutchanet al. 1988; Rileyet al. 1994). The general inference is that proteins that are involved in the parasite's recognition by the host's immune system are under strong selection pressure for accumulating polymorphism as a means for evading the host's defenses.
Correlation in P. falciparum between the effective number of codons (Nc) and G+C content in the first (GC1), second (GC2), or third (GC3) codon position, on the basis of 92 gene loci. Only the correlation with the third codon position is statistically significant (r = 0.737, P < 0.001 with α = 0.05/4 = 0.0125).
The 10 loci of P. falciparum we have surveyed are fairly polymorphic, with a weighted average of π = 0.0197 and a range from 0.002 (RAP-1) to 0.097 (MSP-3). This is higher than the observed diversity in many eukaryotes, such as humans, π = 0.0011 (Li and Sadler 1991), or in Drosophila melanogaster; e.g., π = 0.009 for Cu, Zn superoxide dismutase, which has an intron of 706 bp (Hudsonet al. 1994), and π = 0.010 for the exon of the gene encoding glucose dehydrogenase (Hamblin and Aquadro 1997), even though the genes of P. falciparum considered in this study do not include noncoding segments. Intragenic recombination needs to be taken into account as a possible generator of genetic diversity in P. falciparum. Concerted evolution has been postulated as the process accounting for the intragenic conservation observed in genes such as CSP, which has a large middle region of tandem repeats, and the highly differentiated alleles of MSP-1 and MSP-2 (Tanabeet al. 1987; McCutchanet al. 1988; Marshallet al. 1991; Frontali 1994; Richet al. 1997). Intragenic similarities between alleles with respect to nonsynonymous substitutions, however, may arise by convergence, i.e., selection for homoplastic substitutions (McCutchanet al. 1992; Richet al. 1997). One possible test for discerning between intragenic recombination and homoplasy is to test for recombination only between silent sites. When using only silent sites, we detected recombination only at MSP-1, but the scarcity of silent substitutions makes it difficult to exclude intragenic recombination at the other loci (in addition to the constraints imposed by low G+C content; McCutchanet al. 1992). When using both silent and nonsilent substitutions, the SSCF test was significantly positive for four additional surface-expressed loci but not for other genes (Table 1, last column); however, this result may be clouded by homoplasy, as noted.
Polymorphism of P. falciparum genes
The polymorphism is unevenly distributed among the loci studied. Loci-encoding proteins expressed on the surface of the sporozoite or the merozoite (AMA-1, CSP, LSA-1, MSP-1, MSP-2, and MSP-3) are more polymorphic than those expressed during the sexual stages or inside the parasite (weighted average of π = 0.040, range 0.008–0.096 vs. weighted average of π = 0.003, range 0.002–0.004). The level of polymorphism observed in MSP-1 and MSP-3, for example, is comparable to that found in the locus DRB1 of the major histocompatibility complex in humans (π = 0.071 based on 58 sequences obtained from Marsh and Bodmer 1993). The DRB1 polymorphism is considered to be ancient polymorphism under balancing selection (Ayalaet al. 1994). In the case of genes such as CSP, MSP-1, and MSP-2, π estimates should be taken as minimum estimates because the polymorphism present at repeat motives or highly divergent regions was excluded from our analysis because of the difficulty in obtaining a reliable alignment. The greater polymorphism obtained in the surface antigens is commonly attributed to natural selection, which is assumed to favor polymorphism in those genes directly exposed to the vertebrate host's immune system as a strategic mechanism for evading the host's defense.
Evidence that surface-expressed proteins exhibit high polymorphism as a consequence of positive natural selection can be seen in that amino acid replacement sites are more polymorphic than synonymous sites. For the most part, synonymous substitutions are generally thought to be selectively neutral rather than restrained by purifying natural selection. The higher incidence of replacement substitutions would then be driven by positive natural selection (Anders and Saul 1994; Kaslow 1994). In our P. falciparum loci, the incidence of nonsynonymous substitutions is higher at all loci (and most often significantly so), except for MSP-1 and Pfs25.
The nonsynonymous/synonymous substitution ratio has been used in Drosophila and other organisms for testing whether the nonsynonymous substitutions are under positive selection, which is thought to be the case when the ratio is high (Kreitman and Akashi 1995; Endoet al. 1996; Ohta 1996). In particular, the high nonsynonymous/synonymous ratios observed in human HLA and other major histocompatibility complex loci are commonly attributed to natural selection favoring diversity in the antibodies and other specific components of the immune defense (Hughes and Nei 1988, 1989). Recent studies have concluded that the high level of replacement substitutions observed in exposed surface proteins of P. falciparum is evidence of positive natural selection (Hughes 1991, 1992; Hughes and Hughes 1995). Endo et al. (1996) considered 3595 homologous proteins and found evidence for positive natural selection on 17 gene groups; nine of them are the surface antigens of parasites or viruses, and among them is the MSP-2 of P. falciparum. The observation that selection is contributing to the maintenance of genetic polymorphism has also been made for the env gene in HIV-1 (Seibertet al. 1995). The env gene encodes the envelope glycoproteins gp120 and gp41 with regions V1–V5, which are involved in immune evasion and have a higher number of nonsynonymous than synonymous substitutions (a ratio of 2.0 in V3 and 6.4 in V2; other differences are not statistically significant). The remainder parts of the glycoproteins have significantly more synonymous than nonsynonymous substitutions, providing additional evidence for positive natural selection operating in the maintenance of the nonsynonymous polymorphisms observed in the antigen regions (Seibertet al. 1995). Similar studies with similar conclusions have been carried out in other genes, such as the hemagglutinin gene of the human influenza A viruses (Ina and Gojobori 1994), and in proteins mediating spermegg recognition in marine invertebrates (Vacquieret al. 1997).
The conclusion that a high nonsynonymous/synonymous substitution ratio is an indication of natural selection favoring nonsynonymous polymorphisms depends, however, on certain assumptions. Foremost is the assumption that synonymous substitutions are neutral so that they are not subjected to purifying selection or affected by various constraints such as codon bias or G+C content (Ohta 1996). We have noted that for 92 P. falciparum loci the overall G+C content is 30.2% and only 15.2% for third codon positions, whereas the average G+C content in first and second codon positions is 42.59 and 30.64%, respectively. This variation in G+C content is in agreement with previous studies that included fewer genes and smaller fragments (Mustoet al. 1995). Frontali (1994) has observed that about half of P. falciparum's genome consists of noncoding regions with very low G+C, even lower than for coding regions, but no quantitative analysis has been provided. We have studied the 37 introns reported in the list of the 92 loci in Table 4 and found an average G+C content of only 13.03% (95% C.I. 11.8–14.3, using a t distribution). The overall G+C content of the P. falciparum genome is 18% (McCutchanet al. 1984). Selection or mutation bias may thus be favoring substitutions towards A+T, whether or not they are synonymous. Be that as it may, the high A+T content indicates that constraints exist so that synonymous substitutions may not be completely neutral. The existence of these constraints can be observed in Figure 1, showing a strong correlation between Nc and G+C content in the third position (see also Hyde and Sims 1987; Mustoet al. 1995). Musto et al. (1995) have concluded that there is a composition constraint operating in all the translated sequences and codon positions, but Hughes (1991, 1992; Hughes and Hughes 1995) has argued that G+C content does not account for the high nonsynonymous/synonymous substitution ratio observed in P. falciparum, although his argument has not been accepted definitively. Endo et al. (1996) have thus suggested that the mutation pressure towards increasing A+T content may account for the low level of synonymous substitutions in P. falciparum. A similar argument has been made in other cases where a strong base composition bias is present; for example, the excess of nonsynonymous with respect to synonymous substitutions in hepatitis delta virus has been explained by a strong preference of G+C at the third codon positions (Krushkal and Li 1995).
Codon bias is associated with genome G+C content, but it can be accounted for by either selection or mutation pressure (Gillespie 1991; Ohta 1996). The “mutation pressure” could be caused by mutation bias (Sueoka 1962, 1988), which may be detected if a correlation is found between the G+C content of introns and exons in the same genes (Shieldset al. 1988; Moriyama and Hartl 1993; Powell and Moriyama 1997). We found no correlation between intron G+C composition and (1) G+C content of exons (r = −0.176, P = 0.283), (2) G+C content at the third codon positions (r = −0.170, P = 0.300), or (3) the effective number of codons (r = −0.198, P = 0.226). Because there is no evidence supporting a mutation bias, it may be possible that the observed codon usage is caused by natural selection (Gillespie 1991). Selection favoring specific codons may impose strong constraints on synonymous substitutions.
This trend for keeping a strong codon bias and an A+T-rich genome, whether it is selection or mutation pressure, will affect the number of synonymous substitutions. The substitutions observed in a set of alleles will then be in the direction of increasing the A+T content. The difficulty is that there is no direct way to observe whether a mutation was from A to G or G to A because we don't have enough information for establishing the ancestral state at the polymorphic sites under question. However, we can quantify the transversions A T and G↔C because these preserve the G+C content↔regardless of the direction of the change. We also can estimate the G+C content maintained in a given pair of sequences using those sites that do not change. This allows us to build a very conservative test. If the observed substitutions are caused by a genomic trend towards increasing A+T richness or keeping it high, we expect that (1) the ratio between the average number of G+C vs. A+T substitutions on all sequence pairs should not differ from the ratio of G+C vs. A+T sites estimated using invariant sites (assuming equilibrium in G+C composition at that specific gene) or (2) that the ratio of G+C vs. A+T substitutions will be lower because the G+C substitutions will be affected by purifying selection while A+T substitutions will be neutral or favored by selection. The results are summarized in Table 6. The G+C substitutions are more abundant than expected in the four genes AMA-1, CSP, LSA-1, and Pfs48/45. Although this is not a test for neutrality, it shows that the observed substitutions cannot be explained by a trend for increasing A+T content. This test is highly conservative because it requires a pattern leading to an increase in G+C content that can be observed in the accumulation of G+C transversions. This result thus favors the conclusion that the excess of nonsynonymous substitutions is caused by positive natural selection.
An additional source of concern is that the estimation of synonymous substitutions can be affected by nucleotide composition. Ina (1995) noted that all methods have some degree of bias when they are used on sequences with uneven nucleotide frequencies. On the whole, the method of Li (1993) provides better estimates than that of Nei and Gojobori (1986); both methods are biased, but in different directions. We noted in results that in the case of MSP-2, the excess of nonsynonymous substitutions is significant according to the method of Nei and Gojobori, but not according to Li's. This might occur because Nei and Gojobori's method underestimates the number of synonymous substitutions and overestimates the number of nonsynonymous substitutions when there is a strong bias in nucleotide composition, while Li's (1993) method overestimates the number of synonymous substitutions and underestimates the number of nonsynonymous substitutions (Ina 1995). To the best of our knowledge, no systematic studies have been performed about the effect of overall G+C content on the statistics used for estimating synonymous and nonsynonymous substitutions.
The conclusion at this point is that although the nonsynonymous/synonymous substitution ratio suggests that natural selection may account for the high levels of amino acid polymorphism observed at seven of the 10 loci studied, the evidence is clouded by the constraints imposed by the particular characteristics of codon bias and A+T content in the P. falciparum genome. These characteristics evidence an overall trend for keeping or increasing A+T richness in this genome. This may decrease the number of synonymous substitutions and, thus, affect the nonsynonymous vs. synonymous substitution ratio. However, this trend cannot explain the accumulation of substitutions in the case of AMA-1, CSP, LSA-1, and Pfs48/45 because transversions towards G+C are more frequent than expected. This G+C accumulation, in addition to the observed ratio of nonsynonymous vs. synonymous substitutions, supports that selection is operating at least on these four genes.
In the case of MSP-1 and MSP-3, we could not detect evidence for positive natural selection using the nonsynonymous/synonymous substitution method in the regions under study, although Tajima's test shows that MSP-1 (Table 5) and perhaps MSP-3 are subject to selection. It is possible that selectively favored nonsynonymous substitutions become saturated over time, producing a synonymous/nonsynonymous substitution ratio that is consistent with neutrality in distantly related sequences but could be detected in closely related sequences (Hughes and Nei 1988; Hughes 1992). Lack of significance at other loci using Tajima's test may not be interpreted as negative evidence because this test is highly conservative and has limited power whenever there has been a recent bottleneck or selective sweep (Gillespie 1994; Simonsenet al. 1995), as most likely seems to have occurred in P. falciparum (Richet al. 1997). Moreover, several sequences are partial, which further limits the power of the test.
The ratio GC/AT in 10 P. falciparum genes
The MK test uses the number of fixed substitutions between two closely related species (nonsynonymous/synonymous ratio) as the expected value under neutrality (McDonald and Kreitman 1991). Of the nine loci surveyed in P. falciparum, five have been sequenced in P. reichenowi, which are thus the only ones to which the MK test can be applied. The test is significant at two loci, LSA-1 and Pfs48/45 (Table 5); the number of replacement substitutions is greater than expected under neutrality at both of these loci. The MK test corrects for various constraints, such as those arising from G+C content and codon bias, if we assume that the pattern of divergent substitutions between closely related species is also affected by the same kind of constraints as are intraspecific substitutions.
The closest known relative of P. falciparum is P. reichenowi, a chimpanzee parasite. The time of divergence between P. falciparum and P. reichenowi has been estimated to be ~5–8 mya, about the same time when the chimpanzee and human lineages diverged (Coatneyet al. 1971; Collins and Aikawa 1993; Escalante and Ayala 1994; Escalanteet al. 1995). The genetic distance between P. reichenowi and P. falciparum alleles for the CSP is about five times as large as the distance between intraspecific P. falciparum alleles (Escalanteet al. 1995). The information available about P. reichenowi is limited to 10 gene loci (including the 18S rRNA); however, the assumption that the G+C content or codon bias does not affect the MK results appears to be reasonable. For example, the overall G+C content of P. reichenowi is 0.29 (95% C.I. 0.255–0.325, t distribution), the G+C content at the third codon position is 0.158 (95% C.I. 0.105–0.21), and the effective number of codons is 39.6 (95% C.I. 34.3–44.9). These values are not statistically different from those observed in the same genes of P. falciparum. Table 7 gives the number of transitions and transversions, as well as the number of synonymous and nonsynonymous substitutions between P. reichenowi and the P. falciparum. The transversion bias appears to be less pronounced than the one found intraspecifically, specially for LSA-1, one of the genes for which we found evidence of selection using the MK test. This may result from the presence of more synonymous than nonsynonymous substitutions between P. reichenowi and P. falciparum for all loci, except for the one encoding the CSP.
One potential problem with the MK test in the present case is that we have only one P. reichenowi sequence at each of the five loci tested (only one isolate of P. reichenowi is known to be available; Coatneyet al. 1971, p. 309). If some sites are polymorphic in P. reichenowi, we are likely to overestimate the number of fixed differences between the species. This potential problem is not likely to be important because the MK test compares proportions and there is no reason to expect that the proportion of nonsynonymous/synonymous substitutions will be affected by P. reichenowi polymorphism. The MK test is detecting a disproportionate accumulation of nonsynonymous substitutions in P. falciparum when compared to P. reichenowi, which is apparent when Ks and Kn are compared within P. falciparum (Table 2) and between P. falciparum and P. reichenowi (Table 7). A similar approach was used by Mindell (1996) for studying the effect of natural selection in the maintenance of the genetic polymorphism in HIV-1 viruses.
In conclusion, there is evidence of natural selection contributing to amino acid polymorphism at nine loci: MSP-1 and MSP-3 (Tajima's test); LSA-1 and Pfs48/45 (MK test); AMA-1, CSP, EBA175, LSA-1, MSP-2, Pfs48/45, and RAP-1 (synonymous/nonsynonymous rates). The evidence derived from the intraspecific nonsynonymous/synonymous ratio may be questionable if a trend towards increasing A+T richness could account for this pattern, but this is not the case for AMA-1 and CSP. The evidence for MSP-2 may be questioned because the excess of nonsynonymous substitutions was not significant according to the method of Li (1993). The evidence for MSP-3 may also be questioned because of the ambiguous significance obtained with Tajima's test. Positive natural selection seems definitely established in at least five of the 10 loci investigated.
Transitions, transversions, synonymous, and nonsynonymous substitutions between P. falciparum alleles and P. reichenowi
AMA-1, CSP, LSA-1, and MSP-1 are host-exposed surface proteins in which, as noted above, natural selection is generally assumed to favor polymorphism as an evasion strategy from the host's immune system. Pfs48/45 is only moderately immunogenic, with antibody levels that vary geographically and with grade of exposure (Kaslow 1994). It may be that even moderate or low immune host activity generates sufficient selective pressure to be detectable in the parasite's polymorphism.
Acknowledgments
A. A. Escalante is supported by a fellowship from the American Society for Microbiology. Research in F. J. Ayala's laboratory is supported by National Institutes of Health grant GM42397. This work was supported in part by U.S. Agency for International Development grant HRN-60010-A-00-4010-00 to A. A. Lal.
Footnotes
-
Communicating editor: W.-H. Li
- Received November 13, 1997.
- Accepted January 22, 1998.
- Copyright © 1998 by the Genetics Society of America