Plants use signaling pathways involving salicylic acid, jasmonic acid, and ethylene to defend against pathogen and herbivore attack. Many defense response genes involved in these signaling pathways have been characterized, but little is known about the selective pressures they experience. A representative set of 27 defense response genes were resequenced in a worldwide set of 96 Arabidopsis thaliana accessions, and patterns of single nucleotide polymorphisms (SNPs) were evaluated in relation to an empirical distribution of SNPs generated from either 876 fragments or 236 fragments with >400 bp coding sequence (this latter set was selected for comparisons with coding sequences) distributed across the genomes of the same set of accessions. Defense response genes have significantly fewer protein variants, display lower levels of nonsynonymous nucleotide diversity, and have fewer nonsynonymous segregating sites. The majority of defense response genes appear to be experiencing purifying selection, given the dearth of protein variation in this set of genes. Eight genes exhibit some evidence of partial selective sweeps or transient balancing selection. These results therefore provide a strong contrast to the high levels of balancing selection exhibited by genes at the upstream positions in these signaling pathways.
PLANTS defend themselves in response to pathogens and parasites, ranging from viruses to insect herbivores, using both physical and chemical defenses. Chemical defenses include the synthesis of antimicrobial compounds, such as phytoalexins and defensins, and the production of pathogenesis-related (PR) proteins, comprising enzymes capable of degrading pathogen cell walls. Defense responses often begin with a gene-for-gene recognition of an avirulence (avr) gene of the pathogen by a resistance (R) gene of the host (Van Der Biezen and Jones 1998; Dangl and Jones 2001; Staskawicz et al. 2001). R gene-mediated resistance is usually accompanied by reactive oxygen species (ROS) and hypersensitive cell death (HR) (Veronese et al. 2003), which can be followed by activation of a salicylic acid (SA)-dependent signaling pathway (Figure 1) and the consequent expression of a characteristic set of defense genes (Glazebrook 2001). As a result, plants become more resistant to subsequent attack by otherwise virulent—in the sense of “ability to infect”—pathogens, a phenomenon called systemic acquired resistance (SAR) (Sticher et al. 1997; Durrant and Dong 2004). The SA signaling pathway is activated not only as a result of R gene action but also as a result of pathogen-derived signals. Additional plant defense responses include those controlled by signal transduction networks requiring jasmonic acid (JA) and ethylene (ET). The JA/ET-dependent signaling pathways mediate resistance responses to herbivores or necrotrophic pathogens (Thomma et al. 2001), whereas the SA-dependent signaling pathway mediates resistance to biotrophic pathogens (Ryals et al. 1996). The SA, JA, and ET signaling pathways interact extensively (Glazebrook 2001; Traw et al. 2003). Finally, gene-for-gene interaction between an R gene and its corresponding avr gene in the pathogen (Flor 1956) can trigger downstream pathways in addition to SAR, of which three have been identified (Aarts et al. 1998; McDowell et al. 2000; Glazebrook 2001).
Defense response signal transduction pathways include many different genes that together control the activation of defense response as a reaction to pathogen attack and will therefore be referred to as “defense response genes.” There is presently no information regarding the selective pressures that act on this group of genes in plants. Evolutionary analyses on receptor molecules in the innate immune system of insects have produced diverse results. In general it appears that balancing selection as observed for MHC genes and plant R genes is uncommon for innate immunity genes, although in some instances there are records of transient forms of balancing selection (Lazzaro 2005). In Drosophila, a fruit fly, there is substantial evidence for adaptive evolution in a suite of genes involved in recognition of pathogens and upstream humoral response signaling pathways, corresponding with an arms race between hosts and parasites (e.g., Relish in Drosophila simulans, Schlenke and Begun 2003; Little et al. 2004; Jiggins and Kim 2006, 2007). In many other instances however, Drosophila innate immunity genes appear to be under purifying selection. The end products of response signaling pathways are antimicrobial peptides that provide a generalized pathogen resistance response. Sequence analysis of antimicrobial peptides produced by D. melanogaster fail to reveal evidence of recurrent directional selection, but instead reveal typical levels of standing silent variation and low levels of divergence across species (Lazzaro and Clark 2003; Jiggins and Kim 2005). Purifying selection acting on these genes most likely reflects an absence of specific parasites in Drosophila populations [as opposed to termite colonies with higher pathogen pressures (Bulmer and Crozier 2004)]. Other instances of purifying selection have been observed for genes that are involved in binding to parasite polysaccharides, which are expected to evolve slower as compared to genes involved in binding to parasite proteins (Jiggins and Hurst 2003; Little et al. 2004; Jiggins and Kim 2006).
We have previously demonstrated evidence for selection on Arabidopsis R genes involved in the specific recognition of pathogen isolates (Bakker et al. 2006). These R genes tend to exhibit evidence of balancing selection, maintaining high numbers of alternative alleles that exhibit high levels of recombination but that are not unusually old. We additionally find evidence for selective sweeps, or partial selective sweeps, at a few R gene loci. These results lead to a model in which selection maintains variation for some period of time, but not for an unusually long period of time, consistent with the trench-warfare model of plant-pathogen coevolution (Stahl et al. 1999). R genes are at the top of the defense response pathway of Arabidopsis. We currently know nothing about the molecular evolution of other defense response genes in Arabidopsis.
Tests to detect selection assume selective neutrality as a null hypothesis, and the population is often assumed to be at a demographic equilibrium. Departures from such models are interpreted as indications of natural selection (Kreitman 2000). But systematic surveys of polymorphism across the Arabidopsis thaliana genome have shown that genomewide polymorphism data do not fit standard neutral models, even when nonequilibrium demographic conditions are imposed (Nordborg et al. 2005; Schmid et al. 2005). In the absence of a defensible null neutral model against which to test for selection, several authors have begun substituting an empirical distribution of genomewide polymorphism for a theoretical distribution as a baseline against which to identify unusual features of polymorphism in genes of interest, as previously advocated by Kreitman (2000) and by Nordborg et al. (2005).
This “empirical” approach was employed by Bakker et al. (2006) to investigate evolutionary scenarios for a set of 27 nucleotide binding site–leucine-rich repeat (NBS–LRR) R genes, as described above. An attempt was made to distinguish between three evolutionary scenarios: (1) long-term balancing selection, where a gene may have highly diverged alleles at intermediate frequencies and a high level of silent polymorphism; (2) positive selection on a favorable allele, which can generate a locus with few, relatively young alleles with extended haplotypes (sometimes referred to as a selective sweep); and (3) purifying selection or functional constraint, which leads to low levels of nonsynonymous polymorphism and a commensurately low rate of divergence between species. An empirical modeling approach was additionally used by Akey et al. (2004) to identify eight disease genes that appear to be under selection in humans. They used polymorphism patterns obtained for 132 genes to distinguish between the effects of demographic history and selection. Using an empirical approach without models, Fst values of individual SNPs were contrasted with the empirical genomewide distribution of Fst; Akey et al. (2002) identified 174 candidate genes as targets of selection. However, this study was limited by a relatively small sample size, representing only a few populations. In contrast, the analysis presented in this article for defense response genes in A. thaliana relies on empirical distributions obtained from genomewide polymorphism patterns based on a worldwide sampling of this species.
Here we compare the evolutionary histories of a set of 27 defense response genes with results from nondefense-related genes. We minimized sampling artifacts by resequencing the same set of 96 accessions that were used to establish the empirical distribution of polymorphism in 876 randomly distributed genomic regions, which will be referred to as the “empirical distribution” or “reference loci” (Nordborg et al. 2005). Comparisons for a number of population genetic summary statistics revealed that defense response genes display lower numbers of protein variants and are in general less polymorphic than random genomic fragments, showing an overall pattern of purifying selection with a small number of instances of positive selection and adaptively driven substitutions.
MATERIALS AND METHODS
Sampling and data generation:
DNA was extracted from fresh leaves of 96 A. thaliana accessions (supplemental Table 1). The accessions were chosen to include a worldwide sample, where 50 were sampled from local populations in a hierarchical fashion. The same set of 96 accessions has been extensively characterized for polymorphism in 876 genomic fragments that average 583 bp in length and represent 0.48 Mbp of the Arabidopsis genome (Nordborg et al. 2005). We selected 27 genes that are involved in the Arabidopsis defense response pathway (for a functional description of these genes, see supplemental Table 4). Defense response gene sequences for the 96 A. thaliana accessions were obtained by directly sequencing PCR amplification products. The 876 genomic fragments were selected so that they would cover the Arabidopsis genome with an average spacing of ∼100 kb. Since the 27 defense response genes were not selected in this way, their distances to neighboring fragments are in general shorter than the distances between each of the 876 fragments and their nearest neighbors.
Primers were designed to amplify fragments of ∼600 bp for 27 defense response genes using the A. thaliana reference sequence. We were able to obtain sequence information for most of the accessions for 27 defense response gene loci (supplemental Tables 2 and 3). For 2 defense response genes more than 1 fragment was amplified: at1g66290 (ETR1; 3 fragments) and at4g26120 (NPR2; 2 fragments), which were subsequently concatenated. Studied regions predominantly consisted of exons, where the total length of coding sequences ranged between 159 and 1182 bp (supplemental Table 3). Twenty-two fragments contained 1 or more (parts of) introns and 3 fragments also contained noncoding sequences located outside of the studied gene (supplemental Table 3). Sequences were generated on ABI 3700 automated sequencers (Applied Biosystems, Foster City, CA) by Genaissance Pharmaceuticals. All fragments were sequenced in both directions. Heterozygous sites were treated as missing data. Sequences have been submitted to the GenBank database (accession nos. EU404192–EU406770).
We compared the set of 27 defense response genes with the set of 876 random fragments, for a number of population genetic summary statistics in a similar fashion to that done by Bakker et al. (2006). A number of summary statistics were calculated using C++ programs written by EGB: nucleotide diversity π (Nei and Li 1979), number of segregating sites standardized by sequence length S, Tajima's D (Tajima 1989), FST (Nei 1973), and minimum number of recombinations, Rh (Myers and Griffiths 2003). Summary statistics involving coding regions of defense response genes and random genomic fragments (i.e., 236 random genomic fragments with >400 bp exon sequence) included numbers of synonymous (Ks) and nonsynymous (Ka) substitutions (Nei and Gojobori 1986), maximum number of synonymous substitutions between accession pairs (Ksmax), and the number of protein variants on the basis of nonsynonymous substitutions. We also calculated π, S, and Tajima's D for synonymous and nonsynonymous sites separately. Confidence intervals for the estimates of Ka, Ks, and Ka/Ks ratio were obtained through Monte Carlo simulation using the program K-estimator (Comeron 1999). Distributions of other above-mentioned summary statistics calculated for the set of defense response genes were compared with corresponding distributions obtained for the random fragments by doing Mann–Whitney U-tests in R 1.8.1 (http://www.r-project.org/). In addition, we checked whether summary statistic values for the set of defense response genes would reside in the upper or lower 5th percentile of the corresponding distribution of statistics for the random fragments, as determined from the average values of 27 randomly sampled fragments, repeated 1000 times. While the 5% lower and upper tails together represent 10%, which would not be considered conservative, we believe that these tails do provide conservative cutoff values since the set of 876 reference fragments—and especially the set of 236 coding fragments—contains fragments that are under selection or are in close linkage to sites that are under selection. Analysis of this set of 876 reference fragments has shown fat tails for the distributions of Tajima's D and level of polymorphism , confirming that a substantial fraction of fragments is under selection or is closely linked to a fragment under selection (Nordborg et al. 2005).
Evidence for a recent selective sweep can be viewed as an extended region with reduced recombination surrounding the target of selection resulting in reduced variation at neighboring loci (Maynard-Smith and Haigh 1974; Aguadé et al. 1989). We investigated defense response genes with low nucleotide diversity levels further for indications of a recent selective sweep according to Bakker et al. (2006). This method involves comparison of differences in silent nucleotide diversity (πs) between these defense response genes and fragments within 400 kb (obtained from http://walnut.usc.edu/2010/), followed by a similar analysis in the 5′ and 3′ directions to πs differences for each of the 876 random fragments and its nearest neighboring fragments on both sides.
We used a second approach to identify partial selective sweep candidates. This approach involved a genomewide scan of haplotype sharing according to Toomajian et al. (2006). We used for this purpose polymorphism data from 1204 fragments that come from an ongoing survey of genomic DNA sequence and R gene polymorphism (Nordborg et al. 2005). The density of these sequenced regions along each chromosome is sufficient to identify unusually long identical haplotypes among subsets of these individuals. The haplotype-sharing approach applied here is based on the assumption that chromosomes that are identical by descent (IBD) at a polymorphic site must also share a short region surrounding that site. Therefore, the age of the shared allele at the polymorphic site and the recombination rate in the region influence the length of this IBD region (Nordborg and Tavaré 2002; Innan and Nordborg 2003). We compared haplotype sharing around each defense response gene with what is observed in the rest of the genome in a similar fashion as Bakker et al. (2006). See Toomajian et al. (2006) for a more detailed description of the haplotype sharing statistic.
We used orthologous sequences from the 4× A. lyrata assembly (Arabidopsis lyrata Sequencing Consortium, unpublished data) as outgroup sequences for Ka/Ks and McDonald–Kreitman tests (McDonald and Kreitman 1991). A. lyrata orthologous sequences were identified after BLAST analysis of each of the defense response genes against the assembly. BLAST analysis of neighboring genes was performed to check for synteny. Syneny was inferred when BLAST hits of the neighboring genes were located on the same A. lyrata scaffold. In addition, we performed reciprocal blasting of A. lyrata sequences on the A. thaliana genome sequence. Except for GL1 (At3g27920), we found good BLAST hits for each defense response gene and its neighboring genes. For GL1 we used a published A. lyrata sequence (GenBank accession no. AF263721; Hauser et al. 2001) as an outgroup sequence.
Overview of polymorphism patterns in defense response genes:
We obtained sequence information for 27 defense response genes, most of which are involved in the SA-, JA-, or ET-dependent signal transduction pathways or are part of downstream pathways that can be triggered after gene-for-gene interaction (Figure 1; supplemental Table 4). The genes vary widely in their functions in these pathways but can be subdivided into five categories on the basis of their annotation: (1) transcription regulation, (2) R gene response, (3) phytoalexin/PR protein production, (4) cell-death regulation, and (5) function in intermediate stage of pathway (Table 1). Since each category is involved in a different stage of these pathways, they are expected to experience different selective pressures. However, no significant differences were observed among these groups on the basis of the averages of summary statistics after Bonferroni correction (P > 0.05; Ksmax, π, S, Tajima's D, Rh, Fst, number of protein variants, and interspecies, Ka/Ks).
The complete group of 27 genes displays a wide range of evolutionary histories. A graphical overview of defense response gene evolution was obtained by plotting neighbor-joining trees on the basis of silent sites (Nei and Gojobori 1986) using MEGA version 2.1 (Kumar et al. 2001; Figure 2). This figure reveals substantial variation in tree shape and depth. Comparison with the allele (haplotype) frequency distribution of the set of 876 random genomic fragments (Figure 3) shows that defense response genes have a significantly higher minor allele frequency for synonymous SNPs (Mann–Whitney U-test, P < 0.05) and a marginally lower minor allele frequency for nonsynonymous SNPs (Mann–Whitney U-test, P = 0.054). These patterns reflect an excess of low frequency (minor allele frequency, P < 0.05) nonsynonymous SNPs compared to synonymous SNPs (62% vs. 43%, respectively) for the defense response genes, whereas for the reference loci this difference is much smaller (58% vs. 52%). However, both the reference loci and the defense response genes appear to have a significant excess of low frequency nonsynonymous alleles compared to synonymous alleles (Mann–Whitney U-test, P < 0.0001 and P < 0.0005, respectively), indicating purifying selection acting on nonsynonymous mutations.
Principal component analysis (PCA) based on seven summary statistics (Ksmax—the maximum number of synonymous substitutions between accessions pairs—π, S, Tajima's D, Rh, Fst, and number of protein variants) shows that the 27 defense response genes do not fall into distinct clusters but cover a range of possible evolutionary states (Figure 4). The first principal component (PC1, 44% of total variation) differentiates the set of defense response genes on the basis of Ksmax (maximum interallelic synonymous divergence, a measure of the maximum depth of a gene tree), Tajima's D (test for selection), π (nucleotide diversity), S (number of segregating sites) and Rh (recombination frequency) in such a way that defense response genes with high PC1 values are characterized by alleles that are highly differentiated one from another at the sequence level (Table 2). Small PC1 values, in contrast, indicate low levels of polymorphism. The second principal component (PC2, 20% of total variation) differentiates the set of defense response genes on the basis of Tajima's D (test for selection), Fst (population differentiation), and number of protein variants so that defense response genes with high PC2 values are characterized by a relative excess of alleles at intermediate frequency and high population differentiation (Table 2). Small PC2 values can indicate highly skewed allele frequency distributions and interpopulational uniformity. Candidates for balancing selection and selective sweep on the basis of analyses of population genetic summary statistics (Table 1) are indicated with different symbols in the PCA plot (Figure 4).
Pseudogenes and nonfunctional alleles:
Alleles rendered nonfunctional due to mutations causing frameshifts and/or premature stop codons were observed for five defense response loci, where EDS5 stood out with 36 (38%) plant accessions with nonfunctional alleles. Nonfunctional alleles were a result of a stop codon mutation, coinciding with a major clade in the gene tree (Figure 2). Four loci (ESP, ETR1, EDS1, and PAD4) had nonfunctional alleles in 1 or 2 of the 96 plant accessions. The number of loci with nonfunctional alleles did not differ significantly between the set of defense response genes and the set of reference loci (18.5% vs. 24.0%; Mann–Whitney U-test, P = 0.52).
Genomewide patterns of polymorphism that depart from neutral expectations are generally assumed to reflect demographic history. In contrast, loci that exhibit departures from neutrality but have atypical geographic structure may be candidates for selection. To investigate this possibility, we measured population differentiation (Fst) for the defense response genes across eight groups of accessions, which were previously identified by Structure (Falush et al. 2003) as distinguishable population clusters (Nordborg et al. 2005; supplemental Table 1). These eight groups followed plausible geographic boundaries. Fst values were calculated using alleles identified on the basis of both nucleotide substitutions and indels. No significant difference was observed between defense response genes and the genomewide Fst distributions (P = 0.49). Only one defense response gene (ACD1) had a Fst value below the lower 5th percentile of the empirical distribution, and none had Fst values in the upper 5th percentile of the empirical distribution. These results argue strongly against widespread selection corresponding to eight geographical regions (see Nordborg et al. 2005).
Comparisons with empirical distributions:
Unusually large values of the summary statistics Ksmax (allelic divergence), π (nucleotide diversity), S (number of segregating sites), Tajima's D (test for selection), number of protein variants, and interallelic Ka/Ks ratios (comparison of number of amino acid replacement changes with silent changes) are possible indicators of a locus being under the influence of balancing selection, whereas unusually small values for these summary statistics suggest either a recent selective sweep or a purifying selection. Comparison of the 27 defense response genes with the reference loci revealed that defense response loci have fewer protein variants (mean = 2.78) than the reference loci (mean = 5.38; Mann–Whitney U-test, P < 0.0005), and lower nonsynonymous site diversity (πnonsyn = 0.0008 for defense response genes vs. πnonsyn = 0.0017 for the reference loci, P < 0.005) and lower number of nonsynonymous segregating sites (mean = 0.0071 for defense response genes vs. mean = 0.0132 for the reference loci; Mann–Whitney U-test, P < 0.05).
Reduction in the number of protein alleles by a recent selective sweep can be distinguished from purifying selection against amino acid replacement mutations by asking whether synonymous nucleotide diversity is also reduced. Synonymous site diversity is slightly but not significantly higher for the defense response genes as compared to the reference loci (πsyn = 0.0020 vs. 0.0017, respectively; P = 0.27). Thus, the polymorphism data favor purifying selection.
The defense response genes and the reference loci did not differ significantly for their Tajima's D (synonymous and nonsynonymous polymorphism combined): −0.5065 and −0.8338 (P = 0.07) for the defense response genes vs. the reference loci, respectively. Defense response genes are both in the lower and upper 5% tail of the empirical distribution, but the variance of the Tajima's D estimates for the individual defense loci does not differ from that of the empirical distribution (F-test, P = 0.47). For ETR1 we found that one of the three concatenated fragments (starting at position 1.24739369) had a significantly high Tajima's D value (2.2573), in the upper 2.5th percentile of the empirical distribution. Further analysis revealed no significant difference between the set of 27 defense response genes and the empirical distribution for Tajima's D values calculated on the basis of nonsynonymous sites only (P = 0.23). Average Tajima's D calculated on the basis of synonymous sites, however, is more positive (−0.2028) than the empirical distribution (−0.4394; P = 0.41). Even though this difference is not significant, this positive shift in Tajima's D is a general tendency of the group of 27 genes rather than the presence of a larger than expected number of positive outliers, since we observed only 3 defense response genes (ETR1, PAD3, and DND1) with a synonymous Tajima's D value in the upper 5% tail of the corresponding empirical distribution (P = 0.31).
We did not observe a significant difference in Rh, the minimum number of recombination events per base pair, between the defense response genes and the empirical distribution (Mann–Whitney U-test, P = 0.19). We observed an Rh value of zero for 20 defense response genes (74.1%), but this percentage is not significantly different than for the empirical distribution (68.7%; Mann–Whitney U-test, P = 0.55). This result stands in strong contrast with our similar analysis of R genes, where recombination rates were significantly higher as compared to the empirical distribution for a given π-value (Bakker et al. 2006).
Ksmax, a measure of the maximum number of synonymous substitutions between any pair of accessions at a locus, provides a gauge of the time to the common ancestry of the sample. Large values indicate the presence of long-lived alleles at a locus, whereas small values imply a recent common ancestry of alleles. Compared to the empirical distribution, the most diverged alleles at a defense response locus do not tend to be unusually old (P = 0.44). For 1 defense response gene (BGL1) we observed Ksmax = 0, but this frequency of zero values (3.7%) in the set of 27 defense response genes is not significantly different from the empirical distribution (8.8%; Mann–Whitney U-test, P = 0.39). We did not observe interallelic Ka/Ks values significantly >1 for any defense response locus, which if present might indicate adaptive divergence of alleles (data not shown).
Comparisons with outgroup sequences:
We have obtained corresponding A. lyrata orthologous sequences for all 27 defense response genes. Ka/Ks values between A. thaliana and A. lyrata ranged from zero for PAL1 (At2g37040) to 1.21 for PAD3 (At3g26830). None of the 27 defense response genes exhibited a Ka/Ks value significantly >1, indicating an absence of recurrent adaptive substitution. This observation was confirmed by nonsignificant McDonald–Kreitman test results for all 27 tested defense response loci, indicating no significant deviations from the neutral expectation. This means that the majority of substitutions are neutral or nearly neutral and that the most prevalent type of selection acting on these genes is purifying selection.
Search for positive selection in defense response genes:
Even though the distributions of summary statistics for the 27 defense response genes conform reasonably well to the empirical distribution, 8 loci can be identified as having a combination of values that might qualify them as candidates for selective sweeps. PAL1, COI1, EIN3, PAD4, BGL2, CYP83B1, PBS1, and CPR5 are all observed to have low πsyn, Rh, Ksmax, Tajima's D, and number of protein variants. For 3 of these genes (COI1, PAD4, and CYP83B1), nucleotide diversity values were located in the lower 5% tail of the empirical distribution. None of these 8 defense response genes exhibit evidence for possessing recombinant alleles (Rh = 0), but with low levels of polymorphism the ability to detect recombination is also low. Three of these 8 defense response loci (COI1, CYP83B1, and PBS1) code for only one protein variant, whereas the remaining 5 code for 2–4 protein variants.
To further investigate possible selective sweeps for these loci, we took advantage of the reference loci database to ask whether polymorphisms in loci flanking these genes are similarly low in their levels of polymorphism. Silent nucleotide diversity (πs) was calculated for 5–19 flanking fragments located at distances within 400 kb on each side of these eight defense response genes and the differences in πs between defense response genes and their flanking fragments were plotted against physical distance separating the loci (Figure 5). A similar analysis was carried out for the set of 876 random fragments. In these cases, we analyzed πs differences between each fragment and its two nearest neighbors at distances within 400 kb. For the set of 876 random fragments, πs differences did not increase significantly at increasing physical distance (Figure 5). πs differences between the eight defense response genes and their flanking fragments were concentrated below the median of the empirical distribution: 51 fragments below the median vs. 16 fragments above (chi-square test of independence, P < 0.0001; Figure 5), which is an indication that these loci are imbedded in larger regions of low polymorphism.
For none of the eight defense response genes did we consistently observe πs differences below the lower 5% threshold on the basis of observations for the 876 random fragments. However, for PAD4, PAL1, and PBS1 we observed low πs differences with all fragments on both sides (within 170 kb) below the median of the random fragments distribution. A similar, but less obvious pattern was observed for the other defense response genes, where πs differences with fragments flanking one side of the defense response gene are in general low (below the median of the empirical distribution), whereas πs differences are relatively high on the other side. For one defense response gene, EIN3, we observed large πs differences with fragments on both sides. Such a pattern might be expected at a locus following a selective sweep if the recombination rate is sufficiently high to overcome linkage disequilibrium induced by selection at neighboring loci.
We also looked for evidence of partial selective sweeps of individual alleles using a haplotype-sharing approach (Toomajian et al. 2006; also see materials and methods section). A haplotype sharing score was assigned to each of 265 defense response alleles where the minor allele frequency at the corresponding polymorphic site was ≥2 (Figure 6). A score was also assigned to each allele selected on the basis of the same criterion from the 1204 reference fragments. A sliding window along alleles ordered by frequency was used to produce the 95th percentile of the random fragments distribution. Although only nine (3% of 265) defense response gene alleles (from five defense response genes) between the frequency of 0.05 and 0.95 are located above this percentile, we consider these candidates for recent selection because the tail of the empirical distribution may itself contain adaptive alleles. One allele from each of two different genes (AHBP1 and DND1) is associated with long haplotypes. There are multiple (2–3) alleles associated with long haplotypes for three other genes (ESP, EDS5, and NHL3). The two outlying alleles from ESP are nonoverlapping groups of accessions corresponding with different geographical regions, whereas the two and three outlying alleles from EDS5 and NHL3, respectively, share the same groups of accessions. Six of these alleles have a haplotype sharing score in the top 2.5% of the empirical distribution (AHBP1, one allele of ESP, one allele of EDS5, and all three alleles of NHL3). For only two loci is the allele at a relatively low frequency (<8 accessions; ESP and EDS5), the rest of the alleles are at frequencies varying between 23 and 88 accessions.
The defense response genes have fewer protein variants and are less variable than random genomic fragments, indicating either a purifying selection or a recent selective sweep. However, for a few of these genes (marked with an open triangle in the PCA graph, Figure 3) we observed high positive Tajima's D and Ksmax values for synonymous sites. The best candidate for balancing selection is ETR1. The allelic divergence (Ksmax) of this locus is the largest of all the 27 defense response genes. In addition, one of the three concatenated fragments of this locus has a significant Tajima's D value (2.2573). The other candidates for balancing selection are NPR1, PAD3, NHL3, and DND1, since their Tajima's D values for synonymous sites are located in the upper 5% tail of the corresponding empirical distribution. With the exception of PAD3 (Ksmax = 0.0255), all these genes have intermediate to large Ksmax values (0.0529–0.0929). Two other genes, ACD1 and DND1, have relatively high Ksmax values (>0.09). One of the diverged alleles of ACD1 occurs at a low frequency (as indicated by a low Tajima's D for the locus), making it a less likely candidate for selection.
Overview of polymorphism patterns:
Defense response genes tend to maintain lower levels of diversity as compared to the reference loci. These lower levels of diversity most likely reflect purifying selection acting on most defense response genes (see next section). This is in sharp contrast to R genes, which in general tend to maintain high levels of diversity over short periods of time (Bakker et al. 2006). Although defense response genes are not a homogeneous set with respect to function as is the case for R genes, it is remarkable that as a group they all show low levels of diversity, indicative of purifying selection. For a few defense response loci we found evidence for (partial) selective sweeps and/or transient balancing selection. However, in general, defense response genes appear to be experiencing purifying selection and do not appear to differ from any randomly sampled genomic region with respect to their evolutionary scenario.
Defense response genes are under purifying selection:
Many of the defense response genes in our study have specific roles in signal transduction pathways and it is therefore expected that at least some of them are experiencing selective constraint. Purifying selection is characterized by relatively low nonsynonymous nucleotide diversity and low numbers of protein variants. Because there is generally much less selective constraint on synonymous changes, synonymous nucleotide diversity values should exceed nonsynonymous nucleotide diversity. The defense response genes as a group display lower levels of nonsynonymous nucleotide diversity and have fewer nonsynonymous segregating sites compared with the reference loci. Synonymous sites, in contrast, do not differ significantly between the sets of loci for nucleotide diversity or segregating sites, indicating that the low nonsynonymous polymorphism levels are likely due to functional constraint. Comparison with A. lyrata orthologous sequences provided further evidence for purifying selection: for none of the 27 defense response genes did we observe a Ka/Ks value significantly >1 and/or a significant McDonald–Kreitman test result.
Defense response genes under strong purifying selection are PBS1, FAD3, CYP83B1, COI1, PBS2, and DND1, which code for only one protein variant. PBS1 codes for a protein kinase, which is involved in the detection of avr genes and R genes response (Shao et al. 2003). FAD3 codes for an omega-3 fatty acid desaturase, which synthesizes linolenic acid, a JA precursor (McConn and Browse 1996). CYP83B1 is involved in indole glucosinolate biosynthesis (Naur et al. 2003) and COI1 is involved in proteolysis (Xie et al. 1998). PBS2 is a resistance signaling protein that acts upon R genes response (Holt et al. 2005). We suggest that these six are likely to have histories dominated by purifying selection, on the basis of their relatively high levels of synonymous nucleotide diversity. One of these loci (DND1), however, has a relatively large positive Tajima's D value, raising the possibility of a selectively maintained polymorphism in another region of the gene.
It needs mention that several defense response genes also have other housekeeping functions, which could explain the conserved nature of these genes. However, of the six defense response genes that appear to be experiencing purifying selection, there is only one gene (CYP83B1) that is also involved in processes other than defense response (supplemental Table 4). It is therefore possible that this gene is under purifying selection because of its other essential functions.
Defense response genes do not experience balancing selection, selective sweeps:
Previous evolutionary analyses of a similar set of R genes in Arabidopsis revealed indications for transient balancing selection and partial selective sweeps (Bakker et al. 2006). Our data show that there is not much evidence for selective sweeps and balancing selection in the set of 27 defense response genes. However, for a proper comparison with the previously published set of 27 R genes (Bakker et al. 2006), we also have to look at these more complicated signs of selection. On the basis of comparisons with the reference loci, we conclude that as a group, purifying selection appears to be the dominant form of selection. For a small number of defense response genes, however, we have indications for positive selection (selective sweep) and balancing selection.
Eight defense response genes have a combination of low πsyn, Rh, Ksmax, Tajima's D, and number of protein variants (PAL1, COI1, EIN3, PAD4, BGL2, CYP83B1, PBS1, and CPR5), which together may indicate recent selective sweeps. Although the nucleotide diversity at these loci was lower than at other apparently neutrally evolving A. thaliana functional genes (e.g., Kuittinen and Aguadé 2000; Aguadé 2001; Hauser et al. 2001), it was not significantly different from the reference loci. We therefore believe that these defense response genes have undergone purifying selection. We identified only one candidate for a recent selective sweep (PAD4) on the basis of shared low diversity levels with neighboring genes. This gene is involved in activation of SA accumulation, camalexin synthesis, PR gene expression, and R genes response (Zhou et al. 1998). We investigated the possibility of partial selective sweeps by using a haplotype sharing approach (Toomajian et al. 2006). This analysis revealed several possible instances of a partial selective sweep: AHBP1 (transcription factor activity; Fan and Dong 2002; Pieterse and Van Loon 2004) with a haplotype near fixation (92% of accessions) and DND1 (cell-death regulator; Clough et al. 2000) with an extended haplotype at intermediate frequency (60% of accessions), corresponding with a major clade in the gene tree. This latter locus has relatively high Ksmax and (synonymous) Tajima's D values and therefore appears to be under balancing selection or is closely linked to a site under balancing selection. The fact that we observe evidence for a partial selective sweep and for balancing selection is reminiscent of observations we previously made for the R gene RPM1, where there is evidence for historical allele frequency fluctuations (Stahl et al. 1999) and for the R genes at1g56540 and at5g63020 (Bakker et al. 2006). It is possible that this defense response gene has experienced a recent rise in the frequency of a favorable allele resulting in an extended haplotype sharing surrounding the site(s) under selection. For three defense response genes: ESP (promotes the hydrolysis of glucosinolates to nitriles; Lambrix et al. 2001), EDS5 (transport of intermediates for SA biosynthesis, R genes response; Nawrath et al. 2002), and NHL3 (transport of intermediates for SA biosynthesis, R genes response; Nawrath et al. 2002), we observed >1 extended haplotype. Although multiple extended haplotypes could be the result of multiple partial sweeps at the same gene, they may simply result from the complex history of a single sweep. In fact, on the basis of their observations regarding a clustering of extreme values of a similar haplotype sharing statistic, Voight et al. (2006) have suggested that the clustering of extreme alleles may be a strong sign that at least one of the alleles is a true positive.
For EDS5 we observed 38% nonfunctional alleles coinciding with a major clade in the gene tree, whereas four other loci (ESP, ETR1, EDS1, and PAD4) had nonfunctional alleles in 1 or 2 of the 96 plant accessions. Our R gene study detected nonfunctional alleles in 17 of the 27 studied R genes in a range from 1.15 to 32.8% (Bakker et al. 2006). Nonfunctional alleles occurring at high abundances were observed to either coincide with one or more major clades (7 R genes) or to be scattered over the entire gene tree (2 R genes). Highly differentiated major clades with nonfunctional alleles in R genes are expected to be maintained by balancing selection due to a cost of resistance (Stahl et al. 1999; Tian et al. 2002, 2003). Possibly a similar mechanism has resulted in the maintenance of a clade with nonfunctional alleles in EDS5.
Four defense response genes appear to be under transient balancing selection: ETR1 (ethylene-binding activity; Guo and Ecker 2004), NPR1 (transcription regulation; Zhang et al. 1999), DND1 (cell-death regulation; Clough et al. 2000), and NHL3 (R genes response; Varet et al. 2003). Signatures of balancing selection were weak for all four of these defense response loci; however, in general, defense response genes do not appear to be under balancing selection as shown by low values for most of the summary statistics. Only synonymous Tajima's D values of these defense response genes showed a significant deviation from the empirical distribution.
Comparisons between defense response genes and R genes:
Comparison of these 27 defense response genes with 27 previously studied R genes (Bakker et al. 2006) shows that R genes display in general a significantly higher nonsynonymous nucleotide diversity (0.007 vs. 8.16 × 10−4; P < 0.0001), number of nonsynonymous segregating sites (0.028 vs. 0.007; P < 0.0001), recombination frequency (0.0037 vs. 0.0009; P < 0.005), and number of protein variants (14.2 vs. 2.8; P < 0.0001). Clearly, R genes maintain higher levels of protein variation due to natural selection acting to maintain nonsynonymous mutations and an elevated recombination frequency. It is believed that transient balancing selection is the main selective force acting on R genes causing high levels of protein variation maintained over intermediate periods of time (Bakker et al. 2006). This observation was confirmed by the resequencing study of 20 Arabidopsis accessions of Clark et al. (2007), which identified strong signatures of selection on NBS–LRR type R genes. It is widely argued that R genes experience this type of selection because of their direct interaction with pathogen avirulence ligands. In contrast, defense response genes, which do not interact with pathogen avirulence ligands, are substantially less variable. However, the near complete absence of positive and/or balancing selection signatures in this set of defense response genes is unexpected since evolutionary analyses of genes involved in other (defense response) pathways have shown varying evolutionary scenarios for groups of genes involved in those pathways. For example, studies of genes involved in the anthocyanin biosynthetic pathway have shown differences in selective constraint between upstream and downstream genes in the pathway and differences in evolutionary rates between regulatory and structural genes in maize, snapdragon, and morning glory (Purugganan and Wessler 1994; Rausher et al. 1999). In their study of genes involved in the floral developmental pathway in Arabidopsis, Olsen et al. (2002) observed that patterns of molecular evolution differ among regulatory genes with earlier acting genes exhibiting evidence of adaptive evolution. Similar patterns of evolution have been observed for various defense response pathways both in plants and in animals. Studies of defense response genes involved in the innate immunity response of some insect species have shown evidence for adaptive evolution on genes downstream in this defense response pathway (Schlenke and Begun 2003; Bulmer and Crozier 2004). Several genes involved in the glucosinolate–myrosinase system show high levels of adaptive variation and evidence for balancing selection (reviewed by Kliebenstein et al. 2005). However, in our study of 27 defense response genes in A. thaliana we did not observe any significant difference in population genetic summary statistics between genes involved in different parts of SA, JA, and ET signaling pathways. A possible explanation for this observation is that Arabidopsis SA, JA, and ET signaling pathways do not produce as high a number of variable end products as the glucosinolate–myrosinase system. A. thaliana contains >30 different glucosinolate structures (Kliebenstein et al. 2001; Reichelt et al. 2002), which are the result of genetic variation in glucosinolate core pathway and regulatory genes (Kliebenstein et al. 2002). It should be noted, however, that there is an overlap between the glucosinolate–myrosinase system and the SA and JA signaling pathways: jasmonates increase total glucosinolate concentration whereas salicylate stimulates glucosinolate accumulation (Kliebenstein et al. 2002). Two of the 27 defense response genes from our study are also described as being involved in the Arabidopsis glucosinolate biosynthesis (i.e., CYP83B1 and ESP; Kliebenstein et al. 2005). These 2 genes have not been characterized as candidates for balancing selection in previous analyses of the glucosinolate–biosynthesis system (Kliebenstein et al. 2005), confirming our observations of low levels of variation observed for these genes. Nevertheless, it is expected that genes upstream in the SA, JA, and ET signaling pathways experience less directional selection as compared to downstream and regulatory genes. A more detailed analysis of the Clark et al. (2007) data encompassing all defense response genes in combination with our data of 27 defense response genes might be able to reveal such a pattern for defense response genes.
Another possible explanation for the observed differences in evolutionary histories between defense response genes and R genes is that there is a difference in gene structure between the two groups of genes that could have an effect on to what extent high levels of diversity can be maintained. The LRR region in R genes consists of a varying number of leucine-rich repeat modules. R gene specificity is believed to have evolved through LRR module indels and/or mutations affecting amino acid changes in the solvent-exposed parts of the β-sheet structure (Jones and Jones 1996). Because of the modular nature of the LRR region it is expected that more nonsynonymous mutations are tolerated in R genes as compared to defense response genes, which do not possess such a modular region.
We investigated polymorphism in a representative sample of 27 defense response genes, mostly involved in SA, JA, or ET signaling pathways. A relatively large fraction of the genes display low numbers of protein variants, low numbers of segregating sites, and low levels of nucleotide diversity. Since most of these genes, being part of a signal transduction pathway, have conserved functions, it is not surprising to find evidence for purifying selection. Although we find hints in the data for positive selection, with instances of both directional and balancing selection, the evidence is relatively weak, which may reflect the fact that we studied short fragments of these genes. Additional polymorphisms from the unsequenced regions of these genes will likely provide additional insight into how selection acts on these genes.
The wide variety of gene tree topologies, time-depths, and summary statistics for the 27 defense response genes may reflect a continuum of evolutionary histories, with contributions from both stochastic forces and selection (both positive and negative). Yet, except for greater purifying selection on amino acid replacements, in almost every other respect this set of defense signaling pathway genes are indistinguishable from a large panel of randomly chosen loci spread across the entire Arabidopsis genome. This “unremarkable” pattern of polymorphism for this group of genes is surprising given that genes in other (defense response) pathways do seem to experience different levels of selective pressure. Genes involved in the SA, JA, and ET signaling pathways in general experience high levels of selective constraint possibly because of the conserved nature of the end products of these pathways.
We thank Tina Hu and Keyan Zhao for their help with primer design, Matt Horton and Brian Mack for their help with editing of sequence data, and Arnar Palsson for statistical help. We thank the Department of Energy's Joint Genome Institute for sequencing the Arabidopsis lyrata genome and allowing us to use the 4× draft assembly. This work was supported by a National Institutes of Health postdoctoral fellowship to C.T., a National Science Foundation grant (DEB-0115062), and a National Institutes of Health grant (GM-57994) to J.B. and M.K.
Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under the following accession numbers: At1g08720, EU404192–EU404286; At1g52400, EU404287–EU404376; At1g54040, EU404377–EU404472; At1g64280, EU404473–EU404568; At2g37040, EU404569–EU404664; At3g11660, EU404665–EU404760; At3g20770, EU404761–EU404856; At3g27920, EU404857–EU404951; At3g44880, EU404952–EU405047; At3g48090, EU405048–EU405143; At3g57260, EU405240–EU405333; At4g31500, EU405334–EU405429; At4g37000, EU405430–EU405525; At4g39030, EU405622–EU405717; At5g06950, EU405718–EU405813; At5g45110, EU405814–EU405908; At5g51700, EU405909–EU406003; At5g64930, EU406004–EU406099; At1g66340, EU406100–EU406195; At4g26120, EU406196–EU406291; At5g13160, EU406292–EU406386; At2g29980, EU406387–EU406482; At2g39940, EU406483–EU406578; At3g26830, EU406579–EU406674; At5g15410, EU406675–EU406770.
↵1 Present address: Department of Horticulture, Oregon State University, Corvallis, OR 97331.
Communicating editor: D. Charlesworth
- Received October 10, 2007.
- Accepted December 21, 2007.
- Copyright © 2008 by the Genetics Society of America