While much study has gone into characterizing virulence factors that play a general role in disease, less work has been directed at identifying pathogen factors that act in a host-specific manner. Understanding these factors will help reveal the variety of mechanisms used by pathogens to suppress or avoid host defenses. We identified candidate Pseudomonas syringae host-specific virulence genes by searching for genes whose distribution among natural P. syringae isolates was statistically associated with hosts of isolation. We analyzed 91 strains isolated from 39 plant hosts by DNA microarray-based comparative genomic hybridization against an array containing 353 virulence-associated (VA) genes, including 53 type III secretion system effectors (T3SEs). We identified individual genes and gene profiles that were significantly associated with strains isolated from cauliflower, Chinese cabbage, soybean, rice, and tomato. We also identified specific horizontal gene acquisition events associated with host shifts by mapping the array data onto the core genome phylogeny of the species. This study provides the largest suite of candidate host-specificity factors from any pathogen, suggests that there are multiple ways in which P. syringae isolates can adapt to the same host, and provides insight into the evolutionary mechanisms underlying host adaptation.
PATHOGENS cannot cause disease indiscriminately, but are generally capable of avoiding or suppressing defenses in only a relatively small set of hosts. Both pathogen factors and host factors govern these interactions, and which of these two is dominant likely depends on the specific interaction. Nevertheless, there is a growing body of data that clearly demonstrates that pathogens have evolved compatibility factors that facilitate the disease process in a host-specific manner. (Cornelis 2002; Abramovitch and Martin 2004; Alfano and Collmer 2004; Espinosa and Alfano 2004; Nomura et al. 2005). However, there are several key issues about pathogen host-specificity factors that are still unknown. For example, the general relationship between virulence factors and host-specificity factors is unclear, although it is likely that the latter are a subset of the former required only on select hosts. It is also not known if the factors that are responsible for pathogen compatibility on different host species are fundamentally different from the factors that determine compatibility among different cultivars of the same host species. We do not even know at what level host specificity acts most strongly. For example, host-specificity factors may play important roles during initial colonization, during pathogen migration to the appropriate tissue or cell type, during the initiation of cellular interactions, during the maintenance of these interactions, or even at the point where the pathogen disperses to a new host (Levin 1996).
To address these questions, we first must have a way of identifying host-specific virulence factors. A reasonable hypothesis is that strains that are able to infect and cause disease on common host will share a common set of host-specificity factors. Bacterial genomes are remarkably flexible with respect to their specific complements of genes (Hacker and Kaper 2000; Dobrindt and Hacker 2001; Hacker and Carniel 2001; Dobrindt et al. 2002); consequently, associations between the genetic complement of pathogens and their hosts can be used to guide the search for host-specificity factors. One method that can be employed to identify these associations is microarray-based comparative genomic hybridization (CGH) (Behr et al. 1999; Malloff et al. 2001; Taboada et al. 2005). In this approach, genomic DNA from a variety of tester strains is hybridized against the arrayed genome of a reference or target strain. The relative hybridization intensity at each arrayed locus corresponds to the likelihood that the gene is present in the tester strains. Bacterial CGH studies have been extraordinarily successful in quantifying genome variability and characterizing the differences between pathogenic and nonpathogenic strains, but to date this approach has not been used for the large-scale identification of host-associated virulence factors (Behr et al. 1999; Salama et al. 2000; Hakenbeck et al. 2001; Israel et al. 2001; Malloff et al. 2001; Perrin et al. 2002; Saunders et al. 2004; Taboada et al. 2004; Tsolaki et al. 2004; Zhong et al. 2004; Lindroos et al. 2005; Porwollik et al. 2005; Taboada et al. 2005; van Leeuwen et al. 2005; Yao et al. 2005).
We used microarray-based CGH to identify host-associated factors from the model plant pathogen Pseudomonas syringae. P. syringae is an important agricultural pathogen that infects a tremendous diversity of important crop species. Although the host breadth of the species complex as a whole is extremely broad, individual P. syringae isolates typically have a very narrow range of hosts with which they are compatible (Wiebe and Campbell 1993; Cuppels and Ainsworth 1995; Morris et al. 2000). P. syringae is phylogenetically well characterized (Sawada et al. 1999; Sarkar and Guttman 2004; Hwang et al. 2005) and is well known for the large suite of virulence-associated (VA) systems that it uses during pathogenesis. These VA systems are responsible for the production of toxins (Bender et al. 1999), antimicrobials (McDermott et al. 2003), attachment polysaccharides (Yu et al. 1999), quorum-sensing molecules (Von Bodman et al. 2003), nutrient acquisition proteins (Cody and Gross 1984), and secreted effector proteins (Jin et al. 2003). Of these, the factors most commonly implicated in host specificity are the effector proteins secreted through the type III secretion system (T3SS).
The T3SS is a specialized protein secretion system used by both plant and animal pathogens to inject type III secreted effectors (T3SEs) directly into host cells. These effectors are the most promising candidate host-specificity factors since their role in suppressing the host defense response and promoting pathogen growth and transmission is well established (Abramovitch and Martin 2004; Alfano and Collmer 2004; Espinosa and Alfano 2004; Beth Mudgett 2005; Kim et al. 2005; Nomura et al. 2005). Work on bacterial avirulence proteins (which we now know to be predominantly T3SEs) and their cognate plant resistance proteins (Flor 1971) has clearly demonstrated that T3SEs are central to both the promotion and the restriction of host specificity (Jackson et al. 1999; Tsiamis et al. 2000; Guttman and Greenberg 2001; Abramovitch et al. 2003; Losada et al. 2004; Pitman et al. 2005).
P. syringae as a species has the largest known arsenal of T3SEs of any pathogen (Fouts et al. 2002; Guttman et al. 2002; Petnicki-Ocwieja et al. 2002; Zwiesler-Vollick et al. 2002; Greenberg and Vinatzer 2003; Chang et al. 2005). Functional studies have shown that while the T3SS as a whole is required for pathogenesis, many T3SEs have redundant or subtle roles (Lorang and Keen 1995; Charkowski et al. 1998; Guttman and Greenberg 2001; Badel et al. 2002, 2003; Losada et al. 2004; Lim and Kunkel 2005; Lin and Martin 2005). These findings suggest that host-specific virulence may require specific combinations or suites of T3SEs. This hypothesis is supported by preliminary studies demonstrating that P. syringae isolates from different hosts contain distinct suites of T3SEs (Guttman et al. 2002). Consequently, it is important to study T3SE repertoires in toto across the natural range of P. syringae hosts to understand their net function and role in host-specific interactions. CGH provides a powerful method for achieving this goal.
While it is generally assumed that horizontal gene transfer plays an important role in the evolution of these genes, the particular impact of this process and its importance relative to pathoadaptation (Sokurenko et al. 1999) has rarely been quantified. The few studies that have investigated T3SEs in a phylogenetic context have found support for both horizontal gene transfer and pathoadaptation (Prager et al. 2000; Lavie et al. 2004; Rohmer et al. 2004; Lindeberg et al. 2005; Guttman et al. 2006). While this debate still needs to be resolved, one issue that is clear is that many T3SEs are under intense selective pressures and that evolutionary changes in these genes can profoundly impact pathogen–host interactions (Rohmer et al. 2004; Pitman et al. 2005; Guttman et al. 2006; Ma et al. 2006).
We used microarray-based CGH to identify candidate P. syringae host-associated virulence factors. We assayed for the presence of 353 putative VA genes, including 53 T3SEs or putative T3SEs in 91 diverse P. syringae strains isolated from 39 unique hosts. We concentrated most of our analyses on T3SEs since they are the most promising host-specificity candidates. Our reference strain was the sequenced tomato pathogen P. syringae pv. tomato DC3000 (PtoDC3000). We tested for associations between the distribution of specific VA genes and VA gene profiles among isolates and the hosts of isolation to identify candidate host-specific virulence factors. We also mapped our findings onto the core genome phylogeny of the species to clarify the role of horizontal gene transfer in the evolution of these loci and the acquisition of new host specificities. This is the first large-scale population-level study of the association between natural genetic variation among VA genes and the host of isolation for a bacterial pathogen.
MATERIALS AND METHODS
Ninety-one isolates of P. syringae were generously donated by members of the research community as detailed in Table 1. These isolates span the geographic, phenotypic, and evolutionary diversity of the species and have representatives from all five major monophyletic groups (phylogroups) (Sarkar and Guttman 2004; Hwang et al. 2005). The collection includes 27 pathovars isolated from 39 different hosts, contained within 16 plant host families. Given the scope of the study, it was not possible to quantitatively measure host range for the collection. Nevertheless, published studies (Morris et al. 2000; Hunter and Taylor 2006) and our own analyses (W. Ma, R. L. Morgan and D. S. Guttman, unpublished results) have found that the host of isolation quite accurately reflects the ability of particular strains to grow on specific hosts and therefore is a useful surrogate for host specificity. Strains were grown in King's B (King et al. 1954) medium at 30°.
DNA isolation and PCR:
We selected 353 genes for the array on the basis of their bioinformatic assignment (Buell et al. 2003) to classes of loci known or suspected to play roles in virulence (supplemental Table 8 at http://www.genetics.org/supplemental/). These included genes responsible for the biosynthesis or assembly of alginate (a); coronatine (c); flagellar apparatus (f); general secretion pathway (g); siderophore (h); general metabolism (m); hemolysin (o); pyoverdine (p); quorum sensing (q); transcriptional regulation (r); polysaccharide (s); T3SS apparatus (t); T3SEs, putative T3SEs, and associated genes (te); unknown function (u); type IV pilus (v); and yersiniabactin (y). The genes of unknown function included on the array were upregulated in preliminary microarray studies of P. syringae during infection of Arabidopsis (G. B. Martin, unpublished results). Putative T3SEs and associated genes include those that encode some of the T3SS chaperones, genes located in the conserved effector locus (CEL) (Alfano et al. 2000), and some unannotated ORFs that have characteristics of T3SEs (Lindeberg et al. 2005). Only unique sequences of each locus were used, and in many cases only part of the coding sequence was used to eliminate cross-hybridization with paralogs or xenologs. The array was also spotted with four housekeeping genes from each of 24 P. syringae strains. These loci from recent multilocus sequence typing (MLST) studies (Sarkar and Guttman 2004; Hwang et al. 2005) were used as internal hybridization controls (discussed below). It is important to note that although some of the VA genes may not be functional in PtoDC3000, they still may have functional orthologs in other strains. Additionally, it is recognized that using only PtoDC3000 genes does result in an ascertainment bias since there are undoubtedly other VA genes not found in this strain that will be important for host-specific virulence. Nevertheless, the aim of the study is to identify genes associated with host-specific virulence, not to identify the gene(s) that strictly determine host specificity. While the latter goal is more lofty, it is also limited by our very incomplete understanding of the composition of the flexible genome of the P. syringae species complex. Nonetheless, this limitation in no way precludes us from identifying VA genes that have statistically supported associations with particular hosts and therefore putatively play a role in host-specific virulence.
Genomic DNA was isolated using the PureGene DNA isolation kit (Gentra Systems, Research Triangle Park, NC) according to the manufacturer's instructions and spectrophotometrically quantified. PtoDC3000 genes were PCR amplified using a 185-ng PtoDC3000 template, primers at a concentration of 0.36 μm, and the cycling parameters 94° for 3 min, 40 cycles at 94° for 1 min, 62° for 1 min, 72° for 2 min, and 72° for 15 min. Over 99% successfully yielded a single product as analyzed on 96-well Ready-To-Run precast agarose gels (Amersham, Buckinghamshire, UK). Clean up was performed using the QIAquick 96 PCR purification kit (QIAGEN, Chatsworth, CA). Primer sequences are presented in supplemental Table 9 at http://www.genetics.org/supplemental/.
Samples were vacuum dried, resuspended in 10 μl of CorningPronto! spotting solution, and spotted onto γ-amino-propyl-silane-coated UltraGAPS slides (Corning, Corning, NY) using a MicroGrid Pro arrayer (BioRobotics, Ann Arbor, MI) with 32 MicroSpot2500 printing pins. PCR products were fixed to the modified glass slides by treatment with 250 mJ of UV irradiation followed by a 2-hr incubation at 85°. Each gene was spotted twice per slide.
Genomic DNA was sheared by sonication (Branson sonifier) to generate fragment sizes between 0.5 and 1 kb. Approximately 1 μg of gDNA per microarray experiment was Klenow-labeled (Promega, Madison, WI) with either Cy3 or Cy5 using random primers (Promega). Reactions were purified using QIAGEN Qiaquick spin columns. Labeling efficiency was on average 1 pmol/μl and 0.7 pmol/μl of incorporated Cy3 or Cy5 fluorescent nucleotide, respectively. Equal amounts of Cy3- and Cy5-labeled genomic DNA representing two different strains were mixed, vacuum dried, dissolved in 65 μl of universal hybridization solution, and denatured at 95° for 5 min. Hybridization was performed for 13–16 hr at 42° using 0.25 pmol/μl of labeled sheared genomic DNA using the Promega Pronto! kit. Scanning of slides was performed using a two-channel confocal microarray scanner (ScanArray5000; GSI Lumionics, Boston) and ScanArray software (v3.1; Packard BioChip Technologies, Boston). All microarray experiments were performed at the Center for Gene Expression Profiling at the Boyce Thompson Institute for Plant Research at Cornell University. All hybridization experiments were replicated two or three times.
Numerical signal intensity data were adjusted to account for hybridization inconsistencies across the surfaces of individual slides by implementing the “spatial lowess” algorithm (Cui et al. 2002), which is based on linear scaling of average background values per channel followed by surface correction. Individual spots within Cy3 and Cy5 channels were also corrected for differences in amount of DNA spotted among the genes by linear scaling relative to the signal obtained from the hybridization of the reference strain PtoDC3000. The median hybridization signal intensity was obtained from two independent Cy3-labled hybridizations of PtoDC3000 genomic DNA (the standard deviation between these replicates was negligible). This value was used to determine a spot-specific scale factor, which transformed each signal to a similar intensity. Only one Cy5-labled PtoDC3000 hybridization was performed, so the Cy5 scale factor was calculated from this single value. The signal intensities for each spot were transformed by this scale factor for all subsequent hybridizations.
Scoring and validation:
We developed a new analytical method for identifying homologs of PtoDC3000 VA genes among strains using housekeeping genes as positive hybridization controls. These housekeeping genes were the same loci used in the MLST phylogenetic analysis of the P. syringae core genome (Sarkar and Guttman 2004; Hwang et al. 2005) and are therefore known to be present in all P. syringae isolates. Each chip contained four housekeeping genes from each of 24 strains that span the diversity of the P. syringae species complex. An objective cutoff was determined by identifying the signal intensity of these housekeeping genes from all of the strains not within the phylogroup of the query (probe) strain. The exclusion of strains from within the same phylogroup as the query strain removed a potential source of phylogenetic and sample size bias from the analysis.
The objective cutoff was determined on the basis of an analysis of the fully sequenced genome of the bean pathogen P. syringae pv. syringae B728A (PsyB728A) (Feil et al. 2005). First we bioinformatically determined which PtoDC3000 genes had homologs in PsyB728A by BLASTN analysis using an expected value of <e−10 and a normalized bitscore of >75 (to remove gene fragments). We used this analysis to bin the PsyB728A VA genes into those that are present or divergent. We then analyzed the data from the microarray experiment in which PsyB728A was used as the probe strain and identified a hybridization signal intensity that minimized the type I and type II errors on the basis of bioinformatic analysis. This signal intensity corresponded to the 1.25th percentile value for the housekeeping genes and resulted in a reasonable error rate of 12.9% false presents (type I) and 4.8% false absents (type II). We used the hybridization signal intensity corresponding to the 1.25th percentile for the housekeeping genes as the cutoff for each hybridization, with the specific cutoff determined independently for each hybridization experiment. We raised the housekeeping gene cutoff to the 1.4th percentile for the phylogroup 1 strains on the basis of an analysis of the PtoDC3000 hybridization results and to compensate for the evolutionary closeness of the phylogroup 1 strains to the reference strain PtoDC3000.
Following the standard convention used in CGH studies, genes with hybridization signal intensities below the cutoff were designated as “divergent” (Chan et al. 2003). In the case of two replicate microarray experiments with two replicate spots per gene, we called genes “present” if all four spots had signal intensities above the cutoff threshold described above. Genes were called “divergent” if none of the four spots had signal intensities above the cutoff. Present genes are solid in Figures 1–3⇓, while divergent genes are open. We also coded genes with light and dark shading to indicate genes that were called divergent or present, respectively, with lower confidence. The genes in the former category had one or two of the four spots above the cut-off threshold, while those in the latter group had three of the four spots above the cutoff.
A conservative cutoff is essential for CGH studies in that it ensures that all of the genes called present are in fact orthologs of the target sequences. The extremely high frequency of chimeric and truncated T3SEs (Stavrinides et al. 2006) makes a conservative assignment of orthologs even more critical. We empirically determined that the hybridization cutoff as dictated by the signal intensity of the housekeeping genes generally resulted in calling genes present only if they were at ∼80% or more identical to the target sequence. This approach is consistent with that taken in other CGH studies (Evertsz et al. 2001; Malloff et al. 2001; Taboada et al. 2005) and in fact is less conservative than the 88–92% range proposed by Falkow and colleagues (Chan et al. 2003).
Hierarchical clustering was performed using Cluster (Eisen et al. 1998) by average linkage clustering. Statistical tests for overrepresentation of profiles or genes in isolates of a given host species or family were implemented using bootstrapped permutation tests performed with dedicated PERL scripts (available from D.S.G. upon request).
The bootstrap permutation test for population structure used Watterson's θ, a measure of polymorphism based on the number of segregating sites as a measure of the average genetic distance among strains, and was calculated on the basis of the core genome's MLST housekeeping data set. To determine the statistical significance of these comparisons, we generated a null distribution of θ-values by random bootstrap resampling strains irrespective of the VA genes that they carried. Each bootstrapped data set was composed of the same number of strains as was empirically observed for each effector. One thousand bootstrapped data sets were generated for each effector, and this was compared to the observed θ-value for each individual effector. We performed this analysis for the entire strain set, as well as separately for the three largest P. syringae phylogroups. The latter analysis was performed to remove biases introduced by shared evolutionary history and to control for the fact that comparisons between phylogroups could easily mask any signal that may be present within any one group.
A likelihood-ratio test was used to determine if specific VA genes were statistically associated with specific hosts of isolation or host families. Each VA gene was tested with each host or host family independently. This test compared two models. The first model assumed that there was a strict positive association between individual VA genes and individual hosts or host families, with the probabilities for present and divergent calls being 0.999 and 0.001, respectively. A probability of 0.0 could not be used for the divergent calls since they would have zeroed out the likelihood estimates. The second model assumed that there was no association between the individual VA genes and individual hosts or host families. The probabilities for present and divergent calls were determined empirically from the observed frequency of the gene of interest among all of the strains not isolated from the host of interest. The likelihood-ratio statistic was compared to the chi-square distribution with 1 d.f.
RESULTS AND DISCUSSION
We grouped each VA gene into a functional group on the basis of its annotation (Buell et al. 2003) (Figure 1, Table 2, supplemental Table 8 at http://www.genetics.org/supplemental/) and assessed which VA functional groups are generally conserved among P. syringae isolates. We used the frequency of individual VA genes among the 91 isolates as an indicator of variation. Following standard population genetics convention used for identifying polymorphic loci, we designated VA genes as “variable” if they were present in <95% of the strains (≤86 strains) and as “conserved” if they were present in >95% of the strains. Conserved genes are likely to encode functions necessary for the survival of all strains regardless of host, while variable genes may include those that encode host- or niche-specific functions.
The functional groups with the greatest number of conserved VA genes include genes associated with pyoverdine and alginate biosynthesis, the general secretion pathway, and the flagellar apparatus (Figure 1). The conservation of these genes across diverse isolates of P. syringae indicates that these functions are indispensable regardless of host and therefore may be candidate pathogen-associated molecular patterns (PAMPs) (Nurnberger and Brunner 2002). This is supported by the fact that bacterial flagellin (PSPTO1949), the prototypical PAMP (Felix et al. 1999), is conserved across all isolates.
Interesting VA genes include those that are individually highly variable, but that are associated with a functional group consisting of genes that are largely conserved across strains. These loci belong to essential systems, but may be adaptive in specific hosts. Examples include gspD, a general secretory pathway outer membrane protein that has homology to a gatekeeper for secreted exotoxins in enterotoxigenic Escherichia coli (Tauschek et al. 2002); hlyB, potentially involved in hemolysin iron uptake (Expert 1999); and the flagellar hook genes fliD, fliK, and flgE-2, which have homology to proteins that act as adhesins in Pseudomonas aeruginosa (Yonekura et al. 2002).
The functional groups that contain the greatest number of highly variable VA genes include the coronatine biosynthetic genes and T3SEs (Figures 1 and 2), which is consistent with previous studies (Guttman et al. 2002; Hwang et al. 2005). Additionally, many genes of unknown function are also highly variable, indicating potentially important roles in host adaptation. Interesting exceptions are those highly conserved genes in functional groups that are largely composed of variable VA genes. These genes may be essential components of otherwise flexible systems. Examples include coronatine biosynthetic genes cfa5, corR, cfl, cmaC, cfa2, cmaT, and cfa3 and 10 T3SEs, putative T3SEs, or associated genes including shcF, hopS1, hopI1, PSPTO4851, hopD1, avrF, hopK1, hopAN1, hopAF1, and hopJ1. The high conservation of these genes in diverse strains may reflect selective constraint more typical of core genes and suggests that these loci may play central roles in the general disease process.
Genes encoding or associated with the type IV pilus, general metabolism, polysaccharide synthesis, siderophore, transcriptional regulation, quorum sensing, and T3SS apparatus are intermediate in their level of variation. The interpretation of these intermediate-variable genes is more complicated, given the inherent difficulty of differentiating between alleles that are highly divergent and alleles that are absent. This limitation is of course not limited to CGH studies, but is an inherent limitation of all hybridization-based approaches when there is not 100% probe-target identity (Taboada et al. 2005) (see materials and methods). Some intriguing highly variable genes found within the functional groups that are primarily composed of intermediate-variable genes include general metabolism genes such as glutathione S-transferase (gst), polysaccharide metabolism genes such as the endogluconase wssD, tonB-dependent siderophore receptors, and regulatory genes such as σ-factor rpoS, the global activator sensor gacS, and the T3SS regulator hrpS. This latter result was initially surprising since previous studies indicate that hrpS is ubiquitous among P. syringae strains (Sawada et al. 1999); nevertheless, we confirmed that the hrpS alleles are highly polymorphic by comparing our calls to the nucleotide variation observed in hrpS alleles. All of the alleles that we called divergent were <80% identical to the PtoDC3000 allele. These results are consistent with those of other CGH studies; for example, Tiedje and colleagues called gyrB divergent in their CGH study of Shewanella oneidensis due to its relatively high level of polymorphism (Murray et al. 2001). In general, these results on intermediate-variable loci are useful in that they provide information about which loci are polymorphic and suggest further investigation into their possible roles in adaptation.
We first attempted to identify those VA genes that evolved primarily through vertical descent vs. horizontal gene transfer. These two modes of evolution are not mutually exclusive, since genes that are acquired via horizontal exchange may then be transmitted vertically within a lineage indefinitely. Using the most simplistic approach, we note that four of the T3SEs (7.5%) are present in all strains, 10 T3SEs (18.9%) are present in 95% of strains, and 36 T3SE (67.9%) are present in 75% of the strains. Unfortunately, this approach cannot distinguish horizontal gene transfer from gene loss and therefore is of limited use. A better approach is to test for positive statistical associations between the distribution of individual genes and the core genome phylogeny developed by MLST of housekeeping genes (Sarkar and Guttman 2004; Hwang et al. 2005) (Figure 1 and supplemental Figure 1 at http://www.genetics.org/supplemental/). VA genes that did not display such an association were considered horizontally transferred. We then attempted to identify whether any horizontal transfer events were associated with apparent host shifts.
The most obvious association between the P. syringae core genome phylogeny and the VA gene profiles is in P. syringae phylogroup 1, which contains the reference strain PtoDC3000. Phylogroup 1 is the most highly conserved of the three largest P. syringae phylogenetic groups and, consequently, phylogroup 1 isolates are likely to carry a disproportionate proportion of homologs closely related to the PtoDC3000 target strain. However, despite this conservation, substantial variation is still obvious in the coronatine biosynthetic genes, T3SEs, and genes of unknown function in phylogroup 1 isolates. No distinct phylogenetic association is seen in the distribution of variable VA genes among the four other P. syringae phylogroups, indicating a prominent role for horizontal processes in the evolution of these genes.
We then identified potential VA gene horizontal acquisition and loss events by character mapping the distribution of each VA gene onto the core genome phylogeny. Each clade on the core genome tree was identified (boxed numbers along branches of the core genome phylogeny in Figure 1 and supplemental Figure 1 at http://www.genetics.org/supplemental/), and the percentage of strains within each clade that contained each VA gene was determined. Supplemental Tables 4 and 5 at http://www.genetics.org/supplemental/ present the percentage of occurrence of each VA and T3SE gene in each clade along with the total number of strains in each clade. Of particular interest are the clades that contain strains with similar hosts of isolation. For example, clade 40, which includes most of the strains in P. syringae phylogroup 1, is dominated by strains isolated from either tomato or Brassica hosts (radish, cabbage, cauliflower). The strains in this clade all carry hopO1-2, hopM1, hopF2, hopU1, hrpW, and other T3SEs. hopF2 and hopU1 are closely linked on the PtoDC3000 genome. It is interesting to speculate that the acquisition of some of these T3SEs was responsible for the expansion into this new niche—namely the ability to infect tomato and Brassica hosts. This is supported by the fact that there are no isolates from tomato or Brassica hosts outside of clade 40 within phylogroup 1.
Other illustrative examples include clade 29, which defines phylogroup 4. This small clade is strictly composed of strains that are pathogenic on monocot hosts. These strains share numerous T3SEs, including hopF2, hopX1, hopU1, and hopH 1. Clearly, the ancestor of this group acquired a large number of effectors that have been maintained by its descendants. It will be interesting to determine which of these T3SEs was instrumental in the movement to monocot hosts. Likewise, clade 3 defines a tight cluster of predominantly bean pathogens in phylogroup 3. The strains in this clade all contain T3SEs hopAJ1, hopQ1-1, and hopR1. These are likely candidate factors for the shift to bean hosts.
We also studied the population structure of each variable VA gene relative to the core genome phylogeny. Specifically, we asked whether the set of strains that carry any particular variable VA genes are more closely related to each other than are a set of randomly selected strains of the same size. When this test is upheld we infer that there is structure in the underlying population of strains that carry that particular VA gene. This indicates that a significant number of closely related strains carry the same VA gene; in other words, the gene has been transmitted vertically within this lineage. Alternatively, lack of structure would indicate that the specific VA gene has been randomly distributed among strains by horizontal gene transfer.
The test for population structure relied upon a bootstrap permutation test in which Watterson's θ, a measure of polymorphism based on the number of segregating sites, was used to determine the average genetic distance among strains on the basis of the core genome MLST data set. Each VA gene was analyzed independently. In general, we find that there is very little population structure among the strains carrying most T3SEs, supporting an important role for horizontal gene transfer (Table 3). Unfortunately, much of the signal in the analysis of the total data set and of the phylogroup 1 strains is masked by the very close similarity of the strains in phylogroup 1. Nevertheless, there are a relatively small number of cases where vertical evolution can be inferred. For example, the T3SE hopAA1-2 is quite rare and shows no population structure in any of the three major clades, unless all of the strains are analyzed simultaneously. The significant θ appears to be due to the disproportionate number of strains from phylogroups 4 and 5 that carry this T3SE. Likewise, hopAA1-1 is a highly variable T3SE that is overrepresented within clade 12 of phylogroup 2. The simplest conclusion from this analysis is that horizontal gene transfer has dominated the evolution of nearly all of the variable T3SEs.
Host associations and individual VA genes:
VA genes that are statistically overrepresented in isolates from specific hosts are the strongest candidate host-specific virulence factors. We used a likelihood analysis to test for an association between the presence of specific VA genes and the host of isolation against the null hypothesis that there is no association. Table 4 and supplemental Table 3 at http://www.genetics.org/supplemental/show the P-value results of a likelihood-ratio test of these hypotheses. The most obvious pattern is among the Chinese cabbage isolates, which show seven strongly significant associations, although these patterns more likely reflect the very close relationship among these strains and their high level of similarity to the reference strain PtoDC3000 than anything biologically meaningful. More interesting examples include hopAA1-1, which is found in only 46% of all strains, but is significantly associated with the three isolates obtained from rice. Two of these three isolates are clustered closely together in phylogroup 4, while the third is very divergent and found in phylogroup 2. hopAA1-1 is also marginally significant among tomato isolates (carried by 90% of all tomato isolates) and a number of other hosts including cauliflower, Japanese apricot, and bayberry. The three rice isolates also show marginally significant associations with hopAM1-2, hopQ1-2, and hopX1 (N-term). The nine soybean isolates are strongly associated with three loci. One of these is shcA, the chaperone for hopA1 (van Dijk et al. 2002), while two others are predicted to be lytic transglycosylases (hopAJ1 and PSPTO1378). hopAJ1, which is commonly considered a T3SS helper protein, is also associated with strains isolated from kidney beans. The 11 tomato isolates are significantly associated with a version of hopO1-3 that is truncated in PtoDC3000, hopP1, hopAK1, and an allele of hopAG interrupted by an insertion element in PtoDC3000. Truncated and chimeric T3SEs are extremely common and have been shown to confer new functions and secretion apparatus specificities and even to influence the host range of the bacteria carrying them (Stavrinides et al. 2006).
Host associations and VA profiles:
We attempted to determine whether specific VA gene profiles, defined as the suite of VA genes carried by any individual isolate, are associated with specific hosts of isolation. Two possible outcomes define the extremes of the expectations for this test: (1) all isolates of a given host have the same VA profile, implying that a specific complement of VA genes is required to infect a particular host and (2) all isolates of a given host have different VA profiles, implying that there are either multiple ways of infecting a given host or that there may be subprofiles that are similar among these isolates that are masked by differences in other VA genes. These differences may be due to unrecognized differences in host specificity or niche among strains.
The T3SE profiles clustered by host species and host family are shown in Figure 2. We quantified this analysis by determining the average pairwise level of profile identity shared among strains isolated from common hosts or hosts from the same families. Specifically, we asked whether the average level of profile identity was greater among strains isolated from common hosts than randomly selected strains. Profile identity was calculated by simply determining the proportion of genes that were in the same state (present or divergent) for each pairwise comparison of profiles. Statistical evaluation was performed against bootstrapped null distributions of 1000 pseudoreplicates created from randomly chosen strains. Only variable genes and hosts with three or more isolates were used in this analysis.
VA gene profiles are generally more conserved than T3SE profiles due to the presence of highly conserved functional categories (Table 5). Groups of strains isolated from Chinese cabbage, tomato, Brassicaceae, soybean, and rice have ≥75% profile identity. The profiles from Chinese cabbage and tomato isolates are statistically more similar than chance for both VA genes and T3SEs. The significant profile identity of tomato isolates is not surprising, given that most of the strains are very tightly clustered in phylogroup 1 along with the arrayed reference strain (which is also a tomato pathogen). The four cauliflower isolates are nearly identical strains that also cluster with the tomato isolates in phylogroup 1. Cuppels and Ainsworth (1995) demonstrated that all maculicola strains (Brassica pathogens) could infect tomato and that many tomato isolates could infect cauliflower, indicating that pathogens of these hosts must utilize similar suites of VA genes. Interestingly, when the larger collection of Brassica isolates is examined, significantly similar VA gene profiles are also apparent. These strains were isolated from a wide variety of Brassica hosts including radish, Chinese cabbage, and cauliflower, but are quite evolutionary divergent, being distributed in phylogroups 1, 3, and 5. This analysis suggests that there may be common selective pressures that act on all Brassica pathogens. In general, this analysis supports both predictions concerning profile identity among isolates from a common host, in that some hosts yield isolates that have very similar profiles, while many others do not.
These data permit us to ask more focused questions, such as, is there only one way to infect a specific host? And are similar VA or T3SE profiles the result of vertical descent from a common ancestor, convergent evolution, or horizontal gene transfer? The tomato isolates present an excellent opportunity to address these issues. Nine of 11 tomato isolates tightly cluster in P. syringae phylogroup 1, while the other two, PsyB76 and Pto2170, are highly divergent and found in phylogroup 2. The T3SE profile of Pto2170 is essentially indistinguishable from the 9 phylogroup 1 tomato isolates, as opposed to PsyB76, which has a highly divergent T3SE profile. Given that both the T3SE profile and evolutionary clustering of PsyB76 are different from other tomato pathogens, this strain has likely evolved the ability to infect tomato independently via convergent evolution. In contrast, Pto2170 shows evidence for the horizontal acquisition of T3SE from the phylogroup 1 tomato isolates, as it shares 36 variable T3SEs with the group 1 tomato isolates. The large-scale acquisition of T3SE via horizontal gene transfer is not entirely unlikely, given that many T3SE genes are clustered on genomic islands associated with mobile elements (Jackson et al. 2000). Twelve variable T3SEs were common to all tomato isolates, including: hopY1, hopL1, hopT1-2, hopX1 (C terminus), avrPto1, hopR1, CEL ORF4, hopB1, hopAK1, hopP1, hopAG:ISPssy, and hopO1-3. These may include those T3SEs that define the minimal set required for tomato specificity.
We extended our genomic analysis to address whether regions of the PtoDC3000 chromosome are highly variable with respect to their complement of VA genes. Clusters of high variation are putative genomic islands that may be more prone to gene acquisition and loss. Such genomic islands typically contain clusters of coordinately regulated loci required for niche specialization (Ilyina and Romanova 2002).
We mapped the variation observed for each VA gene onto the corresponding position in the PtoDC3000 genome (Figure 3). Variation was calculated as the percentage of strains that were scored as divergent for each gene. Regions of high variability include the hrp/hrc cluster, which encodes the T3SS apparatus, and the coronatine biosynthetic gene cluster. Significant clusters of variable loci were also found associated with T3SEs, flagellar synthesis genes, and yersiniabactin synthesis genes. Substantial numbers of T3SEs were observed to cluster near the hrp/hrc cluster, the coronatine biosynthetic genes, the largest plasmid, and at least two other genomic islands located at ∼0.55 and 0.95 Mb along the PtoDC3000 genome. The former island contains some highly variable genes encoding type IV pilus proteins, while the latter island contains four conserved type IV pilus biogenesis genes and putative virulence genes. The genomic regions flanking the conserved loci encoding the general secretory pathway include a number of highly variable genes, including genes encoding filamentous hemagglutinin proteins, tail tape measure proteins, and polysaccharide biosynthesis proteins.
We have provided the broadest survey of host-associated virulence factors from any pathogen by tapping into the natural genetic variation found in the P. syringae species complex. We have identified numerous candidate host-specificity genes, furthered our mechanistic understanding of how this remarkable bacterium adapts to diverse plant hosts, and provided a glimpse into the dynamic evolutionary processes that underlie these interactions. These data establish the basis for identifying mechanistic differences among evolutionarily distant pathogens that infect the same host and for determining why pathogens of divergent hosts have maintained similar complements of VA genes. We hope that this study will serve as a guide for focusing functional testing of host specificity and for framing hypotheses directed at understanding the complex mechanisms underlying host adaptation and pathogenesis.
We thank the many individuals who have provided strains to the Guttman lab and Sam Cartinhour for generously sharing DC3000 primer sequences. We acknowledge Nicholas Provart and Ryan Austin for microarray analysis advice; John Stinchcombe for statistical advice; Paul Debbie for assistance with microarray fabrication; and particularly John Stavrinides, Robyn Morgan, Wenbo Ma, and Pauline Wang for valuable input and critique. G.B.M. is partially supported by the National Science Foundation (DBI-0077622). S.F.S. was partially supported by an Ontario Graduate Scholarship. D.S.G. is a Canada Research Chair in Comparative Genomics and supported by grants from the Natural Science and Engineering Research Council of Canada and Performance Plants of Kingston, Ontario.
↵1 Present address: Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, MI 48824.
Communicating editor: J. Lawrence
- Received May 18, 2006.
- Accepted August 10, 2006.
- Copyright © 2006 by the Genetics Society of America