Sequence diversity of 39 dispersed gene loci was analyzed in 48 diverse individuals representative of the genus Pisum. The different genes show large variation in diversity parameters, suggesting widely differing levels of selection and a high overall diversity level for the species. The data set yields a genetic diversity tree whose deep branches, involving wild samples, are preserved in a tree derived from a polymorphic retrotransposon insertions in an identical sample set. Thus, gene regions and intergenic “junk DNA” share a consistent picture for the genomic diversity of Pisum, despite low linkage disequilibrium in wild and landrace germplasm, which might be expected to allow independent evolution of these very different DNA classes. Additional lines of evidence indicate that recombination has shuffled gene haplotypes efficiently within Pisum, despite its high level of inbreeding and widespread geographic distribution. Trees derived from individual gene loci show marked differences from each other, and genetic distance values between sample pairs show high standard deviations. Sequence mosaic analysis of aligned sequences identifies nine loci showing evidence for intragenic recombination. Lastly, phylogenetic network analysis confirms the non-treelike structure of Pisum diversity and indicates the major germplasm classes involved. Overall, these data emphasize the artificiality of simple tree structures for representing genomic sequence variation within Pisum and emphasize the need for fine structure haplotype analysis to accurately define the genetic structure of the species.
THE genetic diversity of a species is the sum of its total DNA sequence variation, resulting from millions of years of cumulative mutation, recombination, and selection. Understanding the pattern of the diversity within cultivated plant species and their wild relatives is both interesting and practically important from the viewpoint of conservation and use. Therefore, the ways by which genetic diversity in populations are estimated and represented are important. A popular approach for measuring the genetic diversity of accessions of a species is to analyze samples from gene banks (germplasm collections) by one or more of the many available molecular marker methods. The effectiveness of this strategy depends upon the suitability of the marker method for analyzing the diversity of the sample collection under investigation. Different marker methods can give different views of diversity, depending upon the evolutionary parameters of the underlying DNA sequence variation. Rapidly evolving DNAs, such as simple sequence repeats (SSRs), give high resolution views of relatedness; single nucleotide polymorphism (SNP)-based variation is more suited to deeper relationships, reflecting the slow mutation rate for this type of sequence variation; and transposon insertion-based marker methods should lie between these two, reflecting their intermediate mutation rate, although few studies have been carried out to test this. Furthermore, the genomic compartment in which the markers reside might affect the diversity pattern seen, and it is possible that markers residing mainly in junk DNA might produce different results from markers based upon genes, which are predominantly euchromatic. For these reasons there is a need to compare the diversity patterns obtained using different molecular approaches for diversity assessment.
The data sets that result from marker analysis of germplasm samples are often represented by trees, with the summed branch lengths separating any two samples taken as a measure of their relatedness. Trees are visually appealing but can be misleading. One of several serious limitations to this approach is the fact that it ignores introgression and recombination. Modeling has shown that recombination leads to long terminal branches, resulting in trees showing a “star phylogeny” (Schierup and Hein 2000). Recombination results in multiplication of the number of trees, with every inherited crossover producing a corresponding extra tree, which differs by a single branch translocation from its antecedent (Hudson 1983; Maddison 1995). One way of representing such ambiguity is to replace tree structure by a reticulated network (Bryant and Moulton 2004; Huson and Bryant 2006). In summary, the usefulness of trees derived from unlinked sites in a genome that has undergone significant recombination is questionable, and trees derived from linked sites must be considered in the context of the ratio of recombination to mutation in that region in the corresponding lineage.
We have previously studied the genetic diversity of the field pea (P. sativum) and its wild relatives, using several marker-based approaches. P. sativum is an Old World legume crop first cultivated 10,000 years ago (Blixt 1972; Zohary 1996; Mithen 2003). Like all major crop species, cultivated Pisum has a condensed gene pool relative to its wild relatives (in this case P. fulvum and P. elatius), and the relationships within the genus Pisum have generated substantial debate. Studies by ourselves and others have indicated that P. fulvum can reasonably be considered as a distinct species, with P. sativum forming a subset of P. elatius (Vershinin et al. 2003; Baranger et al. 2004; Tar'an et al. 2005). Other claimed species such as P. humile and P. abyssinicum have little support from molecular studies. Furthermore, there is extensive sharing of SSAP retrotransposon markers between all Pisum species, suggesting that there has been significant outcrossing between them (Vershinin et al. 2003), despite the predominantly inbreeding nature of the genus, which should restrict introgression of haplotypes. Most, but not all wide crosses between Pisum genotypes are fertile, the exceptions involving P. fulvum and P. abyssinicum. P. abyssinicum and P. sativum are interfertile although success rates can be variable, and between some accessions reciprocal crosses may not be interfertile. P. fulvum and P. sativum have a more restricted interfertility, although crosses can be successful (Clement et al. 2002) and for practical purposes P. abyssinicum has been used in a bridging cross (e.g., Forster et al. 1999). There are sporadic reports of infertility in wide crosses (e.g., Ochatt et al. 2004), but such reports should be considered against the background of demonstrable success in these wide crosses. For these reasons Pisum is perhaps best considered as a species complex with multiple subspecies that interbreed to a different degree.
The SSAP markers that have given rise to the above conclusion are based upon insertions of the PDR1 Ty1-copia group retrotransposon (Vershinin et al. 2003). In large plant genomes such as maize and Pisum, retrotransposons are mainly located in intergenic nuclear DNA, and the antiquity of individual insertion events typically lies in the 0.1–5 MYA range (Sanmiguel et al. 1998; Jing et al. 2005). One aim of the present study was to compare SSAP-based diversity analysis with gene-based sequence diversity analysis, to see whether these very different genomic fractions produce similar or different pictures of Pisum diversity. Such information would also provide pointers to an optimal way of assessing diversity within Pisum and perhaps the other major inbreeding crop species with large, transposon-rich genomes such as wheat, barley, and rice. The other main goal was to explore the variation in genic sequences for Pisum and compare the contributions of recombination and mutation to the evolution of its genetic diversity.
MATERIALS AND METHODS
Plant materials and DNA extraction:
Forty-eight Pisum accessions from the John Innes Pisum Collection (http://www.jic.ac.uk/germplas/pisum/) were selected (see supplemental Table 1 at http://www.genetics.org/supplemental/). Forty-five of these are in a core set of 52 Pisum accessions chosen for our previous study of SSAP marker-based genetic diversity analysis (Vershinin et al. 2003). The three new P. sativum accessions (JI15, JI281, and JI399) are the parents of genetic mapping populations, providing polymorphisms that would hopefully allow us to map the genes in this study, and the four samples missing from this study are P. abyssinicum accessions that have been shown by extensive marker analysis to be almost identical to the single chosen P. abyssinicum accession JI2385 (Vershinin et al. 2003). Each sample was the progeny of a single selfed plant, to ensure homogeneity and derived from an inbred accession to ensure homozygosity. DNAs were extracted by the QIAGEN (Valencia, CA) DNeasy 96 method.
PCR primers for amplification of gene loci:
Primers used for gene segment amplification are shown in supplemental Table 2 at http://www.genetics.org/supplemental/. All primers are derived from Pisum gene exon sequences, and all PCR amplicons contain at least one intron. Forty-two primer pairs were originally selected, 3 of which were discarded (data not shown) because they produced mixed PCR products.
PCR amplification of pea gene-derived sequences:
All PCR amplifications were carried out using a MJ Research PTC-225 Tetrad Thermal Cycler. Twenty-five-microliter reactions contained the following: 30 ng of pea genomic DNA template, 0.2 μm each of forward and reverse primer, 2.5 μl QIAGEN HotStar 10× PCR buffer containing 15 mm MgCl2, 4 μl of dNTPs (Roche 1250 mm), and 0.125 μl (0.625 units) Hot Star Taq DNA Polymerase (QIAGEN). Cycling conditions involved an initial enzyme activation step for 15 min at 95°, followed by 40 cycles of 94° for 1 min, 55° for 1 min and 72° for 1 min, with a final extension cycle of 72° for 7 min. Amplification products were visualized by electrophoresis of 5 μl of PCR product on a 1.5% agarose gel containing ethidium bromide. PCR products were purified using NucleoFast 96 PCR plates (Macherey-Nagel). The PCR product yield was estimated by comparison with standard concentrations of λ bacteriophage DNA using agarose gel electrophoresis. Thirty-eight amplifications failed to produce a PCR product (2% of the complete set).
Sequencing PCR reactions used 0.33 μl Big Dye Terminator v3.1 Cycle Sequencing RR-100 (Perkin-Elmer, Norwalk, CT), 3.33 μl BetterBuffer (Web Scientific), 0.44 μm primer, 6 ng template PCR DNA fragment in a final volume of 10 μl. Cycle conditions: Cycling conditions involved 25 cycles of 96° for 30 sec, 50° for 15 sec, and 60° for 4 min with temperature ramp rate of 1°/sec. Products were purified either using genCLEAN plates (Genetix) or by ethanol precipitation, as follows: 31 μl of a 0.1 m sodium acetate pH 4.6 in 95% ethanol solution was added. This was allowed to incubate for 15 min at room temperature, followed by a 4000 rpm spin for 30 min in an Eppendorf 5810R plate centrifuge. Plates were inverted and spun at 700 rpm for 1 min. A total of 150 μl 70% ethanol was added, and the plates were vortexed and incubated for 15 min before another spin at 4000 rpm for 10 min. The ethanol wash step was then repeated to ensure efficient removal of unincorporated sequencing dye, and samples were left to air dry for 20 min. Electrophoresis of products from the sequencing reactions was carried out using an ABI 3730 capillary sequencer.
DNA sequence analysis:
The output sequence traces from the sequencer were pretrimmed using PHRED with a quality score of 13 or higher (Ewing et al. 1998; Ewing and Green 1998; http://www.phrap.com/phred/). Pretrimmed sequences were imported into Bioedit Sequence Alignment Editor (version 220.127.116.11; http://www.mbio.ncsu.edu/BioEdit/bioedit.html), and the sequences for each gene were aligned with ClustalW (Thompson et al. 1994). At this point all the sequence traces were visually checked against the corresponding alignment to identify regions of poor sequence quality leading to errors in polymorphism assignment. This resulted in the discarding of 3 complete alignments and 49 individual sequences (2.6% of the remaining data set) due to unacceptably low sequence quality. The alignments were then trimmed to the lengths of the shortest members, and presumptive polymorphisms were identified. To test the accuracy of polymorphism identification a second, independent round of polymorphism searching was performed. The sequence traces were imported into Mutation Surveyor V3.01 (SoftGenetics LLC http://www.softgenetics.com/mutationSurveyor.html). This software displays polymorphisms in both graphical and tabular forms, linked to the sequence traces and with associated quality scores. The Mutation Surveyor environment was used to cross check visually the sequence traces for all presumptive polymorphisms identified in both searches in the complete sequence set. For the remaining 39 alignments, 14 discrepancies between the two methods were identified out of 1021 presumptive polymorphic sites in 13,503 bp (∼0.1%). These 14 discrepancies were all resolved after close inspection of the traces.
DNA marker analysis and genetic linkage analysis:
Single nucleotide or insertion-deletion polymorphisms segregating between sample JI399 and either or both of JI15 and JI281 were identified in CLUSTALW sequence alignments. Allele-specific PCR marker primers are shown in supplemental Table 3 at http://www.genetics.org/supplemental/. PCRs using QIAGEN Hotstar Taq DNA polymerase contained 0.1 μm each primer, 20 ng template genomic DNA, and the following program: 95° for 15 min, 40 cycles of 94° for 30 sec, 55° for 30 sec, 72° for 30 sec followed by 72° for 7 min. For loci Pis_GEN_27, Pis_GEN_28, and psat_EST_00176 the annealing temperature was reduced to 51°. The resulting markers were resolved by size difference on agarose gel electrophoresis by the MS-PCR method (Rust et al. 1993), the CEL I endonuclease approach (Kulinski et al. 2000), and/or by the tagged microarray (TAM) approach (Flavell et al. 2003; Jing et al. 2007). In the latter case e and g tags were appended to X and Y tags (supplemental Table 3), respectively, by inclusion of TAM tag primers e-X (ACCGCATCCGAACATTTGTC[spacer C-18]CGTGCCGCAAGGACGGGC) and g-Y (GCCGATAATCACCTTGTCAC[spacer C-18]TATATTATGGGCCGCACTGACGGAC), and the e and g tags were detected by the TAM approach. Markers were scored in either or both recombinant inbred line (RIL) mapping populations JI15 × JI399 and JI281 × JI399 (Ellis et al. 1992). Loci Psat_EST_172, Psat_EST_185, Psat_EST_189, Psat_EST_191, Pis_GEN_15, and Pis_GEN_27 were scored by the CEL 1 approach (Kulinski et al. 2000) in an attenuated RIL population of 16 and positioned by matching scores to existing data in an Excel spread sheet.
Phylogenetic and diversity analysis:
Combined trees for all gene introns were created using DNAdist [neighbor joining (NJ)] and DNAml (maximum likelihood) in the Phylip package (http://evolution.genetics.washington.edu/phylip.html) after concatenation of DNA sequence alignments created using ClustalW within the Bioedit package (http://www.mbio.ncsu.edu/BioEdit/bioedit.html). Individual gene trees were constructed in the same way using DNAdist. The SSAP tree was a reanalysis of the data from Vershinin et al. (2003) based on Dice coefficients, using the DARwin5 package (http://mendel.ethz.ch:8080/Darwin/) (Perrier et al. 2003). Where a sequence was missing, a corresponding gap was inserted. Trees were modified with TreeView v1.6.6 (http://taxonomy.zoology.gla.ac.uk/rod/treeview.html). Diversity data analysis (Table 1) was carried out using DnaSP v4.10 (http://www.ub.es/dnasp/). Distance matrices were obtained using DNAdist, and data manipulations (Figures 4 and 5) were carried out in Microsoft Excel spread sheets. Linkage disequilibrium analysis used the TASSEL (http://www.maizegenetics.net/tassel) (Bradbury et al. 2007) and DNAsp packages.
Putative recombination breakpoints within gene segments were calculated from aligned single gene segment sequences using the TOPALi software (http://www.bioss.ac.uk/knowledge/topali/) (Milne et al. 2004) with the difference of squares (DSS) option. The statistical significance of DSS peaks, determined by the ϕ test (Bruen et al. 2006) within the TOPALi package, was estimated with parametric bootstrapping. Network analysis of the combined aligned sequence set used the split decomposition approach in the software package SplitsTree4 (http://www-ab.informatik.uni-tuebingen.de/software/splitstree4/welcome.html) (Bandelt and Dress 1992; Bryant and Moulton 2004; Huson and Bryant 2006). The statistical ϕ test (Bruen et al. 2006) was carried out within SplitsTree4.
To investigate gene-based diversity in pea we adopted a strategy of sequencing multiple gene regions in a set of samples that were selected to represent the full diversity of the genus Pisum (Table 1; supplemental Tables 1 and 2). The Pisum sample set comprises 45 accessions including 10 P. fulvum, 10 P. elatius, 2 P. humile, and 22 P. sativum landraces and cultivars. This is a subset of a core set of 52 Pisum accessions used in a previous SSAP marker-based genetic diversity study (Vershinin et al. 2003), allowing us to compare the diversity patterns deduced from the two experimental approaches. Additionally, three cultivars (JI15, JI281, and JI399) were also analyzed because these are the parents of genetic mapping populations, to allow genetic mapping of the analyzed genes.
Thirty-nine genes were analyzed, to both ensure against bias associated with individual genes and to investigate the partitioning of diversity in a variety of genomic sites. Five of these genes were selected because of their significance in seed development and composition or plant architecture, and the rest were selected at random from Pisum sequence databases. The PCR amplicons were chosen to contain intron sequence, because of the higher variation in such regions, and exon-derived PCR primers were used because the primer sites are better conserved, minimizing PCR failure and associated data losses.
Gene sequence-based diversity in Pisum:
Alignments were generated for each of the 39 gene segment sequence sets. Nucleotide sequence diversity parameters were then determined for each sequence set, after trimming to remove missing data (Table 1). The regions sequenced are in the range 115–565 bp, with an average of 345 bp and a total aligned sequence set of 13,436 bp per sample. Haplotype diversity is high for most of the loci, averaging 0.749 (SD 0.184), with a notable exception, AJ418375, encoding the Nork receptor-like kinase involved in the early steps of nodulation signal transduction (Endre et al. 2002), with 0.084. The mean nucleotide diversity π varies over 40-fold, from 0.0009 for Nork to 0.0366 for PS-1AA4/5 auxin-regulated gene, with a mean of 0.0108 (SD 0.0071), and Watterson's Θw (Watterson 1975), on the basis of the number of segregating sites in the sample, varies between 0.0032 for Nork and 0.0546 for PS-1AA4/5. Overall, these data reveal a high gene-based diversity of Pisum germplasm and show that this diversity is partitioned unevenly between genes. Tajima's D values on all 39 sequenced loci (Tajima 1989) showed four with evidence for non-neutral evolution [Pis_Gen 9, Pis_Gen 15, Psat_EST_188, and Psat_EST_201, with D values of −2.1714 (P > 0.05), −1.8259 (P > 0.05), −2.258 (P > 0.01), and −2.088 (P > 0.05), respectively].
Sequence-based diversity trees:
To investigate the tree structure of the complete sequence set, the aligned gene-derived sequences were concatenated and an NJ tree was derived from the distance matrix (Figure 1A). The most distinct clade in this tree, with 100% bootstrap support, comprises the ten P. fulvum accessions. This is flanked by seven other wild Pisum accessions with weak bootstrap support (64%). Only two other groups are strongly supported by bootstrap analysis, comprising two pairs of P. sativum cultivars.
The gene sequence-based tree was compared to a NJ tree for an identical set of Pisum samples derived from retrotransposon insertion SSAP marker data (Vershinin et al. 2003; Figure 1B). The SSAP tree shares the deeper features described above for the gene sequence tree, such as a well-supported P. fulvum clade, with the same flanking set of seven wild Pisums. This broad agreement for the deeper structure of the genetic diversity for Pisum, involving most of the wild samples in this analysis, between two approaches that utilize very different fractions of the genome, suggests that both are giving an accurate picture at this level.
Single locus diversity trees:
Despite the similarities in tree structure mentioned above, there is little detailed similarity between the two trees. Furthermore, the trees in Figure 1 tend toward a “star” phylogeny (i.e., the presence of long terminal branches) and the low bootstrap support for most of the branches, which is suggestive that these genomes have been recombining with one another (Schierup and Hein 2000). Consistent with this suggestion, removal of individual sequences from the concatenated complete sequence set resulted in some rearrangements to the tree structure (data not shown). The reason for this became apparent when trees derived from individual gene-derived aligned sequences were investigated. A few examples are shown in Figure 2 and all are included in supplemental Figure 1 at http://www.genetics.org/supplemental/. Several individual gene trees resemble the consensus tree quite well, for example both Psat_EST 171 and AJ291298 show close grouping for all P. fulvum samples, with some members of the intermediate group of seven wild Pisum, mentioned above, nearby. However, many trees show structures that depart from this. For example, the AF010190 and Psat_EST_178 sequence trees split one of the P. fulvum samples, JI2530, from the other nine, and for Pis Gen 27 P. fulvum is widely distributed across the tree. Thus, the phylogeny inferred from these data depends very much on which gene is used.
Heterogeneity of gene-based genetic distance values:
The variance in genetic distance inferred from individual gene loci can be analyzed in a quantitative manner using the distance values. A few examples of the variation in pairwise distances between gene loci, together with the average for all pairwise distances, are shown in Figure 3. The 39 individual gene loci in the figure are ordered by increasing diversity (π) for the combined data set (i.e., in the same order as for Table 1). Figure 3a shows the distances between two cultivars, JI321 and JI399, which show the lowest mean distance separating them in the entire sample set (0.0036) and are tightly linked in the combined NJ tree (Figure 2). Twenty-three out of 37 loci are monomorphic between these samples, and the highest divergence corresponds to a genetic distance of 0.0219 (for Pis_Gen_8). Figure 3b involves JI321 again, this time with JI1267, a P. sativum landrace from India. Thirteen out of 37 loci are monomorphic in this sample pair, and the mean genetic distance is higher (0.0090), as expected. When JI321 is compared to the wild P. elatius sample JI261 (Figure 3c), only three loci are monomorphic in the sequenced region, and the mean genetic distance is still higher (0.0138). Lastly, when two of the most distantly related samples are cross-compared (JI261, P. elatius, and JI2517, P. fulvum) the mean genetic distance is almost doubled (0.0234), with 2 out of 36 loci monomorphic (Figure 3d).
Because the gene order in Figure 3 reflects the increasing global diversity in the combined gene set (Table 1), one would expect an increase in individual allele-to-allele genetic distance from left to right for individual sample pairs. This is broadly true (Figure 3), but there is obviously a large amount of scatter. For example, gene 3 (Pis Gen 9) is one of the least diverse of the 39 genes in this study (Table 1), yet it shows one of the highest genetic distances between JI261 and JI2517 (arrow in Figure 3d), while gene 22 (Psat EST 175, also arrow in Figure 3d), a medium diversity gene in the complete gene set, is monomorphic between these two highly diverse plant samples.
To quantify this scatter in genetic distance, each individual sample–sample distance value for each locus was divided by the corresponding average distance for all sample pairs for that locus, to normalize for varying diversity between gene loci. The normalized distances (D), with corresponding standard deviations (SD) and SD/D ratios for the complete dataset are shown in Figure 4. The normalized distance values vary between 0.233 and 1.888, and their associated SD values vary between 0.398 and 2.659. In general, closely related sample pairs show SD values slightly greater than the distances (SD/D ≥ 1), and distantly related samples tend to have SD/D ratios between 0.5 and 1.0.
Diversity, genetic linkage, and linkage disequilibrium:
We suggest that the scatter in distance values between gene loci is responsible for the distorted phylogenies seen in individual gene trees. If this scatter derives from introgression between different Pisum lineages that has shuffled gene-specific diversity in Pisum, then closely linked genes might yield similar tree structures. To investigate this possibility, we determined genetic map positions for all the gene loci in this study showing polymorphism in either or both of two recombinant inbred mapping populations (JI15 × JI399 and JI281 × JI399) derived from crosses between samples used in this study. The derived genetic linkage map is shown in Figure 5. Several gene segment pairs are closely linked, notably Psat_EST_197 and AF010190, which cosegregate, Psat_EST_191 and Psat_EST_185, separated by 5 cM, and Psat_EST_178 and Pis_GEN27, separated by 2 cM. The first of these pairs are separated by ∼60kb, or ∼seven genes on Medicago genomic BAC clone AC139708 (supplemental Table 2). The pea genome is known to be larger than, but generally collinear with Medicago (Kalo et al. 2004). The corresponding trees based on these two gene segments show some similarities in tree structure but several major differences (Figure 2). The other pairs above seem to be no more similar than any randomly chosen pair (supplemental Figures 1 and 2 at http://www.genetics.org/supplemental/). There is therefore little support for the suggestion that closely linked gene loci are showing coordinated DNA sequence evolution in this collection of germplasm.
The largely independent evolution of closely linked loci in Pisum led us to explore the conservation in linkage disequilibrium (LD) in these samples. Three of the seven P. sativum linkage groups (LGs) had five or six mapped gene loci available, and LD was investigated within and between the loci on each of these three groups. One example (LG III) is shown in Figure 6, and the other two, which produced broadly similar results, are in supplemental Figure 2. To perform the analysis the gene loci sequences were concatenated in their chromosome order. We had no direct information of the gene polarity but our main goal was intergenic LD decay, and intragenic LD decay would be very low compared to intergenic effects. LD decay is only valid relative to the germplasm in which it is investigated, and rates of LD decay increase as the germplasm broadens (Morrell et al. 2005; Caldwell et al. 2006). In the Pisum cultivars there is apparently extensive LD between the loci of LG III, but there is no statistical support for this conclusion because of the low sample number (only seven cultivars, supplemental Table 1). For landraces there is extensive LD within the sequenced loci but little evidence for LD between loci, even for the very closely linked Psat_EST_197 and AF010190. This result is consistent with the lack of close similarity between the trees derived from these loci mentioned above and underlines the fact that even closely linked loci appear to be evolving in a largely independent manner in noncultivar Pisum. Lastly, wild samples show apparent sporadic LD conservation but this is confined to relatively few markers, and we suggest that this is an artifact deriving from the strong population substructure in the wild sample set (see discussion). Indeed, there is evidence for LD decay within some of the loci in wild germplasm (Figure 6).
Evidence for recombination from mosaic detection and phylogenetic network analysis:
The combined analyses above suggest that recombination has shuffled gene loci in Pisum. To test whether intragenic recombination could be detected in any of these sequences, individual gene segment alignments were analyzed by the TOPALi package, which searches for statistically significant evidence of chimeric sequences in aligned sequences (Milne et al. 2004). This analysis revealed 9 of the 39 gene segments that show such evidence for recombination (Pis_Gen_8, Pis_Gen_12, Pis_Gen_28, Psat_EST_163, Psat_EST_178, Psat_EST_190, Psat_EST_191, Psat_EST_196, and Psat_EST_202; supplemental Figure 3 at http://www.genetics.org/supplemental/).
As a final test of the impact of recombination upon the genetic diversity of Pisum and to determine the lineages most affected, we reanalyzed the complete concatenated sequence set used for Figure 1 using the split decomposition approach, which identifies departures from treelike structure in molecular diversity data sets and visualizes them as reticulated networks (Bandelt and Dress 1992; Bryant and Moulton 2004; Huson and Bryant 2006). The results of this approach (Figure 7) show extensive evidence for non-treelike structure, indicated by reticulate structure in the network, and identify the samples exhibiting strongest evidence of recombination. Among the most affected samples are JI3147–JI261 (two P. elatius samples), JI3151–JI241 (P. elatius–P. humile), JI2078–JI228 (P. elatius–P. sativum), and the latter pair with adjacent sectors, mainly comprising P. sativum samples, in the network. There is also evidence for network structure involving the P. fulvum group, both internally and with adjacent P. abysinicum (JI2385), P. elatius (JI3155, JI1094, JI3149, JI3147, and JI261), and P. humile (JI1794). A statistical ϕ test (Bruen et al. 2006) showed significant evidence for recombination (P = 0).
The first goal of this study was to compare the diversity patterns obtained using nuclear gene region sequence and PDR1 retrotransposon insertion data. Our observation that the deep structures of the corresponding diversity trees, comprising many of the wild samples, is conserved between these two approaches is important because it validates both methods, which rely upon very different genomic compartments. In plant genomes retrotransposon insertions are mainly found as nested blocks, either between “gene islands” or in centromeric regions largely devoid of genes (Pélissier et al. 1995; Sanmiguel et al. 1996; Presting et al. 1998). The corresponding location(s) of Pisum retrotransposon insertions is not yet clear but a Ty3-gypsy group element hybridizes to multiple dispersed spots on in situ metaphase chromosome spreads (Neumann et al. 2001), and most of the identifiable PDR1 insertions in the Pisum genome are either in transposable element DNA or unknown repetitious sequence (Jing et al. 2005). In contrast, the sequence data used here derive from 39 expressed nuclear genes. Our data imply that both sequence classes yield genetic diversity data that represent faithfully the behavior of the genome as a whole. The rapid decay of LD shown here suggests that this is not due to hitchhiking effects; rather, we propose that the great majority of both PDR1 insertions and gene sequence mutations are selectively neutral and accumulate at comparable rates across the Pisum genus.
The reasonably large number of gene loci studied (39) and extensive sequence set (>13 kb) obtained for each give us reason to be confident that the diversity picture obtained for this well-studied Pisum sample set is an accurate one. DNA sequence of genes should reveal deeper evolutionary relationships than retrotransposon insertions, because its rate of evolution (∼10−8– 10−10 mutations site−1 yr−1) is orders of magnitude lower than the estimated transposition rate of PDR1 of 5 × 10−7 insertions element−1 yr−1 (Jing et al. 2005).
This study has also provided interesting data on the gene-based genetic diversity of Pisum, which is shown to be a diverse genus, with one polymorphic site on average every 15 bp across the sequences studied here. A comparison between our data and analogous gene-based sequence data for two other important crop species, barley (Caldwell et al. 2006) and maize (Tenaillon et al. 2001) is shown in Table 2. The data are partitioned between cultivars, landraces, and wild plants. All three species show high frequencies of mutations, which is consistent with the fact that all still have a widespread extant wild germplasm. Maize displays the highest diversity, at least for landraces and cultivars (169% Θw average, relative to Pisum, wild plants were not studied). This is unsurprising, as it is an outbreeder. Pisum shows somewhat higher values for Θw than barley (143% averaged across landraces and cultivars). The reason for this is unclear to us as both species show quite similar low outcrossing rates, broad geographic distributions, and domestication histories. Nevertheless, the values of Θw obtained here, combined with the estimate of population size from Jing et al. 2005, are consistent with a credible mutation rate in the range 1.8 × 10−8–1.1 × 10−9 for those sequences that did not show non-neutral evolution in Tajima's test (Tajima 1989; data not shown).
All of the Pisum gene loci analyzed here show high intrinsic genetic diversity (Table 1), with the single exception of the Nork locus, which displays only three polymorphisms at two sites in 2 out of 48 genotypes. Both sites are intronic, with the flanking exon sequence (65 bp) being monomorphic in the entire germplasm set. While only a very small sequence has been studied and it would be presumptuous to attach too much significance to this result, such low diversity in such a diverse species is striking and may reflects a high degree of purifying selection in the genomic region. Nork encodes a receptor kinase mediating the nodulation response to the Nod factor of rhizobial bacteria (Endre et al. 2002). It thus plays a crucial role in nitrogen fixation by pea and needs to interact with multiple other factors to achieve this, necessitating strong conservation. Of course, it is not possible at this stage to say whether Nork itself, rather than a closely linked locus, is under selection, but the rapid LD decay in this sample set argues for the former.
Multiple lines of evidence in this study suggest that the genetic structure of Pisum diversity is inadequately represented by a tree structure and the concept of a single genetic distance between two Pisum samples is a poor approximation to the reality that each locus has its own distance value, and the variation between loci for these values (correcting for inherent differences in gene diversity between loci) is of the same order as the values themselves. Thus, different genetic loci in Pisum can display very different pictures of the genetic diversity of the species, and simply averaging these hides the true diversity pattern. Our LD and TOPALi analyses strongly suggest that recombination is the major reason for this. The rapid decay of LD within the noncultivated germplasm, even between closely linked genes, together with statistically significant evidence for intragenic recombination in 23% of the genes studied here demonstrate that recombination has been very effective in shuffling Pisum genetic diversity between the major lineages of the genus. This has also been found to be the case for Arabidopsis thaliana (Nordborg et al. 2002; Kim et al. 2007), another species with highly restricted outcrossing, and has been proposed to explain the paucity of retrotransposon-based markers that are confined to individual Pisum species (Vershinin et al. 2003).
The observations reported here have implications for the management of plant genetic resources and the selection of germplasm for crop plant breeding. Crop plant germplasm collections, typically containing thousands of wild, landrace, and cultivated samples, are seen as a powerful resource for the introduction of new gene allele combinations into cultivated germplasm. The global diversity of such collections is often represented in a tree format, which is used for selecting sample subsets (core collections) for further exploration or exploitation. This way of representing genetic diversity is shown here to be a potentially error-ridden process. Sample pairs that are very closely related by global diversity analysis may nevertheless carry many distantly related gene alleles and distantly related samples, even including samples supposedly from different species, and carry many identical gene alleles. At the moment such discrepancies cannot be predicted and this work suggests that fine structure haplotype analysis of germplasm collections will be required to provide the solution to this problem. Unfortunately, our LD data indicate that this will be difficult to achieve in uncultivated Pisum germplasm, which carries the greatest wealth of unexplored alleles needed for future crop improvement, because the high rate of LD decay in such germplasm will necessitate a correspondingly detailed definition of the fine structure of haplotype diversity down to the gene level.
We thank Linda Milne for substantial bioinformatics help, David Marshall, Jo Dicks, Robbie Waugh for many interesting discussions and comments, and finally Deborah Charlesworth for many constructive criticisms of this work. This work was supported by grant FP6-2002-FOOD-1-506223 (Grain Legumes Integrated Project) from the European Commission (EC) under the EC Framework program VI.
- Received August 30, 2007.
- Accepted October 11, 2007.
- Copyright © 2007 by the Genetics Society of America