The first genetic transcript map of the soybean genome was created by mapping one SNP in each of 1141 genes in one or more of three recombinant inbred line mapping populations, thus providing a picture of the distribution of genic sequences across the mapped portion of the genome. Single-nucleotide polymorphisms (SNPs) were discovered via the resequencing of sequence-tagged sites (STSs) developed from expressed sequence tag (EST) sequence. From an initial set of 9459 polymerase chain reaction primer sets designed to a diverse set of genes, 4240 STSs were amplified and sequenced in each of six diverse soybean genotypes. In the resulting 2.44 Mbp of aligned sequence, a total of 5551 SNPs were discovered, including 4712 single-base changes and 839 indels for an average nucleotide diversity of θ = 0.000997. The analysis of the observed genetic distances between adjacent genes vs. the theoretical distribution based upon the assumption of a random distribution of genes across the 20 soybean linkage groups clearly indicated that genes were clustered. Of the 1141 genes, 291 mapped to 72 of the 112 gaps of 5–10 cM in the preexisting simple sequence repeat (SSR)-based map, while 111 genes mapped in 19 of the 26 gaps >10 cM. The addition of 1141 sequence-based genic markers to the soybean genome map will provide an important resource to soybean geneticists for quantitative trait locus discovery and map-based cloning, as well as to soybean breeders who increasingly depend upon marker-assisted selection in cultivar improvement.
THE gene space of soybean [Glycine max (L.) Merr.] is as yet undefined. Mudge et al. (2004) estimated that most genes in soybean are clustered in ∼25% of the genome (275 Mbp). One suggestion for providing information relating to gene space was to place 2000–3000 cDNA sequences onto the physical map (Stacey et al. 2004). Another approach would be to genetically map coding sequences onto the existing simple sequence repeat (SSR)-based map (Song et al. 2004). The resulting genetic map would not only indicate the positions of coding sequences, but also answer questions about relationships of coding sequences with existing SSR and RFLP markers. These relationships are important in map-based cloning, quantitative trait locus discovery, and marker-assisted plant improvement. While a fairly extensive set of >1000 genetically mapped SSR markers is available to soybean breeders and geneticists (Cregan et al. 1999; Song et al. 2004) the current map has 138 gaps of >5 cM in which no SSR marker is present. Twenty-six of these 138 gaps are >10 cM, which is troubling if these regions of low SSR marker density are also gene rich.
The presence of SSRs in expressed sequence tag (EST) sequence provides one means for the genetic mapping of the EST; however, the number of soybean ESTs that contain polymorphic SSRs appears rather limited. Song et al. (2004) were successful in developing and mapping only 24 polymorphic SSR markers from >136,000 soybean ESTs. Alternatively, discovery of single-nucleotide polymorphisms (SNPs, which include single-base changes and insertion/deletions) in genic sequence would provide a source of markers that can be analyzed via PCR and other approaches. Because SNPs are more abundant than SSRs, they improve the odds of success in a diversity of applications including positional cloning, association analysis, QTL mapping, and the determination of genetic relationships among individuals. As of December 2006, almost 28 million human SNPs had been cataloged by the National Center for Biotechnology Information SNP database, dbSNP, of which >5.6 million had been validated. The discovery and application of SNPs in plant species has lagged behind that in humans. However, the pace of SNP discovery and application has increased in species including Arabidopsis thaliana (Jander et al. 2002; Schmid et al. 2003, 2005; Nordborg et al. 2005), maize (Zea mays L.) (Tenaillon et al. 2001; Ching et al. 2002), rice (Oryza sativa L.) (Nasu et al. 2002; Feltus et al. 2004), barley (Rostoks et al. 2005), poplar (Populus trichocarpa Torr. & Gray) (Tuskan et al. 2006), and a number of other species.
Two important factors influencing SNP discovery are the frequency of sequence variants and the presence of genome duplications. The frequency of sequence variants in soybean is low due to historical genetic bottlenecks and low sequence diversity in soybean's wild ancestor G. soja (Sieb and Zucc.) (Hyten et al. 2006). Zhu et al. (2003) reported that SNP frequency as measured by nucleotide diversity (θ) was ∼0.00053 in 28.7 kbp of coding sequence analyzed in a diverse set of 25 soybean genotypes. There was more than a twofold higher nucleotide diversity in untranslated regions (UTRs), introns, and genomic DNA in close proximity to coding sequence (θ = 0.00111). These values are similar to those reported in humans (π = 0.000751) from 1.2 billion bp of aligned human sequence (Sachidanandam et al. 2001), which indicates that despite the low level of sequence variation, successful SNP discovery is still possible. In addition, >356,000 soybean EST sequences are available in GenBank, suggesting that in silico SNP discovery should be successful in soybean. However, in silico SNP discovery in soybean is impeded by the nature of the soybean genome and limited genotypic diversity of available EST data. Nelson et al. (2005) suggested the in silico analysis of EST FASTA data as an approach to SNP discovery in an inbred species such as soybean. However, the duplicated nature of the soybean genome and the paucity of a large set of EST data from alternative genotypes as well as sequencing errors inherent in FASTA data make in silico SNP discovery difficult. As a result of the duplicated nature of the soybean genome, a large proportion of soybean ESTs are members of paralogous sets of expressed genes. The soybean is likely an ancient tetraploid with a chromosome number of 2n = 40 (Hymowitz 2004). Recent analyses suggest that the soybean genome has undergone two major duplication events (Blanc and Wolfe 2004; Schlueter et al. 2004). This complicates development of sequence-tagged sites (STSs) required for resequencing to either validate in silico-derived SNPs or for de novo SNP discovery. The analysis of EST data to create the set of unique soybean genes or unigenes might reduce the problems associated with genome duplications and expedite selection of robust STSs. Vodkin et al. (2004) compiled an initial set of 61,127 soybean unigenes on the basis of the alignment of 5′ ESTs. This was followed by 3′-end sequencing of one cDNA from each of 27,513 unigenes. These sequence data provide an important resource for soybean SNP discovery.
The objective of this study was to develop STSs using EST and 3′-unigene sequence and then to use the STSs for the discovery of SNPs via the resequencing of six diverse soybean genotypes. SNPs were discovered in >2000 STSs, and >1000 SNPs derived from genic sequence were genetically mapped to create a transcript map that provides a map-based definition of the soybean gene space and will play a key role in the assembly of the whole-genome shotgun sequence.
MATERIALS AND METHODS
Zhu et al. (2003) indicated that the sequence analysis of six diverse soybean genotypes Archer, Minsoy, Noir 1, Evans, PI 209332, and Peking would permit discovery of 93% of the common SNPs (frequency >0.10) discovered in a diverse group of 25 genotypes. Among these six genotypes are the parents of four recombinant inbred line (RIL) mapping populations used in our laboratories. DNA of the six genotypes was isolated from bulked leaf tissue as described by Keim et al. (1988).
PCR primer selection in EST clusters:
All ESTs available as of August 2003 generated from cultivars Williams and Williams 82 were clustered using the assembly program CAP3 with the requirement that all ESTs within a cluster should have >95% similarity. Each of the clusters was then individually analyzed for the presence of high Phred quality polymorphisms. All like sequences within a cluster were further grouped into smaller sets of similar sequences that were likely to be paralogs. Paralog sequence variants (PSVs) were identified and primers were designed to include the PSVs at the 3′ end of the primers with the intention of providing paralog-specific amplification. Primers were selected using Primer3 (Whitehead Institute, Massachusetts Institute of Technology, MIT).
PCR primer selection in 3′ unigene sequence:
PCR primers were designed to 8587 3′-end soybean unigene sequences deposited in GenBank as a result of work reported by Vodkin et al. (2004) with a range of predicted fragments lengths between 300 and 800 bp. A total of 5798 primers were designed with OLIGO (National Biolabs, St. Paul) and Array Designer 2 (Premier Biosoft International, Palo Alto, CA) using only FASTA data while 2789 primer sets were designed with Primer 3 (Whitehead Institute, MIT) using FASTA data and Phred quality scores. When quality scores were used, all bases within primers were required to have a minimum quality score of 20. In an attempt to provide specificity, the reverse primer was positioned as far to the 3′ end of the cDNA sequence as possible with the intention of maximizing the likelihood of priming from the 3′ UTR.
Preliminary analysis of PCR primers:
Each primer pair was used to amplify genomic DNA of Archer soybean. Amplification reactions were performed with 20 ng of DNA, 0.1 μm of forward and reverse primers, 1× FailSafe PCR PreMix B buffer (Epicentre Technologies, Madison, WI) or a buffer consisting of 20 mm Tris–HCl (pH 8.0), 50 mm KCl, 200 μm each dNTP, 1.0% glycerol, 1.5 mm MgCl2, 2.0 ng/μl single-stranded DNA-binding protein, and Taq DNA polymerase in a 10-μl reaction volume (45 sec at 92°, denaturation; 45 sec at 58°, annealing; and 45 sec 68°, extension) for 40 cycles. PCR products were resolved on a 2.5% agarose gel stained with ethidium bromide. Reactions that gave no product or multiple products were reamplified with either a lower (no product) or a higher (multiple products) annealing temperature. The primer pairs that amplified a single discrete product were selected for further analysis.
The amplicon from each selected primer pair was prepared for sequence analysis by treatment with 4 units of shrimp alkaline phosphatase (SAP) and 4 units of exonuclease I incubated at 37° for 1 hr followed by 72° for 15 min to deactivate the enzymes. Labeling reactions were performed with 1 μl of PCR product, 0.5 μl of BigDye Terminators, version 1.1 or 3.1 (Applied Biosystems, Foster City, CA), 0.3 μm of one of the original PCR primers, 1× Taq DNA polymerase buffer (Promega, Madison, WI), and 1.75 mm of MgCl2 in a 5-μl reaction volume (10 sec at 90°, denaturation; 5 sec at 50°, annealing; 60 sec at 60°, extension) for 40 cycles. The PCR products were labeled from both ends and the resulting termination products were analyzed on an ABI 3730 DNA analyzer. The two resulting sequence traces derived from opposite ends of each amplicon were analyzed and aligned with standard DNA analysis software Phred (Ewing and Green 1998) and Phrap (http://www.phrap.org/). Resulting alignments and trace data were visually inspected in the Consed viewer (Gordon et al. 1998) to distinguish those amplicons that were locus specific and those that apparently resulted from amplification of two or more paralogous loci. Those primer sets that produced what appeared to be single-locus amplicons were used for PCR amplification of genomic DNA of the other five soybean genotypes Minsoy, Noir 1, Evans, PI 209332, and Peking. Resulting amplicons were treated with SAP and exonuclease I as described above, followed by sequence analysis on the ABI 3730. Forward and reverse sequence traces from the five genotypes as well as that from Archer were analyzed and aligned as described above. SNP discovery was carried out in the sequence alignments with a machine learning algorithm based on PolyBayes SNP discovery software (Marth et al. 1999; Matukumalli et al. 2006a).
Sequence analysis for STS verification and SNP discovery:
Sequence traces for each putative SNP identified were visually inspected to verify sequence polymorphisms. Single-nucleotide changes and indels present in each alignment as well as the haplotypes present among the six genotypes were recorded as described by Matukumalli et al. (2006b).
Analysis of introns:
Consensus sequence obtained from the Phrap alignment of the six soybean genotypes was aligned with the EST sequence to which primers were designed with the software alignment program bl2seq (http://www.ncbi.nlm.nih.gov) to determine the presence and length of introns.
Genetic mapping of SNP-containing loci:
One SNP was mapped from SNP-containing STSs derived from the EST clusters and 3′ unigene sequence, as well as from the sequence of soybean genes deposited in GenBank. The allele present at each SNP locus was determined using single-base extension on either the Sequenom MassARRAY platform or the Luminex flow cytometer. The MassARRAY system is based on single-base primer extension technology. The MassARRAY technology uses matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) mass spectrometry to measure directly the mass of the extension product(s) and then correlates the detected mass with a specific genotype (Griffin et al. 1999). Details of the protocol are available on the Sequenom web site: http://www.sequenom.com/applications/hme_assay.php. Single-base extension technology was also used to map SNP-containing loci on the Luminex flow cytometer as described by Chen et al. (2000). SNPs were mapped in three different mapping populations including the University of Utah Minsoy × Noir 1 (M × N) and Minsoy × Archer (M × A) as well as the Evans × PI 209332 (E × PI) (Concibido et al. 1996) RIL mapping populations. The first two populations were used in creating the current integrated linkage map (Song et al. 2004). A total of 89 RILs from the M × A population were assayed using the MassARRAY technology and the Luminex flow cytometry system. In the case of the M × N population, 89 RILs were assayed using MassARRAY genotyping and 75 RILs were genotyped using the Luminex flow cytometer. A total of 77 F6-derived RILs from E × PI were used for genetic mapping. SNPs in 500 genes in the M × N and 501 genes in the M × A populations, respectively, were analyzed using the MassARRAY system. SNPs in 337 genes in the M × N and 128 genes in the M × A populations, respectively, were analyzed using the Luminex flow cytometer. One hundred forty-six genes were mapped in the E × PI population using the Luminex flow cytometer. A total of 233 SNPs in the M × N and 115 SNPs in the M × A populations were genotyped using both analysis platforms. In those cases in which the same SNP was mapped using both the MassARRAY and Luminex systems, genotyping data with the least amount of missing data were used in map construction. To provide a set of markers common to the three mapping populations to expedite the JoinMap linkage analysis, 77 RILs of the Evans × PI 209332 were genotyped with 155 SSR loci using the analysis system described by Wang et al. (2003).
Nucleotide diversity (θ) was estimated as per Halushka et al. (1999),where K is the number of SNPs identified in an alignment of n genotypes, L bp in length. Nucleotide diversity was determined in both the intron and the exon sequence.
Haplotype diversity was calculated in a manner identical to the calculation of gene diversity (Weir 1996) as where Pij is the frequency of the jth haplotype for the ith locus summed across all haplotypes in the locus.
Distribution of SNPs in STSs:
To determine if SNPs were evenly distributed in the fragments assayed, the theoretical SNP cumulative frequency distribution for SNPs was calculated on the basis of the assumption of uniform distribution. This distribution was compared with the actual cumulative frequency distribution in these fragments using a Kolmogorov–Smirnov (KS) test (Gibbons 1976). The KS test assessed the degree of agreement between a sample of empirically gathered values and a target theoretical distribution.
Consensus map construction:
The data sets for JoinMap analysis consisted of 1023 A81-356022 × PI 468916, 1610 M × N, 1072 M × A, and 374 E × PI markers. Of 2807 unique markers in the data sets, a total of 533, 212, and 36 markers were common to two, three, and four populations, respectively. The markers in each of the four populations were grouped separately on the basis of their LOD scores and then integrated using JoinMap (Van Ooijen and Voorrips 2001). The soybean linkage groups were identified using the alphanumeric codes described by Cregan et al. (1999) and Song et al. (2004). Recombination values were converted to genetic distances using the Kosambi mapping function.
Distribution of genic sequences—test of goodness-of-fit of theoretical and observed numbers of newly mapped genes to each linkage group:
The chi-square statistic is commonly used to test for goodness-of-fit of an observed to a theoretical distribution. However, the test of each individual linkage group is problematic due to null degrees of freedom. Therefore, a permutation algorithm was designed to assess the significance level. For each permutation, a total of 1141 SNP markers were randomly assigned to any position in a total length of 2388.61 cM (the total length of the 20 linkage groups) at 0.01-cM intervals. The theoretical number of genes assigned to each linkage group was compared to the observed number of genes mapped to each linkage group for each of 5000 permutations. The probability of goodness-of-fit was measured by the proportion of times that the observed number of genes in each linkage group was larger or smaller than expectation.
Distribution of genes within linkage groups:
The theoretical distribution of map distances between adjacent genic sequences within linkage groups was estimated on the basis of the assumption of a random distribution of markers over the total length of the linkage map. The goodness-of-fit between observed and theoretical distributions was tested using chi-square analysis.
Distribution of genes vs. SSR and RFLP loci:
To estimate the relative proximities of genes on the newly developed transcript map with the preexisting SSR and RFLP loci, the genetic distance between each SSR and RFLP locus to the two flanking genic loci was determined. Genic loci included both the loci mapped in this study as well as classical genes that were previously mapped and 37 genes that were mapped by virtue of an SSR they contained. The SSRs in these 37 genes were not included in the analysis. The proportions of genetic distances that fell in classes of 0.0–0.1 cM, 0.1–0.2 cM, 0.2–0.3 cM, etc., intervals were calculated both for SSRs vs. genes and for RFLPs vs. genes. The proportions of genetic distances in the various distance classes allowed a comparison of the relative genetic proximities of SSR and RFLP loci to adjacent genes.
SNPs in EST clusters:
A total of 160,000 ESTs from cultivars Williams and Williams 82 were clustered using CAP3 and then further grouped into subsets believed to correspond to paralogs. Primers were then designed to high Phred quality polymorphisms with the intention of providing paralog specificity. Of 872 primers tested, one-third failed to produce an amplicon and 516 (59.2%) produced a single band on agarose (Table 1). Upon sequence analysis of these amplicons in the genotype Archer, 367 (42.1%) produced generally high-quality sequence data allowing alignment of the 5′ and 3′ traces. In a number of instances, the sequence traces contained what appeared to be “heterozygous” positions indicative of either haplotype variation or variation between very similar homeologous loci. The 367 primers pairs that produced good-quality sequence data were used to amplify the corresponding fragment from the five additional genotypes Minsoy, Noir 1, Evans, Peking, and PI 209332 followed by sequence analysis from both ends. Alignment and analysis of these sequence traces with those from Archer soybean indicated that 46 fragments that contained heterozygous positions in Archer soybean also had identical heterozygous bases at the same positions in each of the five other genotypes. A typical example of this phenomenon is illustrated in supplemental Figure 1 at http://www.genetics.org/supplemental/ and suggested that the variation was PSV rather than haplotype variation. The analysis of the remaining 321 alignments indicated that they were the result of amplification from a single locus, i.e., an STS. One hundred eighty-one of these STSs contained one or more introns and 225 (25.8% of 872 primer pairs analyzed) contained at least one single-base change or indel, as detected by PolyBayes and verified by visual inspection of the alignments. The 321 STSs contained 214.2 kbp of aligned sequence and a nucleotide diversity of θ = 0.001396.
SNP discovery in primers designed to 3′ unigene sequence data:
In an attempt to increase primer specificity, 3′ unigene sequence data were used in primer design such that one primer was positioned as far to the 3′ end of the unigene as possible. Of those primers designed to FASTA data, 24.2% did not produce a PCR product, which is slightly higher than the 21.6% that did not produce a PCR product when quality scores were used in primer design (Table 1). The proportion of primer sets that amplified multiple or paralogous loci was slightly different between the two groups. However, the proportion of primer pairs that resulted in an STS when primers were designed to sequence with a Phred score of ≥20 was 49.2%, which was greater than the 43.9% of successful STSs produced when primers were designed to FASTA sequence.
A total 8587 primer sets were designed to 3′ unigene sequence of which 4346 produced high-quality sequence data from the analysis of the STSs amplified from genomic DNA of Archer soybean. As was the case with primers designed to the EST clusters, a number of the sequence traces contained heterozygous positions. The subsequent analysis of the 4346 amplicons from the five genotypes Minsoy, Noir 1, Evans, Peking, and PI 209332 indicated that in 427 alignments of traces from the six genotypes each of the genotypes had “heterozygotes” at the same positions, indicating that variation was between paralogs as described earlier and illustrated in supplemental Figure 1 at http://www.genetics.org/supplemental/. A total of 3919 of the primers designed to 3′ unigene sequence resulted in what appeared to be a robust STS as indicated by the sequence analysis of the six genotypes and 1807 of these contained at least one SNP. Thus, 21% of the primer pairs tested resulted in an SNP-containing amplicon.
A total of 22.7% of the primers designed and tested produced (1) multiple products on an agarose gel (9.2%), (2) multiple amplicons as determined by sequence analysis (8.5%), or (3) evidence of PSV as depicted in supplemental Figure 1 at http://www.genetics.org/supplemental/ (5.0%).
Heterogeneity of nucleotide diversity among genes:
The analysis of 9459 primer pairs resulted in the development of 4240 sequence-tagged sites with an average length of 575 bp and a mean of 1.31 SNPs per fragment. The nucleotide diversity in the 2.44 Mbp of aligned genic sequence was θ = 0.000997. More than half of the gene fragments had no sequence variation, which suggested an uneven distribution of sequence variation among the fragments. The Kolmogorov–Smirnov test was conducted to compare the observed cumulative frequency distributions of SNPs in fragments with the theoretical distributions based upon the assumption of mutations being evenly distributed across the 4240 fragments. The observed and theoretical frequency distributions were determined to be significantly different (P < 0.01), indicating that there was heterogeneity in the nucleotide diversity of gene fragments analyzed in this study.
Characteristics of SNPs in exons and introns:
The aligned sequence of the 4240 STSs resulted in the discovery of 5551 SNPs (Table 2). The analysis of the aligned sequence indicated that 91.6 kbp (37.6%) was intron sequence with a mean of 1.64 introns per gene fragment and an average intron length of 279 bp. The number of introns per gene fragment ranged from 1 to 8 and there were 2.66 SNPs/kbp in introns vs. 2.04 SNPs/kbp in exon sequence. Thus, the intron nucleotide diversity (θ = 0.001168) was somewhat greater than that in the exons (θ = 0.000895). Approximately 85% of the SNPs were single-base changes of which 55.7% were transitions and 44.3% transversions. In five cases, triallelic single-base changes were discovered in the six genotypes. The remaining 15% of the SNPs were indels that ranged in length from 1 to 104 bp, of which 14.3% were >5 bp in length and 4.8% were >10 bp in length. Fifty-one percent of the 827 indels were 1 bp in length. The distribution of indel lengths in exons and introns indicated that the proportion of indels in different length classes was fairly similar in exons and introns (Figure 1). There was a slightly higher proportion of indels of 1, 2, and >10 bp in introns while the proportion of 3- and 6-bp indels was greater in exons. Information relating to each of the 2032 SNP-containing STSs and the 5551 SNPs discovered as a result of the resequencing analysis is provided at http://bfgl.anri.barc.usda.gov/soybean/ (files: 2032-SNP-containing STS.xls and 5551-SNPs-2032 STS.xls).
Among the six genotypes analyzed, just two haplotypes were found in 1502 (73.9%) of the 2032 gene fragments that contained a sequence variant (Table 3). Only two haplotypes were found in 837 fragments with a single sequence variant, as well as many with multiple SNPs. Four or more haplotypes were observed in <5% of the gene fragments. There was a mean of 2.31 haplotypes in the 2032 SNP-containing genes. Haplotype diversity in gene fragments that contained two or more SNPs was 0.48 ± 0.13.
Molecular function of genes:
Since genes were chosen randomly, it was anticipated that a diverse set of genes would be included in this study. This anticipation was realized as is suggested by Figure 2. There were small differences among the proportions of genes in the various gene prediction classes for which primers were designed vs. STSs developed as opposed to those in which sequence diversity was discovered to permit genetic mapping. For example, ∼12% of genes to which primers were designed were in the category of catalytic activity whereas 14% of loci that were mapped were in this category. Similarly, 1.3% of genes to which primers were designed were predicted to function in signal transduction whereas only 0.52% of these genes were mapped. In general, for most gene prediction categories, such differences were relatively minor.
Genetic Mapping of STSs:
A total of 1141 genes were placed on the genetic map by virtue of a segregating SNP mapped in one or more of the three RIL mapping populations. In the case of the M × A population, 115 SNPs were mapped using both the Sequenom Mass Spectrometer MassARRAY technology and the Luminex flow cytometer on the same set of 89 RILs. Greater than 98.5% of RIL genotype calls were identical using both SNP detection platforms. Loci were placed on the existing framework map, which consists of 1015 SSR and 709 RFLP loci (Song et al. 2004). The genetic map with SSR, RFLP, SNP, and other loci is provided in supplemental data file 1 at http://www.genetics.org/supplemental/. On the basis of the lengths of the 20 linkage groups in the framework map, the predicted number of genes mapped per linkage group was determined assuming a random distribution of genes. A significantly greater than predicted gene density occurred in 3 of the 20 linkage groups (E, J, and K) (Table 4), while a smaller than expected number of genes mapped to linkage group M. An additional analysis was undertaken to determine the nature of the distribution of genes within linkage groups. The analysis of the theoretical distribution of map distances between adjacent genic sequences clearly indicated that genes were clustered (Figure 3).
Distribution of genes vs. SSR and RFLP loci:
To estimate the relative proximities of genes on the newly developed transcript map to the preexisting SSR and RFLP loci, genetic distances of each SSR and RFLP locus to the two closest flanking genic loci were determined. The graph of the proportion of SSR and RFLP loci that fall at various intervals from the closest two flanking genes suggested little difference between the proximities of SSR and RFLP loci to genes (Figure 4). However, a higher proportion of SSRs appeared to be in very close proximity (0.0–0.5 cM) to genes than were RFLP loci. In addition, correlations of the number of genes with SSRs and genes with RFLPs on the 20 soybean linkage groups indicated a moderately strong relationship between SSR density and gene density. The correlation of SSR numbers per linkage group with genes per linkage group was r = 0.58 (P = 0.0075), whereas that between the numbers of RFLP loci and genic loci was substantially lower, r = 0.25 (P = 0.29). Together these data suggested an apparent association between SSRs and genic sequence.
Gaps in the linkage map now populated with SNP markers:
In the Song et al. (2004) linkage map, a total of 112 and 26 intervals between adjacent SSR markers are >5 and 10 cM, respectively. In the new map, a total of 291 genes were mapped in 72 of the 112 gaps with distances between 5 and 10 cM and 111 genes were mapped in 19 of 26 gaps >10 cM. Thus, sequence-based markers effectively filled many of the gaps between the preexisting sequence-based markers in the linkage map.
A soybean SNP database:
To provide web-based access to the mapped SNP markers generated in this study, a database was created that can be accessed at http://bfgl.anri.barc.usda.gov/soybean/. Information includes descriptive data for each SNP-containing STS, STS information including primer sequences, and SNP positions in the STSs as well as the allele present in each of the six genotypes. The complete integrated map with positions of all SSR, RFLP, and SNP loci is also available. In addition, data related to SNP detection using (1) the Sequenom MassARRAY system including the assay ID, PCR, and single-base extension primer sequences and (2) the Luminex flow cytometer including PCR and single-base extension primer sequences are available in the database.
SNP discovery in soybean:
Of a total of 9459 primer sets designed to EST sequence and examined for locus-specific PCR amplification followed by resequencing for SNP discovery, 21.5% were determined to contain a sequence variant. The low rate of SNP discovery reported here was, first, the result of the inability to develop robust STSs, and, second, the result of the low level of sequence variation in cultivated soybean. The difficulty of STS development stems from a number of factors. To facilitate specificity, the 872 primers designed to ESTs clustered using CAP3 frequently contained a PSV at their 3′ end. In some instances, the PSV bases may have been reverse transcriptase errors that would result in poor or absent PCR amplification. Reverse transcriptase errors have been documented to occur at frequencies of 1/1000–4000 bases in cDNA synthesis in Drosophila (Stapleton et al. 2002). Primer design across intron–exon splice sites would also reduce the success of PCR amplification of primers designed to EST sequence. Likewise, the high level of duplication in the soybean genome complicated the development of STSs. More than 20% of primers designed produced either multiple amplicons as determined by agarose gel analysis or single-band amplicons that yielded traces indicative of two or more amplicons. This would be anticipated in light of reports that, on average, a given chromosomal segment is duplicated 2.55 times in the soybean genome and that some segments are present as many as six times (Shoemaker et al. 1996).
The relatively low nucleotide diversity of cultivated soybean also contributed to the difficulty of SNP discovery. In the 2.44 Mbp of aligned sequence, nucleotide diversity (θ = 0.000997) was similar to that reported by Zhu et al. (2003) (θ = 0.00086) in 66.6 kbp of coding sequence and associated introns, untranslated regions, and perigenic sequence. The level of sequence diversity in cultivated soybean in relation to other cultivated species is relatively low. For example, in rice, Feltus et al. (2004) reported 1.7 single-base changes plus 0.11 indels/kbp in 358 Mbp of low-copy DNA sequence on the basis of a comparison of draft sequences of the two rice subspecies O. sativa ssp. indica and japonica. This is equivalent to a nucleotide diversity of θ = 0.00181. A calculation from Kanazin et al. (2002) indicated nucleotide diversity of θ = 0.0025 in 21.3 kbp of sequence analyzed in five diverse barley cultivars. Similarly, nucleotide diversity, θ = 0.0023, in sorghum (Sorghum bicolor) (Hamblin et al. 2004) is more than twice that of soybean. In modern maize (Z. mays L.) inbreds, Wright et al. (2005) reported nucleotide diversity of θ = 0.00627, which is more than sixfold that of soybean. A similarly high nucleotide diversity of θ = 0.0077 was reported in a comparison of two sugar beet genotypes by Schneider et al. (2001). Along with low sequence diversity, the heterogeneity of nucleotide diversity across the 2.44 Mbp of sequence analyzed in this study also reduced the number of SNP-containing genic fragments discovered vs. what would have been anticipated if polymorphisms had been distributed evenly across the 4240 STSs. This heterogeneity is likely the product of selective sweeps that occurred during soybean domestication, which resulted in regions of the cultivated soybean genome in which little or no sequence variation is present. Hyten et al. (2006) reported that while only 6.8% of genes assayed in 26 wild soybeans [G. soja (Sieb and Zucc.)] contained no sequence variants, 24.5% showed no variation in a set of 52 exotic G. max germ-plasm accessions. This suggests that nearly one-fifth of the cultivated soybean genome is genetically invariant as a result of selective sweeps associated with domestication.
The proportion of indels to total sequence variants in genic and perigenic sequence in soybean (15%) is quite similar to that reported in Arabidopsis. Schmid et al. (2003) determined that 14% of sequence polymorphisms they detected via the resequencing of 12 genotypes were indels. In cultivated barley the comparable proportion is somewhat lower (8%) on the basis of an analysis of sequence polymorphisms in five diverse barley cultivars (Kanazin et al. 2002). In contrast, the frequency of indels in and around maize genes greatly exceeds that reported to date for soybean and other plant species. Bhattramakki et al. (2002) discovered 655 indels via the sequence analysis of 502 genic loci (180,618 bp of aligned sequence) by the resequencing of eight diverse maize inbreds. This was an average of one indel every 276 bp. In this study, one indel was found every 2905 bp. While only six genotypes were analyzed in the case of soybean vs. eight in the work of Bhattramakki et al. (2002), it is clear that the indel frequency is substantially lower in soybean genic sequence than in maize.
The mean number of haplotypes present among the 6 soybean genotypes in the 2032 SNP-containing STSs was 2.31, which was only a little less than the 2.74 haplotypes found in 96 SNP-containing genic fragments analyzed in 25 diverse soybean genotypes by Zhu et al. (2003). Thus, using the 6 genotypes identified by Zhu et al. (2003) as a subset that would identify 93% of the common SNPs (frequency ≥0.10) was apparently successful. The limited haplotype diversity in soybean is comparable to that reported in other crop species. For example, Kilian et al. (2006) determined haplotype diversity among 20 domesticated barley lines in 7 barley genes and reported an average of 2.4 haplotypes per locus. Another report from barley found an average of 2.81 haplotypes per STS via the analysis of 309 STSs with an average length of 466 bp in 7 cultivated and 1 wild barley (Hordeum vulgare ssp. spontaneum) genotype (Rostoks et al. 2005). Ching et al. (2002) analyzed fragments of 18 genes in 36 maize inbreds and found from 2 to 8 haplotypes with an average of only 4.4 haplotypes per gene. In cultivated soybean, the relatively limited haplotype diversity is suggested to be mainly a result of limited diversity in the wild progenitor compounded by a further loss of diversity as a result of domestication (Hyten et al. 2006).
A more comprehensive soybean genetic map:
The previous version of the soybean genetic map (Song et al. 2004) included 1015 PCR-based markers (SSRs). The addition of 1141 markers based upon gene sequence provides the first transcript map of soybean (supplemental data file 1 at http://www.genetics.org/supplemental/). The diversity of gene function associated with these transcripts offers researchers an opportunity to identify potential candidate genes for >1150 QTL reported thus far in soybean for a variety of traits related to biotic and abiotic stresses, plant growth and morphology, and seed quality (SoyBase: http://soybase.agron.iastate.edu/). This initial transcript map and the additional markers it contains should enhance both applied and basic soybean genetics and genomics research, including QTL discovery, marker-assisted selection, map-based cloning, and the anchoring of the physical to the genetic map. Doubling of the number of genetically mapped sequence-based markers is a step forward in creating the resources that will be needed to assemble the whole-genome shotgun sequence of soybean (Joint Genome Institute 2006).
Distribution of genes and SSR and RFLP markers:
Clustering of the mapped genic loci reported here was not unanticipated. Mudge et al. (2004) hybridized RFLP probes derived from PstI digestion of genomic DNA to arrayed BAC clones. Numerous cases of nonhomologous probes hybridizing to common BAC clones indicated that the gene-derived RFLPs were clustered in the genome. These authors concluded that most genes in soybean are clustered in ∼275 Mbp of the genome, which is ∼25% of the 1100-Mbp genome. Our analysis clearly supported gene clustering on the genetic map, although it would be difficult to draw conclusions about the proportion of the genome that is gene rich vs. gene poor. Such an analysis should be forthcoming as the soybean physical map progresses and the genome sequence becomes available.
An analysis of DNA sequence data from A. thaliana, rice, soybean, maize, and wheat (Triticum aestivium) (Morgante et al. 2002) concluded that the frequency of SSRs was significantly higher in transcribed regions and that microsatellites (SSRs) are associated with low-copy portions of plant genomes rather than with regions of repetitive DNA. In contrast, empirical evidence from soybean indicated that end sequences of BAC clones identified with PstI-derived RFLP probes were reported to have 50% more gene-like sequences and 45% less repetitive sequence than end sequences of BAC clones identified with microsatellite markers (Marek et al. 2001). These data suggested that in relation to SSRs, RFLPs tended to be more closely associated with gene-rich regions. Thus, the 138 gaps of >5 cM between adjacent PCR-based markers (SSRs) in the previous SSR-based genetic map of soybean (Song et al. 2004) may include regions of interest to soybean genomicists and breeders because these gaps frequently contain one or more RFLP loci. On the basis of the transcript map developed via the positioning of 1141 genes on the preexisting SSR/RFLP map, SSR loci appear to be at least as closely associated with genic sequence as RFLP loci (Figure 4). Thus, whatever the relationship between genic sequence and SSR and RFLP markers, the gene-based SNPs mapped in this study effectively filled many of the 5- and 10-cM gaps in the previous map with at least one new PCR-based marker.
Access to SNP marker technology:
The SNP detection used here was conducted using two SNP detection platforms. The Sequenom MassARRAY technology is well established and all the information required including redesigned PCR primers to amplify the SNP-containing fragment as well as the single-base extension primer for detection of the one SNP mapped per STS is available at http://bfgl.anri.barc.usda.gov/soybean/ (file: Sequenom Information.xls). Multiplex assays using the MassARRAY technology can be designed from this information and genotyping services are available commercially. The assays conducted on the Luminex flow cytometer also used single-base extension. Single-base extension primers for the detection of one SNP in each of 502 STSs are available at http://bfgl.anri.barc.usda.gov/soybean/ (file: Luminex Information.xls). The Luminex flow cytometer has been demonstrated to be flexible in that, in addition to single-base extension assays, hybridization-based as well as oligonucleotide ligation assays (OLA) can be used on this platform. A comparison of single-base extension, hybridization, and OLA assays for SNP detection in soybean was reported by Lee et al. (2004). The single-base extension assays were reported to be quite robust, but once optimized the hybridization system was actually more rapid and less costly per datapoint, making the latter assay more suitable for high-throughput marker-assisted selection. The 1141 gene fragments mapped in this study contained a total of 2928 SNPs. The data relating to the allele at each SNP locus in each STS will allow the user to design assays on the basis of an ever increasing number of SNP detection assay systems. These have been reviewed by Syvänen (2001, 2005) and Ng and Liu (2006). Existing systems plus the promise of new systems suggest an ever-improving SNP detection throughput coupled with decreasing cost per datapoint. While the newest technologies described by Syvänen (2005) may not be immediately available to soybean geneticists, a number of systems are currently available in addition to the Sequenom MassARRAY technology and the Luminex flow cytometer. An inexpensive and widely used alternative for the detection of a limited number of SNPs is based upon the alteration in a restriction endonuclease site by the presence of the SNP. These so-called cleaved amplified polymorphic sequences (CAPS) markers (Glazebrook et al. 1998) have been successfully used for quite some time. Thus, if a QTL is discovered in a particular region of the genome, SNPs in that region can be identified, assayed, and converted to CAPS markers and used in marker-assisted selection. CAPS assays can be conducted on agarose gels that are available in many plant-breeding laboratories. Indeed, a CAPS marker linked to the important ms2 gene for male sterility at the bottom of linkage group O was obtained in this way (J. M. Chaky and J. E. Specht, unpublished results). As additional SNP markers are placed on the map, more potential SNPs will be available in any given region of the genome from which to design assays for marker-assisted selection.
We thank Tina Sphon, Tad Sonstegard, the Bovine Functional Genomics Laboratory, Animal and Natural Resources Institute, and Beltsville Agricultural Research Center East DNA Sequencing Facility for assistance with the acquisition of sequence data. This work was supported by grants 3212, 4212, and 5212 from the United Soybean Board. The support of the United Soybean Board is greatly appreciated. The authors also thank Monsanto for their funding of the SNP genotyping that was conducted at Genaissance Pharmaceuticals by Min Seob Lee (Sequenom, San Diego).
↵1 Present address: NICEM, CALS, Seoul National University, San 56-1, Sillim-9-dong, Gwanak-gu, Seoul, 151-921, South Korea.
Communicating editor: J. A. Birchler
- Received January 12, 2007.
- Accepted February 16, 2007.
- Copyright © 2007 by the Genetics Society of America