We studied microsatellite frequency and distribution in 21.76-Mb random genomic sequences, 0.67-Mb BAC sequences from the Z chromosome, and 6.3-Mb EST sequences of Bombyx mori. We mined microsatellites of ≥15 bases of mononucleotide repeats and ≥5 repeat units of other classes of repeats. We estimated that microsatellites account for 0.31% of the genome of B. mori. Microsatellite tracts of A, AT, and ATT were the most abundant whereas their number drastically decreased as the length of the repeat motif increased. In general, tri- and hexanucleotide repeats were overrepresented in the transcribed sequences except TAA, GTA, and TGA, which were in excess in genomic sequences. The Z chromosome sequences contained shorter repeat types than the rest of the chromosomes in addition to a higher abundance of AT-rich repeats. Our results showed that base composition of the flanking sequence has an influence on the origin and evolution of microsatellites. Transitions/transversions were high in microsatellites of ESTs, whereas the genomic sequence had an equal number of substitutions and indels. The average heterozygosity value for 23 polymorphic microsatellite loci surveyed in 13 diverse silkmoth strains having 2–14 alleles was 0.54. Only 36 (18.2%) of 198 microsatellite loci were polymorphic between the two divergent silkworm populations and 10 (5%) loci revealed null alleles. The microsatellite map generated using these polymorphic markers resulted in 8 linkage groups. B. mori microsatellite loci were the most conserved in its immediate ancestor, B. mandarina, followed by the wild saturniid silkmoth, Antheraea assama.
MICROSATELLITES or simple sequence repeats (SSRs), tandemly repeated units of one to six nucleotides, are abundant in prokaryotic and eukaryotic genomes (Weber 1990; Field and Wills 1996). They are ubiquitously distributed in the genome, both in protein-coding and in noncoding regions (Toth et al. 2000). The advent of polymerase chain reaction (PCR) and the availability of high-throughput automated sequencers have increased the use, detectability, and popularity of microsatellite markers, which have become a highly informative and versatile class of genetic markers (Litt and Luty 1989; Tautz 1989; Weber and May 1989; Schlötterer 2004).
Information about the distribution and variability of microsatellite sequences in the genome of a species can elucidate its genetic history from the standpoint of evolution and artificial selection. Population and evolutionary studies in humans and Drosophila have shown that highly polymorphic microsatellite markers can provide information on the differentiation of populations and can detect recent selective sweeps and bottleneck events (Di Rienzo et al. 1994; Slatkin 1995; Schlötterer et al. 1997). On the other hand, more stable SSR markers with lower variability can be used to reconstruct more ancient evolutionary events (Meyer et al. 1995). Since mutation rates at microsatellite sequences vary substantially between species and between loci (reviewed in Schlötterer 2000), it is important to investigate and account for factors influencing microsatellite variability while using SSR markers. These include the structure and length of SSRs (Brinkmann et al. 1998; Schlötterer 1998; Streelman et al. 1998). In a variety of organisms, it has been demonstrated that microsatellite mutation rates are positively correlated with repeat number (Wierdl et al. 1997; Schlötterer et al. 1998). Initial studies of humans reported a higher mutation rate of tetranucleotide repeats (Weber and Wong 1993), whereas a later study that compared microsatellite variability in different human populations found strong evidence for an inverse correlation of microsatellite repeat unit length and mutation rate (Chakraborty et al. 1997). More work in different genomes is required to answer this open issue and to find out common patterns, if any. It has been shown that SSRs are less abundant in exons than in noncoding regions and that different taxa exhibit relative abundance of different SSR types (Toth et al. 2000; Morgante et al. 2002). The silkworm, Bombyx mori is an insect of agronomic importance and a lepidopteran molecular model (Nagaraju et al. 2000). Development of molecular markers is important in silkworm for construction of linkage maps, fingerprinting of strains for breeding purposes, and marker-assisted selection. In earlier studies, fingerprinting analyses of silkworm strains have been carried out using RAPDs (Nagaraja and Nagaraju 1995); a heterologous minisatellite probe, Bkm(2)8 (Nagaraju et al. 1995; Sharma et al. 1999); inter-(I)SSR-PCR (Reddy et al. 1999b); and microsatellite loci (Reddy et al. 1999a). Genetic linkage maps of silkworm have been constructed using RAPDs (Promboon et al. 1995; Yasukochi 1998), RFLPs (Shi et al. 1995; Kadono-Okuda et al. 2002), AFLPs (Tan et al. 2001), and SSRs and ISSRs (our unpublished data). Considering the advantages offered by SSR loci, we started an SSR discovery program in the silkworm genome. In the previous study, we analyzed 28 microsatellite loci cloned and characterized from a B. mori subgenomic library (Reddy et al. 1999a). Given the task of constructing a high-density linkage map for silkworm, the need for microsatellites for marker-assisted selection, and the high cost of developing and sequencing new genomic libraries for microsatellite discovery, we sought an alternative means of increasing our collection of these highly informative markers. Hence, in this study we used an in silico approach to characterize the distribution and frequency of different types of microsatellite motifs in the silkworm genome by analyzing the sequences derived from whole-genome shotgun (WGS) sequencing (Mita et al. 2004), BACs from the Z chromosome (Koike et al. 2003), and silkworm expressed sequence tags (ESTs; Mita et al. 2003). We also analyzed microsatellite flanking sequences for their sequence content and mutations in different classes of microsatellite repeats. In addition, we designed primers for 198 SSR flanking sequences and tested for their polymorphism in two divergent parental strains, Nistari and NB4D2, which show contrasting characters for various visible traits. Using a biphasic backcross strategy, we constructed a framework linkage map of microsatellite markers. We also estimated allele frequencies for 23 polymorphic loci using 13 strains of B. mori. Further, extending our study, we also tested the conservation of a selected number of loci in different lepidopteran insects, including B. mandarina, the nearest wild relative to silkworm, a group of saturniids, and a noctuid.
MATERIALS AND METHODS
The sequences analyzed for microsatellites included 21.76-Mb random sequences from WGS sequence data downloaded from DDBJ (http://www.ddbj.nig.ac.jp/whatsnew/040423-e.html; Mita et al. 2004), 0.67-Mb BAC sequences from the Z chromosome (Koike et al. 2003), and ∼9300 nonredundant ESTs accounting for 6.3 Mb, downloaded from SilkBase (http://www.ab.a.u-tokyo.ac.jp/silkbase; Mita et al. 2003). To avoid the double counting, the Z chromosome BAC sequences were removed from the genomic sequences on the basis of similarity search using BLAST (Altschul et al. 1990). The three sets of sequences from WGS, Z chromosome BACs, and ESTs are referred to as genomic, Z chr-BAC, and EST, respectively in this article. The sequences were analyzed for microsatellites and their flanking sequences using SSR finder (SSRF, http://220.127.116.11/MIC/index.html) and Allflank (to be published elsewhere), respectively. In our analysis, only those microsatellites with ≥5 repeat units were counted except in the case of mononucleotide tracts, where ≥15 bases were considered.
Some microsatellite loci identified by the above analysis, composed of di-, tri-, tetra-, penta-, hexanucleotide compound and imperfect repeats with varying lengths, were chosen for the study of allele frequency, polymorphism, and conservation. Primers were designed for the flanking sequences of these loci using the Amplify program (http://bip.weizmann.ac.il/mb/bioguide/pcr/PCRsofAmplify.html). Primers were also designed for microsatellites identified in B. mori gene sequences archived in the GenBank database. The microsatellite primers developed in an earlier study (Reddy et al. 1999a) were also used. All the primers were standardized for optimum annealing temperature and MgCl2 concentration using genomic DNA from the Nistari strain of B. mori. The details of the primer sequences, size range of the alleles, and PCR amplification conditions for each of the 198 microsatellite loci developed and validated in this study are available at our website (http://18.104.22.168:9999/PHP/SILKSAT/index.php?f=design_primer).
Silkworm strains and genomic DNA:
The detection of polymorphism of the newly identified loci was carried out in two silkmoth strains, Nistari (male individuals) and NB4D2 (female individuals), which have nondiapausing and diapausing characters, respectively. These two strains are quite divergent for various qualitative and quantitative traits and produce highly heterotic hybrids when they are crossed and are being used as parental strains to construct a mapping population. Thirteen diverse silkworm strains, which included 6 diapausing (HU204, KA, NB1, NB7, NB18, and NB4D2) and 7 nondiapausing (C.nichi, Moria, Nistari, Pure Mysore, Daizo, Gungnong, and Sarupat) strains, were used for microsatellite analysis. In each of the strains, DNA was extracted from 10 individuals, except for NB4D2, in which DNA from only 5 individuals was extracted. The saturniid silkmoths, Antheraea mylitta, A. assama, A. roylei, A. proylei, A. pernyi, A. yamamai, and Samia cynthia ricini; the bombycid, B. mandarina; and the noctuid moth, Helicoverpa armigera, were used to study the conservation of B. mori microsatellite loci in heterologous lepidopterans. DNA was isolated from adult moths as described earlier (Prasad and Nagaraju 2003).
Choice of primers and locus detection:
For population analysis, 26 microsatellite loci involving di-, tri-, and tetranucleotide repeats with 11 loci identified from the SilkBase, 12 from GenBank, and 3 from subgenomic library screening were used. For polymorphism analysis, 198 loci (34 from ESTs, 91 from total genomic sequences, 19 from subgenomic library screening, and 54 from GenBank) involving di, tri-, tetra-, penta-, hexanucleotide, imperfect and compound repeats were used. For the evaluation of conservation of B. mori microsatellites in the heterologous lepidopteran species, 30 loci (12 loci identified from the subgenomic library, 16 from GenBank, and 2 from SilkBase) were tested. Out of these, 4 loci that amplified in most of the heterologous species were cloned, sequenced, and analyzed.
PCR amplification and electrophoresis:
PCR was typically performed in a 5-μl reaction volume containing 50 μm dNTPs, 0.2 μm fluorescent dUTP (Tamara or R110 or R6G; PE Biosystems), 0.4 μm of each primer, 0.25 units of AmpliTaq Gold polymerase (PE Biosystems), 1× PCR buffer (100 mm Tris-HCl pH 8.8, 500 mm KCl, 0.8% Nonidet P40), 1.5–4 mm MgCl2, and ∼5 ng genomic DNA. Thermal cycling was carried out for 30 cycles on a Thermal Cycler PTC 100 (MJ Research, Watertown, MA) with locus-specific annealing temperature. Samples containing 1 μl of PCR products, 0.3 μl of GeneScan 500 ROX size standard (PE Biosystems), and 2.0 μl of loading buffer (4 parts de-ionized formamide: 1 part 500 mg/ml blue dextran, 25 mm EDTA) were heated at 92° for 5 min and then placed on ice. Denatured samples (1 μl) were immediately loaded on 5% denaturing (6 m urea) Long ranger (FMC, Vallensbaek, Denmark) gels (36 cm well-to-read) in 1× TBE buffer (89 mm Tris, 89 mm borate, 2 mm EDTA pH 8.3) and electrophoresed at 3000 V for 3 hr on an automated DNA sequencer (PE Biosystems 377). Allelic sizes were determined using GeneScan v 2.1 software (PE Biosystems) and data were compiled using Genotyper v 3.0 (PE Biosystems). Polymorphic status for 198 microsatellite loci between two parental strains used to construct the mapping population was tested on 3% 3:1 Metaphor agarose (FMC) and agarose (United States Biochemicals, Cleveland; Amersham Pharmacia, Piscataway, NJ) gel electrophoresis. The primer sets that did not reveal polymorphism on agarose gels were subjected to GeneScan analysis and the polymorphic loci, if any, were detected.
Construction of genetic linkage map:
In B. mori, achiasmatic oogenesis results in absolute linkage in females. In males, linkage depends upon the crossover events occurring during spermatogenesis. This biphasic linkage behavior typical to lepidopterans aids mapping by sequential approach (Heckel et al. 1999; Nagaraju and Goldsmith 2002). F1 individuals from a cross NB4D2 × Nistari were backcrossed to Nistari (the recurrent parent) to generate the mapping population. In the first step, F1 females were backcrossed with Nistari males [backcross (BC)-I] to identify the linkage groups. In a second backcross, F1 males were mated with Nistari females, generating recombinant progeny (BC-II) to obtain the order of the markers within each linkage group. Forty individuals from BC-I and 60 individuals from BC-II were genotyped along with the parents and F1. The data matrix was input into MAPMAKER version 3.0 (Lander et al. 1987). Two-point linkage was determined at LOD = 3.0 and multipoint analysis based on Kosambi function (Kosambi 1944) yielded the order of the markers in each linkage group.
Heterozygosity values (H) for alleles found in silkworm populations were calculated using in which n is the number of alleles scored and Pi is the frequency of the ith allele. Significance of the differences in abundance and length of different repeat types within and between three sequence sources was measured using student's t-test. Nei's standard genetic distance (Nei 1972) between 13 silkworm populations based on 26 microsatellite loci was computed. Departure from mutation drift equilibrium was tested using the Bottleneck program (Cornuet and Luikart 1996) for both diapausing and nondiapausing populations.
Cloning, sequencing, and phylogeny:
Bmsat020, Bmsat110, Bmsat103, and Bmsat109 loci were used to study the conservation of B. mori loci in heterologous lepidopteran insects (Table 8). These loci were amplified by PCR as described earlier. The PCR products were purified with a column (QIAGEN, Valencia, CA) and cloned in pCR 2.1 vector (Invitrogen, San Diego). The recombinant plasmids were sequenced with the primers used for PCR for both strands with the ABI PRISM BigDye terminator cycle sequencing ready reaction kit on a 3100 genetic analyzer (Applied Biosystems, Foster City, CA) and the sequences were aligned using CLUSTAL W with manual corrections.
Phylogeny based on mitochondrial sequences:
For amplification of mitochondrial 12S ribosomal DNA (rDNA) from B. mori, wild silkmoths, and H. armigera, the forward primer SR-J-14199, tactatgttacgacttat, and reverse primer SR-N-14594, aaactaggattagataccc, were used. The amplicons were purified and cloned in the vector pCR 2.1. The recombinant plasmids were sequenced on a 3100ABI Genetic Analyzer (Applied Biosystems). rDNA sequences were submitted to GenBank with the following accession numbers: AF389421–22 and AY037827–35. The sequences were aligned using CLUSTAL W and a phylogenetic tree was constructed using the Neighbor-Joining program of PHYLIP software with Drosophila nasuta 12S rDNA as an outgroup.
We examined the distribution of microsatellites in genomic, Z chr-BAC, and EST sequences. We found that microsatellites were widely distributed in the B. mori genome, with ∼3 kb of repeats per megabase of genomic and EST sequences (Table 1). We estimated that the silkworm genome of 530 Mb accounted for 1.63 Mb of microsatellite repeats, equivalent to 0.31% of the genome (Table 1). Under this consideration, mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats represented 0.110, 0.116, 0.053, 0.018, 0.006, and 0.003% of the genome, respectively.
A/T mononucleotide stretches were much more abundant than C/G stretches in genomic, Z chr-BAC, and EST sequences (Figures 1 and 2). Greater than 20 repeat unit tracts of A/T were common in genomic and EST sequences (Figure 2). Among dinucleotides, CG repeats were found to be the least abundant. Among dinucleotides in ESTs CA and GA were as abundant as TA repeats, and CG repeats were relatively more compared to genomic and Z chr-BAC sequences (Figure 2). Among trinucleotides, TAA repeats were the most abundant repeats in the genomic sequences, composing almost 50% of trinucleotide repeats followed by GTA and TGA (Table 1). These three trinucleotide repeat types were significantly less abundant in ESTs (P < 0.01). All the remaining repeats were significantly overrepresented in ESTs (Figure 1). CGA was the least represented among the trinucleotide repeats.
Since tetra-, penta-, and hexanucleotide repeats have large groups of repeat types, we classified them on the basis of AT percentage from 0 to 100. For example, ATTT was considered to be a 100% AT-rich tetranucleotide repeat, CAAA was a 75% AT-rich tetranucleotide repeat, ACGT was a 50% AT-rich tetranucleotide repeat, and so on. The maximum number of tetranucleotide repeats was observed in the 100% AT-rich repeats followed by the 75% AT-rich category (with one C/G in the repeat) in genomic sequences whereas in EST sequences the 50 and 75% AT-rich categories contributed the most (Figure 2). Eighty percent (with one C/G) and 100% AT-rich repeat types together constituted >80 and >60% of the total number of pentanucleotide repeats in the genomic and EST sequences, respectively. Three groups of hexanucleotide repeats, 66.7% (with two C/G), 83.3% (with one C/G), and 100% AT rich, were found to be significantly bigger in ESTs compared to the genomic sequences. Overall, we observed a common phenomenon whereby repeat number was inversely proportional to repeat length; this was more pronounced in trinucleotide repeats, especially because of their abundance in both the genome and ESTs (Table 1). The average length of mononucleotides was more as compared to other repeat classes. There was no significant difference in the lengths of repeats among genomic, ESTs, and Z chr-BAC sequences (Table 2).
Z chromosome vs. other chromosomes and ESTs:
Our analysis of mononucleotide repeat tracts in Z chr-BAC sequences yielded only A/T motifs (Table 1). Among trinucleotide repeat tracts, TAA repeats were significantly overrepresented in Z chr-BAC compared to genomic and EST sequences. GCA, CGA, and CGG were significantly more represented than those in genomic sequences whereas GAA and GGA were completely absent. Tetra-, penta-, and hexanucleotide repeat motifs were very scarce in the Z chr-BACs (Table 1).
Flanking sequence analysis:
To test the base composition surrounding the microsatellite repeats, we calculated GC content for 100 bases upstream and 100 bases downstream flanking different types of repeat motifs. Overall, the GC content of the sequences flanking different repeat types was more or less similar in both genomic and EST sequences. Also, there was no significant difference in AT/GC composition between upstream and downstream flanking sequences except in the case of 100% AT-rich pentanucleotide repeats, where downstream flanking sequences showed 5% lower GC content compared to that in upstream sequences in both genomic and EST sequences (data not shown). Hence, the upstream and downstream flanking sequences of both genomic and EST sequences were combined and analyzed. GC content was between 30 and 40% in sequences flanking A, T, AT, and AAT repeat tracks and increased up to 60% with increase in G/C content of the repeat unit (Figure 3). In other words, the GC contents of the repeat unit and the flanking sequences were positively correlated.
We analyzed silkworm genomic and EST sequences for imperfections in di-, tri-, tetra-, and pentanucleotide repeats. Incorporation of imperfections in perfect repeat tract by means of mutations was predominantly seen in the center of the repeat tract (398 out of 491). Mutations toward the 3′ end of the repeat motif were less abundant (152/491), whereas mutations toward the 5′ end of the repeat motif were rarely found although they do exist (41/491; supplementary Table S1 at http://www.genetics.org/supplemental/). This distribution was irrespective of the repeat length and we did not observe any length-dependent distribution. The shortest imperfect repeats were identified in di-, tri-, and tetranucleotide repeats of 15 bases in length. The longest was a tetranucleotide repeat of 611 bases (supplementary Table S1).
We determined if the mutations were transitions, transversions, insertions, or deletions. An insertional mutation within the repeat unit, e.g., (ATC)3G(ATC)3, (TGA)3TCGA(TGA)2, and a deletion of a base or more in the repeat tract, e.g., (ATT)3_TT(ATT)2, disturb the continuity of the perfect repeat units.
Transitions (29/74) were most abundant in EST sequences compared to other types of mutations, whereas insertions were the least abundant (7/74; Table 3). Changes from A to G and T to A were the most common mutations disrupting repeat continuity (Table 4). Transition/transversions accounted for 74% of the interruptions of perfect repeats, whereas insertion/deletions were observed in only 26% of the interruptions in ESTs (Table 3). In genomic sequences, repeats were interrupted most often by deletions (153/417) and least by insertions (19/417; Table 3). A to G transitions were the most common substitution disrupting the repeat continuity (Table 4). The ratio of transition/transversion to insertion/deletion was found to be ∼3:2 (Table 3). Double mutations, in which two types of mutations coexisted, were observed in 53 repeats (4 in ESTs and 49 in genomic). Among these, double transitions and double transversions occurred together in 10 and 5 repeat motifs, respectively; transitions and transversions were together in 11/53; insertions/deletions (indels) were together in 4/53; and indels and substitutions were associated in 23/53 imperfect repeats. Triple mutations occurred together in 9 repeat motifs while one (AC)n motif showed all four possible mutations.
The allelic distribution of 26 loci was tested in 13 different strains of diapausing and nondiapausing silkworm strains. Polymorphisms were detected for 23 of the 26 loci. Two microsatellite loci (Bmsat012 and Bmsat015), which were present in the coding region, and one locus (Bmsat109) with an unknown location were found to be monomorphic. The number of alleles scored at each locus in the 13 strains varied from as few as 2 to as many as 14 (Table 5). The heterozygosity values varied from 0.20 to 0.87. We did not observe any association of allele number to any particular repeat type or repeat length.
The average heterozygosity value for 23 polymorphic microsatellite loci was 0.54. Bmsat103, an imperfect dinucleotide repeat locus, was found to be the most polymorphic with 14 alleles and a heterozygosity value of 0.87. The lowest heterozygosity value was found for the trinucleotide locus Bmsat018 with two alleles. A few loci, like Bmsat010, Bmsat013, Bmsat016, Bmsat018, and Bmsat063, showed null alleles. Three loci, Bmsat132, Bmsat066, and Bmsat110, revealed alleles specific to diapausing and nondiapausing strains (Table 6). The 134-bp allele in locus Bmsat132 was specific to all diapausing strains, whereas the 146-bp allele was specific to all nondiapausing strains, except for C.nichi, which did not share either of these alleles. The 215-bp allele in locus Bmsat066 was specific only to diapausing strains except for HU204, which did not share this allele. Likewise, the 90-bp allele of Bmsat110 was unique to diapausing strains.
We tested the influence of domestication of diapause and nondiapause silkworm strains on the level of heterozygosity using bottleneck analysis. The observed heterozygosity exceeded the average of the corresponding distribution of heterozygosities expected at equilibrium in both diapausing and nondiapausing strains. Under the infinite-allele model (IAM) 17 of the 18 polymorphic loci (P = 0.000) and in the stepwise mutation model (SMM) 15 of the 18 polymorphic loci (P = 0.01243) showed an excess of heterozygotes in the case of diapausing populations. Similarly, for nondiapausing strains, all 17 polymorphic loci in IAM (P = 0.000) and 16 of 17 polymorphic loci in SMM (P = 0.000) showed a significant heterozygosity excess. Both the bottleneck tests reject the hypothesis of mutation drift equilibrium and show the occurrence of a recent bottleneck in both diapausing and nondiapausing populations, i.e., a recent decline in effective population size (Ne). These results are not surprising since the populations are highly domesticated and are being maintained by inbreeding. Nei's standard genetic distance based on 23 polymorphic loci in 13 silkworm populations yielded population trees that reflected their geographic origin, voltinism (diapausing/nondiapausing), and pedigree (supplementary Figure 1 at http://www.genetics.org/supplemental/).
The detection of polymorphism of the newly identified loci was carried out in two silkmoth strains, Nistari (male individuals) and NB4D2 (female individuals), which have nondiapausing and diapausing characters, respectively. These two strains are quite divergent for various qualitative and quantitative traits and produce highly heterotic hybrids when they are crossed and are being used as parental strains to construct the mapping population. Out of 198 loci tested, which included 34 from the EST database, 19 from genomic library screening, 54 from GenBank, and 91 from genomic sequences, 36 loci showed polymorphism (18.2%) and 10 loci revealed null alleles (5.1%; Table 7). Perfect tetra- (36.7%), imperfect di- (33.3%), and trinucleotide repeats (23.5%) showed higher levels of polymorphism, respectively, whereas perfect tri- (11.4%) and imperfect tetra- (0%) and pentanucleotide (0%) repeats showed the lowest levels of polymorphism (Table 7).
Genetic linkage map:
A total of 46 polymorphic SSR markers were detected in the study. Among them, 10 markers exhibited null alleles in the male parent, Nistari, and segregated as dominant markers. Four markers showed Z-specific inheritance, which was confirmed by homology search in Z chr-BACs. Four markers exhibited monomorphism among BC progeny and were thus excluded. We carried out χ2 analyses to carry forward only those markers fitting the expected 1:1 ratio in the BC-II population. Five of the markers showed segregation distortion at P = 0.05. Finally, 37 markers were employed for the construction of the SSR linkage map. Eight linkage groups including one Z-group were separated (Figure 4). They ranged from 58.7 cM (group 2) to 6.6 cM (group 5). Eight markers remained unlinked.
Conservation in heterologous species:
To test the conservation of B. mori microsatellite loci in heterologous species (Table 8), we used 30 B. mori microsatellite loci. All species except H. armigera were from the Bombycoidea superfamily. Twenty-two (73.3%) B. mori microsatellite loci were found to be conserved in B. mandarina, the ancestral species of B. mori, and 16 (53.3%) were conserved in A. assama, an Indian muga silkmoth, endemic to northeastern India. The species of S. c. ricini, A. mylitta, A. yamamai, and H. armigera shared four (13.3%); A. pernyi shared three (10%); whereas A. roylei and A. proylei shared only two (6.6%) conserved microsatellite loci with B. mori (supplementary Table S2 at http://www.genetics.org/supplemental/).
The alleles of four loci (Bmsat020, Bmsat110, Bmsat103, and Bmsat109) that showed amplification in most of the heterologous species were cloned and sequenced. We identified both repeat variations and mutations in the flanking sequences in most of the alleles. For example, out of two alleles of Bmsat020 we sequenced, one allele of 200 bp harbored one TTA repeat and the other of 171 bp contained both TTA and TA motifs in H. armigera (Figure 5). The locus Bmsat103 harboring an imperfect (GT) motif showed complex mutations characterized by variation in the repeat motif length and point mutations in the repeat motif as well as in the flanking sequences. The point mutations either created or destroyed the repeat perfection, in addition to creating a new class of repeat motif. In three species, A. roylei, A. yamamai, and H. armigera, a new pentanucleotide microsatellite repeat GGTTA was created in one of the alleles due to multiple point mutations (Figure 5). The locus Bmsat109 revealed an allele in A. pernyi, in which there was a duplication of 59 bp containing a GT repeat motif (Figure 5). The locus Bamsat110, which harbored both GATA and TA repeats, did not show much variation in repeat motif except in B. mandarina, which had only two repeat units for each of these motifs.
In this article we report in silico analysis of microsatellites and experimentally validate a few representative loci for allelic distribution, polymorphism, and conservation in heterologous species. Also, we mapped the polymorphic loci onto a backcross mapping population. Earlier reports on the distribution of microsatellites in insect species showed that, unlike higher animals, they are rich in short repeat tracts (Kruglyak et al. 1998; Chambers and Macavoy 2000; Li et al. 2002). In this study we mined microsatellites of ≥15 bases of mononucleotide repeats and ≥5 repeat units of other classes of repeats.
The B. mori genome is AT rich:
With the sequence data we used for analysis, poly(A/T) was far more abundant than poly(C/G) in both genomic and transcribed sequences, similar to earlier reports (Katti et al. 2001; Morgante et al. 2002). Among dinucleotide repeats TA was in excess in genomic sequences, whereas CG was the rarest (Table 1 and Figure 2). The excess of TA repeats found here deviated from the overall distribution found in arthropods, where CA was found to be the most frequent (Toth et al. 2000). Dinucleotide repeats have been found to be in excess, contributing nearly 50% in many species, and mononucleotide repeats are abundant in primates (Li et al. 2002). In B. mori, mono- and dinucleotide repeats accounted for 36 and 37.9% of the total repeats, respectively. Among trinucleotides, TAA was the most common trinucleotide repeat, whereas CGA and GCA were rare in both genomic and transcribed sequences. In Drosophila, where microsatellite distribution has been studied in detail, AGC is the highest, followed by AAC (Katti et al. 2001). The preponderance of AT-rich trinucleotide repeats in silkworm is consistent with findings in the genomes of Arabidopsis and the yeast Sacccharomyces cervisiae (Cardle et al. 2000; Young et al. 2000). On the contrary, the genomes of maize, human, and rice were rich in GC-rich microsatellites (Jurka and Pethiyagoda 1995; Chin et al. 1996; Temnykh et al. 2001). In accordance with the literature, most of the trinucleotide repeats were excess in ESTs compared to genomic sequences whereas other repeat types were fewer in ESTs (Li et al. 2002; Toth et al. 2000; Morgante et al. 2002). This increase was contributed by trinucleotide repeats other than TAA, GTA, and TGA (Table 1). Tetra-, penta-, and hexanucleotide repeats with one G or C in the repeat unit were found to be abundant in addition to 100% AT-rich repeat units. This kind of distribution might indicate that the presence of a single G/C in the repeat is probably stabilizing the repeat tract.
The most abundant tetra-, penta-, and hexanucleotide repeats were TAAA, CAAA, TTAA, CATA, TAAAA, TTAAA, TATAA, CAAAA, TAAAAA, TTAAAA, and CAAAAA. Long tetra- and pentanucleotide repeats were not found; instead, surprisingly, hexanucleotide repeat arrays were found to have longer repeat tracts (Figure 2). The occurrence of very high amounts of AT-rich repeats, irrespective of repeat type, can be explained by (a) the AT-rich genome of B. mori (∼60%); (b) low Tm, thus easy strand separation; and (c) mutations in poly(A/T) tracts at different time points leading to a combination of AT-rich repeats. It remains to be studied whether the AT-rich portion of the silkworm genome has any relationship with high-order chromatin structure similar to that documented for plant species (Pedersen et al. 1996; Schmidt and Heslop-Harrison 1996). In particular, this knowledge will be useful in light of the fact that silkworm chromosomes are holocentric in nature.
We estimated that microsatellites account for 0.31% of the genome of B. mori (Table 1), which is similar to that of yeast, less than that of plants and vertebrates, and greater than that of Caenorhabditis elegans. (Crollius et al. 2000; Toth et al. 2000; Morgante et al. 2002). The shorter repeat tracts were abundant in the Bombyx genome, similar to those in Drosophila (Table 2, Figure 2); repeat tracts >15 repeat units were sparse in contrast to those in mammals (Kruglyak et al. 1998). This shows that probably there is a limiting mechanism for the length of the repeat tract, as observed in most of the organisms studied (Schlötterer 2000). The lack of longer microsatellites with many repeats could be due to their short persistence time due to mutations and probably they are underrepresented (Harr and Schlötterer 2000).
Sex chromosomes have been found to differ in microsatellite density from autosomes in many eukaryotes. Human, rat (Beckman and Weber 1992), and mouse (Jarne et al. 1998) X chromosomes were found to have a lower abundance of microsatellites compared to autosomes, whereas the reverse is the case for dinucleotide repeats in the Drosophila X chromosome (Bachtrog et al. 1999). The B. mori Z chromosome, equivalent to the X chromosome of mammals and Drosophila, had a higher density of ATT repeats compared to both the autosomes and transcribed sequences (Table 1). These repeats accounted for 59.7% of the total trinucleotides in the Z chromosome. The relative overall trinucleotide microsatellite density was higher in the Z chromosome than in autosomes and ESTs. The Z chromosome sequences contained very few tetra- and hexanucleotide repeats and were devoid of pentanucleotides (Figure 2).
Flanking sequences and microsatellite origin:
Zhu et al. (2000) gave experimental evidence of microsatellite genesis from random sequences. It was shown that the birth of proto-microsatellites happens by short duplications of neighboring sequences without disturbing the stability of the surrounding region. The short repeat then expands by a replication slippage mechanism leading to variability. The generation and perpetuation of proto-microsatellites may be influenced by the nature and composition of the flanking sequences. Our results showed a positive correlation of the AT/GC distribution with flanking sequence, where the GC content of flanking sequence increased from a minimum of 30% to a maximum of 60% as the repeat composition increased from 0% GC to 100% GC (Figure 3). In a sequence that is AT rich, there is a higher possibility of an AT-rich repeat evolving rather than a GC-rich repeat or vice versa. This observation of positive correlation was similar irrespective of the length and type of the repeat motif analyzed. Although the origin and evolution of microsatellites is influenced by the flanking sequences, we did not observe any variation in the flanking sequence content with varying length of repeat unit. A similar observation of positive correlation was made by Bachtrog et al. (1999) in D. melanogaster with respect to TA repeat density, where length was independent of flanking AT content. Considering these results together, we propose that base composition of the flanking sequence may influence the origin and evolution of microsatellites but is independent of repeat unit length (Bachtrog et al. 1999).
Mutations in the repeat motifs:
Microsatellites grow until they reach a threshold length, after which they are abrogated by point mutations or indels disrupting the perfect repeats and thus preventing their infinite growth (Kruglyak et al. 1998; Ellegren 2000). The study of imperfections might indicate previous presence of perfect repeats, which can be reverted back to perfection by reverse mutations (Harr et al. 2000) or evolve into a different repeat motif by further mutations. Mutations in the center of the repeat motif accounted for 60.7% of the imperfect repeats as compared to 30.9% toward the 3′ end of the repeat and 8.4% toward the 5′ end (occurring rarely). This result is in sharp contrast with previous findings (Brohede and Ellegren 1999), which inferred from substitution rate in (CA)n microsatellites that the rate of nucleotide substitution is higher in the ends of the repeat regions than in the middle of the repeat regions. The discrepancy between our observations and that inferred by Brohede and Ellegren (1999) may be that their observations were solely based on 22 (CA)n repeats derived from noncoding regions whereas our observations are based on a large number of di- (138), tri- (120), tetra- (148), and pentanucleotide (85) repeats from both coding and noncoding regions (supplementary Table S1). Imperfect repeats <15 bases were not found (supplementary Table S1). Transitions and transversions constituted 74% of repeat disruptions in silkworm ESTs (Table 3) because they did not disturb the reading frame, which is not the case with indels. By contrast, in genomic sequences transitions/transversions and indels shared equal numbers of repeat disruptions without any bias toward any particular repeat type (Table 3).
In this study, we went beyond in silico discovery to test the loci experimentally for allelic distribution and polymorphism. Eighty-nine alleles for 23 loci were identified in 13 diverse silkworm varieties (Table 5). Most of the loci showed 2–4 alleles. An imperfect dinucleotide repeat locus revealed as many 14 alleles, followed by a tetranucleotide locus with 10 alleles. Imperfect dinucleotide repeat motifs also showed higher polymorphism between Nistari and NB4D2 strains (Table 7). It has been observed that dinucleotide repeats have more slippage events per unit length of DNA than do other repeats (Kruglyak et al. 1998). Generally it is observed that, the longer the repeat stretch is, the higher the mutation rate and allele number are (Brinkmann et al. 1998), but in our results we did not observe any association of the repeat type/length with allele numbers. Our results may be explained by the small number of loci chosen for study combined with locus-specific mutation rates (Schlötterer et al. 1998).
The heterozygosity values ranged from 0.20 to 0.87 (Table 5), showing good variation in the 13 strains for the loci chosen. The three alleles, 215, 134, and 90 bp of Bmsat066, Bmsat132, and Bmsat110 loci, respectively, were present only in diapausing strains, and one allele of 146 bp of Bmsat132 was specific to nondiapausing strains (Table 6). However, further study of these loci may be useful to ascertain their demographic association. Bmsat012 and Bmsat015 loci derived from coding regions showed no polymorphism or allelic variation in the 13 silkmoth populations. This reinforced the observation that coding regions have lower variance than noncoding regions.
About 5% of 198 microsatellite loci revealed null alleles in the parental strains, Nistari and NB4D2, used for assaying for polymorphism. Most of these alleles inherited and segregated in Mendelian fashion (data not shown). Thus, the presence of segregating null alleles will not corrupt the linkage data. If undetected, a null allele will merely result in that individual being scored as a homozygote, resulting in loss of informativeness. However, in population studies, such null alleles, at an appreciable frequency, may result in depression of observed heterozygosity, compared with that expected on the basis of Hardy-Weinberg equilibrium. In almost all the articles reporting microsatellite markers in Lepidoptera, a significant departure from Hardy-Weinberg equilibrium was observed at many of the loci due to deficit of heterozygotes, suggesting the presence of null alleles (Bogdanovicz et al. 1997; Meglecz and Solignac 1998; Harper et al. 2000; Caldas et al. 2002; Cassel 2002; Flanagan et al. 2002; Keyghobadi et al. 2002; Amsellem et al. 2003).
Microsatellite mapping initiative:
Lepidoptera has received little attention in terms of genetics and genetic maps despite immense consequences in terms of economy and ecology. In B. mori, all the linkage maps except the RFLP map are based on dominant markers (Promboon et al. 1995; Shi et al. 1995; Yasukochi 1998; Tan et al. 2001; Kadono-Okuda et al. 2002). Here, we have presented a preliminary linkage map based on 37 microsatellite markers. This effort is a part of our ongoing program to develop a dense genetic linkage map of silkworm using fluorescent inter-simple sequence repeat (FISSR) markers (Nagaraju et al. 2002) and microsatellite markers. These mapped codominant markers will be useful in anchoring a genetic map on a physical map.
Cross-species amplification of microsatellite loci:
The cross-species amplification of silkworm microsatellite loci shows that mutation patterns of microsatellites are often complex: some loci tend to gain repeats (Weber and Wong 1993; Eichler et al. 1994), while others tend to lose repeats (Zhang et al. 1994), and size variations at microsatellite loci are caused by indels in the flanking sequences as well (Angers and Bernatchez 1997; Grimaldi and Crouau-Roy 1997; Colson and Goldstein 1999). In Drosophila, ∼60% of examined microsatellite loci divergence is in the lengths of flanking regions between species (Hutter et al. 1998; Colson and Goldstein 1999). In this study, 71 alleles sequenced from four loci revealed that complexity in interallelic variation is affected by both contraction and expansion of repeat motifs and indels in the flanking regions. At least 20 out of 71 alleles (28.2%) are affected by indels. Many alleles that are identical in length revealed sequence differences (identical by state) as against identical-sized alleles that have similar sequences (identical by descent). Such studies in lepidopteran microsatellites can be particularly useful for population studies and for increasing understanding of the origin, mutational processes, and structure of microsatellites, in addition to allowing more informed interpretation of genotyping data.
Of the 30 loci, 26 showed amplification in at least one of the heterologous species. Twenty-two loci showed amplification in B. mandarina and 15 showed amplification in A. assama. The number of alleles per locus observed was more in B. mori, the focal species, than in B. mandarina at 11 loci and the remaining 11 loci showed no difference in the number of alleles. Similarly 7 loci showed a greater number of alleles in B. mori than in A. assama while the remaining 8 loci showed no difference. The allelic length variation did not show any particular trend in the focal and nonfocal species. These preliminary results could be attributed to ascertainment bias (Ellegren et al. 1995; Forbes et al. 1995; Amos et al. 2003) rather than to directional evolution (Rubinsztein et al. 1995; Amos et al. 1996, 2003). Further studies on microsatellite variation by performing a reciprocal study surveying microsatellite allelic variation in number and length using microsatellite loci characterized from each of these species should shed additional light on directional evolution and ascertainment bias hypothesis.
The successful amplification of silkworm SSR markers across related wild species particularly in B. mandarina and A. assama provides “connectivity” for future consolidation of genetic and physical maps and provides the foundation for association of these maps with particular traits of interest. Also, it provides an opportunity to use SSR markers for investigating the wide range of genetic diversity that exists in wild species outside the gene pool of domesticated silkworm, B. mori.
It is observed that isolation of microsatellites from lepidopterans is very difficult (Meglecz et al. 2004; Zhang 2004) and thus we find as few as 20 reports describing microsatellites from lepidopterans in the literature in contrast with high numbers found in the hymenoptera. Hence, the present report on abundance of microsatellite motifs along with identification and validation of 198 microsatellite loci from B. mori and their utility in population studies, and conservation in heterologous moth species, will be of great use for researchers working on microsatellites in general and lepidopteran species in particular. In addition, the availability of the full sequence of the silkworm genome in the near future will greatly expand the repertoire of SSRs; the more polymorphic ones can be effectively exploited for mapping, annotation, and comparative genomics leading to a better utilization of silkworm, both as a model lepidopteran insect and as an economically important insect on which millions of rural people in the developing countries like China, India, and Thailand depend for their livelihood.
We particularly thank M. R. Goldsmith for useful suggestions on the manuscript. We thank K. P. Arun Kumar for assistance in in silico analysis of the sequence data and B. Srilatha and P. Chandrashekar for technical assistance. This work was supported by the Department of Biotechnology, Government of India by way of grants to J. Nagaraju.
Communicating editor: M. J. Simmons
- Received May 7, 2004.
- Accepted October 1, 2004.
- Genetics Society of America