Transposons and transposon-like repetitive elements collectively occupy 44% of the human genome sequence. In an effort to measure the levels of genetic variation that are caused by human transposons, we have developed a new method to broadly detect transposon insertion polymorphisms of all kinds in humans. We began by identifying 606,093 insertion and deletion (indel) polymorphisms in the genomes of diverse humans. We then screened these polymorphisms to detect indels that were caused by de novo transposon insertions. Our method was highly efficient and led to the identification of 605 nonredundant transposon insertion polymorphisms in 36 diverse humans. We estimate that this represents 25–35% of ∼2075 common transposon polymorphisms in human populations. Because we identified all transposon insertion polymorphisms with a single method, we could evaluate the relative levels of variation that were caused by each transposon class. The average human in our study was estimated to harbor 1283 Alu insertion polymorphisms, 180 L1 polymorphisms, 56 SVA polymorphisms, and 17 polymorphisms related to other forms of mobilized DNA. Overall, our study provides significant steps toward (i) measuring the genetic variation that is caused by transposon insertions in humans and (ii) identifying the transposon copies that produce this variation.
TRANSPOSONS and transposon-like repetitive elements collectively occupy an impressive 44% of the human genome sequence (Lander et al. 2001). Alu and LINE (L1) elements alone account for ∼30% of the genome sequence and are the most abundant transposable elements in humans (Lander et al. 2001). Both Alu and L1 also are actively mobile in the genome today and serve as ongoing sources of human genetic variation (Moran et al. 1996; Ostertag and Kazazian 2001; Batzer and Deininger 2002; Brouha et al. 2003; Dewannieux et al. 2003). The remaining transposon-like elements in the genome have some or all of the hallmark features of transposons, such as target site duplications (TSDs), terminal repeats, and/or poly(A) tails, but are not known to remain functional (Smit and Riggs 1996; Smit 1999; Lander et al. 2001).
Alu elements have been actively mobile in primate genomes during the past 65 million years and consequently have expanded to >1 million copies in the human genome today (Batzer and Deininger 2002 and references therein). The earliest Alu elements appear to have been monomeric derivatives of 7SL RNA, and these monomers later gave rise to dimeric Alu elements (Ullu and Tschudi 1984; Slagel et al. 1987; Britten et al. 1988; Jurka and Zuckerkandl 1991). Alu J elements are the oldest dimeric elements in the human genome (Jurka and Smith 1988; Batzer and Deininger 2002). Although these elements were highly active ∼55–65 million years ago, they are thought to have lost the ability to transpose long ago (Jurka and Smith 1988; Batzer and Deininger 2002). Likewise, Alu S elements, which are intermediate in age, are thought to have become inactive at least 35 million years ago (Jurka and Smith 1988; Batzer and Deininger 2002; Johanning et al. 2003). Alu Y elements, in contrast, are the youngest Alu elements in the genome and these elements remain actively mobile today (Batzer and Deininger 2002; Dewannieux et al. 2003). The Alu J, S, and Y families (and their subfamilies) contain a series of hierarchical DNA sequence changes that arose during Alu evolution (Slagel et al. 1987; Jurka and Smith 1988; Batzer and Deininger 2002; Jurka et al. 2002). Each Alu family contains a unique set of diagnostic base changes that can be used to identify copies belonging to that family.
The second most abundant class of transposons in humans, the LINE (L1) elements, are autonomous poly(A) retrotransposons (Ostertag and Kazazian 2001 and references therein). These elements also have reached high copy numbers in the human genome (∼500,000) and collectively occupy ∼17% of the genome sequence (Lander et al. 2001). Like Alu, L1 elements have been actively mobile over a long period of time and have been classified according to their respective ages using specific base changes (Boissinot et al. 2000; Ovchinnikov et al. 2002; Brouha et al. 2003). The oldest L1 elements in the genome have accumulated deleterious mutations that render them inactive. However, younger L1 elements have been identified that remain actively mobile today (Moran et al. 1996; Brouha et al. 2003). These active copies contain two intact open reading frames, ORF1 and ORF2, which encode proteins that are necessary for L1 retrotransposition (Feng et al. 1996; Moran et al. 1996). ORF1 encodes a 40-kD protein with RNA-binding activity (Hohjoh and Singer 1996, 1997a,b; Kolosha and Martin 1997; Martin et al. 2000, 2003; Martin and Bushman 2001), whereas ORF2 encodes a protein with both endonuclease (EN) and reverse transcriptase (RT) activities (Mathias et al. 1991; Feng et al. 1996; Moran et al. 1996; Cost et al. 2002). EN and RT work together in a process known as target-primed reverse transcription (TPRT; Luan et al. 1993) that integrates a newly synthesized L1 cDNA into a DNA target site (Cost et al. 2002). Alu RNA (and other cellular RNAs) can compete for the L1 machinery during the TPRT process, leading to the retrotransposition of these alternative RNAs instead of the normal L1 mRNA (Esnault et al. 2000; Wei et al. 2001; Dewannieux et al. 2003). This “trans” replication mechanism is thought to account for the massive expansion of Alu elements in the human genome and for the existence of processed pseudogenes.
Because Alu and L1 remain actively mobile in the human genome today, they serve as ongoing sources of genetic variation by generating new transposon insertions (reviewed in Ostertag and Kazazian 2001 and Batzer and Deininger 2002). For example, estimates suggest that a new Alu insertion occurs approximately once every 200 live births (Deininger and Batzer 1999). As a consequence, a large number of polymorphic Alu and L1 insertions have accumulated in human populations. Many of these insertions are expected to be genetically neutral and, therefore, would have little or no impact on human phenotypes. However, other insertions (primarily those within genes) have been found to cause altered human phenotypes, including diseases. For example, disease-causing Alu insertions have been observed in the BRCA2 gene (Miki et al. 1996), the glycerol kinase gene (Zhang et al. 2000), and others (Deininger and Batzer 1999). Disease-causing L1 insertions likewise have been observed in at least 14 different genes, causing cancers (Morse et al. 1988; Miki et al. 1992; Liu et al. 1997), hemophilia (Kazazian et al. 1988), muscular dystrophy (Narita et al. 1993), and other diseases. It is likely that additional transposon insertions will be found to affect human phenotypes as well.
As an initial step toward studying the potential phenotypic variation that is caused by Alu and L1 elements, it is necessary to identify all of the polymorphic insertions that exist in human populations. Only a fraction of such insertions have been identified to date, largely because the methods for detecting transposon insertion polymorphisms are labor intensive. Most of the known Alu and L1 insertion polymorphisms have been identified by systematically screening individual element copies in human populations using PCR assays (Carroll et al. 2001; Roy-Engel et al. 2001; Myers et al. 2002; Abdel-Halim et al. 2003; reviewed in Ostertag and Kazazian 2001 and Batzer and Deininger 2002). Transposon display assays also have been used to identify transposon insertion polymorphisms (Sheen et al. 2000; Badge et al. 2003). Although these methods have been useful for identifying polymorphisms, they are not likely to be sufficient on a genome-wide scale to identify all of the polymorphic Alu and L1 copies that exist in human populations. Thus, new and more efficient methods are necessary to identify transposon insertion polymorphisms.
In addition to Alu and L1 elements, some of the remaining transposons and transposon-like elements in the genome also might be polymorphic and, therefore, would contribute to human genetic diversity. Despite the fact that there are many families of such elements in humans (Smit and Riggs 1996; Smit 1999; Lander et al. 2001), no comprehensive studies have been conducted to examine whether these elements are polymorphic or remain actively mobile. As is the case for Alu and L1, such elements would be of interest because they represent sources of human genetic variation and might also cause mutations that lead to human diseases.
In an effort to measure the levels of genetic variation that are caused by human transposons, we have developed an efficient method to broadly detect transposon insertion polymorphisms of all kinds in humans. The method exploits DNA sequencing traces that originally were generated from diverse humans for single-nucleotide polymorphism (SNP) discovery projects (Sachidanandam et al. 2001; International HapMap Consortium 2003). We have developed a computational pipeline that now analyzes these traces to identify transposon insertion polymorphisms. Our study provides significant steps toward (i) measuring the genetic variation that is caused by transposon insertions in humans and (ii) identifying the transposon copies that produce this variation.
MATERIALS AND METHODS
Identifying insertion and deletion candidates using DNA sequencing traces from diverse humans:
DNA sequencing traces and accompanying quality files were obtained from Cold Spring Harbor Laboratory [traces generated by the SNP Consortium (TSC)] or from the Trace DB archive at the National Center for Biotechnology Information (NCBI). Insertion and deletion (indel) and transposon insertion polymorphisms were identified from these traces using a sequential series of computer programs and databases as outlined in Figure 1. Many of these programs were obtained from NCBI or from other sources as indicated below. Other custom Perl programs were developed for indel and transposon polymorphism discovery as necessary and are available upon request. Most of these programs and databases were installed locally on Dell workstations running Microsoft 2000, XP, or Red Hat Linux operating systems. A 12-CPU Linux cluster also was constructed and utilized for the RepeatMasker and MegaBLAST steps of the pipeline (Figure 1).
A total of 16.4 million DNA sequencing traces were processed using the pipeline depicted in Figure 1. The traces first were screened for vector contamination using the VecScreen system developed by NCBI and were trimmed as necessary. Low-quality regions of the traces then were identified and trimmed with a custom Perl program that uses the Phred quality scores in the accompanying quality files to identify such regions (Ewing and Green 1998; Ewing et al. 1998). Our method identified the longest high-quality region of each trace and then trimmed the flanking data upon encountering 5 bases in a row with Phred scores <25. The longest high-quality interval from each trace was chosen for further analysis and the remaining data were set aside. Trimmed traces also were required to have average Phred scores of at least 25 and minimum lengths of 100 bases.
After trimming, each trace then was mapped to a unique location in the human genome sequence (build 33 for the TSC traces and build 34 for the remaining traces). Builds 33 and 34 of the human genome sequence database were obtained from the University of California (Santa Cruz) and installed locally to perform this step (Kent et al. 2002). All known repeats (including all transposons and transposon-like repetitive elements defined in Repbase Volume 7, Issue7; Jurka 2000) first were temporarily masked in the traces using RepeatMasker (version 2001/07/07; A. Smit, unpublished data) and MaskerAid (Bedell et al. 2000). The single longest unmasked “anchor sequence” of the trace then was used to assign each trace to a unique genomic location using MegaBLAST (NCBI). The anchor sequence was required to have a minimum of a 50-base match at 100% identity for a trace to be mapped successfully. Traces with anchor sequences that matched to more than one genomic location with 100% identity, or that did not have a minimum of a 50-base match at 100% identity, were set aside to avoid traces that mapped to duplicated regions of the genome (Bailey et al. 2002). After the traces were successfully mapped to unique genomic locations, they were unmasked and aligned to their assigned genomic locations using the Bl2Seq program (NCBI). The Bl2Seq program allowed for as much as a 16-base gap in the alignments and led to identification of indels as large as 16 bases in length.
A new algorithm also was developed to identify indels that were >16 bp in length. Our strategy was designed to split trace data into two blocks upon encountering a region in the pairwise alignment that no longer matched the query. The first block of sequence that matched was maintained in the correct position, and the nonmatching sequence was moved over as a block, 1 base at a time, until a match was obtained. The Perl program that was developed to accomplish this task moved the nonmatching block until it detected either a perfect alignment or a distance of 10,000 bases (the maximum distance allowed by the program). The 5 bases on each side of an indel candidate were required to have Phred scores of ≥20 to ensure that high-quality bases were being used to locate the indel junctions. Indel candidates were deposited into dbSNP under accession nos. ss8029278–ss8176133, ss8475737–ss8484870, ss14926095–ss15354938, and ss15357378–ss15378640.
Identifying transposon insertion polymorphisms by screening human indels:
Transposon insertion polymorphisms were identified among indels using a custom computer algorithm. First, indels were identified for which at least 80% of the indel sequence was occupied by a known transposon as defined by the definitions of all human transposons and repeats in Repbase (Vol. 7, Issue 7; Jurka 2000). This step was accomplished by querying an Oracle database that stored RepeatMasker output data (and other information) for each indel. Next, selected candidates were examined with a custom Perl program to determine whether potential TSDs were present. Such duplications generally flank transposon insertions and are hallmarks of most transposons (Berg and Howe 1989). Therefore, if an indel was caused by a transposon insertion, it generally would be expected to be flanked by a TSD (one copy of the duplicated sequence actually is contained within the indel itself, since the duplication is created during the insertion of the transposon). Candidate transposon insertions also were screened with a custom Perl program to identify potential poly(A) tails, which are associated with certain retrotransposons. Finally, the genomic contexts of all transposon indel candidates were examined to identify true de novo insertions vs. indels that were caused by deletions or duplications within existing transposon copies. All indels that met at least the first test were inspected and curated manually (see supplemental Table 1 at http://www.genetics.org/supplemental/ for the final curated set). Six hundred and five nonredundant polymorphisms were identified that were caused by de novo transposon insertions (these are listed in the “Alu,” “L1,” “SVA,” and “Other” sections of supplemental Table 1). Another 50 nonredundant polymorphisms were caused by deletions or duplications within existing transposons (these are listed in the “Deletions and duplications” section of supplemental Table 1).
Analysis of transposon subfamilies:
Alu transposon insertions were classified initially using RepeatMasker (A. Smit, unpublished data) and Repbase (Vol. 7, Issue 7; Jurka 2000). Each polymorphic copy also was compared independently to the consensus sequences of all known Alu subfamilies (Repbase Vol. 7, Issue 7; Jurka 2000). To accomplish this goal, all Alu insertions identified were coaligned with the consensus sequences of all Alu subfamilies using the Clustal W program. Key diagnostic bases then were analyzed to further assist with the assignments of these elements to specific subfamilies. Each copy then was compared to the assigned subfamily consensus using Bl2seq (NCBI). In some cases, element copies also were compared to the consensus sequences of several neighboring families. A final assignment was made on the basis of the best match obtained. L1-Hs and L1-P elements were classified initially using RepeatMasker (A. Smit, unpublished data). The L1-Hs elements then were assigned to a given subfamily using the classification system described by Brouha et al. (2003). All other transposons were classified using the RepeatMasker system (A. Smit, unpublished data) and Repbase (Vol. 7, Issue 7; Jurka 2000).
Validation of the computational pipeline by PCR:
Sixty-one transposon insertions were chosen arbitrarily from the TSC data set and examined by PCR to evaluate the accuracy of our computational predictions (Table 6). PCR assays were designed for each of the 61 polymorphic transposon copies using primers that either flanked (A and D primers) or were located within (B and C primers) a given transposon as depicted in Figure 3. All primers used in these studies are listed in supplemental Table 2 at http://www.genetics.org/supplemental/. A total of 68 PCR assays were designed initially. Seven (10%) of these assays failed due to technical reasons and these assays were abandoned. The remaining 61 assays (90%) yielded band(s) of the expected size(s) and were used to assay 12–24 DNA samples from the Coriell diversity panel (Figure 3 and Table 6). The Coriell diversity panel of 24 DNA samples was obtained from the Coriell Repository, Camden, New Jersey (Collins et al. 1999). Lymphocyte cultures of this panel also were obtained from Coriell and, in some cases, DNA was prepared from these cells. PCR reactions were carried out in 50-μl volumes as described previously (Kimmel et al. 1997). PCR products were run on 1.5% agarose gels and sized using a 1-kb ladder marker (Invitrogen, San Diego).
Analysis of additional genomic SVA elements:
In addition to the SVA copies identified in the trace experiments, 28 other genomic SVA copies were selected from the human genome sequence using SVA element query sequences and the BLAT program (Kent 2002). These SVA copies were examined by PCR to assess whether they were polymorphic in at least one individual of the Coriell panel. PCR primers were developed to examine the status of each SVA copy as described in Figure 3 and supplemental Table 2. PCR reactions were carried out as outlined above and in Figure 3. An SVA copy was considered to be polymorphic if both alleles (one with and one without the transposon insertion) could be identified at least once. Fifty-nine additional SVA element copies were identified by manual inspection of the first 50 Mb of human chromosome 1 using the University of California, Santa Cruz genome browser (Kent et al. 2002). The genomic regions surrounding all of these SVA copies were compared to the equivalent chimp genomic sequences to determine whether the chimp contained an SVA element at the equivalent position (supplemental Table 1).
A strategy for detecting genetic variation caused by transposon insertions in humans:
Our strategy for detecting transposon insertion polymorphisms in humans involved identifying a large number of indel polymorphisms in human populations and then screening these polymorphisms to identify de novo transposon insertions. We reasoned that this strategy should be successful since transposon insertion polymorphisms are equivalent to insertions and deletions in genomes. Relatively few indels had been identified in human populations prior to our study, despite the fact that indels are abundant in the genomes of model organisms such as Drosophila melanogaster (Berger et al. 2001) and Caenorhabditis elegans (Wicks et al. 2001) and were likely to be abundant in humans as well. Therefore, we began our study by developing new computational methods to discover indel polymorphisms in the genomes of diverse humans (materials and methods).
Our strategy involved mining indels from DNA sequencing traces that previously had been generated for SNP discovery projects. All of the traces used in our study originally were generated at genome centers by resequencing pools of genomic DNA from diverse humans. For example, a set of 7.1 million traces, which originally had been generated by shotgun sequencing the DNA of 24 diverse humans (Sachidanandam et al. 2001), was obtained from TSC. A second set of 8.2 million whole-genome shotgun (WGS) traces, which originally had been generated by shotgun sequencing the DNA of eight unrelated African-American adults (four males and four females from the Baylor Polymorphism Resource; International HapMap Consortium 2003), was obtained from the Baylor and Whitehead Genome Centers. Finally, a much smaller set of 0.9 million whole-chromosome shotgun (WCS) traces, which had been generated by shotgun sequencing chromosome 20-specific libraries from four diverse humans (International HapMap Consortium 2003), was obtained from the Sanger Center. Because these DNA sequencing traces were derived from diverse humans, we expected them to harbor various forms of genetic variation, including indels. We developed a computational pipeline to identify indels within these traces by comparing them to the human genome reference sequence (builds 33 and 34; Figure 1).
A total of 606,093 indel candidates were identified by analyzing 16.4 million traces with our computational pipeline (Figure 1 and materials and methods). The majority of these indels (428,838 or 70.8%) were identified from the WGS traces. An additional 155,992 indels (25.7%) were identified from the TSC traces, and 21,263 indels (3.5%) were identified from the WCS traces. Overall, these indel candidates were distributed throughout the human genome and were found on all 24 chromosomes (data not shown). They ranged in size from 1 to 9969 bp in length and contained a wide array of different DNA sequences. All indel candidates were deposited into dbSNP under the “Devine_lab” handle (http://www.nih.nlm.gov/SNP).
We next developed a computer algorithm to identify indels that were caused by de novo transposon insertions. The method was designed to identify indels for which a single transposon copy and its associated sequences (e.g., its target site duplication) accounted for the indel (see materials and methods). Eight hundred and two transposon insertion polymorphisms were detected with these methods in the three populations examined (Table 1). Four major classes of transposon insertions were identified in these experiments: (i) Alu insertions, (ii) L1 insertions, (iii) SVA insertions, and (iv) insertions of “other” elements.
Alu insertion polymorphisms were by far the most abundant polymorphisms identified in the three experiments (Table 1). A total of 173 of 207 (83.6%) of the polymorphisms in the TSC data set were Alu insertions. Likewise, 487 of 583 (83.5%) of the polymorphisms in the WGS set were Alu insertions, and 10 of 12 (83.3%) of the polymorphisms in the WCS set were Alu insertions. L1 insertions were the next most abundant polymorphisms identified, representing 12.6 and 11.0% of the TSC and WGS data sets, respectively (Table 1). Although the L1-Ta class was the most abundant subfamily of L1, other non-Ta L1 elements were identified as well (see below). SVA element insertions were the third most abundant class of transposon polymorphisms identified, representing 2.9 and 4.3% of the TSC and WGS data sets, respectively. Finally, the remaining transposon insertion polymorphisms were caused by a miscellaneous collection of low-frequency insertions. These elements were pooled into a single group of other polymorphisms (Table 1).
It is important to note that our measurements were remarkably consistent between the data sets. This was particularly true for the TSC and WGS data sets, which were significantly larger than the remaining WCS set. For example, as noted above, Alu polymorphisms represented ∼83% of the transposon insertions in all three of the populations examined. The percentages of L1 insertions likewise were very similar in these experiments (Table 1). Overall, the TSC and WGS experiments were remarkably similar given the differences in the populations that were used to generate these trace sets (Table 1). Nevertheless, the results were not completely identical between the populations. For example, Alu Ya5 polymorphisms represented 32.4% of the insertions in the TSC population and 25.4% of the insertions in the WGS population (Table 1). Therefore, at least some of these element families might have amplified at slightly different rates in the populations examined.
We next inspected all of the polymorphic transposon insertions from the three populations to determine whether any of the copies were redundant in the three data sets. In fact, 149 polymorphisms were identified in which the same transposon allele was detected more than once in our trace experiments. In most of these cases (115 of 149 or 77.2%), the alleles were detected independently twice (supplemental Table 1). Another 25 of 149 alleles (16.8%) were detected three times and the remaining 9 of 149 alleles (6%) were detected four to six times (supplemental Table 1). These results provide confidence in our method and suggest that at least some of our transposon insertion polymorphisms are present at high frequencies in human populations. To perform additional analyses of these transposons, we developed a nonredundant data set of 605 transposon insertions (supplemental Table 1).
Alu Y insertion polymorphisms:
A total of 505 nonredundant Alu insertion polymorphisms were identified in the three populations of our study, including both full-length and partial Alu insertions (supplemental Table 1 and Table 2). These elements were compared to all known Alu families and were classified to determine which Alu elements were detected in our experiments (materials and methods). The vast majority of our Alu insertions were Alu Y elements, with 500 of 505 (99%) of the insertions falling in this category (supplemental Table 1 and Table 2). Alu Ya5 elements were the most abundant subfamily in our study, representing 33.7% of the insertions (supplemental Table 1 and Table 2). Alu Yb8 polymorphisms also were abundant, representing 25.5% of the insertions (supplemental Table 1 and Table 2). Alu Y, Alu Yc1, and Alu Ya4 elements were present at intermediate levels (between 5.9 and 7.5%), and these three families together represented 19.7% of all nonredundant Alu insertions in our study. Most of the remaining Alu Y-related insertions were present at relatively low levels and were distributed among 15 different Alu Y subfamilies (supplemental Table 1 and Table 2). Notably, Alu polymorphisms were detected from most of the known Alu Y subfamilies, including Alu Ya, Yb, Yc, Yd, Ye, Yf, Yg, and Yi (supplemental Table 1 and Table 2). Moreover, although we did detect several new small groups of Alu Y insertions that might be considered novel subfamilies (see below and Figure 2), no new extended Alu Y families of significant size were detected in our study.
As outlined above, Alu Ya5 and Alu Yb8 insertions were the most abundant Alu elements in our data sets. Carroll et al. (2001) previously demonstrated that these two Alu subfamilies were highly polymorphic in human populations. In fact, they estimated that 25% of Alu Ya5 elements and 20% of Alu Yb8 elements were polymorphic in at least one individual of a panel of 80 diverse humans (Carroll et al. 2001). On the basis of their copy number estimates for these two elements, we can predict that at least 660 Alu Ya5 insertion polymorphisms and 370 Alu Yb8 insertion polymorphisms should exist in human populations. We found a total of 170 nonredundant Alu Ya5 insertions and 129 nonredundant Alu Yb8 insertions in our study (supplemental Table 1 and Table 2). Only 8 of these polymorphic insertions (4 Alu Ya5 and 4 Alu Yb8) were identified by Carroll et al. (2001). Therefore, 291 of 299 (98.6%) of our Alu Ya5 and Alu Yb8 polymorphisms had not been detected previously. Similar results were obtained with the remaining Alu classes, indicating that our method efficiently identified a large number of novel Alu insertion polymorphisms in human populations.
Polymorphic ancient Alu elements:
In addition to Alu Y elements, we also identified four polymorphic copies of older Alu S elements (Table 2). Two of these examples (Alu ss14941867 and Alu ss8480425) were intact, full-length Alu S insertions with all of the expected features of Alu retrotransposition events, including poly(A) tails and target site duplications (supplemental Table 1). Two additional examples of 5′-truncated or otherwise fragmented copies of Alu S also were identified (supplemental Table 1 and Table 2). One of these insertions (ss14931773) was a 5′-truncated Alu Sc element with a perfect target site duplication. The second insertion (ss15143442) was an Alu Sq element that was truncated at both the 5′ and the 3′ ends and lacked a target site duplication altogether (supplemental Table 1). It is not clear how this second Alu polymorphism was formed. One possibility is that it was caused by an endonuclease-independent mechanism of retrotransposition involving partial Alu RNA templates. Both Alu and L1 elements are known to use an endonuclease-independent mechanism that does not generate target site duplications surrounding the newly transposed copy (Morrish et al. 2002; Abdel-Halim et al. 2003). Perhaps this older Alu Sq element was mobilized by the L1 machinery using this alternative mechanism.
The fact that we identified four ancient Alu S insertion polymorphisms indicates that at least some of the Alu S copies are likely to have retained the ability to transpose long after the majority of Alu S elements became transpositionally inactive. This is most probable for the intact Alu copies discussed above (ss14941867 and Alu ss8480425). These copies do not appear to have been caused by gene conversion events and have estimated ages of 7–23 million years, suggesting that they are younger than most of the Alu S elements (supplemental Table 1). Prior to our study, only the Alu Y elements were thought to be polymorphic in humans, whereas the older Alu S, Alu J, and Alu monomers were thought to have only fixed alleles in human populations. Recent evidence from Johanning et al. (2003) showed that at least some Alu Sx elements appear to have transposed later than previously estimated (∼35 million years ago); however, Alu S insertion polymorphisms were not detected in humans prior to our study.
Sequence variation within polymorphic copies of Alu indicates patterns of Alu evolution:
Significant DNA sequence variation was noted within the polymorphic Alu insertions identified in our study. In most cases, a given element copy could be placed unambiguously within a known Alu subfamily using key diagnostic base changes (materials and methods). Nevertheless, a large number of additional single- and multiple-base changes were noted in these elements relative to their respective consensus sequences (supplemental Table 1). Of particular interest were small groups of Alu elements that clearly belonged to a given element family, but differed from the consensus by one or more shared base changes. Since CpG changes occur independently at a high frequency, it was possible that some of these groups were caused by independent changes at CpG hotspots. However, at least 10 of these groups possessed shared base changes at non-CpG sites (or had unusually high frequencies of a given CpG change along with additional shared changes). Most of these groups also showed evidence for the progressive accumulation of shared mutations. In these cases, a single base change was shared by an initial subset of the elements and additional shared changes appear to have been acquired later. We propose that these groups represent novel evolutionary lineages of Alu elements that are defined by these new base changes (Figure 2). These data suggest that a significant number of Alu insertions go on to serve as new source genes for small numbers of additional retrotransposition events (Deininger et al. 1992).
L1 insertion polymorphisms:
Although most of the ∼500,000 L1 copies in the haploid human genome have accumulated deleterious mutations that render them inactive, some of the younger L1 copies remain actively mobile today (Moran et al. 1996; Brouha et al. 2003). These younger copies belong to the L1-Hs (Human-specific) family of elements (Brouha et al. 2003). The Hs family has been subdivided further into the Ta-0, Ta-1, Ta-nd, and Ta-d subfamilies on the basis of the presence or absence of specific nucleotide changes within the L1 sequence (Boissinot et al. 2000; Ovchinnikov et al. 2002; Brouha et al. 2003). Reflective of their younger ages, L1-Hs elements are highly polymorphic in human populations (Sheen et al. 2000; Ovchinnikov et al. 2001; Myers et al. 2002; Badge et al. 2003; Brouha et al. 2003).
We identified 65 nonredundant L1 insertion polymorphisms in our study (supplemental Table 1 and Table 2). Each of these L1 elements ended in a poly(A) sequence and was flanked by a typical L1 target site duplication (supplemental Table 1). We classified these elements using the system described by Brouha et al. (2003) and found that most of the copies belonged to Ta subfamilies of L1 elements. In fact, elements belonging to the L1 Ta-0, Ta-1, Ta-nd, and Ta-d subfamilies were identified along with some older pre-Ta elements (supplemental Table 1 and Table 2). These results are consistent with the observation that 13 of the 14 L1 insertions that have been found to cause human diseases were L1-Ta elements and the remaining element was a pre-Ta element (reviewed in Moran 1999). Our results also are consistent with previous studies demonstrating that L1-Ta elements are highly polymorphic (Sheen et al. 2000; Ovchinnikov et al. 2001; Myers et al. 2002; Brouha et al. 2003). Although some of our L1 insertion polymorphisms were identified previously, most were unique to our study (supplemental Table 1).
Interestingly, we also identified six polymorphic copies of older L1-P insertions, including five polymorphic L1PA2 insertions and a single L1PA3 insertion (supplemental Table 1 and Table 2). Therefore, in addition to L1-Hs elements, older L1-P elements also are polymorphic in humans. In fact, these elements collectively accounted for 9.2% of the L1 insertion polymorphisms in our study (Table 2). Thus, the spectrum of L1 elements that cause human genetic variation, and perhaps human disease, is broader than previously established (Ovchinnikov et al. 2002). Moreover, since a high level of polymorphism is associated with active transposons, our results suggest that at least some of the L1-P elements in humans and chimps might also remain actively mobile today.
SVA insertion polymorphisms are abundant in humans:
The human SVA element is a transposon-like repetitive element that was first identified within the RP gene on human chromosome 6 (Shen et al. 1994). The authors of this original report proposed that SVA represented a composite retrotransposon that contains two previously identified elements (SINE-R and Alu) as well as a variable nucleotide tandem repeat (VNTR) region. Although the authors of this study had no evidence that their proposed element was actively mobile, they suggested that SVA is a retrotransposon because it ended in a poly(A) tail and was flanked by an apparent target site duplication (Shen et al. 1994). Strichman-Almashanu et al. (2001) later estimated that the haploid human genome contains approximately ∼5000 copies of the SVA element.
We identified 28 nonredundant SVA insertion polymorphisms in our trace experiments (supplemental Table 1, Tables 2 and 3). These insertion polymorphisms have all of the hallmark features of retrotransposon insertions. In each case: (i) both empty and SVA-occupied sites were identified in different humans, (ii) the newly inserted SVA copy ended in a poly(A) tail, and (iii) each inserted copy was precisely flanked by a new target site duplication (Table 3). These copies ranged in size from 396 to 2806 bp in length, with the shorter elements lacking 5′ ends due to truncation, or lacking internal VNTR repeats (supplemental Table 1 and Table 3). The target site duplications of all insertions closely resembled (in both length and sequence) the target site duplications of Alu and L1 (Table 3).
To further confirm that SVA insertion polymorphisms were indeed abundant in human populations, we developed PCR assays to individually examine 28 additional genomic copies of SVA (materials and methods). These copies were identified arbitrarily from the ∼5000 copies in the human genome by searching the human genome database with SVA element query sequences. A total of 11 of 28 (39%) of the copies tested were found to be polymorphic for insertion in at least one individual of the Coriell panel (Collins et al. 1999; Table 4). Thus, together with the 28 SVA polymorphic copies identified in the trace experiments (Table 3), we have identified a total of 39 independent SVA insertion polymorphisms in human populations. These insertion polymorphisms have all of the sequence features of bona fide SVA retrotransposition events (supplemental Table 1, Tables 3 and 4). These results indicate that not only Alu and L1, but also a third transposon, SVA, is highly polymorphic in human populations (supplemental Table 1, Tables 1–4). Collectively, our results indicate that Alu, L1, and SVA provide the bulk of genetic variation that is caused by transposon insertion polymorphisms in humans.
The recent completion of a draft sequence for the chimpanzee genome allowed us to determine whether the SVA element also is present in the chimp genome. We manually inspected the SVA copies listed in Table 3 and found that 27 of 28 (96.4%) of these copies were absent from the equivalent positions of the chimp genome (supplemental Table 1). The remaining copy appeared to be present, but had only partial sequence coverage in the chimp genome sequence. Therefore, most of these 28 polymorphic SVA copies appear to have been generated relatively recently in humans, at a point in time following the divergence of chimps and humans. However, since these 28 copies were selected on the basis of the fact that they were polymorphic in humans, it was possible that these copies were not representative of all SVAs in the human genome. Therefore, we arbitrarily selected 87 additional copies from the human genome to determine whether they were present in the chimp genome. Twenty-eight of these copies are listed in Table 4, and 59 additional copies were identified in the first 50 Mb interval of human chromosome 1, for a total of 87 copies (supplemental Table 1). Our analysis revealed that only 11 of these 87 SVA copies (12.6%) were precisely present at the equivalent positions of the chimp genome (supplemental Table 1). Seven additional copies had partial sequence coverage in the chimp genome and thus also appeared to be present at equivalent positions. Therefore, the available evidence indicates that ∼18 of the 87 SVA copies (20.7%) are likely to be present at equivalent positions of the human and chimp genomes. The remaining 69 SVA copies (79.3%) were completely absent from the chimp genome sequence. In a few cases, the chimp genome completely lacked sequence coverage in the area of the element, so it is unclear whether the SVA is truly absent in these cases (supplemental Table 1). However, in most of these 69 cases, the SVA element and 1 of the 2 copies of the target site duplication were precisely absent from the chimp genome (supplemental Table 1). Taken together, these data indicate that ∼20.7% of SVA insertions in the human genome were generated prior to the evolutionary divergence of chimps and humans, and up to ∼79.3% of the remaining SVA insertions were generated after the divergence of these species. Thus, SVA is a relatively young transposon that has expanded in the human genome during the past several million years. At least some of the SVA copies appear to have been mobilized very recently, suggesting that SVA might also remain actively mobile in humans and chimps today.
HERV-K and other examples of mobilized DNA:
In addition to the three most abundant groups of insertion polymorphisms in this study (Alu, L1, and SVA), we also identified several classes of less abundant insertion polymorphisms that were caused by mobilized DNA. For example, two human endogenous retrovirus K (HERV-K) insertion polymorphisms were identified in our trace experiments (Tables 2 and 5). One of these HERV-K copies was a 969-bp solo LTR element that was flanked by a perfect 5-bp target site duplication (Table 5). The other HERV-K element was a full-length 9462-bp copy that was flanked by a perfect 6-bp target site duplication (Table 5). This full-length copy had intact LTRs at its termini and four intact open reading frames capable of encoding homologs of the retroviral Gag, protease, Pol, and Env proteins (supplemental Table 1 and data not shown). We also identified four examples of mobilized small cellular RNAs, including two polymorphic copies of 5S rDNA and single examples of mobilized U2 and U5 RNA (Table 5). All four of these polymorphic insertions were flanked by L1-like target site duplications, strongly suggesting that these RNAs were mobilized by the L1 machinery. However, none of these mobilized elements contained a poly(A) tail, perhaps suggesting that the poly(A) sequences on template RNAs are not strictly required for the TPRT process (Boeke 1997; Roy-Engel et al. 2003). Finally, we identified a single example of a Mariner dependent-1 (Made1) insertion with an inverted repeat structure that was flanked by a perfect 5-bp target site duplication (Table 5).
Several lines of evidence suggested that our computational methods were highly accurate. For example, we detected 149 redundant transposon polymorphisms in the three data sets (supplemental Table 1). Therefore, our methods independently detected identical transposon insertion polymorphisms using totally different traces from different populations. Moreover, we also detected a number of Alu and L1 insertion polymorphisms that had been detected in previous studies with totally different methods. Nevertheless, as outlined below, we also conducted a systematic validation study to further evaluate the accuracy of our computational pipeline (Figure 3; Table 6; materials and methods).
Sixty-one transposon insertion polymorphisms were selected from the TSC data set to conduct this validation study (Table 6). We focused on the TSC data set because DNA was available for the entire panel of 24 diverse humans used in that study (Collins et al. 1999). Since all DNA traces in that experiment were derived from only these 24 individuals (Sachidanandam et al. 2001), any transposon polymorphism that was predicted from the TSC data set also should be found by PCR in at least 1 of these 24 individuals. If the polymorphism was not found in any of the 24, then we would know with certainty that our bioinformatics prediction was incorrect. PCR assays were developed for all 61 of the selected transposon insertion polymorphisms and 12–24 members of the Coriell panel were evaluated to determine whether the polymorphism could be verified. In all 61 cases, the allele predicted by the trace was confirmed in at least 1 of the 24 individuals of the panel (Figure 3 and Table 6). Therefore, our methods produced a 100% success rate for this arbitrarily selected sample of 61 TSC polymorphisms, indicating that our methods are highly accurate.
In five cases, we detected only the allele predicted by the trace in the panel of 24 individuals, and the allele predicted by the reference human genome sequence was not detected by PCR (Table 6). The most likely explanation for these results is that the person(s) represented by the reference human genome sequence had rare “private” transposon insertions at these positions that were absent from the majority of humans. We (and others) have observed similar results with SNPs identified from the TSC traces (Sachidanandam et al. 2001; Tsui et al. 2003), and Myers et al. (2002) have reported similar results with private alleles of transposon insertions. In cases where they have been examined, these private alleles have been verifiable in the DNA clones that were used to sequence the human genome (Myers et al. 2002).
Estimating the number of transposon insertion polymorphisms in humans:
Our study provided a unique opportunity to measure the levels of variation that are caused by transposon insertions in humans. Because our methods utilized DNA sequencing traces, it was possible to determine exactly how many bases of the human genome were sampled for a given trace experiment. In the case of the TSC experiment, 989,283,997 bp were sampled (equivalent to 30% of the haploid human genome). Similarly, 2,271,983,242 bp were sampled in the WGS experiment, equivalent to ∼69% of the human genome. Therefore, it was possible to normalize the data from these experiments to a genome size of 3.3 billion base pairs (100%) to estimate the total number of transposon insertion polymorphisms that were present in the average haploid genome in our study. By doubling these estimates, we determined that the average (diploid) human in our study harbored ∼1283 Alu insertion polymorphisms, 180 L1 polymorphisms, 56 SVA polymorphisms, and 17 other polymorphisms (Table 7). The TSC and WGS populations gave estimates that generally differed by less than twofold, with the WGS giving a higher estimate. Given that the WGS data were generated from eight African-Americans, these results are consistent with the observation of higher levels of genetic diversity in African populations (International HapMap Consortium 2003).
We also used our data to estimate the total number of common transposon insertion polymorphisms that are present in human populations. We detected 26% of the 660 polymorphic insertions of Alu Ya5 estimated to exist in human populations by Carroll et al. (2001). Similarly, we identified 35% of the 370 polymorphic insertions of Alu Yb8 estimated to exist in humans by Carroll et al. (2001). We also identified 25% of the 234 polymorphic L1-Ta insertions predicted by Myers et al. (2002). Thus, using the Caroll et al. and Myers et al. studies to calibrate our study, we conclude that we are detecting between 25 and 35% of a given class of transposon insertion polymorphisms. Therefore, given that we identified 605 nonredundant transposon polymorphisms, we estimate that human populations harbor a total of 1730–2420 common transposon insertion polymorphisms (for an average estimate of 2075).
Polymorphism frequencies for Alu, L1, and SVA:
Since we recovered data on Alu, L1, and SVA insertion polymorphisms using a single method, we could calculate the relative polymorphism frequencies for these elements (Table 7). To perform these calculations, we compared the number of polymorphisms that were identified for each transposon to the genomic copy numbers for each element (Table 7). We found that the average polymorphism frequency for all copies of L1 was the lowest, at 0.00018 (one polymorphic L1 insertion per 5556 copies in the genome; Table 7). The average polymorphism frequency for all genomic Alu copies likewise was relatively low at 0.00058 (one polymorphic insertion per 1724 copies in the genome). The average polymorphism frequency for SVA, in contrast, was an order of magnitude higher, at 0.0057 (one polymorphic insertion per 175 copies). Therefore, according to this analysis, SVA is the most polymorphic element in humans and is likely to be one of the youngest elements to expand in the human genome. Nevertheless, when the L1-Ta, Alu Ya5, and Alu Yb8 subfamilies were examined separately from all L1 and Alu copies, these younger L1 and Alu subfamilies had even higher polymorphism frequencies than SVA (Table 7). The L1-Ta subfamily, for example, with ∼1040 copies in the diploid genome, has the highest polymorphism frequency at 0.161 (one insertion per 6.2 copies; Table 7). The Alu Ya5 and Alu Yb8 subfamiles likewise have much higher polymorphism frequencies than all genomic Alu elements (Table 7). Therefore, the most active subfamiles of L1 and Alu have the highest polymorphism frequencies, followed by SVA. Like these other elements, SVA might also harbor extremely polymorphic subfamilies that remain to be discovered but are as polymorphic for insertion as the L1-Ta, Alu Ya5, and Alu Yb8 subfamilies. Alternatively, SVA might have a uniformly lower rate of polymorphism but collectively produces relatively high levels of genetic variation through its higher copy number (SVA has almost 10 times the number of L1-Ta copies). Either of these models would account for the relatively high levels of genetic variation that are caused by SVA insertion polymorphisms (Tables 1 and 7).
The spectrum of mobile DNA in humans:
In an effort to measure the overall levels of genetic variation that are caused by human transposons, we have developed a new method to broadly detect transposon insertion polymorphisms of all kinds in humans. Our strategy was highly efficient and led to the identification of 605 nonredundant transposon insertion polymorphisms in 36 diverse humans. Since the majority of these insertion polymorphisms had not been identified previously, our method was highly successful at discovering novel transposon polymorphisms. In fact, we estimate that our collection of 605 polymorphisms represents ∼25–35% of all common transposon insertion polymorphisms in human populations (see below). Our strategy, in principle, now could be used to identify all of the common transposon insertion polymorphisms that exist in human populations. Together with all previously identified Alu and L1 insertion polymorphisms, our 605 insertions provide significant progress toward this goal. Approximately 20 million additional human traces (beyond the 16.4 million used here) currently are available from SNP discovery projects, and more traces are being generated by ongoing SNP discovery projects daily. Another ∼17 million chimp traces are available that could be used to identify transposon insertion polymorphisms in the chimp genome relative to the human genome. Thus, our method is likely to be useful in humans as well as other organisms.
Unlike previous strategies, our polymorphism discovery strategy yielded data regarding the relative levels of genetic variation from all classes of transposons. The three most abundant classes of insertion polymorphisms in our study were Alu, L1, and SVA insertions. Although Alu and L1 insertion polymorphisms were expected to occur at high frequencies, no studies had been conducted previously to measure the polymorphism frequency of the SVA element or the remaining elements in the human genome. Therefore, our method has revealed that three transposons, Alu, L1, and SVA, are highly polymorphic in humans, and that these three elements together provide the bulk of genetic variation that is caused by transposon insertions in humans (Tables 1 and 7). Our data also indicate that few, if any, insertion polymorphisms exist for the remaining classes of elements in the human genome.
Most human transposon families are not highly polymorphic:
As mentioned above, an interesting finding of our study is that many transposon families in humans have not generated insertion polymorphisms to any great extent in recent history. Although Alu, L1, and SVA account for a little more than 30% of the human genome sequence, a total of 44% of the genome sequence is occupied by transposons and transposon-like repetitive elements. Therefore, ∼14% of the human genome is occupied by essentially extinct transposon-like families that contain mostly (or totally) inactive, fixed transposon alleles. Our study does not necessarily indicate that these elements are completely inactive, since we have sampled only 36 human genomes. Therefore, it is likely that we have identified only the most polymorphic classes of elements in the genome, and we may have missed polymorphic copies that occur at lower frequencies within smaller families. Such elements would be of great interest, and we do not rule out the existence of these elements, particularly since the heterochromatic regions of the human genome remain unsequenced. For example, our data indicate that elements such as HERV-K have been mobile recently enough to generate polymorphic insertions in human populations (Table 5). Although such elements do not generate a great deal of genetic variation, they would be of great interest if at least some of the polymorphic copies have retained the ability to function as autonomous retrotransposons. Additional studies will be required to determine whether the full-length HERV-K copy discovered in this study (ss15143090, Table 5) remains actively mobile today.
Alu and L1 insertion polymorphisms:
Our results regarding Alu element polymorphisms generally are in good agreement with a large number of previous studies examining the young Alu Y subfamilies of the human genome (reviewed in Batzer and Deininger 2002). Because we detected all Alu subfamilies with a single method, we also were able to measure the relative levels of variation that are caused by each subfamily (Tables 1, 2, and 7). Consistent with previous studies, we found that Alu Ya5 and Alu Yb8 insertion polymorphisms are highly abundant in humans. We also found that Alu Y, Alu Yc1, and Alu Ya4 insertion polymorphisms are moderately abundant in human populations (Table 2). The remaining Alu Y insertions were less abundant and were distributed among 15 different Alu Y subfamilies (Table 2). Our results further indicate that additional polymorphic Alu Y families of any significant size are not likely to exist in humans. Finally, we unexpectedly found a small number of ancient Alu S insertion polymorphisms in humans (Table 2).
Our results regarding L1 element polymorphisms likewise are in good agreement with previous studies examining L1-Hs insertion polymorphisms in humans (Sheen et al. 2000; Ovchinnikov et al. 2001; Myers et al. 2002; Badge et al. 2003; Brouha et al. 2003). However, we also found that older L1-P (primate) insertions represent a significant source of human genetic variation (Ovchinnikov et al. 2002). In fact, L1-P insertions represented close to 10% of all L1 insertion polymorphisms in our study (Table 2). Because we detected all L1 subfamilies with a single method, it was possible to assess the relative levels of variation that were caused by each subfamily and integrate these values with all other elements in the genome (Tables 1, 2, and 7).
SVA is highly polymorphic in humans:
Since the initial discovery of the SVA element, a number of retrotransposon-like insertions have been reported that might have been caused by SVA retrotransposition events (Hassoun et al. 1994; Kobayashi et al. 1998; Rohrer et al. 1999). One of the best candidates in this regard is a 3-kb retrotransposon insertion in the Fukutin gene that was reportedly responsible for 70% of the Fukuyama type muscular dystrophy in Japan (Kobayashi et al. 1998). The element described in that report has some of the features of a de novo SVA insertion; however, it was not referred to as an SVA element in that article, and the sequence of the insertion was not provided (Kobayashi et al. 1998). Two additional retrotransposon insertions also have been referred to as SVA elements by Ostertag and Kazazian (2001) and Ostertag et al. (2003). In one of these cases, the element was reported in the original study to be a SINE-R element rather than an SVA element (Rohrer et al. 1999). Since SINE-R is itself a retrotransposon and also a component of the SVA element, the insertion could be a SINE-R element or a truncated SVA. In the final case, no SVA sequences were actually present within the DNA insertion that was identified (Hassoun et al. 1994), and the inserted DNA segment was proposed to have been mobilized by a 3′-transduction event sponsored by an adjacent SVA element (Ostertag et al. 2003).
We now provide clear evidence for the existence of at least 39 de novo SVA element insertions in humans (Tables 3 and 4). For these 39 insertions: (i) both empty and SVA-occupied sites were identified in the human genome in different individuals, (ii) the SVA copies ended in poly(A) tails, and (iii) the inserted copies were precisely flanked by new target site duplications. Thus, our study indicates that SVA insertion polymorphisms are highly abundant in humans. In fact, SVA insertion polymorphisms provide about one-third the level of genetic variation that is caused by L1 insertions (Tables 1 and 7). Finally, our data also indicate that SVA has amplified independently and perhaps at different rates in the genomes of humans and chimps. Approximately 79% of the SVA insertions in the human genome are absent from the equivalent positions of the chimp genome, indicating that these insertions occurred relatively recently in human history (within the past ∼6 million years).
Since a high rate of polymorphism is a hallmark feature of an active transposon, our data suggest that SVA might be actively mobile in the human genome today. Because SVA does not encode any obvious proteins of its own (it lacks substantial open reading frames), it is likely to be a nonautonomous element that relies upon another transposon for its own transposition. Several aspects of our SVA insertions suggest that they might be mobilized by L1 elements in trans by the same mechanism that mobilizes Alu elements (Dewannieux et al. 2003). For example, the target site duplications of our SVA insertions closely resemble those of Alu and L1 elements in length and sequence (Tables 3 and 4). Our SVA insertions also have poly(A) tails, indicating that they are poly(A) retrotransposons (supplemental Table 1). Finally, many of our polymorphic SVA insertions had 5′-truncations that were similar to the 5′-truncations of L1 elements. Given these similarities to Alu and L1 elements, SVA is likely to be mobilized in trans by L1-encoded proteins (Esnault et al. 2000; Ostertag and Kazazian 2001; Wei et al. 2001; Dewannieux et al. 2003; Ostertag et al. 2003). Therefore, it appears that all three classes of highly polymorphic elements in our study (Alu, L1, and SVA) were generated by the L1 retrotransposition machinery in cis or in trans.
Estimating the levels of variation caused by transposon insertions in human populations:
Our method provided a unique opportunity to measure the levels of genetic variation that are caused by transposon polymorphisms in humans. The measure of variation that is most commonly reported for transposons is the percentage of copies that are polymorphic in at least one individual of a population (Sheen et al. 2000; Carroll et al. 2001; Ovchinnikov et al. 2001, 2002; Roy-Engel et al. 2001; Myers et al. 2002; Abdel-Halim et al. 2003; reviewed in Ostertag and Kazazian 2001 and Batzer and Deininger 2002). This approach is useful from the viewpoint of assessing whether a given transposon family is polymorphic in populations; however, it tends to overestimate the levels of genetic variation that are caused by transposons. This is because high-frequency and low-frequency alleles are counted equally with this type of measurement. In contrast, we measured the polymorphism rates in a manner that included the allelic frequencies of the transposon alleles. High-frequency alleles are encountered more often than rare alleles in our trace experiments, and therefore our method naturally takes into account the allelic frequencies of the transposon insertions. Thus, with this approach, we have been able to estimate the levels of genetic variation that are caused by transposon insertions in populations. As a consequence of factoring in the allelic frequencies, our estimates for polymorphism rates are four- to fivefold lower than those reported previously. For example, 25% of the Alu Ya5 copies previously were reported to be polymorphic in at least one individual of a population of 80 humans (Carroll et al. 2001). We now estimate that ∼6.8% of the Alu Ya5 copies are polymorphic in the average human of our study (Table 7). Likewise, 45% of the L1-Ta element copies previously were reported to be polymorphic in at least one member of a large population (Myers et al. 2002), whereas we now calculate that ∼13.6% of the L1-Ta copies are polymorphic in the average human of our study (Table 7).
We also generated an estimate of the total number of common transposon insertion polymorphisms that exist in human populations. We generated this estimate by comparing the number of insertion polymorphisms discovered in our study for a given element such as Alu Ya5 to the total number expected. For example, Carroll et al. previously had predicted that ∼25% of the 2640 genomic Alu Ya5 copies were polymorphic for insertion in at least one member of a population of 80 diverse humans (Carroll et al. 2001). Therefore, since we identified 170 Alu Ya5 insertion polymorphisms, we determined that we had identified 26% of all expected Alu Ya5 insertion polymorphisms in humans. By calibrating our study with several of these previous studies, we estimate that our 605 transposon insertions represent between 25 and 35% of all common transposon insertion polymorphisms in human populations. Therefore, on the basis of these comparisons, human populations are estimated to harbor between 1730 and 2420 common transposon insertion polymorphisms (for an average of 2075). Together with our 605 polymorphisms, less than half of these polymorphisms have been identified to date, indicating that additional efforts will be required to identify the full set of polymorphic transposon insertions (i.e., the “transposon insertion polymorphome”) in humans.
Together with previous studies, our analysis indicates that SNPs, indels, and transposon insertion polymorphisms represent significant sources of genetic variation in humans. Human populations are estimated to harbor ∼10 million common SNPs (Judson et al. 2002), ∼2 million common indels (our unpublished data), and ∼2000 common transposon insertion polymorphisms (this study). Therefore, with 10 million bases of variation, SNPs account for the majority of common human genetic variation, followed by indels and then transposon insertion polymorphisms. On the other hand, if we assume that the average transposon polymorphism in humans is ∼500–1000 bp in length, then the total amount of variation caused by common transposon insertions is 1–2 million base pairs (equivalent to 10–20% of the base pair variation caused by SNPs). Thus, in terms of the number of base pairs, common transposon insertions cause significant levels of human genetic variation. Moreover, humans also are likely to harbor >10 million rare private transposon insertions (cases in which only one or a few individuals have the insertion). Therefore, transposon insertion polymorphisms cause significant levels of human variation. A number of studies now have shown that SNPs, indels, and transposon insertions all may cause serious phenotypic changes when positioned at critical sites within genes (Collins et al. 1987; Kazazian et al. 1988; Sachidanandam et al. 2001). Nevertheless, a comprehensive map of genetic variation that integrates SNPs, indels, and transposon insertions currently is lacking. A fully integrated map that includes all forms of genetic variation will be necessary to efficiently identify genetic polymorphisms that influence human phenotypes and diseases.
We thank the people at the SNP Consortium, the Sanger Centre, the Baylor Genome Center, and the Whitehead Genome Center for the use of their trace data, and University of California, Santa Cruz for its human genome database. We also thank Shari Corin for helpful advice on this project and for critical review of the manuscript. Finally, we thank Karen Ventii and Summer Goodson for help with some PCR experiments. This work was supported by training grant 2T32GM008490-11 from the National Institutes of Health (E.A.B.), grant 2-80302 from the Emory University Research Council (S.E.D.), grant RSG-01-173-01-MBC from the American Cancer Society (S.E.D.), and grant 1R01HG02898-01A1 from the National Institutes of Health (S.E.D.).
↵1 These authors contributed equally to this work.
Communicating editor: D. Voytas
- Received May 27, 2004.
- Accepted June 18, 2004.
- Genetics Society of America