The BDGP Gene Disruption Project
Hugo J. Bellen, Robert W. Levis, Guochun Liao, Yuchun He, Joseph W. Carlson, Garson Tsang, Martha Evans-Holm, P. Robin Hiesinger, Karen L. Schulze, Gerald M. Rubin, Roger A. Hoskins, Allan C. Spradling

Abstract

The Berkeley Drosophila Genome Project (BDGP) strives to disrupt each Drosophila gene by the insertion of a single transposable element. As part of this effort, transposons in >30,000 fly strains were localized and analyzed relative to predicted Drosophila gene structures. Approximately 6300 lines that maximize genomic coverage were selected to be sent to the Bloomington Stock Center for public distribution, bringing the size of the BDGP gene disruption collection to 7140 lines. It now includes individual lines predicted to disrupt 5362 of the 13,666 currently annotated Drosophila genes (39%). Other lines contain an insertion at least 2 kb from others in the collection and likely mutate additional incompletely annotated or uncharacterized genes and chromosomal regulatory elements. The remaining strains contain insertions likely to disrupt alternative gene promoters or to allow gene misexpression. The expanded BDGP gene disruption collection provides a public resource that will facilitate the application of Drosophila genetics to diverse biological problems. Finally, the project reveals new insight into how transposons interact with a eukaryotic genome and helps define optimal strategies for using insertional mutagenesis as a genomic tool.

MUTATIONS represent an essential tool for analyzing gene function. In recent years, organized efforts to generate genome-wide mutant collections have progressed substantially in model organisms such as Saccharomyces cerevisiae (Bidlingmaier and Snyder 2002; Giaever et al. 2002; reviewed in Vidan and Snyder 2001), Caenorhabditis elegans (Jansen et al. 1997), Arabidopsis thaliana (Alonso et al. 2003), Danio rerio (Golling et al. 2002; reviewed in Amsterdam 2003), Mus musculus (Mitchell et al. 2001; Mikkers et al. 2002; reviewed in Stanford et al. 2001), and many other organisms (Roos et al. 1997; Akerley et al. 2002; Firon et al. 2003; Uhl et al. 2003). Transposable elements are now widely used in such efforts (Gueiros-Filho and Beverley 1997; Fadool et al. 1998; Klinakis et al. 2000; Bessereau et al. 2001; Zagoraiou et al. 2001).

Insertional mutagenesis using engineered transposable elements has proved to be one of the most productive and versatile approaches to disrupting and manipulating Drosophila genes on a genome-wide scale (Cooley et al. 1988; Bellen et al. 1989; Bier et al. 1989; Grossniklaus et al. 1989; Berg and Spradling 1991; Gaul et al. 1992; Karpen and Spradling 1992; Chang et al. 1993; rok et al. 1993; Erdelyi et al. 1995; rth 1996; Deak et al. 1997; Salzberg et al. 1997; rth et al. 1998; Sekelsky et al. 1999; Mata et al. 2000; Bourbon et al. 2002; cker et al. 2003; Oh et al. 2003). Collections of insertion mutations have been created with independently scorable genetic markers such as eye color, body color, drug resistance, or dominant visible characters, allowing multiple insertions to be easily manipulated. Moreover, specialized transposons that trap enhancers (O'Kane and Gehring 1987; Bier et al. 1989; Wilson et al. 1989), drive GAL4 production (Brand and Perrimon 1993; Manseau et al. 1997; Lukacsovich et al. 2001; Horn et al. 2003), misexpress adjacent genes (rth 1996; Lengyel and Merriam 2001; Toba et al. 1999; Mata et al. 2000; Aigaki et al. 2001; Brennecke et al. 2003), fuse endogenous proteins to green fluorescent protein (Morin et al. 2001), or are a combination of these properties have been utilized.

The P-transposable element has been the vehicle most widely used to disrupt Drosophila genes because it transposes at high rates, depends completely on exogenous transposase, inserts in heterochromatic as well as euchromatic regions (Zhang and Spradling 1994; Roseman et al. 1995; Wallrath and Elgin 1995; Yan et al. 2002; Konev et al. 2003), preferentially transposes near promoters (Spradling et al. 1995), excises imprecisely, generates local deletions from single elements or between element pairs (Preston et al. 1996; Cooley et al. 1990; Huet et al. 2002; reviewed in Gray 2000), transposes locally (Tower et al. 1993; Timakov et al. 2002), induces male recombination (Preston and Engels 1996), preferentially replaces existing elements (Heslip and Hodgetts 1994; Sepp and Auld 1999), and induces unequal recombination in tandem repeats (Thompson-Stewart et al. 1994). However, these advantages must be balanced against the inefficiency resulting from transposon hotspots (Spradling et al. 1999) and the possibility that not all genes are P-element targets. Recently, the TTAA-specific piggyBac element (Cary et al. 1989) has been shown to function as an alternative insertion vector with many attractive features (Horn et al. 2003), including compatibility with P-containing strains (cker et al. 2003).

Beginning in 1993, the Berkeley Drosophila Genome Project established a gene disruption library encompassing 1045 genes with mostly vital function (Spradling et al. 1995, 1999). These lines were selected from seven P-element insertional mutagenesis screens and, following insert localization by polytene chromosome in situ hybridization, were verified and associated with genes by complementation tests. While this collection proved extremely useful, its coverage was limited by the requirement for mutations with a scorable phenotype and by the amount of time required for extensive complementation testing.

The sequencing of the euchromatic portion of the Drosophila genome (Adams et al. 2000; Celniker et al. 2002) and the partial sequencing of the heterochromatic portion (Celniker et al. 2002), as well as the detailed annotation of these sequences (Hoskins et al. 2002; Misra et al. 2002) using expressed sequence tag (EST) and full-length cDNA sequences (Stapleton et al. 2002), provided an opportunity to greatly expand the collection's coverage. Transposon insertions in newly generated lines could now be precisely localized by sequencing genomic DNA flanking the insertions and computationally associated with known or predicted genes. Using this approach to rapidly select a subset of lines bearing insertions in genes that had not previously been disrupted was proposed as a way to further expand the Berkeley Drosophila Genome Project (BDGP) collection (Spradling et al. 1999).

There are significant challenges to applying a sequence-based strategy successfully on a large scale, however. The P-element target sequences are broad but nonrandom (Liao et al. 2000). Why certain genes act as hotspots, while others are rarely targeted, remains unknown. How design parameters such as the structure and location of the starting mutator transposon affect the spectrum of hotspots and the diversity of genes that are targeted remains poorly characterized. The piggyBac transposon has been suggested to be superior to the P element as an insertional mutagen with a broad specificity (cker et al. 2003), but variables affecting piggyBac screens are also little known. Nor can such information be easily determined. In a production-oriented project, the number of different transposition schemes that can be evaluated is limited. Preparing and testing new screen designs requires months of lead time and risks productivity should the screen prove to be inefficient in practice.

Another challenge in a sequence-based strategy is selecting which insertion lines to save. Because of the limited capacity of public Drosophila stock centers it is crucial to preserve lines whose insertions are most likely to disrupt independent genes. Without phenotypes and complementation tests to serve as guides, choosing lines that disrupt distinct genes depends on having a highly accurate annotation of the genome sequence. Drosophila genes undergo complex splicing patterns, reside close to their neighbors, and often overlap. Line selection based on inaccurate or incomplete annotation would substantially reduce the project's output by mistakenly causing genetically redundant strains to be retained and novel strains to be discarded.

Here we report expanding the BDGP gene disruption collection from 1045 to 7140 strains using a sequence-based strategy. Lines in the collection are predicted to bring at least 5362 of the 13,666 annotated genes under experimental control. In the process we have begun to answer some of the questions concerning the efficient design of insertional mutagenesis screens.

MATERIALS AND METHODS

The EP collection:

The original EP screen (rth 1996; rth et al. 1998) was carried out in collaboration with the BDGP. The 2266 lines generated in this project served as the test bed for developing high-throughput methods for sequencing transposon flanks (Liao et al. 2000). This screen utilized the original EP element (rth 1996) whose heat-shock promoter-derived misexpression cassette cannot be activated in the female germ line (Mata et al. 2000; see Table 1) . Because of this limitation, lines from other sources were favored and only 374 EP lines remain in the primary collection (Tables 2 and 3) .

View this table:
TABLE 1

Mutator transposons

View this table:
TABLE 2

Line summary

View this table:
TABLE 3

Primary collection summary

The BG collection:

The BG (Baylor genetrap) screen used the “gene trap” P{GT1} element developed by Lukacsovich et al. (2001). P{GT1} is designed to express the white+ gene only when inserted within a gene and to fuse a GAL4-containing exon with this target gene (Table 1). The BG screen was carried out as shown below in one of six isogenized backgrounds. The w; Iso2A/Iso2A; Iso3A/Iso3A isogenized stocks (Iso A to Iso F) were obtained from Cahir O'Kane at the University of Cambridge (C. O'Kane, personal communication). They were tested in the following behavioral assays and most were judged similar to wild-type Canton-S flies: (1) benzaldehyde jump responses at different drug concentrations, (2) locomotor activity, (3) circadian rhythm, and (4) heat avoidance in an associative learning paradigm. Six pairs of isogenized male and female starting stocks (see first cross below) were constructed, thereby avoiding mixing of genetic backgrounds. The starting site of the mutator element (which we termed BG00000) was sequenced (GenBank accession CL004094), but was found to reside entirely within a blastopia transposon and hence its exact genomic position on CyO was not localized. The crossing scheme used was as follows: Math Math Math

A single w+; Cy+; Sb+ fly was selected per vial to avoid clusters. The jumping rate was 1 or more in 15% of the vials. These flies have the genotype Mathand have an insert of P{GT1} that is w+. They were crossed to y1 w67c23; L2/CyO; D1/TM3, Sb1 and w+, Cy, and Sb progeny were kept. Upon determination of the insertion site, the appropriate chromosome was balanced by backcrossing to the y w; L2/CyO; D1/TM3, Sb1 flies. We generated 2869 BG strains, 482 of which were selected for the primary collection (Tables 2 and 3). Approximately 1500 of the original BG stocks are available from Trudi Mackay at North Carolina State University. Because of their uniform genetic background, the BG collection has proved useful in studies of quantitative traits including bristle number (Norga et al. 2003) and starvation resistance (Harbison et al. 2004).

The KG collection:

The KG (Karpen genome) screen made use of the P{SUPor-P} element [Roseman et al. 1995; FlyBase identification number (ID) FBtp0001587; Table 1], which was designed to facilitate insertion recovery by reducing position effects on the white+ gene due to the presence of two suppressor of Hairy-wing [su(Hw)] binding regions that can act as chromatin insulators. Another potential benefit was the possibility that this P element may enhance the rate of mutagenesis as reported previously (Roseman et al. 1995; Bellen, 1999). Indeed, when inserted between an enhancer and its cognate promoter, a situation likely to be common due to the P element's strong promoter target preference (Spradling et al. 1995), the insulators may alter gene expression.

The laboratory of Gary Karpen generated 1236 of the 10,587 KG strains (strains KG00001–KG00560 and KG01121–KG01798) as a byproduct of their screen for heterochromatic P-element insertions (Yan et al. 2002; Konev et al. 2003). They used nine mating schemes with three different P{SUPor-P} starting sites and saved exceptional progeny in which there was variegated expression of the yellow+ transgene. They sent us progeny with new insertions in which the yellow transgene was expressed normally. We created lines by crossing them to y; ry506 flies of the opposite sex. The insertion-bearing chromosome of lines selected for the primary collection was balanced, using FM4/Df(1)260-1, y for X chromosome insertions; y1; SM6a/In(2LR)Gla, wgGla-1 for chromosome 2 insertions; y1; TM3, Sb1/D1 for chromosome 3 insertions; and y1; ciD/eyD for chromosome 4 insertions.

We generated the other KG strains using stocks provided by Gary Karpen's lab. These stocks employed an isogenized y; ry506 background that had been found in a previous large screen to be uniform and free of hobo elements and other sources of “background” mutations (Karpen and Spradling 1992; Spradling et al. 1999). We mapped the starting site of the P{SUPor-P} mutator element (which we term KG00000) to chromosome 2L position 2748009 (Celniker et al. 2002; equivalent to scaffold segment AE003582.3, position 220758; GenBank accession CL004095).

The crossing scheme was Embedded Image and y+, Cy+, ry flies were selected. The jumping rate was 1 or more in 35% of vials. These flies have the genotype Mathand carry a P{SUPor-P} element on one of the chromosomes. They were backcrossed to y/y; +/+; ry506/ry506. After having been chosen for the primary collection some KG strains were selected for homozygosity. All X chromosome insertions were kept in this genetic background and balanced with FM7. Many but not all second, third, and fourth chromosome insertions were rebalanced with y1 w67c23; L2/CyO or y1 w67c23; D1/TM3, Sb1 or y1; ciD/eyD, respectively.

The EY (EP yellow) collection:

In an effort to broaden its target specificity, we modified the second-generation EP element of Mata et al. (2000) that supports germ cell expression. An intronless yellow+ gene marker was inserted adjacent to the original mini-white gene in this P{EPg} element (see below). The resulting element, P{EPgy2}, was termed the EY element. Misexpression is still driven from an outwardly directed promoter at the 3′ end (Table 1; rightward-pointing arrow). We localized the starting site for the EY screen (EY00000) at nucleotide position 21451923 (Celniker et al. 2002) on the minus strand of the 2L chromosome arm (equivalent to nucleotide 57866 of scaffold segment AE003781.4; GenBank accession CL004093).

The following crossing scheme was used to generate the EY lines: Embedded Image

We selected y+ w+, Cy+, Sb+ flies and crossed them to y1 w67c23/y1 w67c23; +/+; +/+. The jumping rate was 1 or more in 65% of vials. Upon sequencing the genomic DNA adjacent to the P element the insertion chromosomes were balanced with FM7 or y1 w67c23; L2/CyO or y1 w67c23; D1/TM3, Sb1 or y1; ciD/eyD.

Donated collections:

Several collections of strains containing a variety of transposon mutators were donated to the Gene Disruption Project (Table 1). With the exception of the PA and PC collections, the insertion site flanking sequences of all donated strains described in this article were amplified, sequenced, and mapped using the same procedures, described below, that were used for lines generated within the project. In most cases, we extracted genomic DNA from samples of frozen flies collected from unbalanced stocks that were provided by the lab donating the strains.

The PA and PC strains were donated by Brian Ring and Daniel Garza. Each strain contained a single autosomal insertion of a piggyBac element. The mutator transposon in the PA strains was PBac{5HPw+} (FlyBase ID FBtp0016567), marked with mini-white+, while the PC strains carried PBac{3HPy+} (FlyBase ID FBtp0016566), marked with yellow+. DNA segments flanking the insertion sites were amplified and sequenced by Exelixis Corporation. The Gene Disruption Project received data on the insertion sites of 1055 PA and PC lines with insertions that could be mapped to unique euchromatic sites. We initially selected 471 of these lines as candidates for the primary collection, but some of these lines were lost before balanced stocks could be established. Brian Ring and Kathy Matthews constructed balanced stocks of the 342 surviving lines. Kathy Matthews prepared samples of frozen flies from the balanced stocks, which we used to recheck the insertion site flanking sequences (see below).

The KV strains were generated in the laboratory of Gary Karpen using the P-element mutator P{SUPor-P}. They employed a variety of starting sites and genetic crossing schemes, as described by Yan et al. (2002) and Konev et al. (2003), to maximize the recovery of heterochromatic insertions. Many sequences flanking KV insertion sites mapped to WGS3 heterochromatic scaffolds whose chromosomal origin is currently unknown. Gary Karpen provided unpublished fluorescence in situ hybridization mapping data for some of these lines; we mapped others to a chromosome by genetic segregation of the transgene markers. We balanced the insertion-bearing chromosomes using FM4/Df(1)260-1, y for X chromosome insertions; y1 w67c23; SM6a/In(2LR)Gla, wgGla-1 for chromosome 2 insertions; y1 w67c23; TM3, Sb1/D1 for chromosome 3 insertions; and y1; ciD/eyD for chromosome 4 insertions.

The DG strains were made in the laboratory of William Gelbart using the P{wHy} element (Huet et al. 2002; FlyBase ID FBtp0016141) as a mutator. This is a compound element with P-element ends flanking a nonautonomous hobo element. Frozen fly samples of 1384 lines were provided.

The PL strains have insertions of the piggyBac pBac{GAL4D, EYFP} mutator element (Horn et al. 2003; FlyBase ID FBtp0017476) that is marked with EYFP and can act as an enhancer trap to express the GAL4Δ variant. cker et al. (2003) have described a screen in which 798 lines that had an insertion of this mutator on chromosome 3 were created. The third chromosome used as a target had an P{FRT} insertion at the base of both chromosome arms that can be used to generate germ-line clones. Udo Häcker provided samples of 634 lines with homozygous-viable insertions from this collection. Because the samples used for this determination came from stocks in which the insertion-bearing third chromosome had already been made homozygous, the lines that we selected for the primary collection were sent to the Bloomington Stock Center without rechecking the insertion flanks.

The LA strains were made in the laboratories of John Merriam, Judith Lengyel, and Stephen Poole using the P-element mutator P{Mae-UAS.6.11} (J. Merriam and S. Poole, unpublished data; FlyBase ID FBtp0001327). This vector is similar to P{EP} in having a GAL4-inducible promoter for misexpression of flanking genes, but differs in that it is marked with yellow+ rather than with mini-white. The mutator was mobilized from an X chromosome site in males and transpositions to the autosomes were recovered as exceptional y+ males (Lengyel and Merriam 2001). We determined this X chromosome starting site to be 11734628 on the plus strand (Celniker et al. 2002; equivalent to scaffold AE003487.2, position 295085). Insertions were subsequently screened for phenotypes when combined with a GAL4 driver, usually the P{w+mC = Act5C-GAL4}25FO1 driver expressing GAL4 under control of the Actin 5C promoter (Akieda and Merriam 2001). Samples from 1045 strains displaying a phenotype were sent for sequencing.

Construction of P{EPgy2}:

The P{EPgy2} element used in the EY screen was a derivative of P{EPg} (Mata et al. 2000; FlyBase ID FBtp0012862; Table 1). The major differences were that P{EPgy2} contained an intronless yellow+ gene module and lacked the plasmid rescue module of P{EPg}. The plasmid pP{EPgy2} was constructed from two plasmid precursors, p1462 and yellow-BSX. p1462 was an intermediate used in the construction of pP{EPg} that lacks the plasmid rescue module. It was obtained from Pernille Rørth (EMBL, Heidelberg, Germany). The yellow-BSX plasmid was used as the source of the intronless yellow+ gene for pP{EPgy2} and was obtained from Tim Parnell in the laboratory of Pamela Geyer (University of Iowa). It had a SalI fragment containing the intronless yellow+ gene cassette (Patton et al. 1992) inserted into the SalI site of a modified pBluescript vector, pBS-X. This vector had the KpnI site of the polylinker converted into an XbaI site. The yellow+ SalI fragment was the same as the segment designated by FlyBase as y+mDint25.2(S,S) (FlyBase ID FBms0003824). DNA of the yellow-BSX plasmid was digested with a combination of NotI and PspOMI, which generate the same 5′ overhang. NotI cut yellow-BSX at a single site in the polylinker sequences closest to the 3′ end of the yellow+ gene and PspOMI cut yellow-BSX at a single site in the polylinker sequences closest to the 5′ end of the yellow+ gene. The 5.8 kb NotI-PspOMI fragment containing the intronless yellow+ gene was gel purified and ligated with DNA from p1462 that had been cut by NotI and dephosphorylated with shrimp alkaline phosphatase. p1462 had a unique NotI site located between mini-white+ and the GAGA/GAL4-UAS modules. Transformants of the Escherichia coli strain DH5α were recovered in which the intronless yellow+ gene fragment had inserted into the NotI site of p1462 in each of the two relative orientations and these were named pP{EPgy1} and pP{EPgy2}. The mini-white+ and intronless yellow+ genes of pP{EPgy2} are transcribed in the same direction, which is opposite to that of the P-element promoter (Table 1). Portions of both plasmids were sequenced, including the junctions between the two fragments. A compiled sequence for P{EPgy2} (P-element portion only) is available in FlyBase (FlyBase ID FBrf0157089).

Initial transgenic Drosophila lines containing P{EPgy2} were made by Alexei Tulin, using the transformation method described by Tulin et al. (2002). Lines with an insertion of P{EPgy2} on the CyO second chromosome balancer were generated by mobilizing the element from the X chromosome of one of the initial transgenic lines using the TMS, P{ry+t7.2, Delta2-3}99B, Sb1 chromosome as a source of transposase.

Determination of flanking sequences:

Genomic sequences flanking P element or piggyBac insertions were determined by sequencing inverse PCR products (Liao et al. 2000). A detailed protocol is available on the P-Screen webpage at http://flypush.imgen.bcm.tmc.edu/pscreen/.

Genomic DNA was prepared from ∼15 insertion-bearing adults. Flies were collected and frozen at −80° in microfuge tubes. Samples were thawed on ice, and three autoclaved stainless steel ball bearings (BALL-1B; Wheels Manufacturing, Broomfield, CO) and 400 μl of Buffer A (100 mm Tris-HCl, pH 7.5, 100 mm EDTA, 100 mm NaCl, 0.5% SDS) were added. Samples were disrupted by vigorous vortexing and incubated at 65° for 30 min. Debris was precipitated by addition of 800 μl of a 4.3 m LiCl/1.4 m KOAc solution, incubation on ice for 10 min, and centrifugation at room temperature in a microcentrifuge at 14,000 rpm for 15 min. The supernatant was collected, and DNA was precipitated by addition of 800 μl of isopropanol and centrifugation at 14,000 rpm for 10 min. The precipitate was washed with 70% ethanol, air dried, and resuspended in 75 μl of TE (10 mm Tris pH 7.5, 1 mm EDTA). Subsequent steps were performed in 96-well format.

Genomic DNA samples (10 μl) were digested with an appropriate restriction enzyme (5–20 units) and RNase A (8 μg/ml) in a 25-μl reaction at 37° for 2.5 hr. The restriction enzyme was inactivated at 65° for 20 min. Digested samples (10 μl) were self-ligated with 2 units of T4 DNA ligase at 4° for 12 hr in a dilute reaction (400 μl) to favor the generation of circular products. Ligated samples were precipitated with the addition of 40 μl 3 m NaOAc and 1 ml ethanol, and precipitates were washed in 70% ethanol and resuspended in 150 μl TE, pH 7.5. Ligation products (10 μl) were used as templates in inverse PCR reactions (50 μl) with 100 mm dNTPs, 0.2 mm oligonucleotide primers, and 2 units of Taq DNA polymerase (Amersham, Arlington Heights, IL). Reactions were denatured at 95° for 5 min and subjected to 35 cycles of denaturation at 95° for 30 sec, annealing at the appropriate temperature for 1 min, extension at 68° for 2 min, and a final extension at 72° for 10 min.

Flanking sequences were determined by direct sequencing of the inverse PCR products. To remove excess PCR primers and dNTPs, exonuclease I (5 units) and shrimp alkaline phosphatase (2 units) were added directly to an aliquot of PCR reaction (10 μl), the mixture was incubated at 37° for 30 min, and the enzymes were inactivated by incubation at 70° for 15 min. Sequencing reactions were performed using BigDye terminator chemistry (Perkin-Elmer, Norwalk, CT) at one-quarter of the manufacturer's recommended scale, and sequence data were collected using an ABI 3700 capillary device. With the exception of the LA screen, amplification and sequencing were attempted on both the 5′ and 3′ flanks of each insertion.

For the BG collection (P{GT1} insertions), genomic DNA was digested with HinP1I; 3′ flanks were amplified with the oligonucleotide primers Pry1 (CCTTAGCATGTCCGTGGGGTTTGAAT) and Pry4 (CAATCATATCGCTGTCTCACTCA) at an annealing temperature of 55° and sequenced with Spep1 (GACACTCAGAATACTATTC); 5′ flanks were amplified with pGT1.5a (CCGCACGTAAGGGTTAATG) and pGT1.5d (GAAGTTAAGCGTCTCCAGG) at an annealing temperature of 55° and sequenced with Sp1 (ACACAACCTTTCCTCTCAACAA).

For the KG and KV collections (P{SUPor-P} insertions), genomic DNA was digested with HpaII; 3′ flanks were amplified with Pry4 (CAATCATATCGCTGTCTCACTCA) and 3.rev.hpa2 (TTGCCACTTGCTCATACGTC) at an annealing temperature of 55° and sequenced with 3.SUP.seq1 (TATCGCTGTCTCACTCAG); 5′ flanks were amplified with Plac1 (CACCCAAGGCTCTGCTCCCACATT) and Pwht1 (GTAACGCTAATCACTCCGAACAGGTCACA) at an annealing temperature of 60° and sequenced with 5.SUP.seq1 (TCCAGTCACAGCTTTGCAGC).

For the EY collection (P{EPgy2} insertions), genomic DNA was digested with HpaII; 3′ flanks were amplified with Pry1 and Pry4 as described above and sequenced with 3.SUP.seq1; and 5′ flanks were amplified with Plac1 and Pwht1 as described above and sequenced with 5.SUP.seq1.

For the LA collection (P{Mae-UAS.6.11} insertions), genomic DNA was digested with RsaI; 5′ flanks were amplified with LA(f).1 (GGGAATTGGGAATTCGTTAA) and LA(r).1 (TAGCGACGTGTTCACTTTGC) at an annealing temperature of 55° and sequenced with LA(f)seq1 (CTCTCAACAAGCAAACGTGC).

For the PL collection (Pbac{GAL4D, EYFP} insertions), genomic DNA was digested with HaeIII; 3′ flanks were amplified with PRF (CCTCGATATACAGACCGATAAAACACATGC) and PRR (AGTCAGTCAGAAACAACTTTGGCACATATC) at an annealing temperature of 65° and sequenced with PRF; 5′ flanks were amplified with PLF (CTTGACCTTGCCACAGAGGACTATTAGAGG) and PLR (CAGTGACACTTACCGCATTGACAAGCACGC) at an annealing temperature of 65° and sequenced with PLR.

The initial determination of the flanking sequences of the PA and PC strains was done by the Exelixis Corporation in collaboration with Brian Ring and Daniel Garza, prior to the donation of these strains to our project. We rechecked the flanking sequences of balanced or homozygous stocks of strains selected for the primary collection. Genomic DNA was digested with HinP1I (3′ flank) or Sau3A (5′ flank); 3′ flanks were amplified with 3F1 (CCTCGATATACAGACCGATAAAAC) and 3R1 (TGCATTTGCCTTTCGCCTTAT) at an annealing temperature of 55° and sequenced with pB-3SEQ (CGATAAAACACATGCGTCAATT); 5′ flanks were amplified with 5F1 (GACGCATGATTATCTTTTACGTGAC) and 5R1 (TGACACTTACCGCATTGACA) at an annealing temperature of 55° and sequenced with pB-5SEQ (CGCGCTATTTAGAAAGAGAGAG).

Analysis and alignment of flanking sequences:

Sequence traces were processed using phred (Ewing and Green 1998; Ewing et al. 1998) to generate base calls with associated quality scores (error probabilities). Proximal vector-genome junction sequences were identified by text searches for several short sequences from the P-element or piggyBac ends, allowing as many as three nucleotide mismatches per short sequence match, and identified vector sequences were removed. This approach was taken because sequence quality near the beginnings of the traces was variable, so that exact matches to the vector end sequence were not identified in all cases. It achieved almost the same recognition rate as that of human curators. Distal genome-vector junction sequences were identified by text searching for the appropriate restriction site, and sequences beyond identified restriction sites were removed. Using this approach, the restriction site could be missed due to low sequence quality. To avoid extending flanking sequences into the vector sequence in such cases, each sequence was compared to the P-element or piggyBac sequence using BLASTN (Altschul et al. 1997), and likely vector sequences were removed.

Each vector-trimmed sequence was further trimmed to remove low-quality sequences that might result in spurious alignments to the genomic > sequence. A region of low sequence quality was defined as either 5 or more consecutive nucleotides at least 30 nucleotides beyond the insertion site and each with a quality score of less than q20 (error probability >1%) or 10 or more consecutive nucleotides at least 10 nucleotides beyond the insertion site and each with a quality score of less than q15 (error probability >3.2%), whichever criterion resulted in the shorter quality-trimmed sequence. Using these criteria, at least 25 bases of high-quality flanking sequence were obtained in most cases. Excluding EP, PA, and PC lines, one or both flanking sequences at least 25 bases in length were obtained for 24,157 insertions in 27,642 lines (87%).

Flanking sequences at least 25 bases in length were aligned to the Release 2 or Release 3 genomic sequence using BLASTN. The 5′ and 3′ flanking sequences of each insertion were aligned independently. Sequence matches with >90% identity over >90% of the flanking sequence were saved as alignments. BLASTN results for flanking sequences that did not yield alignments by these criteria were examined by human curators and curated alignments were used in some cases. If a sequence aligned to multiple locations, indicating a repetitive sequence, or to no location, usually due to a short sequence, then the results were examined by a human curator and assigned an insertion coordinate if possible. If both the 5′ and 3′ flanking sequences of an insertion were available but aligned to different genomic sites separated by >10 bp and if neither flanking sequence showed evidence of cross-contamination from samples in nearby wells, then the two alignments were assumed to correspond to separate insertions in the same fly stock.

The orientation of each mapped insertion relative to the genomic sequence was defined relative to each vector as shown in Table 1. The position of a mapped insertion in the genomic sequence was defined as the first base at the 5′ end of the 8-bp target site duplication of P-element insertions or the 4-bp target site duplication (always TTAA) of piggyBac insertions. In cases in which the vector-insert junction was not recovered in the flanking sequence, usually due to low sequence quality, the insertion site was defined as the first base of the alignment to the genome sequence. In some cases, a flanking sequence aligned to the genomic sequence along only a portion of its length, indicating a sequence dimorphism between the strain used in the genetic screen and the strain used to produce the reference genome sequenced (y; cn bw sp; Adams et al. 2000). In most such cases, the dimorphic sequence corresponded to a known transposable element (Kaminker et al. 2002). When an insertion mapped within a dimorphic sequence, the genomic insertion site was defined as the position of the most 5′ base in the flanking sequence that aligned to the reference genomic sequence.

Excluding EP, PA, and PC lines, a total of 21,928 insertions (91% of those from which flanking sequences were recovered) were mapped to unique sites in the genome during this phase of the BDGP Gene Disruption Project. Including previously described results (rth et al. 1998; Spradling et al. 1999), new lines, and recheck sequencing, >50,000 insertion ends have been successfully mapped to the Release 3 genomic sequence in this ongoing project.

Brian Ring and Daniel Garza provided sequence data produced at Exelixis Corporation on the insertion sites of 1055 PA and PC lines that they donated to the project. The insertion site data were in the form of 1-kb segments of Release 2 genomic sequence centered near the insertion site. The target site for piggyBac transposons is TTAA and we were told that the insertion site in each mutant strain corresponds to the TTAA closest to the center of the 1-kb segment. We were able to align the 1-kb genomic segments of R2 genomic sequence within unique segments of the R3 genomic sequence for 1046 of these strains. Upon rechecking the flanking sequences for 242 PA and PC lines selected for the primary collection, we confirmed many of these sites, while others differed from the originally reported site by an average of <100 bp. When a difference was found the sequence determined by the project was taken as correct.

Line selection:

During the initial phases of the project lines were selected if their insertion was located within or <2 kb upstream of an annotated transcription unit not previously mutated in the BDGP collection. Lines were also retained if the insertion was between genes or within an intron and >2 kb from any insertion already in the collection. The Release 2 sequence annotations displayed on the GeneSeen browser (N. Harris and S. E. Lewis, unpublished data) were used for these determinations. After completion of the Release 3 genome sequence, all remaining new lines and all previously selected lines were reanalyzed as follows. First, a Perl script was used to record for each insertion those transcripts in which it was located (defined as from 500 bp upstream of the annotated transcript to the 3′ end). A FileMaker Pro script was then used to search each annotated euchromatic gene against the transcript list and record all lines meeting this criterion.

Using this information as a starting point, final decisions for retention were based on a manual examination by a curator of the insertion position relative to nearby gene models, cDNAs, and other data, using the Apollo genome browser (Lewis et al. 2002). To display insertions in Apollo, XML files describing the Release 3.1 sequence annotation were modified by addition of new data “tiers” including the insertion sites and associated descriptors. To be selected, a strain had to be judged likely to mutate or misexpress a novel gene not currently in the BDGP collection (see results). In addition, lines whose elements were inserted >2 kb from the nearest neighboring P element in the collection were generally also retained. These criteria were designed to minimize unnecessary stock maintenance without severely compromising the long-term utility of the collection. The functionalities of the different transposons vary substantially (Table 1) and there is no general consensus as to which characteristics (e.g., enhancer trapping, gene misexpression, or deletion generation) deserve the highest priority. When multiple lines that disrupted a gene existed, the decision on which line to keep was based on a variety of factors, including its verification status, associated mutant phenotype, and element type. Because gene models are less certain in the current annotation of heterochromatin, only manual annotation was used in these regions. Overall, manual curation increased the total number of genes with associated insertions from 5045 (automated curation) to 5362. For the studies of insertion site distribution that are presented here, automated curation was used exclusively to ensure that uniform criteria were applied to all data.

Verification:

Only lines selected for the primary collection were balanced or made homozygous. The flanking sequences of stocks destined for the primary collection were determined and analyzed again after balancing. In most cases (>90%), the initial and recheck coordinates were consistent. When no readable sequence or a different location was obtained, the line was either recycled for another round of sequencing or discarded. When the initial sequence indicated the presence of a second insertion on the same chromosome, we looked for the presence of both sites on the recheck. Rarely, previously undetected second insertions were discovered in the recheck phase. Lines donated to the project as balanced or homozygous stocks were usually not rechecked. More than 96% of selected BG and KG lines were verified. Recheck verification of the other screens has not yet been completed at the time of this article (see Table 2).

P-Screen webpage:

All strains generated by the project (BG, KG, and EY) were made publicly available via an online database at the time they were selected for the primary collection (http://flypush.imgen.bcm.tmc.edu/pscreen/) as well as selected other lines. This site contains the following information: P-element constructs used, strain name (BG, KG, or EY number), genomic insertion site in Release 3 coordinates, inferred cytological location, associated gene (hit or nearby), and availability status after incorporation into the Bloomington Stock Center collection. Lines selected from the donated collections (see below) were not listed until they could be distributed by the Stock Center because they were not available to the project for early distribution.

Data submission:

Data describing all insertions and stocks selected for the primary collection were submitted to public repositories after the fly strains were sent to the Bloomington Stock Center (Figure 1) . Flanking sequences of the selected insertions were deposited at GenBank (http://www.ncbi.nlm.nih.gov/). Stock descriptions, including phenotype and balancer information, were submitted along with the insertion stocks to the Bloomington Stock Center (http://flystocks.bio.indiana.edu/). Detailed descriptions of the selected lines, including insertion coordinates and associated genes, were deposited at FlyBase (http://flybase.bio.indiana.edu/). Data submission to FlyBase is ongoing and not yet completed at the time of this article.

Figure 1.—

Schematic of project workflow. The arrows show how new Drosophila strains from single P-element mutagenesis screens are processed by the project. Lines are sequenced, sequences aligned to unique genomic sites, and insertions likely to disrupt genes not already mutated in the collection are selected (central boxes). Selected lines are balanced, rechecked to verify quality, and sent to the Bloomington Stock Center for public distribution. Lines failing to meet these criteria are recycled or discarded. The percentages indicate the fraction of lines falling into the indicated categories along each path.

RESULTS

Strategy:

We initiated a new strategy to expand the coverage of the BDGP gene disruption collection shortly after the Drosophila genome sequence was first released (Adams et al. 2000). Lines with single transposon insertions would be either newly generated by the project or received from other laboratories. Lines with new insertions would be recognized solely by a change in the genetic linkage of the transposon marker gene, rather than by any phenotype associated with the insertion. DNA would be prepared from adult flies of each unbalanced insertion line and inverse PCR products containing the genomic region flanking the insertion would be sequenced. Lines whose insertion points could be uniquely localized by sequence comparison to the reference genome sequence would then be added to the primary collection or discarded depending on whether they appeared, on the basis of insert location, to mutate a gene that had not already been disrupted. Information on each newly selected line would be posted on the project website (http://flypush.imgen.bcm.tmc.edu/pscreen/) and the unbalanced strains distributed to the community until stable stocks could be generated. After balancing, the flanking sequence would be rechecked to verify that the desired insertion was still present. If so, the line would be sent to the Bloomington Stock Center for public distribution, and associated information forwarded to the Bloomington Stock Center, FlyBase, and GenBank. An outline of the overall strategy is given in Figure 1, and a detailed description of each step is in materials and methods.

Designing screens to generate new lines with the broadest possible gene coverage presented the first major challenge. There was little theoretical or empirical information on how factors such as element structure or starting site affect coverage, yet the ability of the project to test these variables was limited due to the time required. It has been difficult in the past to compare the intrinsic efficiency of different screens because large numbers of molecularly analyzed lines are necessary to obtain statistically significant information regarding anything but a few highly mutable genes (hotspots; Berg and Spradling 1991). High levels of transposition have been associated with the generation of secondary mutations (Spradling et al. 1999), so the products of such screens were excluded from the project. To obtain baseline data on the feasibility and efficiency of our experimental plan, we first molecularly analyzed the EP collection (rth et al. 1998). Subsequently, we applied the strategy of Figure 1 to the products of three other screens carried out by us, as well as to donated lines from six additional screens, including three that utilized the piggyBac transposon (see Table 1).

Associating lines with genes:

Before the gene coverage of individual screens can be compared, it is necessary to address inherent ambiguities in the association of transposon insertions and genes on the basis of insert location. It would be conceptually simple to score as a hit only insertions lying within the annotated 5′ and 3′ limits of a given gene. However, particularly in the case of P elements, such an approach would be a highly inaccurate measure of gene disruption. One reason is that P elements lying a short distance upstream from the 5′ end have been shown in many cases to generate a gene mutation (Spradling et al. 1995). We used 500 bp as an approximate guide for the maximum distance a P element can be located 5′ to a transcription start site and still be likely to disrupt its function. Second, the Release 1 and Release 2 versions of the Drosophila genome annotation that were available during the first 3 years of the project utilized computationally predicted gene models that usually lacked 5′-untranslated exons. Since P elements systematically insert near the true gene 5′ ends of genes in a highly preferential manner (Spradling et al. 1995), and promoters are located on average 1.4 kb upstream from the start codon (Ohler et al. 2002), many insertions at true gene promoters would appear to lie >500 bp upstream from the nearest gene using the available annotation. Anticipating this problem, during the first 3 years we saved lines at novel “intergenic” positions and reanalyzed all our data after the Release 3 (R3) annotation became available (Misra et al. 2002). All data reported in this article are based on the most recent sequence and annotation (Release 3.1), which includes many more 5′- and 3′-untranslated regions (UTRs) than previous releases. This strategy significantly increased the completeness and accuracy of insertion-gene associations (Figure 2A) .

Figure 2.—

Computationally associating insertions with genes. (A) The top shows a sample insertion, KG10308, as it appears in the Release 2-based GeneSeen display. The position of the insertion (triangle and vertical line) is shown relative to the local DNA sequence (horizontal line) and gene models (blue boxes) following the convention that genes above the line are oriented left to right and below the line they are oriented oppositely (arrowheads). KG10308 fails to meet project criteria for a gene association under R2 because it maps ∼2 kb upstream from the CG8249 annotation and 3.5 kb 5′ to the CG8253 annotation. Below, the same region is displayed on the basis of Release 3 sequence annotations (Misra et al. 2002), using the Apollo browser (Lewis et al. 2002). The inclusion of more information on 5′-UTRs in Release 3 reveals that KG10308 actually lies at the 5′ end of CG8253 and likely mutates this gene. (B) A histogram showing the distance between the P-element insertions in 5630 gene-associated primary collection lines and their associated transcript 5′ ends (blue). For comparison, a similar plot of the 267 lines with transcript-associated piggyBac insertions is shown (red). The last point on the right shows all remaining lines >500 bp from +1. (C) KG00786 is located at −10 relative to CG8315 and at −49 relative to CG8320. Nearby, the gene ATPCL (CG8322) is seen overlapping with CG8320. Both close gene spacing and overlapping transcription units are quite common in the Drosophila genome and account for the fact that 20% of single insertions in the primary collection likely affect two genes. (D) KG05287 is shown near the 5′ end of CG31849. This gene lies within a large intron of CG5287, which is transcribed from the opposite strand. The occurrence of genes within the large introns of other genes is common in Drosophila and motivated us to retain insertions in the large introns of already mutated transcription units if they were separated from other insertions by at least 2 kb. (E) KG02679 is one of 60 insertions predicted to lie within or close to an RNA gene. (F) Insertions upstream from CG12462 that lie >10 kb from any annotated gene are shown. Many insertions in this category using Release 2 annotation were later shown to be located near the 5′ ends of genes. Because annotation remains highly imperfect, insertions were saved if they lay >2 kb from any existing insertion in the primary collection. (G) An example of a manually determined insertion-gene association. Many Release 3 gene models (such as CG32767 shown) are still computationally derived only from their protein-coding sequences and lack sequences 5′ to the predicted methionine start codon. Automated annotation fails in these cases because P elements preferentially insert near 5′ ends, which commonly lie >500 bp from the start codon. In the example shown, BG01357 lies 1.8 kb 5′ to the R3 annotation of CG32767 but was manually associated by considering cDNAs such as the 5′ EST RE54443.5 displayed in Apollo.

We used the more complete Release 3.1 gene models to obtain further information on the P-element 5′ preference using these large data sets. The locations of the insertions in 5630 primary collection lines relative to their associated transcript 5′ ends are plotted in Figure 2B. It can be seen that P elements strongly tend to insert within 100 bp symmetrically about the transcription start site. This sharp peak in the distribution could not arise by chance, because annotated R3 start sites are separated on average by 5.6 kb in the genome. Moreover, no such preference for start sites is seen when piggyBac insertions are analyzed in an identical manner (Figure 2B). It can also be seen that a large fraction of all P-element insertions associated with genes fall within 500 bp of the transcript start site.

Several other factors in associating insertions and genes were considered. Many Drosophila genes lie near or within neighboring genes (Figure 2C) often within large introns (Figure 2D). Over 1000 of the R3 euchromatic genes (7.5%) are nested in the introns of other genes and >2000 genes (15%) have annotated transcripts overlapping those of other genes (Misra et al. 2002). There are also many divergently transcribed pairs of genes whose 5′ ends lie <500 bp apart. Overall, ∼20% of insertions were judged likely to disrupt two rather than just one gene on the basis of our criteria. Other insertions were located within known or predicted RNA genes (Lai et al. 2003; Figure 2E). As little knowledge of their cis-regulatory regions is available, lines with insertions located up to 500 bp 5′ or 3′ of such genes were saved.

Three additional classes of lines were saved even though they were not associated with genes by the criteria described above. First, a significant number of insertions lie outside and >500 bp 5′ of any known transcript (Figure 2F). Such insertions might disrupt unannotated genes and/or regulatory elements. Consequently, we saved a skeleton set of insertions in such regions such that no insertion was closer than 2 kb to its nearest neighbor. Second, unannotated genes may lie within introns of known transcripts, so we applied the 2-kb spacing criteria to inserts in large gene introns as well. However, neither of these types of lines were counted as gene disruptions as reported here. Third, ∼24% of Release 3 gene models lack an annotated 5′ UTR (Misra et al. 2002) and are prone to the same problems we experienced with Release 2 models. For example, in Figure 2G the BG01357 insertion lies 1866 bp upstream from the R3 annotation for CG32767, but this annotation begins only at the putative methionine start codon. Sequence data from cDNA RE54443 (which may not be full length) indicate that the true 5′ end(s) lies farther upstream and closer to the P element. To deal with such problems we manually annotated each insertion. Lines with insertions located 500–2000 bp upstream from the annotated 5′ end of a novel gene were sometimes retained in the collection if the available cDNA, EST, and modeling data indicated to a human curator that it would likely provide a primary reagent to researchers wishing to genetically manipulate the gene in question. This process resulted in ∼300 additional gene associations not recognized by automated annotation. For these reasons, all 7140 primary lines are currently useful as reagents and will likely prove to disrupt substantially more genes than the current estimate of 5362.

Insertional mutagenesis screens vary widely in genomic coverage:

First, we used the methods described above to determine how many genes are associated with lines in the EP collection. Our results indicated that a sequence-based strategy of gene disruption could be highly efficient. An average of 686 ± 10 genes are associated with 1000 EP insertions (Figure 3A) . Moreover, the rate of double insertions is only ∼3% of total jumps using this transposon (rth et al. 1998). Despite these attractive parameters, we did not elect to continue generating new EP lines because misexpression from this element is ineffective in some tissues, such as the female germ line (Mata et al. 2000). Nevertheless, these numbers set a standard by which other screens utilizing elements with other desirable properties could be judged and allowed the primary collection to be expanded by 374 strains.

Figure 3.—

Individual screens differ widely in mutagenic efficiency. (A) The average number of genes disrupted by 1000 lines from the indicated screens is shown. Each bar represents an average of from two to four sets of 1000 lines except for PA/PC, where only 1046 lines were available (668 PA + 332 PC were used). (B) The cumulative number of genes disrupted as successive sets of 1000 lines are added for the indicated screens. (C) Screen synergy. The relative number of total genes disrupted when a set of 1000 additional lines from the indicated screens are added to a collection of either 7000 KG lines (bottom set, KG-) or 7000 EY lines (top set, EY-) is shown. (D) The mean percentage of 1000 lines that hit genes is shown for five screens. Standard deviations are given in the text. All of the lines used for these analyses were localized to a unique site in the euchromatic genome.

The initial screen carried out by BDGP (“the BG screen”) utilized a gene trap mutator element designed to stimulate GAL4 production under the control of a gene near the insertion site (Lukacsovich et al. 2001; Table 1). The BG screen utilized a genetic background that had been extensively isogenized, generating lines that minimize between-line genetic diversity (Norga et al. 2003). However, after generating 2869 lines we realized that this approach was not optimal for the purposes of genomic coverage (Table 2). First, the rate of BG element jumping was only one jump per seven vials, less than half the rate observed with the EP screen. Second, the rate of genes hit per 1000 insertions was also much lower, only 339 ± 40. This provided the first evidence that the intrinsic genomic coverage of insertional mutagenesis screens is highly dependent on the structure and/or location of the mutator element that is mobilized. Finally, we were troubled by the frequent recovery of lines in which GAL4 was oriented in the opposite direction to the targeted gene. While this might indicate the existence of many more unannotated genes than anticipated in the Drosophila genome, the goals of our project required a more efficient, predictable mutator. Nonetheless, we added 482 new BG strains to the primary collection.

In search of a better mutator, we switched to generating lines using a previously tested element known as P{SUPor-P} (Roseman et al. 1995). We refer to this as the KG element (see materials and methods). The KG mutator contains two chromatin insulator elements designed to minimize chromosomal position effects and enhance mutability via enhancer blocking. In addition, it houses an intronless yellow gene, which has proven to be much less sensitive to position effects than the mini-white gene used in many previous P-element mutators (Roseman et al. 1995). We thought that this might substantially increase screen efficiency because many transpositions, even within euchromatin, may not be detected using the mini-white gene due to position effects. We generated and analyzed 10,587 new KG transpositions with generally favorable results, adding 2129 lines to the final collection (Table 2). However, the efficiency of KG gene disruption remained significantly below the EP benchmark, i.e., 541 ± 22 vs. 686 ± 10 genes per 1000 lines (Figure 3A), and the KG element does not support gene misexpression.

Consequently, we switched to generating new lines using a modified version of the EP element (Table 1, materials and methods). An intronless yellow+ gene was inserted into the EPg version of EP that allows female germ-line expression (Mata et al. 2000). We called this element EY and used it to generate 10,310 new lines. As hoped, EY transpositions were linked to genes at the same rate as EP jumps, 691 ± 25 vs. 686 ± 10 genes/1000 lines (Figure 3A). This is significantly more efficient than the KG screen and allowed the project to add 2338 new lines to the final collection.

The final 17% of lines analyzed by the project were donated from five external laboratories. The Karpen lab provided additional KG insertion lines, which we termed KV lines, in which the expression of the yellow gene is variegated. As expected, such lines frequently result from insertion within heterochromatin (Yan et al. 2002; Konev et al. 2003). The Gelbart lab contributed 1384 lines, which we refer to as DG lines, containing the hybrid P-hobo element P{wHy} that facilitates the generation of local deletions at the site of insertion (Mohr and Gelbart 2002; Huet et al. 2002). The Garza lab contributed 1055 lines generated using two piggyBac mutators that we refer to as PA and PC (Table 1), while Udo Häcker donated 634 piggyBac lines produced from a screen with a different vector we termed PL (cker et al. 2003). To gain an initial comparison of mutagenesis using piggyBac vs. P element, we calculated the gene disruption efficiency of the PA/PC piggyBac lines. Somewhat surprisingly, our standard efficiency measure indicated that they hit 677 genes, about the same number as 1000 EP or EY lines. Thus, by this initial test, the efficiency of piggyBac mutagenesis equaled, but did not exceed that of the best P elements.

Synergy between element types:

Further insight into screen strategy came from examining the cumulative number of genes disrupted for different elements over time in a large screen (Figure 3B). In a large screen, the incremental yield of new gene disruptions continually decreases during the course of the screen as more and more of the preferred target genes have already been hit. Thus, in designing a screening strategy, consideration must be given not only to the initial gene targeting efficiency, but also to how rapidly the yield decreases as new insertions are added. As expected from the initial efficiency measures, more genes were disrupted by EY jumps compared to KG jumps at each point in the screens. BG transpositions were far less efficient than either of them.

Next, we investigated whether there was any advantage to using a combination of elements rather than a single element (Figure 3C). We calculated the incremental gene yield resulting from 1000 new KG, EY, or PC/PA lines, in a project that had already incorporated 7000 KG or EY lines. If all mutator elements target the same universe of genes, then their relative efficiency would always be proportional to their initial efficiency. However, if elements target sets of genes that only partially overlap, then the element used initially will become less efficient with time (due to the saturation of its targets) in comparison to a new element. This is in fact what was observed. After 7000 KG insertions, switching to PA/PC lines was even more favorable than expected from the initial rate measurements. A total of 1000 piggyBac lines at this point added 421 more gene associations compared to 248 for 1000 added EY lines or just 162 for 1000 more KG lines. The high synergy between P and piggyBac elements was also seen with EY elements. After 7000 EY lines, 1000 PA/PC lines added 358 new genes vs. 188 for an equal number of new EY lines. These results begin to quantitate the broader spectrum of gene targeting exhibited by piggyBac compared to P elements (cker et al. 2003).

In contrast, comparison between the KG and EY mutators revealed only limited synergy. A total of 1000 KG lines became somewhat more efficient relative to 1000 EY lines after 7000 previous EY insertions, now associating with only 25 fewer (163 vs. 188) rather than 100 fewer genes. Curiously, almost no synergy was seen in the reverse direction. After 7000 KG lines, 1000 additional EY lines targeted ∼100 more genes than 1000 added KG lines did. This is about the same as the number of additional genes hit by EY vs. KG lines initially. The very limited synergy indicates that different P-element screens target substantially the same subsets of total genes (at least in the case of these two elements).

Screen-specific hotspots affect screen efficiency:

As documented above, we observed large differences between screens in the total number of insert-associated genes. Thus, a sample of 1000 unselected KG lines hit an average of 541 ± 22 genes compared to 691 ± 25 genes for an equal number of EY lines (see Table 2). To determine if more insertions in a less efficient screen land between genes, we calculated the fraction of insertions in different screens that actually hit a gene (Figure 3B). For comparison, we note that Release 3.1 identifies 52,560 kb of the Drosophila genome as intergenic (42%). Correcting for the 500 bp upstream of 13,666 genes that we also scored as potential gene hits indicates that 63% of random insertions would be associated with a gene. The KG and EY screens hit genes at much higher frequencies (80 ± 1.3% and 81 ± 1.1%, respectively). These values, which reflect P-element gene targeting, are very similar and cannot explain the differences in efficiency. However, the rate of gene targeting appears to differ somewhat in other screens. BG elements hit genes only 72% of the time. This result is paradoxical as the white+ transgene within this element, which has a splice donor but no 3′ polyadenylation site, was designed to be expressed only when its transcript is spliced to an endogenous 3′ exon(s). Apparently, this system actually reduced rather than increased the frequency with which annotated genes are targeted. About 75% of piggyBac (PA/PC) jumps hit genes. Thus, piggyBac mutators are unlikely to target TTAA sequences randomly within the genome, but insert preferentially in genes, although to a lesser extent than P elements and with a reduced 5′ bias.

The major source of efficiency differences between P screens proved to be transposon hotspots. We analyzed the frequency with which genes are hit in all the screens analyzed during the current phase of the project. Some of these results are shown in Table 4 , where it can be seen that the number of times a gene is hit varies widely. The most frequently hit genes were considered “hotspots” (Table 5) and the fraction of all insertions in this class varied significantly between screens (Table 4).

View this table:
TABLE 4

Frequency distribution of targeted genes between screens

View this table:
TABLE 5

Gene targeting rates

Our results suggest that there are two previously unrecognized subclasses of hotspots. All P-element screens (and frequently also piggyBac screens) hit certain hotspot genes at elevated frequencies (Table 5, “common hotpots”). These loci must possess some intrinsic affinity for transposon binding and/or integration, perhaps due to the local chromatin state or the presence of particular proteins. Strikingly, however, a second class of hotspots was highly preferential for a particular screen or screens (Table 5). Most dramatically, the KG screen displayed a class of “super hotspots.” For example, CG9894 alone accounts for a staggering 10% of all KG lines (Figure 4A) and Hr39 for another 2.5%. Five other sites are hit more frequently in the KG screen (>0.56%) than are any of the common hotspots. These screen-preferential hotspots most likely explain the relative inefficiency in the KG screen. All the super hotspots, and almost all of a larger number of less dramatic KG-associated hotspots, are located on chromosome 2 and clustered in three small regions: 22F–23A, 38B–44A, and 49F. A few screen-preferential hotspots were also detected in certain other screens, although their specificity appeared to be lower than that for the KG hotspots (Table 5).

Figure 4.—

Screen-preferential hotspots. (A) An Apollo display of the region surrounding the major KG super hotspot at gene CG9894, which contains the screen starting site. Note that insertions (triangles) are distributed on both strands and at multiple sites. Not all the insertions could be represented as separate triangles. (B) An Apollo display of a major EY hotspot in gene CG3979 (Indy). 1360 1044 is a repetitive element. (C) Pairing diagram of the CyO balancer (black line) with its wild-type homolog (green line) in the germ cells in which new jumps occur. The position of the starting transposon (red bar) and four major groups of hotspots (orange bars) are shown. All reside close to the central region where normal pairing is disrupted due to the multiple inversions on CyO. (D) Model for the generation of screen-associated hotspots by local jumping from the starting site to other chromosome regions that happen to lie nearby in germ cell nuclei.

Some screen-enriched hotspots resemble local transpositions:

P elements and many other transposons preferentially jump locally on their starting chromosome (Tower et al. 1993). We considered whether a relationship exists between screen-enriched hotspots and the site of the starting transposon. In the case of the KG screen, the CG9894 super hotspot corresponds exactly to the position of the starting insertion on the CyO balancer chromosome, which contains multiple chromosome inversions to block the recovery of recombinants. As in the case of local jumping, elevated frequencies of integration are not confined to a single nucleotide site, but extend along the chromosome in both directions (Figure 4A). The broad distribution of insertions seen in Figure 4A continues along the chromosome. Indeed, the elevated number of KG insertions recovered on chromosome 2 compared to that on chromosome 3 (Table 3) is due primarily to the recovery of a higher density of disrupted genes in the vicinity of the super hotspots, rather than uniformly across the entire chromosome. Thus, in their frequency, site dependence, and regional specificity, the KG screen-enriched hotspots resemble local transpositions, but on the homologous chromosome (Tower and Kurapati 1994).

However, these results could not be explained by a simple “homolog-hopping” model (Tower and Kurapati 1994). Similar hotspots were not generally observed near the starting site in the case of other screens. For example, the starting site for the EY screen was localized in region 39C, yet this is not among the hotspots in this screen (Table 5). Moreover, most of the KG hotspots do not lie directly opposite the starting site, but are located at several distant sites, including a few on other chromosomes. One possibility is that some aspect of the local chromatin structure near the starting site is the critical variable. We noted that, when plotted on a diagram showing the pairing pattern expected for CyO, the KG (but not the EY) starting site and all three super hotspot-containing regions were located near CyO breakpoints that may associate in vivo (Figure 4C). The chromatin surrounding these sites may have been altered in a manner that enhances local jumping, allowing nearby sites on the homolog and even on other chromosomes to be targeted. We suggest that screen-associated hotspots may generally arise via local jumping to sites that happen to reside close to the starting transposon in the chromatin of germ cell nuclei (Figure 4D).

P-element gene class preferences:

We examined the spacing of insertions in the primary collection throughout the genome by calculating the interelement distances (average of 16.7 kb). Sometimes, as expected, the insertion density seemed to correlate with the density of genes/promoters, which varies significantly over relatively short regions (Ashburner et al. 1999). The likely existence of other influences was indicated by the nature of the two largest gaps, both measuring ∼290 kb in length. These correspond to the Antennapedia and Bithorax complexes, neither of which was hit by P elements in this phase of the project. (A single P insertion, fs(3)05649 at AbdB, is in the collection from the earlier phase.) The fact that homeotic clusters are insertion cold spots provides further evidence that even in germ cells the genome presents a nonuniform target for transposition.

Previously, we noted that the frequency of insertion seems to vary for different classes of genes (Spradling et al. 1999). To investigate further, we calculated the number of disrupted genes in various functional classes (Figure 5A) . Common signaling pathways, stress response genes, and other genes likely to be active in early germ cells (but not ribosomal protein genes) generally had an above-average probability of being disrupted. In contrast, genes encoding cell type-specific proteins expressed late in development such as cuticle proteins, glue proteins, or chorion proteins were rarely if ever hit. Although insertions in ribosomal proteins might be haplo-insufficient, there should have been no selection against insertions in structural proteins, and chemically induced mutations in some of these genes have been recovered. The arrangement of these infrequently hit structural protein genes in chromosomal clusters suggests that some distinctive aspect of their chromatin structure or their promoter elements (Ohler et al. 2002) reduces their susceptibility to P-element insertion. Unlike the homeotic clusters, the dearth of inserts in clustered cell-specific genes is unlikely to simply reflect low promoter density (Figure 5B).

Figure 5.—

Target selectivity. (A) Different pathways and gene classes are differentially susceptible to P-element insertion. The fraction of genes in various classes hit in the primary collection is shown. (B) Apollo display from the 65A larval cuticle protein gene cluster spanning ∼45 kb. No insertions in this region (solid regions above and below maps) or in regions housing several other similar clusters of genes expressed in terminally differentiated cells were recovered.

It has frequently been suggested that gene activity in germ cells might influence transposon accessibility. To study this variable we examined genes whose embryonic expression has been characterized using whole-mount in situ hybridization by the BDGP (http://www.fruitfly.org/cgi-bin/ex/insitu.pl). Of 104 genes expressed in pole cells or embryonic germ cells, 70% contained an associated insertion in the primary collection, far more than the overall average of 40%. However, 61% of 123 genes that are expressed in the embryonic salivary gland but not in germ cells also have an associated insertion in our project, so the importance of germ cell expression remains uncertain. Taking another tack, we examined if hotspot genes share any diagnostic features of their expression programs. No commonalities were observed. While some hotspots such as CG9894 are highly expressed maternally and/or in embryos, RNA from other hotspot genes (Hr39, cpo) was weak or not detected. Consequently, a simple explanation for the gene selectivity observed in the project remains elusive.

The new primary collection:

By analyzing lines from all 10 screens, 7140 lines have been designated for the primary collection (Table 3). Most of these lines have already been verified and forwarded to the Bloomington Stock Center for distribution. Insertions in the collection are distributed rather uniformly across the entire genome, including heterochromatic contigs and chromosome 4. EY, EP, and LA insertions within the collection are positioned to misexpress 1400 different genes. The BDGP primary collection will provide important reagents for a wide range of biological research.

DISCUSSION

Status of the project:

During the past 3.5 years the BDGP gene disruption project primary collection has expanded from 1000 to >7100 strains and now contains insertions associated with at least 5362 genes. It has proved possible to generate and molecularly analyze large numbers of lines and to produce a collection of high-quality strains with diverse capabilities. High-throughput methods were developed for generating, tracking, and mapping large numbers of insertions, as well as bioinformatic methods for recording and manipulating the data (see materials and methods). We find that a large number of protein-coding genes can be targeted near their promoters and that transposon insertions in RNA gene and heterochromatic genes can be obtained as well. The generation and distribution of these new mutants throughout the course of the project are greatly assisting Drosophila researchers to investigate diverse biological questions. Finally, the project's large, well-characterized data sets from multiple large screens utilizing different mutator elements and starting sites allowed us to gain a better idea of how to optimally design transposon mutagenesis projects.

Two classes of insertion hotspots:

Our work suggests the existence of two classes of genes that act as transposon hotspots. The first class comprises genes that evidently possess favorable chromatin accessibility, DNA target sequences, or bound proteins that mediate high-efficiency association with freely diffusing transposition complexes. These sites may be highly specific at the nucleotide level (Figure 4B) and may be responsible for the nonrandom primary DNA sequence context of P-element integration sites (Bellen et al. 1992; Liao et al. 2000). Many common hotspots appear to be hit frequently in multiple mutagenesis screens utilizing structurally different mutators. Our experiments better documented many members of this class, most of which were already well known as P-element hotspots (Table 5).

The disruption rates of genes in the second class are highly screen dependent (Table 5). These screen-associated hotspots may arise in a variety of ways. One mechanism is likely to be physical proximity to the starting transposon, as suggested by the location of the KG hotspots with respect to the rearranged CyO chromosome. This class may depend primarily on the specific transposon starting site. Besides the KG super hotspots, we found several other potential examples of this type of hotspot in the other screens. It is known that the specific sequences within a mutator may influence target sites (Kassis 2002). In our experiments, the EY and EP elements are similar in structure but were launched from different starting sites. We found that the hotspots in these two screens (and often also BG) were very similar, as expected if element structure was also important.

Screen-associated hotspots may provide insight into nuclear organization:

Insertional mutagens typically do not integrate with equal efficiency across genomes (Sandmeyer et al. 1990; Spradling et al. 1995; Alonso et al. 2003). Now that the products of large mutagenesis screens can be thoroughly analyzed without prior selection, it may be possible to use insertional preferences as tools for probing chromosome organization and function. Generating new insertions from a starting site located close to a chromosome rearrangement might generate super hotspots within predictable regions of the chromosome. Such a procedure might increase the rate of mutagenesis in the targeted region by >10-fold, as we observed in the 23B region, allowing genes in the vicinity to be mutated to saturation and chromatin structure to be probed.

The importance of genome annotation:

Molecularly based insertional mutagenesis projects for some species have the luxury that all such lines can be safely stored for later retrieval and use. However, in many other species, including Drosophila, it is necessary to analyze newly generated lines and preserve only those with special value as experimental reagents. Our results illustrate how this latter type of screen depends crucially on accurate genome sequence annotation. The difficulty of making accurate gene-insertion associations is further exacerbated in organisms such as Drosophila that contain small dense genomes rich in overlapping and differentially spliced transcription units. The use of a transposon that inserts preferentially near promoters compounds the difficulty, as promoter prediction programs are accurate only ∼50% of the time, even when large, accurate training sets are available (Ohler et al. 2002).

During the first 3 years of the project we worked with gene models based largely on computational predictions that frequently provided incomplete information on gene structure and location. Approximately 100 genes were lost from the project when lines located <2 kb from existing lines were discarded and only later found to disrupt a separate, previously unannotated gene. In retrospect, it would probably have been worthwhile to maintain a higher density of insertions in intergenic regions and large introns. Our project suggests that a high priority should be placed on transcript mapping in combination with insertional mutagenesis projects.

Making stocks publicly available:

As insertional mutagenesis of the Drosophila genome progresses, the issue of how to maintain all the valuable lines becomes increasingly acute. Frequently, multiple alleles of a gene that might each provide unique and valuable information regarding gene function are obtained. In the case of genes with multiple promoters, often encoding distinct protein splice variants in different tissues, insertions near the start site of each distinct transcript would allow their individual roles to be investigated. Complex patterns of gene expression during development might be efficiently studied using other alleles that sensitively report patterns of gene expression and, in some cases, reveal the subcellular location of the protein product(s) by fusing transcripts or protein domains to reporters. Much valuable information on gene function can likewise be derived from insertion alleles bearing regulatory elements that allow a gene to be misexpressed under experimental control. Thus, an average of four alleles per gene, rather than one, would likely be necessary to take full advantage of the experimental potential of Drosophila gene disruption collections. Unfortunately, at present, the world capacity for public storage and distribution of Drosophila stocks is much more limited than this. Unless a solution to this problem is found, it is likely that many valuable tools will have to be discarded and the full value of publicly supported projects will be diminished.

The future of Drosophila gene disruption:

Despite the progress toward genetic saturation reported here, many genes remain to be disrupted and readily available tools are still lacking for understanding their biological roles throughout the life cycle. How should our project continue to address these remaining needs in an efficient manner? First, it is clear that a simple continuation of the current strategy using EY elements would be well worthwhile. The last set of 1000 EY insertions scored still yielded 188 new genes, along with another 50–70 lines hitting previously missed intergenic regions or allowing gene misexpression. Consequently, the “yield” of worthwhile lines remains >20%, so that another 30,000 lines might be expected to yield 26,000 single insertions and up to perhaps 4000 additional genes (15%). Switching to a piggyBac vector for the next 30,000 lines would yield insertions associated with a significantly larger number of genes. This conclusion is strongly reinforced by the successful construction of several large collections of piggyBac insertions (cker et al. 2003).

At what point does working to attain further genomic coverage using transposon mutagenesis become unattractive? Experimental data suggest that ultimately even P-element mutagenesis can disrupt the great majority of Drosophila genes. Recently, Oh et al. (2003) reported that they had increased the coverage of second chromosome vital genes from 25 to 80%. Likewise, Timakov et al. (2002) recently demonstrated that a high fraction of genes are susceptible to P-element insertion when rates are elevated by local hopping. However, our data suggest that some gene subclasses such as the cuticle protein genes may be refractory to this approach. Consequently, to disrupt every Drosophila gene will likely require a directed finishing strategy. Fortunately, several methods are available in Drosophila that should be adequate for this task (McCallum et al. 2000; Rong et al. 2002). Indeed, we can now look forward to a period when attention can shift from obtaining mutations to analyzing and understanding the biological processes they disrupt.

Acknowledgments

We thank Meiying Ji, Lara Chetkovich, Ruidong Ma, Ping Dang, Yaojuan Lu, Hongyuan Zhang, Jin Yue, Xingjie Shen, and Mengfei Huang for creating, balancing, and maintaining the fly stocks. Nicole Mozden, Dianne Williams, and Tiffany Jackson assisted in the line maintenance and balancing at Carnegie. We are grateful to Alexei Tulin for transforming P{EPgy2} into flies and to Christine Norman for help with the figures. We thank Soo Park and Kenneth Wan at Lawrence Berkeley National Laboratory for sequencing the inverse PCR products. Pernille Rørth, Tim Parnell, and Pamela Geyer provided plasmids and unpublished data used in the construction of the pP{EPgy2}. We are grateful to Exelixis Corporation for providing a protocol for inverse PCR and sequencing of the flanking sequences of the PA and PC insertions. We are indebted to Gary Karpen, Daniel Garza, Udo Häcker, John Merriam, Stephen Poole, Judith Lengyel, William Gelbart, and the members of their laboratories for donating their collections of insertion mutants to this project. We thank Kathy Matthews for balancing some PA/PC lines and for sending frozen samples for sequencing. This work was supported by a National Institutes of Health Drosophila genome center grant (to G.M.R., P.I.) and by a special supplement to this grant (to A.C.S., P.I.) from the National Institute of General Medical Sciences. Additional funds were provided through the support of the Spradling, Bellen, and Rubin labs from the Howard Hughes Medical Institute.

Footnotes

  • 1 Present address: Roche Palo Alto, Palo Alto, CA 94304.

  • Communicating editor: K. G. Golic

  • Received January 13, 2004.
  • Accepted March 1, 2004.

References

View Abstract