The Carnegie Protein Trap Library: A Versatile Tool for Drosophila Developmental Studies
Michael Buszczak, Shelley Paterno, Daniel Lighthouse, Julia Bachman, Jamie Planck, Stephenie Owen, Andrew D. Skora, Todd G. Nystul, Benjamin Ohlstein, Anna Allen, James E. Wilhelm, Terence D. Murphy, Robert W. Levis, Erika Matunis, Nahathai Srivali, Roger A. Hoskins, Allan C. Spradling


Metazoan physiology depends on intricate patterns of gene expression that remain poorly known. Using transposon mutagenesis in Drosophila, we constructed a library of 7404 protein trap and enhancer trap lines, the Carnegie collection, to facilitate gene expression mapping at single-cell resolution. By sequencing the genomic insertion sites, determining splicing patterns downstream of the enhanced green fluorescent protein (EGFP) exon, and analyzing expression patterns in the ovary and salivary gland, we found that 600–900 different genes are trapped in our collection. A core set of 244 lines trapped different identifiable protein isoforms, while insertions likely to act as GFP-enhancer traps were found in 256 additional genes. At least 8 novel genes were also identified. Our results demonstrate that the Carnegie collection will be useful as a discovery tool in diverse areas of cell and developmental biology and suggest new strategies for greatly increasing the coverage of the Drosophila proteome with protein trap insertions.

THE central challenge of postsequence genomics is to learn how an enhanced knowledge of genes, transcripts, and proteins can be applied to better understand the biology of multicellular organisms. Gaining an accurate picture of where and when metazoan genes are expressed remains a prerequisite for many such advances (Stathopoulos and Levine 2005). The discovery of distinctive, regulated programs of gene expression at a fine scale has the potential to reveal new cell types and substructures that make up tissues and the biological processes that govern their function. However, sensitive and widely applicable methods will be required to detect and distinguish developmentally programmed gene expression changes from those caused simply by cell cycling or environmental perturbation.

Several methods for analyzing gene expression within tissues are currently available. Particular cell types can sometimes be cultured in vitro into populations of useful size. However, isolated cells in artificial media frequently behave differently from cells in vivo interacting with precisely positioned neighbors in three-dimensional microenvironments. Another approach is to isolate tissue cells by flow sorting, microdissection, or laser capture and then determine their expression profiles in depth (reviewed in Espina et al. 2006). Visualizing patterns of gene expression within the intact tissues of transgenic organisms containing gene expression reporters may be the most general method (Tomancak et al. 2002). Epitope tagging, enhancer trapping, and gene trapping all have the added advantage that gene expression can subsequently be observed in living tissues, revealing dynamic processes that are largely beyond the reach of methods based on fixed material (reviewed in Herschman 2003; Dirks and Tanke 2006).

Protein trapping is a variation of gene trapping in which endogenous genes are engineered to produce under normal controls protein segments fused to a reporter such as GFP. The great potential of this technology has been extensively documented in yeast, where large collections of strains that each trap a different gene have been generated using transposable elements (Ross-Macdonald et al. 1999) or by homologous recombination (Huh et al. 2003). Extensive gene and protein trapping has also been carried out in cultured embryonic stem (ES) cells (Gossler et al. 1989; Friedrich and Soriano 1991), where fusions with more than half of annotated mouse genes have been recovered (see Skarnes et al. 2004). However, relatively few of these ES cell lines have so far been used to generate corresponding mouse strains where the versatility and sensitivity of the method for analyzing tissue structure can be tested.

Large-scale protein trap screens may also reveal new information about genome structure and function. Identifying in an unbiased manner locations throughout a genome where a coding exon can be expressed tests the accuracy and completeness of its annotation. Characterizing the splicing patterns that lead to normal or aberrant GFP expression tests the current catalog of transcript isoforms generated by alternative splicing. Moreover, by recovering insertions in the same gene that splice differently and produce GFP with varying efficiency, such a project might generate a data set useful for studying the determinants of splice site selection and transcript stability.

Drosophila provides a favorable system for applying gene traps to diverse developmental and genomic studies. The genome sequence has been extensively annotated on the basis of experimental data (Misra et al. 2002). Thousands of enhancer trap lines have been generated in large-scale transposon screens and culled of redundant strains by the gene disruption project (see Bellen et al. 2004). In contrast, producing Drosophila protein trap lines has remained difficult. Several hundred such lines were generated using a mobile GFP-containing exon flanked by both splice acceptor and donor sites (Morin et al. 2001; Clyne et al. 2003). However, the process was highly inefficient, with as few as 1 in 1500 progeny flies expressing GFP. Positive lines often contained more than one insertion, preferentially tagged a small number of hotspot loci, and tagged many sites not predicated to fuse the GFP exon in frame to any known coding region (Morin et al. 2001). Kelso et al. (2004) found that the recovery of lines could be increased by using an automated embryo sorter to select GFP-positive embryos and established a website, FlyTrap, to gather information on Drosophila protein trap lines. Consequently, we initiated a large-scale protein trap screen to increase gene coverage, test the genome annotation, and address some of the remaining technical difficulties in efficient line production.

Here we report the production of lines that trap 600–900 Drosophila genes, including 244 where one or more trapped proteins can currently be identified. Using the Drosophila ovary as a test system we confirm that protein trap lines reveal fine-scale details of developmentally regulated protein expression, making them exceptionally valuable discovery tools for a wide range of studies. Finally, mapping RNA splicing patterns downstream from >1200 insertions provides insight into how an added exon affects splicing and suggests how the production of protein trap lines can be expanded to cover a larger fraction of the Drosophila proteome.


Generation of P-element lines for protein trap screening:

The P-element-based protein trap screens presented here utilized the pPGA, pPGB, and pPGC vectors described in Morin et al. (2001). These elements carry a mini-white transgene in the opposite orientation to an enhanced green gluorescent protein (EGFP) exon, which is composed of EGFP sequence, without start or stop codons, flanked by splice acceptor and donor sites from the Drosophila MHC locus. A, B, and C refer to the position of the splice sites within the first and last codons of the EGFP exon sequence. Previously used pPGA, pPGB, and pPGC third chromosome insertions (Morin et al. 2001) were remobilized in the presence of balancer chromosomes. New insertions that mapped to the CyO balancer chromosome, did not express EGFP, and exhibited remobilization rates off of the CyO balancer of at least 60% in single-pair mating assays were recovered and used as starting stocks in the screen (see below).

piggyBac protein trap vectors:

To make a shuttle vector for subcloning the EGFP exon into different transposable elements, the entire EGFP exons from the pPGA, pPGB, or pPGC plasmids were excised from the original P-element plasmids (kind gift of W. Chia) using EcoRI and PstI and subcloned into pBluescript (Stratagene, La Jolla, CA). These plasmids were then cut with EcoRV and KpnI, end filled using Klenow, and religated to themselves to create pBS-GFPA, pBS-GFPB, and pBS-GFPC. The resulting plasmids carry the EGFP exon sequence between a unique EcoRI site at the 5′ end and unique PstI, SmaI, BamHI, and XbaI sites at the 3′ end. New tagging sequences can be inserted between the splice acceptor and donor sites of the exon using unique NcoI and XhoI sites.

Two different piggyBac protein trap vectors (Figure 1) were constructed using pBac{D. m. w+} (Handler and Harrell 1999) (kind gift of A. Handler). The pBac{D. m. w+} plasmid was digested with ClaI to remove most of the mini-white sequence and a linker containing HpaI, XhoI, and SpeI sites was inserted in its place to form pBAC{ΔClaI}. To create pBAC{BglII-GFP}, the EGFP exons from pBS-GFPA, pBS-GFPB, and pBS-GFPC were subcloned into the BglII and MfeI sites of pBAC{ΔClaI}. To create pBAC{HpaI-GFP}, EGFP exon sequences were inserted between the MfeI and HpaI sites of pBAC{ΔClaI}. An intronless yellow transgene from the yellow-BSX plasmid (Bellen et al. 2004) was then subcloned into the unique SpeI site of both pBAC{BglII-GFP} and pBAC{HpaI-GFP} to form either pBAC{BglII-GFP; y+}, which has the EGFP exon and yellow transgene oriented away from each other, or pBAC{HpaI-GFP; y+}, which has the EGFP exon and yellow transgene pointing toward each other (Figure 1A). Both pBAC{BglII-GFP; y+} and pBAC{HpaI-GFP; y+} vectors carrying the EGFP exon in the A frame were transformed into y w flies using the phspBac helper plasmid (Handler and Harrell 1999) (kind gift from A. Handler).

We created stable genomic sources of the piggyBac tranposase using P-element transformation vectors. To place the piggyBac transposase under control of the ubiquitin promoter, the piggyBac transposase ORF was excised from phspBac using BamHI and DraI and ligated into the BamHI and SmaI sites of the pCasper3-Up2-RX poly(A) P-element vector (Ward et al. 1998) (kind gift of R. Fehon), which carried a modified multiple cloning site (kind gift of A. Hudson), to form pP{Ub-pBACtrans}. To make an inducible piggyBac transposase source, phspBac was digested with EcoRI and DraI and the fragment containing both the Drosophila hsp70 promoter and piggyBac transposase ORF was ligated into the EcoRI and StuI sites of pCasper4 to form pP{hsp70-pBACtrans}. These vectors were used to transform y w flies, using standard P-element transformation techniques.

To test the activity of the piggyBac transposase transgenes, single-pair matings were set up using the pBAC{HpaI; y+}24.3 insertion, which mapped to the X chromosome, and pP{Ub-pBACtrans} or pP{hsp70-pBACtrans} stocks. The pBAC{HpaI; y+}24.3 insertion was mobilized in males that were then outcrossed to y w females. Phenotypically yellow+ males in the next generation were scored as new insertions. The pP{hsp70-pBACtrans} was able to remobilize the pBAC{HpaI; y+}24.3 insert in 43% (n = 30) of single-pair matings tested whereas the pP{Ub-pBACtrans} was able to remobilize the pBAC{HpaI; y+}24.3 insert in 48% (n = 21) of single-pair matings tested. The pBAC{HpaI; y+}24.3 insertion was remobilized in the presence of a CyO balancer chromosome. New insertions that did not express EGFP and mapped to the CyO chromosome were used in the pBAC-based protein trap screen.

Generation of embryos with novel transpositions:

We isolated new EGFP-expressing P-element insertions using the following genetic scheme:Embedded Image

New pBAC insertions were generated through a similar genetic scheme:Embedded ImageHereafter, P-element and piggyBac element-based protein trap lines were treated the same. For the F1 cross several hundred males and females of the appropriate genotypes were mated in bottles to produce several thousand males in which the elements were mobilized for the F2 cross. These males were crossed to 8000–10,000 virgin y w females in a population cage. These virgin females were obtained using a virgining stock that carried a heat-shock-inducible hid transgene on the Y chromosome (kind gift of R. Lehmann). Two separate overnight embryo collections from each population cage were screened for EGFP expression. We limited the number of times we screened embryos from a particular cage to try to minimize the number of identical insertions recovered due to premeoitic insertion events.

Embryo sorting and line establishment:

We screened for EGFP expression in embryos using a COPAS Drosophila embryo sorter (Union Biometrica). Embryos were dechorionated in 50% bleach for 2.5 min and washed extensively with water. Dechorionated embryos were then washed into sorting solution (0.5× PBS, 2% Tween-20). With the exception of the sorting solution, the COPAS sorter was used according to the manufacturer's protocol, using the manufacturer's solutions. The sorter and sample pressures of the COPAS machine and embryo density were maintained so that the COPAS sorter screened 15–20 embryos/sec. The sorter used a 488, 514 nm multiline argon laser. EGFP fluorescence was detected using PMT1 set to 510 nm. Red fluorescence, used as a measure of embryo autofluorescence, was detected using a second PMT set to 580 nm. Baseline values for each fluorescent axis were set empirically using previously isolated fly strains that express low levels of EGFP and y w non-EGFP expressing embryos (Figure 1). Approximately 250,000 embryos were sorted in five 50,000-embryo batches per day. Sorted embryos were collected and washed in dH2O. All the embryos from a single batch were placed together in standard food vials. We estimate that ∼80% of the sorted embryos survived to adulthood. Sorted flies that survived to adulthood and did not carry the Ki, P{ry+t7.2; Δ2-3}99B or P{w+; pBACtrans} chromosomes were individually outcrossed to a y w stock. New lines that carried EGFP-expressing insertions that did not map to the starting CyO chromosome were maintained as stocks.

DNA sequencing, RT–thermal asymmetric interlaced PCR analyses, and prediction of fusion potential:

Genomic sequences flanking either P-element or piggyBac protein trap insertions were determined by members of the Lawrence Berkeley Lab group using an established protocol for sequencing inverse PCR products from genomic DNA (Bellen et al. 2004). Database software developed for the annotation of the Drosophila gene disruption project (Bellen et al. 2004) was used to manage the sequence data. Once the insertion site of a given protein trap line was determined, a FileMaker Pro database that contained information [version 3.2 of the Drosophila genome annotation (Misra et al. 2002)] for all Drosophila transcripts, exons and introns, and their reading frames was used to predict which gene(s) and transcript(s) were being trapped by a given protein trap insertion.

We developed a reverse transcriptase coupled thermal asymmetric interlaced PCR (RT–TAIL) protocol largely on the basis of methods used to determine T-DNA insertion sites in Arabidopsis (Singer and Burke 2003). This method allowed us to determine the mRNA sequence adjacent to the EGFP exon without using gene-specific primers. Total RNA was isolated from 15 adult flies using an RNAqueous-96 automated kit (catalog no. 1812; Ambion, Austin, TX). The samples were ground in 200 μl of sample buffer and spun for 5 min at 14,000 rpm. The supernatant was placed in a 96-well plate and 100 μl of 100% EtOH was mixed with each sample. The sample was transferred to the filter plate, washed, and then treated with Dnase I (Ambion) for 15 min. Rebinding buffer was added to each well of the filter plate, and the plate was washed extensively. The RNA was eluted off the filter and precipitated with 7.5 m LiCl solution (Ambion). The resulting RNA pellet was washed with 75% EtOH and then retreated with Dnase I for 30 min at 37°. Dnase inactivation reagent (Ambion) was added to the samples. The RNA samples were spun and the supernatant was transferred to a new plate. A detailed protocol is available upon request.

The following GFP-specific primers were used for RT–TAIL PCR:







The arbitrary degenerate (AD) primers used in this study were originally described by Singer and Burke (2003) but are listed here for convenience:





A pool of the AD primers was mixed according to Singer and Burke (2003).

The first round of RT–TAIL PCR was set up in 96-well format using a one-step RT–PCR kit (QIAGEN, Valencia, CA). For every reaction 5 μl of total RNA was mixed with 10 μl 5× buffer, 2 μl 10 mm dNTP solution, 1 μl GFP-For1 or -Rev1 primer, 12.5 μl AD primer mix, 2 μl enzyme mix, and 17.5 μl of dH2O. The reverse transcription reaction was carried out at 50° for 30 min. The sample was then heated to 95° for 15 min and then cycled for primary TAIL–PCR according to Singer and Burke (2003), using a MJ Research (Watertown, MA) thermal cycler. The secondary and tertiary TAIL–PCR reactions were carried out according to Singer and Burke (2003), using GFP-For2 or -Rev2 primers and GFP-For3 or -Rev3 primers, respectively, and regular TAQ DNA polyermase (Roche, Indianapolis). The PCR products of the tertiary reaction were treated with exoSAP (United States Biochemical, Cleveland) and sequenced using GFP-For3 or -Rev3 primers.

The RT–TAIL PCR protocol using the three GFP-For primers, which amplified off of the 3′ end of EGFP, consistently yielded better results than the same reaction using the Rev primers. Therefore most of the RT–TAIL PCR data define splicing products at the 3′ end of the EGFP sequence. To identify sequence fusing to the 5′ end of EGFP, we employed a 5′ RACE kit according to the manufacturer's protocol (Ambion), using the EGFP reverse primers listed above.

Analysis of protein expression in tissues:

Samples were dissected in Grace's medium, placed in 48-well plates outfitted with a nylon mesh bottom, and fixed in 4% paraformaldehyde buffered in 1× PBS for 10 min at room temperature. The plate was washed extensively with PBT (1× PBS, 0.5% Triton X-100, 0.3% BSA) and incubated overnight at 4° with rabbit anti-GFP antibody (Torrey Pines) (1:2000) in PBT. The samples were then washed extensively with PBT and incubated with goat anti-rabbit Alexa488 (Molecular Probes, Eugene, OR) (1:400) for 4 hr at room temperature. The samples were then washed with PBT, stained with 2 μg/ml of DAPI, and mounted in Vectashield (Vector Laboratories, Burlingame, CA). Images were collected using a Leica SP2 confocal microscope.


Generating a large initial collection of tagged strains expressing EGFP:

Our initial strategy was to generate a much larger number of lines containing new protein trap vector insertions than in previous screens and to institute additional technical improvements. Because of their proven utility, we used the same P-element-based protein trap vectors employed by Morin et al. (2001), but we also constructed a similar set of vectors with piggyBac (Figure 1A). As described in materials and methods, we set up crosses in small population cages to limit the recovery of clusters, utilized dominant markers to remove the transposase source from all new lines, and identified rare GFP-expressing embryos rapidly and sensitively using an automated embryo sorter (Figure 1B). This protocol allowed us to screen >60 million embryos over a period of 2.5 years, to identify >7500 “green” embryos, and to use each one to start an individual culture (see Table 1). Ultimately, 7404 strains were successfully established, maintained by selection for white+ eye color, and analyzed further as diagrammed in Figure 1C.

Figure 1.—

Generation and classification of protein trap vector insertions. (A) Schematic of protein trap vectors (after Morin et al. 2001). (B) Sample output from automated sorting of Drosophila embryos mobilized from site not expressing GFP. Rare GFP+ embryos (red circles) registering above a threshold value are diverted by the machine and later used to start individual cultures. (C) Scheme for characterization of putative protein trap lines (see text). (D) Classification of the general types of relationships between transposon inserts and the local genome annotation. Classes 1–4 consist of insertions in the appropriate orientation located within a codon intron (class 1), a noncoding transcribed region (class 2), an upstream genomic region (class 3), or an exon (class 4). For each class, the insert was either of the appropriate frame (subclass A) or of nonappropriate frame (subclass B) to fuse to the protein if splicing continued to the next annotated exon splice acceptor site. Class 5 consists of transposons inserted >0.5 kb from a correctly oriented annotated gene. (E) The structure of cryptic transcripts initiated within the Drosophila mini-white marker gene that contain an ATG codon and splice in frame to EGFP, thereby allowing expression independent of an endogenous transcript in some lines. (F) Western blot analysis of Dlg1 and eIF-4E protein production in control animals (y w, CC00380) and insertion lines predicted to trap Dlg1 (CC01936) or eIF-4E (CC00392, CC00375, and CC01492). (G) Abnormal nuclear accumulation of CG15015-EGFP in line CC01311 whose insertion lies within the FHC domain (left). Tissue culture cells expressing N-terminal or C-terminal fusions are found in the cytoplasm (center and right).

View this table:

Project summary

The same scheme was used with both transposons; however, in practice the piggyBac vectors were not nearly as efficient at generating EGFP-positive candidate lines as the P-element vectors (Table 1). P-element vectors typically exhibited 70% mobilization and generated ∼1 EGFP-expressing embryo per 1000 sorted. In comparison, the piggyBac vectors displayed nearly 50% mobilization, but they yielded only 1 EGFP-expressing embryo per 50,000 sorted. Thus, the piggyBac vectors were slightly less efficient at mobilization, but drastically less efficient at generating EGFP-positive lines upon insertion. Consequently, we soon abandoned attempts to generate large numbers of piggyBac protein trap insertions (Table 1), but continued to characterize the lines we did recover to learn if they would shed any light on the lower frequency of trapping observed.

Localizing insertions on the annotated genome:

To identify candidate proteins that may have been fused within individual lines, we determined the genomic DNA sequence flanking the insertion(s) in collaboration with the Berkeley Drosophila Genome Project (BDGP) gene disruption project (materials and methods). In most cases, the sequences from both the 5′ and 3′ vector end junctions mapped by BLAST analysis to a unique insertion site within the Drosophila genome sequence. Lines for which the sequencing reaction failed, the sequence matched repetitive DNA, or the 5′ and 3′ sequences differed (indicating that two or more insertions were present in the stock) were recycled back into the starting pool, and frequently a unique single insert was eventually identified. Altogether the insertions in 1375 C frame, 3172 B frame, and 1009 A frame P-element and 164 piggyBac A frame lines were localized to unique genomic sites.

Knowing the genomic location of an insertion allowed us to predict which transcripts would incorporate the EGFP exon and whether they would undergo splicing and translation into a functional fusion protein. First, we removed ∼1550 duplicate lines derived from premeiotic clusters that were identified because they bore insertions identical in position and orientation to those in one or more sibling lines. Of the 4170 independent lines remaining, 2149 (52%) were associated with a gene correctly oriented for possible fusion (i.e., located between −500 and the 3′ end). We also classified the ways an insert can be located relative to its closest annotated transcript into general categories as diagrammed in Figure 1D and classified all the lines (Table 2).

View this table:

Line types

Expression of a fusion protein is expected when the GFP exon resides between two coding exons within an intron of matching reading frame (class 1A). Such insertions made up only 23% of the total localized insertions and defined 192 different genes (Table 4). Forty percent of insertions were close to an annotated gene but were not predicted to express the EGFP exon (classes 2–4), while 37% of the lines were not located within 0.5 kb of a correctly oriented gene. EGFP production from these lines might be explained by the use of unannotated genic elements, by noncanonical splicing events, or by the presence of a second insertion at a canonical site.

The Drosophila genome annotation is highly supported by experimental evidence (Misra et al. 2002); hence the frequency of these discrepancies was surprising, but a similar outcome has been reported in previous protein trap screens using both yeast (Ross-Macdonald et al. 1999) and Drosophila (Morin et al. 2001). Some protein-coding genes may have been missed within cDNA libraries (Hild et al. 2003), and a substantial number of genes may contain unannotated far-upstream promoters and alternative translation start sites. There might be a large class of RNA genes that have escaped detection but that can drive expression of EGFP using cryptic start sites. Alternatively, the high selective pressure used to isolate EGFP+ strains may have led to the recovery of rare events in which the normal gene or transcript structure has changed.

Analysis of fusion transcripts:

We sequenced portions of the fusion transcripts to address how EGFP expression arises in lines of various classes. A limited number of 5′ RNA sequences were obtained by 5′ RACE analysis or by TAIL–PCR (see materials and methods). As expected, several lines in class 1A were found to initiate in normal exons upstream from the insertion site. However, we discovered several CB lines that spliced into the EGFP exon from the 5′ P-element sequences of the vector. The P-element promoter is highly efficient at enhancer trapping, and the entire first exon of the transposase gene is present in the vector along with the start of intron 1, which is in the frame compatible with CB lines. These lines were associated with nuclear localized EGFP, possibly due to fusion of the first exon of P transposase with EGFP. Further evidence of enhancer trapping was observed in the analysis of line CA07138. The EGFP RNA was fused to sequences, including an in-frame ATG start codon, derived by transcription and splicing from the noncoding strand of the mini-white transgene carried in the P-element vector (Figure 1E). These observations suggest that EGFP expression in a significant number of the lines depends on transcripts initiated from within the transposon itself by enhancer trapping rather than on EGFP exon addition by splicing into exogenous transcripts.

Analysis of downstream transcript sequences:

To gain additional information, we analyzed the sequences downstream from the EGFP exon from many lines in the collection by carrying out RT–TAIL–PCR in a 96-well format (materials and methods). 3′ sequences up to 700 bp in length and defining the location of one to six downstream exons were obtained for >1200 lines (Table 3). The pattern of downstream splicing allowed productive fusions to be identified and indicated lines that splice out-of-frame and likely become subject to nonsense-mediated decay (NMD) (Vasudevan and Peltz 2003). Fusions within lines of classes 2–4 often were predicted to encode a “linker peptide,” which might or might not include a stop codon, derived from the translation of a small segment of upstream nucleotides. The RNA analysis also revealed the presence of a second insertion in 14% of type 1A lines, but between 24 and 50% of the other classes. The second insertions found within class 2–5 lines were often valid protein trap alleles (class 1A) and were frequently the true source of the lines' EGFP production. This information allowed us to identify additional candidate fusions (Table 4), to correct many initial line classifications, and to more accurately estimate the total number of trapped genes (Table 1: 600–900). By the time lines were selected and balanced, secondary insertions or damage did not contribute substantially to the phenotypes reported in Table 4. Tests estimated the frequency of background lethal mutations among balanced, saved lines at 7–21%, similar to the best transposon screens (Spradling et al. 1999).

View this table:

RNA analysis

View this table:

Identified trapped proteins

Novel splicing suggests new genes and exons:

We compared the splicing observed downstream from the inserted exons with that of the genome annotation (Misra et al. 2002) to identify new Drosophila gene and transcript isoform candidates. To identify new candidate genes, we focused on lines inserted >0.5 kb from an appropriately oriented known gene and for which RNA sequence data were also available. In 114 of these 205 lines, the RNA sequence coincided with the position and orientation of the insertion and therefore indicated the splicing pattern downstream of the single EGFP exon. Most of the lines spliced to one or more novel exons. At least 8 probably correspond to unannotated genes because they match previous gene predictions (Hild et al. 2003) or are supported by EST data (see Table 4). Most of the remaining exons do not predict proteins with homologs in other species and represent either aberrant splicing events or novel or untranslated exons.

Similar analysis of 297 lines with intron insertions allowed us to test for novel exons and transcript isoforms. We examined 443 splicing events and identified a total of 35 (7.9%) that did not correspond to current gene models (Misra et al. 2002). Since at least some of these differences probably resulted from aberrant splicing induced by the insertions, this represents a maximum estimate of the fraction of unannotated genomic exons and emphasizes the high accuracy and completeness of current Drosophila gene models, at least for abundant transcript forms. Often, the RNA data indicated which isoform among several predicted to fuse in frame is likely to predominate in ovarian tissue. For example, we could determine that line CA06613 in ovarian tissue predominantly fuses the Su(Tpl) gene rather than Mi-2, in whose transcription unit it also lies in frame.

The nature of the noncanonical splices observed was interesting. The most common events (21/35) were for insertions in large introns to splice to a novel exon(s) prior to joining the predicted downstream exon. Some simply appear to define alternative isoforms that skip exons or utilize different exon combinations not previously documented. Some of these events may have been induced by the abnormal position of the EGFP exon within the primary transcript. However, several lines appear to define alternative isoforms because they utilize different combinations of known exons in no previously documented transcript isoforms. Three lines utilized 5′ start sites for exons that differed by 6, 21, or 27 bp from the annotated exon. The CC01473 transcript reads through an annotated exon into the adjacent intron and probably defines a novel alternate transcript 3′ end. Although we consider it likely that many of these differences reflect endogenous Drosophila gene expression, all of the candidate novel genes and transcript isoforms require independent confirmation in strains lacking protein trap insertions. Such tests were beyond the scope of our project.

Protein trap insertions likely vary in the fraction of the endogenous protein that is tagged with EGFP for a variety of reasons. First, in many lines only some of the multiple-transcript isoforms contributing to protein production are tagged by the insertion and fused in frame. Second, the splicing efficiency of the EGFP exon might vary due to its surrounding genomic context. To investigate this issue, we analyzed the protein products of tagged genes by Western blotting. The tagged proteins were easily distinguished from their wild-type counterparts on the basis of size and by probing with protein-specific and anti-EGFP antibodies (Figure 1F). In line CC01936 all three isoforms are predicted to incorporate the EGFP exon in frame, and nearly all of the ovarian Dlg1 protein incorporated EGFP as indicated by its mobility. A similar result was reported previously in the case of line CB02119 (Buszczak and Spradling 2006), where the precursors of five of six annotated transcripts are predicted to contain the insertion, although only two fuse in frame. In contrast, only ∼50% of the ovarian wild-type eIF-4E protein is tagged with EGFP (Figure 1F) despite the fact that six of seven annotated eIF-4E transcripts initiate upstream from the insertion site. These examples suggest that the EGFP incorporation level varies between genes in large part due to the tagging of a subset of gene isoforms that themselves display differing expression levels.

In contrast, we found little evidence of short-range context effects. Less than twofold variation in protein expression as measured by Western blotting with anti-EGFP antibodies was observed between lines with insertions at sites within the same intron (N. Srivali and A. Spradling, unpublished data). However, these experiments did reveal that insertions of the piggyBac-based vector consistently produced less EGFP protein than lines with the corresponding P-element-based vector that were inserted in the same intron (N. Srivali and A. Spradling, unpublished data). This suggests that some aspect of the structure of the piggyBac vector used compromised splicing efficiency.

Identification of protein trap and enhancer trap core collections:

To help identify a core set of valid gene trap lines we examined the EGFP expression patterns of many nonredundant lines in both the adult ovary and larval salivary gland. There was a strong correlation between insert location and the nature of the staining patterns observed. More than 95% of lines in class 1A, the in-frame fusions, produced patterned EGFP expression above background in at least some ovarian cells or in the salivary gland. In contrast, a much smaller, but still significant, fraction of lines in classes 2–5 also expressed EGFP in a regulated manner. Combining information on insert location, genome annotation, RNA transcript sequence, and EGFP pattern, we identified a set of 244 lines predicted to produce fusions between EGFP and 431 protein isoforms of 233 distinct genes (Table 4). These new protein trap lines express EGFP in a wide variety of cellular compartments under diverse developmental controls (see Figures 2 and 3).

Figure 2.—

Protein traps for the study of protein subcellular localization. Patterns of subcellular localization of EGFP expressed from the following lines that trap the indicated genes were observed: (A) cytoplasmic, CA06607 (CG17342); (B) nuclear, BA00164 (dom); (C) enhancer trap nuclear, CB04353 (Dad) in stem cells and early cystocytes; (D) endoplasmic reticulum, CA06523 (Rtnl1); (E) extracellular, CA06735 (cathepsin K); (F) membrane, CA07474 (Picot); (G) apical, CC01941 (Baz); (H) chromatin, CA07249 (stwl); (I) nuclear membrane, CA07301 (Fs(2)Ket); (J) lipid droplets, CA07051 (Lsd-2); (K) novel structure, CA07332 (CG6854); (L) novel structure, CC01326 (polo).

Figure 3.—

Protein traps for the study of developmental regulation during oogenesis. The expression in the ovary of various protein trap lines is shown to illustrate how they can be used to associate genes with developmental processes. (A) Schematic of an ovariole tip. The terminal filament (TF), cap cells (CpC), germline stem cells (GSC), cystoblast (CB), and escort cells (ES) are illustrated. (B–H) Cell type identification. (B) Terminal filament, CB02069 (CG14207); (C) cap cells, CB03410 (CG1600); (D) escort cells, CC01359 (fax); (E) follicle cells, CC06135 (CG12785); (F) outer border cells and posterior follicle cells, CB02349 (NK7.1); (G) oocyte nucleus equals the germinal vesicle (arrow), CB04219 (CG13776); (H) novel sheath cell type, CC01646 (CG12920). (I–L) Analyzing developmental processes. (I) Novel structure in center of midstage follicle, CC00523 (Msn); (J) fusome, CC01436 (Sec61); (K) germline and somatic ring canals, CA07004 (Vsg); (L) chorion gene amplification, CB04400 (Orc2). (M–P) Localization of proteins in the oocyte. (M) Posterior pole, CC01442 (EIF-4E); (N) posterior pole, CA06517 (Tral); (O) posterior pole, CC00236 (CG32423); (P) anterior pole, CA07529 (CG6151). (Q–T) Developmental regulation of gene expression in early germ cells. (Q) Control with little change, CC01961 (Actn); (R) GSC/CB enriched, CC06238 (CG11963); (S) GSC and early cyst enriched, CC01915 (Eff); (T) GSC and forming cyst enriched, CC01442 (EIF-4E).

A second major class of lines in the collection showed the properties expected of EGFP enhancer traps (Table 5). These CB lines were susceptible to enhancer trapping from the P-element promoter, were located mostly upstream of the annotated start site, expressed EGFP in nuclei, and the RNA analysis, if available, did not indicate fusion in frame downstream. The expression patterns of such insertions in well-characterized genes supported this interpretation. For example, line CB02030 in ptc showed strong expression in the inner sheath cells of the germarium (Forbes et al. 1996), while line CB04353 in Dad strongly expressed in the germline stem cells and immediately downstream germ cells (Kai and Spradling 2003; Casanueva and Ferguson 2004).

View this table:

Enhancer trap alleles

The characterization of a significant number of lines in the collection remains incomplete (Table 1). Many of these contain insertions located >0.5 kb from an appropriately oriented annotated gene but where RNA sequence data were not obtained. Others are inserted within genes at locations not predicted to generate protein fusions or gene traps. Some of these lines express EGFP in the ovary, and we cannot rule out that others express transcripts in other tissues. On the basis of the processing of previous lines in these same classes this suggests that a significant number of new enhancer traps and a handful of new protein traps could be sorted out from a larger number of lines with secondary insertions in already trapped genes. Consequently, the number of different genes trapped in the collection is likely to increase beyond the 600 or so currently characterized.

Even when a protein is tagged in frame, the insertion of the EGFP sequence is expected to disrupt its normal structure and localization some fraction of the time. For example, line CC01311 traps CG15015, the Drosophila homolog of mammalian Cip4, a modular protein that interacts with Cdc42 and helps to regulate the actin cytoskeleton (Aspenstrom 1997). The CC01311 P element is inserted between the first two coding exons and thus disrupts the FES/Cip4 domain of CG15015 (Figure 1G). The protein trap fusion product localizes to the nucleolus while transgenes of CG15015 tagged at either the very N or C termini localize to the cytoplasm when expressed in S2 cells. We observed that 3 other protein trap fusion products of 107 analyzed accumulated in the nucleus when they were expected to reside in the cytoplasm.

Diverse behavior of tagged proteins:

The 244 identified protein trap lines of the core collection exhibit extremely diverse patterns of EGFP expression, suggesting that proteins occupying a wide range of cellular compartments can be tagged in vivo. We observed many lines with EGFP fluorescence in the cytoplasm (Figure 2A) or nucleus (Figure 2B) as expected. Localization to intracellular membranous structures was also commonly seen, as illustrated by a trap in the ER structural component Reticulon-1 (Figure 2D). Gene trapping of secretory proteins is thought to be inefficient due to retention of the fusion proteins in the ER where the activity of the fusion gene may be affected (Skarnes et al. 1995). The full-length protein traps we constructed could label secreted proteins, as indicated by the extracellular localization of EGFP in CA06735, a fusion with CG8947, a Drosophila cathepsin K homolog (Figure 2E), and the membrane location of a trap in Picot, a phosphate symporter (Figure 2F). Proteins that are apically localized in polarized epithelia such as ovarian follicle cells were easily visualized, as observed for Bazooka (Par3) (Figure 2G). Subcompartmentalized nuclear proteins were also readily apparent. For example, a trap of the Stonewall (Stwl) HMG-related protein involved in germ cell chromatin organization (Clark and McKearin 1996) labeled nurse cell nuclei and the oocyte nucleus (inset) differently (Figure 2H). Fs(2)Ket, a protein involved in nuclear import, was localized to the nuclear periphery (Figure 2I). In some cases, cell-specific cellular compartments were labeled, such as in CA07051, which traps Lsd-2 and exhibited EGFP localization to lipid droplets that arise in late-stage nurse cells (Figure 2J).

These studies provide a high-resolution view of known protein locations in living cells and also identify many proteins that were not previously known to reside within these compartments. In addition, the value of protein trapping as a discovery tool was illustrated by the fact that we observed new patterns of localization as well. For example, the HDAC4 protein, fused by the CA07134 trap, labeled a small body often found in only one nurse cell within an egg chamber (Figure 2K). A spindle-like structure in young nurse cells was labeled with a protein trap in the polo gene encoding a mitotic kinase (Figure 2L). Antibodies specific for the trapped protein can be used to isolate and further investigate the proteins present in these novel structures.

Analysis of developmental regulation—ovarian cells:

All the lines in the core collection were characterized on the basis of their patterns of expression in germ cells and follicle cells during oogenesis. These experiments identified lines expressing in the major classes of somatic cells, including terminal filament cells (Figure 3B), cap cells (Figure 3C), escort cells (Figure 3D), profollicle cells (Figure 3E), and border and posterior cells (Figure 3F). Other lines expressed in germ cells of various ages, including some that were highly enriched in the oocyte nucleus (germinal vesicle) (Figure 3G). As in the case of subcellular compartments, these studies documented patterns of developmental expression for many genes that were not previously known. These genes become attractive candidates for study of their function in the corresponding processes.

Strikingly, the collection also revealed the likely existence of new cell types and novel biological processes previously unrecognized despite many years of study of ovarian biology. Line CC01646 traps the CG12920 protein and is expressed in a small subset of ovarian sheath cells that likely represent a novel cell type (Figure 3H). In line CC00523 we observed accumulation of Msn-EGFP preferentially at the center of midstage growing follicles (Figure 3I). It was not previously known that this region was the site of unique protein accumulation. Msn encodes a protein involved in Jun kinase signaling, suggesting that a special intercellular junction may assemble in this region to structurally organize the nurse cells. We observed a similar expression program (not shown) for the line CB03040 that fuses the Pli gene, encoding a protein associated with the NF-κB signaling response.

Developmentally specific subcellular structures, including the fusome (Figure 3J, arrow) and both somatic and germline ring canals (Figure 3K), were also labeled by rare lines. A general and extremely useful application of the collection is to identify new proteins that are associated with such structures and analyze the effect of mutations. For example, the preferential accumulation of Sec61 in the fusome observed here has been validated in recent studies (Snapp et al. 2004). Proof of principle experiments of this type that focus on the fusome will be described elsewhere.

Another valuable capacity of protein trap lines is the ability to follow important developmental processes at high resolution and in living cells. During oogenesis, at least four major clusters of chorion genes undergo specific gene amplification in stage 10B follicles, a process that can be visualized as small “amplification dots” of BrdU incorporation (Calvi et al. 1998). The amplifying genes specifically contain substantial amounts of replication initiation proteins such as Orc2 at this time, whereas normally Orc2 is found throughout the cell nuclei (Royzman et al. 1999). A protein trap line in Orc2 allows the amplifying loci to be directly visualized (Figure 3L). Inspection shows that the dots are not present in preamplification stage follicle cells but strongly label amplifying gene loci at stage 10B (Figure 3L, arrow).

The Drosophila oocyte represents an important model system for studying RNA and protein localization. Several biochemical and genetic studies have identified proteins enriched at either the anterior or the posterior pole of the oocyte (Lasko 2003; Wilhelm and Smibert 2005). Protein traps in genes identified in these studies, including EIF-4E (Figure 3M) and Tral (Figure 3N), faithfully recapitulate the localization patterns of their endogenous counterparts to the posterior pole of the oocyte (Wilhelm et al. 2003, 2005). Several other proteins tagged in the collection display posterior localization patterns including a trap in CG32423 (Figure 3O), a largely uncharacterized RNA-binding protein. In addition, a trap in CG6151 appears to be enriched at the anterior of the oocyte (Figure 3P). While future work will clarify the role of these proteins in oocyte patterning, these examples show that protein trapping can complement other approaches and be used to identify new components of localized RNP complexes within the cells.

Protein traps provide unique opportunities to analyze gene regulation during development in populations of cells that are difficult to isolate and in those that are sensitive to loss of cellular context. We illustrate the potential of this approach using the regulation of germ cell development within and just downstream from the germline stem cells (GSCs). Many protein trap lines, including CC01961 in Actn, showed uniform expression in GSCs, CBs, and developing germline cysts (Figure 3Q). However, it was possible to find other examples where expression levels between GSCs and early germ cells differed from those in other germ cells within the germarium. One of the most striking examples was line CC06238 that traps the putative Drosophila succinate CoA ligase gene. Expression was stronger in stem cells (and sometimes early cystoblasts) than in later germ cells as illustrated in Figure 3R. Several other genes were downregulated shortly after GSC division, including effete (Figure 3S), encoding the UbcD1 ubiquitin-conjugating enzyme that has been shown to affect germline cyst formation (Lilly et al. 2000). Another line whose EGFP expression was downregulated slightly later, at the completion of cyst formation, trapped the Drosophila eIF-4E gene (Figure 3T). Downregulation of a related gene CG8023 was previously observed at a slightly earlier time, during cyst divisions (Kai et al. 2005).

Studies on the limitations of current protein trap methods:

We also tested the sensitivity of the approach used here to detect Drosophila genes by looking at the expression of the lines in the core collection in germline stem cells. First, although lines were selected on the basis of expression in embryos, we found ovarian expression above background in >90% of lines in the core collection. However, this does not address whether many other genes exist that were fused but expressed EGFP at levels too low to detect in either tissue. Analysis of germline stem cell RNA by hybridization to Affymettrix arrays detected transcripts from ∼6500 Drosophila genes over an ∼1000-fold dynamic range (Kai et al. 2005). Although translational regulation and differential protein stability, not to mention differences in staining sensitivity between different preparations, would be expected to introduce potential variation, we were curious whether protein trap lines could detect stem cell gene transcripts across the full range of expression levels.

We observed a strong correspondence between these two measures of stem cell gene expression (Figure 4, A–E). Lines with very strong EGFP expression tended to have RNA levels at least 10-fold higher than lines with above background but relatively weak expression. These lines in turn had signals higher than most lines scored as below the level of detection on arrays. The correlation was not perfect; for example, some lines showed more EGFP expression in stem cells than might be expected from the microarray study (Figure 4F). The existence of such lines was not unexpected, because some lines likely still carry second insertions, and the microarray used was based on release 1 gene models. Overall, we could detect EGFP above background in GSCs from nearly all lines whose levels of mRNA are called as “present” on Affymetrix arrays (Kai et al. 2005). This suggests that protein trap lines are not limited to a relatively small number of highly expressed genes, but can be used to follow a large fraction of gene activity.

Figure 4.—

Studies of protein trap expression We compared the apparent intensity of GFP-protein staining in the GSCs with the RNA level of the corresponding gene as determined by Affymetrix arrays (Kai et al. 2005). (A–F) The pattern of protein trap expression of the indicated gene (see Table 4 for strain names). The expression level from Affymetrix software (mean of three measurements) is given.


The Carnegie protein trap collection—a versatile research tool:

These experiments significantly expand the number of protein trap lines available for studies of gene expression within a complex multicellular animal (also see accompanying article by Quiñones-Coello et al. 2007, this issue). Our initial characterization of these lines extends previous documentation that the behavior of the EGFP-tagged protein often corresponds to the behavior of the protein to which it is fused. Moreover, we demonstrate that collections such as ours are exceptionally useful as tools of gene discovery. Candidate genes can be selected on the basis of the developmental expression, subcellular localization, or dynamic behavior of particular protein isoforms. The same line can subsequently be used to purify the protein and its associated complexes and to create deletions for further genetic analysis. The Carnegie collection is available for research use from the Carnegie Institution. Information is available at and at

Subcellular distribution of protein location:

Previously, gene and protein trapping in yeast has been used to estimate the fraction of proteins that are localized to various cellular compartments (Ross-Macdonald et al. 1999; Kumar et al. 2002; Huh et al. 2003). More than half of all proteins showed a simple localization to the cytoplasm or nucleus. Other subcellular structures labeled by tagged proteins in yeast included the plasma membrane, the ER, mitochondria, the lysosome, and the perixosome. We obtained similar results. The distribution patterns of EGFP-tagged proteins within Drosophila ovarian cells generally matched those of yeast and most known structures within egg chambers have now been labeled with at least one protein trap line (this study; Morin et al. 2001; Clyne et al. 2003). Moreover, the large size of ovarian cells often allowed us to distinguish the fine structure of several subcellular compartments labeled by EGFP fusion proteins generated in this screen.

Developmental regulation of protein expression:

Despite the fact that only one tissue was examined closely, a large number of proteins in the core collection were expressed and many were developmentally regulated. A diverse array of cell types within the germarium including the terminal filament, cap cells, escort cells, germline stem cells, and prefollicle cells were labeled in various lines. However, expression frequently varies from stage to stage, not only in cell type but also in subcellular location, complicating the problem of accurate annotation. Currently, protein trap images within the ovary are being curated in the FlyTrap database. Because of the relative cellular simplicity of the germarium and developing ovarian follicles, it may be possible to develop tools for displaying expression patterns at single-cell resolution in this tissue. It will be particularly valuable to add data from many other tissues and developmental stages for these same lines, to facilitate comparisons.

Identification of insertions not predicted by genome annotation:

One of the surprising results of our studies was the relatively high frequency of EGFP-positive lines that were located at sites not predicted to fuse to any annotated Drosophila transcript. However, similar results were observed in previous studies of gene trap transposons. At least 44% of insertions analyzed by Morin et al. (2001) were not within annotated genes; moreover, the reading frame of insertions in genes was not determined. In yeast, Ross-Macdonald et al. (1999) observed that while 1346 EGFP-positive insertions were in the correct frame, another 480 were not. Since most lacked an alternative start site, they postulated that a higher than expected frequency of translational frameshifting may occur. In a recent study in the mouse, 24% of genes were trapped in more than one reading frame (De-Zolt et al. 2006). We also observed this phenomenon; however, our studies emphasized the difficulty of drawing final conclusions until the location of every insertion and the actual pattern of splicing within the mutant strains have been characterized.

Sensitivity of gene traps:

A potential limitation of protein trapping in vivo is that many gene products may be expressed at levels so low that EGFP expressed at the same level could not be detected above background. Only 20% of mouse secretory traps that are G418 resistant express detectable CD2, even though the neophosphotransferase gene is fused to CD2 (De-Zolt et al. 2006). Only 33% of β-geo lines resistant to G418 express detectable lacZ. This probably indicates that many genes exist that generate enough neophosphotransferase to confer G418 resistance, but not enough β-galactosidase to be scored as lacZ positive (De-Zolt et al. 2006). Consistent with this, Ross-Macdonald et al. (1999) found that 415 of 1340 in-frame fusions (31%) could be detected above background by immunofluorescence. In contrast, Huh et al. (2003), who tagged complete proteins, detected signals above background for 4156 of 6029 (69%) genes. The system we employed is also designed to tag full-length proteins, and this may have enhanced its sensitivity.

The requirement that each line generate EGFP fluorescence in embryos might provide a limitation on the number of genes that could be tagged. However, our experiments argue that this poses relatively little selection on which genes can be fused. We found that genes expressed in germline stem cells at a wide variety of levels on the basis of microarray studies had been fused in our collection of protein trap lines. There was a rough correlation between the levels observed using antibody staining in these cells and the microarray results. This would indicate that the protein trap methodology can potentially be used to analyze thousands of diverse Drosophila genes.

Increasing proteome coverage:

Our analysis revealed two major limitations of the current strategy for generating protein traps using P elements. Despite the advantages of embryo sorting, the inherent 5′ bias of P-element insertion (Bellen et al. 2004) greatly limits the rate at which new genes can be trapped. Many of the insertions were recovered when an insertion occurred at an internal promoter that lies within a coding intron of another gene isoform. Many genes lack such alternative promoters, so it will be necessary to use different methods to efficiently recover a more diverse collection of protein trap strains.

At least two alternative approaches are worthy of consideration. First, it should be possible to take advantage of the extensive collection of P-element insertions in Drosophila genes that have been generated by the BDGP gene disruption project and other members of the Drosophila community (reviewed in Matthews et al. 2005). We calculate that ∼2000 genes already have an existing P-element insertion within a coding intron. Moreover, P elements can recombine into the sites of existing P elements in the presence of transposase (Sepp and Auld 1999). Consequently, a protein trap allele of each of these genes could, in principle, be generated by combining a protein trap insertion of the appropriate reading frame with the “target” gene insertion in a single strain, ideally using inserts bearing scorable markers and then screening for replacement.

Alternatively, transposons with a broader insertional specificity than the P element would be worthwhile. piggyBac elements are suitable for widespread mutagenesis of Drosophila genes (Thibault et al. 2004) and as gene traps (Bonin and Mann 2004). Minimal sequences for piggyBac transposition were defined recently (Li et al. 2005). We obtained several hundred piggyBac protein trap insertions that express EGFP, but the piggyBac gene trap vector appeared to be less efficient, on the basis of EGFP intensity and Western blotting, than P-element gene traps in nearby locations (also see accompanying article by Quiñones-Coello et al. 2007, this issue). New vectors containing different splice acceptor and donor sites should be tested within the context of a piggyBac element. Several new transposons with diverse insertional specificities, including Minos (Arensburger et al. 2005; Metaxakis et al. 2005), are also worthy of consideration. Continued efforts to provide greater coverage within the Drosophila proteome are warranted because of the exceptional utility of protein traps in analyzing the development and physiology of multicellular organisms.


We thank Lynn Cooley for helpful discussions. We thank X. Morin, W. Chia, A. Hudson, R. Lehmann, and A. Handler for reagents. We are grateful to the following people for their assistance with diverse aspects of the project: Alison Pinder, Joseph Carlson, Martha Evans-Holm, Crista Sewald, Becca Sheng, Melanie Issigonis, Megan Kutzer, Emily Seay, and Dianne Williams. We thank the Howard Hughes Medical Institute, Carnegie Institution, and the National Institutes of Health for support. M.B. was a fellow of the American Cancer Society. T.G.N. is a Howard Hughes Fellow of Life Sciences Research Foundation.


  • Communicating editor: T. Schüpbach

  • Received September 18, 2006.
  • Accepted December 18, 2006.


View Abstract