Abstract
A fundamental goal of genetics and functional genomics is to identify and mutate every gene in model organisms such as Drosophila melanogaster. The Berkeley Drosophila Genome Project (BDGP) gene disruption project generates single P-element insertion strains that each mutate unique genomic open reading frames. Such strains strongly facilitate further genetic and molecular studies of the disrupted loci, but it has remained unclear if P elements can be used to mutate all Drosophila genes. We now report that the primary collection has grown to contain 1045 strains that disrupt more than 25% of the estimated 3600 Drosophila genes that are essential for adult viability. Of these P insertions, 67% have been verified by genetic tests to cause the associated recessive mutant phenotypes, and the validity of most of the remaining lines is predicted on statistical grounds. Sequences flanking >920 insertions have been determined to exactly position them in the genome and to identify 376 potentially affected transcripts from collections of EST sequences. Strains in the BDGP collection are available from the Bloomington Stock Center and have already assisted the research community in characterizing >250 Drosophila genes. The likely identity of 131 additional genes in the collection is reported here. Our results show that Drosophila genes have a wide range of sensitivity to inactivation by P elements, and provide a rationale for greatly expanding the BDGP primary collection based entirely on insertion site sequencing. We predict that this approach can bring >85% of all Drosophila open reading frames under experimental control.
THE nucleotide sequences of several complex eukaryotic genomes, including those of Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Mus musculus, and Homo sapiens, are virtually complete or scheduled for completion during the next several years (Collinset al. 1998; Meinkeet al. 1998). Large-scale sequencing of human and model organism genomes, cDNAs, and expressed sequence tags (ESTs) is identifying tens of thousands of genes about which little is known. Obtaining mutations in these loci on chromosomes free of additional lesions is essential for their functions to be deduced using model organisms. However, mutations in particular open reading frames must still usually be obtained in piecemeal fashion, by producing specifically tailored gene knockouts or by identifying the desired strains within large, randomly mutagenized collections. Both approaches remain slow and uncertain. These problems could be circumvented by constructing complete mutation libraries, whose strains each disrupt single distinct genes. Genome-wide collections of gene knockouts would provide a vital resource for gene-based approaches to biological research.
Insertional mutagenesis provides a highly advantageous strategy for constructing mutations in advance throughout entire genomes, because it simplifies the problem of determining which genes have been disrupted. Insertional screens at low multiplicity have been carried out in bacteria (Kleckneret al. 1977), yeast (Burnset al. 1994; Garrawayet al. 1997), Arabidopsis (Sundaresanet al. 1995; Bhattet al. 1996; Smith etal. 1996), C. elegans (Plasterk 1993; Korswagenet al. 1996), Drosophila (Cooleyet al. 1988; Rørth 1996), zebrafish (Allendeet al. 1996; Gaianoet al. 1996), and mice (Jaenisch 1988; Gossleret al. 1989; Wurstet al. 1995; Zambrowiczet al. 1998). However, converting the products of raw screens into a complete mutation library is a challenging task. The site selectivity of the mutagenic element must be extremely broad to target all genes. High throughput methods must be developed and used to identify screen products that contain single insertions located within distinct genes, because strains bearing just one new insertion each are needed to assess gene function. Consequently, it remains uncertain whether it is possible in practice to construct a complete library of mutations using this approach.
Among model multicellular eukaryotes, insertional mutagenesis has been used for genetic analysis and functional genomics most extensively in Drosophila. Low-multiplicity mutageneses using engineered P elements have been carried out frequently (Cooleyet al. 1988; Bellenet al. 1989; Bieret al. 1989; Berg and Spradling 1991; Karpen and Spradling 1992; Gaulet al. 1992; Törok et al. 1993; Changet al. 1993; Erdelyiet al. 1995; Rørth 1996; Deaket al. 1997; Sozenet al. 1997; Rørth et al. 1998). While the raw strain collections produced in such studies are highly redundant and contain lines with multiple mutations, they provide ideal starting material for constructing a genome-wide mutation library. In 1993, the Berkeley Drosophila Genome Project (BDGP) gathered ∼3900 lines associated with mutant phenotypes (mostly lethality) from seven existing raw collections and began to construct a gene disruption library for use by the Drosophila research community (Spradlinget al. 1995). The Drosophila genome is thought to contain ∼3600 vital genes (Miklos and Rubin 1996), so the project had the potential to encompass a substantial fraction of genes that can mutate to a phenotype recognizable in large laboratory screens (primarily lethality). In contrast, the total number of genes is thought to be about three times larger (Miklos and Rubin 1996). The analysis of these collections also promised to indicate the feasibility of eventually using P-element insertional mutagenesis to disrupt all Drosophila genes.
Mutational saturation
The seven initial collections have now been analyzed, and we document here a library of strains that disrupts ∼1000 different genes. A total of 450 of the strains mutate genes that have been described previously in Drosophila or represent novel loci defined by homology to well-studied genes in other organisms. These associations were made with the assistance of researchers throughout the Drosophila community who have used the collection to help characterize >250 genes, and through the efforts of the BDGP project, including 138 new gene-mutation links reported here. Another 135 disrupted genes are associated only with EST sequences that predict novel proteins or products related to proteins of unknown function in other organisms. An additional 138 lines are inserted within sequenced regions containing candidate open reading frames (ORFs). Thus, >700 of the 1045 mutant strains already link mutant phenotypes with specific open reading frames, and the remaining lines only await completion of the genomic DNA sequence. The ∼1000 genes already represented in the library constitute ∼25% of all Drosophila genes readily defined by mutation, and specify more gene-mutation links than are currently available in other model multicellular eukaryotes. Most important, on the basis of this work we have developed a new program to disrupt the remaining genes while the Drosophila genome sequence is being completed and annotated (see Table 1).
Screen summaries
MATERIALS AND METHODS
Drosophila strains: Flies were grown on standard corn meal/agar media (Ashburner 1990) at 22°. Approximately 3900 lethal, semilethal, sterile, semisterile, or visible lines (Table 2) were collected from seven P-element screens (Cooleyet al. 1988; Bieret al. 1989; Gaulet al. 1992; Karpen and Spradling 1992; Changet al. 1993; Törok et al. 1993; M. Scott and M. Fuller, unpublished results) as described (Spradlinget al. 1995). Three different P-element vectors were used: PZ[ry] (Mlodziket al. 1990), Plac W (Bieret al. 1989), or Puc-hsneo (Steller and Pirrotta 1986). About 40% of the starting lines were marked with rosy+ and 60% with white+. The Gaul et al. (1992) collection was stained for enhancer trap patterns in third instar larval eye-antenna imaginal discs, and only lines showing expression were analyzed further. Lines from the Törok et al. (1993) screen that share the first three numbers in their designator (see Nomenclature) were obtained from the same parents and may derive from premeiotic clusters. When two or more such lines were found to contain insertions at the same polytene site, only one was retained and the other(s) was treated as a duplicate(s). Many lines containing multiple insertions from this screen were discarded prior to localization because they exhibited a diagnostic strong eye coloration.
Deficiency strains were obtained from the Bloomington Stock Center and from many individual laboratories. The deficiencies used are listed in Table 3.
Strain names: BDGP strain names start with a prefix that indicates the chromosome and phenotypic effect of their single P-element insertion. For example, third chromosome strain names begin with either “l(3)” (lethal or strong semilethal), “fs(3)” (female sterile or strong semisterile), “ms(3)” (male sterile or strong semisterile), “v(3)” (visible), or “n(3)” (no obvious phenotype). Semilethal and semisterile mutations were utilized only if they were strong enough to score in complementation tests. Only the effect of the P insertion, not of any secondary mutations on the same chromosome, whether present initially or acquired later, is indicated by the prefix. The phenotypic prefix is followed by a unique designator to distinguish individual lines and to preserve the original names of the lines. Designators for lines from Cooley et al. (1988) take the form “neo” and a 1–3 digit number (“neo63”); from Karpen and Spradling (1992), lines retain their original names (“06253”); from Bier et al. (1989), the letter “j” precedes the original name (i.e., “5C2” becomes “j5C2”); from Gaul et al. (1992), the letter “r” is contained within the original name (i.e., “rJ713”); from Törok et al. (1993), the letter “k” precedes the original name and the slash is omitted (i.e., “133/45” becomes “k13345”); from the Scott and Fuller screen, the letter “s” precedes the original name (i.e., “1629” becomes “s1629”); and for regular names from Chang et al. (1993), the L is moved to the start, the R omitted, and a zero added after the number (i.e., “534RL” becomes “L5340”). While the phenotypic prefix may rarely be changed to reflect new information about the effect of the P insertion, the designator is invariant. Thus, l(2)06253 and n(2)06253 refer to a single BDGP strain, whose P insertion was initially thought to cause lethality, but was subsequently shown to cause no obvious phenotype. Because phenotypic prefixes can change, it is wise to search Internet databases using the designator. Periodically updated information on the BDGP strains can be obtained by searching the BDGP website at http://www.fruitfly.org/p_disrupt/, or from FlyBase (the Drosophila database project) at http://flybase.bio.indiana.edu/transposons/fbinsquery.hform.
Deficiency stocks
Gene names: Symbols for Drosophila gene names are as given by FlyBase. For potentially novel loci defined only by a BDGP insertion strain, the name of the primary strain constitutes the provisional gene name, in accordance with FlyBase rules. Allele names for all the mutations are represented using the designator as the allele superscript. For example, because strain l(2)k10325 is part of the complementation group whose primary strain is l(2)03350 defining a new gene, its mutation is designated l(2)03350k10325. The P-element mutation in strain l(2)s4771 that is allelic to kismet (kis) is designated kiss4771. Again, because it is the designator that is presented in allele tables, it is wise to search FlyBase with the wild-carded designator.
Localization of inserts by in situ hybridization: P elements were localized by in situ hybridization to polytene chromosomes as described previously (Spradlinget al. 1995); see also http://www.fruitfly.org/methods. Digitized images of these localizations are available at http://www.fruitfly.org/p_disrupt/. A few lines were localized by others; these were assumed to be less accurate and are given only to a polytene lettered section, rather than a range of specific bands. To reduce the number of in situ localizations, many alleles of seven known hotspots were removed from the Törok et al. (1993) collection by complementing each starting strain with the following tester loci: 1(2)07815 (kis), l(2)01209 (vkg), l(2)04208 (Eif4A), l(2)02657 (wg), l(2)00255 (bun), l(2)00642 (lola), and l(2)03505 (mam). The insertion(s) in lines that failed to complement were not localized, and they are not included in the tabulation of hotspot allele numbers. Consequently, the allele numbers for these loci are lower than would have otherwise been the case.
Complementation testing: Complementation crosses were carried out among single-insertion lines whose insertions were localized within six to eight polytene bands of each other. A two-stage strategy was used to limit the number of crosses and to minimize redundancy. Each line was first crossed to representative of any locus within range having multiple alleles. Lines failing to complement were identified as additional alleles and eliminated from further crosses. Lines not allelic to such local “hotspots” were subsequently crossed to representatives of the other complementation groups within the relevant zone. As soon as two complementation groups were joined, it was assumed that their behavior was uniform, and few additional crosses between the subgroups were carried out. Generally this strategy worked well. However, in a small number of cases, incomplete or inconsistent complementation behavior was observed due to localization errors larger than four to eight bands, to intergenic complementation, to semilethality, to inadvertent selection of a rearranged allele as the representative allele, to stock instability, or to errors in obtaining or recording complementation data. Problem complementation groups were reanalyzed on a case-by-case basis and the source of the contradiction resolved.
Verification: Strains from the primary collection were crossed to deficiencies (see Table 3) to verify that the P insertion caused the recessive phenotype. In 1717 single-insert strains, the cytogenetic locus of the P element clearly fell within the boundaries of existing deficiency (Df) chromosomes (Table 2). An uncertainty of four to six bands in the cytogenetic breakpoints was assumed, and the previous results of complementation tests with verified lines in the region were also considered (see Spradlinget al. 1995). Complementation with deficiencies that unequivocally remove the P insertion site was taken as proof that the P element did not cause the associated phenotype. Failure to complement indicated that the strain was “verified.” While lines with secondary mutations closely linked to the P insertion might be erroneously verified by this procedure, further molecular and genetic analyses suggest that the frequency of such errors is small. The results of the complementation and verification crosses are summarized in Tables 2, 4, and 5. The data are also available on the BDGP website (http://www.fruitfly.org/p_disrupt/).
The availability of DNA sequence information that can link insertion sites to nearby ESTs, transcripts, and predicted genes is expected to significantly change the way decisions to retain or discard lines are made. Except within the Adh region (Ashburneret al. 1999), we retained insertions only if they caused or were likely to cause a detectable mutant phenotype. However, in the future, as genomic sequences become more highly annotated, it will increasingly be possible to select strains solely on the basis of whether they are likely to disrupt a novel ORF, regardless of whether a recessive phenotype can be observed. In a few cases reported here, viable insertions reside near or within novel transcripts recognized by nucleotide sequence. The prefixes of these lines were changed to n(2) or n(3) to indicate the absence of a scorable phenotype. Only within the Adh region, where sequence annotation is now extensive (Ashburneret al. 1999), did a significant fraction of the retained lines lack strong phenotypes.
Flanking sequence determination: Flanking sequences from one or both ends of most P-element insertions in the primary collection were determined by one or both of two methods. Plasmids containing the 5′ P element and flanking genomic sequences were rescued from many strains. Prior to rescue, the line was expanded, and 40–100 adult flies were collected and frozen at –20°. The plasmid rescue procedure (based on Hamiltonet al. 1991) entails macerating 30–40 flies in a grinding buffer, then one cycle of freeze-thaw, followed by a 20-min incubation at 70°. Subsequently, residual proteins and SDS were removed by addition of potassium acetate (KOAc) and incubation on ice for 30 min. The supernatant obtained after removal of particulate matter was ethanol precipitated to recover genomic DNA. Finally, the samples were treated with RNase A at 37° for 2 hr.
For plasmid rescue, a sample of genomic DNA equivalent to two to four flies was digested with an appropriate restriction enzyme (e.g., XbaI for the PZ lines), then ligated at low DNA concentration to circularize the restriction fragments. Subsequently, DH10B cells were transformed by electroporation. The resulting colonies had acquired the circularized restriction fragment containing the selectable marker, the bacterial origin of replication, one P-element inverted repeat, and a variable amount of flanking genomic DNA. For each rescue, four to six transformants were screened by DNA miniprep and restriction digestion. In cases where at least three of the four (or five of the six) transformants exhibited identical patterns, a plasmid was chosen for sequencing that represented the major class. Occasionally, the appropriate plasmid was identified from a transformation experiment that yielded more than one plasmid form by in situ hybridization. These plasmids were sequenced directly using a primer designed to the P-element inverted repeat. The success rate in this procedure was ∼80%.
The remaining lines were analyzed by recovering a smaller amount of DNA using inverse PCR according to the method of J. Rehm (http://www.fruitfly.org/methods/). This method was successfully adapted to a 96-well format where the success rate in obtaining 25 bp or more of flanking sequences has been >85%.
Association with ESTs: BDGP is generating a collection of 80,000 Drosophila EST sequences with support from Howard Hughes Medical Institute (accessible at http://www.fruitfly.org/EST/). During the preparation of this article, ∼48,000 ESTs were available for comparison. Each flanking sequence was searched against this EST database, matches validated by inspection, and the position of the P insertion relative to the EST-homologous portion of the flanking sequence determined. The names of ESTs with strong matches are given in Tables 4 and 5. Only ESTs that were located within ∼100 bp of the P element are reported; more distant sequence matches might represent adjacent transcripts and were not included in the tables.
Stock distribution: To hasten the availability of the gene disruptions, verified lines from the primary collection were sent to the Bloomington Stock Center in several batches beginning in 1993; the number of strains reached 700 by late in 1994. All 1052 primary collection strains have been available from the Bloomington Stock Center since October 1997. Reserve alleles are maintained at the Carnegie Institution (chromosome 3) or at Berkeley (chromosome 2), and have also been available on request since 1993. Information about stocks is updated periodically on the BDGP website and strains found to be inappropriate are removed from the Bloomington Stock Center. Information derived from further study of any of the BDGP stocks is welcome and should be forwarded to the corresponding author's e-mail address.
Statistical analysis of saturation: Previous attempts to estimate the saturation behavior of P elements have utilized inadequately characterized data sets. We focused on the 737 independent lines from chromosome 2 and 535 independent lines from chromosome 3 that contain a single verified P insertion lying clearly within the validated deficiencies used in the verification analysis. Within this group, for a known number of total lines (transposition events), the number of genes mutated and how many times each was hit should have been determined with complete accuracy. Because the deficiencies included a majority of chromosome 2 and 3 genes (60.3 and 62.0%, respectively), and should be distributed effectively at random, this sample should accurately represent all insertions that cause a phenotype.
Chromosome 2 stocks
Chromosome 3 stocks
Focusing on the less-mutagenized chromosome first, we determined that 154 of the 535 third chromosome genes had been hit once, 43 twice, 16 three times, 6 four times, and 5 five times and that 18 were previously discussed hotspot loci hit six or more times (Table 1). Despite the small number of hotspot loci, they accounted for 204 of the 535 insertions (38%). First, we attempted to fit the data to a Poisson distribution, ignoring for the moment the obvious presence of hotspot genes. The best distribution (λ = 0.558; Table 1) fit the data poorly because it predicted only 8.0 genes (instead of 16) hit three times, only 1.1 (instead of 6) hit four times, and 0.1 instead of 5 hit five times (χ2 = 270, P « 0.001).
To determine if the observed “excess” of genes hit three to five times was caused by the statistical tail from the hotspot loci, we used a binomial distribution to model their contribution (Table 1). The distribution used maximizes the contribution of hotspot genes to the classes of genes hit 3–5 times, while yielding the observed number of genes hit 6–12 times. Despite this, the results reveal that there are too few hotspot genes to account for the excess of genes hit 3–5 times (Table 1; χ2 = 20, P « 0.001). Consequently, a class of genes of intermediate mutability must exist (warmspot genes). To estimate the size of this class, we fit the data for genes hit one to five times on the assumption of two mutability classes, warmspot and coldspot genes. Postulating 115 “warmspot” genes (λ = 1.51) and 613 “coldspot” genes (λ = 0.241) produced a good fit to the data (Table 1; x2 = 0.81, P » 0.05). Extrapolating the warmspot and coldspot data to the entire chromosome and adding the whole chromosome hotspot data, the following was predicted for the third chromosome: 27 hotspot loci + 115/0.603 = 191 warmspot loci + 613/0.603 = 1017 coldspot loci.
We next considered the second chromosome and found that 737 independent verified lines defined 190 genes that had been hit once, 57 twice, 19 three times, 17 four times, 5 five times, and 32 that were hotspot loci hit six or more times (Table 1). Hotspot insertions accounted for 288 of these lines (39%). Again, at least two general classes of mutability were required to fit the data from non-hotspot lines, even after correcting for the contribution of hotspot genes (Table 1; x2 = 36, P « 0.001). Because we expected genes on the second and third chromosomes to have the same average mutabilities, we reasoned that the Poisson parameters for chromosome 2 warmspot and coldspot loci should correspond to parameters of the corresponding chromosome 3 genes corrected for the more extensive mutagenesis that was carried out on chromosome 2. The relative fraction of independent single-insert lines analyzed on chromosome 2 compared to chromosome 3 was 737/535 = 1.38. Multiplying the warmspot and coldspot class Poisson parameters determined for chromosome 3 by 1.38 gave the expected values on chromosome 2 (λ = 2.08 and 0.331). Values of 110 warmspot and 680 coldspot loci on chromosome 2 were then determined to fit the distribution (Table 1; χ2 = 3.8, P » 0.05). Thus, chromosome 2 is predicted to house 47 hotspot genes, 110/0.62 = 177 warmspot genes and 680/0.62 = 1097 coldspot genes.
RESULTS
Rationale: The gene disruption library was assembled from ∼3900 starting lines that had been produced in seven separate single P-element mutagenesis screens (Table 2). Each starting line contained one (or a few) P-element insertion on an autosome bearing a newly induced scorable recessive phenotype. The process of going from this amalgamated raw collection to the finished library involved (1) localizing the insertions by in situ hybridization to polytene chromosomes at high resolution; (2) identifying strains with allelic insertions by inter se complementation crosses; (3) verifying that insertions were responsible for the mutant phenotype by crossing to chromosomes bearing deficiencies; and (4) sequencing DNA flanking the insertions and comparing it to EST and genomic sequence databases (Spradlinget al. 1995). Single-insert-bearing strains that appear to disrupt distinct genes based on all these criteria constitute the final library, or “primary collection.” Because of the requirement for complementation testing, the project was designed initially to focus on genes that mutate to a recognizable lethal, sterile, or visible phenotype.
Identifying single-insert lines: The P insertion(s) in each line was cytogenetically localized by in situ hybridization as described previously (Spradlinget al. 1995; see also materials and methods). The number of lines localized from each screen is given in Table 2. Highly consistent and accurate localizations were required for the success of the complementation analysis. Images of each localization were digitized and stored (http://www.fruitfly.org/p_disrupt/). Based on these results, a significant number of strains were immediately eliminated from consideration for the primary collection. Two hundred seven lines were discarded because they did not represent independent insertions (see materials and methods) and >900 lines were eliminated because they contained two or more insertions or a rearrangement on the mutation-bearing chromosome. A total of 2695 independently derived strains bearing single P-element insertions (1643 on II, 1052 on III) on intact chromosomes were retained. The P-element insertion site within each primary collection strain is listed in Tables 4 and 5.
Identifying allelic mutations: Complementation crosses were carried out inter se between lines whose insertions were located near each other (see materials and methods). The maximum cytogenetic distance between the reported positions of insertions that is required to ensure they are not allelic depends critically on the accuracy of the in situ localization. We complementation tested lines when the distance between their elements was six to eight bands or less. This should have been sufficient to eliminate errors even in cytogenetically difficult regions, because the divergence in the reported positions of allelic insertions averaged less than one band (Spradlinget al. 1995). The molecular analyses reported below provide further independent verification that allelic lines were not missed due to localization errors.
The complementation analysis provided considerable insight into the frequency with which individual loci are mutated by P elements (see Tables 4 and 5). In particular, 74 complementation groups on the autosomes were identified that are hotspots for P-element insertion with between 6 and 37 alleles each. Because of the size of our data set, these loci likely comprisevirtually all the P-element insertion hotspots on the autosomes. Following completion of the complementation analysis, one allele of each complementation group was retained for the primary collection (see Table 2).
Verifying that the insertions cause mutations: Three criteria were used to determine if the P-element insertion in a given single insert line was likely to be responsible for the observed mutant phenotype. First, if the mutation failed to complement one or more independently derived strains whose insertions had been localized nearby, then it was considered verified along with all the other insertions in the complementation group so defined. (The chance that such lines actually contained identical secondary or background mutations was negligible as indicated by test crosses with lines whose insertions were at different sites.) To apply the second verification test, the strain in question was crossed to deficiency chromosomes whose cytogenetically determined breakpoints (Table 3) indicated that they might lack the disrupted gene. Crosses were scored based on the presumed phenotype of the insertion (Tables 4 and 5, “Df comp” and “Df noncomp”). Lines that failed to complement were considered to be verified, because the chance that a background mutation was closely linked to the P element was acceptably small. If complementation was observed, the line was discarded if its insertion clearly fell within the deficiency boundaries; otherwise it was retained but remained unverified. These two tests, combined with further verification based on the analysis of flanking DNA sequences as described below, allowed the total number of lines in the primary collection to be reduced to 1045, of which 725 (69%) are verified (see Table 2). Of these lines, 93% disrupt vital genes, while most of the remainder cause male or female sterility. The phenotype and verification status of each line are shown in Tables 4 and 5.
We can estimate the approximate number of bogus lines that remain in the library. First, the overall fraction of verified lines arising from each screen is calculated by restricting our analysis to those lines whose insertions clearly fall within the boundaries of valid deficiencies and hence can be reliably tested (Table 2, “in Df”). This subgroup represents >60% of all the lines and should be representative of each screen as a whole. The proportion of lines that were verified ranges from 48–88% among the seven screens. Assuming that insertions falling outside the deficiencies are as likely to be valid as those inside allowed us to estimate the number of unverified primary collection lines from each starting screen that are likely to be valid. After making this final correction, the final number of different genes disrupted by P insertions in the collection is estimated to be 953 (Table 2). Using this information we also determined an overall efficiency for each of the seven screens, defined as the percentage of raw lines that contain a single insertion causing its associated phenotype (Table 2, “screen efficiency”).
Deficiency chromosomes with accurate breakpoints are a valuable genetic resource. Knowledge of the true extent of material deleted in deficiency stocks was improved as a by-product of verifying the P insertions. The location of each verified insertion predicted the expected complementation behavior with relevant deficiencies. In cases where contradictions were observed, the breakpoints of the deficiency could sometimes be refined on the basis of the cytogenetic localizations of the terminal P elements (see Tables 4 and 5). A number of such corrections have been incorporated into FlyBase (see flybase.bio.indiana.edu:80/.bin/fbabsq.html). Table 2 shows current estimates of the deficiency break-points used in these studies.
Characterizing insertions using flanking DNA sequence: The genomic DNA sequence flanking the insertion sites in the primary collection lines was needed to complete the verification process and to begin associating lines with specific genes. Physically associating as many insertions as possible with specific sites in the genome would also enhance the usefulness of the primary collection for gene mapping and for directed mutational screening using accurately positioned starting strains (Spradlinget al. 1995). Consequently, we attempted to recover genomic DNA adjacent to the 5′, 3′, or both sides of the P element from every remaining candidate primary collection line following completion of the genetic verification tests. Both plasmid rescue and inverse PCR were used. A single sequencing run was carried out beginning at the insertion site of all recovered flanks (see materials and methods). Despite a shorter average amount of sequence recovered, inverse PCR was successful at a slightly higher average frequency (85% vs. 80%) and could be carried out in a 96-well format that allowed lines to be analyzed more rapidly. If both a 5′ and 3′ sequence was obtained, the two runs were merged in a single contig.
The sequences flanking the insertions were initially compared among themselves as an additional verification test. We wished to eliminate lines whose insertions were very close together but that behaved genetically like separate genes. Such lines are likely to be produced when chromosomes bearing nonallelic background mutations acquire insertions within the same nonvital gene. The genetic behavior of the resulting strains will cause them to survive into the primary collection if their insertions lie outside existing deficiencies. On the other hand, we did not want to eliminate valid insertions in adjacent genes. Consequently, in the absence of additional information, nonallelic insertions separated by 100 bp or more were assumed to represent distinct genes. When the separation was <100 bp, usually only one (if verified) or neither line was retained in the primary collection. Rarely, this might have led to the loss of valid lines, for example, in cases of overlapping genes or intragenic complementation, but it allowed us to discard nearly 100 questionable strains for the collection.
After completing these tests 1045 lines remained in the primary collection. Flanking sequence information has been obtained from 921 of the lines in this final group (88%). Accession numbers for each strain are listed in Tables 4 and 5 (“Sequence”). These sequences, including the position of the insertion, are listed on the BDGP website (http://www.fruitfly.org/p_disrupt/).
Associating primary collection lines with genes: The primary collection provides an opportunity to link ∼1000 Drosophila genes with a genetic phenotype. Because these strains and genetic data have been publicly available from the inception of the project, the Drosophila research community has extensively utilized many lines from the primary collection (and the precursor raw collections). Publications describing at least 250 different Drosophila genes have employed strains from the collection (see Tables 4 and 5, “References”). In many cases, the P-element disruption strain played a major role in the initial characterization of the gene in question.
To identify as many additional genes as possible the P-element flanks were searched against all Drosophila sequences in GenBank and ∼26 Mb of genomic sequence (most searches are current as of December 1998). To test the accuracy of flanking sequence recovery, the polytene location of the P element in each of the 286 lines whose flanking sequences matched genomic sequence determined by BDGP was compared to the independently mapped polytene location of the corresponding P1 clones. Only a few discrepancies resulted, presumably due to the rare recovery of sequence from a cryptic P element, and in these cases a correct flanking sequence was sought. These searches provided a wide variety of valuable information. They confirmed most of the 250 published gene assignments, identified many additional characterized Drosophila genes disrupted by strains in the collection, and molecularly positioned the insertion sites within all these loci. Of the additional Drosophila genes, 55 had previously been characterized only at the molecular level (Table 6).
Further links to well-characterized genes were discovered by associating the insertions with Drosophila transcripts defined by EST sequencing. About 48,000 Drosophila EST sequences were available for these comparisons. A total of 376 insertions were located close to or within an EST sequence, usually near the 5′ end (see Tables 4 and 5). Mutation-causing P elements are known to preferentially cluster in the 5′ region of the affected genes (see Spradlinget al. 1995), a tendency that probably increases the chance of recovering overlaps between the short flanking sequences and 5′ ESTs. For each line with a matching EST, the relevant “clot” (consensus sequence of overlapping ESTs) sequence was conceptually translated and used to search protein data-bases. These comparisons associated 76 more primary collection lines with previously undescribed Drosophila genes encoding proteins related to characterized genes from other species (Table 7). These new Drosophila genes have been named on the basis of the name of their ortholog. Genes are listed in Table 7 only if there is a strong match within the region of overlap and if a study of the ortholog's properties has been published. BDGP has determined complete complementary DNA (cDNA) sequences for some of these genes (Table 7). The approaches described so far linked 450 primary collection lines with known Drosophila genes or with orthologs of characterized genes in other organisms (Tables 4 and 5).
Although the insertions in the remaining lines were not associated with a well-characterized gene or ortholog, it was still possible to link many of them with predicted transcripts and ORFs. The sequence comparisons associated the insertions in 135 additional lines with ESTs whose clots either predicted novel proteins or matched proteins conceptually encoded by ESTs or ORFs from other organisms. BLAST reports of these searches, including periodic updates, are available by searching the BDGP website using the appropriate EST (Tables 4 and 5). Finally, the insertions within 138 of the remaining lines not associated with genes or ESTs were localized within sequenced portions of the Drosophila genome. Bioinformatic analyses of the sequences flanking these insertions reveal candidate ORFs, although such studies have not yet been carried out systematically. In sum, therefore, 706 of the 1045 primary collection strains (67%) already link known or candidate genes with mutant phenotypes. It should be possible to make most of the remaining gene-mutant associations by the time genome and EST sequencing nears completion.
New gene-mutant associations
New genes
P-element selectivity: This study reveals the identity of most genes that are hotspots for P-element insertion on the autosomes (Tables 4 and 5, “Alleles”). We searched for common properties that might explain their efficiency as P-element insertional targets. Hotspot genes are not associated with generally high transcription levels, because only 30% of the genes in the primary collection with more than five alleles have an associated EST sequence, compared to 36% for the collection as a whole. Hotspot genes might be those actively transcribed in premeiotic germline cells, where P elements usually transpose; however, the few genes in the collection whose transcripts are abundant in early germ cells, including vasa, bam, and hsp83, were each hit only once. Indeed, our comparisons uncovered no common biological features such as size, location, or regulation that might explain why hotspot genes are highly susceptible to P-element insertion.
We also considered whether strong preferences exist for insertion within certain classes of genes among all those disrupted in the collection. The primary collection includes an estimated 30% of readily mutable autosomal genes. Genes involved in signal transduction were usually well represented, because the collection mutates ∼50% of all autosomal genes known to be involved in the EGFR, dpp, ras, wg, hh, or N signaling pathways. In addition, disruptions were obtained in 46% of autosomal posterior group genes, 31% of trithorax and Polycomb group genes, but only 14% of ribosomal protein genes. It remains unclear if these differences reflect more than the research priorities of the Drosophila research community.
Not all insertion sites were associated within proteincoding genes. One P element was located within a 5S rDNA repeat and four interrupted tRNA clusters. Nine lines, two of which disrupt the genes Distal-less and fruit-less, were found by sequence analysis to contain insertions within the LTR sequence of a Drosophila retrotransposon related to the yoyo element of the Mediterranean fruit fly Ceratitis capitata (Zhou and Haymer 1997; see also FBgn0021759). The abundance of this element was low overall and all the insertions clustered in a small part of the LTR, a likely hotspot. Two other multicopy target sites were the telomere associated sequence (TAS) element, with six insertions, and the hoppel element, with one insertion. Both elements have been shown previously to be frequent targets of P-element insertion (Karpen and Spradling 1992; Zhang and Spradling 1995; D. Stewart and A. Spradling, personal communication). Because most insertions within repetitive sequences would not be expected to disrupt vital functions, these observations probably reflect which repetitive target sequences are frequently located within the introns or immediate flanks of vital genes in the strains used.
Modeling mutational saturation: The gene disruption project provides a much larger and better-characterized data set than has been previously available for analyzing the site specificity of P-element transposition. This is an important question for determining the appropriate strategy to expand the collection. The insertional specificity of P elements must be extremely broad to achieve complete or nearly complete coverage of all Drosophila genes. In contrast, previous studies inferred that a significant percentage of Drosophila genes, perhaps as great as 50%, are refractory to mutation using P elements (see Kidwell 1986; Töroket al. 1993). If true, this would imply that a different method of mutagenesis is needed to complete the gene disruption project (Spradlinget al. 1995). However, these conclusions remain highly uncertain, because previous studies of saturation behavior utilized raw collections of unverified lines that differ in P-element content and did not correct for locus-specific differences in mutagenesis rates. The total number of different genes mutagenized clearly rises more slowly than expected by assuming that nearly all genes are equally susceptible to P-element insertion. However, this observation alone cannot distinguish between the presence of genes refractory to P-element insertion and the presence of gene classes that differ significantly in P-element mutability. Fortunately, the very information gathered to build the primary collection also allows one to more accurately deduce the saturation behavior of P elements.
We focused on the large subset of the P-element lines from the collections whose insertions lie within the boundaries of validated deficiencies. Within this group, for a known number of total lines (transposition events), the number of genes mutated and how many times each was hit has been determined with complete accuracy. Because the deficiencies included a majority of chromosome 2 and 3 genes (60.3 and 62.0%, respectively), and should be distributed effectively at random, this sample should accurately represent all insertions that cause a phenotype. When we analyzed the distribution of insertional mutations among this set of genes, it was clear that the data did not fit a simple Poisson distribution (see materials and methods; Table 1). The most obvious problem was the hotspot loci. On chromosomes 2 and 3, just 18 or 32 loci account for 38 or 39% of all insertions, respectively. However, even after subtracting the contribution of these hotspot loci, the distribution of gene mutabilities remained skewed (see materials and methods; Table 1). Consequently, a class of warmspot genes was inferred whose mutability is intermediate between the hotspot loci and the large group of low mutability coldspot genes. Assuming the existence of three major mutability classes allowed a good fit to the data.
This model provides several useful insights into P-element behavior. The third chromosome is predicted to contain 27 hotspot loci + 191 warmspot loci + 1017 coldspot loci, while the second chromosome should house 47 hotspot genes + 177 warmspot genes + 1097 coldspot genes. Despite accounting for only 17% of all genes, the 368 warmspot and 74 hotspot genes account for ∼70% of all transposition events. As a result, virtually all the hotspot loci and 80–90% of the warmspot loci have already been defined by strains in the primary collection. On the other hand, only 22–28% of the coldspot loci have so far been disrupted. However, assuming that there are 1400 vital loci per major autosome (Miklos and Rubin 1996), and considering that 93% of the disruptions in our collection are of vital genes, then the model predicts that at least 2556 × 0.93/2800 = 85% of vital genes can eventually be mutated using P elements. Thus, the existence of the hotspot and warmspot genes is the reason that mutational saturation proceeds more slowly than expected on the basis of a single class Poisson analysis, but the final level of saturation is higher than previously appreciated. Indeed, if gene mutabilities actually vary more broadly than three discrete classes, as seems likely, the true level of saturation will exceed 85%. There is no reason to suspect that P-element insertional preferences differ between vital and nonvital genes, so the conclusions drawn here should apply to Drosophila genes generally. These results suggest that a much larger fraction of Drosophila genes than previously supposed, at least 85% and possibly 100%, are susceptible to inactivation by P-element insertion.
DISCUSSION
Collections of gene disruptions as tools for functional genomics: It is now possible in theory to mutate virtually any gene that has been molecularly identified in the major multicellular model organisms and to isolate the mutant allele on a standard genetic background free of secondary lesions. In practice, obtaining mutants remains a time-consuming task that constitutes the largest current impediment to progress in understanding gene function in vivo. While it has become widely accepted that gene sequence and structure can be more efficiently analyzed on a genome-wide scale, a similar consensus on the value of whole genome gene disruption has been slow to develop. As a result, linking genes with mutations remains a cottage industry pursued by individual laboratories. The work reported here has been motivated by the belief that complete gene mutation libraries are feasible and have the potential to greatly accelerate the rate at which gene function can be analyzed. We feel that whole genome mutant collections belong together with complete genome and cDNA sequences as essential tools for future biological research.
The BDGP gene disruption library represents a significant step toward the ultimate goal of stockpiling an identified mutation in every Drosophila transcription unit. The current collection of single P-element insertions provides a particularly useful type of link between the genetic and molecular properties of ∼1000 different autosomal genes that can mutate to a readily recognizable phenotype. This is more than the number of genes that have been characterized at both the genetic and molecular levels in any of the other widely used model multicellular eukaryotes, including Arabadopsis, C. elegans, zebrafish, or mice, and exceeds the number of gene-mutation links known in humans. As a reflection of its utility, lines from the BDGP collection have been utilized in publications characterizing more than 250 different genes since 1988 (Tables 4 and 5).
Expanding the collection: Because the Drosophila genome is believed to house ∼12,000 genes (Miklos and Rubin 1996), the current primary collection is still far from complete. Two basic approaches can be considered for expanding its coverage. A targeted strategy would avoid reisolating new mutations in genes that have already been disrupted in the existing collection or by individual Drosophila researchers. A general strategy for identifying mutations in any gene encoding a protein that can be detected with a specific antiserum has been developed (Dolphet al. 1992). However, a substantial number of genes that express proteins only at low levels may be refractory to disruption by this approach. Consequently, continuing the insertional mutagenesis strategy used previously in some form appears to be the most promising approach to completing the collection.
Significant improvements are possible in the short term by incorporating several new collections of insertions that have already been constructed since the project was initiated (Erdelyiet al. 1995; Deaket al. 1997; Rørth et al. 1998). The third chromosome collection described by Deak et al. (1997) is similar in size to the collection of Törok et al. (1993) on chromosome 2, but preliminary estimates by the authors indicate a higher screen efficiency. Incorporating these lines into the existing collection should increase the number of third chromosome lines to >600 and equalize the saturation levels of the two major autosomes.
It will also be of value to carry out new mutagenesis screens. A major variable in the generation of single P-element-induced mutations is the wide variation in screen efficiency that is documented here (Table 2). One factor that can affect screen efficiency is the overall rate of P transposition. High transposition rates like those in the screen of Törok et al. (1993) produce an excess of lines with more than one P-element insertion (>23% in this case). High transposition rates probably also cause secondary mutations as elements transpose and excise at multiple sites over several germ cell division cycles. However, our results imply that the rate of transposition and amount of secondary damage are not always correlated and are not simply a function of the P elements used (Table 2). Both Bier et al. (1989) and Törok et al. (1993) employed the PlacW and Δ2-3 P elements but obtained very different frequencies of multiple insert lines, rates of background mutation, and overall screen efficiencies. In contrast, the screen of Cooley et al. (1988) using PUChsneo and a weak mobilizing P element exhibited a low transposition rate but still gave an efficiency of only ∼50%. Consequently, our results suggest that currently unidentified factors in the genetic backgrounds used for P-element mutagenesis affect the prevalence of damage at chromosomal sites that do not retain P-element sequences. Unfortunately, the nature of these factors remains poorly understood.
The number of new lines that needs to be characterized to substantially complete the gene disruption project can be estimated from our analysis of saturation. The genome contains ∼3600 vital genes, at least 3100 of which fall into the coldspot class. Statistically, twice this number of insertions, 6200, must be recovered in this class of genes to achieve 87% saturation. Because only 30% of raw insertions target the coldspot class, and because the best screens produce only 85% verified single insert lines, achieving 87% saturation would require the isolation and analysis of 6200/(0.3 × 0.85) = 24,300 autosomal insertions associated with phenotypes. This represents about six times as many lines as were analyzed in the current project.
A molecular strategy for finishing the mutation library: Even a project of this size is feasible, although a very large effort would be required. However, a continuation of the current approach would not address the estimated two-thirds of all genes that do not mutate to a readily detectable phenotype in genetic screens. To obtain P-element insertions that disrupt such genes, it will be necessary to look directly for changes in their structure. With large amounts of genomic and EST sequences becoming available and a strong commitment to completing the Drosophila genome sequence within 1–3 years (Collinset al. 1998; Venteret al. 1998), a strategy based entirely on molecular mapping is becoming feasible. A new generation of P-element misexpression vectors (Rørth 1996) are attractive candidates for use with this approach. These insertions not only can disrupt genes but also are frequently able to program the controlled misexpression of the affected protein. This option should accelerate the collection of functional information, especially on the many genes whose loss does not produce an immediately recognizable phenotype.
We propose to inaugurate a phase two gene disruption project whose goal would be to disrupt all Drosophila genes, regardless of phenotype. Flanking DNA will be recovered from a large number of raw insertion lines and sequenced, much as was done with the primary collection lines in the current collection. The short sequences obtained will allow most new insertions to be precisely positioned on the genomic sequence. Consulting EST and cDNA sequences, gene predictions, ORF homologies, and other relevant data in the vicinity of the insertion sites will allow rapid predictions as to whether each new insertion is likely to disrupt or misexpress an ORF not currently represented in the collection. Lines that do not appear to do so would be quickly discarded. Recently, this strategy has received a valuable test within the fully sequenced 2.9-Mb Adh region (Ashburneret al. 1999). By mapping all available P elements onto the genomic DNA sequence, not just those causing phenotypes as described here, the number of genemutation links was increased substantially (see Table 4).
The phase two strategy has several distinct advantages. First, it broadens the project to include all Drosophila genes. In addition, it greatly simplifies the work required to characterize new candidate lines, compensating in part for the much larger number of lines that will need to be analyzed. Polytene localizations are unnecessary, because multiinsert lines can be detected through their production of more than one distinct P-element flanking sequence. Balancing most of the newly mutagenized chromosomes is not required. Genetic complementation is not necessary, because redundant lines can increasingly be identified on the basis of their location. However, there are several requirements for success. First, the Drosophila genome sequence must be completed in a timely manner. Second, semiautomated methods for recovering and sequencing flanking DNA segments must be further improved. Finally, bioinformatic tools to assist decision making about line retention must be developed.
We can calculate the approximate number of lines that will need to be analyzed during the phase two project. About 11,000 of the estimated 12,000 Drosophila genes are predicted to fall into the coldspot class, assuming that the P-element mutability of all genes is similar to that of vital genes. Therefore, if 30% of new insertions fall in the coldspot class as in the case with lethal insertions, and 95% of raw lines contain only one insertion, then 2 × 11,000/(0.3 × 0.95) = 77,000 lines would be required for 87% saturation. However, two observations suggest that some unselected insertions will fail to disrupt any gene, increasing the total number of lines that will need to be analyzed. First, P elements are attracted to at least some repetitive sequences such as yoyo, TAS, and hoppel, which are often located at nonmutagenic sites within the genome. The fraction of insertions that land in such sites might be significant. Second, P insertions that cause phenotypes cluster around the 5′ region of genes (Spradlinget al. 1995; data not shown). Previously, insertions located too far upstream from transcription start sites, or at nonmutagenic sites within large introns, have been edited out by the requirement for a phenotypic effect. In phase two, they would be recovered and analyzed, lowering efficiency.
The relative fraction of unselected insertions that disrupt genes can be estimated, however. If all insertions mutated genes, then 33% of new transpositions should cause a recognizable phenotype, because about onethird of genes are thought to mutate in this manner. Instead, only ∼15% of raw insertions recovered on clean chromosomes cause a recognizable phenotype (see citations in Table 2). Consequently, as many as 77,000/0.5 = 154,000 insertions might need to be screened to obtain 87% saturation across all Drosophila genes. However, in practice, this may be an overestimate. P elements can be excised imprecisely to generate deletions adjacent to the insertion site. Because of the large number of mapped insertions that will be available by the time phase two is only partially complete, a strategy in which some genes are disrupted by excising nearby nonmutagenic insertions might substantially reduce the final number of strains that need to be generated and analyzed.
A gene disruption library represents a fundamental and indispensable resource for analyzing gene function on a genome-wide scale. The BDGP gene disruption project has already accelerated studies of Drosophila gene function and is likely to be even more valuable as coverage increases. A pilot screen for phase two has already been completed in collaboration with several laboratories (Rørthet al. 1998). A total of 2400 lines from this project have been mapped and initially analyzed (BDGP, unpublished results; see http://www.fruitfly.org/bfd/). We believe that researchers using Drosophila (and other model multicellular organisms) are rapidly approaching an era where obtaining mutations, the basic tools for understanding gene function in vivo, will no longer limit the progress of research.
Acknowledgments
BDGP acknowledges all those researchers who participated in constructing the strains that were used in this project. These include L. Ackerman, M. Alvarado, S. Barbel, C. Berg, E. Bier, S. Bockheim, M. Boedingheimer, R. Carretto, Z. Chang, L. Cooley, M. Fuller, U. Gaul, R. Glaser, E. Grell, B. Harkins, M. Heck, L. Higgins, L. Jan, Y.-N. Jan, G. Karpen, R. Kelley, I. Kiss, A. Laughon, K. Lee, L. Lee, G. Mardon, K. McCall, D. McKearin, C. Montell, D. Montell, T. Overbode, B. Price, J. Riesgo, M. Scott, S. Shepherd, R. Smith, D. Thompson, T. Tick, T. Törok, J. Tower, T. Uemura, H. Vassin, E. Verheyen, S. Wasserman, and L. Yue. We are also grateful to many workers who in the course of this study communicated complementation results and other information on specific P-element strains. In particular, John Roote and Paul Lasko shared complementation data for 2L divisions 24-36 and 37-38. H. Bellen (various), Erica Roulier (29A), Ken Howard (45), Jordan Raff (46A), Elliott Goldstein (46), Robert Burgess (47EF), Claire Russell (49EF), Paul Wes (52E), and Boris Dunkov (99F) contributed and confirmed results in the cytogeneticregions indicated. We thank A. deGrey for assistance in analyzing chromosome 2 data. This work was supported by a genome center grant (P50NIHHG750) from the National Institutes of Health. A.C.S. and G.M.R. are Howard Hughes Medical Institute Investigators.
Footnotes
-
Communicating editor: R. S. Hawley
- Received January 29, 1999.
- Accepted April 26, 1999.
- Copyright © 1999 by the Genetics Society of America