| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Corresponding author: Allan C. Spradling, Howard Hughes Medical Institute Research Laboratories, Department of Embryology, Carnegie Institution of Washington, 115 W. University Pkwy., Baltimore, MD 21210., spradling{at}mail1.ciwemb.edu (E-mail)
Communicating editor: R. S. HAWLEY
| ABSTRACT |
|---|
A fundamental goal of genetics and functional genomics is to identify and mutate every gene in model organisms such as Drosophila melanogaster. The Berkeley Drosophila Genome Project (BDGP) gene disruption project generates single P-element insertion strains that each mutate unique genomic open reading frames. Such strains strongly facilitate further genetic and molecular studies of the disrupted loci, but it has remained unclear if P elements can be used to mutate all Drosophila genes. We now report that the primary collection has grown to contain 1045 strains that disrupt more than 25% of the estimated 3600 Drosophila genes that are essential for adult viability. Of these P insertions, 67% have been verified by genetic tests to cause the associated recessive mutant phenotypes, and the validity of most of the remaining lines is predicted on statistical grounds. Sequences flanking >920 insertions have been determined to exactly position them in the genome and to identify 376 potentially affected transcripts from collections of EST sequences. Strains in the BDGP collection are available from the Bloomington Stock Center and have already assisted the research community in characterizing >250 Drosophila genes. The likely identity of 131 additional genes in the collection is reported here. Our results show that Drosophila genes have a wide range of sensitivity to inactivation by P elements, and provide a rationale for greatly expanding the BDGP primary collection based entirely on insertion site sequencing. We predict that this approach can bring >85% of all Drosophila open reading frames under experimental control.
THE nucleotide sequences of several complex eukaryotic genomes, including those of Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Mus musculus, and Homo sapiens, are virtually complete or scheduled for completion during the next several years (![]()
![]()
Insertional mutagenesis provides a highly advantageous strategy for constructing mutations in advance throughout entire genomes, because it simplifies the problem of determining which genes have been disrupted. Insertional screens at low multiplicity have been carried out in bacteria (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Among model multicellular eukaryotes, insertional mutagenesis has been used for genetic analysis and functional genomics most extensively in Drosophila. Low-multiplicity mutageneses using engineered P elements have been carried out frequently (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The seven initial collections have now been analyzed, and we document here a library of strains that disrupts ~1000 different genes. A total of 450 of the strains mutate genes that have been described previously in Drosophila or represent novel loci defined by homology to well-studied genes in other organisms. These associations were made with the assistance of researchers throughout the Drosophila community who have used the collection to help characterize >250 genes, and through the efforts of the BDGP project, including 138 new gene-mutation links reported here. Another 135 disrupted genes are associated only with EST sequences that predict novel proteins or products related to proteins of unknown function in other organisms. An additional 138 lines are inserted within sequenced regions containing candidate open reading frames (ORFs). Thus, >700 of the 1045 mutant strains already link mutant phenotypes with specific open reading frames, and the remaining lines only await completion of the genomic DNA sequence. The ~1000 genes already represented in the library constitute ~25% of all Drosophila genes readily defined by mutation, and specify more gene-mutation links than are currently available in other model multicellular eukaryotes. Most important, on the basis of this work we have developed a new program to disrupt the remaining genes while the Drosophila genome sequence is being completed and annotated (see Table 1).
|
| MATERIALS AND METHODS |
|---|
Drosophila strains:
Flies were grown on standard corn meal/agar media (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
|
Deficiency strains were obtained from the Bloomington Stock Center and from many individual laboratories. The deficiencies used are listed in Table 3.
|
Strain names:
BDGP strain names start with a prefix that indicates the chromosome and phenotypic effect of their single P-element insertion. For example, third chromosome strain names begin with either "l(3)" (lethal or strong semilethal), "fs(3)" (female sterile or strong semisterile), "ms(3)" (male sterile or strong semisterile), "v(3)" (visible), or "n(3)" (no obvious phenotype). Semilethal and semisterile mutations were utilized only if they were strong enough to score in complementation tests. Only the effect of the P insertion, not of any secondary mutations on the same chromosome, whether present initially or acquired later, is indicated by the prefix. The phenotypic prefix is followed by a unique designator to distinguish individual lines and to preserve the original names of the lines. Designators for lines from ![]()
![]()
![]()
![]()
![]()
![]()
Gene names: Symbols for Drosophila gene names are as given by FlyBase. For potentially novel loci defined only by a BDGP insertion strain, the name of the primary strain constitutes the provisional gene name, in accordance with FlyBase rules. Allele names for all the mutations are represented using the designator as the allele superscript. For example, because strain l(2)k10325 is part of the complementation group whose primary strain is l(2)03350 defining a new gene, its mutation is designated l(2)03350k10325. The P-element mutation in strain l(2)s4771 that is allelic to kismet (kis) is designated kiss4771. Again, because it is the designator that is presented in allele tables, it is wise to search FlyBase with the wild-carded designator.
Localization of inserts by in situ hybridization:
P elements were localized by in situ hybridization to polytene chromosomes as described previously (![]()
![]()
Complementation testing:
Complementation crosses were carried out among single-insertion lines whose insertions were localized within six to eight polytene bands of each other. A two-stage strategy was used to limit the number of crosses and to minimize redundancy. Each line was first crossed to representative of any locus within range having multiple alleles. Lines failing to complement were identified as additional alleles and eliminated from further crosses. Lines not allelic to such local "hotspots" were subsequently crossed to representatives of the other complementation groups within the relevant zone. As soon as two complementation groups were joined, it was assumed that their behavior was uniform, and few additional crosses between the subgroups were carried out. Generally this strategy worked well. However, in a small number of cases, incomplete or inconsistent complementation behavior was observed due to localization errors larger than four to eight bands, to intergenic complementation, to semilethality, to inadvertent selection of a rearranged allele as the representative allele, to stock instability, or to errors in obtaining or recording complementation data. Problem complementation groups were reanalyzed on a case-by-case basis and the source of the contradiction resolved.
Verification:
Strains from the primary collection were crossed to deficiencies (see Table 3) to verify that the P insertion caused the recessive phenotype. In 1717 single-insert strains, the cytogenetic locus of the P element clearly fell within the boundaries of existing deficiency (Df) chromosomes (Table 2). An uncertainty of four to six bands in the cytogenetic breakpoints was assumed, and the previous results of complementation tests with verified lines in the region were also considered (see ![]()
|
|
The availability of DNA sequence information that can link insertion sites to nearby ESTs, transcripts, and predicted genes is expected to significantly change the way decisions to retain or discard lines are made. Except within the Adh region (![]()
![]()
Flanking sequence determination:
Flanking sequences from one or both ends of most P-element insertions in the primary collection were determined by one or both of two methods. Plasmids containing the 5' P element and flanking genomic sequences were rescued from many strains. Prior to rescue, the line was expanded, and 40100 adult flies were collected and frozen at -20°. The plasmid rescue procedure (based on ![]()
For plasmid rescue, a sample of genomic DNA equivalent to two to four flies was digested with an appropriate restriction enzyme (e.g., XbaI for the PZ lines), then ligated at low DNA concentration to circularize the restriction fragments. Subsequently, DH10B cells were transformed by electroporation. The resulting colonies had acquired the circularized restriction fragment containing the selectable marker, the bacterial origin of replication, one P-element inverted repeat, and a variable amount of flanking genomic DNA. For each rescue, four to six transformants were screened by DNA miniprep and restriction digestion. In cases where at least three of the four (or five of the six) transformants exhibited identical patterns, a plasmid was chosen for sequencing that represented the major class. Occasionally, the appropriate plasmid was identified from a transformation experiment that yielded more than one plasmid form by in situ hybridization. These plasmids were sequenced directly using a primer designed to the P-element inverted repeat. The success rate in this procedure was ~80%.
The remaining lines were analyzed by recovering a smaller amount of DNA using inverse PCR according to the method of J. Rehm (http://www.fruitfly.org/methods/). This method was successfully adapted to a 96-well format where the success rate in obtaining 25 bp or more of flanking sequences has been >85%.
Association with ESTs:
BDGP is generating a collection of 80,000 Drosophila EST sequences with support from Howard Hughes Medical Institute (accessible at http://www.fruitfly.org/EST/). During the preparation of this article, ~48,000 ESTs were available for comparison. Each flanking sequence was searched against this EST database, matches validated by inspection, and the position of the P insertion relative to the EST-homologous portion of the flanking sequence determined. The names of ESTs with strong matches are given in Table 4 and Table 5. Only ESTs that were located within ~100 bp of the P element are reported; more distant sequence matches might represent adjacent transcripts and were not included in the tables.
Stock distribution:
To hasten the availability of the gene disruptions, verified lines from the primary collection were sent to the Bloomington Stock Center in several batches beginning in 1993; the number of strains reached 700 by late in 1994. All 1052 primary collection strains have been available from the Bloomington Stock Center since October 1997. Reserve alleles are maintained at the Carnegie Institution (chromosome 3) or at Berkeley (chromosome 2), and have also been available on request since 1993. Information about stocks is updated periodically on the BDGP website and strains found to be inappropriate are removed from the Bloomington Stock Center. Information derived from further study of any of the BDGP stocks is welcome and should be forwarded to the corresponding author's e-mail address.
Statistical analysis of saturation:
Previous attempts to estimate the saturation behavior of P elements have utilized inadequately characterized data sets. We focused on the 737 independent lines from chromosome 2 and 535 independent lines from chromosome 3 that contain a single verified P insertion lying clearly within the validated deficiencies used in the verification analysis. Within this group, for a known number of total lines (transposition events), the number of genes mutated and how many times each was hit should have been determined with complete accuracy. Because the deficiencies included a majority of chromosome 2 and 3 genes (60.3 and 62.0%, respectively), and should be distributed effectively at random, this sample should accurately represent all insertions that cause a phenotype.
Focusing on the less-mutagenized chromosome first, we determined that 154 of the 535 third chromosome genes had been hit once, 43 twice, 16 three times, 6 four times, and 5 five times and that 18 were previously discussed hotspot loci hit six or more times (Table 1). Despite the small number of hotspot loci, they accounted for 204 of the 535 insertions (38%). First, we attempted to fit the data to a Poisson distribution, ignoring for the moment the obvious presence of hotspot genes. The best distribution (
= 0.558; Table 1) fit the data poorly because it predicted only 8.0 genes (instead of 16) hit three times, only 1.1 (instead of 6) hit four times, and 0.1 instead of 5 hit five times (
2 = 270, P << 0.001). To determine if the observed "excess" of genes hit three to five times was caused by the statistical tail from the hotspot loci, we used a binomial distribution to model their contribution (Table 1). The distribution used maximizes the contribution of hotspot genes to the classes of genes hit 35 times, while yielding the observed number of genes hit 612 times. Despite this, the results reveal that there are too few hotspot genes to account for the excess of genes hit 35 times (Table 1;
2 = 20, P << 0.001).
Consequently, a class of genes of intermediate mutability must exist (warmspot genes). To estimate the size of this class, we fit the data for genes hit one to five times on the assumption of two mutability classes, warmspot and coldspot genes. Postulating 115 "warmspot" genes (
= 1.51) and 613 "coldspot" genes (
= 0.241) produced a good fit to the data (Table 1;
2 = 0.81, P >> 0.05). Extrapolating the warmspot and coldspot data to the entire chromosome and adding the whole chromosome hotspot data, the following was predicted for the third chromosome: 27 hotspot loci + 115/0.603 = 191 warmspot loci + 613/0.603 = 1017 coldspot loci.
We next considered the second chromosome and found that 737 independent verified lines defined 190 genes that had been hit once, 57 twice, 19 three times, 17 four times, 5 five times, and 32 that were hotspot loci hit six or more times (Table 1). Hotspot insertions accounted for 288 of these lines (39%). Again, at least two general classes of mutability were required to fit the data from non-hotspot lines, even after correcting for the contribution of hotspot genes (Table 1;
2 = 36, P << 0.001). Because we expected genes on the second and third chromosomes to have the same average mutabilities, we reasoned that the Poisson parameters for chromosome 2 warmspot and coldspot loci should correspond to parameters of the corresponding chromosome 3 genes corrected for the more extensive mutagenesis that was carried out on chromosome 2. The relative fraction of independent single-insert lines analyzed on chromosome 2 compared to chromosome 3 was 737/535 = 1.38. Multiplying the warmspot and coldspot class Poisson parameters determined for chromosome 3 by 1.38 gave the expected values on chromosome 2 (
= 2.08 and 0.331). Values of 110 warmspot and 680 coldspot loci on chromosome 2 were then determined to fit the distribution (Table 1;
2 = 3.8, P >> 0.05). Thus, chromosome 2 is predicted to house 47 hotspot genes, 110/0.62 = 177 warmspot genes and 680/0.62 = 1097 coldspot genes.
| RESULTS |
|---|
Rationale:
The gene disruption library was assembled from ~3900 starting lines that had been produced in seven separate single P-element mutagenesis screens (Table 2). Each starting line contained one (or a few) P-element insertion on an autosome bearing a newly induced scorable recessive phenotype. The process of going from this amalgamated raw collection to the finished library involved (1) localizing the insertions by in situ hybridization to polytene chromosomes at high resolution; (2) identifying strains with allelic insertions by inter se complementation crosses; (3) verifying that insertions were responsible for the mutant phenotype by crossing to chromosomes bearing deficiencies; and (4) sequencing DNA flanking the insertions and comparing it to EST and genomic sequence databases (![]()
Identifying single-insert lines:
The P insertion(s) in each line was cytogenetically localized by in situ hybridization as described previously (![]()
Identifying allelic mutations:
Complementation crosses were carried out inter se between lines whose insertions were located near each other (see MATERIALS AND METHODS). The maximum cytogenetic distance between the reported positions of insertions that is required to ensure they are not allelic depends critically on the accuracy of the in situ localization. We complementation tested lines when the distance between their elements was six to eight bands or less. This should have been sufficient to eliminate errors even in cytogenetically difficult regions, because the divergence in the reported positions of allelic insertions averaged less than one band (![]()
The complementation analysis provided considerable insight into the frequency with which individual loci are mutated by P elements (see Table 4 and Table 5). In particular, 74 complementation groups on the autosomes were identified that are hotspots for P-element insertion with between 6 and 37 alleles each. Because of the size of our data set, these loci likely comprisevirtually all the P-element insertion hotspots on the autosomes. Following completion of the complementation analysis, one allele of each complementation group was retained for the primary collection (see Table 2).
Verifying that the insertions cause mutations:
Three criteria were used to determine if the P-element insertion in a given single insert line was likely to be responsible for the observed mutant phenotype. First, if the mutation failed to complement one or more independently derived strains whose insertions had been localized nearby, then it was considered verified along with all the other insertions in the complementation group so defined. (The chance that such lines actually contained identical secondary or background mutations was negligible as indicated by test crosses with lines whose insertions were at different sites.) To apply the second verification test, the strain in question was crossed to deficiency chromosomes whose cytogenetically determined breakpoints (Table 3) indicated that they might lack the disrupted gene. Crosses were scored based on the presumed phenotype of the insertion (Table 4 and Table 5, "Df comp" and "Df noncomp"). Lines that failed to complement were considered to be verified, because the chance that a background mutation was closely linked to the P element was acceptably small. If complementation was observed, the line was discarded if its insertion clearly fell within the deficiency boundaries; otherwise it was retained but remained unverified. These two tests, combined with further verification based on the analysis of flanking DNA sequences as described below, allowed the total number of lines in the primary collection to be reduced to 1045, of which 725 (69%) are verified (see Table 2). Of these lines, 93% disrupt vital genes, while most of the remainder cause male or female sterility. The phenotype and verification status of each line are shown in Table 4 and Table 5.
We can estimate the approximate number of bogus lines that remain in the library. First, the overall fraction of verified lines arising from each screen is calculated by restricting our analysis to those lines whose insertions clearly fall within the boundaries of valid deficiencies and hence can be reliably tested (Table 2, "in Df"). This subgroup represents >60% of all the lines and should be representative of each screen as a whole. The proportion of lines that were verified ranges from 4888% among the seven screens. Assuming that insertions falling outside the deficiencies are as likely to be valid as those inside allowed us to estimate the number of unverified primary collection lines from each starting screen that are likely to be valid. After making this final correction, the final number of different genes disrupted by P insertions in the collection is estimated to be 953 (Table 2). Using this information we also determined an overall efficiency for each of the seven screens, defined as the percentage of raw lines that contain a single insertion causing its associated phenotype (Table 2, "screen efficiency").
Deficiency chromosomes with accurate breakpoints are a valuable genetic resource. Knowledge of the true extent of material deleted in deficiency stocks was improved as a by-product of verifying the P insertions. The location of each verified insertion predicted the expected complementation behavior with relevant deficiencies. In cases where contradictions were observed, the breakpoints of the deficiency could sometimes be refined on the basis of the cytogenetic localizations of the terminal P elements (see Table 4 and Table 5). A number of such corrections have been incorporated into FlyBase (see flybase.bio.indiana.edu:80/.bin/fbabsq.html). Table 2 shows current estimates of the deficiency breakpoints used in these studies.
Characterizing insertions using flanking DNA sequence:
The genomic DNA sequence flanking the insertion sites in the primary collection lines was needed to complete the verification process and to begin associating lines with specific genes. Physically associating as many insertions as possible with specific sites in the genome would also enhance the usefulness of the primary collection for gene mapping and for directed mutational screening using accurately positioned starting strains (![]()
The sequences flanking the insertions were initially compared among themselves as an additional verification test. We wished to eliminate lines whose insertions were very close together but that behaved genetically like separate genes. Such lines are likely to be produced when chromosomes bearing nonallelic background mutations acquire insertions within the same nonvital gene. The genetic behavior of the resulting strains will cause them to survive into the primary collection if their insertions lie outside existing deficiencies. On the other hand, we did not want to eliminate valid insertions in adjacent genes. Consequently, in the absence of additional information, nonallelic insertions separated by 100 bp or more were assumed to represent distinct genes. When the separation was <100 bp, usually only one (if verified) or neither line was retained in the primary collection. Rarely, this might have led to the loss of valid lines, for example, in cases of overlapping genes or intragenic complementation, but it allowed us to discard nearly 100 questionable strains for the collection.
After completing these tests 1045 lines remained in the primary collection. Flanking sequence information has been obtained from 921 of the lines in this final group (88%). Accession numbers for each strain are listed in Table 4 and Table 5 ("Sequence"). These sequences, including the position of the insertion, are listed on the BDGP website (http://www.fruitfly.org/p_disrupt/).
Associating primary collection lines with genes:
The primary collection provides an opportunity to link ~1000 Drosophila genes with a genetic phenotype. Because these strains and genetic data have been publicly available from the inception of the project, the Drosophila research community has extensively utilized many lines from the primary collection (and the precursor raw collections). Publications describing at least 250 different Drosophila genes have employed strains from the collection (see Table 4 and Table 5, "References"). In many cases, the P-element disruption strain played a major role in the initial characterization of the gene in question.
To identify as many additional genes as possible the P-element flanks were searched against all Drosophila sequences in GenBank and ~26 Mb of genomic sequence (most searches are current as of December 1998). To test the accuracy of flanking sequence recovery, the polytene location of the P element in each of the 286 lines whose flanking sequences matched genomic sequence determined by BDGP was compared to the independently mapped polytene location of the corresponding P1 clones. Only a few discrepancies resulted, presumably due to the rare recovery of sequence from a cryptic P element, and in these cases a correct flanking sequence was sought. These searches provided a wide variety of valuable information. They confirmed most of the 250 published gene assignments, identified many additional characterized Drosophila genes disrupted by strains in the collection, and molecularly positioned the insertion sites within all these loci. Of the additional Drosophila genes, 55 had previously been characterized only at the molecular level (Table 6).
|
Further links to well-characterized genes were discovered by associating the insertions with Drosophila transcripts defined by EST sequencing. About 48,000 Drosophila EST sequences were available for these comparisons. A total of 376 insertions were located close to or within an EST sequence, usually near the 5' end (see Table 4 and Table 5). Mutation-causing P elements are known to preferentially cluster in the 5' region of the affected genes (see ![]()
|
Although the insertions in the remaining lines were not associated with a well-characterized gene or ortholog, it was still possible to link many of them with predicted transcripts and ORFs. The sequence comparisons associated the insertions in 135 additional lines with ESTs whose clots either predicted novel proteins or matched proteins conceptually encoded by ESTs or ORFs from other organisms. BLAST reports of these searches, including periodic updates, are available by searching the BDGP website using the appropriate EST (Table 4 and Table 5). Finally, the insertions within 138 of the remaining lines not associated with genes or ESTs were localized within sequenced portions of the Drosophila genome. Bioinformatic analyses of the sequences flanking these insertions reveal candidate ORFs, although such studies have not yet been carried out systematically. In sum, therefore, 706 of the 1045 primary collection strains (67%) already link known or candidate genes with mutant phenotypes. It should be possible to make most of the remaining gene-mutant associations by the time genome and EST sequencing nears completion.
P-element selectivity:
This study reveals the identity of most genes that are hotspots for P-element insertion on the autosomes (Table 4 and Table 5, "Alleles"). We searched for common properties that might explain their efficiency as P-element insertional targets. Hotspot genes are not associated with generally high transcription levels, because only 30% of the genes in the primary collection with more than five alleles have an associated EST sequence, compared to 36% for the collection as a whole. Hotspot genes might be those actively transcribed in premeiotic germline cells, where P elements usually transpose; however, the few genes in the collection whose transcripts are abundant in early germ cells, including vasa, bam, and hsp83, were each hit only once. Indeed, our comparisons uncovered no common biological features such as size, location, or regulation that might explain why hotspot genes are highly susceptible to P-element insertion.
We also considered whether strong preferences exist for insertion within certain classes of genes among all those disrupted in the collection. The primary collection includes an estimated 30% of readily mutable autosomal genes. Genes involved in signal transduction were usually well represented, because the collection mutates ~50% of all autosomal genes known to be involved in the EGFR, dpp, ras, wg, hh, or N signaling pathways. In addition, disruptions were obtained in 46% of autosomal posterior group genes, 31% of trithorax and Polycomb group genes, but only 14% of ribosomal protein genes. It remains unclear if these differences reflect more than the research priorities of the Drosophila research community.
Not all insertion sites were associated within protein-coding genes. One P element was located within a 5S rDNA repeat and four interrupted tRNA clusters. Nine lines, two of which disrupt the genes Distal-less and fruitless, were found by sequence analysis to contain insertions within the LTR sequence of a Drosophila retrotransposon related to the yoyo element of the Mediterranean fruit fly Ceratitis capitata (![]()
![]()
![]()
Modeling mutational saturation:
The gene disruption project provides a much larger and better-characterized data set than has been previously available for analyzing the site specificity of P-element transposition. This is an important question for determining the appropriate strategy to expand the collection. The insertional specificity of P elements must be extremely broad to achieve complete or nearly complete coverage of all Drosophila genes. In contrast, previous studies inferred that a significant percentage of Drosophila genes, perhaps as great as 50%, are refractory to mutation using P elements (see ![]()
![]()
![]()
We focused on the large subset of the P-element lines from the collections whose insertions lie within the boundaries of validated deficiencies. Within this group, for a known number of total lines (transposition events), the number of genes mutated and how many times each was hit has been determined with complete accuracy. Because the deficiencies included a majority of chromosome 2 and 3 genes (60.3 and 62.0%, respectively), and should be distributed effectively at random, this sample should accurately represent all insertions that cause a phenotype. When we analyzed the distribution of insertional mutations among this set of genes, it was clear that the data did not fit a simple Poisson distribution (see MATERIALS AND METHODS; Table 1). The most obvious problem was the hotspot loci. On chromosomes 2 and 3, just 18 or 32 loci account for 38 or 39% of all insertions, respectively. However, even after subtracting the contribution of these hotspot loci, the distribution of gene mutabilities remained skewed (see MATERIALS AND METHODS; Table 1). Consequently, a class of warmspot genes was inferred whose mutability is intermediate between the hotspot loci and the large group of low mutability coldspot genes. Assuming the existence of three major mutability classes allowed a good fit to the data.
This model provides several useful insights into P-element behavior. The third chromosome is predicted to contain 27 hotspot loci + 191 warmspot loci + 1017 coldspot loci, while the second chromosome should house 47 hotspot genes + 177 warmspot genes + 1097 coldspot genes. Despite accounting for only 17% of all genes, the 368 warmspot and 74 hotspot genes account for ~70% of all transposition events. As a result, virtually all the hotspot loci and 8090% of the warmspot loci have already been defined by strains in the primary collection. On the other hand, only 2228% of the coldspot loci have so far been disrupted. However, assuming that there are 1400 vital loci per major autosome (![]()
| DISCUSSION |
|---|
Collections of gene disruptions as tools for functional genomics:
It is now possible in theory to mutate virtually any gene that has been molecularly identified in the major multicellular model organisms and to isolate the mutant allele on a standard genetic background free of secondary lesions. In practice, obtaining mutants remains a time-consuming task that constitutes the largest current impediment to progress in understanding gene function in vivo. While it has become widely accepted that gene sequence and structure can be more efficiently analyzed on a genome-wide scale, a similar consensus on the value of whole genome gene disruption has been slow to develop. As a result, linking genes with mutations remains a cottage industry pursued by individual laboratories. The work reported here has been motivated by the belief that complete gene mutation libraries are feasible and have the potential to greatly accelerate the rate at which gene function can be analyzed. We feel that whole genome mutant collections belong together with complete genome and cDNA sequences as essential tools for future biological research.
The BDGP gene disruption library represents a significant step toward the ultimate goal of stockpiling an identified mutation in every Drosophila transcription unit. The current collection of single P-element insertions provides a particularly useful type of link between the genetic and molecular properties of ~1000 different autosomal genes that can mutate to a readily recognizable phenotype. This is more than the number of genes that have been characterized at both the genetic and molecular levels in any of the other widely used model multicellular eukaryotes, including Arabadopsis, C. elegans, zebrafish, or mice, and exceeds the number of gene-mutation links known in humans. As a reflection of its utility, lines from the BDGP collection have been utilized in publications characterizing more than 250 different genes since 1988 (Table 4 and Table 5).
Expanding the collection:
Because the Drosophila genome is believed to house ~12,000 genes (![]()
![]()
Significant improvements are possible in the short term by incorporating several new collections of insertions that have already been constructed since the project was initiated (![]()
![]()
![]()
![]()
![]()
It will also be of value to carry out new mutagenesis screens. A major variable in the generation of single P-element-induced mutations is the wide variation in screen efficiency that is documented here (Table 2). One factor that can affect screen efficiency is the overall rate of P transposition. High transposition rates like those in the screen of ![]()
![]()
![]()
2-3 P elements but obtained very different frequencies of multiple insert lines, rates of background mutation, and overall screen efficiencies. In contrast, the screen of ![]()
The number of new lines that needs to be characterized to substantially complete the gene disruption project can be estimated from our analysis of saturation. The genome contains ~3600 vital genes, at least 3100 of which fall into the coldspot class. Statistically, twice this number of insertions, 6200, must be recovered in this class of genes to achieve 87% saturation. Because only 30% of raw insertions target the coldspot class, and because the best screens produce only 85% verified single insert lines, achieving 87% saturation would require the isolation and analysis of 6200/(0.3 x 0.85) = 24,300 autosomal insertions associated with phenotypes. This represents about six times as many lines as were analyzed in the current project.
A molecular strategy for finishing the mutation library:
Even a project of this size is feasible, although a very large effort would be required. However, a continuation of the current approach would not address the estimated two-thirds of all genes that do not mutate to a readily detectable phenotype in genetic screens. To obtain P-element insertions that disrupt such genes, it will be necessary to look directly for changes in their structure. With large amounts of genomic and EST sequences becoming available and a strong commitment to completing the Drosophila genome sequence within 13 years (![]()
![]()
![]()
We propose to inaugurate a phase two gene disruption project whose goal would be to disrupt all Drosophila genes, regardless of phenotype. Flanking DNA will be recovered from a large number of raw insertion lines and sequenced, much as was done with the primary collection lines in the current collection. The short sequences obtained will allow most new insertions to be precisely positioned on the genomic sequence. Consulting EST and cDNA sequences, gene predictions, ORF homologies, and other relevant data in the vicinity of the insertion sites will allow rapid predictions as to whether each new insertion is likely to disrupt or misexpress an ORF not currently represented in the collection. Lines that do not appear to do so would be quickly discarded. Recently, this strategy has received a valuable test within the fully sequenced 2.9-Mb Adh region (![]()
The phase two strategy has several distinct advantages. First, it broadens the project to include all Drosophila genes. In addition, it greatly simplifies the work required to characterize new candidate lines, compensating in part for the much larger number of lines that will need to be analyzed. Polytene localizations are unnecessary, because multiinsert lines can be detected through their production of more than one distinct P-element flanking sequence. Balancing most of the newly mutagenized chromosomes is not required. Genetic complementation is not necessary, because redundant lines can increasingly be identified on the basis of their location. However, there are several requirements for success. First, the Drosophila genome sequence must be completed in a timely manner. Second, semiautomated methods for recovering and sequencing flanking DNA segments must be further improved. Finally, bioinformatic tools to assist decision making about line retention must be developed.
We can calculate the approximate number of lines that will need to be analyzed during the phase two project. About 11,000 of the estimated 12,000 Drosophila genes are predicted to fall into the coldspot class, assuming that the P-element mutability of all genes is similar to that of vital genes. Therefore, if 30% of new insertions fall in the coldspot class as in the case with lethal insertions, and 95% of raw lines contain only one insertion, then 2 x 11,000/(0.3 x 0.95) = 77,000 lines would be required for 87% saturation. However, two observations suggest that some unselected insertions will fail to disrupt any gene, increasing the total number of lines that will need to be analyzed. First, P elements are attracted to at least some repetitive sequences such as yoyo, TAS, and hoppel, which are often located at nonmutagenic sites within the genome. The fraction of insertions that land in such sites might be significant. Second, P insertions that cause phenotypes cluster around the 5' region of genes (![]()
The relative fraction of unselected insertions that disrupt genes can be estimated, however. If all insertions mutated genes, then 33% of new transpositions should cause a recognizable phenotype, because about one-third of genes are thought to mutate in this manner. Instead, only ~15% of raw insertions recovered on clean chromosomes cause a recognizable phenotype (see citations in Table 2). Consequently, as many as 77,000/0.5 = 154,000 insertions might need to be screened to obtain 87% saturation across all Drosophila genes. However, in practice, this may be an overestimate. P elements can be excised imprecisely to generate deletions adjacent to the insertion site. Because of the large number of mapped insertions that will be available by the time phase two is only partially complete, a strategy in which some genes are disrupted by excising nearby nonmutagenic insertions might substantially reduce the final number of strains that need to be generated and analyzed.
A gene disruption library represents a fundamental and indispensable resource for analyzing gene function on a genome-wide scale. The BDGP gene disruption project has already accelerated studies of Drosophila gene function and is likely to be even more valuable as coverage increases. A pilot screen for phase two has already been completed in collaboration with several laboratories (![]()
| ACKNOWLEDGMENTS |
|---|
BDGP acknowledges all those researchers who participated in constructing the strains that were used in this project. These include L. Ackerman, M. Alvarado, S. Barbel, C. Berg, E. Bier, S. Bockheim, M. Boedingheimer, R. Carretto, Z. Chang, L. Cooley, M. Fuller, U. Gaul, R. Glaser, E. Grell, B. Harkins, M. Heck, L. Higgins, L. Jan, Y.-N. Jan, G. Karpen, R. Kelley, I. Kiss, A. Laughon, K. Lee, L. Lee, G. Mardon, K. McCall, D. McKearin, C. Montell, D. Montell, T. Overbode, B. Price, J. Riesgo, M. Scott, S. Shepherd, R. Smith, D. Thompson, T. Tick, T. Törok, J. Tower, T. Uemura, H. Vassin, E. Verheyen, S. Wasserman, and L. Yue. We are also grateful to many workers who in the course of this study communicated complementation results and other information on specific P-element strains. In particular, John Roote and Paul Lasko shared complementation data for 2L divisions 24-36 and 37-38. H. Bellen (various), Erica Roulier (29A), Ken Howard (45), Jordan Raff (46A), Elliott Goldstein (46), Robert Burgess (47EF), Claire Russell (49EF), Paul Wes (52E), and Boris Dunkov (99F) contributed and confirmed results in the cytogenetic regions indicated. We thank A. deGrey for assistance in analyzing chromosome 2 data. This work was supported by a genome center grant (P50NIHHG750) from the National Institutes of Health. A.C.S. and G.M.R. are Howard Hughes Medical Institute Investigators.
Manuscript received January 29, 1999; Accepted for publication April 26, 1999.
| LITERATURE CITED |
|---|
ALLENDE, M. L., A. AMSTERDAM, T. BECKER, K. KAWAKAMI, and H. GAIANOAND et al., 1996 Insertional mutagenesis in zebrafish identifies two novel genes, pescadillo and dead eye, essential for embryonic development. Genes Dev. 10:3141-3155
ALSINA, B., F. SERRAS, J. BAGUNA, and M. COROMINAS, 1998 patufet, the gene encoding the Drosophila melanogaster homologue of selenophosphate synthetase, is involved in imaginal disc morphogenesis. Mol. Gen. Genet. 257:113-123[Medline].
ANDREW, D. J., A. BAIG, P. BHANOT, S. M. SMOLIK, and K. D. HENDERSON, 1997 The Drosophila dCREB-A gene is required for dorsal/ventral patterning of the larval cuticle. Development 124:181-193[Abstract].
ARORA, K., H. DAI, S. G. KAZUKO, J. JAMAL, and M. B. O'CONNOR et al., 1995 The Drosophila schnurri gene acts in the Dpp/TGF-beta signaling pathway and encodes a transcription factor homologous to the human MBP family. Cell 81:781-790[Medline].
ASHBURNER, M., 1990 Drosophila: A Laboratory Manual. Cold Spring Harbor Laboraory Press, Cold Spring Harbor, NY.
ASHBURNER, M., S. MISRA, J. ROOTE, S. LEWIS, and R. BLAZEJ et al., 1999 An exploration of the sequence of a 2.9-Mb region of the genome of Drosophila melanogaster: the Adh region. Genetics 153:179-219
BARRETT, K., M. LEPTIN, and J. SETTLEMAN, 1997 The Rho GTPase and a putative RhoGEF mediate a signaling pathway for the cell shape changes in Drosophila gastrulation. Cell 91:905-915[Medline].
BAUMGARTNER, S., D. MARTIN, C. HAGIOS, and R. CHIQUET-EHRISMANN, 1994 Ten-m, a Drosophila gene related to tenascin, is a new pair-rule gene. EMBO J. 13:3728-3740[Medline].
BAUMGARTNER, S., D. MARTIN, R. CHIQUET-EHRISMANN, J. SUTTON, and A. DESAI et al., 1995 The HEM proteins: a novel family of tissue-specific transmembrane proteins expressed from invertebrates through mammals with an essential function in oogenesis. J. Mol. Biol. 251:41-49[Medline].
BAUMGARTNER, S., J. T. LITTLETON, K. BROADIE, M. A. BHAT, and R. HARBECKE et al., 1996 A Drosophila neurexin is required for septate junction and blood-nerve barrier formation and function. Cell 87:1059-1068[Medline].
BEGEMANN, G., A. M. MICHON, L. VAN DER VOORN, R. WEPF, and M. MLODZIK, 1995 The Drosophila orphan nuclear receptor Seven-up requires the Ras pathway for its function in photoreceptor determination. Development 121:225-235[Abstract].
BELLAICHE, Y., I. THE, and N. PERRIMON, 1998 Tout-velu is a Drosophila homologue of the putative tumour suppressor EXT-1 and is needed for Hh diffusion. Nature 394:85-88[Medline].
BELLEN, H. J., C. J. O'KANE, C. WILSON, U. GROSSNIKLAUS, and R. K. PEARSON et al., 1989 P-element-mediated enhancer detection: a versatile method to study development in Drosophila. Genes Dev. 3:1288-1300
BERG, C. and A. SPRADLING, 1991 Studies on the rate and site-specificity of P element transposition. Genetics 127:515-524[Abstract].
BHAT, M. A., A. V. PHILP, D. M. GLOVER, and J. BELLEN, 1996 Chromatid segregation at anaphase requires the barren product, a novel chromosome-associated protein that interacts with Topoisomerase II. Cell 87:1103-1114[Medline].
BHATT, A. M., T. PAGE, E. J. LAWSON, C. LISTER, and C. DEAN, 1996 Use of Ac as an insertional mutagen in Arabidopsis. Plant J. 9:935-945[Medline].
BIER, E., H. VASSIN, S. SHEPHERD, K. LEE, and K. MCCALL et al., 1989 Searching for pattern and mutation in the Drosophila genome with a P-lacZ vector. Genes Dev. 3:1273-1287
BOULIANNE, G. L., A. DE LA CONCHA, J. A. CAMPOS-ORTEGA, L. Y. JAN, and Y. N. JAN, 1991 The Drosophila neurogenic gene neuralized encodes a novel protein and is expressed in precursors of larval and adult neurons. EMBO J. 10:2975-2983[Medline].
BRAUN, A., J. A. HOFFMANN, and M. MEISTER, 1997 Drosophila immunity: analysis of larval hemocytes by P-element-mediated enhancer trap. Genetics 147:623-634[Abstract].
BRUMMEL, T., S. ABDOLLAH, T. E. HAERRY, M. J. SHIMELL, J. MERRIAM, L. RAFTERY, J. L. WRANA, and M. B. O'CONNOR, 1999 The Drosophila activin receptor baboon signals through dSmad2 and controls cell proliferation but not patterning during larval development. Genes Dev. 13:98-111
BURNS, N., B. GRIMWADE, P. B. ROSS-MACDONALD, E. Y. CHOI, and K. FINBERG et al., 1994 Large-scale analysis of gene expression, protein localization, and gene disruption in Saccharomyces cerevisiae.. Genes Dev. 8:1087-1105
CAMPBELL, G., H. GORING, T. LIN, E. P. SPANA, and S. ANDERSON et al., 1994 RK2, a glial-specific homeodomain protein required for embryonic nerve cord condensation and viability in Drosophila. Development 120:2957-2966[Abstract].
CAMPBELL, S. D., F. SPRENGER, B. A. EDGAR, and P. H. O'FARRELL, 1995 Drosophila Wee1 kinase rescues fission yeast from mitotic catastrophe and phosphorylates Drosophila Cdc2 in vitro. Mol. Biol. Cell 6:1333-1347[Abstract].
CASTRILLON, D. H., P. GONCZY, R. RAWSON, C. G. EBERHART, and S. VISWANATHAN et al., 1993 Toward a molecular genetic analysis of spermatogenesis in Drosophila melanogaster: characterization of male-sterile mutants generated by single P element mutagenesis. Genetics 135:489-505[Abstract].
CHANG, H. C. and G. M. RUBIN, 1997 14-3-3 epsilon positively regulates Ras-mediated signaling in Drosophila. Genes Dev. 11:1132-1139
CHANG, Z., B. D. PRICE, S. BOCKHEIM, M. J. BOEDIGHEIMER, and R. SMITH et al., 1993 Molecular and genetic characterization of the Drosophila tartan gene. Dev. Biol. 160:315-332[Medline].
CLARK, K. A. and D. M. MCKEARIN, 1996 The Drosophila stonewall gene encodes a putative transcription factor essential for germ cell development. Development 122:937-950[Abstract]