People who cannot cope with confusion should not be in genomics.
Mary-Claire King (August 24, 2007, speaking at the University of California, Santa Cruz, CA)
WHAT a wonderful time to be a biologist! This month we celebrate the publication of the “complete” genomic sequences of 10 species of Drosophila to complement those of Drosophila melanogaster (Adams et al. 2000) and D. pseudoobscura (Richards et al. 2005). A 13th genome, that of D. mauritiana, has been sequenced in part, but the data have yet to be analyzed (L. Hillier, Washington University, personal communication). This achievement is marked by the publication of two “community” articles in Nature in November 2007 (Drosophila 12 Genomes Consortium 2007; Stark et al. 2007) and nearly 50 articles in other journals, including Genetics. Rather than attempt to summarize these articles (in effect this is done by the community publications), here I will attempt to draw some more general biological and sociological lessons learned from the sequencing of these genomes.
What are the deep biological questions that large-scale sequencing should seek to answer, in particular large-scale sequencing of organisms such as drosophilids? And what are the analytical capabilities required?
First and foremost, in my view, is to understand the evolution and function of genomes. It is too easily forgotten in these days of genomic hyperbole that we have a very poor understanding of genome evolution, structure, and function. Even in the relatively small and compact genome of D. melanogaster only 30% or so of the bases are within protein-coding regions; those known to encode nontranslated RNAs are <1%. About 25% is “heterochromatic,” composed of both satellite DNAs with a very low information content and scrambled transposable elements. Relatively few regulatory regions of the genome have been well characterized. The role—I hesitate to write “function”—of at least 50% of the bases in this genome is unknown. This problem is far more acute for the human genome than for the Drosophila genome; nevertheless, it is a general problem of some importance. The problem is exacerbated by two facts: first, the fraction of the genome that is conserved, at least between mammals, is much higher than that currently known to be “functional” (Encode Project Consortium 2007). Second, the fraction of the genome present in transcripts is much greater than would be expected on the basis of known genome content (for Drosophila data, see Manak et al. 2006). This and other considerations have led some (e.g., Pheasant and Mattick 2007) to propose that our estimates of the fraction of genomes that are functionally constrained have been grossly underestimated (see Andolfatto 2005; Halligan and Keightley 2006). Discovering the role of these regions is far from simple. There is, for example, evidence in flies, zebrafish, and mice that many conserved regions of the genome can function in reporter gene assays as enhancers, yet the proportion of those that function in vivo is unknown, but is certainly not all. The enormous utility of comparative sequence data for the analysis of non-protein-coding sequences and, indeed, for the refinement of gene models of protein-coding genes is beautifully illustrated in the overview by Stark et al. (2007).
So, what is all of this DNA doing? The simple answer is: we do not yet know. Between 3 and 25% of drosophilid genomes consist of recognizable transposable elements (TEs) (I expect this fraction to be much higher in species that have grossly amplified their heterochromatin, such as D. orena and D. nasutoides). In D. melanogaster, the great majority of these TEs fall into >100 families of two major classes: those that transpose via an RNA intermediate and those that do not. Although the transposable element content of the genome of D. melanogaster is now reasonably well known (Quesneville et al. 2005), that of the other sequenced species is not, although preliminary results should prepare us for some surprises.
Recently, novel classes of TEs have been discovered by Kapitonov and Jurka (2007): helitrons that transpose by a rolling-circle mechanism and polintrons that may transpose by integration of replicas of an excised DNA molecule. Helitrons have, so far, been found only as remnants in D. melanogaster (Kapitonov and Jurka 2003). Polintron elements (also called Maverick elements), by comparative analysis of the 12 genomes, have now been found in drosophilids (Pritham et al. 2007). These are enigmatic elements, encoding up to 10 proteins, including both a retroviral-like integrase and a DNA polymerase B enzyme. Comparative analysis (Casola et al. 2007) has also defined a new family of class II elements in the drosophilids: the P instability factor (PIF) family, TEs that encode both a transposase and a Myb/SANT domain protein. In D. melanogaster, these elements are present only as grossly deleted MITEs, but are abundant in the genomes of D. pseudoobscura, D. persimilis, and D. willistoni. This family of elements shows evidence of “domestication,” that is, the evolution of some copies into functional genes integral to a genome, as well as of horizontal transfer between species. Each class of transposable element, if not each family, has its own evolutionary dynamics within the genome of a species, as illustrated recently by Bergman and Bensasson (2007), who show that non-LTR retroelements in the genome of D. melanogaster are, on average, much older than LTR retroelements.
The domestication of transposable elements (see Volff 2006) is only one of the dramatic consequences of the symbiotic relationship between these elements and their host genomes. Another is their recruitment as cis-regulatory elements for genes under strong positive selection, for example, alleles of Cyp6g1, in both D. melanogaster and D. simulans, that confer insecticide resistance due to overexpression resulting from closely inserted transposable elements (Daborn et al. 2002; Schlenke and Begun 2004). The retrotransposases encoded by retroelements are the driving force for the origin of retrotransposed pseudogenes. Although processed pseudogenes are surprisingly rare in drosophilid genomes (in comparison, at least, to mammalian genomes), Betrán and her colleagues (Bai et al. 2007; see also Bhutkar et al. 2007) have shown that there is a slow but steady generation of novel genes and functions by retrotransposition in the drosophilids. There is a strong bias for these retrogenes to originate from the X chromosome and for them to be specifically expressed in the male germline.
The interaction between host genome and transposable elements remains a fruitful field of study and is becoming experimentally more tractable with the discovery of small RNAs that may play a role in the control of TE activity (e.g., Brennecke et al. 2007). One burning and difficult question is the mechanism of horizontal transfer of elements between species. The evidence for horizontal transfer is necessarily indirect, yet it is compelling in both the drosophilids and other taxa. It is no criticism of the studies of the 12 genomes so far to say that there is much more to learn about the evolution of these extraordinary genome parasites by comparative studies of drosophilids.
The drosophilids are a very rich source of phenotypic variation—morphological, ecological, and behavioral. Everyone interested in morphological variation in these flies should browse Hardy's wonderful monograph on the Hawaiian fauna (Hardy 1965) and Grimaldi's monograph of the family Drosophilidae, richly illustrated with scanning electron micrographs (Grimaldi 1990).
This variation gives rise to the second deep question in genomics: the relationship between sequence and this diversity in phenotype. This is one area in which some progress is indeed being made by studies of Drosophila, and we now know that evolutionary changes in cis-regulatory modules are responsible for at least some morphological changes (e.g., Prud'homme et al. 2006; McGregor et al. 2007).
We know very little about the genetic basis for ecological specialization. D. melanogaster and its sibling species D. simulans, of course, are famously known as domestic species, most populations showing a very close association with humans. Yet, among the 12 species of drosophilid now sequenced, 3 have exceptional ecologies: D. sechellia, D. erecta, and D. mojavensis. D. sechellia seems to be restricted to the foul-smelling fruits of the shrub Morinda citrifolia for breeding. These fruits are rich in nasty chemicals (6- and 8-carbon carboxylic acids) that are toxic to any other self-respecting drosophilid. D. erecta shows a strong preference for breeding on the fallen fruits of West African species of Pandanus, although it will breed elsewhere when these fruits are unavailable. Interestingly, both species show exceptional rates of evolution of genes involved in taste and olfaction (McBride and Arguello 2007; see Matsuo et al. 2007) and in genes whose products are required for detoxification of xenobiotic chemicals (Drosophila 12 Genomes Consortium 2007). The third species, D. mojavensis, is a representative of a large group of species specialized to breed in the cacti of the American deserts, plants full of nasty allelochemicals. Its sequencing will open the door to a more detailed study of chemical ecology (see Fogleman et al. 1998).
Genes encoding proteins involved in RNA metabolism also evolve exceptionally fast, presumably as a response to RNA viruses and the like (Obbard et al. 2006; Heger and Ponting 2007). One curious observation is that D. willistoni is the first animal known not to cope with dietary selenium by its incorporation into selenocysteine; this species lacks selenoproteins and the selenocysteinyl tRNA (Drosophila 12 Genomes Consortium 2007).
Behavioral repertoires in the drosophilids are rich, particularly where sex is concerned. Studies of the function of the fruitless transcription factor in D. melanogaster (Manoli et al. 2006) are beginning to throw light on the proximate control of sexual behavior, and recent spectacular advances in our ability to control the activities of whole classes of neurons by exposing flies to light, and thus uncaging a ligand for an engineered ion channel (see Lima and Miesenböck 2005), offer the chance for a deep understanding of the neural control of behavior. It is clear that one future emphasis in this field must be the development of a comprehensive atlas of the central nervous system of D. melanogaster. This will open the path to comparative studies of neural function and the evolution of its genetic control.
It has been known for some time that genes that encode proteins that function in sex and reproduction, particularly male sexual and reproductive functions, evolve at a particularly fast rate, and this is confirmed by the comparative analysis of the 12 genomes. The morphological characters involved in reproduction, again especially those of the male, also evolve very fast, as all who have used the shape of the genital arch to distinguish D. melanogaster from D. simulans know. One of the species sequenced, D. grimshawi, is (as far as I am aware) unique among the 12 species in having specialized pheromonal glands secreting fluid into the male's anus for deposition on the substrate of the mating arena (Hodosh et al. 1979). Related picture-winged species, e.g., D. clavisetae, show an extraordinary mating behavior in which a male arches its abdomen over its thorax and sprays the head of the courted female with secretions from these glands (Spieth 1984; K. Kaneshiro, unpublished data).
Finally, what can these genomes tell us about the evolution of the genomes themselves at the chromosome level? In 1941, Sturtevant and Novitski, having reviewed the literature concerning the comparative genetics of the drosophilids, concluded that “the six chromosome arms of D. melanogaster retain their essential identity among the species of Drosophila so far studied.” Muller (1940) had previously called these arms A, B, C, D, E, and F to overcome the differences in nomenclature used by those working with different species, and “Muller's elements” is how they are now known. Their integrity is amply confirmed by the genomes now sequenced (Drosophila Chromosome Working Group, unpublished results). It is now reasonably clear that five major mechanisms contribute to the evolution of chromosomes in the drosophilids: whole-element translocations, that is, the fusion and fission of elements; paracentric inversion, leading to gene order rearrangement within elements; accretion and loss of heterochromatic sequences, especially in the pericentromeric and telomeric regions; pericentric inversion and transposition. The last two of these mechanisms are rare, and the former three are common. The logic of constructing chromosome phylogenies, and hence population or species phylogenies, on the basis of an analysis of overlapping chromosome inversions was discovered by Sturtevant and Dobzhansky (1936) and is an art that reached its apogee in Wasserman's phylogenies of the repleta group species (Wasserman 1992) and Carson and Stalker's phylogeny of the picture-winged Hawaiian species (see Carson 1992). Despite some brave attempts (Stalker 1972), it has been very hard to use this method between even closely related species groups, but these sequenced genomes now offer the opportunity of phylogenetically distant comparisons by more a tractable method.
The rates of chromosomal evolution vary enormously between groups and we can now estimate the number of inversions that have been fixed during the evolution of the species of the two subgenera from sequences. For example, we estimate that between the subgenera Drosophila and Sophophora there have been 617 fixed paracentric inversions (M. von Grottus and J. Ranz, unpublished results). The rates of chromosomal evolution are, however, very heterogeneous, for example, D. melanogaster and D. yakuba differ by 29 fixed inversions; 28 of these occurred in the lineage to D. yakuba and only 1 in the lineage to D. melanogaster (Ranz et al. 2007). Similarly, we find that the rate of chromosomal evolution in D. willistoni is extraordinarily high (M. von Grottus, unpublished results; see also Drosophila 12 Genomes Consortium 2007). Whether or not chromosomal breaks occur randomly, their fixation certainly does not, as there are several marked “hot” and “cold” spots of fixed breaks. The forces that constrain the fixation of inversions can only be guessed at, although there are hints that the relative placement of genes (e.g., overlapping genes) is one constraint. As well as relieving us from the tedium of cytological analysis, the sequenced genomes throw light on the mechanisms of chromosomal evolution. The prior expectation was that they would result from exchange between dispersed transposable elements (e.g., Casals et al. 2005) but, at least in the case of the inversions fixed in D. simulans and D. yakuba, this seems not to be so: here the inversions arise by a mechanism that leaves a footprint of inverted flanking duplications of otherwise unique sequences (Ranz et al. 2007). Such unanticipated results illustrate the ways in which the sequenced genomes provide a more focused means of throwing light on the mechanisms of chromosomal evolution.
The publication of these genomes comes at a critical point in biology. Hitherto, the sequencing and analysis of genomes, indeed, any large-scale DNA sequencing project, was an activity economically achieved only at a dedicated sequencing center, such as the Sanger Institute (Hinxton, UK), the Broad Institute (Cambridge, MA), the centers at the Baylor College of Medicine (Houston), the Joint Genome Institute (Walnut Creek, CA), and the Washington University Medical School (St. Louis). This is changing with almost alarming rapidity. The promise of very-high-throughput sequencing at a reasonable cost has now been delivered with the machines now available from Roche (454), Illumina/Solexa, and Applied Biosystems. Individual laboratories can now consider producing 2–4 Gb of sequence a week at a production cost on the order of $10,000. There is every prospect that the cost will decrease and the throughput will increase by at least an order of magnitude over the next 5 years. The implications of these technological developments for biologists cannot be overestimated. Indeed, one might be concerned that we will enter a period of “irresponsible sequencing,” i.e., sequencing simply for the sake of it rather than for answering deep biological or practical questions. The concern must be that facile sequencing will outstrip our abilities to analyze and digest the information.
The sequencing and preliminary analyses of the genomes of these drosophilids is a magnificent achievement. Large-scale sequencing and its associated analytical tools offer exciting new methodologies for investigating critical biological questions such as the evolution and function of genomes and the relationship between sequence and phenotypic diversity, as just described. I must, however, express two concerns about the achievement that we now celebrate. One reflects on scientific judgements, albeit scientific judgments burdened by political reality. The second reflects the social pressures that come to bear on complex community projects.
For comparative biology to show its full power, we need to adopt standard methods of study and standard methods for data representation. One technical problem that we and many others have had in the analysis of these genomes is that the process of sequencing and assembly has not been standard for all of the species. Five sequencing centers contributed to this study: Baylor College of Medicine (for D. pseudoobscura), the J. C. Venter Research Institute (D. willistoni), the Broad Institute (the low-coverage sequences of D. sechellia and D. persimilis), Washington University (D. simulans and D. yakuba), and Agencourt (the remaining six species) (plus, of course, Celera and the Berkeley Drosophila Genome Project for D. melanogaster). Each used its own strategy for constructing clone libraries; most used mixed embryonic stages as a source of DNA, but some of the D. simulans projects used adults. The success of the whole-genome shotgun strategy for sequencing and assembly is critically dependent on good short- and long-insert libraries; yet, for each species, the methods used to prepare the plasmid, fosmid, and BAC libraries were not uniform among sequencing centers. In addition, the ratio of the different-size libraries sequenced differed for each species. For one species, D. yakuba, finishing reads were determined; for six species, EST libraries from embryos and/or adults were sequenced. Each of the sequencing centers then used their own pipelines for the removal of contaminating sequences and for sequence assembly. The well-recognized problems with different assembly pipelines led to an effort to reconcile the assemblies, but for only six of the species (Zimin et al. 2007). These technical differences clearly affect the final product, although the full extent of this is hard to evaluate on the basis of the available data. It must, however, lead to some caution when claiming, for example, lineage-specific differences in sequence or sequence organization; there is a risk of artifact and the strong recommendation must be that any such claim must be supported by independent data.
Ideally, all of these species would have been sequenced by a single center using an identical strategy. Variability in methodologies unnecessarily complicates the use of these data for informative comparative analyses and thus depletes the value of such an enormous investment in sequencing. There is no doubt that the current state of affairs arose because of the political reality that each of the funded sequencing centers in the United States must be given slices of the National Human Genome Research Institute cake. I hope that future comparative sequencing projects will learn from this experience. Pork-barrel politics should have no place in science.
Finally, I must express a concern with respect to the social process that intervened between sequence release and publication. It is an open secret that there has been very considerable community disquiet in the way in which the publications on these sequences have been prepared. This perhaps was inevitable, given that no single center was responsible for leading the overall project and that no funding was set aside for the analyses, a state of affairs unfortunately not atypical for large-scale sequencing projects. (The unsung heroes of this story are the small number of graduate students who took on the burden of much of the analyses.) On the (very) positive side, some in the community took an early initiative to establish an open-access website at Berkeley (http://rana.lbl.gov/drosophila), and this, and the related wiki pages (http://rana.lbl.gov/drosophila/wiki/), have been the major portal for both data and discussion over the course of this project. The first assemblies were released to this site as early as May 2004. The “rules” under which researchers can use genome sequence data were established by a meeting hosted by the Wellcome Trust (2003) at Fort Lauderdale, Florida, in 2003, and most, although not all, researchers have been constrained by these. The pity is that the Fort Lauderdale agreement, which attempts to balance the interests of the data producers (i.e., the sequencing centers) and the community at large, who want to analyze the data, is less than clear as to the responsibilities of all of the parties concerned. It is rightly recognized that the sequencing centers have the right of the first shot at analysis. But how long can the community expect to wait before they can publish on the basis of these data? One year? Two years? There is a general feeling that, in this case, the Fort Lauderdale agreement has not served the community, the sequencing centers, or the journals well. No one can doubt the complexity of large community projects such as this. However, this particular project was plagued by confusion and ever-retreating deadlines that created considerable angst, particularly among the most vulnerable scientists who did much of the work—graduate students and postdoctoral fellows. There have been articles that have been accepted for publication and held for >6 months, pending the publication of the main summary articles by Nature; other articles have been published and then removed from the publisher's web site (http://mbe.oxfordjournals.org/cgi/content/abstract/msm129v2) or forced to have corrections published noting premature use of the data (http://www.biomedcentral.com/1471-2105/8/248). There have been cases in which the Fort Lauderdale agreement has been directly flouted and researchers have been scooped; there have been cases in which, even when the Fort Lauderdale agreement was not being breached, researchers have been roundly abused for allegedly having done so. Thus, the time may have come to reevaluate the conditions under which data from publicly funded genome projects can be used by the community at large. This is particularly needed in light of the technological changes to the sequencing business mentioned above since these will produce a glut of data that will require increasing community participation for their interpretation.
But these issues cannot be allowed to hide the wonderful achievement of all concerned with this project, from the authors of the two “white papers” (Begun and Langley 2003; Clark et al. 2003), which convinced the National Institutes of Health to fund these sequences, to all those who have devoted so much time and effort to their analysis. But what of the future? Two things are clear: first, the current group of articles has only begun the analysis of these genomes; there is material here to keep many people busy for many years. The other is that the revolution in technology, to which I have already alluded, means that we can look forward to the sequencing of many more genome sequences in the future. This is important for many reasons, but I will mention only one, stressed to me by Michael Eisen. Comparison of the regulatory landscapes of the genomes of mammals and drosophilids reveals a very different picture. In the mammals (indeed the vertebrates), relatively small regulatory regions often can be recognized computationally by small islands of conservation in a large ocean of diverged sequence; in Drosophila, this is not as straightforward simply because the genomes are too compact. However, comparison of sequences between drosophilids and other Diptera—for example, those from the Tephritids, which have much larger genomes—may allow conservation to be used as a stronger indicator of regulatory sequences. Clearly, a priority for future large-scale sequencing must be the more distant relatives of Drosophila, especially in the acalyptrate and calyptrate groups.
The analysis of population variation by large-scale sequencing within Drosophila species has only just started with the study of Begun et al. (2007b) of six populations of D. simulans. This study is the first comprehensive map of sequence variation in a drosophilid at the whole-genome level, and the surprise is the extent of signatures of adaptive selection in both coding and noncoding regions. Population genetics is where the new sequencing technologies will have an immediate and major impact; indeed, this work has already begun for populations of D. melanogaster using 454 and Illumina sequencing (Begun et al. 2007a; Quinlan et al. 2007).
I am very grateful to the lead authors of the two community and many of the companion articles for preprints and to many friends and colleagues in the community for discussion. It is probably better that they remain nameless.
- Copyright © 2007 by the Genetics Society of America