Embryonic gene expression patterns are an indispensable part of modern developmental biology. Currently, investigators must visually inspect numerous images containing embryonic expression patterns to identify spatially similar patterns for inferring potential genetic interactions. The lack of a computational approach to identify pattern similarities is an impediment to advancement in developmental biology research because of the rapidly increasing amount of available embryonic gene expression data. Therefore, we have developed computational approaches to automate the comparison of gene expression patterns contained in images of early stage Drosophila melanogaster embryos (prior to the beginning of germ-band elongation); similarities and differences in gene expression patterns in these early stages have extensive developmental effects. Here we describe a basic expression search tool (BEST) to retrieve best matching expression patterns for a given query expression pattern and a computational device for gene interaction inference using gene expression pattern images and information on the associated genotypes and probes. Analysis of a prototype collection of Drosophila gene expression pattern images is presented to demonstrate the utility of these methods in identifying biologically meaningful matches and inferring gene interactions by direct image content analysis. In particular, the use of BEST searches for gene expression patterns is akin to that of BLAST searches for finding similar sequences. These computational developmental biology methodologies are likely to make the great wealth of embryonic gene expression pattern data easily accessible and to accelerate the discovery of developmental networks.
PATTERNS of gene expression in the fruit fly Drosophila melanogaster have been extensively studied by visualizing the presence or absence of gene products or their markers in the developing embryo (visualization methods are reviewed in Goldstein and Fyrberg 1994). Genetic studies show that genes with similar expression patterns often have mutant alleles that affect the same tissue. In these studies, researchers routinely infer gene interactions by visually comparing gene expression pattern images (e.g., Gieseleret al. 2001; Takaesuet al. 2002). For example, the dorsal/ventral polarity of the Drosophila embryo is controlled by many genes, including the secreted factors decapentaplegic and short gastrulation and the transcription factor brinker. short gastrulation and brinker have very similar expression patterns and their mutant phenotypes are also very similar. Genetic analyses have shown that both of these genes antagonize decapentaplegic signaling (Ashe and Levine 1999; Jazwinska et al. 1999a,b). As this example shows, analysis of similar gene expression patterns is important to understanding the interplay of genes that generate the body plans of fruit flies, humans, and other metazoans (reviewed in Carrollet al. 2000; Davidson 2000; Rougvie 2001).
Familiarity with the wealth of images of gene expression patterns gathered over the past two decades is essential for discovering new genetic interactions. However, the burden of becoming familiar with extensive past literature and the rapidly increasing amount of information on gene expression patterns is an impediment to cross-laboratory endeavors and to building a global genetic framework for embryonic development. Computational tools to automatically identify images with similar gene expression patterns from a large collection of images and to predict potential genetic interactions using these images would greatly facilitate developmental biology research. Such tools will become increasingly important with the advent of large-scale in situ RNA hybridization studies (http://www.fruitfly.org). The need for a computational system is particularly acute in studies of Drosophila, as scientists move beyond studies of single genes or gene families to generate a global view of development. To address these issues, we have developed in silico approaches for automated comparison of gene expression pattern images that mimic some of the visual comparison techniques used by researchers in the laboratory. Our methods enable easy and efficient access to gene expression and interaction data and will likely facilitate new discoveries in developmental biology.
In this article, we describe methodologies for (a) standardizing gene expression pattern images as they are acquired under different illumination conditions and are published in a variety of sizes, orientations, and resolutions; (b) quantifying the amount of dissimilarity between two expression patterns by comparing images containing them; (c) identifying those images from a large collection that contain expression patterns similar to that contained in the given image; and (d) inferring genetic interactions between two genes by comparing the expression pattern of a particular gene in wild-type and mutant backgrounds. In the first phase of our efforts to develop computational methods, we have focused on early embryonic development in D. melanogaster. This focus reflects two relevant features of the field of Drosophila developmental genetics. First, there is an emphasis on understanding early developmental events that have global effects, such as the organization of the embryonic dorsal/ventral axis. Second, the deposition of the cuticle after roughly three-fourths of embryonic development prevents the use of several techniques commonly employed for the analysis of gene expression.
We present results showing the performance of our methodologies using a prototype collection of 982 image-searchable gene expression pattern images captured as whole embryos and retrieved from the literature (Gaul and Jackle 1987, 1990; Tautz 1988; Pankratz et al. 1989, 1990; Hulskamp et al. 1990, 1994; Kaniaet al. 1990; Pankratz and Jackle 1990; Hulskamp and Tautz 1991; Riddihough and Ish-Horowicz 1991; Sommer and Tautz 1991; Steingrimssonet al. 1991; Pignoniet al. 1992; Gutjahret al. 1993; Lardelli and Ish-Horowicz 1993; Grossniklauset al. 1994; Hartmannet al. 1994; Pelegri and Lehmann 1994; Rotheet al. 1994; Schulz and Tautz 1994, 1995; Tsai and Gergen 1994; Margoliset al. 1995; Rivera-Pomaret al. 1995; Sanchez-Herrero 1995; Yu and Pick 1995; Arnostiet al. 1996; Klingleret al. 1996; Kosman and Small 1997; Vincentet al. 1997; Lawrence and Pick 1998; Nibuet al. 1998; Toyet al. 1998; Tsaiet al. 1998; Wuet al. 1998; Ashe and Levine 1999; Goldsteinet al. 1999; Jazwinskaet al. 1999b; La Rosee-Borggreveet al. 1999; Niessinget al. 1999; Zhang and Levine 1999; Janodyet al. 2000; Jinet al. 2000; Nasiadkaet al. 2000; Wimmeret al. 2000; Casares and Flores-Saaib et al. 2001; Kobayashiet al. 2001; Nibu and Levine 2001).
MATERIALS AND METHODS
To meaningfully compare gene expression images computationally, it is important to standardize them and to remove background information (reviewed in Castleman 1996). Standardization is required because investigators publish digital images in different sizes and orientations and because images are acquired under different illumination conditions. Digital images are composed of many fine dots (called pixels) of different intensities. Thus standardization establishes the pixel-to-pixel correspondence between two images by setting up a uniform size and point of reference for all images. For example, two expression patterns in Figure 1 (A and B) are quite similar to the naked eye; however, they do not have the necessary pixel-to-pixel correspondence for computer-based analysis. The gene expression image standardization procedure involves a number of steps.
Embryo-enclosing algorithm: This procedure starts with fitting the Drosophila embryo boundaries into the smallest possible rectangular area, a process referred to as “edge-fitting” (Castleman 1996; Costa and Cesar 2000). For instance, the dotted lines in Figure 1 (C and D) are the smallest rectangles in which the actual embryonic image is contained. Since early stage embryos have a consistent shape, it is an effective way of removing the noise outside the embryo area. It is actually more effective than a contour-detecting algorithm in the present case because many images contain embryo boundaries that are too faint to generate a complete contour.
For this step, we have developed an algorithm in which the boundaries of the initial rectangular image are moved inward until they touch the Drosophila embryo outline on all four sides. In this algorithm, we first consider an area of 5 × 5 pixels at the four corners of the image and then compute the mean (m) and standard error (s) of the average color intensities for pixels in the four corners of the image. Next, we find the topmost row in the image, where the upper boundary of the embryo is present. For this, we traverse top-to-bottom in each column to identify the pixels closest to the top boundary where the absolute difference between the pixel intensity (p) and m is significant (e.g., m - p is greater than twice the standard error under the assumption of normality). The row coordinate of the pixel identified closest to the top edge of the image is the top boundary of the embryo. To identify bottom, left, and right boundaries, we use the same algorithm; the only difference is that now we traverse from bottom to top in each column, left to right in each row, and right to left in each row, respectively. This results in an embryo enclosed in the smallest rectangular area possible. The embryo boundaries determined using this algorithm for A and B in Figure 1 are shown in C and D. Note that all images are standardized to an anterior (left)-posterior (right)-dorsal (top)-ventral (bottom) orientation.
Size standardization: In the next step, we need to scale all the resultant images to the same size as they are often captured or published in very different sizes (compare Figure 1, A and B). We chose a size of 270 × 100 pixels, which was the average size of gene expression pattern images acquired from the published literature. For scaling, we perform a geometric transformation (simple scaling) followed by an interpolation to derive the pixel values of the new image. In these studies, we used a gray-level bilinear-interpolation scheme (Castleman 1996, p. 124). It is a first-order interpolation, which determines the destination pixel intensity value based on the four nearest neighbor pixels of the source image. This is a simple but effective approach for scaling. Results from this transformation are shown in Figure 1, E and F, which are standardized images in which pixel correspondence has been established.
Expression pattern extraction: The next step in the gene expression pattern image standardization is to eliminate background to focus on the foreground containing the expression pattern. This ensures that only actual patterns of gene expression are compared for biologically meaningful analysis. In Figure 1A, the gene expression pattern is the darkly stained region embedded in a lighter-color background. While this can be recognized easily by a trained eye, the process needs to be automated for large-scale data gathering. Therefore, the expression pattern (relevant visual content) first must be extracted from each image.
Extraction of the gene expression pattern from the background requires the use of a threshold value of pixel intensity (Gonzalez and Woods 1993; Castleman 1996). All pixels with intensity less than the threshold are assigned a white color (background) and all others are left as is. Our preliminary results demonstrated, as expected, that the same threshold value cannot be used for all images due to differences in the intensity distribution caused by variations in investigator equipment, gene expression pattern, and other factors. To compensate for these variations, we automatically derive threshold values for each image by using adaptive thresholding methodology (see Lie 1995 for details). This can be accomplished, for example, by using the variance of the pixel intensity values for a given image as the basis for the choice of the specific threshold value for that image. While this method captures the entire expression pattern, the resulting image also contains some noise. We therefore filter the images to suppress and/or remove the noise by employing wavelet filters and morphological operators to selectively remove the higher frequency bands contributed by noise (Adams and Bischof 1994). The result of this extraction process for a simple region of expression (Figure 2A) is shown in Figure 2B. For images containing multiple regions of expression, we employ the region-growing procedure to extract relevant patterns. Region growing is a procedure that groups pixels or subregions into larger regions starting from a seed pixel (or region) and appending the neighboring pixels that have similar properties. This improves data quality and makes the extracted pattern biologically relevant. For instance, an algorithm implementing the region-growing procedure automatically extracted all areas of expression for genes that affect multiple regions (e.g., pair-rule genes) without requiring manual input (Figure 2, C and D). (For multiply stained embryos, use of color-sensitive thresholds corresponding to each stain separately allows for the generation of multiple images.) These methods do not distinguish between quantitative levels of expression; they are meant for identifying spatial similarities in the presence or absence of gene expression.
Finally, we convert the color/gray-scale pixels into simple black-and-white patterns such that only pixels containing gene expression take on a black color and the rest of the image lacks any color (Figure 1, G and H). We do not convert original images directly into black-and-white images prior to conducting the above-mentioned size standardization and pattern extraction procedures because the information contained in differences in color intensities is valuable for reliably separating expression patterns from the background.
Digital representation of expression patterns: The binary images representing gene expression patterns are processed to derive a vector of features describing the image content (gene expression patterns). We represent each image in the form of a string of 1’s and 0’s, where black pixels are denoted by a value of 1 and white pixels by a value of 0. For example, an expression pattern is represented as 0100111000... 00000001111111. This is referred to as the binary sequence vector (BSV) representation. The BSV representation is particularly useful in quantifying image-to-image dissimilarity for finding images with similar or overlapping expression patterns, and it allows for localization of image similarity searches to any section of the embryo.
An alternative set of features can be derived using shape descriptors, which have proven to be effective in establishing the similarity between two images as well as in effecting image retrieval (Costa and Cesar 2000). The latest multimedia standard proposed by the International Standards Organization (ISO) for indexing and retrieval of images, namely MPEG7, proposes shape descriptors for efficient and effective image retrieval. These features are specifically designed for natural images and their utility in the gene expression pattern image analysis is generally unclear. We therefore investigated the usefulness of shape descriptors by adapting them to our specific class of expression pattern images. Our analyses showed that they are not more efficient than the binary sequence vector for the early stage embryos considered here (K. Jayaraman, S. Panchanathan and S. Kumar, unpublished results).
Finding best matching expression patterns with the basic expression search tool (BEST): To find the best matching set of images for a given query image, we need to compute the extent of dissimilarity between gene expression patterns. We use the BSV representation for this purpose. The corresponding bit sequences are compared and the number of bits with different values are counted. We define the expression pattern distance (DE) to be equal to the number of differences between two images divided by the number of pixels depicting the expression pattern in at least one of the two images. In other words, we determine expression pattern similarity by focusing only on the pixels that show gene expression (have value 1) in either image. For example, two images, A and B, with bit sequences of length 30 given below have nine differences (underlined bits).
Expression Pattern A: 000110011110000000000111111111
Expression Pattern B: 000110011111110000000111000000
Therefore, DE = 9/18 = 0.5. Images showing the highest amount of match (i.e., lowest DE) with the query image are retrieved as the best matches. This simple metric showed excellent performance for early stage embryos in an analysis of a collection of 982 images (see below).
Inferring gene interactions: Gene interactions are computed by comparing expression patterns of individual genes in wild-type and mutant backgrounds. Normally, investigators conduct this task by simple visual inspection and deduction. However, as the number of images increases, visual inspection becomes cumbersome. Therefore, we have devised an algorithm that uses gene expression pattern images, their genotypes, and their probes to infer the nature of the interaction for that pair of genes. This computational system works well for early stages of embryonic development because gene expression at these stages is often relatively broad, rather than highly localized as in later stage embryos. Also, we find that >80% of the available images are from early stage embryos. Therefore these methods are applicable for large amounts of existing data.
A flowchart for the two-gene case is shown in Figure 3, in which the expression patterns of gene B in the wild-type and mutant backgrounds for gene A are used. By design, these data shed light on how the expression of gene B is influenced by gene A. Figure 3 shows various paths starting from the gene expression images and proceeding to the decisions that correspond to whether or not the expression of gene B is only positively affected (A → B), only negatively affected (A —|B), positively and negatively affected (A |B), or not affected (A × B) by gene A. Using the BSV representations → of gene expression patterns, the actual algorithm works as follows. We begin by defining a new function, termed the gene expression function (gef), which evaluates whether a given expression pattern contains all 0’s or all 1’s or a combination of 0’s and 1’s in a given image (i.e., no expression, ubiquitous expression, or localized expression pattern, respectively). It produces 1 if the image contains all black pixels. If all pixels are white, then it produces a 0; otherwise, it returns a P (for partial). To analyze all possible interactions between the gene expression patterns, we derive new image patterns based on the two original images, for which the logical exclusive OR (xor) operation is used. This operator results in a 0 if the values of the corresponding pixels in the two images under inspection are the same (i.e., if they are both 0 or both 1) or results in a 1 if the values of the corresponding pixels in the two images are different. Outputs from gef function for this image as well as the original data are input to the flowchart diagram for inferring interactions.
Performance of data standardization methods: We examined the performance of our methodologies for embryo-enclosing and expression pattern extraction using 97 images from eight articles (Lawrence and Pick 1998; Toyet al. 1998; Tsaiet al. 1998; Ashe and Levine 1999; Goldsteinet al. 1999; Jazwinskaet al. 1999b; Zhang and Levine 1999; Jinet al. 2000). We provided the original and the automatic embryo-enclosed images to 16 independent evaluators (graduate students in biology or computer science not involved in the current project) and requested that they give a score of 0.25 for each edge in the edge-fitted image that, in their opinion, enclosed the Drosophila embryo in the tightest fit possible (visually). Therefore, a score of 1.0 refers to perfect fit of all four edges, and a score of 0 represents the poorest fit in all directions. The average edge-fitting score was 0.875; i.e., 3.5 edges fit well on average. While most edge-fitted images scored perfectly, we identified a few images that scored poorly (score <0.5). This appears to be due to poor or highly varying illumination in the image (e.g., embryo boundaries are too faint in some images) or to artifacts such as panel numbers or arrows. In these difficult cases, we needed to manually set boundaries.
We evaluated the performance of our automatic approach for pattern extraction by again asking 16 independent evaluators to assign a score of 1-4 for each extracted pattern (1 being worst and 4 being best). This tests the performance of the above method to approximate what the human eye can easily recognize. The average success score for the combined region-growing and adaptive-thresholding methodology was found to be 3.5 (87.5%). The observed difficulty with automatic pattern extraction appears to relate to the lack of sufficient contrast between the expression pattern and the background. In addition, in the presence of multiple patterns of interest (with different illuminations) the automatic technique needs to be user guided to extract all biologically significant regions. For this reason, a semiautomatic system was developed in which different threshold values and/or new seed points for different regions of interest in the embryo can be specified manually. This improves data quality and makes the extraction biologically relevant.
Finding similar gene expression patterns computationally: We explored the effectiveness of our computational approach in finding similar expression patterns using a prototype collection of 982 patterns from 49 published research articles (Figure 4). These tests were geared toward determining the sensitivity of our computational approach and the biological validity of the image-matching procedure. A computer program was written to automatically retrieve gene expression pattern images showing overlap with a given query pattern based solely on the similarities between images. We refer to the images retrieved by BEST as “BEST hits.”
Here we briefly summarize the characteristics of our 10 test-case expression patterns and then we describe BEST-hit results for each case. Of the 10 patterns, 9 are lateral views (Figure 4, A-E, G, and H) and one is a dorsal view (F). Expression patterns in some images differ considerably from each other (e.g., Figure 4, A and B), some differ only slightly (Figure 4, B and C), some have clearly overlapping domains of expression even though they are quite different overall (Figure 4, D and E and G and I), and others have multiple distinct regions of expression (Figure 4, E, F, I, and J). These query images were chosen without regard to their genotype, probe, or any other consideration except for diversity of expression pattern. Also, BEST hits were based solely on gene expression contained in the images; no prior knowledge of gene interactions was used. For each image, the program queried the entire collection of 982 images. The top five BEST hits for each image are shown in descending order below the query image in Figure 4 with the percentage of similarity (1 - DE) given in parentheses.
Figure 4A shows an image with expression restricted to the posterior 10% of the embryo (Zhang and Levine 1999). Specifically, the image shows the expression of forkhead RNA in a transgenic embryo with maternal expression of a mutant hairy cDNA fused to the bicoid 3′ untranslated region (UTR). With this as the query image, BEST should retrieve images with posterior restricted expression patterns, especially other panels in the original article. This is indeed the case. Four of the five BEST hits are other panels in the original article. These panels depict tailless, huckebein, and forkhead RNA expression in transgenic embryos with maternal expression of different mutant hairy cDNA fused to the bicoid 3′ UTR. In addition, the BEST hits contain an image with expression of brachyenteron RNA in a huckebein mutant embryo (Goldsteinet al. 1999).
Figure 4B shows an image with expression restricted to the anterior 25% of the embryo (Zhang and Levine 1999). Here, the expression of hairy RNA in a transgenic embryo with maternal expression of a hairy cDNA fused to the bicoid 3′ UTR is captured. Three of the five BEST hits to this image are other panels in the original article; they depict hairy RNA expression in transgenic embryos with maternal expression of different mutant hairy cDNA fused to the bicoid 3′ UTR. An image from Hulskamp and Tautz (1991), which depicts bicoid RNA expression in a wild-type embryo, is also identified. The bicoid RNA localization is regulated by sequences in its 3′ UTR (MacDonald 1990); thus the identification of bicoid when using a query image in which gene expression is regulated by bicoid 3′ UTR sequences is clearly meaningful. An image from Flores-Saaib et al. (2001) depicting the expression of Gal4-driven β-galactosidase (lacZ) RNA in a transgenic embryo expressing the C-terminal domain of Dorsal fused to Gal4 was also among the five BEST hits. Results for Figure 4, A and B, clearly show that our search tool effectively retrieves biologically relevant images that match a query image with a simple expression pattern.
Figure 4C shows an image with expression of orthodenticle RNA restricted to the anterior 25% of the embryo, except that expression is absent at the anterior terminus (Tsai and Gergen 1994). Four of the five BEST hits are images with an anterior expression pattern in which expression is also absent at the anterior terminus. Three of these are from another article by the same author (Tsaiet al. 1998) and depict orthodenticle RNA expression in different genotypes. An image from Grossniklaus et al. (1994) depicting the expression of sloppy paired1 RNA in a wild-type embryo was also among the five BEST hits. The least similar of the five BEST hits is an image with expression throughout the anterior region of the embryo from Zhang and Levine (1999) showing the expression of hairy RNA in a transgenic embryo with maternal expression of a hairy cDNA fused to the bicoid 3′ UTR. Notably, even though the query image in Figure 4C is quite similar to the query image in Figure 4B, the top-five BEST hits for these query images are not the same. This example demonstrates the sensitivity of the image search system.
Figure 4D shows an image with a single stripe of expression of short gastrulation RNA driven by the even skipped stripe 2 enhancer in a transgenic embryo (Ashe and Levine 1999). The top five BEST hits include RNA and protein expression patterns. The top hit is an image showing Gal4-driven lacZ RNA in an embryo expressing a Knirps-Gal4 fusion protein under the control of the even skipped stripe 2 enhancer (Arnostiet al. 1996). The next two BEST hits are images of Krüppel protein expression in a hunchback mutant embryo (Gaul and Jackle 1990) and Krüppel RNA expression in a bicoid and hunchback double-mutant embryo (Hulskampet al. 1990). The next BEST hit is an image of hunchback RNA expression in a wild-type embryo (Margoliset al. 1995). The engrailed RNA expression in an embryo with ubiquitous fushi tarazu expression (Nasiadkaet al. 2000) rounds out the top five. While the query image and the BEST hits are from six different articles, all of the genes involved are part of a well-characterized, spatially localized hierarchy of gene regulation. The query image and the top BEST hits depict expression of a transgene containing the stripe 2 enhancer that normally directs the expression of the pair-rule gene even skipped. As reviewed by Gaul and Jackle (1990) even skipped expression is regulated by the gap genes hunchback and Krüppel (BEST hits 2, 3, and 4). Finally, the segment polarity pattern of engrailed expression (BEST hit 5) is regulated by the pair-rule gene even skipped (Fujiokaet al. 1995). This example demonstrates the use of the expression search system in retrieving data that help in understanding components of developmental pathways.
Figure 4E shows an image of the two domains of hunchback RNA expression in a wild-type embryo. hunchback is expressed in a wide stripe toward the anterior and a narrow stripe toward the posterior of the embryo (Tsai and Gergen 1994). All five BEST hits are images showing hunchback expression (visualized in a variety of ways) in wild-type or mutant backgrounds. The five BEST hits include overlapping expression patterns from the original article and from three other articles (Gaul and Jackle 1990; Margoliset al. 1995; Janodyet al. 2000).
Figure 4F shows an image, in dorsal view, with three domains of expression. Expression is seen in the anterior 25% and in two stripes of expression toward the posterior of the embryo. Specifically, the image shows Race RNA expression in a goosecoid mutant embryo with short gastrulation expression driven by the even skipped stripe 2 enhancer (Ashe and Levine 1999). Four of the five BEST hits are images from the same article showing dorsal views of Race or Race and short gastrulation RNA expression in goosecoid mutant embryos with short gastrulation expression driven by the even skipped stripe 2 enhancer. An image, in lateral view, showing the expression of knirps RNA in a maternal and zygotic caudal mutant embryo, is also identified (Rivera-Pomaret al. 1995).
Figure 4G shows an image with expression restricted to the ventral 10% of the embryo. The expression of snail RNA in an embryo derived from a groucho germline clone is shown (Goldsteinet al. 1999). Two of the five BEST hits are images from the same article showing snail RNA expression in wild-type and huckebein mutant embryos. In addition, two images of lacZ RNA expression driven by a fusion of rhomboid and twist enhancers are identified (Nibuet al. 1998; Nibu and Levine 2001). The identification of twist expression when using a query image of snail expression is biologically meaningful. Both twist and snail are expressed in the ventral ectoderm and are important for ventral mesoderm formation (Rayet al. 1991). The expression of Gal4-driven lacZ RNA in a transgenic embryo expressing Dorsal fused to Gal4 is also retrieved in this search (Flores-Saaibet al. 2001).
The expression of decapentaplegic RNA is restricted to the dorsal 40% of a wild-type embryo as shown in Figure 4H (Jazwinskaet al. 1999b). One of the five BEST hits is an image from the same article showing brinker RNA expression in an embryo derived from a female bearing a dominant Toll mutant allele. Two images of lacZ RNA driven by wild-type and modified versions of the decapentaplegic dorsal enhancer (Flores-Saaibet al. 2001) in wild-type embryos are also identified. The other two top hits are zernault RNA expression in wild-type and dCtBP mutants (Nibuet al. 1998). In this case, the query and all five BEST hits depict the expression of genes involved in direct interactions with Decapentaplegic signaling during embryonic dorsal/ventral patterning.
Figure 4I shows an image with stripes of expression in ventral and lateral regions. Specifically, the image shows the expression of short gastrulation in a wild-type embryo (Ashe and Levine 1999). The top BEST hit is an image of brinker RNA expression in a wild-type embryo (Jazwinskaet al. 1999b). The fact that the top hit for short gastrulation RNA expression is brinker is consistent with the roles of brinker and short gastrulation in the Decapentaplegic signaling pathway (Jazwinska et al. 1999a,b). One of the five BEST hits is an image from Ashe and Levine (1999) showing the expression of short gastrulation RNA in a transgenic embryo in which short gastrulation is also expressed from the even skipped stripe 2 enhancer. Two images of rhomboid RNA expression in a dCtBP and a snail mutant embryo are also identified (Nibuet al. 1998). All of these genes are involved in embryonic dorsal/ventral patterning. Even skipped protein expression in an embryo derived from a bicoid oskar double-mutant female is also identified (Gaul and Jackle 1990).
Finally, we consider an embryo with seven stripes of expression, a pattern typical of pair-rule genes, for finding BEST hits (Figure 4J). Figure 4J shows the expression of runt RNA in an embryo expressing low levels of knirps under the control of the even skipped stripe 2 enhancer (Kosman and Small 1997). Four of the five BEST hits are images from the same article. The top BEST hit is an image showing the expression of runt RNA in an embryo expressing intermediate levels of knirps under the control of the even skipped stripe 2 enhancer. The other three BEST hits from this article show the expression of fushi tarazu RNA in embryos expressing low or intermediate levels of knirps under the control of the even skipped stripe 2 enhancer. fushi tarazu is a pair-rule gene expressed in the same pattern as even skipped (Lawrence and Johnston 1989). An image of fushi tarazu RNA expression in an embryo ubiquitously expressing a Fushi tarazu-VP16 fusion protein is also identified (Nasiadkaet al. 2000). In this image, the fushi tarazu expression pattern appears continuous due to its reduced size. In the original figure (as published in Nasiadkaet al. 2000) it is possible to distinguish fushi tarazu’s normal striped pattern above the background of ectopic fushi tarazu expression induced by the fusion protein.
Overall these results suggest that the basic expression search tool is (i) able to successfully retrieve expression patterns of the same gene from the same article, (ii) sensitive to relatively small changes in expression pattern, (iii) able to retrieve expression patterns of different genes that have similar functions (e.g., dorsal/ventral patterning), and (iv) able to recover expression patterns of different genes within a spatially localized regulatory hierarchy (e.g., gap, pair-rule, and segment polarity genes). Our examples identify numerous known, biologically meaningful matches (e.g., short gastrulation and brinker) and several potentially new matches (e.g., orthodenticle and sloppy paired1) from a large data set. This is exactly what a researcher would do when comparing their images to those in the published literature manually by eye. The use of BEST hits will expedite the search for finding comparable RNA and protein expression patterns (as is the case for the BLAST search for molecular sequences). As when searching the literature for matching gene expression patterns, a BEST-hits user must determine the biological meaningfulness of matches retrieved by consulting the original articles.
Inferring gene interactions: Here we present results from five examples in which genetic interactions were computed employing the flowchart diagram in Figure 3. Our algorithm compares the expression pattern of a gene in wild-type and mutant embryos. In each case, inferred interactions are expected to match those that an investigator would deduce on the basis of their visual inspection of the same images.
Figure 5A shows the effect of bicoid on Krüppel (Kr) expression. On the left, we show Krüppel RNA expression in a wild-type embryo (Hulskampet al. 1990) and on the right we show Krüppel protein expression in a maternal bicoid mutant (bcd-) embryo (Gaul and Jackle 1990). Our algorithm compares the pattern of Krüppel expression in each image and determines that there is more extensive Krüppel expression in the bcd- embryo than in the wild-type embryo. The program suggests that Krüppel expression is negatively affected by bicoid activity, when we consider the extent of the region of gene expression. This is consistent with the reported refinement of Krüppel expression by bicoid-mediated repression (Gaul and Jackle 1990; Hulskampet al. 1990).
Figure 5B shows the effect of bicoid on tailless (tll) expression by comparing tailless RNA expression in wild-type and bcd- embryos (Pignoniet al. 1992). The algorithm suggests that tailless expression is also negatively affected by bicoid activity. This is consistent with visual comparison of these images as reported by Pignoni et al. (1992).
In Figure 5C, we examine the interaction between caudal and knirps (kni). For this purpose, we use images containing knirps RNA expression in a wild-type embryo (Arnostiet al. 1996) and in a maternal and zygotic caudal mutant (cad-) embryo (Rivera-Pomaret al. 1995). The algorithm determines that there is more extensive knirps expression in the cad- embryo than in the wild-type embryo and concludes that there is a negative relationship between caudal activity and knirps expression. This is consistent with the Rivera-Pomar et al. (1995) report that giant, a caudal-dependent repressor, refines the expression of knirps.
The posterior refinement of giant expression by dCtBP-mediated repression is shown in the next example (Figure 5D). Here giant RNA expression in a wild-type embryo and a dCtBP germline clone mutant embryo (dCtBP-; Nibuet al. 1998) are compared. The algorithm suggests that giant expression is negatively affected by dCtBP activity as reported by Nibu et al. (1998).
Finally, we consider the effect of bicoid on orthodenticle (otd) expression (Figure 5E). The algorithm compares the expression patterns of Orthodenticle protein in a wild-type embryo (Janodyet al. 2000) and orthodenticle RNA in a maternal bicoid mutant (bcd-) embryo (Tsaiet al. 1998; Janodyet al. 2000). Our algorithm suggests that orthodenticle expression is positively affected by bicoid activity, because there is less extensive orthodenticle expression in the bcd- embryo than in the wild-type embryo. This result is consistent with the known activation of orthodenticle expression by bicoid (Tsaiet al. 1998).
These examples illustrate the ability of our computational techniques to infer genetic interactions using image data in a manner similar to that employed visually by researchers. The simplicity of the approach is that the entire process can be described by a few logic functions and is therefore easily automated for analyzing all available pairs of images that are suitable for inferring genetic interactions (results from those analyses will be published elsewhere).
A major problem facing biologists today is that data generation has consistently outpaced advancements in computational techniques for developing biological insights and generating hypotheses. This problem has plagued molecular sequence data for over a decade. With the advent of large-scale high-throughput in situ efforts, we are headed toward biological data overload in developmental biology as well. For example, Berkeley Drosophila Genome Project has released a large collection of images (>30,000) generated with a standardized RNA in situ hybridization method using whole embryos and cDNA probes from the Drosophila UNIGENE set (http://www.fruitfly.org). About 15,000 of these images are expected to depict gene expression from early stage embryos. These data are released in raw form to the public. How will we view these data efficiently? How will we find images with similar expression patterns, other than by manually browsing through each and every image? How will we compare the newly acquired knowledge to all existing knowledge published in the literature? How will the images be standardized? How can we use all the data to generate a global view of the early stages of Drosophila embryonic development? No computational frameworks that will facilitate efficient analyses of these new images or integrate these new images with existing image data from the literature exist.
We have tackled this challenge by developing methods and algorithms for standardization, extraction, and similarity search of gene expression patterns. We have demonstrated their utility in finding similar and overlapping expression patterns for a wide variety of genes from early stages of Drosophila embryonic development. These bioinformatic methodologies are likely to pave the way for establishing a public computational resource to find similar expression patterns based on the visual content of the query image, a facility that will parallel the BLAST search (Altschulet al. 1990) for finding homologous sequences. It will make the great wealth of embryonic gene expression pattern data easily accessible. While we have focused on methods for early stage embryos here, these methods are adaptable for analysis of images capturing later stages as well as imaginal disc gene expression. Toward this end, our initial results are encouraging and are in preparation for publication elsewhere.
We have also proposed a method for elucidating genetic interactions computationally by comparing expression patterns of a given gene in wild-type and mutant backgrounds. This method will provide researchers for the first time with a tool to exhaustively examine every possible pair of genes for which wild-type and mutant expression patterns are available. It therefore will facilitate construction of developmental networks and help explore the combinatorial nature of genetic interactions that provide the exquisite specificity and diversity of gene expression patterns. These algorithms provide specific predictions for researchers, who can then test the validity of these in silico inferences. In the future, we plan to develop similar algorithms for simultaneous analysis of multiple genes.
In summary, methods presented in this article are efforts to fulfill the need for computational technology for large-scale analysis of gene expression data. For instance, our approach will facilitate the construction of similarity-based clusters of expression pattern images. Examination of similarities and differences in their spatial and temporal contexts and of the genetic background of images within each similarity cluster will likely reveal new genetic interactions and thus accelerate the discovery of hidden links in developmental networks. Therefore, an accelerated, expanded understanding of the interconnected nature of gene function may emerge from in silico analysis.
We thank Drs. Tom Brody, David Capco, Alan Filipski, Manfred Laubichler, Alan Rawls, Koichiro Tamura, and Jeanne Wilson-Rawls for invaluable comments on an early draft of this article. We greatly appreciate the efforts of Emily Davenport, Veena Ganeshan, Eric Herbig, and Aaron Johnson in image data collection and pattern extraction, and Dr. Sankar Subramanian’s technical assistance. This research was supported in part by funds from the Center for Evolutionary Functional Genomics (S.K.) and research grants from the National Institutes of Health (S.K., S.J.N.) and National Science Foundation (S.P.).
Communicating editor: S. Yokoyama
- Received August 2, 2002.
- Accepted September 30, 2002.
- Copyright © 2002 by the Genetics Society of America