- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Geng, W.
- Articles by Schafer, W. R.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Geng, W.
- Articles by Schafer, W. R.
Quantitative Classification and Natural Clustering of Caenorhabditis elegans Behavioral Phenotypes
Wei Genga, Pamela Cosmana, Joong-Hwan Baekb, Charles C. Berryc, and William R. Schaferda Department of Electrical and Computer Engineering, University of California, San Diego, California 92093-0407,
b School of Electronics, Telecommunications and Computer Engineering, Hankuk Aviation University, Koyang City, 412-791, South Korea,
c Biostatistics Unit, University of California Cancer Center, San Diego, California 92093
d Division of Biology, University of California, San Diego, California 92093-0349
Corresponding author: William R. Schafer, 9500 Gilman Dr., University of California, San Diego, California 92093-0349., wschafer{at}ucsd.edu (E-mail)
Communicating editor: Y. X. FU
| ABSTRACT |
|---|
Genetic analysis of nervous system function relies on the rigorous description of behavioral phenotypes. However, standard methods for classifying the behavioral patterns of mutant Caenorhabditis elegans rely on human observation and are therefore subjective and imprecise. Here we describe the application of machine learning to quantitatively define and classify the behavioral patterns of C. elegans nervous system mutants. We have used an automated tracking and image processing system to obtain measurements of a wide range of morphological and behavioral features from recordings of representative mutant types. Using principal component analysis, we represented the behavioral patterns of eight mutant types as data clouds distributed in multidimensional feature space. Cluster analysis using the k-means algorithm made it possible to quantitatively assess the relative similarities between different behavioral phenotypes and to identify natural phenotypic clusters among the data. Since the patterns of phenotypic similarity identified in this study closely paralleled the functional similarities of the mutant gene products, the complex phenotypic signatures obtained from these image data appeared to represent an effective diagnostic of the mutants' underlying molecular defects.
AMONG the organisms most amenable to the genetic analysis of behavior is the nematode Caenorhabditis elegans. C. elegans has a simple nervous system consisting of 302 neurons of known position, cell lineage, and synaptic connectivity (![]()
![]()
![]()
![]()
![]()
To address this problem, we have explored the use of machine vision approaches to quantitatively characterize and classify C. elegans uncoordinated mutants. In previous work, we built a tracking and imaging system that could follow and record an individual animal's movements over long time periods and save digital image data representing the animal's body posture over the course of the recording (![]()
In this study, we have used image data collected by our automated tracking system to investigate the natural clustering of C. elegans behavioral phenotypes. From a complex data set consisting of 253 features measured from behavioral recordings of 797 individuals representing eight distinct genotypes, we used principal component analysis to represent each mutant type as a cloud of data points in low-dimensional feature space. We have also used k-means clustering and Euclidean distance measurements to explore the natural structure of the behavioral data and to compare the similarities of mutant phenotypic patterns. These results therefore constitute a quantitative definition of several important C. elegans behavioral phenotypes and demonstrate that mutant phenotypes can be clustered using a complex behavioral signature based on quantitative image features.
| MATERIALS AND METHODS |
|---|
Strains and culture methods:
Routine culturing of C. elegans was performed as described (![]()
The alleles and predicted products of the genes used in these experiments were as follows: unc-38 (x20), nicotinic acetylcholine receptor
-subunit (null allele); unc-29(x29), nicotinic acetylcholine receptor non-
-subunit (null allele; ![]()
-subunit (strong loss-of-function allele; ![]()
![]()
-2-subunit (strong loss-of-function allele); unc-2(mu74), N-type voltage-gated calcium channel
-1-subunit (null allele; ![]()
-1-subunit (partial loss-of-function allele; ![]()
Acquisition of image data:
C. elegans locomotion was tracked with a Zeiss Stemi 2000-C stereomicroscope mounted with a Cohu high-performance CCD video camera essentially as described (![]()
Image preprocessing:
To obtain the clean binary image, the background intensity level of the grayscale image was found first by taking the maximum of the values of the four corner points of the trimmed image (at least one of the corner points is always not part of the worm body). After finding the background level (b), a 5 x 5 moving window was scanned over the trimmed image, and the mean (m) and standard deviation (s) of the pixels inside the window were computed at every pixel position. If m < 0.7b or s > 0.3m, then the pixel was considered to be a pixel of the worm body and was assigned a value 1. To clean up the spots inside the worm body, a morphological closing operator (binary dilation followed by erosion) was applied (![]()
![]()
Image feature extraction:
All of the software for binarization, skeletonization, and feature extraction was coded in C and implemented on a UNIX machine. Some features (e.g., the area of the worm, that is, the number of pixels that make up the single binary object in the frame) could be computed on a single frame; these were computed for all 600 frames in the sequence. The average value, the maximum value, and the minimum value were then computed for these 600 measurements. Some of the maximum and minimum values are outliers introduced by noise or errors during image capture and processing. To avoid using these extreme values, it was more useful to summarize the group statistics with such quantities as the 90th and 10th percentile values out of the population of 600 numbers. Hereafter we use max and min to denote the 90th and 10th percentile values. Other features could not be extracted from a single frame, for example, the movement between 2 frames or the movement within 10 sec (20 frames). Since there are
600 frames total in a sequence, the movement between 2 frames could be computed 300 times if we take pairs of frames in a nonoverlapping fashion, or it could be calculated 599 times taking pairs of frames in a sliding window or overlapping fashion. Likewise, for the movement within 20 frames, we could compute 581 values for overlapping 20-frame intervals. Quantities of this type were calculated in a sliding-window fashion. As before, the average, max, min, and other order statistics can be computed from this set of numbers. Features that describe worm body transparency (median pixel value), and head and tail movement relative to centroid were also measured (W. GENG, unpublished data). A complete list of features used in classification, along with their mean and variance for each genotype, is included as supplemental data at http://www.genetics.org/supplemental/.
| RESULTS |
|---|
Collection and normalization of behavioral feature data:
To explore the natural clustering of behavioral phenotypes using defined quantitative parameters, we collected digital image data from eight representative genotypes: the standard wild-type strain N2 and seven loss-of-function mutants affecting different molecules involved in nervous system function. For each genotype, 100 5-min recordings (98 for unc-29, 99 for unc-2) were made of individual adult hermaphrodites, with images captured at a frequency of 2 Hz. For each recording, 253 parameters describing aspects of the animal's movement, body texture, or body posture were measured; the feature measurements for a single recording were designated as a single multidimensional data point. We then analyzed the clustering of these 797 data points with the goal of determining the optimal substructure of the behavioral data. In particular, we sought to determine how the feature data clustered in multidimensional space and to then correlate the clustering pattern of the feature data with the known biology of the mutant types in the study.
Standardizing inputs on a set of carefully selected features plays an important role in pattern recognition. Since our features were measured in different units, it was necessary to normalize them on a common scale to avoid one feature dominating others. The outliers introduced by noise and errors during the feature extraction process tend to give false clusters in clustering analysis; thus, the scaling method also needs to be carefully selected to suppress outliers. We evaluated three standard normalization methods: Min-max (linear transformation of the original input range into [-1, 1]), Zscore [defined as x = (f - mean(f))/stdev(f), where f is the original input feature], and sigmoidal method (![]()
|
Representation of phenotypic patterns in multidimensional feature space:
To visualize the phenotypic patterns as defined by the selected parameters, we used principal component analysis (PCA; ![]()
![]()
-1 and
-2 voltage-gated calcium channel subunits with nearly coincident expression patterns (![]()
|
|
|
Feature selection and classification of phenotypes:
Since one of our main objectives is to identify parameters that define particular mutant types, we wished to identify a small number of features that provide discriminative information. A variance plot (Fig 1A) shows that the top 43 principal components (17% of total PCs) capture >94% of total variance. This gives a strong indication that a few carefully selected features would represent the data well.
To identify best features for distinguishing any two worm types, we screened the entire feature set using a backward elimination process based on the linear Lagrangian support vector machine classifier (![]()
![]()
Feature subsets that are effective to distinguish all worm types were then selected progressively by choosing the most frequent features that appear on the top of all 28 sequences. For example, the first feature was selected as the feature that appeared most frequently as the no. 1 feature in all 28 sequences. The second feature was selected as the feature that appears most frequently as the no. 1 or no. 2 feature in all 28 sequences besides the feature that was already in the subset. A simple 1-nearest neighbor (1-NN) classifier with 10-fold cross-validation (![]()
A small set of features can be readily identified to approximate the data set by following the cross-validation error curve. Table 3 shows the classification results by using all 253 and a subset of 39 features. The 39-feature subset was selected at the first significant dip location (at k = 39) on the error curve. The data were well represented using a subset of 39 features for discriminating phenotypes. These features included several measurements of speed and reversals averaged over different time periods and worm head and tail width and brightness information (Table 4).
|
|
Natural clustering of phenotypic data:
To further investigate the clustering of the data points, we applied the k-means clustering algorithm to find the natural clusters in the behavioral data. For this analysis, each data point was treated individually without regard to mutant type. The k-means algorithm is an elementary but very popular clustering method. It enjoys the benefits of making no assumptions about the underlying data probability distributions, and is thus applicable to many problems. Suppose there are to be k clusters with respective centers C = {c1, ... , ck} and their corresponding nonoverlapping divisions of feature space are defined as D = {D1, ... , Dk}. Let ||·||2 denote "squared Euclidean distance." Our data are xi: i = 1, 2, ... , 797. We choose C = {c1, ... , ck} so that
. While there is no closed form solution to the minimization, ![]()
|
A key issue in k-means clustering is to determine the optimal number of clusters for the data set. We used two algorithms to determine the optimal cluster number for our behavioral data: the gap statistic (![]()
![]()
The idea of the gap statistic is to standardize the graph of log(Wk) by comparing it to its expectation under an appropriate null reference distribution of the data. Wk is the total within-cluster sum of squares around the cluster centers, when there are k clusters. Since we have 797 points in our data set, the null reference distribution is generated by drawing 797 samples from a distribution that is uniform along each feature data dimension. This is repeated B times. The expectation of the null reference E{log(W*kb)} can be estimated as
, where W*kb is the within-cluster sum of squares of the bth reference data set, and B is the number of reference data sets. The distance between these two curves is defined as the gap,
, for k = 1, ... , K, where K is the maximum number of clusters defined by the user according to the expected range of clusters. We use a maximum of 10 centers (K = 10) and five reference data sets (B = 5). The sampling distribution can be measured by
, where SDk is the standard deviation of the reference null distribution. The formula to calculate the optimal number of clusters kopt can be obtained as the first location where the gap curve starts to drop or level off. That is, the first k that satisfies gap(k)
gap(k + 1) - ask+1, where a is a multiplier adjusted to reject the null mode. Here it is set to 3.
The information theoretic approach tries to find the optimal number of clusters by fitting the within-cluster sum of squares curve (distortion curve) with two hyperbolic curves breaking at the location of the optimal k. The location of the break can be measured in a transformed domain when applying a negative power to the distortion curves. The magnitude of the power is controlled by the dimensionality of the data. Here it is set to -7. The transformed distortion curve usually can be approximated reasonably well by a piecewise linear function consisting of two straight lines with a break, or elbow, at the location of the optimal k. The optimal number of clusters can be easily obtained by finding the biggest jump, which is the difference between the successive points on the transformed distortion curve. This article provides theoretic justification and points out that this method can also provide suboptimal solutions by finding smaller jumps in the curve. This is particularly appealing given our objective of exploring the substructure of the data.
As shown in Fig 3C and Fig D, both methods identified six clusters as the optimal number (Table 3). In this optimal classification, the calcium channel mutants unc-36 and unc-2 were grouped into a single cluster and the nicotinic receptor mutants unc-29 and unc-38 into another cluster. In addition, the information theoretic approach identified an additional suboptimal solution of eight clusters with each cluster composed primarily of a single mutant type (Fig 3D and Table 5). Together, these results demonstrated that worms of the same mutant type tend to exhibit similar behavioral patterns and further showed that cluster analysis can be used to assess phenotypic similarities between different mutant classes.
|
| DISCUSSION |
|---|
Quantitative definition of behavioral mutant phenotypes:
We have shown here that quantitative morphological and locomotion features obtained from digital video recordings can be used to distinguish the behavioral phenotypes of C. elegans mutants. As shown in Table 2, a reduced set of
40 features is sufficient to identify visibly dissimilar mutant types with very high reliability. Furthermore, these features can often be used to distinguish between types with highly similar phenotypes (e.g., unc-2 and unc-36) that cannot be reliably identified even by an experienced human observer (![]()
These studies have also provided insight into the nature of specific mutant phenotypes. For example, unc-36, unc-29, unc-38, and unc-2 have all been categorized as weak kinkers, a term that has been difficult to define precisely (![]()
With the collection of larger data sets, it should be possible to use this approach to define and subdivide other widely cited phenotypic classes of C. elegans. For example, it should be possible to obtain precise definitions for other classes of uncoordinated mutants, such as coilers, shrinkers, and loopy mutants. In addition, although we have focused here on the analysis of phenotypes associated with abnormal locomotion, the image parameters we have used in this study could also be used to categorize other classes of behavioral or developmental mutants that involve alterations in body morphology. Such studies would provide valuable insight into the nature of these additional phenotypic types; in addition, it would be interesting from an informatics perspective to learn how the inclusion of genes whose focus of action is outside the neuromuscular system would impact the importance of features used in classification.
Prospects for using behavioral phenotypes for bioinformatic analysis:
The application of machine-based pattern recognition methods also allowed us to probe the similarities between different behavioral patterns on the basis of their clustering in multidimensional feature space. In general, the pattern of phenotypic clustering mirrored the known similarities in molecular function and cellular site of action of the mutant gene products. For example, unc-29 and unc-38, which respectively encode
and ß nicotinic receptor subunits with overlapping expression patterns, formed a single cluster in the optimal clustering and had centers that were the closest together by Euclidean distance (Fig 3A). Likewise, unc-2 and unc-36 mutants, which are defective in the
-1 and
-2 subunits, respectively, of the neuronal N-type calcium channel, formed a single cluster in the optimal k-means clustering, and the centers of these two types' data clouds were relatively close in feature space. In fact, the centers for all four of these types (which have all been designated as kinker Uncs and all encode excitatory ion channels whose focus of action is primarily at body muscle neuromuscular junctions) were closer to one another than to the other Unc mutants or to wild type. Thus, the quantitative phenotypic signature obtained through behavioral tracking appeared to correspond well to the underlying functional defects of the mutants we analyzed.
We anticipate that this type of comprehensive quantification of mutant behavioral phenotypes will have powerful applications in functional genomic studies. Clustering and pattern recognition analysis of microarray-derived gene expression profiles has provided important information about the likely functions of novel gene products in C. elegans and other organisms (![]()
![]()
![]()
| ACKNOWLEDGMENTS |
|---|
We thank the Caenorhabditis Genetics Center for strains, Zhaoyang Feng for development and maintenance of the tracking system, and Megan Palm, Marika Orlov, and Dan Poole for data collection. This work was supported by a grant from the National Institute of Drug Abuse.
Manuscript received February 5, 2003; Accepted for publication June 27, 2003.
| LITERATURE CITED |
|---|
BAEK, J., P. COSMAN, Z. FENG, J. SILVER, and W. R. SCHAFER, 2002 Using machine vision to analyze and classify C. elegans behavioral phenotypes quantitatively. J. Neurosci. Methods 118:9-21.[Medline]
BRENNER, S., 1974 The genetics of Caenorhabditis elegans. Genetics 77:77-94.
DUDA, R., P. HART and D. STORK, 2001 Pattern Classification, Ed. 2. Wiley, New York.
FLEMING, J. T., M. D. SQUIRE, T. M. BARNES, C. TORNOE, and K. MATSUDA et al., 1997 Caenorhabditis elegans levamisole resistance genes lev-1, unc-29, and unc-38 encode functional nicotinic acetylcholine receptor subunits. J. Neurosci. 17:5843-5857.
FRASER, A. G., R. S. KAMATH, P. ZIPPERLEN, M. MARTINEZ-CAMPOS, and M. SOHRMANN et al., 2000 Functional genomic analysis of C. elegans chromosome I by systematic RNA interference. Nature 408:325-330.[Medline]
GONZALEZ, R., and R. WOODS, 2002 Digital Image Processing, Ed. 2. Prentice-Hall, Englewood Cliffs, NJ.
GROSSMAN, D., 2002 Short Course in Data Warehousing and Data Mining (http://www.ir.iit.edu/~dagr/DataMiningCourse/Spring2001/Notes/Data_Preprocessing.pdf).
HODGKIN, J., 1983 Male phenotypes and mating efficiency in Caenorhabditis elegans. J. Genet. 103:43-64.
JAIN, R., K. RANGACHAR and B. SCHUNCK, 1995 Machine Vision. McGraw-Hill, New York.
KIM, S. K., J. LUND, M. KIRALY, K. DUKE, and M. JIANG et al., 2001 A gene expression map for Caenorhabditis elegans. Science 293:2087-2092.
LEE, R. Y. N., L. LOBEL, M. HENGARTNER, H. R. HORVITZ, and L. AVERY, 1997 Mutations in the
1 subunit of an L-type voltage-activated Ca2+ channel cause myotonia in Caenorhabditis elegans. EMBO J. 16:6066-6076.[Medline]
LLOYD, S. P., 1957 Least squares quantization in PCM. Institute of Mathematical Statistics Meeting, September 1957, Atlantic City, NJ.
MANGASARIAN, O. and R. MUSICANT, 2001 Lagrangian support vector machines. J. Mach. Learning Res. 1:161-177.
MATHEWS, E., E. GARCIA, C. SANTI, G. MULLEN, and C. THACKER et al., 2003 Critical residues of the Caenorhabditis elegans unc-2 voltage-gated calcium channel that affect behavioral and physiological properties. J. Neurosci. 23:537-545.
MENDEL, J. E., H. C. KORSWAGEN, K. S. LIU, Y. M. HADJU-CRONIN, and M. I. SIMON et al., 1995 Participation of the protein Go in multiple aspects of behavior in C. elegans. Science 267:1652-1655.
MODEL, F., P. ADORJAN, A. OLEK, and C. PIEPENBROCK, 2001 Feature selection for DNA methylation based cancer classification. Bioinformatics 17(Suppl S1):57-64.
SCHAFER, W. R. and C. J. KENYON, 1995 A calcium channel homologue required for adaptation to dopamine and serotonin in Caenorhabditis elegans. Nature 375:73-78.[Medline]
SCHAFER, W. R., B. M. SANCHEZ, and C. K. KENYON, 1996 Genes affecting sensitivity to serotonin in Caenorhabditis elegans. Genetics 143:1219-1230.[Abstract]
SEGALAT, L., D. A. ELKES, and J. M. KAPLAN, 1995 Modulation of serotonin-controlled behaviors by Go in Caenorhabditis elegans. Science 267:1648-1651.
SUGAR, C. and G. JAMES, 2003 Finding the number of clusters in a data set: an information theoretic approach. J. Am. Stat. Assoc. in press.
SULSTON, J. and H. HORVITZ, 1977 Post-embryonic cell lineages of the nematode Caenorhabditis elegans.. Dev. Biol. 56:110-156.[Medline]
SULSTON, J., E. SCHIERENBERG, J. WHITE, and J. THOMSON, 1983 The embryonic cell lineage of the nematode Caenorhabditis elegans.. Dev. Biol. 100:64-119.[Medline]
TIBSHIRANI, R., G. WALTHER, and T. HASTIE, 2001 Estimating the number of clusters in a dataset via the Gap statistic. J. R. Stat. Soc. Ser. B 63:411-423.
WHITE, J., E. SOUTHGATE, N. THOMSON, and S. BRENNER, 1986 The structure of the Caenorhabditis elegans nervous system. Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci. 314:1-340.
ZIPPERLEN, P., A. FRASER, R. KAMATH, M. MARTINEZ-CAMPOS, and J. AHRINGER, 2001 Roles for 147 embryonic lethal genes on C. elegans chromosome I identified by RNA interference and video microscopy. EMBO J. 20:3984-3992.[Medline]
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Geng, W.
- Articles by Schafer, W. R.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Geng, W.
- Articles by Schafer, W. R.


