- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Xu, Z.
- Articles by Arnold, J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Xu, Z.
- Articles by Arnold, J.
Mapping by Sequencing the Pneumocystis Genome Using the Ordering DNA Sequences V3 Tool
Zheng Xub, Britton Lancea, Claudia Vargasa, Budak Arpinarb, Suchendra Bhandarkarb, Eileen Kraemerb, Krys J. Kochutb, John A. Millerb, Jeff R. Wagnerc, Michael J. Weised, John K. Wunderlichc, James Stringere, George Smulianf, Melanie T. Cushionf, and Jonathan Arnoldaa Department of Genetics, University of Georgia, Athens, Georgia 30602,
b Department of Computer Science, University of Georgia, Athens, Georgia 30602,
c Molecular Genetics Instrumentation Facility, University of Georgia, Athens, Georgia 30602,
d Accelrys, Madison, Wisconsin 53711-1060,
e Department of Molecular Genetics, Biochemistry and Microbiology, University of Cincinnati College of Medicine, Cincinnati, Ohio 45267
f Department of Internal Medicine and the Cincinnati VAMC, University of Cincinnati College of Medicine, Cincinnati, Ohio 45220
Corresponding author: Jonathan Arnold, University of Georgia, Athens, GA 30602., arnold{at}uga.edu (E-mail)
Communicating editor: Z-B. ZENG
| ABSTRACT |
|---|
A bioinformatics tool called ODS3 has been created for mapping by sequencing. The tool allows the creation of integrated genomic maps from genetic, physical mapping, and sequencing data and permits an integrated genome map to be stored, retrieved, viewed, and queried in a stand-alone capacity, in a client/server relationship with the Fungal Genome Database (FGDB), and as a web-browsing tool for the FGDB. In that ODS3 is programmed in Java, the tool promotes platform independence and supports export of integrated genome-mapping data in the extensible markup language (XML) for data interchange with other genome information systems. The tool ODS3 is used to create an initial integrated genome map of the AIDS-related fungal pathogen, Pneumocystis carinii. Contig dynamics would indicate that this physical map is
50% complete with
200 contigs. A total of 10 putative multigene families were found. Two of these putative families were previously characterized in P. carinii, namely the major surface glycoproteins (MSGs) and HSP70 proteins; three of these putative families (not previously characterized in P. carinii) were found to be similar to families encoding the HSP60 in Schizosaccharomyces pombe, the heat-shock
protein in S. pombe, and the RNA synthetase family (i.e., MES1) in Saccharomyces cerevisiae. Physical mapping data are consistent with the 16S, 5.8S, and 26S rDNA genes being single copy in P. carinii. No other fungus outside this genus is known to have the rDNA genes in single copy.
IN the past 12 years, genomics have provided scientists a fundamental, comprehensive, and systematic way to understand life. Fungi as simple eukaryotes played a central role in the development of genomics (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Building such an integrated genome map is a required step toward the rational understanding of the general structure, function, and evolution of fungal genomes.
Fungal chromosomes are on the order of 0.215 Mb in size and can be separated by pulsed-field gel electrophoresis (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The mapping-by-sequencing strategy (![]()
![]()
![]()
![]()
Currently, several computational tools are available for integrating information within genome projects to complement the integrative experimental approach of mapping by sequencing. They are used as (1) a supporting tool to the genome database (![]()
![]()
![]()
![]()
Our ODS3 software is a genome visualization tool for fungal genome projects (![]()
![]()
![]()
In this "sequence-then-map" approach, the genome map can be mapped and scanned with the use of STCs. The software ODS3 supports an implementation of the mapping-by-sequencing approach and was developed as a Java-based genome mapping visualization system for the Fungal Genome Database (FGDB) (![]()
![]()
The organization of this article is as follows: in METHODS we describe the methods for construction of an initial physical map of the Pneumocystis carinii genome using a dual strategy of mapping by sequencing and hybridization-based physical mapping, the algorithms used to reconstruct the native order of clones on a chromosome, and the design and implementation of ODS3 to carry out an integrated genome-mapping strategy. Under RESULTS we describe an integrated genomic map of P. carinii created using the new tool ODS3. In the DISCUSSION we frame findings about the Pneumocystis Genome Project and suggest some limitations and needed extensions to ODS3 and other integrated genome-mapping tools.
| METHODS |
|---|
Libraries:
P. carinii, karyotype form 1 (![]()
![]()
![]()
DNA preparations:
Cosmid DNAs were isolated from overnight cultures grown in Luria broth (LB) media-carbenicillin (50 µg/ml), using the Concert Rapid Plasmid Miniprep system (GIBCO BRL, Gaithersburg, MD).
Probe preparation:
Totals of 10 µl of plasmid minipreparations and 5 µl of random oligo primers from the Prime-it II kit (Stratagene, La Jolla, CA) were heated in 100° water for 5 min. Then 5 µl of 5x dCTP buffer, 4 µl 33P label (>108 cpm), and 1 µl of Exo(-) Klenow (50 units/µl) were added to the mixture. The solution was incubated at 37° for 1 hr to create a 33P-labeled probe. To stop the reaction, 25 µl of 0.25 M EDTA was added to the probe solution.
DNA hybridization:
cDNA libraries were stamped onto nylon membranes (Hybond-XL; Amersham Pharmacia Biotech) in an 8 x 8 array of 55 microtiter plates per membrane with a BioGrid high-density stamping robot (BioRobotics). Inocula were allowed to grow overnight at 37° on LB agar plates containing carbenicillin (50 µg/ml) for the cDNA library. Membranes were treated to lyse the colonies with: (1) 10% SDS for 5 min; (2) 1.5 M NaCl, 0.5 M NaOH for 5 min; (3) 0.5 M Tris-HCl, 1.5 M NaCl, pH 7.2, for 5 min; and (4) 0.3 M NaCl, 0.3 M Na citrate, pH 7.0 (2x SSC), for 5 min. The treated membranes were air dried for 30 min, and the DNA was crosslinked to the nylon membrane by UV radiation (312 min) with a UV crosslinker (Stratagene) or alternatively by baking at 80° for 2 hr.
One membrane representing the complete cDNA library was prehybridized at 65° for 2 hr with 9 ml of modified hybridization buffer containing casein hydrolysate instead of bovine serum albumin (![]()
DNA sequencing:
Cosmids were sequenced using a shotgun subcloning method (![]()
![]()
DNA sequencing templates were generated using a Robbins Scientific hydra-based double-stranded DNA- sequencing template isolation procedure (![]()
Cosmid end sequencing:
Cosmid end sequencing was challenging because the E. coli strain included with the pWEB vector (Epicentre Technologies) was an endA1 strain (![]()
Automated analysis of all sequencing data:
Sequencing data were processed using an automated workflow (![]()
![]()
![]()
![]()
Tree building:
The 186 expressed sequence tags (ESTs) of the HSP-70 genes from EST sequencing can be found at http://www.uky.edu/Pc/ and were used to build a gene genealogy. Accessions aab58248 and aad00455 were used in a Framesearch (Accelrys) against all 186 ESTs. The resulting inferred polypeptides were aligned with PILEUP (Accelrys) and arranged in an UPGMA tree. One representative from each of 11 clades was selected to build a consensus parsimony tree with PAUP (Accelrys).
To cross-validate this analysis, a separate analysis was performed. The ESTs were binned into clusters with the program Fragment Assembly System (Accelrys). The resulting consensus sequence of each bin was translated and aligned with PILEUP (Accelrys). The resulting alignment was used to build a consensus parsimony tree with PAUP (Accelrys).
Physical mapping algorithms:
![]()
![]()
![]()
The algorithm used by ODS2 seeks to minimize the sum of the Hamming distances between adjacent probes with respect to probe order by simulated annealing as described by ![]()
- Calculate the Hamming distances between probes, order the probes, and then fit the clones to the probe order. We write P = {p1, p2, ... , pm} for the set of m probes, C = {c1, c2, ... , cn} for the set of n clones, and
for an ordering (or permutation) of probes. Let D = |C| x |P| denote a binary matrix, where Di,j is 1 if clone ci overlaps probe pj on the basis of the experimental data; otherwise it is 0: 
The Hamming distance objective function is given by
where D
is a matrix derived from D by permuting the columns to the corresponding probe ordering P
, and
is the Boolean exclusive or operation. After ordering the probes, the longest existing contiguous sequence of probes that hybridized with the given clone is found. The clone is placed in the map such that it spans ordered probes. If more than two such pairs of places are found, the clone is randomly placed in the map in one of the possible positions. If a clone is without hybridization to any probe, it cannot be placed in the map. For those clones that have an ambiguous position in the map, we can adjust or correct their position manually. This modification of the algorithm can decrease the run time of the calculation when the probe is a subset of the set of clones. - Use a microcanonical annealing search algorithm (
CREUTZ 1983 ) instead of simulated annealing for ordering clones. Microcanonical annealing was found to achieve levels of optimization as good as simulated annealing and to do so an order of magnitude faster (
BHANDARKAR and MACHAKA 1997 ).
- A weighted penalty value
related to the number of misplaced anchored clones is added to the sum of Hamming distances F(P
). When computing a given permutation of probes by the algorithm, the subset of clones that are anchored to the genetic map is placed within the probe ordering. We write
for a permutation of anchored clones. Let pos(a
i) denote the function that returns the position of a marker in the genetic map, where aki is the permutation anchored to that marker. The penalty value is calculated by
, where
is a scaling factor variable that can be set by the user when creating the genetic map. A higher value of
places more emphasis on genetic map data. This penalty is minimum for a given
when the order of markers implied by A
is the same as the order of markers in the genetic map.
Implementation of ODS3:
The ODS3 tool available at http://gene.genetics.uga.edu/pub is implemented in Java and allows access to the University of Georgia FGDB (![]()
Design of the FGDB:
The FGDB (![]()
![]()
![]()
Supporting the HTBLAST workflow application:
A critical requirement for a large genome laboratory is software to control laboratory workflow while managing the data produced in the laboratory (![]()
![]()
![]()
![]()
In our setting a collection of query sequences are to be BLASTed. The BLAST tool helps locate hot spots by dividing a query sequence into all possible subsequences of a given length, which depend on the type of the subsequence involved (![]()
![]()
The prototype of this workflow is based on a series of tasks for generating an HTBLAST search for sequences. The first task loads sequences that have not been searched by HTBLAST from the database. After the creation of the input file for HTBLAST, the sequence file (describing the collection of query sequences) is sent to an SGI Origin 2000 computer, which has 24, 300-MHz MIPS R12000 processors with 4 MB cache memory and 8 GB of system memory. The HTBLAST search is remotely invoked at a particular time by a local machine. When the execution of HTBLAST is finished, the results file is sent back to the database server via ftp. At this point, ODS3 has modules to parse the HTBLAST report and to update sequence data in the database (i.e., FGDB) with the new BLAST information.
ODS data files:
There is a need to develop a standardized but flexible format for representing different types of biological data (![]()
![]()
| RESULTS |
|---|
Results in all figures and tables including Pneumocystis genome data (with the exception of the gene genealogy in Fig 3) were generated with ODS3. The tool was and is used to (1) estimate overlap between cosmids to guide the selection of cosmid probes, (2) generate an integrated map from the available sequence and mapping data as they are collected, (3) calculate statistics about the map, (4) examine the physical map for repeats, and (5) correlate features of the map with Pneumocystis sequence. In mapping by sequencing ODS3 estimates an overlap between each Pneumocystis cosmid clone with every other cosmid with an STC to guide the selection of the next cosmid for sequencing.
|
|
|
Coverage of the genome:
A total of 5280 P. carinii cDNA clones contain
2000 distinct genes, listed at http://gene.genetics.uga.edu, being used to link slightly <
2500 cosmids (![]()
![]()
![]()
10% in the cosmid library and 14% in the cDNA library; see http://gene.genetics.uga.edu for BLAST reports). The linking EST and cosmid sequences in the physical map can be classified as fungal or rat in origin by BLAST searches of these sequences against public databases generated by the workflow in ODS3 (![]()
A chart of the current contig assembly dynamics for the whole genome as 384 cosmid probes are added into the physical map is shown in Fig 1. The project is expected to be complete by 800 probings. Cosmid probes were classified as either sequence probes (113 in number) if they were sequenced or as hybridization probes (271 in number) if they were hybridized to the arrayed cDNA library. The physical map currently contains 1045 cDNAs or 100 x (1045/5280) = 21% of the redundant cDNA library. The dynamics would indicate that we are about halfway through completing the physical map, but there are significantly more contigs than expected as explained in the next subsection (![]()
18% by manual editing. For example, if the maps with 344 or 384 probes are manually edited after automated assembly, then the contig numbers are reduced to 188 and 203, respectively. This number of 203 contigs is still too high. Theoretically, the total amount of generated contigs for 500 probings should be 119 (Fig 1).
As an independent control on the whole-genome assembly, >14 clones were assigned by pulsed-field gel electrophoresis (PFGE) Southern to chromosome 7. These 14 anchors were then used to retrieve all other clones linked by hybridization or sequence to the anchors for an independent assembly. Six contigs were assembled from 250 clones assigned to chromosome 7, and eight fragments of these six contigs from the independent assembly could be found in the whole-genome assembly. A total of 63 of the clones on chromosome 7 were removed by the filtering in the genome-wide assembly. The conclusion is that acquiring a small number of anchors (
15) per chromosome is very useful in validating the genome-wide assembly.
Examining chromosome 7 also allowed an assessment of coverage. Chromosome 7 is measured by PFGE at 500 kbp (![]()
53 nonredundant ESTs assigned currently to the map or
1 EST assigned per 20 kbp. There were 48 cosmids in the physical map of chromosome 7, and 104 cosmids were sized by BamHI restriction digests, yielding an estimated insert size of 26 kbp (![]()
![]()
500 x 2 kbp. Subsequently, C. P. VIVARES (unpublished results) confirmed that chromosome 7 is a doublet by two-dimensional PFGE. As a consequence, the completeness of the map with respect to this mid-sized chromosome is estimated to be 90% (100 x 905/1000). The larger chromosomes are likely to have smaller coverage.
Filtering the clone-probe hybridization matrix:
Reconstructing an integrated map with the physical mapping algorithm described under METHODS depends on the clone-probe hybridization matrix generated from physical mapping data and sequence data (including the STCs, cDNAs, and genomic sequence). The physical mapping algorithm utilizes inferred overlaps detected on the basis of DNA/DNA hybridization and sequence similarity from all available sequence data. Not all of the inferred overlaps, such as those overlaps based on sequences representing E. coli contamination, should be relied on. The filtering rule is quite conservative; any sequence contig containing a significant hit (E-5 or smaller) with the filter's keyword is removed entirely. For example, on average 9% of the sequence per cosmid was removed as E. coli or vector with E-40 (see METHODS). To evaluate the map of the entire genome with respect to filtering, 12 filtered data sets were generated and classified into three groups (Table 1) according to removed keyword(s). Each data set contains 384 probes and was prefiltered for the known E. coli and vector contaminants with an E-40 as described in METHODS.
|
For evaluation purposes, we used two different methods of filtering the clone/probe hybridization matrix that are based on BLAST reports to discover clone-probe overlaps from sequence data (see Fig 2). Each data set was run 15 times using a scaling factor
= 1.0 in the physical mapping algorithm (see METHODS). In the first probe-based approach, we carried out the following operations:
- Determine clone/probe hybridization from DNA/DNA hybridization data and compute "sequence hits." Sequence probes are defined as all cosmids that were sequenced by shotgun subcloning (see METHODS). A sequence hit between a sequence probe and clone is computed with a BLASTN search against all sequence from cosmids (including cosmids with only STCs). An overlap is declared if the E-value on the hit is below that chosen by the researcher (i.e., E-6 in this article).
- Filter each probe and its overlapping clone(s). Another BLASTN and BLASTX search of all sequence associated with each probe and each clone that hit a probe was performed against GenBank and GenPept within an automated workflow (see METHODS). Those contigs with the selected E-value and keyword(s) in the second set of BLAST reports were filtered out.
The results of this probe-based approach are reported in Table 2.
|
In the second clone-based approach, we carried out the following operations:
- Filter the sequence of every clone in FGDB. A BLASTN and BLASTX search was performed for all sequence with an automated workflow (see METHODS). Those contigs with the selected E-value and keyword(s) were filtered out.
- Determine clone/probe hybridization matrix as in step 1 of the probe-based approach.
The results of this clone-based approach are reported in Table 3.
|
The two approaches can differ in the resulting filtered clone/probe hybridization matrix because every clone has end sequence. When Table 2 and Table 3 are compared, we find that there are fewer clone-probe hits and clones in the map with the second approach. This method was more effective in eliminating problematic sequences in the Pneumocystis genome project such as particular clones that are enriched for rat sequences, but took on average 20% longer to execute (during ordering of clones and probes with the filtered matrix). The running time for each data set in ODS3 includes the time used to construct the overlap matrix and to compute with the ordering algorithm (see METHODS). In general, the run time is reasonably short, and an average running time of 252 sec was obtained for each of the 12 data sets using the second approach. As a consequence, the clone-based approach was selected and used in all other analyses in this article.
As the stringency of the filter is increased, we expect that removing contaminating DNA will reduce false joins (and possibly weak true joins) and hence increase the number of contigs and reduce the average size of contigs. From Table 2, we can see that the number of contigs is increased and that the average size of contigs is decreased when the stringency of the filter (i.e., the filter threshold is decreased) applied to the BLAST search is increased, as expected. The average contig number of these 12 data sets in Table 3 is
273 (and in Table 2, 267 contigs). The primary pWEB library used to reconstruct the P. carinii genome contained
2500 clones (only
2200 clones grew) with an average insert size of 26 kb. The estimated genome size is 7.7 Mb (![]()
18% after manual curation in ODS3 at probings 344 and 384. An additional explanation is required for the departure of 203 contigs from the expected of 119 contigs in Fig 1.
Another explanation is that true "weak links" are being eliminated by filtering. The evidence for that explanation is that by manual review of the hit matrix for the doublets labeled 7 and 10/11 (![]()
1.88 Mb. By extrapolation we would expect 57 contigs for the whole 7.7-Mb genome, which is closer to the expected 119 contigs in Fig 1.
It is natural then to ask what clones are generating the discrepancy. As an example, after filtering these clones with multiple hybridization signals, the average size of contigs decreased on average from 23.2 (data sets 1, 5, and 9) to 7.1 clones (data sets 4, 8, and 12). Most of the hybridizations eliminated by the filtering (see Table 2 or Table 3) occur in clones with three or more hybridization signals (61%), i.e., clones with ambiguous placement on the physical map. These additional hybridization signals will lead to an ambiguous position assignment for the clone with these multiple hybridization signals.
We also evaluated the filtering technique on chromosome 7 of P. carinii. The filtered matrix was manually curated by BLASTing each cosmid probe and its associated sequences individually and was performed by Dr. George Smulian at the University of Cincinnati. The results were compared with a filtered and reordered matrix (generated by ODS3), in which sequences with BLAST reports containing the keyword coli, vector, Pseudomonas, or Adenovirus were removed, and BLAST hit results were filtered by a threshold value E-6 (i.e., BLAST hits with an E-value less than or equal to E-6 were picked). The two hybridization matrices were in substantial agreement. All clones examined either were linked to the same probe or were positioned discordantly because of hybridization to three or more probes (i.e., implying ambiguous placement on the map). It can be seen that the manually prepared map has 6 contigs as opposed to the automatically generated map with 31 contigs.
Multigene families in the physical map:
One of the advantages of the hybridization matrix as a representation of the physical map is the straightforward identification of repeat families (![]()
100 members of the Pneumocystis major surface glycoprotein (MSG) family (![]()
|
There are also examples of embedded rat sequence needing removal in the physical map. Examples include the region around cDNA S03F07 and cosmid W08D07. These regions contain sequence similar to human and rat sequence found by examining BLAST reports stored in the FGDB (from the HTBLAST workflow) and associated with the clone/probe hybridization matrix viewed with ODS3 (![]()
Several other families were detected including genes encoding homologs to HSP70,
heat-shock protein in Schizosaccharomyces pombe, HSP60 (![]()
![]()
![]()
![]()
protein family (and its homolog in Saccharomyces cerevisiae, SIS1) are thought to lend specificity to the function of the HSP70 family by means of a DnaJ motif shared by family members. The
protein family is also thought to regulate the HSP70 family members. There are also five additional putative repeat families as yet uncharacterized (see Table 4).
One of the striking features of Pneumocystis is the hypothesis that there is only one rDNA; i.e., the rDNA is not a multigene family in this organism (![]()
![]()
![]()
| DISCUSSION |
|---|
Physical mapping strategy:
A physical map for the 7.7-Mbp P. carinii genome is being constructed using a dual strategy: mapping by sequencing to generate an efficient sequencing resource with high connectivity to other genome resources and hybridizing cosmids to a cDNA collection to generate a gene-rich map (![]()
![]()
100 cosmids with an average insert size of 26 kbp. In addition,
300 cosmids thought to be nonoverlapping with the sequenced cosmids were hybridized to date to the arrayed cDNA collection to generate a gene-rich map. There are
2000 distinct cDNAs in the cDNA library, and
(2000/5280) x 1045 = 396 of these genes are currently represented in the physical map. The current integrated map represents at least 21% x 7700 kbp = 1617 kbp of the genome. Gene density on the physical map is then 1617/396 or
1 gene/4 kbp currently from the cDNA library in the physical map. In that the cDNA library contains about half of the genes in the Pneumocystis genome, this is consistent with the gene density estimate from ![]()
The new tool presents views of the data that capitalize on two advantages of the dual physical mapping strategy being used. The tool enables the viewer to inspect biologically interesting regions with repeats as identified by Fig 2. The tool also allows the viewer to inspect links in the physical map to validate that they are not based on contaminating rat DNA.
Coverage and gene density of the physical map of Pneumocystis:
The progress on the physical map is on target with >55% of the genome covered (Fig 1), but there are too many contigs (
200 contigs after manual curation or
13 contigs per chromosome). The likely explanation for the greater-than-expected number of contigs (![]()
![]()
Multigene families in P. carinii:
In Table 4 there are 10 putative multigene families from Pneumocystis, and we confirmed an earlier hypothesis that the rDNA family found in most organisms is in fact a single-copy gene in P. carinii. Of the 12 putative multigene families listed, 2 have been previously identified (MSG and HSP70) in Pneumocystis. The families (heat-shock
protein, HSP60, and tRNA synthetase) are multigene families in S. pombe or S. cerevisiae. Three of the 10 families are heat shock related; the abundance of stress-related messages in cDNA libraries has been a frequent finding (![]()
A striking feature of Pneumocystis biology is the MSGs. The MSGs are the predominant antigenic species found on all P. carinii populations and are encoded by a gene family containing
100 members (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
10% of its genome to this family of genes (![]()
The HSP70 is known to contain at least three family members. We found two of these existing family members. This raises the possibility that there is a third member of the family somewhere in the genome, and a third extremely tentative location was detected at cosmid W05B04. To test this possibility we created a multiway alignment of the 186 ESTs with homology to one of the Pneumocystis HSP70 genes previously reported (![]()
![]()
Pneumocystis has an unusual biology in several respects, and another example is the argument that rDNA genes are single copy (![]()
1 kbp. Because chromosome 7 contains the rDNA sequence by PFGE Southern and because coverage of chromosome 7 is estimated to be 90%, it is unlikely that we have missed an extension of the rDNA sequences. These data are consistent with a single copy of the rDNA genes.
Limitations to mapping by sequencing:
The tool ODS3 enabled the identification of certain families of repeats from the clone/probe hybridization matrix. Once found, the precise placement of the repeats is needed, a function that ODS3 cannot currently provide. Further analysis requires the use of associated unique sequence to provide a context for the repeats, and other assembly tools are needed that successfully separate repeats by the associated unique sequence.
Different filtering strategies using the BLAST reports from the HTBLAST workflow application on all sequences in the integrated genome map vary as to their success in removing contaminating DNA. It was found that a clone-based filtering strategy (as opposed to a probe-based strategy) removed more of the problematic sequences in Table 2 and Table 3 when end sequences of all clones were available. Most of the hits (61%) removed were multiple-hybridization signals (beyond two hits) to particular clones. These multiple-hybridization signals will lead to ambiguous placement of clones in the physical map and will complicate the assembly. Moreover, filtering sequences based only on probes might leave sequences in clones (not used as probes) that might mislead researchers using the sequence on the web. Four kinds of sequences were filtered out: E. coli, vector, Adenovirus, and Pseudomonas. The first two contaminants likely arose in the cosmid DNA preparation during shotgun sequencing. The Adenovirus and Pseudomonas sequences are hypothesized to arise because the libraries are derived from an individual immunosuppressed rat lung, which can contain other microbes (![]()
One filter fits all?
One other decision about the filter is what E-value in HTBLAST to use as a filter for declaring an overlap. In constructing a physical map in Fig 2 an E-value of E-5 was used as a cutoff. The impact of the choice can be seen in Table 2 and Table 3. This filter allowed the separation of the HSP70 family members to different locations in the genome, but the MSG family members coagulated into one location. Furthermore, a threshold of E-40 at a prefiltering step (to remove E. coli and vector sequence, see METHODS) entirely removed the rDNA genes because of their high similarity to those of E. coli. Whether or not a particular filter "works" will depend on the sequence similarity of family members. One filter is not likely to work for all families with differing levels of sequence similarity between family members within a genome. Instead what may be required is a filter that looks at the context of surrounding repeats in declaring an overlap (![]()
The immediate challenge:
In spite of the limitations of the tool, ODS3 does a remarkable job of providing a first pass at the genome and tracking how a project is unfolding. Normally, physical mapping of individual fungal chromosomes with on the order of 1000 clones and 100 probes requires 1 month to edit one chromosome (![]()
Near-term challenge of integrating genome projects:
At present, almost all genome information systems are constructed from scratch with little reuse of software developed elsewhere (![]()
![]()
![]()
![]()
The XML is a way of sharing, exchanging, and organizing the vast amount of genomics data cooperatively. The XML is a meta-language to produce documents that convey content with semantic structure (![]()
The extensible markup language is becoming a prevailing document and information exchange standard on the web (see BioML at htp://www.bioml.com/BIOML/index.html, GAME at http://bioxml.org/Projects/game, GEML at http://www.geml.orgn, and OpenBSA at http://industry.ebi.ack.uk/openBSA/. As long as a genome information system can provide a means to export data and views of the data in XML, then other systems will have the capability to import this information. We propose that XML be used as a data representation and exchange medium for fungal genome projects. To this end we have added to ODS3 the capability to store an integrated genome map in XML format for reuse in other projects. This is a small but important step toward developing a standard representation method for fungal genomics data. In this way standardized XML DTDs can be used to publish genomics information on the web instead of ad hoc representation mechanisms as is currently done. This approach should increase reusability and simplify integration across genome projects (![]()
![]()
| ACKNOWLEDGMENTS |
|---|
We gratefully acknowledge the support from the National Science Foundation in the form of grants MCB-9630910 (J.A.) and BIR-9512887 (J.A.), from the National Institutes of Health (R01 AI44651 to M.C., J.A., G.S., and J.S.), from the U.S. Department of Agriculture (USDA-2002-35300-12475 to S.B. and J.A.), and from the Georgia Research Alliance (J.A.).
Manuscript received June 14, 2002; Accepted for publication December 19, 2002.
| LITERATURE CITED |
|---|
AIGN, V., U. SCHULTE, and J. D. HOHEISEL, 2001 Hybridization-based mapping of Neurospora crassa linkage groups II and V. Genetics 157:1015-1020.
ALTSCHUL, S. F., W. GISH, W. MILLER, E.W. MYERS, and D. J. LIPMAN, 1990 Basic local alignment search tool. J. Mol. Biol. 215:403-410.[Medline]
ARNOLD, J., 1997 Editorial. Fungal Genet. Biol. 21:254-257.[Medline]
ARNOLD, J., 2001 Foreword. Genetics 157:933.
ARNOLD, J. and M. T. CUSHION, 1997 Constructing a physical map of the Pneumocystis genome. J. Eukaryot. Microbiol. 6:8S.
BENNETT, J., and J. ARNOLD, 2001 Genomics for fungi, pp. 267297 in The Mycota VIII. Biology of the Fungal Cell, edited by R. J. HOWARD and N. A. R. GOW. Springer-Verlag, New York.
BHANDARKAR, S. M. and S. A. MACHAKA, 1997 Chromosome reconstruction from physical maps using a cluster of workstations. J. Supercomput. 11:61-86.
BHANDARKAR, S. M., S. A. MACHAKA, S. S. SHETE, and R. N. KOTA, 2001 Parallel computation of a maximum likelihood estimator of a physical map. Genetics 157:1021-1043.
BURKE, D. T., G. F. CARLE, and M. V. OLSON, 1987 Cloning of large segments of exogenous DNA into yeast by means of artificial chromosome vectors. Science 235:1046-1049.
CAMP, N., H. COFER and R. GOMPERTS, 1998 White paper: high throughput BLAST (http://www.sgi.com/solutions/sciences/chembio/resources/papers/HTBlast/HT_Whitepaper.html).
CHIBANA, H., B. B. MAGEE, S. GRINDLE, Y. RAN, and S. SCHERER et al., 1998 A physical map of chromosome 7 of Candida albicans. Genetics 149:1739-1752.
CHURCH, G. and W. GILBERT, 1984 Genomic sequencing. Proc. Natl. Acad. Sci. USA 81:1991-1995.
COULSON, A., C. HUYNH, Y. KOZONO, and R. SHOWNKEEN, 1995 The physical map of the Caenorhabditis elegans genome. Methods Cell Biol. 48:533-550.[Medline]
CREUTZ, M., 1983 Microcanonical Monte Carlo simulation. Physiol. Rev. Lett. 50:1411-1414.
CUSHION, M. T. and J. ARNOLD, 1997 Proposal for a Pneumocystis genome project. J. Eukaryot. Microbiol. 44:7S.[Medline]
CUSHION, M. T., M. KASELIS, S. L. STRINGER, and J. R. STRINGER, 1993 Genetic stability and diversity of Pneumocystis carinii infecting rat colonies. Infect. Immun. 61:4801-4813.
CUTICCHIA, A. J., 1994 A primer for relational databases, pp. 346349 in Automated DNA Sequencing and Analysis, edited by M. D. ADAMS, C. FIELDS and J. C. VENTER. Academic Press, New York.
CUTICCHIA, A. J., J. ARNOLD, and W. E. TIMBERLAKE, 1992 The use of simulated annealing in chromosome reconstruction experiments based on binary scoring. Genetics 132:591-601.[Abstract]
CUTICCHIA, A. J., J. ARNOLD, and W. E. TIMBERLAKE, 1993 ODS (ordering DNA sequences): a physical mapping algorithm based on simulated annealing. Comput. Appl. Biosci. 9:215-219.
DAVIDSON, S. B., J. CRABTREE, B. BRUNK, J. SCHUG, and V. TANNEN et al., 2001 K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Syst. J. 40:512-530.
DURBIN, J., and J. THIERRY-MIEG, 1994 The ACEDB genome database, pp. 4555 in Computer Methods Genome Research, edited by S. SUHAI. Plenum Press, New York.
ELENKO, M., and M. REINERTSEN, 2000 XML & CORBA (http://cgi.omg.org/library/adt.html).
ENKERLI, J., H. REED, A. BRILEY, G. BHATT, and S. F. COVERT, 2000 Physical map of a conditionally dispensable chromosome in Nectria haematococca mating population VI and location of chromosomal breakpoints. Genetics 155:1083-1094.
EWING, B. and P. GREEN, 1998 Base-calling of automated sequencer traces using Phred II. Error probabilities. Genome Res. 8:186-194.
EWING, B., L. HILLIER, M. C. WENDL, and P. GREEN, 1998 Base-calling of automated sequencer traces using Phred I. Accuracy assessment. Genome Res. 8:175-185.
EZEKOWITZ, R. A. B., D. J. WILLIAMS, H. KOZIEL, M. Y. K. ARMSTRONG, and A. WARNER et al., 1991 Uptake of Pneumocystis carinii mediated by the macrophage mannose receptor. Nature 351:155-158.[Medline]
GARBE, T. R. and J. R. STRINGER, 1994 Molecular characterization of clustered variants of genes encoding major surface antigens of human Pneumocystis carinii.. Infect. Immun. 62:3092-3101.
GIUNTOLI, D., S. L. STRINGER, and J. R. STRINGER, 1994 Extraordinarily low number of ribosomal RNA genes in Pneumocystis carinii.. J. Eukaryot. Microbiol. 41:88S.[Medline]
GOFFEAU, A., B. G. BARRELL, H. BUSSEY, R. W. DAVIS, and B. DUJON et al., 1996 Life with 6000 genes. Science 274:546-567.
GOODMAN, N., S. ROZEN and L. D. STEIN, 1995 The Case for Componentry in Genome Information Systems. Whitehead/MIT Center for Genome Research (http://www-genome.wi.mit.edu/informatics/componentry.html), Cambridge, MA.
GOODMAN, N., S. ROZEN, L. D. STEIN, and A. G. SMITH, 1998 The LabBase system for data management in large scale biology research laboratories. Bioinformatics 14:562-574.
HALL, D., J. WANG, and S. M. BHANDARKAR, 2001 ODS2: a multiplatform software application for creating integrated physical and genetic maps. Genetics 157:1045-1056.
HALL, D., J. A. MILLER, J. ARNOLD, K. J. KOCHUT, A. P. SHETH et al., 2003 Using workflow to build an information management system for a geographically distributed genome sequencing initiative, pp. 359371 in Genomics of Plants and Fungi, edited by R. A. PRADE and H. J. BOHNERT. Marcel Dekker, New York.
HOHEISEL, J. D., E. MAIER, R. MOTT, L. MCCARTHY, and A. V. GRIGORIEV et al., 1993 High resolution cosmid and P1 maps spanning the 14 Mb genome of the fission yeast S. pombe. Cell 73:109-120.[Medline]
Initial sequencing and analysis of the human genome. (2001) Nature 409:860-918.[Medline]
KECECIOGLOU, J. D. and E. W. MYERS, 1995 Combinatorial algorithms for DNA sequence assembly. Algorithmica 13:7-51.
KELKAR, H. S., J. GRIFFITH, M. E. CASE, S. F. COVERT, and R. D. HALL et al., 2001 The Neurospora crassa genome: cosmid libraries sorted by chromosome. Genetics 157:979-990.
KOCHUT, K. J., J. ARNOLD, J. A. MILLER and W. D. POTTER, 1993 Design of an object-oriented database for reverse genetics, pp. 234242 in Proceedings of the First International Conference on Intelligent Systems for Molecular Biology, edited by L. HUNTER, D. SEARLS and J. SHAVLIK. AAAI Press, Menlo Park, CA.
KOCHUT, K. J., J. ARNOLD, A. SHETH, J. A. MILLER, and E. KRAEMER et al., 2003 IntelliGEN: a distributed workflow system for discovering protein-protein interactions. Parallel Distributed Databases 13:43-72.
KRAEMER, E., J. WANG, J. GUO, S. HOPKINS, and J. ARNOLD, 2001 An analysis of gene-finding programs for Neurospora crassa. Bioinformatics 17:901-912.
KUSPA, A. and W. F. LOOMIS, 1996 Ordered yeast artificial chromosome clones representing the Dictyostelium discoideum genome. Proc. Natl. Acad. Sci. USA 93:5562-5566.
LIN, J., R. QI, C. ASTON, J. JING, and T. ANANTHARAMAN et al., 1999 Whole-genome shotgun optical mapping of Deinococcus radiodurans.. Science 285:1558-1562.
MAHAIRAS, G. G., J. WALLACE, K. SMITH, S. SWARTZELL, and T. HOLZMAN et al., 1999 Sequence-tagged connectors: a sequence approach to mapping and scanning the human genome. Proc. Natl. Acad. Sci. USA 96:9739-9744.
MILLER, J. A., D. PALANISWAMI, A. P. SHETH, K. J. KOCHUT, and H. SINGH, 1998 WebWork: METEOR's web-based workflow management system.. J. Intell. Inf. Syst. 10:185-215.
MIZUKAMI, T., W. I. CHANG, I. GARKAVTSEVE, N. KAPLAN, and D. LOMARDI et al., 1993 A 13 kb resolution cosmid map of the 14 Mb fission yeast genome by nonrandom sequence-tagged site mapping. Cell 73:121-132.[Medline]
MOTT, R., A. GRIGORIEV, E. MAIER, J. HOHEISEL, and H. LEHRACH, 1993 Algorithms and software tools for ordering clone libraries: application to the mapping of the genome of Schizosaccharomyces pombe. Nucleic Acids Res. 21:1965-1974.
OLSON, M. V., J. E. DUTCHIK, M. Y. GRAHAM, G. M. BRODEUR, and C. HELMS et al., 1986 Random-clone strategy for genomic restriction mapping in yeast. Proc. Natl. Acad. Sci. USA 83:7826-7830.
PERKINS, D. D., M. S. SACHS and A. RADFORD, 2001 Chromosomal Loci of Neurospora crassa. Academic Press, New York.
POTTRATZ, S. T., J. PAULSRUD, J. E. SMITH, and W. J. MARTIN, 1991 Pneumocystis carinii attachment to cultured lung cells by Pneumocystis gp120, a fibronectin binding protein. J. Clin. Invest. 88:403-407.
PRADE, R. A., J. GRIFFITH, K. J. KOCHUT, J. ARNOLD, and W. E. TIMBERLAKE, 1997 In vitro reconstruction of the Aspergillus (=Emericella) nidulans genome. Proc. Natl. Acad. Sci. USA 94:14564-14569.
PRADE, R. A., P. AYOUBI, S. KRISHNAN, S. MACWANA, and H. RUSSELL, 2001 Accumulation of stress and cell wall degrading enzyme associated transcripts during asexual development in Aspergillus nidulans. Genetics 157:957-967.
ROBBINS, R. J., 1996 Bioinformatics: essential infrastructure for global biology. J. Comput. Biol. 3:465-478.[Medline]
ROE, B. A., J. S. CRABTREE and A. S. KHAN, 1996 DNA Isolation and Sequencing. John Wiley & Sons, New York.
ROZEN, S., L. D. STEIN, and N. GOODMAN, 1995 LabBase: a database to manage laboratory data in a large-scale genome-mapping project. IEEE Comput. Med. Biol. 14:702-709.
SCHOENFELD, T., J. MENDEZ, D. R. STORTS, E. PORTMAN, and B. PATTERSON et al., 1995 Effects of bacterial strains carrying the endA1 genotype on DNA quality isolated with WizardTM plasmid purification systems. Promega Notes Mag. 53:12-21.
SENGER, B., L. DESPONS, P. WALTER, H. JAKUBOWSKI, and F. FASIOLO, 2001 Yeast cytoplasmic and mitochondrial methionyl-tRNA synthetases: two structural frameworks for identical functions. J. Mol. Biol. 311:205-216.[Medline]
SILVER, P. A. and J. C. WAY, 1993 Eukaryotic dnaJ homologues and the specificity of HSP70 activity. Cell 74:5-6.[Medline]
SLONIM, D., L. KRUGLYAK, L. STEIN, and E. LANDER, 1997 Building human genome maps with radiation hybrids. J. Comput. Biol. 4:487-504.[Medline]
SMULIAN, A. G., T. SESTERHENN, R. TANAKA, and M. T. CUSHION, 2001 The ste3 pheromone receptor gene of Pneumocystis carinii is surrounded by a cluster of signal transduction genes. Genetics 157:991-1002.
STEDMAN, T. T. and G. A. BUCK, 1996 Identification, characterization, and expression of the BiP endoplasmic reticulum resident chaperonins in Pneumocystis carinii.. Infect. Immun. 64:4463-4471.[Abstract]
STEDMAN, T. T., D. R. BUTLER, and G. A. BUCK, 1998 The HSP70 gene family in Pneumocystis carinii: molecular and phylogenetic characterization of cytoplasmic members. J. Eukaryot. Microbiol. 45:589-599.[Medline]
STEIN, L., 2002 Creating a bioinformatics nation: a web-services model will allow biological data to be fully exploited. Nature 417:119-120.[Medline]
STRINGER, J. R., 1996 Pneumocystis carinii: What is it, exactly? Clin. Microbiol. Rev. 9:489-498.[Abstract]
SUNKIN, S. M. and J. R. STRINGER, 1996 Translocation of surface antigen genes to a unique telomeric expression site in Pneumocystis carinii.. Mol. Microbiol. 19:283-295.[Medline]
TALBOT, C. C., and A. J. CUTICCHIA, 1998 Human mapping databases, pp. 1.13.11.13.12 in Current Protocols in Human Genetics, edited by N. DRACOPOLI, J. HAINES, B. KORF et al. John Wiley & Sons, New York.
THOMAS, S. W., E. A. RUNDENSTEINER and A. J. LEE, 1995 Visualization and database tools for YAC and cosmid contig construction. Project-Oriented Databases and Knowledge Bases in Genome Research, Biotechonology Computing Track, Twenty-Seventh Hawaii International Conference of System Sciences, HICSS-28, Kihei, HI, No. 128.
VARGAS, C., 2002 The genomic study of multigene families of Pneumocystis carinii for potential drug targets. Honors Thesis, B.S. Microbiology, University of Georgia, Athens, GA.
VENTER, J. C., H. O. SMITH, P. W. LI, R. J. MURAL, and L. HOOD, 1996 A new strategy for genome sequencing. Nature 381:364-366.[Medline]
VENTER, J. C., M. D. ADAMS, E. W. MYERS, P. W. LI, and R. J. MURAL et al., 2001 The sequence of the human genome. Science 291:1304-1351.
WADA, M., S. M. SUNKIN, J. R. STRINGER, and Y. NAKAMURA, 1995 Antigenic variation by positional control of major surface glycoprotein gene expression in Pneumocystis carinii.. J. Infect. Dis. 171:1563-1568.[Medline]
XIONG, M., H. J. CHEN, R. A. PRADE, Y. WANG, and J. GRIFFITH et al., 1996 On the consistency of a physical mapping method to reconstruct a chromosome in vitro. Genetics 142:267-284.[Abstract]
ZHANG, M. Q. and T. G. MARR, 1993 Genome mapping by nonrandom anchoring: a discrete theoretical analysis. Proc. Natl. Acad. Sci. USA 90:600-604.
This article has been cited by other articles:
![]() |
T. J. Kottom and A. H. Limper Pneumocystis carinii Cell Wall Biosynthesis Kinase Gene CBK1 Is an Environmentally Responsive Gene That Complements Cell Wall Defects of cbk-Deficient Yeast Infect. Immun., August 1, 2004; 72(8): 4628 - 4636. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Xu, Z.
- Articles by Arnold, J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Xu, Z.
- Articles by Arnold, J.

) are compared with the observed values (
) during genome reconstruction (

