The karyotype of the African malaria mosquito Anopheles gambiae contains two pairs of autosomes and a pair of sex chromosomes. The Y chromosome, constituting ∼10% of the genome, remains virtually unexplored, despite the recent completion of the A. gambiae genome project. Here we report the identification and characterization of Y chromosome sequences of total length approaching 150 kb. We developed 11 Y-specific PCR markers that consistently yielded male-specific products in specimens from both laboratory colony and natural populations. The markers are characterized by low sequence polymorphism in samples collected across Africa and by presence in more than one copy on the Y. Screening of the A. gambiae BAC library using these markers allowed detection of 90 Y-linked BAC clones. Analysis of the BAC sequences and other Y-derived fragments showed massive accumulation of a few transposable elements. Nevertheless, more complex sequences are apparently present on the Y; these include portions of an ∼48-kb-long unmapped AAAB01008227 scaffold from the whole genome shotgun assembly. Anopheles Y appears not to harbor any of the genes identified in Drosophila Y. However, experiments suggest that one of the ORFs from the AAAB01008227 scaffold represents a fragment of a gene with male-specific expression.
SEX chromosomes of many groups of animals and plants originated independently from a pair of ordinary autosomes after acquisition of a major sex-determining locus (Muller 1932; Charlesworth and Charlesworth 2000). In the course of evolution an increasing portion of the Y chromosome was selected to stop recombining with its counterpart (Lahn and Page 1999; Skaletskyet al. 2003), probably in response to accumulation of sexually antagonistic alleles, advantageous to the heterogametic but harmful to the homogametic sex (Fisher 1931). Recombination suppression is responsible for a gradual Y chromosome genetic degeneration—a strikingly common feature of a nonrecombining sex chromosome in all organisms (Rice 1994; Bachtrog 2003b). This inevitable process, involving point mutations, deletions, insertions, tandem duplications, and a massive spread of transposable elements, leads to silencing of most genes present on proto-Y (Steinemann and Steinemann 2000; Bachtrog 2003a). As a consequence, in older sex chromosome systems only a few genes vital for male fertility remain functional on the Y (Carvalho et al. 2000, 2001). Furthermore, in such old systems, Y chromosome degeneration is manifested by a change in chromatin conformation from the euchromatic to the heterochromatic state. In D. melanogaster, for example, the Y heterochromatinization process, thought to be mediated by extensive transposable element insertions (Steinemann and Steinemann 1998), has gone to completion.
The mosquito Anopheles gambiae, a major vector of human malaria, has a karyotype consisting of two pairs of autosomes and a pair of sex chromosomes. The Y chromosome contains a male-determining factor(s) that dominantly induces male development by its presence in an XX/XY system (Clements 1992). Cytogenetic evidence indicates that the Y is variable in size and banding pattern in natural A. gambiae populations and, as in Drosophila melanogaster, is fully heterochromatic (Bonaccorsiet al. 1980). The Y, constituting ∼10% of the genome, remains virtually unexplored. Despite whole-genome shotgun sequencing of A. gambiae from male and female templates, no sequence data has yet been assigned to that chromosome (Holtet al. 2002). The only exception is a recent description of a Y-derived retrotransposon expressed specifically in males (Rohret al. 2002).
We initiated a study aiming at the isolation of A. gambiae Y chromosome sequences for two main reasons. First, identification of such sequences may allow new insights into the evolution of Y chromosome sequence and structure, which is of great general interest. Second, it may allow development of new markers that would lead to a better characterization of anopheline population history and geographic structure. Thorough understanding of population structure and gene flow among A. gambiae populations is critical for effective implementation of malaria control strategies. Evidence based on existing markers generally suggests that this species, despite being broadly distributed across sub-Saharan Africa, has a shallow population structure and a strikingly weak effect of distance on differentiation (Lehmannet al. 1996; Besanskyet al. 1997; Kamauet al. 1999). These observations are consistent with the hypothesis of a recent expansion of A. gambiae populations (Donnellyet al. 2001); however, available markers may not be sensitive enough to detect significant differentiation among populations. Markers on the paternally transmitted Y chromosome, because of its smaller effective population size compared to loci on other chromosomes, its nonrecombining nature, and its inability to cross species boundaries due to male hybrid sterility, may be a more robust alternative to conventional markers. The Y, because of its greater information content relative to mitochondrial DNA and autosomal systems, became a uniquely powerful tool in the study of histories and geographic structure of human populations (Hammer and Zegura 1996; Hurles and Jobling 2001).
Male-specific markers are difficult to isolate, because the nonrecombining Y chromosome consists mostly of repetitive elements that share high similarity with sequences on other chromosomes. Yet Y chromosome DNA fragments have been identified in a variety of organisms—primarily mammals, but also in fish, flies, and plants. The strategies used in those studies include construction of libraries from flow-sorted Y chromosomes (Oosthuizenet al. 1990), differential hybridization of male and female recombinant phage libraries (Anleitner and Haymer 1992), Y chromosome microdissection followed by degenerate oligonucleotide primed PCR (Shibataet al. 1999), random amplified polymorphic DNA PCR (Olivier and Lust 1998), genomic subtraction (Donnisonet al. 1996), and in silico searches of genome sequences (Carvalhoet al. 2000). Here we describe the identification and characterization of numerous Y chromosome-linked sequences from A. gambiae, using some of the strategies listed, and the development of 11 Y chromosome-specific PCR markers. This project was initiated with a differential hybridization strategy prior to the A. gambiae genome sequencing; however, progress and ultimate completion of the genome project allowed us to successfully implement other approaches to identify Y chromosome sequences.
MATERIALS AND METHODS
DNA samples: Specimens used in the study were field collected in Senegal and Burkina Faso in 1997, F1 progeny of females collected in Kenya in 1987 (McLainet al. 1989), and the A. gambiae PEST laboratory colony. Genomic DNA was isolated from individual adult males and virgin females according to Collins et al. (1987). The distal part of the abdomen containing spermathecae was removed prior to DNA extraction from field-collected females or their F1 female progeny to avoid contamination with male DNA transferred in sperm, if mating had occurred.
Southern blot hybridization: Restriction endonuclease-digested DNA was separated by electrophoresis on a 0.8% agarose gel and transferred by capillary blotting onto Hybond-N+ membranes (Amersham Biosciences) in 10× standard saline citrate (SSC) buffer (Sambrooket al. 1998). Southern blots were hybridized overnight as previously described (Severson 1997) with probes radioactively labeled using specific primers. Three posthybridization washes at high stringency were performed in 0.1× SSC, 0.1% SDS at 65° for 15 min each.
Differential hybridization: A λDASH II (Stratagene, La Jolla, CA) genomic library prepared from partially digested A. gambiae SUA strain DNA of both sexes (Salazaret al. 1994) was plated at ∼30,000 PFU on each of the 12 (150-mm) plates. Plaque lifts using Duralose membranes (Stratagene) were performed in quadruplicate. Two sets of filters each were differentially screened with equivalent amounts of male or female A. gambiae ZAN/U strain genomic DNA radiolabeled by random priming using a HighPrime kit (Roche, Indianapolis). The filters were washed at high stringency. Phage that reproducibly hybridized only with the male total genomic DNA were plugged into 1 ml of SM phage dilution buffer and rescreened to obtain purified putative male-specific phage. The phage were harvested from liquid lysates and DNA was isolated according to Salazar et al. (1994).
Subcloning and sequencing: Phage inserts were subcloned into pBluescript SK+ (Stratagene) and, after electroporation, amplified in Escherichia coli DH10B (GIBCO BRL, Gaithersburg, MD). Plasmids were purified using a QIAprep Spin Miniprep kit (QIAGEN, Valencia, CA). PCR products were gel purified using a QIAquick Gel Extraction kit (QIAGEN) and sequenced directly or after cloning into the pGEM-T Easy vector (Promega, Madison, WI). Cloned PCR templates were PCR amplified and gel purified prior to sequencing. Sequencing was performed using ABI BigDye terminator chemistry (PE Applied Biosystems) on an ABI 377 sequencer. Sequences were assembled and verified by inspection of both strands using ABI Sequence Navigator software. Similarity searches of the obtained DNA sequences against the GenBank nr database were performed using BLASTN and BLASTX (Altschulet al. 1997).
Bacterial artificial chromosome library screening: To identify clones containing Y chromosome-derived inserts, a bacterial artificial chromosome (BAC) genomic DNA library (ND-TAM) constructed from DNA prepared from both males and females of the A. gambiae PEST strain (Honget al. 2003) was screened by PCR using primers amplifying male-specific DNA fragments (Table 1). To identify individual Y-linked BAC clones, successively less complex pools of BAC clone DNA were used as templates in PCR reactions with the conditions described below. The screening was performed in three steps. The first step was performed on 80 individual pools of DNA isolated from clones gridded on each 384-well microtiter plate; the second step involved 4 row pools and 6 column pools for each positive plate, and the third step was performed on 16 bacterial samples representing individual clones. BAC DNA was isolated using a QIAGEN Plasmid Midi kit following the manufacturer's protocol with slight modification: DNA was eluted with QF buffer prewarmed to 65°.
Subtraction of BAC clones: Preparation of driver and tester: BAC clone DNA was sonicated in 30% glycerol until the resulting random fragments ranged from 100 bp to 2 kb in size. To eliminate fragments <100 bp, the fragmented DNA was purified using the StrataPrep PCR purification kit (Stratagene) and eluted in 50 μl of water. Driver DNA (8 μg) was biotinylated with Photoprobe biotin (Vector Laboratories, Burlingame, CA) by thermal coupling and, following purification according to manufacturer's recommendation, resuspended in 8 μl of Tris-EDTA (pH 8.0). Tester DNA (3 μg) was treated with mung bean nuclease (1 unit/μg of DNA) for 30 min at 30° to remove single-stranded extensions and, after phenol/chloroform extraction and ethanol precipitation, was resuspended in 4 μl of Tris-EDTA. Adapter sequences were generated from mixtures of complementary oligonucleotides OL1 (5′-ACCGTCGTCCATCCAGTCGCAATCC-3′) and OL2 (5′-GGATTGCGACTGGATGGA-3′) by heat denaturation and slow cooling to room temperature. Adapter sequences were ligated to the tester DNA for 14 hr at 14° using 400 units of T4 DNA ligase. Prior to hybridization the tester was diluted in 20 mm HEPES-HCl (pH 6.6), 50 mm NaCl, and 0.2 mm EDTA (pH 8.0).
Subtractive hybridization: A total volume of 5 μl of the hybridization mixture containing 2 μg of driver DNA mixed with 40 ng of tester DNA, 50 mm HEPES-HCl (pH 8.0), 0.5 m NaCl, and 0.2 mm EDTA (pH 8.0) covered with a drop of mineral oil was denatured at 98° for 5 min, cooled to 68°, and incubated at 68° for 24 hr. Then 2 μg of freshly denatured driver DNA was added to the hybridization mixture and incubated at 68° for an additional 48 hr. DNA from the hybridization mixture was precipitated, resuspended in TBST binding buffer [0.1 m Tris (pH 7.5), 150 mm NaCl, 0.1% Tween 20] and biotinylated homo- and heteroduplexes were removed using Vectrex Avidin D (Vector Laboratories) according to the manufacturer's protocol. The resulting supernatant was used as a template in PCR amplification using the OL1 oligonucleotide as a primer. PCR products were cloned into pGEM-T Easy vector and individual clones were sequenced.
In silico searches of A. gambiae genome: Searches for Y-linked scaffolds harboring potential coding sequences were performed as described (Carvalho et al. 2000, 2001). Scaffolds assembled only from fragments originating from male libraries were used to build a database by using the FORMATDB program of the STANDALONE BLAST (downloaded from the National Center of Biotechnology Information; http://www.ncbi.nlm.nih.gov/). BLAST searches were performed on a local Linux computer. For further details see results. A table containing a list of scaffolds constituting the entire A. gambiae genome, with their GenBank accession numbers, lengths, linkage to a given chromosome (if mapped), and the origin of fragments used for each scaffold's assembly, is available as supplementary data at http://www.genetics.org/supplemental/.
RNA extraction and reverse transcriptase-PCR experiments: Total RNA was extracted using TRIZOL (GIBCO BRL). Residual DNA contamination was eliminated with DNase I (Invitrogen, Carlsbad, CA). Reverse transcriptase-PCR (RT-PCR) was performed using the Superscript One-Step RT-PCR kit (GIBCO BRL). All experiments were done according to the manufacturer's protocol.
PCR assays: PCR mixtures contained 1 μl template DNA (1/100 of the DNA extracted from a single mosquito), 1.5 mm MgCl2, 20 mm Tris-HCl (pH 8.4), 50 mm KCl, 0.2 mm each dNTP, 25 pmol of each primer, and 2.5 units Taq polymerase in a total volume of 50 μl. The PCR reactions were performed in the Perkin-Elmer (Norwalk, CT) 9600 thermocycler with an initial denaturation at 94° for 3 min, followed by 35 cycles of 94° for 20 sec, 55°–58° for 30 sec, and 72° for 1 min, followed by a final elongation step at 72° for 10 min.
Identification of Y-linked clones through differential hybridization: An A. gambiae phage genomic library was screened using the total genomic DNA of males or females as probes. Phage that produced signals only with the male probe were purified by rescreening. Of 34 isolated phage from which DNA was extracted, restriction digested, blotted, and probed with genomic DNA from males or females (reverse Southern blot), 16 phage hybridized uniquely or significantly more strongly with male DNA. The 10 phage giving the most distinct male-specific signals were subcloned, and selected individual fragments were used as probes in genomic Southern blot analysis to confirm association of the phage inserts with the Y chromosome. Subclones from 5 phage showed much stronger hybridization to male DNA and/or hybridized to fragments unique to males, in addition to fragments common to both sexes. Subclones from other tested phage appeared to be present in both sexes, with comparable intensity of hybridization signal.
Two phage, 3-7 and 6-8, hybridized only with male DNA on reverse Southern blots and the subclones derived from them hybridized strongly to fragments unique to males on genomic Southern blots. These two phage were characterized further by restriction mapping and subclone sequencing. GenBank nr database searches using subclone sequences revealed an accumulation in both 3-7 and 6-8 of densely packed transposable elements (TEs), including mtanga and undescribed TEs similar to mag and Pao from Bombyx mori and mdg1 from D. melanogaster.
A 4.1-kb subclone from 3-7 contained a fragment of an mdg1-like retrotransposon with its open reading frame (ORF2) interrupted by a 682-bp fragment lacking similarity to any known coding sequence, followed by a mag-like retrotransposon with an apparently complete ORF1 and a partial ORF2. To confirm Y chromosome linkage of the 3-7 clone we implemented a PCR approach based on the premise that primers designed from the Y should allow amplification exclusively from male DNA, provided target sequences on the Y had diverged enough from sequences on other chromosomes. Initially a primer pair designed within the noncoding fragment was used (3-7_4H2F 5′-CATAGTGTCATAACCAGCACG-3′ and 3-7_4H2R 5′-TCTTCTTCGGGAACACGATG-3′). However, a product of the expected size was amplified from genomic DNA of both sexes, although usually with more robust product from male samples. This result together with the male-specific bands observed in the genomic Southern analysis (data not shown) suggested association of the noncoding sequence with the Y, without Y specificity. Two additional primers (mag-mdgF and mag-mdgR; Table 1), each designed within the coding sequences of the flanking retrotransposons, permitted amplification of a male-specific product, providing convincing evidence for Y linkage of the 3-7 sequences (Figure 1). Here, the PCR pinpoints a unique genomic rearrangement caused by an integration of the mag-like retrotransposon into an mdg1-like retrotransposon on the Y chromosome. The male specificity of the mag-mdg1 marker was confirmed by PCR on males and females from the PEST laboratory colony and on field-derived specimens of A. gambiae.
The nature of the noncoding region from the 4.1-kb subclone was explored further. A BLASTN search of the A. gambiae genome using the 4.1-kb subclone sequence as a query identified a single 458-kb scaffold (AAAB01008977) mapped to chromosome 2, containing sequence 97–99% identical with both the noncoding region and ORFs of the mag-like element. Remarkably, a 187-bp fragment from the 5′ end of the noncoding region was directly repeated on the scaffold, with the repeats spaced by 4482 nucleotides that bore, at the amino acid level, high similarity to mag element proteins, suggesting the presence of a long terminal repeat (LTR) retrotransposon. This putative element has identical LTR sequences and is bracketed by 5-bp direct repeats (CTGTT) probably representing a duplication of the chromosomal target sequence produced during element insertion. Thus, the noncoding sequence in the 4.1-kb subclone of phage 3-7 appears to be a retrotransposon's integral part, consisting of the 5′ LTR and a long putative leader region preceding ORF1. Presence of a highly similar copy of the element on chromosome 2 explains amplification of the noncoding region from both sexes, and the more abundant PCR product from males suggests that the mag-like element is present in multiple copies on the Y. However, stronger amplification in males could also result from mismatches between one of the primers and the target site on the autosomal copy of the element. The primer's 5′ end contains a putative target site for the retroelement, the 5-bp sequence CATAG, located directly upstream of the LTR, which differs from the duplicated target site on chromosome 2 in four of five positions. The 3′ end of the mag-like element from the 3-7 phage has not been sequenced entirely and it is not known if the CATAG sequence represents a genuine target site. If it does, the mag-like retroelement would have no sequence specificity for insertions, similar to the mag retrotransposon (Garelet al. 1994).
Marker development from end sequence of Y-linked BAC clone: The availability of the ND-TAM BAC genomic library containing DNA from both sexes of A. gambiae allowed us to screen for large DNA inserts derived from the Y chromosome using a male-specific mag-mdg1 PCR marker. This library, representing ∼14 genome equivalents and containing 30,720 clones with an average insert size of 133 kb (Honget al. 2003), was screened by a three-step PCR strategy to identify individual BAC clones. Screening of the plate pools revealed presence of clones containing the mag-mdg1 marker on 79 of 80 plates. Partial screening of 3 selected plates led to identification of 5 Y-linked clones (104B1, 104C2, 106G16, 106K11, and 174B1). Sequencing of the clone ends and a few subclones, followed by sequence analysis using BLASTX, showed massive accumulation of transposable elements in these BAC clones: among 19 fragments, 9 were similar to mdg1 proteins, 3 to ninja from D. melanogaster, 3 to mtanga, 1 to T1-2 from A. gambiae, and 1 to mag. Two sequences, the SP6 ends of 104C2 and 174B1 clones, lacked similarity to any known transposable elements. Interestingly, however, both sequences were nearly identical in the 5′ region spanning ∼750 bp. Extending both sequences by primer walking revealed their complete divergence downstream from a short fragment highly similar to a middle repetitive sequence identified earlier in an A. gambiae autosomal telomere (Biessmannet al. 1998). Subsequent BLASTN searches of the A. gambiae genome confirmed the presence of repetitive DNA in the 5′ and other regions of both extended sequences.
A microsatellite (AT)16 was found in the 104C2 sequence downstream from the region shared with 174B1. This was the first microsatellite identified as linked to the A. gambiae Y, a potentially valuable finding, because high mutation rates within microsatellite regions make them powerful markers for population genetics studies. A primer pair (104C2SP6F2 and 104C2SP6R; Table 1) designed on the flanks of the repeat amplified a male-specific PCR product of the expected size (Figure 1). Occasionally a single fragment was amplified in females; however, its size was ∼1 kb larger in females than in males.
Direct sequencing of the 104C2SP6F2-R PCR products from individual PEST males resulted in initially unambiguous sequence, which deteriorated at the end of the microsatellite region. Multiple peaks observed downstream from simple repeats are often a consequence of Taq polymerase slippage on low complexity sequence; yet another explanation for this phenomenon is more likely here. BLASTN searches of the A. gambiae genome with the 104C2SP6 marker identified 12 scaffolds with microsatellite flanking sequences identical to those of the marker, but with the microsatellite regions containing from 13 to 65 dinucleotide repeats. Apparently, the 104C2SP6F2-R primers amplify several distinct Y-linked loci harboring microsatellites of different lengths. Interestingly, some of the size divergent loci seem to be closely linked in the genome. Screening of the BAC library with 104C2SP6F2-R primers identified clones containing more than one such locus. Using those BAC clones as template, the 104C2SP6F2-R primers produced either two fragments easily separated on a 2% agarose gel or a smear of PCR products ranging in size from 250 to 300 bp, rather than the expected single band. Similarly, amplification of the marker from some males derived from natural populations yielded two or more bands.
All but one scaffold identified by BLASTN using the 104C2SP6 marker sequence were unmapped and very short—undoubtedly representing fragments of the Y chromosome that could not be assembled into larger contigs. In the exceptional scaffold (AAAB01008960) mapped to chromosome 2, two sequence regions were nearly identical to the marker sequence. Both regions had target sites perfectly matching the primers, which should have yielded 270- and 301-bp PCR products. However, none of the tested female specimens yielded the expected products. This discrepancy between in silico analysis and laboratory experiments likely resulted from incorrect sequence assembly of this chromosome 2 scaffold within a genomic region containing repetitive sequences in common with those flanking the marker. We have further evidence for incorrect assembly of A. gambiae scaffolds (see below).
Development of Y markers through BAC DNA subtraction: Finding identical or highly similar sequences in the random Y-linked clones led us to the hypothesis that large portions of the Y chromosome in A. gambiae are composed of a few highly abundant transposable elements, among which may exist small islands of more complex sequences. On the basis of this hypothesis we designed a subtractive hybridization strategy to eliminate repetitive sequences and enrich for low-copy-number fragments. DNA of BAC clone 104B1 was used as a “driver” and pooled DNA of clones 104C2, 106G16, 106K11, and 174B1 was used as a “tester.” After random fragmentation, driver DNA was biotinylated and tester DNA was blunt ended and ligated to an adaptor. After liquid hybridization of tester with driver DNA and removal of homo- and heteroduplexes, the remaining molecules were PCR amplified, cloned, and sequenced. Seven clones containing putative nonrepetitive sequences were evaluated as potential Y chromosome PCR markers. Three of the clones contained sequences that largely overlapped the same genomic region and of those one sequence was selected for primer design. In total, of five candidates under investigation, two new markers, S23 and S291, were developed (Figure 1). The primer sequences for the markers are given in Table 1. Rarely, nonspecific products were amplified in females using the S291 marker primers.
Large-scale BAC library screening for Y-linked clones: The ends of the ND-TAM BAC library were sequenced as part of the A. gambiae genome sequencing effort (Holtet al. 2002). The availability of these sequences created an opportunity for ready access to nucleotide sequence of the Y that might allow further insight into its structure. We performed more extensive screening of the library using the mag-mdg1, S23, S291, and 104C2SP6 markers, which resulted in the identification of an additional 85 Y-linked BAC clones from 29 library plates. A list of all identified BAC clones is given in the appendix.
The emerging view of the A. gambiae Y as a sink for a small number of highly repetitive TEs was reinforced from analysis of the available 134 BAC end sequences. Collectively, they constituted >80 kb, assuming an average sequence read of 600 bp. BLAST searches revealed that all of this sequence is repetitive. Nearly 40% of the sequences were mdg1- or 412-like retrotransposon fragments. Other frequent retrotransposon sequences were mag-like (12%) and mtanga (10%). This composition may have resulted from a disproportionate accumulation of those elements compared to other repetitive sequences on the Y and/or from a bias arising during the BAC library construction. The repetitive character of BAC ends bearing no similarity to known TEs was confirmed by BLASTN searches against the A. gambiae genome. These sequences did not allow us to identify Y-linked scaffolds with complex, low-copy-number sequences. Scaffolds identified in silico using the BAC end sequences fell into two categories: very short unmapped scaffolds, likely originating from the Y, that were composed entirely of repetitive elements and scaffolds mapped to other chromosomes by independent evidence, apparently identified because highly repetitive query sequences are ubiquitous in the A. gambiae genome.
Screening of the BAC library suggested close linkage of different markers, because many clones were hit with more than one marker. To evaluate the extent of linkage, each clone confirmed to be Y linked with one marker was screened for the presence of the remaining available markers. Among all 90 identified clones, 27 clones were hit with all four markers, 34 were hit with three markers, 23 were hit with two markers, and only 6 were hit with one marker (see appendix). Many of the multiple-hit clones may contain overlapping sequences, with markers embedded in the shared genomic regions. Several clones found to share one identical end sequence seem to support this hypothesis, although it was not tested whether their DNA originated from the same region of the Y chromosome. This is apparently not the case with the 5 initially identified BAC clones, because analysis of their DNA subjected to restriction digestion showed restriction patterns lacking any common-sized bands. It is conceivable that most of the identified Y-linked BAC clones carry different regions of the Y but that they are identified with multiple markers because of the ubiquity of the marker sequences on that chromosome. Further study could elucidate the relative location of these markers on the Y.
Search for Y-linked scaffolds within the A. gambiae genome: The A. gambiae genome has been sequenced from genomic libraries constructed separately from female and male DNA (Holtet al. 2002). These whole-genome shotgun fragments were used in the assembly process, which resulted in 8987 scaffolds constituting the A. gambiae genome. Because there was approximately equal coverage from male and female mosquitoes in the final data set (Holtet al. 2002), we expected that the contribution of fragments from male libraries to autosomal and X chromosome scaffolds should be ∼50 and 33%, respectively, as our investigation suggested (∼35% X; ∼51% autosome). For scaffolds with undetermined chromosome linkage (“unmapped”), the male contribution was ∼51%, suggesting that Y chromosome sequences may not be well represented in the unmapped scaffold set because of cloning problems and/or due to the small size of the Y chromosome relative to other chromosomes.
We identified 975 scaffolds containing fragments originating exclusively from male libraries among all 8845 unmapped scaffolds (see supplementary data at http://www.genetics.org/supplemental/). These scaffolds constituted a male-derived sequence database created using STANDALONE BLAST. In an attempt to detect scaffolds with coding sequences, a strategy implemented by Carvalho et al. (2000, 2001) was used. Their approach is based on the assumption that heterochromatic genes are prone to have enormous-sized introns composed of repetitive sequences that cannot be assembled by a whole-genome shotgun approach. Gene fragments with unique sequences should appear as small isolated scaffolds that will remain undetected by normal annotation procedures; however, coding sequences of such genes could be recovered if a suitable query sequence (cDNA or protein) is used. In a BLAST report such a query is expected to create a staggered pattern of hits because different scaffolds match nonoverlapping regions of a query. Similar to Carvalho et al. (2001), we used all the protein sequences from the National Center for Biotechnology Information nr database as queries in a TBLASTN search against the male scaffolds and looked for proteins with hits in two or more scaffolds arranged in a staggered pattern. To avoid false positives during the search we used a stringent cutoff value of e = 0.0001. Nineteen protein queries produced the expected staggered pattern; however, none of the corresponding scaffolds proved male specific when tested for Y linkage. There was a high rate of amplification failure. In 17 of 19 cases PCR yielded nonspecific products or failed. The combined evidence of negative PCR results and high sequence similarity to genes from the bacterial genus Aeromonas strongly suggested bacterial rather than mosquito genomic DNA origin of at least 12 of those scaffolds. The source of bacterial sequences among the Anopheles scaffolds remains unknown. The genus Aeromonas encompasses aquatic species of bacteria, none of them reported as symbiotic to Anopheles. In two cases PCR yielded products of expected size present uniformly in both sexes. This TBLASTN strategy also identified a number of single-hit proteins, i.e., proteins with significant similarity to single unmapped scaffolds. Among those, our attention was drawn to scaffolds with sequence similarity to sex-determining or male fertility-related proteins. In total, 27 such scaffolds were tested and 1 scaffold (AAAB01008227), identified by using D. melanogaster Bkm-like sex-determining region hypothetical protein CS314 (GenBank accession no. B21124), was found to be Y specific. However, the analysis of AAAB01008227 revealed that its sequence is apparently not homologous to Bkm-like protein because their similarity is limited to a short low-complexity (microsatellite) region. The same appears to be true with other scaffolds hit by sex-determining or male function genes—in each case the areas of similarity concentrated in low-complexity sequences. The remaining single-hit scaffolds have not been tested.
Characterization of the AAAB01008227 scaffold: The AAAB01008227 scaffold, assembled from 441 fragments, has a total predicted length of 48,385 bp encompassing 43,438 bp of determined sequence and two sequence gaps. A primer pair 128125A (Table 1) designed in the middle of the scaffold amplified a PCR product exclusively in males. To evaluate the distribution of male-specific sequences within the scaffold, to test for the scaffold's integrity and develop new potential Y-specific markers, additional primer pairs (Table 1) were designed along the scaffold (Figure 2). At least one primer from each pair was located outside of repetitive regions according to BLASTN searches against the A. gambiae genome. Six of those primer pairs yielded male-specific products in all specimens tested, demonstrating Y linkage of the scaffold sequence and Y specificity of the markers (Figure 3). However, three other primer pairs, including primers flanking two microsatellite regions, yielded no PCR products. Neither adjusting amplification conditions nor redesigning primers resolved the problem, suggesting incorrect assembly of those genomic regions. Indeed, examination of the fragments (single sequence reads corresponding to a clone end sequence) used for the scaffold's assembly (retrieved from A. gambiae Trace Archive at http://www.ncbi.nlm.nih.gov/blast/mmtrace.html) revealed that none of the fragments spanned the microsatellite regions in their entirety; i.e., the fragments invariably ended within the microsatellites. Further support for the notion of incorrect assembly of the scaffold comes from the mate pair information obtained from Trace Archive (mate pairs are sequences originating from the opposite ends of a single clone). We examined five regions of the scaffold that contained Y marker sequences and compared orientation and distance of mate pairs predicted from the size of the source clones vs. the predictions based on the scaffold assembly. Only in one case did the mate pair position on the scaffold match the expectation based on the source clone. In four other cases, mate pairs were oriented on the scaffold in the same direction and in two such instances were separated by distances twice the expected length. Local misassembly problems regarding the A. gambiae genome sequences were reported earlier by Holt et al. (2002).
Although the AAAB01008227 scaffold is evidently chimeric, parts of the scaffold have been experimentally demonstrated to be linked to the Y chromosome and, as such, were analyzed further. A BLASTX search against the GenBank nr database revealed the presence of sequences highly similar to fragments of putative retrotransposons from other parts of the A. gambiae genome (Table 2). BLASTN searches against A. gambiae genome revealed other highly repetitive sequences scattered along the scaffold, with homologs on autosomes and the X chromosome; three fragments, >500 bp long, are repeated on the scaffold itself, two directly and one in inverted orientation (data not shown). The evidence suggests that the scaffold also contains low-copy-number or, possibly, single-copy sequences interspersed among the highly repetitive ones.
The scaffold sequences contain 28 ORFs detected using the Artemis annotation tool, release 4 (Rutherfordet al. 2000). All of them are short (<440 bp) and either are similar to retrotransposon genes or bear no similarity to any known coding sequence. Selected ORFs from the latter group were tested for expression by RT-PCR using RNA template extracted from abdomens of late larvae/pupae. Three RT-PCR products from ORF24 were present in males, one of which appeared to be male specific (Figure 4). Sequencing of each product from males and females revealed that only the male-specific RT-PCR product was identical to ORF24 on the scaffold (Figure 4). Furthermore, only the male-specific product had an uninterrupted reading frame (data not shown). ORF24 is neither flanked by nor positioned in the vicinity of any transposable element on the scaffold, suggesting that it does not constitute a fragment of a mobile element.
In an attempt to isolate BAC clones containing portions of the AAAB01008227 sequence, all seven markers designed from the scaffold were used to screen the ND-TAM BAC library. Surprisingly, contrary to screening with other markers, no clone was hit. Lack of hits in the BAC library reinforces the notion that Y markers derived from this scaffold represent sequences that are present in low copy number on the Y. However, the unexpected absence of BAC clones containing any of these markers may also be a consequence of biased library construction or instability of clones containing these sequences (Songet al. 2001).
Screening field-derived specimens using Y chromosome markers: The developed markers were evaluated for their utility in population genetics studies. To maximize the chance of finding variation we sampled natural populations from West Africa (Senegal and Burkina Faso) and from East Africa (Kenya), locales separated by up to 6000 kilometers. Following amplification and purification, PCR products were directly sequenced from at least 10 specimens from both West and East Africa. All sequences were identical at mag-mdg1, S23, and S291 loci. Some variation between individuals existed in the remaining markers, although without clear differentiation among populations. Furthermore, in those cases intraindividual variation confounding sequencing results was also observed. The two markers containing microsatellites with di- and pentanucleotide repeats (104C2SP6 and 128125D, respectively) were amplified from more than one locus per individual, each with a different number of repeats, resulting in ambiguous sequences within and downstream from the microsatellite region.
Prior to this study sequence information on the Y chromosome of A. gambiae was limited to a short fragment harboring a mtanga transposable element (Rohret al. 2002). Our analysis provides extensive new data on both primary sequence and genomic organization of the Y. Here we report characterization of Y chromosome-linked DNA fragments nearly 150 kb in total length and development of 11 Y chromosome-specific PCR markers using molecular biology and bioinformatics methods. Ninety Y-derived BAC clones, each with an average length of 133 kb (Honget al. 2003), were identified using the developed markers. In addition, a 48-kb Y-linked scaffold was found to harbor a sequence specifically expressed in males.
Complex, low-copy or single-copy sequences are the most illuminating markers in studies of both population genetics and sex chromosome evolution. Initially, in our search of Y chromosome sequences we implemented a differential hybridization strategy, in which labeled total genomic DNA from males and females was used in turn to screen for clones preferentially hybridizing to the male probe in a dual-sex genomic library. The hybridization kinetics of a total genomic probe dictates that only highly or moderately repeated DNA should hybridize. Thus, this method targets male-specific repetitive DNA or sequences that are highly amplified on the Y chromosome but may be present elsewhere in the genome. The premise of this approach was that single-copy- or low-copy-number sequences could be found among repetitive fragments. However, the A. gambiae Y appears to be highly degenerated, making application of this approach to finding more complex sequences laborious and impractical. Neither in the Y-linked phage inserts isolated by differential hybridization nor among sequences derived from Y-linked BAC clones subsequently identified have such complex sequences been detected. All fragments were repetitive and had very high similarity to sequences on other chromosomes. These results suggest that more complex sequences may constitute a few small, isolated islands scattered in an ocean of repetitive DNA, in agreement with the entirely heterochromatic state of the Y chromosome in A. gambiae. Most of the identified sequences belong to transposable elements, consistent with the Y chromosome serving as a trap for retrotransposons and with the expected tendency of the Y chromosome to degenerate during evolution (Steinemann and Steinemann 1992; Jukanovicet al. 1998; Charlesworth and Charlesworth 2000; Bachtrog 2003a). Moreover, massive accumulation of repetitive sequences suggests a long history of recombinational isolation of the Y chromosome in A. gambiae.
Ubiquitous repetitive sequences found in nearly all characterized fragments made development of population genetic markers very difficult. Although the BLASTN searches suggested otherwise, BAC library screening and direct sequencing of PCR products amplified from individual genomic templates showed that none of the markers appear to be present as single-copy sequences within the genome. Even primers flanking microsatellite-containing markers, found to be the most variable in this study, amplified multiple products, each with a different number of microsatellite repeats. Although their sequence data are difficult to analyze, the potential to utilize such multicopy PCR products in fragment analysis by treating them as compound haplotypes (Ewiset al. 2002) remains to be explored.
The highly repetitive nature of a few ubiquitous transposable elements and possibly of other sequences present on the A. gambiae Y chromosome prevented the assembly of individual sequence reads generated during shotgun genome sequencing into larger scaffolds. Similar problems were encountered in the D. melanogaster genome assembly after whole-genome shotgun sequencing: only a single 15-kb scaffold containing a portion of the kl-5 gene, previously found to be Y specific (Gepner and Hays 1993), could be assigned to the Y chromosome. Only after implementation of a strategy that targeted Y chromosome genes were a number of other Y scaffolds recovered (Carvalho et al. 2000, 2001). All of them contained usually short fragments of coding sequences corresponding to exons of single-copy genes. Nine such genes have been identified thus far (Gepner and Hays 1993; Carvalho et al. 2000, 2001). However, information regarding Drosophila Y chromosome genes provides no clue to the composition of the Y in A. gambiae. Our study suggests that gene content on the Y is not conserved between these dipterans. BLAST searches of the A. gambiae genome show that the most closely related homologs of all nine fruit fly genes are autosomal in Anopheles. Furthermore, PCR experiments suggested that none of the unmapped scaffolds, harboring sequences more distantly related to these genes, are Y linked (data not shown). Lack of gene content conservation on the Y is not surprising considering that ∼260 million years has elapsed since the last common ancestor of both organisms (Gaunt and Miles 2002).
The degree of Y degeneration and the number of functional genes on the Y in A. gambiae remains unknown. There is evidence that at least a single locus corresponding to a male-sex-determining factor is still present there (Clements 1992). However, coding sequences are most likely not limited to this locus. Our finding of an expressed sequence without apparent characteristics of a transposable element gene among the Y fragments is an important first step toward answering questions regarding A. gambiae Y chromosome genes. The RT-PCR results show that expressed sequences similar to ORF24 are present on an autosome or the X. However, unlike ORF24 these transcripts have stop codons in all six reading frames, suggesting that they represent expressed pseudogenes. It is conceivable that one of these loci is parental to the ORF24. Transfer of the gene onto the Y may have resulted in retention of its functionality, because it may confer male function, whereas copies from other chromosomes degenerated. Further study aiming at the isolation of a full-length cDNA containing ORF24 and of corresponding sequences from an autosome or the X will allow elucidation of origin and function of this Y chromosome sequence.
Fragments of other Y chromosome genes are likely present among the unmapped scaffolds, as was the case with the Drosophila Y genes. We limited our search to a small subset of the unmapped scaffolds, only those derived from male-only libraries, which may have resulted in a failure to recover more coding sequences. It remains to be seen if our more exhaustive ongoing search, encompassing all unmapped scaffolds, including those assembled from a mixture of male and female sequence fragments, will be more successful.
Our study has yielded an overview of structure and organization of the Y in A. gambiae, but clearly shows that assembly of the whole-genome shotgun fragments alone contributes little, if anything, to the elucidation of the Y chromosome sequence. Without a significant amount of additional evidence generated with approaches that take into account the peculiarity of Y chromosome organization, shotgun sequence data from the Y are unrecoverable or remain in the form of small isolated scaffolds that cannot be assembled into larger sequences by any available method (Carvalhoet al. 2003). Our results provide a solid starting point for further research on the Y chromosome in A. gambiae. The identified BAC clones presumably represent a large portion of the Y available for analysis. Although composed predominantly of repetitive sequences, these BAC clones also likely harbor more complex sequences that may be informative in studies of A. gambiae population genetics and Y chromosome evolution. However, in the search for Y chromosome-specific genes, the method of Carvalho et al. (2000) seems to hold the greatest promise.
We thank Patricia Romans for a critical reading of the manuscript. Mathew Chrystal provided excellent computer support. This work was supported by grant AI44003 from the National Institutes of Health.
Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under accession nos. CG865069–CG865110.
Communicating editor: M. A. F. Noor
- Received August 4, 2003.
- Accepted December 3, 2003.
- Copyright © 2004 by the Genetics Society of America