Abstract
Substantial intrastrain variation at the nucleotide level complicates molecular and genetic studies in zebrafish, such as the use of CRISPRs or morpholinos to inactivate genes. In the absence of robust inbred zebrafish lines, we generated NHGRI-1, a healthy and fecund strain derived from founder parents we sequenced to a depth of ∼50×. Within this strain, we have identified the majority of the genome that matches the reference sequence and documented most of the variants. This strain has utility for many reasons, but in particular it will be useful for any researcher who needs to know the exact sequence (with all variants) of a particular genomic region or who wants to be able to robustly map sequences back to a genome with all possible variants defined.
THE zebrafish (Danio rerio) is a powerful tool for understanding vertebrate biology. The usefulness of this model organism is bolstered by the availability of a “finished” sequenced and annotated genome (Howe et al. 2013; Flicek et al. 2014). As a natural extension of this resource, there are several high-throughput efforts to systematically mutagenize all zebrafish protein-coding genes (Moens et al. 2008; Kettleborough et al. 2013; Varshney et al. 2013a,b).
In addition to such projects, the combination of a sequenced genome and developments in targeted nuclease technology mean that the zebrafish community is now able to rapidly take advantage of custom genome-editing technologies (Doyon et al. 2008; Bedell et al. 2012; Hruscha et al. 2013; Hwang et al. 2013; Jao et al. 2013). CRISPRs in particular provide an efficient, easy, and inexpensive means of manipulating and interrogating the genome (Jinek et al. 2012; Cong et al. 2013; Mali et al. 2013). However, because there are very few hardy inbred zebrafish lines (overinbreeding tends to result in unhealthy stocks) and polymorphism rates are close to 1 every 100 bases, variants frequently have the potential to interfere with target site design (Stickney et al. 2002; Guryev et al. 2006; Bowen et al. 2012) or with regions of homology used for homologous recombination. In general, genome targeting is heavily dependent on an exact match to the primary sequence. Depending on the sequence, even a single mismatch can severely reduce the cutting efficiency (Hsu et al. 2013). In addition, other techniques such as RNA-Seq or ChIP-Seq are substantially less accurate without having fully characterized variants in the background strain. Therefore, it is preferable to carry out studies in a zebrafish strain in which the regions of invariant sequence are known with a high degree of confidence and all variants are categorized to allow for robust genomic mapping.
With these concerns in mind, we derived the zebrafish line NHGRI-1. NHGRI-1 fish were derived from an original strain known as “TAB-5” made from a hybrid cross between fish from two of the most commonly used zebrafish lines: Tübingen and AB (Streisinger et al. 1981; Haffter et al. 1996). The F1 fish from this cross were inbred and screened to be clear of any mutations affecting the first 5 days of development. Since its initial isolation in 1997, we have carried the strain in the laboratory until the present day without introducing other outside genetic diversity. We selected several mating pairs from the TAB-5 pool, and the most robust mating pair was chosen as the founding pair for NHGRI-1. We are now on the third generation of NHGRI-1 and their fecundity and overall health remain strong.
We carried out high-throughput sequencing to a depth of ∼50× for each parent. The male and female sequencing libraries had a combined 1,289,142,362 nonduplicate reads, with a median coverage of 52× and 47×, respectively. By doing so, we identified >10 million previously unreported single-nucleotide variants (SNVs). The raw sequence data have been deposited in the NCBI Sequence Read Archive [BioProject ID: 246102]. In addition, we have identified nearly all the regions of the genome that are invariant relative to the Zv9 reference sequence. We generated a browser extensible data (BED) file of invariant nucleotides, which indicates the regions in which there were both a lack of alternative alleles and a lack of sufficient read depth and genotype confidence to call bases as invariant (Figure 1). Seventy-one percent of the genome fits these criteria. The invariant file is hosted on the NHGRI-1 website at http://research.nhgri.nih.gov/manuscripts/Burgess/zebrafish/download.shtml, a University of California, Santa Cruz (UCSC) data hub called “ZebrafishGenomics” has been established at http://genome.ucsc.edu/cgi-bin/hgHubConnect, and data have been transferred to http://zfin.org/. Information on the variants themselves can be downloaded from dbSNP (submitter handle, NHGRI_DGS; submitter batch ID, NHGRI-1_founders). The invariant regions are easily identified by using the BED file, simplifying the design of CRISPR targets, amplicon primers, finding regions for homologous recombination, Morpholino design, or essentially any experiment that requires high confidence in the exact sequence of the genomic region of interest.
Screenshot of the UCSC browser custom tracks for NHGRI-1. Twenty mating pairs from 6-month-old TAB-5 fish were screened to select a robust founding pair with good clutch size and healthy progeny; the most fecund pair was renamed NHGRI-1. Fin clips from the NHGRI-1 male and female were prepared as separate genomic DNA libraries and sequenced on the Illumina HiSeq 2000 by the National Institutes of Health (NIH) Intramural Sequencing Center. Both libraries were subjected to paired-end sequencing with 101-bp reads. We aligned the sequence to the zebrafish genome [Zv9 (Howe et al. 2013)] with Novoalign version 2.08.02 (http://www.novocraft.com/). We removed PCR duplicates via SAMtools version 0.1.18 (Li et al. 2009). We used bam2mpg to identify the most probable genotype (MPG) for nucleotides in both parents (Teer et al. 2010). Bases that did not have an MPG score of at least 10, coverage of at least 20×, and a ratio of MPG score to coverage >0.5 were discarded. Regions of low sequence complexity were not specifically excluded from the analysis unless they failed to meet these criteria. The bases that matched the reference and met the above criteria in both fish were used to build the BED track of invariant nucleotides. The top track indicates the bases that were invariant in both fish sequenced. The white regions indicate either variation in at least one fish or insufficient read depth to confidently call the region as invariant. The second track indicates two nonsense mutations detected in this region. The letter indicates the alternative allele, and the color indicates whether the mutation was homozygous (red) or heterozygous (blue) in the NHGRI-1 population. Both tracks are available on the ZebrafishGenomics track hub, which is hosted at http://research.nhgri.nih.gov/manuscripts/Burgess/zebrafish/downloads/NHGRI-1/hub.txt and accessible through http://genome.ucsc.edu/cgi-bin/hgHubConnect.
We detected >17 million total variants upon merging the variant calls from the two libraries. Of that total, 236,301 were in exons of Ensembl transcripts (Table 1). Variants were called as homozygous only if they were homozygous in both fish; such variants will stably retain the variant allele in future generations.
To underscore the issues related to background variation in the commonly used zebrafish lines, we detected 669 variants that formed premature stop codons in at least one transcript, 105 of which were homozygous mutant in both sexes (Table 2). We have generated a BED track of these variants, indicating the location, the alternative allele, and the homo/heterozygosity. This track is available on the ZebrafishGenomics hub and the NHGRI-1 website (Figure 1). A list of affected genes can also be found in supporting information, Table S1.
We detected 3160 deletion or insertion variants (DIVs) in exons. DIVs of a length divisible by three were highly represented and comprise ∼60% of the DIVs (Figure 2A). Presumably, this is because the resultant nonframeshift mutations would be less likely to be selected against than those that produce frameshifts. A similar profile has been reported in human indels (Chen et al. 2007). This trend is not present in the genome-wide set of 2,210,080 NHGRI-1 DIVs (Figure 2B).
Deletion and insertion variant length distribution within exons. (A) The 3160 DIVs in exons. (B) The 2,210,080 DIVs detected genome-wide. Red bars indicate the number of deletions of a given length; blue bars represent insertions.
We compared the SNVs identified in NHGRI-1 with dbSNP (Build ID: 139) and a publically available data set obtained from low-coverage sequencing of multiple zebrafish lines (Sherry et al. 2001; Bowen et al. 2012). For simplicity, we compared only biallelic SNVs for which the reference sequence is known (i.e., no “N”s). The majority of NHGRI-1 SNVs had not been previously reported in either data set (Figure 3). We find that the rate of SNVs per sequenced base in NHGRI-1 is 0.01 or ∼12.5–20× higher than the rate in humans (Kidd et al. 2008). It is important to note that, while the 0.01 number is relevant for NHGRI-1, the regions of homozygosity created by inbreeding mean it certainly underestimates the SNV load in zebrafish as a whole.
SNV overlap with publicly available data sets. This comparison incorporates only SNVs that were biallelic and for which the reference base was an unambiguous A, C, G, or T. The Bowen et al. (2012) SNVs were downloaded from http://fishbonelab.org/harris/Resources_files/parental_variants.tar; both data sets were downloaded on March 12th, 2014.
We also compared the mutational profile of NHGRI-1 to that reported for a zebrafish captured from the wild and sequenced at 39× coverage (Patowary et al. 2013). Different cutoffs had been applied for variant calling in said study, such as a minimum of 32 reads to call an SNV and 5 reads to call a DIV, but the ratios of variant types can be compared. The differences are statistically significant, but small. Among the SNVs in the wild zebrafish, 22.3% were reported as being homozygous, compared to 17.8% in NHGRI-1 (Fisher’s exact test, P < 2.2 × 10−16). Deletions are more prevalent than insertions in both studies, with the wild zebrafish reported as having 53.9% deletions, compared to 51.6% in NHGRI-1 (P < 2.2 × 10−16).
This fish line will have utility in terms of automated design for targeted nucleases, as well as for studies such as ChIP-Seq or RNA-Seq where SNVs or DIVs might reduce the accuracy of mapping the raw sequence data. In addition, techniques such as homologous recombination are very sensitive to variants (te Riele et al. 1992), and NHGRI-1 will allow researchers to target genomic regions that do not contain any variant nucleotides. Thus, NHGRI-1 will prove useful in a variety of circumstances where absolute knowledge of the possible sequence variation is needed. The line will be distributed by the Zebrafish International Resource Center (http://zebrafish.org) and the European Zebrafish Resource Center (http://www.ezrc.kit.edu).
Acknowledgments
This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health.
Footnotes
Available freely online through the author-supported open access option.
Supporting information is available online at http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.114.166769/-/DC1.
Communicating editor: D. Parichy
- Received May 30, 2014.
- Accepted June 28, 2014.
- Copyright © 2014 by the Genetics Society of America
Available freely online through the author-supported open access option.