Comprehensive whole-genome structural variation detection is challenging with current approaches. With diploid cells as DNA source and the presence of numerous repetitive elements, short-read DNA sequencing cannot be used to detect structural variation efficiently. In this report, we show that genome mapping with long, fluorescently labeled DNA molecules imaged on nanochannel arrays can be used for whole-genome structural variation detection without sequencing. While whole-genome haplotyping is not achieved, local phasing (across >150-kb regions) is routine, as molecules from the parental chromosomes are examined separately. In one experiment, we generated genome maps from a trio from the 1000 Genomes Project, compared the maps against that derived from the reference human genome, and identified structural variations that are >5 kb in size. We find that these individuals have many more structural variants than those published, including some with the potential of disrupting gene function or regulation.
WHOLE-GENOME short-read sequencing is now routine and affordable. However, three challenges remain in genome analysis: genome sequence assembly, structural variation detection, and separation of the two parental genomes. In addition to the fact that humans are diploid, with cells harboring two genomes from the parents, the presence of numerous repetitive elements that are longer than the usual sequencing library insert size makes it close to impossible to assemble genome sequences with short-read sequencing alone (El-Metwally et al. 2013). Consequently, almost all whole-genome sequencing projects map the sequencing reads onto the human reference genome sequence without performing whole-genome assemblies (Ley et al. 2008). When whole-genome assembly is attempted, it is done by the laborious and expensive approach of generating paired-end sequencing of cloned genomic DNA fragments to provide scaffolds for sequence assembly (Siegel et al. 2000). Alignment of short sequencing reads to the human reference genome sequence reveals single-nucleotide variation and small indels in the individuals sequenced, but larger structural variants and repetitive regions in the genome are more difficult to detect. As structural variation can disrupt genes or regulatory elements, whole-genome sequencing without assembly and detection of structural variation produces an incomplete picture of the genome. Recently, clone-free approaches (e.g., Hi-C scaffolding) have been used to generate sequence motif maps or long sequences to serve as scaffolds for the assembly of highly accurate short-read sequences (Burton et al. 2013; Kaplan and Dekker 2013), including the de novo assembly of a diploid human genome (Pendleton et al. 2015). These “hybrid assembly” approaches rely on three sets of data—short read sequences, long read sequences (5- to 20-kb reads), and genome maps (150–500 kb)—to overcome repetitive elements and duplicated regions larger than the typical contigs assembled from short-read sequences.
A fully assembled and phased diploid genome makes it possible to identify all structural variants present with direct access to the breakpoints involved. However, high-quality human genome sequence assembly with base-pair resolution, while feasible, is still a costly and laborious endeavor. In this report, we demonstrate the utility of genome mapping, an approach based on massively parallel analysis of extremely long single DNA molecules fluorescently labeled at specific sequence motifs in nanochannel arrays in genome-wide identification of structural variation at 5-kb resolution without sequencing. In contrast to the short reads (hundreds of bases) used in next-generation sequencing (NGS) approaches, genome mapping analyzes individual DNA molecules of hundreds of thousands of base pairs, thus preserving the long-range genome architecture and enabling direct interrogation of structural variants. Genome mapping has been used in several previous studies to provide scaffolds for genome sequence assembly (Hastie et al. 2013; Cao et al. 2014; English et al. 2015; Pendleton et al. 2015; Usher et al. 2015; Xiao et al. 2015). The DNA sample is prepared with a protocol that preserves the integrity of the DNA. Because native DNA is used, no amplification bias is present. Currently, analyzing a genome by genome mapping (at >60× coverage) takes <2 days and costs <$1000. While whole-genome haplotyping is not achieved with genome mapping alone, local phasing across regions of at least 150 kb is routine with our single-molecule analysis approach, as molecules derived from the parental chromosomes are examined separately.
We generated genome maps from a trio from the 1000 Genomes Project where the individuals had been sequenced to high depths and with their structural variations published previously. We compared the genome maps obtained from the trio against those derived from the human reference genome and identified all the structural variations that are >5 kb. Comparing the genome maps of the parents and the child allows us to check for consistency in Mendelian inheritance and separate the haplotypes. Our study shows that these individuals have many more structural variants than those published and that some of these variants have the potential to disrupt gene function or regulation. Using nicking endonuclease Nt.BspQI (GCTCTTCN^), one label is observed about every 8 kb in the human genome. Without sequencing or cloning, we are able to map breakpoints of the structural variation within 8 kb, making this a novel and efficient approach to whole-genome structural variation analysis. Furthermore, our maps pinpoint the Epstein–Barr Virus (EBV) integration sites in the lymphoblastoid cell lines used and provide size estimates of two-thirds of the large “N-base gaps” in the hg38 human reference genome sequence.
Materials and Methods
High-molecular-weight DNA extraction
Cells from the trio cell line were washed with PBS, resuspended in cell suspension buffer, and embedded in thin low-melting-point agarose layers (CHEF Genomic DNA Plug Kit, Bio-Rad). The thin agarose layers were incubated with lysis buffer and proteinase K for 4 hr at 50°. The plugs were washed and then solubilized with GELase (Epicentre). The purified DNA was subjected to 4 hr of drop-dialysis. It was quantified using Nanodrop 1000 (Thermal Fisher Scientific) and/or a Quant-iTdsDNA Assay Kit (Invitrogen/Molecular Probes), and the quality was assessed using pulsed-field gel electrophoresis.
The DNA was labeled according to commercial protocols using the IrysPrep Reagent Kit (BioNano Genomics). Specifically, 300 ng of purified genomic DNA was nicked with 7 U of nicking endonuclease Nt.BspQI [New England BioLabs (NEB)] at 37° for 2 hr in NEB Buffer 3. The nicked DNA was labeled with a fluorescent-dUTP nucleotide analog using Taq polymerare (NEB) for 1 hr at 72°. After labeling, the nicks were ligated with Taq ligase (NEB) in the presence of dNTPs. The backbone of fluorescently labeled DNA was counterstained with YOYO-1 (Invitrogen).
The DNA was loaded onto the nanochannel array of BioNano Genomics IrysChip by electrophoresis of DNA, automated by the Irys system. Linearized DNA molecules were imaged using the BioNano Genomics Irys system. The DNA backbone (outlined by YOYO-1 staining) and locations of fluorescent labels along each molecule were detected using the image detection software of the Irys system, IrysView, which is available upon request (http://www.bionanogenomics.com/products/irysview/). The set of label locations of each DNA molecule defines an individual single-molecule map.
De novo genome map assembly
Single-molecule maps were assembled de novo into genome maps using IrysSolve software tools developed at BioNano Genomics (available upon request) (Cao et al. 2014). Briefly, the assembler is a custom implementation of the overlap-layout-consensus paradigm with a maximum-likelihood model. An overlap graph was generated based on pairwise comparison of all molecules as input. Redundant and spurious edges were removed. The assembler output the longest path in the graph, and consensus maps were derived. Consensus maps were further refined by mapping single-molecule maps to consensus maps, and label positions were recalculated. Refined consensus maps were extended by mapping single molecules to the ends of the consensus and calculating label positions beyond the initial maps. After merging of overlapping maps, a final set of consensus maps (genome maps) was output and used for subsequent analysis.
Generation of in silico BspQI maps
EMBOSS restrict (Rice et al. 2000) was used to detect in silico BspQI sites and generate maps for the hg38 reference genome and the accompanied EBV genome. To match the resolution of the in silico map, which was originally at a 1-bp resolution, to the experimental data, which was at an ∼500-bp resolution, the in silico maps were condensed to 700 bp such that a midpoint position would be taken if two or more BspQI sites were within 700 bp of each other.
Identification EBV integration sites
Two tandem copies of the EBV in silico BspQI maps were used as the EBV in silico reference map to account for the circular nature of the EBV genome and prevent mapping artifacts due to a linearized map. Single-molecule maps were aligned to the EBV in silico reference map using software tools developed at BioNano Genomics (Shelton et al. 2015). For single-molecule maps that were partially aligned to the EBV reference maps, unmapped portions that were >100 kb and had a minimum of 10 nicks were mapped to hg38 to determine potential EBV integration sites. High-confidence EBV integration sites were supported by at least 20 single-molecule maps.
Sizing N-base gaps
N-base gaps in hg38 that were >5 kb were identified and those that corresponded to centromeres and telomeres were removed. Centromere and telomere annotations were based on hg38 centromeres and gap annotations on the University of California at Santa Cruz (UCSC) Table Browser. Using alignment of genome maps to the hg38 in silico BspQI map (described below), genome maps that span across the entire N-base gaps were used to estimate the size of the N-base gaps.
Structural variation detection
Structural variants (SVs) were found by identifying outlier alignments between single-molecule maps or genome maps from a sample and the hg38 reference map. A structural variation detection pipeline that takes advantage of three detection strategies was used to detect SVs (see below). SVs were verified manually to confirm supporting evidence from single-molecule maps. Briefly, each of the >1000 SVs identified by the software tools was checked by two expert analysts to confirm that (1) at least 50 high-quality single-molecule maps covered the SV region and (2) there was clear evidence that a deletion or an insertion event was found in at least 20% of the high-quality single-molecule maps (for a heterozygous SV call) or in >90% of the high-quality single-molecule maps (for a homozygous SV call).
Structural variation detection pipeline
The first SV list was generated using BioNano software tools. Structural variants were found by identifying outliers between genome maps and the hg38 reference in silico map. An outlier is defined as a discrepancy between the two maps at the 0.01% level or worse. Outliers that are found between two high-scoring regions of at least 50 kb are considered SVs.
Two other software tools are also used to identify SVs based on alignment of single-molecule maps directly to the hg38 reference map. A pipeline was set up to apply these two methods for SV detection. Each single-molecule map was aligned to the reference maps by two different alignment algorithms. The alignment results were integrated and supplied as inputs to two SV calling methods that used different approaches to identify SVs. Detailed descriptions of the pipeline are in unpublished results (K. Y. Yip et al. and T. F. Chan et al.). Below are the key steps of this pipeline.
Alignment of single-molecule maps to reference maps:
To align single-molecule maps to the hg38 reference maps, we used RefAligner (BioNano Genomics) (Cao et al. 2014) and OMBlast (Leung et al., unpublished results; software can be downloaded at http://www.hkbic.cuhk.edu.hk/software/omblast). RefAligner aligns each molecule map to the reference maps by finding the best-matching region using a dynamic programming algorithm. The match score depends on the distribution of fluorescent labels on the molecule and the in silico nicking sites in the region on the reference sequence. OMBlast uses a BLAST-like seed-and-extension approach to aligning molecules to the reference. Each molecule map can be split-mapped to multiple locations in the forward or reversed orientation, allowing for detection of large insertions, deletions, and inversions in the SV detection step.
We integrated the two sets of molecule alignments by taking their consensus as follows. If both alignment methods were able to align a molecule, and their aligned locations were similar (within half the length of the molecule), a consensus was reached, and we included the OMBlast result in our integrated list. On the other hand, if only one method was able to align a molecule, we included this alignment directly regardless of the method. Finally, if both methods were able to align a molecule, but the aligned locations were different (more than half the length of the molecule apart), we would exclude both alignments.
Alignment of de novo assembled genome maps to reference maps:
We first aligned the molecules to the genome maps, followed by trimming all molecule alignments by three flanking signals. Then we counted the signal coverage on the assembled genome maps based on the number of molecules aligned. The coverage of the start and the end of the assembled genome maps was usually low, and hence the first and last five signals on the assembled genome maps were ignored. Signals with coverage lower than five supporting alignments were regarded as the low-coverage regions. After splitting the maps at these low-coverage regions, we used OMBlast to align the split fragments with a minimum of 10 signals to reference maps and to detect large insertions, deletions, and inversions using partial-alignment-based strategy as described below, except that the minimum number of fragments supporting the variant call was set to one.
Identification of structural variants:
The integrated list of molecule alignments was then used to identify SVs using a probabilistic method and a partial-alignment-based method. In total, six different types of variations were identified, namely single-site insertions, single-site deletions, segmental insertions, segmental deletions, inversions, and translocations. The first two types involved single nicking sites that were observed on the molecules only but were not present on the reference, or vice versa. These variations could be due to SVs or smaller-scale mutations such as single-nucleotide variants or small insertions and deletions. Since the current study focuses on SVs, below we consider only the latter four types of variants identified by the two methods.
In the probabilistic method, an error model was defined to describe various types of errors possibly contained in the data, including incomplete enzymatic digestion, non-sequence-specific nicks, molecule stretch variation, measurement resolution, and errors. The model parameters were estimated based on the alignment results. For every two adjacent nicking sites on the reference map or that were supported by at least one molecule, the model was used to compare the likelihoods of the null hypothesis of having no SVs between them with four alternative hypotheses, namely having homozygous/heterozygous insertions and having homozygous/heterozygous deletions between the two sites. SVs were then called for cases having a significant P-value and alternative-to-null likelihood ratio.
The partial-alignment-based method examined molecules that aligned to multiple genomic regions. SVs were called depending on (1) whether the two aligned regions were on the same chromosome, (2) the distance between the two aligned regions if they were on the same chromosome, and (3) the orientations of the two alignments. Small inversions were also identified by checking if an inverted pattern was observed in the Compact Idiosyncratic Gapped Alignment Report (CIGAR; https://samtools.github.io/hts-specs/SAMv1.pdf) string of an alignment. For example, the CIGAR string of IIMDD (two inserted nicking sites followed by a match followed by two deletions) would be considered a potential inversion since the inversion of the sequence could exactly explain the CIGAR string. Finally, SVs identified from different molecules that were aligned to nearby regions were merged into a single SV.
To minimize the number of false positives, both the probabilistic and partial-alignment-based methods required a minimum number of molecules supporting the variant allele for an SV to be called.
NGS-based validation of BioNano-based SV calls
We attempted to use publicly available NGS data derived from cell lines of the same individuals to validate BioNano-based SV calls. We looked at both the changes in the coverage depth and the changes in the number of unpaired alignments around the SV regions for confirmation. We expected the coverage depth to drop around deletion regions. There might be fluctuation around insertion breakpoints, but the depth was expected to be relatively normal except for duplication events. We also expected an increase in the number of unpaired alignments around SV breakpoints.
Alignment .bam files for the Illumina high-coverage paired-end 100-base data for the trio samples were downloaded from the Genome in a Bottle Consortium (ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data, accessed in August 2013). The alignment was done against hg19. The BioNano-based SV calls were in hg38 coordinates, so they were converted via the batch LiftOver tool (UCSC Genome Browser) into hg19 coordinates.
Calculation of s/(s + p) ratios:
From the .bam files, we extracted properly paired alignments using SAMtools (Li et al. 2009) with the command “samtools view -f 3.” We extracted unpaired (or single-end) alignments using “samtools view -f 4 -F 264” and “samtools view -f 8 -F 260.” We divided the genome into 500-bp nonoverlapping bins. For each bin, we counted the number of single-end aligned (denoted as “s”) reads and all properly paired aligned (denoted as “p”) reads. We then calculated the ratio between single-end aligned reads and all aligned reads (including properly paired reads and single-end aligned reads). Peaks were called for each region. If a peak overlapped with the BioNano SV region, the SV was considered supported.
Calculation of normalized coverage depth:
The alignments from the three samples were pooled, and the total number of properly paired reads per bin was determined. GC content (%GC) was calculated for each 500-bp nonoverlapping bin. Ten thousand random bins for each chromosome were randomly sampled for fitting. A Loess fit was applied to model the correlation between %GC and coverage counts. Zero-coverage bins were removed to avoid N-base regions. The normalized counts were obtained by subtracting the per-bin coverage counts from the GC correction minus the median coverage count of the sample. The coverage profiles were segmented using the R/Bioconductor package DNAcopy. The minimum width of a segment was set to be five bins. If a significance coverage change overlapped with the BioNano SV region, the SV was considered supported.
Gene annotations of structural variants
Homozygous deletions that overlapped with gene exons were identified using ANNOVAR based on RefSeq Gene hg38 coordinates. Among genes located in deleted regions, candidate genes related to disease susceptibility were identified based on the Online Mendelian Inheritance in Man (OMIM) database as well as a literature review. For these candidate genes, deletion boundaries were refined by examining NGS coverage depth and s/(s + p) ratios, as described above. We finalized the list of deletions of genes related to disease susceptibility by verifying that the updated deletion boundaries still overlapped with gene exons.
We applied a genome-mapping strategy that uses long DNA molecules with fluorescent labels marking a specific DNA sequence motif, GCTCTTCN^, for de novo genome assembly and structural variation detection. An overview of our genome mapping strategy is outlined in Figure 1.
De novo genome assembly of a CEU (Northern European from Utah) trio
The trio samples (NA12878, NA12891, and NA12892) used in this study came from the CEU collection. We prepared long DNA molecules from the cell lines and labeled them with fluorescent nucleotides (Alexa 546-dUTP), specifically at nicking sites created by nicking endonuclease Nt.BspQI. The labeled DNA molecules were then stained with the DNA intercalating dye YOYO-1 and introduced into nanochannel arrays for imaging. We collected single-molecule data (>150 kb) to a minimum of 71× depth of coverage for each individual (Table 1). The image of each molecule was converted to a single-molecule map based on molecular length and the locations of the fluorescent labels. The single-molecule maps were assembled de novo into genome maps. The number of genome maps for each individual ranged from 990 to 1049 (Table 1). Over 98% of genome maps aligned to hg38 and covered at least 96% of hg38 (Table 1 and Figure 2). The remaining 2% of the genome maps correspond to genomic regions substantially different between the trio and the reference genome or with incomplete genomic sequences in the reference (Pendleton et al. 2015). The N50 of the three genome assemblies ranged from 4.6 to 5.0 Mb, with the longest map being 29.0 Mb.
In addition to molecules derived from the human genome, we also observed some molecules (0.05–0.40%) aligning to the 170-kb EBV genome. Using molecules partially aligned to EBV, we identified 3 to 18 high-confidence EBV integration sites in these samples (Figure 3). The difference in EBV integration sites between the three samples is most likely due to a random EBV integration event during passaging of the immortalized cell lines. We also observed that the EBV genome integrated into the lymphoblastoid cell lines was slightly different from the EBV in silico map released with the hg38 genome (Supporting Information, Figure S1).
Sizing N-base gaps in hg38
In hg38, there are 267 gaps (>5 kb) that are represented as “Ns” with arbitrary lengths, including 65 located in the subtelomeric and subcentromeric regions (gray in Figure 2). Our de novo genome maps spanned across 95 of the 202 N-base gaps outside of known telomeric/centromeric regions, allowing the use of genome maps to estimate actual gap size. As an example, accurate sizing of a 50-kb N-base gap at the hg38 chromosome 18 (47M) region is shown in Figure 4. Genome maps in this region reveal that it represents a copy-number variation where the trio samples contained 7–14 copies of a 6-kb tandem repeat. In contrast to the reference genome, which contains 3 copies of the tandem repeat, NA12878 is heterozygous with 7 and 9 copies, giving N-base gap sizes of 24 and 36 kb, respectively, while NA12891 has 9 and 10 copies (36- and 42-kb gaps), and NA12892 has 7 and 14 copies (24- and 66-kb gaps). In this particular case, the 6-kb tandem repeat resides in the TCEB3 gene locus and is likely to affect the number of copies of the TCEB3 gene (Figure 4D).
Detection of structural variants
Detection of SVs that are >5 kb in size was performed using two complementary approaches, a consensus-based approach and a single-molecule approach. This was followed by manual inspection of every identified SV to confirm that the SV calls were supported by the presence of single molecules containing the variants. In heterozygotes, the region must be covered with sufficient depth (>50×) and the variant seen in >20% of the molecules to be considered valid. As shown in Table 2, we identified 909 insertions and 661 deletions (>5 kb in size) in these three samples, significantly more than the 59 insertions and 156 deletions previously found (1000 Genomes Project Consortium et al. 2010, 2012). We also identified a total of 27 inversions and 44 large copy-number variants in these individuals.
Insertion and deletion
We first examined the single-molecule maps for the loci where 59 insertions and 156 deletions (>5 kb) were previously reported in the 1000 Genomes Consortium pilot and phase 1 data (1000 Genomes Project Consortium et al. 2010, 2012). Our maps provided supporting evidence for 39 reported insertions (66%) and 125 reported deletions (80%) (Table 2, “By novelty…”). For the rest of the reported insertions and deletions, our data showed that either the reported SVs were not present (11 insertions and 21 deletions) or the SVs present were of a type different from those reported (2 insertions). Fifteen of the reported SVs (7 insertions and 8 deletions) were not detected because our maps did not have sufficient depth to make an SV call. Apart from validating previously reported insertions and deletions, our single-molecule maps and genome maps identified an additional 870 novel insertions and 536 novel deletions (Table 2, “By novelty…”) with a Mendelian concordance rate of 96% (Table 2, “By Mendelian inheritance”). All these insertions and deletions were carefully verified manually to confirm supporting evidence from single-molecule maps (Table S1 and Table S2). Over 85% of the deletions were validated by next-generation sequence data using read-depth or single-end or paired-end ratio approaches (Table S3).
Among the deletions identified, we found homozygous deletions that affected five genes (GSTM1, LCE3B/C, CR1, SIGLEC14) that may affect disease susceptibility (Table 3). After refining deletion boundaries using short-read sequencing data, the GSTM1, LCE3B, and LCE3C genes were completely deleted while the last four exons were deleted from SIGLEC14 and two exons were deleted from CR1. Complete deletions of GSTM1, LCE3B, and LCE3C have previously been reported. GSTM1 deletion is associated with reduced metabolism of chemical carcinogens and other toxins, leading to increased susceptibility to multiple cancers as well as other disorders such as aplastic anemia (Zhong et al. 1993; Trizna et al. 1995; Lee et al. 2001). Deletion of LCE3B/C is a known risk factor for psoriasis (De Cid et al. 2009). A null allele of SIGLEC14 has been associated with increased susceptibility to group B Streptococcus (Ali et al. 2014), while polymorphisms in CR1 determine the Knops system blood group and have been associated with resistance to malaria (Cockburn et al. 2004).
We attempted to verify 51 inversions previously reported in NA12878 by Kidd et al. (2008). Of these, our genome mapping data provided single-molecule support for only 16 inversions (31%). An example is found in Figure 5A. Ten of the 16 detected inversions showed partial symmetry in the BspQI nicking pattern, suggesting that these previously reported inversions are palindromes. This hypothesis was verified by generating a dot plot of these 10 hg38 regions (data not shown). For the remaining 35 published inversions, 11 reside in complex regions consisting of multiple types of structural variants. The single-molecule data provide no support for inversions in the remaining 24 regions. Since these inversions were originally identified in the GRCh35 (hg17) reference map, we used the National Center for Biotechnology Information Genome Remapping Service to map the inversions together with 80-kb flanking regions from hg38 to hg17. We observed that 12 of 24 of the regions containing the unverified inversions were significantly different between hg38 and hg17, indicating that half of the discrepancies were due to the use of different reference genome sequence assemblies. In addition, we identified 11 novel inversions in NA12878. In total, we identified 27 inversions in NA12878, including 16 published and 11 novel inversions (Table S4). As some of the inversions reside in complex regions with multiple SVs, long single-molecule maps provide long-range information and direct evidence of complex structural variation (Figure 5B).
Sixty-two large copy-number variation (CNV) gain regions associated with NA12878, NA12891, or NA12892 have been reported by Conrad et al. (2010), Wang et al. (2007), Pinto et al. (2007), and Cooper et al. (2008). Of the 62 CNV gain regions, 44 regions map to two or more locations in hg38, indicating that they are known segmental duplications in hg38. Four of these segmental duplications are located in the subtelomeric regions, consistent with previous findings that there are paralogous blocks of subtelomeric repeat elements (Stong et al. 2014). The remaining 18 published CNV gain regions cannot be located in our genome maps.
Resolving zygosity of structural variation
With long single-molecule data, zygosity of structural variation can be directly observed when molecules spanning each haplotype are present, and they can be used to build haplotype-resolved genome maps. An example is illustrated in Figure 1E in which a previously reported 9-kb deletion in NA12892 was found to be heterozygous. In addition, a novel 4-kb insertion was found right next to this deletion locus and is out of phase with the 9-kb deletion, allowing us to construct the haplotypes based on these two structural variants.
The importance of having long-range information in genome assembly (and, by extension, whole-genome SV detection) has been demonstrated in recent studies using genome maps, long-read sequences, fosmid libraries (Cao et al. 2015), or a combination of these for de novo assembly of the human genome (Cao et al. 2014, 2015; Pendleton et al. 2015). In this report, we focus on the use of long single-molecule and genome maps for SV detection without sequencing or cloning. The single-molecule and de novo genome maps that we generated for a well-studied CEU trio allowed us to efficiently identify structural variants >5 kb.
The de novo genome maps that we produced with single-molecule data alone have assembly N50 of ≥4.6 Mb. Insertions, deletions, and inversions >5 kb were detected and validated with single-molecule maps and/or de novo genome maps against the human reference genome and against each other. Overall, we detected and validated seven times more large insertions/deletions than previously found in the 1000 Genome Consortium pilot and phase 1 data (1000 Genomes Project Consortium et al. 2010, 2012). A likely explanation of this discrepancy is that it is harder to detect large insertions with NGS paired-end reads with small inserts. Generally, NGS reads help identify insertion breakpoints, but additional assembly is required to identify the inserted sequence. Large insertions are quite obvious with single-molecule and/or de novo genome maps. The major advantage of using single-molecule maps is the direct, inference-free supporting evidence for the presence of an SV provided by these maps.
Inversions are usually located at structurally complex regions associated with microdeletions, segmental duplications, and clustered regions of the X chromosome (Kidd et al. 2008). Consistent with this observation, we found that at least 22% of the previously reported inversions are embedded in regions with complex structural variants. The majority of the previously reported inversions were not detected by genome mapping partly because inversions found using an older version of the reference genome (hg17) were no longer present in the latest version (hg38). The genome-mapping strategy provides high-level map information to differentiate simple inversion from the more challenging palindromes, especially when the nicking site patterns are easily resolved with long DNA molecules. For SVs larger than the span of single molecules, de novo genome maps assembled from single-molecule maps provide an additional level of supporting evidence.
Structural variants are known to be associated with diseases (Weischenfeldt et al. 2013). Using our genome-mapping strategy, we identified at least four homozygous deletions that disrupt genes that are known to affect disease susceptibility (Table 3). These genes include GSTM1, LCE3B/C, CR1, and SIGLEC14. While we do not have clinical information about these anonymous DNA donors to associate these genotypes with phenotype, genome mapping revealed gene disruptions that may have clinical implications.
With native long DNA molecules without amplification, nonhuman DNA integrated into the genome, such as the EBV genome found in the transformed cell lines studied, can be detected readily. Using single-molecule maps that partially aligned to EBV reference maps and partially to the hg38 reference maps, we determined the EBV integration sites in all three samples. Although EBV integration in these cell lines is only an artifact of the transformation process, genome mapping has potential applications in the study of viral integration in diseases.
While genome mapping has a number of advantages over current approaches in whole-genome SV detection, there are four major limitations. First, the resolution of the imaging system and uncertainties in DNA measurements keep the method from resolving fluorescent labels that are within 1–2 kb of each other—hence our focus on SVs that are >5 kb in length. With engineering and algorithmic advances, one may be able to improve the mapping resolution. Second, the presence of neighboring nicking sites on opposite strands creates “fragile sites” leading to double-stranded breaks during the nick-labeling process that cannot be bridged by any DNA molecules. For the enzyme used in this study, Nt.BspQI, these fragile sites occur, on average, every 1 Mb in the genome, keeping the genome map N50 to <5 Mb. Bridging the fragile sites requires other data sets, such as long-read sequences or genome maps created by a second nicking enzyme. The third limitation is partly a result of the first two limitations and partly because of the complexity of the human genome. To reduce the number of fragile sites present and keep the fluorescent labels from being too close to each other, we have chosen an enzyme that nicks the human genome at an average interval of 8 kb. This choice has several consequences. Because genome mapping relies solely on the repetitive nicking patterns to detect CNVs, but the labels are ∼8 kb apart, only large CNVs containing multiple labels with a unique nicking pattern can be detected when the extra copies are not in tandem of each other. The sparseness of nicking sites makes it hard to pin point the breakpoints of the SVs detected to less than the 8-kb interval between the outermost pairs of nicking sites marking an SV. For balanced SVs like inversions, reliable detection depends on the number of BspQI nicks and the uniqueness of the nick pattern. Smaller inversions or ones without a unique pattern will not be detected by genome mapping alone. A possible solution is to use two nicking enzymes to create two different maps in separate experiments or to nick-label the same sample with two nicking enzymes to create a doubly labeled map. While these maps can theoretically improve map length and assembly accuracy, added efforts and cost are required with the former approach and the number of fragile sites are increased with the latter approach. Furthermore, the complex regions of the genome will likely pose similar challenges no matter what enzyme is used because of the near identity of large repetitive sequences. The fourth limitation is that genome mapping alone cannot map the extra copies of large CNVs if they are not found in tandem with the original copy. There are three possible scenarios when the extra copy/duplications are not in tandem to the original copy: (1) When the extra copy is too small to have a unique nicking pattern (with nicking frequency of Nt.BspQ1 at 8 kb in the human genome, it takes at least 40–50 kb to have five or more nicks to give a unique pattern), it is seen as an insertion at the new location and nothing more. (2) In some cases, the extra copy has a unique nicking pattern, but it is not identical to the original copy because it has additional nicking sites, is missing some nicking sites, or has additional insertion/deletions. Consequently, a large insertion is seen at the new location but cannot be linked back to the original copy. (3) It is only when a large insertion found elsewhere in the genome with a nicking pattern identical to the original copy that one can determine with some certainty that they are copies of each other. They are previously published segmental duplications, and we have been able to find them in our analysis. When genome maps are combined with long-read sequencing data and short-read sequencing data, even the small CNVs and those with different nicking patterns can be identified because the genome assembly approaches single-base-pair resolution.
A similar approach to genome mapping is optical mapping pioneered by the Schwartz group where long DNA molecules on solid support are restriction-digested in situ and ordered restriction maps are produced by sizing the restriction fragments. It has been in numerous genome assembly studies and, more recently, on cancer genomes (Ray et al. 2013; Gupta et al. 2015). However, fundamental challenges, in terms of sample preparation, data analysis, and information density, with traditional optical mapping remain and have been discussed previously. Alternative methods such as denaturation mapping in nanofluidic channels also show promise but have not been applied to large genomes (Reisner et al. 2010).
Although sequence-level data are required to understand the exact arrangement of SV, genome mapping is an efficient way to detect SVs that are >5 kb. Single-molecule maps provide direct evidence and a clear indication of the presence of SV without the need for inference while the de novo genome maps allow the detection of SV larger than the span of molecule maps. Combining short-read and long-read technologies, we will finally be able to characterize a full range of genome variation and assemble phased genome sequences that will in turn increase our understanding of the relationship of genetic variation with phenotypes and diseases.
This research is supported by a National Institutes of Health (NIH) grant to P.-Y.K. and M.X. (R01 HG005946). A.C.Y.M. was supported by NIH training grant T32 AR007175.
Author Contributions: Y.Y.Y.L., C.C., C.L., A.R.H., W.S., S.C., and J.S. generated data. A.K.Y.L., T.-P.K., E.T.L., A.C.Y.M. designed and performed structural variant analyses. T.P.K., A.K.Y.L., E.T.L., A.C.Y.M., A.W.C.P., J.J.K.W., C.M.L.L., Z.D., T.A., W.A., X.Z., and H.D. developed analytical tools. E.T.L., A.P., A.C.Y.M., J.-W.L., and A.K.Y.Y. validated structural variants. Y.M. and Z.D. studied and reviewed the biological relevance of structural variants. P.-Y.K., K.Y.Y., T.-F.C., S.-M.Y., M.X., and H.C. supervised the project. A.C.Y.M., E.T.L., Y.M., P.-Y.K., and K.Y.Y. wrote the manuscript.
Communicating editor: M. Johnston
Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.183483/-/DC1.
- Received October 5, 2014.
- Accepted October 28, 2015.
- Copyright © 2016 by the Genetics Society of America
Available freely online through the author-supported open access option.