Abstract
PacBio sequencing is a powerful approach to study DNA or RNA sequences in a longer scope. It is especially useful in exploring the complex structural variants generated by random integration or multiple rearrangement of endogenous or exogenous sequences. Here, we present a tool, TSD, for complex structural variant discovery using PacBio targeted sequencing data. It allows researchers to identify and visualize the genomic structures of targeted sequences by unlimited splitting, alignment and assembly of long PacBio reads. Application to the sequencing data derived from an HBV integrated human cell line(PLC/PRF/5) indicated that TSD could recover the full profile of HBV integration events, especially for the regions with the complex human-HBV genome integrations and multiple HBV rearrangements. Compared to other long read analysis tools, TSD showed a better performance for detecting complex genomic structural variants. TSD is publicly available at: https://github.com/menggf/tsd.
Keywords: structural variants long reads genomic structure PacBio
Structural variations (SV), such as copy number variations, inversions and translocations, are commonly observed in genomes (Feuk et al. 2006). In human, SVs exist in approximately 13% of the genome in the normal population (Sudmant et al. 2015). Some of these SVs contribute to the phenotype diversity and susceptibility to diseases (Brandler et al. 2016; Truty et al. 2018), which draws more attention in disease studies. Complex SVs can also be observed in the genome with genetics instability (Weischenfeldt et al. 2013; Lupski 2015), virus integration or transgenic modification (Zhao et al. 2016a; Meng 2018), which may generate complex rearrangement of exogenous DNA sequences and random integration in the host genome.
The next-generation sequencing (NGS) technologies have been widely used for SV discovery Abel and Duncavage (2013); Tubio (2015). In such studies, NGS platforms typically generate millions of short reads, ranging from 50 to 300 bp long. The SV discovery is performed by the analysis to the short reads deriving from the SV regions, such as discordant paired-end reads, split reads and sequencing depth information (Zhao et al. 2016b). However, NGS-based SV discovery is limited by the read length, especially for long complex SVs, which makes the more detailed information of the complex SVs usually blind to computational algorithms. Therefore, most of the NGS approaches primarily focus on low-complexity copy number variants or rearrangements.
The third-generation sequencing technologies, such as PacBio sequencers released by Pacific Bioscience, which generate reads up to 60 kbp long (Rhoads and Au 2015), has been emerging as a powerful approach to study the genome in a longer scope. However, the reads generated from PacBio sequencer are error-prone, especially the indel errors (Rhoads and Au 2015). It challenges the SV discovery using the tools designed for NGS. Many computational methods have been developed for long read, e.g., de novo assembly, isoform studies and other applications (Koren et al. 2017; Chin et al. 2013; Chaisson and Tesler 2012; English et al. 2015). However, most of these tools are not designed or optimized for the genomic regions with complex structure, such as complex chromosome relocation, integration and rearrangements. Meanwhile, the throughput and cost of PacBio platform limit its application only to small-sized genome, e.g., bacteria’s genome Ferrarini et al. (2013); Liao et al. (2015). Targeted sequencing has been widely applied in studies by capturing only the interest region. When it comes to long reads, how to use such redundant reads for an accurate SV discovery also challenges the data analysts.
Here, we present a tool, TSD, for identifying and visualizing the structural variants using PacBio targeted sequencing data. It is specially designed for the DNA regions with complex integration and rearrangement by allowing multiple rounds of splitting and mapping of the long PacBio reads. The genomic structure of targeted sequences is recovered by assembling the mapped PacBio fragments. TSD is applied to a PLC/PRF/5 cell line, which contains complex HBV rearrangement and integration events, and identified 9 HBV integration events. Evaluation suggests that TSD has an equal or better performance in discovering the structure of SVs than existing tools, especially when the targeted sequences have complex structure in the genome.
Materials and Methods
Targeted sequencing on PacBio Sequel of PLC/PRF/5 cell line
The targeted regions with HBV integrations in PLC/PRF/5 cell genome were sequenced on PacBio Sequel SMRT system. Briefly, genomic DNA was extracted from PLC/PRF/5 cell (ATCC, CRL-8024) using PureLink Genomic DNA Kit (Invitrogen, CAT K182002) followed by random fragmentation to 5-9 kbp long. Fragments containing HBV sequence were captured and enriched using Roche NimbleGen’s SeqCap EZ enrichment technology with customized HBV specific probes. SMRTbell library was prepared according to the manufacturer’s guidelines and sequenced on PacBio Sequel by GENEWIZ Company. The quality control and processing of raw PacBio reads were performed using SMRT Link v5.1.0. The subreads are used as the initial input for TSD analysis.
Simulated PacBio reads
We generated a set of simulated PacBio reads by randomly extracting the genomic DNA fragments and connecting them to form complex SVs. In this process, the genomic fragments were set to have a random length ranging from 500 bp to 2000 bp. We checked the genomic annotation for the selected fragments and found that they covered diverse regions in human genome, including the coding regions and low-complexity repeat regions. Finally, 5,000,000 genomic fragments were collected and connected in a random way to form 1,000,000 PacBio reads. Each simulated PacBio read carried 3-7 fragments. To simulate the error-prone feature of PacBio reads, indel errors were also added to the reads in a random way by keeping the error ratio to be about 15%. The simulated PacBio reads were used to evaluate the accuracy of alignment mapping and assembly of long PacBio reads.
Reference index for targeted sequences
PacBio reads are mapped to the reference genome using Burrows-Wheeler Aligner (BWA-MEM) (Li and Durbin 2009). For the data with exogenous sequence, e.g., virus or transgenic vector, TSD supports two ways to build the reference index. One way is to treat the exogenous sequences as the extra chromosomes and build the reference index as a whole. The second way is to build the reference index for both genome sequences and targeted sequences, respectively. In most cases, these two ways have no difference to the analysis result. However, when the targeted sequences are derived from the genome or are homologous to the host genome sequence, e.g., the LINE repeats in the human genome, the corresponding genome regions should be masked using the tools like RepeatMasker (Tarailo-Graovac and Chen 2009) or manually replacing the the corresponding nucleotides as “N”s. Otherwise, it will confuse TSD in the final output.
Long reads alignment to the genome
The long PacBio reads are first aligned to reference sequences using “bwa mem -x pacbio” command. If the long reads are partially mapped and unmapped part is longer than a minimum length (e.g., 200 bp), the unmapped sequences are further cut out for next-round alignment. This process is repeated for multiple times until no new mapping can be generated. In this way, PacBio reads with complex structure are represented as a line of mapped fragments.
Fragment similarity and clustering
PacBio reads derived from the complex SVs may comprise multiple mapped DNA fragments, which usually have different origins and organization directions. We used three values to record each mapped DNA fragments: chromosomes, starting points and ending points. Considering the fact that PacBio sequencing data were error-prone, the mapped fragments were clustered into consensus fragments (CFs). In this process, a similarity-based clustering algorithm was applied to cluster the DNA fragments from the same chromosome with the similar starting and ending points. In this way, PacBio reads were further transformed into a line of connected CFs.
For two CFs chrX:f1-t1 and chrX:f2-t2, their similarity is measured by two scores, S1 and S2. Assuming two CFs have a overlap of s bp, S1 and S2 are defined as
In this work, we used as the cutoff to determine if the CFs were derived from the same DNA regions. Alternatively, if the fragments were the first or the last one of PacBio reads, the less strict cutoff was used.
Local alignment of consensus fragments
To determine if two PacBio reads were originated from the same SV, local alignment of PacBio reads was performed by a modified dynamic programming (DP) algorithm. In this algorithm, the CFs instead of nucleotides were used to calculate the matching scores. Two PacBio reads A and B have CFs of {, ,..,} and {, ,..,}, respectively. A matrix was constructed and initialized with 0. The element values of were further determined based on the matching status of read A and B. It is described as:
for i in 1:m
for j in 1:n
match <– M(i–1, j–1) + 1
gap_in_A <– M(i–1, j)
gap_in_B <– M(i, j–1)
M(i, j) <– max(match, gap_in_A, gap_in_B)
The final alignment of read A and B was tracked back in matrix . Read A and B would not be treated as reads from the same SVs if any mismatch was observed. Gaps were allowed only when the gaps located in unmapped regions, which were usually resulted from low sequencing quality.
Seed-based assembly of PacBio reads
Targeted sequencing can generate redundant reads to the same regions. We applied a seed-based clustering method to group PacBio reads. The long PacBio reads were first ranked based on their sequence length or fragment number. The longest read was selected as a seed to assemble PacBio reads. The other reads were aligned against the seed. If one read was similar with the seed and within the mapping range of the seed, this read would be assigned as a supporting read to the seed. If the read was partially overlapped with the seed, the seed would be extended to construct a new seed. The final seed and its supporting reads were reported as the final outcome.
Visualization to the structural organization of targeted sequences
The DNA fragment plots are generated using R package to visualize their organization structure of targeted sequences. The plot information includes (1) the DNA fragments annotated with their origins and the corresponding location; (2) the organization direction of DNA fragments in host genome; (3) the information of supporting reads. Each SV region is reported by a single plot. Users are allowed to modify the output plots by passing the R parameters.
Data Availability
The source codes, exemplary data and scripts are publicly available at: https://github.com/menggf/tsd. Users can repeat part of the analysis results presented in this article by running the “example.sh”. Supplemental material available at Figshare: https://doi.org/10.25387/g3.7356776.
Results and Discussion
Identify the complex structure of targeted sequences
TSD is designed to identify the organization structure of complex SVs, which may be generated by genomic instability, transgene integration, virus infection or even just the low complex genomic regions. As showed in Figure 1 (a), such regions may be comprised of multiple fragmented DNA pieces with different origins. To find the detailed information of inserted DNA sequences, it needs computational efforts to map and assemble the DNA fragments.
Figure 1(b) summarizes the flowchart of TSD in identifying the complex SV structure using PacBio sequencing data. The main idea is that the long PacBio reads are split into mappable fragments and the SV structures are recovered by computational assembly of the mapped DNA fragments (see Figure 1(c)). Compared with the existing tools, TSD has several features: (1) TSD is able to identify the SV structure of any complexity. (2) TSD is designed to analyze targeted sequencing data, which are captured and enriched with sequence-specific probes, and TSD can utilize the redundant reads for accurate SV discovery. (3) TSD can display the full structural profile of the complex SVs in the form of plots.
To evaluate its performance, we generated the simulated PacBio reads, including 1,000,000 reads consisting of 5,000,000 genomic fragments. These reads covered diverse genomic regions, including tandem repeats, interspersed repeats and coding regions. In the evaluation of mapping ability, TSD accurately identified the genomic location of 99.4% DNA fragments (see Figure 1(d)). The fragments mapped at wrong locations were found to be from the repeat regions, especially satellite DNA. These errors were mainly caused by the error-tolerating parameter setting of BWA to PacBio reads in alignment, which made BWA less specificity to the large tandem repeating DNA. In the evaluation of assembling ability, TSD recovered the SV structure at an accuracy of 100% when only using the reads with the accurate mapping location. In our evaluation, the genomic structure included both breakpoint position and direction. Both results indicate that TSD can accurately identify the organization structure of complex SVs.
TSD discovers HBV integration events in PLC/PRF/5 cells
TSD was applied to study HBV integration events in HBV infected PLC/PRF/5 cells. After DNA fragmentation, the HBV-specific probes were used to capture and enrich the DNA pieces that carry HBV sequences. Using 2 million subreads as input, TSD discovered 9 HBV integration events, including HBV rearrangement, HBV integration and genomic relocation. In total, 12 chromosomes got involved. Figure 2(a) illustrated an exemplary HBV integrated region. This region was extremely complex, consisting of 6 fragments, including two HBV rearrangement generated by linking HBV:2656 to HBV:1446 and by linking HBV:2876 to HBV:2694. The left side of HBV sequence was integrated into chr1:143240209 and the right side was integrated in chr8:35446393. We noticed that such a HBV rearranged sequence was about 2400 bp and no single PacBio read covered the whole region. It was recovered by assembling PacBio reads. As targeted sequencing was performed, redundant reads were mapped to this region, including 538 reads mapped with consistent location and direction with seed read. In this example, each rearrangement and integration site was supported by multiple reads. However, we also noticed that more reads were enriched in the HBV regions and their abundance was correlated with the length of HBV sequences. Figure 2(b) showed the genomic organization of other HBV integration events. Like the example in Figure 2(a), we observed complex genomic organization of HBV sequences in human genome. There were only three integration events without HBV rearrangement.
To validate the TSD analysis results, we performed next generation sequencing (NGS) on the same cell line. The integration and rearrangement sites were discovered by analyzing the reads crossing the break points. In Figure 2(b), we marked the integration events validated by NGS analysis results. We found that, all the identified break points could also be discovered by NGS data with the same location and direction. Overall, our results indicated a good confidence for TSD analysis results.
Evaluation With other tools
We evaluated the performance of TSD by comparing it with the analysis results of HGAP, a de novo assembly tool developed by Pacific Biosciences company (Chin et al. 2013). Using the same PacBio sequencing data from HBV infected PLC/PRF/5 cells as the input, HGAP outputs the assembled DNA sequences. As PacBio reads were generated after probe-specific enrichment to the HBV sequence, the output of HGAP were long DNA fragments, including 59 DNA sequences. We mapped them to human genome and HBV sequences with blast, respectively. We found one complete integration and several partial integration events. The complete integration event was featured by mapping to both human genome and HBV sequence, allowing inferring its integration location and direction. It started from chr17:82105786 and ended at chr17:82107610 (see Region 7 in Figure 2(b)). There were also partial integration events on chromosome 3 (Region 1, left), 4 (Region 9, left), 5 (Region 4, right) and 11 (Region 8 left), where only either the left integration site or the right integration site was discovered. By checking the integration location for both complete and partial integration events, we found that TSD discovers all these events at the exactly same genomic and HBV locations. Advantageously, TSD also discovered the corresponding matched site for all the partial integration events. Overall, TSD is better at discovering the complex genomic organization structure of targeted sequence than the de novo assembly tool.
Another evaluation is performed with Sniffles, a tool designed for SV discovery (Sedlazeck et al. 2018). Using PacBio sequencing data, Sniffles identified many SVs. After filtering the SVs without HBV integration or rearrangement and the ones with low reads coverage, we identified 16 HBV integration sites. Among them, 14 out of them were also discovered by TSD prediction. By checking the NGS analysis results, all of 14 overlapped integration sites could be validated as true positive discovery while the remnant two integration sites didn’t. Compared to the output of TSD, all the prediction reported by Sniffles have been identified by TSD, which is consistent to the algorithm setting of TSD. Meanwhile, the emphasis of TSD is to generate the full profile of complex genomic structure. On the contrary, extra analysis is needed to generate a similar profile from the output of Sniffles.
In Table 1, we summarize the evaluation results of three tools and NGS analysis. By checking the HBV integration events, we observed several reasons that accounted for the output difference. For example, the long SVs, especially the ones without single long read covering the whole regions, need the computational tools to split and assemble the reads for a full profile of SVs. This will be affected by the sequencing depth and quality, computational analysis accuracy and existence of reference genome. For the long complex SVs, TSD can achieve a better performance. Meanwhile, the geonomic location of SVs can also affect the SV discovery. The repeat regions may confuse the alignment tools for wrong genomic locations.
Table 1. Evaluation with HGAP and Sniffle.
HGAP | Sniffles | TSD | NGS | |||||
---|---|---|---|---|---|---|---|---|
Int. | Rea. | Int. | Rea. | Int. | Rea. | Int. | Rea. | |
Region1 | left | no | right | no | both | yes | right | — |
Region2 | no | no | both | no | both | yes | both | — |
Region3 | no | no | right | no | both | yes | both | — |
Region4 | right | no | both | yes | both | yes | both | — |
Region5 | no | no | both | yes | both | yes | right | — |
Region6 | no | no | both | yes | both | yes | both | — |
Region7 | both | yes | both | yes | both | yes | both | — |
Region8 | left | yes | both | yes | both | yes | both | — |
Region9 | left | yes | both | yes | both | yes | both | — |
Int.: Integration sites; Rea: Rearrangement sites; left/right: only left/right site of an integration event is identified.
Overall, our evaluation suggests that TSD can achieve an equal or better performance in identifying the organization structure of complex SVs, especially when the SVs consist of multiple rearrangement events.
Command-line implementation of TSD
TSD is coded in Perl language and can be implemented in a command-line way. Before usage, TSD requires some preliminary works. First, BWA must have been installed and its location added into the $PATH variable in Linux system. Second, the genome index has been built using the “bwa index” command.
To simplify the usage, TSD only has two mandatory inputs: “-s” and “-G”, where the former is path location of PacBio reads file in the format of “*fastq” or “*.fq” and “-G” specifies the prefix of BWA genome index. Users can use “-i” to specify the targeted sequence. If the targeted sequence is foreign DNA, e.g., virus sequence, the “-i” option is mandatory for the reason that TSD cannot assemble the targeted sequences in a de novo way. The whole analysis can be done in a way like:
perl LongAssembly.pl –s PacBio_reads.fastq\–G genome.index –i virus.fa
The other parameters are optional. By default, TSD is optimized for the targeted sequencing data, which allow each SV supported by multiple PacBio reads. When the input data are not generated by targeted sequencing, “-r” and “-o” option should be modified, e.g., “-r 1” and “-o 1”, to capture the regions with low sequencing depth. However, this may generate a redundant output. To organize the output, users can use “-d” option to set the directory to store the output and temporary files, which also allows continuous analysis (see software manual for detailed information).
Acknowledgements
Please provide acknowledgements.
Footnotes
Supplemental material available at Figshare: https://doi.org/10.25387/g3.7356776.
Communicating Editor: M. Cherry
Literature Cited
- Abel H. J., Duncavage E. J., 2013. Detection of structural dna variation from next generation sequencing data: a review of informatic approaches. Cancer Genet. Cytogenet. 206: 432–440. 10.1016/j.cancergen.2013.11.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brandler W. M., Antaki D., Gujral M., Noor A., Rosanio G., et al. , 2016. Frequency and complexity of de novo structural mutation in autism. Am. J. Hum. Genet. 98: 667–679. 10.1016/j.ajhg.2016.02.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chaisson M. J., Tesler G., 2012. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): Application and theory. BMC Bioinformatics 13: 238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chin C. S., Alexander D. H., Marks P., Klammer A. A., Drake J., et al. , 2013. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10: 563–569. 10.1038/nmeth.2474 [DOI] [PubMed] [Google Scholar]
- English A. C., Salerno W. J., Hampton O. A., Gonzaga-Jauregui C., Ambreth S., et al. , 2015. Assessing structural variation in a personal genome-towards a human reference diploid genome. BMC Genomics 16: 286 10.1186/s12864-015-1479-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferrarini M., Moretto M., Ward J. A., Šurbanovski N., Stevanović V., et al. , 2013. An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome. BMC Genomics 14: 670. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feuk L., Carson A. R., Scherer S. W., 2006. Structural variation in the human genome. Nat. Rev. Genet. 7: 85–97. 10.1038/nrg1767 [DOI] [PubMed] [Google Scholar]
- Koren S., Walenz B. P., Berlin K., Miller J. R., Bergman N. H., et al. , 2017. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27: 722–736. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H., Durbin R., 2009. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25: 1754–1760. 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liao Y. C., Lin S. H., Lin H. H., 2015. Completing bacterial genome assemblies: Strategy and performance comparisons. Sci. Rep. 5: 8747. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lupski J. R., 2015. Structural variation mutagenesis of the human genome: Impact on disease and evolution. Environ. Mol. Mutagen. 56: 419–436. 10.1002/em.21943 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meng G., 2018. Transgener: a one-stop tool for transgene integration and rearrangement discovery using sequencing data. bioRxiv. 10.1101/462267 [DOI] [Google Scholar]
- Rhoads A., Au K. F., 2015. PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics 13: 278–289. 10.1016/j.gpb.2015.08.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sedlazeck F. J., Rescheneder P., Smolka M., Fang H., Nattestad M., et al. , 2018. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15: 461–468. 10.1038/s41592-018-0001-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sudmant P. H., Rausch T., Gardner E. J., Handsaker R. E., Abyzov A., et al. , 2015. An integrated map of structural variation in 2,504 human genomes. Nature 526: 75–81. 10.1038/nature15394 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tarailo-Graovac M., Chen N., 2009. Using repeatmasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics Chapter 4: Unit 4.10 10.1002/0471250953.bi0410s25 [DOI] [PubMed] [Google Scholar]
- Truty R., Paul J., Kennemer M., Lincoln S. E., Olivares E., et al. , 2019. Prevalence and properties of intragenic copy-number variation in mendelian disease genes. Genet. Med. 21: 114–123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tubio J. M. C., 2015. Somatic structural variation and cancer. Brief. Funct. Genomics 14: 339–351. 10.1093/bfgp/elv016 [DOI] [PubMed] [Google Scholar]
- Weischenfeldt J., Symmons O., Spitz F., Korbel J. O., 2013. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat. Rev. Genet. 14: 125–138. 10.1038/nrg3373 [DOI] [PubMed] [Google Scholar]
- Zhao L.-H., Liu X., Yan H.-X., Li W.-Y., Zeng X., et al. , 2016a Genomic and oncogenic preference of hbv integration in hepatocellular carcinoma. Nat. Commun. 7: 12992 10.1038/ncomms12992 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao X., Emery S. B., Myers B., Kidd J. M., Mills R. E., 2016b Resolving complex structural genomic rearrangements using a randomized approach. Genome Biol. 17: 126 10.1186/s13059-016-0993-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The source codes, exemplary data and scripts are publicly available at: https://github.com/menggf/tsd. Users can repeat part of the analysis results presented in this article by running the “example.sh”. Supplemental material available at Figshare: https://doi.org/10.25387/g3.7356776.