Abstract
Third-generation sequencing can be used in human cancer genomics and epigenomic research. Oxford Nanopore Technologies (ONT) recently released R10.4 flow cell, which claimed an improved read accuracy compared to R9.4.1 flow cell. To evaluate the benefits and defects of R10.4 flow cell for cancer cell profiling on MinION devices, we used the human non-small-cell lung-carcinoma cell line HCC78 to construct libraries for both single-cell whole-genome amplification (scWGA) and whole-genome shotgun sequencing. The R10.4 and R9.4.1 reads were benchmarked in terms of read accuracy, variant detection, modification calling, genome recovery rate and compared with the next generation sequencing (NGS) reads. The results highlighted that the R10.4 outperforms R9.4.1 reads, achieving a higher modal read accuracy of over 99.1%, superior variation detection, lower false-discovery rate (FDR) in methylation calling, and comparable genome recovery rate. To achieve high yields scWGA sequencing in the ONT platform as NGS, we recommended multiple displacement amplification with a modified T7 endonuclease Ⅰ cutting procedure as a promising method. In addition, we provided a possible solution to filter the likely false positive sites among the whole genome region with R10.4 by using scWGA sequencing result as a negative control. Our study is the first benchmark of whole genome single-cell sequencing using ONT R10.4 and R9.4.1 MinION flow cells by clarifying the capacity of genomic and epigenomic profiling within a single flow cell. A promising method for scWGA sequencing together with the methylation calling results can benefit researchers who work on cancer cell genomic and epigenomic profiling using third-generation sequencing.
Keywords: Long read, Nanopore DNA sequencing, Methylation, Single-cell whole genome amplification sequencing, Whole genome shotgun sequencing
Graphical Abstract
1. Introduction
The development of genetic-sequencing-based techniques has facilitated genomic research to understand human diseases, especially rare diseases and cancers [1], [2], [3]. With the introduction of next-generation sequencing (NGS), whole-genome shotgun (WGS) sequencing technologies have been applied to profile the genomic landscape of human diseases. However, short reads with GC-content bias and low-complex sequences limited NGS to fill a more complete understanding of genome and methylome, though algorithms and techniques have been designed to optimize the results by estimating biased and ambiguous sequences [4]. Towards a complete and unbiased genome map, third-generation sequencing transits from NGS by utilizing technologies of direct inspection of single molecules without wash steps during DNA synthesis to obtain long reads [5]. Oxford Nanopore Technologies (ONT or Nanopore) is a third-generation sequencing platform that enables long-read sequencing and direct modification calling to be performed on a portable sequencer (a MinION sequencer). Compared with other third-generation sequencing platforms such as PacBio, ONT MinION is a flexible platform with a short turnaround time [6]. Carrying out sequencing of a native DNA sample using the Ligation Sequencing Kit for library preparation can be limited within 2 h, without the requirement for amplification step included in NGS. The real-time basecalling is able to conduct from squiggle produced by the disrupted current when the molecule passing through the sequencing pore. This rapid turnaround time together with the small size of the MinION sequencer benefits the clinical application including pathogen detection under 6 h [7]. In particular, the MinION sequencer also aids cancer diagnosis and treatment by revealing both genomic and epigenomic patterns in cancer cells [8].
However, the high error rates of ONT reads are key concerns for users who wish to migrate from NGS-based DNA sequencing. R9.4.1 flow cells combined with the updated super-accurate Guppy (Version 6) basecalling model can provide reads with a modal accuracy of 97.6%, equivalent to a Phred score of Q16 [9], [10]. Moreover, the recently released R10.4 flow cells can achieve a modal accuracy of Q20 for native reads, which is comparable with that of NGS reads [10]. Due to the significantly better accuracy than R9.4.1, R10.4 enables microbial genome assembly from metagenomes samples without NGS reads or reference polishing. [10]. Additionally, the average sequence accuracy improved from 96.52% for R9.4.1 to 98.34% for R10.4 when sequencing the SARS-CoV-2 whole genome, and mutation analysis is consistent with the NGS results [11]. However, the increased accuracy of R10.4 flow cells is coupled with a decreased yield. Thus, there is a need for a comprehensive benchmark to illustrate the benefits and limitations of using R10.4 flow cells and Q20+ chemistry.
The analysis of ONT reads is an especially powerful approach for the detection of DNA modification, as the differences between the signals of modified nucleotides and those of canonical nucleotides enable the direct detection of whole-genome DNA modifications from ONT raw signals. Compared with the whole genome bisulfite sequencing results, the ONT reads have less bias and cover more CpG sites [12]. R9.4.1 flow cells can be used in combination with deep learning tools such as Megalodon [13] to detect 5-methylcytosine (5mC) in CpG sites with an accuracy of over 95% [14]. Therefore, the 5mC calling ability of the R10.4 flow cell must be benchmarked against that of the R9.4.1 flow cell, as the R10.4 flow cell has a higher read accuracy but a lower yield than the R9.4.1 flow cell [15].
Single-cell whole genome amplification (scWGA) sequencing can provide extensive information on target cells. Limited amount of DNA from rare samples including tumour biopsy, circulating tumour cells and embryos is a challenge for sequencing, while scWGA may be valuable in clinical research and diagnoses [16], [17], [18]. scWGA sequencing has been used for the detection of alterations in cancer genomes [19], [20], [21] and the preimplantation genetic screening of embryos [22] at the individual cell level. Owing to the limited amount of DNA in a single cell, it must be subjected to whole-genome amplification (WGA) before sequencing [23]. The two WGA methods commonly used on NGS platforms are multiple displacement amplification (MDA) [24] and the multiple annealing and looping-based amplification cycles (MALBAC) method [25]. However, the NGS reads cannot detect large variations, and their coverage of certain genomic regions may be limited owing to sequencing bias. Long reads can aid the detection of large variations and provide read-level support for structural variations (SVs) and copy number variations (CNVs) [26], [27]. Moreover, the library preparation in Nanopore is PCR-free, which avoids the amplification bias encountered in library preparation via NGS. However, there has been little application of single-cell DNA sequencing on third-generation platforms, and the performance of scWGA sequencing on the ONT platform remains to be evaluated.
We chose the HCC78 cell line for our sequencing benchmark because it has various levels of genomic variations. This cell line was established from the pleural effusion of a 65-year-old man with non-small-cell lung carcinoma (NSCLC) and carries known SVs, such as the ROS1 rearrangement, and single-nucleotide variations (SNVs) in the TP53 gene [28], [29], [30]. Moreover, TP53 mutations may lead to genomic instability by causing genome rearrangements, deletions, or duplications [31], [32], which manifest as CNVs. Therefore, we expected the HCC78 cell line to harbor multiple novel CNVs. We also recognized that the various calling abilities of different Nanopore flow cells could be well characterized by sequencing the HCC78 cell line.
Therefore, in this study, we performed WGS and scWGA sequencing of the HCC78 cell line and benchmarked Nanopore R10.4 and R9.4.1 reads in terms of read quality, variation calling, and methylation detection to evaluate their potential utility (Supplementary Fig. 1). Moreover, we demonstrated a promising method of scWGA sequencing using ONT, which can be applied to reduce the false discovery rate on methylation calling and clarified the tasks that can be performed using a single MinION R9.4.1 or R10.4 flow cell.
2. Materials and methods
2.1. Cell culture
HCC78 cells were maintained in Dulbecco's Modified Eagle Medium with 10% fetal bovine serum and 1% Penicillin-Streptomycin-Glutamine in a humid atmosphere with 5% CO2 at 37 °C. All the reagents for cell culture were Gibco products.
2.2. Single-cell whole genome amplification
Single HCC78 cell was acquired by KuipicK (NeuroInDx, Inc., US) system and processed by REPLI-g Single Cell Kit (Qiagen, Germany) and MALBAC® Single Cell WGA Kit (Yikon, China) for whole genome amplification (WGA). The products were purified by VAHTS DNA Clean Beads (Vazyme, China) according to the protocol and quantified using Qubit™ 1X dsDNA HS Assay Kit (Thermo, US). The amplification efficiency of WGA products was validated by qPCR with six targets in different chromosomes (Yikon, China) before sequencing.
2.3. Library preparation and sequencing
Approximately 2 μg of genomic DNA extracted from HCC78 cells using E.Z.N.A.® Tissue DNA Kit (Omega Bio-tek, Inc., US) was used for library preparation according to Nanopore genomic DNA ligation sequencing protocol. Two individual libraries constructed by SQK-LSK110 or SQK-LSK112 (Q20EA) kits were loaded on R9.4.1 or R10.4 flow cells, respectively, and sequenced with the MinION Mk1B sequencer.
Purified amplification products (5 μg) from REPLI-g Single Cell Kit (Qiagen) were used as input for scWGA sequencing on the Nanopore platform. T7 endonuclease Ⅰ (1.5 μL, NEB) and 3 μL 10 × NEBuffer 2 were mixed with amplified products in a 30 μL reaction system. The reactions were incubated for 2 h at 37 °C in the thermal cycler. TE buffer (pH 8) was used to make up the mix to a total volume of 50 μL. The product was purified with 0.75 × AMPure XP beads. The DNA pellet was resuspended and eluted using 25.5 μL nuclease-free water and mixed with 1.5 μL T7 endonuclease Ⅰ and 3 μL 10 × NEBuffer 2, to incubate for another 2 h at 37 °C. The purified DNA was diluted to a final volume of 48 μL and used for library construction with SQK-LSK110 or SQK-LSK112. Constructed libraries were loaded on R9.4.1 and R10.4 flow cells, respectively, and sequenced on MinION Mk1B sequencer.
2.4. Read basecalling
All ONT raw .fast5 data was basecalled into .fastq reads, using Guppy v.6.0.1 with the “sup” and “hac” accurate models, using the config file of “dna_r9.4.1_450bps_sup.cfg” and “dna_r9.4.1_450bps_hac.cfg” for R9.4.1 reads, “dna_r10.4_e8.1_sup.cfg” and “dna_r10.4_e8.1_hac.cfg ” for R10.4 reads, respectively, on an NVIDIA RTX 3090 GPU. Reads< 200 bp were discarded.
2.5. Reads quality statistics and homopolymer analysis
The reads were aligned to human reference (GRCh38, hg38) using Minimap2 v.2.22 [33] with the argument “--secondary=no”. The reads spanning the gap region in GRCh38 genome or containing large indel (indel length>100 bp) were discarded, and only those with a high mapping quality (MAPQ = 60) were retained for downstream quality analysis. To compare the read accuracy of different libraries, we counted the estimated accuracy (Eq. 1), observed accuracy (Eq. 2), substitution proportion (Eq. 3), insertion proportion (Eq. 4), and deletion proportion (Eq. 5) for each aligned read:
| (1) |
| (2) |
| (3) |
| (4) |
| (5) |
Here, N was the number of the base in each read and was the i-th Guppy base quality score. N(sub), N(mat), N(ins) and N(del) were number of substitution(s), match(es), insertion(s), and deletion(s) in each read, respectively. N(total) was the sum number of substitution(s), match(es), insertion(s), and deletion(s). As for the read accuracy in homopolymers, we summarized the accuracy (match proportion) of each homopolymer with a length> 3 using the Pysam v 0.18.0 package [34]. Only the shared homopolymers of four libraries with depths between 3 and 6 were extracted to compare.
2.6. DNA methylation analysis
The 5mC DNA methylations in CpG site were detected by Megalodon v.2.4.1 [35] with Remora super-accurate model [36]. The whole chromosome was spitted into 500 kb fragments, and the mean methylation proportion was counted for each bin through in-house scripts (https://github.com/lrslab/Benchmarking_for_ONT_reads). The correlation coefficient (R value) between bins from R10.4 and R9.4.1 was calculated by R package ggpubr with arguments, method = “pearson”. The distribution of 5mC DNA methylations along the TSS and gene regions were plotted by R package EnrichedHeatmap [37] The 5mC distribution of R10.4 and R9.4.1 reads were merged to the same region for comparison. False positive sites were selected as input for findMotifsGenome.pl in HOMER package [38] to find genomic motifs. The calculation of 5mC proportion with different read coverage was done by subsampling the megalodon modified-base bam (modbam) file. To better understand the difference of methylation performance between R9.4.1 and R10.4, we used the reduced representation bisulfite sequencing (RRBS) data of HCC78 cell line from Cancer Cell Line Encyclopedia (CCLE) (file name: CCLE_RRBS_TSS1kb_20181022.txt) as the reference to compare with ONT data. Before comparing, the 1 kb intervals with a mean depth less than 10 were filtered. In total, 10075 intervals are used in our analysis. To give a performance matrix, the methylation proportion for different intervals are converted to binary representation. The interval with methylation proportion less than 10% was treated as no modification, while the interval with a methylation proportion larger than 80% was treated as a modified one. A similar transformation is also carried out in our methylation profiling using ONT data. The transformed methylation profiling results in R9.4.1 or R10.4 are compared with the bisulfite sequencing data to draw the receiver operating characteristic curve (ROC). We plotted the ROC and used AUC (Area Under Curve) to show the performance of R9.4.1 and R10.4 methylation calling, respectively.
2.7. Variation detection
The reference genome we used was the GRCh38.p13/hg38 human reference genome, which downloaded from Ensembl (Ensembl genome browser 109). The reads were aligned to reference using Minimap2 v.2.22 [33] with default arguments for ONT data and BWA v0.7.17 with “mem” for NGS data. Then the .bam files were used as input for Control-FREEC v.11.6 [39] to detect the CNV. The config file included the parameters of “window = 1000,000, ploidy = 2, breakPointThreshold = 0.8, sex = XY”. We chose “ploidy= 2″ according to the ploidy values inferred using ABSOLUTE algorithm from CCLE (file name:CCLE_ABSOLUTE_combined_20181227.xlsx, sheet name: ABSOLUTE_combined.table), which indicates the ploidy of HCC78 cell line is 2.15. The results of copy number at every 1 M window were calculated from Control-FREEC output file and the averages were normalized to 2.
Structural variants (SVs) were called using cuteSV v1.0.13 [40] with suggested parameters setting for ONT data along with “min_size=1000″ and “min_support” from 3 to 5. The breakpoints in SVs were further confirmed by Integrative Genomics Viewer v2.12.3 [41].
Single nucleotide variations (SNVs) calling was performed using BCFtools v1.15.1 [42] with command of “bcftools mpileup -f hg38.fasta file.bam | bcftools call -mv -o file.vcf”, where “file.bam” and “file.vcf” were the aligned results and the variation calling result file. The homozygous variation records with QUAL> = 20 were collected for comparison with 445 known variation in HCC78 cell line from DepMap. To better understand the difference between sequencing platforms and basecalling models on the capability of SNV identification, we calculated some performance metrics for each ONT libraries using the result of NGS data from CCLE (Bioproject: PRJNA523380, SRA Experiments: SRR8619116) as baseline. To reduce the effect of coverage on our results, we subsampled the three ONT libraries data to the same yield as WGS R10.4 data and called the variation again. Only those sites with the depth over 4 from .vcf file were used. Here the depth is the sum of number of forward reference alleles, reverse reference alleles, forward reference alleles, and reverse non-reference alleles, which can be found in DP tag from .vcf file. The site with the variant allele frequency (VAF) over 0.95 was regard as real SNV sites. For each ONT library, the shared sites between NGS result and itself were used for calculating sensitivity (Eq. 6) and specificity (Eq. 7).
| (6) |
| (7) |
Here, TP (true positive) means the number of bases regarded as the SNV from both reference and prediction, while TN (true negative) means the number of sites that were not. FP (false positive) and FN (false negative) were the numbers of those bases that were regarded as the SNV by prediction but not by reference, and the opposite situation, respectively.
2.8. Genome recovery rate and bias
We summed up the total genome size of the uncovered region and defined the genome coverage ratio as
| (8) |
N(all) represented the total length of reference genome GRCh38, and N(uncov) was the length of reference genome not covered with the read alignments.
The intersected length of uncovered regions and repeat annotation were calculated using “intersect” and “subtract” functions from bedtools v.2.30.0 [43] (see online code for detail). Before counting the summed length of intersection, the gap regions (represented as “N” base) from the reference genome were removed, and the fragments less than 100 bp were also discarded. The repeat types were divided into different classes, including DNA, LINE, SINE, LTR, RC/Helitron, and unknown repeat.
2.9. Data and materials availability
All.fastq reads used in this study, from both Nanopore and Illumina platforms, have been submitted to the NCBI Sequence Read Archive (SRA) as part of BioProject: PRJNA875576.
All of the data processing and figure plotting scripts used in this study are available at https://github.com/lrslab/Benchmarking_for_ONT_reads.
3. Results
3.1. Read statistics
Two types of libraries were built for benchmarking of the R10.4 and R9.4.1 flow cells: (1) WGS sequencing libraries, built using bulk DNA from HCC78 cells, and (2) scWGA sequencing libraries, built with MDA-processed HCC78 single-cell DNA. After basecalling, the raw yields were 12.8 Gb, 26.2 Gb, 18.8 Gb, and 28.8 Gb for WGS R10.4, WGS R9.4.1, scWGA R10.4, and scWGA R9.4.1 reads, respectively (Supplementary Table 1). The R10.4 flow cell generated approximately half the number of reads generated by the R9.4.1 flow cell in both WGS and scWGA sequencing, which was attributable to these flow cells’ different run speeds (200 and 420 bp/s for R10.4 and R9.4.1 flow cells, respectively).
Several filters were applied to obtain clean data for further benchmarking: reads with a length smaller than 200 bp, low mapping quality (mapq<60), or large insertions and deletions (indels;>100 bp) were discarded. After filtering, over 80% of the high-quality reads were retained (Supplementary Fig. 2). The proportions and details of reads retained after filtering in each step are given in Supplementary Table 2.
3.2. Read accuracy
A key requirement of high-throughput sequencing is that the estimated quality score must reflect the real quality of the reads. To determine read accuracy, read quality scores estimated using the Guppy (Version 6.0.1) were compared with those calculated by mapping reads back to the reference genome (GRCh38) (Fig. 1). In the latter calculation (observed read accuracy), every inconsistency between the read and reference genome, such as a substitution or an indel (an insertion or a deletion), was treated as an error.
Fig. 1.
The quality of nanopore read from whole-genome shotgun (WGS) and single-cell whole-genome amplification (WGA) sequencing using R9.4.1 and R10.4 flow cells. A Both R10.4 reads from WGS and scWGA sequencing libraries outperformed the R9.4.1 reads in terms of original Guppy basecaller estimated (grey) and read mapping observed (white) accuracy with medium and quartiles. The dispersion of the boxplot also shows that observed read accuracy is higher than estimated ones accordingly. B The density distribution plot indicates the R10.4 reads had higher modal read accuracy than R9.4.1 both in estimated (top) and observed (bottom) ones. C R10.4 had a higher average accuracy detection rate on homopolymers ranging from 4 to 9 bp than R9.4.1 both for WGS and scWGA libraries. It can also identify the preference for adenine (A) and thymine (T) over cytosine (C) and guanine (G) in homopolymer detection of nanopore reads.
The observed accuracy of R10.4 reads was approximately 1% higher than that of R9.4.1 reads for both WGS and scWGA sequencing (Fig. 1A). The average observed accuracy for R10.4 reads was approximately 96.8%, corresponding to a Phred score of Q15, which is considerably higher than that of R9.4.1 reads (Q13.5). Furthermore, the modal accuracy for R10.4 reads was approximately 99.2%, corresponding to a Phred score of Q21 (Fig. 1B), considerably higher than that of R9.4.1 reads (Q17). The accuracy of R10.4 reads was higher than that of R9.4.1 reads in terms of both indel and nucleotide substitution, especially in terms of deletions (Supplementary Fig. 3A).
Most of the estimated read accuracies were lower than the observed read accuracies, although the estimated and observed accuracies for R9.4.1 WGS reads were similar (Fig. 1A and 1B). The greatest difference in the estimated and observed quality scores was in the mean and modal accuracies of R10.4 scWGA reads. This indicates that the Guppy basecaller need to be optimized for scWGA reads that contain no DNA modifications. The library preparation procedures may also account for the difference between the estimated read accuracy and the observed read accuracy. Moreover, a correlation analysis between the estimated and observed accuracies indicated that the R10.4 quality scoring system had a better correlation than the R9.4.1 quality scoring system, but the quality of each read generated by R10.4 flow cell seemed to be systematically underestimated (Supplementary Fig. 3B).
To test if the read accuracy from different chromosomes exhibited any regional bias, we screened and calculated the proportions of insertions, deletions, and substitutions in various genomic regions were for the four libraries (WGS R10.4, WGS R9.4.1, scWGA R10.4, and scWGA R9.4.1). For WGS and scWGA reads, the R10.4 flow cell outperformed the R9.4.1 flow cell in all regions (Supplementary Fig. 3C). We further spitted the whole genome to 500 kb fragments, and calculated the average of observed read accuracy in each bin of reads from four libraries. The results showed that 5064 out of 5502 bins from WGS reads and 5482 out of 5502 bins from scWGA reads were improved by R10.4 in terms of observed read accuracy, while the remaining parts were not improved (Supplementary Fig. 4).
The read error in homopolymers has been a critical problem in Nanopore sequencing since the introduction of R7 flow cells [44]. To determine if R10.4 flow cells generated better homopolymer reads than R9.4.1 flow cells, we evaluated the read accuracies mapped over homopolymer regions of various sizes (4–9 bp), as shown in Fig. 1C. The accuracies of R10.4 reads were higher than those of R9.4.1 reads for all four bases. Moreover, the basecalling of A-base and T-base homopolymers was more accurate than the basecalling of C-base and G-base homopolymers in the four libraries (Supplementary Fig. 3D).
We also assessed all the reads basecalled by the “high accurate” model. The detailed results can be found in Supplementary Table 3 and Supplementary Fig. 5. All the reads basecalled by the “super accurate” model are more accurate than the “high accurate” ones. The basecalling process for the “super accurate” model can be 2–5 times longer than the “high accurate” model on our sequencing machine with an RTX 3090. We recommend using the “super accurate” model to basecall both R9.4.1 and R10.4 data for better estimated and observed accuracy if time and computational resources can be fulfilled.
We tried to use duplex reads to improve the read accuracy. We re-basecalled the duplex R10.4 reads by using a specific guppy mode, the “guppy_basecaller_duplex”. And only 1% of WGS reads were counted as duplex, and 93.38% of bases in the duplex reads were assigned a quality score higher than Q30. Meanwhile, no duplex reads could be detected from R10.4 scWGA reads, and this option is not available for R9.4.1 reads.
3.3. Detection of CpG methylation
R9.4.1 flow cells are the best tools for whole-genome methylation profiling, as they cover more than 99% of all CpG sites in the human genome [12]. However, to the best of our knowledge, the 5mC calling ability of R10.4 flow cells remains unknown. The methylation calling procedure using ONT reads depends on the chemistry and bioinformatics pipeline that is employed. Emerging tools have provided various strategies for methylation calling, which contribute to better prediction accuracy. According to the benchmarking study of Nanopore methylation calling tools, Megalodon can achieve the highest correlation and lowest root mean square error with control datasets among the tools, including Tombo, Nanopolish, Guppy, DeepSignal and DeepMod [13]. Thus, we applied the state-of-the-art tool Megalodon (version 2.4.1) for both R9.4.1 and R10.4 reads to profile 5mC in CpG sites, thereby ensuring the consistency of our bioinformatics pipeline.
We profiled genomic methylation by dividing each chromosome into 500 kb bins and then determining the mean proportion of 5mC in each bin (Fig. 2A). The average proportions of 5mC identified from the R9.4.1 and R10.4 reads were similar (67.44% and 67.70%, respectively). R9.4.1 and R10.4 reads indicated that in the promoter regions of protein-coding genes, 54.61% and 52.54% of the CpG sites were methylated (Figs. 2B and 2C), respectively. There were strong correlations between the results from R9.4.1 and R10.4 reads using 500 kb genomic bins and analyzing 21,455 gene promoters (Fig. 2A and 2B). When the partition is further refined to CpG site level, the Pearson correlation coefficient (r) of R9.4.1 and R10.4 reads is 0.64 (data not shown). We further investigated the distribution of DNA methylation across the 3000 bp before and after the transcription start sites (TSSs) of the associated genes (Fig. 2D) and gene bodies (Fig. 2E). The pattern in the 5mC distribution around gene body is similar between R9.4.1 and R10.4 results, indicating a good correlation between the two flow cells in various genomic context. We also used methylation proportion within promoter 1 kb upstream TSS generated by RRBS from CCLE as the gold standard to validate the accuracy of ONT data. The Area under Curve of R9.4.1 and R10.4 reads are 0.98 and 0.97 respectively, indicating the high accuracy of ONT reads in methylation detection (Supplementary Fig. 6).
Fig. 2.
DNA methylation detected in WGS sequencing reads from R9.4.1 and R10.4 flow cells. The 5-methylcytosine (5mC) level between R10.4 and R9.4.1 data showed high consistency on A whole genome level with 500 kb window and B promoter level. C The distribution of 5mC proportion in whole genome level divided by 500 kb bins and promoter regions detected from R10.4 and R9.4.1 reads. D The distribution of 5mC across the 3000 bp before and after the transcription start sites (TSSs) of the associated genes and E gene bodies.
The false discovery rate (FDR) is a key parameter in methylation detection. In theory, scWGA reads do not contain 5mC signals and can thus be used as a negative control. They can indicate the possible false-positive methylation results when applying the same methylation calling protocol as WGS reads. As a result, the scWGA reads of R9.4.1 and R10.4 flow cells contained 3.69% and 1.47% 5mC signals, respectively. (Figs. 3A and 3B).
Fig. 3.
Methylation profiling of 5mC using R10.4 and R9.4.1 reads and false positive site filtering in mitochondrial region. A The density plot and B boxplot showing the distribution of methylation CpG (meCpG) proportion in each read. C IGV snapshot for a 500 bp region with reads carrying predicted methylation CpG sites. The methylated CpG is highlighted in red. Please note that the scWGA libraries should be free of methylation and the positive sites are likely to be false positives.
The R10.4 reads had a lower background than the R9.4.1 reads, especially in regions with high GC content (Fig. 3C). Given that the prediction models for both R9.4.1 and R10.4 were trained on a similar dataset, we can conclude that the R10.4 flow cell outperformed the R9.4.1 flow cell in terms of the FDR.
To further eliminate the influence of background noise and increase the reliability of the methylation results, we filtered out the CpG sites that contribute to the potential false discovery. The threshold of likely false positives is set as the value of two-fold standard deviations above the average of the scWGA data as the negative control. The sites defined as false positives in scWGA data were filtered out accordingly from WGS data. We first summarized the false positive sites across the whole genome detected above the threshold and discovered the most frequent pattern using HOMER Motif Analysis software and determined the possible cause of false positives. As shown in Homer's results, the most common false positive CG sites generated from R9.4.1 reads were after C-base-rich regions (Supplementary Fig. 7A), which was consistent with the accuracy evaluation results.
According to the homopolymer results above that the basecalling accuracy of C-base and G-base homopolymers from R9.4.1 reads is lower than A-base and T-base region, we suggest it may relate to higher error rate in high CG-content regions. We also classified the false positive sites into CpG islands (CpGi), shores (up to 2 kb from CpGi), shelves (from 2 kb to 4 kb from CpGi) and open sea (the rest of the genome). More than 80% of false positive sites we filtered out were in open sea regions while fewer CpGi were filtered out from R10.4 data than R9.4.1 (Supplementary Fig. 7B), which may have less effect on the functional analysis in regions at or near the transcription start site of genes using R10.4 reads. It has been proved that there is no CpG methylation within the mitochondrial DNA of human cells [45]. Here, we also checked the mitochondrial region of R10.4 and R9.4.1 reads to evaluate the false discovery rate and filtering efficiency. The WGS and scWGA reads showed similar 5mC distribution across the mitochondrial from R10.4 and R9.4.1 reads respectively while R10.4 had more stable and lower in 5mC proportion than R9.4.1 (Supplementary Fig. 7C to 7H). After filtering, the sites with high levels of 5mC in WGS data were filtered as expected as defined to be false positive in scWGA data, and the 5mC level across the mitochondrial is close to 0 in R10.4 data (Supplementary Fig. 7E). However, there remained high 5mC sites after filtering in R9.4.1 mitochondrial reads (Supplementary Fig. 7H), which in turn indicated the higher FDR in R9.4.1 than R10.4 reads.
The estimated methylation level in the human genome is a useful measurement for informing cancer diagnoses and methylation inhibitor treatments. Shallow WGS sequencing from Nanopore would be the most convenient way to obtain this information. To test the robustness of methylation calling, we subsampled the total reads to 20 Mb, 50 Mb, 125 Mb, 250 Mb, 500 Mb, and 3 Gb 100 times on R9.4.1 and R10.4 reads and summarized the methylation proportion overall detectable CpG sites. The average proportion of methylated CpG sites (meCpGs) at the whole-genome level was consistent with original data from both R9.4.1 and R10.4 reads, even when reads with only 20 Mb yield were considered (Supplementary Fig. 8A). Furthermore, the 5mC proportion of the promoter region detected by R10.4 reads was closer to the overall result than that detected by R9.4.1 reads. Pearson correlation coefficients were calculated for the methylated percentages of the subsampled vs. total CpG sites (Supplementary Fig. 8B). The R9.4.1 flow cell results exhibited a higher correlation for this relationship than the R10.4 flow cell results, indicating that the R9.4.1 flow cell results were superior to the R10.4 flow cell results for methylation calling based on low-coverage sequencing.
Megalodon is also capable of distinguishing genomic 5-hydroxymethylcytosine (5hmC) from 5mC for studying the importance of 5hmC in DNA methylation associated with development or disease [46]. The proportion of 5hmC across the human genomes shows even more tissue specificity and is always lower than 5mC levels [47]. In our study, the average proportion of 5hmC detected by R10.4 and R9.4.1 were 2.55% and 5.86% respectively, which is much lower than 5mC. Meanwhile, we found the same false-positive methylation results as 5mC in scWGA reads with 2.52% and 5.46% 5mC signals from R9.4.1 and R10.4 respectively (Supplementary Fig. 9A and 9B). We filtered out the false positive sites from WGS results using the same method by applying the threshold as the two-fold standard deviations above the average of the scWGA data. The mitochondrial filtering results showed that the 5hmC proportion detected in the remaining sites from R10.4 reads was below 25%, while R9.4.1 reads were still high (Supplementary Fig. 9C to 9H). This filtering indicated that R10.4 also has a lower FDR in 5hmC calling than R9.4.1, and the filtering strategy using scWGA reads can effectively reduce the FDR, especially for R10.4 reads.
3.4. Detection of variation
To evaluate the whether a single R10.4 flow cell was capable of CNV detection from a single cell, we applied the Control-FREEC tool [39] to analyze the WGS and scWGA sequencing results of HCC78 generated in an R9.4.1 flow cell and in an R10.4 flow cell on the MinION sequencer. An extensive range of copy number gains was observed on chromosome 14 of the HCC78 genome (30–65 Mb), and the instability on chromosome 16 (Fig. 4A and 4B) corresponded to the regions with the most frequent gains. Genomic gains were also observed in chromosome arms 1q, 5p, 7q, 11q, 17q, and 19q. The CNV patterns detected by the R9.4.1 and R10.4 flow cells were identical, and the patterns detected in a single HCC78 cell were consistent with those in the bulk cell samples. However, compared with the evenness of the read coverage of WGS libraries, that of scWGA libraries was inferior because of the additional noise generated by the unbalanced amplification during the MDA process.
Fig. 4.
Representative Copy number variation (CNV), structural variation (SV) and single nucleotide variation (SNV) patterns detected from HCC78 using WGS and single-cell WGA sequencing reads from R9.4.1 and R10.4 flow cells. A Genome-wide CNV distribution with 1 Mb bin size. B A zoom in of CNV gaining event on chromosome 14. C and D The intersection number (C) and proportion (D) of SV calling from R10.4 and R9.4.1 reads with minimum supported reads number (n) from 3 to 5. E The IGV snapshot showing reads spanning the breakpoint of ROS1 and SLC34A2 genes in the HCC78 cell line from four libraries. The DNA strand detached at 25,655,005 bp of chromosome 4 and got attached to the ROS1 gene at 117,337,144 bp of chromosome 6, while chromosome 6 also broke at 117,337,162 bp and jointed with the SCL34A2 chromosome 4 at 25,665,008 bp. The unaligned nucleotides are color-coded (A: green, T: red, C: blue, G: gold) while the aligned ones are colored in grey. F Schematic of the translocation between SLC34A2 (top) and ROS1 (bottom) genes. Breakpoints were detected from DNA sequence between ROS1 exon 32 and 33, and between SLC34A2 exon 4 and 5 accordingly from both WGS and scWGA data, which would result in the RNA fusion of ROS1-SLC34A2. G The number of somatic point mutations and indels of HCC78 cell line recorded in DepMap database together with the intersection record number in four libraries. H The Venn diagram of mutations intersected with records from DepMap database among four libraries suggest the considerable reproductivity of R10.4 and R9.4.1 data in SNV detection. WGS, whole-genome shotgun; scWGA, single-cell whole-genome amplification.
Next, we used cuteSV [40] to detect and validate the SVs from R9.4.1 and R10.4 reads. The total number of detected SVs and the intersection between R10.4 and R9.4.1 reads are shown in Fig. 4C. More SV records were called from R9.4.1 reads than from R10.4 reads. The minimum number of support reads ranged from 3 to 5. For the intersection validation, there were fewer R10.4-unique SV records than R9.4.1-unique SV records in both WGS and scWGA sequencing results. The intersection between R9.4.1 and R10.4 called SVs shows that R10.4 reads were more highly representative and credible than R9.4.1 reads in SV detection, with the former having higher proportion of intersecting records than the latter (Fig. 4D). The SV types and record details are presented in Supplementary Tables 4 and 5, respectively. The most common SV in the HCC78 cell line, SLC34A2–ROS1 fusion, was detected from WGS and scWGA reads obtained using either the R9.4.1 flow cell or the R10.4 flow cell. We validated the translocation and specific breakpoints of SCL34A2 and ROS1 genes by identifying the reads spanning both regions (Fig. 4E). All of the breakpoints were supported by at least two R10.4 reads and four R9.4.1 reads of HCC78 genomic DNA extracted from bulk cell samples. The same breakpoints were also observed for single-cell samples, which were supported by at least two reads. The translocation of these two chromosomes could result in RNA fusion [29], [30], as shown in Fig. 4F. Half of the reads covering both regions showed no translocation in single-cell samples, which indicated that the fusion corresponded to heterozygous mutation.
We compared the performances of R10.4 and R9.4.1 flow cells in SNV detection by using the 445 known variations in the HCC78 cell line from the Cancer Cell Line Encyclopedia (CCLE) data extracted in the Broad Cancer Dependency Map Project (DepMap) as the positive control. Approximately 40% of CCLE records corresponded to the same mutant from R10.4 and R9.4.1 data obtained on a single MinION sequencer with 6 × to 8 × read coverage (Fig. 4G and 4H). The sensitivity for WGS (R10.4), WGS (R9.4.1), scWGA (R10.4) and scWGS (R9.4.1) is 38.88%, 43.37%, 32.81% and 30.78% respectively. We further used NGS results of HCC78 provide by CCLE (Bioproject: PRJNA523380, SRA Experiments: SRR8619116) as baseline to calculate the performance metrics (Supplementary Table 6). The results showed that R10.4 reads have higher specificity than R9.4.1 reads of both WGS and scWGA libraries. The sensitivity of R10.4 is comparable with R9.4.1 reads of WGS library while higher than R9.4.1 reads of scWGA libraries in SNV detection. The sensitivity is limited by the low coverage of MinION output, which is yet close to the average sensitivity of SNP calling results (0.743) using NGS data with coverage cut-off over 4 [48].
Additionally, we examined the signature TP53 mutation (TP53: c c.722 C > T, p.Ser241Phe, rs28934573) in HCC78 cells (Supplementary Fig. 10). All four libraries detected the TP53 mutation, with at least one read covering this region. Compared with the number of known CCLE SNVs identified in WGS sequencing, fewer known CCLE SNVs were identified in scWGA sequencing, due to its lower coverage. Moreover, SNVs were detected at a higher rate by R10.4 reads than by R9.4.1 reads, despite the former being based on two-thirds of the yield of the latter (19 G vs. 28 G), indicating that R10.4 flow cells may be more powerful than R9.4.1 flow cells in single-cell DNA sequencing applications.
3.5. Genome recovery rate
We compared the genome recovery rates associated with different sequencing methods by subsampling the reads into different proportions and then calculating their genome recovery rates. The R9.4.1 flow cell had a higher yield than the R10.4 flow cell from both WGS and scWGA sequencing (Fig. 5A). That is, the WGS R9.4.1 reads covered 94.71% of the GRCh38 genome with at least one read, whereas the WGS R10.4 reads covered only 91.78%. The reads from scWGA covered 92.14% and 88.46% of genome using R9.4.1 and R10.4 flow cell respectively. It corresponded to a lower genome coverage than that of WGS for both R9.4.1 and R10.4 flow cells, potentially because of the amplification bias in MDA. Additionally, we sequenced an NGS scWGA library as the control. The genome recovery rate from NGS reads was higher than that from R10.4 reads but lower than that from R9.4.1 reads, when subsampled to the same sample yield as that used for the scWGA reads in the R9.4.1 or R10.4 flow cells (Fig. 5).
Fig. 5.
Genome recovery rate and repeat types of uncovered region from different sequencing methods. A The genome recovery rate of HCC78 DNA sequencing using different sequencing methods. One R9.4.1 flow cell can achieve a higher genome recovery rate with more data yield than R10.4 in both WGS and WGA sequencing. Single cell genome recovery rate increased with the extension of T7 endonuclease Ⅰ incubation time for MDA products. Each grey dot indicates a sequencing library with a different treatment, while the colored lines and dots indicate subsampled reads from WGS or scWGA sequencing. B The sum length for uncovered genomic regions which falls into different repeat types in five libraries. C The sum length of uncovered repeat regions and the proportion (upper right) of repetitive sequence in uncovered regions with the same read coverage. All libraries were subsampled into the same yield as WGS R10.4 data. MDA, multiple displacement amplification. MALBAC, multiple annealing and looping based amplification cycles. T7E1, T7 endonuclease Ⅰ.
In Nanopore sequencing, the library preparation process can be modified to maximize the yield of scWGA sequencing. For example, the MALBAC method was designed to generate short amplicons and can be expected to afford a low yield in Nanopore sequencing (Fig. 5A and Supplementary Table 7). Thus, the MDA method using the phi29 enzyme (which can produce long products) is superior to the MALBAC method for generating WGA DNA for Nanopore sequencing. However, the MDA method generates branches and forms complex secondary structures, which can block the pores in Nanopore flow cells and decrease yields. Consequently, T7 endonuclease Ⅰ must be used to decrease the number of branches generated during MDA. We found that the read yield and genome recovery rate increased with the extension of T7 endonuclease I treatment time (Fig. 5A). Moreover, a yield comparable to that from WGS was obtained from 4 h of T7 endonuclease Ⅰ treatment; this material was then used to construct the final scWGA library for both R9.4.1 and R10.4 flow cells.
To clarify the nature of uncovered regions in R9.4.1 and R10.4 flow cells, we determined the total lengths of uncovered regions in different libraries. The lengths of uncovered regions in a single R9.4.1 flow cell was considerably smaller than those in a single R10.4 flow cell, owing to the nearly doubled yield (Fig. 5B and Supplementary Table 8) of both WGS and scWGA sequences in the R9.4.1 flow cells. When the same yield was used for each library, the total lengths were similar (Fig. 5C and Supplementary Table 8). Most of the uncovered regions were repetitive sequences in the human genome (Fig. 5C). By classifying the uncovered repeats, we determined that the R9.4.1 flow cell was slightly better in covering interspersed repeats (LINE, SINE, LTR, and DNA repeats) than the R10.4 flow cell, and both the R10.4 and R9.4.1 reads were superior to NGS reads in terms of covering local tandem repeats, such as satellites. (Fig. 5C and Supplementary Tables 9 and 10).
3.6. Validation of variation calling in scWGA sequencing using NGS
We first checked the read accuracy of four scWGA libraries for protocol optimization sequenced on R9.4.1 flow cells (Supplementary Table 11). The observed read accuracy in these libraries was all higher than estimated ones for both “high accurate” and “super accurate” guppy basecalling models. And the R9.4.1 scWGA read accuracy was lower than R10.4 ones. We next compared the SNV and CNV detection results of the three scWGA libraries. To ensure a fair comparison, we subsampled NGS (94 G) and R9.4.1 (26 G) scWGA reads to the same yield as R10.4 scWGA reads (18 G).
In terms of the SNV and small indel calling, approximately half (730,864 out of 1,472,634) of the called variants from ONT reads (both R10.4 and R9.4.1 reads) were validated by NGS, and the R10.4 results included 140,470 more records than R9.4.1 reads when intersected with NGS data. This intersection demonstrated the superiority of R10.4 reads to R9.4.1 reads for the detection of single-cell variations (Supplementary Fig. 11A). To exclude possible artefacts introduced by NGS, we limited the intersection of SNV to the 445 well-known sites of variation in the genome of the HCC78 cell line (as determined by DepMap). Most of the sites captured by NGS were also observed in ONT reads, with only 24 being unique to NGS (Supplementary Fig. 11B).
We also evaluated the consistency of CNV calling in scWGA sequencing via different library construction protocols. The CNV patterns captured by the ONT reads were identical to those captured by NGS, even with a limited yield of 12 Gb (corresponding to the lowest recovery rate) (Supplementary Fig. 12).
4. Discussion
We evaluated the performances of multiple sequencing runs on R9.4.1 and R10.4 flow cells and highlighted a promising method for library construction for single-cell whole-genome sequencing on a third-generation sequencing platform. The latest Q20+ chemistry combined with the R10.4 sequencing enzyme exhibited improved read accuracies and SNV detection capabilities relative to those obtained using R9.4.1 techniques. Moreover, the SV and overall methylation detection results based on R10.4 data from a single MinION run were comparable with those based on R9.4.1 data from a single MinION run.
The read yield was affected by the average translocation speeds of DNA passing through the Nanopore, which were 400 and 240 bases per second for the R9.4.1 and R10.4 flow cells, respectively. According to the genome coverage plot, the WGS data from one R9.4.1 flow cell achieved a read coverage of 9 × and covered more than 94% of the genome saturated within one flow cell. In contrast, the WGS data from one R10.4 flow cell achieved a read coverage of approximately 4 × ; thus fewer R10.4 reads than R9.4.1 reads were required to support the mutation targets. Moreover, the read yield for a single flow cell was also determined by the “active” pore number during the 72-h sequencing process. Pores may be blocked or damaged during sequencing, such that the read yield per hour decreases over time. Different flow cells may have different numbers of usable pores, so the total read yields from the same input DNA may vary. From our and other researchers’ experience, the reproducibility of MinION sequencing is good [49]. In this study, we used the R10.4 and R9.4.1 flow cells with similar initial registered pores (1400–1500 pores) and the same amount DNA as input for benchmarking. The active pore number were close to each other without outliers (all within the range of mean±3 standard deviation), and there is no significant difference (P > 0.05) between R9.4.1 and R10.4 groups. Compared with the pores of the R9.4.1 flow cell, the pores of the R10.4 flow cell required more input DNA and were more robust to DNA overloading, as they produced a more even read length when loaded with 5 µg DNA (i.e., four times higher than the recommended value). Therefore, R10.4 flow cells should achieve a higher read yield than R9.4.1 flow cells after the DNA translocation speed has been adjusted in the new releases of motor protein in library construction kits.
The accuracies of R10.4 reads were higher than those of R9.4.1 reads generated by both “high accurate” and “super accurate” guppy basecalling model. The decrease of error rate from R10.4 to R9.4.1 is possibly due to the improvement on homopolymers. The accuracy of CG homopolymer is worse than that of AT homopolymer. Similar results also confirmed by previous study that C and G homopolymers are significantly worse sequenced than A and T ones. It is likely due to the challenges of the signal processing the ionic current measurements in high GC-content regions [50], [51]. Despite the defects in G and C homopolymers, R10.4 has approximately 10% increase on average accuracy compared to R9.4.1 reads, especially for longer homopolymers.
The amplification enzyme used in the scWGA reaction is phi29 DNA polymerase, which exhibits strong strand displacement activity and can generate long fragments of up to 10 kb with a low mutation rate. Random hexamers included in the reaction can be annealed to the denatured DNA template to elongate the new strands. When the synthesized DNA reaches another starting point, phi29 displaces the newly generated strand and continues the extension for the binding of more primers [52], [53]. MDA is more productive and achieves a higher coverage than other WGA methods, such as MALBAC, which achieved a low yield on the ONT platform (Fig. 5A). However, the hyperbranched structure formed during this process can block pores and thereby decrease the final sequencing yield. According to the ONT Premium WGA protocol, a 30-min treatment with T7 endonuclease Ⅰ can reduce the number of branches generated by phi29 during amplification. However, we found this treatment time was inadequate, and the data output was low even when the treatment time was increased to 2 h. Accordingly, to maximize the efficiency of T7 endonuclease Ⅰ, we performed another 2 h of T7 endonuclease Ⅰ treatment after purifying the enzyme cleavage products. The total yields of R9.4.1 and R10.4 data achieved a human genome coverage of more than 8 × and 6 × , respectively, which are comparable to the routine WGS output.
Epigenetic changes such as 5mC methylation are key indicators of human disease. Nanopore WGS sequencing of native DNA enables accurate sequencing of nearly every CpG site in the human genome. Thus, Nanopore WGS sequencing is the most suitable tool for methylation profiling in clinical settings. Although R9.4.1 and R10.4 reads can effectively detect 5mC methylation, the FDR rate may be high in several genomic sites, especially the C-base homopolymer in R9.4.1 reads. Our scWGA dataset can be used by other researchers to filter potential false-positive sites in their own methylation calls. In this study, we also attempted to call other methylation signals using ONT data; however, the 6-methyladenine models in Megalodon or Remora exhibited high FDRs in our dataset. This highlights the necessity for the development of a more accurate method for calling methylation signals with low abundance. R10.4 flow cells may be suitable for use in such a method, as they generate more accurate basecalls than R9.4.1 flow cells.
Additionally, abnormal copy numbers and karyotypes were observed in both R10.4 and R9.4.1 data obtained from the HCC78 cell line, as shown in Fig. 4A. The presence of the TP53 mutation (TP53:c c.722 C>T, p.Ser241Phe, rs28934573) in HCC78 was confirmed, consistent with reported results [28] with at least one read covering the target region. Our scWGA and WGS sequencing reads also validated the signature mutation and SVs within HCC78, which demonstrated the potential of Nanopore sequencing for the detection of single-nucleotide mutations and complex DNA rearrangements in cancer cells. Patients with NSCLC and harboring driver-gene mutations and rearrangements, such as EGFR, KRAS, NRAS, BRAF, ALK, and ROS1, may benefit from targeted therapies [54], as the progression of NSCLC can be driven by multiple molecular alterations [16]. Therefore, progression monitoring and implementation of therapeutic strategies for cancers such as NSCLC could be facilitated by mutation landscape assessment and genomic profiling performed using comprehensive detection methods such as Nanopore long-read sequencing.
There are some limitations to our benchmark. We only included the MinION as the sequencing platform. ONT also has PromethION platform, which is a better choice when high sequencing depth is needed. We limited our analysis to previously known variations and whole genome-wide methylation profiling for the human sample as the sequencing depth of MinION limited the precise evaluation for variation calling results. Hence sensitivity and specificity could be coverage sensitive [48], further benchmarking from PromethION platform can have a better comparison between Nanopore R9.4.1 and R10.4 flow cells. Few duplex reads generated from R10.4 data for higher accuracy, so further optimization is also expected to confirm the results. We are looking forward to a higher proportion of duplex reads from R10.4.1 (reported to be over 20%) and a further version of library preparation kit for better performance.
As another important third generation sequencing platform, Pacbio can use HiFi mode to generate long high-fidelity circular continuous consensus reads with high accuracy. Combined with Tn5 transposase-based library preparation method, SMOOTH-seq is a breakthrough in single cell DNA sequencing field [27]. Compared with our scWGA results, SMOOTH-seq can achieve higher throughput with pooled single cells within one round on Pacbio platform while the genome recovery rate for an individual cell is lower. With the upgrades of Nanopore flow cells and chemistry, including R10.4.1 flow cell and V14 kit, further optimization is promised for single cell sequencing towards both high throughput and high coverage way.
5. Conclusion
Our study is the first benchmark to evaluate single- and bulk-cell whole genome sequencing reads generated on Nanopore R10.4 and R9.4.1 MinION flow cells and provides a promising method for Nanopore library construction for single-cell whole-genome sequencing to filter out the possible false positive sites in methylation calling. The low cost of entry and the portability of the MinION sequencer also make it a useful tool for rapid point-of-care testing. We found that the R10.4 flow cell outperforms the R9.4.1 flow cell in terms of read accuracy, variation detection, and FDR in methylation calling, even with a reduced yield. Moreover, the genome recovery rates of R10.4 and R9.4.1 flow cells are comparable, while the R9.4.1 flow cell is more robust than the R10.4 flow cell in methylation calling when subsampled to low coverage. For WGA in Nanopore sequencing, we recommend the use of MDA for single-cell DNA amplification and an optimized T7 endonuclease I-cutting procedure to achieve high yields. Moreover, this study provides a resource to filter false positive sites in ONT methylation profiling for scWGA sequencing results and highlights the whole genome DNA sequencing using Nanopore R10.4 can provide promising results for single cell variation detection and methylation profiling, which can promote the exploration for its potential in human clinical WGS and scWGA sequencing.
Authors’ contributions
R.L., M.Y. and Y.N designed the study. Y.N. and Z.S. performed the laboratory experiments and sequencing. X.L., Y.N. and R.L. performed the data analysis and visualisations. X.L. upload basecalled data to SRA for data sharing. Y.N., X.L and R.L. wrote the draft, with all authors providing feedback. All authors reviewed and approved the final manuscript.
Funding
This work was supported by HetaoShenzhen-Hong Kong Science and Technology Innovation Cooperation Zone Shenzhen Park Project (HZQB-KCZYZ-2021017) to M.Y; and was supported by the Hong Kong Branch of Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou) (SMSEGL20SC02), Early career scheme (project number 9048204) from the Hong Kong Research Grant Council, Hong Kong Health and Medical Research Fund (project number 9211280) and new Research Initiatives support from City University of Hong Kong (project number 9610497) to R.L.
CRediT authorship contribution statement
Ying Ni: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Writing – review & editing. Xudong Liu: Methodology, Software, Formal analysis, Validation, Data curation, Writing – original draft, Writing – review & editing. Zemenu Mengistie Simeneh: Methodology, Investigation, Writing – review & editing. Mengsu Yang: Conceptualization, Resources, Writing – review & editing, Supervision, Funding acquisition. Runsheng Li: Conceptualization, Methodology, Software, Validation, Resources, Writing – review & editing Supervision, Funding acquisition.
Declaration of Competing Interest
The authors declare that there is no conflict of interest associated with this study.
Footnotes
Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2023.03.038.
Contributor Information
Mengsu Yang, Email: bhmyang@cityu.edu.hk.
Runsheng Li, Email: runsheli@cityu.edu.hk.
Appendix A. Supplementary material
Supplementary material
.
Supplementary material
.
References
- 1.Souche E., Beltran S., Brosens E., Belmont J.W., Fossum M., et al. Recommendations for whole genome sequencing in diagnostics for rare diseases. Eur J Hum Genet. 2022;16 doi: 10.1038/s41431-022-01113-x. (vol) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Smedley D., Smith K.R., Martin A., Thomas E.A., McDonagh E.M., et al. 100,000 Genomes pilot on rare-disease diagnosis in health care - preliminary report. N Engl J Med. 2021;vol. 385(20):1868–1880. doi: 10.1056/NEJMoa2035790. Nov 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Berger M.F., Mardis E.R. The emerging clinical relevance of genomics in cancer medicine. Nat Rev Clin Oncol. 2018;vol. 15(6):353–365. doi: 10.1038/s41571-018-0002-6. (Jun) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Alkhateeb A., Rueda L. Zseq: an approach for preprocessing next-generation sequencing data. J Comput Biol. 2017;vol. 24(8):746–755. doi: 10.1089/cmb.2017.0021. (Aug) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Schadt E.E., Turner S., Kasarskis A. A window into third-generation sequencing. Hum Mol Genet. 2010;vol. 19(R2):R227–R240. doi: 10.1093/hmg/ddq416. Oct 15. [DOI] [PubMed] [Google Scholar]
- 6.Amarasinghe S.L., Su S., Dong X., Zappia L., Ritchie M.E., et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 2020;vol. 21(1):30. doi: 10.1186/s13059-020-1935-5. Feb 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Greninger A.L., Naccache S.N., Federman S., Yu G., Mbala P., et al. Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis. Genome Med. 2015;vol. 7:99. doi: 10.1186/s13073-015-0220-9. Sep 29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wang Y., Zhao Y., Bollas A., Wang Y., Au K.F. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;vol. 39(11):1348–1365. doi: 10.1038/s41587-021-01108-x. (Nov) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Jain M., Tyson J.R., Loose M., Ip C.L.C., Eccles D.A., et al. MinION analysis and reference consortium: phase 2 data release and analysis of R9.0 chemistry, F1000Res. 2017;vol. 6:760. doi: 10.12688/f1000research.11354.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sereika M., Kirkegaard R.H., Karst S.M., Michaelsen T.Y., Sørensen E.A., et al. Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nat Methods. 2022;vol. 19(7):823–826. doi: 10.1038/s41592-022-01539-7. (Jul) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Luo J., Meng Z., Xu X., Wang L., Zhao K., et al. Systematic benchmarking of nanopore Q20+ kit in SARS-CoV-2 whole genome sequencing. Front Microbiol. 2022;vol. 13 doi: 10.3389/fmicb.2022.973367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rocio Esteban, Irina-Alexandra Vasilescu, Marcus H. Stoiber, Daniel J. Turner, David Stoddart et al., "Directly detect and phase genomic 5mC methylation with high reproducibility and low bias using Nanopore sequencing." pp. 549–549.
- 13.Yuen Z.W., Srivastava A., Daniel R., McNevin D., Jack C., et al. Systematic benchmarking of tools for CpG methylation detection from nanopore sequencing. Nat Commun. 2021;vol. 12(1):3438. doi: 10.1038/s41467-021-23778-6. Jun 8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Simpson J.T., Workman R.E., Zuzarte P.C., David M., Dursi L.J., et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat Methods. 2017;vol. 14(4):407–410. doi: 10.1038/nmeth.4184. (Apr) [DOI] [PubMed] [Google Scholar]
- 15.Oxford Nanopore Technology News, Oxford Nanopore Technology Update: CTO Clive G Brown Unveils Latest Sequencing Chemistry with Highest Performance to Date, Short Fragment Mode and Latest Methylation Performance Evaluations, 30th March 2022.
- 16.Reck M., Popat S., Reinmuth N., De Ruysscher D., Kerr K.M., et al. Metastatic non-small-cell lung cancer (NSCLC): ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up. Ann Oncol. 2014;vol. 25(Suppl 3):iii27–iii39. doi: 10.1093/annonc/mdu199. (Sep) [DOI] [PubMed] [Google Scholar]
- 17.Möhrmann L., Werner M., Oleś M., Mock A., Uhrig S., et al. Comprehensive genomic and epigenomic analysis in cancer of unknown primary guides molecularly-informed therapies despite heterogeneity. Nat Commun. 2022;vol. 13(1):4485. doi: 10.1038/s41467-022-31866-4. Aug 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gross A.M., Ajay S.S., Rajan V., Brown C., Bluske K., et al. Copy-number variants in clinical genome sequencing: deployment and interpretation for rare and undiagnosed disease. Genet Med. 2019;vol. 21(5):1121–1130. doi: 10.1038/s41436-018-0295-y. (May) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Oulhen M., Pawlikowska P., Tayoun T., Garonzi M., Buson G., et al. Circulating tumor cell copy-number heterogeneity in ALK-rearranged non-small-cell lung cancer resistant to ALK inhibitors. NPJ Precis Oncol. 2021;vol. 5(1):67. doi: 10.1038/s41698-021-00203-1. Jul 16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Lohr J.G., Adalsteinsson V.A., Cibulskis K., Choudhury A.D., Rosenberg M., et al. Whole-exome sequencing of circulating tumor cells provides a window into metastatic prostate cancer. Nat Biotechnol. 2014;vol. 32(5):479–484. doi: 10.1038/nbt.2892. (May) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Nadeu F., Royo R., Massoni-Badosa R., Playa-Albinyana H., Garcia-Torre B., et al. Detection of early seeding of Richter transformation in chronic lymphocytic leukemia. Nat Med. 2022;vol. 28(8):1662–1671. doi: 10.1038/s41591-022-01927-8. (Aug) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wells D., Kaur K., Grifo J., Glassner M., Taylor J.C., et al. Clinical utilisation of a rapid low-pass whole genome sequencing technique for the diagnosis of aneuploidy in human embryos prior to implantation. J Med Genet. 2014;vol. 51(8):553–562. doi: 10.1136/jmedgenet-2014-102497. (Aug) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Huang L., Ma F., Chapman A., Lu S., Xie X.S. Single-cell whole-genome amplification and sequencing: methodology and applications. Annu Rev Genom Hum Genet. 2015;vol. 16:79–102. doi: 10.1146/annurev-genom-090413-025352. [DOI] [PubMed] [Google Scholar]
- 24.Dean Frank B., Hosono Seiyu, Fang Linhua, Wu Xiaohong, Fawad Faruqi A., et al. Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci USA. 2002;vol. 99(8):5261–5266. doi: 10.1073/pnas.082089499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zong Chenghang, Lu Sijia, Chapman Alec R., Sunney Xie X. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Sci (N Y, N Y ) 2012;vol. 338(6114):1622–1626. doi: 10.1126/science.1229164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Xie H., Li W., Hu Y., Yang C., Lu J., et al. De novo assembly of human genome at single-cell levels. Nucleic Acids Res. 2022;vol. 50(13):7479–7492. doi: 10.1093/nar/gkac586. Jul 22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Fan X., Yang C., Li W., Bai X., Zhou X., et al. SMOOTH-seq: single-cell genome sequencing of human cells on a third-generation sequencing platform. Genome Biol. 2021;vol. 22(1):195. doi: 10.1186/s13059-021-02406-y. Jun 30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Iwakawa R., Kohno T., Enari M., Kiyono T., Yokota J. Prevalence of human papillomavirus 16/18/33 infection and p53 mutation in lung adenocarcinoma. Cancer Sci. 2010;vol. 101(8):1891–1896. doi: 10.1111/j.1349-7006.2010.01622.x. (Aug) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Rikova K., Guo A., Zeng Q., Possemato A., Yu J., et al. Global survey of phosphotyrosine signaling identifies oncogenic kinases in lung cancer. Cell. 2007;vol. 131(6):1190–1203. doi: 10.1016/j.cell.2007.11.025. Dec 14. [DOI] [PubMed] [Google Scholar]
- 30.Klijn C., Durinck S., Stawiski E.W., Haverty P.M., Jiang Z., et al. A comprehensive transcriptional portrait of human cancer cell lines. Nat Biotechnol. 2015;vol. 33(3):306–312. doi: 10.1038/nbt.3080. (Mar) [DOI] [PubMed] [Google Scholar]
- 31.Bjaanæs M.M., Nilsen G., Halvorsen A.R., Russnes H.G., Solberg S., et al. Whole genome copy number analyses reveal a highly aberrant genome in TP53 mutant lung adenocarcinoma tumors. BMC Cancer. 2021;vol. 21(1):1089. doi: 10.1186/s12885-021-08811-7. Oct 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Rausch T., Jones D.T., Zapatka M., Stütz A.M., Zichner T., et al. Genome sequencing of pediatric medulloblastoma links catastrophic DNA rearrangements with TP53 mutations. Cell. 2012;vol. 148(1–2):59–71. doi: 10.1016/j.cell.2011.12.013. Jan 20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;vol. 34(18):3094–3100. doi: 10.1093/bioinformatics/bty191. Sep 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;vol. 25(16):2078–2079. doi: 10.1093/bioinformatics/btp352. Aug 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Oxford-Nanopore-Technologies, Megalodon, GitHub, (2020), pp. https://github.com/nanoporetech/megalodon.
- 36.Oxford-Nanopore-Technologies, Remora, Github, (2022), pp. https://github.com/nanoporetech/remora.
- 37.Gu Z., Eils R., Schlesner M., Ishaque N. EnrichedHeatmap: an R/Bioconductor package for comprehensive visualization of genomic signal associations. BMC Genom. 2018;vol. 19(1):234. doi: 10.1186/s12864-018-4625-x. Apr 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Heinz S., Benner C., Spann N., Bertolino E., Lin Y.C., et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;vol. 38(4):576–589. doi: 10.1016/j.molcel.2010.05.004. May 28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Boeva V., Popova T., Bleakley K., Chiche P., Cappo J., et al. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics. 2012;vol. 28(3):423–425. doi: 10.1093/bioinformatics/btr670. Feb 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Jiang T., Liu Y., Jiang Y., Li J., Gao Y., et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol. 2020;vol. 21(1):189. doi: 10.1186/s13059-020-02107-y. Aug 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Thorvaldsdóttir H., Robinson J.T., Mesirov J.P. Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;vol. 14(2):178–192. doi: 10.1093/bib/bbs017. (Mar) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;vol. 27(21):2987–2993. doi: 10.1093/bioinformatics/btr509. Nov 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;vol. 26(6):841–842. doi: 10.1093/bioinformatics/btq033. Mar 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Rang F.J., Kloosterman W.P., de Ridder J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 2018;vol. 19(1):90. doi: 10.1186/s13059-018-1462-9. Jul 13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Bicci I., Calabrese C., Golder Z.J., Gomez-Duran A., Chinnery P.F. Single-molecule mitochondrial DNA sequencing shows no evidence of CpG methylation in human cells and tissues. Nucleic Acids Res. 2021;vol. 49(22):12757–12768. doi: 10.1093/nar/gkab1179. Dec 16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Ponnaluri V.K., Ehrlich K.C., Zhang G., Lacey M., Johnston D., et al. Association of 5-hydroxymethylation and 5-methylation of DNA cytosine with tissue-specific gene expression. Epigenetics. 2017;vol. 12(2):123–138. doi: 10.1080/15592294.2016.1265713. (Feb) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Nestor C.E., Ottaviano R., Reddington J., Sproul D., Reinhardt D., et al. Tissue type is a major modifier of the 5-hydroxymethylcytosine content of human genes. Genome Res. 2012;vol. 22(3):467–477. doi: 10.1101/gr.126417.111. (Mar) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Yu X., Sun S. Comparing a few SNP calling algorithms using low-coverage sequencing data. BMC Bioinforma. 2013;vol. 14:274. doi: 10.1186/1471-2105-14-274. Sep 17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Tyler A.D., Mataseje L., Urfano C.J., Schmidt L., Antonation K.S., et al. Evaluation of Oxford Nanopore's MinION Sequencing Device for Microbial Whole Genome Sequencing Applications. Sci Rep. 2018;vol. 8(1):10931. doi: 10.1038/s41598-018-29334-5. Jul 19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Schreiber J., Wescoe Z.L., Abu-Shumays R., Vivian J.T., Baatar B., et al. Error rates for nanopore discrimination among cytosine, methylcytosine, and hydroxymethylcytosine along individual DNA strands. Proc Natl Acad Sci USA. 2013;vol. 110(47):18910–18915. doi: 10.1073/pnas.1310615110. Nov 19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Delahaye C., Nicolas J. Sequencing DNA with nanopores: troubles and biases. PLoS One. 2021;vol. 16(10) doi: 10.1371/journal.pone.0257521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Nelson J.R. Random-primed, Phi29 DNA polymerase-based whole genome amplification. Curr Protoc Mol Biol. 2014;vol. 105 doi: 10.1002/0471142727.mb1513s105. Jan 6. pp. Unit 15.13. [DOI] [PubMed] [Google Scholar]
- 53.Lovmar L., Syvänen A.C. Multiple displacement amplification to create a long-lasting source of DNA for genetic studies. Hum Mutat. 2006;vol. 27(7):603–614. doi: 10.1002/humu.20341. (Jul) [DOI] [PubMed] [Google Scholar]
- 54.Duma N., Santana-Davila R., Molina J.R. Non-small cell lung cancer: epidemiology, screening, diagnosis, and treatment. Mayo Clin Proc. 2019;vol. 94(8):1623–1640. doi: 10.1016/j.mayocp.2019.01.013. (Aug) [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary material
Supplementary material






