Abstract
Motivation
The existence of quasispecies in the viral population causes difficulties for disease prevention and treatment. High-throughput sequencing provides opportunity to determine rare quasispecies and long sequencing reads covering full genomes reduce quasispecies determination to a clustering problem. The challenge is high similarity of quasispecies and high error rate of long sequencing reads.
Results
We developed QuasiSeq using a novel signature-based self-tuning clustering method, SigClust, to profile viral mixtures with high accuracy and sensitivity. QuasiSeq can correctly identify quasispecies even using low-quality sequencing reads (accuracy <80%) and produce quasispecies sequences with high accuracy (≥99.55%). Using high-quality circular consensus sequencing reads, QuasiSeq can produce quasispecies sequences with 100% accuracy. QuasiSeq has higher sensitivity and specificity than similar published software. Moreover, the requirement of the computational resource can be controlled by the size of the signature, which makes it possible to handle big sequencing data for rare quasispecies discovery. Furthermore, parallel computation is implemented to process the clusters and further reduce the runtime. Finally, we developed a web interface for the QuasiSeq workflow with simple parameter settings based on the quality of sequencing data, making it easy to use for users without advanced data science skills.
Availability and implementation
QuasiSeq is open source and freely available at https://github.com/LHRI-Bioinformatics/QuasiSeq. The current release (v1.0.0) is archived and available at https://zenodo.org/badge/latestdoi/340494542.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Due to high mutation rates, individuals infected with retroviruses are populated with genetically related but different mutant viral strains which are referred to as viral quasispecies. The existence of dynamic quasispecies in the viral population causes additional difficulties for disease prevention and treatment (Domingo et al., 2019). To better understand outbreaks, migration and pathogenesis for better treatment planning, we need efficient methods to characterize the viral quasispecies composition from patients. Next-generation sequencing has greatly reduced the cost of DNA sequencing and numerous methods have been developed to analyze viral quasispecies. Due to the dominance of short-read sequencing platforms, most methods and software developed or adapted for analysis of haplotypes in viral quasispecies aim to use short reads with high-sequencing depth (Chen et al., 2018; Huang et al., 2019; Posada-Cespedes et al., 2017). However, high similarities between haplotypes in the quasispecies and uneven sequencing depth along the viral genome make it difficult to reconstruct haplotypes of quasispecies (Ahn and Vikalo, 2018; Chen et al., 2018; Jayasundara et al., 2015). Advancements in long-read technologies, represented by PacBio Single-Molecule Real-Time (SMRT) sequencing and Oxford Nanopore Technologies (ONT), provide the potential to reliably profile viral quasispecies. Currently, the average read length of SMRT sequencing from PacBio is ≥30 kb and the record read length for ONT sequencing is 2.3 Mb with 10–30 kb genome libraries being common (Amarasinghe et al., 2020). These long reads can cover the full length of most viral RNA genomes including HIV (∼9.7 kb) and Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (∼30 kb). As a result, viral quasispecies determination can be reduced to a clustering problem without the need to assemble the haplotypes from fragments. Several groups, including ours, have explored this application with PacBio long sequencing reads (Artyomenko et al., 2017; Dilernia et al., 2015; Huang et al., 2016, 2018). The most challenging aspect of using long reads is that these reads are error-prone and the difference between quasispecies is small. Therefore, it is critical to differentiate true variants from sequencing errors and different methods were implemented: a statistical algorithm to select positions (Dilernia et al., 2015) and selection of the variant which is closely correlated with another variant (i.e. linked pair of variants/Tags; Artyomenko et al., 2017; Huang et al., 2018). Recently, CliqueSNV was developed to identify true viral variants by merging cliques in the graph which is constructed based on linkage information between single nucleotide variants (SNVs; Knyazev et al., 2021). However, these procedures use all selected variants in the clustering or graph construction step, which requires significant computational resources to analyze if a large quasispecies population exists. To overcome this limitation, we previously developed a clustering method using tag-sequences based on previously known HIV drug resistance-related positions (Huang et al., 2016). However, the sensitivity of this method is significantly reduced by the limited number of known HIV drug-resistance positions. In this study, we developed a novel workflow, QuasiSeq, to automatically identify haplotypes of a quasispecies using PacBio long sequencing reads. The main component of the workflow is sigCluster (i.e. signature Cluster), a novel clustering method we developed to cluster PacBio long sequencing reads of a quasispecies. The signature of a read is created by concatenating the top SNVs filtered by variant frequency and statistical significance based on the alignment of all full-length reads with a reference. The reads are then clustered with the self-tuning spectral clustering algorithm developed by Zelnik_Manor and Peronawas (Zelnik-Manor and Perona, 2004). The consensus sequences of the final clusters are haplotypes and the cluster sizes are the proportion of the haplotype. The advantage of the QuasiSeq workflow is that the signature size can be adjusted based on the sequencing data size and capability of the available computational resources.
2 Materials and methods
2.1 Overview of the SigClust method and viral quasispecies detection workflow
In this study, we developed a viral quasispecies analysis workflow, QuasiSeq, to determine the quasispecies in a viral sample using PacBio long sequencing data from a mixture of influenza or HIV samples. The key component of QuasiSeq is the SigClust algorithm which can be used to cluster highly similar sequences based on read signatures (Fig. 1). Briefly, all sequencing reads are aligned to a reference sequence with BLASR (Chaisson and Tesler, 2012) to separate the full-length reads (Step 1). The consensus of these full-length reads is then generated with Sparc (Ye and Ma, 2016), followed by aligning the full-length reads to the consensus sequence (Step 2) and the nucleotide variants are detected with VarScan (Koboldt et al., 2012) and then ranked by variant frequency or variant significance (Step 3). If there is no significant variant found (Step 4, thresholds are set based on data quality and quantity), the current group of reads is determined to have originated from a pure viral species and the consensus sequence is considered the final genome sequence for the cluster (Step 8). If significant variants are detected, the read signatures are constructed with the selected SNV positions (Step 5). The reads are then clustered with an adapted spectral clustering method (Step 6). It is essential to calculate distances between sequencing reads for clustering. Multiple sequence alignment of high-throughput sequencing data is computationally intensive and not practical for handling thousands of long reads which have frequent random short insertion and deletion sequencing errors. Therefore, we remove the homozygous sites (sites containing alleles common to all strains) and variants under thresholds from the alignments to leave only positions exhibiting real genetic diversity (i.e. read signatures, Step 5 and Fig. 1B) for the distance calculation between reads. The distance differences between sequences calculated with read signatures are much higher than those calculated with full-length sequences, as demonstrated by 60 SNVs in 10 Influenza A virus (IAV) clones (Supplementary Fig. S1). The difference can be significantly increased by only comparing the 60 SNV positions instead of all 2300 positions (Supplementary Table S1A and B). Since the difference between quasispecies is small, increased differences using sequence signatures makes it easier to differentiate the subspecies and enhances the sensitivity of the QuasiSeq pipeline. Each group of reads within an individual cluster is then used to make a consensus (Step 7) and processed with the same procedure described above until no significant variants are found for any group or the subsequent subgroups generated during the process. This results in a top-down hierarchical clustering mechanism to partition all full-length reads recursively into a tree structure (Fig. 1A, right panel). The consensus sequences of these final clusters (leaves) are the haplotypes of the quasispecies.
Fig. 1.

Viral quasispecies detection workflow. (A) The diagram of viral quasispecies detection workflow. (B) Diagram of read signature construction and clustering
2.2 Datasets
Three PacBio sequencing datasets were used to develop and test the signature clustering workflow in this study. The first dataset, Pro19, was generated with HIV single genome amplicons (SGAs) from a previous study (Imamichi et al., 2016). Pro19 was created from 19 nearly full-length SGAs with known sequences (Supplementary Data S1). The mixture of these 19 SGAs was sequenced in a single SMRT cell on a PacBio RS II instrument without fragmentation using the P4/C2 chemistry (180-min movies) by Pacific Biosciences (Menlo Park, CA, USA) and a total of 99 738 subreads were obtained. The second dataset, Pro4, was created with four SGAs of known sequences (Supplementary Data S3) from two patients, two SGAs for each patient: Pt1_1 and Pt1_2 from patient 1 were collected on April 27, 2000. And Pt2_1 and Pt2_2 from patient 2 were collected on June 5, 2000. These four SGAs were sequenced separately in 4 SMRT cells on the PacBio RS II instrument using the P6/C4 chemistry and 240-min movies by the NIH Intramural Sequencing Center. The sequencing reads were mixed in silico and a total of 60 062 full-length sequencing reads were extracted to form this test dataset. The in silico combination of the data from these four SMRT cells, which mimics sequencing of a mixture of the four SGAs, produced a large amount of data. The Pro4 dataset is used to develop a strategy to improve the scalability of the QuasiSeq workflow allowing it to handle large sequencing datasets. Four subsets were further extracted from Pro4 based on the quality of the sequencing reads: circular consensus sequencing (CCS), QV < 90, QV < 85 and QV < 80. The third test dataset and the reference sequences were obtained from NCBI (Accession Number: SRA SRR2042468), contain sequencing data for 10 IAV clones generated from the same parent clone. The data had been used in the development of 2SNV (Artyomenko et al., 2017) and CliqueSNV (Knyazev et al., 2021), providing the opportunity to compare the performance of QuasiSeq with the peer software and tools. Earth mover’s distance (EMD), precision and recall were calculated as described by the CliqueSNV developers (Knyazev et al., 2021), the IAV dataset was downsampled to 16K, 8K and 4K subsets to compare the performance of these tools using data of low depth. Raw sequencing data and extracted test datasets have been submitted to NCBI SRA under accession number: BioProject PRJNA771375.
2.3 Variant calling and ordering
Sparc (January 2, 2015 version; Ye and Ma, 2016) was used to generate a consensus sequence from a group of reads. BLASR (Version 5.1; Chaisson and Tesler, 2012) was used to align the PacBio sequencing reads to consensus or reference sequences. We used VarScan (Version 2.4.2; Koboldt et al., 2012) to detect variants based on the alignments. For every non-consensus nucleotide at every position in the alignment, the probability for that nucleotide to be a sequencing error or a true SNV is determined by a Fisher’s exact test with a given estimated error rate. We modified FisherExact.java provided by VarScan to enable the error rate as a user input.
2.4 Read signature construction and clustering
The substitution and deletion variants detected in a group of reads aligned to a consensus (reference) sequence were selected to construct the read signatures for the downstream clustering analysis of these reads. Each signature can be thought of as a string of alleles of signature length l, while each read can be a short, randomly positioned and potentially erroneous substring of one of the signatures. The number of unique signatures depends on the number of reads (N), the signature length (l) and the noise in sequencing and mapping reads to reference. Typically, this number increases with an increase of noise and l until it reaches its maximum value, N. The number of variants can be adjusted by changing the significance threshold cut-off in order to control the read signature size.
Edit distances between reads were determined based on the SNV positions determined in the previous step. The final distance is defined as the percentage of differences over the total positions included in the calculation, which is l. The self-tuning spectral clustering algorithm developed by Zelnik_Manor and Peronawas (Zelnik-Manor and Perona, 2004) was adapted and used to perform clustering analysis.
2.5 Implementation of QuasiSeq
The SigClust algorithm is implemented in MATLAB which requires two input files: One is a bam file containing the read alignment to the consensus sequence and the other is the ranked SNV file containing the SNV positions selected for signature construction. First, the bam file was uploaded to MATLAB using its bioinformatic toolbox and the signatures for each read are extracted based on the given SNV positions. If the number of signatures is more than the maximum signatures allowed (limited by computational resources) a binary search will be invoked to prune the SNVs until the number of signatures (just below the maximum allowed) is found. After read signatures are constructed, the edit distances between the signatures are determined and the percentage of the differences over the length of the signatures is used to construct an N by N pairwise affinity matrix for spectral clustering. Sigclust MATLAB code was compiled to a Java executable file. All software and tools used for the quasispecies analysis workflow were built into the QuasiSeq pipeline and have been installed in a Docker (Docker.Web.Site) image with their dependencies. The parallel computation was implemented with Celery (Celery.Project). The web interface for the pipeline in the Docker image is implemented with Django (Django.Project) and Python.
3 Results
3.1 Development of the QuasiSeq workflow
We developed the QuasiSeq workflow to characterize the quasispecies in a mixture of 19 HIV SGAs using PacBio sequencing reads. These SGA sequences are listed in Supplementary Data S1 and have high similarity (Supplementary Table S2). Since the long reads have a high error rate, we need to differentiate the real SNVs from sequencing errors. Fortunately, it is well documented that the sequencing errors occur randomly, and the consensus sequence acquired from the reads should be free of systematic bias (Amarasinghe et al., 2020). The PacBio sequencing reads of the 19 SGA mixture, Pro19, were aligned to the reference sequence (HXB2 strain, NCBI accession: K03455). Based on the alignment of 19 SGA reference sequences with HXB2 (Supplementary Fig. S2), we defined reads covering the reference sequence from position 841 to 7870 as full-length reads in order to have enough reads for the development of the workflow. A total of 5599 full-length reads were selected out of 99 738 raw subreads. The consensus sequence was then assembled using these full-length reads and these reads were then aligned to the consensus sequence (Fig. 2) followed by identification of the SNVs. These SNVs were filtered by variant frequency and variant significance leaving a total of 793 SNVs used to construct the read signatures with each full-length read represented by one of 5599 read signatures (Fig. 2). These read signatures were then clustered with the adapted spectral clustering method by exploiting the structure of the eigenvectors of a pairwise signature-to-signature similarity matrix (18 clusters, Fig. 2). Each group of reads was then processed with the same procedure again. Only one group still had significant variants after the first SigClust cycle. After two SigClust cycles, no group had significant variants and all 19 SGAs in the mixture had been successfully identified (Fig. 2). To evaluate the accuracy of the clustering workflow, we mapped the sequencing reads to the known sequences of the 19 SGAs (Table 1). The results of the QuasiSeq clustering with the read signatures had high agreement with the results obtained by mapping the reads to the known SGA sequences as shown in the clustering confusion matrix table (Supplementary Table 3). The reads have been assigned to the correct SGAs with an average accuracy of 99.97%, specificity of 99.68%, sensitivity of 99.98% and precision of 99.66% (Supplementary Table 4). The consensus sequences acquired from QuasiSeq have an identity range of 99.54–99.92%, compared with the known sequences of the SGAs (Table 1). The consensus sequences assembled with the reads in the final clusters were listed in Supplementary Data S2.
Fig. 2.
Pro19 clustering tree. Signature is presented as number of signature (n) X length of the signature (l)
Table 1.
Summary of the clusters of Pro19 full-length dataset (Con_Seq identity: Consensus Sequence Identity)
| SGA ID | Mapped to SGA | Frequency (%) | Cluster ID | Read count | Frequency (%) | Con_Seq identity (%) |
|---|---|---|---|---|---|---|
| 1995_1 | 70 | 1.25 | Clust_3 | 86 | 1.54 | 99.86 |
| 1995_2 | 456 | 8.14 | FL_Clust_2 | 457 | 8.16 | 99.65 |
| 1995_3 | 431 | 7.70 | FL_Clust_8 | 430 | 7.68 | 99.68 |
| 1995_4 | 271 | 4.84 | FL_Clust_14 | 270 | 4.82 | 99.89 |
| 1995_5 | 382 | 6.82 | FL_Clust_6_1 | 382 | 6.82 | 99.86 |
| 1995_6 | 252 | 4.50 | FL_Clust_13 | 250 | 4.47 | 99.79 |
| 1995_7 | 293 | 5.23 | FL_Clust_5 | 294 | 5.25 | 99.65 |
| 1995_8 | 225 | 4.02 | FL_Clust_16 | 224 | 4.00 | 99.54 |
| 1995_9 | 139 | 2.48 | FL_Clust_6_2 | 138 | 2.46 | 99.86 |
| 1995_10 | 265 | 4.73 | FL_Clust_1 | 263 | 4.70 | 99.80 |
| 1995_11 | 218 | 3.89 | FL_Clust_7 | 218 | 3.89 | 99.77 |
| 1995_12 | 204 | 3.64 | FL_Clust_17 | 204 | 3.64 | 99.92 |
| 1995_13 | 321 | 5.73 | FL_Clust_15 | 317 | 5.66 | 99.84 |
| 1995_14 | 232 | 4.14 | FL_Clust_18 | 230 | 4.11 | 99.72 |
| 2001_1 | 337 | 6.02 | FL_Clust_11 | 336 | 6.00 | 99.80 |
| 2001_2 | 673 | 12.02 | FL_Clust_4 | 672 | 12.00 | 99.73 |
| 2001_3 | 267 | 4.77 | FL_Clust_10 | 267 | 4.77 | 99.79 |
| 2001_4 | 244 | 4.36 | FL_Clust_9 | 243 | 4.34 | 99.73 |
| 2001_5 | 319 | 5.70 | FL_Clust_12 | 318 | 5.68 | 99.77 |
Note: Memory required for storing matrix (GB) = n2*8/10243 because each element in a double-precision numerical matrix requires eight bytes.
In summary, the QuasiSeq workflow worked very well to characterize the composition of a mixture of viral genomes and can detect 19 SGAs from 5599 full-length reads with as low as 1.25% abundance, Cluster-3 for SGA 1995-1 (Table 1).
3.2 Improvement of QuasiSeq workflow
To improve the scalability of the QuasiSeq workflow, allowing it to handle large sequencing datasets for rare quasispecies identification, we generated the second dataset, Pro4, an in silico mixture of sequencing data of four HIV SGAs from two patients using four flowcells. The advantage of using an in silico mixture is that we can accurately assess the efficiency of clustering with the QuasiSeq workflow because we can identify the source of the sequencing reads by their ID. The similarity between the consensus sequences of the four SGAs was > 93% (Supplementary Table S5 and Data S3). The PacBio sequencing reads from the Pro4 dataset were aligned to the genome sequence of HIV strain HXB2 as reference sequence. The reads covering the reference sequence from position 801 to 8800 were defined as full-length reads. A total of 60 062 full-length sequencing reads were separated from this dataset. The consensus of these reads was assembled, the full-length reads aligned to this consensus and 479 SNVs were detected with a variant frequency >3%. Using these 479 SNVs to form the read signature, 60 062 reads could be represented by 60 061 read signatures requiring ∼26.88 GB of memory to store the signature matrix alone (Table 2), plus more memory required for clustering and other computational processes, which may be too large for some systems. On the other hand, clustering 11 519 read signatures using the top 11 SNVs or 16 008 read signatures using the top 12 SNVs would require about 0.99 and 1.91 GB memory, respectively, which would be applicable to most modern computation systems. Based on this calculation, we tested two read signature thresholds (12 000 and 17 000) using a Docker container with four CPUs and 16 GB memory. While the QuasiSeq processing of the dataset with a 17 000 read signatures threshold could not finish within 2 days, the processing of the dataset with a 12 000 read signatures threshold was completed within 211 min with a total of 6047 sequencing reads grouped correctly into four clusters and 15 unassigned reads (Fig. 3A). The consensus sequences acquired from QuasiSeq with the 12 000 read signatures threshold have an identity range of 99.90–99.96% compared to the reference sequences determined by Sanger sequencing method (Table 3A). The consensus sequences calculated with the reads in the final clusters were listed in Supplementary Data S4A. The comparison of reads assigned to each cluster and the source was listed in a clustering confusion matrix table (Supplementary Table S6A). The reads have been assigned to the correct SGAs with an average of 99.94% accuracy, 99.96% specificity, 99.87% sensitivity and 99.89% precision (Supplementary Table S7A). Therefore, we set 12 000 read signatures as the default threshold.
Table 2.
Memory requirement for the Spectral Clustering of read signatures with different length
| SNV number (l) | Signature number (n) | n 2 | Memory required for storing matrix (GB) |
|---|---|---|---|
| 479 | 60 061 | 3 607 323 721 | 26.88 |
| 16 | 27 219 | 740 873 961 | 5.52 |
| 15 | 24 376 | 594 189 376 | 4.43 |
| 14 | 21 766 | 473 758 756 | 3.53 |
| 13 | 18 571 | 344 882 041 | 2.57 |
| 12 | 16 008 | 256 256 064 | 1.91 |
| 11 | 11 519 | 132 687 361 | 0.99 |
Fig. 3.

Pro4 clustering tree. (A) All full-length subreads: 60 047 reads were assigned to four final clusters. (B) CCS reads: 22 789 reads were assigned to four final clusters. (C) Subreads QV <90: 19 988 subreads were assigned to four final clusters. (D) Subreads QV <85: 3809 subreads were assigned to four final clusters. (E) Subreads QV <80: 818 subreads were assigned to four final clusters
Table 3:
Summary of the clusters of Pro4
| SGA | Reads not clustered | Reads in clusters | Cluster | No. of reads | Consensus accuracy (%) |
|---|---|---|---|---|---|
| (A) All full-length subreads | |||||
| Pt1_1 | 6 | 5248 | clust_2_1 | 5285 | 99.96 |
| Pt1_2 | 4 | 21 210 | clust_2_2 | 21 237 | 99.96 |
| Pt2_1 | 2 | 19 537 | clust_1_2 | 19 518 | 99.90 |
| Pt2_2 | 3 | 14 052 | clust_1_1 | 14 007 | 99.92 |
| Total | 15 | 60 047 | 4 | 60 047 | Avg: 99.94 |
|
| |||||
| (B) Full-length CCS reads | |||||
| Pt1_1 | 1 | 2015 | clust_1_2 | 2015 | 100 |
| Pt1_2 | 8 | 7871 | clust_1_1 | 7871 | 100 |
| Pt2_1 | 12 | 7437 | clust_2_2 | 7437 | 100 |
| Pt2_2 | 3 | 5466 | clust_2_1 | 5466 | 100 |
| Total | 24 | 22 789 | 4 | 22 789 | Avg: 100 |
|
| |||||
| (C) Full-length reads with accuracy <90% | |||||
| Pt1_1 | 5 | 2018 | clust_2_2 | 2018 | 99.95 |
| Pt1_2 | 3 | 7457 | clust_2_1 | 7456 | 99.95 |
| Pt2_1 | 2 | 5572 | clust_1_1 | 5571 | 99.86 |
| Pt2_2 | 2 | 4941 | clust_1_2 | 4943 | 99.85 |
| Total | 12 | 19 988 | 4 | 19 988 | Avg: 99.90 |
|
| |||||
| (D) Full-length reads with accuracy <85% | |||||
| Pt1_1 | 4 | 487 | clust_1_2 | 490 | 99.88 |
| Pt1_2 | 2 | 1395 | clust_1_1 | 1395 | 99.88 |
| Pt2_1 | 4 | 980 | clust_2 | 977 | 99.75 |
| Pt2_2 | 2 | 947 | clust_3 | 947 | 99.76 |
| Total | 12 | 3809 | 4 | 3809 | Avg: 99.81 |
|
| |||||
| (E) Full-length reads with accuracy <80% | |||||
| Pt1_1 | 2 | 42 | clust_2_2 | 53 | 99.79 |
| Pt1_2 | 3 | 306 | clust_2_1 | 297 | 99.79 |
| Pt2_1 | 1 | 240 | clust_1 | 238 | 99.55 |
| Pt2_2 | 0 | 230 | clust_3 | 230 | 99.61 |
| Total | 6 | 818 | 4 | 818 | Avg: 99.68 |
To further test the robustness of the QuasiSeq workflow to noise, we filtered the 60 062 reads into several sub-datasets: 22 813 CCS reads with >99% accuracy and Q ≥ 20, 20 000 subreads with accuracy <90% (QV < 90), 3821 subreads with accuracy <85% (QV < 85) and 824 subreads with accuracy <80% (QV < 80). Most reads of these sub-datasets (22 789 CCS, 19 988 QV < 90, 3809 QV < 85 and 818 QV < 80) were all clustered into four clusters successfully (Table 3B–E, Fig. 3B–E) and the details of the sequencing reads assigned to each cluster compared with their source were listed in clustering confusion matrix tables (Supplementary Table S6B–E). The reads have been assigned to the correct SGAs with high accuracy, specificity, sensitivity and precision. Higher-quality reads result in clusters with higher accuracy, specificity, sensitivity and precision (Supplementary Table S7B–E). Moreover, the final consensus sequences have high accuracy compared with the SGA sequences in all groups (Table 3B–E). The consensus sequences calculated with the reads of different quality in the final clusters were listed in Supplementary Data S4B–E. It is worth noting that even the 824 sequencing reads with very low quality (accuracy <80%, mean accuracy =76.9%) could be assigned to the correct clusters with 99.15% accuracy, 99.56% specificity, 97.94% sensitivity and 98.66% precision (Supplementary Table S7E). Furthermore, the accuracy of the final consensus sequences obtained with these low-quality reads reached 99.68% (99.55–99.79%; Table 3E), demonstrating the power of the QuasiSeq pipeline. It is also worth noting that the consensus sequences of clusters obtained from CCS sequencing reads are 100% accurate compared with the sequences assembled from Sanger sequencing reads (Table 3B).
3.3 Test of QuasiSeq workflow with publicly available dataset
To validate our QuasiSeq workflow, we analyzed published sequencing data from a mixture of 10 IAV clones: SRR2042468 (Artyomenko et al., 2017). These 10 IAV clones have 1–13 SNVs, resulting in a total of 60 SNVs, as compared with the sequence of the parent clone (Supplementary Fig. S1). For completeness, the sequence of the parent clone and the 10 child clones were downloaded and enclosed (Supplementary Data S5 and S6). As described above, the difference can be significantly increased by only comparing 60 SNV positions instead of all 2300 positions (Supplementary Table S1A and B).
The PacBio sequencing reads of the 10-clone mixture were aligned to the sequence of the parent clone. A total of 22 161 full-length reads defined as covering the full 2300 bp reference were extracted based on the alignment (Fig. 4). The consensus sequence was then calculated using these full-length reads and these reads were then aligned to the consensus sequence and the SNVs were identified. These SNVs were sorted based on their significance and the top 28 SNVs were used to construct the read signatures with each full-length read represented by one of 599 read signatures. These read signatures were clustered automatically into four clusters. Each group of reads was then processed with the same procedure again until no significant variants were found for any group or the subsequent subgroups generated during the process. After four SigClust cycles, all 10 clones in the mixture were successfully identified with 10 final clusters containing a total of 22 161 sequencing reads (Fig. 4). The consensus sequences acquired from QuasiSeq were listed in Supplementary Data S7 having 100% identity compared to the published reference sequences (Table 4). The frequency of each clone calculated with the read count in each cluster is very similar to the expected frequency of the original mixture (Table 4).
Fig. 4.

Hierarchical Clustering results for fluA data. 26 481 full-length sequencing reads were selected from the benchmark dataset SRR2042468. In total, 26 480 of them were assigned to 10 clusters by the QuasiSeq workflow
Table 4.
Summary of the clusters of SRR2042468 dataset generated from 10 IAV Clones
| Clone ID | Exp Freq (%) | Predictive Freq (%) |
Cluster ID | Con_Seq Identity (%) | Read count | Freq % | ||
|---|---|---|---|---|---|---|---|---|
| CliqueSNV | 2SNV | PH | ||||||
| 1 | 50 | 52.6 | 51.8 | 56.7 | 3_3 | 100 | 11 682 | 52.71 |
| 2 | 25 | 23.7 | 23.7 | 23.8 | 1 | 100 | 5223 | 23.57 |
| 3 | 12.5 | 12.6 | 12.5 | 13.7 | 2 | 100 | 2791 | 12.59 |
| 4 | 6.25 | 6.4 | 6.4 | 0 | 3_1 | 100 | 1412 | 6.37 |
| 5 | 3.13 | 2.3 | 2.3 | 3.1 | 4 | 100 | 539 | 2.43 |
| 6 | 1.56 | 1.17 | 1.2 | 0 | 3_2_1_1 | 100 | 260 | 1.17 |
| 7 | 0.78 | 0.7 | 0.7 | 1.5 | 3_2_2 | 100 | 153 | 0.69 |
| 8 | 0.39 | 0.35 | 0.3 | 1.2 | 3_2_1_2 | 100 | 65 | 0.29 |
| 9 | 0.19 | 0.12 | 0.1 | 0 | 3_2_1_3 | 100 | 24 | 0.11 |
| 10 | 0.097 | 0.051 | 0 | 0 | 3_2_1_4 | 100 | 12 | 0.054 |
| FP | 0 | 1 | ||||||
Note: The number of reads in the column 2 (Expected frequency), column 3 (CliqueSNV), column 4 (2SNV Frequency) and column 5 (PH Frequency) were obtained from the published paper (Artyomenko et al., 2017; Knyazev et al., 2021); Con_Seq Identity is assembled consensus sequence identity comparing with known reference sequences; FP, false positive.
Comparing the performance of CliqueSNV, 2SNV and PredictHaplo (PH) documented in the published paper by others (Artyomenko et al., 2017) on the same dataset, the QuasiSeq workflow can detect the quasispecies at the same accuracy and sensitivity as CliqueSNV, at a lower abundance (0.097%) than 2SNV and PH with no false-positive clones while PH could only detect 6 out of the 10 clones (Table 5). demonstrating higher sensitivity and specificity for QuasiSeq and CliqueSNV. The predicted frequencies of 10 clones are close to the expected frequencies (Table 4, Supplementary Table S8). The EMD of QuasiSeq predicted quasispecis (2.30) is comparable to that of CliqueSNV (2.2) and the recall (100%) and precision (100%) of QuasiSeq predicted quasispecies are the same as CliqueSNV (Knyazev et al., 2021). Analysis results of subsamples of this dataset (16K, 8K and 4K) with QuasiSeq show higher sensitivity than CliqueSNV (Supplementary Table S8).
Table 5.
Comparison of QuasiSeq with 2SNV and PH on the Full 10 IAV Data
| Method | Clone ID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | FP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| True Freq (%) | 50 | 25 | 12.5 | 6.25 | 3.125 | 1.56 | 0.78 | 0.39 | 0.19 | 0.097 | 0 | |
| QuasiSeq | Match | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 0 |
| Freq (%) | 52.7 | 23.6 | 12.6 | 6.4 | 2.4 | 1.17 | 0.7 | 0.29 | 0.11 | 0.054 | 0 | |
| CliqueSNV | Match | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 0 |
| Freq (%) | 52.6 | 23.7 | 12.6 | 6.4 | 2.3 | 1.17 | 0.7 | 0.35 | 0.12 | 0.051 | 0 | |
| 2SNV | Match | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | 1 |
| Freq (%) | 51.8 | 23.7 | 12.5 | 6.4 | 2.3 | 1.2 | 0.7 | 0.3 | 0.1 | 0 | 1.0 | |
| PH | Match | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ | ✗ | 0 |
| Freq (%) | 56.7 | 23.7 | 13.7 | 0 | 3.1 | 0 | 1.5 | 1.2 | 0 | 0 | 0 |
Note: The performance data for CliqueSNV, 2SNV and PH were obtained from the published paper by others (Artyomenko et al., 2017; Knyazev et al., 2021).
To assess performance of SNV calling, we compared the SNV position of the read signatures (Supplementary Table S9) with the known position (Supplementary Fig. S1) and they are 100% consistent.
3.4 Optimization of computation
Given that the subgroups for a group of reads can be processed separately, we further improved the efficiency of QuasiSeq by implementing parallel computation in the QuasiSeq Docker container which processes each subgroup of reads from a cluster, independently. As shown in Supplementary Table S10A, the parallel processing can save 20–36% of total computation time for the three datasets we used in this study. Since the first round of clustering was processed in the same way in both modes, saving only occurred after the first round: 35–56% for these datasets. The QuasiSeq efficiency can be further improved by increasing the number of CPUs and memory size (Supplementary Table S10B). Increasing the number of CPUs and memory size from 4 CPUs and 16 GB to 8 CPUs and 32 GB, and further to 16 CPUs and 64 GB of memory can reduce computation time in both serial and parallel modes.
3.5 Graphical User Interface (GUI) of QuasiSeq
The final QuasiSeq pipeline has been implemented with a GUI interface to facilitate usage (Fig. 5, Supplementary Fig. S3). The input interface includes three sections: (i) Essential Arguments section: ‘Input Fastq’ for the path to the sequencing data file in fastq format, ‘Reference Fasta’ for the path to the reference sequence file in fasta format, ‘Ranking Method’ for choosing between ‘P-value’ and ‘percentage’ to rank variants. (ii) Options: ‘Start & End’ for the start and end coordinates in the reference sequence to define full-length sequences. (iii) Threshold section: ‘Estimate sequencing substitution error rate’ is set based on sequencing quality. For PacBio datasets, most errors are insertion and deletion errors, the substitution error is about 1.07% when the BLASR is used as an alignment tool (Dohm et al., 2020). Therefore, we set the default as 0.01. ‘Maximum number of read signatures’ is set to control the requirement of computing resources. As described above, the default threshold is set as 12 000. ‘Maximum SNV P-value’ and ‘Minimum SNV frequency’ are used to determine the SNVs used for the read signature. ‘Minimum number of reads supporting a quasispecies’ is used as the lowest read count for a cluster and ‘Minimum nucleotides differentiate two quasispecies’ is to setup the number of SNVs used to separate quasispecies. The values set for these thresholds were listed in Supplementary Table S11. (iv) ‘Examples’ for the example data we provide to test the installation. The user can use the example feature to get an example of the settings for ‘Essential Arguments’ and ‘Options’ (Fig. 5). After the input data are successfully analyzed, the results will be presented in the output interface. The results in the interface provide a pie chart and frequency table to visualize the quasispecies composition in the sample. The chart and the table can be downloaded as well as the final consensus sequences of all quasispecies (Supplementary Fig. S3). All essential files for the pipeline and GUI are publicly available in GitHub: https://github.com/LHRI-Bioinformatics/QuasiSeq. The current release (v1.0.0) used in this report is archived and available at https://zenodo.org/badge/latestdoi/340494542.
Fig. 5.

GUI input interface of the QuasiSeq pipeline. All essential arguments and options can be selected and/or entered. An example dataset is included for testing of the installation
4 Discussion
In this study, we have developed a novel analytical workflow that allows for the efficient use of single-pass, continuous long reads generated by the PacBio SMRT® Sequencing technology to deconvolute complex mixtures of HIV-1 and influenza genomes. Based on the design of the algorithm, the correct number and sequences of the different variants present in the original samples can be obtained even when variants differ from each other by as few as a single nucleotide. Importantly, the workflow described does not require a priori definition of the number of putative unique genomes comprising the sample to get an accurate result and explores the entire dataset to derive an independent set of unique genetic variants present in the original sample. Overall, the results shown in the present study demonstrate that it is possible to overcome the error rate present in raw reads derived from SMRT sequencing to obtain highly accurate sequences comprising complex genetic mixtures. We also demonstrated that final consensus sequences with high accuracy could be obtained with very low-quality sequencing data (Table 3E). Our results further demonstrated that higher-quality data can produce final consensus sequences of quasispecies with higher accuracy (Table 3B–E), suggesting that quasispecies sequences can be determined accurately by the current PacBio Hi-Fi sequencing procedure.
Since the sequence read signatures are composed of only significant SNPs with homozygous positions removed, using a signature to cluster sequencing reads has three advantages: (i) As described above, the differences calculated with read signatures are much higher than the differences calculated with full-length sequences (Supplementary Table S1A and B). And increased differences using sequence signatures make it easier to differentiate the subspecies and enhance the sensitivity and specificity of the QuasiSeq pipeline (Table 5). (ii) The computational power requirement is significantly reduced as compared with using full-length sequences. (iii) We can further adjust the computational power requirement by self-tuning the length of signatures based on the availability of the computational resources (Table 2). ‘Maximum number of signatures’ is used to limit the memory needed. After the sequencing reads have been partitioned to clusters, these clusters are independent and can be processed in a parallel manner, further reducing computational time.
Two thresholds, ‘Maximum SNV P-value’ and ‘Minimum SNV frequency’, are used to define the SNVs for building the read signatures. The ‘Minimum SNV frequency’ setting is based on the quality of the sequencing reads: a higher minimum SNV frequency needs to be set for lower-quality sequencing reads in order to select true SNVs. ‘Minimum number of reads supporting a quasispecies’ and ‘Minimum nucleotides differentiating two species’ are used to define a cluster. Lowering values for these two thresholds will increase the sensitivity of the QuasiSeq workflow. Together, these thresholds define the stop points of QusiSeq analysis of a sample.
QuasiSeq workflow can reconstruct the haplotype with low abundance because the haplotypes have been clustered based on reads signatures which are constructed with SNV based on the frequency and P-value, which means that the abundant haplotypes will be constructed first, and the reads belonging to these haplotypes will not be present in the rest of the clusters and the remaining haplotypes will become more abundant and can be reconstructed. These cycles will continue until all haplotypes are constructed. As the results, the QuasiSeq can identify haplotypes with higher sensitivity than other peer tools including most recently developed CliqueSNV (Knyazev et al., 2021).
Moreover, the workflow has been tested with low-quality PacBio sequencing reads and produced acceptable results (Table 3D and E, Supplementary Tables S6D, E and S7D, E). Since error rate of ONT sequencing data is at the similar level and read errors are also random, the pipeline can probably be used with ONT sequencing data which can produce longer reads than the PacBio technology.
Furthermore, we have also added a web interface to QuasiSeq and straightforward threshold settings based on the sequencing data quality, making it easy to use for a laboratory researcher, with no need of a professional bioinformatician or computational biologist.
In summary, QuasiSeq workflow developed in this study can identify quasispecies with high accuracy and high sensitivity using the sequencing data generated by the third-generation sequencing technology. The workflow exhibits superior robustness to sequencing errors and scalability to handle big datasets for rare quasispecies discovery.
Supplementary Material
Acknowledgements
The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services nor does mention of trade names, commercial products or organizations imply endorsement by the US Government. This research was supported [in part] by the National Institute of Allergy and Infectious Disease.
Funding
This work has been supported by federal funds from the National Cancer Institute, National Institutes of Health, under Contract No. HHSN261200800001E.
Conflict of Interest: none declared.
Contributor Information
Xiaoli Jiao, Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA.
Hiromi Imamichi, Laboratory of Immunoregulation, National Institute of Allergy and Infectious Diseases, Bethesda, MD 20892, USA.
Brad T Sherman, Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA.
Rishub Nahar, Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA.
Robin L Dewar, Virus Isolation and Serology Laboratory, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA.
H Clifford Lane, Laboratory of Immunoregulation, National Institute of Allergy and Infectious Diseases, Bethesda, MD 20892, USA.
Tomozumi Imamichi, Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA.
Weizhong Chang, Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA.
References
- Ahn S., Vikalo H. (. 2018) aBayesQR: a Bayesian method for reconstruction of viral populations characterized by low diversity. J. Comput. Biol., 25, 637–648. [DOI] [PubMed] [Google Scholar]
- Amarasinghe S.L. et al. (2020) Opportunities and challenges in long-read sequencing data analysis. Genome Biol., 21, 30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Artyomenko A. et al. (2017) Long single-molecule reads can resolve the complexity of the influenza virus composed of rare, closely related mutant variants. J. Comput. Biol., 24, 558–570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Celery.Project. Celery—Distributed Task Queue. https://docs.celeryproject.org/en/stable/, celery@afc3a22b9d67 v4.0.0.
- Chaisson M.J., Tesler G. (. 2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics, 13, 238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J. et al. (2018) De novo haplotype reconstruction in viral quasispecies using paired-end read guided path finding. Bioinformatics, 34, 2927–2935. [DOI] [PubMed] [Google Scholar]
- Dilernia D.A. et al. (2015) Multiplexed highly-accurate DNA sequencing of closely-related HIV-1 variants using continuous long reads from single molecule, real-time sequencing. Nucleic Acids Res., 43, e129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Django.Project. Django: The Web Framework for Perfectionists with Deadlines. https://www.djangoproject.com/, Version 1.8.
- Docker.Web.Site. Docker (Software). https://docs.docker.com/, Community Version 20.10.14.
- Dohm J.C. et al. (2020) Benchmarking of long-read correction methods. NAR Genome Bioinform., 2, lqaa037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Domingo E. et al. (2019) Viral fitness: history and relevance for viral pathogenesis and antiviral interventions. Pathog. Dis., 77, ftz021. [DOI] [PubMed] [Google Scholar]
- Huang C. et al. (2018) Towards personalized medicine: an improved de novo assembly procedure for early detection of drug resistant HIV minor quasispecies in patient samples. Bioinformation, 14, 449–454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang D.W. et al. (2016) Towards better precision medicine: PacBio single-molecule long reads resolve the interpretation of HIV drug resistant mutation profiles at explicit quasispecies (haplotype) level. J. Data Mining Genomics Proteomics, 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang S.W. et al. (2019) Application of deep sequencing methods for inferring viral population diversity. J. Virol. Methods, 266, 95–102. [DOI] [PubMed] [Google Scholar]
- Imamichi H. et al. (2016) Defective HIV-1 proviruses produce novel protein-coding RNA species in HIV-infected patients on combination antiretroviral therapy. Proc. Natl. Acad. Sci. U S A, 113, 8783–8788. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jayasundara D. et al. (2015) ViQuaS: an improved reconstruction pipeline for viral quasispecies spectra generated by next-generation sequencing. Bioinformatics, 31, 886–896. [DOI] [PubMed] [Google Scholar]
- Knyazev S. et al. (2021) Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res., 49, e102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koboldt D.C. et al. (2012) VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res., 22, 568–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Posada-Cespedes S. et al. (2017) Recent advances in inferring viral diversity from high-throughput sequencing data. Virus Res., 239, 17–32. [DOI] [PubMed] [Google Scholar]
- Ye C., Ma Z.S. (. 2016) Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads. PeerJ, 4, e2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zelnik-Manor L., Perona P. (2004) Self-tuning spectral clustering. In: Proceedings of the 17th International Conference on Neural Information Processing Systems, NIPS'04, pp. 1601–1608.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

