Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2022 Jun 1;18(6):e1009730. doi: 10.1371/journal.pcbi.1009730

Improved transcriptome assembly using a hybrid of long and short reads with StringTie

Alaina Shumate 1,2, Brandon Wong 1,2,3,4, Geo Pertea 5, Mihaela Pertea 1,2,3,*
Editor: Jinyan Li6
PMCID: PMC9191730  PMID: 35648784

Abstract

Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are rarely able to span multiple exons. Long-read technology can capture full-length transcripts, but its relatively high error rate often leads to mis-identified splice sites. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana, Mus musculus, and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie.

Author summary

Identifying the genes that are active in a cell is a critical step in studying cell development, disease, the response to infection, the effects of mutations, and much more. During the last decade, high-throughput RNA-sequencing data have proven essential in characterizing the set of genes expressed in different cell types and conditions, which has driven a strong need for highly efficient, scalable and accurate computational methods to process these data. As sequencing costs have dropped, ever-larger experiments have been designed, often capturing hundreds of millions or even billions of reads in a single study. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also present opportunities for discovery. Recently developed long-read technology now allows researchers to capture entire transcripts in a single long read, enabling more accurate reconstruction of the full exon-intron structure of genes, although these reads have higher error rates and higher costs. In this study we use the high accuracy of short reads to correct the alignments of long RNA reads, with the goal of improving the identification of novel gene isoforms, and ultimately our understanding of transcriptome complexity.


This is a PLOS Computational Biology Software paper.

Introduction

Uncovering the transcriptome of an organism is crucial to understanding the functional elements of the genome. This requires being able to accurately identify transcript structure and quantify transcript expression levels. In eukaryotes, this task is more challenging due to alternative splicing. It occurs frequently, with an estimated 92%-94% of human genes undergoing alternative splicing [1]. Short-read RNA-sequencing (RNA-seq) has been a useful tool in uncovering the transcriptome of many organisms when coupled with computational methods for transcriptome assembly and abundance estimation. Short-read sequencing provides the advantage of deep coverage and highly accurate reads. Second-generation sequencers such as those from Illumina can produce millions of reads with an error rate of less than 1% [2]. While second-generation sequencers produce very large numbers of reads, their read lengths are typically quite short, in the range of 75–125 bp for most RNA-seq experiments today. These short reads often align to more than one location in the genome, and also suffer the limitation that they rarely span more than two exons, resulting in a difficult and sometimes impossible task of constructing an accurate assembly of genes with multiple exons and many diverse isoforms, no matter how deeply those genes are sequenced. These issues can be alleviated by third-generation sequencing technologies such as those from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Reads from these technologies can be greater than 10 kilobases long, allowing full-length transcripts to be sequenced. However, practical limitations often impede the ability to capture full-length transcripts. These include the rapid rate of RNA degradation, shearing of the RNA during library preparation, or incomplete synthesis of cDNA [3]. Additionally, long reads have a high error rate relative to Illumina short reads [4], and the throughput of long-read RNA-seq is much lower than that of short-read RNA-seq. This can make it difficult in some cases to define precise splice sites. Using a combination of short reads and long reads for transcriptome assembly allows us to take advantage of the strengths of each technology and mitigate the weaknesses. While there are many tools that use either short reads or long reads for transcriptome assembly and quantification, there are very few that use a hybrid of the two. These tools include Trinity [5], IDP-denovo [6], and rnaSPAdes [7], which only perform de novo transcriptome assembly. If a high-quality reference genome of the target organism is available, as it is for human and for a large number of plants, animals, and other species, de novo transcriptome assembly usually produces lower-quality assemblies compared to reference-based approaches. This is due to technical challenges resulting from the presence of gene families, large variations in gene expression, and extensive alternative splicing [8]. StringTie is a reference-based transcriptome assembler that can assemble either long reads or short reads, and has been shown to be more accurate than existing short and long read assemblers [9].

In this work we present a new release of StringTie which allows transcriptome assembly and quantification using a hybrid dataset containing both short and long reads. We show with simulated data from the human transcriptome that hybrid-read assemblies result in more accurate assembly and coverage estimates than using long reads or short reads alone. Additionally, we evaluate the assembly accuracy on 9 real datasets from 3 well-studied species (human, Mus musculus, and Arabidopsis thaliana) and demonstrate that the hybrid-read assemblies are more accurate than both the long-read only and short-read only assemblies. We also demonstrate that hybrid-read assembly is more accurate and also substantially faster than a strategy of correcting long reads prior to assembly.

Results

Our hybrid transcriptome assembly algorithm takes advantage of the strengths of both long and short read RNA sequencing, by combining the capacity of long reads to capture longer portions of transcripts with the high accuracy and coverage of short-read data to produce better transcript structures as well as better expression estimates. Fig 1A shows examples of alignment artifacts that are often present in long reads because of the high error rate. These include “fuzzy” splice sites as well as retained introns, spurious extra exons, falsely skipped exons, and false alternative splice sites. Fig 1B shows a specific example of a 9-exon isoform of a human gene that can only be correctly assembled using both long and short reads. There are no long reads mapped to the first 3 exons of this isoform, and we see a retained intron in the alignment. Among the short-read alignments, the 4th and 7th introns are only spanned by a single spliced read, and exons 5 and 8 are not completely covered. This causes the transcripts to be assembled in 3 fragments. Using both long and short reads we were able to correctly assemble the transcript by using the short reads to support the splice sites found in the long-read alignments (See Methods). The adequate short-read coverage of exons 1–3 also allowed us to assemble these despite the lack of coverage in the long reads.

Fig 1.

Fig 1

A) Artifacts present in the long read alignments: i) retained introns; ii) disagreement around the splice sites; iii) spurious extra exons; iv) falsely skipped exons; v) false alternative splice sites. B) Example of a human transcript that can only be correctly assembled using both the long and short reads. This is human transcript ENST000000361722.7 from the TBKBP1 gene. Blue lines in the middle of the reads (gray boxes) indicate a spliced alignment. Purple lines within the reads indicate mismatches in the alignment. The long reads alignments do not have coverage of exons 1–3 and contain a retained intron. The short-read alignments lack adequate splice-site support across the 4th intron and the 7th intron and do not have complete coverage of exons 5 and 8.

Next, we present results for StringTie’s performance with hybrid long and short read sequences on simulated data as well as on three real RNA-seq data sets, from human, mouse, and the model plant Arabidopsis thaliana.

Simulated data

Since it is not possible to know the true transcripts that are present in real RNA-seq datasets, we first used simulated data to assess the accuracy of hybrid-read assembly and quantification across the transcriptome. To this end, we simulated two human RNA-seq datasets, one with short-reads and one with ONT direct RNA long reads (see Methods) and assembled them with StringTie.

To evaluate the accuracy of hybrid-read assemblies compared to long-read only and short-read only assemblies, we generated 4 different assemblies of each read type (long, short, and hybrid) with 4 different sets of parameters (Fig 2A). We then computed the precision and sensitivity for each assembly. Precision is defined as the percent of assembled transcripts that match a true transcript, and sensitivity is defined as the percent of true transcripts that match an assembled transcript (see Methods). For these calculations, we considered a transcript to be truly expressed only if it was fully covered by either the short or long simulated reads. For each hybrid-read assembly, we calculated the relative percent increase in precision and sensitivity over the long-read and short-read assemblies with the same parameters (see Methods). When we report the percent increase of any metric, we are referring to the relative percent increase. Averaging these results, we saw that hybrid-read assemblies had an increase in precision of 9.8% over the long-read assemblies, and an increase in sensitivity of 24.4%. As compared to the short-read assemblies, the hybrid-read assemblies had an increase in precision of 12.5% and an increase in sensitivity of 22.1%. To confirm that these improvements were not simply due to increased coverage in the hybrid reads, we performed the same experiment where the long, short, and hybrid-read datasets all had approximately equal coverage (See Methods and S1 File). When controlled for coverage, we still see that the hybrid-read assembly clearly outperforms both the long and short-read only assemblies in precision and sensitivity (S1 Fig).

Fig 2.

Fig 2

A) Sensitivity and precision for StringTie assemblies of simulated data with varying sensitivity parameters. The two StringTie parameters varied were the minimum read coverage allowed for a transcript (-c) and the minimum isoform abundance as a fraction of the most abundant transcript at a given locus (-f). Each shape represents a different combination of -c,-f parameters with the values indicated in the legend. The default values of -c and -f are 1.0 and 0.01 respectively and are represented by the circle marker. B) Calculated coverage vs. expected coverage for long-read, short-read, and hybrid-read assemblies of simulated data. Coverage values are normalized to log2(1 + coverage). C) Precision of long-read, short-read, and hybrid-read assemblies of simulated data with and without guide annotation. D) Sensitivity of long-read, short-read, and hybrid-read assemblies of simulated data with and without guide annotation.

We also compared the coverage computed by StringTie to the actual coverage of long-read only, short-read only, and hybrid-read assemblies created with default parameters (Fig 2B). StringTie’s computed coverage of the hybrid-read assembly was closest to the true coverage. We found that the correlation between true and calculated coverage for hybrid-read assembly yielded an R2 value of 0.966, higher than the R2 values for both the short-read (0.959) and long-read (0.933) only assemblies.

If the reference genome annotation is reliable, some methods (including StringTie) can use that annotation to improve the accuracy of the transcriptome assembly. Note that not all transcripts in the reference annotation will be expressed in the data, therefore the assembler needs to accurately determine which of the transcripts are present in the data. Moreover, reference annotations are usually incomplete, so StringTie’s default behavior when annotation is provided is to assume that novel transcripts could be present as well. We wanted to assess if StringTie’s performance improves on hybrid data if the human reference annotation is provided. As shown in Fig 2C and 2D, both precision and sensitivity improved when the reference annotation was provided, and hybrid data assembly had the highest sensitivity and precision regardless of whether the reference annotation was provided or not. The improvement in precision was insignificant with short-read data due to the fact that more annotated isoforms were assembled even though they were not expressed. This wasn’t a problem with the other data sets, as long reads were able to better recover the full exon-intron structure of the expressed transcripts. The use of hybrid-read data plus annotation had an increase in precision of 10.7% and an increase in sensitivity of 23.5% as compared to the hybrid-read data assembled without annotation.

Real data

Next, we evaluated the accuracy of hybrid-read assemblies on real data, which is in general much more challenging than simulated data, in part because the real data may contain biases or other artifacts not always captured by simulated data. From publicly available data, we chose a total of 9 combinations of long and short reads from 3 well-studied species: Arabidopsis thaliana, Mus musculus, and human. The Mus musculus samples include samples from brain and liver tissue. The human samples are from the NA12878 cell line. Each combination of long and short reads is derived from the same sample. All three species have well-characterized reference annotation available, even though their level of completeness is not fully established [10]. The short-read libraries were all generated through poly-A selection and sequenced with Illumina sequencers. The long reads were generated by a variety of technologies including ONT direct RNA, ONT cDNA, and PacBio cDNA (Table 1). The quality of these long reads varies with error rates ranging from 3.2% to 17.2% (S1A Table) and the percentage of reads that cover full-length isoforms ranges from 25.4% to 67.2% (S2 Table).

Table 1. Availability of real RNA-seq datasets and descriptions of sequencing technology used including chemistry and basecaller version for ONT datasets.

Accession Number Database Species (Tissue) Sequencing Technology
ERR3486096 European Nucleotide Archive A. thaliana llumina HiSeq 4000
ERR3764345 European Nucleotide Archive A. thaliana ONT direct RNA
SQK-RNA001
MinION
Guppy v2.3.1
ERR3486098 European Nucleotide Archive A. thaliana llumina HiSeq 4000
ERR3764349 European Nucleotide Archive A. thaliana ONT direct RNA
SQK-RNA001
MinION
Guppy v2.3.1
ERR3486099 European Nucleotide Archive A. thaliana llumina HiSeq 4000
ERR3764351 European Nucleotide Archive A. thaliana ONT direct RNA
SQK-RNA001
MinION
Guppy v2.3.1
ERR2680378 European Nucleotide Archive M. musculus
(brain)
llumina HiSeq 4000
ERR2680375 European Nucleotide Archive M. musculus
(brain)
ONT direct RNA
SQK-RNA001
MinION
Albacore 2.1.10
ERR2680377 European Nucleotide Archive M. musculus
(brain)
ONT cDNA
MinION
SQK-PCS108
Albacore 2.1.10
ERR2680380 European Nucleotide Archive M. musculus
(liver)
Illumina HiSeq 4000
ERR2680379 European Nucleotide Archive M. musculus
(liver)
ONT direct RNA
SQK-RNA001
MinION
Albacore 2.1.10
SRR4235527 Sequence Read Archive H. sapiens Illumina Genome Analyzer IIx
NA12878-dRNA github.com/nanopore-wgs-consortium H. sapiens ONT direct RNA
SQK-RNA001
MinION
Guppy v3.2.6
NA12878-cDNA github.com/nanopore-wgs-consortium H. sapiens ONT cDNA
SQK-PCS108
MinION
Albacore 2.1
SRR1153470 Sequence Read Archive H. sapiens lllumina HiSeq 2000
SRR1163655 Sequence Read Archive H. sapiens PacBio cDNA
PacBio RS

Although we cannot know exactly which transcript molecules are present in the samples, it can be assumed that an assembly with more transcripts matching known annotations is more sensitive, and an assembly is more precise if known transcripts comprise a higher percentage of the total number of assembled transcripts. Therefore, to evaluate the accuracy of the assemblies of real data, we report two values: (1) the number of assembled transcripts matching an annotated transcript, and (2) precision, which we define as the percentage of assembled transcripts matching known annotations. We chose to report the number of transcripts matching the annotation instead of sensitivity, because it is impossible to know exactly which transcripts are truly expressed in real experimental data. As with the simulated data, we report the relative percent increase/decrease of both metrics. Since short-read data offers much higher coverage of the expressed transcriptome, for these calculations we only consider loci with long-read coverage, but the results are similar when we look at all loci (S2 Fig).

We also compare hybrid-read assembly to the strategy of correcting long reads prior to assembly, which is a common approach to handling the high error rate of long reads. Multiple previous algorithms have been proposed to combine long and short reads into high-accuracy long reads [11], but those approaches were primarily intended to be applied to whole-genome data with the aim to improve the quality of genome assemblies. Only recently a new method, called TALC [12], was developed for long-read correction in the context of RNA-seq data by incorporating coverage analysis throughout the correction process. Using the corresponding short-read sample, we corrected each long-read sample with TALC. On average TALC decreased the error rate by 9.5% (S1B Table). We created additional long-read and hybrid-read assemblies with the TALC-corrected reads and then compared the accuracy of the hybrid-read assemblies to the corrected long-read assemblies. We also assessed whether using corrected long reads in a hybrid-read assembly substantially improved the accuracy. As we show below, TALC is quite effective at correcting errors; however, it is far slower than StringTie (running on a single RNA-seq samples takes TALC a day or longer, compared to less than one hour for StringTie), and it does not improve transcript assembly as compared to our new hybrid assembly algorithm.

Arabidopsis thaliana

The hybrid-read assemblies of the Arabidopsis thaliana samples achieved higher precision and contained more annotated transcripts than both the long-read and short-read assemblies (Fig 3A–3C). The average percent increase in precision in the hybrid-read assemblies was 8.0% over the long-read assemblies, and 4.1% over the short-read assemblies. The increase in the number of annotated transcripts was 21.7% and 5.0% over the long-read and short-read assemblies respectively. When comparing the results of hybrid-read assembly to an assembly of corrected long reads, the hybrid-read assembly had a very small decrease in precision of 1.0%, but an increase in the number of annotated transcripts of 14.4%. Finally, using the TALC-corrected long reads instead of the uncorrected long reads in a hybrid-read assembly only increased precision by 0.5% and increased the number of annotated transcripts by 0.4%.

Fig 3. Precision and the number of annotated transcripts assembled for 9 real datasets from Arabidopsis thaliana, Mus musculus, and human.

Fig 3

Only loci with long read expression are considered for these calculations The circle markers represent assemblies created from uncorrected reads, and the stars represent assemblies created from long-reads corrected with TALC. The long and short read combinations analyzed from Arabidopsis thaliana were A) ERR3486096 and ERR3764345 B) ERR3486098 and ERR3764349 C) ERR3486099 and ERR3764351. The long and short read combinations analyzed from Mus musculus were D) ERR2680378 and ERR2680375 E) ERR2680378 and ERR2680377 F) ERR2680380 and ERR2680379. The long and short read combinations analyzed from human were G) SRR4235527 and NA12878-cDNA H) SRR4235527 and SRR4235527 I) SRR1153470 and SRR1163655.

Mus musculus

In the Mus musculus samples, the hybrid-read assemblies showed an even greater improvement in precision versus the long-read only and short-read only assemblies (Fig 3D–3F). The percent increase was 38.6% over the long-read assemblies and 18.9% over the short-read assemblies. The number of annotated transcripts assembled increased substantially over the long-read assemblies with a relative increase of 118%; however, there was a slight decrease over the short-read assemblies of 0.6%. As with Arabidopsis thaliana, we saw that the hybrid-read assemblies outperform the corrected long-read assemblies with a 24.3% increase in precision and a 96.0% increase in the number of annotated transcripts. Hybrid-read assemblies using the TALC-corrected long reads again did not appear considerably different than the hybrid-read assemblies with the uncorrected reads: precision decreased by 0.5% while the number of annotated transcripts increased by 0.8%.

Human

In the human data, we saw an increase in precision of 26.0% in the hybrid-read assemblies over the long-read assemblies, and an increase of 22.7% over the short-read assemblies (Fig 3G–3I). The number of annotated transcripts was also higher in the hybrid-read assemblies with an increase of 47.2% over the long-read assemblies and 36.5% over the short-read assemblies. As with the Arabidopsis thaliana and Mus musculus samples, the hybrid-read assemblies were still better than corrected long-read assemblies with 21.4% greater precision and 45.0% more annotated transcripts. The increase in precision and number of annotated transcripts in the hybrid-read assembly with corrected long reads compared to hybrid-read assembly with the uncorrected reads was again small, at 1.1% and 1.0% respectively. Because the human genome is the largest of the 3 genomes, we also compared the runtime of hybrid-read assembly to that of TALC. On average, hybrid-read assembly of the human samples took 48.8 minutes using 1 thread. In comparison, TALC took an average of 7143 minutes using 12 threads.

It is worth noting that long-read sequencing technology, including sequencing chemistry and basecalling, has improved since the long reads in this analysis were generated. We also ran a similar analysis on a direct RNA ONT dataset generated with newer chemistry (SQK-RNA002) and basecalled with Guppy v5.0.7 [13]. Just as with the older data, the hybrid-read assembly achieves better accuracy than either the long or the short-read assembly. We also used these data to evaluate the level of support of the assembled transcripts according to the RefSeq annotation. As expected, the hybrid-read assembly contains more curated (highly supported) transcripts and more predicted (poorly supported) transcripts. The results are shown in S2 File and S3 Fig.

Annotation-Guided assembly

As with the simulated data, we also performed annotation-guided assembly for each dataset shown in Fig 3 and evaluated the precision (Fig 4A) and number of annotated transcripts assembled (Fig 4B). We compared these results to the hybrid-read assemblies created without guide annotation. The average precision of the Arabidopsis thaliana hybrid-read assemblies increased from 62.5% to 80.2% and the average number of annotated transcripts assembled increased from 18,711 to 26,214. The average precision of the Mus musculus hybrid-read assemblies increased from 41.7% to 78.4% and the number of annotated transcripts increased from 15,342 to 42,541. Lastly the precision of the human assemblies increased from 35.7% to 75.1% and the number of annotated transcripts assembled more than doubled, increasing from 20,369 to 43,363. Across all samples in all species, the annotation-guided hybrid-read assemblies had greater precision than the annotation-guided long and short read assemblies. In all Mus musculus and human samples, the hybrid-read assemblies also contain a greater number of annotated transcripts. The Arabidopsis thaliana assemblies contain more annotated transcripts than the long-read assemblies, but slightly fewer than the short-read assemblies.

Fig 4.

Fig 4

A) Precision of assemblies of all real datasets with and without guide annotation B) The number of annotated transcripts assembled in assemblies of all real datasets with and without guide annotation.

Discussion

The new StringTie algorithm described here uses the strengths of both long- and short-read RNA-seq data to improve transcriptome assembly. By using the short reads to support or adjust splice sites identified in the long-read alignments, we were able to reduce noise caused by the high error rate of long reads. Using simulated data, we demonstrated that hybrid-read assemblies achieve greater precision and sensitivity than both the long-read only and short-read only assemblies across a range of sensitivity parameters. We also showed that the calculated transcript coverage correlates better with the true coverage in the hybrid-read assemblies. Using real data from 3 different species, we showed that hybrid-read assemblies are more precise than long and short-read assemblies across all samples in all species. The hybrid-read assemblies also contained more transcripts that precisely matched the reference annotations as compared to the long and short-read assemblies in all but 2 Mus musculus datasets (Fig 3D and 3F). In these 2 datasets, the hybrid-read assemblies contained more annotated transcripts than the long-read assemblies, but slightly fewer than the short-read assemblies.

Performing hybrid assembly with the new StringTie algorithm is akin to correcting the long reads prior to assembly; therefore, we compared StringTie’s hybrid assembly to assembling long reads corrected by TALC. Notably, read correction with TALC took 146 times longer to run than StringTie on human data. Furthermore, we found that all of the hybrid-read assemblies contained more annotated transcripts than the assemblies of TALC-corrected long reads. All but 2 Arabidopsis thaliana hybrid-read assemblies also achieved greater precision. We also tested whether using corrected long reads in a hybrid-read assembly would be more accurate than using uncorrected reads. As shown in Fig 3, the difference between using corrected versus uncorrected long reads with StringTie’s hybrid algorithm is very small, ranging from ~0.5% to 1% for both precision and the number of annotated transcripts assembled. When considering the substantial increase in runtime and the marginal increase in accuracy, we conclude that using StringTie’s hybrid assembly algorithm with uncorrected long reads is the preferable method of transcriptome assembly. The lowest error rate observed in any of the long-read datasets after correction was 1.8% in the human PacBio data. Hybrid-read assembly was still more accurate than this corrected long-read assembly suggesting that even as long-read technology inevitably improves and error rates decline, hybrid-read assembly will still offer benefits over long-read only assembly.

Because Arabidopsis thaliana, Mus musculus, and human are well-studied organisms, they have high-quality reference annotations. This allowed us to perform separate experiments in which StringTie was run with a guide annotation. Across all datasets among the simulated and real data we saw substantial improvements in accuracy. This evidence indicates that the best results are achieved with annotation-guided hybrid assembly for species with high-quality reference annotations. We have demonstrated that hybrid-read assembly with StringTie is better than long-read, short-read, or corrected long-read assemblies. As the first reference-based, hybrid-read transcriptome assembler, we believe this new release of StringTie will be a valuable tool leading to improvements in transcriptomic studies of many species.

Methods

StringTie algorithm for hybrid data

As previously described, StringTie takes as input an alignment file of all reads from a sample in either SAM or BAM format [14], uses these alignments to create a splice graph, and then assembles transcripts by iterating through two steps: first, it identifies the heaviest path in the splicing graph and makes that the candidate transcript; and second, it assigns a coverage level to that transcript by solving a maximum-flow problem [8]. If a known annotation is provided as input to StringTie, then the first step above is initially restricted to paths in the splicing graph that correspond to transcripts in the annotation. After all transcripts in the annotation have been exhausted, if there are still paths in the splicing graph that are covered by reads, the algorithm resumes using its default heuristic to identify the heaviest path in the graph.This new release of StringTie follows the same two steps to assemble transcripts, but also supports input alignment data in CRAM format as it now makes use of the HTSlib C library [15] and can operate in hybrid data mode, enabled by the --mix option. In this new mode of operation, StringTie takes as input two alignment files, the first file on the command line containing the short-read alignment data and the second one having the long-read alignments. These two alignment files are parsed in parallel to identify clusters of reads that represent potential gene loci. Errors in the reads or the alignments, which are commonly present in the long-read data, propagate to the construction of the splice graph, creating vastly more paths through the graph, which not only slows down the algorithm, but also makes it much more difficult to choose the correct set of isoforms (each of which corresponds to a path) at a particular gene locus. As illustrated in Fig 5, each mis-aligned long read can create a "noisy" transcript that appears to have alternative donor and acceptor sites, extra exons, or skipped exons. In the figure, we show two noisy transcripts, one with an extra exon and an erroneous acceptor (AG) site, and the other with two erroneous donor (GT) sites. These two noisy transcripts together contribute four additional exons to the splice graph, shown on the right side of the figure. These additional exons then generate 8 additional, erroneous edges in the graph, shown in orange. Thus, while the clean splice graph has only 4 nodes and 4 edges, the noisy splice graph has 8 nodes and 11 edges. Because every possible path through a splice graph is a possibly valid isoform, the number of isoforms grows exponentially as we add edges. In this simplified example, the clean splice graph shown on the upper right, based on 2 error-free transcripts, has only 2 paths, each representing a correct transcript. The noisy splice graph, in contrast, has 8 possible paths, only 2 of which correspond to genuine transcripts. Note that a splicing graph implicitly assumes independence of local events, and thus it typically contains many more legal paths than the number of transcripts used to create it.

Fig 5. Noisy alignments make the splice graph vastly more complicated.

Fig 5

The clean splice graph on the upper right is based on the two error-free transcripts, while the noisy splice graph is based on all four of the transcripts shown on the left. Regions shown in orange are errors due to mis-alignments.

With a hybrid data set containing both long and short reads, we can take advantage of highly accurate short reads to fix most of these problems. The strategy we employ is to scan all the splice sites at a locus in order to evaluate how well-supported each site is by the read alignments. If a splice site is not well-supported (e.g., by at least one short read, or by most of the long reads that have splice sites in a small window around that particular splice site), we will search for a nearby splice site with the best support (i.e. one that has the largest number of alignments agreeing with it), and adjust the long-read alignment correspondingly. We found that this strategy can greatly reduce the number of spurious splice sites. Relying on short read data, we can also fix other long read alignment artifacts. For instance, one common problem that we and others [16] noticed is the ambiguity of strand of origin for long reads. Due to their high error rate, the aligner sometimes infers the wrong strand for the long-read alignment. We can fix this by scanning nearby splice sites, and choose the strand of the alignment that is best supported by the short-read data. Another common problem is the presence of false "exons" introduced by insertions in the long reads. These insertions tend to be small (usually less than 35bp), so to address this issue, we remove exons that have support only from long reads and that are contained within introns that are well supported by short read alignments.

After the splice graph has been pruned to remove erroneous splice sites and nodes, the hybrid version of StringTie will execute the next two steps:

  1. First, it will cluster all compatible long-read alignments. We can do this efficiently by taking advantage of the sparse bit vector representation of the splice graph already employed by StringTie, where each node or edge in the graph corresponds to a bit in the vector. A read or a paired read (in the case of short read data) will therefore be represented by a vector of bits where only the bits that represent the nodes or edges spanned by the read and its pair are set to 1. The bit representation provides a quick way to check compatibilities between long reads. Each cluster will represent a path in the splice graph that will have an initial expression level estimate E(l) based on the number of long reads covering that path. Note that a cluster does not always have to be a full transcript (i.e. if all long reads in the cluster come from a truncated cDNA molecule), although in most cases it will be.

  2. For each cluster path P inferred in the previous step, starting from the one with the largest number of long reads supporting it, StringTie will use the short-read alignment to output an assembled transcript and expression level estimate. First, StringTie will choose the heaviest path in the splice graph that includes P. This will represent a candidate transcript. Then StringTie will use its maximum flow algorithm to compute an expression level estimate E(s) based on short-read data only. The final expression level of the transcript will be equal to E(l)+ E(s), and short-read alignments that contribute to E(s) will be removed from the subsequent expression level computations.

Note that some gene loci might have either only long-read or short-read alignments present. For those cases, StringTie will follow its previously implemented algorithms to assemble those loci [9].

Reference genomes and annotations

The human reads (simulated and real) were aligned to GRCh38 and compared to the RefSeq annotation version GRCh38.p8 for accuracy. The Mus musculus reads were aligned to GRCm39 (GenBank Accession GCA_000001635.9) and accuracy was computed using the GENCODE annotation version M26. The Arabidopsis thaliana reads were aligned to TAIR10.1 (GenBank Accession GCA_000001735.2)

Simulated data generation

We used the same short read simulated data from FluxSim [17] as was used to evaluate StringTie2 [9] We used NanoSim [18] to simulate ONT direct RNA sequencing reads. Using the NA12878-dRNA reads, we built a model of the reads by using the read_analysis.py module of NanoSim in transcriptome mode with the following command:

read_analysis.py transcriptome -i ONT_dRNA_reads.fq -rg GRCh38.fa -rt transcripts.fa -annot hg38c_protein_and_lncRNA_sorted.gtf -o training

where transcripts.fa is the human reference transcriptome obtained by using gffread [19].

We simulated 13,361,612 reads (the same number of reads in the NA12878-dRNA sample used build the model) by running the simulator.py module of NanoSim in transcriptome mode with the following command:

simulator.py transcriptome -rt transcripts.fa -rg GRCh38.fa -e expression_levels.tpm -r dRNA -n 13361612 --fastq -o simulated_dRNA -b guppy -c training

To match the expression levels of the long reads to the short reads, we used the.pro file generated by FluxSim to calculate the TPM of each transcript. These values were given as input to the NanoSim simulation with the -e parameter.

Equal coverage simulation

To control for coverage in the simulated data, we first calculated the coverage of each dataset simply by summing the lengths of every read and dividing by the sum of the lengths of the transcripts expressed. Doing this we found that the coverage of the short reads was 164.9, the coverage of the long reads was 195.7, and the coverage of the hybrid reads was 356.1. To increase the coverage of the short reads, we reran FluxSim and increased the number of simulated reads from 150,000,000 to 323,636,363, and this resulted in a coverage of 355.9. To increase the coverage of the long reads, we re-ran NanoSim and increased the number of simulated reads from 13,361,612 to 24,306,253. This resulted in a coverage of 342. Because the lengths of the long reads vary extensively, unlike the short reads, the increase in coverage is not always proportional to the increase in the number of reads. Nonetheless the coverage of this dataset is much higher than the original coverage of 195.7 and quite close to the target value of 356.1. For both new simulations, the transcripts were simulated at the same TPMs as in the original simulation.

Alignment and assembly

All short reads were aligned with HISAT2 with default parameters [20] using the following command:

hisat2 -x hisat2_index -1 short_reads_R1.fastq -2 short_reads_R2.fastq -S short_aligned.sam

Long reads were aligned with Minimap2 [21] using the default parameters for spliced alignment with the following command:

minimap2 -ax GRCh38.fa long_reads.fastq -o long_aligned.sam

Alignment files were sorted and converted to BAM format using samtools [14]. Transcriptome assembly and quantification was done with StringTie version 2.2.0. We used the following StringTie commands to assemble the input alignment file for each assembly type:

  • For long-read data: stringtie -L long_reads.bam

  • For short-read data: stringtie short_reads.bam

  • For hybrid data: stringtie --mix short_reads.bam long_reads.bam

In the case of annotation-guided assembly, we added to all commands above the following option: -G reference_annotation.gtf

Accuracy analysis

We define sensitivity as TP/(TP + FN) and precision as TP/(TP + FP) where TP (true positives) are correctly assembled transcripts, FP (false positives) are transcripts that are assembled but do not match the reference annotation, and FN (false negatives) are expressed transcripts that are missing from the assembly. We used gffcompare [19] to obtain these metrics in addition to the number of annotated transcripts assembled. All numbers reported are at the ‘transcript’ level (as opposed to the intron or base level accuracy also reported by gffcompare). The ‘true positive’ reference sets provided to gffcompare (with the -r option) are as follows:

  • Simulated data with varying sensitivity parameters (Fig 1A): Human reference transcripts fully-covered by either the long or short simulated reads. We define full coverage for multi-exon transcripts as coverage across all splice sites. For single-exon transcripts, it is considered fully covered if there is coverage across > = 80% of the length.

  • Simulated data with default sensitivity parameters (Fig 1C and 1D): The full set of expressed transcripts in the simulated data.

  • Real data (Figs 3 and 4): The reference annotation for the given species filtered to only include loci covered by at least one long read.

The -Q option was used with gffcompare to only consider loci present in the reference set provided.

Our main metric used to compare the accuracy of the long, short, and hybrid-read assemblies is relative percent increase in sensitivity and precision which is defined as (S1-S2)/S2 and (P1-P2)/P2 where S1 and P1 are the sensitivity and precision of the hybrid-read assembly and S2 and P2 are the sensitivity and precision of the assembly we are comparing it to. For example, a 10% absolute increase in sensitivity from S2 = 20% to S1 = 30% results in a relative increase of 50% [9]. For the real data, S is the number of annotated transcripts assembled.

Coverage analysis of simulated data

The expected coverage for the long-read only and short-read only assemblies was obtained by taking the sum of the lengths of all the reads covering a transcript, and dividing it by the transcript length. For the hybrid-read assemblies, the expected coverage was calculated by taking the sum of the short-read and long-read expected coverages of each transcript. The computed read coverages were taken from StringTie’s output for each type of assembly. All coverages were exported to R and normalized to log2(1 + coverage). To make the comparison fair, we only plotted the coverages and calculated the R2 for the transcripts that were shared between the long-read only, short-read only, and hybrid-read assemblies.

Long-read correction with TALC

For each set of long reads, we first counted all 21-mers in the short reads from the sample using Jellyfish [22]. The kmer counts were obtained with the following commands:

jellyfish count --mer 21 -s 100M -o kmers.jf -t 8 $short_reads_1.fa $short_reads_2.fa

jellyfish dump -c kmers.jf > kmers.dump

Using the Jellyfish output, we ran TALC with the following command:

talc $long_reads.fa --SRCounts kmers.dump -k 21 -o $long_reads_TALC.fa -t 12

Error-rate calculations and full-length isoform analysis

Annotated transcript sequences for each species were extracted from the reference genome using gffread [19] with the following command:

gffread -w transcripts.fa -g reference_genome.fa reference_annotation.gtf

We then aligned each long-read dataset to the transcript sequences using Minimap2 and output the alignments in PAF format. To calculate the indel and mismatch rates, we first selected the primary alignment for each read. To calculate the indel rate, we summed the number of insertions and deletions in the alignment (using the CIGAR string) and divided by the alignment length (column 11 in the PAF output). To calculate the mismatch rate, we subtracted the number of matches (column 10 in the PAF output) and the number of insertions and deletions from the alignment length and divided the result by the alignment length. The total error rate is the sum of the indel and mismatch rates.

To identify full-length isoforms, we filtered for reads that spanned all intron/exon boundaries of a multi-exon transcript or 80% of the length of a single-exon transcript. The reference annotations were used to identify the coordinates of the intron/exon boundaries.

Supporting information

S1 Fig. Transcript assembly accuracy at all expressed loci in short, long, and hybrid simulated data sets.

A) Sensitivity and precision of the assemblies created from the original dataset where the hybrid read coverage is the combination of the long read and the short read coverage. B) Sensitivity and precision of the assemblies created from the dataset where the coverage of the short, long, and hybrid reads is approximately equal. The two StringTie parameters varied were the minimum read coverage allowed for a transcript (-c) and the minimum isoform abundance as a fraction of the most abundant transcript at a given locus (-f). Each shape represents a different combination of -c,-f parameters with the values indicated in the legend.

(EPS)

S2 Fig. Sensitivity and the number of annotated transcripts assembled for 9 real datasets from Arabidopsis thaliana, Mus musculus, and human.

All loci are included in these calculations. The circle markers represent assemblies created from uncorrected reads, and the stars represent assemblies created from long-reads corrected with TALC. The long and short read combinations analyzed from Arabidopsis thaliana were A) ERR3486096 and ERR3764345 B) ERR3486098 and ERR3764349 C) ERR3486099 and ERR3764351. The long and short read combinations analyzed from Mus musculus were D) ERR2680378 and ERR2680375 E) ERR2680378 and ERR2680377 F) ERR2680380 and ERR2680379. The long and short read combinations analyzed from human were G) SRR4235527 and NA12878-cDNA H) SRR4235527 and NA12878-dRNA I) SRR1153470 and SRR1163655.

(EPS)

S3 Fig. Transcript assembly accuracy on RNA-seq data from the HepG2 cell line.

A) Precision and number of annotated transcripts assembled from long, short, and hybrid-read assemblies generated from reads from the HepG2 cell line. B) The number of predicted and curated transcripts assembled in the long, short, and hybrid-read assemblies generated from reads from the HepG2 cell line. Predicted means that the transcript is poorly supported according to the RefSeq annotation and curated means the transcript is highly supported.

(EPS)

S1 Table

Error rates of all long-read datasets before (a) and after correction (b) with TALC.

(DOCX)

S2 Table. Percentage of reads that are full-length isoforms and the number of unique full-length isoforms captured in each long-read dataset.

(DOCX)

S1 File. Equal Coverage Simulation Results.

(DOCX)

S2 File. Hybrid transcriptome assembly of short and long read data from the HepG2 cell line.

(DOCX)

Acknowledgments

Authors want to thank Steven Salzberg for proofreading the manuscript.

Data Availability

StringTie is freely available as open source software at https://github.com/gpertea/stringtie.

Funding Statement

This study was funded by the National Science Foundation (grant DBI-1759518) awarded to MP. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, et al. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456. doi: 10.1038/nature07509 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Stoler N, Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genomics and Bioinformatics. 2021;3. doi: 10.1093/nargab/lqab019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nature Reviews Genetics. 2019. doi: 10.1038/s41576-019-0150-2 [DOI] [PubMed] [Google Scholar]
  • 4.Buck D, Weirather JL, de Cesare M, Wang Y, Piazza P, Sebastiano V, et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Research. 2017;6. doi: 10.12688/f1000research.10571.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology. 2011;29. doi: 10.1038/nbt.1883 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Fu S, Ma Y, Yao H, Xu Z, Chen S, Song J, et al. IDP-denovo: De novo transcriptome assembly and isoform annotation by hybrid sequencing. Bioinformatics. 2018. doi: 10.1093/bioinformatics/bty098 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Prjibelski AD, Puglia GD, Antipov D, Bushmanova E, Giordano D, Mikheenko A, et al. Extending rnaSPAdes functionality for hybrid transcriptome assembly. BMC Bioinformatics. 2020;21. doi: 10.1186/s12859-020-03614-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology. 2015;33. doi: 10.1038/nbt.3122 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Kovaka S, Zimin A v., Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology. 2019;20. doi: 10.1186/s13059-019-1910-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Pertea M, Shumate A, Pertea G, Varabyou A, Breitwieser FP, Chang YC, et al. CHESS: A new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biology. BioMed Central; 2018;19:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. REVIEW Open Access Opportunities and challenges in long-read sequencing data analysis. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Broseus L, Thomas A, Oldfield AJ, Severac D, Dubois E, Ritchie W. TALC: Transcript-level Aware Long-read Correction. Bioinformatics. 2020;36. doi: 10.1093/bioinformatics/btaa634 [DOI] [PubMed] [Google Scholar]
  • 13.Pyatnitskiy MA, Arzumanian VA, Radko SP, Ptitsyn KG, Vakhrushev I v., Poverennaya E v., et al. Oxford nanopore minion direct rna-seq for systems biology. Biology [Internet]. MDPI; 2021. [cited 2022 Mar 8];10:1131. Available from: doi: 10.3390/biology10111131 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25. doi: 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bonfield JK, Marshall J, Danecek P, Li H, Ohan V, Whitwham A, et al. HTSlib: C library for reading/writing high-throughput sequencing data. GigaScience. Gigascience; 2021;10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wilks​ C, Schatz MC. LongTron: Automated Analysis of Long Read Spliced Alignment Accuracy. bioRxiv. 2020; [Google Scholar]
  • 17.Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigó R, et al. Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Research. 2012;40. doi: 10.1093/nar/gks666 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Yang C, Chu J, Warren RL, Birol I. NanoSim: Nanopore sequence read simulator based on statistical characterization. GigaScience. 2017. doi: 10.1093/gigascience/gix010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Pertea M, Pertea G. GFF Utilities: GffRead and GffCompare. F1000Research. 2020;9. doi: 10.12688/f1000research.23297.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology. 2019;37. doi: 10.1038/s41587-019-0201-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Li H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34. doi: 10.1093/bioinformatics/bty191 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27. doi: 10.1093/bioinformatics/btr011 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009730.r001

Decision Letter 0

Ilya Ioshikhes, Jinyan Li

26 Jan 2022

Dear Dr Pertea,

Thank you very much for submitting your manuscript "Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

Both of the reviewers made very positive comments and liked this research topic, while they also made some critical suggestions for the author to revise the manuscript. Especially, reviewer 1 mentioned how to choose simulated data sets and real data sets to strengthen the results.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Jinyan Li

Associate Editor

PLOS Computational Biology

Ilya Ioshikhes

Deputy Editor

PLOS Computational Biology

***********************

Both of the reviewers made very positive comments and liked this research topic, while they also made some critical suggestions for the author to revise the manuscript. Especially, reviewer 1 mentioned how to choose simulated data sets and real data sets to strengthen the results.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Shumate et al describe a new release of the transcriptome assembler StringTie. In this manuscript, they set out to determine whether hybrid assemblies generated with Illumina and Nanopore data result in greater accuracy than assemblies generated with just one data type. In principle, the results for three higher eukaryotes support their claims although I think a deeper analysis of the data would prove more informative and strengthen the manuscript considerably.

Major comments

1. The Abstract describes long read sequencing as low throughput. While this is true when comparing a MinION to a high output Illumina platform (NextSeq, HiSeq, NovaSeq) but now when comparing against the GridION or PromethION.

2. Introduction.

a. The first is that proteomic diversity and phenotypic complexity is not exclusively limited to higher eukaryotes. Indeed many plants, bacteria, and viruses exhibit remarkably complex transcriptomes.

b. The second is that the error-rate quoted for nanopore direct RNA-Seq is way out of date. Indeed the most recent chemistries (SQK-RNA002) combined with later Guppy versions (v4 onwards) reduce the error rate to ~3-4%.

c. Finally, the authors state that practical limitations impede the ability to capture full-length transcripts. It would help significantly if the authors could put actual numbers on this. For instance, in our hands (and others), we typically observe between 30-60% full-length transcripts (SQK-RNA002 + Guppy v4).

3. Results

a. I have many questions about the underlying data, both simulated and real. In the case of the ONT data, it is imperative to include information on the Chemistry (SQK-RNA001 or SQK-RNA001 for DRS), basecalling (which Guppy versions) as this will very clearly impact on the accuracy of the final ONT fastq files, and the resulting alignments. Indeed one suspect for the latest chemistry + basecaller combinations that the numbers of artefacts in DRS datasets (false exons, incorrect splice junctions) will significantly reduce. These queries also extend to the simulated data. What do the simulated DRS datasets look like? Are they modelled on SQK-RNA001 or SQK-RNA002 chemistry? Similar questions pertain to the Illumina data. For instance, was this all poly(A) selected or were these ribo-depleted libraries?

b. Extending this further, do the authors see any difference in transcriptome assembly quality when comparing chemistry or basecaller versions? While it is abundantly clear that hybrid assembly is still the best option, it would be valuable to know whether further improvements to chemistry/basecalling would increase the relative accuracy of DRS alone approaches.

c. For the real world data, it would great if table 1 were to include more information on the source/cell types sequenced and also how the various datasets link together. For instance, Fig 3 pairs together Nanopore and Illumina datasets but it is now clear how/why these pairs specifically were chosen.

d. A deeper dive into the assembled transcriptomes is warranted. For instance, transcripts in the existing HG38 annotation are variously classified according to support (experimental, theoretical, etc). It would be interesting to know, for a given hybrid assembly, how many poorly supported transcripts in the existing annotation are actually confirmed by hybrid assembly?

e. The comparison to TALC is useful but what about other assemblers that offer ‘hybrid-like’ assembly? For instance, FLAIR uses Illumina data to infer splice junctions that are then ‘corrected’ in nanopore-derived alignments. While I suspect the StringTie approach to be faster and more accurate, it would be useful to demonstrate this definitively.

4. Methods

a. Linking back to comment 3a, more information on how NanoSim was run and what the resulting data look like is needed. What % of NanoSim reads are full length and how does this compare to the real-world datasets? What chemistry (SQK-RNA001 or SQK-RNA002) and basecalling (Guppy version) is simulated?

Minor comments:

1. Line numbers and page numbers would have made this manuscript easier to review! I do however appreciate the inclusion of figures within the manuscript body rather than at the end as this makes it much easier to read.

2. While I understand the choice of red, blue, purple for the figures, these colours are not easy to differentiate. The authors could consider lightening the shade of each or finding an alternative combination that segregates better.

3. It would be useful to have the test datasets available (or at least linked to) within the repository. Alternatively, a collection of ‘light’ test datasets containing reads for a just a few select transcripts would be useful for those who which to explore the various parameters and outputs.

4. In the Methods section, it would be useful to spell out the commands used for HiSat2 and MiniMap.

5. Was any filtering performed on the RefSeq annotations?

Reviewer #2: The manuscript " Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie" by Schumate et al. describes the authors' latest developments on the StringTie software, incorporating the capability to assemble transcript alignments from both short (tens to hundred base length Illumina reads) and long (several kilobase length PacBio or ONT) RNA-seq reads. The initial version of StringTie (2015, Nature Biotechnology) assembled short RNA-seq read alignments, StringTie2 (2019, Genome Biology) had specializations for assembling long read alignments, and now the latest version of StringTie (perhaps more aptly named StringTie3?) is presented here for co-assembly of short and long read alignments. The authors demonstrate with both simulated and real RNA-seq data that this latest StringTie provides more accurate assemblies when both the short and long read alignments are available.

StringTie continues to be one of the most popular genome-guided alignment assemblers for reconstructing transcripts, and is my personal favorite for routine use in my own work. I'm pleased to see that this new functionality of assembling both short and long reads is now integrated.

The manuscript is very well written and the experiments performed are all relevant and necessary for this work. I do have a few critical comments and suggestions as follows:

Major

When combining short read alignments with the long read alignments, the effective coverage of transcripts will increase. Within low to moderately expressed transcripts, increased effective coverage would be expected to yield increased rates of full reconstruction, separately from read types provided. The authors might provide some supplementary study to demonstrate that the improved accuracy for reconstruction with hybrid data is more due to having the hybrid data types available rather than due to increased effective alignment coverage. This could be done with simulated data by controlling for total effective coverage when assembling in short-only, long-only, or hybrid data.

Figure 1 would ideally show a more impactful example of where both read types are needed to assemble the full-length isoform. In the example shown, a splicing graph based on short read alignments alone looks as though it would contain the full-length splicing pattern, and if it was not assembled by StringTie, it might be due to StringTie parameter settings or StringTie-specific logic. A different, even naive splice graph assembler has the potential to reconstruct the full-length splicing pattern based on just those short reads alone.

It is useful that StringTie can make use of 'dirty' long reads (those with high error rates), but it is worth noting that the modern PacBio IsoSeq and ONT transcriptome sequencing methods are currently producing much higher fidelity data. The authors might comment on this in the manuscript. Characteristics of the real long read data being used with StringTie in this manuscript should be made available, perhaps as supplementary materials, to shed insight into the overall quality of those data - ie. are they early error-prone or later higher fidelity long reads being leveraged. In particular, it would be helpful to know characteristics (ie. percent identity or error rates) before and after applying TALC for error correction.

In Figures 2 and 3 that provide accuracy statistics for the different StringTie invocations, it would be helpful to know for the long reads what the full-length sensitivity and specificity would be without doing any StringTie assembly (for example, running a method like cd-hit to reduce redundancy, and examine accuracy statistics for the cd-hit compiled reference isoforms aligned with minimap2). This would provide a nice reference point for defining the baseline quantity of full-length isoforms from which StringTie would further augment through assembly.

The experiments performed with real data involved gathering transcriptome data from public databases. Unlike the simulated data, where the long and short reads appear to be derived from matched data (same genes being expressed), it isn't clear that the real data are similarly matched (so potentially many different genes being expressed between long and short read data). Table 1 doesn't indicate the tissue type that the data correspond to, and I couldn't easily ascertain it from the original source either. Accuracy analysis leveraging these real data were restricted to those genes having long read coverage, although the authors state (indicating data not shown) that results are similar when all loci are examined. There are a couple of issues to comment on here. First, I encourage the authors to provide (in supplement) results observed when considering all loci. This will address the issue of how much more comprehensive and accurate an assembly derived from StringTie when provided all data as opposed to just short or long reads alone, which would be useful to know in practice. Second, by restricting to only loci with long reads, because the derived samples may be unmatched, there is little assurance that relevant genes expressed with long reads will be expressed in the samples from which the short reads were derived. This disparity might account for lower sensitivity of short-only as compared to long-only in Figures 3G,H. A paucity of short reads for a long-read-covered locus would also presumably impact the ability of TALC for error correction at corresponding loci, but I suspect that has negligible effect on StringTie reconstruction given TALC-related results from other experiments presented here.

Figures 2C,D and Figure 4 involve barplots showing sensitivity and specificity separately instead of plotting together in scatter form (as in Figure 3). I think it's more transparent to assess sensitivity and specificity together in scatter form rather than separately in the barplots, unless there's some specific reason to show them separately here.

Minor

Abstract: "unable to span multiple exons" -> "rarely span multiple exons" (as 'unable' is not true, and 'rarely' is used elsewhere in the manuscript).

Intro: " Additionally, long reads have a high error rate between 13% and 15% [4], and the throughput of long-read RNAseq is much lower than that of short-read RNA-seq." This was a true statement for the earlier state of the technology, but due to recent rapid advances, this is no longer true. This recent advance would be worthy of commenting on in the Intro.

Reference annotation-assisted assembly has been used by StringTie and earlier methods to leverage reference isoform structures as part of generating a more complete assembly. I encourage the authors to include more information in the methods on how StringTie does this, or reference earlier StringTie paper(s) that describe it sufficiently. In particular, describing how it contributes to improving specificity of reconstruction would be most helpful, as I find it less intuitive than improving sensitivity.

Please review figure 5 to ensure that the "Noisy" splice graph accurately reflects the read alignments. I think the bottom pink arrow needs to be shifted to the right.

Please be sure to make inputs (aligned bams, simulated fastqs, etc.) available on a data sharing site.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No: I request the simulated data (fastqs) and aligned bams be made available on a data sharing site.

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009730.r003

Decision Letter 1

Ilya Ioshikhes, Jinyan Li

14 Apr 2022

Dear Dr Pertea,

Thank you very much for submitting your manuscript "Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

In particular, Reviewer 2 made some technical comments about performance change when including vs. excluding the reference annotations; the reviewer also requested some clarification and further understanding about Figures 2 and 3. Another round of revision would make the manuscript more improved.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Jinyan Li

Associate Editor

PLOS Computational Biology

Ilya Ioshikhes

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

In particular, Reviewer 2 made some technical comments about performance change when including vs. excluding the reference annotations; the reviewer also requested some clarification and further understanding about Figures 2 and 3. Another round of revision would make the manuscript more improved.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I am quite satisfied with the detailed and considerate responses provided by authors. I thus recommend this paper be accepted for publication.

Reviewer #2: I appreciate the authors responses to reviewers comments and the manuscript is substantially improved and strengthened as a result of the additional experiments performed and related revisions. I have only a few more technical comments that the authors may further consider.

The results provided in Figure 4 indicate the accuracy differences between methods while leveraging the reference annotation as a guide. It would be useful to see the difference in accuracy for each sample as a result of including vs. excluding the reference annotations as well, as shown for simulated data in Figure 2C,D, providing those results in the supplementary materials if they are not easily integrated into the main figure.

In Figure 2C, it is unusual or unexpected that the precision appears largely unchanged for short reads when annotations are added as a guide. This would be worth reviewing and understanding further.

The results statement "The use of hybrid-read data plus annotation had an increase in precision of 10.7% and an increase in sensitivity of 23.5% as compared to using short reads plus annotation, which in turn was superior to using long reads plus annotation." on page 9 is not clear from Fig 2 with respect to how the short and long read (non-hybrid executions) rank compared to each other.

In Figure 3 and particularly in the new Supplementary Figure 2, the heavily increased precision for corrected long reads is strikingly different for Arabidopsis as compared to human and mouse. Similarly, the lower precision for hybrid model compared to corrected long reads is surprising and deserves explanation if possible. One feature of the Arabidopsis genome compared to mouse and human would be the compact gene structures and prevalence of retained introns in transcriptome sequencing data. Could this be partly responsible for the different behavior of Stringtie on Arabidopsis? Other reasons?

I appreciate that the authors added the read error characteristics in Table S1. This is very useful. However, this is limited to mismatches and doesn't include statistics on indels - which are most relevant to the long read data. Please include the indel rates as well, along with methods info on how this was done according to current best practices.

Minor

Figure 2B: should indicate in the legend that coverage refers to log2(1+coverage) as described in the Methods.

A small suggestion for wording: Page 9 "Although we cannot know exactly which transcript 'molecules' are present in the samples"

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009730.r005

Decision Letter 2

Ilya Ioshikhes, Jinyan Li

11 May 2022

Dear Dr Pertea,

We are pleased to inform you that your manuscript 'Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Jinyan Li

Associate Editor

PLOS Computational Biology

Ilya Ioshikhes

Deputy Editor

PLOS Computational Biology

***********************************************************

In the revised versions of your manuscript, you have addressed all of the concerns and suggestions from the reviewers. The reviewers are now satisfactory with these changes and revisions to the manuscript. I therefore recommend acceptance for the paper to be published.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: Thank you for addressing my earlier concerns. Congratulations on a beautiful manuscript and your outstanding work on Stringtie.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009730.r006

Acceptance letter

Ilya Ioshikhes, Jinyan Li

27 May 2022

PCOMPBIOL-D-21-02222R2

Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie

Dear Dr Pertea,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Livia Horvath

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Transcript assembly accuracy at all expressed loci in short, long, and hybrid simulated data sets.

    A) Sensitivity and precision of the assemblies created from the original dataset where the hybrid read coverage is the combination of the long read and the short read coverage. B) Sensitivity and precision of the assemblies created from the dataset where the coverage of the short, long, and hybrid reads is approximately equal. The two StringTie parameters varied were the minimum read coverage allowed for a transcript (-c) and the minimum isoform abundance as a fraction of the most abundant transcript at a given locus (-f). Each shape represents a different combination of -c,-f parameters with the values indicated in the legend.

    (EPS)

    S2 Fig. Sensitivity and the number of annotated transcripts assembled for 9 real datasets from Arabidopsis thaliana, Mus musculus, and human.

    All loci are included in these calculations. The circle markers represent assemblies created from uncorrected reads, and the stars represent assemblies created from long-reads corrected with TALC. The long and short read combinations analyzed from Arabidopsis thaliana were A) ERR3486096 and ERR3764345 B) ERR3486098 and ERR3764349 C) ERR3486099 and ERR3764351. The long and short read combinations analyzed from Mus musculus were D) ERR2680378 and ERR2680375 E) ERR2680378 and ERR2680377 F) ERR2680380 and ERR2680379. The long and short read combinations analyzed from human were G) SRR4235527 and NA12878-cDNA H) SRR4235527 and NA12878-dRNA I) SRR1153470 and SRR1163655.

    (EPS)

    S3 Fig. Transcript assembly accuracy on RNA-seq data from the HepG2 cell line.

    A) Precision and number of annotated transcripts assembled from long, short, and hybrid-read assemblies generated from reads from the HepG2 cell line. B) The number of predicted and curated transcripts assembled in the long, short, and hybrid-read assemblies generated from reads from the HepG2 cell line. Predicted means that the transcript is poorly supported according to the RefSeq annotation and curated means the transcript is highly supported.

    (EPS)

    S1 Table

    Error rates of all long-read datasets before (a) and after correction (b) with TALC.

    (DOCX)

    S2 Table. Percentage of reads that are full-length isoforms and the number of unique full-length isoforms captured in each long-read dataset.

    (DOCX)

    S1 File. Equal Coverage Simulation Results.

    (DOCX)

    S2 File. Hybrid transcriptome assembly of short and long read data from the HepG2 cell line.

    (DOCX)

    Attachment

    Submitted filename: hybrid_read_stringtie_review_responses.docx

    Attachment

    Submitted filename: StringTie_response_to_reviewers_April29.docx

    Data Availability Statement

    StringTie is freely available as open source software at https://github.com/gpertea/stringtie.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES