Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2024 Mar 21;52(8):e42. doi: 10.1093/nar/gkae203

txtools: an R package facilitating analysis of RNA modifications, structures, and interactions

Miguel Angel Garcia-Campos 1,, Schraga Schwartz 2
PMCID: PMC11077046  PMID: 38512053

Abstract

We present txtools, an R package that enables the processing, analysis, and visualization of RNA-seq data at the nucleotide-level resolution, seamlessly integrating alignments to the genome with transcriptomic representation. txtools’ main inputs are BAM files and a transcriptome annotation, and the main output is a table, capturing mismatches, deletions, and the number of reads beginning and ending at each nucleotide in the transcriptomic space. txtools further facilitates downstream visualization and analyses. We showcase, using examples from the epitranscriptomic field, how a few calls to txtools functions can yield insightful and ready-to-publish results. txtools is of broad utility also in the context of structural mapping and RNA:protein interaction mapping. By providing a simple and intuitive framework, we believe that txtools will be a useful and convenient tool and pave the path for future discovery. txtools is available for installation from its GitHub repository at https://github.com/AngelCampos/txtools.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

The most widespread use of RNA sequencing data is quantifying gene expression, which merely requires counting the number of sequencing reads overlapping each gene. However, in addition to the identity of the gene to which they map, sequencing reads store a wealth of additional information, to which we will refer as ‘read-outs’, including (i) the relative position of read start, (ii) the relative position of read ends, (iii) the nucleotide identity frequency and (iv) the deletion counts in comparison to a reference sequence. These ‘read-outs’ have been leveraged to retrieve information pertaining to diverse layers of the RNA metabolism, including (but not limited to) RNA structure, RNA modifications, and RNA binding proteins. While information regarding these layers is typically lost when an RNA is reverse-transcribed into cDNA, pre-treatment of an RNA with diverse chemistries, enzymes, and/or affinity reagents can allow such information to be also maintained upon reverse transcription and to be captured in the form of one or more of the four above described ‘read-outs’. For instance, a wide range of RNA modifications—including m6A, m1A, pseudouridine, m5C and ac4C—are all detected on the basis of RT-truncations (leading to an accumulation of read starts) (1–3), RT misincorporations (leading to mismatches) (2,4–8) or cleavage events (giving rise to accumulations of both read-starts and ends) that these modifications induce (9–11), typically after treatment with specific chemistries or enzymes. Similarly, approaches for mapping RNA structure, such as DMS-seq, CIRS-seq, SHAPE-seq and LASER-seq, all rely on chemistries that preferentially interact and modify unstructured RNAs leading to RT-induced misincorporations (12–15). Finally, CLIP-based approaches for mapping out interaction sites between RBPs and mRNA, or derivatives used to map interactions between modification-specific antibodies and mRNA, also culminate in RT-misincorporations and truncations (16,17).

Thus, while quantification of gene expression levels requires recording only counts of reads at the gene level, detection, and quantification of the above-discussed regulatory levels require quantifying the above-described ‘read-outs’, at the single-nucleotide level. Several tools have been developed to date to supply such single-nucleotide readouts. A widely used computational tool to obtain this kind of data is the samtools’ mpileup utility (18). mpileup is a very fast tool and outputs a long table with three main values per genomic position: the number of reads covering the position, read bases, and sequencing quality. Another similar tool is JACUSA2 (19), which deals mainly with the detection of single-nucleotide variants and RT-arrests in NGS data. JACUSA2 supports two modes of sample setups: single-sample mode, which identifies variants against a reference sequence; and paired-samples mode, which identifies variants comparing samples from two conditions. Its output is a table that sets a score for each variant and the number of read bases at each genomic position. Yet another tool is RNAframework (20), a comprehensive tool for analyzing data derived from an array of RNA structure and RNA modification detection assays. While these tools, along with others (21), offer many advantages, including speed and memory consumption, they suffer from two key—and partially related—limitations.

One limitation is that these tools were designed to operate in a single ‘space’, i.e. either genomic or transcriptomic. For RNA-centered quantifications, choosing only a single of these spaces is limiting. On the one hand, it makes intuitive sense to consider the entirety of the genomic space for mapping purposes, to avoid erroneous mapping of reads to wrong transcripts, which can occur if the transcripts from which they originated are not represented in the transcriptome annotation. On the other hand, the functional unit from which the read originated is the transcript, and hence accurate interpretation of the read mapping patterns (such as misincorporation, or premature termination) as well as downstream analyses must be on the basis of the transcriptomic, rather than genomic, space. A second, critical limitation of current tools is in their handling of paired-end reads. Sequencing both ends of RNA fragments can be synergistically more informative than single-ended counterparts. Consider the case of using premature RT-truncations for the purpose of modification detection (e.g. pseudouridine, via CMC-based profiling (22–25). If only a single, short (e.g. 40 nt) read from one end is sequenced, each such read informs that no RT-truncations occurred at any of the 39 3′ terminal nucleotides of the read. However, if a mate read—again 40 nt long—is available, but with an insert size of 150 nt with respect to the first read, this mate read not only informs that no-RT mutations had occurred anywhere along the 40 nt long of the sequenced part of the read but also anywhere within the 70 (unsequenced) nucleotides separating the sequenced parts of the first and second read. Thus, despite the absence of any sequence coverage within the interval between the two reads, the availability of paired-end reads allows drawing inferences with respect to modification status. As an additional example, some protocols for mapping mRNA modifications rely on modification-dependent cleavage by RNAses sensitive to the modification (11,26). Again, paired-end reads provide critical information—in this case about the absence of cleavage—along the entirety of the intervening, non-sequenced elements between the two sequenced pairs. As a final example, also when it comes to peak-calling in transcriptional data, bridging paired-end reads allows for boosting the accuracy of peak-calling, as we demonstrate in the last section of this manuscript. Yet, to our knowledge available tools do not integrate information originating from each of the two reads, and hence do not allow capturing the full extent of information present in paired-end reads.

Here we present txtools, an R package facilitating genomic data analysis, exploration, and visualization at the single-nucleotide resolution. txtools’ main output and data structure contains transcriptome-wide nucleotide-resolution coverage and ‘read-outs’, while seamlessly connecting genomic and transcriptomic coordinate systems. We provide several vignettes, showcasing the ability of txtools to perform start-to-end analysis and visualization of different types of datasets, via a few short and streamlined lines of code.

Materials and methods

Core functionality

txtools is an R package, readily available for installation through its GitHub repository AngelCampos/txtools. It strongly relies on Bioconductor infrastructure (27,28), especially on the GenomicRanges and GenomicAlignments packages (29) to load and process alignments data. Its main output and work-horse is of data.table class (Barret et al., 2024), a high-performance extension of R’s base data.frame class. For its plotting capabilities, txtools makes use of the ggplot2 package (30).

Its core inputs are BAM files harboring genomically aligned reads, a transcriptome annotation in BED12/6 format, and—optionally—a genomic reference sequence. The key output of the package is a table summarizing each of the four above-defined ‘read-outs’ at each position within the provided transcriptome space. Data processing is performed in two steps, (i) converting genomic space alignments into transcriptomic space alignments in a manner explicitly taking into account read mates (rather than single-end reads) when available and (ii) summarizing the four above defined ‘read-outs’ into a single-nucleotide resolution transcriptome-space based data table (Figure 1, top-panel).

Figure 1.

Figure 1.

Top. txtools main RNA-seq reads processing. txtools allows easy loading of needed data: RNA-seq data in the form of reads mapped to a reference genome from BAM files, transcriptomic annotations in BED-12 or BED-6 format, and reference genomes from FASTA files. Using these data txtools can process the genomic reads into transcriptomic space, with tx_reads(), then generate a transcriptomic-wise summarized data table (txDT) using the tx_makeDT_*() family functions. txDT’s rows correspond to each nucleotide in a transcriptome, containing data on coverage, read-starts, read-ends, nucleotide frequency, and deletions. Bottom. Coverage counts in paired-end reads. Contrary to other tools that only count the sequenced stretch of a read towards coverage calculation, txtools counts the whole width of the paired-end read, this is from the start of read1 to the end of read2.

Processing of the genomic alignments into transcriptomic space is conducted with the tx_reads() function. This function implements filtering steps that will select only reads that are in the appropriate strand and within the gene structure, assigning all the reads that fit each gene's (or isoform) boundaries defined in the gene annotation. Importantly, in the case of paired-end alignments read-1 and read-2 are merged into a single range, to allow coverage metrics to adequately consider the full insert, including the portion that is not sequenced between read-1 and read-2 (Figure 1, bottom-panel). txtools is able to process either single-end, paired-end reads, or long RNA reads, as well as those output by ONT sequencing technologies.

The second step consists of summarizing the transcriptomic reads into a table structure. This is performed through the tx_makeDT_* functions family, which has three options to summarize count data tables with (i) coverage, read-start, and read-end information; (ii) only nucleotide frequencies and deletions or (iii) both data types. Both genomic and transcriptomic coordinate systems are provided in the summarizing table. If a genomic sequence is provided as input, the reference nucleotide at each transcriptomic position will also be reported. The data table provided as output uses the ‘data.table’ class from its eponymous package (Barret et al., 2024), which offers enhanced functionality for large tables, compared to the base-R ‘data.frame’ class.

Additional functionality

An array of additional functions are provided in txtools to facilitate common computational tasks required to analyze and visualize RNA-seq data at the single-nucleotide resolution. These include modules for calculating diverse metrics on the basis of the four read-outs, modules for sequence analyses, modules for metagene analyses, and ones for comparative, statistical analyses. Most of txtools’ functions have the prefix ‘tx_’ and are grouped by a following keyword:

  • tx_load_*: Facilitate the input of files into the R session: The main input files for txtools consist of read alignments in BAM format (either paired-end or single-end), gene annotations in BED-12/6 formats, genome sequences in FASTA format, and previously saved txDTs in RDS format.

  • tx_add_*: Add new information to an existing txDT, outputting the original txDT with the new column appended. Examples of these functions are tx_add_startRatio(), which adds the start to coverage ratio; tx_add_motifPresence(), which adds the location of RNA sequence motifs across the transcriptome; and tx_add_misincRateo(), which adds the ratio of misincorporation according to the genome and gene annotation. Importantly, these functions allow the seamless use of the pipe operator, from the magrittr package (Bache and Wickham 2014), enabling the chaining of one function after the other to create straightforward pipelines and easy-to-read code.

  • tx_get_*: Extract information from a txDT and generate an output that is not a txDT. These functions enable some downstream analysis and checks. Examples are tx_get_flanksFromLogicAnnot(), which extracts data from selected columns centering a window at positions specified by a logical variable, generating a matrix where each row represents a site of interest and each column a value in the window surrounding that site; another example is tx_get_metageneRegions() which outputs a metagene matrix with each row representing a gene and each column a bin in one of the codifying gene regions.

  • tx_plot_*: txtools plotting functions seek to fulfill two objectives: inspection of summarized data, and presentation of results. Examples are tx_plot_nucFreq() and tx_plot_staEndCov(), which plot the counts of data of nucleotide frequency, and read-starts/ends and coverage respectively; tx_plot_seqlogo(), which plots the sequence motif at sites annotated with a logical vector; and tx_plot_metageneRegions(), which creates metagene plots to compare values aggregated by gene regions.

  • tx_test_*: Use of the txDT objects from experimental data to compare metrics between groups. High-speed t-tests and likelihood-ratio tests, from the Bioconductor packages genefilter (Gentleman et al., 2023) and edgeR (31) respectively, are implemented in txtools. These tests can be used to detect differences between groups in continuous variables in the first case or as ratios of count data in the second; for example, start to coverage ratio when testing for RT-stoppage in m6A detection using miCLIP.

Results

The core txtools workflow for most projects would be as follows:

  • Load data into R: read alignments (BAM file), genomic annotation (BED12 format), and optionally a reference genome (FASTA).

  • Process alignments into txDTs.

  • Calculating metrics to support analyses: e.g. read-start to coverage rate, misincorporation rate, gene regions, etc.

  • Detect nucleotides that are relevant to the study.

  • Plot results.

Following we briefly present three use cases showcasing the ability of txtools to facilitate streamlined analysis of RNA-seq datasets at the nucleotide level. While the examples we chose are all in the context of detecting RNA modifications, similar workflows can be used for quantifying RNA structure and for analysis of CLIP-seq datasets.

Use case 1: rRNA pseudouridylation detection

For the first use case, we analyzed pseudouridine mapping data, acquired via Pseudo-seq, a method for genome-wide, single-nucleotide resolution identification of pseudouridine (22). Carlile and collaborators employed N-cyclohexyl-N′-(2-morpholinoethyl)-carbodiimide metho-p-toluenesulphonate (CMC) which reacts with pseudouridines creating an adduct that blocks cDNA’s reverse transcription one nucleotide downstream to the pseudouridylated site. The FASTQ data of a CMC-treated sample and a control sample were downloaded from GEO: GSE58200, and aligned to the yeast ribosome reference (32) sequence using Rsubread (33) to create the respective BAM files. Below are the steps and txtools functions used to analyze the data. The resulting plots for this code are shown in Figure 2.

Figure 2.

Figure 2.

Use case #1 results - rRNA pseudouridylation. (A) Top. txtools Starts-Ends-Coverage plot for the mock-treatment sample centered at position 2973-T of the yeast 28s rRNA, a known pseudouridine site. cov: coverage, start_5p: read-starts, end_3p: read-ends. Middle. Similar plot as above but for the CMC-treated sample. A dramatic difference in read-starts is evident at 1 bp downstream from the pseudouridylated site. Bottom. Lineplot showing the SR_1bpDS (Start-ratio-1bp-downstream) for both CMC and control treatment and the resulting difference, SRD_1bpDS, at 28s:2973 and surrounding nucleotides. (B) Scatterplots of full rRNA transcripts showing the SRD_1bpDS per nucleotide. Marked in green are the known pseudouridylated sites, marked in blue is the sole 2′-O-methylated pseudouridine, and colored in red are the rest of the nucleotides.

  • Load the reference genome with tx_load_genome(), and the gene annotation with tx_load_bed().

  • Process the resulting BAM files using the bam2txDT.R script, included in the txtools installation. This script processes the genomic alignments into transcriptomic ones using tx_reads() and generates a data table with summarized counts for coverage, read-starts and read-ends using tx_makeDT_coverage().

  • Visualize the processed data with tx_plot_staEndCov() and observe the effect of CMC treatment on RT premature stoppage at a known pseudouridylated site, which manifests as an abrupt increase of read-starts compared to the control sample (Figure 2A, top, and middle panel).

  • Calculate the start-rate 1bp down-stream (SR_1bpDS), using tx_add_startRatio1bpDS() in both samples and subtract the resulting SR_1bpDS of the control from that of the CMC-treated sample, yielding the start-rate difference 1bp down-stream (SRD_1bpDS), a metric that we can use to detect pseudouridines in CMC-treatment data.

  • Plot these results using the tx_plot_numeric() function to plot numeric variables in a txDT along a window in a gene (Figure 2A, bottom panel).

  • Using the resulting txDT and a common call to ggplot2’s scatterplot, the results for all four ribosomal transcripts are shown in Figure 2B. The resulting SRD_1bpDS metric shows that a simple threshold can discriminate between pseudouridine harboring and non-harboring sites on rRNA.

Use case 2: single-nucleotide resolution detection of m6A using miCLIP2

For this use case we analyzed the data of the study that presented miCLIP2 (17), a miCLIP enhancement allowing antibody-based single-nucleotide resolution mapping of m6A sites, relying on crosslinking of an antibody to methylated sites (1). Similarly to its predecessor, miCLIP2 relies on premature termination of reverse transcription during cDNA synthesis at the cross-linked residue. Data was downloaded from GEO: GSE163500, selecting for samples of wild-type (WT) and methyltransferase-like 3 (Mettl3) knockout in mESC, and subsequently aligned to the mm9 reference genome using Rsubread (33). Below are the steps and txtools functions used to analyze the data. The resulting plots for this code are shown in Figure 3.

Figure 3.

Figure 3.

Use case #2 results—m6A epitranscriptome in mESC using miCLIP2. (A) Volcano plot showing the 1 bp-down-stream mean start-rate difference at each queried position of the transcriptome (x-axis) and the –log10P-value, calculated using a t-test (y-axis). Colored in blue are all sites that are centered in a DRACH motif, while in red are the rest. Putative sites thresholds used are marked with black lines: startRatio_1bpDS difference >0.05, P-value <0.01, considering only adenines. (B) Metagene plot aligned at the end of the CDS. Showing the relative abundance of putative m6A sites in blue, compared to the baseline presence of the DRACH motif in red.

  • Load the reference genome with tx_load_genome(), and the gene annotation with tx_load_bed().

  • Process all BAM files by looping through each with:

    • tx_load_bam(): Loading the BAM file

    • tx_reads(): Processing into transcriptomic reads

    • tx_makeDT_coverage(): Processing into a table with summarized data on coverage, read-start, and read-ends.

    • tx_add_startRatio1bpDS(): Adding the start ratio 1bp down-stream

  • Unify all the resulting tables with tx_unifyTxDTL(), this is to have them share the same transcriptomic coordinates, by selecting the intersection of genes in all data tables.

  • Perform t-tests using the txtools’ inbuilt ‘genefilter’ wrapper tx_test_ttests(), testing for differences in startRatio_1bpDS between WT and KO samples both with the immunoprecipitation and crosslinking treatment.

  • Add DRACH motif presence with tx_add_motifPresence().

  • Plot a volcano plot using a call to ggplot2’s scatterplot, color-coding for DRACH presence, and select putative m6A sites at mean difference >0.05 and P-value <0.01 (Figure 3A).

  • Plot a metagene profile using tx_plot_metageneRegions(). Showing the previously reported enrichment of m6A sites just after the CDS end, compared to the baseline DRACH motif presence (Figure 3B).

Use case 3: dynamic RNA acetylation using ac4C-seq

For this use case we analyzed RNA acetylation data, acquired via ac4C-seq, a chemical method for the transcriptome-wide quantitative mapping of N4-acetylcytidine (ac4C) at single-nucleotide resolution (Sas-Chen et al. 2020). Sas-Chen and collaborators employed the reaction of ac4C with sodium cyanoborohydride (NaCNBH3) under acidic conditions, which leads to C→T mutations at acetylated positions (Thomas et al. 2018). A key result in this study was the discovery of an abundance of acetylation sites on rRNA derived from the hyperthermophilic archaea T. kodakarensis. Data from GEO: GSE135826 was downloaded and aligned to the T. kodakarensis genome using Rsubread (33). Below are the steps and txtools functions used to analyze the data. The resulting plots for this code are shown in Figure 4.

Figure 4.

Figure 4.

Use case #3 results—dynamic RNA acetylation using ac4C-seq. T. kodakarensis’ ac4C levels increase across a temperature gradient. (A) txtools’ nucleotide frequency plot. centered in a detected ac4C site at rRNA_23S:1462. Top. NaCNBH3-treated sample showing misincorporation of adenines at position 1432-C. Bottom. Control sample showing no misincorporations at the same position as above. (REF = equal to reference nucleotide, A = adenine, C = cytosine, G = guanine, T = thymine, – = deletion, N = ambiguous nucleotide read). (B) Boxplot showing the Misincorporation-rate-difference from C to T (MRD_CtoT) at putative ac4C and background cytidine sites at increasing growth temperatures.

  • Load the reference genome with tx_load_genome(), and the gene annotation with tx_load_bed().

  • Process the BAM files with the bam2txDT.R script provided along the txtools installation, with the parameters ‘-p TRUE -d covNuc -r 300 -m 0’ to generate a summarized count data table with coverage, read-starts, read-ends, nucleotide frequency, and deletion frequency information.

  • Plot a nucleotide frequency plot using the tx_plot_nucFreq() function. To observe the high levels of misincorporations of cytidines in place of thymines exclusively in the NaCNBH3-treated samples (Figure 4A).

  • Calculate the C to T misincorporation rate across every position of the reference transcriptome with tx_add_CtoTMR(). and subtract the C to T misincorporation between the control and the NaCNBH3 treatment samples for each growth temperature to obtain the rate C to T misincorporation rate difference (MRD_CtoT).

  • Call putative ac4C sites if a position in any of the samples surpassed a threshold of MRD_CtoT > 0.005.

  • Plot a boxplot showing the calculated MRD_CtoT across temperatures (Figure 4B). The boxplot shows how the stoichiometry of ac4C sites increases as the temperature of growth increases, reaching the highest levels at 95°C.

The code to reproduce the three study cases presented above is available at: https://github.com/AngelCampos/txtools_uc.

Showcasing the importance of bridging paired-end reads

To showcase the importance of bridging paired-end reads, which is implemented as an option by txtools but not, to the best of our knowledge, by other available tools, we demonstrate the importance of this option in the context of an analysis of m6A-seq data. In m6A-seq (34–36), RNA is first fragmented and then subjected to immunoprecipitation using an anti-m6A antibody, resulting in selective capturing of methylated fragments. These fragments are then sequenced from both ends. The typical analytic pipeline consists of peak calling, based on coverage, in the immunoprecipitated data. While m6A-seq is not considered, inherently, a single-nucleotide resolution methodology, in an idealized scenario (infinite coverage, no sources of noise), the signal should peak precisely over the methylated adenosine in the DRAC motif.

Given that the size of the immunoprecipitated fragments is around 100–150 nt and oftentimes only ∼30 nt are sequenced from each end, relying only on the sequenced ends results in only partial coverage of the insert. txtools offers the possibility of computationally bridging the reads, allowing to restore the full-length fragment. To assess to what extent the lack of complete coverage can result in inaccurate peak calling, we analyzed existing m6A-seq data in yeast (35) using txtools, in two modes: either in paired-end mode (in which paired ends are bridged) or in single-end mode (in which each read is considered a separate entity). We observed a ∼36% increase in the fraction of peaks detected precisely centered on a DRAC motif (330 out of 1475 peaks were centered on the motif in single-end processing versus 429 out of 1405 peaks in paired-end processing) (Supplementary Figure S1A). This was accompanied by a mild increase in the number of peaks called in single-end mode, in comparison to paired-end mode (Supplementary Figure S1B). Thus, without bridging paired ends reads, more peaks are called but of poorer quality. An investigation into the source of these miscalled peaks revealed that a single peak—centered within a DRAC motif—in paired-end mode, oftentimes got split into multiple smaller peaks in single-end mode, resulting in calling of multiple erroneous sites instead of a single correct one (Supplementary Figure S1C). This analysis thus highlights the added value of properly processed paired-end reads information for the accurate calling of sites.

Limitations

A current limitation of txtools is the potential long processing times. To ameliorate this issue, txtools provides the ability to run many of its functions using multiple cores/threads taking advantage of the R package parallel (R Core Team, 2021) reducing the processing time. Of note, this capability is available only in UNIX-based systems and is not available in Windows systems. To provide a guide and expectation of the time taken to process data in different scenarios we arranged a set of benchmarks, using the hg19 genome and gene annotation to generate artificial paired-end reads. All scenarios simulate paired-end RNA-seq reads of 50 bp long for each read1 and read2 and an insert of 100 bp between the reads, extracted from the hg19 transcriptome which were then aligned to the genome using STAR (37). The resulting BAM files were subject to four tasks:

  • T1 = Converting RNA-seq genomic ranges into transcriptomic ranges without their nucleotide sequence.

  • T2 = Generating a summarized table with coverage, read-starts, and read-ends data.

  • T3 = Converting RNA-seq genomic ranges into transcriptomic ranges including their nucleotide sequence.

  • T4 = Generating a summarized table with nucleotide frequency, insert coverage, and deletion frequency, as well as coverage, read-starts, and read-ends data.

Both T1 and T3 are performed by the function tx_reads(), T2 by tx_makeDT_coverage() and T4 by tx_makeDT_covNucFreq(). In a common txtools processing workflow T1 will be followed by T2, and T3 by T4 to obtain a summarized count-data table.

The four tasks were carried out under different scenarios: (i) doubling the number of genes in the gene annotation from 2500 up to 20 000 genes (Figure 5A), (ii) doubling the number of processed paired-end reads from 4 million to 32 million (Figure 5B), (iii) increasing the number of cores from 4, 8 to 12 (Figure 5C) and (iv) increasing both the number of cores and the number of genes (Figure 5D), to show the time taken by the main processing functions of txtools. All tasks were run 5 times for each benchmark.

Figure 5.

Figure 5.

Processing time of the main txtools processing functions for different simulated scenarios. (A) Barplot showing the processing time (y-axis) to perform tasks 1 through 4 (x-axis) in doubling numbers of genes (color-coded). (B) Barplot showing the processing time (y-axis) to perform the benchmarking tasks (x-axis) in doubling numbers of simulated RNA-seq paired-end reads (color-coded). (C) Barplot showing the processing time (y-axis) to perform the benchmarking tasks (x-axis) using 4, 8 and 12 cores (color-coded). (D) Lineplot showing the processing time (y-axis) to perform task ‘T3’ using 2, 4, 8 and 12 cores and doubling the number of genes (x-axis). Additional parameters used for each scenario are written in the top corner of each plot. M = millions; K = thousands. Each value represents the mean of 5 repeats and whiskers show the standard deviation.

The benchmarks show that the lengthier tasks are T1 and T3, processing the genomic reads into transcriptomic merging each read1 with their respective read2, carried out by the tx_reads() function. T3 is only slightly lengthier than T1, by adding the nucleotide sequence which carries a slight overhead. On the other hand, T2 and T4, which summarize the transcriptomic reads into count-data tables, carried out by tx_makeDT*() functions, are much less time-consuming in comparison, and again with a slight overhead by incorporating nucleotide sequence information.

In terms of processing times, these scale roughly linearly with both the number of genes in the gene annotation and the number of reads/alignments in the simulated RNA-seq library (Figure 5A-B). On the other hand, running any task using an increasing number of cores also reduces the processing time in all tasks (Figure 5C, D). Nevertheless, the extent of this effect is not linear as spreading the process across multiple cores has the overhead of collecting and integrating the resulting outputs. In the scenario of Figure 5D doubling the number of cores from two to four reduced the processing time by 45% while doubling from 4 to 8 further reduced it by an additional ∼40%.

Discussion

txtools facilitates the processing and analysis of RNA-seq datasets at the single-nucleotide resolution level. The illustrated case studies showcase the power of txtools, and its ability to streamline analyses that are otherwise non-trivial to implement. txtools offers unique advantages in the processing of paired-end reads, where it allows quantifying information that is not quantified by any other tool, to our knowledge. The seamless integration between genomic and transcriptomic space is another unique advantage of txtools over additionally available tools. Finally, its availability as an R package allows it to readily be integrated into analytic workflows.

Additionally, to the case studies shown here, which rely on Illumina-based RNA-seq data, we have also successfully employed txtools for analysis of RNA-seq data from other platforms, such as Nanopore minION’s long reads.

A key limitation of txtools is its relatively long processing time. Processing time scales primarily with the number of genes and with the number of reads/alignments to be processed. Depending on needs and research focus, trimming down the annotated gene set can help substantially reduce running times. Another option is the txDT subsampling function tx_sampleByGenes(), which enables working only on a random sub-sample of the transcriptome. Once the analytic pipeline is more mature it can then be easily run with the full dataset to generate complete results. Additionally, to avoid long waiting times during an interactive R session, we offer the command line bam2txDT.R script that implements the core functionality of txtools and that can be run in the background. Nevertheless, we hope to optimize processing times in future versions of txtools.

We believe that txtools will be a useful and convenient tool for the RNA and bioinformatics communities. By providing a simple, intuitive, streamlined, and rich framework for processing, analyzing, and visualizing RNA-seq data at the nucleotide level, it paves the path for future discovery for both entry-level and advanced users.

Supplementary Material

gkae203_Supplemental_File

Contributor Information

Miguel Angel Garcia-Campos, Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Central District, 761000, Israel.

Schraga Schwartz, Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Central District, 761000, Israel.

Data availability

txtools source code is available from its GitHub repository at: https://github.com/AngelCampos/txtools. A manual for all txtools functions and vignettes showing examples of their functionality is available at the txtools package website. Links to the use cases and benchmarking code are also available through the txtools GitHub page. The provided code readily downloads needed data stored at Zenodo with the doi: https://doi.org/10.5281/zenodo.10612727.

Supplementary data

Supplementary Data are available at NAR Online.

Funding

H2020 European Research Council [913/21]; Israeli Science Foundation [543165]. Funding for open access charge: ISF [543165].

Conflict of interest statement. None declared.

References

  • 1. Linder B., Grozhik A.V., Olarerin-George A.O., Meydan C., Mason C.E., Jaffrey S.R.. Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nat. Methods. 2015; 12:767–772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Safra M., Sas-Chen A., Nir R., Winkler R., Nachshon A., Bar-Yaacov D., Erlacher M., Rossmanith W., Stern-Ginossar N., Schwartz S.. The m1A landscape on cytosolic and mitochondrial mRNA at single-base resolution. Nature. 2017; 551:251–255. [DOI] [PubMed] [Google Scholar]
  • 3. Hussain S., Sajini A.A., Blanco S., Dietmann S., Lombard P., Sugimoto Y., Paramor M., Gleeson J.G., Odom D.T., Ule J.et al.. NSun2-mediated cytosine-5 methylation of vault noncoding RNA determines its processing into regulatory small RNAs. Cell Rep. 2013; 4:255–261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Schaefer M., Pollex T., Hanna K., Lyko F.. RNA cytosine methylation analysis by bisulfite sequencing. Nucleic Acids Res. 2009; 37:e12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Squires J.E., Patel H.R., Nousch M., Sibbritt T., Humphreys D.T., Parker B.J., Suter C.M., Preiss T.. Widespread occurrence of 5-methylcytosine in human coding and non-coding RNA. Nucleic Acids Res. 2012; 40:5023–5033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Edelheit S., Schwartz S., Mumbach M.R., Wurtzel O., Sorek R.. Transcriptome-wide mapping of 5-methylcytidine RNA modifications in bacteria, archaea, and yeast reveals m5C within archaeal mRNAs. PLoS Genet. 2013; 9:e1003602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Li X., Xiong X., Zhang M., Wang K., Chen Y., Zhou J., Mao Y., Lv J., Yi D., Chen X.-W.et al.. Base-resolution mapping reveals distinct m1A methylome in nuclear- and mitochondrial-encoded transcripts. Mol. Cell. 2017; 68:993–1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Liu C., Sun H., Yi Y., Shen W., Li K., Xiao Y., Li F., Li Y., Hou Y., Lu B.et al.. Absolute quantification of single-base m6A methylation in the mammalian transcriptome using GLORI. Nat. Biotechnol. 2022; 41:355–366. [DOI] [PubMed] [Google Scholar]
  • 9. Birkedal U., Christensen-Dalsgaard M., Krogh N., Sabarinathan R., Gorodkin J., Nielsen H.. Profiling of ribose methylations in RNA by high-throughput sequencing. Angew. Chem. Int. Ed. 2015; 54:451–455. [DOI] [PubMed] [Google Scholar]
  • 10. Marchand V., Blanloeil-Oillo F., Helm M., Motorin Y.. Illumina-based RiboMethSeq approach for mapping of 2′-O-me residues in RNA. Nucleic Acids Res. 2016; 44:e135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Garcia-Campos M.A., Edelheit S., Toth U., Safra M., Shachar R., Viukov S., Winkler R., Nir R., Lasman L., Brandis A.et al.. Deciphering the ‘m6A code’ via antibody-independent quantitative profiling. Cell. 2019; 178:731–747. [DOI] [PubMed] [Google Scholar]
  • 12. Aviran S., Trapnell C., Lucks J.B., Mortimer S.A., Luo S., Schroth G.P., Doudna J.A., Arkin A.P., Pachter L.. Modeling and automation of sequencing-based characterization of RNA structure. Proc. Natl. Acad. Sci. U.S.A. 2011; 108:11069–11074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Incarnato D., Neri F., Anselmi F., Oliviero S.. Genome-wide profiling of mouse RNA secondary structures reveals key features of the mammalian transcriptome. Genome Biol. 2014; 15:491. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Umeyama T., Ito T.. DMS-seq for In vivo genome-wide mapping of protein-DNA interactions and nucleosome centers. Curr. Protoc. Mol. Biol. 2018; 123:e60. [DOI] [PubMed] [Google Scholar]
  • 15. Zinshteyn B., Chan D., England W., Feng C., Green R., Spitale R.C.. Assaying RNA structure with LASER-Seq. Nucleic Acids Res. 2019; 47:43–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Van Nostrand E.L., Pratt G.A., Shishkin A.A., Gelboin-Burkhart C., Fang M.Y., Sundararaman B., Blue S.M., Nguyen T.B., Surka C., Elkins K.et al.. Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP). Nat. Methods. 2016; 13:508–514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Körtel N., Rücklé C., Zhou Y., Busch A., Hoch-Kraft P., Sutandy F.X.R., Haase J., Pradhan M., Musheev M., Ostareck D.et al.. Deep and accurate detection of m6A RNA modifications using miCLIP2 and m6Aboost machine learning. Nucleic Acids Res. 2021; 49:e92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Danecek P., Bonfield J.K., Liddle J., Marshall J., Ohan V., Pollard M.O., Whitwham A., Keane T., McCarthy S.A., Davies R.M.et al.. Twelve years of SAMtools and BCFtools. Gigascience. 2021; 10:giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Piechotta M., Naarmann-de Vries I.S., Wang Q., Altmüller J., Dieterich C. RNA modification mapping with JACUSA2. Genome Biol. 2022; 23:115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Incarnato D., Morandi E., Simon L.M., Oliviero S.. RNA Framework: an all-in-one toolkit for the analysis of RNA structures and post-transcriptional modifications. Nucleic Acids Res. 2018; 46:e97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Picardi E., Pesole G.. REDItools: high-throughput RNA editing detection made easy. Bioinformatics. 2013; 29:1813–1814. [DOI] [PubMed] [Google Scholar]
  • 22. Carlile T.M., Rojas-Duran M.F., Zinshteyn B., Shin H., Bartoli K.M., Gilbert W.V.. Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature. 2014; 515:143–146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Lovejoy A.F., Riordan D.P., Brown P.O.. Transcriptome-wide mapping of pseudouridines: pseudouridine synthases modify specific mRNAs in S. cerevisiae. PLoS One. 2014; 9:e110799. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Schwartz S., Bernstein D.A., Mumbach M.R., Jovanovic M., Herbst R.H., León-Ricardo B.X., Engreitz J.M., Guttman M., Satija R., Lander E.S.et al.. Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA. Cell. 2014; 159:148–162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Li X., Zhu P., Ma S., Song J., Bai J., Sun F., Yi C.. Chemical pulldown reveals dynamic pseudouridylation of the mammalian transcriptome. Nat. Chem. Biol. 2015; 11:592–597. [DOI] [PubMed] [Google Scholar]
  • 26. Zhang Z., Chen L.-Q., Zhao Y.-L., Yang C.-G., Roundtree I.A., Zhang Z., Ren J., Xie W., He C., Luo G.-Z.. Single-base mapping of m6A by an antibody-independent method. Sci. Adv. 2019; 5:eaax0250. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Huber W., Carey V.J., Gentleman R., Anders S., Carlson M., Carvalho B.S., Bravo H.C., Davis S., Gatto L., Girke T.et al.. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods. 2015; 12:115–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Gentleman R.C., Carey V.J., Bates D.M., Bolstad B., Dettling M., Dudoit S., Ellis B., Gautier L., Ge Y., Gentry J.et al.. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004; 5:R80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Lawrence M., Huber W., Pagès H., Aboyoun P., Carlson M., Gentleman R., Morgan M.T., Carey V.J.. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 2013; 9:e1003118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Wickham H. Ggplot2. Wiley Interdiscip. Rev. Comput. Stat. 2011; 3:180–185. [Google Scholar]
  • 31. Robinson M.D., McCarthy D.J., Smyth G.K.. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Taoka M., Nobe Y., Yamaki Y., Yamauchi Y., Ishikawa H., Takahashi N., Nakayama H., Isobe T.. The complete chemical structure of saccharomyces cerevisiae rRNA: partial pseudouridylation of U2345 in 25S rRNA by snoRNA snR9. Nucleic Acids Res. 2016; 44:8951–8961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Liao Y., Smyth G.K., Shi W.. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 2019; 47:e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Dominissini D., Moshitch-Moshkovitz S., Schwartz S., Salmon-Divon M., Ungar L., Osenberg S., Cesarkas K., Jacob-Hirsch J., Amariglio N., Kupiec M.et al.. Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature. 2012; 485:201–206. [DOI] [PubMed] [Google Scholar]
  • 35. Schwartz S., Agarwala S.D., Mumbach M.R., Jovanovic M., Mertins P., Shishkin A., Tabach Y., Mikkelsen T.S., Satija R., Ruvkun G.et al.. High-resolution mapping reveals a conserved, widespread, dynamic mRNA methylation program in yeast meiosis. Cell. 2013; 155:1409–1421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Meyer K.D., Saletore Y., Zumbo P., Elemento O., Mason C.E., Jaffrey S.R.. Comprehensive analysis of mRNA methylation reveals enrichment in 3′ UTRs and near stop codons. Cell. 2012; 149:1635–1646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Dobin A., Davis C.A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., Gingeras T.R.. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013; 29:15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkae203_Supplemental_File

Data Availability Statement

txtools source code is available from its GitHub repository at: https://github.com/AngelCampos/txtools. A manual for all txtools functions and vignettes showing examples of their functionality is available at the txtools package website. Links to the use cases and benchmarking code are also available through the txtools GitHub page. The provided code readily downloads needed data stored at Zenodo with the doi: https://doi.org/10.5281/zenodo.10612727.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES