Abstract
Motivation
Long-read RNA sequencing enables the mapping of RNA modifications, structures, and protein-interaction sites at the resolution of individual transcript isoforms. To understand the functions of these RNA features, it is critical to analyze them in the context of transcriptomic and genomic annotations, such as open reading frames and splice junctions.
Results
We have developed R2Dtool, a bioinformatics tool that integrates transcript-mapped information with transcript and genome annotations, allowing for the isoform-resolved analytics and graphical representation of RNA features in their genomic context. We illustrate R2Dtool’s capability to integrate and expedite RNA feature analysis using epitranscriptomics data. R2Dtool facilitates the comprehensive analysis and interpretation of alternative transcript isoforms.
Availability and implementation
R2Dtool is freely available under the MIT license at github.com/comprna/R2Dtool.
1 Introduction
Long-read sequencing enables accurate mapping of diverse RNA features with transcript isoform resolution. Such features include, among others, chemical modifications (Liu et al. 2019, Stephenson et al. 2022, Acera Mateos et al. 2024, Bansal et al. 2024), structured regions (Aw et al. 2021, Stephenson et al. 2022), and RNA–protein interacting sites (Lin et al. 2022). The functional roles of these RNA features are often linked to their position relative to transcript landmarks, such as transcription start sites, splice junctions, start and stop codons, and polyadenylation sites. For example, positional enrichment of N6-methyladenosine (m6A) downstream of stop codons led to the discovery of an m6A role in transcript deadenylation at the 3ʹ end of transcripts (Dominissini et al. 2012, Lee et al. 2020). Similarly, the identification of isoform-specific exon secondary structures made possible a link with transcript-specific translational efficiency (Aw et al. 2021); and the positional distribution of RNA–protein interaction sites along transcripts revealed specific modes of RNA regulation (Van Nostrand et al. 2020, Lin et al. 2022). The isoform-resolved visualization of RNA features in the context of transcriptomic and genomic landmarks is thus critical to facilitate the discovery of isoform-specific regulatory mechanisms.
The transcriptome-wide distributions of RNA features are often represented using metagene plots, which describe the positional density of these features in the context of simplified gene models. However, current methods for generating metagene plots generally select a single representative isoform, often the longest one or most abundant, against which the positions of all RNA features are calculated (Olarerin-George and Jaffrey 2017, Fournier et al. 2022). This may result in a misassignment of distances and distributions of the RNA features relative to the transcriptomic or genomic landmarks at loci that simultaneously express multiple isoforms, which in mammals is expected to occur in around one in every five genes (Tapial et al. 2017). Isoform-resolved transcriptomic methods, such as long-read sequencing, offer a solution to this challenge by enabling the direct assignment of RNA features to specific transcripts. However, available metagene methods do not capitalize on this data to enable isoform-aware analysis.
Here, we describe R2Dtool, a computational method for the integration and visualization of isoform-resolved RNA features in the context of transcriptomic and genomic annotations. R2Dtool exploits the isoform-resolved mapping of RNA features, such as those obtained from long-read sequencing, to enable simple, reproducible, and lossless integration, annotation, and visualization of isoform-specific RNA features. We illustrate R2Dtool’s capabilities with the analysis of isoform-resolved messenger RNA (mRNA) modification data, which can be accurately obtained from nanopore long-read sequencing data (Hendra et al. 2022, Acera Mateos et al. 2024) and its interpretation in the right mRNA isoform context has uncovered important properties and regulatory mechanisms (Uzonyi et al. 2022, Gleeson et al. 2024).
2 Implementation
R2Dtool starts with a set of RNA feature positions in transcriptomic coordinates, such as those obtained from long-read sequencing reads mapped directly to transcript isoforms (Hendra et al. 2022, Stephenson et al. 2022, Acera Mateos et al. 2024, Gleeson et al. 2024) (Fig. 1A). R2Dtool operates with these transcript features and the corresponding gene annotations to analyze, integrate, and visualize isoform-resolved RNA feature maps. R2Dtool’s core function liftover transposes the transcript-centric coordinates of the isoform-mapped sites to genome-centric coordinates. Another core function, annotate, performs positional annotation of RNA features with isoform-aware metagene coordinates and distances to annotated transcriptomic and genomic landmarks. Furthermore, R2Dtool provides isoform-aware visualization in the form of metaplots. These include the metatranscript plot, which visualizes the isoform-specific distribution of RNA features in the context of a rescaled mRNA coordinate system. Additionally, R2Dtool enables visualization of the positional distribution of RNA features around transcript landmarks, such as isoform-specific splice junctions (metajunction plot) and start or stop codons (metacodon plot).
R2Dtool is applicable to any organism with a transcriptome annotation, either obtained from an existing reference or user-built. R2Dtool’s functions are implemented as standalone command-line tools to follow pipelining and format preservation principles. This facilitates rapid and flexible coupling with other bioinformatic workflows, such as RNA modification profiling, isoform-resolved RNA structure detection, or protein interaction studies. R2Dtool can also be applied to short-read experiments in cases where RNA features can be mapped to specific isoforms, e.g. when analyzing in vitro transcribed sequences that map to specific transcripts. Software and usage details are provided at https://github.com/comprna/R2Dtool.
2.1 Input and output formats
R2Dtool is designed to work with RNA features along transcripts described in BED3+ format (Niu et al. 2022) and gene annotations provided in gene transfer format (GTF2) (Pertea and Pertea 2020) (Fig. 1B). Columns 1–3 of the input BED3+ file must provide the positions of the RNA features on the reference transcriptome. This BED3+ input file can contain any number of additional columns to encode experiment-specific data, including scores, labels, likelihoods, or any metadata generated during the upstream analysis, in concordance with the BED3+ definition. R2Dtool uses gene information from a GTF2 input file, which minimally includes the annotations of coding and noncoding exons but can incorporate additional information. This enables the integration and comparison of the transcript-resolved data with genomic annotations (Fig. 1B–D). R2Dtool generates BED6+n files as outputs including the transcriptomic and genomic coordinates of given sites, together with additional positional properties. The BED6+ output file format is compatible with other tools that handle BED files in genome-based coordinates, such as Samtools, Bedtools, and the IGV genome browser (Li et al. 2009, Thorvaldsdottir et al. 2013, Quinlan 2014).
2.2 Liftover of RNA-mapped features to DNA coordinates
To enable the comparison between isoform-mapped RNA features and genomic annotations, R2Dtool lifts the positions of transcriptome-mapped RNA features over to the corresponding genome reference, using a GTF2 transcript annotation file to calculate the relevant genome positions (Fig. 1C). This operation is performed with the command “r2d liftover -i <Sites (BED3+)> -g <Annotation (GTF2)>.” The output is a BED6+ file, where columns 1–6 contain the new genomic coordinates of each RNA feature in standard BED6 format, whereas the input data, including the original transcriptome-reference coordinates, are losslessly preserved in the output columns 7: N + 7, where n was the original number of columns in the input data. To illustrate this operation, we used R2Dtool’s liftover to visualize m6A modification calls generated from nanopore direct RNA sequencing of HeLa transcriptomic RNA (SQK-RNA004 kit). DRACH-context m6A basecalling was performed with Dorado v0.5.3 (github.com/nanoporetech/dorado) and the methylation calls were processed with ModKit v0.2.6 (github.com/nanoporetech/modkit), before analysis by R2Dtool (code for this analysis is available at github.com/comprna/R2Dtool/). While the m6A calls were made in transcript-centric coordinates, R2Dtool’s liftover enabled the visualization of m6A sites in their genomic context. This is illustrated by showing different m6A sites in a transcript from gene PUS1 in their genomic context and in proximity to an adjacent intron (Fig. 1E). This operation highlights that by transposing sites from alternative transcript isoforms to their genomic coordinates, it is possible to assess features in their right molecular context that would otherwise be missed by genomic-based methods. This is further illustrated with the example of the ANKRD10 gene, which has two alternative transcript isoforms, each with an m6A modification with different stoichiometry (Fig. 1F).
2.3 Isoform-specific positional annotation
R2Dtool leverages the specificity of isoform-resolved RNA feature maps to systematically annotate transcript-mapped RNA feature positions with absolute and relative distances to transcriptomic and genomic landmarks, including transcript starts and ends, splice junctions, stop codons and start codons (Fig. 1D). This operation is performed with the command “r2d annotate -i <Sites (BED3+)> -g <Annotation (GTF2)>,” where the first 3 columns of the input specify the position of the features in transcriptome-specific BED3 coordinates, assumed to be on the plus strand. The output of this command is a BEDn + 12 file, where the original n columns of the input are preserved, and 12 additional columns are added to the output. These additional columns correspond to the gene ID, gene name, transcript biotype, feature metatranscript coordinates, and the absolute and relative distances to local features, such as stop and start codons and adjacent splice junctions.
2.4 Isoform-aware metatranscript coordinates and plots
R2Dtool calculates an isoform-aware metacoordinate for features localized on transcripts that are annotated as protein-coding. This metacoordinate represents the normalized position of a given feature on a fixed-length virtual transcript model using the same method as previously described, where the positions of features are linearly scaled to a position between 0–3, depending on the relative position of the feature across the span of the 5ʹ UTR [0–1), CDS (1–2), or 3ʹ UTR (2–3] (Olarerin-George and Jaffrey 2017). However, unlike other metagene methods that use a single transcript model to calculate the metacoordinates for all features assigned to a given gene, R2Dtool’s leverages the individual isoform assignment of each RNA feature to calculate distinct metacoordinates for each alternative transcript that spans a given genomic position, accommodating potential diversity in UTR and ORF start/end positions present in alternative transcript isoforms. We highlight this capability by showing how a single genomic position on ANKRD10 is heavily m6A methylated when mapped to a transcript isoform where the site is located in the isoform 3ʹ UTR (86% m6A/A; metacoordinate of 2.26), but unmethylated when mapped to an alternative isoform, where the same genomic position corresponds to the transcript CDS (0% m6A/A, metacoordinate of 1.54) (Fig. 1F).
Metatranscript coordinates are calculated for all features mapped onto protein-coding RNAs during the annotation analysis step and can be readily visualized through publication-grade isoform-resolved feature distribution plots using R2Dtool’s plotting functions. These include the metatranscript plot, which shows the normalized positional density of RNA features. The R2Dtool command “r2d plotMetaTranscript” produces density plots showing the distribution of RNA features from the output of the annotate command. The density of RNA features is determined across bins spanning the virtual metatrancript model by comparing the proportion of positive RNA features (e.g. with stoichiometry above or P-value below a certain cutoff), compared to all tested sites in the given metatranscript bin. For the metatranscript plot, we segment the metatranscript into 120 bins of width 0.025. To illustrate this operation, we produced a metatranscript plot for the density of m6A sites with >10% stoichiometry along R2Dtool-annotated transcript-centric m6A calls performed in HeLa (Fig. 1G).
2.5 Isoform-specific landmark-centric plots
R2Dtool also calculates the distribution of absolute distances between RNA features and reference landmarks, such as the start and end of the ORF, the start and end of the transcript, or the nearest upstream or downstream splice junction, all in an isoform-specific manner, enabling a range of downstream positional analyses. We illustrate this operation with the analysis of the relative distances of m6A modifications to their nearest upstream or downstream exon-exon junctions obtained from the annotate command above (Fig. 1H). This plot, produced with the command “r2d plotMetaJunction,” shows a clear exclusion of m6A sites in a window of 200 nt around the exon-exon junctions, in agreement with previous reports (Uzonyi et al. 2022). Like the metatranscript plot, the metajunction plot displays the proportion of positive features at intervals around each splice-site, in increments of 1 nt. Similarly, “r2d plotMetaCodon” can be used to plot the distribution of RNA features in absolute distance around start and stop codons, also in 1 nt intervals (also available at github.com/comprna/R2Dtool).
3 Conclusions
R2Dtool provides a simple, robust, and reproducible framework to integrate and visualize isoform-resolved RNA feature maps, enabling a comprehensive annotation of transcript-centric sites. R2Dtools provides isoform-aware identification of RNA feature distribution and can inform on potential RNA regulatory mechanisms from long-read transcriptomics. R2Dtool is particularly useful for epitranscriptomics, where there have been recent significant advances in the isoform-specific identification of RNA modifications using long-read direct RNA sequencing, where most methods generate their estimates in transcript-centric coordinates. We anticipate that R2Dtool could empower future epitranscriptomic studies to discover new RNA regulatory mechanisms, particularly those involving the interplay between RNA modifications and genomic features. By seamlessly integrating transcriptomic and genomic annotations using standard formats, R2Dtool empowers researchers to fully utilize the potential of isoform-resolved transcriptomics.
Acknowledgements
We thank Dr Akanksha Srivastava and Favour Oyelami for their constructive feedback on our manuscript and code.
Contributor Information
Aditya J Sethi, Shine-Dalgarno Centre for RNA Innovation, John Curtin School of Medical Research, Australian National University, Canberra, Acton ACT 2601, Australia; Centre for Computational Biomedical Sciences, John Curtin School of Medical Research, Australian National University, Canberra, Acton ACT 2601, Australia; EMBL Australia Partner Laboratory Network at the Australian National University, Canberra, Acton ACT 2601, Australia.
Pablo Acera Mateos, Children’s Cancer Institute, Lowy Cancer Centre, University of New South Wales, Sydney, Kensington NSW 2033, Australia.
Rippei Hayashi, Shine-Dalgarno Centre for RNA Innovation, John Curtin School of Medical Research, Australian National University, Canberra, Acton ACT 2601, Australia.
Nikolay E Shirokikh, Shine-Dalgarno Centre for RNA Innovation, John Curtin School of Medical Research, Australian National University, Canberra, Acton ACT 2601, Australia.
Eduardo Eyras, Shine-Dalgarno Centre for RNA Innovation, John Curtin School of Medical Research, Australian National University, Canberra, Acton ACT 2601, Australia; Centre for Computational Biomedical Sciences, John Curtin School of Medical Research, Australian National University, Canberra, Acton ACT 2601, Australia; EMBL Australia Partner Laboratory Network at the Australian National University, Canberra, Acton ACT 2601, Australia.
Conflict of interest
None declared.
Funding
This work was supported by the Australian Research Council (ARC) Discovery Project grants [DP210102385 to R.H. and E.E.] and [DP220101352 to E.E.], by the National Health and Medical Research Council (NHMRC) through an Investigator Grant [GNT1175388 to N.E.S.] and an Ideas Grant [2018833 to E.E.], by a Bootes grant (2021–2022) [to A.J.S.], and by an Innovator grant from the Talo Computational Biology Accelerator Program [to A.J.S.].
Data availability
The data underlying this article is available in Figshare at https://doi.org/10.6084/m9.figshare.25730082.v1.
References
- Acera Mateos P, J Sethi A, Ravindran A. et al. Prediction of m6A and m5C at single-molecule resolution reveals a transcriptome-wide co-occurrence of RNA modifications. Nat Commun 2024;15:3899. 10.1038/s41467-024-47953-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aw JGA, Lim SW, Wang JX. et al. Determination of isoform-specific RNA structure with nanopore long reads. Nat Biotechnol 2021;39:336–46. 10.1038/s41587-020-0712-z [DOI] [PubMed] [Google Scholar]
- Bansal M, Kundu A, Gibson A. et al. Transcriptome-wide quantitative profiling of PUS7-dependent pseudouridylation by nanopore direct long read RNA sequencing. BioRxiv, 2024, preprint: not per reviewed. 10.1101/2024.01.31.578250 [DOI]
- Dominissini D, Moshitch-Moshkovitz S, Schwartz S. et al. Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature 2012;485:201–6. 10.1038/nature11112 [DOI] [PubMed] [Google Scholar]
- Fournier E, Beauparlant CJ, Lippens C. et al. metagene2: A package to produce metagene plots. R package version 1.20.0, 2022. 10.18129/B9.BIOC.METAGENE2 [DOI]
- Gleeson J, Madugalle SU, McLean C. et al. Isoform-level profiling of m6A epitranscriptomic signatures in human brain. BioRxiv, 2024, preprint: not per reviewed. 10.1101/2024.01.31.578088 [DOI]
- Hendra C, Pratanwanich PN, Wan YK. et al. Detection of m6A from direct RNA sequencing using a multiple instance learning framework. Nat Methods 2022;19:1590–8. 10.1038/s41592-022-01666-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee Y, Choe J, Park OH. et al. Molecular mechanisms driving mRNA degradation by m6A modification. Trends Genet 2020;36:177–88. 10.1016/j.tig.2019.12.007 [DOI] [PubMed] [Google Scholar]
- Li H, Handsaker B, Wysoker A. et al. ; 1000 Genome Project Data Processing Subgroup. The sequence alignment/map format and SAMtools. Bioinformatics 2009;25:2078–9. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin, Y., Kwok, S., Hein, A.E. et al. RNA molecular recording with an engineered RNA deaminase. Nat Methods 2023;20:1887–99. 10.1101/2022.09.06.506853 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu H, Begik O, Lucas MC. et al. Accurate detection of m6A RNA modifications in native RNA sequences. Nat Commun 2019;10:4079. 10.1038/s41467-019-11713-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Niu J, Denisko D, Hoffman MM.. The browser extensible data (BED) format. File Format Stand 2022;1:8. [Google Scholar]
- Olarerin-George AO, Jaffrey SR.. MetaPlotR: a Perl/R pipeline for plotting metagenes of nucleotide modifications and other transcriptomic sites. Bioinformatics 2017;33:1563–4. 10.1093/bioinformatics/btx002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pertea G, Pertea M.. GFF utilities: GffRead and GffCompare. F1000Res 2020;9:304. 10.12688/f1000research.23297.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quinlan AR. BEDTools: The Swiss-Army tool for genome feature analysis: BEDTools: The Swiss-Army tool for genome feature analysis. Curr Protoc Bioinformatics 2014;47:11.12.1–2.34. 10.1002/0471250953.bi1112s47 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stephenson W, Razaghi R, Busan S. et al. Direct detection of RNA modifications and structure using single-molecule nanopore sequencing. Cell Genom 2022;2:100097. 10.1016/j.xgen.2022.100097 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tapial J, Ha KCH, Sterne-Weiler T. et al. An atlas of alternative splicing profiles and functional associations reveals new regulatory programs and genes that simultaneously express multiple major isoforms. Genome Res 2017;27:1759–68. 10.1101/gr.220962.117 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thorvaldsdottir H, Robinson JT, Mesirov JP.. Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 2013;14:178–92. 10.1093/bib/bbs017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uzonyi A, Dierks D, Nir R et al. Exclusion of m6A from splice-site proximal regions by the exon junction complex dictates m6A topologies and mRNA stability. Molecular cell 2023;83:37–251.e7. 10.1101/2022.06.29.498130 [DOI] [PubMed] [Google Scholar]
- Van Nostrand EL, Pratt GA, Yee BA. et al. Principles of RNA processing from analysis of enhanced CLIP maps for 150 RNA binding proteins. Genome Biol 2020;21:90. 10.1186/s13059-020-01982-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data underlying this article is available in Figshare at https://doi.org/10.6084/m9.figshare.25730082.v1.