In Silico Identification of RNA Modifications from High-Throughput Sequencing Data Using HAMR

Pavel P Kuksa; Yuk-Yee Leung; Lee E Vandivier; Zachary Anderson; Brian D Gregory; Li-San Wang

doi:10.1007/978-1-4939-6807-7_14

. Author manuscript; available in PMC: 2020 May 18.

Published in final edited form as: Methods Mol Biol. 2017;1562:211–229. doi: 10.1007/978-1-4939-6807-7_14

In Silico Identification of RNA Modifications from High-Throughput Sequencing Data Using HAMR

Pavel P Kuksa ¹, Yuk-Yee Leung ², Lee E Vandivier ³, Zachary Anderson ⁴, Brian D Gregory ⁵, Li-San Wang ⁶

PMCID: PMC7233376 NIHMSID: NIHMS1579281 PMID: 28349463

Abstract

RNA molecules are often altered post-transcriptionally by the covalent modification of their nucleotides. These modifications are known to modulate the structure, function, and activity of RNAs. When reverse transcribed into cDNA during RNA sequencing library preparation, atypical (modified) ribonucleotides that affect Watson-Crick base pairing will interfere with reverse transcriptase (RT), resulting in cDNA products with mis-incorporated bases or prematurely terminated RNA products. These interactions with RT can therefore be inferred from mismatch patterns in the sequencing reads, and are distinguishable from simple base-calling errors, single-nucleotide polymorphisms (SNPs), or RNA editing sites. Here, we describe a computational protocol for the in silico identification of modified ribonucleotides from RT-based RNA-seq read-out using the High-throughput Analysis of Modified Ribonucleotides (HAMR) software. HAMR can identify these modifications transcriptome-wide with single nucleotide resolution, and also differentiate between different types of modifications to predict modification identity. Researchers can use HAMR to identify and characterize RNA modifications using RNA-seq data from a variety of common RT-based sequencing protocols such as Poly(A), total RNA-seq, and small RNA-seq.

Keywords: RNA modification, RNA posttranscriptional modification, RNA covalent modification, Small RNA, Small RNA sequencing, Messenger RNA, RNA sequencing, Machine learning, Classification

1. Introduction

Covalent posttranscriptional modifications of specific nucleotide bases in RNA molecules are known to be highly prevalent and physiologically important [1–6]. RNA modifications play a role in maintaining the structure and stability of RNAs [7–10] and affect their maturation [3, 10], translation and cellular abundance [10, 15]. All known classes of RNA molecules harbor various levels of diverse modifications [1, 8, 9, 11–13]. However, the overall abundance and biological function of these modifications are incompletely understood [10, 14, 15].

Experimental methods for low-throughput detection of some types of RNA modifications are well established [16–19]. One such method is primer extension, which relies on the differential ability of reverse transcriptase to produce cDNAs with base-pair substitutions or premature termination at positions occupied by modified nucleotides [17]. However, because many RNA sequencing library preparation protocols require RNA to cDNA conversion by reverse transcriptase (RT), it is possible to use RNA sequencing data to identify sites of modified nucleotides across many classes of RNAs in a high-throughput and transcriptome-wide fashion by identifying positions within RNAs with significant mismatch rates after alignment. The protocol for in silico identification of RNA modifications presented here uses our previously developed HAMR software [20]. The protocol allows for fast and reliable identification of modified nucleotides at single-nucleotide resolution in all RNA classes transcriptome-wide through the analysis of nucleotide substitutions in high-throughput RNA sequencing datasets. This software provides an important tool for detecting and classifying modified RNA ribonucleotides that modulate RT incorporation, e.g., by affecting Watson-Crick base pairing. The examples of HAMR-detectable modifications include but are not limited to m1A, m3C, m5C, and pseudouridine [1, 8, 10, 13, 14, 20–22]. Modifications such as m6A [23], which do not significantly interfere with RT, will not be detected by HAMR.

In this chapter, we describe how to use HAMR [20] for in silico identification of RNA modifications using RNA sequencing data.

HAMR in a nutshell

The starting point of the HAMR analysis is the mapped sequencing reads in the standard BAM file format (https://samtools.github.io/hts-specs/SAMv1.pdf). The HAMR software is fully automated and the user can perform genome-wide analysis of RNA modifications in one command (see Subheading 3.5) using BAM files with the corresponding reference genome sequence. The user can also start from raw sequencing data and turn it into mapped data (BAM) as summarized in Subheading 3.4 for preparing raw sequencing data for HAMR analysis. Subheading 3.1 additionally describes suggestions for RNA library preparation.

Sample study

To illustrate RNA modification analysis using the HAMR software [20], we also provide a sample study of RNA modifications in human brain using RNA-seq data in Subheading 3.8.

Availability

HAMR is available as a stand-alone command-line pipeline (https://github.com/wanglab-upenn/HAMR/) and as a Web-based application (accessible at http://lisanwanglab.org/hamr/). The HAMR source code is freely available under the MIT license for academic and nonprofit use.

2. Materials

Always use the recommended version of all software dependencies. We also recommend performing a setup as described (see Subheading 3.3) before preprocessing the sequencing data from FASTQ to BAM file (see Subheading 3.4) and analyzing it using HAMR (see Subheadings 3.5 and 3.6).

2.1. Hardware Requirements

Desktop computer or server (multi-core server with 32GB RAM is recommended when starting from raw sequenced reads in human).

2.2. Software Requirements

Linux-based operating system (CentOS, Ubuntu, Debian, etc.).
C compiler (g++).
Standard POSIX programs (awk, grep, bash) (see Note 1).
Samtools suite [24] http://www.htslib.org/download/ (version 1.0 or later).
Python (version v2.7. x or above) (see Note 2).
R program https://www.r-project.org/ (version 3.1.x or above).
Additionally, the following software is required for processing raw reads (see Subheading 3.4):
1. SRA toolkit (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software) when analyzing data from the Sequencing Read Archive.
2. Cutadapt [25] https://code.google.com/p/cutadapt/ (version 1.7 and above) to trim adapter sequences for sequenced reads.
3. STAR [26] https://github.com/alexdobin/STAR/releases (version 2.4.x and above) for mapping reads to the genome.

The required external programs can be installed from Linux repositories (e.g., using package managers such as Apt-get in Ubuntu).

sudo apt-get install samtools
sudo apt-get install r-base

Alternatively, packages can be manually installed (see Note 3).

3. Methods

3.1. RNA Sequencing Data Generation

The protocol begins with extraction of the target population of RNAs such as mRNA or small RNA using a phenol-based extraction method from tissues or cells. The extracted material is then prepared as a sequencing library and sequenced using any standard RT-based Illumina sequencing protocol for RNA class of interest (e.g., mRNAs or small RNAs).

RNA extraction: Use a standard phenol-based RNA extraction to obtain a minimum of 4 μg total RNA using Illumina TruSeq® RNA Library Prep Kit or Illumina TruSeq® Stranded Total RNA Library Prep Kit; alternatively, obtain a minimum of 1 μg total RNA from tissues or cell lines using Illumina TruSeq® Small RNA Library Preparation.
Library preparation for RNA sequencing: Use Illumina TruSeq®RNA, Stranded total RNA, or TruSeq Small RNA library preparation kit protocol for standard RT-based sequencing. These include adaptor ligation, cDNA synthesis, and library amplification steps, all adapted from original Illumina library preparation kit protocol. The libraries should be sequenced on an Illumina Genome Analyzer machine (e.g., HiSeq2000, GAIIx, or above) with read length 50–100 bp.
Following sequencing is a series of preprocessing steps (not included in the HAMR software, summarized in Fig. 1) for turning the FASTQ file obtained from the Illumina sequencer to mapped (BAM) file format. These preprocessing steps ensure the input BAM file is compatible and ready for use with HAMR. The preprocessing steps include read trimming, read mapping with mismatches, read filtering, and handling of multi-mapped reads (optional for studying highly repetitive RNAs such as tRNAs, or other small RNAs). The HAMR software (see Fig. 2) can then be applied to the prepared BAM file. The HAMR software is generally applicable to any mRNA or small RNA populations [14].

Fig. 1 — Preparing the raw sequencing data (from FASTQ to BAM) for running HAMR software [20]

Fig. 2 — Steps used by HAMR software [20] for in silico detection and classification of RNA modifications

3.2. Downloading and Installing HAMR

All HAMR programs are packaged as compiled “binaries” (see Note 4) that can be run from the command line by typing in the program’s file name. In general, HAMR programs will also require other external programs (see Subheading 2.2 on software requirements). To install HAMR:

Create a directory for HAMR analyses and enter this directory:
```
mkdir <HAMRdirectory>
cd <HAMRdirectory>
```
Download HAMR from https://github.com/wanglab-upenn/HAMR to the HAMR directory <HAMRdirectory>. For example, to download HAMR release v1.2 from the command line
```
wget https://github.com/wanglab-upenn/HAMR/archive/v1.2.tar.gz -O HAMR-v1.2.tar.gz
```
Extract HAMR source code from archive:
```
tar xzvf HAMR-v1.2.tar.gz
```
Go to the HAMR program directory:
```
cd HAMR-1.2/
```
Compile HAMR source code (see Note 4):
```
Make clean
Make
```

3.3. Preparing the Reference Genome

Download a reference genomic sequence. Human genome data as well as genomes of other organisms can be obtained from the UCSC Genome Browser (http://hgdownload.soe.ucsc.edu/downloads.html). Users can also go to the HAMR website (http://lisanwanglab.org/hamr/genomes/) to get the genome FASTA files for human (hg19), mouse (mm9), D. melanogaster (dm3), C. elegans (ce6), yeast (sacCer3), and A. thaliana (TAIR10). Plant genomes, such as Arabidopsis, can be downloaded from the Ensembl Plant project (see Note 5). For instance, to download human reference genome:
```
cd <HAMRdirectory>
```
wget http://lisanwanglab.org/hamr/genomes/hg19_all_chr.fas
Index genome FASTA file (see Note 6):
```
samtools faidx hg19_all_chr.fas
```

3.4. Preparing RNA-Seq Data for HAMR

When starting from raw sequenced reads, the user needs to use a Linux-capable computer to execute the following steps (see Fig. 1) to prepare an input BAM file for HAMR analysis (see Subheading 3.8 for an example of HAMR analysis starting from raw sequenced data). All sorting is done with respect to chromosomal coordinates, not read name.

Read trimming

Use Cutadapt software [25] to trim the adaptor sequences from sequencing reads (see Subheading 3.8 for an example of how to run cutadapt).
1. We suggest using the following trimming parameters: 6 % error rate for adapter sequence with at least 6 nt overlap, and the read length is at least 15 nt after trimming.
2. For smRNA-seq: The reads are 3′ adapter-trimmed, with at least 6 bp of adapter sequence having at most a 6 % mismatch rate is required. All untrimmed reads and trimmed reads shorter than 14 bp should be discarded (see Note 7). Only trimmed reads should be used.
3. For mRNA-seq: Use all reads (both trimmed and untrimmed) that are at least 15 nt.
Read mapping
1. Use STAR [26] for mapping reads to the genome (see Subheading 3.8 on an example of mapping reads using STAR aligner). The reads should be mapped allowing for mismatches between sequencing reads and the reference genomic sequence (see Subheading 3.8 on HAMR analysis of RNA modification in the human brain for an example of mapping with STAR).
2. For both smRNAs and mRNAs, map reads with mismatches. For shorter reads (e.g., smRNA-seq) X = 2 mismatches are suggested. For longer reads (e.g., mRNA-seq), setting the number of mismatches to <=5 % of mapped read length is suggested (see Note 8).
3. Allow for soft-clipping (see Note 9).
4. For mRNA only, filter out or resolve the spliced alignments (see Note 10).
5. Sort and index the resulting BAM file (see Note 11).
Handling multimapped reads (optional).
We strongly suggest using only uniquely mapped reads for general purposes and for analyzing longer RNAs such as mRNAs. For example, when using STAR for mapping, the multimapped reads can be filtered out with—outFilterMultimapNmax1 option (see Subheading 3.8). For analysis of highly repetitive classes of RNA (e.g., tRNA), additional steps need to be performed to assign multimapped reads to the repeat family before running HAMR. Resolving multimapped reads in case of tRNAs is illustrated below:
1. Take the nuclear tRNA annotations from the “tRNAs” table in the UCSC genome browser (e.g., GRCh37) (see Note 12).
2. Generate the annotations for mitochondrial tRNAs by running tRNAscan-SE (see Note 13). Set to organelle mode on the mitochondrial genome (“chrM” in GRCh37/hg19).
3. Discard reads aligning to repeat regions or annotated RNAs other than tRNAs. This step will keep only the reads mapped exclusively to the target tRNAs. Discard reads mapped to more than one tRNA iso-acceptor family.
4. Optional: filter out alignments with mismatches overlapping with single nucleotide polymorphisms (SNPs) based on dbSNP for human (see Note 14). This will prioritize alignments whose mismatches had no apparent explanation (i.e., likely caused by RNA modification) over mismatches caused by SNPs.
5. Sort and index the output BAM file (see Note 11).

3.5. Running HAMR Pipeline

1
Description:

The HAMR program [20] takes as input a BAM file consisting of the sequencing reads aligned to the reference genome allowing for mismatches between reads and reference genomic sequences (see Subheading 3.4). HAMR first produces a table of observed reference and non-reference (mismatched) read nucleotide frequencies per candidate genomic site. Only high-quality (e.g., Q score of at least 30) read positions are considered when tabulating mismatches. The resulting candidate set of genomic sites are strictly filtered for sequencing errors using a binomial test that assumes a sequencing error-rate tenfold higher than the error-rate corresponding to the minimum quality score. The sites that pass this filter are enriched for mismatches and consist of potential SNPs, RNA editing events, and RNA modification sites. To filter out sites corresponding to homozygous or heterozygous SNPs and RNA editing events, HAMR performs a series of binomial tests using an ensemble of biallelic null hypotheses corresponding to all possible genotypes (AA, AC, AG, AT, CC, etc.). Sites that cannot be explained by any of the homozygous and biallelic genotypes are then inferred to be modifications. These inferred modification sites will be classified as particular modification types based on machine learning models built from the observed mismatch patterns at known tRNA modification sites, as annotated in the MODOMICS database [20, 27, 28] (see Note 15).

Here, we describe how to use HAMR for processing mapped BAM files into predictions of RNA modification positions and identity (see Fig. 2) (if starting from raw sequencing reads, see Subheading 3.4 on read preprocessing steps). The starting point for HAMR analysis is the mapped BAM. For the following steps, we assume that sequenced reads have been aligned, filtered and are stored in BAM format (see Subheading 3.4). All HAMR programs are operated from the command line. Using a command-line interface makes it easy to reproduce results and keep track of the analyses performed. Note that variable names within HAMR commands enclosed in <angle brackets> are placeholders for file names and analysis parameters and should be replaced with appropriate names without angle brackets.

HAMR analysis can be performed genome-wide (default) or for specific genomic regions (targeted mode). The only difference between the two modes is that user needs to prepare a BED file (BED6 format, also see Note 16) containing genomic regions of interest in BED format for the targeted mode.

Besides preparing a BAM file <prepared_sorted.bam> to run HAMR, the user needs to provide a reference genome file <genome. fa> in FASTA format (see Subheading 3.3 on how to download and prepare a reference genome sequence).

HAMR usage:

python hamr.py [-h] [--target_bed [TARGET_
BED]] [--paired_ends] [--filter_ends]
<bam> <genome_fas> <prediction_training_set>
<output_folder> <output_prefix> <min_qual>
<min_cov> <seq_err> <hypothesis> <max_p>
<max_fdr> <refpercent>

Required parameters

<bam> = prepared sorted BAM file (see Subheading 3.4 on preparing BAM file from raw sequenced read data)
<genome_fas> = indexed <genome.fa>(FASTA file)
<prediction_training_set> = data used for training HAMR
<output_folder> = output directory
<output_prefix> = output directory prefix name, contains the following outputs:
- <output_prefix>.mods.bed= BED file contains detected modification sites
- <output_prefix>.mods.txt = Text file contains detected modification sites
- <output_prefix>.raw.txt =full table contains the candidate sites for statistical test
<min_qual> = Minimum base calling quality, suggested 30
<min_cov> = Minimum number of non-reference nucleotide counts, suggested 10 (see Note 17)
<seq_err> = expected sequencing error rate / mismatch rate, suggested 0.01–0.05 (see Note 18)
<hypothesis> = Statistical test to be performed, H4 is recommended (see Note 19)
<max_p> = Maximum p-value (0.01 is recommended)
<max_fdr> = Maximum FDR with replacement
<refpercent>= Minimum percentage of reads with reference nucleotide

Optional parameters

d
[-h] = prints help screen
e
[-target_bed [TARGET_BED]] = targeted BED file containing genomic regions of interest (see Note 16)
f
[--paired_ends] = set to 1 for paired-end sequencing data
g
[--filter_ends] = set to 1 to filter out the read ends from read-pileups

Example for running HAMR genome-wide mode:

python hamr.py trial.human.bam hg19_all_chr.
fas models/euk_trna_mods.Rdata HAMR_out brain
30 10 0.05 H4 0.01 0.05 0.05 --filter_ends
Sample output HAMR_out/brain.mods.txt

Example for running HAMR in targeted mode for specific genomic regions:

python hamr.py trial.human.bam hg19_all_chr.
fas models/euk_trna_mods.Rdata HAMR_out brain
30 10 0.05 H4 0.01 0.05 0.05 --filter_ends

3.6. Running HAMR Step-by-Step

Besides running HAMR in a genome-wide or target-bed batch mode, the HAMR pipeline (see Fig. 2) can also be run step-by-step.

3.6.1. HAMR-Step-1: RNA Pileup

Description: this step produces per-position read pileup for a given BAM file containing mapped reads.
Usage:
```
./rnapileup <align.bam> <genome.fa> >
<pileup.raw>
```
<align.bam>: sorted BAM file (see Subheading 3.4) containing uniquely-mapped reads and continuous, un-spliced alignments

<genome.fa>: indexed reference genome (FASTA file)

<pileup.raw>: output file in pileup format

Example:

./rnapileup trial.human.bam hg19_all_chr.
fas > trial.human.bam.pileup

Sample RNA pileup output:

graphic file with name nihms-1579281-f0002.jpg

This output is in a pileup format (see http://samtools.sourceforge.net/pileup.shtml for details) and contains chromosome, physical position, reference nucleotide, the number of reads covering the site, read nucleotides, base qualities. In the read nucleotide column:

^~	indicates 5′ (start) of the read segment
.	indicates a match between the read nucleotide and reference nucleotide on the forward strand
,	indicates a match between the read nucleotide and reference nucleotide on the reverse strand
[ACGT]	indicates a mismatch (non-reference read nucleotide) on the forward strand
[acgt]	indicates a mismatch (non-reference nucleotide) on the reverse strand

Open in a new tab

3.6.2. HAMR-Step-2: Filter Pileup

Description: this step removes low-quality reads (i.e., reads with low base calling quality) from pileups and (optionally) filters reads whose 5′ or 3′ ends coincide with the predicted pileup position.
Usage:
```
./filter_pileup <pileup.raw> <minQ> <filter_
ends=0|1> > <pileup.filtered>
```
This command reflects the following required parameters:

<pileup.raw>: raw pileup file from step 1

<minQ>: minimum read quality (30)

<filter_ends=0|1>: read end filtering option
Example:
```
./filter_pileup trial.human.bam.pileup 30 1
> trial.human.bam.pileup.filtered.
```
Output of coverage <pileup.filtered> is of the same format as that of <pileup.raw>
Sample output after filtering RNA pileup:

3.6.3. HAMR-Step-3: Filter Out Pileups with Below Minimum Coverage

Description: this step filters out pileups with below minimum coverage.
Usage:
```
awk ‘$4>=<min_cov>‘ <pileup.
filtered>><pileup.filtered.min_cov>
```
This command reflects the following required parameters:

<min_cov>: minimum coverage

Example:

awk ‘$4>=10’ trial.human.bam.pileup.filtered
>trial.human.bam.pileup.filtered.min10

Sample output of pileups filtered by read coverage:

3.6.4. HAMR-Step-4: Convert Filtered Pileups to BED Format

Description: this step converts filtered pileups to BED format.

Usage:

./rnapileup2mismatchbed <filtered.min_cov> >
<nuc2nuc.BED>

Example:

./rnapileup2mismatchbed trial.human.bam.
pileup.filtered.min10 > trial.human.bam.mis-
match.bed

Output is sorted by genomic coordinates.
The output BED file contains the number of observed reference nucleotides to read nucleotide transitions per candidate site.
Sample read nucleotide mismatch BED output

This output contains the following information:

Fourth column = type of mismatch (A>T, A>G, A>C, etc.) or match (A>., T>., C>., G>.) between the reference and read nucleotide

Fifth column = first number is the total number of reads with match or mismatch event, followed by the detailed list of positions within the reads where the event is observed

3.6.5. HAMR-Step-5: Convert BED to HAMR-FREQ- TABLE

Description: this step converts BED to HAMR-FREQ-TABLE while keeping only positions with >0 non-reference nucleotides, and filters the nucleotide frequency table by removing SITES (rows) with insufficient sequencing-based information.

Usage:

mismatchbed2table <nuc2nuc.BED> > <HAMR-FREQ-TABLE>

Example:

bash mismatchbed2table.sh trial.human.bam. mismatch.bed > trial.human.bam.freq.table

Sample output of read nucleotide frequencies:
Additional filtering of sites based on observed read nucleotide counts is imposed, including:

minimum number of reads with non-reference nucleotide [=10]

minimum number of reference reads [=10] minimum ratio of non-reference/reference[=1 %]

Usage:

awk ‘{cov=$5+$6+$7+$8;nonref=$9; ref.=cov-
nonref; if (ref/cov>=+<refpercent>+)
print;}’’<HAMR-FREQ-TABLE>> <HAMR-FREQ-
TABLE-FINAL>

This command reflects the following required parameters: <refpercent> = Minimum percentage of reads with reference nucleotide

Example:

awk ‘{cov=$5+$6+$7+$8;nonref=$9; ref=cov-
nonref; if (ref/cov>=0.05) print;}’ trial.
human.bam.freq.
table > trial.human.bam.freq. final.table

Output table contains the HAMR frequency information on the FINALIZED set of candidate sites with enough sequencing support.

3.6.6. HAMR-Step-6: Finding Modification Sites Based on Statistical Testing

Description: this step finds modification sites based on significant deviations of the observed nucleotide frequencies from the expected homozygous or biallelic distributions.
Usage:
```
Rscript detect_mods.R <HAMR-NUC-TABLE>
<EXPECTED-SEQ-ERROR-RATE> <HYPOTHESIS-TYPE>
<P-VALUE-THRESHOLD> <Q-VALUE--THRESHOLD> >
<RAW-MOD-TABLE>
```
This command reflects the following required parameters:

<HAMR-NUC-TABLE> = HAMR frequency table from step 5

<EXPECTED-SEQ-ERROR-RATE> = expected sequencing error rate / mismatch rate, suggested 0.01–0.05 (see Note 18)

<HYPOTHESIS-TYPE> = Statistical test to be performed, H4 is recommended (see Note 19)

<P-VALUE-THRESHOLD> = Maximum p-value (0.01 is recommended)

<Q-VALUE-THRESHOLD> = Maximum FDR with replacement

Example:

Rscript detect_mods.R trial.human.bam.freq.final.
table 0.05 H4 0.01 0.05 > trial.human.bam.raw.txt

Sample output of HAMR predicted modification sites:

3.6.7. HAMR-Step-7—Predicting modification identity

Description: predict the identity of modifications for each of the modification sites (see Note 20).
Usage:
```
Rscript classify_mods.R <TABLE-significant.
sites.only> <training.set.Rdata> > <TABLE-
with-predicted-mod-types>
```
This command reflects the following required parameters: <TABLE-significant.sites.only> = Table contains detected modification sites that are statistically significant <training.set.Rdata> = prediction training set (e.g., the training set models/euk_trna_mods.Rdata provided in HAMR that consists of mismatch patterns observed in confirmed modification site in tRNAs)

Example:

Rscript classify_mods.R trial.human.bam.raw.
txt models/euk_trna_mods.Rdata > trial.human.
bam.mods.txt and CHANGE font to match other
command lines

Sample output of HAMR modification sites and predicted modification types:

3.7. Using HAMR Webserver

HAMR is also available as a web-based application, which can be accessed at http://lisanwanglab.org/hamr/.

The web interface allows specification of a remote, indexed BAM file and BED file with targeted intervals for querying. The user may specify parameters for the preprocessing steps, such as minimum base call quality score, minimum read coverage at a site, assumed sequencing error rate, and significance level. Additionally, the user may use the webserver to predict the modification type based on mismatch patterns.

3.8. Application of HAMR to Existing Human Small RNA-Seq Data

Here, we present the application of HAMR to a Illumina small RNA-seq data generated from human brain tissue (GSE48552, http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE48552). With a few preprocessing steps prior to HAMR, users can run a single command and all information about modification sites and types will be summarized in a table.

Download the SRA file from SRA database (http://www.ncbi.nlm.nih.gov/geo/):

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP026/SRP026562/SRR1103937/SRR1103937.sra-O brainpfc.sra

Build genomic index:

STAR--runMode genomeGenerate --runThreadN 8
--genomeDir genomes/star/hg19 --genomeFas-
taFiles hg19_all_chr.fas

Convert SRA to FASTQ file using SRAtoolkit:
```
fastq-dump --split-3 brainpfc.sra
```

Trim adapter sequences from reads (Illumina v1.5 small RNA 3′ adapter in this case):

cutadapt brainpfc.fastq -a ATCTCGTATGCCGTCTTCTGCTTG -e 0.06 -O 6 -m 14
--too-short-output=brainpfc.tooshort.fastq
--untrimmed-output=brainpfc.untrimmed.fastq
--length --tag="length=" -o brainpfc.trimmed.
fastq

Map trimmed reads to the human reference genome hg19 using STAR:

mkdir –p STAR_brainpfc
STAR --genomeDir genomes/star/hg19 --genom-
eLoad LoadAndKeep --readFilesIn brainpfc.
trimmed.fastq --runThreadN 4 --alignIntron-
Max 1 --outSAMattributes NH HI NM MD --out-
FilterMultimapNmax 1 --outReadsUnmapped
Fastx --outFilterMismatchNmax 2 --outFil-
terMatchNmin 15 --outFileNamePrefix STAR_
brainpfc/

Sort and index output BAM:

samtools sort -@ 8 -O bam -T Aligned.out.
sort STAR_brainpfc/Aligned.out.sam > STAR_
brainpfc/Aligned.out.sorted.bam
samtools index STAR_brainpfc/Aligned.out.
sorted.bam

Perform modification analysis with HAMR:

python hamr.py STAR_brainpfc/Aligned.out.sorted.
bam genomes/hg19.fa models/euk_trna_mods.Rdata
HAMR_results brainpfc 30 10 0.05 H4 0.01 0.05
0.05 --filter_ends

HAMR_results/brainpfc.mods.txt will contain a list of RNA modifications found by HAMR.

Acknowledgments

This work is supported by the National Institute of General Medical Sciences [R01-GM099962 to P.P.K, Y.Y.L, B.D.G., and L.S.W], National Institute on Aging [U24-AG041689 to L.S.W.], National Science Foundation [CAREER Award MCB-1053846, MCB-1243947, and IOS-1444490 to B.D.G.]. We thank Alexandre Amlie-Wolf and other members of the Wang and Gregory labs for their comments and help with this work.

Abbreviations

3′: 3-prime (3′)
5′: 5-prime (5′)
bp: Base pair
cDNA: Complementary DNA
HAMR: High-throughput annotation of modified ribonucleotides
mRNA: Messenger RNA
mRNA-seq: messenger RNA sequencing
nt: Nucleotide
RNA-seq: RNA sequencing
RT: Reverse transcriptase
smRNA: Small RNA
smRNA-seq: Small RNA sequencing
SNP: single nucleotide polymorphism
tRNA: Transfer RNA

4 Notes

^1.

These programs are readily available in popular Linux distributions such as Ubuntu, CentOS, or Debian.

^2.

You can check the version of python by typing python --version on the command line.

^3.

The simplest way to install programs from source is to build them in your own directory. Download the appropriate package from the source websites (see Subheading 2.2 on software requirements). Extract the files and check the readme files for installation instructions and a list of required dependencies.

^4.

HAMR is packaged as compiled “binaries” that can be run by typing in the program’s filepath. On certain systems, however, the HAMR needs to be recompiled, to do so, please use make clean, then make.

^5.

Arabidopsis genome FASTA file can be downloaded here: ftp://ftp.ensemblgenomes.org/pub/release-25/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.25.dna.genome.fa.gz

^6.

Make sure to index the FASTA file with genome index before running HAMR for faster execution.

^7.

Untrimmed reads are likely to correspond to larger RNA fragments that are outside the small RNA population.

^8.

If no mismatches are allowed, HAMR is not capable of detecting any modifications.

^9.

Soft clipping routinely allows for mapping reads with untrimmed partial adapters at 3′ end (e.g., shorter than the minimum overlap length), containing partial exon-exon junctions, etc.

^10.

The HAMR program handles continuous read alignments, and spliced reads require additional preprocessing.

^11.

BAM files need to be sorted by genomic chromosome and start positions, not by the read name.

^12.

The tRNAs table in the UCSC genome browser (hg19) can be downloaded here: http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=482380859_VTSVW7eMgPDQJ0DdDzlKALAe02bM&clade=mammal&org=Human&db=hg19&hgta_group=genes&hgta_track=tRNAs&hgta_table=0&hgta_regionType=genome&position=chr1%3A1103243--1103332&hgta_outputType=primaryTable&hgta_outFile-Name=.

^13.

tRNAscan-SE can be downloaded and installed from http://gtrnadb.ucsc.edu/.

^14.

dbSNP can be downloaded from http://www.ncbi.nlm.nih.gov/projects/SNP/.

^15.

tRNA modification information from MODOMICS is chosen as this is by far the best documented list of RNA modification sites. Data from MODOMICS is downloadable at http://modomics.genesilico.pl/downloads/.

^16.

See https://genome.ucsc.edu/FAQ/FAQformat.html#format1 for a description of the BED file format.

^17.

Filter out genomic positions with insufficient number (e.g., less than 10) of non-reference nucleotide counts. Such positions do not provide sufficient evidence for HAMR to reliably establish mismatch pattern necessary to detect RNA modifications.

^18.

Filter out genomic positions with non-reference / reference read count ratio that less than, e.g., 1 %. Such positions are likely to be attributed to sequencing errors, not modification-related mismatches.

^19.

HAMR tests for homozygous, bi-allelic, and multi-allelic loci, see [20] for more information on statistical testing and hypotheses.

^20.

For a list of HAMR-detectable modification types, refer to Supplementary table 1 in [20].

Contributor Information

Pavel P. Kuksa, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA

Yuk-Yee Leung, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.

Lee E. Vandivier, Department of Biology, University of Pennsylvania, Philadelphia, PA USA

Zachary Anderson, Department of Biology, University of Pennsylvania, Philadelphia, PA USA.

Brian D. Gregory, Department of Biology, University of Pennsylvania, Philadelphia, PA USA

Li-San Wang, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.

References

1.Dominissini D, Nachtergaele S, Moshitch-Moshkovitz S et al. (2016) The dynamic N(1)-methyladenosine methylome in eukaryotic messenger RNA. Nature 530:441–446 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Eigenbrod T, Keller P, Kaiser S et al. (2015) Recognition of specified RNA modifications by the Innate immune system. Methods Enzymol 560:73–89 [DOI] [PubMed] [Google Scholar]
3.Li S, Mason CE (2014) The pivotal regulatory landscape of RNA modifications. Annu Rev Genomics Hum Genet 15:127–150 [DOI] [PubMed] [Google Scholar]
4.Lee M, Kim B, Kim VN (2014) Emerging roles of RNA modification: m(6)A and U-tail. Cell 158:980–987 [DOI] [PubMed] [Google Scholar]
5.Satterlee JS, Basanta-Sanchez M, Blanco S et al. (2014) Novel RNA modifications in the nervous system: form and function. J Neurosci 34:15170–15177 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Delatte B, Wang F, Ngoc LV et al. (2016) Transcriptome-wide distribution and function of RNA hydroxymethylcytosine. Science 351:282–285 [DOI] [PubMed] [Google Scholar]
7.Sundaram M, Durant PC, Davis DR (2000) Hypermodified nucleosides in the anticodon of tRNALys stabilize a canonical Uturn structure. Biochemistry 39:12575–12584 [DOI] [PubMed] [Google Scholar]
8.Kierzek E, Malgowska M, Lisowiec J, Turner DH, Gdaniec Z, Kierzek R (2014) The contribution of pseudouridine to stabilities and structure of RNAs. Nucleic Acids Res 42:3492–3501 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Schwartz S, Mumbach MR, Jovanovic M et al. (2014) Perturbation of m6A writers reveals two distinct classes of mRNA methylation at internal and 5′ sites. Cell Rep 8:284–296 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Meyer KD, Jaffrey SR (2014) The dynamic epitranscriptome:N6-methyladenosine and gene expression control. Nat Rev Mol Cell Biol 15:313–326 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Karijolich J, Yu YT (2015) The new era of RNA modification. RNA 21:659–660 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Sun WJ, Li JH, Liu S et al. (2016) RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data. Nucleic Acids Res 44:D259–D265 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Schwartz S, Bernstein DA, Mumbach MR, Jovanovic M, Herbst RH, León-Ricardo BX, Engreitz JM, Guttman M, Satija R, Lander ES, Fink G, Regev A (2014b) Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA. Cell 159:148–162 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Vandivier LE, Campos R, Kuksa PP et al. (2015) Chemical modifications mark alternatively spliced and uncapped messenger RNAs in arabidopsis. Plant Cell 27:3024–3037 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Meyer KD, Saletore Y, Zumbo P (2012) Comprehensive analysis of mRNA methylation reveals enrichment in 3′ UTRs and near stop codons. Cell 149:1635–1646 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Gupta RC, Randerath K (1977) Use of specific endonuclease cleavage in RNA sequencing. Nucleic Acids Res 4:1957–1978 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Woodson SA, Muller JG, Burrows CJ et al. (1993) A primer extension assay for modification of guanine by Ni(II) complexes. Nucleic Acids Res 21:5524–5525 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Motorin Y, Muller S, Behm-Ansmant I, Branlant C (2007) Identification of modified residues in RNAs by reverse transcription-based methods. Methods Enzymol 425:21–53 [DOI] [PubMed] [Google Scholar]
19.Behm-Ansmant I, Helm M, Motorin Y (2011) Use of specific chemical reagents for detection of modified nucleotides in RNA. JNucleic Acids 2011:408053. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Ryvkin P, Leung YY, Silverman IM et al. (2013) HAMR: high-throughput annotation of modified ribonucleotides. RNA 19: 1684–1692 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Dominissini D, Moshitch-Moshkovitz S, Schwartz S et al. (2012) Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature 485:201–206 [DOI] [PubMed] [Google Scholar]
22.Squires JE, Patel HR, Nousch M et al. (2012) Widespread occurrence of 5-methylcytosine in human coding and non-coding RNA. Nucleic Acids Res 40:5023–5033 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Horowitz S, Horowitz A, Nilsen TW et al. (1984) Mapping of N6-methyladenosine residues in bovine prolactin mRNA. Proc Natl Acad Sci U S A 81:5667–5671 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Li H, Handsaker B, Wysoker A et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Martin M (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetJ 17(1):10–12 [Google Scholar]
26.Dobin A, Davis CA, Schlesinger F et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Machnicka MA, Milanowska K, Osman OO et al. (2013) MODOMICS: a database of RNA modification pathways: 2012 update. Nucleic Acids Res 41:D262–D267 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Leung YY, Ryvkin P, Ungar LH et al. (2013) CoRAL: predicting non-coding RNAs from small RNA-sequencing data. Nucleic Acids Res 41:e137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Dominissini D, Nachtergaele S, Moshitch-Moshkovitz S et al. (2016) The dynamic N(1)-methyladenosine methylome in eukaryotic messenger RNA. Nature 530:441–446 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Eigenbrod T, Keller P, Kaiser S et al. (2015) Recognition of specified RNA modifications by the Innate immune system. Methods Enzymol 560:73–89 [DOI] [PubMed] [Google Scholar]

[R3] 3.Li S, Mason CE (2014) The pivotal regulatory landscape of RNA modifications. Annu Rev Genomics Hum Genet 15:127–150 [DOI] [PubMed] [Google Scholar]

[R4] 4.Lee M, Kim B, Kim VN (2014) Emerging roles of RNA modification: m(6)A and U-tail. Cell 158:980–987 [DOI] [PubMed] [Google Scholar]

[R5] 5.Satterlee JS, Basanta-Sanchez M, Blanco S et al. (2014) Novel RNA modifications in the nervous system: form and function. J Neurosci 34:15170–15177 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Delatte B, Wang F, Ngoc LV et al. (2016) Transcriptome-wide distribution and function of RNA hydroxymethylcytosine. Science 351:282–285 [DOI] [PubMed] [Google Scholar]

[R7] 7.Sundaram M, Durant PC, Davis DR (2000) Hypermodified nucleosides in the anticodon of tRNALys stabilize a canonical Uturn structure. Biochemistry 39:12575–12584 [DOI] [PubMed] [Google Scholar]

[R8] 8.Kierzek E, Malgowska M, Lisowiec J, Turner DH, Gdaniec Z, Kierzek R (2014) The contribution of pseudouridine to stabilities and structure of RNAs. Nucleic Acids Res 42:3492–3501 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Schwartz S, Mumbach MR, Jovanovic M et al. (2014) Perturbation of m6A writers reveals two distinct classes of mRNA methylation at internal and 5′ sites. Cell Rep 8:284–296 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Meyer KD, Jaffrey SR (2014) The dynamic epitranscriptome:N6-methyladenosine and gene expression control. Nat Rev Mol Cell Biol 15:313–326 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Karijolich J, Yu YT (2015) The new era of RNA modification. RNA 21:659–660 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Sun WJ, Li JH, Liu S et al. (2016) RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data. Nucleic Acids Res 44:D259–D265 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Schwartz S, Bernstein DA, Mumbach MR, Jovanovic M, Herbst RH, León-Ricardo BX, Engreitz JM, Guttman M, Satija R, Lander ES, Fink G, Regev A (2014b) Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA. Cell 159:148–162 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Vandivier LE, Campos R, Kuksa PP et al. (2015) Chemical modifications mark alternatively spliced and uncapped messenger RNAs in arabidopsis. Plant Cell 27:3024–3037 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Meyer KD, Saletore Y, Zumbo P (2012) Comprehensive analysis of mRNA methylation reveals enrichment in 3′ UTRs and near stop codons. Cell 149:1635–1646 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Gupta RC, Randerath K (1977) Use of specific endonuclease cleavage in RNA sequencing. Nucleic Acids Res 4:1957–1978 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Woodson SA, Muller JG, Burrows CJ et al. (1993) A primer extension assay for modification of guanine by Ni(II) complexes. Nucleic Acids Res 21:5524–5525 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Motorin Y, Muller S, Behm-Ansmant I, Branlant C (2007) Identification of modified residues in RNAs by reverse transcription-based methods. Methods Enzymol 425:21–53 [DOI] [PubMed] [Google Scholar]

[R19] 19.Behm-Ansmant I, Helm M, Motorin Y (2011) Use of specific chemical reagents for detection of modified nucleotides in RNA. JNucleic Acids 2011:408053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Ryvkin P, Leung YY, Silverman IM et al. (2013) HAMR: high-throughput annotation of modified ribonucleotides. RNA 19: 1684–1692 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Dominissini D, Moshitch-Moshkovitz S, Schwartz S et al. (2012) Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature 485:201–206 [DOI] [PubMed] [Google Scholar]

[R22] 22.Squires JE, Patel HR, Nousch M et al. (2012) Widespread occurrence of 5-methylcytosine in human coding and non-coding RNA. Nucleic Acids Res 40:5023–5033 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Horowitz S, Horowitz A, Nilsen TW et al. (1984) Mapping of N6-methyladenosine residues in bovine prolactin mRNA. Proc Natl Acad Sci U S A 81:5667–5671 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Li H, Handsaker B, Wysoker A et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Martin M (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetJ 17(1):10–12 [Google Scholar]

[R26] 26.Dobin A, Davis CA, Schlesinger F et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Machnicka MA, Milanowska K, Osman OO et al. (2013) MODOMICS: a database of RNA modification pathways: 2012 update. Nucleic Acids Res 41:D262–D267 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Leung YY, Ryvkin P, Ungar LH et al. (2013) CoRAL: predicting non-coding RNAs from small RNA-sequencing data. Nucleic Acids Res 41:e137. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

In Silico Identification of RNA Modifications from High-Throughput Sequencing Data Using HAMR

Pavel P Kuksa

Yuk-Yee Leung

Lee E Vandivier

Zachary Anderson

Brian D Gregory

Li-San Wang

Abstract

1. Introduction

HAMR in a nutshell

Sample study

Availability

2. Materials

2.1. Hardware Requirements

2.2. Software Requirements

3. Methods

3.1. RNA Sequencing Data Generation

Fig. 1.

Fig. 2.

3.2. Downloading and Installing HAMR

3.3. Preparing the Reference Genome

3.4. Preparing RNA-Seq Data for HAMR

3.5. Running HAMR Pipeline

Required parameters

Optional parameters

3.6. Running HAMR Step-by-Step

3.6.1. HAMR-Step-1: RNA Pileup

3.6.2. HAMR-Step-2: Filter Pileup

3.6.3. HAMR-Step-3: Filter Out Pileups with Below Minimum Coverage

3.6.4. HAMR-Step-4: Convert Filtered Pileups to BED Format

3.6.5. HAMR-Step-5: Convert BED to HAMR-FREQ- TABLE

3.6.6. HAMR-Step-6: Finding Modification Sites Based on Statistical Testing

3.6.7. HAMR-Step-7—Predicting modification identity

3.7. Using HAMR Webserver

3.8. Application of HAMR to Existing Human Small RNA-Seq Data

Acknowledgments

Abbreviations

4 Notes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases