Abstract
Motivation
RNA sequences of a gene can have single nucleotide variants (SNVs) due to single nucleotide polymorphisms (SNPs) in the genome, or RNA editing events within the RNA. By comparing RNA-seq data of a given cell type before and after a specific perturbation, we can detect and quantify SNVs in the RNA and discover SNVs with altered frequencies between distinct cellular states. Such differential variants in RNA (DVRs) may reflect allele-specific changes in gene expression or RNA processing, as well as changes in RNA editing in response to cellular perturbations or stimuli.
Results
We have developed rMATS-DVR, a convenient and user-friendly software program to streamline the discovery of DVRs between two RNA-seq sample groups with replicates. rMATS-DVR combines a stringent GATK-based pipeline for calling SNVs including SNPs and RNA editing events in RNA-seq reads, with our rigorous rMATS statistical model for identifying differential isoform ratios using RNA-seq sequence count data with replicates. We applied rMATS-DVR to RNA-seq data of the human chronic myeloid leukemia cell line K562 in response to shRNA knockdown of the RNA editing enzyme ADAR1. rMATS-DVR discovered 1372 significant DVRs between knockdown and control. These DVRs encompassed known SNPs and RNA editing sites as well as novel SNVs, with the majority of DVRs corresponding to known RNA editing sites repressed after ADAR1 knockdown.
Availability and Implementation
rMATS-DVR is at https://github.com/Xinglab/rMATS-DVR.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
RNAs transcribed from a single gene may contain single nucleotide variants (SNVs) due to single nucleotide polymorphisms (SNPs) in the genome, or RNA editing events within the RNA. Using RNA sequencing (RNA-seq), we can discover SNVs in RNA by comparing RNA-seq reads to the genome sequence. Since comparing the transcriptome profiles of a given cell type before and after a perturbation is a widely used RNA-seq study design, a valuable and increasingly popular type of RNA-seq data analysis is to quantify and contrast the levels of SNVs in RNA-seq reads among distinct cellular states. Numerous RNA-seq studies have globally identified RNA editing sites with altered editing levels in response to perturbation (Nishikura, 2016). On the other hand, the altered allelic ratios of genomic variants (e.g. SNPs) between RNA-seq samples with the identical genetic background can reveal allele-specific changes in gene expression or RNA processing after perturbations.
Here we report rMATS-DVR, a new computational tool that combines comprehensive identification of SNVs and robust discovery of DVRs between two RNA-seq sample groups with replicates. rMATS-DVR implements a GATK (Genome Analysis Toolkit) (McKenna et al., 2010) based pipeline with stringent guidelines and filters to call SNVs including SNPs and RNA editing events in RNA-seq reads (Lee et al., 2013; Piskol et al., 2013). Then it uses our widely used and rigorous rMATS (replicate Multivariate Analysis of Transcript Splicing) statistical model for differential isoform analysis (Shen et al., 2014) to identify DVRs using RNA-seq read counts of SNVs in replicate RNA-seq data. Specifically, rMATS uses a generalized linear mixed model (GLMM) to simultaneously account for the RNA-seq estimation uncertainty in the mRNA isoform ratios as influenced by sequencing coverage in individual samples, and the variability in isoform ratios among replicates (Shen et al., 2014). Although initially developed for identifying differential alternative splicing, the rMATS statistical model is generic and can be applied to RNA-seq count data on SNPs and RNA editing sites.
2 Materials and methods
rMATS-DVR is a single command line program with RNA-seq alignment files (.bam files) as the input. The major steps of rMATS-DVR are in Figure 1A. RNA-seq alignments are subject to sorting, adding read groups, and removal of PCR duplicates by Picard (https://broadinstitute.github.io/picard/). Then rMATS-DVR uses the GATK toolkit (McKenna et al., 2010) for splitting ‘N’ cigar reads (i.e. splice junction reads) and mapping quality reassignment (program: SplitNCigarReads), base quality score recalibration (program: BaseRecalibrator), and variant discovery across all RNA-seq samples (program: UnifiedGenotyper). For the identified variants, rMATS-DVR uses Samtools (Li et al., 2009) (program: mpileup) to count the reads supporting the reference and alternative nucleotides. Next, the rMATS statistical model (Shen et al., 2014) is used to calculate the P values and FDRs (False Discovery Rates) for DVRs between the two sample groups. Finally, all the SNVs and DVRs are annotated for locations within genes, matches to known SNPs in dbSNP (Sherry et al., 2001), matches to known RNA editing sites in the RADAR database (Ramaswami and Li, 2014) and overlap with repeats (http://www.repeatmasker.org/).
Fig. 1.
(A) Major steps of rMATS-DVR. (B) Classifications of DVRs into known SNPs, known RNA editing sites, and novel variants in the ADAR1 knockdown RNA-seq data. The variants are denoted as the reference → alternative nucleotide on the sense RNA strand. (C) Barplot showing the allelic ratios of a SNP-type of DVR (rs143204328; C→T on the RNA sense strand) in control and ADAR1 knockdown samples. The error bar indicates standard deviation based on binomial distribution
3 Results
As ADAR1 (adenosine deaminase, RNA specific) encodes the major A-to-I RNA editing enzyme, shRNA knockdown of ADAR1 decreases RNA editing levels (Nishikura, 2016). We applied rMATS-DVR to RNA-seq data of the K562 cell line with ADAR1 knockdown and control (see Supplementary Materials). We selected the K562 cell line because it has a high level of endogenous ADAR1 expression and RNA-seq data from ADAR1 knockdown and control cells are available from ENCODE. We discovered 49 507 SNVs from the RNA-seq data, among which 1372 were DVRs between knockdown and control (FDR ≤ 5%). These DVRs included 1052 known RNA editing sites, 135 known SNPs and 185 novel variants (Fig. 1B). Not surprisingly, over 75% of the DVRs were known RNA editing sites, and all of these RNA editing sites had decreased RNA editing levels upon ADAR1 knockdown. The majority of the 185 DVRs corresponding to novel variants were ‘A’ in the reference genome and ‘G’ in the RNA, suggesting that they were likely novel A-to-I RNA editing sites. We also identified 135 DVRs corresponding to known SNPs. These may reflect allele-specific changes in RNA processing or steady-state mRNA level upon ADAR1 knockdown, due to editing-independent functions of ADAR1 on microRNA targeting, pre-mRNA splicing and 3’ end processing, or mRNA stability (Bahn et al., 2015; Wang et al., 2013). For example, as shown in Figure 1C, knocking down of ADAR1 caused a significant shift (FDR = 1.4 × 10−5) in the allelic ratio of a known SNP rs143204328 in the RNA-seq reads, thus suggesting a candidate SNP that may differentially mediate the regulation by ADAR1 on the RNA in an allele-specific manner.
4 Conclusions
The rMATS-DVR software is designed for analyzing and comparing two groups of RNA-seq samples with the identical genetic background, e.g. a given cell type before and after a specific cellular perturbation or environmental stimulus. We anticipate that rMATS-DVR will become an effective and widely used tool for the detection and quantitative analysis of SNVs in diverse RNA-seq research projects.
Funding
National Institutes of Health (R01GM088342 & R01GM117624 to Y.X.).
Conflict of Interest: none declared.
Supplementary Material
References
- Bahn J.H. et al. (2015) Genomic analysis of ADAR1 binding and its involvement in multiple RNA processing pathways. Nat. Commun., 6, 6355.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee J.H. et al. (2013) Analysis and design of RNA sequencing experiments for identifying RNA editing and other single-nucleotide variants. RNA, 19, 725–732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H. et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKenna A. et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res., 20, 1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nishikura K. (2016) A-to-I editing of coding and non-coding RNAs by ADARs. Nat. Rev. Mol. Cell Biol., 17, 83–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piskol R. et al. (2013) Reliable identification of genomic variants from RNA-seq data. Am. J. Hum. Genet., 93, 641–651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ramaswami G., Li J.B. (2014) RADAR: a rigorously annotated database of A-to-I RNA editing. Nucleic Acids Res., 42, D109–D113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen S. et al. (2014) rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. Proc. Natl. Acad. Sci. U. S. A., 111, E5593–E5601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sherry S.T. et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res., 29, 308–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang I.X. et al. (2013) ADAR regulates RNA editing, transcript stability, and gene expression. Cell Rep., 5, 849–860. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.