Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Nov 28.
Published in final edited form as: Methods Mol Biol. 2018;1724:27–41. doi: 10.1007/978-1-4939-7562-4_3

Genome-Wide circRNA Profiling from RNA-seq Data

Daphne A Cooper 1, Mariela Cortés-López 2, Pedro Miura 3
PMCID: PMC6261425  NIHMSID: NIHMS992255  PMID: 29322438

Abstract

The genome-wide expression patterns of circular RNAs (circRNAs) are of increasing interest for their potential roles in normal cellular homeostasis, development, and disease. Thousands of circRNAs have been annotated from various species in recent years. Analysis of publically available or user-generated rRNA-depleted total RNA-seq data can be performed to uncover new circRNA expression trends. Here we provide a primer for profiling circRNAs from RNA-seq datasets. The description is tailored for the wet lab scientist with limited or no experience in analyzing RNA-seq data. We begin by describing how to access and interpret circRNA annotations. Next, we cover converting circRNA annotations into junction sequences that are used as scaffolds to align RNA-seq reads. Lastly, we visit quantifying circRNA expression trends from the alignment data.

Keywords: circRNA, Circular RNAs, Expression analysis, Ribo-depleted total RNA-seq

1. Introduction

Detection of circRNAs in RNA-seq data is limited to stranded, ribosomal RNA (rRNA/ribo)-depleted total RNA-seq, as opposed to library preparation proo9tocols requiring enrichment of polyadenylated RNAs. Total RNA-seq libraries are becoming increasingly common since they detect a variety of RNAs, including both mRNAs and non-polyadenylated RNAs. Over the past few years, thousands of circRNAs have been annotated in humans, mice, fish, insects, plants, and yeast from total RNA-seq data, using algorithms designed to detect “out-of-order” splicing [112]. These analysis pipelines include find_circ [2], circRNA_finder [8], CIRCexplorer [13], and CIRI [14], among o thers [15, 16]. Most of these algorithms and analysis pipelines can detect exonic, intronic, and intergenic circRNAs, whereas CIRCfinder is a pipeline designed to exclusively detect intronic circRNAs [17]. Despite thousands of circRNA annotations, novel circRNAs continue to be annotated, and the user should consider that specific tissues or cell types of interest might express novel circRNAs that necessitate the use of one of the previously mentioned circRNA detection algorithms.

It is generally required that the total RNA-seq datasets to be analyzed for circRNA expression trends are of very high depth, due to only circRNA junction spanning reads being used in the analysis. CircRNA junction spanning reads typically comprise less than 0.1% of the reads generated in a total RNA-seq experiment [5].

Here, we provide a primer for profiling circRNAs from RNAseq datasets using existing circRNA annotations. Table 1 provides a reference for reports and online databases containing circRNA annotations from various organisms. In this example, a circRNA annotation set is used to generate a circRNA junction scaffold Bowtie2 aligner index. Reads that align to the circRNA junction scaffold are summed using featureCounts and the number of reads that align to each circRNA junction is normalized to the read depth of the library. This normalization allows the direct comparison of circRNA expression from different conditions so that differentially expressed circRNAs can be identified.

Table 1.

circRNA annotations

Annotations in reports
Organism Genome annotation Conditions References

Mus musculus mm9 Brain; synaptoneurosomes; liver; heart; embryonic stem cell neuronal diff.; P19 cell neuronal diff.; primary neuron diff. [25]
Homo sapiens hg19 Brain; SH-SY5Y cell neuronal diff.; HEK293 cells; CD34+ cells; CD19+ cells; neutrophils; Hs68 cells; DLD-1 cells; DKO-1 cells, DKs-8 cells [1, 2, 4, 32, 33]
Drosophila melanogaster dm3 Heads; S2 cells [3, 8]
Caenorhabditis elegans ce6 Various tissues and developmental stages [2, 34]
Annotations in online databases
Database name Web address Description References

circBase www.circbase.org circRNAs from various tissues, cell types, and developmental stages from human, mouse, fruit fly, C. elegans, and others [35]
circRNAdb http://reprod.njmu.edu.cn/circrnadb Human circRNAs by cell type, proteincoding potential, gene symbol and PubMed ID [36]
CircNet http://circnet.mbc.nctu.edu.tw. Human tissue-specific circRNA expression information and miRNA binding potential [37]
Tissue-specific circRNA database http://gb.whu.edu.cn/TSCD Human and mouse circRNA from various tissues [38]

2. Materials

  1. Computer running Linux or Mac OSX.

  2. UNIX shell (e.g., bash).

  3. Installed open-access software:
    1. featureCounts [22]: subread.sourceforge.net.
  4. rRNA-depleted total RNA-seq data in FASTQ format (user generated or downloaded from repository).

  5. Genome sequence for species of interest in FASTA format.

  6. circRNA annotations for species of interest in BED6 format.

  7. Optional: Integrative Genomics Viewer (IGV) [23, 24].

  8. Optional: SRAtools: github.com/ncbi/sra-tools.

3. Methods

3.1. Generate circRNA Junction Sequences

  1. View contents of a circRNA annotation BED file. Many cir cRNA annotations are provided as a list of genomic coordinates or as Browser Extensible Data (BED) records that describe the genomic coordinates and other information (https://genome.ucsc.edu/FAQ/FAQformat#format1). The BED format includes a chromosome (chrom), a 0-based start (chromStart) position in base pair units, a stop position (chromStop), a feature name, a score, and strand information (Fig. 1a). The score can provide information about the expression level of a particular circRNA; however it is often assigned a “0” value. The strand assignment provides information about whether the circRNA is encoded from the sense or antisense DNA strand. Figure 1a illustrates a BED file (circRNA.bed) that contains annotations for two circRNAs (circ_1 and circ_2). The BED annotation is used to generate junction sequences circ_1 (Fig. 1b) and circ_2 (Fig. 1c). Both circ_1 and circ_2 are 20 nucleotides (nts) in length, and have the same genomic coordinates; however, circ_1 is encoded on the sense DNA strand and circ_2 is encoded on the antisense DNA strand to highlight how strand orientation impacts the directionality and sequence of circRNAs. CircRNAs encoded on the sense strand of DNA are assigned a “+” in the sixth “strand” field (Fig. 1a: circ_1), while those encoded on the antisense DNA strand are assigned a “−” (Fig. 1a: circ_2). Genes encoded on the antisense DNA strand run 3´ to 5´ in relation to the sense strand; therefore the chromStop field represents the “start” of the gene, and the chromStart field represents the “stop” of the gene. Because circ_2 is encoded on the antisense DNA strand, it is reverse complemented in relation to circ_1 (Fig. 1c). Note that 10 nt junction sequences for circ_1 and circ_2 join the last 5 nts with the first 5 nts of the circularizing exon and run in the 5´–3´ direction (Fig. 1b, c; junction sequence).

    For this primer on circRNA expression profiling, we used the first four records in the mouse cortex annotation from Supplemental Table 2 in Gruner et al. (2016) [5], to make cortex_circRNA.bed. These four BED records will be used to generate circRNA junction sequences of 200 nt length in the manner illustrated in Fig. 1.
    # View contents of cortex_circRNA.bed
    $ more cortex_circRNA.bed
    chrX   58436422  58439349   CDR1as           0    +
    chr1   154691165 154691537  mm9_circ_000025  0    
    chr1   16430403  16453275   mm9_circ_000040  0    -
    chr18  51462263  51463167   mm9_circ_000042  0    +
    
  2. Split the annotations into circRNAs encoded from the sense+strand and circRNAs encoded from the antisensestrand. Write circRNAs encoded on the “+” strand to a file named “sense. cortex_circRNA.bed” and write circRNAs encoded on the “−” strand to a file named “antisense.cortex_circRNA.bed”.
    $ awk ‘($6 == “+”)’ cortex_circRNA.bed > sense.cortex_circRNA.bed
    $ awk ‘($6 == “-”)’ cortex_circRNA.bed > antisense.cortex_circRNA.bed
    
  3. Modify the BED start and stop coordinates to represent half the length of the desired junction scaffold sequence for the “left” and “right” sides of the circRNA junction (Fig. 1b, c). For this example, we generate a junction of 200 nts in length to accommodate RNA-seq reads of 125 nts in length [5]. Using these conditions, an end-to-end Bowtie2 alignment would at minimum require at least a 25-nt overlap of the circRNA junction. RNA-seq data generated from the Illumina platform will generally range in length from 50 to 150 nts. When using real RNA-seq data, consider the junction overlap, and choose a junction scaffold length that will accommodate the read length of your RNA-seq data.

    The desired junction scaffold length for this example is 200 nts in length; therefore, the start and stop coordinates reported in the modified BED annotation for the “left” (Fig. 1b, c; blue arrows) and “right” (Fig. 1b, c; red arrows) sides of the junction will be 100 nts in length (half of the total junction scaffold length). The 100-nt “left” and “right” sequences will be joined to form the out-of-order circRNA junction scaffold sequence similar to the junctions depicted in Fig. 1b, c. For circRNAs encoded on the “+” strand, the “chromStop” position is used to generate the “left” side of the circRNA junction by subtracting 100 nts from the chromStop position, and then writing a new BED record to a file called “l_junc.sense.cortex_circRNA. bed” to represent the 100-nt left half of the junction. For the right side of the junction, 100 nts are added to the chromStart position, and the BED record is written to a file called “r_junc. sense.cortex_circRNA.bed” to represent the 100-nt right half of the junction. For circRNAs encoded on the “−” strand, the “chromStart” position is used to generate the “left” side of the circRNA junction by adding 100 nts to the chromStart position, and then writing the BED record to a file called “l_junc.antisense.cortex_circRNA.bed” to represent the 100-nt left half of the junction. For the right side of the junction, 100 nts are subtracted from the chromStop position, and the BED record is written to a file called “r_junc.antisense.cortex_circRNA.bed” to represent the 100-nt right side of the junction. There are some limitations with capturing circRNAs that are smaller than the desired circRNA junction scaffold length (see Note 1):
    $ awk ‘OFS=“\t” {$2=$3–100} {print $0}’ sense. cortex_circRNA.bed \ > l_junc.sense.cortex_circRNA.bed
    $ awk ‘OFS=“\t” {$3=$2+100} {print $0}’ sense. cortex_circRNA.bed \ > r_junc.sense.cortex_circRNA.bed
    $ awk ‘OFS=“\t” {$3=$2+100} {print $0}’ antisense.cortex_circRNA.bed \> l_junc.antisense.cortex_circRNA.bed
    $ awk ‘OFS=“\t” {$2=$3–100} {print $0}’ antisense.cortex_circRNA.bed \> r_junc.antisense.cortex_circRNA.bed
    
  4. Extract the “left” and “right” 100-nt sequences that will form the circRNA junction with the bedtools “getfasta” subcommand. Both a BED file (generated in the previous step) and a FASTA file containing the appropriate genome sequence are required to build the junction scaffold with the bedtools “getfasta” subcommand. Ensure that the annotations in the BED and genome FASTA file have the same chromosome name. iGenomes and Ensembl annotate genome FASTA chromosomes as a number only, whereas UCSC genome browser includes a “chr” prefix prior to the chromosome number (e.g., 1 vs. chr1). FASTA files of the genome or different genomes of interest organized by chromosome or contigs can be downloaded directly from UCSC Genome Browser: http://hgdownload.cse.ucsc.edu/downloads.html.

    Ensembl: ftp://ftp.ensembl.org/pub/release-87/fasta.
    Illumina iGenomes: support.illumina.com/sequencing/sequencing_software/igenome.html.
    # Extract sequences for “left” and “right” sides of circRNA junctions for circRNAs encoded on “+” strand.
    $ bedtools getfasta -fi mm9.genome.fa -bed l_junc.sense.cortex_circRNA.bed \ -s -name -tab -fo l_junc.sense.cortex_circRNA.seq
    $ bedtools getfasta -fi mm9.genome.fa -bed r_junc.sense.cortex_circRNA.bed \-s -name -tab -fo r_junc.sense.cortex_circRNA.seq
    # Extract sequences for “left” and “right” sides of circRNA junctions for circRNAs encoded on the “-” strand. $ bedtools getfasta -fi mm9.genome.fa -bed l_junc.antisense.cortex_circRNA.bed \ -s -name -tab -fo l_junc.antisense.cortex_circRNA.seq
    $ bedtools getfasta -fi mm9.genome.fa -bed r_junc.antisense.cortex_circRNA.bed \
    -s -name -tab -fo r_junc.antisense.cortex_circRNA.seq
    
  5. Combine the “left” and “right” sequences to form the 200-nt junction scaffold sequence in FASTA format. The FASTA output for the circRNA junctions is shown in Fig. 2.
    $ paste l_junc.sense.cortex_circRNA.seq r_junc.sense.cortex_circRNA.seq \ | awk ‘{print “>“$1”\n”$2$4}’ > circRNA_junctions.fa
    # Append “-” strand circRNA junction sequence to “circRNA_junction.fa”
    $ paste l_junc.antisense.cortex_circRNA.seq r_junc.antisense.cortex_circRNA.seq \| awk ‘{print”>“$1”\n”$2$4}’ >> circRNA_junctions.fa
    

Fig. 1.

Fig. 1

circRNA junction formation from BED annotations. (a) Representative BED file, circRNA.bed, a tabdelimited file that reports the chromosomal location and features of circRNAs. (b) Model of a 3-exon proteincoding gene with exon 2 encoding a circRNA. circ_1 maps to chr1 from nt 10 to 30. The downstream “left” end of the exon (at nt 30) joins with the upstream “right” end of the exon to form the circRNA junction. (c) circ_2 is encoded on the antisense DNA strand, but has the same start and stop coordinates as circ_1. The upstream “left” side of the exon joins with the downstream “right” side of the exon to form the circRNA junction. Because circ_2 is encoded on the antisense DNA strand, the sequence is reverse-complemented relative to circ_1. FASTA-formatted junction sequences (Subheading 3.1, step 4) are used to build the aligner index (Subheading 3.2, step 1) necessary for alignment of RNA-seq data to circRNA junctions (Subheading 3.2)

Fig. 2.

Fig. 2

circRNA junction FASTA sequences. The resulting 200 nt circRNA junction sequences in FASTA format after joining the 100-nt “left” and “right” sequences (Subheading 3.1, step 5)

3.2. Align RNA-seq Reads to the circRNA Junction Scaffold

  1. Build the Bowtie2 aligner index using the circRNA junction scaffold. Six index files with the extension .bt2 are generated by bowtie2-build.
    $ bowtie2-build circRNA_junctions.fa circRNA_junctions
    
  2. Obtain RNA-seq datasets in FASTQ format if user-generated RNA-seq data is not available. Ensure that the library strategy used to generate the RNA-seq data is rRNA-depleted total RNA-seq. Publicly available RNA-seq datasets are available on ENCODE: www.encodeproject.org

    NCBI Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo

    NCBI Sequence Record Archive (SRA): http://www.ncbi.nlm.nih.gov/sra

    SRAtools (https://github.com/ncbi/sra-tools) is needed to extract FASTQ data from SRA files downloaded from GEO and SRA. Obtain at least two RNA-seq datasets from different conditions to compare changes in circRNA expression. For the purposes of this example, we use the SRA prefetch tool to obtain total RNA/rRNA-depleted RNA-seq datasets deposited on the SRA for two conditions: one replicate each for 1-month-old (SRR4280863) mouse cortex and 22-month-old (SRR4280956) mouse cortex [5]. Once the SRA files have been downloaded, the SRA tool fastq-dump is used to extract the FASTQ files from the SRA files. Since these RNA-seq datasets are paired end, the read 1 data are contained in the “_1.fastq.gz” file and the read 2 data are contained in the “_2.fastq.gz” file. Both the read 1 and read 2 files will have the same number of records, and each record will have the same read ID (Fig. 3). The quality of the datasets can be assessed using FastQC (see Note 2):
    #use the prefetch tool from the SRA toolkit to download datasets from the SRA
    #The downloaded files will have a .sra extension
    $ prefetch SRR4280863
    $ prefetch SRR4280956
    #extract FASTQ data from .sra files and split into read1 and read2 using the SRA tool fastq-dump
    $ fastq-dump --split-files -F --gzip SRR4280863
    $ fastq-dump --split-files -F --gzip SRR4280956
    #rename datasets
    $ mv SRR4280863_1.fastq.gz 1m-cortex.R1.fastq.gz
    $ mv SRR4280863_2.fastq.gz 1m-cortex.R2.fastq.gz
    $ mv SRR4280956_1.fastq.gz 22m-cortex.R1.fastq.gz
    $ mv SRR4280956_2.fastq.gz 22m-cortex.R2.fastq.gz
    
  3. Align FASTQ datasets to the junction scaffold using Bowtie2. In this example, we use a minimum alignment score of −15 regardless of read length. Stringency of alignment output can be altered by various parameters in Bowtie2. These parameters are covered in the Bowtie2 manual (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml). By default, Bowtie2 dumps its output to standard out (stdout). However, in this example, stdout is piped to samtools view to write a Binary Alignment Mapping (BAM) output file that includes only mapped reads sorted by the reference sequence and leftmost alignment position. BAM files are compressed binary versions of the Sequence Alignment Mapping (SAM) file. Checking the SAM/BAM alignment reports is critical to ensuring accurate alignment to the circRNA junctions (see Note 3). The circRNA alignment information provided in the SAM/BAM file is covered in Fig. 4 for the first aligned read reported in 1m-cortex.bam:
    $ bowtie2 --score-min=C,−15,0 -x circRNA_junctions \
    −1 1m-cortex.R1.fastq.gz −2 1m-cortex.R2.fastq.gz \
    | samtools view -ShuF4 - \
    | samtools sort - -m 8G 1m-cortex
    #repeat alignment for second condition
    

Fig. 3.

Fig. 3

FASTQ records in read 1 and read 2 RNA-seq datasets. Each sequence record in the FASTQ file is composed of four lines. The first line always begins with an “@” character, followed by the read ID (in this example: HWI-D00269:140:C7FJYANXX:3:1101:1083:1885). The second line of FASTQ record is the read sequence, the third line is a “+” character with or without other read information, and the fourth line contains the sequencing qualities reported for each base. The RNA-seq FASTQ datasets for read 1 (1m-cortex.R1.fastq.gz) and read 2 (1m-cortex.R2.fastq.gz) contain the same number of records and the same read IDs

Fig. 4.

Fig. 4

1m-cortex.bam output from Bowtie2 alignment for first mapped read. The meaning of fields in SAM/BAM output reports is described at samtools.github.io/hts-specs/. In 1m-cortex.bam, generated from the Bowtie2 alignment (Subheading 3.2, step 3), the first reported read is assigned a FLAG value of 163. This FLAG value indicates that the read is properly paired, is the second in the pair, and its mate is reverse complemented relative to the reference. The output also indicates that the read aligns to the CDR1as circRNA junction scaffold reference starting at nt 1 (field 4). The mapping quality (MAPQ) field value is 42, and the entire 125-nt (125M) RNA-seq read aligned to the CDR1as junction. An “=“ character is assigned to field 7 to indicate that the mate read aligns to the same reference sequence. Field 8 reports that the mate read aligns at position 29 of the reference sequence. The inferred fragment length of 153 nts from the read1 and read2 alignment is reported in field 9. The sequence and the quality scores for the read are reported in fields 10 and 11, and the alignment score is reported in field 12. A perfect alignment in Bowtie2 end-to-end mode will produce a score of 0, while various alignment penalties (e.g., mismatches, insertions, gaps) will report negative scores

3.3. Count Reads that Align to circRNA Junctions Using Featurecounts

  1. Generate a Gene Transfer Format (GTF) file using the cortex_ circRNA.bed file that contains the circRNA name information. The circRNA GTF output is shown in Fig. 5.
    $ awk ‘(OFS=“\t”) \
    {print $4, “Gruner_2016”, “junction”, “1”, “200”, “.”, “+”, “.”, “circID 
    \”“$4”\”“} \
    ‘ cortex_circRNA.bed > circRNA.gtf
    
  2. Run featureCounts using the BAM files and the newly generated circRNA GTF file as input to count the reads that aligned to each circRNA junction. Multiple BAM files representing replicates and different conditions can be fed into feature-Counts simultaneously. The feature-Counts output is an easyto-interpret data frame that includes the read count for each circRNA for each condition (Fig. 6):
    $ featureCounts -C -T 4 -t junction -g circID -a circRNA.gtf -o circRNA.counts.txt \     1m-cortex.bam 22m-cortex.bam
    

Fig. 5.

Fig. 5

circRNA GTF for featureCounts. The GTF generated in 3.3.1 provides (1) the circRNA name, (2) the source of data, (3) a feature, (4) the start position of the feature (1-based), (5) the end position, (6) the score (can be replaced with a “.”), (7) the strand, (8) the frame (can be replaced with a “.”), and (9) an attribute, which is a semicolon-separated list of tag-value pairs

Fig. 6.

Fig. 6

FeatureCounts output. Output generated from featureCounts (Subheading 3.3, step 2) provides the count of reads for each circRNA feature. For example, 29 individual reads aligned to CDR1as circRNA junction in the 1m-cortex sample versus 45 individual reads in the 22m-cortex sample

3.4. Normalize circRNA Counts to Compare circRNA Expression Across Conditions

  1. Determine total library size for normalization using transcripts per million (TPM).TPM is one approach for library normalization [5]. For this normalization approach, total read number generated per library is required. For paired-end RNA-seq libraries, this includes the sum of raw FASTQ records from both the read 1 (.R1.fastq.gz) and read 2 (.R2.fastq.gz) RNAseq files. In this example, 1m-cortex.R1.fastq.gz includes 51,789,102 records, and 1m-cortex.R2.fastq.gz includes 51,789,102 records; thus, the total size of library is 103,678,204 reads. Because each FASTQ record is made up of four lines, the total number of lines in the RNA-seq FASTQ data file divided by four can be used to determine the total library size. In addition, the output log of Bowtie2 reports the total number of reads processed:
    $ zcat 1m-cortex.R1.fastq.gz | wc -l
    207156408
    #Number of reads in Read1 FASTQ file: 207156408/4 = 51789102 reads
    $ zcat 1m-cortex.R2.fastq.gz | wc -l
    207156408
    #Number of reads in Read2 FASTQ file: 207156408/4 = 51789102 reads
    
  2. Normalize counts to TPM. Divide the number of reads that aligned to each circRNA by total library size. The result is multiplied by 1,000,000, and then divided by the scaling factor. The scaling factor is the length of the circRNA junction in kilobases (kb). The junction scaffold used for this example was 200 nts in length; thus the scaling factor is 0.2 kb:
    $ awk ‘(NR > 2)’ circRNA.counts.txt \
    | awk  -v lib_size_1m=103678204 -v lib_size_22m=102442864 ‘OFS=“\t” \
    {TPM_1m=((( $7/lib_size_1m)*10^6)/0.2)} \
    {TPM_22m=((( $8/lib_size_22m)*10^6)/0.2)} \
    {print $1, TPM_1m, TPM_22m}’ \
    > circRNA_TPM.txt
    #view normalized data
    $ cat circRNA_TPM.txt
    CDR1as            1.39856     2.19635 mm9_circ_000025   0.530488    1.02496 mm9_circ_000040   0.241131    0.0488077 mm9_circ_000042   0.192905    0.780923
    
  3. Quantify circRNA expression trends.Now that the circRNA counts are normalized, the different libraries/conditions can be directly compared. In our example, libraries of cortex RNA from 1-month-old mice have 1.4 TPM for CDR1as versus 2.2 TPM from 22-month-old mice. From here, fold change can be calculated to determine fluxes in circRNA expression (e.g., a 1.6-fold increase in CDR1as circRNA from 1 m to 22 m). The R base package is useful for performing statistical tests, and can be used in addition to the ggplot2 library for plotting expression changes in circRNAs [25, 26]. Expression trends of circRNAs observed with RNA-seq data should be validated using experimental methods such as RT-qPCR and Northern blot analysis. Additional chapters within this book detail these methodologies to study circRNAs (see Note 4).

4. Notes

  1. A caveat of this method for the generation of junction is that it does not take into account the size of the circRNA. If a circRNA is derived from a single exon that is smaller than the size of the desired junction sequence length, the above script will grab flanking non-circRNA sequence, like introns or intergenic sequence. For circRNAs smaller than the desired junction, options include (1) generating a shorter junction sequence or (2) maintaining the desired length of the junction sequence by concatenating the circRNA sequence. A rough circRNA length can be determined by subtracting the chromStart position from the chromStop position in the BED record.

  2. The quality of RNA-seq libraries can be assessed with fastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Among the quality measures of fastQC is sequence duplication, or repetitive reads due to PCR duplicates. Although the impact of PCR duplicates on RNA expression is debatable [2729], PCR duplicates may impact circRNA profiling due to the low frequency of circRNA junction-specific reads. Picard MarkDuplicates (https://github.com/ broadinstitute/picard) or samtools rmdup can be used to remove PCR duplicates from alignment BAM files. FastQC also reports adapter contamination in the RNA-seq reads. Adapter contamination occurs if the cDNA insert is shorter than the desired read length, and the adapter sequence or a portion of the adapter sequence is included in the read output. Reads with considerable adapter sequence will not align to the circRNA junction scaffold in Bowtie2 end-to-end mode, so trimming adapter sequence can potentially enhance alignment to the circRNA junction scaffold. An option for trimming adapter sequence is Cutadapt [30]; however, adapter-trimmed sequences will result in variable read length, and filtering the alignments to require circRNA junction overlap will be necessary.

  3. Spot check the junction sequences by loading circRNA BED annotations on IGV [23, 24] or UCSC genome browser [31] to ensure that the proper junction is accurately captured. The alignment BAM files can also be loaded onto IGV with the junction scaffold FASTA file (as the genome). A BAM index file (bam.bai) is required for each BAM file loaded onto IGV. To generate the index file, use samtools “index”. Visualizing the reads aligned onto the circRNA scaffold through IGV can help the user spot low mapping quality and other systemic errors.

  4. As there are a huge number of predicted circRNAs, very few annotations have been validated by independent methods such as Sanger sequencing of circRNA RT-PCR products, RNase R treatment, or Northern blot techniques (see additional chapters in the book). It is important for the user to validate circRNA expression trends using independent molecular techniques.

Acknowledgments

This work was supported by the National Institute of General Medical Sciences grant P20 GM103650 and National Institute on Aging grant R15 AG052931. We would also like to thank Matthew Bauer and David Knupp for critical review of this chapter.

References

RESOURCES