Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Nov 2.
Published in final edited form as: Methods Mol Biol. 2013;971:295–312. doi: 10.1007/978-1-62703-269-8_17

High-Throughput RNA Sequencing in B-Cell Lymphomas

Wenming Xiao 1, Bao Tran 1, Louis M Staudt 1, Roland Schmitz 1
PMCID: PMC7605114  NIHMSID: NIHMS1635698  PMID: 23296971

Abstract

High-throughput mRNA sequencing (RNA-seq) uses massively parallel sequencing to allow an unbiased analysis of both genome-wide transcription levels and mutation status of a tumor. In the RNA-seq method, complementary DNA (cDNA) is used to generate short sequence reads by immobilizing millions of amplified DNA fragments onto a solid surface and performing the sequence reaction. The resulting sequences are aligned to a reference genome or transcript database to create a comprehensive description of the analyzed transcriptome. This chapter describes a protocol to perform RNA-seq using the Illumina sequencing platform, presents sequencing data quality metrics and outlines a bioinformatic pipeline for sequence alignment, digital gene expression, and mutation discovery.

Keywords: Next-generation sequencing, Transcriptome, Gene expression, B cell, B-cell lymphoma, Immunoglobulin genes, Mutation

1. Introduction

Advances in high-throughput sequencing have opened new horizons in cancer research. The recent application of this method has led to immense progress in our understanding of biology and pathogenesis of B-cell lymphomas by identifying gene mutations and unbalanced expression of transcript isoforms (16). These studies included analyses of whole genomes, exomes (the coding sequences of the genome) or transcriptomes using genomic DNA, enriched exomic DNA or mRNA, respectively. High-throughput sequencing of mRNA represents an elegant way to study both gene mutations as well as the type and quantity of the transcripts in a cell. Compared with alternative high-throughput gene expression technologies, such as microarrays, RNA-seq achieves a base pair-level resolution, a higher dynamic range of expression levels and lower background level (7). Owing to the fact that RNA-Seq does not rely upon a predetermined set of probe sequences it also provides information on alternative spliced isoforms and novel transcripts. RNA-seq can detect rare transcripts with an abundance of 1–10 RNA molecules per cell (8) facilitating the detection of mutations in oncogenes and tumor suppressor genes even though these transcripts might be present at low abundance (9).

This chapter contains the relevant protocols for performing RNA-seq on Illumina Genome Analyzer or Illumina Hiseq2000 Systems using kits and reagents from commercial sources. Following RNA sample quantification and quality control, a double stranded cDNA library is generated, controlled for size range and quantified. In the cluster station, the denatured cDNA library is annealed to oligonucleotides that are covalently bound to the surface of the flow-cell. These oligonucleotides prime the synthesis of a complementary DNA strand creating a double stranded DNA template of which one strand is immobilized on the flow cell. Following denaturation, the free ends of these covalently attached DNA molecules are bound by neighboring oligonucleotides on the surface of the flow cell, which in turn prime the synthesis of a complementary DNA strand that is consequently also covalently bound to the flow cell. This process is repeated to create local “clusters” of identical DNA molecules. The sequencing reaction in the Genome Analyzer or HiSeq2000 is performed using a sequencing primer that is complementary to the adapter sequence attached to each template strand. The four different bases (A, G, T, and C) are labeled with different colored fluorophores. Following incorporation of the labeled bases, the Genome Analyzer’s camera records the colors for each cluster and Illumina software modules (Firecrest and Bustard) converts the fluorophore information into DNA sequence data with associated intensity and quality scores. The subsequent bioinformatic analysis of RNA-seq is done in several stages. In the first step the sequencing reads are mapped. The correct alignment of short RNA sequences is a crucial, albeit complex, task. Depending on the experimental setup, RNA-seq generates tens of millions of sequence reads corresponding to complex transcriptomes. Several short-read assembly programs have been developed to tackle these problems applying different mathematical algorithms and scoring schemes (10). The mapping strategy outlined in this chapter uses sequential alignment steps using the mapping software BWA (11) and proved to be robust yet computationally inexpensive. Once the data has been mapped to a reference, the aligned reads are parsed to assign single nucleotide variants (SNV), which are validated by additional alignments using Bowtie (12) and Novoalign (http://www.novocraft.com). In a final step, an expression score for every mapped read is computed.

2. Materials

2.1. Sample Quality Control

  1. Agilent RNA 6000 Nano Kit (Agilent Technologies).

  2. Rat Brain Total RNA (Life Technologies).

  3. Agilent 2100 Bioanalyzer (Agilent Technologies).

  4. RNaseZAP (Ambion).

  5. DNAse/RNAse-free water (DEPC-treated).

  6. Bioanlayzer Chip vortexer (IKA MS 3).

2.2. Library Preparation

  1. TruSeq RNA Sample Preparation Kit-Set A (Illumina).

  2. 0.2 mL clear thin wall PCR strip tubes.

  3. DynaMag-96 Side (Invitrogen).

  4. SuperScript II Mix (Invitrogen).

  5. Agencourt AMPure XP 60 μL (Beckman Coulter).

  6. Ethanol absolute.

  7. Tris-Cl 10 mM, pH 8.5 with 0.1% Tween 20.

2.3. Library Quality Control

  1. Agilent DNA 1000 Kit (Agilent Technologies).

2.4. qPCR

  1. Illumina Eco Real-Time PCR System (Illumina).

  2. SYBR FAST Master Mix (KAPA Biosystems).

  3. Illumina GA Primer Premix (KAPA Biosystems).

  4. 6 x Illumina GA DNA Standards (KAPA Biosystems).

  5. DNAse/RNAse-free Water.

  6. 10 mM Tris-Cl + 0.05% Tween-20 (DNase/RNase-free).

  7. Eco Plates (Illumina).

  8. Eco Adhesive Seals (Illumina).

2.5. Cluster Generation and Sequencing

Illumina Cluster Station

Illumina HiSeq2000 or GAIIx Systems

2.6. RNA-Seq Data Quality Control and Assessment

Software:

Illumina Software: http://www.illumina.com/support/sequencing/sequencing_software.ilmn

FastQC: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc

FASTX_toolkit: http://hannonlab.cshl.edu/fastx_toolkit

2.7. Alignment of Sequences to Reference

Software:

BWA: http://bio-bwa.sourceforge.net/

IGV: http://www.broadinstitute.org/igv/home

Download reference sequences:

RefSeq database: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/vertebrate_mammalian

Ensembl database: http://useast.ensembl.org/info/data/ftp/index.html

Genome: ftp://ftp.ncbi.nlm.nih.gov/genomes/

2.8. Extraction of Putative Single Nucleotide Variants

Software:

GATK: http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page

PERL V5.8.8 or later

Download transcriptome mapping fi les from UCSC Human Genome Browser Project: http://genome.ucsc.edu/cgi-bin/hgTables?command=start

2.9. Verification of Putative Sequence Variants with Novoalign and Bowtie

Software:

Bowtie: http://bowtie-bio.sourceforge.net/index.shtml

Novoalign: http://www.novocraft.com/main/index.php

2.10. Annotation of Sequence Variants

Software:

PERL V5.8.8 or later

2.11. Digital Gene Expression

Software:

PERL V5.8.8 or later

2.12. Immunoglobulin Gene Assembly

Software:

BLAST: http://blast.ncbi.nlm.nih.gov

SSAKE: http://www.bcgsc.ca/platform/bioinfo/software/ssake

Formatdb: ftp://ftp.ncbi.nih.gov/blast/documents/formatdb.html

Download reference sequences: Immunoglobulin gene segments: http://www.imgt.org/IMGTdownloads.html

3. Methods

3.1. Sample Quality Control

RNA quality is very important for successful library preparation. Degraded RNA can result in low yield, over representation of the 5′ ends of RNA, or failure of library preparation. As a proxy for RNA quality, the Agilent Bioanalyzer measures the fluorescence intensity of 18S and 28S ribosomal RNA and calculates from that the RNA integrity number (RIN).

  1. Prepare the gel by pipetting 550 μL of RNA-6000 nano gel matrix into a spin filter. Centrifuge at 1,500 × g for 10 min at room temperature. Aliquot 65 μL into RNase-free microfuge tubes (see Note 1).

  2. Vortex the RNA 6000 Nano dye concentration for 10 s, spin down and add 1 μL of dye into a 65 μL aliquot of filtered gel. Vortex and spin at 13,000 × g for 10 min (see Note 2).

  3. Heat denature RNA ladder for 2 min at 70°C and cool on ice for 1 min (see Note 3).

  4. Put an RNA 6000 Nano chip on the chip priming station. Pipette 9.0 μL of gel–dye mix in the well marked G. Position the plunger at 1 μL and close the chip priming station. Press plunger until it is held by the clip. Wait for exactly 30 s and release clip. Wait for additional 5 s. Pull plunger slowly back to 1 μL position. Open chip priming station and pipette 9.0 μL of gel–dye mix in the wells marked G.

  5. Pipette 5 μL of RNA 6000 Nano marker in all 12 sample wells and in the well marked with the ladder symbol.

  6. Pipette 1 μL of prepared ladder in well marked with the ladder symbol. Pipette 1 μL of sample in each of the 11 sample wells and Rat Brain Total RNA control in well 12 (see Note 4). Pipette 1 μL of RNA 6000 Nano Marker in each unused sample well. Put the chip in the adapter of the chip vortexer and vortex for 1 min at 2,400 rpm. Run the chip in the Agilent 2100 Bioanalyzer within 5 min using the program RNA-Eukaryote Total RNA Nano Series II.xsy.

3.2. Library Preparation

The library preparation is outlined in Fig. 1.

Fig. 1.

Fig. 1.

Workflow for cDNA library preparation. Asterisks Indicates steps after which library preparation can be interrupted.

  • 1

    1 μg of total RNA is required for cDNA library construction (see Note 5). Dilute the total RNA with DNAse/RNAse-free water to final volume of 50 μL.

Purification and Fragmentation of mRNA

  • 2

    Add 50 μL of RNA Purification Beads to each well of the RBP plate using a multichannel pipette to bind mRNA to oligo-dT magnetic beads (see Note 6). Mix by gently pipetting up and down for six times.

  • 3

    Seal the plate and denature RNA in a thermal cycler (65°C for 5 min, 4°C hold). Remove from thermal cycler.

  • 4

    Incubate at room temperature for 5 min to allow binding to beads.

  • 5

    Place the plate on the magnetic stand at room temperature for 5 min to separate beads from the solution (see Note 7).

  • 6

    Carefully remove and discard supernatant using a multichannel pipette. Remove the plate from the magnetic stand.

  • 7

    Wash beads by adding 200 μL of Bead Washing Buffer to remove unbound RNA (see Note 8).

  • 8

    Place the plate on the magnetic stand at room temperature for 5 min.

  • 9

    Carefully remove and discard supernatant. Remove the plate from the magnetic stand.

  • 10

    Add 50 μL of Elution Buffer. Mix gently by pipetting.

  • 11

    Seal the plate and elute mRNA in a thermal cycler (80°C for 2 min, 25°C hold).

  • 12

    Remove the plate from thermal cycler and add 50 μL of Bead Binding Buffer and incubate at room temperature for 5 min.

  • 13

    Place the plate on the magnetic stand at room temperature for 5 min (see Note 7).

  • 14

    Carefully remove and discard entire supernatant.

  • 15

    Remove the plate from the magnetic stand and wash beads by adding 200 μL of bead wash buffer to remove unbound RNA.

  • 16

    Place the plate on the magnetic stand at room temperature for 5 min.

  • 17

    Carefully remove and discard supernatant. Remove the plate from the magnetic stand.

  • 18

    Add 19.5 μL of Elute, Fragment Mix. Mix gently by pipetting (see Note 9).

  • 19

    Place the sealed plate in a thermal cycler and elute fragment, and prime RNA using the program: 94°C for 8 min, 4°C hold.

  • 20

    Remove from thermal cycler and spin briefly. Proceed immediately to Synthesize First Strand cDNA.

Synthesize First Strand cDNA

  • 21

    Place the plate on the magnetic stand at room temperature for 5 min.

  • 22

    Transfer 17 μL of supernatant (fragmented and primed mRNA) to a new plate.

  • 23

    Add 8 μL of First Strand Master Mix and SuperScript II mix. Mix by pipetting.

  • 24

    Incubate the plate in a thermal cycler: 25°C for 10 min, 42°C for 50 min, 70°C for 15 min, 4°C hold.

  • 25

    Remove from thermal cycler and proceed immediately to Synthesize Second Strand cDNA.

Synthesize Second Strand cDNA

  • 26

    Add 25 μL of second strand master mix. Mix by pipetting.

  • 27

    Incubate in a thermal cycler, at 16°C for 2.5 h.

  • 28

    Remove the plate from thermal cycler. Bring the reaction mixture to room temperature.

Ampure XP Clean Up

  • 29

    Vortex the Ampure XP beads until they are well dispersed and add 90 μL Ampure XP beads to reaction mixture. Mix by vortexing.

  • 30

    Incubate at room temperature for 15 min.

  • 31

    Place the plate on the magnetic stand at room temperature for 5 min (see Note 7).

  • 32

    Carefully remove and discard 135 μL of supernatant. Some liquid may remain in the wells (see Note 10).

  • 33

    With the plate remaining on the magnetic stand, carefully add 200 μL of freshly prepared 80% EtOH.

  • 34

    Incubate at room temperature for 30 s and carefully remove and discard entire supernatant. Repeat steps 33 and 34 once for a total of two 80% EtOH washes.

  • 35

    Dry beads at room temperature for 15 min and remove the plate from the magnetic stand.

  • 36

    Add 52.5 μL Resuspension Buffer. Mix by pipetting.

  • 37

    Incubate at room temperature for 2 min.

  • 38

    Place the plate on the magnetic stand at room temperature for 5 min (see Note 7).

  • 39

    Transfer 50 μL of supernatant (double stranded cDNA) from to a new PCR plate. Library preparation can be interrupted at this step. cDNA can be stored up to 7 days at −20°C.

End Repair

  • 40

    Add 10 μL of Resuspension Buffer.

  • 41

    Add 40 μL of End Repair Mix. Mix by pipetting.

  • 42

    Incubate in a thermal cycler at 30°C for 30 min. Library preparation can be interrupted at this step. cDNA can be stored up to 7 days at −20°C.

Ampure XP Clean Up

  • 43

    Add 160 μL of mixed Ampure XP beads. Mix by pipetting.

  • 44

    Incubate at room temperature for 15 min.

  • 45

    Place the plate on the magnetic stand at room temperature for 5 min (see Note 7).

  • 46

    Carefully remove and discard 127.5 μL of supernatant. Some liquid may remain in the wells (see Note 10).

  • 47

    Wash as in steps 33–35.

  • 48

    Resuspend the dried pellet with 52.5 μL Resuspension Buffer. Mix well by pipetting.

  • 49

    Incubate at room temperature for 2 min.

  • 50

    Place the plate on the magnetic stand at room temperature for at least 5 min (see Note 7).

  • 51

    Transfer 17.5 μL of supernatant from to a new PCR plate. Library preparation can be interrupted at this step. cDNA can be stored up to 7 days at −20°C.

Adenylate 3′ Ends

  • 52

    Add 12.5 μL of A-Tailing Mix. Mix by pipetting.

  • 53

    Incubate at 37°C for 30 min.

  • 54

    Proceed immediately to Ligate Adapters.

Ligate Adapters

  • 55

    Add 2.5 μL of DNA Ligase Mix, 2.5 μL of Resuspension Buffer, and 2.5 μL of each RNA Adapter Index. Mix by pipetting.

  • 56

    Incubate in a thermal cycler at 30°C for 10 min.

  • 57

    Remove the plate from thermal cycler.

  • 58

    Inactivate ligation by adding 5 μL of Stop Ligase Mix. Mix by pipetting.

Ampure XP Clean Up

  • 59

    Add 42 μL of mixed Ampure XP beads. Mix by pipetting.

  • 60

    Incubate at room temperature for 15 min.

  • 61

    Place the plate on the magnetic stand at room temperature for at least 5 min (see Note 7).

  • 62

    Carefully remove and discard supernatant (see Note 10).

  • 63

    Wash as in steps 33–35.

  • 64

    Resuspend the dried pellet with 52.5 μL Resuspension Buffer. Mix by pipetting.

  • 65

    Incubate at room temperature for 2 min.

  • 66

    Place the plate on the magnetic stand at room temperature for at least 5 min (see Note 7).

  • 67

    Transfer 50 μL of supernatant into new plate.

  • 68

    Add 50 μL of mixed Ampure XP beads. Mix by pipetting.

  • 69

    Incubate at room temperature for 15 min.

  • 70

    Place the plate on the magnetic stand at room temperature for at least 5 min (see Note 7).

  • 71

    Carefully remove and discard 95 μL of supernatant. Some liquid may remain in each well (see Note 10).

  • 72

    Wash as in steps 33–35.

  • 73

    Add 22.5 μL Resuspension Buffer. Mix by pipetting.

  • 74

    Incubate at room temperature for 2 min.

  • 75

    Place the plate on the magnetic stand at room temperature for at least 5 min (see Note 7).

  • 76

    Transfer 20 μL of supernatant to corresponding wells of the plate of step 67. Library preparation can be interrupted at this step. cDNA can be stored up to 7 days at −20°C.

Enrich DNA Fragments

  • 77

    Add 5 μL of PCR Primer Cocktail.

  • 78

    Add 25 μL of PCR Master Mix. Mix by pipetting.

  • 79

    Amplify cDNA library using the PCR program: 98°C for 30 s, 15× (98°C for 10 s, 60°C for 30 s, 72°C for 30 s), 72°C for 5 min, hold at 4°C.

  • 80

    Add 50 μL of mixed Ampure XP beads. Mix by pipetting.

  • 81

    Incubate at room temperature for 15 min.

  • 82

    Place the plate on the magnetic stand at room temperature for at least 5 min (see Note 7).

  • 83

    Carefully remove and discard 95 μL of the supernatant. Some liquid may remain in the wells (see Note 10).

  • 84

    Wash as in steps 33–35.

  • 85

    Resuspend dried pellet with 15 μL Resuspension Buffer. Mix by pipetting.

  • 86

    Incubate at room temperature for 2 min.

  • 87

    Place the plate on the magnetic stand at room temperature for at least 5 min (see Note 7).

  • 88

    Carefully transfer 15 μL of supernatant to a new plate.

3.3. Library Quality Control

The cDNA library is analyzed using an Agilent DNA 1000 Bioanalzyer chip to validate quality, size range and quantity of the library. The anticipated yield of the cDNA library preparation is at least 10 nM. The expected size of most of the cDNA molecules is typically 300–350 bp. In case of high quantities of primer-dimers that are typically around 50 bp in size an additional Ampure XP Clean Up step is recommended.

  1. Bring DNA dye concentrate and DNA gel matrix to room temperature. Vortex the DNA dye concentrate for 10 s and spin down (see Note 11).

  2. Pipette 25 μL of the dye concentrate into DNA gel matrix vial.

  3. Vortex for 10 s. Check proper mixing of gel and dye.

  4. Transfer the gel–dye mix to the top receptacle of a spin filter. Place the spin filter in a microcentrifuge and spin for 15 min at room temperature at 2,240 × g.

  5. Place DNA chip on the chip priming station.

  6. Pipette 9.0 μL of the gel–dye mix at the bottom of the well marked G (see Note 12). Position plunger at 1 μL and close the chip priming station.

  7. Press the plunger of the syringe down until it is held by the clip. Wait for exactly 60 s and then release the plunger with the clip release mechanism.

  8. Wait for 5 s, then slowly pull back the plunger to the 1 μL position. Open the chip priming station.

  9. Pipette 9.0 μL of the gel–dye mix in each of the wells marked.

  10. Pipette 5 μL of DNA marker into the well marked with the ladder symbol and into each of the 12 sample wells (see Note 13).

  11. Pipette 1 μL of the DNA ladder in the well marked with the ladder symbol. In each of the 12 sample wells pipette 1 μL of sample for used wells or 1 μL of deionized water for unused wells.

  12. Place the chip horizontally in the adapter of the chip vortexer. Vortex for 60 s at 2,400 rpm. Run within 5 min using the program dsDNA-DNA 1000 Series II.xsy.

3.4. qPCR

Accurate quantification of the cDNA library is critical for successful cluster generation. Low concentration of cDNA libraries result in lower cluster density and reduced sequencing yield, whereas too high concentrations can cause too dense clusters leading to poor cluster resolution during the sequencing.

  1. Dilute cDNA Libraries 1:10,000 (see Note 14).

  2. Pipette qPCR reaction mix (Table 1):

  3. Place a 48-well qPCR plate on the dock assembly and adjust the angle of the dock as desired.

  4. Pipette 8 μL of the qPCR reaction mix into in the sample plate.

  5. Add 2 μL of either DNA, Illumina DNA Standards (1–6) or water into sample wells.

  6. Seal the 48-well microplate with adhesive film.

  7. Vortex and spin plate. No air bubbles should be visible.

  8. Run the sample plate in the Eco qPCR System using the PCR program: 95°C 5 min, 35× (95°C for 30 s, 60°C 45 s) (see Note 15).

Table 1.

qPCR reaction mix

(μL) Reagent
5 2× SYBR Master Mix
1 10× Illumina GA Primer Premix
2 Water
2 DNA template, GA Standard, or water

3.5. Cluster Generation and Sequencing

The cDNA samples are clustered on the Flowcell using the protocol: cBot_PE_Amplification_Linearization_Blocking_PrymHyb_v8.0.xml. Sequencing is performed on an Illumina GAIIx or HiSeq2000 instrument.

3.6. RNA-Seq Data Quality Control and Assessment

There is a variety of software tools, including both published open-source algorithms and programs developed in-house, for RNA-Seq data QC and analysis. This section focuses on data quality control and assessment based on Illumina software tools.

3.6.1. Instrument Run Monitoring and Data Evaluation

During the Illumina instrument run for both HiSeq2000 and GAIIx platforms, the data quality control can be performed based on the output from instrument control software and off instrument (remote) Sequencing Analysis Viewer software. Table 2 shows software tools and metrics for data evaluation.

Table 2.

Instrument run monitoring and data evaluation

Platform Quality metrics Software Software function
HiSeq2000 GAIIx Raw Cluster Numbers and %PassFilter Clusters allow the assessment of cluster density and performance HCS/SCS (Instrument Control Software) Instrument Control Software provides a graphical interface which allows user view run status and statistics while running the instrument
Intensity and Phasing/Prephasing values allow assessment of the sequencing chemistry performance RTA (Real-Time Analysis) RTA is a component within the Instrument Control Software that monitors the run’s progress, optimizes run conditions and provides run-time quality statistics
Basecall Quality Scores allow assessment of base call quality cycle by cycles SAV (Sequencing Analysis Viewer) SAV is an application that allows user, in real time, to view run quality metrics generated by RTA from an off-instrument remote location

Table 3 lists acceptable run performance metrics based on control sample (e.g., Phix). Performance may vary for a given mRNA sample based on sample quality, cluster density and other experimental factors.

Table 3.

Run performance metrics

Matrix Description Normal value range
Cluster Density (1,000/mm2) Density of clusters detected by image analysis 200–820 K/mm2
Cluster PF (%) Percentage of clusters passing intensity filter criteria 60–90%
Phasing/Prephasing Value used by RTA for the percentage of molecules in a cluster for which sequencing falls behind (phasing) or jumps ahead (prephasing) the current cycle within a read Phasing < Prephasing Phasing <0.5 Prephasing <0.7
Q30+ (%) Percentage of bases with a Phred score of 30 or higher ≥80%
Mismatch Error Rate The calculated error rate, as determined by the alignment of control (e.g., Phix) <2%

3.6.2. Post Sequencing Data Processing

  1. Convert sequencing data to FASTQ files using Illumina CASAVA or OLB software. There is an option to automatically filter low quality reads that have failed Illumina specific thresholds.

  2. Generate quality reports using the FastQC software (Table 4).

  3. Perform adapter clipping and low quality trimming using FASTX_toolkit software.

  4. Specify the adapter and contamination files such as ribosomal DNA or other contaminant sequences in the CASAVA contaminant file path, using CASAVA eland_rna module for alignment and automatically filtering the contaminants during mapping.

  5. Align reads to reference genome and splice junctions by using CASAVA pipeline eland_paired and eland_rna modules for sequence quality evaluation.

Table 4.

Quality assessment using FastQ

Matrix Fail QC value range
Per base quality score If the lower quartile for any base is less than 5 or if the median for any base is less than 20
Per sequence quality score If the most frequently observed mean quality for sequences is below 20
Per base sequence content If the difference between A and T, or G and C is greater than 20% in any position
Sequence GC content If the sum of the deviations from the normal distribution represents more than 30% of the reads
PCR duplicates If non-unique sequences make up more than 50% of the total
Overrepresented sequences If any sequence is found to represent more than 1% of the total

3.7. Alignment of Sequences to Reference

In essence, there are two assembly strategies for RNA-seq data. The sequence reads can be directly aligned to a reference such as the genome or transcriptome, or assembled de novo and subsequently aligned. Whilst the use of de-novo assembly provides the opportunity to construct a whole genome transcriptome map including the discovery of previously unknown transcripts this method requires greater computational resources and is less sensitive in mapping transcripts at low abundance. Thus, except for immunoglobulin gene assembly (Subheading 3.12), this chapter will focus on reference-based mapping strategies. Alignment to the genome allows mapping of reads from unannotated loci. However, this method alone produces a high risk of false positive SNVs due to mismapped reads (e.g., spanning one or more intron-exon junctions). To address these problems, we outline a sequential alignment strategy using the BWA alignment tool (11) (Fig. 2). Following download and indexing, RNA-seq reads (in fastq format) are aligned to the RefSeq database (see Note 16). The aligned and unaligned reads are stored in separate files in SAM format. Unaligned reads from this step will be used immunoglobulin gene assembly in Subheading 3.12. In the next alignment step, reads that failed to map to RefSeq are mapped to the Ensembl database, which includes additional transcripts and pseudogenes (13). Reads that do not align via Ensembl and then aligned to the Genome sequence. This sequential mapping strategy usually results in the following mapping efficiencies: RefSeq: 70–80%, Ensembl: 5% and Genome: 5–10%. The three final alignment files are merged and converted into BAM format for visualization of aligned target sequences in the Integrated Genome Viewer (IGV) (14). The converted BAM file (the binary version of a SAM file) is sorted and indexed before being imported into IGV (see Note 17).

Fig. 2.

Fig. 2.

Flowchart of sequential alignment process using BWA. In the first step of the mapping process, RNA-Seq reads are aligned to the RefSeq database. Aligned and not aligned reads are stored in separate SAM format files. Not aligned reads are mapped and assembled for Ig transcripts, or aligned to the Ensembl database. Reads that fail to align this step are mapped to the genome.

3.8. Extraction of Putative Single Nucleotide Variants

The resulting alignments are utilized to generate single nucleotide variants (SNV) calls. To reduce the impact of method-specific artifacts, the following considerations should be taken into account: Reads that map to identical starting position (redundant reads) are discarded. Only reads that have a Phred quality score of 20 or more will be used to create SNV calls. To ensure both a high sensitivity and a reduction of false positive results, SNVs can be included, if these are observed in more than three reads and the ratio of mutant reads vs. total coverage is greater than 20%. A wide variety of software is available to make variant calls, such as mpileup in SAMtools combining with BCFtools or Variant Discovery Tools in the Genome Analysis Toolkit (GATK). Alternatively, individual PERL scripts can be developed to “look up” mismatches in alignments. Based on mapping of RefSeq and Ensembl transcripts to the genome, which is available from UCSC Human Genome Browser Project, the identified SNVs are subsequently assigned to their corresponding genomic coordinates.

3.9. Verification of Putative Sequence Variants with Novoalign and Bowtie

A major challenge of variant detection with RNA-seq is the reduction of false positive SNVs. High sequencing coverage of an observed variant can minimize incorrect variant calls. However, false positive SNVs can also result from mismapping of sequencing reads to highly homologous sequences (e.g., pseudogenes) present in the genome or transcriptome database. To tackle this problem, it is helpful to utilize multiple alignment algorithms with different mathematical scoring metrics. To this end, reads containing SNVs from the first alignment using BWA (Subheading 3.7) are extracted from the SAM file (e.g., using PERL) and are split based on exon structure using RefSeq or Ensembl transcriptome definitions as the reference. Reads less than 16 bp long are discarded. These reads are aligned to the human genome using Bowtie and Novolalign mapping tools. Since both programs can generate alignment results in SAM format, the subsequent analyses can be performed with the same tools as used for BWA alignment. A variant is declared confirmed if it is called by BWA and either the Bowtie or Novoalign algorithms. As determined by Sanger resequencing, the combination of BWA with Bowtie and Novoalign yields a true positive rate of up to 95% (6).

3.10. Annotation of Sequence Variants

To understand the biological effects of a SNV, the annotation of the observed variant is paramount. While the majority of SNVs discovered by RNA-Seq are single nucleotide polymorphisms (SNP), the main focus in cancer biology lies on the identification of non-SNP mutations with the potential to activate or inactivate gene functions. In the first step of the annotation process, SNVs that corresponded to known SNPs are identified based on the genomic coordinates. If an SNV occurs within the coding region of a transcript, a PERL script can be used to determine the resulting amino acid substitution. For this conversion, the relative position of a given SNV within the coding sequences is used to determine the resulting codon change. Using the genetic code, this codon change is translated into an amino acid substitution. The interpretation of non-synonymous SNVs can be facilitated using the SIFT (Sorting Intolerant from Tolerant) algorithm that can predict the effect of amino acid substitutions (15). SIFT presumes that functionally important amino acids will be conserved within protein families. Thus, amino acid substitutions at conserved positions tend to be predicted as deleterious. SIFT also considers the type of amino acid change. Based on the amino acids appearing at each position in the alignment, SIFT calculates the probability that an amino acid at a position is tolerated conditional on the most frequent amino acid being tolerated.

3.11. Digital Gene Expression

Digital gene expression can be derived from the initial transcriptome alignment by BWA. RNA-Seq reads are assigned to genes based on their genomic coordinates assigned in Subheading 3.8 (Fig. 3). The total extend of a gene is considered to be the union of all exons that match with RNA-seq reads. RPKM (reads per kilobase per million) is calculated based on the total reads that hit on the gene, the length of this gene and total number of reads from the test sample.

Fig. 3.

Fig. 3.

Digital gene expression. RNA-Seq reads from BWA alignment are assigned to genes based on their genomic coordinates using RefSeq and Ensembl mapping results from UCSC Human Genome Browser Project. The total length of a gene is the sum of all exon covered by RNA-seq reads.

RPKM=Number of reads for a gene/(gene length[kb]*(Number of reads from analyzed sample/1,000,000))

3.12. Immunoglobulin Gene Assembly

Most B-cell lymphomas express a B cell receptor (BCR) (16). The BCR is a membrane bound antibody composed of two identical heavy chain and two identical light chain immunoglobulin (Ig) polypeptides, non-covalently bound to the signaling components CD79A and CD79B. Immunoglobulins consist of variable (V) regions and constant (C) regions. The heavy chain V-regions are composed of three (V, D, and J) gene segments whereas the light chains are consisting of two gene segments (V and J). During B cell development, these gene segments are assembled by somatic DNA rearrangement to encode a functional Ig. This Ig gene segment assembly results in a large diversity of non-germ-line encoded immunoglobulin molecules. To map and assemble Ig transcript from RNA-seq, Ig germ-line segments including V, D, J and constant regions of heavy and light chain are downloaded from the Immunogentics database (IMGT). The sequences in fasta format are used to create a BLAST-compatible database with formatdb. To reduce computational complexity, only reads that were not aligned in the first alignment step (Subheading 3.7) are used for Ig gene assembly since Ig genes are not contained in the RefSeq database. Reads with a median Phred score of more than 20 are extracted and aligned separately to the reference database for V, D, J and constant regions segments. The output of the BLAST alignment to the Ig V, D, J and C segments is parsed using PERL, and the mapping of each read to Ig segments is determined. The number of reads mapping to individual Ig segments is then determined. Reads for the most prevalent segments of V, D, J and constant regions are extracted, merged and assembled separately for heavy-chain and light-chain using the SSAKE assembly tool.

Acknowledgments

This work was supported by the by the Dr. Mildred Scheel Stiftung fur Krebsforschung (Deutsche Krebshilfe). We are grateful to Yuliya Kriga, Jyoti Shetty, Yongmei Zhao, John Powell, and George Wright who were instrumental in establishing the protocols described here.

4. Notes

1.

Kit should be used at room temperature and protected from light for the whole procedure.

2.

Use prepared gel–dye mix within 1 day.

3.

Aliquots can be stored at −70°C. Thaw aliquots on ice.

4.

It is recommended to heat denature all RNA samples and RNA ladder before use.

5.

The RIN number of total RNA used for library preparation should be greater or equal to 8.

6.

Vortex the thawed RNA Purification Beads tube vigorously to completely resuspend the oligo-dT beads.

7.

All beads should be attached to the side of the wells. The liquid should appear clear.

8.

Remove the entire ethanol from the bottom of the wells. Residual ethanol can hamper downstream enzymatic reactions and may contain contaminants.

9.

The Elute, Prime, Fragment Mix contains reaction buffer and random hexamers for priming of reverse transcription.

10.

Leave the plate on the magnetic stand while performing the following 80% EtOH washes.

11.

The gel–dye mix can be stored for 4 weeks at 4°C.

12.

When pipetting the gel–dye mix, do not to draw up particles that may sit at the bottom of the gel–dye mix vial. Insert the tip of the pipette to the bottom of the chip well when dispensing. This prevents a large air bubble forming under the gel–dye mix. Placing the pipette at the edge of the well may lead to poor results.

13.

Empty wells may cause improper running of the chip. Add 5 μL of DNA marker plus 1 μL of deionized water to each unused sample well.

14.

Optimal quantification of cDNA libraries is achieved using an input DNA amount of 1 pM.

15.

If the concentration for any sample is greater than 50 nM, the sample should be diluted and rerun in the qPCR.

16.

The index of reference sequence database needs to be built just once unless new version of BWA is used.

17.

Conversion of the SAM file format into BAM format can be performed in SAMtools with the command: samtools view -bS yourSAMfile > yourBAMfile. Sorting and Indexing: samtools yourBAMfile yourBAMfile.sorted and samtools index yourBAMfile.sorted. These processes typically take a few hours.

References

  • 1.Morin RD, Mendez-Lago M, Mungall AJ et al. (2011) Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma. Nature 476:298–303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Pasqualucci L, Trifonov V, Fabbri G et al. (2011) Analysis of the coding genome of diffuse large B-cell lymphoma. Nat Genet 43:830–837 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Ngo VN, Young RM, Schmitz R et al. (2011) Oncogenically active MYD88 mutations in human lymphoma. Nature 470:115–119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Puente XS, Pinyol M, Quesada V et al. (2011) Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia. Nature 475:101–105 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kridel R, Meissner B, Rogic S et al. (2012) Whole transcriptome sequencing reveals recurrent NOTCH1 mutations in mantle cell lymphoma. Blood 119:1963–1971 [DOI] [PubMed] [Google Scholar]
  • 6.Schmitz R, Young RM, Ceribelli M et al. (2012) Burkitt lymphoma pathogenesis and therapeutic targets from structural and functional genomics. Nature 490:116–120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Mortazavi A, Williams BA, McCue K et al. (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628 [DOI] [PubMed] [Google Scholar]
  • 9.Frischmeyer PA, Dietz HC (1999) Nonsense-mediated mRNA decay in health and disease. Hum Mol Genet 8:1893–1900 [DOI] [PubMed] [Google Scholar]
  • 10.Garber M, Grabherr MG, Guttman M et al. (2011) Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods 8:469–477 [DOI] [PubMed] [Google Scholar]
  • 11.Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Langmead B, Trapnell C, Pop M et al. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Larsson TP, Murray CG, Hill T et al. (2005) Comparison of the current RefSeq, Ensembl and EST databases for counting genes and gene discovery. FEBS Lett 579:690–698 [DOI] [PubMed] [Google Scholar]
  • 14.Robinson JT, Thorvaldsdottir H, Winckler W et al. (2011) Integrative genomics viewer. Nat Biotechnol 29:24–26 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Kumar P, Henikoff S, Ng PC (2009) Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 4:1073–1081 [DOI] [PubMed] [Google Scholar]
  • 16.Rui L, Schmitz R, Ceribelli M et al. (2011) Malignant pirates of the immune system. Nat Immunol 12:933–940 [DOI] [PubMed] [Google Scholar]

RESOURCES