Significance
RNA sequencing (RNA-Seq) is a common tool for measuring relative gene expression levels. However, as an absolute quantitative tool, the data are prone to various distortions due to biases from library preparation. We improve the quantitative aspects of RNA-Seq by barcoding individual cDNA molecules to correct for amplification bias, distinguish clonal replicates, and obtain absolute measurements of gene expression. We have also developed a set of barcoded synthetic RNAs that can be spiked into samples as easy-to-use quantitative controls. Additionally, we demonstrate the combined use of capture enrichment with molecular barcoding for the sequencing of targeted genes. These results demonstrate low library preparation efficiency leading to the stochastic loss of low-abundance transcripts, which cannot be overcome by simply increasing sequencing depth.
Keywords: cDNA library, molecular barcoding, RNA-seq
Abstract
We present a simple molecular indexing method for quantitative targeted RNA sequencing, in which mRNAs of interest are selectively captured from complex cDNA libraries and sequenced to determine their absolute concentrations. cDNA fragments are individually labeled so that each molecule can be tracked from the original sample through the library preparation and sequencing process. Multiple copies of cDNA fragments of identical sequence become distinct through labeling, and replicate clones created during PCR amplification steps can be identified and assigned to their distinct parent molecules. Selective capture enables efficient use of sequencing for deep sampling and for the absolute quantitation of rare or transient transcripts that would otherwise escape detection by standard sequencing methods. We have also constructed a set of synthetic barcoded RNA molecules, which can be introduced as controls into the sample preparation mix and used to monitor the efficiency of library construction. The quantitative targeted sequencing revealed extremely low efficiency in standard library preparations, which were further confirmed by using synthetic barcoded RNA molecules. This finding shows that standard library preparation methods result in the loss of rare transcripts and highlights the need for monitoring library efficiency and for developing more efficient sample preparation methods.
RNA sequencing (RNA-Seq) is a powerful method for the measurement of global gene expression (1, 2). As a discovery tool, the method has dramatically increased our knowledge of the transcriptome, providing new insights into transcript diversity, including the discovery of new structural variants such as alternative splicing, gene fusions or rearrangements, and low-expressed molecules. As a profiling tool, the method is primarily challenged by the large dynamic range of expression levels of mRNAs in a library. Sequencing of millions to tens of millions of copies of high-abundance transcripts is required to detect rare transcripts of interest (3, 4). To compensate, increased numbers of reads are often used, despite the low efficiency of this strategy. Reports estimate that ∼40 million reads may be required for the reliable measurement of gene expression for transcripts of high and moderate abundance, and as many as 500 million reads may be required to cover the full sequence diversity of a complex transcript library (1, 5, 6). In routine use, sparse coverage is further compromised by the practice of multiplexing samples in a single RNA-Seq run, primarily driven by cost constraints when designing studies involving large numbers of samples, such as those required in clinical applications.
To efficiently sample the rare or low-abundance isoforms of transcripts, capture methods have recently been used (7, 8). These promising methods use a targeted strategy whereby genomic regions of interest are enriched through hybridization capture and amplification before sequence sampling. When applied to generate the sequencing library for RNA-Seq, the method can effectively enrich rare transcripts and reduce inefficiencies associated with sampling large numbers of abundant transcripts present in an unmodified library. However, because targeted sequencing procedures introduce biases caused by unequal capture and amplification efficiencies across different genes, accurate gene expression measurements cannot be assured (7).
We have incorporated a molecular indexing and capture-enrichment strategy into RNA-Seq in which individual cDNA fragments are labeled with a diverse set of molecular indexing adaptors during library preparation and before any PCR amplification. Our method is based on the concept of stochastic labeling (9) and other similar molecular-tagging or identification strategies (10–15). Each cDNA molecule is labeled at random from a reservoir of ∼10,000 distinct molecular-indexing barcode combinations. Individual cDNA fragments of identical sequence become distinguishable after molecular indexing and can be tracked in subsequent analysis. Once cDNA molecules are indexed, information on the abundance levels of each molecule is retained within the population of molecules. Biases and nucleotide-calling errors from PCR amplification or sequencing steps can be detected and corrected. Reads for new observations can be distinguished from the resampling of clonal replicates, and, upon sufficient sequencing depth, the entire library can be analyzed to determine absolute numbers of each cDNA.
We show that the combination of molecular indexing with RNA-Capture can be used quantitatively to count the absolute copy numbers of abundant or rare RNA transcripts within a library. By using a set of indexed control cDNA sequences, we show that multiple targets can be captured from a sequencing library and reamplified as desired without concern for distortions in the absolute abundance or relative ratios. In addition, digital PCR experiments and indexed reference molecules in the RNA starting material directly measure the efficiency of RNA-Seq library construction. These measurements indicate an extremely low efficiency in the standard RNA-Seq library preparation method, which leads to the exclusion of low-abundance transcripts during library construction. Because library preparation is independent of sequencing depth, poor efficiency renders many low-abundance transcripts nondetectable. Use and implications of the molecular indexing approach are discussed.
Results
Construction of Molecular Indexing RNA-Seq Libraries.
RNA-Seq libraries were prepared by using standard protocols recommended by Illumina, with substitution of the molecular indexing adaptors for standard ligation adaptors (Materials and Methods). Briefly, mRNAs were purified and fragmented, and cDNA fragments were synthesized and end-repaired, followed by the adenylation of 3′ ends. Two sets of 96 3′T overhang double-stranded ligation adaptors harboring either PCR primer A or B annealing sites were produced (Fig. 1A) with an 8-bp error-correcting sequence barcode (16, 17) at the ligating end of each of the 96 adaptors. In the ligation reaction, these adaptors are present in vast excess over the concentration of cDNA fragments and serve as a nondepleting reservoir of molecular labels. Each end of a 3′A overhang cDNA fragment ligates to one of the 96 labels at random, generating 9,216 (96 × 96) possible label combinations. We have found this method to be sufficient for most RNA-Seq measurements, because exceeding 9,216 copies of identical cDNA fragments for any given mRNA transcript is highly unlikely. Paired-end reads reveal the molecular index combination along with the adjoining cDNA fragment sequence of an mRNA transcript (Fig. 1B). If desired, another set of 8-bp sample-specific barcodes can be appended during PCR. Although the use of randomized nucleotides as molecular barcodes has been reported (10–14), we avoided this strategy because some sequence combinations could be problematic, and the presence of sequencing errors could cause barcode identification ambiguity.
Fig. 1.

(A) Illustration of the addition of molecular indices to cDNA molecules. Each end of the 3′ adenylated cDNA fragment (light blue bars) ligates randomly to one of 96 possible 3′ T overhang A or B adaptors (purple or dark blue bars). On each adaptor, the molecular index is represented as an 8-nucleotide sequence barcode (yellow). cDNAs with ligated adaptors can be amplified by using PCR primers A and B. If desired, a sample index barcode (brown bar) can be added to primer A. (B) PCR-amplified clones can be sequenced by using primers corresponding to the adaptor sequences. (C) An outline of the experiment workflow is shown.
Sequence Capture Enrichment for Targeted RNA-Seq.
We prepared molecular-indexing RNA-Seq libraries using 500 ng of human lymphocyte total RNA as input (Fig. 1C). External RNA Controls Consortium (ERCC) RNAs were spiked into the input RNA sample to serve as an internal control (18–20). These RNAs are formulated to contain 92 poly-adenylated synthetic RNA transcripts spanning a 106-fold range in concentration. Upon completion of the library preparation steps, we estimated the total number of resulting PCR-amplified fragments in the library by quantitative PCR (qPCR). A sequencing run was performed on the MiSeq instrument, and 2.9 × 106 mapped paired-end fragments were obtained. For targeted deep sequencing, we performed sequence enrichment capture for a subset of seven ERCC RNAs (Materials and Methods). The postcapture library was sequenced, and 4.5 × 106 mapped paired-end fragments were obtained. Enrichment was demonstrated by an alignment plot confirming increased amplitude of fragments corresponding to the capture probe used (Fig. S1). Molecular indexing enabled an absolute count of each of the mRNA transcripts present in the library. We counted unique molecules by the following rationale. When a group of sample RNA transcripts of identical sequence is fragmented, and in the absence of any clonal replication, fragments that overlap each other by at least 1 nucleotide are determined to originate from different copies of a transcript (Materials and Methods). After amplification, this determination is complicated by clonal replication, and in conventional RNA-Seq, distinct molecules cannot be distinguished from clonal duplicates when multiple fragments of identical sequence are present. With the added information gained from molecular indexing, fragments of identical sequence become distinct, and resampling of clonal duplicates can be identified (Fig. S2).
Using the molecular-indexing information, we determined the absolute number of RNA transcripts contained in the RNA-Seq library for a set of seven captured ERCC RNAs. As listed in Table 1, the seven ERCC RNA species spiked into the initial sample ranged from 8,790 to 9,000,000 copies, and the number of paired-end reads acquired from RNA-Seq that mapped to each of these species ranged from 3,080 to 1,059,847. From these sequencing reads, the absolute number of transcripts of each ERCC RNA species contained in the RNA-Seq library was counted and ranged from 17 to 41,331. Therefore, in this experiment, each unique ERCC transcript in the library is sequenced on average 26–224 times (e.g., 1,059,847/41,331 or 40,592/181). Although it is commonly assumed that fragments with the same start and stop sites are PCR replicates, these results do not support this assumption. Molecular indexing demonstrates that far more often than expected, distinct fragments share the same start/stop sites. For example, in ERCC136, a total of 3,579 unique transcript fragments shared 2,255 start/stop sites. However, ∼147,534 [(1,033 − 299) × 201] fragment frames are possible for fragments of 200- to 400-bp size range that are derived from a 1,033-bp RNA, assuming a uniform distribution of break points. The highest abundance control ERCC130 had an even greater number of identical fragments (41,331 distinct cDNA fragments with 11,206 start/stop sites). These observations suggest that one or more of the processes (RNA fragmentation, random priming, or reverse transcription) is not random and that considering start/stop sites alone is not always a reliable criterion for identifying clonal duplicates.
Table 1.
Absolute quantitation of ERCC transcripts in the library by molecular indexing and sequence capture
| RNA | Length | Copies of input RNA in library | Paired-end reads | Unique start/stop sites* | Unique transcripts† | Yield‡ |
| ERCC130 | 1,059 | 9,000,000 | 1,059,847 | 11,206 | 41,331 | 0.0046 |
| ERCC136 | 1,033 | 562,500 | 310,315 | 2,255 | 3,579 | 0.0064 |
| ERCC108 | 1,022 | 281,250 | 76,479 | 1,198 | 1,603 | 0.0057 |
| ERCC116 | 1,991 | 140,625 | 40,592 | 157 | 181 | 0.0013 |
| ERCC092 | 1,124 | 70,314 | 36,347 | 263 | 308 | 0.0044 |
| ERCC095 | 521 | 35,157 | 5,565 | 41 | 42 | 0.0012 |
| ERCC019 | 644 | 8,790 | 3,080 | 15 | 17 | 0.0019 |
The number of sequenced clones of different start/stop sites and overlapping by at least a single nucleotide.
The number of sequenced clones of different start/stop sites with distinct molecular indexing and overlapping by at least a single nucleotide.
The ratio of resulting transcripts in the library to the total number of copies added to the sample used for library preparation.
Strikingly, this experiment revealed significant losses of the input RNA, resulting in a low overall efficiency of 0.12% (42/35,157) to 0.64% (3,579/562,500), as shown in Table 1. For every 1,000 copies of a transcript in the starting sample, only 1–6 copies remained in the sequencing library.
Estimating RNA-Seq Library Preparation Efficiency by Digital PCR.
It is possible that the low library efficiency could be due to cumulative losses from the large number of steps in the library preparation protocol. The 14 sequential steps in the protocol are listed in Table 2. Assuming a moderately efficient stepwise yield of 70% for each of the first 13 steps, and a 50% size selection recovery at the 14th step, we obtain an overall efficiency of ∼0.7%, close to the experimental results (0.1–0.6%) from targeted sequencing.
Table 2.
Estimation and digital PCR validation of RNA-Seq library preparation stepwise yields
| Step | Procedure | Est. yield* | Est. copies of GAPDH transcripts† | Digital PCR-measured GAPDH transcripts‡ | Est. total no. of all DNA fragments§ |
| 1 | Total RNA input | 1 | 10,000,000 | 13,400,000 | 53,600,000,000 |
| 2 | Poly(A) RNA isolation | 0.7 | 7,000,000 | ||
| 3 | RNA fragmentation | 0.7 | 4,900,000 | ||
| 4 | Ethanol precipitation purification | 0.7 | 3,430,000 | ||
| 5 | Random primer reverse transcription | 0.7 | 2,401,000 | 772,800 | 3,091,200,000 |
| 6 | Second strand cDNA synthesis | 0.7 | 1,680,700 | ||
| 7 | Qiagen column purification | 0.7 | 1,176,490 | ||
| 8 | End-repair | 0.7 | 823,543 | ||
| 9 | Qiagen column purification | 0.7 | 576,480 | ||
| 10 | 3′A base addition | 0.7 | 403,536 | ||
| 11 | Qiagen column purification | 0.7 | 282,475 | ||
| 12 | Adaptor ligation | 0.7 | 197,733 | ||
| 13 | Qiagen column purification | 0.7 | 138,413 | ||
| 14 | Electrophoresis size selection (Pippin prep) | 0.5 | 69,206 | 240,000 | 960,000,000 |
| 15 | Viable templates for PCR amplification | 3,840¶ | 15,360,000|| | ||
| Cumulative yield** | 0.0069 | 0.0003 |
Estimated efficiency for each listed step, including an assumed moderately efficient stepwise yield of 70% for each of the first 13 steps and an assumed 50% size selection recovery at the 14th step.
Based on stepwise efficiency estimates, 200 copies of GAPDH and 10 pg of total RNA per cell, and 500-ng input of total RNA.
The number of copies determined by digital PCR measurement starting with 500 ng of input total RNA (average of triplicate experiments).
Assuming four cDNA fragments per transcript and GAPDH abundance at 1:1,000 of all transcripts.
Calculated from || (cannot be measured directly with digital PCR).
Measured by digital PCR using primers corresponding to ligated adaptors on cDNA fragment ends.
Ratio of resulting transcripts or fragments in the library over the starting amount in the input sample.
In an attempt to directly assess the losses in the protocol, we carried out digital PCR on GAPDH and measured 1.3 × 107 copies in the input RNA sample, corresponding to ∼0.1% of all of the transcripts by mass (Table 2). After the first five steps of library preparation—which include RNA isolation, fragmentation, purification, and reverse transcription—5.8% (772,800/13,400,000) were left, suggesting an average stepwise yield of ∼60% for the first five steps. We also interrogated the library after the 14th step with digital PCR and found 1.8% (240,000/13,400,000) of the initial molecules remaining.
However, not all of these 240,000 molecules are viable templates for the final PCR amplification (15th) step because some were not successfully modified by the earlier enzymatic steps (second-strand cDNA synthesis, end repair, 3′A addition, or adaptor ligation) and, as a result, lack the necessary adaptors required for the final PCR amplification. To estimate the number of viable GAPDH templates, we used the adaptor primers for digital PCR and determined that there were a total of 1.5 × 107 viable cDNA molecules, regardless of gene identity. Assuming that, on average, four cDNA fragments were generated from each full-length RNA transcript and 0.1% of all cDNAs were GAPDH, we estimate that 3,840 (1.5 × 107/4/1000) viable GAPDH transcripts remain (Table 2). This result implies an overall efficiency of ∼0.03%, 3–20 times lower than the results from sequencing analysis. This discrepancy could be due to assay interference from excess Y-adaptor primers remaining after size selection (melting curve analysis and polyacrylamide gel electrophoresis on individual digital PCR products clearly demonstrated a large number of these interfering artifacts). The lack of suitable measurement methods motivated us to develop a more reliable and precise technique to measure the library efficiency.
Direct Encoding and Counting of Individual RNA Molecules to Monitor Library Efficiency.
A more straightforward way to measure library preparation efficiency is to add a known number of barcoded RNA molecules into the sample and determine how many make it through the library preparation steps. We created a set of 960 sequence-barcoded synthetic control RNAs (Fig. 2) and mixed it into the sample for processing through the library preparation steps. The number of barcoded RNA molecules in the input sample, the proportion successfully converted to cDNA, and the viable sequencing templates that remained in the final library were determined by PCR amplification and counting of the barcode tags (Fig. 2). Dye-labeled PCR products were directly hybridized to a detector of oligonucleotides complementary to the set of barcodes and scored by fluorescence imaging. By using this approach, library efficiencies for each sample were determined, demonstrating an overall yield of 2–3 transcripts for every 1,000 introduced into the sample (Table 3), approximately equivalent to the efficiency measured from quantitative targeted sequencing. This direct-counting endpoint PCR approach served as a useful internal control for the monitoring of RNA-Seq library preparation efficiency.
Fig. 2.

Library efficiency determination by direct spike-in of synthetic, barcoded, poly-adenylated RNA transcripts. Molecules of RNA are identical in sequence except for a 21-nt-long barcode sequence (molecular index) embedded within each transcript. A total of 960 barcodes were created, and each can be identified by RT-PCR amplification with dye-labeled PCR primers, followed by hybridization detection on an array of oligonucleotide probes. Arrows show the location of the PCR primers used.
Table 3.
Library efficiency determination by indexed RNA molecules
| Control RNA input | Low | Medium | High |
| Starting amount of liver total RNA input (∼50,000 cells), ng | 500 | 500 | 500 |
| Copies of indexed RNA molecules spiked in* | 200,000 | 2,000,000 | 20,000,000 |
| Resulting copies after first strand cDNA synthesis† | 4,510 | 25,010 | 246,820 |
| Resulting copies in final library† | 600 | 3,000 | 50,100 |
| Yield‡ | 0.003 | 0.002 | 0.003 |
Added to the liver RNA before any library preparation steps.
Individual copies counted by PCR amplification followed by detection of molecular index by hybridization to printed array.
Ratio of cDNA copies to input RNA copies.
Discussion
Quantitative measurements of low-abundance transcripts by massively parallel sequencing are challenging because of the Poisson sampling nature of the method. To increase the sequencing efficiency of rare molecules, capture strategies have been used to enrich for specific transcripts, usually at the cost of distorting relative transcript levels. We have shown that the combination of molecular indexing with sequence capture uniquely identifies cDNAs in RNA-Seq libraries and can be used to determine the transcript concentration independent of capture bias and PCR amplification distortions. This targeted approach should prove useful for studies in which the sampling of low-abundance mRNA is critical or in which a low sample input requires PCR amplification.
The incorporation of molecular indexing also allows us to ascertain the absolute numbers of transcripts present in a library. Quantitative experiments reveal that only a small portion of the initial RNA successfully converts to sequencing-ready cDNA templates. This low conversion rate was independently confirmed by directly adding barcoded synthetic RNA controls in to the sample. We attribute the low efficiency to cumulative losses arising from the multiple steps involved in library preparation, directly impacting library complexity. For example, if 100 copies of a particular RNA are present in the sample, it would be highly unlikely for any of them to survive with an efficiency of library preparation between 0.1% and 0.6%. At these low efficiencies, stochastic loss of various low-abundance transcripts can be significant, and because many transcripts may be excluded from the library, information cannot be recovered by increasing sequencing depth.
It is useful to consider the implications of low library efficiency in the context of a typical RNA-Seq experiment. Although in early experiments, it was routine to use 1 µg or more of total RNA as the input sample (1), more recent emphasis on cellular specificity has driven many investigators to use 50 ng or less total RNA. Based on the average value of 10 pg of RNA per cell, this value represents ∼5,000 cell equivalents. Assuming ∼200,000 mRNAs in a single cell, we can expect ∼1 × 109 mRNA molecules in the sample. At an overall efficiency of ∼0.1%, the resulting library will be on the order of 1 × 106 molecules (ignoring RNA fragmentation for simplicity). Based on Poisson sampling, a library of 1 × 106 fragments quickly reaches saturation well before 2 × 107 reads (20× sampling) (Fig. S3). At this sequencing depth, effectively all unique molecules in the library have been sampled, and very little new information will be obtained by additional sequencing. Any transcripts that were present in the starting material at less than ∼1,000 copies are simply lost. This problem is further exacerbated with studies on alternative splicing, allelic variation, or rare isoforms or where the transcripts of interest are limited to a subset of the cells.
Given the rapid evolution of sequencing technologies, and the desire to study small samples to the limit of single cells, more efficient library preparation methods are required. Several new approaches have recently been introduced with fewer steps and should provide higher library efficiency (21, 22). Although our method labels and tracks cDNA molecules through PCR amplification, other upstream bias created before cDNA synthesis and adaptor ligation are not detected. Therefore, future improvements may include the direct indexing of RNA molecules during reverse transcription using barcoded cDNA synthesis primers. Special attention should also be devoted to RNA isolation and poly-A enrichment procedures, which are other potential sources of distortion (20, 23).
Materials and Methods
Molecular Index Library Preparation.
First, 500 ng of Human Normal Lymphocyte Total RNA (Biochain) was mixed with 1 µL of a 1–100 dilution of ERCC control RNA (Life Technologies). The sample was processed by using the mRNA Seq Sample Prep Kit (Illumina) following the manufacturer-recommended protocol from purification and mRNA fragmentation through the adenylation of end-repaired cDNA fragments and cleanup. The subsequent adapter ligation step was performed in 1× T4 rapid ligase buffer (Illumina) with 1 µM molecular indexing adapters (Cellular Research) and 600 units of T4 DNA ligase (Illumina) at room temperature for 15 min. After a Qiagen PCR purification step, one half of the eluate (30 µL) was run on a Pippin cassette (Sage Science) for extraction of the 200- to 400-bp fragments. The collected sample was cleaned with AMPure XP beads (Beckman Coulter) and eluted in 50 µL of 10 mM Tris buffer, pH 8. A 50-µL PCR was run in 1X ThermoPol buffer (NEB) with 0.2 mM dNTPs, 0.2 µM CR P1 primer, 0.2 µM CR IDX D1 primer (Cellular Research), 1 unit of Vent R (exo-) DNA Polymerase (NEB), and 5 µL of purified template cDNA. The thermocycler was set to run the following program: 72 °C for 2 min, 94 °C for 1 min, 15 cycles of 94 °C for 15 s, 60 °C for 15 s, and 72 °C for 30 s, then 72 °C for 2 min and 4 °C hold. After PCR, the sample was cleaned by using AMPure XP beads (Beckman Coulter) and eluted in 30 µL of Qiagen buffer EB. For quality-control assessment, the sample was run on the Agilent Bioanalyzer (Agilent Technologies). A paired-end 150-bp sequencing run was performed on the MiSeq (Illumina) with a final library concentration of 6 pM as determined by qPCR (KAPA Biosystems).
Sequence-Specific Target Capture.
Capture probes for seven representative ERCC control transcripts were generated by PCR and biotinylated by using Terminal transferase labeling with DLR (Affymetrix). Forty-three percent of the original library was amplified by PCR for 15 cycles in preparation for annealing to the capture probes. Each 14-µL ERCC transcript capture reaction used 200 ng of reamplified library, 100 ng of target capture probe, 20 µg of denatured herring sperm DNA, and 2.4 µM P1 and IDX D1 blocking primer (Cellular Research). The reaction was incubated at 95 °C for 5 min and then placed on ice. Six microliters of 20× saline-sodium phosphate EDTA (SSPE) buffer, pH 7.4 and 2 µL of 1% SDS were added, and the reaction was incubated at 65 °C for 24 h. The annealed sample was added to 500 ng of prewashed MyOne Streptavidin C1 beads (Invitrogen) and incubated for 30 min at room temperature. Beads were captured with a magnet and washed once with 0.5 mL of 1× SSC and 0.1% SDS for 15 min at room temperature. This step was followed by three washes each for 10 min at 65 °C with prewarmed 0.1× SSC and 0.1% SDS, with a full resuspension of beads after each wash. The supernatant was discarded, and the beads were resuspended in 100 µL of PCR mix containing 1× Taq buffer, 0.2 mM dNTPs, 0.2 µM P1 and IDX D1 primers, and 5 units of Taq Polymerase. After 15 cycles of PCR, the tubes were placed on a magnet, and the sample was aspirated to a new tube and cleaned by using a modified AMPure XP bead cleanup protocol (1:1 ratio of beads to sample). The purified product was eluted in 20 µL of water and quantitated on the NanoDrop Spectrophotometer (Thermo Scientific). Samples were pooled and checked on the Agilent Bioanalyzer, and a paired-end 150-bp sequencing run was performed on the MiSeq (Illumina) with a final library concentration of 6 pM as determined by qPCR (KAPA Biosystems).
Digital PCR Validation of RNA-Seq Library Preparation Stepwise Yields.
Three replicates of 500 ng each of human liver total RNA was processed by using the mRNA-Seq Sample Prep Kit (Illumina). Sample aliquots were collected at selected steps. An outline of the procedure is as follows. Poly(A) RNA was isolated from the total RNA with a two-step magnetic bead protocol. The resulting mRNA was fragmented in a buffer containing divalent cation at 94 °C for 5 min and purified by ethanol precipitation. The RNA was resuspended in 11.1 µL of water, and a reverse transcription reaction was performed following manufacturer’s instructions. To each sample, 62.8 µL of water was added to dilute the reaction. At this point, a 2-µL aliquot was taken for digital PCR analysis (24) from each reaction. A 100-µL second-strand synthesis reaction was performed following manufacturer’s instructions. The samples were purified with a PCR Purification Kit (Qiagen) and eluted in 50 µL of buffer EB (Qiagen). After second-strand synthesis, an end-repair reaction was carried out, followed by a cleanup with a PCR purification Kit (Qiagen) and elution in 32 µL of buffer EB (Qiagen). The repaired ends were adenylated in a 50-µL reaction with 32 µL of eluted DNA, 5 µL of A-tailing buffer, 10 µL of 1 mM dATP, and 3 µL of Klenow exo-enzyme at 37 °C for 30 min. The reactions were cleaned with a MinElute PCR Purification Kit (Qiagen) and eluted in 23 µL of buffer EB (Qiagen). Y adaptors from the Illumina library preparation kit were ligated following kit instructions for 15 min followed by a cleanup using a PCR Purification Kit (Qiagen) and elution in 60 µL of buffer EB (Qiagen). Thirty microliters of sample was run on the Pippin Prep gel extractor (Sage Science) for purification of the 200- to 400-bp fragments and eluted in 53 µL of electrophoresis buffer. At this point, a 2-µL aliquot from each reaction was taken for digital PCR analysis and Picogreen quantitation. One-half of the eluate was used in a 50-µL amplification reaction with Phusion DNA polymerase (NEB) for 15 cycles. Products were purified with AMPure beads (Beckman Coulter) using a 1:1 beads-to-sample ratio and eluted in 15 µL of buffer EB (Qiagen). The library was validated on the Bioanalyzer (Agilent Technologies). Digital PCR measurements were carried out on the ABI 7300 real-time PCR instrument (Life Technologies) by using the iTaq SYBR green PCR mix (Bio-Rad). Digital PCR primers amplified either the ligated sequencing adaptors or GAPDH.
Library Efficiency Determination by Indexed RNA Molecules.
A set of 960 barcoded DNAs (1,092 bp) was generated by PCR (Fig. 2). Each of the 960 DNAs consisted of a T7 RNA promoter sequence at the 5′ end and a poly(A) 3′ end tail and were identical to each other, except for a unique 21-bp molecular index sequence at positions 728–748. The barcoded DNAs were used as templates in a bulk T7 in vitro transcription (IVT) reaction to synthesize a corresponding set of 960 indexed RNA molecules. After IVT, the remaining DNA template in the reaction was destroyed by DNase I digestion, and the RNA was purified by using RNeasy (Qiagen) and quantitated by measuring absorbance at 260 nm. This bulk preparation of the IVT-synthesized RNA migrates as a single-molecular-weight species of the expected size. Presence of each of the 960 barcodes in the synthesized RNA was verified by RT-PCR analysis and determined to be approximately equimolar (data not shown).
The synthetic barcoded control RNA was diluted, and 2 × 105, 2 × 106, or 2 × 107 copies were mixed into 500 ng of liver total RNA for library preparation using the Illumina mRNA Seq Sample Prep Kit. An aliquot was removed immediately after first-strand cDNA synthesis and also from the final library, and serial dilutions of the aliquots were amplified with PCR by using F1 and R1 primers specific for the synthetic RNA. Nested PCR with a dye-labeled primer was performed, and products were hybridized to a custom array of oligonucleotide probes complementary to the 960 RNA barcodes (Cellular Research). Fluorescence scans were carried out on a custom imaging system (Cellular Research), and intensity threshold cutoffs were applied to count detected barcodes (9). For a diversity of 960 barcode types (m), the number of copies of the control RNA (n) could be accurately determined from the number of different barcodes detected when n < m in the F1/R1 amplification reaction. When n approaches and exceeds m, the likelihood of different molecules with the same barcode type increases, resulting in reduced precision in measurements of n (discussed in detail in ref. 9). Therefore, to increase precision, dilution can be used to reduce n before PCR amplification, and the resulting calculation of n is corrected by the dilution factor.
Bioinformatics and Data Analysis.
Adapter sequences were removed, and paired-end reads were mapped by using BWA (25) to the hg19 Refseq RNA sequences (http://genome.ucsc.edu) and the 92 ERCC artificial RNA sequences obtained from the manufacturer. A bash script was used to filter reads to require a minimum mapping quality of 30. For the absolute counting of transcripts in the postcapture library, a minimum overlap of 1 bp was required for read fragments in the capture probe region (Figs. S1 and S2). Fragments with identical start and stop sites were determined to represent distinct transcripts only if the molecular index was different. The 1-bp window sampling was repeated for every 10 bp along the length of the capture probe. Estimates and plotting of sequencing sampling of libraries was performed in R by using the equation
where E[k] is the expected number of distinct molecules sequenced, m is the total number of distinct clones in the original library, and n is the number of sequencing reads (9).
Supplementary Material
Acknowledgments
We thank Christina Fan for helpful discussions. The work was supported in part by National Institutes of Health Grants R24-GM102656 (to W. Xiao), P01-HG00205 (to R.W.D.), and R43HG007129 and R43HG007130 (to G.K.F.).
Footnotes
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1323732111/-/DCSupplemental.
References
- 1.Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5(7):621–628. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]
- 2.Nagalakshmi U, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320(5881):1344–1349. doi: 10.1126/science.1158441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Clark MB, et al. The reality of pervasive transcription. PLoS Biol. 2011;9(7):e1000625, discussion e1001102. doi: 10.1371/journal.pbio.1000625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Birney E, et al. ENCODE Project Consortium NISC Comparative Sequencing Program Baylor College of Medicine Human Genome Sequencing Center Washington University Genome Sequencing Center Broad Institute Children’s Hospital Oakland Research Institute Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447(7146):799–816. doi: 10.1038/nature05874. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Blencowe BJ, Ahmad S, Lee LJ. Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes. Genes Dev. 2009;23(12):1379–1386. doi: 10.1101/gad.1788009. [DOI] [PubMed] [Google Scholar]
- 6.Toung JM, Morley M, Li M, Cheung VG. RNA-sequence analysis of human B-cells. Genome Res. 2011;21(6):991–998. doi: 10.1101/gr.116335.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Halvardson J, Zaghlool A, Feuk L. Exome RNA sequencing reveals rare and novel alternative transcripts. Nucleic Acids Res. 2013;41(1):e6. doi: 10.1093/nar/gks816. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Mercer TR, et al. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat Biotechnol. 2012;30(1):99–104. doi: 10.1038/nbt.2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Fu GK, Hu J, Wang PH, Fodor SP. Counting individual DNA molecules by the stochastic attachment of diverse labels. Proc Natl Acad Sci USA. 2011;108(22):9026–9031. doi: 10.1073/pnas.1017621108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Casbon JA, Osborne RJ, Brenner S, Lichtenstein CP. A method for counting PCR template molecules with application to next-generation sequencing. Nucleic Acids Res. 2011;39(12):e81. doi: 10.1093/nar/gkr217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Jabara CB, Jones CD, Roach J, Anderson JA, Swanstrom R. Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID. Proc Natl Acad Sci USA. 2011;108(50):20166–20171. doi: 10.1073/pnas.1110064108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Kinde I, Wu J, Papadopoulos N, Kinzler KW, Vogelstein B. Detection and quantification of rare mutations with massively parallel sequencing. Proc Natl Acad Sci USA. 2011;108(23):9530–9535. doi: 10.1073/pnas.1105422108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kivioja T, et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods. 2012;9(1):72–74. doi: 10.1038/nmeth.1778. [DOI] [PubMed] [Google Scholar]
- 14.Schmitt MW, et al. Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci USA. 2012;109(36):14508–14513. doi: 10.1073/pnas.1208715109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Shiroguchi K, Jia TZ, Sims PA, Xie XS. Digital RNA sequencing minimizes sequence-dependent bias and amplification noise with optimized single-molecule barcodes. Proc Natl Acad Sci USA. 2012;109(4):1347–1352. doi: 10.1073/pnas.1118018109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat Methods. 2008;5(3):235–237. doi: 10.1038/nmeth.1184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Mamanova L, et al. Target-enrichment strategies for next-generation sequencing. Nat Methods. 2010;7(2):111–118. doi: 10.1038/nmeth.1419. [DOI] [PubMed] [Google Scholar]
- 18.Baker SC, et al. External RNA Controls Consortium The External RNA Controls Consortium: A progress report. Nat Methods. 2005;2(10):731–734. doi: 10.1038/nmeth1005-731. [DOI] [PubMed] [Google Scholar]
- 19.External RNACC. External RNA Controls Consortium Proposed methods for testing and selecting the ERCC external RNA controls. BMC Genomics. 2005;6:150. doi: 10.1186/1471-2164-6-150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Jiang L, et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 2011;21(9):1543–1551. doi: 10.1101/gr.121095.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Islam S, et al. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 2011;21(7):1160–1167. doi: 10.1101/gr.110882.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ramsköld D, et al. Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells. Nat Biotechnol. 2012;30(8):777–782. doi: 10.1038/nbt.2282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Qing T, Yu Y, Du T, Shi L. mRNA enrichment protocols determine the quantification characteristics of external RNA spike-in controls in RNA-Seq studies. Sci China Life Sci. 2013;56(2):134–142. doi: 10.1007/s11427-013-4437-9. [DOI] [PubMed] [Google Scholar]
- 24.Vogelstein B, Kinzler KW. Digital PCR. Proc Natl Acad Sci USA. 1999;96(16):9236–9241. doi: 10.1073/pnas.96.16.9236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
