Genome-wide identification of transcript start and end sites by Transcript Isoform Sequencing, TIF-Seq

Vicent Pelechano; Wu Wei; Petra Jakob; Lars M Steinmetz

doi:10.1038/nprot.2014.121

. Author manuscript; available in PMC: 2015 Jan 1.

Published in final edited form as: Nat Protoc. 2014 Jun 26;9(7):1740–1759. doi: 10.1038/nprot.2014.121

Genome-wide identification of transcript start and end sites by Transcript Isoform Sequencing, TIF-Seq

Vicent Pelechano ¹, Wu Wei ^1,², Petra Jakob ¹, Lars M Steinmetz ^1,^2,^3,^*

PMCID: PMC4111111 NIHMSID: NIHMS603849 PMID: 24967623

Abstract

Hundreds of transcript isoforms with varying boundaries and alternative regulatory signals are transcribed from the genome, even in a genetically homogeneous population of cells. To study this transcriptional heterogeneity, we developed Transcript Isoform Sequencing (TIF-Seq), a method that allows the genome-wide profiling of full-length transcript isoforms defined by their exact 5′ and 3′ boundaries. TIF-Seq entails generating full-length cDNA libraries, followed by their circularization and the sequencing of the junction fragments spanning the 5′ and 3′ transcript ends. By determining the respective co-occurrence of start and end sites of individual transcript molecules, TIF-Seq can distinguish variations that conventional approaches for mapping single ends cannot, such as short abortive transcripts, bicistronic messages, and overlapping transcripts that differ in lengths. The TIF-Seq protocol we describe here can be applied to any eukaryotic organism (e.g., yeast, human) and requires 6-10 days to generate TIF-Seq libraries, 10 days for sequencing and 2-3 days for analysis.

Keywords: Transcript isoforms, UTR variation, ncRNAs, RNA-Seq, 5′CAP, alternative polyadenylation, next generation sequencing

Introduction

Genome-wide study of transcription has proven invaluable for understanding the mechanisms regulating gene expression and characterizing gene function. However most of our genome-wide knowledge is based on the study of changes in expression level (i.e., variations in mRNA abundance). Inferring function from expression abundance is based on the simplified assumption that each gene transcribes identical RNA molecules. In reality, one gene may express diverse transcript isoforms using alternative promoters, exons and terminators¹. Transcription thus often generates alternative RNA molecules (i.e. isoforms) that differ in length and sequence from each gene. These transcripts can differ dramatically in their function, localization, and life cycle^2-4. Genome-wide methods like DNA microarrays⁵ and RNA-Seq⁶ have been instrumental in characterizing eukaryotic RNA populations and novel transcript classes^7,8. However, these methods only detect the cumulative signal of transcripts overlapping a given genomic region (because they either fragment the RNA, cDNA or probe for cumulative signal in specific regions); they cannot resolve the boundaries of individual RNA molecules. Methods analysing variation of either transcription start sites^9,10 or polyadenylation sites^11-14 have indicated considerable variability in transcript boundaries and suggested effects on RNA stability, translation or localization. These methods, however, cannot detect which start sites co-occur with which end sites, a property that determines the functional potential of each RNA molecule².

This led us to develop the approach presented here, transcript isoform sequencing (TIF-Seq), which allows us to concurrently determine the start and end sites of individual RNA molecules within a sample and discriminate between overlapping molecules¹⁵. We have investigated transcript isoform variation in Saccharomyces cerevisiae using TIF-Seq, showing extensive transcript isoform diversity that affect messenger RNA stability, localization and translation, or generating truncated versions of proteins that differ in localization or function¹⁵. Similar approaches based on paired-end sequencing have been previously developed for the study of the transcriptome using Sanger sequencing^16,17 and more recently have been also applied to next generation sequencing^18,19. TIF-Seq, also based in next generation sequencing, avoids initial sample size selection and introduces intermediate amplification steps and molecular barcodes to limit molecular bottlenecks and thus increase sample complexity. This technology has enabled an unprecedented glimpse into the vast transcriptional diversity generated by a genome, with several functional implications, such as variability in mRNA stability, localization, or generation of truncated proteins¹⁵.

Overview of TIF-Seq

The TIF-Seq procedure can be conceptually divided into four main stages, each of which is described below.

RNA oligo-capping. The first step of the protocol consists of obtaining an RNA sample suitable for the generation of full-length cDNA (Procedure steps 1 to 26, and Fig. 1). There are multiple approaches that can be used to select for full-length RNA molecules. Common approaches include cap-trapping²⁰, template switching (e.g., SMART)²¹ or oligo-capping methods^22,23 (reviewed in ²⁴). We selected the oligo-capping method²³ because it does not entail removing the ribosomal RNA (rRNA) in a preceding step, and thus allows the use of rRNA integrity as a general indicator of RNA quality throughout the process (Fig. 2a). The oligo-capping method consists of the selective ligation of an oligo of known sequence to the previously capped RNA molecules. As a first step, total RNA is dephosphorylated to remove any existing 5′ phosphates (e.g., RNA degradation intermediates). After this step, a treatment with Tobacco Acid Pyrophosphatase (TAP) removes the 5′ cap structure and exposes the 5′ phosphate group necessary for subsequent single-stranded RNA ligation. These two steps are applied to ensure that any non-capped molecules are dephosphorylated prior to oligo ligation, and thus guarantee that only previously capped molecules are processed further. Although it is almost impossible to determine the exact number of non-capped RNA molecules present in our final dataset, if TIF-Seq analysis is focused on the 5′non-cap molecules (i.e., substituting the TAP treatment with a T4 PNK treatment¹⁵) the obtained profile with respect to the transcription start site is totally different. Additionally background rRNA contamination should be low (i.e., around 20% depending on the sequence mapping approach) and correspond mainly to polyadenylated rRNA degradation intermediates²⁵. After TAP treatment, an oligo with known sequence is ligated to the 5′ end of the formerly capped RNA molecules. Previous studies have observed that the use of different oligos for single-stranded ligation can lead to significant differences between samples²⁶. Thus, to be able to effectively compare different samples, we chose to avoid any early sample barcoding and use the same oligo for all samples. Additionally, to estimate the accuracy of our 5′ and 3′ end identification, we recommend that capped and polyadenylated in vitro synthesized transcripts be used as a ‘spike-in’ (Box 1).
Production of full-length cDNA. Once total RNA has been subjected to the oligo-capping protocol, barcoded full-length cDNA is generated by selectively amplifying RNA molecules with both a 5′ cap structure and a poly(A) tail (Procedure steps 28 to 41, and Fig. 1). For this procedure, each sample is subjected to reverse transcription and full-length cDNA generation. Notably, in order to control for subsequent intermolecular ligation, these reactions are carried out in duplicate (i.e., in two separate tubes) using different sets of oligos, producing equivalent full-length cDNA libraries that should differ only in their terminal barcodes (referred to here as chimera control barcodes A and B). This information will be used after the circularization step to estimate the percentage of intermolecular and intramolecular full-length cDNA ligations (similar to Fullwood et al. ¹⁸). RNA is subjected to retrotranscription (optimized to produce long cDNA molecules by increasing reaction time, temperature and using trehalose combined with an efficient enzyme) followed by a brief PCR amplification using the common sequences added during single stranded RNA ligation and reverse transcription. This initial amplification step is essential to maintain sample complexity (i.e., diversity of isoforms). In preliminary experiments, where the PCR amplification was omitted and the double-stranded cDNA was directly used in subsequent steps, the complexity of the sample was dramatically reduced: the same molecules were sequenced multiple times, producing artifactual homogeneity. This loss in complexity is likely due to a bottleneck in the number of RNA molecules that reach the final library stage, and can be addressed by performing intermediate sample amplification as described here. While excessive PCR amplification can also reduce molecular complexity²⁷, a few PCR cycles has in our experience minimized sample loss and thus effectively enhanced molecular complexity, having enabled us to detect hundreds of thousands of isoforms per sample¹⁵. Another important consideration for TIF-Seq and similar methods (e.g., RNA-PET¹⁹) is the removal of PCR duplicates, as we expect multiple identical isoforms to be transcribed. We have found that the introduction of random barcodes as molecular identifiers (Fig. 1) during reverse transcription allows straightforward identification of PCR duplicates and estimation of library complexity. Specifically, the oligos used for reverse transcription (PET-3RT-A and PET-3RT-B, in table 1) contain 8 random nucleotides (i.e., 7 N (A, T, C, or G) and 1 V (A, C, or G)). Thus, during the data analysis each transcript will be identified by a specific 5′ and 3′ end and also by a particular random sequence (in green in Fig. 1). As this random sequence is introduced before any amplification step, those identified molecules with the same 5′ and 3′ ends, that contain also the same random barcode, can be discarded as PCR duplicates.
Intramolecular circularization. For circularizing full-length cDNA molecules, the independently barcoded samples are pooled, digested, and subjected to intramolecular ligation (Procedure steps 43 to 60 and Fig. 1). To increase ligation efficiency, we generate sticky ends by digesting the full-length cDNA sample with the NotI enzyme. To maximize the number of intramolecular ligations and minimize intermolecular ligations, we perform the ligation reaction at a low DNA concentration (1 ng μL^-1 or less). The earlier introduction of chimera control barcodes during the full-length cDNA generation step will now allow the estimation of the percentage of circular molecules originating from intermolecular ligation. As true intramolecular ligations would result in either A-A or B-B barcode combinations, the combinations A-B or B-A in the final products can only arise due to intermolecular ligations. The expected occurrence of intermolecular events is therefore calculated as double the abundance of A-B and B-A reads present in the final library (since A-A and B-B combinations could also originate from intermolecular ligations at the same frequency as A-B and B-A). After cDNA circularization, any remaining linear DNA molecules are exonucleolytically degraded.
Sequencing library construction. To construct the final sequencing library (Procedure steps 61 to 109, and Fig. 1), the circularized full-length cDNA is sonicated and the junction fragments (which are biotinylated) are purified using magnetic beads. Once the biotinylated sample is bound to the beads, a standard Illumina library is constructed (or according to the sequencing platform used), including DNA blunt ending, dA addition and ligation of forked adaptors. In this case the library remains bound to the streptavidin beads during all steps. If multiple samples are to be run in the same sequencing lane (as we recommend), barcoded forked adaptors should be used during the library preparation (Table 1). After the enrichment PCR (step 99), it is very important to perform an accurate size selection step (Fig. 2B). This step is especially important because only a fraction of the sonicated molecules will contain fragments spanning the 5′ and 3′ ends, long enough to uniquely map to the genome and short enough for the sequencing read to identify the exact 5′ and 3′ transcript ends.

The protocol is composed of: (a) 5′RNA oligo-capping using a DNA/RNA oligo of known sequence (5RNAGsuI), (b) generation and PCR amplification of full-length cDNA, (c) intramolecular circularization after NotI digestion, and (d) purification of paired-end cDNA tags followed by generation of Illumina-compatible sequencing libraries. Individual steps are labeled in black and the structure of the final sequenced library is depicted in the bottom right side of panel d. CIP stands for Calf Intestinal Alkaline Phosphatase, TAP for Tobacco Acid Pyrophosphatase, and ssRNA for single-stranded RNA. Refer to the main text for specific details.

(a) Example of good quality HS RNA Bioanalyzer of *S. cerevisiae* RNA after single stranded RNA ligation (step 27). rRNA peaks (*e.g.*, 18S and 26S for *S. cerevisiae*) should be clearly visible, and the sample should not be enriched in short RNA molecules that could arise from RNA degradation. FU, stands for Fluorescence Units. (b) Example of eGel size selected TIF-Seq library analyzed using a High sensitivity DNA Bioanalyzer ready for sequencing (step107). A clear peak should be observed between 270 and 280 nt. Optimal TIF-Seq samples should not contain Illumina primer-dimmer band (usually located around 130 nt). (c) Example of aligned TIF-Seq reads visualized in IGV³³ for *S. cerevisiae* (data from¹⁵). Each identified transcript is depicted by a line that connects the 5′ and 3′ end of each transcript isoforms, without any information regarding the internal splicing events. Transcripts are depicted in red (+ strand) and blue (- strand). The overall coverage of each strand is depicted by a grey box. Optimal sample should have high complexity (without a significant number of PCR duplicates).

BOX 1. Preparation of capped and polyadenylated in vitro transcript. TIMING 6 h.

The following protocol describes how to prepare a mix of in vitro transcripts (IVTs) that should be added to each sample to control for the quality of 5′ and 3′ exact nucleotide identification. In this case we use IVTs derived from B. subtilis (ATCC 87482 (pGIBS-LYS), ATCC 87483 (pGIBS-PHE) and ATCC 87484 (pGIBS-THR)) that contain a poly(A) encoded tail in their DNA template. But in general any polyadenylated IVT of known sequence that is subsequently capped can be used.

Generation of in vitro transcripts

Increase the volume of 200 ng of linearized DNA template to 22.5 μL with RNAse-free water, and set up the following 50 μL reaction:


Component	Amount (μL)	Final

Linearized DNA template (200ng)	22.5
Transcription Optimized 5x buffer	10	1x
DTT (0.1M)	5	10 mM
NTP mix (2.5 mM each)	10	0.5 mM
T3 RNA polymerase (10 U μL^-1)	2	0.4 U μL^-1
RNasin Plus	0.5	-

Total	50

5RNAGsuI	CACTCTrGrArGrCrArArUrArCrC
PET-3RT-A	ACATGTATAGCGGCCGCTAGANNNNNNNVTTTTTTTTTTTTTTTTVN
PET-3RT-B	ACATGTATAGCGGCCGCATCTNNNNNNNVTTTTTTTTTTTTTTTTVN
PET-5ABio	TATAGCGGCCGCAC[BtndT]GCACTCTGAGCAATACC
PET-5BBio	TATAGCGGCCGCTGA[BtndT]CACTCTGAGCAATACC
PET-5A	TATAGCGGCCGCACTGCACTCTGAGCAATACC
PET-5B	TATAGCGGCCGCTGATCACTCTGAGCAATACC
PET-3-A	ACATGTATAGCGGCCGCTAGA
PET-3-B	ACATGTATAGCGGCCGCATCT
Barcoded forked adaptors^#
mp1PE1	[Phos]AGCGCTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
mp1PE2	ACACTCTTTCCCTACACGACGCTCTTCCGATCTAGCGCT^* T
mp5PE1	[Phos]ACAGTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
mp5PE2	ACACTCTTTCCCTACACGACGCTCTTCCGATCTCACTGT^* T
mp19PE1	[Phos]CGGAATAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
mp19PE2	ACACTCTTTCCCTACACGACGCTCTTCCGATCTATTCCG^* T
mp22PE1	[Phos]CTATACAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
mp22PE2	ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTATAG^* T
mp34PE1	[Phos]GGTAGCAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
mp34PE2	ACACTCTTTCCCTACACGACGCTCTTCCGATCTGCTACC^* T
mp37PE1	[Phos]GTTTCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
mp37PE2	ACACTCTTTCCCTACACGACGCTCTTCCGATCTCGAAAC^* T


Cycle number	Denature	Anneal	Extend

1	98°C, 30 s
2-11	98°C, 20 s	50°C, 30 s (adding 1°C/cycle)	72°C, 5 min (adding 10 sec / cycle)
12			72°C, 5 min

Configuration A-A	3′end-AAAAAAAAAAAAAAAABNNNNNNNTCTAGCGGCCGCACTGCACTCTGAGCAATACC-5′end
Configuration B-B	3′end-AAAAAAAAAAAAAAAABNNNNNNNAGATGCGGCCGCTGATCACTCTGAGCAATACC-5′end
Configuration A-B (chimera)	3′end-AAAAAAAAAAAAAAAABNNNNNNNTCTAGCGGCCGCTGATCACTCTGAGCAATACC-5′end
Configuration B-A(chimera)	3′end-AAAAAAAAAAAAAAAABNNNNNNNAGATGCGGCCGCACTGCACTCTGAGCAATACC-5′end

Step	Problem	Possible reason	Possible solution
3, 26	RNA degradation	Contamination with RNase	Include an RNase inhibitor during the protocol (as already recommended). Use always small aliquots of RNase-free reagents and make sure of the integrity and quality of the starting RNA.
107	Insufficient amount of final library produced	Low efficiency in RNA ligation, reverse transcription or production of full –length cDNA molecules.	Perform parallel reactions in different tubes and pool the obtained samples. As molecular barcodes can be used to overcome amplification bias, the number of PCR cycles could also be increased.
116	High number of reads containing Illumina primer dimer sequences or decreased library complexity	Due to the low amount of produced library or due to an excess of Illumina linkers used for the ligation step	Increase the amount input material. Optimize the number of PCR cycles to minimize the number of PCR duplicates while preventing the formation of a molecular bottleneck. Run a pilot sequencing lane (all libraries multiplexed with other samples or in a lower throughput machine, e.g., MiSeq) to assess the quality of the libraries before sequencing at high coverage with multiple HiSeq lanes.


Component	Amount (μL)	Final

60 μg total RNA	35	1.2 μg μL^-1
Capped IVT controls (1x, 500 pg μL^-1, see Box 1)	6	10 pg μL^-1
Turbo DNase buffer 10x	5	1x
Turbo DNase enzyme (2 U μL^-1)	3	6 U
RNasin plus	1

Total	50


Component	Amount (μL)	Final

DNase treated RNA sample	50	-
RNase-free water	36	-
NEBuffer 3 (10x)	10	1x
CIP (Calf Intestinal Alkaline Phosphatase, 10 U μL^-1)	3	30 U
RNasin plus	1	-

Total	100


Component	Amount (μL)	Final

Phosphatase treated sample	16.3	-
TAP reaction buffer (10x)	2	1x
DTT (0.1 M)	0.2	1 mM
Tobacco Acid Pyrophosphatase (TAP, 10 U μl^-1)	0.5	5 U
RNasin plus	1	-

Total	20


Component	Amount (μL)	Final

TAP treated sample	4.5
100 mM DNA/RNA oligo 5RNAGsuI	1	10 mM
buffer T4 RNA Ligase 1 (10x, containing 10 mM ATP)	1	1x
T4 RNA Ligase 1 (10 U μl^-1)	2	20 U
DMSO	1	10% (vol/vol)
RNasin plus	0.5	-

Total	10

PERMALINK

Genome-wide identification of transcript start and end sites by Transcript Isoform Sequencing, TIF-Seq

Vicent Pelechano

Wu Wei

Petra Jakob

Lars M Steinmetz

Abstract

Introduction

Overview of TIF-Seq

Figure 1. Detailed experimental workflow of TIF-Seq.

Figure 2. TIF-Seq quality controls and anticipated results.

BOX 1. Preparation of capped and polyadenylated in vitro transcript. TIMING 6 h.

Generation of in vitro transcripts

Add 5′Cap to the IVTs

Table 1.

Applications of TIF-Seq

Limitations of TIF-Seq

RNA length bias

Reverse transcriptase template switching

Number of usable reads

Frequency of NotI recognition site

Analysis of cellular population

Materials

Reagents

Equipment

Procedure

Preparation of DNA-free RNA. TIMING 45 min

Phosphatase treatment. TIMING 4 h

5′ Cap removal. TIMING 4h

Single stranded RNA ligation. TIMING 18 h

Reverse transcription. TIMING 2.5 h

Generate second strand by PCR. TIMING 2.5 h

(Optional) Size selection. TIMING 4 h

BOX 2. Optional size selection to enrich for longer mRNAs. TIMING 4h.

Figure 3. Gel size-selection of long Full-length cDNA.

Produce sticky ends. TIMING 1.5 h

Circularize full-length cDNA. TIMING 19h

Sonicate circularized DNA. TIMING 1.5 h

Bind samples to streptavidin beads. TIMING 1 h

End repair DNA fragments. TIMING 45 min

Add a protruding Adenine to the DNA fragments. TIMING 45 min

Ligation of barcoded Illumina adaptors. TIMING 1.5 h

Library PCR amplification. TIMING 1.5 h

Library size selection. TIMING 1.5 h

Table 2.

Bioinformatic analysis. TIMING 20 h

TIMING

Table 3. Troubleshooting advice.

Anticipated Results

BOX 3. Preparation of annealed linkers. TIMING 2 h.

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases