Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2024 Feb 21;52(7):e35. doi: 10.1093/nar/gkae120

Correcting 4sU induced quantification bias in nucleotide conversion RNA-seq data

Kevin Berg 1,2, Manivel Lodha 3, Isabel Delazer 4, Karolina Bartosik 5, Yilliam Cruz Garcia 6,7, Thomas Hennig 8, Elmar Wolf 9,10, Lars Dölken 11, Alexandra Lusser 12, Bhupesh K Prusty 13, Florian Erhard 14,15,
PMCID: PMC11039982  PMID: 38381903

Abstract

Nucleoside analogues like 4-thiouridine (4sU) are used to metabolically label newly synthesized RNA. Chemical conversion of 4sU before sequencing induces T-to-C mismatches in reads sequenced from labelled RNA, allowing to obtain total and labelled RNA expression profiles from a single sequencing library. Cytotoxicity due to extended periods of labelling or high 4sU concentrations has been described, but the effects of extensive 4sU labelling on expression estimates from nucleotide conversion RNA-seq have not been studied. Here, we performed nucleotide conversion RNA-seq with escalating doses of 4sU with short-term labelling (1h) and over a progressive time course (up to 2h) in different cell lines. With high concentrations or at later time points, expression estimates were biased in an RNA half-life dependent manner. We show that bias arose by a combination of reduced mappability of reads carrying multiple conversions, and a global, unspecific underrepresentation of labelled RNA emerging during library preparation and potentially global reduction of RNA synthesis. We developed a computational tool to rescue unmappable reads, which performed favourably compared to previous read mappers, and a statistical method, which could fully remove remaining bias. All methods developed here are freely available as part of our GRAND-SLAM pipeline and grandR package.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

Nucleotide conversion sequencing of metabolically labelled RNA (1–3) enables the direct analysis of the temporal dynamics of RNA expression upon different perturbations in bulk or single cells (4–7) and of different quantitative parameters (8,9). Cells are simultaneously subjected to a certain condition of interest and labelled with the nucleoside analogue 4-thiouridine (4sU), which is readily incorporated into nascent RNA. 4sU can be chemically converted in extracted RNA (1–3) or intact cells (7) to induce T-to-C mismatches in genome-mapped reads that originate from labelled RNA. Only a minor fraction (2–10%) of uridines is substituted by 4sU in nascent RNA, such that only a fraction of reads from labelled RNA molecules cover sites of 4sU incorporation (10). Statistical approaches nevertheless provide unbiased estimates of the new-to-total RNA ratio (NTR) and quantify the uncertainty in these estimates (11). Recent methods carry these uncertainties forward for the estimation of biophysical parameters of the temporal kinetics of RNA expression such as synthesis rates and half-lives and for identifying differentially regulated genes (12).

The precision and accuracy of the quantification of labelled RNA is strongly affected by the frequency of 4sU incorporation: The more 4sU is incorporated into each newly synthesized RNA molecule, the better are estimates of the NTR (see Supplementary Note S1 and ref (11)). In principle, this incorporation frequency depends on the relative concentrations of triphosphorylated 4sU and U available for incorporation into nascent RNA. Thus, a straight-forward way to optimize the quantification of labelled RNA is to increase the 4sU concentration. However, labelling using very high concentrations of 4sU especially over several hours affects cell viability (1) and rRNA processing (13). We recently showed that expression estimates from nucleotide conversion RNA-seq experiments can be affected before significant effects on cell viability are detectable (12). In multiple data sets, we observed marked differences between samples labelled with 4sU and unlabelled but otherwise biologically equivalent control samples in principal component analyses. In addition, in a differential gene expression analysis of total RNA levels, comparing 4sU labelled cells against equivalent unlabelled controls preferentially genes with short RNA half-lives appeared to be downregulated. This observed downregulation can have biological or technical reasons (see Figure 1): Excessive 4sU labelling might have direct effects on RNA metabolism, e.g. incorporation of 4sU into nascent RNA might result in reduced processivity of RNA polymerase II or 4sU containing RNA might be less stable than unlabelled RNA molecules. Excessive labelling might also induce indirect effects, e.g. due to the activation of cellular stress pathways.

Figure 1.

Figure 1.

Overview of 4sU-induced quantification bias in new RNA. Cells are labeled with 4-thiouridine (4sU), which is incorporated into newly synthesized RNA. Incorporated 4sU could globally reduce transcriptional activity or induce degradation of labeled mRNAs (A). 4sU has been shown to interfere with reverse transcription and must be handled carefully to be well represented in the sequencing library (15) (B). T-to-C mismatches within read sequences makes it harder to correctly map reads (C). All three effects result in dropout of 4sU reads, mainly affecting genes with short half-lives and therefore introducing quantification bias. The incorporation frequency Inline graphic is defined as the relative frequency of T-to-C conversions in all observed reads corresponding to newly synthesized RNA (RNAPII: RNA polymerase II; RT: reverse transcriptase; obs: observed; unobs: unobserved).

The observed downregulation of short-lived RNA might also have technical reasons. In a recent study, reverse transcription efficiency was found to be reduced for RNA containing 4sU converted with iodoacetamide (14) as it is used for SLAM-seq (1). The consequence of such a strong reduction of reverse transcription due to 4sU is that labelled RNA is underrepresented in the sequencing library for 4sU-labelled samples. While this manuscript was under review, a preprint showed very convincingly that sub-optimal sample handling can also lead to a loss of labelled RNA independent of reverse transcription (15). Moreover, mismatched bases generally impact negatively on read mappability. If mappability is strongly impaired, reads corresponding to labelled RNA are underrepresented among the mapped reads used for quantifying gene expression. Since genes with short-lived RNAs have a higher percentage of labelled RNA in the total RNA pool than genes with long-lived RNAs, both, an underrepresentation of labelled RNA in the library, and an underrepresentation of labelled RNA in the mapped reads could explain quantification bias correlating with RNA half-lives. (11)

Here, we performed nucleotide conversion RNA-seq with increasing concentrations of 4sU and with several periods of labelling in different cell types. We used these data to study the cell type specificity and dependence on the duration of labelling and 4sU concentration of biased expression estimates due to 4sU labelling and to assess the impact of technical reasons thereof. To counter these effects, we here propose a new method to rescue previously unmappable reads. We compared it to existing read mapping tools and evaluated it using in-silico simulated and real data sets. Furthermore, we devised a scaling strategy to correct for the underrepresentation of new RNA in the sequencing library or among mapped reads. Our data provides evidence that this correction completely removed this effect from the data enabling the analysis of samples that otherwise suffer from quantification bias due to excessive 4sU treatment.

Materials and methods

Cell culture and 4sU labelling

NIH-3T3 (ATCC CRL-1658) Swiss murine embryonic fibroblasts, human U2OS (RRID:CVCL_0042), HFF-TERT (ATCC CRL-4001) hTERT-immortalized human foreskin fibroblasts and HCT 116 (RRID:CVCL_0291) cells were grown in DMEM (Dulbecco's Modified Eagle's Medium) supplemented with 100 IU/mL Penicillin and 100 mg/mL Streptomycin at 37°C/5% CO2. NIH-3T3 cells were supplemented with 10% NCS (New-born calf serum), U2OS, HFF-TerT and HCT116 cells were supplemented with 10% FBS.

All cells were seeded in six-well plates at 5 × 106 cells/well followed by 4sU-labelling the next day. NIH-3T3 cells were labelled with 800 μM 4sU for 15, 30, 60, 90 and 120 min. U20S, HFF-TerT and HCT116 cells were labelled with 0, 100, 200, 400 or 800 μM 4sU for 1 h.

The cell lines were routinely checked for Mycoplasma by PCR and tested negative at all times.

Nucleotide conversion RNA-seq

Cells subject to 4sU-labelling were harvested using TRI reagent (Sigma) and RNA isolation was carried out for U2OS, HFF-TerT and HCT116 cells following a protocol recommended for TRI reagent. For NIH-3T3 cells, RNA was extracted using the Zymo DirectZol RNA-microprep kit (R2062) as described by the manufacturer and re-suspended in 1X PBS buffer. SLAM-seq (1) was conducted as described before (16) using IAA (Iodoacetamide) to mediate U > C conversions at 50°C/20 minutes for NIH-3T3 and at 37 oC/1h for U2OS, HFF-TerT and HCT116. The reaction was quenched using excess DTT (Dithiothreitol). RNA was then purified using RNeasy Mini elute kit (Qiagen) and subject to quality control via gel electrophoresis for 18S and 28S RNA followed by Bioanalyzer assessment (Agilent 2100). Library preparation (Illumina TruSeq) and sequencing (2 × 75 pair-ended) was conducted by the Core Unit SysMed (Würzburg) using NextSeq500 as described previously (16).

For the experiment involving TUC-seq, total RNA from 4sU-labelled and unlabelled cells was isolated as described above and eluted in RNAase-free H2O. RNA (5 μg) conversion using TUC chemistry was performed as previously described (17), except that incubation with OsO4/NH4Cl was performed for 1h at 40°C. SLAM conversion was done at 50°C/30 minutes. RNA was purified by precipitation as described above and subjected to quality control using a Bioanalyzer instrument (Agilent 2100). For a side-by-side comparison of SLAM-seq and TUC-seq (Figure 8F), cell labelling, RNA conversion, library preparation (DNBSEQ stranded mRNA) and sequencing (2 × 100 nt paired-end; DNBseq) were performed in parallel.

Figure 8.

Figure 8.

Various steps during sample handling result in 4sU dropout. (A) Scatter plot of nucleotide content in sequenced RNA fragments in 4sU naïve sample versus 2 h 4sU labelling. n = 105 genes with an RNA half-life <30 min, >10 TPM and an estimated major isoform percentage of >90% are shown. (B) Boxplots showing the log2 fold changes for all n = 105 genes of nucleotide content in 2 hours 4sU labelling vs. 4sU naïve sample for all nucleotides. P values (<2.2 × 10−16, Wilcoxon test) are indicated. (C) Line plots showing the average log2 fold changes across the n = 105 genes of nucleotide content over all labelling times versus 4sU naïve sample and both replicates. (D) Heatmaps showing the average log2 fold changes across all n = 105 genes for all dinucleotides over all labelling times vs. 4sU naïve sample in both replicates. (E) Experimental design for the conversion experiment. (F) 4sU dropout values for the two replicates involving labelling with U and no conversion, and labelling with 4sU and either no conversion, SLAM conversion or TUC conversion. (G) Experimental design for the methanol fixation experiment. (H) 4sU dropout values for the two replicates involving conversion in tubes with 200 μM or 800 μM 4sU, or for conversion in methanol fixed cells as indicated. (I, J) Scatter plot of log2 fold changes for 2 h labelling versus 4sU naïve sample in replicate A against log2 fold changes in replicate B before (I) and after (J) correction. The Spearman correlation coefficient and associated P values (asymptotic t test) are indicated.

For the methanol-fixed cells, 4sU-labelled NIH-3T3 cells were trypsinized and fixed in PBS and methanol at a ratio of 1:4. Methanol-fixed cells were stored at –80°C. On the day of the experiment, cells were taken out and were thawed on ice for 30 min. Subsequently, IAA was added to the methanol-fixed cells and kept at 4°C overnight under protection from light. Next day, cells were centrifuged down to remove methanol. Subsequently, cells were rehydrated in PBS for 30 min and used for RNA extraction using TRI reagent. Conversion for the ‘Tube’ samples of this experiment was done at 37oC/1h. Total RNA was processed for library preparation in a similar way as done for the rest of the samples with the exception that the QuantSeq FWD Kit was used for library preparation and sequencing was done in 100 nt single-end mode on the Illumina NextSeq2000 platform.

RNA-seq/SLAM-seq data processing

All publicly available RNA-seq and newly generated SLAM-seq data used here were processed using the GRAND-SLAM pipeline (11). Fastq files of publicly available RNA-seq data were downloaded from the SRA database. The accession numbers were: GSE162264 for the simulation of mismatches on read mappability and the evaluation of read mapping tools from (4) (sample: GSM4948135), GSE124167 (samples: GSM3523316- GSM3523318) and GSE109480 (Samples: GSM2944116GSM2944120) for the comparison of read mappability after T > C mismatch introduction in TruSeq (18) and QuantSeq (19) data sets respectively.

Adapter sequences were trimmed using cutadapt (version 3.5) using parameters ‘-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT’ for the increasing concentrations and progressive labelling data and ‘-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA’ for data from ref (4). Then, bowtie2 (version 2.3.0) was used to map read against an rRNA (NR_046233.2 for TruSeq and QuantSeq data, and U13369.1 for increasing concentrations, progressive labelling and ref. (4)) and Mycoplasma database using default parameters. Remaining reads were mapped against target databases using STAR (version 2.7.10b) using parameters ‘–outFilterMismatchNmax 20 –outFilterScoreMinOverLread 0.4 –outFilterMatchNminOverLread 0.4 –alignEndsType Extend5pOfReads12 –outSAMattributes nM MD NH –outSAMunmapped Within’. We used the murine genome for TruSeq and QuantSeq data, the human genome for increasing concentrations, progressive labelling and data from ref. (4). All genome sequences were taken from the Ensembl database (version 90 for human, version 102 for mouse). Bam files for each data set were merged and converted into a CIT file using the GEDI toolkit and then processed using GRAND-SLAM (version 2.0.7) with parameters ‘-trim5p 15 -modelall’ to generate read counts and NTR values on the gene level, taking into account all reads that are compatible with at least one isoform of a gene. For the newly generated SLAM-seq data sets only genes with >200 reads in half of the samples were retained for the evaluation of 4sU dropout scaling.

Incorporation frequency saturation curves

We assume that the incorporation frequency of 4sU only depends on the relative concentrations of (triphosphorylated) U and 4sU. Ignoring 4sU uptake into cells and all steps necessary to make 4sU available for transcription, the incorporation frequency Inline graphic can be computed from the U concentration Inline graphic and 4sU concentration Inline graphic as Inline graphic. This Inline graphic is shown as a function of Inline graphic in Figure 2D. The unknown U concentrations are estimated by solving this equation for Inline graphic, and taking the average of the Inline graphic computed from the 100μM and 200μM samples.

Figure 2.

Figure 2.

Evaluation of increasing concentrations on quantification bias. (A) SLAM-seq experiments were conducted with U20S, HCT116 and HFF-TerT cells labeled for 1h with 4sU concentrations of 0μM, 100μM, 200μM, 400μM or 800μM. (B) Observed T-to-C mismatches across all reads per cell line and 4sU concentration. (C) Percentages of new RNA content per cell line and 4sU concentration estimated by GRAND-SLAM. (D) Incorporation frequency of 4sU into newly synthesized RNA per cell line and 4sU concentration estimated by GRAND-SLAM (solid line). The dashed lines represent the theoretically expected incorporation frequency (see Materials and methods). (E) 4sU dropout plots of n = 6454 genes for the 800 μM 4sU samples versus 4sU naïve samples for all three cell lines. The x axis shows the RNA half-life, the y axis the median centered log2 fold change of total RNA expression for the 4sU labelled sample versus the corresponding 4sU naïve control sample. A local polynomial regression (loess) fit is indicated in red. Spearman's correlation coefficient (ρ) and the associated P value (approximate t test) are given. Only genes with estimated half-lives <15 h are shown to focus on the trend for short-lived RNAs.

4sU dropout plots

4sU dropout plots are computed for a sample labelled with 4sU and a biologically equivalent 4sU naïve control sample. The x axis of 4sU dropout plots is the RNA half-life computed from the 4sU labelled sample using the formula (11)

graphic file with name M0009.gif

Here, Inline graphic is the labelling time and Inline graphic the new-to-total RNA ratio (NTR) estimated by GRAND-SLAM (11). Alternatively, the x axis is the NTR rank among all genes. The y axis is the log2 fold change of the labelled vs the naïve sample computed using the lfc package (20). These plots can be generated using the function Plot4sUDropout or Plot4sUDropoutRank of grandR (12). We want to point out that genes with estimated half-lives above 15h are removed from these plots by default.

For the experiment involving labelling with uridine, the NTR ranks were computed from the SLAM-seq samples.

Simulation of nucleotide conversion RNA-seq with defined incorporation rates

We used the mapped reads from ref. (4) to obtain the corresponding mapping positions and genes for each read sequence while unmapped reads were removed from the simulation. To classify reads as new or old, we used NTR values estimated from the data as follows: For a gene with a read count of Inline graphic and an NTR value Inline graphic, we randomly selected Inline graphic reads and defined them as new.

In each new read, we randomly introduced additional T-to-C mismatches into the sequence by mutating a T in the read sequence into a C with probability equal to a defined incorporation rate. Pre-existing T-to-C mismatches were kept. To simulate reads with shorter read lengths we trimmed the 3′ ends to the desired length.

Simulations based on the QuantSeq (19) and Illumina TruSeq (18) experiments where performed in the same manner to evaluate the effect of additional T-to-C mismatches on mappability in different library preparation methods and read lengths.

To evaluate and compare the mapping accuracy of STAR, grand-Rescue, SLAM-DUNK, HISAT2 and HISAT-3N we generated read sequences fully in silico. We created 75 bp read sequences equal to the read counts per gene in the dataset from ref. (4) from random exonic locations of the respective genes. To simulate polymerase and technical errors, the reads were subjected to a 0.2% error rate by randomly altering single nucleotides. Then, T-to-C conversions were introduced as described above.

These reads were then mapped by all mapping tools, using the following parameters: For STAR we used ‘–outFilterMismatchNmax 20 –outFilterScoreMinOverLread 0.4 –outFilterMatchNminOverLread 0.4 –alignEndsType Extend5pOfReads12 –outSAMattributes nM MD NH –outSAMunmapped Within’, for HISAT2 ‘–no-repeat-index’, for HISAT-3N ‘–base-change T,C –no-repeat-index’ and standard parameters for SLAM-DUNK.

Gene ontology analysis

Gene ontology analysis was performed using GOrilla (21) with two unranked lists as the running mode, the default p-value threshold of 10−3 and the process ontology output. We used all genes in the data set as the background list and genes with a log fold change lower than –0.1 as the target list.

Pseudotranscriptome generation

As a basis for the creation of the pseudotranscriptomes of the Homo sapiens and Mus musculus genomes, we used the fasta- and gtf-files from the Ensembl database (versions 90 and 102, respectively). For each genome, we processed the gtf-files to keep all entries with the gene, exon or CDS feature. Coordinates were adapted to reduce intronic and intergenic regions to 100 nucleotide spacers and genes on the negative strand were projected onto the plus strand. We then processed the fasta files, accordingly, removing intronic and intergenic sequences and replacing them by a uniform spacer of 100 N nucleotides. To transfer genes from the negative strand to the plus strand, their sequences were replaced by their reverse complements. Finally, all T nucleotides were exchanged by C.

grand-Rescue

grand-Rescue is a two-step process that starts from a bam file (containing read mappings without rescued 4sU labelled reads and unmappable reads) and generates a new bam file (additionally containing the mappings of rescued reads).

After mapping fastq files with STAR, grand-Rescue extracts unmapped reads, using the command ‘gedi -e ExtractReads’ with standard parameters, writing all unmapped read sequences to a new fastq file, converting all T nucleotides to C and saving the original sequence per read along with the read IDs and all of its bam file tags to an idMap file. The resulting fastq file is then mapped to the pseudotranscriptome, using STAR with the following parameters: ‘–outFilterMismatchNmax 10 –outFilterScoreMinOverLread 0.4 –outFilterMatchNminOverLread 0.4 –alignEndsType Extend5pOfReads12 –outSAMattributes nM MD NH –outSAMmode Full’. Afterwards, we removed all multimapped reads from this file with samtools (version 1.13) with the parameters ‘view -b -F 256’.

Subsequently, ‘gedi -e RescuePseudoReads’ is used to transfer the mapping position to the original genome by using the mapped position in the pseudotranscriptome. We first identified the gene a read was mapped to in the pseudotranscriptome and the gene's location on the plus or minus strand in the original genome. We calculate the distance of the alignment start position to the gene start position in the pseudotranscriptome and determine its alignment start position in the original genome by adding this distance to the gene start position in the original genome (or subtracting it from the gene end position, if the gene is originally on the negative strand and reverse complementing the sequence), skipping over intronic regions that may exist. Then, we recover the original read sequence before full T-to-C conversion along with all saved bam file tags and recalculate the nM and MD tags.

Finally, we remove unmapped reads from the original bam file and merge it with the rescued bam file from the pseudotranscriptome mapping using samtools.

Correcting 4sU dropout

The percentage of 4sU dropout can be estimated for a sample labelled with 4sU if there is a biologically equivalent 4sU naïve control samples. It is estimated by numerically finding a factor Inline graphic such that, if the NTR is multiplied by Inline graphic, the spearman correlation coefficient of the log2 fold change 4sU/no4sU vs the NTR rank is 0. The 4sU dropout percentage then is Inline graphic. To correct for 4sU dropout, the expression of labelled RNA is multiplied by Inline graphic, and the total expression estimate and the NTR is changed accordingly.

RNA half-lives and testing for mis-normalization

RNA half-lives and 95% confidence intervals were estimated from the progressive labelling data using the non-linear least squares method (12) using the grandR function FitKinetics after recalibrating effective labelling times using the grandR function CalibrateEffectiveLabelingTimeKineticFit. The likelihood ratio test for an upward or downward trend in the total RNA of uncorrected data was performed by using the 4sU labelling time as independent variable in the target model, and only an intercept term for the background model. Testing was performed using the LikelihoodRatioTest function of grandR.

Dropout fragment nucleotide compositional analysis

To determine the RNA fragments for the paired-end reads, we first used kallisto (version 0.44.0) with parameter –rf-stranded to infer transcript level expression for the two pooled 4sU naïve samples from our progressive labelling time course. We determined the major isoform for each gene by identifying the transcript with highest TPM value per gene, and the major isoform percentage by dividing the TPM of the major isoform by the total TPM of all transcripts for a gene. All genes with RNA half-life <30 min, TPM > 10 and a major isoform percentage of >90% were considered further. For each of these genes, and each sample, we collected all mapped read pairs, and determined the RNA fragment by connecting the two mates according to the exon-intron pattern of the major isoform. The corresponding sequences was used to count all k-mers with k = 1…3.

Statistical analyses

Hierarchical clustering for the heatmaps in Figure 8D and Supplementary Figure S10B have been performed using complete linkage and Euclidean distances. The DESeq2 (22) analysis for Figure 7E was conducted using a likelihood ratio test comparing the background linear model where only an intercept term was fitted to all samples (replicates and time points) to the target linear model where the time point was included as independent variable for regression. The test was performed on total RNA levels separately before and after correction.

Figure 7.

Figure 7.

Evaluation of 4sU dropout correction in progressive labelling data. (A) Comparison of the 4sU dropout percentage before and after correction in progressive labelling data over all time points in replicate A. (BC) 4sU dropout rank plots of the 120 min sample in replicate A before and after correction with overlayed boxplots as in Figure 6B, C. Spearman's correlation coefficient (ρ) and the associated p value (approximate t test) are given. (D) Old, new and total gene expression of Dusp4 and Eif3l over 2 hours of labelling before (left) and after (right) correction. The kinetic model fits are indicated as dashed lines. (E) Vulcano plot of genes showing an upwards or downwards trend on total RNA level before (left) and after correction (right). The y axis shows -log10 of the DESeq2 P value (likelihood ratio test comparing a model with the 4sU labelling time as independent variable vs a model with intercept only) adjusted for multiple testing (Benjamini–Hochberg; FDR, false discovery rate). The numbers of genes above and below 5% FDR and log2 fold changes of >0.25 are indicated. Gene half-lives are represented by color.

Results

Excessive 4sU treatment results in quantification bias preferentially for short-lived RNAs

Since strong incorporation frequencies are beneficial for quantifying labelled RNA (see Supplementary Note S1), we investigated to which extent an increase of 4sU concentrations improves the incorporation frequency, and whether this has a negative effect on the samples. We previously observed short-lived RNAs to be downregulated when comparing samples that were treated with 4sU for long periods of time (8h) to 4sU naïve samples (12). We reasoned that this could be caused by three effects (Figure 1): First, long-term treatment by 4sU could globally impact on RNA metabolism by reducing transcriptional activity or accelerating degradation of labelled RNA. With reduced transcriptional activity, all RNAs are inhibited by the same factor, but levels of short-lived RNAs would drop more rapidly than levels of long-lived RNAs, thereby explaining our observation. Second, as described previously, converted 4sU might reduce reverse transcription efficiency (14), or other steps of the library preparation might be implemented in a sub-optimal way for 4sU (15), such that labelled RNA is underrepresented in the sequencing library. The total number of T-to-C conversions for short-lived RNAs is larger than for long-lived RNAs, and, therefore, this could also explain the apparent downregulation of short-lived RNAs in 4sU treated samples. Third, the probability that a read is correctly mapped to its genomic locus of origin declines with increasing numbers of mismatched bases, which would also have its strongest effect on short-lived RNAs. Importantly, in all three cases fewer reads corresponding to newly synthesized RNA are mapped in the 4sU treated sample than in the 4sU naïve sample. In the first case, these reads are missing due to reduced RNA levels in the cells. In the second case, RNA levels are unaltered, but the composition of the sequencing library is changed. In the third case, the representation of genes by sequencing reads is unchanged but the read mapping algorithm could not assign them to their correct genomic loci.

To further investigate these, not mutually exclusive, causes, we generated several nucleotide conversion RNA-seq data sets of 4sU labelled samples: We performed dose escalation experiments by labelling with 0 μM, 100 μM, 200 μM, 400 μM and 800 μM of 4sU for 1h in the two cancer cell lines U2OS and HCT116 as well as in human telomerase reverse transcriptase immortalized primary foreskin fibroblasts (HFF-TerT; Figure 2A). The observed 4sU-induced T-to-C conversions among all mapped reads increased with higher concentrations for all three cell lines (Figure 2B). Interestingly, the maximal value at 800 μM was remarkably similar among the three cell lines (HFF-TerT, 0.80%; U2OS, 0.87%; HCT116, 0.94%), but the temporal kinetics were quite different. To investigate this further, we used GRAND-SLAM (11) to estimate the 4sU incorporation frequency (percentage of T-to-C conversions among labelled RNA only) and the percentage of labelled RNA for all samples. Except for HCT116, the estimated percentage of labelled RNA was largely constant for all concentrations indicating that GRAND-SLAM could reliably deconvolute the observed T-to-C conversions into contributions of the percentage of labelled RNA and different 4sU incorporation frequencies (Figure 2C). Consistent with previous observations (23), incorporation frequencies among the three cell lines differed substantially and, as expected, increased with higher concentrations (Figure 2D). The incorporation frequency only depends on the relative concentrations of activated 4sU and uridine (U) and is therefore expected to be approximately a linear function of the 4sU concentration in the regime well below the U concentration (see Materials and methods). However, this increase saturated for all three cell lines well below 100%, indicating that, with increasing 4sU concentrations, import or activation of 4sU became rate limiting or that 4sU containing reads were underrepresented for the three reasons introduced above.

To investigate the effect of 4sU on short-lived RNAs we performed 4sU dropout analysis by correlating the log2 fold change of each 4sU treated sample to the corresponding 4sU naïve control sample of the same cell line vs the NTR. Interestingly, U2OS, which had the overall lowest incorporation frequencies (Figure 2D), did not show downregulation of short-lived RNA even with 800 μM 4sU (Figure 2E). By contrast, HCT116 and especially HFF-TerT showed this effect at higher concentrations (Figure 2E).

We also sequenced a time course of murine NIH-3T3 fibroblasts labelled using 800 μM 4sU for 0, 15, 30, 60, 90 and 120 min (Figure 3A). Here, as expected, the raw T-to-C conversions as well as the percentage of newly synthesized RNA increased with longer periods of labelling (Figure 3B, C). Interestingly, consistent with the observations made by us and others that activated 4sU accumulates only slowly in cells (12,24), the incorporation frequencies of 4sU also increased from 2% for the 15 min and 30 min timepoints to more than 4% at the 1 and 2 h time point (Figure 3D). 4sU dropout analyses did not reveal any effects of 4sU up to 30 min treatment but showed increasingly stronger downregulation of short-lived RNAs at later time points (Figure 3E).

Figure 3.

Figure 3.

Evaluation of progressive labelling durations on quantification bias. (A) SLAM-seq experiments were conducted with NIH-3T3 cells with labelling for 0, 15, 30, 60, 90 or 120 min with 800 μM of 4sU. (B) Observed T-to-C mismatches over all reads per time point and replicate. (C) Percentages of new RNA per time point and replicate estimated by GRAND-SLAM. (D) Incorporation frequency of 4sU into newly synthesized RNA per time point and replicate estimated by GRAND-SLAM. (E) 4sU dropout plots of n = 9072 genes for all time points of replicate A. The x axis shows the RNA half-life, the y axis the median centered log2 fold change of total RNA expression for the 4sU labelled sample vs. the corresponding 4sU naïve control sample. A local polynomial regression (loess) fit is indicated in red. Spearman's correlation coefficient (ρ) and the associated p value (approximate t test) are given. Only genes with estimated half-lives <15 h are shown to focus on the trend for short-lived RNAs.

In summary, both increasing concentrations of 4sU as well as extended periods of labelling bias the quantification of total RNA due to fewer observed sequencing reads in short lived RNAs.

Introduction of additional mismatches impairs read mappability

We first investigated whether the observed downregulation of short-lived RNA is solely due to diminished mappability of reads with T-to-C conversions. To this end, we considered all mapped reads (76 bp, single-end) from a 4sU naïve control sample of a recent study (4) as a starting point and artificially and randomly introduced T-to-C conversions with varying incorporation frequencies ranging from 0% (equal to the original data) up to a maximum of 25% into the reads. This was done only for a fraction of the reads corresponding to the gene-wise NTR estimated in the original data, thus simulating a realistic nucleotide conversion sequencing experiment with controlled 4sU incorporation. We then used STAR (25) to map these reads back to the reference genome.

First, we investigated how many of the introduced T-to-C conversions were lost due to reduced mappability. As expected, higher incorporation frequencies directly correlated with the amount of lost T-to-C conversions, reaching already 9.8% at 10% incorporation rate and a maximum of 38.7% in the 25% sample (Figure 4A). Interestingly, the number of lost mismatches did not rise linearly with increasing incorporation rates, indicating that reads with multiple mismatches are increasingly difficult to map. To test this, we binned the reads according to their number of introduced T-to-C conversions and analysed the count ratio of mappable reads vs the simulated reads in each bin. This ratio dropped steeply with every additional mismatch, resulting in a loss of 19% of reads with 3, and almost 50% of the reads with 5 T-to-C conversions (Figure 4B). Thus, mappability suffers substantially in presence of multiple 4sU induced nucleotide conversions on the same read. Generally, mappability of reads decreases with increasing T-to-C conversion rates, but to a varying degree in relation to sequencing technique and read lengths (Supplementary Figure S1).

Figure 4.

Figure 4.

In-silico simulation of nucleotide-conversion RNA-seq. (A) Line plots showing the number of simulated, observed and unobserved T-to-C mismatches after simulating T-to-C conversion rates of 0% up to 25% in 4sU naïve reads and subsequent remapping with STAR. (B) Line plots showing the percentage of reads with 0 up to 14 T-to-C mismatches observed after mapping versus the true number of reads created by SLAM-seq simulations with 5% to 25% conversion rates. (C) GRAND-SLAM estimates of incorporation frequencies in newly synthesized RNA after introduction of T-to-C mismatches into read sequences and subsequent mapping (empty dots, with mapping) and into already mapped reads (solid dots, without mapping) for conversion rates of 5% to 25%. (D) 4sU dropout plot of a simulated 25% conversion rate sample versus a 4sU naïve sample. Only genes with estimated half-lives <25 h are shown to focus on the trend for short-lived RNAs. (E) 4sU dropout plot of a simulated 25% conversion rate sample vs. the true read counts from the 0% conversion rate sample. All genes with a log2 fold change <–0.12 are highlighted.

Next, we used GRAND-SLAM to estimate incorporation frequencies in labelled RNA. Interestingly, the incorporation frequencies were underestimated by a fixed factor of approximately 0.83 for all simulated samples (Figure 4C). This was an effect of read mapping, as introducing the mismatches into already mapped reads before running GRAND-SLAM resulted in unbiased estimates (Figure 4C). Underestimation by a fixed factor is not unexpected since GRAND-SLAM utilizes the proportions of read counts with >1 T-to-C conversions for estimation of the incorporation frequency which suffer to the same extent from reduced mappability independent of the true incorporation frequency (Figure 4B). The estimated incorporation frequency is an important parameter for the estimation of gene-wise NTRs. However, gene-wise NTR estimates were not biased due to underestimated incorporation frequencies of up to 15% (Supplementary Figure S2) and were underestimated specifically for high NTR values for incorporation frequencies above 15%. This indicates that the GRAND-SLAM model inherently compensates for biased estimates of incorporation frequencies when T-to-C mismatches are unobserved and only suffers when a substantial fraction of the reads is missing.

Finally, we compared our simulated samples with a 4sU naïve sample from the original data set to mimic 4sU dropout analyses of real data. Interestingly, similar to real data with high 4sU concentrations, short-lived RNAs appeared to be downregulated (Figure 4D). However, this effect was only apparent at simulated incorporation frequencies of >15% and generally less pronounced as in the extreme cases of real data. We also compared the 25% sample with the 0% sample, which reflects the original reads without introduced T-to-C conversions, thereby removing variance between replicates. This revealed the loss of reads of short-lived key transcription factors like MYC and JUN or central signalling molecules like CYR61 (Figure 4E). Gene set enrichment analysis revealed 200 gene ontology terms that consist of short-lived RNAs and therefore appear to be downregulated due to reduced mappability (Supplementary Table S1). We concluded that albeit contributing, reduced mappability cannot explain the drastic loss of reads from short-lived RNAs observed in real samples treated with high concentrations of 4sU but might still result in biased fold changes for highly relevant classes of genes.

grand-Rescue improves mappability of T-to-C conversion reads

Since reduced mappability of reads with T-to-C conversions has significant effects on quantification, we wondered whether read mapping could be improved. A promising approach that has been used in the past for other applications that involve nucleotide conversion such as bisulfite sequencing is to perform read mapping under a three-letter alphabet, e.g. after changing all T to C in both reads and reference (26). In principle, after switching to a three-letter alphabet, any standard read mapping tool can be used. Among the plethora of available read mapping tools, there are large differences in terms of mapping accuracy also without nucleotide conversion (27). We therefore favoured an approach that can use any available tool to do the actual read mapping, instead of adapting an existing tool or developing a new tool. Our method termed grand-Rescue is a two-step algorithm that first tries to map all reads to the reference genome without any modification, and then subsequently tries to map all unmappable reads to a pseudotranscriptome with a reduced alphabet. The final mapping locations are then transferred to the original reference genome (Figure 5A). We use STAR (25) as the internal read mapper, which we found to have superior performance over several other read mapping tools.

Figure 5.

Figure 5.

Evaluation of grand-Rescue and comparison to existing mapping tools (A) grand-Rescue first extracts unmappable reads, converts all T in their sequences to C and maps these reads with a read mapping tool of choice to a three-letter (T converted to C) pseudo-transcriptome. Rescued reads are then transferred to the original genome. (B) Percentage of unmapped reads for HISAT2, HISAT3N, SLAM-DUNK, STAR and STAR + Rescue for different simulated incorporation frequencies. (C) Percentage of uniquely mapped reads in relation to all reads per sample for HISAT2, HISAT3N, SLAM-DUNK, STAR and STAR + Rescue. (D) Correctly mapped reads in relation to all reads for HISAT2, HISAT3N, SLAM-DUNK, STAR and STAR + Rescue. (E) GRAND-SLAM estimates of incorporation frequencies before (empty dots) and after (solid dots) rescue. (F) 4sU dropout plots of a simulated 25% conversion rate sample vs. the true read counts from the 0% conversion rate sample before (top) and after (bottom) rescue.

We used simulated nucleotide conversion sequencing data to evaluate and compare the performance of grand-Rescue with STAR (25) and HISAT2 (28), two standard read mappers, as well as SLAM-DUNK (29) and HISAT3N (30), tools that have been developed specifically for nucleotide conversion RNA-seq. Instead of using STAR mapped reads as starting point as above, which would favour STAR based read mapping, we randomly redistributed reads across their mRNA (see Methods). As expected, the percentage of unmappable reads for STAR and especially HISAT2 increased drastically with the conversion rate (Figure 5B). Surprisingly, this was also the case for the T-to-C conversion aware read mapper SLAM-DUNK, while HISAT3N and grand-Rescue remained unaffected by increasing T-to-C conversions. Among the reads mappable by each individual tool, the percentage of uniquely mappable reads with increasing T-to-C conversions stayed constant for HISAT3N and only dropped slightly for both grand-Rescue and SLAM-DUNK (Figure 5C). Importantly, however, HISAT3N and SLAM-DUNK only mapped 96.6% of the reads uniquely, while 97.8% where uniquely mapped by grand-Rescue even for the 25% sample. We observed a similar picture when analyzing the percentage of correctly mapped reads among unique reads, i.e. the mapping accuracy: For both, grand-Rescue and HISAT3N the accuracy did not drop with increasing T-to-C conversions, but HISAT3N had overall lower performance than grand-Rescue across all samples (Figure 5D). Using STAR and adapting its mapping parameters only minimally improves its performance but is still suboptimal compared to grand-Rescue (Supplementary Figure S3).

We concluded that all three T-to-C conversion aware read mappers, which follow different strategies, can indeed improve read mapping and that grand-Rescue performs favourably. Importantly, however, the internally used read mapping algorithm, which can be changed for grand-Rescue, also has a great effect on read mappability.

grand-Rescue mitigates effects of reduced mappability

Rescuing previously unmappable T-to-C conversion reads using grand-Rescue substantially improved the estimates of the T-to-C conversion frequency which were now only slightly underestimated by a factor of roughly 0.97 instead of 0.83 before rescue (Figure 5E). More importantly, after rescue, the apparent downregulation of short-lived RNA due to read mappability was not observed anymore (Figure 5F). To account for the impact of different library preparation protocols and read lengths on the estimation of 4sU incorporation rates by GRAND-SLAM, we used different starting points for our simulation, including data generated using a 3′ end sequencing protocol (QuantSeq) as well as paired-end and single-end data sets based on random priming (TruSeq), and simulated 4sU incorporation rates from 0% to 10%. The QuantSeq data consisted of 75bp single end reads, whereas the TruSeq data were sequenced with 2 × 125 bp paired end reads. To mimic other sequencing modes, we in-silico trimmed the TruSeq reads to 100 or 75 bp reads and also discarded the second reads, and thus analyzed overall six settings based on TruSeq data, each with different incorporation rates (Supplementary Figure S4). The percentage of rescued reads was highest in QuantSeq, especially in the sample with a 10% conversion rate with 0.58% of all reads being rescued whereas in the single end TruSeq samples, less reads were rescued (Supplementary Figure S4A). These findings are also reflected in the incorporation estimation, which was underestimated most in 75 bp in QuantSeq and could be rescued (Supplementary Figure S4B) and to a minor extent in the TruSeq data sets, whereas longer reads showed less to no signs of underestimation (Supplementary Figure S4B).

Both the dose escalation as well as progressive time course data sets were generated using the TruSeq protocol and sequenced with 2 × 76 bp paired end reads. Indeed, in accordance with the simulated data, the estimated 4sU incorporation frequency did not change significantly for these data sets (Supplementary Figures S5 and S6), and the effect on short-lived RNA was still clearly visible after rescue (Supplementary Figures S7 and S8).

We concluded that even though improved read mapping can fully mitigate the effect of reduced mappability of T-to-C conversion reads, short-lived RNA still appears to be downregulated with high 4sU concentrations at long labelling times.

Labelled RNA is underrepresented by a constant factor

We hypothesized that a global and unspecific underrepresentation of labelled RNA in the sequencing libraries is responsible for the observed downregulation of short-lived RNAs and that gene-specific differences in RNA half-lives can explain gene-specific differences in downregulation. In this case, in each sample the same fraction of labelled RNA is missing for each gene. To test this hypothesis, we devised an algorithm to estimate this percentage of 4sU dropout and used this parameter to scale up the estimated newly synthesized RNA per gene (Figure 6A).

Figure 6.

Figure 6.

Correction of 4sU dropout by scaling labelled RNA. (A) Correction of three example genes affected by 4sU dropout with short (a), medium (b) and long RNA half-life (c). A dropout factor ‘d’ is calculated and subsequently expression of labelled RNA is multiplied by 1/1-d. (BC) 4sU dropout rank plots of the 800 μM HFF-TerT sample before and after correction. The x axis here is the rank of the new-to-total RNA ratio (largest NTR left). Boxplots showing the log2 fold changes of the 4sU labelled sample vs unlabelled control are overlayed for 10 equisized bins along the x axis. Spearman's correlation coefficient (ρ) and the associated P value (approximate t test) are given. (D) Comparison of 4sU dropout percentage before and after correction in all three cell lines and for all 4sU concentrations.

We estimated 4sU dropout to minimize the absolute correlation of the log2 fold change of the 4sU treated sample vs the corresponding 4sU naïve sample (4sU versus no4sU) against the new-to-total RNA ratio per gene. Before this correction, this correlation was strong and highly significant for the 800 μM HFF-TerT sample (Figure 6B, Spearman's ρ = 0.29, P< 2.2 × 10−16, asymptotic t test). After correction, the correlation vanished (Figure 6C, Spearman's ρ = 0, P= 0.91, asymptotic t test). Importantly, we did not observe any signs of a non-monotonic correlation after correction: The distributions of the 4sU versus no4sU log2 fold change for 10 equisized bins along the NTRs were indistinguishable (P= 0.11, Kruskall–Wallis-test). This result suggests that scaling by the percentage of transcriptional loss completely removed the observed effect of preferential downregulation of short-lived RNAs.

The 4sU dropout percentage cannot only be used to correct for this effect, but also is a convenient way to quantify the extent of this effect per sample as an alternative to visually inspecting the corresponding 4sU dropout plots. Indeed, the dropout values for the 4sU dose escalation experiments mirrored our visual impression (Figure 6D): For HFF-TerTs, the 4sU dropout rose to >40% at 800 μM and was lower for all other samples. For the two cancer cell lines, only the 800 μM sample of HCT116 was above 30%. In summary, the 4sU dropout percentage can be used as a statistic to quantify preferential downregulation of short-lived RNA and to correct for it.

4sU dropout scaling mitigates biased expression estimates

To further investigate whether scaling using the 4sU dropout percentage can mitigate the effect of downregulation of short-lived RNAs, we analysed our progressive labelling time course data set. The transcriptional loss was remarkably consistent among replicates and increased steadily and almost linearly with longer labelling time up to a value of 31.6% and 33.5% for the two replicates with 2 h labelling (Figure 7A). Again, scaling labelled RNA based on the 4sU dropout percentage corrected for the downregulation effect without clear signs of non-monotonic correlations (Figure 7B, C).

4sU dropout does not only bias estimates of the NTRs, but also has profound effects for normalization across samples with distinct labelling times, e.g. for progressive labelling time courses. This is because under 4sU dropout, the fundamental assumption of no global changes in gene expression that most normalization methods make is violated. Indeed, normalization resulted in downward trends along the labelling time for short-lived RNAs, and in upwards trends for long-lived RNAs (Figure 7D, E). For example, the levels of the short-lived RNA of Dusp4 declined at a rate of 12% per hour after size factor normalization (22) whereas the levels of the long-lived RNA of Eif3l increased at a rate of 5% per hour (Figure 7D). Both trends were fully corrected by 4sU dropout scaling (Figure 7D). Of note, these downwards and upwards trends due to normalization also bias half-life estimates: Upon correction, the estimated half-lives changed from 28.7 min (0.95% CI: 25.5–32.8 min) to 23.0 min (0.95% CI: 20.6–26.0 min) for Dusp4 and from 9:39 h (0.95% CI: 7:56–12:19 h) to 5:51 h (0.95% CI: 5:22–6:25 h) for Eif3l. Globally, n = 162 genes showed a strong and significant upwards or downwards trend without correction (absolute log2 fold change per hour >0.25, P value <5%, likelihood ratio test, Benjamini–Hochberg adjusted for multiple testing, Figure 7E), and only n = 23 remained after 4sU dropout factor scaling (Figure 7E).

To test our correction approach further, we considered SLAM-seq data previously published by us (4). This data set is a progressive labelling time course with 0, 1, 2 or 4 h of labelling and is part of a larger study on the molecular mechanisms of the transcription elongation factor SPT6. Since we had noticed 4sU dropout for the 2 and 4 h time points, we had to exclude labelled RNA from these for estimating kinetic parameters in the previous publication and focused on synthesis rates only (4). Indeed, half-lives estimated without correction using all time points deviated drastically from the ones estimated with correction (Supplementary Figure S9A). Interestingly, bias was strongest for long half-lives, where the estimator is more dependent on the later time points. However, bias was substantially reduced when labelled RNA from the 2 and 4 h time points were excluded as done in the publication (4) (Supplementary Figure S9B). Since a lot of useful information was excluded, it was not unexpected that there was strong variance especially for half-lives longer than 2 h (Supplementary Figure S9B) and that the estimated confidence intervals where much larger when only using the 1 h time point than when using all time points with correction (Supplementary Figure S9C). Thus, with the correction approach proposed here, also the 2 and 4 h time points can now be used e.g. to assess the impact of SPT6 depletion on RNA half-lives.

In summary, scaling by the 4sU dropout percentage removed global 4sU induced effects on expression estimates that occur with high 4sU concentrations and long periods of labelling.

Multiple sample handling steps result in 4sU dropout

In principle, extensive 4sU dropout observed in our progressive labelling data set could be due to a direct or indirect effect of 4sU on RNA metabolism in the living cells, or because labelled RNA is underrepresented in the sequencing library. This underrepresentation could be due to diminished reverse transcription efficiency of 4sU containing RNA (14) or due to other properties of 4sU that interfere with sample handling and library preparation (15). To test this hypothesis, we determined the RNA fragments that were reverse transcribed from the paired-end sequencing data for all samples for n = 105 genes that had an RNA half-life of < 30 min, were strongly expressed (>10 TPM) and had an estimated major isoform percentage of >90%. Interestingly, RNA fragments across all 105 genes that were sequenced from cells that were treated with 4sU for 2 h had significantly lower U or 4sU content than RNA fragments sequenced from 4sU naïve cells (P< 2.2 × 10−16, Wilcoxon test, Figure 8A, B). Lower U or 4sU content was consistent for both replicates and gradually decreased with the labelling time (Figure 8C). This was not due to issues with read mappability since reads were mapped using grand-Rescue, and we also observed the same differences in nucleotide content for the not sequenced parts of the RNA fragments in between the read pair (Supplementary Figure S10). Notably, by counting di- and trinucleotides, we found that underrepresentation of U or 4sU in the RNA fragments in 4sU labelled samples depended on the sequence context with neighbouring U (or 4sU), A or G nucleotides resulting in stronger underrepresentation (Figures 8D and S10). Taken together, this suggests that RNA fragments that contain 4sU residues have a lower probability to be reverse transcribed into cDNA than RNA fragments without 4sU. This can be a direct effect of 4sU on reverse transcription efficiency of iodoacetamide-converted 4sU nucleotides, as described (14) or because 4sU containing RNA drops out earlier during sample preparation, e.g. because it interacts with the material of the tubes, as also described (15).

To further test these possibilities, we performed nucleotide conversion RNA-seq in NIH-3T3 cells, where we either did not label cells at all (control), labelled cells with 800μM uridine (U) or 800μM 4sU for 2 h. The latter sample was split and either not converted at all, converted with the SLAM-seq approach (1) (which converts 4sU into a cytosine analogue) or the TUC-seq approach (3) (which converts 4sU into an actual cytosine) (Figure 8E). While the U labelled samples showed no signs of dropout, 4sU labelling alone without conversion step resulted in intermediate levels of dropout (10–20%) but high levels of dropout (25–40%) for both SLAM and TUC conversion. This indicates that inefficient reverse transcription does not play a major role in the dropout of 4sU and that at least half of the RNA dropped out during SLAM or TUC conversion.

To further investigate this, we performed another experiment, where we labelled cells with either 0 μM (control), 200 μM or 800 μM 4sU, and then performed SLAM conversion either in intact, methanol fixed cells, or in tubes as in all other experiments (Figure 8G). Surprisingly, in these experiments, in the tubes both 4sU concentrations resulted in intermediate dropout levels (∼20%). Here, SLAM conversion was done at 37°C in contrast to the 50°C used in the other experiments, indicating that conversion done at lower temperatures mitigated 4sU dropout. In fixed cells, dropout increased from ∼20% with 200 μM to ∼30% with 800 μM 4sU (Figure 8H). Thus, also when 4sU containing RNA is not in contact with the potentially sticky plastic of the tubes, substantial 4sU dropout occurred in a concentration dependent manner. Taken together, these data show that there are multiple steps during sample handling that can result in 4sU containing RNA to drop out.

If 4sU containing RNA drops out during library preparation, it is a global and random effect that affects all fragments from labelled RNAs to roughly the same extent. Any other covariate that correlates with the 4sU versus no4sU log2 fold change would also result in a correlation of the 4sU versus no4sU log2 fold change among replicates. Indeed, these fold changes of the uncorrected 2h replicates were strongly correlated (Spearman's ρ = 0.42, P< 2.2 × 10−16, asymptotic t test, Figure 8I). However, after scaling this correlation disappeared completely (Spearman's ρ = 0.01, P= 0.47, asymptotic t test, Figure 8J). Thus, any other additional factor resulting in differences between 4sU treated and 4sU naïve samples was minor in comparison to biological variability among samples. In summary, these findings indicate that scaling by the 4sU dropout percentage could fully correct for bias in expression estimates due to excessive 4sU labelling in this experiment.

Discussion

Nucleotide conversion RNA-seq requires high 4sU concentrations: The 4sU concentration correlates with the 4sU incorporation frequency in labelled RNA, which in turn determines how many reads originating from labelled RNA carry a T-to-C mismatch. For the data from ref. (1), we estimated an incorporation frequency in labelled RNA of 2% (11). Based on binomial statistics, with the 50bp single end reads that were used in this study, 75% of all reads originating from labelled RNA are expected to carry no T-to-C mismatch (10). With higher concentrations resulting in an incorporation frequency of 10%, about 80% would carry at least one T-to-C mismatch. Statistical approaches such as GRAND-SLAM can deal with such missing observations, but also benefit substantially from higher incorporation frequencies (11). In addition, the accuracy of half-life estimates drop severely when the labelling time is much shorter than the RNA half-life (12). Thus, in addition to high concentrations, long periods of labelling are required for accurately estimating the whole spectrum of RNA half-lives for mammalian genes. However, excessive labelling with 4sU reduces cell viability (1) and has been shown to affect rRNA processing (13). In addition to these biological effects, we show here that excessive labelling also affects sequencing data due to dropout of 4sU containing RNA molecules during library preparation and reduced mappability of reads with many mismatches.

String matching allowing for mismatches is computationally a much harder problem than exact string matching (31). Therefore, all available read mapping tools use a two-step approach to quickly map reads: First, using a data structure for exact string matching and some heuristics to allow for mismatches, candidate mapping positions are identified. Second, the candidate positions are then filtered according to user-defined criteria such as the number of maximal mismatches. To improve read mapping for T-to-C mismatches, two different strategies have been proposed: HISAT-3N (30) operates on a genome with reduced, three-letter alphabet, and SLAM-dunk (29) uses adapted criteria that do not penalize T-to-C mismatches in the filtering step. Thus, HISAT-3N is aware of 4sU induced conversions for both candidate generation and filtering, while SLAM-dunk considers conversions only for filtering. Both strategies have disadvantages that could be observed for our simulated reads: Mapping with reduced alphabets generates more multi-mappers, whereas conversion aware filtering misses true mapping locations with increasing numbers of mismatches. Our grand-Rescue approach is also based on a reduced alphabet, but we mitigate the effect of multi-mappers by only mapping previously unmappable reads against a three-letter pseudo-transcriptome. To additionally reduce the search space while still conserving the original genome organisation into exonic and intronic regions, we replaced all intronic regions with 100 N spacers. This allows grand-Rescue to be agnostic of how the increased number of multimapping reads in long intronic sequences with a 3-letter alphabet are handled by the underlying read mapping tool, which also has major impact on the overall performance.

While we have shown that read mappability, especially for short read lengths, has an effect, it could not explain the extent of 4sU dropout observed in our experiments. Previous reports have shown that converted 4sU residues can act as a roadblock for the reverse transcriptase (14), and that suboptimal sample handling and 4sU binding to sample tubes results in dropout (15). Here, by comparing SLAM conversion and TUC conversion chemistries, we have shown that blocking reverse transcription is not a major cause of the dropout observed in our SLAM-seq experiments, since TUC-seq converts 4sU into an actual cytosine, and both chemistries resulted in highly similar levels of dropout. This experiment also included samples that were labelled with 4sU but not subjected to any conversion step. The intermediate dropout levels of these samples provide clear evidence that loss of 4sU occurred before conversion and during conversion. Furthermore, performing the conversion in intact, methanol fixed cells also showed substantial 4sU dropout. Importantly, this does not rule out 4sU containing RNA sticking to plastic as a major cause of dropout, which has been shown convincingly recently (15). Rather, this means that at high 4sU concentrations, there are multiple factors that result in an underrepresentation of 4sU containing RNA in sequencing libraries.

The central question regarding 4sU dropout is whether it is a global and unspecific phenomenon. If dropout is mainly caused by steps during library preparation, or due to an unspecific effect on transcription or degradation that applies to all genes, the effects of 4sU dropout on quantitative expression estimates can be corrected for by the approach proposed here. By contrast, if high 4sU concentrations are not unspecific effects and, e.g. activate cellular stress response pathways, the biology of the cells is changed by labelling, which can impact on the quantitative parameters to be measured using nucleotide conversion RNA-seq. In this case, our correction approach cannot remove the effects of high 4sU concentrations from data. Importantly, as shown here, specific effects that are not corrected for can be revealed by comparing the 4sU versus no4sU log fold changes among replicates. Thus, we recommend to always perform this comparison after using our correction approach.

In principle, the uridine content of mRNAs could be used as a covariate when correcting 4sU dropout. For several reasons, we here resided with a more parsimonious model: First, it is impossible to evaluate whether including uridine content would provide improved quantifications. Second, we expect the influence of uridine content to even out when RNA fragments over full length mRNAs are sequenced. Third, we also observed dependence on surrounding nucleotides, which would suggest even more complex models. Finally, our data indicate that also with a simpler model, the influence of RT efficiency could be removed from the data.

To achieve optimal accuracy in quantifying labelled RNA using nucleotide conversion RNA-seq experiments, we suggest the following guidelines: First, due to the cell type and labelling time specificity of 4sU uptake and incorporation, for a new experimental system a small-scale sequencing experiment should be performed first to test several 4sU concentrations in duplicates with shallow sequencing. This experiment can then be used to assess (i) incorporation frequency (at least 4%, better 8%, see Supplementary Note S1), (ii) 4sU dropout and (iii) whether the effects of 4sU dropout can be removed using our correction approach (a vignette on this is provided as part of our grandR package). Second, if only insufficient incorporation frequencies can be achieved without causing 4sU dropout, the conversion temperature or other steps of the experimental protocol should be optimized to minimize loss of 4sU (15). Third, at least 2 × 75 bp read lengths should be used to facilitate accurate read mapping. Fourth, especially for shorter reads, either HISAT3N or our read mapping rescue tool in conjunction with STAR should be used to improve read mappability. Fifth, for all experiments, a no4sU control should be included and 4sU dropout should be assessed. If 4sU dropout is identified, it should be removed from data using our correction approach, and the efficacy of this correction should be assessed by comparing replicates.

Implementations of the two methods introduced here are available as part of our GRAND-SLAM/grandR pipeline. grand-Rescue is a stand-alone program that is integrated as an additional step into the pipeline. Its only input is a single bam file containing mapped and unmapped reads, and it generates a new bam file containing the rescued read mappings in addition to previously mapped reads. Thus, it can be integrated into any existing pipeline as an additional step. Computation of the 4sU dropout percentage and the scaling approach to correct for dropout are implemented as functions in our grandR package (12), and we provide a vignette to showcase the usage of these functions. In principle, computing the 4sU dropout percentage of a sample that has been labelled with 4sU requires an otherwise biologically equivalent control sample without 4sU labelling as reference, or a reference sample without a global change in RNA synthesis or stability.

Here, we report that high concentrations of 4sU or prolonged labelling resulted in an apparent downregulation of short-lived RNAs, which can have profound impact on results when staying unnoticed. We therefore advocate that checking for this effect is a mandatory part of quality control for nucleotide conversion RNA-seq. If such quantification bias is observed, it is important to investigate its causes. Technical issues such as reduced read mappability for labelled RNA or various steps during library preparation can result in 4sU dropout und therefore in apparent downregulation of short-lived RNA. It is not unlikely, that such technical issues might be the only cause for 4sU dropout in this experiment, since after correction by our scaling approach, no quantification bias was observed anymore, and the correlation of log2 fold changes between replicates disappeared completely. However, downregulation of short-lived RNA can also be a sign of 4sU affecting the living cells biologically. If such an effect of 4sU on RNA metabolism cannot be excluded, all obtained results can be misleading and must be interpreted withcare.

Supplementary Material

gkae120_Supplemental_Files

Acknowledgements

We thank Christophe Toussaint and Emmanuel Saliba (HIRI Würzburg) for support with library preparation and sequencing and Ronald Micura (University of Innsbruck) for support with TUC-seq.

Author contributions: Kevin Berg: Formal Analysis, Methodology, Software, Visualization, Writing-original draft, Writing-review & editing. Manivel Lodha: Investigation. Isabel Delazer: Investigation. Karolina Bartosik: Investigation. Yiliam Cruz Garcia: Investigation. Thomas Hennig: Supervision. Lars Dölken: Supervision. Elmar Wolf: Supervision. Alexandra Lusser: Supervision. Bhupesh K Prusty: Investigation, Supervision. Florian Erhard: Conceptualization, Formal Analysis, Funding acquisition, Methodology, Software, Supervision, Visualization, Writing-original draft, Writing-review & editing

Contributor Information

Kevin Berg, Chair of Computational Immunology, University of Regensburg, Regensburg, Germany; Institut für Virologie und Immunbiologie, Julius-Maximilians-Universität Würzburg, Würzburg, Germany.

Manivel Lodha, Institut für Virologie und Immunbiologie, Julius-Maximilians-Universität Würzburg, Würzburg, Germany.

Isabel Delazer, Medical University of Innsbruck, Biocenter, Institute of Molecular Biology, Innsbruck, Austria.

Karolina Bartosik, Institute of Organic Chemistry, Center for Molecular Biosciences Innsbruck, University of Innsbruck, Innsbruck, Austria.

Yilliam Cruz Garcia, Cancer Systems Biology Group, Theodor Boveri Institute, University of Würzburg, Würzburg, Germany; Institute of Biochemistry, University of Kiel, Kiel, Germany.

Thomas Hennig, Institut für Virologie und Immunbiologie, Julius-Maximilians-Universität Würzburg, Würzburg, Germany.

Elmar Wolf, Cancer Systems Biology Group, Theodor Boveri Institute, University of Würzburg, Würzburg, Germany; Institute of Biochemistry, University of Kiel, Kiel, Germany.

Lars Dölken, Institut für Virologie und Immunbiologie, Julius-Maximilians-Universität Würzburg, Würzburg, Germany.

Alexandra Lusser, Medical University of Innsbruck, Biocenter, Institute of Molecular Biology, Innsbruck, Austria.

Bhupesh K Prusty, Institut für Virologie und Immunbiologie, Julius-Maximilians-Universität Würzburg, Würzburg, Germany.

Florian Erhard, Chair of Computational Immunology, University of Regensburg, Regensburg, Germany; Institut für Virologie und Immunbiologie, Julius-Maximilians-Universität Würzburg, Würzburg, Germany.

Data availability

Raw data generated here have been deposited at GEO under accession numbers GSE229504 (Increasing concentrations), and GSE229506 (Progressive labelling), GSE253169 (SLAM- & TUC-seq), and GSE253370 (Fixed cell SLAM-seq).

All processed data (GRAND-SLAM outputs) are available on zenodo:

https://doi.org/10.5281/zenodo.7805929 (progressive labelling)

https://doi.org/10.5281/zenodo.7753460 (Increasing concentrations)

https://doi.org/10.5281/zenodo.7760483 (QuantSeq (19)),

https://doi.org/10.5281/zenodo.7760437 (Illumina TruSeq (18))

https://doi.org/10.5281/zenodo.8225022 (Reanalysis Spt6(4))

https://doi.org/10.5281/zenodo.10470570 (SLAM- & TUC-seq, Fixed cell SLAM-seq)

R notebooks and data files for generating all figures are available on zenodo (https://doi.org/10.5281/zenodo.10478770).

The release version, source code and documentation of grandRescue can be found on github (https://github.com/erhard-lab/grandRescue). All other methods including 4sU dropout plots and the dropout scaling method are available as part of our grandR package (https://cran.r-project.org/package=grandR).

Supplementary data

Supplementary Data are available at NAR Online.

Funding

F.E. received funding by the Bavarian State Ministry of Science and Arts (Bavarian Research Network FOR-COVID); Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) [ER 927/2-1] and framework of the Research Unit FOR5200 DEEP-DV [443644894] project ER 927/4-1, and by project ID 492620490 – SFB1583/Z2; A.L. received funding from the Austrian Science Foundation [FWF F8009-B]. Funding for open access charge: DFG grant, OA subsidy of the University of Regensburg.

Conflict of interest statement. None declared.

References

  • 1. Herzog  V.A., Reichholf  B., Neumann  T., Rescheneder  P., Bhat  P., Burkard  T.R., Wlotzka  W., Haeseler  A.v., Zuber  J., Ameres  S.L  Thiol-linked alkylation of RNA to assess expression dynamics. Nat. Methods. 2017; 14:1198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Schofield  J.A., Duffy  E.E., Kiefer  L., Sullivan  M.C., Simon  M.D.  TimeLapse-seq: adding a temporal dimension to RNA sequencing through nucleoside recoding. Nat. Methods. 2018; 15:221–225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Riml  C., Amort  T., Rieder  D., Gasser  C., Lusser  A., Micura  R.  Osmium-mediated transformation of 4-thiouridine to cytidine as key to study RNA dynamics by sequencing. Angew. Chem. Int. Ed. Engl.  2017; 56:13479–13483. [DOI] [PubMed] [Google Scholar]
  • 4. Narain  A., Bhandare  P., Adhikari  B., Backes  S., Eilers  M., Dölken  L., Schlosser  A., Erhard  F., Baluapuri  A., Wolf  E.  Targeted protein degradation reveals a direct role of SPT6 in RNAPII elongation and termination. Mol. Cell. 2021; 81:3110–3127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Finkel  Y., Gluck  A., Nachshon  A., Winkler  R., Fisher  T., Rozman  B., Mizrahi  O., Lubelsky  Y., Zuckerman  B., Slobodin  B.  et al.  SARS-CoV-2 uses a multipronged strategy to impede host protein synthesis. Nature. 2021; 594:240–245. [DOI] [PubMed] [Google Scholar]
  • 6. Erhard  F., Baptista  M.A.P., Krammer  T., Hennig  T., Lange  M., Arampatzi  P., Jürges  C.S., Theis  F.J., Saliba  A.-E., Dölken  L.  scSLAM-seq reveals core features of transcription dynamics in single cells. Nature. 2019; 571:419–423. [DOI] [PubMed] [Google Scholar]
  • 7. Cao  J., Zhou  W., Steemers  F., Trapnell  C., Shendure  J.  Sci-fate characterizes the dynamics of gene expression in single cells. Nat. Biotechnol.  2020; 38:980–988. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Schott  J., Reitter  S., Lindner  D., Grosser  J., Bruer  M., Shenoy  A., Geiger  T., Mathes  A., Dobreva  G., Stoecklin  G.  Nascent Ribo-Seq measures ribosomal loading time and reveals kinetic impact on ribosome density. Nat. Methods. 2021; 18:1068–1074. [DOI] [PubMed] [Google Scholar]
  • 9. Jürges  C.S., Lodha  M., Le-Trilling  V.T.K., Bhandare  P., Wolf  E., Zimmermann  A., Trilling  M., Prusty  B., Dölken  L., Erhard  F.  Multi-omics reveals principles of gene regulation and pervasive non-productive transcription in the human cytomegalovirus genome. 2022; bioRxiv doi:07 January 2022, preprint: not peer reviewed 10.1101/2022.01.07.472583. [DOI]
  • 10. Erhard  F., Saliba  A.-E., Lusser  A., Toussaint  C., Hennig  T., Prusty  B.K., Kirschenbaum  D., Abadie  K., Miska  E.A., Friedel  C.C.  et al.  Time-resolved single-cell RNA-seq using metabolic RNA labelling. Nat. Rev. Methods Primers. 2022; 2:77. [Google Scholar]
  • 11. Jürges  C., Dölken  L., Erhard  F.  Dissecting newly transcribed and old RNA using GRAND-SLAM. Bioinformatics. 2018; 34:i218–i226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Rummel  T., Sakellaridi  L., Erhard  F.  grandR: a comprehensive package for nucleotide conversion sequencing data analysis. Nat. Commun.  2022; 14:3559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Burger  K., Mühl  B., Kellner  M., Rohrmoser  M., Gruber-Eber  A., Windhager  L., Friedel  C.C., Dölken  L., Eick  D.  4-thiouridine inhibits rRNA synthesis and causes a nucleolar stress response. RNA Biol. 2013; 10:1623–1630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Watson  M.J., Park  Y., Thoreen  C.C.  Roadblock-qPCR: a simple and inexpensive strategy for targeted measurements of mRNA stability. RNA. 2021; 27:335–342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Zimmer  J.T., Vock  I.W., Schofield  J.A., Kiefer  L., Moon  M.H., Simon  M.D.  Improving the study of RNA dynamics through advances in RNA-seq with metabolic labeling and nucleotide-recoding chemistry. 2023; bioRxiv doi:24 May 2023, preprint: not peer reviewed 10.1101/2023.05.24.542133. [DOI]
  • 16. Whisnant  A.W., Jürges  C.S., Hennig  T., Wyler  E., Prusty  B., Rutkowski  A.J., L’hernault  A., Djakovic  L., Göbel  M., Döring  K.  et al.  Integrative functional genomics decodes herpes simplex virus 1. Nat. Commun.  2020; 11:2038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Lusser  A., Gasser  C., Trixl  L., Piatti  P., Delazer  I., Rieder  D., Bashin  J., Riml  C., Amort  T., Micura  R.. LaCava  J., Vaňáčová  Š.  Thiouridine-to-cytidine conversion sequencing (TUC-Seq) to measure mRNA transcription and degradation rates. The Eukaryotic RNA Exosome: Methods and Protocols. 2020; NY: Methods in Molecular Biology. Springer; 191–211. [DOI] [PubMed] [Google Scholar]
  • 18. Sarantopoulou  D., Tang  S.Y., Ricciotti  E., Lahens  N.F., Lekkas  D., Schug  J., Guo  X.S., Paschos  G.K., FitzGerald  G.A., Pack  A.I.  et al.  Comparative evaluation of RNA-seq library preparation methods for strand-specificity and low input. Sci. Rep.  2019; 9:13477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Lee  J.W., Stone  M.L., Porrett  P.M., Thomas  S.K., Komar  C.A., Li  J.H., Delman  D., Graham  K., Gladney  W.L., Hua  X.  et al.  Hepatocytes direct the formation of a pro-metastatic niche in the liver. Nature. 2019; 567:249–252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Erhard  F.  Estimating pseudocounts and fold changes for digital expression measurements. Bioinformatics. 2018; 34:4054–4063. [DOI] [PubMed] [Google Scholar]
  • 21. Eden  E., Navon  R., Steinfeld  I., Lipson  D., Yakhini  Z.  GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinf.  2009; 10:48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Love  M.I., Huber  W., Anders  S.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol.  2014; 15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Dölken  L., Ruzsics  Z., Rädle  B., Friedel  C.C., Zimmer  R., Mages  J., Hoffmann  R., Dickinson  P., Forster  T., Ghazal  P.  et al.  High-resolution gene expression profiling for simultaneous kinetic parameter analysis of RNA synthesis and decay. RNA. 2008; 14:1959–1972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Smalec  B.M., Ietswaart  R., Choquet  K., McShane  E., West  E.R., Churchman  L.S.  Genome-wide quantification of RNA flow across subcellular compartments reveals determinants of the mammalian transcript life cycle. 2022; bioRxiv doi:21 August 2022, preprint: not peer reviewed 10.1101/2022.08.21.504696. [DOI] [PMC free article] [PubMed]
  • 25. Dobin  A., Davis  C.A., Schlesinger  F., Drenkow  J., Zaleski  C., Jha  S., Batut  P., Chaisson  M., Gingeras  T.R.  STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013; 29:15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Krueger  F., Andrews  S.R.  Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011; 27:1571–1572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Baruzzo  G., Hayer  K.E., Kim  E.J., Di Camillo  B., FitzGerald  G.A., Grant  G.R.  Simulation-based comprehensive benchmarking of RNA-seq aligners. Nat. Methods. 2017; 14:135–139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Kim  D., Paggi  J.M., Park  C., Bennett  C., Salzberg  S.L.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol.  2019; 37:907–915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Neumann  T., Herzog  V.A., Muhar  M., von Haeseler  A., Zuber  J., Ameres  S.L., Rescheneder  P.  Quantification of experimentally induced nucleotide conversions in high-throughput sequencing datasets. BMC Bioinf.  2019; 20:258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Zhang  Y., Park  C., Bennett  C., Thornton  M., Kim  D.  Rapid and accurate alignment of nucleotide conversion sequencing reads with HISAT-3N. Genome Res.  2021; 31:1290–1295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Cohen-Addad  V., Feuilloley  L., Starikovskaya  T.  Lower bounds for text indexing with mismatches and differences. 2019; arXiv doi:21 December 2018, preprint: not peer reviewedhttps://arxiv.org/abs/1812.09120.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkae120_Supplemental_Files

Data Availability Statement

Raw data generated here have been deposited at GEO under accession numbers GSE229504 (Increasing concentrations), and GSE229506 (Progressive labelling), GSE253169 (SLAM- & TUC-seq), and GSE253370 (Fixed cell SLAM-seq).

All processed data (GRAND-SLAM outputs) are available on zenodo:

https://doi.org/10.5281/zenodo.7805929 (progressive labelling)

https://doi.org/10.5281/zenodo.7753460 (Increasing concentrations)

https://doi.org/10.5281/zenodo.7760483 (QuantSeq (19)),

https://doi.org/10.5281/zenodo.7760437 (Illumina TruSeq (18))

https://doi.org/10.5281/zenodo.8225022 (Reanalysis Spt6(4))

https://doi.org/10.5281/zenodo.10470570 (SLAM- & TUC-seq, Fixed cell SLAM-seq)

R notebooks and data files for generating all figures are available on zenodo (https://doi.org/10.5281/zenodo.10478770).

The release version, source code and documentation of grandRescue can be found on github (https://github.com/erhard-lab/grandRescue). All other methods including 4sU dropout plots and the dropout scaling method are available as part of our grandR package (https://cran.r-project.org/package=grandR).


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES