Whole blood RNA extraction efficiency contributes to variability in RNA sequencing data sets

William W Wilfinger; Hamid R Eghbalnia; Karol Mackey; Robert Miller; Piotr Chomczynski

doi:10.1371/journal.pone.0291209

. 2023 Nov 16;18(11):e0291209. doi: 10.1371/journal.pone.0291209

Whole blood RNA extraction efficiency contributes to variability in RNA sequencing data sets

William W Wilfinger ^1,^*, Hamid R Eghbalnia ², Karol Mackey ¹, Robert Miller ³, Piotr Chomczynski ¹

Editor: Surya Saha⁴

PMCID: PMC10653446 PMID: 37972054

Abstract

Numerous methodologies are used for blood RNA extraction, and large quantitative differences in recovered RNA content are reported. We evaluated three archived data sets to determine how extraction methodologies might influence mRNA and lncRNA sequencing results. The total quantity of RNA recovered /ml of blood affects RNA sequencing by impacting the recovery of weakly expressed mRNA, and lncRNA transcripts. Transcript expression (TPM counts) plotted in relation to transcript size (base pairs, bp) revealed a 30% loss of short to midsized transcripts in some data sets. Quantitative recovery of RNA is of considerable importance, and it should be viewed more judiciously. Transcripts common to the three data sets were subsequently normalized and transcript mean TPM counts and TPM count coefficient of variation (CV) were plotted in relation to increasing transcript size. Regression analysis of mean TPM counts versus transcript size revealed negative slopes in two of the three data sets suggesting a reduction of TPM transcript counts with increasing transcript size. In the third data set, the regression slope line of mRNA transcript TPM counts approximates zero and TPM counts increased in proportion to transcript size over a range of 200 to 30,000 bp. Similarly, transcript TPM count CV values also were uniformly distributed over the range of transcript sizes. In the other data sets, the regression CV slopes increased in relation to transcript size. The recovery of weakly expressed and /or short to midsized mRNA and lncRNA transcripts varies with different RNA extraction methodologies thereby altering the fundamental sequencing relationship between transcript size and TPM counts. Our analysis identifies differences in RNA sequencing results that are dependent upon the quantity of total RNA recovery from whole blood. We propose that incomplete RNA extraction directly impacts the recovery of mRNA and lncRNA transcripts from human blood and speculate these differences contribute to the “batch” effects commonly identified between sequencing results from different archived data sets.

Introduction

Human blood has been recognized as an important diagnostic resource for centuries. Blood is a complex fluid in continuous contact with all body tissues, thereby providing information from a variety of unique compartments that include nucleated white blood cells (WBC), enucleated red blood cells (RBC) and cell-free RNA: ribonucleoprotein complexes and ancillary vesicular debris from assorted tissues in the body [1, 2]. Since blood collection is considered a non-invasive procedure, it is frequently used for the evaluation of an assortment of disease related biomarkers. In addition, the growing application of personalized medicine in the treatment of chronic diseases has shown that RNA signatures can be employed to specifically optimize treatment strategies best suited for the patient. This has supplied the impetus for the development of a variety of unique methods for the collection, stabilization, and extraction of RNA from blood.

In the clinic, blood is routinely collected in K₂EDTA or ACD Vacutainer tubes, or with blood collection tubes designed to stabilize RNA/DNA for subsequent extraction at some later time (e.g., Pax Gene® [3–5, 6, 8, 9, 12, 14–18, 20, 21, 24–26], Tempus® [12, 14–17, 19–21, 23–26], RNAgard® [8, 9]). Each of the various blood stabilization tubes have unique proprietary ingredients designed to stabilize the nucleic acids. The extraction procedures routinely employed to purify and recovery RNA from blood samples add additional variability since they employ different extraction technologies such as: phenol-based extractions [3–11, 13, 22], silica gel column purification procedures [3–6, 8–26], glass fiber extraction columns [26], magnetic bead extractions [16–17, 19, 23] and assorted blood cell enrichment methodologies coupled with various extraction protocols [6, 8–11, 17]. The total quantity of RNA recovered from whole blood differs significantly between these various extraction methodologies [3–26], but the reported RNA purity (A_260/280 > 1.9 and A_260/230 ratios > 1.7) and integrity-based RIN values (RIN > 7) are in the acceptable range for microarray and RNA sequencing studies [27, 28].

Although investigators employing these different extraction technologies use decent quality RNA considered acceptable for RNA sequencing, when the sequenced transcripts from identical samples are compared, greater variation is observed between methods than across different blood samples [6, 8–11, 14, 16, 21, 26]. Therefore, although substantial amounts of data have been generated with these various blood collection and extraction methodologies, attempts to pool the data sets for more comprehensive meta-analysis have had limited success. Several reports evaluating different extraction methodologies concluded that RNA yield contributes significantly to technical variation across methods [11, 12, 14, 25, 29]. We reported that RNA content in human blood ranges from 6–22 μg / ml [7], reaching concentrations greater than previously reported in the literature [3–6, 8–26].

Advancements in next generation sequencing (NGS) have significantly reduced the cost of RNA analysis and expanded the interest in applying RNA sequencing to an array of disease conditions. Current dogma relating to the suitability of RNA for sequencing applications, simply based on RNA purity and integrity, is inadequate. Other criteria are needed to reduce variability and improve agreement across various extraction platforms. To address this question, we evaluated three archived data sets in which blood was collected and extracted with different methodologies, but the resulting raw sequencing counts were processed under identical conditions to minimize analytical pipeline induced variability [27, 28]. We evaluated a variety of parameters such as the number of gene calls, transcript size distributions and call variance with the goal of identifying factor(s) that might reduce sequencing variability and provide a testable explanation for the large batch effects frequently reported when comparing similar sequencing data sets [29]. Based on our analysis, we propose that the differential recovery of short to midsized mRNA and lncRNA transcripts during RNA extraction directly affects the character and breadth of the RNA library and its amplification, thereby disproportionally altering RNA calls over the entire range of transcripts. In addition, we demonstrate that the proportional relationship between transcript size and TPM counts, considered to be the fundamental requirement for RNA sequencing, is only attained when RNA is uniformly extracted and recovered across a complete range of transcript sizes. To address these shortcomings, we strongly recommend reporting RNA yield in all sequencing studies (e.g., μg RNA / ml of blood). We encourage investigators to select RNA extraction protocols that provide a mean RNA recovery approximating 14 μg RNA / ml of human whole blood [7] and that efficiently recovery RNA over the entire range of transcript sizes.

Methods

Ethics approval and consent to participate

The blood samples employed in this study were collected in accordance with the approved protocol provided by the Chesapeake Research Review, LLC. CIRBI Protocol # Pro00009509 [7]. Participants received a written informed consent form that was signed and witnessed in accordance with Chesapeake IRB guidelines. Participants provided witnessed signed informed consent documents specifically approved by Chesapeake IRB guidelines. All methods were performed in accordance with the relevant guidelines and regulations as outlined in the Declaration of Helsinki. The ethics requirements for the other archived data cited in this report are available in the public forum and were previously met when the data were originally published [33, 34].

RNA extraction, sequencing, and data analysis

In an earlier report, we analyzed human whole blood RNA concentrations and differential blood cell counts in thirty-five individuals ranging in age from 50–89 years of age [7]. Fasting venous blood was collected with K₂EDTA Vacutainer tubes, stored at room temperature for about ~ 15 minutes and aliquoted into RNAzol-BD [7]. The blood: RNAzol-BD lysates were aggressively shaken to solubilize denatured proteins before storage at -70 C. Extraction of large RNA transcripts greater than 200 base pairs (bp) was performed according to the manufacture’s protocol (https://www.mrcgene.com/product/rnazol-bd). The two hundred bp cutoff was established based on Bioanalyzer electropherogram plots of total, large and small RNA profiles [30]. The RNA was DNase-treated, and 1 μg of large RNA was sent to the University of Cincinnati Genomics, Epigenetics and Sequencing Core Facility for sequencing. After passing quality control analysis, the samples were depleted of globin and ribosomal transcripts prior to library formation and sequencing on the Illumina HiSeq 2000 platform (GSE169359). Standard procedures were employed to evaluate the quality of the raw data and the resulting FASTQ data files held 53–77 million single-end reads [31, 32].

We surveyed the Sequenced Reads Archives public repository (https://www.ncbi.nlm.nih.gov/sra) to select additional data sets for comparison. Archived data set one contained nine normal controls (GSE109313) ranging in age from 18–70 years of age was included in our analysis [33]. In this study, blood was collected with PAXgene collection tubes and then extracted with the PAXgene RNA extraction kit. A total of 500 ng of RNA was ribo-depleted and used for poly(A) selection. A second archived data set contained twelve control subjects (GSE112057) of unreported age [34]. Blood samples were collected in Tempus blood collection tubes and the RNA was extracted with the Tempus Spin RNA Isolation kit. The FASTQ data files held 21.5–49.3 million double-end reads.

These data sets represent diverse RNA extraction methodologies, providing an opportunity to examine how different extraction procedures might impact sequencing results. To minimize data processing variability, all three data sets were processed through our pipeline under identical conditions, as outlined in an earlier report [31]. Briefly, FASTQ data files were trimmed and processed. Single-end reads were aligned to reference genome GRCh37.p13[hg19] using the BowTie2 aligner supporting gapped alignments. Cufflinks and HTSeq software were used to provide quality control for our analysis [31]. All counts were expressed as DeSeq-normalized TPM (Transcripts Per Kilobase Million).

A list of 25,354 sequenced read assignments was identified with the annotated reference genome. RNA transcripts containing < 0.1 TPM count were designated as 0 and RNA transcripts with means < 0.1 TPM counts were omitted from the analysis. The original source files were screened for copy number variants [32, 35, 36] and multiple assignments for individual genes were removed and expressed as a single gene ID. These data sets were used to compare and evaluate mRNA and lncRNA size distribution and transcript expression levels.

Characterization of reference genome RNA transcript lengths

We examined the list of known RNA sequences from reference genome GRCH37.p13[hg19] to establish a baseline for comparison of transcript lengths in the three data sets [37]. This reference genome contained 258,705 entries ranging in size from 5 to 347,561 bp’s sequences (S1 Table) with both mRNA and lncRNA sequences as well as known RNA sequences lacking a name or function (32,686 entries). The blue filled area in Fig 1 depicts the size and number of named transcripts assigned to each size interval (left y-axis). Each RNA transcript was assigned to an interval (bin, n = 200) based on Log₁₀ size range of 1 to 360,000 base pairs (e.g., Log₁₀ 0–5.556 bp using a 0.028 increment interval). To improve the clarity of the transcript size distribution plot in Fig 1, the gene Log₁₀ scale was limited to values of 1.7–4.4 thereby covering a size range of 50 to 30,000 bp’s.

Transcript size does not follow a normal distribution and there is an obvious peak of 28,182 transcripts at ~550–566 bp. Most of the transcripts falling within this size range consist of lncRNA [38, 39]. In addition, larger transcripts (e.g., > 1,000 bp’s) code for mRNA genes with multiple variants. To avoid averaging mRNA length measurements, transcript size is based on the size of variant one.

To further characterize the size distribution of the RNA reference genome depicted in Fig 1, we downloaded two additional files from NCBI containing annotated lists of mRNAs (82,961) and lncRNA transcripts (10,782) relevant to the same reference genome (right y-axis). Although the annotations for lncRNA transcripts relating to assigned function and accepted identification labels are not as mature as the assignments for mRNA transcripts, the size distribution of lncRNA is interspersed among the mRNA genes [38] and it may provide information relating to transcript recovery during RNA extraction. The three sequenced experimental data sets described in this study were evaluated in relation to the size distribution of the mRNA and lncRNA reference files depicted in Fig 1 (S1 Table of Transcript sizes).

In our analysis, single-pair sequencing transcripts were assigned to a single gene ID [31, 32] and specific variants were not identified. Therefore, variant one was selected to represent the gene length assignment of genes with multiple size variants. Based on the selection of variant one, a list of 19,608 NCBI mRNA transcripts was identified for our analysis (Fig 1, red line, right y-axis). These RNA transcripts range in size from 180 (ETDC) to 43,816 (MUC16) bp (S1 Table). Since less information is available relating to the predominant size of specific lncRNA’s [38, 39], multiple size designations for specific lncRNA transcripts were averaged and a second list of 6,725 unique lncRNA transcripts was identified (Fig 1, green line, right y-axis). The established transcript size of the NCBI reference genome was used as the basis for characterizing the sequenced RNA in the three experimental data sets (S1 Table of Transcript Sizes).

Characterization of RNA size distributions in experimental samples

The aligned reads from the three data sets were evaluated with SAMtools to quantify the number of reads per transcript [31]. Transcripts with mean raw counts ≥ 3 were used to establish a baseline minimum count prior to TPM transcript size normalization. After TPM size adjustment, the source file containing a list of 25,354 transcripts was further evaluated. To provide the largest pool of transcript candidates, a preliminary list of transcripts was selected by limiting individual TPM counts to values > 0.1.

Analytical methods

Statistical calculations were performed using the Microsoft Windows Excel Platform using the Analysis ToolPak application (Excel, version 2304 within Microsoft 365, Version 16.0. 16327). ToolPak is Charles Zaiontz’s Real Statistics Resource Pack for Excel 2010, 2013, 2016, 2019, 2021 or 365 for Windows (Release 8.7). RAnGER data management software previously described in detail [32, 35] was employed to establish gene count minima, consolidate copy number variants [36], and perform standard statistical calculations. One-way ANOVA was employed in conjunction with Student-Newman-Keul’s range test to evaluate computed means across the three data sets (S3 Normalization and Statistical Analysis).

Results

Overview of sequencing Results for RNA recovered from whole blood by different methodologies

The selected data files differ markedly in the way RNA was extracted and processed prior to RNA sequencing. We speculated that an examination and comparison of the sequencing results might indicate how these methods impact variability during RNA sequencing.

After processing the FASTQ data files, we looked at the total number of sequenced transcripts with mean TPM counts > 0.1 that corresponded to the NCBI reference genome. A summary of the analysis is presented in the Fig 2A table. The total number of identified transcripts ranged from 10,042 to 15,082 (33.4% difference) representing 55.4, 39.6 and 59.5% of the 25,354 annotated transcripts in the C9, C12 and C35 data sets, respectively. To further characterize these transcripts, mRNA and lncRNA transcripts with TPM Counts > 0.1 and known bp size assignments (Fig 1, S1 Table) were used to evaluate the original lists of total sequenced transcripts. Among the three data sets, mRNA and lncRNA transcripts with bp size assignments accounted for 84.8 and 6.70% of the identified transcripts, respectively. Collectively, total mRNA and lncRNA transcripts in the three data sets constitute 91.5% of the sequenced transcripts while transcripts with unassigned bp size assignments account for only 8.5% of the transcripts. Therefore, the transcripts with known bp size assignments should provide a representative assessment of the sequencing results. The RNA used for sequencing in the C9 and C35 data sets, on average, improved mRNA and lncRNA recovery relative to the C9 data set by 24 and 56%, respectively (S2).

Fig 2 — FASTQ-formatted data files from the three data sets were aligned and the mapped reads were assembled into transcripts. A) Tabular summary of identified sequenced RNA transcripts. The largest number of sequenced mRNA and lncRNA transcripts with TPM counts > 0.1 was found in the C35 data set, but the total number of transcripts with known bp size measurements was almost identical in the C9 and C35 data sets. In contrast, the number of mRNA and lncRNA transcripts was ~27% lower in the C12 data set (C35: 13,077 vs C12: 9,563 identified mRNA and lncRNA transcripts). B) Venn Plot analysis of the sequenced transcripts in the C9 (14,051), C12 (10,042) and C35 (15,082) data files identified 8721 mRNA and 481 lncRNA transcripts common to all three data sets. C) Size distributions of the total sequenced transcripts with mean TPM counts < 1. The number of sequenced transcripts with mean TPM counts < 1 is markedly higher in the C9 and C35 data sets implying an improved ability to detect weakly expressed transcripts.

To consider how RNA recovery affected reported transcript size, 19,608 mRNA and 6,725 lncRNA reference genome transcripts with known size assignments were used to characterize the sequenced mRNA and lncRNA transcripts identified in the data sets (S1 and S2). In the C35 data file, 11,882 mRNA transcripts and 1,195 lncRNA transcripts with known size measurement assignments were identified (Fig 2A). The 13,077 mRNA and lncRNA transcripts identified in the C35 data set was almost identical to the number of mRNA and lncRNA transcripts identified in the C9 data set (13,002) even though the two data sets were processed with markedly different methods (e.g., EDTA Vacutainer followed by phenol-based extraction with globin and mRNA depletion vs. PAXgene blood collection with silica gel column purification and poly-A selection). In contrast to the C9 and C35 data sets, fewer mRNA (24%) and lncRNA (56%) transcripts were identified in the C12 data set (Tempus blood collection with silica gel column purification of total RNA). The total number of transcripts identified in the C35 data set was 7 to 33% higher than in the other two data sets; however, a core group of 8721 mRNA transcripts was identified in all three data files and used in subsequent studies (Fig 2B).

If the number of mRNA transcripts identified in the C9 and C35 data sets is due to an improved recovery of RNA, one might expect to see more transcripts with TPM counts < 1 due to an improved recovery of low expression transcripts. Fig 2C clearly supports this inference. The total number of transcripts with TPM counts < 1 is greater within the C9 and C35 data set (C9 = 3413, C12 = 81 and C35 = 4409). Furthermore, the distribution of C9 and C35 transcripts presented in Fig 2C is shifted towards smaller transcripts as compared to C12. In summary, different numbers of sequenced mRNA and lncRNA transcripts were recovered with different extraction methodologies and in some cases, these differences are due to the improved recovery of weakly expressed short to midsized transcripts (Fig 2A and 2C).

Size distribution of sequenced mRNA and lncRNA recovered with different extraction methodologies

To consider if a disproportionate recovery of RNA transcripts occurred during RNA extraction, the number of recovered transcripts was portrayed graphically in Fig 3, as a function of transcript size. The size distribution of 19,608 mRNA genes (Fig 3A, red circles) and 6,725 lncRNA transcripts (Fig 3B, red circles) from the reference genome are presented with the size distributions of the transcripts recovered from the three data sets. The identified mRNAs and lncRNA for C9 (yellow), C12 (purple) and C35 (green) data sets were assigned a size designation and plotted relative to transcript size (Fig 3A and 3B). To improve clarity, only transcripts between 200 and 30,000 bp were reported. While the mRNA size plots for the C9 and C35 data sets overlap, the C12 transcript size distribution is smaller and shifted slightly to the right suggesting that short and midsized transcripts were omitted. When compared to the 2,752 smallest transcripts identified in the NCBI mRNA reference list that range in size from 200–1260 bp, the C9, C12 and C35 data sets contained 43.9, 24.7 and 42.4% of the reference transcripts, respectively.

The lncRNA transcripts between 200 and 30,000 bp identified in the NCBI reference list are depicted in Fig 3B (red circles). The biggest difference between the lncRNA reference list and the transcript profiles is the small number of identified lncRNA transcripts in the C9, C12 and C35 data sets, 15.2, 7.3, and 17.8% respectively. Like the mRNA profiles in Fig 3A, differences in the relative number of recovered lncRNA transcripts were most notable among the short and midsized transcripts in the three files.

Analysis of 8,721 transcripts common among the three data sets

A foundational assumption in RNA sequencing predicts that the relative number transcript counts should be proportional to the size of the transcript. We used the 8,721 transcripts identified in all three data sets to see if this assumption held true for RNA extracted with the various methods.

All three data sets were adjusted to an equivalent number of total counts (200,000, S3) before preparing scatter graphs of the mean TPM counts plotted in relation to transcript lengths. Fig 4A–4C represents the normalized mean transcript expression levels for the three data sets plotted in relation to increasing transcript size over a range of 200 to 30,000 bp (Log₁₀ 2.3–4.4). After normalizing the three files to an identical number of total counts, the calculated mean for the 8,721 individual mRNA transcripts was 22.93 (solid red lines). The regression slope line for each distribution is depicted by the red-dashed line in each panel. The dashed red lines should have a slope of zero if the above assumption holds true.

Fig 4 — The mean TPM count for the individual transcripts was plotted in relation to the transcript size. The normalized sample mean of 22.93 is depicted by the solid red line. The mRNA transcript distribution plots depicted for C9 and C12 are similar, and both display negative regression lines (red dashed line). The solid and dashed red lines depicting the sample mean and regression line, respectively, of the C35 data sets are almost parallel. Since the total number of transcripts in all three data sets are identical, this result could only occur if the relative distribution of counts assigned to the distinct size transcripts has changed.

The plots of the C9 and C12 data sets have negative regression slope lines of -1.1144 and -0.6469, respectively. The slope line of the C35 scatter plot approximates zero (-0.0643) and the distribution of mRNA transcript counts remains proportional to the relative size of the transcripts. The breadth of the scatter plot at any given transcript size range is explained by differences in the level of gene expression for transcripts of equivalent size. The three distribution plots represent an identical number of transcripts (8,721), so the computed slope lines could only be different from zero if TPM counts were disproportionate relative to transcript size. The C35 data set was the only file in which the number of TPM counts were proportional to the size of the transcript. It is also important to note that simply increasing the total number of sequenced transcripts in the C9 data set by poly-A selection did not correct or reduce the negative slope line of the mean transcript TPM count vs transcript size relationship. In fact, the negative slope line was 1.7-fold greater in the C9 data set than in the C12 data set. This result further emphasizes the importance of efficient transcript recovery on RNA sequencing outcomes.

Coefficient of variation estimates of transcript variability

It is well known that variability, as estimated by the standard deviation (SD), increases in proportion to the calculated mean. To compare the variability among transcripts with different mean TPM counts, the coefficient of variation (CV) was employed. As previously noted, normalized TPM counts were used in these comparisons. To determine how this variability was affected by transcript size, scatter plots of the mean TPM count CV values for the C9, C12 and C35 data sets are presented in Fig 5A–5C. The distribution plot of the 8,721 transcript TPM count CV’s is plotted as a function of transcript size (Log₁₀ bp). The solid red line (- 0.5 Log₁₀) represents a mean TPM count CV of ~ 0.31 or 31% for the 8721 transcripts. The scatter plot of the TPM count CV for the C9 and C12 data sets are similar and the positive regression slope lines suggest that the variability is increasing in relation to transcript size.

Fig 5 — The positive slope line depicted by the red dashed line in the C9 and C12 data sets suggests that variability is increasing in relation to transcript size. Unlike the C9 and C12 data sets, the CV distribution within the C35 data set is less variable and more evenly distributed across all transcript sizes. The regression slope line of CV dispersion in the C35 data set is negative suggesting that the smaller mRNA transcripts are more variable, and the CV values decline with increasing transcript size.

In contrast to the C9 and C12 data sets, the distribution plot of the mean TPM count CV values across the range of transcripts in the C35 data is uniform and decreases slightly, indicating the variability was greater among the smallest transcripts. This would be consistent with the premise that smaller transcripts yield fewer sequencing fragments, thereby providing fewer and more variable sequencing results [40, 41]. Finally, the narrow range of CV scatter across transcript size indicates that the C35 data file has the smallest overall level of variability as measured by the CV of 32.7% (C9 CV = 55% and C12 CV = 52%, respectively).

Based on our analysis of the 8721 transcripts common to all three data sets, using the mean TPM counts, and their variability presented in Figs 4 and 5, we speculated that the differences in the recovery of short and midsized transcripts contributes to these observed changes.

Slope and intercept analysis of all transcripts in the three data sets

In our initial analysis of transcript TPM counts vs transcript size presented in Fig 4, total TPM counts were normalized across the three data sets and the regression analysis was limited to 8721 common transcripts. It is possible that the different slopes and intercepts observed in Fig 4, were influenced by the TPM count normalization or the selection of a reduced number of total transcripts. To examine this possibility, we computed the individual slope and intercept values for the lncRNA and mRNA transcripts with known size measurements and TPM counts > 0.1 in all 56 samples. Since the individual samples have different numbers of total TPM counts and identified transcripts, regression analysis of the individual samples is likely to contain the greatest variability among the three data sets. The results of this comprehensive analysis are presented in the Box plots depicted in Fig 6.

Fig 6 — Box plots depicting the mean (x) slope (A) and intercept (B) values for individuals in the C9, C12, and C35 data sets. All identified mRNA and lncRNA transcripts with a TPM count > 0.1 were regressed in relation to transcript size. The average number of mRNA and lncRNA transcripts with TPM counts > 0.1 and known transcript size in the three data sets were 12,164 ± 500.7, 9108 ± 6.68 and 11,847 ± 261.1, respectively. The mean slope and intercept values recorded for the three data sets were proportional to values previously noted in Fig 4.

One Way Analysis of Variance of the computed slopes (C9: -0.600 ± 0.198, C12: -0.650 ± 0.083, and C35: 0.096 ± 0.038, Mean, SD) and intercepts (C9: 2.539 ± 0.620, C12: 3.142 ± 0.267, and C35: -0.083 ± 0.124, Mean, SD) of the regressed transcript-TPM counts vs transcript-size relationships in the three data sets clearly demonstrate that the slope and intercept of the C35 data set is markedly different than the values identified in the C9 and C12 data sets (df = 2, 53, slope: F = 411, P < 5.3E-33, Intercept: F = 711 P< 5.24E-39). Furthermore, the observed slopes and intercepts of the samples in the C35 data set are consistent with the expectations that form the basis for RNA sequencing analysis.

The colored area within each box plot represents approximately 50% of the values identified between the 25 and 75 percentile and the solid black line represents the sample median. In Fig 6, note the shift in the sample median from the mean in the C9 and C12 data sets relative to the C35 sample group and the degree of dispersion between the slopes and intercepts of the data sets. Regression analysis of the sequencing results of the individual samples without TPM count normalization or the survey of identical transcripts in the various data sets did not alter the unique slope and intercept profiles previously described in Fig 4. But the degree of sample dispersion is reduced in the C35 data set vs that noted for the C9 and C12 samples. A detailed summary of the statistical analysis is provided in S3.

Discussion

A variety of methodologies are available for the extraction and recovery of RNA from whole blood. In addition, novel collection tubes, such as PAXgene and Tempus tubes, also can be employed for RNA stabilization [3, 4, 6, 8, 12–21, 23–26]. Although published microarray and RNA sequencing studies consistently use “good quality” RNA that is defined by A260/280, A260/230 ratios and RIN values, differences in gene expression are frequently reported and findings are difficult to replicate across various experimental platforms. The resulting differences observed between similar studies performed using different extraction methodologies preclude pooling the data for more potentially informative “meta” analysis, irrespective of the fact that these sequencing studies begin with “good quality” RNA. Apparently, extracted RNA is not identical across studies even though it has been deemed “good quality.” Since the quality and integrity of the RNA are theoretically identical, other factors must be contributing to the variability of the sequencing results. In this report, we compared three NCBI archived data sets of RNA sequencing results where markedly different extraction methods were used to obtain RNA from the blood of human control subjects. We sought to identify additional factors contributing to the disparate results commonly reported for sequencing results. Table 1 summarizes the factors addressed in this study.

Table 1. Summary of sequencing results recovered from data sets employing different extraction methodology.

Item	Experimental Parameter	Comment:
A	Extraction Capacity	The largest number of mRNA and lncRNA transcripts recovered from the FASTQ data files were found in the C35 data set (Fig 2A). Total transcript recovery was improved in relation to the C9 and C12 data sets by 6.8 and 33.4%, respectively).
B	Extraction Efficiency	The ability to efficiently recover short or weakly expressed transcripts independent of transcript size. The largest number of small and midsized transcripts with TPM counts < 1 was identified in the C35 data set (C9 = 3,413, C12 = 88 and C35 = 4,409, Fig 2A and 2C).
C	Recovery of small to midsized RNA transcripts	A comparative survey of mRNA and lncRNA transcripts based on transcript size revealed the differential loss of short to midsized transcripts in some data sets (Fig 3).
D	Proportional relationship of transcript counts to transcript size	The C35 data set was the only file in which the number of transcript counts could be shown to increase in parallel to transcript size, thereby fulfilling a fundamental precept of RNA sequencing (Fig 4C).
E	Experimental error is independent of transcript size	The coefficient of variation in the C35 data set was uniformly expressed over the entire range of transcript sizes (Fig 5C). In contrast, the CV increased in proportion to transcript size in the C9 and C12 data sets (Fig 5A and 5B), possibly due to incomplete transcript recovery.
F	Differential gene expression	If short to midsized mRNA and lncRNA transcripts are not efficiently and quantitatively extracted from whole blood, the increased variability attributed to transcript recovery across samples cannot be distinguished from differential gene expression.
G	RNA extraction	In additional to relying on the stability and purity of RNA recovered from whole blood, greater efforts should be directed at establishing criterion for evaluating the efficiency and capacity of the RNA extraction protocols that are routinely used for RNA sequencing.

Open in a new tab

In our initial survey of the sequencing results, we detected sizeable differences between the three data sets in the total number mRNA and lncRNA transcripts, as noted in Fig 2A. The number of sequenced transcripts in the C12 data file was ~28.5 and 33.7% lower than in either the C9 or C35 data sets even though the raw FASTQ files were processed identically. Furthermore, the identification of lncRNA transcripts in the C12 data set also was reduced by more than 50%. This is a substantial difference since total RNA was extracted for RNA sequencing in this data set. When evaluating RNA extraction, greater importance should be given to the quantity of RNA obtained per volume of blood (Table 1, A).

In our initial survey of mRNA and lncRNA transcripts, we identified major differences in the total number of transcripts as noted in Fig 2A. An examination of the number of transcripts with TPM counts < 1 among the three data sets also identified larger numbers of weakly expressed transcripts in the C9 and C35 data sets (Fig 2C). This finding implies that there is an improvement in the recovery of smaller and weakly expressed RNA transcripts in these two data sets. Therefore, the quantitative recovery of RNA is of considerable importance, and it should be viewed more judiciously (Table 1, B).

To gain additional perspective on the transcript size relationships of the known mRNA and lncRNA transcripts identified in the three data sets, the sequenced RNA transcripts were ranked according to their size measured in base pairs (bp’s). The size distribution of sequenced mRNA and lncRNA transcripts recovered from the three data files is depicted in Fig 3A and 3B. The mRNA and lncRNA transcripts identified in the three data files are superimposed with the list of mRNAs and lncRNA transcripts from the NCBI reference files. A comparison of the mRNA and lncRNA transcript size distributions in the three data files, to the know transcript sizes of the NCBI reference genome, provides some perspective on the range of transcripts recovered during RNA extraction (Fig 3A and 3B). The visible absence of weakly expressed short to midsized mRNA and lncRNA transcripts in the C12 data set supports the conclusion that these transcripts may not be efficiently recovered in some extractions (Table 1, C).

Although major differences exist in the total number of identified transcripts between the three data files, 8721 common transcripts were identified in all three data sets (Fig 2B). When examining these transcripts, one would expect some degree of commonality since they represent blood drawn from healthy control subjects and they are expressed at sufficient levels to be detectable in all three data sets. When examined in relation to transcript size, the normalized mean TPM counts identified among the short to midsized transcripts in the C9 and C12 data sets appeared to be disproportionately higher than the mean TPM counts of largest transcripts. This unexpected result contradicts the long-held view that the number of sequencing calls should be proportional to the relative size of the transcript. However, the transcript count distributions in the C35 data set displayed proportional numbers of mean TPM counts across the entire transcript size range of 200 to 30,000 bp Fig 4C. We recommend confirming that the number of transcript counts remain proportional to the relative size of the sequenced transcripts (Table 1, D).

After observing differences in the slope lines of the transcript TPM count distributions across transcript lengths, it was of interest to see how these changes might impact data set variability. Since the CV is typically employed to evaluate the precision of a technique, ideally one would expect the CV to remain stable and independent of any changes in transcript size. Therefore, the coefficient of variation was employed to characterize transcript normalized TPM count mean variance among the three data sets. The CV plots for the C9 and C12 data sets (Fig 5A and 5B) are similar; however, the profile of the C35 data set displayed a much smaller range of dispersion over the entire transcript size range with a mean CV of 32.7% thereby approximating the inter-individual CV of 30.7% for total blood RNA concentrations [7]. The scatter plot slope line of the mRNA transcript means and their respective CVs, in the C35 data set, remained constant (slope ≅ 0) over the entire range of transcript sizes from 200 to 30,000 bp’s (Figs 4C and 5C). Therefore, when analyzing RNA sequencing data files, it may be useful to confirm that experimental error is evenly distributed and independent of transcript size (Table 1, E).

Many studies have been published to identify the factors that contribute to sequencing variability; however, our overall understanding has not progressed beyond the fact that RNA stabilization and extraction are fundamental sources of this variation [3–26]. Initial efforts to explain differences in the number of mapped reads identified protocol differences relating to the selection of RNA species [5, 27, 42]. rRNA and globin make up a substantial number of total RNA transcripts and if they are not removed prior to library formation, they will constitute most of the sequencing reads. Therefore, selection of total RNA versus poly-A selected RNA, or globin and rRNA depletion, have a major impact on library complexity and the number of mapped reads [27, 42]. The development of dependable and reproducible ribosomal and globin depletion protocols has significantly improved the number of exonic and intronic reads that can now be detected during RNA sequencing [5, 14, 43–45]. Protocols that use total RNA extraction in conjunction with globin and ribosomal RNA-depletion procedures demonstrate significant improvement in the total number of mapped reads.

Although the mRNA and lncRNA transcript size distributions of the C9 and C35 data sets are identical (Fig 3A), the normalized transcript mean and CV distribution plots for these two data files are markedly different (Figs 4–6). While RNA transcript recovery is important, it is of equal importance to demonstrate that the sequenced transcripts exhibit a proportional transcript size/TPM count relationship as depicted in Fig 4C. Although similar numbers of total transcripts are recovered in both data sets, only the sequenced genes in the C35 data set show this proportionality. Apparently, the presence of a large number of transcripts is unable to correct deficiencies in RNA recovery and sustain the proportional transcript size/TPM count relationship observed in the C35 data set.

Based on the overview of the three data sets employed in this study, we speculate that the inefficient recovery of weakly expressed short to midsized RNA transcripts has a significant impact on RNA sequencing results. Sultan et.al. [5] previously reported an improved recovery of short RNA transcripts during phenol-based extractions and proposed that these transcripts are lost during complicated and tedious silica column-based extractions. Yip et.al. [21] has also reported that sample processing-dependent differences in gene expression were due to the loss of transcripts during RNA extraction. Therefore, it is reasonable to conclude that the loss of as many as 30–50% of the mRNA and lncRNA transcripts directly impacts the character and size of the resulting sequencing library. Sequencing results are further compromised during library amplification by a reduction in the total number of recovered transcripts as well as the incomplete and variable recovery of other transcripts resulting in over and under amplification of segments of the resulting library (Figs 4 and 5). If the primary character of the RNA library is skewed or misrepresented in any way, library amplification will further distort the sequencing results. Therefore, it is critically important to extract and efficiently recovery the complete range of transcripts from the blood during RNA extraction. The subsequent removal of globin and ribosomal RNA transcripts provides the best opportunity for the construction of a “complete” library that can be representatively amplified thereby significantly reducing the between file variance (Table 1, F).

We previously reported that the average RNA content in human whole blood is 14.58 μg / ml of blood with inter-individual variations ranging from 6.7 to 22.7 μg / ml (Inter-individual CV of 30.7% and intra-individual CV of 5.9%, [7]). Since the variation in inter-individual RNA recovery spans a 3.4-fold range, RNA extractions must have sufficient capacity to efficiently cover this dynamic range of blood RNA concentrations. Based on an average blood concentration of 14.58 μg of RNA / ml of blood, sequencing 1 μg of RNA from the C35 samples constitutes ~7% of the RNA in the sample. Previously reported column-based RNA yields for human blood are much lower ranging 1–8 μg of RNA / ml of blood [3, 4, 6, 8, 12–21, 23–26]. Using 1 μg of RNA from these extractions would constitute 12–100% of the RNA from the sample. The impact of these dramatic quantitative differences requires additional consideration, and we encourage investigators to pay greater attention to the amount of total RNA that is recovered during blood RNA extraction. Furthermore, we strongly recommend that the quantity of recovered RNA should be reported for every sample used for RNA sequencing and included in all publications. If the expected range of total RNA is routinely recovered and efficient globin and ribosomal RNA depletion protocols are employed there is a greater opportunity for a “highly complex” library to be identified in every sample and representatively amplified thereby significantly reducing sample variance [27, 42, 46].

In conclusion, we believe that differences in RNA recovery resulting from incomplete RNA extraction is a primary source of the RNA sequencing batch effects previously reported in the literature. The disproportionate loss of short to midsize mRNA and lncRNA transcripts during RNA extraction excludes these transcripts from any subsequent downstream application such as qPCR, RNA microarray analysis or RNA sequencing. This issue may be extremely important when studying physiological pathways containing large numbers of short regulatory genes, such as cytokines, as well as the physiological role of short lncRNA transcripts [15, 16]. We speculate that these issues are further exacerbated when the cDNA libraries are amplified prior to qPCR or RNA sequencing. Finally, if the efficiency of RNA transcript recovery changes from sample-to sample with various extraction methodologies, it becomes impossible to determine the extent to which RNA recovery or the level of differential expression are contributing to the observed TPM count differences (Table 1, G). Although the methodological improvements mentioned here do not leap to the forefront when analyzing RNA sequencing data, addressing these concerns using the steps outlined here may uncover information buried in the data and strengthen study conclusions.

Supporting information

S1 File. Distribution of TPM counts across individual samples.

Maximum number of transcripts (N) identified in the C9, C12 and C35 data sets after processing FASTQ data files under identical conditions. A) A total of 12,059, 9,978 and 10,616 transcripts were found in every sample in the C9, C12 and C35 data sets, respectively. Further analysis identified meaningful numbers transcript counts as the number of samples per file was sequentially decreased to one. For example, in the C35 data set, 233 lncRNA transcripts were found in only 1 of the 35 samples but the counts ranged from 3.5–613 with a mean of 55.9 ± 85.9 TPM counts. Transcripts within this count range clearly represent legitimate gene expression values and thereby may provide some indication of the sensitivity of the extraction process.

(XLSX)

Click here for additional data file.^{(21.5KB, xlsx)}

S2 File. Normalization and statistical analysis.

Summary of the experimental rationale and ANOVA statistical analysis.

(DOCX)

Click here for additional data file.^{(30KB, docx)}

S1 Table. List of NCBI reference genome transcripts, mRNA and lncRNA transcripts with base pair size assignments.

To evaluate RNA recovery, transcripts in reference genome GRCH37.p13[hg19] [37] were used to identify the sequenced transcripts in the three data files. The lists of 19,608 mRNA and 6,725 lncRNA transcripts used to identify sequenced mRNA and lncRNA transcripts in the three data set are provided.

(XLSX)

Click here for additional data file.^{(8MB, xlsx)}

Acknowledgments

Robert Miller has furnished the programming for the RAnGER software that was employed in our analysis, and he has agreed to provide the use of the software in this publication. Copies of the software are available at rmillerllc927@gmail.com along with detailed video instructions for using the program [35].

Abbreviations

mRNA: messenger RNA
lncRNA: long noncoding RNA
GO: Gene Ontology
RNA-seq: RNA sequencing
SD: Standard Deviation
TPM: Transcripts Per Kilobase Million
RIN: RNA Integrity Number
M: mean
CV: Coefficient of Variation
DGE: differential gene expression

Data Availability

• The original archived data files are available at NCBI Gene Expression Omnibus site (https://www.ncbi.nlm.nih.gov/geo/) containing sequencing data for the C9, C12 and C35 data sets in the following listings GSE109484, GSE112057 and GSE169359.

Funding Statement

The authors received no specific funding for this work.

References

1.Liew CC, Ma J, Tang HC, Zheng R, Dempsey AA. The peripheral blood transcriptome dynamically reflects system wide biology: a potential diagnostic tool. J Lab Clin Med. 2006. Mar;147(3):126–32. doi: 10.1016/j.lab.2005.10.005 . [DOI] [PubMed] [Google Scholar]
2.Savelyeva AV, Kuligina EV, Bariakin DN, Kozlov VV, Ryabchikova EI, Richter VA, et al. Variety of RNAs in Peripheral Blood Cells, Plasma, and Plasma Fractions. Biomed Res Int. 2017; 2017:7404912. 10.1155/2017/7404912. Epub 2017 Jan 3. ; PMCID: PMC5239830. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Chai V, Vassilakos A, Lee Y, Wright JA, Young AH. Optimization of the PAXgene blood RNA extraction system for gene expression analysis of clinical samples. Journal of Clinical Laboratory Analysis. 2005; 19(5):182–188. doi: 10.1002/jcla.20075 ; PMCID: PMC6807908. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kim JH, Jin HO, Park JA, Chang YH, Hong YJ, Lee JK. Comparison of three different kits for extraction of high-quality RNA from frozen blood. Springerplus. 2014; 3:76. Published 2014 Feb 8. doi: 10.1186/2193-1801-3-76 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Sultan M, Amstislavskiy V, Risch T, et al. Influence of RNA extraction methods and library selection schemes on RNA-seq data. BMC Genomics. 2014;15: 675. doi: 10.1186/1471-2164-15-675 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Liu X, Li Q, Wang X, Zhou X, Liao Q, He X, et al. Comparison of six different pretreatment methods for blood RNA extraction. Biopreserv Biobank. 2015. Feb;13(1):56–60. doi: 10.1089/bio.2014.0090 . [DOI] [PubMed] [Google Scholar]
7.Chomczynski P, Wilfinger WW, Eghbalnia HR, Kennedy A, Rymaszewski M, Mackey K, Inter-Individual Differences in RNA Levels in Human Peripheral Blood. PLoS ONE 2016; 11(2): e0148260. doi: 10.1371/journal.pone.0148260 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Donohue DE, Gautam A, Miller S-A, Srinivasan S, Abu-Amara D, Campbell R, et al. Gene expression profiling of whole blood: A comparative assessment of RNA-stabilizing collection methods. PLoS ONE 2019; 14(10): e0223065. doi: 10.1371/journal.pone.0223065 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Gautam A, Donohue D, Hoke A, Miller SA, Srinivasan S, Sowe B, et al. Investigating gene expression profiles of whole blood and peripheral blood mononuclear cells using multiple collection and processing methods. PLoS ONE 2019; 14 (12): e0225137. doi: 10.1371/journal.pone.0225137 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Rodríguez A, Duyvejonck H, Van Belleghem JD, Gryp T, Van Simaey L, Vermeulen S, et al. Comparison of procedures for RNA-extraction from peripheral blood mononuclear cells. PLoS One. 2020. Feb 21;15(2): e0229423. doi: 10.1371/journal.pone.0229423 ; PMCID: PMC7034890. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Li S, Chen L, Li J, Liu J. Comparison of different protocols of RNA preparation from circulating blood for RNA sequencing. Biotechnol Lett. 2021. Sep;43(9):1685–1698. Epub 2021 Jun 25. doi: 10.1007/s10529-021-03152-8 . [DOI] [PubMed] [Google Scholar]
12.Asare AL, Kolchinsky SA, Gao Z, Wang R, Raddassi K, Bourcier K, et al. Differential gene expression profiles are dependent upon method of peripheral blood collection and RNA isolation. BMC Genomics. 2008; 9:474. doi: 10.1186/1471-2164-9-474 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Muyal JP, Muyal V, Kaistha BP, Seifart C, Fehrenbach H. Systematic comparison of RNA extraction techniques from frozen and fresh lung tissues: checkpoint towards gene expression studies. Diagn Pathol. 2009. Mar 24; 4:9. doi: 10.1186/1746-1596-4-9 ; PMCID: PMC2669047. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Menke A, Rex-Haffner M, Klengel T, Binder EB, Mehta D. Peripheral blood gene expression: it all boils down to the RNA collection tubes. BMC Res Notes. 2012. Jan 4; 5:1. doi: 10.1186/1756-0500-5-1 ; PMCID: PMC3280191. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Duale N, Brunborg G, Rønningen KS, Briese T, Aarem J, Aas KK, et al. Human blood RNA stabilization in samples collected and transported for a large biobank. BMC Res Notes. 2012; 5:510. doi: 10.1186/1756-0500-5-510 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Häntzsch M, Tolios A, Beutner F, Nagel D, Thiery J, Teupser D, et al. Comparison of whole blood RNA preservation tubes and novel generation RNA extraction kits for analysis of PC and MiRNA profiles. PLoS One. 2014; 9: e113298. 10.1371/journal.pone.0113298. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Bayatti N, Cooper-Knock J, Bury JJ, Wyles M, Heath PR, Kirby J, et al. Comparison of blood RNA extraction methods used for gene expression profiling in amyotrophic lateral sclerosis. PLoS One. 2014. Jan 27;9(1): e87508. doi: 10.1371/journal.pone.0087508 ; PMCID: PMC3903649. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Meyer A, Paroni F, Günther K, Dharmadhikari G, Ahrens W, Kelm S, et al. Evaluation of Existing Methods for Human Blood mRNA Isolation and Analysis for Large Studies. PLoS One. 2016. Aug 30;11(8): e0161778. doi: 10.1371/journal.pone.0161778 ; PMCID: PMC5004844. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Aarem J, Brunborg G, Aas KK, Harbak K, Taipale MM, Magnus P, et al. Comparison of blood RNA isolation methods from samples stabilized in Tempus tubes and stored at a large human biobank. BMC Res Notes. 2016. Sep 1;9(1):430. doi: 10.1186/s13104-016-2224-y ; PMCID: PMC5009671. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Skogholt AH, Ryeng E, Erlandsen SE, Skorpen F, Schønberg SA, Sætrom P. Gene expression differences between PAXgene and Tempus blood RNA tubes are highly reproducible between independent samples and biobanks. BMC Res Notes. 2017; 10:136. doi: 10.1186/s13104-017-2455-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Yip L., Fuhlbrigge R., Atkinson M.A. et al. Impact of blood collection and processing on peripheral blood gene expression profiling in type 1 diabetes. BMC Genomics 18, 636 (2017). doi: 10.1186/s12864-017-3949-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Shen Y, Li R, Tian F, Chen Z, Lu N, Bai Y, et al. Impact of RNA integrity and blood sample storage conditions on the gene expression analysis. Onco Targets Ther. 2018. Jun 20; 11:3573–3581. doi: 10.2147/OTT.S158868 ; PMCID: PMC6016255. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Richards J., Unger E.R. & Rajeevan M.S. Simultaneous extraction of mRNA and microRNA from whole blood stabilized in tempus tubes. BMC Res Notes 2019; 12, 39. doi: 10.1186/s13104-019-4087-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Lamot L., Niemietz I. & Brown K.L. Comparable type I interferon score determination from PAXgene and Tempus whole blood RNA collection and isolation systems. BMC Res Notes. 2019; 12, 511. doi: 10.1186/s13104-019-4562-z [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Carrillo-Ávila J.A., de la Puente R., Catalina P. et al. Evaluation of RNA purification methods by using different blood stabilization tubes: identification of key features for epidemiological studies. BMC Res Notes 13, 77 (2020). doi: 10.1186/s13104-020-04943-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Lee DE, Lee H, Lee SD, Han HS, Choe JY, Yun S, et al. Comparison of different methods of RNA preparation from peripheral blood for nucleic acid amplification assay. Indian J Med Microbiol. 2018. Jan-Mar;36(1):77–80. doi: 10.4103/ijmm.IJMM_18_104 . [DOI] [PubMed] [Google Scholar]
27.Kukurba KR, Montgomery SB. RNA Sequencing and Analysis. Cold Spring Harb Protoc. 2015; 11:951–69. doi: 10.1101/pdb.top084970 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016; 17:13. doi: 10.1186/s13059-016-0881-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Scholes AN, Lewis JA. Comparison of RNA isolation methods on RNA-Seq: implications for differential expression and meta-analyses. BMC Genomics. 2020; 21:249. doi: 10.1186/s12864-020-6673-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Chomczynski P, Wilfinger W, Kennedy A, Rymaszewski M and Mackey K. RNAzol® BD: a reagent for the effective isolation of RNA from whole blood. Nat Methods ii. 2013. 10.1038/nmeth.f.364 [DOI] [Google Scholar]
31.Eghbalnia HR, Wilfinger WW, Mackey K, Chomczynski P. Coordinated analysis of exon and intron data reveals novel differential gene expression changes. Sci Rep. 2020; 10:15669. doi: 10.1038/s41598-020-72482-w [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Wilfinger WW, Miller R, Eghbalnia HR, Mackey K, Chomczynski P. Strategies for detecting and identifying biological signals amidst the variation commonly found in RNA sequencing data. BMC Genomics. 2021; 22:322. doi: 10.1186/s12864-021-07563-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Mangul S., Yang H.T., Strauli N. et al. ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues. Genome Biol 19, 36 (2018). doi: 10.1186/s13059-018-1403-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Mo A, Marigorta UM, Arafat D, Chan LHK, Ponder L, Jang SR, et al. Disease-specific regulation of gene expression in a comparative analysis of juvenile idiopathic arthritis and inflammatory bowel disease. Genome Med. 2018; 10:48. doi: 10.1186/s13073-018-0558-x [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Miller R. RAnGER© (Rank, Analyze, Graph and Evaluate Results), 2021. www.rmillerllc927@gmail.com. [Google Scholar]
36.Sudmant PH, Kitzman JO, Antonacci F, Alkan C, Malig M, Tsalenko A, et al. Diversity of human copy number variation and multicopy genes. Science. 2010; 330:641–6. doi: 10.1126/science.1197005 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2013;41(Database issue): D991–5. Gene Expression Omnibus database, a public functional data repository. (https://www.ncbi.nlm.nih.gov/sites/GDSbrowser/). doi: 10.1093/nar/gks1193 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 2012; 22:1775–89. doi: 10.1101/gr.132159.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Ma L, Cao J, Liu L, Du Q, Li Z, Zou D, et al. LncBook: a curated knowledgebase of human long non-coding RNAs, Nucleic Acids Research. 2019;47: D128. doi: 10.1093/nar/gky960 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009; 4:14. doi: 10.1186/1745-6150-4-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Mandelboum S, Manber Z, Elroy-Stein O, Elkon R. Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias. PLoS Biol. 2019;17: e3000481. doi: 10.1371/journal.pbio.3000481 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Head SR, Komori HK, LaMere SA, Whisenant T, Van Nieuwerburgh F, Salomon DR, et al. Library construction for next-generation sequencing: overviews and challenges. Biotechniques. 2014. Feb 1;56(2):61–4, 66, 68, passim. doi: 10.2144/000114133 ; PMCID: PMC4351865. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Liu J, Walter E, Stenger D, Thach D. Effects of globin mRNA reduction methods on gene expression profiles from whole blood. J Mol Diagn. 2006. Nov;8(5):551–8. doi: 10.2353/jmoldx.2006.060021 ; PMCID: PMC1876175. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Raz Tal & Kapranov, Phi42lipp Lipson, Doron Letovsky, Stan Milos, Patrice Thompson, John. Protocol Dependence of Sequencing-Based Gene Expression Measurements. 2011. PloS one. 6. e19287. doi: 10.1371/journal.pone.0019287 [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Jang JS, Berg B, Holicky E, Eckloff B, Mutawe M, Carrasquillo MM, et al. Comparative evaluation for the globin gene depletion methods for mRNA sequencing using the whole blood-derived total RNAs. BMC Genomics. 2020. Dec 11;21(1):89021. doi: 10.1186/s12864-020-07304-4 ; PMCID: PMC7733259. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Bhargava V, Head SR, Ordoukhanian P, Mercola M, Subramaniam S. Technical variations in low-input RNA-seq methodologies. Sci Rep. 2014. Jan 14; 4:3678. doi: 10.1038/srep03678 ; PMCID: PMC3890974. [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0291209.r001

Decision Letter 0

Surya Saha

26 May 2023

PONE-D-23-05931Whole Blood RNA Extraction Efficiency Contributes to Variability in RNA Sequencing Data SetsPLOS ONE

Dear Will,

I'm glad to report back that the manuscript looks good overall but requires minor updates.

Please include more details of the methods as requested by both reviewers.

Here are the Table formatting guidelines from the PLOS One website:

Tables

Cite tables in ascending numeric order upon first appearance in the manuscript file.

Place each table in your manuscript file directly after the paragraph in which it is first cited (read order). Do not submit your tables in separate files.

Tables require a label (e.g., “Table 1”) and brief descriptive title to be placed above the table. Place legends, footnotes, and other text below the table.

Please add the table to the main paper text.

I look forward to receiving the updated manuscript soon. Once again, I apologize for the delays in the review process.

Cheers,

Surya

Please submit your revised manuscript by Jul 10 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Surya Saha, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well.

3. Please amend your manuscript to include your abstract after the title page.

4. Please ensure that you include a title page within your main document. You should list all authors and all affiliations as per our author instructions and clearly indicate the corresponding author.

5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Dear Will,

I'm glad to report back that the manuscript looks good overall but requires minor updates.

Please include more details of the methods as requested by both reviewers.

Here are the Table formatting guidelines from the PLOS One website:

Tables

Cite tables in ascending numeric order upon first appearance in the manuscript file.

Place each table in your manuscript file directly after the paragraph in which it is first cited (read order). Do not submit your tables in separate files.

Tables require a label (e.g., “Table 1”) and brief descriptive title to be placed above the table. Place legends, footnotes, and other text below the table.

Please add the table to the main paper text.

I look forward to receiving the updated manuscript soon. Once again, I apologize for the delays in the review process.

Cheers,

Surya

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Great study idea. I'd suggest in the future to have one data source (whole blood) with multiple aliquots so that your team can personally oversee the RNA extraction via different methods. There is likely more variability inherent within the sample sets than caused by extraction methods (ie: blood sample degradation, age of samples, environmental factors influencing gene expression).

In addition, find a way to include small non-coding RNAs in future assessments. Because of their short length, meaningful data can often be gained despite degraded blood samples. It would give a more comprehensive view: mRNA, lncRNA, and miRNAs paint a better picture than just mRNA or lncRNA.

I'd also suggest in future studies to outline your statistical methods in more detail. I understand the RAnGER was used for data management, but validation with a statistical software (even R, using the ranger package) would be a bonus.

Lastly, the figures need to be of higher quality. They are likely generated by a program or software (probably Ranger or similar) but find a way to export the figures in high quality or a larger size that you can then resize.

Reviewer #2: Table 1 is missing. This article cannot be accepted until it is included as it is referred to many times in the discussion section. While the article raises excellent points on the variability of RNA extraction and purification methods, it would have been stronger for them to perform all the RNA upstream steps themselves, as inter-lab variability is a common issue even for well established SOPs.

In the methods, it would be helpful to include the pipeline ran including versions, instead of referring to a previous paper. This applies to the extraction methods as well. A supplementary figure/table would suffice.

Overall the points the paper makes are valid. Including total RNA recovered would be useful for evaluating extraction procedures, with the understanding that any experimental samples could have largely different RNA amounts.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Nov 16;18(11):e0291209. doi: 10.1371/journal.pone.0291209.r002

Author response to Decision Letter 0

19 Jun 2023

Response to Reviewers’

On behalf of my coauthors, I would like to thank the reviews for their constructive comments and editorial suggestions. To aid the reviewers in evaluating our responses, we reference specific text lines in the revised manuscript so the reviewers can quickly identify our edits.

Reviewer 1:

General Comments to the author, Reviewers 1 and 2:

2. Has the statistical analysis been performed appropriately and rigorously?

Both reviewers commented on this issue. We used the EXCEL Analysis ToolPak application (The ToolPak application employed in our study is part of Excel Version 2304 (Build 16327.20248), within Microsoft 365 version 16.0.16327) for our statistical analysis. ToolPak is Charles Zaiontz’s Real Statistics Resource Pack for Excel 2010, 2013, 2016, 2019, 2021 or 365 for Windows (Release 8.7). We have added this information in line 244 of the manuscript. In addition, in line 251 we reference supplemental file S3 as the location of more detailed information relating to our statistical analysis.

We did not statistically evaluate the descriptive data summarized in Figure 2 since a comparison of the means of the three data sets would only provide an N of one. Our primary objective was to simply document that the three data sets contain different numbers of transcripts and that the recovery of small to midsized transcripts appears to be responsible for these differences (Figures 2 and 3). Previous publications (3-26) have reported significant differences in the total number of sequenced transcripts with various extraction protocols, but these observations have not resulted in the improvement of sequencing results or provided an explanation for the observed differences in differential gene expression between similar experimental data sets.

We believe the principal contributions of this paper are the results presented in Figures 3, 4 and 5. Since the results presented in these figures were generated using the mean TPM counts of the various data sets, we followed the reviewers’ suggestions and performed a more rigorous statistical analysis. We have now calculated the individual slope and intercepts for all fifty-six samples used in this study to further quantitate the observations provided in Figures 3-5. The results of this additional analysis are presented in Figure 6 and the statistical analysis is described in supplement file (S3). We believe your suggestions strengthen and provide additional support for the important observations presented in Figures 4 and 5.

General Comments to the author, Reviewer 1:

1. Prepare multiple aliquots of blood drawn from individuals and extract these samples with assorted RNA extraction methodology. We agree with your suggestion that this would provide useful information; however, this is a costly and time-consuming project that is unlikely to be covered by research grants or private funding. Therefore, a primary goal of our manuscript is to raise additional awareness to the fact that RNA recovery can be markedly different with various extraction kits and these differences have significant impact on RNA sequencing results. More importantly, we report that expressing transcript TPM counts in relation to transcript size is particularly useful in documenting how RNA extracts influence sequencing results independent of RNA purity and integrity, thereby providing a mechanism for evaluating the fundamental premises underlying RNA sequencing. We are hopeful that our manuscript will provide the rationale for the NIH to design and fund the project that you have suggested.

2. Include small non-coding RNA in future assessments. In human blood, small RNA constitutes approximately 23 % of the total RNA [7]. In our study, we elected to sequentially extract both the large and small RNA fractions in accordance with the RNAzol-BD extraction protocol. Based on our earlier report [7], using the large RNA fraction provides an additional 23 % enrichment of mRNA and lncRNA transcripts. The fact that small to midsized RNA transcripts were noticeably greater with the large RNA extraction protocol than with total RNA protocol employed with the other two methods further implicates the RNA recovery issue. Since we saved the small RNA fractions as you suggested, this could be the source of a future study.

3. In future studies outline the source of your statistical methods. Thank you for alerting us to this omission. We used the EXCEL Analysis ToolPak for our statistical analysis. ToolPak is Charles Zaiontz’s Real Statistics Resource Pack for Excel 2010, 2013, 2016, 2019, 2021 or 365 for Windows (Release 8.7). This information is added to line 244 of the manuscript. In addition, more detailed information is given in supplement three and this is also referenced in line 251 of the revised manuscript.

4. Rework the Figures to provide better clarity. We have downloaded and processed the Figures with the PLOS ONE PACE software and they were approved. If the academic editor considers that they may require additional modifications, we will comply with the directives of the journal.

Specific Comments to the author, Reviewer 2:

1. Is the manuscript technically sound, and does the data support the conclusions?

The response of Reviewer #2 to this question is “Partly”. To add to the weight of evidence, we have provided further quantitative data. The revised manuscript contains slope and intercept analysis presented in Figure 6 to provide stronger support for the conclusions presented in Figures 4 and 5. Please see my comments to reviewer # 1 for a more comprehensive explanation.

Specific Comments Reviewer 2:

1. Table 1 is missing. Table 1 was originally sent to the journal as a separate file, and it appears to have not been included in the manuscript package sent to reviewer #2. The summary Table used in the Discussion section of the manuscript has now been incorporated into the revised manuscript. Please accept our apology for this omission!

2. While the article raises excellent points on the variability of RNA extraction and purification methods, it would have been stronger for them to perform all the RNA upstream steps themselves, as inter-lab variability is a common issue even for well-established SOPs.

Based on the literature assessment of extraction efficiency, we would agree with your view that inter-lab variability is an issue. In a previous publication [7], we demonstrated that the intra-individual Coefficient of variation (CV) for our samples was 5.87 % while inter-individual variation was 5.2-fold larger (30.7%). Our samples were extracted by three different individuals over a period of approximately 1 month; however, the variability observed in data sets C9 and C12 is still markedly higher than in the C35 sample group as illustrated in Figures 5 and 6. We believe the variability attributed to inter-lab variability is due to incomplete RNA recovery during RNA extraction as well as during the column purification steps. We addressed this issue in our comments to reviewer one.

3. In the methods, it would be helpful to include the pipeline ran including versions, instead of referring to a previous paper. We have added the additional information relating to our data processing pipeline in the paragraph beginning at line 163 in the revised manuscript. Also note line number 223-229 in the original manuscript.

I have also provided an excerpt from one of our previous reports [31] to provide additional information for the reviewers.

Workflow. Raw data from thirty-five samples in the FASTQ format was submitted for quality control to evaluate adapter contamination, average base quality score per read, the GC content distribution, and other relevant parameters. The software FastQC was used to perform quality control. All samples received passing marks according to FASTQC benchmark criteria. FastQC software was further utilized to verify the quality of aligned BAM files. The alignments were generated by the Bowtie2 aligner, a BWT-based unspliced aligner, with the recent addition of supported gapped alignments. Bowtie2 extracts multiple substrings or seeds from each read and aligns them using a BWT approach with no gaps, then extends alignments using a Smith-Waterman-like scoring scheme. We used the results from Cufflinks for comparative analysis with the unique counts we extracted using HT-Seq and our in-house scripts. The GRCh37.p13 [hg19] reference genome was used in this study.

4. Overall, the points the paper makes are valid. Including total RNA recovered would be useful for evaluating extraction procedures, with the understanding that any experimental samples could have different RNA amounts. We agree that individual samples have different RNA amounts, and this information was reported for our thirty-five samples in our initial report (S1 Table, Ref 7), but sample RNA recovery was not reported with the sequencing files of the other two data sets. We also considered comparing the total number of identified mRNA and lncRNA transcripts among the three data sets. In Figure 2, we demonstrate that the total number of recovered transcripts with known size measurements in the C9 and C35 data sets are identical (C9: 13,002 vs C35: 13,077 transcripts); however, the data comparisons in Figures 3, 4 and 5 illustrate major differences between the two data sets. Although a similar number of transcripts were recovered in both data sets, the presumed incomplete recovery of small to midsized transcripts in the C9 data set accounts for the differences observed between the two data sets in Figures 3-6. Therefore, we believe a most crucial point presented in our paper is that the RNA extracted for RNA-seq applications must adhere to the basic principles that form the foundation of RNA sequencing analysis as illustrated in Figures 4 and 6.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(20.9KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0291209.r003

Decision Letter 1

Surya Saha

24 Aug 2023

Whole Blood RNA Extraction Efficiency Contributes to Variability in RNA Sequencing Data Sets

PONE-D-23-05931R1

Dear Dr. Wilfinger,

Congratulations!

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Surya Saha, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: Thanks for addressing the concerns from the first review. This is an interesting study and the results are well written.

Reviewer #2: The authors have addressed my comments adequately and thoroughly. I recommend that they continue to push for better standards in the RNA field so that it's easier to compare between studies and datasets.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

PLoS One. doi: 10.1371/journal.pone.0291209.r004

Acceptance letter

Surya Saha

29 Aug 2023

PONE-D-23-05931R1

Whole Blood RNA Extraction Efficiency Contributes to Variability in RNA Sequencing Data Sets

Dear Dr. Wilfinger:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Surya Saha

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File. Distribution of TPM counts across individual samples.

(XLSX)

Click here for additional data file.^{(21.5KB, xlsx)}

S2 File. Normalization and statistical analysis.

Summary of the experimental rationale and ANOVA statistical analysis.

(DOCX)

Click here for additional data file.^{(30KB, docx)}

S1 Table. List of NCBI reference genome transcripts, mRNA and lncRNA transcripts with base pair size assignments.

(XLSX)

Click here for additional data file.^{(8MB, xlsx)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(20.9KB, docx)}

Data Availability Statement

[pone.0291209.ref001] 1.Liew CC, Ma J, Tang HC, Zheng R, Dempsey AA. The peripheral blood transcriptome dynamically reflects system wide biology: a potential diagnostic tool. J Lab Clin Med. 2006. Mar;147(3):126–32. doi: 10.1016/j.lab.2005.10.005 . [DOI] [PubMed] [Google Scholar]

[pone.0291209.ref002] 2.Savelyeva AV, Kuligina EV, Bariakin DN, Kozlov VV, Ryabchikova EI, Richter VA, et al. Variety of RNAs in Peripheral Blood Cells, Plasma, and Plasma Fractions. Biomed Res Int. 2017; 2017:7404912. 10.1155/2017/7404912. Epub 2017 Jan 3. ; PMCID: PMC5239830. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref003] 3.Chai V, Vassilakos A, Lee Y, Wright JA, Young AH. Optimization of the PAXgene blood RNA extraction system for gene expression analysis of clinical samples. Journal of Clinical Laboratory Analysis. 2005; 19(5):182–188. doi: 10.1002/jcla.20075 ; PMCID: PMC6807908. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref004] 4.Kim JH, Jin HO, Park JA, Chang YH, Hong YJ, Lee JK. Comparison of three different kits for extraction of high-quality RNA from frozen blood. Springerplus. 2014; 3:76. Published 2014 Feb 8. doi: 10.1186/2193-1801-3-76 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref005] 5.Sultan M, Amstislavskiy V, Risch T, et al. Influence of RNA extraction methods and library selection schemes on RNA-seq data. BMC Genomics. 2014;15: 675. doi: 10.1186/1471-2164-15-675 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref006] 6.Liu X, Li Q, Wang X, Zhou X, Liao Q, He X, et al. Comparison of six different pretreatment methods for blood RNA extraction. Biopreserv Biobank. 2015. Feb;13(1):56–60. doi: 10.1089/bio.2014.0090 . [DOI] [PubMed] [Google Scholar]

[pone.0291209.ref007] 7.Chomczynski P, Wilfinger WW, Eghbalnia HR, Kennedy A, Rymaszewski M, Mackey K, Inter-Individual Differences in RNA Levels in Human Peripheral Blood. PLoS ONE 2016; 11(2): e0148260. doi: 10.1371/journal.pone.0148260 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref008] 8.Donohue DE, Gautam A, Miller S-A, Srinivasan S, Abu-Amara D, Campbell R, et al. Gene expression profiling of whole blood: A comparative assessment of RNA-stabilizing collection methods. PLoS ONE 2019; 14(10): e0223065. doi: 10.1371/journal.pone.0223065 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref009] 9.Gautam A, Donohue D, Hoke A, Miller SA, Srinivasan S, Sowe B, et al. Investigating gene expression profiles of whole blood and peripheral blood mononuclear cells using multiple collection and processing methods. PLoS ONE 2019; 14 (12): e0225137. doi: 10.1371/journal.pone.0225137 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref010] 10.Rodríguez A, Duyvejonck H, Van Belleghem JD, Gryp T, Van Simaey L, Vermeulen S, et al. Comparison of procedures for RNA-extraction from peripheral blood mononuclear cells. PLoS One. 2020. Feb 21;15(2): e0229423. doi: 10.1371/journal.pone.0229423 ; PMCID: PMC7034890. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref011] 11.Li S, Chen L, Li J, Liu J. Comparison of different protocols of RNA preparation from circulating blood for RNA sequencing. Biotechnol Lett. 2021. Sep;43(9):1685–1698. Epub 2021 Jun 25. doi: 10.1007/s10529-021-03152-8 . [DOI] [PubMed] [Google Scholar]

[pone.0291209.ref012] 12.Asare AL, Kolchinsky SA, Gao Z, Wang R, Raddassi K, Bourcier K, et al. Differential gene expression profiles are dependent upon method of peripheral blood collection and RNA isolation. BMC Genomics. 2008; 9:474. doi: 10.1186/1471-2164-9-474 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref013] 13.Muyal JP, Muyal V, Kaistha BP, Seifart C, Fehrenbach H. Systematic comparison of RNA extraction techniques from frozen and fresh lung tissues: checkpoint towards gene expression studies. Diagn Pathol. 2009. Mar 24; 4:9. doi: 10.1186/1746-1596-4-9 ; PMCID: PMC2669047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref014] 14.Menke A, Rex-Haffner M, Klengel T, Binder EB, Mehta D. Peripheral blood gene expression: it all boils down to the RNA collection tubes. BMC Res Notes. 2012. Jan 4; 5:1. doi: 10.1186/1756-0500-5-1 ; PMCID: PMC3280191. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref015] 15.Duale N, Brunborg G, Rønningen KS, Briese T, Aarem J, Aas KK, et al. Human blood RNA stabilization in samples collected and transported for a large biobank. BMC Res Notes. 2012; 5:510. doi: 10.1186/1756-0500-5-510 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref016] 16.Häntzsch M, Tolios A, Beutner F, Nagel D, Thiery J, Teupser D, et al. Comparison of whole blood RNA preservation tubes and novel generation RNA extraction kits for analysis of PC and MiRNA profiles. PLoS One. 2014; 9: e113298. 10.1371/journal.pone.0113298. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref017] 17.Bayatti N, Cooper-Knock J, Bury JJ, Wyles M, Heath PR, Kirby J, et al. Comparison of blood RNA extraction methods used for gene expression profiling in amyotrophic lateral sclerosis. PLoS One. 2014. Jan 27;9(1): e87508. doi: 10.1371/journal.pone.0087508 ; PMCID: PMC3903649. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref018] 18.Meyer A, Paroni F, Günther K, Dharmadhikari G, Ahrens W, Kelm S, et al. Evaluation of Existing Methods for Human Blood mRNA Isolation and Analysis for Large Studies. PLoS One. 2016. Aug 30;11(8): e0161778. doi: 10.1371/journal.pone.0161778 ; PMCID: PMC5004844. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref019] 19.Aarem J, Brunborg G, Aas KK, Harbak K, Taipale MM, Magnus P, et al. Comparison of blood RNA isolation methods from samples stabilized in Tempus tubes and stored at a large human biobank. BMC Res Notes. 2016. Sep 1;9(1):430. doi: 10.1186/s13104-016-2224-y ; PMCID: PMC5009671. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref020] 20.Skogholt AH, Ryeng E, Erlandsen SE, Skorpen F, Schønberg SA, Sætrom P. Gene expression differences between PAXgene and Tempus blood RNA tubes are highly reproducible between independent samples and biobanks. BMC Res Notes. 2017; 10:136. doi: 10.1186/s13104-017-2455-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref021] 21.Yip L., Fuhlbrigge R., Atkinson M.A. et al. Impact of blood collection and processing on peripheral blood gene expression profiling in type 1 diabetes. BMC Genomics 18, 636 (2017). doi: 10.1186/s12864-017-3949-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref022] 22.Shen Y, Li R, Tian F, Chen Z, Lu N, Bai Y, et al. Impact of RNA integrity and blood sample storage conditions on the gene expression analysis. Onco Targets Ther. 2018. Jun 20; 11:3573–3581. doi: 10.2147/OTT.S158868 ; PMCID: PMC6016255. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref023] 23.Richards J., Unger E.R. & Rajeevan M.S. Simultaneous extraction of mRNA and microRNA from whole blood stabilized in tempus tubes. BMC Res Notes 2019; 12, 39. doi: 10.1186/s13104-019-4087-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref024] 24.Lamot L., Niemietz I. & Brown K.L. Comparable type I interferon score determination from PAXgene and Tempus whole blood RNA collection and isolation systems. BMC Res Notes. 2019; 12, 511. doi: 10.1186/s13104-019-4562-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref025] 25.Carrillo-Ávila J.A., de la Puente R., Catalina P. et al. Evaluation of RNA purification methods by using different blood stabilization tubes: identification of key features for epidemiological studies. BMC Res Notes 13, 77 (2020). doi: 10.1186/s13104-020-04943-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref026] 26.Lee DE, Lee H, Lee SD, Han HS, Choe JY, Yun S, et al. Comparison of different methods of RNA preparation from peripheral blood for nucleic acid amplification assay. Indian J Med Microbiol. 2018. Jan-Mar;36(1):77–80. doi: 10.4103/ijmm.IJMM_18_104 . [DOI] [PubMed] [Google Scholar]

[pone.0291209.ref027] 27.Kukurba KR, Montgomery SB. RNA Sequencing and Analysis. Cold Spring Harb Protoc. 2015; 11:951–69. doi: 10.1101/pdb.top084970 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref028] 28.Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016; 17:13. doi: 10.1186/s13059-016-0881-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref029] 29.Scholes AN, Lewis JA. Comparison of RNA isolation methods on RNA-Seq: implications for differential expression and meta-analyses. BMC Genomics. 2020; 21:249. doi: 10.1186/s12864-020-6673-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref030] 30.Chomczynski P, Wilfinger W, Kennedy A, Rymaszewski M and Mackey K. RNAzol® BD: a reagent for the effective isolation of RNA from whole blood. Nat Methods ii. 2013. 10.1038/nmeth.f.364 [DOI] [Google Scholar]

[pone.0291209.ref031] 31.Eghbalnia HR, Wilfinger WW, Mackey K, Chomczynski P. Coordinated analysis of exon and intron data reveals novel differential gene expression changes. Sci Rep. 2020; 10:15669. doi: 10.1038/s41598-020-72482-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref032] 32.Wilfinger WW, Miller R, Eghbalnia HR, Mackey K, Chomczynski P. Strategies for detecting and identifying biological signals amidst the variation commonly found in RNA sequencing data. BMC Genomics. 2021; 22:322. doi: 10.1186/s12864-021-07563-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref033] 33.Mangul S., Yang H.T., Strauli N. et al. ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues. Genome Biol 19, 36 (2018). doi: 10.1186/s13059-018-1403-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref034] 34.Mo A, Marigorta UM, Arafat D, Chan LHK, Ponder L, Jang SR, et al. Disease-specific regulation of gene expression in a comparative analysis of juvenile idiopathic arthritis and inflammatory bowel disease. Genome Med. 2018; 10:48. doi: 10.1186/s13073-018-0558-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref035] 35.Miller R. RAnGER© (Rank, Analyze, Graph and Evaluate Results), 2021. www.rmillerllc927@gmail.com. [Google Scholar]

[pone.0291209.ref036] 36.Sudmant PH, Kitzman JO, Antonacci F, Alkan C, Malig M, Tsalenko A, et al. Diversity of human copy number variation and multicopy genes. Science. 2010; 330:641–6. doi: 10.1126/science.1197005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref037] 37.Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2013;41(Database issue): D991–5. Gene Expression Omnibus database, a public functional data repository. (https://www.ncbi.nlm.nih.gov/sites/GDSbrowser/). doi: 10.1093/nar/gks1193 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref038] 38.Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 2012; 22:1775–89. doi: 10.1101/gr.132159.111 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref039] 39.Ma L, Cao J, Liu L, Du Q, Li Z, Zou D, et al. LncBook: a curated knowledgebase of human long non-coding RNAs, Nucleic Acids Research. 2019;47: D128. doi: 10.1093/nar/gky960 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref040] 40.Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009; 4:14. doi: 10.1186/1745-6150-4-14 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref041] 41.Mandelboum S, Manber Z, Elroy-Stein O, Elkon R. Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias. PLoS Biol. 2019;17: e3000481. doi: 10.1371/journal.pbio.3000481 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref042] 42.Head SR, Komori HK, LaMere SA, Whisenant T, Van Nieuwerburgh F, Salomon DR, et al. Library construction for next-generation sequencing: overviews and challenges. Biotechniques. 2014. Feb 1;56(2):61–4, 66, 68, passim. doi: 10.2144/000114133 ; PMCID: PMC4351865. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref043] 43.Liu J, Walter E, Stenger D, Thach D. Effects of globin mRNA reduction methods on gene expression profiles from whole blood. J Mol Diagn. 2006. Nov;8(5):551–8. doi: 10.2353/jmoldx.2006.060021 ; PMCID: PMC1876175. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref044] 44.Raz Tal & Kapranov, Phi42lipp Lipson, Doron Letovsky, Stan Milos, Patrice Thompson, John. Protocol Dependence of Sequencing-Based Gene Expression Measurements. 2011. PloS one. 6. e19287. doi: 10.1371/journal.pone.0019287 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref045] 45.Jang JS, Berg B, Holicky E, Eckloff B, Mutawe M, Carrasquillo MM, et al. Comparative evaluation for the globin gene depletion methods for mRNA sequencing using the whole blood-derived total RNAs. BMC Genomics. 2020. Dec 11;21(1):89021. doi: 10.1186/s12864-020-07304-4 ; PMCID: PMC7733259. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0291209.ref046] 46.Bhargava V, Head SR, Ordoukhanian P, Mercola M, Subramaniam S. Technical variations in low-input RNA-seq methodologies. Sci Rep. 2014. Jan 14; 4:3678. doi: 10.1038/srep03678 ; PMCID: PMC3890974. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Whole blood RNA extraction efficiency contributes to variability in RNA sequencing data sets

William W Wilfinger

Hamid R Eghbalnia

Karol Mackey

Robert Miller

Piotr Chomczynski

Roles

Abstract

Introduction

Methods

Ethics approval and consent to participate

RNA extraction, sequencing, and data analysis

Characterization of reference genome RNA transcript lengths

Fig 1. The size distributions of NCBI transcripts.

Characterization of RNA size distributions in experimental samples

Analytical methods

Results

Overview of sequencing Results for RNA recovered from whole blood by different methodologies

Fig 2. Overview of transcripts in three sequenced data sets.

Size distribution of sequenced mRNA and lncRNA recovered with different extraction methodologies

Fig 3. Size distribution of mRNA and lncRNA sequenced transcripts extracted with different methodologies.

Analysis of 8,721 transcripts common among the three data sets

Fig 4. Relationship of transcript mean TPM counts to transcript size among 8721 transcripts identified in all three data sets.

Coefficient of variation estimates of transcript variability

Fig 5. Impact of transcript size on sequencing variability.

Slope and intercept analysis of all transcripts in the three data sets

Fig 6.

Discussion

Table 1. Summary of sequencing results recovered from data sets employing different extraction methodology.

Supporting information

Acknowledgments

Abbreviations

Data Availability

Funding Statement

References

Decision Letter 0

Surya Saha

Roles

Author response to Decision Letter 0

Decision Letter 1

Surya Saha

Roles

Acceptance letter

Surya Saha

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases