Skip to main content
PLOS One logoLink to PLOS One
. 2025 Apr 21;20(4):e0321631. doi: 10.1371/journal.pone.0321631

Reliable RNA-seq analysis from FFPE specimens as a means to accelerate cancer-related health disparities research

Mitchell J Frederick 1,*, Dannelys Perez-Bello 1, Pedram Yadollahi 1, Patricia Castro 2, Alan Frederick 3, Andrew Frederick 3, Rashid A Osman 4, Fonma Essien 1, Imelda Yebra 1, Ashley Hamlin 1, Thomas J Ow 5,6, Heath D Skinner 7, Vlad C Sandulache 1,8,9,*
Editor: Kazunori Nagasaka10
PMCID: PMC12011225  PMID: 40258023

Abstract

Whole transcriptome sequencing (WTS/ RNA-Seq) is a ubiquitous tool for investigating cancer biology. RNA isolated from frozen sources limits possible studies for analysis of associations with phenotypes or clinical variables requiring long-term follow-up. Although good correlations are reported in RNA-Seq data from paired frozen and formalin fixed paraffin embedded (FFPE) samples, uncertainties regarding RNA quality, methods of extraction, and data reliability are hurdles to utilization of archival samples. We compared three different platforms for performing RNA-seq using archival FFPE oropharyngeal squamous carcinoma (OPSCC) specimens stored up to 20 years, as part of an investigation of transcriptional profiles related to health disparities. We developed guidelines to purify DNA and RNA from FFPE tissue and perform downstream RNA-seq and DNA SNP arrays. RNA was extracted from 150 specimens, with an average yield of 401.8 ng/cm2 of tissue. Most samples yielded sufficient RNA reads >13,000 protein coding genes which could be used to differentiate HPV-associated from HPV-independent OPSCCs. Co-isolated DNA was used to identify reliably define patient ancestry which correlated well with patient-reported race. Utilizing the methods described in this study provides a robust, reliable, and standardized means of DNA & RNA extraction from FFPE as well as a means by which to assure the quality of the data generated. Optimized RNA extraction techniques, combined with robust bioinformatic approaches designed to optimize data homogenization, analysis and biological validation can revolutionize our ability to transcriptomically profile large solid tumor sets derived from ancestrally varied patient populations.

Introduction

Owing to technical advances in next generation sequencing (NGS) and widespread optimization of protocols and commercial kits for RNA extraction, whole transcriptome RNA expression profiling (i.e., RNA-seq) has become a ubiquitous research tool. Often, RNA-seq is performed using RNA isolated from fresh or snap-frozen samples [15]. While such investigations have provided insight into the biology of diseases, the availability of fresh or frozen samples can constrain study designs, create selection bias, and limit research endeavors that require long-term follow-up. This is unfortunate because many research institutions and laboratories have amassed large biobanks of archival specimens collected over decades during routine diagnosis, medical practice, or drug studies, which far outnumber collections of fresh/frozen tissue [68]. Therefore, the ability to reliably perform bulk RNA-seq profiling from archival formalin fixed paraffin embedded (FFPE) specimens would greatly expand scientific research [9]. Despite evidence that RNA-seq can successfully be performed using FFPE specimens of varying age [1012], uncertainties regarding optimal RNA extraction procedures and the variability/ reliability of whole transcriptome expression data generated remain the biggest obstacles preventing more widespread acceptance and application [13,14].

One area that would benefit immensely from routine RNA-seq analysis of archival FFPE samples is health disparities research. In addition to socio-economic and cultural differences, there is evidence that disparities in clinical disease and outcome among minority populations could also have a biological basis [1517]. If so, it is likely that known differences in single nucleotide polymorphisms (SNPs) underpinning ancestral differences at the genomic level could exert their phenotypic impact by influencing RNA expression levels of specific genes [15,18]. Moreover, demographic data collected through medical records can be imperfect [19] or altogether missing, making genomic ancestry a potentially more accurate method [20]. Due to this need, we describe optimized procedures to extract both RNA and DNA from FFPE tumor sections, guidelines for performing quality control (QC) of extracted nucleic acid, recommendations for specific RNA-seq and DNA SNP platforms, and suggest approaches for standardized data processing and data QC.

Materials and Methods

Ethics approval and consent to participate

All specimen collection, retrieval and analysis performed followed approval from Baylor College of Medicine and Michael E. DeBakey Veterans Affairs Medical Center (MEDVAMC) Institutional Review Boards. Specimens were accessed for research purposes on 01/08/2023.

Basic steps in our RNA data processing pipeline include: 1) Filtering out non-protein coding genes; 2) Excluding zero count data to calculate a 75th percentile (UQ) read value for each sample; 3) Calculating the cohort median count upper quartile (MUQ) value; 4) Replacing any gene sizes ≤252 bp with a minimum value of 252 (i.e., size of smallest known human gene) to prevent data distortion or a zero division error; 4) Calculating the global cohort gene size median (GMGS); 5) Normalizing each count value through dividing by the sample specific 75th quartile and the individual gene size (either constant or sample specific); 6) Multiplying the quotient by the product of the MUQ x GMGS to bring numbers back to the scale of counts; 7) Performing a log2 transformation after adding back 0.01 to every value (Log2 [X + 0.01]); 8) Identifying and removing “technical” outlier samples using the average median absolute deviation (MAD) approach coupled with a downstream statistical method to detect MAD outliers; 9) Re-scaling the data so the distribution has a global median set to 7.0; and 10) Replacing negative log2 values with a zero, to avoid the pitfalls of artificially inflating standard deviations and fold changes associated with very low expression values. Importantly, because the data are rescaled to median of 7 before replacing low expression values, zero replacements are handled consistently across data with different sequencing depth. Furthermore, it becomes feasible to select a single low expression threshold cut-off (e.g., Log2 ≤ 2) to suffice for every experiment across time and different sequencing depths. The protocol described in this peer-reviewed article is published on protocols.io at dx.doi.org/10.17504/protocols.io.6qpvr9kpzvmk/v1 (protocol.io) and is included for printing as supporting information file 1 with this article. Specific python codes and the supporting Gene_GC.csv input files are freely available from the Github repository at https://github.com/Mjfreder/RNA_normalize_qc.

Results and discussion

Isolation and quality control of RNA from FFPE specimens

Various publications have compared in-house and commercial kit-based procedures for optimally extracting RNA, DNA, or both from FFPE tissue [14,2125]. Here, rather than re-investigate all the variables, our goal was to provide researchers with a detailed set of guidelines and instructions—all in one place— to successfully purify DNA and RNA from FFPE tissue and for downstream analyses like RNA-seq and DNA SNP arrays. Our procedures were extensively vetted and yielded robust results when applied by multiple individuals in our group, regardless of experience level. Although we make no claim that our methods produce superior results, we present them as best practices that could enhance the likelihood of experimental success using FFPE prepared tissue.

We opted to first isolate RNA using the AllPrep DNA/RNA FFPE kit from Qiagen and further process residual DNA pellets using the COBAS kit from Roche, with some modifications to the manufacturers’ instructions. All RNA/DNA isolations began from pre-cut slides of oropharyngeal squamous cell carcinomas (OPSCC) prepared from archival blocks that ranged in storage time from 1 to 20 years. Alternatively, sections could be sliced directly into Eppendorf tubes and stored at room temperature in a desiccator prior to processing. FFPE slide sections were cut to 6 µm thickness; Supplementary Table 1 in S1 Table summarizes the physical properties from 43 FFPE samples processed for RNA isolation and commercially sequenced, along with an additional 150 samples also analyzed for RNA yield but not sequenced. A breakdown of yields for the 43 sequenced specimens and for all 193 samples processed is provided in Supplementary Table 2 in S1 Table. For the 43 sequenced samples, the average total RNA yield was 3629 ng (range of 485 ng to 13,247 ng), average area of tissue per slide was 2 cm2, average number of slides used was ~9, average total area of tissue processed was 18 cm2, and average yield per total area of tissue was 402 ng/cm2 (median = 274 ng/cm2). Most standard commercial RNA sequencing platforms can work with 200–500 ng RNA, albeit the true yields measured by fluorescent methods were usually half that determined by absorbance as discussed further below. Summary sample metrics for the entire cohort of 193 specimens were not fundamentally different from the subset of 43 selected for sequencing (Supplementary Table 2 in S1 Table). We did not find a statistically significant relationship between how long tissue blocks had been stored and the RNA yields per square centimeter of tissue (Fig 1A).

Fig 1. RNA yields.

Fig 1

There was no correlation between RNA yields and relative age of the individual clinical specimens (A,B). C) Laboratory concentrations were approximately 50% of concentrations measured at the commercial vendor. D) Technical outlier detection using the MAD method, showing the average MAD values from the TrueSeq cohort, which cluster well except for outlier sample OPSCC_02 (Adj P < 0.0001, Supplementary Table 5 in S1 Table).

We qualitatively and quantitatively compared the reliability of three different library/platforms for performing whole transcriptomic RNA-Seq from archival FFPE specimens (Supplementary Table 1 in S1 Table). RNA extracted from a subset of FFPE specimens was processed through one commercial sequencing company that used either the TruSeq RNA exome kit from Illumina (recommended specifically for FFPE samples) for library preparation or the Illumina TruSeq Stranded kit followed by removal of noncoding RNA with the Ribo-zero kit. Another set of RNA samples was processed through a different vendor for massive analysis of cDNA ends (MACE-seq) that captures 3’ ends of polyadenylated RNA, but samples were unavoidably stored at 4oC for several days during transit. However, MACE-seq potentially performs better than other platforms with partially degraded RNA [26,27].

We compared the RNA concentration measured in-house by absorbance to that obtained through fluorescence methods performed commercially (Supplementary Table 3 in S1 Table, Fig 1B, S1 Fig), excluding samples that were improperly stored during transit. On average, in-house concentrations measured by absorbance were about half that obtained through fluorescence techniques (Fig 1B). Not surprisingly for FFPE specimens, RIN values were under the recommended value of 7 or higher and ranged from 1.2 to 2.5 (Supplementary Table 3 in S1 Table). A more common metric for fragmented RNA is the DV200 (i.e., percentage of fragments ≥200 bp) which varied widely from 1.48% to 71.47% (Supplementary Table 3 in S1 Table), with a median value of just 18.65, excluding the samples used for MACE-seq that were stored at warmer temperatures for an extended time.

Recommendations for processing RNA-Seq data

There are many publications examining approaches for processing RNA-Seq data [2832], each with its merits. Here, we present a user-friendly pipeline (S2 Fig) and Python code that normalizes RNA expression data in a standardized fashion from raw reads, puts gene expression values on the same scale regardless of the sequencing platform or depth of coverage, and controls for inter-sample variability. Furthermore, we incorporate procedures for consistent identification of technical outlier samples and minimizing the influence of low expression or zero values. These steps are not intended to substitute for batch corrections but do allow some gross comparisons and inferences to be made across experiments spanning different platforms. Nonetheless, our normalization methods and their outputs should be compatible with batch adjustment methods like the frequently used Combat algorithm [33,34]. Popular RNA-seq normalization methods include calculating fragments per kilobase of transcript per million reads (FPKM), or transcripts per million (TPM). We use an upper quartile (UQ) normalization approach known as FPKM-UQ, frequently adopted by The Cancer Genome Atlas (TCGA) project [3537]. Each of these methods have been used successfully and they all control for gene size—which while not strictly necessary for analyzing differences in gene expression—should be employed whenever single sample gene set enrichment (ssGSEA) is a downstream analysis [9,38]. The UQ normalization in our pipeline divides gene expression values by a sample specific UQ value (i.e., 75th percentile excluding zeros), rather than a sample total, and has been shown by others to control well for inter-sample variability [39,40].

According to our data processing pipeline shown in S2 Fig the starting input file should contain meta data (i.e., combined sample data) with individual gene sizes and either raw counts (e.g., HTSeq-Count) or expected counts if RNA-Seq by Expectation Maximization (RSEM) was used. Gene sizes are constant across samples and derived purely through reference annotation when HTSeq-Counts are employed, but for RSEM the effective gene sizes will vary slightly from sample to sample. It’s worth noting that many sequencing cores and companies already provide FPKM and TPM normalizations with the data [41,42]. However, such values should be used cautiously, because these normalizations frequently do not filter out non-protein coding genes in the process. It is not uncommon for the gene expression data provided by sequencing centers to include tens of thousands of non-protein coding genes with 40,000–60,000 “genes” listed when in fact there the true number of protein coding human genes is in the range of 19,000–20,000 (depending on whether immune specific genes are included). For easy reference, we provide our comprehensive list of 19,709 human protein coding genes (Supplementary Table 4 in S1 Table) derived from the HGNC website along with their official symbols, crossmatched to their Entrez gene ID’s, Ensembl IDs, and HGNC ID’s.

A consequence of our normalization approach is that some downstream software packages for differential analysis of gene expression (DEG) that require raw data input and perform their own internal normalization, such as DESeq2 [43] or EdgeR [44], are not appropriate. DEG packages like Limma [45] or Voom [46] which accept pre-normalized data for linear modeling can be used. For simpler analyses between groups, we typically combine multiple T-testing with the Benjamini Hochberg procedure for controlling false discovery rate (FDR). The main advantage with our data processing pipeline is better control over what is happening during normalization.

Analysis and comparison of RNA-Seq data sets generated from FFPE specimens using different sequencing approaches

The methods described here were developed to aid in comparing RNA-Seq data from multiple platforms, but they are appropriate for routine analysis of RNA-Seq experiments. The total mapped reads or sequencing depth varied greatly between the different library preparations/platforms we examined, with an average of roughly 17.3 million, 6.6 million, and 0.67 million for the TruSeq, TruSeq stranded, and MACE-seq approaches, respectively (Supplementary Table 3 in S1 Table, S3 Fig). We examined the relationship between total mapped reads and reads mapped to protein coding genes (Supplementary Table 3 in S1 Table, S3 Fig) and found best results with MACE-Seq (average = 88%), followed by TruSeq (average = 82%), compared to Stranded/Ribo-zero (average = 55%). Differences in sequencing depth between methods were reflected in the UQ values used during normalization (Supplementary Table 3 in S1 Table).

The distributions of gene expression (i.e., median UQ normalized log 2 values) for each platform/cohort, along with the TCGA OPSCC RNA-Seq dataset similarly normalized, are shown in S4 Fig A,C,E, and G before global rescaling. To directly compare the behavior of RNA values it is helpful to re-scale each cohort to have a common global median. However, samples that represent technical outliers should first be removed to avoid data distortion. Outliers were identified using a two-step procedure based on the Median Absolute Deviation (MAD) approach. First, for every gene, MAD values were computed by taking the absolute value of a sample’s gene expression minus the cohort median and delta values were then averaged across all genes to calculate a sample specific average MAD value. In the second step, a linear model was fit across all the sample average MAD values to generate residuals, which were normalized to calculate the probability (i.e., p-value) of observing each sample’s average MAD score using the cumulative distribution function with an FDR = 0.01 for multiple testing. Samples OPSCC_02 (TrueSeq RNA exome), OPSCC_32 and 33 (Stranded RNA/Ribo-zero), OPSCC_34 and 35 (MACE-Seq) were all identified as outliers (Supplementary Table 5 in S1 Table, FDR < 0.0001, Fig 1C, S5 Fig) and tended to have some of the lowest 75th PCTL raw counts relative to other samples in their respective cohorts.

The global medians of each data set (excluding outliers) were then used to re-scale cohorts to obtain distributions with a new and common global median of 7. Essentially, each normalized log2 gene expression value was modified by adding (7- initial global median) to obtain rescaled or adjusted values, except that gene expression values corresponding to true zeros were not adjusted. Histograms for the newly rescaled cohorts were generated (S4 Fig B,D,F,H). After this global adjustment, low expression values could be treated uniformly across cohorts. To minimize data distortion and noise around low expression genes, all negative log2 expression values were set to zero. For downstream applications like DEG that require testing multiple genes and p-value corrections it is helpful to minimize the number of statistical tests performed or genes analyzed. One advantage of our normalization procedure that rescales all cohorts to the same global median of 7 is that a universal low expression threshold can be set. We recommend filtering out genes where the maximum average log2 expression of any group ≤2, which typically results in anywhere from 12,000–16,000 genes being tested (e.g., Supplementary Table 5 in S1 Table). Finally, all re-scaled log2 values <0 were uniformly substituted with zeros, to minimize analysis artifacts associated with small values.

RNA sample quality metrics driving successful sequencing

As mentioned, RNA RIN values are of little use for assessing RNA isolated from FFPE samples. After examining the different sequencing metrics described above, we re-considered which, if any, RNA sample QC measurements might be useful to predict good quality sequencing results. TapeStation readouts for representative samples sequenced with the Stranded library plus Ribozero approach (Fig 2) or the TruSeq library (S6 Fig) illustrated several differences among samples.

Fig 2. Relationship between DV200 and biological richness of gene expression.

Fig 2

DV200 values and RNA fragment size are shown for representative samples along with the relative number of usable genes and outlier associated p-values from Supplementary Table 5. * denotes potentially erroneous DV200 reading.

Samples with large DV200 (e.g., percent of RNA molecules ≥200 bp) values, such as OPSCC_25 (DV200 = 63.7%) and OPSCC_31 (DV200 = 47.1%) yielded over 14,000 genes having log2 expression values ≥2 and were not among the outliers. Even samples with considerably lower DV200 values like OPSCC_09 (DV200 = 13.7%) or OPSCC_13 (DV200 = 19.9%) also yielded a substantial number of genes expressed above the low threshold cut-off. Outliers like OPSCC_02, OPSCC_32, and OPSCC_33, on the other hand, visibly had some of the lowest DV200 values (<14%). However, several samples like OPSCC_01 and OPSCC_16 also had DV200 values below 10%, but performed reasonably well with over 13,000 genes each expressed above the low threshold. These latter samples had very subtle shoulders visible in their TapeStation curves that distinguished them from the outlier OPSCC_02 sample in the TruSeq cohort. Generally, the samples used in the Stranded/Ribo-zero library were of better quality based on TapeStation curves, with more a greater proportion of RNA fragments in the 100 bp range. Collectively, our experience suggests that although a high DV200 percentage is very likely to produce good reads, samples with peaks or shoulders in the 100 bp range may also generate acceptable sequencing data.

Biological Reliability of RNA-seq data from FFPE specimens

We found that wholesale examination of correlations among all protein coding genes between the TCGA cohort and our datasets was not particularly informative and seemingly good correlations existed even when tested against highly unrelated cancer types like pancreatic tumors. Very likely, the majority of genes do not contribute to biological variation and serve as a buffer when included in such analyses. Furthermore, if subsets of genes expected to be more highly expressed in a tissue specific manner were chosen for cross-dataset correlations using cohort medians, results were disappointing. By randomly modeling the cohort makeups, we concluded that this likely stems from the high variability introduced to cohort medians when analyzing small sample sets.

We theorized a more robust approach for examining biological fidelity of RNA-Seq datasets would be to look for known or expected patterns of gene co-expression, considering both positive and negative correlations. Clinically, OPSCCs can be divided by the presence or absence of high-risk human papillomavirus (HPV) as a cancer driver, which is reflected in very different biology and treatment outcomes [4752]. The incidence of HPV-associated OPSCC has been steadily increasing in the last few decades and as of 2021, ~75% of cancers occurring in this anatomical subsite are associated with HPV [6,7]. We previously identified that 51 out of 78 TCGA OPSCC were HPV-associated based on re-analyzing publicly available whole exome sequencing data, and used this to define genes differentially expressed between HPV-associated and HPV-independent OPSCC after normalizing gene expression for the TCGA cohort (Supplementary Table 6 in S1 Table). Next, we selected genes that showed ≥3-fold significant (FDR < 0.1) difference up or downregulation based on HPV status in the OPSCC TCGA cohort (S7 Fig, S6 Table) and filtered out immune genes to focus on differences in tumor biology, leaving 855 differentially regulated genes. Theoretically, these genes should show a similar pattern of expression in our cohorts, based on the assumption that both HPV-associated and HPV-independent tumors are expected. In reality, some amount of data noise is expected in our datasets due to platform differences or smaller cohort sizes. Therefore, we examined the co-correlation of these 855 differentially expressed genes in our largest FFPE cohort, the TruSeq library data set that included 20 samples.

Co-correlation coefficients from the 855 differentially expressed genes calculated using the TruSeq cohort are listed in Supplementary Table 7 in S1 Table and hierarchical clustering was used to identify modules of genes with patterns of co-correlation or anti-correlation (S8 Fig), with directionality of fold change expected from the TCGA cohort annotated. It is worth noting that genes with increased or decreased expression in the TCGA cohort according to HPV status were clearly segregating in the TruSeq FFPE cohort (S8 Fig), which is already indicative that biology was being strongly preserved. We selected genes in clusters 1 and 3 (collectively 598 out of 855 genes) to see if they could predict HPV status of our FFPE specimens. Because we do not have the HPV status of the samples used in our study and the p16 (CDKN2A) immunohistochemistry (IHC) sometimes used as surrogate was not available for all samples, we relied on CDKN2A mRNA levels. Overexpression of CDK2NA RNA, which also drives p16 protein detection by IHC, results from a negative feedback loop in HPV-associated cancers with diminished Rb and highly active E2F transcription factor. Using the OPSCC TCGA cohort we demonstrated that HPV-associated tumors abundantly overexpress CDKN2A RNA with an average of 34-fold elevation (Supplementary Table 6 in S1 Tables) that is highly significant (p < 0.0001, S7 Fig B).

We used the 598 DEGs to cluster samples from all 3 of our FFPE cohorts (Supplementary Tables 8–10-10 in S1 Tables) to predict their HPV status (Fig 34, S9 Fig) based on the patterns of their gene expression. In the TruSeq cohort (Fig 3), two main gene clusters were found which were enriched or depleted for genes previously determined to be differentially expressed in the TCGA cohort based on HPV status. The segregation of these genes was highly significant (p < 0.00001) indicating strong preservation of tumor biology. This was additionally validated by significant elevation of CDKN2A RNA among the FFPE specimens predicted to be HPV-associated (p < 0.005, Fig 3B), despite the fact that CDKN2A had been purposely removed from the list of genes for clustering. Highly similar results were found for the Stranded/Ribo-zero and Mace-Seq datasets, regarding segregation of the HPV-signature genes, although genes associated with HPV-independent tumors formed two subclusters. Predicted differences in CDKN2A RNA similarly validated the HPV status predictions, albeit the p-values were not as robust because of the smaller sample numbers in these cohorts.

Fig 3. Correlation of HPV associated genes in TruSeq cohort.

Fig 3

Expression of HPV-associated DEGs (identified from the TCGA OPSCC reference cohort) was used for unsupervised 2-way clustering of TruSeq samples (i.e., Ward’s agglomerative hierarchical clustering) to predict HPV status (A). Samples predicted to be HPV-associated (i.e., HPV-Pos) are annotated with purple boxes. The regulation status of genes is annotated across the top of the heatmap with black boxes if they were also upregulated in HPV-associated TCGA samples or grey boxes if they were upregulated in HPV-independent (i.e., HPV-Neg) TCGA samples. Significant enrichment of genes upregulated in TCGA HPV-associated cancers was found in gene cluster 1 (red cluster) and enrichment of genes upregulated in HPV-independent TCGA samples is found in gene cluster 2 (blue cluster), demonstrating significant separation (e.g., P < 0.00001 by Chi-square testing). Specimens from sample cluster 1 (purple) predicted to be HPV-associated based on their gene expression pattern had significantly higher CDKN2A (p16) expression (P) <0.0005 than specimens from sample cluster 2 (grey) predicted to be HPV-independent (B), validating predictions.

Fig 4. Correlation of HPV associated genes in the Stranded Cohort.

Fig 4

Expression of HPV-associated DEGs (identified from the TCGA OPSCC reference cohort) was used for unsupervised 2-way clustering of Stranded cohort samples (i.e., Ward’s agglomerative hierarchical clustering) to predict HPV status (A). Samples predicted to be HPV-associated (i.e., HPV-Pos) are annotated with purple boxes. The regulation status of genes is annotated across the top of the heatmap with black boxes if they were also upregulated in HPV-associated TCGA samples or grey boxes if they were upregulated in HPV-independent (i.e., HPV-Neg) TCGA samples. Significant enrichment of genes upregulated in TCGA HPV-associated cancers was found in gene cluster 1 (red cluster) and enrichment of genes upregulated in HPV-independent TCGA samples is found in gene cluster 2 (blue cluster), demonstrating significant separation (e.g., P < 0.00001 by Chi-square testing). Specimens from sample cluster 1 (purple) predicted to be HPV-associated based on their gene expression pattern had higher CDKN2A (p16) expression than specimens from sample cluster 2 (grey) predicted to be HPV-independent (B), which nearly reached significance.

Assessing gene dropout

The phenomena whereby low sequencing depth randomly leads to genes not being represented is well described for single cell RNA-Seq (scRNA)experiments [53, 54]. We wondered whether gene dropout might also occur in FFPE samples for different biological or technical reasons, such as selective RNA instability. Put in mathematical terms, gene dropout means observing more samples with zero gene expression than expected by chance. We modeled the expected probability of observing samples with zero expression individually for every gene using the existing OPSCC TCGA cohort (N = 71) as a gold standard because the dataset was comparatively large and derived using frozen tissue. Because our cohort of samples were predicted to contain roughly 50% HPV-associated tumors, we used an average of the observed number of zero reads between HPV-associated and HPV-independent sample in the TCGA (Supplementary Table 11 in S1 Table) to derive an estimate for each gene of the probability of getting zero reads. For every gene (excluding variable immunoglobulin or T cell receptor genes), the binomial probability distribution was applied to approximate the individual probability (p-value) and adjusted p-value (for multiple testing) of observing at least as many samples found in our cohorts with zero counts. The gene-wise adjusted p-values for all 3 of our RNA-seq cohorts are provided in the supplementary tables (Supplementary Table 12–14-14 in S1 Table). To compare gene dropout between sequencing platforms, we plotted the frequency distribution of genes against the log adjusted p-values (Fig 5A). Counting the number of genes with adjusted p-values < 0.01, dropout was highest in the TruSeq cohort (N = 3,100), lowest for the Stranded/Ribo-zero dataset (N = 583), and in between with the MACE-Seq samples (N = 2,181). A Venn diagram identifying overlap of gene dropout is shown in Fig 5B.

Fig 5. Gene dropout.

Fig 5

Frequency distribution for number of genes verses the -log 10 Adj p-values derived from binomial probability calculations (A) demonstrated dropout was highest in the TruSeq cohort (N = 3,100), lowest for the Stranded/Ribo-zero dataset (N = 583), and in between with the MACE-Seq samples (N = 2,181); overlapping dropout genes (B). C) Cohort medians for genes with dropout in the TruSeq or MACE-seq dataset plotted against their medians in the Stranded/Ribo-zero dataset demonstrate a bias towards dropout for low expression genes.

To better understand the study limitations or biological bias potentially introduced from gene dropout we performed a Gene Ontology Enrichment Analysis on the 283 common genes, and then for each cohort’s dropout after removing these common genes. The 283 genes commonly missing from all cohorts were enriched for genes related to sexual reproduction and actually depleted for genes involved in DNA damage or metabolism (Supplementary Table 15 in S1 Table). Implicit from the analysis of common gene dropout, genes associated with metabolism or nucleic acid regulation (e.g., synthesis, repair) were also statistically missing or not among the lists of gene dropout for all three sequencing platforms (Supplementary Table 16–18-18 in S1 Table). Genes with higher-than-expected loss from both the TruSeq and MACE-Seq cohorts were enriched for neuro-regulation pathways (Supplementary Tables 16-17 in S1 Table), which might be attributed to a higher inclusion of adjacent nerve cells in the TCGA cohort―possibly due to larger average size of specimens in the latter. Although far fewer in number, genes with dropout from the Stranded/Ribo-zero dataset were enriched for keratinization, which could be attributed to differences in the cohort makeup given the small sample size and high variability of differentiation markers naturally present among OPSCC tumors. To provide insight into the differences in gene dropout between samples prepared using the Stranded/Riob-zero kit, we plotted the cohort medians for genes with dropout in the TruSeq or MACE-seq dataset against their medians in the Stranded/Ribo-zero dataset (Fig 5C). Genes that dropped out had a bias towards lower expression in the Stranded/Ribo-zero dataset and this was most prominent when examining gene dropout in the TruSeq cohort.

Unlike scRNA, where dropout events are randomly distributed among samples, missing genes in the FFPE analysis were consistently absent across all samples within a given platform—though variability was observed between platforms. To better understand the source(s) of these differences, we compared the expected expression of genes well represented in the three platforms to the expected expression of genes with significant dropout using their average expression values derived from the TCGA OPSCC cohort, computed after equally weighting HPV-independent and HPV-associated samples (S10 Fig)- Compared to the expected distribution of all genes (S10 Fig A), the 283 common genes to dropout from all platforms and those specifically missing from the TruSeq or Stranded libraries were biased towards low expression (S10 FigB–D), whereas the 1017 genes specifically absent from the MACE-Seq library preparation were more evenly distributed across all levels of expression (S10 FigE). Underrepresentation of GC rich regions have been reported when RNA sequencing is performed from fixed specimens [55] possibly because GC rich regions can form more stable secondary structures, increasing the frequency of formalin cross linking. Indeed, genes with dropout were found to have on average significantly higher GC content when compared to genes that were well represented (S10 Fig F and G) among all three library platforms, but the differences were greatest among the common dropout genes (51% GC) and those missing from Stranded library preparation (51%) compared to well represented genes which ranged from 46% to 47% GC. However, genes missing from the MACE-Seq library preparation were less impacted by GC content. Collectively, the analysis suggests a bias in dropout towards genes with low expression and higher GC content when RNA is prepared using the TrueSeq or Stranded libraires. However, MACE-Seq captures only the 3’ end of transcripts and relies more heavily on the presence of poly(A) tails, which are susceptible to degradation during fixation [56]. This degradation may explain the distinct pattern of gene dropout observed.

GC-bias among different library platforms

Because GC content was one of the components influencing gene dropout, we examined how it impacted expression more broadly. For each library preparation, we plotted the average log2 gene expression values against the % GC content of the canonical gene transcripts and compared the smoothed curves to average values derived from the TCGA OPSCC cohort as a gold standard. Comparisons were made after stratifying samples according to their known (TCGA) or predicted HPV status. For TCGA OPSCC samples, regardless of HPV status, there was a slight GC-bias with peak gene expression occurring at about 50% GC content, with average expression gradually declining as the % GC content increased further (S11 Fig A and B). Compared to the TCGA, the GC-bias (i.e., decline in expression with increasing % GC) was most apparent when samples were prepared using the TruSeq library compared to samples analyzed from the Stranded or MACE-Seq libraries (S11 Fig A–D), with no observed differences based on predicted HPV status. Multiple bioinformatic approaches exist to correct for GC-bias in RNAseq data. Therefore, we developed an analytical tool called “GCplotQC” to facilitate identification and quantification of GC-bias among specific samples to guide decisions pertaining to whether these GC correction techniques would be appropriate for experimental data. The program utilizes expression data from each sample along with known GC content for canonical transcripts to overlay plots of expression against % GC for the sample and a reference (S12 Fig), which could either be derived from the average of samples or a comparator like the TCGA. The program then calculates the area of the sample curve beneath the comparator reference curve (i.e., deviation), which is displayed on the plot and conveniently aggregated in a master table. Although not specifically driven by GC-bias, technical outliers identified earlier by our pipeline were easily distinguished from other samples by their much larger deviations (S12 Fig B, D, and F). When examined individually, the average deviations for samples prepared with the TruSeq library were significantly higher then those derived from samples prepared using the Stranded or MACE-Seq library approach (S12 Fig G).

Co-extraction of DNA and SNP profiling patient ancestry

We analyzed DNA pellets left over during RNA extraction from a subset of samples using a modified COBAS protocol for ancestry using the Illumina Global Diversity SNP array. Our average DNA yield was 245 ng or 24 ng/cm2 of tissue (Supplementary Table 19 in S1 Table), which was sufficient for the SNP platform where 200 ng is the recommended amount. For comparison we isolated DNA from additional specimens that were not processed for RNA and ran them on the same SNP array. We found that the leftover DNA extracted during the RNA isolations performed satisfactorily but had slightly diminished SNP call rates which averaged 83.5% compared to 92.0% for samples where only DNA was isolated (P = 0.05, Fig 6A). In all cases, we were able to use the SNP data generated to predict ancestry (Supplementary Table 20 in S1 Table) using software freely available. Using the K18M model of SNPs, we identified a variety of ethnic groups at various percentages, but 9 out of 14 (e.g., 64%) were of Atlantic/Northern European descent, which was consistent with patient demographics for OPSCC. Two patients were found to be African American, and another was of Western Mediterranean ancestry. In samples from patients #1 (OPSCC_03/OPSCC_26) and patient # 8 (e.g., OPSCC_24), we isolated DNA directly or after RNA purification and compared their ancestry prediction. For both patients, (Fig 6B, 6C), the predicted ancestries matched accordingly, with one showing strong descent from Equatorial Africa and the other a mixed ancestry predominately of Atlantic/Northern European origin.

Fig 6. Ancestry measurements.

Fig 6

A) SNP call rate was calculated for each DNA extraction approach for each individual sample. B, C) Comparison of ancestry results from two unique samples where DNA was extracted twice using either the COBAS method alone (red) or from residual pellets following RNA isolation in a combined procedure (black).

Conclusions

In the present work we demonstrate the feasibility of utilizing archival FFPE specimens of varying age for whole transcriptome sequencing (i.e., RNA-Seq) to obtain biologically meaningful data. Further, we provide technical details for: 1) isolating both RNA and DNA, 2) insight into the most meaningful quality control metrics, 3) methods for standardizing data analysis, and 4) suggestions for data quality control increasing confidence in the data. Lastly, we show that residual DNA recovered from specimens during RNA isolation is of sufficient quality for SNP array platforms, which can be used to confirm ancestry or racial identity of study participants. We recommend the AllPrep DNA/RNA FFPE kit to isolate RNA because in our hands it performed robustly with respect to yield for nearly 200 specimens and led to high quality downstream RNA-seq reads using three different library prep methods. On average, a total of 18 cm2 of tissue (e.g., less than 10 slides) cut at standard thickness (6 µM) generated more than sufficient RNA for the different sequencing options available. If RNA yields are less than 246 ng/cm2 (Supplementary Table 2 in S1 Table, lower 90% confidence interval of mean) based on A260 measurements, then appropriate trouble shooting steps are warranted. FFPE RNA is known to be fragmented, so typical metrics like RIN are not appropriate. DV200 measurements should be interpreted cautiously, because it is still possible to generate reliable RNA-Seq data from samples with a DV200 < 20%. Rather the shape of the TapeStation curve closer to 100 bp may be a better indicator.

After filtering out non-protein coding genes (e.g., Supplementary Table 4 in S1 Table) our quality control process consists of five main steps easily applied, followed by some type of analysis for biological fidelity that by necessity will be context specific. We have incorporated an UQ normalization into our ready to use Python code (Supplementary Methods), but other methods could work as well. The percentage of mapped reads belonging to protein coding genes should be examined for each sample with the caveat that stranded library approaches may lower the values because of the additional requirement that captured fragmented DNA also be from a specific strand. Our method of detecting technical outliers based on MAD residuals works differently than principal component analysis (PCA) and should be applied upstream of the latter [57]. Unlike PCA, this method makes use of information from every gene rather than a few thousand most variable genes. Therefore, the MAD method may have better sensitivity for detecting just a single outlier in larger cohorts where the most variable genes selected for PCA may be driven largely by biological variability. In fact, it is probably more consistent than PCA when it comes to separating technical rather than biologic variation. However, the method is not a substitute for PCA that could be applied afterwards to resolve differences based on biologic variation that may reveal further heterogeneity to be considered when comparing gene expression.

The number of protein coding genes with log2 RNA expression ≥ 2 should be calculated only after re-adjusting the global median to equal 7. This re-centering allows the same standard threshold for low gene expression to be more consistently applied, regardless of sequencing depth, batch affects, or platforms used. By way of comparison, the OPSCC TCGA cohort samples average 14,565 genes (Range:13,383–15,389) with log2 expression ≥ 2. Typically, samples with too few protein coding genes above the low expression threshold are easily identified as technical outliers. Using the MAD residuals method avoids having to set a precise cut-off value for number of genes above the threshold to consider acceptable, provided most samples in the cohort express reasonable numbers of genes. For our datasets, excluding outliers, the average number of genes above the threshold of 2 was ~14,000. It is useful to compare the distribution (i.e., histogram) of normalized log2 RNA expression values to a standard like the TCGA data, which follows a quasi-log normal distribution. In our experience, the re-centered RNA-seq data from the TCGA has a similar appearance regardless of the cancer chosen.

In one of the last simple QC steps, we recommend identifying genes with an abnormal degree of sample dropout, if the expected frequency of samples with zero expression can be appropriately modeled with an orthogonal data set. We observed the least amount of gene dropout in the cohort prepared using the Stranded library kit, however, it is likely that the average quality of RNA was better for these specimens judging by the TapeStation curves. Regardless, for the TruSeq and Stranded datasets the genes lost seemed unrelated to tumor biology and likely represented low expression genes However, gene dropout out in the MACE-Seq cohort was independent of expression level, and likely reflects higher instability of RNA at the poly A tail. Nevertheless, genes related to metabolism or nucleic acid processing were depleted from the lists of dropout in all cohorts and therefore were reasonably represented.

Although our cohort sizes were limited, we detected substantially more GC-bias among samples prepared with the TruSeq library. The GCplotQC tool we developed could be useful to detect samples with unusual GC-bias that may stem from over fixation. Although our RNA isolation procedure includes an incubation step at elevated temperatures to reverse formalin cross-links, samples that are over fixed would be predicted to preferentially lose expression of RNA transcripts with higher GC content. Higher GC content is known to stabilize RNA secondary structure thereby increasing the odds of formalin cross-linking. Compared to the TCGA cohort, which utilized frozen specimens, our FFPE samples did exhibit decreased RNA expression with increasing % GC content. However, the deviations we observed were more uniform across a range of GC content and samples, suggesting a more generalized effect inherent from the formalin fixation that was minimally present with samples from the MACE-Seq and Stranded cohorts. The GC-bias present with TruSeq samples was significantly more pronounced. A variety of bioinformatic approaches to correct for GC-bias in RNAseq data exist [58,59], including methods that perform adjustment during alignment (e.g., BBMAp from BBTools), before quantitation (e.g., CorrectGCBias from RseQC), or after quantitation (e.g., RUVSeq). We recommend using the GCplotQC tool to gauge the extent of GC-bias among the entire cohort and individual samples, preferably using TCGA data as a reference. Individual samples that suffer from over fixation may stand out as technical outliers using our pipeline or possibly have significantly fewer genes expressed above our recommended log2 threshold of 2.0. Alternatively, cases where the impact of over fixation is more subtle would likely be apparent as a sharp decline in gene expression at some %GC threshold, which could be visualized through our GCplotQC tool.

If possible, the biological fidelity of new data should be evaluated independent of the scientific questions being explored. For our data set, we leveraged the fact that a substantial subset of OPSCCs are linked to high risk HPV, which leads to programs of genes that are both up and downregulated. We then used hierarchical clustering to examine if these genes defined by an independent cohort (i.e., TCGA data) behaved similarly in our datasets. This same approach could be applied using just about any set of genes from pathways known to be co-regulated in the tissue under study. Co-regulation or anti-correlations among gene clusters can be examined statistically using simple Chi-square tests. Lastly, we showed that SNP data from residual DNA, co-isolated during RNA purification, is of sufficient quality for SNP arrays and can be expected to have call rates that average above 80%, which may be sufficient for many applications and gives comparable results to DNA directly isolated from FFPE. Our data, methods, and recommended analytical approaches should provide investigators with helpful guidelines and tools to better leverage archival specimens to answer important biological questions.

Supporting information

S1 Fig. RNA quantitation.

RNA concentrations measured via fluorescence compared to internal laboratory measures using optical density (OD).

(TIF)

pone.0321631.s001.tif (484.3KB, tif)
S2 Fig. Analytical pipeline.

Pipeline for analysis of RNA-seq data inclusive of normalization.

(TIF)

pone.0321631.s002.tif (862.6KB, tif)
S3 Fig. Coding vs non-coding reads.

Quantification of coding vs non-coding reads mapped using the True-Seq and Stranded RNA with Ribo-zero platforms (A) and the MACE RNAseq platform (B). Head to head comparison across platforms (C). Sample OPSCC_33, which had the lowest % of protein coding reads (red circle) was also a technical outlier (Supplementary Table 5 in S1 Table).

(TIF)

pone.0321631.s003.tif (1.1MB, tif)
S4 Fig. Distribution of RNA median gene expression.

The distributions of gene expression (i.e., median UQ normalized log 2 values) for each platform/cohort, along with the TCGA OPSCC RNA-Seq dataset similarly normalized (A,C,E,G) before global rescaling. Histograms for the rescaled cohorts were also generated (B,D,F,H).

(TIF)

pone.0321631.s004.tif (1.5MB, tif)
S5 Fig. Average Median Absolute Deviation (MAD).

MAD values were averaged across all genes to calculate a sample specific average MAD value. Samples OPSCC_32 and 33 (Stranded RNA/Ribo-zero), OPSCC_34 and 35 (MACE-Seq) were identified as outliers with significantly different MAD values (Supplementary Table 5 in S1 Table).

(TIF)

pone.0321631.s005.tif (874.6KB, tif)
S6 Fig. Relationship between DV200 and biological richness of gene expression.

DV200 values and RNA fragment size are shown for representative samples along with the relative number of usable genes and associated p-values from outlier analysis (Supplementary Table 5 in S1 Table).

(TIF)

pone.0321631.s006.tif (2.1MB, tif)
S7 Fig. Correlation of HPV-associated genes and CDKN2A expression in TCGA samples.

A) Genes that showed ≥3-fold significant (FDR < 0.1) difference up or downregulation based on HPV status in the OPSCC TCGA cohort were identified. B) Confirmation that CDKN2A expression is highly upregulated in HPV-associated (HPV-pos) TCGA samples. ****P < 0.0001.

(TIF)

pone.0321631.s007.tif (743KB, tif)
S8 Fig. Cross-correlation of HPV associated genes.

Cross-correlation coefficients of gene expression values within the TruSeq cohort, using the list of 855 DEGs previously associated with HPV status in the TCGA OPSCC samples, were used for unsupervised clustering to identify modules of genes (black boxes) that behaved similarly. Gene clusters 1 and 3 behaved robustly. Genes are annotated vertically and horizontally by whether they were upregulated (red boxes) or downregulated (grey boxes) in the original TCGA cohort according to HPV status.

(TIF)

pone.0321631.s008.tif (7.2MB, tif)
S9 Fig. Correlation of HPV associated genes in the MACE-Seq cohort.

Expression of HPV-associated DEGs (identified from the TCGA OPSCC reference cohort) was used for unsupervised 2-way clustering of MACE-Seq cohort samples (i.e., Ward’s agglomerative hierarchical clustering) to predict HPV status (A). Samples predicted to be HPV-associated (i.e., HPV-Pos) are annotated with purple boxes. The regulation status of genes is annotated across the top of the heatmap with black boxes if they were also upregulated in HPV-associated TCGA samples or grey boxes if they were upregulated in HPV-independent (i.e., HPV-Neg) TCGA samples. Significant enrichment of genes upregulated in TCGA HPV-associated cancers was found in gene cluster 1 (red cluster) and enrichment of genes upregulated in HPV-independent TCGA samples is found in gene cluster 3 (dark blue cluster), demonstrating significant separation (e.g., P < 0.00001 by Chi-square testing). Specimens from sample cluster 1 (purple) predicted to be HPV-associated based on their gene expression pattern had higher CDKN2A (p16) expression than specimens from sample cluster 2 (grey) predicted to be HPV-independent (B). * P < 0.05.

(TIF)

pone.0321631.s009.tif (2.4MB, tif)
S10 Fig. Sources of gene dropout.

A) Distribution of HPV-associated and HPV-independent weighted averages of RNA expression for all genes using the TCGA OPSCC cohort. B) Distribution of average expression from the TCGA OPSCC cohort for the 283 common dropout genes across all FFPE library platforms. C) Distribution of average expression from the TCGA OPSCC cohort for the 1681 genes that uniquely dropped out from the TruSeq platform. D) Distribution of average expression from the TCGA OPSCC cohort for the 1017 genes that uniquely dropped out from the MACE-Seq platform. E) Distribution of average expression from the TCGA OPSCC cohort for the 248 genes (21 + 267) that uniquely dropped out from the Stranded and/or TruSeq platforms. F) Distribution of GC content among genes that were covered well or dropped out for each platform. G) Statistical comparisons between genes that dropped out and those well covered for each of the platforms or the common 283 dropout genes. The P-values (**** P < 0.00001, *** P < 0.0005) were derived from individual comparisons of averages connected by solid lines after a Tukey’s multiple comparison test.

(TIF)

pone.0321631.s010.tif (1.6MB, tif)
S11 Fig. GC-bias in RNA-Seq data from FFPE samples.

A) Smoothed spline plots comparing average gene expression verses % GC content of canonical transcripts for HPV-independent FFPE samples prepared using the TruSeq (dark blue line) or Stranded (light blue line) libraries compared to HPV-independent TCGA OPSCC samples (black line). B) Smoothed spline plots comparing average gene expression verses % GC content of canonical transcripts for FFPE samples predicted to be HPV-associated and prepared using the TruSeq (dark blue line) or Stranded (light blue line) libraries compared to HPV-associated TCGA OPSCC samples (black line). C) Smoothed spline plots comparing average gene expression verses % GC content of canonical transcripts for FFPE samples predicted to be HPV-independent and prepared using the MACE-Seq approach (red line) compared to HPV-independent TCGA OPSCC samples (black line). D) Smoothed spline plots comparing average gene expression verses % GC content of canonical transcripts for FFPE samples predicted to be HPV-associated and prepared using the MACE-Seq approach (red line) compared to HPV-associated TCGA OPSCC samples (black line).

(TIF)

pone.0321631.s011.tif (907KB, tif)
S12 Fig. Qualitative and quantitative assessment of GC bias using the GCplotQC tool.

Smoothed spline curves of gene expression verses % GC content of canonical transcripts for individual samples processed using the TruSeq library (A,B), the Stranded library (C,D), or the MACE-Seq library (E,F) compared to a weighted average of HPV-associated and HPV-independent OPSCC TCGA samples for reference. The program computes the area between the reference (OPSCC TCGA samples) and individual sample curves, wherever the sample curve has lower gene expression and displays the value as delta area. Samples identified as technical outliers (B,D, and F) in other steps of the pipeline also show dramatic increases in delta area. G) Scatter plot of average sample delta area values shows significantly more deviation in samples prepared with the TruSeq library compared to those prepared with the other two libraries. **** P < 0.00001.

(TIF)

pone.0321631.s012.tif (3.2MB, tif)
S1 Table. Supplemental Tables.

(XLSX)

pone.0321631.s013.xlsx (35MB, xlsx)

Acknowledgments

Not applicable.

Data Availability

All data analyzed in the manuscript are included in the supplementary tables.

Funding Statement

This work was supported by the National Institute of Dental and Craniofacial Research (R21DE032344- VCS), the National Institutes of Health (R01DE0323337, P50 CA097190, R01 DE028061- HDS), the Dan L Duncan Comprehensive Cancer Center (P30-CA125123), the Human Tissue Acquisition & Pathology Core Baylor College of Medicine, a Price Family Foundation Pilot Project to Develop Experimental Cancer Therapeutics for Underrepresented Populations (TJO), awarded by the Montefiore Einstein Comprehensive Cancer Center and an Administrative Supplement to Support Diversity Research awarded under NIH-NCI. 3U54CA274321-02S1 (TJO). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Yu W, Chen Y, Putluri N, Osman A, Coarfa C, Putluri V, et al. Evolution of cisplatin resistance through coordinated metabolic reprogramming of the cellular reductive state. Br J Cancer. 2023;128(11):2013–24. doi: 10.1038/s41416-023-02253-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kazi MA, Veeramachaneni R, Deng D, Putluri N, Cardinas M, Sikora A, et al. Glutathione peroxidase 2 is a metabolic driver of the tumor immune microenvironment and immune checkpoint inhibitor response. J. Immunotherap. Cancer. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Frederick M, Skinner HD, Kazi SA, Sikora AG, Sandulache VC. High expression of oxidative phosphorylation genes predicts improved survival in squamous cell carcinomas of the head and neck and lung. Sci Rep. 2020;10(1):6380. doi: 10.1038/s41598-020-63448-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Yu W, Chen Y, Dubrulle J, Stossi F, Putluri V, Sreekumar A, et al. Cisplatin generates oxidative stress which is accompanied by rapid shifts in central carbon metabolism. Sci Rep. 2018;8(1):4306. doi: 10.1038/s41598-018-22640-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Jacobsen SB, Tfelt-Hansen J, Smerup MH, Andersen JD, Morling N. Comparison of whole transcriptome sequencing of fresh, frozen, and formalin-fixed, paraffin-embedded cardiac tissue. PLoS One. 2023;18(3):e0283159. doi: 10.1371/journal.pone.0283159 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wilde DC, Castro PD, Bera K, Lai S, Madabhushi A, Corredor G, et al. Oropharyngeal cancer outcomes correlate with p16 status, multinucleation and immune infiltration. Mod Pathol. 2022;35(8):1045–54. doi: 10.1038/s41379-022-01024-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Elhalawani H, Mohamed ASR, Elgohari B, Lin TA, Sikora AG, Lai SY, et al. Tobacco exposure as a major modifier of oncologic outcomes in human papillomavirus (HPV) associated oropharyngeal squamous cell carcinoma. BMC Cancer. 2020;20(1):912. doi: 10.1186/s12885-020-07427-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gao XH, Li J, Gong HF, Yu GY, Liu P, Hao LQ, et al. Comparison of fresh frozen tissue with formalin-fixed paraffin-embedded tissue for mutation analysis using a multi-gene panel in patients with colorectal cancer. Front Oncol. 2020;10:310. doi: 10.3389/fonc.2020.00310 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zhao Y, Mehta M, Walton A, Talsania K, Levin Y, Shetty J, et al. Robustness of RNA sequencing on older formalin-fixed paraffin-embedded tissue from high-grade ovarian serous adenocarcinomas. PLoS One. 2019;14(5):e0216050. doi: 10.1371/journal.pone.0216050 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Choi Y, Kim A, Kim J, Lee J, Lee SY, Kim C. Optimization of RNA extraction from formalin-fixed paraffin-embedded blocks for targeted next-generation sequencing. J Breast Cancer. 2017;20(4):393–9. doi: 10.4048/jbc.2017.20.4.393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Cannizzo MD, Wood CE, Hester SD, Wehmas LC. Case study: Targeted RNA-sequencing of aged formalin-fixed paraffin-embedded samples for understanding chemical mode of action. Toxicol Rep. 2022;9:883–94. doi: 10.1016/j.toxrep.2022.04.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • .Matsunaga H, Arikawa K, Yamazaki M, Wagatsuma R, Ide K, Samuel AZ, et al. Reproducible and sensitive micro-tissue RNA sequencing from formalin-fixed paraffin-embedded tissues for spatial gene expression analysis. Sci Rep. 2022;12(1):19511. doi: 10.1038/s41598-022-23651-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Eikrem O, Beisland C, Hjelle K, Flatberg A, Scherer A, Landolt L, et al. Transcriptome Sequencing (RNAseq) Enables Utilization of Formalin-Fixed, Paraffin-Embedded Biopsies with Clear Cell Renal Cell Carcinoma for Exploration of Disease Biology and Biomarker Development. PLoS One. 2016;11(2):e0149743. doi: 10.1371/journal.pone.0149743 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Marczyk M, Fu C, Lau R, Du L, Trevarton AJ, Sinn BV, et al. The impact of RNA extraction method on accurate RNA sequencing from formalin-fixed paraffin-embedded tissues. BMC Cancer. 2019;19(1):1189. doi: 10.1186/s12885-019-6363-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Liu Y, Kramer JR, Sandulache VC, Yu R, Li G, Chen L, et al. Immunogenetic determinants of susceptibility to head and neck cancer in the million veteran program cohort. Cancer Res. 2023;83(3):386–97. doi: 10.1158/0008-5472.CAN-22-1641 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Deshmukh SK, Azim S, Ahmad A, Zubair H, Tyagi N, Srivastava SK, et al. Biological basis of cancer health disparities: resources and challenges for research. Am J Cancer Res. 2017;7(1):1–12. [PMC free article] [PubMed] [Google Scholar]
  • 17.Neagu A-N, Bruno P, Johnson KR, Ballestas G, Darie CC. Biological basis of breast cancer-related disparities in precision oncology era. Int J Mol Sci. 2024;25(7):4113. doi: 10.3390/ijms25074113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Bachtiar M, Jin Y, Wang J, Tan TW, Chong SS, Ban KHK, et al. Architecture of population-differentiated polymorphisms in the human genome. PLoS One. 2019;14(10):e0224089. doi: 10.1371/journal.pone.0224089 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Polubriaginof FCG, Ryan P, Salmasian H, Shapiro AW, Perotte A, Safford MM, et al. Challenges with quality of race and ethnicity data in observational databases. J Am Med Inform Assoc. 2019;26(8–9):730–6. doi: 10.1093/jamia/ocz113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Plichta JK, Rushing CN, Lewis HC, Rooney MM, Blazer DG, Thomas SM, et al. Implications of missing data on reported breast cancer mortality. Breast Cancer Res Treat. 2023;197(1):177–87. doi: 10.1007/s10549-022-06764-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Patel PG, Selvarajah S, Guérard K-P, Bartlett JMS, Lapointe J, Berman DM, et al. Reliability and performance of commercial RNA and DNA extraction kits for FFPE tissue cores. PLoS One. 2017;12(6):e0179732. doi: 10.1371/journal.pone.0179732 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kresse SH, Namløs HM, Lorenz S, Berner J-M, Myklebost O, Bjerkehagen B, et al. Evaluation of commercial DNA and RNA extraction methods for high-throughput sequencing of FFPE samples. PLoS One. 2018;13(5):e0197456. doi: 10.1371/journal.pone.0197456 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Sarnecka AK, Nawrat D, Piwowar M, Ligęza J, Swadźba J, Wójcik P. DNA extraction from FFPE tissue samples - a comparison of three procedures. Contemp Oncol (Pozn). 2019;23(1):52–8. doi: 10.5114/wo.2019.83875 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Ondracek RP, Chen J, Marosy B, Szewczyk S, Medico L, Mohan AS, et al. Results and lessons from dual extraction of DNA and RNA from formalin-fixed paraffin-embedded breast tumor tissues for a large Cancer epidemiologic study. BMC Genomics. 2022;23(1):614. doi: 10.1186/s12864-022-08837-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Atanesyan L, Steenkamer MJ, Horstman A, Moelans CB, Schouten JP, Savola SP. Optimal fixation conditions and DNA extraction methods for MLPA Analysis on FFPE tissue-derived DNA. Am J Clin Pathol. 2017;147(1):60–8. doi: 10.1093/ajcp/aqw205 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Boneva S, Schlecht A, Böhringer D, Mittelviefhaus H, Reinhard T, Agostini H, et al. 3’ MACE RNA-sequencing allows for transcriptome profiling in human tissue samples after long-term storage. Lab Invest. 2020;100(10):1345–55. doi: 10.1038/s41374-020-0446-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Jang JS, Holicky E, Lau J, McDonough S, Mutawe M, Koster MJ, et al. Application of the 3’ mRNA-Seq using unique molecular identifiers in highly degraded RNA derived from formalin-fixed, paraffin-embedded tissue. BMC Genomics. 2021;22(1):759. doi: 10.1186/s12864-021-08068-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Koch CM, Chiu SF, Akbarpour M, Bharat A, Ridge KM, Bartom ET, et al. A beginner’s guide to analysis of RNA sequencing data. Am J Respir Cell Mol Biol. 2018;59(2):145–57. doi: 10.1165/rcmb.2017-0430TR [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Tan Z, Chen X, Zuo J, Fu S, Wang H, Wang J. Comprehensive analysis of scRNA-Seq and bulk RNA-Seq reveals dynamic changes in the tumor immune microenvironment of bladder cancer and establishes a prognostic model. J Transl Med. 2023;21(1):223. doi: 10.1186/s12967-023-04056-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Zoabi Y, Shomron N. Processing and analysis of RNA-seq data from public resources. Methods Mol Biol. 2021;2243:81–94. doi: 10.1007/978-1-0716-1103-6_4 [DOI] [PubMed] [Google Scholar]
  • 31.Deshpande D, Chhugani K, Chang Y, Karlsberg A, Loeffler C, Zhang J, et al. RNA-seq data science: From raw data to effective interpretation. Front Genet. 2023;14:997383. doi: 10.3389/fgene.2023.997383 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Lee G-Y, Ham S, Lee S-JV. Brief guide to RNA sequencing analysis for nonexperts in bioinformatics. Mol Cells. 2024;47(5):100060. doi: 10.1016/j.mocell.2024.100060 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Chen AA, Luo C, Chen Y, Shinohara RT, Shou H; Alzheimer’s Disease Neuroimaging Initiative. Privacy-preserving harmonization via distributed ComBat. Neuroimage. 2022;248:118822. doi: 10.1016/j.neuroimage.2021.118822 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Zhang Y, Parmigiani G, Johnson WE. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom Bioinform. 2020;2(3):lqaa078. doi: 10.1093/nargab/lqaa078 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Lehrer S, Rheinstein PH. EARS2 significantly coexpresses with PALB2 in breast and pancreatic cancer. Cancer Treat Res Commun. 2022;32:100595. doi: 10.1016/j.ctarc.2022.100595 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Padegal G, Rao MK, Boggaram Ravishankar OA, Acharya S, Athri P, Srinivasa G. Analysis of RNA-Seq data using self-supervised learning for vital status prediction of colorectal cancer patients. BMC Bioinformatics. 2023;24(1):241. doi: 10.1186/s12859-023-05347-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Molania R, Foroutan M, Gagnon-Bartsch JA, Gandolfo LC, Jain A, Sinha A, et al. Removing unwanted variation from large-scale RNA sequencing data with PRPS. Nat Biotechnol. 2023;41(1):82–95. doi: 10.1038/s41587-022-01440-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics. 2013;14:7. doi: 10.1186/1471-2105-14-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Evans C, Hardin J, Stoebel DM. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Brief Bioinform. 2018;19(5):776–92. doi: 10.1093/bib/bbx008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Xia Y. Statistical normalization methods in microbiome data with application to microbiome cancer research. Gut Microbes. 2023;15(2):2244139. doi: 10.1080/19490976.2023.2244139 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Zhao Y, Li M-C, Konaté MM, Chen L, Das B, Karlovich C, et al. TPM, FPKM, or normalized counts? a comparative study of quantification measures for the analysis of RNA-seq Data from the NCI patient-derived models repository. J Transl Med. 2021;19(1):269. doi: 10.1186/s12967-021-02936-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Ilgisonis EV, Ponomarenko EA, Tarbeeva SN, Lisitsa AV, Zgoda VG, Radko SP, et al. Gene-centric coverage of the human liver transcriptome: QPCR, Illumina, and Oxford Nanopore RNA-Seq. Front Mol Biosci. 2022;9:944639. doi: 10.3389/fmolb.2022.944639 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. doi: 10.1093/bioinformatics/btp616 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. doi: 10.1093/nar/gkv007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15(2):R29. doi: 10.1186/gb-2014-15-2-r29 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Dahlstrom KR, Calzada G, Hanby JD, Garden AS, Glisson BS, Li G, et al. An evolution in demographics, treatment, and outcomes of oropharyngeal cancer at a major cancer center: a staging system in need of repair. Cancer. 2013;119(1):81–9. doi: 10.1002/cncr.27727 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Fakhry C, Blackford AL, Neuner G, Xiao W, Jiang B, Agrawal A, et al. Association of oral human papillomavirus dna persistence with cancer progression after primary treatment for oral cavity and oropharyngeal squamous cell carcinoma. JAMA Oncol. 2019;5(7):985–92. doi: 10.1001/jamaoncol.2019.0439 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Fakhry C, Zhang Q, Gillison ML, Nguyen-Tân PF, Rosenthal DI, Weber RS, et al. Validation of NRG oncology/RTOG-0129 risk groups for HPV-positive and HPV-negative oropharyngeal squamous cell cancer: Implications for risk-based therapeutic intensity trials. Cancer. 2019;125(12):2027–38. doi: 10.1002/cncr.32025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Gleber-Netto FO, Rao X, Guo T, Xi Y, Gao M, Shen L, et al. Variations in HPV function are associated with survival in squamous cell carcinoma. JCI Insight. 2019;4(1):e124762. doi: 10.1172/jci.insight.124762 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Masterson L, Lechner M, Loewenbein S, Mohammed H, Davies-Husband C, Fenton T, et al. CD8+ T cell response to human papillomavirus 16 E7 is able to predict survival outcome in oropharyngeal cancer. Eur J Cancer. 2016;67:141–51. doi: 10.1016/j.ejca.2016.08.012 [DOI] [PubMed] [Google Scholar]
  • 52.Welters MJP, Ma W, Santegoets SJAM, Goedemans R, Ehsan I, Jordanova ES, et al. Intratumoral HPV16-Specific T Cells Constitute a Type I-oriented tumor microenvironment to improve survival in HPV16-driven oropharyngeal cancer. Clin Cancer Res. 2018;24(3):634–47. doi: 10.1158/1078-0432.CCR-17-2140 [DOI] [PubMed] [Google Scholar]
  • 53.Saliba A-E, Westermann AJ, Gorski SA, Vogel J. Single-cell RNA-seq: advances and future challenges. Nucleic Acids Res. 2014;42(14):8845–60. doi: 10.1093/nar/gku555 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Hwang B, Lee JH, Bang D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp Mol Med. 2018;50(8):1–14. doi: 10.1038/s12276-018-0071-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Esteve-Codina A, Arpi O, Martinez-García M, Pineda E, Mallo M, Gut M, et al. A Comparison of RNA-seq results from paired formalin-fixed paraffin-embedded and fresh-frozen glioblastoma tissue samples. PLoS One. 2017;12(1):e0170632. doi: 10.1371/journal.pone.0170632 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Nardon E, Donada M, Bonin S, Dotti I, Stanta G. Higher random oligo concentration improves reverse transcription yield of cDNA from bioptic tissues and quantitative RT-PCR reliability. Exp Mol Pathol. 2009;87(2):146–51. doi: 10.1016/j.yexmp.2009.07.005 [DOI] [PubMed] [Google Scholar]
  • 57.Chen X, Zhang B, Wang T, Bonni A, Zhao G. Robust principal component analysis for accurate outlier sample detection in RNA-Seq data. BMC Bioinformatics. 2020;21(1):269. doi: 10.1186/s12859-020-03608-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011;12:480. doi: 10.1186/1471-2105-12-480 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Zheng W, Chung LM, Zhao H. Bias detection and correction in RNA-Sequencing data. BMC Bioinformatics. 2011;12:290. doi: 10.1186/1471-2105-12-290 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Kazunori Nagasaka

13 Jan 2025

PONE-D-24-53608Reliable RNA-seq analysis from FFPE specimens as a means to accelerate cancer-related health disparities researchPLOS ONE

Dear Dr. Sandulache,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Feb 27 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Kazunori Nagasaka

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1.Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Please note that funding information should not appear in any section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript.

4. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

5. Thank you for stating the following financial disclosure:

 “VCS- R21DE032344

HDS- R01DE0323337; R01 DE028061”

Please state what role the funders took in the study.  If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

6. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

7. We note that you have included the phrase “data not shown” in your manuscript. Unfortunately, this does not meet our data sharing requirements. PLOS does not permit references to inaccessible data. We require that authors provide all relevant data within the paper, Supporting Information files, or in an acceptable, public repository. Please add a citation to support this phrase or upload the data that corresponds with these findings to a stable repository (such as Figshare or Dryad) and provide and URLs, DOIs, or accession numbers that may be used to access these data. Or, if the data are not a core part of the research being presented in your study, we ask that you remove the phrase that refers to these data.

8. We note you have not yet provided a protocols.io PDF version of your protocol and/or a protocols.io DOI. When you submit your revision, please provide a PDF version of your protocol as generated by protocols.io (the file will have the protocols.io logo in the upper right corner of the first page) as a Supporting Information file. The filename should be S1_file.pdf, and you should enter “S1 File” into the Description field. Any additional protocols should be numbered S2, S3, and so on. Please also follow the instructions for Supporting Information captions [https://journals.plos.org/plosone/s/supporting-information#loc-captions]. The title in the caption should read: “Step-by-step protocol, also available on protocols.io.”

Please assign your protocol a protocols.io DOI, if you have not already done so, and include the following line in the Materials and Methods section of your manuscript: “The protocol described in this peer-reviewed article is published on protocols.io (https://dx.doi.org/10.17504/protocols.io.[...]) and is included for printing purposes as S1 File.” You should also supply the DOI in the Protocols.io DOI field of the submission form when you submit your revision.

If you have not yet uploaded your protocol to protocols.io, you are invited to use the platform’s protocol entry service [https://www.protocols.io/we-enter-protocols] for doing so, at no charge. Through this service, the team at protocols.io will enter your protocol for you and format it in a way that takes advantage of the platform’s features. When submitting your protocol to the protocol entry service please include the customer code PLOS2022 in the Note field and indicate that your protocol is associated with a PLOS ONE Lab Protocol Submission. You should also include the title and manuscript number of your PLOS ONE submission.

9. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Dear Authors,

Thank you for submitting your manuscript to PLOS One.

Based on the comments from our reviewers, our decision is "Minor Revision." Please revise the manuscript accordingly, and we look forward to receiving your revised submission.

Sincerely,

Kazunori Nagasaka

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does the manuscript report a protocol which is of utility to the research community and adds value to the published literature?

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the protocol been described in sufficient detail?

To answer this question, please click the link to protocols.io in the Materials and Methods section of the manuscript (if a link has been provided) or consult the step-by-step protocol in the Supporting Information files.

The step-by-step protocol should contain sufficient detail for another researcher to be able to reproduce all experiments and analyses.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Does the protocol describe a validated method?

The manuscript must demonstrate that the protocol achieves its intended purpose: either by containing appropriate validation data, or referencing at least one original research article in which the protocol was used to generate data.

Reviewer #1: Yes

Reviewer #2: No

**********

4. If the manuscript contains new data, have the authors made this data fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the article presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please highlight any specific errors that need correcting in the box below.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In this submitted protocol authors optimized guidelines for DNA and RNA purification from FFPE oropharynx carcinoma tissue, followed by RNA-seq and DNA SNP analysis.

Introduction

The rational background of the protocol establishment, the given requirements are properly described in the Introduction.

Results

The provided protocol with the details included in supplementary documents is a gap-filling requirement. The presented results are well-summarized and logically described. A correct and critical comparison to TCGA data is provided, discussed, and reasons for cross-dataset correlations failures were explained.

Discussion

The Discussion highlights the main points in this lab method process and provides suggestions.

This submitted material is a great contribution to the research in the corresponding field, but also valuable for interdisciplinary use.

Reviewer #2: Large collection of FFPE archival biopsies from hospitals and research institutes are valuable samples for generating transcriptome and exome data. The authors have provided a detailed protocol to isolate RNA and DNA from FFPE samples, generate RNA-seq and SNP array data, and perform bioinformatics to remove outlier samples and normalize expression levels between samples. Although they do not develop any new methods, the authors provide an end-to-end protocol from RNA/DNA isolation to bioinformatics analysis by combining existing methods with some modifications. The protocol may be useful to researchers without prior experience working with FFPE samples.

Major deficiency of the described bioinformatics method is that it does not take into consideration the impact of the chemistry of formalin fixation on the sequences. It is well known that formalin fixation primarily causes modification of amino groups on the four bases (A, C, G, U) with differing probabilities and can impact the GC content of the sequenced reads. Also, the decreased gene coverage in RNA-seq (described as dropouts by the authors) is not completely random and is very different from the dropouts observed in single-cell RNA-seq. Thus, it is essential that QC checks specific to FFPE, especially formalin overfixation, is necessary. The method described by the authors may work when majority of the samples produce good quality data and a small proportion of outlier samples need to be identified, but not when a large proportion of samples have QC issues due to fixation protocol or FFPE storage issues. Please include additional QC steps that are specific to FFPE samples and steps to identify formalin overfixation.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: Yes:  Jozsef Dudas

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2025 Apr 21;20(4):e0321631. doi: 10.1371/journal.pone.0321631.r003

Author response to Decision Letter 1


17 Feb 2025

Response to reviewer comments:

Reviewer #1: This submitted material is a great contribution to the research in the corresponding field, but also valuable for interdisciplinary use.

Response: Thank you.

Reviewer #2: Major deficiency of the described bioinformatics method is that it does not take into consideration the impact of the chemistry of formalin fixation on the sequences. It is well known that formalin fixation primarily causes modification of amino groups on the four bases (A, C, G, U) with differing probabilities and can impact the GC content of the sequenced reads. Also, the decreased gene coverage in RNA-seq (described as dropouts by the authors) is not completely random and is very different from the dropouts observed in single-cell RNA-seq. Thus, it is essential that QC checks specific to FFPE, especially formalin overfixation, is necessary. The method described by the authors may work when majority of the samples produce good quality data and a small proportion of outlier samples need to be identified, but not when a large proportion of samples have QC issues due to fixation protocol or FFPE storage issues. Please include additional QC steps that are specific to FFPE samples and steps to identify formalin overfixation.

Response: We appreciate the astute observation and suggestions and agree completely. We now include a newly created tool called “GCplotQC” developed to analyze GC-bias in individual samples and quantitatively compare deviation from a reference standard like TCGA data or the cohort average. The tool graphically generates smoothed curves from plots of %GC content verses gene expression and calculates the area of deviation compared to a designated standard reference. The area of deviation for all samples is then outputted to a table for easy reference, while the curves for individual samples are stored as png images and aggregated as a PDF file. This can be used to gauge overall GC bias among the entire cohort or individually identify samples that have steeper GC bias presumably stemming from over fixation. We have now added two paragraphs to the Results section that describe the impact of GC-bias in our study, how that relates to gene dropout, and the differences we observed between the three library preparations used. These results are illustrated in three brand new supplementary figures now added (Supplementary Figures 10, 11, and 12). We also added an additional paragraph to the Discussion section that outlines the utility of the QC tool, and references bioinformatic techniques developed by others that could be applied to correct for GC-bias at various steps in the analysis.

To more specifically answer the reviewer’s questions, we agree that the zero dropout we observed was not through the same mechanisms found for single cell RNAseq because the latter impacts samples randomly and our zero dropouts occurred uniformly across samples. This is now explained better in our modified Results section. We find that zero dropout in our cohorts was driven by both low expected gene expression and to some degree GC content. However, in our case the GC bias was more likely systemic and uniform across samples within a cohort that likely reflects a general limitation of formalin fixation. Indeed, this is why our sample processing protocol includes an elevated temperature incubation to reverse and mitigate the effects of fixation. We report that the GC-bias was much more pronounced for the TruSeq samples than for those prepared with the Stranded or MACE-Seq libraries. As for how much GC-bias is too much before an analytical correction should be applied, we leave that to the investigator to determine for themselves. Our tool will allow quantitation and visualization of the affect (e.g. QC), which can be used in decisions governing whether to utilize one of the many GC correction pipelines already available. The use of a reference standard like TCGA data should help detect whether some samples are over fixed, or the entire cohort is bad, and our methods employ a quantitative metric derived from the area between a sample’s curve and the curve generated from a chosen standard. We are very grateful for the reviewer’s challenging question and believe addressing the concern has significantly improved our manuscript and the utility of our published pipeline.

Decision Letter 1

Kazunori Nagasaka

10 Mar 2025

Reliable RNA-seq analysis from FFPE specimens as a means to accelerate cancer-related health disparities research

PONE-D-24-53608R1

Dear Dr. Sandulache,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Kazunori Nagasaka

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Dear Authors,

Thank you very much for submitting your revised manuscript. I am pleased to inform you that, after careful consideration, our expert reviewers have recommended your paper for acceptance in PLOS ONE.

We believe your research makes a clear and valuable contribution to the clinical field, and we anticipate that your findings will be widely recognized and cited in future studies.

Congratulations once again!

We look forward to seeing your published work and hope you continue your excellent research.

Sincerely,

Kazunori Nagasaka

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does the manuscript report a protocol which is of utility to the research community and adds value to the published literature?

Reviewer #2: Yes

**********

2. Has the protocol been described in sufficient detail?

To answer this question, please click the link to protocols.io in the Materials and Methods section of the manuscript (if a link has been provided) or consult the step-by-step protocol in the Supporting Information files.

The step-by-step protocol should contain sufficient detail for another researcher to be able to reproduce all experiments and analyses.

Reviewer #2: Yes

**********

3. Does the protocol describe a validated method?

The manuscript must demonstrate that the protocol achieves its intended purpose: either by containing appropriate validation data, or referencing at least one original research article in which the protocol was used to generate data.

Reviewer #2: Yes

**********

4. If the manuscript contains new data, have the authors made this data fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the article presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please highlight any specific errors that need correcting in the box below.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: Authors have updated the manuscript to address the GC bias caused by formalin fixation. Although the reason for the GC bias is not adequately or accurately described, the authors describe it more as observations and should be adequate for the protocol reporting. Authors have provided code and ways to potentially detect or minimize GC bias.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #2: No

**********

Acceptance letter

Kazunori Nagasaka

PONE-D-24-53608R1

PLOS ONE

Dear Dr. Sandulache,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Kazunori Nagasaka

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. RNA quantitation.

    RNA concentrations measured via fluorescence compared to internal laboratory measures using optical density (OD).

    (TIF)

    pone.0321631.s001.tif (484.3KB, tif)
    S2 Fig. Analytical pipeline.

    Pipeline for analysis of RNA-seq data inclusive of normalization.

    (TIF)

    pone.0321631.s002.tif (862.6KB, tif)
    S3 Fig. Coding vs non-coding reads.

    Quantification of coding vs non-coding reads mapped using the True-Seq and Stranded RNA with Ribo-zero platforms (A) and the MACE RNAseq platform (B). Head to head comparison across platforms (C). Sample OPSCC_33, which had the lowest % of protein coding reads (red circle) was also a technical outlier (Supplementary Table 5 in S1 Table).

    (TIF)

    pone.0321631.s003.tif (1.1MB, tif)
    S4 Fig. Distribution of RNA median gene expression.

    The distributions of gene expression (i.e., median UQ normalized log 2 values) for each platform/cohort, along with the TCGA OPSCC RNA-Seq dataset similarly normalized (A,C,E,G) before global rescaling. Histograms for the rescaled cohorts were also generated (B,D,F,H).

    (TIF)

    pone.0321631.s004.tif (1.5MB, tif)
    S5 Fig. Average Median Absolute Deviation (MAD).

    MAD values were averaged across all genes to calculate a sample specific average MAD value. Samples OPSCC_32 and 33 (Stranded RNA/Ribo-zero), OPSCC_34 and 35 (MACE-Seq) were identified as outliers with significantly different MAD values (Supplementary Table 5 in S1 Table).

    (TIF)

    pone.0321631.s005.tif (874.6KB, tif)
    S6 Fig. Relationship between DV200 and biological richness of gene expression.

    DV200 values and RNA fragment size are shown for representative samples along with the relative number of usable genes and associated p-values from outlier analysis (Supplementary Table 5 in S1 Table).

    (TIF)

    pone.0321631.s006.tif (2.1MB, tif)
    S7 Fig. Correlation of HPV-associated genes and CDKN2A expression in TCGA samples.

    A) Genes that showed ≥3-fold significant (FDR < 0.1) difference up or downregulation based on HPV status in the OPSCC TCGA cohort were identified. B) Confirmation that CDKN2A expression is highly upregulated in HPV-associated (HPV-pos) TCGA samples. ****P < 0.0001.

    (TIF)

    pone.0321631.s007.tif (743KB, tif)
    S8 Fig. Cross-correlation of HPV associated genes.

    Cross-correlation coefficients of gene expression values within the TruSeq cohort, using the list of 855 DEGs previously associated with HPV status in the TCGA OPSCC samples, were used for unsupervised clustering to identify modules of genes (black boxes) that behaved similarly. Gene clusters 1 and 3 behaved robustly. Genes are annotated vertically and horizontally by whether they were upregulated (red boxes) or downregulated (grey boxes) in the original TCGA cohort according to HPV status.

    (TIF)

    pone.0321631.s008.tif (7.2MB, tif)
    S9 Fig. Correlation of HPV associated genes in the MACE-Seq cohort.

    Expression of HPV-associated DEGs (identified from the TCGA OPSCC reference cohort) was used for unsupervised 2-way clustering of MACE-Seq cohort samples (i.e., Ward’s agglomerative hierarchical clustering) to predict HPV status (A). Samples predicted to be HPV-associated (i.e., HPV-Pos) are annotated with purple boxes. The regulation status of genes is annotated across the top of the heatmap with black boxes if they were also upregulated in HPV-associated TCGA samples or grey boxes if they were upregulated in HPV-independent (i.e., HPV-Neg) TCGA samples. Significant enrichment of genes upregulated in TCGA HPV-associated cancers was found in gene cluster 1 (red cluster) and enrichment of genes upregulated in HPV-independent TCGA samples is found in gene cluster 3 (dark blue cluster), demonstrating significant separation (e.g., P < 0.00001 by Chi-square testing). Specimens from sample cluster 1 (purple) predicted to be HPV-associated based on their gene expression pattern had higher CDKN2A (p16) expression than specimens from sample cluster 2 (grey) predicted to be HPV-independent (B). * P < 0.05.

    (TIF)

    pone.0321631.s009.tif (2.4MB, tif)
    S10 Fig. Sources of gene dropout.

    A) Distribution of HPV-associated and HPV-independent weighted averages of RNA expression for all genes using the TCGA OPSCC cohort. B) Distribution of average expression from the TCGA OPSCC cohort for the 283 common dropout genes across all FFPE library platforms. C) Distribution of average expression from the TCGA OPSCC cohort for the 1681 genes that uniquely dropped out from the TruSeq platform. D) Distribution of average expression from the TCGA OPSCC cohort for the 1017 genes that uniquely dropped out from the MACE-Seq platform. E) Distribution of average expression from the TCGA OPSCC cohort for the 248 genes (21 + 267) that uniquely dropped out from the Stranded and/or TruSeq platforms. F) Distribution of GC content among genes that were covered well or dropped out for each platform. G) Statistical comparisons between genes that dropped out and those well covered for each of the platforms or the common 283 dropout genes. The P-values (**** P < 0.00001, *** P < 0.0005) were derived from individual comparisons of averages connected by solid lines after a Tukey’s multiple comparison test.

    (TIF)

    pone.0321631.s010.tif (1.6MB, tif)
    S11 Fig. GC-bias in RNA-Seq data from FFPE samples.

    A) Smoothed spline plots comparing average gene expression verses % GC content of canonical transcripts for HPV-independent FFPE samples prepared using the TruSeq (dark blue line) or Stranded (light blue line) libraries compared to HPV-independent TCGA OPSCC samples (black line). B) Smoothed spline plots comparing average gene expression verses % GC content of canonical transcripts for FFPE samples predicted to be HPV-associated and prepared using the TruSeq (dark blue line) or Stranded (light blue line) libraries compared to HPV-associated TCGA OPSCC samples (black line). C) Smoothed spline plots comparing average gene expression verses % GC content of canonical transcripts for FFPE samples predicted to be HPV-independent and prepared using the MACE-Seq approach (red line) compared to HPV-independent TCGA OPSCC samples (black line). D) Smoothed spline plots comparing average gene expression verses % GC content of canonical transcripts for FFPE samples predicted to be HPV-associated and prepared using the MACE-Seq approach (red line) compared to HPV-associated TCGA OPSCC samples (black line).

    (TIF)

    pone.0321631.s011.tif (907KB, tif)
    S12 Fig. Qualitative and quantitative assessment of GC bias using the GCplotQC tool.

    Smoothed spline curves of gene expression verses % GC content of canonical transcripts for individual samples processed using the TruSeq library (A,B), the Stranded library (C,D), or the MACE-Seq library (E,F) compared to a weighted average of HPV-associated and HPV-independent OPSCC TCGA samples for reference. The program computes the area between the reference (OPSCC TCGA samples) and individual sample curves, wherever the sample curve has lower gene expression and displays the value as delta area. Samples identified as technical outliers (B,D, and F) in other steps of the pipeline also show dramatic increases in delta area. G) Scatter plot of average sample delta area values shows significantly more deviation in samples prepared with the TruSeq library compared to those prepared with the other two libraries. **** P < 0.00001.

    (TIF)

    pone.0321631.s012.tif (3.2MB, tif)
    S1 Table. Supplemental Tables.

    (XLSX)

    pone.0321631.s013.xlsx (35MB, xlsx)

    Data Availability Statement

    All data analyzed in the manuscript are included in the supplementary tables.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES