Skip to main content
PLOS One logoLink to PLOS One
. 2023 Apr 24;18(4):e0284458. doi: 10.1371/journal.pone.0284458

Transforming RNA-Seq gene expression to track cancer progression in the multi-stage early to advanced-stage cancer development

Michelle Livesey 1, Sophia Catherine Rossouw 1, Renette Blignaut 2, Alan Christoffels 1, Hocine Bendou 1,*
Editor: Enrique Hernandez-Lemus3
PMCID: PMC10124877  PMID: 37093793

Abstract

Background

Cancer progression can be tracked by gene expression changes that occur throughout early-stage to advanced-stage cancer development. The accumulated genetic changes can be detected when gene expression levels in advanced-stage are less variable but show high variability in early-stage. Normalizing advanced-stage expression samples with early-stage and clustering of the normalized expression samples can reveal cancers with similar or different progression and provide insight into clinical and phenotypic patterns of patient samples within the same cancer.

Objective

This study aims to investigate cancer progression through RNA-Seq expression profiles across the multi-stage process of cancer development.

Methods

RNA-sequenced gene expression of Diffuse Large B-cell Lymphoma, Lung cancer, Liver cancer, Cervical cancer, and Testicular cancer were downloaded from the UCSC Xena database. Advanced-stage samples were normalized with early-stage samples to consider heterogeneity differences in the multi-stage cancer progression. WGCNA was used to build a gene network and categorized normalized genes into different modules. A gene set enrichment analysis selected key gene modules related to cancer. The diagnostic capacity of the modules was evaluated after hierarchical clustering.

Results

Unnormalized RNA-Seq gene expression failed to segregate advanced-stage samples based on selected cancer cohorts. Normalization with early-stage revealed the true heterogeneous gene expression that accumulates across the multi-stage cancer progression, this resulted in well segregated cancer samples. Cancer-specific pathways were enriched in the normalized WGCNA modules. The normalization method was further able to stratify patient samples based on phenotypic and clinical information. Additionally, the method allowed for patient survival analysis, with the Cox regression model selecting gene MAP4K1 in cervical cancer and Kaplan-Meier confirming that upregulation is favourable.

Conclusion

The application of the normalization method further enhanced the accuracy of clustering of cancer samples based on how they progressed. Additionally, genes responsible for cancer progression were discovered.

1. Introduction

Cancer is an ever-changing disease that generally becomes more heterogeneous as the disease progresses [1]. Different cancers progress and evolve in different ways. Some cancers are fast-growing and can cause mortality soon after diagnosis, while other cancers can be successfully treated [2]. One way of tracking cancer progression is to assess gene expression differences across the multi-stage process of cancer development. To our knowledge, limited research has focused on the progression of cancer in relation to gene expression. The numerous genetic changes that accrue over the course of early-stage to advanced-stage cancer development can be traced by RNA-Seq.

RNA-Seq is a high-throughput sequencing technology with computational methods to determine the quantity of RNA present in a biological sample. The method examines the continuously changing cellular transcriptome, allowing for an efficient and comprehensive description of gene expression profiles between different conditions over time [3]. RNA-Seq data is often in the format of a gene-by-sample count matrix, with genes in rows, and samples along the columns. The elements in the matrix report for each sample, the number of reads that could be uniquely aligned to a particular gene. The raw read counts have to be adjusted or “transformed” to aid our understanding of cancer progression.

To demonstrate our approach to investigating RNA-Seq cancer progression over the course of early-stage to advanced-stage cancer, we illustrate a bar graph of a single raw count gene expression profile in two cancer types (Fig 1). The dark blue and light blue bars represent advanced-stage and early-stage cancer gene expression, respectively, for gene x. In advanced-stage, gene x shows an identical expression profile in cancer types 1 and 2. Based on the same raw expression value, both cancer types will group together. However, when considering the early-stage gene expression profiles in both cancer types, it’s worth noting that the difference in expression between advanced-stage and early-stage cancer gene expression in cancer type 1 is greater than the difference in cancer type 2.

Fig 1. Raw RNA-Seq data of advanced-stage and early-stage gene expression of gene x in two cancer types.

Fig 1

Cancer type 1 and cancer type 2 show a gene expression fold increase of 4 and 1, respectively, from early to advanced-stage cancer.

The present study aims to normalize advanced-stage with early-stage RNA-Seq data to investigate cancer progression in relation to gene expression. The normalization method corrects for genes that display less expression variability in advanced-stage cancer samples but display a high variability in early-stage cancer samples. As a result, more meaningful information is available in which the two distinct cancer types can be differentiated based on the differences in gene expression profiles, or cancer progression, from early-stage to advanced-stage cancer. The development of such high-throughput genome analysis techniques for research on cancer has a significant impact on clinical treatment, as the discovery of cancers that differentiate in gene expression profiles (subtypes) is useful for guiding clinical treatment of multiple cancer [4].

The normalization method evaluated was performed by Frost and colleagues [5]. This method involves calculating the quotient of cancerous samples (dividend) and normal/non-cancerous samples (divisor), thereby producing normalized differential RNA expression profiles within a specific condition. However, many RNA-Seq research projects do not generate normal sequenced samples. Accordingly, we propose that early-stage cancer samples be used. We further hypothesize that using early-stage cancer samples will provide a more accurate genetic landscape of the multi-stage cancer progression.

2. Materials and methods

2.1. Data acquisition and processing

Cancer progression was investigated in early-stage and advanced-stage cancer. The datasets examined were selected based on cancers known to have an increased survival risk among patients due to associated autoimmune diseases. This includes five cancers; Diffuse Large B-cell Lymphoma (DLBCL) [6, 7], Lung Cancer [8], Cervical cancer [9], Liver cancer [10, 11], and Testicular cancer [12].

RNA-sequenced gene expression profiles for both early- and advanced-stage cancer were downloaded from the UCSC Xena database using cancer-specific data from The Cancer Genome Atlas cohort, from the Genomic Data Commons (GDC-TCGA) [13] (Table 1). Each patient’s expression profile was organized in a gene-by-sample genomic matrix. Additional metadata includes the associated phenotypic and survival profiles of each patient (S1 File).

Table 1. Cancer datasets.

The cancer cohorts were limited according to clinical or tumor stage and the primary site involved in each cancer. Patient samples were categorized in early-stage and advanced-stage, as well as the primary sites.

Number of samples
Cancer cohort Primary site Early-stage Advanced-stage
Diffuse Large B-cell Lymphoma Lymph Node 4 8
Lung Adenocarcinoma Bronchus and Lung 28 28
Cervical Cancer Cervix uteri 8 22
Liver Cancer Liver and intrahepatic bile ducts 20 6
Testicular Cancer Testis 15 15

The cancer datasets were made up of 60,483 unique Ensembl identifiers, which included transcript-non-specific expression data for all coding genes plus long non‐coding RNA (lncRNA), pseudogenes, and multiple forms of non‐coding transcripts [14]. The datasets quantified gene expression as log2(x+1) with x referring to the count of reads mapped to a specific genetic region in the human reference genome (GRCh38.p2, gencode release 22).

Genes having ENSG identifiers annotated with a protein-coding biotype were extracted using Ensembl BioMart (GRCh38.p13, Ensembl 104, May 2021) [15]. This eliminated 40,927 (67,7%) non‐coding entries leaving 19,556 protein‐coding entries. The gene expression of the 19,556 protein-coding genes as log2(x+1) was converted to raw counts for further analysis, as it was found that raw RNA-Seq data may perform better for capturing more original transcriptome patterns in different disease conditions [16].

2.2. Data normalization

The normalization method involves calculating the quotient of advanced-stage gene expression and early-stage gene expression (GitHub code: https://github.com/3270006/tracking-cancer-progression). We followed the same calculations established by [5].

2.2.1. Gene and tissue correction

The gene-by-sample matrices from each cancer cohort in Table 1 were used to assemble early-stage (E) and advanced-stage (A) gene expression matrices. This included:

  • A, s×q matrix for advanced-stage gene expression and,

  • E, s×r matrix for early-stage gene expression.

Where q and r represent the number of advanced-stage and early-stage cancer samples, respectively, and s the number of protein-coding genes.

Two binary primary site classification matrices were created for each gene expression matrix. This included:

  • PA, t×q matrix for advanced-stage cancer primary sites and,

  • PE, t×r matrix for early-stage cancer primary sites.

Where q and r represent the number of advanced-stage and early-stage cancer samples, respectively, and t the number of primary sites.

The advanced-stage cancer expression vector of gene i in matrix A was multiplied by the binary classification vector for primary site I in matrix PA as shown in Eq 1, resulting in a vector of tissue‐specific advanced-stage cancer gene expression Xi.

Xi=PAIAi (1)

The early-stage expression vector of gene i in matrix E was multiplied by the binary classification vector for primary site I in matrix PE as shown in Eq 2, resulting in a vector of tissue‐specific early-stage gene expression Yi.

Yi=PEIEi (2)

Xi and Yi, were computed based on the series of vectors of all primary sites and all protein-coding genes to build three‐dimensional matrices for X (advanced-stage cancer) and Y (early-stage cancer). The Xi,j,I three‐dimensional matrix represents the raw count gene expression value for gene i in advanced-stage cancer j of primary site I. While, the three‐dimensional matrix of Yi,k,I represents the raw count gene expression value for gene i in early-stage cancer k of primary site I.

The initial phase of calculating for the normalized dataset (subsequently called ‘Tissue‐corrected’), involved creating a mean normalized expression Gtissue for gene i at each primary site I, as given in Eq 3. To summarize, the sum of early-stage gene i within each primary site I was calculated.

Gi,Itissue=1mIk=1rYi,k,I (3)

Where r is the number of early-stage cancer samples in primary site I. The calculation to determine for mI are shown in Eq 4, where the sum of a given primary site in the binary matrix PE were calculated for all early-stage samples.

mI=k=1rPk,IE (4)

Finally, the tissue‐corrected gene expression matrix Ltissue was calculated as shown in Eq 5.

Li,j,Itissue=lnXi,j,IGi,Itissue (5)

2.3. Weighted gene co-expression network analysis

Both the advanced-stage cancer gene expression as raw count (uncorrected) and the normalized tissue-corrected datasets were analysed. The 19,556 protein-coding genes were subjected to Weighted Gene Co-expression Network Analysis (v. 1.70–3) (WGCNA) R package [17, 18].

2.3.1. Data pre-processing

The uncorrected matrix was filtered of genes that had a count of less than 10 in more than 90% of samples as recommended by the WGCNA authors, resulting in 17,436 protein-coding genes. The tissue-corrected matrix was filtered by removing all genes that had a row sum of zero, resulting in 19,350 protein-coding genes.

2.3.2. Gene co-expression network construction

To construct a weighted network, a correlation matrix between each pair of genes across all samples was calculated. A soft threshold power β was calculated to amplify the correlation between genes. The optimal power value was selected based on a scale-free topology criterion (R2 > 0.8). Based on this, an adjacency matrix was constructed, followed by the generation of a topological overlap matrix (TOM), and computation of the corresponding dissimilarity (1-TOM) values [19, 20].

To group the protein-coding genes, an average linkage hierarchical clustering based on the hclust function in conjunction with the dissimilarity TOM was used, resulting in a gene hierarchical clustering tree (tree graph). A novel dynamicTreeCut algorithm (v. 1.63–1) was employed to identify the clusters, in which branches of the dendrogram were sliced to determine the modules. Modules represent the partitioning of protein-coding genes into distinct groups based on expression values co-correlated and variable across the cancer cohorts. Modules were named using the default WGCNA settings, which assign a colour to each module.

2.4. Pathways and transcription factor enrichment analyses

A popular gene set enrichment analysis tool, WebGestalt (WEB-based GEne SeT AnaLysis Toolkit) was used to extract biological information from genes of interest [21]. The over-representation analysis (ORA) in the WebGestaltR package (v. 0.4.4) was used to characterize the genes of interest that were grouped inside each module found by WGCNA [2224]. The ORA used all protein-coding genes as a reference set, the WikiPathways [25, 26] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [27] databases for functional annotations, and the Benjamini‐Hochberg method for multiple testing correction [28].

Transcription factor (TF) enrichment analysis was performed on the genes of interest that were grouped inside each module found by WGCNA using the ChEA3 database web-server application [29]. To estimate the TF-target enrichment, the ARCHS4 resource were selected as it uses a co-expression method to compile a list of genes that are controlled by each TF.

2.5. Clustering by transcript profiling

The clustering of cancer samples is the most basic and exploratory analysis to find groups of samples sharing similar gene expression patterns, which can lead to the discovery of new cancer subtypes. Therefore, gene expression profiles will be subjected to clustering analysis to investigate the grouping of cancer samples. Accordingly, the computation model was used to predict cancer clusters (subtypes) that progressed differently and/or similarly.

The cosine distance between the expression profiles of the genes included in the modules and Ward’s method for agglomeration were used to create clusters of similar cancers established by hierarchical clustering [30, 31]. The number of clusters was identified using the find_k function, which estimates k using maximal average silhouette widths [32]. This function forms part of the dendextend (v. 1.15.2) R package. Finally, the dendrograms were split into k groups to assign samples to a cluster.

2.6. Survival analysis

The genes categorized in each module by WGCNA across the clusters were subjected to a Cox regression model based on the Lasso algorithm of the glmnet R package (v. 4.1–3) [3335]. The model reduces the number of candidate genes and selects the most significant genes for a patient’s survival, assigning a regression coefficient value to each gene. The product of the coefficient value and the corresponding gene’s expression value resulted in a prognostic risk score for each patient. The patient scores were used to calculate a median risk score. A status value of 1 or 0 was assigned to each patient based on whether the patient’s score was above or below the median risk score. Kaplan-Meier (K–M) estimates for overall survival (OS) were generated according to the patient status information. The K–M curves were created using the ggsurvplot function from the survminer R package (v. 0.4.9).

2.7. Statistics

The statistical analysis was performed using the car (v. 3.0–11), DescTools (v. 0.99.43), and agricolae (v. 1.3–5) R packages. The statistics were conducted to evaluate for different gene expression in each module and primary sites across the clusters.

The differences in the gene expression were first evaluated for normality and equal variance using Shapiro-Wilk test of normality [36] and Levene’s test of homogeneity [37], respectively. If the Shapiro-Wilk null hypothesis was not rejected (P ≥ 0.05; H0: normal distribution) and Levene’s test null hypothesis were not rejected (P ≥ 0.05; H0: equal variance across groups), an analysis of variance (ANOVA) [38] was employed. If the ANOVA null hypothesis of equal mean gene expression in each module and primary site was rejected by chance (P ≤ 0.05), a Tukey’s post-hoc test was used for pairwise comparisons [39].

In the event that Levene’s test null hypothesis was rejected (P ≤ 0.05; H1: difference in variances between groups) and Shapiro-Wilk test resulted in either normal (P ≥ 0.05) or not normal distribution (P ≤ 0.05), then the Kruskal-Wallis test [40] was used to evaluate for differences in the gene expression in each module and primary site across clusters. If the Kruskal-Wallis was rejected, it can be concluded that equal median gene expression across groups was rejected, a post-hoc analysis was performed using Dunn’s test [41].

3. Results and discussion

Both the uncorrected and tissue-corrected matrices were evaluated to determine if the normalization method represents differences in the true gene expression. The normalization method is considered effective if the normalized gene expression has an increased power in differentiating samples based on cancer type and clinical and phenotypic information.

3.1. Uncorrected RNA-Seq

The uncorrected protein-coding genes were inserted into WGCNA. The soft-thresholding power was defined as 20, with a scale-free topological index of above 0.8. This resulted in a gene tree and corresponding module colours. Similar modules were merged using the associated adjacency heatmap. The merged modules and the number of genes in each module was used for further analysis (S1 Fig).

A total of 3175 genes were categorized into 32 modules using WGCNA. Of those, only 10 modules were enriched for functional pathway annotations with WikiPathways: brown, cyan, grey60, magenta, purple, dark green, dark grey, light cyan, light steel blue 1, and tan. The first five modules were enriched for tissue-specific processes (ORA, P ≤ 0.047). The latter five modules were enriched for cancer-relevant processes (ORA, P ≤ 0.045).

It was found that the tan module had the highest total genes detected in biological pathways. It was also noteworthy that a repetition of the same pathway description appeared in several different modules. The same behaviour was noted with KEGG pathway analysis (S2 Fig). This indicates that the uncorrected dataset, which did not undergo normalization, did not efficiently depict gene expression differences.

The hierarchical clustering of cancer samples using the 3175 genes resulted in two cancer clusters (Fig 2). The primary site composition of each cluster was evaluated to determine if each primary site corresponded to the cluster assignment. Both clusters were primary sites heterogeneous. Cluster 1 was composed of samples of DLBCL (13.2%), lung (35.8%), liver (5.7%), cervical (22.6%), and testicular cancer (22.6%). While cluster 2 was composed of DLBCL (3.8%), lung (34.6%), liver (11.5%), cervical (38.5%), and testicular cancer (11.5%). The uncorrected dataset failed to correctly segregate the cancer samples in different clusters (Fig 2).

Fig 2. Heatmap of uncorrected RNA-Seq data illustrating module expression within cancer clusters.

Fig 2

The colour bar on the left shows modules identified by WGCNA and enriched for functional pathway annotations. The rows are further composed of protein-coding genes with raw count values. Clusters of similar cancer cohorts are indicated across the top, and cancer cohorts are displayed by the colour bar along the top with the key on the right. *Primary sites abbreviations: CESC = Cervical squamous cell carcinoma; DLBCL = Diffuse Large B-cell Lymphoma; LIHC = Liver Hepatocellular Carcinoma; LUAD = Lung Adenocarcinoma; TGCT = Testicular Germ Cell Tumors.

The statistical analysis outlined in the methods section was performed to compare each module across the cancer clusters. From the 10 enriched modules, seven modules; cyan, dark green, dark grey, grey60, light cyan, light steel blue 1, and tan were characterized by significantly different expressions (Kruskal-Wallis P ≤ 0.0008) across cancer clusters. While the magenta, purple (ANOVA, P ≥ 0.08) and brown modules (Kruskal-Wallis, P = 0.31) did not show differential expression across clusters. That is, WGCNA selected genes with less differential power, because of non-normalization, resulting in heterogeneous clusters composed of samples from different primary sites (Fig 2).

The same statistical analysis was performed to compare each primary site in Cluster 1 to the equivalent primary site in Cluster 2 for each module. This computation was performed to determine if the segregation of primary sites into Clusters 1 and 2 was based on changes in the gene expression. The statistical test showed no significant difference between sample groups of the same primary sites from the two different clusters. It can be said that the clustering of the uncorrected dataset failed to segregate the primary sites based on different gene expression. Evidently, the unnormalized genes failed to show differentiation.

3.2. Tissue-corrected RNA-Seq data

The tissue-corrected protein-coding genes were inserted into WGCNA. A soft threshold selection of the lowest β value that leads to R2 > 0.8 was selected as 21. This resulted in a gene tree and corresponding module colours. Similar modules were merged using the associated adjacency heatmap. The merged modules and the number of genes in each module was used for further analysis (S3 Fig).

WGCNA identified 617 genes distributed into seven modules. The module that composed the most and least genes was the brown and pink modules, respectively. Of the seven modules, KEGG analysis enriched five modules (S4 Fig), while a total of four modules were found to be enriched for functional pathway annotations with WikiPathways. This included the black, brown, magenta and turquoise modules (Fig 3), of which all four modules were enriched for cancer-related processes (ORA, P ≤ 0.038). The pathway descriptions identified in the four modules are indicated in the bar chart in Fig 3. Each colour bar represents the module colour and shows the number of genes that were enriched for that module. Analysing the degree of enrichment and terms further signifies the difference of each module.

Fig 3. WikiPathways enrichment of gene modules detected by WGCNA from the tissue-corrected dataset using the ORA, WebGestalt.

Fig 3

The black module was enriched for cytoplasmic ribosomal proteins (ORA, P < 0.001). The brown module was enriched for NK cell, T cell or inflammatory signalling (ORA, P ≤ 0.021). It was also found that the brown module has the highest total genes detected in biological pathways. The magenta module enriched for mRNA processing (ORA, P < 0.0001). Meanwhile, processes relevant to the cell cycle progression were enriched in the turquoise module (Fig 3). The turquoise module was the largest module comprising 139 genes and also identified pathways that were related to other cancers such as breast cancer, gastric cancer, and retinoblastoma. Gastric adenocarcinoma has been reported to be correlated to the investigated cancers including liver carcinoma and lung cancer through specific genes [42]. It was noted that some genes were shared between the detected cancer pathways, this included the AURKA gene, which was involved in the gastric and breast cancer pathways. An increased gene expression of the AURKA gene has been previously identified in the liver and lung cancer [43]. Gastric and the retinoblastoma pathways further shared the MCM4, TOP2A and RFC4 genes, that have been reported in the studied cancers, where MCM4 is overexpressed in liver cancer [44], TOP2A promotes lung cancer [45], and RFC4 has a high expression in liver, lung, and cervical cancer [46].

Moreover, cancer progression and the retinoblastoma pathway are closely connected [47, 48]. It was found that the retinoblastoma and the breast cancer pathways shared the CHEK1 gene, a gene that has been reported in the development of human malignant tumors, such as lung and cervical cancers [49]. Therefore, the enriched module genes detected in the studied cancers could suggest that they play a role in cancer development and thus could also be relevant to other cancer types.

The WGCNA module genes were further subjected to TF enrichment analysis, to gain evidence for potential mechanistic connection of transcriptome changes to specific TFs. ChEA3 TF analysis revealed associations between the observed gene expression changes and involved TFs. The top 5 prioritized TFs for each module are presented in S2 File, with documented information about their biological involvement in the context of cancer (S2 File). The analysis confirms, with supported literature, several TF relationships with the multiple cancers evaluated in this study.

Hierarchical clustering of the 617 genes in WGCNA modules detected eight clusters characterized by distinct expression of the four enriched modules (Kruskal-Wallis Test, P < 0.0001) (Fig 4). Post hoc analysis by Dunn’s Test to assess pairwise differences across clusters in each module showed differential expression for 21 of 28 cluster comparisons for the black module, 25 of 28 comparisons for the brown module, 24 of 28 comparisons for the magenta module, and 27 out of 28 comparisons for the turquoise module. The high proportion of pairwise cluster comparisons with significant differences highlights the distinctive expression patterns in each module across clusters.

Fig 4. Heatmap of tissue-corrected RNA-Seq data illustrating module expression within cancer clusters.

Fig 4

The colour bar on the left shows modules identified by WGCNA and enriched for functional pathway annotations. The rows are further composed of protein-coding genes with expression values obtained after data normalization. Clusters of similar cancer cohorts are indicated across the top and the cancer cohort are displayed by the colour bar along the top with the key on the right. *Primary sites abbreviations: CESC = Cervical squamous cell carcinoma; DLBCL = Diffuse Large B-cell Lymphoma; LIHC = Liver Hepatocellular Carcinoma; LUAD = Lung Adenocarcinoma; TGCT = Testicular Germ Cell Tumors.

The primary site composition of each cluster was evaluated to determine if the cancer primary site corresponded to the cluster assignment. Cluster 1 was primary site homogenous, composed of only DLBCL samples, while Cluster 2 was primary site heterogeneous, composed of DLBCL and liver samples. Clusters 3 and 4 were primary site homogenous, however shows a segregation of lung samples. The same was observed in Clusters 5 and 6 with cervical samples and Clusters 7 and 8 composed of testicular samples (Fig 4).

The associated metadata of the cancer samples were investigated to determine if distinct phenotypes could have caused similar cancer cohorts to partition into separate clusters in Fig 4. The DLBCL samples present in Cluster 1 show gene profiles that are more upregulated in comparison to the Cluster 2 DLBCL samples. In addition, it was noted that DLBCL samples in Cluster 1 showed a higher number of extranodal sites involvement (≥ 2), while those in Cluster 2 showed no or low number of extranodal sites involvement (≤ 2). Common sites of extranodal spread are lung, liver, kidney, and bone marrow [50]. It has also been reported that DLBCL can be involved in virtually any organ [51]. Therefore, the DLBCL Cluster 2 found grouped with liver samples is an interesting finding, given the high prevalence of secondary liver involvement by lymphoma including DLBCL and indicates advanced disease [52]. However, this information of secondary liver involvement in the metadata associated to DLBCL is unavailable, and requires further investigation to support the claim that DLBCL patients have liver infection, as well as the use of a higher sample number, which was not possible for this study since the public data was not available. The phenotypic data for lung samples in Clusters 4 and 5 did not provide a clear reason for the segregation of the cancer cohort as some clinical information on the samples were incomplete.

It was discovered that the average overall survival of patients with cervical cancer represented in Cluster 5 were greater than the average overall survival of cervical cancer patients in Cluster 6. This led to a survival analysis in which the Cox regression model selected MAP4K1 (ENSG00000104814) categorized in the brown module as a prognostic gene. The upregulation of the MAP4K1 gene has been found to be favourable in cervical cancer [53, 54]. According to Kaplan-Meier results in a recent study, the high expression of the MAP4K1 gene was beneficial to cervical cancer patients [55]. Their research focussed on PDCD1, a gene that is most typically related to its expression on tumor-infiltrating lymphocytes. Moreover, they showed that PDCD1 significantly co-expressed with the following 15 genes, whose high expression is beneficial for cervical cancer patients; MAP4K1, ACAP1, CST7, CXCR6, GPR171, GZMH, GZMK, P2RY10, RASAL3, SH2D1A, TBC1D10C, ZNF831, GZMM, JAKMIP1, and PSTPIP1 [55]. We compared their finding to the results of our normalization method and discovered the PDCD1 gene as well as the first 12 of the 15 genes were co-expressed within the brown module. This finding validates the normalization method in this study, as upregulation is observed in the brown module for Cluster 5, whereas the brown module in Cluster 6 mainly illustrates downregulation (Fig 4). The normalized gene expression of MAP4K1 in cervical patient samples from Clusters 5 and 6 were extracted from the brown module and shown in Fig 5.

Fig 5. Boxplot of gene MAP4K1 from cervical cancer samples categorized in the brown module by WGCNA.

Fig 5

The red box plot, constructed with Cluster 5 samples, shows upregulation of gene MAP4K1, while the blue box plot, constructed with Cluster 6 samples, shows downregulation of MAP4K1 gene.

We corroborate the previous findings [5355] in that the upregulation of gene MAP4K1, in Cluster 5, is favourable in cervical cancer patients as shown by the Kaplan-Meier curve, in Fig 6. Cluster 5 presents a longer life expectancy than the patient samples in Cluster 6.

Fig 6. Kaplan-Meier of MAP4K1 gene in cervical cancer patients.

Fig 6

Analysis shows the correlation between normalized gene expression level and patient survival in days. Patients were divided as detected in Clusters 5 and 6 after clustering according to transcript profiling.

The brown module were further subjected to TF enrichment analysis using an established computational tool to offer a better understanding of the associations between the observed gene expression changes and TFs in the context of cervical cancer. The TFs that were associated with the MAP4K1 gene in which the TF was found to effect cervical cancer survival was extracted and documented (S3 File). Several co-expressed genes that also play a role in cervical cancer survival identified in [55] were also linked to the TFs and highlighted (S3 File).

Lastly, the phenotypic data of testicular cancer, divided in Clusters 7 and 8, showed that the primary diagnosis of the patients in Cluster 7 was seminomas, while Cluster 8 were made up of patient samples that were primary diagnosis with type embryonal carcinoma testicular cancer, mixed germ cell tumor or Teratoma malignant.

To further demonstrate the significance of late-stage cancer samples normalized with early-stage cancer samples, an investigation was carried out with a normal tissue expression dataset from the Genotype-Tissue Expression (GTEx) Portal [56]. Normalized gene expression profiles using normal tissue samples were clustered and allowed for the segregation between distinct cancer types (S5 Fig). However, it failed to provide in-depth clustering based on subtypes within cancer types. As a result, the variations in gene expression, such as in cervical cancer that was associated with survival, could not be stratified by normalizing late-stage cancer samples with normal tissue. The results obtained with our method by normalizing late-stage with early-stage cancer samples demonstrate the ability of the method to cluster samples by cancer progression, rather than simply by cancer type as with the use of normal samples.

4. Conclusion

The RNA-Seq read count before normalization showed discrepancies in comparison to normalized gene expression. The goal of our normalization method was achieved, since it shows that advanced-stage cancer gene expression data can be normalized using early-stage cancer gene expression data. WGCNA analysis validated the results of the tissue-corrected matrix as the correct relationships between normalized gene expression were presented. It was further illustrated that the biological information was preserved and allowed more meaningful comparisons of each cancer cohort, including survival analyses.

The benefit of the normalization method used in the present study was twofold; (i) it was able to segregate tumor samples with different and similar progression, (ii) and it could cluster samples from distinct cancer types as well as samples within the same cancer type. A significant result of the latter was in the case of cervical cancer, in which gene MAP4K1 was segregated based on the gene’s prognosis. This discovery demonstrated that the normalization method can be used in conjunction with cancer clustering to identify areas of higher cancer risk as well as the cause of the increased risk.

The value of this method thus aids with hypotheses that seek to explore various novel cancer subtypes that segregate by different gene expression profiles and further investigate the biological association, clinical, or prognostic features linked to the cancer subtypes (clusters). Additionally, hypotheses that investigate cancer progression and identify cancer subtypes with different progression. New users can further use this method to find new subtypes in their data and associate it with the clinical data that they have.

Supporting information

S1 Fig. Uncorrected RNA-Seq data were inserted into WGCNA to identify gene modules.

(A) Soft threshold power. (B) Gene clustering tree. Each colour underneath the dendrogram shows the module assignment, and branches above represent the genes. The dynamic tree cut shows the initial module detection and merged dynamic indicates the modules divided according to their similarity. (C1) Module eigengene dendrogram identified groups of correlated modules. The red line indicates the module eigengene threshold of 0.25 and (C2) Eigengene adjacency heatmap of different gene co-expression modules. In the heatmap, the blue colour represents low adjacency, while the red represents high adjacency. (D) Barplot of 32 co-expression modules constructed after similar modules were merged with module size at the top of each bar.

(TIF)

S2 Fig. KEGG enrichment of gene modules detected by WGCNA from the uncorrected RNA dataset using the ORA, WebGestalt.

(TIF)

S3 Fig. Tissue-corrected dataset were inserted into WGCNA to identify gene modules.

(A) Soft threshold power. (B) Gene clustering tree. Each colour underneath the dendrogram shows the module assignment, and branches above represent the genes. The dynamic tree cut shows the initial module detection and merged dynamic indicates the modules divided according to their similarity. (C1) Module eigengene dendrogram identified groups of correlated modules. The red line indicates the module eigengene threshold of 0.25 and (C2) Eigengene adjacency heatmap of different gene co-expression modules. In the heatmap, the blue colour represents low adjacency, while the red represents high adjacency. (D) Barplot of seven co-expression modules constructed after merged modules with module size at the top of each bar.

(TIF)

S4 Fig. KEGG enrichment of gene modules detected by WGCNA from the tissue-corrected RNA dataset using the ORA, WebGestalt.

(TIF)

S5 Fig. Heatmap of tissue-corrected RNA-Seq data of late-stage cancer samples normalized with normal tissue samples, illustrating module expression within cancer clusters.

Normal tissue expression dataset was obtained from the Genotype-Tissue Expression (GTEx) Portal. To match the number of male/female ratios as in the late-stage cancer samples, the same number normal tissue samples of male/female ratios were randomly selected, except for cervical cancer, which only had 10 normal tissue samples. The colour bar on the left shows modules identified by WGCNA and enriched for functional pathway annotations. The rows are further composed of protein-coding genes with expression values obtained after data normalization. Clusters of similar cancer cohorts are indicated across the top and the cancer cohort are displayed by the colour bar along the top with the key on the right. *Primary sites abbreviations: CESC = Cervical squamous cell carcinoma; DLBCL = Diffuse Large B-cell Lymphoma; LIHC = Liver Hepatocellular Carcinoma; LUAD = Lung Adenocarcinoma; TGCT = Testicular Germ Cell Tumors.

(TIF)

S1 File. The data underlying the results presented in this study are publicly accessible from the UCSC Xena data browser (https://xenabrowser.net) from individual cancer cohorts.

(PDF)

S2 File. Top 5 TFs derived from the ChEA3 enrichment analysis of each tissue-corrected WGCNA module.

The biological role indicates the role of the identified TF in cancer according to literature.

(PDF)

S3 File. Transcription factors (TFs) enrichment analysis of tissue-corrected WGCNA brown module.

A list of TFs and their corresponding rank according to ARCHS4 co-expression, with documented information about their biological function associated with survival in the context of cervical cancer. The genes in bold were previously found [55] to play a role in cervical cancer survival.

(PDF)

Data Availability

The RNA-Seq gene expression and curated clinical public datasets underlying the results presented in the study are available from the following direct URLs: TCGA-DLBC (https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-DLBC.htseq_counts.tsv.gz and https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/GTEX_phenotype.gz),TCGA-LUAD (https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LUAD.htseq_counts.tsv.gz and https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LUAD.GDC_phenotype.tsv.gz), TCGA-LIHC (https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LIHC.htseq_counts.tsv.gz and https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LIHC.GDC_phenotype.tsv.gz), TCGA-CESC (https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-CESC.htseq_counts.tsv.gz and https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-CESC.GDC_phenotype.tsv.gz), and TCGA-TGCT (https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-TGCT.htseq_counts.tsv.gz and https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-TGCT.GDC_phenotype.tsv.gz). The normal tissue were obtained from the Genotype-Tissue Expression (GTEx) Portal as follows: https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/gtex_gene_expected_count.gz and https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/GTEX_phenotype.gz, for the expression and phenotype information, respectively (S1 File).

Funding Statement

This work was supported by the South African Medical Research Council and National Research Foundation of South Africa grant: 121787 (M.L). https://www.samrc.ac.za & http://www.nrf.ac.za. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Dagogo-Jack I, Shaw AT. Tumour heterogeneity and resistance to cancer therapies. Nat Rev Clin Oncol. 2018. Feb;15(2):81–94. doi: 10.1038/nrclinonc.2017.166 Epub 2017 Nov 8. . [DOI] [PubMed] [Google Scholar]
  • 2.Natrajan R, Sailem H, Mardakheh FK, Arias Garcia M, Tape CJ, Dowsett M, et al. Microenvironmental Heterogeneity Parallels Breast Cancer Progression: A Histology–Genomic Integration Analysis. PloS Med. 2016. Feb 16;13(2): e1001961. doi: 10.1371/journal.pmed.1001961 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009. Jan;10(1):57–63. doi: 10.1038/nrg2484 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Berger MF, Mardis ER. The emerging clinical relevance of genomics in cancer medicine. Nat Rev Clin Oncol. 2018. Jun;15(6):353–365. doi: 10.1038/s41571-018-0002-6 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Frost FG, Cherukuri PF, Milanovich S, Boerkoel CF. Pan‐cancer RNA‐seq data stratifies tumours by some hallmarks of cancer. J Cell Mol Med. 2020. Jan;24(1):418–430. doi: 10.1111/jcmm.14746 Epub 2019 Nov 15. . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.de Carvalho PS, Leal FE, Soares MA. Clinical and molecular properties of human immunodeficiency virus related diffuse large b-cell lymohoma. Front Oncol. 2021. Apr 29;11:675353. doi: 10.3389/fonc.2021.675353 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Mörth C, Valachis A, Abu Sabaa A, Marshall K, Hedström G, Flogegård M, et al. Autoimmune disease in patients with diffuse large B-cell lymphoma: occurrence and impact on outcome. Acta Oncol. 2019. Aug;58(8):1170–1177. doi: 10.1080/0284186X.2019.1619936 Epub 2019 May 25. . [DOI] [PubMed] [Google Scholar]
  • 8.Shiels MS, Cole SR, Kirk GD, Poole C. A meta-analysis of the incidence of non-AIDS cancers in HIV-infected individuals. J Acquir Immune Defic Syndr. 2009. Dec;52(5), 611–22. doi: 10.1097/QAI.0b013e3181b327ca . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Dugué PA, Rebolj M, Garred P, Lynge E. Immunosuppression and risk of cervical cancer. Expert Rev Anticancer Ther. 2013. Jan;13(1):29–42. doi: 10.1586/era.12.159 . [DOI] [PubMed] [Google Scholar]
  • 10.Clifford GM, Rickenbach M, Polesel J, Dal Maso L, Steffen I, Ledergerber B, et al. Influence of HIV-related immunodeficiency on the risk of hepatocellular carcinoma. AIDS. 2008. Oct 18;22(16):2135–41. doi: 10.1097/QAD.0b013e32831103ad . [DOI] [PubMed] [Google Scholar]
  • 11.Lleo A, de Boer YS, Liberal R, Colombo M. The risk of liver cancer in autoimmune liver diseases. Ther Adv Med Oncol. 2019. Jul 12;11:1758835919861914. doi: 10.1177/1758835919861914 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Goedert JJ, Purdue MP, McNeel TS, McGlynn KA, Engels EA. Risk of germ cell tumors among men with HIV/acquired immunodeficiency syndrome. Cancer Epidemiol Biomarkers Prev. 2007. Jun;16(6):1266–9. doi: 10.1158/1055-9965.EPI-07-0042 . [DOI] [PubMed] [Google Scholar]
  • 13.Goldman MJ, Craft B, Hastie M, Repečka K, McDade F, Kamath A, et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol. 2020. Jun;38(6):675–678. doi: 10.1038/s41587-020-0546-8 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Aken BL, Ayling S, Barrell D, Clarke L, Curwen V, Fairley S, et al. The Ensembl gene annotation system. Database (Oxford). 2016. Jun 23;2016:baw093. doi: 10.1093/database/baw093 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Smedley D, Haider S, Durinck S, Pandini L, Provero P, Allen J, et al. The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 2015. Jul 1;43(W1):W589‐W98. doi: 10.1093/nar/gkv350 Epub 2015 Apr 20. . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Han H, Men K. How does normalization impact RNA-seq disease diagnosis? J Biomed Inform. 2018. Sep;85:80–92. doi: 10.1016/j.jbi.2018.07.016 Epub 2018 Jul 21. . [DOI] [PubMed] [Google Scholar]
  • 17.Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008. Dec 29;9:559. doi: 10.1186/1471-2105-9-559 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Zhao W, Langfelder P, Fuller T, Dong J, Hovarth S. Weighted gene coexpression network analysis: state of the art. J Biopharm Stat. 2010. Mar;20(2):281–300. doi: 10.1080/10543400903572753 . [DOI] [PubMed] [Google Scholar]
  • 19.Yip AM, Horvath S. Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics. 2007. Jan 24; 8:22. doi: 10.1186/1471-2105-8-22 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005;4:Article17. doi: 10.2202/1544-6115.1128 Epub 2005 Aug 12. . [DOI] [PubMed] [Google Scholar]
  • 21.Zhang B, Kirov S, Snoddy J. WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 2005. Jul 1;33(Web Server issue):W741–8. doi: 10.1093/nar/gki475 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Wang J, Duncan D, Shi Z, Zhang B. WEB-based Gene SeT AnaLysis Toolkit (WebGestalt). Update 2013. Nucleic Acids Res. 2013. Jul;41(Web Server issue):W77–83. doi: 10.1093/nar/gkt439 Epub 2013 May 23. . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Wang J, Vasaikar S, Shi Z, Greer M, Zhang B. WebGestalt 2017: a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit. Nucleic Acids Res. 2017. Jul 3;45(W1):W130–W137. doi: 10.1093/nar/gkx356 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Liao Y, Wang J, Jaehnig EJ, Shi Z, Zhang B. WebGestalt 2019: gene set analysis toolkit with revamped Uis and AsPIs. Nucleic Acids Res. 2019. Jul 2;47(W1):W199–W205. doi: 10.1093/nar/gkz401 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Kelder T, Pico AR, Hanspers K, van Iersel MP, Evelo C, Conklin BR. Mining biological pathways using wikiPathways web services. PloS ONE. 2009. Jul 30;4(7):e6447. doi: 10.1371/journal.pone.0006447 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Slenter DN, Kutmon M, Hanspers K, Riutta A, Windsor J, Nuner N, et al. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res. 2018. Jan 4;46(D1):D661–D667. doi: 10.1093/nar/gkx1064 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017. Jan 4;45(D1):D353–D361. doi: 10.1093/nar/gkw1092 Epub 2016 Nov 28. . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol. 1995; 57(1):289–300. [Google Scholar]
  • 29.Keenan AB, Torre D, Lachmann A, Leong AK, Wojciechowicz ML, Utti V, et al. ChEA3: transcription factor enrichment analysis by orthogonal omics integration. Nucleic Acids Res. 2019. Jul 2;47(W1):W212–W224. doi: 10.1093/nar/gkz446 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Jaskowiak PA, Campello RJ, Costa IG. On the selection of appropriate distances for gene expression data clustering. BMC Bioinformatics. 2014;15 Suppl 2(Suppl 2):S2. doi: 10.1186/1471-2105-15-S2-S2 Epub 2014 Jan 24. . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Ward JH. Hierarchical grouping to optimize an objective function. Journal of American Statistical Association. 1963; 58:236–244. [Google Scholar]
  • 32.Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65. [Google Scholar]
  • 33.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22. . [PMC free article] [PubMed] [Google Scholar]
  • 34.Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011. Mar;39(5):1–13. doi: 10.18637/jss.v039.i05 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, et al. Strong rules for discarding predictors in lasso-type problems. J R Stat Soc Series B Stat Methodol. 2012. Mar;74(2):245–266. doi: 10.1111/j.1467-9868.2011.01004.x . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Shapiro SS, Wilk MB. An Analysis of Variance Test for Normality (Complete Samples). Biometrika. 1965;52:591–611. [Google Scholar]
  • 37.Levene H, Olkin II, Hotelling H. Robust Tests for Equality of Variances. Contributions to Probability and Statistics; Essays in Honor of Harold Hotelling. 1960;78–92. [Google Scholar]
  • 38.Fisher RA. On the probable error of a coefficient of correlation deduced from a small sample. Metron. 1921;1;3–32. [Google Scholar]
  • 39.Tukey JW. Comparing Individual Means in the Analysis of Variance. Biometrics. 1949. Jun;5(2):99–114. . [PubMed] [Google Scholar]
  • 40.Kruskal WH, Wallis WA. Use of Ranks in One-Criterion Variance Analysis. J Am Stat Assoc. 1952;47(260):583–621. [Google Scholar]
  • 41.Dunn OJ. Multiple Comparisons Using Rank Sums. Technometrics. 1964;6:241–52. [Google Scholar]
  • 42.Salarikia SR, Kashkooli M, Taghipour MJ, Malekpour M, Negahdaripour M. Identification of hub pathways and drug candidates in gastric cancer through systems biology. Sci Rep. 2022. Jun 1;12(1):9099. doi: 10.1038/s41598-022-13052-0 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Miralaei N, Majd A, Ghaedi K, Peymani M, Safaei M. Integrated pan-cancer of AURKA expression and drug sensitivity analysis reveals increased expression of AURKA is responsible for drug resistance. Cancer Med. 2021. Sep;10(18):6428–6441. doi: 10.1002/cam4.4161 Epub 2021 Aug 1. . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Zheng R, Lai G, Li R, Hao Y, Cai L, Jia J. Increased expression of MCM4 is associated with poor prognosis in patients with hepatocellular carcinoma. J Gastrointest Oncol. 2021. Feb;12(1):153–173. doi: 10.21037/jgo-20-574 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Kou F, Sun H, Wu L, Li B, Zhang B, Wang X, Yang L. TOP2A Promotes Lung Adenocarcinoma Cells’ Malignant Progression and Predicts Poor Prognosis in Lung Adenocarcinoma. J Cancer. 2020. Feb 10;11(9):2496–2508. doi: 10.7150/jca.41415 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Li Y, Gan S, Ren L, Yuan L, Liu J, Wang W, et al. Multifaceted regulation and functions of replication factor C family in human cancers. Am J Cancer Res. 2018. Aug 1;8(8):1343–1355. . [PMC free article] [PubMed] [Google Scholar]
  • 47.Du W, Searle JS. The rb pathway and cancer therapeutics. Curr Drug Targets. 2009. Jul;10(7):581–9. doi: 10.2174/138945009788680392 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Marshall AE, Roes MV, Passos DT, DeWeerd MC, Chaikovsky AC, Sage J, et al. RB1 Deletion in Retinoblastoma Protein Pathway-Disrupted Cells Results in DNA Damage and Cancer Progression. Mol Cell Biol. 2019. Jul 29;39(16):e00105–19. doi: 10.1128/MCB.00105-19 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Wu M, Pang JS, Sun Q, Huang Y, Hou JY, Chen G, et al. The clinical significance of CHEK1 in breast cancer: a high-throughput data analysis and immunohistochemical study. Int J Clin Exp Pathol. 2019. Jan 1;12(1):1–20. . [PMC free article] [PubMed] [Google Scholar]
  • 50.Jamil A, Mukkamalla SKR. Lymphoma. 2022. Jul 18. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2022 Jan–. . [Google Scholar]
  • 51.Beham-Schmid C. Aggressive lymphoma 2016: revision of the WHO classification. Memo. 2017;10(4):248–254. doi: 10.1007/s12254-017-0367-8 Epub 2017 Nov 30. . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Rajesh S, Bansal K, Sureka B, Patidar Y, Bihari C, Arora A. The imaging conundrum of hepatic lymphoma revisited. Insights Imaging. 2015. Dec;6(6):679–92. doi: 10.1007/s13244-015-0437-6 Epub 2015 Oct 6. . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Uhlen M, Zhang C, Lee S, Sjöstedt E, Fagerberg L, Bidkhori G, et al. A pathology atlas of the human cancer transcriptome. Science. 2017. Aug 18;357(6352):eaan2507. doi: 10.1126/science.aan2507 . [DOI] [PubMed] [Google Scholar]
  • 54.The human protein atlas (HPA) [Internet]. Human Pathology Atlas [cited 2022 Jun 7]: MAP4K1 gene available from: https://www.proteinatlas.org/ENSG00000104814-MAP4K1/pathology.
  • 55.Kannan S, O’Connor GM, Bakker EY. Molecular Mechanisms of PD-1 and PD-L1 Activity on a Pan-Cancer Basis: A Bioinformatic Exploratory Study. Int J Mol Sci. 2021. May 22;22(11):5478. doi: 10.3390/ijms22115478 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.GTEx Consortium. Genetic effects on gene expression across human tissues. Nature. 2017. Oct 11;550(7675):204–213. doi: 10.1038/nature24277 Erratum in: Nature. 2017 Dec 20;: . [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Enrique Hernandez-Lemus

7 Oct 2022

PONE-D-22-16548Tracking cancer progression: A normalization method to identify differential RNA-Seq gene expression in the multi-stage early to advanced-stage cancer development.PLOS ONE

Dear Dr. Bendou,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please consider carefully the comments of the reviewers. Particularly relevant is to highlight the advantages of your proposed method and its contrast with the one presented in reference 4, as well as the biological relevance of your findings so that the full extent of your contribution will be clearer.

Please submit your revised manuscript by Nov 21 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Enrique Hernandez-Lemus, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors show that by normalizing the gene expression from late stage cancer using early stage cancer samples to track progression.

The normalization method has been applied before using non tumor samples to normalize instead of early stage samples as in this case.

Minor suggestions/comments:

1. Figure 1 could include depiction of foldchange sizes in the two cancer types.

2. Most, if not all, of the subfigures in Figure 2 should be supplementary figures, sames as with Figure 5.

3. In the methods section, in the vector multiplication it needs to be clarified which operation is being applied; the notation implies cross product, but probably the one being used is element-wise multiplication (Hadamard product).

4. Including the code for normalization and analysis steps would improve the paper, it would clarify steps, and would be of use to the commuity. The code could be included as either supplementary or with a github (or similar link.

5. An explanation of the rational for the differential expression steps would be very useful. Why not use a standard differential expression analysis package such as DeSEQ2 or EdgeR?

6. It would be interesting to know if using early stage works like a surrogate of normal tissue (as in the original publication for the normalization method) or it is actually better or gives a different insight.

Reviewer #2: The manuscript by Livesey et al describes a computational method aimed at the identification of gene expression signatures associated with cancer development from early to advanced stages. A central approach is to analyze multiple cancer types from multiple tissue sites, normalize all expression values for each gene at advanced cancer stage by the average expression value at the early stage, and apply the WGCNA method to identify gene modules corresponding to cancer progression. This method is a modification of a previously described approach (reference 4) to the identification of gene expression signatures in various combined cancer types compared to normal tissue samples. The method proposed in this manuscript is methodologically identical to the “tissue-corrected” normalization approach of reference 4, only applied to the comparison of samples from early vs late stages of cancer. The authors describe the application of their method to publicly available gene expression data from early and late stages of five cancer types, with 4 to 28 samples in each group. They compared the WGCNA results based on original expression data to the results based on the normalized data and observed a stronger pathway enrichment in WGCNA gene modules and a more clinically relevant clustering by sample type based on the normalized data. The manuscript is clearly written, with the exception of a few minor passages.

My general concerns are the novelty of the proposed methodology, the evaluation approach, and a very minimal discussion of the actual known and new gene signatures detected by this method.

Major comments:

1. The direct advantage of this method to the discovery of known and new markers of advanced cancer stages is not described to a full extent. Detecting MAP4K1 (Fig. 8,9) whose upregulation is known to be associated with better outcomes is an interesting finding. Could this finding be made by a more basic differential expression analysis between early and late stages of cervical cancer? Does this method detect other markers of late cervical cancer?

More generally, it is not fully clear how the analysis of multiple cancers and the use of WGCNA modules bring new specific predictive gene markers.

2. It would be important to put these findings in the context of

a. known progression of gene expression in these specific cancers described in the literature, and

b. more basic approach of simply detecting differentially expressed genes (DEGs) between early and late stage for each cancer type in the evaluation dataset.

3. Lines 420-421:

“The clustering of DLBCL and liver samples in Cluster 2, could suggest that the two cancers progress in a very similar manner.”

This grouping of two cancers in the same cluster is not a sufficiently strong argument to make this far-reaching conclusion. To make a more compelling argument, it would be important to analyze and interpret specific gene signatures that are shared between progression of these two very different cancers from very different tissue environments.

4. Conclusions about a clinically relevant clustering of samples from different subtypes of testicular cancer (lines 448-451):

Could these and similar conclusions about grouping of cancer subtypes be achieved by a more basic DEG analysis and clustering these samples within each cancer type by the expression of DEGs?

5. The rationale behind the choice of a relatively limited evaluation dataset was not fully described. Lines 108-110:

“The datasets examined were selected based on cancers known to have an increased survival risk among patients due to associated autoimmune diseases.”

The numbers of these patient samples (Table 1) are relatively small.

Why this selection of cancer types was made? Would autoimmune signatures contribute as an unrelated confounding factor to late-stage cancer expression signatures?

6. All conclusions about functional gene enrichment in detected WGCNA modules are based on the WikiPathway database. WikiPathway is only one of the public pathway databases, and arguably not the best one. It would be beneficial to analyze the enrichment of pathways from other more established resources, e.g. MSigDB, KEGG, Reactome, ChEA gene categories, as well as TF LOF and TF perturbation gene collections available from EnrichR.

7. Although some of the pathways in Fig. 6 are general pan-cancer gene sets, e.g. pathways related to cell cycle or DNA damage, some of these pathways correspond to specific cancers (gastric cancers, breast cancer, etc) that are unrelated to the analyzed cancer types. This inconsistency and specific genes detected in these unrelated cancer pathways need to be addressed.

Minor comments:

1. Lines 126-127: Xena database provides not read counts but TPM values. x should be noted as TPM.

2. Text labels in Figures 2,3,5 are not legible.

3. Language needs to be clarified in a few passage throughout the text, e.g. lines 425-426:

“the average overall survival rate of patients with cervical cancer represented in Cluster 5 lived longer than the patients with cervical cancer in Cluster 6”

4. Heatmaps and color bars in Fig. 3,4,7: Coloring modules by bars and labeling these modules in text by the same color name is redundant. It would be more informative to have at least a very general and brief text labels by the functions of these modules.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 1

Enrique Hernandez-Lemus

8 Dec 2022

PONE-D-22-16548R1A normalization method to track cancer progression with RNA-Seq gene expression in the multi-stage early to advanced-stage cancer developmentPLOS ONE

Dear Dr. Bendou,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

 Please present a point-by-point response to all the comments by Reviewer 2 and modify your manuscript accordingly.

Please submit your revised manuscript by Jan 22 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Enrique Hernandez-Lemus, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have addressed satisfactorily my comments and I appreciate that the authors made their code available.

Reviewer #2: The paper describes application of a previously published method proposed (“tissue-corrected” normalization approach of reference 4) to the comparison of samples from early vs late stages of cancer. Applying an existing method in a somewhat different context does not amount to a new methodological development.

My few main concerns have not been fully addressed in the revised manuscript.

1. This manuscript does not include any benchmarking, apart from showing an interesting but anecdotal consistency with previously reported biological results, even in the revised version, which describes a biologically meaningful clustering of 12 PDCD1 associated genes. A more rigorous large-scale methodological evaluation and comparison with other computational approaches would be important.

2. To show the value of the method to potential users, in addition to valuable examples of confirming previously known gene associations, it would be important to illustrate how this method can help generate new hypotheses of previously uncharacterized biological associations or mechanisms.

Based on the authors' response about the rationale for using a small dataset focused on a particular cancer type, this manuscript almost seems a pilot to a bigger project focused on HIV effects on Diffuse Large-B cell Lymphoma. This bigger project, hopefully with a larger patient cohort, larger body of results and hopefully new mechanistic hypotheses, may warrant a publication in this journal.

Minor comments:

1. My question about the interaction of cancer-specific signatures and autoimmune responses was not addressed by the authors:

Would autoimmune signatures contribute as an unrelated confounding factor to late-stage cancer expression signatures?

2. “Therefore, the use of the WikiPathway database was sufficient in providing the knowledge we required in the context of this specific investigation."

Using multiple databases of functional gene categories in addition to WikiPathways would be important, especially in the context of this manuscript where functional enrichment is a major source of confirming that results are reasonable. Some of these additional databases include more mechanistic knowledge, e.g. empirically validated TF targets, which would be beneficial for generating new mechanistic hypotheses based on the method's results.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 2

Enrique Hernandez-Lemus

7 Mar 2023

PONE-D-22-16548R2A normalization method to track cancer progression with RNA-Seq gene expression in the multi-stage early to advanced-stage cancer developmentPLOS ONE

Dear Dr. Bendou,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Some minor issues mostly related to the presentation are made by the reviewers (figures, p-value formats, etc.) please address them in your further revised version.

Please submit your revised manuscript by Apr 21 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Enrique Hernandez-Lemus, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Overall I think the manuscript lacks some clarity. I have the following comments:

- Figure 2 is not needed in my opinion. The point that it makes is that the same pathways were enriched in several modules, this could just be mentioned in the text.

- In line 75-82 the use of the word normalized is confusing. The authors give an introduction to RNA-seq and where they mention that raw counts need to be normalized to be used for further analysis. I do agree but normalizing raw counts reads is the normal standard practice, although in a different way in which authors propose normalization in this work. The paper in which they base their normalization method actually uses TPM which is a normalization strategy for RNA-seq. I think the authors need to address why they used unnormalized values (raw reads) as input instead of TPMs, and also point out in the introduction that they are proposing a normalization strategy that does not replace standard normalization but it is specific to their research question. Furthermore, comparing unnormalized gene expression of different samples makes no sense. For figure 1 they would have to explain that even normalized expression values such as tpm would still not address the issue depicted in Figure 1. The authors here propose normalization of gene expression for late stage cancer samples using early stage cancer samples. I understand what the author means but then need to clarify that their normalization method is not the same kind of normalization done to RNA-seq raw counts data such as accounting for library size.

- In line 244 the use of the word implemented is inaccurate. In computer science "implementing" implies writing code, not just executing or running a program with your own dataset.

- Most of the p values are not represented in a standard way (missing zeros before decimal points among others).

- I dont think screened is the appropriate verb here (line 360).

- Not sure what the next phase menas: "It was also found that the brown module has the highest enrichment". Enrichment score?, number of enriched pathways? pathways with the most genes?

- Modules can be enriched with certain pathways, but genes are not enriched, therefore all instances of "enriched genes" should be corrected.

- Line 401 "The WGCNA module genes were further subjected to TF" -> TF enrichment analysis

- "Moreover, they showed that PDCD1 significantly co-expressed with the following 15 genes, whose high expression is beneficial for cervical cancer patients; MAP4K1, ACAP1, CST7, CXCR6, GPR171, GZMH, GZMK, P2RY10, RASAL3, SH2D1A, TBC1D10C, ZNF831, GZMM, JAKMIP1, and PSTPIP1 [54]." Here the reference is a link to the human protein atlas for MAP4K1, so the reference only refers to the first gene and it seems like the authors claim it is for all 15 genes. If it is the case that they tried all 15 genes maybe they should include those figures as supplementary, for all 15 genes, or further explain in the text that they tested the genes in a website. As it currently stands readers expect to find a reference to a paper that claims that high expression of all 15 genes are beneficial for cervical cancer patients which is not what the reference is.

- In line 483 there is a claim that more than one tool was utilized for TF enrichment analyisis and I believe only one was used.

Reviewer #2: The authors addressed most of my comments in this new revision. As a response to the authors' question in the rebuttal, I suggest that functional pathway enrichment analyses based on KEGG database (Fig. 2 and 3 of rebuttal) should be included in the manuscript as supplementary figure(s). This is a minor recommendation that I trust the authors to follow through.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 3

Enrique Hernandez-Lemus

3 Apr 2023

Transforming RNA-Seq gene expression to track cancer progression in the multi-stage early to advanced-stage cancer development

PONE-D-22-16548R3

Dear Dr. Bendou,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Enrique Hernandez-Lemus, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Enrique Hernandez-Lemus

13 Apr 2023

PONE-D-22-16548R3

Transforming RNA-Seq gene expression to track cancer progression in the multi-stage early to advanced-stage cancer development.

Dear Dr. Bendou:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Enrique Hernandez-Lemus

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Uncorrected RNA-Seq data were inserted into WGCNA to identify gene modules.

    (A) Soft threshold power. (B) Gene clustering tree. Each colour underneath the dendrogram shows the module assignment, and branches above represent the genes. The dynamic tree cut shows the initial module detection and merged dynamic indicates the modules divided according to their similarity. (C1) Module eigengene dendrogram identified groups of correlated modules. The red line indicates the module eigengene threshold of 0.25 and (C2) Eigengene adjacency heatmap of different gene co-expression modules. In the heatmap, the blue colour represents low adjacency, while the red represents high adjacency. (D) Barplot of 32 co-expression modules constructed after similar modules were merged with module size at the top of each bar.

    (TIF)

    S2 Fig. KEGG enrichment of gene modules detected by WGCNA from the uncorrected RNA dataset using the ORA, WebGestalt.

    (TIF)

    S3 Fig. Tissue-corrected dataset were inserted into WGCNA to identify gene modules.

    (A) Soft threshold power. (B) Gene clustering tree. Each colour underneath the dendrogram shows the module assignment, and branches above represent the genes. The dynamic tree cut shows the initial module detection and merged dynamic indicates the modules divided according to their similarity. (C1) Module eigengene dendrogram identified groups of correlated modules. The red line indicates the module eigengene threshold of 0.25 and (C2) Eigengene adjacency heatmap of different gene co-expression modules. In the heatmap, the blue colour represents low adjacency, while the red represents high adjacency. (D) Barplot of seven co-expression modules constructed after merged modules with module size at the top of each bar.

    (TIF)

    S4 Fig. KEGG enrichment of gene modules detected by WGCNA from the tissue-corrected RNA dataset using the ORA, WebGestalt.

    (TIF)

    S5 Fig. Heatmap of tissue-corrected RNA-Seq data of late-stage cancer samples normalized with normal tissue samples, illustrating module expression within cancer clusters.

    Normal tissue expression dataset was obtained from the Genotype-Tissue Expression (GTEx) Portal. To match the number of male/female ratios as in the late-stage cancer samples, the same number normal tissue samples of male/female ratios were randomly selected, except for cervical cancer, which only had 10 normal tissue samples. The colour bar on the left shows modules identified by WGCNA and enriched for functional pathway annotations. The rows are further composed of protein-coding genes with expression values obtained after data normalization. Clusters of similar cancer cohorts are indicated across the top and the cancer cohort are displayed by the colour bar along the top with the key on the right. *Primary sites abbreviations: CESC = Cervical squamous cell carcinoma; DLBCL = Diffuse Large B-cell Lymphoma; LIHC = Liver Hepatocellular Carcinoma; LUAD = Lung Adenocarcinoma; TGCT = Testicular Germ Cell Tumors.

    (TIF)

    S1 File. The data underlying the results presented in this study are publicly accessible from the UCSC Xena data browser (https://xenabrowser.net) from individual cancer cohorts.

    (PDF)

    S2 File. Top 5 TFs derived from the ChEA3 enrichment analysis of each tissue-corrected WGCNA module.

    The biological role indicates the role of the identified TF in cancer according to literature.

    (PDF)

    S3 File. Transcription factors (TFs) enrichment analysis of tissue-corrected WGCNA brown module.

    A list of TFs and their corresponding rank according to ARCHS4 co-expression, with documented information about their biological function associated with survival in the context of cervical cancer. The genes in bold were previously found [55] to play a role in cervical cancer survival.

    (PDF)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Data Availability Statement

    The RNA-Seq gene expression and curated clinical public datasets underlying the results presented in the study are available from the following direct URLs: TCGA-DLBC (https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-DLBC.htseq_counts.tsv.gz and https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/GTEX_phenotype.gz),TCGA-LUAD (https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LUAD.htseq_counts.tsv.gz and https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LUAD.GDC_phenotype.tsv.gz), TCGA-LIHC (https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LIHC.htseq_counts.tsv.gz and https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LIHC.GDC_phenotype.tsv.gz), TCGA-CESC (https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-CESC.htseq_counts.tsv.gz and https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-CESC.GDC_phenotype.tsv.gz), and TCGA-TGCT (https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-TGCT.htseq_counts.tsv.gz and https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-TGCT.GDC_phenotype.tsv.gz). The normal tissue were obtained from the Genotype-Tissue Expression (GTEx) Portal as follows: https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/gtex_gene_expected_count.gz and https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/GTEX_phenotype.gz, for the expression and phenotype information, respectively (S1 File).


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES