Skip to main content
Computational and Structural Biotechnology Journal logoLink to Computational and Structural Biotechnology Journal
. 2023 Aug 28;21:4238–4251. doi: 10.1016/j.csbj.2023.08.029

Comprehensive analysis of circulating cell-free RNAs in blood for diagnosing non-small cell lung cancer

Yulin Liu 1,1, Yin Liang 1,1, Qiyan Li 1, Qingjiao Li 1,
PMCID: PMC10491804  PMID: 37692082

Abstract

Early screening and detection of non-small cell lung cancer (NSCLC) is crucial due to the significantly low survival rate in advanced stages. Blood-based liquid biopsy is non-invasive test to assistant disease diagnosis, while cell-free RNA is one of the promising biomarkers in blood. However, the disease related signatures have not been explored completely for most cell-free RNA transcriptome sequencing (cfRNA-Seq) datasets. To address this gap, we developed a comprehensive cfRNA-Seq pipeline for data analysis and constructed a machine learning model to facilitate noninvasive early diagnosis of NSCLC. The results of our study have demonstrated the identification of differential mRNA, lncRNAs and miRNAs from cfRNA-Seq, which have exhibited significant association with development and progression of lung cancer. The classifier based on gene expression signatures achieved an impressive area under the curve (AUC) of up to 0.9, indicating high specificity and sensitivity in both cross-validation and independent test. Furthermore, the analysis of T cell and B cell immune repertoire extracted from cfRNA-Seq have provided insights into the immune status of cancer patients, while the microbiome analysis has revealed distinct bacterial and viral profiles between NSCLC and normal samples. In our future work, we aim to validate the existence of cancer associated T cell receptors (TCR)/B cell receptors (BCR) and microorganisms, and subsequently integrate all identified signatures into diagnostic model to improve the prediction accuracy. This study not only provided a comprehensive analysis pipeline for cfRNA-Seq dataset but also highlights the potential of cfRNAs as promising biomarkers and models for early NSCLC diagnosis, emphasizing their importance in clinical settings.

Keywords: Non-small cell lung cancer, Diagnosis model, CfRNA-Seq, Microbiome, Immune repertoire

Graphical Abstract

ga1

1. Introduction

Lung cancer holds the distinction of being the most prevalent cancer globally, and it stands as the leading cause of cancer-related deaths in the United States, claiming approximately 350 lives each day [1]. There are two main types of lung cancer, known as small-cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC), with the latter representing the majority of cases, accounting for approximately 80%−85%. NSCLC can be further categorized into subtypes such as lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSU), large cell carcinoma and others based on the specific cells affected by cancer. The primary risk factor associated with the development of NSCLC is tobacco smoking. However, there has been a decline in new NSCLC cases are decreasing in the United States in recent years due to the extensive effort in tobacco cessation [1], [2]. Advancements in molecular targeted therapy and immunotherapy have resulted in significant improvements in the survival rates of NSCLC patients. Nevertheless, despite these advancements, individuals diagnosed with advanced-stage NSCLC still face limited survival times. On the other hand, early detection of NSCLC offers the best prognosis, with an 80% five-year survival rate [3]. Imaging techniques frequently serve as tools for the early diagnosis and screening of lung cancer, including modalities like X-ray and CT scanning [4]. However, it is important to note that even these approaches are not immune to specific limitations. These limitations encompass concerns pertaining to radiation exposure, and a diminished degree of sensitivity. Recurrence, progression, and metastasis are potential occurrences in patients with NSCLC. It is crucial to emphasize that invasive tissue biopsy only offers a glimpse into the localized state of the tumor at a specific moment, consequently lacking the capacity to offer a holistic portrayal of general condition of the patient[5]. As a result, the exploration of methodologies for early detection of NSCLC hold the key to enhancing treatment outcomes.

Liquid biopsy provides a non-invasive approach for detecting, monitoring and investigating diseases by analyzing various body fluids, such as peripheral blood mainly, urine, cerebrospinal fluid, among the others. The utilization of liquid biopsy empowers patients to assess their condition through ongoing sampling. This approach enables the real-time monitoring of dynamic molecular-level alterations within tumors, facilitating a prompt understanding of changes in their condition. It encompasses the analysis of different materials, including DNA, RNA, proteins, exosomes and circulating whole cells. Within the context of blood, there are two types of circulating RNAs that provide valuable information about our health status: RNA fragments released by apoptotic cells and exosomal RNAs secreted by various live cells. These collectively referred to as circulating cell-free RNAs (cfRNAs). Recently, studies have highlighted the significance of various types of RNAs, such as messenger RNA (mRNA), long non-coding RNA (lncRNA), microRNA (miRNA) and circular RNA (circRNA) in blood, as valuable biomarkers for disease diagnosis. cfRNA transcriptome sequencing (cfRNA-Seq) has been employed in a range of medical fields, including obstetrics, oncology, bone marrow transplantation, Alzheimer's disease, pregnancy-related complications, and liver disease to identify biomarkers [6], [7], [8], [9], [10], [11]. By examining the abundance of different RNA species, alteration in gene expression associated with diseases can be identified. Moreover, variations such alternative splicing can also be identified through cfRNA sequencing data analysis. During tumor progression, the adaptive immune system responds to tumor antigens, which involves the proliferation of T cells and B cells [12], [13]. It is anticipated that the TCR and BCR in tumor tissues undergo cancer-specific changes, which may be discernible from cfRNA sequencing data obtained from blood samples.

In addition to RNA from the host, circulating cfRNA from the microbiome is also present in the blood. The human microbiome plays a crucial roles in various physiological processes and the development of disease through its interaction between the host and multiple organs [14]. Studies have indicated that plasma cell-free DNA (cfDNA) originating from the microbiome can distinguish different types of cancer, and the inclusion of cfRNA from both the human host and microbiome in plasma can improve the accuracy of cancer diagnosis. Despite the potential significance of cfRNAs from both human and microbial sources, most cfRNA-Seq studies have primarily focus on gene expression analysis, largely disregarding the RNA fragments derived from bacteria and viruses. This oversight results in valuable microbiome signals being overlooked, thereby wasting the potential insights that could be gained from publicly available cfRNA-Seq datasets.

In this work, we aim to investigate the potential of using cfRNA-Seq technology to identify biomarkers for the diagnosis of NSCLC (Fig. 1). We utilized two publicly available cohorts, including 35 NSCLC samples and 46 normal controls, 6 NSCLC and 6 normal samples, respectively. Besides, we enrolled 10 healthy donors as the Inhouse cohort and performed the cfRNA-Seq on them. We firstly analyzed the transcriptome characteristics of cfRNA in NSCLC patients and constructed machine learning models for non-invasive early diagnosis. The results demonstrated that the gene expression profiles obtained through cfRNA-Seq exhibited a robust capability to distinguish NSCLC patients from normal individuals. Moreover, we extracted the information related to the TCR/ BCR, as well as microbial information from cfRNA-Seq datasets. The analysis of these data revealed altered immune statuses and variations in the microbiome composition among cancer patients. The identification of cancer-associated TCR/BCR signatures and microbial markers derived from cfRNAs holds promise for further improving the diagnosis of NSCLC in future studies.

Fig. 1.

Fig. 1

Schematics of the study design. (A) A total of 41 lung cancer samples and 52 healthy human samples were obtained from two publicly available cfRNA datasets, the TP cohort (PRJNA729258) and the SU cohort (PRJNA589238). Additionally, 10 healthy samples were collected to form the Inhouse cohort. (B) The research methodology involved conducting differential gene analysis, functional enrichment analysis, and survival analysis to investigate the association of cfRNAs with the progression of NSCLC. (C) The primary objectives of the study were to identify cfRNA markers associated with NSCLC and to develop a diagnostic algorithm capable of predicting early-stage non-small cell carcinoma noninvasively. (D) TCR/BCR and microbial information in blood were extracted from cfRNA-Seq data to investigate their potential roles in the context of NSCLC.

2. Materials and methods

2.1. Data download

The cfRNA-Seq data used in this study were obtained from two cohorts: the Southeast University Cohort (SU) and the Tsinghua University and Peking University Joint Cohort (TP), which consisted of a total of 41 lung cancer samples and 52 normal samples. The raw sequencing reads and corresponding clinical information were downloaded from the NCBI's Sequence Read Archive (SRA) database, with the access numbers PRJNA729258 (TP cohort) and PRJNA589238 (SU cohort). The TP cohort comprised 35 NSCLC samples and 46 normal samples, while the SU cohort included 6 NSCLC samples and 6 normal samples. Notable, a significant proportion (86%) of the cancer patients in the TP cohort were diagnosed at an early stage, and 100% cancer patients in the SU cohort were also in the early stage of the disease.

2.2. Sample collection

The Inhouse cohort for this study comprised blood samples collected from 10 healthy individuals at the Eighth Affiliated Hospital of Sun Yat-Sen University. The clinical study was conducted in accordance with ethical guidelines and approved by The Medical Scientific Research Ethics Committee of The Eighth Affiliated Hospital, Sun Yat-Sen University (approval number: 2022–013–02). All necessary ethical regulations were adhered to throughout the study, and written informed consent was obtained from all participants prior to their involvement in the research.

2.3. Cell-free RNA isolation and whole transcriptome sequencing

The blood samples were centrifuged at 1600 g for 10 min at 4 ℃ to collect the plasma fraction. The resulting plasma was carefully transferred to a separate tube. Subsequently, the plasma was centrifuged once again at 16000 g for another 10 min at 4 °C to eliminate any residual cells. The double-spun plasma was then stored at −80 °C and thawed at room temperature before use. For RNA extraction, total RNA was extracted from 1 mL of plasma using the miRNeasy serum/plasma kit (Qiagen, cat 217204). Following the extraction, the sequencing libraries were prepared from 8 μL of RNA using the Revelo RNA-Seq High Sensitivity library preparation kit (Tecan, cat 30184204). Then the DNA libraries were sequenced on the Novaseq 6000 platform (Novogene, Beijing, China) with 150 bp paired-end mode.

2.4. Data processing

In this study, an in-house bioinformatics workflow managed with Snakemake [15] were employed to process the cfRNA-Seq data with the following steps. At first, the SRA files were converted to FASTQ files using fasterq-dump (v.2.10), and the data quality was assessed using FastQC (v.0.11.9). Then, TrimGalore (v.0.6.7) (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) with Cutadapt (v.4.0) [16] was utilized to trim adapters and remove low-quality bases from the reads. Third, STAR (v.2.7.10a) [17] was employed to align the clean reads to the human reference genome (GRCh38), followed by the PCR duplication removal with Samtools (v.1.15.1) [18]. At last, HTSeq-count (v.2.0) [19] was employed to quantify the abundance of RNAs (Homo_sapiens.GRCh38.105) with parameter --stranded reverse for TP cohort, --stranded no for SU cohort.

The reads in cfRNA libraries that failed to align with the human reference genome were extracted for microbial analysis. We built a customized database consisting of bacterial reference from RefSeq, which was downloaded from the NCBI on 4 May 2021, using Kraken2-build, and performed taxonomic assignment using the k-mer based classifier Kraken2 [20] with default parameters. Combined with Kraken2 classifier, Bracken (v.2.6.2) [21] was used to estimate the counts of reads originating from each taxon (including bacteria and viruses) presented in each sample based on Bayesian probability.

2.5. Sample quality control

In order to assess the quality of samples, several metrics were evaluated, including the mapping rate to human, bacterial and viral reference genome, rRNA rate and DNA contamination. The rRNA rate is determined by dividing the counts mapped to known ribosomal RNAs by the total counts mapped to human genes. DNA contamination is estimated by quantifying the reads mapped to intronic regions as compared to exonic regions of the genome. To ensure the quality of the samples for downstream analysis, the following criteria were applied: (1) raw reads > 1 million; (2) clean reads (reads remained after trimming low quality and adaptor sequences) > 1 million; (3) the number of detected genes > 5000; (4) the ratio of classified reads (either mapped to human or microbiome) > 0.5; (5) rRNA rate < 0.5; (6) the ratio of DNA contamination < 5. By applying these quality control metrics, samples meeting all the specified criteria were selected for further analysis in the downstream analysis.

2.6. Gene expression analysis

DESeq2 were used to identify differential expressed genes (DEGs) between lung cancer and healthy samples, including mRNAs, lncRNAs and microRNAs. DEGs were determined based on |log2FC|>1 and adjusted p-value < 0.05. For functional annotation of the differential mRNAs, clusterProfiler [22], an R package, was utilized to performed Gene Ontology (GO) [23] and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis [24]. The GO terms were categorized into: biological process (BP), molecular function (MF) and cellular components (CC). The enriched Go terms and KEGG pathways are selected based on a significance level of p-value < 0.05. To gain further insights into the interactions among the differential mRNAs, Protein-Protein Interaction (PPI) network analysis was conducted. The PPI network was constructed using the STRING database [25] with a score_threshold = 900. Subsequently, CytoHubba [26], a plugin of Cytoscape, was utilized to identify and visualize the hub genes (with a score value≥60). Moreover, MCODE [27], another plugin of Cytoscape, was applied to identify modules with highly interconnected regions.

The function of differential lncRNAs was investigate through two methods: (1) conducting extensive literature reviews; (2) annotating lncRNAs directly using lncSEA [28].

To explore the function of differential miRNAs, we employed two methods: (1) performing a comprehensive literature review; (2) predicting the targeted mRNAs of differential miRNAs using both mirDB (V6.0) [29] and TargetScan (V7.2) (https://www.targetscan.org/vert_72/), and performing GO and KEGG pathway enrichment analysis on the common genes exited in both database. Additionally, a miRNA-mRNA network was constructed and visualized using Cytoscape to depict the interactions between the differentially expressed miRNAs and their target mRNAs.

2.7. Survive analysis

The differential mRNAs for TCGA-LUAD (lung adenocarcinoma) and TCGA-LUSC (lung squamous cell carcinoma) were downloaded from GEPIA2 website (http://gepia2.cancer-pku.cn/), using Limma method applied with the criteria of padj < 0.05 and |log2FoldChange| > 1. The overlap between the differential mRNAs identified from cfRNA-Seq data and TCGA datasets were visualized with VennDiagram (R package) for LUAD and LUSC. The common differential mRNAs between cfRNA-Seq and TCGA-LUAD or TCGA-LUSC were selected for survive analysis. The overall survival (OS) analysis based on gene expression were performed on the samples from TCGA database using GEPIA2. Log-rank test were performed to evaluate the survival time difference between the high-expression cohort (top 75%) and the low-expression cohort (bottom 25%).

2.8. Diagnosis model constructed with the gene expression signature

Two classical machine learning approaches (random forest and logistic regression) were used to construct novel classification model for distinguishing NSCLC samples from normal, and their performance were compared. Random forest (RF) is an ensemble learning method for classification, which operates by constructing many decision trees and aggregating their results with voting. Due to the imbalance distribution of lung cancer and normal samples, a balanced random forest classifier was adopted with subsampling strategy to yield unbiased predictions. Logistic regression (LR) is one of the simplest machine learning algorithms which is easy to implement, interpret and efficient to train. LR assumes no multicollinearity between the predictor variables.

The sample size in this work is small and the features are high dimension, therefore we need to exclude as many redundant features as possible to avoid overfitting. Partial Least-Squares Discriminant Analysis (PLS-DA) is a supervised version of Principal Component Analysis (PCA), which can achieve dimensionality reduction but with full awareness of the class labels. PLS-DA has been recommended for use in omics data analysis, especially for feature selection and classification [30]. A VIP (variable importance in the projection) score is a measure of a variable’s importance in the PLS-DA mode. The features with VIP score larger than a cutoff were selected as the top features. For fair evaluation, the classification performance (AUC, specificity and sensitivity) of the novel diagnosis model is validated by both cross validation and an independent test set.

The model was constructed as following steps. At first, we randomly split the TP cohort into training and test dataset with ratio 2:1. Then, in each splitting, PLS-DA is adopted to select features based on the training set, then classification model is built on training set and validated on test dataset. These two steps were repeated 100 times through random sampling. At last, the same feature selection and classification were proceeded on the whole TP cohort, and the performance was validated in the independent test set which combined SU cohort and Inhouse cohort.

The normalized gene expression profile at fragments per kilobase per million mapped fragments (FPKM) were used as the input. The PLS-DA were implemented with ropls package in R. The machine learning algorithms were implemented with Scikit-learn in Python (https://scikit-learn.org/). We set the number of decision trees as 1000. All predictions of test datasets were used to calculate the area under the ROC curves, sensitivity, specificity and accuracy with ROCR package in R [31].

2.9. Immune repertoire reconstruction from RNA-seq data

TRUST4 tool [32] was used to extract the CDR3 sequence and relative abundance of TCR/ BCR. The bam file (including unmapped reads) was used to infer the CDR3 RNA and amino acid sequences based on the contigs assembled from the unaligned reads of each sample. The complexity of the immune system is manifested through the diversity and clonality of TCR and BCR. The diversity of the TCR/BCR was quantified using the Shannon–Wiener index (Shannon entropy) [33], which takes into account both the relative number of clonotypes present and relative abundance or distribution of each clonotype. To assess the clonality of the TCR/BCR, we classified the degree of clonal amplification into four classes for comparison: 0–0.001, 0.001–0.01, 0.01–0.1, and 0.1–1.

DeepCAT [34] is a convolutional neural network-based computational approach for identifying cancer-associated beta chain TCR hypervariable CDR3 sequences. We prepared the TCR/BCR data from the RNA-seq source in accordance with the specifications of the README file to be used as DeepCAT's input data. Then downloaded the CHKP folder and utilized the PredictCancer function in the package according to the README file. Finally, we generated the plots by ggplot2 package in R studio.

2.10. Microbiome analysis

To eliminate potential contamination, we applied a filtering process to exclude bacterial genera previously reported as common laboratory contaminants [35]. Additionally, we removed bacterial species associated with these identified genera to further minimize contamination. Subsequently, we performed the predominant microbiome analysis, alpha-diversity (Richness index, Shannon index) and beta-diversity (Bray–Curtis metric distance) analysis on the relative abundance profiles at the species, genus level for bacteria and virus respectively, with the R package vegan and p-value was calculated by Wilcoxon test. To identify significant differential microbial abundance, we employed Wilcoxon test and considered p-value < 0.05 as statistically significant.

2.11. Statistical analysis

All statistical analyses and visualizations were performed in R version 4.0.5 (https://cran.r-project.org), R studio version 1.2.5033 (https://www.rstudio.com).

3. Results

3.1. Data quality of samples

The sequencing information of all the samples are summarized in Supplementary Table S1. 2 NSCLC samples and 5 normal samples from TP cohort were excluded from the analysis. This exclusion was based on the following criteria: (1) the number of detected genes < 5000; (2) the ratio of DNA contamination > 5; (3) the ratio of unclassified reads > 0.5. All samples from SU and Inhouse cohort met the quality criteria and were included in the analysis.

3.2. cfRNA profile alterations in NSCLC patients

In TP cohort, we identified 4742 differential mRNAs (648 up-regulated and 4094 down-regulated), 1289 differential lncRNAs (155 up-regulated and 1134 down-regulated), 10 differential microRNAs (all down-regulated) between NSCLC and normal controls (top of Fig. 2A). Likewise, in the SU cohort, we discovered 619 differential mRNAs (400 up-regulated and 219 down-regulated), 61 differential lncRNAs (25 up-regulated and 36 down-regulated), and 1 differential microRNAs (bottom of Fig. 2A). To ensure the robustness of our findings, we compared the results of two cohorts. As demonstrated at Fig. 2B, we identified 241 commonly up-regulated mRNAs, 70 commonly down-regulated mRNAs, 16 commonly up-regulated lncRNAs, 8 commonly down-regulated lncRNAs and 1 commonly down-regulated miRNA. These shared dysregulated genes suggest the reliability and consistency of disease biomarker identification through cfRNA analysis in peripheral blood for NSCLC.

Fig. 2.

Fig. 2

. (A) Volcano plot of differential mRNAs, differential lncRNAs and differential miRNAs between NSCLC and healthy control in TP cohort (top, NSCLC =33, Healthy = 41) and SU cohort (bottom, NSCLC = 6, Healthy = 6). (B) Venn diagram illustrating the overlap of the differential mRNAs, lncRNAs, miRNAs between TP cohort and SU cohort respectively. P-value was calculated by hypergeometric test and considered p-value < 0.05 as statistically significant. DEGs were determined based on |log2FC| >1 and adjusted p-value < 0.05.

Given the larger sample size of the TP cohort, we performed further analysis based on this dataset and validated the findings using the SU cohort. To illustrate the differential expression levels, we generated boxplots for the top 5 differential mRNAs, lncRNAs, and microRNAs with the highest absolute value of log2FoldChange in TP cohort (Fig. S1). These plots effectively demonstrate the distinct expression patterns between NSCLC and normal control samples.

3.3. The development of NSCLC is reflected by the mRNA in the peripheral blood

In order to explore the biological relevance of plasma cfRNAs in patients with NSCLC, we performed functional annotation analysis for differential mRNAs, encompassing GO terms and KEGG pathways (Fig. 3). And the detailed information of the enriched terms and p-value was listed in Supplementary Table S2. We observed a significant association between the up-regulated mRNAs in NSCLC and vascular function. The enriched cellular components identified through GO analysis were predominantly related to platelet function, including platelet alpha granule, platelet alpha granule lumen, and platelet alpha granule membrane. These components are likely involved in biological processes associated with coagulation and hemostasis. Furthermore, the KEGG pathway analysis revealed enrichment in pathways such as Platelet activation, Focal adhesion, Vascular smooth muscle contraction, and the Rap1 signaling pathway. Previous studies have indicated that patients with NSCLC often exhibit activation of the coagulation fibrinolytic system, leading to subclinical changes in coagulation fibrinolytic indices. These findings align with existing literature, which consistently emphasizes the significant role of hemostasis, coagulation, and platelet activation in both subtypes of NSCLC [36].

Fig. 3.

Fig. 3

The enriched GO terms and KEGG pathways of up-regulated and down-regulated mRNAs between NSCLC and healthy control in TP cohort. (A) Top 20 significant GO-BP terms of up-regulated mRNA (left) and down-regulated mRNAs (right). (B) Top 20 significant GO-MF terms of up-regulated mRNA (left) and down-regulated mRNAs (right). (C) Top 20 significant GO-CC terms of up-regulated mRNA (left) and down-regulated mRNAs (right). (D) Top 20 significant KEGG pathways of up-regulated mRNA (left) and down-regulated mRNAs (right). The enriched GO terms and KEGG pathways are selected based on a significance level of p-value < 0.05.

Among down-regulated mRNAs, 1147, 175 and 139 enriched GO terms were identified in BP, MF and CC respectively. The GO analysis found that the majority of the genes were enriched for ribosome-related cellular composition and molecular functions, and were associated with RNA production and expression processes. Besides, there were 47 KEGG pathways enriched by down-regulated mRNAs. We observed that many components related to immune function, including Th1 and Th2 cell differentiation, Type I diabetes mellitus, Primary immunodeficiency, Graft-versus-host disease, Antigen processing and presentation, MAPK signaling pathway, Th17 cell differentiation and so on. The detailed information of the enriched terms, genes in overlap and p-value was listed in Supplementary Table S2. The immune response of the organism to lung cancer is a highly complex process. In general, strong immune evasion is associated with advanced stages of solid tumor progression. However, research has revealed that LUAD and LUSC can exhibit features such as impaired antigen presentation, loss of heterozygosity in the human leukocyte antigen (HLA) region, silencing of neoantigens, activation of immune checkpoints, alterations in TH1/TH2 cytokine ratios, and immune microenvironment evolution. Remarkably, our own results align closely with these findings [37]. Normally, Th1 and Th2 cells are in a state of dynamic equilibrium. An imbalance in Th1 and Th2 cytokines can potentially contribute to the development of lung cancer. Additionally, lung cancer tumors secrete a multitude of cytokines that shape the microenvironment, tilting the Th1/Th2 balance further towards Th2 dominance, leading to Th2 drift. Th2 represents a subtype of T helper cells capable of producing specific cytokines, thereby suppressing the in vivo anti-tumor immune response and evading the local T cell response against tumors [38]. Besides, immune evasion is also closely related to the processing and presentation of tumor antigens. Dendritic cells (DC) play a central role in the initiation and maintenance of anti-tumor immunity. Tumor antigens must be taken up and presented by DC to activate CD8+T cells [39]. Previous research has highlighted the significant role of Th17 cells in anti-tumor immune responses. Nevertheless, other studies have also indicated that Th17 cells may promote tumor formation through the secretion of IL-17, which in turn stimulates stromal cells, epithelial cells, and tumor cells to express various pro-angiogenic factors. However, the precise cellular mechanisms underlying the promotion or inhibition of tumor growth by Th17 cells in lung adenocarcinoma remain unclear, warranting further in-depth investigations. From the above it can be seen in NSCLC, key components of innate immunity such as natural killer cells (NK) and DC are insufficient in number and dysfunctional, leading to an immunosuppressive tumor microenvironment (TME). A single-cell transcriptomic study conducted on CD45+ immune cells from NSCLC tissues, normal tissues, and blood revealed suppressed immune activity within the TME [40]. It can be inferred that these functional alterations associated with immune evasion during lung cancer progression are also manifested in the cell-free RNAs of peripheral blood.

3.4. PPI network analysis and survival analysis reveal potential of mRNA in blood as a biomarker

Besides, to unveil the regulatory networks of differential mRNAs and identify key hub genes within these networks, we constructed a PPI network. The resulting network consisted of 1351 nodes and 2853 edges. Among the hub genes identified based on their degree (score value ≥ 60) in the PPI network, the top 8 genes were RPL18A, RPL19, RBX1, FBL, TIMP1, EXOSC5, CX3CL1 and DRD4. The expression levels of these genes were visualized through boxplots, as shown in Fig. S2A. Based on a comprehensive literature review, the expression of RBX1 in NSCLC tissues was significantly higher than that in corresponding non-cancerous lung tissues [41], which is consistent with the results of cfRNA, further supporting the feasibility of identifying biomarkers of NSCLC from cfRNA in peripheral blood. It is known that a subset of PPI networks exhibits strongly connected regions, which are more likely to play a crucial role in biological regulation. On the other hand, nodes with weaker connections may have a lesser impact on the overall integrity of the entire network. In addition to the hub genes, we also identified 11 modules within the PPI network that exhibited high network density. Among these modules, the top 3 modules were depicted in Fig. S2B, providing an overview of their interconnectedness. These modules, derived based on their interactions within the PPI network, can help unveil potential key players in biological pathways and processes associated with NSCLC. The close relationships among genes in these modules suggest their potential involvement in shared cellular functions or pathways relevant to the NSCLC. These findings contribute to the growing understanding of NSCLC and provide insights into potential therapeutic targets and prognostic markers in the field of cancer research.

We further investigated whether the differential mRNAs identified in cfRNA-Seq affected the survival time for patients with NSCLC. Survival analysis is conducted for these differential mRNAs identified in both peripheral blood samples and TCGA dataset (TCGA-LUAD or TCGA-LUSC). As shown in Fig. 4A, 41 up-regulated mRNAs (out of 648) and 939 down-regulated mRNAs (out of 4094) in cfRNA were also present in the TCGA-LUAD or TCGA-LUSC dataset. Subsequently, we found 8 up-regulated mRNAs and 83 down-regulated mRNAs to be significantly associated with the survival time of NSCLC. Detailed information on these mRNAs, along with the results of literature review, were summarized at Supplementary Table S3. We selected several differential mRNAs as examples, including 4 up-regulated mRNAs (MAGOHB, SKA3, PKM, CHMP4C) as shown in Fig. 4B and 4 down-regulated mRNAs (MGP, MCOLN2, DENND1C, PLEKHB1) as shown in Fig. 4C. SKA3 was highly expressed in NSCLC tissues and cells, and the overexpression of SKA3 significantly accelerated cell proliferation and metastasis, inhibited apoptosis of NSCLC cells, and promoted lung metastasis [42]. CHMP4C is a unique gene associated with cell cycle and was highly expressed in LUSC tissues [43]. Through Human Protein Atlas analysis, MCOLN2 was confirmed to be down-regulated in LUSC and LUAD [44]. From these results, the dysregulated mRNAs identified from cfRNA-Seq possess the potential to serve as noninvasive biomarkers for predicting the prognosis of patients with NSCLC.

Fig. 4.

Fig. 4

. (A) The overlap of differential mRNAs and lncRNAs between cfRNA-Seq and TCGA dataset. LUAD: Lung adenocarcinoma, LUSC: Lung squamous cell carcinoma. (B-C) The overall survival (OS) analysis of common differential mRNAs identified from cfRNA-Seq, TCGA-LUAD and TCGA-LUSC. Up-regulated mRNAs (B) and Down-regulated mRNAs (C). The survival time difference between the high-expression cohort (top 75%) and the low-expression cohort (bottom 25%) were performed by log-rank test.

There were differentially expressed genes only existing in cfRNA-Seq (Fig. 4A), such as BDR3, HMGB1, DUS6P, which have been shown to be related to the occurrence and development of NSCLC. For example, BDR3, which is up-regulated in cfRNA-Seq, has been found to play a role in promoting the proliferation and migration of lung cancer cells, while its knockdown can reduce these processes and increase apoptosis rates [45]. The elevated levels of HMGB1, detected as up-regulated in cfRNA-Seq, have been significantly associated with poor survival outcomes [46]. On the other hand, DUS6P, down-regulated in cfRNA-Seq, act as a tumor suppressor, and is involved in the regulation of cell migration, motility, and tumor growth [47]. These results highlight the ability of cfRNA-Seq to identify novel alternatively expressed genes that may not be captured through solid biopsy due to intra-tumor heterogeneity.

3.5. The potential of differential lncRNAs and miRNAs from cfRNA-Seq as biomarkers for NSCLC

We performed enrichment analysis for all differentially expressed lncRNA using LncSEA, and significant functional enrichment sets were identified (Supplementary Table S4). Among the differential lncRNAs, several were found to be enriched in pathways associated with NSCLC, including GAS5, GHRLOS, CASC2, MEG3, MIR4435–2HG. In our study, we observed the down-regulation of GAS5, which is consistent with previous research demonstrating that overexpression of GAS5 inhibits cell invasion and migration in cell lines [48]. Similarly, the reduced expression of GHRLOS in NSCLC cell lines and tissues, as observed in our study, has been associated with a poor prognosis in patients with NSCLC [49]. CASC2, identified as a tumor suppressor, has been found to be expressed at low levels in NSCLC cell lines, suggesting a potential role in tumorigenesis [50]. The decreased expression of MEG3, which is an inhibitory lncRNA, may contribute to the promotion of cell migration and invasion in NSCLC [51]. Additionally, MIR4435–2HG has been found to be highly expressed in lung cancer tissues, particularly in NSCLC, consistent with the results obtained from cfRNA analysis. Functional experiments have further demonstrated that knockdown of MIR4435–2HG decreases the proliferation and migration abilities of NSCLC cells [52]. Further studies and validations are necessary to elucidate the functional implications and clinical significance of these differential lncRNA and to explore their potential as diagnostic or prognostic biomarkers for NSCLC.

Ten down-regulated miRNAs were identified from cfRNA-Seq data, namely MIR1282, MIR486–1, MIR6090, MIR486–2, MIR3648–1, MIR3661, ENSG00000275110, MIR150, MIR25 and ENSG00000277437. Among these, MIR1282 encodes MIR128, which has been frequently reported as tumor suppressor in the nervous system [53]. Abnormal expression of MIR150 is believed to directly impact oncogenes and/or tumor suppressor genes involved in tumor formation and progression [54]. MIR150, being a specific miRNA of hematopoietic cells, plays a malignant role in malignant hematological diseases [55]. Increasing the expression of MIR150 has been shown to effectively inhibit proliferation in vitro and in vivo, and promote cell apoptosis. To identify potential target mRNAs, we utilized mirDB V6.0 and TargetScan (V7.2), resulting in a final set of 79 pairs of miRNA-mRNAs (Supplementary Table S5). The miRNA-mRNA cross-network was visualized in Fig. S3A. To gain further insights into the function of the target mRNAs, GO and KEGG pathways analysis were performed (Fig. S3B). The KEGG pathway analysis revealed strong associations of NSCLC with pathways such as Longevity regulating pathway - multiple species, PI3K-Akt signaling pathway and Longevity regulating pathway. The PI3K-Akt pathway is an intracellular signal transduction pathway that promotes metabolism, proliferation, cell survival, growth and angiogenesis in response to extracellular signals. Activation of the PI3K-Akt pathway stimulates cell growth and proliferation. Overactivation of this signaling pathway can lead to abnormal cell proliferation and carcinogenesis. These results imply the potential of miRNA as diagnostic biomarkers for NSCLCs.

3.6. NSCLC diagnostic model constructed on cfRNA gene expression profile

Based on the results of the investigation, the cfRNA signals obtained have the potential to distinguish NSCLC samples from normal samples. To validate the findings, the TP cohort samples were randomly divided into two groups for validation. Two-third of samples (22 lung cancer and 27 normal samples) were used for training, and one-third (11 lung cancer and 14 normal samples) were used for testing. For each split, the PLS-DA method was used to identify important features on training set with VIP > 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1 and 2.2 respectively. Then a random forest (RF) or logistic regression (LR) model with balanced subsampling was generated based on the selected features. The results from 100 repetitions of this process are summarized in Table 1. With the increasing of VIP cutoff, the selected features become less, resulting in classification performance dropped. When the VIP > 1.8, 770 genes at average are selected as features. The performance of RF model for diagnosing NSCLC is satisfactory with an average AUC of 0.907, sensitivity of 0.874 and specificity of 0.886 (Fig. 5A). Similarly, the performance of LR is also satisfactory but a little worse than RF, with an average AUC of 0.882, sensitivity of 0.86 and specificity 0.84 (Fig. 5B).

Table 1.

The performance of classification models in cross-validation and independent test.

VIP Cross-Validation
Independent Test
# of features AUC of RF AUC of LR # of features AUC of RF AUC of LR
1.5 2719 0.907 0.904 2877 0.969 1
1.6 1821 0.909 0.892 1884 0.917 1
1.7 1197 0.907 0.881 1054 0.859 1
1.8 770 0.907 0.885 535 0.703 0.938
1.9 481 0.905 0.882 276 0.333 0.5
2.0 293 0.905 0.877 161 0.109 0.354
2.1 176 0.901 0.875 105 0.031 0.302
2.2 105 0.892 0.868 56 0.052 0.531

Fig. 5.

Fig. 5

The performance of classification models for distinguishing NSCLC samples from normal samples through cfRNAs. (A-B) The average AUC curves for classifiers with VIP > 1.8 for feature selection in the cross-validation scenario, Random Forest model (A) and Logistic regression model (B). (C) The AUC curves for our NSCLC diagnosis strategies based on an independent dataset for evaluating the performance. (D) The overlap among PLS-DA selected features and DEGs identified from TP and SU cohort independently.

We performed feature selection (Fig. S4) and model building on the full dataset of the TP cohort. Subsequently, we validated the models using an independent test set (Supplementary Table S6) which combined with SU and Inhouse cohort. There were no statistical differences in age (p-value = 0.883, Wilcoxon test) and sex (p-value = 0.189, Chi-square test) between NSCLC and healthy group in the independent test set. Both RF and LR model perform well with VIP cutoff from 1.5 to 1.8. When VIP > 1.8, the number of features selected from the TP cohort are 535, resulting in AUC = 0.703 and 0.938 for RF and LR respectively (Fig. 6C). With fewer selected features when VIP cutoff increases, the performance of the classification decreased dramatically.

Fig. 6.

Fig. 6

Immune repertoire analysis from cfRNA-Seq for NSCLC. (A) The length distribution of T cell and B cell CDR3 amino acid sequences. (B) Degree of expansion and frequency distribution of T cell and B cell clones. (C) Predicted true shannon diversity of TCR and BCR in NSCLC and healthy control samples. (D) Profile of cancer scoring of the TP cohort via the DeepCAT model. The diamond black dots are the final sample scores, with the blue line value of 0.58. P-value was calculated by Wilcoxon test and considered p-value < 0.05 as statistically significant.

In summary, the performance of the classification model improves as more features are selected within a specific range. However, a clinical test with thousands of biomarkers is not feasible. Therefore, when VIP cutoff > 1.8, the performance is acceptable with appropriate number of biomarkers. Furthermore, we observed that around half of PLS-DA features (256 out of 535, 47%) were differentially expressed genes in two cohorts (Fig. 5D), which indicated PLS-DA serves as an effective feature selection strategy in this study. For further reference, the features importance in RF model with VIP cutoff > 1.8 were listed in Supplementary Table S7.

3.7. Immune repertoire constructed from cfRNA-Seq for NSCLC

We further explored the immune activity of T cells and B cells for NSCLC from blood circulating RNAs. The TCR plays a pivotal role in conferring antigen specificity to T cells, with its hypervariable CDR3 region engaging amino acid residues presented by major histocompatibility complex(MHC) molecules. The analysis mainly focused on the amino acid sequences of CDR3. Notably, the length of TCR β chain CDR3 sequences is from 6 to 23 amino acids, with the most frequency of 14 amino acids for both NSCLC and normal control samples (Fig. 6A). Intriguingly, the amplificated clonotypes were concentrated in the large and hyperexpanded clonal amplification interval. We found that the clonal amplification frequency distribution of β - CDR3 in the NSCLC were higher than that in normal control samples in interval 0.1–1, while lower in interval 0.01–0.1 in NSCLC (Fig. 6B). Additionally, the diversity of β-CDR3 in NSCLC samples was significantly lower than in normal control samples (Fig. 6C). These findings suggest an abnormality in T cell clones, which may be associated with immune response, regulation, and overall immune status in NSCLC. The results from the clonality analysis align with the observations from the functional enrichment analysis of differential mRNAs, collectively indicating an unbalanced T cell population in NSCLC.

Previous research showed that tracking the changes of T cell pool in peripheral blood holds potential as a biomarker for cancer diagnosis. Beshnova et al. created a deep learning algorithm named DeepCAT to predict cancer-associated TCR sequences. The convolutional neural network (CNN) serves as the model's kernel, and the input data is the TCR CDR3 sequence. The CNN model encodes the amino acid properties of the input sequence and outputs the likelihood of each CDR3 sequence being associated with cancer. Subsequently, the model calculates the mean of all the probability of all CDR3 sequences within a given sample to obtain the cancer score of the sample. The validation results showed that the TCR data obtained from RNA was also suitable for the input of this model [34]. In order to further explore the availability of TCR signals in cfRNA, we tried to apply the CDR3 sequence obtained from TP cohort to DeepCAT as an independent verification set (see Method 2.9). A total of 60 samples' cancer scores were obtained, with an AUC value of 0.55. Regrettably, the prediction effect was not as high as in the original paper. From the boxplot of cancer scores of the samples (Fig. 6D), it can be seen that the scores of the healthy group are concentrated in an interval and relatively stable. Although the original paper did not differentiate between cancer and non-cancer scores, it is apparent that the scores of the majority of healthy patients are less than 0.6 from our data. In contrast, the cancer group exhibits substantial variability in CDR3 quantities within each sample, leading to corresponding variations in the final cancer scores, with most samples displaying higher scores. And it was observed that CDR 3 is less in cancer patients than in healthy patients.

The measure metrics of BCR H chain CDR3 sequences showed similar pattern to that observed for TCR β-CDR3. The length of BCR H chain CDR3 sequences ranged from 7 to 33 amino acids, among which 11 amino acids had the highest frequency in both NSCLC and normal control samples (Fig. 6A). Like TCR β-CDR3, the abundance of BCR H-CDR3 in interval 0.1–1 and 0.01–0.1 was much higher than the other two intervals. Moreover, hyperexpanded BCR H-CDR3 from NSCLC samples were more than normal control samples, while large-expanded H-CDR3 showed oppose trend (Fig. 6B). Although Shannon index showed a small drop in NSCLC samples compared to normal control samples, this difference did not reach statistical significance (Fig. 6C). These observations suggested B cells activate a stronger immune response in NSCLC samples than normal control samples.

These results indicate that cfRNA has the capacity to reflect the immune response of cancer patients and non-cancer patients. In particular, the decreased diversity of TCR β-CDR3 in cancer patients compared to non-cancer patients suggests sustained lymphocyte activity and a reduction in TCR diversity distribution. This is closely related to the development of lung cancer, as evidenced by our previous functional enrichment analysis, where we observed enrichment of abnormal genes in immune evasion functions, including dysregulation of Th1/Th2 differentiation. In the pre-invasive stage of lung cancer, Th2 shift is commonly observed, leading to an immunosuppressive state. Therefore, it is reasonable to speculate that the decrease in TCR diversity may indirectly reflect a progression in lung cancer. These results highlight the potential utility of TCR β-CDR3 in cfRNA for constructing prediction models.

3.8. Microbial-derived cfRNAs can be identified from cfRNAs in peripheral blood

We compared the composition of microbial community in NSCLC samples using cfRNA and performed predominant abundance analysis of bacteria at genus level focusing on the top 15 bacteria from cfRNAs of NSCLC and normal samples (Fig. 7A). Staphylococcus was found to be the highest proportion of bacterial genera in both NSCLC and normal samples, with similar relative abundance. However, there were noticeable differences in the proportions of other genera, such as Klesiella, Cutibacterium, Priestia and Halomonas. Their Richness and Shannon index showed a slight drop in NSCLC samples, but without significant difference (Fig. 7B). These results suggested the number of genera in NSCLC samples was lower than in normal samples, indicating reduced diversity. Furthermore, the composition of the bacterial community exhibited partial clustering, indicating that the two groups are not entirely similar (Fig. 7C). To identify the bacterial genera responsible for these differences, differential abundance analysis for genera were performed. We detected 66 low-abundance bacteria genera and 13 high-abundance bacteria genera in NSCLC samples, including Halobacillus, Kibdelosporangium, Nesterenkonia, Tepiditoga, etc. (Fig. 7D). Some of these genera have been associated with inflammation. For instance, Virgibacillus showed higher abundant in patients with mild enteritis based on 16S RNA-seq [56].

Fig. 7.

Fig. 7

The microbiome analysis of bacteria and virus at genus level. (A) The relative abundance of dominant bacteria (left) and virus (right) between NSCLC and healthy control. (B) The Richness and Shannon index of bacteria (top) and virus (bottom) between NSCLC and healthy control. (C) PCoA based on Bray-Crutis distance of bacteria (left) and virus (right) between NSCLC and healthy control. (D) Top 10 differential bacteria (left) and virus (right) between NSCLC and healthy control. P-value was calculated by Wilcoxon test and considered p-value < 0.05 as statistically significant.

For viral genera, the abundance of dominant virus presented obvious differences between NSCLC and normal control samples (Fig. 7A). Gorganvirus, the most abundant virus, accounted for around 70% in NSCLC samples which was almost double of normal samples. Ortheobunyavirus, Orthohantavirus and Betabaculovirus, showed higher abundance in normal than NSCLC samples, rank 2nd to 4th. These findings indicated that while the composition of the virus community may be similar, the abundance differs between NSCLC and normal control samples. Richness and Shannon index of virus was smaller than in NSCLC samples (Fig. 7B). These trends indicated that the number of viruses in NSCLC samples was smaller than in normal control samples, and its diversity was lower as well. Principal Coordinate Analysis (PCoA) showed that NSCLC samples overlapped with partial normal control samples, suggesting that there existed differences in composition of viral community between them (Fig. 7C). After performing differential abundance analysis, we only found some viral genera were over or less abundant in normal control samples compared to NSCLC samples. We also applied microbiome analysis to bacterial and viral species. It was clear that the composition of species followed same trend as genera (Fig. S5). The enrichment of pathways related to viral infection identified through previous KEGG enrichment analysis indicates that in the studied biological samples, genes or proteins associated with viral infection play significant roles in these pathways. This observation may suggest the presence of viral infection or virus-related biological processes in the samples.

4. Conclusions and discussion

In this work, we performed a comprehensive analysis of cfRNAs profiled by cfRNA-Seq technique for NSCLC. cfRNA in plasma carries multiple signals that may be identified, some of which are differently expressed between NSCLC and normal samples and are linked to clinical features of NSCLC patients. Random forest and logistic regression methods were used to construct diagnosis models based on gene expression signatures. Our data supported the diagnostic value of cfRNA for cancer.

Due to the low integrity and low concentration of cfRNA in blood, it is challenging to construct the high-throughput sequencing library. The experimental protocols for generating cfRNA-Seq data in two cohorts are different, which result in the different detection power of circulating RNAs in blood. This observation brings us two important questions: whether the biomarkers discovered from one cohort can be validated in the other cohort even though there are technological biases? Can a classification model constructed for diagnosing cancer based on cfRNA-Seq achieve a good performance on an independent test set with a different experiment protocol?

The high-dimensional nature of transcriptome data coupled with the small sample size in this study necessitated the use of PLS-DA as a feature selection strategy to mitigate the risk of overfitting. PLS-DA is a supervised statistical method for discriminant analysis that is supervised. The Variable Importance for the Projection (VIP) method can be used to assess the influence and explanatory power of several genes on the classification and discrimination of each group of samples. Despite the fact that the number of genes in the two cohorts differed, the prediction model built from their overlaps performed very well on the independent set, indicating that we have discovered more credible biomarkers, and the model is also quite robust. Most NSCLC patients in the cohort we used were in the early stage (see Methods 2.1), and the positive outcomes obtained from our data support the notion that the invasive diagnosis method developed based cfRNA-Seq will have the ability to be utilized in clinical practice.

Additionally, we explored the immune repertoire signals and microbial-derived transcriptional information present in human-derived RNA and attempted to interpret these signals.

From the enrichment analysis we observed that the downregulated differential mRNA was concentrated on the pathways of immune function, we further characterized the TCR profile of NSCLC patients from cfRNA. Indeed, we observed that the TCR information obtained from cfRNA samples reflected the host's immune state to a certain extent, providing valuable information on the tumor-immune interaction in NSCLC. Besides, the TCR sequence extracted from our data was applied to a cancer prediction model developed by Beshnova et al. [34]. Although the results obtained did not achieve the same efficacy as reported in the original paper, there are two potential reasons for these differences. Firstly, the degradation of cfRNA in the sample and the low sequencing depth led to the loss of TCR data. It is evident from the observed species that the number of CDR3 in some samples is low, which may affect the final score. Secondly, NSCLC is characterized by its extreme immunodiversity. In a study on the tumor microenvironment of NSCLC using single-cell sequencing, the authors identified four distinct immunophenotypes based on the tumor microenvironment infiltration pattern. TCR CDR3 levels in patients' blood are linked to the tumor and its microenvironment. Notably, one of the types is immune desert type that no significant immune cell infiltration, which may contribute to the difference in CDR3 levels in blood [57].

Apart from the human transcriptional information, we also investigated the transcriptional signals originating from microbial sources. The bacterial compositional richness in NSCLC patients and healthy individuals exhibits some differentiation at the genus level but lacks statistical significance. This discrepancy may arise from the intricate relationship between lung microbes and external/oral microorganisms associated with the upper respiratory tract and the gut. Furthermore, the existing literature on the specific microbiome composition in lung cancer demonstrates substantial inconsistency. Nevertheless, there is a strong interest in elucidating the connection between microbiota and cancer immunity. Future research endeavors should focus on further investigating the immune signals and microbiome information underlying cfRNAs in NSCLC. In-depth exploration of these immune signals and their associations with microbiome will pave the way for improved diagnostic and therapeutic strategies in the field. Our study has the following limitations: (1) The classification model made a good prediction on the independent test dataset, which suggests it is generalizable. However, the sample size of both training set and independent test set are small, which might limit the performance of the machine learning models. The models also need to be confirmed by future research, including larger cohorts and samples with well-recorded tumor stages. (2) A classification model utilizing multi-omics integration needed to be developed in future, which might improve the diagnosis performance. We need larger cohorts to characterize the blood microbial and immune profiles of NSCLC, and recently rapidly developed deep learning strategies may help us understand the complex links between the two and better predict the emergence of early lung cancer and provide insights into the treatment of lung cancer. In future studies, we will verify the existence of different microorganisms and evaluate the value of TCR, then add these signals into the lung cancer prediction model, which is expected to improve the prediction accuracy.

Authors' contributions

QL (Qingjiao Li): conceived and supervised the study. YL (Yulin Liu): searched the dataset, processed the sequence data and performed the data analysis with the input from QL (Qiyan Li). YL (Yin Liang) performed the classification model. YL (Yulin Liu) and YL (Yin Liang) wrote, reviewed and edited the article. All authors are involved in the revision and approved the submitted manuscript.

Informed consent statement

Not applicable.

Funding

This work was supported by the Natural Science Foundation of Guangdong Province [2019A1515111174]; the National Natural Science Foundation of China [32000466]; and the Shenzhen Science and Technology Program [JCYJ20190808100817047, RCBS20200714114909234].

Declaration of Competing Interest

The authors, including Yulin Liu, Yin Liang, Qiyan Li, and Qingjiao Li, declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgements

The authors acknowledge the research groups for their cfRNA-seq data.

Footnotes

Appendix A

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2023.08.029.

Contributor Information

Yulin Liu, Email: yulinliu0218@163.com.

Yin Liang, Email: liangy355@mail2.sysu.edu.cn.

Qiyan Li, Email: liqy83@mail2.sysu.edu.cn.

Qingjiao Li, Email: liqj23@mail.sysu.edu.cn.

Appendix A. Supplementary material

Supplementary material

mmc1.pptx (5.1MB, pptx)

.

Supplementary material

mmc2.xlsx (1,020.9KB, xlsx)

.

Data availability

The dataset supporting the conclusions of this article is available in the NCBI’s Sequence Read Archive (SRA) repository with access number PRJNA589238 and PRJNA729258. For Inhouse cohort, data access can be obtained through a request to the corresponding author liqj23@mail.sysu.edu.cn.

References

  • 1.Siegel R.L., Miller K.D., Fuchs H.E., Jemal A. Cancer statistics, 2022. CA Cancer J Clin. 2022;72:7–33. doi: 10.3322/caac.21708. [DOI] [PubMed] [Google Scholar]
  • 2.Siegel R.L., Miller K.D., Jemal A. Cancer statistics, 2018. CA Cancer J Clin. 2018;68:7–30. doi: 10.3322/caac.21442. [DOI] [PubMed] [Google Scholar]
  • 3.Blandin Knight S., Crosbie P.A., Balata H., Chudziak J., Hussell T., Dive C. Progress and prospects of early detection in lung cancer. Open Biol. 2017;7 doi: 10.1098/rsob.170070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Blandin Knight S., Crosbie P.A., Balata H., Chudziak J., Hussell T., Dive C. Progress and prospects of early detection in lung cancer. Open Biol. 2017;7 doi: 10.1098/rsob.170070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zhang M., Chen J. Advances in clinical application of liquid biopsy in non-small cell lung cancer. Zhongguo Fei Ai Za Zhi. 2021;24:723–728. doi: 10.3779/j.issn.1009-3419.2021.102.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ngo T.T.M., Moufarrej M.N., Rasmussen M.-L.H., Camunas-Soler J., Pan W., Okamoto J., et al. Noninvasive blood tests for fetal development predict gestational age and preterm delivery. Science. 2018;360:1133–1136. doi: 10.1126/science.aar3819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wang L., Wang J., Jia E., Liu Z., Ge Q., Zhao X. Plasma RNA sequencing of extracellular RNAs reveals potential biomarkers for non-small cell lung cancer. Clin Biochem. 2020;83:65–73. doi: 10.1016/j.clinbiochem.2020.06.004. [DOI] [PubMed] [Google Scholar]
  • 8.Toden S., Zhuang J., Acosta A.D., Karns A.P., Salathia N.S., Brewer J.B., et al. Noninvasive characterization of Alzheimer’s disease by circulating, cell-free messenger RNA next-generation sequencing. Sci Adv. 2020;6:eabb1654. doi: 10.1126/sciadv.abb1654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ibarra A., Zhuang J., Zhao Y., Salathia N.S., Huang V., Acosta A.D., et al. Non-invasive characterization of human bone marrow stimulation and reconstitution by cell-free messenger RNA sequencing. Nat Commun. 2020;11:400. doi: 10.1038/s41467-019-14253-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Del Vecchio G., Li Q., Li W., Thamotharan S., Tosevska A., Morselli M., et al. Cell-free DNA methylation and transcriptomic signature prediction of pregnancies with adverse outcomes. Epigenetics. 2021;16:642–661. doi: 10.1080/15592294.2020.1816774. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Chalasani N., Toden S., Sninsky J.J., Rava R.P., Braun J.V., Gawrieh S., et al. Noninvasive stratification of nonalcoholic fatty liver disease by whole transcriptome cell-free mRNA characterization. Am J Physiol-Gastrointest Liver Physiol. 2021;320:G439–G449. doi: 10.1152/ajpgi.00397.2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Rd S., Lj O., Mj S. Cancer immunoediting: integrating immunity’s roles in cancer suppression and promotion. Science. 2011:331. doi: 10.1126/science.1203486. [DOI] [PubMed] [Google Scholar]
  • 13.Mm G., X Z., H S., E C., Jp W., T N., et al. Checkpoint blockade cancer immunotherapy targets tumour-specific mutant antigens. Nature. 2014:515. doi: 10.1038/nature13988. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Chen S., Jin Y., Wang S., Xing S., Wu Y., Tao Y., et al. Cancer type classification using plasma cell-free RNAs derived from human and microbes. ELife. 2022;11 doi: 10.7554/eLife.75181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Köster J., Rahmann S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 2018;34 doi: 10.1093/bioinformatics/bty350. 3600–3600. [DOI] [PubMed] [Google Scholar]
  • 16.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. Embnet J. 2011:17. doi: 10.14806/ej.17.1.200. [DOI] [Google Scholar]
  • 17.Turner S. Faculty Opinions recommendation of STAR: ultrafast universal RNA-seq aligner. 2012:793464455. 10.3410/f.717961569.793464455. [DOI]
  • 18.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Putri G.H., Anders S., Pyl P.T., Pimanda J.E., Zanini F. Analysing high-throughput sequencing data in Python with HTSeq 2.0. Bioinformatics. 2022:btac166. doi: 10.1093/bioinformatics/btac166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wood D.E., Lu J., Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20:257. doi: 10.1186/s13059-019-1891-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lu J., Breitwieser F.P., Thielen P., Salzberg S.L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput Sci. 2017;3 doi: 10.7717/peerj-cs.104. [DOI] [Google Scholar]
  • 22.Yu G., Wang L.G., Han Y., He Q.Y. clusterProfiler: an R package for comparing biological themes among gene clusters. Omics-a J Integr Biol. 2012;16:284–287. doi: 10.1089/omi.2011.0118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., et al. Gene ontology: tool for the unification of biology. Gene Ontol Consort Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Kanehisa M., Goto S., Furumichi M., Tanabe M., Hirakawa M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010;38:D355–D360. doi: 10.1093/nar/gkp896. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Szklarczyk D., Gable A.L., Nastou K.C., Lyon D., Kirsch R., Pyysalo S., et al. The string database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 2021;49:D605–D612. doi: 10.1093/nar/gkaa1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Chin C.-H., Chen S.-H., Wu H.-H., Ho C.-W., Ko M.-T., Lin C.-Y. cytoHubba: identifying hub objects and sub-networks from complex interactome. BMC Syst Biol. 2014;8(Suppl 4):S11. doi: 10.1186/1752-0509-8-S4-S11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Bader G.D., Hogue C.W.V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinforma. 2003;4:2. doi: 10.1186/1471-2105-4-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Chen J., Zhang J., Gao Y., Li Y., Feng C., Song C., et al. LncSEA: a platform for long non-coding RNA related sets and enrichment analysis. Nucleic Acids Res. 2020 doi: 10.1093/nar/gkaa806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Chen Y., Wang X. miRDB: an online database for prediction of functional microRNA targets. Nucleic Acids Res. 2020;48:D127–D131. doi: 10.1093/nar/gkz757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Ruiz-Perez D., Guan H., Madhivanan P., Mathee K., Narasimhan G. So you think you can PLS-DA. BMC Bioinforma. 2020;21:2. doi: 10.1186/s12859-019-3310-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Sing T., Sander O., Beerenwinkel N., Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21:3940–3941. doi: 10.1093/bioinformatics/bti623. [DOI] [PubMed] [Google Scholar]
  • 32.Song L., Cohen D., Ouyang Z., Cao Y., Hu X., Liu X.S. TRUST4: immune repertoire reconstruction from bulk and single-cell RNA-seq data. Nat Methods. 2021;18:627–630. doi: 10.1038/s41592-021-01142-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Liu H., Pan W., Tang C., Tang Y., Wu H., Yoshimura A., et al. The methods and advances of adaptive immune receptors repertoire sequencing. Theranostics. 2021;11:8945–8963. doi: 10.7150/thno.61390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Beshnova D., Ye J., Onabolu O., Moon B., Zheng W., Fu Y.-X., et al. De novo prediction of cancer-associated T cell receptors for noninvasive cancer detection. Sci Transl Med. 2020;12:eaaz3738. doi: 10.1126/scitranslmed.aaz3738. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Salter S.J., Cox M.J., Turek E.M., Calus S.T., Cookson W.O., Moffatt M.F., et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 2014;12:87. doi: 10.1186/s12915-014-0087-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Hoang L.T., Domingo-Sabugo C., Starren E.S., Willis-Owen S.A.G., Morris-Rosendahl D.J., Nicholson A.G., et al. Metabolomic, transcriptomic and genetic integrative analysis reveals important roles of adenosine diphosphate in haemostasis and platelet activation in non-small-cell lung cancer. Mol Oncol. 2019;13:2406–2421. doi: 10.1002/1878-0261.12568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Anichini A., Perotti V.E., Sgambelluri F., Mortarini R. Immune escape mechanisms in non small cell lung cancer. Cancers. 2020;12:3605. doi: 10.3390/cancers12123605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Ito N., Nakamura H., Tanaka Y., Ohgi S. Lung carcinoma: analysis of T helper type 1 and 2 cells and T cytotoxic type 1 and 2 cells by intracellular cytokine detection with flow cytometry. Cancer. 1999;85:2359–2367. [PubMed] [Google Scholar]
  • 39.Jhunjhunwala S., Hammer C., Delamarre L. Antigen presentation in cancer: insights into tumour immunogenicity and immune evasion. Nat Rev Cancer. 2021;21:298–312. doi: 10.1038/s41568-021-00339-z. [DOI] [PubMed] [Google Scholar]
  • 40.Zilionis R., Engblom C., Pfirschke C., Savova V., Zemmour D., Saatcioglu H.D., et al. Single-Cell transcriptomics of human and mouse lung cancers reveals conserved myeloid populations across individuals and species. Immunity. 2019;50:1317–1334. doi: 10.1016/j.immuni.2019.03.009. e10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Xing R., Chen K.-B., Xuan Y., Feng C., Xue M., Zeng Y.-C. RBX1 expression is an unfavorable prognostic factor in patients with non-small cell lung cancer. Surg Oncol. 2016;25:147–151. doi: 10.1016/j.suronc.2016.05.006. [DOI] [PubMed] [Google Scholar]
  • 42.Xie L., Cheng S., Fan Z., Sang H., Li Q., Wu S. SKA3, negatively regulated by miR-128-3p, promotes the progression of non-small-cell lung cancer. Per Med. 2022;19:193–205. doi: 10.2217/pme-2020-0095. [DOI] [PubMed] [Google Scholar]
  • 43.Liu B., Guo S., Li G.-H., Liu Y., Liu X.-Z., Yue J.-B., et al. CHMP4C regulates lung squamous carcinogenesis and progression through cell cycle pathway. J Thorac Dis. 2021;13:4762–4774. doi: 10.21037/jtd-21-583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.St O., B B., B O., B F., Lemamy G.J., B N., et al. Exogenous central angiotensin fails to stimulate a sodium appetite in diabetes insipidus Brattleboro rats. Physiol Behav. 2021;230 doi: 10.1016/j.physbeh.2020.113308. [DOI] [PubMed] [Google Scholar]
  • 45.Lee J.H., Yoo S.S., Hong M.J., Choi J.E., Kang H.-G., Do S.K., et al. Epigenetic readers and lung cancer: the rs2427964C>T variant of the bromodomain and extraterminal domain gene BRD3 is associated with poorer survival outcome in NSCLC. Mol Oncol. 2022;16:750–763. doi: 10.1002/1878-0261.13109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Ren Y., Cao L., Wang L., Zheng S., Zhang Q., Guo X., et al. Autophagic secretion of HMGB1 from cancer-associated fibroblasts promotes metastatic potential of non-small cell lung cancer cells via NFκB signaling. Cell Death Dis. 2021;12:858. doi: 10.1038/s41419-021-04150-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Moncho-Amor V., Pintado-Berninches L., Ibañez de Cáceres I., Martín-Villar E., Quintanilla M., Chakravarty P., et al. Role of Dusp6 Phosphatase as a tumor suppressor in non-small cell lung cancer. Int J Mol Sci. 2019;20:2036. doi: 10.3390/ijms20082036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Zhu L., Zhou D., Guo T., Chen W., Ding Y., Li W., et al. LncRNA GAS5 inhibits invasion and migration of lung cancer through influencing EMT process. J Cancer. 2021;12:3291–3298. doi: 10.7150/jca.56218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Ren K., Sun J., Liu L., Yang Y., Li H., Wang Z., et al. TP53-Activated lncRNA GHRLOS regulates cell proliferation, invasion, and apoptosis of non-small cell lung cancer by modulating the miR-346/APC axis. Front Oncol. 2021;11 doi: 10.3389/fonc.2021.676202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Li L., Zhang H., Wang X., Wang J., Wei H. Long non-coding RNA CASC2 enhanced cisplatin-induced viability inhibition of non-small cell lung cancer cells by regulating the PTEN/PI3K/Akt pathway through down-regulation of miR-18a and miR-21. RSC Adv. 2018;8:15923–15932. doi: 10.1039/c8ra00549d. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Lv D., Bi Q., Li Y., Deng J., Wu N., Hao S., et al. Long non‑coding RNA MEG3 inhibits cell migration and invasion of non‑small cell lung cancer cells by regulating the miR‑21–5p/PTEN axis. Mol Med Rep. 2021;23:191. doi: 10.3892/mmr.2021.11830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Yang M., He X., Huang X., Wang J., He Y., Wei L. LncRNA MIR4435-2HG-mediated upregulation of TGF-β1 promotes migration and proliferation of nonsmall cell lung cancer cells. Environ Toxicol. 2020;35:582–590. doi: 10.1002/tox.22893. [DOI] [PubMed] [Google Scholar]
  • 53.Zhuang L., Xu L., Wang P., Meng Z. Serum miR-128-2 serves as a prognostic marker for patients with hepatocellular carcinoma. PLoS One. 2015;10 doi: 10.1371/journal.pone.0117274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Brouwer E.S., Napravnik S., Eron J.J., Simpson R.J., Brookhart M.A., Stalzer B., et al. Validation of medicaid claims-based diagnosis of myocardial infarction using an HIV clinical cohort. Med Care. 2015;53:e41–e48. doi: 10.1097/MLR.0b013e318287d6fd. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Wang F., Ren X., Zhang X. Role of microRNA-150 in solid tumors. Oncol Lett. 2015;10:11–16. doi: 10.3892/ol.2015.3170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Wang Z., Wang Q., Wang X., Zhu L., Chen J., Zhang B., et al. Gut microbial dysbiosis is associated with development and progression of radiation enteritis during pelvic radiotherapy. J Cell Mol Med. 2019;23:3747–3756. doi: 10.1111/jcmm.14289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Salcher S., Sturm G., Horvath L., Untergasser G., Kuempers C., Fotakis G., et al. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer Cell. 2022;40:1503–1520. doi: 10.1016/j.ccell.2022.10.008. e8. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.pptx (5.1MB, pptx)

Supplementary material

mmc2.xlsx (1,020.9KB, xlsx)

Data Availability Statement

The dataset supporting the conclusions of this article is available in the NCBI’s Sequence Read Archive (SRA) repository with access number PRJNA589238 and PRJNA729258. For Inhouse cohort, data access can be obtained through a request to the corresponding author liqj23@mail.sysu.edu.cn.


Articles from Computational and Structural Biotechnology Journal are provided here courtesy of Research Network of Computational and Structural Biotechnology

RESOURCES