Abstract
Cell-free RNA (cfRNA) in plasma reflects phenotypic alterations of both localized sites of cancer and the systemic host response. Here we report that cfRNA sequencing enables the discovery of messenger RNA (mRNA) biomarkers in plasma with the tissue of origin-specific to cancer types and precancerous conditions in both solid and hematologic malignancies. To explore the diagnostic potential of total cfRNA from blood, we sequenced plasma samples of eight hepatocellular carcinoma (HCC) and ten multiple myeloma (MM) patients, 12 patients of their respective precancerous conditions, and 20 non-cancer (NC) donors. We identified distinct gene sets and built classification models using Random Forest and linear discriminant analysis algorithms that could distinguish cancer patients from premalignant conditions and NC individuals with high accuracy. Plasma cfRNA biomarkers of HCC are liver-specific genes and biomarkers of MM are highly expressed in the bone marrow compared to other tissues and are related to cell cycle processes. The cfRNA level of these biomarkers displayed a gradual transition from noncancerous states through precancerous conditions and cancer. Sequencing data were cross-validated by quantitative reverse transcription PCR and cfRNA biomarkers were validated in an independent sample set (20 HCC, 9 MM, and 10 NC) with AUC greater than 0.86. cfRNA results observed in precancerous conditions require further validation. This work demonstrates a proof of principle for using mRNA transcripts in plasma with a small panel of genes to distinguish between cancers, noncancerous states, and precancerous conditions.
Subject terms: Diagnostic markers, Cancer genomics, Hepatocellular carcinoma, Myeloma
Introduction
Although recent advances in cancer research offer new methods to treat cancer, early detection of malignancy still confers the highest chance of improving long-term patient survival. Currently, only 2.4% of metastatic liver cancer patients survive for more than 5 years1. Early detection of liver cancer, which has the most rapidly increasing incidence in the United States, has the potential to extend 5-year survival rates to 33% with current treatment options. Even with hematologic malignancies like multiple myeloma (MM), 95% of patients are diagnosed when cancer has already spread systemically, resulting in at least a 20% decrease in 5-year survival rates compared to detection at earlier stages2. Noninvasive, low cost and reliable cancer diagnostic assays could greatly benefit patients by facilitating accessibility to early cancer screening.
In many cancers, there are disease states known to be precursors of malignant disease. For example, MM, a cancer of antibody-producing plasma cells, is often preceded by monoclonal gammopathy of undetermined significance (MGUS), which is characterized by lower levels of abnormal antibodies. The prevalence of MGUS is about 3% in the Caucasian population, and the conversion rate from MGUS to multiple myeloma is ~1% per year3,4. Hepatocellular carcinoma (HCC), the most common form of liver cancer, is often preceded by liver cirrhosis (Cirr) characterized by irreversible fibrosis of the liver. The prevalence of cirrhosis is between 4.5–9.5% of the global population5–7. The risk of developing de novo HCC in patients with liver cirrhosis ranges between 1 and 5% per year, depending on the etiology of the cirrhosis5–11. Most early cancer detection studies to date have focused on distinguishing cancer from healthy controls, rather than discriminating between cancer and common premalignant conditions. Therefore, there is an unmet clinical need for a simple blood test that can identify patients with premalignant conditions who require further intervention due to a higher likelihood of cancer incidence.
With current clinical practices, a cancer diagnosis is primarily initiated based upon costly imaging studies or invasive screening procedures. Alternatively, some cancers may only come to attention with clinical symptoms that present at more advanced stages. Liquid biopsy, a minimally invasive method for sampling and analyzing biomarkers in various body fluids, has the potential to improve cancer diagnosis and prognosis12–15. Several blood-based analytes have been explored for use in liquid biopsies for cancer detection such as circulating cells (circulating tumor cells (CTCs), circulating hybrid cells (CHCs), tumor-associated macrophages (TAMs))16–21, circulating tumor DNA (ctDNA)22–24, platelets25–27, and protein panels28. However, ctDNA and circulating cells are present at low levels, have varied characteristics between patients, and only weakly correlate with phenotypic changes in cancer17,29,30. Epigenetic features of ctDNA such as methylation and 5-hydroxymethylcytosine signatures, or ctDNA protected patterns may provide information about the tissue of origin for pan-cancer detection31–38. However, these methods may require large deep sequencing coverage to be effective and may have inadequate sensitivity and specificity. Recent transcriptome analysis of tumor-educated platelets has shown promise for pan-cancer detection25–27, but platelets are fragile, can be easily activated in vitro, and have highly variable characteristics depending on their preparation which make them challenging to utilize with existing clinical blood tests39. There is thus a need for robust liquid biopsy technology that can overcome these challenges in a safe, reliable, and cost-effective manner.
Circulating cell-free RNA (cfRNA) in the blood is released from cells by active secretion or through apoptosis and necrosis40,41. Plasma cfRNA has the potential to reflect the systemic response to growing tumors and provide information about the tissue of tumor origin specifically by cancer type. Previous work has demonstrated that global cfRNA profiles indicate temporal changes in organ-specific transcripts. Further analysis of these transcripts facilitated the prediction of pregnancy delivery, preterm birth, and distinction of cancer from healthy controls42–46. Here, we explore the potential of cfRNA profiles to distinguish cancers from their premalignant conditions. We sequenced total plasma cfRNA from plasma samples of patients with HCC and MM and their precancerous conditions including liver cirrhosis (Cirr) and MGUS, and non-cancer (NC) donors. We identified potential cfRNA biomarkers using plasma cfRNA-sequencing of a pilot sample set and validated potential cfRNA biomarkers in an independent sample set. We further validated the sequencing data using orthogonal measurement by quantitative reverse transcription PCR. Feature selection and classification models were built to explore the potential of cfRNA profiles in differentiating malignant from premalignant conditions.
Results
Identification of plasma cfRNA biomarkers by sequencing
To identify cfRNA transcripts which potentially distinguish cancer patients from NC individuals, we prospectively collected blood samples from the following individuals: a pilot set of ten MM and eight HCC patients; 13 patients with premalignant conditions including eight MGUS and four Cirr; and 20 NC donors. Detailed clinical information of the samples is listed in the supplementary information (Supplementary Table 1). Plasma cfRNA samples were sequenced to saturation with a mean of 33.8 M raw reads with a range of 27.7 to 52.3 M (Supplementary Table 2 and Supplementary Figs. 2, 3a). After selecting for reads that mapped uniquely to the human genome, the cfRNA libraries had an average read depth of 14 M with a range from 2.3 to 43 M. On average, 80% of reads mapped to exons (Supplementary Table 2 and Supplementary Fig. 3b). A total of 39,374 annotated features were detected with at least one mapped read across all samples. The majority of detected RNAs were protein-coding with a mean fraction of 82% and a range from 65 to 89% (Supplementary Table 2 and Supplementary Fig. 3c). The fraction of reads mapping to exons and the distribution of read depths were uniform across all sample groups (Fig. S3b, c).
We then determined if cfRNA profiles can distinguish HCC and MM from NC donors. Principal component analysis (PCA) using the top 500 genes with the largest variance across all samples through pairwise comparison showed separation of HCC and MM cfRNA profiles from that of NC donors (Fig. 1b, c). Differential expression (DE) analysis of the pairwise comparison between individual cancer types with respect to NC donors using DEseq2 yielded 110, and 12 differentiating genes (adjusted p value <0.01) for MM and HCC, respectively (Supplementary Table 3 and Supplementary Fig. 4a, b). Permutations of random sample shuffling in each pair with 500 rounds resulted in zero significant differentiating genes (padj < 0.01) in more than 95 and 94% of permutations for each pair comparing MM, and HCC to non-cancer donors, respectively (Supplementary Table 4 and Supplementary Fig. 4c, d). Gene ontology analysis revealed that MM upregulated genes were enriched for oxygen transport and gas transport (Supplementary Figure 5a). In HCC, the upregulated gene set was enriched for plasminogen activation (Supplementary Fig. 5b). This data collectively indicates the separation of cfRNA profiles in HCC and MM compared to NC donors.
To further explore the potential of cell-free RNA for cancer detection, we applied linear discriminant analysis (LDA) and a Random Forest (RF) algorithm to find combinations of discriminating genes to separate cancer from non-cancer individuals. Two independent methods were used to identify specific input gene lists for the classifying algorithms. First, discriminating genes using DESeq2 analysis with adjusted p value < 0.01 (Supplementary Table 3) were used as one feature set (DE gene set). Second, we implemented the learning vector quantization (LVQ) method to find the most important features that distinguished the two groups and selected the top ten as another feature set (LVQ gene set) (Supplementary Table 5). The linear combination for each gene set by LDA showed significant separation between HCC and MM from NC donors with p value of 6.7 × 10−8, 6.7 × 10−10, and 6.4 × 10−7, 6.4 × 10−7 using the DE and top ten LVQ gene sets, respectively (Fig. 1d, e). We further employed the Random Forest (RF) method to develop orthogonal classification models. The area under the receiver operating characteristic (ROC) curve (AUC) was higher than 0.92 in both LDA and RF models with both DE and LVQ feature sets for the two cancer types (Fig. 1f, g).
To evaluate the significance and accuracy of our classification models, we employed the leave-one-out cross-validation (LOOCV) method. Both LDA and RF algorithms were trained on the described DE and LVQ gene sets, resulting in four classification models (Fig. 1f–i). Classifying MM from non-cancer donors yielded 90% accuracy (27/30) for all four models tested. HCC was correctly differentiated from NC donors with accuracies of 100% (28/28) and 93% (26/28) using the LDA method or 96% (27/28) and 96% (27/28) using the RF method with LVQ and DE feature sets, respectively. Overall, the LOOCV test confirmed that the biomarker sets determined by DESeq2 and LVQ methods, combined with our classification models using LDA and RF algorithms are statistically significant. LVQ gene sets yielded higher accuracy for both cancer types and were used as the feature sets for further validation.
cfRNA profiles distinguish multiple myeloma from its premalignant condition, MGUS, and MGUS from non-cancer
We next examined if cfRNA profiles were able to recapitulate the transition from a precancerous condition to a cancerous one and distinguish between them. We chose to test our hypothesis on MM as it has the well-defined precancerous condition of MGUS. The top ten most significant genes that discriminate MM from NC donors as identified by LVQ displayed a gradual transition in cfRNA level from the non-cancer donors through MGUS to MM (Fig. 2a). Among these ten most significant genes, seven genes (CA1, EPB42, HBG1, HBG2, CENPE, CPOX, and NUSAP1) have higher expression in bone marrow, where cancerous plasma cells accumulate, compared to other tissue and cell types in publicly available data from the Human Protein Atlas47,48 (Fig. 2b). Three genes resulting from the LVQ analysis are related to cell cycle processes: Centromere protein E (CENPE), a kinesin-like motor protein that accumulates in the G2 phase of the cell cycle and is highly expressed in the bone marrow49,50; Serine/threonine-protein kinase (NEK2), which is involved in mitotic regulation50,51; and Nucleolar and spindle associated protein 1 (NUSAP1), a nucleolar-spindle-associated protein that plays a role in spindle microtubule organization52.
An LDA plot using a combination of the top ten LVQ genes from pairwise comparisons MM—NC, and MGUS—NC displayed the separation of all three groups (Fig. 2c). An RF model using the top ten most important LVQ genes from MGUS—NC pairwise comparison yielded an accuracy of 89.3% (19/20 NC donors and 6/8 MGUS patients) (Fig. 2d). Classification of MM from MGUS yielded an accuracy of 100% (8/8 MGUS and 10/10 MM) using LOOCV with the RF classification method using the top ten most important genes from LVQ analysis of MM versus NC comparison as a feature set (Fig. 2e). The three-group classification resulted in an accuracy of 86.8% (18/20 NC, 6/8 MGUS, and 9/10 MM) defined by LOOCV using the RF method with the feature set composed of the combination of the top 10 LVQ genes from the comparison MM versus NC and MGUS versus NC donors (Fig. 2f).
cfRNA profiles distinguish liver cancer from its premalignant condition, cirrhosis, and cirrhosis from non-cancer
Next, we asked if we could distinguish between a solid tumor such as HCC and its precancerous condition, Cirr. Among the top ten most important genes that discriminate HCC from NC identified by the LVQ analysis, five genes also significantly differentiate HCC from Cirr (Fig. 3a). Interestingly, eight out of the top ten genes are expressed specifically in the liver and the corresponding proteins are secreted into the blood47,48 (Fig. 3b). Apolipoprotein E (APOE) binds to the specific liver and peripheral cell receptors and is essential for the normal catabolism of triglyceride-rich lipoprotein constituents53. Complement C3 (C3) is synthesized in the liver and secreted to the plasma and is involved in both innate and adaptive immune responses54. Ceruloplasmin (CP) is a secreted plasma metalloprotein from the liver that binds copper in the plasma and is involved in the peroxidation of Fe(II) transferrin to Fe(III) transferrin55. 24-dehydrocholesterol reductase DHCR24 catalyzes the reduction of sterol intermediates56. Fibrinogen alpha chain (FGA), fibrinogen beta chain (FGB), and fibrinogen gamma chain (FGG) encodes the coagulation factor fibrinogen, which is a component of blood clotting57. Histidine-rich glycoprotein (HRG) is a plasma glycoprotein that binds heparin sulfate on the surface of the liver, lung, kidney, and heart endothelial cells58.
We explored the potential of cfRNA to distinguish HCC from Cirr and Cirr from NC individuals. An LDA plot using the feature set comprised of a combination of the top 10 LVQ genes identified for the pairwise comparisons of HCC—NC and Cirr—NC, shows a distinct separation between these groups (Fig. 3c). RF methods using the top ten important genes from Cirr—NC pairwise comparisons yielded 100% accuracy in classifying Cirr from NC samples using LOOCV (Fig. 3d). Classification of HCC from Cirr also yielded 100% accuracy using LOOCV with RF (Fig. 3e). We further attempted to classify three classes including NC, Cirr, and HCC in one model. The three-group classification resulted in 90.6% accuracy using LOOCV with RF (Fig. 3f).
Validation of cfRNA biomarkers
We designed a primer panel for the LVQ gene set to validate the sequencing data by quantitative reverse transcription PCR (RT-qPCR). RT-qPCR results from the pilot sample set were consistent with the sequencing data with a Pearson correlation coefficient >0.77 and a p value of 2.2 × 10−16 (Fig. 4a). We confirmed that the differential levels of cfRNA transcripts of genes identified by the LVQ algorithm (HBG1, HBG2, NUSAP1, for MM and C3, CP, FGA, FGB for HCC) from RNA-sequencing were also observed with RT-qPCR (Fig. 4b).
To confirm that the feature sets and classification models defined in our pilot cohort were robust and generalizable, we collected a set of independent validation samples from ten NC controls, nine MM patients, and 20 HCC patients (Supplementary Table 1 and Supplementary Figs. 6, 7). We validated the cfRNA biomarkers identified from the pilot set in silico by measuring the classification accuracy on this independent sample set using the models trained with the pilot dataset using the LVQ gene sets. The linear combination identified by LDA in the pilot cohort of the LVQ feature set showed significant separation in the validation sample set between MM and HCC from NC donors, consistent with our previous results (Fig. 5a, c). Furthermore, both LDA and RF models trained on the pilot cohort with this same feature set were able to classify cancer from NC controls in our validation cohort, with an AUC >0.86 and 0.9 when classifying NC donors from MM and HCC, respectively (Fig. 5b, d).
Our cfRNA classification model performed well for early and late clinical stages in the pilot set (Fig. 6a–d). In the validation sample set, the model displayed stage-dependent discrimination. It was validated with an AUC of 0.74 for Barcelona Clinic Liver Cancer (BCLC) stage A in HCC (Fig. 6e, f) and an AUC of 0.64 for stage I in MM (Fig. 6g, h). For later stages, the model achieved a higher AUC of 0.91 for BCLC stages B and C in HCC (Fig. 6e, f) and 0.83 for stages II and III in MM (Fig. 6g, h) in the validation sample set. This stepwise increase in discrimination suggests that these biomarkers become more prevalent with cancer progression. HCC classification also showed significant discrimination compared to NC for different etiologies (Fig. 7), and both HCC and MM showed discrimination for males and females (Supplementary Fig. 9) and are not age-dependent (Supplementary Fig. 9) in our pilot and validation sample sets.
Discussion
We sequenced cfRNA from patients with two cancer types, one solid (HCC) and the other hematologic (MM), and their precancerous conditions (Cirr and MGUS, respectively), and from NC donors. Both cancer types can be distinguished from non-cancer controls and precancerous conditions using their cfRNA profiles. To differentiate each cancer type from individuals without cancer, the combination of ten genes identified by learning vector quantization (LVQ) analysis in each pairwise comparison yielded higher accuracy compared to the use of a larger set of differentiating genes as evaluated by leaving one out cross-validation (LOOCV). RT-qPCR confirmation for a panel of selected biomarkers was consistent with the sequencing data. Plasma cfRNA biomarkers identified from the sequencing data were further validated in an independent sample cohort. The use of a small gene panel potentially enables a cost-effective office-based assay for pan-cancer detection that can be highly useful in broad clinical applications.
To date, most investigations into the potential of blood-based methods for cancer detection have only focused on distinguishing cancers from healthy controls15,22,25,26,28,36. However, many cancer types have etiologies associated with precursor states such as MGUS for MM and Cirr for HCC. Here, we report that cfRNA profiles can recapitulate the transition from a precancerous condition to cancer, and can effectively do so for both solid and hematologic cancers. We, therefore, propose that cfRNA panels containing a small number of genes may distinguish cancers from premalignant conditions and precursors from healthy individuals. This development might potentially enable a cost-effective screening strategy for early cancer detection during routine exams in high-risk patients.
Liver and bone marrow have been reported to contribute heavily to the abundance of cell-free nucleic acids in plasma42,45,46. This may explain the source of cfRNA biomarkers found in these cancer types. In HCC, eight out of the top ten genes used in the classification model are specifically synthesized in the liver and encode secreted proteins found in the blood that mediate plasminogen activation and fibrinolysis processes. In MM, seven out of ten genes among the most important cfRNA biomarkers have relatively high expression in bone marrow compared to other tissue and cell types and are related to cell cycle processes. These findings indicate that the identified cfRNA biomarkers potentially originate from the tissue of origin of the tumor. Further investigation is needed to better define the tissue and cell-type origin of the biomarkers, and how they may associate with disease initiation and progression.
Our study has important limitations. This is cross-sectional single sampling with a small sample size for both the discovery and validation sets. Furthermore, the sample sets do not represent the wider distribution of cancer subtypes and precursor lesions in the overall population with different underlying etiologies. Another limitation is that the majority of patients and controls are white, so further studies are needed to examine if these results can be extrapolated to a more diverse population of other races. However, accurate classification is not sex- or age-dependent. Although the control population has a higher female/male ratio than the cases, our classification model showed significant discrimination for both males and females in the pilot and validation sets. Despite the median age of controls being 9 and 6 years younger than cases for the pilot and validation sets, respectively, discrimination does not depend on age. Our results for precancerous conditions are promising but require further validation. In addition, we have not fully characterized the stability of cell-free RNA and the biological origin of the identified cfRNA biomarkers. Before the tests developed from this work can be clinically applied, large-scale clinical studies will be required to validate the potential of cfRNA as a cancer biomarker and to build robust classification models. Such large-scale clinical studies will also help to determine if the test can be applied to a broader risk population without specific predispositions.
In summary, we report a proof of principle that global profiling of cell-free mRNA has the potential to establish a platform for longitudinal monitoring of disease progression across both solid and hematologic cancers. This work lays the foundation for developing inexpensive assays that measure transcript levels of mRNA in plasma for a small panel of genes that can differentiate pan-cancer from premalignant conditions and otherwise healthy donors. Intriguingly, organ-specific enriched mRNA transcripts were identified as biomarkers that might indicate the tissue of origin for the tumor. These cell-free plasma RNA biomarkers could be readily combined with other nucleic acid-based and protein-based approaches for potentially increased diagnostic sensitivity and specificity.
Methods
For a detailed summary of the methods used and the general workflow of our study please see Supplementary Fig. 1.
Patient samples
Blood samples from non-cancer donors and patients with monoclonal gammopathy of undetermined significance (MGUS), multiple myeloma, liver cirrhosis, and liver cancer were obtained from Oregon Health and Science University (OHSU) by Knight Cancer Institute Biolibrary and Oregon Clinical and Translational Research Institute (OCTRI). All samples were collected under institutional review board (IRB) approved protocols by Oregon Health and Science University. Participants provided written informed consent to take part in the study. Individuals who had no recorded previous history of cancer were considered to be non-cancer donors.
All samples with various diagnoses within the same sample set were collected and processed using a uniform protocol by the same staff at Oregon Health and Science University. The validation set and the pilot set were collected and processed independently by two groups of staff. The clinical information regarding study participants are given in the Supplementaty Table S1. The pilot set includes 10 MM and 8 HCC patients; 13 patients with premalignant conditions including eight MGUS and four Cirr; and 20 NC donors. The validation set includes ten NC controls, nine MM patients, and 20 HCC patients. All Cirr patients underwent abdominal US or MRI and all MGUS patients had an evaluation of bone marrow or assessment of serum-free light chain ratio within 6 months of blood collection for this study.
Processing of whole blood
For all cohorts, whole blood samples were collected in EDTA-anticoagulated vacutainers. Within 2 h of collection, blood samples were first centrifuged at 1000×g for 10 min at 4°C followed by 15,000×g for 10 min at 4 °C. Plasma was then stored at −80°C until RNA isolation.
cfRNA isolation
Samples were randomly shuffled for RNA extraction, library preparation, and sequencing in Illumina flow cells (Fig. 1a). Total RNA purification was performed by using a plasma/serum circulating and exosomal RNA purification kit (Norgen Biotek) from 3 ml of human plasma according to the manufacturer’s protocol. To digest trace amounts of contaminating DNA, RNA was treated with 10X Baseline-ZERO DNase. DNase I treated RNA samples were purified and further concentrated using RNA clean and concentrator-5 (Zymo Research) according to the manufacturer’s manuals. Final eluted RNA was stored immediately at −80 °C.
Library preparation
We prepared stranded RNA-Seq libraries using Clontech SMARTer stranded total RNA-seq kit v2- pico input mammalian (Takara Bio) according to the manufacturer’s instructions. For cDNA synthesis, we used option 2 (without fragmentation), starting from highly degraded RNA. The input of 7 ul of RNA samples were used to generate cDNA libraries suitable for next-generation sequencing. For the addition of adapters and indexes, we employed SMARTer RNA unique dual index kit −96 U. SMARTer RNA unique dual index of each 5′ and 3′ PCR primer were added to each sample to distinguish pooled libraries from each other. The amplified RNA-seq library was purified by immobilization onto the AMPure XP PCR purification system (Beckman Coulter). The library fragments originated from rRNA and mitochondrial rRNA were treated with ZapR v2 and R-Probes according to the manufacturer’s protocols. For final RNA-seq library amplification, 16 cycles of PCR were performed and the final 20 ul was eluted in Tris buffer following amplified RNA-seq library purification. The amplified RNA-seq library was stored at −20 °C prior to sequencing.
Sequencing data processing and quality control
Each sample was sequenced to more than 20 million paired-end reads using an Illumina Nextseq or HiSeq sequencer. Adapter sequences were trimmed using sickle tool59. After trimming, the quality of the reads were checked using FastQC (v0.11.7)60,61 and RSeQC (v2.6.4)62. Reads were aligned to the hg38 human genome using the STAR aligner (v2.5.3a)63 with two pass mode flag. Duplicated reads were removed using the Picard tool (v1.119)64. Read counts for each gene were calculated using the htseq-count tool (v0.11.2)65 in intersection-strict mode. The number of mapped reads to each gene were normalized to the total number of reads in the whole transcriptome (Reads Per Million - RPM). For each sample, we calculated exon, intron, intergenic fractions, and protein-coding fractions (CDS exons) using RSeQC62 and the read_distribution.py script. Samples with an exon fraction larger than 0.35 were kept for further analysis.
Identification of cfRNA biomarkers (DESeq and LVQ and GO analysis)
Two independent methods were applied to select cfRNA features for building classification models. Differentiating genes between all pairwise comparisons were identified with the R package DESeq2 (v1.24.0) using the Wald test66 with adjusted p value (padj) < 0.01 (Supplemental Table S3) were used as one feature set (DE gene set). The second method for feature selection uses the LVQ algorithm built-in an R package caret (v6.0-84)—with tenfold cross-validation repeated three times67. The top ten most important features were selected by ranking the varImp parameter (LVQ gene set) (Supplemental Table S5). Gene Ontology (GO) analysis was implemented on the top differentiating genes from the DESeq2 analysis with padj < 0.01 using the package topGO (v2.37.0) and a Fischer statistical test to measure significant enrichment of each Gene Ontology term68.
Cancer type classification (LDA and RF)
Two methods were used to build models for classifying cancer types using feature sets identified from pairwise comparison using DESeq2 and LVQ methods. LDA models were built using the R package MASS (v7.3–51.4)69. Random Forest models were built using the R package randomForest (v4.6-14)70.
Statistical consideration (permutation test and leave-one-out cross-validation)
To test the significance of the differential expression results for each pairwise comparison of cancer to NC donors, we performed a permutation test in which differential expression analysis between two groups of randomized samples was compared using the DESeq2 package. For each pair, 500 permutations of random shuffling were performed and the number of differentiating genes with padj < 0.01 were documented for building a histogram and compared to the number of significant genes (padj < 0.01) for the group with correct labeling. To determine the significance and accuracy of our classification models, we employed the LOOCV method. Briefly, in LOOCV, LDA, or RF algorithms classified each sample based on the training models obtained from all other samples (total number of samples in each pair minus the testing sample). The test was repeated until all individual samples were classified and cross-validated.
Tissue specificity of LVQ feature sets using publicly available databases
To evaluate whether our LVQ gene sets were tissue-specific to the tissue of origin (TOO), we downloaded publicly available average tissue-level expression values (transcripts per million; TPMs) from the Human Protein Atlas (ref: https://www.proteinatlas.org/about/download). The methodology used to normalize and calculate average expression values can be found here: https://www.proteinatlas.org/about/assays+annotation#hpa_rna. We then subsetted this matrix of counts values for our two gene sets (top ten LVQ for MM versus non-cancer, and top ten LVQ for HCC versus non-cancer), and calculated a z-score across tissue types to evaluate which tissue types the genes were enriched in. Next, we generated a heatmap of this transformed matrix using ComplexHeatmap (v2.4.3).
Reporting Summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
This work was supported by the Cancer Early Detection Advanced Research Center (CEDAR), Knight Cancer Institute, Oregon Health and Science University (CEDAR3250918), Cancer Research UK/OHSU Project Award (C63763/A27122 to T.T.M.N.), the OCTRI CTSA grant (UL1TR000128), the Kuni Foundation, the Department of Defense (W81XWH2110853 to T.T.M.N.) and the Susan G. Komen Foundation (CCR21663959 to T.T.M.N.). We would like to acknowledge the CEDAR repository and the Biolibrary for helping with sample collection and processing and Gordon Mills, Joseph Estabrook, and Ashley Woodfin for helpful discussion.
Author contributions
T.T.M.N. designed and supervised all aspects of the study and wrote the manuscript; All authors contributed to and edited the manuscript; H.J.K., J.T.W., C.W.K., E.S., and T.T.M.N. designed experiments and carried out RNA extraction, library preparation, and RT-qPCR measurement; T.T.M.N., P.A., B.R.-H., and R.C. processed the sequencing data; T.T.M.N., P.A., and B.R.-H. developed statistical tools and analyzed the data; F.C., P.S., R.F.T., K.F., and W.E.N. contributed to the study design and the manuscript. B.R-H and H.J.K. are co-first authors.
Data availability
cfRNA sequencing data have been deposited in the Gene Expression Omnibus Repository (GSE182824).
Code availability
In-house scripts used in this manuscript, which includes data processing, downstream analysis, and the scripts used to generate figures is publicly available on Github
Competing interests
Oregon Health and Science University has filed patent applications based on this work. The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Breeshey Roskams-Hieter, Hyun Ji Kim.
Supplementary information
The online version contains supplementary material available at 10.1038/s41698-022-00270-y.
References
- 1.SEER. Cancer Stat Facts: Liver and Intrahepatic Bile Duct Cancer (National Cancer Institute, 2018)
- 2.Howlader N. et al. SEER Cancer Statistics Review, 1975–2018 (National Cancer Institute, 2021)
- 3.Kyle RA, Rajkumar SV. Management of monoclonal gammopathy of undetermined significance (MGUS) and smoldering multiple myeloma (SMM) Oncol. 2011;25:578–586. [PMC free article] [PubMed] [Google Scholar]
- 4.Dhodapkar MV. MGUS to myeloma: a mysterious gammopathy of underexplored significance. Blood. 2016;128:2599. doi: 10.1182/blood-2016-09-692954. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Llovet JM, et al. Hepatocellular carcinoma. Nat. Rev. Dis. Prim. 2016;2:16018. doi: 10.1038/nrdp.2016.18. [DOI] [PubMed] [Google Scholar]
- 6.Fateen W, Ryder SD. Screening for hepatocellular carcinoma: patient selection and perspectives. J. Hepatocell. Carcinoma. 2017;4:71–79. doi: 10.2147/JHC.S105777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Starr SP, Raines D. Cirrhosis: diagnosis, management, and prevention. Am. Fam. Physician. 2011;84:1353–1359. [PubMed] [Google Scholar]
- 8.Laursen L. A preventable cancer. Nature. 2014;516:S2. doi: 10.1038/516S2a. [DOI] [PubMed] [Google Scholar]
- 9.Goh GB, Chang PE, Tan CK. Changing epidemiology of hepatocellular carcinoma in Asia. Best. Pr. Res Clin. Gastroenterol. 2015;29:919–928. doi: 10.1016/j.bpg.2015.09.007. [DOI] [PubMed] [Google Scholar]
- 10.Wong VW, et al. Clinical scoring system to predict hepatocellular carcinoma in chronic hepatitis B carriers. J. Clin. Oncol. 2010;28:1660–1665. doi: 10.1200/JCO.2009.26.2675. [DOI] [PubMed] [Google Scholar]
- 11.Yang HI, et al. Risk estimation for hepatocellular carcinoma in chronic hepatitis B (REACH-B): development and validation of a predictive score. Lancet Oncol. 2011;12:568–574. doi: 10.1016/S1470-2045(11)70077-8. [DOI] [PubMed] [Google Scholar]
- 12.Bai Y, Zhao H. Liquid biopsy in tumors: opportunities and challenges. Ann. Transl. Med. 2018;6:S89. doi: 10.21037/atm.2018.11.31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Palmirotta R, et al. Liquid biopsy of cancer: a multimodal diagnostic tool in clinical oncology. Ther. Adv. Med Oncol. 2018;10:1758835918794630. doi: 10.1177/1758835918794630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Marrugo-Ramírez, J., Mir, M. & Samitier, J. Blood-based cancer biomarkers in liquid biopsy: a promising non-invasive alternative to tissue biopsy. Int. J. Mol. Sci. 19, 2877 (2018) [DOI] [PMC free article] [PubMed]
- 15.Esposito A, et al. Liquid biopsies for solid tumors: understanding tumor heterogeneity and real time monitoring of early resistance to targeted therapies. Pharm. Ther. 2016;157:120–124. doi: 10.1016/j.pharmthera.2015.11.007. [DOI] [PubMed] [Google Scholar]
- 16.Sundling KE, Lowe AC. Circulating tumor cells: overview and opportunities in cytology. Adv. Anat. Pathol. 2019;26:56–63. doi: 10.1097/PAP.0000000000000217. [DOI] [PubMed] [Google Scholar]
- 17.Millner LM, Linder MW, Valdes R., Jr. Circulating tumor cells: a review of present methods and the need to identify heterogeneous phenotypes. Ann. Clin. Lab Sci. 2013;43:295–304. [PMC free article] [PubMed] [Google Scholar]
- 18.Thiele JA, et al. Circulating tumor cells: fluid surrogates of solid tumors. Annu. Rev. Pathol. Mechanisms Dis. 2017;12:419–447. doi: 10.1146/annurev-pathol-052016-100256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Liu Y, Cao X. The origin and function of tumor-associated macrophages. Cell. Mol. Immunol. 2014;12:1. doi: 10.1038/cmi.2014.83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Adams DL, et al. Circulating giant macrophages as a potential biomarker of solid tumors. Proc. Natl Acad. Sci. USA. 2014;111:3514. doi: 10.1073/pnas.1320198111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Gast CE, et al. Cell fusion potentiates tumor heterogeneity and reveals circulating hybrid cells that correlate with stage and survival. Sci. Adv. 2018;4:eaat7828. doi: 10.1126/sciadv.aat7828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Newman AM, et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nat. Med. 2014;20:548–554. doi: 10.1038/nm.3519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Corcoran RB, Chabner BA. Application of cell-free DNA analysis to cancer treatment. N. Engl. J. Med. 2018;379:1754–1765. doi: 10.1056/NEJMra1706174. [DOI] [PubMed] [Google Scholar]
- 24.Abbosh C, et al. Phylogenetic ctDNA analysis depicts early-stage lung cancer evolution. Nature. 2017;545:446–451. doi: 10.1038/nature22364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Best MG, et al. RNA-Seq of tumor-educated platelets enables blood-based pan-cancer, multiclass, and molecular pathway cancer diagnostics. Cancer Cell. 2015;28:666–676. doi: 10.1016/j.ccell.2015.09.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Best MG, Wesseling P, Wurdinger T. Tumor-educated platelets as a noninvasive biomarker source for cancer detection and progression monitoring. Cancer Res. 2018;78:3407–3412. doi: 10.1158/0008-5472.CAN-18-0887. [DOI] [PubMed] [Google Scholar]
- 27.In't Veld, S. G. J. G. & Wurdinger, T. Tumor-educated platelets. Blood133, 2359–2364 (2019) [DOI] [PubMed]
- 28.Cohen JD, et al. Detection and localization of surgically resectable cancers with a multi-analyte blood test. Science. 2018;359:926. doi: 10.1126/science.aar3247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Abbosh C, Birkbak NJ, Swanton C. Early stage NSCLC - challenges to implementing ctDNA-based screening and MRD detection. Nat. Rev. Clin. Oncol. 2018;15:577–586. doi: 10.1038/s41571-018-0058-3. [DOI] [PubMed] [Google Scholar]
- 30.Haque, I. S. & Elemento, O. Challenges in using ctDNA to achieve early detection of cancer. Preprint at bioRxiv 237578 (2017)
- 31.Salta, S. et al. A DNA methylation-based test for breast cancer detection in circulating cell-free DNA. J. Clin. Med. 7, 420 (2018) [DOI] [PMC free article] [PubMed]
- 32.Xu R-h, et al. Circulating tumour DNA methylation markers for diagnosis and prognosis of hepatocellular carcinoma. Nat. Mater. 2017;16:1155. doi: 10.1038/nmat4997. [DOI] [PubMed] [Google Scholar]
- 33.Song C-X, et al. 5-Hydroxymethylcytosine signatures in cell-free DNA provide information about tumor types and stages. Cell Res. 2017;27:1231. doi: 10.1038/cr.2017.106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Shen SY, et al. Sensitive tumour detection and classification using plasma cell-free DNA methylomes. Nature. 2018;563:579–583. doi: 10.1038/s41586-018-0703-0. [DOI] [PubMed] [Google Scholar]
- 35.Moss J, et al. Comprehensive human cell-type methylation atlas reveals origins of circulating cell-free DNA in health and disease. Nat. Commun. 2018;9:5068. doi: 10.1038/s41467-018-07466-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Cristiano S, et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature. 2019;570:385–389. doi: 10.1038/s41586-019-1272-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Liu MC, et al. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol. 2020;31:745–759. doi: 10.1016/j.annonc.2020.02.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Chen X, et al. Non-invasive early detection of cancer four years before conventional diagnosis using a blood test. Nat. Commun. 2020;11:3475. doi: 10.1038/s41467-020-17316-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Gemmell CH. Activation of platelets by in vitro whole blood contact with materials: increases in microparticle, procoagulant activity, and soluble P-selectin blood levels. J. Biomater. Sci. Polym. Ed. 2001;12:933–943. doi: 10.1163/156856201753113114. [DOI] [PubMed] [Google Scholar]
- 40.Heitzer E, et al. Current and future perspectives of liquid biopsies in genomics-driven oncology. Nat. Rev. Genet. 2019;20:71–88. doi: 10.1038/s41576-018-0071-5. [DOI] [PubMed] [Google Scholar]
- 41.Wan JCM, et al. Liquid biopsies come of age: towards implementation of circulating tumour DNA. Nat. Rev. Cancer. 2017;17:223. doi: 10.1038/nrc.2017.7. [DOI] [PubMed] [Google Scholar]
- 42.Koh, W. et al. Noninvasive in vivo monitoring of tissue-specific global gene expression in humans. Proc. Natl Acad. Sci. USA111, 7361–7366 (2014). [DOI] [PMC free article] [PubMed]
- 43.Pan, W. et al. Simultaneously monitoring immune response and microbial infections during pregnancy through plasma cfRNA sequencing. Clin. Chem. 63,1695–1704 (2017) [DOI] [PubMed]
- 44.Ngo TTM, et al. Noninvasive blood tests for fetal development predict gestational age and preterm delivery. Science. 2018;360:1133. doi: 10.1126/science.aar3819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Larson MH, et al. A comprehensive characterization of the cell-free transcriptome reveals tissue- and subtype-specific biomarkers for cancer detection. Nat. Commun. 2021;12:2357. doi: 10.1038/s41467-021-22444-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Ibarra A, et al. Non-invasive characterization of human bone marrow stimulation and reconstitution by cell-free messenger RNA sequencing. Nat. Commun. 2020;11:400. doi: 10.1038/s41467-019-14253-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.GTEx Consortium. The genotype-tissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013) [DOI] [PMC free article] [PubMed]
- 48.Uhlén M, et al. Tissue-based map of the human proteome. Science. 2015;347:394. doi: 10.1126/science.1260419. [DOI] [PubMed] [Google Scholar]
- 49.Sardar HS, Gilbert SP. Microtubule capture by mitotic kinesin centromere protein E (CENP-E) J. Biol. Chem. 2012;287:24894–24904. doi: 10.1074/jbc.M112.376830. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Uhlen M, et al. Proteomics. Tissue-based map of the human proteome. Science. 2015;347:1260419. doi: 10.1126/science.1260419. [DOI] [PubMed] [Google Scholar]
- 51.Fry AM. The Nek2 protein kinase: a novel regulator of centrosome structure. Oncogene. 2002;21:6184–6194. doi: 10.1038/sj.onc.1205711. [DOI] [PubMed] [Google Scholar]
- 52.Mills CA, et al. Nucleolar and spindle-associated protein 1 (NUSAP1) interacts with a SUMO E3 ligase complex during chromosome segregation. J. Biol. Chem. 2017;292:17178–17189. doi: 10.1074/jbc.M117.796045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Srivastava RA, Bhasin N, Srivastava N. Apolipoprotein E gene expression in various tissues of mouse and regulation by estrogen. Biochem. Mol. Biol. Int. 1996;38:91–101. [PubMed] [Google Scholar]
- 54.Jia Q, et al. Association between complement C3 and prevalence of fatty liver disease in an adult population: a cross-sectional study from the Tianjin Chronic Low-Grade Systemic Inflammation and Health (TCLSIHealth) cohort study. PLoS ONE. 2015;10:e0122026. doi: 10.1371/journal.pone.0122026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Zeng DW, et al. Serum ceruloplasmin levels correlate negatively with liver fibrosis in males with chronic hepatitis B: a new noninvasive model for predicting liver fibrosis in HBV-related liver disease. PLoS ONE. 2013;8:e77942. doi: 10.1371/journal.pone.0077942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Waterham HR, et al. Mutations in the 3beta-hydroxysterol Delta24-reductase gene cause desmosterolosis, an autosomal recessive disorder of cholesterol biosynthesis. Am. J. Hum. Genet. 2001;69:685–694. doi: 10.1086/323473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Fort A, et al. A liver enhancer in the fibrinogen gene cluster. Blood. 2011;117:276–282. doi: 10.1182/blood-2010-07-295410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Gram J, et al. Plasma histidine-rich glycoprotein and plasminogen in patients with liver disease. Thromb. Res. 1985;39:411–417. doi: 10.1016/0049-3848(85)90164-1. [DOI] [PubMed] [Google Scholar]
- 59.Joshi, N. A. & Sickle, F. J. A sliding-window, adaptive, quality-based trimming tool for FastQ files. https://github.com/najoshi/sickle (2011).
- 60.Leggett RM, et al. Sequencing quality assessment tools to enable data-driven informatics for high throughput genomics. Front. Genet. 2013;4:288–288. doi: 10.3389/fgene.2013.00288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Andrews, S. FastQC: a quality control tool for high throughput sequence data. (2010)
- 62.Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012;28:2184–2185. doi: 10.1093/bioinformatics/bts356. [DOI] [PubMed] [Google Scholar]
- 63.Dobin A, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Van der Auwera GA, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinforma. 2013;43:11.10.1–33. doi: 10.1002/0471250953.bi1110s43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Anders S, Pyl PT, Huber W. HTSeq-a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31:166–169. doi: 10.1093/bioinformatics/btu638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Kuhn, M. Building predictive models in R using the caret package. J. Stat.Softw. 28, 1–26 (2008)
- 68.Alexa A. & Rahnenfuhrer, J. topGO: enrichment analysis for gene ontology. R package version 2.36.0. (2019)
- 69.Venables, W. N. & Ripley, B. D. Modern Applied Statistics with S (Springer, 2002)
- 70.Liaw, A. & Wiener, M. Classification and regression by randomForest. R. N. 2, 18–22 (2002).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
cfRNA sequencing data have been deposited in the Gene Expression Omnibus Repository (GSE182824).
In-house scripts used in this manuscript, which includes data processing, downstream analysis, and the scripts used to generate figures is publicly available on Github