Skip to main content
Current Issues in Molecular Biology logoLink to Current Issues in Molecular Biology
. 2024 Jul 9;46(7):7291–7302. doi: 10.3390/cimb46070432

Application of Transcriptome-Based Gene Set Featurization for Machine Learning Model to Predict the Origin of Metastatic Cancer

Yeonuk Jeong 1,, Jinah Chu 2,, Juwon Kang 1,3, Seungjun Baek 1, Jae-Hak Lee 1, Dong-Sub Jung 1, Won-Woo Kim 1, Yi-Rang Kim 1, Jihoon Kang 1,*, In-Gu Do 2,*
Editor: Giulia Fiscon
PMCID: PMC11276602  PMID: 39057073

Abstract

Identifying the primary site of origin of metastatic cancer is vital for guiding treatment decisions, especially for patients with cancer of unknown primary (CUP). Despite advanced diagnostic techniques, CUP remains difficult to pinpoint and is responsible for a considerable number of cancer-related fatalities. Understanding its origin is crucial for effective management and potentially improving patient outcomes. This study introduces a machine learning framework, ONCOfind-AI, that leverages transcriptome-based gene set features to enhance the accuracy of predicting the origin of metastatic cancers. We demonstrate its potential to facilitate the integration of RNA sequencing and microarray data by using gene set scores for characterization of transcriptome profiles generated from different platforms. Integrating data from different platforms resulted in improved accuracy of machine learning models for predicting cancer origins. We validated our method using external data from clinical samples collected through the Kangbuk Samsung Medical Center and Gene Expression Omnibus. The external validation results demonstrate a top-1 accuracy ranging from 0.80 to 0.86, with a top-2 accuracy of 0.90. This study highlights that incorporating biological knowledge through curated gene sets can help to merge gene expression data from different platforms, thereby enhancing the compatibility needed to develop more effective machine learning prediction models.

Keywords: cancer of unknown primary, metastatic cancer, machine learning, gene expression, transcriptome

1. Introduction

Finding the primary site of cancer is important for determining the treatment regimen for the cancer. Cancer of unknown primary (CUP) describes the diagnosis of metastatic cancer where the primary site of origin eludes detection despite comprehensive diagnostic evaluations [1]. CUP remains a perplexing challenge in oncology, representing approximately 3–5% of all malignancies a decade ago compared to 2–4% in recent years [1,2]. Nevertheless, CUP ranks as the third to fourth leading cause of cancer-related mortality [3]. CUP patients also exhibit greater levels of anxiety and depression than patients with known primary cancer and non-metastatic known primary cancer, along with impaired physical, mental, and social relationships [4]. Certainly, the majority of patients (80–90%) diagnosed with CUPs belong to an unfavorable group; their median overall survival (OS) spans from 3 to 11 months, with a 1-year OS of only 25–40% [5].

Identification of primary cancer site characteristics and associated expression targets enables the utilization of targeted anticancer drugs, enhancing prognosis over broad chemotherapy [3,6,7]. Ding et al. performed a meta-analysis, and determined that identifying the tumor of origin and administering targeted therapy are efficacious approaches, particularly for CUP patients with receptive tumor types [8]. However, traditional pathology methods, while considered the gold standard, often face limitations in exhaustively identifying the primary site due to tissue constraints and the complexity of diagnostic stains. Recent studies have demonstrated the potential of ML algorithms trained on diverse tumor and normal tissue datasets in discerning tissue-specific and tumor-specific patterns from high-resolution molecular data. Molecular diagnostic methodologies, exemplified by genome screening, have the potential to facilitate the identification of elusive origins in CUP cases [9,10]. Researchers are also using transcriptome-wide profiling to identify genes responsible for cancer or disease [11,12]. By utilizing gene expression data and employing sophisticated ML techniques, researchers have achieved notable successes in improving diagnostic accuracy for various cancers [13,14]. Using gene expression profiling (GEP) analysis, classical statistics and machine learning classification techniques can be used to predict primary cancers [15,16]. Researchers have used public transcriptome data to develop deep learning models that can identify genetic markers of primary and metastatic cancers, validating the model using newly acquired clinical samples [17,18]. SCOPE was trained on a collection of 10,688 untreated primary tumor samples and tested on 201 metastatic cancers. The model was validated on 15 cancer types, achieving an overall mean accuracy rate of 86% [17]. CUP-AI-Dx was trained on the transcriptome of 18,217 primary cancer samples, and external validation on metastatic cancer samples was performed on 92 metastatic cancer samples of 18 types collected from clinical laboratories in the US and Australia [18]. The resulting CUP-AI-Dx models showed top-1 accuracy results of 86.96% and 72.46%, respectively. Moon et al. developed OncoNPC, an XGBoost classifier based on next-generation sequencing, which identified distinct subgroups within CUP and improved treatment outcomes through genomically guided therapies [19]. OncoNPC was trained on 36,445 tumors from three medical centers and is able to classify 22 cancers; 971 CUP tumors were collected from the Dana-Farber Cancer Institute, and the model was able to identify the primary cancer type from metastatic carcinoma with 41.2% accuracy.

With the increase in the amount of tumor sample transcriptome data obtained through RNA sequencing, the amount of data available for learning is steadily increasing. However, a significant volume of cancer transcriptome data has been generated using micro-arrays, and compatibility must be ensured in order to utilize data from both platforms. Inter-platform compatibility can be achieved through featurization using gene set information that reflects biological knowledge [20]. By employing this approach, it is possible to construct a more extensive dataset for training, which is expected to improve accuracy. In this manner, our model was able to train more data than previous studies and make predictions for both micro-array and RNA sequencing data.

In this study, we used transcriptome data for 17 solid tumors from The Cancer Genome Atlas project (TCGA) [21] and Oncopression (OCP) datasets [22], and utilized training data featurized into gene set enrichment scores. We then trained a classification machine learning ensemble model by combining logistic regression, LightGBM, and SVM through a voting method (Figure 1). Furthermore, we conducted external validation using clinical samples collected from metastatic sites. Our ONCOfind-AI model showed a high robustness of 96.8 ± 2% with 5-fold validation on 27,941 primary cancer data samples. This is the largest number in studies that have built models from public data, and ONCOfind-AI produces the highest accuracy, with 86.2% for top-1 predicted sites and 90.0% for top-2 predicted sites on metastatic cancer samples in external validation.

Figure 1.

Figure 1

This figure presents the scheme of modeling process. RNA sequencing and micro-array data were collected from The Cancer Genome Atlas project (TCGA) and Oncopression (OCP), respectively, and featurization was conducted using gene sets representing the characteristics of each tissue and organ. Using the resulting feature scores, we developed a model to predict the primary site among 17 organs.

2. Methods and Materials

2.1. Data Source

The data sources for training included transcriptome data for primary cancer collected from the OCP and TCGA databases. OCP data were created via micro-array with normalization [22], whereas the TCGA data, downloaded from Firehose (https://gdac.broadinstitute.org/, accessed on 1 July 2023), were created via RNA-Seq and normalized to RNA-Seq by expectation maximization (RSEM). A total of 27,941 samples for 17 cancer tissue type were selected for training (Table 1). For external validation, we used 103 formalin-fixed paraffin-embedded (FFPE) tissue samples from patients collected by Kangbuk Samsung Medical Center (KBSMC) between 2018 and 2022. Patient samples were collected after informed consent was obtained; the study followed the guidelines of the Declaration of Helsinki and received approval from the Institutional Review Board (KBSMC 2022-11-018). Additionally, public datasets comprising 107 samples across seven cohorts were obtained from the Gene Expression Omnibus (GEO) to further validate our findings (Table 2).

Table 1.

Training datasets.

Cancer Type Data Source Sample No.
Adrenal Gland OCP 311
TCGA 261
Bile Duct OCP 184
TCGA 36
Bladder OCP 310
TCGA 408
Brain OCP 3054
TCGA 696
Breast OCP 5543
TCGA 1093
Colorectal OCP 3074
TCGA 381
Head And Neck OCP 622
TCGA 520
Kidney OCP 356
TCGA 891
Liver OCP 413
TCGA 373
Lung OCP 2243
TCGA 1018
Ovary OCP 1143
TCGA 307
Pancreas OCP 207
TCGA 178
Prostate OCP 247
TCGA 497
Skin OCP 294
TCGA 103
Stomach OCP 920
TCGA 599
Thyroid OCP 298
TCGA 501
Uterus OCP 322
TCGA 538
Total OCP 19,541
TCGA 8400
OCP + TCGA 27,941

Table 2.

Validation datasets.

Cancer Data Source Sample No.
Bile Duct KBSMC 4
Bladder KBSMC 6
Breast GSE14017 29
GSE147995 13
GSE191230 7
KBSMC 5
Colorectal KBSMC 20
GSE40367 7
Head And Neck KBSMC 2
Kidney KBSMC 6
Liver GSE40367 15
KBSMC 13
Lung KBSMC 5
Ovary KBSMC 5
Pancreas KBSMC 10
Prostate KBSMC 5
Skin KBSMC 2
Stomach KBSMC 11
GSE246963 8
GSE191139 4
Thyroid GSE60542 24
KBSMC 3
Uterus KBSMC 6
Total KBSMC 103
GEO 107
Total 210

2.1.1. RNA Sequencing and Gene Expression Profiling

mRNA was extracted from FFPE tissue samples taken from patients using an RNeasy FFPE Kit (Qiagen, Hilden, Germany) according to the manufacturer’s instruction. In summary, FFPE tissue sections were deparaffinized by treatment with deparaffinization solution and lysed by proteinase K digestion followed by heat treatment. Next, the supernatants treated with DNase were added to Buffer RBC and ethanol to adjust the binding conditions for RNA. The samples were applied to the RNeasy MinElute spin column, where the total RNA was bound to the membrane and contaminants were efficiently washed away. RNA was then eluted in RNase-free water. The RNA concentration was determined using a NanoDrop (Thermo Fisher Scientific, Waltham, MA, USA). Subsequently, RNA-seq was performed by Macrogen (Seoul, Republic of Korea) on the Illumina (San Diego, CA, USA) RNA-Seq platform for paired-end sequencing employing the SureSelectXT RNA Direct Reagent Kit (Agilent, Santa Clara, CA, USA). The raw FASTQ RNA-Seq data were trimmed using the Trimmomatic-0.39-1 tool [23]. Alignment and quantification were performed using STAR 2.7.8a and RSEM 1.3.3 with the GRCh38.105 genome reference [24,25].

2.2. Featurization and Feature Selection

Clinical data for external validation from KBSMC and GEO were quantified by transcriptome profile, as described in Section 2.1.1. Data from TCGA and OCP, the primary site cancer samples for training, are available in public databases and provide preprocessed transcriptome profiles. However, because the two public databases measure transcriptomes differently, they are normalized differently, and even samples from the same cancer type show different behavior. Figure 2A shows that the average expression levels of each gene between the TCGA and OCP groups for breast cancer are different in terms of both range and distribution pattern. To integrate the different characteristics of these data, we converted the gene-wise information of the transcriptome to the gene set dimension. We created gene set enrichment scores for all samples; statistical values were extracted using the Kolmogorov–Smirnov test with 8300 gene sets from the “Hallmarker gene sets”, “C2 curated gene sets”, “C6 oncogenic signature gene sets”, and “C8 cell type signature gene sets” obtained from MsigDB (www.gsea-msigdb.org, accessed on 5 March 2024, v2023.2) [26,27,28,29,30,31,32,33,34,35]. Positive or negative signs were assigned based on the directionality of the expression difference. Each gene set was defined by canonical pathway, cancer type, tissue type, cell type, and oncogenic gene. Therefore, we used the transcriptomic profiles collected from the tumors to calculate which gene sets are activated, allowing us to know which tissues, which tumors, and which pathways they originate from. To calculate the score per gene set, we used Gene Set Enrichment Analysis (GSEA) and calculated the normalized enrichment score for each gene set [26].

Figure 2.

Figure 2

Featurization and feature selection: (A) average expression of each genes by groups from breast cancer (BRCA); (B) Histogram of TCGA-OCP distinguishing AUC value by gene sets. The red line marks 0.5; (C) Average enrichment score of each gene set by groups from BRCA; (D) Distribution of samples by T-SNE. (C-1,D-1) are when all 8300 gene sets are used, while (C-2,D-2) are when 1249 gene sets from AUC ranging from 0.45–0.55 are used.

After computing the 8300 gene set scores, we investigated which of these could provide a comprehensive representation of the OCP and TCGA data. In feature selection, the receiver operating characteristic (ROC)-based feature selection approach can be used as an effective tool to evaluate individual features [36,37]. In particular, for binary-class problems the single feature classifier constructed from feature fi can establish an appropriate threshold θ. If xθ, then x is classified as the TCGA class, and if x<θ, then x is classified as the OCP class. If one feature fi has an area under the ROC curve (AUC) value for a single-feature classifier farther from 0.5 than another feature fj, we can say that fi is more discriminative than fj in the two classes [36]. We calculated the AUC for each gene set feature in order to discriminate between the OCP cohort and TCGA cohort. Figure 2B shows a histogram of the AUC values calculated for each gene set feature. An AUC of 0.5 means that the classifier has no discriminative capacity at all, which means that OCP and TCGA can be used comprehensively when using features around 0.5. Therefore, feature selection was performed at intervals of ±0.05 around a baseline of AUC 0.5, and these selected feature groups were used for modeling.

We found that simply converting to gene set scores significantly normalized the ranges and distribution patterns of the OCP and TCGA cohorts. Figure 2(C-1)) shows all 8300 gene sets, demonstrating that the TCGA and OCP data are more highly correlated than at the gene level. Moreover, filtering gene sets that show differences between groups by AUC range can achieve a greater correlation (Figure 2(C-2)). When examining the distribution through T-SNE clustering for major cancer types such as lung, stomach, and breast cancer, compatibility between TCGA and OCP can be observed (Figure 2D). Dimension reduction was performed on the features using T-SNE from Scikit-learn in order to visualize the distribution of samples and groups in a two-dimensional space, further aiding in understanding and optimizing the model. Figure 2(D-1) is when all 8300 gene sets are used, while Figure 2(D-2) is when 1249 gene sets from the AUC range of 0.45–0.55 are used. In Figure 2(D-2), it can be seen that when using only the 1249 features selected by AUC, the features are more focused on the characteristics of the cancer type than those of the data type.

2.3. Cancer Primary Site Classification Model

To create a classification ensemble model based on machine learning, we used logistic regression, LightGBM, and support vector machine (SVM) with a soft voting classification approach. For this algorithm, we used the Scikit-learn (version 0.22.1) and LightGBM (version 3.1.1) Python packages. LightGBM is renowned for its high efficiency and low memory usage, which make it particularly effective for handling large datasets with high dimensionality. It also excels in terms of speed and accuracy. Additionally, SVM is advantageous for its effectiveness in high-dimensional spaces and its versatility through the use of different kernel functions, enabling it to model nonlinear boundaries. The ensemble approach combining these powerful classifiers leverages their individual strengths to enhance the predictive performance and achieve significant results. The parameters were set using a grid search.

2.4. External Validation

For external validation in metastatic cancer, predictions were made through the developed model using KBSMC and GEO data and a ranking was created based on the probability. To further validate the model under real-world conditions, we trained the model on public data and then evaluated it on a prospectively collected sample of cancer patients from the hospital. Through this process, the top-1 and top-2 prediction accuracy was calculated.

3. Results and Discussion

We demonstrate that more integrated data lead to better prediction performance on datasets consisting of real-world clinical data. In addition, we show how to integrate the two largest sources of transcriptomic data, namely, micro-arrays and RNA sequencing, and use them as features in a machine learning model. The model is found to be robust, with auccuracy scores of 0.988, 0.981, 0.965, 0.963, and 0.942 when performing a 5-fold cross-validation using randomized shuffling of the full TCGA + OCP data. Figure 3A shows the average F1 scores for each model with features in the range of AUC values shown in Figure 2B. In this case, the average F1 score is the average of the predictions across the 17 cancer primary organs. The model was trained with TCGA and then tested with cancer types from OCP data; conversely, the model was also trained with OCP data and then tested with TCGA cancer types. The overall line does not shown the median, as there is a larger amount of OCP data than TCGA data (19,541 vs. 8400) (Table 1). We found that AUC values of 0.5 ± 0.1 to 0.3 were associated with an average F1 score of 0.9 or higher (Figure 3A); therefore, we selected features from this range to build our model and conduct external data validation (Figure 3B).

Figure 3.

Figure 3

(A) TCGA vs OCP cross validation. Models were created to evaluate metastatic carcinoma using the AUC range of f1 scores of 0.9 and above, indicated by the red line; (B) Accuracy of external validation by feature selection groups; (C) Confusion matrix of external validation for KBSMC and GEO data; (D) Confusion matrix of external validation for only KBSMC data.

Models trained on TCGA data, OCP data, and a combination of both TCGA and OCP were validated using external data. The best scores for the validation sets in each model are shown in Figure 3B and Table 3. Notably, models trained on TCGA data generated through RNA sequencing exhibited significantly lower performance when validated with KBSMC data, which were also produced through RNA sequencing. However, we found a significant improvement in performance when micro-array data from OCP were included in the training with our feature integration method. The models trained on the combined TCGA + OCP dataset demonstrated an accuracy range of approximately 0.80 to 0.86 depending on the range of feature selection (Figure 3B, Table 3). Figure 3C,D shows the confusion matrix for the top-1 prediction of the external validation set. The AUC of the feature selection group adopted in the evaluation model ranged from 0.5 ± 0.25. Most of the answers matched the actual primary site, and for those that were incorrect, the most likely answer was a neighboring organ. For example, when bile duct was the correct answer, the incorrect answers were pancreas and colorectum, whereas when the correct answer was uterus were the incorrect answers were ovary and kidney. In particular, the prostate, uterus, bile duct, and pancreas, which had relatively high incorrect answer rates, each had fewer than ten data points, making it difficult to perform sufficient validation. Our model had an average accuracy of 0.9 when calculating the accuracy for KBSMC+GEO data up to the top-2 predictions (Table 4).

Table 3.

External validation results.

Training Data Test Data Feature AUC Range 0.5± Weighted Accuracy
OCP + TCGA GEO + KBSMC 0.25 0.862
OCP + TCGA GEO + KBSMC 0.3 0.852
OCP + TCGA GEO + KBSMC 0.15 0.843
OCP + TCGA GEO + KBSMC 0.2 0.843
OCP + TCGA GEO + KBSMC 0.1 0.833
OCP + TCGA KBSMC 0.25 0.825
OCP + TCGA KBSMC 0.3 0.806
OCP GEO + KBSMC 0.1 0.79
OCP + TCGA KBSMC 0.15 0.777
OCP GEO + KBSMC 0.2 0.776
OCP GEO + KBSMC 0.15 0.776
OCP + TCGA KBSMC 0.2 0.767
OCP GEO + KBSMC 0.3 0.757
OCP + TCGA KBSMC 0.1 0.748
OCP GEO + KBSMC 0.25 0.738
TCGA GEO + KBSMC 0.3 0.724
TCGA GEO + KBSMC 0.15 0.719
TCGA GEO + KBSMC 0.1 0.719
TCGA GEO + KBSMC 0.2 0.71
TCGA GEO + KBSMC 0.25 0.71
OCP KBSMC 0.15 0.689
OCP KBSMC 0.2 0.689
OCP KBSMC 0.1 0.68
OCP KBSMC 0.3 0.65
OCP KBSMC 0.25 0.631
TCGA KBSMC 0.3 0.534
TCGA KBSMC 0.1 0.524
TCGA KBSMC 0.15 0.515
TCGA KBSMC 0.25 0.495
TCGA KBSMC 0.2 0.495

Table 4.

Top-2 accuracy of external validation.

Cancer Type Top-2 Accuracy for GEO + KBSMC Validation
BileDuct 0.250
Bladder 1.000
Breast 0.926
Colorectal 0.963
HeadAndNeck 0.500
Kidney 1.000
Liver 0.964
Lung 0.800
Ovary 1.000
Pancreas 0.900
Prostate 0.800
Skin 1.000
Stomach 0.870
Thyroid 0.889
Uterus 0.667
Weighted Average 0.900

3.1. Performance of ONCOfind-AI

We compared ONCOfind-AI to previous studies that have used machine learning to predict the primary site of metastatic carcinoma over the past five years (Table 5).

Table 5.

Performance comparison.

Model Train Set Test Set (Metastatic) Average Accuracy
SCOPE [17] 10,688 201 (33 from origin site) 15 types, 86%
CUP-AI-Dx [18] 18,217 92 (23 from origin site) 18 types, 83.33%
OncoNPC [19] 36,445 971 22 types, 41.2%
ONCOfind-AI 27,941 210 17 types, 86.2%

As it is very difficult to collect metastatic carcinomas, the models were all trained on cancer samples from the primary site. This study presents an integrated feature calculation method for the data types that provide transcriptome profiles, making it the largest study to use public data. The most recent study, OncoNPC from 2023, was a collaboration across three clinical research centers and collected the largest number of cancer samples; however, it achieved relatively low accuracy using a simple XGBoost classifier model. SCOPE used 201 metastatic cancer samples for validation, with 168 obtained from metastatic sites and the remaining 33 from the origin site. CUP-AI-Dx was developed at the Jackson Laboratory for Genomic Medicine and trained on RNA-Seq data from TCGA and the Cancer Genome Consortium (ICGC). For clinical validation, 92 FFPE samples representing 18 cancer types were collected from two clinical laboratories in the USA and Australia. Of these, 23 samples were from the JAX CLIA lab, six of of which were primary cancers. The primary site was predicted with 86.96% accuracy. The remaining 69 samples were from 18 types of metastatic cancer from the the University of Melbourne, for which the primary site was predicted with 72.46% accuracy. Although there is a large amount of primary cancer data, as there are many surgeries, commonly patients do not undergo surgery when metastasis occurs, making metastatic cancer data very rare and difficult to obtain. Therefore, most studies have only used a very small amount of testing data for clinical validation compared to the amount of training data. However, in our study all 210 samples were taken from metastatic tissues of metastatic cancer, and ONCOfind-AI predicted the primary site with the highest top-1 accuracy. Furthermore, as the answers that missed the top-1 prediction mentioned organs similar to the correct answer, it can be interpreted that our model learned the characteristics of tissues. This suggests that with more comprehensive feature selection based on the featurization approach outlined in our research, we can anticipate even better performance as the volume of data available across all cancer types increases.

3.2. Limitation and Future Work

This study shows the potential for further data integration by including other public databases such as ICGC, which we plan to address in the future. Collecting cancer samples from metastatic sites is very challenging, as is any study based on the resulting data, as patients with metastatic cancer often do not undergo biopsies. Our laboratory will continue to work with KBSMC to collect data and refine the model. In future research, we will validate whether our model could perform as well for other non-Asian ethnicities. Cancer genomic variation varies significantly by race, and only 672 out of 11,122 patients in the public TCGA database are Asian [21]; therefore, we expect our model to perform better for Americans.

Author Contributions

Y.J.: writing—original draft, writing—review and editing, conceptualization, methodology, validation, data curation. J.C.: methodology, resources, data curation, writing—review. J.K.: writing—original draft, conceptualization, methodology, validation. S.B.: data curation. J.-H.L.: supervision, writing—review. D.-S.J.: supervision, writing—review. W.-W.K.: supervision, project administration, funding acquisition. Y.-R.K.: supervision, project administration, funding acquisition. J.K.: supervision, project administration, funding acquisition, writing—review and editing. I.-G.D.: supervision, project administration, funding acquisition, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

This study was conducted according to the guidelines of the Declaration of Helsinki and was approved by the Institutional Review Board of Kangbuk Samsung Hospital (KBSMC 2022-11-018).

Informed Consent Statement

Complete written informed consent was obtained from the patient for the publication of this study and accompanying images.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author, In-Gu Do. The data are not publicly available due to their containing information that could compromise the privacy of research participants. The source code for this project has been distributed on GitHub (http://github.com/yeonuk-Jeong/ONCOfind, accessed on 2 July 2024).

Conflicts of Interest

Authors Yeonuk Jeong, Juwon Kang, Seungjun Baek, Jae-Hak Lee, Dong-Sub Jung, Won-Woo Kim, Yi-Rang Kim, and Jihoon Kang were employed by the company Oncocross Ltd. Authors Jinah Chu and In-Gu Do were employed by the Kangbuk Samsung Hospital. The authors declare that this study received funding from the Seoul Business Agency (SBA). The funder was not involved in the study design, the collection, analysis, and interpretation of data, the writing of this article, or the decision to submit it for publication.

Funding Statement

This research was supported by the Seoul R&BD Program (BT220252) through the Seoul Business Agency (SBA) funded by the Seoul Metropolitan Government.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Pavlidis N., Pentheroudakis G. Cancer of unknown primary site. Lancet. 2012;379:1428–1435. doi: 10.1016/S0140-6736(11)61178-1. [DOI] [PubMed] [Google Scholar]
  • 2.Varadhachary G., Abbruzzese J.L. Abeloff’s Clinical Oncology. Elsevier; Amsterdam, The Netherlands: 2020. Carcinoma of unknown primary; pp. 1694–1702. [Google Scholar]
  • 3.Qaseem A., Usman N., Jayaraj J.S., Janapala R.N., Kashif T. Cancer of unknown primary: A review on clinical guidelines in the development and targeted management of patients with the unknown primary site. Cureus. 2019;11:e5552. doi: 10.7759/cureus.5552. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hyphantis T., Papadimitriou I., Petrakis D., Fountzilas G., Repana D., Assimakopoulos K., Carvalho A.F., Pavlidis N. Psychiatric manifestations, personality traits and health-related quality of life in cancer of unknown primary site. Psycho-Oncology. 2013;22:2009–2015. doi: 10.1002/pon.3244. [DOI] [PubMed] [Google Scholar]
  • 5.Ma W., Wu H., Chen Y., Xu H., Jiang J., Du B., Wan M., Ma X., Chen X., Lin L., et al. New techniques to identify the tissue of origin for cancer of unknown primary in the era of precision medicine: Progress and challenges. Briefings Bioinform. 2024;25:bbae028. doi: 10.1093/bib/bbae028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Rassy E., Pavlidis N. Progress in refining the clinical management of cancer of unknown primary in the molecular era. Nat. Rev. Clin. Oncol. 2020;17:541–554. doi: 10.1038/s41571-020-0359-1. [DOI] [PubMed] [Google Scholar]
  • 7.Shuel S.L. Targeted cancer therapies: Clinical pearls for primary care. Can. Fam. Physician. 2022;68:515. doi: 10.46747/cfp.6807515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ding Y., Jiang J., Xu J., Chen Y., Zheng Y., Jiang W., Mao C., Jiang H., Bao X., Shen Y., et al. Site-specific therapy in cancers of unknown primary site: A systematic review and meta-analysis. ESMO Open. 2022;7:100407. doi: 10.1016/j.esmoop.2022.100407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Massard C., Loriot Y., Fizazi K. Carcinomas of an unknown primary origin—Diagnosis and treatment. Nat. Rev. Clin. Oncol. 2011;8:701–710. doi: 10.1038/nrclinonc.2011.158. [DOI] [PubMed] [Google Scholar]
  • 10.Varghese A., Arora A., Capanu M., Camacho N., Won H., Zehir A., Gao J., Chakravarty D., Schultz N., Klimstra D., et al. Clinical and molecular characterization of patients with cancer of unknown primary in the modern era. Ann. Oncol. 2017;28:3015–3021. doi: 10.1093/annonc/mdx545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Mai J., Lu M., Gao Q., Zeng J., Xiao J. Transcriptome-wide association studies: Recent advances in methods, applications and available databases. Commun. Biol. 2023;6:899. doi: 10.1038/s42003-023-05279-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Cao C., Kwok D., Edie S., Li Q., Ding B., Kossinna P., Campbell S., Wu J., Greenberg M., Long Q. kTWAS: Integrating kernel machine with transcriptome-wide association studies improves statistical power and reveals novel genes. Briefings Bioinform. 2021;22:bbaa270. doi: 10.1093/bib/bbaa270. [DOI] [PubMed] [Google Scholar]
  • 13.Petinrin O.O., Saeed F., Toseef M., Liu Z., Basurra S., Muyide I.O., Li X., Lin Q., Wong K.C. Machine learning in metastatic cancer research: Potentials, possibilities, and prospects. Comput. Struct. Biotechnol. J. 2023;21:2454–2470. doi: 10.1016/j.csbj.2023.03.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Divate M., Tyagi A., Richard D.J., Prasad P.A., Gowda H., Nagaraj S.H. Deep learning-based pan-cancer classification model reveals tissue-of-origin specific gene expression signatures. Cancers. 2022;14:1185. doi: 10.3390/cancers14051185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Zheng Y., Ding Y., Wang Q., Sun Y., Teng X., Gao Q., Zhong W., Lou X., Xiao C., Chen C., et al. 90-gene signature assay for tissue origin diagnosis of brain metastases. J. Transl. Med. 2019;17:331. doi: 10.1186/s12967-019-2082-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Jiang W., Shen Y., Ding Y., Ye C., Zheng Y., Zhao P., Liu L., Tong Z., Zhou L., Sun S., et al. A naive Bayes algorithm for tissue origin diagnosis (TOD-Bayes) of synchronous multifocal tumors in the hepatobiliary and pancreatic system. Int. J. Cancer. 2018;142:357–368. doi: 10.1002/ijc.31054. [DOI] [PubMed] [Google Scholar]
  • 17.Grewal J.K., Tessier-Cloutier B., Jones M., Gakkhar S., Ma Y., Moore R., Mungall A.J., Zhao Y., Taylor M.D., Gelmon K., et al. Application of a neural network whole transcriptome–based pan-cancer method for diagnosis of primary and metastatic cancers. JAMA Netw. Open. 2019;2:e192597. doi: 10.1001/jamanetworkopen.2019.2597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Zhao Y., Pan Z., Namburi S., Pattison A., Posner A., Balachander S., Paisie C.A., Reddi H.V., Rueter J., Gill A.J., et al. CUP-AI-Dx: A tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence. EBioMedicine. 2020;61:103030. doi: 10.1016/j.ebiom.2020.103030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Moon I., LoPiccolo J., Baca S.C., Sholl L.M., Kehl K.L., Hassett M.J., Liu D., Schrag D., Gusev A. Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary. Nat. Med. 2023;29:2057–2067. doi: 10.1038/s41591-023-02482-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.van der Kloet F.M., Buurmans J., Jonker M.J., Smilde A.K., Westerhuis J.A. Increased comparability between RNA-Seq and microarray data by utilization of gene sets. PLoS Comput. Biol. 2020;16:e1008295. doi: 10.1371/journal.pcbi.1008295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Yuan J., Hu Z., Mahal B.A., Zhao S.D., Kensler K.H., Pi J., Hu X., Zhang Y., Wang Y., Jiang J., et al. Integrated analysis of genetic ancestry and genomic alterations across cancers. Cancer Cell. 2018;34:549–560. doi: 10.1016/j.ccell.2018.08.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Lee J., Choi C. Oncopression: Gene expression compendium for cancer with matched normal tissues. Bioinformatics. 2017;33:2068–2070. doi: 10.1093/bioinformatics/btx121. [DOI] [PubMed] [Google Scholar]
  • 23.Bolger A.M., Lohse M., Usadel B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Dobin A., Davis C.A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., Gingeras T.R. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Li B., Dewey C.N. RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 2011;12:323. doi: 10.1186/1471-2105-12-323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., Paulovich A., Pomeroy S.L., Golub T.R., Lander E.S., et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Liberzon A., Birger C., Thorvaldsdóttir H., Ghandi M., Mesirov J.P., Tamayo P. The molecular signatures database hallmark gene set collection. Cell Syst. 2015;1:417–425. doi: 10.1016/j.cels.2015.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Shao X., Gomez C.D., Kapoor N., Considine J.M., Grams C., Gao Y., Naba A. MatrisomeDB 2.0: 2023 updates to the ECM-protein knowledge database. Nucleic Acids Res. 2023;51:D1519–D1530. doi: 10.1093/nar/gkac1009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Newman J.C., Weiner A.M. L2L: A simple tool for discovering the hidden significance in microarray expression data. Genome Biol. 2005;6:R81. doi: 10.1186/gb-2005-6-9-r81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Zeller K.I., Jegga A.G., Aronow B.J., O’Donnell K.A., Dang C.V. An integrated database of genes responsive to the Myc oncogenic transcription factor: Identification of direct genomic targets. Genome Biol. 2003;4:R69. doi: 10.1186/gb-2003-4-10-r69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Nishimura D. BioCarta. Biotech Softw. Internet Rep. Comput. Softw. J. Sci. 2001;2:117–120. doi: 10.1089/152791601750294344. [DOI] [Google Scholar]
  • 32.Kanehisa M., Goto S., Furumichi M., Tanabe M., Hirakawa M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010;38:D355–D360. doi: 10.1093/nar/gkp896. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Schaefer C.F., Anthony K., Krupa S., Buchoff J., Day M., Hannay T., Buetow K.H. PID: The pathway interaction database. Nucleic Acids Res. 2009;37:D674–D679. doi: 10.1093/nar/gkn653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Jassal B., Matthews L., Viteri G., Gong C., Lorente P., Fabregat A., Sidiropoulos K., Cook J., Gillespie M., Haw R., et al. The reactome pathway knowledgebase. Nucleic Acids Res. 2020;48:D498–D503. doi: 10.1093/nar/gkz1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Pico A.R., Kelder T., Van Iersel M.P., Hanspers K., Conklin B.R., Evelo C. WikiPathways: Pathway editing for the people. PLoS Biol. 2008;6:e184. doi: 10.1371/journal.pbio.0060184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Sun L., Wang J., Wei J. AVC: Selecting discriminative features on basis of AUC by maximizing variable complementarity. BMC Bioinform. 2017;18:73–89. doi: 10.1186/s12859-017-1468-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Fawcett T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006;27:861–874. doi: 10.1016/j.patrec.2005.10.010. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author, In-Gu Do. The data are not publicly available due to their containing information that could compromise the privacy of research participants. The source code for this project has been distributed on GitHub (http://github.com/yeonuk-Jeong/ONCOfind, accessed on 2 July 2024).


Articles from Current Issues in Molecular Biology are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES