This cross-sectional diagnostic study evaluates the accuracy of a machine learning method that uses the whole transcriptome to identify gene markers in primary and metastatic tumors.
Key Points
Question
What is the practical use of a computational method trained to classify cancer types using the whole transcriptome?
Findings
For this cross-sectional diagnostic analysis, a set of neural networks was trained using the whole transcriptomes of normal and tumor tissues; the resultant classifier had a 99% accuracy rate for identifying primary cancers in an independent cohort, showed stable performance in treatment-resistant metastases, and identified 12 of 15 putative primary tumors for cancers with unknown site of origin.
Meaning
According to results of this study, machine learning–based cancer classifiers that use the whole transcriptome to automatically learn tissue- and tumor-specific gene markers in an unsupervised manner may resolve cases refractory to routine pathology diagnosis.
Abstract
Importance
A molecular diagnostic method that incorporates information about the transcriptional status of all genes across multiple tissue types can strengthen confidence in cancer diagnosis.
Objective
To determine the practical use of a whole transcriptome–based pan-cancer method in diagnosing primary and metastatic cancers and resolving complex diagnoses.
Design, Setting, and Participants
This cross-sectional diagnostic study assessed Supervised Cancer Origin Prediction Using Expression (SCOPE), a machine learning method using whole-transcriptome RNA sequencing data. Training was performed on publicly available primary cancer data sets, including The Cancer Genome Atlas. Testing was performed retrospectively on untreated primary cancers and treated metastases from volunteer adult patients at BC Cancer in Vancouver, British Columbia, from January 1, 2013, to March 31, 2016, and testing spanned 10 822 samples and 66 output classes representing untreated primary cancers (n = 40) and adjacent normal tissues (n = 26). SCOPE’s performance was demonstrated on 211 untreated primary mesothelioma cancers and 201 treatment-resistant metastatic cancers. Finally, SCOPE was used to identify the putative site of origin in 15 cases with initial presentation as cancers with unknown primary of origin.
Results
A total of 10 688 adult patient samples representing 40 untreated primary tumor types and 26 adjacent-normal tissues were used for training. Demographic data were not available for all data sets. Among the training data set, 5157 of 10 244 (50.3%) were male and the mean (SD) age was 58.9 (14.5) years. Testing was performed on 211 patients with untreated primary mesothelioma (173 [82.0%] male; mean [SD] age, 64.5 [11.3] years); 201 patients with treatment-resistant cancers (141 [70.1%] female; mean [SD] age, 55.6 [12.9] years); and 15 patients with cancers of unknown primary of origin; among the treatment-resistant cancers, 168 were metastatic, and 33 were the primary presentation. An accuracy rate of 99% was obtained for primary epithelioid mesotheliomas tested (125 of 126). The remaining 85 mesotheliomas had a mixed etiology (sarcomatoid mesotheliomas) and were correctly identified as a mixture of their primary components, with potential implications in resolving subtypes and incidences of mixed histology. SCOPE achieved an overall mean (SD) accuracy rate of 86% (11%) and F1 score of 0.79 (0.12) on the 201 treatment-resistant cancers and matched 12 of 15 of the putative diagnoses for cancers with indeterminate diagnosis from conventional pathology.
Conclusions and Relevance
These results suggest that machine learning approaches incorporating multiple tumor profiles can more accurately identify the cancerous state and discriminate it from normal cells. SCOPE uses the whole transcriptomes from normal and tumor tissues, and results of this study suggest that it performs well for rare cancer types, primary cancers, treatment-resistant metastatic cancers, and cancers of unknown primary of origin. Genes most relevant in SCOPE’s decision making were examined, and several are known biological markers of respective cancers. SCOPE may be applied as an orthogonal diagnostic method in cases where the site of origin of a cancer is unknown, or when standard pathology assessment is inconclusive.
Introduction
Identification of the site of origin of a tumor in a patient is currently used to guide cancer treatment. It also informs any subsequent analysis through alignment with relevant tumor literature and expected molecular background. Currently, established pathology approaches are used for cancer diagnosis and are considered the criterion standard. These approaches use morphology and histochemistry to provide a diagnosis that also determines eligibility for drug regimens and clinical trials. Modern pathology is a process of sequential exclusion and prioritization across candidate diagnoses, but an exhaustive search is rendered impossible by limited tissue and diagnostic stains.
The efficiency of cancer diagnosis by pathologists may be improved if an automated method can be developed to approach this task with some knowledge of cancer biology, similar to a pathologist. A machine learning method trained across diverse tumors and normal tissues can learn what characterizes each cancer, rather than its tissue site. Training on high-resolution molecular data will allow it to discover tissue-specific and tumor-specific biological patterns from the whole transcriptome.
The use of gene-expression data has outperformed traditional pathology workflows for cancer diagnosis in several landmark studies.1,2,3,4 Studies have also shown that transcriptome-wide profiling offers greater information about tumors than microarrays,5,6 with practical use in precision oncology.7,8 We can therefore use high-resolution transcriptomic data as an orthogonal approach to improve diagnostic accuracy in many cancers.9,10 Although analyzing such high-dimensional data within a diagnostic workflow is not manually feasible, machine learning methods can be trained to do so instead.
We have developed and validated Supervised Cancer Origin Prediction Using Expression (SCOPE), a set of neural networks that use the whole transcriptome to identify the closest match for a tumor from among 40 cancer types and 26 normal tissues. We account for the influence of differentiation and biopsy site by including normal tissues (classes) from The Cancer Genome Atlas (TCGA) in our training data set.11 We determine genes weighted heavily for decision making and demonstrate that SCOPE is able to prioritize genes relevant to each class without any prior information. Our method takes the reads per kilobase of transcript per million (RPKM) values of 17 688 genes to predict a tumor type in less than 2 minutes per sample on a CPU machine with 32 GB RAM, and can be extended as new data become available.
Diagnostic Challenges in Pathology
Pathology protocols for cancer diagnosis work best when the tissue specimens display high-quality and recognizable histologic features in a substantial number of cells. Generic histologic features alone are often not sufficient to determine the subtype of a tumor; hence, the confirmation of cell of origin, typically via immunohistochemical analysis, remains the bedrock of modern pathology practice.12 Therefore, diagnosis can become a challenging task of tiered, single-plex immunohistochemical analyses for lineage-specific proteins, iteratively evaluating the next-likely diagnostic candidates. Limited tissue availability and a limited list of unambiguous immunohistochemical antibodies restrict the extent of validation workups. Interobserver variability and sample-related issues further confound pathology diagnoses.13
Diagnostic Confounders
Misdiagnosis rates for metastases in clinical practice can range between 45% and 94% in the event of challenging presentation (eg, suboptimal sample quality, histologic similarity between tissues, poor differentiation).14 This is concerning because metastases can form up to 60% of distant recurrences and cause upward of 90% of cancer-associated deaths for cancers detected in the gastrointestinal tract and across certain gynecologic cancers.13,15,16 Biomarker conversion in metastases can confound diagnosis using immunohistochemistry or biomarker-based assays.5 The site of biopsy is another confounder, particularly in case of the liver.17 Previous work using expression microarrays has indicated that the microenvironment can contribute to the enrichment of hepatic genes’ expression in liver metastases, confounding an accurate diagnosis.18 These issues are magnified in cancers of unknown primary of origin (CUPs), where developing specific diagnostic protocols remains a challenge for pathology.3,9,10
Application of Machine Learning in Cancer Diagnosis
The first application of machine learning in molecular cancer diagnosis discriminated small, round blue cell tumors using microarray data.19 Khan et al19 observed that difficulties in interpreting morphologic features and reverse transcriptase polymerase chain reaction results were easily overcome when using neural networks to classify these cancers. Other machine learning algorithms have been applied to larger cohorts of cancer microarrays.3,20,21 Within these studies, samples are separated into training and test sets, and representative genes (features) are selected to maximize accuracy on the test set.
Including rare cancer types and providing a refined diagnosis remain challenges for current computational diagnostic methods. To optimize training, rare cancer types are often excluded, and geographically proximal cancers are merged. This inevitably leads to loss of granularity and limited scope in the application of the models trained.3,20 Performance is evaluated on the test set, which can either be held out from the initial cohort or, preferably (but rarely), a cohort of samples generated and processed at different centers.3,20,21
RNA Sequencing: Incorporating High-Resolution Sequencing Data in Diagnostic Methods
RNA sequencing has largely replaced microarrays for transcriptome-wide profiling. However, the current repertoire of diagnostic methods does not draw on the high dynamic range and comprehensive coverage provided by RNA sequencing.8,22 Large-scale sequencing projects (TCGA,23 International Cancer Genome Consortium [ICGC]24) have amassed RNA sequencing data from approximately 10 000 patients with untreated primary cancers. The size and diversity of this data set provides unprecedented opportunity to apply machine learning approaches to improve the classification of all cancer types. With the availability of high-performance computing systems, it is now possible to train models using information about the transcriptional status of all genes.
Methods
We trained an ensemble of neural networks using four-fifths of the TCGA cohort of primary cancers and normal tissues. We evaluated the method on the remaining one-fifth of the TCGA cohort and tested for robustness on the Genentech cohort of primary mesotheliomas.25 We demonstrated its application on extensively pretreated metastatic lesions and CUPs. A linear classifier using analysis of variance (ANOVA)–selected features26 was used to establish the baseline performance for the classifier in the metastatic cohort (eAppendix 1 in the Supplement). The study was conducted on retrospective data from January 1, 2013, to March 31, 2016, and no follow-up was required. The Personalized OncoGenomics project at BC Cancer is approved by the University of British Columbia BC Cancer Agency Research Ethics Board. Eligible patients living in British Columbia were referred to the program by their treating oncologist. Written informed consent for the selected patient population was obtained between January 1, 2013, and March 31, 2016. This study followed the Standards for Reporting of Diagnostic Accuracy (STARD) reporting guidelines for diagnostic analyses.
Data Preparation
Training Data
Multiplatform RNA sequencing data were obtained from TCGA, the National Cancer Institute’s non-Hodgkin lymphoma data set, and in-house data sets of secondary glioblastoma, adult medulloblastoma, and follicular lymphoma, available for research and development within our institution by agreement. Colon and rectal adenocarcinomas were combined into a single cohort (COADREAD) based on TCGA consortium findings.23 This resulted in 10 822 samples spanning 40 different tumor types and 26 adjacent normal classes, with each sample represented by 17 688 distinct gene reads per kilobase of transcript per million (RPKM) values (eAppendix 2 in the Supplement).
Test Data
Testing was performed retrospectively. An independent set of 211 adult primary untreated mesothelioma cancers was obtained from the Genentech mesothelioma cohort.25 One hundred twenty-six of these samples are classic epithelioid mesotheliomas, whereas 85 are sarcomatoid variants. Because the training set of mesotheliomas was histologically classic epithelioid mesotheliomas, testing was performed as follows: For the epithelioid mesotheliomas, we tested whether the classification was exclusively for mesothelioma. For the biphasic and sarcomatoid variants, we tested whether the classification was split between sarcomas and mesotheliomas, as would be expected based on mixed histology of the samples. Test sets for adult metastatic disease and 15 CUPs were obtained retrospectively from the Personalized OncoGenomics study at BC Cancer.7 Biopsy specimens of 168 of the 201 metastases were obtained from their site of metastasis (24 cancer types), and the remaining 33 from their site of origin (12 cancer types) (eAppendix 3, eTable 1, and eTable 2 in the Supplement).
Data Preprocessing and Model Selection
For the initial selection of the optimal classification algorithm, gene RPKMs were used as input (eAppendix 4 and eAppendix 5 in the Supplement). Support vector machines, random forests, extra trees, and a fully connected neural network were compared. Five-cross validation with grid search was used to identify the best parameters for each of these algorithms. The trained models were subsequently tested on the one-fifth held-out set.
Because the other ensemble models (random forest, extra trees) had near-equivalent 5-cross validation results with the neural network during training, we evaluated the utility of extending the neural network model. An ensemble was developed by training multiple neural networks with different linear transformations of the data. The resultant classifier (SCOPE) contained 5 neural networks. For 1 of these neural networks, we generated synthetically generated samples to expand the rarer classes during training (Synthetic Minority Oversampling Technique [SMOTE]27). The differences in the 5 networks are described in detail in eAppendix 1 and eTable 3 in the Supplement.
Assessing Use of Feature Selection
After selection of the optimal algorithm as described, we tested the practical use of feature selection in improving classification performance. Guided by previous work,28 we used pairwise ANOVA of log-transformed training data to identify a subset of 3000 genes that are statistically significant at discriminating the training classes. We also trained a classifier using the Catalogue of Somatic Mutations in Cancer’s list of 552 genes harboring somatic mutations.29 Neural network architectures optimal for each input space were identified using grid search across parameters, and trained with 5-cross validation for comparison.
Analysis of Results
The confidence score for a prediction was calculated as follows: Each neural network in the ensemble generated prediction probabilities between 0 and 1 for each class, all of which sum to 1. The class with the highest probability was considered the top-voted class for that neural network. The class top-voted most frequently across the ensemble was identified. The confidence score was then calculated as the mean of those top-voting scores.
Weight analysis of neural network connections was used to identify genes that were most important for predicting each class (eAppendix 6 in the Supplement).
Statistical Analysis
Precision, recall, and F1 score were used to evaluate models and demonstrate their performance. Aggregate precision and F1 scores, where reported in text, are accompanied by 95% CIs. Precision is defined as (true-positives)/(true-positives + false-positives), and intuitively represents the classifier’s ability to distinguish between positive and negative cases. Recall is defined as (true-positives)/(true-positives + false-negatives), and intuitively represents the classifier’s ability to correctly identify all positive cases. The F1 score is the harmonic mean of the precision and recall. These metrics are calculated for each individual class, and the mean reported as the cohort metric. Accuracy is reported as (true-positives + true-negatives)/(total cases), and is calculated for the entire cohort.
A paired χ2 test for association between prediction accuracy and tumor content was performed on the metastatic test cohort, with the null hypothesis being, “the classification accuracy of SCOPE is independent of tumor content.” Tumor content was determined by pathology analysis. A paired t test was used to test the association between prediction accuracy and confidence score (null hypothesis: no correlation exists between prediction accuracy and confidence score). The level of significance was 2-sided P = .05 for all tests of association. Pearson correlation was used to evaluate association between class-specific accuracy and training class size. Statistical tests were conducted using the base statistics package available in R (R version 3.5.0; RStudio API version 1.1.442; R Project for Statistical Computing).
Results
A total of 10 688 adult patient samples representing 40 untreated primary tumor types and 26 adjacent-normal tissues were used for training. Demographic data were not available for all data sets. Among the training data set, 5157 of 10 244 (50.3%) were male and the mean (SD) age was 58.9 (14.5) years. Testing was performed on 211 patients with untreated primary mesothelioma (173 [82.0%] male; mean [SD] age, 64.5 [11.3] years); 201 patients with treatment-resistant cancers (141 [70.1%] female; mean [SD] age, 55.6 [12.9] years); and 15 patients with cancers of unknown primary of origin; among the treatment-resistant cancers, 168 were metastatic, and 33 were the primary presentation. In our study, SCOPE achieved 97% accuracy and a macro F1 score of 0.92 on the 2780 cases in the TCGA held-out set. We found that the whole transcriptome had improved performance over the Catalogue of Somatic Mutations in Cancer cancer gene set and ANOVA-selected genes (Figure 1A; eFigure 1 and eAppendix 7 in the Supplement). The single neural network outperformed other machine learning algorithms (Figure 1B; eFigure 2 in the Supplement). For 46 of 66 classes, 80% to 100% of the samples in each class were correctly classified (Figure 1C; eFigure 3 in the Supplement). We found that 7 classes were refractory to appropriate classification, among which 3 were cancer types (esophageal carcinomas and adenocarcinomas and cervical cancers), and all 7 had fewer than 50 training examples (class size range, 3-50). On closer investigation of the 5 neural networks in the ensemble, we found that the neural network trained with SMOTE-supplemented training examples showed improved performance on smaller classes compared with the other 4 (eFigure 4 in the Supplement).
Association of Classification Anomalies and Biological Similarities in Held-Out Set
Among the poor-performing classes in the TCGA held-out set, certain patterns were evident. The 3 kidney adjacent-normal classes (KICH, KIRC, KIRP) had significant cross-calling, which was as expected because all 3 represent healthy kidney tissue (eFigure 3B in the Supplement).
Esophageal carcinomas and adenocarcinomas were often misclassified as stomach adenocarcinomas (Figure 1C). For cervical cancers, which can be squamous, adenosquamous, and adenocarcinomas,30 subtypes were also challenging to distinguish by SCOPE. We found these trends were replicated in unsupervised clustering of the RNA sequencing data, suggesting biological rationale for the same (eFigure 5 and eFigure 6 in the Supplement).
As further evidence, we observed other molecular patterns previously noted in literature in our results. The endometrium is a common site of occurrence for uterine carcinosarcomas, and an endometrioid carcinomalike profile is a well-documented molecular subtype of uterine carcinosarcomas. We found that uterine carcinosarcoma was frequently misclassified as uterine corpus endometrial carcinoma. The Cancer Genome Atlas analysis has found that a majority of uterine carcinosarcoma samples had serouslike endometrial carcinoma precursors.16 This cross-calling was also observed by another group using this data set for classification.21
Prioritization of Known Diagnostic Gene Features Without Prior Knowledge
Manual review showed that high-importance genes for a given class were biologically relevant to the corresponding cancer or normal tissue type. For example, 2 kidney-specific genes, UMOD and AQP2, were exclusively associated with the adjacent normal tissues from all 3 renal cancer types in training. Known diagnostic markers for renal clear cell carcinoma, namely CA9 and CA12, were associated with renal clear cell carcinoma. Important genes for testicular germline cancers, POU5F1, GDF3, and NANOG, are known and proposed biomarkers. High POU5F1 (OCT4) and NANOG expression is associated with spermatogenesis dysregulation.31 Unexpectedly, in the absence of a healthy tissue class corresponding to a primary tumor type, some important genes for the cancer reflect biological characteristics of the progenitor healthy tissue, such as DPPA3/5 for testicular germline cancers, and TYR and MLANA for uveal melanomas. These observations underscore the value in including adjacent normal tissues for a pan-cancer classifier. Genes associated with each cancer type are detailed in eTable 4 in the Supplement.
External Validation on Primary Cancers
Mesothelioma is a cancer that arises in the pleura, which lines the lungs. Three main histologic categories have been defined within mesothelioma: epithelioid, sarcomatoid, and a biphasic type that presents a combination of features from the former.32 Subtype diagnosis in mesothelioma influences patient prognosis and disease management, but without specialized histopathologist training, there is low agreement between diagnoses.33 We applied SCOPE on a previously published cohort of primary, untreated mesothelioma subtypes.
Characterizing Cancers With Mixed Histology
We obtained 99.2% accuracy (125 of 126) in identifying epithelioid mesotheliomas and biphasic-epithelioid cancers in this cohort. This is as expected, because SCOPE was trained to identify epithelioid mesotheliomas (this subtype was exclusively represented in the mesothelioma training set). Twenty-three of 29 sarcomatoid mesotheliomas (79.3%) and 55 of 56 biphasic-sarcomatoid mesotheliomas (98.2%) were predicted with split confidence between mesothelioma and sarcoma (Table 1). In addition, 4 of the remaining 6 sarcomatoid subtype samples were predicted confidently as sarcomas.
Table 1. Performance of SCOPE on Genentech Cohort of Primary Mesotheliomasa.
Mesothelioma Subtype | Total Cases With Subtype, No. | Precision | Recall | F1 Score | Predicted | Predicted Cases, No. |
---|---|---|---|---|---|---|
Biphasic epithelioidlike | 72 | 1 | 1 | 1 | Epithelioid mesothelioma | 72 |
Epithelioid | 54 | 1 | 0.98 | 0.99 | Epithelioid mesothelioma | 53 |
Sarcomatoid | 29 | NA | NA | NA | Sarcomatoid mesothelioma | 18 |
NA | NA | NA | Epithelioid mesothelioma | 5 | ||
NA | NA | NA | Sarcoma | 4 | ||
NA | NA | NA | Other | 2 | ||
Biphasic sarcomalike | 56 | NA | NA | NA | Epithelioid mesothelioma | 38 |
NA | NA | NA | Sarcomatoid mesothelioma | 17 | ||
NA | NA | NA | Other | 1 |
Abbreviations: NA, not applicable; SCOPE, Supervised Cancer Origin Prediction Using Expression.
Data from Bueno et al,25 2016.
Providing Diagnosis for Complex Metastases
In an independent set of 201 posttreatment metastatic cancers, SCOPE performed well above the baseline linear classifier, achieving an overall accuracy of 86% (11%), and a mean (SD) F1 score of 0.79 (0.12) (Figure 2A; Table 2; eFigure 7, eTable 5, and eTable 6 in the Supplement). Among the 41 mispredictions, 7 (17.1%) matched the site of biopsy (for example, predicting hepatocarcinoma for a breast cancer biopsy specimen from the liver), and 13 of the 41 (31.7%) matched a cancer type with same organ system of origin instead (for example, predicting uterine carcinosarcoma as ovarian cancer, predicting stomach adenocarcinoma as esophageal adenocarcinoma). For the remaining 21 cases, no obvious explanation was found for misclassification. Because our method provided a confidence score for each prediction, we found that in the set of confident diagnoses from the ensemble (118 of 201, confidence score of ≥80%, spanning 20 cancer types) accuracy went up to 92%.
Table 2. Performance on the Metastatic Cohort.
Diagnosed Typea | Total Cases, No. | Cohort Metricsb | Cases Predicted, No.c | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TPR | FPR | TP | TN | FP | FN | Precision | Recall | F1 Score | Diagnosis | Biopsy Site | Organ System | Other | ||
Metastatic Site Biopsies | ||||||||||||||
Mesothelioma | 1 | 1.00 | 0.00 | 1 | 130 | 0 | 0 | 1.00 | 1.00 | 1.00 | 1 | 0 | 0 | 0 |
Colorectal AC | 21 | 0.81 | 0.00 | 17 | 114 | 0 | 4 | 1.00 | 0.81 | 0.89 | 17 | 1 | 2 | 1 |
UCEC | 5 | 0.40 | 0.00 | 2 | 129 | 0 | 3 | 1.00 | 0.40 | 0.57 | 2 | 0 | 1 | 2 |
Uterine carcinosarcoma | 4 | 0.25 | 0.00 | 1 | 130 | 0 | 3 | 1.00 | 0.25 | 0.40 | 1 | 0 | 2 | 1 |
Breast carcinoma | 65 | 0.97 | 0.03 | 63 | 68 | 2 | 2 | 0.97 | 0.97 | 0.97 | 63 | 1 | 0 | 1 |
LNG_group | 14 | 1.00 | 0.01 | 14 | 117 | 1 | 0 | 0.93 | 1.00 | 0.97 | 14 | 0 | 0 | 0 |
Sarcoma | 17 | 0.53 | 0.01 | 9 | 122 | 1 | 8 | 0.90 | 0.53 | 0.67 | 9 | 1 | 0 | 7 |
Ovarian carcinoma | 7 | 0.86 | 0.01 | 6 | 160 | 1 | 1 | 0.86 | 0.86 | 0.86 | 6 | 0 | 0 | 1 |
Pancreatic AC | 9 | 0.33 | 0.01 | 3 | 158 | 1 | 6 | 0.75 | 0.33 | 0.46 | 3 | 1 | 4 | 1 |
MISC_group | 9 | 0.88 | 0.00 | 8 | 125 | 1 | 1 | 0.73 | 0.88 | 0.77 | 8 | 0 | 0 | 1 |
Cholangio-carcinoma | 5 | 0.80 | 0.02 | 4 | 127 | 2 | 1 | 0.67 | 0.80 | 0.73 | 4 | 0 | 1 | 0 |
GEJ_group | 11 | 0.29 | 0.01 | 3 | 151 | 5 | 8 | 0.31 | 0.29 | 0.26 | 3 | 3 | 1 | 4 |
Primary Site Biopsies | ||||||||||||||
CNS_group | 6 | 1.00 | 0.00 | 6 | 23 | 0 | 0 | 1.00 | 1.00 | 1.00 | 6 | 0 | 0 | 0 |
Breast carcinoma | 4 | 1.00 | 0.00 | 4 | 25 | 0 | 0 | 1.00 | 1.00 | 1.00 | 4 | 0 | 0 | 0 |
Colorectal AC | 1 | 1.00 | 0.00 | 1 | 28 | 0 | 0 | 1.00 | 1.00 | 1.00 | 1 | 0 | 0 | 0 |
GEJ_group | 1 | 1.00 | 0.00 | 1 | 28 | 0 | 0 | 1.00 | 1.00 | 1.00 | 1 | 0 | 0 | 0 |
MISC_group | 2 | 1.00 | 0.00 | 2 | 27 | 0 | 0 | 1.00 | 1.00 | 1.00 | 2 | 0 | 0 | 0 |
Pancreatic AC | 2 | 1.00 | 0.00 | 2 | 27 | 0 | 0 | 1.00 | 1.00 | 1.00 | 2 | 0 | 0 | 0 |
Uterine carcinosarcoma | 1 | 1.00 | 0.00 | 1 | 28 | 0 | 0 | 1.00 | 1.00 | 1.00 | 1 | 0 | 0 | 0 |
Sarcoma | 6 | 0.83 | 0.00 | 5 | 24 | 0 | 1 | 1.00 | 0.83 | 0.91 | 5 | 0 | 0 | 1 |
Mesothelioma | 4 | 0.75 | 0.00 | 3 | 26 | 0 | 1 | 1.00 | 0.75 | 0.86 | 3 | 0 | 0 | 1 |
LNG_group | 5 | 0.88 | 0.02 | 4 | 23 | 1 | 1 | 0.75 | 0.88 | 0.76 | 4 | 0 | 1 | 0 |
UCEC | 1 | 0.00 | 0.00 | 0 | 29 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0 | 0 | 1 | 0 |
Total | 201 | 0.76 | 0.005 | 160 | 3128 | 19 | 41 | 0.86 | 0.76 | 0.79 | 160 | 7 | 13 | 21 |
Abbreviations: AC, adenocarcinoma; CESC–AC, cervical/endocervical adenocarcinoma; FN, false-negative count; FP, false-positive count; FPR, false-positive rate; SCC, squamous cell carcinoma; TN, true-negative count; TP, true-positive count; TPR, true-positive rate; UCEC, uterine corpus endometrial carcinoma.
CNS_group includes lower-grade glioma, glioblastoma multiforme. LNG_group includes lung AC, and lung SCC; GEJ_group includes esophageal AC, esophageal SCC, stomach AC, liver hepatocellularcarcinoma, and papillary kidney carcinoma. MISC_group includes prostate AC, testicular germ cell tumor, CESC-AC, subcutaneous melanoma, diffuse large B-cell lymphoma, follicular lymphoma, thymoma, and adrenocortical carcinoma.
Precision, as indicated, is equivalent to class-specific accuracy.
Cases where predicted cancer type matched pathology diagnosis (diagnosis), was the same as tissue type of the biopsy site (biopsy site), matched a cancer type with same organ system of origin (organ system), or did not match any of the above (other).
In our assessment of the metastatic cohort, we found no association between classification accuracy and tumor content (P = 0.59), and a weak correlation with the size of training class (Pearson correlation coefficient, 0.39). There was an association between classification accuracy and confidence score (n = 201; P < .001). These observations are evident in eFigure 8 in the Supplement. In biopsies from sites of metastasis (n = 168), an association was found between low tumor content and the diagnosis of another cancer type with the same organ system of origin (Figure 2B and Table 2). This association was not found in primary cancer biopsies (Figure 2C and Table 2).
Identification of Putative Primary Tumor Type for Cancers of Unknown Primary
We retrospectively predicted the cancer type for 15 cancers where the primary site of origin was unknown after initial pathology assessment. These tumors were therefore refractory to standard pathology protocols. Subsequent diagnosis was determined by analysis of whole-genome sequencing and RNA-Seq data, and validated by pathology review and immunohistochemistry.7 The prediction by SCOPE was compared against this putative diagnosis (eAppendix 3 in the Supplement). As shown in Figure 3, the classifier’s prediction matched all putative diagnoses except 1 Ewing sarcoma, 1 neuroendocrine tumor, and 1 salivary carcinoma; these 3 cancer types were not present in training.
Discussion
We present a cancer-type classifier that leverages the entire gene-expression profile of a tumor sample. Our method achieves 97% overall accuracy and a mean (SD) F1 score of 0.92 (0.06) on our held-out set. This performance level is maintained on external cohorts, with an overall accuracy of 99% on primary cancers and mean (SD) accuracy of 86% (11%) for metastatic disease. We use the confidence score values (equivalent to probabilities) for predictions to characterize cancers with mixed histology (eFigure 9 in the Supplement).
Our findings support observations in literature that physiologically proximal and morphologically similar cancer types, such as stomach adenocarcinomas and some esophageal adenocarcinomas, are highly similar at the whole-transcriptome level, in spite of having distinct clinical designations.15,34 It also reflects the existing challenge with using glass-based pathology (even with the aid of immunohistochemistry) to discern these tumors.
We observed poorer performance of the method in pretreated metastases. This may be driven by known biological differences in the metastatic space. SCOPE also has difficulty discriminating metastatic cancers that share the same organ system of origin, if tumor content of the sequenced sample is low. It is possible that although there is diluted signal for the correct cancer type, low tumor content limits an accurate prediction. Incorporating more metastatic cancers in training should address these deficiencies.
Limitations
A limitation of SCOPE is the lack of external validation sets for all classes. We intend to incorporate these data sets as they become available. A challenge for general application of this method is transcriptomic data that has been generated from RNA extracted from formalin-fixed paraffin-embedded (FFPE) tissue, rather than from snap frozen tissue. Formalin-fixed paraffin-embedded specimens are persistent morphologic records of tissue biopsies, and highly prevalent in pathology laboratories worldwide. However, controllable and uncontrollable variables, including tissue characteristics, fixation technique, and storage conditions, can affect the yield and quality of total RNA in FFPE blocks. We obtained 100% accuracy on 5 in-house primary FFPE samples. Nonetheless, FFPE application of this method will require additional validation.
Rare tumors are typically underrepresented in public data sets, which is currently a challenge in generalizing classification methods such as ours. They are also challenging to diagnose with conventional pathology methods. We show the utility of synthetic oversampling to legitimately generate additional training samples for rare cancers. Cancers of unknown primary site form 3% to 5% of all cancer diagnoses.35 This is a powerful method to identify diagnostic candidates where results from conventional pathology diagnosis are inconclusive.
Conclusions
A challenging part of building molecular diagnostics is selection of relevant features. With recent advances in computational frameworks, we can manipulate high-dimensional data quickly and efficiently, allowing us to explore a machine learning approach that leverages large training sets across multiple tumor types. This is also a hypothesis-generating method for discovery of new diagnostic biomarkers. The method is available online.36
As demonstrated by previous studies, SCOPE has proven valuable for orthogonal assessment of common cancers26 and for contextualizing the biological features of complex, rare cancer types.37 As shown by its performance on CUPs, it is particularly useful in expediting precision oncology workflows and in clinical laboratories where access to a plethora of immunostains for sequential diagnosis may be limited.
References
- 1.Hamblin A, Wordsworth S, Fermont JM, et al. . Clinical applicability and cost of a 46-gene panel for genomic analysis of solid tumours: retrospective validation and prospective audit in the UK National Health Service. PLoS Med. 2017;14(2):. doi: 10.1371/journal.pmed.1002230 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Meiri E, Mueller WC, Rosenwald S, et al. . A second-generation microRNA-based assay for diagnosing tumor tissue origin. Oncologist. 2012;17(6):801-. doi: 10.1634/theoncologist.2011-0466 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Monzon FA, Medeiros F, Lyons-Weiler M, Henner WD. Identification of tissue of origin in carcinoma of unknown primary with a microarray-based gene expression test. Diagn Pathol. 2010;5:3. doi: 10.1186/1746-1596-5-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Zoon CK, Starker EQ, Wilson AM, Emmert-Buck MR, Libutti SK, Tangrea MA. Current molecular diagnostics of breast cancer and the potential incorporation of microRNA. Expert Rev Mol Diagn. 2009;9(5):455-467. doi: 10.1586/erm.09.25 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Stefanovic S, Wirtz R, Deutsch TM, et al. . Tumor biomarker conversion between primary and metastatic breast cancer: mRNA assessment and its concordance with immunohistochemistry. Oncotarget. 2017;8(31):51416-51428. doi: 10.18632/oncotarget.18006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gröschel S, Bommer M, Hutter B, et al. . Integration of genomics and histology revises diagnosis and enables effective therapy of refractory cancer of unknown primary with PDL1 amplification. Cold Spring Harb Mol Case Stud. 2016;2(6):a001180. doi: 10.1101/mcs.a001180 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Laskin J, Jones S, Aparicio S, et al. . Lessons learned from the application of whole-genome analysis to the treatment of patients with advanced cancers. Cold Spring Harb Mol Case Stud. 2015;1(1):a000570. doi: 10.1101/mcs.a000570 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Cheng DT, Mitchell TN, Zehir A, et al. . Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets (MSK-IMPACT): a hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology. J Mol Diagn. 2015;17(3):251-264. doi: 10.1016/j.jmoldx.2014.12.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Varadhachary GR, Raber MN, Matamoros A, Abbruzzese JL. Carcinoma of unknown primary with a colon-cancer profile-changing paradigm and emerging definitions. Lancet Oncol. 2008;9(6):596-599. doi: 10.1016/S1470-2045(08)70151-7 [DOI] [PubMed] [Google Scholar]
- 10.Bender RA, Erlander MG. Molecular classification of unknown primary cancer. Semin Oncol. 2009;36(1):38-43. doi: 10.1053/j.seminoncol.2008.10.002 [DOI] [PubMed] [Google Scholar]
- 11.Rapin N, Bagger FO, Jendholm J, et al. . Comparing cancer vs normal gene expression profiles identifies new disease entities and common transcriptional programs in AML patients. Blood. 2014;123(6):894-904. doi: 10.1182/blood-2013-02-485771 [DOI] [PubMed] [Google Scholar]
- 12.Wang HL, Kim CJ, Koo J, et al. . Practical immunohistochemistry in neoplastic pathology of the gastrointestinal tract, liver, biliary tract, and pancreas. Arch Pathol Lab Med. 2017;141(9):1155-1180. doi: 10.5858/arpa.2016-0489-RA [DOI] [PubMed] [Google Scholar]
- 13.Vennalaganti P, Kanakadandi V, Goldblum JR, et al. . Discordance among pathologists in the United States and Europe in diagnosis of low-grade dysplasia for patients with Barrett’s esophagus. Gastroenterology. 2017;152(3):564-570.e4. doi: 10.1053/j.gastro.2016.10.041 [DOI] [PubMed] [Google Scholar]
- 14.Meyer AN, Payne VL, Meeks DW, Rao R, Singh H. Physicians’ diagnostic accuracy, confidence, and resource requests: a vignette study. JAMA Intern Med. 2013;173(21):1952-1958. doi: 10.1001/jamainternmed.2013.10081 [DOI] [PubMed] [Google Scholar]
- 15.Kim J, Bowlby R, Mungall AJ; Cancer Genome Atlas Research Network; Analysis Working Group: Asan University; BC Cancer Agency; Brigham and Women’s Hospital; Broad Institute; Brown University; Case Western Reserve University; Dana-Farber Cancer Institute; Duke University; Greater Poland Cancer Centre; Harvard Medical School; Institute for Systems Biology; KU Leuven; Mayo Clinic; Memorial Sloan Kettering Cancer Center; National Cancer Institute; Nationwide Children’s Hospital; Stanford University; University of Alabama; University of Michigan; University of North Carolina; University of Pittsburgh; University of Rochester; University of Southern California; University of Texas MD Anderson Cancer Center; University of Washington; Van Andel Research Institute; Vanderbilt University; Washington University; Genome Sequencing Center: Broad Institute; Washington University in St. Louis; Genome Characterization Centers: BC Cancer Agency; Broad Institute; Harvard Medical School; Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins University; University of North Carolina; University of Southern California Epigenome Center; University of Texas MD Anderson Cancer Center; Van Andel Research Institute; Genome Data Analysis Centers: Broad Institute; Brown University; Harvard Medical School; Institute for Systems Biology; Memorial Sloan Kettering Cancer Center; University of California Santa Cruz; University of Texas MD Anderson Cancer Center; Biospecimen Core Resource: International Genomics Consortium; Research Institute at Nationwide Children’s Hospital; Tissue Source Sites: Analytic Biologic Services; Asan Medical Center; Asterand Bioscience; Barretos Cancer Hospital; BioreclamationIVT; Botkin Municipal Clinic; Chonnam National University Medical School; Christiana Care Health System; Cureline; Duke University; Emory University; Erasmus University; Indiana University School of Medicine; Institute of Oncology of Moldova; International Genomics Consortium; Invidumed; Israelitisches Krankenhaus Hamburg; Keimyung University School of Medicine; Memorial Sloan Kettering Cancer Center; National Cancer Center Goyang; Ontario Tumour Bank; Peter MacCallum Cancer Centre; Pusan National University Medical School; Ribeirão Preto Medical School; St. Joseph’s Hospital &Medical Center; St. Petersburg Academic University; Tayside Tissue Bank; University of Dundee; University of Kansas Medical Center; University of Michigan; University of North Carolina at Chapel Hill; University of Pittsburgh School of Medicine; University of Texas MD Anderson Cancer Center; Disease Working Group: Duke University; Memorial Sloan Kettering Cancer Center; National Cancer Institute; University of Texas MD Anderson Cancer Center; Yonsei University College of Medicine; Data Coordination Center: CSRA Inc.; Project Team: National Institutes of Health . Integrated genomic characterization of oesophageal carcinoma. Nature. 2017;541(7636):169-175. doi: 10.1038/nature20805 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Cherniack AD, Shen H, Walter V, et al. ; Cancer Genome Atlas Research Network . Integrated molecular characterization of uterine carcinosarcoma. Cancer Cell. 2017;31(3):411-423. doi: 10.1016/j.ccell.2017.02.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Robinson DR, Wu YM, Lonigro RJ, et al. . Integrative clinical genomics of metastatic cancer. Nature. 2017;548(7667):297-303. doi: 10.1038/nature23306 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Clark AM, Ma B, Taylor DL, Griffith L, Wells A. Liver metastases: microenvironments and ex-vivo models. Exp Biol Med (Maywood). 2016;241(15):1639-1652. doi: 10.1177/1535370216658144 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Khan J, Wei JS, Ringnér M, et al. . Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001;7(6):673-679. doi: 10.1038/89044 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ma XJ, Patel R, Wang X, et al. . Molecular classification of human cancers using a 92-gene real-time quantitative polymerase chain reaction assay. Arch Pathol Lab Med. 2006;130(4):465-473. [DOI] [PubMed] [Google Scholar]
- 21.Li Y, Kang K, Krahn JM, et al. . A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data. BMC Genomics. 2017;18(1):508. doi: 10.1186/s12864-017-3906-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Zararsız G, Goksuluk D, Korkmaz S, et al. . A comprehensive simulation study on classification of RNA-Seq data. PLoS One. 2017;12(8):e0182507. doi: 10.1371/journal.pone.0182507 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Weinstein JN, Collisson EA, Mills GB, et al. ; Cancer Genome Atlas Research Network . The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113-1120. doi: 10.1038/ng.2764 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Hudson TJ, Anderson W, Artez A, et al. ; International Cancer Genome Consortium . International network of cancer genome projects [published correction appears in Nature. 2010;465(7300):966]. Nature. 2010;464(7291):993-998. doi: 10.1038/nature08987 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Bueno R, Stawiski EW, Goldstein LD, et al. . Comprehensive genomic analysis of malignant pleural mesothelioma identifies recurrent mutations, gene fusions and splicing alterations. Nat Genet. 2016;48(4):407-416. doi: 10.1038/ng.3520 [DOI] [PubMed] [Google Scholar]
- 26.Grewal JK, Eirew P, Jones M, et al. . Detection and genomic characterization of a mammary-like adenocarcinoma. Cold Spring Harb Mol Case Stud. 2017;3(6):a002170. doi: 10.1101/mcs.a002170 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321-357. doi: 10.1613/jair.953 [DOI] [Google Scholar]
- 28.Haury AC, Gestraud P, Vert JP. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One. 2011;6(12):e28210. doi: 10.1371/journal.pone.0028210 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Forbes SA, Beare D, Boutselakis H, et al. . COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 2017;45(D1):D777-D783. doi: 10.1093/nar/gkw1121 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Burk RD, Chen Z, Saller C, et al. ; Cancer Genome Atlas Research Network; Albert Einstein College of Medicine; Analytical Biological Services; Barretos Cancer Hospital; Baylor College of Medicine; Beckman Research Institute of City of Hope; Buck Institute for Research on Aging; Canada’s Michael Smith Genome Sciences Centre; Harvard Medical School; Helen F. Graham Cancer Center &Research Institute at Christiana Care Health Services; HudsonAlpha Institute for Biotechnology; ILSbio, LLC; Indiana University School of Medicine; Institute of Human Virology; Institute for Systems Biology; International Genomics Consortium; Leidos Biomedical; Massachusetts General Hospital; McDonnell Genome Institute at Washington University; Medical College of Wisconsin; Medical University of South Carolina; Memorial Sloan Kettering Cancer Center; Montefiore Medical Center; NantOmics; National Cancer Institute; National Hospital, Abuja, Nigeria; National Human Genome Research Institute; National Institute of Environmental Health Sciences; National Institute on Deafness &Other Communication Disorders; Ontario Tumour Bank, London Health Sciences Centre; Ontario Tumour Bank, Ontario Institute for Cancer Research; Ontario Tumour Bank, The Ottawa Hospital; Oregon Health &Science University; Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center; SRA International; St Joseph’s Candler Health System; Eli &Edythe L. Broad Institute of Massachusetts Institute of Technology &Harvard University; Research Institute at Nationwide Children’s Hospital; Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins University; University of Bergen; University of Texas MD Anderson Cancer Center; University of Abuja Teaching Hospital; University of Alabama at Birmingham; University of California, Irvine; University of California Santa Cruz; University of Kansas Medical Center; University of Lausanne; University of New Mexico Health Sciences Center; University of North Carolina at Chapel Hill; University of Oklahoma Health Sciences Center; University of Pittsburgh; University of São Paulo, Ribeir ão Preto Medical School; University of Southern California; University of Washington; University of Wisconsin School of Medicine &Public Health; Van Andel Research Institute; Washington University in St Louis . Integrated genomic and molecular characterization of cervical cancer. Nature. 2017;543(7645):378-384. doi: 10.1038/nature21386 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Song HW, Wilkinson MF. Transcriptional control of spermatogonial maintenance and differentiation. Semin Cell Dev Biol. 2014;30:14-26. doi: 10.1016/j.semcdb.2014.02.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Hylebos M, Van Camp G, van Meerbeeck JP, Op de Beeck K. The Genetic landscape of malignant pleural mesothelioma: results from massively parallel sequencing. J Thorac Oncol. 2016;11(10):1615-1626. doi: 10.1016/j.jtho.2016.05.020 [DOI] [PubMed] [Google Scholar]
- 33.Brcic L, Vlacic G, Quehenberger F, Kern I. Reproducibility of malignant pleural mesothelioma histopathologic subtyping. Arch Pathol Lab Med. 2018;142(6):747-752. doi: 10.5858/arpa.2017-0295-OA [DOI] [PubMed] [Google Scholar]
- 34.Barra WF, Moreira FC, Pereira Cruz AM, et al. . GEJ cancers: gastric or esophageal tumors? searching for the answer according to molecular identity. Oncotarget. 2017;8(61):104286-104294. doi: 10.18632/oncotarget.22216 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Losa F, Soler G, Casado A, et al. . SEOM clinical guideline on unknown primary cancer (2017). Clin Transl Oncol. 2018;20(1):89-96. doi: 10.1007/s12094-017-1807-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Grewal J. cancerscope on GitHub. http://www.github.com/jasgrewal/cancerscope/. Accessed April 3, 2019.
- 37.Chahal M, Pleasance E, Grewal J, et al. . Personalized oncogenomic analysis of metastatic adenoid cystic carcinoma: using whole-genome sequencing to inform clinical decision-making. Cold Spring Harb Mol Case Stud. 2018;4(2):pii:a002626. doi: 10.1101/mcs.a002626 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.