Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2020 May 8;15(5):e0226461. doi: 10.1371/journal.pone.0226461

Predicting cancer origins with a DNA methylation-based deep neural network model

Chunlei Zheng 1, Rong Xu 1,*
Editor: Serdar Bozdag2
PMCID: PMC7209244  PMID: 32384093

Abstract

Cancer origin determination combined with site-specific treatment of metastatic cancer patients is critical to improve patient outcomes. Existing pathology and gene expression-based techniques often have limited performance. In this study, we developed a deep neural network (DNN)-based classifier for cancer origin prediction using DNA methylation data of 7,339 patients of 18 different cancer origins from The Cancer Genome Atlas (TCGA). This DNN model was evaluated using four strategies: (1) when evaluated by 10-fold cross-validation, it achieved an overall specificity of 99.72% (95% CI 99.69%-99.75%) and sensitivity of 92.59% (95% CI 91.87%-93.30%); (2) when tested on hold-out testing data of 1,468 patients, the model had an overall specificity of 99.83% and sensitivity of 95.95%; (3) when tested on 143 metastasized cancer patients (12 cancer origins), the model achieved an overall specificity of 99.47% and sensitivity of 95.95%; and (4) when tested on an independent dataset of 581 samples (10 cancer origins), the model achieved overall specificity of 99.91% and sensitivity of 93.43%. Compared to existing pathology and gene expression-based techniques, the DNA methylation-based DNN classifier showed higher performance and had the unique advantage of easy implementation in clinical settings. In summary, our study shows that DNA methylation-based DNN models has potential in both diagnosis of cancer of unknown primary and identification of cancer cell types of circulating tumor cells.

Introduction

Identification of cancer origins is routinely performed in clinical practice as site-specific treatments improve patient outcomes [14]. While some cancer origins are easy to be determined, others are difficult, especially for metastatic and un-differentiated cancer. Cancer origin determination is typically carried out with immunohistochemistry panels on the tumor specimen and imaging tests, which need considerable resources, time, and expense. In addition, pathologic-based procedures have limited accuracy (66–88%) in determining the origins of metastatic cancer [58].

Several gene expression- or microRNA-based molecular classifiers have been developed to identify cancer origin. A k-nearest neighbor classifier based on 92 genes showed an accuracy of 84% in identifying primary site of metastatic cancer via cross-validation [9]. Pathwork, a commercially available platform based on similarity score of 1,550 genes between cancer tissue and reference tissue, achieved an overall sensitivity of 88%, an overall specificity of 99% and an accuracy of 89% in identifying tissue of origin [10, 11]. A decision-tree classifier based on 48 microRNA showed an accuracy of 85–89% in identification of cancer primary sites [12, 13], and an updated version, the 64-microRNA based assay, exhibited an overall sensitivity of 85% [14, 15]. A recent support vector machine-based classifier that integrated gene expression and histopathology showed an accuracy of 88% in known origins of cancer samples [16]. Though these molecular platforms have shown better performance in identifying tissue of origin as compared to pathology-based methods, gene expression- or microRNA-bases classifiers are not easy to be implemented in clinic setting partially due to the instability of RNA [17, 18]. In addition, these classifiers have performance of <90% accuracy, which may further limit their wide adoption in clinical settings [17]. Hence, it is desirable to develop higher performance prediction tools for cancer origin determination, which can also be easily implemented in clinical settings.

DNA methylation is a process by which methyl groups are added to the DNA molecule and 70–80% of human genome is methylated [19]. It has been shown that DNA methylation is established in tissue specific manner during development [20, 21]. Though the genomes of cancer patients exhibit overall demethylation, tissue specific DNA methylation markers might be conserved [21]. Indeed, a random forest-based cancer origin classifier using DNA methylation was reported to achieve a performance with 88.6% precision and 97.7% recall in the validation set [18], which demonstrated the usefulness of methylation data in cancer origin prediction. Recently, deep learning technologies have rapidly applied to the biomedical field, including protein structure prediction, gene expression regulation, behavior prediction, disease diagnosis and drug development [22, 23]. Studies show that deep learning-based models often achieved higher performance than traditional machine learning methods (e.g. random forest and support vector machine, etc.) in many settings, such as gene expression inference [24], transcript factor binding prediction [25], protein-protein interaction prediction [26], detection of rare disease-associated cell subsets [27], variant calling [28], clinic trial outcome prediction [29], among others. In this study, we trained and robustly evaluated a high-performance cancer origin predictive model by leveraging the large amount of DNA methylation data available in The Cancer Genome Atlas (TCGA) and the recent developments in deep neural network learning techniques. We demonstrated that our model performed better than traditional pathology- or gene expression-based models as well as methylation-based random forest prediction model.

Materials and methods

Datasets

DNA methylation data (Illumina human methylation 450k BeadChip) and clinical information of 8,118 patients across 24 tissue types were obtained from in GDC data portal [30] using TCGAbiolink (Bioconductor package, version 2.5.12) [31]. We excluded six tissue types with less than 100 cases in TCGA to build robust cancer origin classifier. The final data include DNA methylation data and clinical information from 7,339 patients of 18 cancer origins. TCGA data were used for both cancer origin classifier training and evaluation, which were randomly and stratified split into training set (n = 4,403), development set (n = 1,468) and test set (n = 1,468) (Fig 1).

Fig 1. Distribution of cancer samples in TCGA by tissue of origin.

Fig 1

A total of 7339 patients were randomly and stratified split into train, dev and test sets according to 60:20:20.

In order to evaluate the classifier trained on TCGA dataset using independent data, we obtained 11 DNA methylation datasets (Illumina 450k platform) from Gene Expression Omnibus (GEO) [32] using GEOquery (Bioconductor package, version 2.42.0) [33]. A total of 581 cancer patients covering 10 cancer origins were obtained and the information for each dataset was described in Table 1.

Table 1. Characteristics of GEO datasets.

GEO ID Disease Cancer origin Cancer type Num. of patients
GSE77871 Adrenocortical carcinomas Adrenal gland Primary 18
GSE78751 Triple negative breast cancer Breast Primary, metastatic 23
12
GSE101764 Colorectal cancer Colorectal Primary 112
GSE38268 Head and Neck Squamous Cell Carcinoma Head and neck Primary 6
GSE89852 hepatocellular carcinomas Liver Primary 37
GSE49149 Pancreatic cancer Pancreas Primary 167
GSE112047 Prostate cancer Prostate Primary 31
GSE38240 Prostate cancer Prostate Primary, metastatic 2
6
GSE73549 Prostate cancer Prostate Metastatic 18
GSE86961 Papillary thyroid cancer Thyroid Primary 82
GSE52955 Urology cancer Kidney, Bladder, prostate Primary 17, 25, 25

The third DNA methylation data are from 1001 cancer cell lines, which were reported in a large-scale study [34] and deposited in GEO (GSE68379). These cell lines are not treated with any drug or compound and DNA methylation data were obtained from Illumina 450K BeadChip platform. We used this dataset as a case study, i.e., applying our DNN-based tissue classifier to identify the tissue sources of these cancer cell lines. After excluding cancer cell lines whose tissue sources are not covered in our classifier, a total of 391 cell lines from 11 tissue sources were used in this study. Fig 2 shows distribution of these cancer cell lines by tissue sources.

Fig 2. Distribution of cancer cell lines by tissue source.

Fig 2

Feature selection

Only the training data (n = 4,403) from TCGA were used for feature selection. Currently, Illumina 450K and 27K are two commonly used platforms for genome wide analysis of DNA methylation, which measure DNA methylation of around 450K and 27K CpG sites respectively. DNA methylation level of CpG site is expressed as beta value using the ratio of intensities between methylated and unmethylated alleles. Beta value is between 0 and 1 with 0 being unmethylated and 1 fully methylated. All data we used in this study are from 450K platform. In order to reduce the dimensionality while at the same time making the set of features back-compatible with those from 27K platform, we reduced CpG sites to 27K for 450K derived samples by extracting 27K probes from 450K data as the probes used in 450K platform include all probes in 27K platform. To further remove the noise in the data, we used one-way analysis of variance (one-way ANOVA) to filter the CpG sites whose beta values are not significantly different (p > 0.01) among different tissues, resulting in 18,976 CpG sites. Then we used the Tukey’s honest significance difference (Tukey’s HSD) test to remove the CpG sites that the maximal difference of their mean beta values among all tissues is less than 0.15. Tukey’s HSD is a multiple comparison procedure to find means that are significantly different from each other. It’s essentially a t-test that controls family-wise error rate, which is commonly used for post hoc test for ANOVA [35]. The results from Tukey’s HSD test are pairwise tissue comparisons with statistical significance and mean difference. Here, we used the pairwise tissue test results to obtain the maximal difference of the mean beta value among all tissues. Tukey’s HSD test resulted in 10360 CpG sites that were used for the input layer of neural network.

Training a deep neural network (DNN) model for cancer origin classification

We used DNA methylation data from training set (n = 4,403) to build a DNN model to predict cancer origins. Tensorflow [36], an open source framework to facilitate deep learning model training, was used for this purpose. Four well-established techniques were used to optimize the training process, including weight initialization by Xaiver method [37], Adam optimization [38], learning rate decay and mini-batch training. Xaiver method can efficiently avoid gradient disappearance/explosion that random initialization may bring. Adam, a combination of Stochastic Gradient Descent with momentum descendent [39] and RMSprop [40], makes training process faster. Exponential learning decay (decay every 1,000 steps with a base of 0.96) was used to improve model performance. Training was performed in 128 mini-batch of 30 epochs to efficiently use the data.

We employed multilayer perceptron (MLP) to construct the neuron network. Three hyperparameters (learning rate, number of hidden layer and hidden layer unit) were optimized according to development set performance (1,468 patients with the same distribution of cancer origins as training set). Three learning rates (α = 0.001, 0.01 and 0.1), three hidden layers (L = 2, 3, 4) and three hidden layer units (N = 32, 64, 128) were tested. We used grid search strategy to optimize these parameters and the best combination according to development set performance is α = 0.001, L = 2 and N = 64.

Validating and testing DNN-based cancer origin prediction model

We used four strategies to evaluate the performance of the DNN cancer origin classifier: (1) evaluation in the10-fold cross-validation in training dataset to obtain overall specificity, sensitivity, PPV and NPV as well as corresponding confidence intervals of this model; (2) evaluation in the hold-out testing dataset to obtain both the overall model performance and tissue-wise performance; (3) evaluation in the subset of metastatic cancer samples nested in testing dataset to assess the performance of the model in predicting the primary sites of metastatic cancer, which are often more difficult to be identified in clinical practice and more clinically relevant; (4) evaluation in independent datasets from GEO to test the robustness and generalizability of this DNN model. Metrics including specificity, sensitivity, positive predictive value (PPV) and negative predictive value (NPV) were reported. Receiver Operating Characteristic curve (ROC curve) was also calculated for each test data performance.

Source code, data availability, and reproducibility

Source code used in this study is publicly available in a Github repository (https://github.com/thunder001/Cancer_origin_prediction). We also shared a Jupyter Notebook to replicate all the machine learning experiments from data processing, model building and optimization to model evaluation. To execute this notebook, the environment needs to be firstly created according to a YAML file available in Github. In addition, we also created a Docker image available in Docker hub (https://hub.docker.com/r/thunder001/cancer_origin_prediction), where you can download it and run the container directly on your computer.

Results

The overall performance of the DNN-based cancer origin classifier in 10-fold cross-validation setting

We used DNA methylation data of 7,339 patients from TCGA across 18 primary tissues to train and test a DNN-based cancer origin classifier. The sample distribution in different cancer origins were shown in Fig 1. The final DNN architecture consists of one input layer (10,360 neurons), two hidden layers (64 neurons each layer) and one output layer (18 neurons) that represents 18 cancer origins (Fig 3).

Fig 3. Schematic representation of DNN architecture of cancer origin classifier.

Fig 3

Evaluated in a 10-fold cross-validation setting, the model achieved an overall precision (positive predictive value, PPV) of 0.9503 (95% CI:0.9373–0.9633) and recall (sensitivity) of 0.9259 (95% CI:0.9187–0.9330) respectively. In addition, this model also achieved a high specificity of 0.9972 (95% CI:0.9969–0.9975) (Table 2).

Table 2. DNN model performance using 10-fold cross validation of training data.

Mean SD CI (95%)
Specificity 0.9972 0.0001 0.9969, 0.9975
Sensitivity (Recall) 0.9259 0.0032 0.9187, 0.9330
PPV (Precision) 0.9503 0.0057 0.9373, 0.9633
NPV 0.9973 0.0001 0.9970, 0.9976

PPV: positive predictive value; NPV: negative predictive value.

DNN-based cancer origin classifier shows high performance in testing dataset

We tested the classifier using test dataset, which includes 1,468 samples with similar distribution with training set (Fig 1). Cancer origin classification and a confusion matrix for all samples were shown in S1 and S2 Tables respectively. Model performance metrics were shown on Table 3. The specificity and negative predictive value (NPV) in individual cancer origin prediction were consistently higher than 0.99. The overall precision (PPV) and recall (sensitivity) reached 0.9608 and 0.9595 respectively. For many cancer tissue origin predictions, including brain, colorectal, prostate, skin, testis, thymus and thyroid, this DNN model achieved a precision of 100% (Table 3) and an average AUC of 0.99 (Fig 4).

Table 3. DNN model performance in test set.

CANCER ORIGIN SPECIFICITY SENSITIVITY (RECALL) PPV (PRECISION) NPV
AG 0.9993 0.9787 0.9787 0.9993
BLADDER 0.9986 0.9878 0.9759 0.9993
BRAIN 1.0000 1.0000 1.0000 1.0000
BREAST 0.9977 1.0000 0.9810 1.0000
COLORECTAL 1.0000 0.9861 1.0000 0.9993
ESOPHAGUS 0.9909 0.7410 0.7579 0.9902
HN 0.9971 0.9099 0.9619 0.9927
KIDNEY 0.9993 1.0000 0.9925 1.0000
LIVER 0.9993 0.9851 0.9851 0.9993
LUNG 0.9984 0.9740 0.9894 0.9961
PANCREAS 0.9979 1.0000 0.9167 1.0000
PROSTATE 1.0000 1.0000 1.0000 1.0000
SKIN 1.0000 1.0000 1.0000 1.0000
SOFT TISSUE 0.9993 0.9825 0.9825 0.9993
STOMACH 0.9921 0.9375 0.8721 0.9964
TESTIS 1.0000 1.0000 1.0000 1.0000
THYMUS 1.0000 0.8889 1.0000 0.9979
THYROID 1.0000 1.0000 1.0000 1.0000
OVERALL 0.9983 0.9595 0.9608 0.9983

PPV: positive predictive value; NPV: negative predictive value; AG: Adrenal Gland; HN: Head and Neck

Fig 4. AUCs for individual cancer origin prediction in TCGA test set.

Fig 4

There are some variations in precision and recall in different cancer origin predictions. The lowest performance occurred in esophagus origin prediction with a precision of 0.7579 and a recall of 0.7410. A total of 10 of 39 esophagus origins were incorrectly predicted as stomach origins (S1 and S2 Tables). Given that esophagus is a broad area, if a tumor is located at the border of stomach and esophagus, it might be difficult for the classifier to distinguish these two tissues. In addition, tissues from adjacent regions may have similar methylation profiles so that the methylation-based prediction model has difficulty in differentiating cancers with adjacent origins (e.g., esophagus vs stomach).

DNN-based cancer tissue classifier shows high performance in determining the origins of metastasized cancers

We evaluated the performance of the classifier in determining the origins of metastatic cancers that nested in our test data. Our data contained 701 samples from distantly metastasized cancers and 558 of them have been used for model development. We then used remaining 143 samples from 12 cancer origins with various sample sizes for evaluation (Fig 5A). Cancer origin predictions and corresponding confusion matrix were shown in S3 and S4 Tables. Model performance metrics and ROC curves were shown in Table 4 and Fig 5B. Consistently, DNN model showed robust high performance in predicting metastatic cancer origins.

Fig 5. Performance of the DNN-based cancer origin classifier in metastatic cancer samples from TCGA test set.

Fig 5

(A) Distribution of metastatic cancer samples by tissue of origin. (B) AUCs for individual cancer origin prediction.

Table 4. DNN model performance in metastatic cancer samples.

CANCER ORIGIN SPECIFICITY SENSITIVITY (RECALL) PPV (PRECISION) NPV
ADRENAL GLAND 1.0000 1.0000 1.0000 1.0000
BLADDER 1.0000 0.9643 1.0000 0.9914
BREAST 0.9929 1.0000 0.7500 1.0000
COLORECTAL 1.0000 1.0000 1.0000 1.0000
ESOPHAGUS 0.9504 1.0000 0.2222 1.0000
HEAD AND NECK 1.0000 0.8833 1.0000 0.9222
KIDNEY 1.0000 1.0000 1.0000 1.0000
LIVER 0.9929 1.0000 0.6667 1.0000
LUNG 1.0000 0.6667 1.0000 0.9929
PANCREAS 1.0000 1.0000 1.0000 1.0000
STOMACH 1.0000 1.0000 1.0000 1.0000
THYROID 1.0000 1.0000 1.0000 1.0000
OVERALL 0.9947 0.9595 0.8866 0.9922

PPV: positive predictive value; NPV: negative predictive value.

We noticed that performance metrics in several cancer origin predictions were poor: a precision of 0.22 for esophagus origin prediction, a precision of 0.67 for liver origin prediction and a recall of 0.67 for lung prediction. The poor performance in these three cancer origin predictions may be due to small sample size. As mentioned above, metastatic cancer samples comprise only a small subset of test dataset in TCGA, the majority of which are primary tumors. Only 2, 2 and 3 metastatic cancer samples from esophagus, liver and lung origin respectively were included in test dataset (Fig 5A). The classifier mis-classified 6 out of 60 head and neck cancers as esophagus origin and 1 of 3 of lung cancers as liver cancers (S4 Table). Due to small sample sizes for esophagus, liver and lung cancers, a few mis-classifications had significant impacts on the precision metrics.

DNN-based cancer tissue classifier shows high performance in independent testing datasets

The DNN model was trained using DNA methylation data from TCGA. We then tested it in independent datasets of 11 data series consisting of 581 tumor samples covering 10 tissue origins downloaded from Gene Expression Omnibus (GEO). The sample distribution was shown in Fig 6A and cancer origin predictions were listed in S5 Table. Evaluated using these independent datasets, the DNN model achieved high performance with an overall precision and recall of 98.69% and 93.43% respectively (Table 5). High performance was also achieved in individual cancer origin predictions (Table 5) with an average AUC of 0.99 (Fig 6B). Importantly, the model achieved 100% accuracy in predicting the origins of metastatic cancers in these datasets, including 24 prostate cancer that metastasized to bone, lymph node or soft tissue and 12 breast cancer that metastasized to lymph node (see Table 1 for these samples).

Fig 6. Performance of the DNN-based cancer origin classifier in GEO dataset.

Fig 6

(A) Distribution of cancer samples obtained from GEO by tissue of origin. (B) AUCs for individual cancer origin prediction.

Table 5. DNN model performance using independent cancer samples (GEO).

CANCER ORIGIN SPECIFICITY SENSITIVITY (RECALL) PPV (PRECISION) NPV
ADRENAL GLAND 1.0000 0.7778 1.0000 0.9929
BLADDER 1.0000 1.0000 1.0000 1.0000
BREAST 0.9963 0.9714 0.9444 0.9982
COLORECTAL 1.0000 0.9643 1.0000 0.9915
HEAD AND NECK 1.0000 0.8333 1.0000 0.9983
KIDNEY 1.0000 1.0000 1.0000 1.0000
LIVER 0.9945 1.0000 0.9250 1.0000
PANCREAS 1.0000 0.8084 1.0000 0.9283
PROSTATE 1.0000 1.0000 1.0000 1.0000
THYROID 1.0000 0.9878 1.0000 0.9980
OVERALL 0.9991 0.9343 0.9869 0.9907

PPV: positive predictive value; NPV: negative predictive value.

Application of DNN-based cancer tissue classifier in predicting cancer cell type

We next investigate how cancer tissue-trained classifier can be used in cancer cell type prediction. DNA methylation data from 391 cancer cell lines covering 11 tissue sources were obtained from a large-scale study [40]. Applying our classifier into these cancer cell lines, we obtained overall accuracy, precision and recall is 0.8104, 0.8613 and 0.8255 respectively (Table 6). The overall AUC achieves 0.98 (Fig 7). Predicted tissue resource for individual cancer cell line was listed in S7 Table.

Table 6. DNN model performance in cancer cell type prediction.

CANCER CELL TYPE SPECIFICITY SENSITIVITY (RECALL) PPV (PRECISION) NPV
BLADDER 0.9946 0.7778 0.8750 0.9893
BRAIN 1.0000 0.7959 1.0000 0.9716
BREAST 0.9942 0.8958 0.9556 0.9855
COLORECTAL 0.9942 0.9333 0.9545 0.9914
HEAD AND NECK 1.0000 0.8421 1.0000 0.9833
KIDNEY 0.9889 0.9355 0.8788 0.9944
LIVER 0.9602 0.8571 0.4444 0.9945
LUNG 0.9842 0.6267 0.9038 0.9174
PANCREAS 0.9973 0.4167 0.9091 0.9632
SKIN 0.9826 1.0000 0.8868 1.0000
TESTIS 0.9974 1.0000 0.6667 1.0000
OVERALL 0.9903 0.8255 0.8613 0.9810

Fig 7. AUCs for individual cancer cell type prediction.

Fig 7

Similarly, we noticed variation of model performance for individual cancer cell types. Both precisions and recalls are high in prediction of cancer cell types derived from Brain, Breast, Colorectal, Head and Neck and Skin. However, recall is relatively high in prediction of cancer cell lines from Liver (0.8571) but precision is low (0.4444). Further examining confusion matrix (S8 Table), we found this is caused by mis-prediction of lung cancer cell line as liver cancer cell line. Likewise, precision is high in prediction of pancreatic cell lines (0.9091) but recall is low (0.4167), which is mainly caused by mis-prediction of pancreatic cell lines as stomach and esophagus (S8 Table).

Discussion

We developed a deep neural network model to predict the cancer origins based on large amount of DNA methylation data from 7,339 patients of 18 different cancer origins. By combining DNA methylation data with deep learning algorithm, our caner origin classifier achieved high performance as demonstrated in four different evaluation settings. Compared with Pathwork, a commercially available cancer origin classifier based on gene expressions [10], our DNN model showed higher precision (95.03% vs 89.4%) and recall (92.3% vs 87.8%) and comparable specificity (99.7% vs 99.4%). Compared with DNA methylation-based random forest model, our DNN model achieved higher PPV (precision) (95.03% in cross validation and 96.08% in test vs 88.6%) and comparable specificity, sensitivity and NPV. In addition, we showed that our DNN model is highly robust and generalizable as evaluated in an independent testing dataset of 581 samples (10 cancer origins), with overall specificity of 99.91% and sensitivity of 93.43%. Therefore, high performance both in primary and metastatic cancer origin prediction and the potential for easy implementation in clinical setting make the methylation-based DNN model a promising tool in determining cancer origins.

DNA methylation is established in tissue specific manner and conserved during cancer development [21], which makes DNA methylation profile a very useful feature in cancer origin prediction. Deep neural networks (DNNs) excels in capturing hierarchical features inherent in many complicated biological mechanisms. Our study indicates that the trained DNN model may be able to capture hierarchical patterns of cancer origins from the DNA methylation data. While Interpretation of deep learning-based models is a rapidly developing field and we expect that our model can be explained in a meaningful way in the future.

Our DNN model has potential in predicting origins of Cancer of Unknown Primary origin (CUP). CUP is a sub-group of heterogenous metastatic cancer with illusive primary site even after standard pathological examination [41]. It is estimated that 3–5% metastatic cancers are CUP and the majority of CUP patients (80%) have poor prognosis with overall survival of 6–10 months [41]. Identifying primary site of CUP poses challenges for treatment decisions in clinical practice. Currently, intensive pathologic examination still leaves 30% of them unidentified [42, 43]. High performance of our DNA methylation-based DNN model may provide an opportunity in this scenario when pathology-based approach fails. However, compared to pathologic examination, DNA methylation-based DNN prediction models has limited interpretability due to the “black-box” nature of deep learning methods and our limited understanding of the mechanistic connections between DNA methylation and cancer origins. We envision that a hybrid approach innovatively combining existing pathological examinations with DNA methylation-based prediction may offer both interpretability and high prediction power.

Due to the limited CUP data in both TCGA and GEO, we currently are unable to test the DNN models in predicting the origins of CUP. Our future direction is to collaborate with hospital to collect DNA methylation data from CUP patients to test our model. One challenge is to obtain the true primary sites for these patients. Due to unknown property of CUP, true primary sites may be established in later cancer development [18]. Another is through the post-mortem examination of patients since 75% of primary sites of CUP were found in autopsy [44].

Another potential usage of our model is to determine the tissue source or cell type of circulating tumor cells (CTCs). CTCs are cells that are shed from primary and metastatic tumors into blood. The enumeration of CTCs is shown an independent prognostic biomarker of overall survival in breast cancer and characteristics of CTCs has shown predictive role of CTCs for patient response to therapy [45, 46]. Role of CTCs in non-invasive diagnosis of cancer also emerges [47, 48]. However, identification its tissue source or tumor type is challenging. Zou J et al has developed eTumorType, which is based on Copy Number Variation (CNV) and shows promising in diagnosis of tumor type of CTCs [49]. We here applied our DNA methylation-based model into cancer cell type identification and our model shows relatively high performance with overall specificity of 0.9903 and overall sensitivity of 0.8255 for 11 cancer types (Table 6). CTCs may have different property from cancer cell lines and we expect that our model can be directly tested in CTCs when sufficient DNA methylation data are available. We are aware that the cell types our model can identified are still limited and performance also varies in different cancer types. Further improvements of our model are warranted.

One limitation of this study is that small sizes of metastatic cancers in our data. Two resources of metastatic cancer were used in this study: TCGA and GEO. TCGA has 701 metastatic cancer samples (12 tissues) with available methylation data from Illumina Human Methylation 450K platform. While the model achieved an overall specificity of 99.47% and sensitivity of 95.95% in cross-validation using TCGA data, we were unable to robustly test it using independent dataset since methylation data of metastatic cancers is limited in GEO. Further independent validation of our DNN-based model in predicting origins of metastatic cancers, especially poorly differentiated or undifferentiated metastatic cancer samples, is needed.

Conclusion

We developed a DNN-based cancer origin classifier using large-scale of DNA methylation data. This model shows high performance in predicting cancer tissue origins of solid tumors. We also demonstrated the model can be used for cancer cell type identification. In summary, the DNA methylation-based DNN models has potential in diagnosing cancer origin of CUP as well as identifying cancer cell type of CTCs.

Supporting information

S1 Table. Cancer origin predictions for 1468 patient samples from TCGA.

(DOCX)

S2 Table. Confusion matrix for TCGA test set predictions.

(CSV)

S3 Table. Cancer tissue origin predictions for 143 metastatic cancer samples.

(DOCX)

S4 Table. Confusion matrix for metastatic cancer samples in TCGA test set.

(CSV)

S5 Table. Cancer origin predictions for 581 samples from GEO datasets.

(DOCX)

S6 Table. Confusion matrix for GEO sample predictions.

(CSV)

S7 Table. Cancer cell type predictions for 391 cancer cell lines.

(DOCX)

S8 Table. Confusion matrix for cancer cell type prediction.

(CSV)

Data Availability

Source code used in this study is publicly available in a Github repository (https://github.com/thunder001/Cancer_origin_prediction). We also shared a Jupyter Notebook to replicate all the machine learning experiments from data processing, model building and optimization to model evaluation. To execute this notebook, the environment needs to be firstly created according to a YAML file available in Github. In addition, we also created a Docker image available in Docker hub (https://hub.docker.com/r/thunder001/cancer_origin_prediction), where you can download it and run the container directly on your computer.

Funding Statement

RX and CLZ are funded by the NIH Director’s New Innovator Award under the Eunice Kennedy Shriver National Institute of Child Health & Human Development of the National Institutes of Health (DP2HD084068, Xu), NIH National Institute of Aging (R01 AG057557-01, R01 AG061388-01, R56 AG062272-01, Xu) and American Cancer Society Research Scholar Grant (RSG-16-049-01-MPC, Xu). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Hainsworth JD, Rubin MS, Spigel DR, Boccia RV, Raby S, Quinn R, et al. Molecular gene expression profiling to predict the tissue of origin and direct site-specific therapy in patients with carcinoma of unknown primary site: a prospective trial of the Sarah Cannon research institute. J Clin Oncol 2013. 10;31:217–23. 10.1200/JCO.2012.43.3755 [DOI] [PubMed] [Google Scholar]
  • 2.Varadhachary GR, Raber MN, Matamoros A, Abbruzzese JL. Carcinoma of unknown primary with a colon-cancer profile-changing paradigm and emerging definitions. Lancet Oncol. 2008;9:596–9. 10.1016/S1470-2045(08)70151-7 [DOI] [PubMed] [Google Scholar]
  • 3.Varadhachary GR, Spector Y, Abbruzzese JL, Rosenwald S, Wang H, Aharonov R, et al. Prospective gene signature study using microRNA to identify the tissue of origin in patients with carcinoma of unknown primary. Clin Cancer Res. 2011;17:4063–70. 10.1158/1078-0432.CCR-10-2599 [DOI] [PubMed] [Google Scholar]
  • 4.Varadhachary GR, Karanth S, Qiao W, Carlson HR, Raber MN, Hainsworth JD, et al. Carcinoma of unknown primary with gastrointestinal profile: immunohistochemistry and survival data for this favorable subset. Int J Clin Oncol. 2014;19:479–84. 10.1007/s10147-013-0583-0 [DOI] [PubMed] [Google Scholar]
  • 5.Brown RW, Campagna LB, Dunn JK, Cagle PT. Immunohistochemical identification of tumor markers in metastatic adenocarcinoma. A diagnostic adjunct in the determination of primary site. Am J Clin Pathol. 1997;107:12–9. 10.1093/ajcp/107.1.12 [DOI] [PubMed] [Google Scholar]
  • 6.DeYoung BR, Wick MR. Immunohistologic evaluation of metastatic carcinomas of unknown origin: an algorithmic approach. Semin Diagn Pathol. 2000;17:184–93. [PubMed] [Google Scholar]
  • 7.Dennis JL, Hvidsten TR, Wit EC, Komorowski J, Bell AK, Downie I, et al. Markers of adenocarcinoma characteristic of the site of origin: development of a diagnostic algorithm. Clin Cancer Res. 2005;11:3766–72. 10.1158/1078-0432.CCR-04-2236 [DOI] [PubMed] [Google Scholar]
  • 8.Park SY, Kim BH, Kim JH, Lee S, Kang GH. Panels of immunohistochemical markers help determine primary sites of metastatic adenocarcinoma. Arch Pathol Lab Med. 2007;131:1561–7 [DOI] [PubMed] [Google Scholar]
  • 9.Ma XJ, Patel R, Wang X, Salunga R, Murage J, Desai R, et al. Molecular classification of human cancers using a 92-gene real-time quantitative polymerase chain reaction assay. Arch Pathol Lab Med. 2006;130:465–73. [DOI] [PubMed] [Google Scholar]
  • 10.Monzon FA, Lyons-Weiler M, Buturovic LJ, Rigl CT, Henner WD, Sciulli C, et al. Multicenter validation of a 1,550-gene expression profile for identification of tumor tissue of origin. J Clin Oncol. 2009;27:2503–8. 10.1200/JCO.2008.17.9762 [DOI] [PubMed] [Google Scholar]
  • 11.Pillai R, Deeter R, Rigl CT, Nystrom JS, Miller MH, Buturovic L, et al. Validation and reproducibility of a microarray-based gene expression test for tumor identification in formalin-fixed, paraffin-embedded specimens. J Mol Diagn. 2011;13:48–56. 10.1016/j.jmoldx.2010.11.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Rosenfeld N, Aharonov R, Meiri E, Rosenwald S, Spector Y, Zepeniuk M, et al. MicroRNAs accurately identify cancer tissue origin. Nat Biotechnol. 2008;26:462–9. 10.1038/nbt1392 [DOI] [PubMed] [Google Scholar]
  • 13.Rosenwald S, Gilad S, Benjamin S, Lebanony D, Dromi N, Faerman A, et al. Validation of a microRNA-based qRT-PCR test for accurate identification of tumor tissue origin. Mod Pathol 2010;23:814–23. 10.1038/modpathol.2010.57 [DOI] [PubMed] [Google Scholar]
  • 14.Meiri E, Mueller WC, Rosenwald S, Zepeniuk M, Klinke E, Edmonston TB, et al. A second-generation microRNA-based assay for diagnosing tumor tissue origin. Oncologist. 2012;17:801–12 10.1634/theoncologist.2011-0466 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Pentheroudakis G, Pavlidis N, Fountzilas G, Krikelis D, Goussia A, Stoyianni A, et al. Novel microRNA-based assay demonstrates 92% agreement with diagnosis based on clinicopathologic and management data in a cohort of patients with carcinoma of unknown primary. Mol Cancer. 2013;12:57 10.1186/1476-4598-12-57 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Tothill RW, Shi F, Paiman L, Bedo J, Kowalczyk A, Mileshkin L, et al. Development and validation of a gene expression tumour classifier for cancer of unknown primary. Pathology. 2015;47:7–12. 10.1097/PAT.0000000000000194 [DOI] [PubMed] [Google Scholar]
  • 17.Greco FA, Lennington WJ, Spigel DR, Hainsworth JD. Molecular profiling diagnosis in unknown primary cancer: accuracy and ability to complement standard pathology. J Natl Cancer Inst. 2013. June 5;105(11):782–90. 10.1093/jnci/djt099 [DOI] [PubMed] [Google Scholar]
  • 18.Moran S, Martínez-Cardús A, Sayols S, Musulén E, Balañá C, Estival-Gonzalez A, et al. Epigenetic profiling to classify cancer of unknown primary: a multicentre, retrospective analysis. Lancet Oncol. 2016;17:1386–1395. 10.1016/S1470-2045(16)30297-2 [DOI] [PubMed] [Google Scholar]
  • 19.Kulis M, Esteller M. DNA methylation and cancer. Adv Genet. 2010;70:27–56. 10.1016/B978-0-12-380866-0.60002-2 [DOI] [PubMed] [Google Scholar]
  • 20.Ohgane J, Yagi S, Shiota K. Epigenetics: the DNA methylation profile of tissue-dependent and differentially methylated regions in cells. Placenta. 2008;29 Suppl A:S29–35. [DOI] [PubMed] [Google Scholar]
  • 21.Fernandez AF, Assenov Y, Martin-Subero JI, Balint B, Siebert R, Taniguchi H, et al. A DNA methylation fingerprint of 1628 human samples. Genome Res. A DNA methylation fingerprint of 1628 human samples. Genome Res. 2012;22:407–19. 10.1101/gr.119867.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Min S, Lee B, Yoon S. Deep learning in bioinformatics. Brief Bioinform. 2017;18:851–869 10.1093/bib/bbw068 [DOI] [PubMed] [Google Scholar]
  • 23.Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15(141). 10.1098/rsif.2017.0387 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Chen Y, Li Y, Narayan R, Subramanian A, Xie X. Gene expression inference with deep learning. Bioinformatics. 2016;32(12):1832–9. 10.1093/bioinformatics/btw074 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33(8):831–8 10.1038/nbt.3300 [DOI] [PubMed] [Google Scholar]
  • 26.Du T, Liao L, Wu CH, Sun B. Prediction of residue-residue contact matrix for protein-protein interaction with Fisher score features and deep learning. Methods. 2016;110:97–105 10.1016/j.ymeth.2016.06.001 [DOI] [PubMed] [Google Scholar]
  • 27.Arvaniti E, Claassen M. Sensitive detection of rare disease-associated cell subsets via representation learning. Nat Commun. 2017;8:14825 10.1038/ncomms14825 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36(10):983–987 10.1038/nbt.4235 [DOI] [PubMed] [Google Scholar]
  • 29.Artemov AV, Putin E, Vanhaelen Q, Aliper A, Ozerov IV, Zhavoronkov A, et al. Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes. BioRxiv [Preprint]. 2016. ( 10.1101/095653) [DOI] [Google Scholar]
  • 30.GDC data portal. https://portal.gdc.cancer.gov. Accessed 7 August 2019
  • 31.Colaprico A, Silva TC, Olsen C, Garofano L, Cava C, Garolini D, et al. TCGAbiolinks: a R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016;44:e71 10.1093/nar/gkv1507 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/. Accessed 7 August 2019
  • 33.Davis S, Meltzer PS. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 2007;23:1846–7 10.1093/bioinformatics/btm254 [DOI] [PubMed] [Google Scholar]
  • 34.Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, Schubert M, et al. A Landscape of Pharmacogenomic Interactions in Cancer. Cell. 2016. July 28;166(3):740–754. 10.1016/j.cell.2016.06.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.McDonald JH. Handbook of Biological Statistics (3rd ed). Sparky House Publishing, Baltimore, Maryland: 2014 [Google Scholar]
  • 36.Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: Large-scale machine learning on heterogeneous systems. In: OSDI'16 Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. 2016;265–283
  • 37.Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the International Conference on Artificial Intelligence and Statistics. 2010;249–256
  • 38.Diederik P. Kingma and Jimmy Lei Ba. Adam. A method for stochastic optimization. arXiv. 2014;1412.6980v9
  • 39.Qian N. On the momentum term in gradient descent learning algorithms. Neural Netw. 1999;12:145–151. 10.1016/s0893-6080(98)00116-6 [DOI] [PubMed] [Google Scholar]
  • 40.Mcmahan HB and Streeter M. Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning. Advances in Neural Information Processing Systems (Proceedings of NIPS). 2014;1–9.
  • 41.Varadhachary GR, Raber MN. Cancer of unknown primary site. N Engl J Med. 2014;371:757–65 10.1056/NEJMra1303917 [DOI] [PubMed] [Google Scholar]
  • 42.Krämer A, Hübner G, Schneeweiss A, Folprecht G, Neben K. Carcinoma of Unknown Primary—an Orphan Disease? Breast Care (Basel). 2008;3:164–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Ettinger DS, Agulnik M, Cates JM, Cristea M, Denlinger CS, Eaton KD, et al. NCCN Clinical Practice Guidelines Occult primary. J Natl Compr Canc Netw. 2011;9:1358–95. 10.6004/jnccn.2011.0117 [DOI] [PubMed] [Google Scholar]
  • 44.Pentheroudakis G, Golfinopoulos V, Pavlidis N. Switching benchmarks in cancer of unknown primary: from autopsy to microarray. Eur J Cancer. 2007;43:2026–36 10.1016/j.ejca.2007.06.023 [DOI] [PubMed] [Google Scholar]
  • 45.Paoletti C, Hayes DF. Circulating Tumor Cells. Adv Exp Med Biol. 2016;882:235–58. 10.1007/978-3-319-22909-6_10 [DOI] [PubMed] [Google Scholar]
  • 46.Chinen LTD, Abdallah EA, Braun AC, Flores BCTCP, Corassa M, Sanches SM, et al. Circulating Tumor Cells as Cancer Biomarkers in the Clinic. Adv Exp Med Biol. 2017;994:1–41. 10.1007/978-3-319-55947-6_1 [DOI] [PubMed] [Google Scholar]
  • 47.Sindeeva OA, Verkhovskii RA, Sarimollaoglu M, Afanaseva GA, Fedonnikov AS, Osintsev EY, et al. New Frontiers in Diagnosis and Therapy of Circulating Tumor Markers in Cerebrospinal Fluid In Vitro and In Vivo. Cells. 2019. October 2;8(10) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Potdar PD and Lotey NK. Role of circulating tumor cells in future diagnosis and therapy of cancer. J Cancer Metastasis Treat. 2015;1:44–56 [Google Scholar]
  • 49.Zou J, Wang E. eTumorType, An Algorithm of Discriminating Cancer Types for Circulating Tumor Cells or Cell-free DNAs in Blood. Genomics Proteomics Bioinformatics. 2017. April;15(2):130–140. 10.1016/j.gpb.2017.01.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Serdar Bozdag

13 Feb 2020

PONE-D-19-32944

Predicting cancer origins with a DNA methylation-based deep neural network model

PLOS ONE

Dear Dr Xu,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by Mar 29 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Serdar Bozdag, Ph.D.

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements:

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.plosone.org/attachments/PLOSOne_formatting_sample_main_body.pdf and http://www.plosone.org/attachments/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. To comply with PLOS ONE submission guidelines, in your Methods section, please provide additional information regarding your statistical analyses. For more information on PLOS ONE's expectations for statistical reporting, please see https://journals.plos.org/plosone/s/submission-guidelines.#loc-statistical-reporting.

3. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Authors should make a case study which can use the method. For example, CTC methylation information could be used for showing the usefulness of the application. One of the example, CTC was used (PMID: 28389380).

Reviewer #2: General:

The authors proposed a deep neural network-based model to predict origin of cancers using DNA methylation data available in TCGA repository. They divided the dataset into train. development and test sets according to 60:20:20 splits. Other than using 10-fold cross-validation, they also used an independent set of data available in GEO database.

Strength:

• Author clearly mentioned that how cancer site classification study is useful

• Idea of model development and independent validation is good

• Prediction of metastatic samples provides another level of validation

Weakness:

• Major drawback of this model is that it cannot tell which features are responsible for classification or related to the origin of cancers.

• Though the DNN model configuration is mentioned, the author did not specify which version of DNN is used, for example MLP, Autoencoder, etc.

• Design of DNN architecture was not explained clearly. How the number of hidden layers and number of nodes in each hidden layer are decided?

• Feature selection is the key component of this study, which was not explained clearly. Given the information provided, it would be difficult to reproduce the results

• There is no section with “Conclusion.”

Minor Comments:

• Line - 61: “All these molecular platforms have shown better performance in identifying tissue of origin as compared to pathology-based methods.” � This sentence does not make sense with the context of material presented in this section. Needs rephrasing.

• Line-62: “However, gene expression- or microRNA-bases classifiers need to handle RNA that is unstable and less convenient in clinic settings.” �This statement requires some references to support. There is also a spelling mistake, should be … microRNA-based …..

• Line – 114: “Then we used the Tukey honest test to remove….” � Little explanation of “Tukey honest test” and how it works would improve the quality of the paper.

• Line-262: “High performance of our DNA methylation based DNN model may provide an opportunity in this scenario when pathology-based approach fails.” � But it does not provide which DNA methylation sites are related to producing high performance.

Major Comments:

• Need to add a section called “Conclusion”

• Feature Selection: Line-110: “To make the model with good compatibility and also reduce the dimensionality, we firstly reduced CpG sites to 27K for 450K derived samples.” � Concern-1: The authors did not mention what technique was used to reduce the dimension from 450K to 27K. Concern-2: After reducing features 450K to 27K, there will be two sets of 27K features. It is not clear – What do they do with these two sets? Do they combine? Step-by-step procedure to get to the final set of features, 10,360 CpG sites, needs to be stated/explained clearly so that the results can be reproduced.

• Line-127: “In addition, three hyperparameters (learning rate, number of hidden layer and hidden layer unit) were optimized to obtain best performance according to development set performance.” � More details are needed on how each of these hyperparameters are obtained so that the results can be reproduced.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Edwin Wang

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 1

Serdar Bozdag

23 Apr 2020

Predicting cancer origins with a DNA methylation-based deep neural network model

PONE-D-19-32944R1

Dear Dr. Xu,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Serdar Bozdag, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

Reviewer #3: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: (No Response)

Reviewer #3: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: (No Response)

Reviewer #3: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: (No Response)

Reviewer #3: (No Response)

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: (No Response)

Reviewer #3: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Reviewer #3: No

Acceptance letter

Serdar Bozdag

28 Apr 2020

PONE-D-19-32944R1

Predicting cancer origins with a DNA methylation-based deep neural network model

Dear Dr. Xu:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Serdar Bozdag

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Cancer origin predictions for 1468 patient samples from TCGA.

    (DOCX)

    S2 Table. Confusion matrix for TCGA test set predictions.

    (CSV)

    S3 Table. Cancer tissue origin predictions for 143 metastatic cancer samples.

    (DOCX)

    S4 Table. Confusion matrix for metastatic cancer samples in TCGA test set.

    (CSV)

    S5 Table. Cancer origin predictions for 581 samples from GEO datasets.

    (DOCX)

    S6 Table. Confusion matrix for GEO sample predictions.

    (CSV)

    S7 Table. Cancer cell type predictions for 391 cancer cell lines.

    (DOCX)

    S8 Table. Confusion matrix for cancer cell type prediction.

    (CSV)

    Attachment

    Submitted filename: 2_Response to Reviewers_final_March27.docx

    Data Availability Statement

    Source code used in this study is publicly available in a Github repository (https://github.com/thunder001/Cancer_origin_prediction). We also shared a Jupyter Notebook to replicate all the machine learning experiments from data processing, model building and optimization to model evaluation. To execute this notebook, the environment needs to be firstly created according to a YAML file available in Github. In addition, we also created a Docker image available in Docker hub (https://hub.docker.com/r/thunder001/cancer_origin_prediction), where you can download it and run the container directly on your computer.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES