Abstract
Thyroid nodules are neoplasms commonly found among adults, with papillary thyroid carcinoma (PTC) being the most prevalent malignancy. However, current diagnostic methods often subject patients to unnecessary surgical burden. In this study, we developed and validated an automated, highly accurate multi-study-derived diagnostic model for PTCs using personalized biological pathways coupled with a sophisticated machine learning algorithm. Surprisingly, the algorithm achieved near-perfect performance in discriminating PTCs from non-tumoral thyroid samples with an overall cross-study-validated area under the receiver operating characteristic curve (AUROC) of 0.999 (95% confidence interval [CI]: 0.995–1) and a Brier score of 0.013 on three independent development cohorts. In addition, the algorithm showed excellent generalizability and transferability on two large-scale external blind PTC cohorts consisting of The Cancer Genome Atlas (TCGA), which is the largest genomic PTC cohort studied to date, and the post-Chernobyl cohort, which includes PTCs reported after exposure to radiation from the Chernobyl accident. When applied to the TCGA cohort, the model yielded an AUROC of 0.969 (95% CI: 0.950–0.987) and a Brier score of 0.109. On the post-Chernobyl cohort, it yielded an AUROC of 0.962 (95% CI: 0.918–1) and a Brier score of 0.073. This algorithm also is robust against other various types of clinical scenarios, discriminating malignant from benign lesions as well as clinically aggressive thyroid cancer with poor prognosis from indolent ones. Furthermore, we discovered novel pathway alterations and prognostic signatures for PTC, which can provide directions for follow-up studies.
Keywords: papillary thyroid carcinomas, tall cell variants, molecular diagnosis, machine learning
Introduction
Papillary thyroid carcinoma (PTC) has been reported to have the largest increase in overall occurrence among all cancers [1]. This may be a result of improvement in small tumor detection, in particular, the increased use of ultrasound and fine-needle aspiration biopsy (FNAB) [2]. Currently, the best diagnostic tool for PTC is considered to be FNAB. However, its limitations include the need for a highly experienced cytopathologist for accurate interpretation and the frequently indeterminate cytology due to overlapping pathology. Patients with indeterminate thyroid nodules from repeated cytologic exams are referred for thyroid surgery. However, only 20–30% of these nodules harbor malignancy [3]. In other words, up to 80% of patients are exposed to unnecessary surgical risk [4], requiring thyroid hormone replacement therapy for the rest of their lives. Recently, gene mutation tests for BRAF or RAS have been helping to diagnose indeterminate thyroid nodules. However, the BRAF mutation is present only in some cases of thyroid cancer, with an overall prevalence of 45% [5, 6]. RAS mutations can also be present in adenoma, making it difficult to diagnose malignancy [7]. Therefore, it is crucial to identify more accurate predictors that can help with the diagnosis of thyroid lesions.
With an enormous amount of large-scale omics data generated in recent years, meta-analysis and large-scale computational modeling have become an indispensable tool to overcome the drawback of insufficient statistical power in individual studies and to gain evidence-based insights. However, the conventional meta-analysis approaches are often univariate and consider each feature independently. The highly correlated nature of high throughput genomic data violates the independent assumption required by classic statistical models. In addition, typical high-throughput data have many thousands of predictors and often very small sample sizes, leading to the so-called High-Dimension Low Sample Size (HDLSS) problem, and thus posing significant statistical challenges [8, 9]. The HDLSS problem is often computationally infeasible and renders many traditional classification algorithms impractical to use as these algorithms tend to overfit the data, resulting in failure of validating the models on independent, ‘real-world’ data.
Regularized regression learning methods such as lasso, ridge and elastic-net have recently been widely applied to handle high-dimensional problems [9]. Lasso is one of the most popular, regularized learning methods and has very broad applications in data mining and machine learning. Lasso imposes an L1-norm (maximum or supremum norm) penalty on the regression coefficients, which enables both continuous shrinkage and automatic variable selection. However, in cases where the number of predictors, p, is larger than the number of observations, n, such as HDLSS, the lasso selects at most n predictors. In addition, lasso cannot handle multicollinearity. Ridge regression utilizes an L2-norm (the Euclidean norm) penalty and keeps all the predictors in the model. Consequently, ridge regression cannot produce a parsimonious model. Elastic-net regression combines L1-norm penalties for variable selection and L2-norm penalties for robustness [9–11]. Elastic-net is highly data-adaptive, applicable to high-dimensional settings and able to account for correlation among features. This makes Elastic-net particularly useful for HDLSS genomic models.
With the growing amount of high-throughput data with regards to transcriptomics, genomics and proteomics, integrating data of heterogeneous sources to provide a holistic molecular perspective of biological systems with developing algorithms is a primary research challenge [12]. A multi-omic approach can be broadly categorized into two categories, which are vertical integration of across studies considering multilayers on the same samples and horizontal integration that performs across studies on the same variables [13]. However, due to the increased complexity of biological systems and dimensionality with relatively low sample size, the usage of integrative analysis is often limited in many studies where gene expression data are the most abundant.
Omics-based machine learning modeling often lacks biological interpretability. Integrating prior knowledge from signaling pathways into these models can not only reduce the number of dimensions but also increase statistical power and model interpretability. Cancer is an example of a heterogeneous disease implicated with various clinical variables and molecular pathway signatures. A growing body of evidence suggests that pathway-based analysis can provide more insights into the complex biological mechanisms of disease than the gene level analysis [14, 15].
Nevertheless, most of the existing pathway-based approaches measure pathway activities on the entire samples, lacking explanation on the deregulation. Few exceptional quantification frameworks include the five methods that require individual samples that are individPath [16], individualized pathway aberrance score, iPAS [17], PathOlogists [18], PARADIGM [19] and Pathifier [15]. The PathOlogists and PARADIGM share common drawbacks that rely on network structures, which are yet to be fully known. And, the reference (e.g. normal sample)-based methods such as individPath, iPAS and Pathifier convert gene-level information and to that of pathway-level with reduced dimensionality, providing direct personalized pathway level analysis. Pathifier has been successfully employed in personalized analysis by utilizing principal curve [20–22]. However, although these methods require cohort references, the number of reference cohort samples is often limited [23].
In an effort to overcome this limitation, our study combined personalized pathway analysis and a meta-analytic approach to enhance the robustness of data and minimize the risk of false-negative in limited sample size. From here, we have successfully constructed a multi-study-derived personalized diagnostic model for PTC with excellent accuracy on external blind cohorts via regularized machine learning algorithms.
Results
To construct a robust and generalized binary classification model of PTC and normal tissue based on personalized pathway information, we developed a workflow that integrates multiple study-derived penalized machine learning method with individual pathway dysregulation in PTC samples (Figure 1). PTCs and normal thyroid tissues from three microarray studies (Supplementary Table S1) were used as a development study set for model training. For external blind validation, two independent cohorts of PTC, The Cancer Genome Atlas (TCGA, n = 568) and post-Chernobyl cohort (n = 94), were used to test the algorithm’s generalizability and transferability, as they present divergent nature in platform technologies or tumorigenesis compared with the development set. The NGS-based TCGA cohort contains a highly imbalanced dataset. The Chernobyl cohort includes samples from 23 patients from Ukraine who developed radiation-associated PTC after the Chernobyl accident. Detailed descriptions of cohorts and the technical variables used in the studies are presented in Supplementary Table S1.
Figure 1 .

Workflow for performing a multi-study-derived, individualized pathway learning model for detecting PTC. The pipeline consists of three main parts: cross study normalization, pathway mapping and prediction model construction. The study cohort was preprocessed and categorized into an internal development and validation cohorts and an external blind validation cohorts. For cross-study normalization, an EB method was used. Pathway mapping for each individual sample was conducted using public pathway databases (KEGG, BioCarta and PID) and Pathifier algorithm. The regularized regression model was built using elastic-net. The optimal values of the hyper-parameters for the model were obtained from LOOCV with EPSGO algorithm. QC, quality control; PTC, papillary thyroid cancer; EB, empirical Bayes; LOOCV, leave-one-out cross validation; EPSGO, Efficient Parameter Selection via Global Optimization; KEGG, Kyoto Encyclopedia of Genes and Genomes (KEGG); PID, Pathway Interaction Database.
We merged the three development study cohorts through an empirical Bayes algorithm [24] (Supplementary Figures S2). The merged development study set, originally comprised of gene expression level data, was then transformed into a pathway-level matrix using the Pathifier algorithm [15]. Pathifier is an algorithm designed to quantify the degree of pathway abnormality. This method uses the algorithm by Hastie and Stuetzle [25] to find a principal curve which is nonparametric, nonlinear generalization of the first few principal components for dimension reduction. With this algorithm, a one-dimensional principal curve can be generated from a cloud of data points in a high-dimensional space. This algorithm yields a pathway deregulation score (PDS) for each sample in a context-specific manner. These scores can be calculated using the distance from the starting point (the centroid of control samples) of the principal curve to the target point projected by a personalized pathway, which eventually generates a compact pathway representation of each individual sample [15] (see Supplementary Methods).
Using 11726 merged genes from three studies as input features with pathway information extracted from manually curated databases, including the Kyoto Encyclopedia of Genes and Genomes (KEGG) [26], the National Cancer Institute (NCI)-Pathway Interaction Database (PID) [27] and the BioCarta [28], we obtained a principal curve for each pathway (Figure 2b) and PDS matrix with 752 rows (pathway signatures) (Figure 2a). With this PDS matrix, we used regularized regression to build a prediction model for PTC. Elastic-net regularization is a linear combination of the ridge and lasso regression. Two hyperparameters (α and λ) need to be fine-tuned for a proper elastic-net penalty function. The hyperparameter α controls the trade-off between the ridge and lasso penalties, whereas λ controls the overall amount of penalization [11]. Since commonly used fixed grid search methods are highly arbitrary, we used an Efficient Parameter Selection via Global Optimization algorithm (EPSGO) [29] (see Supplementary Methods) to find the optimal value of α and λ with minimum binomial deviance (Figure 2c). At the value that the regularization parameter gave the lowest binomial deviance, EPSGO-tuned elastic-net successfully yielded the most parsimonious set of predictors with 12 non-zero pathway dysregulation coefficients (Figure 2d–e, Supplementary Table S2, Supplementary Figures S4). The final model produced near-perfect performance on cross-study validation (Figure 2f). Additional comparison between our model with respect to 12 random, ranked pathways from PDS matrix, thyroid oncogenic pathways from Cancer and Biological Pathway Associations Database [30], along with two different widely used classifiers, random forest and Support Vector Machine, presented in Supplementary Figures S5 and Supplementary Table S3.
Figure 2 .

Model construction. (a) PDS matrix for the three development cohorts. Each row represents the z-score-normalized PDS for each individual sample in each cohort. The color-bars in the bottom indicate the study cohort and the sample types. (b) Principal curves of selected pathways. The principal curve learned for the pathways on the development cohort. The data points and the principal curve are projected onto the three PCs (PC1 to PC3). The principal curve goes through the cloud of samples and is directed so that control samples (non-tumoral) are near the beginning of the curve. The PTC samples are projected onto the curve. (c) Hyper-parameter optimization for elastic-net with EPSGO. Cross-study validation deviance as a function of both tuning hyperparameters α and λ. α controls the tradeoff between the ridge and lasso penalties, whereas λ controls the overall amount of penalization. The arrow highlights the final EPSGO solution where the deviance is within 1 SE of the minimum (α = 1 and λ = 0.002 with deviance = 0.168). (d) Coefficient paths for elastic-net penalized regression models. Each curve represents a coefficient in the model. The solution path is scaled to reflect log λ on the x-axis. The hyperparameter λ controls the overall amount of penalization. The axis above is the number of non-zero coefficients at the current λ. (e) Heatmap of the pathways with non-zero coefficient. (f) Estimated probabilities for each sample in cross-study validation. Within the cohort and subclass, samples are sorted by the probability of the true class. PC, principal component; EPSGO, Efficient Parameter Selection via Global Optimization; NT, non-tumoral.
The overall area under the receiver operating characteristic curve (AUROC) for the three development cohorts was 0.999 (95% confidence interval [CI]: 0.995–1) with an area under the precision-recall curve (AUPRC) of 1.000, a Brier score of 0.013, a sensitivity of 100% (95% CI: 93.5–100%) and a specificity of 94.4% (95% CI: 72.7–99.9%) (Table 1, Figure 3a, Supplementary Table S4). We further validated the generalizability and transferability of our model using external blind validation cohorts (TCGA and post-Chernobyl cohorts). Surprisingly, our diagnostic algorithm showed excellent performance on both independent test sets. For the TCGA PTC cohort, it yielded an AUROC of 0.969 (95% CI: 0.950–0.987), an AUPRC of 0.996 and a Brier score of 0.109. The sensitivity and specificity of the algorithm were 84.5 and 96.6%, respectively. For the post-Chernobyl cohort, it yielded an AUROC of 0.962 (95% CI: 0.918–1), an AUPRC of 0.970 and a Brier score of 0.073. The sensitivity and specificity of the algorithm were 93.9 and 82.2%, respectively (Figure 3b, Table 1).
Figure 3 .

Internal and external evaluation of model performance. Receiver operating characteristic (ROC) and precision-recall curve for the binary classifiers ability to distinguish PTCs in the internal cross-study validation (a) and in the external blind validation (b).
Table 1.
Performance measures in overall cross-study validation and external validation
| Overall cross-study validation | External cohort validation | ||
|---|---|---|---|
| Giordano TJ et al., Ismael R et al., Schulten H et al. | TCGA | Post-Chernobyl | |
| AUROC | 0.999 (0.995, 1.000) | 0.969 (0.950, 0.987) | 0.962 (0.918, 1.000) |
| AUPRC | 1.000 | 0.996 | 0.970 |
| Brier score | 0.013 | 0.109 | 0.073 |
| Confusion matrix metrics | |||
| - Sensitivity (recall) | 1.000 (0.935, 1.000) | 0.845 (0.810, 0.875) | 0.939 (0.831, 0.987) |
| - Specificity | 0.944 (0.727, 0.999) | 0.966 (0.883, 0.996) | 0.822 (0.679, 0.920) |
| - Precision | 0.988 (0.935, 1.000) | 0.995 (0.983, 0.999) | 0.852 (0.729, 0.934) |
| - Likelihood ratio+ | 18.000 (2.679, 120.918) | 24.921 (6.379, 97.363) | 5.281 (2.806, 9.939) |
| - Likelihood ratio– | 0 | 0.161 (0.130, 0.198) | 0.074 (0.025, 0.225) |
| - F1 | 0.994 | 0.914 | 0.893 |
Values in parentheses are 95% CI.
While most of the patients with PTCs generally have high survival rates, some histological variants have been found to be associated with more aggressive clinical behaviors and worse prognosis. The tall cell variant (TCV) of PTC, the most common aggressive variant with a 6% prevalence [31], has been reported to exhibit higher rates of extrathyroidal extension, lymph node and distant metastases [32, 31, 33]. The TCGA cohort had an imbalanced dataset with 38 TCV samples and only 3 normal tissues adjacent to the tumor as a control. We evaluated the algorithm’s performance using only these TCV samples for the blind test set. Again, our algorithm showed an excellent ability to discriminate between normal adjacent tissue and TCVs with an AUROC of 0.981 (95% CI: 0.939–1), an AUPRC of 0.998, a Brier score of 0.029, a sensitivity of 97.1% (95% CI: 85.0–100%) and a specificity of 100% (95% CI: 29.2–100%).
Although limited in sample size, development cohorts contained thyroid neoplasms other than PTCs (Supplementary Table S1), allowing us to run our model through various diagnostic scenarios. When we included the adenomas (normal/adenoma versus PTCs) in the model, the overall AUROC of three studies was 0.986 (95% CI: 0.972–1) with an AUPRC of 0.984, a Brier score of 0.053, a sensitivity of 95.5% (95% CI: 88.9–98.8%) and a specificity of 90.5% (95% CI: 77.4–97.3%). With rare thyroid neoplasms in GSE27155 dataset, including anaplastic carcinomas (ATC), medullary thyroid cancer (MTC), follicular thyroid cancer (FTC) and oncocytic carcinoma (OC), we exploratively further investigated the algorithm’s discrimination capacity using different groupings: OC versus oncocytic adenoma (OA), thyroid cancers with better prognosis (PTC and FTC) versus thyroid cancers with poor prognosis (MTC, ATC and OC), and follicular adenoma versus FTC. OC has been reported to be more aggressive than other conventional thyroid carcinomas, with higher frequencies of extrathyroidal extension, local recurrence and metastasis to lymph nodes [34, 35]. Historically, there have been controversies over the accuracy of histological criteria to differentiate between benign and malignant oncocytic thyroid tumors [36]. When we applied the algorithm to GSE27155, the algorithm perfectly discriminated OA from OC with an AUROC of 1 based on leave-one-out cross-validation (LOOCV). The other clinical scenarios mentioned above consistently showed excellent results (Table 2).
Table 2.
Evaluation of model performance in different clinical settings
| OA - OC | PTC, FTC - MTC, ATC, OC | FA - FC | |
|---|---|---|---|
| AUROC | 1.000 | 0.993 (0.981, 1.006) | 0.962 (0.891, 1.000) |
| AUPRC | 1.000 | 0.975 | 0.969 |
| BRIER | 0.057 | 0.035 | 0.117 |
| Confusion matrix metrics | |||
| - Sensitivity (recall) | 1.000 (0.631, NaN) | 1.000 (0.768, NaN) | 0.9231 (0.6397, 0.998) |
| - Specificity | 1.000 (0.590, NaN) | 0.922 (0.827, 0.974) | 0.9000 (0.5550, 0.997) |
| - Precision | 0.979 (0.924, 0.987) | 0.736 (0.532, NaN) | 0.9231 (0.6245, 0.998) |
| - Likelihood ratio+ | Inf | 2.800 (5.517, 29.697) | 9.2308 (1.4284, 59.653) |
| - Likelihood ratio– | 0 | 0 | 0.0855 (0.0129, 0.568) |
| - F1 | 0.933 | 0.88 | 0.833 |
MCT, medullary, thyroid cancer; ATC, anaplastic thyroid cancer; FA, fllicular adenoma.
Values in parentheses are 95% CI.
Our machine learning algorithm, which is based on regression and pathway information, provides more biologically interpretable outcomes than other so-called ‘black box’ machine learning algorithms. The algorithm yielded the 12 most informative PTC-associated pathways with non-zero coefficients in elastic-net modeling. There are a total of 579 unique genes in the 12 final non-zero pathway predictors, which are listed in Supplementary Table S2. We additionally conducted univariate and multivariate survival analysis on the TCGA dataset to assess the prognostic significance of these genes (Figure 4). After checking that important covariates such as age, gender, pathologic stage and histological type met the requirements for the proportional hazards assumption, these covariates were entered into a multivariate model (Supplementary Table S5). In multivariate Cox regression analysis for overall survival (OS), 11 genes in 4 non-zero pathways were identified as significant, independent prognostic factors (Figure 4b). The most significant negative prognostic factor for OS was the MAPK9 gene in the ‘IL12 SIGNALING MEDIATED BY STAT4’ pathway [hazard ratio (HR) 55.3, 95% CI: 4.3–710.9, P < 0.01], whereas the most significant independent positive prognostic factor for OS was RANBP3 in the ‘HTLV-I INFECTION’ pathway (HR 0.09, 95% CI: 0.02–0.55, P < 0.001). MAPK9 also showed worse recurrence-free survival (RFS) in multivariate cox analysis (HR 5.7, 95% CI: 1.7–19.2, P < 0.01; Supplementary Figures S6), which was consistent with the results of the univariate Kaplan–Meier curve (log rank P = 0.026 for OS, P = 0.009 for RFS; Figure 4c). To our best knowledge, MAPK9 has so far never been reported as a significant prognostic factor of survival in thyroid cancer.
Figure 4 .

Assessment of the most prognostic genes in 12 non-zero pathways. Unadjusted (a) and multivariate adjusted (b) OS HRs for each gene expression in TCGA cohort. Ratios greater than 1 (blue) indicate worse prognosis for the elevated expression levels of indicated genes. Significantly altered genes are marked with asterisks (**, P < 0.01, ***, P < 0.001). (c) Kaplan–Meier curve of the most positive and negative prognostic genes (MAPK9 and RANBP3, respectively) indicating OS (upper) and recurrence free survival (bottom) across patients with low-, medium- and high-gene expression levels. Numbers given below the curves are Kaplan–Meier estimate (number at risk).
Since TCGA contains abundant multi-omic layers, multi-omic profiling and hub gene network analysis were performed based on genes in the 12 final non-zero pathway predictors. The most abundant variant mutation and type were missense mutation and SNP, respectively (Figure 5a–c). The top mutated genes were NRAS and HRAS, represented as a bar graph in Figure 5e. Copy number variations were assigned GISTIC (Genomic Identification of Significant Targets in Cancer [37]) values of −2, −1, 0, 1, 2, representing homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification or high-level copy number amplification, respectively. Gene expression increases along with the increment of copy number variation in both CDC23 and EP300 (Figure 5f). The three genes HLA-DPA1, TNFRSF1A and ITGB2 are most inversely correlated with methylation (Figure 5g). This highly suggests that these 3 genes are regulated by epigenetic mechanisms. Functional interactions of the 12 pathway genes within a network were analyzed via minimal span tree (MST) algorithm. PRKACB was identified as a hub gene for normal groups with the weight factor of 1.466 and visualized using MST (Supplementary Methods). VCAM1 is identified as PTC hub gene with a weight factor of 1.671, while that of the normal group was 0.6 (Supplementary Figure S7).
Figure 5 .

Multi-omic profiling of 12 non-zero coefficient pathway genes. (a–e) Summary of mutation, which displays variant types (a–c) and number of variants in each sample as a stacked barplot (d) and top 10 mutated genes (e). (f) Gene expression with regards to GISTIC. (g) Gene expression with regards to methylation status.
Older age is a major independent risk factor for PTC [38–41]. Unlike other cancers, a staging system for PTC takes the age of the patient into account. In the specific case of PTC staging system, 55 years of age is the age breakpoint in the American Joint Committee on Cancer [41]. Thus, we investigated the discrimination capacity of the algorithm, focusing mainly on high-risk groups of age over 55. Primarily, stages I and II were annotated as early-stage tumors (i.e. localized cancers), and that of III and IV as late-stage tumors (i.e. regional and distant spread). Then, two alternative ways of dichotomization were incorporated; one way is to annotate Stage I solely as early-stage with remaining Stages II, III and IV as late-stage, and the other is of Stage I and Stage II limited. In measuring classifier performance, 70% of the TCGA data were allocated for training and the remaining 30% data were partitioned as test data sets. Our TCGA-derived pathway enet classifier showed high performance with an AUC of 0.739 in the primary dichotomized stages of early (Stages I, II) and late (Stages III, IV) tumors. Similar performance value of 0.705 was obtained from dichotomized stages of early (Stage I) and late (Stages II, III, IV), while that of limited early (Stage I) and late (Stage II) was 0.733. All obtained performance values are relatively high compared with that of RF or SVM (Supplementary Figure S8).
It has been previously reported that both innate and adaptive immune systems are highly involved in the association of cancer-related inflammation and thyroid cancer pathogenesis [42, 43]. In this regard, we hypothesized if the immune profiling signature can be used as classifier features. To analyze tumor microenvironment heterogeneity across tumor tissues and patients, immune cell profiling was performed for immune infiltrates. Using a gene signature expression-based cell-type enrichment tool xCell [44], 3 datasets (GSE27155, GSE3678 and GSE54958) were selected as input values to obtain cell type enrichment scores (ES) across 64 immune and stromal cell types. With this cell-type ES matrix, regularized regression was used to build a prediction model for PTC and yielded 8 non-zero coefficient predictors (Supplementary Table S6). Surprisingly, the model showed high performance on both independent test sets of TCGA PTC and post-Chernobyl cohorts. For the TCGA PTC cohort, it yielded an AUROC of 0.908 with a Brier score of 0.075. For the post-Chernobyl cohort, it yielded an AUROC of 0.894 with a Brier score of 0.253 (Figure 6). These results highly suggest that core immune cell signatures could be viewed as generalizable features of thyroid cancer. This is the very first time reporting cell-type ES as a possible machine learning predictor.
Figure 6 .

Model performance of immune cell profiling. ROC and precision-recall curve for the classifier to distinguish PTCs in the external blind validation.
Discussion
Array and NGS-based technology has quickly become widespread in the last decade with significant improvements in the reliability, speed and costs of genetic sequencing. In near future, it is expected that most tumors will be completely sequenced, giving physicians the full genetic information required to treat their patients. Advances in liquid biopsy techniques coupled with rapid reductions in sequencing costs could make routine familial screening possible. With accelerated development in machine learning algorithms, this wealth of genetic information is becoming more valuable for both researchers and clinicians. However, despite the progresses in these two fields, one of the main limitations of current machine learning-based approaches is that it is very difficult to understand the rationale behind how results are obtained, thus limiting their utility for clinicians. The algorithm used in this study is based on regression and personalized pathways and more intuitive and interpretable than alternative state-of-the-art algorithms such as deep learning, artificial neural network, support vector machines and random forests, which are called ‘black box’ models for having large numbers of decision rules or hidden layers.
Explanatory modeling with high generalizability is especially important in personalized medicine. The novelty and the strength of this study is that we applied individual pathway mapping to multiple study cohort. Through this, we achieved excellent statistical power and generalizability without losing the model’s interpretability. Parsimony is another important consideration for predictive modeling of high-dimensional genomic data. Our study employed a two-step approach to dimension reduction: cross-study pathway-level representation and penalized regression with a global-tuning algorithm. First, without any prior filtering process, the initial input of 11726 merged genes from multiple cohorts was converted into 752 pathway information per individual sample. Then, we used penalized regression analysis and obtained a highly accurate and parsimonious model with 12 core pathway predictors. Detailed descriptions of 12 core pathways are presented in the Supplementary Discussion.
The model’s high-discrimination capability with an AUPRC of 0.996 on the TCGA cohort without any class imbalance corrections is quite remarkable because Precision-Recall curves are generally more sensitive to imbalance than ROC curves [45]. These results suggest that the model is robust to imbalanced data. In addition, the model showed high transferability between different platforms and tumorigenesis. The discovery cohorts were generated by microarray and the external validation cohort included a cohort by NGS platform. Also, the external cohort included radiation-induced PTC samples from Chernobyl dataset and previous studies reported the marked genomic difference between radiation-induced PTCs and sporadic PTCs [46, 47]. Despite the technical and biological differences in discovery and external cohorts, the statistical metrics proved that the model maintained accuracy, i.e. the indication of model transferability. To our knowledge, this is the first study to report such high diagnostic performance across NGS and microarray platforms in thyroid cancer.
Due to the limited number of multilayer data on thyroid cancer, our study primarily applied transcriptomic data for meta-analysis. Although this led to the finding of core features with near perfect prediction performance and is generically applicable in a high-throughput data environment, more comprehensive understanding of biological processes may be feasible with additional omic-data layers. This study mainly focused on PTC because other subtypes of thyroid cancer lacked a sufficient number of publicly available datasets for the building and the validation of the algorithm. Therefore, further research for other subtypes of thyroid cancer using this model may be beneficial if an adequate number of samples can be obtained. Moreover, follow-up studies are necessary to evaluate the practicality of utilizing this diagnostic tool in a clinical setting to improve in personalized medicine. Additionally, we have demonstrated immune cell signatures’ possible usage as machine learning predictors based on the immune profiling data obtained from tumor microenvironment heterogeneity analysis, which has never been reported. Further study is urgently needed to test the transferability and generalizability of these findings.
Methods
Study cohort composition
PTC cohorts used in this study were selected from publicly available gene expression profiles which had PTCs and normal thyroid tissues as a control group. We restricted our selection to human datasets and excluded studies with insufficient information, redundant data, extremely small sample sizes or inappropriate control groups. This selection process left four studies for the study cohort (Supplementary Figure S1). GSE33630 [48] included PTCs and normal thyroid tissues from the Chernobyl Tissue Bank (www.chernobyltissuebank.com). We used this cohort as an external blind test set to test model’s transferability. The other studies were used for model training and internal cross-study validation. Additionally, although RNA-seq platform used in TCGA is different from microarrays used in other cohorts, we used the TCGA cohort [49], the largest PTC cohort studied to date, as an external blind validation set to evaluate the model’s generalizability and transferability. Detailed descriptions of cohorts and the technical variables used in the studies are presented in Supplementary Table S1.
Development of the algorithm
Our study employed a two-step approach to model building: multi-study-derived pathway-level representation and penalized regression with a global-tuning algorithm. The first step is to convert individual gene-level information into pathway-level information. The second step is regularization to choose a model that balances between explanatory power and parsimony. Detailed descriptions of the algorithms are presented in the Supplementary Methods.
Evaluation strategies
We mainly used the AUROC, the area under Precision-Recall curve (AUPRC) and Brier score loss (squared error) [50] to evaluate model performance. The AUROC depicts the true positive rate (also known as sensitivity and recall) as the function of the false positive rate (also known as 1 − specificity), and the AUPRC shows the precision as the function of recall. Additional descriptions of the evaluation methodology are presented in Supplementary Methods.
Survival analysis
The Kaplan–Meier method [51] and the log rank test [52] were used to determine the univariate significance of the variables in relation to OS and RFS. Multivariate analyses with Cox proportional hazards (Cox PH) regression were used to examine the effects of multiple covariates on survival [53]. The Schoenfeld residuals test [54] was used to test the PH assumption in the Cox model. The significance of Cox PH parameters was tested using the Wald test and described by the HR, with 95% CI. All statistical analyses were performed using R version 3.2.3 (R Foundation for Statistical Computing Platform).
Analyzing cellular heterogeneity
Cell-type ES was obtained using R package xCell [44]. It analyzes gene expression data for 64 immune and stroma cell types based on the previously learned data from thousands of pure cell types varying on the sources. This tool is effective in reducing associations between closely related cell types and, therefore, reliably portray cellular heterogeneity landscape.
Data and code availability
The data set and R code in this paper are publicly available online at https://malcogene.github.io/PTC/.
Conflict of interest
The authors declare that they have no conflicts of interest with the contents of this article.
Contributors
S.Y.K. and K.S.P. conceived and designed the experiments and performed and analyzed the experiments. S.Y.K. performed mathematical and statistical analyses. All authors wrote the paper. All authors analyzed the results and approved the final version of the article.
Key Points
We developed a highly accurate multi-study-derived diagnostic model for papillary thyroid carcinomas using personalized pathways and sophisticated machine learning algorithms.
The model achieved near-perfect discrimination performance with a cross-study validation area under curve (AUC) of over 0.99 and generalization performance on two large-scale cohorts (TCGA and the post-Chernobyl) of PTC samples with an excellent AUC of over 0.96.
The model had high transferability as well as with excellent accuracy, discriminating other thyroid neoplasms than PTCs, including anaplastic carcinomas, medullary thyroid cancer, follicular thyroid cancer and oncocytic carcinoma.
The model showed more interpretability and excellent transferability regardless of the platforms, the numbers and types of samples.
Supplementary Material
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2019R1F1A1062023 and NRF-2020M3A9D8038014) and the National Institutes of Health (NIH)/National Cancer Institute (NCI) Cancer Center Support Grant P30 CA008748 and R21 CA234752.
Kyoung Sik Park, MD, PhD, is a professor at Konkuk University School of Medicine, South Korea. His main interests focus on thyroid and breast cancer, precision medicine, cancer genomics and cancer predictive modeling.
Seong Hoon Kim, MD, is a staff surgeon at Konkuk University School of Medicine, South Korea. His main interests focus on precision medicine and cancer genomics.
Jung Hun Oh, PhD, is an attending physicist in the Department of Medical Physics at Memorial Sloan Kettering Cancer Center, USA. His main interests focus on machine learning algorithm, precision medicine in cancer treatment and computational genomics.
Sung Young Kim, MD, PhD, is a professor at Konkuk University School of Medicine, South Korea. His main interests focus on precision medicine, computational genomics and development of machine learning algorithm.
Contributor Information
Kyoung Sik Park, Konkuk University School of Medicine, South Korea.
Seong Hoon Kim, Konkuk University School of Medicine, South Korea.
Jung Hun Oh, Department of Medical Physics at Memorial Sloan Kettering Cancer Center, USA.
Sung Young Kim, Konkuk University School of Medicine, South Korea.
References
- 1. Pellegriti G, Frasca F, Regalbuto C, et al. Worldwide increasing incidence of thyroid cancer: update on epidemiology and risk factors. J Cancer Epidemiol 2013;2013:965212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Cho YJ, Kim DY, Park E-C, et al. Thyroid fine-needle aspiration biopsy positively correlates with increased diagnosis of thyroid cancer in South Korean patients. BMC Cancer 2017;17:114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Krauss EA, Mahon M, Fede JM, et al. Application of the Bethesda classification for thyroid fine-needle aspiration: institutional experience and meta-analysis. Arch Pathol Lab Med 2016;140:1121–31. [DOI] [PubMed] [Google Scholar]
- 4. Gonçalves Filho J, Kowalski LP. Surgical complications after thyroid surgery performed in a cancer hospital. Otolaryngol Head Neck Surg 2005;132:490–4. [DOI] [PubMed] [Google Scholar]
- 5. Tufano RP, Teixeira GV, Bishop J, et al. BRAF mutation in papillary thyroid cancer and its value in tailoring initial treatment: a systematic review and meta-analysis. Medicine (Baltimore) 2012;91:274–86. [DOI] [PubMed] [Google Scholar]
- 6. Chang H, Shin BK, Kim A, et al. DNA methylation analysis for the diagnosis of thyroid nodules - a pilot study with reference to BRAF(V) (600E) mutation and cytopathology results. Cytopathology 2016;27:122–30. [DOI] [PubMed] [Google Scholar]
- 7. Howell GM, Hodak SP, Yip L. RAS mutations in thyroid cancer. Oncologist 2013;18:926–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Clarke R, Ressom HW, Wang A, et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer 2008;8:37–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Lever J, Krzywinski M, Altman N. Points of significance: regularization. Nat Methods 2016;13:803–4. [Google Scholar]
- 10. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
- 11. Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Statistical Soc B 2005;67:301–20. [Google Scholar]
- 12. Wei Z, Zhang Y, Weng W, et al. Survey and comparative assessments of computational multi-omics integrative methods with multiple regulatory networks identifying distinct tumor compositions across pan-cancer data sets. Brief Bioinform 2020;10.1093/bib/bbaa102. [DOI] [PubMed] [Google Scholar]
- 13. Ulfenborg B. Vertical and horizontal integration of multi-omics data with miodin. BMC Bioinformatics 2019;20:649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Glaab E. Using prior knowledge from cellular pathways and molecular networks for diagnostic specimen classification. Brief Bioinform 2016;17:440–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Drier Y, Sheffer M, Domany E. Pathway-based personalized analysis of cancer. Proc Natl Acad Sci U S A 2013;110:6388–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Wang H, Cai H, Ao L, et al. Individualized identification of disease-associated pathways with disrupted coordination of gene expression. Brief Bioinform 2016;17:78–87. [DOI] [PubMed] [Google Scholar]
- 17. Ahn T, Lee E, Huh N, et al. Personalized identification of altered pathways in cancer using accumulated normal tissue data. Bioinformatics 2014;30:i422–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Song T, Cao S, Tao S, et al. A novel unsupervised algorithm for biological process-based analysis on cancer. Sci Rep 2017;7:4671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Vaske CJ, Benz SC, Sanborn JZ, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 2010;26:i237–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Fa B, Luo C, Tang Z, et al. Pathway-based biomarker identification with crosstalk analysis for robust prognosis prediction in hepatocellular carcinoma. EBioMedicine 2019;44:250–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Huang S, Chong N, Lewis NE, et al. Novel personalized pathway-based metabolomics models reveal key metabolic pathways for breast cancer diagnosis. Genome Med 2016;8:34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Livshits A, Git A, Fuks G, et al. Pathway-based personalized analysis of breast cancer expression data. Mol Oncol 2015;9:1471–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Vitali F, Li Q, Schissler AG, et al. Developing a “personalome” for precision medicine: emerging methods that compute interpretable effect sizes from single-subject transcriptomes. Brief Bioinform 2019;20:789–805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 2007;8:118–27. [DOI] [PubMed] [Google Scholar]
- 25. Hastie T, Stuetzle W. Principal curves. J Am Stat Assoc 1989;84:502–16. [Google Scholar]
- 26. Kanehisa M, Goto SKEGG. Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000;28:27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Schaefer CF, Anthony K, Krupa S, et al. PID: the pathway interaction database. Nucleic Acids Res 2009;37:D674–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Nishimura D. BioCarta. Biotech Software & Internet Report 2. Open Access Library, 2001, 117–20. [Google Scholar]
- 29. Sill M, Hielscher T, Becker N, et al. c060: extended inference with lasso and elastic-net regularized cox and generalized linear models. J Stat Softw 2014;62:1–22. [Google Scholar]
- 30. Li F, Wu T, Xu Y, et al. A comprehensive overview of oncogenic pathways in human cancer. Brief Bioinform 2020;21:957–69. [DOI] [PubMed] [Google Scholar]
- 31. Wang X, Cheng W, Liu C, et al. Tall cell variant of papillary thyroid carcinoma: current evidence on clinicopathologic features and molecular biology. Oncotarget 2016;7:40792–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Morris LGT, Shaha AR, Tuttle RM, et al. Tall-cell variant of papillary thyroid carcinoma: a matched-pair analysis of survival. Thyroid 2010;20:153–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Shi X, Liu R, Basolo F, et al. Differential clinicopathological risk and prognosis of major papillary thyroid cancer variants. J Clin Endocrinol Metab 2016;101:264–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Montone KT, Baloch ZW, LiVolsi VA. The thyroid Hürthle (oncocytic) cell and its associated pathologic conditions: a surgical pathology and cytopathology review. Arch Pathol Lab Med 2008;132:1241–50. [DOI] [PubMed] [Google Scholar]
- 35. Tsybrovskyy O, Rössmann-Tsybrovskyy M. Oncocytic versus mitochondrion-rich follicular thyroid tumours: should we make a difference? Histopathology 2009;55:665–82. [DOI] [PubMed] [Google Scholar]
- 36. Boronat M, Cabrera JJ, Perera C, et al. Late bone metastasis from an apparently benign oncocytic follicular thyroid tumor. Endocrinol Diabetes Metab Case Rep 2013;2013:130051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Mermel CH, Schumacher SE, Hill B, et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol 2011;12:R41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Nixon IJ, Kuk D, Wreesmann V, et al. Defining a valid age cutoff in staging of well-differentiated thyroid cancer. Ann Surg Oncol 2016;23:410–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Ho AS, Luu M, Zalt C, et al. Mortality risk of nonoperative papillary thyroid carcinoma: a corollary for active surveillance. Thyroid 2019;29:1409–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. NCCN Clinical Practice Guidelines in Oncology : Thyroid carcinoma. National Comprehensive Cancer network. Version 2. 2017. [DOI] [PMC free article] [PubMed]
- 41. Amin MB, Greene FL, Edge SB, et al. The eighth edition AJCC cancer staging manual: continuing to build a bridge from a population-based to a more “personalized” approach to cancer staging. CA Cancer J Clin 2017;67:93–9. [DOI] [PubMed] [Google Scholar]
- 42. Ferrari SM, Fallahi P, Galdiero MR, et al. Immune and inflammatory cells in thyroid cancer microenvironment. Int J Mol Sci 2019;20:4413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Galdiero MR, Varricchi G, Marone G. The immune network in thyroid cancer. Onco Targets Ther 2016;5:e1168556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Aran D, Hu Z, Butte AJ. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol 2017;18:220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 2015;10:e0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Boltze C, Riecke A, Ruf CG, et al. Sporadic and radiation-associated papillary thyroid cancers can be distinguished using routine immunohistochemistry. Oncol Rep 2009;22:459–67. [DOI] [PubMed] [Google Scholar]
- 47. Handkiewicz-Junak D, Swierniak M, Rusinek D, et al. Gene signature of the post-Chernobyl papillary thyroid cancer. Eur J Nucl Med Mol Imaging 2016;43:1267–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Dom G, Tarabichi M, Unger K, et al. A gene expression signature distinguishes normal tissues of sporadic and radiation-induced papillary thyroid carcinomas. Br J Cancer 2012;107:994–1000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Cancer Genome Atlas Research Network . Integrated genomic characterization of papillary thyroid carcinoma. Cell 2014;159:676–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev 1950;78:1–3. [Google Scholar]
- 51. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc 1958;53:457. [Google Scholar]
- 52. Peto R, Peto J. Asymptotically efficient rank invariant test procedures -nuffield department of population health. J Roy Stat Soc Ser A 1972;135:185–207. [Google Scholar]
- 53. Cox DR. Regression models and life-tables. J R Stat Soc B Methodol 1972;34:187–220. [Google Scholar]
- 54. Schoenfeld D. Partial residuals for the proportional hazards regression model. Biometrika 1982;69:239. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data set and R code in this paper are publicly available online at https://malcogene.github.io/PTC/.
