Skip to main content
. 2021 Dec;13(12):7006–7020. doi: 10.21037/jtd-21-806

Table 2. The characteristics of the included studies for diagnosing NSCLC through the gene profiles analyzed by AI models.

Authors Publication year Number of datasets Number of cases Number of genes (total) Subtypes (cases) Training set (cases) Validation set (cases) Test set (cases) Independent test datasets (cases) Classifier Results Conclusion
ACC SN SP AUC Precision
Xiao et al. (31) 2017 1 162 1,385 TCGA: ADC (n=162) NR NR NR 0 DL-based multi-model (KNN, SVM, DT, RF, GBDT) KNN: 88.00%; SVM: 97.20%; DT: 96.80%; RF: 93.20%; GBDT: 96.80%; majority voting: 97.20%; DL-based method: 99.20% DT: 97.37% NR NR DT: 98.46% The DL-based multi-model algorithm could obtain more information to achieve the accuracy of 99.20% for distinguishing ADCs from normal
Yuan et al. (32) 2020 1 150 1,100, 260, 43 (n=20,502) GEO: ADC (n=77), SCC (n=73) NR NR NR 0 SVM, RF, RIPPER SVM: 0.867; RF: 0.880; RIPPER: 0.867 SVM: 0.987; RF: 0.974; RIPPER: 0.867 SVM: 0.740; RF: 0.781; RIPPER: 0.872 NR SVM: 0.800; RF: 0.772; RIPPER: 0.877 Analyzing the gene expression dataset of NSCLC subtypes, the RIPPER algorithm yielded the almost equal performance of subtyping NSCLCs compared with the SVM/RF classifier
Podolsky et al. (33) 2016 3 480 NR DFCI: ADC (n=139), SCC (n=21), other (n=26), normal (n=17); UMD: ADC (n=86), normal (n=10); BWHD: ADC (n=150), other (n=31) 235 96 149 0 KNN, NB, SVM, DT NR NR NR KNN, k=1: 0.87; KNN, k=5: 0.96; KNN, k=10: 0.97; NB_normal: 0.85; NB_histogram: 0.84; SVM: 0.91; C4.5 DT: 0.92 NR Compared with other machine learning algorithms, SVM was the optimal tool in NSCLC morphology classification based on gene expression level evaluation
Cai et al. (34) 2015 2 1,099 16 (n=45) TCGC: ADC (n=126), SCC (n=134); GEO: SCLC (n=28); TCGA: ADC (n=452), SCC (n=359) 288 0 811 0 RF and multi-SVMs Training datasets: 86.54%; Independent datasets: 84.60% Training datasets: 84.37%; Independent datasets: 85.52% NR NR Training datasets: 66.79%; Independent datasets: 85.94% The accuracies of multi-SVM model with such 16 top features for diagnosing NSCLC subtypes were 86.54% and 84.6% in the training and test set, respectively
Li et al. (35) 2018 2 853 20 (n=107) TCGA: ADC (n=286), normal (n=59); GEO: ADC (n=387), normal (n=121) 2/3 of each dataset 0 1/3 of each dataset 0 RF, SVM, and ANN TCGA: 98.68%; GSE68465: 99.51%; GSE10072: 97.91%. TCGA: 99.28%; GSE68465: 99.95%; GSE10072: 98.05% TCGA: 95.68%; GSE68465: 92.83%; GSE10072: 97.75% NR NR Machine learning models with twenty ADC signature genes were robust for early ADC diagnosis
Dong et al. (36) 2019 1 369 699 TCGA: ADC (n=369) NR NR NR 0 SVM, KNN, LR, RF, gcForest and the ensemble MLW-gcForest Methylation: 0.751; RNA: 0.689; CNV: 0.645; multi-modal: 0.908 Methylation: 0.763; RNA: 0.679; CNV: 0.677; Multi-modal: 0.882 NR Multi-model: 0.96 Methylation: 0.771; RNA: 0.659; CNV: 0.675; Multi-modal: 0.896 MLW-gcForest algorithm had an AUC of 0.96 and an accuracy of 0.908 for ADC staging, better than those achieved by traditional machine learning algorithms
Yang et al. (37) 2020 2 600 42, 26, 16 (n=528) TCGA: ADC (n=470); GSE62182: ADC (n=94); GSE83527: ADC (n=36) 376 94 0 130 SVM NR NR NR TCGA: 0.62; GSE62182: 0.66; GSE83527: 0.63 NR The 16‑miRNA signature analyzed by LIBSVM algorithm showed a similar ability to classify ADC pathological stages to that of the combinations of 42 or 26 miRNAs

NR, not reported; AI, artificial intelligence; DL, deep learning; SVM, Support Vector Machine; KNN, K-nearest neighbors; GBDT, gradient boosting decision trees; LR, logistic regression; RF, Random Forest; DT, Decision Tree; ANN, artificial neural networks; NB, Naive Bayes; RIPPER, Repeated Incremental Pruning to Produce Error Reduction algorithm; ADC, adenocarcinoma; SCC, squamous-cell carcinoma; NSCLC, non-small cell lung cancer; SCLC, small cell lung cancer; TCGA, The Cancer Genome Atlas; GEO, Gene Expression Omnibus; DFCI, Dana-Farber Cancer Institute; UMD, University of Michigan Dataset; BWHD, Brigham and Women’s Hospital Dataset; CNV, copy number variation; AUC, Receiver-operating characteristic (ROC) curve; ACC, accuracy; SN, sensitivity; SP, specificity.