. 2021 Dec;13(12):7006–7020. doi: 10.21037/jtd-21-806

Table 2. The characteristics of the included studies for diagnosing NSCLC through the gene profiles analyzed by AI models.

Authors	Publication year	Number of datasets	Number of cases	Number of genes (total)	Subtypes (cases)	Training set (cases)	Validation set (cases)	Test set (cases)	Independent test datasets (cases)	Classifier	Results					Conclusion
Authors	Publication year	Number of datasets	Number of cases	Number of genes (total)	Subtypes (cases)	Training set (cases)	Validation set (cases)	Test set (cases)	Independent test datasets (cases)	Classifier	ACC	SN	SP	AUC	Precision	Conclusion
Xiao et al. (31)	2017	1	162	1,385	TCGA: ADC (n=162)	NR	NR	NR	0	DL-based multi-model (KNN, SVM, DT, RF, GBDT)	KNN: 88.00%; SVM: 97.20%; DT: 96.80%; RF: 93.20%; GBDT: 96.80%; majority voting: 97.20%; DL-based method: 99.20%	DT: 97.37%	NR	NR	DT: 98.46%	The DL-based multi-model algorithm could obtain more information to achieve the accuracy of 99.20% for distinguishing ADCs from normal
Yuan et al. (32)	2020	1	150	1,100, 260, 43 (n=20,502)	GEO: ADC (n=77), SCC (n=73)	NR	NR	NR	0	SVM, RF, RIPPER	SVM: 0.867; RF: 0.880; RIPPER: 0.867	SVM: 0.987; RF: 0.974; RIPPER: 0.867	SVM: 0.740; RF: 0.781; RIPPER: 0.872	NR	SVM: 0.800; RF: 0.772; RIPPER: 0.877	Analyzing the gene expression dataset of NSCLC subtypes, the RIPPER algorithm yielded the almost equal performance of subtyping NSCLCs compared with the SVM/RF classifier
Podolsky et al. (33)	2016	3	480	NR	DFCI: ADC (n=139), SCC (n=21), other (n=26), normal (n=17); UMD: ADC (n=86), normal (n=10); BWHD: ADC (n=150), other (n=31)	235	96	149	0	KNN, NB, SVM, DT	NR	NR	NR	KNN, k=1: 0.87; KNN, k=5: 0.96; KNN, k=10: 0.97; NB_normal: 0.85; NB_histogram: 0.84; SVM: 0.91; C4.5 DT: 0.92	NR	Compared with other machine learning algorithms, SVM was the optimal tool in NSCLC morphology classification based on gene expression level evaluation
Cai et al. (34)	2015	2	1,099	16 (n=45)	TCGC: ADC (n=126), SCC (n=134); GEO: SCLC (n=28); TCGA: ADC (n=452), SCC (n=359)	288	0	811	0	RF and multi-SVMs	Training datasets: 86.54%; Independent datasets: 84.60%	Training datasets: 84.37%; Independent datasets: 85.52%	NR	NR	Training datasets: 66.79%; Independent datasets: 85.94%	The accuracies of multi-SVM model with such 16 top features for diagnosing NSCLC subtypes were 86.54% and 84.6% in the training and test set, respectively
Li et al. (35)	2018	2	853	20 (n=107)	TCGA: ADC (n=286), normal (n=59); GEO: ADC (n=387), normal (n=121)	2/3 of each dataset	0	1/3 of each dataset	0	RF, SVM, and ANN	TCGA: 98.68%; GSE68465: 99.51%; GSE10072: 97.91%.	TCGA: 99.28%; GSE68465: 99.95%; GSE10072: 98.05%	TCGA: 95.68%; GSE68465: 92.83%; GSE10072: 97.75%	NR	NR	Machine learning models with twenty ADC signature genes were robust for early ADC diagnosis
Dong et al. (36)	2019	1	369	699	TCGA: ADC (n=369)	NR	NR	NR	0	SVM, KNN, LR, RF, gcForest and the ensemble MLW-gcForest	Methylation: 0.751; RNA: 0.689; CNV: 0.645; multi-modal: 0.908	Methylation: 0.763; RNA: 0.679; CNV: 0.677; Multi-modal: 0.882	NR	Multi-model: 0.96	Methylation: 0.771; RNA: 0.659; CNV: 0.675; Multi-modal: 0.896	MLW-gcForest algorithm had an AUC of 0.96 and an accuracy of 0.908 for ADC staging, better than those achieved by traditional machine learning algorithms
Yang et al. (37)	2020	2	600	42, 26, 16 (n=528)	TCGA: ADC (n=470); GSE62182: ADC (n=94); GSE83527: ADC (n=36)	376	94	0	130	SVM	NR	NR	NR	TCGA: 0.62; GSE62182: 0.66; GSE83527: 0.63	NR	The 16‑miRNA signature analyzed by LIBSVM algorithm showed a similar ability to classify ADC pathological stages to that of the combinations of 42 or 26 miRNAs

NR, not reported; AI, artificial intelligence; DL, deep learning; SVM, Support Vector Machine; KNN, K-nearest neighbors; GBDT, gradient boosting decision trees; LR, logistic regression; RF, Random Forest; DT, Decision Tree; ANN, artificial neural networks; NB, Naive Bayes; RIPPER, Repeated Incremental Pruning to Produce Error Reduction algorithm; ADC, adenocarcinoma; SCC, squamous-cell carcinoma; NSCLC, non-small cell lung cancer; SCLC, small cell lung cancer; TCGA, The Cancer Genome Atlas; GEO, Gene Expression Omnibus; DFCI, Dana-Farber Cancer Institute; UMD, University of Michigan Dataset; BWHD, Brigham and Women’s Hospital Dataset; CNV, copy number variation; AUC, Receiver-operating characteristic (ROC) curve; ACC, accuracy; SN, sensitivity; SP, specificity.