Table 2.
Publication | ML method | Sample size | Sequencing data type | Performance | Validation method | Feature selection | Highlight/advantage | Shortcoming |
---|---|---|---|---|---|---|---|---|
Mathios et al. [78] | LR model with a LASSO penalty | 799 | cfDNA fragment | AUC (0.98) | 10-fold cross-validation | cfDNA fragment features, clinical risk factors, and CT imaging features | This study provides a framework for combining cfDNA fragmentation profiles with other markers for lung cancer detection | DNA variations in late-stage disease may affect cfDNA detection |
Lung-CLiP [79] | 5-nearest neighbor; 3-nearest neighbor; NB; LR; DT | 160 | cfDNA | AUC (0.69–0.98) | Leave-one-out cross-validation | SNV + CNV features | This study establishes an ML framework for the early detection of lung cancers using cfDNA | Sampling bias exists (most are smokers) in the training dataset |
Liang et al. [80] | LR | 296 | ctDNA | AUC (0.816) | 10-fold cross-validation | Nine DNA methylation markers | This study establishes an ML framework for the early detection of lung cancers using DNA methylation markers | The selected features are comprised of only nine methylation biomarkers, which poses a limitation on assay performance |
Raman et al. [81] | RF; SVM; LR with ridge, elastic net; LASSO regularization | 843 | cfDNA | mAUC (0.896–0.936) | Leave-one-out cross-validation | Copy number profiling of cfDNA | The model provides a framework for using copy number profiling of cfDNA as a biomarker in lung cancer detection | Feature selection methods can be used to reduce overfitting and may have the potential to achieve higher AUC |
Kobayashi et al. [83] | Diet Networks with EIS | 954 | Somatic mutation | Accuracy (0.8) | 5-fold cross-validation | SNVs, insertions, and deletions across 1796 genes | The EIS helps to stabilize the training process of Diet Networks | The interpretable hidden interpretations obtained from EIS may vary between different datasets |
Whitney et al. [86] | LR | 299 | RNA-seq of BECs | AUC (0.81) | 10-fold cross-validation | Lung cancer-associated and clinical covariate RNA markers | The model keeps sensitivity for small and peripheral suspected lesions | The selected genes vary greatly under different feature selection processes and parameters |
Podolsky et al. [87] | KNN; NB normal distribution of attributes; NB distribution through histograms; SVM; C4.5 DT | 529 | RNA-seq | AUC (0.91) | Hold-out | RNA-seq | This study systematically compares different models of lung cancer subtype classification across different datasets | Feature selection methods can be used to reduce overfitting |
Choi et al. [88] | An ensemble model based on elastic net LR; SVM; hierarchical LR | 2285 | RNA-seq of bronchial brushing samples | AUC (0.74) | 5-fold cross-validation | RNA-seq of 1232 genes with clinical covariates | The model integrates RNA-seq features and clinical information to improve the accuracy of risk prediction | Sample sizes in certain subgroups are small and may cause unbalanced training |
Aliferis et al. [89] | Linear SVM; polynomial-kernel SVM; KNN; NN | 203 | RNA-seq | AUC (0.8783–0.9980) | 5-fold cross-validation | RNA-seq of selected genes using RFE and UAF | The study uses different gene selection algorithms to improve the classification accuracy | The selected genes vary greatly across different training cohorts |
Aliferis et al. [91] | DT; KNN; linear SVM; polynomial-kernel SVM; RBF-kernel SVM; NN | 37 | CNV measured by CGH | Accuracy (0.892) | Leave-one-out cross-validation | Copy number of 80 selected genes based on linear SVM | The study systematically compares different models of lung cancer subtype classification | The sample size is small |
Daemen et al. [92] | HMM; weighted LS-SVM | 89 | CNV measured by CGH | Accuracy (0.880–0.955) | 10-fold cross-validation | CNV measured by CGH | The use of recurrent HMMs for CNV detection provides high accuracy for cancer classification | Benchmarked comparisons are needed to demonstrate the superiority of using the HMM model |
Jurmeister et al. [93] | NN; SVM; RF | 972 | DNA methylation | Accuracy (0.878–0.964) | 5-fold cross-validation | Top 2000 variable CpG sites | The study provides a framework for using DNA methylation data to predict tumor metastases | The model cannot accurately predict samples with low tumor cellularity through methylation data |
Note: LASSO, least absolute shrinkage and selection operator; cfDNA, cell-free DNA; NB, naive Bayes; DT, decision tree; SNV, single-nucleotide variant; CNV, copy number variation; ctDNA, circulating tumor DNA; mAUC, mean area under the curve; EIS, element-wise input scaling; BEC, bronchial epithelial cell; KNN, K-nearest neighbors; NN, neural network; RFE, recursive feature elimination; UAF, univariate association filtering; CGH, comparative genomic hybridization; HMM, hidden Markov model; LS-SVM, least squares support vector machines; RNA-seq, RNA sequencing. Compared with hold-out, cross-validation is usually more robust, and accounts for more variance between possible splits in training, validation, and test data. However, cross-validation is more time consuming than using the simple holdout method.