Skip to main content
. 2022 Dec 1;20(5):850–866. doi: 10.1016/j.gpb.2022.11.003

Table 2.

Publications relevant toMLon early detection and diagnosis using sequencing data

Publication ML method Sample size Sequencing data type Performance Validation method Feature selection Highlight/advantage Shortcoming
Mathios et al. [78] LR model with a LASSO penalty 799 cfDNA fragment AUC (0.98) 10-fold cross-validation cfDNA fragment features, clinical risk factors, and CT imaging features This study provides a framework for combining cfDNA fragmentation profiles with other markers for lung cancer detection DNA variations in late-stage disease may affect cfDNA detection
Lung-CLiP [79] 5-nearest neighbor; 3-nearest neighbor; NB; LR; DT 160 cfDNA AUC (0.69–0.98) Leave-one-out cross-validation SNV + CNV features This study establishes an ML framework for the early detection of lung cancers using cfDNA Sampling bias exists (most are smokers) in the training dataset
Liang et al. [80] LR 296 ctDNA AUC (0.816) 10-fold cross-validation Nine DNA methylation markers This study establishes an ML framework for the early detection of lung cancers using DNA methylation markers The selected features are comprised of only nine methylation biomarkers, which poses a limitation on assay performance
Raman et al. [81] RF; SVM; LR with ridge, elastic net; LASSO regularization 843 cfDNA mAUC (0.896–0.936) Leave-one-out cross-validation Copy number profiling of cfDNA The model provides a framework for using copy number profiling of cfDNA as a biomarker in lung cancer detection Feature selection methods can be used to reduce overfitting and may have the potential to achieve higher AUC
Kobayashi et al. [83] Diet Networks with EIS 954 Somatic mutation Accuracy (0.8) 5-fold cross-validation SNVs, insertions, and deletions across 1796 genes The EIS helps to stabilize the training process of Diet Networks The interpretable hidden interpretations obtained from EIS may vary between different datasets
Whitney et al. [86] LR 299 RNA-seq of BECs AUC (0.81) 10-fold cross-validation Lung cancer-associated and clinical covariate RNA markers The model keeps sensitivity for small and peripheral suspected lesions The selected genes vary greatly under different feature selection processes and parameters
Podolsky et al. [87] KNN; NB normal distribution of attributes; NB distribution through histograms; SVM; C4.5 DT 529 RNA-seq AUC (0.91) Hold-out RNA-seq This study systematically compares different models of lung cancer subtype classification across different datasets Feature selection methods can be used to reduce overfitting
Choi et al. [88] An ensemble model based on elastic net LR; SVM; hierarchical LR 2285 RNA-seq of bronchial brushing samples AUC (0.74) 5-fold cross-validation RNA-seq of 1232 genes with clinical covariates The model integrates RNA-seq features and clinical information to improve the accuracy of risk prediction Sample sizes in certain subgroups are small and may cause unbalanced training
Aliferis et al. [89] Linear SVM; polynomial-kernel SVM; KNN; NN 203 RNA-seq AUC (0.8783–0.9980) 5-fold cross-validation RNA-seq of selected genes using RFE and UAF The study uses different gene selection algorithms to improve the classification accuracy The selected genes vary greatly across different training cohorts
Aliferis et al. [91] DT; KNN; linear SVM; polynomial-kernel SVM; RBF-kernel SVM; NN 37 CNV measured by CGH Accuracy (0.892) Leave-one-out cross-validation Copy number of 80 selected genes based on linear SVM The study systematically compares different models of lung cancer subtype classification The sample size is small
Daemen et al. [92] HMM; weighted LS-SVM 89 CNV measured by CGH Accuracy (0.880–0.955) 10-fold cross-validation CNV measured by CGH The use of recurrent HMMs for CNV detection provides high accuracy for cancer classification Benchmarked comparisons are needed to demonstrate the superiority of using the HMM model
Jurmeister et al. [93] NN; SVM; RF 972 DNA methylation Accuracy (0.878–0.964) 5-fold cross-validation Top 2000 variable CpG sites The study provides a framework for using DNA methylation data to predict tumor metastases The model cannot accurately predict samples with low tumor cellularity through methylation data

Note: LASSO, least absolute shrinkage and selection operator; cfDNA, cell-free DNA; NB, naive Bayes; DT, decision tree; SNV, single-nucleotide variant; CNV, copy number variation; ctDNA, circulating tumor DNA; mAUC, mean area under the curve; EIS, element-wise input scaling; BEC, bronchial epithelial cell; KNN, K-nearest neighbors; NN, neural network; RFE, recursive feature elimination; UAF, univariate association filtering; CGH, comparative genomic hybridization; HMM, hidden Markov model; LS-SVM, least squares support vector machines; RNA-seq, RNA sequencing. Compared with hold-out, cross-validation is usually more robust, and accounts for more variance between possible splits in training, validation, and test data. However, cross-validation is more time consuming than using the simple holdout method.