. 2022 Jul 12;24(7):e36490. doi: 10.2196/36490

Table 4.

Study analysis for journal publications on the prediction phase.

Reference	Objective, data set, and methodology	Performance and remarks
[131]	Objective: ALL^a relapse prediction Data set: 336 newly diagnosed children with ALL Methodology: Random forest algorithm	Performance: Accuracy: 0.829 AUC^b: 0.902 Strengths: Usage of 4 ML^c algorithms and 104 features Good model performance in all risk-level groups Adoption of a special feature selection strategy: 100-fold Monte Carlo cross validation combined with 10-fold cross validation Limitations: Data set imbalance (relapsed and nonrelapsed children) Strong predictors were excluded from the variable set Validation: 10-fold cross validation
[144]	Objective: Prediction of patients with CML^d and non-CML using complete blood count records Data set: Complete blood count records of 1623 patients with a BCR-ABL1 test extracted from the US Veterans Health Administration Methodology: XGBoost and LASSO	Performance: AUC range: 0.87-0.96 at the time of diagnosis Strengths: Use of 2 models Use of 2 feature selection methods Limitations: Imbalanced data set (predominant gender is male) Nonstandard data collection process Validation: Split sample validation (20% of the data for validation)
[1]	Objective: Leukemia detection based on biomedical data Data set: 401 leukemia datapoints from Z H Sikder Medical College and Hospital Methodology: Decision tree	Performance: Accuracy: 100% Strengths: Use of 4 supervised ML algorithms Limitations: Overfitting Validation: 10-fold cross validation
[50]	Objective: Prediction of leukemia survivability Data set: 131,615 records and 133 attributes for patients with leukemia from the SEER^e database Methodology: Deep neural network model	Performance: Accuracy: 74.85% Strengths: Use of a DNN^f ensemble method Limitations: Many problems in the leukemia data set (redundant attributes, missing values, and unknown values) Validation: 10-fold cross validation Ensemble method
[96]	Objective: Predictive identification of patients at risk during treatment Data set: 737 samples of patients diagnosed with CLL^g at Mayo Clinic Methodology: logistic regression, support vector machine, gradient boosting machine, random forest	Performance: ROC^h-AUC: above 80% Strengths: Binary classification outperforms survival analytic methods Limitations: Lack of actionable information provided by the ML algorithms Validation: 100 runs of 5-fold cross validation

^aALL: acute lymphoblastic leukemia.

^bAUC: area under the curve.

^cML: machine learning.

^dCML: chronic myeloid leukemia.

^eSEER: Surveillance, Epidemiology, and End Results

^fDNN: deep neural network.

^gCLL: chronic lymphocytic leukemia.

^hROC: receiver operating characteristic