Skip to main content
. 2022 Jul 12;24(7):e36490. doi: 10.2196/36490

Table 4.

Study analysis for journal publications on the prediction phase.

Reference Objective, data set, and methodology Performance and remarks
[131]
  • Objective: ALLa relapse prediction

  • Data set: 336 newly diagnosed children with ALL

  • Methodology: Random forest algorithm

Performance:
  • Accuracy: 0.829

  • AUCb: 0.902

Strengths:
  • Usage of 4 MLc algorithms and 104 features

  • Good model performance in all risk-level groups

  • Adoption of a special feature selection strategy: 100-fold Monte Carlo cross validation combined with 10-fold cross validation

Limitations:
  • Data set imbalance (relapsed and nonrelapsed children)

  • Strong predictors were excluded from the variable set

Validation:
  • 10-fold cross validation

[144]
  • Objective: Prediction of patients with CMLd and non-CML using complete blood count records

  • Data set: Complete blood count records of 1623 patients with a BCR-ABL1 test extracted from the US Veterans Health Administration

  • Methodology: XGBoost and LASSO

Performance:
  • AUC range: 0.87-0.96 at the time of diagnosis

Strengths:
  • Use of 2 models

  • Use of 2 feature selection methods

Limitations:
  • Imbalanced data set (predominant gender is male)

  • Nonstandard data collection process

Validation:
  • Split sample validation (20% of the data for validation)

[1]
  • Objective: Leukemia detection based on biomedical data

  • Data set: 401 leukemia datapoints from Z H Sikder Medical College and Hospital

  • Methodology: Decision tree

Performance:
  • Accuracy: 100%

Strengths:
  • Use of 4 supervised ML algorithms

Limitations:
  • Overfitting

Validation:
  • 10-fold cross validation

[50]
  • Objective: Prediction of leukemia survivability

  • Data set: 131,615 records and 133 attributes for patients with leukemia from the SEERe database

  • Methodology: Deep neural network model

Performance:
  • Accuracy: 74.85%

Strengths:
  • Use of a DNNf ensemble method

Limitations:
  • Many problems in the leukemia data set (redundant attributes, missing values, and unknown values)

Validation:
  • 10-fold cross validation

  • Ensemble method

[96]
  • Objective: Predictive identification of patients at risk during treatment

  • Data set: 737 samples of patients diagnosed with CLLg at Mayo Clinic

  • Methodology: logistic regression, support vector machine, gradient boosting machine, random forest

Performance:
  • ROCh-AUC: above 80%

Strengths:
  • Binary classification outperforms survival analytic methods

Limitations:
  • Lack of actionable information provided by the ML algorithms

Validation:
  • 100 runs of 5-fold cross validation

aALL: acute lymphoblastic leukemia.

bAUC: area under the curve.

cML: machine learning.

dCML: chronic myeloid leukemia.

eSEER: Surveillance, Epidemiology, and End Results

fDNN: deep neural network.

gCLL: chronic lymphocytic leukemia.

hROC: receiver operating characteristic