Machine Learning (ML) advantages
|
Effective task-specific algorithms are available for a wide range of research applications |
Allows for accurate prediction and detection of complex patterns/relationships in data |
Permits agnostic and unbiased exploratory research, freedom from assumptions about underlying data |
Well-suited for high dimensional “omics” datasets (where number of input variables exceeds observations) |
Able to accommodate several types of input variables (continuous, categorical, imaging features, etc.) |
Can simultaneously account for linear and non-linear relationships between variables |
Models can autonomously improve while learning in real-time from new data |
|
ML challenges and pitfalls
|
Pitfall avoidance
|
|
Models trained on small datasets often have poor generalizability in other datasets |
Collaborative data sharing and harmonization (particularly important in PAH, a rare disease) |
No gold standard approaches exist for algorithm selection or hyperparameter tuning |
Apply heuristic data-driven methods to objectively select algorithm and set hyperparameters |
Algorithms can be oversensitive to noise (mislabeled data, confounding signal, assay technical artifact) |
Careful attention to data collection, quality control, and pre-processing (normalization, standardization, missing value handling, batch adjustment) |
“Black box” models (difficult to interpret) |
Explain model decision processes (graphically); delineate which input variables drive model decisions (feature selection methods, variable importance measures) |
Model decisions can unfairly disadvantage certain patient subgroups (“algorithmic bias”) |
Select a cohort representative of wider disease population; include socioeconomic input variables |
Inadequate model validation |
Independent cohort validation is critical (resampling-based cross-validation on training dataset is not adequate); compare model vs. existing gold standard |
Lack of transparency in model reporting |
Full disclosure of methods; share model code and anonymized data at publication; adhere to emerging ML reporting guidelines (e.g., TRIPOD-ML) |