Skip to main content
. Author manuscript; available in PMC: 2023 Apr 29.
Published in final edited form as: Circ Res. 2022 Apr 28;130(9):1423–1444. doi: 10.1161/CIRCRESAHA.121.319969

Table 1: Machine Learning.

Advantages, challenges and pitfalls with avoidance strategies.

Machine Learning (ML) advantages
Effective task-specific algorithms are available for a wide range of research applications
Allows for accurate prediction and detection of complex patterns/relationships in data
Permits agnostic and unbiased exploratory research, freedom from assumptions about underlying data
Well-suited for high dimensional “omics” datasets (where number of input variables exceeds observations)
Able to accommodate several types of input variables (continuous, categorical, imaging features, etc.)
Can simultaneously account for linear and non-linear relationships between variables
Models can autonomously improve while learning in real-time from new data

ML challenges and pitfalls Pitfall avoidance

Models trained on small datasets often have poor generalizability in other datasets Collaborative data sharing and harmonization (particularly important in PAH, a rare disease)
No gold standard approaches exist for algorithm selection or hyperparameter tuning Apply heuristic data-driven methods to objectively select algorithm and set hyperparameters
Algorithms can be oversensitive to noise (mislabeled data, confounding signal, assay technical artifact) Careful attention to data collection, quality control, and pre-processing (normalization, standardization, missing value handling, batch adjustment)
“Black box” models (difficult to interpret) Explain model decision processes (graphically); delineate which input variables drive model decisions (feature selection methods, variable importance measures)
Model decisions can unfairly disadvantage certain patient subgroups (“algorithmic bias”) Select a cohort representative of wider disease population; include socioeconomic input variables
Inadequate model validation Independent cohort validation is critical (resampling-based cross-validation on training dataset is not adequate); compare model vs. existing gold standard
Lack of transparency in model reporting Full disclosure of methods; share model code and anonymized data at publication; adhere to emerging ML reporting guidelines (e.g., TRIPOD-ML)