Table 1: Machine Learning.

Advantages, challenges and pitfalls with avoidance strategies.

Machine Learning (ML) advantages
Effective task-specific algorithms are available for a wide range of research applications
Allows for accurate prediction and detection of complex patterns/relationships in data
Permits agnostic and unbiased exploratory research, freedom from assumptions about underlying data
Well-suited for high dimensional “omics” datasets (where number of input variables exceeds observations)
Able to accommodate several types of input variables (continuous, categorical, imaging features, etc.)
Can simultaneously account for linear and non-linear relationships between variables
Models can autonomously improve while learning in real-time from new data

ML challenges and pitfalls	Pitfall avoidance

Models trained on small datasets often have poor generalizability in other datasets	Collaborative data sharing and harmonization (particularly important in PAH, a rare disease)
No gold standard approaches exist for algorithm selection or hyperparameter tuning	Apply heuristic data-driven methods to objectively select algorithm and set hyperparameters
Algorithms can be oversensitive to noise (mislabeled data, confounding signal, assay technical artifact)	Careful attention to data collection, quality control, and pre-processing (normalization, standardization, missing value handling, batch adjustment)
“Black box” models (difficult to interpret)	Explain model decision processes (graphically); delineate which input variables drive model decisions (feature selection methods, variable importance measures)
Model decisions can unfairly disadvantage certain patient subgroups (“algorithmic bias”)	Select a cohort representative of wider disease population; include socioeconomic input variables
Inadequate model validation	Independent cohort validation is critical (resampling-based cross-validation on training dataset is not adequate); compare model vs. existing gold standard
Lack of transparency in model reporting	Full disclosure of methods; share model code and anonymized data at publication; adhere to emerging ML reporting guidelines (e.g., TRIPOD-ML)