Table 2.
Preprocessing step | Substep or rationale/use-case | Techniques |
---|---|---|
Exploratory data analysis (EDA) / descriptive statistics | Provide an overview of patterns and potential imbalance in data. In case of stratified analyses (e.g., by sex) or subset analysis (e.g., those completing follow-up), comparison with table of characteristics and statistical tests for difference is useful |
▪ Plots (ordinal/continuous data: histograms, nominal data: bar plots) ▪ Tables (for nearly any type of data; informative but should preferably be combined with visual presentations for ease) |
Outlier/invalid data management | Outliers may be naturally occurring, but may also be due to measurement error, erroneous transfer of data etc. Removing outlier/invalid data is sensible (to improve representativeness and decrease impact of unusual measures) depending on the data and model/task at hand. Exploratory data analysis is needed to evaluate this. Importantly, over-removing outliers may reduce generalizability and increase the risk of overfitting [109]. If negative impact is suspected, sensitivity analysis can be performed to compare robustness/trends in the output. The ML models used also influence the need for removing outliers, as some are (more) robust to outliers. Invalid data management requires manual assessment with exploratory data analysis and domain-expertise |
▪ Manual removal, e.g., of measures below/above a certain factor (commonly 1.5; 3 may be considered as extreme outliers) of interquartile ranges [IQR]. If assuming Gaussian distribution, removing measures outside of the 95th confidence interval may also be sensible ▪ Data-driven approaches (e.g., local outlier factor [LOF] [110]) ▪ For unsupervised learning, DBSCAN, HDBSCAN, and trimming approaches can facilitate identifying and handling noise in the data ▪ Model diagnostics can identify outliers after fitting a model; sometimes this is needed to see that an observation is outlying with respect to the modelled general pattern |
Missingness management | The (likely) mechanism of missingness [111] must be evaluated to decide a suitable approach to substitute missing value or exclude subjects with (certain degrees of) missingness. Exploratory data analysis is key in this evaluation. Multiple imputation is generally preferred, to reduce the uncertainty in individual substitutions by producing several imputed datasets. The minimum number of imputed datasets is debated, with various rules of thumb [112]. There are also data-driven methods to decide the number of datasets needed for a specified precision of standard error [112]. Selecting an imputation method requires attention to data type (e.g., cross-sectional or longitudinal [113]) and association between variables [114]. In many cases, the analysis should be performed in each imputed dataset and pooled [115] |
▪ Complete case analysis (either as main analysis or sensitivity analysis) ▪ Available case analysis (similar to the above, but dynamically select subjects with sufficient data for each individual analysis) [111] ▪ Imputation (e.g., multiple imputation) [113, 114, 116, 117] |
Feature/predictor/label selection | In some datasets, particularly high-dimensional data, there may be irrelevant/non-informative/ “noisy” variables. While it is important to consider all variables, the inclusion of e.g., variables that are overlapping to a high degree clinically may decrease model performance, cause overfitting, and complicating interpretation |
▪ Literature/expert-based selection, based on experience or literature ▪ Data-driven methods [118]: - Recursive feature elimination (RFE) [78] - Sequential feature algorithms [119, 120] - LASSO (combines variable selection and regularization to improve prediction accuracy and model interpretability [121]) - Tree-based approaches (e.g., decision trees, random forests) [118], and wrapper around said approaches, e.g., Boruta’s algorithm |
Data splitting | To reduce overfitting, data typically needs to be split into a training set (for learning patterns), and a validation set (for performance assessment, to guide hyperparameter fine-tuning). Ideally, a third test set, agnostic of the preceding steps, is useful to assess the model performance on unseen data. In supervised learning, results from several approaches can be compared on a data subset not used for tuning; in unsupervised learning performance measurement is less obvious |
▪ Cross-validation (k-fold validation, leave-one-out [LOO] etc.) [122] ▪ Bootstrapping ▪ Stability exploration in unsupervised learning |
Feature engineering |
Feature scaling Useful in continuous data and varying scale between features (e.g., height and annual income, the latter often on scales many times larger). Models (e.g., k-nearest neighbors) relying on distance metrics [123] perform better following feature scaling, as sheer magnitude of certain variable otherwise overinfluences the algorithm. Feature scaling should be performed after data splitting, as the sets otherwise introduce information to each other |
▪ Normalization (rescales data to [typically] range 0–1; sensitive to outliers, useful when scale of variables are of higher importance, e.g., neural networks) [124] ▪ Standardization (rescales data to mean 0 and standard deviation 1; more robust to outliers and generally recommended for linear models, e.g., support vector machines) [124] |
Discretizing/continuizing continuous/discrete variables Discretizing converts continuous data into categorical bins, simplifying modeling, handling non-linear relationships, and improving interpretability by mitigating the effects of outliers. However, it causes information loss, and this is often a reason to avoid it. In contrast, continuizing converts categorical or ordinal data into continuous formats, capturing ordinal relationships and enhancing performance for algorithms that prefer numerical inputs, although this may not be feasible at times |
▪ Entropy-MDL discretization ▪ Equal frequency discretization ▪ Equal width discretization ▪ One-hot encoding ▪ Frequency or mean encoding |
|
Feature extraction | In addition to feature selection, dimensionality can also be reduced by creating new features that aggregate information in the original features. Many algorithms perform suboptimally with high-dimensional data (“curse of dimensionality”) [125]. In such cases, dimensionality reduction can be applied to reduce runtime and improve performance |
▪ Linear techniques (e.g., principal component analysis [PCA; continuous data], multiple correspondence analysis [MCA; categorical data], MFA/FAMD [mixed data]; perform feature scaling prior to this step [126]) ▪ Non-linear techniques (e.g., autoencoders, t-SNE, UMAP) ▪ Subject-matter knowledge to define informative indexes summarizing several features |
Class imbalance management | In supervised learning, there may be substantial imbalance of labels among the subjects. This imbalance can reduce the performance of the model | ▪ Oversampling of minority class subjects (e.g., adaptive synthetic sampling approach [ADASYN] [127]) and/or undersampling of majority class subjects (e.g., by random exclusion) |
The table describe common steps taken during preprocessing. The list is intended as a guide with steps presented in consecutive order as they should be performed, but it is not meant to be comprehensive or universally applicable. It is recommended, particularly for unique applications, to evaluate previous similar implementations or relevant technical literature.