Table 1.
Statistical techniques for commonly encountered data imperfections.
| Data/Study Imperfection | Examples of appropriate techniques and software |
|---|---|
| Model misspecification errors; unknown shapes of exposure-response dependencies | Flexible nonparametric models (e.g., MARS, https://cran.r-project.org/web/packages/earth/earth.pdf) and deep learning; non-parametric model ensembles (e.g., random forest, https://cran.r-project.org/web/packages/randomForest/randomForest.pdf) and superlearning (https://rdrr.io/cran/SuperLearner/f/vignettes/Guide-to-SuperLearner.Rmd) for model combination |
| Exposure estimation errors and errors in estimated or measured covariates (explanatory variables) | Errors-in-variables methods (e.g., the MMC package in R, https://cran.r-project.org/web/packages/mmc/mmc.pdf; see also https://www.jstatsoft.org/article/view/v048i02, https://cran.r-project.org/web/packages/GLSME/GLSME.pdf, https://arxiv.org/pdf/1510.07123.pdf) |
| Omitted variables; unobserved or unmeasured risk factors, confounders, and modifiers | latent variable techniques and finite mixture distribution modeling methods (e.g., www.jstatsoft.org/article/view/v011i08; https://www.jstatsoft.org/article/view/v048i02; PROC CALIS in SAS) |
| Missing data values | Multiple imputation algorithms (e.g., MICE, https://cran.r-project.org/web/packages/mice/mice.pdf); data augmentation and EM (expectation-maximization) algorithms |
| Inter-individual heterogeneity and variability in causal exposure-response curves | Finite mixture distribution modeling, clustering, individual conditional expectation methods (e.g., https://cran.r-project.org/web/packages/ICEbox/ICEbox.pdf) |
| Correlated or interdependent explanatory variables | Probabilistic graphical methods, e.g., Bayesian networks (https://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf; https://cran.r-project.org/web/packages/CompareCausalNetworks/index.html) |
| Interactions among risk factors or other explanatory variables | Nonparametric detection, estimation, and visualization of interactions (https://rdrr.io/cran/npIntFactRep/; https://rdrr.io/cran/npregfast/) |
| Uncertain internal validity (soundness of causal inferences) | Use quasi-experiment designs (or randomization and design of experiments where possible) to control for standard threats to internal validity, e.g., using PlanOut and PlanAlyzer software (https://hci.stanford.edu/publications/2014/planout/planout-www2014.pdf; https://dl.acm.org/doi/pdf/10.1145/3360608) |
| Uncertain external validity (generalizability of findings) | Multisite causal mediation analysis (https://cran.r-project.org/web/packages/MultisiteMediation/index.html); Bayesian evidence synthesis and hierarchical meta-analysis (https://cran.r-project.org/web/packages/jarbes/index.html) |