Skip to main content
. 2020 May 19;187:109638. doi: 10.1016/j.envres.2020.109638

Table 1.

Statistical techniques for commonly encountered data imperfections.

Data/Study Imperfection Examples of appropriate techniques and software
Model misspecification errors; unknown shapes of exposure-response dependencies Flexible nonparametric models (e.g., MARS, https://cran.r-project.org/web/packages/earth/earth.pdf) and deep learning; non-parametric model ensembles (e.g., random forest, https://cran.r-project.org/web/packages/randomForest/randomForest.pdf) and superlearning (https://rdrr.io/cran/SuperLearner/f/vignettes/Guide-to-SuperLearner.Rmd) for model combination
Exposure estimation errors and errors in estimated or measured covariates (explanatory variables) Errors-in-variables methods (e.g., the MMC package in R, https://cran.r-project.org/web/packages/mmc/mmc.pdf; see also https://www.jstatsoft.org/article/view/v048i02, https://cran.r-project.org/web/packages/GLSME/GLSME.pdf, https://arxiv.org/pdf/1510.07123.pdf)
Omitted variables; unobserved or unmeasured risk factors, confounders, and modifiers latent variable techniques and finite mixture distribution modeling methods (e.g., www.jstatsoft.org/article/view/v011i08; https://www.jstatsoft.org/article/view/v048i02; PROC CALIS in SAS)
Missing data values Multiple imputation algorithms (e.g., MICE, https://cran.r-project.org/web/packages/mice/mice.pdf); data augmentation and EM (expectation-maximization) algorithms
Inter-individual heterogeneity and variability in causal exposure-response curves Finite mixture distribution modeling, clustering, individual conditional expectation methods (e.g., https://cran.r-project.org/web/packages/ICEbox/ICEbox.pdf)
Correlated or interdependent explanatory variables Probabilistic graphical methods, e.g., Bayesian networks (https://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf; https://cran.r-project.org/web/packages/CompareCausalNetworks/index.html)
Interactions among risk factors or other explanatory variables Nonparametric detection, estimation, and visualization of interactions (https://rdrr.io/cran/npIntFactRep/; https://rdrr.io/cran/npregfast/)
Uncertain internal validity (soundness of causal inferences) Use quasi-experiment designs (or randomization and design of experiments where possible) to control for standard threats to internal validity, e.g., using PlanOut and PlanAlyzer software (https://hci.stanford.edu/publications/2014/planout/planout-www2014.pdf; https://dl.acm.org/doi/pdf/10.1145/3360608)
Uncertain external validity (generalizability of findings) Multisite causal mediation analysis (https://cran.r-project.org/web/packages/MultisiteMediation/index.html); Bayesian evidence synthesis and hierarchical meta-analysis (https://cran.r-project.org/web/packages/jarbes/index.html)