Table 1. Data analysis tasks.
task | key characteristics and concepts | example analytical tools/methods | causal knowledge needed? | example question | example verbiage & interpretation |
---|---|---|---|---|---|
description | A quantitative overview of the data. The metrics of interest may range from simple descriptive statistics to complex visualization techniques. | mean ± s.d., box plots, proportions, unsupervised cluster analyses, time trends, generalized regressiona | no | What is the central tendency and spread of T-cell count, a marker of immune function, in wild spotted hyenas in Kenya? | Summarize a feature based on the metric of interest. This includes but is not limited to: prevalence of a phenomenon (%, proportions), central tendency and dispersion (mean ± s.d., median, minimum, maximum), natural correlations among groups of variables (clusters or latent constructs), and/or trends in a variable over time (e.g. average rate of increase per year). Ex: In wild spotted hyenas, the median (range) T-cell count is 850 cells/mm2 (range: 500 to 1,200 cells/mm2). |
prediction | Identification of a set of explanatory variables that optimize variation explained in a dependent variable, with no focus on the causal or temporal structure among the explanatory variables of interest. This task often involves use of automated procedures to maximize model fit and leverages the joint distribution of multiple variables. | tree-based techniques, recurrent neural networks, unsupervised machine learning algorithms, generalized regressiona | some | What set of social and ecological factors explain maximum variation in T-cell count in wild spotted hyenas in Kenya? | This task focuses on how well a set of X-variables (predictors) maximize variance in (predict) Y. Interpretation for individual X-variables focuses on assessments of model fit (e.g. AIC/BIC or adjusted R2), or predictive capacity (e.g. area under the receiver operating characteristic curve [AUC]) rather than a causal effect.). Ex: In wild spotted hyenas, a model that includes prey density, average yearly rainfall, and social rank as predictors of T-cell count yields the best model fit (lowest AIC and BIC, and/or highest adjusted R2) in comparison to other combinations of predictors derived from available data. |
association | Assessment of the unadjusted relationship between two variables of interest. This relationship may be explored within strata of a few key other variables that may influence the association of interest and can inform future causal inference studies. | Pearson or Spearman correlation coefficients, estimates from unadjusted generalized regressiona | some | How does social connectedness correlate with T-cell count in wild spotted hyenas in Kenya? | Use non-causal language to describe the crude relationship between X and Y. The estimate for X can be interpreted as an association or relation or correlation, but not a causal effect.). Ex: Higher social connectedness is associated with higher T-cell count in wild spotted hyenas. |
causal inference | Obtain a causal (i.e. unbiased) effect of X on Y. This type of analysis requires knowledge on the causal and temporal relationship between X and Y, as well as third variables (confounders, mediators, effect modifiers, colliders) that may influence this relationship in order to control bias. | use of directed acyclic graphs to reflect the research question, followed by an appropriate analytical strategy which can involve but are not limited to generalized regressiona, inverse probability weighting, structural equation modelling, path analysis, Rubin causal inference, and G-methods. | yes | Does social connectedness affect T-cell count in wild spotted hyenas in Kenya? | Causal interpretation of the relationship between X and Y via use of words such as ‘effect' to describe the relationship between X and Y, or ‘affect’ or ‘cause’ to indicate whether and how X influences Y. Ex: There is a direct positive effect of social connectedness on T-cell count in wild spotted hyenas. |
aIncludes generalized regression models (e.g. linear, Poisson, negative binomial, logistic) and generalized mixed models (e.g. linear mixed models, segmented mixed models, mixed models with splines).