Causes of Outcome Learning: a causal inference-inspired machine learning approach to disentangling common combinations of potential causes of a health outcome

Andreas Rieckmann; Piotr Dworzynski; Leila Arras; Sebastian Lapuschkin; Wojciech Samek; Onyebuchi Aniweta Arah; Naja Hulvej Rod; Claus Thorn Ekstrøm

doi:10.1093/ije/dyac078

. 2022 May 8;51(5):1622–1636. doi: 10.1093/ije/dyac078

Causes of Outcome Learning: a causal inference-inspired machine learning approach to disentangling common combinations of potential causes of a health outcome

Andreas Rieckmann ^1,^✉, Piotr Dworzynski ², Leila Arras ³, Sebastian Lapuschkin ⁴, Wojciech Samek ^5,⁶, Onyebuchi Aniweta Arah ^7,⁸, Naja Hulvej Rod ⁹, Claus Thorn Ekstrøm ¹⁰

PMCID: PMC9799206 PMID: 35526156

Abstract

Nearly all diseases are caused by different combinations of exposures. Yet, most epidemiological studies focus on estimating the effect of a single exposure on a health outcome. We present the Causes of Outcome Learning approach (CoOL), which seeks to discover combinations of exposures that lead to an increased risk of a specific outcome in parts of the population. The approach allows for exposures acting alone and in synergy with others. The road map of CoOL involves (i) a pre-computational phase used to define a causal model; (ii) a computational phase with three steps, namely (a) fitting a non-negative model on an additive scale, (b) decomposing risk contributions and (c) clustering individuals based on the risk contributions into subgroups; and (iii) a post-computational phase on hypothesis development, validation and triangulation using new data before eventually updating the causal model. The computational phase uses a tailored neural network for the non-negative model on an additive scale and layer-wise relevance propagation for the risk decomposition through this model. We demonstrate the approach on simulated and real-life data using the R package ‘CoOL’. The presentation focuses on binary exposures and outcomes but can also be extended to other measurement types. This approach encourages and enables researchers to identify combinations of exposures as potential causes of the health outcome of interest. Expanding our ability to discover complex causes could eventually result in more effective, targeted and informed interventions prioritized for their public health impact.

Keywords: Causes of effects, sufficient component cause model, inductive–deductive, machine learning, neural networks, explanations, precision public health, complex epidemiology, interactions, supervised clustering

Key Messages.

Most diseases are caused by a combination of multiple exposures but most epidemiological studies focus on one single exposure and one single health outcome.
Using causal inference and machine learning, the Causes of Outcome Learning approach addresses explorative questions such as ‘Given a particular health outcome, what are the most common combinations of exposures, which might have been its causes?’.
Using simulated data and real-life data, we demonstrate the usefulness of the approach.
A tutorial is included in the Supplementary material of this paper (available as Supplementary data at IJE online) and the R package ‘CoOL’ is available to assist researchers with the computational phase.

Introduction

Most diseases are multifactorial and exposures may act together and lead to combined effects that exceed the sum of the individual effects on an additive scale, which is called synergism.^1–3 A classic example is how the combined effect of smoking and asbestos on lung cancer exceeds the sum of their individual effects.⁴ The most established theoretical framework for understanding synergism in epidemiology is the sufficient cause model. This model uses causal pie illustrations of components of causes to indicate that when all components of one cause are present it is sufficient to cause disease.⁵ Assessing synergisms may lead to improved public health in two ways: (i) better disease prevention and treatment through insight into the causation of a disease (aetiology) and (ii) quantification of the disease burden in high-risk subgroups who may benefit from risk-mitigating interventions. For decades, these points have been appreciated for effective preventive strategies. Rose, for example, said that ‘risk assessment must consider all relevant factors together rather than confine attention to a single test, for nearly all diseases are multifactorial’ when discussing effective policy decisions.⁶

Few epidemiological studies try to identify larger combinations of causes for specific outcomes despite the policy relevance. We suspect that the apparent lack of epidemiological studies into causes of outcomes has several reasons: (i) frequently taught frameworks for epidemiologists that warn against type 1 errors from multiple testing (false-positive findings),⁷ (ii) various confounding structures for each exposure complicate causal interpretation,⁸ (iii) the overwhelming number of combinations among exposures challenges the model fitting,⁹ (iv) insufficient statistical power in small data samples hides true phenomena and (v) the lack of theoretically founded approaches.⁹^,¹⁰ Frameworks for identifying component causes exist, though they are not commonly applied in epidemiology. These frameworks select on either outcome¹¹ or exposure.¹² Unfortunately, these frameworks can only consider a few exposures at a time and they do not allow for the estimation of risk, which is often of public health interest. In the social sciences, configurational comparative methods deal with sufficient causes (referencing earlier work¹³). The most famous of these methods is the qualitative comparative analysis,¹⁴^,¹⁵ which has been applied in the public health domain.¹⁶ Qualitative comparative analysis works by analysing all combinations of exposures and uses a top-down search of exposure combinations that fulfil some chosen criteria, such as a risk threshold.¹⁵ Using pre-defined risk thresholds has advantages as transparent protocols and disadvantages as being threshold-sensitive and confined to unadjusted tabular data.

Moreover, assessing exposure synergisms through standard approaches based on calculating all possible combinations of exposures is rarely feasible in practice. First, such analysis would be based on a large number of parameters requiring large sample sizes and posing computational challenges. Second, the numerous parameters returned from regression are not interpretable and potentially misleading as demonstrated in Supplementary Comparison 1 (available as Supplementary data at IJE online).

We introduce a causal inference-inspired machine learning approach called the Causes of Outcome Learning (CoOL) approach. CoOL is aimed at generating insights regarding questions like ‘Given a particular health outcome, what are the most common combinations of exposures, which might have been its causes?’. It utilizes the flexibility of a tailored machine learning model and an explanation technique to discover meaningful combinations of exposures while avoiding certain causal biases. Examples of questions to ask using CoOL could be ‘What are the most common combination of environmental and household exposures measured before 6 weeks of age causing child mortality in Guinea-Bissau between 6 weeks of age to 3 years of age?’ or ‘What are the most common combination of stressful events in childhood causing a high disease burden in early adult life in Denmark?’. To answer these questions well, we must explore many combinations of exposures. Targeting subgroups for interventions aimed at these combinations of exposures may provide a large public health impact. We present the approach assisted by a simple simulated example solely for pedagogical purposes but CoOL also works on complex scenarios with higher-order interacting synergy. A step-by-step tutorial and six simulations of various complexity are included in Supplementary Simulations 1–6 (available as Supplementary data at IJE online). Three robustness checks are found in Supplementary Simulations 7–9 (available as Supplementary data at IJE online). A six-page real-life application using cohort data from the Center of Disease Control and Prevention is available in the Supplementary real-life analysis (available as Supplementary data at IJE online). A glossary can be found in Supplementary Table 1 (available as Supplementary data at IJE online).

Simulated example

We generate a healthy study population of 10 000 individuals (half men and women); 20% are exposed to Drug A and 20% are exposed to Drug B. Sex, Drug A and Drug B are independent. In this scenario, all individuals have a baseline risk of developing any atopic disease of 5% throughout a 10-year follow-up period; men who are exposed to Drug A have a 15% higher risk of developing atopy and so do women who are exposed to Drug B. The simulated example uses two two-way interactions as a pedagogical example but CoOL can identify any higher-order interacting synergy if it exists in data and the data set is large enough.

The CoOL approach

CoOL is enabled by recent advances in understanding why machine learning models produce the results they do [explainable artificial intelligence such as layer-wise relevance propagation (LRP)^17–19] and by the science of causal structures for causal inference.²⁰ CoOL is a three-phase inductive–deductive scientific process (Figure 1). The bulk of our method’s contribution is related to the second phase. The goal of CoOL is to generate hypotheses for further testing. The road map for applying CoOL is as follows:

The phases of CoOL towards inference to the best explanation

(a) Pre-computational phase: scoping the research question and causal structure assumptions. (b) Computational phase: (i) A non-negative model as close to the assumed causal model is fitted, (ii) risk contributions are decomposed and (iii) individuals are clustered into subgroups. (iv) Manual validation of the results is suggested in an internal validation data set to assess the stability of the results. (c) Post-computational phase: the results are held against existing evidence in order to develop new hypotheses that can be tested in new studies. New understandings will update our initial assumed causal model

The pre-computational phase: Propose a causal model using a directed acyclic graph (DAG) of the exposures and the outcome based on prior domain expertise of selected actionable exposures and contextual factors. This phase aids the identification of exposure variables to include in the analysis.
The computational phase: The goal of this phase is to identify subgroups of the population who have certain combinations of exposures that together were found to elevate their risk for the health outcome [we provide the R package ‘CoOL’ (Supplementary Information 1, available as Supplementary data at IJE online)]:
1. Training data:
  1. Fit a non-negative model on an additive scale based on the features from the assumed causal model. We suggest a tailored neural network that can capture synergistic effects using activation functions, which allows combinations of covariates interacting to predict higher risks.
  2. Decompose the risk contributions.
  3. Cluster individuals based on the risk contributions.
2. Internal validation data:
  1. Ensure the robustness of the findings in an internal validation data set.
Post-computational phase: Based on learnings from the computational phase and existing knowledge, develop hypotheses to be assessed in further (intervention) studies on new temporal or external validation data. The approach focuses on common high-risk subgroups and directs researchers towards potentially large public health impact. The outcome of this phase is to suggest one or several sound hypotheses by combining the empirical findings from one’s own data with a critical assessment.

Inference from CoOL relies on how risk contributions cluster in subgroups of the population. These risk contributions rely on causal assumptions specified in the pre-computational phase but they are not counter-factual estimates, i.e. reflecting what would have happened had the exposure been absent. The main challenges for a causal interpretation in CoOL as well as in standard approaches are first that the measured covariates may be insufficient to adjust for confounding, second as the total effects of exposures are diluted if mediators are included in the model, and third because the effect of synergistically co-acting exposures is divided between the risk contributions estimated via CoOL. However, CoOL is designed to avoid biases the following ways: (i) guiding the inclusion of relevant exposures through expertise-based knowledge in the pre-computational phase, (ii) using a relaxed monotonic model to prevent the introduction of collider bias⁹ when clustering risk contributions (Supplementary Comparison 2, available as Supplementary data at IJE online), (iii) adjusting for calendar effects to prevent spurious time-trend associations (Supplementary Method 1, available as Supplementary data at IJE online), (iv) re-weighting the study population if some individuals are censored during follow-up to prevent selection bias⁹ (Supplementary Method 2, available as Supplementary data at IJE online) and (v) designing the model set-up on an additive scale to allow us to identify synergisms, which a multiplicative model could not.

We use the following notation in the next sections: $X_{i}$ denotes $i$ exposures, $Y$ denotes the outcome and ${S C}_{j}$ denotes j unknown sets of sufficient causes for the outcome (inspired by the notation by VanderWeele and Robins²¹). U_SCi and U denote different types of unmeasured (including unmeasurable and unknown) causes. U_SCi denotes the unmeasured component causes of SC_j, whereas U denotes unmeasured causes of Y. $R^{b +}$ denotes a baseline risk assumed to affect all individuals. Activation functions are denoted as $S^{+}$ . Connection parameters from the exposures to the activation functions are denoted as $β_{i, j}^{+}$ . Intercepts are denoted as $α_{j}^{-}$ . ⁺ denotes restrictions to non-negative values (≥0, positive or zero) and ⁻ denotes restrictions to non-positive values (≤0, negative or zero).

Pre-computational phase

Causal structures are commonly depicted with DAGs,²² which allow a causal interpretation of associations given a set of causal assumptions: exchangeability, positivity, consistency, no measurement error and no model misspecification.⁹

The intuition of CoOL is to link exposures to unknown sufficient causes²¹ (with probabilistic effects, not deterministic) as illustrated in Figure 2a and c. The theoretical DAG in Figure 2c makes no assumptions about the existence of causal effects between exposures and outcomes, and the computational steps aim at reducing these causal effects towards the minimal sets of component causes. The assumed causal model assists in exposure selection: actionable exposures that we can intervene on, such as drug intake, and contextual factors, which describe subgroups in risk. It also helps to decide whether proximal non-actionable exposures should be excluded if they mediate effects of actionable exposures and thus mask their effects. Further, the assumed causal model is used for the interpretation of the results because only direct and joint effects are returned.⁸

Sufficient causes, causal model and non-negative neural network

The pictogram shows the relation between epidemiological theory, structural models and a non-negative neural network. The left column is a generic presentation and the right column shows the simulated example. (a) and (b) An illustration of sufficient causes. The example to the right shows that a certain disease occurs if men are exposed to Drug A and some unknown factors and if women are exposed to Drug B and some unknown factors. (c) and (d) An assumed causal model illustrated using a directed acyclic graph, where $X_{i}$ denotes the exposures, USCi denotes the unmeasured causes of the sufficient causes, U denotes the unmeasured causes of Y assumed to affect all individuals, ${S C}_{j}$ denotes hidden sufficient causes and $Y$ denotes the outcome. (e) and (f) A non-negative neural network resembling the assumed causal model. $X_{i}$ denotes exposures, $β_{i, j}^{+}$ denotes non-negative parameters, $S_{j}^{+}$ denotes hidden activation functions, $α_{j}^{-}$ denotes non-positive intercepts acting as activation thresholds for activation functions and $R^{b^{+}}$ denotes the baseline risk

A common drawback of existing synergistic risk estimation models is their positive monotonicity assumption, i.e. exposures either have no effect or always act in the same direction on the outcome.¹^,²³ The proposed non-negative model (next section) relaxes the monotonicity assumption by letting us explore all directions of exposures on the outcome simultaneously for which effects act independently or synergistically with others (e.g. if there exist exposures that are especially harmful for men and other exposures that are especially harmful for women). If we had applied a model with both positive and negative parameters, the risk contributions would also take both negative and positive values, which would be difficult to interpret. Further, since the risk contributions are conditioned on the outcome, clustering risk contributions from a model with positive and negative parameters could lead to collider bias stratification⁹ and thus result in spuriously inversely correlated risk contributions (Supplementary Comparison 2). Using monotonicity (including the relaxed version) by applying a non-negative model prevents collider bias in the computational phase Step 3 when clustering the risk contributions.

In causal inference studies, inclusion of covariates causing confounding is solely for adjustment.⁹ In CoOL, all potential causes of the outcome are of relevance (i.e. covariates are also considered as potential exposures of interest) and including them carefully allows quantification of individual and joint direct exposure effects adjusted for when individual exposures confound the effect of another exposure (due to being on the latter exposure's backdoor path to the outcome).²⁰ However, researchers need to consider issues with unmeasured confounding, selection or collider bias and measurement bias. In studies, where data are gathered over a longer time span, calendar time may introduce spurious correlations if changes occur in exposure prevalence and in diagnosis criteria. The model can be adjusted for calendar time without attributing it a risk contribution (Supplementary Method 1, available as Supplementary data at IJE online). Also, selection bias may occur if at-risk individuals become systematically censored. To prevent selection bias due to censoring during follow-up, the model can be adjusted using inverse probability of censoring weights assuming a correct model specification of the probability of not being censored during follow-up (Supplementary Method 2, available as Supplementary data at IJE online).

For our motivating example, we assume that sex, Drug A and Drug B do not share a common cause. Ideally, we want to identify the sufficient causes shown in Figure 2b and the DAG showing our scientific interest has been drawn in Figure 2d. Had other information been available, we may have included it or not, depending on the assumed causal structure for developing atopy.

Computational phase

The many potential combinations of exposures increase the risk of identifying spurious associations. To manually validate the findings before developing hypotheses, data are split into a training data set and an internal validation data set. We suggest fitting the model on the training data (with regularization to reduce overfitting to noise, which could produce ungeneralizable predictions) until it converges based on the error function. A training scheme using k-fold splits of the training data may be useful in very large data sets but needs further investigation.

Fitting a non-negative model

We suggest a non-negative, single-hidden layer, neural network on an additive scale (Figure 2e) as the mathematical model designed to mimic our assumed causal model (Figure 2c). This model resembles a linear regression model estimating risk differences but with two key modifications. First, the model includes a series of latent interactions that can combine the effects of various exposures. The latent interactions are estimated using what is known in machine learning as activation functions, $S^{+} ()$ , represented in the hidden layer between the exposures and the outcome. Second, we restrict all connection parameters to have non-negative values (≥0, positive or zero)²⁴ so that exposures can only increase the occurrence of the outcome.¹ Further, each category of the variable is binary/one-hot encoded into one new variable each with 0 if not present and 1 if present and thereby meets a relaxed version of the monotonicity assumption. The disease outcome is coded 0 and 1. The activation functions return the non-negative (≥0, positive or zero) sum of its input value. The intercepts can only take non-positive (≤0, negative or zero) values and act as an activation threshold that only allows combinations of exposures with large $β_{i, j}^{+}$ -weighted sum to pass $S^{+} ()$ . The baseline risk can only take non-negative (≥0, positive or zero) values. The non-negative and non-positive restrictions are made to decompose and cluster the risk contributions without suggesting spurious subgroups due to collider bias (Supplementary Comparison 2, available as Supplementary data at IJE online). If a person has no risk contribution of any exposures, the person is assumed to have a risk equal to the baseline risk. The connection parameters between the activation functions and the outcome have a fixed value of 1. The model estimates the risk on an additive scale so that synergisms are defined as combined effects that are larger than the sum of individual effects.⁵

This model can be formulated as below and satisfies the assumption that the added risk is independent of the baseline risk or is formulated as an ‘independent of background’ model according to Beyea and Greenland: ²⁵

P (Y = 1 | X) = \sum_{j} (S^{+} (\sum_{i} (X_{i} \cdot β_{i, j}^{+}) + α_{j}^{-})) + R^{b +}

Fitting the model is done using stochastic gradient descent on the training data set: in a step-wise procedure run on one individual at a time, the model estimates the individual’s risk of the disease outcome, $P (Y | X)$ , calculates the squared prediction error $(Y - P (Y | X))^{2}$ and adjusts the model parameters to minimize this error.²⁶ By iterating through all individuals for multiple epochs, we obtain model parameters, which minimizes the sum of prediction errors across the entire population. The initial values, derivatives, learning rates and regularization parameter are described in Supplementary Information 2 (available as Supplementary data at IJE online).

Our simulated example data are split into a training data set and an internal validation data set. Figure 2f presents the model for our motivating example. We binary-encode new variables for each possible category of each exposure, such that sex (coded 0 if man, 1 if women) becomes two factors: man (coded 1 if man, 0 if not man) and woman (coded 1 if woman, 0 if not woman) and so forth for Drug A and Drug B. If, for example, we had strong expertise knowledge that Drug B could only be harmful (and never beneficial), we could have used this causal information to limit the degrees of freedom in the model and decrease the chance of discovering false-positive findings. The training data set is used to fit the proposed non-negative model with 10 hidden activation functions. Figure 4a–c shows how the error decreases by each epoch; it visualizes the neural network connections and receiver operating characteristic curve. Although the predictive performance measured by the area under the receiver operating characteristic curve (AUC) provides a useful metric for evaluating model discriminatory performance across the entire population, a model with low AUC can still capture important sets of causes for particular subgroups.²⁷

Results of the computational phase of CoOL

The main results are combined in one plot. (a) Prediction performance measured by the mean squared error by epoch. (b) A visualization of the fitted non-negative neural network. The width of the line indicates the strength of each connection. (c) A plot on prediction performance as measured by the area under the receiver operating characteristic curve. (d) A dendrogram of the three subgroups. (e) The mean risk and prevalence by subgroups. (f) The table with the main results for the working example. ‘n’ is the total number of individuals in the subgroup, ‘e’ is the number of events/individuals with the outcome in the subgroup, ‘prev’ is the prevalence of the subgroup, ‘risk’ is the mean risk in the subgroup based on the model, ‘excess’ is the excess fraction being the proportion out of all cases that are more than expected (more than the baseline risk) in this subgroup (see Supplementary Information 4, available as Supplementary data at *IJE* online), ‘obs risk’ is the observed risk in this subgroup (95% CI is calculated using the Wald method in ⁷⁴), ‘risk based on the sum of individual effects’ is the risk summed up where all other exposures are set to zero. For the three estimates presented at each variable by each subgroup, the first estimate is the mean risk contribution, the estimate in parentheses is the standard deviation and the estimate in brackets is the risk contribution had all other exposures been set to zero. The baseline risk is by definition the same for all groups

Decomposing risk contributions

Machine learning models are commonly referred to as black boxes due to the limited interpretability of their parameters and the way they interact with the input variables.²⁸ Instead of attempting to interpret the model parameters directly, we use LRP^17–19 to decompose the risk of the outcome to risk contributions for each individual (in particular, we use the LRP_{alpha=1, beta=0} rule). LRP was introduced by Bach et al. in 2015¹⁷ as a decomposition technique for pre-trained neural networks and was later justified via Deep Taylor Decomposition.²⁹ As opposed to other explanation techniques for neural networks, LRP is aimed at conserving the information such that all relevance measures sum to the probability of the outcome. In CoOL, the predicted risk of the outcome, $P (Y = 1| X),$ is decomposed into a baseline risk, $R^{b +}$ , and the risk contributions by each exposure, $R_{i}^{X}$ (where $P (Y = 1 | X)$ can take values between 0 and 1):

R^{b +} + \sum_{i} R_{i}^{X} = P (Y = 1| X)

These risk contributions may be interpreted as an expression of the exposures’ positive contribution to the risk given the model and the individual’s set of exposures. The estimation is designed to prevent spurious associations and direct researchers in identifying combinations of exposures associated with elevated risk of a specific health outcome, but they cannot directly be interpreted as the counter-factual effect of what would have happened had the exposure been absent. No risk contributions are decomposed to the intercepts, $α_{j}^{-}$ . The below procedure is conducted for all individuals in a one-by-one fashion. The baseline risk, $R^{b +}$ , is represented by its own parameter (Figure 2e) and is therefore estimated as part of fitting the non-negative neural network. More precisely, the decomposition of the risk contributions for exposures, $R_{i}^{X}$ , takes three steps:

Step 1: Subtract the baseline risk, $R^{b^{+}}$ :

R_{total}^{X} = P (Y = 1 | X) - R^{b +}

Step 2: Decompose risk contributions to the hidden activation functions, where $S_{j}$ is the value returned by each of the $j$ activation functions given the exposure distribution $X_{i}$ , parameters, $β_{i, j}^{+}$ and intercepts, $α_{j}^{-}$ :

R_{j}^{X} = \frac{S_{j}}{\sum_{j'} S_{j'}} R_{t o t a l}^{X}

Step 3: Decompose risk contributions from the hidden activation functions to the exposures:

R_{i}^{X} = \sum_{j} (\frac{X_{i} \cdot β_{i, j}^{+}}{\sum_{i'} (X_{i'} \cdot β_{i', j}^{+})} R_{j}^{X})

As a result of the risk decomposition, each individual is assigned a set of risk contributions, $R_{i}^{X}$ , one for each exposure plus a baseline risk, $R^{b +}$ . The decomposition of risk contributions have been illustrated in Figure 3e and f using the motivating example and explanation in the figure legend.

Workflow of the computational phase of CoOL

The flowchart of how subgroups are identified as part of the computational phase of Causes of Outcome Learning. (a) The expanded data set of sex (one variable for man, one for woman), Drug A (one variable for Drug A, one for no Drug A) and Drug B (one variable for Drug B, one for no Drug B). (b) The fitted non-negative model is illustrated. Wide edges indicate large connection parameters. (c) and (d) The predicted risk, $P (Y | X)$ . (e) The predicted risk is decomposed using LRP to risk contributions of the baseline, $R^{b +}$ , and exposures, $R^{X}$ . (f) The risk contribution matrix. (g) A dendrogram to help decide on the number of subgroups. (h) Clustered risk contribution matrix into subgroups. (i) Prevalence and mean risk by subgroup plot. This plot indicate areas for greater public health impact. (j) A table with the mean of risk contributions by subgroups. It can hold more information that can be useful when developing hypotheses, such as quantifications of the excess proportion of all cases found in this subgroups when considering the prevalence of the subgroup, the risk in the subgroup and the baseline risk

Clustering of risk contributions

We suggest to subgroup the individuals based on risk contributions using Manhattan distances and Ward’s method.³⁰^,³¹ A dendrogram may help decide the number of relevant subgroups (Figure 3g).³² Plotting the prevalence and mean risk of each subgroups can help researchers to identify the subgroups with the highest public health impact (Figure 3i).³³ A table of mean risk contributions and standard deviations by subgroups may illuminate which exposures are associated with elevated risk in each subgroup (Figure 3j). An indication of synergism is when the combined risk contribution of a set of exposures is higher than the sum of stand-alone risk contributions of each of the exposures (Supplementary Information 3, available as Supplementary data at IJE online, but deviations may occur in noisy data sets). Final reporting of synergism should be using the yet unseen internal validation data set before developing hypotheses in the post-computational phase.³⁴

Given the combined risk contributions causally affect the outcome and meet the assumption of positive monotonicity, the excess fraction (also referred to as grouped partial attributable risks³⁵ or formally as the attributable proportion in the population²³) is the area within a subgroup above the baseline risk (Figure 3i) and can be defined for a subgroup Z as:

\frac{P (Y = 1) - P (Y_{X_{z} = \bar{X_{z}}} = 1)}{P (Y = 1)}

where $X_{z} = \bar{X_{z}}$ denotes eliminating risk contributors in subgroup Z and is calculated as (Supplementary Information 4, available as Supplementary data at IJE online):

\frac{P (X = x_{z}) \cdot (P (Y_{X_{z}} = 1) - R^{b +})}{P (Y = 1)}

Yet, as the inductive–deductive process of CoOL aims to identify causes and test hypotheses, the excess fractions may be cautiously interpreted as the potential for a public health impact if the hypothesis is true.

Analysing our motivating example, we apply the fitted non-negative model, decompose the risk contributions using LRP and show a dendrogram of how similar the populations are (Figure 4d), which suggests three groups. Figure 4e shows the risk and prevalence of the three subgroups, where one subgroup has a risk of 5%, a second subgroup has a risk of ∼20% with a prevalence of 10% and a third subgroup has a risk of ∼20% with a prevalence of 10%. Figure 4f shows us that we correctly identified that men (sex_0) who are exposed to Drug A (drug_a_1) have a 5% baseline risk, which reaches a near 20% risk through the contributions from being a man and Drug A. Similar are the findings for women (sex_1) and Drug B (drug_b_1). In general, we expect that the predicted risks are slightly underestimated due to regularization.

Post-computational phase

The results of the computational step may provide learnings about different sets of exposures, which may have led to a higher risk of the outcome in specific subgroups. When exploring causes of outcomes, some findings may be spurious. Therefore, the combination of appropriately selected exposures, a well-defined study-design, the use of regularization parameters for model fitting, critically selecting findings and ensuring the replicability in internal validation data is important before developing new hypotheses. This evidence for hypothesis development should be interpreted in light of the domain expertise formalized in the assumed causal model (the pre-computational phase). New hypotheses about multifactorial aetiology may be denoted in an updated DAG.²¹ In contrast to other machine learning approaches, CoOL allows us to identify subgroups through combinations of risk contributions that are easily communicated with words.

New learnings may be formulated as a hypothetical intervention and assessed using established methodological frameworks for causal inference modelling.⁹^,²¹ The post-computational phase for triangulating the hypotheses is conducted in external populations (in temporal validation data or more desirably, external validation data). If replicable, the researchers should provide sufficient evidence that the replicated finding is causal (and not due to similar bias structures). This may be done using various triangulation approaches with orthogonal bias structures (i.e. designs with biases in different directions) including studies outside the epidemiological field.³⁶ Eventually and if possible, the hypotheses generated using CoOL need to be tested using a randomized set-up.

In our example, we now have some learnings to inform two hypotheses: men taking Drug A seem to be at a higher-than-normal risk and women taking Drug B seem to be at a higher-than-normal risk. We may test the findings in observational data from other populations before we eventually intervene (stop exposure to Drug A for men and Drug B for women) possible in a randomized way if justified by equipoise.

Real-life application

Below is a summary of an application of CoOL on publicly available real-life data that focuses on demonstrating the computational phase and highlighting the importance of the pre-computational and post-computational phases.