Abstract
Life-course epidemiology relies on specifying complex (causal) models that describe how variables interplay over time. Traditionally, such models have been constructed by perusing existing theory and previous studies. By comparing data-driven and theory-driven models, we investigated whether data-driven causal discovery algorithms can help in this process. We focused on a longitudinal data set on a cohort of Danish men (the Metropolit Study, 1953–2017). The theory-driven models were constructed by 2 subject-field experts. The data-driven models were constructed by use of the temporal Peter-Clark (TPC) algorithm. The TPC algorithm utilizes the temporal information embedded in life-course data. We found that the data-driven models recovered some, but not all, causal relationships included in the theory-driven expert models. The data-driven method was especially good at identifying direct causal relationships that the experts had high confidence in. Moreover, in a post hoc assessment, we found that most of the direct causal relationships proposed by the data-driven model but not included in the theory-driven model were plausible. Thus, the data-driven model may propose additional meaningful causal hypotheses that are new or have been overlooked by the experts. In conclusion, data-driven methods can aid causal model construction in life-course epidemiology, and combining both data-driven and theory-driven methods can lead to even stronger models.
Keywords: causal discovery, causal models, directed acyclic graphs, life-course studies, longitudinal studies, Peter-Clark algorithm, structure learning
Abbreviations
- CPDAG
completed partially directed acyclic graph
- DAG
directed acyclic graph
- FCI
fast causal inference
- PC
Peter-Clark
- TPC
temporal Peter-Clark
- TPC-P
temporal Peter-Clark prespecified
- TPC-S
temporal Peter-Clark search
- TPDAG
temporal partially directed acyclic graph
Life-course studies require specification of complex (causal) models in order to estimate causal relationships between exposures and outcomes of interest. These models are systems of assumptions about potential cause-effect relationships, and they are typically produced by reviewing relevant theory and previous empirical studies. While this tried-and-tested strategy has generated numerous important results, it is fundamentally confirmatory by nature. Thus, it may suffer from confirmation bias, because the model specification by design must fit into existing theories of cause and effect across the life course.
Causal discovery (also referred to as (causal) structure learning) provides a data-driven alternative to theory-driven model specification. Here, statistical algorithms are used to infer causal models directly from observational data. There exist several correct and complete algorithms, which can infer all identifiable causal information from a given data set, but the algorithms require certain assumptions to be fulfilled, some of which are not testable (1, 2). Moreover, it is not yet clear how the algorithmic guarantees of correctness translate to finite-sample data (3, 4). Hence, although causal discovery may in principle be tremendously useful for constructing causal life-course models, its practical utility is still not well understood.
In this study, we aimed to compare the classical, theory-driven approach to constructing life-course models with a data-driven causal discovery approach. Life-course studies are an ideal use case for such a methodological comparison because the temporal information embedded in life-course data eases the causal discovery task and helps guide theoretical model construction as well.
We focused the comparison on a life-course epidemiologic data set based on the Metropolit Study cohort (5). The Metropolit Study is a longitudinal observational study following a cohort of Danish men born in 1953, with several data collections throughout their lives, as well as register-based follow-up. Two subject-matter experts constructed theory-driven causal models for the variables in this data set (without access to the actual data). We compared these expert models with models obtained by applying a causal discovery method called the temporal Peter-Clark (TPC) algorithm (6) to the data set.
CAUSAL MODELS AND ASSUMPTIONS
In this paper, we represent systems of causal assumptions using directed acyclic graphs (DAGs), and we also refer to these as causal models. Here we provide a brief and informal introduction to relevant concepts and assumptions, and we refer the reader to Hernán and Robins (7) for a more thorough description.
DAGs are useful for summarizing causal assumptions and thereby guiding an appropriate statistical analysis targeting the causal estimands of interest (8, 9). The use of DAGs is becoming increasingly widespread in health and clinical research (10). A causal DAG uses nodes to represent random variables, while directed edges are used to indicate causal relationships. A directed edge from one variable
to another variable
, denoted
, means that
is a direct cause of
(relative to the variables in the system). This means that intervening on
will have an effect on
but not vice versa, and that this effect is not entirely mediated via other variables in the system. Two variables are said to be adjacent if there exists an edge between them. A DAG constructed like this will fulfill the Markov property (2), which ensures that we can use the causal information in the DAG to deduce what conditional independence relationships must be fulfilled by data that come from the DAG.
Relying on DAGs to represent our causal models implies some assumptions about the causal mechanisms studied. First, we assume causal sufficiency, that is, that we are considering all relevant variables. More specifically, we need to include all confounding variables, and we are not allowed to condition on any collider variables that are not in the DAG (see Lipsky and Greenland (9) for definitions). If this is not fulfilled, a DAG is not a meaningful representation of the causal mechanism (2). Secondly, we assume acyclicity, which means that there cannot be feedback loops between variables:
cannot be a cause of
if
is also a cause of
. Note that these assumptions are always required when using DAGs over observed variables to represent causal models, and hence they are (sometimes implicitly) applied in all epidemiologic studies utilizing DAGs.
DAGs are attractive causal model representations because they are relatively intuitive to work with, and because many useful statistical results exist for DAGs (11). However, it is not generally possible to infer a DAG from observational data. The problem is that different DAGs can imply the same conditional independencies, and hence potentially the same observed data; the structure embedded in a DAG is not sufficient to uniquely specify the statistical properties of observed data. Instead of inferring DAGs from the data, we therefore infer Markov equivalence classes of DAGs. A Markov equivalence class is the collection of all DAGs that produce data with the same conditional independencies. Fortunately, these equivalence classes are very well understood, so we can still provide a detailed description of which causal models agree with a given observational data set and which causal models are not compatible with it.
A Markov equivalence class can be represented by a graph itself, namely a completed partially directed acyclic graph (CPDAG). A CPDAG has directed edges wherever all DAGs in the Markov equivalence class agree on the orientation of that edge, while they have undirected edges whenever at least 2 DAGs in the equivalence class disagree on the orientation of an edge. Note that this means that all DAGs in an equivalence class share the same skeleton; that is, there are edges between the same variables. Only certain edge orientations may differ among members from the same Markov equivalence class. Figure 1 provides an example of 2 DAGs that make up a Markov equivalence class along with the CPDAG that represents the equivalence class.
Figure 1.

An example of 2 directed acyclic graphs (panels A and B) that belong to the same Markov equivalence class and the completed partially directed acyclic graph that represents this equivalence class (panel C). If the variables
and
are assigned to one period while
and
are assigned to a different period, the graph in panel A is furthermore the temporal partially directed acyclic graph.
If additional external background knowledge is available, it may be possible to rule out certain members of the equivalence class. In this article, we focus on a specific type of external temporal information, which assigns each variable to one specific period in life (e.g., childhood or adulthood). This additional temporal information rules out certain edge orientations: It is not possible to have causal links against the direction of time. We refer to a CPDAG that respects such temporal information as a temporal partially directed acyclic graph (TPDAG). As an example, consider the CPDAG in Figure 1C. If we assume that
and
belong to one period (e.g., childhood) while
and
belong to another period (adulthood), then we can rule out the Figure 1B DAG from the equivalence class, as this model then has a causal link against the direction of time. Therefore, the Figure 1A DAG is the only remaining member of the equivalence class, and hence DAG A is the TPDAG for this causal mechanism.
In order to infer a CPDAG or TPDAG from observational data, an extra assumption is generally needed: faithfulness. This assumption is perhaps better understood by its complement: A key example of nonfaithfulness is to have causal effects with opposite signs cancel each other out. When we assume faithfulness, we assume that such phenomena do not occur. More generally, faithfulness rules out that 2 variables appear to be unrelated in the data (in the large sample limit), even though they are causally related in the data-generating mechanism. More formally, the faithfulness assumption ensures that all conditional independencies observed in the data originate from causal unrelatedness in the data-generating mechanisms (7), and it is hence the reverse implication of the Markov property.
We used a causal discovery algorithm to construct data-driven models. We stress that causal discovery algorithms do not violate the general rule of causal inference, namely correlation does not imply causation, but instead make use of its complement; causation does in fact imply association (under the assumption of faithfulness). In other words, causal relationships in the data-generating mechanism will leave behind certain associational patterns in the data. The totality of the associations found in a data set can be used to rule out certain causal relationships and help us reconstruct a causal model for the potential relationships that remain. Under the additional assumption of causal sufficiency, a CPDAG (or TPDAG) summarizes which parts of the causal mechanism can be reconstructed (directed edges) and which parts remain unknown (undirected edges). For a more formal introduction to how and why this works, see Petersen (12).
METHODS
Data
We used data from a longitudinal study, the Metropolit Study (5). This study includes a cohort of Danish men born in the Copenhagen metropolitan area in 1953. Several data sources were combined: information from birth registers, a study conducted in childhood (age 12 years), conscript board examination data (ages 18–30 years), and survey information (age 51 years). We furthermore used information from the Danish national registers, which contain extensive additional administrative data. This provided information for an additional follow-up period after the final survey, until the men were 64 years of age in 2017. The records from these data sources were matched using unique personal identification numbers. We only considered cohort members who participated in the 2004 survey, and who were furthermore alive and residing in Denmark at the end of follow-up on December 31, 2017. The study was thus retrospective.
We considered a total of 22 variables, and each variable was assigned to one of 5 periods: birth, childhood, youth, adulthood, or early old age (Table 1). All variables were either numerical or binary, and we considered only complete cases, which gave us a sample size of 3,145. Web Appendix 1 (available at https://doi.org/10.1093/aje/kwad144) provides further details about the data, including the exclusion process (13% of participants were excluded due to missing or erroneous information; see Web Figure 1) and marginal distributions (Web Table 1).
Table 1.
Overview of the Variables Used in This Study, Their Time Periods, and the Age and Year Associated With Each Period, Metropolit Study Cohort, Denmark, 1953–2017
| Period | Age, years | Calendar Year(s) | Variables |
|---|---|---|---|
| Birth | 0 | 1953 | Mother married, low paternal social class, length, weight |
| Childhood | 12 | 1965 | Maternal smoking, intelligence test score, positive attitude towards school, bullied in school, low paternal social class |
| Youth | 18–30 | 1971–1983 | BMI, intelligence test score |
| Adulthood | 51 | 2004 | Employment status, binge drinking, BMI, total years of smoking, weekly contact with friends, cohabitation status, number of children, undergraduate education |
| Early old age | 52–64 | 2005–2017 | Depression, hospital contact due to heart disease, retired |
Abbreviation: BMI, body mass index.
Theory-driven model construction
Two health researchers with experience in epidemiology were recruited to construct theory-driven models. We refer to them as the experts below. Expert 1 has a background in biology and a doctorate in heart disease epidemiology, while expert 2 is a medical doctor specializing in psychiatry.
The 2 experts were provided with a list of the 22 variables, but no access to data, and were asked to construct a causal DAG from these variables. They were instructed to annotate each edge with an indication of whether they had “high” or “moderate” confidence in the suggested direct causal link. We provide the full instructions given to the experts in Web Appendix 2. The clarity of these instructions was tested in a small pilot study using a third medical doctor with only a little research experience.
The 2 experts first constructed their own models independently without consulting each other. Afterwards, we organized and moderated a meeting where the 2 experts brought their individual models and together constructed a new consensus model. This consensus model served as our primary expert model, but we also compare the individual models in the “Theory-Driven Models” section.
Data-driven model construction
The data-driven models were constructed using the TPC algorithm (6). The TPC algorithm is an extension of the Peter-Clark (PC) algorithm, a well-known causal discovery algorithm, modified to take into account the temporal structure of life-course data.
The TPC algorithm uses information about conditional independencies among variables to infer a TPDAG. Therefore, it requires a statistical test for conditional independence. However, it is mathematically impossible to construct a test of conditional independence between combinations of numerical and binary variables that controls both type I and type II errors simultaneously (no matter how large the sample), if we do not make any assumptions about the distributions of these variables (13). Instead, we used a test of nonassociation suggested by Petersen et al. (6), which tests a necessary (but not sufficient) condition for conditional independence using generalized linear models with spline expansions to handle nonlinearity.
The TPC algorithm requires choosing a significance level,
, for the tests. No general guidelines exist for the choice of
for the TPC algorithm (or PC algorithm). Which value of
performs best depends on graph size and sample size (4), and most likely also on data distribution, choice of test, and whether background information is included. We considered 2 approaches for choosing this significance level:
TPC search (TPC-S): Choose the significance level such that the number of edges in the estimated graph is as close as possible to the number of edges in the expert consensus graph.
TPC prespecified (TPC-P): Use a prespecified significance level of
.
TPC-S makes use of the fact that adjacency retention is generally high as
decreases (6). This means that the difference between the discovered skeletons for, for example,
and
will be that the former graph can be constructed by adding edges to the latter graph. Hence, we can make sure that the final graph has a certain number of edges by searching for the relevant value of
.
This approach requires an expert graph, which may not be available in many use cases where the TPC algorithm could be applied. Therefore, we also consider the TPC-P approach. This resembles the arbitrary choice of significance level often used in statistical testing. Our specific choice of
has also been considered by others (14), but we wish to emphasize that the value is nonetheless somewhat arbitrary.
We used a complete-case analysis; that is, we excluded all observations that had missing information. In a sensitivity analysis, we furthermore applied testwise deletion, as suggested by Witte et al. (15). When using testwise deletion, observations are excluded test by test rather than globally, and therefore, the statistical power is reduced less.
We investigated the stability of the results by use of bootstrapping. We repeated the TPC-S and TPC-P procedures 100 times on random bootstrap samples, and we report the stability of an edge as the number of times it occurred across the bootstrap repetitions.
All computations were conducted in R (R Foundation for Statistical Computing, Vienna, Austria), primarily using the causalDisco package (12). The software code is available in Web Appendix 3.
RESULTS
We first describe the expert models and the data-driven models separately, and afterwards we compare the two.
Note that we can only determine whether 2 models agree on the orientation of a specific edge if that edge is in fact included in both models. Therefore, when comparing 2 models, we first describe agreement with respect to adjacencies and then discuss orientation agreement among the shared edges.
Note also that the maximal number of possible edges in a DAG (or CPDAG) with 22 variables is
![]() |
Theory-driven models
Consensus model.
Figure 2 presents the expert consensus DAG. This DAG has a total of 30 edges, 7 of which the experts have high confidence in (marked with blue). We find that most proposed direct causal links are either within a time period or between variables from 2 adjacent time periods. We find only 2 direct causal links with a larger time lag (i.e., connections that surpass several periods), namely the edges between “positive attitude towards school” in childhood and “undergraduate education” in adulthood and between “maternal smoking” in childhood and “total years of smoking” in adulthood. The occurrence of time-lagged direct causal effects is often of special interest in life-course epidemiology because it may provide evidence in favor of, for example, critical period theories, which propose that some early exposures have effects in specific periods much later in life that cannot be intervened upon and diminished after the critical period has ended (16).
Figure 2.
The expert consensus model constructed for 22 variables from the Metropolit Study (Denmark, 1953–2017). The model is presented using a directed acyclic graph (DAG). The DAG has a total of 30 edges. Edges colored blue are those that the experts marked as having high confidence in, while the remaining edges were marked as moderate confidence by the experts. BMI, body mass index.
Interexpert agreement.
The 2 experts provided rather different individual model suggestions (Web Figures 2 and 3). The expert 1 model has 19 edges, while the expert 2 model has 37 edges. Table 2 compares the adjacencies found in the 2 expert models. We see that the 2 experts agree on 15 adjacencies, while expert 2 claims 22 adjacencies that expert 1 does not and expert 1 claims 4 adjacencies that expert 2 does not. Among the 15 adjacencies that the experts agree on, they also agree on the orientation of 13 edges.
Table 2.
Comparison of Adjacenciesa Included in 2 Individual Expert Models, Metropolit Study Cohort, Denmark, 1953–2017b
| Expert 2 | Expert 1 | |
|---|---|---|
| Adjacency | Nonadjacency | |
| Adjacency | 15 | 22 |
| Nonadjacency | 4 | 190 |
a For each of the 231 potential adjacencies in the directed acyclic graph, we classified whether both experts included the adjacency (n = 15, 13 of which also agreed on orientation), neither expert included the adjacency (n = 190), only expert 1 included the adjacency (n = 4), or only expert 2 included the adjacency (n = 22).
bNote that the 2 individual expert models did not contain the same number of edges.
Expert 1 marked 6 edges as high confidence, while expert 2 marked 13 edges as high confidence. Five of these edges were shared by both experts, and both experts agreed on the orientation of those edges.
Data-driven models
The TPC-S model.
TPC-S was used to construct a model with the same number of edges as the expert consensus model. This was achieved at
.
Figure 3 shows the estimated TPDAG. We note that this TPDAG is quite informative, as only 2 out of 30 edges are undirected. Hence it makes claims about the causal direction of the remaining 28 edges.
Figure 3.
Model obtained using a temporal Peter-Clark algorithm search (TPC-S) on data from the Metropolit Study (Denmark, 1953–2017). The TPC-S algorithm targeted the same number of edges as the expert consensus model (30) and achieved this at a significance level of
. BMI, body mass index.
We find a high degree of stability for the edges included in this model: Using bootstrapping, we find that 26 of the 30 edges in the TPC-S model occur more than 60% of the time, and 15 of these edges occur more than 80% of the time (Web Appendix 4, Web Table 2). If we consider only adjacencies and disregard edge orientation, even higher stability measures are found.
The TPC-P model.
We also considered using the TPC-P method with a prespecified significance level of
(Web Figure 4). All adjacencies found in the TPC-S model are also present in the resulting TPC-P model, but the TPC-P model furthermore has 13 additional edges.
The TPC-P model is also very informative, since only 2 edges are unoriented. The unoriented edges are the same as those for the TPC-S model. The 2 procedures agree on the orientation of 26 of their shared 30 edges, while 4 edges are oriented differently between the two graphs. These are edges between low paternal social class and intelligence test score (both in childhood), binge drinking and total years of smoking (both in adulthood), weekly contact with friends and cohabitation status (both in adulthood), and depression and retirement (both in early old age).
We again find a high degree of stability: 32 of the 43 edges in the TPC-S model occur more than 70% of the time, but an additional 5 edges occur in more than 75% of the bootstrap repetitions if we disregard the edge orientation (Web Appendix 4, Web Table 3).
Handling of missing information.
Both TPC-S and TPC-P produced models identical to those presented above when we used testwise deletion rather than complete-case analysis to handle missing information. However, TPC-S found its target significance level to be somewhat different when using testwise deletion; we found
.
Comparison: theory-driven and data-driven models
We first compared the TPC-S model and the consensus expert model (Table 3). We found large agreement about absent adjacencies but less agreement for claimed adjacencies. Only 10 adjacencies are present in both the TPC-S model and the expert consensus model, while both models have 20 additional adjacencies that the other does not. The symmetry is by design, since both models have 30 edges. Among the 10 shared adjacencies, the 2 models largely agree on orientation: 9 of these have the same orientation, while only 1 does not. This is the edge between weight at birth and length at birth, which is undirected in the TPC-S model.
Table 3.
Comparison of Adjacenciesa Found in the TPC-S Model and the Expert Consensus Model, Metropolit Study Cohort, Denmark, 1953–2017b
| TPC-S Model | Expert Consensus Model | |
|---|---|---|
| Adjacency | Nonadjacency | |
| Adjacency | 10 | 20 |
| Nonadjacency | 20 | 181 |
Abbreviation: TPC-S, temporal Peter-Clark algorithm search.
a For each of the 231 potential adjacencies in the directed acyclic graph, we classified whether both approaches included the adjacency (n = 10, 9 of which also agreed on orientation), neither approach included the adjacency (n = 181), only the expert consensus model included the adjacency (n = 20), or only the TPC-S model included the adjacency (n = 20).
b Note that the 2 approaches resulted in the same total number of adjacencies by design, and hence the symmetry in the 2 disagreement categories is not incidental.
If we consider only edges marked as high-confidence in the expert consensus model, we see that TPC-S performs quite well: Of the 7 edges marked with high confidence in the expert consensus model, 6 are also present in the TPC-S model, and they all have the same orientation in both models. Only the edge from “bullied in school” to “positive attitude towards school” (both in childhood) is not present in the TPC-S model. The high-confidence edges are found to be very stable: 5 of them are found in all bootstrap repetitions, while the last edge is found in 88% of the bootstrap repetitions for TPC-S.
We also note that the TPC-S model has more lagged causal relationships spanning several periods than the expert models, including a direct causal link that skips 2 periods (“maternal smoking” in childhood to “hospital contact due to heart disease” in early old age).
Turning to the TPC-P model, we see that it has found 3 additional edges from the expert consensus model, as compared with TPC-S, but that this comes at the cost of 30 suggested edges that are not present in the expert consensus model, which is 10 more than the TPC-S model (Table 4). Moreover, 10 of the 13 adjacencies shared with the expert consensus model also had the same orientation. All 6 high-confidence edges were found in all bootstrap repetitions, and hence their estimation was highly stable.
Table 4.
Comparison of Adjacenciesa Found in the TPC-P Modelb and the Expert Consensus Model, Metropolit Study Cohort, Denmark, 1953–2017
| TPC-P Model | Expert Consensus Model | |
|---|---|---|
| Adjacency | Nonadjacency | |
| Adjacency | 13 | 30 |
| Nonadjacency | 17 | 171 |
Abbreviation: TPC-P, temporal Peter-Clark algorithm prespecified.
a For each of the 231 potential adjacencies in the directed acyclic graph, we classified whether both approaches included the adjacency (n = 13, 10 of which also agree on orientation), neither approach included the adjacency (n = 171), only the expert consensus model included the adjacency (n = 17), or only the TPC-P model included the adjacency (n = 30).
b Using the significance level
.
Causes of depression.
To exemplify what the differences in the 2 models can imply, we now focus on depression in early old age. Zooming in on this variable, we find quite large differences between the theory-driven and data-driven models. The data-driven TPC-S model identifies only 3 direct causes of depression (employment status in adulthood, retirement in early old age, and heart disease in early old age), while the expert consensus model proposes 5 additional direct causes (cohabitation, body mass index, contact with friends, binge drinking, and number of children, all in adulthood). Moreover, all of these variables additionally serve as confounders of the effect of heart disease on depression according to the expert consensus model, while this effect is unconfounded in the TPC-S model. Hence, the differences in how the 2 models describe causal links to depression not only include potential causes but also have implications for how these causal effects may be estimated, since it is necessary to adjust for confounders in order to obtain unbiased causal effect estimates.
If we move beyond considering only direct causal effects, we find some similarities between the 2 models. In several instances, the models agree on indirect causes of depression, although they do not agree on the intermediate causal pathways. For example, although the 2 models disagree on whether binge drinking in adulthood is a direct cause of depression in early old age, they do agree that there is an indirect causal pathway from binge drinking to depression. The same can be said for the effect of body mass index.
On the other hand, the 2 models do not agree about whether cohabitation, contact with friends, and number of children in adulthood are indirect causes of depression in early old age. In the TPC-S model, these variables are not indirect causes of depression, while they are direct (and sometimes indirect) causes in the expert consensus model. Hence, the 2 models are in larger agreement about the roles of binge drinking and body mass index than the other 3 variables for which they did not agree about direct effects.
Plausibility of additional edges proposed by the data-driven model.
We conducted a post hoc investigation of whether the extra edges proposed by the TPC algorithm but not included in the expert model may aid causal model-building. Table 5 provides an overview of the extra edges, and we have assigned a plausibility classification to each of them to describe whether the proposed direct causal relationship is supported by existing knowledge. Web Appendix 5 presents arguments underlying the classifications. The classifications were conducted post hoc, and hence should be interpreted cautiously. We find that only 3 of the extra edges are classified as low-plausibility, while 6 edges are moderately plausible and 11 are highly plausible.
Table 5.
Edges Proposed by the Temporal Peter-Clark Algorithm Search Model That Were Not Present in the Expert Consensus Model, Metropolit Study Cohort, Denmark, 1953–2017
| Source Variable a | Edge Type | Destination Variable | Plausibility b |
|---|---|---|---|
| Mother married (B) | — | Low paternal social class (B) | Low |
| Mother married (B) |
|
Intelligence test score (C) | Moderate |
| Low paternal social class (B) |
|
Intelligence test score (C) | High |
| Low paternal social class (B) |
|
Intelligence test score (Y) | High |
| Weight (B) |
|
Body mass index (Y) | High |
| Maternal smoking (C) |
|
Hospital contact due to heart disease (E) | High |
| Intelligence test score (C) |
|
Low paternal social class (C) | Low |
| Positive attitude towards school (C) |
|
Intelligence test score (Y) | Low |
| Low paternal social class (C) |
|
Undergraduate education (A) | High |
| Intelligence test score (Y) |
|
Total years of smoking (A) | High |
| Intelligence test score (Y) |
|
Retirement (E) | Moderate |
| Employment status (A) |
|
Total years of smoking (A) | High |
| Employment status (A) |
|
Cohabitation status (A) | Moderate |
| Binge drinking (A) |
|
Total years of smoking (A) | High |
| Binge drinking (A) |
|
Cohabitation status (A) | Moderate |
| Total years of smoking (A) |
|
Retirement (E) | High |
| Weekly contact with friends (A) |
|
Cohabitation status (A) | Moderate |
| No. of children (A) |
|
Cohabitation status (A) | Moderate |
| Undergraduate education (A) |
|
Body mass index (A) | High |
| Undergraduate education (A) |
|
Retirement (E) | High |
Abbreviations: A, adulthood; B, birth; C, childhood; E, early old age; Y, youth.
a Variable names are annotated with their life-course period (B, C, Y, A, or E).
b We used 3 categories of plausibility. “Low” indicates that no evidence in favor of the relationship existed and that we had no suggestions for convincing new hypotheses that explained the relationship. “Moderate” indicates that some evidence in favor of the relationship existed or that we had meaningful suggestions for new hypotheses that explained the relationship. “High” indicates that strong evidence in favor of the relationship existed. Motivations for each plausibility classification are provided in Web Appendix 5.
DISCUSSION
For the theory-driven approach, we found that the 2 experts provided somewhat different individual model suggestions, although we found higher agreement about proposed adjacencies between the 2 experts than between the TPC-S and expert consensus models. We consider this to be evidence in favor of expert models having higher credibility than data-driven estimates. However, the proposed consensus strategy will likely produce an even stronger model, and of course the greater the number of experts involved, the higher the credibility. But using a larger number of experts is perhaps not feasible, since causal model construction is very time-consuming and requires highly specialized knowledge. Hence, including a few experts with different specializations, as we did in this study, can be a reasonable compromise.
For the data-driven approach, we found that the TPC algorithm identified some but not all of the direct causal links proposed in the expert consensus model. Notably, the TPC-S model correctly identified almost all direct causal links marked with high confidence by the experts. This indicates that TPC-S does indeed capture causal relationships. TPC-S also produced a number of direct causal links not proposed by the experts, and correspondingly missed some direct causal links from the expert model. However, we found that the majority of the edges proposed by TPC-S but not by the expert model were in fact epidemiologically plausible, while only a very few suggested edges were outright implausible. Hence, conducting data-driven model construction does indeed seem to contribute with new insights and meaningful hypotheses that may have been overlooked in the expert model construction process.
We therefore propose that the TPC algorithm can be a useful supplement to classical theory-driven model construction, and that combining theory-driven and data-driven approaches may lead to stronger causal models that include both well-known and newly discovered causal relationships. We found it useful to use the TPC-S approach specifically, so that the TPC algorithm obtains the same number of edges as the expert model, because this will allow the data-driven approach to contribute with as much information as the theory-driven approach. TPC-P provided insights into what may happen if we allow the data-driven method to contribute with more information: While TPC-P added 13 additional edges, compared with TPC-S, only 3 of them were also found in the expert consensus model. We consider this a worse trade-off between informativeness and correctness than what TPC-S produced. We thus suggest a pipeline where 1) the initial expert model is first constructed, 2) a data-driven search is then performed targeting the same number of edges, and finally 3) the results of the data-driven search are critically assessed and incorporated into a new combined model.
To our knowledge, this study is the first to have compared theory-driven and data-driven causal model construction for life-course studies and epidemiology more generally. A few related studies exist. Spirtes and Cooper (17) searched for causal effects in a database of pneumonia patients and compared the discovered effects with expert judgments. However, they did not construct full causal models but instead only considered whether potential instrumental variables can be used to identify causal relationships between pairs of variables. They did not incorporate temporal background information into the search. Oates et al. (18) proposed using the PC algorithm to add additional edges to expert graphs (used in the PC algorithm as background knowledge), thereby allowing the data-driven method to make further suggestions for direct causal links. However, their procedure failed and resulted in faulty causal conclusions if there were any wrongly oriented edges in the expert model. In contrast, our proposed pipeline would still have the chance to find a conflicting orientation in the data-driven model and thereby draw the experts’ attention towards a potentially problematic edge. Twardy et al. (19) constructed causal models for coronary heart disease using both causal discovery and so-called knowledge engineering. For the latter, they considered regression models from published studies on coronary heart disease, and they constructed causal models where all covariates from these regression models were considered (potential) direct causes of the outcome but were otherwise unrelated. This may not be an advisable strategy, since covariates are often included in regression models because they are potential confounders, and hence this approach may not describe the intended causal model. Hashem and Cooper (20) also compared human and computer methods for constructing causal models for health sciences. In that study, however, the baseline true models were constructed with computer simulations, and medical students were given data simulated from those models and tasked to construct causal models using the data (20). This did not prove an easy task, and we suggest that it is hence more useful to have humans construct models from theory and computers infer models from data, as in the current study.
An open question is how much the results presented here depend on sample size. The TPC algorithm conducts a large number of tests of nonassociation, and hence sample size can potentially have a notable impact on the statistical decisions made along the way. The PC algorithm is quite sensitive towards sample size and may require large samples to adequately construct a causal model (4). However, due to the use of external information, the TPC algorithm does not conduct as many tests as the PC algorithm, and hence it may perform relatively better for small or moderate samples. We consider the high degree of bootstrap stability found in this study to support this conjecture. Further work should be devoted to investigating this systematically.
The choice of test may also influence the performance of the TPC algorithm. We use a test that tests a necessary, but not sufficient, condition for conditional independence (namely nonassociation in a generalized linear model setting). If the test is not testing a good proxy for conditional independence, the algorithm is not guaranteed to work well. The fact that we did identify all but one direct causal relationship marked with high confidence in the expert consensus model indicates that we are testing something that can be used to infer conditional independence (and thereby causal models). We thus believe that studies such as the present one may help inform whether the test is appropriate.
Another, more direct consequence of the choice of test is that the study was restricted to binary and numerical variables, as the test used here only accommodates these data types. As a result, we used dichotomized versions of some categorical variables that were originally measured using more categories (bullying in childhood, positive attitude towards school in childhood, and paternal social class at birth and in childhood). Such a dichotomization may introduce bias if the true causal effects operate at the level of the more nuanced categories that were collapsed into 2 categories. However, the original categories may be less well-defined (i.e., how much bullying a person was subjected to) and hence suffer from poor external validity. All in all, it is not clear whether the restriction towards binary variables was problematic in this study, and we suggest that investigators critically assess this in each application of the TPC algorithm. If dichotomization is not meaningful in a specific application, the TPC algorithm can still be used, but the test we used here needs to be replaced by an alternative that accommodates categorical variables with more than 2 categories.
We did not find any differences in the fitted models from TPC-S or TPC-P, depending on whether we used complete-case analysis or testwise deletion to handle missing information. This may have been due to the rather low amount of missing information (13%). Thus, there was no noticeable gain in statistical power.
Another, more severe issue related to missing information is that both complete-case analysis and testwise deletion are only generally guaranteed to be unbiased if whether a given observation is missing or not is independent of all observed and unobserved variables (i.e., missing completely at random using Rubin’s terminology (21)). This is a quite strong assumption which is likely to have been violated to some extent in the current study. However, we believe that the rather low amount of missing information makes it unlikely that such bias drives the conclusions. Some authors have investigated alternative approaches to handling missing information in the PC algorithm under less restrictive assumptions (15, 22). An interesting line of future research would be to extend these analyses to the TPC algorithm as well.
More generally, there are still many open questions about the statistical properties of causal discovery algorithms. While their mathematical foundations are well-understood, we still need more knowledge about how finite data and imperfect statistical testing affect their performance. In particular, we lack means to control type I and II errors, and this has important implications: While we may find the mathematical assumption of faithfulness uncontroversial, we may still be subject to “empirical unfaithfulness” due simply to an underpowered study leading to type II errors in statistical tests of conditional independence. Moreover, we do not yet have a systematic way to describe the statistical uncertainty of the procedures. Using bootstrapping, we have investigated how stably each edge is estimated across resampling, and this provides some insight into the marginal uncertainty of each estimated edge. However, the goal of causal discovery as applied here is not to estimate singular edges but to provide models for the full data-generating mechanism. Hence, an appropriate uncertainty measure should quantify uncertainty over the full model, not edge by edge. We hope that new methodological developments will allow us to supplement analyses such as the current one with more adequate uncertainty quantifications in the future.
A more fundamental limitation of the current study, and of the TPC algorithm more generally, is the assumption of causal sufficiency. In many epidemiologic data sets, it may not be reasonable to assume no unobserved confounding. In the presence of unobserved confounding, the PC algorithm (and by extension the TPC algorithm as well) is known to produce models with spurious adjacencies (1). Hence, absent adjacencies (i.e., nonconnected variables) will generally still be correct (of course, subject to statistical uncertainty) even in the presence of unobserved confounding. We do believe that this issue is likely to affect the models presented here. Although we aimed to include as many relevant variables as possible, there were most likely some unobserved confounders that we left out, and hence, the TPC-S and TPC-P models may have included spurious adjacencies. This could explain the implausible edges discussed above, but these could also have resulted from statistical uncertainty, even if there was no unobserved confounding. In any case, these imperfect models may be the best solution at this time, and this corresponds well with current practice in epidemiology, where DAGs are often used, even though their use implicitly assumes no unobserved confounding.
Extending the current work to handle unobserved confounding may be possible, but there are several challenges. First, a DAG is no longer adequate for describing the causal data-generating mechanism, and instead experts would need to construct, for example, acyclic directed mixed graphs. Second, the TPC algorithm cannot be used. Instead, the fast causal inference (FCI) algorithm (1) can be applied, since this algorithm is correct even under nonsufficiency. However, a temporal extension of the FCI algorithm would have to be developed and implemented in R. Andrews et al. (23) have proposed a related extension of the FCI algorithm, which does include temporal background knowledge, but this algorithm assumes that there is no unobserved confounding between variables from different periods. We do not consider this a realistic assumption in the present context. Third, the FCI algorithm also learns an equivalence class rather than the full causal model, and when there is unobserved confounding, this equivalence class is described by a partial ancestral graph. However, partial ancestral graphs are quite complex and difficult to interpret. Moreover, comparing a partial ancestral graph to an acyclic directed mixed graph is not straightforward, since their edges do not have the same interpretation. Even if the acyclic directed mixed graph belongs to the equivalence class that the partial ancestral graph represents, they do not even share the same adjacencies. Hence, all in all, handling unobserved confounding when constructing data-driven causal life-course models requires new methodological developments, and we plan to address this in future research.
Supplementary Material
ACKNOWLEDGMENTS
Author affiliations: Section of Biostatistics, Department of Public Health, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark (Anne Helby Petersen, Claus Thorn Ekstrøm); Department of Philosophy, Dietrich College of Humanities and Social Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States (Peter Spirtes); Center for Clinical Research and Prevention, Bispebjerg and Frederiksberg Hospitals, Copenhagen, Denmark (Merete Osler); and Section of Epidemiology, Department of Public Health, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark (Merete Osler).
This work was funded by Independent Research Fund Denmark (grant 8020-00031B) and the US National Institutes of Health (contract R01HL159805).
These data are not available online for replication because they cannot be anonymized. Researchers interested in gaining access to the data may apply to the Public Health Database (https://publichealth.ku.dk/research/databases-for-collaboration/publich-health-database/) at the Department of Public Health, University of Copenhagen.
We owe great thanks to Drs. Ida Kim Wium-Andersen and Matilde Winther-Jensen for the time and effort they devoted to constructing the expert models.
Conflict of interest: none declared.
REFERENCES
- 1. Spirtes P, Glymour CN, Scheines R, et al. Causation, Prediction, and Search. Cambridge, MA: MIT Press; 2001. [Google Scholar]
- 2. Peters J, Janzing D, Schölkopf B. Elements of Causal Inference: Foundations and Learning Algorithms. Cambridge, MA: MIT Press; 2017. [Google Scholar]
- 3. Scheines R, Ramsey J. Measurement error and causal discovery. In: Eberhardt F, Bareinboim E, Maathuis M, et al., eds. Proceedings of the UAI 2016 Workshop on Causation: Foundation to Application. (CEUR Workshop Proceedings, vol. 1792, paper 1). https://ceur-ws.org/Vol-1792/paper1.pdf. Published February 8, 2017. Accessed April 12, 2022. [PMC free article] [PubMed]
- 4. Petersen AH, Ramsey J, Ekstrøm CT, et al. Causal discovery for observational sciences using supervised machine learning. J Data Sci. 2023;21(2):255–280. [Google Scholar]
- 5. Osler M, Lund R, Kriegbaum M, et al. Cohort profile: the Metropolit 1953 Danish male birth cohort. Int J Epidemiol. 2006;35(3):541–545. [DOI] [PubMed] [Google Scholar]
- 6. Petersen AH, Osler M, Ekstrøm CT. Data-driven model building for life-course epidemiology. Am J Hyg. 2021;190(9):1898–1907. [DOI] [PubMed] [Google Scholar]
- 7. Hernán M, Robins J. Causal Inference: What If. Boca Raton, FL: Chapman & Hall/CRC Press; 2020. [Google Scholar]
- 8. Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology. 1999;10(1):37–48. [PubMed] [Google Scholar]
- 9. Lipsky AM, Greenland S. Causal directed acyclic graphs. JAMA. 2022;327(11):1083–1084. [DOI] [PubMed] [Google Scholar]
- 10. Tennant PWG, Murray EJ, Arnold KF, et al. Use of directed acyclic graphs (DAGs) to identify confounders in applied health research: review and recommendations. Int J Epidemiol. 2021;50(2):620–632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Pearl J. Causality: Models, Reasoning, and Inference. 2nd ed. NY: Cambridge University Press; 2009. [Google Scholar]
- 12. Petersen AH. Package ‘causalDisco’: Tools for Causal Discovery on Observational Data. (R package, version 0.9.1). https://cran.r-project.org/web/packages/causalDisco/causalDisco.pdf. Published October 22, 2022. Accessed March 28, 2023.
- 13. Shah RD, Peters J. The hardness of conditional independence testing and the generalised covariance measure. Ann Stat. 2020;48(3):1514–1538. [Google Scholar]
- 14. Ramsey JD, Andrews B. A comparison of public causal search packages on linear, gaussian data with no latent variables [preprint]. arXiv. 2017. ( 10.48550/arXiv.1709.04240). Accessed March 28, 2023. [DOI] [Google Scholar]
- 15. Witte J, Foraita R, Didelez V. Multiple imputation and test-wise deletion for causal discovery with incomplete cohort data. Stat Med. 2022;41(23):4716–4743. [DOI] [PubMed] [Google Scholar]
- 16. Lucas A, Fewtrell MS, Cole TJ. Fetal origins of adult disease—the hypothesis revisited. Br Med J (Clin Res Ed). 1999;319(7204):245–249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Spirtes P, Cooper GF. An experiment in causal discovery using a pneumonia database. In: Heckerman D, Whittaker J, eds. Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics (PMLR R2). http://proceedings.mlr.press/r2/spirtes99a/spirtes99a.pdf. Published 1999. Accessed April 25, 2022. [Google Scholar]
- 18. Oates CJ, Kasza J, Simpson JA, et al. Repair of partly misspecified causal diagrams. Epidemiology. 2017;28(4):548–552. [DOI] [PubMed] [Google Scholar]
- 19. Twardy CR, Nicholson AE, Korb KB, et al. Epidemiological data mining of cardiovascular Bayesian networks electronic. J Health Inform. 2006;1:3. [Google Scholar]
- 20. Hashem AI, Cooper GF. Human causal discovery from observational data. Proc AMIA Annu Fall Symp. 1996;27–31. [PMC free article] [PubMed] [Google Scholar]
- 21. Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–592. [Google Scholar]
- 22. Tu R, Zhang C, Ackermann P, et al. Causal discovery in the presence of missing data. Proc Mach Learn Res. 2019;89:1762–1770. [Google Scholar]
- 23. Andrews B, Spirtes P, Cooper GF. On the completeness of causal discovery in the presence of latent confounding with tiered background knowledge. Proc Mach Learn Res. 2020;108:4002–4011. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



