Abstract
Several studies have provided strong evidence that long-term exposure to air pollution, even at low levels, increases risk of mortality. As regulatory actions are becoming prohibitively expensive, robust evidence to guide the development of targeted interventions to protect the most vulnerable is needed. In this paper, we introduce a novel statistical method that (i) discovers subgroups whose effects substantially differ from the population mean, and (ii) uses randomization-based tests to assess discovered heterogeneous effects. Also, we develop a sensitivity analysis method to assess the robustness of the conclusions to unmeasured confounding bias. Via simulation studies and theoretical arguments, we demonstrate that hypothesis testing focusing on the discovered subgroups can substantially increase statistical power to detect heterogeneity of the exposure effects. We apply the proposed denovo method to the data of 1,612,414 Medicare beneficiaries in the New England region in the United States for the period 2000 to 2006. We find that seniors aged between 81–85 with low income and seniors aged 85 and above have statistically significant greater causal effects of long-term exposure to PM2.5 on 5-year mortality rate compared to the population mean.
Keywords: Causal effect, Causal inference, Unmeasured confounding, Particulate Matter, Observational study, Recursive partitioning, Sample split
1. Introduction
Air pollution is a major environmental risk to health. Over the past few decades, researchers have estimated the association between air pollution exposure and a wide range of health outcomes from respiratory diseases to death (Dockery et al., 1993; Samet et al., 2000; Dominici et al., 2006; Loomis et al., 2013; Di et al., 2017; Makar et al., 2017). Recently, Wu et al. (2020) reported statistically significant evidence of increased mortality risk associated with long term exposure to fine particulate matter with an aerodynamic diameter < 2.5 μm (PM2.5) even when exposure levels were below 12 μg/m3, which is the current national PM2.5 standard. The World Health Organization (WHO)’s International Agency for Research on Cancer (IARC) has concluded that exposure to PM2.5 is carcinogenic to humans (Loomis et al., 2013). Rückerl et al. (2011) and Rajagopalan et al. (2018) provide extensive overviews of air pollution impact on a variety of health outcomes in epidemiological studies. Along with these findings, regulations to limit emissions have been formulated and updated in order to decrease air pollution and promote public health as the Clean Air Act requires. However, as the air quality regulation costs become expensive, robust evidence of more targeted regulatory actions is required for determining the appropriate allocation of regulatory efforts and resources. To establish regulatory programs, it is important and informative to discover population subgroups that are potentially more vulnerable to air pollution.
Our aim is to develop a new methodological approach in causal inference to address this critically important public health question: “Which population subgroups have causal effects of air pollution on mortality that are statistically significantly different from the population average?” In epidemiology, a similar question is often formulated in terms of assessing evidence of effect modification of air pollution effects on health by some pre-specified variables, such as age, sex, race. Many air pollution epidemiological studies have been conducted to assess effect modification. For example, Di et al. (2017) and Di et al. (2017) conducted subgroup analyses using up to one-way interaction terms for studying short-term and long-term effect of air pollution. Also, Pope et al. (2019) used National Health Interview Surveys (NHIS) between 1986–2014 to study the long-term effect of air pollution on mortality, and used Cox proportional hazard ratio models that account for complex sample design. A subgroup analysis was conducted by examining pre-specified variables one at a time. However, most of the studies have three main limitations: (a) variables that are suspected to be potential effect modifiers need to be selected a priori before analysis; (b) effect modification is assessed in a model-dependent way by testing as whether there is evidence of any interaction between the exposure variable and a modifier; (c) the existing approaches rely on standard regression approaches for adjusting confounders, without conducting systematic sensitivity analyses for unmeasured confounding.
In statistical literature, there have been developments that can address some of the three raised issues. Many methods mainly aim at estimating conditional average treatment effects. For example, Wager and Athey (2018) propose the Causal Forest method to estimate the covariate-specific treatment effects and Su et al. (2009) use recursive partitioning to estimate treatment effects across subpopulations. Hahn et al. (2020) propose a nonlinear regression model using a Bayesian version of Causal Forest. These tree-based approaches provide a flexible way to parameterize the covariate space. If there is any subgroup contributing effect modification, it is highly likely that a tree (or a combination of trees) continuously grows to capture this subgroup, which leads to high estimation accuracy. Despite such significant advances, most of these contributions do not assess the robustness of causal findings about effect modification. Especially, when the assumption of no unmeasured confounding bias is violated, evidence of effect modification can be explained by unmeasured bias. A sensitivity analysis can then be conducted to characterize how extensive an unmeasured confounding bias has to be in order to alter the conclusions.
In this paper we propose a novel approach, called the de novo method, to overcome these limitations. The de novo method is developed within a causal inference framework, and in the context of matched observational studies. More specifically, we split the sample into two parts. In the first subsample, we let the data discover “promising” subgroups with air pollution effects that differ from the population mean. Here, we apply two machine learning algorithms to discover vulnerable subgroups: (a) classification and regression tree (CART) method proposed by Breiman et al. (1984) and (b) Causal Tree (CT) method proposed by Athey and Imbens (2016) that uses different criteria for constructing partitions. There are more sophisticated techniques such as Random Forests. However, we use tree methods for clearer interpretability. Results from tree-based methods can be easily explained to non-experts, and thus can be directly used for regulatory policy.
In the second subsample, we develop randomization-based hypothesis tests to confirm as whether there is evidence that exposure effects for the newly discovered subgroups are statistically significant different from the population average causal effect. The null hypothesis is that each subgroup’s exposure effect is not different from the population mean. Rejecting the null hypothesis can provide statistical evidence and practical guidance to look more closely at a particular subgroup for future researchers. Note that our definition of heterogeneous exposure effects is different from the more common definition that considers the deviance from the null effect used in previous studies (Hsu et al., 2013; Lee et al., 2018). Furthermore, we develop a sensitivity analysis by generalizing randomization inference to assess the impact of unmeasured confounding bias on causal conclusions. Also, in addition to hypothesis testing, we propose a method that can account for unmeasured bias in estimating the population mean and also discovering population subgroups.
The outline of the rest of this paper is as follows. In Section 2, we introduce notation and assumptions. In section 3 we present the methodological approach and theoretical results for continuous and binary outcomes. In Section 4, we illustrate the performance of the proposed method in several simulated situations. In Section 5, we apply this new approach to an observational study of the effect of long-term PM2.5 exposure on mortality for the Medicare population in the New England region in the United States between 2000–2006, and identify subgroups that have higher (or lower) mortality rates from the population mean. Section 6 concludes with a discussion where we review and compare our approach with other existing methods.
2. Notation and review of observational studies
2.1. Notation For Stratified Randomized Experiments
Consider a stratified randomized experiment with groups g = 1, …, G. Let gi be a stratum of ngi observations within group g. We assume i = 1, …, Ig. Let gij be the jth individual in the stratum gi. Within gi, we assume that mgi individuals receive the treatment and ngi − mgi individuals receive the control with min{mgi, ngi − mgi} = 1. For simplicity, we assume mgi = 1, which means that each stratum gi has only one treated individual. Let Zgij be a binary treatment assignment; if individual gij receives the treatment, Zgij = 1 otherwise Zgij = 0. For each gi, the sum of Zgij is one, . Under the potential outcome framework with binary treatment, each individual has two potential outcomes (Neyman, 1990; Rubin, 1974); one is under treatment, rTgij and the other is under control, rCgij. Only one of the two potential outcomes can be observed according to Zgij, thus the individual treatment effect, rTgij − rCgij cannot be observed. This individual exhibits the observed response Rgij = rTgijZgij + rCgij(1 − Zgij). Within pair gi, the observed treated-minus-control difference in responses is Ygi = (Zgi1 − Zgi2)(Rgi1 − Rgi2). If the treatment has an additive effect, rTgij − rCgij = τ for all gij, then Ygi = τ +(Zgi1 − Zgi2)(rCgi1 − rCgi2).
Let where xgij and ugij denote observed and unobserved covariates, and be the set containing all possible values z of . Write for the number of elements in a finite set . Then, . In a randomized experiment, a treatment assignment Z is randomly chosen from . Therefore, and from the independence between strata. The response is a random variable due to Z whereas is fixed. This randomization enables researchers to make inference for treatment effects in a randomized experiment (Rosenbaum, 2002a, 2017).
2.2. Matching and Observational Studies
The main challenge in observational studies is to remove confounding bias. Matching is a simple and transparent way to adjust for biases due to measured confounders (Stuart, 2010). Roughly speaking, for each treated individual (e.g. a Medicare patient exposed to levels of PM2.5 higher than 12 μg/m3), matching produces a stratum of observations (gi) composed of untreated individuals and controls (e.g. Medicare patients exposed to levels of PM2.5 lower than 12 μg/m3) who are as similar as possible in terms of potential confounders to the treated individual.
In this paper, we consider a matched pair design containing only one matched untreated individual for each treated individual (ngi = 2). However, our method can be readily extended to matching with multiple controls. In practice, it is difficult to find a control who has the same exact value of the covariates, especially for continuous covariates. Instead, we find a control as similar to the targeted treated individual as possible. Then, we assess how similar matched pairs are by checking the overall covariate balance. The most common diagnostic for checking balance is using the standardized difference, see Rosenbaum (2010) for more detail. The quality of matched pairs produced by matching methods should be assessed and reported before making causal inference.
Under the assumption of no unmeasured confounding bias, inferences can be made by treating matched sets as stratified randomized experiments (Rosenbaum, 2002b; Hansen, 2004). This assumption implies that the probability of receiving the treatment πgij = Pr(Zgij = 1|xgij) depends only on observed covariates xgij meaning that if two individuals gij and gij′ within the same stratum gi have the same covariates (i.e., xgij = xgij′), then πgij = πgij′. This property also implies since two individuals in a matched pair share the same observed covariates. In addition to the assumption of no unmeasured confounding, another assumption is required: common support for Pr(Zgij = 1|xgij). The common support assumption means that every treated or control individual must have a positive probability of receiving the treatment (and no treatment), that is, 0 < Pr(Zgij = 1|xgij) < 1. Anyone with the probability 1 of receiving the treatment cannot be compared since there exists no control individual who has the same covariates.
2.3. Sensitivity to Unmeasured Biases in Observational Studies
In an observational study, matching methods can adjust for measured confounders, however, it might be possible that two individuals gij and gij′ with xgij = xgij′ have different probabilities of receiving the treatment due to different values of an unmeasured confounder ugij ≠ ugij′. In this context, we introduce an approach that builds upon the sensitivity analysis framework proposed by Rosenbaum (2002b). We introduce a sensitivity parameter Γ, and assume that for two individuals (gij and gij′) within the same stratum (gi) with xgij = xgij′ but with ugij ≠ ugij′ the odds of treatment assignment may differ at most by Γ where Γ ≥ 1,
| (1) |
When Γ = 1, the model (1) is equivalent to assuming that there are no unmeasured confounders, and the distribution of treatment assignment is the same as the randomization distribution discussed in Section 2.1. Then, the null hypothesis can be tested by using randomization inference, and the P-value can be obtained as a point estimate. When Γ > 1, the resulting distribution cannot be explicitly obtained, but the bounds for that are controlled by Γ can be obtained. For example, in the context of a matched pair design, is bounded as . For each Γ > 1, randomization inference produces an interval of P-values instead of obtaining a point estimate.
For Γ > 1, sensitivity analysis can be conducted by considering the upper bound of the P-value interval. If the upper bound is still less than a significance level α, the null hypothesis can be rejected even in the presence of unmeasured confounders since the worst-case P-value is less than α. The P-value interval becomes wider as Γ increases, and at some point, the P-value interval contains α. When α is located in the middle of the interval, this P-value interval is uninformative, and the null hypothesis cannot be rejected. In practice, we report the largest value of Γ that leads to the upper bound P-value that is less than α. A larger value shows that the conclusion is more robust to unmeasured confounding. An approximation of the upper P-value bound can be used, see Gastwirth et al. (2000) for more detailed discussions.
3. The De Novo Method: A Combined Exploratory and Confirmatory Method
In this section, we propose a novel de novo method for discovering and analyzing heterogeneous causal effects. To apply the de novo method, we use matched pair data by assuming that matching successfully eliminates measured confounding bias. By using a sample-splitting approach discussed in Section 3.3, matched pair data can be split into (i) one subsample for discovering effect modification structure as a tree and (ii) the other subsample for making inference based on the discovered tree (Section 3.2).
3.1. The Null Hypothesis with a Nuisance Parameter
Suppose that the outcome is continuous. Let τ be the population average treatment effect. We propose a hypothesis testing approach to identify subgroups with heterogeneous causal effects by considering the null hypothesis as H0 : rTgij − rCgij ≡ τ. This null hypothesis is Fisher’s sharp null hypothesis but with the additive constant effect τ. When there is no heterogeneity of the causal effects at all, every individual gij has the same constant causal effect, rTgij − rCgij = τ. Therefore, if H0 is rejected, there is evidence that some individuals have different effects than the population mean. The null hypothesis H0 can be tested using randomization inference by imputing missing potential outcomes when τ is a known and fixed value.
In the following subsection, we assume a known value of τ and describe a one-sided testing procedure. The alternative hypothesis is that the treatment effects rTgij − rCgij are larger than τ at level α. A level α two-sided test can be easily constructed by using the procedure twice at level α/2; one to test for the positive direction (i.e., larger than τ) and the other to test for the negative direction (i.e., smaller than τ). However, τ is unknown in practice, and is a nuisance parameter that has to be estimated.
3.2. Testing the Null Hypothesis With a Given Tree Structure
Suppose that there is a given tree that is a partitioning of the total sample. Assume that there are G terminal nodes and each subgroup g represents a terminal node in the partition. To utilize the structure of trees, we trace back how the partition is built. When growing a tree, a certain internal node is chosen and forced to split into two subsequent nodes. Since there are G terminal nodes, the number of all nodes is 2G − 1. Excluding the initial node, we consider G − 2 internal nodes and G terminal nodes. Write Π = (Π1, …, Π2G−2)⊤ for these 2G − 2 nodes in total. To illustrate Π, consider a simple example tree with two binary variables, male and young, shown in Figure 1. The first split is made on the male variable, and the second split is made for male on the young variable. There are three terminal nodes: (1) female, (2) old male, and (3) young male from left to right. The one internal node that represents the male sample can be represented as the union of old male and young male. We simply denote this internal node as 2 ∪ 3. Technically, the total sample 1 ∪ 2 ∪ 3 is one of the internal nodes, but it will not be considered since the treatment effect within 1 ∪ 2 ∪ 3 is just the population average. In the example, the tree can be represented by Π = (1, 2, 3, 2 ∪ 3)⊤.
Figure 1:

An example tree Π.
With G terminal nodes, G − 2 internal nodes are considered for de novo discovery of population subgroups with heterogeneous causal effects. It may seem counter-intuitive since more comparisons imply paying more price for multiple testing. However, the inclusion of the internal nodes has several beneficial aspects. First, some of the terminal nodes may have a small number of matched pairs, which leads to a lack of power for detecting effect modification. Including the internal nodes can compensate for this lack of power. Second, when Π is much more complicated than the true structure, considering only the terminal nodes is misleading. This is important especially when Π is not given, and has to be estimated. Overfitting a tree leads to an unnecessarily complex structure, but including internal nodes can correct this problem.
We consider a test statistic for the terminal node g where qgij is a function of . Under the null H0 : rTgij − rCgij = τ, Rgij and qgij are fixed by conditioning on . The comparison vector of (2G − 2) test statistics is constructed from Tg by using the (2G − 2) × G conversion matrix C. The matrix C creates the (2G − 2) correlated test statistics from mutually independent statistics Tg; see Lee et al. (2018) for a discussion of the conversion matrix in a factorial design. To illustrate the matrix C, let us revisit the example shown in Figure 1. There are four nodes in total, and the matrix C can be constructed as
The last row represents the internal node indicating the male subgroup 2 ∪ 3 and the first three represent the terminal nodes 1, 2 and 3. Now, let T = (T1, …, TG)⊤ be the vector of the test statistics for terminal nodes. Then, the (2G − 2) test statistics for all nodes S = (S1, …, S2G−2)⊤ can be obtained as S = CT. In the above example, S = (T1, T2, T3, T2 + T3)⊤ is the corresponding comparison vector for Π = (1, 2, 3, 2 ∪ 3)⊤.
Under the null hypothesis H0, randomization inference gives the exact null distribution of the test statistic Tg for Γ = 1. However, for Γ > 1, the exact distribution of Tg cannot be obtained, but is bounded above by with expectation μΓg and variance νΓg. A large sample approximation can be applied to the joint bounding distribution of when G is fixed. Let μΓ = (μΓ1, …, μΓG)⊤ and VΓ for the G × G diagonal matrix with g-th diagonal element νΓg. Under H0 with mild regularity conditions, the joint distribution of converges to a multivariate Normal distribution . Specifically, the bounding statistic can be represented as where if Ygi > 0 and otherwise, and hgi is a function of |Ygi|. For instance, if hgi is the rank of |Ygi|, then is the bounding statistic for Wilcoxon’s signed rank test statistic. Under the sensitivity analysis model (1), the bounding random variable is 1 with probability Γ/(1 + Γ) and 0 with probability 1/(1 + Γ) (Rosenbaum, 2002b). From the above specification of and hgi, and . The below theorem guarantees a Normal approximation.
Theorem 1 Under H0, given G is fixed, if as ming∈{1,…,G}(Ig) → ∞,
| (2) |
then the random vector converges in distribution to the G-dimensional normal distribution .
In the condition (2), is the condition for the convergence of each univariate bounding statistic . Therefore, (2) implies that if each distribution of can be approximated by , then the distribution of can be approximated by . The proof of Theorem 1 is given in the supplementary materials.
Define θΓ = CμΓ and ΣΓ = CVΓC⊤, noting that ΣΓ is not typically diagonal. Write θΓk for the k-th coordinate of θΓ and for the k-th diagonal element of ΣΓ. Define DΓk = (Sk − θΓk)/σΓk and DΓ = (DΓ1, …, DΓ,2G−2)⊤. Finally, write ρΓ for the (2G − 2) × (2G − 2) correlation matrix formed by dividing the element of ΣΓ in row k and column k′ by σΓk σΓk′. For Γ > 1, as T is bounded by that converges to the Normal distribution , the deviate vector DΓ is bounded by the Normal distribution . Write for the probability of the (2G − 2)-dimensional lower orthant (−∞, x] × … × (−∞, x] for , and DΓ max for the maximum deviate among DΓ, i.e., DΓ max = max1≤k≤2G−2 DΓk. For Γ ≥ 1, consider the following testing procedure,
| (3) |
where κΓ,α is the critical value at a level α that satisfies . Also, κΓ,α can be obtained by using the qmvnorm function in the mvtnorm package in R; see Genz and Bretz (2009). The procedure (3) takes the maximum deviate that is governed by the probability . Therefore, a sensitivity analysis using this procedure has the correct level when H0 is true in large samples because of the probability Pr(DΓ max ≥ κΓ,α) ≤ α; see Proposition 1, Rosenbaum (2012) for the proof of this argument.
It is important to note that the procedure (3) is designed to test the null hypothesis H0 : rTgij − rCgij = τ for a known τ. In practice, we do not know τ, but we can estimate it with the (1 − η) level confidence interval, called CI(1 − η). When τ is unknown, we consider DΓ,max(τ) as a function of τ and define the minimum of the maximum deviate, called DΓ,minmax,
| (4) |
Since τ is unknown, we take the minimum of DΓ,max(τ) within the confidence interval CI(1 − η), which leads to DΓ,minmax ≤ DΓ,max(τ*) where τ* is the true value of τ. The critical value κΓ,α does not depend on τ for many test statistics such as Wilcoxon’s signed rank sum test. Therefore, DΓ,minmax is directly compared with κΓ,α. In cases where κΓ,α depends on τ, the maximum critical value across CI(1 − η) can be considered.
3.3. Sample Splitting and Discovering a Tree
In the previous subsections, we assumed that a tree Π is given for hypothesis testing, but in reality Π is unknown. We divide the data into two subsamples for: building a tree and making inferences. The first subsample is used to estimate the tree. We apply CART and CT; see Breiman et al. (1984) and Zhang and Singer (2010) for discussion of CART and see Athey and Imbens (2016) for discussion of CT. These two approaches are both designed to discover tree structures, but they have different criteria for constructing the partition and cross-validation. We compare these approaches for different sample splitting ratios in Section 4. The second subsample is used for conducting hypothesis tests using testing procedures discussed in Section 3.2. Our proposed approach for testing described in (3) relies on a large sample approximation, that holds for a known G, but G is unknown in practice. We specify a priori the maximum depth of the tree, which could be large, but substantially smaller than the sample size. In the discussion we elaborate on the adequacy of this assumption.
Using the second subsample, hypothesis testing can further trim the discovered tree and find the hidden structure within it. One may be concerned that sample-splitting leads to a loss of power for discovering heterogeneous subgroups, but there is a significant benefit that offsets the loss. Without splitting, the power of detecting heterogeneity decreases as the number of potential effect modifiers increases. By using a part of the data, a handful of important covariates can be selected, for example, by discovering a tree. The loss of power can be minimized in high-dimensional settings.
3.4. Testing Subgroup-specific Null Hypotheses
Our primary interest is to test the null hypothesis H0 : rTgij − rCgij = τ for all g ∈ {1, …, G} where τ is the population mean. The null H0 is a test for effect modification in the whole population. However, testing the null hypothesis for a subgroup may be of interest. Rejecting implies that the corresponding subgroup has a treatment effect significantly different from the population mean. As the test statistic DΓ max is used for testing H0, to test , we consider a test statistic with respect to the subtree Πsub that is a sub-vector of Π and contains all subgroups included in the targeted subgroup. Let be an index set that indicates the inclusion of Πsub in Π. The index set is a subset of {1, …, 2G − 2}. The new test statistic can be defined by only focusing on the deviates DΓk for . Since the number of considered deviates is reduced, it is required to compute a new critical value . The computation can be done by using the sub-correlation matrix that contains the (k, k′) element of ρΓ for k, that is the intersection of row k and column k′. Especially, when is for single subgroup g, the critical value can be easily computed as Φ−1(1 − α/2) where Φ(·) is the cumulative distribution function of the standard Normal distribution. Then, the test statistic can be compared to .
3.5. Binary Outcome
When outcomes are binary, the individual treatment effect is δgij = rTgij − rCgij, and δgij is an element of {−1, 0, 1}. The average treatment effect is the average difference between two potential outcomes, denoted by where . The unbiased estimator of δ is where is the estimated average treatment effect within stratum gi. Also, instead of testing H0 : rTgij − rCgij = τ that is defined for continuous outcomes, we consider a test of the null hypothesis where . Since δgij can be either −1, 0, or 1, the considered δ0 has to be a member of . Let . The null allows to test a set , which means that rejecting is rejecting all δ in .
We adopt Fogarty et al. (2016)’s testing method for binary outcomes, and combine it with the de novo method. Let , , , and where is the variance contribution from stratum gi to var(Tg). Now, we define the test statistic vector T = (T1, …, Tg) where . For each g, has an approximately Normal distribution N(0, 1). For Γ = 1, Fogarty et al. (2016) proposes a method based on randomization inference. To account for worst-case biases, it uses integer programming for finding the maximal variance of Σg. See Section 5 and Theorem 1 in Fogarty et al. (2016) for more details. For Γ > 1, a similar approach can be used, but it requires more complicated computations in solving an integer quadratic program, see Fogarty et al. (2017) for detailed computation. The rest of our proposed procedure is the same as the method in Section 3.2. For instance, a tree can be discovered based on CART by regressing δgij on covariates xgij from the first subsample obtained from sample splitting. From the second subsample, the (1 − η) level confidence interval for δ can be constructed by inverting hypothesis tests. Then, the proposed testing procedures (3) and (4) with a level α can be applied.
In applying the above approach, however, difficulties arise due to the discreteness of δ0. When a tree is considered, each terminal node has a different sample size, which causes incompatible hypothesis tests. To illustrate this, consider a simple example with 10 matched pairs (N = 20) using the tree in Figure 1. Suppose that the female subgroup has 5 matched pairs (Nfemale = 10). The null for the entire sample is testing whether δentire = δ0 where δ0 must be one of {−20/20, −19/20, …, 19/20, 20/20}. To discover effect modification, we ultimately want to test H0 : δentire = δfemale(= δold male = δyoung male). This implies that a test of δfemale = δ0 should be considered. However, this test is incompatible with, for instance, δ0 = 3/20 because δfemale can be tested only for values (−10/10, −9/10, …, 10/10). A remedy to fix this problem is to use the two closest compatible values around an incompatible value, conduct hypothesis tests for these two values, and take the larger P-value. For δ0 = 3/20, when testing the female group, the two closest compatible values are 1/10 and 2/10. This fix is slightly conservative, however, as the number of matched pairs Ig increases, the grid of compatible δ0 is finer, and the obtained P-value converges to the true value. Technically, for each δ0, let and be the closest two compatible values for subgroup g.
Theorem 2 Under , if Σg → ∞ as Ig → ∞, then and .
The proof of Theorem 2 is given in the supplementary materials.
4. Simulation
In this section, we evaluate the performance of the de novo method with various settings using simulations. We consider three main factors that may affect the performance: (1) choice of tree algorithm, (2) split ratio, (first, second), and (3) the degree of heterogeneity of the group specific causal effects with respect to the population average. First, as discussed in Section 3.3, CART and CT approaches are compared. Second, the splitting ratio of the first subsample to the second may affect the performance. If we invest too much on the first discovery step, we lose power for detecting heterogeneity. On the other hand, if we invest too little, some important structure may not be discovered resulting in loss of power. We consider three ratios (10%, 90%), (25%, 75%) and (50%, 50%). Finally, we examine the performance according to the amount of heterogeneity. If the heterogeneity is small, then the de novo method may not have enough power to detect it. All simulation settings assume the absence of unmeasured confounding.
We consider two simulation studies, one with continuous outcomes and the other with binary outcomes. For both studies, we set N = 4000 with 2000 matched pairs and the true effect size as 0.5 on average. Also, we consider five covariates, x1, …, x5, and assume that at most two of them lead to heterogeneity (i.e., they are true effect modifiers), say x1 and x2. For continuous outcomes, suppose that an individual with covariate values x1 = a, x2 = b has a treatment effect from a Normal distribution . Define τ = (τ00, τ01, τ10, τ11). we consider five situations: (1) τ = (0.4, 0.4, 0.6, 0.6), (2) τ = (0.3, 0.3, 0.7, 0.7), (3) τ = (0.4, 0.4, 0.5, 0.7), (4) τ = (0.3, 0.3, 0.6, 0.8), and (5) τ = (0.2, 0.5, 0.5, 0.8). For example, the first situation τ = (0.4, 0.4, 0.6, 0.6) means that there is small effect modification of x1, but not x2. The third situation τ = (0.4, 0.4, 0.5, 0.7) means that there is small effect modification of both x1 and x2. Wilcoxon’s signed rank sum test is used for continuous outcomes. Similarly, for binary outcomes, suppose that an individual treatment effect has a binomial distribution for x1 = a, x2 = b, and define δ = (δ00, δ01, δ10, δ11). Also, consider five situations: (1) δ = (0.45, 0.45, 0.55, 0.55), (2) δ = (0.4, 0.4, 0.6, 0.6), (3) δ = (0.45, 0.45, 0.5, 0.6), (4) δ = (0.4, 0.4, 0.5, 0.7), and (5) δ = (0.35, 0.5, 0.5, 0.65).
Table 1 describes the simulated power of the situations for both continuous and binary outcomes. The upper part of the table shows the simulated power for continuous outcomes. As we expected, when there is a small amount of heterogeneity as in the first and third situations, both CT and CART methods produce low power for all three splitting ratios. However, if there is moderate or large heterogeneity, the power of the test is much higher. The CT method generally has higher power than the CART method. Also, the CT method performs the best with (25%, 75%) ratio, however the CART method has the best performance with (50%, 50%) ratio. If the size of the first subsample is small, it is highly likely that the CART method produces a conservative tree. On the other hand, the CT method accounts for the size of the second subsample, and exploits more exploratory search for tree structures although it often produces false discovery with a high probability.
Table 1:
Simulated power (from 10,000 replications) for hypothesis tests to discover heterogeneous subgroups. The upper table is for continuous outcomes and the lower table is for binary outcomes. The true effect size is 0.5 on average for the entire population and the sample size is 4000 with 2000 matched pairs.
| Splitting ratio (first, second) | ||||||||
|---|---|---|---|---|---|---|---|---|
| (10%, 90%) | (25%, 75%) | (50%, 50%) | ||||||
| Degree of heterogeneity | Continuous outcomes | |||||||
| (x1, x2) | τ = (τ00, τ01, τ10, τ11) | CT | CART | CT | CART | CT | CART | |
| 1 | (Small, No) | (0.4, 0.4, 0.6, 0.6) | 0.09 | 0.04 | 0.10 | 0.05 | 0.07 | 0.06 |
| 2 | (Large, No) | (0.3, 0.3, 0.7, 0.7) | 0.55 | 0.35 | 0.85 | 0.70 | 0.82 | 0.84 |
| 3 | (Small, Small) | (0.4, 0.4, 0.5, 0.7) | 0.11 | 0.05 | 0.14 | 0.07 | 0.09 | 0.07 |
| 4 | (Large, Small) | (0.3, 0.3, 0.6, 0.8) | 0.54 | 0.36 | 0.85 | 0.70 | 0.83 | 0.84 |
| 5 | (Moderate, Moderate) | (0.2, 0.5, 0.5, 0.8) | 0.49 | 0.31 | 0.73 | 0.51 | 0.63 | 0.57 |
| Degree of heterogeneity | Binary outcomes | |||||||
| (x1, x2) | δ = (δ00, δ01, δ10, δ11) | CT | CART | CT | CART | CT | CART | |
| 1 | (Small, No) | (0.45, 0.45, 0.55, 0.55) | 0.05 | 0.02 | 0.05 | 0.03 | 0.04 | 0.03 |
| 2 | (Large, No) | (0.40, 0.40, 0.60, 0.60) | 0.61 | 0.33 | 0.79 | 0.69 | 0.76 | 0.78 |
| 3 | (Small, Small) | (0.45, 0.45, 0.50, 0.60) | 0.07 | 0.02 | 0.07 | 0.04 | 0.05 | 0.04 |
| 4 | (Large, Small) | (0.40, 0.40, 0.50, 0.70) | 0.68 | 0.38 | 0.90 | 0.77 | 0.90 | 0.89 |
| 5 | (Moderate, Moderate) | (0.35, 0.50, 0.50, 0.65) | 0.53 | 0.25 | 0.65 | 0.47 | 0.54 | 0.50 |
In the supplementary materials, Table 5 summarizes the discovery rates. In the first situation with (50%, 50%) ratio, we found that the only true effect modifier x1 is discovered in the CT method with probability 0.54 and in the CART method with probability 0.40. The CT method falsely discovers other covariates with probability 0.17, but the CART method with probability 0.07. Although the CT method has a higher false discovery rate than CART, falsely discovered partitions will be tested using the second subsample, and will be trimmed after all. The simulated power for binary outcomes is shown in the lower part of Table 1. As we have seen in the upper part, for binary outcomes, the CT method also has better performance than the CART method in general. In the analysis of our study that will be discussed in the next section, we will consider (25%, 75%) ratio for sample-splitting since this ratio shows the best compromise (measured by power of test) between discovery and confirmation of effect modification.
5. Causal Effect of Exposure to PM2.5 on 5-year Mortality in the New England
We consider 1,612,414 beneficiaries that entered in the Medicare cohort on January 1 2002 (reference date). For each enrollee, we calculate his/her exposure to PM2.5 during the two years prior to entry into the cohort, so from January 1, 2000 to December 31, 2001.
The two year average of PM2.5 is obtained in a continuous scale. We create a binary treatment variable using a cutoff value 12 μg/m3 based on the national ambient air quality standard. Among 1,612,414 individuals, there are 584,374 treated individuals (i.e., PM2.5 > 12 μg/m3) and 1,028,040 control (i.e., PM2.5 ≤ 12 μg/m3). We note that the level of PM2.5 is estimated at the centroid of a ZIP code. Individuals living in the same ZIP code area share the same value of PM2.5, thus the same exposure. We use the previously published methods that validate the estimation of PM2.5 levels. See Di et al. (2016) for more detail of estimation methods of exposure to PM2.5.
The outcome is death by the end of the study, December 31, 2006. The outcome is 1 if he/she died at any point before this date and 0 otherwise. In addition to the exposure and the outcome, we consider both individual-level covariates and ZIP code-level covariates. All covariates are measured in 2001 before the reference date. Table 2 displays a summary of the treated and control populations for individual-level and ZIP code-level covariates. Before matching, among the treated individuals we observed a high percentage of Medicaid eligible (thus poorer) individuals, smaller percentage of male and smaller percentage of white.
Table 2:
Summary statistics and covariate balance before and after matching.
| Summary Statistics | Standardized Differences | ||||
|---|---|---|---|---|---|
| Covariates | Treated | Control (Before) | Control (After) | Before | After |
| Individual-level | |||||
| Male (%) | 38.5 | 39.9 | 38.5 | −0.02 | 0.00 |
| White (%) | 92.8 | 96.9 | 92.8 | −0.19 | 0.00 |
| Medicaid Eligible (%) | 10.8 | 9.1 | 10.8 | 0.05 | 0.00 |
| Age (Group, 1–5) | 2.6 | 2.6 | 2.6 | 0.02 | 0.00 |
| Age (65–107) | 76.3 | 76.1 | 76.3 | 0.02 | 0.00 |
| ZIP code-level | |||||
| Temperature (°C) | 10.4 | 9.8 | 10.3 | 0.55 | 0.06 |
| Relative Humidity (%) | 76.1 | 76.9 | 76.1 | −0.44 | 0.01 |
| BMI (%) | 26.1 | 26.3 | 26.1 | −0.44 | −0.06 |
| Smoker Rate (%) | 49.9 | 52.6 | 49.7 | −0.72 | 0.07 |
| Black Population (%) | 6.2 | 3.2 | 6.0 | 0.33 | 0.03 |
| Median Household Income (1000s of $) | 56.1 | 53.8 | 56.7 | 0.10 | −0.03 |
| Median Value of Housing (1000s of $) | 207.5 | 184.8 | 205.9 | 0.20 | 0.01 |
| Below Poverty Level (%) | 8.3 | 9.1 | 8.3 | −0.09 | 0.01 |
| Below High School Education (%) | 30.6 | 30.1 | 30.2 | 0.03 | 0.03 |
| Owner-occupied Housing (%) | 62.9 | 68.9 | 62.7 | −0.33 | 0.01 |
| Population Density (log-scale) | −6.9 | −8.1 | −7.0 | 0.89 | 0.06 |
To adjust for measured confounders and discover heterogeneity we use a matching method that produces exact matched pairs on four individual-level covariates, white, male, Medicaid eligibility, and age group. To obtain pairs that are exactly matched on these covariates, the age variable is transformed into five categories (1:65–70, 2:71–75, 3:76–80, 4:81–85, and 5:above 85). The dataset is stratified into 40 = 2 × 2 × 2 × 5 strata according to levels of individual-level covariates. For each stratum, the ZIP code-level covariates are matched as closely as possible. Matching can be performed by using the Optmatch R package. We randomly select about 20% of the treated individuals from the entire dataset for a better covariate balance. This allows us to construct 110,091 matched pairs. Covariate balance is shown in Table 2. Since two matched individuals have the same values for individual-level covariates, the standardized differences between them are zero. The standardized differences of ZIP code-level covariates are located between −0.06 and 0.07, which indicates that there is no systematic difference between treated and control populations.
To apply the de novo method, we start with dividing the matched pairs into two subsamples with (25%, 75%) ratio. The first subsample of 27,500 matched pairs is used for discovering heterogeneous subgroups, and the other 82,591 matched pairs are used for conducting hypothesis tests. Figure 2 displays the discovered tree with six disjoint subgroups and four combined subgroups (1 ∪ 2, 1 ∪ 2 ∪ 3, 4 ∪ 5, 4 ∪ 5 ∪ 6) from the first subsample. We grew a tree with the restriction that the maximum depth be d = 3, and trimmed it with cross-validation. This restricted the number of terminal nodes (i.e., G) to be less than 2d = 8. The tree in the figure was obtained from the CT method. The CART method produced a coarser tree with the terminal nodes (1 ∪ 2, 3, 4 ∪ 5, 6). Our simulation studies showed indeed that the CART method is slightly more conservative in creating subgroups than the CT method. To examine the stability of the tree showed in Figure 2 with respect to a different subsample, we considered 1000 trees by using 1000 bootstrapped samples based on the first subsample. We found that all trees were firstly divided by age. Also, they eventually created the same four age subgroups (1 ∪ 2, 3, 4 ∪ 5, 6) although the order of age split was different in each bootstrapped tree. The subgroup 1 ∪ 2 was further divided by white 62% of the time, and the 4 ∪ 5 was divided by Medicaid eligibility 52% of the time.
Figure 2:

Discovered tree from the first subsample. Actions are represented on edges. Subgroups whose null hypotheses are rejected at a total significance level α + η = 0.05 are represented by solid rectangles; otherwise, represented by dashed rectangles. The point estimates with the 95% confidence intervals for the subgroup causal effects are computed from the second subsample.
Before conducting tests for heterogeneity, we test whether there is any causal effect of being exposed to PM2.5 > 12 μg/m3 on mortality. The obtained matched pairs are used for testing Fisher’s hypothesis of no effect . We consider the truncated product method proposed by Hsu et al. (2013) with the six discovered subgroups (1, …, 6) in Figure 2. This method computes upper bounds on P-values for each of the six subgroups, and then combines the P-values using the truncated product proposed by Zaykin et al. (2002). The null hypothesis can be tested by using McNemar tests with the second subsample of 82,591 matched pairs. By applying this method, we found that is rejected, which concludes that exposure to high-level PM2.5 statistically increases the risk of death. In addition, Table 3 reports the sensitivity analysis with upper bounds on P-values. At Γ = 1.28, is still rejected at the 0.044 level, but at Γ = 1.29, the hypothesis is not rejected at the 0.05 level. We conclude that exposure to high-level PM2.5 increased the 5-year mortality rate even in the presence of unmeasured biases up to Γ = 1.28.
Table 3:
Sensitivity analysis for testing Fisher’s hypothesis of no effect: Upper bounds on P-values for various Γ
| Subgroups | Truncated Product | ||||||
|---|---|---|---|---|---|---|---|
| Γ | 1 | 2 | 3 | 4 | 5 | 6 | |
| 1.00 | 0.558 | 0.183 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1.10 | 1.000 | 0.697 | 0.879 | 0.201 | 0.001 | 0.000 | 0.000 |
| 1.20 | 1.000 | 0.965 | 1.000 | 0.989 | 0.024 | 0.000 | 0.000 |
| 1.25 | 1.000 | 0.992 | 1.000 | 1.000 | 0.073 | 0.000 | 0.006 |
| 1.28 | 1.000 | 0.997 | 1.000 | 1.000 | 0.124 | 0.001 | 0.044 |
| 1.29 | 1.000 | 0.998 | 1.000 | 1.000 | 0.146 | 0.003 | 0.072 |
Furthermore, the sensitivity parameter Γ can be represented as a curve of two parameters (Λ, Δ). Technically, Γ = (ΔΛ + 1)/(Δ + Λ), see Rosenbaum and Silber (2009). The parameter Λ describes the relationship between an unmeasured confounder ugij and treatment assignment Zgij, and the parameter Δ describes the relationship between ugij and the potential outcome rCgij. For example, Γ = 1.28 corresponds to Λ = 2.17 and Δ = 2. To illustrate this, consider an unmeasured variable ugij of time spent outdoors that is negatively associated with both the treatment and the outcome. Here, (Λ, Δ) = (2.17, 2) implies that ugij doubles the odds of exposure to high-level PM2.5 and increases the odds of death by 2.11-fold. Our sensitivity analysis claims that the conclusion remains even in the presence of any ugij with (Λ, Δ) satisfying (ΔΛ + 1)/(Δ + Λ) ≤ 1.28.
Returning to testing the null hypothesis H0 of no heterogeneity, the second subsample is used for confirming and identifying subgroups with heterogeneous causal effects in the discovered tree structures shown in Figure 2. Since we do not know the true value of the population average of δ, we first estimate the 100(1 − η)% confidence interval for δ with η = 0.01, (1.12%, 2.38%) and test the global null hypothesis H0 of no heterogeneity for each value within this interval. Table 4 shows ten deviates from the discovered subgroups for various δ0 at Γ = 1. A negative deviate means that the corresponding subgroup causal effect is below the population average, and a positive deviate means the opposite. The critical value κΓ,α is almost constant as κΓ,α = 2.80 at Γ = 1, and is obtained from the multivariate Normal distribution with α = 0.04 to achieve a total significance level α + η = 0.05. At Γ = 1, the maximum absolute deviate DΓ max is reported in the last column, and the minimum test statistic DΓ,minmax is 7.30, which is larger than κΓ,α = 2.80. This indicates that there is a statistically significant evidence of heterogeneity when there is no unmeasured confounding.
Table 4:
Sensitivity analysis for testing the null hypothesis of no heterogeneity and description of the discovered subgroups. The upper table shows ten deviates from the subgroups with the maximum absolute deviate where the critical values κΓ,α = 2.80 for Γ = 1 when α = 0.04 and η = 0.01 and κΓ,α = 2.72 for Γ > 1 when α = 0.05 and η = 0, and the lower table shows the proportions of the subgroups and comparisons of outcomes between treated and control populations.
| Subgroups | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 1 ⋃ 2 | 1 ⋃ 2 ⋃ 3 | 4 ⋃ 5 | 4 ⋃ 5 ⋃ 6 | |||
| Γ | δ 0 | D Γ1 | D Γ2 | D Γ3 | D Γ4 | D Γ5 | D Γ6 | D Γ7 | D Γ8 | D Γ9 | D Γ10 | D Γ max |
| 1 | 0.0112 | −3.43 | −0.29 | 0.27 | 2.23 | 3.30 | 8.66 | −3.37 | −2.55 | 3.20 | 8.14 | 8.66 |
| 0.0143 | −4.37 | −0.57 | −0.31 | 1.80 | 3.16 | 8.25 | −4.34 | −3.68 | 2.75 | 7.54 | 8.25 | |
| 0.0175 | −5.31 | −0.85 | −0.91 | 1.36 | 2.99 | 7.85 | −5.33 | −4.83 | 2.28 | 6.92 | 7.85 | |
| 0.0207 | −6.27 | −1.13 | −1.50 | 0.92 | 2.85 | 7.45 | −6.32 | −5.97 | 1.82 | 6.31 | 7.45 | |
| 0.0238 | −7.20 | −1.41 | −2.09 | 0.48 | 2.71 | 7.05 | −7.30 | −7.11 | 1.36 | 5.70 | 7.30 | |
| Γ | δ 0 | D Γ1 | D Γ2 | D Γ3 | D Γ4 | D Γ5 | D Γ6 | D Γ7 | D Γ8 | D Γ9 | D Γ10 | D Γ minmax |
| 1.010 | 0.0242 | −6.39 | −1.16 | −1.52 | 0.44 | 2.50 | 6.53 | −6.45 | −6.10 | 1.46 | 5.85 | 6.53 |
| 1.050 | 0.0285 | −4.07 | −0.45 | 0.00 | 0.00 | 1.61 | 4.17 | −4.03 | −3.26 | 0.80 | 3.80 | 4.17 |
| 1.070 | 0.0306 | −2.97 | −0.13 | 0.00 | 0.00 | 1.18 | 3.02 | −2.87 | −2.37 | 0.63 | 2.84 | 3.02 |
| 1.075 | 0.0312 | −2.72 | −0.05 | 0.00 | 0.00 | 1.06 | 2.73 | −2.61 | −2.16 | 0.58 | 2.57 | 2.73 |
| 1.076 | 0.0312 | −2.63 | −0.02 | 0.00 | 0.00 | 1.04 | 2.69 | −2.52 | −2.09 | 0.57 | 2.54 | 2.69 |
| Subgroups | Total | |||||||||||
| Proportion (%) | 46.2 | 4.2 | 22.3 | 13.7 | 1.7 | 11.9 | 50.4 | 72.8 | 15.4 | 27.2 | 100.0 | |
| Treated (%) | 14.5 | 15.2 | 26.5 | 39.6 | 54.6 | 61.7 | 14.6 | 18.3 | 41.2 | 50.1 | 26.9 | |
| Control (%) | 14.6 | 14.4 | 25.3 | 36.8 | 46.6 | 53.7 | 14.6 | 17.8 | 37.9 | 44.8 | 25.2 | |
| Risk difference (%) | 0.0 | 0.8 | 1.3 | 2.7 | 8.0 | 7.9 | 0.0 | 0.4 | 3.3 | 5.3 | 1.8 | |
| Odds ratio | 1.00 | 1.07 | 1.07 | 1.12 | 1.38 | 1.39 | 1.00 | 1.03 | 1.15 | 1.24 | 1.10 | |
In addition, one may be interested in testing the null hypothesis for a certain subgroup that the exposure effect in this subgroup is the same as the population average. For instance, policymakers may want to know whether Medicare beneficiaries aged between 81–85 (i.e., 4 ∪ 5) are at a high risk of death compared to the average for the whole population. To test this, we can focus on the subset of deviates {DΓ4, DΓ5, DΓ9}, which means . A new critical value is 2.38 which is smaller than κΓ,α = 2.80. At Γ = 1, the null hypothesis for 4 ∪ 5 is rejected since exceeds 2.38 for all values of δ0 in the interval. For the terminal nodes, can be tested with the critical value obtained from the standard Normal distribution with α = 0.04. Other critical values are 2.37 for 1 ∪ 2, 2.55 for 1 ∪ 2 ∪ 3, and 2.56 for 4 ∪ 5 ∪ 6. Figure 2 represents subgroups with solid rectangles whose null hypotheses are rejected at Γ = 1, and otherwise with dashed rectangles. Subgroup 1 (white, aged between 65–75) has an effect size significantly lower than the population average, but subgroups 5 and 6 have effect sizes significantly higher than the population average. Also, in Figure 2, the point estimates and the 95% confidence intervals for subgroup causal effects are displayed. We note that each subgroup’s confidence interval is computed by inverting the null hypothesis for the subgroup; for example, the confidence interval for subgroup 1 is an inversion of testing the null hypothesis H0 : δ = δ1, not testing H0 : δ = δ0, where δ1 is the average exposure effect within stratum subgroup 1. The lower part of Table 4 provides the detailed descriptions of the discovered subgroups.
Table 4 performs a sensitivity analysis for unmeasured confounding in discovering heterogeneous subgroups. We set η = 0 and α = 0.05 for Γ > 1, which produces the critical value κΓ,α = 2.72. For each value of Γ, only the minimum of DΓ,max (i.e., DΓ,minmax) is reported in the table. For example, at Γ = 1.01, the minimum DΓ minmax = 6.53 is obtained at δ0 = 0.0242. DΓminmax is attained at either the deviate DΓ1 or the deviate DΓ6. Therefore, it can be inferred that the subgroups 1 and 6 have the least sensitivity to unmeasured biases. Figure 3 displays the maximum absolute deviate DΓ max across δ0 in the interval [0, 0.04] for each value of Γ. The curve of DΓ max has a V-shape. All the curves have the minimum DΓ,minmax within the interval, and for Γ ≤ 1.07, the curves are above the horizontal line of the critical value κΓ,α = 2.72. This implies that the null hypothesis H0 of no heterogeneity is rejected only up to Γ ≤ 1.07. Table 4 shows more calibrated values of Γ for a sensitivity analysis. As shown in the table, DΓminmax is larger than κΓ,α = 2.72 until Γ = 1.075. This sensitivity analysis shows that there is statistically significant evidence of heterogeneity up to a value of Γ equal to 1.075. The value Γ = 1.075 corresponds to a situation where an unobserved potential confounder could increase the odds of exposure to high level PM2.5 by 1.5-fold and increase the odds of death by more than 1.44-fold (i.e., (Δ, Λ) = (1.5, 1.44)).
Figure 3:

The maximum absolute deviate DΓ max for various Γ in the interval [0, 0.04] of δ. The dashed line represents the critical value κΓ,α = 2.72
Our finding about effect modification in the Medicare population can be compared to the previous findings. We showed that older groups (i.e., Subgroups 5 and 6) have significantly higher mortality rates, which was shown in Di et al. (2017). Di et al. (2017) conducted ad-hoc subgroup analyses for sex, race and Medicaid eligibility, and found that the Black population was most vulnerable to air pollution exposure, but other factors did not make notable differences. We also found that the Non-white subpopulation was more vulnerable than the White subpopulation within aged below 75. However, we could not find that Non-white was vulnerable in the age 75+ subpopulation. Instead, we found that those who were eligible for Medicaid had a higher mortality rate in this subpopulation. These specific subpopulations could not be discovered in the previous literature, and more importantly, we made valid statistical inference with a thorough sensitivity analysis of unmeasured confounding.
6. Discussion
We introduced a new approach for de novo discovering of subgroups with causal effects of a binary treatment (or exposure) on a binary and continuous outcome that is allowed to be heterogeneous with respect to the causal effect on average for the whole population. We considered both exploratory and confirmatory statistical analyses. Instead of determining a set of covariates a priori before making an inference, the exploratory search can reveal the tree. Then, randomization-based tests are conducted to identify subgroups within the tree with significantly different effects with respect to the population average. We also developed a sensitivity analysis to assess the effect of unmeasured confounding bias on the conclusions regarding the population average and also the subgroup-specific causal effects.
The de novo method considers the sample-splitting approach that divides the entire sample into two subsamples. However, there has been little literature to select the optimal splitting ratio. Specifically, when applying the method, it is not known which ratio can provide the highest power of test. We considered three ratios through simulation studies in Section 4 to decide the optimal ratio among them, and found that the optimal ratio among the tree ratios depended on the size of effect modification. However, the simulation results cannot be a general guideline for those who do not have any prior knowledge from literature about how large effect modification might be. Selecting the optimal ratio without any prior information can be an interesting problem for future research.
When discovering a tree Π, the size of Π needs to be controlled. As we discussed in Section 3.3, the number of terminal nodes in the tree should not be large compared to the sample size to justify a large sample approximation. However, when discovering Π, the size of Π is selected with cross-validation, so is random, not fixed. To the best of our knowledge, theoretical results are not known about the tree size obtained from cross-validation. When a too large value of G is selected, large sample approximation may not work since min(Ig) may not be large enough. To avoid this issue, we restrict the maximum depth d of a tree when growing it. For example, in our application, we set d = 3. This restriction guarantees that the final number of terminal nodes is less than 2d for a fixed value d.
To assess the degree of heterogeneity in the presence of unmeasured confounding, a sensitivity analysis can be conducted for various values of Γ. When an unmeasured bias is present, the distribution of Z is governed by Γ. The change in the distribution of Z affects both (a) estimating 100(1 − η)% the confidence interval and (b) testing the null H0 at a level α. For Γ > 1, as Γ increases, the 100(1 − η)% confidence interval for τ rapidly converges to the real line even for a comparatively large η. When choosing a large value of η, the obtained confidence interval may be narrow enough up to a certain Γ. However, there is a trade-off between η and α; a large η means a small α for testing, which may lead to a loss of power. It is difficult to find the optimal balance between η and α since the optimal balance depends on the true size of Γ that is unknown. More transparently, we propose to consider η = 0, which means considering all values of τ on the real line. For many test statistics of the form such as Wilcoxon’s signed rank sum test, DΓ max is substantially larger when τ is too small or too large, and the minimum of DΓ max is obtained within a sizable range. In practice, a wide enough range of τ can be chosen by making sure that DΓ max is large enough at the ends of the range even for a large Γ. This approach requires more intensive computation, but we may expect a minimal power loss because of η = 0.
The proposed method for testing a sub-hypothesis for subgroup g can be improved by considering a logical implication between and for all g′ ≠ g. Rejecting leads to rejection of . Therefore, if either or is rejected, then we can reject . This logical implication increases power of test. Furthermore, there is no need to construct the confidence interval of τ, instead every value of τ on the real line can be checked. If the specified value τ is too large or small, then it is highly likely that either or is rejected. Therefore, for a given total significance level α + η, α can be maximized by setting η = 0.
Finally, the proposed method is not limited to air pollution studies. It can be applied to answering any research questions regarding effect modification. For instance, in social science, discovering effect modification can help reveal causal mechanisms of effects varying with background variables. Also, in precision medicine, discovering more patient-specific subgroups that can have most (or least) benefit from a treatment could be of interest. In this way, it allows practitioners to maximize treatment efficacy and minimize side effects. In practice, a tree discovery step can be tuned according to questions of interest. discovering a tree with a few simple subgroups is utterly important for public policy implications. However, when it comes to precision medicine, discovering a larger tree with more accurate subgroups could be of interest. When a larger discovered tree is preferred, it is better to assign a larger portion of the total sample to a discovery subsample.
Supplementary Material
Acknowledgment
We are grateful for helpful feedback from the editor, the associate editor, four anonymous referees, and session participants at JSM and European Causal Inference Meeting.
Funding
This work was supported by NIH grants (R01GM111339, R01ES024332, R01ES026217, P50MD010428, DP2MD012722, R01ES028033, R01MD012769) and HEI grant (4953-RFA14-3/16-4).
Footnotes
SUPPLEMENTARY MATERIAL
Appendix The online appendix contains proofs for Theorems 1–2, additional simulations for the discovery step. (.pdf file)
R Code for Application and Simulations An R script illustrates our methods with a simulated data set. Codes for both implementing the denovo method and producing simulation results are provided. (.R file)
References
- Athey S and Imbens G (2016, jul). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113(27), 7353–7360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breiman L, Friedman JH, Olshen RA, and Stone CJ (1984). Classification and regression trees. New York, N.Y: Chapman & Hall/CRC. [Google Scholar]
- Di Q, Dai L, Wang Y, Zanobetti A, Choirat C, Schwartz JD, and Dominici F (2017, dec). Association of short-term exposure to air pollution with mortality in older adults. JAMA 318(24), 2446. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Di Q, Kloog I, Koutrakis P, Lyapustin A, Wang Y, and Schwartz J (2016, apr). Assessing PM2.5exposures with high spatiotemporal resolution across the continental united states. Environmental Science & Technology 50(9), 4712–4721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Di Q, Wang Y, Zanobetti A, Wang Y, Koutrakis P, Choirat C, Dominici F, and Schwartz JD (2017, jun). Air pollution and mortality in the medicare population. New England Journal of Medicine 376(26), 2513–2522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dockery DW, Pope CA, Xu X, Spengler JD, Ware JH, Fay ME, Ferris BG, and Speizer FE (1993, dec). An association between air pollution and mortality in six u.s. cities. New England Journal of Medicine 329(24), 1753–1759. [DOI] [PubMed] [Google Scholar]
- Dominici F, Peng RD, Bell ML, Pham L, McDermott A, Zeger SL, and Samet JM (2006, mar). Fine particulate air pollution and hospital admission for cardiovascular and respiratory diseases. JAMA 295(10), 1127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fogarty CB, Mikkelsen ME, Gaieski DF, and Small DS (2016, apr). Discrete optimization for interpretable study populations and randomization inference in an observational study of severe sepsis mortality. Journal of the American Statistical Association 111(514), 447–458. [Google Scholar]
- Fogarty CB, Shi P, Mikkelsen ME, and Small DS (2017, jan). Randomization inference and sensitivity analysis for composite null hypotheses with binary outcomes in matched observational studies. Journal of the American Statistical Association 112(517), 321–331. [Google Scholar]
- Gastwirth JL, Krieger AM, and Rosenbaum PR (2000, aug). Asymptotic separability in sensitivity analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62(3), 545–555. [Google Scholar]
- Genz A and Bretz F (2009). Computation of Multivariate Normal and t Probabilities. Berlin: Springer. [Google Scholar]
- Hahn PR, Murray JS, and Carvalho CM (2020, sep). Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects (with discussion). Bayesian Analysis 15(3), 965–1056. [Google Scholar]
- Hansen BB (2004, sep). Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association 99(467), 609–618. [Google Scholar]
- Hsu JY, Small DS, and Rosenbaum PR (2013, mar). Effect modification and design sensitivity in observational studies. Journal of the American Statistical Association 108(501), 135–148. [Google Scholar]
- Lee K, Small DS, and Rosenbaum PR (2018, may). A powerful approach to the study of moderate effect modification in observational studies. Biometrics 74(4), 1161–1170. [DOI] [PubMed] [Google Scholar]
- Loomis D, Grosse Y, Lauby-Secretan B, Ghissassi FE, Bouvard V, Benbrahim-Tallaa L, Guha N, Baan R, Mattock H, and Straif K (2013, dec). The carcinogenicity of outdoor air pollution. The Lancet Oncology 14(13), 1262–1263. [DOI] [PubMed] [Google Scholar]
- Makar M, Antonelli J, Di Q, Cutler D, Schwartz J, and Dominici F (2017, sep). Estimating the causal effect of low levels of fine particulate matter on hospitalization. Epidemiology 28(5), 627–634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neyman J (1990). On the application of probability theory to agricultural experiments. essay on principles. section 9. Statistical Science 5(444), 465–472. [Google Scholar]
- Pope CA, Lefler JS, Ezzati M, Higbee JD, Marshall JD, Kim S-Y, Bechle M, Gilliat KS, Vernon SE, Robinson AL, and Burnett RT (2019, jul). Mortality risk and fine particulate air pollution in a large, representative cohort of u.s. adults. Environmental Health Perspectives 127(7), 077007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rajagopalan S, Al-Kindi SG, and Brook RD (2018, oct). Air pollution and cardiovascular disease. Journal of the American College of Cardiology 72(17), 2054–2070. [DOI] [PubMed] [Google Scholar]
- Rückerl R, Schneider A, Breitner S, Cyrys J, and Peters A (2011, aug). Health effects of particulate air pollution: A review of epidemiological evidence. Inhalation Toxicology 23(10), 555–592. [DOI] [PubMed] [Google Scholar]
- Rosenbaum PR (2002a). Covariance adjustment in randomized experiments and observational studies. Statistical Science 17(3), 286–327. [Google Scholar]
- Rosenbaum PR (2002b). Observational Studies. New York: Springer. [Google Scholar]
- Rosenbaum PR (2010). Design of observational studies. New York: Springer. [Google Scholar]
- Rosenbaum PR (2012, jul). Testing one hypothesis twice in observational studies. Biometrika 99(4), 763–774. [Google Scholar]
- Rosenbaum PR (2017, jan). Observation and Experiment. Harvard University Press. [Google Scholar]
- Rosenbaum PR and Silber JH (2009, dec). Amplification of sensitivity analysis in matched observational studies. Journal of the American Statistical Association 104(488), 1398–1405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66(5), 688–701. [Google Scholar]
- Samet JM, Dominici F, Curriero FC, Coursac I, and Zeger SL (2000, dec). Fine particulate air pollution and mortality in 20 u.s. cities, 1987–1994. New England Journal of Medicine 343(24), 1742–1749. [DOI] [PubMed] [Google Scholar]
- Stuart EA (2010, feb). Matching methods for causal inference: A review and a look forward. Statistical Science 25(1), 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Su X, Tsai C-L, Wang H, Nickerson DM, and Li B (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research 10(5), 141–158. [Google Scholar]
- Wager S and Athey S (2018, jun). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113(523), 1228–1242. [Google Scholar]
- Wu X, Braun D, Schwartz J, Kioumourtzoglou MA, and Dominici F (2020, jun). Evaluating the impact of long-term exposure to fine particulate matter on mortality among the elderly. Science Advances 6(29), eaba5692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zaykin DV, Zhivotovsky LA, Westfall PH, and Weir BS (2002, jan). Truncated product method for combiningP-values. Genetic Epidemiology 22(2), 170–185. [DOI] [PubMed] [Google Scholar]
- Zhang H and Singer B (2010). Recursive Partitioning and Applications. New York: Springer. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
