Discovering Heterogeneous Exposure Effects Using Randomization Inference in Air Pollution Studies

Kwonsang Lee; Dylan S Small; Francesca Dominici

doi:10.1080/01621459.2020.1870476

. Author manuscript; available in PMC: 2022 Oct 28.

Published in final edited form as: J Am Stat Assoc. 2021 Feb 16;116(534):569–580. doi: 10.1080/01621459.2020.1870476

Discovering Heterogeneous Exposure Effects Using Randomization Inference in Air Pollution Studies

Kwonsang Lee ^*, Dylan S Small ^†, Francesca Dominici ^‡

PMCID: PMC9616014 NIHMSID: NIHMS1733551 PMID: 36311902

Abstract

Several studies have provided strong evidence that long-term exposure to air pollution, even at low levels, increases risk of mortality. As regulatory actions are becoming prohibitively expensive, robust evidence to guide the development of targeted interventions to protect the most vulnerable is needed. In this paper, we introduce a novel statistical method that (i) discovers subgroups whose effects substantially differ from the population mean, and (ii) uses randomization-based tests to assess discovered heterogeneous effects. Also, we develop a sensitivity analysis method to assess the robustness of the conclusions to unmeasured confounding bias. Via simulation studies and theoretical arguments, we demonstrate that hypothesis testing focusing on the discovered subgroups can substantially increase statistical power to detect heterogeneity of the exposure effects. We apply the proposed denovo method to the data of 1,612,414 Medicare beneficiaries in the New England region in the United States for the period 2000 to 2006. We find that seniors aged between 81–85 with low income and seniors aged 85 and above have statistically significant greater causal effects of long-term exposure to PM_2.5 on 5-year mortality rate compared to the population mean.

Keywords: Causal effect, Causal inference, Unmeasured confounding, Particulate Matter, Observational study, Recursive partitioning, Sample split

1. Introduction

Air pollution is a major environmental risk to health. Over the past few decades, researchers have estimated the association between air pollution exposure and a wide range of health outcomes from respiratory diseases to death (Dockery et al., 1993; Samet et al., 2000; Dominici et al., 2006; Loomis et al., 2013; Di et al., 2017; Makar et al., 2017). Recently, Wu et al. (2020) reported statistically significant evidence of increased mortality risk associated with long term exposure to fine particulate matter with an aerodynamic diameter < 2.5 μm (PM_2.5) even when exposure levels were below 12 μg/m³, which is the current national PM_2.5 standard. The World Health Organization (WHO)’s International Agency for Research on Cancer (IARC) has concluded that exposure to PM_2.5 is carcinogenic to humans (Loomis et al., 2013). Rückerl et al. (2011) and Rajagopalan et al. (2018) provide extensive overviews of air pollution impact on a variety of health outcomes in epidemiological studies. Along with these findings, regulations to limit emissions have been formulated and updated in order to decrease air pollution and promote public health as the Clean Air Act requires. However, as the air quality regulation costs become expensive, robust evidence of more targeted regulatory actions is required for determining the appropriate allocation of regulatory efforts and resources. To establish regulatory programs, it is important and informative to discover population subgroups that are potentially more vulnerable to air pollution.

Our aim is to develop a new methodological approach in causal inference to address this critically important public health question: “Which population subgroups have causal effects of air pollution on mortality that are statistically significantly different from the population average?” In epidemiology, a similar question is often formulated in terms of assessing evidence of effect modification of air pollution effects on health by some pre-specified variables, such as age, sex, race. Many air pollution epidemiological studies have been conducted to assess effect modification. For example, Di et al. (2017) and Di et al. (2017) conducted subgroup analyses using up to one-way interaction terms for studying short-term and long-term effect of air pollution. Also, Pope et al. (2019) used National Health Interview Surveys (NHIS) between 1986–2014 to study the long-term effect of air pollution on mortality, and used Cox proportional hazard ratio models that account for complex sample design. A subgroup analysis was conducted by examining pre-specified variables one at a time. However, most of the studies have three main limitations: (a) variables that are suspected to be potential effect modifiers need to be selected a priori before analysis; (b) effect modification is assessed in a model-dependent way by testing as whether there is evidence of any interaction between the exposure variable and a modifier; (c) the existing approaches rely on standard regression approaches for adjusting confounders, without conducting systematic sensitivity analyses for unmeasured confounding.

In statistical literature, there have been developments that can address some of the three raised issues. Many methods mainly aim at estimating conditional average treatment effects. For example, Wager and Athey (2018) propose the Causal Forest method to estimate the covariate-specific treatment effects and Su et al. (2009) use recursive partitioning to estimate treatment effects across subpopulations. Hahn et al. (2020) propose a nonlinear regression model using a Bayesian version of Causal Forest. These tree-based approaches provide a flexible way to parameterize the covariate space. If there is any subgroup contributing effect modification, it is highly likely that a tree (or a combination of trees) continuously grows to capture this subgroup, which leads to high estimation accuracy. Despite such significant advances, most of these contributions do not assess the robustness of causal findings about effect modification. Especially, when the assumption of no unmeasured confounding bias is violated, evidence of effect modification can be explained by unmeasured bias. A sensitivity analysis can then be conducted to characterize how extensive an unmeasured confounding bias has to be in order to alter the conclusions.

In this paper we propose a novel approach, called the de novo method, to overcome these limitations. The de novo method is developed within a causal inference framework, and in the context of matched observational studies. More specifically, we split the sample into two parts. In the first subsample, we let the data discover “promising” subgroups with air pollution effects that differ from the population mean. Here, we apply two machine learning algorithms to discover vulnerable subgroups: (a) classification and regression tree (CART) method proposed by Breiman et al. (1984) and (b) Causal Tree (CT) method proposed by Athey and Imbens (2016) that uses different criteria for constructing partitions. There are more sophisticated techniques such as Random Forests. However, we use tree methods for clearer interpretability. Results from tree-based methods can be easily explained to non-experts, and thus can be directly used for regulatory policy.

In the second subsample, we develop randomization-based hypothesis tests to confirm as whether there is evidence that exposure effects for the newly discovered subgroups are statistically significant different from the population average causal effect. The null hypothesis is that each subgroup’s exposure effect is not different from the population mean. Rejecting the null hypothesis can provide statistical evidence and practical guidance to look more closely at a particular subgroup for future researchers. Note that our definition of heterogeneous exposure effects is different from the more common definition that considers the deviance from the null effect used in previous studies (Hsu et al., 2013; Lee et al., 2018). Furthermore, we develop a sensitivity analysis by generalizing randomization inference to assess the impact of unmeasured confounding bias on causal conclusions. Also, in addition to hypothesis testing, we propose a method that can account for unmeasured bias in estimating the population mean and also discovering population subgroups.

The outline of the rest of this paper is as follows. In Section 2, we introduce notation and assumptions. In section 3 we present the methodological approach and theoretical results for continuous and binary outcomes. In Section 4, we illustrate the performance of the proposed method in several simulated situations. In Section 5, we apply this new approach to an observational study of the effect of long-term PM_2.5 exposure on mortality for the Medicare population in the New England region in the United States between 2000–2006, and identify subgroups that have higher (or lower) mortality rates from the population mean. Section 6 concludes with a discussion where we review and compare our approach with other existing methods.

2. Notation and review of observational studies

2.1. Notation For Stratified Randomized Experiments

Consider a stratified randomized experiment with groups g = 1, …, G. Let gi be a stratum of n_gi observations within group g. We assume i = 1, …, I_g. Let gij be the jth individual in the stratum gi. Within gi, we assume that m_gi individuals receive the treatment and n_gi − m_gi individuals receive the control with min{m_gi, n_gi − m_gi} = 1. For simplicity, we assume m_gi = 1, which means that each stratum gi has only one treated individual. Let Z_gij be a binary treatment assignment; if individual gij receives the treatment, Z_gij = 1 otherwise Z_gij = 0. For each gi, the sum of Z_gij is one, $\sum_{j = 1}^{n_{g i}} Z_{g i j} = 1$ . Under the potential outcome framework with binary treatment, each individual has two potential outcomes (Neyman, 1990; Rubin, 1974); one is under treatment, r_Tgij and the other is under control, r_Cgij. Only one of the two potential outcomes can be observed according to Z_gij, thus the individual treatment effect, r_Tgij − r_Cgij cannot be observed. This individual exhibits the observed response R_gij = r_TgijZ_gij + r_Cgij(1 − Z_gij). Within pair gi, the observed treated-minus-control difference in responses is Y_gi = (Z_gi1 − Z_gi2)(R_gi1 − R_gi2). If the treatment has an additive effect, r_Tgij − r_Cgij = τ for all gij, then Y_gi = τ +(Z_gi1 − Z_gi2)(r_Cgi1 − r_Cgi2).

Let $F = {(r_{Tgij}, r_{Cgij}, x_{g i j}, u_{g i j}), g = 1, \dots, G, i = 1, \dots, I_{g}, j = 1, \dots, n_{g i}}$ where x_gij and u_gij denote observed and unobserved covariates, and $Z$ be the set containing all possible values z of $Z = {(Z_{111}, Z_{112}, \dots, Z_{G I_{G} n_{G I_{G}}})}^{⊤}$ . Write $| S |$ for the number of elements in a finite set $S$ . Then, $| Z | = \prod_{g = 1}^{G} \prod_{i = 1}^{I_{g}} n_{g i}$ . In a randomized experiment, a treatment assignment Z is randomly chosen from $Z$ . Therefore, $Pr (Z = z ∣ F, Z) = | Z |^{- 1}$ and $Pr (Z_{g i j} = 1 ∣ F, Z) = 1 / n_{g i}$ from the independence between strata. The response $R = (R_{111}, R_{112}, \dots, R_{G I_{G} n_{G I_{G}}})$ is a random variable due to Z whereas $F$ is fixed. This randomization enables researchers to make inference for treatment effects in a randomized experiment (Rosenbaum, 2002a, 2017).

2.2. Matching and Observational Studies

The main challenge in observational studies is to remove confounding bias. Matching is a simple and transparent way to adjust for biases due to measured confounders (Stuart, 2010). Roughly speaking, for each treated individual (e.g. a Medicare patient exposed to levels of PM_2.5 higher than 12 μg/m³), matching produces a stratum of observations (gi) composed of untreated individuals and controls (e.g. Medicare patients exposed to levels of PM_2.5 lower than 12 μg/m³) who are as similar as possible in terms of potential confounders to the treated individual.

In this paper, we consider a matched pair design containing only one matched untreated individual for each treated individual (n_gi = 2). However, our method can be readily extended to matching with multiple controls. In practice, it is difficult to find a control who has the same exact value of the covariates, especially for continuous covariates. Instead, we find a control as similar to the targeted treated individual as possible. Then, we assess how similar matched pairs are by checking the overall covariate balance. The most common diagnostic for checking balance is using the standardized difference, see Rosenbaum (2010) for more detail. The quality of matched pairs produced by matching methods should be assessed and reported before making causal inference.

Under the assumption of no unmeasured confounding bias, inferences can be made by treating matched sets as stratified randomized experiments (Rosenbaum, 2002b; Hansen, 2004). This assumption implies that the probability of receiving the treatment π_gij = Pr(Z_gij = 1|x_gij) depends only on observed covariates x_gij meaning that if two individuals gij and gij′ within the same stratum gi have the same covariates (i.e., x_gij = x_gij′), then π_gij = π_gij′. This property also implies $Pr (Z_{g i 1} = 1 ∣ F, Z) = Pr (Z_{g i 2} = 1 ∣ F, Z) = 1 / 2$ since two individuals in a matched pair share the same observed covariates. In addition to the assumption of no unmeasured confounding, another assumption is required: common support for Pr(Z_gij = 1|x_gij). The common support assumption means that every treated or control individual must have a positive probability of receiving the treatment (and no treatment), that is, 0 < Pr(Z_gij = 1|x_gij) < 1. Anyone with the probability 1 of receiving the treatment cannot be compared since there exists no control individual who has the same covariates.

2.3. Sensitivity to Unmeasured Biases in Observational Studies

In an observational study, matching methods can adjust for measured confounders, however, it might be possible that two individuals gij and gij′ with x_gij = x_gij′ have different probabilities of receiving the treatment due to different values of an unmeasured confounder u_gij ≠ u_gij′. In this context, we introduce an approach that builds upon the sensitivity analysis framework proposed by Rosenbaum (2002b). We introduce a sensitivity parameter Γ, and assume that for two individuals (gij and gij′) within the same stratum (gi) with x_gij = x_gij′ but with u_gij ≠ u_gij′ the odds of treatment assignment may differ at most by Γ where Γ ≥ 1,

\frac{1}{Γ} \leq \frac{π_{g i j} \cdot (1 - π_{g i j^{'}})}{(1 - π_{g i j}) \cdot π_{g i j^{'}}} \leq Γ .

(1)

When Γ = 1, the model (1) is equivalent to assuming that there are no unmeasured confounders, and the distribution of treatment assignment is the same as the randomization distribution discussed in Section 2.1. Then, the null hypothesis can be tested by using randomization inference, and the P-value can be obtained as a point estimate. When Γ > 1, the resulting distribution cannot be explicitly obtained, but the bounds for $Pr (Z_{g i j} = 1 ∣ F, Z)$ that are controlled by Γ can be obtained. For example, in the context of a matched pair design, $Pr (Z_{g i j} = 1 ∣ F, Z)$ is bounded as $1 / (1 + Γ) \leq Pr (Z_{g i j} = 1 ∣ F, Z) \leq Γ / (1 + Γ)$ . For each Γ > 1, randomization inference produces an interval of P-values instead of obtaining a point estimate.

For Γ > 1, sensitivity analysis can be conducted by considering the upper bound of the P-value interval. If the upper bound is still less than a significance level α, the null hypothesis can be rejected even in the presence of unmeasured confounders since the worst-case P-value is less than α. The P-value interval becomes wider as Γ increases, and at some point, the P-value interval contains α. When α is located in the middle of the interval, this P-value interval is uninformative, and the null hypothesis cannot be rejected. In practice, we report the largest value of Γ that leads to the upper bound P-value that is less than α. A larger value shows that the conclusion is more robust to unmeasured confounding. An approximation of the upper P-value bound can be used, see Gastwirth et al. (2000) for more detailed discussions.

3. The De Novo Method: A Combined Exploratory and Confirmatory Method

In this section, we propose a novel de novo method for discovering and analyzing heterogeneous causal effects. To apply the de novo method, we use matched pair data by assuming that matching successfully eliminates measured confounding bias. By using a sample-splitting approach discussed in Section 3.3, matched pair data can be split into (i) one subsample for discovering effect modification structure as a tree and (ii) the other subsample for making inference based on the discovered tree (Section 3.2).

3.1. The Null Hypothesis with a Nuisance Parameter

Suppose that the outcome is continuous. Let τ be the population average treatment effect. We propose a hypothesis testing approach to identify subgroups with heterogeneous causal effects by considering the null hypothesis as H₀ : r_Tgij − r_Cgij ≡ τ. This null hypothesis is Fisher’s sharp null hypothesis but with the additive constant effect τ. When there is no heterogeneity of the causal effects at all, every individual gij has the same constant causal effect, r_Tgij − r_Cgij = τ. Therefore, if H₀ is rejected, there is evidence that some individuals have different effects than the population mean. The null hypothesis H₀ can be tested using randomization inference by imputing missing potential outcomes when τ is a known and fixed value.

In the following subsection, we assume a known value of τ and describe a one-sided testing procedure. The alternative hypothesis is that the treatment effects r_Tgij − r_Cgij are larger than τ at level α. A level α two-sided test can be easily constructed by using the procedure twice at level α/2; one to test for the positive direction (i.e., larger than τ) and the other to test for the negative direction (i.e., smaller than τ). However, τ is unknown in practice, and is a nuisance parameter that has to be estimated.

3.2. Testing the Null Hypothesis With a Given Tree Structure

Suppose that there is a given tree that is a partitioning of the total sample. Assume that there are G terminal nodes and each subgroup g represents a terminal node in the partition. To utilize the structure of trees, we trace back how the partition is built. When growing a tree, a certain internal node is chosen and forced to split into two subsequent nodes. Since there are G terminal nodes, the number of all nodes is 2G − 1. Excluding the initial node, we consider G − 2 internal nodes and G terminal nodes. Write Π = (Π₁, …, Π_2G−2)^⊤ for these 2G − 2 nodes in total. To illustrate Π, consider a simple example tree with two binary variables, male and young, shown in Figure 1. The first split is made on the male variable, and the second split is made for male on the young variable. There are three terminal nodes: (1) female, (2) old male, and (3) young male from left to right. The one internal node that represents the male sample can be represented as the union of old male and young male. We simply denote this internal node as 2 ∪ 3. Technically, the total sample 1 ∪ 2 ∪ 3 is one of the internal nodes, but it will not be considered since the treatment effect within 1 ∪ 2 ∪ 3 is just the population average. In the example, the tree can be represented by Π = (1, 2, 3, 2 ∪ 3)^⊤.

With G terminal nodes, G − 2 internal nodes are considered for de novo discovery of population subgroups with heterogeneous causal effects. It may seem counter-intuitive since more comparisons imply paying more price for multiple testing. However, the inclusion of the internal nodes has several beneficial aspects. First, some of the terminal nodes may have a small number of matched pairs, which leads to a lack of power for detecting effect modification. Including the internal nodes can compensate for this lack of power. Second, when Π is much more complicated than the true structure, considering only the terminal nodes is misleading. This is important especially when Π is not given, and has to be estimated. Overfitting a tree leads to an unnecessarily complex structure, but including internal nodes can correct this problem.

We consider a test statistic $T_{g} = \sum_{i = 1}^{I_{g}} \sum_{j = 1}^{2} Z_{g i j} q_{g i j}$ for the terminal node g where q_gij is a function of $F$ . Under the null H₀ : r_Tgij − r_Cgij = τ, R_gij and q_gij are fixed by conditioning on $F$ . The comparison vector of (2G − 2) test statistics is constructed from T_g by using the (2G − 2) × G conversion matrix C. The matrix C creates the (2G − 2) correlated test statistics from mutually independent statistics T_g; see Lee et al. (2018) for a discussion of the conversion matrix in a factorial design. To illustrate the matrix C, let us revisit the example shown in Figure 1. There are four nodes in total, and the matrix C can be constructed as

C = [\begin{array}{l} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 1 \end{array}] .

The last row represents the internal node indicating the male subgroup 2 ∪ 3 and the first three represent the terminal nodes 1, 2 and 3. Now, let T = (T₁, …, T_G)^⊤ be the vector of the test statistics for terminal nodes. Then, the (2G − 2) test statistics for all nodes S = (S₁, …, S_2G−2)^⊤ can be obtained as S = CT. In the above example, S = (T₁, T₂, T₃, T₂ + T₃)^⊤ is the corresponding comparison vector for Π = (1, 2, 3, 2 ∪ 3)^⊤.

Under the null hypothesis H₀, randomization inference gives the exact null distribution $Pr (T_{g} ∣ F, Z)$ of the test statistic T_g for Γ = 1. However, for Γ > 1, the exact distribution of T_g cannot be obtained, but is bounded above by ${\bar{T}}_{Γ g}$ with expectation μ_Γg and variance ν_Γg. A large sample approximation can be applied to the joint bounding distribution of ${\bar{T}}_{Γ} = {({\bar{T}}_{Γ 1}, \dots, {\bar{T}}_{Γ G})}^{⊤}$ when G is fixed. Let μ_Γ = (μ_Γ1, …, μ_ΓG)^⊤ and V_Γ for the G × G diagonal matrix with g-th diagonal element ν_Γg. Under H₀ with mild regularity conditions, the joint distribution of ${\bar{T}}_{Γ}$ converges to a multivariate Normal distribution $N_{G} (μ_{Γ}, V_{Γ})$ . Specifically, the bounding statistic ${\bar{T}}_{Γ g}$ can be represented as $\sum_{i = 1}^{I_{g}} {\bar{B}}_{Γ g i} h_{g i}$ where ${\bar{B}}_{Γ g i} = 1$ if Y_gi > 0 and ${\bar{B}}_{Γ g i} = 0$ otherwise, and h_gi is a function of |Y_gi|. For instance, if h_gi is the rank of |Y_gi|, then ${\bar{T}}_{Γ g}$ is the bounding statistic for Wilcoxon’s signed rank test statistic. Under the sensitivity analysis model (1), the bounding random variable ${\bar{B}}_{Γ g i}$ is 1 with probability Γ/(1 + Γ) and 0 with probability 1/(1 + Γ) (Rosenbaum, 2002b). From the above specification of ${\bar{B}}_{Γ g i}$ and h_gi, $μ_{Γ g} = {Γ / (1 + Γ)} \sum_{i = 1}^{I_{g}} h_{g i}$ and $ν_{Γ g} = {Γ / {(1 + Γ)}^{2}} \sum_{i = 1}^{I_{g}} h_{g i}^{2}$ . The below theorem guarantees a Normal approximation.

Theorem 1 Under H₀, given G is fixed, if as min_{g∈{1,…,G}}(I_g) → ∞,

max_{1 \leq g \leq G} (\frac{{max}_{1 \leq i \leq I_{g}} h_{g i}^{2}}{\sum_{i = 1}^{I_{g}} h_{g i}^{2}}) \to 0,

(2)

then the random vector ${\bar{T}}_{Γ} = {({\bar{T}}_{Γ 1}, \dots, {\bar{T}}_{Γ G})}^{⊤}$ converges in distribution to the G-dimensional normal distribution $N_{G} (μ_{Γ}, V_{Γ})$ .

In the condition (2), ${max}_{1 \leq i \leq I_{g}} h_{g i}^{2} / \sum_{i = 1}^{I_{g}} h_{g i}^{2}$ is the condition for the convergence of each univariate bounding statistic ${\bar{T}}_{Γ g}$ . Therefore, (2) implies that if each distribution of ${\bar{T}}_{Γ g}$ can be approximated by $N (μ_{Γ g}, ν_{Γ g})$ , then the distribution of ${\bar{T}}_{Γ}$ can be approximated by $N_{G} (μ_{Γ}, V_{Γ})$ . The proof of Theorem 1 is given in the supplementary materials.

Define θ_Γ = Cμ_Γ and Σ_Γ = CV_ΓC^⊤, noting that Σ_Γ is not typically diagonal. Write θ_Γk for the k-th coordinate of θ_Γ and $σ_{Γ k}^{2}$ for the k-th diagonal element of Σ_Γ. Define D_Γk = (S_k − θ_Γk)/σ_Γk and D_Γ = (D_Γ1, …, D_Γ,2G−2)^⊤. Finally, write ρ_Γ for the (2G − 2) × (2G − 2) correlation matrix formed by dividing the element of Σ_Γ in row k and column k′ by σ_Γk σ_Γk′. For Γ > 1, as T is bounded by $\bar{T}$ that converges to the Normal distribution $N_{G} (μ_{Γ}, V_{Γ})$ , the deviate vector D_Γ is bounded by the Normal distribution $N_{2 G - 2} (0, ρ_{Γ})$ . Write $1 - ϒ_{ρ_{Γ}} (x)$ for the probability of the (2G − 2)-dimensional lower orthant (−∞, x] × … × (−∞, x] for $N_{2 G - 2} (0, ρ_{Γ})$ , and D_{Γ max} for the maximum deviate among D_Γ, i.e., D_{Γ max} = max_{1≤k≤2G−2} D_Γk. For Γ ≥ 1, consider the following testing procedure,

Reject H_{0} if D_{Γ max} = max_{1 \leq k \leq 2 G - 2} \frac{S_{k} - θ_{Γ k}}{σ_{Γ k}} \geq κ_{Γ, α},

(3)

where κ_Γ,α is the critical value at a level α that satisfies $ϒ_{ρ_{Γ}} (κ_{Γ, α}) = α$ . Also, κ_Γ,α can be obtained by using the qmvnorm function in the mvtnorm package in R; see Genz and Bretz (2009). The procedure (3) takes the maximum deviate that is governed by the probability $ϒ_{ρ_{Γ}} (x)$ . Therefore, a sensitivity analysis using this procedure has the correct level when H₀ is true in large samples because of the probability Pr(D_{Γ max} ≥ κ_Γ,α) ≤ α; see Proposition 1, Rosenbaum (2012) for the proof of this argument.

It is important to note that the procedure (3) is designed to test the null hypothesis H₀ : r_Tgij − r_Cgij = τ for a known τ. In practice, we do not know τ, but we can estimate it with the (1 − η) level confidence interval, called CI(1 − η). When τ is unknown, we consider D_Γ,max(τ) as a function of τ and define the minimum of the maximum deviate, called D_Γ,minmax,

D_{Γ, minmax} = min_{τ \in C I (1 - η)} D_{Γ, max} (τ) .

(4)

Since τ is unknown, we take the minimum of D_Γ,max(τ) within the confidence interval CI(1 − η), which leads to D_Γ,minmax ≤ D_Γ,max(τ*) where τ* is the true value of τ. The critical value κ_Γ,α does not depend on τ for many test statistics such as Wilcoxon’s signed rank sum test. Therefore, D_Γ,minmax is directly compared with κ_Γ,α. In cases where κ_Γ,α depends on τ, the maximum critical value across CI(1 − η) can be considered.

3.3. Sample Splitting and Discovering a Tree

In the previous subsections, we assumed that a tree Π is given for hypothesis testing, but in reality Π is unknown. We divide the data into two subsamples for: building a tree and making inferences. The first subsample is used to estimate the tree. We apply CART and CT; see Breiman et al. (1984) and Zhang and Singer (2010) for discussion of CART and see Athey and Imbens (2016) for discussion of CT. These two approaches are both designed to discover tree structures, but they have different criteria for constructing the partition and cross-validation. We compare these approaches for different sample splitting ratios in Section 4. The second subsample is used for conducting hypothesis tests using testing procedures discussed in Section 3.2. Our proposed approach for testing described in (3) relies on a large sample approximation, that holds for a known G, but G is unknown in practice. We specify a priori the maximum depth of the tree, which could be large, but substantially smaller than the sample size. In the discussion we elaborate on the adequacy of this assumption.

Using the second subsample, hypothesis testing can further trim the discovered tree and find the hidden structure within it. One may be concerned that sample-splitting leads to a loss of power for discovering heterogeneous subgroups, but there is a significant benefit that offsets the loss. Without splitting, the power of detecting heterogeneity decreases as the number of potential effect modifiers increases. By using a part of the data, a handful of important covariates can be selected, for example, by discovering a tree. The loss of power can be minimized in high-dimensional settings.

3.4. Testing Subgroup-specific Null Hypotheses

Our primary interest is to test the null hypothesis H₀ : r_Tgij − r_Cgij = τ for all g ∈ {1, …, G} where τ is the population mean. The null H₀ is a test for effect modification in the whole population. However, testing the null hypothesis $H_{0}^{s u b} : r_{Tgij} - r_{Cgij} = τ$ for a subgroup may be of interest. Rejecting $H_{0}^{s u b}$ implies that the corresponding subgroup has a treatment effect significantly different from the population mean. As the test statistic D_{Γ max} is used for testing H₀, to test $H_{0}^{s u b}$ , we consider a test statistic $D_{Γ max}^{s u b}$ with respect to the subtree Π^sub that is a sub-vector of Π and contains all subgroups included in the targeted subgroup. Let $I$ be an index set that indicates the inclusion of Π^sub in Π. The index set $I$ is a subset of {1, …, 2G − 2}. The new test statistic $D_{Γ max}^{s u b} = {max}_{k \in I} D_{Γ k}$ can be defined by only focusing on the deviates D_Γk for $k \in I$ . Since the number of considered deviates is reduced, it is required to compute a new critical value $κ_{Γ, α}^{s u b}$ . The computation can be done by using the sub-correlation matrix $ρ_{Γ}^{s u b}$ that contains the (k, k′) element of ρ_Γ for k, $k^{'} \in I$ that is the intersection of row k and column k′. Especially, when $H_{0}^{s u b}$ is for single subgroup g, the critical value $κ_{Γ, α}^{s u b}$ can be easily computed as Φ⁻¹(1 − α/2) where Φ(·) is the cumulative distribution function of the standard Normal distribution. Then, the test statistic $D_{Γ max}^{s u b}$ can be compared to $κ_{Γ, α}^{s u b}$ .

3.5. Binary Outcome

When outcomes are binary, the individual treatment effect is δ_gij = r_Tgij − r_Cgij, and δ_gij is an element of {−1, 0, 1}. The average treatment effect is the average difference between two potential outcomes, denoted by $δ = (1 / N) \sum_{g = 1}^{G} \sum_{i = 1}^{I_{g}} \sum_{j = 1}^{n_{g i}} δ_{g i j}$ where $N = \sum_{g = 1}^{G} \sum_{i = 1}^{I_{g}} n_{g i}$ . The unbiased estimator of δ is $\hat{δ} : = \sum_{g = 1}^{G} \sum_{i = 1}^{I_{g}} (n_{g i} / N) {\hat{δ}}_{g i}$ where ${\hat{δ}}_{g i} = \sum_{j = 1}^{n_{g i}} (Z_{g i j} R_{g i j} / m_{g i j} - (1 - Z_{g i j}) R_{g i j} / (n_{g i} - m_{g i}))$ is the estimated average treatment effect within stratum gi. Also, instead of testing H₀ : r_Tgij − r_Cgij = τ that is defined for continuous outcomes, we consider a test of the null hypothesis $H_{0}^{binary} : δ = δ_{0}$ where $δ_{0} \in {d / N : d \in [- N, N] \cap ℤ}$ . Since δ_gij can be either −1, 0, or 1, the considered δ₀ has to be a member of ${d / N : d \in [- N, N] \cap ℤ}$ . Let $δ = (δ_{111}, δ_{112}, \dots, δ_{G I_{G} n_{G I_{G}}})$ . The null $H_{0}^{binary}$ allows to test a set $D_{δ_{0}} = {δ : \sum_{g = 1}^{G} \sum_{i = 1}^{I_{g}} \sum_{j = 1}^{n_{g i}} δ_{g i j} = δ_{0}}$ , which means that rejecting $H_{0}^{binary}$ is rejecting all δ in $D_{δ_{0}}$ .

We adopt Fogarty et al. (2016)’s testing method for binary outcomes, and combine it with the de novo method. Let $N_{g} = \sum_{i = 1}^{I_{g}} n_{g i}$ , $δ_{g} = (1 / N_{g}) \sum_{i = 1}^{I_{g}} \sum_{j = 1}^{n_{g i}} δ_{g i j}$ , ${\hat{δ}}_{g} = \sum_{i = 1}^{I_{g}} (n_{g i} / N_{g}) {\hat{δ}}_{g i}$ , and $Σ_{g} = \sum_{i = 1}^{I_{g}} σ_{g i}^{2}$ where $σ_{g i}^{2}$ is the variance contribution from stratum gi to var(T_g). Now, we define the test statistic vector T = (T₁, …, T_g) where $T_{g} = N_{g} {\hat{δ}}_{g}$ . For each g, $(T_{g} - N_{g} δ_{g}) / \sqrt{Σ_{g}}$ has an approximately Normal distribution N(0, 1). For Γ = 1, Fogarty et al. (2016) proposes a method based on randomization inference. To account for worst-case biases, it uses integer programming for finding the maximal variance of Σ_g. See Section 5 and Theorem 1 in Fogarty et al. (2016) for more details. For Γ > 1, a similar approach can be used, but it requires more complicated computations in solving an integer quadratic program, see Fogarty et al. (2017) for detailed computation. The rest of our proposed procedure is the same as the method in Section 3.2. For instance, a tree can be discovered based on CART by regressing δ_gij on covariates x_gij from the first subsample obtained from sample splitting. From the second subsample, the (1 − η) level confidence interval for δ can be constructed by inverting hypothesis tests. Then, the proposed testing procedures (3) and (4) with a level α can be applied.

In applying the above approach, however, difficulties arise due to the discreteness of δ₀. When a tree is considered, each terminal node has a different sample size, which causes incompatible hypothesis tests. To illustrate this, consider a simple example with 10 matched pairs (N = 20) using the tree in Figure 1. Suppose that the female subgroup has 5 matched pairs (N_female = 10). The null for the entire sample is testing whether δ_entire = δ₀ where δ₀ must be one of {−20/20, −19/20, …, 19/20, 20/20}. To discover effect modification, we ultimately want to test H₀ : δ_entire = δ_female(= δ_{old male} = δ_{young male}). This implies that a test of δ_female = δ₀ should be considered. However, this test is incompatible with, for instance, δ₀ = 3/20 because δ_female can be tested only for values (−10/10, −9/10, …, 10/10). A remedy to fix this problem is to use the two closest compatible values around an incompatible value, conduct hypothesis tests for these two values, and take the larger P-value. For δ₀ = 3/20, when testing the female group, the two closest compatible values are 1/10 and 2/10. This fix is slightly conservative, however, as the number of matched pairs I_g increases, the grid of compatible δ₀ is finer, and the obtained P-value converges to the true value. Technically, for each δ₀, let $δ_{g}^{L}$ and $δ_{g}^{H}$ be the closest two compatible values for subgroup g.

Theorem 2 Under $H_{0}^{binary} : δ = δ_{0}$ , if Σ_g → ∞ as I_g → ∞, then $(N_{g} {\hat{δ}}_{g} - N_{g} δ_{g}^{L}) / \sqrt{Σ_{g}} \overset{d}{\to} N (0, 1)$ and $(N_{g} {\hat{δ}}_{g} - N_{g} δ_{g}^{R}) / \sqrt{Σ_{g}} \overset{d}{\to} N (0, 1)$ .

The proof of Theorem 2 is given in the supplementary materials.

4. Simulation

In this section, we evaluate the performance of the de novo method with various settings using simulations. We consider three main factors that may affect the performance: (1) choice of tree algorithm, (2) split ratio, (first, second), and (3) the degree of heterogeneity of the group specific causal effects with respect to the population average. First, as discussed in Section 3.3, CART and CT approaches are compared. Second, the splitting ratio of the first subsample to the second may affect the performance. If we invest too much on the first discovery step, we lose power for detecting heterogeneity. On the other hand, if we invest too little, some important structure may not be discovered resulting in loss of power. We consider three ratios (10%, 90%), (25%, 75%) and (50%, 50%). Finally, we examine the performance according to the amount of heterogeneity. If the heterogeneity is small, then the de novo method may not have enough power to detect it. All simulation settings assume the absence of unmeasured confounding.

We consider two simulation studies, one with continuous outcomes and the other with binary outcomes. For both studies, we set N = 4000 with 2000 matched pairs and the true effect size as 0.5 on average. Also, we consider five covariates, x₁, …, x₅, and assume that at most two of them lead to heterogeneity (i.e., they are true effect modifiers), say x₁ and x₂. For continuous outcomes, suppose that an individual with covariate values x₁ = a, x₂ = b has a treatment effect from a Normal distribution $N (τ_{a b}, 1)$ . Define τ = (τ₀₀, τ₀₁, τ₁₀, τ₁₁). we consider five situations: (1) τ = (0.4, 0.4, 0.6, 0.6), (2) τ = (0.3, 0.3, 0.7, 0.7), (3) τ = (0.4, 0.4, 0.5, 0.7), (4) τ = (0.3, 0.3, 0.6, 0.8), and (5) τ = (0.2, 0.5, 0.5, 0.8). For example, the first situation τ = (0.4, 0.4, 0.6, 0.6) means that there is small effect modification of x₁, but not x₂. The third situation τ = (0.4, 0.4, 0.5, 0.7) means that there is small effect modification of both x₁ and x₂. Wilcoxon’s signed rank sum test is used for continuous outcomes. Similarly, for binary outcomes, suppose that an individual treatment effect has a binomial distribution $B (δ_{a b})$ for x₁ = a, x₂ = b, and define δ = (δ₀₀, δ₀₁, δ₁₀, δ₁₁). Also, consider five situations: (1) δ = (0.45, 0.45, 0.55, 0.55), (2) δ = (0.4, 0.4, 0.6, 0.6), (3) δ = (0.45, 0.45, 0.5, 0.6), (4) δ = (0.4, 0.4, 0.5, 0.7), and (5) δ = (0.35, 0.5, 0.5, 0.65).

Table 1 describes the simulated power of the situations for both continuous and binary outcomes. The upper part of the table shows the simulated power for continuous outcomes. As we expected, when there is a small amount of heterogeneity as in the first and third situations, both CT and CART methods produce low power for all three splitting ratios. However, if there is moderate or large heterogeneity, the power of the test is much higher. The CT method generally has higher power than the CART method. Also, the CT method performs the best with (25%, 75%) ratio, however the CART method has the best performance with (50%, 50%) ratio. If the size of the first subsample is small, it is highly likely that the CART method produces a conservative tree. On the other hand, the CT method accounts for the size of the second subsample, and exploits more exploratory search for tree structures although it often produces false discovery with a high probability.

Table 1:

Simulated power (from 10,000 replications) for hypothesis tests to discover heterogeneous subgroups. The upper table is for continuous outcomes and the lower table is for binary outcomes. The true effect size is 0.5 on average for the entire population and the sample size is 4000 with 2000 matched pairs.

			Splitting ratio (first, second)
			(10%, 90%)		(25%, 75%)		(50%, 50%)
	Degree of heterogeneity	Continuous outcomes
	(x₁, x₂)	τ = (τ₀₀, τ₀₁, τ₁₀, τ₁₁)	CT	CART	CT	CART	CT	CART
1	(Small, No)	(0.4, 0.4, 0.6, 0.6)	0.09	0.04	0.10	0.05	0.07	0.06
2	(Large, No)	(0.3, 0.3, 0.7, 0.7)	0.55	0.35	0.85	0.70	0.82	0.84
3	(Small, Small)	(0.4, 0.4, 0.5, 0.7)	0.11	0.05	0.14	0.07	0.09	0.07
4	(Large, Small)	(0.3, 0.3, 0.6, 0.8)	0.54	0.36	0.85	0.70	0.83	0.84
5	(Moderate, Moderate)	(0.2, 0.5, 0.5, 0.8)	0.49	0.31	0.73	0.51	0.63	0.57
	Degree of heterogeneity	Binary outcomes
	(x₁, x₂)	δ = (δ₀₀, δ₀₁, δ₁₀, δ₁₁)	CT	CART	CT	CART	CT	CART
1	(Small, No)	(0.45, 0.45, 0.55, 0.55)	0.05	0.02	0.05	0.03	0.04	0.03
2	(Large, No)	(0.40, 0.40, 0.60, 0.60)	0.61	0.33	0.79	0.69	0.76	0.78
3	(Small, Small)	(0.45, 0.45, 0.50, 0.60)	0.07	0.02	0.07	0.04	0.05	0.04
4	(Large, Small)	(0.40, 0.40, 0.50, 0.70)	0.68	0.38	0.90	0.77	0.90	0.89
5	(Moderate, Moderate)	(0.35, 0.50, 0.50, 0.65)	0.53	0.25	0.65	0.47	0.54	0.50

Open in a new tab

In the supplementary materials, Table 5 summarizes the discovery rates. In the first situation with (50%, 50%) ratio, we found that the only true effect modifier x₁ is discovered in the CT method with probability 0.54 and in the CART method with probability 0.40. The CT method falsely discovers other covariates with probability 0.17, but the CART method with probability 0.07. Although the CT method has a higher false discovery rate than CART, falsely discovered partitions will be tested using the second subsample, and will be trimmed after all. The simulated power for binary outcomes is shown in the lower part of Table 1. As we have seen in the upper part, for binary outcomes, the CT method also has better performance than the CART method in general. In the analysis of our study that will be discussed in the next section, we will consider (25%, 75%) ratio for sample-splitting since this ratio shows the best compromise (measured by power of test) between discovery and confirmation of effect modification.

5. Causal Effect of Exposure to PM_2.5 on 5-year Mortality in the New England

We consider 1,612,414 beneficiaries that entered in the Medicare cohort on January 1 2002 (reference date). For each enrollee, we calculate his/her exposure to PM_2.5 during the two years prior to entry into the cohort, so from January 1, 2000 to December 31, 2001.

The two year average of PM_2.5 is obtained in a continuous scale. We create a binary treatment variable using a cutoff value 12 μg/m³ based on the national ambient air quality standard. Among 1,612,414 individuals, there are 584,374 treated individuals (i.e., PM_2.5 > 12 μg/m³) and 1,028,040 control (i.e., PM_2.5 ≤ 12 μg/m³). We note that the level of PM_2.5 is estimated at the centroid of a ZIP code. Individuals living in the same ZIP code area share the same value of PM_2.5, thus the same exposure. We use the previously published methods that validate the estimation of PM_2.5 levels. See Di et al. (2016) for more detail of estimation methods of exposure to PM_2.5.

The outcome is death by the end of the study, December 31, 2006. The outcome is 1 if he/she died at any point before this date and 0 otherwise. In addition to the exposure and the outcome, we consider both individual-level covariates and ZIP code-level covariates. All covariates are measured in 2001 before the reference date. Table 2 displays a summary of the treated and control populations for individual-level and ZIP code-level covariates. Before matching, among the treated individuals we observed a high percentage of Medicaid eligible (thus poorer) individuals, smaller percentage of male and smaller percentage of white.

Table 2:

Summary statistics and covariate balance before and after matching.

	Summary Statistics			Standardized Differences
Covariates	Treated	Control (Before)	Control (After)	Before	After
Individual-level
Male (%)	38.5	39.9	38.5	−0.02	0.00
White (%)	92.8	96.9	92.8	−0.19	0.00
Medicaid Eligible (%)	10.8	9.1	10.8	0.05	0.00
Age (Group, 1–5)	2.6	2.6	2.6	0.02	0.00
Age (65–107)	76.3	76.1	76.3	0.02	0.00
ZIP code-level
Temperature (°C)	10.4	9.8	10.3	0.55	0.06
Relative Humidity (%)	76.1	76.9	76.1	−0.44	0.01
BMI (%)	26.1	26.3	26.1	−0.44	−0.06
Smoker Rate (%)	49.9	52.6	49.7	−0.72	0.07
Black Population (%)	6.2	3.2	6.0	0.33	0.03
Median Household Income (1000s of $)	56.1	53.8	56.7	0.10	−0.03
Median Value of Housing (1000s of $)	207.5	184.8	205.9	0.20	0.01
Below Poverty Level (%)	8.3	9.1	8.3	−0.09	0.01
Below High School Education (%)	30.6	30.1	30.2	0.03	0.03
Owner-occupied Housing (%)	62.9	68.9	62.7	−0.33	0.01
Population Density (log-scale)	−6.9	−8.1	−7.0	0.89	0.06

Open in a new tab

To adjust for measured confounders and discover heterogeneity we use a matching method that produces exact matched pairs on four individual-level covariates, white, male, Medicaid eligibility, and age group. To obtain pairs that are exactly matched on these covariates, the age variable is transformed into five categories (1:65–70, 2:71–75, 3:76–80, 4:81–85, and 5:above 85). The dataset is stratified into 40 = 2 × 2 × 2 × 5 strata according to levels of individual-level covariates. For each stratum, the ZIP code-level covariates are matched as closely as possible. Matching can be performed by using the Optmatch R package. We randomly select about 20% of the treated individuals from the entire dataset for a better covariate balance. This allows us to construct 110,091 matched pairs. Covariate balance is shown in Table 2. Since two matched individuals have the same values for individual-level covariates, the standardized differences between them are zero. The standardized differences of ZIP code-level covariates are located between −0.06 and 0.07, which indicates that there is no systematic difference between treated and control populations.

To apply the de novo method, we start with dividing the matched pairs into two subsamples with (25%, 75%) ratio. The first subsample of 27,500 matched pairs is used for discovering heterogeneous subgroups, and the other 82,591 matched pairs are used for conducting hypothesis tests. Figure 2 displays the discovered tree with six disjoint subgroups and four combined subgroups (1 ∪ 2, 1 ∪ 2 ∪ 3, 4 ∪ 5, 4 ∪ 5 ∪ 6) from the first subsample. We grew a tree with the restriction that the maximum depth be d = 3, and trimmed it with cross-validation. This restricted the number of terminal nodes (i.e., G) to be less than 2^d = 8. The tree in the figure was obtained from the CT method. The CART method produced a coarser tree with the terminal nodes (1 ∪ 2, 3, 4 ∪ 5, 6). Our simulation studies showed indeed that the CART method is slightly more conservative in creating subgroups than the CT method. To examine the stability of the tree showed in Figure 2 with respect to a different subsample, we considered 1000 trees by using 1000 bootstrapped samples based on the first subsample. We found that all trees were firstly divided by age. Also, they eventually created the same four age subgroups (1 ∪ 2, 3, 4 ∪ 5, 6) although the order of age split was different in each bootstrapped tree. The subgroup 1 ∪ 2 was further divided by white 62% of the time, and the 4 ∪ 5 was divided by Medicaid eligibility 52% of the time.

Figure 2: — Discovered tree from the first subsample. Actions are represented on edges. Subgroups whose null hypotheses are rejected at a total significance level α + η = 0.05 are represented by solid rectangles; otherwise, represented by dashed rectangles. The point estimates with the 95% confidence intervals for the subgroup causal effects are computed from the second subsample.

Before conducting tests for heterogeneity, we test whether there is any causal effect of being exposed to PM_2.5 > 12 μg/m³ on mortality. The obtained matched pairs are used for testing Fisher’s hypothesis of no effect $H_{0}^{Fisher} : r_{Tgij} = r_{Cgij}$ . We consider the truncated product method proposed by Hsu et al. (2013) with the six discovered subgroups (1, …, 6) in Figure 2. This method computes upper bounds on P-values for each of the six subgroups, and then combines the P-values using the truncated product proposed by Zaykin et al. (2002). The null hypothesis $H_{0}^{Fisher}$ can be tested by using McNemar tests with the second subsample of 82,591 matched pairs. By applying this method, we found that $H_{0}^{Fisher}$ is rejected, which concludes that exposure to high-level PM_2.5 statistically increases the risk of death. In addition, Table 3 reports the sensitivity analysis with upper bounds on P-values. At Γ = 1.28, $H_{0}^{Fisher}$ is still rejected at the 0.044 level, but at Γ = 1.29, the hypothesis is not rejected at the 0.05 level. We conclude that exposure to high-level PM_2.5 increased the 5-year mortality rate even in the presence of unmeasured biases up to Γ = 1.28.

Table 3:

Sensitivity analysis for testing Fisher’s hypothesis of no effect: Upper bounds on P-values for various Γ

	Subgroups						Truncated Product
Γ	1	2	3	4	5	6	Truncated Product
1.00	0.558	0.183	0.003	0.000	0.000	0.000	0.000
1.10	1.000	0.697	0.879	0.201	0.001	0.000	0.000
1.20	1.000	0.965	1.000	0.989	0.024	0.000	0.000
1.25	1.000	0.992	1.000	1.000	0.073	0.000	0.006
1.28	1.000	0.997	1.000	1.000	0.124	0.001	0.044
1.29	1.000	0.998	1.000	1.000	0.146	0.003	0.072

Open in a new tab

Furthermore, the sensitivity parameter Γ can be represented as a curve of two parameters (Λ, Δ). Technically, Γ = (ΔΛ + 1)/(Δ + Λ), see Rosenbaum and Silber (2009). The parameter Λ describes the relationship between an unmeasured confounder u_gij and treatment assignment Z_gij, and the parameter Δ describes the relationship between u_gij and the potential outcome r_Cgij. For example, Γ = 1.28 corresponds to Λ = 2.17 and Δ = 2. To illustrate this, consider an unmeasured variable u_gij of time spent outdoors that is negatively associated with both the treatment and the outcome. Here, (Λ, Δ) = (2.17, 2) implies that u_gij doubles the odds of exposure to high-level PM_2.5 and increases the odds of death by 2.11-fold. Our sensitivity analysis claims that the conclusion remains even in the presence of any u_gij with (Λ, Δ) satisfying (ΔΛ + 1)/(Δ + Λ) ≤ 1.28.

Returning to testing the null hypothesis H₀ of no heterogeneity, the second subsample is used for confirming and identifying subgroups with heterogeneous causal effects in the discovered tree structures shown in Figure 2. Since we do not know the true value of the population average of δ, we first estimate the 100(1 − η)% confidence interval for δ with η = 0.01, (1.12%, 2.38%) and test the global null hypothesis H₀ of no heterogeneity for each value within this interval. Table 4 shows ten deviates from the discovered subgroups for various δ₀ at Γ = 1. A negative deviate means that the corresponding subgroup causal effect is below the population average, and a positive deviate means the opposite. The critical value κ_Γ,α is almost constant as κ_Γ,α = 2.80 at Γ = 1, and is obtained from the multivariate Normal distribution with α = 0.04 to achieve a total significance level α + η = 0.05. At Γ = 1, the maximum absolute deviate D_{Γ max} is reported in the last column, and the minimum test statistic D_Γ,minmax is 7.30, which is larger than κ_Γ,α = 2.80. This indicates that there is a statistically significant evidence of heterogeneity when there is no unmeasured confounding.

Table 4:

Sensitivity analysis for testing the null hypothesis of no heterogeneity and description of the discovered subgroups. The upper table shows ten deviates from the subgroups with the maximum absolute deviate where the critical values κ_Γ,α = 2.80 for Γ = 1 when α = 0.04 and η = 0.01 and κ_Γ,α = 2.72 for Γ > 1 when α = 0.05 and η = 0, and the lower table shows the proportions of the subgroups and comparisons of outcomes between treated and control populations.

		Subgroups
		1	2	3	4	5	6	1 ⋃ 2	1 ⋃ 2 ⋃ 3	4 ⋃ 5	4 ⋃ 5 ⋃ 6
Γ	δ ₀	D _Γ1	D _Γ2	D _Γ3	D _Γ4	D _Γ5	D _Γ6	D _Γ7	D _Γ8	D _Γ9	D _Γ10	D _{Γ max}
1	0.0112	−3.43	−0.29	0.27	2.23	3.30	8.66	−3.37	−2.55	3.20	8.14	8.66
	0.0143	−4.37	−0.57	−0.31	1.80	3.16	8.25	−4.34	−3.68	2.75	7.54	8.25
	0.0175	−5.31	−0.85	−0.91	1.36	2.99	7.85	−5.33	−4.83	2.28	6.92	7.85
	0.0207	−6.27	−1.13	−1.50	0.92	2.85	7.45	−6.32	−5.97	1.82	6.31	7.45
	0.0238	−7.20	−1.41	−2.09	0.48	2.71	7.05	−7.30	−7.11	1.36	5.70	7.30
Γ	δ ₀	D _Γ1	D _Γ2	D _Γ3	D _Γ4	D _Γ5	D _Γ6	D _Γ7	D _Γ8	D _Γ9	D _Γ10	D _{Γ minmax}
1.010	0.0242	−6.39	−1.16	−1.52	0.44	2.50	6.53	−6.45	−6.10	1.46	5.85	6.53
1.050	0.0285	−4.07	−0.45	0.00	0.00	1.61	4.17	−4.03	−3.26	0.80	3.80	4.17
1.070	0.0306	−2.97	−0.13	0.00	0.00	1.18	3.02	−2.87	−2.37	0.63	2.84	3.02
1.075	0.0312	−2.72	−0.05	0.00	0.00	1.06	2.73	−2.61	−2.16	0.58	2.57	2.73
1.076	0.0312	−2.63	−0.02	0.00	0.00	1.04	2.69	−2.52	−2.09	0.57	2.54	2.69
		Subgroups										Total
Proportion (%)		46.2	4.2	22.3	13.7	1.7	11.9	50.4	72.8	15.4	27.2	100.0
Treated (%)		14.5	15.2	26.5	39.6	54.6	61.7	14.6	18.3	41.2	50.1	26.9
Control (%)		14.6	14.4	25.3	36.8	46.6	53.7	14.6	17.8	37.9	44.8	25.2
Risk difference (%)		0.0	0.8	1.3	2.7	8.0	7.9	0.0	0.4	3.3	5.3	1.8
Odds ratio		1.00	1.07	1.07	1.12	1.38	1.39	1.00	1.03	1.15	1.24	1.10

Open in a new tab

In addition, one may be interested in testing the null hypothesis $H_{0}^{s u b}$ for a certain subgroup that the exposure effect in this subgroup is the same as the population average. For instance, policymakers may want to know whether Medicare beneficiaries aged between 81–85 (i.e., 4 ∪ 5) are at a high risk of death compared to the average for the whole population. To test this, we can focus on the subset of deviates {D_Γ4, D_Γ5, D_Γ9}, which means $I = {4, 5, 9}$ . A new critical value $κ_{Γ, α}^{s u b}$ is 2.38 which is smaller than κ_Γ,α = 2.80. At Γ = 1, the null hypothesis $H_{0}^{s u b}$ for 4 ∪ 5 is rejected since $D_{Γ max}^{s u b} = D_{Γ 5}$ exceeds 2.38 for all values of δ₀ in the interval. For the terminal nodes, $H_{0}^{s u b}$ can be tested with the critical value $κ_{Γ, α}^{s u b} = 2.05$ obtained from the standard Normal distribution with α = 0.04. Other critical values are 2.37 for 1 ∪ 2, 2.55 for 1 ∪ 2 ∪ 3, and 2.56 for 4 ∪ 5 ∪ 6. Figure 2 represents subgroups with solid rectangles whose null hypotheses are rejected at Γ = 1, and otherwise with dashed rectangles. Subgroup 1 (white, aged between 65–75) has an effect size significantly lower than the population average, but subgroups 5 and 6 have effect sizes significantly higher than the population average. Also, in Figure 2, the point estimates and the 95% confidence intervals for subgroup causal effects are displayed. We note that each subgroup’s confidence interval is computed by inverting the null hypothesis for the subgroup; for example, the confidence interval for subgroup 1 is an inversion of testing the null hypothesis H₀ : δ = δ₁, not testing H₀ : δ = δ₀, where δ₁ is the average exposure effect within stratum subgroup 1. The lower part of Table 4 provides the detailed descriptions of the discovered subgroups.

Table 4 performs a sensitivity analysis for unmeasured confounding in discovering heterogeneous subgroups. We set η = 0 and α = 0.05 for Γ > 1, which produces the critical value κ_Γ,α = 2.72. For each value of Γ, only the minimum of D_Γ,max (i.e., D_Γ,minmax) is reported in the table. For example, at Γ = 1.01, the minimum D_{Γ minmax} = 6.53 is obtained at δ₀ = 0.0242. D_Γminmax is attained at either the deviate D_Γ1 or the deviate D_Γ6. Therefore, it can be inferred that the subgroups 1 and 6 have the least sensitivity to unmeasured biases. Figure 3 displays the maximum absolute deviate D_{Γ max} across δ₀ in the interval [0, 0.04] for each value of Γ. The curve of D_{Γ max} has a V-shape. All the curves have the minimum D_Γ,minmax within the interval, and for Γ ≤ 1.07, the curves are above the horizontal line of the critical value κ_Γ,α = 2.72. This implies that the null hypothesis H₀ of no heterogeneity is rejected only up to Γ ≤ 1.07. Table 4 shows more calibrated values of Γ for a sensitivity analysis. As shown in the table, D_Γminmax is larger than κ_Γ,α = 2.72 until Γ = 1.075. This sensitivity analysis shows that there is statistically significant evidence of heterogeneity up to a value of Γ equal to 1.075. The value Γ = 1.075 corresponds to a situation where an unobserved potential confounder could increase the odds of exposure to high level PM_2.5 by 1.5-fold and increase the odds of death by more than 1.44-fold (i.e., (Δ, Λ) = (1.5, 1.44)).

Figure 3: — The maximum absolute deviate D_{Γ max} for various Γ in the interval [0, 0.04] of δ. The dashed line represents the critical value κ_Γ,α = 2.72

Our finding about effect modification in the Medicare population can be compared to the previous findings. We showed that older groups (i.e., Subgroups 5 and 6) have significantly higher mortality rates, which was shown in Di et al. (2017). Di et al. (2017) conducted ad-hoc subgroup analyses for sex, race and Medicaid eligibility, and found that the Black population was most vulnerable to air pollution exposure, but other factors did not make notable differences. We also found that the Non-white subpopulation was more vulnerable than the White subpopulation within aged below 75. However, we could not find that Non-white was vulnerable in the age 75+ subpopulation. Instead, we found that those who were eligible for Medicaid had a higher mortality rate in this subpopulation. These specific subpopulations could not be discovered in the previous literature, and more importantly, we made valid statistical inference with a thorough sensitivity analysis of unmeasured confounding.

6. Discussion

We introduced a new approach for de novo discovering of subgroups with causal effects of a binary treatment (or exposure) on a binary and continuous outcome that is allowed to be heterogeneous with respect to the causal effect on average for the whole population. We considered both exploratory and confirmatory statistical analyses. Instead of determining a set of covariates a priori before making an inference, the exploratory search can reveal the tree. Then, randomization-based tests are conducted to identify subgroups within the tree with significantly different effects with respect to the population average. We also developed a sensitivity analysis to assess the effect of unmeasured confounding bias on the conclusions regarding the population average and also the subgroup-specific causal effects.

The de novo method considers the sample-splitting approach that divides the entire sample into two subsamples. However, there has been little literature to select the optimal splitting ratio. Specifically, when applying the method, it is not known which ratio can provide the highest power of test. We considered three ratios through simulation studies in Section 4 to decide the optimal ratio among them, and found that the optimal ratio among the tree ratios depended on the size of effect modification. However, the simulation results cannot be a general guideline for those who do not have any prior knowledge from literature about how large effect modification might be. Selecting the optimal ratio without any prior information can be an interesting problem for future research.

When discovering a tree Π, the size of Π needs to be controlled. As we discussed in Section 3.3, the number of terminal nodes in the tree should not be large compared to the sample size to justify a large sample approximation. However, when discovering Π, the size of Π is selected with cross-validation, so is random, not fixed. To the best of our knowledge, theoretical results are not known about the tree size obtained from cross-validation. When a too large value of G is selected, large sample approximation may not work since min(I_g) may not be large enough. To avoid this issue, we restrict the maximum depth d of a tree when growing it. For example, in our application, we set d = 3. This restriction guarantees that the final number of terminal nodes is less than 2^d for a fixed value d.

To assess the degree of heterogeneity in the presence of unmeasured confounding, a sensitivity analysis can be conducted for various values of Γ. When an unmeasured bias is present, the distribution of Z is governed by Γ. The change in the distribution of Z affects both (a) estimating 100(1 − η)% the confidence interval and (b) testing the null H₀ at a level α. For Γ > 1, as Γ increases, the 100(1 − η)% confidence interval for τ rapidly converges to the real line even for a comparatively large η. When choosing a large value of η, the obtained confidence interval may be narrow enough up to a certain Γ. However, there is a trade-off between η and α; a large η means a small α for testing, which may lead to a loss of power. It is difficult to find the optimal balance between η and α since the optimal balance depends on the true size of Γ that is unknown. More transparently, we propose to consider η = 0, which means considering all values of τ on the real line. For many test statistics of the form $T = \sum_{g = 1}^{G} \sum_{i = 1}^{I_{g}} \sum_{j = 1}^{2} Z_{g i j} q_{g i j}$ such as Wilcoxon’s signed rank sum test, D_{Γ max} is substantially larger when τ is too small or too large, and the minimum of D_{Γ max} is obtained within a sizable range. In practice, a wide enough range of τ can be chosen by making sure that D_{Γ max} is large enough at the ends of the range even for a large Γ. This approach requires more intensive computation, but we may expect a minimal power loss because of η = 0.

The proposed method for testing a sub-hypothesis $H_{0}^{(g)} : r_{Tgij} - r_{Cgij} = τ$ for subgroup g can be improved by considering a logical implication between $H_{0}^{(g)}$ and $H_{0}^{(g^{c})} : r_{T g^{'} i j} - r_{C g^{'} i j} = τ$ for all g′ ≠ g. Rejecting $H_{0}^{(g^{c})}$ leads to rejection of $H_{0}^{(g)}$ . Therefore, if either $H_{0}^{(g)}$ or $H_{0}^{(g^{c})}$ is rejected, then we can reject $H_{0}^{(g)}$ . This logical implication increases power of test. Furthermore, there is no need to construct the confidence interval of τ, instead every value of τ on the real line can be checked. If the specified value τ is too large or small, then it is highly likely that either $H_{0}^{(g)}$ or $H_{0}^{(g^{c})}$ is rejected. Therefore, for a given total significance level α + η, α can be maximized by setting η = 0.

Finally, the proposed method is not limited to air pollution studies. It can be applied to answering any research questions regarding effect modification. For instance, in social science, discovering effect modification can help reveal causal mechanisms of effects varying with background variables. Also, in precision medicine, discovering more patient-specific subgroups that can have most (or least) benefit from a treatment could be of interest. In this way, it allows practitioners to maximize treatment efficacy and minimize side effects. In practice, a tree discovery step can be tuned according to questions of interest. discovering a tree with a few simple subgroups is utterly important for public policy implications. However, when it comes to precision medicine, discovering a larger tree with more accurate subgroups could be of interest. When a larger discovered tree is preferred, it is better to assign a larger portion of the total sample to a discovery subsample.

Supplementary Material

Supp 1

NIHMS1733551-supplement-Supp_1.pdf^{(82.4KB, pdf)}

Supp 2

NIHMS1733551-supplement-Supp_2.zip^{(17.6MB, zip)}

Acknowledgment

We are grateful for helpful feedback from the editor, the associate editor, four anonymous referees, and session participants at JSM and European Causal Inference Meeting.

Funding

This work was supported by NIH grants (R01GM111339, R01ES024332, R01ES026217, P50MD010428, DP2MD012722, R01ES028033, R01MD012769) and HEI grant (4953-RFA14-3/16-4).

Footnotes

SUPPLEMENTARY MATERIAL

Appendix The online appendix contains proofs for Theorems 1–2, additional simulations for the discovery step. (.pdf file)

R Code for Application and Simulations An R script illustrates our methods with a simulated data set. Codes for both implementing the denovo method and producing simulation results are provided. (.R file)

References

Athey S and Imbens G (2016, jul). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113(27), 7353–7360. [DOI] [PMC free article] [PubMed] [Google Scholar]
Breiman L, Friedman JH, Olshen RA, and Stone CJ (1984). Classification and regression trees. New York, N.Y: Chapman & Hall/CRC. [Google Scholar]
Di Q, Dai L, Wang Y, Zanobetti A, Choirat C, Schwartz JD, and Dominici F (2017, dec). Association of short-term exposure to air pollution with mortality in older adults. JAMA 318(24), 2446. [DOI] [PMC free article] [PubMed] [Google Scholar]
Di Q, Kloog I, Koutrakis P, Lyapustin A, Wang Y, and Schwartz J (2016, apr). Assessing PM2.5exposures with high spatiotemporal resolution across the continental united states. Environmental Science & Technology 50(9), 4712–4721. [DOI] [PMC free article] [PubMed] [Google Scholar]
Di Q, Wang Y, Zanobetti A, Wang Y, Koutrakis P, Choirat C, Dominici F, and Schwartz JD (2017, jun). Air pollution and mortality in the medicare population. New England Journal of Medicine 376(26), 2513–2522. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dockery DW, Pope CA, Xu X, Spengler JD, Ware JH, Fay ME, Ferris BG, and Speizer FE (1993, dec). An association between air pollution and mortality in six u.s. cities. New England Journal of Medicine 329(24), 1753–1759. [DOI] [PubMed] [Google Scholar]
Dominici F, Peng RD, Bell ML, Pham L, McDermott A, Zeger SL, and Samet JM (2006, mar). Fine particulate air pollution and hospital admission for cardiovascular and respiratory diseases. JAMA 295(10), 1127. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fogarty CB, Mikkelsen ME, Gaieski DF, and Small DS (2016, apr). Discrete optimization for interpretable study populations and randomization inference in an observational study of severe sepsis mortality. Journal of the American Statistical Association 111(514), 447–458. [Google Scholar]
Fogarty CB, Shi P, Mikkelsen ME, and Small DS (2017, jan). Randomization inference and sensitivity analysis for composite null hypotheses with binary outcomes in matched observational studies. Journal of the American Statistical Association 112(517), 321–331. [Google Scholar]
Gastwirth JL, Krieger AM, and Rosenbaum PR (2000, aug). Asymptotic separability in sensitivity analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62(3), 545–555. [Google Scholar]
Genz A and Bretz F (2009). Computation of Multivariate Normal and t Probabilities. Berlin: Springer. [Google Scholar]
Hahn PR, Murray JS, and Carvalho CM (2020, sep). Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects (with discussion). Bayesian Analysis 15(3), 965–1056. [Google Scholar]
Hansen BB (2004, sep). Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association 99(467), 609–618. [Google Scholar]
Hsu JY, Small DS, and Rosenbaum PR (2013, mar). Effect modification and design sensitivity in observational studies. Journal of the American Statistical Association 108(501), 135–148. [Google Scholar]
Lee K, Small DS, and Rosenbaum PR (2018, may). A powerful approach to the study of moderate effect modification in observational studies. Biometrics 74(4), 1161–1170. [DOI] [PubMed] [Google Scholar]
Loomis D, Grosse Y, Lauby-Secretan B, Ghissassi FE, Bouvard V, Benbrahim-Tallaa L, Guha N, Baan R, Mattock H, and Straif K (2013, dec). The carcinogenicity of outdoor air pollution. The Lancet Oncology 14(13), 1262–1263. [DOI] [PubMed] [Google Scholar]
Makar M, Antonelli J, Di Q, Cutler D, Schwartz J, and Dominici F (2017, sep). Estimating the causal effect of low levels of fine particulate matter on hospitalization. Epidemiology 28(5), 627–634. [DOI] [PMC free article] [PubMed] [Google Scholar]
Neyman J (1990). On the application of probability theory to agricultural experiments. essay on principles. section 9. Statistical Science 5(444), 465–472. [Google Scholar]
Pope CA, Lefler JS, Ezzati M, Higbee JD, Marshall JD, Kim S-Y, Bechle M, Gilliat KS, Vernon SE, Robinson AL, and Burnett RT (2019, jul). Mortality risk and fine particulate air pollution in a large, representative cohort of u.s. adults. Environmental Health Perspectives 127(7), 077007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rajagopalan S, Al-Kindi SG, and Brook RD (2018, oct). Air pollution and cardiovascular disease. Journal of the American College of Cardiology 72(17), 2054–2070. [DOI] [PubMed] [Google Scholar]
Rückerl R, Schneider A, Breitner S, Cyrys J, and Peters A (2011, aug). Health effects of particulate air pollution: A review of epidemiological evidence. Inhalation Toxicology 23(10), 555–592. [DOI] [PubMed] [Google Scholar]
Rosenbaum PR (2002a). Covariance adjustment in randomized experiments and observational studies. Statistical Science 17(3), 286–327. [Google Scholar]
Rosenbaum PR (2002b). Observational Studies. New York: Springer. [Google Scholar]
Rosenbaum PR (2010). Design of observational studies. New York: Springer. [Google Scholar]
Rosenbaum PR (2012, jul). Testing one hypothesis twice in observational studies. Biometrika 99(4), 763–774. [Google Scholar]
Rosenbaum PR (2017, jan). Observation and Experiment. Harvard University Press. [Google Scholar]
Rosenbaum PR and Silber JH (2009, dec). Amplification of sensitivity analysis in matched observational studies. Journal of the American Statistical Association 104(488), 1398–1405. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66(5), 688–701. [Google Scholar]
Samet JM, Dominici F, Curriero FC, Coursac I, and Zeger SL (2000, dec). Fine particulate air pollution and mortality in 20 u.s. cities, 1987–1994. New England Journal of Medicine 343(24), 1742–1749. [DOI] [PubMed] [Google Scholar]
Stuart EA (2010, feb). Matching methods for causal inference: A review and a look forward. Statistical Science 25(1), 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
Su X, Tsai C-L, Wang H, Nickerson DM, and Li B (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research 10(5), 141–158. [Google Scholar]
Wager S and Athey S (2018, jun). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113(523), 1228–1242. [Google Scholar]
Wu X, Braun D, Schwartz J, Kioumourtzoglou MA, and Dominici F (2020, jun). Evaluating the impact of long-term exposure to fine particulate matter on mortality among the elderly. Science Advances 6(29), eaba5692. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zaykin DV, Zhivotovsky LA, Westfall PH, and Weir BS (2002, jan). Truncated product method for combiningP-values. Genetic Epidemiology 22(2), 170–185. [DOI] [PubMed] [Google Scholar]
Zhang H and Singer B (2010). Recursive Partitioning and Applications. New York: Springer. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1733551-supplement-Supp_1.pdf^{(82.4KB, pdf)}

Supp 2

NIHMS1733551-supplement-Supp_2.zip^{(17.6MB, zip)}

[R1] Athey S and Imbens G (2016, jul). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113(27), 7353–7360. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Breiman L, Friedman JH, Olshen RA, and Stone CJ (1984). Classification and regression trees. New York, N.Y: Chapman & Hall/CRC. [Google Scholar]

[R3] Di Q, Dai L, Wang Y, Zanobetti A, Choirat C, Schwartz JD, and Dominici F (2017, dec). Association of short-term exposure to air pollution with mortality in older adults. JAMA 318(24), 2446. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Di Q, Kloog I, Koutrakis P, Lyapustin A, Wang Y, and Schwartz J (2016, apr). Assessing PM2.5exposures with high spatiotemporal resolution across the continental united states. Environmental Science & Technology 50(9), 4712–4721. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Di Q, Wang Y, Zanobetti A, Wang Y, Koutrakis P, Choirat C, Dominici F, and Schwartz JD (2017, jun). Air pollution and mortality in the medicare population. New England Journal of Medicine 376(26), 2513–2522. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Dockery DW, Pope CA, Xu X, Spengler JD, Ware JH, Fay ME, Ferris BG, and Speizer FE (1993, dec). An association between air pollution and mortality in six u.s. cities. New England Journal of Medicine 329(24), 1753–1759. [DOI] [PubMed] [Google Scholar]

[R7] Dominici F, Peng RD, Bell ML, Pham L, McDermott A, Zeger SL, and Samet JM (2006, mar). Fine particulate air pollution and hospital admission for cardiovascular and respiratory diseases. JAMA 295(10), 1127. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Fogarty CB, Mikkelsen ME, Gaieski DF, and Small DS (2016, apr). Discrete optimization for interpretable study populations and randomization inference in an observational study of severe sepsis mortality. Journal of the American Statistical Association 111(514), 447–458. [Google Scholar]

[R9] Fogarty CB, Shi P, Mikkelsen ME, and Small DS (2017, jan). Randomization inference and sensitivity analysis for composite null hypotheses with binary outcomes in matched observational studies. Journal of the American Statistical Association 112(517), 321–331. [Google Scholar]

[R10] Gastwirth JL, Krieger AM, and Rosenbaum PR (2000, aug). Asymptotic separability in sensitivity analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62(3), 545–555. [Google Scholar]

[R11] Genz A and Bretz F (2009). Computation of Multivariate Normal and t Probabilities. Berlin: Springer. [Google Scholar]

[R12] Hahn PR, Murray JS, and Carvalho CM (2020, sep). Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects (with discussion). Bayesian Analysis 15(3), 965–1056. [Google Scholar]

[R13] Hansen BB (2004, sep). Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association 99(467), 609–618. [Google Scholar]

[R14] Hsu JY, Small DS, and Rosenbaum PR (2013, mar). Effect modification and design sensitivity in observational studies. Journal of the American Statistical Association 108(501), 135–148. [Google Scholar]

[R15] Lee K, Small DS, and Rosenbaum PR (2018, may). A powerful approach to the study of moderate effect modification in observational studies. Biometrics 74(4), 1161–1170. [DOI] [PubMed] [Google Scholar]

[R16] Loomis D, Grosse Y, Lauby-Secretan B, Ghissassi FE, Bouvard V, Benbrahim-Tallaa L, Guha N, Baan R, Mattock H, and Straif K (2013, dec). The carcinogenicity of outdoor air pollution. The Lancet Oncology 14(13), 1262–1263. [DOI] [PubMed] [Google Scholar]

[R17] Makar M, Antonelli J, Di Q, Cutler D, Schwartz J, and Dominici F (2017, sep). Estimating the causal effect of low levels of fine particulate matter on hospitalization. Epidemiology 28(5), 627–634. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Neyman J (1990). On the application of probability theory to agricultural experiments. essay on principles. section 9. Statistical Science 5(444), 465–472. [Google Scholar]

[R19] Pope CA, Lefler JS, Ezzati M, Higbee JD, Marshall JD, Kim S-Y, Bechle M, Gilliat KS, Vernon SE, Robinson AL, and Burnett RT (2019, jul). Mortality risk and fine particulate air pollution in a large, representative cohort of u.s. adults. Environmental Health Perspectives 127(7), 077007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Rajagopalan S, Al-Kindi SG, and Brook RD (2018, oct). Air pollution and cardiovascular disease. Journal of the American College of Cardiology 72(17), 2054–2070. [DOI] [PubMed] [Google Scholar]

[R21] Rückerl R, Schneider A, Breitner S, Cyrys J, and Peters A (2011, aug). Health effects of particulate air pollution: A review of epidemiological evidence. Inhalation Toxicology 23(10), 555–592. [DOI] [PubMed] [Google Scholar]

[R22] Rosenbaum PR (2002a). Covariance adjustment in randomized experiments and observational studies. Statistical Science 17(3), 286–327. [Google Scholar]

[R23] Rosenbaum PR (2002b). Observational Studies. New York: Springer. [Google Scholar]

[R24] Rosenbaum PR (2010). Design of observational studies. New York: Springer. [Google Scholar]

[R25] Rosenbaum PR (2012, jul). Testing one hypothesis twice in observational studies. Biometrika 99(4), 763–774. [Google Scholar]

[R26] Rosenbaum PR (2017, jan). Observation and Experiment. Harvard University Press. [Google Scholar]

[R27] Rosenbaum PR and Silber JH (2009, dec). Amplification of sensitivity analysis in matched observational studies. Journal of the American Statistical Association 104(488), 1398–1405. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66(5), 688–701. [Google Scholar]

[R29] Samet JM, Dominici F, Curriero FC, Coursac I, and Zeger SL (2000, dec). Fine particulate air pollution and mortality in 20 u.s. cities, 1987–1994. New England Journal of Medicine 343(24), 1742–1749. [DOI] [PubMed] [Google Scholar]

[R30] Stuart EA (2010, feb). Matching methods for causal inference: A review and a look forward. Statistical Science 25(1), 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Su X, Tsai C-L, Wang H, Nickerson DM, and Li B (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research 10(5), 141–158. [Google Scholar]

[R32] Wager S and Athey S (2018, jun). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113(523), 1228–1242. [Google Scholar]

[R33] Wu X, Braun D, Schwartz J, Kioumourtzoglou MA, and Dominici F (2020, jun). Evaluating the impact of long-term exposure to fine particulate matter on mortality among the elderly. Science Advances 6(29), eaba5692. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Zaykin DV, Zhivotovsky LA, Westfall PH, and Weir BS (2002, jan). Truncated product method for combiningP-values. Genetic Epidemiology 22(2), 170–185. [DOI] [PubMed] [Google Scholar]

[R35] Zhang H and Singer B (2010). Recursive Partitioning and Applications. New York: Springer. [Google Scholar]

PERMALINK

Discovering Heterogeneous Exposure Effects Using Randomization Inference in Air Pollution Studies

Kwonsang Lee

Dylan S Small

Francesca Dominici

Abstract

1. Introduction

2. Notation and review of observational studies