Emulate randomized clinical trials using heterogeneous treatment effect estimation for personalized treatments: Methodology review and benchmark

Yaobin Ling; Pulakesh Upadhyaya; Luyao Chen; Xiaoqian Jiang; Yejin Kim

doi:10.1016/j.jbi.2022.104256

. Author manuscript; available in PMC: 2024 Jan 1.

Published in final edited form as: J Biomed Inform. 2022 Nov 28;137:104256. doi: 10.1016/j.jbi.2022.104256

Emulate randomized clinical trials using heterogeneous treatment effect estimation for personalized treatments: Methodology review and benchmark

Yaobin Ling ¹, Pulakesh Upadhyaya ¹, Luyao Chen ¹, Xiaoqian Jiang ¹, Yejin Kim ^1,^*

PMCID: PMC9845190 NIHMSID: NIHMS1855891 PMID: 36455806

Abstract

Big data and (deep) machine learning have been ambitious tools in digital medicine, but these tools focus mainly on association. Intervention in medicine is about the causal effects. The average treatment effect has long been studied as a measure of causal effect, assuming that all populations have the same effect size. However, no “one-size-fits-all” treatment seems to work in some complex diseases. Treatment effects may vary by patient. Estimating heterogeneous treatment effects (HTE) may have a high impact on developing personalized treatment. Lots of advanced machine learning models for estimating HTE have emerged in recent years, but there has been limited translational research into the real-world healthcare domain. To fill the gap, we reviewed and compared eleven recent HTE estimation methodologies, including meta-learner, representation learning models, and tree-based models. We performed a comprehensive benchmark experiment based on nationwide healthcare claim data with application to Alzheimer’s disease drug repurposing. We provided some challenges and opportunities in HTE estimation analysis in the healthcare domain to close the gap between innovative HTE models and deployment to real-world healthcare problems.

Keywords: Causal inference, Target trial, Conditional average treatment effect, Drug development, Deep learning, Machine learning

1. Introduction

Causal inference discovers a cause of an effect. Although randomized experiments (e.g., randomized clinical trials, A/B test) are a de facto gold standard to identify causation, they are sometimes economically infeasible or unethical if intervention harms subjects [1]. The treatment effect estimation using observational data (e.g., real-world data) is an alternative strategy to emulate the randomized experiments and infer the causation. However, observation inevitably contains bias. A confounding variable is a variable that influences exposure to the treatment and outcomes(Fig. 1A). It is one of the major sources of bias that can mislead us to draw a wrong conclusion that the treatment has effects on the outcome when it does not.[2] A statistical approach to reducing such bias in observational data for treatment effect estimation has long been studied in multiple disciplines. For example, a target trial framework in epidemiology and biostatistics has been focused on hypothesis testing to infer the average treatment effect by adjusting the confounders via matching or weighting [3–12] (Fig. 1Bc, 1Bd).

Fig. 1. — Illustrations of treatment effects analysis in the medical science field. A. An example of a causal relationship. B. Estimating causal treatment effects from real-world data under the Neyman-Rubin framework. (a) Subjects in RCTs are randomly assigned to a treatment group and a control group, thus the subjects in both groups have similar characteristics. (b) Subjects in real-world data are not randomly assigned to a treatment group and control group due to disease indication. (c) Matching subjects in each group can reduce bias [13]. (d) Weighting subjects by their propensity for treatment can create a comparable pseudo population [14,15] (Details described in S.1.1). (e) Neyman-Rubin causal effect calculation. C. Heterogeneous treatment effects vs. Average treatment effect. Patients are diverse and treatment effects vary. Estimating the *average* treatment effect (ATE) may oversimplify the heterogeneity of each patient.

However, patients are diverse and treatment effects vary. Decades of drug development in complex diseases have shown that there is no “one-size-fits-all” treatment [16]. Estimating the average treatment effect (ATE) may oversimplify the heterogeneity of each patient. The need for personalized treatment is tremendous. Therefore, it is important to estimate treatment effects for each individual or similar subgroups of patients, which is the so-called heterogeneous treatment effect (HTE) (Fig. 1C). The HTE estimation has transformative potential in personalized medicine by respecting the disease and patients’ heterogeneity.

The HTE estimation is recently gaining attention in econometrics [17] (e.g. uplift modeling), and machine learning (non-parametric HTE) [18], but is rarely investigated in the computational medicine area [19]. To close the gap in this translational effort, we review and compare some recent HTE methodologies and perform benchmark experiments to test the feasibility of the methodologies in emulating clinical trials for personalized treatment development. Our benchmark experiments use nationwide electronic health records with ~ 60 M patients in the US under the target trial protocol. Our scope is within a translational biomedical research perspective, particularly with digital health. This review and benchmark paper adapts notations and naming strategies from various sources including econml [20], causalml [21], Künzel et al. [22], and Bica et al. [19]. For a theoretical and methodological comparison, see [23–25]. For an econometric perspective, see [26]. For a clinical pharmacology perspective, see [19].

2. Preliminaries

2.1. Potential outcome framework

In this paper, we investigate models built under Neyman-Rubin’s potential outcome framework [27,28]. Suppose that we have N subjects (i = 1, ⋯, N) with the feature X. For each patient, T denotes treatment assignment; T = 1 if the subject is in the treatment group t with the potential outcome Y(1), and T = 0 if the subject is in the control (placebo) group with the potential outcome Y(0). The Neyman-Rubin framework is to estimate the treatment effect of treatment t given subject feature X (Fig. 1Be) by

τ (x) = E [Y (1) - Y (0) ∣ X]

(1)

The potential outcome framework requires several assumptions:

Strong ignorability (or exchangeability). We assume no unobserved confounders exist or we observe all the variables X affecting treatment assignment T and outcomes Y, i.e., Y⊥T|X, [27,29]. Take the language of real-world drug administration as an example, we observe a sufficient set of confounding variables on patient characteristics that determine the outcome and administration of drugs.
Positivity. The probability, P(T|X), of receiving the treatment is not deterministic. i.e., 0 < P(T|X) < 1 [30]. For example, a patient’s feature X does not 100 % guarantee the onset of a particular specific drug.
Stable Unit Treatment Value Assumption. Each subject’s potential outcomes remain the same regardless of what treatment the other subjects receive (no interference between the subjects).[27] For example, a patient taking a drug does not affect other patients’ choice of drug.

2.2. Counterfactual outcome

A fundamental challenge of the treatment effect estimation is that it is impossible to observe Y(1) and Y(0) simultaneously. We call the outcome that the subject has a factual outcome and the other hypothetical outcomes in an alternative situation a counterfactual outcome. A common approach to address this missing counterfactual outcome is to calculate the average of the two potential outcomes in the treatment group and control group separately after randomization. We estimate the average treatment effect (ATE) of treatment by E[Y(1)] − E[Y(0)] if the treatment assignment is randomized. However, in real-world observational data, patients take drugs based on indication, not at random. The patients exposed to the drugs and those not exposed to the drugs are not equivalent. The confounding variable creates a selection bias between the treated and untreated. Several techniques to adjust the confounding variables use propensity scores e(x) = P(T|X), such as matching (Fig. 1Bc), stratification, and inverse weighting (Fig. 1Bd), doubly robust estimations [31], or conditional independence (g-estimation) [32]. More details in Supplementary Material S.1.1.

3. Heterogeneous treatment effects

3.1. Overview

Heterogeneous treatment effect estimation is to quantify individual or subgroups’ treatment effect by accounting for the heterogeneity of patient’s conditions to outcome while reducing selection bias. The HTE of treatment T ∈ {1, 0} given the patient’s condition X is formulated as Eqn 1, which varies based on the subject’s features X. A general step to estimate the HTE is to first learn the “nuisance” or “context” (likelihood of being exposed to treatment T and expected values of outcomes Y at given subject X) via arbitrary supervised models (Fig. 2A and Fig. 2B). Then it estimates the HTE by learning a coefficient of a structural equation model or imputing missing counterfactual outcomes in non-parametric methods (Fig. 2C). Many HTE estimation methods have been proposed [18,22,33–37]. Among the models, double machine learning, meta-learners, representation learning, and causal forests are the most popular methods and are flexible to be implemented in many scenarios. We will focus on these three methods, and they were summarized in Table 1.

Fig. 2. — A toy example of heterogeneous treatment effects with high-dimensional features. Based on the subject’s feature X, the subject can be exposed to treatment T or not (i.e.,E[T|X]) and has different levels of outcomes (i.e.,E[Y|X]). The treatment effects also vary based on these different contexts or nuisances stemming from the subject’s features.

Table 1.

Comparison of HTE estimation methods.

Models	Characteristics	Treatment type	How is the treatment assignment incorporated?	How is the selection bias handled?
Double machine learning [33]	Learn HTE by the coefficient of a partially linear structural equation Widely used in the econometric community	Continuous, discrete	T is a function of X to learn.	T is treated as a function of X to learn.
Single learner	Learn single outcome prediction model for both the treated and untreated group (E[Y\|X,T]) The treated and untreated group share the same regression models	Continuous, discrete	T is a variable (together with X) to predict outcome	Not considered
Two-learner	Learn multiple outcome prediction models for each treatment group.	Discrete	Separate outcome prediction model based on T	Not considered
X-learner [22]	T learner with additional treatment effect estimation regression models for each of the treated and untreated groups. Handle data size imbalance in the treated and untreated group	Discrete	Separate outcome prediction model based on T	Weighted average of each treatment effect estimation based on propensity scores
BNN [34]	Minimize the distribution distance between the transformed representation of the treated and untreated	Binary	Not considered	Covariate shift
TARNET [18]	BNN with multi-task heads for each treatment group, without covariate shift	Binary	Separate potential outcome prediction layers on each T	Not considered
CFR [38]	BNN with multi-task heads for each treatment group	Binary	Separate potential outcome prediction layers on each T	Covariate shift
DR-CFR [35]	Separate representations for propensity score and outcomes prediction	Binary	Separate potential outcome prediction layers on each T	Covariate shift
SCIGAN [36]	Estimate counterfactual outcomes via generative adversarial networks.	Continuous, discrete	T is a variable to predict dose-dependent outcome	Covariate shift
Dragonnet [37]	Predict Y using X that are predictive of T (Sufficiency of Propensity Scores). Regularize the outcome prediction to satisfy augmented IPTW estimator	Discrete	Separate potential outcome prediction layers on each T	Learn neural network to predict outcome satisfying augmented IPTW estimator
Double sample causal tree [39]	Use an ‘honest’ method to grow a tree. The splitting criterion is to maximize the variance of estimated treatment effects among samples in the training subset.	Discrete	Not considered; only for randomized data, in which treatment assignment is fully randomized	Honest splitting criterion and cross-validation
Causal forest [40]	Using multiple causal trees to get predictions of treatment effects (ensembling). Point estimates from the model are asymptotically normal and unbiased, and so allow for confident intervals to be calculated.	Discrete	Not considered; only for randomized data, in which treatment assignment is fully randomized	Honest splitting criterion and cross-validation

Open in a new tab

3.2. Double machine learning

Double machine learning shows us how the parametric causal inference evolves to semi-parametric causal inference with machine learning. A challenge that classical causal inference faces are the high-dimensional and complex nuisance parameter around the causal estimates. The complex nuisance parameters are often difficult to be estimated with a classical semi-parametric framework (e.g., 10 binary features mean 2¹⁰ combinations of features to test whether the treatment effects are distinctive from the others). To address the high dimensionality of features, Chernozhukov proposes to replace semi-parametric nuisance with arbitrary supervised machine learning models, which is the so-called Double machine learning [33]. The Double machine learning architecture is composed of two arbitrary supervised machine learning models which are used to estimate the conditional probability of taking the treatment and the conditional expectation of the outcome respectively. Then one can estimate the HTE by fitting a regression model (Details in Supplementary material S.1.2). The architecture was later generalized to R-learner and was proved to have an asymptotic error rate.

3.3. Meta-learners

Meta-learners are one of the nonparametric HTE estimation methods that treat the HTE estimation for discrete treatment as missing counterfactual outcome imputation. They decompose the HTE estimation into several sub-regression problems that any arbitrary supervised machine learning model can be utilized. We follow the naming strategy introduced by Künzel et al.[22].

Single-learner (or S-learner), the most basic method in meta-learner, treats the treatment assignment variable T as just another feature (in addition to X) and builds a “single” supervised model to estimate the outcomes Y (Fig. 3Aa).
Two-learner (T-learner) fits separate regression models for treatment and control groups respectively (Fig. 3Ab). The advantage of T-learner over S-learner is that T-learner performs relatively better when there is not much similarity between the outcome given treatment and the outcome given the control [22].
X-learner is a variant of the T-learner with extra steps to separate the HTE functions (Fig. 3Ac).[22] It is named after “X”-like shaped use of training data for counterfactual outcome estimation. The main advantage of X-learner over T-learner is that X-learner separates the treatment effect regression of the treatment and placebo group so that the HTE estimators can capture information about their differences. This strategy is beneficial when the number of subjects treated and untreated is not balanced.

Fig. 3. — Illustrations of different architectures of HTE models (A) A toy example to compare meta-learners in Künzel et al.[22] Metalearners decompose the HTE estimation into several sub-regression problems that any arbitrary supervised machine learning model can be utilized. (B) Illustration of covariate shift problem and counterfactual regression network [18,38]. Balanced representation learning via covariate shift using neural networks [18,38] with the treatment-invariant representation of a patient’s feature X.

Despite its flexibility and simple intuition, the meta-learner has several limitations. The meta-learners are non-parametric models with full flexibility to select any arbitrary supervised machine learning model, making it difficult to obtain a valid confidence interval. Also, the meta-learners are only available with discrete treatment. With multiple discrete treatments, they require a model for each treatment, posing extra computations. For more details, see [22] and some open-source implementations, causalml [41], and econml [42].

3.4. Representation learning

Representation learning based on deep neural networks has also been actively used in nonparametric HTE estimation. Recent works have focused on learning a covariate shift function by which feature representation of the treated and untreated follows a similar distribution (Fig. 3B) and studied the trade-off between balance and predictive power of such algorithms. BNN [38], CFRNet [38], and TARNet [18] propose a family of algorithms to predict HTE using the balanced representations learned from observational data (Fig. 3B), with the last two providing a bound for the HTE estimation error. Their theoretical approaches are also applied to many synthetic and real-world datasets [38].

Specifically, the balanced learning approach finds a representation ϕ : X→R and treatment-specific head h₁ and h₀, that will minimize the evaluation measure PEHE:

PEHE = {(\hat{τ} (x) - τ (x))}^{2} p (x) d x

(2)

where $\hat{τ} (x) = h_{1} (ϕ (x)) - h_{0} (ϕ (x))$ . It utilizes a loss function that is lower bounded by PEHE to train the model; the loss function consists of the sum of the expected factual treated and control losses for outcome regression and distance of ϕ(x) given T = 1 and T = 0 for covariate shift (Fig. 3B).

3.5. Tree-based methods

Another class of models is a tree-based model [39,40,43]. Tree-based models, including causal trees and causal forests, are nonparametric models that use recursively splitting criteria to find subgroups in which the sub-samples can be viewed as from randomized experiments and the divergence between outcomes of treatment and control groups is maximized. One characteristic of tree-based models is that they keep good asymptotic properties, and thus allow users to conduct solid statistical inference about the point estimators from these models. In addition, tree-based models can provide generated rules for further interpretation and external validation [44].

Double sample causal tree, one of the most representative tree methods, uses an ‘honest’ estimation method to grow a tree. It splits the training random sub-samples into two parts, one is used for predicting outcomes, and the other part is used to find the split for the node. The ‘honest’ mean that with sample i, one can use the response Y_i for estimating treatment effects within the leaf or for finding the split but cannot use it for both. Propensity tree is another method that incorporates the estimation of propensity scores in the model to adjust for confounders in observational datasets [40].

Causal forest generates ensembles of many causal trees and estimates treatment effects by averaging predictions of the ensembled trees. By aggregating results from many trees, the causal forest thus can provide more robust estimators and smooth decision rules.

3.6. Methodology comparison

The HTE estimation methods we discussed have their strengths and weaknesses. Double machine learning utilizes structural equations to model the discrete or continuous treatment with confidence intervals. Meta-learners allow users to use any base learner to fit outcome prediction and treatment effect regression, so researchers have the autonomy to choose supervised models that best work for their data. The HTE estimation using representation learning is an end-to-end approach that all tasks (reducing selection bias, predicting potential outcomes, and calculating treatment effect) are seamlessly connected in one neural network framework. Thanks to the neural network’s flexibility as a function approximator, various modeling hypotheses (e.g., multi-head for respective treatments [18], disentanglements of representation [35].) can be incorporated and tested.

Let us compare the methods based on three criteria: treatment type, treatment assignment, and selection bias (Table 1). For the treatment type, T-learner and X-learner handle binary treatment, but it’s straightforward to extend it to multiple discrete treatments. In contrast, BNN, CFRNet, and TARNet assume minimizing the distance between two treatment groups, thus requiring more computational challenge when extending the binary treatment to multiple discrete or continuous treatments. Instead, Double machine learning and S-learner can incorporate continuous treatment. For the treatment assignment variable, most methods for discrete treatments adopt separate outcome prediction (i.e., either neural network layers/heads or an independent prediction model) to handle the different potential outcomes. This setting is advantageous when the size of treatment groups is not balanced. For selection bias handling, covariate shift is the main approach in representation learning [18]; covariate selection by propensity scores was also investigated. Weighting by propensity scores was also widely used [22].

3.7. Evaluation metric

Directly evaluating the accuracy of the estimated HTE is challenging because the ground-truth treatment effect is never observed in data; randomized experiments are the only method to obtain the ground truth. Researchers have used several indirect measurements: robustness and estimated goodness-of-fit.

Robustness: To evaluate whether the model’s estimation is robust to different data, one can train the same model on training data and test data respectively, and then calculated the “estimated” root mean squared error (ERMSE) of the estimated treatment effects from training data and test data. High robustness (or low ERMSE) means the model consistently generates a similar HTE estimation regardless of the input data, which supports the validity of the estimation.

Estimated goodness-of-fit: To evaluate how accurately the models can predict HTE, the precision of estimating heterogeneous effects (PEHE) is a direct metric for this goal [45]. The PEHE is defined as the difference between the true HTE and the estimated HTE (Eq. 2). As the true HTE is never observed, Alaa and van der Schaar proposed the influence function-PEHE (IF-PEHE) that approximates the true PEHE by “derivatives” of the PEHE function, not directly relying on the unobserved counterfactual outcomes [46]. The approximation is composed of two parts: a plugin estimate and an influence function used to compensate for bias. Using a well-designed plugin estimate makes it easier to train and gives a partial guess on the true PEHE; the remaining bias from the plugin estimate is approximated by the influence function, which is analogous to the derivatives of a function in standard calculus. See details in the paper [46] or Supplementary Material S.1.4.

4. Benchmark experiments

We tested the feasibility of the HTE estimation in identifying personalized drug effectiveness. We focused on a task to emulate a randomized clinical trial (NCT03991988) [47] that tests the treatment effect of Montelukast on treating AD. Our emulation is based on a target trial framework[11] that mimics the actual trial’s eligibility criteria, treatment strategy, and observation period from nationwide large-scale claim data. From the emulated randomized clinical trial, we aim to estimate the HTE of Montelukast (as well as other anti-asthma drugs) in reducing AD risk and ultimately evaluate the feasibility of the HTE estimation as a new tool for identifying personalized drug effectiveness.

4.1. Background in AD and Montelukast

AD is a progressive neurologic disorder with the death of brain cells. Chronic inflammation is known to be linked to AD [48,49]. Montelukast is an anti-inflammatory drug for chronic inflammation, such as asthma. Some researchers speculate that Montelukast might reduce AD risk, not only by reducing chronic peripheral inflammation but also by targeting the leukotriene signaling pathway, which mediates various aspects of AD pathologies [49]. An ongoing Phase II trial is investigating the treatment effect of Montelukast on AD [47]. However, the average cost of such trials is ~ 13 million dollars in the U.S. [1], and with 99 % failure rates in AD treatment trials [49]. Emulating the RCTs via treatment effect estimation before the actual trials may offer a cost-effective and safe tool to reduce the failure rates. However, there are several confounding variables in observational data. If Montelukast and AD turn out to have a significant correlation or association, the high association may or may not be due to the direct preventive effects of Montelukast on AD. It may be due to the anti-inflammatory effects of Montelukast on reducing chronic inflammation. In addition, AD has various pathologies developing clinical symptoms (such as via neuroinflammation, and protein misfolding). The treatment effects of Montelukast on AD vary by the pathology one patient has.

4.2. Experimental setting

We defined eligibility criteria, standard-of-care arm, follow-up period, and outcome based on NCT03991988 [47] (Table 2, Supplementary Material S.2.1).

Table 2.

Comparison of our experiment setting and the actual trial NCT03991988 [47].

	Randomized clinical trial on Montelukast (NCT03991988)	Our target trial design
Aim	Assess the effects of Montelukast initiation on amyloid and tau accumulation and cognitive symptoms of prodromal AD.	Assess the effect of one type of anti-asthma drug on AD onset prevention from routine clinical care setting in real-world data (2007–2020)
Eligibility criteria	Age > 50 Mild cognitive impairments Diagnosis affecting AD: No depression, schizophrenia, Parkinson’s disease, multiple sclerosis when the trial starts No history of stroke in the past 3 years. No history of increased intracranial pressure Indication of Montelukast: No comorbidity on bronchial asthma or exercise-induced bronchospasm Adverse effect of Montelukast: No liver disease, renal disease due to safety (3 inclusion criteria, 15 exclusion criteria)	Born before 1942 (Age at 2020 > average ADRD onset age) * No history of AD onset yet at follow-up starts Diagnosis affecting AD No depression, schizophrenia, Parkinson’s disease, multiple sclerosis when observation starts No history of stroke in the past 3 years. No history of increased intracranial pressure
Treatment strategies	Treatment arm: Continuous therapy of an anti-asthma drug (Montelukast); No other drugs within the same category (Leukotriene receptor antagonists) Standard-of-care arm: No other drugs within the study drug class	Treatment arm: Continuous therapy of one class of anti-asthma drug No other drugs within the same category Standard-of-care arm: No other drugs within the study drug class Take other active anti-asthma drugs
Assignment procedure	Random assignment	Reduce selection bias computationally (Propensity score, covariate shift in balanced learning)
Follow-up period	2019.09.25 – 2022.10	Follow-up starts at first records meeting eligibility criteria Follow-up ends at ADRD onset or last observation, whichever occurs first
Outcome	Adverse effect (gastrointestinal symptoms, anaphylaxis, elevated liver enzymes) CSF Amyloid and tau change CDR-SB cognitive scores	AD and related dementia onset (PheWas dx codes and medication codes)

Open in a new tab

Montelukast is a cysteinyl leukotriene type 1 (CycsLT-1) receptor antagonist to treat asthma and allergy symptoms. Animal model studies show Montelukast’s efficacy in reducing amyloid-beta toxicity and neuroinflammation [50,51]. A retrospective study reports an association between Montelukast and reduced uses of dementia medicine in older adults, compared to other anti-asthma drugs [52]. Two Phase 2 clinical trials (NCT03402503, NCT03991988) are ongoing to test the effectiveness of Montelukast on AD’s neuropsychological progression.

To emulate clinical trials, we used real-world patient drug claim data. Claim data capture routine clinical care, although there are ongoing debates about whether such real-world healthcare administrative data (including electronic health records and claim data) are suitable data sources to infer treatment effects [53]. We used the Optum Clinformatics^® Data Mart subscribed by UTHealth. It comprises administrative health claims from Jan 2007 to June 2020 for commercial and Medicare Advantage health plan members. These administrative claims are submitted for payment by providers and pharmacies and are verified, adjudicated, adjusted, and de-identified before inclusion in Clinformatics ^® Data Mart. It contains 6.5 billion claims with 7.6 billion diagnosis codes (in ICD 9/10), 2.7 billion medication codes, and 2.5 billion lab results.

We defined the study treatment as leukotriene receptor antagonist (Montelukast, Zileuton, Zafirlukast, and Pranlukast) and standard-of-care treatment as remaining active anti-asthma drugs classes (Beta-2 adrenergic receptor agonist, Anticholinergic drug, Xanthine, and Corticosteroid). See Supplementary Table S1 for specific drug names and their DrugBank ID.

We selected cohorts of older adults meeting the eligibility criteria in NCT03991988 [47]. We included 2,740 patients taking the study treatment and 8,545 patients taking standard-of-care treatments, so the final cohort size was 11,285. We split the subjects into training, test, and validation datasets randomly by 6:2:2 (6,771 for training, 2,257 for validation, and 2,257 for test data).

Potential confounding variables that both affect the exposure to the study treatment and ADRD onset include age when follow-up starts, sex, race, and comorbidities. For comorbidities, we converted ICD9 or ICD10 diagnosis codes into PheWas codes to increase the clinical relevance of the billing codes.[54] PheWas ICD code is a hierarchical grouping of ICD codes based on statistical co-occurrence, code frequency, and human review. A total of 242 PheWas diagnosis codes were included in the dataset. We used the onset of AD as an implicit measure of the increased level of AD risk as the outcome. AD onset was detected as having either an ADRD diagnosis code or medication.

Based on Eq. (1), the HTE is defined as the difference between the estimated AD onset when the patients are intervened to take the study treatment (leukotriene receptor antagonist) and when the patients are intervened to take the control (standard-of-care treatments). A negative HTE value means a reduced AD onset risk due to the study treatment. To evaluate the performance of our HTE estimation model, we measured the robustness (ERMSE) and the estimated goodness-of-fit (IF-PEHE). The detailed plug-in function setting for IF-PEHE is in Supplementary Material S.2.3.

Some HTE estimation methods require a propensity score, the likelihood of a patient to take the study treatment, to weight their estimates and reduce selection bias. We chose the random forest as a propensity score estimation model (Supplementary Material S.2.4). As to estimate propensity score, we used all the features available (the 242 diagnosis codes and three demographics). We checked whether the treatment assignment was at random or affected by other features by calculating the prediction accuracy of the propensity score model (Supplementary Material S.2.4, Table S2).

Meta-learners require several prediction tasks: i) potential outcome classification models for treatment and control groups, respectively, and ii) treatment effect regression (if X-learner and R-learner). We compared logistic regression, random forest, and XGBoost as a choice for each prediction task.

4.3. Results

The benchmark experiments show that most HTE estimation models with low variance (S-learners, DML, DragonNet) had better robustness (low ERMSE) and goodness-of-fit (low IF-PEHE). Particularly, S-learner with the Random Forest as the base learner gave the best HTE estimation in both metrics (Table 3). We further discussed the possible reason of the high performance of the simple model in the Discussion section. The representation learning methods had a low ERMSE and high IF-PEHE generally. Although the representation learning methods were not competitive with the low variance models, the DragonNet performs better in both metrics among representation learning methods, because our sparse and noisy input data can benefit from the sufficiency of the propensity score for causal adjustments. Since it uses only the information relevant to the treatment, it gives better estimates when many covariates influence outcomes but have no effect on the treatment.

Table 3.

Comparison of various HTE estimation models in a target trial for NCT03991988 using nationwide claim data.

		Setting	Robustness (ERMSE)	Goodness of fit (IF-PEHE)
Meta learners	S learner	Logistic Regression	0.0225	22.12
		Random Forest	0.0029	9.82
		XGBoost Classifier	0.0380	19.01
	T learner	Logistic Regression	0.2933	155.63
		Random Forest	0.0737	21.66
		XGBoosting	0.2926	57.77
	X learner	Outcome learner: Logistic Regression Effect learner: ElasticNet	0.0705	35.78
		Outcome learner: Random Forest Classifier Effect learner: Random Forest Regressor	0.1147	22.46
		Outcome learner: XGBoost Classifier Effect learner: XGBoost Regressor	0.1980	79.33
	R learner	Outcome learner: Logistic regression Effect learner: ElasticNet (equivalent to DML)	0.0110	22.61
	R learner	Outcome learner: XGBoost Classifier Effect learner: XGBoost Regressor	0.5873	385.34
Balanced representation learning	CFRNet	Representation Learner: Deep Neural Network Hypothesis Learner: Deep Neural Network	0.0547	102.60
Balanced representation learning	DRLearner	Representation Learner: Three neural networks, two regression networks for each treatment arm Confounder Learner: two logistic networks to model logging policy, design weights for confounder impact	0.0597	152.99
Representation learning (Others)	DragonNet	Outcome learner: Neural Network Regressor which learns outcome as well as propensity scores. Effect learner: Neural Network Regressor	0.0401	53.57
Causal forest	Generalized Random Forest	Single tree estimator: double sample causal tree	0.0060	18.90

Open in a new tab

4.4. Important features in the best model

Using the best model (S-learner with Random Forest), we investigated patients’ features that contribute to the high/low HTE. We calculated feature importance scores in the HTE estimation using the Shapley scores (Fig. 4).[55] The Shapley score measures the average marginal contribution of a feature by excluding the feature over a varying subset of other features. As a result, age at baseline was listed as the second most important factor; it is shown that older patients benefit less from leukotriene receptor antagonists in reducing the risk of AD. On the other hand, leukotriene receptor antagonists have a beneficial treatment effect on patients with disorders of bone and cartilage (PheWas dx: 733) or gout and other crystal arthropathies (PheWas dx: 274). This finding implies that patients with chronic inflammatory bone and joint disorders are more likely to have beneficial treatment effects from leukotriene receptor antagonists to reduce AD risk. Indeed, there is accumulating evidence suggesting significant interplay between peripheral immune activity (e.g., chronic inflammatory bone and joint disorders), blood–brain barrier permeability, microglial activation/proliferation, and AD-related neuroinflammation [56]. In an in vivo study, Montelukast (one type of leukotriene receptor antagonist) inhibits inflammation-induced osteoclastogenesis in the calvarial model [57]. This finding requires in-depth biological investigation.

Fig. 4. — Feature importance values of the best-performed model (S-learner with Random Forest). A feature with a negative SHAP value means the feature can lead to negative HTE values, implying a reduced AD onset risk due to the study treatment. Features are sorted in decreasing order by importance. Blue points = low feature value; Red points = high feature values.

5. Discussion on the HTE estimation methods

We have reviewed and compared the recent HTE estimation methods to show their feasibility for translational research in biomedicine and drug development. Our extensive benchmark experiments compared several state-of-the-art and popular HTE estimation methodologies such as meta-learners, causal trees, and counterfactual representation learning. As a result, we found that the simple S-learner with Random Forest estimates the HTE most stably with sparse and noisy drug claim data. Possible reasons that the simple model performs best on our sparse claim data are three-fold: i) shared function in treatment and control groups: The treatment and control groups might have similar mechanisms between the features and outcome regardless of being treated with the study treatment or not. Incorporating treatment assignment variables as one of the features and building one shared model for the two groups can help models learn the shared mechanisms in the two groups, particularly when the features are sparse, and data are not rich. ii) Biologically zero treatment effect: If the study treatment has no treatment effect biologically in nature, S-learner is known to be more accurate in estimating the zero value of HTE [22]. Indeed, there has been no clear scientific evidence that leukotriene receptor antagonists prevent AD. iii) Avoid overfitting: S-learner has a small number of model parameters (low variance in predicted values). Possibly S-learner is more likely to avoid overfitting than other meta-learners with a larger number of model parameters (e.g., the X-learner requires three prediction tasks). Our empirical finding is corroborated by prior theoretical investigations on inductive bias in HTE estimation[58]. Based on this finding, we suggest that the future HTE estimation model should consider the shared structure between the arms.

In addition, we found several limitations and challenges in data and problem formulation when applying them to real-world healthcare data and problems. The methods we compared primarily focus on how to achieve a more robust and accurate estimation of HTE through different model architectures, assuming that an ideal form of data is given. However, real-world data is incomplete in observation and lacks unbiased control to compare. Here we elaborate on the challenges we identified in detail.

5.1. Unobserved confounding variables

We assumed that our data contains a sufficient set of confounding variables that determines the outcome (e.g., AD onset) and treatment (e. g., the onset of study treatment). Caution is needed as the assumptions can be violated with real-world drug administrative observational data. Some important confounding variables, such as socioeconomic status, are not included in healthcare administrative data because these data are collected for billing purposes and not for scholarly purposes.

5.2. Variables in healthcare administrative data

The HTE estimation requires a larger set of variables to capture patients’ various conditions that might affect treatment effects at varying levels. Healthcare administrative data (e.g. EHRs, claim data) is a useful data source to contain the various comorbid conditions and co-medication patterns. Two data issues arise, sparsity and multimodality. Healthcare administrative data are sparse, noisy, and missing, not at random. Although we mitigated the sparsity by mapping all the diagnosis codes to PheWas codes[59] to derive compact and dense variables and increase the clinical relevance of the billing-purpose codes, the data was not rich enough, particularly, to train data-hungry deep learning models (e.g., CFRNet, DRLearner, Dragonnet). Previous work on deep learning and healthcare administrative data [60] show that the medical event prediction accuracy of deep learning models is marginal to that of traditional feature-based baselines. This data sparsity challenge might be a barrier to applying the representation learning methods to healthcare administrative data and general machine learning tasks. To mitigate the sparsity of healthcare administrative data, one can consider feature selection. Extra consideration is needed when applying feature selection to some HTE models as they consist of multiple prediction sub-tasks of supervised learning compared to general supervised learning tasks. Several feature selection methods for the HTE estimation were proposed [61]. We tested these feature selections on the best-performing model, S-learner with random forest regression, to see if the feature selection decreased the HTE performance measures, which turns out to be not helpful (Supplementary S.2.5).

Another challenge is data multimodality. Both medication order and diagnosis codes capture a patient’s comorbid conditions from different perspectives. One focus of the HTE estimation is to reduce selection bias by non-randomized treatment assignments and incorporating comprehensive multimodal variables. It is a study design choice to determine which modality to include or how to incorporate both modalities that follow different distributions. Healthcare experts’ knowledge would also be very critical to selecting the most important variables to predict propensity to treatment.

5.3. Define standard-of-care treatments

Estimating treatment effect using observational data requires us to define a virtual placebo (or standard-of-care treatments) from which the study treatment is compared. The target trial framework compares the choice of standard-of-care treatment: active drugs vs no active drugs [9,11,12]. Defining “good” standard-of-care treatments is critical that are distinctive enough to compare with the study treatment, not co-administered with the study treatment (avoid patients taking both the study treatment and standard-of-care treatment at the same time), and share similar disease indications of the study treatment to avoid confounding variables. A tradeoff arises when defining the standard-of-care treatment. Active drugs as standard-of-care treatment (e.g., other active anti-asthma drugs as a control in the benchmark experiment) help us avoid confounding effects by disease indication (e.g., all patients in the cohort have asthma), but provide treatment effects of the study treatment that are marginal to the active drugs. No active drugs as standard-of-care treatment (e.g., no active anti-asthma drugs as control) allow us to estimate treatment effects by the significant contrast of the study treatment and control, but the confounding by indication remains [62,63].

6. Conclusion

Recently, many machine learning models for estimating HTE have been proposed, but there has been limited effort to apply the methods to a real-world healthcare problem. Our methodology review and benchmark study provided an overview of current methods and benchmark tests by applying them to the task of emulating clinical trials. We estimated the HTE of leukotriene receptor antagonists in reducing the risk of AD using nationwide healthcare claim data.

Our study has several limitations. First of all, our methodology comparison is only focused on the non-parametric approach using machine learning. Parametric HTE estimation methods we did not discuss in this review include generalized linear models with stratification-multilevel method or match-smoothing method [64]. Also, for our benchmark experiment, we could not set the time zero to separate the pre-treatment and post-treatment comorbidity because the first time the treatment was assigned is ambiguous in our short-term data. This might reduce treatment effect size. For future investigation, we plan to incorporate the interaction between treatment and post-treatment covariates in order to capture the underlying causal structure information among the variables and thus can lead to a more accurate and robust estimator of treatment effects [65].

Nonetheless, we delivered some insights that may benefit future research direction, such as that i) simple models work better if the true treatment effect is close to zero, which is the case in most clinical trials, ii) observational healthcare data, particularly administrative data (e.g., claim, EHRs) are incomplete, thus researchers need a caution for unobserved confounding variables, and iii) active drugs sharing the same indication to the treatment of interest as a placebo may help reduce confounding by indication. For future research direction, we envision that i) it is promising to develop HTE estimation methods that leverage the similar mechanism of treatment and placebo (e.g., active drugs sharing the same indication), either by the single learner or parameter sharing, ii) it is also critical to infuse human knowledge (e.g., PheWas, [54]) to complement the incomplete and sparse records in healthcare administrative data iii) As healthcare data always contains timestamps of those records, it is also worthwhile to incorporating temporal relationships of the variables when estimating HTE. Our benchmark experiment codes are publicly available to facilitate transparent comparison¹. As a future benchmark study, we will investigate the utility of HTE methods in clinical registries and trial data.

Supplementary Material

NIHMS1855891-supplement-1.docx^{(100.6KB, docx)}

Funding

This work was supported by Cancer Prevention and Research institutes of Texas [RR180012 to X.J.]; the National Institute on Aging [R01AG066749]; the University of Texas system STARs Program; and the University of Texas Health Center startup.

Footnotes

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Supplementary material

Supplementary data to this article can be found online at https://doi.org/10.1016/j.jbi.2022.104256.

https://github.com/yling1105/HTE-benchmark.

Data availability

Data will be made available on request.

References

[1].What is the cost of a clinical trial?, (2021). http://www.sofpromed.com/what-is-the-cost-of-a-clinical-trial/ (accessed October 1, 2021).
[2].Tyler IS, VanderWeele J, On the definition of a confounder, Ann. Stat 41 (2013) 196. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Pearl J, Causality: Models, Reasoning, and Inference, Cambridge University Press, 2000. [Google Scholar]
[4].Pearl J, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Elsevier, 2014. [Google Scholar]
[5].S.V.F. McGough James J., Estimating the Size of Treatment Effects: Moving Beyond P Values, Psychiatry. 6 (n.d.) 21. [PMC free article] [PubMed] [Google Scholar]
[6].Ohlsson H, Kendler KS, Applying Causal Inference Methods in Psychiatric Epidemiology: A Review, JAMA Psychiat. 77 (2020) 637–644. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Parascandola M, Weed DL, Causation in epidemiology, J. Epidemiol. Community Health 55 (2001), 10.1136/jech.55.12.905. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Zenil H, Kiani NA, Zea AA, Tegnér J, Causal deconvolution by algorithmic generative models, Nature, Machine Intelligence. 1 (1) (2019) 58–66. [Google Scholar]
[9].Prosperi M, Guo Y.i., Sperrin M, Koopman JS, Min JS, He X, Rich S, Wang M. o., Buchan IE, Bian J, Causal inference and counterfactual prediction in machine learning for actionable healthcare, Nat. Machine Intell 2 (7) (2020) 369–375. [Google Scholar]
[10].Glymour MM, Spiegelman D, Evaluating Public Health Interventions: 5. Causal Inference in Public Health Research-Do Sex, Race, and Biological Factors Cause Health Outcomes? Am. J. Public Health 107 (2017) 81–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Hernán MA, Robins JM, Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available, Am. J. Epidemiol 183 (2016), 10.1093/aje/kwv254. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Chen Z, Zhang H, Guo Y, George TJ, Prosperi M, Hogan WR, He Z, Shenkman EA, Wang F, Bian J, Exploring the feasibility of using real-world data from a large clinical data research network to simulate clinical trials of Alzheimer’s disease, npj Digital Med. 4 (2021), 10.1038/s41746-021-00452-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Dehejia RH, Wahba S, Propensity score-matching methods for nonexperimental causal studies, Rev. Econ. Stat 84 (1) (2002) 151–161. [Google Scholar]
[14].Wang Y, Shah RD, Debiased Inverse Propensity Score Weighting for Estimation of Average Treatment Effects with High-Dimensional Confounders, arXiv [stat.ME]. (2020). http://arxiv.org/abs/2011.08661. [Google Scholar]
[15].Austin PC, Stuart EA, Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies, Stat. Med 34 (28) (2015) 3661–3679. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Lam B, Masellis M, Freedman M, Stuss DT, Black SE, Clinical, imaging, and pathological heterogeneity of the Alzheimer’s disease syndrome, Alzheimers. Res. Ther 5 (2013) 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J, Double/Debiased Machine Learning for Treatment and Causal Parameters, (2016). http://arxiv.org/abs/1608.00060 (accessed October 1, 2021).
[18].Shalit U, Johansson FD, Sontag D, Estimating individual treatment effect: generalization bounds and algorithms, (2016). http://arxiv.org/abs/1606.03976 (accessed July 21, 2021).
[19].Bica I, Alaa AM, Lambert C, van der Schaar M, From Real-World Patient Data to Individualized Treatment Effects Using Machine Learning: Current and Future Methods to Address Underlying Challenges, Clin. Pharmacol. Ther 109 (2021) 87–100. [DOI] [PubMed] [Google Scholar]
[20].microsoft, microsoft/EconML, (n.d.). https://github.com/microsoft/EconML (accessed March 25, 2021).
[21].Chen H, Harinen T, Lee J-Y, Yung M, Zhao Z, CausalML: Python Package for Causal Machine Learning, arXiv [cs.CY]. (2020). http://arxiv.org/abs/2002.11631. [Google Scholar]
[22].Künzel SR, Sekhon JS, Bickel PJ, Yu B, Metalearners for estimating heterogeneous treatment effects using machine learning, Proc. Natl. Acad. Sci. U. S. A 116 (10) (2019) 4156–4165. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Curth A, Svensson D, Weatherall J, van der Schaar M, Really Doing Great at Estimating CATE? A Critical Look at ML Benchmarking Practices in Treatment Effect Estimation, (2021). https://openreview.net/pdf?id=FQLzQqGEAH (accessed December 1, 2021).
[24].Curth A, van der Schaar M, Nonparametric Estimation of Heterogeneous Treatment Effects: From Theory to Learning Algorithms, in: Banerjee A, Fukumizu K (Eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR, 2021: pp. 1810–1818. [Google Scholar]
[25].Curth A, van der Schaar M, On Inductive Biases for Heterogeneous Treatment Effect Estimation, arXiv [stat.ML]. (2021). http://arxiv.org/abs/2106.03765. [Google Scholar]
[26].Jacob D, CATE meets ML – The Conditional Average Treatment Effect and Machine Learning, (2021). http://arxiv.org/abs/2104.09935 (accessed July 22, 2021).
[27].Rubin DB, Causal Inference Using Potential Outcomes, J. Am. Stat. Assoc 100 (2005) 322–331, 10.1198/016214504000001880. [DOI] [Google Scholar]
[28].He H, Wu P, (din) Chen D-G, Statistical Causal Inferences and Their Applications in Public Health Research, Springer, 2016. [Google Scholar]
[29].Pearl J, Mackenzie D, The Book of Why: The New Science of Cause and Effect, Basic Books, 2018. [Google Scholar]
[30].Hernán MA, Robins JM, Estimating causal effects from epidemiological data, J. Epidemiol. Community Health 60 (2006) 578–586. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, Davidian M, Doubly robust estimation of causal effects, Am. J. Epidemiol 173 (2011) 761–767. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Naimi AI, Cole SR, Kennedy EH, An introduction to g methods, Int. J. Epidemiol 46 (2017) 756. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Chetverikov D, Demirer M, Duflo E, Hansen C, Newey WK, Chernozhukov V, Double machine learning for treatment and causal parameters, (2016). 10.1920/wp.cem.2016.4916. [DOI] [Google Scholar]
[34].Johansson FD, Shalit U, Sontag D, Learning Representations for Counterfactual Inference, (2016). http://arxiv.org/abs/1605.03661 (accessed July 21, 2021).
[35].Hassanpour N, Greiner R, Learning Disentangled Representations for CounterFactual Regression, in: Eighth International Conference on Learning Representations, 2020. http://www.openreview.net/pdf?id=HkxBJT4YvB (accessed July 21, 2021). [Google Scholar]
[36].Bica I, Jordon J, van der Schaar M, Estimating the Effects of Continuous-valued Interventions using Generative Adversarial Networks, (2020). http://arxiv.org/abs/2002.12326 (accessed July 21, 2021).
[37].Shi C, Blei DM, Veitch V, Adapting Neural Networks for the Estimation of Treatment Effects, arXiv [stat.ML]. (2019). http://arxiv.org/abs/1906.02120. [Google Scholar]
[38].Johansson F, Shalit U, Sontag D, Learning Representations for Counterfactual Inference, in: International Conference on Machine Learning, PMLR, 2016: pp. 3020–3029. [Google Scholar]
[39].Athey S, Imbens G, Recursive partitioning for heterogeneous causal effects, Proc. Natl. Acad. Sci. U. S. A 113 (27) (2016) 7353–7360. [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Wager S, Athey S, Estimation and Inference of Heterogeneous Treatment Effects using Random Forests, J. Am. Stat. Assoc 113 (523) (2018) 1228–1242. [Google Scholar]
[41].Welcome to Causal ML’s documentation — causalml documentation, (n.d.). https://causalml.readthedocs.io/en/latest/ (accessed August 4, 2021).
[42].Welcome to econml’s documentation! — econml 0.12.0b5 documentation, (n.d.). https://econml.azurewebsites.net (accessed August 4, 2021).
[43].Athey S, Wager S, Estimating Treatment Effects with Causal Forests: An Application, Observational, Studies. 5 (2) (2019) 37–51. [Google Scholar]
[44].Nie X, Wager S, Quasi-oracle estimation of heterogeneous treatment effects, Biometrika. 108 (2020) 299–319. [Google Scholar]
[45].Hill JL, Bayesian Nonparametric Modeling for Causal Inference, J. Comput. Graph. Stat 20 (2011) 217–240, 10.1198/jcgs.2010.08162. [DOI] [Google Scholar]
[46].Alaa A, Van Der Schaar M, Validating Causal Inference Models via Influence Functions, in: Chaudhuri K, Salakhutdinov R (Eds.), Proceedings of the 36th International Conference on Machine Learning, PMLR, 2019: pp. 191–201. [Google Scholar]
[47].Montelukast Therapy on Alzheimer’s Disease, (n.d.). https://clinicaltrials.gov/ct2/show/NCT03991988 (accessed July 8, 2021).
[48].Bozek A, Jarzab J, Improved activity and mental function related to proper antiasthmatic treatment in elderly patients with Alzheimer’s disease, Allergy Asthma Proc. 32 (2011) 341–345, 10.2500/aap.2011.32.3459. [DOI] [PubMed] [Google Scholar]
[49].Michael J, Marschallinger J, Aigner L, The leukotriene signaling pathway: a druggable target in Alzheimer’s disease, Drug Discov. Today 24 (2019) 505–516, 10.1016/j.drudis.2018.09.008. [DOI] [PubMed] [Google Scholar]
[50].Lai J, Mei ZL, Wang H, Hu M, Long Y, Miao MX, Li N, Hong H, Montelukast rescues primary neurons against Aβ1–42-induced toxicity through inhibiting CysLT1R-mediated NF-κB signaling, Neurochem. Int 75 (2014) 26–31, 10.1016/j.neuint.2014.05.006. [DOI] [PubMed] [Google Scholar]
[51].Mansour RM, Ahmed MAE, El-Sahar AE, El Sayed NS, Montelukast attenuates rotenone-induced microglial activation/p38 MAPK expression in rats: Possible role of its antioxidant, anti-inflammatory and antiapoptotic effects, Toxicol. Appl. Pharmacol 358 (2018) 76–85. [DOI] [PubMed] [Google Scholar]
[52].Grinde B, Engdahl B, Prescription database analyses indicates that the asthma medicine montelukast might protect against dementia: a hypothesis to be verified, Immun. Ageing 14 (2017) 20. [DOI] [PMC free article] [PubMed] [Google Scholar]
[53].Beaulieu-Jones BK, Finlayson SG, Yuan W, Altman RB, Kohane IS, Prasad V, Yu K, Examining the Use of Real-World Evidence in the Regulatory Process, Clin. Pharmacol. Ther 107 (2020) 843–852, 10.1002/cpt.1658. [DOI] [PMC free article] [PubMed] [Google Scholar]
[54].Denny JC, Bastarache L, Ritchie MD, Carroll RJ, Zink R, Mosley JD, Field JR, Pulley JM, Ramirez AH, Bowton E, Basford MA, Carrell DS, Peissig PL, Kho AN, Pacheco JA, Rasmussen LV, Crosslin DR, Crane PK, Pathak J, Bielinski SJ, Pendergrass SA, Xu H, Hindorff LA, Li R, Manolio TA, Chute CG, Chisholm RL, Larson EB, Jarvik GP, Brilliant MH, McCarty CA, Kullo IJ, Haines JL, Crawford DC, Masys DR, Roden DM, Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data, Nat. Biotechnol 31 (12) (2013) 1102–1111. [DOI] [PMC free article] [PubMed] [Google Scholar]
[55].Lundberg SM, Lee S-I, A unified approach to interpreting model predictions, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: pp. 4768–4777. [Google Scholar]
[56].Culibrk RA, Hahn MS, The Role of Chronic Inflammatory Bone and Joint Disorders in the Pathogenesis and Progression of Alzheimer’s Disease, Front. Aging Neurosci 12 (2020), 583884. [DOI] [PMC free article] [PubMed] [Google Scholar]
[57].Kang J-H, Lim H, Lee D-S, Yim M, Montelukast inhibits RANKL-induced osteoclast formation and bone loss via CysLTR1 and P2Y12, Mol. Med. Rep 18 (2018) 2387–2398. [DOI] [PubMed] [Google Scholar]
[58].Curth, Schaar, On inductive biases for heterogeneous treatment effect estimation, Adv. Neural Inf. Process. Syst (n.d.). https://proceedings.neurips.cc/paper/2021/hash/8526e0962a844e4a2f158d831d5fddf7-Abstract.html. [Google Scholar]
[59].Bastarache L, Using Phecodes for Research with the Electronic Health Record: From PheWAS to PheRS, Annu Rev Biomed Data Sci 4 (1) (2021) 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
[60].Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu PJ, Liu X, Marcus J, Sun M, Sundberg P, Yee H, Zhang K, Zhang Y, Flores G, Duggan GE, Irvine J, Le Q, Litsch K, Mossin A, Tansuwan J, Wang D, Wexler J, Wilson J, Ludwig D, Volchenboum SL, Chou K, Pearson M, Madabushi S, Shah NH, Butte AJ, Howell MD, Cui C, Corrado GS, Dean J, Scalable and accurate deep learning with electronic health records, Npj Digital Medicine. 1 (2018) 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
[61].Zhao Z, Zhang Y, Harinen T, Yung M, Feature Selection Methods for Uplift Modeling, arXiv [cs.LG]. (2020). http://arxiv.org/abs/2005.03447. [Google Scholar]
[62].Salas M, Hofman A, Stricker BH, Confounding by indication: an example of variation in the use of epidemiologic terminology, Am. J. Epidemiol 149 (1999), 10.1093/oxfordjournals.aje.a009758. [DOI] [PubMed] [Google Scholar]
[63].Kyriacou DN, Lewis RJ, Confounding by Indication in Clinical Research, JAMA 316 (2016) 1818–1819. [DOI] [PubMed] [Google Scholar]
[64].Xie Y.u., Brand JE, Jann B, Estimating Heterogeneous Treatment Effects with Observational Data, Sociol. Methodol 42 (1) (2012) 314–347. [DOI] [PMC free article] [PubMed] [Google Scholar]
[65].Tran C, Zheleva E, Improving Data-driven Heterogeneous Treatment Effect Estimation Under Structure Uncertainty, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, New York, NY, USA, 2022: pp. 1787–1797. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1855891-supplement-1.docx^{(100.6KB, docx)}

Data Availability Statement

Data will be made available on request.

[R1] [1].What is the cost of a clinical trial?, (2021). http://www.sofpromed.com/what-is-the-cost-of-a-clinical-trial/ (accessed October 1, 2021).

[R2] [2].Tyler IS, VanderWeele J, On the definition of a confounder, Ann. Stat 41 (2013) 196. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Pearl J, Causality: Models, Reasoning, and Inference, Cambridge University Press, 2000. [Google Scholar]

[R4] [4].Pearl J, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Elsevier, 2014. [Google Scholar]

[R5] [5].S.V.F. McGough James J., Estimating the Size of Treatment Effects: Moving Beyond P Values, Psychiatry. 6 (n.d.) 21. [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Ohlsson H, Kendler KS, Applying Causal Inference Methods in Psychiatric Epidemiology: A Review, JAMA Psychiat. 77 (2020) 637–644. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Parascandola M, Weed DL, Causation in epidemiology, J. Epidemiol. Community Health 55 (2001), 10.1136/jech.55.12.905. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Zenil H, Kiani NA, Zea AA, Tegnér J, Causal deconvolution by algorithmic generative models, Nature, Machine Intelligence. 1 (1) (2019) 58–66. [Google Scholar]

[R9] [9].Prosperi M, Guo Y.i., Sperrin M, Koopman JS, Min JS, He X, Rich S, Wang M. o., Buchan IE, Bian J, Causal inference and counterfactual prediction in machine learning for actionable healthcare, Nat. Machine Intell 2 (7) (2020) 369–375. [Google Scholar]

[R10] [10].Glymour MM, Spiegelman D, Evaluating Public Health Interventions: 5. Causal Inference in Public Health Research-Do Sex, Race, and Biological Factors Cause Health Outcomes? Am. J. Public Health 107 (2017) 81–85. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Hernán MA, Robins JM, Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available, Am. J. Epidemiol 183 (2016), 10.1093/aje/kwv254. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Chen Z, Zhang H, Guo Y, George TJ, Prosperi M, Hogan WR, He Z, Shenkman EA, Wang F, Bian J, Exploring the feasibility of using real-world data from a large clinical data research network to simulate clinical trials of Alzheimer’s disease, npj Digital Med. 4 (2021), 10.1038/s41746-021-00452-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Dehejia RH, Wahba S, Propensity score-matching methods for nonexperimental causal studies, Rev. Econ. Stat 84 (1) (2002) 151–161. [Google Scholar]

[R14] [14].Wang Y, Shah RD, Debiased Inverse Propensity Score Weighting for Estimation of Average Treatment Effects with High-Dimensional Confounders, arXiv [stat.ME]. (2020). http://arxiv.org/abs/2011.08661. [Google Scholar]

[R15] [15].Austin PC, Stuart EA, Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies, Stat. Med 34 (28) (2015) 3661–3679. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Lam B, Masellis M, Freedman M, Stuss DT, Black SE, Clinical, imaging, and pathological heterogeneity of the Alzheimer’s disease syndrome, Alzheimers. Res. Ther 5 (2013) 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J, Double/Debiased Machine Learning for Treatment and Causal Parameters, (2016). http://arxiv.org/abs/1608.00060 (accessed October 1, 2021).

[R18] [18].Shalit U, Johansson FD, Sontag D, Estimating individual treatment effect: generalization bounds and algorithms, (2016). http://arxiv.org/abs/1606.03976 (accessed July 21, 2021).

[R19] [19].Bica I, Alaa AM, Lambert C, van der Schaar M, From Real-World Patient Data to Individualized Treatment Effects Using Machine Learning: Current and Future Methods to Address Underlying Challenges, Clin. Pharmacol. Ther 109 (2021) 87–100. [DOI] [PubMed] [Google Scholar]

[R20] [20].microsoft, microsoft/EconML, (n.d.). https://github.com/microsoft/EconML (accessed March 25, 2021).

[R21] [21].Chen H, Harinen T, Lee J-Y, Yung M, Zhao Z, CausalML: Python Package for Causal Machine Learning, arXiv [cs.CY]. (2020). http://arxiv.org/abs/2002.11631. [Google Scholar]

[R22] [22].Künzel SR, Sekhon JS, Bickel PJ, Yu B, Metalearners for estimating heterogeneous treatment effects using machine learning, Proc. Natl. Acad. Sci. U. S. A 116 (10) (2019) 4156–4165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Curth A, Svensson D, Weatherall J, van der Schaar M, Really Doing Great at Estimating CATE? A Critical Look at ML Benchmarking Practices in Treatment Effect Estimation, (2021). https://openreview.net/pdf?id=FQLzQqGEAH (accessed December 1, 2021).

[R24] [24].Curth A, van der Schaar M, Nonparametric Estimation of Heterogeneous Treatment Effects: From Theory to Learning Algorithms, in: Banerjee A, Fukumizu K (Eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR, 2021: pp. 1810–1818. [Google Scholar]

[R25] [25].Curth A, van der Schaar M, On Inductive Biases for Heterogeneous Treatment Effect Estimation, arXiv [stat.ML]. (2021). http://arxiv.org/abs/2106.03765. [Google Scholar]

[R26] [26].Jacob D, CATE meets ML – The Conditional Average Treatment Effect and Machine Learning, (2021). http://arxiv.org/abs/2104.09935 (accessed July 22, 2021).

[R27] [27].Rubin DB, Causal Inference Using Potential Outcomes, J. Am. Stat. Assoc 100 (2005) 322–331, 10.1198/016214504000001880. [DOI] [Google Scholar]

[R28] [28].He H, Wu P, (din) Chen D-G, Statistical Causal Inferences and Their Applications in Public Health Research, Springer, 2016. [Google Scholar]

[R29] [29].Pearl J, Mackenzie D, The Book of Why: The New Science of Cause and Effect, Basic Books, 2018. [Google Scholar]

[R30] [30].Hernán MA, Robins JM, Estimating causal effects from epidemiological data, J. Epidemiol. Community Health 60 (2006) 578–586. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, Davidian M, Doubly robust estimation of causal effects, Am. J. Epidemiol 173 (2011) 761–767. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Naimi AI, Cole SR, Kennedy EH, An introduction to g methods, Int. J. Epidemiol 46 (2017) 756. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Chetverikov D, Demirer M, Duflo E, Hansen C, Newey WK, Chernozhukov V, Double machine learning for treatment and causal parameters, (2016). 10.1920/wp.cem.2016.4916. [DOI] [Google Scholar]

[R34] [34].Johansson FD, Shalit U, Sontag D, Learning Representations for Counterfactual Inference, (2016). http://arxiv.org/abs/1605.03661 (accessed July 21, 2021).

[R35] [35].Hassanpour N, Greiner R, Learning Disentangled Representations for CounterFactual Regression, in: Eighth International Conference on Learning Representations, 2020. http://www.openreview.net/pdf?id=HkxBJT4YvB (accessed July 21, 2021). [Google Scholar]

[R36] [36].Bica I, Jordon J, van der Schaar M, Estimating the Effects of Continuous-valued Interventions using Generative Adversarial Networks, (2020). http://arxiv.org/abs/2002.12326 (accessed July 21, 2021).

[R37] [37].Shi C, Blei DM, Veitch V, Adapting Neural Networks for the Estimation of Treatment Effects, arXiv [stat.ML]. (2019). http://arxiv.org/abs/1906.02120. [Google Scholar]

[R38] [38].Johansson F, Shalit U, Sontag D, Learning Representations for Counterfactual Inference, in: International Conference on Machine Learning, PMLR, 2016: pp. 3020–3029. [Google Scholar]

[R39] [39].Athey S, Imbens G, Recursive partitioning for heterogeneous causal effects, Proc. Natl. Acad. Sci. U. S. A 113 (27) (2016) 7353–7360. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] [40].Wager S, Athey S, Estimation and Inference of Heterogeneous Treatment Effects using Random Forests, J. Am. Stat. Assoc 113 (523) (2018) 1228–1242. [Google Scholar]

[R41] [41].Welcome to Causal ML’s documentation — causalml documentation, (n.d.). https://causalml.readthedocs.io/en/latest/ (accessed August 4, 2021).

[R42] [42].Welcome to econml’s documentation! — econml 0.12.0b5 documentation, (n.d.). https://econml.azurewebsites.net (accessed August 4, 2021).

[R43] [43].Athey S, Wager S, Estimating Treatment Effects with Causal Forests: An Application, Observational, Studies. 5 (2) (2019) 37–51. [Google Scholar]

[R44] [44].Nie X, Wager S, Quasi-oracle estimation of heterogeneous treatment effects, Biometrika. 108 (2020) 299–319. [Google Scholar]

[R45] [45].Hill JL, Bayesian Nonparametric Modeling for Causal Inference, J. Comput. Graph. Stat 20 (2011) 217–240, 10.1198/jcgs.2010.08162. [DOI] [Google Scholar]

[R46] [46].Alaa A, Van Der Schaar M, Validating Causal Inference Models via Influence Functions, in: Chaudhuri K, Salakhutdinov R (Eds.), Proceedings of the 36th International Conference on Machine Learning, PMLR, 2019: pp. 191–201. [Google Scholar]

[R47] [47].Montelukast Therapy on Alzheimer’s Disease, (n.d.). https://clinicaltrials.gov/ct2/show/NCT03991988 (accessed July 8, 2021).

[R48] [48].Bozek A, Jarzab J, Improved activity and mental function related to proper antiasthmatic treatment in elderly patients with Alzheimer’s disease, Allergy Asthma Proc. 32 (2011) 341–345, 10.2500/aap.2011.32.3459. [DOI] [PubMed] [Google Scholar]

[R49] [49].Michael J, Marschallinger J, Aigner L, The leukotriene signaling pathway: a druggable target in Alzheimer’s disease, Drug Discov. Today 24 (2019) 505–516, 10.1016/j.drudis.2018.09.008. [DOI] [PubMed] [Google Scholar]

[R50] [50].Lai J, Mei ZL, Wang H, Hu M, Long Y, Miao MX, Li N, Hong H, Montelukast rescues primary neurons against Aβ1–42-induced toxicity through inhibiting CysLT1R-mediated NF-κB signaling, Neurochem. Int 75 (2014) 26–31, 10.1016/j.neuint.2014.05.006. [DOI] [PubMed] [Google Scholar]

[R51] [51].Mansour RM, Ahmed MAE, El-Sahar AE, El Sayed NS, Montelukast attenuates rotenone-induced microglial activation/p38 MAPK expression in rats: Possible role of its antioxidant, anti-inflammatory and antiapoptotic effects, Toxicol. Appl. Pharmacol 358 (2018) 76–85. [DOI] [PubMed] [Google Scholar]

[R52] [52].Grinde B, Engdahl B, Prescription database analyses indicates that the asthma medicine montelukast might protect against dementia: a hypothesis to be verified, Immun. Ageing 14 (2017) 20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] [53].Beaulieu-Jones BK, Finlayson SG, Yuan W, Altman RB, Kohane IS, Prasad V, Yu K, Examining the Use of Real-World Evidence in the Regulatory Process, Clin. Pharmacol. Ther 107 (2020) 843–852, 10.1002/cpt.1658. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] [54].Denny JC, Bastarache L, Ritchie MD, Carroll RJ, Zink R, Mosley JD, Field JR, Pulley JM, Ramirez AH, Bowton E, Basford MA, Carrell DS, Peissig PL, Kho AN, Pacheco JA, Rasmussen LV, Crosslin DR, Crane PK, Pathak J, Bielinski SJ, Pendergrass SA, Xu H, Hindorff LA, Li R, Manolio TA, Chute CG, Chisholm RL, Larson EB, Jarvik GP, Brilliant MH, McCarty CA, Kullo IJ, Haines JL, Crawford DC, Masys DR, Roden DM, Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data, Nat. Biotechnol 31 (12) (2013) 1102–1111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] [55].Lundberg SM, Lee S-I, A unified approach to interpreting model predictions, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: pp. 4768–4777. [Google Scholar]

[R56] [56].Culibrk RA, Hahn MS, The Role of Chronic Inflammatory Bone and Joint Disorders in the Pathogenesis and Progression of Alzheimer’s Disease, Front. Aging Neurosci 12 (2020), 583884. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] [57].Kang J-H, Lim H, Lee D-S, Yim M, Montelukast inhibits RANKL-induced osteoclast formation and bone loss via CysLTR1 and P2Y12, Mol. Med. Rep 18 (2018) 2387–2398. [DOI] [PubMed] [Google Scholar]

[R58] [58].Curth, Schaar, On inductive biases for heterogeneous treatment effect estimation, Adv. Neural Inf. Process. Syst (n.d.). https://proceedings.neurips.cc/paper/2021/hash/8526e0962a844e4a2f158d831d5fddf7-Abstract.html. [Google Scholar]

[R59] [59].Bastarache L, Using Phecodes for Research with the Electronic Health Record: From PheWAS to PheRS, Annu Rev Biomed Data Sci 4 (1) (2021) 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] [60].Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu PJ, Liu X, Marcus J, Sun M, Sundberg P, Yee H, Zhang K, Zhang Y, Flores G, Duggan GE, Irvine J, Le Q, Litsch K, Mossin A, Tansuwan J, Wang D, Wexler J, Wilson J, Ludwig D, Volchenboum SL, Chou K, Pearson M, Madabushi S, Shah NH, Butte AJ, Howell MD, Cui C, Corrado GS, Dean J, Scalable and accurate deep learning with electronic health records, Npj Digital Medicine. 1 (2018) 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] [61].Zhao Z, Zhang Y, Harinen T, Yung M, Feature Selection Methods for Uplift Modeling, arXiv [cs.LG]. (2020). http://arxiv.org/abs/2005.03447. [Google Scholar]

[R62] [62].Salas M, Hofman A, Stricker BH, Confounding by indication: an example of variation in the use of epidemiologic terminology, Am. J. Epidemiol 149 (1999), 10.1093/oxfordjournals.aje.a009758. [DOI] [PubMed] [Google Scholar]

[R63] [63].Kyriacou DN, Lewis RJ, Confounding by Indication in Clinical Research, JAMA 316 (2016) 1818–1819. [DOI] [PubMed] [Google Scholar]

[R64] [64].Xie Y.u., Brand JE, Jann B, Estimating Heterogeneous Treatment Effects with Observational Data, Sociol. Methodol 42 (1) (2012) 314–347. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] [65].Tran C, Zheleva E, Improving Data-driven Heterogeneous Treatment Effect Estimation Under Structure Uncertainty, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, New York, NY, USA, 2022: pp. 1787–1797. [Google Scholar]

PERMALINK

Emulate randomized clinical trials using heterogeneous treatment effect estimation for personalized treatments: Methodology review and benchmark

Yaobin Ling

Pulakesh Upadhyaya

Luyao Chen

Xiaoqian Jiang

Yejin Kim

Abstract

1. Introduction

Fig. 1.

2. Preliminaries

2.1. Potential outcome framework

2.2. Counterfactual outcome

3. Heterogeneous treatment effects

3.1. Overview

Fig. 2.

Table 1.

3.2. Double machine learning

3.3. Meta-learners

Fig. 3.

3.4. Representation learning

3.5. Tree-based methods

3.6. Methodology comparison

3.7. Evaluation metric

4. Benchmark experiments

4.1. Background in AD and Montelukast

4.2. Experimental setting

Table 2.

4.3. Results

Table 3.

4.4. Important features in the best model

Fig. 4.

5. Discussion on the HTE estimation methods

5.1. Unobserved confounding variables

5.2. Variables in healthcare administrative data

5.3. Define standard-of-care treatments

6. Conclusion

Supplementary Material

Funding

Footnotes

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases