Abstract
We propose a test-based elastic integrative analysis of the randomised trial and real-world data to estimate treatment effect heterogeneity with a vector of known effect modifiers. When the real-world data are not subject to bias, our approach combines the trial and real-world data for efficient estimation. Utilising the trial design, we construct a test to decide whether or not to use real-world data. We characterise the asymptotic distribution of the test-based estimator under local alternatives. We provide a data-adaptive procedure to select the test threshold that promises the smallest mean square error and an elastic confidence interval with a good finite-sample coverage property.
Keywords: counterfactual outcome, least favourable confidence interval, non-regularity, precision medicine, pre-test estimator, semiparametric efficiency
1. Introduction
Precision medicine (Hamburg & Collins, 2010), which aims at customising medical treatments to individual patient characteristics, has recently received lots of attention. A critical step toward precision medicine is to characterise the heterogeneity of treatment effect (HTE; Rothwell, 2005; Rothwell et al., 2005) entailing how patient characteristics are related to treatment effect. Randomised trials (RTs) are the gold-standard method for treatment effect evaluation because randomisation of treatment ensures that treatment groups are comparable and biases are minimised to the extent possible. However, due to high costs and eligibility criteria for recruiting patients, the trial sample is often small and limited in the patient diversity, which renders the trial underpowered to estimate the HTE and unable to estimate the HTE for specific patient characteristics. On the other hand, extensive real-world (RW) data are increasingly available for research purposes, such as electronic health records, claims databases, and disease registries, with much larger sample sizes and broader demographic and diversity than RT cohorts. Several national organisations (Norris et al., 2010) and regulatory agencies (Sherman et al., 2016) have recently advocated using RW data to have a faster and less costly drug discovery process. Indeed, big data provide unprecedented opportunities for new scientific discovery; however, they also present challenges with possible incomparability with RT data due to selection bias, unmeasured confounding, lack of concurrency, data quality, outcome validity, etc. (US Food and Drug Administration, 2019).
The motivating application is to evaluate adjuvant chemotherapy for resected non-small cell lung cancer (NSCLC) at early-stage disease. Adjuvant chemotherapy for resected NSCLC was shown to be effective in late-stage II and IIIA disease based on RTs (Le Chevalier, 2003). However, the benefit of adjuvant chemotherapy in stage IB NSCLC disease is unclear. Cancer and Leukemia Group B (CALGB) 9633 is the only RT designed specifically for stage IB NSCLC (Strauss et al., 2008); however, it comprises about 300 patients, which was undersized to detect clinically meaningful improvements for adjuvant chemotherapy (Katz & Saad, 2009). ‘Who can benefit from adjuvant chemotherapy with stage IB NSCLC?’ remains an important clinical question. An exploratory analysis of CALGB 9633 showed that patients with tumour size cm might benefit from adjuvant chemotherapy (Strauss et al., 2008). On the other hand, the National Cancer Database (NCDB) is a clinical oncology registry database that captures the information from approximately of all newly diagnosed cancer patients in the USA. Our goal is to integrate the CALGB 9633 trial with a cohort selected under the same trial eligibility criteria from the NCDB. We expect that an integrated analysis of the CALGB 9633 and NCDB data can considerably improve the efficiency of the HTE estimation on adjuvant chemotherapy regarding tumour size over the RT-only analysis. Although such population-based disease registries provide rich information citing the real-world usage of adjuvant chemotherapy, the concern is the potential bias associated with RW data.
Many authors have proposed methods for generalising treatment effects from RTs to the target population, whose covariate distribution can be characterised by the RW data (Buchanan et al., 2018; Colnet et al., 2020; Lee, Yang, Dong, et al., 2022; Lee, Yang, & Wang, 2022; Zhao et al., 2019). When both RT and RW data provide covariate, treatment, and outcome information, there are two main approaches for integrative analysis: meta-analyses of summary statistics (e.g., Verde & Ohmann, 2015) and pooled patient data (Sobel et al., 2017). The major drawback of meta-analyses of the first kind is that they use only aggregated information and do not distinguish the roles of the RT and RW data, both having unique strengths and weaknesses. Meta-analyses of the second kind include all patients, but pooling the data from two sources breaks the randomisation of treatments and relies on causal inference methods to adjust for confounding bias (e.g., Prentice et al., 2005). More importantly, one cannot rule out possible unmeasured confounding in the RW data. In addition, most existing integrative methods focused on average treatment effects (ATEs) but not on HTEs, which lies at the heart of precision medicine.
To acknowledge the advantages of the RT and RW data, we propose an elastic algorithm for combining the RT and RW data for accurate and robust estimation of the HTE function with a vector of known effect modifiers. The primary identification assumptions underpinning our method are (i) the transportability of the HTE from the RT data to the target population and (ii) the strong ignorability of treatment assignment in the RT data. Transportability is a common assumption in the trial generalisability literature, which holds if the HTE function captures all the treatment effect modifiers, or the study sample is a random sample from the target population. The well-controlled trial design can also ensure the strong ignorability of treatment assignment. If the RW sample satisfies the parallel assumptions (i) and (ii), it is comparable to the RT sample in estimating the HTE. In this case, integrating the RW sample would increase the efficiency of HTE estimation. Toward this end, we use the semiparametric efficiency theory (Bickel et al., 1993; Robins, 1994) to derive the semiparametrically efficient integrative estimator of the HTE. However, due to many practical limitations, the RW sample may violate the desirable comparability assumption (i) or (ii). In this case, integrating the RW sample would lead to bias in HTE estimation. Utilising the design advantage of RTs, we derive a preliminary test statistic to gauge the comparability and reliability of the RW data and decide whether or not to use the RW data in an integrative analysis. Therefore, our test-based elastic integrative estimator uses the efficient combination strategy for estimation if the violation test is insignificant and retains only the RT data if the violation test is significant.
The proposed estimator belongs to pre-test estimation by construction (Giles & Giles, 1993) and is non-regular. We consider null, local, and fixed alternative hypotheses for the pre-testing, representing three scenarios when the comparability assumption required for the RW data is zero, weakly, and strongly violated, respectively. Notably, the fixed alternative formulates the bias of the RW score of the HTE parameter to be fixed, under which the pre-test statistic goes to infinity as the sample size increases. Thus, the inference under the fixed alternative cannot capture the finite-sample behaviour of the test and estimator well and lacks uniform validity. A common strategy to obtain uniform inference validity for non-regular estimators is considering the local alternative, which formulates the bias of the RW score to be in the neighbourhood of zero. The inference under the local alternative provides a better approximation of the finite-sample behaviour of the proposed estimator. Such strategies have been considered in designing trials for sample size/power calculation and in the weak instrument, partial identification, and classification literature (Cheng, 2008; Laber & Murphy, 2011; Staiger & Stock, 1997). Under the local alternative, when the testing distribution is non-degenerate, exact inference for pre-test estimation is complex because the estimator depends on the randomness of the test procedure. This issue cannot be solved by splitting the sample into two parts for testing and estimation separately (Toyoda & Wallace, 1979). The reason is that sample splitting cannot bypass the issue of the additional randomness due to pre-testing, and therefore the impact of pre-testing remains. Also, our test statistic and estimator are constructed based on the whole sample data. To consider the effect of pre-testing, we decompose the test-based elastic integrative estimator into orthogonal components; one is affected by the pre-testing, and the other is not. This step reveals the asymptotic distributions of the proposed estimator to be mixture distributions involving a truncated normal component with ellipsoid truncation and a normal component. Under this framework, we provide a data-adaptive procedure to select the threshold of the test statistic that promises the smallest mean square error (MSE) of the proposed estimator. Lastly, we propose an elastic procedure to construct confidence intervals (CIs), which are adaptive to the local and fixed alternative and have good finite-sample coverage properties.
This article is organised as follows. Section 2 introduces the basic setup, HTE, identification assumptions, and semiparametric efficient estimation. Section 3 establishes a test statistic for gauging the comparability of the RW data with the RT data, a test-based elastic integrative estimator, the asymptotic properties, and an elastic inference procedure. Section 4 presents a simulation study to evaluate the performance of the proposed estimator in terms of robustness and efficiency. Section 5 applies the proposed method to combined CALGB 9633 (RT) and NCDB (RW) data to characterise the HTE of adjuvant chemotherapy in patients with stage IB non-small cell lung cancer. We relegate technical details and all proofs to the Online supplementary material.
2. Basic set-up
2.1. Notation, the HTE, and two data sources
Let be the binary treatment, Z a vector of pre-treatment covariates of interest with the first component being 1, X a vector of auxiliary variables including Z, and Y the outcome of interest. We consider Y to be continuous or binary to fix ideas, although our framework can be extended to general-type outcomes, including the survival outcome. To define causal effects, we follow the potential outcomes framework (Neyman, 1923; Rubin, 1974). Under the Stable Unit of Treatment Value assumption, let be the potential outcome had the subject been given treatment a, for . And, by the causal consistency assumption, the observed outcome is .
Based on the potential outcomes, the individual treatment effect is , and characterises the HTE. For a binary outcome, is also called the causal risk difference. In clinical settings, the parametric family of HTEs is desirable and has wide applications in precision medicine to discover optimal treatment regimes tailored to individual characteristics (Chakraborty & Moodie, 2013). We assume the HTE function to be
(1) |
where is a vector of unknown parameters and p is fixed.
We illustrate the HTE function in the following examples.
Example 1 Shi et al., 2016; Tian et al., 2014 —
For a continuous outcome, a linear HTE function is , where each component of quantifies how the treatment effect varies over each Z.
Example 2 Richardson et al., 2017; Tian et al., 2014 —
For a binary outcome, an HTE function for the causal risk difference is , ranging from to 1.
To evaluate the effect of adjuvant chemotherapy, let Y be the indication of cancer recurrence within one year of surgery. Consider the HTE function in Example 2 with and . This model entails that, on average, the treatment would increase or decrease the risk of cancer recurrence by had the patient received adjuvant chemotherapy, and the magnitude of increase depends on age and tumour size. If , it indicates that the treatment is beneficial for this patient. Moreover, if and , then older patients with larger tumour sizes would benefit more from adjuvant chemotherapy.
We consider two independent data sources: one from the RT study and the other from the RW study. Let denote RT participation, and let denote RW study participation. Let V summarise the entire record of observed variables . The RT data consist of with sample size m, and the RW data consist of with sample size n, where and are sample index sets for the two data sources. Our setup requires the RT and RW samples to contain Z’s information but may include different sets of auxiliary information in X. The total sample size is . Generally, n is larger than m. In our asymptotic framework, we assume both m and n go to infinity, and , where .
For simplicity of exposition, we use the following notations throughout the paper: denotes the empirical measure over the combined RT and RW data, denotes for a vector or matrix M, and are the asymptotic expectation and variance of a random variable, denotes is independent of , denotes that follows the same distribution as , and denotes that and have the same asymptotic distribution as . Let be the propensity score.
2.2. Identification of the HTE from the RT and RW data
The fundamental problem of causal inference is that and are not jointly observable. Therefore, the HTE is not identifiable without additional assumptions.
We view the RT sample as the gold standard for HTE estimation, satisfying the following assumption.
Assumption 1 RT validity —
(i) , and (ii) for and for all .
Assumption 1(i) states that the HTE function is transportable from the RT sample to the target population. This assumption is a common assumption in the data integration literature. Stronger versions of Assumption 1(i) have also been considered in the literature, including the ignorability of study participation, i.e., (Buchanan et al., 2018; Stuart et al., 2011), or the mean exchangeability, i.e., for (Dahabreh et al., 2019). Assumption 1(i) holds if Z captures the heterogeneity of effect modifiers or if the study sample is a random sample from the target population. Under the structural equation model framework, Pearl and Bareinboim (2011) provided graphical conditions for transportability. The graphical representation can aid the investigator in assessing the plausibility of Assumption 1(i). Assumption 1(ii) entails that treatment assignment in the RT study follows a randomisation mechanism based on the pre-treatment variables X, and all subjects have positive probabilities of receiving each treatment. Assumption 1(ii) holds by the design of complete randomisation of treatment, where the treatment is independent of the potential outcomes and covariates, i.e., . It also holds by the design of stratified block randomisation of treatment based on discrete X, where the treatment is independent of the potential outcomes within each stratum of X. The propensity score is known by design.
We consider a parallel assumption for the RW sample, termed RW comparability.
Assumption 2 RW comparability —
(i) , and (ii) for and for all .
Although Assumption 2 appears similar to Assumption 1, its implications differ substantively. Assumption 2(i) states that the HTE function is transportable from the RW sample to the target population. To make this assumption more plausible, one can use the same trial eligibility criteria to select the RW sample to ensure a sufficient overlap of the RW covariate space with the RT sample. However, this assumption can be violated in various ways. For example, RT and RW studies may be conducted in different care settings (large academic medical centres versus smaller community hospitals), contexts (geography, policy-related, or socio-structural factors), or time frames. Each of these concerns can violate Assumption 2(i). In addition, due to the lack of control of treatment assignment in RW data, Assumption 2(ii) implies that the observed covariates X capture all the confounding variables related to the treatment and outcome. This assumption may also be restrictive in practice. For example, in the NCDB cohort, the physicians or patients decided, based on experiences or preferences, whether patients received adjuvant chemotherapy after tumour resection. While the database captures many site-level and patient-level information, there may be unmeasured confounding variables that associate with the treatment selection and clinical outcome, e.g., financial status and accessibility to health care facilities.
By trial design, we assume Assumption 1 for the RT data holds throughout the paper; however, we regard Assumption 2 for the RW data as an idealistic assumption, which may be violated. If Assumption 2 holds, we will use a semiparametric efficient strategy to combine both data sources for optimal estimation. However, if Assumption 2 is violated, our proposed method will automatically detect the violation and retain only the RT data for estimation. In practice, it is important to identify a ‘similar’ RW sample to be integrated with the RT sample. Hernán and Robins (2016) provided a framework for using big real-world data to emulate a target trial when a randomised trial is unavailable. When selecting an RW sample, we can check the rubrics for the eligibility criteria that defines the target population, treatment definitions, assignment procedures, follow-up time, outcome, and effect contrast of interest, to increase the chance of successfully integrating the RW sample with the RT sample.
Unlike our focus on testing the comparability of the RW in HTE estimation, testing transportability alone may be of more importance in some contexts. Under Assumptions 1(ii) and 2(ii), i.e., the treatment ignorability holds, possible tests can be adopted to test , e.g., the U-statistics-based test (Luedtke et al., 2019).
Under Assumptions 1 and 2, the following identification formula holds for the HTE:
(2) |
The identification formula motivates regression analysis based on the modified outcome to estimate the HTE. This approach involves the inverse of the treatment probability, and thus the resulting estimator may be unstable if some estimated treatment probabilities are close to zero or one. It calls for a principled way to construct improved estimators of the HTE. Rudolph and van der Laan (2017) derived the semiparametric efficiency score (SES) and bound for the average treatment effect. In the next subsection, we derive the SES of the HTE under Assumptions 1 and 2 that motivates improved estimators.
2.3. Semiparametric efficiency score
The semiparametric model consists of model (1) with the parameter of interest and the unspecified distribution. Assumptions 1 and 2 impose restrictions on . To see this, define
(3) |
Intuitively, subtracts from the subject’s observed outcome Y the treatment effect of the subject’s observed treatment , which mimics the potential outcome . Formally, following Robins (1994), we can show that . Therefore, by Assumptions 1 and 2, must satisfy the restriction:
(4) |
For simplicity of exposition, denote
where is the outcome mean function and is the outcome variance function. By viewing jointly as the set of confounders, we invoke the SES of the structural nested mean model in Robins (1994). We further make a simplifying assumption that
(5) |
which is a natural extension of (4). This assumption allows us to derive the SES of as
(6) |
which separates the term with the outcome, i.e., , and the term with the treatment, i.e., . This feature relaxes model assumptions of the nuisance functions while retaining root-n consistency in the estimation of ; see Section 2.4. Even without the simplifying assumption in (5), by the mean independence property in (4), we can verify that
Therefore, if (5) holds, is the SES of ; if (5) does not hold, is unbiased and permits robust estimation. We provide examples to elucidate the SES below before delving into robust estimation in the following subsection.
Example 3
For a continuous outcome and the HTE function given in Example 1, the SES of is
For a binary outcome and the HTE function given in Example 2, the SES of is
Remark 1 Comparison with other doubly robust approaches —
The identification formula (2) motivates the inverse probability weighted (IPW)-adjusted regression. However, IPW is known to be inefficient and sensitive to model misspecification of the propensity score. Alternatively, Kennedy (2020) proposed a pseudo-outcome regression approach using augmented IPW (AIPW) pseudo-outcomes that leverages weighting and outcome mean functions and improves the performance of IPW-adjusted regression. The doubly robust loss function for the treatment contrast or blip function in Luedtke and van der Laan (2016) also exploits weighting and outcome mean functions. Both IPW and AIPW use weighting to remove confounding biases; differently, the SES in (6) uses the mean independence of and to construct unbiased estimating equations. The simulation study in Online Supplementary Material, Section S4.1 shows that the SES approach outperforms the AIPW-adjusted approach when the propensity score can be close to zero or one.
2.4. From SES to robust estimation
In principle, an efficient estimator for can be obtained by solving . However, depends on the unknown distribution through , , and , and thus solving is infeasible. Nevertheless, the state-of-art causal inference literature suggests that estimators constructed based on SES are robust to approximation errors using machine learning methods, the so-called rate double robustness; see, e.g., Chernozhukov et al. (2018) and Rotnitzky et al. (2019).
In order to obtain a robust estimator with good efficiency properties, we consider approximating the unknown functions using non-parametric or machine learning methods. In summary, our algorithm for the estimation of proceeds as follows.
-
Step 1.
Obtain an estimator of using non-parametric or machine learning methods, denoted by based on .
-
Step 2.
Obtain a preliminary estimator by solving , based on .
-
Step 3.
Obtain the estimators of and using non-parametric or machine learning methods, denoted by and , based on and , respectively.
-
Step 4.Let be with the unknown quantities replaced by the estimated parametric models in Steps 1 and 3. Obtain the efficient integrative estimator by solving
(7)
The estimator depends on the approximation of nuisance functions. To establish the asymptotic properties of , we provide the regularity conditions.
Assumption 3
(i) and ; (ii) ; and (iii) additional regularity conditions in Online Supplementary Material, Assumption S1.
Assumption 3 is typical regularity conditions for Z-estimation or M-estimation (van der Vaart, 2000). Assumption 3(i) states that we require the posited models to be consistent for the two nuisance functions. Assumption 3(ii) states that the combined rate of convergence of the posited models is . Online Supplementary Material, Assumption S1 regularises the complexity of the functional space. Importantly, these conditions ensure retains the parametric-rate consistency, allowing flexible data-adaptive models and not restricting to stringent parametric models.
Theorem 1
Suppose Assumptions 1–3 hold. Then, is root-n consistent for and asymptotically normal.
Theorem 1 implies that asymptotically, can be viewed as the solution to when the nuisance functions are known. Therefore, for consistent variance estimation of , we can use the standard sandwich formula (Stefanski & Boos, 2002) or the perturbation-based resampling (Hu & Kalbfleisch, 2000), treating the nuisance functions to be known.
3. Test-based elastic integrative analysis
A major concern for integrating the RT and RW data lies in the possibly poor quality of the RW data. Then, combining the RT and RW data into an integrative analysis would lead to a biased HTE estimator. This section addresses the critical challenge of preventing any biases present in the RW data from leaking into the proposed estimator.
3.1. Detection of the RW incompatibility
We consider all assumptions in Theorem 1 hold except that Assumption 2 may be violated. We derive a test that detects the violation of this crucial assumption for using the RW data. For simplicity, we denote the SES based solely on the RT or RW data as
respectively. Moreover, let and be and with the nuisance functions replaced by their estimates, and let and be Fisher information matrices.
We now formulate the null hypothesis for the case when Assumption 2 holds and fixed and local alternatives and for the case when Assumption 2 is violated:
(Null) .
(Fixed alternative) , where is a p-vector of constants with at least one non-zero component.
(Local alternative) , where η is a p-vector of constants with at least one non-zero component.
Considering the fixed alternative is common to establish asymptotic properties of standard estimators and tests; however, the local alternative is useful to study finite-sample properties and regularity of non-standard estimators and tests. In finite samples, the violation of Assumption 2 may be weak; e.g., there exists a hidden confounder in the RW data, but the association between the hidden confounder and the outcome or the treatment is small. In such cases, the test statistic can be small or moderate. The fixed alternative formulates the bias of the RW score to be fixed, implying that the test statistic goes to infinity with the sample size. Consequently, the fixed alternative inference cannot capture the finite-sample behaviour well in the cases of weak violation and does not have uniform validity. That is, there exist scenarios where the finite-sample coverage probability from standard inference is far from the nominal level for any sample size. The local alternative asymptotics is a common approach to obtaining uniform inference validity for non-regular estimators. In the local alternative , the bias of may be small as quantified by . The values of η represent different tracks that the bias of follows to converge to zero. We will show that the test statistic is , thus better capturing the finite-sample behaviour in the weak violation cases. The local alternative encompasses the null and fixed alternative as special cases by considering different values of η. In particular, corresponds to with . Also, corresponds to with ; hence, considering alone is not informative about the finite-sample behaviours of the proposed test and estimator.
We detect biases in the RW data based on the following two key insights. First, we obtain an initial estimator by solving the estimating equation based solely on the RT data, . It is important to emphasise that the propensity score in the RT is known by design and, therefore, is always consistent. Second, if Assumption 2 holds for the RW data, is unbiased, but is no longer unbiased if it is violated. Therefore, large values of provide evidence of the violation of Assumption 2.
To detect the violation of Assumption 2 for using the RW data, we construct the test statistic
(8) |
where is the asymptotic variance of , , and is a consistent estimator for . The test statistic T measures the distance between and zero. If the idealistic assumption holds, we expect T to be small. By the standard asymptotic theory, we show in the Online supplementary material that under a Chi-square distribution with degrees of freedom p, as . This result serves to detect the violation of the assumption required for the RW data.
3.2. Elastic integration
Let be the th percentile of . For a small if , there is strong evidence to reject for the RW data; i.e., there is a detectable bias for the RW data estimator. In this case, we would only use the RT data for estimation. On the other hand, if , there is no strong evidence that the RW data estimator is biased; therefore, we would combine both the RT and RW data for optimal estimation. Our strategy leads to the elastic integrative estimator solving
(9) |
The choice of γ involves the bias-variance trade-off. On the one hand, under , the acceptance probability of integrating the RW data is . Therefore, for a relatively large sample size, we will accept good-quality RW data with probability and reject good-quality RW data with type I error γ. Hence, a small γ is desirable; similarly, for with small η. On the other hand, under with large η, the reverse is true, and hence a large γ is desirable.
To formally investigate the trade-off, we characterise the asymptotic distributions of the elastic integrative estimator under the null, fixed, and local alternatives. We do not discuss the trivial cases when and 1, corresponding to or . With mixes two distributions, namely, and . Each distribution can be non-standard because the estimators and test are constructed based on the same data and, therefore, may be asymptotically dependent.
To characterise those non-standard distributions, we decompose this task into three steps. First, by the standard asymptotic theory, it follows that , where is a standard p-variate normal random vector, and where and are some p-variate normal random vectors with variances and , respectively.
Second, we find another standard p-variate normal random vector that is independent of , and decompose the normal distributions and into two orthogonal components: i) one corresponds to and ii) the other one corresponds to . Importantly, component i) would be affected by the test constraints induced by , but component ii) would not be affected. For we show that it is fully represented by as Therefore, its distribution is not affected by ; that is,
For , we show that with . Due to the independence between and , is a mixture distribution
mixing a non-normal component, where represents the truncated normal distribution , and a normal component. For illustration, Figure 1 demonstrates the geometry of the decomposition of distributions with scalar variables.
Figure 1.
Representation of the normal distributions and based on and with .
Third, we formally characterise the distribution of , a multivariate normal distribution with ellipsoid truncation (Li et al., 2018; Tallis, 1963). This step enables us to quantify the asymptotic bias and variance of the proposed estimator; see Section 3.3.
Let be the cumulative distribution function (CDF) of a random variable, and be the CDF of a random variable, where and are the central Chi-square distribution and the non-central Chi-square distribution with the non-centrality parameter λ, respectively. Theorem 2 summarises the asymptotic distribution of .
Theorem 2
Suppose assumptions in Theorem 1 hold except that Assumption 2 may be violated. Let and be independent normal random vectors with mean and , respectively, and covariance . Let be the truncated normal distribution . Let the elastic integrative estimator be obtained by solving (9). Then, has a limiting mixture distribution
(10)
Under and .
Under and ; i.e., (10) reduces to a normal distribution with mean 0 and variance .
Under , with , and , where .
In Theorem 2, in (10) is a general characterisation of the asymptotic distribution of . It implies different asymptotic behaviours of depending on whether Assumption 2 is strongly, weakly, or not violated. First, corresponds to the situation where Assumption 2 is strongly violated. Under , T rejects the RW data (i.e., holds) with probability converging to one, becomes , and becomes , a normal distribution with mean 0 and variance . As expected, under , is asymptotically normal and regular. Second, and correspond to the situations when Assumption 2 is not and weakly violated, respectively. Under and , T has positive probabilities of accepting and rejecting the RW data, switches between and , and follows a limiting mixing distribution , indexed by η. Although the exact form of is complicated, the entire distribution and summary statistics such as mean, variance, and quantiles can be simulated by rejective sampling. Importantly, under and , is non-normal and non-regular. The non-regularity is determined by the local parameter η, which entails that the asymptotic distribution of may change abruptly when is slightly violated. It is worth emphasising that the local asymptotics provides a better approach to demonstrate the finite-sample properties of the test and estimators than the fixed asymptotics does.
3.3. Asymptotic bias and MSE
Based on Theorem 2, it is essential to understand the asymptotic behaviours of and the truncated multivariate normal distribution in general. Toward that end, we derive the moment generating functions (MGFs) of such distributions in the Online supplementary material, which shed light on the moments of .
Corollary 1 provides the analytical formula of the asymptotic bias and MSE of .
Corollary 1
Suppose assumptions in Theorem 1 hold except that Assumption 2 may be violated.
Under the bias and MSE of are and
Under the bias and MSE of are and
Under the bias and MSE of areand
(11) with .
(12)
Corollary 1 enables us to demonstrate the potential advantages and disadvantages of compared with and under different scenarios. To illustrate, we consider the case of a scalar , , and Figure 2 shows as a function of η by varying compared to . For a given , when η is small, is more efficient than ; and when η increases, the MSE of increases, exceeds, and gradually returns to the MSE of . This phenomenon reveals the super-efficiency (related to the problem of non-regularity) of at small values of η at the cost of the MSE inflation for some η values. LeCam (1953) obtained an earlier result of super-efficiency for the famous Hodges estimator. Also, with a smaller γ achieves a larger deduction of the MSE at small values of η but also more considerable inflation of the MSE at big values of η compared to , and vice versa. This observation motivates our adaptive selection of γ in Section 3.5 to produce an elastic integrative estimator with small bias and mean squared error for a possible value of η. Also, super-efficiency and non-regularity are the root causes for the standard asymptotic inference to fail, which motivates the proposed elastic confidence intervals to provide uniformly valid confidence intervals (Section 3.4); however, they can be conservative at certain parameter values when the sample size is small (Section 4).
Figure 2.
Illustration of the super-efficiency of in terms of mse as a function of η by varying compared to .
Remark 2 Sample splitting and cross fitting —
Sample splitting and cross fitting are helpful tactics to simplify asymptotic analyses by removing the dependence between nuisance parameter estimation and primary parameter estimation (Chernozhukov et al., 2018; Kennedy, 2020). To apply sample splitting to our context, one can divide the sample into two parts for testing and estimation separately. While sample splitting and cross fitting are beneficial in theoretical development, they may come with expenses of heavier computation and fewer data for estimating different components. Thus, we do not use sampling splitting or cross fitting as a device to establish the theoretical properties of the proposed pre-test estimator. Without sample splitting, the test and estimators are intimately related, requiring careful decompositions of the estimators into components that are asymptotically dependent and independent of the test statistic, as shown in our three steps toward Theorem 2. Also, sample splitting cannot resolve the non-regularity issue of the pre-test estimator (Toyoda & Wallace, 1979). This is because sample splitting cannot bypass additional randomness due to pre-testing. Thus, the impact of pre-testing and superefficiency remains an issue; see the simulation study in Online Supplementary Material, Section S4.6.
Remark 3 Soft thresholding to mitigate the non-regularity —
The proposed elastic integrative estimator involves an indicator function to make a binary decision to include or exclude the RW data from analysis. The indicator function serves as hard thresholding. To alleviate the non-regularity issue and refine the proposed estimator, one may use soft thresholding by imposing the smoothness of the indicator function. For example, similar to Yang and Ding (2018), one can use a smooth weight function to replace , where is the normal cumulative distribution with zero mean and variance . As , becomes closer to . Also, as suggested by a reviewer, one can weigh the RW data based on the p-value from the test, i.e., . A small p-value indicates a large bias in the RW data, and we should give the RW data less weight. Conversely, a large p-value suggests a small bias, and we should provide the RW data with more weight. The third idea is to create bootstrap replications of the elastic integrative estimator and obtain the average of the bootstrap replications to impose smoothness. Chakraborty et al. (2010) showed in simulation that soft-thresholding reduces the non-regularity of Q-learner in the dynamic treatment regime literature; however, they also provided a caveat that soft-thresholding cannot eliminate the non-regularity. Heuristically, the standard inference under the fixed alternative still provides poor finite sample coverage properties. Therefore, one still requires the local alternative asymptotics to derive inference procedures with uniform validity as we did for the hard thresholding estimator. We will leave this topic for future research.
3.4. Inference
The non-parametric bootstrap method provides consistent inference in many cases of regular estimators. However, this feature prevents using the non-parametric bootstrap inference for because the indicator function of the preliminary test in (9) renders a non-smooth and non-regular estimator (Shao, 1994). We formally show in the Online supplementary material the inconsistency of the nonparametric bootstrap inference for . Alternatively, Laber and Murphy (2011) proposed an adaptive confidence interval for the test error in classification, a non-regular statistics, by bootstrapping the upper and lower bounds of the test error. In this article, we propose an adaptive procedure for robust inference of accommodating the strength of violation of Assumption 2 in finite samples.
Let be a p-vector of zeros except that the kth component is one, and let be the kth component of , for . Because the asymptotic distribution of is different under the local and fixed alternatives, we propose different strategies for constructing CIs: under , the asymptotics is non-standard, we construct a least favourable CI that guarantees good coverage properties uniformly over possible values of the local parameter; under , the asymptotics is standard, we construct the usual Wald CI based on the normal limiting distribution.
First, under , we rewrite in (10) as where is the non-regular component with having mean , is the regular component, and and are independent. For a fixed let be the approximated th quantile of , which can be obtained by rejective sampling. We can construct a confidence interval of as . Different CIs are required for different values of . To accommodate different possible values of , one solution is to construct the least favourable CI by taking the infimum of the lower bound of the CI and the supremum of the upper bound of the CI over all possible values of . However, the range of can be vast, rendering the least favourable CI non-informative. We identify the plausible values of following a multivariate normal distribution with mean and variance . Let , such that and let be a bounded region of a standard p-variate normal distribution. Then,
is a bounded region of with asymptotic probability . We construct the least favourable CI for as . Here, using the wider quantile range of instead of the quantile range is necessary to guarantee the coverage of due to ignoring other possible values of outside .
Second, under Assumption 2 is strongly violated. As shown in Theorem 2, is regular and asymptotically normal, denoted by . Therefore, a confidence interval of can be constructed based on the - and th quantiles of the normal distribution , denoted by .
Finally, because the least favourable CI may be unnecessarily wide under , we require a strategy to distinguish between corresponding to finite values of and corresponding to . To do this, we use the test statistic T. Under , ; while under . Therefore, we specify a sequence of thresholds that diverges to infinity as and compare T to . Many choices of can be considered, e.g., , which is similar to the BIC criterion (Andrews & Soares, 2010; Cheng, 2008). If , we choose the local alternative strategy to construct the least favourable CI, and if , we choose the fixed alternative strategy to construct a normal CI, leading to an elastic CI
(13) |
Theorem 3
Suppose assumptions in Theorem 1 hold except that Assumption 2 may be violated. The asymptotic coverage rate of the elastic CI of in (13) satisfies
and the equality holds under .
3.5. Adaptive selection of γ
The selection of γ involves the bas-variance trade-off and therefore is important to determine the MSE of . Corollary 1 indicates that under , the MSE of in (12) involves two terms: Term 1 is , and Term 2 involves . If η is small, the MSE is dominated by Term 1, which can be made small if we select a small γ; while if η is large, the MSE is dominated by Term 2, which can be made small if we select a large
The above observation motivates an adaptive selection of We propose to estimate η by and select γ that minimises , where is given by (12) or approximated by rejective sampling. In practice, we can specify a grid of values from 0 to 1 for γ, denoted by , simulate the distribution of for all , and finally choose γ to be the one in that minimises the MSE of . As corroborated by simulation, the selection strategy is effective in the sense that when the signal of violation is weak, the selected value of γ is small and when the signal of violation is strong, the selected value of γ is large.
4. Simulation study
We evaluate the finite sample performance of the proposed elastic estimator via simulation for robustness against unmeasured confounding and adaptive inference. Specifically, we compare the RT estimator, the efficient combining estimator, and the elastic estimator under settings that vary the strength of unmeasured confounding in the RW data. We also carry out simulation under a setting when the transportability assumption is violated in the RW data; see Online Supplementary Material, Section S4.3 in the supplementary material.
We first generate populations of size . For each population, we generate the covariate , where for , and the treatment effect modifier is . We generate by
(14) |
for . Throughout the simulation, we fix to be zero and consider two cases for : a) zero effect modification and b) nonzero effect modification .
We then generate two samples from the target population. We generate the RT selection indicator by where Under this selection mechanism, the selection rate is around , which results in RT subjects. We also take a random sample of size from the population to form an RW sample. In the RT sample, the treatment assignment is where . In the RW sample, , where logit with adaptively chosen α to ensure the mean of to be around 0.5. In addition, we vary b to indicate the different strengths of unmeasured confounding in the analysis (violation of Assumption 2). The observed outcome Y in both samples is .
To assess the robustness of the elastic integrative estimator against unmeasured confounding, we consider the omission of in all estimators, resulting in unmeasured confounding in the RW data. The strength of unmeasured confounding is indexed by b in (14); high values of b indicate strong levels of unmeasured confounding and vice versa. We specify the range of b by 10 values in an irregular grid from 0 to 2 , which places more emphasis on the scenarios where Assumption 2 is weakly violated. We compare the following estimators for the HTE parameter ψ:
RT : the efficient estimator based only on the RT data solving (9) with ;
Eff : the efficient integrative estimator solving (9) with ;
Elastic : the proposed elastic integrative estimator solving (9) with adaptive selection of γ.
For all estimators, we estimate the propensity score function by a logistic sieve model with the power series X, and their two-way interactions (omitting ) and the outcome mean functions by linear sieve models with the power series X, and their two-way interactions (omitting ). If higher-order series is specified, it is necessary to select the series to balance the bias and variance in estimating the nuisance functions, such as using the penalised estimating equation approach (Lee, Yang, Dong, et al., 2022). The CIs are constructed for , and based on the perturbation-based resampling with the replication size 100 and for based on the elastic approach with . Sensitivity analysis shows that the coverage rates and widths of the CIs stay close with (Online Supplementary Material, Section S4.4).
Figure 3 presents the plots of Monte Carlo biases, variances, and MSEs of estimators based on 2000 simulated datasets with numerical results reported in Online Supplementary Material, Table S3. Table 1 reports the coverage rates and widths of CIs. The RT estimator is unbiased across different scenarios, and the coverage rates are close to the nominal level. However, has larger variances than other integrative estimators due to the small RT sample size. The efficient integrative estimator gains efficiency over by leveraging the large sample size of the RW data. However, the bias of increases as b increases. Thus, has smaller MSEs than for small values of b but larger MSEs for large values of b. The coverage rates of the CIs for deviate away from the nominal level as b increases. This can lead to an uncontrolled false discovery of important treatment effect modifiers (see the case of zero effect modification with ). The elastic integrative estimator with the adaptive selection of γ reduces ’s biases across all scenarios regardless of the strength of unmeasured confounding. The challenging scenarios are indexed by b around and , where the small biases of occur. In these scenarios, the pre-testing (built in the elastic estimator) has difficulty in detecting the RW sample’s biases. However, with an adaptive selection of γ achieves the smallest MSE among all estimators across all scenarios (Figure 3 and Online Supplementary Material, Table S3).
Figure 3.
Summary statistics plots of estimators of with respect to the strength of unmeasured confounding labelled by ‘b’. In each plot, the three estimators , , and are labelled by ‘RT’, ‘Eff’, and ‘Elastic’. Each row of the plots corresponds to a different metrics: ‘bias’ for bias, ‘var’ for variance, ‘MSE’ for mean square error; each column of the plots corresponds to one component of in the two cases: , , and with .
Table 1.
Simulation results for coverage rates and widths of 95% confidence intervals for and (labelled as ‘RT’, ‘Eff’, and ‘Elastic’) in the two cases: zero effect modification (left) and nonzero effect modification (right) with ; the slightly wider ECIs for (than CIs for ) are bolded
RT | Eff | Elastic | RT | Eff | Elastic | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Case 1: zero effect modification | Case 2: nonzero effect modification | |||||||||||
b | ||||||||||||
Coverage Rate () | ||||||||||||
0 | 94.1 | 94.1 | 93.8 | 93.7 | 92.7 | 92.5 | 94.3 | 93.8 | 95.0 | 94.2 | 92.7 | 92.5 |
0.11 | 94.1 | 94.1 | 92.2 | 92.7 | 93.2 | 92.8 | 94.3 | 93.8 | 93.3 | 92.9 | 92.9 | 92.7 |
0.23 | 94.1 | 94.0 | 88.5 | 89.8 | 92.8 | 92.8 | 94.3 | 93.8 | 89.8 | 89.0 | 93.3 | 92.7 |
0.34 | 94.1 | 94.0 | 83.2 | 84.5 | 94.0 | 93.8 | 94.3 | 93.8 | 84.9 | 83.5 | 94.4 | 93.5 |
0.46 | 94.1 | 94.0 | 74.7 | 76.3 | 94.5 | 94.5 | 94.3 | 93.8 | 76.8 | 75.8 | 94.5 | 94.4 |
0.57 | 94.1 | 94.0 | 66.4 | 66.1 | 95.5 | 95.2 | 94.3 | 93.8 | 67.2 | 66.8 | 95.5 | 94.8 |
0.69 | 94.1 | 94.1 | 56.1 | 56.3 | 95.5 | 95.8 | 94.3 | 93.8 | 56.8 | 55.9 | 95.3 | 94.6 |
0.8 | 94.1 | 94.0 | 46.3 | 46.8 | 95.5 | 95.6 | 94.3 | 93.8 | 46.5 | 45.2 | 95.3 | 95.0 |
1 | 94.1 | 94.0 | 31.5 | 31.1 | 95.5 | 95.0 | 94.3 | 93.8 | 30.9 | 29.4 | 95.5 | 94.9 |
2 | 94.1 | 94.0 | 2.9 | 3.6 | 94.3 | 94.4 | 94.3 | 93.8 | 2.6 | 3.0 | 94.7 | 94.2 |
Width () | ||||||||||||
0 | 528 | 528 | 243 | 242 | 472 | 473 | 529 | 530 | 243 | 243 | 472 | 474 |
0.11 | 527 | 528 | 242 | 242 | 488 | 487 | 529 | 530 | 242 | 243 | 479 | 480 |
0.23 | 527 | 528 | 241 | 242 | 496 | 497 | 529 | 530 | 241 | 242 | 498 | 500 |
0.34 | 528 | 528 | 241 | 241 | 516 | 516 | 529 | 530 | 241 | 242 | 511 | 514 |
0.46 | 528 | 528 | 239 | 240 | 530 | 530 | 529 | 530 | 240 | 240 | 524 | 526 |
0.57 | 528 | 528 | 238 | 238 | 535 | 535 | 529 | 530 | 238 | 239 | 530 | 532 |
0.69 | 528 | 528 | 235 | 236 | 534 | 534 | 529 | 530 | 236 | 236 | 529 | 531 |
0.8 | 528 | 528 | 233 | 234 | 532 | 532 | 529 | 530 | 233 | 234 | 530 | 532 |
1 | 528 | 528 | 229 | 230 | 529 | 529 | 529 | 530 | 229 | 230 | 530 | 532 |
2 | 528 | 528 | 207 | 208 | 527 | 527 | 529 | 530 | 208 | 209 | 528 | 530 |
To inspect the performance of the proposed data-adaptive selection strategy, Online Supplementary Material, Table S8 reports Monte Carlo averages and standard deviations of the selected values for the local parameter η, the threshold , and the proportion of combining the RT and RW samples. As expected, increases as b increases, indicating increased biases in the RW sample. The selected γ increases (as a result, the proportion of combining the RT and RW samples decreases) as b increases, which shows the proposed adaptive selection strategy is effective. To compare the adaptive selection strategy with the fixed threshold strategy, a simulation study in Online Supplementary Material, Section S4.5 shows that the elastic integrative estimator with a fixed threshold can have increased biases compared to a data-adaptive selected threshold.
The coverage rates of the ECIs for are close to the nominal level for all settings with different values of b. The ECIs are narrower than the CIs for when b is small ( for and for ), are wider than the CIs for when b increases, and become close to the CIs for when b reaches 1 or larger. However, the conservativity of the ECIs reduces as n increases, and the ECIs can perform at least as well as the CIs for for any b (see Online Supplementary Material, Table S6 for ).
5. An application
We illustrate the potential benefit of the proposed elastic estimator to evaluate the effect of adjuvant chemotherapy for early-stage resected non-small cell lung cancer (NSCLC) using the CALGB 9633 data and a large clinical oncology database, the NCDB. In CALGB 9633, we include 319 patients, with 163 randomly assigned to observation () and 156 randomly assigned to chemotherapy (). The NCDB cohort is selected based on the same patient eligibility criteria as the CALGB 9633 trial; see Online Supplementary Material, Section S5. The comparable NCDB sample includes patients diagnosed with NSCLC between 2004 and 2016 in stage IB disease, with on observation and receiving chemotherapy after surgery. The numbers of treated and controls are relatively balanced in the CALGB 9633 trial, while they are unbalanced in the NCDB sample. We include five covariates in the analysis: gender (, ), age, the indicator for histology (, ), race (), and tumour size in centimetre. The outcome is the overall survival within three years after the surgery, i.e., if died due to all causes and otherwise. We are interested in estimating the HTE of adjuvant chemotherapy over observation after resection for the patient population with the same set of eligibility criteria as that of CALGB 9633.
Table 2 reports the covariate means by sample and treatment group. Due to treatment randomisation, covariates are balanced between the treated and the control in the CALGB 9633 trial sample. While due to a lack of treatment randomisation, covariates are relatively unbalanced in the NCDB sample. Older patients with histology and smaller tumours are likely to choose a conservative treatment on observation. Moreover, we cannot rule out the possibility of unmeasured confounders in the NCDB sample.
Table 2.
Covariate means with standard errors in parentheses by sample and treatment group in the CALGB 9633 trial and NCDB samples
A | N | Age | tumour size | Male | Squamous | White | |
---|---|---|---|---|---|---|---|
(years) | (cm) | () | () | () | |||
RT: | 319 | 60.8 (9.62) | 4.60 (2.08) | 63.9 | 39.8 | 89.3 | |
CALGB 9633 | 1 | 156 | 60.6 (10) | 4.62 (2.09) | 64.1 | 40.4 | 90.4 |
0 | 163 | 61.1 (9.25) | 4.57 (2.07) | 63.8 | 39.3 | 88.3 | |
RW: | 15,166 | 67.9 (10.2) | 4.82 (1.71) | 54.6 | 39.1 | 89.6 | |
NCDB | 1 | 4,263 | 63.9 (9.23) | 5.19 (1.79) | 54.3 | 35.6 | 88.6 |
0 | 10,903 | 69.4 (10.1) | 4.67 (1.65) | 54.8 | 40.5 | 90.0 |
We assume a linear HTE function with tumour size as the treatment effect modifier. We compare the same set of estimators and variance estimators considered in the simulation study and the efficient estimator applied to the real-world NCDB cohort, denoted by . Table 3 reports the results. Figure 4 shows the estimated treatment effect as a function of the standardised tumour size. Due to the limited sample size of the trial sample, all components in are not significant. Due to the large sample size of the NCDB sample, and are close and reveal that adjuvant chemotherapy significantly reduced cancer recurrence within three years after the surgery. Patients with larger tumour sizes benefit more from adjuvant chemotherapy. However, this finding may be subject to possible biases of the NCDB sample. In the proposed elastic integrative analysis, the test statistic is ; there is no strong evidence that the NCDB presents hidden confounding in our analysis. As a result, the elastic integrative estimator remains the same as . In reflection of the pre-testing procedure, the estimated standard error of is larger than ’s. From Figure 4, patients with tumour sizes in significantly benefit from adjuvant chemotherapy in improving overall survival within three years after the surgery.
Table 3.
Point estimate, standard error, and 95% Wald confidence interval of the causal risk difference between adjuvant chemotherapy and observation based on the CALGB 9633 trial sample and the NCDB sample:
Intercept () | tumour size* () | |||||
---|---|---|---|---|---|---|
Est. | S.E. | C.I. | Est. | S.E. | C.I. | |
RT | 0.094 | 0.054 | (0.202, 0.015) | 0.002 | 0.055 | (0.107, 0.111) |
RW | 0.076 | 0.0085 | (0.093, 0.059) | 0.029 | 0.009 | (0.046, 0.011) |
Eff | 0.076 | 0.0083 | (0.093, 0.059) | 0.026 | 0.009 | (0.043, 0.009) |
Elastic | 0.076 | 0.0196 | (0.115, 0.037) | 0.026 | 0.029 | (0.084, 0.032) |
Figure 4.
Estimated treatment effect as a function of the (standardised) tumour size along with the Wald confidence intervals: tumour size, RT, RW, and Eff are the efficient estimator applied to the RT, RW, and combined sample, respectively, and Elastic is the proposed elastic combining estimator.
6. Concluding remarks
The proposed elastic estimator integrates ‘high-quality small data’ with ‘big data’ to simultaneously leverage small but carefully controlled unbiased experiments and massive but possibly biased RW datasets for HTEs. Most causal inference methods require the no unmeasured confounding assumption. However, this assumption may not hold for the RW data due to the uncontrolled, real-world data collection mechanism and is unverifiable based only on the RW data. Utilising the design advantage of RTs, we can gauge the reliability of the RW data and decide whether or not to use RW data in an integrative analysis.
The key assumptions underpinning our framework are the structural HTE model, i.e., Model (1), HTE transportability, and no unmeasured confounding. In practice, RTs usually consider much narrower populations than seen in the real world. Improving the generalisability or external validity of RT findings has been an important research topic in the data integration literature (e.g., Cole & Stuart, 2010; Lee, Yang, Dong, et al., 2022; Rudolph & van der Laan, 2017). Besides Assumption 1(i), the positivity of trial participation or the overlap of the covariate distribution between the RT and RW samples is required in the problem of generalisability. We emphasise that although, formally, we do not require the overlap assumption between the RT and RW samples, its violation renders Model (1) and transportability vulnerable. When transporting from the narrow RT sample to the broader RW sample, the reliable information of treatment effects for the non-overlapping region essentially hinges on the extrapolation from the RT sample. If there is no strong prior knowledge, Model (1) and transportability may not hold. In this case, the RT estimate and the RW estimate of the HTE can be inconsistent due to model misspecification even when there are no unmeasured confounders. See a simulation study in Online Supplementary Material, Section S4.3. The inconsistency of the RW estimator with the RT estimator may reflect violation of either transportability (e.g., due to model misspecification) or unmeasured confounding. Some practical strategies (e.g., matching) can be implemented to select an RW sample with sufficient overlap with the RT sample to improve their comparability and the chance of successfully integrating the information from two separate sources; see Online Supplementary Material, Section S5.2.
The elastic integrative estimator gains efficiency over the RT-only estimator by integrating the reliable RW data and also automatically detecting bias in the RW data and gears to the RT data. However, the proposed estimator is non-regular and belongs to pre-test estimation by construction (Giles & Giles, 1993). To demonstrate the non-regularity issue, we characterise the distribution of the elastic integrative estimator under local alternatives, which better approximates the finite-sample behaviours. Moreover, we provide a data-adaptive selection of the threshold in the testing procedure, which guarantees small MSEs of the estimator. Nonetheless, fixing the threshold may not control bias well under ; see a simulation study in Online Supplementary Material, Section S4.5. If the investigator prefers small biases in the elastic combining estimator, we recommend setting the lower bounds of a grid for selecting γ. Although the elastic confidence intervals demonstrate good coverage properties in our simulation under all hypotheses , , and , an open problem remains for the post-selection inference after a data-adaptive selection of the threshold in the testing procedure, which will be rigorously analysed theoretically and empirically in the future study.
The proposed framework can also be extended to individualised treatment regime learning (Chu et al., 2022; Wu & Yang, 2021, 2022) and the data integration problem of combining probability and non-probability samples (Yang & Kim, 2020; Yang et al., 2019, 2021). However, an additional complication arises due to the mixed design-based and super-population inference framework, which will be overcome in future research.
Supplementary Material
Contributor Information
Shu Yang, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
Chenyin Gao, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
Donglin Zeng, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
Xiaofei Wang, Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA.
Funding
This research is supported by the NSF grant DMS 1811245, NIH grant 1R01AG066883, 1R01ES031651, R01GM124104, MH123487, and NS073671. The authors would also like to thank the Associate editor and anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper.
Data availability
The authors have access to the data from CALGB 9633 through Alliance Statistics and Data Center, and the NCDB database through the approved license from American College of Surgeons to Duke University School of Medicine. De-identified individual patient data were used for the application, and the results of the analysis were summarized in the manuscript. The readers may contact Alliance data sharing working group to request an access to CALGB 9633 data and initiate an application process with American College of Surgeons to gain the access to the NCDB database.
Supplementary material
Supplementary material are available at Journal of the Royal Statistical Society: Series B online.
References
- Andrews D. W., & Soares G. (2010). Inference for parameters defined by moment inequalities using generalized moment selection. Econometrica, 78(1), 119–157. 10.3982/ECTA7502 [DOI] [Google Scholar]
- Bickel P. J., Klaassen C., Ritov Y., & Wellner J. (1993). Efficient and adaptive inference in semiparametric models. Johns Hopkins University Press. [Google Scholar]
- Buchanan A. L., Hudgens M. G., Cole S. R., Mollan K. R., Sax P. E., Daar E. S., Adimora A. A., Eron J. J., & Mugavero M. J. (2018). Generalizing evidence from randomized trials using inverse probability of sampling weights. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181(4), 1193– 1209. 10.1111/rssa.12357 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chakraborty B., & Moodie E. E. (2013). Statistical methods for dynamic treatment regimes. Springer. [Google Scholar]
- Chakraborty B., Murphy S., & Strecher V. (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical Methods in Medical Research, 19(3), 317–343. 10.1177/0962280209105013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng X. (2008). Robust confidence intervals in nonlinear regression under weak identification, Manuscript, Department of Economics, Yale University.
- Chernozhukov V., Chetverikov D., Demirer M., Duflo E., Hansen C., Newey W., & Robins J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1),C1–C68. 10.1111/ectj.12097 [DOI] [Google Scholar]
- Chu J., Lu W., & Yang S. (2022). ‘Targeted optimal treatment regime learning using summary statistics’, arXiv, arXiv:2201.06229, preprint: not peer reviewed.
- Cole S. R., & Stuart E. A. (2010). Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. American Journal of Epidemiology, 172(1), 107–115. 10.1093/aje/kwq084 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Colnet B., Mayer I., Chen G., Dieng A., Li R., Varoquaux G., Vert J.-P., Josse J., & Yang S. (2020). ‘Causal inference methods for combining randomized trials and observational studies: a review’, arXiv, arXiv:2011.08047, preprint: not peer reviewed.
- Dahabreh I. J., Robertson S. E., Tchetgen E. J., Stuart E. A., & Hernán M. A. (2019). Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics, 75(2), 685–694. 10.1111/biom.13009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Giles J. A., & Giles D. E. A. (1993). Pre-test estimation and testing in econometrics: Recent developments. Journal of Economic Surveys, 7(2), 145–197. 10.1111/j.1467-6419.1993.tb00163.x [DOI] [Google Scholar]
- Hamburg M. A., & Collins F. S. (2010). The path to personalized medicine. New England Journal of Medicine, 363(4), 301–304. 10.1056/NEJMp1006304 [DOI] [PubMed] [Google Scholar]
- Hernán M. A., & Robins J. M. (2016). Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology, 183(8), 758–764. 10.1093/aje/kwv254 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu F., & Kalbfleisch J. D. (2000). The estimating function bootstrap. Canadian Journal of Statistics, 28(3), 449–481. 10.2307/3315958 [DOI] [Google Scholar]
- Katz A., & Saad E. D. (2009). CALGB 9633: An underpowered trial with a methodologically questionable conclusion. Journal of Clinical Oncology, 27(13), 2300–2301. 10.1200/JCO.2008.21.1565 [DOI] [PubMed] [Google Scholar]
- Kennedy E. H. (2020). ‘Optimal doubly robust estimation of heterogeneous causal effects’, arXiv, arXiv:2004.14497, preprint: not peer reviewed.
- Laber E. B., & Murphy S. A. (2011). Adaptive confidence intervals for the test error in classification. Journal of the American Statistical Association, 106(495), 904–913. 10.1198/jasa.2010.tm10053 [DOI] [PMC free article] [PubMed] [Google Scholar]
- LeCam L. (1953). On some asymptotic properties of maximum likelihood estimates and related Bayes estimates. University of California Publications in Statistics, 1, 277–330. [Google Scholar]
- Le Chevalier T. (2003). Results of the Randomized International Adjuvant Lung Cancer Trial (IALT): Cisplatin-based chemotherapy (CT) vs no CT in 1867 patients with resected non-small cell lung cancer (NSCLC). Lung Cancer, 21, 238–238. 10.1016/S0169-5002(03)91656-4 [DOI] [Google Scholar]
- Lee D., Yang S., Dong L., Wang X., Zeng D., & Cai J. (2022). Improving trial generalizability using observational studies. Biometrics. 10.1111/biom.13609 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee D., Yang S., & Wang X. (2022). ‘Generalizable survival analysis of randomized controlled trials with observational studies’, arXiv, arXiv:2201.06595, preprint: not peer reviewed.
- Li X., Ding P., & Rubin D. B. (2018). Asymptotic theory of rerandomization in treatment–control experiments. Proceedings of the National Academy of Sciences, 115(37),9157–9162. 10.1073/pnas.1808191115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luedtke A., Carone M., & van der Laan M. J. (2019). An omnibus non-parametric test of equality in distribution for unknown functions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1), 75–99. 10.1111/rssb.12299 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luedtke A. R., & van der Laan M. J. (2016). Super-learning of an optimal dynamic treatment rule. The International Journal of Biostatistics, 12(1), 305–332. 10.1515/ijb-2015-0052 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neyman J. (1923). Sur les applications de la thar des probabilities aux experiences Agaricales: Essay de principle. English translation of excerpts by Dabrowska, D. and Speed, T.. Statistical Science, 5, 465–472. [Google Scholar]
- Norris S., Atkins D., Bruening W., Fox S., Johnson E., Kane R., Morton S. C., Oremus M., Ospina M., Randhawa G., Schoelles K., Shekelle P., & Viswanathan M. (2010). Selecting observational studies for comparing medical interventions. In Methods guide for effectiveness and comparative effectiveness reviews [Internet]. Agency for Healthcare Research and Quality.
- Pearl J., & Bareinboim E. (2011). Transportability of causal and statistical relations: A formal approach. In Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on (pp. 540–547). IEEE.
- Prentice R. L., Langer R., Stefanick M. L., Howard B. V., Pettinger M., Anderson G., Barad D., Curb J. D., Kotchen J., Kuller L., Limacher M., & Wactawski-Wende J. (2005). Combined postmenopausal hormone therapy and cardiovascular disease: Toward resolving the discrepancy between observational studies and the women’s health initiative clinical trial. American Journal of Epidemiology, 162(5), 404–414. 10.1093/aje/kwi223 [DOI] [PubMed] [Google Scholar]
- Richardson T. S., Robins J. M., & Wang L. (2017). On modeling and estimation for the relative risk and risk difference. Journal of the American Statistical Association, 112(519), 1121–1130. 10.1080/01621459.2016.1192546 [DOI] [Google Scholar]
- Robins J. M. (1994). Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics-Theory and Methods, 23(8), 2379–2412. 10.1080/03610929408831393 [DOI] [Google Scholar]
- Rothwell P. M. (2005). Subgroup analysis in randomised controlled trials: Importance, indications, and interpretation. The Lancet, 365(9454), 176–186. 10.1016/S0140-6736(05)17709-5 [DOI] [PubMed] [Google Scholar]
- Rothwell P. M., Mehta Z., Howard S. C., Gutnikov S. A., & Warlow C. P. (2005). From subgroups to individuals: General principles and the example of carotid endarterectomy. The Lancet, 365(9455), 256–265. 10.1016/S0140-6736(05)70156-2 [DOI] [PubMed] [Google Scholar]
- Rotnitzky A., Smucler E., & Robins J. M. (2019). ‘Characterization of parameters with a mixed bias property’, arXiv, arXiv:1904.03725, preprint: not peer reviewed.
- Rubin D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5), 688–701. 10.1037/h0037350 [DOI] [Google Scholar]
- Rudolph K. E., & van der Laan M. J. (2017). Robust estimation of encouragement design intervention effects transported across sites. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5), 1509–1525. 10.1111/rssb.12213 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shao J. (1994). Bootstrap sample size in nonregular cases. Proceedings of the American Mathematical Society, 122(4), 1251–1262. 10.1090/S0002-9939-1994-1227529-8 [DOI] [Google Scholar]
- Sherman R. E., Anderson S. A., Dal Pan G. J., Gray G. W., Gross T., Hunter N. L., LaVange L., Marinac-Dabic D., Marks P. W., Robb M. A., Shuren J., Temple R., Woodcock J., Yue L. Q., & Califf R. M. (2016). Real-world evidence—what is it and what can it tell us. New England Journal of Medicine, 375(23), 2293–2297. 10.1056/NEJMsb1609216 [DOI] [PubMed] [Google Scholar]
- Shi C., Song R., & Lu W. (2016). Robust learning for optimal treatment decision with np-dimensionality. Electronic Journal of Statistics, 10(2), 2894–2921. 10.1214/16-EJS1178 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sobel M., Madigan D., & Wang W. (2017). Causal inference for meta-analysis and multi-level data structures, with application to randomized studies of Vioxx. Psychometrika, 82(2), 459–474. 10.1007/s11336-016-9507-z [DOI] [PubMed] [Google Scholar]
- Staiger D., & Stock J. H. (1997). Instrumental variables regression with weak instruments. Econometrica, 65(3), 557–586. 10.2307/2171753 [DOI] [Google Scholar]
- Stefanski L. A., & Boos D. D. (2002). The calculus of M-estimation. The American Statistician, 56(1), 29–38. 10.1198/000313002753631330 [DOI] [Google Scholar]
- Strauss G. M., Herndon J. E., Maddaus M. A., Johnstone D. W., Johnson E. A., Harpole D. H., Gillenwater H. H., Watson D. M., Sugarbaker D. J., Schilsky R. L., Vokes E. E., & Green M. R. (2008). Adjuvant paclitaxel plus carboplatin compared with observation in stage IB non–small-cell lung cancer: CALGB 9633 with the Cancer and Leukemia Group B, Radiation Therapy Oncology Group, and North Central Cancer Treatment Group Study Groups. Journal of Clinical Oncology, 26(31), 5043–5051. 10.1200/JCO.2008.16.4855 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stuart E. A., Cole S. R., Bradshaw C. P., & Leaf P. J. (2011). The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 174(2), 369–386. 10.1111/j.1467-985X.2010.00673.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tallis G. M. (1963). Elliptical and radial truncation in normal populations. The Annals of Mathematical Statistics, 34(3), 940–944. 10.1214/aoms/1177704016 [DOI] [Google Scholar]
- Tian L., Alizadeh A. A., Gentles A. J., & Tibshirani R. (2014). A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association, 109(508), 1517–1532. 10.1080/01621459.2014.951443 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Toyoda T., & Wallace T. D. (1979). Pre-testing on part of the data. Journal of Econometrics, 10(1), 119–123. 10.1016/0304-4076(79)90071-X [DOI] [Google Scholar]
- US Food and Drug Administration (2019). Rare diseases: Natural history studies for drug development, https://www.fda.gov/media/122425/ (accessed 1 May 2022).
- van der Vaart A. W. (2000). Asymptotic statistics. Cambridge University Press. [Google Scholar]
- Verde P. E., & Ohmann C. (2015). Combining randomized and non-randomized evidence in clinical research: A review of methods and applications. Research Synthesis Methods, 6(1), 45–62. 10.1002/jrsm.1122 [DOI] [PubMed] [Google Scholar]
- Wu L., & Yang S. (2021). ‘Transfer learning of individualized treatment rules from experimental to real-world data’, arXiv, arXiv:2108.08415, preprint: not peer reviewed.
- Wu L., & Yang S. (2022). Integrative r-learner of heterogeneous treatment effects combining experimental and observational studies. In Proceedings of Machine Learning Research (Vol. 140, pp. 1–S5).
- Yang S., & Ding P. (2018). Asymptotic inference of causal effects with observational studies trimmed by the estimated propensity scores. Biometrika, 105(2), 487–493. 10.1093/biomet/asy008 [DOI] [Google Scholar]
- Yang S., & Kim J. K. (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3(2), 625–650. 10.1007/s42081-020-00093-w [DOI] [Google Scholar]
- Yang S., Kim J. K., & Hwang, Y. (2021). Integration of data from probability surveys and big found data for finite population inference using mass imputation. Survey Methodology, 47(1), 29–58. [Google Scholar]
- Yang S., Kim J. K., & Song R. (2019). Doubly robust inference when combining probability and non-probability samples with high-dimensional data. Journal of the Royal Statistical Society, Series B, 82(2), 445–465. 10.1111/rssb.12354 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Y.-Q., Zeng D., Tangen C. M., & Leblanc M. L. (2019). Robustifying trial-derived optimal treatment rules for a target population. Electronic Journal of Statistics, 13, 1717–1743. 10.1214/19-EJS1540 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The authors have access to the data from CALGB 9633 through Alliance Statistics and Data Center, and the NCDB database through the approved license from American College of Surgeons to Duke University School of Medicine. De-identified individual patient data were used for the application, and the results of the analysis were summarized in the manuscript. The readers may contact Alliance data sharing working group to request an access to CALGB 9633 data and initiate an application process with American College of Surgeons to gain the access to the NCDB database.