Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jul 17.
Published in final edited form as: Stat Biosci. 2016 May 26;9(2):320–338. doi: 10.1007/s12561-016-9149-9

Strengthening Instrumental Variables Through Weighting

Douglas Lehmann 1, Yun Li 1, Rajiv Saran 2, Yi Li 1
PMCID: PMC6636680  NIHMSID: NIHMS1030209  PMID: 31316679

Abstract

Instrumental variable (IV) methods are widely used to deal with the issue of unmeasured confounding and are becoming popular in health and medical research. IV models are able to obtain consistent estimates in the presence of unmeasured confounding, but rely on assumptions that are hard to verify and often criticized. An instrument is a variable that influences or encourages individuals toward a particular treatment without directly affecting the outcome. Estimates obtained using instruments with a weak influence over the treatment are known to have larger small-sample bias and to be less robust to the critical IV assumption that the instrument is randomly assigned. In this work, we propose a weighting procedure for strengthening the instrument while matching. Through simulations, weighting is shown to strengthen the instrument and improve robustness of resulting estimates. Unlike existing methods, weighting is shown to increase instrument strength without compromising match quality. We illustrate the method in a study comparing mortality between kidney dialysis patients receiving hemodialysis or peritoneal dialysis as treatment for end-stage renal disease.

Keywords: Causal inference, End-stage renal disease, Instrumental variables, Unmeasured confounding, Weak instruments

1. Introduction

The randomized controlled trial (RCT) has long been considered the gold standard for obtaining treatment effects. When the treatment has been randomized to subjects, it is reasonable to assume that measured and unmeasured risk factors will balance between groups, and treatment effects can be obtained through direct comparisons. While this is a major benefit of RCTs, they can be costly, and in some cases it is impossible or unethical to randomize the treatment. Observational data are a popular alternative to RCTs but come at the cost of removing control over treatment assignment from the hands of the researcher, giving rise to the possibility that treatment groups will differ in unmeasured ways that confound the relationship between treatment and outcome. Statistical methods that ignore this unmeasured confounding may give biased and misleading results [2,35]. This is a primary concern in any observational study, and much research has gone into this problem.

Instrumental variable (IV) methods are widely used to deal with this issue of unmeasured confounding. These methods rely on an additional variable, termed the instrument, that influences or encourages individuals toward the treatment and only affects the outcome indirectly through its effect on treatment. In this sense, the instrument mimics randomization by randomly “assigning” individuals to different likelihoods of receiving treatment. Instruments with little influence over treatment assignment are termed weak instruments, and there are a number of problems associated with using them. Results obtained when using weak instruments suffer from greater small sample bias, and are less robust to violations of the key assumption that the instrument is randomly assigned or independent of unmeasured confounders [7,32]. This assumption cannot be verified and is often criticized, and thus, using a strong instrument is important for obtaining credible results.

The literature relating to weak instrumental variables has primarily focused on detailing the problems and limitations associated with using them. See, for example, Bound et al. [7], Staiger and Stock [33], Angrist et al. [1], Small and Rosenbaum [32] or Baiocchi et al. [2]. Variable selection methods to select a strong subset among a pool of weak instruments have been proposed [5,6,9]. In a landmark paper related to working with a single weak instrument, Baiocchi et al. [3] proposed near-far matching, a novel method to extract a smaller study with a stronger instrument from a larger study (see also [4,39]). This matching-based IV methodology aims to construct pairs that are “near” on covariates but “far” in the instrument. In other words, pairs consist of subjects with similar characteristics who have received substantially different amounts of encouragement toward the treatment, with a greater difference indicating a stronger instrument. This difference is increased in near-far matching using penalties to discourage pairs with similar instrument values, while allowing a certain number of individuals to be removed from the analysis entirely. This results in a stronger instrument across a smaller number of pairs. One limitation of near-far matching is that it may strengthen the instrument at the cost of match quality.

We propose weighted IV-matching, an alternative for strengthening the instrument within this IV-matching framework. Rather than using penalties to discourage pairs who received similar encouragement, we suggest strengthening the instrument after matches have been formed through weighting, with a pair’s weight being a function of the instrument within that pair. A fundamental difference between these two techniques is the at which the instrument is strengthened. Weighted IV-matching strengthens the instrument after matches have been formed, allowing the matching algorithm to focus only on creating good matches with similar covariate values. Near-far matching, on the other hand, strengthens the instrument and matches on covariates simultaneously, requiring the algorithm to share priority between the two goals. This generally leads to better quality matches for weighted IV-matching, a major benefit since failing to properly match on observed confounders may lead to bias in estimation.

We illustrate these methods with a comparison of hemodialysis (HD) and peritoneal dialysis (PD) on 6-months mortality among patients with end-stage renal disease (ESRD) using data from the United States Renal Data System (USRDS). PD has several benefits over HD, including cost benefits, an improved quality of life, and the preservation of residual renal function [12,23,34]. Despite this, PD remains underutilized in the United States [15]. One explanation for this may be a lack of consensus regarding the effect of PD on patient survival. An RCT to investigate this question was stopped early due to insufficient enrollment [17]. Many observational studies suggest that PD is associated with decreased mortality although results are often conflicting [13,16,18,24,36,37]. Complicating the issue is a strong selection bias, with PD patients tending to be younger and healthier than HD patients. Studies have dealt with this issue by measuring and controlling for important confounders, but to our knowledge none have addressed the possibility of unmeasured confounding that likely remains. We define PD as the treatment and consider a binary outcome for 6-months survival. The focus on 6-months survival is to study the influence of initial dialysis modality on early mortality, which tends to be high for dialysis patients. Studying early mortality can provide guidance for selecting the initial dialysis modality in order to reduce this early mortality. See, for example, Noordzij and Jager [26], Sinnakirouchenan and Holley [31], or Heaf et al. [13].

A possible instrument in the data is the mean PD usage at the facility level. Instruments based on mean treatment usage in a geographic region, facility, or other group are often called preference-based instruments [8,20], because it is believed that these groups may have preferences that at least partially override both measured and unmeasured patient characteristics when making treatment decisions. In other words, facilities with high PD usage are more likely to “encourage” their patients toward PD than those with low usage. Preference-based instruments are among the most commonly used instruments in practice [11], and methods to improve upon them may have broad applications.

The remainder of this article is organized as follows. In Sect. 2, we outline the proposed weighted IV-matching procedure and briefly compare it to near-far matching. Inference and sensitivity are discussed in Sect. 3. The finite sample performances of these methods are compared in Sect. 4 through simulation, and they are illustrated with a data analysis in Sect. 5. We conclude with a discussion in Sect. 6.

2. Weighted IV-Matching

We begin with an outline of the IV-matching framework [3,4] and then propose weighted IV-matching for strengthening the instrument within this framework. We briefly compare weighted IV-matching with near-far matching and highlight key differences.

With a preference-based instrument, two rounds of matching are implemented [4]. In the context of our motivating data example, an optimal non-bipartite matching algorithm first pairs facilities [10,21]. After facilities have been paired, the instrument is dichotomized into encouraging and unencouraging. This is done by comparing instrument values within each facility pair and considering the facility with the higher value to be an encouraging facility and the other to be an unencouraging facility. An optimal bipartite matching algorithm then pairs patients at the PD encouraging facility with patients in the other. This results in I pairs of two subjects with similar patient and facility characteristics that received different levels of encouragement toward PD. Instrument strength can be assessed by the average difference, or separation, of this encouragement across pairs. For example, the instrument is considered stronger in a study in which the average encouraged and unencouraged subjects were treated at facilities with 85 and 30 % treatment usage compared to one with average treatment usage of 60 and 45 %.

Creating a stronger instrument in this framework is thus equivalent to increasing this separation. We propose increasing this separation by assigning more weight to pairs more likely to be influenced by the instrument. Specifically, we propose weighting by the probability that the encouraged subject receives the treatment while the unencouraged subject receives the control. This can be thought of as the probability that a pair “complies” with encouragement, and giving more weight to pairs more likely to comply creates a stronger instrument across all pairs. Without loss of generality, assume subject j in pair i was treated at the encouraging facility and subject j′ at the unencouraging facility, with Zij = 1 indicating encouragement and Zij′ = 0 indicating unencouragement. Let Dij indicate treatment received. The weight for pair i is then calculated as

wi=P(Dij=1|Zij=1)P(Dij=0|Zij=0). (1)

Similar to separation of the instrument, this probability is a measure of instrument strength. However, rather than an average across all pairs, it is a measure of the influence the instrument within the pair i. A stronger instrument is created when more weight is given to pairs in which the instrument has more influence over treatment. This has the effect of redistributing the data in a way to highlight “good” pairs that are more influenced by the instrument and increases separation of the instrument in the process.

In practice, the probabilities in Eq. (1) are unlikely to be known but will need to be estimated. Using facility-level mean PD usage as the instrument, P(Dij=1|Zij=1) is estimated bythe mean PD usage at the encouraging facility, while P(Dij=0|Zij=0) is estimated with one minus the mean PD usage at the unencouraging facility. Weights can be standardized to maintain the effective sample size and statistical power.

The near-far matching procedure of Baiocchi et al. [3,4] forces separation of the instrument in the matching process. This is done in the first round by adding a penalty to the distance measure between facilities whose instrument values are within a certain threshold, and allowing a certain number to be removed. This requires the matching algorithm to pair facilities with similar covariates and enforce separation of encouragement simultaneously, and generates an implicit tradeoff. A large penalty will dominate the distance used to reflect similarity on covariates, thereby increasing instrument separation but at the expense of match quality, whereas a small penalty may get overshadowed by the covariate distance, leading to better matches, but with less separation. Removing a number of facilities serves to alleviate some of the damage to match quality, although it may not be entirely preserved, since the algorithm is still sharing priority between creating good matches and enforcing instrument separation.

A fundamental difference between weighted IV-matching and near-far matching is the stage in which the instrument is strengthened. Weighted IV-matching strengthens the instrument after matches have been formed, allowing the matching algorithm to focus only on creating good matches with similar covariate values. Near-far matching, on the other hand, strengthens the instrument in the matching process, forcing the algorithm to balance creating good matches and enforcing separation of the instrument. This difference highlights a theme that we will see when comparing the performances of these two techniques; in a tradeoff between match quality and instrument strength, weighted IV-matching tends to favor match quality, while near-far matching tends to favor instrument strength. Strength in either of these areas has implications on the resulting analysis.

3. Inference and Sensitivity

3.1. Notation

We define causal effects of interest using the potential outcomes framework [1,25,30]. Let Zij = 1 if subject j in pair i is encouraged toward treatment, Zij = 0 otherwise. Let Dij (Zij) indicate treatment received for subject j in pair i given their encouragement, and let Yij(Zij, Dij(Zij)) indicate mortality. Dj(Zij) and Yij(Zij, Dij(Zij)) are referred to as “potential outcomes.” For encouraged subjects, with Zij = 1, we observe treatment Dij(1) and response Yij(1, Dij(1)). Similarly for unencouraged subjects, we observe Dij (0) and response Yij (0, Dij (0)). Our interest lies in estimating the parameter

λ=ijYij((1,Dij(1))Yij(0,Dij(0)))ij(Dij(1)Dij(0)). (2)

This parameter is often referred to as the local average treatment effect [1,14]. In contrast to an average treatment effect, which is applicable to the entire population, the local effect is interpreted as an average treatment effect among a subgroup of the population known as “compliers.” Depicted in Table 1, compliers are individuals that will take the treatment that they are encouraged to take. Unfortunately, since we never observe subjects under both states of encouragement, we never observe both Yij(1, Dij(1)) and Yij(0, Dij(0)) or both Dij(1) and Dij(0), and we must estimate λ from the data. We impose the following five assumptions to aid us in estimation [1,2].

Table 1.

Population subgroups defined by the effect of encouragement on treatment

D(1)
0 1

D(0)
 0 Never-takers Compliers
 1 Defiers Always-takers

D(1) denotes the treatment a subject will receive if they are encouraged toward treatment, while D(0) denotes the treatment they will receive if they are not

A1. Stable unit treatment value assumption (SUTVA)

Often known as no interference, SUTVA requires that individuals’ outcomes be unaffected by the treatment assignment of others, and will be violated if spillover effects exist between treatment and control groups. SUTVA allows us to consider a subjects potential outcomes as a function of only their treatment and encouragement, rather than treatment and encouragement assignments across the entire population.

A2. Random assignment of the instrument

The instrument is assumed to be randomly assigned, and this implies that it is independent of any unobserved con-founders. It is often stated conditional on measured confounders. This assumption cannot be verified to hold, and weak instruments are especially sensitive to violations [2,7,32,33].

A3. Exclusion restriction

The instrument can only affect the outcome through its effect on treatment. This requires that Yij(1, Dij(1) = d) = Yij(0, Dij(0) = d) for all i, j and for d = 0, 1, which cannot be verified since both potential outcomes are never observed for any individual.

A4. Nonzero association between instrument and treatment

A nonzero association between the instrument and outcome implies that E[Dij(1) - Dij (0)] ≠ 0.

A5. Monotonicity

The monotonicity assumption states that there are no defiers, or subjects that always do the opposite of what they are encouraged to do, and implies that Dij (1) ≥ Dij (0) for all i, j.

Assumptions A1 and A2 allow for unbiased estimation of the instruments effect on the outcome and treatment, or the numerator and denominator in (2). By exclusion restriction, never-takers and always-takers do not contribute to estimation since their treatment and response values do not vary with encouragement. Monotonicity ensures that the group of defiers is empty, while a nonzero association between the instrument and treatment ensures that the group of compliers is not empty. Thus, with the addition of A3-A5, λ is interpreted as the average causal effect of the treatment among the compliers. Further discussion of these assumptions can be found in Imbens and Angrist [14], Angrist et al. [1] or Baiocchi et al. [2], among many others.

3.2. Estimation and Inference

Let Yij=ZijY(1,Dij(1))+(1Zij)Y(0,Dij(0)) denote the observed response and Dij=ZijDij(1)+(1Zij)Dij(0) the observed treatment for subject j in pair i.

Estimate λ as

λ^=i=1Iw^iJ=12[ZijYij(1Zij)Yij]i=1Iw^iJ=12[ZijDij(1Zij)Dij]. (3)

For inferences regarding λ, Baiocchi et al. [3] develop an asymptotically valid test for the null hypothesis H0(λ)H0(λ) is true under many population distributions, and thus is a composite null hypothesis. The size of a test for a composite null is the supremum over all null hypotheses in the composite null, and a test is considered valid if it has size less than or equal to its nominal level. Using statistics

T(λ0)=1Ii=1Iw^i[j=12Zij(Yijλ0Dij)j=12(1Zij)(Yijλ0Dij)]=1Ii=1IVi(λ0)

and

S2(λ0)=1I(I1)i=1I[Vi(λ0)T(λ0)]2

we can test H0(λ) by comparing T(λ0)/S(λ0) to a standard normal cumulative distribution for large I. Inverting this test and solving for T0)/S0) = 0 and ±1.96 provides an estimate and 95 % confidence interval for λ. We refer interested readers to Baiocchi et al. [3] for a detailed discussion of this test statistic, its distribution and related issues.

This inference procedure does not, however, provide a standard error estimate. A sandwich-type variance estimate can be obtained following a procedure similar to that discussed in Lunceford and Davidian [22] and Li and Greene [19]. Define the following estimating equations with respect to θ=(μY1,μY0,μD1,μD0,β)

0=i=1Ij=12ϕij(θ)=i=1Ij=12[wiZij(YijμY1)wi(1Zij)(YijμY0)wiZij(DijμD1)wi(1Zij)(DijμD0)Sβ(β)] (4)

where μY1=E(wiZijYij)/E(wiZij), μY0=E(wi(1Zij)Yij)/E(wi(1Zij)) and similar for μD1 and μD0.Sβ(β) correspond to the score equations for estimating parameters β, often from a logistic regression, for the probabilities used in Eq. (1) for determining theweight.This procedure allows for simultaneous estimation of wi and λ. We estimate var(θ^) with (2I) (2I)1A^1B^A^T, where A^=i=1Ij=12ϕij(θ)/θ|θ=θ^ and B^=i=1Ij=12ϕij(θ)ϕijT(θ)|θ=θ^. Applying the multivariate delta method with g(θ)=(μrTμrc)/(μdTμdC), an estimate of var(λ^) is obtained as g(θ)Tva^r(θ^)g(θ). This approach does not take into account the matching process and can be expected to overestimate the variance, al it was found to perform well in simulations. In Sects. 4 and 5, intervals and coverage results will be based on the permutation inference procedure.

3.3. Sensitivity

An important benefit of working with stronger instruments is the increased robustness of resulting estimates to violations of the assumption that the instrument is randomly assigned or independent of unmeasured confounders. In this section, we describe a sensitivity analysis outlined in [28] and applied to IV-matching in [3,4]. The goal of this sensitivity analysis is to determine how far an instrument can deviate from being randomly assigned before the qualitative results of the study are altered, with more robust results remaining consistent under larger deviations. In other words, how large would an unmeasured instrument-outcome confounder have to be to explain what appears to be a significant treatment effect?

Following Rosenbaum [28], deviation from random assignment is quantified by assuming that within pair i matched on covariates X, subjects j and j′ differ in their odds of receiving encouragement by at most a factor of Γ ≥ 1, where

1Γπij(1πij)πij(1πij)Γfor alli,j,jwithXij=Xij (5)

and πij=P(Zij=1|Xij). Under the random assignment assumption, πij=πij′ and Γ = 1. As the assumption becomes increasingly violated, these probabilities diverge and Γ increases.

The sensitivity analysis is conducted by using Γ in inference procedures to obtain bounds on the p-value associated with testing H0 : λ = 0. For matched pairs, this involves comparing the sum of events in the encouraged group among discordant pairs with two binomial distributions, one with probability p=11+Γ and another with probability p+=Γ1+Γ. This is done for increasing values of Γ until a previously rejected H0 becomes accepted, e.g. a significant effect is no longer significant. The maximum deviation that can be sustained is given by the largest Γ value in which the upper bound on the p-value remains less than 0.05, with larger maximum deviations indicating more robust results. When pairs are weighted, normal approximations to the binomials can be used for obtaining p-values using the weighted sum of events in the encouraged group among discordant pairs.

Rosenbaum and Silber [29] discuss how the univariate parameter Γ can be mapped to two components as Γ=(ΔΛ+1)/(Δ+Λ), where Ʌ represents the effect of an unmeasured confounder on the instrument and Δ the effect of an unmeasured confounder on the outcome. For example, an unmeasured confounder that triples the odds of receiving encouragement (Ʌ = 3) while doubling the odds of experiencing the event (Δ = 2) is equivalent to a deviation from random assignment of size Γ = 1.4.

This mapping of Γ allows the sensitivity analysis to remain simple while providing a useful interpretation of its magnitude.

4. Simulation

In this section, we compare the finite sample performances of three IV-matching techniques through simulation. The standard IV-match (IVM) uses the full data and makes no attempt to strengthen the instrument, while weighted IV-matching (WIVM) and near-far matching (NFM) will create stronger instruments as described in Sect. 2. For the NFM procedure, we add a penalty to the distance between facilities if their instruments are within a distance equal to the interquartile range of instrument values. As in Baiocchi et al. [3], we specify a penalty function that begins at 0 and increases exponentially a pairs instrument values become closer, and allow 50 % of facilities to be removed during the matching process.

4.1. Setup

One thousand datasets are generated containing i = 1,..., 200 facilities with j = 1,..., 40 subjects at each. Binary treatment D and binary outcome Y are randomly assigned with

P(Dij=1)=logit1(γi+αX1,i+δX2,ij+vij), (6)
P(Yij=1)=logit1(βDij+αX1,i+δX2,ij+ϵij). (7)

γi~N(0,1) represents a facility effect. Standard normal covariates X1,i and X2,ij represent observed confounders and are used for matching. X1,i is a facility-level confounder, and X2,ij is a patient-level confounder. Coefficients α, δ, and β represent the effects of X1, X2, and D, respectively. Unobserved confounding is created by generating (νij,ϵij) as bivariate normal with correlation ρ = .75. The proportion of treated individuals at each facilities serves as the instrument.

To obtain the “true” local average treatment effect that we wish to estimate, or λ in (2), we need counterfactual treatments and responses for every individual. These are not easily obtained under the current setup, since γ, not encouragement, is in Eq. (6). Furthermore, we do not know which counterfactual state an individual will be considered to have been observed in until after matching, since subjects are determined to have been observed in an encouraging or unencouraging facility by comparing instrument values within pairs. Despite this caveat, suitable counterfactuals can be obtained in the following way.

Consider patients treated at facilities with γi>0 to be observed in the encouragement state, while those at facilities with γi0 to be observed in the unencouragement state. For individuals in the encouragement state, we have Dij=Dij(1) and Yij=Yij(1,Dij(1)) from Eqs. (6) and (7). For counterfactuals, sample a γ from the unencouragement group and denote it γ*Dij(0) is then obtained using Eq. (6) with P(Dij=1)=P(Dij(0)=1)=logit1(γi*+αX1,i+δX2,ij+vij) and Yij(0,Dij(0)) is obtained using Eq. (7) with P(Yij=1)=P(Yij(0,Dij(0))=1)=logit1(βDij(0)+αX1,i+δX2,ij+ϵij). Counterfactuals for patients observed in the unencouragement state can be obtained similarly. After obtaining Dij(1),Dij(0), and Yij(1,Dij(1)), and Yij(0,Dij(0)), these are plugged into Eq. (2) for the true effect, λ.

4.2. Simulation Results

4.2.1. Instrument Strength

The present work is motivated by the desire to create a stronger instrument by increasing the separation of encouragement within pairs. Table 2 shows that both WIVM and NFM were able to do so, increasing the standardized differences in encouragement approximately 25 and 65 %, respectively. All things being equal, the stronger instrument is preferred. Looking at match quality in the next section, however, we will see that all things are not equal.

Table 2.

Separation of encouragement within pairs based on 1000 simulations

(Z¯U,Z¯E) St diff

IVM (37 %, 62 %) 141
WIVM (35 %, 65 %) 175
NFM (30 %, 70 %) 232

Reported is the mean treatment usage at unencouraging facilities Z¯U, encouraging facilities Z¯E, and the standardized difference between them, calculated as St Diff=100(Z¯EZ¯U)/.5(sZE2+sZU2) where sZE2 and sZU2 are sample variances of the instrument in each group

4.2.2. Match Quality

Table 3 reports balance of covariates X1 and X2 as indicated by the standardized difference within pairs. The WIVM procedure produced consistently better covariate balance than the NFM procedure. The particularly poor balance of facility level X1 under the NFM procedure shows that introducing penalties to the match negatively affected the ability to properly match on X1 in the first round.

Table 3.

Standardized differences in covariates X1 and X2 within pairs

(α, δ) IVM WIVM NFM

X1 (0, 0) 0.01 0.01 0.34
(0.25, 0.25) 0.15 0.14 18.01
(0.50, 0.50) 0.14 0.16 36.10
X2 (0, 0) 0.01 0.02 0.10
(0.25, 0.25) 0.58 0.68 1.02
(0.50, 0.50) 1.35 1.57 2.10

Results based on 1000 simulations

The pattern seen in Tables 2 and 3 shows a tradeoff of instrument strength and match quality between WIVM and NFM. WIVM allows the matching algorithm to focus entirely on matching on covariates, and strengthens the instrument through weighting after the matches have been formed. NFM, on the other hand, incorporates penalties into the match to enforce separation of the instrument, requiring the matching algorithm to share priority between matching on covariates and strengthening the instrument. A large penalty might dominate the distance used for matching and diminish the ability to properly match on covariates. In the tradeoff between instrument strength and match quality, WIVM is willing to trade less instrument strength for higher-quality matches, while NFM is willing to trade lower-quality matches for a stronger instrument. In results that follow, we will see that strength or weakness in either area has important implications on inferences and sensitivity.

4.2.3. Estimation and Coverage

Table 4 presents simulation results relating to estimation and coverage of λ under increasing magnitudes of observed confounding. When α and δ are zero and matching on X1 and X2 is trivial, each method is nearly unbiased and maintains nominal coverage. WIVM and NFM achieved lower mean squared error than IVM, which is one benefit associated with stronger instruments [38]. As α and δ increase and matching on X1 and X2 becomes more important, the performances of IVM and WIVM remain mostly unchanged. NFM, on the other hand, results in increased bias and mean squared errors and low coverage rates, which can be attributed to the inability of the NFM procedure to properly match on X1.

Table 4.

Bias, mean squared error (MSE), and 95% coverage probabilities (CP) for estimation of λ based on 1000 simulations

(α, δ) β λ IVM
WIVM
NFM
Bias MSE CP Bias MSE CP Bias MSE CP

(0,0) 0.0 0.0 4.6 2.3 94.3 4.5 1.6 93.9 2.5 1.7 94.2
0.6 0.14 1.4 2.0 94.3 1.3 1.4 95.1 0.3 1.4 95.6
1.0 0.23 4.6 1.9 94.6 3.7 1.4 95.2 2.5 1.4 95.0
(0.25, 0.25) 0.0 0.0 3.0 2.1 94.8 4.9 1.5 94.6 29.2 2.5 84.9
0.6 0.14 4.9 1.9 95.2 4.9 1.5 95.2 25.8 2.3 87.6
1.0 0.23 4.4 1.8 95.0 4.7 1.3 96.0 26.4 2.2 86.5
(0.50, 0.50) 0.0 0.0 8.7 2.4 93.6 9.7 1.7 94.2 93.9 10.5 28.3
0.6 0.14 8.7 2.3 94.3 7.9 1.7 93.6 88.6 9.7 30.9
1.0 0.23 4.0 2.0 93.7 3.7 1.5 93.8 78.6 7.8 38.3

Bias and MSE are multiplied by 1000. Coverage probabilities are based on confidence intervals obtained using the permutation inference procedure discussed in Sect. 3

4.2.4. Sensitivity

In this section, we report simulation results for studying violations of random assignment of the instrument. Figure 1 presents how large a deviation from random assignment estimates are robust to, as defined by Γ. Larger values of Γ correspond with more robust results. These curves are naturally upward sloping since larger effects are more robust, all else being equal.

Fig. 1.

Fig. 1

Sensitivity results based on 1000 simulations. Lines represent the size of an unobserved bias, as quantified by Γ, which be required to explain a significant finding. Larger values of Γ correspond with more robust estimates. Left: (α, δ) = (0, 0), Right: (α, δ) = (0.5, 0.5)

Results in Fig. 1 show that more robust results were obtained after creating stronger instruments. An interesting finding can be seen when comparing results from the left panel in Fig. 1 to the right panel. As α and δ increase from 0 in the left panel to 0.5 in the right panel, results for IVM and WIVM are unchanged, but those for the NFM seem to improve greatly. This apparent improvement arises from the biased estimates obtained after failing to properly match on X1. These biased estimates appear more robust than their unbiased counterparts, and cause the curve to shift left by about the size of this bias. This serves as a warning that this sensitivity analysis assumes measured confounders have been properly adjusted and we are able to obtain unbiased estimates. Γ can therefore be misleading if match quality is poor.

5. Data Analysis

In this section, we illustrate IV-matching (IVM), weighted IV-matching (WIVM), and near-far matching (NFM) with a study comparing the mortality rates in the first 6 months between patients receiving hemodialysis (HD) or peritoneal dialysis (PD) as treatment for end-stage renal disease. Complete information on 164,195 adults initiating dialysis for the first time between January 1, 2010 and December 31, 2013 was obtained from the United States Renal Data System. The analysis was restricted to patients being treated at dialysis facilities with at least ten patients that used both HD and PD during the study period. The analysis was conducted as intention-to-treat, with treatment defined as the modality prescribed at the onset of dialysis.

The instrument’s facility with mean PD usage was calculated on data from 2007 to 2009 to avoid correlation with any patient-level confounders. The instrument varied greatly across facilities, ranging from 0 to 100 % with a mean of 9.8 %. The correlation coefficient between a facility’s PD usage in 2007–2009 and 2010–2013 was 0.68.

Figure 2 and Table 6 of the appendix confirm that patients treated with PD are generally healthier than those treated with HD. On average, they are 6 years younger, receive more pre-ESRD care, suffer less comorbidities, and are more likely to be employed than HD patients. In addition, facilities with higher PD usage tend to be larger, as indicated by the higher number of nurses, social workers, and hemodialysis stations. Since these factors could be related to unmeasured confounders that affect patient outcomes, it is important to control for these variables when matching.

Fig. 2.

Fig. 2

Covariate balances before and after matching as indicated by the standardized differences within pairs. Dashed gray lines are at ±10. Standardized differences larger than this have been suggested to represent imbalances [27]

5.1. Constructing Matches

We follow the two-round matching procedure described in Sect. 2 for constructing matches. An optimal non-bipartite match first pairs facilities. Within each of these pairs, an optimal bipartite match then pairs patients from one facility with patients in the other.

For the first-round facility-level match, we defined the distance between facilities using a Mahalanobis distance based on the facility covariates in Fig. 2. For the NFM procedure, a penalty was added to this distance if facilities instrument values were within 14 % of each other (the inter-quartile range), and half of facilities were dropped from the analysis. For the second-round patient-level match, we matched on a prognostic score based on the patient-level covariates in Fig. 2. For the WIVM procedure, a weight was assigned to each pair based on Eq. (1), where probabilities were estimated using the instrument, for the facility mean PD usage from 2007 to 2009.

5.2. Results

Of the 164,195 patients, 128,700 were paired using the IVM and WIVM procedures, while 67,904 were paired using the NFM procedure. The average unencouraged and encouraged patients were treated at a facility with PD usage from 2007 to 2009 of 4.7 and 15.3 % using the IVM procedure; 6.3 and 27.8 % using the WIVM procedure; and 3.8 and 25.3 % using the NFM procedure. For the WIVM and NFM, the increased separation gap corresponds with roughly a 100 % increase in the standardized difference in the encouragement group, with neither procedure performing notably better than the other in terms of instrument strength.

Covariate balance after matching is presented in Fig. 2 as well as Table 7 of the appendix. Each method is seen to improve covariate balance on average. The IVM and WIVM, however, generally resulted in better balance than the NFM, particularly for facility-level covariates where the NFM seems to struggle. These results are similar to those seen in the simulations of Sect. 4. Estimation results reported in Table 5 indicate that PD has a protective effect on mortality in the first 6 months. For example, λ^ = −0.09 suggests that for every 100 subjects that are encouraged to switch from HD to PD, there are nine fewer deaths in the first 6 months. Both the WIVM and the NFM decreased the width of the confidence intervals associated with λ compared to IVM, with the NFM leading to the narrowest interval. While the WIVM and NFM created equally strong instruments, the WIVM ultimately led to the more robust results since the NFM estimated a smaller effect. Although results appear similar for each of the three methods in this particular analysis, they could differ quite substantially in other scenarios, particularly when important group-level covariates are present or difficult to adjust for.

Table 5.

Estimates and 95 % confidence intervals for the local average treatment effect, λ, as well as sensitivity parameter Γ

λ^ 95 % CI Γ

IVM −0.09 (−0.14, 0.03) 1.03
WIVM −0.09 (−0.15, −0.06) 1.09
NFM −0.07 (−0.10, −0.04) 1.07

6. Discussion

Weak instrumental variables present many problems to an IV analysis. Of particular concern is that results obtained using weak instruments are sensitive to the small unmeasured confounders affecting instrument assignment. This problem cannot be alleviated through increasing sample sizes. While we cannot verify the assumption that the instrument is independent of unmeasured confounders, working with stronger instruments increases robustness to its violations [3,7,32].

In this article, we proposed a weighting procedure for building a stronger instrument in the IV-matching framework. The key idea is that we can redistribute the data through weighting to highlight pairs in a way that increases the overall instrument strength. The proposed weights were based on the probability that a pair complies with encouragement, or that within the said pair, the encouraged subject received treatment, while the unencouraged subject received control. Other weights could be considered, the only requirement being that more weight is assigned to pairs that are more influenced by the instrument. In future work, we are considering the possibility of an “optimal” weight, perhaps subject to a constraint on covariate balance.

Compared with the existing methods, weighting is able to build a stronger instrument without compromising match quality. This is because weights are applied to strengthen the instrument after matches have been formed, as opposed to methods that strengthen the instrument simultaneously with matching. This is a major strength of the proposed method since failing to properly match on important covariates leads to biased effect estimates and misleading sensitivity results.

Using data from the United States Renal Data System, the proposed method was illustrated in a study comparing mortality in the first 6 months between patients receiving hemodialysis or peritoneal dialysis as treatment for end-stage renal disease. The proposed weighting procedure was able to create a stronger instrument while maintaining the integrity of matches. A protective effect of peritoneal dialysis was found, suggesting that there are nine fewer deaths for every 100 patients that are encouraged to switch from hemodialysis (HD) to peritoneal dialysis (PD).

While the the current work focused on building a stronger instrument within an IV-matching framework, the idea might not be limited to this setting. In future research, we would investigate the use of weighting to increase the instrument strength in more common instrumental variable procedures.

Acknowledegements

The authors have benefited from conversations with Michael Biaocchi. This project has been funded in whole or in part through funds from the National Institute of Diabetes and Digestive and Kidney Diseases, the National Institutes of Health, and the Department of Health and Human Services, under Contract No. HHSN276201400001C. Yun Li is funded by the National Institutes of Health (R01- DK070869).

Appendix

Table 6.

Summary of covariates before matching

Patient covariates HD PD St diff

N 142,737 21,458
Outcome
 Death w/in 6 months 14 % 4% 35.7
Covariates
 Age 64 58 37.7
 Male 57 % 55 % 3.8
 Bmi 29.6 29.5 1.9
 6+ months pre-ESRD care 45 % 69 % −49.3
# of comorbidities 2.4 1.9 44.1
 Hemoglobin 9.9 10.6 −4.2
 Serum creatinine 6.6 6.4 1.0
 No insurance 7% 8% −6.9
 White 68 % 71 % −5.1
 Black 26 % 22 % 8.7
 Asian 4% 5% −7.4
 Hispanic 13 % 12% 2.2
 Employed 9% 26 % −45.2

Facility covariates Q 1 Q 4 St diff

Instrument
 PD usage 3% 30 % −208
Covariates
 For profit 85 % 86 % −3.3
 # of nurses 6.7 8.7 −43.3
 # of technicians 8.2 8.1 2.0
 # of social workers 0.8 1.1 −36.4
 # of HD stations 20.3 21.9 −19.1
 Median income $51,086 $50,850 1.2
 Bachelors degree + 23.7 % 23.4 % 4.5

Patient-level covariates are compared across dialysis modalities and facility-level covariates are compared across the first and fourth quartiles of the PD usage

Table 7.

Summary of covariates after matching, by matching algorithm

IVM (64,350 pairs)
WIVM (64,350 pairs)
NFM (33,702 pairs)
U E St diff U E St diff U E St diff

Instrument
 Facility % PD 2007–2009 4.7 % 15.3 % −96.3 6.3 % 27.8 % −194.8 3.8 % 25.3 % −195.2
Treatment
 PD 10.0 % 16.3 % −18.7 11.5 % 23.5 % −35.7 9.0 % 23.8 % −44.1
Outcome
 Died w/in 6 months 11.9 % 11.3% 1.7 11.7% 10.7% 3.3 11.8% 10.8 % 3.1
Patient covariates
 Age 62.8 62.7 0.5 62.6 62.1 3.5 63.0 62.2 5.0
 Male 57.0 % 56.9 % 0.2 57.3 % 56.7 % 1.3 57.0 % 56.7 % 0.6
 BMI 30.5 30.4 0.5 30.5 30.2 1.0 31.0 31.2 −0.7
 6+ mos pre-ESRD care 47.8 % 50.3 % −5.1 50.2 % 53.1 % −5.7 50.4 % 53.2 % −5.6
 # of comorbidities 2.5 2.4 1.8 2.5 2.4 2.1 2.6 2.4 10.1
 Hemoglobin 9.9 10.0 −0.4 9.9 10.0 −0.5 10.0 9.9 1.0
 Serum creatinine 6.6 6.5 0.4 6.7 6.5 0.6 6.5 6.5 0.2
 No insurance 7.1 % 6.9 % 0.9 6.8 % 7.5 % −2.7 6.1 % 7.4 % −4.9
 White 68.6 % 65.9 % 5.7 68.9 % 63.9 % 10.6 68.8 % 64.1 % 10.1
 Black 25.3 % 28.1 % −6.2 25.1 % 29.2 % −9.4 26.7 % 29.4 % −6.1
 Asian 3.8 % 3.8 % 0.4 4.1 % 3.8 % 1.7 2.6 % 4.2 % −8.4
 Hispanic 13.8 % 12.6 % 3.7 13.4% 12.4% 3.1 9.9 % 12.2 % −7.0
 Employed 11.4% 12.4 % −3.1 12.3 % 13.7% −4.2 11.6% 13.6% −6.3
Facility covariates
 For profit 84.3 84.3 0.1 81.1 81.1 0.0 83.3 83.4 −0.3
 # of nurses 9.1 9.2 −2.0 10.0 10.2 −3.8 9.2 10.7 −26.5
 # of technicians 9.8 9.9 −0.5 9.7 9.8 1.2 9.0 9.7 −10.6
 # of social workers 1.1 1.3 −13.9 1.1 1.4 −16.8 1.1 1.3 −10.4
 # of HD stations 24.0 24.0 −0.5 24.2 24.4 −1.2 23.5 24.6 −10.9
 Median income $50,874 $51,343 −2.32 $50,618 $50,496 0.6 $50,470 $51,368 −4.5
 Bachelors degree + 23.4 25.0 −10.8 24.0 25.1 −8.2 23.5 25.7 −15.4

U and E correspond to patients considered to have been treated at unencouraging and encouraging PD facilities, respectively

References

  • 1.Angrist JD, Imbens GW, Rubin DB (1996) Identification of causal effects using instrumental variables. J Am Stat Assoc 91(434):444–455 [Google Scholar]
  • 2.Baiocchi M, Cheng J, Small DS (2014) Instrumental variable methods for causal inference. Stat Med 33(13):2297–2340 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Baiocchi M, Small D, Lorch S, Rosenbaum P (2010) Building a stronger instrument in an observational study of perinatal care for premature infants. J Am Stat Assoc 105(492):1285–1296 [Google Scholar]
  • 4.Baiocchi M, Small DS, Yang L, Polsky D, Groeneveld PW (2012) Near/far matching: a study design approach to instrumental variables. Health Serv Outcomes Res Methodol 12(4):237–253 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Belloni A, Chen D, Chernozhukov V, Hansen C (2012) Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80(6):2369–2429 [Google Scholar]
  • 6.Belloni A, Chernozhukov V, Hansen C (2010) Lasso methods for gaussian instrumental variables models. arXiv:1012.1297 [Google Scholar]
  • 7.Bound J, Jaeger DA, Baker RM (1995) Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. J Am Stat Assoc 90(430):443–450 [Google Scholar]
  • 8.Brookhart MA, Schneeweiss S (2007) Preference-based instrumental variable methods for the estimation of treatment effects: assessing validity and interpreting results. Int J Biostat 3(1):1–25 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Caner M, Fan Q (2010) The adaptive lasso method for instrumental variable selection Technical report, working paper, North Carolina State University [Google Scholar]
  • 10.Derigs U (1988) Solving non-bipartite matching problems via shortest path techniques. Ann Oper Res 13(1):225–261 [Google Scholar]
  • 11.Garabedian LF, Chu P, Toh S, Zaslavsky AM, Soumerai SB (2014) Potential bias of instrumental variable analyses for observational comparative effectiveness researchpotential bias of instrumental variable analyses for observational cer. Ann Intern Med 161(2):131–138 [DOI] [PubMed] [Google Scholar]
  • 12.Goodlad C, Brown E (2013) The role of peritoneal dialysis in modern renal replacement therapy. Postgrad Med J 89(1056):584–590 [DOI] [PubMed] [Google Scholar]
  • 13.Heaf JG, Løkkegaard H, Madsen M (2002) Initial survival advantage of peritoneal dialysis relative to haemodialysis. Nephrol Dial Transpl 17(1):112–117 [DOI] [PubMed] [Google Scholar]
  • 14.Imbens GW, Angrist JD (1994) Identification and estimation oflocal average treatment effects. Econometrica 62(2):467–475 [Google Scholar]
  • 15.Jiwakanon S, Chiu YW, Kalantar-Zadeh K, Mehrotra R (2010) Peritoneal dialysis: an underutilized modality. Curr Opin Nephrol Hypertens 19(6):573–577 [DOI] [PubMed] [Google Scholar]
  • 16.Kim H, Kim KH, Park K, Kang SW, Yoo TH, Ahn SV, Ahn HS, Hann HJ, Lee S, Ryu JH, Kim SJ, Kang DH, Choi KB, Ryu DR (2014) A population-based approach indicates an overall higher patient mortality with peritoneal dialysis compared to hemodialysis in korea. Kidney Int 86(5):991–1000 [DOI] [PubMed] [Google Scholar]
  • 17.Korevaar JC, Feith G, Dekker FW, van Manen JG, Boeschoten EW, Bossuyt PM, Krediet RT (2003) Effect of starting with hemodialysis compared with peritoneal dialysis in patients new on dialysis treatment: a randomized controlled trial. Kidney Int 64(6):2222–2228 [DOI] [PubMed] [Google Scholar]
  • 18.Kumar VA, Sidell MA, Jones JP, Vonesh EF (2014) Survival of propensity matched incident peritoneal and hemodialysis patients in a united states health care system. Kidney Int 86(5):1016–1022 [DOI] [PubMed] [Google Scholar]
  • 19.Li L, Greene T (2013) A weighting analogue to pair matching in propensity score analysis. Int J Biostat 9(2):215–234 [DOI] [PubMed] [Google Scholar]
  • 20.Li Y, Lee Y, Wolfe RA, Morgenstern H, Zhang J, Port FK, Robinson BM (2015) On a preference- based instrumental variable approach in reducing unmeasured confounding-by-indication. Stat Med 34(7):1150–1168 [DOI] [PubMed] [Google Scholar]
  • 21.Lu B, Greevy R, Xu X, Beck C (2011) Optimal nonbipartite matching and its statistical applications. Am Stat 65(1):21–30 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Lunceford JK, Davidian M (2004) Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med 23(19):2937–2960 [DOI] [PubMed] [Google Scholar]
  • 23.Marrón B, Remón C, Pérez-Fontán M, Quirós P, Ortíz A (2008) Benefits of preserving residual renal function in peritoneal dialysis. Kidney Int 73:S42–S51 [DOI] [PubMed] [Google Scholar]
  • 24.Mehrotra R, Chiu YW, Kalantar-Zadeh K, Bargman J, Vonesh E (2011) Similar outcomes with hemodialysis and peritoneal dialysis in patients with end-stage renal disease. Arch Int Med 171(2):110. [DOI] [PubMed] [Google Scholar]
  • 25.Neyman J (1923) On the application of probability theory to agricultural experiments. Stat Sci 5:463–480 [Google Scholar]
  • 26.Noordzij M, Jager K (2012) Survival comparisons between haemodialysis and peritoneal dialysis. Nephrol Dial Transpl 27(9):3385–3387 [DOI] [PubMed] [Google Scholar]
  • 27.Normand SLT, Landrum MB, Guadagnoli E, Ayanian JZ, Ryan TJ, Cleary PD, McNeil BJ (2001) Validating recommendations for coronary angiography following acute myocardial infarction in the elderly: a matched analysis using propensity scores. J Clin Epidemiol 54(4):387–398 [DOI] [PubMed] [Google Scholar]
  • 28.Rosenbaum P (2002) Observational studies. Springer, New York [Google Scholar]
  • 29.Rosenbaum PR, Silber JH (2009) Amplification of sensitivity analysis in matched observational studies. JAm StatAssoc 104(488):1398–1405 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 66(5):688 [Google Scholar]
  • 31.Sinnakirouchenan R, Holley JL (2011) Peritoneal dialysis versus hemodialysis: risks, benefits, and access issues. Adv Chronic Kidney Dis 18(6):428–432 [DOI] [PubMed] [Google Scholar]
  • 32.Small DS, Rosenbaum PR (2008) War and wages: the strength of instrumental variables and their sensitivity to unobserved biases. J Am Stat Assoc 103(483):924–933 [Google Scholar]
  • 33.Staiger DO, Stock JH (1997) Instrumental variables regression with weak instruments. Econometrica 65:557–586 [Google Scholar]
  • 34.Tam P (2009) Peritoneal dialysis and preservation of residual renal function. Perit Dial Int 29(Supple-ment2):S108–S110 [PubMed] [Google Scholar]
  • 35.VanderWeele TJ, Arah OA (2011) Bias formulas for sensitivity analysis of unmeasured confounding for general outcomes, treatments, and confounders. Epidemiology 22(1):42–52 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Vonesh E, Snyder J, Foley R, Collins A (2006) Mortality studies comparing peritoneal dialysis and hemodialysis: what do they tell us? Kidney Int 70:S3–S11 [DOI] [PubMed] [Google Scholar]
  • 37.Weinhandl ED, Foley RN, Gilbertson DT, Arneson TJ, Snyder JJ, Collins AJ (2010) Propensity-matched mortality comparison of incident hemodialysis and peritoneal dialysis patients. J Am Soc Nephrol 21(3):499–506 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Wooldridge JM (2001)Econometric analysis of cross section andpanel data. The MITPress, Cambridge [Google Scholar]
  • 39.Zubizarreta JR, Small DS, Goyal NK, Lorch S, Rosenbaum PR (2013) Strongerinstruments viainteger programming in an observational study of late preterm birth outcomes. Ann Appl Stat 7(1):25–50. doi: 10.1214/12-A0AS582 [DOI] [Google Scholar]

RESOURCES