Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Apr 7.
Published in final edited form as: Annu Rev Stat Appl. 2018 Nov 28;6(1):125–148. doi: 10.1146/annurev-statistics-031017-100353

Handling Missing Data in Instrumental Variable Methods for Causal Inference

Edward H Kennedy 1, Jacqueline A Mauro 1, Michael J Daniels 2, Natalie Burns 2, Dylan S Small 3
PMCID: PMC8025985  NIHMSID: NIHMS1007650  PMID: 33834080

Abstract

It is very common in instrumental variable studies for there to be missing instrument data. For example, in the Wisconsin Longitudinal Study one can use genotype data as a Mendelian randomization-style instrument, but this information is often missing when subjects do not contribute saliva samples, or when the genotyping platform output is ambiguous. Here we review missing-at-random assumptions one can use to identify instrumental variable causal effects, and discuss various approaches for estimation and inference. We consider likelihood-based methods, regression and weighting estimators, and doubly robust estimators. The likelihood-based methods yield the most precise inference, and are optimal under the model assumptions, while the doubly robust estimators can attain the nonparametric efficiency bound while allowing flexible nonparametric estimation of nuisance functions (e.g., instrument propensity scores). The regression and weighting estimators can sometimes be easiest to describe and implement. Our main contribution is an extensive review of this wide array of estimators under varied missing-at-random assumptions, along with discussion of asymptotic properties and inferential tools. We also implement many of the estimators in an analysis of the Wisconsin Longitudinal Study, to study effects of impaired cognitive functioning on depression.

Keywords: causal inference, instrumental variable, missing data, observational study, semiparametric efficiency

1. Introduction

A central goal of many studies is to make inferences about the causal effect of a treatment. The gold standard study design is a randomized trial with perfect compliance, but often such an approach is not ethical or feasible, and instead either an observational study or a randomized trial with imperfect compliance is the only feasible study design. Instrumental variables (IV) methods are used to address a wide variety of problems spanning numerous disciplines, including economics, public policy, and epidemiology.

One major motivation for using IV methods is to handle potential biases introduced by unmeasured confounding, i.e., the presence of unmeasured pre-treatment differences between the treated and control group (another motivation for using IV methods that we do not focus on in this review is for handling measurement error [Carroll et al. 2006, Chapter 6]). IV methods for causal inference are methods that control for unmeasured confounders by (i) finding a variable (the IV) that influences treatment but is independent of unmeasured confounders and has no direct effect on the outcome except through its effect on treatment; (ii) using the IV to extract variation in the treatment that is free of unmeasured confounders and (iii) using this confounder- free variation in the treatment to estimate the causal effect of the treatment [Angrist et al. 1996, Baiocchi et al. 2014, Brookhart et al. 2010, Hernán and Robins 2006].

In practical settings, missing data presents specific challenges for drawing valid inferences; data can be incomplete on one or more of the outcome, exposure, instrument, or covariates. The extant literature on methods for handling missing data is considerable and wide-ranging. Broadly speaking, these methods can be divided into those relying on reweighting by inverse selection probability [Horvitz and Thompson 1952, Li et al. 2013] and those relying on imputing missing values under an assumed model or other set of assumptions; the latter includes multiple imputation [Rubin 1996] and resampling methods (e.g., hotdeck [Andridge and Little 2010]) All of these methods rely on some version of the missing at random (MAR) assumption, originally explicated by Rubin [1976] or for the case of likelihood-based inference, the ignorability assumption [Little and Rubin 2014].

In spite of an enormous literature on missing data in general, the problem of missing instrumental variables has received relatively little attention; however, the issue has relevance to a number of practical settings, especially in the domain of Mendelian randomization. Two examples help to illustrate.

Family size and child outcomes.

In economics, there is a large literature on the effect of family size on child outcomes starting with the seminal paper by Becker [1960]. There is concern that family size is endogenous, so a twin birth has been frequently used as an instrument for family size [Aaslund and Gr0nquist 2010, Black et al. 2005, Caceres-Delpiano 2006, Rosenzweig and Wolpin 1980]. However, whether a twin birth versus a single birth occurs is only observed (or defined) for non-randomly selected sub-samples [Mogstad and Wiswall 2012]. For example, twins on the second birth are missing for the sub-sample of families with only one child, twins on the third birth are missing for the sub-sample with two or fewer children, and so on.

Mendelian randomization.

Mendelian randomization uses genetic variants as IVs to estimate causal effects of risk factors on an outcome when there is potential unmeasured confounding or reverse causation [Smith and Ebrahim 2003]. Missing data on genetic variants typically arise because there is some difficulty in interpreting the outcome of genotyping platforms [Burgess et al. 2011]. If the output is not clear, a “missing” result is recorded. Burgess et al. [2011] studied the effect of C-reactive protein on fibrinogen using three genetic variants as IVs using data from the British Women’s Heart and Health Study. There was missingness in 2.1% of participants for C-reactive protein, 2.4% for fibrinogen and 10.8%, 1.9% and 2.6% for the three genotypes.

The outline of the paper is as follows. In Section 2, we describe data from the Wisconsin Longitudinal Study concerning the effect of cognitive impairment on depression in the elderly. These data provide the primary motivation for this investigation, and are analyzed in detail in Section 9. In Section 3, we introduce the data structure and target estimand of interest. In Section 4 we describe three possible missing-at-random mechanisms and corresponding identification results. In Sections 5-8, we detail likelihood-based (MLE and Bayes), regression, weighting, and doubly robust estimators, respectively, under the three missing-at-random mechanisms, describing the assumptions required and corresponding theoretical properties of the methods. Finally in Section 9 we apply the methods to study effects of impaired cognitive functioning on depression, using the aforementioned Wisconsin Longitudinal Study data.

2. Motivation: Wisconsin Longitudinal Study

One source for IVs is genetic variants that affect treatment variables [Smith and Ebrahim 2003]. For example, Voight et al. [2012] studied the effect of HDL cholesterol on myocardial infarction using the genetic variant LIPG 396Ser allele as an IV, for which carriers have higher levels of HDL cholesterol but similar levels of other lipid and non-lipid risk factors compared with noncarriers. The approach of using genetic variants as an IV is called Mendelian randomization because it makes use of the random assignment of genetic variants conditional on parents’ genes, as discovered by Mendel. Didelez and Sheehan [2007], Lawlor et al. [2008] and Burgess and Thompson [2015] provide good reviews of Mendelian randomization methods. We will consider the following Mendelian randomization example using data from the Wisconsin Longitudinal Study (WLS), a longitudinal study of Wisconsin residents who graduated from high school in 1957 and their siblings [Herd et al. 2014]. We consider subjects who were alive and able to be contacted for a follow-up survey in 2011. Our goal is to study the effect of cognitive functioning on depression in the elderly; here there are concerns about unmeasured confounding and also the possibility of reverse causation. We will use as an IV the APOE ϵ4 genotype, which has been found to affect cognitive function [Farlow et al. 2004].

Specifically, our IV is coded as 1 for a person with one or two C alleles at the SNP location rs429358 in the APOE gene and 0 for a person with two T alleles; the IV of 1 confers an increased risk for impaired memory. The exposure is defined as having impaired memory in 2011, at which point subjects were an average of 71 years old. We define impaired memory as having a score on a delayed word recall task that is more than one standard deviation below the mean, following Ganguli et al. [2004]. The delayed word recall task consists of being read ten words and then asked to immediately recall the words; then subjects are asked to answer several other questions on a survey for about ten minutes, without being told that the words will be asked about again, at which they are asked to recall as many of the ten words as possible. The mean among WLS participants on the delayed word recall task in 2011 was 4.14 and the standard deviation was 2.08, so impaired memory (>1 standard deviation below the mean) is defined as a delayed word recall score of two or less. The outcome is a modified CES-D depression score that ranges from 0 to 132, where higher scores indicate more depression [Radloff 1977].

A number of covariates were collected in this study, including age in 2011, years of education (specifically, equivalent years of regular education, which is 12 for a high school graduate with less than one year of college, regardless of the number of years it took to complete high school, 16 for anyone with a college degree, regardless of number of years it took to complete college, etc.) and IQ score in high school (specifically the Henmon-Nelson IQ score). The delayed word recall test was only administered to a random sample of 80% of subjects and we will only consider these 80% of subjects. There was a small amount of missingness in the exposure of impaired memory, as well as in the covariates and the outcome of CES-D depression score (approximately 6% for the exposure and 5% for the outcome). However there is a substantial amount of missingness in the IV. To focus on main issues, we will only consider subjects with complete data on the exposure, covariates and outcome. There are n = 4390 such subjects and 1282 (29%) are missing the IV.

The missingness in the IV comes from two sources. First, participants were asked to submit saliva samples in order for them to be genotyped. There is a large amount of missingness from participants who did not submit saliva samples: more than 25% of participants in the WLS who were alive and asked to submit saliva samples did not do so. Second, there is a small amount of missingness (~2%) among study participants who did submit saliva samples for genotyping. Missing data on genetic variants typically arise because there can be some difficulty in interpreting the outcome of genotyping platforms [Burgess et al. 2011]. If the output is not clear, a “missing” result is recorded.

As shown in Table 1, the missingness in the IV is related to the covariates, the exposure, and the outcome. In particular, the subjects with missing IV values tend to be younger, less educated, have lower high school IQs, less impaired memory, and are more depressed.

Table 1:

Sample covariate means from WLS, among those with missing/non-missing IV values, along with a confidence interval and p-value based on a t-test for equality in means.

Covariate Non-missing IV
(n = 3108)
Missing IV
(n = 1282)
95% CI for difference p-value
Age in 2011 71.01 71.24 (−0.39, −0.07) 0.004
Education 13.74 13.29 (0.31, 0.59) < 0.001
IQ 103.68 101.32 (1.41, 3.31) < 0.001
Impaired Memory 0.12 0.16 (−0.06, −0.01) 0.006
Depression 15.12 17.02 (−2.90, −0.90) < 0.001

In Section 9 we will return to the WLS example and analyze the data using methods discussed throughout.

3. Data structure & estimand

We assume we observe an iid sample (O1,..., On) for

O=(X,R,RZ,A,Y)~

where X ∈ ℝd denotes baseline covariates, R ∈ {0,1} is an indicator for whether a binary instrument Z ∈ {0,1} is observed/non-missing, A ∈ ℝ is a treatment variable, and Y ∈ ℝ is some outcome of interest. For example, in the WLS data covariates X include age, education, and IQ score, the instrument Z is an indicator of having at least one C allele at rs429358 in the APOE gene, treatment A is an indicator of impaired memory based on the word recall task, and the outcome Y is the CES-D depression score.

To focus ideas we consider a covariate-adjusted version of the classic Wald estimand

ψ=E{E(Y|X,Z=1)E(Y|X,Z=0)}E{E(A|X,Z=1)E(A|X,Z=0)} 1.

However we are agnostic about the exact causal quantity it is assumed to represent, since it can equal different causal effects under different assumptions.

For example, let (Az, Yz) denote the potential treatment and outcome under Z = z, and similarly Yza the potential outcome under (Z = z,A = a), so that Ya = YZa the potential outcome under A = a and the observed Z. Following Imbens and Angrist [1994] and Angrist et al. [1996], if treatment A is binary, and in addition to usual consistency (A = AZ, Y = YZA) and positivity (ℙ(Z = 1 | X) bounded away from zero and one) assumptions, one assumes:

A1. Unconfounded IV: Z ⫫ (Az, Yz) | X.

A2. Exclusion restriction: Yza = Ya for z = 0,1.

A3. Relevance: ℙ(Az=1 > Az=0) > 0.

A4. Monotonicity: ℙ(Az=1Az=0) = 1.

Then

ψ=E(Ya=1Ya=0|Az=1>Az=0)

is an average treatment effect among “compliers” who would take treatment only when encouraged by the IV. We refer to Abadie [2003], Angrist et al. [1996], Imbens and Angrist [1994], Tan [2006] and Ogburn et al. [2015] for detailed discussion of these assumptions. Note the monotonicity assumption (A4) rules out the presence of any “defiers” with Az=1 < Az=0, who would do the exact opposite of what the instrument encourages them to do. If A is multi-valued and (A1)–(A4) hold, then the same observed data quantity ψ in 1. instead equals a particular weighted average of complier effects [Angrist and Imbens 1995].

Rather than using a monotonicity assumption as above, an alternative framework restricts effect heterogeneity to identify causal effects with instrumental variables. For example, the parameter ψ arises as a coefficient in the classical additive linear constant effects model, as discussed for example by Holland [1988] and Small [2007]. Note the linearity and constant-effect assumptions there allow the use of non-binary treatments, and thus also identify average treatment effects in the full population. The parameter ψ also arises in a semiparametric version of the classical IV model, in which the first stage regression model E(A|X,Z) is unrestricted while

Y=h(X)+ψA+ϵ

with ϵ a random error term satisfying cov(Z,ϵ|X)=0. Therefore arbitrary non-linear covariate relationships are allowed in the so-called second-stage model, but the treatment effect is linear and there are no treatment-covariate interactions [Okui et al. 2012]. This follows since this model implies

E(Y|X,Z=1)E(Y|X,Z=0)=ψ{E(A|X,Z=1)E(A|X,Z=0)}

which holds for the ψ from 1. by taking an expectation of both sides.

Other authors such as Hernán and Robins [2006], Robins [1994], and Tan [2010] have also used heterogeneity restrictions for identification, but by limiting general effect modification without using linearity assumptions. For example, if one assumes the no-interaction condition E(Ya=1Ya=0|Az=0=1)=E(Ya=1Ya=0|Az=1=1) in place of the monotonicity assumption (A4) then

ψ=E(Ya=1Ya=0|Az=1)

is an average treatment effect among those who would take treatment when encouraged to or not [Hernán and Robins 2006]. If one makes an even stronger no-effect modification assumption that the average treatment effect is the same in all four principal strata (Az=1,Az=0){0,1}2, then of course

ψ=E(Ya=1Ya=0)

is simply an average treatment effect across the population.

Thus, since the parameter ψ means different things under different causal assumptions, we leave the preferred set of causal assumptions up to the reader/user, and instead focus on the observed data quantity ψ in 1.. Note this is an “observed data” quantity in the sense that it does not depend explicitly on counterfactuals, but as discussed next it is not directly identifiable since Z is not always observed.

4. Missingness mechanisms & identification

Importantly, without further assumptions, the IV estimand 1. is not identified when the instrument Z is missing for some subjects. This is because

E(Y|X,Z)=E(Y|X,Z,R=0)(R=0|X,Z)+E(Y|X,Z,R=1)(R=1|X,Z)

and the data contains no information about the first quantity E(Y|X,Z,R=0) on the right-hand side, since Z is not observed when R = 0 (and similarly for regressions of A on X and Z). Thus in this section we detail three missingness mechanisms that can yield identification of the IV estimand ψ. Our logic largely follows that of seminal work by Rubin [1976], Robins et al. [1994, 1995], Robins and Rotnitzky [1995], van der Laan and Robins [2003], Tsiatis [2006], Little and Rubin [2014], and others; we refer to those works for more details.

Remark 1 (Connection to missing exposures problem). Before detailing missing-at-random assumptions and corresponding identification results, however, we first point out a connection between the missing instruments problem and a problem involving missing unconfounded exposures. Note that viewing Z as an unconfounded exposure, the parameter ψ in 1. involves quantities such as E{E(A|X,Z=z)} and E{E(Y|X,Z=z)} that are mathematically equivalent to average effects of Z on treatment and outcome variables A and Y, respectively. Therefore the missing IV problem (in the formulation we consider) can be viewed as a missing unconfounded treatment problem, with multiple outcomes, as noted by Kennedy [2018]. For this reason, throughout the paper we will make some use of results from the relatively small but important literature on missing treatments [Molinari 2010, Williamson et al. 2012, Zhang et al. 2016].

4.1. Missing completely at random

The simplest and perhaps strongest identifying assumption that could be used is the missing completely at random (MCAR) assumption given by

R(X,Z,A,Y) MCAR.

This assumption would only hold if the missingness in Z was completely random, akin to a coin flip not depending on any measured factors or on the underlying value of Z itself. Note MCAR is a statement about independence between the missingness indicator R and the true IV values Z (together with observed data), which exist regardless of whether missing or not; it is not a statement about independence between R and “observed” data values RZ, which will of course be correlated. MCAR would seem unlikely to hold in practice, except in unusual cases (e.g., in the WLS data, if some completely random portion of saliva samples were tainted or lost). The MCAR assumption is partly testable since it implies that R(X,A,Y), which can be assessed with the observed data, e.g., by regressing R on (X, A, Y) and looking for non-null associations. Importantly, since MCAR implies

RY|X,Z,RA|X,Z,andRX

it immediately follows that the IV estimand is identified from ℙ as

ψmc=E{E(Y|X,Z=1,R=1)E(Y|X,Z=0,R=1)|R=1}E{E(A|X,Z=1,R=1)E(A|X,Z=0,R=1)|R=1} 2.

which is just a complete-case version of the estimand ψ from 1., computed in the subset of the population with R = 1, i.e., among just those for whom the instrument Z is observed. This indicates that estimation is straightforward under MCAR, since those subjects with R = 0 can simply be excluded and standard estimation methods can be applied. However, such complete-case estimators will not be efficient in general, as will be discussed in subsequent sections. Kennedy and Small [2017] considered efficiency theory for estimation with missing IVs under a MCAR assumption, but only in a setting with one-sided noncompliance and no covariates.

4.2. Missing at random given covariates

A second identifying assumption that weakens MCAR to allow IV missingness to depend on measured covariates, rather than being completely random, is given by

R(Z,A,Y)|X MAR-X.

Note MAR-X is a restricted missing at random (MAR) assumption, which like MCAR says missingness is still (conditionally) independent of the post-instrument treatment and outcome variables, in addition to the underlying instrument values. Thus MAR-X can be viewed as removing the assumption that RX from MCAR. Like the MCAR assumption, MAR-X is also partly testable with observed data since it implies R(A,Y)|X; this could be assessed for example by regressing the missingness indicator R on both X and (A, Y), and testing dependence on the latter. This assumption might be reasonable if the missingness occurs temporally prior to the treatment and outcome variables being measured, though this condition would not be enough to ensure MAR-X holds.

Since MCAR-X implies

RY|X,ZandRA|X,Z

it follows that the IV estimand is identified under this assumption as

ψmx=E{E(Y|X,Z=1,R=1)E(Y|X,Z=0,R=1)}E{E(A|X,Z=1,R=1)E(A|X,Z=0,R=1)}. 3.

Note the conditional quantities in the numerator and denominator are the same as those from the MCAR estimand ψmc in 2., However, these quantities must be averaged over the marginal distribution of the covariates X, rather than just the distribution among those with R = 1, since IV missingness R is no longer assumed to be independent of X.

4.3. Missing at random

The weakest missing-at-random-type identifying assumption would be

RZ|X,A,Y MAR.

This MAR assumption allows IV missingness R to depend on all of (X, A, Y), only requiring its conditional independence with the underlying IV values. That this assumption is strictly weaker than MCAR and MAR-X follows since MCAR implies MAR as well as the testable assumption R(X,A,Y), while MAR-X implies MAR as well as the testable assumption R(A,Y)|X. Thus in terms of the assumptions made, MAR makes the weakest assumptions and we have the nesting MAR MAR-X MCAR. This also means that, in general, any estimators that are valid under MAR are also valid under MAR-X and MCAR.

The MAR assumption has been considered by a number of authors in the missing instruments or related problems. Chaudhuri and Guilkey [2016] proposed it in a missing instruments setting, where the parameter of interest is defined as a zero of a finite-dimensional moment condition. It was also considered by Kennedy and Small [2017] in a missing instrument setting with one-sided noncompliance and no covariates. Zhang et al. [2016] and Kennedy [2018] considered a similar MAR assumption in the missing unconfounded treatment problem. Kennedy [2018] showed that in the missing instruments case it leads to the identification result

ψmar=E[E{YE(Z|X,A,Y,R=1)|X}E{E(Z|X,A,Y,R=1)|X}E{YE(1Z|X,A,Y,R=1)|X}E{E(1Z|X,A,Y,R=1)|X}]E[E{AE(Z|X,A,Y,R=1)|X}E{E(Z|X,A,Y,R=1)|X}E{AE(1Z|X,A,Y,R=1)|X}E{E(1Z|X,A,Y,R=1)|X}]. 4.

To gain some intuition about why this result holds, consider the first conditional fraction in the numerator. Since RZ|(X,A,Y) under MAR, this term equals

E{E(YZ|X,A,Y)|X}E{E(Z|X,A,Y)|X}=E(YZ|X)E(Z|X)=E(Y|X,Z=1)

using iterated expectation, and the same logic holds for the other three conditional fraction terms.

Remark 2. Zhang et al. [2016] gave an identification result related to 4. in the missing unconfounded treatment problem, which is equal to the numerator term, but presented in a different weighting form. We present an analog of this result in Section 6.3. Zhang et al. [2016] used models based on factorizing the joint distribution of O = (X, R, RZ, A, Y) as

p(R,Z,A,Y|X)=p(Z|X)p(A,Y|X,Z)p(R|X,A,Y) 5.

which mimics the causal inference-based factorization of the likelihood, if Z was fully observed. Parametrizations based on this factorization are not variation-independent, since the instrument propensity score

E(Z|X)=E{E(Z|X,A,Y,R=1)|X}=E{E(RZ|X,A,Y)E(R|X,A,Y)|X}

depends on the distribution of R given (X, A, Y), as well as the distribution of (A, Y) given X. This means that models for the individual components can conflict and/or restrict each other. In contrast the form of the estimand ψmar in 4. is based on the variation-independent factorization

p(R,RZ,A,Y|X)=p(A,Y|X)p(R|X,A,Y)p(Z|X,A,Y)R

for which the individual components can be set to arbitrary distributions, without conflict.

4.4. Other mechanisms

Beyond the above missing-at-random-type assumptions, one could also consider a sensitivity analysis approach, in the same spirit as Daniels and Hogan [2008], Robins et al. [2000], Scharfstein et al. [1999]. For example one could specify a certain deviation away from MAR, e.g., depending on a known scalar parameter, and vary the values of this parameter to assess how the corresponding inference for ψ changes. Alternatively one could instead construct and estimate bounds, e.g., under an assumption that |E(Z|X,A,Y,R=1)E(Z|X,A,Y,R=0)|δ. We focus on estimation and inference under missing at random type mechanisms in this work, but other alternatives would also be useful for practice, especially since MAR-type assumptions are often not known to hold with certainty in cases where missingness does not occur by design.

5. Likelihood-based methods

A full likelihood based approach (for a continous outcome, Y, binary exposure, A, and a binary instrument Z) typically relies on the following structural equation model,

Y=βI{A>0}+ηyX+ϵyA=zδ+ηaX+ϵaZ|X~Ber(θz(x)) 6.

where Φ1(θz(x))=ηxX,X is the vector of covariates (and includes an intercept), A=I{A>0 } and ϵ=(ϵy,ϵa)~Fϵ with E[ϵ]=0.Letθa=(δ,ηa) and θy=(γ,β,ηy).

Note that assuming the above model is correctly specified, β=ψ from 1..

Fϵ is typically assumed to be a bivariate normal distribution with covariance matrix,

Σ=[σy2σyρσyρ1].

The variance of ϵ2 is fixed to be one for identifiability. In what follows, define θ=(θa,θy,θz,σy2,ρ). We discuss more robust and flexible formulations for a likelihood based approach in Section 10.

5.1. Ignorable missingness

In the context of a likelihood based analysis, it is important to introduce the concept of ignorable missingness. This is defined in the context of a probability model for the joint distribution of Y, A, Z, R, |X. Assume a model for (Y, A, Z|X), p(y, a, z|x; θ), θ, parameterized by θ and a model for (R|Y,A,Z|X),p(rz|y,a,z,x;ξ) parameterized by ξ. Then missingness in the IV is ignorable if

  1. the missingness is MAR as defined in Section 4.3 (of course, the mechanisms in sections 4.1 and 4.2 are special cases)

  2. θ and ξ are distinct.

Note for Bayesian inference, in addition need a priori independence, i.e., p(θ,ξ)=p(θ)p(ξ). A major advantage of ignorability, is that the explicit form of the missing data mechanism is not needed for inference. However, the full probability model in (6)) needs to be correctly specified. Full likelihood inference often results in reduced uncertainty as a result of extra modelling assumptions.

As pointed out in Section 4, under MCAR and MAR-X, an analysis using only completers will result in valid inferences about θ (and β in particular). However, as pointed out earlier, this will be inefficient.

5.2. Observed data likelihood

To write down the observed data likelihood, we integrate out A and subsequently derive the forms of p(y|a, z,x) and p(a|z,x). The two conditional distributions have the following forms:

p(y|a,z,x)=exp(12σy2(y[βa+μy])2)σy[Φ(ρσy(yμy)+μa(1ρ2)5)1a(1Φ(ρσy(y[β+μy])+μa(1ρ2)5))aΦ(μa)1a(1ϕ(μa))a]

where μa = + ηaX and μy = ηyX and

p(a|z,x)=Ber(Φ(zδ+ηax)).

The second expression is just a probit model for the exposure.

And the contribution of the ith unit to the likelihood (suppressing the i index for clarity) takes one of the following forms based on the missingness pattern:

{p(y|z,a,x;θy)p(a|z,x,θa)p(z|θz(x))}r{z=01p(y|z,a,x;θy)p(a|z,x,θa)p(z|θz(x))}(1r)

An easy way to maximize the likelihood is to use a quasi-newton algorithm (the form of the gradient can be found in the web appendix). Under standard regularity conditions, correction specification of the model in (6), and ignorable missingness, the mle of θ is CAN with asymptotic covariance matrix as the inverse of the expected information matrix. In the application in Section 9,we use the nonparametric bootstrap to compute estimates of uncertainty.

Note that for the case of a continuous exposure, the probit model for A is replaced by a normal linear regression and the covariate in the Y model would be A as opposed to I {A* > 0}.

For this setting, maximization of the observed data likelihood is much simpler (in terms of forms of derivatives) and an EM algorithm could be used easily for missing Z. We also note the model in (6) can easily be modified for binary outcomes (using a probit formulation), and continuous instruments.

5.3. Bayesian inference

Under a likelihood framework, an alternative to frequentist (ML inference) is Bayesian inference. A Bayesian approach allows exact (non-large sample) inference through the posterior distribution and easily allows incorporation of external or historical information via informative priors; the latter could be very helpful in the case of weak instruments. It also allows for incorporating uncertainty regarding data uncheckable assumptions regarding the instrument (see for example, Kraay [2012]) or the missingness [Daniels and Hogan 2008]. In the context of genetic marker instruments (i.e., Mendelian randomization), like we explore here, McKeigue et al. [2010] use a Bayesian approach under a standard parametric likelihood with a continuous exposure and a binary outcome under ignorable missingness.

In the setting here of ignorable missingness, one additional condition is required: a priori independence of θ and ξ.

The posterior distribution of the parameters in 6. is not available in closed form. However, the posterior can be sampled from using a relatively straightforward Markov chain Monte Carlo (MCMC) algorithm. In the absence of prior information, diffuse priors can be specified for the parameters. In particular, we consider here diffuse normal priors on regression parameters and relatively diffuse gamma priors on the inverse of the variances. In terms of prior specification, we propose a reparameterization of the model, somewhat similar to that proposed in Lopes and Polson [2014] but here in the case of a binary exposure. In particular, we can derive the conditional distribution of Y | A ,Z,X which is not the first expression in 6. due to the correlation of the error terms. This conditional distribution is normal with mean given by [ηyxi+βI{Ai>0}+ρσy(Aixiηaδzi)] and variance given by σy2(1ρ2). Setting α=ρσy and τ2=σy2(1ρ2), the distribution looks like a standard linear regression. We use a diffuse normal on the regression coefficient, α and the variance τ2. Data augmentation can be used to both avoid explicitly integrating out A and to impute missing Z’s. Details on the MCMC algorithm can be found in the supplementary materials; and we note the distribution to be sampled in each step of the algorithm is available in closed form and is easy to sample.

5.4. Discussion

The model presented in this section is completely parametric including the choice of (bivariate) normality for the residual distribution and linear additive effects for the covariates. There have been a variety of more flexible specifications proposed in the literature. Chib and Hamilton [2002] use a Dirichlet process scale mixture of normals for Fϵ. Chib and Greenberg [2007] include exogenous covariates additively using splines (i.e., GAM) but with a multivariate normal error. In particular, they replace ηax and ηyx with ha(x) and hy(x) respectively, where h.(·) are specified as splines. Conley et al. [2008] propose a Bayesian nonparametric alternative to the multivariate normal errors; in particular, a Dirichlet process mixture of bivariate normals.

To assess sensitivity to the MAR assumption (as defined in a likelihood-based context in Daniels and Hogan [2008]), we need to consider models for the joint distribution of the full data, p(y,a,z,r|x) and not specify them given factorizations as in (5). We are currently working on these extensions.

6. Weighting methods

In this section we describe inverse-probability-weighting (IPW) methods for estimating the IV parameter ψ from 1.. We consider estimation and inference under the MCAR, MAR-X, and MAR assumptions described in Section 4.

6.1. MCAR

IPW estimators under MCAR can be constructed based on the identifying expression

ψipw,mc=E{ZYE(Z|X,R=1)(1Z)Y1E(Z|X,R=1)|R=1}E{ZAE(Z|X,R=1)(1Z)A1E(Z|X,R=1)|R=1}. 7.

This expression follows since MCAR implies RZ|X, so that the instrument propensity score equals E(Z|X,R=1)=E(Z|X), and since R(X,Z,A,Y) so that

E{ZYE(Z|X,R=1)|R=1}=E{ZYE(Z|X)}=E{E(Y|X,Z=1)}

The same logic applies to the other terms in the expression 7..

A natural estimator based on 7. is given by

ψ^ipw,mc=n{ZYE^(Z|X,R=1)(1Z)Y1E^(Z|X,R=1)|R=1}n{ZAE^(Z|X,R=1)(1Z)A1E^(Z|X,R=1)|R=1} 8.

where we use the shorthand

n{f(O)|V=v}=i=1nf(Oi)1(Vi=v)j=1n1(Vj=v)

to denote sample averages among those units with V = v, so that for example the unconditional expression n{f(O)}=1ni=1nf(Oi) denotes a usual sample average, and E^(Z|X,R=1) is an estimate of the instrument propensity score (e.g., based on regressing Z on X among those with Z observed). The above estimator 8. is simply the usual IPW estimator of the parameter ψ in 1., excluding at the outset any subjects with R = 0 for whom the instrument Z was not observed.

It can be shown using a standard estimating equation / M-estimator analysis [van der Vaart 2000, 2002] that the IPW estimator ψ^ipw,mc 8. will be n-consistent and asymptotically normal (n-CAN) whenever the instrument propensity score estimator E^(Z|X,R=1) is, e.g., if it is estimated with a correctly specified smooth parametric model. In this case inference (e.g., confidence intervals and hypothesis tests) could be obtained using the bootstrap. If E^(Z|X,R=1) is estimated nonparametrically, e.g., using kernel smoothing, nearest-neighbor or other flexible methods, it may not be asymptotically normal in general, let alone n consistent. However, nCAN behavior may be attained if the propensity score is sufficiently smooth (e.g., has enough derivatives, in the sense that it is contained in a Holder class with high enough order), and is estimated with particular methods (e.g., splines) that incorporate undersmoothing, as in for example Hirano et al. [2003].

An advantage of the IPW estimator ψ^ipw,mc is that it is easy to implement; it only requires modeling the instrument propensity score, and then can be computed by simple weighted averages of (Z, A, Y) among those with Z observed. A disadvantage is that it will generally only be nCAN under correct parametric model specification, and even then not typically optimally efficient. Beyond these estimation issues, of course, it also relies on the strong MCAR identifying assumption, which may be unlikely to hold in practice unless the missingness is by design.

6.2. MAR-X

IPW estimators under MAR-X can be constructed based on the identifying expression

ψipw,mx=E{RZYE(RZ|X)R(1Z)YE(R(1Z)|X)}E{RZAE(RZ|X)R(1Z)AE(R(1Z)|X)} 9.

which follows since

E{RZYE(RZ|X)}=E{E(Y|X,R=Z=1)}

and this equals E{E(Y|X,Z=1)} under MAR-X by virtue of the fact that RY|X,Z. The same logic equally applies to the other terms in the expression.

A natural estimator based on 9. is given by

ψ^ipw,mx=n{RZYE^(RZ|X)R(1Z)YE^(R(1Z)|X)}n{RZAE^(RZ|X)R(1Z)AE^(R(1Z)|X)} 10.

which simply replaces the unknown missingness/instrument propensity scores E(RZ|X)andE(R(1Z)|X) with estimates, and replaces population expectations with sample averages. Note that E(RZ|X) can be easily estimated by regressing RZ on X, and similarly E(R(1Z)|X).

The above IPW estimator ψ^ipw,mx in 10. for the MAR-X setting has essentially the same statistical properties and advantages/disadvantages as the IPW estimator ψ^ipw,mc in 8. described in the previous MCAR subsection, with the exception of course that it requires the weaker MAR-X assumption compared to the stronger MCAR assumption. Nonetheless it will be nCAN whenever its propensity score estimators are, and otherwise would require careful nonparametric estimation with undersmoothing and potentially strong smoothness assumptions. Despite having similar statistical properties and requiring weaker identifying assumptions, it is also essentially as easy to implement as the MCAR IPW estimator, though it does require estimating two propensity scores E(RZ|X) and E(R(1Z)|X), instead of just the one E(Z|X,R=1) required under MCAR.

6.3. MAR

IPW estimation under MAR is motivated by the identifying expression

ψmar,ipw==E{RZYE(R|X,A,Y)E(Z|X)R(1Z)YE(R|X,A,Y)E(1Z|X)}E{RZAE(R|X,A,Y)E(Z|X)R(1Z)AE(R|X,A,Y)E(1Z|X)}. 11.

essentially identical to that given by Zhang et al. [2016]. This expression follows since

E(Z|X)=E{E(Z|X,A,Y,R=1)|X}

as described in Section 4.3 and since under MAR we have

E(YZ|X)=E{YE(Z|X,A,Y,R=1)|X}=E{RZYE(R|X,A,Y)|X}

so that the first term in the numerator equals

E{RZYE(R|X,A,Y)E(Z|X)}=E{E(YZ|X)E(Z|X)}=E{E(Y|X,Z=1)}

by iterated expectation. Again the logic is the same for the other terms.

The estimator based on ψmar,ipw in 11. is given by

ψ^mar,ipw=n{RZY𝔼^(R|X,A,Y)𝔼^(Z|X)R(1Z)Y𝔼^(R|X,A,Y)𝔼^(1Z|X)}n{RZA𝔼^(R|X,A,Y)𝔼^(Z|X)R(1Z)A𝔼^(R|X,A,Y)𝔼^(1Z|X)}. 12.

This estimator is slightly more burdensome to implement than the previous IPW estimators, though still somewhat straightforward using standard software. One nuance is that it requires a two-stage approach for estimation of the instrument propensity score E(Z|X), which is not identified with a simple regression of Z on X among those with observed instrument, since missingness can depend on post-instrument treatment and outcome. Zhang et al. [2016] discuss approaches for estimating this propensity score based on a parametric model assumption of the form E(Z|X)=g(X;β) for some known smooth function g and finite-dimensional parameter βd, e.g., a logistic model g(x;β)=expit(βTx). Specifically, they give maximum-likelihood, inverse-probability-weighted, and doubly robust approaches for jointly estimating the indexing parameters β, together with the parameters of a parametric model for the distribution of (A, Y) given Z and X.

Similarly, based on the fact that E(Z|X)=E{E(Z|X,A,Y,R=1)|X}, a regression- based approach for estimating the instrument propensity score would be to estimate the fully conditional instrument propensity score E(Z|X,A,Y,R=1) based on a regression of Z on (X, A, Y) among those with observed instruments, and then regress the predicted values from this regression under R =1 on covariates X, i.e.,

E^(Z|X)=E^{E^(Z|X,A,Y,R=1)|X}.

Alternatively, one could use an inverse-probability-weighted approach, noting that we also have E(Z|X)=E{RZ/E(R|X,A,Y)|X}, suggesting the estimator

E^(Z|X)=E^{RZE^(R|X,A,Y)|X}

that takes estimates of the missingness propensity score E(R|X,A,Y) and then regresses the pseudo-outcome RZ/E^(R|X,A,Y) on the covariates X. This approach may be particularly useful since the missingness propensity score E(R|X,A,Y) has to be estimated anyways to construct the other denominator weight in the IPW estimator ψ^mar,ipw in 12.. Yet another method for estimating the instrument propensity scores would be the doubly robust estimator

𝔼^(Z|X)=𝔼^[R{Z𝔼^(Z|X,A,Y,R=1)}𝔼^(R|X,A,Y)+𝔼^(Z|X,A,Y,R=1)|X].

In the case where one uses parametric models for each regression, and correctly specifies the model for (Z|X) in the outer regression, this approach would only require either of the nuisance 𝔼^(Z|X,A,Y,R=1)or𝔼^(R|X,A,Y) to be correctly specified (not necessarily both), in contrast to the previous approaches that require the one nuisance regression they rely on to be correctly specified. In the nonparametric case, the above doubly robust estimator could still attain faster rates of convergence than the others, due to a smaller second-order bias due to the doubly robust structure, similar to that found in other infinite-dimensional functional estimation problems such as Robins and Rotnitzky [2001], Rubin and van der Laan [2005], van der Laan [2013], and Kennedy et al. [2017].

The statistical properties of the IPW estimator ψ^mar,ipw in 12. are relatively straightforward in the parametric case, where smooth finite-dimensional models are used for the missingness propensity score E(R|X,A,Y) and instrument propensity score E(Z|X) (as well as corresponding secondary nuisance functions required for estimation of the latter, as described above). There, standard estimating equation techniques [van der Vaart 2000, 2002] will show that the estimator ψ^mar,ipw will be nCAN whenever the corresponding parametric models it relies on are correctly specified. In such cases, inference would be most straightforward using the bootstrap. However, the statistical properties are somewhat more subtle in the nonparametric case, due to the fact that the nuisance function E(Z|X) itself cannot be estimated with a single regression, and instead has to be estimated in a two-stage approach involving its own secondary nuisance functions (as described in the previous paragraph). For particular estimators (e.g., splines) using undersmoothing, and incorporating some strong smoothness assumptions, it may be possible for the estimator ψ^mar to be nCAN even when nuisance functions are estimated nonparametri-cally, extending the kind of logic found for example in Hirano et al. [2003]. However, one would not expect this nCAN -type behavior in cases where the nuisance functions are estimated with general nonparametric methods.

6.4. Summary

In terms of ease of implementation, the IPW estimator under MCAR is most straightforward, followed by the IPW estimators under MAR-X, and then MAR. However, this ordering is exactly reversed when comparing strength of identifying assumptions: MAR is most flexible and makes the weakest identifying assumptions, then MAR-X, and then MCAR (which seems unlikely to hold in practice, at least in general). Estimation and inference is relatively straightforward for all three estimators when nuisance functions (such as instrument and missingness propensity scores) are modeled parametrically, with the MCAR and MAR-X estimators most easily dealt with; in this setting the bootstrap is available for inference in all cases. None of the estimators are tailored to do particularly well in the nonparametric case, where nuisance functions are estimated flexibly. There, inference is somewhat complicated for all three estimators, with n consistency and asymptotic normality not attained in general, apart from perhaps some cases where nuisance estimators and their tuning parameters are carefully chosen in a way that is not optimal for estimating the nuisance functions themselves. At a high level, the behavior of the IPW estimators presented here will be similar to the regression estimators discussed in the next section. However we will see that the approach discussed in Section 8 has some new advantages, particularly in the nonparametric case.

7. Regression methods

Here we describe regression-based methods for estimating the IV parameter ψ, which in contrast to the methods from Section 5, do not necessarily require full likelihood specification, and in contrast to the methods from Section 6, model outcome processes rather than instrument processes. As in the previous section we consider estimation and inference under the MCAR, MAR-X, and MAR assumptions from Section 4.

7.1. MCAR

Regression estimators under MCAR can be constructed based on the identified expression ψmc in 2., i.e., by replacing the unknown conditional expectations with estimates (and replacing the population covariate distribution with its empirical counterpart), leading to the complete-case regression estimator given by

ψ^reg,mc=n{E^(Y|X,Z=1,R=1)E^(Y|X,Z=0,R=1)|R=1}n{E^(A|X,Z=1,R=1)E^(A|X,Z=0,R=1)|R=1} 13.

We use the same ℙn shorthand from the previous section, and write E^(Y|X,Z,R=1) and E^(A|X,Z,R=1) for estimates of the outcome and treatment regressions among those with observed instrument values. The estimator ψ^reg,mc is simply a usual regression estimator applied to subjects with non-missing instrument values.

As with the IPW estimator in 8., a standard estimating equation analysis shows that the regression estimator ψ^reg,mc will be nCAN whenever the outcome and treatment regression estimators are, e.g., when estimated with correctly specified smooth parametric models. The asymptotic variance can be obtained in closed-form via Taylor expansion arguments, or else inference can be conducted via the bootstrap. When the regressions E^(Y|X,Z,R=1) and E^(A|X,Z,R=1) are estimated nonparametrically, i.e., with nearest-neighbor matching or splines, then n consistency will generally require particular estimators be used with some undersmoothing, following similar arguments from Hahn [1998], Abadie and Imbens [2006], and others.

Advantages of the regression estimator ψ^reg,mc include that it is easy to implement, as it only involves fitting two regression models among those with observed instrument values; further it will typically have smaller variance than the weighting-based estimators from the previous section. However, like the weighting estimators it will often only be nCAN under correct parametric model specification, or else more careful nonparametric estimation than is required for influence-function-based estimators presented in Section 8. Further it relies on strong MCAR identifying assumption.

7.2. MAR-X

Regression estimators under MAR-X can be constructed based on the identifying expression given by ψmx in 3., with a sample analog given by

ψ^reg,mx=n{E^(Y|X,Z=1,R=1)E^(Y|X,Z=0,R=1)}n{E^(A|X,Z=1,R=1)E^(A|X,Z=0,R=1)}. 14.

This estimator requires no extra model fitting beyond that needed for the MCAR estimator ψ^mc in 13.; the only difference is that it averages the corresponding predictions across all subjects’ covariates, not just among those with R =1 who have non-missing instrument values. This is required since the missingness is no longer assumed to be independent of covariate information.

Apart from the differences in MCAR/MAR-X identifying assumptions, the above regression estimator ψ^reg,mx has similar statistical properties and advantages/disadvantages as the previous MCAR/MAR-X estimators ψ^ipw,mc,ψ^ipw,mx, and ψ^reg,mc. It will be nCAN if the regression estimators it depends on are, and otherwise would require some careful undersmoothing, particular estimators, and potentially strong smoothness or low-dimension assumptions. The MAR-X regression estimator is essentially equivalently easy to implement as the MCAR estimator.

7.3. MAR

Regression-based estimation under MAR is motivated by the identifying expression ψmar in 4.,yielding the estimator

ψ^mar,reg=n[𝔼^{Y𝔼^(Z|X,A,Y,R=1)|X}𝔼^{𝔼^(Z|X,A,Y,R=1)|X}𝔼^{Y𝔼^(1Z|X,A,Y,R=1)|X}𝔼^{𝔼^(1Z|X,A,Y,R=1)|X}]n[𝔼^{A𝔼^(Z|X,A,Y,R=1)|X}𝔼^{𝔼^(Z|X,A,Y,R=1)|X}𝔼^{A𝔼^(1Z|X,A,Y,R=1)|X}𝔼^{𝔼^(1Z|X,A,Y,R=1)|X}]n{𝔼^(YZ|X)𝔼^(Z|X)𝔼^(Y(1Z)|X)𝔼^(1Z|X)}n{𝔼^(AZ|X)𝔼^(Z|X)𝔼^(A(1Z)|X)𝔼^(1Z|X)}. 15.

As with the IPW estimator under MAR, this estimator requires somewhat more computation than for those under MCAR/MAR-X, but can still be constructed relatively easily with standard software. It similarly presents several choices for two-stages approaches to estimating nuisance functions. The same discussion from Section 6.3 applies to estimation of the instrument propensity scores E(Z|X)=E{E(Z|X,A,Y,R=1)|X}, which are needed for the denominators in the conditional terms of the numerator and denominator of the estimator.

The issues are similar for estimation of the numerators in the conditional terms, E(YZ|X) and E(AZ|X). One could use the regression based estimator

E^(YZ|X)=E^{YE^(Z|X,A,Y,R=1)|X}

that regresses Z on (X, A, Y) among those with observed instruments, and then regresses the corresponding predicted values – multiplied by Y – on covariates. The same procedure could be used replacing Y with A, for the terms E(AZ|X) in the denominator. Alternatively, one could use a weighting-based approach

E^(YZ|X)=E^{RZYE^(R|X,A,Y)|X}

that takes estimates of the missingness propensity score and regresses the pseudo-outcome RZY/E^(R|X,A,Y) on covariates. It is somewhat aesthetically displeasing to use a regression-based estimator ψ^mar,reg for ψ, while using inverse-probability-weighting to estimate the nuisance functions: since the nuisance estimation then requires consistent estimation of the missingness propensity score, this raises the issue of why one would not also use a consistent missingness propensity score estimator for estimation of the parameter ψmar of interest (and vice versa). Finally, similar to the doubly robust approach proposed in the previous section, one could insteaduse

𝔼^(YZ|X)=𝔼^[YR{Z𝔼^(Z|X,A,Y,R=1)}𝔼^(R|X,A,Y)+Y𝔼^(Z|X,A,Y,R=1)|X].

The properties of this estimator are essentially exactly equivalent to those of the doubly robust nuisance estimator for E(Z|X) discussed in Section 6.3.

Overall, the statistical properties of the regression estimator ψ^mar,reg in 15. are similar to those of ψ^mar,ipw in 12.: relatively straightforward if smooth parametric models are used for all nuisance functions, so that standard estimating equation techniques [van der Vaart 2000, 2002] show that the estimator ψ^mar,regisnCAN under correct model specification, and inference can be obtained via the bootstrap. The nonparametric case is not so straightforward, since the nuisance functions themselves have nuisance functions, and require two-stage estimation approaches. Nonetheless, following similar logic as Hahn [1998] and Abadie and Imbens [2006], it should be possible to show n-consistency and asymptotic normality at least for particular undersmoothed nuisance estimators, under smoothness assumptions and with low-dimensional covariates. The semiparametric-influenced approach described in the next section loosens the conditions required for n-consistency and asymptotic normality in nonparametric models, and has been shown to attain the nonparametric efficiency bound under weak conditions.

8. Efficient doubly robust methods

In this section we describe efficient nonparametric methods for estimating the parameter ψ from 1., using ideas from semiparametric efficiency theory (e.g., influence functions) to correct bias that comes from nonparametric smoothing; the methods are doubly robust. We focus on the MAR case since, following Robins and Rotnitzky [1995], the nonparametric efficiency bounds under MAR and MCAR assumptions coincide.

Kennedy [2018] proposes the doubly-robust estimator

ψ^mar=n(ϕ^Y,1ϕ^Y,0)n(ϕ^A,1ϕ^A,0) 16.

where for z{0,1} and v{a,y}, the quantity ϕ^V,z=ϕV,z(O;^) is an estimated version of the (uncentered) influence function term ϕV,z(O;)=ϕV,z given by

ϕV,z=VE(V1z|X)E(1z|X)E(1z|X)[R{1zE(1z|X,A,Y,R=1)}E(1z|X,A,Y,R=1)+E(1z|X,A,Y,R=1)]+E(V1z|X)E(1z|X)

with 1z=1(Z=z)

E(V1z|X)=E{VE(1z|X,A,Y,R=1)|X}

for V{1,A,Y}. The estimator ψ^mar in 16. is a bias-corrected version of the plug-in regression- based estimator

ψ^mar,reg=n{𝔼^(YZ|X)𝔼^(Z|X)𝔼^(Y(1Z)|X)𝔼^(1Z|X)}n{𝔼^(AZ|X)𝔼^(Z|X)𝔼^(A(1Z)|X)𝔼^(1Z|X)}

given in 15., where the bias correction is exactly the sample average of the first inverse-probability-weighted residual term in the efficient influence function term ϕV,z above. The form of the bias correction comes from the influence function for the parameter ψmar in 4., which can be viewed as representing the derivative term in a distributional Taylor-type expansion (or von Mises expansion) of the functional ψmar=ψmar() in 4.. As will be discussed in more detail shortly, such bias correction yields second-order error terms that allow for fast n-rates of convergence and asymptotic normality under general and weak nonparametric conditions. In other words, nCAN behavior is possible even when the nuisance functions (e.g., instrument and missingness propensity scores) are estimated flexibly at slower rates, for example using modern regression and machine learning methods. The efficient influence function is crucial not only for constructing bias-corrected estimators like that above, but also because its variance gives an efficiency bound, providing a benchmark for efficient estimation in nonparametric models. In particular, as described in detail by Bickel et al. [1993], van der Vaart [2002], Tsiatis [2006], and others, the variance of the efficient influence function acts as a local asymptotic minimax lower bound, and indicates that the asymptotic variance of any regular asymptotically linear must be at least as large as that of the variance of the efficient influence function.

Kennedy [2018] derived the efficient influence function for ψmar from 4., and gave nonpara-metric conditions under which the estimator ψ^mar in 16. is nCAN with asymptotic variance equaling that of the efficient influence function. This result indicates that, under the given conditions, the estimator ψ^mar is optimally efficient in the sense described above. A closed-form expression for the asymptotic variance is also given, which permits the construction of straight-forward Wald-type confidence intervals and hypothesis tests. The conditions are described in detail in Kennedy [2018], with efficiency essentially requiring that the nuisance functions

{E(R|X,A,Y),E(Z|X,A,Y,R=1),E(Z|X),E(AZ|X),E(YZ|X)}

be consistently estimated at O(n1/4) rates of convergence (in L2(ℙ) norm). However, this is a sufficient and not necessary condition; slower rates on some nuisance estimators can in some cases be traded off with faster rates on others. We refer to Kennedy [2018] for a precise statement. Further n consistency may be possible even when not attainable by ψ^mar, for example by using estimators with extra bias correction coming from higher-order influence functions [Robins et al. 2008, 2017]. Nonetheless, O(n1/4) rates can be achieved under relatively weak nonparamet- ric conditions on the nuisance functions (e.g., sparsity, smoothness, or other structure, such as additive model or bounded variation conditions). Further, since the conditions are stated in a method-agnostic way only involving rates for the nuisance estimators, this allows the behavior of the estimator ψ^mar to be easily described even for general nuisance estimators. Importantly, these conditions also do not require any impractical undersmoothing (unlike the regression and IPW-based estimators), so that one can often optimally tune the nuisance estimators (e.g., via cross-validation) and still achieve efficiency when it comes to estimation of ψmar.

Finally, Kennedy [2018] shows that the estimator ψ^mar is doubly robust, in the sense that it is consistent if either the instrument propensity score (given covariates) and missingness propensity score

{E(Z|X),E(R|X,A,Y)}

are consistently estimated, or if the restricted and full instrument propensity scores

{E(Z|X),E(Z|X,A,Y,R=1)}

given covariates and given both covariates/outcomes are consistently estimated. Although double robustness is often presented as offering two chances at a consistent estimator, to us the main benefit is that it can give the estimators parametric-type nCAN behavior even when the nuisance functions are estimated flexibly and nonparametrically, at slower than n rates.

In terms of implementation, ψ^mar in 16. is roughly as burdensome to compute as the other estimators for the MAR setting, i.e., the regression estimator ψ^mar,reg in 15. and the IPW estimator ψ^mar,ipw in 12.. The efficient estimator ψ^mar requires estimating the instrument propensity score E(Z|X), as well as the similarly estimable quantities E(ZY|X) and E(AZ|X), for which the same discussion as given in Sections 6.3 and 7.3 applies.

9. Application

Here we return to the Wisconsin Longitudinal Study (WLS) example discussed earlier in Section 2. Recall the goal is to estimate effects of cognitive functioning (measured via a dichotomized score on a word recall task) on depression (measured via the CES-D depression score). We use the APOE ϵ4 genotype as an instrument, and adjust for covariates that measure age, education, and IQ. Importantly, APOE ϵ4 information is missing for 29% of the population, typically due to subjects not submitting saliva samples, but sometimes because of ambiguity in output from the genotyping platform. Full details are given in Section 2.

We analyzed the WLS data using the approaches outlined in previous sections. We implemented the likelihood-based MLE and Bayes estimators from Section 5, together with estimators that operate under each of the MCAR, MAR-X, and MAR assumptions discussed in Section 4. In particular, for the MCAR case we constructed four estimators: an unadjusted Wald estimator, two-stage least squares (which can be viewed as a regression-based estimator), an inverse- probability-weighted estimator, and a doubly robust estimator. For the MAR-X and MAR cases we constructed doubly robust estimators, i.e., the estimator 16. in the latter case.

For the MLE and Bayes estimators we used the parametric model assumptions described in Section 5. For the two-stage least-squares estimator we used the classic linear model setup. For the complete-case IPW estimator we used logistic regression to model the IV propensity score. For the doubly robust estimators we used generalized additive models to estimate all nuisance functions. Standard errors and confidence intervals were generated via the nonparametric bootstrap, except for the Bayes estimator which uses MCMC to sample from the posterior. All results are given in Table 2.

Table 2:

Estimation results

Method Estimate SE 95% CI
Likelihood-based
 MLE 23.54 0.37 (22.81, 24.27)
 Bayes 23.69 0.75 (22.19, 25.14)
Complete-case MCAR
 Wald 13.01 11.88 (−3.80, 38.38)
 TSLS 12.23 11.39 (−7.79, 37.53)
 IPW 12.29 11.49 (−6.71, 38.48)
 DR 12.32 11.78 (−6.75, 39.23)
Doubly robust MAR-X 10.07 11.43 (−9.84, 34.76)
Doubly robust MAR 12.21 11.89 (−8.46, 38.16)

The MLE and Bayes approaches estimate that impaired cognitive function causes a substantial significant increase in depression. However, all of the other estimators give confidence intervals that contain zero, so that a null hypothesis of no effect cannot be rejected. The point estimates are all closer to the null than for the MLE and Bayes approaches, and confidence intervals are much wider.

10. Discussion

In this paper we reviewed instrumental variable methods for causal inference, in cases where some subjects have missing instrument data. A motivating application comes from the Wisconsin Longitudinal Study, which uses genotype data as an instrument for exploring the effect of impaired cognitive functioning on depression; many subjects have unknown genotype because they did not contribute saliva samples, or because the results from the genotyping platform were unclear. We detailed the different kinds of missing-at-random assumptions one could use to identify the instrumental variable estimand, which are based on assuming missingness occurs completely at random, or at random given covariates, or at random given covariates as well as post-instrument treatment and outcome data. We described varied methods for estimation and inference under each of these assumptions, including likelihood-based, regression, weighting, and doubly robust approaches. In summary, the likelihood-based methods require parametric assumptions about the likelihood, and are optimally efficient under such assumptions; the regression and weighting methods can be relatively straightforward to implement, depending on the particular missing-at- random assumption; the doubly robust methods can be optimally efficient under a nonparametric model, but require modeling multiple nuisance functions.

There are many crucial areas where more work is needed. First, to focus ideas in this work we consider a particular IV estimand, but other estimands may also be useful to study. Further, we considered missingness in the instrument, since this is common in practice and leads to interesting and unique complications. However in practice it is frequently the case that there is also missingness in the covariate, treatment, and/or outcome, and often the missingness may not be monotone; except for the setting of ignorable missingness in the likelihood based framework, this substantially complicates not only estimation and inference, but also the assumptions required for identification. Finally, we have not pursued sensitivity analysis methods in this work, but these are a crucial part of analysis with missing data.

Supplementary Material

Details on Bayesian computations

Acknowledgements

This work was partly supported by NIH grants R01 CA183854 and R01 GM112327. The authors thank Joe Hogan for insightful discussions.

References

  1. Aaslund O and H Gr0nquist Family size and child outcomes: is there really no trade-off. Labour Economics, 17:130–139, 2010. [Google Scholar]
  2. Abadie A Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics, 113(2):231–263, 2003. [Google Scholar]
  3. Abadie A and Imbens GW Large sample properties of matching estimators for average treatment effects. Econometrica, 74(1):235–267, 2006. [Google Scholar]
  4. Andridge RR and Little RJ A review of hot deck imputation for survey non-response. International statistical review, 78(1):40–64, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Angrist JD and Imbens GW Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of the American Statistical Association, 90(430):431–442, 1995. [Google Scholar]
  6. Angrist JD, Imbens GW, and Rubin DB Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444–455, 1996. [Google Scholar]
  7. Baiocchi M, Cheng J, and Small DS Instrumental variable methods for causal inference. Statistics in Medicine, 33(13):2297–2340, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Becker G An economic analysis of fertility. In Demographic Change in Developed Countries, pages 209–240. Princeton University Press, Princeton, NJ, 1960. [Google Scholar]
  9. Bickel PJ, Klaassen CA, Ritov Y, and Wellner JA Efficient and Adaptive Estimation for Semi- parametric Models. Baltimore: Johns Hopkins University Press, 1993. [Google Scholar]
  10. Black W, Devereux P, and Salvanes K The more the merrier? the effects of family size and birth order on children’s education. Quarterly Journal of Economics, 120:669–700, 2005. [Google Scholar]
  11. Brookhart M, Rassen J, and Schneeweiss S Instrumental variable methods in comparative safety and effectiveness research. Pharmacoepidemiology and drug safety, 19(6):537–554, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Burgess S and Thompson SG Mendelian randomization: methods for using genetic variants in causal estimation. CRC Press, 2015. [Google Scholar]
  13. Burgess S, Seaman S, Lawlor DA, Casas JP, and Thompson SG Missing data methods in Mendelian randomization studies with multiple instruments. American Journal of Epidemiology, 174 (9):1069–1076, 2011. [DOI] [PubMed] [Google Scholar]
  14. Caceres-Delpiano J The impacts of family size on investment in child quality. Journal of Human Resources, 41:738–754, 2006. [Google Scholar]
  15. Carroll RJ, Ruppert D, Stefanski LA, and Crainiceanu CM Measurement error in nonlinear models: a modern perspective. CRC press, 2006. [Google Scholar]
  16. Chaudhuri S and Guilkey DK GMM with multiple missing variables. Journal of Applied Econometrics, 31(4):678–706, 2016. [Google Scholar]
  17. Chib S and Greenberg E Semiparametric modeling and estimation of instrumental variable models. Journal of Computational and Graphical Statistics, 16(1):86–114, 2007. [Google Scholar]
  18. Chib S and Hamilton B Semiparametric bayes analysis of longitudinal data treatment models. Journal of Econometrics, 110(1):67–89, 2002. [Google Scholar]
  19. Conley T, Hansen CB, McCulloch R, and Rossi P A semi-parametric bayesian approach to the instrumental variable problem. Journal of Econometrics, 144(1):276–305, 2008. [Google Scholar]
  20. Daniels MJ and Hogan JW Missing data in longitudinal studies: Strategies for Bayesian modeling and sensitivity analysis. CRC Press, 2008. [Google Scholar]
  21. Didelez V and Sheehan N Mendelian randomization as an instrumental variable approach to causal inference. Statistical Methods in Medical Research, 16(4):309–330, 2007. [DOI] [PubMed] [Google Scholar]
  22. Farlow M, He Y, Tekin S, Xu J, Lane R, and Charles H Impact of APOE in mild cognitive impairment. Neurology, 63(10):1898–1901, 2004. [DOI] [PubMed] [Google Scholar]
  23. Ganguli M, Dodge HH, Shen C, and DeKosky ST Mild cognitive impairment, amnestic type: an epidemiologic study. Neurology, 63(1):115–121, 2004. [DOI] [PubMed] [Google Scholar]
  24. Hahn J On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66(2):315–331, 1998. [Google Scholar]
  25. Herd P, Carr D, and Roan C Cohort profile: Wisconsin longitudinal study (WLS). International Journal of Epidemiology, 43(1):34–41, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Hernán MA and Robins JM Instruments for causal inference: an epidemiologist’s dream? Epidemiology, 17(4):360–372, 2006. [DOI] [PubMed] [Google Scholar]
  27. Hirano K, Imbens GW, and Ridder G Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003. [Google Scholar]
  28. Holland PW Causal inference, path analysis and recursive structural equations models. Socoiological Methodology, 18:449–484, 1988. [Google Scholar]
  29. Horvitz DG and Thompson DJ A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952. [Google Scholar]
  30. Imbens GW and Angrist JD Identification and estimation of local average treatment effects. Econometrica, 62(2):467–475, 1994. [Google Scholar]
  31. Kennedy EH Efficient nonparametric causal inference with missing exposures. arXiv preprint arXiv :1802–08952, 2018. [DOI] [PubMed] [Google Scholar]
  32. Kennedy EH and Small DS Paradoxes in instrumental variable studies with missing data and one-sided noncompliance. arXiv preprint arXiv :1705.00506, 2017. [Google Scholar]
  33. Kennedy EH, Ma Z, McHugh MD, and Small DS Nonparametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society: Series B, 79(4): 1229–1245, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Kraay A Instrumental variables regressions with uncertain exclusion restrictions: a bayesian approach. Journal of Applied Econometrics, 27(1):108–128, 2012. . URL https://onlinelibrary.wiley.com/doi/abs/10.1002/jae.1148. [Google Scholar]
  35. Lawlor DA, Harbord RM, Sterne JA, Timpson N, and Smith G. Davey Mendelian randomization: using genes as instruments for making causal inferences in epidemiology. Statistics in Medicine, 27(8): 1133–1163, 2008. [DOI] [PubMed] [Google Scholar]
  36. Li L, Shen C, Li X, and Robins JM On weighting approaches for missing data. Statistical Methods in Medical Research, 22(1):14–30, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Little RJ and Rubin DB Statistical analysis with missing data. John Wiley & Sons, 2014. [Google Scholar]
  38. Lopes H and Polson N Bayesian instrumental variables: Priors and likelihoods. Econometric Reviews, 33(1):100–121, 2014. [Google Scholar]
  39. McKeigue PM, Campbell H, Wild S, Vitart V, Hayward C, Rudan I, Wright AF, and Wilson JF Bayesian methods for instrumental variable analysis with genetic instruments (?mendelian randomization?): example with urate transporter slc2a9 as an instrumental variable for effect of urate levels on metabolic syndrome. International Journal of Epidemiology, 39(3):907–918, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Mogstad M and Wiswall M Instrumental variables estimation with partially missing instruments. Economics Letters, 114(2):186–189, 2012. [Google Scholar]
  41. Molinari F Missing treatments. Journal of Business & Economic Statistics, 28(1):82–95, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Ogburn EL, Rotnitzky A, and Robins JM Doubly robust estimation of the local average treatment effect curve. Journal of the Royal Statistical Society: Series B, 77(2):373–396, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Okui R, Small DS, Tan Z, and Robins JM Doubly robust instrumental variable regression. Statistica Sinica, 22(1):173–205, 2012. [Google Scholar]
  44. Radloff LS The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1(3):385–401, 1977. [Google Scholar]
  45. Robins JM Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics - Theory and Methods, 23(8):2379–2412, 1994. [Google Scholar]
  46. Robins JM and Rotnitzky A Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995. [Google Scholar]
  47. Robins JM and Rotnitzky A Comments on: Inference for semiparametric models: Some questions and an answer. Statistica Sinica, 11:920–936, 2001. [Google Scholar]
  48. Robins JM, Rotnitzky A, and Zhao LP Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846–866, 1994. [Google Scholar]
  49. Robins JM, Rotnitzky A, and Zhao LP Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association, 90(429): 106–121, 1995. [Google Scholar]
  50. Robins JM, Rotnitzky A, and Scharfstein DO Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. Statistical Models in Epidemiology, the Environment, and Clinical Trials, pages 1–94, 2000. [Google Scholar]
  51. Robins JM, Li L, Tchetgen Tchetgen EJ, and van der Vaart AW Higher order influence functions and minimax estimation of nonlinear functionals. Probability and Statistics: Essays in Honor of David A. Freedman, pages 335–421, 2008. [Google Scholar]
  52. Robins JM, Li L, Mukherjee R, Tchetgen Tchetgen E, and van der Vaart AW Minimax estimation of a functional on a structured high dimensional model. The Annals of Statistics, 45(5):1951–1987, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Rosenzweig M and Wolpin K Testing the quantity-quality fertility model: the use of twins as a natural experiment. Econonmetrica, 48:227–240, 1980. [PubMed] [Google Scholar]
  54. Rubin DB Inference and missing data. Biometrika, 63(3):581–592, 1976. [Google Scholar]
  55. Rubin DB Multiple imputation after 18+ years. Journal of the American statistical Association, 91 (434):473–489, 1996. [Google Scholar]
  56. Rubin DB and van der Laan MJ A general imputation methodology for nonparametric regression with censored data. UC Berkeley Division of Biostatistics Working Paper Series, 194, 2005. [Google Scholar]
  57. Scharfstein DO, Rotnitzky A, and Robins JM Adjusting for nonignorable drop-out using semi-parametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120, 1999. [Google Scholar]
  58. Small DS Sensitivity analysis for instrumental variables regression with overidentifying restrictions. Journal of the American Statistical Association, 102(479):1049–1058, 2007. [Google Scholar]
  59. Smith GD and Ebrahim S Mendelian randomization: can genetic epidemiology contribute to under-standing environmental determinants of disease? International Journal of Epidemiology, 32(1):1–22, 2003. [DOI] [PubMed] [Google Scholar]
  60. Tan Z Regression and weighting methods for causal inference using instrumental variables. Journal of the American Statistical Association, 101(476):1607–1618, 2006. [Google Scholar]
  61. Tan Z Marginal and nested structural models using instrumental variables. Journal of the American Statistical Association, 105(489):157–169, 2010. [Google Scholar]
  62. Tsiatis AA Semiparametric Theory and Missing Data. New York: Springer, 2006. [Google Scholar]
  63. van der Laan MJ Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. UC Berkeley Division of Biostatistics Working Paper Series, 317:1–90, 2013. [Google Scholar]
  64. van der Laan MJ and Robins JM Unified Methods for Censored Longitudinal Data and Causality. New York: Springer, 2003. [Google Scholar]
  65. van der Vaart AW Asymptotic Statistics. Cambridge: Cambridge University Press, 2000. [Google Scholar]
  66. van der Vaart AW Semiparametric statistics. In: Lectures on Probability Theory and Statistics, pages 331–457, 2002. [Google Scholar]
  67. Voight BF, Peloso GM, Orho-Melander M, Frikke-Schmidt R, Barbalic M, Jensen MK, Hindy G, Hólm H, Ding EL, Johnson T, et al. Plasma HDL cholesterol and risk of myocardial infarction: a Mendelian randomisation study. The Lancet, 380(9841):572–580, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Williamson E, Forbes A, and Wolfe R Doubly robust estimators of causal exposure effects with missing data in the outcome, exposure or a confounder. Statistics in Medicine, 31(30):4382–4400, 2012. [DOI] [PubMed] [Google Scholar]
  69. Zhang Z, Liu W, Zhang B, Tang L, and Zhang J Causal inference with missing exposure information: methods and applications to an obstetric study. Statistical Methods in Medical Research, 25(5):2053–2066, 2016. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Details on Bayesian computations

RESOURCES