Local instrumental variables and latent variable models for identifying and bounding treatment effects

James J Heckman; Edward J Vytlacil

doi:10.1073/pnas.96.8.4730

. 1999 Apr 13;96(8):4730–4734. doi: 10.1073/pnas.96.8.4730

Local instrumental variables and latent variable models for identifying and bounding treatment effects

James J Heckman ^1,^*, Edward J Vytlacil ¹

PMCID: PMC16400 PMID: 10200330

Abstract

This paper examines the relationship between various treatment parameters within a latent variable model when the effects of treatment depend on the recipient’s observed and unobserved characteristics. We show how this relationship can be used to identify the treatment parameters when they are identified and to bound the parameters when they are not identified.

This paper uses the latent variable or index model of econometrics and psychometrics to impose structure on the Neyman (1)–Fisher (2)–Cox (3)–Rubin (4) model of potential outcomes used to define treatment effects. We demonstrate how the local instrumental variable (LIV) parameter (5) can be used within the latent variable framework to generate the average treatment effect (ATE), the effect of treatment on the treated (TT) and the local ATE (LATE) of Imbens and Angrist (6), thereby establishing a relationship among these parameters. LIV can be used to estimate all of the conventional treatment effect parameters when the index condition holds and the parameters are identified. When they are not, LIV can be used to produce bounds on the parameters with the width of the bounds depending on the width of the support for the index generating the choice of the observed potential outcome.

Models of Potential Outcomes in a Latent Variable Framework

For each person i, assume two potential outcomes (Y_0i, Y_1i) corresponding, respectively, to the potential outcomes in the untreated and treated states. Let D_i = 1 denote the receipt of treatment; D_i = 0 denotes nonreceipt. Let Y_i be the measured outcome variable so that

This is the Neyman-Fisher-Cox-Rubin model of potential outcomes. It is also the switching regression model of Quandt (7) or the Roy model of income distribution (8, 9).

This paper assumes that a latent variable model generates the indicator variable D. Specifically, we assume that the assignment or decision role for the indicator is generated by a latent variable D^*_i:

where Z_i is a vector of observed random variables and U_Di is an unobserved random variable. D^*_i is the net utility or gain to the decision-maker from choosing state 1. The index structure underlies many models in econometrics (10) and in psychometrics (11).

The potential outcome equation for the participation state is Y_1i = μ₁(X_i, U_1i), and the potential outcome for the nonparticipation state is Y_0i = μ₀(X_i, U_0i), where X_i is a vector of observed random variables and (U_1i, U_0i) are unobserved random variables. It is assumed that Y₀ and Y₁ are defined for everyone and that these outcomes are independent across persons so that there are no interactions among agents. Important special cases include models with (Y₀, Y₁) generated by latent variables and include μ_j(X_i, U_ji) = μ_j(X_i) + U_ji if Y is continuous and μ_j(X_i, U_ji) = 1(Xβ_j + U_ji ≥ 0) if Y is binary, where 1(A) is the indicator function that takes the value 1 if the event A is true and takes the value 0 otherwise. We do not restrict the (μ₁, μ₀) function except through integrability condition iv given below.

We assume: (i) μ_D(Z) is a nondegenerate random variable conditional on X = x; (ii) (U_D, U₁) and (U_D, U₀) are absolutely continuous with respect to Lebesgue measure on ℜ²; (iii) (U_D, U₁) and (U_D, U₀) are independent of (Z, X); (iv) Y₁ and Y₀ have finite first moments; and (v) Pr(D = 1) > 0.

Assumption i requires an exclusion restriction: There exists a variable that determines the treatment decision but does not directly affect the outcome. Let F_{U_D} be the distribution of U_D with the analogous notation for the distribution of the other random variables. Let P(z) denote Pr(D = 1|Z = z) = F_{U_D}(μ_D(z)). P(z) is sometimes called the “propensity score”, following ref. 12. Let Ũ_D denote the probability transform of U_D: Ũ_D = F_{U_D}(U_D). Note that, because U_D is absolutely continuous with respect to Lebesgue measure, Ũ_D ≈ Unif(0,1). Let Δ_i denote the treatment effect for person i: Δ_i = Y_1i − Y_0i.

It is the index structure on D that plays the crucial role in this paper. An index structure on the potential outcomes (Y₀, Y₁) is not required, although it is both conventional and convenient in many applications.

Definition of Parameters

We examine four different mean parameters within this framework: the ATE, effect of treatment on the treated (TT), the local ATE (LATE), and the LIV parameter. The average treatment effect is given by:

From assumption iv, it follows that E(Δ|X = x) exists and is finite a.e. F_X. The expected effect of treatment on the treated is the most commonly estimated parameter for both observational data and social experiments (13, 14). It is defined as:

From iv, Δ^TT(x, D = 1) exists and is finite a.e. F_{X|D = 1}, where F_X|D=1denotes the distribution of X conditional on D = 1. It will be useful to define a version of Δ^TT(X, D = 1) conditional on P(Z):

so that

From our assumptions, Δ^TT (x, P(z), D = 1) exists and is finite a.e. F_X,P(Z)|D=1. In the context of a latent variable model, the LATE parameter of Imbens and Angrist (6) using P(Z) as the instrument is

Without loss of generality, assume that P(z) > P(z′). From assumption iv, it follows that Δ^LATE(x, P(z), P(z′)) is well defined and is finite a.e. F_X,P(Z) × F_X,P(Z). For interpretative reasons, Imbens and Angrist (6) also assume that P(z) is monotonic in z, a condition that we do not require. However, we do require that P(z) ≠ P(z′) for any (z, z′) where the parameter is defined.

The fourth parameter that we analyze is the LIV parameter introduced in ref. 5 and defined in the context of a latent variable model as

LIV is the limit form of the LATE parameter. In the next section, we demonstrate that Δ^LIV(x, P(z)) exists and is finite a.e. F_X,P(Z) under our maintained assumptions.

A more general framework defines the parameters in terms of Z. The latent variable or index structure implies that defining the parameters in terms of Z or P(Z) results in equivalent expressions. In the index model, Z enters the model only through the μ_D(Z) index, so that for any measurable set A,

Because any cumulative distribution function is left-continuous and nondecreasing, we have

Relationship Between Parameters Using the Index Structure

Given the index structure, a simple relationship exists among the four parameters. From the definition it is obvious that

Next, consider Δ^LATE(x, P(z), P(z′)). Note that

so that

and thus

LIV is the limit of this expression as P(z) → P(z′). In Eq. 8, E(Y₁|X = x, Ũ) and E(Y₀|X = x, Ũ) are integrable with respect to dF_Ũ a.e. F_X. Thus, E(Y₁|X = x, P(Z) = P(z)) and E(Y₀|X = x, P(Z) = P(z)) are differentiable a.e. with respect to P(z), and thus E(Y|X = z, P(Z) = P(z)) is differentiable a.e. with respect to P(z) with derivative given by†

From assumption iv, the derivative in Eq. 10 is finite a.e. F_X,Ũ. The same argument could be used to show that Δ^LATE(x, P(z), P(z′)) is continuous and differentiable in P(z) and P(z′).

We rewrite these relationships in succinct form in the following way:

and

Each parameter is an average value of LIV, E(Δ|X = x, Ũ_d = u), but for values of U_D lying in different intervals. LIV defines the treatment effect more finely than do LATE, ATE, or TT.

Δ^LIV(x, p) is the average effect for people who are just indifferent between participation or not at the given value of the instrument (i.e., for people who are indifferent at P(Z) = p). Δ^LIV(x, p) for values of p close to zero is the average effect for individuals with unobservable characteristics that make them the most inclined to participate, and Δ^LIV(x, p) for values of p close to one is the average treatment effect for individuals with unobservable characteristics that make them the least inclined to participate. ATE integrates Δ^LIV(x, p) over the entire support of Ũ_D (from p = 0 to p = 1). It is the average effect for an individual chosen at random. Δ^TT(x, P(z), D = 1) is the average treatment effect for persons who chose to participate at the given value of P(Z) = P(z); Δ^TT(x, P(z), D = 1) integrates Δ^LIV(x, p) up to p = P(z). As a result, it is primarily determined by the average effect for individuals whose unobserved characteristics make them the most inclined to participate in the program. LATE is the average treatment effect for someone who would not participate if P(Z) ≤ P(z′) and would participate if P(Z) ≥ P(z). Δ^LATE(x, P(z), P(z′)) integrates Δ^LIV(x, p) from p = P(z′) to p = P(z).

To derive TT, use Eq. 4 to obtain

Using Bayes rule, one can show that

Because Pr(D = 1|X = x, P(Z)) = P(z),

Note further that, because Pr(D = 1|X) = E(P(Z)|X) = ∫₀¹ (1 − F_P(Z)|X=x(t))dt, we can reinterpret Eq. 14 as a weighted average of LIV parameters in which the weighting is the same as that from a “length-biased,” “size-biased,” or “P-biased” sample:

where g_x(u) = 1 − F_P(Z)|X=x(u)/∫ (1 − F_P(Z)|X=x(t))dt. Replacing P(Z) with length-of-spell, g_x(u) is the density of a length-biased sample of the sort that would be obtained from stock biased sampling in duration analysis (16). Here we sample from the P(Z) conditional on D = 1 and obtain an analogous density used to weight up LIV. g_x(u) is nonincreasing function of U. Δ^LIV(x, p) is given zero weight for p ≥ p^max(x).

Identification of Treatment Parameters

Assume access to an infinite independently and identically distributed sample of (D, Y, X, Z) observations, so that the joint distribution of (D, Y, X, Z) is identified. Let 𝒫(x) denote the closure of the support P(Z) conditional on X = x, and let 𝒫^c(x) = (0, 1)∖𝒫(x). Let p^max(x) and p^min(x) be the maximum and minimum values in 𝒫(x).

LATE and LIV are defined as functions (Y, X, Z) and are thus straightforward to identify. Δ^LATE(x, P(z), P(z′)) is identified for any (P(z), P(z′)) ∈ 𝒫(x) × 𝒫(x). Δ^LIV(x, P(z)) is identified for any P(z) that is a limit point of 𝒫(x). The larger the support of P(Z) conditional on X = x, the bigger the set of LIV and LATE parameters that can be identified.

ATE and TT are not defined directly as functions of (Y, X, Z), so a more involved discussion of their identification is required. We can use LIV or LATE to identify ATE and TT under the appropriate support conditions: (i) If 𝒫(x) = [0, 1], then Δ^ATE(x) is identified from Δ^LIV. If {0, 1} ∈ 𝒫(x), then Δ^ATE(x) is identified from Δ^LATE. (ii) If (0, P(z)) ⊂ 𝒫(x), then Δ^TT(x, P(z), D = 1) is identified from Δ^LIV. If {0, P(z)} ∈ 𝒫(x) then Δ^TT(x, P(z), D = 1) is identified from Δ^LATE.

Note that TT is identified under weaker conditions than is ATE. To identify TT, one needs to observe P(Z) arbitrarily close to 0 (p^min(x) = 0) and to observe some positive P(Z) values whereas to identify ATE, one needs to observe P(Z) arbitrarily close to 0 and arbitrarily close to 1 (p^max(x) = 1 and p^min(x) = 0). Note that the conditions involve the closure of the support of P(Z) conditional on X = x and not the support itself. For example, to identify Δ^TT(x, D = 1) from Δ^LATE, we do not require that 0 be in the support of P(Z) conditional on X = x but that points arbitrarily close to 0 be in the support. This weaker requirement follows from Δ^LIV(x, P(z)) being a continuous function of P(z) and Δ^LATE(x, P(z), P(z′)) being a continuous function of P(z) and P(z′).

Without these support conditions, we can still construct bounds if Y₁ and Y₀ are known to be bounded with probability one. For ease of exposition and to simplify the notation, assume that Y₁ and Y₀ have the same bounds, so that

and

For example, if Y is an indicator variable, then the bounds are y_x^l = 0 and y_x^u = 1 for all x. For any P(z) ∈ 𝒫(x), we can identify

and

In particular, we can evaluate Eq. 16 at P(z) = p^max(x) and can evaluate Eq. 17 at P(z) = p^min(x). The distribution of (D, Y, X, Z) contains no information on ∫_p^max(x)¹ E(Y₁|X = x, Ũ = u)du and ∫₀^{p^min(x)} E(Y₀|X = x, Ũ = u)du, but we can bound these quantities:

We thus can bound Δ^ATE(x) by§

The width of the bounds is thus

The width is linearly related to the distance between p^max(x) and 1 and the distance between p^min(x) and 0. These bounds are directly related to the “identification at infinity” results of refs. 9 and 18. Such identification at infinity results require the condition that μ_D(Z) takes arbitrarily large and arbitrarily small values if the support of U_D is unbounded. The condition is sometimes criticized as being not credible. However, as is made clear by the width of the above bounds, the proper metric for measuring how close one is to identification at infinity is the distance between p^max(x) and 1 and the distance between p^min(x) and 0. It is credible that these distances may be small. In practice, semiparametric methods that use identification at infinity arguments to identify ATE are implicitly extrapolating E(Y₁|X = x, Ũ = u) for u > p^max(x) and E(Y₀|X = x, Ũ = u) for u < p^min(x).

We can construct analogous bounds for Δ^TT(x, P(z), D = 1) for P(z) ∈ 𝒫(x):

The width of the bounds on Δ^TT(x, P(z), D = 1) is thus:

The width of the bounds is linearly decreasing in the distance between p^min(x) and 0. Note that the bounds are tighter for larger P(z) evaluation points because the higher the P(z) evaluation point, the less weight is placed on the unidentified quantity ∫₀^{p^min(x)} E(Y₀|X = x, Ũ = u)du. In the extreme case, where P(z) = p^min(x), the width of the bounds simplifies to y_x^u − y_x^l.

We can integrate the bounds on Δ^TT(x, P(z), D=1) to bound Δ^TT(x, D = 1):

The width of the bounds on Δ^TT(x, D = 1) is thus:

Using Eq. 13, we have

Unlike the bounds on ATE, the bounds on TT depend on the distribution of P(Z), in particular, on Pr(D = 1|X = x) = E(P(Z)|X = x). The width of the bounds is linearly related to the distance between p^min(x) and 0, holding Pr(D = 1|X = x) constant. The larger Pr(D = 1|X = x) is, the tighter the bounds because the larger P(Z) is on average, the less probability weight is being placed on the unidentified quantity ∫₀^{p^min(x)} E(Y₀|X = x, Ũ = u)du.

Conclusion

This paper uses an index model or latent variable model for the selection variable D to impose some structure on a model of potential outcomes that originates with Neyman (1), Fisher (2), and Cox (3). We introduce the LIV parameter as a device for unifying different treatment parameters. Different treatment effect parameters can be seen as averaged versions of the LIV parameter that differ according to how they weight the LIV parameter. ATE weights all LIV parameters equally. LATE gives equal weight to the LIV parameters within a given interval. TT gives a large weight to those LIV parameters corresponding to the treatment effect for individuals who are the most inclined to participate in the program. The weighting of P for LIV that produces TT is like that obtained in length biased or sized biased samples.

Identification of LATE and LIV parameters depends on the support of the propensity score, P(Z). The larger the support of P(Z), the larger the set of LATE and LIV parameters that are identified. Identification of ATE depends on observing P(Z) values arbitrarily close to 1 and P(Z) values arbitrarily close to 0. When such P(Z) values are not observed, ATE can be bounded, and the width of the bounds is linearly related to the distance between 1 and the largest P(Z) and the distance between 0 and the smallest P(Z) value. For TT, identification requires that one observe P(Z) values arbitrarily close to 0. If this condition does not hold, then the TT parameter can be bounded and the width of the bounds will be linearly related to the distance between 0 and the smallest P(Z) value, holding Pr(D = 1|X) constant.

Implementation of these methods through either parametric or nonparametric methods is straightforward. In joint work with Arild Aakvik of the University of Bergen (Bergen, Norway), we have developed the sampling theory for the LIV estimator and empirically estimated and bounded various treatment parameters for a Norwegian vocational rehabilitation program.

We conclude this paper with the observation that the index structure for D is not strictly required, nor is any monotonicity assumption necessary to produce results analogous to those presented in this paper. The index structure on D simplifies the derivations and yields the elegant relationships presented here. However, LIV can be defined without using an index structure (5); so can LATE. We can define LIV for different sets of regressors and produce relationships like those given in Eq. 11 defining the integrals over multidimensional sets instead of intervals. The bounds we present also can be generalized to cover this case as well. The index structure for D arises in many psychometric and economic models in which the index represents net utilities or net preferences over states, and these are usually assumed to be continuous. In these cases, its application leads to the simple and concise relationships given in this paper.

Acknowledgments

We thank Aarild Aakvik, Victor Aguirregabiria, Xiaohong Chen, Lars Hansen and Justin Tobias for close reading of this manuscript. We also thank participants in the Canadian Econometric Studies Group (September, 1998), the Midwest Econometrics Group (September, 1998), the University of Upsalla (November, 1998), the University of Chicago (December, 1998), the University of Chicago (December, 1998), and University College London (December, 1998). James J. Heckman is Henry Schultz Distinguished Service Professor of Economics at the University of Chicago and a Senior Fellow at the American Bar Foundation. Edward Vytlacil is a Sloan Fellow at the University of Chicago. This research was supported by National Institutes of Health Grants R01-HD34958-01 and R01-HD32058-03, National Science Foundation Grant 97-09-873, and the Donner Foundation.

ABBREVIATIONS

LIV: local instrumental variable
ATE: average treatment effect
TT: effect of treatment on the treated
LATE: local ATE

Footnotes

^†

See, e.g., Kolmogorov and Fomin (15), Theorem 9.8 for one proof.

^‡

The modifications required to handle the more general case are straightforward.

^§

The following bounds on ATE also can be derived easily by applying Manski’s (17) bounds for “Level-Set Restrictions on the Outcome Regression.” The bounds for the other parameters discussed in this paper cannot be derived by applying his results.

References

1.Neyman J. Stat Sci. 1990;5:465–480. [Google Scholar]
2.Fisher R A. Design of Experiments. London: Oliver and Boyd; 1935. [Google Scholar]
3.Cox D R. The Planning of Experiments. New York: Wiley; 1958. [Google Scholar]
4.Rubin D. Ann Stat. 1978;6:34–58. [Google Scholar]
5.Heckman J. J Human Resources. 1997;32:441–462. [Google Scholar]
6.Imbens G, Angrist J. Econometrica. 1994;62:467–476. [Google Scholar]
7.Quandt R. J Am Stat Assoc. 1972;67:306–310. [Google Scholar]
8.Roy A. Oxford Econ Papers. 1951;3:135–146. [Google Scholar]
9.Heckman J, Honoré B. Econometrica. 1990;58:1121–1149. [Google Scholar]
10.Maddala G S. Qualitiative and Limited Dependent Variable Models. Cambridge, U.K.: Cambridge Univ. Press; 1983. [Google Scholar]
11.Junker B, Ellis J. Ann Stat. 1997;25:1327–1343. [Google Scholar]
12.Rosenbaum P, Rubin D. Biometrika. 1983;70:41–55. [Google Scholar]
13.Heckman J, Robb R. In: Longitudinal Analysis of Labor Market Data. Heckman J, Singer B, editors. New York: Cambridge Univ. Press; 1985. pp. 156–245. [Google Scholar]
14.Heckman J, Lalonde R, Smith J. In: Handbook of Labor Economics. Ashenfelter O, Card D, editors. Amsterdam: Elsevier; 1999. , in press. [Google Scholar]
15.Kolmogorov A N, Fomin S V. Introductory Real Analysis, trans. Silverman, R. New York, NY: Dover; 1970. [Google Scholar]
16.Rao C R. In: A Celebration of Statistics. Feinberg S, editor. Berlin: Springer; 1986. [Google Scholar]
17.Manski C. Am Econ Rev. 1990;80:319–323. [Google Scholar]
18.Heckman J. Am Econ Rev. 1990;80:313–318. [Google Scholar]

[B1] 1.Neyman J. Stat Sci. 1990;5:465–480. [Google Scholar]

[B2] 2.Fisher R A. Design of Experiments. London: Oliver and Boyd; 1935. [Google Scholar]

[B3] 3.Cox D R. The Planning of Experiments. New York: Wiley; 1958. [Google Scholar]

[B4] 4.Rubin D. Ann Stat. 1978;6:34–58. [Google Scholar]

[B5] 5.Heckman J. J Human Resources. 1997;32:441–462. [Google Scholar]

[B6] 6.Imbens G, Angrist J. Econometrica. 1994;62:467–476. [Google Scholar]

[B7] 7.Quandt R. J Am Stat Assoc. 1972;67:306–310. [Google Scholar]

[B8] 8.Roy A. Oxford Econ Papers. 1951;3:135–146. [Google Scholar]

[B9] 9.Heckman J, Honoré B. Econometrica. 1990;58:1121–1149. [Google Scholar]

[B10] 10.Maddala G S. Qualitiative and Limited Dependent Variable Models. Cambridge, U.K.: Cambridge Univ. Press; 1983. [Google Scholar]

[B11] 11.Junker B, Ellis J. Ann Stat. 1997;25:1327–1343. [Google Scholar]

[B12] 12.Rosenbaum P, Rubin D. Biometrika. 1983;70:41–55. [Google Scholar]

[B13] 13.Heckman J, Robb R. In: Longitudinal Analysis of Labor Market Data. Heckman J, Singer B, editors. New York: Cambridge Univ. Press; 1985. pp. 156–245. [Google Scholar]

[B14] 14.Heckman J, Lalonde R, Smith J. In: Handbook of Labor Economics. Ashenfelter O, Card D, editors. Amsterdam: Elsevier; 1999. , in press. [Google Scholar]

[B15] 15.Kolmogorov A N, Fomin S V. Introductory Real Analysis, trans. Silverman, R. New York, NY: Dover; 1970. [Google Scholar]

[B16] 16.Rao C R. In: A Celebration of Statistics. Feinberg S, editor. Berlin: Springer; 1986. [Google Scholar]

[B17] 17.Manski C. Am Econ Rev. 1990;80:319–323. [Google Scholar]

[B18] 18.Heckman J. Am Econ Rev. 1990;80:313–318. [Google Scholar]

PERMALINK

Local instrumental variables and latent variable models for identifying and bounding treatment effects

James J Heckman

Edward J Vytlacil

Abstract