Skip to main content
Elsevier Sponsored Documents logoLink to Elsevier Sponsored Documents
. 2016 Sep;146:50–54. doi: 10.1016/j.econlet.2016.06.033

Robust inference for the Two-Sample 2SLS estimator

David Pacini a,c, Frank Windmeijer a,b,c,
PMCID: PMC5026329  PMID: 27667880

Abstract

The Two-Sample Two-Stage Least Squares (TS2SLS) data combination estimator is a popular estimator for the parameters in linear models when not all variables are observed jointly in one single data set. Although the limiting normal distribution has been established, the asymptotic variance formula has only been stated explicitly in the literature for the case of conditional homoskedasticity. By using the fact that the TS2SLS estimator is a function of reduced form and first-stage OLS estimators, we derive the variance of the limiting normal distribution under conditional heteroskedasticity. A robust variance estimator is obtained, which generalises to cases with more general patterns of variable (non-)availability. Stata code and some Monte Carlo results are provided in an Appendix. Stata code for a nonlinear GMM estimator that is identical to the TS2SLS estimator in just identified models and asymptotically equivalent to the TS2SLS estimator in overidentified models is also provided there.

Keywords: Linear model, Data combination, Instrumental variables, Robust inference, Nonlinear GMM

Highlights

  • We derive the variance of the TS2SLS estimator under heteroscedasticity.

  • We propose a new robust variance estimator.

  • We provide Stata code for the TS2SLS estimator and its robust variance estimator.

  • We provide Stata code for an asymptotically equivalent nonlinear GMM estimator.

1. Introduction

The Two-Sample Two-Stage Least Squares (TS2SLS) estimator was introduced by Klevmarken (1982) and applies in cases where one wants to estimate the effects of possibly endogenous explanatory variables x on outcome y, but where y and x are not observed in the same data set. Instead, one has observations on outcomes y and instruments z in one sample (sample 1) and on x and z in another (sample 2). Related Two-Sample IV (TSIV) estimators were proposed by Arellano and Meghir (1992) and Angrist and Krueger (1992). Furthermore, Angrist and Krueger (1995) proposed the TS2SLS estimator as a Split-Sample IV (SSIV) estimator. Inoue and Solon (2010) show that the TS2SLS estimator is more efficient than the TSIV estimator of Angrist and Krueger (1992). For further details, see Angrist and Pischke (2009) and the review of Ridder and Moffitt (2007).

This type of data combination estimation method is popular in economics. It is for example used in research on intergenerational mobility, as earnings of different generations are often not observed in the same data set, see the extensive list of references in Jerrim et al. (2014). A further recent application is van den Berg et al. (in press), who investigate the effect of early-life hunger on late-life health and use the two-sample IV approach to deal with imperfect recollection of conditions early in life. Pierce and Burgess (2013) propose the use of the TS2SLS estimator in epidemiology, in particular when estimating the causal relationship between an exposure and an outcome using genetic factors as instrumental variables, so-called Mendelian randomisation, and where obtaining complete exposure data may be difficult due to high measurement costs.

Under certain assumptions, as stated below, the TS2SLS estimator is consistent and has a limiting normal distribution, see e.g.  Klevmarken (1982) and Inoue and Solon (2010). Here we derive the limiting distribution of the TS2SLS estimator under general, unspecified, forms of conditional heteroskedasticity. As the TS2SLS estimator is a simple function of the reduced form parameters for y in sample 1, and the first-stage parameters for x in sample 2, its asymptotic variance is a function of the variances and covariances of these OLS estimators.

The variance of the limiting normal distribution of the TS2SLS estimator is given in (10) below and the formula for a robust estimator of the asymptotic variance is presented in (12). Neither of these have been derived and/or proposed in the literature before. The result in Inoue and Solon (2010) for the conditionally homoskedastic case is similar to our result for that case. They derive the limiting variance of the TS2SLS estimator from the optimal nonlinear GMM estimator. For overidentified models, these two estimators are not the same, but they have the same limiting distribution. Inoue and Solon (2010) did not derive the limiting robust variance for this GMM estimator, but did derive the limiting variance of the efficient two-step GMM estimator under general forms of conditional heteroskedasticity in Inoue and Solon (2005), which is also the approach presented in Arellano and Meghir (1992). Our derivation is different as we focus solely on the TS2SLS estimator as defined below in (5). For the conditional homoskedastic case, our variance estimator differs from the one proposed by Inoue and Solon (2010), as it uses the information from the two samples differently.

Applied researchers have constructed robust standard errors for the just-identified single endogenous regressor case by means of the delta method, see e.g.  Dee and Evans (2003). Our result can be seen as a generalisation of this method to situations with multiple regressors and overidentification. Although we consider here a simple cross-sectional setup, other sampling designs can be accommodated and the result is straightforwardly extended to compute, for example, cluster-robust standard errors.

Our result also generalises to situations outside the standard TS2SLS setup. For example, it can accommodate a model with three explanatory variables where one endogenous variable is observed with the outcome variable in sample 1, but not in sample 2, one explanatory variable is only observed in sample 2 and one endogenous variable is observed in both samples 1 and 2. This is discussed in Section  5 below and we present Stata code for this example and for the standard TS2SLS setup in the Appendix (see Appendix A).

In the next section we present the model, assumptions and the TS2SLS estimator. In Section  3, we present our main results. Section  4 compares our results to those derived for nonlinear GMM. The Appendix also presents Stata code for the GMM estimator.

2. Model, assumptions and TS2SLS estimator

The structural linear model of interest is given by

yi=xiβ+εi, (1)

but we cannot estimate this model as yi and xi are not jointly observed. Instead, we have two independent samples. In sample 1 we have observations on y and kz exogenous instruments z. Sample 2 contains observations on the kx explanatory variables x and z. Denoting by subscripts 1 and 2 whether the variables are observed in sample 1 or sample 2, in the first sample we observe {y1i,z1i} for i=1,,n1, and in the second sample we observe {x2j,z2j} for j=1,,n2. Throughout we assume that kzkx. Other explanatory variables that enter model (1), but that are observed in both samples and are exogenous, including the constant, have been partialled out.

The TS2SLS estimator is derived as follows. From the information in sample 1, we can estimate the reduced form model for y1i, given by

y1i=z1iπy1+u1i. (2)

From sample 2, we can estimate the linear projections

x2j=Πx2z2j+v2j, (3)

with Πx2=E(z2jz2j)1E(z2jx2j), a kz×kx matrix of rank kx by assumption. As (3) is a linear projection, it follows that E(z2jv2j)=0. Although the x1i are not observed, the data generating process for y1i is given by the structural model (1) and hence it and its reduced form are given by

y1i=x1iβ+ε1i=(z1iΠx1+v1i)β+ε1i=z1iΠx1β+ε1i+v1iβ, (4)

with the linear projection parameters Πx1=E(z1iz1i)1E(z1ix1i). Again, E(z1ivi1)=0. From (2), (4) it follows that πy1=Πx1β and u1i=ε1i+v1iβ. Clearly, knowledge of πy1 and Πx1 identifies the structural parameters β, and the standard 2SLS estimator in a sample with y1i, x1i and z1i all observed combines the information contained in the OLS estimators for πy1 and Πx1, denoted by π^y1 and Π^x1 as follows

β^2sls=(Π^x1Z1Z1Π^x1)1Π^x1Z1Z1π^y1,

with Z1 the n1×kz matrix [z1i].

As x1i is not observed, we cannot estimate Πx1, but we can estimate Πx2 using the second sample. Denoting the OLS estimator for Πx2 by Π^x2, the Two-Sample 2SLS estimator is given by

β^ts2sls=(X^1X^1)1X^1y1=(Π^x2Z1Z1Π^x2)1Π^x2Z1y1=(Π^x2Z1Z1Π^x2)1Π^x2Z1Z1π^y1. (5)

We make the following assumptions:

  • A1:

    {y1i,z1i}i=1n1 and {x2j,z2j}j=1n2 are i.i.d. random samples from the same population with finite fourth moments and are independent.

  • A2:

    E(z1iz1i)=Qzz1; E(z2jz2j)=Qzz2. Qzz1 and Qzz2 are nonsingular.

  • A3:

    E(z1ix1i) and E(z2ix2i) both have rank kx.

  • A4:

    E(z1iε1i)=0.

  • A5:

    E(u1i2z1izi1)=Ωy1, a finite and positive definite matrix.

  • A6:

    E[(Ikxz2j)v2jv2j(Ikxz2j)]=E(v2jv2jz2jz2j)=Ωx2, a finite and positive definite matrix. Ikx is the identity matrix of order kx.

  • A7:

    limn1,n2n1n2=α for some α>0.

Assumptions A1–A3 and A7 are standard data combination assumptions, see e.g.  Inoue and Solon (2010). Assumptions A2 and A3, combined with A1, result in E(z1iz1i)=E(z2jz2j) and E(z1ix1i)=E(z2ix2i), and hence Πx1=Πx2. A1–A3 are clearly sufficient, but not necessary conditions for Πx1 to be equal to Πx2. The condition Πx1=Πx2 itself is sufficient for consistency of β^ts2sls, and necessary for the limiting normal distribution of n1(β^ts2slsβ) to have a mean of zero. In the derivations below we do not (need to) impose Qzz1=Qzz2. The resulting estimator of the variance of β^ts2sls is a simple function of the variances of π^y1 and vec(Π^x2), and this function is unambiguous about which information from which sample is being utilised.

Assumptions A5 and A6 explicitly allow for general forms of heteroskedasticity. The robust variance estimator for β^ts2sls is obtained incorporating robust variance estimators for π^y1 and vec(Π^x2). This was done by Dee and Evans (2003) using the delta method for the just identified single regressor case, i.e.  kx=kz=1. The result derived below can be seen as a generalisation of this to multiple regressors and overidentified settings.

3. Limiting distribution and variance estimator

The OLS estimators for πy1 and Πx2 are given by

π^y1=(Z1Z1)1Z1y1
Π^x2=(Z2Z2)1Z2X2,

with Z1 the n1×kz matrix [z1i]; Z2 the n2×kz matrix [z2j]; y1 the n1 vector (y1i) and X2 the n2×kx matrix [x2j]. Under Assumptions A1–A4 and A7 we obtain

plim(π^y1)=E(z1iz1i)1E(z1ix1i)β=πy1=Πx1β=Πx2β;
plim(Π^x2)=E(z2jz2j)1E(z2jx2j)=Πx2,

and hence the TS2SLS estimator is consistent as

plim(β^ts2sls)=plim(1n1Π^x2Z1Z1Π^x2)11n1Π^x2Z1Z1π^y1=(Πx2Qzz1Πx2)1Πx2Qzz1πy1=β. (6)

Note that the probability limits obtained here and the limiting distributions derived below are for both n1 and n2.

For the derivation of the asymptotic distribution of β^ts2sls, denote πx2=vec(Πx2); π^x2=vec(Π^x2); θ=(πy1πx2)and θ^=(π^y1π^x2). Under Assumptions A1–A7

n1(π^y1πy1)dN(0,Vπy1); (7)
n2(π^x2πx2)dN(0,Vπx2), (8)

where

Vπy1=Qzz11Ωy1Qzz11;
Vπx2=(IkxQzz21)Ωx2(IkxQzz21).

Hence

n1(θ^θ)dN(0,Vθ), (9)

with

Vθ=[Vπy100αVπx2].

From the limiting distribution of θ^, the limiting distribution of β^ts2sls is readily obtained and we give a simple proof in the Appendix (see Appendix A). Our main result is:

Under Assumptions A1–A7, the limiting distribution of β^ts2sls is given by

n1(β^ts2slsβ)dN(0,Vβ);
Vβ=C(Vπy1+α(βIkz)Vπx2(βIkz))C=CVπy1C+α(βC)Vπx2(βC), (10)

where

C=(Πx2Qzz1Πx2)1Πx2Qzz1. (11)

We can obtain an estimator for the asymptotic variance of β^ts2sls as follows. Let Va^r(π^y1) and Va^r(π^x2) be estimators of the asymptotic variances of π^y1 and π^x2, in the sense that plim(n1Va^r(π^y1))=Vπy1 and plim(n2Va^r(π^x2))=Vπx2. Let C^ be the matrix of least squares coefficients from the regressions of the columns of Z1 on X^1. As plim(C^)=plim((X^1X^1)1X^1Z1)=C, an estimator of the asymptotic variance of β^ts2sls is given by

Va^r(β^ts2sls)=C^Va^r(π^y1)C^+(β^ts2slsC^)Va^r(π^x2)(β^ts2slsC^), (12)

as

n1Va^r(β^ts2sls)=C^(n1Va^r(π^y1))C^+n1n2(β^ts2slsC^)(n2Va^r(π^x2))(β^ts2slsC^)pVβ.

When the model is just identified, kz=kx, then C^=Π^x21. When furthermore kx=kz=1, (12) reduces to the simple expression

Va^r(β^ts2sls)=(Va^r(π^y1)+β^ts2sls2Va^r(π^x2))/π^x22,

with β^ts2sls=π^y1π^x2, which is identical to the expression obtained using the delta method as in Dee and Evans (2003).

Specifying Va^r(π^y1) and Va^r(π^x2) in (12) as being robust to general forms of heteroskedasticity results in a robust variance estimator for β^ts2sls. A small Monte Carlo exercise reported in the Appendix confirms that our asymptotic results reflect the behaviour of the TS2SLS estimator. Although we have here an i.i.d. cross-sectional setup, the results generalise to e.g. cluster-robust variances straightforwardly.

4. GMM

Assuming conditional homoskedasticity for both u1i and v2j such that

E(u1i2|z1i)=σu2andE(v2jv2j|z2j)=Σv,

we have that

Vπy1=σu2Qzz1andVπx2=ΣvQzz21,

and hence

Vβ=σu2(Πx2Qzz1Πx2)1+αβΣvβCQzz21C.

The variance estimator (12) is then

Va^r(β^ts2sls)=σ^u2(X^1X^1)1+β^ts2slsΣ^vβ^ts2slsC^(Z2Z2)1C^, (13)

with σ^u2=(y1Z1π^y1)(y1Z1π^y1)/n1 and Σ^v=(X2Z2Π^x2)(X2Z2Π^x2)/n2.

Inoue and Solon (2010) derive Vβ from the limiting distribution of the optimal GMM estimator using moment conditions

E[z1i(y1iz1iΠx2β)]=0; (14)
E[z2j(x2jΠx2z2j)]=0, (15)

and weight matrix

[Va^r(π^y1)00Va^r(π^x2)]=[σ^u2(Z1Z1)100Σ^v(Z2Z2)1].

Let ψ=(βπx2), thenthis GMM estimator is the same as the minimum distance estimator

ψ˜=argminβ,πx2(π^y1Πx2βπ^x2πx2)[(Va^r(π^y1))100(Va^r(π^x2))1](π^y1Πx2βπ^x2πx2).

Unless the model is just identified, β˜β^ts2sls, but their limiting distributions are the same. This is a situation similar to that of the LIML and 2SLS estimators in the standard IV model. When the model is overidentified, the TS2SLS estimator itself cannot be obtained as a GMM estimator. The limiting variance of n1(β˜β) is obtained from the limiting variance of n1(ψ˜ψ). Inoue and Solon (2010) imposed Qzz1=Qzz2 and obtained the variance as

Vβ,IS=(σu2+αβΣvβ)(Πx2Qzz1Πx2)1

and their variance estimator is given by

Va^rIS(β^ts2sls)=(σ˜u2+n1n2β^ts2slsΣ^vβ^ts2sls)(X^1X^1)1,

where σ˜u2=(y1X^1β^ts2sls)(y1X^1β^ts2sls)/n1. Apart from this difference in the estimation of σu2, the main difference is the imposition that Qzz1=Qzz2. Although this is justified asymptotically given the Assumptions A1–A3, the finite sample variance of π^x2 in (12) is clearly more naturally estimated by Σ^v(Z2Z2)1 than by Σ^v(n2n1Z1Z1)1. Also, for the example in footnotes 3 and 2 in Inoue and Solon, 2010, Inoue and Solon, 2005 respectively, when E(z1ix1i)=cE(z2jx2j) and E(z1iz1i)=cE(z2jz2j), with c1, then the TS2SLS estimator is consistent and asymptotically normally distributed but n1Va^rIS(β^ts2sls) is no longer a consistent estimator of the variance of the limiting distribution, whereas n1Va^r(β^ts2sls) is.

Inoue and Solon (2010) did not derive the robust variance of β˜. Although this can be obtained from the robust variance of ψ˜, the matrix expressions involved are quite cumbersome. Arellano and Meghir (1992) similarly considered the robust variance of the GMM estimator ψ˜ but also did not derive a variance estimator for β˜ separately. One can of course simply obtain robust standard errors for ψ˜ and hence β˜ using GMM routines that can estimate the parameters using the nonlinear and linear moment conditions (14), (15). These estimates are then obtained using iterative methods, and for just-identified models this produces the TS2SLS estimator with robust standard errors. For overidentified models, the efficient two-step GMM estimator for ψ can then also be obtained together with a Hansen test for the validity of the moment conditions. We present Stata code for this GMM estimation procedure in the Appendix (see Appendix A).

5. Generalising the result

Although we derived the results in Section  3 for the standard TS2SLS estimator, the limiting distribution results (17) and (18) in the Appendix (see Appendix A) apply more generally. Indeed, the only aspect in Vθ that is particular to this specific two-sample setup is the zero covariance between π^y1 and π^x2, due to the samples being independent.

Consider as a generalisation a model with three explanatory variables x1, x2 and x3. Using the same notational convention as before, in sample 1 we observe {y1i,x11i,x31i,z1i}i=1n1. In sample 2 we observe {x22j,x32j,z2j}j=1n2. In this case, x1 is only observed in sample 1, x2 is only observed in sample 2, whereas x3 is observed in both samples. Let Z=(Z1Z2)and x3=(x31x32), then the reduced form and first-stage OLS estimators are given by

π^y1=(Z1Z1)1Z1y1;π^x11=(Z1Z1)1Z1x11
π^x22=(Z2Z2)1Z2x22;π^x3=(ZZ)1Zx3.

Let Π^x=[π^x11π^x22π^3], then the two-sample IV estimator is given by

β^2s=(Π^xZ1Z1Π^x)1Π^xZ1Z1π^y1.

We differentiate this estimator from the standard two-sample setup above and reserve the name β^ts2sls for that particular setup. Under Assumptions A1–A7, the limiting distribution is as in (17), but as θ^=(π^y1vec(Π^x)), the variance Vθ differs from the standard setup as there is a different covariance structure. There are non-zero covariances between π^y1 and π^x11; π^x11 and π^x3; and π^x11 and π^x3, whereas the covariances between π^y1 and π^x22; and π^x11 and π^x22 are zero. From (18), an estimator for the asymptotic variance is given by

Va^r(β^2s)=(δ^C^)Va^r(θ^)(δ^C^), (16)

where δ^=(1β^2s) and C^=(X^1X^1)1X^1Z1=(Π^xZ1Z1Π^x)1Π^xZ1Z1.

For the standard TS2SLS setup and the more general structures, one can obtain the robust variance estimates using standard routines. We give Stata code for two examples in the Appendix (see Appendix A). The structure of the algorithm for the general case is:

  • 1.

    Estimate the reduced form and first-stage parameters by OLS, obtain the predicted values X^1 and a robust variance estimate for θ^=(π^y1π^x2), the matrix Va^r(θ^). In Stata, the latter can be obtained using the ‘gmm’ or the ‘suest’ routine.

  • 2.

    Regress y1 on X^1 to obtain the TS2SLS estimator.

  • 3.

    Regress the columns of Z1 on X^1 and collect the parameter estimates in the matrix C^.

  • 4.

    Calculate Va^r(β^2s) by the matrix expression in (16).

  • 5.

    Some adjustments have to be made when parameters on exogenous variables and the constant are included in the estimation. These are detailed in the code in the Appendix (see Appendix A).

Footnotes

We would like to thank Helmut Farbmacher, Tom Palmer, Mark Schaffer, Jon Temple, Kate Tilling and the editor, Costas Meghir, for helpful comments. Windmeijer acknowledges funding by the Medical Research Council, grant no. MC_UU_12013/9.

Appendix A

Supplementary material related to this article can be found online at http://dx.doi.org/10.1016/j.econlet.2016.06.033.

Appendix A. Supplementary material

The following is the Supplementary material related to this article.

MMC S1

Appendix A. Supplementary material.

mmc1.pdf (97.5KB, pdf)

References

  1. Angrist J.D., Krueger A.B. The effect of age at school entry on educational attainment: An application of instrumental variables with moments from two samples. J. Amer. Statist. Assoc. 1992;87:328–336. [Google Scholar]
  2. Angrist J.D., Krueger A.B. Split-sample instrumental variables estimates of the return to schooling. J. Bus. Econom. Statist. 1995;13:225–235. [Google Scholar]
  3. Angrist J.D., Pischke J.-S. Princeton University Press; Princeton: 2009. Mostly Harmless Econometrics. An Empiricist’s Companion. [Google Scholar]
  4. Arellano M., Meghir C. Female labour supply and on-the-job search: An empirical model estimated using complementary data sets. Rev. Econom. Stud. 1992;59:537–559. [Google Scholar]
  5. Dee T.S., Evans W.N. Teen drinking and educational attainment: Evidence from two-sample instrumental variables estimates. J. Labor Econ. 2003;21:178–209. [Google Scholar]
  6. Inoue, A., Solon, G., 2005. Two-sample instrumental variables estimators, NBER Technical Working Paper 311.
  7. Inoue A., Solon G. Two-sample instrumental variables estimators. Rev. Econ. Stat. 2010;92:557–561. [Google Scholar]
  8. Jerrim, J., Choi, A., Rodriguez, R.S., 2014. Two-Sample Two-Stage Least Squares (TSTSLS) estimates of earnings mobility: how consistent are they?, Working Paper No. 14-17, Institute of Education, University of London.
  9. Klevmarken, N.A., 1982. Missing variables and two-stage least squares estimation from more than one data set, Working Paper Series No. 62, Research Institute of Industrial Economics, Stockholm, Sweden.
  10. Pierce B.L., Burgess S. Efficient design for Mendelian randomization studies: Subsample and 2-sample instrumental variables estimators. Am. J. Epidemiol. 2013;178:1177–1184. doi: 10.1093/aje/kwt084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Ridder G., Moffitt R. The econometrics of data combination. In: Heckman J.J., Leamer H.E., editors. Handbook of Econometrics Vol. 6, Part B. 2007. pp. 5469–5547. (Chapter 75) [Google Scholar]
  12. van den Berg G.J., Pinger P.R., Schoch J. Instrumental variable estimation of the causal effect of hunger early in life on health later in life. Econom. J. 2015 (in press) [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

MMC S1

Appendix A. Supplementary material.

mmc1.pdf (97.5KB, pdf)

RESOURCES