Skip to main content
Entropy logoLink to Entropy
. 2022 Sep 2;24(9):1235. doi: 10.3390/e24091235

Simultaneous Maximum Likelihood Estimation for Piecewise Linear Instrumental Variable Models

Shuo Shuo Liu 1,*, Yeying Zhu 2
Editor: Kateřina Hlaváčková-Schindler
PMCID: PMC9497487  PMID: 36141121

Abstract

Analysis of instrumental variables is an effective approach to dealing with endogenous variables and unmeasured confounding issue in causal inference. We propose using the piecewise linear model to fit the relationship between the continuous instrumental variable and the continuous explanatory variable, as well as the relationship between the continuous explanatory variable and the outcome variable, which generalizes the traditional linear instrumental variable models. The two-stage least square and limited information maximum likelihood methods are used for the simultaneous estimation of the regression coefficients and the threshold parameters. Furthermore, we study the limiting distribution of the estimators in the correctly specified and misspecified models and provide a robust estimation of the variance-covariance matrix. We illustrate the finite sample properties of the estimation in terms of the Monte Carlo biases, standard errors, and coverage probabilities via the simulated data. Our proposed model is applied to an education-salary data, which investigates the causal effect of children’s years of schooling on estimated hourly wage with father’s years of schooling as the instrumental variable.

Keywords: causal inference, instrumental variables, piecewise linear, thresholds model

1. Introduction

In observational studies, the measured confounders can be controlled by a variety of methods such as propensity score based matching and regression adjustment. However, when the confounding variable is unmeasured, the traditional causal inference methods usually lead to biased estimators since changes in the unmeasured confounder will lead to changes in the explanatory variable, both of which will result in changes in the response variable. Failing to adjust such a confounder will lead to spurious association between the explanatory variable and the outcome. Analysis of instrumental variables (IV) has gained popularity in causal inference, such as investigating causal graphical structures [1,2] and controlling for unmeasured confounding [3,4]. An instrument is a variable that is correlated with the explanatory variable but not associated with any unmeasured confounders. In addition, the instrumental variable is supposed to have influence on the response variable only through the explanatory variable, i.e., there is no direct effect of this variable on the response. Instrumental variable analysis can be applied to many areas and disciplines, such as economics and epidemiology. For example, causality between the years of schooling and earnings in economics has been studied in the literature [5]. This example exploits the college proximity as the instrumental variable because it is revealed that those living near college or university usually have significantly higher level of education than others. On the other hand, it is believed that college proximity may improve earnings only by increasing the subject’s years of schooling. Both indicate that college proximity is a useful instrumental variable. In biomedical and epidemiological research, the main interest is to investigate the causal effect of an exposure variable on a certain disease outcome. A gene can be assumed as a good instrument if it is closely linked to the exposure but has no direct effect on the disease [6]. The study of genetic variants as instrumental variables is known as Mendelian randomization, which is discussed extensively in the literature (e.g., [7,8]). For instance, a set of 32 recently identified genetic variants are used as instrumental variables to study whether child fat mass causally affects academic achievement and blood pressure [9].

1.1. Related Work

Since the development of instrumental variables, a plenty of instrumental variable estimation methods have been proposed for the causal effect estimation. Two-stage least squares (2SLS) [10] is one of the most commonly used methods for the instrumental variable estimation. Theoretical analyses such as consistency and asymptotic normality also exist in the literature. When the response variable is binary, the second stage can be modified with logistic regression in mendelian randomization studies [11]. Another method is the likelihood-based method, particularly the limited information maximum likelihood (LIML) [12]. It is proved that the LIML method is more effective in dealing with the weak instruments [13]. The phenomenon of weak instruments occurs when the correlation between the instrument(s) and the explanatory variable is close to zero. When there are weak instruments, 2SLS is generally unstable and the causal effect estimators are badly biased. The typical rule of thumb to detect weak instruments is the F-statistic, which states that an instrument may be weak if the first-stage F-statistic is less than 10 [14].

Most of the IV approaches impose linear assumptions among the instrument, explanatory and response variables. However, this is not always the case. For example, a subject’s years of schooling may only have a positive effect on subsequent earnings if the subject obtained at least a high-school degree. There would be no difference in the earnings if the subject obtained either an elementary or middle school degree. In this hypothetical scenario, a linear regression model between the explanatory and response variables is clearly misspecified. When the null hypothesis of linearity relationship is rejected, one strategy could be to develop piecewise linear models, which is more interpretable compared to the completely nonlinear models.

In this paper, we propose a piecewise linear instrumental variable (PLIV) model for estimating the causal effect via a continuous threshold function. The continuous threshold function assumes that both the explanatory variable and the instrumental variable are continuous. Instrumental variable models with continuous variables have been studied extensively in the literature. For example, continuous instruments have been used in the classical IV models, developed in a structural equation modeling framework [15]. A recent paper proposes semiparametric doubly robust estimators of causal effects with the continuous instruments [16]. Moreover, some discussions about continuous exposure and a continuous response for Mendelian randomization can be found in a review paper [8].

A threshold in a variable occurs when there is a sudden change in the values of this variable. We call the point where the change happens as a cut-off point or a threshold. The subset causal effect exists when there is a threshold in the explanatory variable. The proposed PLIV model is useful because it can study the subset causal effect when the true model is not linear and it can also degenerate to a linear instrumental variable model when the relationship among the variables is indeed linear. In other words, by using piecewise linear functions, we can quantitatively find the subset effects of the explanatory and the instrumental variables.

We use the Rectified Linear Unit (ReLU) function, mathematically defined in Equation (1), to incorporate the piecewise relationships. Utilization of ReLU function for defining the subset effects have been studied in the literature, such as a regression kink model that tests the presence of the threshold [17] and the segmented and hinge models to study the subset effects in logistic regression [18]. Besides, the continuous threshold models via the ReLU function with two-way interactions is considered in the Cox’s proportional hazards model, where the asymptotic normality under mild conditions is established [19]. In this paper, we use a continuous threshold function with multiple thresholds to formulate the piecewise linear instrumental variable models. A similar study of the piecewise linear instrumental variable model through the random slope approach is studied in the literature [20]. It divides the data into a few segments and analyzes the data in each segment individually. However, this method suffers from huge efficiency and accuracy loss.

1.2. Contribution of This Article

In this paper, we consider a piecewise linear model when the linearity assumption of the data is inappropriate and provide a rigorous treatment of the statistical properties of the model. Our contributions can be summarized as follows.

  • We simultaneously estimate the coefficients and thresholds of the piecewise linear instrumental variable model by the limited information maximum likelihood (LIML) method, assuming the number of thresholds is known.

  • The proposed piecewise linear instrumental variable model will degenerate to the linear instrumental variable model if there are no thresholds. Therefore, it provides a generalization to the linear instrumental variable model. To our best knowledge, this is the first work on the piecewise linear extension to the traditional linear instrumental variable models.

  • We also study the theoretical properties of the PLIV model, including the consistency and asymptotic normality of the estimators.

2. Piecewise Linear Instrumental Variable Model

Notations: we denote scalars by unbolded lowercase letters (e.g., sample size n and the i-th observation of outcome variable yi), random variable by unbolded capital letter (e.g., X), random vectors by boldface lowercase letters (e.g., xi and β), and matrices with boldface capital letters (e.g., X ).

In the ordinary linear regression model yi=xiβ+ϵi, there is an assumption that the explanatory variables are uncorrelated with the error term, i.e., cov(xi,ϵi) = 0. However, there are some situations where the covariance between the explanatory variables and error term exists. This leads to inconsistent estimation of ordinary least squares due to the phenomenon of endogeneity in x. One way to deal with this issue is to introduce an instrument variable, whose changes are related to changes in the explanatory variable but do not lead to the change in the response variable directly.

Let (xi,yi,zi),i=1,,n, denotes the observed data for (X,Y,Z), where X is the explanatory variable, Y is the response variable, and Z is the instrumental variable. To estimate the subset causal effect and establish the piecewise linear relationship, for any threshold parameter tR, we use a continuous threshold function which is defined as:

φ(xi,t)=(xit)I(xi>t)=(xit)+, (1)

where I(·) is an indicator function. ReLU function, commonly used as an activation function in deep learning, is a special case with t=0 such that φ(xi,0)=(xi0)I(xi>0)=(xi0)+.

The proposed model provides sparsity and computational efficiency compared to the smoothing or approximation approach in the literature. The estimation stage involves indicator functions but it does not require an approximation of the indicator function. Let K and J denote the number of thresholds in Z and X, respectively. Denote c=(c1,,cK)T as the vector of thresholds in Z and denote t=(t1,,tJ)T as the vector of thresholds in X. We propose the following piecewise linear instrumental variable model:

xi=α0+α1φ(zi,c1)++αKφ(zi,cK)+αK+1zi+vi (2)
yi=β0+β1φ(xi,t1)++βJφ(xi,tJ)+βJ+1xi+ui, (3)

where β=(β0,,βJ+1)T is the vector of coefficients representing the causal effect of X on Y; α=(α0,,αK+1)T is the vector of coefficients representing the instrumental effect of Z on X; ui and vi are the error terms for the ith observation. In the context of causal inference, we interpret β as the causal effect of x on y. More specifically, for tj<xtj+1,1jJ with tJ+1 denoting the maximum value of x, one unit increase in x leads to βJ+1+j=1jβj units change in y. Besides, βJ+1 represents the change in y that is caused by one unit increase in x for t0<xt1 where t0 is the minimum value of x. To better understand this, in Figure 1, we plot the function y=φ(x,2)+3×φ(x,3)+2x where β1=1,β2=3,β3=2 as an example. When 2<x3, the slope is β1+β3=3. When 3<x4, the slope is β1+β2+β3=6.

Figure 1.

Figure 1

Plot of the function y=φ(x,2)+3×φ(x,3)+2x.

Here, we assume K and J are prespecified according to some prior knowledge or theoretical justifications. Practically, we may use the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) [21] to select them. A more elegant examination of the condition for the number of thresholds can be found in Newey [22]. In particular, when α1==αK=0 and β1==βJ=0, our proposed model degenerates to the traditional linear instrumental variable model.

For instrumental variable analysis, an instrumental variable is correlated with the explanatory variable but not correlated with the error term. In our model, (Zc)+={(Zc1)+,,(ZcK)+} is the vector of instrumental variables with the following properties:

  • Instrument relevance: cov{(Zc)+,X}0: (Zc)+ is correlated with the explanatory variable X.

  • Instrument exogeneity: cov{(Zc)+,U}=0: (Zc)+ is uncorrelated with the error term U.

We assume KJ for identifiability, i.e., the number of instruments should be larger than or equal to the number of endogenous variables.

Remark 1.

Note that intensive research about nonlinear instrumental variable models has been conducted in the literature, such as the nonparametric instrumental regression [23,24,25]. We point out that the target of our method is to quantitatively find the thresholds and estimate the subset causal effects. We aim to generalize the traditional linear IV model and fit an interpretable model rather than approximate the data by a nonlinear function.

To estimate the unknown parameters in (2) and (3), we utilize the two-stage least square (2SLS) method and the limited information maximum likelihood (LIML) method. Details about the proposed estimation methods are discussed below.

3. Simultaneous Maximum Likelihood Estimation

We first introduce how the LIML method is used in our model and initialize the naive estimators by the 2SLS method.

3.1. Limited Information Maximum Likelihood

As discussed in the introduction about the advantages, limited information maximum likelihood is another popular approach for estimation in the instrumental variable models. Here, we assume the error terms (U,V) are jointly normally distributed and correlated to some extent due to the unmeasured confounding effect. Let 0 be the zero-mean vector and ρ be the correlation of (U,V). Denote σu2 and σv2 as the variance of the error terms U and V, respectively. Then the probability density function of the bivariate normal (U,V) can be written as:

f(U,V)=12πσuσv1ρ2exp12(1ρ2)Q(U,V),

where the quadratic form Q(U,V)=UTUσu22ρUTVσvσu+VTVσv2. For a single observation, the log-likelihood is

(ui,vi;θ)logσuσv12log(1ρ2)12(1ρ2)ui2σu22ρuiviσuσv+vi2σv2,

where θ=(αT,βT,cT,tT,ρ,σu,σv)T denote all the model parameters and

vi=xiα0α1φ(zi,c1)αKφ(zi,cK)αK+1zi
ui=yiβ0β1φ(xi,t1)βJφ(xi,tJ)βJ+1xi.

To simplify notations, we let (θ)=(ui,vi;θ) denote the log-likelihood. The maximum likelihood estimates for θ is obtained by maximizing the log-likelihood within the compact set ΘRD(θ) such that θ^n=arg maxθΘn(θ), where n(θ)=1/ni=1n(θ). However, there is no closed-form solution for θ, so we take the gradient-based algorithm for estimation. This yields approximate M-estimators. To speed up estimation, we use the two-stage least square method to initialize the estimators.

3.2. Initialization: Two-Stage Least Square

The traditional two-stage least squares method regresses the explanatory variable on the instrumental variable and computes the predictions x^ in the first stage. In the second stage, it regresses the response variable on the predictions x^. The causal effect of interest is estimated from the second stage. In our method, we employ 2SLS to obtain the initial values of the parameters of the piecewise linear instrumental variable model. Below we describe the 2SLS procedures for initializations:

Stage 1: First, we regress x on {(zc)+,z} and then obtain the fitted values x^, where (zc)+={(zc1)+,,(zcK)+}.

Stage 2: We regress y on {(x^t)+,x^}, where (x^t)+={(x^t1)+,,(x^tJ)+}. Thus, in the second stage, we fit the following regression model:

yi=β0+β1φ(x^i,t1)++βJφ(x^i,tJ)+βJ+1x^i+ui.

For each combination of the number of thresholds in X and Z, we could pick c, t and the regression coefficients simultaneously through grid search when the sum of squared errors (SSE) of Y is minimized. However, for J2 or K2, it is slightly computationally expensive to conduct grid search. Since we only need 2SLS to provide the initialization of the parameters in our method, we choose c to be a vector of the points that are evenly spaced between the 5% to 95% quantiles of Z. Similarly, we choose t to be a vector of the points that are evenly spaced between the 5% to 95% quantiles of X. We ignore points below and above the 5% to 95% quantiles in order to avoid boundary effects. The regression coefficients are obtained accordingly.

3.3. Theoretical Analysis

Under mild conditions, we study the statistical properties of the proposed model and establish the robust variance-covariance estimators for the estimated parameters under the correctly specified and misspecified models, separately. To investigate the theoretical properties, we consider the following regularity conditions:

  • C1. Observations (Xi,Yi,Zi),i=1,,n are independently and identically distributed on a compact set XYZR1R1R1. Furthermore, E(X2)<, E(Y2)<, and E(Z2)<.

  • C2. The explanatory variable X and the instrumental variable Z are continuous in the parameter space, i.e., they have continuous probability density functions fX(·) and fZ(·). The density functions are uniformly bounded, that is, there exist constants c_1, c_2, c¯1, and c¯2 such that
    c_1infZZfZ(·)supZZfZ(·)c¯1andc_2infXXfX(·)supXXfX(·)c¯2.

    Furthermore, the true value of the coefficients for the threshold effects satisfy α00 and β00, where α0=(α20,,α(K1)0) and β0=(β20,,β(J1)0).

  • C3. (θ) is upper-semicontinuous for almost all (X,Y,Z), that is, for every θ,
    lim supθnθ(X,Y,Z;θn)(X,Y,Z;θ),a.s.

Remark 2.

Condition C1 is commonly used in regression models. Condition C2 is used for estimating the unknown thresholds and ensures the model is identifiable. The continuity requirements of X and Z are used to estimate the thresholds. Condition C3 is used to establish the consistency and the asymptotic normality of the maximum likelihood estimator.

In terms of estimation, we take the gradient-based method which depends on the first order derivative ˙(θ)=(θ)/θ (details can be found in Appendix A) with the initialized estimators by 2SLS. In this paper, we do not approximate the indicator function by the logistic function as some researchers do (e.g., [18,26,27]). The gradient-based algorithm for the ReLU function has shown success in the context of deep learning and machine learning. Compared to the approximation techniques as discussed in Section 1, model estimation with the ReLU function is computationally cheaper since no approximation of the indicator function is required. In fact, as long as Condition C2 is satisfied which requires variables X and Z to be continuous, the gradients composed of the indicator functions converge to a continuous function of the threshold parameters as n, for example,

1ni=1nI(zi>ck)PEI(zi>ck)=Pzi>ck,

for k=1,,K by the law of large numbers. Therefore, the second order derivative of the ReLU function with respect to the thresholds can be derived based on the resulting continuous probability function. More specifically, the second order derivative with respect to ck is simply fZ(ck).

To prove the asymptotic normality, we first need to show the consistency of the proposed estimators.

Theorem 1.

Under conditions C1–C4, assume that Θ is compact and the true parameter vector θ0=arg maxθΘE(θ) is unique. Furthermore, for every sufficiently small ball BΘ, supθB(θ) is measurable with EsupθB(θ)<, then θ^npθ0.

Proof. 

The proof follows the Theorem 5.7 of van der Vaart [28]. For completeness, we include it as Theorem A1 in Appendix B. To utilize Theorem 5.7, we need to check the condition that (θ^n)(θ0)oP(1) for some θ0Θ0. This is true since n(θ) is continuous in θ, n(θ) converges to (θ) uniformly, and θ^n (approximately) maximizes n(θ). Thus, all the conditions are satisfied and the result follows. □

Theorem 2.

Under conditions C1–C4, let θ0 be the true value of θ. Let ˙(θ) be a measurable function with E˙(θ)˙(θ)T(i,j)< for i,j=1,,|θ|, where |θ| denotes the number of elements in θ, then

nθ^nθ0dN0,Vθ01Mθ0Vθ01,

where Mθ0=E˙(θ0)˙(θ0)T and ˙(θ0) is the first order derivative of (θ) with respect to θ evaluated at θ0 and Vθ0 is the second order derivative of E{(θ)} with respect to θ evaluated at θ0 (derivations in Appendix A). Vθ has the form

Vθ=Vθ(1)+Vθ(2)=Vθ(1)+00Vαc(2)000000Vβt(2)000Vcc(2)0000Vtt(2)00000000sym.0,

where 0 denotes a zero vector or a zero matrix and 0 denotes a scalar. Details of Vθ(1) and Vθ(2) are given in the Appendix A.

Proof. 

First, note that (θ) is Lipschitz continuous in θ. Moreover, the fact that Vθ is continuous in θ admits the Taylor expansion of EXYZ(θ):

E(X,Y,Z)(θ)=E(X,Y,Z)(θ0)+12θθ0Vθ0θθ0T+opθθ02.

Since θ^ is the maximum likelihood estimate of θ, 1ni=1n(θ^)supθ1ni=1n(θ)oP(1n). Plus the result from Theorem 1 that θ^npθ0, we conclude from Theorem 5.14 of van der Vaart [28] that:

nθ^nθ0=Vθ011ni=1n˙i(θ0)+oP(1),

which implies an asymptotic normal distribution with mean 0 and variance-covariance matrix Vθ01Mθ0Vθ01. □

For completeness, we include Theorem 5.14 of van der Vaart [28] (2000) as Theorem A2 in Appendix B. When the model is correctly specified, Vθ0=Mθ0, the asymptotic variance is the inverse of Fisher information. Matrices Vθ0 and Mθ0 are estimated through the replacement of θ0 by the MLE θ^n. Thus, for the correctly specified model, the variance-covariance matrix is estimated by the inverse of Mθ^n. For the misspecified model, the variance-covariance matrix is estimated by Vθ^n1Mθ^nVθ^n1. Let us define Vn as the second derivative of n(θ) with respect to θ, then we can decompose Vn the same way as Vθ into two matrices Vn(1) and Vn(2). Note that Vn is the empirical process of Vθ and VnpVθ by the law of large numbers, so we use the estimated probability densities f^Z(c^k) and f^X(t^j) for fZ(ck) and fX(tj) for k=1,,K and j=1,,J, respectively.

4. Simulation Studies

In this section, we evaluate the performance of the proposed model using simulated datasets. We consider two scenarios with the same sample size n=500. We let error terms U and V be jointly normally distributed with mean 0 and correlation ρ{0.2,0.5,0.8}. Here, we consider a common standard deviation σu=σv=0.3. Besides, we simulate the instrumental variable ZN(0,1). The first scenario has one threshold in X and one threshold in z, and it takes the following form:

xi=1+0.5×(zi0.5)++zi+viyi=0.2+(xi0)++0.5×xi+ui.

The true values of the parameters in PLIV models are α=(1,0.5,1), β=(0.2,1,0.5), c=0.5, and t=0. The second scenario has two thresholds in x and two thresholds in z, and it takes the following form:

xi=1+0.5×(zi+1)++(zi1)++zi+viyi=1+1.2×(xi+1)++(xi2)++0.5×xi+ui.

The true parameters are α=(1,0.5,1,1), β=(1,1.2,1,0.5), c=(1,1), and t=(1,2). We show the simulated piecewise linear instrumental variable models for scenario 1 and scenario 2 in Figure 2. We replicate the simulation 1000 times to evaluate the finite sample properties of the proposed model by the PLIV method.

Figure 2.

Figure 2

Piecewise linear instrumental variable models with simulated data for scenario 1 and scenario 2. The upper panel plots the simulated X versus Z, Y versus X for scenario 1, respectively. The lower panel plots the simulated X versus Z, Y versus X for scenario 2, respectively.

Table 1 summarizes the biases, standard errors of θ^ and coverage probabilities of θ by the proposed PLIV method for scenario 1, where tse is the theoretical standard error and ese is the empirical standard error. As we can see in the table, all the biases of θ^ are close to zero. We also find that the theoretical standard error and the empirical standard error are close enough, which confirms the validity of our theoretical results in Section 3. The results show that our model estimation is quite accurate and therefore provides unbiased and consistent estimators. Besides, we notice that the coverage probabilities are around 95% under different values of ρ. Moreover, biases and the standard errors decrease as we increase ρ because the instrumental variables becomes stronger.

Table 1.

Empirical biases, theoretical standard errors (tse), and empirical standard errors (ese) of θ^, as well as 95% coverage probabilities (cp) on θ for scenario 1.

ρ=0.2 ρ=0.5 ρ=0.8
bias tse ese cp bias tse ese cp bias tse ese cp
α0 −19.25 41.25 45.80 937 −16.43 38.26 41.56 939 −9.10 32.08 33.78 940
α1 7.65 98.27 102.66 927 6.36 93.13 97.02 924 4.10 77.32 81.80 919
α2 −16.95 46.20 47.71 931 −14.79 42.82 43.64 933 −8.28 33.52 34.34 943
β0 −7.86 55.41 54.87 950 −6.88 52.37 52.74 944 −4.28 43.92 44.80 945
β1 0.48 80.58 77.07 955 −0.35 75.48 74.69 942 −0.58 60.37 62.50 940
β2 −4.35 34.57 34.06 947 −3.84 32.49 32.60 945 −2.38 26.21 26.57 933
c −95.15 178.21 247.82 839 −82.89 159.34 224.83 846 −46.25 113.96 165.49 864
t −14.88 97.77 108.77 922 −12.71 87.80 101.10 908 −6.76 62.69 71.68 908
ρ 2.82 48.99 47.54 951 2.67 37.91 36.81 947 1.62 17.70 17.22 941
σ2 −2.32 14.00 13.72 954 −1.85 15.65 15.40 953 −1.10 18.12 17.82 956

Note: all numbers are multiplied by 1000. These results are based on 1000 replications.

Table 2 summarizes the biases, standard errors of θ^ and 95% coverage probabilities of θ by the PLIV method for scenario 2, where tse is the theoretical standard error and ese is the empirical standard error. We find the similar patterns as in Table 1 from scenario 1. For instance, all the biases are small. Theoretical standard errors and the empirical standard errors are close to each other. Most coverage probabilities are around 95% when ρ=0.2 and ρ=0.5. We also observe that the coverage probabilities of the thresholds are slightly low when ρ=0.8. The reason might be due to the high correlation between errors. With multiple thresholds and high correlation, it poses challenges to estimate the exact locations.

Table 2.

Empirical biases, theoretical standard errors (tse), and empirical standard errors (ese) of θ^, as well as 95% coverage probabilities (cp) on θ for scenario 2.

ρ=0.2 ρ=0.5 ρ=0.8
bias tse ese cp bias tse ese cp bias tse ese cp
α0 −51.88 268.22 247.08 946 −38.92 232.37 226.53 939 −20.83 158.06 169.46 921
α1 29.20 176.58 157.46 966 24.67 157.87 143.26 965 13.44 110.56 107.65 949
α2 15.11 172.47 166.40 943 11.80 178.03 163.63 949 11.40 146.19 143.76 955
α3 −26.32 164.95 147.35 945 −19.39 144.98 135.53 931 −9.21 101.13 101.32 934
β0 −8.36 120.42 116.63 944 −8.23 111.05 108.00 950 −0.84 85.31 82.56 958
β1 6.61 71.82 71.49 947 6.57 66.84 66.57 948 3.39 52.07 52.12 950
β2 6.44 115.13 99.07 966 5.38 106.29 90.78 969 3.30 83.05 75.06 962
β3 −4.14 57.89 56.20 947 −4.33 53.69 52.40 950 −1.10 41.80 40.31 955
c1 −3.01 253.38 246.83 930 9.41 221.21 257.36 924 6.90 152.06 218.68 898
c2 2.15 120.17 138.80 913 5.07 139.96 140.17 901 9.10 84.42 134.44 880
t1 0.79 76.25 79.60 944 1.04 68.31 72.98 939 4.57 48.70 49.52 935
t2 18.65 168.54 189.81 926 17.60 149.74 174.54 911 16.26 104.90 158.56 922
ρ 2.87 47.44 45.58 950 3.40 36.81 35.35 953 2.14 17.37 16.77 948
σ2 −3.64 14.00 13.64 939 −2.99 15.55 15.21 946 −1.84 17.99 17.63 955

Note: all numbers are multiplied by 1000. These results are based on 1000 replications.

We include results with a sample size of 1000 in Appendix C, while fixing ρ=0.5. Overall, as n increases, we observe that both biases and standard errors drop.

5. Application

In this section, we revisit the Card’s education data [5]. We apply the proposed model to study the causal effect of years of schooling on hourly wage in cents with father’s years of schooling as the instrumental variable. The interest here is to find a threshold and study the threshold effect of the years of schooling. It is generally believed that a child’s years of schooling has a direct effect on the child’s wage and parents’ education only affects the child’s income by affecting the child’s education level. In other words, parents’ education level has no direct effect on child’s wage. Therefore, the father’s years of schooling can be treated as a valid instrumental variable.

In Card’s data, we remove the missing values and include a total of n=2657 observations. The explanatory variable X (child’s years of education) is between 1 and 18 with median 13, and the instrumental variable Z (father’s years of education) has minimum 0, maximum 18, and median 12. Figure 3 indicates that variables X and Y are skewed and have heavy tails so transformations are needed before the analysis. A log transformation is applied to both.

Figure 3.

Figure 3

Histogram plots of the raw data X, Y, and Z.

Table 3 shows the point estimate, standard error, and associated 95% confidence interval of θ by the proposed model with K=1 and J=0, which are selected by BIC. In the table, α1 and c are the coefficient and threshold for the transformed father’s years of schooling, respectively. β1 is the causal effect of years of schooling on earnings. The estimated causal effect of interest β^1 is 0.87, which results in a difference of exp(0.87×a) units increase in wage if there are a units increase in the log of years of schooling. In economics, β^1 is interpreted as “elasticity". That is, if years of education increases by 1%, the person’s income will increase by 0.87% by our estimation. In terms of the instrumental variable, we notice that the threshold c is estimated to be 7.86. The corresponding p-value is not calculated since testing c=0 is meaningless in this context. It shows that there exists a threshold at around 8 in the father’s years of schooling. That is, the father’s years of schooling only has a positive effect on the child’s years of schooling if father receives at least 8 years of education. This information can not be observed if the traditional 2SLS method or nonparametric approaches are applied to analyze the data. The threshold effect as well as the thresholds are all statistically significant since their corresponding p-values are far less than 0.05.

Table 3.

Summary table of θ by the SML-PLIV model.

Parameter Estimate Std. Error z Value 95% C.I. p-Value
α0: intercept 2.25 0.013 168.8 (2.222, 2.274) ≈0
α1: (Zc)+ −0.02 0.003 −4.8 (−0.023, −0.009) ≈0
α2: Z 0.04 0.003 14.3 (0.033, 0.043) ≈0
β0: intercept 4.04 0.217 18.6 (3.613, 4.464) ≈0
β1: logX 0.87 0.084 10.4 (0.705, 1.033) ≈0
c 7.86 0.939 8.4 (6.016, 9.696) -

6. Discussion, Limitations, and Future Research

In this paper, we propose a simultaneous maximum likelihood estimation for a piecewise linear instrumental variable model. We use the two-stage least square estimators as the initial values and the limited information maximum likelihood methods to estimate the regression coefficients and the threshold parameters simultaneously. We also provide a robust inference of the proposed model. The proposed model with the piecewise linear functions allows us to find the thresholds for both the explanatory and the instrumental variables, which generalizes the traditional linear instrumental variable models. In the simulation study, we evaluate the performance of the proposed model and find that it behaves well in terms of the biases, standard errors, and coverage probabilities in different settings.

In our model, we include a single continuous explanatory variable and a single continuous instrumental variable. We assume the explanatory variable and the instrumental variable are continuous. More complicated cases can be considered. For example, developing a piecewise linear model with count data might be interesting. However, finding the optimal number of thresholds as well as the locations is challenging from the theoretical side. Furthermore, we assume the number of thresholds K and J are prespecified. Treating the numbers of thresholds as random variables, finding the optimal values, and investigating the theoretical properties can be future research.

Acknowledgments

We thank the editor, the associate editor, and the three reviewers for careful reviews and insightful comments, which have improved this article.

Appendix A. Derivation of the Information and Hessian Matrices

The likelihood to be minimized is

θ=1ni=1nlogσuσv12log(1ρ2)12(1ρ2)ui2σu22ρuiviσuσv+vi2σv2.

When the model is specified,

EXYZθ=logσuσv12log(1ρ2)12(1ρ2)EXYZUTUσu22ρUTVσvσu+VTVσv2.

To write out the first order derivative ˙(θ) of θ with respect to θ, we define the following notations. θ/αc is the row concatenation of the first order derivative of θ with respect to α and c. θ/βt is the row concatenation of the first order derivative of θ with respect to β and t. For notation simplicity, we drop the subscription i. Let αI(z>c)={α1I(z>c1),,αkI(z>cK)} and βI(x>t)={β1I(x>t1),,βjI(x>tJ)}. Then we can divide the first order derivative ˙(θ) as following

θαc=1ni=1n1,(zc)+,z,αI(z>c)T1(1ρ2)(vσv2ρuσuσv)θβt=1ni=1n1,(xt)+,x,βI(x>t)T1(1ρ2)(uσu2ρvσuσv)θρ=1ni=1nρ1ρ2ρ(1ρ2)2u2σu22ρuvσuσv+v2σv2+uvσvσu(1ρ2)θσu=u2(1ρ2)σu3ρuv(1ρ2)σvσu21σuθσv=v2(1ρ2)σv3ρuv(1ρ2)σuσv21σv. (A1)

The interchangeability of expectation and differentiation is satisfied here and it implies EXYZ(θ)/θ=EXYZ˙(θ). It is easy to check EXYZθ/θ=0 at θ0 as it should be. We next derive the second order derivative Vθ of EXYZθ when the model is specified. We partition the symmetric matrix Vθ as two symmetric matrices V1,θ and V2,θ such that

Vθ=Vθ(1)+Vθ(2)=Vθ(1)+00Vαc(2)000000Vβt(2)000Vcc(2)0000Vtt(2)00000000sym.0.

For the derivation of Vθ(1), let zc=1,(zc)+,z and xt=1,(xt)+,x. Since the matrix Vθ(1) is symmetric, we only need to derive the upper diagonal elements. The first row of Vθ(1) is is the row concatenation of 2EXYZ(θ)/α2, 2EXYZ(θ)/αβ, 2EXYZ(θ)/ αc, 2EXYZ(θ)/αt, 2EXYZ(θ)/αρ, 2EXYZ(θ)/ασu, and 2EXYZ(θ)/ασv, such that

V1,θ(1)=1(1ρ2)EXYZ(zc)Tzcσv2,ρxtσvσu,αI(z>c)σv2,ρβI(x>t)σvσu,2ρvσv2(1ρ2)u(1+ρ2)(1ρ2)σvσu,ρuσvσu2,ρuσv2σu2vσv3.

The second row of Vθ(1) is the row concatenation of 2EXYZ(θ)/β2, 2EXYZ(θ)/βc, 2EXYZ(θ)/βt, 2EXYZ(θ)/βρ, 2EXYZ(θ)/βσu, and 2EXYZ(θ)/βσv such that

V2,θ(1)=1(1ρ2)EXYZ(xt)Txtσu2,ραI(z>c)σuσv,βI(x>t)σv2,2ρuσu2(1ρ2)v(1+ρ2)(1ρ2)σvσu,ρvσv32uσu3,ρvσuσv2.

The third row of Vθ(1) is the row concatenation of 2EXYZ(θ)/c2, 2EXYZ(θ)/ct, 2EXYZ(θ)/cρ, 2EXYZ(θ)/cσu, and 2EXYZ(θ)/cσv such that

V3,θ(1)=1(1ρ2)EXYZαI(z>c)TαI(z>c)σvσu,βI(x>t)σu2,v(ρ2+1)σuσv(1ρ2)2ρvσv2(1ρ2),ρuσvσu2,2vσv3ρuσuσv2.

The fourth row of Vθ(1) is the row concatenation of 2EXYZ(θ)/t2, 2EXYZ(θ)/tρ, 2EXYZ(θ)/tσu, and 2EXYZ(θ)/tσv such that

V4,θ(1)=1(1ρ2)EXYZβI(x>t)TβI(x>t)σv2,v(1+ρ2)σvσu(1ρ2)2ρu(1ρ2)σu2,2uσv3ρvσvσu2,ρvσuσv2.

The remaining terms in Vθ(1) is given by

2EXYZ(θ)/ρ2=1+ρ2(1ρ2)24uvρ(ρ2+1)σuσv(ρ21)3+2ρuvσuσv(ρ21)2,
2EXYZ(θ)/ρσu=2ρu2(1ρ2)2σu32uvρ2σu2σv(ρ21)2uvσu2σv(1ρ2),
2EXYZ(θ)/ρσv=2ρv2(1ρ2)2σv32uvρ2σuσv2(ρ21)2uvσuσv2(1ρ2),
2EXYZ(θ)/σu2=2ρuv(1ρ2)σvσu33u2σu4(1ρ2)+1σu2,
2EXYZ(θ)/σuσv=ρuv(1ρ2)σv2σu2,
2EXYZ(θ)/σv2=2ρuv(1ρ2)σuσv33v2σv4(1ρ2)+1σv2.

In terms of the matrix Vθ(2), we decompose the following elements

Vαc(2)=EXYZvσv2(1ρ2)ρuσuσv(1ρ2)×00I(z>c1)000I(z>cK)00,
Vβt(2)=EXYZuσu2(1ρ2)ρvσuσv(1ρ2)×00I(x>t1)000I(x>tJ)00,
Vcc(2)=EXYZ1σuσv(1ρ2)×α1fZ(c1)000αKfZ(cK),
Vtt(2)=EXYZ1σv2(ρ21)×β1fX(t1)000βJfX(tJ).

It is easy to check that when the model is correctly specified, Vθ(2)=0 and Vθ=EXYZ˙(θ)˙(θ)T.

Appendix B. Theorems

Define Pf as the expectation Ef(X)=fdP and abbreviate the average n1i=1nf(Xi) to Pnf, an empirical distribution. Furthermore, we define

Mn(θ)=1/ni=1nmθ(Xi)=PnmθandΨn(θ)=1/ni=1nψθ(Xi)=Pnψθ.

Theorem A1

(Theorem 5.7 of van der Vaart [28]). Let Mn be random functions and let M be a fixed function of θ such that for every ϵ>0

supθΘMn(θ)M(θ)P0,supθ:dθ,θ0εM(θ)<Mθ0.

Then every sequence of estimators θ^n with Mn(θ^n)Mn(θ0)oP(1) converges in probability to θ0.

Theorem A2

(Theorem 5.14 of van der Vaart [28]). For each θ in an open subset of Euclidean space, let θψθ(x) be twice continuously differentiable for every x. Suppose that Pψθ0=0, that Pψθ02< and that the matrix Pψ˙θ0 exists and is nonsingular. Assume that the second-order partial derivatives are dominated by a fixed integrable function ψ¨(x) for every θ in a neighborhood of θ0. Then every consistent estimator sequence θ^n such that Ψn(θ^n)=0 for every n satisfies

nθ^nθ0=Pψ˙θ011ni=1nψθ0Xi+oP(1).

In particular, the sequence nθ^nθ0 is asymptotically normal with mean zero and covariance matrix Pψ˙θ01Pψθ0ψθ0TPψ˙θ01.

Appendix C. Additional Simulation Results

Table A1.

Empirical biases, theoretical standard errors (tse), and empirical standard errors (ese) of θ^, as well as 95% coverage probabilities (cp) on θ for scenario 1 with sample size 1000.

ρ=0.2 ρ=0.5 ρ=0.8
bias tse ese cp bias tse ese cp bias tse ese cp
α0 −8.30 27.04 29.78 928 −6.48 25.41 27.35 933 −3.28 22.00 22.78 942
α1 3.08 68.29 70.76 950 2.96 64.99 67.54 932 2.46 53.88 55.17 949
α2 −7.90 30.92 32.50 936 −6.05 28.90 30.11 938 −2.79 23.05 23.46 955
β0 −3.61 38.76 39.70 949 −2.80 36.66 37.62 938 −1.77 30.74 31.19 945
β1 −0.46 55.44 54.45 956 −0.18 52.00 51.80 948 0.65 41.79 42.11 939
β2 −1.21 24.18 24.78 938 −0.88 22.76 23.35 928 −0.43 18.38 18.42 949
c −41.08 123.92 167.07 873 −31.07 111.23 148.14 873 −12.70 79.47 98.09 886
t −7.63 68.18 76.36 919 −4.90 61.13 66.47 920 −1.50 43.69 46.31 935
ρ 1.08 34.23 34.63 948 1.08 26.53 26.77 948 0.72 12.41 12.49 946
σ2 −0.86 9.82 9.68 949 −0.64 10.96 10.75 949 −0.26 12.68 12.42 946

Note: all numbers are multiplied by 1000. These results are based on 1000 replications.

Table A2.

Empirical biases, theoretical standard errors (tse), and empirical standard errors (ese) of θ^, as well as 95% coverage probabilities (cp) on θ for scenario 2 with sample size 1000.

ρ=0.2 ρ=0.5 ρ=0.8
bias tse ese cp bias tse ese cp bias tse ese cp
α0 −25.84 176.86 168.03 943 −15.74 155.65 161.09 929 −7.23 104.42 114.68 927
α1 8.53 115.79 106.55 956 7.00 103.98 101.28 947 4.01 73.82 75.63 944
α2 8.49 112.53 105.08 964 5.55 108.98 107.91 958 3.31 98.38 93.69 957
α3 −11.29 108.14 99.84 951 −5.51 95.98 95.87 934 −1.84 67.37 71.38 935
β0 −2.87 83.31 84.73 942 −2.03 77.04 78.41 945 −1.09 59.84 62.78 929
β1 3.86 49.69 50.23 945 2.72 46.32 46.96 942 2.34 36.27 37.33 941
β2 5.77 72.64 67.82 960 3.88 67.56 63.16 963 2.32 53.65 52.82 939
β3 −0.69 39.92 40.14 940 -0.48 37.14 37.55 943 −0.11 29.14 31.00 944
c1 −16.09 171.89 185.99 923 0.96 152.10 212.49 907 2.67 103.26 158.39 891
c2 −2.69 81.76 95.51 912 4.37 75.10 125.84 903 7.86 56.60 131.88 894
t1 2.18 53.28 57.74 933 2.08 47.82 52.28 921 2.33 34.17 38.55 921
t2 20.13 111.57 136.46 925 13.61 99.30 108.45 930 13.30 70.22 85.85 927
ρ 1.21 32.96 33.18 953 1.52 25.63 25.73 950 0.93 12.14 12.31 942
σ2 −1.41 9.81 9.64 948 −1.17 10.88 10.59 951 −0.58 12.57 12.27 948

Note: all numbers are multiplied by 1000. These results are based on 1000 replications.

Author Contributions

Conceptualization, S.S.L. and Y.Z.; methodology, S.S.L. and Y.Z.; experiments and analysis, S.S.L.; original draft writing, S.S.L.; writing review and editing, S.S.L. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data used in the application section come from the ivmodel package of CRAN, which can be downloaded from https://github.com/hyunseungkang/ivmodel/tree/master/data (accessed on 31 August 2022). Codes to simulate data, generate tables and plots in Section 4 can be found at https://github.com/shuoshuoliu/PLIV (accessed on 31 August 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

Zhu’s research is supported by the National Sciences and Engineering Research Council of Canada (Grant No. RGPIN-2017-04064).

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Sokolovska N., Wuillemin P.H. The Role of Instrumental Variables in Causal Inference Based on Independence of Cause and Mechanism. Entropy. 2021;23:928. doi: 10.3390/e23080928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Zander B., Liśkiewicz M. On searching for generalized instrumental variables; Proceedings of the Artificial Intelligence and Statistics (PMLR); Cadiz, Spain. 9–11 May 2016; pp. 1214–1222. [Google Scholar]
  • 3.Angrist J.D., Imbens G.W., Rubin D.B. Identification of causal effects using instrumental variables. J. Am. Stat. Assoc. 1996;91:444–455. doi: 10.1080/01621459.1996.10476902. [DOI] [Google Scholar]
  • 4.Greenland S. An introduction to instrumental variables for epidemiologists. Int. J. Epidemiol. 2000;29:722–729. doi: 10.1093/ije/29.4.722. [DOI] [PubMed] [Google Scholar]
  • 5.Card D. Using Geographic Variation in College Proximity to Estimate the Return to Schooling. National Bureau of Economic Research; Cambridge, MA, USA: 1993. Technical Report. [Google Scholar]
  • 6.Didelez V., Sheehan N. Mendelian randomization as an instrumental variable approach to causal inference. Stat. Methods Med. Res. 2007;16:309–330. doi: 10.1177/0962280206077743. [DOI] [PubMed] [Google Scholar]
  • 7.Lawlor D.A., Harbord R.M., Sterne J.A., Timpson N., Davey Smith G. Mendelian randomization: Using genes as instruments for making causal inferences in epidemiology. Stat. Med. 2008;27:1133–1163. doi: 10.1002/sim.3034. [DOI] [PubMed] [Google Scholar]
  • 8.Burgess S., Small D.S., Thompson S.G. A review of instrumental variable estimators for Mendelian randomization. Stat. Methods Med. Res. 2017;26:2333–2355. doi: 10.1177/0962280215597579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.von Hinke S., Smith G.D., Lawlor D.A., Propper C., Windmeijer F. Genetic markers as instrumental variables. J. Health Econ. 2016;45:131–148. doi: 10.1016/j.jhealeco.2015.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Theil H. Economic Forecasts and Policy. 2nd ed. Palgrave Macmillan; Amsterdam, The Netherlands: 1961. [Google Scholar]
  • 11.Palmer T.M., Holmes M.V., Keating B.J., Sheehan N.A. Correcting the Standard Errors of 2-Stage Residual Inclusion Estimators for Mendelian Randomization Studies. Am. J. Epidemiol. 2017;186:1104–1114. doi: 10.1093/aje/kwx175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Davidson R. Estimation and Inference in Econometrics. Oxford University Press; New York, NY, USA: 1993. [Google Scholar]
  • 13.Angrist J., Pischke J. Instrumental Variables in Action: Sometimes You get What You Need. Most. Harmless Econom. Empiricist’s Companion. 2009:113–220. [Google Scholar]
  • 14.Stock J., Wright J.H., Yogo M. A Survey of Weak Instruments and Weak Identification in Generalized Method Of Moments. J. Bus. Econ. Stat. 2002;20:518–529. doi: 10.1198/073500102288618658. [DOI] [Google Scholar]
  • 15.Wooldridge J.M. Econometric Analysis of Cross Section and Panel Data. MIT Press; Cambridge, MA, USA: 2010. [Google Scholar]
  • 16.Kennedy E.H., Lorch S., Small D.S. Robust causal inference with continuous instruments using the local instrumental variable curve. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2019;81:121–143. doi: 10.1111/rssb.12300. [DOI] [Google Scholar]
  • 17.Hansen B.E. Regression kink with an unknown threshold. J. Bus. Econ. Stat. 2017;35:228–240. doi: 10.1080/07350015.2015.1073595. [DOI] [Google Scholar]
  • 18.Fong Y., Di C., Huang Y., Gilbert P.B. Model-robust inference for continuous threshold regression models. Biometrics. 2017;73:452–462. doi: 10.1111/biom.12623. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Liu S.S., Chen B.E. Continuous threshold models with two-way interactions in survival analysis. Can. J. Stat. 2020;48:751–772. doi: 10.1002/cjs.11561. [DOI] [Google Scholar]
  • 20.Scheines R., Cooper G., Yoo C., Chu T. Piecewise Linear Instrumental Variable Estimation of Causal Influence. PMLR. 2001:265–271. [Google Scholar]
  • 21.Schwarz G. Estimating the dimension of a model. Ann. Stat. 1978;6:461–464. doi: 10.1214/aos/1176344136. [DOI] [Google Scholar]
  • 22.Newey W.K. Efficient instrumental variables estimation of nonlinear models. Econom. J. Econom. Soc. 1990;48:809–837. doi: 10.2307/2938351. [DOI] [Google Scholar]
  • 23.Darolles S., Fan Y., Florens J.P., Renault E. Nonparametric instrumental regression. Econometrica. 2011;79:1541–1565. doi: 10.2139/ssrn.1338775. [DOI] [Google Scholar]
  • 24.Florens J.P., Johannes J., Van Bellegem S. Identification and estimation by penalization in nonparametric instrumental regression. Econom. Theory. 2011;27:472–496. doi: 10.1017/S026646661000037X. [DOI] [Google Scholar]
  • 25.Carroll R.J., Ruppert D., Crainiceanu C.M., Tosteson T.D., Karagas M.R. Nonlinear and nonparametric regression and instrumental variables. J. Am. Stat. Assoc. 2004;99:736–750. doi: 10.1198/016214504000001088. [DOI] [Google Scholar]
  • 26.Seo M.H., Linton O. A smoothed least squares estimator for threshold regression models. J. Econom. 2007;141:704–735. doi: 10.1016/j.jeconom.2006.11.002. [DOI] [Google Scholar]
  • 27.Lin H., Zhou L., Peng H., Zhou X.H. Selection and combination of biomarkers using ROC method for disease classification and prediction. Can. J. Stat. 2011;39:324–343. doi: 10.1002/cjs.10107. [DOI] [Google Scholar]
  • 28.Van der Vaart A.W. Asymptotic Statistics. Volume 3 Cambridge University Press; Cambridge, UK: 2000. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data used in the application section come from the ivmodel package of CRAN, which can be downloaded from https://github.com/hyunseungkang/ivmodel/tree/master/data (accessed on 31 August 2022). Codes to simulate data, generate tables and plots in Section 4 can be found at https://github.com/shuoshuoliu/PLIV (accessed on 31 August 2022).


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES