Summary
We consider a high dimensional regression model with a possible change point due to a covariate threshold and develop the lasso estimator of regression coefficients as well as the threshold parameter. Our lasso estimator not only selects covariates but also selects a model between linear and threshold regression models. Under a sparsity assumption, we derive non‐asymptotic oracle inequalities for both the prediction risk and the ‐estimation loss for regression coefficients. Since the lasso estimator selects variables simultaneously, we show that oracle inequalities can be established without pretesting the existence of the threshold effect. Furthermore, we establish conditions under which the estimation error of the unknown threshold parameter can be bounded by a factor that is nearly even when the number of regressors can be much larger than the sample size n. We illustrate the usefulness of our proposed estimation method via Monte Carlo simulations and an application to real data.
Keywords: Lasso, Oracle inequalities, Sample splitting, Sparsity, Threshold models
1. Introduction
The lasso and related methods have received rapidly increasing attention in statistics since the seminal work of Tibshirani (1996). For example, see Bühlmann and van de Geer (2011) as well as Fan and Lv (2010) and Tibshirani (2011) for a general overview and recent developments.
In this paper, we develop a method for estimating a high dimensional regression model with a possible change point due to a covariate threshold, while selecting relevant regressors from a set of many potential covariates. In particular, we propose the penalized least squares (lasso) estimator of parameters, including the unknown threshold parameter, and analyse its properties under a sparsity assumption when the number of possible covariates can be much larger than the sample size.
To be specific, let be a sample of independent observations such that
(1.1) |
where, for each i, is an M×1 deterministic vector, is a deterministic scalar, follows and 1{·} denotes the indicator function. The scalar variable is the threshold variable and is the unknown threshold parameter. Since is a fixed variable in our set‐up, expression (1.1) includes a regression model with a change point at unknown time (e.g. ). In this paper, we focus on the fixed design for and independent normal errors . This set‐up has been extensively used in the literature (e.g. Bickel et al. (2009)).
A regression model such as model (1.1) offers applied researchers a simple yet useful framework to model non‐linear relationships by splitting the data into subsamples. Empirical examples include cross‐country growth models with multiple equilibria (Durlauf and Johnson, 1995), racial segregation (Card et al., 2008) and financial contagion (Pesaran and Pick, 2007), among many others. Typically, the choice of the threshold variable is well motivated in applied work (e.g. initial per capita output in Durlauf and Johnson (1995), and the minority share in a neighbourhood in Card et al. (2008)), but selection of other covariates is subject to applied researchers' discretion.
However, covariate selection is important in identifying threshold effects (i.e. non‐zero ) since a statistical model favouring threshold effects with a particular set of covariates could be overturned by a linear model with a broader set of regressors. Therefore, it seems natural to consider the lasso as a tool to estimate model (1.1).
The statistical problem that we consider is to estimate unknown parameters when M is much larger than n. For the classical set‐up (estimation of parameters without covariate selection when M is smaller than n), estimation of model (1.1) has been well studied (e.g. Tong (1990), Chan (1993) and Hansen (2000)). Also, a general method for testing threshold effects in regression (i.e. testing in model (1.1)) is available for the classical set‐up (e.g. Lee et al. (2011)).
Although there are many references on lasso‐type methods and also equally many on change points, sample splitting and threshold models, there seem to be only a handful of references that intersect both topics. Wu (2008) proposed an information‐based criterion for carrying out change point analysis and variable selection simultaneously in linear models with a possible change point; however, the method proposed in Wu (2008) would be infeasible in a sparse high dimensional model. In change point models without covariates, Harchaoui and Lévy‐Leduc (2008, 2010) proposed a method for estimating the location of change points in one‐dimensional piecewise constant signals observed in white noise, using a penalized least square criterion with an ‐type penalty. Zhang and Siegmund (2012) developed Bayes information criterion like criteria for determining the number of changes in the mean of multiple sequences of independent normal observations when the number of change points can increase with the sample size. Ciuperca (2014) considered a similar estimation problem to ours, but the corresponding analysis was restricted to the case when the number of potential covariates is small.
In this paper, we consider the lasso estimator of regression coefficients as well as the threshold parameter. Since the change point parameter does not enter additively in model (1.1), the resulting optimization problem in the lasso estimation is non‐convex. We overcome this problem by comparing the values of standard lasso objective functions on a grid over the range of possible values of .
Theoretical properties of the lasso and related methods for high dimensional data have been examined by Fan and Peng (2004), Bunea et al. (2007), Candès and Tao (2007), Huang et al. (2008a,b), Kim et al. (2008), Bickel et al. (2009) and Meinshausen and Yu (2009), among many others. Most of the references consider quadratic objective functions and linear or non‐parametric models with an additive mean 0 error. There has been recent interest in extending this framework to generalized linear models (e.g. van de Geer (2008) and Fan and Lv (2011)), to quantile regression models (e.g. Belloni and Chernozhukov (2011a), Bradic et al. (2011) and Wang et al. (2012)), and to hazards models (e.g. Bradic et al. (2012) and Lin and Lv (2013)). We contribute to this literature by considering a regression model with a possible change point and then deriving non‐asymptotic oracle inequalities for both the prediction risk and the ‐estimation loss for regression coefficients under a sparsity scenario.
Our theoretical results build on Bickel et al. (2009). Since the lasso estimator selects variables simultaneously, we show that oracle inequalities that are similar to those obtained in Bickel et al. (2009) can be established without pretesting the existence of the threshold effect. In particular, when there is no threshold effect (), we prove oracle inequalities that are basically equivalent to those in Bickel et al. (2009). Furthermore, when , we establish conditions under which the estimation error of the unknown threshold parameter can be bounded by a factor of nearly when the number of regressors can be much larger than the sample size. To achieve this, we develop some sophisticated chaining arguments and provide sufficient regularity conditions under which we prove oracle inequalities. The superconsistency result of is well known when the number of covariates is small (see, for example, Chan (1993) and Seijo and Sen (2011a, b)). To the best of our knowledge, our paper is the first work that demonstrates the possibility of a nearly ‐bound in the context of sparse high dimensional regression models with a change point.
The remainder of this paper is as follows. In Section 2 we propose the lasso estimator, and in Section 3 we give a brief illustration of our proposed estimation method by using a real data example in economics. In Section 4 we establish the prediction consistency of our lasso estimator. In Section 5 we establish sparsity oracle inequalities in terms of both the prediction loss and the ‐estimation loss for , while providing low level sufficient conditions for two possible cases of threshold effects. In Section 6 we present results of some simulation studies, and Section 7 concludes. The on‐line appendices consist of six sections: appendix A provides sufficient conditions for one of our main assumptions, appendix B gives some additional discussions on identifiability for , appendices C, D and E contain all the proofs, and appendix F provides additional numerical results.
1.1. Notation
We collect the notation that is used in the paper here. For following model (1.1), let denote the 2M×1 vector such that and let X(τ) denote the n×2M matrix whose ith row is . For an L‐dimensional vector a, let denote the ‐norm of a, and |J(a)| denote the cardinality of J(a), where . In addition, let denote the number of non‐zero elements of a, i.e. . Let denote the vector in that has the same co‐ordinates as a on J and zero co‐ordinates on the complement of J. For any n‐dimensional vector , define the empirical norm as . Let the superscript ‘(j)’ denote the jth element of a vector or the jth column of a matrix depending on the context. Finally, define , and . Then, we define the prediction risk as
![]() |
2. Lasso estimation
Let . Then, using notation defined above, we can rewrite model (1.1) as
(2.1) |
Let . For any fixed , where is a parameter space for , consider the residual sum of squares
where .
We define the following 2M×2M diagonal matrix:
For each fixed , define the lasso solution by
(2.2) |
where λ is a tuning parameter that depends on n and is a parameter space for .
It is important to note that the scale normalizing factor D(τ) depends on τ since different values of τ generate different dictionaries X(τ). To see this more clearly, define
(2.3) |
Then, for each and for each j=1,…,M, we have and . Using this notation, we rewrite the ‐penalty as
Therefore, for each fixed , is the weighted lasso that uses a data‐dependent ‐penalty to balance covariates adequately.
We now estimate by
(2.4) |
In fact, for any finite n, is given by an interval and we simply define the maximum of the interval as our estimator. If we wrote the model by using then the convention would be the minimum of the interval being the estimator. Then the estimator of is defined as . In fact, our proposed estimator of (α,τ) can be viewed as the one‐step minimizer such that
(2.5) |
It is worth noting that we penalize and in expression (2.5), where is the change of regression coefficients between two regimes. Model (1.1) can be written as
(2.6) |
where . In view of model (2.6), alternatively, one might penalize and instead of and . We opted to penalize in this paper since the case corresponds to the linear model. If , then this case amounts to selecting the linear model.
3. Empirical illustration
In this section, we apply the proposed lasso method to growth regression models in economics. The neoclassical growth model predicts that economic growth rates converge in the long run. This theory has been tested empirically by looking at the negative relationship between long‐run growth rate and initial gross domestic product (GDP) given other covariates (see Barro and Sala‐i‐Martin (1995) and Durlauf et al. (2005) for literature reviews). Although empirical results confirmed the negative relationship between growth rate and initial GDP, there has been some criticism that the results depend heavily on the selection of covariates. Recently, Belloni and Chernozhukov (2011b) showed that lasso estimation can help to select the covariates in the linear growth regression model and that the lasso estimation results reconfirm the negative relationship between long‐run growth rate and initial GDP.
We consider the growth regression model with a possible threshold. Durlauf and Johnson (1995) provided the theoretical background of the existence of multiple steady states and estimated the model with two possible threshold variables. They checked the robustness by adding other available covariates to the model, but it is not still free from the criticism of ad hoc variable selection. Our proposed lasso method might be a good alternative in this situation. Furthermore, as we shall show later, our method works well even if there is no threshold effect in the model. Therefore, one might expect more robust results from our approach.
The regression model that we consider has the form
(3.1) |
where is the annualized GDP growth rate of country i from 1960 to 1985, is the log‐GDP in 1960 and is a possible threshold variable for which we use the initial GDP or the adult literacy rate in 1960 following Durlauf and Johnson (1995). Finally, is a vector of additional covariates related to education, market efficiency, political stability, market openness and demographic characteristics. In addition, contains cross‐product terms between lgdp60 and education variables. Table 1 gives a list of all covariates used and a description of each variable. We include as many covariates as possible, which might mitigate the potential omitted variable bias. The data set mostly comes from Barro and Lee (1994), and the additional adult literacy rate is from Durlauf and Johnson (1995). Because of missing observations, we have 80 observations with 46 covariates (including a constant term) when is the initial GDP (n=80 and M=46), and 70 observations with 47 covariates when is the literacy rate (n=70 and M=47). It is worthwhile to note that the number of covariates in the threshold models is bigger than the number of observations (2M>n in our notation). Thus, we cannot adopt the standard least squares method to estimate the threshold regression model.
Table 1.
List of variables
Variable name | Description |
---|---|
Dependent variable | |
gr | Annualized GDP growth rate in the period 1960–1985 |
Threshold variables | |
gdp60 | Real GDP per capita in 1960 (1985 price) |
lr | Adult literacy rate in 1960 |
Covariates | |
lgdp60 | Log‐GDP per capita in 1960 (1985 price) |
lr | Adult literacy rate in 1960 (only included when Q=lr) |
ls | log(investment/output) annualized over 1960–1985; a proxy for log(physical |
savings rate) | |
lgr | log(population growth rate) annualized over 1960–1985 |
pyrm60 | log(average years of primary schooling) in the male population in 1960 |
pyrf60 | log(average years of primary schooling) in the female population in 1960 |
syrm60 | log(average years of secondary schooling) in the male population in 1960 |
syrf60 | log(average years of secondary schooling) in the female population in 1960 |
hyrm60 | log(average years of higher schooling) in the male population in 1960 |
hyrf60 | log(average years of higher schooling) in the female population in 1960 |
nom60 | Percentage of no schooling in the male population in 1960 |
nof60 | Percentage of no schooling in the female population in 1960 |
prim60 | Percentage of primary schooling attained in the male population in 1960 |
prif60 | Percentage of primary schooling attained in the female population in 1960 |
pricm60 | Percentage of primary schooling complete in the male population in 1960 |
pricf60 | Percentage of primary schooling complete in the female population in 1960 |
secm60 | Percentage of secondary schooling attained in the male population in 1960 |
secf60 | Percentage of secondary schooling attained in the female population in 1960 |
seccm60 | Percentage of secondary schooling complete in the male population in 1960 |
seccf60 | Percentage of secondary schooling complete in the female population in 1960 |
llife | log(life expectancy at age 0) averaged over 1960–1985 |
lfert | log(fertility rate) averaged over 1960–1985 |
edu/gdp | Government expenditure on eduction per GDP averaged over 1960–1985 |
gcon/gdp | Government consumption expenditure net of defence and education per GDP averaged over 1960–1985 |
revol | Number of revolutions per year over 1960–1984 |
revcoup | Number of revolutions and coups per year over 1960–1984 |
wardum | Dummy for countries that participated in at least one external war over 1960–1984 |
wartime | Fraction of time over 1960–1985 involved in external war |
lbmp | log(1 + black market premium averaged over 1960–1985) |
tot | Term‐of‐trade shock |
lgdp60 × ‘educ’ | Product of two covariates (interaction of lgdp60 and education variables from pyrm60 to seccf60); total 16 variables |
Table 2 summarizes the model selection and estimation results when is the initial GDP. In the on‐line appendix F (see Table 4), we report additional empirical results with being the literacy rate. To compare different model specifications, we also estimate a linear model, i.e. all δs are 0s in model (3.1), by standard lasso estimation. In each case, the regularization parameter λ is chosen by the ‘leave‐one‐out’ cross‐validation method. For the range of the threshold parameter, we consider an interval between the 10% and 90% sample quantiles for each threshold variable.
Table 2.
Model selection and estimation results with Q=gdp60a
Variable | Value for the linear model |
Values for the threshold model,
|
|||
---|---|---|---|---|---|
|
|
||||
Constant | −0.0923 | −0.0811 | — | ||
lgdp60 | −0.0153 | −0.0120 | — | ||
ls | 0.0033 | 0.0038 | — | ||
lgr | 0.0018 | — | — | ||
pyrf60 | 0.0027 | — | — | ||
syrm60 | 0.0157 | — | — | ||
hyrm60 | 0.0122 | 0.0130 | — | ||
hyrf60 | −0.0389 | — | −0.0807 | ||
nom60 | — | — | 2.64 × 10 | ||
prim60 | −0.0004 | −0.0001 | — | ||
pricm60 | 0.0006 | −1.73 × 10 |
|
||
pricf60 | −0.0006 | — | — | ||
secf60 | 0.0005 | — | — | ||
seccm60 | 0.0010 | — | 0.0014 | ||
llife | 0.0697 | 0.0523 | — | ||
lfert | −0.0136 | −0.0047 | — | ||
edu/gdp | −0.0189 | — | — | ||
gcon/gdp | −0.0671 | −0.0542 | — | ||
revol | −0.0588 | — | — | ||
revcoup | 0.0433 | — | — | ||
wardum | −0.0043 | — | −0.0022 | ||
wartime | −0.0019 | −0.0143 | −0.0023 | ||
lbmp | −0.0185 | −0.0174 | −0.0015 | ||
tot | 0.0971 | — | 0.0974 | ||
lgdp60 × pyrf60 | — |
|
— | ||
lgdp60 × syrm60 | — | — | 0.0002 | ||
lgdp60 × hyrm60 | — | — | 0.0050 | ||
lgdp60 × hyrf60 | — | −0.0003 | — | ||
lgdp60 × nom60 | — | — |
|
||
lgdp60 × prim60 |
|
— | — | ||
lgdp60 × prif60 |
|
— |
|
||
lgdp60 × pricf60 |
|
— | — | ||
lgdp60 × secm60 | −0.0001 | — | — | ||
lgdp60 × seccf60 | −0.0002 |
|
— | ||
λ | 0.0004 | 0.0034 | |||
|
28 | 26 | |||
Number of covariates | 46 | 92 | |||
Number of observations | 80 | 80 |
The regularization parameter λ is chosen by the ‘leave‐one‐out’ cross‐validation method. denotes the number of covariates to be selected by the lasso estimator and a dash indicates that the regressor is not selected. Recall that is the coefficient when and that is the change of the coefficient value when .
Main empirical findings are as follows. First, the marginal effect of lgdp60, which is given by
where educ is a vector of education variables and and are subvectors of and corresponding to educ, is estimated to be negative for all the observed values of educ. This confirms the theory of the neoclassical growth model. Second, some non‐zero coefficients of interaction terms between lgdp60 and various education variables show the existence of threshold effects in both threshold model specifications. This result implies that the growth convergence rates can vary according to different levels of the initial GDP or the adult literacy rate in 1960. Specifically, in both threshold models, we have , but some s are not 0. Thus, conditionally on other covariates, there are different technological diffusion effects according to the threshold point. For example, a developing country (lower Q) with a higher education level will converge faster perhaps by absorbing advanced technology more easily and more quickly. Finally, the lasso with the threshold model specification selects a more parsimonious model than that with the linear specification even though the former doubles the number of potential covariates.
4. Prediction consistency of lasso estimator
In this section, we consider the prediction consistency of the lasso estimator. We make the following assumptions.
Assumption 1
For the parameter space for , any , including , satisfies for some constant . In addition, that satisfies .
There are universal constants and such that uniformly in j and , and uniformly in j, where j=1,…,2M.
There is no i≠j such that
Assumption 1(a) imposes the boundedness for each component of the parameter vector. The first part of assumption 1(a) which implies that for any , seems to be weak, since the sparsity assumption implies that is much smaller than . Furthermore, in the literature on change point and threshold models, it is common to assume that the parameter space is compact. For example, see Seijo and Sen (2011a, b).
The lasso estimator in expression (2.5) can be computed without knowing the value of , but must be specified. In practice, researchers tend to choose some strict subset of the range of observed values of the threshold variable. Assumption 1(b) imposes that each covariate is of the same magnitude uniformly over τ. In view of the assumption that , it is not stringent to assume that is bounded away from zero.
Assumption 1(c) imposes that there is no tie among s. This is a convenient assumption such that we can always transform general to without loss of generality. This holds with probability 1 for the random‐design case if is continuously distributed.
Define
where and are defined in expression (2.3). Assumption 1(b) implies that is bounded away from zero. In particular, we have that .
Recall that
(4.1) |
where and . To establish theoretical results in the paper (in particular, oracle inequalities in Section 5), let be the lasso estimator defined by expression (2.5) with
(4.2) |
for a constant A>2√2/μ, where μ ∈ (0,1) is a fixed constant. We now present the first theoretical result of this paper.
Theorem 1
(consistency of the lasso). Let assumption 1 hold. Let μ be a constant such that 0<μ<1, and let be the lasso estimator defined by expression (2.5) with λ given by equation (4.2). Then, with probability at least , we have
where .
The non‐asymptotic upper bound on the prediction risk in theorem 1 can be translated easily into asymptotic convergence. Theorem 1 implies the consistency of the lasso, provided that n→∞, M→∞ and . Recall that represents the sparsity of model (2.1). In view of equation (4.2), the condition requires that . This implies that can increase with n.
Remark 1
Note that the prediction error increases as A or μ increases; however, the probability of correct recovery increases if A or μ increases. Therefore, there is a trade‐off between the prediction error and the probability of correct recovery.
5. Oracle inequalities
In this section, we establish finite sample sparsity oracle inequalities in terms of both the prediction loss and the ‐estimation loss for unknown parameters. First of all, we make the following assumption.
Assumption 2
(uniform restricted eigenvalue (URE) ). For some integer s such that 1⩽s⩽2M, a positive number and some set , the following condition holds:
If were known, then assumption 2 is just a restatement of the restricted eigenvalue assumption of Bickel et al. (2009) with . Bickel et al. (2009) provided sufficient conditions for the restricted eigenvalue condition. In addition, van de Geer and Bühlmann (2009) showed the relationships between the restricted eigenvalue condition and other conditions on the design matrix, and Raskutti et al. (2010) proved that restricted eigenvalue conditions hold with high probability for a large class of correlated Gaussian design matrices.
If is unknown as in our set‐up, it seems necessary to assume that the restricted eigenvalue condition holds uniformly over τ. We consider separately two cases depending on whether or not. On the one hand, if so that is not identifiable, then we need to assume that the URE condition holds uniformly on the whole parameter space, . On the other hand, if so that is identifiable, then it suffices to impose that the URE condition holds uniformly on a neighbourhood of . In the on‐line appendix A, we provide two types of sufficient conditions for assumption 2. One type is based on modifications of assumption 2 of Bickel et al. (2009) and the other type is in the same spirit as van de Geer and Bühlmann (2009), section 10.1. Using the second type of results, we verify primitive sufficient conditions for the URE condition in the context of our simulation designs. See the on‐line appendix A for details.
The URE condition is useful for us to improve the result in theorem 1. Recall that, in theorem 1, the prediction risk is bounded by a factor of . This bound is too large to give us an oracle inequality. We shall show below that we can establish non‐asymptotic oracle inequalities for the prediction risk as well as the ‐estimation loss, thanks to the URE condition.
The strength of the proposed lasso method is that it is not necessary to know or pretest whether or not. It is worth noting that we do not have to know whether there is a threshold in the model to establish oracle inequalities for the prediction risk and the ‐estimation loss for , although we divide our theoretical results into two cases below. This implies that we can make prediction and estimate precisely without knowing the presence of a threshold effect or without pretesting for it.
5.1. Case I: no threshold
We first consider the case that . In other words, we estimate a threshold model via the lasso method, but the true model is simply a linear model . This is an important case to consider in applications, because one may not be sure not only about covariates selection but also about the existence of the threshold in the model.
Let denote the supremum (over ) of the largest eigenvalue of . Then, by definition, the largest eigenvalue of is bounded uniformly in by . The following theorem gives oracle inequalities for the first case.
Theorem 2
Suppose that . Let assumptions 1 and 2 hold with for 0<μ<1, and . Let be the lasso estimator defined by expression (2.5) with λ given by expression (4.2). Then, with probability at least we have
for some universal constant .
To appreciate the usefulness of the inequalities derived above, it is worth comparing inequalities in theorem 2 with those in theorem 7.2 of Bickel et al. (2009). The latter corresponds to the case that is known a priori and in our notation. If we compare theorem 2 with theorem 7.2 of Bickel et al. (2009), we can see that the lasso estimator in model (2.5) gives qualitatively the same oracle inequalities as the lasso estimator in the linear model, even though our model is much more overparameterized in that δ and τ are added to β as parameters to estimate.
Also, as in Bickel et al. (2009), there is no requirement on such that the minimum value of non‐zero components of is bounded away from zero. In other words, there is no need to assume the minimum strength of the signals. Furthermore, is well estimated here even if is not identifiable at all. Finally, note that the value of the constant is given in the proof of theorem 2 and that theorem 2 can be translated easily into asymptotic oracle results as well, since both κ and are bounded away from zero by the URE condition and assumption 1 respectively.
5.2. Case II: fixed threshold
This subsection explores the case where the threshold effect is well identified and discontinuous. We begin with the following additional assumptions to reflect this.
Assumption 3
(identifiability under sparsity and discontinuity of regression). For a given and for any η and τ such that and , there is a constant c>0 such that
Assumption 3 implies, among other things, that for some and for any and τ such that ,
(5.1) |
This condition can be regarded as identifiability of . If were known, then a sufficient condition for the identifiability under the sparsity would be that URE holds for some . Thus, the main point in result (5.1) is that there is no sparse representation that is equivalent to when the sample is split by In fact, assumption 3 is stronger than just the identifiability of as it specifies the rate of deviation in f as τ moves away from which in turn dictates the bound for the estimation error of . We provide further discussions on assumption 3 in the on‐line appendix B.
Remark 2
The restriction in assumption 3 is necessary since we consider the fixed design for both and . Throughout this section, we implicitly assume that the sample size n is sufficiently large such that is very small, implying that the restriction never binds in any of the inequalities below. This is typically true for the random‐design case if is continuously distributed.
Assumption 4
(smoothness of design). For any η>0, there is a constant C<∞ such that
Assumption 4 has been assumed in the classical set‐up with a fixed number of stochastic regressors to exclude cases like has a point mass at or is unbounded. In our set‐up, assumption 4 amounts to a deterministic version of some smoothness assumption for the distribution of the threshold variable . When is a random vector, it is satisfied under the standard assumption that is continuously distributed and is continuous and bounded in a neighbourhood of for each j.
To simplify the notation, in the following theorem, we assume without loss of generality that . Then . In addition, let where is the same constant in theorem 1.
Assumption 5
(well‐defined second moments). For any η such that , is bounded, where
and [·] denotes an integer part of any real number.
Assumption 5 assumes that is well defined for any η such that . Assumption 5 amounts to some weak regularity condition on the second moments of the fixed design. Assumption 3 implies that and that is bounded away from zero. Hence, assumptions 3 and 5 imply that is bounded and bounded away from zero.
To present the theorem below, it is necessary to make one additional technical assumption (see assumption 6 in the on‐line appendix E). We opted not to show assumption 6 here, since we believe that this is just a sufficient condition that does not add much to our understanding of the main result. However, we would like to point out that assumption 6 can hold for all sufficiently large n, provided that , as n→0. See remark 4 in the on‐line appendix E for details.
We now give the main result of this section.
Theorem 3
Suppose that assumptions 1 and 2 hold with , for 0<μ<1, and . Furthermore, assumptions 3, 4 and 5 hold and let n be sufficiently large that assumption 6 in the on‐line appendix E holds. Let be the lasso estimator defined by expression (2.5) with λ given by expression (4.2). Then, with probability at least for some positive constants and , we have
for some universal constant .
Theorem 3 gives the same inequalities (up to constants) as those in theorem 2 for the prediction risk as well as the ‐estimation loss for . It is important to note that is bounded by a constant times , whereas is bounded by a constant times . This can be viewed as a non‐asymptotic version of the superconsistency of to . As noted at the end of Section 5.1, since both κ and are bounded away from zero by the URE condition and assumption 1 respectively, theorem 3 implies asymptotic rate results immediately. The values of constants , and are given in the proof of theorem 3.
The main contribution of this section is that we have extended the well‐known superconsistency result of when M<n (see, for example, Chan (1993) and Seijo and Sen (2011a, b)) to the high dimensional set‐up (M≫n). In both cases, the main reason that we achieve the superconsistency for the threshold parameter is that the least squares objective function behaves locally linearly around the true threshold parameter value rather than locally quadratically, as in regular estimation problems. An interesting remaining research question is to investigate whether it would be possible to obtain the superconsistency result of under a weaker condition, perhaps without a restricted eigenvalue condition.
6. Monte Carlo experiments
In this section we conduct some simulation studies and check the properties of the lasso estimator proposed. The baseline model is model (1.1), where is an M‐dimensional vector generated from N(0,I), is a scalar generated from the uniform distribution on the interval of (0,1) and the error term is generated from . The threshold parameter is set to depending on the simulation design, and the coefficients are set to , and where c=0 or c=1. Note that there is no threshold effect when c=0. The number of observations is set to n=200. Finally, the dimension of in each design is set to M=50,100,200,400, so that the total numbers of regressors are 100, 200, 400 and 800 respectively. The range of τ is .
We can estimate the parameters by the standard lasso–least angle regression algorithm of Efron et al. (2004) without much modification. Given a regularization parameter value λ, we estimate the model for each grid point of τ that spans over 71 equispaced points on . This procedure can be conducted by using the standard linear lasso. Next, we plug in the estimated parameter for each τ into the objective function and choose by expression (4.2). Finally, is estimated by . The regularization parameter λ is chosen by expression (4.2) where σ=0.5 is assumed to be known. For the constant A, we use four different values: A=2.8,3.2,3.6,4.0.
Table 3 and Figs 1 and 2 summarize these simulation results. To compare the performance of the lasso estimator, we also report the estimation results of the least squares estimation (‘least squares’) available only when M=50 and two oracle models (oracle 1 and oracle 2). Oracle 1 assumes that the regressors with non‐zero coefficients are known. In addition to that, oracle 2 assumes that the true threshold parameter is known. Thus, when c≠0, oracle 1 estimates and τ by using least squares estimation whereas oracle 2 estimates only . When c=0, both oracle 1 and oracle 2 estimate only . All results are based on 400 replications of each sample.
Table 3.
Simulation results with M=50a
Threshold parameter | Estimation method | Constant for λ | Prediction error |
|
|
|
|||||
---|---|---|---|---|---|---|---|---|---|---|---|
Mean | Median | Standard deviation | |||||||||
Jump scale: c = 1 | |||||||||||
|
Least squares | None | 0.285 | 0.276 | 0.074 | 100.00 | 7.066 | 0.008 | |||
Lasso | A=2.8 | 0.041 | 0.030 | 0.035 | 12.94 | 0.466 | 0.010 | ||||
A=3.2 | 0.048 | 0.033 | 0.049 | 10.14 | 0.438 | 0.013 | |||||
A=3.6 | 0.067 | 0.037 | 0.086 | 8.44 | 0.457 | 0.024 | |||||
A=4.0 | 0.095 | 0.050 | 0.120 | 7.34 | 0.508 | 0.040 | |||||
Oracle 1 | None | 0.013 | 0.006 | 0.019 | 4.00 | 0.164 | 0.004 | ||||
Oracle 2 | None | 0.005 | 0.004 | 0.004 | 4.00 | 0.163 | 0.000 | ||||
|
Least squares | None | 0.317 | 0.304 | 0.095 | 100.00 | 7.011 | 0.008 | |||
Lasso | A=2.8 | 0.052 | 0.034 | 0.063 | 13.15 | 0.509 | 0.016 | ||||
A=3.2 | 0.063 | 0.037 | 0.083 | 10.42 | 0.489 | 0.023 | |||||
A=3.6 | 0.090 | 0.045 | 0.121 | 8.70 | 0.535 | 0.042 | |||||
A=4.0 | 0.133 | 0.061 | 0.162 | 7.68 | 0.634 | 0.078 | |||||
Oracle 1 | None | 0.014 | 0.006 | 0.022 | 4.00 | 0.163 | 0.004 | ||||
Oracle 2 | None | 0.005 | 0.004 | 0.004 | 4.00 | 0.163 | 0.000 | ||||
|
Least squares | None | 2.559 | 0.511 | 16.292 | 100.00 | 12.172 | 0.012 | |||
Lasso | A=2.8 | 0.062 | 0.035 | 0.091 | 13.45 | 0.602 | 0.030 | ||||
A=3.2 | 0.089 | 0.041 | 0.125 | 10.85 | 0.633 | 0.056 | |||||
A=3.6 | 0.127 | 0.054 | 0.159 | 9.33 | 0.743 | 0.099 | |||||
A=4.0 | 0.185 | 0.082 | 0.185 | 8.43 | 0.919 | 0.168 | |||||
Oracle 1 | None | 0.012 | 0.006 | 0.017 | 4.00 | 0.177 | 0.004 | ||||
Oracle 2 | None | 0.005 | 0.004 | 0.004 | 4.00 | 0.176 | 0.000 | ||||
Jump scale: c = 0 | |||||||||||
—b | Least squares | None | 6.332 | 0.460 | 41.301 | 100.00 | 20.936 | —b | |||
Lasso | A=2.8 | 0.013 | 0.011 | 0.007 | 9.30 | 0.266 | |||||
A=3.2 | 0.014 | 0.012 | 0.008 | 6.71 | 0.227 | ||||||
A=3.6 | 0.015 | 0.014 | 0.009 | 4.95 | 0.211 | ||||||
A=4.0 | 0.017 | 0.016 | 0.010 | 3.76 | 0.204 | ||||||
Oracle 1 and | None | 0.002 | 0.002 | 0.003 | 2.00 | 0.054 | |||||
oracle 2 |
M denotes the column size of and τ denotes the threshold parameter. Oracle 1 and oracle 2 are estimated by least squares when sparsity is known and when sparsity and are known respectively. All simulations are based on 400 replications of a sample with 200 observations.
Not applicable.
Figure 1.
Mean prediction errors and mean (♦, τ=0.3; □, τ=0.4; ◯, τ=0.5; △, c=0): (a) M=100; (b) M=200; (c) M=400
Figure 2.
Mean ‐errors for α and τ (♦, τ=0.3; □, τ=0.4; ◯, τ=0.5; △, c=0): (a) M=100; (b) M=200; (c) M=400
The reported mean‐squared prediction error PE for each sample is calculated numerically as follows. For each sample s, we have the estimates , and . Given these estimates, we generate new data of 400 observations and calculate the prediction error as
(6.1) |
The mean, median and standard deviation of the prediction error are calculated from the 400 replications, . We also report the mean of and ‐errors for α and τ. Table 3 reports the simulation results for M=50. For simulation designs with M>50, the least squares estimator is not available, and we summarize the same statistics only for the lasso estimator in Figs 1 and 2.
When M=50, across all designs, the lasso estimator proposed performs better than the least squares estimator in terms of mean and median prediction errors, the mean of and the ‐error for α. The performance of the lasso estimator becomes much better when there is no threshold effect, i.e. c=0. This result confirms the robustness of the lasso estimator for whether or not there is a threshold effect. However, the least squares estimator performs better than the lasso estimator in terms of estimation of when c=1, although the difference here is much smaller than the differences in prediction errors and the ‐error for α.
From Figs 1 and 2, we can reconfirm the robustness of the lasso estimator when M=100,200,400. As predicted by the theory that was developed in previous sections, the prediction error and ‐errors for α and τ increase slowly as M increases. The graphs also show that the results are quite uniform across different regularization parameter values except A=4.0.
In the on‐line appendix F, we report additional simulation results, while allowing correlation between covariates. Specifically, the M‐dimensional vector is generated from a multivariate normal N(0,Σ) distribution with , where denotes the (i,j) element of the M×M covariance matrix Σ and ρ=0.3. All other random variables are the same as above. We obtained very similar results to those for the previous cases: the lasso outperforms the least squares estimator, and the prediction error, the mean of and ‐errors increase very slowly as M increases. See further details in the on‐line appendix F, which also reports satisfactory simulation results regarding frequencies of selecting true parameters when both ρ=0 and ρ=0.3.
In sum, the simulation results confirm the theoretical results that were developed earlier and show that the lasso estimator proposed will be useful for the high dimensional threshold regression model.
7. Conclusions
We have considered a high dimensional regression model with a possible change point due to a covariate threshold and have developed the lasso method. We have derived non‐asymptotic oracle inequalities and have illustrated the usefulness of our proposed estimation method via simulations and a real data application.
We conclude this paper by providing some areas of future research. First, it would be interesting to extend other penalized estimators (e.g. the adaptive lasso of Zou (2006) and the smoothly clipped absolute deviation penalty of Fan and Li (2001)) to our set‐up and to see whether we would be able to improve the performance of our estimation method. Second, an extension to multiple change points is also an important research topic. There has been some advance in this direction, especially regarding key issues like computational cost and the determination of the number of change points (see, for example, Harchaoui and Lévy‐Leduc (2010) and Frick et al. (2014)). However, they are confined to a single regressor case, and the extension to a large number of regressors would be highly interesting. Finally, it would be also an interesting research topic to investigate the minimax lower bounds of the estimator proposed and its prediction risk like Raskutti et al. (2011, 2012) did in high dimensional linear regression set‐ups.
Supporting information
‘Online appendices’.
Acknowledgements
We thank Marine Carrasco, Yuan Liao, Ya'acov Ritov, two referees and seminar participants at various places for their helpful comments. This work was supported by a National Research Foundation of Korea grant funded by the Korean Government (NRF‐2012S1A5A8023573), the Institute of Economic Research of Seoul National University, by the European Research Council (ERC‐2009‐StG‐240910‐ROMETA) and by the Social Sciences and Humanities Research Council of Canada. This work was made possible by the facilities of the Shared Hierarchical Academic Research Computing Network (www.sharcnet.ca) and Compute/Calcul Canada.
References
- Barro, R . and Lee, J . (1994) Data set for a panel of 139 countries. Report National Bureau of Economic Research, Cambridge. (Available from http://admin.nber.org/pub/barro.lee/
- Barro, R . and Sala‐i‐Martin, X . (1995) Economic Growth. New York: McGraw‐Hill. [Google Scholar]
- Belloni, A. and Chernozhukov, V. (2011a) ‐penalized quantile regression in high‐dimensional sparse models. Ann. Statist., 39, 82–130. [Google Scholar]
- Belloni, A . and Chernozhukov, V . (2011b) High dimensional sparse econometric models: an introduction In Inverse Problems and High‐dimensional Estimation (eds Alquier P., Gautier E. and Stoltz G.), pp. 121–156. Berlin: Springer. [Google Scholar]
- Bickel, P. J. , Ritov, Y. and Tsybakov, A. B. (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist., 37, 1705–1732. [Google Scholar]
- Bradic, J. , Fan, J. and Jiang, J. (2012) Regularization for Cox's proportional hazards model with NP‐dimensionality. Ann. Statist., 39, 3092–3120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bradic, J. , Fan, J. and Wang, W. (2011) Penalized composite quasi‐likelihood for ultrahigh dimensional variable selection. J. R. Statist. Soc. B, 73, 325–349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bühlmann, P . and van de Geer, S . (2011) Statistics for High‐dimensional Data: Methods, Theory and Applications. New York: Springer. [Google Scholar]
- Bunea, F. , Tsybakov, A. and Wegkamp, M. (2007) Sparsity oracle inequalities for the Lasso. Electron. J. Statist., 1, 169–194. [Google Scholar]
- Candès, E. and Tao, T. (2007) The Dantzig selector: statistical estimation when p is much larger than n . Ann. Statist., 35, 2313–2351. [Google Scholar]
- Card, D. , Mas, A. and Rothstein, J. (2008) Tipping and the dynamics of segregation. Q. J. Econ., 123, 177–218. [Google Scholar]
- Chan, K. S. (1993) Consistency and limiting distribution of the least squares estimator of a threshold autoregressive model. Ann. Statist., 21, 520–533. [Google Scholar]
- Ciuperca, G. (2014) Model selection by lasso methods in a change‐point model. Statist. Pap., 55, 349–374. [Google Scholar]
- Durlauf, S. N. and Johnson, P. A. (1995) Multiple regimes and cross‐country growth behavior. J. Appl. Econmetr., 10, 365–384. [Google Scholar]
- Durlauf, S. , Johnson, P . and Temple, J . (2005) Growth econometrics In Handbook of Economic Growth, vol. (eds P. Aghion and S. N. Durlauf), pp. 555–677. Amsterdam: Elsevier [Google Scholar]
- Efron, B. , Hastie, T. , Johnstone, I. and Tibshirani, R. (2004) Least angle regression. Ann. Statist., 32, 407–499. [Google Scholar]
- Fan, J. and Li, R. (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist. Ass., 96, 13–48. [Google Scholar]
- Fan, J. and Lv, J. (2010) A selective overview of variable selection in high dimensional feature space. Statist. Sin., 20, 101–148. [PMC free article] [PubMed] [Google Scholar]
- Fan, J. and Lv, J. (2011) Nonconcave penalized likelihood with np‐dimensionality. IEEE Trans. Inform. Theor., 57, 5467–5484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan, J. and Peng, H. (2004) Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist., 32, 928–961. [Google Scholar]
- Frick, K. , Munk, A. and Sieling, H. (2014) Multiscale change point inference (with discussion). J. R. Statist. Soc. B, 76, 495–580. [Google Scholar]
- van de Geer, S. A. (2008) High‐dimensional generalized linear models and the lasso. Ann. Statist., 36, 614–645. [Google Scholar]
- van de Geer, S. A. and Bühlmann, P. (2009) On the conditions used to prove oracle results for the Lasso. Electron. J. Statist., 3, 1360–1392. [Google Scholar]
- Hansen, B. E. (2000) Sample splitting and threshold estimation. Econometrica, 68, 575–603. [Google Scholar]
- Harchaoui, Z . and Lévy‐Leduc, C . (2008) Catching change‐points with Lasso In Advances in Neural Information Processing Systems, vol. . Cambridge: MIT Press. [Google Scholar]
- Harchaoui, Z. and Lévy‐Leduc, C. (2010) Multiple change‐point estimation with a total variation penalty. J. Am. Statist. Ass., 105, 1480–1493. [Google Scholar]
- Huang, J. , Horowitz, J. L. and Ma, M. S. (2008a) Asymptotic properties of bridge estimators in sparse high‐dimensional regression models. Ann. Statist., 36, 587–613. [Google Scholar]
- Huang, J. , Ma, S. G. and Zhang, C.‐H. (2008b). Adaptive lasso for sparse high‐dimensional regression models. Statist. Sin., 18, 1603–1618. [Google Scholar]
- Kim, Y. , Choi, H. and Oh, H.‐S. (2008) Smoothly clipped absolute deviation on high dimensions. J. Am. Statist. Ass., 103, 1665–1673. [Google Scholar]
- Lee, S. , Seo, M. and Shin, Y. (2011) Testing for threshold effects in regression models. J. Am. Statist. Ass., 106, 220–231. [Google Scholar]
- Lin, W. and Lv, J. (2013) High‐dimensional sparse additive hazards regression. J. Am. Statist. Ass., 108, 247–264. [Google Scholar]
- Meinshausen, N. and Yu, B. (2009) Lasso‐type recovery of sparse representations for high‐dimensional data. Ann. Statist., 37, 246–270. [Google Scholar]
- Pesaran, M. H. and Pick, A. (2007) Econometric issues in the analysis of contagion. J. Econ. Dynam. Control, 31, 1245–1277. [Google Scholar]
- Raskutti, G. , Wainwright, M. J. and Yu, B. (2010) Restricted eigenvalue properties for correlated gaussian designs. J. Mach. Learn. Res., 11, 2241–2259. [Google Scholar]
- Raskutti, G. , Wainwright, M. J. and Yu, B. (2011) Minimax rates of estimation for high‐dimensional linear regression over‐balls. IEEE Trans. Inform. Theor., 57, 6976–6994. [Google Scholar]
- Raskutti, G. , Wainwright, M. J. and Yu, B. (2012) Minimax‐optimal rates for sparse additive models over kernel classes via convex programming. J. Mach. Learn. Res., 13, 389–427. [Google Scholar]
- Seijo, E. and Sen, B. (2011a) Change‐point in stochastic design regression and the bootstrap. Ann. Statist., 39, 1580–1607. [Google Scholar]
- Seijo, E. and Sen, B. (2011b) A continuous mapping theorem for the smallest argmax functional. Electron. J. Statist., 5, 421–439. [Google Scholar]
- Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58, 267–288. [Google Scholar]
- Tibshirani, R. (2011) Regression shrinkage and selection via the lasso: a retrospective (with comments). J. R. Statist. Soc. B, 73, 273–282. [Google Scholar]
- Tong, H . (1990) Non‐linear Time Series: a Dynamical System Approach. New York: Oxford University Press. [Google Scholar]
- Wang, L. , Wu, Y. and Li, R. (2012) Quantile regression for analyzing heterogeneity in ultra‐high dimension. J. Am. Statist. Ass., 107, 214–222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu, Y. (2008) Simultaneous change point analysis and variable selection in a regression problem. J. Multiv. Anal., 99, 2154–2171. [Google Scholar]
- Zhang, N. R. and Siegmund, D. O. (2012) Model selection for high dimensional multi‐sequence change‐point problems. Statist. Sin., 22, 1507–1538. [Google Scholar]
- Zou, H. (2006) The adaptive lasso and its oracle properties. J. Am. Statist. Ass., 101, 1418–1429. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
‘Online appendices’.