Abstract
Although it is usual to find collinearity in econometric models, it is commonly disregarded. An extended solution is to eliminate the variable causing the problem but, in some cases, this decision can affect the goal of the research. Alternatively, residualization not only allows mitigation of collinearity, but it also provides an alternative interpretation of the coefficients isolating the effect of the residualized variable. This paper fully develops the residualization procedure and justifies its application not only for dealing with multicollinearity but also for separating the individual effects of the regressor variables. This contribution is illustrated by two econometric models with financial and ecological data, although it can also be extended to many different fields.
Keywords: Collinearity, econometric, isolated effect, residualization
1. Introduction
Explanatory variables of an econometric model can present strong collinearity and consequently the variance of the ordinary least squares (OLS) estimators may be large compared to the values of the estimated parameters that can be insignificant or have the wrong sign. Even when collinearity diagnostic measures consider that the collinearity is not of concern, it is possible that the individual effects of the variables may not be separated or displayed clearly. This idea resembles the objective of the Shapley value regression [57], which presents an entirely different strategy for assessing the contribution of regressor variables to the dependent variable. It owes its origin to the theory of cooperative games. The value of obtained by fitting a linear regression model is regarded as the value of a cooperative game played by the independent variables (each variable is a member) against the dependent variable (explaining it). The analyst does not have sufficient information to disentangle the contributions made by the individual members, only their joint contribution is known. The Shapley value decomposition imputes the most likely contribution of each individual member. On the other hand, [3] proposed an alternative methodology to OLS based on ordered variable regression (OVR), originally presented by [64], which entirely resolves the issue of related predictors by creating and using predictors that are perfectly unrelated.
These antecedents lead to residualization, which is a procedure applied in numerous research articles published in relevant social science journals in many different fields, such as linguistics [1,8,26,34–36], environmental issues [28–30] or economic development and policies [4,6,32,38,62], for example. This method has been also applied in previous research under the name of regression with orthogonal variables (see [44,53]). However, this method has not been fully developed in prior works and we consider that this lack of specification can lead to different criticisms such as the one in [66]. [65] also concluded that ‘residualization of predictor variables is not the hoped-for panacea [to collinearity]’. We consider that the key point not taken into consideration until now is that this methodology provides an alternative interpretation for the estimated parameters, apart from the mitigation of collinearity. This could be seen as a limitation since the methodology is not always applicable but it can be also seen as an opportunity to obtain new interpretations which are not possible from the initial model. To briefly explain the general concept of the method, it might be said that by residualizing one of the explanatory variables, its effect is being isolated from the rest of the variables of the model. Thus the part of this variable that has no relationship with the rest of independent variables is being included in the model, leading to a new interpretation of the residualized variable.
This paper fully develops the residualization procedure and justifies its application not only for dealing with multicollinearity but also for separating the individual effects of the regressor variables. Main properties and inference are also presented together with the variance inflation factor (VIF) and the condition number (CN), which allow us to check whether the collinearity has been mitigated after the application of residualization.
The structure of this paper is as follows: Section 2 presents the estimation and main properties of residualization procedure showing that the estimation of the variance of the random disturbance, the global significance test, the individual significance test of the residualized variable and the goodness of fit will be similar to that of the original model. Section 3 analyzes how residualization mitigates collinearity. Section 4 compares residualization procedure with OLS and other well-known techniques, such as ridge regression, principal component regression (PCR) or partial least squares regression (PLSR). Section 5 presents the successive residualization. Finally, Section 6 illustrates the contribution of this paper with two econometric models: the first one shows the application of the method when the main goal of the researcher is to mitigate collinearity, and the second shows the application of the method when the purpose of the study is to obtain new interpretations of the variables. The two empirical examples belong to different fields: the first is a financial model, while the second is usually applied in ecological studies. The main conclusions are summarized in Section 7.
2. Estimation and properties
Consider the following general linear regression model for p exogenous variables and n observations:
| (1) |
where the first column of corresponds to the independent term and the random disturbance, , is spherical.
The first step is to define the following auxiliary regression:
| (2) |
where is the result obtained after eliminating column (variable) i from matrix and represent the variable i. That is, .
By OLS estimation of (2), it will be obtained the correspondent estimated residuals, . They will represent the part of variable i that has no relation with any other exogenous variable of model (1) since the residuals are orthogonal to (that is, , with being a vector of zeros with appropriate dimensions).
Taking into account the previous, residualization procedure involves replacing the variable by the estimated residuals of model (2), , in the original model (1). Thus residualization procedure is obtained from the following regression:
| (3) |
where .
Once the basic procedure is explained, the results of model (1) and model (3) are compared.
2.1. Estimation
From , the OLS estimator for model (1), , will be
| (4) |
taking into account that
where and are, respectively, the OLS estimator and the sum of the square of the residuals of the auxiliary regression (2).
Likewise, since , the OLS estimation of model (3), , is given by
| (5) |
Thus it is possible to compare the OLS estimators, expression (5), of the residualized model (3) with the OLS estimators of model (1), expression (4), obtaining the following conclusions:
The estimate of the coefficient of the residualized variable does not change in model (3), that is, . However, the interpretation is different: the variation produced in dependent variable, , given an increase in , that is to say, the part of the independent variable that is not related to the rest of independent variables, . Hence, due to the new interpretation of the residualized variable, the residualization could be applied to obtain conclusions that would not otherwise be possible.
The orthogonality of with verifies the ceteris paribus assumption, that is to say, when this variable increases, the other variables remain constant.
In addition, it is interesting to take into consideration that:
For convenience purposes, all of the independent variables in model (1) have been included in the auxiliary regression (2). However, it will be possible to include only some of the independent variables, depending on the interest of the researcher (e.g. trying to obtain interpretable residuals). In this case, the estimations of explanatory variables which are not included in the auxiliary regression will not change their value. Furthermore, if the constant is included in the auxiliary regression, nonessential collinearity will be mitigated because the residuals will be orthogonal to the constant. See Section 3 for details on distinguishing among the different types of collinearity.
The estimate of the non-residualized variables in model (3) coincides with the estimate obtained from model . That is, the estimation and interpretation of the non-residualized variables will be the same as that obtained in a regression in which the residualized variable is eliminated. Nevertheless, this coincidence only occurs when we introduce in the auxiliary regression all the rest of explanatory variables of the original model. Furthermore, since the two models have different residuals, the inference associated with these coefficients will be different.
Remark 2.1
Another interesting issue is how to select the variable to residualize. This paper presents different criteria that can be applied, or a combination thereof, depending on the goal of the research.
If the goal is to look for new interpretations, the variable to residualize will be the one that leads to the new interpretation desired by the researcher since the only interpretation that changes is that of the residualized variable.
It may be also interesting to rank the independent variables in model (1) according to their relevance to avoid residualizing variables considered to be relevant maintaining the original interpretation of these coefficients. This fact was already been proposed in [3] with the use of OVR models.
2.2. Goodness of fit, estimation of the variance of the random disturbance and joint significance
The estimated residuals of the original model (1) will be given by
| (7) |
Since are the residuals of the auxiliary regression (2), it is verified that .
And the residuals of the residualized model (3) are
| (8) |
It is evident that expression (8) coincides with (7), that is to say, the residuals of the original (1) and residualized (3) models coincide. Therefore, it is possible to conclude the following:
It is evident that the squared sums of the residuals of the two models coincide and, consequently, both models yield the same estimate of the variance of the random disturbance.
Since the two models employ the same dependent variable, the total sum of squares will be also the same, and consequently, the coefficients of determination will also coincide.
Since the F statistic of the global significance test can be expressed as a function of the coefficient of determination, it is evident that the global significance tests of both models will also be the same.
It is clear that , it is to say that the original model and the residualized one provide the same estimation.
2.3. Individual inference
Since the random disturbances are spherical, the individual inference will be given by the main diagonal of matrix , that is to say (see expression (4)), by
| (9) |
Taking into account the following expression:
| (10) |
it is evident that the main diagonals of both matrices are different, except for the i element. Since the estimation of the variance of the random disturbance is the same, considering the estimation of the coefficients, it is possible to conclude the following:
The inference related to the individual significance (Student's t-test) of the unchanged variables differs between models (1) and (3).
The inference related to the individual significance (Student's t-test) of the residualized variable coincides in models (1) and (3).
Consequently, the residualization of the initial model does not affect the estimation of the variance of the random disturbance, the coefficient of determination, the global significance test or the individual significance test of the residualized variable. It only changes the individual significance of unaltered variables.
Remark 2.2
Another option to select the variable to be residualized is to choose a variable with a coefficient that is significantly different from zero in the original model since the individual significance test of the residualized variable is maintained in the residualized model.
3. Collinearity
Multicollinearity consists of the presence of interdependency between explanatory variables [19], distinguishing between two principal types of multicollinearity: perfect collinearity and near-collinearity [43,47,58]. The first type occurs when the interdependency between variables is exact, and the second occurs when it is approximate. Near-collinearity, also known as imperfect or approximate collinearity, may be divided into essential and nonessential collinearity [39,40,59]. The former concerns the relationship between explanatory variables, excluding the intercept, while the latter involves the relationship between the intercept and at least one of the remaining independent variables of the model.
In addition to the new interpretation of the coefficient of the residualized variable, another result of interest in the residualized model is the effect on the linear relationship between the independent variables of the initial model. To verify that collinearity is mitigated after the residualization of the initial model, the estimated variances of the estimated coefficients, the VIF and the CN are analyzed in the residualized model.
3.1. Decrease in estimated standard variance
Considering that the estimation of the random disturbance variance is the same in the original and residualized models, the estimated variances of the coefficients will be determined by the main diagonal of the matrices and . As noted above, the element corresponding to the residualized variable is the same in both matrices, and thus, the estimated variance will be also the same. That is, .
For the rest of the variables, given expressions (9) and (10), it is possible to obtain that
where and are the elements of the matrices and , respectively. Since , it is verified that for with . In consequence, the estimated variances of the residualized model will be always less than or equal to those in the original model.
This result is relevant since it demonstrates that the residualization implies a decrease in the estimated variances of the estimated coefficients (which are assumed to be inflated due to the presence of collinearity). Note that this result is contrary to the conclusions presented in [7].
Remark 3.1
The linear relationship on the coefficients of the model (1) given in (6) can also be used to reduce the variance of the estimated coefficients only estimating the model by the restricted least-squares estimator. In this case, the residualization could be used to mitigate this particular consequence of the existence of severe collinearity in the multiple linear regression model.
3.2. Variance inflation factor
Each explanatory variable of model (1) has an associated VIF given by
| (11) |
where is the coefficient of determination of model (2). It is generally accepted that values of VIF higher than 10 indicate severe collinearity [31].
Applying this definition in the residualized model (3) and being the dependent variable of the auxiliary regression, its coefficient of determination will be zero and the associated VIF will be one (the minimum value possible). In other case, will be obtained from the following auxiliary regression:
| (12) |
where is the result obtained after eliminating column (variable) j from matrix .
Due to the orthogonality of with (matrix without columns (variables) i and j), the residuals of (12) coincide1 with the residuals of the following model:
| (13) |
Then, models (12) and (13) have the same coefficient of determination since the dependent variable is the same in both models.
However, the coefficient of determination of model (13) will be lower than that of the following model:
| (14) |
since this latter model contains an additional independent variable, . Then, the coefficient of determination of model (12) is lower than that of the model (14).
Thus since the VIF associated with variable j in the original model (1) is obtained from the coefficient of determination of the auxiliary regression given by (14) and in the residualized model (3) from the coefficient of determination of model (12), it is clear that the VIF is decreased after residualizing the model. That is, the collinearity present in the model has been diminished.
Remark 3.2
If the goal is to mitigate the collinearity in the model, one suggestion may be to residualize the variable with the highest VIF because, after the residualization, the VIF will be equal to 1. In this case, all independent variables have to be included in the auxiliary regression (2) to mitigate the essential and nonessential collinearities in the most efficient way.
3.3. Condition number
Given the model (1), the CN is given by
where and are, respectively, the minimum and maximum eigenvalues of . Note that, previously, the matrix should be transformed to be unit length by columns, it is to say, the data should be divided by the square root of the sum of its squared elements (see [5]). This author stated that values of CN between 20 and 30 indicate moderated collinearity and values higher than 30 indicate high collinearity.
Then, the CN associated to model (3) is obtained as follows:
where and are, respectively, the minimum and maximum eigenvalues of , where
being for and . Then
Thus one of the p eigenvalues of will be equal to one and the rest will coincide with the eigenvalues of matrix . Supposing that the eigenvalue equal to one is the first one, , it is verified that
If this is the minimum eigenvalue of , the rest of eigenvalues will be equal or higher than one (), and, consequently, its sum will be equal or higher than p−1 . However, this sum will be equal to p−1 (since the trace of is equal to p−1). Then, all the eigenvalues will be equal to one ( with ), it is to say, will be the identity matrix and, then, all the variables will be considered orthogonal to each other.
If this is the maximum eigenvalue of , the rest of eigenvalues will be equal or lesser than one (), and, consequently, its sum will be equal or lesser than p−1 . However, it was justified that this sum is equal to p−1. Then, all the eigenvalues will be equal to one ( with ). Thus, as before, all the variables will be considered orthogonal to each other.
If the eigenvalue equal to one cannot be the minimum or maximum eigenvalue of , they will have to be found in the rest of the eigenvalues of . Thus the CN of model (3) coincides with that of the auxiliary regression (2):
On the other hand, according to the Cauchy's Interlace Theorem for Eigenvalues of Hermitian Matrices,2 since is a submatrix of order p−1 of , it is evident that it has to be verified that
Thus the CN of the residualization (3) has to be equal or lesser than the CN of the original model (1).
Remark 3.3
If the goal is to mitigate the collinearity in the model, one suggestion could be to residualize the variable i whose auxiliary regression (where the variable i is the dependent one) presents the lowest CN since it coincides with the CN of the residualized model that will always be equal to or lesser than the CN of the original model.
4. Comparison of the residualization method with other existing methods
This section presents a Monte Carlo simulation to compare the residualization methodology with other existing methods such as ridge regression, PCR and PLSR in relation to the mean square error (MSE) and prediction error. First, the obtention of the MSE of the residualization method is presented, as well as the way to compare it with the MSE obtained by OLS. Second, the metrics used to measure the prediction capability of each methodology are also presented.
4.1. Mean square error
Note that the original model is different from the residualized model, and for this reason, both models should be analyzed separately and the comparison may not be convenient. However, some publications have not considered this divergence (see, e.g. [66]). For this reason, given that is a biased estimator of :
it could be interesting to calculate the MSE of residualization and to compare it with the MSE of the OLS estimator.
Given an estimator of , its MSE is expressed as
In the case of the OLS estimator, is an unbiased estimator, , and, taking into account expression (9), the following is verified:
| (15) |
For the estimator , starting from (10), it is verified3 that
| (16) |
From expressions (15) and (16), it is clear that
so has a lower MSE than if
| (17) |
4.2. Metrics
The root mean squared error (RMSE) and the mean absolute error (MAE) will be applied to measure the fit capability of each model while the prediction capability will be measured by the root mean squared prediction error (RMSPE) and the mean absolute prediction error (MAPE).
Given a sample with n observations and assuming it is divided into two subsamples: the first with m observations and the second with h observations, verifying that m + h = n. Then, the first subsample is applied to measure the fit capability calculating the RMSE and MAE. The second subsample is applied to evaluate the prediction capability obtaining the RMSPE and MAPE. Then, the following expressions are obtained:
4.3. Simulation
The simulation performed to compare the residualization with other existing methods is described below.
Given the model , the following simulation is performed in order to establish the behavior of condition (17):
It is considered that with .
It is also considered that and , so is generated. Thus, given matrix , a symmetric positive-definite matrix, , is built.
and are generated from .
The random perturbance, , is generated as , where , from which it is calculated that , where .
A comparison of both models (OLS and residualization) is conducted with different sample sizes, , such that 60,000 simulations are performed in this experiment.
First, once the previous model and the corresponding auxiliary regressions are estimated, condition (17) is calculated from the obtained estimations of , and . In Table 1, it can be observed that there are two types of situations: one where essential collinearity does not imply strong collinearity problems (the mean correlation is equal to 0.4877, which leads to a VIF value of 1.31208) and another where essential collinearity implies strong collinearity problems (the maximum and minimum correlations lead to VIF values of approximately 50.2512). It can also be observed that there are two types of situations in relation to nonessential collinearity: one where it is not worrisome (the mean value of the coefficient of variation (CV) is approximately 6, which implies the data have enough variability) and another where nonessential collinearity is worrisome (the minimum values of CV for each variable are close to zero, which implies slight variability of the data and indicates that the data may be considered almost constant and hence related to the intercept).
Table 1. Simulation results for MSE.
| n | 25 | 50 | 75 | 100 | 125 | 150 | Mean | |
|---|---|---|---|---|---|---|---|---|
| Cond. (17) | Resid. var. | 8.13% | 6.96% | 7.15% | 6.69% | 6.96% | 7.00% | 7.159% |
| Resid. var. | 8.42% | 7.15% | 7.12% | 6.71% | 6.77% | 6.84% | ||
| −0.9862 | −0.9747 | −0.9883 | −0.9973 | −0.9934 | −0.9915 | 0.4877 | ||
| 0.9999 | 0.9999 | 0.9999 | 0.9999 | 0.9998 | 0.9999 | |||
| 0.00958 | 0.00935 | 0.00911 | 0.0111296 | 0.00628 | 0.01072 | 6.6665 | ||
| 30222.48 | 12359.03 | 3249.84 | 3426.69 | 31001.38 | 1498.77 | |||
| 0.0107 | 0.0116 | 0.00791 | 0.00682 | 0.0108 | 0.0114 | 5.7096 | ||
| 9763.29 | 37893.3 | 5199.45 | 2718.18 | 2655.24 | 3586.79 | |||
The first and second rows of Table 1 show the percentage of cases in which (condition (17)) is verified, considering that variables and are residualized. Note that both results are similar and that there are no material differences for different sample sizes. The results show that in only 7.159% of the cases, the condition is verified.
Second, Table 2 is obtained from 60,000 more simulations performed dividing the sample as was described in Section 4.2 and considering . R's package pls was applied to obtain values of PCR and PLSR considering one and two main components. For ridge regression, the value of k was selected to mitigate the collinearity considering that it is not worrying for values of CN lesser than 20, as shown by [54]. This idea was also applied by [22] but using the VIF instead of the CN, but in this case it was considered more appropriate to use the CN since the VIF ignores the nonessential collinearity [52].
Table 2. Simulation results for RMSE, MAE, RMSPE and MAPE.
| Metric | OLS | Resid. var. | Resid. var. | Ridge | PCR (1c) | PLSR (1c) | PCR (2c) | PLSR (2c) |
|---|---|---|---|---|---|---|---|---|
| RMSE | 2.635103 | 2.635103 | 2.635103 | 2.637032 | 8.289639 | 5.435111 | 8.810389 | 6.182757 |
| MAE | 1.960508 | 1.960508 | 1.960508 | 1.961239 | 6.611481 | 4.300871 | 8.507369 | 6.266333 |
| RMSPE | 2.725054 | 19.73536 | 19.64029 | 2.721985 | 8.49428 | 5.591137 | 2.725054 | 2.725054 |
| MAPE | 2.087758 | 17.56701 | 17.48688 | 2.084665 | 6.92049 | 4.524651 | 2.087758 | 2.087758 |
From the results of the first sample, it is obtained that the residualization method and OLS lead to the same results for RMSE and MAE as was commented in Section 2.2 due to being verified. These values are slightly lower than those of other techniques.
From values of RMSPE and MAPE obtained from the second sample, it is possible to conclude that the residualization method presents the lowest prediction capability. However, the fact that the rest of methods do not improve the results obtained by OLS, could indicate that, when the purpose is prediction, the best way to proceed is to do nothing. These results support the idea provided by [25]: if the goal is simply to predict, then multicollinearity is not a problem because the predictions will still be accurate.
5. Successive residualization
It is possible that the goal of the researcher (to mitigate collinearity or obtain a new interpretation for the estimated coefficients) has not been achieved after residualizing the first variable. In that case, it is necessary to residualize a second variable. In this case, the doubly residualized model will be given by
| (18) |
where with being the residuals of the auxiliary regression (13).
The goal of this section is not to obtain the estimated inference of this model (18) but to highlight that successive residualization may be interesting, and in this case, the residuals y will be orthogonal since it is verified that
This relationship between the residuals will still hold if more variables are residualized; that is, the degree of multicollinearity will continue decreasing. Note that if the process is repeated p−1 times, all the variables of the model will be orthogonal to each other. Interested readers should consult [53] for further information on successive residualization.
6. Empirical application
The methodology proposed herein is applied to two different models in relation to economic-financial and ecological data sets. The first example is focused on the application of residualization when the main goal of the researcher is the mitigation of collinearity and also presents a new interpretation of the modified variable. The second example shows the application of residualization when the major purpose of the study is to obtain new interpretations of the variables. The use of different models from different fields is motivated by the challenge of demonstrating the relevance and application of the method in real-world examples across a wide range of disciplines.
6.1. The economic-financial model
The first empirical application is based on a model developed by Wooldridge [63] with interest rate data from the market yields published by Salomon Brothers in An Analytical Record of Yields and Yield Spreads in the 1990s but modified in this paper for the period June 2008 to April 2019 (end-of-month data) by using a dataset from the U.S. Department of the Treasury:
| (19) |
where is the random disturbance, which is spherical, and , and represent the coupon equivalents for different maturity tranches: 13 weeks, 26 weeks and 52 weeks, respectively. The coupon equivalent, also called the bound equivalent or the investment yield, is the bill's yield based on the purchase price, discount and a 365/366-day year. The coupon equivalent can be used to compare the yield on a discount bill to the yield on a nominal coupon bond that pays semiannual interest.
The following matrix represents the correlation between the variables:
From this matrix, it is observed that all explanatory variables are positively related to the dependent variable. In addition, the coefficient of correlation between explanatory variables and is equal to 0.993, which can serve as a first indication of the dependence between them. From this coefficient of correlation, it is determined that the VIF is equal to 71.516, while the CN is equal to 23.233. Both measures confirm the existence of worrisome near-collinearity in this model. In relation to nonessential collinearity, the coefficients of variation ( and ) are higher than the threshold established by [56], and consequently, it is possible to conclude that there is no relation between the intercept and any of the independent variables.
Table 3 presents the results of the estimation by OLS, ridge regression and residualization. As seen from the OLS results, the negative values of the estimated parameter do not make economic sense, particularly when taking into account the matrix of correlation previously presented that suggests the existence of a direct relation between the explanatory and dependent variables.
Table 3. Results of Wooldridge model.
| OLS | Ridge | Residualization | ||
|---|---|---|---|---|
| Intercept | 0.050 ** | 0.058 | 0.183 ** | |
| (s.d.) | (0.008) | n/a | (0.007) | |
| −0.557 ** | −0.464 | 1.067 ** | ||
| (s.d.) | (0.065) | n/a | (0.008) | |
| 1.562 ** | 1.473 | |||
| (s.d.) | (0.063) | n/a | ||
| 1.562 ** | ||||
| (s.d.) | (0.063) | |||
| 0.9935 | n/a | 0.9935 | ||
| F statistic | 9794 | n/a | 9794 | |
| p-value (of F) | n/a | |||
**, * mean the coefficient is statistically significant at 0.01 (99% level of confidence) and at 0.05 (95% level of confidence), respectively.
This nonexpected sign can be a consequence of the presence of collinearity.
To apply residualization, the first step will be to select the variable to be residualized. In this case, it is possible to consider that the medium term, , includes in some way the short term, . From the Shapley values (Table 4), it can also be seen that it contributes more to the of the model. Additionally, the residuals obtained from the auxiliary regression between and will be the part of that is not explained by . That is, the residuals will represent the second 13-week period. This fact could indicate that the variable selected to be residualized is . Note that residualization provides a new interpretation that it is not possible to obtain from the initial model.
Table 4. Results of Wooldridge model: Shapley values.
| Shapley value (OLS) | 0.4828 | 0.5107 | 0.9935 |
| Share (% of ) | 48.594% | 51.406% | 100% |
When residualization is applied, the value of the estimated parameter for becomes positive. Additionally, all the estimated coefficients are individually significant with a level of confidence of 99%, the coefficient of determination of the original model is maintained, and the essential near-collinearity problem is mitigated (the lowest possible value for the VIF, 1.000, is obtained). In this case, the CN is equal to 1.878, which indicates the relationship between and the intercept, reflecting that this relationship is not worrisome. Finally, it is important to remark that the estimated variances of the estimators have been diminished, except in the case of the residualized variable, which remains constant.
It may be interesting to compare these results with the results obtained from ridge regression, which is widely applied to estimate models with collinearity. For this, a value of k was selected that results in a VIF value of less than 10 (VIF , calculated by following [55]), which is k = 0.047, see Table 3. Note that in this case, the signs obtained are not the expected ones. Furthermore, ridge regression does not allow any conclusion to be drawn about the global characteristics of the model, the individual significance, the inference or the analysis of the isolated effect of both variables, and .
In addition, Figure 1 compares the estimations in terms of essential near-collinearity (the nonessential collinearity is not analyzed since it is not worrisome in this model), showing that the VIF value for ridge regression for different values of k is always lower than that of the OLS estimation but higher than the VIF obtained by residualization.
Figure 1.

VIF values obtained by ridge regression of the Wooldridge model (monthly data). k in steps of 0.001.
6.2. The ecological model
The second example is based on the STIRPAT model that is usually applied in environmental economics to observe the influence of some social and economic variables on the atmospheric impact of a country or a group of countries. It can be defined as the stochastic version of the IPAT identity [11,12], which identifies the impact on the atmosphere as a function of the population, the affluence measured from the gross domestic product (GDP), and one or more technology variables (usually variables related to industries, see e.g. [41] or [42]. Ehrlich and Holdren, who were the first researchers to study the IPAT identity [15–17], were aware of the problem of the relationship between variables in this identity, and although collinearity is likely to appear, most STIRPAT applications have disregarded it (see, e.g. [2,9,10,13,24,33,42,45,46,48–50,60]). However, there have been some efforts to address collinearity in STIRPAT models (see [14,18,20,27,37,41,51,61]).
This paper applies the STIRPAT model to data from China ( 1990–2014), the most polluting country in the world, as revealed by the World Bank, with a CO emissions value of 10291926.878 kilotonnes (kt) in 2014. The traditional specification of the STIRPAT model is
| (20) |
where is the random disturbance, which is spherical; represents CO emissions; the total of population (billions); variable is the per capita GDP (expressed in trillions of constant 2010 US$); and finally, is industrialization (% of GDP). The dataset has been extracted from the World Bank. However, in this paper, the following specification is proposed:
| (21) |
where are the residuals of the following auxiliary regression:
| (22) |
where is the GDP (expressed in trillions of constant 2010 US$).
The use of model (21) instead of model (20) intends to overcome the following disadvantages:
- Traditionally, per capita GDP has been used to avoid the existing dependency between the GDP and the population. However, as the reader will see in the following correlation matrix, the linear relationship between per capita GDP and population is higher than the relationship between GDP and population. This means that the linear relationship is not mitigated but increased.
In STIRPAT studies, (per capita GDP) is usually taken as the variable that represents affluence. However, the use of presents a disadvantage. From an interpretative point of view, a very important issue of the economy is ignored when using the per capita GDP (the ratio between GDP and population) since the distribution of income and the level of development of each region of the country are disregarded when all people are considered equal in terms of earnings. An increase in the per capita GDP does not necessarily mean the country is more developed; it may also indicate that the richest people in the country have increased their income.
In the residualization procedure, the relationship between GDP and population is deleted (in this case, the relationship between GDP and industrialization, variable , is also deleted), and, on the other hand, it is assumed that all people have the same income. Indeed, coincides with the part of GDP that has no relationship with population and industrialization, as has been discussed. If could be interpreted as a tool to measure the enrichment of the people and not the enrichment of the country, would be interpreted as a tool that measures whether the countries, and not the people, are richer in economic terms that are unrelated to industrialization.
Furthermore, in model (21), nonessential collinearity represents a significant issue for this empirical example (see [56]) since and . To mitigate the nonessential collinearity, the variables population and technology will be centered. Thus the following model is proposed:
| (23) |
where and .
With this model, it is verified that the degree of the existing near-multicollinearity is not worrisome. The values of VIF are lower than the threshold of 10 (VIF, VIF and VIF) and the value of CN is lower than 30 .
The results obtained by using OLS estimation of models (20) and (23) are shown in Table 5. The reader will observe the following:
In model (20), the intercept has a coefficient that is significantly different from zero and has a negative value. This means that if population and GDP were null, CO emissions would be negative. This situation is corrected with model (23).
In model (20), the estimated coefficient for population is not significantly different from zero; by contrast, in model (23), the coefficient is significant, and it has a positive value. This means that when the population increases, the CO emissions also increase. This is in line with economic theory and the correlation matrix.
In models (20) and (23), the GDP coefficient (obtained from and , respectively) is significantly different from zero and has a positive value. However, the interpretations of the two estimated coefficients are different. Thus, in model (23), the conclusion is that the increase in the wealth of the country when the production of goods and services is unrelated to industrialization entails an increase in CO emissions.
In model (20), the estimated coefficient for industrialization is significant and has a positive value. It is contrary to the sign expected by observing the correlation matrix; however, this situation is corrected by model (23), in which this coefficient is not significantly different from zero.
Table 5. Results of STIRPAT models (20) and (23).
| Model (20) | Model (23) | ||
|---|---|---|---|
| Intercept | −10,287,191 * | Intercept | 5,405,875 *** |
| (3,667,784) | (53,166) | ||
| −1,790,259 | 35,883,988 *** | ||
| (1,861,596) | (1,004,251) | ||
| 1,837,647 | 1,300,875 *** | ||
| (77,813) | (57,278) | ||
| 409,211 *** | 24,250 | ||
| (80,784) | (82,153) | ||
| 0.9924 | 0.9918 | ||
| F statistic | 918.2 | 851.2 | |
| p-value (of F) |
***, **, * mean the coefficient is statistically significant at 0.001 (99.9% level of confidence), at 0.01 (99% level of confidence) and at 0.05 (95% level of confidence), respectively.
7. Conclusion
The estimation and inference of the multiple linear regression model estimated by the residualization procedure has been exhaustively developed in this paper and the results compared with those of the original model. In this sense, it is important to point out that the application of residualization leads to conclusion about a model different to the original even though they have several identical characteristics (such as the variance estimation of the random perturbation, the coefficient of determination or the significance statistics).
The main contributions when applying this technique are:
The new interpretations of the coefficients. The residualized model can answer questions that could not be answered with the initial model.
The possibility of reducing the degree of collinearity in the initial model.
This paper proposes different criteria to select the variable to be residualized and the option of a successive residualization is also revealed.
Finally, it is relevant to note that residualization is not always applicable because the interpretations of the new estimated coefficients are not always simple. This also occurs with the application of other well-known techniques, such as ridge regression, PCR or PLSR, whose interpretation of estimated parameters is also controversial. A Monte Carlo simulation was performed to conclude that ridge regression, PCR or PLSR presents a prediction capability better than residualization but not better than OLS. Other alternative methodologies, such as raise regression (see [21] and [23]), could also be recommended.
Notes
Disclosure statement
No potential conflict of interest was reported by the authors.
References
- 1.Ambridge B., Pine J. and Rowland C., Semantics versus statistics in the retreat from locative overgeneralization errors, Cognition 123 (2012), pp. 260–279. doi: 10.1016/j.cognition.2012.01.002 [DOI] [PubMed] [Google Scholar]
- 2.Azam M. and Khan A., Testing the environmental Kuznets curve hypothesis: A comparative empirical study for low, lower middle, upper middle and high income countries, Renew. Sust. Energ. Rev. 63 (2016), pp. 556–567. doi: 10.1016/j.rser.2016.05.052 [DOI] [Google Scholar]
- 3.Baird G. and Bieber S., The goldilocks dilemma: Impacts of multicollinearity. A comparison of simple linear regression, multiple regression, and ordered variable regression models, J. Mod. Appl. Stat. Methods 15 (2016), pp. 18. [Google Scholar]
- 4.Bandelj N. and Mahutga M., How socio-economic change shapes income inequality in post-socialist Europe, Social Forces 88 (2010), pp. 2133–2161. doi: 10.1353/sof.2010.0042 [DOI] [Google Scholar]
- 5.Belsley D., Conditioning Diagnostics: Collinearity and Weak Data in Regression, John Wiley, New York, 1991. [Google Scholar]
- 6.Bradshaw Y., Urbanization and underdevelopment: A global study of modernization, urban bias, and economic dependency, Am. Sociol. Rev. 52 (1987), pp. 224–239. doi: 10.2307/2095451 [DOI] [Google Scholar]
- 7.Buse A., Brickmaking and the collinear arts: A cautionary tale, Canadian J. Econ. 27 1994), pp. 408–414. doi: 10.2307/135754 [DOI] [Google Scholar]
- 8.Cohen-Goldberg A., Phonological competition within the word: Evidence from the phoneme similarity effect in spoken production, J. Mem. Lang. 67 (2012), pp. 184–198. doi: 10.1016/j.jml.2012.03.007 [DOI] [Google Scholar]
- 9.Coondoo D. and Dinda S., Causality between income and emission: A country group-specific econometric analysis, Ecol. Econ. 40 (2002), pp. 351–367. doi: 10.1016/S0921-8009(01)00280-4 [DOI] [Google Scholar]
- 10.De Bruyn S., Van Den Bergh J. and Opschoor J., Economic growth and emissions: Reconsidering the empirical basis of environmental Kuznets curves, Ecol. Econ. 25 (1998), pp. 161–175. doi: 10.1016/S0921-8009(97)00178-X [DOI] [Google Scholar]
- 11.Dietz T. and Rosa E., Rethinking the environmental impacts of population, affluence and technology, Hum. Ecol. Rev. 1 (1994), pp. 277–300. [Google Scholar]
- 12.Dietz T. and Rosa E., Effects of population and affluence on CO emissions, Proc. Natl. Acad. Sci. USA 94 (1997), pp. 175–179. doi: 10.1073/pnas.94.1.175 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Disli M., Ng A. and Askari H., Culture, income, and CO emission, Renew. Sust. Energ. Rev. 62 (2016), pp. 418–428. doi: 10.1016/j.rser.2016.04.053 [DOI] [Google Scholar]
- 14.Dong J., Deng C., Li R. and Huang J., Moving low-carbon transportation in Xinjiang: Evidence from STIRPAT and rigid regression models, Sustainability 9 (2016), pp. 24. doi: 10.3390/su9010024 [DOI] [Google Scholar]
- 15.Ehrlich P. and Holdren J., The people problem, Saturday Rev. 4 (1970), pp. 42–43. [Google Scholar]
- 16.Ehrlich P. and Holdren J., The impact of population growth, Science 171 (1971), pp. 1212–1217. doi: 10.1126/science.171.3977.1212 [DOI] [PubMed] [Google Scholar]
- 17.Ehrlich P. and Holdren J., A bulletin dialogue on the ‘Closing circle’: critique: One dimensional ecology, B. Atom. Sci. 28 (1972), pp. 16–27. doi: 10.1080/00963402.1972.11457930 [DOI] [Google Scholar]
- 18.Fan Y., Liu L. and Wei Y., Analyzing impact factors of CO emissions using the STIRPPAT model, Environ. Impact. Assess. Rev. 26 (2006), pp. 377–395. doi: 10.1016/j.eiar.2005.11.007 [DOI] [Google Scholar]
- 19.Farrar D. and Glauber R., Multicollinearity in regression analysis: The problem revisited, Rev. Econ. Stat. 49 (1967), pp. 92–107. doi: 10.2307/1937887 [DOI] [Google Scholar]
- 20.Fernández Y., Fernández M., González D. and Olmedillas B., El efecto regulador de los planes nacionales de asignación sobre las emisiones de CO, Rev. Econ. Mund. 40 (2015), pp. 47–66. [Google Scholar]
- 21.García C., García J. and Soto J., The raise method: An alternative procedure to estimate the parameters in presence of collinearity, Qual. Quant. 45 (2011), pp. 403–423. doi: 10.1007/s11135-009-9305-0 [DOI] [Google Scholar]
- 22.García C., Salmerón R. and García C., Choice of the ridge factor from the correlation matrix determinant, J. Stat. Comput. Simul. 89 (2019), pp. 211–231. doi: 10.1080/00949655.2018.1543423 [DOI] [Google Scholar]
- 23.García J., Salmerón R., García C. and López-Martín M., The raise estimators. estimation, inference and properties, Commun. Stat. Theory Methods 46 (2017), pp. 6446–6462. doi: 10.1080/03610926.2016.1260738 [DOI] [Google Scholar]
- 24.Gassebner M., Lamla M. and Sturm J., Determinants of pollution: What do we really know?, Oxf. Econ. Pap. 63 (2011), pp. 568–595. doi: 10.1093/oep/gpq029 [DOI] [PubMed] [Google Scholar]
- 25.Gujarati D., Basic Econometrics, 4th ed, McGraw-Hill, New York, 2004. [Google Scholar]
- 26.Jaeger T., Redundancy and reduction: Speakers manage syntactic information density, Cogn. Psychol. 61 (2010), pp. 23–62. doi: 10.1016/j.cogpsych.2010.02.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Jia J., Deng H., Duan J. and Zhao J., Analysis of the major drivers of the ecological footprint using the STIRPAT model and the PLS method. A case study in Henan Province, China, Ecol. Econ. 68 (2009), pp. 2818–2824. doi: 10.1016/j.ecolecon.2009.05.012 [DOI] [Google Scholar]
- 28.Jorgenson A., Global warming and the neglected greenhouse gas: A cross-national study of the social causes of methane emissions intensity, 1995, Social Forces 84 (2006), pp. 1779–1798. doi: 10.1353/sof.2006.0050 [DOI] [Google Scholar]
- 29.Jorgenson A. and Burns T., The political-economic causes of change in the ecological footprints of nations, 1991–2001: A quantitative investigation, Soc. Sci. Res. 36 (2007), pp. 834–853. doi: 10.1016/j.ssresearch.2006.06.003 [DOI] [Google Scholar]
- 30.Jorgenson A. and Clark B., The economy, military, and ecologically unequal exchange relationships in comparative perspective: A panel study of the ecological footprints of nations, 1975–2000, Soc. Probl. 56 (2009), pp. 621–646. doi: 10.1525/sp.2009.56.4.621 [DOI] [Google Scholar]
- 31.Kennedy P., A Guide to Econometrics, 3rd ed, MIT Press, Cambridge, 1992. [Google Scholar]
- 32.Kentor J. and Kick E., Bringing the military back in: Military expenditures and economic growth 1990 to 2003, J. World-Systems Res. 14 (2008), pp. 142–172. doi: 10.5195/JWSR.2008.342 [DOI] [Google Scholar]
- 33.Kumar S., Environmentally sensitive productivity growth: A global analysis using Malmquist–Luenberger index, Ecol. Econ. 56 (2006), pp. 280–293. doi: 10.1016/j.ecolecon.2005.02.004 [DOI] [Google Scholar]
- 34.Kuperman V., Bertram R. and Baayen R., Morphological dynamics in compound processing, Lang. Cogn. Process. 23 (2008), pp. 1089–1132. doi: 10.1080/01690960802193688 [DOI] [Google Scholar]
- 35.Kuperman V., Bertram R. and Baayen R., Processing trade-offs in the reading of Dutch derived words, J. Mem. Lang. 62 (2010), pp. 83–97. doi: 10.1016/j.jml.2009.10.001 [DOI] [Google Scholar]
- 36.Lemhöfer K., Dijkstra T., Schriefers H., Baayen R., Grainger J. and Zwitserlood P., Native language influences on word recognition in a second language: A megastudy, J. Exp. Psychol. Learn. Mem. Cognition 34 (2008), pp. 12–31. doi: 10.1037/0278-7393.34.1.12 [DOI] [PubMed] [Google Scholar]
- 37.Lin S., Zhao D. and Marinova D., Analysis of the environmental impact of China based on STIRPAT model, Environ. Impact. Assess. Rev. 29 (2009), pp. 341–347. doi: 10.1016/j.eiar.2009.01.009 [DOI] [Google Scholar]
- 38.Mahutga M. and Bandelj N., Foreign investment and income inequality: The natural experiment of central and Eastern Europe, Int. J. Comp. Sociol. 49 (2008), pp. 429–454. doi: 10.1177/0020715208097788 [DOI] [Google Scholar]
- 39.Marquardt D., A critique of some ridge regression methods: Comment, J. Am. Stat. Assoc. 75 (1980), pp. 87–91. [Google Scholar]
- 40.Marquardt D. and Snee S., Ridge regression in practice, J. Am. Stat. Assoc. 29 (1975), pp. 3–20. [Google Scholar]
- 41.Martínez-Zarzoso I., Bengochea-Morancho A. and Morales-Lage R., The impact of population on CO emissions: Evidence from European countries, Environ. Resour. Econ. 38 (2007), pp. 497–512. doi: 10.1007/s10640-007-9096-5 [DOI] [Google Scholar]
- 42.Martínez-Zarzoso I. and Maruotti A., The impact of urbanization on CO emissions: Evidence from developing countries, Ecol. Econ. 70 (2011), pp. 1344–1353. doi: 10.1016/j.ecolecon.2011.02.009 [DOI] [Google Scholar]
- 43.Novales A., Econometría, McGraw-Hill, Madrid, 1988. [Google Scholar]
- 44.Novales A., Salmerón R., García C., García J. and López-Martín M., Tratamiento de la multicolinealidad aproximada mediante variables ortogonales, Anal. Econ. Aplic. XXIX Congreso Int. Econ. Aplic. 29 (2015), pp. 1212–1227. [Google Scholar]
- 45.Pablo-Romero M. and De Jesús J., Economic growth and energy consumption: The energy-Environmental Kuznets curve for Latin America and the Caribbean, Renew. Sust. Energ. Rev. 60 (2016), pp. 1343–1350. doi: 10.1016/j.rser.2016.03.029 [DOI] [Google Scholar]
- 46.Pao H. and Tsai C., CO emissions, energy consumption and economic growth in BRIC countries, Energy Policy. 38 (2010), pp. 7850–7860. doi: 10.1016/j.enpol.2010.08.045 [DOI] [Google Scholar]
- 47.Paul R., Multicollinearity: Causes, effects and remedies. Thesis (M. Sc.) (Agricultural Statistics). Roll No. 4405 (2006). IASRI, New Delhi.
- 48.Rafindadi A., Revisiting the concept of environmental Kuznets curve in period of energy disaster and deteriorating income: Empirical evidence from Japan, Energy Policy. 94 (2016), pp. 274–284. doi: 10.1016/j.enpol.2016.03.040 [DOI] [Google Scholar]
- 49.Roberts J. and Grimes P., Carbon intensity and economic development 1962–1991: A brief exploration of the environmental Kuznets curve, World Dev. 25 (1997), pp. 191–198. doi: 10.1016/S0305-750X(96)00104-0 [DOI] [Google Scholar]
- 50.Roca J. and Padilla E., Emisiones atmosféricas y crecimiento económico en españa: la curva de Kuznets ambiental y el protocolo de kyoto, Econ. Ind. 351 (2003), pp. 73–86. [Google Scholar]
- 51.Roy M., Basu S. and Pal P., Examining the driving forces in moving toward a low carbon society: An extended STIRPAT analysis for a fast growing vast economy, Clean Technol. Environ. Policy 19 (2017), pp. 2265–2276. doi: 10.1007/s10098-017-1416-z [DOI] [Google Scholar]
- 52.Salmerón R., García C. and García J., Variance inflation factor and condition number in multiple linear regression, J. Stat. Comput. Simul. 88 (2018), pp. 2365–2384. doi: 10.1080/00949655.2018.1463376 [DOI] [Google Scholar]
- 53.Salmerón R., García J., García C. and García C., Treatment of collinearity through orthogonal regression: An economic application, Bolet. Estad. Invest. Oper. 32 (2016), pp. 184–202. [Google Scholar]
- 54.Salmerón R., García J., García C. and López-Martín M., Transformation of variables and the condition number in ridge estimation, Comput. Stat. 33 (2018), pp. 1497–1524. doi: 10.1007/s00180-017-0769-4 [DOI] [Google Scholar]
- 55.Salmerón R., García J., López-Martín M. and García C., Collinearity diagnostic applied in ridge estimation through the variance inflation factor, J. Appl. Stat. 43 (2016), pp. 1831–1849. doi: 10.1080/02664763.2015.1120712 [DOI] [Google Scholar]
- 56.Salmerón R., Rodríguez A. and García C., Diagnosis and quantification of the non-essential collinearity, Comput. Stat. (2019).
- 57.Shapley L., A value for n-person games, Contrib. Theory Games (AM-28) 2 (2016), pp. 307. [Google Scholar]
- 58.Silvey S., Multicollinearity and imprecise estimation, J. Royal Stat. Soc. B (Method.) 31 (1969), pp. 539–552. [Google Scholar]
- 59.Snee R. and Marquardt D., Comment: Collinearity diagnostics depend on the domain of prediction, the model, and the data, Am. Stat. 38 (1984), pp. 83–87. [Google Scholar]
- 60.Torras M. and Boyce J., Income, inequality, and pollution: A reassessment of the environmental Kuznets curve, Ecol. Econ. 25 (1998), pp. 147–160. doi: 10.1016/S0921-8009(97)00177-8 [DOI] [Google Scholar]
- 61.Uddin G., Alam K. and Gow J., Estimating the major contributors to environmental impacts in Australia, Int. J. Ecol. Econ. Stat. 37 (2016), pp. 1–14. [Google Scholar]
- 62.Walton J. and Ragin C., Global and national sources of political protest: Third world responses to the debt crisis, Am. Sociol. Rev. 55 (1990), pp. 876–890. doi: 10.2307/2095752 [DOI] [Google Scholar]
- 63.Wooldridge J., Introduccia Econometría. Un Enfoque Moderno, 2nd ed, Thomson Paraninfo, Madrid, 2008. [Google Scholar]
- 64.Woolf B., Computation and interpretation of multiple regressions, J. Royal Stat. Soc. B (Method.) 13 (1951), pp. 100–119. [Google Scholar]
- 65.Wurm L. and Fisicaro S., What residualizing predictors in regression analyses does (and what it does not do), J. Mem. Lang. 72 (2014), pp. 37–48. doi: 10.1016/j.jml.2013.12.003 [DOI] [Google Scholar]
- 66.York R., Residualization is not the answer: Rethinking how to address multicollinearity, Soc. Sci. Res. 41 (2012), pp. 1379–1386. doi: 10.1016/j.ssresearch.2012.05.014 [DOI] [PubMed] [Google Scholar]
