The Regression Analysis of Collinear Data

John Mandel

doi:10.6028/jres.090.043

. 1985 Nov-Dec;90(6):465–476. doi: 10.6028/jres.090.043

The Regression Analysis of Collinear Data

John Mandel ¹

PMCID: PMC6687607 PMID: 34566182

Abstract

This paper presents a technique based on the intuitively-simple concepts of Sample Domain and Effective Prediction Domain, for dealing with linear regression situations involving collinearity of any degree of severity. The Effective Prediction Domain (EPD) clarifies the concept of collinearity, and leads to conclusions that are quantitative and practically useful. The method allows for the presence of expansion terms among the regressors, and requires no changes when dealing with such situations.

Keywords: collinearity, efficient prediction domain, ill-conditioning, multicollinearity, regression analysis

Introduction

The scientists’ search for relations between measurable properties of materials or physical systems can be effectively helped by the statistical technique known as multiple regression. Even when limited to linear regression, the technique is often of great value, as we shall see below. Often, however, difficulties in interpretation arise because of a condition called collinearity. This condition, which is inherent in the structure of the design points (the X space) of the regression experiment, is often treated, at least implicitly, as a sort of disease of the data that is to be remedied by special mathematical manipulations of the data.

We consider collinearity not as a disease but rather as additional information provided by the data to the data analyst, warning him to limit the use of the regression equation as a prediction tool to specific subspaces of the X space, and telling him precisely what these subspaces are. Thus, collinearity is an indication of limitations inherent in the data. The statistician’s task is to detect these limitations and to express them in a useful manner. If this viewpoint is adopted, there is no need for remedial techniques. All that is required is a method for extracting the additional information from the data. We will present such a method.

The Model

We assume that measurements y have been made at a number of “x-points,” each point being characterized by the numerical values of a number of “regressor-variables” x_j. We also assume that y is a linear function of the x-variables. The mathematical model, for p regressors, is:

y = β_{1} x_{1} + β_{2} x_{2} + \dots + β_{j} x_{j} + \dots + β_{p} x_{p} + ϵ

(1)

where ϵ is the error in the y measurement. We denote by N the number of points, or “design points”, i.e., the combinations of the x’s at which y is measured.

Usually, the variable x₁ is identically equal to “one” for all N points, to allow for the presence of a constant term. Then the expected value of y, denoted E(y), is equal to β₁ when all the other x ‘s are zero. This point, called the origin, is seldom one of the design points and is, in fact, quite often far removed from all design points. In many cases this point is even devoid of physical meaning.

First Example:

Firefly Data

We present the problem in terms of two examples of real data. The first data set (Buck [1]¹) is shown in table 1. It consists of 17 points and has two regressors, in addition to a constant term (x₁ = 1). The measurement is the time of the first flash of a firefly, after 6:30 p.m. It is studied as a function of ambient light intensity (x₂) and temperature (x₃).

Table 1.

Data for firefly study.

x₁	x₂	x₃	y
1	26	21.1	45
1	35	23.9	40
1	40	17.8	58
1	41	22.0	50
1	45	22.3	31
1	55	23.3	52
I	55	20.5	54
1	56	25.5	38
1	70	21.7	40
1.	75	26.7	28
1	79	25.0	38
1	87	24.4	36
1	100	22.3	36
1	100	25.5	46
1	110	26.7	40
1	130	25.5	31
1	140	26.7	40

Open in a new tab

Definition of Variables

y=time of First flash (number of minutes after 6:30 p.m.)

x₂=light intensity (in metercandles, mc)

x₃=temperature (°C)

Figure 1 is a plot of x₃ versus x₂. There is obviously a trend: x₃ increases as x₂ increases. The existence of a relation of this type between some of the regressor variables often causes difficulties in the interpretation of the regression analysis. To deal with the problem in a general way we propose a method based on two concepts. The first of these we shall call the “sample domain.”

For our data, the sample domain consists of the rectangle formed by the vertical straight lines going through the lowest and highest x₂ of the experiment, respectively, and by the horizontal straight lines going through the lowest and highest x₃, respectively (See Fig. 1). The concept is readily generalized to an X space of any number of dimensions, and becomes a hypercube in such a space. Note that the vertex B of the sample domain is relatively far from any of the design points. This has important consequences.

The regression equation

\hat{y} = {\hat{β}}_{1} \cdot x_{1} + {\hat{β}}_{2} \cdot x_{2} + {\hat{β}}_{3} \cdot x_{3}

(2)

allows us to estimate y at any point (x₁, x₂, x₃) (we recall that x₁ = 1) and to estimate the variance of $\hat{y}$ at this point. The point can be inside or outside the sample domain. Obviously the variance of $\hat{y}$ , which we denote by $Var (\hat{y})$ , will tend to become larger as the point for which the prediction is made is further away from the cluster of points involved in the experiment. Therefore $Var (\hat{y})$ at the point B may be considerably larger than at points A, C, and D. Such a condition is associated with the concept of “collinearity.” We define collinearity, in a semi-quantitative way, as the condition that arises when for at least one of the vertices of the sample domain, $Var (\hat{y})$ is considerably larger than for the other vertices. The concept will become clearer as we proceed.

At any rate, the larger variance at one of the vertices of the sample domain is generally the lesser of two concerns, the other being that the regression equation, for which validity may have been reasonably firmly established in the vicinity of the cluster of experimental points, may no longer be valid at a more distant point. It is important to note that the evidence from the data alone cannot justify inferences at such distant points. In order to validate prediction at such points, it is necessary to introduce either additional data or additional assumptions.

For these reasons, we seek to establish a region in the X-space for which prediction is reasonably safe on the basis of the experiment alone. We call this the Effective Prediction Domain, or EPD.

The EPD is the second concept required for our treatment of collinear data. It is closely related to the first concept, the sample domain, as will be shown below.

Establishing the EPD

Our procedure consists of two steps, involving two successive transformations of the coordinate system. The original coordinate system in which the x-regressors are expressed is referred to as the X-system.

1. The Z System

The first step consists in a translation of the X-system (parallel to itself) to a different origin, located centrally within the cluster of experimental points (centering); and simultaneously by a rescaling of each x to a standard scale. The new system, called the Z-system, is given by the equations²

For j = 1 : z_{1} = K (a constant)

(3a)

For j > 1 : z_{j} = \frac{x_{j} - C_{j}}{R_{j}}

(3b)

For C_j and R_j we consider two choices, which we call the Correlation Scale Transformation (CST) and the Range Midrange Transformation (RMT). We discuss first the Correlation Scale Transformation defined by the choice

C_{j} = {\bar{x}}_{j}, R_{j} = \sqrt{\sum_{i} {(x_{i j} - {\bar{x}}_{j})}^{2}}

(4)

where i = 1 to N.

It easily follows from (3b) that

{\bar{z}}_{j} = 0, \sum_{i} z_{i j}^{2} = 1

(5)

It is then reasonable to choose a value K in (3a) equal to

K = 1 / \sqrt{N}

(6)

so as to make $\sum_{i} z_{i 1}^{2} = 1$

The values of C_j and R_j for the firefly data are given in table 2. Contrary to statements found in the literature (see discussion at end of this paper), the centering and rescaling defined by the Correlation Scale Transformation have no effect whatsever on collinearity. The location of the sample domain relative to the design points remains unchanged, though it is expressed in different coordinates.

Table 2.

Firefly data—parameters for correlation scale transformation.

j	C	R
1	0	4.123106
2	73.176471	135.264447
3	23.582353	10.073962

Open in a new tab

To arrive at an EPD, a second operation is necessary, viz. a rotation of the Z-coordinate system to a new coordinate system, which we shall call the W-system (of coordinates).

2. The W-System

The rotation from Z to W is accomplished by the method of Principal Components, or its equivalent, the Singular Value Decomposition (SVD). For a discussion of this method the reader is referred to Mandel [2]. Here we merely recall a few facts. Each w-coordinate is a linear combination of all z-coordinates given by the matrix equation:

W = Z V

(7)

where V is an orthogonal matrix.

In algebraic notation, eq (7) becomes

w_{i k} = \sum_{j} z_{i j} v_{k j} \begin{array}{l} i = 1 to N \\ j = 1 to p \end{array}

(8)

where the v_kj are the elements of the V matrix. The v_kj, for a given k, are simply the direction cosines of the w_k axis with respect to the Z-system. Consequently,

\sum_{j} v_{k j}^{2} = 1

(9)

Since the rotation is orthogonal, any two distinct w-axes, say w_k and $W_{k ’}$ , are orthogonal and consequently:

\sum_{j} v_{k j} \cdot v_{k^{'} j} = 0 for k \neq k ’

(10)

For the firefly data, the V matrix is shown in table 3, and the complete set of z and w coordinates is given in table 4.

Table 3.

Firefly data—V matrix.

	j
k	1	2	3
1	0	.7071	.7071
2	1.000	0	0
3	0	−.7071	.7071

Open in a new tab

Table 4.

Firefly data—z and w coordinates (CST).¹

Point	z₂	z₃	w₁	w₃
1	−.3488	−.2464	−.4216	.0724
2	−.2822	.0315	−.1780	.2219
3	−.2453	−.5740	−.5800	−.2324
4	−.2379	−.1571	−.2800	.0572
5	−.2083	−.1273	−.2381	.0573
6	−.1344	−.0280	−.1156	.0753
7	−.1344	−.3060	−.3121	−.1213
8	−.1270	.1904	.0440	.2245
9	−.0235	−.1869	−.1495	−.1155
10	.0135	.3095	.2276	.2094
11	.0431	.1407	.1292	.0691
12	.1022	.0812	.1289	−.0148
13	.1983	−.1273	.0495	−.2302
14	.1983	.1904	.2741	−.0055
15	.2722	.3095	.4106	.0264
16	.4201	.1904	.4309	−.1624
17	.4940	.3095	.5674	−.1304
			λ₁ = 1.6549	λ₃ = .3451

Open in a new tab

$z_{1} = 1 / \sqrt{17} = .2425$ for all i

$w_{2} = 1 / \sqrt{17} = .2425$ for all i, λ₂ = 1.0000

Note that row 2, as well as column 1, in table 3 consists of the element “one” in one cell and zeros in all others cells. This is a consequence of the orthogonality of z₁ with respect to all z_j with j > 1. This orthogonality is in turn due to the nature of the Correlation Scale Transformation, as expressed by eq (4).

At the bottom of the w columns we find values labeled 𝜆_j. They are simply the sums of squares of all w-values in that column.

λ_{j} = \sum_{i} w_{i j}^{2}

(11)

The 𝜆_j are also the eigenvalues of the Z’Z matrix which, for our choice of C_j and R_j, is the correlation matrix of the regressors x. Note that w₂ is the $constant = 1 / \sqrt{N}$ . Consequently

λ_{2} = N {(\frac{1}{\sqrt{N}})}^{2} = 1.

We need to consider w₁ and w₃ only. A similar situation applied to the z coordinates, where $z_{1} \equiv 1 / \sqrt{N}$ for all i. Figure 2 shows both the z-coordinates (z₂ and z₃) and the w-coordinates (w₁ and w₃) for the firefly data. The order of the w-coordinates (w₁, w₂, w₃) is that of the corresponding λ-values, in decreasing order.

3. The Effective Prediction Domain (EPD)

The EPD is simply the sample domain corresponding to the W-system of coordinates. Thus, straight lines parallel to the w₃-axis are drawn through the smallest and largest w₁, respectively, and lines parallel to the w₁-axis are drawn through the smallest and largest w₃. Here again generalization is readily made to a p -dimensional W -space. The EPD for the firefly data is also shown in figure 2.

The interpretation of EPD is straightforward. Unlike the sample domain in either the X-system or the Z-system, the EPD excludes points that are distant from the cluster of regressor points. This has two advantages. In the first place, the use of the regression equation is justified for all points inside, and on the periphery of the EPD. And accordingly, the variance of the predicted value $\hat{y}$ for any such point will not be unduly large. These statements require more detailed treatment. To this effect we introduce the concept of variance factor (VF).

4. The Variance Factor (VF)

From regression theory we know that the variance of any linear functon, say L, of the coefficient estimates ${\hat{β}}_{j}$ is of the form:

Var (L) = f (X) \cdot σ_{ϵ}^{2}

(12)

where $σ_{ϵ}^{2}$ is the variance of the experimental errors ϵ of the y measurements. The multiplier f(X) is independent of the y and depends only on the X matrix and on the coefficients in the L function. We call this multiplier the variance factor, VF.

Thus, we have:

Var ({\hat{β}}_{j}) = VF ({\hat{β}}_{j}) \cdot σ_{ϵ}^{2}

(13)

and

Var (\hat{y}) = VF (\hat{y}) \cdot σ_{ϵ}^{2}

(14)

In eq (14), $\hat{y}$ is the estimated, or predicted y value at any chosen point in X-space. $VF (\hat{y})$ is of course a function of the location of this point.

Returning now to our statements above, it is well-known that a regression equation can show excellent (very small) residuals and yet be very poor for certain prediction purposes. The small residuals merely mean that a good fit has been obtained at the points used in the experiment. This is no guarantee that the fit is good at other points. However, if the regression equation is scientifically reasonable, it is likely that the experimental situation underlying it will also be valid for points that are close to the cluster of the regressor points used in the experiment. Every point in the EPD satisfies this requirement.

Furthermore, the variance of prediction, measured by the VF, will also be reasonably small for all points of the EPD, simply because they are geometrically close to the design points.

The calculation of $VF (\hat{y})$ is quite simple, once the V-matrix and the λ values have been calculated. It is based on the equation

VF (\hat{y}) = \sum_{k} u_{k}^{2}

(15)

where u_k is defined as:

u_{k} = \frac{w_{k}}{\sqrt{λ_{k}}}

(16)

Combining eqs (8) and (16), we obtain

u_{i k} = \sum_{j} z_{i j} \frac{v_{k j}}{\sqrt{λ_{k}}}

(17)

and hence:

VF (\hat{y}) = \sum_{k} \frac{{(\sum_{j} z_{i j} v_{k j})}^{2}}{λ_{k}}

(18)

Figure 3 shows the VF values at the vertices of the original sample domain and of the EPD. Interpreting these results, we see that the collinearity of our data is reflected in the rejection of an appreciable portion of the sample domain for purposes of safe prediction. This does not mean that prediction outside the EPD is impossible, or unacceptable. It merely means that such prediction cannot be justified on the basis of the data alone. Of course, the risk of predicting outside the EPD increases with the distance from the EPD. It will generally be reasonably safe to use the regression equation even outside the EPD, as long as the point for which prediction is made is reasonably close to the borders of the EPD. Using eq (18), the VF for any contemplated prediction point is readily calculated and can serve as a basis for decision.

Second Example:

Calibration for Protein Determination

The instructive and intuitively satisfying graphical display of the EPD becomes impossible when the number of regressors, including the independent term, exceeds 3. We must then replace the graphical procedure by an analytical one, as will now be shown in the treatment of our second example.

The data were presented by Fearn [3], in a discussion of Ridge Regression. They represent the linear regression of percent protein, in ground wheat samples, on near-infrared reflectance at six different wavelengths.

For reasons of simplicity in presentation, we include here only three of the six wavelengths, a change that has a rather small effect on the final outcome of the analysis: it turns out that the regression equation based on these 3 wavelengths is very nearly as precise as that based on 6 wavelengths.

The data, displayed in table 5, are a very good example of the use of regression equations: the regression equation is indeed to be used as a “calibration curve” for the analysis of protein, using the rapid spectrometry instead of the far more time-consuming Kjeldahl nitrogen determination. Our data have an N value of 24, and p (including the independent term) is 4.

Table 5.

Protein Calibration Data^(*)

	Reflectance			% Protein
Point	x₂	x₃	x₄	y
1	246	374	386	9.23
2	236	386	383	8.01
3	240	359	353	10.95
4	236	352	340	11.67
5	243	366	371	10.41
6	273	404	433	9.51
7	242	370	377	8.67
8	238	370	353	7.75
9	258	393	377	8.05
10	264	384	398	11.39
11	243	367	378	9.95
12	233	365	365	8.25
13	288	415	443	10.57
14	293	421	450	10.23
15	324	448	467	11.87
16	271	407	451	8.09
17	360	484	524	12.55
18	274	406	407	8.38
19	260	385	374	9.64
20	269	389	391	11.35
21	242	366	353	9.70
22	285	410	445	10.75
23	255	376	383	10.75
24	276	396	404	11.47

Open in a new tab

^(*)

x₁ = 1

Table 6 exhibits the correlation matrix of the 24 design points. It is very apparent that the x values at all three wavelengths are highly correlated with each other, thus indicating a high degree of collinearity. At a first glance one would be very skeptical about such a set of data, and suspect that the X matrix shows such a high degree of redundancy as to make the regression useless for prediction purposes. Fearn explains that the correlations are more a reflection of particle size variability than of protein content. Our analysis will confirm that, properly interpreted, the data lead to a very satisfactory calibration procedure.

Table 6.

Protein calibration data—correlation matrix of x₁ through x₄.

1	0	0	0
	1	.9843	.9337
		1	.9545
			1

Open in a new tab

We will find it useful to introduce a slightly different Z transformation, which we call the Range-Midrange Transformation.

The Range-Midrange Transformation

The Range-Midrange Transformation (RMT) is defined as follows:

For j = 1 : z_{1} = 1

(19a)

For j > 1 : z_{j} = \frac{x_{j} - C_{j}}{R_{j}}

(19b)

but now C_j is defined as the midrange of the N values of x_j and R_j is one-half the range of these values. With these definitions, it is clear that the smallest z-value, for any regressor, is (−1) and the largest z-value is (+1). It is because of this −1 to +1 scale that this transformation was introduced. The benefits of this scale will become apparent in the following section.

EPD for the Protein Data

The EPD resulting from the Singular Value Decomposition based on the Range-Midrange Transformaton will not be he same as the EPD we would have obtained using the Correlation Scale Transformation, but we will see that those features of the EPD that are of importance for us, in establishing the limitations of the regression equation, are practically unaffected.

Table 7 shows the C and R values for the four regressors and table 8 exhibits the V matrix and the λ values obtained from the Singular Value Decomposition. The latter, it may be recalled, simply expresses the rotation of the Z coordinate system to the W system.

Table 7.

Protein calibration data—parameters for Z transformation (RMT).

j	C	R
1	0	1
2	296.5	63.5
3	418.0	66.0
4	432.0	92.0

Open in a new tab

Table 8.

Protein calibration data—V matrix and λ values (RMT).

k	1	2	3	4	λ
1	−.6665	.4845	.4217	.3784	43.7810
2	.7365	.3299	.3797	.4523	8.3782
3	−.1096	−.5491	−.2509	.7896	.3758
4	−.0332	−.5958	.7843	−.1698	.06624

Open in a new tab

For each w_k coordinate, there are 24 values, corresponding to the 24 regressor points.

Table 9 shows the smallest and the largest w_k value, for each of the four k.

Table 9.

Protein calibration data—limits defining the EPD.

Coordinate (k)	Smallest w	Largest w
1	−1.9282	.6181
2	−.4097	1.8989
3	−.1669	.3158
4	−.0801	.1324

Open in a new tab

According to table 9, we must have, in the EPD:

- 1.9282 ≦ w_{1} ≦ .6181

(20)

with similar statements for w₂, w₃, and w₄. Applying now eq (8), this double inequality can be written:

- 1.9282 ≦ - .6665 z_{1} + .4845 z_{2} + .4217 z_{3} + .3784 z_{4} ≦ .6181

Since z₁ is constant and =1, this double inequality becomes:

- 1.2617 ≦ .4845 z_{2} + .4217 z_{3} + .3784 z_{4} ≦ 1.2846.

(21a)

With the RMT, the value of any z_k is, for any k > 1, between ( −1) and (+1). Thus the expression in the middle has, for all design points, a value between −1.2846 and 1.2846, where 1.2846 is the sum of the absolute values of the three coefficients. Therefore, the double inequality expressed by eq (21a) holds, essentially, for every point in the original sample domain. Thus, w₁, the first coordinate of the EPD, which represents its largest dimension, imposes essentially no restrictions on the sample domain.

Doing the same calculations for the three other w-coordinates (see table 9), we obtain, respectively:

- 1.1462 ≦ .3299 z_{2} + .3797 z_{3} + .4523 z_{4} ≦ 1.1619

(21b)

- 0.568 ≦ - .5491 z_{2} - .2509 z_{3} + .7896 z_{4} ≦ .4254

(21c)

- .0469 ≦ - .5958 z_{2} + .7843 z_{3} - .1698 z_{4} ≦ .1656.

(21d)

We see that w₂ too, imposes only very light restrictions on the sample domain. On the other hand, w₃ and w₄ do imply limitations that eliminate appreciable portions of the sample domain from the EPD.

We could readily convert eqs (21c) and (21d) to x coordinates by means of table 7 and eqs (19a) and (19b), but the z-coordinates, using the Range-Midrange Transformation, are more readily interpreted in terms of the severity of collinearity than the x-coordinates.

Thus, the sum of the absolute values of the coefficients in the middle terms of (21c) and (21d) are 1.5896 and 1.5499, respectively. Points for which these linear combinations take the valves ±1.5896 and ±1.5499 exist in the original sample domain. The EPD, on the other hand, limits these functions to intervals with much narrower limits.

Effect of Type of Z Transformation

We have used two different Z transformations, the Correlation Scale, and the Range-Midrange. It is proper to ask how our results would have been affected in the Protein Calibration Data, had we used Correlation Scale, instead of the Range-Midrange Transformation. We show the comparison in table 10. Let us recall that with the CST, one of the w coordinates yields a λ-value of unity, and a constant w value for all points. Therefore, we obtain for CST, only three sets of inequalities, as compared to the four sets for RMT. To allow the comparison between the two transformation to be made, we have multiplied eqs (21a) through (21d) by positive constants, so as to make the coefficient of z₄ equal to ± 1. The same was done for the corresponding inequalities obtained by the Correlation Scale Transformation.

Table 10.

Protein calibration data—effect of Z transformation.¹

w coordinate	z Transf.	Inequalities
1	CST	−3.034≤1.021 z₂+1.061 z₃+z₄≤3.082
	RMT	−3.334≤1.280 z₂+1.114 z₃+z₄≤3.395
2	CST
	RMT	−2.534≤.729 z₂ + .840 z₃+z₄≤2.569
3	CST	−.075≤−.686 z₂−.321 z₃+z₄ .535
	RMT	−.072≤ −.695 z₂−.318 z₃+z₄≤.539
4	CST	−.278≤−3.531 z₂+4.640 z₃−z₄≤.980
	RMT	−.276≤−3.509 z₂+4.619 z₃−z₄≤.975

Open in a new tab

All inequalities are expressed in RMT z coordinates.

Of course, since the z coordinates are different for the two transformations, the inequalities for the CST, expressed in the CST z-units, had to be converted to RMT z-units, for a meaningful comparison. As can be seen from table 10, the two smallest dimensions of the EPD are practically the same for the two transformations. Thus, even though the method of principal components is not invariant with respect to linear transformations of scale, our analysis leads, in this case, to very similar results for the small dimensions of the EPD. We believe that this is generally true for all situations in which collinearity is noticeable, i.e., for all situations in which the EPD eliminates considerable portions of the original sample domain. For situations in which this does not apply, i.e., totally non-collinear cases, the inequalities do not matter, since they impose no restrictions on the sample domain.

It is interesting to contrast the remarkable similarity between the inequalities for w₃ and w₄ for the two transformations in table 10, with the behavior of a commonly advocated measure of collinearity (Belsley, Kuh, and Welsch [4], the condition-number.

The SVD resulting from the CST yields the following eigenvalues: 2.9151, 1.0000, .07176, .01312. The condition number is defined as the ratio of the largest to the smallest eigenvalue. In this case:

condition number = 2.9151 / .01312 = 222.2

On the other hand, the SVD resulting from the RMT on the same data yields the eigenvalues: 43.7810, 8.3782, .37575, .066244. This time we have:

condition number = 43.7810 / .066244 = 660.9.

Thus the condition number varies considerably when the data are subjected to different standardizing transformations. It is not clear what useful information can be derived from the condition number.

By contrast, the treatment of collinearity we advocate has a useful and readily understood interpretation: the EPD is that part of the X space in which, and near which, prediction is safe. It also indicates what portions of the original sample domain are inappropriate for prediction on the basis of the given data alone. It fulfills this function in a way which is practically invariant with respect to intermediate transformations of scale. We use the qualifier “intermediate” because collinearity has meaning only in terms of a given original coordinate system (the X system). This system, which determines the original sample domain, must be considered fixed. On the other hand, transformations of this system prior to calculating the EPD can be defined in different ways without affecting the practical inferences drawn from the data on the basis of the Final EPD derived form the standardizing transformation.

Cross-Validation

We can take advantage of the availability of a second set of protein calibration data, also given in Fearn [3]; to verify the correctness of our approach. Fearn lists 26 additional points for which the reflectance measurements, as well as the Kjeldahl nitrogen determination, were made. We applied the Z transformation obtained above (RMT on first set of 24 points) to each of these 26 points, and noted every point for which at least one of the four sets of inequalities (21a) through (21d) failed to be satisfied. We found 14 such points. This means that 14 “future points” obtained under the same test conditions were outside the EPD established on the basis of the original 24 points. However, as we observed above, as long as the point is not far from the EPD, prediction at that point is likely to be valid. We tested “predictability” at these 14 points by calculating the VF value for each of them, and by comparing the predicted protein value with the measured one. The results are shown in table 11. It is apparent that all VF are relatively small, indicating that even though these 14 points are outside the EPD calculated from the original set, they are not far from that EPD. This is confirmed by the good agreement between the observed and predicted values. The standard deviation of fit for the original set of 24 points was 0.23; the standard deviation for a single measurement derived from the 14 differences in table 11 is 0.30.

Table 11.

Protein calibration data—cross-validation of analysis.

	% Protein
Point¹	Observed	Predicted	VF
1	8.66	9.53	.281
4	11.77	11.97	.416
6	10.46	10.96	.193
9	12.03	11.47	.212
10	9.43	9.54	.762
11	8.66	8.15	.454
12	14.44	13.99	.881
14	10.41	10.17	.468
16	11.69	11.24	.472
17	12.19	11.83	.390
18	11.59	11.39	.314
20	8.60	8.39	.201
22	9.34	8.93	.151
26	10.89	10.94	.741

Open in a new tab

Point in additional set (Fearn [3]) with its number designation in that set.

Expansion Terms

Quite frequently, a regression equation contains x variables that are non-linear functions of one or more of the other x variables, such as $x_{2}^{2}$ , x₂⋅x₃, etc. Polynomial regressions are necessarily of this type. Since the x variables are non-stochastic in the usual regression models, the least squares solution for the regression equation is not affected by the presence of such “expansion terms.” On the other hand, collinearity can be introduced, or removed, or modified by them.

In our treatment the expansion terms cause no additional problems. Consider for example, the regression

y = β_{1} x_{1} + β_{2} x_{2} + β_{3} x_{2}^{2} + ϵ

(22)

with x₁ ≡ 1.

Here we have p =3. Using RMT, followed by a singular value decomposition, we obtain an EPD of three dimensions, leading to the inequalities.

A_{1} \leq w_{1} \leq B_{1}, A_{2} \leq w_{2} \leq B_{2}, A_{3} \leq w_{3} \leq B_{3} .

(23)

Expressing the w as functions of the z, this leads to three double inequalities governing the z, of the form

A_{1} \leq f_{1} (z) \leq B_{1}, A_{2} \leq f_{2} (z) \leq B_{2}, A_{3} \leq f_{3} (z) \leq B_{3},

(23)

Now, since $x_{3} = x_{2}^{2}$ , we have

z_{3} = \frac{x_{3} - C_{3}}{R_{3}} = \frac{x_{2}^{2} - C_{3}}{R_{3}} = \frac{{(R_{2} z_{2} + C_{2})}^{2} - C_{3}}{R_{3}} .

Hence:

z_{3} = \frac{C_{2}^{2} - C_{3}}{R_{3}} z_{1} + \frac{2 C_{2} R_{2}}{R_{3}} z_{2} + \frac{R_{2}^{2}}{R_{3}} z_{2}^{2} .

(24)

Because of this relation the functions f₁(z), f₂(z), f₃(z) become functions of z₁, z₂ (and $z_{2}^{2}$ ) only. Using this fact, we interpret the three sets of inequalities (23) exactly as we have interpreted eqs (21a) through (21d) by determining which of these inequalities, if any, impose restrictions on the use of the original sample domain.

To illustrate this procedure, consider the small set of artificial data shown in table 12, for which the model is given at the bottom of the table. The term $x_{3} = x_{2}^{2}$ introduces a high correlation between x₂ and x₃ and consequently also considerable collinearity.

Table 12.

An artificial quadratic example¹.

Point	x₂	x₃	y
1	.2	.04	28.3
2	.4	.16	27.5
3	1	1.00	25.6
4	2.1	4.41	2S.7
5	3.6	12.96	46.4
6	4.7	22.09	69.8

Open in a new tab

y = β₁x₁ + β₂x₂ + β₃x₃ + ϵ; β₁=30, β₂=8, β₃=3.5, σ_e=0.2 x₁ = 1.

Note that $x_{3} = x_{2}^{2}$ .

The inequalities characterizing the EPD based on a Range-Midrange Transformation and converted to the z-scales, are shown in table 13. Applying eq (24) to express z₃ in terms of z₂, the three double-inequalities become:

for w_{1} : - .8431 \leq 1.1284 z_{2} + .2853 z_{2}^{2} \leq 1.4137 for w_{2} : - .6928 \leq .8541 z_{2} + .1612 z_{2}^{2} \leq 1.0153 for w_{3} : .0077 \leq 0003 z_{2} + .3218 z_{2}^{2} \leq 3221.

Table 13.

Quadratic example—inequalities for EPD.

w-coordinate	Inequalities
w₁	−1.1281≤−.5070 z₂+.6214 z₃≤1.1284
w₂	−.8541≤.5029 z₂+.3511 z₃≤.8541
w₃	−.3141≤−.7005 z₂+.7008 z₃≤.0003

Open in a new tab

It is readily verified that of these six inequalities, all but one are satisfied for all z₂ values between −1 and +1. The last one, involving the left side of the third set, is satisfied for all z₂ values except for the interval: −.156≤z₂≤. 155. This corresponds to an x₂ interval between 2.1 and 2.8, or between the design points x₂=2. 1 and x₂=3.6 (see table 12). The interpretation of this finding is that while all design points are of course inside the EPD, a small portion of the curve $x_{2}^{2}$ versus x₂ falls slightly outside the EPD. This is of no practical significance since the VF for these points, even though they are outside the EPD, does not exceed 0.58. By comparison, the smallest VF value along the curve, for the range x₂=.2 to x₂=4.7, is of the order of 0.26. Thus we see that the serious collinearity in this data set is merely a consequence of the presence of the expansion term $x_{3} = x_{2}^{2}$ .

Any point in X space, in order to be acceptable, must lie on the curve $x_{3} = x_{2}^{2}$ . An x₃ with any other value is obviously not valid and our analysis of the data, through the EPD, calls attention to this fact: in the direction of w₃, the width of the EPD is only .31 as compared with widths of 2.26 and 1.71 for w₁ and w₂.

Discussion

The common mathematical definition of collinearity is the existence of at least one linear relation between the x’s, of the form

\sum_{j} c_{j} x_{i j} = 0 j = 1 to p

(25)

where the c_j are not all zero, and such that eq (25) holds with the same c_j values, for all i. This defines what we shall call “exact collinearity.” Geometrically, it means that all design points lie in an hyperplane of the x-space, going through the origin of the coordinate system. Equation (25) also implies that the matrix X’X is singular, and consequently that the estimates of the β coefficients are not uniquely defined.

Exact collinearity seldom occurs in real experimental situations; indeed, if the X matrix is not the result of a designed experiment, it is highly improbable that a relation such as eq (25) would hold exactly. If, on the other hand, the experiment is designed, care would generally have been taken to avoid a situation of exact collinearity.

While exact collinearity is practically of little concern, near-collinearity is a frequent occurrence in real-life data. This occurs when an equation such as (25) is “approximately” true for all i. Many attempts have been made to define more closely the concept of near-collinearity, but while these endeavors have led to a number of proposals for measuring collinearity, they are of little practical use to the experimenter confronted with the task of interpreting his data.

It is not our intention to discuss here the pros and cons of the various attempts made by a number of authors to “remedy” a near-collinear situation. The best-known of these remedial procedures is Ridge Regression. We merely repeat what we have said in the body of the paper: any attempt to remedy collinearity must necessarily be based on additional assumptions, unless it consists of making additional measurements. The latter alternative is of course logical and valid, but the making of assumptions invented specifically for the purpose of removing collinearity does not appear to us to be a recommendable policy in data analysis.

One easily recognizable condition leading to collinearity is the existence of at least one high correlation coefficient among the non-diagonal elements of the correlation matrix of the x’s. This has given rise to the concept of the Variance Inflation Factor (VIF). The VIF for ${\hat{β}}_{j}$ is defined (Draper and Smith [5]), as:

VIF ({\hat{β}}_{j}) = \frac{1}{1 - R_{j}^{2}}

(26)

where R_j is the multiple correlation coefficient of x_j on all other regressors. If d represents a residual in this regression, the usual formula for R_j is given by

R_{j}^{2} = 1 - \frac{Σ d^{2}}{\sum_{i} {(x_{i j} - {\bar{x}}_{j})}^{2}} .

(27)

Now, Snee and Marquardt (Belsley [6], “comments”) make, implicitly, a distinction between the two “models”:

y = β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p} + ϵ

(28a)

with x₁≡1, and

y - \bar{y} = β_{2} (x_{2} - {\bar{x}}_{2}) + \dots + β_{p} (x_{p} - {\bar{x}}_{p}) + ϵ

(28b)

where (28b) is called the “centered” model. For (28b), Snee and Marquardt use eq. (27), but for (28a) they appear to use the definition:

R_{j}^{2} = 1 - \frac{Σ d^{2}}{\sum_{i} x_{i j}^{2}} .

(29)

Equation 29, in which the denominator of the last term is not centered, is not explicitly given by Snee and Marquardt, but is implied by their statement:

“If the domain of prediction includes the full range from the natural origin through the range of the data, then collinearity diagnostics should not be mean-centered,” and confirmed by the VIF values given in their table 1. In this table, “no centering” results in VIF values of 200,000 and 400,000, while the VIF for the “centered” data are unity. The quoted statement occurs in a section entitled “Model building must consider the intended or implied domain of prediction.” The basic idea underlying the section in question is that the analysis of the data, based on the “collinearity diagnostics” (specifically: the VIF values), is goverened by the location of the points were one wishes to make predictions and, more specifically, on whether the origin (x₁ = 1, x₂=x₃⋅⋅⋅=0) is such a point. The VIF values which, according to Snee and Marquardt’s formulas, depend heavily on whether or not this origin is included, will then indicate the quality of the predicted values.

A more reasonable approach, and one more consistent with the procedures commonly used by scientists, is to limit prediction to the vicinity of where one made the measurements, unless additional information is available that justifies extrapolation of the regression equation to more distant points of the samples space. The vicinity of the measured points is determined by the EPD which, in the case of collinearity, may be considerably smaller than the sample domain. In this view, it is the location of the design points, rather than that of the intended points of prediction, that determines predictability. The latter is measured, not by VIF values, but rather by the more concrete VF values, for any desired point of prediction.

The view advocated by Snee and Marquardt sometimes results in an enormous difference in the VIF values between the centered and non-centered forms. Equation 29 serves no useful purpose and is, in fact, unjustified and misleading. It is unjustified because it not only includes the origin (x₁ = 1, x_k = 0 for k > 1) in the correlation and VIF calculations, but moreover, gives this point infinite weight in these calculatons. Yet, no measurement was made at that point. Equation 29 is also misleading because it leads to very large VIF values for some non-centered regressions, implying that severe “ill-conditioning” exists, even when the X matrix is except for some trivial coding, completely orthogonal (cf. [6]).

The ill-conditioning exists only in terms of the large VIF value. It is an artifact arising from the desire to make the two forms of the regression equation into two distinct “models”.

The two forms, eqs 28a and 28b lead to identical estimates for the β_j, including β₁, and for their standard errors. They also lead to identical values and variances for an estimated (predicted) $\hat{y}$ , at any point of the X space. There seems to be no valid reason for the two distinct equations for the VIF. They only lead to the false impression that centering can reduce or even remove collinearity.

Our viewpoint in this paper is that the usefulness of a regression equation lies in its abilty to “predict” y for interesting combinations of the x’s. We also take the position that inferences from the data alone should be confined to x points that are in the general geometric vicinity of the cluster of design points. An inference for points that are well outside this domain (i.e., outside a suitably defined EPD) is, in the absence of additional information, only a tentative conclusion, and not a valid scientific inference. Such conclusions may however, be very useful, provided their tentative character is recognized, and provided they are subsequently subjected to further experimental verification.

Daniel and Wood [7] discuss briefly the relation between the variance of $\hat{y}$ and the location of the point at which the prediction is made. However, their discussion is in the context of selecting the best subset of regressors from among the entire set of regressors, a subject different from the one dealt with in this paper.

Another publication that deals explicitly with predictability is a paper by Willan and Watts [8]. These authors define a “Region of Effective Predictability” (REP_A) as that portion of the X space in which the variance of the predicted $\hat{y}$ does not exceed twice the variance of $\hat{y}$ predicted at the centroid of the X matrix. The volume of the region is then compared with that of a similarly defined REP, denoted REP_o. The latter refers to a “fictitious orthogonal reference design” of “orthogonal data with the same N and the same rms values as the actual data.” The ratio of the volume of REP_A to that of REP_o is taken as “an overall measure of the loss of predictability volume due to collinearity”.

This concept, apart from its artificial character, suffers from other shortcomings. Like so many other treatments, it attempts to provide a measure of collinearity. But the practitioner who is confronted with a collinear X matrix does not need a measure of collinearity: he needs a way to use the data for the purpose for which they were obtained. Furthermore, this measure loses its meaning when expansion variables are present. For example, for the artificial quadratic set of table 12, Willan and Watts’ measure would indicate a high degree of collinearity which, while literally true, is totally misleading since the collinearity in no way reduces the usefulness and predicting power of the regression equation, as long as the meaning of the expansion term is taken into account. But even in cases without expansion terms, the measure in question may be misleading. Thus when applied to the protein calibration data of table 5, it may well lead the analyst to give up on these data as a hopelessly highly-collinear set, whereas, as we have seen, there is nothing wrong with this set and it can indeed be used very effectively for the calibration of a method for protein determination based on reflectance measurements.

Finally, a few words about estimating the β-coefficients considered as rates of change of y with changes in the individual x_j. As pointed out by Box [9], this is generally not a desirable use of regression equations. If, however, it is the major purpose of a ‘particular experiment, then this experiment should be designed accordingly, which means: essentially with an orthogonal X matrix. A collinear X matrix leads to the ability to estimate certain linear combinations of the β’s much better than the β’s themselves. The experimenter can calculate the VF values, not only for any point of X space, but also for any β or combination of β’s, and he can do this without making a single measurement, i.e., in the planning stages of the experiment. If the experimenter does not take advantage of this opportunity, he may be in for considerable disappointment, after having spent time, money, and effort on inadequate experimentation. We believe that he advocacy of remedial techniques, such as Ridge Regression for collinear data is unwise. One of the most important tasks of a data analyst is to detect, and to call attention to, limitations in the use and interpretation of the data.

Biography

About the Author: John Mandel is a statistical consultant serving with NBS' National Measurement Laboratory.

Footnotes

Figures in brackets indicate literature references.

We assume that in the X-system, the regressor x₁ is identically equal to unity, to allow for an independent term.

References

[1].Buck J.B., Studies on the Firefly, Part I: The Effects of Light and Other Aspects on Flashing in Photinus Pyralic, with Special Reference to Periodicity and Diurnal Rhythm, Physiological Zoology, 10, 45–58 (1937). [Google Scholar]
[2].Mandel J., Use of the Singular Value Decomposition in Regression Analysis. The American Statistician, 36, 15–24 (1982). [Google Scholar]
[3].Fearn T., A Misuse of Ridge Regression in the Calibration of a Near Infrared Reflectance Instrument, Applied Statistics, 32, 73–79 (1983). [Google Scholar]
[4].Belsley D.A., Kuh E., and Welsh R.E., Regression Diagnostics: Identifying Influential Observations and Sources of Collinearity, Wiley, NY: (1980). [Google Scholar]
[5].Draper N.R. and Smith H., Applied Regression Analysis, Wiley, 2nd Edition, NY: (1981). [Google Scholar]
[6].Belsley D.A., Demeaning Conditioning Diagnostics Through Centering, The American Statistician, 38, 73–77, and “Comments”, 78–93 (1984). [Google Scholar]
[7].Daniel C., and Wood F.S., Fitting Equations to Data, Wiley, 2nd Edition, NY: (1980). [Google Scholar]
[8].Willan A.R. and Watts D.G., Meaningful Multicollinearity Measures. Technometrics, 20, 407–12 (1978). [Google Scholar]
[9].Box G.E.P., Use and Abuse of Regression, Technometrics 8 625–29 (1966). [Google Scholar]

[R1] [1].Buck J.B., Studies on the Firefly, Part I: The Effects of Light and Other Aspects on Flashing in Photinus Pyralic, with Special Reference to Periodicity and Diurnal Rhythm, Physiological Zoology, 10, 45–58 (1937). [Google Scholar]

[R2] [2].Mandel J., Use of the Singular Value Decomposition in Regression Analysis. The American Statistician, 36, 15–24 (1982). [Google Scholar]

[R3] [3].Fearn T., A Misuse of Ridge Regression in the Calibration of a Near Infrared Reflectance Instrument, Applied Statistics, 32, 73–79 (1983). [Google Scholar]

[R4] [4].Belsley D.A., Kuh E., and Welsh R.E., Regression Diagnostics: Identifying Influential Observations and Sources of Collinearity, Wiley, NY: (1980). [Google Scholar]

[R5] [5].Draper N.R. and Smith H., Applied Regression Analysis, Wiley, 2nd Edition, NY: (1981). [Google Scholar]

[R6] [6].Belsley D.A., Demeaning Conditioning Diagnostics Through Centering, The American Statistician, 38, 73–77, and “Comments”, 78–93 (1984). [Google Scholar]

[R7] [7].Daniel C., and Wood F.S., Fitting Equations to Data, Wiley, 2nd Edition, NY: (1980). [Google Scholar]

[R8] [8].Willan A.R. and Watts D.G., Meaningful Multicollinearity Measures. Technometrics, 20, 407–12 (1978). [Google Scholar]

[R9] [9].Box G.E.P., Use and Abuse of Regression, Technometrics 8 625–29 (1966). [Google Scholar]

PERMALINK

The Regression Analysis of Collinear Data

John Mandel

Abstract

Introduction

The Model

First Example:

Firefly Data

Table 1.

Figure 1—

Establishing the EPD

1. The Z System

Table 2.

2. The W-System

Table 3.

Table 4.

Figure 2—

3. The Effective Prediction Domain (EPD)

4. The Variance Factor (VF)

Figure 3—

Second Example:

Calibration for Protein Determination

Table 5.

Table 6.

The Range-Midrange Transformation

EPD for the Protein Data

Table 7.

Table 8.

Table 9.

Effect of Type of Z Transformation

Table 10.

Cross-Validation

Table 11.

Expansion Terms

Table 12.

Table 13.

Discussion

Biography

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases