Generic Feature Selection with Short Fat Data

B Clarke; J-H Chu

. Author manuscript; available in PMC: 2014 Oct 24.

Published in final edited form as: J Indian Soc Agric Stat. 2014;68(2):145–162.

Generic Feature Selection with Short Fat Data

B Clarke ¹, J-H Chu ²

PMCID: PMC4208697 NIHMSID: NIHMS619926 PMID: 25346546

SUMMARY

Consider a regression problem in which there are many more explanatory variables than data points, i.e., p ≫ n. Essentially, without reducing the number of variables inference is impossible. So, we group the p explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regress the response on these statistics under a penalized error criterion to obtain estimates of the regression coefficients. We examine the performance of this approach for a variety of choices of n, p, classes of statistics, clustering algorithms, penalty terms, and data types. When n is not large, the discrimination over number of statistics is weak, but computations suggest regressing on approximately [n/K] statistics where K is the number of blocks formed by a clustering algorithm. Small deviations from this are observed when the blocks of variables are of very different sizes. Larger deviations are observed when the penalty term is an L^q norm with high enough q.

Keywords: Large p small n, LASSO, Ridge, Bridge, Clustering, Variance-bias tradeoff, Summary statistics

1. THE BASIC PROBLEM IN THE USUAL SETTINGS

Suppose Y = Yⁿ = (Y₁, …, Y_n)′ is an n × 1 data vector, X = (X₁, …, X_n)′ is an n × p design matrix in which each X_i is a vector of p explanatory variables, and β = (β₁, …, β_p)′ is the parameter vector. Suppose all the variables are standardized i.e., transformed to have mean zero and variance one so that it will be enough to look at the dependence structure and relative contributions of the X_i’s. Let us write the model

Y = X β + \in

(1)

in which ∈ = (∈₁, …, ∈_n)′ is the error term and the constant term usually appearing in a regression model has been subsumed by the rescaling. We want E(∈) = 0, and Var(∈) to be diagonal. Regardless of the distribution of ∈, we have

{\hat{β}}_{OLS} = arg min_{β} \sum_{i} {(y_{i} - x_{i}^{'} β)}^{2} = {(X^{'} X)}^{- 1} X^{'} y,

(2)

as the least squares estimator of β, provided the inverse exists. If |X′X| is small, the inverse is large in the sense that some of its eigenvalues must be large. When p > n, X is n × p, i.e., short and fat. For Short Fat Data (SFD) |X′X| = 0 so its inverse fails to exist.

The central issue here is that the mean function for Y, EY, is in a space of dimension p while only n < p data points are available. That is, the SFD or ‘large p, small n’ problem would disappear if we had more data. However, even though one can imagine arbitrarily large n’s, in practice they do not exist.

Alternatively, we can try to do effective dimension reduction by regressing Y on functions of the X_i’s. The idea is that if we evaluate a comparatively small number of suitably chosen functions on each X_i, i.e., features, and then do penalized regression on those features we will have retained all the information in the data about the response Y. The question is what kind of statistics to use to achieve optimal dimension reduction. Obviously, good statistics on which to regress should encapsulate the information in the explanatory variables relevant to the response.

In many cases, this is done by careful physical modeling, i.e., using domain specific knowlege to restrict the class of models that have to be considered. Recent examples include Stenning et al. (2013) for solar image data and McKay (2004) for musical scores. However, feature selection based on modeling is very time-intensive and may require information not available to a researcher. So, here, we address generic feature selection, done in the absence of modeling information. We look at five classes of features, but only three are independent of the response and could therefore be used in practice. The other two are for comparison purposes to assess how well the first three seem to perform.

One can readily imagine that when p is large enough relative to n, dimension reduction to a ‘reasonable’ p′ statistics may not give p′ ≤ n. This may be the case when the explanatory variables are known to segregate into a number of disjoint classes and the number of these classes is still greater than n. In these cases, it may be reasonable to use a single statistic within each class, but not to permit statistics to depend on variables from more than one class. Thus, even after reducing to regression on statistics one still has SFD. Not as fat as before, but the new ‘X′X’ remains singular.

A second way to correct for p ≫ n is to change the optimality criterion. Since X′X appears in the solution under squared error, let us add a penalty term to shrink the solution towards non-singularity. One general class is

arg min_{f \in F} \sum_{i = 1}^{n} L_{1} (y_{i}, x_{i}, f) + λ L_{2} (f),

(3)

where Inline graphic is a class of functions, and L₁ and L₂ are two loss functions. The first, L₁, expresses the sense in which we want the function f(x) to be close to the response y. The second, L₂, ensures that the ‘complexity’ of f is not so large that we overfit the data. Since we don’t want L₂ to swamp the information in the data we use a hyperparameter λ to control the tradeoff between how well f summarizes the data and how complicated f may be. Usually, λ is chosen adaptively and sometimes it is called a decay parameter.

Various instances of (3) are of great interest. The polynomial subclass is

arg min_{β} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{'} β)}^{q} + λ \sum_{i = 1}^{p} {∣ β_{i} ∣}^{r}

(4)

where q and r usually are integers. When (λ, q) = (0, 2), (4) corresponds to (1) and yields (2). If the x_i’s are replaced by functions of the explanatory variables, then we are doing ‘feature selection’ i.e., regression on statistics formed from the explanatory variables. Indeed, an estimator arising from (4) is Bayesian: The argument of the min in (4) can be regarded as the log-likelihood of the product of an n-fold normal density with mean vector (x′₁β, …, x′_nβ) and a prior proportional to $e^{λ \sum_{i = 1}^{p} {∣ β_{i} ∣}^{r}}$ so that finding the quantity in (4) is equivalent to choosing the mode of a posterior.

The main contribution of this paper is to observe that clustering over variables, summarizing the clusters by statistics, and then feeding the statistics into a shrinkage method may be an effective way to do dimension reduction. More formally, if one must reduce the number of explanatory variables by constructing features in the absence of sufficient modeling information (as is often the case) then one may be led to a two step procedure. The first step is to choose a number of clusters, K, by use of a clustering procedure, or by use of physical modeling when this is possible. In either case, the second step is to summarize each cluster separately by a small number of generic statistics. If one does this then the best number of statistics to use per cluster is roughly [n/K], the smallest integer larger than n/K. Since n + K ≥ K[n/K] ≥ n, the total number of statistics roughly equals the number of data points. These statistics can then be fed into any penalized method such as (3) or (4) to give coefficients and predictors. Essentially this means that even when p ≫ n one can pragmatically reduce to a p′ ≈ n meaning that elaborate schemes for permitting p′ > n provide little gain.

In this procedure the variability due to the clustering is neglected in practice, although it is built into our simulations which use repeated calculations of the generalized cross-validation error over repeated generation of the data. However, the focus is not only on variability, bias matters too. In essence, the optimal number of statistics [n/K] represents a bias-variance tradeoff: More statistics means less bias but more variability, fewer statistics means more bias but less variability. Thus, our ‘[n/K] rule’ stems from seeking an optimal variance-bias tradeoff rather than asymptotic optimality because n is small and it is unrealistic to think n will increase without bound.

In the first part of the next section we list several standard families of models and verify that they are of the form of (3) or (4). Equipped with these examples, we discuss the relationship of SFD and generic feature selection. In Section 3, we describe our contribution: It amounts to an investigation of how four modeling factors (penalty, choice of statistics, data type and clustering algorithm) affect regression on generic statistics with SFD. In Section 4, we present the results of this 4 by 4 array that suggest the ‘[n/K] rule’. In Section 5, we show two limitations of the [n/K] rule and, in Section 6, we give our general conclusions.

2. MODEL CLASSES AND SUMMARY STATISTICS

The four classes we briefly review here are OLS, Ridge, Bridge, and LASSO. There are many other model classes that use penalization such as CART (Breiman et al. 1984), SCAD (Fan and Li 2001), elastic net (Zou and Hastie 2005) and so forth. These are usually designed for model identification (CART is the exception) and often have the oracle property to ensure asymptotically good model identification. However, penalized methods with the oracle property do not in general perform as well as other methods do for data summarization and prediction which are the goals here. (SCAD is the one exception to this because its penalty is so small it compares with, say, model averaging methods.) Predictive comparisons among these approaches is not common, but see Austin et al. (2013) and Clarke and Severinski (2011) for special cases. At the end of this section we discuss the summary statistics used in our computations.

2.1 Ordinary Least Squares

Recall the ordinary least squares regression problem defined by (1.1) and (1.2). We obtain β̂_OLS by minimizing the residual sum of squares $\sum_{i = 1}^{n} {(y_{i} - x_{i}^{T} β)}^{2}$ over β. The estimator β̂_OLS is unbiased for β but has a large variance when X is nearly collinear. Also, β̂_OLS is not unique when X is less than full rank.

With SFD where n < p we need to replace actual inverses with generalized inverses of some sort to get uniqueness. For an n × m matrix A the procedure begins by trying to solve AX = y when y∈ Range(A). One definition of a generalized inverse for A is a matrix B for which ABA = A. This reduces to the usual definition of matrix inverse when A is invertible. If BAB = B, i.e., A is a generalized inverse for B, and both AB and BA are orthogonal projections, then B is unique. This is called the Moore-Penrose generalized inverse.

Using the Moore-Penrose inverse gives unique solutions. Indeed, the central results in the theory of linear models – properties of parameter estimates and fitted values, Chi-squared distributions for sums of squares – continue to hold using Moore-Penrose inverses. The cost is substantially inflated variances.

2.2 Ridge Regression

Hoerl and Kennard (1970) introduced ridge regression, RR, which modifies OLS by introducing a penalty term λ to shrink the β’s toward zero. The RR estimator,

\begin{array}{l} {\hat{β}}_{Ridge} = arg min_{β} {\sum_{i = 1}^{n} {∣ y_{i} - x_{i}^{T} β ∣}^{2} + λ \sum_{j} {∣ β_{j} ∣}^{2}} \\ = {(X^{T} X + λ I_{p})}^{- 1} X^{T} y, \end{array}

is biased, but the variance is smaller than that of the OLS estimator. Therefore, one can often achieve better estimation in terms of MSE, and better prediction. It is seen that RR adds λ times the identity matrix to the objective function to force non-singularity.

2.3 LASSO

Tibshirani (1996) introduced the LASSO – Least Absolute Shrinkage and Selection Operator – which uses a factor λ times the absolute value of β as a penalty term. The LASSO estimator is defined to be

{\hat{β}}_{LASSO} = arg min_{β} {\sum_{i = 1}^{n} {∣ y_{i} - x_{i}^{T} β ∣}^{2} - λ \sum_{j} ∣ β_{j} ∣} .

The LASSO emerges from a more general treatment called Least Angle Regression, when an extra correlation restriction is enforced on the algorithm, see Efron et al. (2004, Exp. 3.1).

The LASSO combines shrinkage and selection on the regression function. The penalty term itself is often recognized as corresponding to putting a prior on β and ‘shrinking’ the parameter to a point, usually taken to be zero. The selection arises because there are many cases where optimizing in the LASSO leads to setting some of the β_j’s to be zero.

2.4 Bridge Regression

Frank and Friedman (1993) defined bridge regression. It is (1.6) with q = 2. That is, the penalty on the sum of squares is λ times Σ|β_j|^r for some r ≥ 0. We write

{\hat{β}}_{Bridge} = arg min_{β} {\sum_{i = 1}^{n} {∣ y_{i} - x_{i}^{T} β ∣}^{2} + λ \sum_{j} {∣ β_{j} ∣}^{r}}

It is seen that RR is r = 2, LASSO is r = 1, and AIC is nearly equivalent to “r = 0” in the normal error case.

2.5 Choice of Summary Statistics

One topic not usually addressed is how to choose functions of the explanatory variables generically so as to achieve more parsimonious models. This is sometimes done in the context of basis expansions e.g., wavelets, but comparisons from one basis to another are infrequent. Usually, the basis is chosen on the grounds of some sort of physical modeling that may or may not be helpful. Indeed, choosing functions based on modeling assumptions is well-established but in cases where model information is scant, unreliable, or subjective one has little choice but to proceed generically. For instance, in SFD problems, there may not be extra information available on the X_i’s to narrow the class of statistics it is worth searching but one can reduce the number of explanatory variables by requiring the sample variance of a component of X_i is large enough to be provide meaningful discrimination.

Here, we suppose the individual variables in the X_i’s segregate into M classes C₁, C₂, …, C_M. Let class C_k have p_k variables and let M_k be the n × p_k matrix of predictors in class C_k. Write M_ik for the i-th row of the submatrix M_k, and let S(M_ik) denote a function of the i-th row of the k-th class of variables in X_i. Our task is to choose functions of the form S in sensible ways to serve as summary statistics of X_i_{, p_k−1 + 1, …,} X_i_{, p_k−1 + p_k} on which to regress.

If we choose a single S_k for each C_k the regularized risk from (4) is

\sum_{i = 1}^{n} {(y_{i} - \sum_{k = 1}^{M} γ_{k} S_{k} (M_{i k}))}^{q} + λ \sum_{k = 1}^{M} {∣ γ_{k} ∣}^{r} .

Using the classes C_k permits the p_k’s to be reduced to a smaller number of statistics.

There are many natural choices for sequences of statistics to study. Percentiles and moments are the obvious ones to use first. Principal components, PC’s, provides another way to choose a sequence of statistics generically. Alternatively, as we discuss below, statistics such as partial least squares, PLS’s, or sliced inverse regression, SIR’s, can be used. These last two are qualitatively different from moments, percentiles and PC’s because of their dependence on the outcomes y₁, …, y_n.

In some cases it is realistic to assume classes C_k are known. Hawkins et al. (2001) have a setting in which the classes can be specified pre-experimentally. However, in general, it is unclear how many statistics one wants to choose for each class C_k whence our ‘[n/K] rule’.

3. THE SIMULATION SETTING

The justification of the ‘[n/K] rule’ rests on the computational investigation of a large matrix of cases representing the predictive performance of commonly occurring regularized risks. The problem can be visualized as a one way table crossed with another one way table crossed with a two way table. The first one way table is what the researcher cannot control: The actual properties of the data. The second one way table represents the pre-processing the researcher must do: This is how the explanatory variables are clustered into classes. The two-way table represents how the researcher models: The optimality criterion and the choice of statistics to summarize the data. We go through these factors in turn.

3.1 Data Type

The factor ‘data type’ is not under the control of the researcher. So, we used a large variety of standard data types to see how each technique performed on it. First we considered independent normal data with equal sized blocks. Then we used unequal sized blocks. Then we used correlated normal data with equal and unequal block sizes. More generally, we turned to ARMA(a, b) data with a, b = 0, 1, 2. For greater realism, we then used non-normal independently generated data. The non-normality was mostly from the heavier tails although the shape of the distributions we used was not always symmetric. Finally, to investigate non-normal dependent data we generated correlated normal data but applied transformations to it so the distribution of the data going into the analysis would no longer be normal.

3.2 Clustering Algorithm

The factor ‘clustering method’, reflects how the researcher must preprocess the data so it will be amenable to summarization. We chose 6 levels i.e., 6 kinds of clustering, to partition the explanatory variables into disjoint classes C₁, …, C_K; the elements within each C_i are expected to be more highly dependent than elements from different C_j’s. For the first ‘level’ we assume the clusters to be known. The other five levels were K-means, three agglomerative procedures differing in the dissimilarity used, and one divisive.

The K-means algorithm is based on a distance between explanatory p-vectors X_i, here taken to be the Euclidean metric denoted by ||·||. We did this for a range done of K; the clustering with the smallest error is used as ‘true’. We did this with the R function kmeans(), taking the clustering with the minimum error over 10 tries as the globally optimal clustering.

Three of the hierarchical methods were agglomerative. These clustering algorithms use dissimilarities d_j. Dissimilarities generalize the concept of distance: For entries of vectors we have values d_j(x_{i, j}, x_i_′_{, j}) giving $d_{i, i^{'}} = D (x_{i}, x_{i^{'}}) = \sum_{j = 1}^{p} d_{j} (x_{i, j,} x_{i^{'}, j})$ on vectors. Agglomerative hierarchical algorithms begin with n singleton clusters and combine two clusters at each step depending on the dissimilarity. The distance between clusters is expressed in terms of dissimilarity and here we use three forms: Single linkage uses d_NN(G, H) = max_i_∈_G_, _i_′∈_H d_i_, _i_′ for clusters G, and H. Complete linkage, or furthest neighbor, is d_FN(G, H) = max_i_∈_G_, _i_′∈_H d_i_, _i_′. Group average d_GA(G, H) takes the mean of the d_i_, _i_′’s over clusters G and H. These were implemented by agnes() in R, see Struyf et al. (1996). Of these three methods, complete linkage seemed to work better than single or group for our purpose, although the differences are small.

The sixth hierarchical clustering method was divisive. This approach begins by treating the whole data set as a single cluster and recursively divides it at each iteration. This procedure was implemented by diana() in R. For details, see Struyf et al. (1996).

3.3 The Optimality Criterion

The researcher gets to choose the optimality criterion to be employed. Here we consider 3 levels for this factor, i.e., 3 different forms of regularized risk. These are RR, LASSO, and bridge. RR is computed in closed form because the choice of λ gives non-singularity. That is, we can use fits directly from (X′X + λI_p)⁻¹ X′_y.

The version of LASSO we use here is a modified Least Angle Regression, LARS, see Efron et al. (2004). As described in Section 2, LASSO is a quadratic programming problem. However, using LARS one can obtain solutions readily for all values of λ by a variant of forward stepwise regression. As λ varies from 0 to ∞, the LARS procedure can be used to generate the LASSO solutions.

To implement the third level, bridge regression as in Section 2.1.4, we used the Fortran code and description of Fu (1998). Following this treatment, for given λ ≥ 0 and r ≥ 1 we compute β̂ and use

p (λ) = t r (X {(X^{'} X + λ W^{-})}^{- 1} X^{'}) - n_{0}

as the effective number of parameters, in which W⁻ is the generalized inverse of W = diag(2|β̂|²⁻^r/r) and n₀ is the number of entries β̂_j for which β̂_j = 0 for r = 1. Note that all generalized inverse procedures give the same results on diagonal matrices and that n₀ represents the number of zero entries on the diagonal of W. It is seen that β̂ solves

(X^{'} X + \frac{λ r}{2} diag ({∣ β_{j} ∣}^{r - 2})) β = X^{'} y .

(5)

3.4 Choice of Statistics

The second factor under the researcher’s control is the choice of statistics. We used 5 levels. These levels consist of 2 subclasses with 3 and 2 levels respectively. The first class consisted of modified moments, percentiles, and PC’s. The second class consisted of partial least squares (PLS) and sliced inverse regressions (SIR) statistics. Unlike the first subclass, these depend on the values of Y for their construction.

Our statistics based on moments separate positive and negative parts for odd exponents; this is not needed for even exponents. Thus, for a data vector X = (X₁, …, X_p) the first moment is represented by two statistics which we regress on together: ${\bar{X}}^{+} = \sum_{i = 1}^{p} X_{i} I_{{X_{i} \geq 0}}$ and ${\bar{X}}^{-} = \sum_{i = 1}^{p} X_{i} I_{{X_{i} < 0}}$ . The second moment is $\sum_{i = 1}^{p} X_{i}^{2}$ . The third is again separated into two positive and negative parts, like the first moment.

The percentiles we use are standard. The first percentile statistic is the median of (X₁, …, X_p), calculated with linear interpolation. The second set of percentile statistics consists of 3 statistics, the median and the two quartiles. The third set of percentile statistics adds the 4 percentiles midway between all of the quartiles giving 7 percentiles at 12.5k for k = 1, …, 7, and so forth. In jumping from one to three to seven percentiles, and so forth, the idea was to explore whether tail behavior was helpful by forcing the statistics to respond to different regions of the distribution. However, the 33^rd and 67^th percentiles, the quartiles, the quintiles, and so forth could have used instead, possibly leading to clearer support for the [n/K] rule at the cost of them not being nested.

The PC’s of a matrix X arise from writing X = UDV′ so that

X^{'} X = V D^{2} V^{'},

where D is diag(d₁, …, d_p) with d_j ≥ 0 in decreasing order. This is the usual diagonalization for a symmetric matrix giving non-negative real eigenvalues. Write V = (V₁, …, V_p). Then, Z_j = XV_j for j = 1, …, p is a set of directions that can be assumed orthogonal with $Var (Z_{j}) = d_{j}^{2} / n$ . So, the first PC is Z₁ and it is the linear combination of explanatory variables having the largest variance. Likewise, the second PC is Z₂ and has the second largest variance, and so forth. One can regress on the the first Z_j’s as a way to ensure the most important contributions to the variability in the data have been modeled. Regression on all the PC’s devolves to the original regression.

Partial least squares, PLS, is a different way to construct a sequence of statistics on which to regress. Recall that Y and the X_j’s are standardized. Begin by regressing Y on each of the p explanatory variables. This gives p expressions say ϕ̂_1, _j = 〈x_j, y〉 x_j for j = 1, …, p. The first PLS direction is $Z_{1} = \sum_{j = 1}^{p} {\hat{ϕ}}_{1, j}$ . Next, regress Y on Z₁ to get, say, ψ̂₁ and orthogonalize the p explanatory variables with respect to Z₁, i.e., subtract the portion of each explanatory variable X_j that is in the direction of Z₁. Redo the procedure for the orthogonalized explanatory variables, $X_{j}^{(new)} = X_{j}^{(old)} = [〈 Z_{1}, X_{j}^{(old)} 〉 / 〈 Z_{1}, Z_{1} 〉] Z_{1}$ For all j to generate ϕ̂_u_, _j and ψ̂_u for u = 2, 3, … to obtain Z₂…, see Hastie et al. (2001) Section 3.4 for further details. Regression on all the PLS directions devolves to the original regression.

Sliced Inverse Regression, SIR, is a technique from Li (1991) motivated by partitioning the range of Y, doing inverse regression on each region, pooling over the results and doing a pincipal components analysis on the weightd covariance matrix. The resulting statistics can be used for regressions.

3.5 Assessing Performance

In general, in these settings, we are concerned primarily with prediction since the true models are inaccessible. This leads us to use cross-validation, CV, as a performance criterion. The natural choice is leave-one-out CV because it is approximately unbiased for prediction error. However, the variance of leave-one-out CV may be high since any two of the training sets have n − 2 data points in common. Rather than using fivefold or other forms of CV, we actually used Generalized CV, GCV, for its computational convenience.

Suppose there exists a matrix S so that the fitted values ŷ = (ŷ₁, …, ŷ_n) for the outcome y can be expressed as ŷ = Sy. Then, writing tr(S) for the trace of S,

GCV = \frac{1}{n} \sum_{i = 1}^{n} {(\frac{y_{i} - \hat{f} (x_{i})}{1 - t r (S) / n})}^{2},

(6)

which can be computed for ridge. GCV is easier to compute than CV because tr(S) needs to be found only once. When a regression method is not linear, i.e., there is no matrix S, a GCV can still be given. For LASSO, the form of the GCV is given by Fu (1998). For bridge, consistent with (5) the GCV error is taken as

GCV = \frac{RSS}{n {(1 - p (λ) / n)}^{2}}

(7)

in which $RSS = \sum_{i = 1}^{n} {(y_{i} - X_{i}^{'} \hat{β})}^{2}$ . For our work below we chose r = 1.5, 2.5 and 3 and we let λ vary over (.001, 1000).

For the two different classes of statistics – dependent on Y or not – we used two different forms of the GCV. For the first, we used all the data to generate the statistics and then used leave-one-out GCV to find the best number of statistics per cluster. Thus, results for moments, percentiles, and PC’s are an assessment of goodness of fit, averaged over 200 iterations. For the second class – PLS and SIR – we left out one data point, found the statistics, and then did GCV, averaging over 200 iterations. So, this criterion is more predictive and a ‘purer’ form of GCV than for the first class. This seemed appropriate since goodness of fit seems less relevant for statistics that are more complex. Since our focus is on finding the best number of statistics from a class of statistics rather than comparing from one class to another, these different (but very similar) criteria will not affect our conclusions.

4. RESULTS

For each data type we fix a clustering algorithm, an optimality criterion and then look at the GCV for each choice of statistic.

Since the experimenter cannot choose the type of data to be analyzed, we have separated subsections based on 4 data types: Normal with correlation (or independent), independent non-Normal, Normal with serial correlation (ARMA), and dependent non-normal. To present our findings, nested within each of these subsections we have subsubsections, one for each of the three penalties we used: LASSO, Ridge, and Bridge. Within each of the subsubsections we have nested a two way array based on class of statistic and clustering technique. There are 5 classes of statistics (moments, percentiles, PC’s, PLS’s and SIR’s) and six clustering techniques (k-means, agglomerative with 3 link functions, and divisive). Although the clustering must be done before the optimality criterion can be implemented or the statistics calculated, we have put the use of the clustering procedure last in our presentation (when we comment on it at all) because it rarely affected the [n/K] rule in the main cases we studied.

Subject to the slightly different forms of the GCV for moments, percentiles, and PC’s versus PLS’s and SIR’s, it is reasonable to compare different generic choices of statistics within subsubsections because the data type and penalty are common while the clustering technique appears not to matter. Choices of decay parameter are also reasonable to compare but we have not done this; we have defaulted to the automatic selection of decay parameters in the packages we have used. We tested several choices of interval in which to situate the decay parameter but then settled on [.001, 1000].

The common structure among all the results below is a linear regression model with p regressors and n observations. The regression matrix X consists of K blocks, X₁, …, X_K, and block X_k contains p_k variables and so is n × p_k with $\sum_{k = 1}^{K} p_{k} = p$ . Our results are for n = 10, p = 400, K = 4 and p_k = 100 for all k, and N = 200 iterations unless specified otherwise. We present the computed results below commenting only briey on the patterns they exhibit.

4.1 Normal Data

Here we chose ε ~ N(0, 1), and supposed the blocks of the regression matrix contained variables drawn from N(0, Σ), where Σ has 1 on the diagonal and ρ on all the off-diagonal terms. The β_j’s were also drawn from a N(0, 1).

4.1.1 LASSO

As noted, LASSO fits are computed by using the LARS package by Hastie and Efron, which uses the LARS algorithm by Efron et al. (2003). This package chooses the decay parameter λ by finding the minimum leave-one-out cross validated mean square prediction error.

Moments

Recall we have 2 statistics for each odd order and one for each even order. Table 4.1 shows the GCV error as a function of the correlation and number of moments used in the regression. Here and below an asterisk, *, denotes the minimum in a row. In some cases, we use a dagger, †, to indicate the minimum in a row and an asterisk to indicate the second largest value in that row. This notation means that we believe the apparent minimum is an artefact of the computing rather than accurately approximating the value of the quantity desired. Also, we often omit columns that merely confirm recognizable patterns. For instance, in Table 4.1, we omit the columns for 6-th and higher moments. On the other hand, for comparison purposes, we sometimes include a column at the right labeled ‘all’ which gives the GCV for the stated penalty using all the data. In Table 4.1, for instance, LASSO is applied to the 400 variables and not all moments are used.

Table 4.1.

Correlated Normal; LASSO; Moments; Known Clusters

#Moments	1	2	3	4	5	all

ρ = 0	43.249	41.727	41.205*	41.743	42.062	44.258
ρ = 0.1	45.351	44.359*	44.844	44.821	45.132	45.151
ρ = 0.3	38.637	34.132*	35.733	35.278	36.317	40.386
ρ = 0.5	35.824	33.211	33.162*	33.375	34.406	35.716
ρ = 0.7	31.432	26.370*	26.662	27.899	28.441	27.894
ρ = 0.9	22.850	16.4458	18.950	19.780	21.597	18.229

#Moment	1	2	3	4	5

K = 2	39.970	39.498	37.751	37.408*	37.888
K = 3	37.546	36.678*	36.742	37.229	37.802
K = 4	38.614	36.492	36.410*	36.623	36.874
K = 5	34.299*	37.445	37.235	37.775	38.010
K = 6	35.350*	37.841	37.648	38.398	39.185

#Moment	1	2	3	4	5

K = 2	43.296	41.897	40.317*	40.381	40.586
K = 3	38.025	37.675	36.752*	37.085	36.861
K = 4	36.684*	37.158	37.512	37.875	38.110
K = 5	35.596*	40.193	39.806	40.220	40.196
K = 6	36.033*	38.386	38.968	39.523	38.932

#Percentiles	1	3	7	15	31	all

ρ = 0	44.216	37.259*	41.459	40.512	41.667	43.225
ρ = 0.1	41.394	37.816*	38.794	38.653	39.292	40.083
ρ = 0.3	38.321	32.530*	35.388	35.577	35.109	39.441
ρ = 0.5	35.929	29.0542*	29.754	29.384	30.469	40.851
ρ = 0.7	23.650	20.700	20.665	20.550	20.347*	28.125
ρ = 0.9	10.971	8.9469	8.781*	9.1723	8.993	17.504

#Percentiles	1	3	7	15	31

K = 2	41.095	36.550	35.588*	35.619	36.045
K = 3	39.703	35.121	34.963*	35.230	35.120
K = 4	37.624	34.083	34.079*	34.106	34.233
K = 5	36.750	33.581*	33.844	33.766	34.680
K = 6	36.532	34.103*	34.409	35.198	35.308

#Percentiles	1	3	7	15	31

K = 2	43.221	40.141	38.005*	39.438	39.735
K = 3	41.448	36.665	36.637*	37.494	37.700
K = 4	39.410	35.737*	36.140	36.538	37.585
K = 5	37.609	35.234*	36.226	35.577	36.914
K = 6	36.072	34.620*	36.069	36.340	36.600

#PC	1	2	3	4	5	9

ρ = 0	45.095	41.994	41.075*	41.955	42.868	44.774
ρ = 0.1	43.515	40.393	39.082*	40.344	41.497	43.148
ρ = 0.3	38.203	38.785	35.925*	37.590	39.278	41.891
ρ = 0.5	32.728	34.967	32.595*	34.962	37.479	40.697
ρ = 0.7	24.966^†	31.705	27.689*	30.678	33.308	38.066
ρ = 0.9	12.157^†	24.884	18.424*	23.004	26.008	32.320

#PC’s	1	2	3	4	5	9

K = 2	41.161	38.308	37.210	37.919	34.038*	40.333
K = 3	39.408	37.278	33.961*	36.262	38.249	42.620
K = 4	37.701	37.307	36.320*	38.227	40.440	42.585
K = 5	36.519	34.148*	37.474	39.715	40.462	42.306
K = 6	36.047	35.652*	38.938	40.207	41.732	41.902

#PLS	1	2	3	4	5

ρ =0	41.404	40.555*	41.152	41.448	41.691
ρ = 0.1	39.904	38.651*	39.106	39.578	39.743
ρ = 0.3	38.518^†	39.073*	39.759	39.873	39.943
ρ = 0.5	38.721^†	42.230*	43.792	43.966	43.992
ρ = 0.7	29.756^†	37.794*	38.619	38.726	38.758
ρ = 0.9	15.748^†	40.681*	41.128	41.212	41.211

#SIR	3	4	5	6	7	8

ρ = 0	46.556*	46.851	47.326	47.562	47.638	47.790
ρ = 0.1	44.323	44.471	45.086	44.158*	44.677	44.826
ρ = 0.3	45.137*	45.964	46.353	46.667	46.865	46.507
ρ = 0.5	47.854	47.125	46.823	45.691	44.299*	44.796
ρ = 0.7	56.178	54.079	49.925	49.890	47.407	46.938*
ρ = 0.9	47.390	40.164	34.764	34.042	33.386	32.816*

#SIR	3	4	5	6	7	8

K = 2	46.358	48.121	46.310	45.781	45.726	45.508*
K = 3	46.927	46.651	46.388	46.600	46.398	46.125*
K = 4	46.461	46.448	46.720	46.315	46.195	45.660*
K = 5	46.225	46.686	46.633	46.504	45.882	45.689*
K = 6	46.948	47.134	47.366	46.861	46.998	46.320*

#Moments	1	2	3	4	5

ρ = 0	58.351*	59.163	59.619	59.863	60.366

#Percentiles	1	3	7	15	31

ρ = 0	59.622	58.635*	60.468	61.359	62.398

#PC’s	1	2	3	4	5	all

ρ = 0	58.676	55.911*	58.331	61.197	61.303	64.641

#Moments	1	2	3	4	5

ρ = 0	46.322	46.069*	49.884	49.915	51.884
ρ = 0.1	43.573*	46.319	45.646	48.244	50.077
ρ = 0.3	42.212	41.807*	42.788	42.117	44.253
ρ = 0.5	31.484*	35.972	34.507	38.824	36.869
ρ = 0.7	30.161*	42.460	36.557	45.776	44.018
ρ = 0.9	20.935*	35.789	34.991	42.996	42.595

#Percentiles	1	3	7	15	31

ρ = 0	48.088*	49.056	50.805	51.789	51.973
ρ = 0.1	44.284*	45.296	46.176	46.090	46.600
ρ = 0.3	39.438*	42.475	43.694	43.839	44.004
ρ = 0.5	28.846*	30.857	31.551	31.993	32.251
ρ = 0.7	29.431	26.393*	27.372	27.574	27.766
ρ = 0.9	19.315	16.025	15.443*	16.305	16.325

#PC	1	2	3	4	5	6	7	8	9	all

ρ = 0	35.559	23.521	6.046 ^†	6.107	6.086	6.039	5.975	5.909	5.878	5.874*
ρ = 0.1	32.861	21.913	6.822 ^†	6.694	6.691	6.609	6.551	6.581	6.510*	6:542
ρ = 0.3	30.002	20.808	8.981 ^†	8.828	8.645	8.428	8.357	8.208	8.203*	8.207
ρ = 0.5	24.744	15.961	5.370 ^†	5.077	4.986	4.856	4.819	4.811	4.807	4.780*
ρ = 0.7	16.179	10.885	5.351 ^†	5.130	5.037	5.033	5.050	5.021	5.016	4.996*
ρ = 0.9	7.914	5.338	3.712 ^†	3.617*	3.654	3.624	3.625	3.638	3.661	3.69

#PLS	1	2	3	4	5	6	7

ρ = 0	53.165	50.940*^†	51.816	51.837	51.840	51.840	51.840
ρ = 0.1	49.882	48.639*^†	49.167	49.158	49.158	49.159	49.159
ρ = 0.3	54.792	49.402	48.782^†	48.751	48.750	48.749*	48.750
ρ = 0.5	54.056	48.210	46.874^†	46.852*	46.856	46.856	46.856
ρ = 0.7	56.630^††	64.540	61.737^†	61.713	61.711	61.71088*	61.711
ρ = 0.9	36.274^††	99.114	96.324*^†	96.719	96.743	96.743	96.743

1	2	3	4	5	6	7	7

49.004	48.026*	48.542	48.992	49.356	49.368	49.455	49.757
46.148*	46.163	46.623	46.959	47.303	47.778	47.595	47.949
44.686	43.109*	43.985	45.068	44.772	44.947	44.818	44.763
41.692	40.654	40.394	40.468	40.733	40.467	39.841*	39.917
45.479	45.792	46.391	44.183	43.989	43.227	42.946	41.690*
44.750	44.460	44.593	38.037	34.775	33.042	31.162	30.500*

#Moments	1	2	3	4	5

γ=1.5	42.273	41.472	41.048*	43.003	43.187
γ=2.5	42.883	42.760*	43.302	45.177	48.135
γ=3	44.202*	44.280	45.287	47.348	49.922

#Percentiles	1	3	7	15	31

γ=1.5	37.912*	38.018	38.893	39.717	39.658
γ=2.5	38.027*	39.009	40.194	41.338	43.393
γ=3	38.865*	40.148	41.585	43.210	44.786

#PC	6	7	8	9

γ=1.5	17.204	16.110	15.578*	16.018
γ = 2.5	16.758	15.385	14.938*	15.239
γ=3	16.324	14.868	14.421*	14.761

1	2	3	4	5	6	7	8

45.104	44.661*	44.794	45.011	45.013	45.014	45.014	45.014
46.081	47.1416	45.573*	45.637	45.632	45.633	45.633	45.633
46.128	50.595	46.039	45.814	45.793	45.792	45.792*	45.792*

#SIR	1	2	3	4	5

γ=1.5	40.711	39.857*	41.101	41.534	41.275
γ=2.5	40.893	39.736*	41.433	42.209	42.346
γ=3	41.034	39.773*	41.510	42.324	42.569

#PC	1	2	3	4	5	all

Normal(0, 1)	43.661	42.034	40.362*	41.114	41.485	43.598
Double exp(1)	45.197	42.595	41.638*	42.578	43.021	44.620
Uniform(−5, 5)	44.239	39.591	38.671*	39.232	40.318	44.739
Exp(1)	41.745	39.509	37.717*	38.339	39.515	41.065

PERMALINK

Generic Feature Selection with Short Fat Data

B Clarke

J-H Chu

SUMMARY

1. THE BASIC PROBLEM IN THE USUAL SETTINGS

2. MODEL CLASSES AND SUMMARY STATISTICS

2.1 Ordinary Least Squares

2.2 Ridge Regression

2.3 LASSO

2.4 Bridge Regression

2.5 Choice of Summary Statistics

3. THE SIMULATION SETTING

3.1 Data Type

3.2 Clustering Algorithm

3.3 The Optimality Criterion

3.4 Choice of Statistics

3.5 Assessing Performance

4. RESULTS

4.1 Normal Data

4.1.1 LASSO

Moments

Table 4.1.

Fig. 1.

Table 4.2.

Table 4.3.

Table 4.4.

Percentiles

Table 4.5.

Table 4.6.

Table 4.7.

Table 4.8.

Fig. 2.

Principal Components

Table 4.9.

Table 4.10.

Table 4.11.

Partial Least Squares

Table 4.12.

Sliced Inverse Regression

Table 4.13.

Table 4.14.

4.1.2 Ridge

Moments

Table 4.15.

Percentiles

Table 4.16.

Principal Components

Table 4.17.

Partial Least Squares

Table 4.18.

Sliced Inverse Regression

Table 4.19.

4.1.3 Bridge

Moments

Table 4.20.

Percentiles

Table 4.21.

Principal Components

Table 4.22.

Partial Least Squares

Table 4.23.

Sliced Inverse Regression

Table 4.24.

4.2 Independent Non-normal Data

Table 4.25.

4.3 Serially Dependent Normal Data

Table 4.26.

Table 4.27.

4.4 Dependent Non-normal Data

Table 4.28.

5. TWO LIMITATIONS

5.1 Block Size and Number of Statistics

Table 5.1.

5.2 Standard Deviations of GCV Scores

Table 5.4.

6. DISCUSSION

Acknowledgments

References

ACTIONS

#PC	1	2	3	4	5

ρ=0.1	43.229	41.839	39.414*	42.947	41.754
ρ=0.3	43.767	40.952	37.339*	39.781	41.371
ρ=0.5	40.194	39.385	37.971*	38.509	39.985
ρ=0.7	42.341	39.582	35.972*	37.673	38.532
ρ=0.9	39.593	36.738	35.446*	36.180	36.976

#PC	1	2	3	4	5	all

ρ = 0.1	42.405	40.986	39.489	40.102	39.324	38.014*
ρ = 0.3	47.006	44.731	43.901*	45.731	46.553	48.678
ρ = 0.5	47.004	40.238*	42.018	42.634	44.070	49.080
ρ = 0.7	41.904	40.476	39.236	39.173	38.991*	44.209
ρ = 0.9	41.820	40.343	39.729*	40.610	41.024	42.244

#Moment	1	2	3	4	5

ρ=0.1	44.106	40.841	39.243*	40.034	40.137
ρ=0.3	43.184	39.469	37.968	37.925*	38.565
ρ=0.5	46.770	41.611	39.681*	41.174	42.379
ρ=0.7	46.520	43.252	42.266*	42.684	44.748
ρ=0.9	43.462	39.313	37.677*	38.489	39.647

#PC		(3, 3, 3, 3)	(6, 2, 2, 2)	(9, 1, 1, 1)	all

ρ=0	N = 2000	39.958	39.852*	43.219	43.452
ρ=0.1	N = 2000	39.722	39.391*	42.823	43.778
ρ=0.3	N = 2000	35.267*	35.755	39.305	39.060
ρ=0.5	N = 2000	30.013*	30.362	33.675	32.383
ρ=0.7	N = 2000	22.489*	22.667	25.405	23.677
ρ=0.9	N = 2000	12.713	12.977	15.220	12.196*

#PC	1	2	3	4	5
ρ = 0	45.095 (23.281)	41.994 (24.348)	41.075* (23.127)	41.955 (23.189)	42.868 (23.420)
ρ = 0.1	43.515 (23.177)	40.393 (23.475)	39.082* (22.897)	40.344 (23.518)	41.497 (23.552)
ρ = 0.3	38.203 (20.228)	38.785 (23.180)	35.925* (21.261)	37.590 (21.831)	39.278 (22.353)
ρ = 0.5	32.728 (18.779)	34.967 (22.973)	32.595* (20.692)	34.962 (21.859)	37.479 (22.774)
ρ = 0.7	24.966 ^† (15.041)	31.705 (23.246)	27.689* (19.098)	30.678 (20.296)	33.308 (22.093)
ρ = 0.9	12.157 ^† (8.224)	24.884 (25.333)	18.424* (14.792)	23.004 (18.682)	26.008 (21.090)