Skip to main content
Cancer Informatics logoLink to Cancer Informatics
. 2016 Apr 12;14(Suppl 5):109–121. doi: 10.4137/CIN.S30804

High-Dimensional Statistical Learning: Roots, Justifications, and Potential Machineries

Amin Zollanvari 1,
PMCID: PMC4830639  PMID: 27081307

Abstract

High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical–statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject.

Keywords: high-dimensional analysis, shrinkage, ridge estimation, sparsity, curse of dimensionality, G-analysis, double asymptotics, random matrix theory, Kolmogorov asymptotics

Introduction

Classical statistical techniques have been fashioned for situations in which the number of data points is much larger than the number of variables.1 This is in large part due to the classical notion of statistical consistency, which guarantees the performance of a statistical technique in situations where the number of measurements unboundedly increases (n → ∞) for a fixed dimensionality p of observations.25

However, even though many modern datasets are characterized by a number of variables far exceeding the sample size, many practitioners still utilize classical learning methods to extract information out of such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Nonetheless, one may argue that the so-called curse of dimensionality phenomenon in statistical learning can serve as a justification for utilizing classical techniques, and no need exists to incorporate many variables in a model. This phenomenon states that, when one attempts to improve performance by increasing the number of variables for a given number of data points, the performance improves up to a certain point, after which it starts deteriorating.6 This phenomenon seems a justification for reducing the number of variables (dimensionality reduction) to a small number, perhaps much less than the sample size. In this reduced feature space, we are then “safe” to apply classical schemes because now the sample size is potentially much larger than the number of variables. The effect of the curse of dimensionality and its implications will be described in more detail later.

Using classification as an archetype, the dimensionality reduction generally follows a common methodology: 1) use a classification rule, including a feature selection, to design a classifier, and 2) use an error estimation rule to estimate the error of the designed classifier. The performance of many widely used classifiers is guaranteed in situations where n >> p. They are designed to converge (in probability) to the Bayes classifier (optimal classifier) if n → ∞ and p is fixed. Likewise, the performance of many error estimation rules lives up to similar asymptotic premises. Therefore, the feature selection strategy serves as an interface in order to scale the complexity of data to one that can be studied through classical methods. Fortunately, two mathematical–statistical machineries exist that are specifically designed to serve in high-dimensional settings: 1) shrinkage, and 2) the Girko G-analysis. These frameworks can serve as potential machineries in order to develop mathematical models suitable for analysis in situations wherein the dimensionality of observations is comparable or potentially larger than the sample size. While the shrinkage estimation is grounded on the sparsity principle, G-analysis, in its simplest form, is based on double asymptotics n → ∞, p → ∞, p/nc, 0 < c < ∞, as well as on some conditions on the existence of moments of random variables involved.7 However, G-analysis makes no assumption on the sparsity of the parameters to be estimated. Note that having the last two conditions, that is, p → ∞ and p/nc, implies first n → ∞.

The sparsity principle imposes an assumption on the nature of the probabilistic structure of observations; it assumes that only a small number of predictors contribute to the response.8 In other words, while the curse of dimensionality restricts the number of variables feeding a model (by a subset selection strategy), the sparsity principle, on the other hand, does not restrict the number of variables. Instead, a model is potentially trained on all variables, and it has a good performance if the parameter space is sparse. While the effect of parameter sparsity on the behavior of shrinkage estimation has been studied to some extent, the effect of the curse of dimensionality on G-analysis has been generally left unexplored. Understating the effect of the peaking phenomenon or the curse of dimensionality is important because, if it can be avoided, then we can see the G-analysis as a potential machinery to follow in situations where the parameter sparsity is not well justified. It might be argued that there is nothing wrong with the classical methodologies (which work well when n→ ∞ and p is fixed) because (in the context of classification) ultimately it is the error of the designed classifier that matters, and, if classical methodology does not work, then the price paid will be poor performance. This is a legitimate argument as long as the cost is negligible. Unfortunately, this is not always the case, as the next paragraph illustrates.

Let us consider genomic datasets as a prototypical example of a modern, high-dimensional, small-sample dataset. In 2005, Michiels et al.9 challenged the validity and repeatability of several microarray-based research studies. They reported that a reanalysis of data from the seven largest published microarray-based studies, which attempted to predict the prognosis of cancer patients, revealed that five of those seven did not really classify patients better than a random assignment. There were other studies aimed at reproducing the published results of such prognosis studies, but they too generally failed.1012 The consequence of the failures in many genomic research studies has been brought into sharp focus by Dr. J. Woodcock, Director of the Center for Drug Evaluation and Research (CDER) at the U.S. Food and Drug Administration. She stated, “We may be out of the general skepticism phase, but we are in the long slog phase”.13 In listing barriers to “coming up with the right diagnostics,” she estimated that 75% of published biomarker applications are not replicable: “This poses a huge challenge for industry in biomarker identification and diagnostics development”.13 From a technical point of view, the irreproducibility crisis of the results that we are facing today14,15 can be attributed in large part to the nature or misuse of our classical statistical techniques.16,17 This state of affairs could have been preventable years ago if we took the following lines noted by Ronald A. Fisher, one of the first biologist–geneticist–statisticians, more seriously. In 1925, he said18,19:

Little experience is sufficient to show that the traditional machinery of statistical processes is wholly unsuited to the needs of practical research. Not only does it take a cannon to shoot a sparrow, but it misses the sparrow! The elaborate mechanism built on the theory of infinitely large samples is not accurate enough for simple laboratory data. Only by systematically tackling small sample problems on their metrics does it seem possible to apply accurate tests to practical data.

Sparsity and Shrinkage Estimation

Let x be a realization of a random vector of p dimension that is normally distributed with unknown mean θ and identity covariance matrix, ie, x ~ N(θ, Ip). Our goal is to estimate the vector θ. One may consider x as being the sample mean constructed from n observations, in which case the covariance matrix is Ip/n; however, as in many other studies,2022 we consider the more convenient canonical form that is obtained after proper scale transformations, which results in the identity covariance matrix Ip.

In a seminal work,20 Stein astonished the statistical community by showing that, if we consider the total squares of the errors as the loss function and p ≥ 3, then there exists a class of estimators of θ that uniformly has a smaller risk than that of the regular maximum likelihood estimator. In other words, the “usual estimator” δML(x) = x is inadmissible for p ≥ 3. Later, James and Stein gave an explicit form of an estimator that uniformly dominates δML(x),23 ie, uniformly has a smaller risk than δML(x); for an estimator δ, the risk R(θ,δ) is the expectation of the loss function over the sample space, given by

R(θ,δ)=E(θδ2), (1)

where ‖y2 = yTy is the 2 norm.

This achievement led to a large body of work on many aspects of the problem, proposing estimators for situations in which the covariance matrix of the normal population is known or unknown, extending this estimation procedure to non-normal populations, the Bayesian justification of the James–Stein estimator, and various attempts to improve upon the James–Stein estimator. Here, it is not possible to summarize the large amount of work done in this direction. We describe some of the key developments in the field, but for more information the readers are referred to other works.2428 The James–Stein estimator is given by23

δJS=(1p2x2)x. (2)

It turns out that δJS is itself inadmissible and has a peculiar behavior for small values of ‖x‖. In particular, the shrinkage factor (1p2x2) becomes negative, in which case the sign of each component of the vector x changes. Baranchik proposed an improved estimator that dominates x.21,29,30 This estimator is obtained by taking the “positive part” of the James–Stein estimator as

δJS+=(1p2x2)+x, (3)

where for a real number k, k+ = max(0,k). The estimator δJS+ has also an undesirable property that, when x2p2n, each element of the vector θ is estimated by zero. From the well-known results on smoothness of admissible estimators,31 it follows that δJS+ itself is inadmissible. Baranchik30 widened the class of Stein’s estimator by showing that any estimator of the form

δB=(1ϕ(x)x2)x, (4)

where ϕ(.) is (i) monotone, nondecreasing, and (ii) 0ϕ(.)2(p2)n+2, dominates the usual estimator. Note that δJS is a special case of δB for ϕ(.) any constant satisfying (ii). Several other estimators have been proposed,32,33 but Kubokawa’s estimator34 is an estimator that dominates δJS and is admissible. A compelling intuition behind the James–Stein estimator and all previous extensions is as follows: Suppose an estimation of θ is desired and, rather than using the usual estimator x, we use an estimator δααx, where α is determined by minimizing the risk function. The risk of this estimator is given by

R(θ,δα)=E(θδα2)=(1α)θ2+pα2. (5)

One can verify that the optimal choice of α that minimizes R(θ,δα) is given by

αopt=1pθ2+p, (6)

which results in the minimum risk (the Bayes risk), given by

R(θ,δαopt)=pθ2θ2+p. (7)

However, note that with the choice of α in (6), α x is not an estimator anymore because it depends on the unknown θ‖. However, since E[‖x2] = θ2 + p, one can estimate ‖θ2 + p in the denominator of (6) by ‖x2 and obtain the estimator (1 – p/‖x2)x, which is in the form of the James–Stein estimator. An even better estimator is obtained if we replace p by p–2, which results in δJS.23,24 Since the risk of the “usual estimator” δML(x) is R(θ, δML) = p, we have

R(θ,δML)R(θ,δαopt)=pp+θ2. (8)

It is evident that, when θ‖ is large (compared to p), then R(θ,δαopt)R(θ,δML) and we do not gain much by shrinking. However, for small values of θ‖, R(θ,δαopt)1<<p=R(θ,δML). This shows that for a model (here the mean of a multivariate Gaussian distribution) that is sparse (contains many zero elements), the larger the dimension, the more we gain by shrinking.

Similar to the James–Stein estimation, ridge estimation is a type of shrinkage originally proposed in35 and developed further by many researchers. Consider the linear model

y=Xβ+ε, (9)

where y is an n-dimensional observation vector, X is a known n × p matrix, β = [β1, β2,., βp]T is a p-dimensional parameter vector to be estimated, and ε is an n-dimensional error vector with mean 0 and covariance matrix σ2Ip. If we assume X is a full (column) rank matrix (p < n), the ordinary least-square solution to this familiar linear model is given by

β^=(XTX)1XTy. (10)

However, when p > n, the solution (10) does not exist because XTX becomes degenerate. Even the solution obtained by generalized inverse form of matrix XTX does not work well. A possible solution was proposed by Hoerl and Kennard,3537 who replaced the residual sum of squares by its 2 penalized form, given by

L2(β)yXβ2+λβ2, (11)

where λ > 0 denotes a penalty factor controlling the length of β. Minimizing L2(β) results in the so-called ridge regression, given by

β^=(XTX+λIp)1XTy. (12)

In this way, the inverse of possibly ill-conditioned XTX is stabilized by adding the scalar matrix λIp. The value of λ is assumed to be a meta-parameter of the procedure and, in general, is estimated from a model-selection strategy such as cross-validation.38 A family of estimators closely related to the ridge estimator is obtained by considering the l1 penalty factor in (11). In that case, the goal is to minimize

L1(β)yXβ2+λβ. (13)

Here, λ determines a trade-off between the approximation error (yXβ2) and the sparsity of parameters ‖β‖. Of course, a more natural way for such a trade-off would have been the 0 penalty factor (i=1p1{βi=0}) but replacing 1 with 0 in (13) and trying to minimize the resulting risk function would be an NP-hard problem.39 The “least absolute shrinkage and selection operator” (lasso) technique proposed by Tibshirani40 is used to minimize L1(β). Therein, Tibshirani formulates the following quadratic programming problem:

minimizeβyXβ2subject toβt, (14)

which is equivalent to minimizing L1(β); to wit, for any λ in L1(β), there exists a t ≥ 0 such that (14) has the same solution as minimizing L1(β). The key element in the popularity of the lasso is its ability to perform estimation and model selection at the same time. For a design matrix X such that XTX = Ip (orthonormal), understanding the mechanism of the lasso’s model selection is straightforward. In this case, the lasso estimate of βi, denoted by β^i is obtained by the soft shrinkage operator originally proposed by Donoho and Johnson41; to wit, we have

β^i=sign(β^i0)(|β^i0|λ)+, (15)

where β^i0 is the solution of ordinary least-square and (x)+= max(x,0). Evidently, for λ>|β^i0|,β^i is exactly zero. Therefore, in this case, having λ>|β^i0|, i = 1, 2,., k, is equivalent to choosing a subset of size k variables and setting all other parameters to 0. Therefore, for a large value of λ, many βi values are set to 0 and the solution becomes sparse. For a general design matrix X, minimizing L1(β) has no closed-form solution. Solving (14) using the algorithm proposed by Tibshirani40 is not efficient for large values of p, but efficient algorithms have emerged to compute the solution.4244 There has been a large body of work on generalizing the idea of the lasso, eg, elastic net,45 adaptive lasso,46 grouped lasso,47 fused lasso,48 and graphical lasso.49 The readers are referred to50 for a review of the various generalizations of the lasso technique. Before the emergence of the idea of the lasso, the idea of using 1 norm for sparse representation of signals was used in the signal processing community41 (also see Appendix I in Ref. 39). Nevertheless, in signal processing, the idea was generalized and formalized later in Ref. 51, resulting in the basis pursuit (BP) framework. In signal processing applications, the BP is used to decompose a signal (y) into an optimal superposition of discrete dictionary waveforms (X) such as Fourier dictionary, wavelet dictionary, etc. The optimality is based on minimizing the 1 norm of coefficients (β) of the representation such that the solution becomes an affine subspace in Rp. Formally, BP solves

minimizeββsubject toXβ=y. (16)

Therefore, by minimizing the 1 norm of coefficients, we are seeking for the sparsest possible solution. In the presence of noise, BP is used by solving a quadratically constrained linear program, which is a trade-off between a quadratic misfit and the 1 norm of coefficients51

minimizeββsubject toyXβ2ε, (17)

in which ε > 0 controls the amount of trade-off. As we have seen, the James–Stein estimation and 1 shrinkage (lasso-based or BP-based methods) are well suited to situations where the underlying model is sparse. Next, we review independent machinery for high-dimensional analysis in which the assumption of sparsity does not play a role.

Curse of Dimensionality and G-analysis

Generalized consistent estimation (also known as Girko G-analysis) is a technique to construct estimators specific to situations in which the dimension is comparable to the number of samples. In this setting, an estimator is constructed such that it converges to the actual parameter in a “double-asymptotic” sense, to wit, in an asymptotic scenario in which dimension and sample size increase in a proportional manner, eg, n → ∞, p → ∞, and p/n → c > 0. In this framework, the sparsity principle does not play a role. However, if the curse of dimensionality is intrinsic to frequentist statistics, then regardless of the model we use, there will still be a large gap between the complexity that we can capture through our model and the complexity of the phenomenon under study (if, of course, the phenomenon is complex per se). Therefore, in the subsequent discussion, we first examine the curse of dimensionality, its origin, and implications.

Curse of dimensionality

The curse of dimensionality, also known as the “peaking phenomenon” or the “Hughes phenomenon”, is generally considered as the known fact accounting for dimensionality reduction and feature selection.6,52,53 Regarding the peaking phenomenon, McLachlan stated54:

For training samples of finite size, the performance of a given discriminant rule in a frequentist framework does not keep on improving as the number p of feature variables is increased. Rather, its overall unconditional error rate will stop decreasing and start to increase as p is increased beyond a certain threshold, depending on the particular situation.

Jain and Waller stated52:

Thus, even if the cost of taking measurements is negligible, there exists an optimum measurement complexity, which is a function of the number of available training samples and the probability structure of the model.

Chandrasekaran and Jain pointed out:

It is known that, in general, the number of measurements in a pattern classification problem cannot be increased arbitrarily, when the class-conditional densities are not completely known and only a finite number of learning samples are available. Above a certain number of measurements, the performance starts deteriorating instead of improving steadily.

See Ref. 55 or p. 561 in Ref. 56 for more comments about this phenomenon. The first observation of peaking phenomenon is attributed to the work of Hughes.53 However, the peaking observed by Hughes was shocking to many scientists since it was contrary to the previously reported results on the lack of peaking for Bayes (optimal) classifiers. Hughes noted, “If insufficient sample data are available to estimate the pattern probabilities accurately, then a Bayes recognizer is not necessarily optimal”.53,57 Various researchers correctly criticized Hughes’ work by pointing out that the paradoxial peaking phenomenon observed therein was not real and was due to the estimate of unknown cell probabilities from the data.5760 In other words, the peaking phenomenon observed by Hughes was essentially within a frequentist framework, not a Bayesian.

Nevertheless, it is now the general consensus that in the frequentist framework, the performance of a constructed classifier does not keep improving as more features are added. To be more precise, it is assumed that there is a certain point after which we should not keep adding features because the expected error rate of the classifier starts to increase (see above quotes as well as Refs. 6 and 56). Commonly, this certain point is referred to as the “optimal number of features”.52,55 Nevertheless, all the aforementioned studies, and even terminologies such as curse of dimensionally or peaking phenomenon, give the impression that we should not learn from a large number of variables when a finite (and perhaps relatively small) number of samples is available.

Here, I shall try to convince you that the curse of dimensionally is not a phenomenon intrinsic to the frequentist framework. Instead, it is an artifact of many contemporary frequentist approaches. However, let us first review a few theoretical works that show peaking phenomenon in a frequentist setting.

In Ref. 6, the authors studied the peaking phenomenon in the context of discrete classification using a histogram rule. They considered multinomial distributions governing the data and characterized the expected error rate of the histogram rule over both sample space and a uniform prior distribution on the multinomial parameters. However, the complexity of the expression obtained there for the expected error rate did not let them achieve an analytical solution for the optimal dimension or analytical proof for the existence of a dimension at which the expected error rate is minimized.

Another work in this context is that of Jain and Waller,52 who analytically studied the peaking phenomenon in connection with linear discriminant analysis (LDA) and in the context of Gaussian multivariate models. They used Bowker and Sitgreaves’s61,62 approximation of the expected error of LDA to determine an expression for the minimal increase in δp2 that justifies adding another feature (“avoid peaking”), where δp2 is the Mahalanobis distance given by

δp2=(θ1θ0)T1(θ1θ0), (18)

where p denotes the dimensionality. They showed that the minimal increase in δp2 (assuming δp2>>4pn) to keep the same expected error rate is given by

δp+12δp2=δp2n3p, (19)

where n in the total number of samples in both classes. They also used this minimal increase expression to obtain various curves of expected error rate to show the peaking phenomenon. Nevertheless, their results were specific to LDA. Their observation should not be generalized to other models or even interpreted as a proof of omnipresence of peaking phenomenon in the frequentist framework (at least the way it is commonly presented). Moreover, the asymptotic expansion that Bowker and Sitgreaves used to develop their approximation of the expected error was essentially developed under the classical n → ∞ and fixed p regime, which has a poor performance in a high-dimensional setting.

The salient point is that, even with the existing classifiers that have been developed through the classical statistical framework (n → ∞, p fixed), the peaking phenomenon is not what is commonly perceived. To show this, in the following, we present an example in which even after the so-called optimal number of features has been found, we still keep adding features to the model. From earlier discussion, recall that the optimal number of features is generally considered as the number after which adding more features deteriorates the performance. However, in this example, we observe that after initial deterioration in the performance of the classifier, the performance again starts to improve after adding many features. Furthermore, we observe that even by considering all features in this example, the performance is still better than the performance at the so-called optimal point. Although these observations depend on the complexity of classifiers and the probabilistic structure of the problem, they demonstrate that learning from a large number of variables is plausible.

Consider a set of n = n0 + n1 independent and identically distributed (i.i.d.) training samples in Rp, where x1,x2,.,xn0 come from population ∏0 and xn0+1,xn0+2,,xn0+n1 come from population ∏1. The population ∏i is assumed to follow a multivariate Gaussian distribution N(θi, Σ), for i = 0,1, where θi and Σ denote the mean and the covariance matrix, respectively. Consider the following discriminant function63:

W(x¯0,x¯1,x)=(xx¯0+x¯12)T(x¯0x¯1), (20)

where x¯0=1n0i=1n0xi and x¯1=1n1i=n0+1n0+n1xi are the sample means for each class. The Euclidean-distance classifier (EDC) is given by

ψ(x)={1,ifW(x¯0,x¯1,x)00,ifW(x¯0,x¯1,x)>0. (21)

That is, the sign of W(x¯0,x¯1,x) determines the classification of the sample point x. The true error of ψ(x), denoted by ε, is defined to be the probability of misclassification:

εn,p=α0ε0+α1ε1, (22)

where αi is the prior probability for class i and εi is the error contributed by class i, which is given by

εi=P((1)iW(x¯0,x¯1,x)0|xΠi,x¯0,x¯1). (23)

From (23), we have

εi=Φ((1)i+1(θix¯0+x¯12)T(x¯0x¯1)(x¯0x¯1)T(x¯0x¯1)). (24)

Assuming n0 = n1 = n, α0 = α1, and Σ = σ2Ip, from the results of Raudys64,65 (also see Refs. 66 and 67) we can approximate the expected error (over the sampling distribution) of EDC by

E[εn,p]Φ(δp2δp2+2J), (25)

where δp2 is the Mahalanobis distance given by (18) and J = p/n. This formula is also credited to A.N. Kolmogorov (see p. 3 of Ref. 68). Raudys’ approximations of finite sample expectation of true error for LDA and its derivatives (including EDC) are based on the basic form of the double-asymptotic approach (n → ∞, p → ∞,p/n → c). In Ref. 69, the authors compared Raudys’ approximation to several well-known approximations that have been obtained by classical approaches (n → ∞, p fixed) and showed that the double-asymptotic approximations are significantly more accurate than classical approximations in analyzing the expected true error of LDA. In particular, even with n/p < 3, the double-asymptotic expansions yield excellent approximations that are far more accurate than others.69

Let b(p) denotes a column vector of size p with identical elements b. In order to examine the peaking phenomenon, we consider two 1700-dimensional Gaussian distributions where θ0=θ1,θ0=[0.2(10)T,0.05(190)T,0.03(1200)T,0(300)T]T, and Σ = I1700, with Ip being the identity matrix of size p. As described in Ref. 55, “best” features are generally added first and less useful features are added later. In our Gaussian model, the feature discriminative power is the same or reduces as we move to higher dimensions (see θ0). Therefore, the first p features are our best features. At each p, we train a classifier on 100 p-dimensional observations taken from the two aforementioned Gaussian distributions. We determine the expected error rate of EDC from (25). Figure 1 shows E[εn,p] versus p in a logarithmic scale. The solid curve is obtained from (25). In order to ensure that this curve is not an artifact of using approximation (25), E[εn,p] has been estimated by Monte Carlo simulations for a few dimensions as follows:

Figure 1.

Figure 1

Expected error of Euclidean-distance classifier versus dimension for n0 = n1 = 100. The solid curve is obtained from theoretical results. The small circles are the result of simulation experiments for p = 10, 65, 200, 535, and 1,400.

The Monte Carlo simulation protocol.

  • Step I: Fix a of pair of p-dimensional Gaussian distribution ∏0 and ∏1 with identity covariance matrices and means being the first p elements of vector θ0 and θ1, respectively. In simulations, we only consider p = 10, 65, 200, 535, and 1400.

  • Step II: From each distribution, generate a training set of size n = 100.

  • Step III: Using the training sample, construct EDC using (20) and (21).

  • Step IV: Find the true error of the constructed classifier using (22) and (24); this is possible because we have parameters of our model.

  • Step V: Repeat Steps II–IV 500 times and take the average. The result is an estimate of E[εn,p ].

The result of the Monte Carlo simulation for the five dimensions that we have considered is depicted by small circles in Figure 1: they align well with the theoretical results represented by the curve. As we see in this figure, as soon as we start adding more than 10 features to the EDC model, the performance starts deteriorating, but if we keep adding more and more features, at about p = 65, the performance again starts to improve. At p = 200, it has a local minimum and this behavior repeats one more time, resulting in a multi-hump curve. Interestingly, by considering all 1,700 features in the classifier, we obtain a performance better than the first local minimum at p = 10; to wit, E[ε100,1700] = 0.273 < 0.276 =E[ε100,10]. Nevertheless, the best performance happens when p = 1,400 (E[ε100,1400] = 0.254). Note that the EDC model is essentially a variant of the LDA classifier, which, under a Gaussian model, converges to the Bayes classifier as n → ∞ and p fixed. It is natural to expect development of better classifiers from a mechanism such as the G-analysis framework, which is specifically designed for high-dimensional analysis. The conclusion to be drawn from this example is not to reject the peaking phenomenon – we can cite many examples that demonstrate that the peaking phenomenon is observed in the same way that is classically stated. Instead, this example demonstrates the following: 1) the way the curse of dimensionality is generally stated does not reflect what this phenomenon really is and may give a wrong impression to many practitioners, and 2) a compromise between complexity of the learning model and the number of predictors may achieve a better performance in a large-dimensional space than in a small one.

Double asymptotics and G-analysis

An example from random matrix theory

Random matrix theory (RMT) is a type of double-asymptotic analysis that is more focused on the analysis of spectral distribution of random matrices. The spectral distribution of random matrices is an important subject in multivariate analysis, as many statistics can be represented in terms of functionals of the spectral distribution of some matrices.70 Let {xi,j, i,j = 1,2,…} be a double array of i.i.d. random variables with mean zero and variance 1. Let xj ≜[x1,j, x2,j, xp,j ]T and define the data matrix X such that X = [x1, x2,., xn]. The p × p sample covariance matrix is then defined by

Cn,p=1n1l=1n(xlx¯)(xlx¯)T. (26)

The empirical spectral distribution (ESD) of matrix Cn,p is given by

FCn,p(x)=1pi=1p1{λi(Cn,p)x}, (27)

where λi(Cn,p), i = 1,2,…,p are the eigenvalues of Cn,p and 1{.} is the indicator function. We consider two cases: 1) xi,j has a standard normal distribution; and 2) xi,j is taken from {–1,+1} with equal probability. A fundamental problem in high-dimensional analysis is to study the convergence of sequence of FCn,p(x). The histograms in Figure 2A–C represent the empirical spectral distribution of one realization of matrix Cn,p for three cases in which p/n = 1/2 and, (a) p = 1,000 and n = 2,000; (b) p = 100 and n = 200; and (c) p = 20 and n = 40. Under a double-asymptotic regime where p → ∞, n → ∞, p/n→ c > 0, the empirical distribution converges almost surely to a nonrandom distribution function Fc with the density given by

fc(x)=(11/c)+δ(x)+12πcx(xa)+(bx)+, (28)

where a=(1c)2, b=(1+c)2 and δ (x) is the delta function. This result was obtained by Marčenko and Pastur in 196771 and today is referred to as the M–P law. As we see in these histograms, for relatively large values of p and n [Figs. 2(A), (B)], there is a substantial agreement between the M–P law and the ESD of one realization of matrix Cn,p.

Figure 2.

Figure 2

(A)–(C) Comparing the empirical spectral distribution of one realization of the covariance matrix with the limiting spectral distribution: (A) p = 1,000, n = 2,000; (b) p = 100, n = 200; (C) p = 20, n = 40. (D), (E) Comparing the average empirical spectral distribution of N realizations of the covariance matrix with the limiting spectral distribution: (D) p = 1,000, n = 2,000, and N = 10; (E) p = 100, n = 200, and N = 100; (f) p = 20, n = 40, and N = 10,000. Marčenko and Pastur71 obtained the closed-form solution of the limiting spectral distribution by using the double-asymptotic framework.

Figure 2D–2F show the result of comparison between the average empirical spectral distribution of N realizations of covariance matrix Cn,p with the limiting spectral distribution in which N is 10, 100, and 10,000, for p = 1,000, 100, and 20, respectively (p/n = 1/2). As we see, there is a substantial agreement between the average empirical distribution and the M–P law in all three cases. Convergence of histograms in Figures 2A and 2D to the same limiting density presents an interesting property of the double-asymptotic regime; to wit, in this asymptotic regime the dependencies of the results on the realization of Cn,p disappear. Furthermore, the result is independent of the distribution of the elements in the double array: for both normal and binary random variables the limiting spectral distributions are identical.

The convergence of histograms in Figure 2(D)–(F) to the same density shows another interesting property of this operating regime: if we consider the average (expected) behavior of the ESD, the result of double asymptotics, although theoretically valid for p → ∞ and n → ∞, agrees well with the empirical result even for situations in which p and n are relatively small (Fig. 2F). Note that in many practical situations we are interested in average behavior of a statistic and, in this regard, double asymptotics can serve as a potential machinery for analyzing and synthesizing statistics. It is illuminating to compare the result of double-asymptotic analysis with the classical asymptotic analysis. The only information that we have from the classical asymptotic analysis is that, as n → ∞ and for fixed p, the distribution of λi(Cn,p) converges (almost surely) to a delta function at 1. Clearly, this information is completely useless in estimating the empirical spectral distribution in all cases considered in this example.

Emergence of double-asymptotic analysis

In the past few decades, double-asymptotic analysis, in general, and RMT, in particular, have found eminent roles in various disciplines, including nuclear physics, statistical mechanics, signal processing, wireless communications, biology, and economics.70,72 Some scientists, such as Raj Nadakuditi,73 believe that RMT is “somehow buried deep in the heart of nature”. The first account of double-asymptotic analysis can be traced back to studying the limiting spectral distribution of random matrices of large dimension. This analysis was done by Eugene P. Wigner, the Nobel Prize winning physicist who, in the context of quantum physics and in connection with the energy levels of heavy nuclei, proved that the expected spectral distribution of Wigner matrices of increasing dimension converges to the semicircle law.7476 In quantum physics, any measurable physical quantity (a dynamical variable) of a system is represented by a self-adjoint operator (commonly referred to as the Hamiltonian) that acts in the state space. The Hamiltonian operator acting in state space can be represented in the matrix representation, resulting in matrix mechanics – which Heisenberg used to formulate quantum mechanics in the first place.77,78 Concerning the rise of RMT in physics, Freeman Dyson writes79:

By assuming all states of a very large ensemble to be equally probable, one obtains useful information about the overall behavior of a complex system, when the observation of the state of the system in all its detail is impossible. What is here required is a new kind of statistical mechanics, in which we renounce exact knowledge not of the state of a system but of the nature of the system itself. We picture a complex nucleus as a ‘black box’ in which a large number of particles are interacting according to unknown laws. The problem then is to define in a mathematically precise way an ensemble of systems in which all possible laws of interaction are equally probable.

And concerning the randomness of the Hamiltonian that appears in Schrödinger’s equation, Mehta writes,80

In the case of the nucleus, however, there are two difficulties. First, we do not know the Hamiltonian and, second, even if we did, it would be far too complicated to attempt to solve the corresponding equation. Therefore, from the very beginning we shall be making statistical hypotheses on H [Hamiltonian], compatible with the general symmetry properties. Choosing a complete set of functions as a basis, we represent the Hamiltonian operators H as matrices. The elements of these matrices are random variables whose distributions are restricted only by the general symmetry properties we might impose on the ensemble of operators.

Here Mehta eloquently describes the reason why pioneers such as Wigner and Dyson associated the randomness of the Hamiltonian and, consequently, its corresponding matrix representation with the complexity of the nucleus (see Ref. 79 for more details). Beside the curse of dimensionality that we discussed earlier, the principle of parsimony is another motivation for an immediate use of dimensionality reduction regardless of the complexity of the phenomenon under study. While the principle of parsimony tells us that a simpler model is preferable to a competing complex model, it does not tell that all phenomena are simple. While the sparsity of parameters describing a phenomenon seems a tempting idea, not all phenomena are sparse. A heavy nucleus is a perfect example of a complex system. What Wigner did was not to create a parsimonious model to describe such a system. Instead, in his groundbreaking work, he increased the complexity of model to infinity by considering random matrices of infinite dimension.74,75 As described in Ref. 81, the Bayesian philosophy dominated the statistics in the nineteenth century, but twentieth century was more a frequentist one. I believe that after the long 250-year-old debate between Bayesians and frequentists, if there is a revolution in statistical learning community (if not yet), it happens in shifting the low-dimensional analytical paradigm to a high-dimensional one (see the R.A. Fisher’s quote in the introduction).

In the last 60 years, there have been an enormous number of studies in the context of random matrices of increasing dimension. The field has been developed to a large extent in the hands of F. J. Dyson,79,82,83 M. L. Mehta,8487 L. A. Pastur,71,88,89 V. L. Girko,4,9092 J. W. Silverstein,9396 Z. D. Bai,97100 and Y. Q. Yin.101104 As estimated in Ref. 70, there has been more than 2,500 publications in the field from 1955 to 2004. The readers are encouraged to consult70,72,80 for historical surveys of some of important results in the field.

A set of independent, but closely related work to previous studies, is the work of Raudys, Deev, Meshalkin, Serdobolskii, and Fujikoshi on the application of double asymptotics in classification.66,67,105112 This body of work is formalized as follows: consider a sequence of Gaussian discrimination problems with a sequence of parameters and sample sizes: (μp,0,μp,1,p,np,0,np,1), p = 1,2,…, where the means and the covariance matrix are arbitrary. Let us denote the limit under n0 → ∞,n1 → ∞, p →∞, pn0J0<, pn1J1<, by limp. In this setting, a large body of work has been devoted to characterizing the expected true error and estimated error of LDA and its variants. The readers are referred to the paper by Raudys and Young who conducted a review of articles in this context.65 It is commonly assumed that the Mahalanobis distance, δμ,p=(μp,0μp,1)Tp1(μp,0μp,1), is finite and limpδμ,p=δ¯μ where δ¯μ denotes an arbitrary finite limiting point (see p. 4 of Refs. 67, 68, and 113). This condition ensures the existence of limits of performance metrics of relevant statistics.67,68 The aforementioned asymptotics along with the limiting point conditions is also referred to as “Kolmogorov asymptotics”.68 Raudys’s approximation of expected error rate of EDC that was presented in Ref. 25 is essentially based on the Kolmogorov asymptotics. As we discussed earlier, the result of the approximation of this type is far more accurate than its counterparts obtained from classical asymptotic settings (see Ref. 69 for a comparison). In the last two decades, the ideas of double asymptotics and, in particular, RMT have found eminent applications in engineering disciplines such as wireless communication. One of the first applications of this type of analysis in wireless communication was to characterize the performance of large multiuser linear receivers114 and to study various properties of code division multiple access (CDMA) channels where the number of users and the length of spreading code is large.115,116 These works paved the way for subsequent applications of RMT to study various properties of different communication channels; for instance, see Refs. 88, 117120 to just cite a few articles.

G-analysis

Regarding the general statistical analysis of observation (G-analysis) and its connection to G-estimation and the Kolmogorov asymptotic conditions, V. L. Girko, one of the pioneers in developing this theory, writes:

The general statistical analysis of observations (G-analysis) is a mathematical theory studying some complex system S, such that the number mn of parameters of its mathematical models can increase together with the growth of the number n of observations of the system S. The use of this theory consists in finding, with the help of observations of the system S, mathematical models (G-estimators) that approach the system S in some sense with a given rate under general assumptions on the observations: the existence of the distribution densities of the observed random vectors and matrices are not needed. The existence of several first moments of their components is all that is required; in addition, the numbers mn and n satisfy the G-condition:

lim¯nf(mn,n)< (29)

where f(x,y) is some positive function increasing along x and decreasing along y. In most cases the function f(mn,n) is equal to xy–1. In this case the G-condition is also called the Kolmogorov condition.

The notation lim¯nf(mn,n) in the comment above denotes lim¯supnf(mn,n). By taking f(mn,n) = p/n, the framework boils down to the “usual” double-asymptotic setting in which p → ∞, n→ ∞, p/nc. Using this machinery and throughout a decade, Girko developed about 50 estimators for various purposes. They are named as G1-estimator (an estimator of generalized variance), G2-estimator (an estimator of Stieltjes transform of the normalized spectral function), G3-estimator (estimator of inverse covariance matrix), etc. Among them, we observe an estimator of the solution of the discrete Kolmogorov–Wiener filter (G9), and an estimator of Anderson statistics (G13). A summary of his results is presented in.7 As pointed out by Gikro, an important property of G-analysis is that the result does not depend on the actual distribution of data. An example was given in the previous section in which the convergence of empirical spectral distribution to M–P law is independent of distribution of data (Fig. 2). This provides us with a mathematical machinery not only to calibrate many traditional estimators from the standpoint of double asymptotics (the so-called generalized consistency3) but also to synthesize distribution-free techniques.

In recent years, several research groups, predominantly in signal processing community, have utilized the idea of G-analysis in various settings. For example, Mestre and Lagunas derived the generalized consistent estimator of the optimum loading factor in spatial filtering.2 In Ref. 3, Rubio and Mestre used the G-analysis to first evaluate the performance of a global minimum variance portfolio (GMVP) implementation based on shrinkage covariance matrix estimation and weighted sampling. Then they used the G-analysis to characterize the limiting expression of the realized variance and, based on that, they achieved a generalized consistent estimator of out-of-sample portfolio variance.3 In Ref. 121, the authors developed an estimator of the optimal linear filter for both multiantenna array signals and financial asset returns. In Ref. 5, we utilized the G-analysis to calibrate a traditional estimator of the true error of the regularized LDA. This classical estimator, known as the plug-in estimator, is consistent under n → ∞ and fixed p regime with a poor performance in small-sample situations. We observe that the calibrated new estimator can outperform not only the plug-in estimator but also other estimators of the true error, including Bootstrap 0.632 and cross-validation, in many situations in terms of bias and root-mean-square (RMS) error. Some other applications of G-analysis include estimating the eigenvalues and eigenvectors of sample covariance matrix122 and estimating the direction of arrival (DoA) in linear sensor arrays.123

Extending G-analysis to Bayesian settings

In Ref. 124, we characterized the moments of a Bayesian minimum mean-square error (MMSE) error estimator, ε^B,125,126 of the true error rate of LDA under a Gaussian model. There, we have characterized two sets of performance metrics: 1) the first, second, and cross moments of the estimated and actual errors conditioned on a fixed feature-label distribution, and 2) the unconditional moments. This means that we have obtained the metrics of performance of ε^B depending on the evaluation scheme – conditional moments (unconditional moments) can be used as if the estimator is evaluated in a frequentist (Bayesian) framework. We set up a series of conditions to which we refer as the Bayesian–Kolmogorov asymptotic conditions. These conditions allow us to characterize the performance metrics of Bayesian MMSE error estimation in an asymptotic sense. The Bayesian–Kolmogorov asymptotic conditions are set up based on the assumption of increasing n, p, and the certainty parameter ν, with an arbitrary constant limiting ratio between n and p, and n and v. To our knowledge, these conditions permit, for the first time, the application of G-analysis in a Bayesian setting. To analyze the Bayesian MMSE error estimator ε^iB, we consider a sequence of Gaussian discrimination problems as

(μp,0,μp,1,Σp,np,0,mp,0,mp,1,vp,0,vp,1),p=1,2,, (30)

where Σ and µi are the covariance matrix (assumed to be known) and the mean of Gaussian feature label distributions, respectively, and Σ/vi and mi are the covariance matrix and the mean of the Gaussian prior distribution on µi. We assume that the following limits exist for i,j = 0,1: limpmp,iTΣp1μp,j=miTΣ1μj¯, limpmp,iTΣp1mp,j=miTΣ1mj¯, and limpμp,iTΣp1μp,j=μiTΣ1μj¯, where miTΣ1μj¯, miTΣ1mj¯, and μiTΣ1μj¯ are some constants to which the limits converge. These are generalizations of limpδμ,p=δ¯μ condition that was previously stated in the Kolmogorov asymptotics context. All of the aforementioned conditions along with vi, viniγi<, construct the Bayesian–Kolmogorov asymptotic conditions. To summarize, the asymptotic regime that we considered there is characterized by the following limit:

limp,ni,vipn0J0,pn1J1,v0n0γ0,v1n1γ1γi<,Ji<mp,iTΣp1μp,j=O(1),mp,iTΣp1μp,jmiTΣ1μj¯mp,iTΣp1mp,j=O(1),mp,iTΣp1mp,jmiTΣ1mj¯μp,iTΣp1μp,j=O(1),μp,iTΣp1μp,jμiTΣ1μj¯ (31)

This limit is defined for a situation in which there is a conditioning on a specific value of feature label distribution parameters such as µp,i. Therefore, in this case µp,i is not a random variable, and for each p, it is a vector of constants. Absent such conditioning, the sequence of discrimination problems and the above limit reduce to

(Σp,np,0,np,1,mp,0,mp,1,vp,0,vp,1),p=1,2,, (32)

and

limp,ni,vipn0J0,pn1J1,v0n0γ0,v1n1γ1γi<,Ji<mp,jTΣp1mp,j=O(1),mp,jTΣp1mp,jmiTΣ1mj¯ (33)

respectively The accuracy of the closed-form expressions derived in this work is remarkable. We believe this framework can serve as a potential technique in the future to calibrate various Bayesian estimators from a frequentist point of view: to wit, to optimize the performance of a developed Bayesian method for a specific parametric distribution with unknown parameters.

Discussion

In recent years, various statistical learning rules have been put forward for cancer diagnosis, prognosis, discriminating stages of cancer, types of pathology, and duration of survivability based on molecular profiles such as gene or protein expression patterns and single nucleotide polymorphism genotypes. Such a biomarker discovery process in high-throughput genomic and proteomic profiles has presented the statistical learning community with a challenging problem, namely how to learn from a large number of variables and a relatively small sample size. The properties of high-dimensional data, though, are not well understood.127 A high-dimensional setting is not the place to rely on intuition, nonrigorous propositions, and heuristics. At the same time, the classical notion of statistical consistency, which guarantees the performance of many classical statistical techniques, falters because this notion guarantees the performance of a technique in situations where the number of measurements unboundedly increases (n → ∞) for a fixed dimensionality of observations, p. In a finite sample operating regime, this implies that in order to expect an acceptable performance from a statistical technique, we need to have many more sample points than variables – a scenario opposite to what we currently face in high-throughput biology. Despite many achievements in the last few decades in the field of statistical learning, some of the most elementary problems remain unsolved. In this regard, Serdobolskii stated68:

It is difficult to describe the recent state of affairs in applied multivariate methods as satisfactory. Unimprovable (dominating) statistical procedures are still unknown except for a few specific cases. The simplest problem of estimating the mean vector with minimum quadratic risk is unsolved, even for normal distributions. Commonly used standard linear multivariate procedures based on the inversion of sample covariance matrices can lead to unstable results or provide no solution in dependence of data. Thus nearly all conventional linear methods of multivariate statistics prove to be unreliable or even not applicable to high-dimensional data.

Two mathematical–statistical machineries, discussed herein, show promising results in constructing techniques of high-dimensional data analysis: 1) shrinkage, and 2) Girko G-analysis. This paper presented a brief history of development, the underlying assumptions, and some of important results of each machinery. While in the last decade there has been some effort to create statistical software packages from some of the shrinkage methods, the methods developed through G-analysis remain mostly in the literature and unknown to many theoreticians and practitioners. Some effort from applied statistics and the signal processing community seems worthwhile in order to create ready-to-use software packages from these methods. In addition, practical implications of the underlying assumptions in G-analysis need further investigation. For example, we assumed n → ∞, p → ∞, p/n → c, 0 < c < along with some conditions on the existence of moments of random variables involved. In an asymptotic sense, the results are applicable to any ratio of p/n. However, in a finite-sample regime, it would be interesting to study the robustness of designed methods with respect to this ratio. Other natural directions that deserve further research in this line of work are 1) understanding the true nature of the so-called curse of dimensionality phenomenon, 2) the connection between G-analysis conditions and the curse of dimensionality, 3) the possibility of using G-analysis in creating lasso-like operators that have the capability of performing model selection, and 4) extending G-analysis to Bayesian statistics.

Footnotes

ACADEMIC EDITOR: J. T. Efird, Editor in Chief

PEER REVIEW: Four peer reviewers contributed to the peer review report. Reviewers’ reports totaled 424 words, excluding any confidential comments to the academic editor.

FUNDING: This work was partially supported by the Nazarbayev University Social Policy Grant. The author confirms that the funder had no influence over the study design, content of the article, or selection of this journal.

COMPETING INTERESTS: Author discloses no potential conflicts of interest.

Paper subject to independent expert blind peer review. All editorial decisions made by independent academic editor. Upon submission manuscript was subject to anti-plagiarism scanning. Prior to publication all authors have given signed confirmation of agreement to article publication and compliance with all applicable ethical and legal requirements, including the accuracy of author and contributor information, disclosure of competing interests and funding sources, compliance with ethical requirements relating to human and animal study participants, and compliance with any copyright requirements of third parties. This journal is a member of the Committee on Publication Ethics (COPE).

Author Contributions

Conceived the concepts: AZ. Wrote the first draft of the manuscript: AZ. Developed the structure and arguments for the paper: AZ. Made critical revisions: AZ. The author reviewed and approved of the final manuscript.

REFERENCES

  • 1.Efron B. Bayesians, frequentists, and scientists. J Am Stat Assoc. 2005;100:15. [Google Scholar]
  • 2.Mestre X, Lagunas MA. Finite sample size effect on minimum variance beam-formers: optimum diagonal loading factor for large arrays. IEEE Trans Signal Process. 2006;54:69–82. [Google Scholar]
  • 3.Rubio F, Mestre X, Palomar DP. Performance analysis and optimal selection of large minimum variance portfolios under estimation risk. IEEE J Sel Topics Signal Process. 2012;6:337–50. [Google Scholar]
  • 4.Girko VL. Statistical Analysis of Observations of Increasing Dimension. Dordrecht: Kluwer Academic Publishers; 1995. [Google Scholar]
  • 5.Zollanvari A, Dougherty ER. Generalized consistent error estimator of linear discriminant analysis. IEEE Trans Signal Process. 2015;63:2804–14. [Google Scholar]
  • 6.Chandrasekaran B, Jain AK. Quantization complexity and independent measurements. IEEE Trans Comput. 1974;23:102–6. [Google Scholar]
  • 7.Girko VL. An Introduction to Statistical Analysis of Random Arrays. Utrecht: VSP; 1998. [Google Scholar]
  • 8.Pourahmadi M. High-Dimensional Covariance Estimation. New York City, NY: Wiley; 2013. [Google Scholar]
  • 9.Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet. 2005;365:488–92. doi: 10.1016/S0140-6736(05)17866-0. [DOI] [PubMed] [Google Scholar]
  • 10.Hanczar B, Hua J, Sima C, Weinstein J, Bittner M, Dougherty ER. Small-sample precision of ROC-related estimates. Bioinformatics. 2010;26:822–30. doi: 10.1093/bioinformatics/btq037. [DOI] [PubMed] [Google Scholar]
  • 11.Michiels S, Koscielny S, Hill C. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst. 2007;99:147157. doi: 10.1093/jnci/djk018. [DOI] [PubMed] [Google Scholar]
  • 12.Robinson EB, Howrigan D, Yang J, et al. Response to predicting the diagnosis of autism spectrum disorder using gene pathway analysis. Mol Psychiatry. 2014;19:859–61. doi: 10.1038/mp.2013.125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Yousefi MR, Dougherty ER. Performance reproducibility index for classification. Bioinformatics. 2012;28:2824–33. doi: 10.1093/bioinformatics/bts509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Dougherty ER, Braga-Neto UM. Epistemology of computational biology: mathematical models and experimental prediction as the basis of their validity. J Biol Syst. 2006;14:65–90. [Google Scholar]
  • 15.Dougherty ER. On the epistemological crisis in genomics. Curr Genomics. 2008;9:69–79. doi: 10.2174/138920208784139546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Dougherty ER, Hua J, Bittner ML. Validation of computational methods in genomics. Curr Genomics. 2007;8:1–19. doi: 10.2174/138920207780076956. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Braga-Neto UM. Fads and fallacies in the name of small-sample microarray classification. IEEE Signal Process Mag. 2007;24:91–9. [Google Scholar]
  • 18.Fisher RA. Statistical Methods for Research Workers. Edinburgh; Oliver and Boyd; 1925. [Google Scholar]
  • 19.Martin JK, Hirschberg DS. Small Sample Statistics for Classification Error Rates II: Confidence Intervals and Significance Tests. University of California; Irvine: 1996. (Technical Report). [Google Scholar]
  • 20.Stein C. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution; Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability; Berkeley, CA: University of California Press; 1956. pp. 197–206. [Google Scholar]
  • 21.Efron B, Morris C. Stains estimation rule and its competitors – an empirical Bayes approach. J Am Stat Assoc. 1973;68:117–30. [Google Scholar]
  • 22.Efron B, Morris C. Data analysis using steins estimator and its generalizations. J Am Stat Assoc. 1975;70:311–9. [Google Scholar]
  • 23.James W, Stein C. Estimation with Quadratic Loss. Vol. 68. Berkeley, CA: University of California Press; 1973. pp. 117–30. [Google Scholar]
  • 24.Brandwein AC, Straderman WE. Stein estimation: the spherically symmetric case. Stat Sci. 1990;5:356–69. [Google Scholar]
  • 25.Brandwein AC, Straderman WE. Stein estimation for spherically symmetric distributions: recent developments. Stat Sci. 2012;27:1–13. [Google Scholar]
  • 26.Gupta AK, Pena EA. A simple motivation for James-Stein estimator. Stat Probab Lett. 1991;12:337–40. [Google Scholar]
  • 27.Hoffman K. Stein estimation – a review. Statistical Pap. 2000;41:127–58. [Google Scholar]
  • 28.Gruber MHJ. Improving Efficiency by Shrinkage: The James–Stein and Ridge Regression Estimators. New York, NY: CRC Press; 1998. [Google Scholar]
  • 29.Baranchik AJ. Multiple Regression and Estimation of the Mean of a Multivariate Normal Distribution. Stanford Univ; Stanford: 1964. (Technical Report). [Google Scholar]
  • 30.Baranchik AJ. A family of minimax estimators of the mean of a multivariate normal distribution. Ann Math Stat. 1970;41:642–5. [Google Scholar]
  • 31.Harter HL. Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Ann Math Statist. 1971;42:855–903. [Google Scholar]
  • 32.DasGupta A, Sinha BK. A New General Interpretation of the Stein Estimate and How it Adapts: with Applications. Purdue Univ; West Lafayette: 1997. (Technical Report). [Google Scholar]
  • 33.Guo YY, Pal N. A sequence of improvements over the James-Stein estimator. J Multivar Anal. 1992;42:302–17. [Google Scholar]
  • 34.Kubokawa T. An approach to improving the James-Stein estimator. J Multivar Anal. 1991;36:121–6. [Google Scholar]
  • 35.Hoerl AE. Application of Ridge analysis to regression problems. Chem Eng Progress. 1962;58:54–9. [Google Scholar]
  • 36.Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–9. [Google Scholar]
  • 37.Hoerl AE, Kennard RW. Ridge regression: applications to nonorthogonal problems. Technometrics. 1970;12:69–82. [Google Scholar]
  • 38.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
  • 39.Tropp JA. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Trans Inform Theory. 2006;20:1030–51. [Google Scholar]
  • 40.Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc B. 1996;58:267–88. [Google Scholar]
  • 41.Donoho D, Johnstone I. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81:425–55. [Google Scholar]
  • 42.Osborne M R, Presnell B, Turlach BA. On the LASSO and its dual. J Comput Graph Stat. 1999;9:319–37. [Google Scholar]
  • 43.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Stat. 2004;32:407–99. [Google Scholar]
  • 44.Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–35. [Google Scholar]
  • 45.Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc B. 2005;67:301–20. [Google Scholar]
  • 46.Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101:1418–29. [Google Scholar]
  • 47.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc B. 2007;68:49–67. [Google Scholar]
  • 48.Tibshirani RJ, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J R Stat Soc B. 2005;67:91–108. [Google Scholar]
  • 49.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–41. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Tibshirani RJ. Regression shrinkage and selection via Lasso: a retrospective. J R Stat Soc B. 2005;73:273–82. [Google Scholar]
  • 51.Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM J Sci Comput. 1999;20:33–61. [Google Scholar]
  • 52.Jain A, Waller W. On the optimal number of features in the classification of multivariate Gaussian data. Pattern Recogn. 1978;10:365–74. [Google Scholar]
  • 53.Hughes GF. On the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theory. 1968;14:5563. [Google Scholar]
  • 54.McLachlan G. Discriminant Analysis and Statistical Pattern Recognition. New York, NY: Wiley; 2004. [Google Scholar]
  • 55.Raudys SJ, Jain AK. Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell. 1991;13:252–64. [Google Scholar]
  • 56.Devroye L, Gyorfi L, Lugosi G. A Probabilistic Theory of Pattern Recognition. New York, NY: Springer; 1996. [Google Scholar]
  • 57.Abend K, Harley TJ., Jr Comments on the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theory. 1969;14:420–3. [Google Scholar]
  • 58.Lindley D. The Bayesian approach. Scand J Statist. 1978;5:1–26. [Google Scholar]
  • 59.Waller W, Jain A. On the monotonicity of the performance of Bayesian classifiers. IEEE Trans Inf Theory. 1978;24:392394. [Google Scholar]
  • 60.Van Campenhout JM. On the peaking of the Hughes mean recognition accuracy: the resolution of an apparent paradox. IEEE Trans Syst Man Cybern Syst. 1978;8:390395. [Google Scholar]
  • 61.Sitgreaves R. Some results on the distribution of the W-classification. In: Solomon H, editor. Studies in Item Analysis and Prediction. Stanford, CA: Stanford University Press; 1961. pp. 241–51. [Google Scholar]
  • 62.Bowker AH, Sitgreaves R. An asymptotic expansion for the distribution function of the W-classification statistic. In: Solomon H, editor. Studies in Item Analysis and Prediction. Stanford, CA: Stanford University Press; 1961. pp. 292–310. [Google Scholar]
  • 63.Raudys S. Statistical and Neural Classifiers, an Integrated Approach to Design. London: Springer-Verlag; 2001. [Google Scholar]
  • 64.Raudys S. On determining training sample size of a linear classifier. Comput Syst. 1967;28:79–87. In Russian. [Google Scholar]
  • 65.Raudys S, Young DM. Results in statistical discriminant analysis: a review of the former Soviet Union literature. J Multivar Anal. 2004;89:1–35. [Google Scholar]
  • 66.Deev AD. Representation of statistics of discriminant analysis and asymptotic expansion when space dimensions are comparable with sample size. Dokl Akad Nauk SSSR. 1970;195:759–62. In Russian. [Google Scholar]
  • 67.Zollanvari A, Braga-Neto UM, Dougherty ER. Analytic study of performance of error estimators for linear discriminant analysis. IEEE Trans Sig Proc. 2011;59:4238–55. [Google Scholar]
  • 68.Serdobolskii VI. Multivariate Statistical Analysis: A High-Dimensional Approach. Berlin: Kluwer Academic Publishers; 2000. [Google Scholar]
  • 69.Wyman FJ, Young DM, Turner DW. A comparison of asymptotic error rate expansions for the sample linear discriminant function. Pattern Recognit. 1990;23:775–83. [Google Scholar]
  • 70.Bai ZD, Silverstein JW. Spectral Analysis of Large Dimensional Random Matrices. Berlin: Springer; 2010. [Google Scholar]
  • 71.Marčenko VA, Pastur LA. Distribution of eigenvalues for some sets of random matrices. Math USSR Sb. 1967;1:457–83. [Google Scholar]
  • 72.Couillet R, Debbah M. Random Matrix Methods for Wireless Communications. New York, NY: Cambridge University Press; 2011. [Google Scholar]
  • 73.Buchanan M. Enter the Matrix: The Deep Law that Shapes Our Reality. New Scientist Magazine; 2010. p. 2755. [Google Scholar]
  • 74.Wigner EP. Characteristic vectors of bordered matrices with infinite dimensions. Ann Math. 1955;62:548–64. [Google Scholar]
  • 75.Wigner EP. On the distribution of the roots of certain symmetric matrices. Ann Math. 1958;67:325–7. [Google Scholar]
  • 76.Wigner EP. Random matrices in physics. SIAM Rev. 1967;9:123. [Google Scholar]
  • 77.Isham CJ. Lectures on Quantum Theory: Mathematical and Structural Foundations. London: Imperial College; 2005. [Google Scholar]
  • 78.Marchildon L. Quantum Mechanics: From Basic Principles to Numerical Methods and Applications. New York, NY: Springer; 2002. [Google Scholar]
  • 79.Dyson FJ. Statistical theory of the energy levels of complex systems, I, II, and III. J Math Phys. 1962;3:140–75. [Google Scholar]
  • 80.Mehta ML. Random Matrices. San Diego, CA: Academic Press; 1991. [Google Scholar]
  • 81.Efron B. Modern Science and the Bayesian-Frequentist Controversy. Stanford Univ; Stanford: 2005. (Technical Report). [Google Scholar]
  • 82.Dyson FJ. Brownian-motion model for the eigenvalues of a random matrix. Comm Math Phys. 1962;3:1191. [Google Scholar]
  • 83.Dyson FJ. Correlations between eigenvalues of a random matrix. Comm Math Phys. 1970;19:235–50. [Google Scholar]
  • 84.Mehta ML. On the statistical properties of the level-spacings in nuclear spectra. Nucl Phys. 1960;18:395–419. [Google Scholar]
  • 85.Mehta ML, Gaudin M. On the density of eigenvalues of a random matrix. Nucl Phys. 1960;18:420–7. [Google Scholar]
  • 86.Mehta ML. Random matrices in nuclear physics and number theory. Contemp Math. 1986;50:295–309. [Google Scholar]
  • 87.Mehta ML. Determinants of quaternion matrices. J Math Phys Sci. 1974;8:559–70. [Google Scholar]
  • 88.Hachem W, Khorunzhiy O, Loubaton P, Najim J, Pastur L. A new approach for capacity analysis of large dimensional multi-antenna channels. IEEE Trans Inf Theory. 2008;54:3987–4004. [Google Scholar]
  • 89.Pastur L. A simple approach to global regime of the random matrix theory. Mathematical Results in Statistical Mechanics. 1999:429–54. [Google Scholar]
  • 90.Girko VL. Circle law. Theory Probab Appl. 1984;4:694–706. [Google Scholar]
  • 91.Girko VL. On the circle law. Theory Probab Math Statist. 1984;28:15–23. [Google Scholar]
  • 92.Girko VL. Theory of Random Determinants. Dordrecht: Kluwe Academic Publishers; 1990. [Google Scholar]
  • 93.Silverstein JW. Comments on a result of Yin, Bai and Krishnaiah for large dimensional multivariate F matrices. J Multivar Anal. 1984;15:408–9. [Google Scholar]
  • 94.Silverstein JW, Bai ZD. On the empirical distribution of eigenvalues of a class of large dimensional random matrices. J Multivar Anal. 1995;54:175–92. [Google Scholar]
  • 95.Silverstein JW. Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices. J Multivar Anal. 1995;55:331–9. [Google Scholar]
  • 96.Silverstein JW, Tulino AM. Theory of large dimensional random matrices for engineers; IEEE Ninth International Symposium on Spread Spectrum Techniques and Applications; Manaus: Amazon; 2006. pp. 458–64. [Google Scholar]
  • 97.Bai ZD, Yin YQ, Krishnaiah PR. On limiting spectral distribution of product of two random matrices when the underlying distribution is isotropic. J Multivar Anal. 1986;19:189–200. [Google Scholar]
  • 98.Bai ZD, Yin YQ. On the convergence of the spectral empirical process of Wigner matrices. Bernoulli. 2005;11:1059–92. [Google Scholar]
  • 99.Bai ZD, Saranadasa H. Effect of high dimension comparison of significance tests for a high dimensional two sample problem. Stat Sin. 1996;6:311–29. [Google Scholar]
  • 100.Bai ZD. Convergence rate of expected spectral distributions of large random matrices. Part I. Wigner matrices. Ann Probab. 1993;21:625–48. [Google Scholar]
  • 101.Yin YQ, Bai ZD, Krishnaiah PR. Limiting behavior of the eigenvalues of a multivariate F matrix. J Multivariate Anal. 1983;13:508–16. [Google Scholar]
  • 102.Yin YQ, Bai ZD, Krishnaiah PR. On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probab Theory Relat Fields. 1988;78:509–21. [Google Scholar]
  • 103.Yin YQ, Bai ZD, Krishnaiah PR. A limit theorem for the eigenvalues of product of two random matrices. J Multivar Anal. 1983;13:489–507. [Google Scholar]
  • 104.Bai ZD, Yin YQ. Necessary and sufficient conditions for the almost sure convergence of the largest eigenvalue of Wigner matrices. Ann Probab. 1988;16:1729–41. [Google Scholar]
  • 105.Raudys S. On the amount of a priori information in designing the classification algorithm. Tech Cybern. 1972;4:168–174. In Russian. [Google Scholar]
  • 106.Deev AD. Asymptotic expansions for distributions of statistics W, M, and W* in discriminant analysis. Stat Methods Classif. 1972;31:6–57. In Russian. [Google Scholar]
  • 107.Meshalkin LD, Serdobolskii VI. Errors in the classification of multi-variate observations. Theory Prob Appl. 1978;23:741–50. [Google Scholar]
  • 108.Serdobolskii VI. On minimum error probability in discriminant analysis. Soviet Math Dokl. 1983;27:720–5. [Google Scholar]
  • 109.Raudys S.Comparison of the estimates of the probability of misclassification Proc. International Joint Conference on Pattern RecognmKyoto, Japan; 1978280–2. [Google Scholar]
  • 110.Raudys S. Evolution and generalization of a single neurone II. Complexity of statistical classifiers and sample size considerations. Neural Networks. 1998;11:297–313. doi: 10.1016/s0893-6080(97)00136-6. [DOI] [PubMed] [Google Scholar]
  • 111.Fujikoshi Y. Error bounds for asymptotic approximations of the linear discriminant function when the sample size and dimensionality are large. J Multivar Anal. 2000;73:1–17. [Google Scholar]
  • 112.Fujikoshi Y, Seo T. Asymptotic approximations for EPMC’s of the linear and the quadratic discriminant functions when the samples sizes and the dimension are large. Statist Anal Random Arrays. 1998;6:269–80. [Google Scholar]
  • 113.Zollanvari A, Genton MG. On Kolmogorov asymptotics of estimators of the misclassification error rate in linear discriminant analysis. Sankhya Ser A. 2013;75:300–26. doi: 10.1007/s13171-013-0029-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Tse D, Hanly SV. Multiaccess fading channels. I. Polymatroid structure, optimal resource allocation and throughput capacities. IEEE Trans Inf Theory. 1998;44:2796–815. [Google Scholar]
  • 115.Verdu S, Shamai S. Spectral efficiency of CDMA with random spreading. IEEE Trans Inf Theory. 1999;45:622–40. [Google Scholar]
  • 116.Tse D, Hanly SV. Linear multiuser receivers: effective interference, effective bandwidth and user capacity. IEEE Trans Inf Theory. 1999;45:641–57. [Google Scholar]
  • 117.Hachem W, Loubaton P, Najim J. A CLT for information theoretic statistics of Gram random matrices with a given variance profile. Ann Probab. 2008;18:2071–130. [Google Scholar]
  • 118.Muller R, Verdú S. Design and analysis of low-complexity interference mitigation on vector channels. IEEE J Sel Areas Commun. 2001;19:1429–41. [Google Scholar]
  • 119.Couillet R, Debbah M, Silverstein JW. A deterministic equivalent for the analysis of correlated MIMO multiple access channels. IEEE Trans Inf Theory. 2011;57:3493–514. [Google Scholar]
  • 120.Wagner S, Couillet R, Debbah M, Slock DTM. Large system analysis of linear precoding in correlated MISO broadcast channels under limited feedback. IEEE Trans Inf Theory. 2012;58:4509–37. [Google Scholar]
  • 121.Zhang M, Rubio F, Palomar DP, Mestre X. Finite-sample linear filter optimization in wireless communications and financial systems. IEEE Trans Signal Process. 2013;61:5014–25. [Google Scholar]
  • 122.Mestre X. Improved estimation of eigenvalues and eigenvectors of covariance matrices using their sample estimates. IEEE Trans Inf Theory. 2008;54:5113–29. [Google Scholar]
  • 123.Mestre X, Lagunas M. Modified subspace algorithms for DoA estimation with large arrays. IEEE Trans Signal Process. 2008;56:598–614. [Google Scholar]
  • 124.Zollanvari A, Dougherty ER. Moments and root-mean-square error of the Bayesian MMSE estimator of classification error in the Gaussian model. Pattern Recognit. 2014;47:2178–92. doi: 10.1016/j.patcog.2013.11.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 125.Dalton L, Dougherty ER. Bayesian minimum mean-square error estimation for classification error – part I: definition and the Bayesian MMSE error estimator for discrete classification. IEEE Trans Signal Process. 2011;59:115–29. [Google Scholar]
  • 126.Dalton L, Dougherty ER. Bayesian minimum mean-square error estimation for classification error–part II: linear classification of Gaussian models. IEEE Trans Signal Process. 2011;59:130144. [Google Scholar]
  • 127.Clarke R, Ressom HW, Wang A, et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer. 2008;8:37–49. doi: 10.1038/nrc2294. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Cancer Informatics are provided here courtesy of SAGE Publications

RESOURCES