Abstract
High dimensional data routinely arises in image analysis, genetic experiments, network analysis, and various other research areas. Many such datasets do not correspond to well-studied probability distributions, and in several applications the data-cloud prominently displays non-symmetric and non-convex shape features. We propose using spatial quantiles and their generalizations, in particular, the projection quantile, for describing, analyzing and conducting inference with multivariate data. Minimal assumptions are made about the nature and shape characteristics of the underlying probability distribution, and we do not require the sample size to be as high as the data-dimension. We present theoretical properties of the generalized spatial quantiles, and an algorithm to compute them quickly. Our quantiles may be used to obtain multidimensional confidence or credible regions that are not required to conform to a pre-determined shape. We also propose a new notion of multidimensional order statistics, which may be used to obtain multidimensional outliers. Many of the features revealed using a generalized spatial quantile-based analysis would be missed if the data was shoehorned into a well-known probabilistic configuration.
Keywords: Multivariate quantile, Spatial quantile, Projection quantile, Generalized spatial quantile, Multidimensional coverage sets, Multivariate order statistics, Brain imaging, High dimensional data visualization
1. Introduction
The use of multivariate Normal distribution, or certain characteristics of multivariate Normal distributions, is routine in statistical data analysis. Prominent among such characteristics are the elliptic shape of the density function concentration regions, convexity and compactness of such concentration ellipsoids, and an overall symmetry of the density function around the location parameter. These characteristics are useful, for example, in describing confidence sets (or in a Bayesian analysis, credible sets), or acceptance regions for hypothesis tests. Sometimes multivariate heavy-tailed, lifetime, or discrete distributions may be put to use, however it is not obvious how to proceed when the properties of the data do not match the characteristics of the chosen family of distributions.
In this paper, we propose to address the issue of how to describe, analyze and conduct inference on datasets where routine assumptions like multivariate Normality may not be viable. Minimal assumptions are made about the nature and shape characteristics of the data-cloud. Also, in view of several recent applications where the dimensions of the observations are extraordinarily high, but the sample size may or may not be high, our methodology does not necessarily require that the sample size be higher than the dimensions of the data.
As an example where routine multivariate data analysis assumptions may not be appropriate, consider the problem of the treatment of Alzheimer's disease using Deep Brain Stimulation (DBS). This treatment is conducted by putting DBS electrodes close to the nucleus of the brain, to provide a stimulation deep inside the brain of the patient. The data consists of the location of the electrode placement inside the brain, along with measurements on the changes in the neurological patterns of the patients. The measurements on changes of neurological patterns differ from one location to another inside a person's brain, and from one individual to another. The medical interest in this problem lies in obtaining the region of the human brain where the placement of the electrodes and subsequent stimulation results in prominent changes in the neurological patterns. For example, we may want to obtain the region where electrode placement and stimulation results in a 50% or more improvement in cognitive ability. An assumption that such a region is a convex ellipsoid seems tenuous at best given the geometry of the human brain, and medical professionals are generally unwilling to accept such simplistic statistical assumptions.
An example of a statistical application requiring extraction of high dimensional geometrical features is available from microarray gene experimentation. Typically, a large number n (=O(103)) of genes are observed a number of times p (=O(10)). Such studies are often conducted to understand the role of genes in cell-cycle regulation, typically in the context of a disease like cancer where the regular cell-cycle pattern may be altered due to the over or under-expression of a number of genes. In the context of a particular type of cancer, most of the n genes do not participate in the cell-cycle regulation process. In order to understand which genes are “out-of-the-ordinary” in a given context, we need to study the p-dimensional profile of each one of the n genes and identify the outlying ones. Standard approaches rely on assumptions like a multivariate Normal distribution pattern, or some characteristic of it, for example, in considering correlation as a sole dependency measure. There is no biological reason to presume that the p(=10 – 50) dimensional data-cloud formed by the expressions of thousands of genes would correspond to a p-dimensional Normal density pattern. We need a method for identifying extraordinary genes, without presuming the data fits into a probabilistic model simply because the model is well understood.
These examples illustrate the need for ways of obtaining and using general multivariate quantiles. Multivariate quantiles and coverage sets are important tools for a number of different problems. They may be used for summarizing multivariate Bayesian and resampling-based inference, for simultaneous hypothesis tests, for evaluation of several competing models for a given data, and several other applications. One of the main roles of multivariate quantiles is to capture the geometry of the data, and hence the dependency among the variables. The listing of coordinate-wise quantiles is uninformative about the joint distribution of the variables, also coordinate-wise quantiles do not retain desirable invariance properties.
The desirable properties for any candidate multivariate quantile include reflecting the shape and other properties of the data, fast and accurate algorithms for computation, and tractable theoretical properties. Moreover, applicability in sparse data in high dimensions should be considered an advantage, since such cases routinely occur in several modern applications. In this paper, we build on the notion of geometric or spatial quantiles presented in [13]. The central idea of Chaudhuri is that multivariate quantiles are indexed by a p-dimensional vector of norm between zero and one, where p is the dimension of the observations. This definition naturally includes the classical definition of a quantile for the univariate case, it extends well-studied notions of multivariate medians [16,2,28] to general quantiles, and conforms to the principle adopted in [3,4,18] and others that multivariate quantiles should have both direction and magnitude. By varying the direction, magnitude and the distance metric, we obtain the class of generalized spatial quantiles, of which Chaudhuri's quantiles are a special case. Another interesting special case is the projection quantile, which stands out in terms of computational ease and theoretical tractability, and is intuitively appealing since it relates to quantiles of one-dimensional projections.
Multivariate quantiles may be used for several purposes, including data description and exploratory analysis, graphical displays, estimation and inference. Some of these tasks may be accomplished by using data-depth, which is essentially a center-outward ranking of multivariate data. Data-depths have been studied comprehensively, see [30,22,24,23,31,36] for several seminal developments. The relationship between multivariate quantiles and depth is similar to that of univariate quantiles and ranks, in the sense that depth (or rank) can be computed from quantiles (see [31] and Section 3.3), but depth/rank does not carry as much information as quantiles. Hence, all methodology, theory and applications based on depth are available when quantiles are used as basic quantities. On the other hand, concepts like quantile regression require a notion of quantiles (see Section 3.2 for the multivariate version), and may not be satisfactorily obtained using a depth function alone. Moreover, several depth functions do not account for shape features, they may require an unviable amount of computational time, and may not be applicable in high dimensions. Also, since the underlying densities could be posterior or bootstrap densities (and hence conditional on data) in many applications, verification of all technical assumptions relating to data-depth could be problematic.
In Section 2, we first present Chaudhuri's spatial quantiles, then develop projection quantiles, and finally present the generalized spatial quantiles. Properties of the generalized spatial quantiles, some applications, and algorithms to compute them are presented in Section 3. First, in Section 3.1, we obtain a number of theoretical results; on the consistency and asymptotic Normality of the sample generalized spatial quantile, on the consistency of approximating the distribution of the sample generalized spatial quantile using generalized bootstrap, and on a Bahadur-type asymptotic representation. We also establish a one-to-one correspondence between projection quantiles and the unit ball in where p is the data-dimension, which is a multivariate generalization of the well-known relationship between quantiles and probabilities. In Section 3.2, we propose a method for obtaining credible or confidence regions in dimensions greater than one, when only a data-scatter is available. Such confidence regions are not presumed to conform to a pre-determined feature like symmetry or convexity, and are expected to capture the shape of the data-cloud. We prove that the one-dimensional projections of the projection quantiles-based confidence regions have exact coverage probability, thus illustrating the efficacy of the proposed method. We then discuss, in Section 3.3, the notion of multivariate order statistics, and remark on how they may be used for detecting outliers in high dimensional data and for defining data-depths. Lastly in Section 3, in Section 3.4 we present a coordinate descent algorithm for computing the generalized spatial quantiles, which is especially useful when the sample size is lower than the dimension size.
Since data-depth measures can accomplish some of the tasks of multivariate quantiles, in Section 4 we first present a simulation example to compare three cases of generalized spatial quantiles and a popular data-depth measure. This simulation example shows that in standard multivariate inferential problems, quantiles and data-depths generally complement and corroborate each other. We then revisit the examples of DBS electrode placement and human cancer cell-cycle regulation that have been briefly introduced above. The advantage of using multivariate quantiles as opposed to data-depth in high dimensions is illustrated in the cell-cycle regulation data. A concluding section collects further remarks, and an Appendix is used for the proofs of some of the theoretical results from Section 3.
2. Spatial quantiles
In this section we describe Chaudhuri's quantiles, projection quantiles and generalized spatial quantiles. In this context we also establish some notations that we follow in the rest of this paper.
2.1. Chaudhuri's spatial quantiles
In p-dimensional Euclidean space , Chaudhuri's spatial quantiles [13] are maps from the open unit ball to . For any random variable and every , the uth quantile Q(u) is defined as the minimizer of
| (1) |
The inner product 〈·, ·〉 above is the usual Euclidean inner product, and the norm ∥ · ∥ is the usual Euclidean norm. The existence and uniqueness of Chaudhuri's spatial quantiles are discussed in Section 3. If a random sample X1, . . ., Xn is available, the empirical spatial quantile Qn(u) imitates the above setup, and is defined as the minimizer of
| (2) |
Note that, in the 1-dimensional case, the αth sample quantile is traditionally defined as the point below which exactly α-proportion of the data falls, for α ∈ (0, 1). This definition is recovered from (2) using p = 1 and u = 2α – 1 ∈ (–1, 1).
Historically, possibly the earliest example of Chaudhuri's quantiles is Haldane's spatial median [16]. Various properties and applications of Chaudhuri's quantiles are available in [6–9,11].
2.2. The projection quantile
Here we present another approach that retains the theme of describing quantiles as function indexed by the unit ball in . Let U denote the unit vector in the direction of , i.e., U = u/∥u∥. Let XU = 〈X, U〉 = ∥u∥–1 〈X, u〉, thus the projection the random vector on the 1-dimensional space spanned by the vector is XUU = ∥u∥–2 〈X, u〉u. Let qu be the (1 + ∥u∥)/2th quantile of XU, that is, . The uth projection quantile is defined as Qproj(u) = quu/∥u∥ = quU.
Thus, the uth projection quantile Qproj(u) is a vector that lies in the subspace spanned by u, and has the intuitive appeal of being related to qu. Moreover, it poses no computational burden of any significance, since projecting X on a 1-dimensional subspace is a simple operation. One of the attractive features of quantiles of univariate, continuous distributions is that they are invertible functions of probabilities, that is, there is a one-to-one map between the quantiles and probabilities. In Section 3 we establish the equivalent property for the projection quantile; i.e. the projection quantile is a one-to-one map of the unit ball in p-dimensions. There may be several interesting applications developed from this important property, which we will pursue in future.
The use of projections for studying higher dimensional objects is very standard in geometry and statistics. For example, projection pursuit is used extensively in many applications. An early review of projection pursuit may be found in [19], and an overview of applications may be found in [17]. A notion of data depth based on projections has been developed and studied in [36,35,34] and in several other papers. However, we have not been able to trace a reference for the projection quantile, as described in this section.
2.3. Generalized spatial quantiles
In this section we present a general approach towards spatial quantiles, which obtains Chaudhuri's spatial quantiles as well as the projection quantiles as special cases. As earlier, define U as the unit vector in the direction of , i.e., U = u/∥u∥. Also, for convenience, define β = ∥u∥, thus u = βU. Let XU = 〈X, U〉, qU = 〈q, U〉, thus the projections of X and q in the direction of u is XUU and qUU respectively. Let XX⊥ = X – XUU, qU⊥ = q – qUU; these are the projections on the space orthogonal to U (or u). In particular, we have ∥X – q∥2 = (XU – qU)2 + ∥XU⊥ – qU⊥∥2.
Based on this, for every , the generalized spatial quantiles Q(u, λ) are defined as minimizers of expectation of:
Note that for λ = 0 we get the projection quantile, for λ = 1 we get Chaudhuri's quantiles.
We may consider another level of generalization here, by replacing the Euclidean norm used in Ψu,λ(X, q) with a Lk-norm, for k ≥ 1. The Lk-norm of a vector is given by . Thus, the generalized spatial quantiles Q(u, λ, k) based on the Lk-norm are defined as minimizers of expectation of:
The notion of a projection, and the definitions of XU, qU, XU⊥, qU⊥ based on the Euclidean inner product are retained as earlier. The extension of Chakraborty [5] to Chaudhuri's quantiles is obtained with Ψu, 1, 1(X, q). The properties of the quantiles depend on the choice of k, but for this paper excepting the occasional remark, we will keep to the use of the Euclidean norm, and not use k as a part of our notation. Note that Ψu,0,k(X, q) = Ψu,0(X, q) and the choice of the norm does not matter for projection quantiles. Also, when u is chosen along any Cartesian basis direction (0, . . ., 0, 1, 0, . . ., 0), the coordinate-wise quantiles are obtained as a special case of projection quantiles. In applications, certain linear combination of the elements of may be of interest, for example, certain contrasts or the cross-section mean. Quantiles from the joint distributions of all such interesting linear combinations are easily obtainable by our method. The definition of generalized spatial quantiles effectively imposes the requirement that the quantile of a random variable should reside in its support, and reflect the topological and geometric properties of the support. Hence, quantiles of p-dimensional random vectors should be p-dimensional, and dependent on the metric and geometry in use.
3. Properties, applications and algorithms
3.1. Properties of generalized spatial quantiles
We now present a few properties of generalized spatial quantiles. Some of these properties have been discovered earlier for special cases like Chaudhuri's spatial quantiles. Our approach below presents a unified and easily understood framework for every fixed , relying on the convexity of Ψu,λ(X, qU, qU⊥) in (qU, qU⊥). Our first result is to establish this convexity.
Proposition 3.1
The function
is convex in (qU, qU⊥), with the subgradient function
The proof of this result is easy and hence omitted. We restrict ourselves to such random variables for which is finite for our choices of q = qUU + qU⊥. We also assume that the minimizer of , denoted by , which is the population (u, λ)th quantile, is unique. The conditions of finiteness of the expectation of the population quantile defining function and the uniqueness of the population quantile, are mild and necessary assumptions.
Let X1, X2, . . ., Xn be an i.i.d. sample. We denote the minimizer of , the sample (u, λ)th quantile, by qn = qnUU + qnU⊥. Our next set of results relate to the behavior of qn, much of which is characterized by the moments of the subgradient function g(X, q*) defined in Proposition 3.1.
Theorem 3.1
qn → q* almost surely as n → ∞.
- If and if is twice continuously differentiable at q* with the second derivative H being positive definite, then as n → ∞
where . This implies, in particular, that n1/2(qn – q*) is asymptotically Normal, with asymptotic variance H–1VH– where V = Var g(X, q*). Under the conditions of the previous item, the generalized bootstrap approximation for the distribution of n1/2(qn – q*) is consistent, and resampling may be used for inference.
- In addition to the conditions of the previous item, assume that
for some s ∈ (0, 1) and r > (8 + p(1 + s))/(1 – s). Then the following asymptotic Bahadur-type representation holds with probability 1:
as n → ∞.
The above results require considerable algebra in some cases, but are otherwise derivable using the results of Haberman [15], Niemiro [26], and Bose and Chatterjee [1]. We omit the proofs of these to avoid lengthy technical discussions. Our next result is to establish an inverse of the projection quantiles. To simplify notations, we assume that the spatial median is .
Theorem 3.2
Suppose X is an absolutely continuous random variable in . The projection quantile defined as Qproj(u) = ∥u∥–quu, where qu is the (1 + ∥u∥)/2-quantile of XU = ∥u∥–1 〈X, u〉, and the following function
are inverse functions of each other, for u ≠ 0 and x ≠ 0. The spatial median and map to each other.
We prove this result in the Appendix following this paper.
The projection quantile, and the generalized spatial quantile for all choices of λ ≠ 1 are equivariant under location shifts. That is, the quantiles of for any are given by the corresponding quantiles of Y added to a. For Chaudhuri's quantiles, which correspond to λ = 1, both rotation and location equivariance are obtained. Note however, that when the sample size is considerably large compared to the dimension size, a simple two-step transformation process is adequate to address invariance issues. This is the transformation–retransformation approach proposed by Chakraborty and Chaudhuri [10]. For data in , isolate p + 1 data points Y0, . . ., Yp, and re-center every other observation by subtracting Y0. Then express the re-centered data in terms of a basis given by {Yi – Y0, i = 1, . . ., p}. The results from the statistical analyses performed on transformed data (excluding the p + 1 isolated points) can be mapped to the original co-ordinate system by a simple back transformation, and would satisfy all the conditions of affine equivariance.
It is clear that the multivariate projection quantile defined in Section 2.2 shares the same kind of robustness properties as a univariate quantile, and Qproj(u) has a breakdown value of (1 + ∥u∥)/2th. The robustness properties of the other generalized spatial quantiles are not so apparent. Chakraborty and Chaudhuri [9] have studied the breakdown value of the spatial median.
3.2. Spatial confidence sets and quantile regression
For any choice of β ∈ (0, 1) and λ ≥ 0, the set of generalized spatial quantiles is a compact, path connected set, and if β1 ≤ β2. Since by varying the choice of β we can consider an entire range of compact sets from the null set to the support of the random vector under study, we propose to use as a generalized spatial confidence set. Different choices of λ correspond to determining the shape of the sets . Later, in Section 4, we show that the choice of the norm also regulates the shape of to some extent.
A challenging task here is to compute the probability . Our next result is to show that projection confidence sets achieve the exact coverage probability of β, for the natural interval resulting from for any linear combination of the coordinates of X.
Theorem 3.3
For every linear combination cTX with ∥c∥ = 1, consider the interval constructed using the projection quantiles corresponding to –βc and βc for any β ∈ (0, 1). This projection quantile based interval has the exact coverage probability of β.
Theorem 3.3 is also proved in the Appendix.
The computation of β for which is achieved for fixed α ∈ (0, 1) and fixed λ is an open problem. For λ = 0 and p = 2, if X follows the uniform distribution on the unit square , we have . For λ = 0, p = 2 and X following the bivariate standard Normal distribution with mean zero and identity dispersion matrix, the relation holds, where Φ(·) is the univariate standard Normal cumulative distribution function. For general multivariate data, we adopt a scheme similar to [33], and in order to find a set with α-level coverage we choose that value of β for which α fraction of the data are inside . Thus, finite-sample coverage properties of our confidence or credible sets are exact.
Multivariate quantiles and coverage sets are important tools for a number of different problems. For example, modern-day Bayesian and resampling-based statistical inference typically involve Monte Carlo sampling from the probability distributions of interest, which are then used to approximate moments, quantiles, credible or confidence regions, and for other statistical purposes. While these inferential tasks are routine when performed for one-dimensional quantities, they can be difficult in higher dimensions. As an illustration, consider a random sample X = (X1, . . ., Xn) from a probability distribution Pθ for some , and suppose g(θ) is the quantity of interest. In a Bayesian study, a prior probability measure π(·) on Θ is used, then typically a Monte Carlo sample θ = (θ1, . . ., θm) is generated from the posterior distribution π(·|X). Posterior quantiles may then be approximated using the order statistics of g(θ1), . . ., g(θm), if . However, if g(θ) is two or higher dimensional vector, obtaining its quantiles or a credible set becomes challenging.
Similarly, if g̃(X) is an estimator of g(θ), bootstrap-based inference will typically proceed by obtaining the Monte Carlo sample , where 's are the resamples of X. Then, functionals of the distribution of g̃(X) can be evaluated empirically in a straightforward way, but if g(θ) is two or higher dimensional, obtaining its bootstrap-based confidence region is problematic.
One of the motivating factors for empirical likelihood techniques is that bootstrap confidence sets could not be constructed easily in multi-dimensions. Hence, Owen [27] uses the bootstrap only for calibration. Our methods offer a solution to the open problem of constructing multidimensional bootstrap confidence sets, that are different from the depth-based approach advocated by Yeh and Singh (1997).
As an example, consider the data on prey of dippers considered in [27]. There, in Fig. 1, 95% confidence regions, constructed from empirical likelihood and Normal theory, are presented for the bivariate means of (Caddis fly larvae, Stonefly larvae) and (Mayfly larvae, other invertebrates). In Fig. 1, in the top panel we present the bootstrap-based 95% confidence set for the same problem. Notice the lack of convexity for the 95% confidence set for the mean of (Caddis fly larvae, Stonefly larvae), a feature not revealed by the empirical likelihood based region or the Normality-based region. We would like to emphasize that if a convex confidence set is desired, our algorithm can handle that as well with minor changes in the computer code. The extreme variability in the dipper-prey data suggests that median might be better choice of a location parameter to consider, and the bottom panels of Fig. 1 show the 95% confidence sets for the bivariate medians.
Fig. 1.
Dipper data with 95% projection confidence interval for mean (top row) and median (bottom row).
We describe multivariate quantile regression briefly below. Suppose the ith response is the vector , while the ith covariate is the matrix . Thus, the data consists of . Multivariate quantile regression models the uth quantile of Yi as a linear transformation of Xi. Adopting notations as earlier of β = ∥u∥, U = u/∥u∥ and for any vector that ZU = 〈Z, U〉U and ZU⊥ = Z – ZU, we define the uth quantile regression vector as the argument that minimizes
where qi = Xiγu. The simple multivariate quantile regression case is obtained when d = 1. We obtain the classical univariate quantile regression of Koenker and Bassett [20] as a special case with p = 1. Properties of the quantile regression estimator can be derived easily from Section 3.1. The above framework assumes that quantiles of each element of the p-dimensional response are dependent on d covariates. This assumption can be dropped and the number of covariates allowed to vary for each element; while the development is easy the algebra is unwieldy.
3.3. Multivariate order statistics, data-depth and outliers
We now introduce the notion of an order statistic in the context of generalized spatial quantiles. Recall that for a size n sample of real-valued data, the jth order statistic is the value below or equal to which j observations fall, and above which n – j observations fall. The elementary transformation α = j/n ∈ (0, 1] may be used to restate the above notion in terms of the αth order statistic, i.e., the value below or equal to which nα observations fall, and above which n(1 – α) observations fall. For -valued multivariate data X1, . . ., Xn, instead of indexing the order statistics by the values α ∈ (0, 1] we index each observation Xi according to a vector , such that Xi = Qn(ui, λ). That is, for every fixed λ ≥ 0, observation Xi is the uith order statistic for that value for which it minimizes
where XjU = 〈Xj, u〉/∥u∥ and XjU⊥ = Xj – XjUu/∥u∥, j = 1, . . ., n. Thus, the order statistics are indexed by directions as well as norms of vectors in the unit sphere in .
For illustration, let us consider the λ = 0 case corresponding to the projection quantiles. Here, an observation Xi is the uith order statistic if the sample projection quantile corresponding to ui is Xi itself. In other words, for projection quantiles, observations and their order statistic indices correspond to the sample version of Theorem 3.2 Hence, , where
There are several applications of the above notion of a multivariate order statistic. Define the direction of an order statistic Ui = ui/∥ui∥ and its norm βi = ∥ui∥ for reference. First, all the usage of one-dimensional order statistics and ranks, and similar other univariate summarizations of the data may be carried out for multivariate data by associating βi with Xi (and ignoring Ui). A new notion of data-depth may be developed, with a function of βi being the depth of Xi. Such depths may be used to define another confidence set for multivariate random variables, extending the work of Yeh and Singh [33]. The βi's may also be used for outlier detection. The Ui's are directional data, and can be used for testing whether the data shows spherical symmetry, for example. Tests for multivariate Normality may also be devised using the Ui's and the βi's. Robust analysis of multivariate data, including robust estimation and inference, may be carried out using the above notion of order statistics. These applications will be pursued in future research.
3.4. Fast computing of generalized spatial quantiles
The computation of projection quantiles Qproj(·) is immediate, and does not require the sample size to be greater than the dimension of the data. However, for arbitrary generalized spatial quantiles, a Newton–Raphson type algorithm may be used when n > p, and for the case of n ≤ p an exhaustive grid-search needs to be carried out for exact computation. Neither alternative is attractive, or viable in high dimensions, hence we present below a coordinate descent algorithm to approximate any generalized spatial quantile, which is applicable regardless of the relationship between n and p, or the choice of λ and β. Recall that generalized spatial quantiles are obtained by minimizing . Our coordinate descent algorithm iterates the following steps till convergence:
Start with a tentative minimizer of . The projection quantiles Qproj(u) may be used for this initial value.
For each coordinate i ∈ {1, . . ., p}, sequentially consider to be a function of qi only for minimization, and obtain as its minimizer, for i = 1, . . ., p.
At the end of the above step, a new vector is obtained. Convergence is achieved if the distance between q(1) and q(0) is small, otherwise the above steps are repeated with q(1) in place of q(0).
To check the performance of the above computation method, we implemented it for multivariate Normal data. We simulated 200 multivariate normal random numbers in dimensions ranging between 2 and 8, and computed the correct generalized spatial quantile directly using multidimensional optimization using “nlm” function in the software R, version 2.11.1 and also using the above algorithm. We computed the approximation error of our algorithm and the number of iterations it takes to converge. We defined an iteration as one revision of all the coordinates of the quantile, that is, for Step 2 above being implemented for all i = 1, . . ., p. The relative error in approximation is defined as the Euclidean norm of the difference between the generalized spatial quantile and the approximation obtained by the above algorithm, divided by the norm of the generalized spatial quantile. We use u = (0, . . ., 0, 0.8) for this simulation. The results reported below are not affected by our choice of u, since the coordinate descent methodology is invariant to the choice of u. We considered λ = 0.5, 1, 1.5.
This exercise is repeated 100 times, and the average (E(Rel. Err)) and the standard deviation (SD(Rel. Err)) of the relative error in approximation scaled up by 105, and the average (E(Iter)) and standard deviation (SD(Iter)) of the number of iterations required are reported in Table 1. Note that approximation errors are O(10–5) in about 5 iteration steps, thus the above algorithm performs excellently. Also, the number of iterations required does not increase with the dimension. However, since each iteration in p dimensions involves p implementations of Step 2, the actual number of optimizations carried out increases linearly with dimension.
Table 1.
Expectation and standard deviation of the relative approximation error (E(Rel. Err) and (SD(Rel. Err))) scaled by a factor of 105, and the number of iterations (E(iter) and SD(iter)) of the coordinate-wise updating algorithm for normal data for dimensions 2, 4, 6 and 8 and λ = 0.5, 1, 1.5.
| Dimension | 2 | 4 | 6 | 8 | |
|---|---|---|---|---|---|
| λ = 0.5 | E(Rel.Err) | 0.92 | 0.74 | 0.70 | 0.66 |
| SD(Rel. Err) | 1.17 | 0.31 | 0.28 | 0.30 | |
| E(Iter) | 5.70 | 5.03 | 4.95 | 4.98 | |
| SD(Iter) | 1.10 | 0.26 | 0.22 | 0.14 | |
| λ = 1.0 | E(Rel.Err) | 0.93 | 0.64 | 0.52 | 0.36 |
| SD(Rel.Err) | 2.01 | 0.29 | 0.25 | 0.22 | |
| E(Iter) | 5.39 | 4.97 | 4.88 | 4.81 | |
| SD(Iter) | 1.14 | 0.17 | 0.33 | 0.39 | |
| λ = 1.5 | E(Rel.Err) | 0.62 | 0.56 | 0.36 | 0.30 |
| SD(Rel. Err) | 0.41 | 0.24 | 0.22 | 0.16 | |
| E(Iter) | 5.33 | 4.91 | 4.73 | 4.56 | |
| SD(Iter) | 0.77 | 0.29 | 0.45 | 0.50 |
4. Simulation examples and applications
We divide this section in three parts. In the first part, we compare three generalized spatial quantiles, and the halfspace depth due to [30] with four bivariate densities. We compare the volumes of 80% coverage sets from each of these four methods, as well as their shape features.
In the second part, we present our projection quantile-based analysis of the DBS electrode placement experiment. We report the 90% confidence set for the region of the human brain where cognitive ability improvement of 50% or more have been reported. This image clearly shows an asymmetric, non-convex figure, which is in close correspondence of the geometry of the human brain, and in accordance with the medical knowledge relating to Alzheimer's disease.
In the third part, we use projection quantiles-based order statistics and Tukey's depth on a microarray experiment, to identify genes that display extraordinary behavior in human cancer cell-cycle regulation. This example illustrates that prohibitive computational requirements for data-depth, and the inherent features of high-dimensional data, result in too many points having discretized, low depth values, which results in poor quality inference.
4.1. Comparative inference with quantiles and depth
Data from four bivariate density functions are used for our simulation experiment on comparison of coverage sets obtained by different quantile-based and depth-based methods. The density functions are: (1) the standard bivariate Normal distribution, with the marginals being standard normal and with zero correlation between the two variables, (2) an even mixture of two bivariate Normal components, with the means being (–2, 5) and (2, 5), all variances equal to one and the two correlations being –0.75 and 0.75:
(3) a standard bivariate-T distribution with 5 and 10 degrees of freedom for X and Y coordinates, and (4) a standard bivariate double exponential.
We generate a sample of size n = 200 from each of these distributions, and compute the 80% coverage regions obtained by using the projection quantile, generalized spatial quantile using the L1-norm and λ = 1, and Chaudhuri's quantiles, which correspond to the L2-norm and λ = 1. We also use the package depth in R to obtain the 80% coverage region by Tukey's depth, according to the principle of Yeh and Singh [33].
Table 2 provides the coverage and volume of the regions enclosed by 80% coverage sets in the four distributions. Notice that for all the methods the volumes differ across distributions, but are very similar to each other for the bivariate Normal and Student's-t distributions. The shape characteristics in the mixture Normal and the double exponential distributions create some difference in the volumes. However, note from Fig. 2 the difference in shape features of the four coverage sets. The projection quantile method captures the approximate shape of the data-cloud in all the cases and can be non-convex, while the Tukey depth-based and Chaudhuri's quantile-based sets are always near-ellipsoid convex sets. Based on volume alone, the L1-norm based generalized spatial quantile method seems best, with the projection quantile method being a close second.
Table 2.
Volume of the 80% coverage sets in four simulated populations.
| Algorithm | Biv. nor. | Mix nor. | T dist | Double exp. |
|---|---|---|---|---|
| Projection | 9.12 | 22.77 | 14.00 | 36.35 |
| L1 Geometric | 9.30 | 16.52 | 15.18 | 36.30 |
| L2 Geometric | 8.94 | 25.18 | 14.86 | 42.59 |
| Tukey depth | 9.14 | 18.34 | 14.66 | 48.46 |
Fig. 2.
Generalized spatial quantiles in all directions corresponding to 90% coverage for four distributions.
4.2. Analysis of the DBS electrode placement data
The data for this experiment consists of the locations in the brain where the DBS electrodes have been placed, and a binary variable indicating whether more than 50% improvement in cognitive ability has resulted from the brain stimulation. The locations of the DBS electrode placement are given with respect to a common coordinate system defined by the anterior commissure (AC) and posterior commissure (PC) planes and the midline. The medical interest in this experiment centers around the efficacy of the DBS electrode-based treatment for long term improvement in the cognitive ability of a patient suffering from Alzheimer's disease. It is thought that some of the region surrounding the nucleus of the brain should be stimulated for long term improvement; however, the shape or the size of this region is unknown. Our goal here is to map the region of the brain where 90% or more success (defined as >50% improvement in cognitive ability) has been reported. We obtain the region using projection quantiles, by varying β such that 90% coverage is achieved, as described in Section 3.2. This region is displayed in Fig. 3, which also includes the trajectory of the insertion path of some of the electrodes. We also present three two-dimensional cross-sectional plots in the same figure, for greater clarity. Note that the shape of the 90% confidence region in Fig. 3 is irregular, and is neither convex nor symmetric about a point or a line or a plane. However, it closely imitates the shape of the nucleus of the human brain. The region of the brain thus identified from the data using projection quantiles is in agreement with the opinion of scientists and physicians studying Alzheimer's disease; however, biological knowledge about the human brain is still scant.
Fig. 3.
90% 3D confidence Set of the final DBS location for patients with 50% or more improvement in cognitive ability and projection of the set on lat-ap, ap-vert and lat-vert plains. Also the path of the electrodes are shown.
4.3. Gene behavior in cell-cycle regulation experimentation
We consider the data on the human cancer cell (HeLa S3) cycle data, available at http://genome-www.stanford.edu/Human-CellCycle/Hela for this part of our analysis. In this particular data, [32] identify 1134 genes out of a total of 42 920 as periodic, or cell-cycle regulators, based on a periodicity analysis of the marginal distributional behavior of each gene. Gene-network causality and related dependence across pairs of genes has been reported in [25]. Several other studies report other low-dimensional patterns of genes in this or similar datasets, for example, through the computation of various kinds of correlations between gene-pairs.
Here we are interested in identifying those genes that stand out, compared to the overall data cloud of gene expressions, and thereby are of interest in understanding the cell-cycle regulatory mechanism. A parametric distribution for the underlying population of gene expressions is not easy to express, use as a statistical model, verify in practice, or justify on biological grounds. We use ranking based on the projection quantiles to identify those genes that correspond to extreme quantiles. Also, a ranking based on Tukey's depth is obtained.
Here we report our results on experiment 1 of the Hela S3 cell cycle data, which has p = 11 time points over which the expressions of the different genes have been obtained. Spellman et al. [29] showed that of the 15 536 genes studies in this experiment, n = 828 are periodic, and are candidates for a possible role in cell-cycle regulation. We ignore the genes that have not been identified as periodic, since they lack relevance in the biological process. For each gene g among the 828 periodic genes, we compute its projection order statistic ug; i.e. the vector such that the sample projection quantile with respect to ug is the gene expression g. Details of this method have been discussed in Section 3.3.
The genes that have the highest β values are more significant. We set β ≥ 0.9999677 as a cutoff point, based on computations for the projection quantile confidence region of 90% coverage for the standard Normal distribution in . However, the data clearly does not fit such a distributional pattern, and only 35 of the genes obtain β ≥ 0.9999677. Tukey's depth for the n = 828 gene profiles in and their β values are presented in Fig. 4. The 35 identified genes with β values above the cutoff are identified with triangles. Some of these 35 genes are also among those with the least Tukey's depth, but there are some genes with higher depth. However, notice that there are 179 genes among the 828 have the same minimum Tukey's depth. It is extremely unlikely biologically that 179 out of 828 genes would be influential in cell-cycle regulation, thus depth-based inference seems to be greatly influenced by false positives. The high number of genes with very low depth show that care must be taken with depth-based inference in high dimensions. Note that computing depths precisely in high dimension is virtually an impossibility; computing just the Tukey's median takes O(np–1) expected time [12]. The potential lack of precision in approximate depth computation may also lead to misleading results.
Fig. 4.

Projection quantile and Tukey depth measure of each periodic gene in cell cycle data. The top 35 genes are indicated by triangles.
We may compare this list of genes with those of Li et al. [21], where 20 genes have been studied, that are thought to be associated with human cell-cycle regulatory pattern [14]. Eighteen of these genes are part of our set of 828 genes, and three of these are obtained among the 35 most significant genes that were significant according to our projection order statistic based analysis. These genes are PCNA, PLK and CDC20. A match of three out of eighteen possible genes serves as a strong reinforcement of the utility of our approach. Significantly, the Tukey depth-based approach fails to place PCNA among its large list of 179 genes with lowest depth, although it identifies correctly PLK and CDC20 as significant genes.
5. Discussion
The analysis of high dimensional data is a challenging area of research. The traditional approach is to model the data in a probabilistic framework that is often just convenient for the statistician, and/or to replace the high dimensional open problems with lower dimensional ones. Our approach of using generalized spatial quantiles for summarization, estimation and inference is one possible avenue, which neither requires arbitrary probabilistic assumptions nor takes recourse to reduce the data reducing to lower dimensional structures. We have aimed to build on earlier attempts at using geometric quantiles and related methods, and have tried to integrate several approaches towards a quantile-based analysis of multivariate data that have a commonality between them.
The concept of the generalized spatial quantile provides a common platform showing the connection between the projection quantile and the geometric quantile. Interpretation and applicability of the λ parameter is under study at the moment. The computation of the generalized spatial quantile, by means of the iterative algorithm has been demonstrated to work well in examples. Further research is under way to understand the effect of different choices of λ and the effect of outliers on these quantiles and breakdown properties.
There are unique challenges in analyzing multimodal data in high dimensions, and in this paper we have not addressed the issue of multimodality or mixture distributions, other than in a small way in Section 4.1. Depending on the application at hand, one option is to use a classification or clustering step, followed by computing multivariate quantiles in each cluster separately. Projection quantiles may turn out to be particularly useful, since they do not require any condition linking cluster size with data dimensions.
Acknowledgments
The second authors research is partially supported by grants from the University of Minnesota, and from the National Science Foundation of the USA.
Appendix
For any , let us adopt the notation FXU for the (absolutely continuous) distribution function of XU. The following result is useful for proving Theorem 3.2.
Lemma A.1
Under the conditions of Theorem 3.2, for every .
Proof of Lemma A.1
For , let x̃ = x/∥x∥. Hence, in our adopted notation. Note that ∥x∥ > 0, and since the spatial median is zero, we have that Gx(∥x∥) > 1/2. Hence, for , we have . Thus, , and hence .
Thus we have
where, recall, the (absolutely continuous) distribution of Xx̃ is FXx̃.
Thus, the result is proved.
Proof of Theorem 3.2
We show that for any , , and for every , .
We start with the first identity. Note that for any ,
Use Lemma A.1 to establish that this is equal to x.
For the other identity, for any , note that . Thus we have ∥u∥ = 2FXU(∥Qproj(u)∥) – 1. Also note that Q̃X = 〈X, Qproj(u)/∥Qproj(u)∥〉 = ∥u∥–1〈X, u〉 = XU. Thus, , and hence 2GQproj(u)(∥Qproj(u)∥) – 1 = 2FXU(∥Qproj(u)∥) – 1 = ∥u∥.
Proof of Theorem 3.3
Note that if –u is the diametrically opposite vector of u, we have X(–u) = {–〈X, U〉}(–U) and thus X(–U) = –〈X, U〉 = –XU.
We assume that . Note that cTX = 〈X, c〉 ~ FCX(·) by our notation. Along the line , the set carves out the interval (–q(–c), qc), and we have and
Thus .
References
- 1.Bose A, Chatterjee S. Generalized bootstrap for estimators of minimizers of convex functionals. J. Statist. Plann. Inference. 2003;117:225–239. [Google Scholar]
- 2.Brown BM. Statistical uses of the spatial median. J. R. Stat. Soc. Ser. B Stat. Methodol. 1983;45(1):25–30. [Google Scholar]
- 3.Brown BM, Hettmansperger TP. Affine invariant rank methods in the bivariate location model. J. R. Stat. Soc. Ser. B Stat. Methodol. 1987;49(3):301–310. [Google Scholar]
- 4.Brown BM, Hettmansperger TP. An affine invariant bivariate version of the sign test. J. R. Stat. Soc. Ser. B Stat. Methodol. 1989;51(1):117–125. [Google Scholar]
- 5.Chakraborty B. On affine equivariant multivariate quantiles. Ann. Inst. Statist. Math. 2001;53(2):380–403. [Google Scholar]
- 6.Chakraborty B, Chaudhuri P. On an adaptive transformation–retransformation estimate of multivariate location. J. R. Stat. Soc. Ser. B Stat. Methodol. 1998;60(1):145–157. [Google Scholar]
- 7.Chakraborty B, Chaudhuri P. Inst. Math. Statist. Vol. 31. IMS Lecture Notes Monogr. Ser.; Hayward, CA: 1997. On multivariate rank regression, in: L1-Statistical Procedures and Related Topics, Neuchâtel, 1997; pp. 399–414. [Google Scholar]
- 8.Chakraborty B, Chaudhuri P. Multivariate Analysis, Design of Experiments, and Survey Sampling. Vol. 159. Statist. Textbooks Monogr.; Dekker, New York: 1999. On affine invariant sign and rank tests in one- and two-sample multivariate problems; pp. 499–522. [Google Scholar]
- 9.Chakraborty B, Chaudhuri P. A note on the robustness of multivariate medians. Statist. Probab. Lett. 1999;45(3):269–276. [Google Scholar]
- 10.Chakraborty B, Chaudhuri P. On a transformation and re-transformation technique for constructing an affine equivariant multivariate median. Proc. Amer. Math. Soc. 1996;124(8):2539–2547. [Google Scholar]
- 11.Chakraborty B, Chaudhuri P, Oja H. Operating transformation retransformation on spatial median and angle test. Statist. Sinica. 1998;8:767–784. [Google Scholar]
- 12.Chan TM. An optimal randomized algorithm for maximum Tukey depth. Proc. 5th ACM-SIAM Symposium on Discrete Algorithms. 2004:423–429. [Google Scholar]
- 13.Chaudhuri P. On a geometric notion of quantiles for multivariate data. J. Amer. Statist. Assoc. 1996;91(434):862–872. [Google Scholar]
- 14.Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Haberman SJ. Concavity and estimation. Ann. Statist. 1989;17:1631–1661. [Google Scholar]
- 16.Haldane JBS. Note on the median of a multivariate distribution. Biometrika. 1948;35:414–415. [Google Scholar]
- 17.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. second ed. Springer; 2009. [Google Scholar]
- 18.Hettmansperger TP, Nyblom J, Oja H. Affine invariant multivariate one-sample sign tests. J. R. Stat. Soc. Ser. B Stat. Methodol. 1994;56(1):221–234. [Google Scholar]
- 19.Peter J. Huber, Projection pursuit. With discussion. Ann. Statist. 1985;13(2):435–525. [Google Scholar]
- 20.Koenker R, Bassett G., Jr Regression quantiles. Econometrica. 1978;46(1):33–50. [Google Scholar]
- 21.Li X, et al. Discovery of time-delayed gene regulatory networks based on temporal gene expression profiling. BMC Bioinformatics. 2006;7:26. doi: 10.1186/1471-2105-7-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Liu RY. On a notion of data depth based on random simplices. Ann. Statist. 1990;18:405–414. [Google Scholar]
- 23.Liu RY, Parelius JM, Singh K. Multivariate analysis by data depth: descriptive statistics, graphics and inference (with discussion) Ann. Statist. 1999;27:783–858. [Google Scholar]
- 24.Liu RY, Singh K. A quality index based on data depth and multivariate rank tests. J. Amer. Statist. Assoc. 1993;88:252–260. [Google Scholar]
- 25.Mukhopadhyay N, Chatterjee SB. Causality and pathway search in microarray time series experiment. Bioinformatics. 2007;23:442–449. doi: 10.1093/bioinformatics/btl598. [DOI] [PubMed] [Google Scholar]
- 26.Niemiro W. Asymptotics for M-estimators defined by convex minimization. Ann. Statist. 1992;20:1514–1533. [Google Scholar]
- 27.Art B. Owen, Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75(2):237–249. [Google Scholar]
- 28.Small CG. A survey of multidimensional medians. Int. Stat. Rev. 1990;58:263–277. [Google Scholar]
- 29.Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B. Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell. 1998;9:3273–3297. doi: 10.1091/mbc.9.12.3273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Tukey JW. James RD, editor. Mathematics and picturing data. Proceedings of the International Congress on Mathematics, in: Canadian Math. Congress. 1975;2:523–531. [Google Scholar]
- 31.Vardi Y, Zhang C-H. The multivariate L1-median and associated data depth. Proc. Natl. Acad. Sci. 2000;97(4):1423–1426. doi: 10.1073/pnas.97.4.1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Whitfield ML, Sherlock G, Saldanha A, Murray J, Ball CA, Alexander K, Matese J, Perou CM, Hurt M, Brown P, Botstein D. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell. 2002;13:1977–2000. doi: 10.1091/mbc.02-02-0030.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yeh A, Singh K. Balanced confidence regions based on Tukey's depth and the bootstrap. J. Roy. Statist. Soc. Ser. B. 1997;59(3):639–652. [Google Scholar]
- 34.Zuo Yijun. Multidimensional trimming based on projection depth. Ann. Statist. 2006;34(5):2211–2251. [Google Scholar]
- 35.Zuo Yijun, Cui Hengjian, Young Dennis. Influence function and maximum bias of projection depth based estimators. Ann. Statist. 2004;32(1):189–218. [Google Scholar]
- 36.Zuo Y, Serfling R. General notions of statistical depth function. Ann. Statist. 2000;28(2):461–482. [Google Scholar]



