Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 16.
Published before final editing as: J Nonparametr Stat. 2024 Jun 6:10.1080/10485252.2024.2360551. doi: 10.1080/10485252.2024.2360551

Sparse kernel sufficient dimension reduction

Bingyuan Liu 1,, Lingzhou Xue 1
PMCID: PMC12356157  NIHMSID: NIHMS2002206  PMID: 40824847

Abstract

The sufficient dimension reduction (SDR) with sparsity has received much attention for analysing high-dimensional data. We study a nonparametric sparse kernel sufficient dimension reduction (KSDR) based on the reproducing kernel Hilbert space, which extends the methodology of the sparse SDR based on inverse moment methods. We establish the statistical consistency and efficient estimation of the sparse KSDR under the high-dimensional setting where the dimension diverges as the sample size increases. Computationally, we introduce a new nonconvex alternating directional method of multipliers (ADMM) to solve the challenging sparse SDR and propose the nonconvex linearised ADMM to solve the more challenging sparse KSDR. We study the computational guarantees of the proposed ADMMs and show an explicit iteration complexity bound to reach the stationary solution. We demonstrate the finite-sample properties in simulation studies and a real application.

Keywords: Sufficient dimension reduction, kernel dimension reduction, nonconvex optimization, alternating direction method of multipliers

1. Introduction

Consider the sufficient dimension reduction (SDR) of high-dimensional random vector xp in the presence of response y. We aim to find the low-dimensional central subspace 𝒮y|xp such that dim 𝒮y|x=dp and yx|P𝒮y|xx, where the dimension p may diverge as the sample size increases, and P𝒮y|x is the projection of a vector onto the subspace 𝒮y|x. The inverse moment methods are extensively studied under the fixed-dimension setting, including the sliced inverse regression (SIR) (Li 1991), sliced average variance estimation (SAVE) (Cook and Weisberg 1991), and directional regression (DR) (Li and Wang 2007). Fan, Xue, and Yao (2017), Luo, Xue, Yao, and Yu (2022), and Yu, Yao, and Xue (2022) extended these inverse moment methods and introduced sufficient forecasting using factor models for high-dimensional predictors when both n, p. Ying and Yu (2022), Zhang, Xue, and Li (2024), and Zhang, Li, and Xue (2024) studied the sufficient dimension reduction for Fréchet regression and distribution-on-distribution regression. (Lin, Zhao, and Liu 2018; Lin, Zhao, and Liu 2019) introduced the Lasso-SIR under the high-dimensional setting when p/n0 and the diagonal thresholding SIR (DT-SIR) introduced a new diagonal thresholding screening under the ultra high-dimensional setting when p/n. However, the restricted linearity condition on the conditional expectation of x given P𝒮y|xx is required for inverse moment methods. To remove the linearity condition, the kernel sufficient dimension reduction (KSDR) was proposed by Fukumizu, Bach, and Jordan (2009). The nonconvex KSDR does not require the linearity condition nor the elliptical distribution of x, but it poses a significant computational challenge. Wu, Miller, Chang, Sznaier, and Dy (2019) solved this problem using an iterative spectral method, but such a method is hard to extend when a sparsity assumption is assumed in high-dimensional settings. Shi, Huang, Feng, and Suykens (2019) studies nonconvex penalty in a kernel regression problem. Although both the theoretical property and optimisation algorithm cannot be applied in SDR settings, it shows the promises of nonconvex penalised KSDR in high-dimensional settings.

The sparse estimation of central subspace has received considerable attention in the past decade. The main focus of the literature is on sparse SDR based on inverse moment methods, which is equivalent to solving the generalised eigenvectors from the estimated kernel matrix (Li 2007). For example, the kernel matrix for SIR is the conditional covariance matrix of x given y. In view of such equivalence, Li (2007) followed the similar spirit of the sparse principal component analysis (SPCA) (Zou, Hastie, and Tibshirani 2006; Zou and Xue 2018) to propose the sparse SDR using 1 penalisation. Bondell and Li (2009) proposed a shrinkage inverse regression estimation and studied the asymptotic consistency and asymptotic normality. Chen, Zou, and Cook (2010) proposed a coordinate-independent sparse estimation method and established the oracle property when the dimension p is fixed. Lin et al. (2019) established the consistency of the Lasso-SIR by assuming the number of slices also diverges. Neykov, Lin, and Liu (2016) followed the convex semidefinite programming (d’Aspremont, Ghaoui, Jordan, and Lanckriet 2005; Amini and Wainwright 2008) to give a high-dimensional analysis for the sign consistency of the DT-SIR in a single-index model. Tan, Wang, Zhang, Liu, and Cook (2018) proposed a sparse SIR estimator based on convex formulation and established the Frobenius norm convergence of projection subspace. Lin et al. (2018) established the consistency of DT-SIR by assuming a stability condition for Ex|y. Tan, Shi, and Yu (2020) proved the optimality of 0-constrained sparse solution and provided a three-steps adaptive estimation to approximate 0 solution with near-optimal property. The sparse estimation in the KSDR area has not been well studied.

In this paper, we propose the nonconvex estimation scheme for sparse KSDR, motivated by the nonconvex estimation of sparse SDR (Li 2007). The proposed methods target the ideal 0-constrained estimator and can be efficiently solved with statistical and computational guarantees after approximation. We follow d’Aspremont et al. (2005) to reformulate sparse SDR and sparse KSDR as the 0-constrained trace maximisation problem. It is worth pointing out that KSDR is also equivalent to solving the eigenvectors from the nonconvex residual covariance estimator in RKHS (Fukumizu et al. 2009). The residual covariance estimator in RKHS plays a similar role as the kernel matrix in inverse moment methods. Thus, we propose that sparse SDR solves the 0-constrained convex M-estimation problem, whereas sparse KSDR solves the 0-constrained nonconvex M-estimation problem. We provide a high-dimensional analysis for the global solution of the 0-constrained SDR and prove its asymptotic properties, including both variable selection consistency and convergence rate. We demonstrate the power of this general theory by obtaining the explicit convergence rates for the 0-constrained SDR under the high-dimensional setting when the dimension p diverges. Furthermore, we also provide a high-dimensional analysis of the global solution of the 0-constrained KSDR and prove its asymptotic properties, such as the consistency in variable selection and estimation, without assuming the linearity condition.

Computationally, the 0-constrained trace maximisation is NP-hard. There has been some work showing success in adjusting the weight of coefficients adaptively, using convex penalties such as adaptive-lasso (Zou 2006) and adaptive-ridge regression (Frommlet and Nuel 2016). The performance of adaptive-based approaches heavily relies on the accuracy of initial estimation. Seeking for a more robust 0 penalisation approximation without adaptive procedure, we use the folded concave penalisation (Fan and Li 2001; Fan, Xue, and Zou 2014) as an alternative to the 1 penalisation in Li (2007), Lin et al. (2019), and Lin et al. (2018). We propose the novel nonconvex alternating direction method of multipliers (ADMM) to solve the folded concave penalised SDR and the nonconvex linearised ADMM to solve the folded concave penalised KSDR. In particular, sparse KSDR has not only a non-convex loss function but also a folded concave penalty and a positive definite constraint. We provide the important convergence guarantees of the proposed nonconvex ADMMs for solving two challenging nonconvex and nonsmooth problems. We also establish the explicit iteration complexity bound for the proposed nonconvex ADMM and its linearised variant to reach the stationary solution of sparse SDR and sparse KSDR, respectively. To the best of our knowledge, our work is the first in the literature to study the challenging nonconvex and nonsmooth optimisation problem for sparse KSDR.

The rest of this paper is organised as follows. We will revisit the methods for sparse SDR in Section 2 and then present the proposed methods for sparse KSDR in Section 3. The proposed nonconvex ADMM and its linearised variant are presented in Sectio 4, and statistical and computational guarantees are presented in Section 5. Section 6 evaluates the numerical performance in several different simulation settings. Section 7 applies sparse KDR in a real data analysis. Section 7 includes a few concluding remarks.

Before proceeding, we define some useful notations. Let θ0=cardθj0,j=1,d, and 𝒮 is the cardinality of a set 𝒮. For a matrix A, let A be the maximum norm, A1 the matrix 1 norm, A2 the spectral norm, AF the Frobenius norm, Atr the trace norm, defined as Atr=TrATA1/2. Let d denote the norm in d-dimensional Hilbert space. To estimate the central subspace, it is equivalent to finding a low-rank matrix Θp×d of full column rank such that yx|ΘTx. We may also denote the central subspace by Sy|x=spanΘ, as the column space of matrix Θ.

2. Preliminaries

This section first revisits the sparse SDR methods in Li (2007) and then follows d’Aspremont et al. (2005) to reformulate sparse SDR via the trace maximisation problem. Note that SDR aims to estimate the linear combination Θ such that yx|ΘTx given independently and identically distributed data x1,y1,,xn,yn. Li (2007) showed that SDR methods based on inverse moments, such as SIR, DR, SAVE, and their variants, essentially solve the generalised eigenvector of a certain kernel matrix. When we assume the product of the inverse covariance matrix and kernel matrix is a normal matrix, the problem is also equivalent to solving an eigenvector of a certain kernel matrix. More specifically, at population level, the kernel matrix of SIR is Msir=Σx1covEz|y; the kernel matrix of DR is:

Mdr=2x1(EE2zzT|y+E2Ez|yEzT|y+EEzT|yEz|yEEz|yEzT|yIp),

where z=xEx and Σx the covariance matrix of all predictors x; the kernel matrix of DT-SIR (Lin et al. 2018) is MDTsir=xDT1covExDTExDT|y, where xDT is the subset of random variables in x after the diagonal thresholding.

Given the finite sample size n, we have an empirical version of such kernel matrices M^, by replacing population covariance matrix Σx with sample covariance matrix Σ^x, and all means E with empirical mean En. For example, M^sir=Σ^x1covEnz|y. Li (2007) shows these SDR methods are equivalent to solving the leading eigenvector of the empirical version of such kernel matrices.

After estimating the kernel matrix M^, we study the sparse estimation of the leading SDR direction, i.e. the largest eigenvector of M^. We use the well-known fact: solving the largest eigenvector θ1 is equivalent to trace optimisation (Wold, Esbensen, and Geladi 1987) as follows:

maxθtrθTM^θs.t.θ221.

To impose the sparsity, we introduce 0 constraint in the following trace maximisation to penalise the cardinality of non-zero elements in vector θ:

θ^1=argmaxθtrθTM^θs.t.θ221,θ0t

Note that spanθTx=spanθθTx. With Θ=θθT, then it is equivalent to solving the positive semidefinite estimate Θ^1 from

maxΘp×ptrM^Θs.t.trΘ=1;1TΘ01t2;Θ_0;rankΘ=1

where Θ0 is defined as the element-wise 0 norm. When Θ_0 and rankΘ=1, trΘ=1 implies θ22=1, and 1TΘ01t2 is equivalent to θ0t. However, the rank-one constraint that rankΘ=1 is hard to solve efficiently in practice. Therefore, we consider a relaxation of removing this constraint. Instead, we solve the dominant eigenvector of Θ^1 as θ^1, which is the SDR direction of our interest. Generally, this relaxation tends to have a solution very close to rank one matrix (d’Aspremont et al. 2005).

After removing the rank-one constraint, it is equivalent to solve Θ^1 from the following 0 penalised estimation problem for some sufficiently large γ:

maxΘp×ptrM^Θγ1TΘ01s.t.trΘ=1;Θ_0, (1)

where given Θ^1, we follow d’Aspremont et al. (2005) to truncate it to obtain the dominant (sparse) eigenvector of θ^1, and it will be the leading sparse SDR direction of our interest.

In the sequel, given θ^1,,θ^k, we solve the following sparse SDR direction θ^k+1 in a sequential fashion. We denote M^1=M^, and for k1, let M^k+1 be the projection of M^k on the space orthogonal to previous SDR directions. To remove the effect of previous eigenvectors, θ^k+1 is estimated as the largest (sparse) eigenvector of matrix M^k+1, which is a Hotelling’s deflation from previous vectors:

M^k+1=M^kθ^kTM^kθ^kθ^kθ^kT.

Mackey (2009) studied the property of Hotelling’s deflation in this type of sparse eigenvector estimation problem. Especially when we estimate the sparse eigenvector on the true support, this deflation helps eliminate the influence of a given eigenvector by setting the associated eigenvalue to zero and achieve θ^1θ^2θ^k. For the sparse estimation of θ^k+1, we plug in M^k+1 to (1), then solve θk+1 from:

maxθptrθTM^k+1θs.t.θ221,θ0t,

which is further equivalent to the following constrained problem following the same rank relaxation:

maxΘp×ptrM^k+1Θγ1TΘ01s.t.trΘ=1;Θ_0. (2)

(2) is still a nonconvex problem since 0 constraint is included. To solve the computational challenge, one existing relaxation is using folded concave penalty to approximate 0 constraint. Let PλΘ be a folded concave penalty function such as the SCAD (Fan and Li 2001) and MCP (Zhang 2010), where Θ is the matrix such that taking the absolute value of matrix Θ element-wisely.

Specifically, in this paper, we consider the element-wise SCAD penalty Pδt as:

Pδ(t)=2at(a+1)δtδa,1(δt)211a2δ2δatδ,1tδ.

where a is a constant, usually chosen as 3.7 in the SCAD penalty. In Section 5, we discuss how such penalisation can well approximate 0 penalisation. In this paper, We use SCAD penalty in this paper for simplicity. It is worth mentioning that we can also apply other nonconvex penalties and approximations here, such as MCP (Zhang 2010) and coefficient thresholding (Liu, Zhang, Xue, Song, and Kang 2024) to achieve the same theoretical property and similar numerical performance.

With this approximation, we solve Θ^1 from the following folded concave penalized estimation of sparse SDR:

maxΘp×ptrM^ΘλPδΘs.t.trΘ=1;Θ_0, (3)

and then for k=1,2,,d1, sequentially solve Θ^k+1, given M^k+1, from

maxΘp×ptrM^k+1ΘλPδΘs.t.trΘ=1;Θ_0.

Without loss of generality, this idea of estimating leading sparse eigenvalues can also be applied to the sparse Principal Component Analysis (PCA) problem.

Theorem 4 will provide the computational guarantee for the above approximation to 0 penalisation. Section 6 will also present the numerical performance.

3. Methodology

The methodology and theory of the sparse SDR depend on the linearity condition and constant variance condition, which usually hold when predictors follow elliptical distributions, but elliptical conditions can be violated in many cases. In this section, following the framework built by Fukumizu et al. (2009) and Li, Chun, and Zhao (2012), we extend the framework to achieve sparse KSDR, which does not require linearity and constant variance conditions. The key concept is that we construct a conditional covariance operator in reproducing kernel Hilbert space (RKHS), which evaluates the amount of information y that cannot be explained by the function of linear combinations of x. Then, we derive the empirical estimator of such conditional covariance operator with finite samples and finally minimise the trace of such estimator with sparsity constraints.

First, we provide the necessary definitions in RKHS to construct the conditional covariance operator. Let kx:Ωx×Ωx and ky:Ωy×Ωy be positive definite kernel functions and x and y be the RKHS generated by kx and ky, such that x satisfies reproducing property:

xΩx,k,xx;
xΩx,fx,f,k,xx=fx.

And y satisfies such reproducing property accordingly in space y. Then define Σxx:xx be the variance operator of x, which satisfies the relation f,Σxxgx=covfx,gx for all f, gx, and so as Σyy. Define Σyx:xy as the covariance operator satisfying g,Σyxfy=covgy,fx for all gy and fx, and Σxy=Σyx* be the adjoint operator of Σyx. Based on Baker (1973), we know there exists unique bounded operators Vyx and Vxy satisfy:

Σyx=Σyy1/2VyxΣxx1/2;Σxy=Σxx1/2VxyΣyy1/2.

Then, the conditional covariance operator can be defined as:

Σyyx=ΣyyΣyy1/2VyxVxyΣyy1/2.

For convenience, we sometime write it as: Σyy|x=ΣyyΣyxΣxxΣxy. where Σxx is the pseudo inverse of Σxx. Specifically, Σxx=Σxx+ϵsI1, as the Tychonoff regularised inverse matrix of Σxx, given some sufficient small constant ϵs.

Also, define the sparse orthogonal space Sdp=ΘM(p×d;))|ΘTΘ=Id;Θ.i0kfori=1,,d, representing the leading d sparse SDR directions. When d=1, it represents the leading sparse SDR vector.

In what follows, we study the sparse estimation of ΘSdp such that yx|ΘTx. To simplify notation, we introduce S=ΘTx, and ks, Σss, Σys and Σsy are defined in the same manner as kx, Σxx, Σyx and Σxy. Then we have:

Σyys=ΣyyΣysΣssΣsy

As shown in Theorem4 of Fukumizu et al. (2009), if x and y are rich enough RKHS, yx|ΘTxΣyys=Σyy|x, and also for any ΘSdp, we have Σyy|ΘTxΣyy|x. This conclusion holds for our sparse orthogonal space as well, under the sparsity assumption of SDR vectors.

This result indicates that such a conditional covariance operator is a good estimator for sufficient dimension reduction at the population level. However, the population-level operator is not available in practice. At the sample level, we need to derive the empirical version of such an operator in coordinate representation and minimise it. To start with, we have the coordinate representation of the kernel matrix as:

Κx=kxx1,x1kxx1,xnkxxn,x1kxxn,xn.

Let Gx=QKxQ be the centralised gram matrix, where Q=In1n1nT/n. Similarly, we define Ky, Ks, Gy, Gs. Following Lemma1 of Li et al. (2012), under the spanning systems:

^X=spankx,xiEnkx,x:i=1,,n,
^Y=spanky,yiEnky,y:i=1,,n.

The variance and covariance operators can be written as:

Σ^xx=Gx,Σ^yx=Gx,Σ^xy=Gy,Σ^yy=Gy
Σ^ss=Gs,Σ^sy=Gy,Σ^ys=Gs.

Placing these coordinate representations into Σyy|s, we derive the coordinate representation of Σ^yy|s as:

Σ^yy|s=GyGsGsGy

where Gs=Gs+ϵsI1 is the Tychonoff regularised inverse matrix of Gs given some sufficient small constant ϵs. Then we minimise the trace of trΣ^yy|ΘTx to estimate sparse SDR directions from ΘSdp.

To deal with the orthogonality constraint in Sdp, we use a similar sequential procedure as in Section 2. To this end, we solve θ^1 from the following trace minimisation to minimise the sum of conditional variance:

minθS1ptrΣ^yy|θTx

where S1p=θ|θp,θ2=1;θ0t. It is worth pointing out the equivalence between the above trace minimisation problem and the following regularised nonparametric regression problem:

minΘS1pminf𝒳𝒮1ni=1na=1E|ξaYi1nj=1nξaYjfXi1nj=1nfXj|2}+ϵnnf2𝒳𝒮

This equivalence provides insight into sparse KSDR, and it is also useful to establish the theoretical properties in Section 5.

In sparse KSDR, we assume y is the function of θ1Tx,,θdTx. In this case, θ1,,θd are the true sparse SDR vectors of our interest. Previously, we have introduced how to estimate θ1. Here, we introduce how to solve these non-leading SDR directions sequentially.

Given that we have already estimated orthogonal sufficient directions θ^1,,θ^k for k1, we sequentially solve the following KSDR direction θk+1 from the minimisation of trΣyy|θ^1Tx,,θ^kTx,θTx. By orthogonality of all sufficient directions, trΣyy|θ^1Tx,,θ^kTx,θTx=trΣyy|θ^1θ^1T++θ^kθ^kT,θθTx. Hence, we only need to solve

θ^k+1init=minθS1ptrΣyy|θ^1θ^1T++θ^kθ^kT,θθTx,

Then we project θ^k+1init to the space orthogonal to all previous KSDR vectors, denoted as the final estimator θ^k+1.

Here, we justify why this projection procedure is correct. Define θp=𝒫θ^1,,θ^kθ as the projection of θ on the direction orthogonal to all previous k KSDR vectors, such that spanθ^1,,θ^k,θ=spanθ^1,,θ^k,θp. By definition of the conditional covariance operator, we have:

trΣyy|θ^1Tx,,θ^kTx,θTx=trΣyy|θ^1Tx,,θ^kTx,θpTx.

Since the non-orthogonal direction to previous KSDR directions plays no effect in reducing the trace, we only need to consider the projection θp=𝒫θ^1,,θ^kθ for the sparse estimation of the following KSDR direction θk+1. By the projection procedure, we guarantee that θ^1θ^2θ^k.

Finally, we discuss solving 0 constraint and rank one constraint. Recall that PλΘ is a folded concave penalty. In practice, we solve the following folded concave relaxation of sparse KSDR to approximate the 0 constraint inside S1p, and solve Θ^1 from

minΘp×ptryy|ΘTxs.t.trΘ=1;PδΘt;Θ_0,

which can be rewritten in terms of a folded concave penalised estimation problem:

minΘp×ptryy|ΘTx+λPδΘs.t.trΘ=1;Θ_0.

Given Θ^1, we follow Section 2 to truncate it to obtain the first KSDR direction θ^1. Next, in the sequential fashion, given θ^1,,θ^k for k1, we solve Θ^k+1 from

minΘp×ptrΣyy|θ^1θ^1T++θ^kθ^kT,ΘTx+PλΘs.t.trΘ=1;Θ_0,

where we used the fact that trΣyy|θ^1Tx,,θ^kTx,θTx=trΣyy|θ^1Tx,,θ^kTx,θpTx. Now, we compute the projection Θk+1proj=θ1θkΘ^k+1 and solve the dominant (sparse) eigenvector as the desired sparse KSDR direction θ^k+1.

4. Nonconvex optimisation

Section 4 first proposes an efficient nonconvex ADMM to solve the nonconvex optimisation of sparse SDR and then presents an efficient nonconvex linearised ADMM to solve the even more challenging nonconvex optimisation of sparse KSDR. Both nonconvex ADMMs enjoy the important computational guarantees, which will be presented in Section 5.

4.1. Nonconvex ADMM for sparse SDR

In this section, we propose a novel nonconvex ADMM to solve the challenging nonconvex and nonsmooth optimisation in the sparse SDR. The proposed nonconvex ADMM enjoys the convergence guarantees to the ‘ϵ-stationary solution’ (see the definition in Theorem 5) with an explicit iteration complexity bound. The convergence guarantee of nonconvex algorithms can be defined as varying from algorithms. The ϵ-stationary solution introduced in Jiang, Lin, Ma, and Zhang (2019) is an applicable one to depict the convergence in the aspect of the Lagrangian function. More details about the ϵ-stationary solution and computational guarantees can be found in Theorem 5 of Section 5.1.

To begin with, we introduce a new variable Ω and also a slacking variable Φ. By adding the equality constraint ΘΩΦ=0, (3) is equivalent to:

minΘp×ptrM^Θ+PλΘ+μ2ΦF2s.t.trΩ=1;Θ_0,ΘΩΦ=0.

The slacking variable Φ plays an essential role in the convergence guarantee of our proposed nonconvex ADMM. Accordingly, the augmented Lagrange function is

pΘ,Ω,Φ,Λ=trM^Θ+PλΘ+μ2ΦF2+trΩ=1;Θ_0+Λ,ΘΩΦ+ρ2ΘΩΦF2.

where Λ is the Lagrange variable and ρ>0. Given the initial value (Ω0, Φ0, Λ0), we solve (Θq+1, Ωq+1, Φq+1, Λq+1) iteratively for q=0,1,2, as follows:

Θq+1=argminΘpΘ,Ωq,Φq,Λq+h2ΘqΘF2
Ωq+1=argminΩpΘq+1,Ω,Φq,Λq
Φq+1=ΦqγΦqpΘq+1,Ωq+1,Φ,Λq
Λq+1=ΛqρΘq+1Ωq+1,

where h>0 is chosen to guarantee the convexity of Θ step and γ & ρ are chosen to guarantee the convergence. Their choices will be discussed in Theorem 5. Note that the Φ step and the Λ step are straightforward. Fortunately, we also have the closed-form solutions in both Θ step and Ω step. In the Θ step, we know that

minΘpΘ,Ωq,Φq,Λq+h2ΘqΘF2=minΘρ+h2Θ1ρ+hρΩq+ρΦq+hΘq,ΛqM^F2+PλΘ,

which can be analytically solved by using the univariate folded concave thresholding rule Sλb,c=argminu12ub2+cPλu. More specifically, we have

Θq+1=Sλ1ρ+hρΩq+ρΦq+hΘq,ΛqM^,ρ+h2.

In the Ω step, it is easy to see that

minΩpΘq+1,Ω,Φq,Λq=minΩ_0;trΩ=1Ω+ΦqΘq+11ρΛqF2,

which can be solved by using the projection onto the simplex in the Euclidean space (See Algorithm 1 of Ma 2013). To solve Ωq+1, we take the singular value decomposition of Wq=Θq+1Φq+1ρΛq as Wq=Udiagσ1,,σpUT. After analytically solving the projection of σ onto the simplex in the Euclidean space. Denoting ξ^=ξ^1,ξ^p, it has closed-form solution:

ξ^=argminξ12ξσ22s.t.j=1pξj=t,ξj=0.

Then we obtain the closed-form solution Ωq+1=Udiagξ^1,,ξ^pUT. In the Φ step, we solve the derivative as:

ΦqpΘq+1,Ωq+1,Φ,Λq=μΦΛ+ρΦ+ΩΘ.

Now, we summarise the proposed nonconvex ADMM in Algorithm 1.

4.2. Nonconvex linearised ADMM for sparse KSDR

In this section, we provide a detailed derivation of the nonconvex linearised ADMM algorithm for sparse KSDR. We present the algorithm for solving the leading KSDR vector,

Algorithm1ProposednonconvexADMMforsolvingthesparseSDR¯Require:Given(Ω0,Φ0,Λ0)p×p×p×p×p×p,forq=0,1,doΘq+1=𝒮λ1ρ+h(ρΩq+Φq+hΘqΛqM^),ρ+h2Ωq+1=Udiagξ^1,,ξ^pUT,whereΘq+1Φq+1ρΛq=Udiagσ1,,σpUTΦq+1=Φqγμ+ρΦΛ+ρΩΦΛq+1=ΛqρΘq+1Ωq+1¯¯

and the algorithm for the following vectors follows similarly. To begin with, we introduce new variables Ω1 and Ω2 and slacking variables Φ1 and Φ2. By adding two equality constraints that ΘΩ1Φ1=0, ΘΩ2Φ2=0, (2.7) is equivalent to:

minΘ,Ω1,Ω2,Φ1,Φ2trΣ^yy|ΘTx+PλΩ1+μ2Φ1F2+Φ2F2s.tΘΩ1Φ1=0,ΘΩ2Φ2=0.trΩ2=1,Ω2_0.

The slacking variables Φ1 and Φ2 are essential for the convergence guarantee of our proposed algorithm. Accordingly, the augmented Lagrange function is

ρΘ,Ω1,Ω2,Λ1,Λ2=trΣ^yy|Θx+trΩ2=1;Ω2_0+μ2Φ1F2+Φ2F2+PλΩ1+Λ1,ΘΩ1Φ1+Λ2,ΘΩ2Φ2+ρ2ΘΩ1Φ1F2+ρ2ΘΩ2Φ2F2,

where Λ1 and Λ2 are the Lagrange variable and ρ>0.

It is important to point out that tr(Σ^yy|ΘTx) introduces the additional computational challenges since the Θ step does not lead to a closed-form solution. To this end, we introduce a linearised counterpart of this trace function as

tr(Σ^yy|Θ(q)Tx)+ΘΘqTΘtr(Σ^yy|Θ(q)Tx).

After plugging this linearised term, we denote the new Lagrange function as ρ*Θ,Θq,Ω1,Ω2,Φ1,Φ2,Λ1,Λ2. Given the initial solution (Ω10, Ω20, Φ10, Φ20, Λ10, Λ20), we solve (Θq+1,Ω1q+1, Ω2q+1, Φ1q+1, Φ2q+1, Λ1q+1, Λ2q+1) iteratively for q=0,1,2, as follows:

Θq+1=argminΘρ*Θ,Ω1q,Ω2q,Φ1q,Φ2q,Λ1q,Λ2q+h2ΘΘqF2
Ω1q+1=argminΩ1ρΘq+1,Ω1,Ω2q,Φ1q,Φ2q,Λ1q,Λ2q
Ω2q+1=argminΩ2ρΘq+1,Ω1q+1,Ω2,Φ1q,Φ2q,Λ1q,Λ2q
Φ1q+1=Φ1qγΦ1qρΘq+1,Ω1q+1,Ω2q+1,Φ1q,Φ2q,Λ1q,Λ2q
Φ2q+1=Φ2qγΦ2qρΘq+1,Ω1q+1,Ω2q+1,Φ1q,Φ2q,Λ1q,Λ2q
Λ1q+1=Λ1qρΘq+1+Ω1q+1
Λ2q+1=Λ2qρΘq+1+Ω2q+1,

where h>0 is chosen to guarantee the convexity of Θ step and γ, ρ are chosen to guarantee the algorithmic convergence (see Theorem 10). In what follows, we present the closed-form solution for each subproblem.

For Θ step, with the linearisation, Θq+1 can be solved by taking derivatives as:

Θρ*Θ,Θq,Ω1q,Ω2q,Φ1q,Φ2q,Λ1q,Λ2q+hΘΘq=0.

Defining gΘ=GsGs+ϵnI1, then tr(Σ^yy|Θx)=trgΘGy, then the derivatives can be solved by using the chain rule of matrix derivatives:

Θtr(Σ^yy|Θx)=vecΘgΘTvecgΘtr(Σ^yy|Θx),

which has a complex closed form. For space consideration, the detailed calculation of closed-form derivatives and its numerical approximation is deferred to the supplement. Based on Θtr(Σyy|Θx), we can obtain the close-form derivative. By solving the linearisation, we obtain the minimiser:

q1Θ,Ω1,Ω2,Φ1,Φ2,Λ1,Λ2=ρΩ1+Φ1+Ω2+Φ2Λ1Λ2gΘtr(Σ^yy|Θx)+hΘq2ρ+h.

For Ω1 step, it is equivalent to solve the subproblem

minΩ1PλΘ+Λ1,ΘΩ1Φ1+ρ2ΘΩ1Φ1F2=minΩ1Ω1ΘΦ1+Λ1ρF2+PλΩ1.

This can be efficiently solved by using the univariate penalised least squares solution:

Ω1q+1=SλΘΦ1+Λ1ρ,2,

where Sλ has been defined in Section 4.1. For Ω2 step, we are solving the optimisation problem:

Ω2=argmintrΩ2=1Λ2Ω2+ρ2ΘΩ2Φ2F2=argmintrΩ2=1+ρ2Ω2+Φ2ΘΛ2/ρF2.

It can be solved by projection onto the simplex in the Euclidean space. We have presented exactly the same algorithm in Section 4.1. Thus, we omit it here. For Φ1 and Φ2 steps, we can solve them straightforwardly:

Φ1q+1=Φ1q+γμΦ1q+Λ1+ρΦ1q+Ω1q+1Θq+1
Φ2q+1=Φ2q+γμΦ2q+Λ2+ρΦ2q+Ω2q+1Θq+1.

Now we can summarise the proposed nonconvex linearised ADMM in Algorithm 2:

Algorithm2NonconvexlinearisedADMMforsolvingsparseKSDR¯Require:Given(Ω10,Ω20,Φ10,Φ20,Λ10,Λ20)allinp×pfork=0,1,doΘq+1=q1Θ,Ω1q,Ω2q,Φ1q,Φ2q,Λ1q,Λ2q.Ω1q+1=SλΘΦ1+Λ1ρ,2Ω2q+1=Udiagξ^1,,ξ^pUT,whereΘq+1Φ2q+1ρΛ2q=Udiagσ1,,σpUTΦ1q+1=Φ1q+γΛ1Φ1q+ρΦ1q+Ω1q+1Θq+1Φ2q+1=Φ2q+γΛ2Φ2q+ρΦ2q+Ω2q+1Θq+1Λ1q+1=Λ1qρΘq+1Ω1q+1Λ2q+1=Λ2qρΘq+1Ω2q+1¯¯

5. Theoretical properties

In this section, we establish a theoretical guarantee for both sparse SDR and sparse KSDR estimators. For both SDR and KSDR, first we study the statistical guarantee of 0-constrained estimator, showing their consistency, allowing p diverging as n. Then, we study how our folded concave penalised estimator can approximate 0-constrained estimator and prove the computation guarantee of our proposed nonconvex ADMM and its linearised version.

5.1. Asymptotic properties of sparse SDR

We study the statistical and computational properties of 0-constrained SDR, such as the consistency and convergence rate in this subsection and the computational guarantees of folded concave penalised SDR in this subsection. The 0-constrained SDR

θ^1=argmaxθtrθTM^θs.t.θ221,θ0t

achieves the desired theoretical properties under the high-dimensional setting, while folded concave penalised SDR is computationally feasible and it asymptotically converges to 0-constrained SDR. First, we define the true sufficient dimension vectors. For the selected method, d is the smallest integer such that, the top d eigenvectors of M are θi for i=1,,d, such that yx|θi for i=1,,d. Then by definition, we know θi,,θd are true sufficient vectors. Now we introduce assumptions thm(A1)–thm(A3).

E(ξTx|θ1Tx,,θdTx) is a linear function of θ1Tx,,θdTx for any ξp.

cov(x|θ1Tx,,θdTx)is degenerate.

True sufficient dimension vectors θi satisfy θi0t for i=1,,d.

Note that thm(A1), thm(A2), and thm(A3) are also called the linearity condition, the constant variance condition, and the sparsity condition, respectively. thm(A1) and thm(A2) hold when x follows elliptical distributions. thm(A3) is a standard assumption in sparse SDR literature. See Li (1991), Li and Wang (2007), Li (2007) and Luo et al. (2022) for justifications of these conditions.

Let sin (θa,θb)=1θaTθb2. We first study the convergence rate of the first estimated sparse SDR direction θ^1. Theorem 1 derives this estimation bound based on the general estimation bound that M^M2=Opτn,p.

Theorem 5.1: Suppose (A1)–(A3), M^M2=Opτn,p, λ1>λ2, θ10t, and the spectral decomposition for M exists. Then we have:

sin(θ^1,θ1)=Opτn,p.

In practice, Opτn,p has been derived for each specific SDR method. Corollary 1 discusses the explicit convergence rate, given the explicit convergence rate Opτn,p for different methods.

Corollary 5.2: (a) Suppose (A1)–(A3), λ1>λ2, θ10t, covariance matrix Σx=Ip*p, and p=on1/2 as n, then for sparse SIR and sparse DR, we have:

sin(θ^1,θ1)=Oppn1/2

(b) Let v0,1 corresponds the stability of the central curve Ex|y. When p=on, with proper choice of number of slices H=np12v+2 and (A3)–(A4) in Lin et al. (2018) are satisfied, for sparse SIR, we have:

sin(θ^1,θ1)asn.

Further, if we have Σx=Ip*p, we have convergence rate:

sin(θ^1,θ1)=Oppn12v+2

(c) Let s=𝒮p where 𝒮=i|θji0forsomej,ijd. Under the high-dimensional setting that pn, by assuming (A3)–(A6) and (S1) in Lin et al. (2018), and Σx=Ip*p, then for the sparse DT-SIR, with proper choice of H=np12v+2, we have convergence rate:

sin(θ^1,θ1)=Opsn12v+2.

where the stability v of the central curve Ex|y is defined in definition 1 of Lin et al. (2019). We also provide the complete definition of it in the supplemental material.

It is worth mentioning that our framework is very flexible. By applying different SDR methods to estimate the the M matrix, we can derive the convergence rate under different settings. Let j=j:θ1j0 is the support set of θ1. We further prove the variable selection consistency in Theorem 2.

Theorem 5.3: Under the assumption of Theorem 1, further if minjJθ1jC0τn,p for some constant C0>0, and θ10=t, then suppθ^1=suppθ1 with probability tending to 1 as n.

Next, recall that in Section 2, for k2, by the same optimisation procedure, we can further estimate θ^k from M^k. Next, we study the estimation bound and consistency for θ^kk2 in Theorem 3.

Theorem 5.4: Suppose that the conditions in Theorems 1 and 2 are satisfied, θk0t, λk>λk+1 and λk=𝒪1 for 1kd1. Then for 1kd1, we have:

M^k+1Mk+12=Opτn,p
sin(θ^k+1,θk+1)=Opτn,p.

where M^k+1 and Mk+1 are defined in Section 2.

Now, we provide the computational guarantees of our proposed nonconvex ADMM. First, we show that the folded concave penalised estimation well approximates the 0 penalisation. To this end, we use the rescaled folded concave penalty function, as we defined in Section 2. It is easy to see limδ0Pδt=t0 for any t. Thus, Pδt converges to t0 in the 0 penalisation. For matrix Θ, we define the penalisation PδΘ=i=1pj=1dPΘij, which is the summation of univariate rescaled penalisation for all entries. Here, we reveal the connection between 0 penalised programming (1) and scaled folded-concave penalised programming (3) in the following theorem.

Theorem 5.5: Suppose we choose δu to be sequence such that limUδu=0+. For the programming (3), we choose penalisation function Pδt as we defined. Let Θu be the global minimiser of (3) when we choose δ=δu, and assume Θ is a limit point of Θu. Then Θ is also a global minimiser to (1).

Since the nonconvex problem (3) has widely varying curvature, it is nontrivial to prove the algorithmic convergence. By taking advantage of the structured nonconvexity, the next theorem shows that our proposed nonconvex ADMM converges to an ϵ-stationary solution with an explicit iteration complexity bound. In the following theorem, we use vecΘ to represent the vectorisation of a matrix from p×d to a vector pd.

Theorem 5.6: Let L be the Lipschitz bound of loss function, ρ>max183+613L,6hL and

γ(13ρ212ρL72L22ρ9ρ212ρL72L2,+ρ2+2193L,+2ρ13ρ212ρL72L272L212ρL9ρ2,2ρ+13ρ212ρL72L272L2+12ρL9ρ2)ρ183+613L,2+2193L

For any 0<ϵ<min1L,13h, with no more than 𝒪1ϵ4 number of iterations, our proposed algorithm obtains an ϵ-stationary solution (Θ^, Ω^, Λ^), as defined in Jiang et al. (2019), for any matrix Θp×p with finite Frobenius norm satisfying that:

vecΘΘ^TvecM^+PλΘΛ^Iϵ
Θ^Ω^ϵ
trΩ^=1,Ω^_0.

Remark: When ϵ=0, the constraint exactly implicates the KKT-condition. Thus, the solution will be a stationary solution. We will show numerical performance of such a near-stationary solution in the next section.

5.2. Asymptotic properties of sparse KSDR

In this section, we establish the consistency of kernel sparse dimension reduction under milder assumptions. Based on Fukumizu et al. (2009), we assume following assumptions thm(A4)–thm(A7).

For any bounded continuous function g on 𝒴, the mapping ΘExEy|ΘTx[gy|ΘTX]2 is continuous on ΘSkp for k=1,2,d.

For any k=1,2,d, let ΘSkp, and PΘ be the probability distribution of the random variable ΘTx on 𝒳. The Hilbert space 𝒳𝒮+ is dense in L2PΘ for any ΘSkp.

There is a measurable function ϕ:𝒳 such that Eϕx2< and the Lipschitz condition: ksΘTx,ksΘ˜Tx,dϕxDΘ,Θ˜ holds for all Θ, Θ˜Skp and x𝒳 for any k=1,2,d, and the kernel ksx, is defined in Section 3. Dis a distance which is compatible with the topology of Skp. And we assume Σyytr, ΣyxHS. The true sparse SDR directions Θ=θ1,,θd satisfies ΘSdp.

Note that thm(A1) and thm(A2) are not required for the sparse KSDR, and we rewrite the sparsity assumption thm(A3) in our defined sparse orthogonal space as thm(A7). In thm(A4) to thm(A6), we introduce some mild regularity conditions in RKHS, which are well justified in Fukumizu et al. (2009).

Theorem 5.7: Under assumption (A4) to (A7), and we also suppose true variable set supp(θ1) are correlated with a finite number of variables; Suppose our choice of kernel function is continuous and bounded, and suppose the regularisation parameter ϵn satisfies:

ϵn0;npt1t+2ϵn;pnasuchthatα<1/tasn.

Then the functions trΣ^yy|θTx and trΣyy|θTx are continuous on S1p, and it converges in probability that:

supθS1ptrΣ^yy|θTxntrΣyy|θTx0.

Theorem 6 establishes the consistency of the first KDR vector. The growth rate assumption in the theorem says that the growth rate of p is negatively affected by the sparsity level t. When t is smaller, p can diverge faster as n. Because of this uniform convergence, we establish the variable selection consistency by following Theorem 7, allowing a diverging dimension of p.

Theorem 5.8: Under all conditions in Theorem 6, for pnα, such that α<1/t as n, and k=1,,d, we have:

Prsuppθ^k=suppθk1.

When variable selection consistency is achieved, we can further construct estimation consistency of θ^:

Theorem 5.9: If all conditions in Theorem 7 hold, and we assume pnα, such that α<1/t as n. Then we denote:

θ^1=argminθtrΣyy|θTxs.t.θ2=1,θ0t

And define the corresponding matrix Θ˜1=θ^1θ^1T. When Θ˜1 is nonempty, for any arbitrary open set U in S1p with true direction θθTU, we have

limnPrΘ˜1U=1. (4)

For following directions, let θ1,,θk correspond to the true first k KDR vectors, and

θ^k+1=argminθtrΣyy|i=1kθ^iθ^iT+θθTTxs.t.θ2=1,θ0t.

DenoteΘ˜k+1=i=1kθ^iθ^iT+θ^k+1θ^k+1T, and the true space Θk+1=i=1k+1θiθiT. When θ^i are all not nonempty for i=1,,k, for any arbitrary open set U in Sk+1p with Θk+1U, we have

limnPrΘ˜k+1U=1. (5)

Theorem 8 extends the KSDR consistency results from Fukumizu et al. (2009) to a diverging p case. Making use of the 0-constrained parameter space, we have both variable selection and estimation consistency under certain conditions.

Following the statistical property, Theorem 9 shows the asymptotic equivalency of folded concave constraint and 0 constraint. Theorem 10 shows that our nonconvex linearised ADMM guarantees the convergence to an ϵ-stationary solution.

Similar with Theorem 4, for Theorem 9, we consider

minΘp×ptrΣyy|ΘTx+γ1TΘ01trΘ=1;Θ_0
minΘp×ptrΣyy|ΘTx+γPδΘtrΘ=1;Θ_0,

where PδΘ is defined in Section 2. Considering d=1, we can also solve sufficient directions iteratively, as introduced in Section 2.

Theorem 5.10: Suppose we choose δu to be sequence such that limUδu=0+. For the programming (4.12), we choose penalisation function Pδt as we defined. Let Θu be the global minimiser of (4.12) when we choose δ=δu, and Θ is a limit point of Θu. Then Θ is also a global minimiser to (4.11).

Parallel to Theorem 5, we show the property for the ϵ-stationary solution of our proposed nonconvex ADMM in Theorem 10.

Theorem 5.11: Let L be the Lipschitz constant of the loss function. Let

ρ>max183+613L,6hL

and

γ(13ρ212ρL72L22ρ9ρ212ρL72L2,+ρ2+2193L,+2ρ13ρ212ρL72L272L212ρL9ρ2,2ρ+13ρ212ρL72L272L2+12ρL9ρ2)ρ183+613L,2+2193L.

For any 0<ϵ<min1L,13h, with no more than 𝒪1ϵ4 number of iterations, our proposed algorithm obtains an ϵ-stationary solution (Θ^, Ω^1, Ω^2, Λ^1, Λ^2), as defined in Jiang et al. (2019), for any Θp×p with finite Frobenius norm,satisfying that

vecΘΘ^TvecΘtrΣ^yy|Θ^TxΛ^1+Λ^2Iϵ
Ω1Ω^1TPλΩ1Λ^1+Λ^2Iϵ
Θ^Ω^1ϵ
Θ^Ω^2ϵ
trΩ^2=1,Ω^2_0.

6. Numerical properties

In this section, we evaluate the performance of the sparse SIR/DR/KSDR estimator in our semidefinite relaxation framework, the DT-SIR estimator proposed by Lin et al. (2019), the variable selection of penalised linear regression with SCAD penalisation and marginal correlation screening.

Here we compare the numerical performance of these methods in four data generation models as follows with sample size n=100.

  • Model 1: y=0.5x1+x20.5+0.5x1x2+1.52+1+.05x1x22+σ2e, where eN0,1, xN0,I10 and σ=0.2,0.4,0.8 for Model 1a–1c.

  • Model 2: y=12x12+x22+σ2e, where eN0,1, xN0,I20, and σ=0.1,0.2,0.3 for Model 2a–2c.

  • Model 3: y=12x1+12+σ2e, where eN0,1, xN0,I40, and σ=0.2,0.4,0.8 for Model 3a–3c.

  • Model 4: y=sin2πx1+1+σ2e, where σ=0.2,0.4,0.8 for Model 4a-4c, eN0,1, and x is uniformly distributed on x10|xi>0.7,fori=1,2,10

To compare the numerical performance of these methods, we replicate our data generation procedure 100 times and apply all the methods to these 100 replications. We compute the estimated sparse SDR vector θ^k for k=1,2,,d for each method each time. Then, we evaluate the performance of different methods using multiple R2, True Positive Rate (TPR) and True Negative Rate (TNR). The multiple R2 is defined as R2=1dk=1dPspanΘθ^k22. True Positive Rate(TPR) is defined as the fraction of variables correctly specified as non-zero in at least one out of SDR vectors, and True Negative Rate(TNR) is the fraction of variables correctly specified as zero in all SDR vectors.

The simulation results are summarised in the following Table 13. In the following Tables 1 and 2, we report the mean and standard deviation for these three indicators, with sparse SIR/DR/KSDR, DT-SIR, and SCAD penalised regression. The marginal correlation method doesn’t have R2, and we report its true positive/negative rate in Table 3. The tuning procedure is as follows. For the proposed methods, we choose the penalisation parameter λ via the oracle tuning procedure. Specifically, we construct two independent datasets with the same sample size n=100 for training and tuning respectively. For each λ, we estimate the parameter Θ on the training dataset and evaluate the corresponding loss function on the tuning dataset. We choose the best penalisation parameter to minimise the loss function on the tuning dataset. In practice, we may use the cross-validation to find the proper penalisation parameter.

Table 1.

Summary of R2, TP and TN for model 1 and 2.

SpKSDR DT-SIR SpSIR SpDR SCAD SpKSDR DT-SIR SpSIR SpDR SCAD
Model 1a Model 2a

R2 0.833 0.843 0.980 0.995 0.656 0.975 0.224 0.215 0.965 0.172
(0.303) (0.153) (0.098) (0.050) (0.063) (0.112) (0.250) (0.278) (0.146) (0.355)
TPR 0.917 0.834 0.965 0.916 1 0.975 0.570 0.220 0.965 0.060
(0.096) (0.223) (0.067) (0.013) (0) (0.224) (0.341) (0.278) (0) (0.192)
TNR 0.975 0.423 0.976 0.969 0.660 0.988 0.429 0.805 0.990 0.905
(0.069) (0.255) (0.066) (0.091) (0.191) (0.038) (0.270) (0.070) (0.042) (0.103)

Model 1b Model 2b

R2 0.917 0.834 0.965 0.916 0.654 0.950 0.235 0.180 0.929 0.183
(0.203) (0.158) (0.151) (0.278) (0.064) (0.154) (0.274) (0.261) (0.199) (0.363)
TPR 0.917 1 0.970 0.916 1 0.950 0.490 0.180 0.940 0.060
(0.231) (0) (0.119) (0.187) (0) (0.230) (0.362) (0.261) (0.191) (0.192)
TNR 0.975 0.423 0.976 0.969 0.626 0.981 0.464 0.795 0.949 0.895
(0.069) (0.255) (0.066) (0.091) (0.229) (0.205) (0.270) (0.065) (0.181) (0.136)

Model 1c Model 2c

R2 0.900 0.757 0.950 0.720 0.649 0.900 0.233 0.185 0.920 0.196
(0.203) (0.158) (0.151) (0.278) (0.073) (0.230) (0.266) (0.263) (0.197) (0.371)
TPR 0.917 1 0.955 0.720 1 0.950 0.525 0.195 0.930 0.060
(0.190) (0) (0.144) (0.278) (0) (0.230) (0.392) (0.274) (0.173) (0.192)
TNR 0.967 0.375 0.978 0.930 0.610 0.975 0.454 0.799 0.981 0.893
(0.080) (0.228) (0.060) (0.070) (0.236) (0.052) (0.281) (0.069) (0.048) (0.138)

Table 3.

Summary of TP and TN for marginal correlation method.

Model 1c Model 2c Model 3c Model 4c
TPR 0.980 0.425 1 0.550
(0.098) (0.329) (0) (0.500)
TNR 0.873 0.913 1 0.950
(0.012) (0.018) (0) (0.056)

Table 2.

Summary of R2, TP and TN for model 3 and 4.

SpKSDR DT-SIR SpSIR SpDR SCAD SpKSDR DT-SIR SpSIR SpDR SCAD
Model 3a Model 4a

R2 0.889 0.953 1 0.990 0.995 0.667 0.449 0.600 0.128 0.608
(0.333) (0.060) (0) (0.099) (0.017) (0.479) (0.337) (0.492) (0.333) (0.445)
TPR 0.889 1 1 1 1 0.667 0.900 0.600 0.130 0.700
(0.333) (0) (0) (0) (0) (0.479) (0.302) (0.462) (0.338) (0.461)
TNR 0.997 0.850 1 0.990 0.983 0.963 0.502 0.950 0.873 0.881
(0.009) (0.138) (0) (0.239) (0.039) (0.052) (0.297) (0.062) (0.132) (0.172)

Model 3b Model 4b

R2 0.850 0.913 1 0.940 0.996 0.567 0.381 0.360 0.100 0.438
(0.362) (0.099) (0) (0.238) (0.012) (0.504) (0.332) (0.482) (0.293) (0.451)
TPR 0.850 1 1 1 1 0.567 0.840 0.360 0.110 0.550
(0.365) (0) (0) (0) (0) (0.504) (0.368) (0.482) (0.314) (0.500)
TNR 0.996 0.840 1 0.940 0.985 0.952 0.542 0.920 0.889 0.873
(0.009) (0.138) (0) (0.239) (0.030) (0.055) (0.268) (0.060) (0.052) (0.127)

Model 3c Model 4c

R2 0.778 0.573 1 0.500 0.992 0.533 0.257 0.320 0.164 0.335
(0.440) (0.348) (0) (0.503) (0.021) (0.507) (0.285) (0.469) (0.367) (0.429)
TPR 0.778 0.900 1 0.500 1 0.533 0.730 0.320 0.210 0.450
(0.440) (0.302) (0) (0.503) 0 (0.507) (0.447) (0.469) (0.409) (0.500)
TNR 0.994 0.781 1 0.987 0.984 0.948 0.520 0.915 0.862 0.863
(0.011) (0.178) (0) (0.013) (0.035) (0.055) (0.287) (0.059) (0.156) (0.136)

For implementation, Sparse SIR/DR/KSDR are implemented using our proposed non-convex ADMM, DT-SIR is implemented using R package ‘LassoSIR’, and SCAD penalised regression is implemented using R package ‘ncvreg’. For the marginal correlation method, we compute the correlation between y and each variable xi, then rank the absolute value of the correlation and select the first d variables, where d is the true dimension.

We also compare the performance of R2 for kernel dimension reduction in Fukumizu et al. (2009) in Model 4(a, b, c) as a benchmark. The average R2 are 0.642, 0.546 and 0.513 for Model 4(a), 4(b) and 4(c) separately. The results are comparable with SKDR results. SKDR performs slightly better given it eliminates some noise from irrelevant variables, thanks to the penalisation. Given kernel dimension reduction doesn’t provide a sparse solution, and doesn’t provide a meaningful TPR and TNR, we didn’t include them in the main table.

We draw the following conclusions from the results: (1) sparse SIR and DT-SIR are compatible in most cases, while sparse SIR outperforms DT-SIR in R2 in general. In model 1 and model 3, when the assumption for SIR holds, our sparse SIR is almost perfect. DT-SIR also performs well in these two settings but sometimes includes many irrelevant variables and leads to an unsatisfying TNR and R2. (2) The sparse DR is also compatible with SIR in general. Although it may not outperform SIR in model 1 and model 3, when the function is symmetric over x in model 2, SIR fails to recover the true space, while sparse DR is very stable in this case. (3) sparse KSDR performs best in model 4 when linearity conditions thm(A1) and thm(A2) don’t hold, and both SIR and DR lose consistency. While sparse KSDR also performs fairly well compared to SIR and DR in models 1–3, when the linearity condition holds and no matter whether the function is symmetric over x. (4) The penalised regression fails in model 2 and model 4 when the relationship between x and y cannot be approximated by linear regression. Also, regression cannot identify the linear combination θ, which is also of interest. (5) Similarly, the marginal correlation method fails in model 2 when the correlation between x and y is completely non-linear, and they have population correlation zero. What’s more, using correlation-based methods also cannot identify the linear combination θ of our interest.

7. Application to microbiome data analysis

In this section, we apply our method to analyse the host transcriptomic data and 16S rRNA gene sequencing data from paired biopsies from IPAA patients with UC and familial adenomatous polyposis (Morgan et al. 2015). Morgan et al. (2015) extracted 5 principal components of microbiome 16s rRNA gene sequencing data, and analysed the relationship between these PCs and the PCs from host transcriptomic data. However, the result is hard to interpret since PCs are linear combinations of all host transcriptomics. Thus we apply our proposed sparse KSDR to explore the more interpretable sparse linear combinations which are relevant to these microbiome PCs.

First we select 20 representative genes of host transcriptomic data from two groups. The first group of genes are those most different from two locations (PPI and Porch) of a patient by Kolmogorov–Smirnov test. The second group of genes are important genes pointed out in Morgan et al. (2015), which are important to the pathogenesis of inflammatory bowel disease (IBD). And we treat 9 principal components of microbiome 16s rRNA gene sequencing as y in our model separately, and apply our kernel sparse diemsnion reduction estimation on the host transcriptomic data of PPI for each patient, with sample size n=196 and p=20. The selected genes and their coefficients are summarised in the following Table 4.

Table 4.

Selected genes and corresponding coefficients using KSDR.

PCs Gene 1 Gene 2 Gene 3 Gene 4
PC1 MMEL1(0.52) IL1RN(−0.50) TLR1(−0.30) IL10(−0.27)
PC2 TNF (−0.42) MUC6(0.37) TLR1(−0.36) NFKB1(−0.29)
PC3 MUC6(0.58) IL1RN(0.50) SOX14(0.45) FUT2(0.24)
PC4 MUC6(0.74) IL1RN7096 (0.36) TLR1(0.26) NFKB1(0.23)
PC5 TLR1(−0.49) MUC6(0.41) IFNG(−0.40) NFKB1(0.39)

In the result, leading PCs (PC1-PC5) are mainly correlated with protein-related genes such as TLR1 and MMEL1 It reflects that these PCs may be mainly composed of antibiotic-signature microbiome. Several genes that have been widely studied that are important to the pathogenesis of IBD is also selected. For example, IL10 is selected in PC1, MUC6 is selected in PC2, 3, 4, and 5, and IL1RN is selected in PC1, 3, 4. Among them, IL10 is well studied that its effectiveness in signalling a subgroup of patients with IBD (Begue et al. 2011). MUC6 may have a role in epithelial wound healing after mucosal injury in IBD in addition to mucosal protection (Buisine et al. 2001). IL1RN has a well-established pathological role in IBD (Stokkers et al. 1998). Our result emphasises the effects of these genes may be actively related to microbiome RNA environment of IBD patients.

8. Conclusion

Motivated by the desired property of sparse SDR, we propose the nonconvex estimation scheme for sparse KSDR. In aspect of statistical property, we prove the asymptotic consistency for both estimators. In the aspect of optimisation, we show that they can both be solved by nonconvex ADMM algorithm efficiently with a provable convergence guarantee. We also show the practical usefulness of our proposed methods in various simulation settings and real data analysis.

Supplementary Material

Supp 1

Acknowledgments

The authors were grateful for the insightful comments and suggestions from the editor and reviewers.

Funding

The authors were partially supported by the National Institutes of Health (NIH) grant 1R01GM152812 and National Science Foundation (NSF) grants DMS-1811552, DMS-1953189, CCF-2007823, and DMS-2210775.

Footnotes

Supplemental data for this article can be accessed online at http://dx.doi.org/10.1080/10485252.2024.2360551.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  1. Amini AA, and Wainwright MJ (2008), ‘High-Dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components’, in Information Theory, 2008. ISIT 2008. IEEE International Symposium on, IEEE, pp. 2454–2458. [Google Scholar]
  2. Baker CR (1973), ‘Joint Measures and Cross-Covariance Operators’, Transactions of the American Mathematical Society, 186, 273–289. [Google Scholar]
  3. Begue B, Verdier J, Rieux-Laucat F, Goulet O, Morali A, Canioni D, Hugot J-P, Daussy C, Verkarre V, Pigneur B, and Fischer A (2011), ‘Defective Il10 Signaling Defining a Subgroup of Patients with Inflammatory Bowel Disease’, Official Journal of the American College of Gastroenterology ACG, 106(8), 1544–1555. [DOI] [PubMed] [Google Scholar]
  4. Bondell HD, and Li L (2009), ‘Shrinkage Inverse Regression Estimation for Model-free Variable Selection’, Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(1), 287–299. [Google Scholar]
  5. Buisine M, Desreumaux P, Leteurtre E, Copin M, Colombel J, Porchet N, and Aubert J (2001), ‘Mucin Gene Expression in Intestinal Epithelial Cells in Crohn’s Disease’, Gut, 49(4), 544–551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chen X, Zou C, and Cook RD (2010), ‘Coordinate-independent Sparse Sufficient Dimension Reduction and Variable Selection’, The Annals of Statistics, 38(6), 3696–3723. [Google Scholar]
  7. Cook RD, and Weisberg S (1991), ‘Comment’, Journal of the American Statistical Association, 86(414), 328–332. [Google Scholar]
  8. d’Aspremont A, Ghaoui LE, Jordan MI, and Lanckriet GR (2005), ‘A Direct Formulation for Sparse PCA Using Semidefinite Programming’, in Advances in Neural Information Processing Systems, pp. 41–48. [Google Scholar]
  9. Fan J, and Li R (2001), ‘Variable Selection Via Nonconcave Penalized Likelihood and Its Oracle Properties’, Journal of the American Statistical Association, 96(456), 1348–1360. [Google Scholar]
  10. Fan J, Xue L, and Yao J (2017), ‘Sufficient Forecasting Using Factor Models’, Journal of Econometrics, 201(2), 292–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Fan J, Xue L, and Zou H (2014), ‘Strong Oracle Optimality of Folded Concave Penalized Estimation’, The Annals of Statistics, 42(3), 819–849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Frommlet F, and Nuel G (2016), ‘An Adaptive Ridge Procedure for L 0 Regularization’, PloS One, 11(2), e0148620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fukumizu K, Bach FR, and Jordan MI (2009), ‘Kernel Dimension Reduction in Regression’, The Annals of Statistics, 37(4), 1871–1905. [Google Scholar]
  14. Jiang B, Lin T, Ma S, and Zhang S (2019), ‘Structured Nonconvex and Nonsmooth Optimization: Algorithms and Iteration Complexity Analysis’, Computational Optimization and Applications, 72(1), 115–157. [Google Scholar]
  15. Li K-C (1991), ‘Sliced Inverse Regression for Dimension Reduction’, Journal of the American Statistical Association, 86(414), 316–327. [Google Scholar]
  16. Li L (2007), ‘Sparse Sufficient Dimension Reduction’, Biometrika, 94(3), 603–613. [Google Scholar]
  17. Li B, Chun H, and Zhao H (2012), ‘Sparse Estimation of Conditional Graphical Models with Application to Gene Networks’, Journal of the American Statistical Association, 107(497), 152–167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Li B, and Wang S (2007), ‘On Directional Regression for Dimension Reduction’, Journal of the American Statistical Association, 102(479), 997–1008. [Google Scholar]
  19. Lin Q, Zhao Z, and Liu JS (2018), ‘On Consistency and Sparsity for Sliced Inverse Regression in High Dimensions’, The Annals of Statistics, 46(2), 580–610. [Google Scholar]
  20. Lin Q, Zhao Z, and Liu JS (2019), ‘Sparse Sliced Inverse Regression Via Lasso’, Journal of the American Statistical Association, 114(528), 1726–1739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Liu B, Zhang Q, Xue L, Song PX-K, and Kang J (2024), ‘Robust High-dimensional Regression with Coefficient Thresholding and Its Application to Imaging Data Analysis’, Journal of the American Statistical Association, 119, 715–729. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Luo W, Xue L, Yao J, and Yu X (2022), ‘Inverse Moment Methods for Sufficient Forecasting Using High-dimensional Predictors’, Biometrika, 109(2), 473–487. [Google Scholar]
  23. Ma S (2013), ‘Alternating Direction Method of Multipliers for Sparse Principal Component Analysis’, Journal of the Operations Research Society of China, 1(2), 253–274. [Google Scholar]
  24. Mackey LW (2009), ‘Deflation Methods for Sparse PCA’, in Advances in Neural Information Processing Systems, pp. 1017–1024. [Google Scholar]
  25. Morgan XC, Kabakchiev B, Waldron L, Tyler AD, Tickle TL, Milgrom R, Stempak JM, Gevers D, Xavier RJ, and Silverberg MS (2015), ‘Associations Between Host Gene Expression, the Mucosal Microbiome, and Clinical Outcome in the Pelvic Pouch of Patients with Inflammatory Bowel Disease’, Genome Biology, 16(1), 67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Neykov M, Lin Q, and Liu JS (2016), ‘Signed Support Recovery for Single Index Models in High-dimensions’, Annals of Mathematical Sciences and Applications, 1(2), 379–426. [Google Scholar]
  27. Shi L, Huang X, Feng Y, and Suykens J (2019), ‘Sparse Kernel Regression with Coefficient-based Lq-regularization’, Journal of Machine Learning Research, 20(116), 1–44. [Google Scholar]
  28. Stokkers P, Van Aken B, Basoski N, Reitsma P, Tytgat G, and Van Deventer S (1998), ‘Five Genetic Markers in the Interleukin 1 Family in Relation to Inflammatory Bowel Disease’, Gut, 43(1), 33–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Tan K, Shi L, and Yu Z (2020), ‘Sparse SIR: Optimal Rates and Adaptive Estimation’, The Annals of Statistics, 48(1), 64–85. 10.1214/18-AOS1791. [DOI] [Google Scholar]
  30. Tan KM, Wang Z, Zhang T, Liu H, and Cook RD (2018), ‘A Convex Formulation for High-dimensional Sparse Sliced Inverse Regression’, Biometrika, 105(4), 769–782. [Google Scholar]
  31. Wold S, Esbensen K, and Geladi P (1987), ‘Principal Component Analysis’, Chemometrics and Intelligent Laboratory Systems, 2(1–3), 37–52. [Google Scholar]
  32. Wu C, Miller J, Chang Y, Sznaier M, and Dy J (2019), ‘Solving Interpretable Kernel Dimensionality Reduction’, in Advances in Neural Information Processing Systems, pp. 7915–7925. [Google Scholar]
  33. Ying C, and Yu Z (2022), ‘Fréchet Sufficient Dimension Reduction for Random Objects’, Biometrika, 109(4), 975–992. [Google Scholar]
  34. Yu X, Yao J, and Xue L (2022), ‘Nonparametric Estimation and Conformal Inference of the Sufficient Forecasting with a Diverging Number of Factors’, Journal of Business & Economic Statistics, 40(1), 342–354. [Google Scholar]
  35. Zhang C-H (2010), ‘Nearly Unbiased Variable Selection Under Minimax Concave Penalty’, The Annals of Statistics, 38(2), 894–942. [Google Scholar]
  36. Zhang Q, Li B, and Xue L (2024), ‘Nonlinear Sufficient Dimension Reduction for Distribution-on-distribution Regression’, Journal of Multivariate Analysis, 202, 105302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Zhang Q, Xue L, and Li B (2024), ‘Dimension Reduction for Fréchet Regression’, Journal of the American Statistical Association, in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Zou H (2006), ‘The Adaptive Lasso and Its Oracle Properties’, Journal of the American Statistical Association, 101(476), 1418–1429. [Google Scholar]
  39. Zou H, Hastie T, and Tibshirani R (2006), ‘Sparse Principal Component Analysis’, Journal of Computational and Graphical Statistics, 15(2), 265–286. [Google Scholar]
  40. Zou H, and Xue L (2018), ‘A Selective Overview of Sparse Principal Component Analysis’, Proceedings of the IEEE, 106(8), 1311–1320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

RESOURCES