Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Apr 7.
Published in final edited form as: Sankhya Ser B. 2019 Feb 7;80(1 Suppl):179–223. doi: 10.1007/s13571-018-0183-0

A Blockwise Consistency Method for Parameter Estimation of Complex Models

Runmin Shi 1, Faming Liang 2,*, Qifan Song 3, Ye Luo 4, Malay Ghosh 5
PMCID: PMC8026010  NIHMSID: NIHMS996656  PMID: 33833478

Abstract

The drastic improvement in data collection and acquisition technologies has enabled scientists to collect a great amount of data. With the growing dataset size, typically comes a growing complexity of data structures and of complex models to account for the data structures. How to estimate the parameters of complex models has put a great challenge on current statistical methods. This paper proposes a blockwise consistency approach as a potential solution to the problem, which works by iteratively finding consistent estimates for each block of parameters conditional on the current estimates of the parameters in other blocks. The blockwise consistency approach decomposes the high-dimensional parameter estimation problem into a series of lower-dimensional parameter estimation problems, which often have much simpler structures than the original problem and thus can be easily solved. Moreover, under the framework provided by the blockwise consistency approach, a variety of methods, such as Bayesian and frequentist methods, can be jointly used to achieve a consistent estimator for the original high-dimensional complex model. The blockwise consistency approach is illustrated using two high-dimensional problems, variable selection and multivariate regression. The results of both problems show that the blockwise consistency approach can provide drastic improvements over the existing methods. Extension of the blockwise consistency approach to many other complex models is straightforward.

Keywords: Coordinate Descent, Gaussian Graphical Model, Multivariate Regression, Precision Matrix, Variable Selection

1. Introduction

The drastic improvement in data collection and acquisition technologies during the past two decades has enabled scientists to collect a great amount of data, such as climate data and high-throughput biological assay data. With the growing dataset size typically comes a growing complexity of data structures, of the patterns in the data, and of the models needed to account for the patterns. Such complex models are often characterized as high-dimensional, hierarchical, or highly nonlinear. Among modern statistical methods, Markov chain Monte Carlo (MCMC) has proven to be very powerful and typically unique computational tools for analyzing data of complex structures. However, the feasibility of MCMC is being challenged in the era of big data, since it typically requires a large number of iterations and a complete scan of the full dataset for each iteration. The frequentist methods, such as maximum likelihood estimation and regularization methods, can be fast, but the optimization problem involved therein is usually difficult to handle when the dimension of the parameter space is high and/or the model structure is complex. How to estimate the parameters of complex models with big data has put a great challenge on current statistical methods.

To tackle this problem, we propose a blockwise consistency (BwC) method, which is developed based on the coordinate descent method (Tseng, 2001; Tseng and Yun, 2009). The coordinate descent method has recently been adopted in statistics to solve the computational problems in estimating parameters for sparse linear and logistic regression with convex (Friedman et al., 2010) or nonconvex separable regularization terms (Breheny and Huang, 2011; Mazumder et al., 2011a), as well as the sparse precision matrix (Friedman et al., 2008). The BwC method extends the applications of the coordinate descent method to more general complex statistical models by asymptotically maximizing the following objective function,

maxθEθlogπ(X|θ), (1)

where π(·) denotes the likelihood function of the model, θ denotes a high-dimensional parameter vector, θ* denotes the true parameter vector, and Eθ denotes expectation with respect to the true likelihood function π(x|θ*). Suppose θ has been partitioned into a few blocks θ = (θ(1), …, θ(k)). With the asymptotic objective function, BwC decomposes the high-dimensional parameter estimation problem to a series of low-dimensional parameter estimation problems: iteratively finding consistent estimates for the parameters of each block conditioned on the current estimates of the parameters of other blocks. The low-dimensional problem has often a much simpler structure than the original problem and thus can be easily solved. Moreover, under the framework provided by BwC, a variety of methods, such as Bayesian and frequentist methods, can be jointly used, depending on their availability and convenience, to find a consistent estimate for the original high-dimensional complex model.

This paper demonstrates the use of BwC for two types of complex models. The first one is high-dimensional variable selection under the small-n-large-p scenario, i.e., the sample size is much smaller than the number of variables. For this problem, the dimension of the parameter space can be very high, often ranging from a few thousands to a few millions. Since the problem is ill-posed, regularization methods, e.g., Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001) and MCP (Zhang, 2010), are often used. However, the performance of these methods can quickly deteriorate as the dimension increases. As shown by our numerical results, BwC can provide a drastic improvement in both parameter estimation and variable selection over regularization methods. The second model is the multivariate high-dimensional regression, where BwC is used to iteratively select relevant variables and estimate the precision matrix. BwC has decomposed the complex problem into two sub-problems, variable selection and precision matrix estimation, and each has a simple structure with a bunch of algorithms available in the literature. The complex problem can then be solved with a combined use of the existing algorithms. This facilitates big data analysis.

The BwC method is also related with the imputation-regularized optimization (IRO) method proposed in Liang et al. (2018). The BwC method can be viewed as a degenerated version of the IRO method, i.e., without missing data involved. However, extending the theory of IRO to BwC is not trivial. For IRO, the step of imputation for missing data plays a major role, with which the method leads to two interleaved Markov chains and thus the convergence of the method can be studied based on the theory of Markov chain. However, for BwC, it is not appropriate to study its convergence based on the theory of Markov chain.

The remainder of this paper is organized as follows. Section 2 describes the BwC method and studies its convergence. Section 3 demonstrates the use of BwC for high-dimensional variable selection problems. Section 4 demonstrates the use of BwC for high-dimensional multivariate regression. Section 5 concludes the paper with a brief discussion.

2. The Blockwise Consistency Method

2.1. The Blockwise Coordinate Ascent Method

Let X1, …, Xn denote a random sample drawn from a complex model with the density function given by π(x|θ), where θ ∈ Θn denotes a high-dimensional parameter vector, and the subscript n indicates the dependence of the dimension of θ on the sample size n. Without loss of generality, we assume that θ has been partitioned into k blocks θ = (θ(1), …, θ(k)). Correspondingly, Θn is partitioned as Θn=Θn(1)×Θn(2)××Θn(k). From Jensen’s inequality, we have

Eθ*logπ(X|θ)Eθ*logπ(X|θ*). (2)

Therefore, finding a consistent estimator of θ can be viewed as an optimization problem—maximizing Eθlogπ(X|θ) with respect to θ. Suppose that the ideal objective function Eθlogπ(X|θ) is evaluable, then the blockwise coordinate ascent algorithm (Tseng, 2001) can be applied, which works in a cyclic rule as follows:

Algorithm 2.1.

(Blockwise Coordinate Ascent)

1.For each iteration t, set the index s=t(modk)+1.2.For the index s, find θ˜t(s) which maximizes the ideal objective function; that is, set                                        θ˜t(s)=arg maxθ(s)Θn(s)Eθ*logπ(X|θ˜t1(1),,θ˜t1(s1),θ(s),θ˜t1(s+1),,θ˜t1(k))                                        (3)     Letθ˜t=(θ˜t(1),,θ˜t(k))  denote the estimator of θ obtained at iteration t, where                                                                          θ˜t(j)={θ˜t(s),j=s,θ˜t1(j),js.

Following Tseng (2001), the convergence of the method can be studied with the following conditions:

  • (A1) The level set Θ0={θ:Eθlogπ(X|θ)Eθlogπ(X|θ0)} is compact and Eθ*logπ(X|θ) is continuous on Θ0.

  • (A2) Eθ*logπ(X|θ) has at most one maximum in θ(s) for s = 1, 2, …, k.

For any function f : ℝd → ℝ ⋃ {∞}, we denote by dom(f) the effective domain of f, i.e., dom(f) = {x ∈ ℝd|f (x) < ∞}. For any x ∈ dom(f) and any e ∈ ℝd, we denote the lower directional derivative of f at x in the direction e by

f(x;e)=lim infδ0[f(x+δe)f(x)]/δ.

We say that x is a stationary point of f if x ∈ dom(f) and f′(x, e) ≤ 0, ᗄe. We say that x is a coordinatewise maximum point of f if x ∈ dom(f) and f(x+(0,,ei,,0))f(x),eidi, where di is the dimension of ei. We say f is regular at x ∈ dom(f) if f′(x; e) ≤ 0, ᗄe = (e1, …, ed), such that f′(x; (0, …, ei, …, 0)) ≤ 0 for all i = 1, 2, …, d.

Lemma 2.1 concerns the convergence of the blockwise coordinate ascent algorithm, which is a restatement of Part (c) of Theorem 4.1 of Tseng (2001).

Lemma 2.1 (Theorem 4.1 of Tseng (2001)) If the condition (A1) holds, then the sequence {θ˜t} is defined and bounded. Moreover, if the condition (A2) holds, then every cluster point of {θ˜t} is a coordinatewise maximum point of Eθlogπ(X|θ). In addition, if Eθ*logπ(x|θ) is regular at θ, then θ is a stationary point of Eθlogπ(x|θ).

2.2. The Blockwise Consistency Method

Since θ* is unknown, the ideal objective function Eθ*logπ(X|θ) is not evaluable or only evaluable in the limit case, Algorithm 2.1 is not practically usable. To develop a practically usable algorithm for estimating θ, we let θ^t=(θ^t(1),,θ^t(k)) denote the estimate of θ obtained at iteration t, let θ^t(s)=(θ^t(1),,θ^t(s1),θ^t(s+1),,θ^t(k)) denote a subvector of θ^t with the sth component omitted, and define

Gn(θ(s)|θ^t1(s))=Eθ*logπ(X|θ^t1(1),,θ^t1(s1),θ(s),θ^t1(s+1),,θ^t1(k)),G^n(θ(s)|θ^t1(s))=Enlogπ(X|θ^t1(1),,θ^t1(s1),θ(s),θ^t1(s+1),,θ^t1(k)), (4)

where Enh(X)=i=1nh(Xi)/n denotes the empirical mean of the function h(X), and Gn(θ(s)|θt1(s)) depends on the sample size n through θ^t1(s). To accommodate the high-dimensional case, for which a regularization term is needed to overcome the ill-posedness issue, we propose to estimate θ^t(s) by setting

θ^t(s)=arg maxθ(s)Θn(s){G^n(θ(s)|θ^t1(s))Pλn,s,t(θ(s))}, (5)

where Pλn,s,t() denotes the penalty function imposed on θ(s), and λn,s,t denotes the regularization parameter used for block s at iteration t. Note that λn,s,t can be different for different blocks and different iterations. Suppose that we are able to choose an appropriate penalty function Pλn,s,t(θ(s)) such that θ^t(s) forms a consistent estimator of θt(s), which is defined by

θt*(s)=arg maxθ(s)Θn(s)Gn(θ(s)|θ^t1(s)). (6)

By iterating between all components of θ = (θ(1), …, θ(k)), we have the following algorithm:

Algorithm 2.2.

(Blockwise Consistency)

1. For each iteration t, set the indexs=t(mod k)+1.2. For the index s, find θ^t(s) which forms a consistent estimate of θt*(s). Let θ^t=(θ^t(1),,θ^t(k)), where                                                                                 θ^t(j)={θ^t(s),j = s,θ^t1(j),js.

As an extension of Algorithm 2.2, we note that θ^t(s) is not necessarily obtained by directly maximizing the penalized likelihood function given in (5). In fact, as shown in Theorem 2.3, θ^t(s) can be obtained using any a consistent estimation procedure as long as |θ^t(s)θt*(s)| is small enough for each t (pointwisely) and the log-likelihood function logπ(x|θ(s),θ^t1(s)) is well behaviored. In this case, following from Jensen’s inequality, the estimator θ^t(s) implicitly and asymptotically maximizing the penalized likelihood function given in (5), provided that the penalty function has been appropriately defined.

2.3. Convergence of the BwC Method

Let Θn,T denote the space of the path {θ^t:t=1,2,,T}, which can be considered as an arbitrary subset of Θn with T elements (replicates are allowed). The proof for the convergence of {θ^t:t=1,2,,T} takes three steps. First, we establish a uniform law of large numbers (ULLN) for G^n(θ(s)|θ^t1(s)) toward the mean function Gn(θ(s)|θ^t1(s)) over the space Θn(s)×Θn,T(s). Second, we show that, with appropriate penalty functions, θ^t(s) converges to θt(s) in probability uniformly along the path. Finally, we show that the path {θ^t:t=1,2,,T} and the path {θ^t:t=1,2,,T} converges to the same limit as n → ∞ and T → ∞.

To complete the first step of the proof, we assume the following conditions:

  • (B1) logπ(x|θ) is a continuous function of θ for each xX and a measurable function of x for each θ ∈ Θn.

  • (B2) [Conditions for Glivenko-Cantelli Theorem]
    1. There exists a function mn(x) such that supθΘn,xX|log fθ(x)|mn(x), where mn(x) indicates the dependence of the function on the sample size n; supn+E[mn(x)]<, and supn+E[mn(x)1(mn(x)ζ)]0 as ζ → ∞.
    2. Define Fn={log f(x|θ(s),θ^t1(s))|θ(s)Θ(s),Θ^t1(s)Θn,T(s)} and Gn,M={q1(mn(x)M)|qFn}. Suppose that for every ϵ and M > 0, the metric entropy log N(ϵ,Gn,M,L1(n))=op(n), where ℙn is the empirical measure of x, and N(ϵ,Gn,M,L1(n)) is the covering number with respect to the L1(ℙ)-norm.

Theorem 2.1 If the conditions (B1) and (B2) hold, then for any T,

sup θ^t1(s)Θn,T(s)sup θ(s)Θn(s)|G^n(θ(s)|θ^t1(s))Gn(θ(s)|θ^t1(s))|p0,  as n, (7)

wherep denotes convergence in probability.

The metric entropy condition in (B2) has often been used in the literature of high-dimensional statistics. Assume that all elements in n1Fn are uniformly Lipschitz with respect to the l1-norm. Then the metric entropy log N(ϵ,Gn,M,L1(n)) can be measured based on the parameter space Θn. Since the functions in Gn,M are all bounded, the corresponding parameter space can be contained in a l1-ball by the continuity of log f (x|θ) in θ. Further, we assume that the diameter of the l1-ball grows at a rate of O(nα) for some 0 ≤ α < 1/2, then log N(ϵ,Gn,M,L1(n))=O(n2αlog p) holds, which allows p to grow at a polynomial rate of O(nγ) for some constant 0 < γ < ∞. Note that the increased diameter accounts for the conventional assumption that the size of the true model grows with the sample size n. Refer to Vershynin (2015) for more discussions on this issue. Similar conditions have been used in the literature. For example, Raskutti et al. (2011) studied minimax rates of estimation for high-dimensional linear regression over lq-balls.

To establish the uniform convergence of {θ^t(s)} toward {θ^t(s)}, we assume the following conditions:

  • (B3) For each t=1,2,,T,Gn(θ(s)|θ^t1(s)) has a unique maximum at θt*(s); for any ϵ > 0, sup θ(s)Θn(s)\Bt(ϵ)Gn(θ(s)|θ^t1(s)) exists, where Bt(ϵ)={θ(s)Θn(s):θ(s)θt*(s)<ϵ}. Let δt=Gn(θt*(s)|θ^t1(s))sup θ(s)Θn(s)\Bt(ϵ)Gn(θ(s)|θ^t1(s)), and δ = mint∈{1,2,…,T} δt > 0.

  • (B4) The penalty function Pλn,s,t(θ(s)) is non-negative, ensures the existence of θt(s) for all n ∈ ℕ and t = 1, 2, …, T, and converges to 0 uniformly over the set {θt(s):t=1,2,T} as n → ∞.

Theorem 2.2 If the conditions (B1)-(B4) hold, then the regularization estimator θ^t(s) in (5) is uniformly consistent with respect to θt(s) over t = 1,2, …, T, i.e., sup t{1,2,,T}θ^t(s)θt*(s)p0 as n → ∞.

On the choice of penalty functions, we have the following comments. Take the high-dimensional regression as example. Let ps denote the dimension of θ(s). If we allow ps to grow with n at the rate ps = O(nγ) for some constant γ > 0, allow the number of nonzero regression coefficients of the true regression to grow with n at the rate O(nα) for some constant 0 < α < 1/2, choose λn,s,t=O(log (ps)/n), and set Pλn,s,t(θ)=λn,s,ti=1|Θn(s)|cλn,s,t(|θ(s,i)|), where θ(s,i) denotes the ith element of θ(s), cλn,s,t() is set in the form of SCAD (Fan and Li, 2001) or MCP (Zhang, 2010) penalties, then the condition (B4) is satisfied. For both the SCAD and MCP penalties, cλn,s,t(|θ(s,i)|)=0 if θ(s,i)=0 and bounded by a constant otherwise. Note that, if Θn(s)=ps, then the Lasso penalty does not satisfy (B4) as which is unbounded. This explains why the Lasso estimate is biased even as n → ∞. However, if Θn(s) is restricted to a bounded space, then the Lasso penalty also satisfies (B4).

Alternative to regularization methods, one may first restrict the space Θ(s) to some low-dimensional subspace through sure screening, and then find a consistent estimate of θt(s) in the subspace using a conventional statistical methods, such as maximum likelihood, moment estimation, or even regularization. For example, the sure independence screening (SIS) (Fan and Lv, 2008; Fan and Song, 2010) method belong to this class. It is interesting to point out that the sure screening-based methods can be viewed as a special subclass of regularization methods, for which the solutions in the low-dimensional subspace receives a zero penalty, and those outside the subspace receives a penalty of ∞. It is easy to see that such a binary-type penalty function satisfies condition (B4).

Both the regularization and sure screening-based methods are constructive. In what follows, we give a proof for the use of general consistent estimation procedures in the BwC algorithm. Let θt,g(s) denote the estimate of θt(s) produced by such a general consistent estimation procedure at iteration t, which might not explicitly maximize the objective function defined in (5). Theorem 2.3 shows that if θt,g(s) is accurate enough for each t (pointwisely) and the log-likelihood function log π(x|θ) is well behaved, then θt,g(s) can be used in the BwC algorithm.

  • (B5) [Conditions for general consistent estimate θt,g(s) ] Assume that log(T) = o(n); for each t = 1,2, …, T, θt,q(s)θt*(s)=Op(1/n) (pointwisely) and the Hessian matrix 2Gn(θ(s)|θ^t1,g(s))/θ(s)θ(s) is bounded in a neighborhood of θt(s); and let Zt,i=log π(xi|θt,g(s),θ^t1,g(s))E[log π(xi|θt,g(s),θ^t1,g(s))], then E|Zt,i|mm!M˜bm2v˜i/2 for every m ≥ 2 and some constants M˜b>0 and v˜i=O(1). That is, each Zt,i is a sub-exponential random variable.

Theorem 2.3 Assume the conditions (B1)-(B3) and (B5) hold. Then θt,g(s) is uniformly consistent to θt(s) over t = 1,2, …, T, sup t{1,2,,T}θt,g(s)θt(s)p0 as n → ∞.

Regarding the condition of T, we have the following comments. Since θ^t(s) might not be close enough to θt(s) as assumed in (B3) for the regularized estimator (5), we are not able to prove the uniform convergence of θ^t(s) to θt(s) over all possible paths {θ^t:t=1,2,}. However, we are able to prove that the uniform convergence holds over all possible paths {θ^t(s):t=1,2,,T} with T being not too large compared to en. This is enough for Theorems 2.1–2.4. To justify this, we may consider the case that the dimension of θ grows with n at a rate of p = O(nγ) for a constant γ > 0, say, γ = 5. Then it is easy to see that when n > 13, the ratio T/p has an order of

O(en/p)=O(enγlog (n))O(e0.1n)O(p100),

which implies that essentially there is no constraint on the ≻setting of T.

Condition (B5) restricts the consistent estimates to those having a distance to the true parameter point of the order Op(1/n). Such condition can be satisfied by some estimation procedures in the low-dimensional subspace, e.g., maximum likelihood, for which both the variance and bias are often of the order O(1/n) (Firth, 1993) and therefore the root mean squared error is of the order O(1/n). To make the result of Theorem 2.3 more general to include more estimation procedures, we can relax this order to θ^t(s)θt(s)=Op(n1/4), if we would like to relax the order of T to log (T)=o(n) and the order of metric entropy to log N(ϵ,Gn,M,L1(n))=op(n). As mentioned previously, both the order of T and the order of metric entropy are technical conditions and relaxing them to the order of O(n) will not restrict much the applications of the BwC algorithm. The proof for this relaxation is straightforward, following the proof of Theorem 2.3.

To examine the distance between the two paths {θ˜t} and {θ^t}, we define the mapping

Ms(θt)={arg max θ(i)Θ(i)E[log π(X|θt(1),,θt(i1),θ(i),θt(i+1),,θt(k))],if i=s,θt(i),if is,

where s = t mod k + 1. For the mapping, we assume the following condition is satisfied:

  • (B6) The function Ms(θ) is differentiable. Let ρs(θ) be the largest singular value of ∂Ms(θ)/∂θ. There exists a ρ* < 1 such that ρs(θ) ≤ ρ* almost surely for all θ ∈ Θn and s = 1,2, …, k.

The condition (B6) is a contraction condition, which implies that for any θ and θ′ in Θn,

Ms(θ)Ms(θ)ρ*θθ, (8)

i.e., both θ and θ′ tend to move toward a fixed point under the mapping. This condition is usually satisfied for the expectation function Eθlog π(x|θ), as for which we always assume the true parameter point θ* is unique. By Jensen’s inequality, we know that θ* can serve as the fixed point of the mapping.

Theorem 2.4 concerns the distance between the paths {θ˜t} and {θ^t}, which shows that the two paths will eventually converge to the same point in probability.

Theorem 2.4 Assume the conditions (A1)-(A2), (B1)-(B4) and (B6) hold, or the conditions (A1)-(A2), (B1)-(B3) and (B5)-(B6) holds. Then θ^tθ˜t uniformly converges to zero in probability, and the limit θ^:=lim tθ^tp θ˜:=lim t θ˜t.

Note that the proofs of the above theorems are independent of the number of blocks and the size of each block. Hence, the BwC algorithm is very general, which allows the number of parameters, the number of blocks and the block size to increase with the sample size.

3. BwC for High Dimensional Variable Selection

Consider variable selection for a high dimensional linear regression

y=Xθ+ϵ, (9)

where y = (y1, ⋯, yn)TRn is the response vector, n is the sample size, X = (x1, ⋯, xp) ∈ Rn×p is the design matrix, p is the number of explanatory variables (also known as predictors), θRp is the vector of unknown regression coefficients, and ϵ = (ϵ1, ⋯,ϵn)T ~ Nn(0,σ2 In) is the Gaussian random error. Assume that p can be much greater than n and can also increase with n.

For this problem, the regularization approach has been extensively used in the literature, which is to find an estimator of θ by maximizing a penalized log-likelihood function

En[log π(X|θ)]1ni=1pPλ(|θi|), (10)

where Pλ(·) is the penalty function and λ is a tunable parameter. It is known that many choices of the penalty function can lead to consistent solutions to this problem, such as the l1-penalty used in Lasso (Tibshirani, 1996), the smoothly clipped absolute deviation (SCAD) penalty used in Fan and Li (2001), the minimax concave penalty (MCP) used in Zhang (2010), and the reciprocal l1-penalty (also known as rLasso) used in Song and Liang (2015a). However, as known by many researchers, the performance of these methods can deteriorate very quickly with an increasing number of predictors. To illustrate this issue, we consider the following example with four different values of p = 500, 1000, 2000 and 5000:

yi=θ0+j=1pxijθj+ϵi,i=1,2,,n, (11)

where n = 100, and ϵi’s are iid normal random errors with mean 0 and variance 1. The true value of θj’s are θj = 1 for j = 1, 2, …, 10 and 0 otherwise. Let xj = (x1j·, x2j·, …, xnj)′ denote the jth predictor for j = 1, 2, …, p, which are given by

x1=z1+e,x2=z2+e,,,xp=zp+e, (12)

where e, z1, …, zp are iid normal random vectors drawn from N(0, In). Under this setting, xj’s are highly correlated with a mutual correlation coefficient of 0.5. For each value of p, 10 datasets are independently generated.

The regularization methods, including Lasso, SCAD and MCP, were applied to the datasets. These methods have been implemented in the R-package SIS (Fan et al., 2015), and were run under their default settings: The variables were first scanned according to the iterative sure independence screening (ISIS) algorithm (Fan et al., 2009), and then selected from the remaining set by maximizing (10) with their respective penalties. The regularization parameters were determined according to the BIC criterion. To measure the performance of these methods in variable selection, we calculate the false and negative selection rates. Let s* denote the set of true variables, and let ŝ denote the set of selected variables. Define

fsr =|s^\s||s^|, nsr=|s*\s^||s|, (13)

where | · | denotes the set cardinality. The smaller the values of fsr and nsr are, the better the performance of the method is. The numerical results are summarized in Table 1. It is easy to see that for all the methods, the performance deteriorates very quickly: All the values of fsr, nsr and parameter estimation error increase with p. These methods have also been tried with the tuning parameters determined via cross-validation. The results are even worse than those reported in Table 1.

Table 1:

Performance of the regularization methods for the simulated data with different values of p:|s^|avg denotes the average number of variables selected for 10 datasets, fsr and nsr are calculated in (13), θ^θ=j=1p(θ^jθj)2 measures the parameter estimation error, and the numbers in parenthesis denote the standard deviations of the corresponding estimators.

Lasso SCAD MCP
|s^|avg 20.4 (0.60) 16.6 (1.80) 15.7 (1.16)
|s^s|avg 9.8 (0.20) 10 (0) 10 (0)
p=500 fsr 0.51 (0.022) 0.31 (0.086) 0.32 (0.059)
nsr 0.02 (0.020) 0 (0) 0 (0)
|s^|avg 21 (0) 19.6 (1.11) 18.2 (0.65)
|s^s|avg 7.8 (0.85) 8.9 (0.67) 9 (0.73)
p=1000 fsr 0.63 (0.041) 0.52 (0.067) 0.49 (0.050)
nsr 0.22 (0.085) 0.11 (0.067) 0.10 (0.073)
|s^|avg 21 (0) 20.9 (0.10) 20.4 (0.43)
|s^s|avg 6.1 (0.92) 6.6 (0.85) 6.7 (0.86)
p=2000 fsr 0.71 (0.044) 0.68 (0.041) 0.66 (0.049)
nsr 0.39 (0.092) 0.34 (0.085) 0.33 (0.086)
|s^|avg 21 (0) 19.9 (1.10) 20.3 (0.70)
|s^s|avg 3.7 (0.79) 5.4 (0.93) 5.2 (0.95)
p=5000 fsr 0.82 (0.037) 0.69 (0.085) 0.73 (0.062)
nsr 0.63 (0.079) 0.46 (0.093) 0.48 (0.095)

Note that for this example, all the three methods selected about 21 variables, which is approximately equal to n/log(n), the upper bound of the model size set by ISIS. Without this upper bound, more variables will be selected. For example, we have also implemented Lasso using the R-package glmnet with the regularization parameter tuned via cross-validation. The average number of variables selected by Lasso for the datasets with p = 5000 is 72.7 with standard deviation 3.05.

For this example, the difficulty suffered by the regularization methods, whose performance deteriorates with dimensions, can be alleviated using the BwC method. The key difference between the BwC method and the three methods is at that BwC decomposes the high-dimensional variable selection problem to a series of lower dimensional variable selection problems and applies a different regularization function (or parameter), as prescribed in (5), to each of the lower dimensional problems. In addition, the choice of the regularization function (or parameter) is adaptive, which can be changed from iteration to iteration. Although this makes the regularization methods a little more complicated, as return, it can significantly increase accuracy of variable selection and parameter estimation. The Block Coordinate Ascent method can be used to maximize the objective function (10), it employes the same regularization parameter for all blocks.

3.1. An Illustrative Example

To illustrate the performance of BwC, we use the datasets of p = 5000 generated above. We first consider a naive version of BwC before giving a sophisticated one. For this naive version of BwC, we had the datasets slightly modified: we exchanged the positions of some variables in the dataset such that the 10 true variables are positioned as {1, 2, 1001, 1002, 2001, 2002, 3001, 3002, 4001, 4002}. Then BwC was implemented as follows:

  1. Split the predictors into 5 blocks: {x1, …, x1000}, {x1001, …, x2000}, {x2001, …, x3000}, {x3001, …, x4000}, and {x4001, …, x5000}.

  2. Conduct variable selection using Lasso for each block independently, and combine the selected predictors to get an initial estimate of θ.

  3. Conduct blockwise conditional variable selection using MCP for 25 sweeps. Here a sweep refers to a cycle of updates for all blocks.

In this implementation, we use Lasso to construct the initial estimate, because Lasso is less likely to miss true variables compared to other methods though it tends to select more. This reduces the risk of missing important variables in the conditioning set at the early stage of the algorithm and thus accelerates its convergence. The MCP and SCAD algorithms have also been used in the initial step, the results are similar. We use MCP as the block consistent estimator because MCP itself is consistent for high dimensional variable selection. For MCP, we let the regularization parameter be determined according to the BIC criterion, and other parameters be set to the default value as given in the R-package SIS. To save computational time, we set the number of iterations in the sure independence screening (SIS) step to be 1. However, multiple iterations will not hurt the performance of BwC except for the CPU time. Figure 1 shows the convergence path of BwC for one dataset. For this dataset, BwC converges to a model with 11 variables, including 10 true variables and one false variable, and the estimation error of θ is about 0.5. BwC can converge very fast, usually within 5 sweeps. Other results are summarized in Table 2.

Figure 1:

Figure 1:

Convergence path of BwC for one simulated dataset with n = 100 and p = 5000, where fsr, nsr, and sqrt(sse) denote the false selection rate, negative selection rate, and parameter estimation error θ^θ, respectively.

Table 2:

Comparison of BwC, Lasso, SCAD and MCP for the simulated example with p = 5000. Refer to Table 1 for the notation.

Lasso SCAD MCP BwC
|s^|avg 21(0.0) 19.9(1.10) 20.3(0.70) 12.8(0.36)
|s^s|avg 3.7 (0.79) 5.4 (0.93) 5.2 (0.95) 10(0)
fsr 0.824 0.729 0.744 0.219
nsr 0.63 0.46 0.48 0

For comparison, the results of Lasso, SCAD and MCP were also included in the table. The comparison shows that BwC has made a drastic improvement over Lasso, SCAD and MCP in both variable selection and parameter estimation. On average, BwC selected only 12.8 variables for each dataset, without missing any true variables. Recall that the number of true variables is 10 and all the variables are highly correlated with a mutual correlation coefficient of 0.5. SCAD and MCP selected about 20 variables for each dataset, missing about half of the true variables. Lasso performs even worse than SCAD and MCP, it missed over 60% of the true variables. In different runs of BwC, SCAD and Lasso were used as the block consistent estimator. SCAD produced almost the same results as MCP, while Lasso did not. This is because Lasso is not consistent for variable selection unless the strongly irrepresentable condition holds (Zhao and Yu, 2006; Zou, 2006), and such a condition is hard to be satisfied by this example due to the high correlation between the predictors.

BwC works extremely well for this example. The major reason is that BwC decomposes the high-dimensional variable selection problem to a series of lower dimensional variable selection problems and chooses the regularization parameter for each of the lower dimensional problems in an adaptive way. Since the regularization method can work well for each of the lower dimensional problems, its performance for the original high-dimensional problem is thus significantly improved. Please note that we are here not to claim that the BwC algorithm is the best for high-dimensional variable selection, but that it can improve the performance of the algorithms which are sensitive to the dimension. For the algorithms which are less sensitive to the dimension, the improvement might be limited. In general, the l0-penalty based algorithms, e.g., Liang et al. (2013) and Chen and Chen (2008), are less sensitive to the dimension. Given its strong oracle optimality (Fan et al., 2014), the SparseNet algorithm(Mazumder et al., 2011b) is very attractive for high-dimensional regression. The SparseNet algorithm has been implemented in the R package sparsenet. For this example, it selected on average (over 10 datasets) 14.9 variables with a standard deviation of 2.51, and the corresponding fsr and nsr are 0.329 and 0, respectively. Given its excellent performance, we naturally want to employ it in the BwC algorithm. However, in the package sparsenet, it is implemented with the regularization parameter determined through cross-validation, which prevents it being used in the BwC algorithm. It is known that the cross-validation criterion is prediction-based and the models selected through it tends to be larger than the true one.

The other reason why BwC performs so well for this example is that it has been designed for an ideal scenario that each block contains some significant variables. Here the significant variables can be understood as the true variables or the variables that are significantly correlated with the response variable. Since variable selection is to distinguish significant variables from other variables, the existence of significant variables in each block can reasonably accelerate the convergence of BwC by avoiding selecting too many noisy variables from empty blocks. A block is called empty if it contains no significant variables. To achieve such an ideal scenario, we propose a balanced data partitioning scheme as described in the next subsection. Along with the description, we discussed how to choose the number of blocks.

3.2. BwC with a Balanced Data Partitioning Scheme

In the initial step of BwC, variable selection was done for each block independently. Since for each block, the model can be misspecified with some true variables missed from the set of available variables. As shown by Song and Liang (2015b), for each block, the variables found in this initial step include the true variables as well as the surrogate variables for the missed true variables in the block. Hence, we can expect that the total number of variables selected in the initial step is large. Then, via conditional updating, some surrogate variables can be deleted. Therefore, the number of selected variables has a trend to decrease with sweeps. To achieve the goal of balanced data partitioning, we propose to re-distribute the selected variables to different blocks in a balanced manner, i.e., each block is assigned about the same number of selected variables. For convenience, we call a stage of the algorithm for each re-distribution of selected variables. To accommodate the decreasing trend of selected variables, we let the number of blocks be diminished with stages. In summary, we have the following algorithm:

BwC for High-Dimensional Regression

1.Split the predictors int o K=K max   blocks, conduct variable selection using Lasso for each blockindependently, and combine the selectors to get an initial estimate of θ.2.Conduct blockwise conditional variable selection using MCP for m sweeps.3.Reduce the number of blocks K by s, which is called the block dim inishing size, and re-distribute the variables selected in step 2 to each block in a balanced manner.4.Go to step 2 unless K has been smaller than K min  ,a pre-specified number. Usually we set Kmin =1or 2.5.Select the final model from (Kmax Kmin )/s+1 models produced above according to a pre-specifiedcriterion such as sparsity, prediction error,BIC or EBIC (Chen and Chen, 2008). Algorithm 3.1

In practice, we often set Kmax = 10 or larger, with 10 as the default value. Since the performance of the regularization methods can deteriorate with the dimension, see Table 1, a small initial block size is generally preferred. On the other hand, an extremely small initial block size can result in a large set of significant variables identified in the initial step, which, in turn, slows down the convergence of BwC. As a trade-off, we suggest to set the initial block size to be 1000 or less. Since BwC can converge pretty fast for a given partition, we set m = 25 as the default number of sweeps, which has been extremely conservative according to our experience. Also, we point out that the MCP algorithm, implemented in the package SIS, assumes that significant variables can be found for each regression. We modified the code slightly such that it allows for the return of null models. For the block diminishing size s, we set 2 as the default value, but 1 and 3 are also often used. In distributing selected variables to different blocks, we often have the unselected variables shuffled and redistributed, though this is unnecessary. The shuffling and redistributing step helps to maintain the same block sizes as stages proceed. For BwC, we often output the sparsest model as the final model, which is set as the default criterion for model selection. Other criteria, such as prediction error, BIC and EBIC, can also be used as the criterion for final model selection. In what follows, BwC will be run with default settings unless stated otherwise.

To illustrate the performance of BwC with the balanced data partitioning scheme, we considered a modified example from the one used in Section 3.1. Here we considered four settings of the true model with the true model size |s*| = 5, 10, 15 and 20, respectively. Under each setting, the positions of the true variables are uniformly distributed among the first 3500 variables out of p = 5000 in total. This mimics the situation that empty blocks exist in the early stages of BwC. To accommodate the large true model size, we increase the sample size n from 100 to 200. For each true model size, we generated 10 datasets independently. To measure the prediction performance of the selected model, we generated additional 1000 observations as the test set for each of the 40 datasets.

BwC was applied to these datasets under the default setting. The results are summarized in Figure 2 and Table 3. Figure 2 shows the number of selected variables and the sum of squared prediction errors produced by BwC for 4 datasets, each with different true model size, along with the number of blocks. For the datasets with |s*| = 5, 10, 15 and 20, the sparsest models were produced at K = 2, 4, 6 and 6, respectively. In terms of variable selection, the results are surprisingly good: The sparsest models produced by BwC are almost identical to the true models. It is interesting to see that the model sizes produced by BwC with block numbers form a U-shape curve, where the sparsest model (usually the true model) is located at the bottom of the curve. For the dataset with |s*| = 5, although the size of the models produced by BwC tends to decrease with K, it still forms a U-curve if the non-blocking case (i.e., K = 1) is included. From this U-curve, the optimal block number can be automatically determined. For these datasets, it is easy to see that the curves of prediction error match well with the curves of model size.

Figure 2:

Figure 2:

The number of selected variables (dashed line) and the sum of squared predicted errors (solid line) along with the number of blocks

Table 3:

Numerical results of BwC, Lasso, SCAD and MCP for the simulated data with the mutual correlation coefficient ρ = 0.5: The CPU time (in seconds) is measured for a single run on an Intel Core i7–4790@3.60GHz Quad-Core desktop. Refer to Table 1 for other notations.

Lasso SCAD MCP BwC
|s^|avg 18.9 (3.07) 20.6 (4.37) 21.5 (0.77) 6.7 (0.30)
|s^s|avg 5 (0) 5 (0) 5 (0) 5 (0)
|s| = 5 fsr 0.69 (0.032) 0.50 (0.137) 0.75 (0.010) 0.24 (0.036)
nsr 0 (0) 0 (0) 0 (0) 0 (0)
CPU time 5.41 (1.26) 18.94 (4.43) 20.53 (1.60) 30.79 (0.41)
|s^|avg 35.3 (1.70) 28.4 (4.04) 25.4 (1.27) 12.7 (0.56)
|s^s|avg 10 (0) 10 (0) 10 (0) 9.9 (0.10)
|s| = 10 fsr 0.71 (0.023) 0.51 (0.111) 0.60 (0.022) 0.21 (0.038)
nsr 0 (0) 0 (0) 0 (0) 0.01 (0.01)
CPU time 10.14 (0.89) 23.28 (3.50) 26.66 (1.34) 31.43 (0.36)
|s^|avg 37 (0) 34.3 (2.20) 30.3 (0.75) 17.5 (0.62)
|s^s|avg 12.5 (1.21) 14.5 (0.50) 15 (0) 15 (0)
|s| = 15 fsr 0.66 (0.032) 0.54 (0.062) 0.50 (0.012) 0.13 (0.028)
nsr 0.17 (0.081) 0.03 (0.033) 0 (0) 0 (0)
CPU time 16.97 (2.81) 24.46 (2.23) 32.78 (0.31) 31.98 (0.11)
|s^|avg 37 (0) 37 (0) 36.5 (0.34) 24 (0.71)
|s^s|avg 9 (1.42) 11.6 (1.69) 12.7 (1.69) 19.9 (0.10)
|s| = 20 fsr 0.76 (0.038) 0.69 (0.046) 0.65 (0.050) 0.16 (0.026)
nsr 0.55 (0.071) 0.42 (0.085) 0.37 (0.084) 0.01 (0.005)
CPU time 13.49 (1.96) 17.37 (1.59) 24.45 (2.57) 32.64 (0.52)

Table 3 summarizes the results of BwC for the 40 datasets. For comparison, the regularization methods, Lasso, SCAD and MCP, were also applied to these datasets. As for the previous examples, these methods were run with the package SIS under their default settings except that the regularization parameters were determined according to the BIC criterion. The comparison indicates that BwC is superior to these regularization methods in both variable selection and parameter estimation for all the four cases. The regularization methods tend to select more false variables and miss more true variables as |s*| increases, while BwC can consistently identify almost all true variables.

Table 3 also reports the CPU time (in seconds) cost by each method for a single dataset. The comparison indicates that BwC does not cost much more CPU time than SCAD and MCP. Note that in the package SIS, all the methods, Lasso, SCAD and MCP were implemented with a variable screening step. Under their default setting, the variable screening step is done with the ISIS algorithm, which may take quite a few iterations to converge for some datasets. For this reason, SCAD, MCP and BwC take comparable CPU time. Compared to SCAD and MCP, Lasso tends to select more variables at each iteration. As a consequence, it often takes less iterations to converge and thus costs less CPU time.

For the model with |s*| = 10, we have considered two sample sizes n = 100 and 200 with the results reported in Table 2 and Table 3, respectively. A comparison of the two tables show that the performance of all methods has been improved as the sample size increases. For BwC, the average model size |ŝ|avg is closer to the true model size and the value of fsr is further reduced as the sample size increases. For the regularization methods, both fsr and nsr have been reduced, although the average model size is still very large. This comparison indicates that the performance of BwC is much less dependent on the sample size than the regularization method.

For a thorough comparison of BwC with the regularized methods, we also consider the scenario where the mutual correlation coefficient between different predictors is low. For this purpose, we also generated datasets according to (11) and (12) with the variance of e tuned such that the mutual correlation coefficient is equal to ρ = 0.2. The results for this case are summarized in Table 4. The comparison indicates that the superiority of BwC over the regularization methods is not much dependent on the strength of correlation between predictors, but mainly from the improved performance of the regularization methods on lowdimensional problems.

Table 4:

Numerical results of BwC, Lasso, SCAD and MCP for the simulated data with the mutual correlation coefficient ρ = 0.2. Refer to Table 3 for the notation.

Lasso SCAD MCP BwC
|s^|avg 22.6 (3.97) 11.5 (4.25) 17.8 (4.34) 5.6 (0.31)
|s| = 5 |s^s|avg 5 (0) 5 (0) 5 (0) 5 (0)
fsr 0.71 (0.048) 0.19 (0.113) 0.43 (0.134) 0.09 (0.040)
nsr 0 (0) 0 (0) 0 (0) 0 (0)
|s^|avg 37 (0) 12.8 (2.69) 28.7 (4.02) 11.4 (0.43)
|s| = 10 |s^s|avg 9.9 (0.1) 10 (0) 10 (0) 10 (0)
fsr 0.73 (0.003) 0.08 (0.073) 0.52 (0.107) 0.11 (0.031)
nsr 0.01 (0) 0.01 (0) 0 (0) 0 (0)
|s^|avg 37 (0) 30.4 (3.36) 32.6 (2.93) 17.2 (0.53)
|s| = 15 |s^s|avg 15 (0) 14.6 (0.40) 14.6 (0.40) 15 (0)
fsr 0.59 (0) 0.43 (0.094) 0.49 (0.082) 0.12 (0.026)
nsr 0 (0) 0.03 (0.027) 0.03 (0.027) 0 (0)
|s^|avg 37 (0) 33.6 (2.27) 37 (0) 23.8 (1.44)
|s| = 20 |s^s|avg 10.8 (1.44) 14.6 (1.92) 14.6 (1.81) 19.9 (0.10)
fsr 0.71 (0.039) 0.51 (0.097) 0.62 (0.049) 0.14 (0.044)
nsr 0.46 (0.072) 0.27 (0.096) 0.30 (0.091) 0.01 (0.005)

3.3. A Real Data Study: Biomarker Discovery for Anticancer Drug Sensitivity

Complex diseases such as cancer often have significant heterogeneity in response to treatments. Hence, individualized treatment based on the patient’s prognostic or genomic data rather than a “one treatment fits all” approach could lead to significant improvement of patient care. The success of personalized medicine decisively depends on the accuracy of the diagnosis and outcome prediction (Hamburg and Collins, 2010), and thus the discovery of precise biomarkers for disease is essential. Recent advances in high-throughput biotechnologies, such as microarray, sequencing technologies and mass spectrometry, have provided an unprecedented opportunity for biomarker discovery. Given the high dimensionality of the omics data, biomarker discovery is best cast as variable selection from a statistical point of view. In this study, we consider the problem of identifying the genes that predict the anticancer drug sensitivity for different cell lines, toward the ultimate goal of selecting right drugs for individual patients.

We considered a dataset extracted from the Cancer Cell Line Encyclopedia (CCLE) database. The dataset include the sensitivity data of 474 cell lines to the drug Topotecan, as well as the expression data of 18,926 genes for each cell line. Topotecan is a chemotherapeutic agent that is a topoisomerase inhibitor, and it has been used to treat lung cancer, ovarian cancer, and other types cancer. The drug sensitivity was measured using the area under dose-response curves, which is termed as activity area in (Barretina et al., 2012). For this dataset, BwC was run with the initial block number K = 20 and the block diminishing size s = 3. To have a thorough exploration for this dataset, BwC was run 25 times. Note that the models selected by BwC may be different in different runs due to its stochastic nature in redistributing significant genes to different blocks. The average number of genes found by BwC is 9.88 with standard deviation 0.40. Table 5 lists the top 10 genes identified by BwC in the order of frequencies appeared in the 25 final models. The appearance frequency provides a simple measure for the importance of each gene to prediction of drug sensitivity by integrating over multiple models.

Table 5:

Top 10 genes selected by BwC for the drug Topotecan based on 25 independent runs.

No. Gene Frequency No. Gene Frequency
1 SLFN11 25 6 MFAP2 6
2 HSPB8 11 7 RFXAP 6
3 C14orf93 8 8 ANO10 5
4 ILF3 7 9 AGPAT5 5
5 C15orf57 6 10 RPL3 5

Our result is consistent with the current biological knowledge. For example, the first gene SLFN11 is a helicase that breaks apart annealed nucleic acid strands. Helicases are known to work hand in hand with topoisomerase at many major DNA events (Duguet, 1997). It was also found by Barretina et al. (2012) that SLFN11 is predictive of treatment response for Topotecan, as well as Irinotecan, another inhibitor of TOP-I. It is known that the gene HSPB8 interacts with the gene HSPB1, and the recent study (Li et al., 2016) shows that HSPB1 polymorphisms might be associated with radiation-induced damage risk in lung cancer patients treated with radiotherapy. The gene ILF3 is also related to lung cancer, which shows correlated mRNA and protein over-expression in lung cancer development and progression (Guo et al., 2008). The relationship of other genes to the drug sensitivity will be further explored elsewhere.

For comparison, Lasso, SCAD and MCP have also been applied to this dataset. Similar to simulated examples, they tend to select a larger model than BwC. Lasso selected 18 genes, MCP selected 21 genes, and SCAD selected 52 genes, which are all significantly larger than the average model size produced by BwC.

4. BwC for High Dimensional Multivariate Regression

Multivariate regression is a generalized regression model of regressing q > 1 responses on p predictors. Applications of this generalized model often arise in biomedicine, psychometrics, and many other quantitative disciplines. Let X ∈ ℝn×p denote the design matrix, let Y ∈ ℝn×q denote the response matrix, let B ∈ ℝp×q denote the regression coefficient matrix, and let E ∈ ℝn×q denote the random error matrix. Then the model can be written as

Y=XB+E. (14)

For convenience, we let yi and y(i) denote the ith column and ith row of Y, respectively. Likewise, we let ϵi and ϵ(i) denote the ith column and ith row of E, respectively. Assume that ϵ(i)s follow a multivariate normal distribution N (0, Σ), while ϵ(1),,ϵ(n) are mutually independent. We are interested in jointly estimating the regression coefficient matrix B and the precision matrix Ω = Σ−1, in particular, when q and/or p are greater than n.

In the literature, there are several papers on variable selection for multivariate linear regression, see e.g., Turlach et al. (2005) and Peng et al. (2010), under the assumption that the response variables are mutually independent. There are also several methods for precision matrix estimation, such as gLasso (Yuan and Lin, 2007; Friedman et al., 2008), nodewise regression (Meinshausen and Bühlmann, 2006), and ψ-learning (Liang et al., 2015). However, the papers which address simultaneous variable selection and precision matrix estimation in the context of multiple regression are sparse. A few exceptions are Rothman et al. (2010), Bhadra and Mallick (2013), Sofer et al. (2014), and Wang (2015). Rothman et al. (2010) proposed an iterative approach that alternatively estimates the precision matrix and regression coefficients under a l1-penalty, but without any theoretical guarantees provided for the convergence of their algorithm. Bhadra and Mallick (2013) proposed a Bayesian approach to the problem, but again no convergence results were proved. Sofer et al. (2014) proves the convergence of the iterative approach of Rothman et al. (2010) under very restrictive conditions, such as pq/n → 0 and q2/n → 0, which require both p and q to be much smaller than n. Wang (2015) proposed a penalized conditional log-likelihood approach, where the conditional log-likelihood function is constructed for each response conditional on the covariates and other responses, and established estimation consistency and selection consistency with diverging dimension of the covariates and number of responses. In Wang (2015), the algorithm only iterates once conditioned on the initial estimate of B. We believe that a BwC implementation of the algorithm, i.e., iterating the algorithm multiple times conditioned on the updated estimate of B, will much improve its performance. We note that Cai et al. (2013) considered estimation of precision matrix with adjusted covariate effects, but not under the situation of multivariate regression. They proposed a two-stage regularization approach to estimate the covariate coefficients and the precision matrix. Like the algorithm of Wang (2015), the two-stage approach can also be formulated under the framework of BwC and its numerical performance is expected to be improved with multiple iterations.

The BwC method can be naturally applied to this problem by treating B and Ω as two separated parameters blocks. It is easy to verify that the conditions (A1)–(A4) are satisfied for this problem. Therefore, BwC provides a theoretical guarantee for the convergence of the iterative approach used in Rothman et al. (2010), Sofer et al. (2014) and Wang (2015). More importantly, the theory of BwC allows both q and p to be greater than the sample size n. For the problems for which p is extremely large, BwC can be run in a nested manner: conditioned on the estimate of Ω, BwC can again be used as a subroutine for estimation of B. In what follows we first give a brief review of the iterative approach used in Rothman et al. (2010) in order to distinguish it from BwC.

4.1. A Brief Review of the Existing Iterative Approach

Consider the multivariate regression model (14). With the regularization methods, a natural way is to minimize the objective function

Q(B,Ω)=ln|Ω|+ trace [1n(Y  X B)T(Y  X B)Ω]+Pλ(B)+Pγ(Ω), (15)

where Pλ(·) and Pλ(·) are the penalty functions for B and Ω, respectively; and λ and γ are regularization parameters. In practice, Pλ(·) can be set to a standard penalty function, such as the one used in Lasso, SCAD or MCP; and Pγ (·) can be set to the l1-penalty as the one used in graphical Lasso (Friedman et al., 2008).

To minimize Q(B, Ω) simultaneously over B and Ω, Rothman et al. (2010) proposed an iterative approach, the so-called multivariate regression with covariance estimation (MRCE) algorithm, which is actually a block coordinate descent algorithm. Let Ω^  and B^ denote the current estimators of Ω and B, respectively. One iteration of the MRCE algorithm consist of two general steps:

Algorithm 4.1 MRCE Algorithm

(i) Estimate B by minimizing Q(B|Ω^) conditional on the current estimator of Ω, where

Q(B|Ω^)= trace [1n(YXB)T(YXB)Ω^]+Pλ(B). (16)

(ii) Estimate Ω by minimizing Q(Ω|B^) conditional on the current estimator of B, where

Q(Ω|B^)=ln|Ω|+ trace [1n(YXB^)T(YXB^)Ω]+Pγ(Ω). (17)

Step (ii) can be accomplished using the graphical Lasso algorithm. To accomplish step (i), Rothman et al. (2010) suggested to set a Lasso penalty for B, i.e., set Pλ(B)=λi=1qj=1p|βji|, and then apply the cyclical coordinate descent algorithm to solve (16) for B. Let S = XTX and H = XTYΩ, let B^(m) denote the estimate of B at the m-th iteration, and let BRIDGE = (XTX + λI)−1XTY denote the ridge penalized least-square estimator of B. The cyclical coordinate descent algorithm is described as follows:

Algorithm 4.2.

Cyclical coordinate descent algorithm for multivariate regression

1.SetB^(m)B^(m1),visit all entries of B^(m)in some sequence and for entry (r,c) update β^rc(m)with the minimizer of the objective function along its coordinate direction given by                                 β^rc(m)sign(β^rc(m)+hrcurcsrrωrr)(|β^rc(m)+hrcurcsrrωcc|nλsrrωcc)+,whereurc=j=1pk=1qβ^jk(m)srjωkc.2.ifj,k|β^jk(m)β^jk(m1)|<ϵj,k|β^jkRIDGE|then stop,otherwise go to Step 1, whereϵ is a pre-specified  small number, say 104.

Following from the convergence theory of Tseng (2001), this algorithm is guaranteed to converge to the global minimizer if the given precision matrix Ω^ is non-negative definite. In this case, the trace term in the objective function (16) is convex and differentiable and the penalty term decomposes into a sum of convex functions of individual parameters. As analyzed in Rothman et al. (2010), the computational complexity of this algorithm is O(p2q2) per iteration, as it needs to cycle through pq parameters, and each calculation of urc costs O(pq) flops.

For high dimensional problems, the iterative approach often performs less well. Firstly, the cyclical coordinate descent algorithm, which is a simple adoption of the Lasso algorithm for multivariate regression, can deteriorate its performance as the dimension increases. Secondly, the errors introduced in one step can adversely affect the performance of the approach in the other step. Also, it often takes a large number of iterations to converge as noted by Rothman et al. (2010). To address these issues, Rothman et al. (2010) proposed the so-called approximate MRCE (AMRCE) algorithm, which is described as follows.

Approximate MRCE (AMRCE) Algorithm

1.Perform q separate lasso regressions each with the same optimal tuning parameterλ^0viacross validation.LetB^λ^0lassodenote the solution.2.ComputeΩ^=Ω^(B^λ^0lasso) through the gLasso algorithm.3.ComputeB^=B^(Ω^) through the cyclical coordinate descent algorithm. Algorithm 4.3

The AMRCE algorithm basically output the result from the first sweep of the iterative approach, except that the initial value for the parameter matrix B^ is chosen by implementing q separate Lasso regression each with the same optimal tuning parameter λ^0 selected via cross validation. In this paper, we implemented step 2 using the R-package gLasso (Friedman et al., 2015), and implemented step 3 using the R-package MRCE (Rothman, 2015). The two-stage algorithm proposed by Sofer et al. (2014) is essentially the same with the AMRCE algorithm, where one stage is to estimate B conditional on the estimate of Ω, and the other stage is to estimate Ω conditional on the estimate of B. The major difference between the two algorithms is on how to set the initial estimators and how to tune the regularization parameters.

4.2. The BwC Approach

Consider the following transformation for the model (14) for a given precision matrix Ω:

Y˜=XB˜+E˜, (18)

where Y˜=YΩ12, B˜=BΩ12 and E˜=EΩ12. For this transformed model, the elements of E˜ are i.i.d. Gaussian N(0,1). If we partition the matrices Y˜, B˜ and E˜ columnwisely, i.e., letting Y˜=(y˜1,y˜2,,y˜q), B˜=(β˜1,,β˜q), and E˜=(ϵ˜1,ϵ˜2,,ϵ˜q), then the model (18) can be expressed as q independent regressions:

y˜i=Xβ˜i+ϵ˜i,i=1,,q, (19)

where ϵ˜i follows a multivariate Gaussian distribution Nn(0,I). Based on this transformation, the BwC approach can be applied to iteratively estimate the coefficient matrix B˜ and the precision matrix Ω. One sweep of the approach consists of the following steps:

Algorithm 4.4.

BwC for Transformed Multivariate Regression

1.Conditioned on the current estimator of Ω, estimate β˜1,,β˜q independently according to (19) using  the BwC algorithm 3.1.2.Conditioned on the current estimator of B˜,estimate the adjacency structure of the precision matrixΩ from the residual YXB˜^Σ1/2using the ψ -learning algorithm (Liang et al., 2015), where B˜^denote the current estimate of B˜.3.Based on the adjacency structure obtained in step 2, recover the precision matrix Ω using the algorithmgiven in Hastie et al.(2009)(p.634).

Note that Algorithm 4.4 is a nested BwC algorithm, where B˜ is also estimated using the BwC algorithm. Given an estimate of B˜, one may get an estimate of B via the transformation B^=B˜^Σ^12. Note that the estimate B^ obtained in this way might not be very sparse (although still sparse in rows), as Σ^12 can be a dense matrix. However, since a zero-row of B˜ implies a zero-row of B, B˜^ can be used for variable screening, i.e., removing those variables in X which correspond a zero-row of B˜^. After variable screening, the remaining rows of B can be easily estimated using Algorithm 4.2 based on the estimate of Ω. In summary, the proposed approach consists of the following steps:

BwC for Multivariate Regression

1.(Initialization)Usetheψ-learningalgorithmtoestimatetheadjacencystructureofΩfromthere-sponse variables Y, and recover the working precision matrixΩ^ from this structure using the algorithm given in Hastie et al. (2009)(p.634)2.(B˜  and Ω-estimation)  Conditioned on the working precision matrix Ω^, estimate B˜ and Ω using  Algorithm 4.43. (Variable screening) Remove all the variables in X which correspond to the zero-rows of B˜^. Denote  the reduced design matrix as Xred.4.(B -estimation ) Conditioned on the working precision matrix Ω^,estimate the remaining rows of B using Algorithm 4.2 from the reduced modelY=XredBred+E. Algorithm 4.5

Regarding this algorithm, we have a few remarks.

  1. Like the AMRCE algorithm, we find that Algorithm 4.4 used in estimation of B˜ and Ω only needs to iterate for a very few sweeps, say, 1 or 2. A large number of sweeps will not hurt the performance of the algorithm, but not necessary. In general, the algorithm can produce very stable estimates (with most of the significantly non-zero elements in B˜ and Ω being identified) within a very few sweeps, but will not converge to some fixed solution. This is because the execution of Algorithm 4.4 involves a lot of randomness, such as redistribution of selected variables, shuffling of unselected variables, and stochastic approximation estimation involved in the ψ-learning algorithm in recovering the adjacency structure of the precision matrix.

  2. Due to the sparsity of B˜, X can be much reduced in the step of variable screening. Since the initial estimate B^red(0)=B˜^redΣ^1/2 can be quite close to the true value of B^red, the B-estimation step can usually converge very fast.

  3. The BwC approach estimates β˜i, i = 1, …, q, independently. For each i, the data size it deals with is n × p, and the computational complexity of the BwC subroutine is O(knp), where k′ denotes the number of sweeps performed in the subroutine. Recall that for each conditioned step of the subroutine, the SIS-MCP algorithm is implemented for variable selection, whose computational complexity is O(nm) and m denotes the number of variables included in the subset data. Hence, the total computational complexity for q regressions is O(knqp), which can be much better than O(p2q2), the computational complexity of the MRCE algorithm.

  4. The BwC approach adopted the ψ-learning algorithm (Liang et al., 2015) for estimating the adjacency structure of the precision matrix. The ψ-learning algorithm belongs to the class of sure screening algorithms, which is to first reduce the neighborhood of each variable via correlation screening, and then estimate the Gaussian graphical model in the reduced model space using a procedure that is essentially the same with the covariance selection method (Dempster, 1972). By nature, the covariance selection method is maximum likelihood estimation method. Under the Markov property and adjacency faithfulness conditions, it can be shown that the ψ-learning algorithm is consistent. Therefore, the ψ-learning algorithm can be used in the BwC algorithm, and it implicitly and asymptotically maximizes the objective function specified in (5). The numerical results reported in Liang et al. (2015) indicate that the ψ-learning algorithm can produce more accurate networks than gLasso. In terms of computational complexity, the ψ-learning algorithm is also better than gLasso. The ψ-learning algorithm has a computational complexity of nearly O(q2), while gLasso has a computational complexity of O(q3). Even in its fast implementation (Witten et al., 2011; Mazumder and Hastie, 2012), which makes use of the block diagonal structure in the graphical Lasso solution, gLasso still has a computational complexity of O(q2+ν), where 0 < ν ≤ 1 may vary with the block diagonal structure it used.

Finally, we would point out that a fundamental difference between the MRCE approach and the BwC approach is that MRCE works with an explicit objective function as specified in (15), while BwC does not. BwC requires only an asymptotic objective function for each block, while the joint finite-sample objective function for all blocks might not exist in an explicit form. For the ψ-learning algorithm, which is developed based on the theory of graphical models, an explicit finite-sample objective function does not exist though a consistent estimator can still be produced. From this perspective, we can see that the BwC approach is flexible. It provides a general framework for combining different methods for estimating parameters for complex models.

4.3. A Simulated Example

Consider the multivariate regression (14) with n = 200, q = 100, and p = 3000, 5000 and 10000. The explanatory variables are generated by (11) and (12) except that the shared error term e is divided by a random number generated from Uniform[1,10]. That is, some of the explanatory variables are still highly correlated. The full regression coefficient matrix B is a p × q matrix, which contains 300,000, 500,000 and 1,000,000 elements for p = 3000, 5000 and 10000, respectively. To set the matrix B, we randomly selected 200 elements to set them to 1 and set the non-selected elements to 0. The precision matrix Ω = (ωij) is given by

ωij={1,if i=j,0.5,if |ij|=1,0.25,if |ij|=2,0,otherwise.  (20)

Such a precision matrix has been used by Yuan and Lin (2007), Mazumder and Hastie (2012) and Liang et al. (2015) to illustrate their respective algorithms for high dimensional Gaussian graphical models. For each value of p, we generated 10 independent datasets, and the numerical results reported below are averaged over the 10 datasets.

The BwC approach was first applied to this example. In the B˜ and Ω-estimation step, we set the initial block number K = 10, the number of sweeps m = 25, and the block diminishing size s = 2, and adopted the SIS-MCP algorithm for variable selection with the regularization parameter determined by the BIC criterion. The ψ-learning algorithm used in precision matrix estimation contains two free parameters, α1 and α2, which represent the significant levels of the screening tests for the correlation coefficients and ψ-partial correlation coefficients, respectively. The correlation screening step produces conditional sets for evaluating the ψ-partial correlation coefficients. As mentioned in Liang et al. (2015), the ψ-learning algorithm is pretty robust to the choice of α1, and a large α1 is usually preferred, which produces large conditional sets and thus reduces the risk of missing important connections in the resulting Gaussian graphical network. Here a connection also refers to a nonzero off-diagonal element of the precision matrix. For this example, we set α1 = 0.25. For α2, it is ideal to decrease its value with sweeps. A large value of α2 in the early sweeps helps to keep important connections in the network, as variable selection in these sweeps might still be pre-mature. A small value of α2 in the late sweeps helps to reduce the false selection of network connections. For simplicity, we set α2 = 0.05 in all sweeps. In the variable screening step, the number of variables in X can be largely reduced. For the datasets with p = 3000, on average Xred contains only 201.6 variables with a standard deviation of 1.9; for p = 5000, on average Xred contains only 210.1 variables with a standard deviation of 2.3; and for p = 10000, on average Xred contains only 218.5 variables with a standard deviation of 2.8. After obtaining the estimate of Ω, the cyclical coordinate descent algorithm was applied to estimate Bred based on the reduced set of variables. For this step, the regularization parameter was tuned according to the EBIC criterion. For comparison, the AMRCE algorithm was also applied to this example, for which the regularization parameter was tuned according to the BIC criterion.

Table 6 and Table 7 summarize the results for precision matrix estimation and variable selection, respectively. The comparison shows that BwC significantly outperforms AMRCE for this example. As indicated by Table 6, AMRCE mis-identified almost all true connections, while BwC can correctly identified over 80% of them. For variable selection, BwC tends to have a lower false selection rate than AMRCE, although they both are able to identify all true variables. It is remarkable that in terms of mean squared errors (MSE) of the estimate of B, BwC works much better than AMRCE. To have this issue further explored, we plot in Figure 3 the histograms of the non-zero elements of B^ obtained for three datasets. The histograms show that for the true variables, the regression coefficient estimates produced by BwC are closer to their true value 1 than those produced by AMRCE. Also, the true variables and the falsely selected variables are more separable for the BwC estimates than for the AMRCE estimates. In fact, with a finely tuned threshold, the true variables can be accurately identified with the BwC approach.

Table 6:

Results comparison for precision matrix estimation: the results are averages over 10 independent datasets with the standard deviation given in the parenthesis, where |s^|avg denotes the number of connections selected by the method, and |s^s|avg denotes the number of true connections selected by the method. The true number of connections is 197.

p method |s^|avg |s^s|avg fsr nsr
3000 BwC 171.1 (2.28) 162.0 (1.75) 0.05 (0.008) 0.18 (0.009)
AMRCE 70.7 (5.92) 1.1 (0.33) 0.99 (0.004) 0.99 (0.003)
5000 BwC 169.8 (3.51) 161.3 (2.64) 0.05 (0.006) 0.18 (0.013)
AMRCE 68.35 (10.14) 0.65 (0.20) 0.99 (0.003) 0.99 (0.002)
10000 BwC 165.6 (1.57) 156.2 (1.54) 0.06 (0.006) 0.21 (0.008)
AMRCE 65.35 (13.58) 1.55 (0.52) 0.98 (0.006) 0.98 (0.005)
Table 7:

Results comparison for variable selection: the results are averages over 10 independent datasets with the standard deviation given in the parenthesis, where |s^|avg denotes the number of variables selected by the method, and |s^s|avg denotes the number of true variables selected by the method. The true number of variables is 200.

p method |s^|avg |s^s|avg fsr nsr MSE
3000 BwC 248.3 (3.23) 200 (0) 0.19 (0.010) 0 (0) 0.05 (0.003)
AMRCE 268.5 (5.03) 199.9 (0.10) 0.25 (0.014) 0.001 (0.001) 0.22 (0.009)
5000 BwC 251.8 (3.91) 200 (0) 0.20 (0.012) 0 (0) 0.05 (0.002)
AMRCE 265.9 (12.96) 199.4 (0.40) 0.23 (0.039) 0.003 (0.002) 0.25 (0.019)
10000 BwC 255.9 (3.96) 200 (0) 0.22 (0.012) 0 (0) 0.06 (0.003)
AMRCE 233.3 (7.20) 198.5 (0.50) 0.14 (0.021) 0.008 (0.003) 0.30 (0.011)
Figure 3:

Figure 3:

Histograms of non-zero elements of B^ obtained by BwC (left panel) and AMRCE (right panel) for three datasets with p = 3000 (upper panel), p = 5000 (middle panel), and p = 10000, respectively.

Finally, we would like to report that the BwC approach took a little longer CPU time than AMRCE for this example. This is because BwC took multiple sweeps with diminishing block sizes in estimation of although it is cheaper than AMRCE for each sweep. On average, BwC took 26, 36 and 52 CPU minutes on an Intel Core i7–4790@3.60GHz Quad-Core desktop for each dataset with p = 3000, 5000 and 10000, respectively. On average, AMRCE took 5, 9 and 20 CPU minutes for these datasets, respectively, on the same computer.

4.4. A Real Data Study: eQTL Analysis

This dataset includes 60 unrelated individuals of Northern and Western European ancestry from Utah (CEU), whose genotypes are available from the International Hapmap project and publicly downloadable via the hapmart interface (http://hapmart.hapmap.org). The genotype is coded as 0, 1 and 2 for homozygous rare, heterozygous and homozygous common alleles, respectively. Our study focuses on the SNPs found in the 5′ UTR (untranslated region) of mRNA (messenger RNA) with a minor allele frequency of 0.1 or more. The UTR possibly has an important role in the regulation of gene expression and has previously been subject to investigation by Chen et al. (2008). The gene expression data of these individuals were analyzed by Stranger et al. (2007). There were four replicates for each individual. The raw data were background corrected and then quantile normalized across replicates of a single individual and then median normalized across all individuals. The gene expression data are again publicly available from the Sanger Institute website (ftp://ftp.sanger.ac.uk/pub/genevar). Out of the 47293 total available probes corresponding to different Illumina TargetID, we selected the 100 most variable probes with each corresponding to a different transcript. Further, we removed the “duplicated” SNPs. Two SNPs are said “duplicated” if they have the same genotypes across all individuals. After these reductions, we reached a dataset with n = 60, p = 3005 (for SNPs), and q = 100 (for transcripts). The same dataset has been analyzed by Bhadra and Mallick (2013).

To validate the Gaussian assumption for the response variables of the model (14), the nonparanormal transformation proposed by Liu et al. (2009) has been applied to the gene expression data. The nonparanormal transformation is a semiparametric transformation, which converts a continuous variable to a Gaussian variable while leaving the structure of the underlying graphical model unchanged. Prior to the analysis, we centered both the response variable (Gene expression data) and the predictors (SNP). Each of the predictors has also been normalized to have a standard deviation of 1. The BwC approach was applied to this example. In the step of B˜ and Ω-estimation, we set the initial block number K = 10, the number of sweeps m = 25, and the block diminishing size s = 2 for the sub-BwC run. For the ψ-learning algorithm, we set α1 = 0.25 and α2 = 0.05. The BwC approach was independently run for 25 times with the results summarized in Table 8 and Table 9.

Table 8:

Pairs of associated transcripts identified by BwC in all 25 runs.

Correlated Transcripts Related Gene
GI_14211892-S : GI_33356559-S CYorfl5B : SMCY
GI_20302136-S : GI_7661757-S KIAA0125 : KIAA0125
hmm3574-S : Hs.406489-S hmm3574 : Hs.406489
Hs.449602-S : Hs.449584-S Hs.449602 : Hs.449584
Hs.512137-S : GI_41190507-S Hs.512137 : LOC388978
Hs.449602-S : Hs.449584-S Hs.449602 : Hs.449584
GI_27754767-A : GI_27754767-I FXYD2 : FXYD2
GI_40354211-S : Hs.185140-S PIP3-E : PIP3-E
Table 9:

SNPs and their associated multiple transcripts identified by BwC.

SNP Transcript
rs3784932 GI_24308084-S, GI_21389558-S
rs1044369 GI_37546969-S, hmm3587S
rs1944926 GI_21464138-A , GI_22027487-S, GI_41350202-S, hmm9615-S
rs2241649 hmm9615S, GI_7019408-S
rs12892419 GI_33356162-S, GI_13514808-S

Table 8 reports 8 pairs of associated transcripts that were identified by BwC in all 25 runs. Among them, there are two transcripts from the gene FXYD2, two transcripts from the gene KIAA0125, and two transcripts from the gene PIP3-E. Also, the associated gene pair, CYorf15B and SMCY, found in this study are known to be related with gender (Weickert, 2009). These results indicate the validity of the BwC approach. Due to the general weak relations between SNPs and transcripts, the null model tends to be produced in the B-estimation step with the cyclical coordinate descent algorithm. Instead, we explored the estimator B^=B˜^Σ^12 for this example. For this estimator, there were 256 non-zero elements identified by BwC in all 25 runs. Among the top 100 elements (in magnitudes), we found 5 SNPs which are associated with more than one transcripts. Table 9 shows these SNPs and the associated multiple transcripts.

5. Conclusion

We have proposed the blockwise consistency (BwC) method as a potential solution to the problem of parameter estimation for complex models that are often encountered in big data analysis. The BwC method decomposes the high dimensional parameter estimation problem into a series of lower dimensional parameter estimation problems which often have much simpler structures than the original high-dimensional problem, and thus can be easily solved. Moreover, under the framework provided by BwC, a variety of methods, such as Bayesian and frequentist methods, can be jointly used to achieve a consistent estimator for the original high-dimensional complex model. The BwC approach has been illustrated using two examples, high dimensional variable selection and high-dimensional multivariate regression. Both examples show that the BwC method can provide a drastic improvement over the existing methods. Extension of the BwC method to other high-dimensional problems, such as variable selection for high-dimensional generalized linear models, is straightforward.

The BwC method works in a similar way to the block coordinate ascent algorithm. As mentioned previously, the fundamental difference between the two algorithms is that the block coordinate ascent algorithm works with an explicit finite-sample objective function for each block, while BwC does not. BwC requires only a consistent estimator for each block, while the joint finite-sample objective function for all blocks might not exist in an explicit form. This allows us to combine the best methods to each sub-problem to solve the original complex problems. In addition, for maximizing penalized log-likelihood functions, the block coordinate ascent algorithm allows only a common regularization parameter to be used for all parameters, while BwC allows different regularization parameters to be used for different blocks at different iterations. Such an adaptive way of choosing regularization parameters helps a lot to the performance of BwC.

The BwC method can also be used under the Bayesian framework to find the maximum a posteriori (MAP) estimator for a complex model, e.g., hierarchical model, for which the conjugacy holds for the parameters. We note that these models are traditionally treated with the Gibbs sampler under the Bayesian framework. However, the feasibility of the Gibbs sampler (Geman and Geman, 1984) is being challenged in the era of big data, as which typically requires a large number of iterations and multiple scans of the full dataset for each sweep. Compared to the Gibbs sampler, BwC can converge much faster, usually within tens of iterations. For many problems, typically, if the Gibbs sampler can be implemented, so can BwC but with substantially less work and more straightforward convergence properties. The uncertainty of the BwC estimator can be assessed using the bootstrap method (Efron and Tibshirani, 1993).

Acknowledgement

Liang’s research was support in part by the grants DMS-1612924 and DMS/NIGMS R01-GM117597.

Appendix

Proof of Theorem 2.1 We follow the proof of Theorem 2.4.3 of van der Vaart and Wellner (1996). By the symmetrization Lemma 2.3.1 of van der Vaart and Wellner (1996), measurability of the class Fn, and Fubini’s theorem,

Esup θ(s)Θn(s),θ^t1(s)Θn,T(s)|G^n(θ(s)|θ^t1(s))Gn(θ(s)|θ^t1(s))|2ExEϵsup qF1ni=1nϵiq(xi)     2ExEϵsup qGn,M1ni=1nϵiq(xi)+2E*[mn(x)1(mn(x)>M)],

where ϵi are i.i.d. Rademacher random variables with P(εi = +1) = P(εi = −1) = 1/2, and E* denotes the outer expectation.

By condition (B2)-(a), 2E*[mn(x)1(mn(x) > M)] → 0 for sufficiently large M. To prove convergence in mean, it suffices to show that the first term converges to zero for fixed M. Fix x1, …, xn, and let H be a ϵ-net in L1(ℙn) over GM, then

Eϵsup qGn,M1ni=1nϵiq(xi)Eϵsup qH1ni=1nϵiq(xi)+ϵ,

where the cardinality of H can be chosen equal to N(ϵ,Gn,M,L1(n). Bound the l1-norm on the right by the Orlicz-norm ψ2 and use the maximal inequality (Lemma 2.2.2 of van der Vaart and Wellner (1996)) and Hoeffding’s inequality, it can be shown that

Eϵsup qGn,M1ni=1nϵiq(x)K1+log N(ϵ,Gn,M,L1(n))6nM+ϵP*  ϵ, (21)

where K is a constant, and P* denotes outer probability. It has been shown that the left side of (21) converges to zero in probability. Since it is bounded by M, its expectation with respect to x1, …, xn converges to zero by the dominated convergence theorem.

This concludes the proof that sup sup θ(s)Θn(s),θ^t1(s)Θn,T(s)|G^n(θ(s)|θ^t1(s))Gn(θ(s)|θ^t1(s))|p0 in mean. Further, by Markov inequality, we conclude that (7) holds.

Proof of Theorem 2.2 Since both G^n(θ(s)|θ^t1(s)) and Gn(θ(s)|θ^t1(s)) are continuous in θ(s) as implied by the continuity of log π(x|θ) in θ, the remaining part of the proof follows from Lemma .1.

Lemma .1 Consider a sequence of functions Qt(θ, Xn) for t = 1,2, …, T. Suppose that the following conditions are satisfied: (C1) For each t, Qt(θ, Xn) is continuous in θ and there exists a function Qt(θ), which is continuous in θ and uniquely maximized at θ(t). (C2) For any ϵ>0,sup θΘn\Bt(ϵ)Qt(θ) exists, where Bt(ϵ)={θ:θθ*(t)<ϵ}; Let δt=Qt(θ(t))sup θΘn\Bt(ϵ)Qt(θ),δ=min t{1,2,,T}δt>0. (C3) sup t{1,2,,T}sup θΘn|Qt(θ,Xn)Qt(θ)|p0 as n. (C4) The penalty function Pλn(θ) is non-negative and converges to 0 uniformly over the set {θ(t):t=1,2,,T} as n → ∞, where λn is a regularization parameter and its value can depend on the sample size n. Let θn(t)= arg  max θΘn{Qt(θ,Xn)Pλn(θ)}. Then the uniform convergence holds, i.e., sup t{1,2,,T}θn(t)θ*(t)p0.

PROOF: Consider two events (i) sup t{1,2,,T}sup θΘn\Bt(ϵ)|Qt(θ,Xn)Qt(θ)|<δ/2, and (ii) sup t{1,2,,T}sup θBt(ϵ)|Qt(θ,Xn)Qt*(θ)|<δ/2. From event (i), we can deduce that for any t ∈ {1, 2, …, T} and any θΘn\Bt(ϵ),Qt(θ,Xn)<Qt(θ)+δ/2Qt(θ(t))δt+δ/2Qt(θ(t))δ/2. Therefore, Qt(θ,Xn)Pλn(θ)<Qt(θ(t))δ/2o(1) by condition (C4).

From event (ii), we can deduce that for any t ∈ {1, 2, …, T} and any θBt(ϵ),Qt(θ,Xn)>Qt(θ)δ/2 and Qt(θ(t),Xn)>Qt(θ(t))δ/2. Therefore, Qt(θ(t),Xn)Pλn(θ(t))>Qt(θ(t))δ/2o(1) by condition (C4).

If both events hold simultaneously, then we must have θn(t)Bt(ϵ) for all t ∈ {1,2, …, T} as n → ∞. By condition (C3), the probability that both events hold tends to 1. Therefore,

P(θn(t)Bt(ϵ) for all t=1,2,,T)1,

which concludes the lemma. □

Proof of Theorem 2.3 Applying Taylor expansion to Gn(θ(s)|θ^t1,g(s)) at θt(s), we get Gn(θ(s)|θ^t1,g(s))Gn(θt*(s)|θ^t1,g(s))=Op(1/n), following from the condition (B5) and the condition (B3) that Gn(θ(s)|θ^t1,g(s)) is maximized at θt*(s). Therefore,

n[G^n(θt,g(s)|θ^t1,g(s))Gn(θt*(s)|θ^t1,g(s))]=Zt,1++Zt,n+n[Gn(θt,g(s)|θ^t1,g(s))Gn(θt*(s)|θ^t1,g(s))]=Zt,1++Zt,n+ϵn,

where ϵn = Op(1), and

P(n|G^n(θt,g(s)|θ^t1,g(s))Gn(θt*(s)|θ^t1,g(s))|>nz)P(|Zt,1++Zt,n|>nz|ϵn|). (22)

By Bernstein’s inequality,

P(|Zt,1++Zt,n|>nz|ϵn|)2 exp  {12(z|ϵn|/n)2v˜+M˜b(z|ϵn|/n)}, (23)

for v˜(v˜1++v˜n)/n2 and M˜b=M˜b/n. Applying Taylor expansion to the right of (23) at z and combining with (22) leads to

P(|G^n(θt,g(s)|θ^t1,g(s))Gn(θt*(s)|θ^t1,g(s))|>z)Kexp {12z2v˜+M˜bz}, (24)

where K=2+3MbOp(1/n)=2+3M˜bOp(1), since the derivative |d[z2/(v˜+M˜bz)]/dz|3/M˜b.

By applying Lemma 2.2.10 of van der Vaart and Wellner (1996), for Orlicz norm ψ1, we have

sup θt,g(s)Θn(s),t=1,2,,T|G^n(θt,g(s)|θ^t1,g(s))Gn(θt*(s)|θ^t1,g(s))|ψ1ϵ+K(M˜blog (1+TN(ϵ,Gn,M,L1(n)))+v˜log (1|+TN(ϵ,Gn,M,L1(n)))), (25)

for a constant K′ and any ϵ > 0. Since v˜=O(1/n),M˜b=O(1/n), log(T) = o(n), and log N(ϵ,Gn,M,L1(n))=o(n), we have

sup θt,g(s)Θn(s),t=1,2,,T|G^n(θt,g(s)|θ^t1,g(s))Gn(θt*(s)|θ^t1,g(s))|ψ1Pϵ.

Therefore,

sup θt,g(s)Θn(s),t{1,2,,T}|G^n(θt,g(s)|θ^t1,g(s))Gn(θt*(s)|θ^t1,g(s))|p0. (26)

Note that, as implied by the proof of Lemma 2.2.10 of van der Vaart and Wellner (1996), (25) holds for a general constant K in (24). Then, by condition (B3), we must have the uniform convergence that θt,g(s)Bt(ϵ) for all t as n → ∞, where Bt(ε) is as defined in (B3). This statement can be proved by contradiction as follows:

Assume θt,g(s)Bi(ϵ) for some i ∈ {1, 2, …, T}. By the uniform convergence established in Theorem 2.1, |G^n(θt,g(s)|θ^t1,g(s))Gn(θt,g(s)|θ^t1,g(s))|=op(1). Further, by condition (B3) and the assumption θt,g(s)Bi(ϵ),

|G^n(θt,g(s)|θ^t1,g(s))Gn(θt*(s)|θ^t1,g(s))||Gn(θt,g(s)|θ^t1,g(s))Gn(θt*(s)|θ^t1,g(s))||G^n(θt,g(s)|θ^t1,g(s))Gn(θt,g(s)|θ^t1,g(s))|δop(1),

which contradicts with the uniform convergence established in (26). This concludes the proof.

Proof of Theorem 2.4 Define dt(n):=θ^tθ˜t, where n indicates the implicit dependence of θ^t on n. Then

dt(n):=θ^θ˜θ^tMs(θ^t1)+Ms(θ^t1)θ˜t. (27)

For the first component of the inequality (27), we define

gn:=sup t,θ^t1Θnθ^tMs(θ^t1),

which converges to zero in probability as n → ∞, following from Theorem 2.2 and Theorem 2.3 for both types of consistent estimation procedures considered in the paper. For the second component of the inequality (27), we have

Ms(θ^t1)θ˜t=Ms(θ^t1)Ms(θ˜t1)ρ*θ^t1θ˜t1=ρ*dt1(n),

following from condition (B6).

Combining with the fact that d0 = 0, i.e., the two paths {θ^t} and {θ˜t} started from the same point, we have

dt(n)l=0t1gn(ρ*)lgn1ρ*p0, (28)

where the convergence is uniform over t as gnp0. Moreover, since θ˜t converges to a coordinatewise maximum point of Eθ,log π(X|θ) under conditions (A1) and (A2), θ^t will converge to the same point in probability. That is, θ^:=lim tθ^tpθ˜:=lim tθ˜t.

Contributor Information

Runmin Shi, Department of Statistics, University of Florida, Gainesville, FL 32611;.

Faming Liang, Department of Statistics, Purdue University, West Lafayette, IN 47906;.

Qifan Song, Department of Statistics, Purdue University, West Lafayette, IN 47907;.

Ye Luo, Department of Economics, University of Florida, Gainesville, FL 32611.

Malay Ghosh, University of Florida, Gainesville, FL 32611..

References

  1. Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin A, Kim S, Wilson C, Lehar J, Kryukov G, Sonkin D, Reddy A, Liu M, Murray L, Berger M, Monahan J, Morais P, Meltzer J, Korejwa A, Jane-Valbuena J andMapa F, Thibault J, Bric-Furlong E, Raman P, Shipway A, Engels I, and et al. (2012), “The Cancer cell line encyclopedia enables predictive modeling of anticancer drug sensitivity,” Nature, 483, 603–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bhadra A and Mallick B (2013), “Joint High-Dimensional Bayesian Variable amd Covariance Selection with an Application to eQTL Analysis,” Biometrics, 69, 447–457. [DOI] [PubMed] [Google Scholar]
  3. Breheny P and Huang J (2011), “Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection,” Annals of Applied Statistics, 5, 232–252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cai T, Li H, Liu W, and Xie J (2013), “Covariate-adjusted precision matrix estimation iwth an application in genetical genomics,” Biometrika, 100, 139–156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chen J and Chen Z (2008), “Extended Bayesian information criterion for model selection with large model space,” Biometrika, 94, 759–771. [Google Scholar]
  6. Dempster A (1972), “Covariance Selection,” Biometrics, 28, 157–175. [Google Scholar]
  7. Duguet M (1997), “When helicase and topoisomerase meet!” J. Cell Sci, 110, 1345–1350. [DOI] [PubMed] [Google Scholar]
  8. Efron B and Tibshirani R (1993), An Introduction to the Bootstrap, Boca Raton, FL: Chapman & Hall/CRC. [Google Scholar]
  9. Fan J, Feng Y, Saldana DF, Samworth R, and Wu Y (2015), “Sure Independence Screening,” CRAN R Package. [Google Scholar]
  10. Fan J and Li R (2001), “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]
  11. Fan J and Lv J (2008), “Sure independence screening for ultrahigh dimensional feature space (with discussion),” Journal of the Royal Statistical Society, Series B, 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fan J, Samworth R, and Wu Y (2009), “Ultrahigh dimensional feature selection: Beyond the linear model,” Journal of Machine Learning Research, 10, 1829–1853. [PMC free article] [PubMed] [Google Scholar]
  13. Fan J and Song R (2010), “Sure independence screening in generalized linear model with NP-dimensionality,” Annals of Statistics, 38, 3567–3604. [Google Scholar]
  14. Fan J, Xue L, and Zou H (2014), “Strong oracle optimality of folded concave penalized estimation,” Annals of Statistics, 42, 819–849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Firth D (1993), “Bias reduction of maximum likelihood estimates,” Biometrika, 80, 27–38. [Google Scholar]
  16. Friedman J, Hastie T, and Tibshirani R (2008), “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, 9, 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. — (2010), “Regularization paths for generalized linear models via coordinate descent,” Journal of Statistical Software, 33, 1–22. [PMC free article] [PubMed] [Google Scholar]
  18. — (2015), “GLASSO: Graphical lasso- estimation of Gaussian graphical models,” CRAN R-Package. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Geman S and Geman D (1984), “Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. [DOI] [PubMed] [Google Scholar]
  20. Guo N, Wan Y, Tosun K, Lin H, Msiska Z, Flynn D, Remick S, Vallyathan V, Dowlati A, Shi X, Castranova V, Beer D, and Qian Y (2008), “Confirmation of gene expression-based prediction of survival in non-small cell lung cancer,” Clin. Cancer Res, 14, 8213–8220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hamburg M and Collins F (2010), “The path to personalized medicine,” New Engl. J. Med, 363, 301–304. [DOI] [PubMed] [Google Scholar]
  22. Hastie T, Tibshirani R, and Friedman J (2009), The Elements of Statistical Learning, Springer. [Google Scholar]
  23. Li X, Xu S, Cheng Y, and Shu J (2016), “HSPB1 polymorphisms might be associated with radiation-induced damage risk in lung cancer patients treated with radiotherapy,” Tumour Biol., to appear. [DOI] [PubMed] [Google Scholar]
  24. Liang F, Jia B, Xue J, Li Q, and Luo Y (2018), “An imputation-regularized optimization algorithm for high-dimensional missing data problems and beyond,” Journal of the Royal Statistical Society, Series B, in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Liang F, Song Q, and Qiu P (2015), “An Equivalent Measure of Partial Correlation Coefficients for High Dimensional Gaussian Graphical Models,” Journal of the American Statistical Association, 110, 1248–1265. [Google Scholar]
  26. Liang F, Song Q, and Yu K (2013), “Bayesian Subset Modeling for High Dimensional Generalized Linear Models,” Journal of the American Statistical Association, 108, 589–606. [Google Scholar]
  27. Liu H, Lafferty J, and Wasserman L (2009), “The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs,” Journal of Machine Learning Research, 10, 2295–2328. [PMC free article] [PubMed] [Google Scholar]
  28. Mazumder R, Friedman J, and Hastie T (2011a), “SparseNet: Coordinate descent with nonconvex penalties,” Journal of the American Statistical Association, 106, 1125–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. — (2011b), “Sparsenet: Coordinate descent with nonconvex penalties,” Journal of the American Statistical Association, 106, 1125–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Mazumder R and Hastie T (2012), “The graphical Lasso: New insights and alternatives,” Electronic Journal of Statistics, 6, 2125–2149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Meinshausen N and Bühlmann P (2006), “High-dimensional graphs and variable selection with the Lasso,” Annals of Statistics, 34, 1436–1462. [Google Scholar]
  32. Peng J, Zhu J, Bergamaschi A, Han W, Noh D-Y, Pollack JR, and Wang P (2010), “Regularized Multivariate Regression for Identifying Master Predictors with Application to Integrative Genomics Study of Breast Cancer,” Annals of Applied Statistics, 4, 53–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Raskutti G, Wainwright M, and Yu B (2011), “Minimax rates of estimation for high-dimensional linear regression over-balls,” IEEE Transactions on Information Theory, 57, 6976–6994. [Google Scholar]
  34. Rothman A (2015), “MRCE: Multivariate regression with covariance estimation,” CRAN R-Package. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Rothman A, Levina E, and Zhu J (2010), “Sparse multivariate regression with covariance estimation,” Journal of Computational and Graphical Statistics, 19, 947–962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Sofer T, Dicker L, and Lin X (2014), “Variable selection for high dimensional multivariate outcomes,” Statistica Sinica, 22, 1633–1654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Song Q and Liang F (2015a), “High Dimensional Variable Selection with Reciprocal L1-Regularization,” Journal of the American Statistical Association, 110, 1607–1620. [Google Scholar]
  38. — (2015b), “A Split-and-Merge Bayesian Variable Selection Approach for Ultra-high dimensional Regression,” Journal of the Royal Statistical Society, Series B, 77, 947–972. [Google Scholar]
  39. Tibshirani R (1996), “Regression shrinkage and selection via the LASSO,” Journal of the Royal Statistical Society, Series B, 58, 267–288. [Google Scholar]
  40. Tseng P (2001), “Convergence of a block coordinate descent method for nondifferentiable minimization,” Journal of Optimization Theory and Applications, 109, 475–494. [Google Scholar]
  41. Tseng P and Yun S (2009), “A coordinate gradient descent method for nonsmooth separable minimization,” Mathematical Programming, Series B, 117, 387–423. [Google Scholar]
  42. Turlach B, Venables W, and Wright S (2005), “Simultaneous variable selection,” Technometrics, 47, 349–363. [Google Scholar]
  43. Vershynin R (2015), “Estimation in high dimensions: A geometric perspective,” in Sampling Theory: A Renaissance, ed. Pfander G., Cham, pp. 3–66. [Google Scholar]
  44. Wang J (2015), “Joint estimation of sparse multivariate regression and conditional graphical models,” Statistica Sinica, 25, 831–851. [Google Scholar]
  45. Weickert C e. a. (2009), “Transcriptome analysis of male female differences in prefrontal cortical development,” iMolecular Psychiatry, 14, 558–561. [DOI] [PubMed] [Google Scholar]
  46. Witten D, Friedman J, and Simon N (2011), “New insights and faster computations for the graphical Lasso,” Journal of Computational and Graphical Statistics, 20, 892–900. [Google Scholar]
  47. Yuan M and Lin Y (2007), “Model selection and estimation in the Gaussian graphical model,” Biometrika, 95, 19–35. [Google Scholar]
  48. Zhang C-H (2010), “Nearly unbiased variable selection under minimax concave penalty,” Annals of Statistics, 38, 894–942. [Google Scholar]
  49. Zhao P and Yu B (2006), “On model selection consistency of Lasso,” Journal of Machine Learning Research, 7, 2541–2563. [Google Scholar]
  50. Zou H (2006), “The adptive lasso and its oracle properties,” Annals of Statistics, 38, 894–942. [Google Scholar]

RESOURCES