A selective review of robust variable selection with applications in bioinformatics

Cen Wu; Shuangge Ma

doi:10.1093/bib/bbu046

. 2014 Dec 5;16(5):873–883. doi: 10.1093/bib/bbu046

A selective review of robust variable selection with applications in bioinformatics

Cen Wu, Shuangge Ma ^✉

PMCID: PMC4570200 PMID: 25479793

Abstract

A drastic amount of data have been and are being generated in bioinformatics studies. In the analysis of such data, the standard modeling approaches can be challenged by the heavy-tailed errors and outliers in response variables, the contamination in predictors (which may be caused by, for instance, technical problems in microarray gene expression studies), model mis-specification and others. Robust methods are needed to tackle these challenges. When there are a large number of predictors, variable selection can be as important as estimation. As a generic variable selection and regularization tool, penalization has been extensively adopted. In this article, we provide a selective review of robust penalized variable selection approaches especially designed for high-dimensional data from bioinformatics and biomedical studies. We discuss the robust loss functions, penalty functions and computational algorithms. The theoretical properties and implementation are also briefly examined. Application examples of the robust penalization approaches in representative bioinformatics and biomedical studies are also illustrated.

Keywords: robust methods, variable selection, penalization, bioinformatics study

Introduction

The booming of high-throughput profiling technologies in the past decades has generated an unprecedented amount of high-dimensional data for bioinformatics and biomedical research. For instance, in a genome-wide association study where the goal is to identify genetic risk factors for complex diseases, single nucleotide polymorphisms (SNPs) are potential predictors for phenotypes, and a million or more of them can be profiled simultaneously. In a microarray gene expression study, >40 000 probes (features) profiled with a regular Affymetrix chip are potential covariates for disease classification on $10^{1 \sim 3}$ subjects.

The analysis of high-dimensional bioinformatics data faces many challenges. The heterogeneity problem arises as bioinformatics data are often aggregated from multiple sources, such as from different subpopulations. For example, in an expression quantitative trait locus (eQTL) mapping study of laboratory rat [1], the rats under study were from different strains. A similar issue was also reported for the gene expression study of genetically engineered mutants from various strains of Bacillus Subtilis [2]. Heterogeneous data often have heavy tails and outliers. In addition, most of the existing analysis methods assume specific (semi)parametric models. With low-dimensional biomedical data, it has been shown that model mis-specification can happen and lead to biased estimation and misleading conclusions. With high-dimensional data, conducting model diagnostics or determining model forms data-dependently can be extremely difficult. Such challenges demand the development of robust methods that can accommodate data contamination and can be insensitive to model specification.

With bioinformatics data, regularized variable selection can be as critical as estimation. From a statistical perspective, using all predictors, regular classification rules even cannot outperform random guess under high-dimensional settings [3]. From a biological perspective, for example in gene profiling studies, it is expected that the majority of profiled genes are ‘noises’, and only a few are related to outcomes and phenotypes. In the statistical literature, there are a large number of methods for regularized variable selection and estimation. As a powerful tool for selecting the subset of important predictors associated with responses, penalization plays a central role in high-dimensional statistical modeling, especially for bioinformatics studies. Thus, the relevant research has been pursued with special attention along with the waves of breakthroughs in high-throughput profiling technologies.

Multiple robust penalized variable selection methods have been developed. The purpose of this article is to provide a selective review of such methods for high-dimensional data, with special attention to applications in bioinformatics and biomedical studies. This article distinguishes itself from the published ones along the following aspects. First, compared with studies such as [4], it emphasizes the robustness of penalization procedures. It may be the first to provide a detailed review of such procedures. Second, it attempts to set up a unified framework of ‘robust loss function + penalty’ incorporating robust penalization for different types of responses under high- as well as low-dimensional settings, as the development in high-dimensional robust penalization approaches is still limited compared with that of the non-robust counterpart. This differs from [4] which emphasizes variable selection for models with continuous responses in high-dimensional feature space, and from [5] which focuses on regularized estimation for the accelerated failure time (AFT) model under both low- and high-dimensional settings.

A penalized selection and estimation procedure is characterized by its loss function, which defines lack-of-fit, and penalty function, which defines model complexity. In the following two sections, we discuss the loss function and penalty, respectively.

Robust modeling

Consider a disease outcome or phenotype, which can be a continuous biomarker, categorical disease status, or survival time. For most of this section, we start our discussion with a continuous outcome, as it has the simplest form and has been the most extensively investigated. Discussions on the other types of outcomes are provided following those on the continuous outcome.

For subject $i$ , ( $i = 1, \dots, n$ ), denote $x_{i} = {(x_{i 1}, \dots x_{i p})}^{T}$ as the $p$ covariates, which can be gene expressions, SNPs, copy number variations, and other omics measurements. The outcome $y_{i}$ is associated with $x_{i}$ through the following parametric or semi-parametric model

y_{i} = \emptyset (β_{1} x_{i 1} + \dots + β_{p} x_{i p} + ε_{i}) = \emptyset (x_{i}^{T} β + ε_{i}) .

(1)

$\emptyset$ defines the model, $β = {(β_{1}, \dots β_{p})}^{T}$ is the $p \times 1$ vector of regression coefficients and $ε_{i}$ is the random error. Model (1) describes the most widely investigated relationship between a response and predictors in robust modeling. It can move beyond linearity, which will be discussed later on.

The widely adopted ordinary least squares (LS) or likelihood-based procedures are vulnerable, when $ε_{i}$ deviates from the normal distribution or there are outliers (or influential points) in the response. Similarly, when the influential points are in the predictors, which can be encountered in for example microarray gene expression studies, regular procedures are not appropriate. When the model $\emptyset$ is mis-specified, statistical accuracy in terms of model complexity and parameter estimation can be sacrificed. Possible data contamination and model mis-specification demand the development of robust modeling.

The key lies in constructing the robust loss function $L (β; y, x)$ , where $y$ and $x$ denote the vector and matrix composed of all responses and predictors, respectively. A robust loss differs from an ordinary one by being able to accommodate irregularity in data and model settings. For a continuous outcome, the simplest choice of $\emptyset$ in Equation (1) is the identity function, and the loss function usually has the form $L (y - x^{T} β)$ . There are no strict rules for constructing a loss function. The basic requirement is consistency: under the asymptotic settings, minimizing the loss function leads to a consistent estimate of the unknown regression parameter. When consistency is available, statistical and computational efficiency is also of interest. Under contamination, the intuition is to down-weigh the influence of observations ‘far away’ from the center. Under model mis-specification, the intuition is to build the loss function that can accommodate a class of models (as opposed to a single specific one). Below we survey some broadly adopted robust loss functions.

Check loss function and its variants

Perhaps the most extensively examined robust loss function is the check loss function in quantile regression (QR). Define $ρ_{τ} (t) = {τ - I (t < 0)} t$ at a given quantile level $0 < τ < 1$ . With such a choice, we have the loss function for the $i$ th subject as following:

ρ_{τ} (y_{i} - x_{i}^{T} β) = {\begin{matrix} τ (y_{i} - x_{i}^{T} β) if y_{i} - x_{i}^{T} β > 0 \\ - (1 - τ) (y_{i} - x_{i}^{T} β) otherwise \end{matrix} .

The overall loss function is defined as $\sum_{i} ρ_{τ} (y_{i} - x_{i}^{T} β)$ . By emphasizing more on the relative rank as opposed to absolute magnitude, this loss function is able to ‘tolerate’ outliers and influential points to a much greater extent than the ordinary LS. The check loss function has been adopted in multiple studies [6–11]. The unique advantage of robust procedures built on QR lies in the ability to capture the heterogeneity of data through different quantiles. For example, with the presence of data heterogeneity, different sets of important genes or SNPs can be associated with the response at different quantiles.

Researchers have developed the multiple-QR approach [12, 13], where multiple conditional quantile functions are estimated simultaneously. This approach has been motivated by the unique advantage of QR over the traditional conditional mean regression: a more comprehensive picture of the response–predictor relationship is available through investigating multiple conditional quantile functions jointly. Denote $β^{(k)} = {(β_{1}^{(k)}, \dots, β_{p}^{(k)})}^{T}$ as the vector of regression coefficients from the $τ_{k}$ conditional quantile function of $y$ given $x$ for $k = 1, \dots, K$ . Then, instead of estimating $β^{(k)}$ individually, the loss function for the simultaneous multiple QR (SMQR) [12], which is also termed as the composite QR (CQR) [13], can be expressed as follows:

\sum_{k = 1}^{K} \sum_{i = 1}^{n} ρ_{τ_{k}} (y_{i} - x_{i}^{T} β^{(k)}) .

(2)

This formulation reduces to the check loss function when $K = 1$ . Equal weights are assigned to different quantiles in (2). As pointed out in the literature [12], SMQR is preferable only when it is sensible to assume that the same subset of predictors are associated with the response in multiple QR models, or the study objective is to identify the same subset of predictors in those models.

As a more general formulation, the composite quasi-likelihood (CQL) [14] is proposed as follows:

\sum_{k = 1}^{K} \sum_{i = 1}^{n} w_{k} ρ_{k} (y_{i} - x_{i}^{T} β),

(3)

where $ρ_{1}, \dots, ρ_{K}$ are the convex loss functions, and $w_{1}, \dots w_{K}$ are the weights. This formulation is closely related to non-parametric statistics in that ${ρ_{1}, \dots, ρ_{K}}$ can be viewed as the basis functions used to approximate the log-likelihood function. Note that Equation (3) includes the CQR as a special case when the weights are equal. In addition, Bradic and others [14] investigated the case with a composite of $L_{1}$ and $L_{2}$ loss functions, in which Equation (3) reduces to the following:

w_{1} \sum_{i = 1}^{n} | y_{i} - x_{i}^{T} β | + w_{2} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{T} β)}^{2} .

(4)

The quantile-related loss functions have been the most extensively used with continuous outcomes. With categorical data, when there are only a few levels, the notion of quantile and so QR may not be suitable. As discussed in the literature [15], smoothing techniques have been developed in QR to accommodate categorical responses. However, those are rarely adopted in bioinformatics and biomedical studies. Survival outcomes can also be analyzed under the QR framework. For example, Wang and others [16] considered a weighted quantile loss function for survival data.

LAD loss function and its extension

The least absolute deviation (LAD) regression is well known for its robustness to heavy-tailed errors or outliers in response. The LAD loss function is defined as follows:

\sum_{i = 1}^{n} | y_{i} - x_{i}^{T} β | .

(5)

There are multiple examples of using the LAD loss function [17–19]. LAD regression provides the estimate for the conditional median function. It is the foundation for the development of QR and can be viewed as a special case of QR.

Despite the robustness of the LAD loss function, Lambert-Lacroix and Zwald [20] pointed out that this criterion suffers a loss of efficiency with normally distributed data. They proposed using the following Huber’s Criterion with a concomitant scale as the loss function:

where $s$ is the scale parameter, and for M > 0,

$H_{M}$ is a mixture of absolute errors for relatively large errors and squared errors for smaller ones, with M controlling the robustness. It is closer to LAD for a larger M, and becomes more similar to LS on the contrary. M can be determined data-adaptively, for example using cross validation (CV), together with the tuning parameters. However, not much improvement has been observed in practice. The researchers set M = 1.345 [21]. Compared with the LAD loss function, the Huber’s criterion with a concomitant scale leads to estimates retaining the robustness property of LAD criterion while improving the efficiency when outliers or heavy-tailed errors are not present.

Rank-based loss function

In addition to the QR, rank-based regression is another common type of robust procedures. Jaeckel [22] was among the first to propose rank-based regression as a robust and non-parametric alternative to classic LS and likelihood-based approaches. The dispersion function proposed in [22] has been adopted by [23] as a loss function

\sum_{i = 1}^{n} ξ [\frac{R (ε_{i})}{n + 1} - \frac{1}{2}] ε_{i,}

(6)

where $ε_{i} = y_{i} - x_{i}^{T} β$ , $ξ (\cdot)$ is a non-decreasing weight function, and $R (ε_{i})$ is the rank of $ε_{i}$ in ${ε_{1}, \dots, ε_{n}}$ . Note that (6) reduces to the regular Wilcoxon statistic when $ξ (\cdot)$ is the identity function. In an independent work, Wang and Li [24] investigated the weighted Wilcoxon loss function

\frac{1}{n} \sum_{i < j} b_{i j} | ε_{i} - ε_{j} |

(7)

where $b_{i j}$ ’s are the positive and symmetric weights. They pointed out that when $b_{i j}$ ’s are constants, the minimization of Equation (7) is equivalent to that of Equation (6) with the identity weight function.

For censored survival data, Cai and others [25] considered robust modeling under the AFT model. They adopted a Gehan-type loss function whose quasi-gradient is the Gehan estimating function. The same loss function has also been examined in [26] for both univariate and multivariate survival data. Johnson [27] investigated a stratified Gehan-type loss function under a partially linear model. Different from the Wilcoxon-type loss function for continuous data [23, 24] and the Gehan-type loss function for survival data [25–27], Shi et al. [28] used the partial rank loss function to accommodate continuous, categorical, and survival responses within the maximum rank correlation framework. It is interesting to note that their formulation is also robust to model mis-specification.

Remarks

Case studies of the rank-based methods [23–27] are all in biomedical areas where the predictors are in general clinical variables with dimensionality much lower than the sample size. The partial rank-based approach [28] has been applied to high-throughput profiling data in lung cancer, however, the approach essentially conducts marginal analysis of one gene at a time. Part of the reason for the restrictions of the above rank-based procedures to low-dimensional data lies in the non-smoothness of the rank loss functions, and the corresponding computational algorithms are intensive.

Robustness to influential points (heavy-tailed errors or outliers) in response and/or predictors, or to model mis-specification is critically dependent on the choice of loss function. Just as the check loss function and its variants, the rank loss functions are robust to influential points in response. Furthermore, Wang and Li [24] demonstrated that the Wilcoxon-type loss function leads to procedures robust to influential points in both response and predictors, whereas Shi et al. [28] showed that the partial rank-based approach is robust to model mis-specification as well as influential points in response.

Other choices

The rank loss function [24] is not the only choice that leads to robustness to influential points in both predictors and response. Another one is the exponential squared loss (ESL) function [29]

\sum_{i = 1}^{n} e^{- \frac{{(y_{i} - x_{i}^{T} β)}^{2}}{γ_{n}}},

(8)

where $γ_{n} \in (0, + \infty)$ is a tuning parameter controlling the degree of robustness and needs to be chosen using a data-driven procedure. For a large $γ_{n}$ , the resulting estimator is similar to the ordinary LS, and a small $γ_{n}$ leads to observations with small residuals to have small influence on the estimation of $β$ .

When heterogeneity comes from the aggregation of multiple data sources corresponding to different groups or populations, the log-likelihood under the finite mixture regression (FMR) model can serve as the robust loss function. This FMR-type objective function has been adopted in the literature [2, 30]. To examine model mis-specification, Lu and others [31] used the log-likelihood of generalized linear models with uncensored outcomes and log-partial likelihood of Cox model with censored outcomes, respectively.

Remarks

In the above subsections, we have provided a list of loss functions that have been adopted in bioinformatics and biomedical studies. In the literature, there are other choices of robust loss functions. For example, loss functions have been developed to accommodate repeated measurements with measurement errors. However, such loss functions have been generally applied to low-dimensional data, and their performance with high-dimensional bioinformatics data has not been carefully examined.

The two most common robust loss functions have roots in QR and rank regression. The rank-based loss functions are flexible in the sense that they can accommodate continuous, categorical and survival responses under the same framework. Besides, the influential points can be in both predictors and response. In comparison, investigation on the check loss function and its extensions has been mainly focused on robustness to heavy-tailed errors or outliers in response for continuous and survival outcomes.

None of the loss functions is universally better than the others. For instance, the LAD loss shows satisfactory performance under heavy-tailed distributions, especially the double exponential distribution, and the Huber loss is suitable for contaminated normal distributions. As of now, there is no universal rule for choosing the best loss function.

Penalized variable selection

In Table 1, we provide a partial list of studies that have conducted robust penalized estimation. Under the penalization framework, variable selection and estimation of $β$ in (1) are achieved simultaneously through computing $\hat{β}$ , the estimate of $β$ , which minimizes the penalized loss function. The general framework is as follows

\hat{β} = {argmin}_{β} {L (β; y, x) + p e n_{λ} (β)}

(9)

where $p e n_{λ} (\cdot)$ is the penalty function depending on the tuning parameter $λ \in (0, + \infty)$ . $p e n_{λ} (\cdot)$ controls model complexity by leading some component of $\hat{β}$ to be exactly zero. For many penalties, $p e n_{λ} (\cdot) = λ \times p e n (β)$ . This shows how $λ$ balances model complexity and goodness of fit. More predictors are included as $λ \to 0$ , which leads to models with better goodness of fit. However, prediction performance and interpretability can be sacrificed under more complex models. On the other hand, fewer predictors stay in the model as $λ \to + \infty$ , and the model does not have any predictor when $λ = + \infty$ . With a properly tuned $λ$ , satisfactory prediction performance and model interpretability can be achieved. A similar pattern follows when $p e n_{λ} (\cdot)$ takes other formats.

Table 1.

Published articles that adopt penalized robust variable selection methods for bioinformatics/biomedical data (partial list)

Authors	Loss function	Penalty	Real data analysis
Li and Zhu [6]	Check loss	LASSO	Cardiomyopathy data
Wang et al. [9]	Check loss	LASSO/ALASSO/SCAD/MCP	eQTL data
Peng and Wang [10]	Check loss	SCAD/MCP	eQTL data
Fan et al. [11]	Check loss	LASSO/ALASSO	cis-eQTL data
Zou and Yuan [12]	Composite check loss	$F_{+ \infty}$ –norm type	Cardiomyopathy data
Bradic et al. [14]	Composite quasi-likelihood	Weighted L1 penalty	cis-eQTL data
Wang et al. [16]	Weighted check loss	ALASSO	Cancer clinical trial data
Johnson and Peng [23]	Rank (dispersion function)	SCAD	Biomedical data
Wang and Li [24]	Rank (weighted Wilcoxon)	SCAD	Biomedical data
Cai et al. [25]	Rank (Gehan type)	LASSO	Biomedical data
Xu et al. [26]	Rank (Gehan type)	LASSO/ALASSO	Biomedical data
Johnson [27]	Rank (stratified Gehan type)	LASSO	Biomedical data
Shi et al. [28]	Rank	MCP	High-throughput profiling data
Wang et al. [29]	ESL	ALASSO	Biomedical data
Städler et al. [2]	FMR likelihood	LASSO	Gene expression data
Lu et al. [31]	Likelihood	ALASSO	Biomedical data
Gao and Huang [46]	LAD	Adaptive fused LASSO	DNA copy number data

Open in a new tab

Penalization is not the only candidate for variable selection. Robust variable selection can also be achieved if the robust loss functions are integrated into the framework of, for example, boosting [32] or Bayesian variable selection [33]. Boosting can accommodate high-dimensional data, as it is relatively insensitive to the dimensionality of predictors. Nevertheless, only approximate solutions are provided. As penalization has been extensively investigated and has satisfactory performance in bioinformatics studies especially when the dimensionality is high, in this article, we focus on penalization and review penalties that have been adopted in bioinformatics studies below. Based on the penalized estimates, one can adopt estimation-, significance- and stability-based selection [34]. This part will be omitted, as the existing robust studies have been focused on the estimation-based selection.

Penalty functions

LASSO The least absolute shrinkage and selection operator (LASSO) penalty [35] is defined as $p e n_{λ} (β) = λ \sum_{j = 1}^{p} | β_{j} |$ , which is the $L_{1}$ norm of regression coefficients. It has been adopted in multiple studies [2, 6, 25, 27], which have indicated its satisfactory selection performance in identifying a small number of representative predictors. Despite its desirable simplicity, the LASSO approach can be variable selection inconsistent under certain conditions [36], which motivates the development of alternative penalties.

Adaptive LASSO To improve the performance of LASSO, Zou [36] proposed the adaptive LASSO (ALASSO) penalty where $p e n_{λ} (β) = λ \sum_{j = 1}^{p} {| β_{j}^{(0)} |}^{- γ} | β_{j} |$ . $β_{j}^{(0)}$ is a properly chosen initial estimate, and positive $γ$ is usually taken as 1. Zou [36] showed that for linear models with fixed p(<n), with properly chosen weights, adaptive LASSO is variable selection consistent. Similar properties of adaptive LASSO given p>n have also been established in the literature. Applications of the adaptive LASSO penalty to high-dimensional data include [9] (C. Wu et al., under review) and several others. In practice, especially for high-dimensional models, it is not trivial to choose proper weights.

SCAD The smoothly clipped absolute deviation (SCAD) penalty [37] can overcome some limitations of LASSO and adaptive LASSO. The penalty for $β_{j}$ is defined as

for some a>2, and the overall penalty is $p e n_{λ} (β) = \sum_{j = 1}^{p} p e n_{λ} (β_{j})$ . The consistency in estimation of the SCAD penalty has been established for a variety of robust models under both low- and high-dimensional settings. Wang and others [9] adopted this penalty for microarray studies. It has also been applied to biomedical studies [23, 24].

Weighted L1 penalty The adaptive LASSO is essentially a weighted $L_{1}$ penalty (WLP). Bradic and others [14] explicitly formulated a more general framework for the WLP as $p e n_{λ} (β) = \sum_{j = 1}^{p} γ_{λ} (| β_{j}^{(0)} |) | β_{j} |$ where the function $γ_{λ} (\cdot)$ and initial estimate $β_{j}^{(0)}$ are chosen to reduce bias in $L_{1}$ penalization. In particular, LASSO, adaptive LASSO and SCAD correspond to $γ_{λ} (x) = λ$ , $γ_{λ} (x) = λ {| x |}^{a}$ where $a$ <0, and $γ_{λ} (x) = λ {I (x \leq λ) + \frac{{(a λ - x)}_{+}}{(a - 1) λ} I (x > λ)}$ , respectively. In addition, when $γ_{λ} (\cdot) = p_{λ}^{'} (\cdot)$ with $p_{λ}$ being a non-convex penalty such as SCAD, the penalty can be viewed as a local linear approximation (LLA) to $p_{λ}$ . The performance of WLP is crucially dependent on the choice of $γ_{λ} (\cdot)$ . Multiple studies [11, 14] have investigated how to choose $γ_{λ} (\cdot)$ under different robust modeling frameworks and applied the procedures to bioinformatics studies.

MCP The minimax concave penalty (MCP) penalty [38] is expressed as

for $β_{j}$ , and the overall penalty is $p e n_{λ} (β) = \sum_{j = 1}^{p} p e n_{λ} (β_{j})$ . The rationale behind MCP is similar to that of SCAD—both penalties try to avoid excessive penalization of the large values of $β_{j}$ ’s. Researchers [9, 28] have adopted MCP in a high-dimensional eQTL study and a high-throughput profiling study of lung cancer, respectively. Satisfactory performance of MCP has been demonstrated.

Fused LASSO Different from the above penalty functions, fused LASSO has been proposed to carry out variable selection while accommodating the spatial features of data [39, 40]. It is defined as following:

p e n_{λ} (β) = λ_{1} \sum_{j = 1}^{p} | β_{j} | + λ_{2} \sum_{j = 2}^{p} | β_{j} - β_{j - 1} |,

where the first penalty promotes sparsity in coefficients, and the second penalty promotes sparsity in their differences, that is, the local smoothness of the coefficient profile. Jiang and others [41] adopted an adaptive version of fused LASSO to simultaneously identify interquantile commonality and significant quantile coefficients.

Remarks As pointed out in [37], an ideal penalty function should lead to an estimator with unbiasedness, sparsity and continuity. The variable selection capability of all the above penalties is because of the singularity of $p_{λ}^{'} (\cdot)$ at the origin. We note that some other penalty functions have also been used for robust variable selection in bioinformatics, for example, the $F_{+ \infty}$ –norm-type penalty [12], however not as extensively as the aforementioned ones.

Wang and others [9] compared four penalty functions, namely LASSO, adaptive LASSO, SCAD and MCP, in high-dimensional penalized QR and showed that except LASSO, the performance of other three are comparable and better than that of LASSO. This phenomenon is generally applicable under other penalization frameworks as long as the tuning parameters, including the weights in adaptive LASSO, are properly chosen. Closer investigation of the rationale behind SCAD and MCP can be carried out by comparing their derivatives, as shown in Figure 1. As $β$ starts increasing from 0, both penalties apply the same rate of penalization as LASSO. The rate for both penalties then diminishes to 0 when $β$ grows further away from 0, though the transition differs between SCAD and MCP.

The LASSO, SCAD and MCP penalty functions (left panel) and corresponding derivative functions (right panel) given $a$ = 3.7 and $λ$ = 1.5. A colour version of this figure is available online at BIB online: http://bib.oxfordjournals.org.

Computation

Computational algorithms under the penalization framework are critically dependent on the loss function and penalty. The popular gradient-based optimizations are applicable to the LS loss function and convex penalties, such as LASSO and adaptive LASSO. For non-convex penalties such as SCAD or MCP, approximations are needed to make the penalized loss functions ‘smoother’.

For robust procedures, if the robust loss function is convex in $β$ , gradient-based optimization still applies. For instance, the rank loss function [23] is convex, which leads to a gradient-based algorithm. Below we provide a survey of representative algorithms for some commonly adopted robust loss functions and penalty functions.

Linear programming-based algorithms

A prominent feature of the check loss function and its variants is that they are not differentiable at the origin. The majority of algorithms developed for penalized QR are based on linear programming. Li and Zhu [6] casted LASSO penalized QR into a linear programming problem

\begin{array}{l} \underset{β}{minimize} \sum_{i = 1}^{n} (τ ξ_{i} + (1 - τ) ζ_{i}) \\ subject to \sum_{j = 1}^{p} | β_{j} | \leq λ ξ_{i}, ζ_{i} \geq 0,, - ζ_{i} \leq x_{i}^{T} β \leq ξ_{i}, i = 1, \dots, n \end{array}

(10)

and developed an efficient algorithm computing the entire solution path based on the Lagrangian primal function derived from Equation (10). They also obtained an effective dimension estimate for the model, which allows for selecting the tuning parameter conveniently.

Wu and Liu [7] developed a difference convex algorithm (DCA) to tackle the non-convex optimization issue in penalized QR. By decomposing the SCAD penalty function as the difference of two convex functions $p_{λ} (θ) = p_{λ, 1} (θ) - p_{λ, 2} (θ)$ where

DCA approximates $p_{λ, 2} (θ)$ using a linear function at each iteration, which turns the original non-convex optimization task into a convex one. DCA decreases the value of loss function at each step and converges in a finite number of iterations. It is shown that under p<n, DCA outperforms the local quadratic approximations (LQA) and Majorize-Minimization (MM) algorithms especially in computation time.

In addition to LQA, MM and DCA, Wang and others [9] adopted the LLA to carry out penalized QR with non-convex penalties when $p ≫ n$ . The optimization is again casted as a linear programming problem:

\begin{matrix} \underset{ε, ζ}{minimize} \frac{1}{n} \sum_{i = 1}^{n} (τ ε_{i}^{+} + (1 - τ) ε_{i}^{-}) + \sum_{j = 1}^{n} w_{j}^{(t - 1)} ζ_{j} \\ \begin{matrix} subject to ε_{i}^{+} - ε_{i}^{-} = y_{i} - x_{i}^{T} β; i = 1, \dots, n \\ ε_{i}^{+} \geq 0, ε_{i}^{-} \geq 0; i = 1, \dots, n \\ ζ_{j} \geq β_{j}, ζ_{j} \geq - β_{j}; j = 1, \dots, p \end{matrix} \end{matrix}

(11)

which can be solved using existing linear programming software.

A series of approximations, such as LQA, MM, LLA and DCA, can be used to smooth non-convex penalty functions. Subsequently, the optimization of a penalized quantile loss function can generally be formulated as a linear programming problem. When the loss function is an extension of the quantile loss function, such as the penalized weighted composite regression, the idea still applies. We refer to published studies [14] for more details.

Coordinate descent algorithms

In the coordinate descent (CD) algorithms, the objective function is optimized with respect to a single predictor at a time, and minimization is cycled through all predictors iteratively until convergence. It has been demonstrated as extremely efficient for high-dimensional optimization problems. By adopting the CD algorithm, Peng and Wang [10] proposed the quantile-based iterative CD (QICD) approach, which reduces the optimization of penalized QR into weighted median regression

\min_{β_{j}} {{(n + 1)}^{- 1} \sum_{i = 1}^{n + 1} w_{i j} u_{i j}},

(12)

where at the $r$ th sub-iteration of the $k$ th iteration,

u_{i j} = {\begin{array}{l} \frac{Y_{i} - \sum_{s < j} x_{i s} β_{s}^{(k) (r + 1)} - \sum_{s > j} x_{i s} β_{s}^{(k) (r)}}{x_{i j}} - β_{j}_{,} & i = 1, \dots, n \\ β_{j,} & i = n + 1 \end{array}

and

w_{i j} = {\begin{array}{l} n^{- 1} | x_{i j} (τ - I (u_{i j} x_{i j} < 0)) |, & i = 1, \dots, n \\ p_{λ}^{,} (| β_{j}^{(k - 1)} | +), & i = n + 1 \end{array}

As shown in simulation, under high-dimensional settings (n = 300, p = 1000 and 2000), although both the QICD and LLA algorithms have satisfactory performance in parameter estimation and model selection, the QICD algorithm substantially improves the speed of computation.

In some high-dimensional bioinformatics studies [28] (C. Wu et al., under review), the CD algorithms have been adopted after smoothing the rank-based penalized loss functions. Variants of the CD algorithm also enjoy their popularity. For example, Wang and others [29] adopted a block gradient CD algorithm, whereas Städler and others [2] developed a block CD generalized expectation–maximization algorithm for an FMR LASSO model.

Algorithms based on data augmentation

For the LAD–LASSO regression

\sum_{i = 1}^{n} | y_{i} - x_{i}^{T} β | + n \sum_{j = 1}^{p} λ_{j} | β_{j} |,

(13)

which allows for coefficient-specific tunings, Wang and others [17] proposed a data augmentation scheme

\sum_{i = 1}^{n + p} | y_{i}^{*} - x_{i}^{* T} β |,

(14)

where $(y_{i}^{*}, x_{i}^{*}) = (y_{i}, x_{i})$ for $i = 1, \dots, n$ , and $(y_{n + j}^{*}, x_{n + j}^{*}) = (0, n λ_{j} e_{j})$ for $j = 1, \dots, p$ . $e_{j}$ denote a p dimensional vector where all the components are 0 except that $e_{j j}$ = 1. Note that Equation (14) is a regular LAD criterion and can be optimized by the standard unpenalized LAD program. The same algorithm is also adopted in studies investigating LAD–LASSO under high-dimensional settings [18, 19]. The most prominent feature of this kind of algorithm lies in the computational convenience: the standard algorithms and software packages can be used after data augmentation.

Convergence of the CD algorithm and its variants has been extensively investigated. In general, convergence can be established under the MM framework, as the majority of the aforementioned algorithms can be viewed as special cases of the MM algorithm. Interested readers may refer to relevant literature for more details.

Selection of tuning parameters

The robust penalization procedures generally consist of two components. First, for each fixed tuning parameter $λ$ , estimate the regression coefficient $β$ . Second, tune $β$ to achieve optimal performance. Popular criteria of choosing optimal tuning parameters include Bayesian information criterion (BIC), generalized CV (GCV), CV and others. More sophisticated criteria can be needed when multiple tuning parameters are involved. For example, a hybrid approach has been developed [29]. In that study, the regularization parameters are first chosen by a BIC-type criterion. Then, the tuning parameter $γ_{n}$ in Equation (8), which controls robustness and efficiency of the penalized estimator, is determined by a data-driven procedure. Another special case is that sometimes it is more sensible to investigate results corresponding to a series of $λ$ values [28] rather than those under a fixed value.

Implementations

The availability of software is critical for effective application. However, our limited search suggests that, as of now, there is no portable software package particularly developed to implement the robust penalization methods. Some of the algorithms described above can be carried out using existing software, although the execution is not trivial. For example, the data augmentation-based LAD–LASSO estimation can be realized using the R function rq in the QUANTREG package under fixed tuning parameters [17]. The linear programming-based algorithms are generally solvable by calling functions from existing software packages, such as the lpSolve package in R [12] (code available from authors on request). The study by Jiang and others [41] is the only one providing complete code, which is available at the corresponding author’s web page. Lambert-Lacroix and Zwald L. [20] showed in their paper how to carry out convex programming in CVX, a Matlab-based program for disciplined convex programming, under fixed tuning parameters. The rest of the reviewed studies have not provided software for implementation.

Programming for the reviewed methods is a prohibitive task and beyond the scope of this review. Without available software, we are not able to conduct numerical comparison. To gain some insights into the numerical performance of different methods, we compile Table 2, which summarizes the comparison of robust penalized variable selection methods in a number of reviewed papers. The settings are for continuous responses and linear models. Although the summary is partial, it still provides an informative picture of pros and cons of these methods. It is observed that no method can dominate all others under all scenarios. The quantile (or LAD) loss-based methods can accommodate high/ultra-high dimensionality more readily than others such as the rank-based ones. The comparison between the robust methods and the oracle, where the true model is known, is not realistic in practice and hence omitted.

Table 2.

Summary of comparisons of the robust penalization approaches against alternatives (partial list)

References	Proposed/alternative	Main settings	Major conclusions
Wu and Liu [7]	(1) Q⁽^a⁾+DCA–SCAD/ Q+LQA–SCAD; Q+MM–SCAD. (2,3) Q+DCA–SCAD/ Q+ALASSO; Q+LASSO	(1) n = 60, d = 3^(b) and p = 8; with normal random errors (2) n = 100, d = 3,p = 8 and n = 100, d = 3, p = 110 with normal random errors (3) n = 100, d = 3, p = 8 with heteroscedastic random errors	(1) All approaches achieve similar performances in terms of prediction, but DCA + SCAD significantly outperforms the alternatives in variable selection and computation time. (2, 3) The approaches with LASSO have the worst performance while the other two are comparable.
Wang et al. [9]	Q+LLA–SCAD; Q+LLA–MCP/ LS+ LASSO; LS+ ALASSO; LS+ SCAD; LS+ MCP; Q+ LASSO; Q+ ALASSO	n = 300, d = 5 and p = 400/600; $τ$ = 0.3, 0.5 and 0.7 for quantile-based approaches; Predictors are generated with correlations, and the response is generated under heteroscedastic noises.	The proposed approaches outperform the LS counterpart, and also have less estimation errors than the ALASSO counterpart when $τ$ = 0.3 and 0.7, at the cost of choosing slight more false-positive results.
Peng and Wang [10]	QICD–SCAD; QICD–MCP/ Q+LLA–SCAD; Q+LLA–MCP	n = 300, d = 5, p = 1000/2000; $τ$ = 0.3, 0.5 and 0.7 for quantile-based approaches; The response is generated from a linear model with heteroscedastic random errors, and predictors are generated with correlations.	The two approaches perform similarly in terms of variable selection and prediction, but the proposed one is significantly faster.
Fan et al. [11]	Q+AR–LASSO/ LS+LASSO; LS+SCAD; Q+R–LASSO	n = 100, d = 7, p = 400 under light to heavier-tailed random errors. Predictors are generated (1) with correlations and (2) independently.	In most cases, the proposed approach has smaller estimation errors than the alternatives. In general, it also has a smaller number of false–positive results while including slightly more false-negative results.
Zou and Yuan [12]	EWCQR +Matrix norm penalty ( $F_{+ \infty}$ –norm type) /Q+LASSO	(1) n = 300, d = 3, p = 8 (2) n = 300, p = 12, and three responses are generated given a (almost) common set of significant predictors with d = 2∼3, under different correlation structures and light to heavy-tailed random errors (3) Similar to (2), three responses are generated separately given different sets of significant predictors.	In (1) and (2), the proposed approach selects fewer false-positive results and has smaller model errors, as a common (or almost common) set of significant variables is shared among multiple quantile functions. In (3), when the true models do not share any significant predictors, the alternative excels by choosing much fewer false–positive results.
Gao and Huang [18]	LAD+LASSO/ LS+LASSO	p = 200 and n = 500, 200, 100; n = 200 and p = 1000, 2000 and 5000; d = 5 for all scenarios. Responses are generated under correlated predictors and light to heavy-tailed noises.	The two approaches are comparable under standard normal errors. LAD+LASSO starts to work better when the tails of random errors become heavier.
Bradic et al. [14]	(1) $L_{1} - L_{2}$ +WLP; (2) $L_{1} - L_{2}^{+}$ +WLP; (3) OWCQR+WLP; (4) OWCQ $R^{+}$ +WLP/ (5) $L_{1}$ +WLP; (6) $L_{2}$ +WLP; (7) EWCQR+WLP;	n = 100, d = 4, p = 12 and 500; The responses are simulated under different correlation structures in predictors and light to heavy-tailed random errors.	All the composite quasi-likelihood-based approaches show stable performances when the dimension increases from p<<n to p>>n. In terms of variable selection, (3) and (4) outperform (1), (2), (5), (6) and (7). (7) outperforms (1), (2), (5) and (6).
Wang et al. [29]	ESL+ALASSO/ CQR+ALASSO; LAD+ALASSO	n = 100 ∼ 800, d = 4 and p = 8; n = 1000, d = 4 and p = 100; The influential points are generated (1) only in predictors; (2) only in responses; (3) in both predictors and responses	The proposed approach has better performance in correctly identifying irrelevant features, but the other two yield smaller prediction errors. In general, no approach dominates the others in all scenarios.
Wang and Li [24]	Rank (weighted Wilcoxon based) + SCAD/ LS+SCAD	(1) n = 100, d = 3 and p = 8 Correlated predictors and light to heavy-tailed random errors are used to generate responses. (2) is similar to (1) but contaminations are in predictors.	The proposed approach outperforms in both scenarios, in terms of choosing more correct models and yielding smaller estimation errors.
Lambert-Lacroix and Zwald [20]	Scaled Huber Loss + ALASSO/ LS+ALASSO; LAD+ALASSO	(1) CV is used to select tunings; (2) BIC is used to select tunings; n = 100, d = 3 and p = 8; The noises are simulated as low/high correlated Gaussian noises, large/sensible outliers	When CV is used to select tuning parameters, the proposed approach is comparable to LS+ALASSO under normal errors and outperforms LAD+ALASSO. The advantage of the proposed approach is not significant under BIC.

Open in a new tab

Note. The simulations are on continuous responses and linear models.

^aQ standards for check loss function. ^bd is the number of true non-zero variables.

Statistical properties

An ideal penalized robust method should have consistency in two aspects. The first is variable selection consistency, as discussed in the above sections. The second is estimation consistency. An ‘oracle’ method should generate estimates having the same asymptotic distributions as if the true model is known in advance. When p is fixed and smaller than sample size n, asymptotic properties of relevant procedures have been established in multiple studies [7, 13, 16, 17, 20, 23–26, 29–31]. As high-throughput bioinformatics studies are under the ‘large $p$ , small $n$ ’ paradigm, we are more interested in procedures with asymptotic consistency properties established for $p > n$ . Such studies include [8, 18, 19]. It is more desirable to establish asymptotic properties under the ultrahigh dimensional models with $p ≫ n$ [9, 11]. Though this list is partial, it clearly indicates that it is far more challenging to establish asymptotic properties when $p$ grows faster than $n$ . Multiple novel techniques have been implemented in the literature. For example, to overcome the non-smoothness of both the quantile loss function and the penalty functions (SCAD and MCP), Wang and others [9] applied a sufficient local optimization condition for the convex differencing representation of the penalized quantile loss function, which is crucial for establishing asymptotic properties without stringent moment or distributional restrictions on the random errors in ultrahigh dimensionality. Fan and others [11] established the oracle properties for variable selection under penalized QR with WLP and asymptotic normality of the corresponding penalized estimator, under mild assumptions on the model error distributions in ultrahigh dimensions. As we do not focus on the theoretical side of penalized robust procedures, we just offer a list of studies involving the development of asymptotic properties.

Examples

In this section, we describe five case studies. The first four are high-throughput bioinformatics studies, whereas the last one is from a cross-sectional study with low-dimensional covariates. The uniqueness of the low-dimensional data lies in the presence of influential points in both predictors and response. The responses in the data examples are continuous (DNA copy number data, after proper transformation, can be viewed as continuous). Robust penalization procedures have also been applied to bioinformatics studies with other types of responses, although less frequently.

The cardiomyopathy data

A microarray gene expression study of cardiomyopathy in transgenic mice has been conducted [42]. The G protein-coupled receptor, designated as Ro1, is a variant of the human kappa opioid receptor. Redfern and others [42] manifested the relationship between the expression of Ro1 in the hearts of adult mice and cardiomyopathy. Symptoms of lethal dilated cardiomyopathy are present when the receptor is overexpressed, and the mice recover when the expression is turned off. Li and Zhu [6] treated the Ro1 expression as response and gene expressions measured by Affymetrix Mu6500 arrays as predictors. The goal is to identify gene expression changes associated with the variation in Ro1 expression and to search for diagnostic markers for cardiomyopathy. The 50%, 75% and 90% $L_{1}$ QR are fitted [6]. Compared with the LASSO approach [43], $L_{1}$ QR identifies different sets of important genes.

The same data set has also been analyzed using SMQR [12]. Twelve genes are identified by fitting the 10%, 25%, 50%, 75% and 90% conditional quantiles simultaneously. Besides, the five individual $L_{1}$ QR identifies 38 genes, and LASSO mean regression identifies 17 genes. The genes selected by SMQR are a subset of the 38 and 17 genes.

The eQTL data of lab rats

The overall objective of this eQTL mapping study of laboratory rat was to examine gene regulation in the mammalian eyes and to identify genetic variations relevant to human eye diseases [1]. There are 18,958 probes left after preprocessing of the >31,000 probe sets on 120 twelve-week-old male offspring of rats produced from F1 animal intercross. One analysis goal is to search for genes whose expressions are associated with TRIM32 (probe ID 1389163_at), which activates the Bardet–Biedl syndrome.

Wang and others [9] selected the top 300 probes according to their correlations with 1389163_at for further analysis. The Q–SCAD/MCP fitted at $τ$ = 30%, 50% and 70% identifies different sets of probes, and the LS–SCAD/MCP selects models sparser than their quantile counterparts. This phenomenon indicates that probes can be strongly associated with 1389163_at merely at the lower or upper tail of the conditional distribution, and the LS approach can miss these heterogeneous signals. Peng and Wang [10] applied the QICD approach to the same data set and selected different sets of probes associated with 1389163_at at $τ$ = 30%, 50% and 70%.

The penalized QRs in both studies have been developed for ultrahigh dimensions. Wang and others [9] analyzed the top 300 probes, whereas Peng and Wang [10] analyzed the top 3,000. This partly reflects the superior performance of the CD algorithms [10] over the LLA-based linear programming algorithms [9].

The cis-eQTL data

The cis-eQTL mapping data from the international HapMap project [44] have been analyzed in multiple studies [11, 14]. Bradic and others [14] analyzed data for gene CCT8 whose overexpression could be associated with the Down syndrome phenotypes. The data include 210 unrelated individuals and are divided into three populations: Asian, CEPH and Yoruba. A few novel SNPs identified by the robust procedure from these groups can account for the variation in CCT8 gene expression level.

To show the effectiveness of the robust adaptive LASSO procedure, Fan and others [11] analyzed data on gene CHRNA6, which is located in the cytogenetic location 8p11 on chromosome 8 and related to the trigger of dopamine releasing neurons with nicotine. The function of this gene has been pursued in quite a few nicotine addiction studies on people with the Western European heritage. For 90 individuals with Western Europe ancestry, Fan and others [11] kept the top 100 SNPs that are the most strongly associated with the expression of CHRNA6 out of all the SNPs located within 1 megabase upstream and downstream of the transcription start site of CHRNA6. The robust procedures (R-LASSO and AR-LASSO) identify sets of SNPs different from the regular approaches (LASSO and SCAD). In particular, for the SNPs identified by all the four procedures, only one (rs10958726) has been reported previously, and only the robust procedures find this SNP to be important. The QQ plots of all the approaches under comparison show heavy right tails, which justify the use of robust procedures.

The DNA copy number data

The DNA copy number data on chromosome 22q11 from 12 patients suffering from the 22q11 deletion syndrome, Cat-Eye syndrome and a few other symptoms have been analyzed by Gao and Huang [45]. Each patient was reported with about 372 000 features in the microarray data. Gao and Huang [45] applied the LAD-aFL approach to each individual segment after partitioning the entire chromosome into a number of segments. LAD-aFL identifies all the previously reported blocks as well as the breakpoints for DNA block amplification and deletion. Furthermore, several new deleted regions for Patient 03-154 are detected by the proposed approach.

Gao and Huang [45] also applied LAD-aFL to a colorectal cancer data set. The LAD-aFL method identifies both weak and strong DNA alterations on chromosome 1 in samples X59, X524, X186 and X204. The researchers show the superior performance of the robust approach over the LS counterpart. For the four chromosomes [Chromosome 8 of GM03134, Chromosome 14 of GM01750, Chromosome 22 of GM13330 and Chromosome 23 of GM03563 from the Bacterial Artificial Chromosome (BAC) data], LAD-aFL only detects breakpoints for the first two chromosomes, which are consistent with the Karyotyping approach, whereas the LS-FL only identifies breakpoints from Chromosome 22 and 23.

The plasma beta-carotene level data

The plasma beta–carotene level data set from a cross-sectional study that consists of 273 female patients going through elective surgical procedures during a 3-year period has been analyzed in multiple studies [24, 29]. It has been suggested that the plasma concentration of beta-carotene may be a risk factor for cancer. The study goal is to identify whether environmental/clinical covariates, such as age, smoking status and dietary beta-carotene consumed, are associated with the plasma beta-carotene level. Wang and Li [24] and Wang and others [29] chose different sets of covariates for downstream analysis. It is shown that there are outliers in both response and covariates. In [24], the proposed approach outperforms its LS counterpart with better prediction while identifying a model with fewer covariates. While the better performance of the robust over non-robust procedure is expected, it is shown that the ESL–LASSO approach also has better performance than CQR–LASSO and LAD–LASSO. In [29], ESL–LASSO outperforms CQR–LASSO and LAD–LASSO in terms of correctly identifying unimportant variables. However, it is noted that none of them dominates the other two in simulation with the scenarios of influential points in predictors, influential points in response and influential points in both predictors and response.

Remarks

With the special focus of this article, the above reviewed examples are from the field of bioinformatics and biomedicine. It should be noted that the reviewed methods are also applicable to data generated in other fields. The existing examples include the stock market data [17, 20], the economic growth data [41] and others. Data contamination and model mis-specification are by no means specific to bioinformatics, and hence, the reviewed methods can potentially be broadly applicable. We note that data from many other fields are often low dimensional. For such data, variable selection can still be applied but may be less critical.

Discussion

This article has provided a review of robust variable selection methods using penalization that have applications in bioinformatics studies. We have discussed more than ten robust loss functions and five penalty functions, as well as corresponding computational algorithms. For some of them, the statistical consistency properties have been established, mostly for the low-dimensional scenario and some for the high-dimensional scenario. Five representative data sets are also discussed as a demonstration of applications.

Our limited review suggests that the majority of existing robust methods have been developed for heavy-tailed errors and outliers in response. Only two studies have investigated the scenario where influential points are observed in both predictors and response [24, 29]. Specifically, the study in [29] is among the first to fully characterize the robustness properties of the penalized procedure, such as the finite sample breakdown point and influence function. Interestingly, the simulation study indicates that with the adaptive LASSO penalty, compared with CQR and LAD, the proposed method does not dominate. Therefore, characterizing the robustness of penalized quantile/LAD regressions deserves further study. Only a few studies [28, 31] have examined the robustness of penalized procedures to model mis-specification. More attention should be paid to procedures robust to influential points in predictors and/or model mis-specification. A major observation is that, unlike the traditional robust modeling approaches that are theoretically well characterized in robustness, such as the breakdown point of the robust estimator or boundness of the influence functions, properties of the robust penalized procedures are not fully understood. Another major observation is that no robust penalization approach can universally dominate all the others.

Multiple variable selection methods can be potentially coupled with the reviewed robust loss functions. Penalization can be preferred because of its satisfactory statistical and empirical properties. A large number of penalty functions have been proposed in the literature. Here we have only reviewed those that have been applied to robust procedures. It is conjectured that other penalties can also be used along with the reviewed robust losses. More effort will be needed to investigate their statistical and numerical properties. In the reviewed studies, it has been assumed that for the majority of observations, there is no measurement error. In some early profiling especially microarray studies, the measurement error problem has been acknowledged, but penalized robust variable selection methods that can accommodate measurement error remain to be studied. The reviewed studies have assumed linear covariate effects. There are also a few studies that accommodate nonlinear covariate effects, for example by adopting varying coefficient models. One example is that in [46], a partially linear varying coefficient model is integrated with the penalized rank regression to accommodate non-linear gene–environment interactions. Another family of methods that can also be robust and accommodate high-dimensional data is dimension reduction [47]. With a significantly different framework, such methods are not discussed in this article.

Key Points.

For low-dimensional biomedical data, robust methods have been extensively developed. However, the development for high-dimensional data is limited.
Multiple robust loss functions have been reviewed. They can be robust to data contamination and model mis-specification.
Penalization provides a way of regularized estimation and variable selection with robust loss functions.
Data analyses suggest satisfactory performance of the penalized robust methods.
More development in methodology, theory and implementation is still needed.

Acknowledgements

The authors thank the editor and three reviewers for careful review and insightful comments, which have led to significant improvement of this manuscript.

Biographies

Cen Wu is a Postdoctoral Associate in Department of Biostatistics at Yale University. He obtained his Ph.D. in statistics from Michigan State University in 2013.

Shuangge Ma is an Associate Professor in Department of Biostatistics at Yale University.

Funding

This study was supported by awards from the National Institutes of Health (CA182984, CA152301, P50CA121974, P30CA16359).

References

1.Scheetz T, Kim K, Swiderski R, et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Natl Acad Sci 2006;103:14429–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Städler N, Bühlmann P, van de Geer S. L₁–penalization for mixture regression models. Test 2010;19:209–56. [Google Scholar]
3.Fan J, Fan Y. High dimensional classification using features annealed independence rules. Ann Statist 2008;36:2605–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Fan JQ, Lv J. A selective review of variable selection in high dimensional feature space. Statist Sin 2010;20:104–48. [PMC free article] [PubMed] [Google Scholar]
5.Chung M, Long Q, Johnson B. A tutorial on rank–based coefficient estimation for censored data in small– and large–scale problems. Stat Comput 2013;23:601–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Li Y, Zhu J. L₁–norm quantile regression. J Comput Graph Statist 2008;17:163–85. [Google Scholar]
7.Wu Y, Liu Y. Variable selection in quantile regression. Statist Sin 2009;37:801–17. [Google Scholar]
8.Belloni A, Chernozhukov V. L₁ penalized quantile regression in high dimensional sparse models. Ann Statist 2011;39:82–130. [Google Scholar]
9.Wang L, Wu Y, Li R. Quantile regression for analyzing heterogeneity in ultrahigh dimension. J Am Stat Assoc 2012;107:214–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Peng B, Wang L. An iterative coordinate descent algorithm for high-dimensional nonconvex penalized quantile regression. J Comput Graph Statist 2014. DOI:10.1080/10618600.2014.913516 (In Press). [Google Scholar]
11.Fan Y, Fan J, Barut E. Adaptive robust variable selection. Ann Statist 2014;42:324–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Zou H, Yuan M. Regularized simultaneous model selection in multiple quantiles regression. Comput Stat Data An 2008;52:5296–304. [Google Scholar]
13.Zou H, Yuan M. Composite quantile regression and the oracle model selection theory. Ann Statist 2008;36:1108–26. [Google Scholar]
14.Bradic J, Fan J, Wang W. Penalized composite quasi–likelihood for ultrahigh-dimensional variable selection. J R Statist Soc Ser B 2011;73:325–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Koenker R. Quantile Regression. UK: Cambridge University Press; 2005. [Google Scholar]
16.Wang H, Zhou J, Li Y. Variable selection for censored quantile regression. Stat Sinica 2013;23:145–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection through the lad-LASSO. J Bus Econ Stat 2007;25:347–55. [Google Scholar]
18.Gao X, Huang J. Asymptotic analysis of high-dimensional LAD regression with LASSO. Stat Sinica 2010;20:1485–506. [Google Scholar]
19.Wang L. The L₁ penalized LAD estimator for high dimensional linear regression. J Multivar Anal 2013;120:135–51. [Google Scholar]
20.Lambert-Lacroix S, Zwald L. Robust regression through the Hubers criterion and adaptive LASSO penalty. Electron J Stat 2011;16:1015–53. [Google Scholar]
21.Huber P. Robust Statistics. New York, NY: Wiley, 1981. [Google Scholar]
22.Jaeckel LA. Estimating regression coefficients by minimizing the dispersion of residuals. Ann Math Stat 1972;43:1449–58. [Google Scholar]
23.Johnson B, Peng L. Rank-based variable selection. J Nonparametr Stat 2008;20:241–52. [Google Scholar]
24.Wang L, Li R. Wighted Wilcoxon–type smoothly clipped absolute deviation method. Biometrics 2009;65:564–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Cai T, Huang J, Tian L. Regularized estimation for the accelerated failure time model. Biometrics 2009;65:394–404. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Xu J, Leng C, Ying Z. Rank-based variable selection with censored data. Stat Comput 2010;20:165–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Johnson B. Rank-based estimation in the L₁–regularized partly linear model for censored outcomes with application to integrated analyses of clinical predictors and gene expression data. Biostatistics 2009;10:659–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Shi X, Liu J, Huang J, et al. A penalized robust method for identifying gene-environment interactions. Genet Epidemiol 2014;38:220–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Wang X, Jiang Y, Huang M, et al. Robust variable selection with exponential squared loss. J Am Stat Assoc 2013;108:632–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Khalili A, Chen J. Variable selection in finite mixture of regression models. J Am Stat Assoc 2007;102:1025–38. [Google Scholar]
31.Lu W, Goldberg Y, Fine J. On the robustness of the adaptive LASSO to model misspecification. Biometrika 2012;99:717–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd edn New York, NY: Springer, 2009. [Google Scholar]
33.O'Hara RB, Sillanpää MJ. A review of Bayesian variable selection methods: what, how and which. Bayesian Anal 2009;4:85–118. [Google Scholar]
34.Meinshausen M, Buhlmann P. Stability selection (with discussion). J R Stat Soc B 2010;72:417–73. [Google Scholar]
35.Tibshirani R. Regression shrinkage and selection via the LASSO. J R Stat Soc B 1996;58:267–88. [Google Scholar]
36.Zou H. The adaptive LASSO and its oracle property. J Am Stat Assoc 2006;101:1418–29. [Google Scholar]
37.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 2001;96:1348–60. [Google Scholar]
38.Zhang C. Nearly unbiased variable selection under minimax concave penalty. Ann Statist 2010;38:894–942. [Google Scholar]
39.Tibshirani R, Saunders M, Rosset S, et al. Sparsity and smoothness via the fused lasso. J R Stat Soc B 2005; 67:91–108. [Google Scholar]
40.Tibshirani R, Wang P. Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 2008;9:18–29. [DOI] [PubMed] [Google Scholar]
41.Jiang L, Bondell H, Wang H. Interquantile shrinkage and variable selection in quantile regression. Comp Stat Data Anal 2014;69:208–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Redfern C, Degtyarev M, Kwa A, et al. Conditional expression of a Gi-coupled receptor causes ventricular conduction delay and a lethal cardiomyopathy. Proc Natl Acad Sci USA 2000;97:4826–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Segal M, Kam D, Bruce C. Regression approaches for microarray data analysis. J Comput Biol 2003;10:961–80. [DOI] [PubMed] [Google Scholar]
44.The International HapMap Consortium. A haplotype map of the human genome. Nature 2005;437:1299–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Gao XL, Huang J. A robust penalized method for the analysis of noisy DNA copy number data. BMC Genomics 2010;11:517. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Wu C, Shi X, Cui Y, et al. A penalized robust semiparametric approach for gene-environment interactions. 2014. (Under review). [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Yao W, Wang Q. Robust variable selection through MAVE. Comput Statist Data Anal 2013;63:42–9. [Google Scholar]

[bbu046-B1] 1.Scheetz T, Kim K, Swiderski R, et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Natl Acad Sci 2006;103:14429–34. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B2] 2.Städler N, Bühlmann P, van de Geer S. L₁–penalization for mixture regression models. Test 2010;19:209–56. [Google Scholar]

[bbu046-B3] 3.Fan J, Fan Y. High dimensional classification using features annealed independence rules. Ann Statist 2008;36:2605–73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B4] 4.Fan JQ, Lv J. A selective review of variable selection in high dimensional feature space. Statist Sin 2010;20:104–48. [PMC free article] [PubMed] [Google Scholar]

[bbu046-B5] 5.Chung M, Long Q, Johnson B. A tutorial on rank–based coefficient estimation for censored data in small– and large–scale problems. Stat Comput 2013;23:601–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B6] 6.Li Y, Zhu J. L₁–norm quantile regression. J Comput Graph Statist 2008;17:163–85. [Google Scholar]

[bbu046-B7] 7.Wu Y, Liu Y. Variable selection in quantile regression. Statist Sin 2009;37:801–17. [Google Scholar]

[bbu046-B8] 8.Belloni A, Chernozhukov V. L₁ penalized quantile regression in high dimensional sparse models. Ann Statist 2011;39:82–130. [Google Scholar]

[bbu046-B9] 9.Wang L, Wu Y, Li R. Quantile regression for analyzing heterogeneity in ultrahigh dimension. J Am Stat Assoc 2012;107:214–22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B10] 10.Peng B, Wang L. An iterative coordinate descent algorithm for high-dimensional nonconvex penalized quantile regression. J Comput Graph Statist 2014. DOI:10.1080/10618600.2014.913516 (In Press). [Google Scholar]

[bbu046-B11] 11.Fan Y, Fan J, Barut E. Adaptive robust variable selection. Ann Statist 2014;42:324–51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B12] 12.Zou H, Yuan M. Regularized simultaneous model selection in multiple quantiles regression. Comput Stat Data An 2008;52:5296–304. [Google Scholar]

[bbu046-B13] 13.Zou H, Yuan M. Composite quantile regression and the oracle model selection theory. Ann Statist 2008;36:1108–26. [Google Scholar]

[bbu046-B14] 14.Bradic J, Fan J, Wang W. Penalized composite quasi–likelihood for ultrahigh-dimensional variable selection. J R Statist Soc Ser B 2011;73:325–49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B15] 15.Koenker R. Quantile Regression. UK: Cambridge University Press; 2005. [Google Scholar]

[bbu046-B16] 16.Wang H, Zhou J, Li Y. Variable selection for censored quantile regression. Stat Sinica 2013;23:145–67. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B17] 17.Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection through the lad-LASSO. J Bus Econ Stat 2007;25:347–55. [Google Scholar]

[bbu046-B18] 18.Gao X, Huang J. Asymptotic analysis of high-dimensional LAD regression with LASSO. Stat Sinica 2010;20:1485–506. [Google Scholar]

[bbu046-B19] 19.Wang L. The L₁ penalized LAD estimator for high dimensional linear regression. J Multivar Anal 2013;120:135–51. [Google Scholar]

[bbu046-B20] 20.Lambert-Lacroix S, Zwald L. Robust regression through the Hubers criterion and adaptive LASSO penalty. Electron J Stat 2011;16:1015–53. [Google Scholar]

[bbu046-B21] 21.Huber P. Robust Statistics. New York, NY: Wiley, 1981. [Google Scholar]

[bbu046-B22] 22.Jaeckel LA. Estimating regression coefficients by minimizing the dispersion of residuals. Ann Math Stat 1972;43:1449–58. [Google Scholar]

[bbu046-B23] 23.Johnson B, Peng L. Rank-based variable selection. J Nonparametr Stat 2008;20:241–52. [Google Scholar]

[bbu046-B24] 24.Wang L, Li R. Wighted Wilcoxon–type smoothly clipped absolute deviation method. Biometrics 2009;65:564–71. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B25] 25.Cai T, Huang J, Tian L. Regularized estimation for the accelerated failure time model. Biometrics 2009;65:394–404. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B26] 26.Xu J, Leng C, Ying Z. Rank-based variable selection with censored data. Stat Comput 2010;20:165–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B27] 27.Johnson B. Rank-based estimation in the L₁–regularized partly linear model for censored outcomes with application to integrated analyses of clinical predictors and gene expression data. Biostatistics 2009;10:659–66. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B28] 28.Shi X, Liu J, Huang J, et al. A penalized robust method for identifying gene-environment interactions. Genet Epidemiol 2014;38:220–30. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B29] 29.Wang X, Jiang Y, Huang M, et al. Robust variable selection with exponential squared loss. J Am Stat Assoc 2013;108:632–43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B30] 30.Khalili A, Chen J. Variable selection in finite mixture of regression models. J Am Stat Assoc 2007;102:1025–38. [Google Scholar]

[bbu046-B31] 31.Lu W, Goldberg Y, Fine J. On the robustness of the adaptive LASSO to model misspecification. Biometrika 2012;99:717–31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B32] 32.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd edn New York, NY: Springer, 2009. [Google Scholar]

[bbu046-B33] 33.O'Hara RB, Sillanpää MJ. A review of Bayesian variable selection methods: what, how and which. Bayesian Anal 2009;4:85–118. [Google Scholar]

[bbu046-B34] 34.Meinshausen M, Buhlmann P. Stability selection (with discussion). J R Stat Soc B 2010;72:417–73. [Google Scholar]

[bbu046-B35] 35.Tibshirani R. Regression shrinkage and selection via the LASSO. J R Stat Soc B 1996;58:267–88. [Google Scholar]

[bbu046-B36] 36.Zou H. The adaptive LASSO and its oracle property. J Am Stat Assoc 2006;101:1418–29. [Google Scholar]

[bbu046-B37] 37.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 2001;96:1348–60. [Google Scholar]

[bbu046-B38] 38.Zhang C. Nearly unbiased variable selection under minimax concave penalty. Ann Statist 2010;38:894–942. [Google Scholar]

[bbu046-B39] 39.Tibshirani R, Saunders M, Rosset S, et al. Sparsity and smoothness via the fused lasso. J R Stat Soc B 2005; 67:91–108. [Google Scholar]

[bbu046-B40] 40.Tibshirani R, Wang P. Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 2008;9:18–29. [DOI] [PubMed] [Google Scholar]

[bbu046-B41] 41.Jiang L, Bondell H, Wang H. Interquantile shrinkage and variable selection in quantile regression. Comp Stat Data Anal 2014;69:208–19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B42] 42.Redfern C, Degtyarev M, Kwa A, et al. Conditional expression of a Gi-coupled receptor causes ventricular conduction delay and a lethal cardiomyopathy. Proc Natl Acad Sci USA 2000;97:4826–31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B43] 43.Segal M, Kam D, Bruce C. Regression approaches for microarray data analysis. J Comput Biol 2003;10:961–80. [DOI] [PubMed] [Google Scholar]

[bbu046-B44] 44.The International HapMap Consortium. A haplotype map of the human genome. Nature 2005;437:1299–320. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B45] 45.Gao XL, Huang J. A robust penalized method for the analysis of noisy DNA copy number data. BMC Genomics 2010;11:517. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B46] 46.Wu C, Shi X, Cui Y, et al. A penalized robust semiparametric approach for gene-environment interactions. 2014. (Under review). [DOI] [PMC free article] [PubMed] [Google Scholar]

[bbu046-B47] 47.Yao W, Wang Q. Robust variable selection through MAVE. Comput Statist Data Anal 2013;63:42–9. [Google Scholar]

PERMALINK

A selective review of robust variable selection with applications in bioinformatics

Cen Wu

Shuangge Ma

Abstract

Introduction

Robust modeling

Check loss function and its variants

LAD loss function and its extension

Rank-based loss function

Remarks

Other choices

Remarks

Penalized variable selection

Table 1.

Penalty functions

Figure 1.

Computation

Linear programming-based algorithms

Coordinate descent algorithms

Algorithms based on data augmentation

Selection of tuning parameters

Implementations

Table 2.

Statistical properties

Examples

The cardiomyopathy data

The eQTL data of lab rats

The cis-eQTL data

The DNA copy number data

The plasma beta-carotene level data

Remarks

Discussion

Key Points.

Acknowledgements

Biographies

Funding

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases