Variance estimation based on blocked 3×2 cross-validation in high-dimensional linear regression

Xingli Yang; Yu Wang; Wennan Yan; Jihong Li

doi:10.1080/02664763.2020.1780571

. 2020 Jun 18;48(11):1934–1947. doi: 10.1080/02664763.2020.1780571

Variance estimation based on blocked 3×2 cross-validation in high-dimensional linear regression

Xingli Yang ^a, Yu Wang ^b, Wennan Yan ^a, Jihong Li ^b,^CONTACT

PMCID: PMC9042017 PMID: 35706435

Abstract

In high-dimensional linear regression, the dimension of variables is always greater than the sample size. In this situation, the traditional variance estimation technique based on ordinary least squares constantly exhibits a high bias even under sparsity assumption. One of the major reasons is the high spurious correlation between unobserved realized noise and several predictors. To alleviate this problem, a refitted cross-validation (RCV) method has been proposed in the literature. However, for a complicated model, the RCV exhibits a lower probability that the selected model includes the true model in case of finite samples. This phenomenon may easily result in a large bias of variance estimation. Thus, a model selection method based on the ranks of the frequency of occurrences in six votes from a blocked 3×2 cross-validation is proposed in this study. The proposed method has a considerably larger probability of including the true model in practice than the RCV method. The variance estimation obtained using the model selected by the proposed method also shows a lower bias and a smaller variance. Furthermore, theoretical analysis proves the asymptotic normality property of the proposed variance estimation.

Keywords: High-dimensional linear regression, blocked 3×2 cross-validation, variance estimation, asymptotic normality property

1. Introduction

In traditional linear regression, variance estimation is an important statistical inference problem. Variance estimation is prominently used to infer regression coefficients and plays an important role in the construction of model selection criteria, such as Akaike information criterion (AIC) and Bayesian information criterion (BIC). Variance is always estimated based on the residuals obtained by fitting the selected model [15,17]. However, in high-dimensional linear regression, the dimension of the variables is always greater than the sample size. Traditional variance estimation techniques, such as ordinary least squares (OLS), are not directly applicable because they exhibit a high bias even under sparsity assumption. Nevertheless, advances in variable selection in high-dimensional linear regression make this problem accessible. Particularly, Fan and Guo [5] indicated that one of the major reasons is the high spurious correlation between the unobserved realized noise and some of the predictors. This finding implies that the inference of variance estimation is always related to variable selection, that is, it is always a two-step process. First, a variable selection tool is used to select a model. The corresponding regression coefficients are then estimated via several estimation methods on the basis of the selected variables. Second, variance is estimated on the basis of the residuals by fitting the selected model.

The aforementioned description indicates that a key assumption in estimating variance in high-dimensional linear regression is the sparsity condition, that is, the number of nonzero variables is smaller than the sample size. Variable selection can identify the subset of important predictors and improve model interpretability and predictability [5]. Recently, numerous model selection and estimation techniques, such as least absolute shrinkage and selection (Lasso) and smoothly clipped absolute deviation (SCAD), have been proposed for high-dimensional problems [2,3,8,9,14,20–22,28,29]. Particularly, Fan and Lv [6] showed that correlation ranking has a sure screening property in a Gaussian linear model with Gaussian covariates and proposed the sure independent screening (SIS) and iteratively sure independent screening (ISIS) methods. Other relevant references can be found in [7,10,11].

However, as shown in the literature [5], the traditional variance estimation technique based on OLS after model selection by SIS technique always exhibits a high bias even under sparsity assumption. See the example showed in Figure 1.

Figure 1(a) shows the following findings. First, estimated variance is biased compared with real variance. Second, the bias increases with the dimension of p. Furthermore, when selecting variables using SIS technique, the bias of the variance estimation also increase with the number of the selected variables in the model from $\hat{s} = 1$ to 10, as shown in Figure 1(b). In [5], the authors indicated that this bias results from the irrelevant variables with high spurious correlations. A new refitted cross-validation (RCV), which will be introduced in detail in Section 2.2, is proposed to estimate the variance based on two-fold cross-validation and attenuate this influence [1,12,16,23,26]. However, the performance of RCV exclusively requests both models selected from the two data sets covering the true model; otherwise, the RCV estimation will be considerably biased, as shown in the example in Table 1 and Figure 2.

Table 1. The missing variables in the variable selection of RCV.

RCV	The 12th replicated experiment	The 53th replicated experiment
Fold 1	$x_{11}$ , $x_{13}$	$x_{3}$ , $x_{11}$ , $x_{19}$
Fold 2	$x_{19}$	no variables

Open in a new tab

Figure 2. — Box plots of RCV variance estimation with the variables selected from the 12th, and the 53th replications, which does not include all the relevant variables. The data are generated from the linear regression model $y = X β + ϵ$ with 10 nontrivial coefficients $β = (1.01, - 0.06, 0.72, 1.55, 2.32, - 0.36, 3.75, - 2.04, - 0.13, 0.61)$ for i = 1, 2, 3, 5, 7, 11, 13, 17, 19, 23. n = 400, p = 1000, the covariates are jointly normal with $c o v (x_{i}, x_{j}) = {0.5}^{| i - j |}$ and $ϵ \sim N (0, 1)$ . The simulated experiment runs 100 times. SIS is the model selection method and $\hat{s} = 100$ .

In many cases, the model selected by the RCV method does not always cover the true model. For example, Table 1 shows that the two folds in two-fold cross-validation do not include all relevant variables in the 12th replicated experiment, and only one fold includes the relevant variables in the 53rd replicated experiment. Subsequently, the variables selected in the 12th and 53rd replications are used to fit the model, and RCV is performed to estimate the variance. Figure 2 shows the results. All the RCV variance estimations exhibit a large bias, whether for the 12th or 53rd replication.

Wang et al. [24] proposed a blocked $3 \times 2$ cross-validation (B3×2CV) method based on three replications of two-fold cross-validation and demonstrated its superiority in comparing algorithms and estimating generalization error. Related studies can be found in the literature [18,19,24,25,27]. Thus, a model selection method based on the ranks of the frequency of occurrence in six votes from B3×2CV is proposed in this study. The contributions of this study are as follows. First, the B $3 \times 2$ CV is initially applied to the variance estimation of high-dimensional linear regression and shows better performance than the commonly used RCV method. Second, although the RCV method proposed by Fan and Guo [5] corrects the bias problem of variance estimation in high-dimensional linear regression, the performance of RCV exclusively requests both models selected from the two data sets covering the true model; otherwise, the RCV estimation will be considerably biased. In practice, the model selected by the RCV method does not always cover the true model for many experiments. However, the B $3 \times 2$ CV method can alleviate or overcome this problem, that is, it has a considerably larger probability of including the true model than the RCV method. Third, a voting technology based on the frequency of occurrence in six votes of six combinations from B $3 \times 2$ CV is used to implement model selection. The variance estimation using the model selected by the B $3 \times 2$ CV method also shows lower bias and smaller variance. Furthermore, the variance estimation based on B $3 \times 2$ CV retains excellent theoretical properties with better experimental results compared with the RCV method.

The remainder of this paper is organized as follows. Section 2 introduces RCV and B3×2CV. In Section 3, we prove the asymptotic property of B3×2CV variance estimation. Simulation studies are conducted in Section 4 to show the superiority of the proposed variance estimation. Finally, we conclude this study and discuss future research works in Section 5.

2. Techniques for variance estimation in high-dimensional linear regression model

2.1. High-dimensional linear regression model

The following linear regression model with high dimension in considered:

y = X β + ε,

(1)

where $y = (y_{1}, \dots y_{n})^{T}$ is an n-vector of responses, $X = (x_{1}, \dots x_{n})^{T}$ is an $n \times p$ (p>n or $p ≫ n)$ design matrix, $β = (β_{1}, \dots β_{p})^{T}$ is a p-vector of parameters, and $ε = (ε_{1}, \dots ε_{n})^{T}$ is an n-vector of independently and identically distributed random noise with $E (ε_{i}) = 0$ and $V a r (ε_{i}) = σ^{2}$ . Noise is assumed independent of predictors in this study, and variance estimation is used to estimate parameter $σ^{2}$ .

Next, we will introduce the variance estimation based on RCV and B3×2CV.

2.2. RCV-based variance estimation

In this subsection, RCV is introduced based on the two-fold cross-validation provided by Fan and Guo [5], which uses part of the data to select the model (variable) and a different part to estimate coefficient $β$ and variance $σ^{2}$ . First, a data set is considered to have sample size n, which is randomly split into two approximately even data sets $(y_{1}, X_{1})$ and $(y_{2}, X_{2})$ . Second, a variable selection tool is applied on $(y_{1}, X_{1})$ , and ${\hat{M}}_{1}$ denotes the set of variables selected. Third, OLS is performed on data set $(y_{2}, X_{2})$ to estimate regression coefficient $β$ and variance $σ^{2}$ with the selected variable set of ${\hat{M}}_{1}$ , namely

{\hat{σ}}_{1}^{2} = \frac{y_{2}^{T} (I_{n / 2} - P_{2}^{{\hat{M}}_{1}}) y_{2}}{n / 2 - ∣ {\hat{M}}_{1} ∣},

where $P_{2}^{{\hat{M}}_{1}} = X_{2}^{{\hat{M}}_{1}} [(X_{2}^{{\hat{M}}_{1}})^{T} X_{2}^{{\hat{M}}_{1}}]^{- 1} (X_{2}^{{\hat{M}}_{1}})^{T}$ . $X_{2}^{{\hat{M}}_{1}}$ denotes the submatrix containing the columns of $X_{2}$ , which are indexed by ${\hat{M}}_{1}$ . $∣ {\hat{M}}_{1} ∣$ indicates the number of elements in ${\hat{M}}_{1}$ . Similarly, the variable set of ${\hat{M}}_{2}$ on data set $(y_{2}, X_{2})$ is obtained; $β$ and $σ^{2}$ on data set $(y_{1}, X_{1})$ are estimated, resulting in

{\hat{σ}}_{2}^{2} = \frac{y_{1}^{T} (I_{n / 2} - P_{1}^{{\hat{M}}_{2}}) y_{1}}{n / 2 - ∣ {\hat{M}}_{2} ∣},

where $P_{1}^{{\hat{M}}_{2}} = X_{1}^{{\hat{M}}_{2}} [(X_{1}^{{\hat{M}}_{2}})^{T} X_{1}^{{\hat{M}}_{2}}]^{- 1} (X_{1}^{{\hat{M}}_{2}})^{T}$ . The definition of $X_{1}^{{\hat{M}}_{2}}$ is similar to $X_{2}^{{\hat{M}}_{1}}$ . Finally, the variance can be estimated based on the averaging ${\hat{σ}}_{1}^{2}$ and ${\hat{σ}}_{2}^{2}$ , namely

{\hat{σ}}_{R C V}^{2} = ({\hat{σ}}_{1}^{2} + {\hat{σ}}_{2}^{2}) / 2

(2)

2.3. B3×2CV-based variance estimation

Wang et al. [24] proposed a new technology, B3×2CV, based on three replications of two-fold cross-validation and demonstrated its superiority in algorithm comparison. We apply it to the variance estimation in high-dimensional regression. We randomly split the sample into four approximately equally sized disjointed parts denoted as $L_{j}$ ,j = 1, 2, 3, 4. Then, we obtain three groups with six combinations by pairwise combination. Let Group 1: ${(L_{1}, L_{2}), (L_{3}, L_{4})}$ ; Group 2: ${(L_{1}, L_{3}), (L_{2}, L_{4})}$ ; Group 3: ${(L_{1}, L_{4}), (L_{2}, L_{3})}$ denote the three groups, respectively.

We use ${\hat{σ}}_{i}^{2}$ (i = 1, 2, 3) to denote the estimation obtained by estimating the variance on Group i, i = 1, 2, 3; ${\hat{σ}}_{i, k}^{2}$ (i = 1, 2, 3; k = 1, 2) represents the estimation of kth data set of Group i. Let $(y_{1}^{(i)}, X_{1}^{(i)})$ denotes the first data set of Group i; $(y_{2}^{(i)}, X_{2}^{(i)})$ denotes the second data set of Group i. The B3×2CV-based variance estimation can be obtained by the following steps.

Step I A variable selection tool, such as SIS, is performed on six combinations $(L_{1}, L_{2}), (L_{3}, L_{4}), (L_{1}, L_{3}), (L_{2}, L_{4}), (L_{1}, L_{4}), (L_{2}, L_{3})$ . Then, the first $\hat{s}$ variables are selected for each combination. Second, variables are sorted based on the frequency of occurrence in six votes of six combinations. Similarly, the top $\hat{s}$ variables become the selected variable set $\hat{M}$ .

Step II OLS is applied on the two data sets of any Group i to estimate coefficient $β$ and variance $σ_{i, k}^{2}$ with variable set $\hat{M}$ , resulting in

{\hat{σ}}_{i, 1}^{2} = \frac{y_{2}^{(i) T} (I_{n / 2} - P_{2}^{(i) \hat{M}}) y_{2}^{(i)}}{n / 2 - ∣ \hat{M} ∣}, {\hat{σ}}_{i, 2}^{2} = \frac{y_{1}^{(i) T} (I_{n / 2} - P_{1}^{(i) \hat{M}}) y_{1}^{(i)}}{n / 2 - ∣ \hat{M} ∣},

where

P_{k}^{(i) \hat{M}} = X_{k}^{(i) \hat{M}} [(X_{k}^{(i) \hat{M}})^{T} X_{k}^{(i) \hat{M}}]^{- 1} (X_{k}^{(i) \hat{M}})^{T}, k = 1, 2.

Step III The variance can be estimated based on the average of all ${\hat{σ}}_{i, 1}^{2}$ and ${\hat{σ}}_{i, 2}^{2}$ in the three groups, namely

{\hat{σ}}_{B 3 \times 2 C V}^{2} = \frac{1}{3} \sum_{i = 1}^{3} {\hat{σ}}_{i}^{2} = \frac{1}{3} \sum_{i = 1}^{3} \frac{1}{2} ({\hat{σ}}_{i, 1}^{2} + {\hat{σ}}_{i, 2}^{2}) .

(3)

Notably, the variance is estimated by the standardized variables. In comparison with the model considering the constant term (intercept), the variance estimation without the constant term multiplied by $(n / 2 - | \hat{M} |) / (n / 2 - | \hat{M} | - 1)$ is the variance estimation with the constant term. For example, when n = 400 in Experiment 1, the variance estimation with the constant term of RCV is 1.360, which is $(400 / 2 - 100) / (400 / 2 - 100 - 1)$ times of the variance estimation without the constant term 1.346. The two variance estimations only have a constant difference between them and do not affect the conclusions of this study.

3. Asymptotic normality of B3×2CV

Asymptotic behavior is a basic property for an estimation, which presents the asymptotic distribution of the estimation. On the basis of the obtained distribution, we can implement the statistical inference, such as confidence interval or hypothesis testing. Thus, the asymptotic normality property of the B3×2CV variance estimation is studied in this section.

For convenience, define

φ_{min} (m) = min_{M :∣ M ∣⩽ m} \{λ_{min} (\frac{1}{n} X_{M}^{T} X_{M})\},

where $λ_{min} (A)$ denote the smallest eigenvalue of a matrix $A$ .

Assumption 1

Errors $ε_{1}^{(i)}, \dots, ε_{n}^{(i)}$ are independent and identical distribution with $E (ε_{j}^{(i)}) = 0, V a r (ε_{j}^{(i)}) = σ^{2}$ ( $j = 1, \dots, n; i = 1, 2, 3$ ) and independent of design matrix $X$ .

Assumption 2

There is a constant $λ_{0} > 0$ and $b_{n}$ such that $b_{n} / n \to 0$ such that $P {φ_{min} (b_{n}) ⩾ λ_{0}} = 1$ for all n.

Remark 3.1

Assumption 2 is to ensure that the variables selected are not highly correlated. See the detailed discussion in [5].

Theorem 3.1

Under the Assumptions 1 and 2, and suppose that $E [(ε_{j}^{(i)})^{4}] < \infty$ and $\hat{s} ⩽ b_{n}$ , $\hat{s} =∣ \hat{M} ∣, i = 1, 2, 3, k = 1, 2$ , we have, if the variable selections on six combinations are independent identically distributed, then

$\sqrt{n} ({\hat{σ}}_{B 3 \times 2 C V}^{2} - σ^{2}) ⟹ D N (0, E [(ε_{1}^{(1)})^{4}] - σ^{4})$

Where ‘ $⟹ D$ ’ stands for convergence in distribution.

Proof.

Define the selected variable sets on the six combinations $(L_{1}, L_{2}), (L_{3}, L_{4}), (L_{1}, L_{3}), (L_{2}, L_{4}), (L_{1}, L_{4}), (L_{2}, L_{3})$ as ${\hat{M}}_{i, k}, i = 1, 2, 3, k = 1, 2$ , and sequences of events $A_{n}^{i, k} = {M_{0} \subset {\hat{M}}_{i, k}}$ , where $M_{0}$ represents the true model, then $P (A_{n}^{i, k}) ⟹ P 1$ . Next, if defining sequences of events $A_{n} = {M_{0} \subset \hat{M}}$ , we have,

$P (A_{n}) \geq P (A_{n}^{1, 1}, A_{n}^{1, 2}, \dots, A_{n}^{3, 2}) = (P (A_{n}^{1, 1}))^{6} ⟹ P 1.$

That is $P (A_{n}) ⟹ P 1$ .

According to the Theorem 2 in [5], on any Group i, i = 1, 2, 3, we have

$\begin{aligned} \sqrt{n} ({\hat{σ}}_{i}^{2} - σ^{2}) & = \frac{\sqrt{n}}{n - 2 \hat{s}} (ε_{2}^{(i) T} ε_{2}^{(i)} - \frac{1}{2} n σ^{2}) + \frac{\sqrt{n}}{n - 2 \hat{s}} (ε_{1}^{(i) T} ε_{1}^{(i)} - \frac{1}{2} n σ^{2}) + o_{p} (1) \\ = \frac{1}{\sqrt{n}} \sum_{j = 1}^{n} [(ε_{j}^{(i)})^{2} - σ^{2}] + o_{p} (1) . \end{aligned}$

and

$\sqrt{n} ({\hat{σ}}_{i}^{2} - σ^{2}) ⟹ D N (0, E [(ε_{1}^{(1)})^{4}] - σ^{4}) .$

Then

$\begin{aligned} \frac{1}{3} \sum_{i = 1}^{3} \sqrt{n} ({\hat{σ}}_{i}^{2} - σ^{2}) & = \sqrt{n} ({\hat{σ}}_{B 3 \times 2 C V}^{2} - σ^{2}) \\ = \frac{1}{3} \frac{1}{\sqrt{n}} \sum_{i = 1}^{3} [\sum_{j = 1}^{n} ((ε_{j}^{(i)})^{2} - σ^{2}) + o_{p} (1)] . \\ E [\sqrt{n} ({\hat{σ}}_{B 3 \times 2 C V}^{2} - σ^{2})] & = \frac{1}{3 \sqrt{n}} E [\sum_{i = 1}^{3} \sum_{j = 1}^{n} ((ε_{j}^{(i)})^{2} - σ^{2})] \\ = \frac{1}{3 \sqrt{n}} \sum_{i = 1}^{3} \sum_{j = 1}^{n} [E [(ε_{j}^{(i)})^{2}] - σ^{2}] = 0. \\ V a r [\sqrt{n} ({\hat{σ}}_{B 3 \times 2 C V}^{2} - σ^{2})] & = \frac{1}{9 n} V a r ((\sum_{i = 1}^{3} \sum_{j = 1}^{n} ((ε_{j}^{(i)})^{2} - σ^{2}))) \\ = \frac{1}{9 n} \sum_{i = 1}^{3} V a r (\sum_{j = 1}^{n} (ε_{j}^{(i)})^{2} - σ^{2}) \\ + \frac{1}{9 n} \sum_{i \neq i^{'}} \sum_{j = j^{'} = 1}^{n} C o v [((ε_{j}^{(i)})^{2} - σ^{2}), ((ε_{j^{'}}^{(i^{'})})^{2} - σ^{2})] \\ = \frac{1}{9 n} \cdot 3 n \cdot (E (ε_{j}^{(i)})^{4} - σ^{4}) + \frac{1}{9 n} \cdot 6 n \cdot (E (ε_{j}^{(i)})^{4} - σ^{4}) \\ = E (ε_{j}^{(i)})^{4} - σ^{4} . \end{aligned}$

Combining with the conclusion that $\sqrt{n} ({\hat{σ}}_{i}^{2} - σ^{2})$ asymptotically follows the normal distribution, we have:

$\sqrt{n} ({\hat{σ}}_{B 3 \times 2 C V}^{2} - σ^{2}) = \frac{1}{3} \sum_{i = 1}^{3} \sqrt{n} ({\hat{σ}}_{i}^{2} - σ^{2}) ⟹ D N (0, E [(ε_{1}^{(1)})^{4}] - σ^{4}) .$

The proof of Theorem 3.1 is completed.

Remark 3.2

The comparison of Theorem 2 in [5] shows that the asymptotic distributions of RCV and B3×2CV are identical. However, in finite sample size, ${\hat{σ}}_{B 3 \times 2 C V}^{2}$ is less biased and has a smaller variance than ${\hat{σ}}_{R C V}^{2}$ , as shown in Figures 3 and 5.

Figure 3. — Box plots for RCV, R3×2CV, R5×2CV, B3×2CV and B5×2CV variance estimation, where R3×2CV refers to the three simple random replications of two-fold cross-validation. The data are generated from the linear regression model $y = X β + ϵ$ with 10 nontrivial coefficients $β = (1.01, - 0.06, 0.72, 1.55, 2.32, - 0.36, 3.75, - 2.04, - 0.13, 0.61)$ for i = 1, 2, 3, 5, 7, 11, 13, 17, 19, 23. The sample size n and the number of covariates p are 400 and 1000, respectively. The covariates are jointly normal with $c o v (x_{i}, x_{j}) = {0.5}^{| i - j |}$ and $ϵ \sim N (0, 1)$ . The simulated experiment runs 100 times. SIS is the model selection method and $\hat{s} = 100$ .

Figure 5. — Variance estimation of (a) RCV and (b) B $3 \times 2$ CV with the change of sample size. The data are generated from the linear regression model $y = X β + ϵ$ with 10 nontrivial coefficients $β = (1.01, - 0.06, 0.72, 1.55, 2.32, - 0.36, 3.75, - 2.04, - 0.13, 0.61)$ for i = 1, 2, 3, 5, 7, 11, 13, 17, 19, 23. $n = 300, 400, \dots, 1000$ , p = 1000, the covariates are jointly normal with $c o v (x_{i}, x_{j}) = {0.5}^{| i - j |}$ and $ϵ \sim N (0, 1)$ . The simulated experiment runs 100 times. SIS is the model selection method and $\hat{s} = 100$ .

4. Simulation experimental study

In this section, we compare the statistical performance of B3×2CV with those of RCV and R3×2CV on simulated and real data. Here, in comparison with the three designed special replications of RCV for B3×2CV, the R3×2CV refers to the three simple random replications of the RCV.

4.1. Experiment 1

The setup of Figure 2 in Section 1 is used to demonstrate the variance estimations of RCV, R3×2CV, R5×2CV, B3×2CV and B5×2CV in Figure 3. The only difference between R5×2CV (B5×2CV) and R3×2CV (B3×2CV) is the repetition times of the two-fold cross-validation. The experimental results of R5×2CV and B5×2CV illustrate that simple random replications of RCV cannot improve the performance of variance estimation. For the B3×2CV variance estimation, the median line of the box plot is closer to the true value line ‘1’ compared with the RCV and R3×2CV variance estimations. This result implies that the proposed B3×2CV technique corrects the bias of the RCV. The interquartile ranger (IQR) and range of normal values of the RCV box plot are wider than those of the B3×2CV, which indicates that the distribution of RCV estimation may be more dispersed. That is, the fluctuation of the RCV variance estimation is larger than that of the B3×2CV.

In comparison with the RCV box plot, the location of the median lines of R3×2CV and R5×2CV box plots have no apparent change. That is, the random replications based on the RCV cannot reduce the bias and have nearly the same performance. A possible reason is that R3×2CV and R5×2CV do not improve the probability that the selected model includes the true model of the RCV. Furthermore, the R3×2CV box plot has a narrower IQR than the RCV box plot does. In comparison with B3×2CV, the fluctuation of B5×2CV variance estimation is less but has no significant difference. Therefore, in view of computation complexity, B3×2CV is preferred in variance estimation.

Next, the bias, standard error, and probability that the selected model includes the true model of all variance estimations with the same setups are calculated. Table 2 summarizes the results of RCV, R3×2CV, and B3×2CV.

Table 2. Simulation results for Experiment 1: bias, standard error and probability that the selected model includes the true model(P) for RCV, R3×2CV, R5×2CV, B3×2CV and B5×2CV procedures.

	RCV	R3×2CV	R5×2CV	B3×2CV	B5×2CV
Bias	0.346	0.299	0.301	0.127	0.128
Standard error	0.260	0.202	0.200	0.192	0.182
Probability	0.04	0.04	0.04	0.25	0.26

Open in a new tab

The simulation results in Table 2 show that among the biases of all estimations, the RCV estimation (0.346) is the largest, followed by R3×2CV estimation (0.299). In comparison with the bias of RCV and R $3 \times$ 2CV, that of B3×2CV is reduced by 57.5 $%$ . Moreover, the standard error of B3×2CV is smaller than those of RCV and R3×2CV. The probability that the selected model includes the true model of RCV is only 0.04, which is considerably lower than the 0.25 of B3×2CV. These results further demonstrate the superiority of the proposed method.

4.2. Experiment 2

The following simulation is conducted to test the sensitivity of B3×2CV to the true model size. The true model is gradually enlarged by increasing the number of relevant variables of the true model. All the coefficients of the added relevant variables are 1. The true model with different sizes is obtained by adding five new relevant variables each time into the true model of Figure 2, as shown in Table 3. Figure 4(a,b) summarize the medians and variances of the variance estimations with different model sizes s.

Table 3. True models for Experiment 2.(… represents all relevant variables in the last true model).

True model size	The relevant variables
s = 15	$x_{1}, x_{2}, x_{3}, x_{5}, x_{7}, x_{11}, x_{13}, x_{17}, x_{19}, x_{23}, x_{24}, x_{26}, x_{27}, x_{31}, x_{32}$
s = 20	$\dots, x_{34}, x_{36}, x_{37}, x_{38}, x_{41}$
s = 25	$\dots, x_{43}, x_{44}, x_{45}, x_{47}, x_{49}$
s = 30	$\dots, x_{50}, x_{51}, x_{53}, x_{55}, x_{57}$
s = 35	$\dots, x_{58}, x_{59}, x_{60}, x_{62}, x_{63}$

Open in a new tab

Figure 4 shows that although the bias and variances of RCV and B3×2CV estimations increase with the model size s from 15 to 35, the changes in the RCV are more intense compared with the minimal change of B3×2CV. That is, B3×2CV estimation is insensitive to the true model size, whereas the opposite is true for the RCV.

We also calculate the probabilities that the selected model includes the true model of two methods on different true models with different sizes. The similar conclusion obtained is that the probability that the selected model includes the true model of B3×2CV is considerably larger than that of the RCV for different model sizes (Table 4).

Table 4. Simulation results for Experiment 2: probability that the selected model includes the true model for RCV and B3×2CV.

	s = 15	s = 20	s = 25	s = 30	s = 35
RCV	0.035	0.005	0.005	0	0
B3×2CV	0.27	0.14	0.07	0.05	0.03

Open in a new tab

4.3. Experiment 3

Figures 3 and 4 only show the experimental results when the sample size is 400. In this subsection, the experimental setup of Experimental 1 is used to observe the asymptotic performance of the variance estimations of RCV and B $3 \times 2$ CV in high-dimensional linear model by changing the sample size from 300 to 1,000.

Figures 5 shows that the variance estimates of RCV and B $3 \times 2$ CV all converge to the true value of 1 with the increase of sample size. The bias of variance estimation corresponding to Figure 5 is provided in Table 5. The bias is obtained by subtracting the true variance from the mean of 100 variance estimations. The difference between the variance estimation obtained by B $3 \times 2$ CV is extremely small, especially for the large sample size. Although the difference obtained by the RCV has a decreasing trend, its convergence rate is slow. Moreover, the variance of the variance estimator by B $3 \times 2$ CV decreases rapidly with the increase in sample size, whereas the variance of RCV estimators has no evident trend of decrease.

Table 5. Experiment 3: bias of the variance estimation for RCV and B3×2CV.

n	300	400	500	600	700	800	900	1000
RCV	0.382	0.346	0.288	0.228	0.161	0.140	0.124	0.128
B3×2CV	0.120	0.127	0.077	0.073	0.041	0.039	0.000	0.018

Open in a new tab

4.4. Real data analysis

We now apply the proposed procedure to analyze the white wine quality data obtained from the UCI machine learning repository [4]. The data set consists of 4,898 samples and 12 variables (1 target variable and 11 predictors). Data preprocessing is performed as follows.

We transform the 8th variable (density) as response variable and the rest of 11 variables as predictors.
We transform the 4th variable $x_{4}$ as $l n (x_{4})$ and the 8th variable $x_{8}$ as $(x_{8} \times 10^{4} - min (x_{8} \times 10^{4}) + 1)^{\frac{1}{2}}$
We remove the abnormal points (the 2,782nd, 3,902nd, 1,654th, 1,664th, 3,620th, 3,624th, 1,527th, 4,481th, 4,108th, and 4,150th). The remaining data number is 4,888.
Implement BOX-COX transformation on the data with $λ = 1.192$ . The residual plot in Figure 6(a) indicates that the residuals have identical distribution and meet the assumptions of the model. In addition, the variance estimation ${\hat{σ}}^{2} = 0.045$ obtained from stepwise regression is regarded as a benchmark for comparison of the variance estimations of RCV, R3×2CV, and B3×2CV.
Generate $(n, p) = (4, 888, 989)$ mutually independent random variables to meet the conditions of high-dimensional settings. Each variable follows the standard normal distribution. The generated data are attached to the original white wine quality that has been previously standardized. Now, the new data, which have 4,888 samples and 1,000 variables, are obtained to forecast the density of white wine over 1,000 predictors.

Figure 6. — (a) Residual plot of Real data, and (b) variances of estimations of the real data for various procedures.

We randomly select 400 samples from 4,888 new samples each time and perform the RCV, R3×2CV, and B3×2CV processes. This operation is repeated 100 times. Figure 6(b) shows the results of the variance estimation of RCV, R3×2CV, and B3×2CV. The median line of variance estimations of all procedures deviates from the benchmark. However, the median line of B3×2CV has minimum deviation, which is considerably less than those of RCV and R3×2CV. Moreover, the range of the normal value of B3×2CV is the narrowest among the three procedures. Although the B3×2CV has the largest IQRs, most of its values are under 0.06. These results show that the B3×2CV estimation is closer to the benchmark.

Table 6 shows that, as expected, the probability and average number that the selected model includes the true model of B3×2CV are higher than those of RCV and R3×2CV.

Table 6. Simulation results for Real Data: probability and the average number that the selected model includes the true model for RCV, R3×2CV, and B3×2CV(Numbers in brackets present the number of variables selected by stepwise regression).

	RCV	R3×2CV	B3×2CV
Probability	0.015	0.012	0.07
Average number	8.435(11)	8.435 (11)	9.16(11)

Open in a new tab

5. Conclusion and discussion

In this study, we apply the B3×2CV technique to the variance estimation in high-dimensional linear regression and prove the asymptotic theory of B3×2CV variance estimation. The theoretical and empirical results show that the B3×2CV estimation is more stable than those of RCV and R3×2CV.

Furthermore, the data in this study are assumed to be independent and posses identical distribution. However, whether we should perform row randomization first and then analyze the variance estimation should be considered when a data structure exists. As in [13], the same technology is used in the processing of structured data. In addition to least squares, other techniques, such as ridge regression or other penalization methods, can be applied in the model fitting stage of high-dimensional linear regression. In future research works, we will use other variable selection tools, such as Lasso, SCAD, and Dantzig, to select important variables and compare the performance of different variable selection tools.

Funding Statement

This work was supported by the Shanxi Applied Basic Research Program [201901D111034, 201801D211002], the National Natural Science Foundation of China [61806115] and the Open Research Fund of Key Laboratory of Advanced Theory and Application in Statistics and Data Science-MOE, ECNU [KLATASDS2007].

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Breiman L. and Spector P., Submodel selection and evaluation in regression: The X-random case, Inter. Stat. Rev. 60 (1992), pp. 291–319. doi: 10.2307/1403680 [DOI] [Google Scholar]
2.Bunea F., Tsybakov A., and Wegkamp M., Sparsity oracle inequalities for the lasso, Electron J Stat 1 (2007), pp. 169–194. doi: 10.1214/07-EJS008 [DOI] [Google Scholar]
3.Candes E. and Tao T., The dantzig selector: Statistical estimation when p is much larger than n (with discussion), Ann. Stat. 35 (2007), pp. 2313–2351. doi: 10.1214/009053606000001523 [DOI] [Google Scholar]
4.Cortez P., Cerdeira A., Almeida F., Matos T., and Reis J., Modeling wine preferences by data mining from physicochemical properties, Decision Support Syst. Elsevier 47 (2019), pp. 547–553. doi: 10.1016/j.dss.2009.05.016 [DOI] [Google Scholar]
5.Fan J. and Guo S., Variance estimation using refitted cross-validation in ultrahigh dimensional regression, J. R. Statist. Soc. B, 74 (2012), pp. 37–65. doi: 10.1111/j.1467-9868.2011.01005.x [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Fan J. and Lv J., Sure independence screening for ultrahigh dimensional feature space (with discussion), J. R. Statist. Soc. B, 70 (2008), pp. 849–911. doi: 10.1111/j.1467-9868.2008.00674.x [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Fan J. and Lv J., A selective overview of variable selection in high dimensional feature space, Statist. Sin 20 (2010), pp. 101–148. [PMC free article] [PubMed] [Google Scholar]
8.Fan J. and Lv J., Non-concave penalized likelihood with NP-dimensionality, IEEE Trans. Inform. Theory 57 (2011), pp. 5467–5484. doi: 10.1109/TIT.2011.2158486 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Fan J. and Peng H., Nonconcave penalized likelihood with a diverging number of parameters, Ann. Stat. 32 (2004), pp. 928–961. doi: 10.1214/009053604000000256 [DOI] [Google Scholar]
10.Fan J., Samworth R., and Wu Y., Ultrahigh dimensional feature selection: Beyond the linear model, J. Mach. Learn. Res 10 (2009), pp. 2013–2038. [PMC free article] [PubMed] [Google Scholar]
11.Fan J. and Song R., Sure independence screening in generalized linear models with NP-dimensionality, Ann. Stat. 38 (2010), pp. 3567–3604. doi: 10.1214/10-AOS798 [DOI] [Google Scholar]
12.Hastie T., Tibshirani R., and Friedman J.H., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, New York, 2009, 241–249. [Google Scholar]
13.Héberger K. and Kollár-Hunek K., Comparison of validation variants by sum of ranking differences and ANOVA, J. Chem. 33 (2019), pp. 1–14. doi: 10.1002/cem.3104 [DOI] [Google Scholar]
14.Kim Y., Choi H., and Oh H.-S., Smoothly clipped absolute deviation on high dimensions, J. Amer. Statist. Assoc. 103 (2008), pp. 1665–1673. doi: 10.1198/016214508000001066 [DOI] [Google Scholar]
15.Kleinbaum D.G., Kupper L.L., Muller K.E., and Nizam A., Applied Regression Analysis and Other Multivariable Methods, 3rd ed., Duxbury Press, Belmont, CA, 2003. [Google Scholar]
16.Kohavi R., A study of cross-validation and bootstrap for accuracy estimation and model selection, Inter. Joint Conf. Artificial Intel. 2 (1995), pp. 1137–1143. [Google Scholar]
17.Kutner M.H., Nachtsheim C.J., and Neter J., Applied Linear Regression Models, 4th ed., Mc Graw Hill Education, New York, 2005. [Google Scholar]
18.Li J., Hu J., and Wang Y., Blocked 3×2 cross-validation estimator of prediction error-A simulated comparative study based on biological data, J. Biomath. 29 (2014), pp. 700–710. [Google Scholar]
19.Li J., Wang R., Wang W., and Li G., Automatic labeling of semantic roles on chinese framenet, J. Softw. 30 (2010), pp. 597–611. doi: 10.3724/SP.J.1001.2010.03756 [DOI] [Google Scholar]
20.Lv J. and Fan Y., A unified approach to model selection and sparse recovery using regularized least squares, Ann. Stat. 37 (2009), pp. 3498–3528. doi: 10.1214/09-AOS683 [DOI] [Google Scholar]
21.Meier L., van de Geer S., and Bhlmann P., The group lasso for logistic regression, J. R. Statist. Soc. B 70 (2008), pp. 53–71. doi: 10.1111/j.1467-9868.2007.00627.x [DOI] [Google Scholar]
22.Meinshausen N. and Yu B., LASSO-type recovery of sparse representations for high-dimensional data, Ann. Stat. 37 (2009), pp. 246–270. doi: 10.1214/07-AOS582 [DOI] [Google Scholar]
23.Shao J., Linear model selection by cross-validation, J. Amer. Statist. Assoc. 88 (1993), pp. 486–494. doi: 10.1080/01621459.1993.10476299 [DOI] [Google Scholar]
24.Wang Y., Wang R., Jia H., and Li J., Blocked 3×2 cross-validated t test for comparing supervised classification learning algorithms, Neural Comput. 26 (2014), pp. 208–235. doi: 10.1162/NECO_a_00555 [DOI] [PubMed] [Google Scholar]
25.Wang R., Wang Y., Li J., Yang X., and Yang J., Block-regularized $m \times 2$ cross-validated estimator of the generalization error, Neural Comput. 29 (2017), pp. 519–554. doi: 10.1162/NECO_a_00923 [DOI] [PubMed] [Google Scholar]
26.Yang Y., Comparing learning methods for classification, Statist. Sinica 16 (2006), pp. 635–657. [Google Scholar]
27.Yang X., Wang Y., Wang R., and Li J., Variance of estimator of the prediction error based on blocked 3×2 cross-validation, Chinese J. Appl. Probab. Statist. 30 (2014), pp. 372–380. [Google Scholar]
28.Zhang C.H. and Huang J., The sparsity and bias of the lasso selection in high-dimensional linear regression, Ann. Stat. 36 (2008), pp. 1567–1594. doi: 10.1214/07-AOS520 [DOI] [Google Scholar]
29.Zhao P. and Yu B., On model selection consistency of lasso, J. Mach. Learn. Res. 7 (2006), pp. 2541–2563. [Google Scholar]

[CIT0001] 1.Breiman L. and Spector P., Submodel selection and evaluation in regression: The X-random case, Inter. Stat. Rev. 60 (1992), pp. 291–319. doi: 10.2307/1403680 [DOI] [Google Scholar]

[CIT0002] 2.Bunea F., Tsybakov A., and Wegkamp M., Sparsity oracle inequalities for the lasso, Electron J Stat 1 (2007), pp. 169–194. doi: 10.1214/07-EJS008 [DOI] [Google Scholar]

[CIT0003] 3.Candes E. and Tao T., The dantzig selector: Statistical estimation when p is much larger than n (with discussion), Ann. Stat. 35 (2007), pp. 2313–2351. doi: 10.1214/009053606000001523 [DOI] [Google Scholar]

[CIT0004] 4.Cortez P., Cerdeira A., Almeida F., Matos T., and Reis J., Modeling wine preferences by data mining from physicochemical properties, Decision Support Syst. Elsevier 47 (2019), pp. 547–553. doi: 10.1016/j.dss.2009.05.016 [DOI] [Google Scholar]

[CIT0005] 5.Fan J. and Guo S., Variance estimation using refitted cross-validation in ultrahigh dimensional regression, J. R. Statist. Soc. B, 74 (2012), pp. 37–65. doi: 10.1111/j.1467-9868.2011.01005.x [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0006] 6.Fan J. and Lv J., Sure independence screening for ultrahigh dimensional feature space (with discussion), J. R. Statist. Soc. B, 70 (2008), pp. 849–911. doi: 10.1111/j.1467-9868.2008.00674.x [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0007] 7.Fan J. and Lv J., A selective overview of variable selection in high dimensional feature space, Statist. Sin 20 (2010), pp. 101–148. [PMC free article] [PubMed] [Google Scholar]

[CIT0008] 8.Fan J. and Lv J., Non-concave penalized likelihood with NP-dimensionality, IEEE Trans. Inform. Theory 57 (2011), pp. 5467–5484. doi: 10.1109/TIT.2011.2158486 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0009] 9.Fan J. and Peng H., Nonconcave penalized likelihood with a diverging number of parameters, Ann. Stat. 32 (2004), pp. 928–961. doi: 10.1214/009053604000000256 [DOI] [Google Scholar]

[CIT0010] 10.Fan J., Samworth R., and Wu Y., Ultrahigh dimensional feature selection: Beyond the linear model, J. Mach. Learn. Res 10 (2009), pp. 2013–2038. [PMC free article] [PubMed] [Google Scholar]

[CIT0011] 11.Fan J. and Song R., Sure independence screening in generalized linear models with NP-dimensionality, Ann. Stat. 38 (2010), pp. 3567–3604. doi: 10.1214/10-AOS798 [DOI] [Google Scholar]

[CIT0012] 12.Hastie T., Tibshirani R., and Friedman J.H., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, New York, 2009, 241–249. [Google Scholar]

[CIT0013] 13.Héberger K. and Kollár-Hunek K., Comparison of validation variants by sum of ranking differences and ANOVA, J. Chem. 33 (2019), pp. 1–14. doi: 10.1002/cem.3104 [DOI] [Google Scholar]

[CIT0014] 14.Kim Y., Choi H., and Oh H.-S., Smoothly clipped absolute deviation on high dimensions, J. Amer. Statist. Assoc. 103 (2008), pp. 1665–1673. doi: 10.1198/016214508000001066 [DOI] [Google Scholar]

[CIT0015] 15.Kleinbaum D.G., Kupper L.L., Muller K.E., and Nizam A., Applied Regression Analysis and Other Multivariable Methods, 3rd ed., Duxbury Press, Belmont, CA, 2003. [Google Scholar]

[CIT0016] 16.Kohavi R., A study of cross-validation and bootstrap for accuracy estimation and model selection, Inter. Joint Conf. Artificial Intel. 2 (1995), pp. 1137–1143. [Google Scholar]

[CIT0017] 17.Kutner M.H., Nachtsheim C.J., and Neter J., Applied Linear Regression Models, 4th ed., Mc Graw Hill Education, New York, 2005. [Google Scholar]

[CIT0018] 18.Li J., Hu J., and Wang Y., Blocked 3×2 cross-validation estimator of prediction error-A simulated comparative study based on biological data, J. Biomath. 29 (2014), pp. 700–710. [Google Scholar]

[CIT0019] 19.Li J., Wang R., Wang W., and Li G., Automatic labeling of semantic roles on chinese framenet, J. Softw. 30 (2010), pp. 597–611. doi: 10.3724/SP.J.1001.2010.03756 [DOI] [Google Scholar]

[CIT0020] 20.Lv J. and Fan Y., A unified approach to model selection and sparse recovery using regularized least squares, Ann. Stat. 37 (2009), pp. 3498–3528. doi: 10.1214/09-AOS683 [DOI] [Google Scholar]

[CIT0021] 21.Meier L., van de Geer S., and Bhlmann P., The group lasso for logistic regression, J. R. Statist. Soc. B 70 (2008), pp. 53–71. doi: 10.1111/j.1467-9868.2007.00627.x [DOI] [Google Scholar]

[CIT0022] 22.Meinshausen N. and Yu B., LASSO-type recovery of sparse representations for high-dimensional data, Ann. Stat. 37 (2009), pp. 246–270. doi: 10.1214/07-AOS582 [DOI] [Google Scholar]

[CIT0023] 23.Shao J., Linear model selection by cross-validation, J. Amer. Statist. Assoc. 88 (1993), pp. 486–494. doi: 10.1080/01621459.1993.10476299 [DOI] [Google Scholar]

[CIT0024] 24.Wang Y., Wang R., Jia H., and Li J., Blocked 3×2 cross-validated t test for comparing supervised classification learning algorithms, Neural Comput. 26 (2014), pp. 208–235. doi: 10.1162/NECO_a_00555 [DOI] [PubMed] [Google Scholar]

[CIT0025] 25.Wang R., Wang Y., Li J., Yang X., and Yang J., Block-regularized $m \times 2$ cross-validated estimator of the generalization error, Neural Comput. 29 (2017), pp. 519–554. doi: 10.1162/NECO_a_00923 [DOI] [PubMed] [Google Scholar]

[CIT0026] 26.Yang Y., Comparing learning methods for classification, Statist. Sinica 16 (2006), pp. 635–657. [Google Scholar]

[CIT0027] 27.Yang X., Wang Y., Wang R., and Li J., Variance of estimator of the prediction error based on blocked 3×2 cross-validation, Chinese J. Appl. Probab. Statist. 30 (2014), pp. 372–380. [Google Scholar]

[CIT0028] 28.Zhang C.H. and Huang J., The sparsity and bias of the lasso selection in high-dimensional linear regression, Ann. Stat. 36 (2008), pp. 1567–1594. doi: 10.1214/07-AOS520 [DOI] [Google Scholar]

[CIT0029] 29.Zhao P. and Yu B., On model selection consistency of lasso, J. Mach. Learn. Res. 7 (2006), pp. 2541–2563. [Google Scholar]

PERMALINK

Variance estimation based on blocked 3×2 cross-validation in high-dimensional linear regression

Xingli Yang

Yu Wang

Wennan Yan

Jihong Li

Abstract

1. Introduction

Figure 1.

Table 1. The missing variables in the variable selection of RCV.

Figure 2.

2. Techniques for variance estimation in high-dimensional linear regression model

2.1. High-dimensional linear regression model

2.2. RCV-based variance estimation

2.3. B3×2CV-based variance estimation

3. Asymptotic normality of B3×2CV

Assumption 1

Assumption 2

Remark 3.1

Theorem 3.1

Proof.

Remark 3.2

Figure 3.

Figure 5.

4. Simulation experimental study

4.1. Experiment 1

Table 2. Simulation results for Experiment 1: bias, standard error and probability that the selected model includes the true model(P) for RCV, R3×2CV, R5×2CV, B3×2CV and B5×2CV procedures.

4.2. Experiment 2

Table 3. True models for Experiment 2.(… represents all relevant variables in the last true model).

Figure 4.

Table 4. Simulation results for Experiment 2: probability that the selected model includes the true model for RCV and B3×2CV.

4.3. Experiment 3

Table 5. Experiment 3: bias of the variance estimation for RCV and B3×2CV.

4.4. Real data analysis

Figure 6.

Table 6. Simulation results for Real Data: probability and the average number that the selected model includes the true model for RCV, R3×2CV, and B3×2CV(Numbers in brackets present the number of variables selected by stepwise regression).

5. Conclusion and discussion

Funding Statement

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases