A novel and robust feature selection method with FDR control for omics-wide association analysis

Zhibo Chen; Zi-Tong Lu; Xue-Ting Song; Yu-Fan Gao; Jian Xiao

doi:10.1371/journal.pone.0300490

. 2025 Aug 22;20(8):e0300490. doi: 10.1371/journal.pone.0300490

A novel and robust feature selection method with FDR control for omics-wide association analysis

Zhibo Chen ¹, Zi-Tong Lu ¹, Xue-Ting Song ¹, Yu-Fan Gao ¹, Jian Xiao ^1,^*

Editor: Wan-Tien Chiang²

PMCID: PMC12373251 PMID: 40845052

Abstract

Omics-wide association analysis is a very important tool for medicine and human health study. However, the modern omics data sets collected often exhibit the high-dimensionality, unknown distribution response, unknown distribution features and unknown complex association relationships between the response and its explanatory features. Reliable association analysis results depend on an accurate modeling for such data sets. Most of the existing association analysis methods rely on the specific model assumptions and lack effective false discovery rate (FDR) control. To address these limitations, the paper firstly applies a single index model for omics data. The model shows robust performance in allowing the relationships between the response variable and linear combination of covariates to be connected by any unknown monotonic link function, and both the random error and the covariates can follow any unknown distribution. Then based on this model, the paper combines rank-based approach and symmetrized data aggregation approach to develop a novel and robust feature selection method for achieving fine-mapping of risk features while controlling the false positive rate of selection. The theoretical results support the proposed method and the analysis results of simulated data show the new method possesses effective and robust performance for all the scenarios. The new method is also used to analyze the two real datasets and identifies some risk features unreported by the existing finds.

Introduction

Advances in high-throughput omics technologies, such as metagenomics sequencing and DNA methylation or protein microarrays, have revolutionized research in medicine. One major use of such technologies is conducting omics-wide association analysis to identify relevant omics features, such as microbial genes and DNA methylation variants, from a large pool of candidate features by analyzing their association with a phenotype of interest, e.g., patient prognosis or response to medical treatment. The identified features can help understand the disease mechanisms, subject to further validation studies, or be used to form more accurate prediction models for personalized medicine [1]. The increasing availability of massive human genomic data sets makes the dimensionality of omics features much larger than its sample size, which also poses new challenges to statistical analysis [2–4]. However, the modern omics data sets collected often not only own the high-dimensionality property but also exhibit unknown distribution response, unknown distribution features and unknown complex associated relationships between the response and its explanatory features. Reliable association analysis results rely on an accurate high-dimensional modeling for such data sets. Such an involved modelling task can be based on high-dimensional single index model (SIM) method. The single index model has been the subject of extensive investigation in both the statistics and biology literatures over the last few decades. It generalizes the linear model to scenarios where the regression function can be any monotonic link function including the nonlinear functions. High-dimensional single index models have also attracted interest with various authors studying variable selection, estimation and inference using penalization schemes [5–14]. However, almost of all these methods do not provide a false discovery rate (FDR) controlled multiple testing procedure for simultaneously testing the significance of model coefficients and can not perform FDR controlled feature selection. In particular, for the feature selection of high-dimensional single index models, Rejchel et al. [32] proposed the cross-validation Ranklasso method, along with its modified versions: the thresholded Ranklasso and the weighted Ranklasso methods. However, the precision of feature selection of all these regularization-based methods depends on the tuning parameter in the penalty function and sample size. Given a specific sample size, the relationship between the false feature selection rate (e.g. FDR) and the tuning parameter value remains unknown. Therefore, it is challenging to use the tuning parameter to control the FDR for the selection results of Ranklasso-type methods. As a result, the Ranklasso-type methods are unable to effectively conduct FDR-controlled feature selection. In addition, the multiple testing work with FDR control of few methods relies on p-values with Benjamini and Hochberg (BH) correction [15], and may fail to control FDR in the presence of complex and strong dependence structure among features. Moreover, high-dimensionality of features makes such approach to have lower power due to large-scale adjustment burden. Thence using high-dimensional single index model to develop an effective and robust FDR controlled feature selection method becomes very desired.

As we know, in the existing literature, the FDR controlled feature selection methods for high dimensional models can be achieved mainly via the following three approaches.

Knockoff filter-based approach.

Barber and Candes [16] firstly introduced the knockoff filter, a new feature selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method achieves exact FDR control in finite sample settings no matter the design or covariates, the number of variables in the model, or the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. Following the knockoff filter framework, Candes et al. [17] proposed a model-X knockoff FDR control method. However, this method requires both complete knowledge of the joint distribution of the design matrix and repeated derivation of the conditional distributions. The development of methods to construct exact or approximate knockoff features for a broader class of distributions is a promising area of active research [18, 19].
Symmetrized data aggregation (SDA) approach.
To simultaneously test the significance of the regression coefficients in high-dimensional linear regression model, Du et al. [20] firstly proposed a data splitting-based method “SDA” to select features with FDR control. The key idea of the proposed method is to apply the sample-splitting strategy to construct a series of statistics with marginal symmetry property and then to utilize the symmetry for obtaining an approximation to the number of false discoveries. The SDA approach consists of three procedures.
- a. The first procedure splits the sample into two parts, both of which are utilized to construct statistics to assess the evidence of the regression coefficient against the null.
- b. The second procedure aggregates the two statistics to form a new ranking statistic fulfilling symmetry about zero properties under the null.
- c. The third procedure chooses a threshold along the ranking by exploiting the symmetry about zero property between positive and negative null statistics to control the FDR.
Stability selection approach.

Stability selection [21], an innovative variable selection algorithm, relies on resampling datasets to identify variables. The core principle of this approach involves repeatedly applying the variable selection method to resampling subsets of the data, thereby defining a variable as stable variable if it is frequently selected. It can help a feature selection method like lasso to improve its performance. Another good property of stability selection is that it provides an effective way to control false discovery rate (FDR) in finite sample cases provided that its tuning parameters are set properly. Due to its versatility and flexibility, stability selection has been successfully applied in many domains such as gene expression analysis [22–25].

Additionally, FDR controlled feature selection procedure can also be implemented by many other statistical inference methods (eg, regression-based modeling, two-sample testing, and statistical causal mediation analysis) [26–31]. However, directly applying these statistical methods to analyze the omics data is usually underpowered and sometimes can render inappropriate results (The selection results may contain too many false discoveries).

In this paper, we firstly employ a single index model (SIM) to model omics data. The considered SIM is robust in performance of the two aspects. One aspect involves the relationships between the response variable and a linear combination of covariates connected by an unknown monotonic link function, while the other aspect involves the random error and covariates following unknown distributions. In this paper, our plan is applying the above FDR-controlled feature selection approaches and the rank-based approach [32] for high-dimensional SIM to develop a novel and robust FDR-controlled feature selection method. More specifically, our plan involves constructing a single index model based on the rank-based approach, in which the transformed response samples are dependent. However, such dependency violates the independence assumption of the response samples underlying the Knockoff filter-based approach. Consequently, the Knockoff filter approach is not applicable to our rank-based plan. Regarding the stability selection approach, the existing theory, which significantly informs its implementation, provides relatively weak bounds on FDR, leading to a reduced number of true positives [33–36]. Furthermore, stability selection necessitates users to specify two out of three parameters: the target FDR, a selection threshold, and the expected number of selected features. Numerous studies have demonstrated that stability selection is sensitive to these choices, complicating tuning for optimal performance [34, 36–39]. In summary, stability selection not only requires numerous parameter settings but also is excessively conservative, leading to a reduced number of false positives (FP) at the expense of true positives (TP). Hence, the stability selection approach is not an appropriate choice for our plan. The SDA approach is independent of p-values, owns higher power performance, and requires significantly fewer tuning parameters, which allows it to possess several useful theoretical properties. This motivates us to adopt the SDA approach within the context of the rank-based SIM model. Consequently, we leverage both the rank-based approach [32] and the SDA approach to develop a robust feature selection method for SIM, which we subsequently apply to fine-mapping of omics data while controlling the false positive rate of selection. Notably, the proposed method does not depend on p-values. Additionally, we utilize theoretical results to validate the effectiveness of the proposed method.

Finally, we design extensive simulation studies to compare the proposed method with the competing methods.The simulation results demonstrate that the proposed method effectively controls FDR across all scenarios. Additionally, the results indicate that when the sample size is moderate and errors stem from a heavy-tailed Cauchy distribution, or when the relationship is non-linear, or when the relationship is non-linear and errors follow a heavy-tailed Cauchy distribution, the proposed method outperforms other competing methods in terms of power performance. For small sample size scenario, compared to the proposed method, other competing methods may either be underpowered or render inappropriate results by having an inflated FDR than the nominal FDR threshold. These results indicate our method owns effective and robust performance for all the scenarios. The proposed method is also applied to the analysis of two real data sets and identifies some casual features unreported by the existing finds.

Materials and methods

This section firstly reviews rank-based approach single index model (SIM) for omics data and then provides parameter estimation methods for SIM. Finally, a multiple testing procedure with FDR control is given for simultaneously testing the coefficients of high-dimensional single index model.

Review of based on rank approach single index model

Let $X = (X_{i, j})_{n \times p}$ denote observed p omics features matrix on n samples; $Y = (Y_{i})_{n \times 1}$ denote the response vector such as disease status, gene expression and so on. Assume that the means of all the p features are zero, Rejchel et al. [32] focused on the single index model (1) without intercept

Y_{i} = g (\sum_{j = 1}^{p} X_{i, j} β_{j}, ε_{i}), i = 1, \dots, n,

(1)

where $β = {(β_{1}, \dots, β_{p})}^{⊤}$ denotes the coefficients associated with the omics features on the response Y, $g (\cdot)$ is an unknown monotonic link function. No assumptions are made on the form of the monotonic link function g, or the distribution of the error $ε_{i}$ , or the distribution of all the p features, thence model (1) is robust for modeling the omics data. Their purpose is to perform feature selection to identify the feature set $S = {j : β_{j} \neq 0, j = 1, \dots, p}$ within the framework of model (1). For this purpose, Rejchel et al. [32] utilized a rank-based lasso approach to sparsely estimate S by optimizing the problem

{\hat{θ}}_{RLasso} = \arg {min}_{θ} [\frac{1}{2 n} | | Y^{R} - X θ | |_{2}^{2} + λ | | θ | |_{1}],

(2)

where $Y^{R} = (Y_{i}^{R})_{n \times 1}$ with $Y_{i}^{R} = \frac{R_{i}}{n} - 0.5$ ; $R_{i} = \sum_{j = 1}^{n} I (Y_{j} \leq Y_{i})$ denotes the rank values of the response actual values Y_i by their centered ranks; $| | \cdot | |_{1}$ and $| | \cdot | |_{2}$ denotes l₁ and l₂ norm, respectively; $λ > 0$ is a data-dependent tuning parameter of l₁ norm type lasso penalty.

Rejchel et al. [32] highlighted that the rank-based lasso method, as defined in the optimization problem (2), does not estimate true regression coefficients β in model (1). However, given some assumptions, they defined a parameter $θ^{0}$ by the optimization problem without introducing the penalty

θ^{0} = \arg {min}_{θ \in ℝ^{p}} E [Q (θ)], Q (θ) = \frac{1}{2 n} ‖ Y^{R} - X θ ‖_{2}^{2},

which is related to the true vector of regression coefficients β. It is revealed that, under certain standard assumptions, the support $S_{θ^{0}} = {j : θ_{j}^{0} \neq 0, j = 1, \dots, p}$ of $θ^{0}$ coincides with the support $S = {j : β_{j} \neq 0, j = 1, \dots, p}$ of β. Furthermore, Rejchel et al. [32] demonstrated that the estimator ${\hat{θ}}_{RLasso}$ is a consistent estimate of $θ^{0}$ , and thus can be utilized to identify S. More detailed information can be found in the work [32].

Omics-wide association analysis based on robust single index model

In real omics data analysis, the response variable Y often follows an unknown distribution, which may result in non-linear associated relationships between the response variable Y and the linear combination of omics features X. In order to simultaneously account for the unknown non-linear associated relationships, unknown distribution of response and unknown distribution of features, we employ model (1) to robustly model the omics data.

Based on high-dimensional model (1) with a large number of features $(p \geq n)$ , we focus on the multiple testing problem (3) under the null hypothesis:

H_{i}^{0} : β_{i} = 0, i = 1, \dots, p,

(3)

to identify the features associated with the response Y while controlling for false positives. Because the above section has shown the support $S_{θ^{0}} = {j : θ_{j}^{0} \neq 0, j = 1, \dots, p}$ of $θ^{0}$ coincides with the support $S = {j : β_{j} \neq 0, j = 1, \dots, p}$ of β in model (1), the multiple testing problem (3) under the null hypothesis becomes

H_{i}^{0} : θ_{i}^{0} = 0, i = 1, \dots, p .

(4)

The objective of this paper is developing a FDR controlled multiple testing procedure for statistical inference problem (4) under high-dimensional model (1) framework to perform omics-wide association analysis.

Estimation methods for the parameter $θ^{0}$

For the statistical inference problem (4), this section introduces estimation methods for the true parameter $θ^{0}$ from the high-dimensional model (1) with a large number of features (p > n) and the low-dimensional model (1) with a small number of features (p < n), respectively. For high-dimensional scenario, we employ the rank-based lasso method in (2) to obtain the sparse parameter estimate ${\hat{θ}}_{RLasso}$ . Given the observed variables X and the transformed variable $Y^{R}$ , the optimization problem (2) can be implemented using the R package glmnet.

For low-dimensional scenario, we use rank-based ordinary least squares method to estimate the parameters $θ^{0}$ by solving the optimization problem defined in equation (5)

{\hat{θ}}_{ROLS} = \arg {min}_{θ} [\frac{1}{2 n} ‖ Y^{R} - X θ ‖_{2}^{2}],

(5)

and obtain the parameter estimators

{\hat{θ}}_{ROLS} = (X^{⊤} X)^{- 1} X^{⊤} Y^{R} .

(6)

If there exists strong correlations among X, causing the ROLS method to suffer from multicollinearity and become ineffective, we first use the variation inflation factor (VIF) to assess the presence of multicollinearity, and then based on the VIF value, we eliminate the problematic features. Subsequently, we apply the ROLS method to estimate the parameters of the model constructed by the remaining features. Especially, if there exists a continuous type confounder causing multicollinearity, we can consider stratified approach to solve this problem. More specifically, we can firstly categorize this continuous confounder variable into multiple, say four, categories, or into discrete type variable. Then we can fit four conditional rank-based regression models across the four categories of confounders. Furthermore, take the weighted average of the four parameter estimates from the four conditional models, as the final parameter estimates. We believe such approach can remove most of the bias due to confounding. The properties of the estimators ${\hat{θ}}_{ROLS}$ and ${\hat{θ}}_{RLasso}$ can be found in the “Theoretical properties” section (Given in the following).

Remark 1. In the following section, we use the SDA approach to develop a FDR controlled feature selection method. The SDA approach requires under null hypothesis, the distribution of the estimator (obtained under low-dimensional scenario) should be symmetric around zero or asymptotically symmetric around zero. For the low-dimensional case, when the number of features is close to or not sufficiently small compared to the sample size, the thresholded Ranklasso method [32] should be used to obtain an estimator with asymptotically symmetric properties around zero under null hypothesis. (Note that under null hypothesis, the distributions of the regular lasso and the weighted Rank-lasso estimators [32] are usually not asymptotically symmetric around zero). However, in our simulation studies, we observed that the number of features is relatively small across all scenarios. Additionally, when the thresholded Ranklasso method is utilized in low-dimensional cases, its power is significantly lower compared to using the ROLS method. This is because the thresholded Ranklasso method conducts feature selection again in low-dimensional scenarios, resulting in a decrease in the number of true signals. Thence we adopt ROLS method for parameter estimation under low-dimensional scenario.

FDR controlled robust feature selection procedure

In this section, we apply the SDA approach [20] illustrated in the “introduction” section to problem (4) for testing the parameter $θ^{0}$ , and develop a corresponding FDR controlled feature selection procedure. For the two-part independent samples split in the first procedure of SDA, we employ the RLasso and ROLS methods to estimate the parameters, respectively. By the first part sample, we utilize rank-based lasso to sparsely estimate the parameters $θ^{0}$ to identify fewer candidate omics features associated with the response. The purpose of reducing feature dimensionality is to alleviate the burden of multiple testing for the third procedure of SDA. Subsequently, for the second part sample, we employ the rank-based ordinary least square method to obtain a more precise estimate under the low-dimensional case. Finally, we utilize the estimates obtained by the two-part samples to construct the FDR-controlled feature selection procedure.

More specifically, the proposed procedure is outlined as follows.

Step 1: Splitting samples.

Given the ratio γ, the sample is randomly split into two independent disjoint parts $ζ_{1}$ and $ζ_{2}$ with sample sizes n₁ and n₂, respectively, where $n_{1} + n_{2} = n$ and $n_{1} / n = γ$ . The work [20] utilized the simulation studies to verify that the setting $γ = 2 / 3$ is often the most powerful for SDA approach, thus we also use this ratio to split data for simulation studies and real data analysis.
Step 2: Selecting the candidate omics feature set using the first part sample $ζ_{1}$ .

The RLasso method is employed to obtain the estimates ${\hat{θ}}_{RLasso} = {{\hat{θ}}_{RLasso, 1}, \dots, {\hat{θ}}_{RLasso, p}}$ of the parameter $θ^{0}$ of p features by the first part sample $ζ_{1}$ . These non-zero estimates are further used to get the candidate feature set $S_{0} = {i : {\hat{θ}}_{RLasso, i} \neq 0, i = 1, \dots, p}$ . Based on the first part sample $ζ_{1}$ , we use the candidate features $X_{ζ_{1}}^{S_{0}}$ to construct a low-dimensional single index model
$Y_{ζ_{1}} = g (X_{ζ_{1}}^{S_{0}} β, ε) .$ (7)

Then the ROLS method is employed for model (7). to obtain the estimates ${\hat{θ}}_{ROLS, i}^{1}$ of $θ_{i}, i \in S_{0}$ .
Step 3: Similar to the 2th step, firstly use the candidate features $X_{ζ_{2}}^{S_{0}}$ , based on the second part sample $ζ_{2}$ , to construct a low-dimensional single index model, then utilize this model to obtain the estimates ${\hat{θ}}_{ROLS, i}^{2}$ of $θ_{i}, i \in S_{0}$ .
Step 4: Constructing the test statistics under null for problem (4).

Under null, we utilize the estimates from the two different part samples to construct the statistics $T_{1, i} = \frac{{\hat{θ}}_{ROLS, i}^{1}}{ϖ_{i}^{1}}$ and $T_{2, i} = \frac{{\hat{θ}}_{ROLS, i}^{2}}{ϖ_{i}^{2}}$ , with $i \in S_{0}$ and T_2,i = 0 for $i \notin S_{0}$ . Here, $ϖ_{i}^{j}$ is considered as $\hat{Se} ({\hat{θ}}_{ROLS, i}^{j})$ and is viewed as a scaling constant that is independent of ${\hat{θ}}_{ROLS, i}^{1}$ and ${\hat{θ}}_{ROLS, i}^{2}$ . From the Theorem 1 (Given in the following Theory section), the variance of ${\hat{θ}}_{ROLS, i}^{k}$ is very complicated so that it is difficult for us to compute it. It is well known that the bootstrap re-sampling method can help consistently estimate the variance of the unknown distribution or the complex distribution variable [40–42]. Thence we employ the bootstrap re-sampling method to obtain $\hat{Se} ({\hat{θ}}_{ROLS, i}^{k})$ , where $k = 1, 2$ .
Step 5: Aggregating the test statistics.

Aggregate the two statistics obtained in the 4th step to form a new ranking statistic that satisfies the property of being symmetric around zero under the null hypothesis:
$W_{j} = T_{1, j} T_{2, j}, j = 1, \dots, p .$

Remark 2: Because ${\hat{θ}}_{ROLS, i}^{k}$ without being thresholded follows asymptotically normal distribution with the mean being $θ_{i}^{0}$ , T_k,i,k = 1,2 has asymptotically normal distribution with the mean being 0 under null $H_{i}^{0} : θ_{i}^{0} = 0$ . This demonstrates that $W_{j} = T_{1, j} T_{2, j}$ possesses the property of being symmetric around zero. Intuitively, the positive and larger W_j values indicate strong evidence against the null hypothesis, while the negative W_j values most likely correspond to null cases.
Step 6: Choosing the threshold.

Given the nominal level $α \in [0, 1]$ , a threshold is chosen to exploit the symmetrical property between positive and negative null statistics in order to control the False Discovery Rate (FDR) at the level α:
$L = inf {t > 0 : \frac{♯ {j : W_{j} \leq - t}}{♯ {j : W_{j} \geq t} \lor 1} \leq α},$

where $♯$ denotes the number of elements in a set. The rejected features are given by ${j : W_{j} \geq L, 1 \leq j \leq p}$ .
Step 7: Robustly selecting the discovery set.

The goal of this step is to further stabilize the selection result. The reason is that the selection result is not stable and can vary substantially across different data splits due to the inflation of variances of the estimated coefficients by randomly splitting the data into two halves. Therefore, we propose the following procedure to robustly select features.

Suppose we repeat the above 6-step data-splitting procedure B times independently. Each time, the set of selected features is denoted as $S_{b}, b \in {1, 2, \dots, B}$ . For the j-th feature, we define the empirical inclusion rate ${\hat{I}}_{j}$ as:

{\hat{I}}_{j} = \frac{1}{B} Σ_{b = 1}^{B} I (j \in S_{b}), j = 1 \dots, p .

Sort the features based on their empirical inclusion rates in increasing order. Denote the sorted empirical inclusion rates as $0 \leq {\hat{I}}_{(1)} \leq {\hat{I}}_{(2)} \leq \dots {\hat{I}}_{(p)} .$ Then select the features $\hat{S} = {j : {\hat{I}}_{j} \geq {\hat{I}}_{(p - τ_{median} + 1)}}$ , where $τ_{median}$ denotes the median size of the selected features over the B runs. In general, a larger B value will result in more stable feature selection. Through simulation studies, the performance of the proposed procedure using B = 15 is very similar to that using larger B values (B > 15). Therefore, in order to save running time, we set B = 15 for both the simulation and real data analysis.

To facilitate comprehension of the aforementioned procedure, a simple flow path is provided below.

Randomly splitting the entire samples into the two independent parts.
By the first part sample, we employ Ranklasso to get the parameter estimates ${\hat{θ}}_{Rlasso}$ and use ${\hat{θ}}_{Rlasso}$ to identify fewer candidate omics features associated with the response. Denote the selected candidate feature set as $\hat{S}$ . The goal of reducing the dimensionality of features is to achieve more accurate parameter estimation in a low-dimensional setting. The well-performing estimators are then utilized to construct the test statistics.
Based on the first part sample, we apply ROLS method for low-dimensional single index model (built on fewer candidate omics features set) to obtain a more precision estimates ${\hat{θ}}_{ROLS, j}^{1}, j \in \hat{S}$ and let ${\hat{θ}}_{ROLS, j}^{1} = 0, j \notin \hat{S}$ .
Based on the second part sample, we apply ROLS method for low-dimensional single index model (built on fewer candidate omics features set) to obtain a more precision estimates ${\hat{θ}}_{ROLS, j}^{2}, j \in \hat{S}$ and let ${\hat{θ}}_{ROLS, j}^{2} = 0, j \notin \hat{S}$ .
The estimators ${\hat{θ}}_{ROLS}^{1}$ and ${\hat{θ}}_{ROLS}^{2}$ (gotten by the two part indenpendent samples) are utilized to construct statistics to assess the evidence of the regression coefficient against the null, respectively.
Aggregating the two statistics to form a new ranking statistic fulfilling symmetry about zero properties under the null.
Choosing a threshold along the ranking by exploiting the symmetry about zero property between positive and negative null statistics to control the FDR.
Stabilizing the selection results by multi-splitting procedure.

Remark 3. The proposed procedure is denoted as SIM-FDR. From the above procedure, FDR control of SIM-FDR method does not rely on p-values and requires that the distribution of the aggregated statistics W_j should exhibit symmetry about zero under the null hypothesis. In fact, SIM-FDR does not require that the distribution of W_j must be normal or asymptotically normal. It only requires that the distribution should exhibit the property of symmetry about zero. Thus, SIM-FDR may not heavily rely on the sample size. The simulation results in small sample settings verify this point, indicating that SIM-FDR is more robust than its competitors for a wide range of scenarios, as achieving asymptotic symmetry is much easier in practice compared to achieving asymptotic normality.

Theoretical properties

This section discusses the finite-sample and asymptotic FDR control properties of the proposed SIM-FDR. Given the intuitive nature of the 7th step of SIM-FDR, our focus is solely on proving the FDR control properties of the previous 6-step SDA approach. Prior to presenting all the theoretical results, we firstly provide the assumptions and definitions. The proofs of all the theorems are provided in S1 File (Supporting information section). Recall the definition of the true parameter $θ^{0}$ as given in the “Review" section above.

θ^{0} = \arg {min}_{θ \in ℝ^{p}} E [Q (θ)], Q (θ) = \frac{1}{2 n} ‖ Y^{R} - X θ ‖_{2}^{2} .

Required basic assumptions

Assumption 1. Assume that $(Y_{i}, X_{i, \cdot}, ε_{i}), i = 1, \dots, n$ are independent and identically distributed with $X_{i, \cdot}$ denoting the predictor vector $X_{i, \cdot} = (X_{i, 1}, \dots, X_{i, p})$ of i-th sample, the distribution of X_i is absolutely continuous with X_i denoting the i-th predictor variable, $𝔼 (X_{i}) = 0$ , and the noise variable $ε_{i}$ is independent of $X_{i, \cdot}$

Assumption 2. We assume that for each $θ \in ℝ^{p}$ , the conditional expectation $𝔼 (X_{i, \cdot} θ | X_{i, \cdot} β)$ exists and $𝔼 (X_{i, \cdot} θ | X_{i, \cdot} β) = d_{θ} X_{i, \cdot} β$ for a real number $d_{θ} \in ℝ$ .

Assumption 3. We assume the cumulative distribution function F of the response variable Y_i is increasing and g in model (1) is increasing with respect to the first argument $X_{i, \cdot} β$ .

Assumption 4. Let p₀ = |S| denote the number of elements in support set S. We suppose that the significant predictors $X_{i, \cdot}^{S}$ is sub-gaussian with the coefficient $τ_{0} > 0$ , i.e. for each $u \in ℝ^{p_{0}}$ we have $\exp (X_{i, \cdot}^{S} u) \leq \exp (τ_{0}^{2} u^{⊤} u / 2),$ where $X_{i, \cdot}^{S}$ denotes the sub-matrix of $X_{i, \cdot}$ on the column indices from S. Moreover, the irrelevant predictors are univariate sub-gaussian, i.e. for each $a \in ℝ$ and $j \notin S$ , we have $\exp (a X_{i, j}) \leq \exp (τ_{j}^{2} a^{2} / 2)$ for positive numbers $τ_{j} .$ Finally, we denote $τ = max (τ_{0}, τ_{j}, j \notin S) .$

Remark 4. No other assumptions are made on the distribution of the noise variable ε. Assumption 2 is a standard condition in the literature on the single index model and can be found in the work [32]. The Assumption 3 satisfies the needs of rank-based approach. The regular sub-gaussian condition is added on the feature matrix X by the Assumption 4.

Consider the model (1), Rejchel et al. [32] show under Assumptions 1 and 2, the conclusion

θ^{0} = γ_{β} β

holds with $γ_{β} = \arg {min}_{d \in ℝ} 𝔼 [Q (d β)]$ being a positive constant; under assumption 3, the signs of β coincide with the signs of $θ^{0}$ and

S = {j : β_{j} \neq 0} = {j : θ_{j}^{0} \neq 0} .

This result indicates the support of β in model (1) coincides with that of the rank approach-based true coefficient $θ^{0}$ . Such conclusions tell us we can perform feature selection for model (1) by using the estimates of $θ^{0}$ .

Definitions of the cone invertibility factor (CIF)

In our article, the validation of SIM-FDR relies on the consistency of parameter estimation for the Rank-lasso problem (2), which in turn supports the feature selection screening property as discussed in problem (2). In the high-dimensional setting, Rejchel et al. [32] have demonstrated that ensuring the consistency of parameter estimators generated by the Rank-lasso penalty requires satisfying the cone invertibility factor (CIF) condition defined on the feature matrix X.

Let $θ^{S}$ and $θ^{\bar{S}}$ be the restrictions of the vector $θ \in ℝ^{p}$ to the indices from S and $\bar{S}$ with $\bar{S} = {1, \dots, p} \ S$ , respectively. Now, for $ξ > 1$ we consider a cone

𝒞 (ξ) = {θ \in ℝ^{p} : | | θ^{\bar{S}} | |_{1} \leq ξ | | θ^{S} | |_{1}} .

Define the population version of CIF to be

F_{q} (ξ, X) = {inf}_{0 \neq θ \in 𝒞 (ξ)} \frac{p_{0}^{1 / q} | | 𝔼 (X_{i, \cdot}^{⊤} X_{i, \cdot}) θ | |_{\infty}}{| | θ | |_{q}},

(8)

for a sharp formulation of convergency results for all l_q norms with $q \geq 1 .$

FDR control

Before presenting the theorems for controlling finite-sample FDR and asymptotic FDR of the proposed procedure, we first establish the property of asymptotic symmetry around zero for the test statistics W_j and the feature screening property for the candidate feature selection result obtained by the RLasso method in the second step of the proposed procedure. These results are essential for demonstrating FDR control.

Asymptotically symmetry around zero property of the statistics W.

We prove the conclusion that the statistics W_j is asymptotically symmetric around 0 under null hypothesis. Obviously, the distribution of the statistics W_j depends on that of the estimator ${\hat{θ}}_{ROLS}$ gotten at low-dimensional model (1) scenario.

Theorem 1. Suppose that Assumptions 1, 2, 3 and 4 are satisfied, $𝔼 | X_{i} |^{4} < \infty, i = 1, \dots, p$ with X_i denoting the i-th predictor variable and the covariance matrix $Σ_{X}$ of the features X is positive definite, we have the following conclusions for the OLS estimator ${\hat{θ}}_{ROLS}$ of $θ^{0}$ under low-dimensional model scenario,

\frac{{\hat{θ}}_{ROLS, j} - \frac{n}{n - 1} θ_{j}^{0}}{\sqrt{Var ({\hat{θ}}_{ROLS, j})}} \to_{d} N (0, 1), j = 1, \dots, p,

where $Var ({\hat{θ}}_{ROLS, j}) = [Σ_{X}^{- 1} D Σ_{X}^{- 1}]_{j, j}$ with $[Σ_{X}^{- 1} D Σ_{X}^{- 1}]_{j, j}$ denoting the j-th diagonal element in the matrix $[Σ_{X}^{- 1} D Σ_{X}^{- 1}]$ , and D is defined by the Lemma 3 given in the supplementary materials (Given in Supporting information). Furthermore, under null hypothesis $θ_{j}^{0} = 0$ , the statistics W_j has the asymptotically symmetry around zero property.

Sure screening property for the candidate feature selection result.

In this section, we prove the sure screening property for the candidate feature selection result by the RLasso method in 2nd step of the proposed procedure at high-dimensional scenario. Define the estimated feature index set as

\hat{S} = {i : {\hat{θ}}_{RLasso, i} \neq 0, i = 1, \dots, p} .

Theorem 2. Consider problem (2) and let $a ≐ a_{n} \in (0, 1)$ be a fixed sequence such that $a \to 0$ , $q \geq 1$ and $ξ > 1$ be arbitrary. Suppose that Assumptions 1, 2, 3 and 4 are satisfied. Moreover, suppose that

n \geq \frac{K_{1} p_{0}^{2} τ^{4} (1 + ξ)^{2} \log (p / a)}{F_{q}^{2} (ξ, X)}

(9)

and

λ \geq K_{2} \frac{ξ + 1}{ξ - 1} τ^{2} \sqrt{\frac{\log (p / a)}{κ n}},

(10)

where $K_{1}, K_{2}, τ$ are universal constants and $κ$ is the smallest eigenvalue of the correlation matrix between the true predictors $X_{S} = {(X_{\cdot, j})}_{j \in S}$ . Given the beta-min condition ${min}_{j \in S} | θ_{j}^{0} | > \frac{4 ξ λ}{ξ + 1}$ , then we have

{lim}_{n \to \infty} P (S \subset \hat{S}) \to 1 .

Theorem 2 indicates that when the sample size is larger, the results that the estimated relevant feature index set $\hat{S}$ contains the true relevant feature index set S satisfy the sure screening property.

Finite-sample FDR control.

Theorem 3. Suppose the proposed model (1) satisfies all the assumptions given in the above section. Assume the statistics $W_{j}, 1 \leq j \leq p$ , are well-defined. For any $α \in (0, 1)$ , the FDR of SIM-FDR satisfies

FDR \leq {min}_{ϵ \geq 0} {α (1 + 4 ϵ) + P ({max}_{j \in S^{c}} Δ_{j} > ϵ)},

where $S = {j : β_{j} \neq 0, 1 \leq j \leq p}$ , $Δ_{j} = | P (W_{j} > 0 | | W_{j} |, W_{- j}) - 1 / 2 |$ and $W_{- j} = {(W_{1}, \dots, W_{p})}^{⊤} \ W_{j}$ .

This theorem holds no matter the unknown relationship between features X and the response Y. The quantity $Δ_{j}$ is seen as a measure to investigate the effect of both the asymmetry of W_j and the dependence between W_j and W_−j on FDR.

Asymptotic FDR control.

Following the proof of asymptotic FDR control [20], we need to establish the six technical assumptions for asymptotic FDR control of the proposed SIM-FDR method. These assumptions include Sure screening property for the candidate feature selection result by the 2nd step of SIM-FDR, Moments conditions, Feature matrix conditions, Estimation accuracy of the estimators ${\hat{θ}}_{ROLS, i}^{1}$ and ${\hat{θ}}_{ROLS, i}^{2}$ of $θ_{i}^{0}, i \in S_{0}$ , Signal strength, and Dependence among statistics. Assuming these assumptions hold, it is straightforward to follow the procedure of their proofs to demonstrate that our method possesses the property of asymptotic FDR control. In particular, the assumptions (Moments conditions, Feature matrix conditions, Signal strength, and Dependence) can be specified or designed, and the Assumption of “Estimation accuracy" can be achieved based on estimation consistency provided by Theorem 1 (since an estimator with asymptotic normal property must be consistent). The assumption of “Sure screening property" can be ensured by Theorem 2. Thus the asymptotic FDR control is easily proved.

Simulation analysis results

In order to evaluate feature selection performance of the proposed method (SIM-FDR), we consider sample size n = 250 and 100 for moderate and small sample size scenarios, and the number of omics features p = 400. All simulation settings are replicated 100 times.

Competing methods

Five methods are considered for comparisons with SIM-FDR.

A marginal method that testing one omics feature at a time followed by Benjamini and Hochberg (BH) correction [15], denoting “BH" method.
The original model-X knockoff FDR controlled feature selection method [16], denoted “MXKF" method. MXKF method uses based on high-dimensional joint linear regression model approach to analyze continuous response data. It can be implemented by using the R package knockoffs.
The regular rank-lasso method defined in (2) with the tuning parameter selected by cross-validation, denoting “Rlasso-cv" method.
Following the work of Rejchel et al. [32], the adaptive rank-lasso method is defined as the following formula (11),
${\hat{θ}}_{RLasso-adaptive} = \arg {min}_{θ} [\frac{1}{2 n} | | Y^{R} - X θ | |_{2}^{2} + λ_{a} \sum_{j = 1}^{p} w_{j} | θ_{j} |],$ (11)

with $λ_{a} = 2 λ_{r l}$ and $λ_{r l} = 0.3 \sqrt{\frac{\log (p)}{n}},$ and weights
$w_{j} = {\begin{matrix} \frac{0.1 λ_{r l}}{| {\hat{θ}}_{r l, j} |}, & | {\hat{θ}}_{r l, j} | > 0.1 λ_{r l}; \\ | {\hat{θ}}_{r l, j} |^{- 1}, & otherwise, \end{matrix}$

where
${\hat{θ}}_{r l} = \arg {min}_{θ} [\frac{1}{2 n} | | Y^{R} - X θ | |_{2}^{2} + λ_{r l} \sum_{j = 1}^{p} | θ_{j} |] .$ (12)

If ${\hat{θ}}_{r l, j} = 0$ , then the j-th explanatory variable is removed from the list of predictors before running adaptive rank-lasso (11). This method is denoted as “Rlasso-adaptive".

Following the work of Rejchel et al. [32], the threshold rank-lasso is defined as: the tuning parameter for rank-lasso is selected by cross-validation and the threshold is selected in such a way that the number of selected predictors coincides with the number of predictors selected by adaptive rank-lasso (the above method). This method is denoted as “Rlasso-threshold".

Generating omics features

We simulate the $n \times p$ feature matrix X by the multivariate normal distribution $N_{p} (μ, Σ)$ with $μ = {(μ_{1}, \dots, μ_{p})}^{⊤}$ and $μ_{i} = 0$ . To evaluate the performance of SIM-FDR under the dependency among the features, we set the covariance matrix $Σ = (Σ_{i j})_{p \times p}$ to be the following structure.

$Σ_{i j} = σ^{∣ i - j ∣}$ for $i \neq j, i, j = 1, \dots, p$ , where we set $σ = 0.5$ for moderate strength correlation level among omics features and $Σ_{i i} = 1$ for $i = 1, \dots, p$ .

Simulating the response

We firstly design the regression coefficients $β = (β_{1}, \dots, β_{p})$ and then use the coefficients to generate the response. Let the location vector of nonzero values in β be

V = \underset{10}{\underset{⏟}{(1, 2, 3, 4, 5, 10, 20, 30, 40, 50)}} .

Then we set $β [V]$ , which denotes the nonzero values vector of β, to be

β [V] = \underset{10}{\underset{⏟}{(- 1.5, 1.5, - 2, 2, - 1, 1, - 3, 3, 2, - 2)}} .

After generating β, we employe the following six different type models to simulate the outcome Y, where linear regression model is used to simulate linear associated relationships between the response and the omics features, single index model is used to simulate nonlinear associated relationships, and the Cauchy distribution is used to simulate the heavy-tailed distributional random errors.

Model 1

Linear regression model setting with the normal distributional random error: $y_{i} = Σ_{j = 1}^{p} β_{j} X_{i, j} + ε_{i}$ , where the error term ε is independent of X and generated from the normal distribution with location parameter being 0 and the variance parameter being γ.
Model 2

Linear regression model setting with the Cauchy distributional random error: $y_{i} = Σ_{j = 1}^{p} β_{j} X_{i, j} + ε_{i}$ , where the error term ε is independent of X and generated from the Cauchy distribution with location parameter being 0 and the scale parameter being γ.
Model 3

Single index model with the normal distributional random error: $y_{i} = \exp (4 + Σ_{j = 1}^{p} β_{j} X_{i, j}) + ε_{i}$ , where the error term ε is independent of X and generated from the standard normal distribution with location parameter being 0 and the scale parameter being γ.
Model 4

Single index model setting with the Cauchy distributional random error: $y_{i} = \exp (4 + Σ_{j = 1}^{p} β_{j} X_{i, j}) + ε_{i}$ , where the error term ε is independent of X and generated from the Cauchy distribution with location parameter being 0 and the scale parameter being γ.
Model 5: Double-index model.

Let the location vector of nonzero values in $p \times 1$ -dimensional $β_{1}$ be
$V_{1} = \underset{5}{\underset{⏟}{(1, 2, 3, 4, 5)}},$

and the location vector of nonzero values in $p \times 1$ -dimensional $β_{2}$ be
$V_{2} = \underset{5}{\underset{⏟}{(10, 20, 30, 40, 50)}} .$

Then we set $β_{1} [V_{1}]$ and $β_{2} [V_{2}]$ to be
$\underset{5}{\underset{⏟}{(- 1.5, 1.5, - 2, 2, - 1)}}$

and
$\underset{5}{\underset{⏟}{(1, - 3, 3, 2, - 2)}},$

respectively. Double-index model setting with the Cauchy distributional random error is considered:
$y_{i} = Σ_{j = 1}^{p} β_{1, j} X_{i, j} + \exp (Σ_{j = 1}^{p} β_{2, j} X_{i, j}) + ε_{i},$

where the error term ε is independent of X and generated from the Cauchy distribution with location parameter being 0 and the scale parameter being γ.
Model 6: Multi-index model.

Let the location vector of nonzero values in $p \times 1$ -dimensional $β_{1}$ be
$V_{1} = \underset{3}{\underset{⏟}{(1, 2, 3)}},$

the location vector of nonzero values in $p \times 1$ -dimensional $β_{2}$ be
$V_{2} = \underset{3}{\underset{⏟}{(4, 5, 10)}},$

and the location vector of nonzero values in $p \times 1$ -dimensional $β_{3}$ be
$V_{3} = \underset{4}{\underset{⏟}{(20, 30, 40, 50)}} .$

Then we set $β_{1} [V_{1}]$ , $β_{2} [V_{2}]$ and $β_{3} [V_{3}]$ to be
$\underset{3}{\underset{⏟}{(- 1.5, 1.5, - 2)}},$

$\underset{3}{\underset{⏟}{(2, - 1, 1)}},$

and
$\underset{4}{\underset{⏟}{(- 3, 3, 2, - 2)}},$

respectively. Multi-index model setting with the Cauchy distributional random error is considered:
$y_{i} = \frac{Σ_{j = 1}^{p} β_{1, j} X_{i, j}}{\exp (Σ_{j = 1}^{p} β_{2, j} X_{i, j})} + \exp (Σ_{j = 1}^{p} β_{3, j} X_{i, j}) + ε_{i},$

where the error term ε is independent of X and generated from the Cauchy distribution with location parameter being 0 and the scale parameter being γ.

In order to consider the different strength association levels between the features and the response, we vary the signal noise ratio (SNR), defined as $\frac{Var (E (Y ∣ Z))}{γ}$ for models 1 and 2, where the scale or variance parameter can be set for models 3, 4, 5 and 6 by the formula $γ = \frac{Var (E (Y ∣ Z))}{SNR}$ .

Methods settings and comparison measurements

The MXKF method places the burden of knowledge on knowing the complete conditional distribution of X, and there is no algorithm that can generate model-X knockoffs for general distributions efficiently [18]. Therefore, we utilize the default design used previously [17] in this simulation. For SIM-FDR methods, the optimal λ used in the rank-based lasso are determined through 10-fold cross-validation. For BH method, we test the association between the outcome and each omics feature marginally and apply the BH procedure to these marginal p-values to identify significant features. In addition, we set $γ = 2 / 3$ and B = 10 for SIM-FDR in simulation analysis.

Given nominal FDR levels $α = 0.05, 0.1$ , based on 100 simulated data sets, we use empirical FDR and empirical Power, defined as

\hat{FDR} = \frac{1}{100} \sum_{i = 1}^{100} [\frac{| {j : I_{j} = 0 and j \in S_{i}} |}{| S_{i} | \lor 1}];

\hat{Power} = \frac{1}{100} \sum_{i = 1}^{100} [\frac{| {j : I_{j} \neq 0 and j \in S_{i}} |}{| S^{*} |}],

to measure the feature selection performance of different methods. In addition, the Matthews correlation coefficient (MCC) is employed to evaluate the results of feature selection. MCC measures the overall accuracy of selection for true positives (TP), false negatives (FN), true negatives (TN), and false positives (FP), with a larger value indicating overall better selection [43]. The definition of empirical MCC is

\hat{MCC} = \frac{1}{100} \sum_{i = 1}^{100} [\frac{{\hat{TP}}_{i} \times {\hat{TN}}_{i} - {\hat{FP}}_{i} \times {\hat{FN}}_{i}}{\sqrt{({\hat{TP}}_{i} + {\hat{FP}}_{i}) \times ({\hat{TP}}_{i} + {\hat{FN}}_{i}) \times ({\hat{TN}}_{i} + {\hat{FP}}_{i}) \times ({\hat{TN}}_{i} + {\hat{FN}}_{i})}}]

with

{\hat{TP}}_{i} = | {j : I_{j} \neq 0 and j \in S_{i}} |,

{\hat{TN}}_{i} = | {j : I_{j} = 0 and j \notin S_{i}} |,

{\hat{FP}}_{i} = | {j : I_{j} = 0 and j \in S_{i}} |,

{\hat{FN}}_{i} = | {j : I_{j} \neq 0 and j \notin S_{i}} |,

where I_j = 0 indicates that the j-th omics feature is not associated with the response, and $I_{j} \neq 0$ indicates that the j-th omics feature is truly associated with the response, $S^{*} = {j : I_{j} \neq 0, j = 1, \dots, p}$ denotes the indices set of omics features truly associated with the response, $| S^{*} |$ denotes the number of omics features truly associated with the response, and S_i denotes the indices set of the selected omics feature using the i-th data set.

Results for moderate sample size scenario (n = 250)

The FDR performance of all the methods is similar for models 1 and 2, both FDR and power performances of all the methods are nearly the same for models 3 and 4, and the FDR and power performances of all the methods are nearly the same for models 5 and 6. Consequently, we present the analysis results for models 1 and 2, models 3 and 4, and models 5 and 6, respectively.

Results for models 1 and 2.

FDR performance.

For models 1 and 2, as shown in Figs 1 and 2, the proposed SIM-FDR method effectively controls the actual FDRs at the specified levels for all simulation scenarios. BH and Rlasso-cv methods exhibit significantly higher actual FDRs across all simulation scenarios and fails to control the FDR as expected. The MXKF method has significantly higher or lightly higher actual FDRs than the specified FDR levels for all simulation scenarios. The actual FDRs of both Rlasso-adaptive and Rlasso-threshold methods are nearly zero.
Power performance.

For model 1, the MXKF method demonstrates nearly identical power to our SIM-FDR method but at the expense of higher actual FDRs. While for model 2 scenario with the Cauchy distributional errors, SIM-FDR substantially outperforms the MXKF method with the power improvement approaching about 0.36 for some scenarios, and significantly dominates Rlasso-adaptive and Rlasso-threshold methods with the power improvement approaching about 0.62 for some scenarios. BH and Rlasso-cv method possesses too high actual FDRs for these two models so that their performance becomes inconsequential.

Fig 1 — FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 2 — FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Results for models 3 and 4.

From Figs 3 and 4, the results for models 3 and 4 demonstrate the same performance for all the methods. Here, we present their results together. For models 3 and 4, SIM-FDR, Rlasso-adaptive and Rlasso-threshold methods effectively control the actual FDRs to the specified levels across all scenarios. In contrast, the MXKF and Rlasso-cv methods exhibit much higher actual FDRs for almost of all scenarios and do not achieve the expected level of FDR control. In addition, SIM-FDR demonstrates significantly better power performance than MXKF, Rlasso-adaptive and Rlasso-threshold methods across all scenarios, with power improvement being substantial and approaching about 0.7 for some scenarios.

Fig 3 — FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 4 — FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Results for models 5 and 6.

From Figs 5 and 6, the results for models 5 and 6 demonstrate SIM-FDR, Rlasso-adaptive and Rlasso-threshold methods effectively control the actual FDRs to the specified levels across all scenarios. BH method only can control FDRs for some scenarios. While the MXKF and Rlasso-cv methods own much higher actual FDRs for all scenarios and can not achieve the expected level of FDR control. For power performance, SIM-FDR significantly outperforms MXKF, BH, Rlasso-adaptive and Rlasso-threshold methods across all scenarios, with power improvement being substantial and approaching about 0.62 for some simulation scenarios. However, even though the power of Rlasso-cv is highest, lacking effective FDR control makes its performance to be inconsequential.

Fig 5 — FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 6 — FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Results for small sample size scenario (n = 100)

The performance of all the methods is similar for models 1 and 2, and models 3 and 4. The analysis results for models 1 and 2, and models 3 and 4 are shown, respectively.

Results for models 1 and 2.

As shown in Figs 7 and 8, the SIM-FDR, Rlasso-adaptive, and Rlasso-threshold methods effectively control the actual FDRs at the specified levels for all simulation scenarios. However, the BH and MXKF methods exhibit significantly higher actual FDRs for most simulation scenarios and fail to control the actual FDRs at the desired levels. Regarding these two models, as illustrated in Figs 7 and 8, the MXKF method shows slightly higher power than SIM-FDR, albeit at the cost of higher actual FDRs. In terms of power performance, our method SIM-FDR consistently outperforms the Rlasso-adaptive and Rlasso-threshold methods across all simulation scenarios.

Fig 7 — FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 8 — FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Results for models 3 and 4.

From Figs 9 and 10, it can be observed that for models 3 and 4, the SIM-FDR, Rlasso-adaptive, and Rlasso-threshold methods effectively control the actual false discovery rates (FDRs) to the specified levels in all simulation scenarios. The Benjamini-Hochberg (BH) method shows success in controlling FDR for certain scenarios, while the MXKF method exhibits significantly higher actual FDRs and lacks the ability to control FDRs in all simulation scenarios. In terms of power performance, SIM-FDR demonstrates superiority over the BH, MXKF, Rlasso-adaptive, and Rlasso-threshold methods across all simulation scenarios, with a substantial power improvement of approximately 0.30 in some scenarios.

Fig 9 — FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 10 — FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Results for model 5.

Since the statistical power of most methods is close to zero in scenarios with small sample sizes for model 6, we will only focus on the results of model 5. From Fig 11, the results show that BH, SIM-FDR, Rlasso-adaptive, and Rlasso-threshold methods effectively control the actual false discovery rates (FDRs) to the specified levels across all scenarios in the small sample simulation scenario. However, the MXKF and Rlasso-cv methods have much higher actual FDRs for all scenarios and fail to achieve the expected level of FDR control. In terms of power performance, SIM-FDR outperforms MXKF, BH, Rlasso-adaptive, and Rlasso-threshold methods across all simulation scenarios.

Fig 11 — FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

The simulation results evaluated via the Matthews correlation coefficient (MCC).

All MCC results are presented in Figs 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, and 22. Notably, Figs 12, 13, 14, 15, 21, and 22 demonstrate that our method, SIM-FDR, achieves the highest MCC values across 36 scenarios under moderate sample size $(n = 250)$ , indicating superior feature selection performance for moderate to large sample sizes. For small sample size $(n = 100)$ , SIM-FDR consistently outperforms competing methods for 12 scenarios depicted in Figs 18 and 20. While in Figs 16, 17, and 19, SIM-FDR dominates or significantly outperforms other methods in 10 out of 18 scenarios, and only falls behind the MXKF method in the rest 8 scenarios. In a word, our method outperforms the other methods for 58 scenarios among the total 66 scenarios and ranks second in the remaining 8 scenarios. However, as shown in Figs 7, 8, and 10, MXKF exhibits substantially higher actual FDRs in this 8 scenarios, failing to control false discoveries and yielding excessive false positives. While MCC serves as a comprehensive evaluation metric, this paper primarily focuses on FDR and Power performance of all the methods. Consequently, MXKF’s marginally better MCC in the remaining 8 scenarios may lack practical significance for real-world data analysis.

Fig 12 — MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 13 — MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 14 — MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 15 — MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 16 — MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 17 — MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 18 — MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 19 — MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 20 — MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 21 — MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Fig 22 — MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.

Conclusions of simulating analysis results.

Reviewing the results from almost of all the scenarios, the actual FDRs of SIM-FDR are far below the actual FDRs of the other methods. Thence, there should be no need to verify that our procedure has higher power at the price of a higher FDR level. To summarise, compared to SIM-FDR method, other existing methods may either be underpowered or render inappropriate results by having an inflated FDR than the nominal FDR threshold. Especially, the proposed method owns better feature selection performance for moderate to large sample sizes. Such results indicate the proposed SIM-FDR owns robust performance for all the scenarios.

Real data analysis results

Ocean microbiome data

Integrative marine data collection efforts such as Tara Oceans [44] or the Simons CMAP provides the means to investigate ocean ecosystems on a global scale. This data set contains p = 35651 miTAG OTUs [45] observed on n = 136 samples. Using Tara’s environmental and microbial survey of ocean surface water [45], we apply all the methods to identify miTAG OTUs (omics features) associated with some environmental covariates. Especially, salinity is thought to be an important environmental factor in marine microbial ecosystems, thus we aimed to identify the miTAG OTUs (omics features) more robustly associated with the response of interest “marine salinity”.

Before applying all the methods, we conducted a series of preprocessing steps to make the Tara data more amenable to the proposed method. Firstly, following the work of Sunagawa et al. [46], we calculated read sum of all the 35651 miTAG OTUs (omics features) and removed low-abundance OTUs with read sum less than 10000 reads per sample. We further retained OTUs that appeared in at least 14 samples, resulting in a new OTUs matrix of dimension n = 136 and p = 1015. Secondly, we normalized OTU raw read counts into composition data with the sum of each row being one. Thirdly, we transformed the composition data using log function and took the log-transformed data as the omics features. Following the analysis of simulated data, we set the same setting $γ = 2 / 3$ for SIM-FDR while let B be 30 to stabilize the analysis results of SIM-FDR. We varied nominal FDR levels from 0 to 0.20 for the real data analysis. The collected raw Tara data set and the preprocessed Tara data set can be available in S2 Dataset and S3 Dataset (Supporting information section), respectively.

The results were presented in Table 1. It is observed that the number of taxa identified by BH and Rlasso-cv methods exceed that of SIM-FDR, MXKF, Rlasso-threshold and Rlasso-adaptive methods, yet MXKF did not identify any taxa fetures for all the FDR levels. This may align with the simulation results from many scenarios. For example, the scenarios may be model 1 or model 2 with small sample scenario (Figs 7 and 8), giving the sample size of n = 136 and the number of genes p = 1015. The Figs 7 and 8 show that the BH and Rlasso-cv methods exhibit higher actual FDRs and fails to control FDR at the given nominal levels. Therefore, the results in Table 1 may indicate that BH and Rlasso-cv methods may yield more false discoveries, while SIM-FDR, Rlasso-threshold and Rlasso-adaptive methods may provide fewer but more precise taxa selection results.

Table 1. The number of selected taxa by all the methods under different nominal FDR levels.

FDR level	0.02	0.04	0.07	0.09	0.11	0.13	0.16	0.18	0.20
BH	371	444	507	558	588	620	648	671	687
MXKF	0	0	0	0	0	0	0	0	0
01 SIM-FDR	4	4	4	4	4	4	5	6	6
Rlasso-cv	68	68	68	68	68	68	68	68	68
Rlasso-threshold	2	2	2	2	2	2	2	2	2
Rlasso-adaptive	2	2	2	2	2	2	2	2	2

Open in a new tab

However, we need to further validate that SIM-FDR should yield more precise taxon selection compared to BH and Rlasso-cv methods. To investigate the feature selection performance of our proposed method, we introduce simulated variables as inactive or non-association taxa. In detail, 500 noise taxa Z₁ are from N(0,1) and another 500 noise taxa Z₂ are from t(3), which are all independently and randomly generated. We denoted this dataset as Tara-Noise, and our main goal is to identify the important taxa from the whole 2015 taxa (1015 real taxa and 1000 noise taxa). The noise taxa Z₁ and Z₂ can be viewed as false taxa which are not associated with the response. Given FDR level 0.1, the results of falsely selecting the noise taxa by all the methods are shown in Table 2. We observed that both BH and Rlasso-cv methods mistakenly selected a number of noise taxa while our method SIM-FDR did not select any noise taxa. This indicates that these two methods indeed produce more false discoveries, while our method achieves higher precision in detecting findings.

Table 2. The number of selected noise taxa in 1000 noise taxa using the Tara-Noise data given the nominal FDR level ( $α = 0.1$ ).

Method	the number of selected noise taxa
BH	30
MXKF	0
SIM-FDR	0
Rlasso-cv	2
Rlasso-threshold	0
Rlasso-adaptive	0

Open in a new tab

Given nominal FDR level 0.20, SIM-FDR method identified six taxa associated with the ocean salinity gradients. From Tables 3 and 4, the identified OTU197, OTU741 and OTU2043 by SIM-FDR come from the class Alphaproteobacteria, OTU1473 comes from the class Deltaproteobacteria, OTU1439 and OTU520 come from the class Gammaproteobacteria. For the Tara data, Bien et al. [47] proposed a tree-aggregated predictive model and also used their method to conduct taxa selection. However, their selection result is very different from that of SIM-FDR. Thus, our results could offer some new perspective of the Tara data.

Table 3. The information of six selected taxa by SIM-FDR at the FDR level 0.20.

Taxa Name	Rank	Kingdom	Phylum	Class	Order	Family
OTU197	Life	Bacteria	Proteobacteria	Alphaproteobacteria	SAR11 clade	Surface
OTU741	Life	Bacteria	Proteobacteria	Alphaproteobacteria	SAR11 clade	Surface
OTU2043	Life	Bacteria	Proteobacteria	Alphaproteobacteria	Rickettsiales	S25-593
OTU1473	Life	Bacteria	Proteobacteria	Deltaproteobacteria	Desulfuromonadales	GR-WP33-58
OTU1439	Life	Bacteria	Proteobacteria	Gammaproteobacteria	KI89A clade	f__134
OTU520	Life	Bacteria	Proteobacteria	Gammaproteobacteria	E01-9C-26 marine group	f__69

Open in a new tab

Table 4. The information of six selected taxa by SIM-FDR at the FDR level 0.20.

Taxa Name	Genus	Species
OTU197	g__118	AY664083.1.1206
OTU741	g__409	EU801445.1.1438
OTU2043	g__681	JN166192.1.1464
OTU1473	g__608	EF574438.1.1503
OTU1439	g__601	FR683972.1.1501
OTU520	g__304	JF747664.1.1516

Open in a new tab

Head and neck squamous cell carcinoma data

Head and neck squamous cell carcinoma (HNSCC) is a prevalent and prognostically challenging cancer globally [48]. Since the release of the TCGA-HNSC dataset in 2015, over 1,000 related articles have been published. The original data includes a total of 18,409 gene expression values. A prescreening process was conducted using marginal Cox models, leading to the selection of the top 2,000 genes with the smallest p-values for downstream analysis. The preprocessed HNSCC dataset, which contains 2,000 gene expression values, the logarithm of survival time, and a censoring indicator, can be downloaded from TCGA Provisional using the R packages cgdsr or GEInter. Additionally, the preprocessed HNSCC dataset can be available in S4 Dataset (Supporting information section).

Here, our objective was to identify potential genes associated with the survival time of head and neck squamous cell carcinoma (HNSCC). The results of this analysis are instrumental in understanding the molecular mechanisms underlying the occurrence and progression of HNSCC, and they hold significant implications for future treatment strategies. In the preprocessed HNSCC dataset, we collected 484 samples, with a censoring ratio of approximately 58. This indicates that the true survival times for 58 of the samples remain unobserved and 42 of the samples are observed. To render this dataset suitable for our method, we utilized the observed survival time samples, resulting in 204 samples ( $n = 484 \times 0.42 = 204$ ) for further analysis. We took 2000 genes as feature variables and employed the original survival time data, prior to logarithmic transformation, as the response variable. We then applied all relevant methods for gene selection on the dataset. Following the analysis of simulated data, we set the same setting $γ = 2 / 3$ for SIM-FDR while let B be 30 to stabilize the analysis results of SIM-FDR. We varied nominal FDR levels from 0 to 0.20 for this real data analysis.

The results are presented in Table 5. It is observed that the number of genes identified by the BH and Rlasso-cv methods exceeds that identified by SIM-FDR, while the MXKF, Rlasso-threshold, and Rlasso-adaptive methods did not identify any genes across all FDR levels. This may align with the simulation results from various scenarios, such as model 1 or model 2 under moderate sample conditions (Figs 1 and 2), which involve a sample size of $(n = 204)$ and a gene count of $(p = 2000)$ . Figs 1 and 2 illustrate that the BH and Rlasso-cv methods exhibit higher actual FDRs and fail to control the FDR at the specified nominal levels. Therefore, the results in Table 5 may indicate that the BH and Rlasso-cv methods yield a greater number of false discoveries, whereas the SIM-FDR method may provide fewer but more accurate gene selection results.

Table 5. The number of selected genes by all the methods under different nominal FDR levels.

FDR level	0.02	0.04	0.07	0.09	0.11	0.13	0.16	0.18	0.20
BH	265	308	360	399	427	463	500	540	592
MXKF	0	0	0	0	0	0	0	0	0
SIM-FDR	3	3	3	3	3	7	7	7	7
Rlasso-cv	51	51	51	51	51	51	51	51	51
Rlasso-threshold	0	0	0	0	0	0	0	0	0
Rlasso-adaptive	0	0	0	0	0	0	0	0	0

Open in a new tab

Next, we introduced some simulated feature variables as non-association genes to further investigate the feature selection performance of all the methods. In details, 500 noise genes Z₁ with 204 samples are from the normal distribution N(–1,2) and another 500 noise genes Z₂ with 204 samples are from the normal distribution N(1,2), which are all independently and randomly generated. The genes Z₁ and Z₂ can be viewed as noise genes which are not associated with the survival time response. We denoted this dataset as HNSCC-Noise, and our main goal is to identify the important genes from the whole 3000 genes. Then we applied all the methods to the HNSCC-Noise dataset. The discovered genes are listed in Table 6. We observed that BH and Rlasso-cv mistakenly selected a number of noise genes while our method SIM-FDR did not select any noise genes. This indicates that these two methods indeed produce more false discoveries, while our method achieves higher precision in detecting findings.

Table 6. The number of selected genes in 1000 noise genes using the HNSCC-Noise data given the nominal FDR level ( $α = 0.1$ ).

Method	the number of selected noise taxa
BH	19
MXKF	0
SIM-FDR	0
Rlasso-cv	2
Rlasso-threshold	0
Rlasso-adaptive	0

Open in a new tab

Finally, at a significance level of $α = 0.2$ , Table 7 presents seven genes—SNX14, BICRAL, KIR3DL1, GRAPL, SEMA4D, IL20RA, and POLDIP2—that were detected using SIM-FDR. Notably, none of these genes have been reported in existing literature. Our findings may provide new insights into the HNSCC data, as these genes could enhance our understanding of the carcinogenic mechanisms associated with the cell cycle in HNSCC and may serve as potential biomarkers for the survival and treatment of HNSCC.

Table 7. The names of selected genes by SIM-FDR at the FDR level 0.2.

Genes’ Names
SNX14, BICRAL, KIR3DL1, GRAPL, SEMA4D, IL20RA, POLDIP2

Open in a new tab

Discussion and conclusion

In this paper, we first employ a more general single index model to fit omics data. This model is robust and can account for nonlinear associations with unknown distributional random errors and features. Next, based on this model, we develop an effective FDR control procedure for feature selection in high-dimensional single index models. We further apply this procedure to fine-map omics features while controlling the false discovery rate of selection. The results from simulated data indicate that when the linear or nonlinear model has heavy-tailed distributional random errors in moderate sample cases, the proposed SIM-FDR method significantly outperforms competing methods in power performance across nearly all scenarios while effectively controlling the FDR. In small sample cases, our method maintains actual FDRs at the nominal FDR levels for all scenarios, whereas nearly all competing methods fail to control the FDR. Compared to the SIM-FDR method, other methods may either be underpowered or yield inappropriate results, resulting in an inflated FDR relative to the nominal FDR threshold. Overall, these findings suggest that SIM-FDR demonstrates robust performance.

However, there are some aspects of our method that need to be discussed as follows.

Our approach primarily focuses on feature selection and does not involve estimating the link function or true coefficients of the original model for predicting the response. Once features are selected using our method, existing single index models for prediction can be utilized to fit the selected features, and the fitted model can subsequently be employed for prediction.
Estimating the unknown link function in high-dimensional single index models presents a significant challenge. Although our approach does not require estimating the link function, it assumes that the unknown link function is monotonic.
In the theoretical section, our method assumes that predictors are absolutely continuous and sub-gaussian. However, Rejchel et al. [32] have demonstrated through simulation studies that the rank-based approach can also be applied in scenarios with discrete-type predictors. This suggests that our method can be employed to analyze data with discrete-type predictors.
Specifically, our method is designed for single-index models and not for double-index or multi-index models. After careful consideration, we believe that combining sufficiency dimension reduction with the rank-based approach may address this issue. We hope to extend our method to accommodate double-index or multi-index scenarios in future work.
Our method is not suitable for handling binary-response and other discrete-type response datasets. We plan to explore ways to extend SIM-FDR for analyzing binary-response and other discrete-type response cases in future research.

Besides the above discussions, there has been considerable research interest in utilizing additional information (e.g., phylogenetic information) from microbiome data to enhance detection power while maintaining FDR control [49, 50]. In the future, it will be of interest to incorporate such information into the SIM-FDR framework to further improve the detection power of controlled feature selection.

Supporting information

S1 File. The proofs of all the Theorems in Theory Properties section can be available.

(PDF)

pone.0300490.s001.pdf^{(174.1KB, pdf)}

S2 Dataset. The collected raw Tara data set.

(ZIP)

pone.0300490.s002.zip^{(5.8MB, zip)}

S3 Dataset. The preprocessed Tara data set.

(ZIP)

pone.0300490.s003.zip^{(607.1KB, zip)}

S4 Dataset. The preprocessed HNSCC dataset, which contains 2,000 gene expression values, the logarithm of survival time, and a censoring indicator, can also be available.

(ZIP)

pone.0300490.s004.zip^{(382B, zip)}

Acknowledgments

We thank all the women and men who agreed to participate in this study.

Data Availability

All relevant data sets are in the Supporting information files of the paper.

Funding Statement

This work was supported by the National Natural Science Foundation of China (NSFC) (no. 11801571, no. 12171483 and no. 61773401).

References

1.Majewski IJ, Bernards R. Taming the dragon: genomic biomarkers to individualize the treatment of cancer. Nat Med. 2011;17(3):304–12. doi: 10.1038/nm.2311 [DOI] [PubMed] [Google Scholar]
2.Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34(3). doi: 10.1214/009053606000000281 [DOI] [Google Scholar]
3.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–41. doi: 10.1093/biostatistics/kxm045 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Li H. Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu Rev Stat Appl. 2015;2(1):73–94. doi: 10.1146/annurev-statistics-010814-020351 [DOI] [Google Scholar]
5.Foster JC, Taylor JMG, Nan B. Variable selection in monotone single-index models via the adaptive LASSO. Stat Med. 2013;32(22):3944–54. doi: 10.1002/sim.5834 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Radchenko P. High dimensional single index models. Journal of Multivariate Analysis. 2015;139:266–82. doi: 10.1016/j.jmva.2015.02.007 [DOI] [Google Scholar]
7.Ganti R, Rao N, Willett RM. Learning single index models in high dimensions. arXiv preprint 2015. https://arxiv.org/abs/1506.08910 [Google Scholar]
8.Luo S, Ghosal S. Forward selection and estimation in high dimensional single index models. Statistical Methodology. 2016;33:172–9. doi: 10.1016/j.stamet.2016.09.002 [DOI] [Google Scholar]
9.Cheng L, Zeng P, Zhu Y. BS-SIM: an effective variable selection method for high-dimensional single index model. Electron J Statist. 2017;11(2). doi: 10.1214/17-ejs1329 [DOI] [Google Scholar]
10.Yang ZR, Balasubramanian K, Liu H. High-dimensional nongaussian single index models via thresholded score function estimation. In: International Conference on Machine Learning. 2017 . p. 3851–60.
11.Dudeja R, Hsu D. Learning single-index models in gaussian space. In: Conference on Learning Theory. 2018. p. 1887–930.
12.Hirshberg DA, Wager S. Debiased inference of average partial effects in single-index models. arXiv preprint 2018. https://arxiv.org/abs/1811.02547 [Google Scholar]
13.Pananjady A, Foster DP. Single-index models in the high signal regime. 2019. https://people.eecs.berkeley.edu/ashwinpm/SIMs.pdf
14.Eftekhari H, Banerjee M, Ritov Y. Inference in high-dimensional single-index models under symmetric designs. Journal of Machine Learning Research. 2021;22(27):1–63. [Google Scholar]
15.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1995;57(1):289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x [DOI] [Google Scholar]
16.Barber RF, Candès EJ. A knockoff filter for high-dimensional selective inference. Ann Statist. 2019;47(5). doi: 10.1214/18-aos1755 [DOI] [Google Scholar]
17.Candès E, Fan Y, Janson L, Lv J. Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2018;80(3):551–77. doi: 10.1111/rssb.12265 [DOI] [Google Scholar]
18.Bates S, Candès E, Janson L, Wang W. Metropolized knockoff sampling. Journal of the American Statistical Association. 2020;116(535):1413–27. doi: 10.1080/01621459.2020.1729163 [DOI] [Google Scholar]
19.Romano Y, Sesia M, Candès E. Deep knockoffs. Journal of the American Statistical Association. 2019;115(532):1861–72. doi: 10.1080/01621459.2019.1660174 [DOI] [Google Scholar]
20.Du L, Guo X, Sun W, Zou C. False discovery rate control under general dependence by symmetrized data aggregation. Journal of the American Statistical Association. 2021;118(541):607–21. doi: 10.1080/01621459.2021.1945459 [DOI] [Google Scholar]
21.Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2010;72(4):417–73. doi: 10.1111/j.1467-9868.2010.00740.x [DOI] [Google Scholar]
22.Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, et al. Wisdom of crowds for robust gene network inference. Nat Methods. 2012;9(8):796–804. doi: 10.1038/nmeth.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Haury A-C, Mordelet F, Vera-Licona P, Vert J-P. TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC Syst Biol. 2012;6:145. doi: 10.1186/1752-0509-6-145 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Hu X, Hu Y, Wu F, Leung RWT, Qin J. Integration of single-cell multi-omics for gene regulatory network inference. Comput Struct Biotechnol J. 2020;18:1925–38. doi: 10.1016/j.csbj.2020.06.033 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.de Groot P, Nikolic T, Pellegrini S, Sordi V, Imangaliyev S, Rampanelli E, et al. Faecal microbiota transplantation halts progression of human new-onset type 1 diabetes in a randomised controlled trial. Gut. 2021;70(1):92–105. doi: 10.1136/gutjnl-2020-322630 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Aitchison J. The statistical analysis of compositional data. Caldwell, New Jersey: Blackburn Press; 2003. [Google Scholar]
27.Shi P, Zhang A, Li H. Regression analysis for microbiome compositional data. Ann Appl Stat. 2016;10(2) :1019–40. doi: 10.1214/16-aoas928 [DOI] [Google Scholar]
28.Cao Y, Lin W, Li H. Two-sample tests of high-dimensional means for compositional data. Biometrika. 2017;105(1):115–32. doi: 10.1093/biomet/asx060 [DOI] [Google Scholar]
29.Sohn MB, Li H. Compositional mediation analysis for microbiome studies. Ann Appl Stat. 2019;13(1). doi: 10.1214/18-aoas1210 [DOI] [Google Scholar]
30.Lu J, Shi P, Li H. Generalized linear models with linear constraints for microbiome compositional data. Biometrics. 2019;75(1):235–44. doi: 10.1111/biom.12956 [DOI] [PubMed] [Google Scholar]
31.Zhang H, Chen J, Li Z. Testing for mediation effect with application to human microbiome data. Statistics in Biosciences. 2019;:1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Rejchel W, Bogdan ML. Rank-based lasso efficient methods for high-dimensional robust model selection. Journal of Machine Learning Research. 2020;21:1–47.34305477 [Google Scholar]
33.Alexander DH, Lange K. Stability selection for genome-wide association. Genet Epidemiol. 2011;35(7):722–8. doi: 10.1002/gepi.20623 [DOI] [PubMed] [Google Scholar]
34.Li S, Hsu L, Peng J, Wang P. Bootstrap inference for network construction with an application to a breast cancer microarray study. Ann Appl Stat. 2013;7(1):391–417. doi: 10.1214/12-AOAS589 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Hofner B, Boccuto L, Göker M. Controlling false discoveries in high-dimensional situations: boosting with stability selection. BMC Bioinformatics. 2015;16:144. doi: 10.1186/s12859-015-0575-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Wang F, Mukherjee S, Richardson S, Hill SM. High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking. Stat Comput. 2020;30(3):697–719. doi: 10.1007/s11222-019-09914-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Haury A-C, Mordelet F, Vera-Licona P, Vert J-P. TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC Syst Biol. 2012;6:145. doi: 10.1186/1752-0509-6-145 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Werner T. Loss-guided stability selection. Advances in Data Analysis and Classification. 2023;:1–26. [Google Scholar]
39.Zhou J, Sun J, Liu Y, Hu J, Ye J. Patient Risk Prediction Model via Top-k Stability Selection. In: Proceedings of the 2013 SIAM International Conference on Data Mining. 2013. p. 55–63. 10.1137/1.9781611972832.7 [DOI]
40.Efron B. Bootstrap methods: another look at the jackknife. Ann Statist. 1979;7(1). doi: 10.1214/aos/1176344552 [DOI] [Google Scholar]
41.Kushary D. Bootstrap methods and their application. Technometrics. 2000;42(2):216–7. doi: 10.1080/00401706.2000.10486018 [DOI] [Google Scholar]
42.Johnson RW. An introduction to the bootstrap. Teaching Statistics. 2001;23(2):49–54. doi: 10.1111/1467-9639.00050 [DOI] [Google Scholar]
43.Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. doi: 10.1186/s12864-019-6413-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Sunagawa S, Acinas SG, Bork P, Bowler C, et al. Tara oceans: towards global ocean ecosystems biology. Nat Rev Microbiol. 2020;18(8):428–45. doi: 10.1038/s41579-020-0364-5 [DOI] [PubMed] [Google Scholar]
45.Logares R, Sunagawa S, Salazar G, Cornejo-Castillo FM, Ferrera I, Sarmento H, et al. Metagenomic 16S rDNA Illumina tags are a powerful alternative to amplicon sequencing to explore diversity and structure of microbial communities. Environ Microbiol. 2014;16(9):2659–71. doi: 10.1111/1462-2920.12250 [DOI] [PubMed] [Google Scholar]
46.Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G, et al. Ocean plankton. Structure and function of the global ocean microbiome. Science. 2015;348(6237):1261359. doi: 10.1126/science.1261359 [DOI] [PubMed] [Google Scholar]
47.Bien J, Yan X, Simpson L, Müller CL. Tree-aggregated predictive modeling of microbiome data. Sci Rep. 2021;11(1):14505. doi: 10.1038/s41598-021-93645-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Cancer Genome Atlas Network. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. 2015;517(7536):576–82. doi: 10.1038/nature14129 [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Xiao J, Cao H, Chen J. False discovery rate control incorporating phylogenetic tree increases detection power in microbiome-wide multiple testing. Bioinformatics. 2017;33(18):2873–81. doi: 10.1093/bioinformatics/btx311 [DOI] [PubMed] [Google Scholar]
50.Hu J, Koh H, He L, Liu M, Blaser MJ, Li H. A two-stage microbial association mapping framework with advanced FDR control. Microbiome. 2018;6(1):131. doi: 10.1186/s40168-018-0517-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0300490.r001

Decision Letter 0

Mihye Ahn

14 May 2024

PONE-D-24-08088A novel model-free feature selection method with FDR control for omics-wide association analysisPLOS ONE

Dear Dr. Xiao,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jun 28 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Mihye Ahn, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

4. Thank you for stating the following financial disclosure:

"This work was supported by the National Natural Science Foundation of China (NSFC) (no.11801571, no.12171483 and no.61773401)."

Please state what role the funders took in the study. If the funders had no role, please state: ""The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.""

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

5. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have done a concrete representation of their work with clear explanation on their method with simulation results. The authors may consider comparing their method to more comparators, but the display they have looks OK. The English in the manuscript is mostly straightforward with some minor revisions needed in some sentences. The detailed comments are provided in the attached pdf file.

Reviewer #2: Here are some suggestions I’d like to provide on your revisions:

1. Please elaborate on the literature review in the introduction part. It’s crucial to see what previous research works have been done in this area with their cons and pros. It is also important to highlight the research gap your research addresses and to clarify the motivation behind your research. This will provide readers with a more solid understanding of novelty and importance of your research.

2. In the section discussing simulation study, I saw you compared with two other existing methods. It might be helpful to provide a more detailed description on them, such as the one developed by Benjamini and Hochberg. Explain why you want to perform comparisons with those two methods, how your chosen methods differ and what advantages/disadvantages they might have.

3. On Line 111, you reference an 'eta' in equation (2), but it does not appear to be present. Please either introduce 'eta' in the equation or revise the text to reflect the equation correctly.

4. Make sure to be consistent in notations throughout the manuscript. Specifically, the use of subscripts such as 'i' and 'j' in your method.

5. On Line 267, the 'Z' is mentioned. Please provide a clear explanation of 'Z' so that readers can understand its meaning.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Sungtaek Son

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Reviewer comment.pdf

pone.0300490.s005.pdf^{(244.5KB, pdf)}

PLoS One. 2025 Aug 22;20(8):e0300490. doi: 10.1371/journal.pone.0300490.r002

Author response to Decision Letter 1

24 Jul 2024

I respect the decision of the reviewers! Thanks!

PLoS One. doi: 10.1371/journal.pone.0300490.r003

Decision Letter 1

Mihye Ahn

20 Oct 2024

PONE-D-24-08088R1A novel robust feature selection method with FDR control for omics-wide association analysisPLOS ONE

Dear Dr. Xiao,

Please submit your revised manuscript by Dec 04 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Mihye Ahn, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: The revised version of the paper looks much better than the initial version. I appreciate that the authors have accepted and reflected the recommendations. However, some more revision is needed due to some inconsistency and confusion in the use of the notations and expressions. Please see the comments in the attached file.

Reviewer #2: After the first revision, the previous comments have been partially addressed, and the content has become more comprehensive . However, there are still minor grammatical issues, and the use of professional language can be further improved. I recommend thoroughly proofreading the paper and revising it to make sure the use of precise academic language. Additionally, please double check all mathematical notations for correct usage, and remove any interpretation of notations that have not appeared in the context.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Sungtaek Son

Reviewer #2: No

**********

Attachment

Submitted filename: Reviewer comment_R1.pdf

pone.0300490.s006.pdf^{(155.8KB, pdf)}

PLoS One. 2025 Aug 22;20(8):e0300490. doi: 10.1371/journal.pone.0300490.r004

Author response to Decision Letter 2

3 Dec 2024

Response to the comments of reviewer

Thank you very much for your helpful and constructive comments on our manuscript. We have amended our manuscript following your valuable suggestions. Specific replies to your comments are given below:

Line 146: "distributional" "distribution of".

Line 179: "I" "we".

Line 185: "gotten" "obtained".

Line 191: "usually do not satisfy asymptotically symmetric around zero property" "are usually not asymptotically symmetric around zero".

Line 224 and 228: I would suggest the authors distinguish the notations for the candidate features for samples and . For example, instead of using �the authors can consider and .

Response: Thanks very much! We have revised all the above problems in the new manuscript in red.

Line 237 - 239 "It is well known -- complex distribution variable": I would suggest adding references that discuss that bootstrap re-sampling can be used to obtain consistent estimate of the variance.

Response: Thanks very much! We have added the following previous references in the new manuscript in red.

1、DebashisKushary. Bootstrap Methods and Their Application. Technometrics, 2000, 42(2):216-217.

2、Efron B. Bootstrap Methods: Another Look at the Jackknife, Annals of Statistics, 1979, 7(1):1-26.

3、Johnson R W. An Introduction to the Bootstrap. Teaching Statistics, 2010, 23(2):49-54.

Line 299: "Theory properties" "Theoretical properties".

Line 302: Is be a vector in . Please specify. If so, the authors would need to use the transpose of , i.e., instead of .

Response: Thanks very much! We have revised all the above problems in the new manuscript in red.

Line 312-313: and would merely be observations, not "sets". Please use an alternative notation to denote sets.

Response: Thanks very much! We have revised all the above problems in the new manuscript in red.

Line 305 - 306: For real numbers, I suggest using . Also, please make the notations consistent. On line 306, there are two expressions for expectation: E and E . Also, is a real number or a function whose range is in �

Line 356: "non matter" "no matter".

Response: Thanks very much! We have revised all the above problems in the new manuscript in red.

Attachment

Submitted filename: Response to Reviewers.pdf

pone.0300490.s007.pdf^{(136.6KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0300490.r005

Decision Letter 2

Mihye Ahn

20 Feb 2025

PONE-D-24-08088R2A novel robust feature selection method with FDR control for omics-wide association analysisPLOS ONE

Dear Dr. Xiao,

Reviewer 1 has suggested a few minor revisions that need to be addressed. I believe that once these adjustments are made, your manuscript will meet the necessary requirements for approval.

Please submit your revised manuscript by Apr 06 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Mihye Ahn, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: Thank you for the opportunity to review the revised manuscript on the robust feature

selection method with FDR control for omics-wide association analysis.

I appreciate the efforts put in by the authors for revising the manuscript based on the

reviewer’s comments. The contents are much clearer, and the mathematical expressions

are easier to follow.

Upon reading the paper line by line, I have only had some minor comments. These are not

necessarily crucial in following the writing, but still stand out and could be disturbing for

future readers to some extent. Please find the line-by-line comments in the attached pdf file. In particular,

the comment regarding Line 133 should be addressed.

Overall, the manuscript now has enriched content and appears to suffice the standards

required for publication in PLOS One. I recommend that the paper be accepted for

publication.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Sungtaek Son

Reviewer #2: No

**********

Attachment

Submitted filename: Review_Comment2.pdf

pone.0300490.s008.pdf^{(163.3KB, pdf)}

PLoS One. 2025 Aug 22;20(8):e0300490. doi: 10.1371/journal.pone.0300490.r006

Author response to Decision Letter 3

21 Feb 2025

Overall, the manuscript now has enriched content and appears to suffice the standards required for publication in PLOS One. I recommend that the paper be accepted for publication.

Abstract :

Line 4: "associated relat ionships" → "association" Line 10: "can be connected" → "to be connected"

Response: Thanks very much! We have revised all the above problems in the new manuscript in red.

The main body of the manuscript:

Line 11: (Erase "is") omics features is much larger → omics features much larger

Line 11: (Add "which") also poses new → which also poses new

Line 17: (Erase "done by") can be done by based on → can be based on

Response: Thanks very much! We have revised all the above problems in the new manuscript in red.

Line 95: (Recommend using an alternative term for "notable", e.g., useful) hence it enjoys many notable theoretical properties. Thence it enjoys many useful theoretical properties

Response: Thanks very much! We have revised this problem in the new manuscript in red.

Line 101: to valid → to validate

Line 131: Add citation number for "Rejchel et al."

Response: Thanks very much! Sorry for my careless! We have revised this problem in the new manuscript in red.

Line 133: is not used in any expression above.

Response: Thanks very much! Sorry for my careless! We have deleted it.

Line 137: The subscript in the definition for should be instead of RP.

Line 304: For consistency, replace E() by E().

Response: Thanks very much! We have revised this problem in the new manuscript in red.

Line 342: asymptotically symmetry → asymptotically symmetric

Line 344: In Theorem 1, considering that is a vector,

should be revised. For example, it could be .

Response: Thanks very much! We have revised this problem in the new manuscript in red.

Line 395, 401, and 559: "following the work[]" can be elaborated with author names, for example, "following the work of Rejchel et al. [32]".

Line 556: " then" → "thus"?

Response: Thanks very much! We have revised this problem in the new manuscript in red

Attachment

Submitted filename: Response to Reviewers.pdf

pone.0300490.s009.pdf^{(101.3KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0300490.r007

Decision Letter 3

Wan-Tien Chiang

4 May 2025

PONE-D-24-08088R3A novel robust feature selection method with FDR control for omics-wide association analysisPLOS ONE

Dear Dr. Xiao,

Please submit your revised manuscript by Jun 18 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Wan-Tien Chiang

Academic Editor

PLOS ONE

Journal Requirements:

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Reviewer #1: Thank you for the opportunity to read and review your work. I carefully read the revised manuscript and I do not have any further comments.

Reviewer #3:

1. Introduction part: the authors utilize rank-based approach and SDA approach to develop a robust feature selection method for a single index model (SIM). Meanwhile, the authors introduce several methods of FDR controlled feature selection: Knockoff filter-based approach, Symmetrized data aggregation (SDA) approach, Stability selection approach. Based on these conceptions, can authors provide rationelle for integrate rank/SDA methods into SIM in the introduction part?

2. Results part: as observed from results, SIM-FDR is not always ranked to be highest in both aspects of FDR as well as Power. Can authors use a kind of overall/balanced evaluation method to judge that SIM-FDR is best when compared to other methods?

3. Results part: since massive human genomic data sets makes the dimensionality of omics features much larger than its sample size, can authors

also use real human genomic data sets to test performance of SIM-FDR?

4. Discussion: list ponits as above

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Sungtaek Son

Reviewer #3: No

**********

PLoS One. 2025 Aug 22;20(8):e0300490. doi: 10.1371/journal.pone.0300490.r008

Author response to Decision Letter 4

19 Jun 2025

Response to the comments of reviewer

Reviewer #1: Thank you for the opportunity to read and review your work. I carefully read the revised manuscript and I do not have any further comments.

Reviewer #3:

Response: Thanks very much! We have provided some discussions in the revised manuscript in red between the lines 94 and 119.

Response: Thanks very much!

1. We first introduced the new measurement: Matthews correlation coefficient (MCC) to evaluate the performance of feature selection. MCC measures the overall accuracy of selection for true positives (TP), false negatives (FN), true negatives (TN), and false positives (FP), with a larger value indicating overall better selection (Baldi et al. 2000; Chicco and Jurman 2020). We have added some descriptions about MCC between the lines 476 and 479 in the revised manuscript in red.

2. The simulation results evaluated via the Matthews correlation coefficient (MCC) are shown in the revised manuscript in red between the lines 560 and 576.

3. Results part: since massive human genomic data sets makes the dimensionality of omics features much larger than its sample size, can authors also use real human genomic data sets to test performance of SIM-FDR?

Response: Thanks very much! We utilized the real Head and neck squamous cell carcinoma (HNSCC) data to further evaluate the performance of the proposed method. The Cancer Genome Atlas profiled 279 head and neck squamous cell carcinomas (HNSCC) to provide a comprehensive landscape of somatic genomic alterations. We have added the analysis of HNSCC data into the revised manuscript in red.between the lines 641 and 693.

4. Discussion: list ponits as above

Response: Thanks very much! We have reconstructed the “Discussion” section in the revised manuscript in red between the lines 694 and 737.

Attachment

Submitted filename: Response_to__reviewers_auresp_4.pdf

pone.0300490.s010.pdf^{(291.2KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0300490.r009

Decision Letter 4

Wan-Tien Chiang

29 Jul 2025

A novel and robust feature selection method with FDR control for omics-wide association analysis

PONE-D-24-08088R4

Dear Dr. Xiao,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Austin W.T. Chiang

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #3: Yes

**********

6. Review Comments to the Author

Reviewer #3: All questions have been well answered.

1. Rationelle for SIM-FDR have been discussed;

2. MCC for SIM-FDR evaluation is suitable;

3. HNSCC data usage is suitable;

4. All points were listed.

We have learnt lots! Thank you!

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: No

**********

PLoS One. doi: 10.1371/journal.pone.0300490.r010

Acceptance letter

Wan-Tien Chiang

PONE-D-24-08088R4

PLOS ONE

Dear Dr. Xiao,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Wan-Tien Chiang

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File. The proofs of all the Theorems in Theory Properties section can be available.

(PDF)

pone.0300490.s001.pdf^{(174.1KB, pdf)}

S2 Dataset. The collected raw Tara data set.

(ZIP)

pone.0300490.s002.zip^{(5.8MB, zip)}

S3 Dataset. The preprocessed Tara data set.

(ZIP)

pone.0300490.s003.zip^{(607.1KB, zip)}

S4 Dataset. The preprocessed HNSCC dataset, which contains 2,000 gene expression values, the logarithm of survival time, and a censoring indicator, can also be available.

(ZIP)

pone.0300490.s004.zip^{(382B, zip)}

Attachment

Submitted filename: Reviewer comment.pdf

pone.0300490.s005.pdf^{(244.5KB, pdf)}

Attachment

Submitted filename: Reviewer comment_R1.pdf

pone.0300490.s006.pdf^{(155.8KB, pdf)}

Attachment

Submitted filename: Response to Reviewers.pdf

pone.0300490.s007.pdf^{(136.6KB, pdf)}

Attachment

Submitted filename: Review_Comment2.pdf

pone.0300490.s008.pdf^{(163.3KB, pdf)}

Attachment

Submitted filename: Response to Reviewers.pdf

pone.0300490.s009.pdf^{(101.3KB, pdf)}

Attachment

Submitted filename: Response_to__reviewers_auresp_4.pdf

pone.0300490.s010.pdf^{(291.2KB, pdf)}

Data Availability Statement

All relevant data sets are in the Supporting information files of the paper.

[pone.0300490.ref001] 1.Majewski IJ, Bernards R. Taming the dragon: genomic biomarkers to individualize the treatment of cancer. Nat Med. 2011;17(3):304–12. doi: 10.1038/nm.2311 [DOI] [PubMed] [Google Scholar]

[pone.0300490.ref002] 2.Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34(3). doi: 10.1214/009053606000000281 [DOI] [Google Scholar]

[pone.0300490.ref003] 3.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–41. doi: 10.1093/biostatistics/kxm045 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref004] 4.Li H. Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu Rev Stat Appl. 2015;2(1):73–94. doi: 10.1146/annurev-statistics-010814-020351 [DOI] [Google Scholar]

[pone.0300490.ref005] 5.Foster JC, Taylor JMG, Nan B. Variable selection in monotone single-index models via the adaptive LASSO. Stat Med. 2013;32(22):3944–54. doi: 10.1002/sim.5834 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref006] 6.Radchenko P. High dimensional single index models. Journal of Multivariate Analysis. 2015;139:266–82. doi: 10.1016/j.jmva.2015.02.007 [DOI] [Google Scholar]

[pone.0300490.ref007] 7.Ganti R, Rao N, Willett RM. Learning single index models in high dimensions. arXiv preprint 2015. https://arxiv.org/abs/1506.08910 [Google Scholar]

[pone.0300490.ref008] 8.Luo S, Ghosal S. Forward selection and estimation in high dimensional single index models. Statistical Methodology. 2016;33:172–9. doi: 10.1016/j.stamet.2016.09.002 [DOI] [Google Scholar]

[pone.0300490.ref009] 9.Cheng L, Zeng P, Zhu Y. BS-SIM: an effective variable selection method for high-dimensional single index model. Electron J Statist. 2017;11(2). doi: 10.1214/17-ejs1329 [DOI] [Google Scholar]

[pone.0300490.ref010] 10.Yang ZR, Balasubramanian K, Liu H. High-dimensional nongaussian single index models via thresholded score function estimation. In: International Conference on Machine Learning. 2017 . p. 3851–60.

[pone.0300490.ref011] 11.Dudeja R, Hsu D. Learning single-index models in gaussian space. In: Conference on Learning Theory. 2018. p. 1887–930.

[pone.0300490.ref012] 12.Hirshberg DA, Wager S. Debiased inference of average partial effects in single-index models. arXiv preprint 2018. https://arxiv.org/abs/1811.02547 [Google Scholar]

[pone.0300490.ref013] 13.Pananjady A, Foster DP. Single-index models in the high signal regime. 2019. https://people.eecs.berkeley.edu/ashwinpm/SIMs.pdf

[pone.0300490.ref014] 14.Eftekhari H, Banerjee M, Ritov Y. Inference in high-dimensional single-index models under symmetric designs. Journal of Machine Learning Research. 2021;22(27):1–63. [Google Scholar]

[pone.0300490.ref015] 15.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1995;57(1):289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x [DOI] [Google Scholar]

[pone.0300490.ref016] 16.Barber RF, Candès EJ. A knockoff filter for high-dimensional selective inference. Ann Statist. 2019;47(5). doi: 10.1214/18-aos1755 [DOI] [Google Scholar]

[pone.0300490.ref017] 17.Candès E, Fan Y, Janson L, Lv J. Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2018;80(3):551–77. doi: 10.1111/rssb.12265 [DOI] [Google Scholar]

[pone.0300490.ref018] 18.Bates S, Candès E, Janson L, Wang W. Metropolized knockoff sampling. Journal of the American Statistical Association. 2020;116(535):1413–27. doi: 10.1080/01621459.2020.1729163 [DOI] [Google Scholar]

[pone.0300490.ref019] 19.Romano Y, Sesia M, Candès E. Deep knockoffs. Journal of the American Statistical Association. 2019;115(532):1861–72. doi: 10.1080/01621459.2019.1660174 [DOI] [Google Scholar]

[pone.0300490.ref020] 20.Du L, Guo X, Sun W, Zou C. False discovery rate control under general dependence by symmetrized data aggregation. Journal of the American Statistical Association. 2021;118(541):607–21. doi: 10.1080/01621459.2021.1945459 [DOI] [Google Scholar]

[pone.0300490.ref021] 21.Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2010;72(4):417–73. doi: 10.1111/j.1467-9868.2010.00740.x [DOI] [Google Scholar]

[pone.0300490.ref022] 22.Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, et al. Wisdom of crowds for robust gene network inference. Nat Methods. 2012;9(8):796–804. doi: 10.1038/nmeth.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref023] 23.Haury A-C, Mordelet F, Vera-Licona P, Vert J-P. TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC Syst Biol. 2012;6:145. doi: 10.1186/1752-0509-6-145 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref024] 24.Hu X, Hu Y, Wu F, Leung RWT, Qin J. Integration of single-cell multi-omics for gene regulatory network inference. Comput Struct Biotechnol J. 2020;18:1925–38. doi: 10.1016/j.csbj.2020.06.033 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref025] 25.de Groot P, Nikolic T, Pellegrini S, Sordi V, Imangaliyev S, Rampanelli E, et al. Faecal microbiota transplantation halts progression of human new-onset type 1 diabetes in a randomised controlled trial. Gut. 2021;70(1):92–105. doi: 10.1136/gutjnl-2020-322630 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref026] 26.Aitchison J. The statistical analysis of compositional data. Caldwell, New Jersey: Blackburn Press; 2003. [Google Scholar]

[pone.0300490.ref027] 27.Shi P, Zhang A, Li H. Regression analysis for microbiome compositional data. Ann Appl Stat. 2016;10(2) :1019–40. doi: 10.1214/16-aoas928 [DOI] [Google Scholar]

[pone.0300490.ref028] 28.Cao Y, Lin W, Li H. Two-sample tests of high-dimensional means for compositional data. Biometrika. 2017;105(1):115–32. doi: 10.1093/biomet/asx060 [DOI] [Google Scholar]

[pone.0300490.ref029] 29.Sohn MB, Li H. Compositional mediation analysis for microbiome studies. Ann Appl Stat. 2019;13(1). doi: 10.1214/18-aoas1210 [DOI] [Google Scholar]

[pone.0300490.ref030] 30.Lu J, Shi P, Li H. Generalized linear models with linear constraints for microbiome compositional data. Biometrics. 2019;75(1):235–44. doi: 10.1111/biom.12956 [DOI] [PubMed] [Google Scholar]

[pone.0300490.ref031] 31.Zhang H, Chen J, Li Z. Testing for mediation effect with application to human microbiome data. Statistics in Biosciences. 2019;:1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref032] 32.Rejchel W, Bogdan ML. Rank-based lasso efficient methods for high-dimensional robust model selection. Journal of Machine Learning Research. 2020;21:1–47.34305477 [Google Scholar]

[pone.0300490.ref033] 33.Alexander DH, Lange K. Stability selection for genome-wide association. Genet Epidemiol. 2011;35(7):722–8. doi: 10.1002/gepi.20623 [DOI] [PubMed] [Google Scholar]

[pone.0300490.ref034] 34.Li S, Hsu L, Peng J, Wang P. Bootstrap inference for network construction with an application to a breast cancer microarray study. Ann Appl Stat. 2013;7(1):391–417. doi: 10.1214/12-AOAS589 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref035] 35.Hofner B, Boccuto L, Göker M. Controlling false discoveries in high-dimensional situations: boosting with stability selection. BMC Bioinformatics. 2015;16:144. doi: 10.1186/s12859-015-0575-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref036] 36.Wang F, Mukherjee S, Richardson S, Hill SM. High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking. Stat Comput. 2020;30(3):697–719. doi: 10.1007/s11222-019-09914-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref037] 37.Haury A-C, Mordelet F, Vera-Licona P, Vert J-P. TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC Syst Biol. 2012;6:145. doi: 10.1186/1752-0509-6-145 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref038] 38.Werner T. Loss-guided stability selection. Advances in Data Analysis and Classification. 2023;:1–26. [Google Scholar]

[pone.0300490.ref039] 39.Zhou J, Sun J, Liu Y, Hu J, Ye J. Patient Risk Prediction Model via Top-k Stability Selection. In: Proceedings of the 2013 SIAM International Conference on Data Mining. 2013. p. 55–63. 10.1137/1.9781611972832.7 [DOI]

[pone.0300490.ref040] 40.Efron B. Bootstrap methods: another look at the jackknife. Ann Statist. 1979;7(1). doi: 10.1214/aos/1176344552 [DOI] [Google Scholar]

[pone.0300490.ref041] 41.Kushary D. Bootstrap methods and their application. Technometrics. 2000;42(2):216–7. doi: 10.1080/00401706.2000.10486018 [DOI] [Google Scholar]

[pone.0300490.ref042] 42.Johnson RW. An introduction to the bootstrap. Teaching Statistics. 2001;23(2):49–54. doi: 10.1111/1467-9639.00050 [DOI] [Google Scholar]

[pone.0300490.ref043] 43.Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. doi: 10.1186/s12864-019-6413-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref044] 44.Sunagawa S, Acinas SG, Bork P, Bowler C, et al. Tara oceans: towards global ocean ecosystems biology. Nat Rev Microbiol. 2020;18(8):428–45. doi: 10.1038/s41579-020-0364-5 [DOI] [PubMed] [Google Scholar]

[pone.0300490.ref045] 45.Logares R, Sunagawa S, Salazar G, Cornejo-Castillo FM, Ferrera I, Sarmento H, et al. Metagenomic 16S rDNA Illumina tags are a powerful alternative to amplicon sequencing to explore diversity and structure of microbial communities. Environ Microbiol. 2014;16(9):2659–71. doi: 10.1111/1462-2920.12250 [DOI] [PubMed] [Google Scholar]

[pone.0300490.ref046] 46.Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G, et al. Ocean plankton. Structure and function of the global ocean microbiome. Science. 2015;348(6237):1261359. doi: 10.1126/science.1261359 [DOI] [PubMed] [Google Scholar]

[pone.0300490.ref047] 47.Bien J, Yan X, Simpson L, Müller CL. Tree-aggregated predictive modeling of microbiome data. Sci Rep. 2021;11(1):14505. doi: 10.1038/s41598-021-93645-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref048] 48.Cancer Genome Atlas Network. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. 2015;517(7536):576–82. doi: 10.1038/nature14129 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0300490.ref049] 49.Xiao J, Cao H, Chen J. False discovery rate control incorporating phylogenetic tree increases detection power in microbiome-wide multiple testing. Bioinformatics. 2017;33(18):2873–81. doi: 10.1093/bioinformatics/btx311 [DOI] [PubMed] [Google Scholar]

[pone.0300490.ref050] 50.Hu J, Koh H, He L, Liu M, Blaser MJ, Li H. A two-stage microbial association mapping framework with advanced FDR control. Microbiome. 2018;6(1):131. doi: 10.1186/s40168-018-0517-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A novel and robust feature selection method with FDR control for omics-wide association analysis

Zhibo Chen

Zi-Tong Lu

Xue-Ting Song

Yu-Fan Gao

Jian Xiao

Roles

Abstract

Introduction

Materials and methods

Review of based on rank approach single index model

Omics-wide association analysis based on robust single index model

Estimation methods for the parameter θ0

FDR controlled robust feature selection procedure

Theoretical properties

Required basic assumptions

Definitions of the cone invertibility factor (CIF)

FDR control

Asymptotically symmetry around zero property of the statistics W.

Sure screening property for the candidate feature selection result.

Finite-sample FDR control.

Asymptotic FDR control.

Simulation analysis results

Competing methods

Generating omics features

Simulating the response

Methods settings and comparison measurements

Results for moderate sample size scenario (n = 250)

Results for models 1 and 2.

Fig 1. Results for model 1 at moderate sample size case (n = 250).

Fig 2. Results for model 2 at moderate sample size case (n = 250).

Results for models 3 and 4.

Fig 3. Results for model 3 at moderate sample size case (n = 250).

Fig 4. Results for model 4 at moderate sample size case (n = 250).

Results for models 5 and 6.

Fig 5. Results for model 5 at moderate sample size case (n = 250).

Fig 6. Results for model 6 at moderate sample size case (n = 250).

Results for small sample size scenario (n = 100)

Results for models 1 and 2.

Fig 7. Results for model 1 at small sample size case (n = 100).

Fig 8. Results for model 2 at small sample size case (n = 100).

Results for models 3 and 4.

Fig 9. Results for model 3 at small sample size case (n = 100).

Fig 10. Results for model 4 at small sample size case (n = 100).

Results for model 5.

Fig 11. Results for model 5 at small sample size case (n = 100).

The simulation results evaluated via the Matthews correlation coefficient (MCC).

Fig 12. Results for model 1 at moderate sample size case (n = 250).

Fig 13. Results for model 2 at moderate sample size case (n = 250).

Fig 14. Results for model 3 at moderate sample size case (n = 250).

Fig 15. Results for model 4 at moderate sample size case (n = 250).

Fig 16. Results for model 1 at small sample size case (n = 100).

Fig 17. Results for model 2 at small sample size case (n = 100).

Fig 18. Results for model 3 at small sample size case (n = 100).

Fig 19. Results for model 4 at small sample size case (n = 100).

Fig 20. Results for model 5 at small sample size case (n = 100).

Fig 21. Results for model 5 at moderate sample size case (n = 250).

Fig 22. Results for model 6 at moderate sample size case (n = 250).

Conclusions of simulating analysis results.

Real data analysis results

Ocean microbiome data

Table 1. The number of selected taxa by all the methods under different nominal FDR levels.

Table 2. The number of selected noise taxa in 1000 noise taxa using the Tara-Noise data given the nominal FDR level (α=0.1).

Table 3. The information of six selected taxa by SIM-FDR at the FDR level 0.20.

Table 4. The information of six selected taxa by SIM-FDR at the FDR level 0.20.

Head and neck squamous cell carcinoma data

Table 5. The number of selected genes by all the methods under different nominal FDR levels.

Table 6. The number of selected genes in 1000 noise genes using the HNSCC-Noise data given the nominal FDR level (α=0.1).

Table 7. The names of selected genes by SIM-FDR at the FDR level 0.2.

Discussion and conclusion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Mihye Ahn

Roles

Author response to Decision Letter 1

Estimation methods for the parameter $θ^{0}$

Table 2. The number of selected noise taxa in 1000 noise taxa using the Tara-Noise data given the nominal FDR level ( $α = 0.1$ ).

Table 6. The number of selected genes in 1000 noise genes using the HNSCC-Noise data given the nominal FDR level ( $α = 0.1$ ).