Abstract
Omics-wide association analysis is a very important tool for medicine and human health study. However, the modern omics data sets collected often exhibit the high-dimensionality, unknown distribution response, unknown distribution features and unknown complex association relationships between the response and its explanatory features. Reliable association analysis results depend on an accurate modeling for such data sets. Most of the existing association analysis methods rely on the specific model assumptions and lack effective false discovery rate (FDR) control. To address these limitations, the paper firstly applies a single index model for omics data. The model shows robust performance in allowing the relationships between the response variable and linear combination of covariates to be connected by any unknown monotonic link function, and both the random error and the covariates can follow any unknown distribution. Then based on this model, the paper combines rank-based approach and symmetrized data aggregation approach to develop a novel and robust feature selection method for achieving fine-mapping of risk features while controlling the false positive rate of selection. The theoretical results support the proposed method and the analysis results of simulated data show the new method possesses effective and robust performance for all the scenarios. The new method is also used to analyze the two real datasets and identifies some risk features unreported by the existing finds.
Introduction
Advances in high-throughput omics technologies, such as metagenomics sequencing and DNA methylation or protein microarrays, have revolutionized research in medicine. One major use of such technologies is conducting omics-wide association analysis to identify relevant omics features, such as microbial genes and DNA methylation variants, from a large pool of candidate features by analyzing their association with a phenotype of interest, e.g., patient prognosis or response to medical treatment. The identified features can help understand the disease mechanisms, subject to further validation studies, or be used to form more accurate prediction models for personalized medicine [1]. The increasing availability of massive human genomic data sets makes the dimensionality of omics features much larger than its sample size, which also poses new challenges to statistical analysis [2–4]. However, the modern omics data sets collected often not only own the high-dimensionality property but also exhibit unknown distribution response, unknown distribution features and unknown complex associated relationships between the response and its explanatory features. Reliable association analysis results rely on an accurate high-dimensional modeling for such data sets. Such an involved modelling task can be based on high-dimensional single index model (SIM) method. The single index model has been the subject of extensive investigation in both the statistics and biology literatures over the last few decades. It generalizes the linear model to scenarios where the regression function can be any monotonic link function including the nonlinear functions. High-dimensional single index models have also attracted interest with various authors studying variable selection, estimation and inference using penalization schemes [5–14]. However, almost of all these methods do not provide a false discovery rate (FDR) controlled multiple testing procedure for simultaneously testing the significance of model coefficients and can not perform FDR controlled feature selection. In particular, for the feature selection of high-dimensional single index models, Rejchel et al. [32] proposed the cross-validation Ranklasso method, along with its modified versions: the thresholded Ranklasso and the weighted Ranklasso methods. However, the precision of feature selection of all these regularization-based methods depends on the tuning parameter in the penalty function and sample size. Given a specific sample size, the relationship between the false feature selection rate (e.g. FDR) and the tuning parameter value remains unknown. Therefore, it is challenging to use the tuning parameter to control the FDR for the selection results of Ranklasso-type methods. As a result, the Ranklasso-type methods are unable to effectively conduct FDR-controlled feature selection. In addition, the multiple testing work with FDR control of few methods relies on p-values with Benjamini and Hochberg (BH) correction [15], and may fail to control FDR in the presence of complex and strong dependence structure among features. Moreover, high-dimensionality of features makes such approach to have lower power due to large-scale adjustment burden. Thence using high-dimensional single index model to develop an effective and robust FDR controlled feature selection method becomes very desired.
As we know, in the existing literature, the FDR controlled feature selection methods for high dimensional models can be achieved mainly via the following three approaches.
-
Knockoff filter-based approach.
Barber and Candes [16] firstly introduced the knockoff filter, a new feature selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method achieves exact FDR control in finite sample settings no matter the design or covariates, the number of variables in the model, or the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. Following the knockoff filter framework, Candes et al. [17] proposed a model-X knockoff FDR control method. However, this method requires both complete knowledge of the joint distribution of the design matrix and repeated derivation of the conditional distributions. The development of methods to construct exact or approximate knockoff features for a broader class of distributions is a promising area of active research [18, 19].
-
Symmetrized data aggregation (SDA) approach.
To simultaneously test the significance of the regression coefficients in high-dimensional linear regression model, Du et al. [20] firstly proposed a data splitting-based method “SDA” to select features with FDR control. The key idea of the proposed method is to apply the sample-splitting strategy to construct a series of statistics with marginal symmetry property and then to utilize the symmetry for obtaining an approximation to the number of false discoveries. The SDA approach consists of three procedures.- a. The first procedure splits the sample into two parts, both of which are utilized to construct statistics to assess the evidence of the regression coefficient against the null.
- b. The second procedure aggregates the two statistics to form a new ranking statistic fulfilling symmetry about zero properties under the null.
- c. The third procedure chooses a threshold along the ranking by exploiting the symmetry about zero property between positive and negative null statistics to control the FDR.
-
Stability selection approach.
Stability selection [21], an innovative variable selection algorithm, relies on resampling datasets to identify variables. The core principle of this approach involves repeatedly applying the variable selection method to resampling subsets of the data, thereby defining a variable as stable variable if it is frequently selected. It can help a feature selection method like lasso to improve its performance. Another good property of stability selection is that it provides an effective way to control false discovery rate (FDR) in finite sample cases provided that its tuning parameters are set properly. Due to its versatility and flexibility, stability selection has been successfully applied in many domains such as gene expression analysis [22–25].
Additionally, FDR controlled feature selection procedure can also be implemented by many other statistical inference methods (eg, regression-based modeling, two-sample testing, and statistical causal mediation analysis) [26–31]. However, directly applying these statistical methods to analyze the omics data is usually underpowered and sometimes can render inappropriate results (The selection results may contain too many false discoveries).
In this paper, we firstly employ a single index model (SIM) to model omics data. The considered SIM is robust in performance of the two aspects. One aspect involves the relationships between the response variable and a linear combination of covariates connected by an unknown monotonic link function, while the other aspect involves the random error and covariates following unknown distributions. In this paper, our plan is applying the above FDR-controlled feature selection approaches and the rank-based approach [32] for high-dimensional SIM to develop a novel and robust FDR-controlled feature selection method. More specifically, our plan involves constructing a single index model based on the rank-based approach, in which the transformed response samples are dependent. However, such dependency violates the independence assumption of the response samples underlying the Knockoff filter-based approach. Consequently, the Knockoff filter approach is not applicable to our rank-based plan. Regarding the stability selection approach, the existing theory, which significantly informs its implementation, provides relatively weak bounds on FDR, leading to a reduced number of true positives [33–36]. Furthermore, stability selection necessitates users to specify two out of three parameters: the target FDR, a selection threshold, and the expected number of selected features. Numerous studies have demonstrated that stability selection is sensitive to these choices, complicating tuning for optimal performance [34, 36–39]. In summary, stability selection not only requires numerous parameter settings but also is excessively conservative, leading to a reduced number of false positives (FP) at the expense of true positives (TP). Hence, the stability selection approach is not an appropriate choice for our plan. The SDA approach is independent of p-values, owns higher power performance, and requires significantly fewer tuning parameters, which allows it to possess several useful theoretical properties. This motivates us to adopt the SDA approach within the context of the rank-based SIM model. Consequently, we leverage both the rank-based approach [32] and the SDA approach to develop a robust feature selection method for SIM, which we subsequently apply to fine-mapping of omics data while controlling the false positive rate of selection. Notably, the proposed method does not depend on p-values. Additionally, we utilize theoretical results to validate the effectiveness of the proposed method.
Finally, we design extensive simulation studies to compare the proposed method with the competing methods.The simulation results demonstrate that the proposed method effectively controls FDR across all scenarios. Additionally, the results indicate that when the sample size is moderate and errors stem from a heavy-tailed Cauchy distribution, or when the relationship is non-linear, or when the relationship is non-linear and errors follow a heavy-tailed Cauchy distribution, the proposed method outperforms other competing methods in terms of power performance. For small sample size scenario, compared to the proposed method, other competing methods may either be underpowered or render inappropriate results by having an inflated FDR than the nominal FDR threshold. These results indicate our method owns effective and robust performance for all the scenarios. The proposed method is also applied to the analysis of two real data sets and identifies some casual features unreported by the existing finds.
Materials and methods
This section firstly reviews rank-based approach single index model (SIM) for omics data and then provides parameter estimation methods for SIM. Finally, a multiple testing procedure with FDR control is given for simultaneously testing the coefficients of high-dimensional single index model.
Review of based on rank approach single index model
Let denote observed p omics features matrix on n samples; denote the response vector such as disease status, gene expression and so on. Assume that the means of all the p features are zero, Rejchel et al. [32] focused on the single index model (1) without intercept
| (1) |
where denotes the coefficients associated with the omics features on the response Y, is an unknown monotonic link function. No assumptions are made on the form of the monotonic link function g, or the distribution of the error , or the distribution of all the p features, thence model (1) is robust for modeling the omics data. Their purpose is to perform feature selection to identify the feature set within the framework of model (1). For this purpose, Rejchel et al. [32] utilized a rank-based lasso approach to sparsely estimate S by optimizing the problem
| (2) |
where with ; denotes the rank values of the response actual values Yi by their centered ranks; and denotes l1 and l2 norm, respectively; is a data-dependent tuning parameter of l1 norm type lasso penalty.
Rejchel et al. [32] highlighted that the rank-based lasso method, as defined in the optimization problem (2), does not estimate true regression coefficients β in model (1). However, given some assumptions, they defined a parameter by the optimization problem without introducing the penalty
which is related to the true vector of regression coefficients β. It is revealed that, under certain standard assumptions, the support of coincides with the support of β. Furthermore, Rejchel et al. [32] demonstrated that the estimator is a consistent estimate of , and thus can be utilized to identify S. More detailed information can be found in the work [32].
Omics-wide association analysis based on robust single index model
In real omics data analysis, the response variable Y often follows an unknown distribution, which may result in non-linear associated relationships between the response variable Y and the linear combination of omics features X. In order to simultaneously account for the unknown non-linear associated relationships, unknown distribution of response and unknown distribution of features, we employ model (1) to robustly model the omics data.
Based on high-dimensional model (1) with a large number of features , we focus on the multiple testing problem (3) under the null hypothesis:
| (3) |
to identify the features associated with the response Y while controlling for false positives. Because the above section has shown the support of coincides with the support of β in model (1), the multiple testing problem (3) under the null hypothesis becomes
| (4) |
The objective of this paper is developing a FDR controlled multiple testing procedure for statistical inference problem (4) under high-dimensional model (1) framework to perform omics-wide association analysis.
Estimation methods for the parameter
For the statistical inference problem (4), this section introduces estimation methods for the true parameter from the high-dimensional model (1) with a large number of features (p > n) and the low-dimensional model (1) with a small number of features (p < n), respectively. For high-dimensional scenario, we employ the rank-based lasso method in (2) to obtain the sparse parameter estimate . Given the observed variables X and the transformed variable , the optimization problem (2) can be implemented using the R package glmnet.
For low-dimensional scenario, we use rank-based ordinary least squares method to estimate the parameters by solving the optimization problem defined in equation (5)
| (5) |
and obtain the parameter estimators
| (6) |
If there exists strong correlations among X, causing the ROLS method to suffer from multicollinearity and become ineffective, we first use the variation inflation factor (VIF) to assess the presence of multicollinearity, and then based on the VIF value, we eliminate the problematic features. Subsequently, we apply the ROLS method to estimate the parameters of the model constructed by the remaining features. Especially, if there exists a continuous type confounder causing multicollinearity, we can consider stratified approach to solve this problem. More specifically, we can firstly categorize this continuous confounder variable into multiple, say four, categories, or into discrete type variable. Then we can fit four conditional rank-based regression models across the four categories of confounders. Furthermore, take the weighted average of the four parameter estimates from the four conditional models, as the final parameter estimates. We believe such approach can remove most of the bias due to confounding. The properties of the estimators and can be found in the “Theoretical properties” section (Given in the following).
Remark 1. In the following section, we use the SDA approach to develop a FDR controlled feature selection method. The SDA approach requires under null hypothesis, the distribution of the estimator (obtained under low-dimensional scenario) should be symmetric around zero or asymptotically symmetric around zero. For the low-dimensional case, when the number of features is close to or not sufficiently small compared to the sample size, the thresholded Ranklasso method [32] should be used to obtain an estimator with asymptotically symmetric properties around zero under null hypothesis. (Note that under null hypothesis, the distributions of the regular lasso and the weighted Rank-lasso estimators [32] are usually not asymptotically symmetric around zero). However, in our simulation studies, we observed that the number of features is relatively small across all scenarios. Additionally, when the thresholded Ranklasso method is utilized in low-dimensional cases, its power is significantly lower compared to using the ROLS method. This is because the thresholded Ranklasso method conducts feature selection again in low-dimensional scenarios, resulting in a decrease in the number of true signals. Thence we adopt ROLS method for parameter estimation under low-dimensional scenario.
FDR controlled robust feature selection procedure
In this section, we apply the SDA approach [20] illustrated in the “introduction” section to problem (4) for testing the parameter , and develop a corresponding FDR controlled feature selection procedure. For the two-part independent samples split in the first procedure of SDA, we employ the RLasso and ROLS methods to estimate the parameters, respectively. By the first part sample, we utilize rank-based lasso to sparsely estimate the parameters to identify fewer candidate omics features associated with the response. The purpose of reducing feature dimensionality is to alleviate the burden of multiple testing for the third procedure of SDA. Subsequently, for the second part sample, we employ the rank-based ordinary least square method to obtain a more precise estimate under the low-dimensional case. Finally, we utilize the estimates obtained by the two-part samples to construct the FDR-controlled feature selection procedure.
More specifically, the proposed procedure is outlined as follows.
-
Step 1: Splitting samples.
Given the ratio γ, the sample is randomly split into two independent disjoint parts and with sample sizes n1 and n2, respectively, where and . The work [20] utilized the simulation studies to verify that the setting is often the most powerful for SDA approach, thus we also use this ratio to split data for simulation studies and real data analysis.
-
Step 2: Selecting the candidate omics feature set using the first part sample .
The RLasso method is employed to obtain the estimates of the parameter of p features by the first part sample . These non-zero estimates are further used to get the candidate feature set . Based on the first part sample , we use the candidate features to construct a low-dimensional single index model(7) Then the ROLS method is employed for model (7). to obtain the estimates of .
Step 3: Similar to the 2th step, firstly use the candidate features , based on the second part sample , to construct a low-dimensional single index model, then utilize this model to obtain the estimates of .
-
Step 4: Constructing the test statistics under null for problem (4).
Under null, we utilize the estimates from the two different part samples to construct the statistics and , with and T2,i = 0 for . Here, is considered as and is viewed as a scaling constant that is independent of and . From the Theorem 1 (Given in the following Theory section), the variance of is very complicated so that it is difficult for us to compute it. It is well known that the bootstrap re-sampling method can help consistently estimate the variance of the unknown distribution or the complex distribution variable [40–42]. Thence we employ the bootstrap re-sampling method to obtain , where .
-
Step 5: Aggregating the test statistics.
Aggregate the two statistics obtained in the 4th step to form a new ranking statistic that satisfies the property of being symmetric around zero under the null hypothesis:Remark 2: Because without being thresholded follows asymptotically normal distribution with the mean being , Tk,i,k = 1,2 has asymptotically normal distribution with the mean being 0 under null . This demonstrates that possesses the property of being symmetric around zero. Intuitively, the positive and larger Wj values indicate strong evidence against the null hypothesis, while the negative Wj values most likely correspond to null cases.
-
Step 6: Choosing the threshold.
Given the nominal level , a threshold is chosen to exploit the symmetrical property between positive and negative null statistics in order to control the False Discovery Rate (FDR) at the level α:where denotes the number of elements in a set. The rejected features are given by .
-
Step 7: Robustly selecting the discovery set.
The goal of this step is to further stabilize the selection result. The reason is that the selection result is not stable and can vary substantially across different data splits due to the inflation of variances of the estimated coefficients by randomly splitting the data into two halves. Therefore, we propose the following procedure to robustly select features.
Suppose we repeat the above 6-step data-splitting procedure B times independently. Each time, the set of selected features is denoted as . For the j-th feature, we define the empirical inclusion rate as:
Sort the features based on their empirical inclusion rates in increasing order. Denote the sorted empirical inclusion rates as Then select the features , where denotes the median size of the selected features over the B runs. In general, a larger B value will result in more stable feature selection. Through simulation studies, the performance of the proposed procedure using B = 15 is very similar to that using larger B values (B > 15). Therefore, in order to save running time, we set B = 15 for both the simulation and real data analysis.
To facilitate comprehension of the aforementioned procedure, a simple flow path is provided below.
Randomly splitting the entire samples into the two independent parts.
By the first part sample, we employ Ranklasso to get the parameter estimates and use to identify fewer candidate omics features associated with the response. Denote the selected candidate feature set as . The goal of reducing the dimensionality of features is to achieve more accurate parameter estimation in a low-dimensional setting. The well-performing estimators are then utilized to construct the test statistics.
Based on the first part sample, we apply ROLS method for low-dimensional single index model (built on fewer candidate omics features set) to obtain a more precision estimates and let .
Based on the second part sample, we apply ROLS method for low-dimensional single index model (built on fewer candidate omics features set) to obtain a more precision estimates and let .
The estimators and (gotten by the two part indenpendent samples) are utilized to construct statistics to assess the evidence of the regression coefficient against the null, respectively.
Aggregating the two statistics to form a new ranking statistic fulfilling symmetry about zero properties under the null.
Choosing a threshold along the ranking by exploiting the symmetry about zero property between positive and negative null statistics to control the FDR.
Stabilizing the selection results by multi-splitting procedure.
Remark 3. The proposed procedure is denoted as SIM-FDR. From the above procedure, FDR control of SIM-FDR method does not rely on p-values and requires that the distribution of the aggregated statistics Wj should exhibit symmetry about zero under the null hypothesis. In fact, SIM-FDR does not require that the distribution of Wj must be normal or asymptotically normal. It only requires that the distribution should exhibit the property of symmetry about zero. Thus, SIM-FDR may not heavily rely on the sample size. The simulation results in small sample settings verify this point, indicating that SIM-FDR is more robust than its competitors for a wide range of scenarios, as achieving asymptotic symmetry is much easier in practice compared to achieving asymptotic normality.
Theoretical properties
This section discusses the finite-sample and asymptotic FDR control properties of the proposed SIM-FDR. Given the intuitive nature of the 7th step of SIM-FDR, our focus is solely on proving the FDR control properties of the previous 6-step SDA approach. Prior to presenting all the theoretical results, we firstly provide the assumptions and definitions. The proofs of all the theorems are provided in S1 File (Supporting information section). Recall the definition of the true parameter as given in the “Review" section above.
Required basic assumptions
Assumption 1. Assume that are independent and identically distributed with denoting the predictor vector of i-th sample, the distribution of Xi is absolutely continuous with Xi denoting the i-th predictor variable, , and the noise variable is independent of
Assumption 2. We assume that for each , the conditional expectation exists and for a real number .
Assumption 3. We assume the cumulative distribution function F of the response variable Yi is increasing and g in model (1) is increasing with respect to the first argument .
Assumption 4. Let p0 = |S| denote the number of elements in support set S. We suppose that the significant predictors is sub-gaussian with the coefficient , i.e. for each we have where denotes the sub-matrix of on the column indices from S. Moreover, the irrelevant predictors are univariate sub-gaussian, i.e. for each and , we have for positive numbers Finally, we denote
Remark 4. No other assumptions are made on the distribution of the noise variable ε. Assumption 2 is a standard condition in the literature on the single index model and can be found in the work [32]. The Assumption 3 satisfies the needs of rank-based approach. The regular sub-gaussian condition is added on the feature matrix X by the Assumption 4.
Consider the model (1), Rejchel et al. [32] show under Assumptions 1 and 2, the conclusion
holds with being a positive constant; under assumption 3, the signs of β coincide with the signs of and
This result indicates the support of β in model (1) coincides with that of the rank approach-based true coefficient . Such conclusions tell us we can perform feature selection for model (1) by using the estimates of .
Definitions of the cone invertibility factor (CIF)
In our article, the validation of SIM-FDR relies on the consistency of parameter estimation for the Rank-lasso problem (2), which in turn supports the feature selection screening property as discussed in problem (2). In the high-dimensional setting, Rejchel et al. [32] have demonstrated that ensuring the consistency of parameter estimators generated by the Rank-lasso penalty requires satisfying the cone invertibility factor (CIF) condition defined on the feature matrix X.
Let and be the restrictions of the vector to the indices from S and with , respectively. Now, for we consider a cone
Define the population version of CIF to be
| (8) |
for a sharp formulation of convergency results for all lq norms with
FDR control
Before presenting the theorems for controlling finite-sample FDR and asymptotic FDR of the proposed procedure, we first establish the property of asymptotic symmetry around zero for the test statistics Wj and the feature screening property for the candidate feature selection result obtained by the RLasso method in the second step of the proposed procedure. These results are essential for demonstrating FDR control.
Asymptotically symmetry around zero property of the statistics W.
We prove the conclusion that the statistics Wj is asymptotically symmetric around 0 under null hypothesis. Obviously, the distribution of the statistics Wj depends on that of the estimator gotten at low-dimensional model (1) scenario.
Theorem 1. Suppose that Assumptions 1, 2, 3 and 4 are satisfied, with Xi denoting the i-th predictor variable and the covariance matrix of the features X is positive definite, we have the following conclusions for the OLS estimator of under low-dimensional model scenario,
where with denoting the j-th diagonal element in the matrix , and D is defined by the Lemma 3 given in the supplementary materials (Given in Supporting information). Furthermore, under null hypothesis , the statistics Wj has the asymptotically symmetry around zero property.
Sure screening property for the candidate feature selection result.
In this section, we prove the sure screening property for the candidate feature selection result by the RLasso method in 2nd step of the proposed procedure at high-dimensional scenario. Define the estimated feature index set as
Theorem 2. Consider problem (2) and let be a fixed sequence such that , and be arbitrary. Suppose that Assumptions 1, 2, 3 and 4 are satisfied. Moreover, suppose that
| (9) |
and
| (10) |
where are universal constants and is the smallest eigenvalue of the correlation matrix between the true predictors . Given the beta-min condition , then we have
Theorem 2 indicates that when the sample size is larger, the results that the estimated relevant feature index set contains the true relevant feature index set S satisfy the sure screening property.
Finite-sample FDR control.
Theorem 3. Suppose the proposed model (1) satisfies all the assumptions given in the above section. Assume the statistics , are well-defined. For any , the FDR of SIM-FDR satisfies
where , and .
This theorem holds no matter the unknown relationship between features X and the response Y. The quantity is seen as a measure to investigate the effect of both the asymmetry of Wj and the dependence between Wj and W−j on FDR.
Asymptotic FDR control.
Following the proof of asymptotic FDR control [20], we need to establish the six technical assumptions for asymptotic FDR control of the proposed SIM-FDR method. These assumptions include Sure screening property for the candidate feature selection result by the 2nd step of SIM-FDR, Moments conditions, Feature matrix conditions, Estimation accuracy of the estimators and of , Signal strength, and Dependence among statistics. Assuming these assumptions hold, it is straightforward to follow the procedure of their proofs to demonstrate that our method possesses the property of asymptotic FDR control. In particular, the assumptions (Moments conditions, Feature matrix conditions, Signal strength, and Dependence) can be specified or designed, and the Assumption of “Estimation accuracy" can be achieved based on estimation consistency provided by Theorem 1 (since an estimator with asymptotic normal property must be consistent). The assumption of “Sure screening property" can be ensured by Theorem 2. Thus the asymptotic FDR control is easily proved.
Simulation analysis results
In order to evaluate feature selection performance of the proposed method (SIM-FDR), we consider sample size n = 250 and 100 for moderate and small sample size scenarios, and the number of omics features p = 400. All simulation settings are replicated 100 times.
Competing methods
Five methods are considered for comparisons with SIM-FDR.
A marginal method that testing one omics feature at a time followed by Benjamini and Hochberg (BH) correction [15], denoting “BH" method.
The original model-X knockoff FDR controlled feature selection method [16], denoted “MXKF" method. MXKF method uses based on high-dimensional joint linear regression model approach to analyze continuous response data. It can be implemented by using the R package knockoffs.
The regular rank-lasso method defined in (2) with the tuning parameter selected by cross-validation, denoting “Rlasso-cv" method.
-
Following the work of Rejchel et al. [32], the adaptive rank-lasso method is defined as the following formula (11),
(11) with and and weightswhere(12) If , then the j-th explanatory variable is removed from the list of predictors before running adaptive rank-lasso (11). This method is denoted as “Rlasso-adaptive".
Following the work of Rejchel et al. [32], the threshold rank-lasso is defined as: the tuning parameter for rank-lasso is selected by cross-validation and the threshold is selected in such a way that the number of selected predictors coincides with the number of predictors selected by adaptive rank-lasso (the above method). This method is denoted as “Rlasso-threshold".
Generating omics features
We simulate the feature matrix X by the multivariate normal distribution with and . To evaluate the performance of SIM-FDR under the dependency among the features, we set the covariance matrix to be the following structure.
for , where we set for moderate strength correlation level among omics features and for .
Simulating the response
We firstly design the regression coefficients and then use the coefficients to generate the response. Let the location vector of nonzero values in β be
Then we set , which denotes the nonzero values vector of β, to be
After generating β, we employe the following six different type models to simulate the outcome Y, where linear regression model is used to simulate linear associated relationships between the response and the omics features, single index model is used to simulate nonlinear associated relationships, and the Cauchy distribution is used to simulate the heavy-tailed distributional random errors.
-
Model 1
Linear regression model setting with the normal distributional random error: , where the error term ε is independent of X and generated from the normal distribution with location parameter being 0 and the variance parameter being γ.
-
Model 2
Linear regression model setting with the Cauchy distributional random error: , where the error term ε is independent of X and generated from the Cauchy distribution with location parameter being 0 and the scale parameter being γ.
-
Model 3
Single index model with the normal distributional random error: , where the error term ε is independent of X and generated from the standard normal distribution with location parameter being 0 and the scale parameter being γ.
-
Model 4
Single index model setting with the Cauchy distributional random error: , where the error term ε is independent of X and generated from the Cauchy distribution with location parameter being 0 and the scale parameter being γ.
-
Model 5: Double-index model.
Let the location vector of nonzero values in -dimensional beand the location vector of nonzero values in -dimensional beThen we set and to beandrespectively. Double-index model setting with the Cauchy distributional random error is considered:where the error term ε is independent of X and generated from the Cauchy distribution with location parameter being 0 and the scale parameter being γ.
-
Model 6: Multi-index model.
Let the location vector of nonzero values in -dimensional bethe location vector of nonzero values in -dimensional beand the location vector of nonzero values in -dimensional beThen we set , and to beandrespectively. Multi-index model setting with the Cauchy distributional random error is considered:where the error term ε is independent of X and generated from the Cauchy distribution with location parameter being 0 and the scale parameter being γ.
In order to consider the different strength association levels between the features and the response, we vary the signal noise ratio (SNR), defined as for models 1 and 2, where the scale or variance parameter can be set for models 3, 4, 5 and 6 by the formula .
Methods settings and comparison measurements
The MXKF method places the burden of knowledge on knowing the complete conditional distribution of X, and there is no algorithm that can generate model-X knockoffs for general distributions efficiently [18]. Therefore, we utilize the default design used previously [17] in this simulation. For SIM-FDR methods, the optimal λ used in the rank-based lasso are determined through 10-fold cross-validation. For BH method, we test the association between the outcome and each omics feature marginally and apply the BH procedure to these marginal p-values to identify significant features. In addition, we set and B = 10 for SIM-FDR in simulation analysis.
Given nominal FDR levels , based on 100 simulated data sets, we use empirical FDR and empirical Power, defined as
to measure the feature selection performance of different methods. In addition, the Matthews correlation coefficient (MCC) is employed to evaluate the results of feature selection. MCC measures the overall accuracy of selection for true positives (TP), false negatives (FN), true negatives (TN), and false positives (FP), with a larger value indicating overall better selection [43]. The definition of empirical MCC is
with
where Ij = 0 indicates that the j-th omics feature is not associated with the response, and indicates that the j-th omics feature is truly associated with the response, denotes the indices set of omics features truly associated with the response, denotes the number of omics features truly associated with the response, and Si denotes the indices set of the selected omics feature using the i-th data set.
Results for moderate sample size scenario (n = 250)
The FDR performance of all the methods is similar for models 1 and 2, both FDR and power performances of all the methods are nearly the same for models 3 and 4, and the FDR and power performances of all the methods are nearly the same for models 5 and 6. Consequently, we present the analysis results for models 1 and 2, models 3 and 4, and models 5 and 6, respectively.
Results for models 1 and 2.
-
FDR performance.
For models 1 and 2, as shown in Figs 1 and 2, the proposed SIM-FDR method effectively controls the actual FDRs at the specified levels for all simulation scenarios. BH and Rlasso-cv methods exhibit significantly higher actual FDRs across all simulation scenarios and fails to control the FDR as expected. The MXKF method has significantly higher or lightly higher actual FDRs than the specified FDR levels for all simulation scenarios. The actual FDRs of both Rlasso-adaptive and Rlasso-threshold methods are nearly zero.
-
Power performance.
For model 1, the MXKF method demonstrates nearly identical power to our SIM-FDR method but at the expense of higher actual FDRs. While for model 2 scenario with the Cauchy distributional errors, SIM-FDR substantially outperforms the MXKF method with the power improvement approaching about 0.36 for some scenarios, and significantly dominates Rlasso-adaptive and Rlasso-threshold methods with the power improvement approaching about 0.62 for some scenarios. BH and Rlasso-cv method possesses too high actual FDRs for these two models so that their performance becomes inconsequential.
Fig 1. Results for model 1 at moderate sample size case (n = 250).
FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 2. Results for model 2 at moderate sample size case (n = 250).
FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Results for models 3 and 4.
From Figs 3 and 4, the results for models 3 and 4 demonstrate the same performance for all the methods. Here, we present their results together. For models 3 and 4, SIM-FDR, Rlasso-adaptive and Rlasso-threshold methods effectively control the actual FDRs to the specified levels across all scenarios. In contrast, the MXKF and Rlasso-cv methods exhibit much higher actual FDRs for almost of all scenarios and do not achieve the expected level of FDR control. In addition, SIM-FDR demonstrates significantly better power performance than MXKF, Rlasso-adaptive and Rlasso-threshold methods across all scenarios, with power improvement being substantial and approaching about 0.7 for some scenarios.
Fig 3. Results for model 3 at moderate sample size case (n = 250).
FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 4. Results for model 4 at moderate sample size case (n = 250).
FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Results for models 5 and 6.
From Figs 5 and 6, the results for models 5 and 6 demonstrate SIM-FDR, Rlasso-adaptive and Rlasso-threshold methods effectively control the actual FDRs to the specified levels across all scenarios. BH method only can control FDRs for some scenarios. While the MXKF and Rlasso-cv methods own much higher actual FDRs for all scenarios and can not achieve the expected level of FDR control. For power performance, SIM-FDR significantly outperforms MXKF, BH, Rlasso-adaptive and Rlasso-threshold methods across all scenarios, with power improvement being substantial and approaching about 0.62 for some simulation scenarios. However, even though the power of Rlasso-cv is highest, lacking effective FDR control makes its performance to be inconsequential.
Fig 5. Results for model 5 at moderate sample size case (n = 250).
FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 6. Results for model 6 at moderate sample size case (n = 250).
FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Results for small sample size scenario (n = 100)
The performance of all the methods is similar for models 1 and 2, and models 3 and 4. The analysis results for models 1 and 2, and models 3 and 4 are shown, respectively.
Results for models 1 and 2.
As shown in Figs 7 and 8, the SIM-FDR, Rlasso-adaptive, and Rlasso-threshold methods effectively control the actual FDRs at the specified levels for all simulation scenarios. However, the BH and MXKF methods exhibit significantly higher actual FDRs for most simulation scenarios and fail to control the actual FDRs at the desired levels. Regarding these two models, as illustrated in Figs 7 and 8, the MXKF method shows slightly higher power than SIM-FDR, albeit at the cost of higher actual FDRs. In terms of power performance, our method SIM-FDR consistently outperforms the Rlasso-adaptive and Rlasso-threshold methods across all simulation scenarios.
Fig 7. Results for model 1 at small sample size case (n = 100).
FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 8. Results for model 2 at small sample size case (n = 100).
FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Results for models 3 and 4.
From Figs 9 and 10, it can be observed that for models 3 and 4, the SIM-FDR, Rlasso-adaptive, and Rlasso-threshold methods effectively control the actual false discovery rates (FDRs) to the specified levels in all simulation scenarios. The Benjamini-Hochberg (BH) method shows success in controlling FDR for certain scenarios, while the MXKF method exhibits significantly higher actual FDRs and lacks the ability to control FDRs in all simulation scenarios. In terms of power performance, SIM-FDR demonstrates superiority over the BH, MXKF, Rlasso-adaptive, and Rlasso-threshold methods across all simulation scenarios, with a substantial power improvement of approximately 0.30 in some scenarios.
Fig 9. Results for model 3 at small sample size case (n = 100).
FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 10. Results for model 4 at small sample size case (n = 100).
FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Results for model 5.
Since the statistical power of most methods is close to zero in scenarios with small sample sizes for model 6, we will only focus on the results of model 5. From Fig 11, the results show that BH, SIM-FDR, Rlasso-adaptive, and Rlasso-threshold methods effectively control the actual false discovery rates (FDRs) to the specified levels across all scenarios in the small sample simulation scenario. However, the MXKF and Rlasso-cv methods have much higher actual FDRs for all scenarios and fail to achieve the expected level of FDR control. In terms of power performance, SIM-FDR outperforms MXKF, BH, Rlasso-adaptive, and Rlasso-threshold methods across all simulation scenarios.
Fig 11. Results for model 5 at small sample size case (n = 100).
FDRs given in left column and powers given in right column are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
The simulation results evaluated via the Matthews correlation coefficient (MCC).
All MCC results are presented in Figs 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, and 22. Notably, Figs 12, 13, 14, 15, 21, and 22 demonstrate that our method, SIM-FDR, achieves the highest MCC values across 36 scenarios under moderate sample size , indicating superior feature selection performance for moderate to large sample sizes. For small sample size , SIM-FDR consistently outperforms competing methods for 12 scenarios depicted in Figs 18 and 20. While in Figs 16, 17, and 19, SIM-FDR dominates or significantly outperforms other methods in 10 out of 18 scenarios, and only falls behind the MXKF method in the rest 8 scenarios. In a word, our method outperforms the other methods for 58 scenarios among the total 66 scenarios and ranks second in the remaining 8 scenarios. However, as shown in Figs 7, 8, and 10, MXKF exhibits substantially higher actual FDRs in this 8 scenarios, failing to control false discoveries and yielding excessive false positives. While MCC serves as a comprehensive evaluation metric, this paper primarily focuses on FDR and Power performance of all the methods. Consequently, MXKF’s marginally better MCC in the remaining 8 scenarios may lack practical significance for real-world data analysis.
Fig 12. Results for model 1 at moderate sample size case (n = 250).
MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 13. Results for model 2 at moderate sample size case (n = 250).
MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 14. Results for model 3 at moderate sample size case (n = 250).
MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 15. Results for model 4 at moderate sample size case (n = 250).
MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 16. Results for model 1 at small sample size case (n = 100).
MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 17. Results for model 2 at small sample size case (n = 100).
MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 18. Results for model 3 at small sample size case (n = 100).
MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 19. Results for model 4 at small sample size case (n = 100).
MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 20. Results for model 5 at small sample size case (n = 100).
MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 21. Results for model 5 at moderate sample size case (n = 250).
MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Fig 22. Results for model 6 at moderate sample size case (n = 250).
MCCs are averaged over 100 replications, and their standard deviations (sd) are given on the top of the histogram.
Conclusions of simulating analysis results.
Reviewing the results from almost of all the scenarios, the actual FDRs of SIM-FDR are far below the actual FDRs of the other methods. Thence, there should be no need to verify that our procedure has higher power at the price of a higher FDR level. To summarise, compared to SIM-FDR method, other existing methods may either be underpowered or render inappropriate results by having an inflated FDR than the nominal FDR threshold. Especially, the proposed method owns better feature selection performance for moderate to large sample sizes. Such results indicate the proposed SIM-FDR owns robust performance for all the scenarios.
Real data analysis results
Ocean microbiome data
Integrative marine data collection efforts such as Tara Oceans [44] or the Simons CMAP provides the means to investigate ocean ecosystems on a global scale. This data set contains p = 35651 miTAG OTUs [45] observed on n = 136 samples. Using Tara’s environmental and microbial survey of ocean surface water [45], we apply all the methods to identify miTAG OTUs (omics features) associated with some environmental covariates. Especially, salinity is thought to be an important environmental factor in marine microbial ecosystems, thus we aimed to identify the miTAG OTUs (omics features) more robustly associated with the response of interest “marine salinity”.
Before applying all the methods, we conducted a series of preprocessing steps to make the Tara data more amenable to the proposed method. Firstly, following the work of Sunagawa et al. [46], we calculated read sum of all the 35651 miTAG OTUs (omics features) and removed low-abundance OTUs with read sum less than 10000 reads per sample. We further retained OTUs that appeared in at least 14 samples, resulting in a new OTUs matrix of dimension n = 136 and p = 1015. Secondly, we normalized OTU raw read counts into composition data with the sum of each row being one. Thirdly, we transformed the composition data using log function and took the log-transformed data as the omics features. Following the analysis of simulated data, we set the same setting for SIM-FDR while let B be 30 to stabilize the analysis results of SIM-FDR. We varied nominal FDR levels from 0 to 0.20 for the real data analysis. The collected raw Tara data set and the preprocessed Tara data set can be available in S2 Dataset and S3 Dataset (Supporting information section), respectively.
The results were presented in Table 1. It is observed that the number of taxa identified by BH and Rlasso-cv methods exceed that of SIM-FDR, MXKF, Rlasso-threshold and Rlasso-adaptive methods, yet MXKF did not identify any taxa fetures for all the FDR levels. This may align with the simulation results from many scenarios. For example, the scenarios may be model 1 or model 2 with small sample scenario (Figs 7 and 8), giving the sample size of n = 136 and the number of genes p = 1015. The Figs 7 and 8 show that the BH and Rlasso-cv methods exhibit higher actual FDRs and fails to control FDR at the given nominal levels. Therefore, the results in Table 1 may indicate that BH and Rlasso-cv methods may yield more false discoveries, while SIM-FDR, Rlasso-threshold and Rlasso-adaptive methods may provide fewer but more precise taxa selection results.
Table 1. The number of selected taxa by all the methods under different nominal FDR levels.
| FDR level | 0 | 0.02 | 0.04 | 0.07 | 0.09 | 0.11 | 0.13 | 0.16 | 0.18 | 0.20 |
|---|---|---|---|---|---|---|---|---|---|---|
| BH | 0 | 371 | 444 | 507 | 558 | 588 | 620 | 648 | 671 | 687 |
| MXKF | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 01 SIM-FDR | 0 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | 6 | 6 |
| Rlasso-cv | 0 | 68 | 68 | 68 | 68 | 68 | 68 | 68 | 68 | 68 |
| Rlasso-threshold | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| Rlasso-adaptive | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
However, we need to further validate that SIM-FDR should yield more precise taxon selection compared to BH and Rlasso-cv methods. To investigate the feature selection performance of our proposed method, we introduce simulated variables as inactive or non-association taxa. In detail, 500 noise taxa Z1 are from N(0,1) and another 500 noise taxa Z2 are from t(3), which are all independently and randomly generated. We denoted this dataset as Tara-Noise, and our main goal is to identify the important taxa from the whole 2015 taxa (1015 real taxa and 1000 noise taxa). The noise taxa Z1 and Z2 can be viewed as false taxa which are not associated with the response. Given FDR level 0.1, the results of falsely selecting the noise taxa by all the methods are shown in Table 2. We observed that both BH and Rlasso-cv methods mistakenly selected a number of noise taxa while our method SIM-FDR did not select any noise taxa. This indicates that these two methods indeed produce more false discoveries, while our method achieves higher precision in detecting findings.
Table 2. The number of selected noise taxa in 1000 noise taxa using the Tara-Noise data given the nominal FDR level ().
| Method | the number of selected noise taxa |
|---|---|
| BH | 30 |
| MXKF | 0 |
| SIM-FDR | 0 |
| Rlasso-cv | 2 |
| Rlasso-threshold | 0 |
| Rlasso-adaptive | 0 |
Given nominal FDR level 0.20, SIM-FDR method identified six taxa associated with the ocean salinity gradients. From Tables 3 and 4, the identified OTU197, OTU741 and OTU2043 by SIM-FDR come from the class Alphaproteobacteria, OTU1473 comes from the class Deltaproteobacteria, OTU1439 and OTU520 come from the class Gammaproteobacteria. For the Tara data, Bien et al. [47] proposed a tree-aggregated predictive model and also used their method to conduct taxa selection. However, their selection result is very different from that of SIM-FDR. Thus, our results could offer some new perspective of the Tara data.
Table 3. The information of six selected taxa by SIM-FDR at the FDR level 0.20.
| Taxa Name | Rank | Kingdom | Phylum | Class | Order | Family |
|---|---|---|---|---|---|---|
| OTU197 | Life | Bacteria | Proteobacteria | Alphaproteobacteria | SAR11 clade | Surface |
| OTU741 | Life | Bacteria | Proteobacteria | Alphaproteobacteria | SAR11 clade | Surface |
| OTU2043 | Life | Bacteria | Proteobacteria | Alphaproteobacteria | Rickettsiales | S25-593 |
| OTU1473 | Life | Bacteria | Proteobacteria | Deltaproteobacteria | Desulfuromonadales | GR-WP33-58 |
| OTU1439 | Life | Bacteria | Proteobacteria | Gammaproteobacteria | KI89A clade | f__134 |
| OTU520 | Life | Bacteria | Proteobacteria | Gammaproteobacteria | E01-9C-26 marine group | f__69 |
Table 4. The information of six selected taxa by SIM-FDR at the FDR level 0.20.
| Taxa Name | Genus | Species |
|---|---|---|
| OTU197 | g__118 | AY664083.1.1206 |
| OTU741 | g__409 | EU801445.1.1438 |
| OTU2043 | g__681 | JN166192.1.1464 |
| OTU1473 | g__608 | EF574438.1.1503 |
| OTU1439 | g__601 | FR683972.1.1501 |
| OTU520 | g__304 | JF747664.1.1516 |
Head and neck squamous cell carcinoma data
Head and neck squamous cell carcinoma (HNSCC) is a prevalent and prognostically challenging cancer globally [48]. Since the release of the TCGA-HNSC dataset in 2015, over 1,000 related articles have been published. The original data includes a total of 18,409 gene expression values. A prescreening process was conducted using marginal Cox models, leading to the selection of the top 2,000 genes with the smallest p-values for downstream analysis. The preprocessed HNSCC dataset, which contains 2,000 gene expression values, the logarithm of survival time, and a censoring indicator, can be downloaded from TCGA Provisional using the R packages cgdsr or GEInter. Additionally, the preprocessed HNSCC dataset can be available in S4 Dataset (Supporting information section).
Here, our objective was to identify potential genes associated with the survival time of head and neck squamous cell carcinoma (HNSCC). The results of this analysis are instrumental in understanding the molecular mechanisms underlying the occurrence and progression of HNSCC, and they hold significant implications for future treatment strategies. In the preprocessed HNSCC dataset, we collected 484 samples, with a censoring ratio of approximately 58. This indicates that the true survival times for 58 of the samples remain unobserved and 42 of the samples are observed. To render this dataset suitable for our method, we utilized the observed survival time samples, resulting in 204 samples () for further analysis. We took 2000 genes as feature variables and employed the original survival time data, prior to logarithmic transformation, as the response variable. We then applied all relevant methods for gene selection on the dataset. Following the analysis of simulated data, we set the same setting for SIM-FDR while let B be 30 to stabilize the analysis results of SIM-FDR. We varied nominal FDR levels from 0 to 0.20 for this real data analysis.
The results are presented in Table 5. It is observed that the number of genes identified by the BH and Rlasso-cv methods exceeds that identified by SIM-FDR, while the MXKF, Rlasso-threshold, and Rlasso-adaptive methods did not identify any genes across all FDR levels. This may align with the simulation results from various scenarios, such as model 1 or model 2 under moderate sample conditions (Figs 1 and 2), which involve a sample size of and a gene count of . Figs 1 and 2 illustrate that the BH and Rlasso-cv methods exhibit higher actual FDRs and fail to control the FDR at the specified nominal levels. Therefore, the results in Table 5 may indicate that the BH and Rlasso-cv methods yield a greater number of false discoveries, whereas the SIM-FDR method may provide fewer but more accurate gene selection results.
Table 5. The number of selected genes by all the methods under different nominal FDR levels.
| FDR level | 0 | 0.02 | 0.04 | 0.07 | 0.09 | 0.11 | 0.13 | 0.16 | 0.18 | 0.20 |
|---|---|---|---|---|---|---|---|---|---|---|
| BH | 0 | 265 | 308 | 360 | 399 | 427 | 463 | 500 | 540 | 592 |
| MXKF | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| SIM-FDR | 0 | 3 | 3 | 3 | 3 | 3 | 7 | 7 | 7 | 7 |
| Rlasso-cv | 0 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 |
| Rlasso-threshold | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Rlasso-adaptive | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Next, we introduced some simulated feature variables as non-association genes to further investigate the feature selection performance of all the methods. In details, 500 noise genes Z1 with 204 samples are from the normal distribution N(–1,2) and another 500 noise genes Z2 with 204 samples are from the normal distribution N(1,2), which are all independently and randomly generated. The genes Z1 and Z2 can be viewed as noise genes which are not associated with the survival time response. We denoted this dataset as HNSCC-Noise, and our main goal is to identify the important genes from the whole 3000 genes. Then we applied all the methods to the HNSCC-Noise dataset. The discovered genes are listed in Table 6. We observed that BH and Rlasso-cv mistakenly selected a number of noise genes while our method SIM-FDR did not select any noise genes. This indicates that these two methods indeed produce more false discoveries, while our method achieves higher precision in detecting findings.
Table 6. The number of selected genes in 1000 noise genes using the HNSCC-Noise data given the nominal FDR level ().
| Method | the number of selected noise taxa |
|---|---|
| BH | 19 |
| MXKF | 0 |
| SIM-FDR | 0 |
| Rlasso-cv | 2 |
| Rlasso-threshold | 0 |
| Rlasso-adaptive | 0 |
Finally, at a significance level of , Table 7 presents seven genes—SNX14, BICRAL, KIR3DL1, GRAPL, SEMA4D, IL20RA, and POLDIP2—that were detected using SIM-FDR. Notably, none of these genes have been reported in existing literature. Our findings may provide new insights into the HNSCC data, as these genes could enhance our understanding of the carcinogenic mechanisms associated with the cell cycle in HNSCC and may serve as potential biomarkers for the survival and treatment of HNSCC.
Table 7. The names of selected genes by SIM-FDR at the FDR level 0.2.
| Genes’ Names |
|---|
| SNX14, BICRAL, KIR3DL1, GRAPL, SEMA4D, IL20RA, POLDIP2 |
Discussion and conclusion
In this paper, we first employ a more general single index model to fit omics data. This model is robust and can account for nonlinear associations with unknown distributional random errors and features. Next, based on this model, we develop an effective FDR control procedure for feature selection in high-dimensional single index models. We further apply this procedure to fine-map omics features while controlling the false discovery rate of selection. The results from simulated data indicate that when the linear or nonlinear model has heavy-tailed distributional random errors in moderate sample cases, the proposed SIM-FDR method significantly outperforms competing methods in power performance across nearly all scenarios while effectively controlling the FDR. In small sample cases, our method maintains actual FDRs at the nominal FDR levels for all scenarios, whereas nearly all competing methods fail to control the FDR. Compared to the SIM-FDR method, other methods may either be underpowered or yield inappropriate results, resulting in an inflated FDR relative to the nominal FDR threshold. Overall, these findings suggest that SIM-FDR demonstrates robust performance.
However, there are some aspects of our method that need to be discussed as follows.
Our approach primarily focuses on feature selection and does not involve estimating the link function or true coefficients of the original model for predicting the response. Once features are selected using our method, existing single index models for prediction can be utilized to fit the selected features, and the fitted model can subsequently be employed for prediction.
Estimating the unknown link function in high-dimensional single index models presents a significant challenge. Although our approach does not require estimating the link function, it assumes that the unknown link function is monotonic.
In the theoretical section, our method assumes that predictors are absolutely continuous and sub-gaussian. However, Rejchel et al. [32] have demonstrated through simulation studies that the rank-based approach can also be applied in scenarios with discrete-type predictors. This suggests that our method can be employed to analyze data with discrete-type predictors.
Specifically, our method is designed for single-index models and not for double-index or multi-index models. After careful consideration, we believe that combining sufficiency dimension reduction with the rank-based approach may address this issue. We hope to extend our method to accommodate double-index or multi-index scenarios in future work.
Our method is not suitable for handling binary-response and other discrete-type response datasets. We plan to explore ways to extend SIM-FDR for analyzing binary-response and other discrete-type response cases in future research.
Besides the above discussions, there has been considerable research interest in utilizing additional information (e.g., phylogenetic information) from microbiome data to enhance detection power while maintaining FDR control [49, 50]. In the future, it will be of interest to incorporate such information into the SIM-FDR framework to further improve the detection power of controlled feature selection.
Supporting information
(PDF)
(ZIP)
(ZIP)
(ZIP)
Acknowledgments
We thank all the women and men who agreed to participate in this study.
Data Availability
All relevant data sets are in the Supporting information files of the paper.
Funding Statement
This work was supported by the National Natural Science Foundation of China (NSFC) (no. 11801571, no. 12171483 and no. 61773401).
References
- 1.Majewski IJ, Bernards R. Taming the dragon: genomic biomarkers to individualize the treatment of cancer. Nat Med. 2011;17(3):304–12. doi: 10.1038/nm.2311 [DOI] [PubMed] [Google Scholar]
- 2.Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34(3). doi: 10.1214/009053606000000281 [DOI] [Google Scholar]
- 3.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–41. doi: 10.1093/biostatistics/kxm045 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Li H. Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu Rev Stat Appl. 2015;2(1):73–94. doi: 10.1146/annurev-statistics-010814-020351 [DOI] [Google Scholar]
- 5.Foster JC, Taylor JMG, Nan B. Variable selection in monotone single-index models via the adaptive LASSO. Stat Med. 2013;32(22):3944–54. doi: 10.1002/sim.5834 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Radchenko P. High dimensional single index models. Journal of Multivariate Analysis. 2015;139:266–82. doi: 10.1016/j.jmva.2015.02.007 [DOI] [Google Scholar]
- 7.Ganti R, Rao N, Willett RM. Learning single index models in high dimensions. arXiv preprint 2015. https://arxiv.org/abs/1506.08910 [Google Scholar]
- 8.Luo S, Ghosal S. Forward selection and estimation in high dimensional single index models. Statistical Methodology. 2016;33:172–9. doi: 10.1016/j.stamet.2016.09.002 [DOI] [Google Scholar]
- 9.Cheng L, Zeng P, Zhu Y. BS-SIM: an effective variable selection method for high-dimensional single index model. Electron J Statist. 2017;11(2). doi: 10.1214/17-ejs1329 [DOI] [Google Scholar]
- 10.Yang ZR, Balasubramanian K, Liu H. High-dimensional nongaussian single index models via thresholded score function estimation. In: International Conference on Machine Learning. 2017 . p. 3851–60.
- 11.Dudeja R, Hsu D. Learning single-index models in gaussian space. In: Conference on Learning Theory. 2018. p. 1887–930.
- 12.Hirshberg DA, Wager S. Debiased inference of average partial effects in single-index models. arXiv preprint 2018. https://arxiv.org/abs/1811.02547 [Google Scholar]
- 13.Pananjady A, Foster DP. Single-index models in the high signal regime. 2019. https://people.eecs.berkeley.edu/ashwinpm/SIMs.pdf
- 14.Eftekhari H, Banerjee M, Ritov Y. Inference in high-dimensional single-index models under symmetric designs. Journal of Machine Learning Research. 2021;22(27):1–63. [Google Scholar]
- 15.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1995;57(1):289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x [DOI] [Google Scholar]
- 16.Barber RF, Candès EJ. A knockoff filter for high-dimensional selective inference. Ann Statist. 2019;47(5). doi: 10.1214/18-aos1755 [DOI] [Google Scholar]
- 17.Candès E, Fan Y, Janson L, Lv J. Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2018;80(3):551–77. doi: 10.1111/rssb.12265 [DOI] [Google Scholar]
- 18.Bates S, Candès E, Janson L, Wang W. Metropolized knockoff sampling. Journal of the American Statistical Association. 2020;116(535):1413–27. doi: 10.1080/01621459.2020.1729163 [DOI] [Google Scholar]
- 19.Romano Y, Sesia M, Candès E. Deep knockoffs. Journal of the American Statistical Association. 2019;115(532):1861–72. doi: 10.1080/01621459.2019.1660174 [DOI] [Google Scholar]
- 20.Du L, Guo X, Sun W, Zou C. False discovery rate control under general dependence by symmetrized data aggregation. Journal of the American Statistical Association. 2021;118(541):607–21. doi: 10.1080/01621459.2021.1945459 [DOI] [Google Scholar]
- 21.Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2010;72(4):417–73. doi: 10.1111/j.1467-9868.2010.00740.x [DOI] [Google Scholar]
- 22.Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, et al. Wisdom of crowds for robust gene network inference. Nat Methods. 2012;9(8):796–804. doi: 10.1038/nmeth.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Haury A-C, Mordelet F, Vera-Licona P, Vert J-P. TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC Syst Biol. 2012;6:145. doi: 10.1186/1752-0509-6-145 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Hu X, Hu Y, Wu F, Leung RWT, Qin J. Integration of single-cell multi-omics for gene regulatory network inference. Comput Struct Biotechnol J. 2020;18:1925–38. doi: 10.1016/j.csbj.2020.06.033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.de Groot P, Nikolic T, Pellegrini S, Sordi V, Imangaliyev S, Rampanelli E, et al. Faecal microbiota transplantation halts progression of human new-onset type 1 diabetes in a randomised controlled trial. Gut. 2021;70(1):92–105. doi: 10.1136/gutjnl-2020-322630 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Aitchison J. The statistical analysis of compositional data. Caldwell, New Jersey: Blackburn Press; 2003. [Google Scholar]
- 27.Shi P, Zhang A, Li H. Regression analysis for microbiome compositional data. Ann Appl Stat. 2016;10(2) :1019–40. doi: 10.1214/16-aoas928 [DOI] [Google Scholar]
- 28.Cao Y, Lin W, Li H. Two-sample tests of high-dimensional means for compositional data. Biometrika. 2017;105(1):115–32. doi: 10.1093/biomet/asx060 [DOI] [Google Scholar]
- 29.Sohn MB, Li H. Compositional mediation analysis for microbiome studies. Ann Appl Stat. 2019;13(1). doi: 10.1214/18-aoas1210 [DOI] [Google Scholar]
- 30.Lu J, Shi P, Li H. Generalized linear models with linear constraints for microbiome compositional data. Biometrics. 2019;75(1):235–44. doi: 10.1111/biom.12956 [DOI] [PubMed] [Google Scholar]
- 31.Zhang H, Chen J, Li Z. Testing for mediation effect with application to human microbiome data. Statistics in Biosciences. 2019;:1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Rejchel W, Bogdan ML. Rank-based lasso efficient methods for high-dimensional robust model selection. Journal of Machine Learning Research. 2020;21:1–47.34305477 [Google Scholar]
- 33.Alexander DH, Lange K. Stability selection for genome-wide association. Genet Epidemiol. 2011;35(7):722–8. doi: 10.1002/gepi.20623 [DOI] [PubMed] [Google Scholar]
- 34.Li S, Hsu L, Peng J, Wang P. Bootstrap inference for network construction with an application to a breast cancer microarray study. Ann Appl Stat. 2013;7(1):391–417. doi: 10.1214/12-AOAS589 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Hofner B, Boccuto L, Göker M. Controlling false discoveries in high-dimensional situations: boosting with stability selection. BMC Bioinformatics. 2015;16:144. doi: 10.1186/s12859-015-0575-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Wang F, Mukherjee S, Richardson S, Hill SM. High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking. Stat Comput. 2020;30(3):697–719. doi: 10.1007/s11222-019-09914-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Haury A-C, Mordelet F, Vera-Licona P, Vert J-P. TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC Syst Biol. 2012;6:145. doi: 10.1186/1752-0509-6-145 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Werner T. Loss-guided stability selection. Advances in Data Analysis and Classification. 2023;:1–26. [Google Scholar]
- 39.Zhou J, Sun J, Liu Y, Hu J, Ye J. Patient Risk Prediction Model via Top-k Stability Selection. In: Proceedings of the 2013 SIAM International Conference on Data Mining. 2013. p. 55–63. 10.1137/1.9781611972832.7 [DOI]
- 40.Efron B. Bootstrap methods: another look at the jackknife. Ann Statist. 1979;7(1). doi: 10.1214/aos/1176344552 [DOI] [Google Scholar]
- 41.Kushary D. Bootstrap methods and their application. Technometrics. 2000;42(2):216–7. doi: 10.1080/00401706.2000.10486018 [DOI] [Google Scholar]
- 42.Johnson RW. An introduction to the bootstrap. Teaching Statistics. 2001;23(2):49–54. doi: 10.1111/1467-9639.00050 [DOI] [Google Scholar]
- 43.Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. doi: 10.1186/s12864-019-6413-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Sunagawa S, Acinas SG, Bork P, Bowler C, et al. Tara oceans: towards global ocean ecosystems biology. Nat Rev Microbiol. 2020;18(8):428–45. doi: 10.1038/s41579-020-0364-5 [DOI] [PubMed] [Google Scholar]
- 45.Logares R, Sunagawa S, Salazar G, Cornejo-Castillo FM, Ferrera I, Sarmento H, et al. Metagenomic 16S rDNA Illumina tags are a powerful alternative to amplicon sequencing to explore diversity and structure of microbial communities. Environ Microbiol. 2014;16(9):2659–71. doi: 10.1111/1462-2920.12250 [DOI] [PubMed] [Google Scholar]
- 46.Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G, et al. Ocean plankton. Structure and function of the global ocean microbiome. Science. 2015;348(6237):1261359. doi: 10.1126/science.1261359 [DOI] [PubMed] [Google Scholar]
- 47.Bien J, Yan X, Simpson L, Müller CL. Tree-aggregated predictive modeling of microbiome data. Sci Rep. 2021;11(1):14505. doi: 10.1038/s41598-021-93645-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Cancer Genome Atlas Network. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. 2015;517(7536):576–82. doi: 10.1038/nature14129 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Xiao J, Cao H, Chen J. False discovery rate control incorporating phylogenetic tree increases detection power in microbiome-wide multiple testing. Bioinformatics. 2017;33(18):2873–81. doi: 10.1093/bioinformatics/btx311 [DOI] [PubMed] [Google Scholar]
- 50.Hu J, Koh H, He L, Liu M, Blaser MJ, Li H. A two-stage microbial association mapping framework with advanced FDR control. Microbiome. 2018;6(1):131. doi: 10.1186/s40168-018-0517-1 [DOI] [PMC free article] [PubMed] [Google Scholar]






















