Figure - PMC

Skip to main content

An official website of the United States government

Here's how you know

Here's how you know

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

View full-text article in PMC

[Preprint]. 2023 Feb 28:rs.3.rs-2609859. [Version 1] doi: 10.21203/rs.3.rs-2609859/v1

Search in PMC
Search in PubMed
View in NLM Catalog
Add to search

This work is licensed under a Creative Commons Attribution 4.0 International License, which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.

License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License

PMC Copyright notice

Fig. 2 | — a. A synthetic dataset consisting of N = 50,000 samples × p = 1,000 features was generated. Some features are correlated with the outcome (informative features, light blue), while the others are not (uninformative features, grey). Forty thousand samples are held out for validation. Out of the remaining 10,000, 50 sets ranging of sample sizes n ranging from 30 to 1,000 are drawn randomly. c. Three metrics are used to evaluate performance: sparsity (average number of selected features compared to the number of informative features), reliability (Jaccard Index, JI, comparing the true set of informative features to the selected feature set), and predictivity (mean squared error, MSE). c. The surrogate for the false discovery proportion (FDP+, red line) and the experimental false discovery rate (FDR, dotted line) are shown as a function of the frequency threshold. An example is shown for n = 150 samples and 25 informative features (all other conditions are shown in Fig. S1). The FDP+ estimate approaches the experimental FDR around the reliability threshold, θ. d-f. Sparsity (d), reliability (JI, e), and predictivity performances (MSE, f) of Stabl (red box plots) and least absolute shrinkage and selection operator (Lasso, grey box plots) as a function of the number of samples (n, x-axis) for 10 (left panels), 25 (middle panels), or 50 (right panels) informative features. g-i. Sparsity (g), reliability (h), and predictivity (i) performances of models built using a data-driven reliability threshold θ (Stabl, red lines) or a fixed frequency threshold (i.e., SS) of 30% (light grey lines), 50% (Lasso, dark grey lines), or 80% (black lines). The feature set selected by Stabl remains closer in number (sparsity) and composition (reliability) to the true set of informative features, while achieving a superior or comparable predictive performance to models built using a fixed threshold. j. The reliability threshold chosen by Stabl is shown as a function of the sample size (n, x-axis) for 10 (left panel), 25 (middle panel), or 50 (right panel) informative features. Benchmarking of Stabl against elastic net (EN) is shown in Fig. S6.