Skip to main content
. 2023 Apr 11;6:394. doi: 10.1038/s42003-023-04764-8

Fig. 2. Feature selection to account for cancer type and tumor purity imbalance.

Fig. 2

a Boxplots of the genomics-based tumor purity across 20 cancer types from TCGA, blue and red lines marking the bottom/top 20% purity samples; bar plots of the genomics-based tumor purity separated into low, medium, and high purity ranges. b The first step of the feature selection strategy: lasso feature selection, cross-validated on cancer types as folds, was performed separately on low-mid (0.17–0.72 purities) and mid-high (0.38–0.97) purity range samples, the two feature sets were intersected resulting in 167 genes. c The second step of the feature selection strategy: using the features from the first step, lasso feature selection was iteratively performed across all purity ranges, cross-validated on cancer types as folds, resulting in 158 gene features used for the final model. In the boxplots in (a), the lower and upper hinges correspond to the first and third quartiles, the upper whisker extends to the largest value no further than 1.5 of inter-quartile range from the hinge, the lower whisker extends to the smallest value no further than 1.5 of inter-quartile range from the hinge, and points beyond the end of the whiskers are plotted individually.