A Novel Algorithm for Feature Selection Using Penalized Regression with Applications to Single-Cell RNA Sequencing Data

. 2022 Oct 12;11(10):1495. doi: 10.3390/biology11101495

Algorithm 1: Steps to implement the proposed algorithm

1.
Load data set into R and assign classes 1 and 0 to the two selected group of cells to form a binary classification problem.
2.
Shuffle cells within each class to randomize the data points.
3.
Remove genes with no variability in expression across all cells.
4.
Split the data set into training (90%) and test (10%) for 10-fold cross validation.
- (a)
  Fit ridge, lasso, elastic net, and drop lasso.
- (b)
  Find the top important genes from each method. The top genes are the genes that have coefficients above a cut off (mean of absolute value of coefficients).
- (c)
  Form a gene pool by taking union of the top important genes from the 4 models; for instance, Figure 3 and Figure 4 represent the gene pool of data sets GSE123818 and GSE71585, respectively.
- (d)
  Fit SGL with the new gene pool pre-grouped by hierarchical clustering.
- (e)
  Save the coefficients of SGL.
- (f)
  Repeat the steps for a 10-fold CV.
5.
Calculate the average of coefficients for each gene across the 10 folds and sort the genes.
6.
Visualize the gene versus coefficients plot and select the final set of genes using an elbow curve.
7.
Cluster all the cells by applying K-means clustering on the top important genes.