|
Algorithm 1: Steps to implement the proposed algorithm |
-
1.
Load data set into R and assign classes 1 and 0 to the two selected group of cells to form a binary classification problem.
-
2.
Shuffle cells within each class to randomize the data points.
-
3.
Remove genes with no variability in expression across all cells.
-
4.
Split the data set into training (90%) and test (10%) for 10-fold cross validation.
-
(a)
Fit ridge, lasso, elastic net, and drop lasso.
-
(b)
Find the top important genes from each method. The top genes are the genes that have coefficients above a cut off (mean of absolute value of coefficients).
-
(c)
Form a gene pool by taking union of the top important genes from the 4 models; for instance, Figure 3 and Figure 4 represent the gene pool of data sets GSE123818 and GSE71585, respectively.
-
(d)
Fit SGL with the new gene pool pre-grouped by hierarchical clustering.
-
(e)
Save the coefficients of SGL.
-
(f)
Repeat the steps for a 10-fold CV.
-
5.
Calculate the average of coefficients for each gene across the 10 folds and sort the genes.
-
6.
Visualize the gene versus coefficients plot and select the final set of genes using an elbow curve.
-
7.
Cluster all the cells by applying K-means clustering on the top important genes.
|