Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2023 May 12:2023.05.09.540043. [Version 1] doi: 10.1101/2023.05.09.540043

Feature selection for preserving biological trajectories in single-cell data

Jolene S Ranek 1,2, Wayne Stallaert 3, Justin Milner 4, Natalie Stanley 2,5,*, Jeremy E Purvis 1,2,*
PMCID: PMC10197710  PMID: 37214963

Abstract

Single-cell technologies can readily measure the expression of thousands of molecular features from individual cells undergoing dynamic biological processes, such as cellular differentiation, immune response, and disease progression. While examining cells along a computationally ordered pseudotime offers the potential to study how subtle changes in gene or protein expression impact cell fate decision-making, identifying characteristic features that drive continuous biological processes remains difficult to detect from unenriched and noisy single-cell data. Given that all profiled sources of feature variation contribute to the cell-to-cell distances that define an inferred cellular trajectory, including confounding sources of biological variation (e.g. cell cycle or metabolic state) or noisy and irrelevant features (e.g. measurements with low signal-to-noise ratio) can mask the underlying trajectory of study and hinder inference. Here, we present DELVE (dynamic selection of locally covarying features), an unsupervised feature selection method for identifying a representative subset of dynamically-expressed molecular features that recapitulates cellular trajectories. In contrast to previous work, DELVE uses a bottom-up approach to mitigate the effect of unwanted sources of variation confounding inference, and instead models cell states from dynamic feature modules that constitute core regulatory complexes. Using simulations, single-cell RNA sequencing data, and iterative immunofluorescence imaging data in the context of the cell cycle and cellular differentiation, we demonstrate that DELVE selects features that more accurately characterize cell populations and improve the recovery of cell type transitions. This feature selection framework provides an alternative approach for improving trajectory inference and uncovering co-variation amongst features along a biological trajectory. DELVE is implemented as an open-source python package and is publicly available at: https://github.com/jranek/delve.

Introduction

High-throughput single-cell technologies, such as flow and mass cytometry [1, 2, 3], single-cell RNA sequencing [4, 5, 6, 7], and imaging-based profiling techniques [8, 9, 10, 11] have transformed our ability to study how cell populations respond and dynamically change during processes like development [12, 13, 14, 15] and immune response [16, 17, 18]. By profiling many features (e.g. proteins or genes) for many thousands of cells from a biological sample, these technologies provide high-dimensional snapshot measurements that can be used to gain fundamental insights into the molecular mechanisms that govern phenotypic changes.

Trajectory inference methods [19] have been developed to model dynamic biological processes from snapshot single-cell data. By assuming cells are asynchronously changing over time such that a profiled biological sample from a single experimental time point describes a range of the underlying dynamic process, computational trajectory inference approaches have leveraged minimum spanning tree approaches [20, 21, 22], curve-fitting [23, 24], graph-based techniques [25, 26, 27], probabilistic approaches [28, 29, 30], or optimal transport [31, 32] to order cells based on their similarities in feature expression. Once a trajectory model is fit, regression [33, 34, 35] can be performed along estimated pseudotime (e.g. distance through the inferred trajectory from a start cell) to identify specific cell state changes associated with differentiation or disease trajectories. Moreover, these inferred cellular trajectories have the potential to elucidate higher-order gene interactions [36], gene regulatory networks [37], predict cell fate probabilities [29], or find shared mechanisms of expression dynamics across disease conditions or species [38, 39].

While trajectory analysis has proven useful in the context of single-cell biology, the identification of characteristic genes or proteins that drive continuous biological processes relies on having inferred accurate cellular trajectories, which can be challenging, especially when trajectory inference is performed on the original full unenriched dataset. Single-cell data are noisy measurements that suffer from limitations in detection sensitivity, where dropout [40], low signal-to-noise, or sample degradation [41] can result in spurious signals that can overwhelm true biological differences. Furthermore, all profiled sources of feature variation contribute to the cell-to-cell distances that define the inferred cellular trajectory; thus, including confounding sources of biological variation (e.g. cell cycle, metabolic state) or irrelevant features (e.g. extracted imaging measurements that contain low signal-to-noise ratio) can distort or mask the intended trajectory of study [42, 43]. With the accumulation of large-scale single-cell data and multi-modal measurements [44], appropriate filtering of noisy, information-poor, or irrelevant features can serve as a crucial and necessary step for cell type identification, inference of dynamic phenotypes, and identification of punitive driver features (e.g. genes, proteins).

Feature selection methods [45] are a class of supervised or unsupervised approaches that can remove redundant or information-poor features prior to performing trajectory inference, and therefore, they have great potential for improving the interpretation of downstream analysis, while easing the computational burden by reducing dataset dimension. In the supervised-learning regime, classification-based [46] or information-theoretic approaches [47, 48] have been used to evaluate features according to their discriminative power or association with cell types. Despite having great power to detect biologically-relevant features, these methods rely on expensive or laborious manual annotations (e.g. cell types) which are often unavailable [49] thus precluding them from use. In the unsupervised-learning regime, computational approaches often aim to identify relevant features based on intrinsic properties of the complete dataset; however, these methods have some limitations with respect to retaining features that are useful for defining cellular trajectories. Namely, although unsupervised variance-based approaches [50, 51], which effectively sample features based on their overall variation across cells, have been extensively used to identify features that define cell types without the need for ground truth annotations, (1) they can be overwhelmed by noisy or irrelevant features that dominate data variance, and (2) are insensitive to lineage-specific features (e.g. transcription factors) that have a small variance and gradual progression of expression. Alternatively, unsupervised similarity-based [52, 53, 25] or subspace-learning [54, 55] feature selection methods evaluate features according to their association with a cell-similarity graph defined by all features or the underlying structure of the data (e.g. pairwise similarities defined by uniform manifold approximation and projection (UMAP) [56], eigenvectors of the graph Laplacian matrix [57]). While these approaches have the potential to detect smoothly varying genes or proteins that define cellular transitions, they rely on the cell-similarity graph from the full dataset and can fail to identify relevant features when the number of noisy features outweighs the number of informative ones [58, 59].

To address these limitations, we developed DELVE (dynamic selection of locally covarying features), an unsupervised feature selection method for identifying a representative subset of molecular features that robustly recapitulate cellular trajectories. In contrast to previous work [55, 52, 50, 54, 53, 25], DELVE uses a bottom-up approach to mitigate the effect of unwanted sources of variation confounding feature selection and trajectory inference, and instead models cell states from dynamic feature modules that constitute core regulatory complexes. Features are then ranked for selection according to their association with the underlying cell trajectory graph using data diffusion techniques. We demonstrate the power of our approach for improving inference of cellular trajectories through achieving an increased sensitivity to detect diverse and dynamically expressed features that better delineate cell types and cell type transitions from single-cell RNA sequencing and protein immunofluorescence imaging data. Overall, this feature selection framework provides an alternative approach for uncovering co-variation amongst features along a biological trajectory.

Results

Overview of the DELVE algorithm

We propose DELVE, an unsupervised feature selection framework for modeling dynamic cell state transitions using graph neighborhoods (Fig. 1). Our approach extends previous unsupervised similarity-based [52, 53, 25] or subspace-learning feature selection [55] methods by computing the dependence of each gene on the cellular trajectory graph structure using a two-step approach. Inspired by the molecular events that occur during differentiation, where the coordinated spatio-temporal expression of key regulatory genes govern lineage specification [60, 61, 62, 63], we reasoned that we can approximate cell state transitions by identifying groups of features that are temporally co-expressed or co-regulated along the underlying dynamic process.

Figure 1: Schematic overview of the DELVE pipeline.

Figure 1:

Feature selection is performed in a two-step process. In step 1, DELVE clusters features according to their expression dynamics along local representative cellular neighborhoods defined by a weighted k-nearest neighbor affinity graph. Neighborhoods are sampled using a distribution-focused sketching algorithm that preserves cell-type frequencies and spectral properties of the original dataset [66]. A permutation test with a variance-based test statistic is used to determine if a set of features are (1) dynamically changing (dynamic) or (2) exhibiting random patterns of variation (static). In step 2, dynamic modules are used to seed or initialize an approximate cell trajectory graph and the trajectory is refined by ranking and selecting features that best preserve the local structure using the Laplacian Score [52]. In this study, we compare DELVE to the alternative unsupervised feature selection approaches on how well selected features preserve cell type and cell type transitions according to several metrics.

In step one, DELVE identifies groups of features that are temporally co-expressed by clustering features according to their average pairwise change in expression across prototypical cellular neighborhoods (Figure 1 step 1). As has been done previously [64, 65], we model cell states using a weighted k-nearest neighbor (k-NN) affinity graph, where nodes represent cells and edges describe the transcriptomic or proteomic similarity amongst cells according to all profiled features. Here, DELVE leverages a distribution-focused sketching method [66] to effectively sample cellular neighborhoods across all cell types. This sampling approach has three main advantages: (1) cellular neighborhoods are more reflective of the distribution of cell states, (2) redundant cell states are removed, and (3) fewer cellular neighborhoods are required to estimate feature dynamics resulting in increased scalability. Following clustering, each DELVE module contains a set of features with similar local changes in co-variation across prototypical cell states along the cellular trajectory. Feature-wise permutation testing is then used to assess the significance of dynamic expression variation across grouped features as compared to random assignment. By identifying and excluding modules of features that have static, random, or noisy patterns of expression variation, this approach effectively mitigates the effect of unwanted sources of variation confounding feature ranking and selection, and subsequent trajectory inference.

In step two, DELVE approximates the underlying cellular trajectory by constructing an affinity graph between cells, where cell similarity is now redefined according to a core set of dynamically expressed regulators. All profiled features are then ranked according to their association with the underlying cellular trajectory graph using graph signal processing techniques [67, 68] (Fig. 1 step 2). More concretely, a graph signal is any function that has a real defined value on all of the nodes. In this context, we consider all features as graph signals and rank them according to their total variation in expression along the cellular trajectory graph using the Laplacian Score [52]. Intuitively, DELVE retains features that are considered to be globally smooth, or have similar expression values amongst similar cells along the approximate cellular trajectory graph. In contrast, DELVE excludes features that have a high total variation in signal, or expression values that are rapidly oscillating amongst neighboring cells, as these features are likely noisy or not involved in the underlying dynamic process that was seeded. The output of DELVE is a ranked set of features that best preserve the local trajectory structure. For a more detailed description on the problem formulation, the mathematical foundations behind feature ranking, and the impact of nuisance features on trajectory inference, see DELVE in the Methods section.

DELVE outperforms existing feature selection methods on representing cellular trajectories in the presence of single-cell RNA sequencing noise

Although feature selection is a common preprocessing step in single-cell analysis [69] with the potential to reveal cell-type transitions that would have been masked in the original high-dimensional feature space [42], there has been no systematic evaluation of feature selection method performance on identifying biologically-relevant features for trajectory analysis in single-cell data, especially in the context of noisy data that contain biological or technical challenges (e.g. low total mRNA count, low signal-to-noise ratio, or dropout). In this study, we compared DELVE to eleven other feature selection approaches and evaluated methods on their ability to select features that represent cell types and cell type transitions using simulated data where the ground truth was known. In the sections below, we will describe an overview of the feature selection methods considered and outline the simulation design and evaluation criteria in more detail. We will then provide qualitative and quantitative assessments of how noise impacts feature selection method performance and subsequent inference of cellular trajectories.

Overview of feature selection methods

We performed a systematic evaluation of twelve feature selection methods for preserving cellular trajectories in noisy single-cell data. Feature selection methods were grouped into five general categories prior to evaluation: supervised, similarity, subspace-learning, variance, and baseline approaches. For more details on the feature selection methods implemented and hyperparameters, see Benchmarked feature selection methods and Supplementary Table 1.

Supervised approaches:

To illustrate the performance of ground-truth feature selection that could be obtained through supervised learning on expert annotated cell labels, we performed Random Forest classification. Random Forest classification [46] is a supervised ensemble learning algorithm that uses an ensemble of decision trees to partition the feature space such that all of the cells with the same cell type label are grouped together. Here, each decision or split of a tree was chosen by minimizing the Gini Impurity score [70]. This approach was included to provide context for unsupervised feature selection method performance.

Similarity approaches:

We considered four similarity-based approaches as unsupervised feature selection methods that rank features according to their association with a cell similarity graph defined by all profiled features (e.g. Laplacian Score, neighborhood variance, hotspot) or dynamically-expressed features (e.g. DELVE). First, the Laplacian Score (LS) [52] is an unsupervised locality-preserving feature selection method that ranks and selects features according to (1) the total variation in feature expression across neighboring cells using a cell similarity graph defined by all features and (2) a feature’s global variance. Next, neighborhood variance [25] is an unsupervised feature selection method that selects features with gradual changes in expression for building biological trajectories. Here, features are selected if their variance in expression across local cellular neighborhoods is smaller than their global variance. Hotspot [53] performs unsupervised feature selection through a local autocorrelation test statistic that measures the association of a gene’s expression with a cell similarity graph defined by all features. Lastly, DELVE (dynamic selection of locally covarying features) is an unsupervised feature selection method that ranks features according to their association with the underlying cellular trajectory graph. First, features are clustered into modules according to changes in expression across local representative cellular neighborhoods. Next, modules of features with dynamic expression patterns (denoted as dynamic seed) are used to construct an approximate cellular trajectory graph. Features are then ranked according to their association with the approximate cell trajectory graph using the Laplacian Score [52].

Subspace learning approaches:

We considered two subspace-learning feature selection methods as unsupervised methods that rank features according to how well they preserve the overall cluster structure (e.g. MCFS) or manifold structure (e.g. SCMER) of the data. First, multi-cluster feature selection (MCFS) [55] is an unsupervised feature selection method that selects features that best preserve the multi-cluster structure of data by solving an L1 regularized least squares regression problem on the spectral embedding defined by all profiled features. The optimization is solved using the least angles regression algorithm [71]. Next, single-cell manifold-preserving feature selection (SCMER) [54] is an unsupervised feature selection method that selects a subset of features that best preserves the pairwise similarity matrix between cells defined in uniform manifold approximation and projection [56] based on all profiled features. To do so, it uses elastic net regression to find a sparse solution that minimizes the KL divergence between a pairwise similarity matrix between cells defined by all features and one defined using only the selected features.

Variance approaches:

We considered two variance-based feature selection approaches (e.g. highly variable genes [50], max variance) as unsupervised methods that use global expression variance as a metric for ranking feature importance. First, highly variable gene selection (HVG) [50] is an unsupervised feature selection method that selects features according to a normalized dispersion measure. Here, features are binned based on their average expression. Within each bin, genes are then z-score normalized to identify features that have a large variance, yet a similar mean expression. Next, max variance is an unsupervised feature selection method that ranks and selects features that have a large global variance in expression.

Baseline approaches:

We considered three baseline strategies (e.g. all, random, dynamic seed) that provide context for the overall performance of feature selection. First, all features illustrates the performance when feature selection is not performed and all features are included for analysis. Second, random features represents the performance quality when a random subset of features are sampled. Lastly, dynamic seed features indicate the performance from dynamically-expressed features identified in step 1 of the DELVE algorithm prior to feature ranking and selection.

Single-cell RNA sequencing simulation design

To validate our approach and benchmark feature selection methods on representing cellular trajectories, we simulated 90 single-cell RNA sequencing datasets with 1500 cells and 500 genes using Splatter. Splatter [72] simulates single-cell RNA sequencing data with various trajectory structures (e.g. linear, bifurcation, tree) using a gamma-Poisson hierarchical model. Importantly, this approach provides ground truth reference information (e.g. cell type annotations, differentially expressed genes per cell type and trajectory, and a latent vector that describes an individual cell’s progression through the trajectory) that we can use to robustly assess feature selection method performance, as well as quantitatively evaluate the limitations of feature selection strategies for trajectory analysis. Moreover, to comprehensively evaluate feature selection methods under common biological and technical challenges associated with single-cell RNA sequencing data, we added relevant sources of single-cell noise to the simulated data. First, we simulated low signal-to-noise ratio by enforcing a mean-variance relationship amongst genes; this ensures that lowly expressed genes are more variable than highly expressed genes. Next, we modified the total number of profiled mRNA transcripts, or library size. This has been shown previously to vary amongst cells within a single-cell experiment and can influence both the detection of differentially expressed genes [73], as well as impact the reproducibility of the inferred lower-dimensional embedding [74]. Lastly, we simulated the inefficient capture of mRNA molecules, or dropout, by undersampling gene expression from a binomial distribution; this increases the amount of sparsity present within the data. For more details on the splatter simulation, see Splatter simulation. For each simulated trajectory, we performed feature selection according to all described feature selection strategies, and considered the top 100 ranked features for downstream analysis and evaluation.

Qualitative assessment of feature selection method performance

Prior to evaluating feature selection method performance quantitatively, we began our analysis with a qualitative assessment of the importance of feature selection for representing cellular trajectories when the data contain irrelevant or noisy genes. First, we visually compared the cellular trajectories generated from a feature selection strategy with PHATE (potential of heat-diffusion for affinity-based transition embedding). PHATE [75] is a nonlinear dimensionality reduction method that has been shown to effectively learn and represent the geometry of complex continuous and branched biological trajectories. As an illustrative example, Fig. 2a shows the PHATE embeddings for simulated linear differentiation trajectories generated from four feature selection approaches (all, DELVE, Laplacian Score (LS), and random) when subjected to a decrease in the signal-to-noise ratio. Here, we simulated a reduction in the signal-to-noise ratio and stochastic gene expression by modifying the biological coefficient of variation (BCV) parameter within Splatter [72]. This scaling factor controls the mean-variance relationship between genes, where lowly expressed genes are more variable than highly expressed genes (See Splatter simulation). Under low noise conditions where the data contained a high signal-to-noise ratio, we observed that excluding irrelevant features with DELVE or the Laplacian Score (LS) produced a much smoother, denoised visualization of the linear trajectory, where cells were more tightly clustered according to cell type. This was compared to the more diffuse presentation of cell states obtained based on all genes. We then examined how noise influences the quality of selected features from a feature selection strategy. As the signal-to-noise ratio decreased (high, medium, low), we observed that the linear trajectory became increasingly harder to distinguish, whereby including both irrelevant and noisy genes often masked the underlying trajectory structure (Fig. 2a all genes, medium to low signal-to-noise ratio). Furthermore, we found that unsupervised similarity-based or subspace learning feature selection methods that initially define a cell similarity graph according to all irrelevant, noisy, and informative genes often selected genes that produced noisier embeddings as the amount of noise increased (e.g. Fig. 2a LS: medium signal-to-noise ratio), as compared to DELVE (e.g. Fig. 2a DELVE medium signal-to-noise ratio). We reason that this is due to spurious similarities amongst cells, reduced clusterability, and increased diffusion times. These qualitative observations were consistent across different noise conditions (e.g. decreased signal-to-noise, decreased library size, increased dropout) and trajectory types (e.g. linear, bifurcation, tree) (See Supplementary Figs. 19). Although a qualitative comparison, this example illustrates how including irrelevant or noisy genes can define spurious similarities amongst cells, which can (1) influence a feature selection method ability to identify biologically-relevant genes and (2) impact the overall quality of an inferred lower dimensional embedding following selection. Given that many trajectory inference methods use lower dimensional representations in order to infer a cell’s progression through a differentiation trajectory, it is crucial to remove information-poor features prior to performing trajectory inference in order to obtain high quality embeddings, clustering assignments, or cellular orderings that are reproducible for both qualitative interpretation and downstream trajectory analysis.

Figure 2: Comparison of feature selection methods on preserving linear trajectories when subjected to a reduction in the signal-to-noise ratio.

Figure 2:

(a) Example PHATE [75] visualizations of simulated linear trajectories using four feature selection approaches (all features, DELVE, Laplacian Score (LS) [52], and random selection) when subjected to a reduction in the signal-to-noise ratio (high, medium, low). Here, we simulated a reduction in the signal-to-noise ratio and stochastic gene expression by modifying the biological coefficient of variation (bcv) parameter within Splatter [72] that controls the mean-variance relationship between genes, where lowly expressed genes are more variable than highly expressed genes (high: bcv=0.1, medium: bcv=0.25, low: bcv=0.5).d indicates the total number of genes (d=500) and p indicates the number of selected genes following feature selection (p=100). (b-d) Performance of twelve different feature selection methods: random forest [46], DELVE, dynamic seed features, LS [52], neighborhood variance [25], hotspot [53], multi-cluster feature selection (MCFS) [55], single-cell manifold preserving feature selection (SCMER) [54], max variance, highly variable gene selection (HVG) [50], all features, random features. Following feature selection, trajectory preservation was quantitatively assessed according to several metrics: (b) the precision of differentially expressed genes at k selected genes, (c) k-NN classification accuracy, and (d) pseudotime correlation to the ground truth cell progression across 10 random trails. Error bars/bands represent the standard deviation. * indicates the method with the highest median score. For further details across other trajectory types and noise conditions, see Supplementary Figs. 19.

Quantitative assessment of feature selection method performance

We next quantitatively examined how biological or technical challenges associated with single-cell RNA sequencing data may influence a feature selection method’s ability to detect the particular genes that define cell types or cell type transitions. To do so, we systematically benchmarked the 12 described feature selection strategies on their capacity to preserve trajectories according to three sets of quantitative comparisons. Method performance was assessed by evaluating if selected genes from an approach were (1) differentially expressed within a cell type or along a lineage, (2) could be used to classify cell types, and (3) could accurately estimate individual cell progression through the cellular trajectory. Fig. 2bd shows feature selection method performance for simulated linear differentiation trajectories when subjected to the technical challenge of having a reduction in the signal-to-noise ratio.

First, we assessed the biological relevancy of selected genes, as well as the overall recovery of relevant genes as the signal-to-noise ratio decreased by computing a precision score. Precision@k is a metric that defines the proportion of selected genes (k) that are known to be differentially-expressed within a cell type or along a lineage (See Precision@k). Overall, we found that DELVE achieved the highest precision@k score between selected genes and the ground truth, validating that our approach was able to select genes that are differentially expressed and was the strongest in defining cell types and cell type transitions (See Fig. 2b). Importantly, DELVE’s ability to recover informative genes was robust to the number of genes selected (k) and to the amount of noise present in the data. In contrast, variance-based, similarity-based, or subspace-learning approaches exhibited comparatively worse recovery of cell type and lineage-specific differentially expressed genes.

Given that a key application of single-cell profiling technologies is the ability to identify cell types or cell states that are predictive of sample disease status, responsiveness to drug therapy, or are correlated with patient clinical outcomes [76, 77, 78, 79, 65], we then assessed whether selected genes from a feature selection strategy can correctly classify cells according to cell type along the underlying cellular trajectory; this is a crucial and necessary step of trajectory analysis. Therefore, we trained a k-nearest neighbor (k-NN) classifier on the selected feature set (see k-nearest neighbor classification) and compared the predictions to the ground truth cell type annotations by computing a cell type classification accuracy score. Across all simulated trajectories, we found that DELVE selected genes that often achieved the highest median k-NN classification accuracy score (high signal-to-noise ratio: 0.937, medium signal-to-noise ratio: 0.882, low signal-to-noise ratio: 0.734) and produced k-NN graphs that were more faithful to the underlying biology (See Fig. 2c). Moreover, we observed a few results that were consistent with the qualitative interpretations. First, removing irrelevant genes with DELVE, LS, or MCFS achieved higher k-NN classification accuracy scores (e.g. high signal-to-noise ratio; DELVE = 0.937, LS = 0.915, and MCFS = 0.955, respectively) than was achieved by retaining all genes (all = 0.900). Next, DELVE outperformed the Laplacian Score, suggesting that using a bottom-up framework and excluding nuisance features prior to performing ranking and selection is crucial for recovering cell-type specific genes that would have been missed if the cell similarity graph was initially defined based on all genes. Lastly, when comparing the percent change in performance as the amount of noise corruption increased (e.g. high signal-to-noise ratio to medium signal-to-noise ratio) for linear trajectories, we found that DELVE often achieved the highest average classification accuracy score (0.905) and lowest percent decrease in performance (−6.398%), indicating that DELVE was the most robust unsupervised feature selection method to noise corruption (See Supplementary Fig. 10a). In contrast, the existing unsupervised similarity-based or subspace learning feature selection methods that achieved high to moderate average k-NN classification accuracy scores (e.g. MCFS = 0.905, LS = 0.874) had larger decreases in performance (e.g. MCFS = −9.673%, LS = −8.390%) as the amount of noise increased. This further highlights the limitations of current feature selection methods on identifying cell type-specific genes from noisy single-cell omics data.

Lastly, when undergoing dynamic biological processes such as differentiation, cells exhibit a continuum of cell states with fate transitions marked by linear and nonlinear changes in gene expression [80, 81, 82]. Therefore, we evaluated how well feature selection methods could identify genes that define complex differentiation trajectories and correctly order cells along the cellular trajectory in the presence of noise. To infer cellular trajectories and to estimate cell progression, we used the diffusion pseudotime algorithm [83] on the selected gene set from each feature selection strategy, as this approach has been shown previously [19] to perform reasonably well for inference of simple or branched trajectory types (See Trajectory inference and analysis). Method performance was then assessed by computing the Spearman rank correlation between estimated pseudotime and the ground truth cell progression. We found that DELVE approaches more accurately inferred cellular trajectories and achieved the highest median pseudotime correlation to the ground truth measurements, as compared to alternative methods or all features (See Fig. 2d). Furthermore, similar to the percent change in classification performance, we found that DELVE was the most robust unsupervised feature selection method in estimating cell progression, as it often achieved the highest average pseudotime correlation (0.645) and lowest percent decrease in performance (−22.761%) as the amount of noise increased (See Supplementary Fig. 10b high to medium signal-to-noise ratio). In contrast, the alternative methods incorrectly estimated cellular progression and achieved lower average pseudotime correlation scores (e.g. MCFS = 0.602, LS = 0.526) and higher decreases in performance as the signal-to-noise ratio decreased (MCFS = −38.884%, LS = −40.208%).

We performed this same systematic evaluation across a range of trajectory types (e.g. linear, bifurcation, tree) and biological or technical challenges associated with single-cell data (See Supplementary Figs. 112). Fig. 3 displays the overall ranked method performance of feature selection methods on preserving cellular trajectories when subjected to different sources of single-cell noise (pink: decreased signal-to-noise ratio, green: decreased library size, and blue: increased dropout). Ranked aggregate scores were computed by averaging results across all datasets within a condition; therefore, this metric quantifies how well a feature selection strategy can recover genes that define cell types or cell type transitions underlying a cellular trajectory when subjected to that biological or technical challenge (See Aggregate scores). Across all conditions, we found that DELVE often achieved an increased recovery of differentially expressed genes, higher cell type classification accuracy, higher correlation of estimated cell progression, and lower percent change in performance in noisy data. While feature selection method performance varied across biological or technical challenges, we found that the Laplacian score (LS) and multi-cluster feature selection (MCFS) performed reasonably well under low amounts of noise corruption and are often the second and third ranked unsupervised methods. Altogether, this simulation study demonstrates that DELVE more accurately recapitulates cellular dynamics and can be used to effectively interrogate cell identity and lineage-specific gene expression dynamics from noisy single-cell data.

Figure 3: DELVE outperforms existing feature selection methods on representing trajectories in the presence of single-cell RNA sequencing noise.

Figure 3:

Feature selection methods were ranked by averaging their overall performance across datasets from different trajectory types (e.g. linear, bifurcation, tree) when subjected to noise corruption (e.g. decreased signal-to-noise ratio, decreased library size, and increased dropout). Several metrics were used to quantify trajectory preservation, including, precision of dynamically-expressed genes with 50 selected genes (p@50), precision at 100 selected genes (p@100), precision at 150 selected genes (p@150), k-NN classification accuracy of cell type labels (acc), and pseudotime correlation (pst). Here, higher-ranked methods are indicated by a longer lighter bar, and the star illustrates our approach (DELVE) as well as the performance from dynamic seed features of step 1 of the algorithm. DELVE often achieves the highest precision of lineage-specific differentially expressed genes, highest classification accuracy, and highest pseudotime correlation across noise conditions and trajectory types. Of note, random forest was included as a baseline representation to illustrate feature selection method performance when trained on ground truth cell type annotations; however, it was not ranked, as this study is focused on unsupervised feature selection method performance on trajectory preservation.

Revealing molecular trajectories of proliferation and cell cycle arrest

Recent advances in spatial single-cell profiling technologies [8, 9, 84, 10, 11, 85, 86, 87, 88] have enabled the simultaneous measurement of transcriptomic or proteomic signatures of cells, while also retaining additional imaging or array-derived features that describe the spatial positioning or morphological properties of cells. These spatial single-cell modalities have provided fundamental insights into mammalian organogenesis [88, 89] and complex immune responses linked to disease progression [90, 91]. By leveraging imaging data to define cell-to-cell similarity, DELVE can identify smoothly varying spatial features that are strongly associated with cellular progression, such as changes in cell morphology or protein localization.

To demonstrate this, we applied DELVE to an integrated live cell imaging and protein iterative indirect immunofluorescence imaging (4i) dataset consisting of 2759 human retinal pigmented epithelial cells (RPE) undergoing the cell cycle (See RPE analysis). In a recent study [92], we performed time-lapse imaging on an asynchronous population of non-transformed RPE cells expressing a PCNA-mTurquoise2 reporter to record the cell cycle phase (G0/G1, S, G2, M) and age (time since last mitosis) of each cell. We then fixed the cells and profiled them with 4i to obtain measurements of 48 core cell cycle effectors. The resultant dataset consisted of 241 imaging-derived features describing the expression and localization of different protein markers (e.g. nucleus, cytoplasm, perinuclear region – denoted as ring), as well as morphological measurements from the images (e.g. size and shape of the nucleus). Given that time-lapse imaging was performed prior to cell fixation, this dataset provides the unique opportunity to rigorously evaluate feature selection methods on a real biological system (cell cycle) with technical challenges (e.g. many features with low signal-to-noise ratio, autofluorescence, sample degradation).

We first tested whether DELVE can identify a set of dynamically-expressed cell cycle-specific features to construct an approximate cellular trajectory graph for feature selection. Overall, we found that DELVE successfully identified dynamically-expressed seed features (p=13) that are known to be associated with cell cycle proliferation (e.g. increase in DNA content and area of the nucleus) and captured key mechanisms previously shown to drive cell cycle progression (Fig. 4a right), including molecular events that regulate the G1/S and G2/M transitions. For example, the G1/S transition is governed by the phosphorylation of RB by cyclin:CDK complexes (e.g. cyclinA/CDK2 and cyclinE/CDK2), which control the expression of E2F transcription factors that regulate S phase genes [93]. We also observed an increase in expression of Skp2, which reduces p27-mediated inhibition of E2F1 target genes [94, 95]. In addition, our approach identified S phase events that are known to be associated with DNA replication, including an accumulation of PCNA foci at sites of active replication [96] and a DNA damage marker, pH2AX, which becomes phosphorylated in response to double-stranded DNA breaks in areas of stalled replication [97, 98]. Lastly, we observed an increase in expression of cyclin B localized to different regions of the cell, which is a primary regulator of G2/M transition alongside CDK1 [99, 100]. Of note, phosphorylation of RB also controls cell cycle re-entry and is an important biomarker that is often used for distinguishing proliferating from arrested cells [101, 102]. Furthermore, by ordering the average pairwise change in expression of features across ground truth phase annotations, we observed that DELVE dynamically-expressed seed features exhibited non-random patterns of expression variation that gradually increased throughout the canonical phases of the cell cycle (Fig. 4a), and were amongst the top ranked features that were biologically predictive of cell cycle phase and age measurements using a random forest classification and regression framework, respectively (See Random forest, Figure 4a right, Supplementary Fig. 13). Collectively, these results illustrate that the dynamic feature module identified by DELVE represents a minimum cell cycle feature set (Fig. 4b dynamic seed) that precisely distinguishes individual cells according to their cell cycle progression status and can be used to construct an approximate cellular trajectory for ranking feature importance.

Figure 4: DELVE recovered signatures of proliferation and arrest in noisy protein immunofluorescence imaging data.

Figure 4:

(a) DELVE identified one dynamic module consisting of 13 seed features that represented a minimum cell cycle. (a left) UMAP visualization of image-derived features where each point indicates a dynamic or static feature identified by the model. (a middle) The average pairwise change in expression of features within DELVE modules ordered across ground truth cell cycle phase annotations. (a right top) Simplified signaling schematic of the cell cycle highlighting the role of DELVE dynamic seed features within cell cycle progression. (a right bottom) Heatmap of the standardized average expression of dynamic seed features across cell cycle phases. (b) Feature selection was performed to select the top (p=30) ranked features from the original (d=241) feature set according to a feature selection strategy. PHATE visualizations illustrating the overall quality of low-dimensional retinal pigmented epithelial cell cycle trajectories following feature selection. Cells were labeled according to ground truth cell cycle phase annotations from time-lapse imaging. Each panel represents a different feature selection strategy. (c) PHATE visualizations following DELVE feature selection, where cells were labeled according to cell cycle trajectory (top) or ground truth age measurements (bottom). (d) Performance of twelve feature selection methods on representing the cell cycle according to several metrics, including classification accuracy between predicted and ground truth phase annotations using a support vector machine classifier on selected features, normalized mutual information (NMI) between predicted and ground truth phase labels to indicate clustering performance, precision of phase-specific features determined by a random forest classifier trained on cell cycle phase annotations, mean-squared error between predicted and ground truth molecular age measurements using a support vector machine regression framework on selected features, and the correlation between estimated pseudotime to the ground truth molecular age measurements following trajectory inference on selected features. Error bands represent the standard deviation. * indicates the method with the highest median score. DELVE achieved the highest classification accuracy, highest p@k score, and high NMI clustering score indicating robust prediction of cell cycle phase. Furthermore, DELVE achieved the lowest mean squared error and highest correlation between arrest and proliferation estimated pseudotime and ground truth age measurements indicating robust prediction of cell cycle transitions.

We then comprehensively evaluated feature selection methods on their ability to retain imaging-derived features that define cell cycle phases and resolve proliferation and arrest cell cycle trajectories. We reasoned that cells in similar stages of the cell cycle (as defined by the cell cycle reporter) should have similar cell cycle signatures (4i features) and should be located near one another in a low dimensional projection. Fig. 4b shows the PHATE embeddings from each feature selection strategy. Using the DELVE feature set, we obtained a continuous PHATE trajectory structure that successfully captured the smooth progression of cells through the canonical phases of the cell cycle, where cells were tightly grouped together according to ground truth cell cycle phase annotations (Fig. 4b). Moreover, we observed that the two DELVE approaches (i.e. DELVE and dynamic seed), in addition to hotspot and HVG selection, produced qualitatively similar denoised lower-dimensional visualizations comparable to the supervised random forest approach that was trained on ground truth cell cycle phase annotations. In contrast, similarity-based approaches such as Laplacian score and neighborhood variance, which define a cell similarity graph according to all features, showed more diffuse presentations of cell states. Variance-based (max variance) or subspace-learning approaches (SCMER, MCFS) produced qualitatively similar embeddings to that produced using all features.

To quantitatively assess if selected features from a feature selection strategy were biologically predictive of cell cycle phases, we performed three complementary analyses. We first focused on the task of cell state classification, where our goal was to learn the ground truth cell cycle phase annotations from the selected feature set. To do so, we trained a support vector machine (SVM) classifier and compared the accuracy of predictions to their ground truth phase annotations (See Support Vector Machine). We performed nested 10-fold cross validation to obtain a distribution of predictions for each method. Overall, we found that DELVE achieved the highest median classification accuracy (DELVE = 0.960) obtaining a similar performance to the random forest classifier trained on cell cycle phase annotations (random forest = 0.957), and outperforming existing unsupervised approaches (e.g. hotspot = 0.935, max variance = 0.902, HVG = 0.889, MCFS = 0.870, SCMER = 0.797, LS = 0.770), as well as all features (0.946), suggesting that selected features with DELVE were more biologically predictive of cell cycle phases (Fig. 4d). We next aimed to assess how well a feature selection method could identify and rank cell cycle phase-specific features according to their representative power. To test this, we trained a random forest classifier on the ground truth phase annotations using nested 10-fold cross validation (See Random forest). We then compared the average ranked feature importance scores from the random forest to the selected features from a feature selection strategy using the precision@k metric. Strikingly, we found that DELVE achieved the highest median precision@k score (DELVE p@30 = 0.800) and appropriately ranked features according to their discriminative power of cell cycle phases despite being a completely unsupervised approach (Fig. 4d). This was followed by hotspot with a precision@k score of (hotspot p@30 = 0.633) and highly variable gene selection (HVG p@30 = 0.500). In contrast, the Laplacian Score and max variance obtained low precision scores (p@30 = 0.367 and 0.333 respectively), whereas neighborhood variance and subspace-learning feature selection methods MCFS and SCMER were unable to identify cell cycle phase-specific features from noisy 4i data and exhibited precision scores near random (p@30 = 0.267, 0.267, and 0.233, respectively). Lastly, we assessed if selected image-derived features could be used for downstream analysis tasks like unsupervised cell population discovery. To do so, we clustered cells using the KMeans++ algorithm [103] on the selected feature set and compared the predicted labels to the ground truth annotations using a normalized mutual information (NMI) score over 25 random initializations (See Unsupervised clustering). We found that hotspot, DELVE, and dynamic seed features were better able to cluster cells according to cell cycle phases and achieved considerably higher median NMI scores (0.615, 0.599, 0.543, respectively), as compared to retaining all features (0.155) (Fig. 4d). Moreover, we found that clustering performance was similar to that of the random forest trained on cell cycle phase annotations (0.626). In contrast, variance-based approaches achieved moderate NMI clustering scores (HVG: 0.421, max variance: 0.361) and alternative similarity-based and subspace learning approaches obtained low median NMI scores (~ 0.2) and were unable to cluster cells into biologically-cohesive cell populations. Of note, many trajectory inference methods use clusters when fitting trajectory models [24, 104, 105, 23], thus accurate cell-to-cluster assignments following feature selection is crucial for both cell type annotation and discovery, as well as for accurate downstream trajectory analysis interpretation. Collectively, these results highlight that feature selection with DELVE identifies imaging-derived features from noisy protein immunofluorescence imaging data that are more biologically predictive of cell cycle phases.

We then focused on the much harder task of predicting an individual cell’s progression through the cell cycle. A central challenge in trajectory inference is the destructive nature of single-cell technologies, where only a static snapshot of cell states is profiled. To move toward a quantitative evaluation of cell cycle trajectory reconstruction following feature selection, we leveraged the ground truth age measurements determined from time-lapse imaging of the RPE-PCNA reporter cell line. We first evaluated whether selected features could be used to accurately predict cell cycle age by training a support vector machine (SVM) regression framework using nested ten fold cross validation (See Support Vector Machine). Method performance was subsequently assessed by computing the mean squared error (MSE) between the predictions and the ground truth age measurements. Overall, we found that DELVE achieved the lowest median MSE (3.261), outperforming both supervised (random forest = 3.296) and unsupervised approaches (e.g. second best performer hotspot = 3.654) suggesting that selected features more accurately estimate the time following mitosis (Fig. 4c). Crucially, this highlights DELVE’s ability to learn new biologically-relevant features that might be missed when performing a supervised or unsupervised approach. Lastly, we assessed whether selected imaging features could be used to accurately infer proliferation and arrest cell cycle trajectories using common trajectory inference approaches (Fig. 4d). Briefly, we constructed predicted cell cycle trajectories using the diffusion pseudotime algorithm [83] under each feature selection strategy (See Trajectory inference and analysis). Cells were separated into proliferation or arrest lineages according to their average expression of pRB, and cellular progression was estimated using ten random root cells that had the youngest age. Feature selection method performance on trajectory inference was then quantitatively assessed by computing the Spearman rank correlation between estimated pseudotime and the ground truth age measurements. We found that DELVE achieved the highest median correlation of estimated pseudotime to the ground truth age measurements (proliferation: 0.656, arrest: 0.405) as compared to alternative methods (second best performer hotspot; proliferation: 0.632, arrest: 0.333) or all features (proliferation: 0.330, arrest: 0.135), indicating that our approach was better able to resolve both proliferation and cell cycle arrest trajectories where other approaches failed (Fig. 4d). Of note, DELVE was robust to the choice in hyperparameters and obtained reproducible results across a range of hyperparameter choices (See Supplementary Fig. 14).

As a secondary validation, we applied DELVE to 9 pancreatic adenocarcinoma (PDAC) cell lines (e.g. BxPC3, CFPAC, MiaPaCa, HPAC, Pa01C, Pa02C, PANC1, UM53) profiled with 4i (See PDAC analysis) and performed a similar evaluation of cell cycle phase and phase transition preservation (See Supplementary Figs. 1624). Across all cell lines and metrics, we found that DELVE approaches and hotspot considerably outperformed alternative methods on recovering the cell cycle from noisy 4i data and often achieved the highest classification accuracy scores, clustering scores, and the highest correlation of cellular progression along proliferative and arrested cell cycle trajectories (See Supplementary Fig. 15). Notably, DELVE was particularly useful in resolving cell cycle trajectories from the PDAC cell lines that had numerous imaging measurements with low signal-to-noise ratio (e.g. CFPAC, MiaPaCa, PANC1, and UM53), whereas the alternative strategies were unable to resolve cell cycle phases and achieved scores near random (See Supplementary Figs. 17, 19, 23, 24).

Identifying molecular drivers of CD8+ T cell effector and memory formation

To demonstrate the utility of our approach in a complex differentiation setting consisting of heterogeneous cell subtypes and shared and distinct molecular pathways, we applied DELVE to a single-cell RNA sequencing time series dataset consisting of 29,893 mouse splenic CD8+ T cells responding to acute viral infection [106]. Here, CD8+ T cells were profiled over 12 time points following infection with the Armstrong strain of lymphocytic choriomeningitis virus (LCMV): Naive, d3-, d4-, d5-, d6-, d7-, d10-, d14-, d21-, d32-, d60-, and d90- post-infection (See CD8 T cell differentiation analysis). During an immune response to acute viral infection, naive CD8+ T cells undergo a rapid activation and proliferation phase, giving rise to effector cells that can serve in a cytotoxic role to mediate immediate host defense, followed by a contraction phase giving rise to self-renewing memory cells that provide long-lasting protection and are maintained by antigen-dependent homeostatic proliferation [107, 108, 109]. Despite numerous studies detailing the molecular mechanisms of CD8+ T cell effector and memory fate specification, the molecular mechanisms driving activation, fate commitment, or T cell dysfunction continue to remain unclear due to the complex intra- and inter-temporal heterogeneity of the CD8+ T cell response during infection. Therefore, we applied DELVE to the CD8+ T cell dataset to resolve the differentiation trajectory and investigate transcriptional changes that are involved in effector and memory formation during acute viral infection with LCMV.

Following unsupervised seed selection, we found that DELVE successfully identified three gene modules constituting core regulatory complexes involved in CD8+ T cell viral response and had dynamic expression patterns that varied across experimental time following viral infection (Fig. 5ac). Namely, dynamic module 0 contained genes involved in early activation and interferon response (e.g. Ly6a, Bst2, Ifi27l2a) [110, 111], and proliferation (e.g. Cenpa, Cenpf, Ccnb2, Ube2c, Top2a, Tubb4b, Birc5, Cks2, Cks1b, Nusap1, Hmgb2, Rrm2, H2afx, Pclaf, Stmn1) [112]. Dynamic module 1 contained genes involved in effector formation, including interferon-γ cytotoxic molecules, such as perforin/granzyme pathway (e.g. Gzma, Gzmk), integrins (e.g. Itgb1), killer cell lectin-like receptor family (e.g. Klrg1, Klrd1, Klrk1, Klrc1), chemokine receptors (e.g. Cxcr3, Cxcr6, Ccr2), and a canonical transcription factor involved in terminal effector formation (e.g. Id2) [113, 114, 115, 116]. Lastly, dynamic module 2 contained genes involved in long-term memory formation (e.g. Sell, Bcl2, Il7r, Ltb) [117, 118, 119, 120]. To quantitatively examine if genes within a dynamic module were meaningfully associated with one another, or had experimental evidence of co-regulation, we constructed gene association networks using experimentally-derived association scores from the STRING database [121]. Here, a permutation test was performed to assess the statistical significance of the observed experimental association amongst genes within a DELVE module as compared to random gene assignment (See Protein-protein interaction networks). Notably, across all three dynamic modules, DELVE identified groups of genes that had statistically significant experimental evidence of co-regulation (p-value = 0.001), where DELVE networks had a larger average degree of experimentally-derived edges than the null distribution (Fig. 5b: dynamic modules). Degree centrality is a simple measurement of the number of edges (e.g. experimentally derived associations between genes) connected to a node (e.g. gene); therefore, in this context, networks with a high average degree may contain complexes of genes that are essential for regulating a biological process. In contrast, genes identified by DELVE that exhibited random or noisy patterns of expression variation (static module) had little to no evidence of co-regulation (p-value = 1.0) and achieved a much lower average degree than networks defined by random gene assignment (Fig. 5b).

Figure 5: DELVE identified molecular drivers of CD8+ T cell effector and memory formation.

Figure 5:

(a) DELVE identified three dynamic modules representing cell cycle and early activation (dynamic 0), effector formation and cytokine signaling (dynamic 1), and long-term memory formation (dynamic 2) during CD8+ T cell differentiation response to viral infection with lymphocytic choriomeningitis virus (LCMV). UMAP visualization of (d=500) genes where each point indicates a dynamic or static gene identified by the model. (b) A permutation test was performed using experimentally-derived association scores from the STRING protein-protein interaction database [121] to assess whether genes within DELVE dynamic modules had experimental evidence of co-regulation as compared to random assignment. (b top) STRING association networks, where nodes represent genes from a DELVE module and edges represent experimental evidence of association. (b middle) Average pairwise change in expression amongst genes within a module ordered by time following infection. (b bottom) Histograms showing the distribution of the average degree of experimentally-derived edges of gene networks from R=1000 random permutations. The dotted line indicates the observed average degree from genes within a DELVE module. p-values were computed using a one-sided permutation test. (c) Heatmap visualization of the standardized average expression of dynamically-expressed genes identified by DELVE ordered across time following infection. (d) Trajectory inference was performed along the memory lineage using the diffusion pseudotime algorithm [83], where cell similarity was determined by DELVE selected genes or highly variable genes (HVG). UMAP visualization of memory T cell scores according to the average expression of known memory markers (Bcl-2, Sell, Il7r). Line plot indicates the onset of expression and cellular commitment to the memory lineage following infection. (e) Genes from the full dataset were regressed along estimated pseudotime using a generalized additive model to determine lineage-specific significant genes. The venn diagram illustrates the quantification and overlap of memory lineage-specific genes across feature selection strategies. The barplots show the top ten gene ontology terms associated with the temporally-expressed gene lists specific to each feature selection strategy.

Next, we examined if DELVE could be used to improve the identification of genes associated with long-term CD8+ T cell memory formation following trajectory inference. To do so, we first prioritized cells along the memory lineage by computing a memory T cell score according to the average expression of three known memory markers (Bcl2, Sell, and Il7r) (See CD8 T cell differentiation analysis, Fig. 5d). We then reconstructed the memory CD8+ T cell differentiation trajectory from middle-late stage cellular commitment using the diffusion pseudotime algorithm [83] on the top 250 ranked genes following DELVE feature selection or highly variable gene selection. Therefore, in this context, cell ordering was reflective of the differences in cell state along the memory lineage according to (1) dynamically-expressed genes that had experimental evidence of co-regulation or (2) variance-based selection. Lastly, we performed a regression analysis for each gene (d=500) in the original dataset along estimated pseudotime using a negative binomial generalized additive model (GAM). Genes were considered to be differentially expressed along the memory lineage if they had a q-value<0.05 following Benjamini-Hochberg false discovery rate correction [122] (See Trajectory inference and analysis). Overall, we found that ordering cells according to similarities in selected gene expression using DELVE was more reflective of long-term memory formation and achieved an increased recovery of memory lineage-specific genes, as directly compared to the standard approach of highly variable gene selection (Fig. 5e). To determine the biological relevance of these memory lineage-specific genes, we performed gene set enrichment analysis on the temporally-expressed genes specific to each feature selection strategy using EnrichR [123]. Here, DELVE obtained higher combined scores and identified more terms involved in immune regulation and memory CD8+ T cell formation, including negative regulation of cell cycle phase transition, negative regulation of T cell mediated cytotoxicity, lymphocyte mediated immunity, and negative regulation of T cell mediated immunity (Fig. 5e).

Discussion

Computational trajectory inference methods have transformed our ability to study the continuum of cellular states associated with dynamic phenotypes; however, current approaches for reconstructing cellular trajectories can be hindered by biological or technical noise inherent to single-cell data [42, 43]. To mitigate the effect of unwanted sources of variation confounding trajectory inference, we designed a bottom-up unsupervised feature selection method that ranks and selects features that best approximate cell state transitions from dynamic feature modules that constitute core regulatory complexes. The key innovation of this work is the ability to parse temporally co-expressed features from noisy information-poor features prior to performing feature selection; in doing so, DELVE constructs cell similarity graphs that are more reflective of cell state progression for ranking feature importance.

In this study, we benchmarked twelve feature selection methods [46, 52, 25, 53, 55, 54, 50] on their ability to identify biologically relevant features for trajectory analysis from simulated single-cell RNA sequencing data where the ground truth was known. We found that DELVE achieved the highest recovery of differentially expressed genes within a cell type or along a cellular lineage, highest cell type classification accuracy, and most accurately estimated individual cell progression across a variety of trajectory topologies and biological or technical challenges. Furthermore, through a series or qualitative and quantitative comparisons, we illustrated how noise (e.g. stochasticity, sparsity, low library size) and information-poor features create spurious similarities amongst cells and considerably impact the performance of existing unsupervised similarity-based or subspace learning-based feature selection methods on identifying biologically-relevant features.

Next, we applied DELVE to a variety of biological contexts and demonstrated improved recovery of cellular trajectories over existing unsupervised feature selection strategies. In the context of studying the cell cycle from protein imaging-derived features [92], we illustrated how DELVE identified molecular features that were strongly associated with cell cycle progression and were more biologically predictive of cell cycle phase and age, as compared to the alternative unsupervised feature selection methods. Importantly, DELVE often achieved similar or better performance to the supervised Random Forest classification approach without the need for training on ground truth cell cycle labels. Lastly, in the context of studying heterogeneous CD8+ T cell response to viral infection from single-cell RNA sequencing data [106], we showed how DELVE identified gene complexes that had experimental evidence of co-regulation and were strongly associated with CD8+ T cell differentiation. Furthermore, we showed how performing feature selection with DELVE prior to performing trajectory inference improved the identification and resolution of gene programs associated with long-term memory formation that would have been missed by the standard unsupervised feature selection approach.

This study highlights how DELVE can be used to improve inference of cellular trajectories in the context of noisy single-cell omics data; however, it is important to note that feature selection can greatly bias the interpretation of the underlying cellular trajectory [42], thus careful consideration should be made when performing feature selection for trajectory analysis. Furthermore, we provided an unsupervised framework for ranking features according to their association with temporally co-expressed genes, although we note that DELVE can be improved by using a set of previously established regulators (See Step 2: feature ranking). Future work could focus on extending this framework for applications such as (1) deconvolving cellular trajectories using biological system-specific seed graphs or (2) studying complex biological systems such as organoid models or spatial microenvironments.

Methods

DELVE

DELVE identifies a subset of dynamically-changing features that preserve the local structure of the underlying cellular trajectory. In this section, we will (1) describe computational methods for the identification and ranking of features that have non-random patterns of dynamic variation, (2) explain DELVE’s relation to previous work, and (3) provide context for the mathematical foundations behind discarding information-poor features prior to performing trajectory inference.

Problem formulation

Let X=xii=1n denote a single-cell dataset, where xiRd represents the vector of d measured features (e.g. genes or proteins) measured in cell i. We assume that the data have an inherent trajectory structure, or biologically-meaningful ordering, that can be directly inferred by a limited subset of p features where pd. Therefore, our goal is to identify this limited set of p features from the original high-dimensional feature set that best approximate the transitions of cells through each stage of the underlying dynamic process.

Step 1: dynamic seed selection

Graph construction:

Our approach DELVE extends previous similarity-based [52, 25, 53] or subspace-learning [55] feature selection methods by computing the dependence of each gene on the underlying cellular trajectory. In step 1, DELVE models cell states using a weighted k-nearest neighbor affinity graph of cells (k=10), where nodes represent cells and edges describe the transcriptomic or proteomic similarity amongst cells according to the d profiled features encoded in X. More specifically, let 𝒢=(𝒱,) denote a between-cell affinity graph, where 𝒱 represents the cells and the edges, , are weighted according to a Gaussian kernel as,

wij=exp-xvi-xvj22σi2,ifvj𝒩i0,otherwise. (1)

Here, W is a n×n between-cell similarity matrix, where cells vi and vj are connected with an edge with edge weight wij if the cell vj is within the set of vi’s neighbors, as denoted by notation 𝒩i. Moreover, σi, specific for a particular cell i, represents the Gaussian kernel bandwidth parameter that controls the decay of cell similarity edge weights. We chose a bandwidth parameter as the distance to the 3rd nearest neighbor as this has been shown previously in refs. [53] and [124] to provide reasonable decay in similarity weights.

Identification of feature modules:

To identify groups of features with similar co-expression variation, DELVE clusters features according to changes in expression across prototypical cell neighborhoods. First, cellular neighborhoods are defined according to the average expression of each set of k nearest neighbors 𝒩i as, Z=ziRdi=1n, where each zi=1k𝒩ixi represents the center of the k nearest neighbors for cell i across all measured features. Next, DELVE leverages Kernel Herding sketching [66] to effectively sample m representative cell neighborhoods, or rows, from the per-cell neighbor averaged feature matrix, Z, as Z˜=z˜iRdi=1m. This sampling approach ensures that cellular neighborhoods are more reflective of the original distribution of cell states, while removing redundant cell states to aid in the scalability of estimating expression dynamics. DELVE then computes the average pairwise change in expression of features across representative cellular neighborhoods, Δ, as,

Δ=1m1i=1m(Z˜jmz˜iT), (2)

where jm is a column vector of ones with length m, such that jmRm. Lastly, features are clustered according to the transpose of their average pairwise change in expression across the representative cellular neighborhoods, ΔT, using the KMeans++ algorithm [103]. In this context, each DELVE module contains a set of features with similar local changes in co-variation across cell states along the cellular trajectory.

Dynamic expression variation permutation testing:

To assess the significance of dynamic expression variation across grouped features within a DELVE module, we perform a permutation test as follows. Let Sc2Pc denote the average sample variance of the average pairwise change in expression across m cell neighborhoods for the set of p features (a set of features denoted as Pc) within a DELVE cluster c as,

S¯c2(Pc)=1|Pc|p=1|Pc|i=1m(Δi,pΔ¯p)2m1. (3)

Moreover, let Rq denote a set of randomly selected features sampled without replacement from the full feature space d, such that Pc=Rq, and S˜c2Rq denote the average sample variance of randomly selected feature sets averaged across t random permutations as,

S˜c2(Rq)=1tq=1tS¯c2(Rq). (4)

Here, DELVE considers a module of features as being dynamically-expressed if the average sample variance of the change in expression of the set of features within a DELVE cluster (or specifically feature set Pc), is greater than random assignment, Rq, across randomly permuted trials as,

ScPc>S˜cRq. (5)

In doing so, this approach is able to identify and exclude modules of features that have static, random, or noisy patterns of expression variation, while retaining dynamically expressed features for ranking feature importance. Of note, given that KMeans++ clustering is used to initially assign features to a group, feature-to-cluster assignments can tend to vary due to algorithm stochasticity. Therefore, to reduce the variability and find a core set of features that are consistently dynamically-expressed, this process is repeated across ten random clustering initializations and the set of dynamically-expressed features are defined as the intersection across runs.

Step 2: feature ranking

Following dynamic seed selection, in step two, DELVE ranks features according to their association with the underlying cellular trajectory graph. First, DELVE approximates the underlying cellular trajectory by constructing a between-cell affinity graph, where the nodes represent the cells and edges are now re-weighted according to a Gaussian kernel between all cells based on the core subset of dynamically expressed regulators from step 1, such that X˜=x˜iRp where pd as

w˜ij=exp-x˜vi-x˜vj22σi2,ifvj𝒩i0,otherwise. (6)

Here, W˜ is a n×n between-cell similarity matrix, where cells vi and vj are connected with an edge with edge weight w˜ij if the cell is within the set of vi’s neighbors, denoted as 𝒩i. Moreover, as previously mentioned, σi represents the Gaussian kernel bandwidth parameter for a particular cell i as the distance to the 3rd nearest neighbor.

Features are then ranked according to their association with the underlying cellular trajectory graph using graph signal processing techniques [52, 67, 68]. A graph signal f is any function that has a real defined value on all of the nodes, such that fRn and fi gives the signal at the ith node. Intuitively, we can consider all features as graph signals and rank them according to their variation in expression along the approximate cell trajectory graph to see if they should be included or excluded from downstream analysis. Let L denote the unnormalized graph Laplacian, with L=D-W˜, where D is a diagonal degree matrix with each element as dii=jw˜ij. The local variation in the expression of feature signal f can then be defined as the weighted sum of differences in signals around a particular cell i as,

(Lf)(i)=jw˜ij(f(i)f(j)). (7)

This metric effectively measures the similarity in expression of a particular node’s graph signal, denoted by the feature vector, f, around its k nearest neighbors. By summing the local variation in expression across all neighbors along the cellular trajectory, we can define the total variation in expression of feature graph signal f as,

fTLf=ijw˜ij(f(i)f(j))2. (8)

Otherwise known as the Laplacian quadratic form, in this context, the total variation represents the global smoothness of the particular graph signal encoded in f (e.g. expression of a particular gene or protein) along the approximate cellular trajectory graph. Intuitively, DELVE ultimately retains features that have a low total variation in expression, or have similar expression values amongst similar cells along the approximate cellular trajectory graph. In contrast, DELVE excludes features that have a high total variation in expression, or those which have expression values that are rapidly oscillating amongst neighboring cells, as these features are likely noisy or not involved in the underlying dynamic process that was initially seeded.

In this work, we ranked features according to their association with the cell-to-cell affinity graph defined by dynamically expressed features from DELVE dynamic modules using the Laplacian score [52]. This measure takes into account both the total variation in expression, as well as the overall global variance. For each of the original d measured features, or graph signals encoded in f with fRn, the Laplacian score Lf is computed as,

Lf=f˜TLf˜f˜TDf˜ (9)

Here, L represents the unnormalized graph Laplacian, such that L=D-W˜,D is a diagonal degree matrix with the ith element of the diagonal dii as dii=jw˜ij,f˜ represents the mean centered expression of feature f as f˜=f-fTD11TD1, and 1=[1,,1]T. By sorting features in ascending order according to their Laplacian score, DELVE effectively ranks features that best preserve the local trajectory structure (e.g. an ideal numerator has a small local variation in expression along neighboring cells), as well as best preserve cell types (e.g. an ideal denominator has large variance in expression for discriminitive power).

Benchmarked feature selection methods

In this section, we describe the twelve feature selection methods evaluated for representing biological trajectories. For more details on implementation and hyperparameters, see Supplementary Table 1.

Random forest

To quantitatively compare feature selection approaches on preserving biologically relevant genes or proteins, we aimed to implement an approach that would leverage ground truth cell type labels to determine feature importance. Random forest classification [46] is a supervised ensemble learning algorithm that uses an ensemble of decision trees to partition the feature space such that all of the samples (cells) with the same class (cell type labels) are grouped together. Each decision or split of a tree was chosen by minimizing the Gini impurity score as,

G(m)=i=1Cpmi(1pmi). (10)

Here, pmi is the proportion of cells that belong to class i for a feature node m, and C is the total number of classes (e.g. cell types). We performed random forest classification using the sklearn v0.23.2 package in python. Nested 10-fold cross-validation was performed using stratified random sampling to assign cells to either a training or test set. The number of trees was tuned over a grid search within each fold prior to training the model. Feature importance scores were subsequently determined by the average Gini importance across folds.

Max variance

Max variance is an unsupervised feature selection approach that uses sample variance as a criterion for retaining discriminative features, where Sˆf2 represents the sample variance for feature fRn as,

Sf2=1n1i=1n(fif¯)2, (11)

where fi indicates the expression value of feature f in cell i. We performed max variance feature selection by sorting features in descending order according to their variance score and selecting the top p maximally varying features.

Neighborhood variance

Neighborhood variance [25] is an unsupervised feature selection approach that uses a local neighborhood variance metric to select gradually-changing features for building biological trajectories. Namely, the neighborhood variance metric S˜f2 quantifies how much feature f varies across neighboring cells as,

S˜f2=1nkc1i=1nj=1kc(fif𝒩(i,j))2. (12)

Here, fi represents the expression value of feature f for cell i,𝒩(i,j) indicates the j nearest neighbor of cell i, and kc is the minimum number of k-nearest neighbors required to form a fully connected graph. Features were subsequently selected if they had a smaller neighborhood variance S˜f2 than global variance Sf2,

Sf2S˜f2>1. (13)
Highly variable genes

Highly variable gene selection [50] is an unsupervised feature selection approach that selects features according to a normalized dispersion measure. First, features are binned based on their average expression. Within each bin, genes are then z-score normalized to identify features that have a large variance, yet a similar mean expression. We selected the top p features using the highly variable genes function in Scanpy v1.8.1 (flavor = Seurat, bins = 20, n_top_genes = p).

Laplacian score

Laplacian score (LS) [52] is a locality-preserving unsupervised feature selection method that ranks features according to (1) how well a feature’s expression is consistent across neighboring cells defined by a between-cell similarity graph define by all profiled features and (2) the feature’s global variance. First, a weighted k-nearest neighbor affinity graph of cells (k=10) is constructed according to pairwise Euclidean distances between cells based on all features, X. More specifically, let 𝒢=(𝒱,), where 𝒱 represents the cells and edges, , are weighted using a Gaussian as follows. Specifically, edge weights between cells i and j can be defined as,

wij=exp-xvi-xvj22σi2,ifvj𝒩i0,otherwise. (14)

Here W is a n×n between-cell similarity matrix, where cells vi and vj are connected with an edge with edge weight wij if the cell vj is within the set of vi’s neighbors, 𝒩i. Moreover, as previously described, σi represents the bandwidth parameter for cell i defined as the distance to the 3rd nearest neighbor. For each feature f, where fRn represents the value of the feature across all n cells, we compute the Laplacian score, Lf as,

Lf=f˜TLf˜f˜TDf. (15)

Here, L represents the unnormalized graph Laplacian, with L=D-W,D is a diagonal degree matrix with the ith element of the diagonal dii as dii=jwij,f˜ represents the mean centered expression of feature f as f˜=f-fTD11TD1, and 1=[1,,1]T. We performed feature selection by sorting features in ascending order according to their Laplacian score and selecting the top p features.

MCFS

MCFS Multi-cluster feature selection (MCFS) [55] is an unsupervised feature selection method that selects for features that best preserve the multi-cluster structure of data by solving an L1 regularized least squares regression problem on the spectral embedding. Similar to the Laplacian score, first k-nearest neighbor affinity graph of cells (k=10) is computed to encode the similarities in feature expression between cells i and j using a Gaussian kernel as,

wij=exp-xvi-xvj22σi2,ifvj𝒩i0,otherwise. (16)

Similar to previous formulations above, W is an n×n between cell affinity matrix, where a pair of cells vi and vj are connected with an edge with weight wij if cell vj is within the set of vi’s neighbors, 𝒩i. Further, σi represents the kernel bandwidth parameter chosen to be the distance to the third nearest neighbor from cell i. Next, to represent the intrinsic dimensionality of the data, the spectral embedding [125] is computed through eigendecomposition of the unnormalized graph Laplacian L, where L=D-W as,

Ly=λDy. (17)

Here, Y={y}k=1K are the eigenvectors corresponding to the K smallest eigenvalues, W is a symmetric affinity matrix encoding cell similarity weights, and D represents a diagonal degree matrix with each element as dii=jwij. Given that eigenvectors of the graph Laplacian represent frequency harmonics [68] and low frequency eigenvectors are considered to capture the informative structure of the data, MCFS computes the importance of each feature along each intrinsic dimension yk by finding a relevant subset of features by minimizing the error using an L1 norm penalty as,

minakyk-XTak2s.t.ak1γ. (18)

Here, the non-zero coefficients, ak, indicate the most relevant features for distinguishing clusters from the embedding space, yk and γ controls the sparsity and ensures the least relevant coefficients are shrunk to zero. The optimization is solved using the least angles regression algorithm [71], where for every feature, the MCFS score is defined as,

MCFSj=maxkak,j. (19)

Here, j and k index feature and eigenvector, respectively. We performed multi-cluster feature selection with the number of eigenvectors K chosen to be the number of ground truth cell types present within the data, as this is the traditional convention in spectral clustering [57] and the number of nonzero coefficients was set to the number of selected features, p.

SCMER

Single-cell manifold-preserving feature selection (SCMER) [54] selects a subset of p features that represent the embedding structure of the data by learning a sparse weight vector w by formulating an elastic net regression problem that minimizes the KL divergence between a cell similarity matrix defined by all features and one defined by a reduced subset of features. More specifically, let P denote a between-cell pairwise similarity matrix defined in UMAP [56] computed with the full data matrix XRn×d and Q denote a between-cell pairwise similarity matrix defined in UMAP computed with the dataset following feature selection YRn×p, where Y=Xw and pd. Here, elastic net regression is used to find a sparse and robust solution of w that minimizes the KL divergence as,

KL(PQ)=ijpijlogpijqij. (20)

Features with non-zero weights in w are considered useful for preserving the embedding structure and selected for downstream analysis. We performed SCMER feature selection using the scmer v.0.1.0a3 package in python by constructing a k-nearest neighbor graph (k=10) according to pairwise Euclidean distances of cells based on their first 50 principal components and using the default regression penalty weight parameters (lasso=3.87e-4,ridge=0).

Hotspot

Hotspot [53] is an unsupervised gene module identification approach that performs feature selection through a test statistic that measures the association of a gene’s expression with the between-cell similarity graph defined based on the full feature matrix, X. More specifically, first, a k-nearest neighbor cell affinity graph (k=10) is defined based on pairwise Euclidean distances between all pairs of cells using a Gaussian kernel as,

wij=exp-xvi-xvij2σi2,ifvj𝒩i0,otherwise. (21)

Here, cells vi and vj are connected with an edge with edge weight wij if the cell vj is within the set of vi’s neighbors such that jwij=1 for each cell and σi represents the bandwidth parameter for cell i defined as the distance to the k3 neighbor. For a given feature fRn, representing expression across all n cells where fi is the mean-centered and standardized expression of feature f in cell i according to a null distribution model of gene expression, the local autocorrelation test statistic representing the dependence of each gene on the graph structure is defined as,

Hf=ijiwijfifj. (22)

Hotspot was implemented using the hotspot v1.1.1 package in python, where we selected the top p features by sorting features in ascending order according to their significance with respect to a null model defined by a negative binomial distribution with the mean dependent on library size.

All features

To consider a baseline representation without feature selection, we evaluated performance using all features from each dataset following quality control preprocessing.

Random features

As a second baseline strategy, we simply selected a subset of random features without replacement. Results were computed across twenty random initializations for each dataset.

DELVE

DELVE was run as previously described. Here, we constructed a weighted k-nearest neighbor affinity graph of cells (k=10), and 1000 neighborhoods were sketched to identify dynamic seed feature clusters (c=3 for the simulated dataset, c=5 for the RPE cell cycle dataset, c=5 for the CD8 T cell differentiation dataset, and c=10 for PDAC cell cycle datasets). Results were computed across twenty random initializations for each dataset.

Datasets

We evaluated feature selection methods based on how well retained features could adequately recover biological trajectories under various noise conditions, biological contexts, and single-cell technologies.

Splatter simulation

Splatter [72] is a single-cell RNA sequencing simulation software that generates count data using a gamma-Poisson hierarchical model with modifications to alter the mean-variance relationship amongst genes, the library size, or sparsity. We used splatter to simulate a total of 90 ground truth datasets with different trajectory structures (e.g. linear, bifurcation, and tree topologies). First, we estimated simulation parameters by fitting the model to a real single-cell RNA sequencing dataset consisting of human pluripotent stem cells differentiating into mesoderm progenitors [126]. We then used the estimated parameters (mean_rate = 0.0173, mean_shape = 0.54, lib_loc = 12.6, lib_scale = 0.423, out_prob = 0.000342, out_fac_loc = 0.1, out_fac_scale = 0.4, bcv = 0.1, bcv_df = 90.2, dropout = None) to simulate a diverse set of ground truth reference trajectory datasets with the splatter paths function (python wrapper scprep SplatSimulate v1.1.0 of splatter v1.18.2). Here, a reference trajectory structure (e.g. bifurcation) was used to simulate linear and nonlinear changes in the mean expression of genes along each step of the specified differentiation path. We simulated ten differentiation datasets (1500 cells, 500 genes, 6 clusters) for each trajectory type (linear, bifurcation, tree) by modifying (1) the probability of a cell belonging to a cluster by randomly sampling from a Dirichelet distribution with six categories and a uniform concentration of one and (2) the path skew by randomly sampling from a beta distribution (α=10,β=10). The output of each simulation is a ground truth reference consisting of cell-to-cluster membership, differentially expressed genes per cluster or path, as well as a latent step vector that indicates the progression of each cell within a cluster. Lastly, we modified the step vector to be monotonically increasing across clusters within the specified differentiation path to obtain a reference pseudotime measurement.

To estimate how well feature selection methods can identify genes that represent cell populations and are differentially expressed along a differentiation path in noisy single-cell RNA sequencing data, we added relevant sources of biological and technical noise to the reference datasets.

  1. Biological Coefficient of Variation (BCV): To simulate the effect of stochastic gene expression, we modified the biological coefficient of variation parameter within splatter (BCV = 0.1, 0.25, 0.5). This scaling factor controls the mean-variance relationship between genes, where lowly expressed genes are more variable than highly expressed genes, following a γ distribution.

  2. Library size: The total number of profiled mRNA transcripts per cell, or library size, can vary between cells within a single-cell RNA sequencing experiment and can influence the detection of differentially expressed genes [73], as well as impact the reproducibility of the lower-dimensional representation of the data [74]. To simulate the effect of differences in sequencing depth, we proportionally adjusted the gene means for each cell by modifying the location parameter (lib_loc = 12, 11, 10) of the log-normal distribution within splatter that estimates the library size scaling factors.

  3. Technical dropout: Single-cell RNA sequencing data contain a large proportion of zeros, where only a small fraction of total transcripts are detected due to capture inefficiency and amplification noise [127]. To simulate the inefficient capture of mRNA molecules and account for the trend that lowly expressed genes are more likely to be affected by dropout, we undersampled mRNA counts by sampling from a binomial distribution with the scale parameter or dropout rate proportional to the mean expression of each gene as previously described in ref. [128] as,
    ri=exp-λμi2. (23)
    Here, μi represents the log mean expression of gene i, and λ is a hyperparameter that controls the magnitude of dropout (λ=0,0.05,0.1).

In our subsequent feature selection method analyses, we selected the top p=100 features under each feature selection approach.

RPE analysis

The retinal pigmented epithelial (RPE) dataset [92] is an iterative indirect immunofluorescence imaging (4i) dataset consisting of RPE cells undergoing the cell cycle. Here, time-lapse imaging was performed on an asynchronous population of non-transformed human retinal pigmented epithelial cells expressing a PCNA-mTurquoise2 reporter in order to record the cell cycle phase (G0/G1, S, G2, M) and molecular age (time since last mitosis) of each cell. Following time-lapse imaging, cells were fixed and 48 core cell cycle effectors were profiled using 4i [8]. For preprocessing, we min-max normalized the data and performed batch effect correction on the replicates using ComBat [129]. Lastly, to refine phase annotations and distinguish G0 from G1 cells, we selected cycling G1 cells according to the bimodal distribution of pRBRB expression as described in ref. [92]. Of note, cells were excluded if they did not have ground truth phase or age annotations. The resultant dataset consisted of 2759 cells × 241 features describing the expression and localization of different protein markers. For our subsequent analysis, we selected the top p=30 features for each feature selection approach.

PDAC analysis

The pancreatic ductal adenocarcinoma (PDAC) dataset is an iterative indirect immunofluorescence imaging dataset consisting of 9 human PDAC cell lines: BxPC3, CFPAC, HPAC, MiaPaCa, Pa01C, Pa02C, Pa16C, PANC1, UM53. For each cell line (e.g. BxPC3) under control conditions, we min-max normalized the data. Cell cycle phases (G0, G1, S, G2, M) were annotated a priori based on manual gating cells according to the abundance of core cell cycle markers. Phospho-RB (pRB) was used to distinguish proliferative cells (G1/S/G2/M, high pRB) from arrested cells (G0, low pRB). DNA content, E2F1, cyclin A (cycA), and phospho-p21 (p-p21) were used to distinguish G1 (DNA content = 2C, low cycA), S (DNA content = 2–4C, high E2F1), G2 (DNA content = 4C, high cycA), and M (DNA content = 4C, high p-p21). For our subsequent analysis, we selected the top p=30 features for each feature selection approach.

CD8 T cell differentiation analysis

The CD8 T cell differentiation dataset [106] is a single-cell RNA sequencing dataset consisting of mouse splenic CD8+ T cells profiled over 12-time points (d = day) following infection with the Armstrong strain of the lymphocytic choriomeningitis virus: Naive, d3-, d4-, d5-, d6-, d7-, d10-, d14-, d21-, d32-, d60-, d90- post-infection. Spleen single-cell RNA sequencing data were accessed from the Gene Expression Omnibus using the accession code GSE131847 and concatenated into a single matrix. The dataset was subsequently quality control filtered according to the distribution of molecular counts. To remove dead or dying cells, we filtered cells that had more than twenty percent of their total reads mapped to mitochondrial transcripts. Genes that were observed in less than three cells or had less than 400 counts were also removed. Following cell and gene filtering, the data were transcripts-per-million normalized, log+1 transformed, and variance filtered using highly variable gene selection, such that the resulting dataset consisted of 29893 cells × 500 genes (See Highly variable genes). Lastly, to obtain lineage labels for trajectory analysis, cells were scored and gated according to their average expression of known memory markers (Bcl2, Sell, Il7r) using the score_genes function in Scanpy v1.8.1. When evaluating feature selection methods, we selected the top p=250 features for each feature selection approach.

Evaluation

Classification and regression

k-nearest neighbor classification

To quantitatively compare feature selection methods on retaining features that are representative of cell types, we aimed to implement an approach that would assess the quality of the graph structure. k−nearest neighbors classification is a supervised learning algorithm that classifies data based on labels of the k-most similar cells according to their gene or protein expression, where the output of this algorithm is a set of labels for every cell. We performed k-nearest neighbors classification to predict cell type labels from the simulated single-cell RNA sequencing datasets as follows. First, 3-fold cross-validation was performed using stratified random sampling to assign cells to either a training or a test set. Stratified random sampling was chosen to mitigate the effect of cell type class imbalance. Within each fold, feature selection was then performed on the training data to identify the top p=250 relevant features according to a feature selection strategy. Next, a k-nearest neighbor classifier (k=3) was fit on the feature selected training data to predict the cell type labels of the feature selected test query points. Here, labels are predicted as the mode of the cell type labels from the closest training data points according to Euclidean distance. Classification performance was subsequently assessed according to the median classification accuracy with respect to the ground truth cell type labels across folds.

Support Vector Machine

The Support Vector Machines (SVM) [130] is a supervised learning algorithm that constructs hyperplanes in the high-dimensional feature space to separate classes. We implemented SVM classification or regression using the sklearn v0.23.4 package in python. SVM classification was used to predict cell cycle phase labels for both RPE and PDAC 4i datasets, whereas regression was used to predict age measurements from time lapse imaging for the RPE dataset. Here, Nested 10-fold cross-validation was performed using stratified random sampling to assign cells to either a training set or a test set. Within each fold, feature selection was performed to identify the p most relevant features according to a feature selection strategy. SVM hyperparameters were then tuned over a grid search and phase labels were subsequently predicted from the test data according to those p features. Classification performance was assessed according to the median classification accuracy with respect to the ground truth cell type labels across folds. Regression performance was assessed according to the average mean squared error with respect to ground truth age measurements across folds.

Precision@k

To evaluate the biological relevance of selected features from each method, we computed precision@k (p@k) as the proportion of top k selected features that were considered to be biologically relevant according to a ground truth reference as,

p@k=Fs,kFrFs,k, (24)

where Fs,k indicates the set of selected features at threshold k, where Fs,kFs, and Fr indicates the set of reference features. Reference features were defined as either (1) the ground truth differentially expressed features within a cluster or along a differentiation path from the single-cell RNA sequencing simulation study (see Splatter simulation) or (2) the features determined to be useful for classifying cells according to cell cycle phase using a random forest classifier trained on ground truth phase annotations from time-lapse imaging for the protein immunofluorescence imaging datasets (See Random forest, RPE analysis, PDAC analysis).

Unsupervised clustering

To evaluate feature selection method performance on retaining features that are informative for identifying canonical cell types, we performed unsupervised clustering on the data defined by the top p ranked features from a feature selection strategy. More specifically, for each feature selection approach, clustering was performed on the selected data using either the KMeans++ algorithm [103] with the number of centroids set as the same number of ground truth cell cycle phase labels for the protein immunofluorescence imaging datasets (RPE: c=4, PDAC: c=5).

To assess the accuracy of clustering assignments, we quantified a normalized mutual information (NMI) score between the predicted cluster labels and the ground truth cell type labels. Normalized mutual information [131] is a clustering quality metric that measures the amount of shared information between two cell-to-cluster partitions (u and v, such that the ith entry ui gives the cluster assignment of cell i) as,

NMI=2I(u;v)H(u)H(v), (25)

where, I(u;v) measures the mutual information between ground truth cell type labels u and cluster labels v, and H(u) or H(v) indicates the Shannon entropy or the amount of uncertainty for a given set of labels. Here, a score of 1 indicates that clustering on the selected features perfectly recovers the ground truth cell type labels. KMeans++ clustering was implemented using the KMeans function in sklearn v0.23.4.

Protein-protein interaction networks

In this work, we aimed to test whether features within DELVE dynamic clusters had experimental evidence of co-regulation as compared to random assignment. The STRING (search tool for the retrieval of interacting genes/proteins) database [121] is a relational database that computes protein association scores according to information derived from several evidence channels, including computational predictions (e.g. neighborhood, fusion, co-occurance), co-expression, experimental assays, pathway databases, and literature text mining. To assess the significance of protein interactions amongst features within a DELVE cluster, we performed a permutation test with a test statistic derived from STRING association scores using experimental evidence as follows.

Let 𝒢p=𝒩p,p denote a graph of p proteins from a DELVE cluster comprising the nodes 𝒩p, and p denote the set of edges, where edge weights encode the association scores of experimentally-derived protein-protein interaction evidence from the STRING database. Moreover, let 𝒢r=𝒩r,r denote a graph of r proteins randomly sampled without replacement from the full feature space d such that r=p comprising the nodes 𝒩r, and r denote the set of edges encoding the experimentally-derived association scores between those r proteins from the STRING database. We compute the permutation p-value as described previously in ref. [132] as,

p-value=N+1R+1. (26)

Here N indicates the number of times that TrTobs out of R random permutations (R=1000), where Tr is the average degree of a STRING association network from randomly permuted features as Tr=𝒩rr, and Tobs is the average degree of a STRING association network from the features identified within a DELVE cluster as Tobs=𝒩pp. Of note, networks with higher degree are more connected, and thus show greater experimental evidence of protein-protein interactions. Experimental evidence-based association scores were obtained from the STRING database using the stringdb v0.1.5 package in python and networks were generated using networkx v2.5.1.

Trajectory inference and analysis

To evaluate how well feature selection methods could identify features that recapitulate the underlying cellular trajectory and can be used for trajectory analysis, we computed three metrics to assess trajectory preservation at different stages of inference: accuracy of the inferred trajectory graph, correlation of estimated pseudotime to the ground truth cell progression measurements, and the significance of dynamic features identified following regression analysis.

To obtain predicted trajectories, we performed trajectory inference using the diffusion pseudotime algorithm [83] based on 20 diffusion map components generated from a k-nearest neighbor graph (k=10), where edge weights were determined by pairwise Euclidean distances between cells according to selected feature expression. Inference was performed for the following lineages – simulated trajectories: all cells, arrested trajectory: cells with G0 phase annotation, proliferative trajectory: cells with G1, S, G2, or M phase annotation, CD8 T cell memory lineage: cells following day 7 of infection with a memory score. Moreover, for each feature selection approach, we estimated pseudotime using ten random root cells according to a priori biological knowledge. Root cells were chosen as either (1) cells with the smallest ground truth pseudotime annotation for the simulated datasets, (2) cells with the youngest molecular age for 4i cell cycle datasets, or (3) cells from the day 7 population along the memory lineage for the CD8 T cell differentiation dataset. Feature selection trajectory performance was subsequently assessed as follows.

  1. Trajectory graph similarity: Partition-based graph abstraction (PAGA) [104] performs trajectory inference by constructing a coarse grained trajectory graph of the data. First cell populations are determined either through unsupervised clustering, graph partitioning, or a prior experimental annotations. Next, a statistical measure of edge connectivity is computed between cell populations to estimate the confidence of a cell population transition. To assess if feature selection methods retain features that represent coarse cell type transitions, we compared predicted PAGA trajectory graphs to ground truth cell cycle reference trajectories curated from the literature [92]. First, PAGA connectivity was estimated between ground truth cell cycle phase groups using the k-nearest neighbor graph (k=10) based on pairwise Euclidean distances between cells according to selected feature expression. We then computed the Jaccard distance between predicted and reference trajectories as,
    djWp,Wr=1-WpWrWpWr. (27)
    Wp indicates the predicted cell type transition adjacency matrix, where each entry Wp,ij represents the connectivity strength between cell populations i and j from PAGA and Wr indicates the reference trajectory adjacency matrix with entries encoding ground truth cell type transitions curated from the literature. Here, a lower Jaccard distance indicates that predicted trajectories better capture known cellular transitions.
  2. Pseudotime correlation: To evaluate if feature selection methods retain features that accurately represent a cell’s progression through a biological trajectory, we computed the Spearman rank correlation coefficient between estimated pseudotime following feature selection and ground truth cell progression annotations (e.g. the ground truth pseudotime labels generated from simulations, time-lapse imaging molecular age measurements).

  3. Regression analysis: To identify genes associated with the inferred CD8+ T cell differentiation trajectory following feature selection, we performed regression analysis for each gene (d=500) along estimated pseudotime using a negative binomial generalized additive model (GAM). Genes were considered to be differentially expressed along the memory lineage if they had a q-value<0.05 following Benjamini-Hochberg false discovery rate correction [122].

  4. Gene Ontology: To identify the biological relevance of the differentially expressed genes along inferred CD8+ T cell differentiation trajectories specific to each feature selection strategy, we performed gene set enrichment analysis on the set difference of significant genes from either highly variable gene selection or DELVE feature selection using Enrichr [123]. Here, we considered the mouse gene sets from GO Biological Process 2021.

Diffusion pseudotime was implemented using the dpt function in Scanpy v1.8.1, PAGA was implemented using the paga function in scanpy v1.8.1, GAM regression was implemented using the statsmodels v0.12.2 package in python, and gene set enrichment analysis was performed using the enrichr function in gseapy v1.0.4 package in python.

PHATE visualizations

To qualitatively compare lower dimensional representations from each feature selection strategy, we performed nonlinear dimensionality reduction using PHATE (potential of heat-diffusion for affinity-based transition embedding) [75] as this approach performs reasonably well for representing complex continuous biological trajectories. PHATE was implemented using the phate v1.0.7 package in python. Here, we used the same set of hyperparameters across all feature selection strategies (knn=30,t=10,decay=40).

Aggregate scores

To rank feature selection methods on preserving biological trajectories in the presence of single-cell noise, we computed rank aggregate scores by taking the mean of scaled method scores across simulated single-cell RNA sequencing datasets from a trajectory type and noise condition (e.g. linear trajectory, dropout noise). More specifically, we first defined an overall method score per dataset as the median of each metric. Method scores were subsequently min-max scaled to ensure datasets were equally weighted prior to computing the average.

Supplementary Material

Supplement 1

Acknowledgements

We would like to thank Tarek Zikry, Garrett Sessions, and Alec Plotkin for their insightful discussions related to this work.

Funding

This work was supported by the National Institutes of Health, F31-HL156433 (JSR), 5T32-GM067553 (JSR), R01-GM138834 (JEP), NSF CAREER Award 1845796 (JEP), and NSF Award 2242980 (JEP).

Footnotes

Data and code availability

The raw publicly available single-cell datasets used in this study are available in the Zenodo repository https://doi.org/10.5281/zenodo.4525425 for the RPE cell cycle dataset [133], the Zenodo repository https://doi.org/10.5281/zenodo.7860332 for the PDAC cell cycle datasets [134], and the Gene Expression Omnibus (GEO) under the accession code GSE131847 for the CD8 T cell differentiation dataset [135]. The preprocessed datasets are available in the Zenodo repository https://doi.org/10.5281/zenodo.7883604 [136]. DELVE is implemented as an open-source python package and is publicly available at https://github.com/jranek/delve. Source code including all functions for benchmarking feature selection methods including preprocessing, feature selection, evaluation, and plotting are publicly available at: https://github.com/jranek/delve_benchmark.

References

  • [1].Spitzer Matthew H and Nolan Garry P. Mass cytometry: Single cells, many features. Cell, 165(4):780–791, May 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Baumgarth N and Roederer M. A practical approach to multicolor flow cytometry for immunophenotyping. J. Immunol. Methods, 243(1–2):77–97, September 2000. [DOI] [PubMed] [Google Scholar]
  • [3].Bandura Dmitry R, Baranov Vladimir I, Ornatsky Olga I, Antonov Alexei, Kinach Robert, Lou Xudong, Pavlov Serguei, Vorobiev Sergey, Dick John E, and Tanner Scott D. Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Anal. Chem., 81(16):6813–6822, August 2009. [DOI] [PubMed] [Google Scholar]
  • [4].Zheng Grace X Y, Terry Jessica M, Belgrader Phillip, Ryvkin Paul, Bent Zachary W, Wilson Ryan, Ziraldo Solongo B, Wheeler Tobias D, McDermott Geoff P, Zhu Junjie, Gregory Mark T, Shuga Joe, Montesclaros Luz, Underwood Jason G, Masquelier Donald A, Nishimura Stefanie Y, Schnall-Levin Michael, Wyatt Paul W, Hindson Christopher M, Bharadwaj Rajiv, Wong Alexander, Ness Kevin D, Beppu Lan W, Deeg H Joachim, McFarland Christopher, Loeb Keith R, Valente William J, Ericson Nolan G, Stevens Emily A, Radich Jerald P, Mikkelsen Tarjei S, Hindson Benjamin J, and Bielas Jason H. Massively parallel digital transcriptional profiling of single cells. Nat. Commun., 8:14049, January 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Macosko Evan Z, Basu Anindita, Satija Rahul, Nemesh James, Shekhar Karthik, Goldman Melissa, Tirosh Itay, Bialas Allison R, Kamitaki Nolan, Martersteck Emily M, Trombetta John J, Weitz David A, Sanes Joshua R, Shalek Alex K, Regev Aviv, and McCarroll Steven A . Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, May 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Picelli Simone, Faridani Omid R, Björklund Asa K, Winberg Gösta, Sagasser Sven, and Sandberg Rickard Full-length RNA-seq from single cells using smart-seq2. Nat. Protoc., 9(1):171–181, January 2014. [DOI] [PubMed] [Google Scholar]
  • [7].Zilionis Rapolas, Nainys Juozas, Veres Adrian, Savova Virginia, Zemmour David, Klein Allon M, and Mazutis Linas. Single-cell barcoding and sequencing using droplet microfluidics. Nat. Protoc., 12(1):44–73, January 2017. [DOI] [PubMed] [Google Scholar]
  • [8].Gut Gabriele, Herrmann Markus D, and Pelkmans Lucas. Multiplexed protein maps link subcellular organization to cellular states. Science, 361(6401), August 2018. [DOI] [PubMed] [Google Scholar]
  • [9].Keren Leeat, Bosse Marc, Thompson Steve, Risom Tyler, Vijayaragavan Kausalia, McCaffrey Erin, Marquez Diana, Angoshtari Roshan, Noah F Greenwald Harris Fienberg, Wang Jennifer, Kambham Neeraja, Kirkwood David, Nolan Garry, Montine Thomas J, Galli Stephen J, West Robert, Bendall Sean C, and Angelo Michael. MIBI-TOF: A multiplexed imaging platform relates cellular phenotypes and tissue structure. Sci Adv, 5(10):eaax5851, October 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Giesen Charlotte, Wang Hao A O, Schapiro Denis, Zivanovic Nevena, Jacobs Andrea, Hattendorf Bodo, Schüffler Peter J, Grolimund Daniel, Buhmann Joachim M, Brandt Simone, Varga Zsuzsanna, Wild Peter J, Günther Detlef, and Bodenmiller Bernd. Highly multiplexed imaging of tumor tissues with subcellular resolution by mass cytometry. Nat. Methods, 11(4):417–422, April 2014. [DOI] [PubMed] [Google Scholar]
  • [11].Goltsev Yury, Samusik Nikolay, Kennedy-Darling Julia, Bhate Salil, Hale Matthew, Vazquez Gustavo, Black Sarah, and Nolan Garry P. Deep profiling of mouse splenic architecture with CODEX multiplexed imaging. Cell, 174(4):968–981.e15, August 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Cao Junyue, Spielmann Malte, Qiu Xiaojie, Huang Xingfan, Ibrahim Daniel M, Hill Andrew J, Zhang Fan, Mundlos Stefan, Christiansen Lena, Steemers Frank J, Trapnell Cole, and Shendure Jay. The single-cell transcriptional landscape of mammalian organogenesis. Nature, 566(7745):496–502, February 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Farrell Jeffrey A, Wang Yiqun, Riesenfeld Samantha J, Shekhar Karthik, Regev Aviv, and Schier Alexander F. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science, 360(6392), June 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Fawkner-Corbett David, Antanaviciute Agne, Parikh Kaushal, Jagielowicz Marta, Gerós Ana Sousa, Gupta Tarun, Ashley Neil, Khamis Doran, Fowler Darren, Morrissey Edward, Cunningham Chris, Johnson Paul R V, Koohy Hashem, and Simmons Alison. Spatiotemporal analysis of human intestinal development at single-cell resolution. Cell, 184(3):810–826.e23, February 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Holloway Emily M, Czerwinski Michael, Tsai Yu-Hwai, Wu Joshua H, Wu Angeline, Childs Charlie J, Walton Katherine D, Sweet Caden W, Yu Qianhui, Glass Ian, Treutlein Barbara, Camp J Gray, and Spence Jason R. Mapping development of the human intestinal niche at Single-Cell resolution. Cell Stem Cell, 28(3):568–580.e4, March 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Kaufmann Max, Evans Hayley, Anna-Lena Schaupp, Engler Jan Broder, Kaur Gurman, Willing Anne, Kursawe Nina, Schubert Charlotte, Attfield Kathrine E, Fugger Lars, and Friese Manuel A. Identifying CNS-colonizing T cells as potential therapeutic targets to prevent progression of multiple sclerosis. Med (N Y), 2(3):296–312.e8, March 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Risom Tyler, Glass David R, Averbukh Inna, Liu Candace C, Baranski Alex, Kagel Adam, McCaffrey Erin F, Greenwald Noah F, Rivero-Gutiérrez Belén, Strand Siri H, Varma Sushama, Kong Alex, Keren Leeat, Srivastava Sucheta, Zhu Chunfang, Khair Zumana, Veis Deborah J, Deschryver Katherine, Vennam Sujay, Maley Carlo, Hwang E Shelley, Marks Jeffrey R, Bendall Sean C, Colditz Graham A, West Robert B, and Angelo Michael. Transition to invasive breast cancer is associated with progressive changes in the structure and composition of tumor stroma. Cell, 185(2):299–310.e18, January 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Arunachalam Prabhu S, Wimmers Florian, Mok Chris Ka Pun, Perera Ranawaka A P M, Scott Madeleine, Hagan Thomas, Sigal Natalia, Feng Yupeng, Bristow Laurel, Tsang Owen Tak-Yin, Wagh Dhananjay, Coller John, Pellegrini Kathryn L, Kazmin Dmitri, Alaaeddine Ghina, Leung Wai Shing, Chan Jacky Man Chun, Chik Thomas Shiu Hong, Choi Chris Yau Chung, Huerta Christopher, McCullough Michele Paine, Lv Huibin, Anderson Evan, Edupuganti Srilatha, Upadhyay Amit A, Bosinger Steve E, Maecker Holden Terry, Khatri Purvesh, Rouphael Nadine, Peiris Malik, and Pulendran Bali. Systems biological assessment of immunity to mild versus severe COVID-19 infection in humans. Science, 369(6508):1210–1220, September 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Saelens Wouter, Cannoodt Robrecht, Todorov Helena, and Saeys Yvan. A comparison of single-cell trajectory inference methods. Nat. Biotechnol., 37(5):547–554, May 2019. [DOI] [PubMed] [Google Scholar]
  • [20].Trapnell Cole, Cacchiarelli Davide, Grimsby Jonna, Pokharel Prapti, Li Shuqiang, Morse Michael, Lennon Niall J, Livak Kenneth J, Mikkelsen Tarjei S, and Rinn John L. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol., 32(4):381–386, April 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Ji Zhicheng and Ji Hongkai. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res., 44(13):e117, July 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Shin Jaehoon, Daniel A Berg Yunhua Zhu, Joseph Y Shin Juan Song, Michael A Bonaguidi Grigori Enikolopov, David W Nauen, Kimberly M Christian, Guo-Li Ming, and Hongjun Song. Single-Cell RNA-Seq with waterfall reveals molecular cascades underlying adult neurogenesis. Cell Stem Cell, 17(3):360–372, September 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Cannoodt Robrecht, Saelens Wouter, Sichien Dorine, Tavernier Simon, and Saeys Yvan. SCORPIUS improves trajectory inference and identifies novel modules in dendritic cell development. bioRxiv, October 2016. [Google Scholar]
  • [24].Street Kelly, Risso Davide, Russell B Fletcher Diya Das, Ngai John, Yosef Nir, Purdom Elizabeth, and Dudoit Sandrine. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics, 19(1):477, June 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Welch Joshua D, Hartemink Alexander J, and Prins Jan F. SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol., 17(1):106, May 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Wolf F Alexander, Hamey Fiona K, Plass Mireya, Solana Jordi, Dahlin Joakim S, Göttgens Berthold, Rajewsky Nikolaus, Simon Lukas, and Theis Fabian J. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol., 20(1):59, March 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Cao Junyue, Spielmann Malte, Qiu Xiaojie, Huang Xingfan, Ibrahim Daniel M, Hill Andrew J, Zhang Fan, Mundlos Stefan, Christiansen Lena, Steemers Frank J, Trapnell Cole, and Shendure Jay. The single-cell transcriptional landscape of mammalian organogenesis. Nature, 566(7745):496–502, February 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Haghverdi Laleh, Büttner Maren, Wolf F Alexander, Buettner Florian, and Theis Fabian J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods, 13(10):845–848, October 2016. [DOI] [PubMed] [Google Scholar]
  • [29].Setty Manu, Kiseliovas Vaidotas, Levine Jacob, Gayoso Adam, Mazutis Linas, and Pe’er Dana. Characterization of cell fate probabilities in single-cell data with palantir. Nat. Biotechnol., 37(4):451–460, April 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Stassen Shobana V, Yip Gwinky G K, Wong Kenneth K Y, Ho Joshua W K, and Tsia Kevin K. Generalized and scalable trajectory inference in single-cell omics data with VIA. Nat. Commun., 12(1):5528, September 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Schiebinger Geoffrey, Shu Jian, Tabaka Marcin, Cleary Brian, Subramanian Vidya, Solomon Aryeh, Gould Joshua, Liu Siyan, Lin Stacie, Berube Peter, Lee Lia, Chen Jenny, Brumbaugh Justin, Rigollet Philippe, Hochedlinger Konrad, Jaenisch Rudolf, Regev Aviv, and Lander Eric S. Optimal-Transport analysis of Single-Cell gene expression identifies developmental trajectories in reprogramming. Cell, 176(4):928–943.e22, February 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Tong Alexander, Huang Jessie, Wolf Guy, van Dijk David, and Krishnaswamy Smita. TrajectoryNet: A dynamic optimal transport network for modeling cellular dynamics. Proc Mach Learn Res, 119:9526–9536, July 2020. [PMC free article] [PubMed] [Google Scholar]
  • [33].Van den Berge Koen, de Bézieux Hector Roux, Street Kelly, Saelens Wouter, Cannoodt Robrecht, Saeys Yvan, Dudoit Sandrine, and Clement Lieven. Trajectory-based differential expression analysis for single-cell sequencing data. Nat. Commun., 11(1):1201, March 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Hou Wenpin, Ji Zhicheng, Chen Zeyu, Wherry E John, Hicks Stephanie C, and Ji Hongkai. A statistical framework for differential pseudotime analysis with multiple single-cell RNA-seq samples. bioRxiv, page 2021.07.10.451910, July 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Campbell Kieran R and Yau Christopher. Uncovering pseudotemporal trajectories with covariates from single cell and bulk expression data. Nat. Commun., 9(1):2442, June 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Ghazanfar Shila, Lin Yingxin, Su Xianbin, Lin David Ming, Patrick Ellis, Ze-Guang Han, Marioni John C, and Yang Jean Yee Hwa. Investigating higher-order interactions in single-cell data with scHOT. Nat. Methods, 17(8):799–806, August 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Deshpande Atul, Chu Li-Fang, Stewart Ron, and Gitter Anthony. Network inference with granger causality ensembles on single-cell transcriptomics. Cell Rep., 38(6):110333, February 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Alpert Ayelet, Moore Lindsay S, Dubovik Tania, and Shen-Orr Shai S. Alignment of single-cell trajectories to compare cellular expression dynamics. Nat. Methods, 15(4):267–270, April 2018. [DOI] [PubMed] [Google Scholar]
  • [39].Sugihara Reiichi, Kato Yuki, Mori Tomoya, and Kawahara Yukio. Alignment of single-cell trajectory trees with CAPITAL. Nat. Commun., 13(1):5972, October 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Stephanie C Hicks, Townes F William, Teng Mingxiang, and Irizarry Rafael A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics, 19(4):562–578, October 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [41].Hickey John W, Neumann Elizabeth K, Radtke Andrea J, Camarillo Jeannie M, Beuschel Rebecca T, Albanese Alexandre, McDonough Elizabeth, Hatler Julia, Wiblin Anne E, Fisher Jeremy, Croteau Josh, Small Eliza C, Sood Anup, Caprioli Richard M, Angelo R Michael, Nolan Garry P, Chung Kwanghun, Hewitt Stephen M, Germain Ronald N, Spraggins Jeffrey M, Lundberg Emma, Snyder Michael P, Kelleher Neil L, and Saka Sinem K. Spatial mapping of protein composition and tissue organization: a primer for multiplexed antibody-based imaging. Nat. Methods, 19(3):284–295, March 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Charrout Mohammed, Reinders Marcel J T, and Mahfouz Ahmed . Untangling biological factors influencing trajectory inference from single cell data. NAR Genom Bioinform, 2(3):lqaa053, September 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Tritschler Sophie, Büttner Maren, Fischer David S, Lange Marius, Bergen Volker, Lickert Heiko, and Theis Fabian J. Concepts and limitations for learning developmental trajectories from single cell genomics. Development, 146(12), June 2019. [DOI] [PubMed] [Google Scholar]
  • [44].Zhu Chenxu, Preissl Sebastian, and Ren Bing. Single-cell multimodal omics: the power of many. Nat. Methods, 17(1):11–14, January 2020. [DOI] [PubMed] [Google Scholar]
  • [45].Yang Pengyi, Huang Hao, and Liu Chunlei. Feature selection revisited in the single-cell era. Genome Biol., 22(1):321, December 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [46].Breiman Leo. Random forests. Mach. Learn., 45(1):5–32, October 2001. [Google Scholar]
  • [47].Peng Hanchuan, Long Fuhui, and Ding Chris. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell., 27(8):1226–1238, August 2005. [DOI] [PubMed] [Google Scholar]
  • [48].Estévez Pablo A, Tesmer Michel, Perez Claudio A, and Zurada Jacek M. Normalized mutual information feature selection. IEEE Trans. Neural Netw., 20(2):189–201, February 2009. [DOI] [PubMed] [Google Scholar]
  • [49].Liechti Thomas, Weber Lukas M, Ashhurst Thomas M, Stanley Natalie, Prlic Martin, Van Gassen Sofie, and Mair Florian. An updated guide for the perplexed: cytometry in the high-dimensional era. Nat. Immunol., 22(10):1190–1197, October 2021. [DOI] [PubMed] [Google Scholar]
  • [50].Satija Rahul, Farrell Jeffrey A, Gennert David, Schier Alexander F, and Regev Aviv. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol., 33(5):495–502, May 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Stuart Tim, Butler Andrew, Hoffman Paul, Hafemeister Christoph, Papalexi Efthymia, Mauck William M 3rd, Hao Yuhan, Stoeckius Marlon, Smibert Peter, and Satija Rahul. Comprehensive integration of Single-Cell data. Cell, 177(7):1888–1902.e21, June 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [52].He Xiaofei, Cai Deng, and Niyogi Partha. Laplacian score for feature selection. In Advances in Neural Information Processing Systems, volume 18. MIT Press, 2005. [Google Scholar]
  • [53].DeTomaso David and Yosef Nir. Hotspot identifies informative gene modules across modalities of single-cell genomics. Cell Syst, 12(5):446–456.e9, May 2021. [DOI] [PubMed] [Google Scholar]
  • [54].Liang Shaoheng, Mohanty Vakul, Dou Jinzhuang, Miao Qi, Huang Yuefan, Müftüoğlu Muharrem, Ding Li, Peng Weiyi, and Chen Ken. Single-cell manifold-preserving feature selection for detecting rare cell populations. Nature Computational Science, 1(5):374–384, May 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [55].Cai Deng, Zhang Chiyuan, and He Xiaofei. Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ‘10, pages 333–342, New York, NY, USA, July 2010. Association for Computing Machinery. [Google Scholar]
  • [56].McInnes Leland, Healy John, and Melville James. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv, February 2018. [Google Scholar]
  • [57].Ng Andrew Y, Jordan Michael I, and Weiss Yair. On spectral clustering: analysis and an algorithm. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, pages 849–856, Cambridge, MA, USA, January 2001. MIT Press. [Google Scholar]
  • [58].Lindenbaum Ofir, Shaham Uri, Svirsky Jonathan, Peterfreund Erez, and Kluger Yuval. Differentiable unsupervised feature selection based on a gated laplacian. arXiv, July 2020. [Google Scholar]
  • [59].Shaham Uri, Lindenbaum Ofir, Svirsky Jonathan, and Kluger Yuval. Deep unsupervised feature selection by discarding nuisance and correlated features. Neural Netw., 152:34–43, August 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [60].Arnold Sebastian J and Robertson Elizabeth J. Making a commitment: cell lineage allocation and axis patterning in the early mouse embryo. Nat. Rev. Mol. Cell Biol., 10(2):91–103, February 2009. [DOI] [PubMed] [Google Scholar]
  • [61].Perrimon Norbert, Pitsouli Chrysoula, and Ben-Zion Shilo. Signaling mechanisms controlling cell fate and embryonic patterning. Cold Spring Harb. Perspect. Biol., 4(8):a005975, August 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [62].Britton George, Heemskerk Idse, Hodge Rachel, Qutub Amina A, and Warmflash Aryeh. A novel self-organizing embryonic stem cell system reveals signaling logic underlying the patterning of human ectoderm. Development, 146(20), October 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [63].Rompolas Panteleimon, Mesa Kailin R, Kawaguchi Kyogo, Park Sangbum, Gonzalez David, Brown Samara, Boucher Jonathan, Klein Allon M, and Greco Valentina. Spatiotemporal coordination of stem cell commitment during epidermal homeostasis. Science, 352(6292):1471–1474, June 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [64].Levine Jacob H, Simonds Erin F, Bendall Sean C, Davis Kara L, Amir El-Ad D, Tadmor Michelle D, Litvin Oren, Fienberg Harris G, Jager Astraea, Zunder Eli R, Finck Rachel, Gedman Amanda L, Radtke Ina, Downing James R, Pe’er Dana, and Nolan Garry P. Data-Driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell, 162(1):184–197, July 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [65].Dann Emma, Henderson Neil C, Teichmann Sarah A, Morgan Michael D, and Marioni John C. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat. Biotechnol., September 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [66].Baskaran Vishal Athreya, Ranek Jolene, Shan Siyuan, Stanley Natalie, and Oliva Junier B. Distribution-based sketching of single-cell samples. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, number Article 26 in BCB ‘22, pages 1–10, New York, NY, USA, August 2022. Association for Computing Machinery. [Google Scholar]
  • [67].Dong Xiaowen, Thanou Dorina, Toni Laura, Bronstein Michael, and Frossard Pascal. Graph signal processing for machine learning: A review and new perspectives. IEEE Signal Process. Mag., 37(6):117–127, November 2020. [Google Scholar]
  • [68].Shuman David I, Narang Sunil K, Frossard Pascal, Ortega Antonio, and Vandergheynst Pierre. The emerging field of signal processing on graphs: Extending High-Dimensional data analysis to networks and other irregular domains. arXiv, October 2012. [Google Scholar]
  • [69].Luecken Malte D and Theis Fabian J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol., 15(6):e8746, June 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [70].Ishwaran Hemant. The effect of splitting on random forests. Mach. Learn., 99(1):75–118, April 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [71].Efron Bradley, Hastie Trevor, Johnstone Iain, and Tibshirani Robert. Least angle regression. aos, 32(2):407–499, April 2004. [Google Scholar]
  • [72].Zappia Luke, Phipson Belinda, and Oshlack Alicia. Splatter: simulation of single-cell RNA sequencing data. Genome Biol., 18(1):174, September 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [73].Hafemeister Christoph and Satija Rahul. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol., 20(1):296, December 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [74].Svensson Valentine, da Veiga Beltrame Eduardo, and Pachter Lior. Quantifying the tradeoff between sequencing depth and cell number in single-cell RNA-seq. bioRxiv, page 762773, September 2019. [Google Scholar]
  • [75].Moon Kevin R, van Dijk David, Wang Zheng, Gigante Scott, Burkhardt Daniel B, Chen William S, Yim Kristina, van den Elzen Antonia, Hirn Matthew J, Coifman Ronald R, Ivanova Natalia B, Wolf Guy, and Krishnaswamy Smita. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol., 37(12):1482–1492, December 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [76].Shan Siyuan, Baskaran Vishal Athreya, Yi Haidong, Ranek Jolene, Stanley Natalie, and Oliva Junier B. Transparent single-cell set classification with kernel mean embeddings. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, number Article 25 in BCB ‘22, pages 1–10, New York, NY, USA, August 2022. Association for Computing Machinery. [Google Scholar]
  • [77].Ranek Jolene S, Stanley Natalie, and Purvis Jeremy E. Integrating temporal single-cell gene expression modalities for trajectory inference and disease prediction. Genome Biol., 23(1):1–32, September 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [78].Bruggner Robert V, Bodenmiller Bernd, Dill David L, Tibshirani Robert J, and Nolan Garry P. Automated identification of stratifying signatures in cellular subpopulations. Proc. Natl. Acad. Sci. U. S. A., 111(26):E2770–7, July 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [79].Burkhardt Daniel B, Stanley Jay S 3rd, Tong Alexander, Perdigoto Ana Luisa, Gigante Scott A, Herold Kevan C, Wolf Guy, Giraldez Antonio J, van Dijk David, and Krishnaswamy Smita. Quantifying the effect of experimental perturbations at single-cell resolution. Nat. Biotechnol., 39(5):619–629, May 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [80].Torregrosa Gabriel and Garcia-Ojalvo Jordi. Mechanistic models of cell-fate transitions from single-cell data. Current Opinion in Systems Biology, 26:79–86, June 2021. [Google Scholar]
  • [81].Zhou Peijie, Wang Shuxiong, Li Tiejun, and Nie Qing. Dissecting transition cells from single-cell transcriptome data through multiscale stochastic dynamics. Nat. Commun., 12(1):5609, September 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [82].Casey Michael J, Stumpf Patrick S, and MacArthur Ben D. Theory of cell fate. Wiley Interdiscip. Rev. Syst. Biol. Med., 12(2):e1471, March 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [83].Haghverdi Laleh, Büttner Maren, Wolf F Alexander, Buettner Florian, and Theis Fabian J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods, 13(10):845–848, October 2016. [DOI] [PubMed] [Google Scholar]
  • [84].Angelo Michael, Bendall Sean C, Finck Rachel, Hale Matthew B, Hitzman Chuck, Borowsky Alexander D, Levenson Richard M, Lowe John B, Liu Scot D, Zhao Shuchun, Natkunam Yasodha, and Nolan Garry P. Multiplexed ion beam imaging of human breast tumors. Nat. Med., 20(4):436–442, April 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [85].Stickels Robert R, Murray Evan, Kumar Pawan, Li Jilong, Marshall Jamie L, Bella Daniela J Di, Arlotta Paola, Macosko Evan Z, and Chen Fei. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat. Biotechnol., 39(3):313–319, March 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [86].Liu Yang, Yang Mingyu, Deng Yanxiang, Su Graham, Enninful Archibald, Cindy C Guo Toma Tebaldi, Zhang Di, Kim Dongjoo, Bai Zhiliang, Norris Eileen, Pan Alisia, Li Jiatong, Xiao Yang, Halene Stephanie, and Fan Rong. High-Spatial-Resolution Multi-Omics sequencing via deterministic barcoding in tissue. Cell, 183(6):1665–1681.e18, December 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [87].Xia Chenglong, Fan Jean, Emanuel George, Hao Junjie, and Zhuang Xiaowei. Spatial transcriptome profiling by MERFISH reveals subcellular RNA compartmentalization and cell cycle-dependent gene expression. Proc. Natl. Acad. Sci. U. S. A., 116(39):19490–19499, September 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [88].Chen Ao, Liao Sha, Cheng Mengnan, Ma Kailong, Wu Liang, Lai Yiwei, Qiu Xiaojie, Yang Jin, Xu Jiangshan, Hao Shijie, Wang Xin, Lu Huifang, Chen Xi, Liu Xing, Huang Xin, Li Zhao, Hong Yan, Jiang Yujia, Peng Jian, Liu Shuai, Shen Mengzhe, Liu Chuanyu, Li Quanshui, Yuan Yue, Wei Xiaoyu, Zheng Huiwen, Feng Weimin, Wang Zhifeng, Liu Yang, Wang Zhaohui, Yang Yunzhi, Xiang Haitao, Han Lei, Qin Baoming, Guo Pengcheng, Lai Guangyao, Muñoz-Cánoves Pura, Maxwell Patrick H, Thiery Jean Paul, Qing-Feng Wu, Zhao Fuxiang, Chen Bichao, Li Mei, Dai Xi, Wang Shuai, Kuang Haoyan, Hui Junhou, Wang Liqun, Ji-Feng Fei, Wang Ou, Wei Xiaofeng, Lu Haorong, Wang Bo, Liu Shiping, Gu Ying, Ni Ming, Zhang Wenwei, Mu Feng, Yin Ye, Yang Huanming, Lisby Michael, Cornall Richard J, Mulder Jan, Uhlén Mathias, Esteban Miguel A, Li Yuxiang, Liu Longqi, Xu Xun, and Wang Jian. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell, 185(10):1777–1792.e21, May 2022. [DOI] [PubMed] [Google Scholar]
  • [89].Lohoff T, Ghazanfar S, Missarova A, Koulena N, Pierson N, Griffiths J A, Bardot E S, Eng C-H L, Tyser R C V, Argelaguet R, Guibentif C, Srinivas S, Briscoe J, Simons B D, Hadjantonakis A-K, Göttgens B, Reik W, Nichols J, Cai L, and Marioni J C. Integration of spatial and single-cell transcriptomic data elucidates mouse organogenesis. Nat. Biotechnol., 40(1):74–85, January 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [90].Risom Tyler, Glass David R, Averbukh Inna, Liu Candace C, Baranski Alex, Kagel Adam, McCaffrey Erin F, Greenwald Noah F, Rivero-Gutiérrez Belén, Strand Siri H, Varma Sushama, Kong Alex, Keren Leeat, Srivastava Sucheta, Zhu Chunfang, Khair Zumana, Veis Deborah J, Deschryver Katherine, Vennam Sujay, Maley Carlo, Hwang E Shelley, Marks Jeffrey R, Bendall Sean C, Colditz Graham A, West Robert B, and Angelo Michael. Transition to invasive breast cancer is associated with progressive changes in the structure and composition of tumor stroma. Cell, 185(2):299–310.e18, January 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [91].McCaffrey Erin F, Donato Michele, Keren Leeat, Chen Zhenghao, Delmastro Alea, Fitzpatrick Megan B, Gupta Sanjana, Noah F Greenwald Alex Baranski, Graf William, Kumar Rashmi, Bosse Marc, Fullaway Christine Camacho, Ramdial Pratista K, Forgó Erna, Jojic Vladimir, Van Valen David, Mehra Smriti, Khader Shabaana A, Bendall Sean C, van de Rijn Matt, Kalman Daniel, Kaushal Deepak, Hunter Robert L, Banaei Niaz, Steyn Adrie J C, Khatri Purvesh, and Angelo Michael. The immunoregulatory landscape of human tuberculosis granulomas. Nat. Immunol., 23(2):318–329, February 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [92].Stallaert Wayne, Kedziora Katarzyna M, Taylor Colin D, Zikry Tarek M, Ranek Jolene S, Sobon Holly K, Taylor Sovanny R, Young Catherine L, Cook Jeanette G, and Purvis Jeremy E. The structure of the human cell cycle. Cell Syst, 13(1):103, January 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [93].Hume Samuel, Dianov Grigory L, and Ramadan Kristijan. A unified model for the G1/S cell cycle transition. Nucleic Acids Res., 48(22):12483–12501, December 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [94].Carrano A C, Eytan E, Hershko A, and Pagano M. SKP2 is required for ubiquitin-mediated degradation of the CDK inhibitor p27. Nat. Cell Biol., 1(4):193–199, August 1999. [DOI] [PubMed] [Google Scholar]
  • [95].Zhang L and Wang C. F-box protein skp2: a novel transcriptional target of E2F. Oncogene, 25(18):2615–2627, April 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [96].Essers Jeroen, Theil Arjan F, Baldeyron Céline, van Cappellen Wiggert A, Houtsmuller Adriaan B, Kanaar Roland, and Vermeulen Wim. Nuclear dynamics of PCNA in DNA replication and repair. Mol. Cell. Biol., 25(21):9350–9359, November 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [97].Khurana Simran and Oberdoerffer Philipp. Replication stress: A lifetime of epigenetic change. Genes, 6(3):858–877, September 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [98].Sirbu Bianca M, Couch Frank B, Feigerle Jordan T, Bhaskara Srividya, Hiebert Scott W, and Cortez David. Analysis of protein dynamics at active, stalled, and collapsed replication forks. Genes Dev., 25(12):1320–1327, June 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [99].Lindqvist Arne, Rodríguez-Bravo Verónica, and Medema René H. The decision to enter mitosis: feedback and redundancy in the mitotic entry network. J. Cell Biol., 185(2):193–202, April 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [100].Gavet Olivier and Pines Jonathon. Activation of cyclin B1-Cdk1 synchronizes events in the nucleus and the cytoplasm at mitosis. J. Cell Biol., 189(2):247–259, April 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [101].Moser Justin, Miller Iain, Carter Dylan, and Spencer Sabrina L. Control of the restriction point by rb and p21. Proc. Natl. Acad. Sci. U. S. A., 115(35):E8219–E8227, August 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [102].Weinberg R A. The retinoblastoma protein and cell cycle control. Cell, 81(3):323–330, May 1995. [DOI] [PubMed] [Google Scholar]
  • [103].Arthur David and Vassilvitskii Sergei. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, SODA ‘07, pages 1027–1035, USA, January 2007. Society for Industrial and Applied Mathematics. [Google Scholar]
  • [104].Wolf F Alexander, Hamey Fiona K, Plass Mireya, Solana Jordi, Dahlin Joakim S, Göttgens Berthold, Rajewsky Nikolaus, Simon Lukas, and Theis Fabian J. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol., 20(1):59, March 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [105].Qiu Xiaojie, Mao Qi, Tang Ying, Wang Li, Chawla Raghav, Pliner Hannah A, and Trapnell Cole. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods, 14(10):979–982, October 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [106].Kurd Nadia S, He Zhaoren, Louis Tiani L, Milner J Justin, Omilusik Kyla D, Jin Wenhao, Tsai Matthew S, Widjaja Christella E, Kanbar Jad N, Olvera Jocelyn G, Tysl Tiffani, Quezada Lauren K, Boland Brigid S, Huang Wendy J, Murre Cornelis, Goldrath Ananda W, Yeo Gene W, and Chang John T. Early precursors and molecular determinants of tissue-resident memory CD8+ T lymphocytes revealed by single-cell RNA sequencing. Sci Immunol, 5(47), May 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [107].Wherry E John and Ahmed Rafi. Memory CD8 t-cell differentiation during viral infection. J. Virol., 78(11):5535–5545, June 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [108].Kaech Susan M, Wherry E John, and Ahmed Raft Effector and memory t-cell differentiation: implications for vaccine development. Nat. Rev. Immunol., 2(4):251–262, April 2002. [DOI] [PubMed] [Google Scholar]
  • [109].Wherry E John, Teichgräber Volker, Becker Todd C, Masopust David, Kaech Susan M, Antia Rustom, von Andrian Ulrich H, and Ahmed Rafi Lineage relationship and protective immunity of memory CD8 T cell subsets. Nat. Immunol., 4(3):225–234, March 2003. [DOI] [PubMed] [Google Scholar]
  • [110].Blasius Amanda L, Giurisato Emanuele, Cella Marina, Schreiber Robert D, Shaw Andrey S, and Colonna Marco Bone marrow stromal cell antigen 2 is a specific marker of type I IFN-producing cells in the naive mouse, but a promiscuous cell surface antigen following IFN stimulation. J. Immunol., 177(5):3260–3265, September 2006. [DOI] [PubMed] [Google Scholar]
  • [111].Jergovic Mladen, Coplen Christopher P, Uhrlaub Jennifer L, Besselsen David G, Cheng Shu, Smithey Megan J, and Nikolich-Žugich Janko. Infection-induced type I interferons critically modulate the homeostasis and function of CD8+ naïve T cells. Nat. Commun., 12(1):5303, September 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [112].Tirosh Itay, Izar Benjamin, Prakadan Sanjay M, Wadsworth Marc H 2nd, Treacy Daniel, Trombetta John J, Rotem Asaf, Rodman Christopher, Lian Christine, Murphy George, Fallahi-Sichani Mohammad, Dutton-Regester Ken, Jia-Ren Lin, Cohen Ofir, Shah Parin, Lu Diana, Genshaft Alex S, Hughes Travis K, Ziegler Carly G K, Kazer Samuel W, Gaillard Aleth, Kolb Kellie E, Alexandra-Chloé Villani, Johannessen Cory M, Andreev Aleksandr Y, Van Allen Eliezer M, Bertagnolli Monica, Sorger Peter K, Sullivan Ryan J, Flaherty Keith T, Frederick Dennie T, Jané-Valbuena Judit, Yoon Charles H, Rozenblatt-Rosen Orit, Shalek Alex K, Regev Aviv, and Garraway Levi A. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science, 352(6282):189–196, April 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [113].Chattopadhyay Pratip K, Betts Michael R, Price David A, Gostick Emma, Horton Helen, Roederer Mario, and De Rosa Stephen C. The cytolytic enzymes granyzme a, granzyme b, and perforin: expression patterns, cell distribution, and their relationship to cell maturity and bright CD57 expression. J. Leukoc. Biol., 85(1):88–97, January 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [114].Omilusik Kyla D, Nadjsombati Marija S, Shaw Laura A, Yu Bingfei, Milner J Justin, and Goldrath Ananda W. Sustained id2 regulation of E proteins is required for terminal differentiation of effector CD8+ T cells. J. Exp. Med., 215(3):773–783, March 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [115].Milner J Justin, Nguyen Hongtuyet, Omilusik Kyla, Reina-Campos Miguel, Tsai Matthew, Toma Clara, Delpoux Arnaud, Boland Brigid S, Hedrick Stephen M, Chang John T, and Goldrath Ananda W. Delineation of a molecularly distinct terminally differentiated memory CD8 T cell population. Proc. Natl. Acad. Sci. U. S. A., 117(41):25667–25678, October 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [116].Kaech Susan M and Cui Weiguo. Transcriptional control of effector and memory CD8+ T cell differentiation. Nat. Rev. Immunol., 12(11):749–761, November 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [117].Harty John T and Badovinac Vladimir P. Shaping and reshaping CD8+ t-cell memory. Nat. Rev. Immunol., 8(2):107–119, February 2008. [DOI] [PubMed] [Google Scholar]
  • [118].Grayson J M, Zajac A J, Altman J D, and Ahmed R. Cutting edge: increased expression of bcl-2 in antigen-specific memory CD8+ T cells. J. Immunol., 164(8):3950–3954, April 2000. [DOI] [PubMed] [Google Scholar]
  • [119].Kaech Susan M, Tan Joyce T, Wherry E John, Konieczny Bogumila T, Surh Charles D, and Ahmed Rafi. Selective expression of the interleukin 7 receptor identifies effector CD8 T cells that give rise to long-lived memory cells. Nat. Immunol., 4(12):1191–1198, December 2003. [DOI] [PubMed] [Google Scholar]
  • [120].Upadhyay Vaibhav and Yang-Xin Fu. Lymphotoxin signalling in immune homeostasis and the control of microorganisms. Nat. Rev. Immunol., 13(4):270–279, April 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [121].Szklarczyk Damian, Gable Annika L, Nastou Katerina C, Lyon David, Kirsch Rebecca, Pyysalo Sampo, Doncheva Nadezhda T, Legeay Marc, Fang Tao, Bork Peer, Jensen Lars J, and von Mering Christian. The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res., 49(D1):D605–D612, January 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [122].Benjamini Yoav and Hochberg Yosef. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc., 57(1):289–300, January 1995. [Google Scholar]
  • [123].Kuleshov Maxim V, Jones Matthew R, Rouillard Andrew D, Fernandez Nicolas F, Duan Qiaonan, Wang Zichen, Koplev Simon, Jenkins Sherry L, Jagodnik Kathleen M, Lachmann Alexander, McDermott Michael G, Monteiro Caroline D, Gundersen Gregory W, and Ma’ayan Avi. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res., 44(W1):W90–7, July 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [124].van der Maaten Laurens and Hinton Geoffrey. Visualizing data using t-SNE. J. Mach. Learn. Res., 9(86):2579–2605, 2008. [Google Scholar]
  • [125].Belkin Mikhail and Niyogi Partha. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, pages 585–591, Cambridge, MA, USA, January 2001. MIT Press. [Google Scholar]
  • [126].Loh Kyle M, Chen Angela, Koh Pang Wei, Deng Tianda Z, Sinha Rahul, Tsai Jonathan M, Barkal Amira A, Shen Kimberle Y, Jain Rajan, Morganti Rachel M, Shyh-Chang Ng, Fernhoff Nathaniel B, George Benson M, Wernig Gerlinde, Salomon Rachel E A, Chen Zhenghao, Vogel Hannes, Epstein Jonathan A, Kundaje Anshul, Talbot William S, Beachy Philip A, Ang Lay Teng, and Weissman Irving L. Mapping the pairwise choices leading from pluripotency to human bone, heart, and other mesoderm cell types. Cell, 166(2):451–467, July 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [127].Kharchenko Peter V, Silberstein Lev, and Scadden David T. Bayesian approach to single-cell differential expression analysis. Nat. Methods, 11(7):740–742, July 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [128].Pierson Emma and Yau Christopher. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol., 16:241, November 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [129].Johnson W Evan, Li Cheng, and Rabinovic Ariel. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics, 8(1):118–127, January 2007. [DOI] [PubMed] [Google Scholar]
  • [130].Cortes Corinna and Vapnik Vladimir. Support-vector networks. Mach. Learn., 20(3):273–297, September 1995. [Google Scholar]
  • [131].van der Hoef Hanneke and Warrens Matthijs J. Understanding information theoretic measures for comparing clusterings. Behaviormetrika, 46(2):353–370, October 2019. [Google Scholar]
  • [132].Phipson Belinda and Smyth Gordon K. Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol., 9:Article39, October 2010. [DOI] [PubMed] [Google Scholar]
  • [133].Stallaert Wayne, Kedziora Katarzyna M, Taylor Colin D, Zikry Tarek M, Ranek Jolene S, Sobon Holly K, Taylor Sovanny R, Young Catherine L, Cook Jeanette G, and Purvis Jeremy E. The structure of the human cell cycle. Datasets. Zenodo Repository. 10.5281/zenodo.4525425 (2022). [DOI] [Google Scholar]
  • [134].Stallaert Wayne, Papke Bjoern, Der Channing, and Purvis Jeremy E. Cell cycle heterogeneity in pancreatic ductal adenocarcinoma. Datasets. Zenodo Repository. 10.5281/zenodo.7860332 (2023). [DOI] [Google Scholar]
  • [135].Kurd Nadia S, He Zhaoren, Louis Tiani L, Milner J Justin, Omilusik Kyla D, Jin Wenhao, Tsai Matthew S, Widjaja Christella E, Kanbar Jad N, Olvera Jocelyn G, Tysl Tiffani, Quezada Lauren K, Boland Brigid S, Huang Wendy J, Murre Cornelis, Goldrath Ananda W, Yeo Gene W, and Chang John T. Early precursors and molecular determinants of tissue-resident memory CD8+ T lymphocytes revealed by single-cell RNA sequencing. Datasets. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131847 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [136].Ranek Jolene, Stallaert Wayne, Milner Justin, Stanley Natalie, and Purvis Jeremy. Feature selection for preserving biological trajectories in single-cell data. Datasets. Zenodo Repository. 10.5281/zenodo.7883604 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES