Thinking process templates for constructing data stories with SCDNEY

Yue Cao; Andy Tran; Hani Kim; Nick Robertson; Yingxin Lin; Marni Torkel; Pengyi Yang; Ellis Patrick; Shila Ghazanfar; Jean Yang

doi:10.12688/f1000research.130623.2

. 2023 Dec 15;12:261. Originally published 2023 Mar 10. [Version 2] doi: 10.12688/f1000research.130623.2

Thinking process templates for constructing data stories with SCDNEY

Yue Cao ^1,^2,^3,⁴, Andy Tran ^1,^2,^3,^4,^#, Hani Kim ^2,^4,^5,^#, Nick Robertson ^1,^2,^3,⁴, Yingxin Lin ^1,^2,^3,⁴, Marni Torkel ^1,^2,^3,⁴, Pengyi Yang ^1,^2,^4,⁵, Ellis Patrick ^1,^2,^3,⁴, Shila Ghazanfar ^2,^3,^4,^a, Jean Yang ^1,^2,^3,^4,^b

PMCID: PMC10905113 PMID: 38434622

Version Changes

Revised. Amendments from Version 1

We have improved our manuscript by highlighting examples where critical thinking is needed for diagnosis, providing clarification on the Thinking Process Template as well as improving the writings of various parts. We have updated Figure 1 and Figure 6.

Abstract

Background

Globally, scientists now have the ability to generate a vast amount of high throughput biomedical data that carry critical information for important clinical and public health applications. This data revolution in biology is now creating a plethora of new single-cell datasets. Concurrently, there have been significant methodological advances in single-cell research. Integrating these two resources, creating tailor-made, efficient, and purpose-specific data analysis approaches can assist in accelerating scientific discovery.

Methods

We developed a series of living workshops for building data stories, using Single-cell data integrative analysis (scdney). scdney is a wrapper package with a collection of single-cell analysis R packages incorporating data integration, cell type annotation, higher order testing and more.

Results

Here, we illustrate two specific workshops. The first workshop examines how to characterise the identity and/or state of cells and the relationship between them, known as phenotyping. The second workshop focuses on extracting higher-order features from cells to predict disease progression.

Conclusions

Through these workshops, we not only showcase current solutions, but also highlight critical thinking points. In particular, we highlight the Thinking Process Template that provides a structured framework for the decision-making process behind such single-cell analyses. Furthermore, our workshop will incorporate dynamic contributions from the community in a collaborative learning approach, thus the term ‘living’.

Keywords: single-cell analysis, data analysis, data story, thinking process template, living workshop

Introduction

Recent advancements in biotechnology have empowered scientists to generate unprecedented amounts of data at the cellular level that carry critical information for important clinical and public health applications ( Goodwin, McPherson, and McCombie, 2016; Stark, Grzelak, and Hadfield, 2019). These data provide a unique opportunity for us to inspect individual cells through the lens of genomics, transcriptomics, proteomics and so on, providing insight into different aspects of a cell and representing a data revolution in biomedical data. To extract scientific discoveries from these data, over one thousand analytical methods have been developed ( Zappia and Theis, 2021) to exploit diverse kinds of data and answer a broad range of questions. These analytical methods can be used as ‘black box’ tools to analyse data without knowledge of the methodological details. Hence, it can be difficult to determine how ‘robust’ a data analysis should be conducted. Hence, this can create challenges when it comes to deciding how rigorous a data analysis is. To make the most of the single-cell data revolution in omics science, it is important for researchers to first navigate and determine the optimal analytical tools for each question while being aware of their hidden pitfalls and assumptions.

Analysing omics data often involves complex workflows including data cleaning, processing, and downstream analysis. A critical component in a successful analysis is the thinking process, which involves the analyst considering the steps in the workflows and making informed decisions that are appropriate for the research questions at hand. For example, the workflow for single-cell analysis often involves multiple interdependent steps such as data filtering and normalisation, feature selection, clustering, dimensionality reduction, alongside further downstream analytical steps. Each of these steps can require analysts to make context-specific decisions, such as deciding thresholds (e.g., filtering or feature selection), selecting parameters (e.g., normalisation or clustering) or selecting an algorithm (e.g., dimensionality reduction). As these analytical choices are dependent on earlier steps, they can have cascading impacts on the downstream analysis, and eventually, the conclusions that are drawn ( Krzak et al., 2019; Raimundo, Vallot, and Vert, 2020). Workflows can help users identify a set of seemingly disjoint methods into a unified and coherent process. Thus, it is crucial that users are guided through the thinking process in order to make the most appropriate decisions at each step given their specific context.

There is a difference between offering a tutorial or workflow and offering a thinking process. Computational methods are often accompanied by a tutorial that demonstrates how to apply the method to perform a specific task on an example dataset. These tutorials can be straightforward to follow and understand, helping users run the method on their own data. Workflows describe a sequence of analytical methods for processing and analysing certain types of data ( Breckels et al., 2016; Lun, McCarthy, and Marioni, 2016; Borcherding, Bormann, and Kraus, 2020). Workflows can help users identify a set of seemingly disparate methods into a cohesive whole. However, simply copying an existing tutorial or workflow leaves the risk of treating the methods as a ‘black box’, potentially leading to false discoveries. We believe that it is important to not only instruct analysts on how to apply a method or workflow, but also to guide them to critically assess their results at each stage. Indeed, efforts are underway to make more transparent what happens ‘behind the paper’ such as the Springer Nature protocols and methods community ( https://protocolsmethods.springernature.com/channels/behind-the-paper) with discussions surrounding experimental and analytical choices throughout the project. Critical thinking and assessment of results at each stage enables the analysts to identify where problems arise and guides them to customise their analysis for their specific context. Thus, there is a pressing need to build on existing tutorials and workflows in a way that incorporates such critical thinking.

To this end, we present a Thinking Process Template to formalise the thought process an analyst should undertake to ensure robust analysis that is tailored to their data. We use Thinking Process Template as a general term to refer to the data analysis procedure from data input to final analysis outcome that involves critical thinking processes and goes beyond the simple application of tools or the products of data analysis. Here, we demonstrate this through scdney, a collection of analytical packages and living workshop materials, which can be updated based on feedback and suggestions from users. In this paper, we demonstrate two examples of our Thinking Process Template in inferring and assessing a cell lineage trajectory, and in performing patient disease classification. We envision that our Thinking Process Template and scdney’s living workshops will complement existing resources and will be a model for future tutorials to encourage transparent and robust research practices for the bioinformatics and biomedical data science community.

Methods

Selection of data stories to illustrate scdney

Here we showcase two data stories to illustrate scdney. These two data stories and the accompanying workshops were not derived from previous studies nor have they been published elsewhere. The first data story describes the use of scdney on cell level analysis through inferring and assessing the developmental trajectory of individual cells. The second data story details the use of scdney on patient level analysis by extracting and summarising information obtained from each cell. The code for both data stories are hosted on Github as reproducible Rmarkdown files, reported in the code availability section. The underlying data are reported in the data availability section.

Workshop for data story 1

A summary of the case study is provided below with detailed information including R code hosted on our Github ( Lin, Kim and Chen, 2023).

To predict the gene pairs associated with the developmental course of the differentiation of mouse hippocampal cells, we downloaded the publicly available data (from GEO with accession number GSE104323) profiling eight cell types from neural lineages of the mouse hippocampus harvested from two post-natal timepoints (day 0 and 5) ( La Manno et al., 2018). For speed, we removed the Nbl1, Nbl2, Granule and CA cell types from the dataset and reduced the dataset from 18,213 to 12,935 cells. To evaluate the accuracy of the original cell type labels, we applied scReClassify ( Kim et al., 2019) from the scdney package. scReClassify generates cell-type-specific probabilities for each cell, where a probability of 1 denotes the highest accuracy in classification and 0 denotes lowest accuracy. Using the maximum probability assigned to each cell, we re-labelled the cell-type annotations of cells that have inconsistent labels and have a maximum probability greater than 0.9. Then, we used the re-labelled cell-type annotations to perform marker gene analysis using Cepo, a method to determine cell-type-specific differentially stable genes ( Kim et al., 2021).

To build the trajectories, we applied two commonly used trajectory inference tools, Slingshot ( Street et al., 2018) and destiny ( Angerer et al., 2016). Finally, to predict gene-pairs that change over the trajectory course, we used our previously developed package scHOT ( Ghazanfar et al., 2020), which is available on Bioconductor. scHOT enables detection of changes in higher-order interactions in single-cell gene expression data.

Workshop for data story 2

A summary of the case study is provided below with detailed information including R code hosted on our Github ( Cao and Tran, 2023).

We predict patient disease outcome using COVID-19 datasets and the packages scFeatures ( Cao et al., 2022), and ClassifyR ( Strbenac et al., 2015). To build the prediction model on distinguishing mild and severe outcomes, we used the publicly available Schulte-Schrepping data ( Schulte-Schrepping et al., 2020). We randomly sampled 20 mild and 20 severe patient samples for the purpose of demonstrating the workshop in a reasonable amount of time. Then, we applied scFeatures from the scdney package to generate patient representations from the single-cell data. scFeatures generates interpretable molecular representations from various feature types. By doing so, we were able to represent each patient with more information than a matrix of gene expressions. At the same time, it also transformed the scRNA-seq data into a matrix of samples by features, which is a standard form for machine learning models. We generated a total of 13 matrices, one for each feature type across the feature categories of (i) cell type proportions, (ii) cell type specific gene expressions, (iii) cell type specific pathway expressions, (iv) cell type specific CCI scores and (v) overall aggregated gene expressions. The details of the feature types can be found in the scFeatures publication ( Cao et al., 2022).

To build a patient outcome classification model from the patient representations, we used our previously developed package ClassifyR ( Strbenac et al., 2015), which is available on Bioconductor ( https://bioconductor.org/packages/ClassifyR/). ClassifyR provides an implementation of cross-validated classifications, including implementation for a range of commonly used classifiers and evaluation metrics. For this case study, we ran SVM on each of the feature types using a repeated five-fold cross-validation framework with 20 repeats. The accuracy was measured using the metric ‘balanced accuracy’ that is implemented in ClassifyR.

To assess the generalisability of the constructed model, we used the Schulte-Schrepping data as training data and another publicly available COVID-19 scRNA-seq dataset, the Wilk data ( Wilk et al., 2020), as an independent testing data. First, we processed the dataset in the same way using scFeatures to generate the patient representations. Given that different datasets generate slightly different sets of features, for example, due to the difference in the genes recorded, we subset the features derived from the Schulte-Schrepping dataset and the Wilk data by their common features. We then rebuilt the model using the Schulte-Schrepping dataset as the training dataset using the same cross-validation framework as above. The best model from the 100 models (i.e., from the 20 repeated five-fold cross-validation) was identified based on balanced accuracy and evaluated on the Wilk dataset.

Results

Thinking Process Template

Typically, in scientific research papers involving cellular data technologies, there are three key components: (1) Data, (2) Narratives, and (3) Visuals ( Figure 1). Through narratives, we explain the data; through visuals, we illuminate the data; through narratives and visuals we engage. At the intersection of the three components are the product: the data stories. However, what is hidden behind these components are the critical thinking questions such as evaluation and parameter choices that happen behind the decision-making process.

Here, we present a Thinking Process Template, to uncover the thinking process behind the construction of data stories, guided by analytical decisions. We demonstrate this in two distinct data analytical scenarios, presented as scientific questions. First, we ask, what are the cell types present in our developmental single-cell dataset, and what are the correlated gene pairs in each trajectory? Second, what features are important for disease outcome classification? In both cases we illuminate the underlying thinking strategy taken by analysts/data scientists in extracting biological knowledge from the data and drawing from the vast compendium of prior knowledge to reveal novel scientific knowledge.

Scdney - Single cell data integrative analysis

As a vehicle to demonstrate the Thinking Process Template, we present scdney ( Figure 2), a series of foundational methods for single cell data analysis, including

•
a data integration approach for scRNA-seq data that enables tailored prior knowledge ( Lin et al., 2019);
•
a novel cell type classification method based on cell hierarchy ( Lin et al., 2020);
•
a novel method for identifying differential stable genes, that is, genes that are stably expressed in one cell type relative to other cell types ( Kim et al., 2021);
•
a multi-modal workflow for analysing CITE-seq data ( Kim et al., 2020);
•
an analytical approach to test for higher-order changes in gene behaviour within human tissue ( Ghazanfar et al., 2020). By higher-order changes, we refer to higher order interactions such as variation and coexpression that are beyond changes in mean expression; and
•
a feature extraction method that creates multi-view feature representations on patient level from single-cell data ( Cao et al., 2022).

Building upon the collection of vignettes, the Thinking Process Template examines various critical thinking questions that analysts need to make, which drives the decision for the next step in the analysis workflow. Next, using the scdney workflow, we will illustrate the process of generating two data stories. The scdney workflow starts with data ( Figure 2A), the series of methods are used for the analysis of data ( Figure 2B), and through the critical thinking ( Figure 2C), we derive the final data story.

Narrative for data story 1 - to identify key gene-pairs associated with the developmental course

In the first data story, the aim was to identify key gene-pairs associated with the developmental course of the differentiation of mouse hippocampal cells, enabling us to find key gene sets that distinguish hippocampal development in mice using scRNA-seq data ( La Manno et al., 2018) ( Figure 3). Box 1 lists some questions and our thought process during the development of the story.

Figure 3. — The thinking process begins from the processed data with cell type annotations and proceeds to constructing a trajectory and extracting biological insight through identification of correlated gene pairs. The orange diamonds highlight potential questions that help us quality check the data analysis, and the orange hexagonal shapes denote the specific computational tasks that are required to answer the questions above.

Box 1. Critical questions to consider for identification of key gene sets in the developmental course.

Question: Which tools should I use and what format does the data need to be in?

Thinking process: Several tools have been developed to construct trajectories from single cell data. Different tools may require different types of input data; therefore, it is important to understand the tools and your data before selecting a tool. Another key aspect of working on trajectory reconstruction is to judge which cell populations to include in the trajectory analyses. Some cell types or cell populations not involved in the differentiation system of interest should be excluded in the trajectory inference.

Question: Which trajectory method should I use?

Thinking process: Depending on the complexity of the trajectory, the choice of tools can have a large impact on the accuracy of the resulting trajectory built. A large body of work has been performed to evaluate current single-cell trajectory inference methods ( Saelens et al., 2019). They provide guidelines and a framework to test which trajectory tool and setting are most appropriate for your data. Again, this requires you to have a good understanding of the expected underlying biology in your data, such as the topology and the number of branches of the expected trajectory.

Question: Are the cell type labels accurate?

Thinking process: Evaluating the quality of the cell type labels is important, as the quality of this may directly impact downstream analyses such as determining cell-type markers. By quantifying the proportion of cells accurately labelled in the dataset, we are not only able to assess the quality of the overall dataset, but also to re-classify any mislabelled cells.

Question: Is the trajectory stable?

Thinking process: This can be achieved in many ways, such as testing the reproducibility of the trajectory when different tools are used or when permuting the features (gene sets or cells) in the data. A consistent trajectory across various permutations provides stronger support for the final trajectory.

Question: Is the trajectory sensible?

Thinking process: Inspecting how sensible a trajectory is critical. We should inspect various features of the trajectory such as the direction of the trajectory (which includes evaluating the root of the trajectory), the number of branches, and the number of terminal nodes (e.g., terminal populations) in the data. Whilst these evaluations require an in-depth understanding of your biological system through literature search, there are computational tools that help guide this. For example, CYTOTRACE can be used to predict the root cell (i.e., the most undifferentiated cell) in a cell population.

Question: How reliable are the top regulated gene-pairs?

Thinking process: This question essentially asks whether the extracted gene-pairs are expected for the current biological system. This often requires prior knowledge of experimentally validated ground truths, which can be employed to evaluate the validity of our results. The presence of one or more biological truths increases the confidence that the current framework is appropriate.

Question: How accurate are the identified top gene-pairs?

Thinking process: It is important to bear in mind that the presence of known biological truths in our results do not necessarily mean that the other predicted gene pairs are also biological truths. There are many ways we can validate the accuracy of the predicted gene-pairs, and these validation approaches can be done experimentally or computationally. Computationally, one of the ways we can validate the accuracy is to assess the reproducibility of our framework on a new dataset derived from the same biological system. When such independent datasets are not available, a simple train-test split can be performed on the data to test the reproducibility of the findings.

The dataset we use contains eight cell types from neural lineages of the mouse hippocampus harvested from two post-natal timepoints (day 0 and 5) ( La Manno et al., 2018). Whilst the main goal in the original study was to demonstrate the RNA velocity fields that describe the fate decisions governing mouse hippocampal development, our data story aims to uncover novel gene-pairs associated with these neural lineages using scHOT ( Ghazanfar et al., 2020).

We start by asking whether the cell type annotations in the original data are accurate. Here our expectation is that most of the labels are accurate, and by using scReClassify ( Kim et al., 2019) we demonstrate that approximately 88.4% of cells show an original classification accuracy over 0.9 ( Figure 4A). Among these cells, only 1.5% (177 cells) were re-classified, suggesting that a small proportion of cells may have been mislabelled. These findings were confirmed through marker analysis using Cepo ( Figure 4B), and the cells with high confidence scores were re-labelled for subsequent analyses. Once we have confirmed with further quality control questions as shown in the box and ensured the quality of the cell type annotations, we can then use these labels to perform marker gene analysis and to construct the lineage trajectories ( Figure 4C).

Figure 4. — (A) Shows the proportion of cells in each confidence level, defined by scReClassify, for each cell type group. (B) The distribution of gene expression of top five marker genes in Immature Granule 2 cells as per the original labels (bottom panel) and re-classified labels (top panel). (C) UMAP of mouse brain cells coloured by cell type and faceted by cells that maintain their original labels (left) and those that have been re-classified (right).

After performing the quality control of the original annotations, we then can ask questions relating to trajectory reconstruction. In the trajectory building stage, we ask questions (see Box 1) to ensure the stability and robustness of the trajectories by testing the concordance of the pseudo-times between various trajectory reconstruction tools ( Cao et al., 2019; Street et al., 2018). In our Thinking Process Template, we indicate at various points at which one can use prior knowledge (indicated by the glasses icon) to guide the analysis. For example, we can use prior knowledge to ask whether the reconstructed trajectories show the correct branching expected in the underlying biology of the differentiation and whether key gene-pairs that are known to be co-regulated are identified by scHOT ( Ghazanfar et al., 2020), as well as asking which genes are differentially expressed across a pseudotime using tradeSeq ( Van den Berge et al., 2020) and performing functional annotation of these gene sets through clusterProfiler ( Yu et al., 2012). Together, these analyses demonstrate that the final trajectories are in line with our expectations and provide more confidence in the new biological insights extracted from these trajectories. The story includes other downstream analysis of the data such as cell-cell communication using CellChat ( Jin et al., 2021) and RNA velocity analysis using scVelo ( Bergen et al., 2020), which the users can perform to further explore their data.

Narrative for data story 2 - develop a PBMC biomarker model to predict COVID-19 patient outcomes

In our second data story, we aim to predict COVID-19 patient outcomes (mild or severe) from scRNA-seq data of peripheral blood mononuclear cells ( Schulte-Schrepping et al., 2020) ( Figure 5). Below, we list some questions and our thought process during the development of the story. Here, we showcase the story we derived on the COVID-19 patient outcome prediction. The story begins with the question of what models and input format we will use to build a prediction model (see Box 2). Here, we decided to use classical machine learning instead of deep learning, given the small sample size of 20 mild and 20 severe patients. We utilise scFeatures, a package that generates interpretable multiscale features from scRNA-seq data, such as cell-type proportions, pathway expression, ligand-receptor interactions and more. These features can then be used as input to facilitate an interpretable classification model. Once we have asked quality control questions as shown in the box and ensured the quality of generated features, we then used these features to build models to predict mild or severe outcomes.

Figure 5. — The thinking process begins from processed data with cell type annotations and branches into two questions, each with a different focus. The top part focuses on using the disease classification model to extract biological insights into the disease, such as what features are important towards disease classification. The bottom part focuses on examining the model properties, such as whether the model is generalisable.

Box 2. Critical questions to consider for the prediction of patient outcomes.

Question: What model should I use and what data structure is required by the model?

Thinking process: There exist a number of advanced deep learning tools that can obtain various biological insights from the count matrix ( Bao et al., 2022). However, a small sample size, which is often what’s typical in single-cell patient data, may not be ideal to train a deep learning model. We might consider the alternatives such as classification machine learning methods like random forest. These methods requires the input in the format of samples by features. In this case, we can consider manually extracting the features such as cell type proportion.

Question: Is the data preprocessed appropriately?

Thinking process: The quality of the data itself has a direct impact on the quality of the extracted features, and subsequently the quality of the model. Therefore, it is important to perform “quality control” both on the original count matrix and on any of the extracted features derived from the count matrix.

Question: Should I downweight any samples?

Thinking process: Class imbalance can have a negative effect on the model, as the model would be biassed towards the over-represented class. One potential strategy to alleviate this is to downweight the over-represented class.

Questions: Do the generated features make sense? Are the extracted features sensible?

Thinking process: This is really asking whether the extracted features are expected. This often requires finding a handful of the top differentially expressed genes through DE analysis and checking if they are mentioned in literature.

Question: Does the overall graphical representation of the features look sensible?

Thinking process: In this question, we are looking at the overall distribution of the generated features. For example, if we examine the heatmap or volcano plot, are we seeing what we expect to see? Also, see below for examples of quality control checks.

Question: Are there any missing values or outliers in the generated features?

Thinking process: We should inspect the generated features to ensure they are not saturated with missing values. Features where many values are missing may not be informative for downstream analysis and should be removed prior to model building.

Question: Are the generated features heavily correlated?

Thinking process: Having many heavily correlated features can negatively affect a model by introducing noise and instability.

Question: There are a lot of the generated features, how do I make sense of them?

Thinking process: Given the number of features in a single-cell matrix (typically around 20,000 genes for a scRNA-seq data), one may end up with many derived features. One strategy is to perform an association study, where we examine the association of the features with the outcome. We could also conduct a literature search or consult with biologists to determine whether these top features are biologically significant.

Questions: How good is my prediction? How does it compare to the current state-of-the-art?

Thinking process: The expected accuracy of a prediction can vary depending on the specific task at hand. For example, an accuracy of 0.6 may be what the current state-of-the-art is for a difficult disease classification task, whereas for a clear cell type classification task, an accuracy of 0.9 may be the baseline.

Questions: Is the result different using different metrics? Different models?

Thinking process: It may be necessary to try a number of machine learning models and a number of evaluation criteria to assess model performance. For example, when there are imbalanced class sizes, balanced accuracy and F1 score are better measures of model performance compared to precision and recall.

Questions: Is my model overfitting to the data? Do I need further testing?

Thinking process: One needs to be careful with model overfitting. A model may have very high accuracy on the dataset it is built from, but performs poorly on an unseen dataset. To assess model overfitting, we could test the performance of the model on an unseen dataset to assess its generalisability.

Question: Are the top features stable across the models?

Thinking process: After we obtain the model, we may wish to inspect the top features selected by the model. The repeated cross-validation framework is often used when building machine learning models as it provides a better assessment of model predictability than a simple train-test split. Therefore, we need to check whether the top features are similar across all models from the cross-validation framework.

When building machine learning models, it is crucial that we ask questions on the model performance on a variety of models and metrics. Therefore, we choose to use ClassifyR, as it provides a user-friendly implementation on a number of common machine learning models and evaluation metrics. We created models for each feature type, resulting in a total of 13 models and compared the utility of these feature types for patient classification. We found that a support vector machine classifier consistently achieves a cross-validation accuracy over 0.7 ( Figure 6A), demonstrating the usefulness of these features to classify disease outcomes.

Figure 6. — (A) Shows the balanced accuracy of each feature type on classifying the mild and severe patients in the Schulte-Schrepping dataset. Models were run using five fold cross-validation with 20 repeats. For each feature type, the best model from the cross-validation was then selected and used to predict on the mild and severe patients in the Wilk dataset, as shown in (B). (C) Rankings of each feature in the feature type “gene mean celltype” across all cross-validated models.

Once the final models are obtained, we ask questions on the robustness of the models. One approach on this involves assessing the generalisability of the model on an independent dataset. We examined the performance of the 13 models on a different COVID-19 dataset obtained from the Wilk study that also contain mild and severe patients ( Wilk et al., 2020). We found while the 13 models have close balanced accuracy between 0.75 and 0.88 on the Schulte-Schrepping dataset, their performance varied greatly in the Wilk dataset and ranged between 0.49 to 0.78 ( Figure 6A, B). It is noteworthy that models built from feature types such as “gene proportion cell type” that have high accuracy do not necessarily maintain good performance on the Wilk dataset. On the other hand, the feature type “CCI” achieved an accuracy of over 0.75 in both datasets, indicating potential for further examination. Finally, to extract biological insights from the fitted models, we guide users to interpret the fitted models to identify important features and reflect on whether the features make sense. Here, it is important not to select the top features based on a single model, but to ask about the stability of these top features. To illustrate this idea, we examined all features that appeared at least once as top 10 features in the cross-validated models. Figure 6C highlights that while the majority of the features were consistently ranked as top features across all models, a proportion of features were ranked in the hundreds and thousands position in some models. These two scenarios illustrate the importance of critical thinking to avoid heading down a wrong decision path.

Discussion

Here, we have presented a Thinking Process Template to not only guide users how to perform a single cell data analysis, but also to encourage critical thinking, ensuring that each part of the workflow successfully performs its desired task. We demonstrated this through the use of scdney, a collection of analytical packages that can perform a wide range of single-cell data processing and analyses. In the previous section, we demonstrated the importance of the process with two examples: identification of key gene pairs that distinguish hippocampal development in mouse cells and generation of features from human cells for disease outcome prediction. We envisage use of the Thinking Process Template as a valuable framework for critical thinking in single-cell data analysis.

Bioinformatics analysis workflows involve many steps, each often requiring decisions to be made, dependent on the earlier choices. The most appropriate decisions will differ between datasets and analyses. Therefore, performing a robust analysis requires significant training and experience. However, our Thinking Process Template conveys this training as critical thinking questions that less-experienced users can easily follow for their specific context. The template can be adapted to a wide range of analyses, complementing the existing learning resources, to lower the barrier to entry for performing reproducible bioinformatics analysis. Furthermore, the template enables an asynchronous learning approach ( Bishop and Verleger, 2013), where the users can learn at their own pace and on their own time without the constraints of traditional workshop schedules. This is particularly useful for bioinformatics analysis, where the decisions and steps can vary depending on the specific datasets and analyses and need to be thoroughly thought about prior to drawing conclusions.

In the last decade, partly in response to the replicability crisis ( Guttinger and Love, 2019), there has been an increased emphasis on open and transparent science and an increased culture among bioinformaticians of sharing data and code so that key findings can be reproduced. However, sharing code alone does not address all aspects of the replicability of scientific conclusions and further, does not explicitly contribute towards the sharing of analytical strategies. In our Thinking Process Template, we believe acknowledging the critical thinking steps ensures a better understanding of the stability and robustness of analytical decisions made in an analysis, making it possible to assess if the same conclusions would be drawn if different decisions were made. Further, sharing the key critical thinking steps of a project, in addition to the code, will improve replicability of results by making it clear where, when, and why analyses can differ when the same code is applied to different data. This will enhance reproducibility of studies performed by different researchers and institutes, and by promoting open examination of the practices, may help to promote replicability in the broader research field.

The thinking process of data analysis is dynamic, constantly evolving and specific to the dataset and the research questions. In practice, when addressing similar research questions, the data analysis workflow that works well on one dataset may not be universal to all other datasets. The thinking process proposed in this paper could serve as useful tips and tricks to address these problems. The output from the thinking process can potentially stimulate a new thinking process, which may further inspire the scientists to ask different questions about the data. The complex thinking process involved in publication is starting to be acknowledged on collaborative learning platforms, such as the one established by F1000. These platforms enable authors to describe the behind-the-scenes stories leading to their publications, as well as for others to contribute analytical suggestions and ideas in a dynamic way. It is known that groups of people with cognitive diversity are often able to solve problems more effectively than a group of cognitively similar people ( Reynolds and Lewis, 2017). Sharing ideas therefore supports the development of effective bioinformatics analysis. By offering an approach for researchers to share and discuss the methods and decisions involved in their analysis, the Thinking Process Template also promotes a deeper level of transparency in bioinformatics analysis. This includes not only the sharing of positive results, but also the sharing of negative or null results. In many cases, null results can be just as important to science, as they provide valuable information about what does not work and can help the broader community to avoid repeating failed experiments or approaches. However, the current scientific field leans more towards the reporting of positive results only. We see the Thinking Process Template to be a tool that can support the sharing of both positive and negative results by providing a structured framework for documenting the decisions and findings in various steps of the analysis. The document can later be shared with the community to increase the transparency of the work.

A distinct and complementary component to the Thinking Process Template is related to the ease for researchers to reproduce open data analyses on their local computer systems. Robustness of computational tools is an enduring issue in various analytically-driven fields and challenges with reproducing data analytics is often due to the difference in software versioning and the large variety of operating systems. To address these issues, the R programming community has developed tools such as BiocManager and Renv to help with the installation and documentation of R package dependencies. The use of containers such as Docker allows for the creation of fully reproducible software and analytical environments that can be easily shared and run on different operating systems. In the case of SCDNEY, we have taken steps to improve the robustness of the tool. The scdney wrapper package ( https://github.com/SydneyBioX/scdney) and its individual packages are incorporated into controlled repositories such as Github and Bioconductor. In addition, scdney is provided as a Docker container which contains all the necessary dependencies for installation, making it easy for researchers to install and use scdney on their local systems.

Conclusion

In conclusion, the advancement of computational methodologies for integrative analysis of single-cell omics data is transforming molecular biology at an unprecedented scale and speed. Here we introduce the design Thinking Process Template that structures analytical decision making together with scdney, a wrapper package with a collection of packages presented in the context of several data stories. By establishing scdney as a collection of living workshops, we highlight the current solutions in generating novel biological insights. By emphasising the Thinking Process Template and the critical thinking process behind in our workshops, we aim to empower users to more effectively and confidently use scdney to gain insights from their single-cell data. Finally, we discuss various key aspects such as reproducibility, replicability, and usability of the computational tools. We hope scdney serves as a foundation for future development and application of computational methods for integrative analysis of and biological discovery from single-cell omics data.

Author contribution

JY, SG conceived, designed and funded the study. HK completed the analysis and design of data story 1 with feedback from YL, PY. YC and AT completed the analysis and design of data story 2 with guidance from JY, and SG. The implementation and construction of the R package for the case study were done jointly between YC and AT. NR tested all R packages; MT and YL develop the graphics with feedback from JY, SG and EP. The development of the designed Thinking Process Template was done jointly by all authors and all authors wrote, reviewed and approved the manuscript.

Acknowledgments

The authors thank all their colleagues, particularly at The University of Sydney, Sydney Precision Data Science and Judith and David Coffey Life Lab in Charles Perkins Centre for their support and intellectual engagement. Special thanks to Daniel Kim, Mohammad Javad Davoudabadi and Lijia Yu for their contribution in our weekly discussion. Dario Strbenac for providing ClassifyR support which enabled the writing of story 2.

Funding Statement

This work was supported by the AIR@innoHK programme of the Innovation and Technology Commission of Hong Kong to all authors; Research Training Program Tuition Fee Offset and Stipend Scholarship to AT; Research Training Program Tuition Fee Offset and University of Sydney Postgraduate Award Stipend Scholarship to YC; Australian Research Council Discovery Early Career Researcher Awards (DE220100964, DE200100944) funded by the Australian Government to SG and EP; A National Health and Medical Research Council (NHMRC) Investigator Grant (1173469) to PY. The funding source had no role in the study design; in the collection, analysis, and interpretation of data, in the writing of the manuscript, and in the decision to submit the manuscript for publication.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

[version 2; peer review: 2 approved]

Data availability

Underlying data

NCBI Gene Expression Omnibus (GEO): Transcriptome analysis of single cells from the developing mouse dentate gyrus. Accession number, GSE104323, https://identifiers.org/ncbigene:104323.

European Genome-phenome Archive (EGA): ScRNA-seq of PBMC and whole blood samples reveals a dysregulated myeloid cell compartment in severe COVID-19. Access number EGAS00001004571, https://ega-archive.org/studies/EGAS00001004571.

Software availability

Source code available from: https://github.com/SydneyBioX/scdneyDiseasePrediction/tree/v1.0.0.

Archived source code at time of publication: https://doi.org/10.5281/zenodo.7582777 ( Cao and Tran, 2023).

License: MIT.

Source code available from: https://github.com/SydneyBioX/scdneyAdvancedPhenotyping/tree/v1.0.0.

Archived source code at time of publication: https://doi.org/10.5281/zenodo.7582775 ( Lin, Kim and Chen, 2023).

License: MIT.

References

Angerer P, Haghverdi L, Büttner M, et al. : destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics. 2016;32(8):1241–1243. 10.1093/bioinformatics/btv715 [DOI] [PubMed] [Google Scholar]
Bao S, Li K, Yan C, et al. : Deep learning-based advances and applications for single-cell RNA-sequencing data analysis. Brief. Bioinform. 2022;23(1). 10.1093/bib/bbab473 [DOI] [PubMed] [Google Scholar]
Bergen V, Lange M, Stefan Peidli F, et al. : Generalizing RNA Velocity to Transient Cell States through Dynamical Modeling. Nat. Biotechnol. 2020;38(12):1408–1414. 10.1038/s41587-020-0591-3 [DOI] [PubMed] [Google Scholar]
Bishop J, Verleger MA: The flipped classroom: A survey of the research. 2013 ASEE Annual Conference & Exposition. 2013. Reference Source
Borcherding N, Bormann NL, Kraus G: scRepertoire: An R-based toolkit for single-cell immune receptor analysis. F1000Res. 2020;9:47. 10.12688/f1000research.22139.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Breckels LM, Mulvey CM, Lilley KS, et al. : A Bioconductor workflow for processing and analysing spatial proteomics data. F1000Res. 2016;5:2926. 10.12688/f1000research.10411.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Cao Y, Lin Y, Patrick E, et al. : scFeatures: Multi-view representations of single-cell and spatial data for disease outcome prediction. Bioinformatics. 2022;38:4745–4753. 10.1093/bioinformatics/btac590 [DOI] [PMC free article] [PubMed] [Google Scholar]
Cao J, Spielmann M, Qiu X, et al. : The Single-Cell Transcriptional Landscape of Mammalian Organogenesis. Nature. 2019;566(7745):496–502. 10.1038/s41586-019-0969-x [DOI] [PMC free article] [PubMed] [Google Scholar]
Cao Y, Tran A: SydneyBioX/scdneyDiseasePrediction: v1.0.0 (v1.0.0). Zenodo. 2023. 10.5281/zenodo.7582777 [DOI]
Ghazanfar S, Lin Y, Su X, et al. : Investigating higher-order interactions in single-cell data with scHOT. Nat. Methods. 2020;17(8):799–806. 10.1038/s41592-020-0885-x [DOI] [PMC free article] [PubMed] [Google Scholar]
Goodwin S, McPherson JD, McCombie WR: Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 2016;17(6):333–351. 10.1038/nrg.2016.49 [DOI] [PMC free article] [PubMed] [Google Scholar]
Guttinger S, Love AC: Characterizing scientific failure. EMBO Rep. 2019;20(9):e48765. 10.15252/embr.201948765 [DOI] [PMC free article] [PubMed] [Google Scholar]
Jin S, Guerrero-Juarez CF, Zhang L, et al. : Inference and Analysis of Cell-Cell Communication Using CellChat. Nat. Commun. 2021;12(1):1088. 10.1038/s41467-021-21246-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim HJ, Lin Y, Geddes TA, et al. : CiteFuse enables multi-modal analysis of CITE-seq data. Bioinformatics. 2020;36(14):4137–4143. 10.1093/bioinformatics/btaa282 [DOI] [PubMed] [Google Scholar]
Kim HJ, Wang K, Chen C, et al. : Uncovering cell identity through differential stability with Cepo. Nat. Comput. Sci. 2021;1(12):784–790. 10.1038/s43588-021-00172-2 [DOI] [PubMed] [Google Scholar]
Kim T, Lo K, Geddes TA, et al. : scReClassify: post hoc cell type classification of single-cell rNA-seq data. BMC Genomics. 2019;20(Suppl 9):913. 10.1186/s12864-019-6305-x [DOI] [PMC free article] [PubMed] [Google Scholar]
Krzak M, Raykov Y, Boukouvalas A, et al. : Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods. Front. Genet. 2019;10:1253. 10.3389/fgene.2019.01253 [DOI] [PMC free article] [PubMed] [Google Scholar]
La Manno G, Soldatov R, Zeisel A, et al. : RNA velocity of single cells. Nature. 2018;560(7719):494–498. 10.1038/s41586-018-0414-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin Y, Cao Y, Kim HJ, et al. : scClassify: sample size estimation and multiscale classification of cells using single and multiple reference. Mol. Syst. Biol. 2020;16(6):e9389. 10.15252/msb.20199389 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin Y, Ghazanfar S, Wang KYX, et al. : scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc. Natl. Acad. Sci. U. S. A. 2019;116(20):9775–9784. 10.1073/pnas.1820006116 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin Y, Kim HJ, Chen C: SydneyBioX/scdneyAdvancedPhenotyping: v1.0.0 (v1.0.0). Zenodo. 2023. 10.5281/zenodo.7582775 [DOI]
Lun ATL, McCarthy DJ, Marioni JC: A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 2016;5:2122. 10.12688/f1000research.9501.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
Raimundo F, Vallot C, Vert J-P: Tuning parameters of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 2020;21(1):212. 10.1186/s13059-020-02128-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
Reynolds A, Lewis D: Teams solve problems faster when they’re more cognitively diverse. Harv. Bus. Rev. 2017;30:1–8. [Google Scholar]
Saelens W, Cannoodt R, Todorov H, et al. : A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 2019;37(5):547–554. 10.1038/s41587-019-0071-9 [DOI] [PubMed] [Google Scholar]
Schulte-Schrepping J, Reusch N, Paclik D, et al. : Severe COVID-19 Is Marked by a Dysregulated Myeloid Cell Compartment. Cell. 2020;182(6):1419–1440.e23. 10.1016/j.cell.2020.08.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
Stark R, Grzelak M, Hadfield J: RNA sequencing: the teenage years. Nat. Rev. Genet. 2019;20(11):631–656. 10.1038/s41576-019-0150-2 [DOI] [PubMed] [Google Scholar]
Strbenac D, Mann GJ, Ormerod JT, et al. : ClassifyR: an R package for performance assessment of classification with applications to transcriptomics. Bioinformatics. 2015;31(11):1851–1853. 10.1093/bioinformatics/btv066 [DOI] [PubMed] [Google Scholar]
Street K, Risso D, Fletcher RB, et al. : Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics. 2018;19(1):477. 10.1186/s12864-018-4772-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
Van den Berge K, Roux de Bézieux H, Street K, et al. : Trajectory-Based Differential Expression Analysis for Single-Cell Sequencing Data. Nat. Commun. 2020;11(1):1201. 10.1038/s41467-020-14766-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wilk AJ, Rustagi A, Zhao NQ, et al. : A single-cell atlas of the peripheral immune response in patients with severe COVID-19. Nat. Med. 2020;26(7):1070–1076. 10.1038/s41591-020-0944-y [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu G, Wang L-G, Han Y, et al. : clusterProfiler: An R Package for Comparing Biological Themes among Gene Clusters. OMICS. 2012;16(5):284–287. 10.1089/omi.2011.0118 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zappia L, Theis FJ: Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol. 2021;22(1):301. 10.1186/s13059-021-02519-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

F1000Res. 2024 Feb 15. doi: 10.5256/f1000research.159911.r230731

Reviewer response for version 2

Kelly Street ¹

I appreciate the authors' thoughtful responses to my previous comments and I find the updated article clear and convincing. I have no further major comments and I am honored to have been included in this work!

Following up on a couple (very) minor points:

7.1. It doesn't look like this change was made in the updated version.

7.2. The updated version still seems weird to me (I don't think it makes sense to "identify" pieces into a whole). I would rephrase with something like "Workflows can help users identify a set of seemingly disjoint methods and combine them into a unified and coherent process."

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Single-cell transcriptomics, computational biology software development

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

F1000Res. 2023 Jun 15. doi: 10.5256/f1000research.143393.r173874

Reviewer response for version 1

Kelly Street ¹

The authors present a Thinking Process Template for the analysis of complex datasets. This template encourages researchers to think critically about each step in their analysis and to be more transparent in the reporting of their results through the use of data stories that acknowledge the subjective decisions that had to be made along the way.

The thinking process template is demonstrated through two example analyses of single-cell RNAseq data with methods from the scdney package. These examples are highly relevant, as a typical analysis in this field generally involves multiple interconnected steps with the choices made at each step having (potentially major) downstream consequences. These example data stories show what additional questions the researchers asked at each step of the analysis in order to establish confidence in their results to that point. These questions and the associated supplemental analyses are generally not included in the final products of most research projects (papers, software, etc.), but are of critical importance to the validity of the results.

I strongly agree with the authors' call for greater transparency in method selection and the educational advantages of training analysts to think critically about their pipelines. While I found some of the concepts presented in this manuscript confusing, I think the overall message is both timely and critically important.

The biggest source of confusion for me was the notion of the Thinking Process Template. The authors claim that this template will "formalize the thought process an analyst should undertake to ensure robust analysis that is tailored to their data," which is quite a lofty goal. This template is referenced throughout and demonstrated by the two example analyses, but I was never clear on its definition or practical use (Figures 3 and 5 and Boxes 1 and 2 all seem more useful and generalizable than Figures 1 and 2). As such, I failed to see the connection between the different Results subsections. I think the paper could be improved by providing a more concise definition of the Thinking Process Template. Alternatively, I think removing/de-emphasizing these sections and focusing on advocating for the use of data stories (as demonstrated), would make for a more effective message, as I'm not sure what is gained from the additional terminology.
On a related note, the title mentions "Thinking process templates" (plural, not capitalized), whereas the text itself often refers to a Thinking Process Template (singular, capitalized). Given my confusion related to point (1), I found myself wondering if the "templates" actually referred to the example analyses, or if it was indeed supposed to be something more general.
It took me a little while to find the workshop materials mentioned, particularly for the second data story, which is not listed on the scdney website (at least, not the one I found: https://sydneybiox.github.io/scdney/ ). Since these materials are central to the results, it would be helpful to include direct links to them in the text, if possible, or otherwise in the Data/Software Availability sections (I'm referring to https://sydneybiox.github.io/scdneyAdvancedPhenotyping/articles/advanced_phenotyping.html and https://sydneybiox.github.io/scdneyDiseasePrediction/articles/disease_outcome_classification_schulte.html , and apologize if these are not the intended definitive versions of the data stories).

Minor comments
The full data stories make use of several methods that are not mentioned or cited, but perhaps could be. I noted Monocle 3, tradeSeq, velocyto, clusterProfiler, CellChat, and CYTOTRACE for the first data story and Seurat for the second.
Figure 4B could use more explanation. The top five marker genes seem to do a much better job of identifying the original set of "ImmGranule2" cells than the re-classified set. Why do these markers become less informative in the re-classified data? Wouldn't that indicate that the original labels were more biologically meaningful?
I think the phrase "through visuals, we enlighten the data" only works if you use the archaic definition of "enlighten". "Illuminate" might be a better choice.
The following sentences/phrases could stand to be re-written for clarity:
1. Introduction: "...how 'robust' a data analysis should be conducted"
2. Introduction: "...help users identify a set of seemingly disparate methods into a cohesive whole"
3. Methods: "not from previous studies nor have them been published somewhere else"
4. Figure 1: "Critical thinking questions to what we have to make from the Data, drive our decision with the Narrative and enlighten with the Visuals."
5. Results: "...creates multi-view feature representation on patient level..."
6. Results: "The scdney workflow start(s) with data..."
7. Results: "...gene sets that distinguish hippocampal development in mice from scRNA-seq data..." (sounds like you are contrasting "hippocampal development" and "scRNA-seq")
8. Conclusion: "...scdney, a collection of wrapper packages..." (scdney is the wrapper package)
The bulleted list of methods in the scdney package should be consistently formatted (ie. should all start with a lower case "a" or "an").

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Single-cell transcriptomics, computational biology software development

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

F1000Res. 2023 Dec 7.

Yue Cao ¹

Response:

We thank the reviewer for the support in the transparency of method selection. Below we provide point-to-point response.

The biggest source of confusion for me was the notion of the Thinking Process Template. The authors claim that this template will "formalize the thought process an analyst should undertake to ensure robust analysis that is tailored to their data," which is quite a lofty goal. This template is referenced throughout and demonstrated by the two example analyses, but I was never clear on its definition or practical use (Figures 3 and 5 and Boxes 1 and 2 all seem more useful and generalizable than Figures 1 and 2). As such, I failed to see the connection between the different Results subsections. I think the paper could be improved by providing a more concise definition of the Thinking Process Template. Alternatively, I think removing/de-emphasizing these sections and focusing on advocating for the use of data stories (as demonstrated), would make for a more effective message, as I'm not sure what is gained from the additional terminology.

Response:

We appreciate the reviewer’s perspective and would like to provide some clarification on the Thinking Process Template. Thinking Process Template is a general term that refers to the data analysis procedure from data input to final analysis outcome that involves critical thinking processes and goes beyond directapplication of tools. It is distinct to data stories which are the products of data analysis efforts. We have now clarified the terminology in the introduction section:

“We use Thinking Process Template as a general term to refer to the data analysis procedure from data input to final analysis outcome that involves critical thinking processes and goes beyond the simple application of tools or the products of data analysis.”
On a related note, the title mentions "Thinking process templates" (plural, not capitalized), whereas the text itself often refers to a Thinking Process Template (singular, capitalized). Given my confusion related to point (1), I found myself wondering if the "templates" actually referred to the example analyses, or if it was indeed supposed to be something more general.

Response:

We refer to the “Thinking Process Template” as the general term that encapsulates the critical thinking process rather than specific templates. In single-cell, the analysis that researchers perform are diverse and there is no Thinking Process Template applicable to all tasks. Therefore, we provided two examples in Figures 3 and 5 and Box 1 and 2 to illustrate two examples of Thinking Process Templates in the context of two data stories: 1) the identification of gene pairs and 2) patient prediction.
It took me a little while to find the workshop materials mentioned, particularly for the second data story, which is not listed on the scdney website (at least, not the one I found: https://sydneybiox.github.io/scdney/ ). Since these materials are central to the results, it would be helpful to include direct links to them in the text, if possible, or otherwise in the Data/Software Availability sections (I'm referring to https://sydneybiox.github.io/scdneyAdvancedPhenotyping/articles/advanced_phenotyping.html and https://sydneybiox.github.io/scdneyDiseasePrediction/articles/disease_outcome_classification_schulte.html , and apologize if these are not the intended definitive versions of the data stories).

Response:

We thank the reviewer for this suggestion and have updated the https://sydneybiox.github.io/scdney/workshops.html to include the second data story.

Minor comments

The full data stories make use of several methods that are not mentioned or cited, but perhaps could be. I noted Monocle 3, tradeSeq, velocyto, clusterProfiler, CellChat, and CYTOTRACE for the first data story and Seurat for the second.

Response:

We have added the relevant packages and references:

“as well as asking which genes are differentially expressed across a pseudotime using tradeSeq (Van den Berge et al. 2020) and performing functional annotation of these gene sets through clusterProfiler (Yu et al. 2012). Together, these analyses demonstrate that the final trajectories are in line with our expectations and provide more confidence in the new biological insights extracted from these trajectories. The story includes other downstream analysis of the data such as cell-cell communication using CellChat (Jin et al. 2021) and RNA velocity analysis using scVelo (Bergen et al. 2020), which the users can perform to further explore their data.”
Figure 4B could use more explanation. The top five marker genes seem to do a much better job of identifying the original set of "ImmGranule2" cells than the re-classified set. Why do these markers become less informative in the re-classified data? Wouldn't that indicate that the original labels were more biologically meaningful?

Response:

We apologise for the confusion. We have corrected the mislabel in the legend. The visualisations show that the cell-type specific markers show a better distribution in the newly annotated cell type group (blue, re-classified) than in the original grouping (red, original). A higher expression of the top marker genes suggests that new groupings now better represent the cell-type profile of ImmGranule2.
I think the phrase "through visuals, we enlighten the data" only works if you use the archaic definition of "enlighten". "Illuminate" might be a better choice.

Response:

We have now replaced the word “enlighten” with “illuminate”, including the occurrence in Figure 1.
The following sentences/phrases could stand to be re-written for clarity:
1. Introduction: "...how 'robust' a data analysis should be conducted"
  
  Response:
  
  We have rephrased this into: “Hence, this can create challenges when it comes to deciding how rigorous a data analysis is.”.
2. Introduction: "...help users identify a set of seemingly disparate methods into a cohesive whole"
  
  Response:
  
  We have rephrased this into: “Workflows can help users identify a set of seemingly disjoint methods into a unified and coherent process.”
3. Methods: "not from previous studies nor have them been published somewhere else"
  
  We have rephrased this into: “These two data stories and the accompanying workshops were not derived from previous studies nor have they been published elsewhere.”.
4. Figure 1: "Critical thinking questions to what we have to make from the Data, drive our decision with the Narrative and enlighten with the Visuals."
  
  Response:
  
  We have rephrased this into “Critical thinking questions from the data drive our decision with the Narrative and illuminate with the Visuals.“
5. Results: "...creates multi-view feature representation on patient level..."
  
  Response:
  
  This is now corrected.
6. Results: "The scdney workflow start(s) with data..."
  
  Response:
  
  This is now corrected.
7. Results: "...gene sets that distinguish hippocampal development in mice from scRNA-seq data..." (sounds like you are contrasting "hippocampal development" and "scRNA-seq")
  
  Response:
  
  We have changed the word “from” to “using” to avoid the confusion.
8. Conclusion: "...scdney, a collection of wrapper packages..." (scdney is the wrapper package)
  
  Response:
  
  We have corrected this sentence.
The bulleted list of methods in the scdney package should be consistently formatted (ie. should all start with a lower case "a" or "an").

Response:

This is now fixed.

F1000Res. 2023 May 23. doi: 10.5256/f1000research.143393.r170725

Reviewer response for version 1

Jun Li ¹

This article described an important effort to improve the teaching of data science skills through "thinking process templates". In the past, static knowledge was taught in didactic lectures, whereas data analysis workflow was taught in step-by-step tutorials. What remains lacking is to treat the analysis of each dataset as a unique journey, containing decision points that require context-specific diagnosis and judgement. This article focused on single-cell data analysis in the scdney system, which is a collection of related R packages performing data integration, cell type annotation, statistical modeling such as differential expression analysis and, importantly, "constructing data stories".

Such an effort is much needed, as most of the time the tutorials for popular analysis workflows were presented as a single path through the data, without highlighting key forking points, measures of confidence, and reasons to select certain algorithms for normalization, projection, or visualization. The article stated this problem very well, and sought to design the thinking template to uncover "hidden pitfalls and assumptions".

It also drew attention to the importance of the "interdependence" of data analysis steps (such as imputation-then-normalization versus normalization-then-imputation), as the choices "have cascading impacts on the downstream analysis".

The main part of the article described two use cases, using different datasets for different end goals. They illustrate how thinking process needs to be customized for the situation, even though the basic codes and the statistical principles are similar. The single most important value of such an exercise is to advocate for a new way of combining scholarship with pedagogy, in which the analysts strive to provide "a structured framework for documenting the decisions and findings in various steps of the analysis". As such, the emphasis is to go beyond the practice of recording the workflow, by being responsible for sharing the reasons, alternative decisions, and inevitable compromises as one brings a story out of a data object.

In terms of improvement, my main suggestion is to highlight one or two examples where the standard workflow would proceed unwittingly into the wrong decision (such as using the default choice of normalization method or embedding parameters), but a more circumspect, thinking-along-the-way approach would have produced relevant diagnostics and turned the analysis unto another path. Similarly, it would be useful to provide an example where a decision depends on the earlier choices, or when the same dataset needs different treatments for different goals.

Very minor issues:

A typo in page 4 "reocrded".
Please concisely explain "differential stable genes" and "higher-order changes", in page 5.
In the paragraph before "Conclusions", in the sentence "To address these issues,…" there is probably a misplaced comma.
In Conclusion, the sentence "Together with scdney,…" needs fixing.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

Single-cell data analysis, data science education, data processing and modeling

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

F1000Res. 2023 Dec 7.

Yue Cao ¹

We thank the reviewer for supporting our teaching philosophy on critical thinking. We have now re-written the section on the second data story to highlight examples where critical thinking is needed for diagnosis.

The modified section is copied below:

“Once the final models are obtained, we ask questions on the robustness of the models. One approach on this involves assessing the generalisability of the model on an independent dataset. We examined the performance of the 13 models on a different COVID-19 dataset obtained from the Wilk study that also contain mild and severe patients (Wilk et al., 2020). We found while the 13 models have close balanced accuracy between 0.75 and 0.88 on the Schulte-Schrepping dataset, their performance varied greatly in the Wilk dataset and ranged between 0.49 to 0.78 (Figure 6A, B). It is noteworthy that models built from feature types such as “gene proportion cell type” that have high accuracy do not necessarily maintain good performance on the Wilk dataset. On the other hand, the feature type “CCI” achieved an accuracy of over 0.75 in both datasets, indicating potential for further examination. Finally, to extract biological insights from the fitted models, we guide users to interpret the fitted models to identify important features and reflect on whether the features make sense. Here, it is important not to select the top features based on a single model, but to ask about the stability of these top features. To illustrate this idea, we examined all features that appeared at least once as top 10 features in the cross-validated models. Figure 6C highlights that while the majority of the features were consistently ranked as top features across all models, a proportion of features were ranked in the hundreds and thousands position in some models. These two scenarios illustrate the importance of critical thinking to avoid heading down a wrong decision path.”

Below we provide point-to-point response on the minor issues.

Very minor issues:

A typo in page 4 "reocrded".

Response:

This is now fixed.
Please concisely explain "differential stable genes" and "higher-order changes", in page 5.

Response:

We have added a concise description to each of the terms: “ a novel method for identifying differential stable genes, that is, genes that are stably expressed in one cell type relative to other cell types” and “By higher-order changes, we refer to higher order interactions such as variation and coexpression that are beyond changes in mean expression”.
In the paragraph before "Conclusions", in the sentence "To address these issues,…" there is probably a misplaced comma.

Response:

We have rephrased this sentence into: “ To address these issues, the R programming community has developed tools such as BiocManager and Renv to help with the installation and documentation of R package dependencies.”
In Conclusion, the sentence "Together with scdney,…" needs fixing.

Response:

We have rephrased the sentence into: “ Here we introduce the design thinking process template that structures analytical decision making together with scdney, a wrapper package with a collection of packages presented in the context of several data stories.”

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Cao Y, Tran A: SydneyBioX/scdneyDiseasePrediction: v1.0.0 (v1.0.0). Zenodo. 2023. 10.5281/zenodo.7582777 [DOI]
Lin Y, Kim HJ, Chen C: SydneyBioX/scdneyAdvancedPhenotyping: v1.0.0 (v1.0.0). Zenodo. 2023. 10.5281/zenodo.7582775 [DOI]

Data Availability Statement

Underlying data

NCBI Gene Expression Omnibus (GEO): Transcriptome analysis of single cells from the developing mouse dentate gyrus. Accession number, GSE104323, https://identifiers.org/ncbigene:104323.

[ref1] Angerer P, Haghverdi L, Büttner M, et al. : destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics. 2016;32(8):1241–1243. 10.1093/bioinformatics/btv715 [DOI] [PubMed] [Google Scholar]

[ref2] Bao S, Li K, Yan C, et al. : Deep learning-based advances and applications for single-cell RNA-sequencing data analysis. Brief. Bioinform. 2022;23(1). 10.1093/bib/bbab473 [DOI] [PubMed] [Google Scholar]

[ref29] Bergen V, Lange M, Stefan Peidli F, et al. : Generalizing RNA Velocity to Transient Cell States through Dynamical Modeling. Nat. Biotechnol. 2020;38(12):1408–1414. 10.1038/s41587-020-0591-3 [DOI] [PubMed] [Google Scholar]

[ref3] Bishop J, Verleger MA: The flipped classroom: A survey of the research. 2013 ASEE Annual Conference & Exposition. 2013. Reference Source

[ref4] Borcherding N, Bormann NL, Kraus G: scRepertoire: An R-based toolkit for single-cell immune receptor analysis. F1000Res. 2020;9:47. 10.12688/f1000research.22139.1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] Breckels LM, Mulvey CM, Lilley KS, et al. : A Bioconductor workflow for processing and analysing spatial proteomics data. F1000Res. 2016;5:2926. 10.12688/f1000research.10411.1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] Cao Y, Lin Y, Patrick E, et al. : scFeatures: Multi-view representations of single-cell and spatial data for disease outcome prediction. Bioinformatics. 2022;38:4745–4753. 10.1093/bioinformatics/btac590 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] Cao J, Spielmann M, Qiu X, et al. : The Single-Cell Transcriptional Landscape of Mammalian Organogenesis. Nature. 2019;566(7745):496–502. 10.1038/s41586-019-0969-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] Cao Y, Tran A: SydneyBioX/scdneyDiseasePrediction: v1.0.0 (v1.0.0). Zenodo. 2023. 10.5281/zenodo.7582777 [DOI]

[ref8] Ghazanfar S, Lin Y, Su X, et al. : Investigating higher-order interactions in single-cell data with scHOT. Nat. Methods. 2020;17(8):799–806. 10.1038/s41592-020-0885-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] Goodwin S, McPherson JD, McCombie WR: Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 2016;17(6):333–351. 10.1038/nrg.2016.49 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] Guttinger S, Love AC: Characterizing scientific failure. EMBO Rep. 2019;20(9):e48765. 10.15252/embr.201948765 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] Jin S, Guerrero-Juarez CF, Zhang L, et al. : Inference and Analysis of Cell-Cell Communication Using CellChat. Nat. Commun. 2021;12(1):1088. 10.1038/s41467-021-21246-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] Kim HJ, Lin Y, Geddes TA, et al. : CiteFuse enables multi-modal analysis of CITE-seq data. Bioinformatics. 2020;36(14):4137–4143. 10.1093/bioinformatics/btaa282 [DOI] [PubMed] [Google Scholar]

[ref12] Kim HJ, Wang K, Chen C, et al. : Uncovering cell identity through differential stability with Cepo. Nat. Comput. Sci. 2021;1(12):784–790. 10.1038/s43588-021-00172-2 [DOI] [PubMed] [Google Scholar]

[ref13] Kim T, Lo K, Geddes TA, et al. : scReClassify: post hoc cell type classification of single-cell rNA-seq data. BMC Genomics. 2019;20(Suppl 9):913. 10.1186/s12864-019-6305-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] Krzak M, Raykov Y, Boukouvalas A, et al. : Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods. Front. Genet. 2019;10:1253. 10.3389/fgene.2019.01253 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] La Manno G, Soldatov R, Zeisel A, et al. : RNA velocity of single cells. Nature. 2018;560(7719):494–498. 10.1038/s41586-018-0414-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] Lin Y, Cao Y, Kim HJ, et al. : scClassify: sample size estimation and multiscale classification of cells using single and multiple reference. Mol. Syst. Biol. 2020;16(6):e9389. 10.15252/msb.20199389 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] Lin Y, Ghazanfar S, Wang KYX, et al. : scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc. Natl. Acad. Sci. U. S. A. 2019;116(20):9775–9784. 10.1073/pnas.1820006116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] Lin Y, Kim HJ, Chen C: SydneyBioX/scdneyAdvancedPhenotyping: v1.0.0 (v1.0.0). Zenodo. 2023. 10.5281/zenodo.7582775 [DOI]

[ref19] Lun ATL, McCarthy DJ, Marioni JC: A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 2016;5:2122. 10.12688/f1000research.9501.2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] Raimundo F, Vallot C, Vert J-P: Tuning parameters of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 2020;21(1):212. 10.1186/s13059-020-02128-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] Reynolds A, Lewis D: Teams solve problems faster when they’re more cognitively diverse. Harv. Bus. Rev. 2017;30:1–8. [Google Scholar]

[ref22] Saelens W, Cannoodt R, Todorov H, et al. : A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 2019;37(5):547–554. 10.1038/s41587-019-0071-9 [DOI] [PubMed] [Google Scholar]

[ref23] Schulte-Schrepping J, Reusch N, Paclik D, et al. : Severe COVID-19 Is Marked by a Dysregulated Myeloid Cell Compartment. Cell. 2020;182(6):1419–1440.e23. 10.1016/j.cell.2020.08.001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] Stark R, Grzelak M, Hadfield J: RNA sequencing: the teenage years. Nat. Rev. Genet. 2019;20(11):631–656. 10.1038/s41576-019-0150-2 [DOI] [PubMed] [Google Scholar]

[ref25] Strbenac D, Mann GJ, Ormerod JT, et al. : ClassifyR: an R package for performance assessment of classification with applications to transcriptomics. Bioinformatics. 2015;31(11):1851–1853. 10.1093/bioinformatics/btv066 [DOI] [PubMed] [Google Scholar]

[ref26] Street K, Risso D, Fletcher RB, et al. : Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics. 2018;19(1):477. 10.1186/s12864-018-4772-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref32] Van den Berge K, Roux de Bézieux H, Street K, et al. : Trajectory-Based Differential Expression Analysis for Single-Cell Sequencing Data. Nat. Commun. 2020;11(1):1201. 10.1038/s41467-020-14766-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] Wilk AJ, Rustagi A, Zhao NQ, et al. : A single-cell atlas of the peripheral immune response in patients with severe COVID-19. Nat. Med. 2020;26(7):1070–1076. 10.1038/s41591-020-0944-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] Yu G, Wang L-G, Han Y, et al. : clusterProfiler: An R Package for Comparing Biological Themes among Gene Clusters. OMICS. 2012;16(5):284–287. 10.1089/omi.2011.0118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] Zappia L, Theis FJ: Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol. 2021;22(1):301. 10.1186/s13059-021-02519-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Thinking process templates for constructing data stories with SCDNEY

Yue Cao

Andy Tran

Hani Kim

Nick Robertson

Yingxin Lin

Marni Torkel

Pengyi Yang

Ellis Patrick

Shila Ghazanfar

Jean Yang

Roles

Version Changes

Revised. Amendments from Version 1

Abstract

Background

Methods

Results

Conclusions

Introduction

Methods

Selection of data stories to illustrate scdney

Workshop for data story 1

Workshop for data story 2

Results

Thinking Process Template

Figure 1. Critical thinking questions from the data drive our decision with the Narrative and illuminate with the Visuals.

Figure 2. scdney workflow.

Figure 3. Thinking process template for analysing a single-cell RNA-seq data with a lineage trajectory.

Box 1. Critical questions to consider for identification of key gene sets in the developmental course.

Figure 4. Assessment of cell type labels and re-annotation of sub-optimal labels with scReClassify.

Figure 5. Thinking process template for analysing a single-cell RNA-seq data for disease outcome classification.

Box 2. Critical questions to consider for the prediction of patient outcomes.

Figure 6. Assessment of disease outcome classification accuracy using scFeatures' generated features.

Discussion

Conclusion

Author contribution

Acknowledgments

Funding Statement

Data availability

Underlying data

Software availability

References

Reviewer response for version 2

Kelly Street

Roles

Reviewer response for version 1

Kelly Street

Roles

Yue Cao

Reviewer response for version 1

Jun Li

Roles

Yue Cao

Associated Data

Data Citations

Data Availability Statement

Underlying data

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases