Skip to main content
F1000Research logoLink to F1000Research
. 2025 Jun 2;13:1256. Originally published 2024 Oct 21. [Version 2] doi: 10.12688/f1000research.154675.2

SCUBA implements a storage format-agnostic API for single-cell data access in R

William M Showers 1,2, Jairav Desai 1, Krysta L Engel 1,2, Clayton Smith 1,2, Craig T Jordan 2, Austin E Gillen 2,3,a
PMCID: PMC12351237  PMID: 40822437

Version Changes

Revised. Amendments from Version 1

The new version of the manuscript highlights the addition of the `fetch_data` generic. This was added instead using the FetchData generic from SeuratObject and extending it to SingleCellExperiment and anndata objects. A website for SCUBA was added via pkgdown (https://amc-heme.github.io/SCUBA/) with improved documentation. A user guide vignette was added to the pkgdown website, and the Github README was updated with additional installation instructions. Modifications were made to the introduction section in response to reviewer feedback. Figures 1, 2, 4, 5, and 6 were modified. We corrected an error in the caption for figure 4, where the caption for figure 4A was intended to refer to figure 4C, and vice versa.

Abstract

While robust tools exist for the analysis of single-cell datasets in both Python and R, interoperability is limited, and analysis tools generally only accept one object class. Considerable programming expertise is required to integrate tools across package ecosystems into a comprehensive analysis, due to their differing languages and internal data structures. This complicates validation of results and leads to inconsistent visualizations between analysis suites. Conversion between object formats is the most common solution, but this is difficult and error-prone due to the rapid pace of development of the analysis suites and their underlying data structures. To address this, we created SCUBA (Single-Cell Unified Backend API), an R package that implements a unified data access API for all common R and Python single-cell object formats. SCUBA extends the data access approach from the widely used Seurat package to SingleCellExperiment and anndata objects. SCUBA also implements new data-specific access functions for all supported object types. Performance scales well across all SCUBA-supported formats. In addition to performance, SCUBA offers several advantages over object conversion for the visualization and further analysis of pre-processed single-cell data. First, SCUBA extracts only data required for the operation at hand, leaving the original object unmodified. This process is simpler, less error prone, and less memory intensive than object conversion, which operates on the entire dataset. Second, code written with SCUBA can use any supported object class as input, with simple and consistent syntax across object formats. This allows a single analysis script or package (like our interactive single-cell browser, scExploreR) to work seamlessly with multiple object types, reducing the complexity of the code and improving both readability and reproducibility. Adoption of SCUBA will ultimately improve collaboration and reproducible research in single-cell analysis by lowering the barriers between package ecosystems.

Keywords: single-cell sequencing, multimodal, software tools, R package, Python, visualization

Introduction

The rapidly evolving landscape of single-cell sequencing methods has led to the production of increasingly large and diverse single-cell datasets, greatly improving our knowledge of both inter- and intra-patient heterogeneity in a wide range of diseases and normal tissues. 1, 2 While there are many excellent tools available for analyzing single-cell datasets in both the Python and R ecosystems, interoperability is hindered by the use of incompatible object classes ( Figure 1A). Analysis tools generally only accept one object class, forcing users to commit to a suite of packages at the beginning of an analysis. This restriction limits access to tools outside that suite, creating “walled gardens” that pose several challenges for single-cell analysis. Single-cell analysis requires programming experience for full customization of analysis and visualization, and different object formats make it even more difficult for biologists to analyze data. The implementation of popular object formats in both Python and R requires users to be fluent in both programming languages, and it is difficult and time consuming to learn both without formal education in data structures and syntax in each language. The widespread use of multiple object formats also makes it difficult to validate results produced by one analysis suite with another, and visualizations produced with different analysis suites are not consistent. Additionally, the practical analysis of objects with large numbers of cells requires a way to interact with on-disk matrices rather than loading all data in memory. On-disk matrix implementations are analysis suite-specific, which introduces additional barriers to effective analysis. For example, anndata objects are natively stored in the memory-efficient HDF5 format, but anndata objects are not compatible with the Bioconductor’s single-cell tools or Seurat. If a user converts an anndata object to Seurat or SingleCellExperiment format, they must use a different on-disk matrix implementation specific to that format, which further restricts the analyses that can be performed. If single-cell object formats were interoperable, it would be easy for researchers to analyze data from any single-cell dataset, regardless of the object format used when the dataset was generated, but unfortunately this is not the case.

Figure 1. SCUBA addresses challenges posed by multiple object formats in single-cell sequencing data.


Figure 1.

A) Raw single-cell sequencing data is stored in defined object classes, and processed downstream by packages that only accept one object class. This creates “walled garden” analysis suites of incompatible packages that complicate single-cell analysis. When a specific downstream package is desired, the user will need to convert between object formats prior to use. This is possible, but the process is difficult and error-prone. B) SCUBA returns feature expression data, metadata, and reduction coordinates from Seurat, SingleCellExperiment, and Anndata objects in a consistent output format. An overview of each object structure is shown, with rectangles indicating data matrices stored in each object. Dimensions of the matrices are labeled with “cells” or “genes” (features), and matrices placed adjacent to one another indicate requirements that matrices have the same number of values in the dimension indicated (i.e. for Seurat objects, the “reduction coordinates” and “gene expression” matrices must have the same number of cells, but may have varying number of genes (or in the case of reduction coordinates, dimensions). Next to the description of each matrix, object-specific code to retrieve the matrix is given. If additional modalities are supported by an object type, the structure of matrices specific to sequencing modalities are shown, along with code to retrieve data on alternate modalities. The output format for SCUBA is shown at the bottom of the panel. The output is a single R data.frame with values for each variable requested for each cell. The S3 methods added by SCUBA to yield the output format are shown in blue. The methods are based on the existing FetchData method from the MIT licensed, open-source Seurat package. The fetch_data method for Seruat objects is a wrapper for FetchData from Seurat, and the fetch_data methods for SingleCellExperiment and anndata objects apply the workflow from the Seurat FetchData method to these object classes.

Currently, the most effective solution is to convert between object classes. All major single-cell analysis packages implement conversion functions, and third-party packages such as sceasy, 3 Zellconverter, 4 and scDIOR 5 are specifically designed for these conversions. Additionally, SeruatWrapper 6 implements conversion functions that allow several otherwise incompatible single-cell analysis packages to be used on Seurat objects. However, inconsistencies in approaches to object structure across implementations often result in data loss upon conversion, which is difficult to overcome. Additionally, the rapid development of packages implementing object formats means that conversion functions are difficult to maintain and frequently break due to changes in these packages. This is especially true when converting to and from the anndata format, since this format is implemented in the Python programming language, and the Seurat and SingleCellExperiment objects are implemented in the R programming language. Even if conversion is successfully achieved without loss of data quality, it has recently been demonstrated that results from Seurat differ from those of Scanpy, 7 despite the fact the two packages implement ostensibly identical processing steps. Addressing interoperability and consistency issues between analysis suites is crucial to ensuring the fidelity and reproducibility of single-cell analysis results, making consistent visualizations across suites essential.

Rather than conversion between objects, we propose a more sustainable approach to the visualization and further analysis of pre-processed single-cell data by implementing a unified API for all common single-cell object formats. Here, we present Single-Cell Unified Backend API (SCUBA), an R package based on the data accession function in the widely used Seurat 8 package that returns data from Seurat, SingleCellExperiment, and annadata objects in a common format for downstream visualization and analysis ( Figure 1B). SCUBA also implements new data-specific access functions for all supported object types. Data is returned in a single R data.frame, with requested variables as columns, and cells as rows. The functions in this package allow users to plot data in a consistent manner from these object types in R, without requiring conversion. SCUBA can also be used in functional programming applications as the basis for single-cell plotting packages, or in the development of Shiny apps. For objects with very large numbers of cells, it is now possible to choose the object class based on on-disk storage performance and produce visually consistent plots without having to downsample the object. Packages and scripts created with SCUBA are flexible with regard to input type, greatly improving the consistency of results between objects and increasing accessibility of these analyses for non-programmers.

Methods

Implementation

SCUBA provides a unified framework for data access by leveraging R’s S3 object-oriented programming. 9 The workflow for data access in SCUBA is based on Seurat’s FetchData method. We implemented a new generic, fetch_data, which executes S3 methods based on the input object class. For Seurat objects, the fetch_data method is a wrapper for FetchData from Seurat. For SingleCellExperiment and anndata objects, SCUBA implements novel methods that replicate the behavior of Seurat FetchData in these objects. The Seurat method was chosen as a basis due to its ease of use, and its implementation in Seurat plotting functions, which are widely used.

Access to anndata objects is accomplished using reticulate 10 and performing as many operations in python as possible before returning data to R. The workflow from the existing method for Seurat objects is largely unchanged upon re-implementation for these object formats. We used code from the Seurat package under the terms of the package’s MIT license.

In addition to replicating the behavior of Seurat’s FetchData in SingleCellExperiment and anndata objects, SCUBA includes S3 generics and methods specific to the retrieval of metadata and reduction coordinates from each object format. These methods offer improvements in performance relative to retrieving the same data via fetch_data for large objects.

Operation

SCUBA can be installed as an R package via GitHub using the devtools 11 R package, and can be used on all common operating systems. To ensure compatibility across operating systems, SCUBA is maintained using Continuous Integration (CI), with Github Actions and the testthat 12 R package. The Github Actions workflow performs 100+ tests on recent Linux R and Python versions whenever a pull request is created. Tests are additionally run on Mac OS and Windows for releases. The dataset used for testing is a downsampled version of the acute myeloid leukemia reference dataset 13 from Triana et al. 2021. 14

If using SCUBA with Seurat or SingleCellExperiment objects, no further installation is necessary beyond the R dependencies. For anndata objects, the reticulate 10 R package and a Python installation are required. The following python packages must be manually installed: pandas, 15 numpy, 16 scipy, 17 and anndata. 18 We recommend installing these packages in an anaconda 19 environment and loading the environment in R with reticulate::use_condaenv(), but this is not required. Detailed installation instructions are available on the SCUBA GitHub Page.

Use cases

The features of SCUBA fall broadly into three categories; data access, data visualization, and data exploration. The functions provided in these categories can be used independently or in a stepwise pipeline. Generally speaking, SCUBA works best for objects that have been filtered and clustered, though SCUBA can work on objects in any state as long as the data being requested exists. Here we highlight independent use cases using a downsampled version of the acute myeloid leukemia reference dataset 13 generated by Triana et al. 14 Additional vignettes are provided on the SCUBA GitHub page.

FetchData Methods for SingleCellExperiment and Anndata Objects

Example usage of SCUBA’s fetch_data methods is given in Figure 2. The existing Seurat method (first column) is compared to the methods added by SCUBA. There are only minor variations in input syntax across the three supported object types, and the required parameters are few, making the methods easy to use. All data requested is specified using the vars parameter. The methods infer whether the data requested is metadata, reduction coordinates, or feature expression by parsing the character vector passed to this parameter. To retrieve feature expression or reduction coordinates, the user adds a “key” with an underscore giving the name of the reduction, or the modality to pull feature expression data from. If using an object with only one modality, the modality key is not needed, and the key is also not needed to retrieve metadata. Minor variations in the key exist between object types, due to object-specific conventions for naming modalities (which are called “assays” in Seurat objects, and “Experiments” in SingleCellExperiment objects). Variations in the layer parameter are based on variations in conventions for naming layers (in SingleCellExperiment objects, “assays”, and in Seurat v4 and earlier, “slots”). The consistency in parameters between the three object types, and the presence of only minor differences in inputs to each parameter, facilitates the writing of scripts for any object type.

Figure 2. The methods added by SCUBA simplify the retrieval of data from supported object classes.


Figure 2.

The existing Seurat method (first column), is compared to the methods added by SCUBA for SingleCellExperiment and anndata objects (second and third columns). The methods use consistent syntax across object classes and involve the use of only a few parameters. Pseudocode is used in the examples. object represents a single-cell object. features represents one or more features, from any modality in the object. metadata represents one or more metadata variables, for example, cell type classifications. reduction_dims represents a set of dimensions in a reduction included with the object, with the number of the dimension separated from the reduction with an underscore. For example, to fetch the first and second dimensions of the UMAP projection, reduction_dims would be c(“UMAP_1”, “UMAP_2”).

The output of fetch_data is identical across the three object classes. The output is an R data.frame with values for each requested feature in vars per cell. Columns represent each feature, and rows represent cells.

Metadata, reduction-specific accession methods

SCUBA also includes S3 generics and methods specific to the retrieval of metadata and reduction coordinates, which are faster than retrieving the same data via fetch_data for R object types. An overview of the fetch_metadata and fetch_reduction functions is given in Figure 3A. As with ‘fetch_data’, the output of fetch_metadata and fetch_reduction is an R data.frame with data for the requested metadata variables or reduction coordinates, respectively, as columns, and rows for each cell. Figure 3B compares the usage of fetch_reduction and fetch_metadata between supported object types. We implement these methods in anndata objects for consistency in syntax, but their performance is roughly equivalent to fetch_data. The functions are easy to use, and the inputs to each function do not vary based on object type. To set defaults for the reduction and cells parameters of fetch_reduction, SCUBA provides several utility methods. default_reduction will search for UMAP, t-SNE, and PCA reductions, and will return them in that order if they exist. get_all_cells will return the IDs of all cells in the object.

Figure 3. SCUBA methods specific to the retrieval of metadata and reduction coordinates.


Figure 3.

A) Overview of outputs of fetch_metadata, for metadata variables, and fetch_reduction, for reduction coordinates. The output is an R data.frame with the metadata or reduction coordinates as columns, and the cells as rows. B) Comparison of the usage of fetch_metadata and fetch_reduction across each object type. For fetch_metadata, the metadata variable or variables to retrieve (which are represented as metadata in this pseudocode example) are specified via a character vector input to vars. For fetch_reduction, the dimensions to return from the reduction coordinate matrix is passed to dims, and the reduction to pull from is specified via reduction. The cells parameter allows the user to specify which cells to fetch reduction coordinates for. For ease of use and flexibility, there is no difference in inputs between object types; only the object itself varies.

Figure 4A-B compares the performance of fetch_metadata and fetch_reduction with the performance of ‘ fetch_data’ to pull one metadata variable, and the first and second dimensions of UMAP coordinates, respectively. Run time was tested for each function on random subsets of varying numbers of cells, with five subsets created for each size. The fetch_metadata and fetch_reduction methods were more performant than ‘ fetch_data’ in Seurat and SingleCellExperiment objects for all subsets tested. In anndata objects, the runtime of these functions was comparable to that of FetchData. Performance testing for the fetch_data methods added by SCUBA was also performed ( Figure 4C). Performance of the method for anndata objects exceeds the performance for the existing Seurat method in most cases, and the performance of the SingleCellExperiement method exceeds performance of the Seurat method for the largest subset tested (500k cells).

Figure 4. Performance testing of SCUBA functions and methods.


Figure 4.

Five random subsets of the indicated numbers of cells were created from the Human Brain Atlas object downloaded from CellXGene. 24 The subsets were saved in the following object formats: Seurat, via saveRDS(), SingleCellExperiment, via HDF5Array::saveHDF5Summarized Experiment(), and anndata, via write_h5ad. For all tests, the indicated operations were run on each of the five subsets, and the run time was measured using sys.time. A) Performance of the FetchData methods developed for SingleCellExperiment and anndata objects, compared to the existing FetchData method for Seurat objects. The series “Seurat”, “SingleCellExperiment”, and “anndata” indicate the performance of fetch_data methods for Seurat, SingleCellExperiment, and anndata objects, respectively. A single feature was pulled via FetchData for each of the random subsets for the indicated object and number of cells. B) Comparison of fetch_data methods vs. fetch_metadata for the retrieval of data for a single metadata variable. In most cases, using fetch_metadata to pull metadata was more performant than using ‘ fetch_data’. C) Comparison of ‘ fetch_data’ methods vs. fetch_reduction for the retrieval of data for a pair of reduction coordinates. fetch_reduction was more performant than ‘ fetch_data’ for the retrieval of reduction coordinates in most cases.

Example scripts created with SCUBA

Figure 5 gives an example usage of SCUBA methods to create plots with consistent visuals across object types. Figure 5A shows the scripts to create a density plot showing expression by cell type from each of the three supported object types, showing regions of the script that vary between object types, and regions that are conserved. Figure 5B shows the output of the example script. The script demonstrates the ease at which expression data can be visualized from each object format, and the ease at which plot visuals can be harmonized across object formats.

Figure 5. SCUBA enables flexible plotting scripts harmonized across object types.


Figure 5.

A) Example script for visualizing expression of a gene by cluster in a density plot. The three boxes for fetch_data indicate slight variations in the script for each object type. All downstream code is the same across object formats. B) Output of the plotting script in (A). Output does not vary by object type.

Any plot visualizing a combination expression data, metadata, and reduction coordinates can be created by generating a table from fetch_data, fetch_metadata, or fetch_reduction, and passing the output table to downstream plotting code. Plotting is performed via ggplot2 20 in this example, but any other plotting package that accepts a data.frame or a tibble as input may be used. If desired, it is also possible to convert to a pandas 15 dataframe via Reticulate, 10 and perform plotting operations in python. The flexibility of SCUBA’s data access methods facilitates the creation of a broad variety of plots from single-cell data.

Figure 6 shows an example script that simplifies the printing of unique values of a metadata variable represented in an object, which is a commonly used basic operation in analysis. With SCUBA, this operation can simply be performed by calling fetch_metadata on the object and piping the results to unique(). The language used is the same for all supported object classes, which negates the need to memorize and use the most efficient function calls for each respective object type.

Figure 6. SCUBA simplifies common object exploration operations.


Figure 6.

This figure compares the usage of SCUBA with the most efficient equivalent operations for viewing the unique values of a metadata variable represented in an object. The operation with SCUBA is shown in the first column, and the most efficient equivalents are shown in the second column. SCUBA simplifies this operation, allowing for the development of scripts that are generalized for multiple object types.

Conclusions

SCUBA addresses issues with interoperability between single-cell object formats by providing a flexible backend that returns data in a consistent format, via a consistent interface. The consistent output format of SCUBA facilitates downstream use in functional programming applications (plotting scripts, packages, etc.) and allows for consistent visualizations across object types. Packages and scripts using SCUBA will not require object conversion prior to use, conferring several advantages for end users. Users will not have to risk data loss upon object conversion, and analysis will be more straightforward without conversion, requiring less programming experience. Packages made with SCUBA will also allow users to choose object classes based on storage and performance characteristics that are best for the specific dataset, rather than being constrained to a class based on downstream packages. SCUBA does not allow users to use any analysis package with any object format, however. The aforementioned benefits only apply to packages and scripts created using SCUBA. SCUBA also only performs data access operations, and is not for object assembly, clustering, or filtering. Because of this, SCUBA is not a replacement for analysis packages such as Seurat and Scanpy. Instead, SCUBA allows users to visualize objects that have been prepared with these analysis packages in the same manner, regardless of object class.

Support for MuData 21, 22 will be added in the future, as this Python object class is especially useful for storing data from multimodal single-cell sequencing experiments. SCUBA is particularly well suited for interactive use, such as in Shiny apps, where multiple object formats may be used as inputs. We developed a single-cell browser, scExploreR, 23 that allows users to create consistent Seurat-style visualizations from either Seurat, SingleCellExperiment, or anndata objects. SCUBA can also be used to create a plotting package that produces visuals from any supported object class for reports and shiny apps, and a QC package reporting the results of preprocessing steps such as filtering, clustering, and batch correction could also be created using SCUBA. The flexibility of SCUBA is envisioned to facilitate analysis and visualization of preprocessed data, unifying disparate object-based package ecosystems.

Ethics and consent

Ethical approval and consent were not required.

Acknowledgements

The authors would like to acknowledge Monica Ransom, Sarah E. Staggs, Stephanie R. Gipson, Abbigayl Burtis, and Devin Burke for their thoughtful comments and suggestions during the development of this package and the writing of this manuscript.

Funding Statement

This work received support from US VA IK2BX004952-01A1 to AEG and US NIH R35CA242376 to CTJ.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

[version 2; peer review: 3 approved

Data and software availability

SCUBA uses two third-party datasets for performance benchmarking, testing, and demonstration in the manuscript. The datasets are described below.

Figshare: Expression of 197 surface markers and 462 mRNAs in 15281 cells from blood and bone marrow from a young healthy donor. https://doi.org/10.6084/m9.figshare.13398065.v4. 13

This project contains the following underlying data:

  • 200AB_projected.rds. (Seurat object with 15821 cells, showing the expression of 197 surface markers and 462 mRNAs in bone marrow from a young healthy donor).

The dataset is available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

CELLxGENE: Human Brain Cell Atlas v1.0. https://cellxgene.cziscience.com/collections/283d65eb-dd53-496d-adb7-7570c7caa443.

This project contains the following underlying data:

  • cc9bfb86-96ed-4ecd-bcc9-464120fc8628.rds. (Seurat object with 800k non-neuronal cells used for performance benchmarking in the manuscript. The file is accessed by selecting “All non-neuronal cells” and then the.rds radio button).

The dataset is available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

The Velten et al. dataset 13 was processed to yield a format suitable for testing and demonstration of SCUBA, downsampled, and stored in the inst/extdata/ and data/directories of the SCUBA repo. Scripts used in these operations and performance benchmarking are available at the manuscript GitHub repo: https://github.com/amc-heme/SCUBA_Manuscript. Working examples of code shown in figures 2, 3, 5, and 6 are also stored in this repo.

Software, up to date source code, and tutorials are available from: https://github.com/amc-heme/scuba

Archived source code at time of publication: https://zenodo.org/doi/10.5281/zenodo.13776167

License: MIT

References

  • 1. Schäfer PSL, Dimitrov D, Villablanca EJ, et al. : Integrating single-cell multi-omics and prior biological knowledge for a functional characterization of the immune system. Nat. Immunol. 2024;25:405–417. 10.1038/s41590-024-01768-2 [DOI] [PubMed] [Google Scholar]
  • 2. Zeng AGX, et al. : A cellular hierarchy framework for understanding heterogeneity and predicting drug response in acute myeloid leukemia. Nat. Med. 2022;28:1212–1223. 10.1038/s41591-022-01819-x [DOI] [PubMed] [Google Scholar]
  • 3. Kiselev V, Huang N: sceasy. 2022.
  • 4. Zappia L, Lun A, Kamm J, et al. : Zellconverter. 2025.
  • 5. Feng H, Lin L, Chen J: scDIOR: single cell RNA-seq data IO software. BMC Bioinformatics. 2022;23:16. 10.1186/s12859-021-04528-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Butler A, et al. : SeuratWrappers. New York Genome Center: Satija Lab;2024. [Google Scholar]
  • 7. Wolf FA, Angerer P, Theis FJ: SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:15. 10.1186/s13059-017-1382-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Hao Y, et al. : Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 2024;42:293–304. 10.1038/s41587-023-01767-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Wickham H: S3. Advanced R. Chapman and Hall/CRC;2019. 10.1201/9781351201315-16 [DOI] [Google Scholar]
  • 10. Ushey K, Allaire J, Tang Y: reticulate: Interface to ‘Python’. 2023.
  • 11. Wickham H, Hester J, Chang W, et al. : devtools: Tools to Make Developing R Packages Easier. 2022.
  • 12. Wickham H: testthat: Get Started with Testing. The R Journal. 2011;3:5. 10.32614/RJ-2011-002 [DOI] [Google Scholar]
  • 13. Velten L, Triana S, Haas S, et al. : Expression of 197 surface markers and 462 mRNAs in 15281 cells from blood and bone marrow from a young healthy donor.[Dataset]. Figshare. 2021. 10.6084/m9.figshare.13398065.v4 [DOI]
  • 14. Triana S, et al. : Single-cell proteo-genomic reference maps of the hematopoietic system enable the purification and massive profiling of precisely defined cell states. Nat. Immunol. 2021;22:1577–1589. 10.1038/s41590-021-01059-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. The pandas development team: Pandas. 2023. 10.5281/ZENODO.3509134 [DOI]
  • 16. Harris CR, et al. : Array programming with NumPy. Nature. 2020;585:357–362. 10.1038/s41586-020-2649-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Virtanen P, et al. : Author Correction: SciPy 1.0: fundamental algorithms for scientific computing in Python (Nature Methods, (2020), 10.1038/s41592-019-0686-2). Nat. Methods. 2020;17:352–352. 10.1038/s41592-020-0772-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Virshup I, Rybakov S, Theis FJ, et al. : anndata: Annotated data. 2021. 2021.12.16.473007. 10.1101/2021.12.16.473007 [DOI]
  • 19. Conda contributors: Conda: A system-level, binary package and environment manager running on all major operating systems and platforms. 2024.
  • 20. Wickham H: Ggplot2: Elegant Graphics for Data Analysis. Switzerland: Springer;2016. 10.1007/978-3-319-24277-4 [DOI] [Google Scholar]
  • 21. Bredikhin D, Kats I, Stegle O: MUON: multimodal omics analysis framework. Genome Biol. 2022;23:42. 10.1186/s13059-021-02577-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Virshup I, et al. : The scverse project provides a computational ecosystem for single-cell omics data analysis. Nat. Biotechnol. 2023;41:604–606. 10.1038/s41587-023-01733-8 [DOI] [PubMed] [Google Scholar]
  • 23. Showers W, Desai J, Gipson S, et al. : scExploreR: a Flexible Shiny App for Democratized Analysis of Multimodal single-cell RNA-seq Data. 2024.
  • 24. Siletti K, et al. : Human Brain Cell Atlas v1.0.[Dataset]. CELLxGENE. 2023. Reference Source
F1000Res. 2025 Aug 19. doi: 10.5256/f1000research.182803.r396950

Reviewer response for version 2

Benedikt Obermayer 1

This manuscript presents an R package for unified data access to single-cell genomics objects from different commonly used formats (Seurat, SingleCellExperiment, and anndata). In version 2, most issues raised by previous reviewers were satisfactorily addressed, and the paper has reached a sufficiently sound stage.

I have one more major issue and a few small corrections that could be incorporated.

Major issue:

I agree with Damian Panas and Marcin Tabaka in that this package does not really present a significant advancement or novel solution to comprehensively address the problem of interoperability. People who manage to successfully set up reticulate and conda environments to be used within their R installation would be expected to be able to get necessary data out of objects in different formats. People with less expertise for which a visualization / data exploration tool allowing for different input formats might be most useful will probably not be able to set this up in a reasonable time frame. In that case, I'd prefer a web-based or maybe Docker-based explorer that accepts R as well as python objects as input.

Minor issues:

- I don't really understand the advantage of SCUBA over the anndata R package for reading h5ad files and accessing their contents, apart from a unified syntax. (However, anndata specific data structures such as the ad$raw slot don't seem to be accessible using SCUBA). SCUBA appears to be somewhat faster than anndata in my hands, but that is probably of little concern to most users. This could be explained better.

- I noticed two typos (Suerat instead of Seurat in Fig. 1 caption, and SueratWrapper instead of SeuratWrapper on p4 bottom).

- When installing the package, my existing conda env with anndata did not have a sufficiently recent version (including anndata.abc, which I think was introduced in 0.10). Requirements don't specify this

- the plot functions still use the deprecated FetchData method. Why is density plot not part of the package if it's used in the User Guide?

Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Is the rationale for developing the new software tool clearly explained?

Yes

Is the description of the software tool technically sound?

Yes

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes

Reviewer Expertise:

Bioinformatics, single-cell genomics, computational biology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

F1000Res. 2025 Aug 13. doi: 10.5256/f1000research.182803.r389307

Reviewer response for version 2

Marcin Tabaka 1, Damian Panas 2,3

The revised article comprises significant improvements on the initial submission. The comprehensive documentation, created using the pkgdown framework, is an essential addition. The documentation is now clearly structured, detailed, and supplemented by numerous examples. The authors have reformatted the manuscript to improve clarity and expanded the benchmarking section to provide a more informative performance comparison. Another significant addition is the new fetch_data() function, which resolves one of the most important issues raised in the initial submission. While a few minor problems presented earlier remain, the most critical concerns have been addressed and resolved. Overall, the current version of the SCUBA is a robust, high-quality tool that demonstrates thoughtful improvements in its usability.

Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

No

Is the rationale for developing the new software tool clearly explained?

Partly

Is the description of the software tool technically sound?

Partly

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes

Reviewer Expertise:

Single-cell Genomics, Bioinformatics

We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

F1000Res. 2025 Jan 29. doi: 10.5256/f1000research.169727.r350233

Reviewer response for version 1

Marcin Tabaka 3, Damian Panas 1,2

In this study, William M. Showers and colleagues developed SCUBA, an R package designed to facilitate the access to single-cell data stored in various formats commonly utilized by single-cell data analysis software in R (Seurat, Scran) or Python (Scanpy). These single-cell data storage formats include SeuratObject (Seurat), SingleCellExperiment (Scran, Bioconductor’s single-cell tools), Anndata (Scanpy). Interoperability between R-based and Python-based tools and data types is a real problem in bioinformatics. Several tools are typically used in this case, such as zellkonverter, anndataR, sceasy, or SeuratDisk. However, these tools rely on data type conversion, which can be computationally expensive. SCUBA implements a unified API for single-cell object formats to efficiently access data for exploratory analysis and visualization. It extracts from storage objects using FetchData methods a table with requested features, metadata, and reduction coordinates for each cell. The manuscript is clearly written, with a logical structure that effectively conveys the study's objective and organized in the following structure: 1) Introduction; 2) Methods including subsection Implementation and Operation; 3) Use cases; and 4) Conclusions. The “Use cases” section involve examples of Scuba usage: 1) application of FetchData function to retrieve features, metadata, and dimensionality reduction coordinates; 2) S3 generics and methods for similar tasks as in (1) including functions fetch_metadata and fetch_reduction; 3) speed tests of the functions for SeuratObjects stored in RDS format, SingleCellExperiment and Anndata in h5 formats;  3) presentation of example scripts for single-cell data exploration and visualization. While the issue of single-cell data interoperability is a well-recognized and pervasive challenge in the field, the manuscript in the current form does not appear to present a significant advancement or novel solution to address this problem comprehensively. It represents rather a set of functions that allows accessing data from various single-cell data objects needed for visualization purposes and should be a part of their scExploreR package rather than the standalone package.

Other comments:

1) The authors justify the need for SCUBA development by the fact that conversion between different data storage formats results in data loss upon conversion or this process is error-prone. Additionally, the rapid development of single-cell packages means that conversion functions are difficult to maintain and frequently break due to changes in other packages. I don’t understand why data is lost during conversion or why it is error-prone and how their approach overcome these limitations. For me, Scuba has exactly the same problems. In case of mentioned updates in single-cell third-party software and storage formats, Authors’ SCUBA code will require updates in the same way.  

2) Authors should explain how different visualization software like CellxGene and many others retrieve features or cell metadata and benchmark them against SCUBA.

3) One of the most common features of exploratory analysis is retrieval of feature values such as gene expression or chromatin accessibility from raw count or processed feature matrices. While this function is presented in Figure 2, it is omitted in the performance testing in Figure 4. Why? Can SCUBA extract, for example, gene expression values for a specified gene from huge matrices (500k cells) from SeuratObjects, SingleCellExperiments, Anndata objects in efficient manner? Authors show only retrieval of cell coordinates and cell metadata which are stored in smaller data structures.  

4) It is not clear if Authors developed also a new faster version of FetchData for SeuratObject.

5) The speed tests in Figure 4 are confusing. It’s not clear when Authors use native “FetchData” functions from Seurat/Scanpy and when from SCUBA. For example, Figure 4:  fetch_metadata is slower than “FetchData” from Scanpy for anndata?

6) All the differences in run times in A-C are of the order of 0.5-2 seconds, so negligible for users analyzing the single-cell data.

7) Authors state: “Analysis tools generally only accept one object class, forcing users to commit to a suite of packages at the beginning of an analysis. This restriction limits access to tools outside that suite, creating “walled gardens” that pose several challenges for single-cell analysis.” How Scuba overcomes this problem?

8) The package lacks robust and well-organized documentation, such as that provided by the aforementioned zellkoverter, anndataR, or SeuratDisk. The use of the FetchData function is also unclear. The name exceptionally follows the Pascal case naming convention, unlike all other functions which follow the Camel case. If this is done deliberately, e.g., to highlight the use of the Seurat or SeuratObject packages, please consider including a namespace so that the user is aware of the external packages being employed. Alternatively, consider adding the fetch_data function to make the SCUBA package appear more like a closed, independent toolkit.

Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

No

Is the rationale for developing the new software tool clearly explained?

Partly

Is the description of the software tool technically sound?

Partly

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes

Reviewer Expertise:

Single-cell Genomics, Bioinformatics

We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.

F1000Res. 2025 May 27.
William Showers 1

Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

In this study, William M. Showers and colleagues developed SCUBA, an R package designed to facilitate the access to single-cell data stored in various formats commonly utilized by single-cell data analysis software in R (Seurat, Scran) or Python (Scanpy). These single-cell data storage formats include SeuratObject (Seurat), SingleCellExperiment (Scran, Bioconductor’s single-cell tools), Anndata (Scanpy). Interoperability between R-based and Python-based tools and data types is a real problem in bioinformatics. Several tools are typically used in this case, such as zellkonverter, anndataR, sceasy, or SeuratDisk. However, these tools rely on data type conversion, which can be computationally expensive. SCUBA implements a unified API for single-cell object formats to efficiently access data for exploratory analysis and visualization. It extracts from storage objects using FetchData methods a table with requested features, metadata, and reduction coordinates for each cell. The manuscript is clearly written, with a logical structure that effectively conveys the study's objective and organized in the following structure: 1) Introduction; 2) Methods including subsection Implementation and Operation; 3) Use cases; and 4) Conclusions. The “Use cases” section involve examples of Scuba usage: 1) application of FetchData function to retrieve features, metadata, and dimensionality reduction coordinates; 2) S3 generics and methods for similar tasks as in (1) including functions fetch_metadata and fetch_reduction; 3) speed tests of the functions for SeuratObjects stored in RDS format, SingleCellExperiment and Anndata in h5 formats;  3) presentation of example scripts for single-cell data exploration and visualization. While the issue of single-cell data interoperability is a well-recognized and pervasive challenge in the field, the manuscript in the current form does not appear to present a significant advancement or novel solution to address this problem comprehensively. It represents rather a set of functions that allows accessing data from various single-cell data objects needed for visualization purposes and should be a part of their scExploreR package rather than the standalone package.

Thank you for this assessment. The observation that SCUBA is a set of functions for accessing data from single-cell objects is correct, but we disagree that SCUBA should not be a standalone package. The functions provided by SCUBA are useful both in fetching data for interactive report generation (the usage context for scExploreR), and for static report generation by bioinformaticians. The consistent usage of SCUBA across object formats facilitates, for example, re-running a specific visualization script on data from a group that uses a different object class. SCUBA is a tool for developers to work seamlessly across different object classes, and we envision it as the basis of an analysis suite of static visualization packages, as well as the basis of scExploreR. We improved our documentation with examples of how users would integrate SCUBA into their analysis scripts.

Other comments:

1) The authors justify the need for SCUBA development by the fact that conversion between different data storage formats results in data loss upon conversion or this process is error-prone. Additionally, the rapid development of single-cell packages means that conversion functions are difficult to maintain and frequently break due to changes in other packages. I don’t understand why data is lost during conversion or why it is error-prone and how their approach overcome these limitations. For me, Scuba has exactly the same problems. In case of mentioned updates in single-cell third-party software and storage formats, Authors’ SCUBA code will require updates in the same way.  

While it is true that SCUBA is vulnerable to changes in underlying object class implementation and must be maintained, we see this vulnerability as being less severe than with conversion packages. Conversion methods such as those mentioned aim to convert the entirety of each object class to another class, while SCUBA functions each extract small subsets of data. While still requiring active maintenance to track upstream changes, we believe this atomic approach is substantially less vulnerable to breaking changes that affect the entire package. 

2) Authors should explain how different visualization software like CellxGene and many others retrieve features or cell metadata and benchmark them against SCUBA.

Thank you for the suggestion. We don’t feel this is in scope, as the API for data access in CELLxGENE and other visualization software is not exposed to end users in the way that Seurat’s FetchData is.

3) One of the most common features of exploratory analysis is retrieval of feature values such as gene expression or chromatin accessibility from raw count or processed feature matrices. While this function is presented in Figure 2, it is omitted in the performance testing in Figure 4. Why? Can SCUBA extract, for example, gene expression values for a specified gene from huge matrices (500k cells) from SeuratObjects, SingleCellExperiments, Anndata objects in efficient manner? Authors show only retrieval of cell coordinates and cell metadata which are stored in smaller data structures.  

Thank you for pointing this out. The function mentioned for retrieval of gene expression data was benchmarked in figure 4A. The captions for figure 4C and figure 4A were switched, which incorrectly suggested that figure 4A was benchmarking fetch_reduction. We regret the confusion caused by this error and have corrected it in the revised version. Figure 4A has also been changed to more clearly reflect that the performance of the original Seruat FetchData is being compared to the methods added in SCUBA. 

4) It is not clear if Authors developed also a new faster version of FetchData for SeuratObject.

We did not. Figure 4A has been updated to make this clear. In addition, we created a snake case fetch_data generic in response to comment 8, below. `fetch_data.Seurat` is simply a wrapper for Seurat’s FetchData, while `fetch_data.SingleCellExperiment` and `fetch_data.AnnDataR6` are added by SCUBA. This has been made clear in the manuscript and the updated documentation. 

5) The speed tests in Figure 4 are confusing. It’s not clear when Authors use native “FetchData” functions from Seurat/Scanpy and when from SCUBA. For example, Figure 4:  fetch_metadata is slower than “FetchData” from Scanpy for anndata?

As mentioned above, we feel that the new generic fetch_data makes this clear. Figure 4B and figure 4C compare the `fetch_data` method in SCUBA to `fetch_metadata` and `fetch_reduction`, respectively. 

6) All the differences in run times in A-C are of the order of 0.5-2 seconds, so negligible for users analyzing the single-cell data.

Thanks for this observation. We agree in the context of interactive analysis in an R session, but disagree in the context of interactive web apps that plot data from single-cell objects with many cells. Regardless, while this is a valid point, we feel the main benefit of SCUBA is the consistency in syntax across object formats.

7) Authors state: “Analysis tools generally only accept one object class, forcing users to commit to a suite of packages at the beginning of an analysis. This restriction limits access to tools outside that suite, creating “walled gardens” that pose several challenges for single-cell analysis.” How Scuba overcomes this problem?

While SCUBA does not fully overcome this problem, we see SCUBA as being a dependency of packages that would do so. We envision an analysis suite with packages for analyses such as differential gene expression and gene set enrichment analysis, as well as static and interactive visualizations. Packages built on SCUBA access functions will be flexible by default to input object class, and should be easier for end users to use compared to running a function from a conversion package, and then running an existing function from an analysis package. We also expect packages based on SCUBA to be easier to develop, since the access functions are easier to use than running class-specific access code, and have consistent syntax.

8) The package lacks robust and well-organized documentation, such as that provided by the aforementioned zellkoverter, anndataR, or SeuratDisk. The use of the FetchData function is also unclear. The name exceptionally follows the Pascal case naming convention, unlike all other functions which follow the Camel case. If this is done deliberately, e.g., to highlight the use of the Seurat or SeuratObject packages, please consider including a namespace so that the user is aware of the external packages being employed. Alternatively, consider adding the fetch_data function to make the SCUBA package appear more like a closed, independent toolkit.

We appreciate these observations. In addition to overhauling the function documentation for clarity, we added a pkgdown site at https://amc-heme.github.io/SCUBA/. In the site, we have added the “User Guide” vignette to the “Articles” tab, which contains a walkthrough of key functions in the SCUBA package for the access, exploration, and visualization of single-cell datasets.

As suggested, we created a new generic fetch_data, and moved the pascal case FetchData methods added by SCUBA to this generic. The fetch_data method for Seurat is a wrapper for Seurat’s FetchData method, and the fetch_data methods for SingleCellExperiment and anndata are added by SCUBA. In addition to improving clarity, the fetch_data generic removes the need for a dependency on Seurat. 

F1000Res. 2025 Jan 15. doi: 10.5256/f1000research.169727.r334007

Reviewer response for version 1

Kristian Ullrich 1

In this article the authors provide a valuable R package that implements a unified data access API for single-cell object formats like `anndata` objects (Python) and `Seurat/SingleCellExperiment` objects (R). The SCUBA API (implemented in R) keeps the original single-cell objects unmodified and use three data accession methods, namely `FetchData`, `fetch_metdata` and `fetch_reduction`. To more easily compare pre-processed data stored in common single-cell data formats, like `AnnData`, `SingleCellExperiment` (SCE) and `Seurat`, the authors fetch metadata from cells and genes from the original object and create a `data.frame` R object with the requested features. Using the `data.frame` object, downstream plotting function are provided to create plots in the `ggplot2` R package grammar.

Code review

Given the dependency issues related to R-base version and Python version, I would encourage the authors to submit their package to either CRAN or Bioconductor R repository so that R package community standards are tested in a continious integration setup. The authors should try to alter their code to pass at least the R-CMD-check without warnings and errors.

In the "Use cases" section, the authors claim that "Additional vignettes are provided on the SCUBA GitHub page", which is not true time of writing this review. Please add the mentioned vignettes to the GitHub page.

The provided R functions should contain runnable examples. Building a vignette would help the community to present the basic functions from the R package and provide more background information, if needed. I would suggest the authors to include a github workflow to create the R package documentation, vignettes and convert it into a website e.g. with the `pkgdown` package or similar workflow.

Software

  • I was able to install the software and run the examples given on the main github page, despite the need to pre-load the `Seurat` R package. As indicated in the Minor comments part, the three functions `reduction_names()`, `assay_names()` and `features_in_assay()` have been removed from the repository and the given example code block should be also removed from the github page.

  • The Rmarkdown files on the github page https://github.com/amc-heme/SCUBA_Manuscript has been inspected and partially run. I have not re-created the subset data sets, however the code base seems to be fine.

Major comments

  • If many genes are requested by the `FetchData` function, creating a single `data.frame` has the offsite to not benefit from sparse data structures leading to high memory footprints. Please implement some checks to prevent a user given the local resources to request too many gene features.

  • Please remove "manually" from the sentence: "Currently, the most efective solution is to manually convert between object classes." Here, manually implies a programatically high efford to do so, which in my opinion is not true, given the existing converting tools like `SeuratWrapper`, `zellkonverter`, `SCEasy Converter` `scDIOR`. Please add the corresponding literature for the mentioned tools.

  • In your R DESCRIPTION file you set the `reticulate` and `anndata` R packages as suggests, however for your API to be fully functional these packages needs to be imported. Please change accordingly.

Minor comments

  • Please change the DESCRIPTION and add the corresponding author/creators in the github repository.

  • Please alter the examples from your initial github page so that they work as expected. E.g. the `FetchData` function for Seurat objects is not pre-loaded, one need to first import Seurat `library(Seurat)`.

  • Please remove the following example part from your github pages, sicne the functions have been removed and are not working anymore `reduction_names()`, `assay_names()` and `features_in_assay()`.

  • Please update the corresponding `fetch_...` function main description and param field about what object types are supported, since the `AnnDataR6` class in most cases is supported but not mentioned.

  • Please rephrase "Single-cell analysis can be inaccessible to bench scientists due to programming experience required, ...", in my opinion this sentence is formulated too harshly, as it is possible e.g. to convert and analyze single-cell objects with GUI-based solutions such as usegalaxy.eu and the SCEasy Converter plugin.

  • Please provide a working example for a default user without admin priviliges how to setup and use `reticulate` in R for the usecase described here to access an `AnnData` object and how to install the prerequisite Python packages. The authors refer to the main `reticulate` documentation, however, showing a working code snippet example would be beneficial for scientists not fluent in both programming languages.

  • Please remove "for all three object types" from Figure 5A "Common plotting script for all three object types", since the plotting function applies on the constructed intermediate `data.frame` object.

  • Figure 6, please change the code example to use the R default pipe operator `|>` instead of the `magrittr` pipe operator `%>%` so that one would not need to pre-load the corresponding libraries.

Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Is the rationale for developing the new software tool clearly explained?

Yes

Is the description of the software tool technically sound?

Partly

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes

Reviewer Expertise:

Comparative Genomics, Bioinformatics, R Programming

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

References

  • 1. : Comparison of visualization tools for single-cell RNAseq data. NAR Genom Bioinform .2020;2(3) : 10.1093/nargab/lqaa052 lqaa052 10.1093/nargab/lqaa052 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. : scDIOR: single cell RNA-seq data IO software. BMC Bioinformatics .2022;23(1) : 10.1186/s12859-021-04528-3 16 10.1186/s12859-021-04528-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
F1000Res. 2025 May 23.
William Showers 1

Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

In this article the authors provide a valuable R package that implements a unified data access API for single-cell object formats like `anndata` objects (Python) and `Seurat/SingleCellExperiment` objects (R). The SCUBA API (implemented in R) keeps the original single-cell objects unmodified and use three data accession methods, namely `FetchData`, `fetch_metdata` and `fetch_reduction`. To more easily compare pre-processed data stored in common single-cell data formats, like `AnnData`, `SingleCellExperiment` (SCE) and `Seurat`, the authors fetch metadata from cells and genes from the original object and create a `data.frame` R object with the requested features. Using the `data.frame` object, downstream plotting function are provided to create plots in the `ggplot2` R package grammar.

Code review

Given the dependency issues related to R-base version and Python version, I would encourage the authors to submit their package to either CRAN or Bioconductor R repository so that R package community standards are tested in a continious integration setup. The authors should try to alter their code to pass at least the R-CMD-check without warnings and errors.

Thank you for this valuable feedback. While we do not intend to submit this package to CRAN or Bioconductor, we recognize the value of continuous integration and have ensured that the code now passes the R-CMD-check without warnings or errors.

In the "Use cases" section, the authors claim that "Additional vignettes are provided on the SCUBA GitHub page", which is not true time of writing this review. Please add the mentioned vignettes to the GitHub page.

We have added a vignette with examples of all major SCUBA functions (see the comment below).

The provided R functions should contain runnable examples. Building a vignette would help the community to present the basic functions from the R package and provide more background information, if needed. I would suggest the authors to include a github workflow to create the R package documentation, vignettes and convert it into a website e.g. with the `pkgdown` package or similar workflow.

We appreciate this feedback, and have improved our documentation to address these important observations. Specifically, we have added a pkgdown site at https://amc-heme.github.io/SCUBA/. Here, we have added the “User Guide” vignette to the “Articles” tab, which contains a walkthrough of key functions in the SCUBA package for the access, exploration, and visualization of single-cell datasets. The vignette shows usage examples for all three object types currently supported by SCUBA. In addition to adding the vignette, we updated the function documentation to improve clarity and provide usage examples.

Software

I was able to install the software and run the examples given on the main github page, despite the need to pre-load the `Seurat` R package. As indicated in the Minor comments part, the three functions `reduction_names()`, `assay_names()` and `features_in_assay()` have been removed from the repository and the given example code block should be also removed from the github page.

Thanks for letting us know about this. We have removed reduction_names() and assay_names() from the documentation to avoid this confusion. Features_in_assay() was added back to the package, and documentation on usage of this function is now on our pkgdown site, in the “User Guide” article.

The Rmarkdown files on the github page https://github.com/amc-heme/SCUBA_Manuscript have been inspected and partially run. I have not re-created the subset data sets, however the code base seems to be fine.

We have double-checked the code in amc-heme/SCUBA_Manuscript internally for consistent outputs. Please let us know if you notice further issues with the code in the manuscript repo. 

Major comments

If many genes are requested by the `FetchData` function, creating a single `data.frame` has the offsite to not benefit from sparse data structures leading to high memory footprints. Please implement some checks to prevent a user given the local resources to request too many gene features.

Thank you for this observation. We don’t feel that it is necessary to prevent users from doing this, but we have added a warning to users (shown below) in this scenario. The warning appears when 1000 or more genes are requested.

"A very large number of features was requested (<> features). fetch_data is not intended to be used with feature queries of this length. Data is returned in a dense format, so the memory usage of the output may be very large. Also, this query may take a while to complete."

Please remove "manually" from the sentence: "Currently, the most efective solution is to manually convert between object classes." Here, manually implies a programatically high efford to do so, which in my opinion is not true, given the existing converting tools like `SeuratWrapper`, `zellkonverter`, `SCEasy Converter` `scDIOR`. Please add the corresponding literature for the mentioned tools.

Thank you for pointing this out. The sentence was changed as follows in the updated manuscript:

"All major single-cell analysis packages implement conversion functions, and third-party packages such as sceasy, Zellconverter, and scDIOR are specifically designed for these conversions. Additionally, SeruatWrapper implements conversion functions that allow several otherwise-incompatible single-cell analysis packages to be used on Seurat objects."

Citations for the corresponding literature have also been added to the revised manuscript. 

In your R DESCRIPTION file you set the `reticulate` and `anndata` R packages as suggests, however for your API to be fully functional these packages needs to be imported. Please change accordingly.

We appreciate this observation. While we prefer not to require users to install anndata and reticulate if they are not using anndata objects, we have added a conditional statement in all exported AnnDataR6 methods to check if these packages are installed, and throw an error directing users to install these packages if they are not. 

Minor comments

Please change the DESCRIPTION and add the corresponding author/creators in the github repository.

This change was made.

Please alter the examples from your initial github page so that they work as expected. E.g. the `FetchData` function for Seurat objects is not pre-loaded, one need to first import Seurat `library(Seurat)`.

To keep users from having to import Seurat to use our FetchData methods, we moved them to a new generic defined within our package (fetch_data).

Please remove the following example part from your github pages, since the functions have been removed and are not working anymore `reduction_names()`, `assay_names()` and `features_in_assay()`.

All three functions have been removed from the README on our GitHub page. We added `features_in_assay()` back to SCUBA, and have added documentation for the function in our user guide vignette.

Please update the corresponding `fetch_...` function main description and param field about what object types are supported, since the `AnnDataR6` class in most cases is supported but not mentioned.

This has been corrected. All fetch_* functions and all functions that take an object as a parameter now mention all object types corrected. The description text for the `object` parameter is now also consistent across all functions. 

Please rephrase "Single-cell analysis can be inaccessible to bench scientists due to programming experience required, ...", in my opinion this sentence is formulated too harshly, as it is possible e.g. to convert and analyze single-cell objects with GUI-based solutions such as usegalaxy.eu and the SCEasy Converter plugin.

We have modified this sentence to the following: “Single-cell analysis requires programming experience for full customization of analysis and visualization, and different object formats make it even more difficult for biologists to analyze data”.

Please provide a working example for a default user without admin privileges how to setup and use `reticulate` in R for the usecase described here to access an `AnnData` object and how to install the prerequisite Python packages. The authors refer to the main `reticulate` documentation, however, showing a working code snippet example would be beneficial for scientists not fluent in both programming languages.

Thanks for this suggestion. We have added this to the README, under the section “Additional Installation for anndata Objects”.

Please remove "for all three object types" from Figure 5A "Common plotting script for all three object types", since the plotting function applies on the constructed intermediate `data.frame` object.

This has been changed. The sentence now reads “Example script for visualizing expression of a gene by cluster in a density plot.”

Figure 6, please change the code example to use the R default pipe operator `|>` instead of the `magrittr` pipe operator `%>%` so that one would not need to pre-load the corresponding libraries.

This change has been made. Please see figure 6 in the revised submission.

F1000Res. 2024 Dec 17. doi: 10.5256/f1000research.169727.r346283

Reviewer response for version 1

Huamei Li 1

Showers et al. developed the SCUBA R package as a unified API for single-cell data analysis. In single-cell and spatial transcriptomics studies, data format conversion between Seurat, SingleCellExperiment, and AnnData often presents challenges, leading to obstacles in downstream analyses. While several tools have been developed to address data format conversion issues, SCUBA provides a unified and user-friendly interface for handling multiple formats and simplifies single-cell data processing workflows, making it a promising tool for practical applications. However, several concerns regarding SCUBA require further clarification:

1) The README file available at https://github.com/amc-heme/SCUBA/tree/main does not include examples demonstrating how SCUBA reads Seurat, SingleCellExperiment, and AnnData formats from local files. Additionally, it is unclear whether SCUBA addresses compatibility issues arising from different versions of these formats, such as Seurat 4.0+ versus Seurat 5.0+. Can SCUBA facilitate seamless data format conversion and overcome challenges posed by version discrepancies?

2) While SCUBA provides application examples for single-cell datasets, it remains unclear whether the tool can be effectively extended to spatial transcriptomics datasets generated by different platforms. Clarification on its applicability to spatial datasets would be beneficial.

3) To enhance visualization capabilities, it is recommended that SCUBA include optional smoothing methods for displaying the expression distribution of feature genes. This would allow for better identification of regions with concentrated expression of specific features.

4) The SCUBA documentation could be further improved by providing more detailed explanations of the functions, their usage, and intended purposes. Additionally, including a comprehensive analysis example that integrates both single-cell and spatial transcriptomics data would help illustrate the full workflow and practical utility of SCUBA.

Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Is the rationale for developing the new software tool clearly explained?

Yes

Is the description of the software tool technically sound?

Yes

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes

Reviewer Expertise:

Bioinformatics, Immunogenetics, Single-cell and spatial technologies

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

F1000Res. 2025 May 23.
William Showers 1

Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

Showers et al. developed the SCUBA R package as a unified API for single-cell data analysis. In single-cell and spatial transcriptomics studies, data format conversion between Seurat, SingleCellExperiment, and AnnData often presents challenges, leading to obstacles in downstream analyses. While several tools have been developed to address data format conversion issues, SCUBA provides a unified and user-friendly interface for handling multiple formats and simplifies single-cell data processing workflows, making it a promising tool for practical applications. However, several concerns regarding SCUBA require further clarification:

1) The README file available at http://https//github.com/amc-heme/SCUBA/tree/main https://github.com/amc-heme/SCUBA/tree/main does not include examples demonstrating how SCUBA reads Seurat, SingleCellExperiment, and AnnData formats from local files. Additionally, it is unclear whether SCUBA addresses compatibility issues arising from different versions of these formats, such as Seurat 4.0+ versus Seurat 5.0+. Can SCUBA facilitate seamless data format conversion and overcome challenges posed by version discrepancies?

Thank you for this observation. We have added new documentation via a pkgdown website ( https://amc-heme.github.io/SCUBA/). The Article “User Guide” contains a note on how each object class is loaded. SCUBA is seamlessly compatible with Seurat v4 and v5 objects. We have added a note on this to the README. 

2) While SCUBA provides application examples for single-cell datasets, it remains unclear whether the tool can be effectively extended to spatial transcriptomics datasets generated by different platforms. Clarification on its applicability to spatial datasets would be beneficial.

SCUBA partially supports spatial datasets. Aspects of spatial datasets that can be expressed as a counts matrix can be loaded into SCUBA, but we do not yet support the loading of image files. We have added a note on this to the README.

3) To enhance visualization capabilities, it is recommended that SCUBA include optional smoothing methods for displaying the expression distribution of feature genes. This would allow for better identification of regions with concentrated expression of specific features.

Thank you for the suggestion. While we agree that this is a valuable approach, we feel this is out of scope for the current tool, and welcome users to apply their own smoothing methods in plotting scripts based on SCUBA.

4) The SCUBA documentation could be further improved by providing more detailed explanations of the functions, their usage, and intended purposes. Additionally, including a comprehensive analysis example that integrates both single-cell and spatial transcriptomics data would help illustrate the full workflow and practical utility of SCUBA.

Thank you for the suggestion. We have added the suggested documentation to the README and the user guide vignette on the pkgdown website. We have also updated function documentation in SCUBA v.1.1.0.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Velten L, Triana S, Haas S, et al. : Expression of 197 surface markers and 462 mRNAs in 15281 cells from blood and bone marrow from a young healthy donor.[Dataset]. Figshare. 2021. 10.6084/m9.figshare.13398065.v4 [DOI]
    2. Siletti K, et al. : Human Brain Cell Atlas v1.0.[Dataset]. CELLxGENE. 2023. Reference Source

    Data Availability Statement

    SCUBA uses two third-party datasets for performance benchmarking, testing, and demonstration in the manuscript. The datasets are described below.

    Figshare: Expression of 197 surface markers and 462 mRNAs in 15281 cells from blood and bone marrow from a young healthy donor. https://doi.org/10.6084/m9.figshare.13398065.v4. 13

    This project contains the following underlying data:

    • 200AB_projected.rds. (Seurat object with 15821 cells, showing the expression of 197 surface markers and 462 mRNAs in bone marrow from a young healthy donor).

    The dataset is available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

    CELLxGENE: Human Brain Cell Atlas v1.0. https://cellxgene.cziscience.com/collections/283d65eb-dd53-496d-adb7-7570c7caa443.

    This project contains the following underlying data:

    • cc9bfb86-96ed-4ecd-bcc9-464120fc8628.rds. (Seurat object with 800k non-neuronal cells used for performance benchmarking in the manuscript. The file is accessed by selecting “All non-neuronal cells” and then the.rds radio button).

    The dataset is available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

    The Velten et al. dataset 13 was processed to yield a format suitable for testing and demonstration of SCUBA, downsampled, and stored in the inst/extdata/ and data/directories of the SCUBA repo. Scripts used in these operations and performance benchmarking are available at the manuscript GitHub repo: https://github.com/amc-heme/SCUBA_Manuscript. Working examples of code shown in figures 2, 3, 5, and 6 are also stored in this repo.

    Software, up to date source code, and tutorials are available from: https://github.com/amc-heme/scuba

    Archived source code at time of publication: https://zenodo.org/doi/10.5281/zenodo.13776167

    License: MIT


    Articles from F1000Research are provided here courtesy of F1000 Research Ltd

    RESOURCES