ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Irzam Sarfraz; Muhammad Asif; Joshua D Campbell

doi:10.1093/bioinformatics/btab179

. 2021 Mar 14;37(18):3058–3060. doi: 10.1093/bioinformatics/btab179

ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Irzam Sarfraz ¹, Muhammad Asif ^2,^✉, Joshua D Campbell ^3,^✉

Editor: Anthony Mathelier

PMCID: PMC9940906 PMID: 33715007

Abstract

Motivation

R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance.

Results

To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets.

Availability and implementation

ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

High-throughput sequencing experiments can be used to generate data on thousands of genomic features across many samples (Reuter et al., 2015). These assays have enabled computational researchers to easily perform analyses on this high-dimensional data and gain insights into many different biological processes (Haque et al., 2017). An important consideration is the selection of appropriate data structures that support the storage and manipulation of the high-throughput data throughout the analysis workflow. SummarizedExperiment (Tierny et al., 2008) is an R package that serves as an all-in-one structure for manipulation of gene expression data by supporting the storage and retrieval of multiple data assays as well as rowData (i.e. gene metadata) and colData (i.e. sample metadata). SingleCellExperiment (Lun and Risso, 2020) class has additional features that allow for the storage of dimensionality reduction results and alternate experiments such as spike-in transcripts (Jiang et al., 2011) that have different dimensions from the original assays. When an experiment constitutes observations from different data modalities with varying dimensions, MultiAssayExperiment (Ramos et al., 2017) offers integration of these observations.

However, some tasks in an analysis workflow often involve subsetting of the original data assay. For example, excluding poor-quality samples or selecting the most variable genes requires subsetting of the columns or rows of the assay, respectively (R Foundation for Statistical Computing., 2008; Yip et al., 2018). To perform this task, a new object can be created to store the desired subset of data. However, this will cause some of the data in the original parent object to be copied into the new subset, which will result in unnecessarily utilization of memory. Furthermore, there will be no systematic provenance between the original object and the newly created subset and therefore the users are fully responsible for keeping track of which subsets belong to which parent objects. Given the complexity and redundancy that can emerge from the creation of a collection of related objects from one another, a considerable need emerges for the development of a class that supports both storage and manipulation of subsets in a manner that maintains data provenance and limits storage redundancy.

The ExperimentSubset package overcomes this problem by offering data structures that acts as a drop-in replacement for common experiment classes. The ExperimentSubset class allows for efficient subsetting, storage and retrieval of each subset data while keeping the features and methods of the latter two classes intact. Therefore, we propose the use of a single ExperimentSubset object throughout standard analysis workflows for genomic data as all parent and subset assays can be manipulated from within the same object.

2 Materials and methods

The ExperimentSubset package contains several R (R Foundation for Statistical Computing., 2008) object-oriented based S4 classes, each of which inherits from an ‘Experiment’ class while adding additional support for manipulation of subsets of data (Fig. 1a). More specifically, the ExperimentSubset package contains five classes called SubsetSummarizedExperiment, SubsetRangedSummarizedExperiment, SubsetSingleCellExperiment, SubsetTreeSummarizedExperiment and SubsetSpatialExperiment which extend the SummarizedExperiment, RangedSummarizedExperiment, SingleCellExperiment, TreeSummarizedExperiment (Huang et al., 2021) and SpatialExperiment (Righelli et al., 2021) classes, respectively. These Subset classes can be used as drop-in replacements for functions that utilize the parent classes. For convenience, we include a constructor to create an object with the appropriate ExperimentSubset class based on the class of the input parent object. An internal class called AssaySubset is used to store the information for each subset. This class includes slots for storing indices that indicate which rows and columns from the parent object are to be used in the subset as well as a slot for the subset’s own ‘Experiment’ object. The primary memory-saving feature of this approach is that subsets are initialized by using indices to point to the rows and columns of the parent assay without copying any data from the parent assay. The ‘Experiment’ object in the AssaySubset inherits the class of the parent object and can be used to store subset-specific data such as assays, colData and rowData. For example, a user may want to create a subset of raw data based on a set of filtering criteria and then normalize only that subset of the data. In this case, the normalized data will be stored in the ‘Experiment’ object within the AssaySubset slot rather than the main assay slot of parent object. Importantly, subsets can serve as parent objects to other subsets. Both assays from the parent objects and assays from the subset objects can be a parent of another subset which ultimately enables a hierarchical structure of subsets (Fig. 1a). Finally, as the ExperimentSubset object inherits from the Experiment class of the input object, all methods available for that class can be applied to the data of the initial parent object. Several methods such as rowData and colData have also been extended to allow for extraction of these features directly from the subset’s Experiment data.

The primary methods (Fig. 1b) available with the Experiment-Subset class for the manipulation of subsets include:

A constructor that facilitates the creation of ExperimentSubset object from a count matrix or directly from any experiment object inherited from SummarizedExperiment class and supports the creation of a subset directly from within this constructor.
A method to create additional subsets by indicating the rows and columns to include in the subset and a pointer to the parent assay.
Inclusion of additional assays within the subsets made possible through a method that intakes a matrix-like object against a subset.
Methods to store and retrieve rowData, colData, altExps, metadata and reducedDims against a subset.
Support for common methods inherited from SingleCellExperiment and SummarizedExperiment to allow storage and retrieval of assays, subsets and subset assays for easier usage.
Subset support for methods of different ‘Experiment’ classes.

3 Results

To illustrate the benefits of the ExperimentSubset classes, we apply an analysis workflow to single-cell RNA-seq dataset containing over 4 thousand cells (PBMC4K) (Hansen et al., 2020):

library(ExperimentSubset)

library(TENxPBMCData)

library(scater)

library(scran)

tenx_pbmc4k <- TENxPBMCData(dataset = “pbmc4k”)

A call to the constructor converts the PBMC4k SingleCellExperiment object into the ExperimentSubset object:

es <- ExperimentSubset(tenx_pbmc4k)

We create a new subset by excluding the cells with less than 1500 counts. We then normalize this subset and create another subset by selecting the genes with the highest variability:

#compute QC Metrics, store back into colData and filter cells

perCellQCMetrics <- perCellQCMetrics(assay(es, “counts”))

colData(es) <- cbind(colData(es), perCellQCMetrics)

filteredCellsIndices <- which(colData(es)$sum > 1500)

#create a subset against filtered cells

es <- createSubset(es, “filteredCells”, cols = filteredCellsIndices, parentAssay = “counts”)

#normalize subset and store back into the subset assay slot

assay(es, “filteredCells”, subsetAssayName = “filteredCellsNormalized”) <- normalizeCounts(assay(es, “filteredCells”))

#create a new subset by including only the top 1000 highly variable genes

topHVG1000 <- getTopHVGs(modelGeneVar(assay(es, “filteredCellsNormalized”)), n = 1000)

es <- createSubset(es, “hvg1000”, rows = topHVG1000, parentAssay = “filteredCellsNormalized”)

For the last analysis, we apply PCA to the highly variable genes subset, identify clusters from the computed principal components, and store all of the results back into the same ExperimentSubset object:

reducedDim(es, type = “PCA”, subsetName = “hvg1000”) <- calculatePCA(assay(es, “hvg1000”)) clusterPC = kmeans(reducedDim(es, “PCA”, subsetName = “hvg1000”), 5)$cluster

colData(es, subsetName = “hvg1000”) <- cbind(colData(es, subsetName = “hvg1000”), clusterPC)

The relationships between different subsets can be viewed using the summary function:

subsetSummary(es)

As evident from this workflow, we source from and store back all these computations into the same object, regardless of the computation type or the dimensions of the subset. When examining memory usage, the ExperimentSubset object reduced the overall footprint by 51% compared to the approach of using separate SingleCellExperiment objects for each subset (Fig. 1c; Supplementary Data). Similar results were observed when running the same workflow on the pbmc8k and pbmc33k datasets (Hansen et al., 2020)] that uses more objects (three objects in contrast to single object in our approach) and results in higher memory footprint.

4 Discussion

Numerous pipelines and toolkits that facilitate the analysis of genomics data must utilize a data structure to allow access to the data for manipulation and support the consequent storage of the transformed data back into the container for further computations. In many cases, the transformed data does not match the dimensions of the input data, and therefore, the data structures cannot store results back in the original object due to the design restriction imposed by the corresponding classes of these objects. The creation of multiple objects to store subsets of data will result in code complexity, data redundancy and a larger memory footprint. While the MultiAssayExperiment package allows storage of assays with varying dimensions, it is more oriented toward integration of assays from different sources, rather than for subsets from a single experiment. Therefore, we propose the use of ExperimentSubset object in complex workflows that offers all major features from widely used ‘Experiment’ classes along with efficient storage and retrieval of subsets.

Availability

The ExperimentSubset class is available as an R package on Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset.

Funding

This work was funded by National Library of Medicine (NLM) [R01LM013154-01 to J.D.C.] and the Informatics Technology for Cancer Research (ITCR) [1U01 CA220413-01 to J.D.C.].

Conflict of Interest: none declared.

Supplementary Material

btab179_Supplementary_Data

Click here for additional data file.^{(1.2KB, zip)}

Contributor Information

Irzam Sarfraz, Department of Computer Science, National Textile University, Faisalabad 37610, Pakistan.

Muhammad Asif, Department of Computer Science, National Textile University, Faisalabad 37610, Pakistan.

Joshua D Campbell, Department of Medicine, Boston University School of Medicine, Boston, MA 02118, USA.

References

Hansen K.D. et al. (2020) TENxPBMCData: PBMC data from 10X Genomics. R package version 1.8.0. https://bioconductor.org/packages/release/data/experiment/html/TENxPBMCData.html.
Haque A. et al. (2017) A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med., 9, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang R. et al. (Mar. 2021) TreeSummarizedExperiment: a S4 class for data with hierarchical structure. F1000Research, 9, 1246. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiang L. et al. (2011) Synthetic spike-in standards for RNA-seq experiments. Genome Res, 21, 1543–1551. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lun A., Risso D. (2020) SingleCellExperiment: S4 Classes for Single Cell Data. http://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html (31 August 2020, date last accessed)
R Foundation for Statistical Computing. (2008) R: A language and environment for statistical computing. https://www.r-project.org/. (31 August 2020, date last accessed).
Ramos M. et al. (2017) Software for the integration of multiomics experiments in bioconductor. Cancer Res., 77, e39–e42. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reuter J.A. et al. (2015) High-throughput sequencing technologies. Mol. Cell, 58, 586–597. [DOI] [PMC free article] [PubMed] [Google Scholar]
Righelli D. et al. (2021) Spatial Experiment: infrastructure for spatially resolved transcriptomics data in R using Bioconductor. https://www.biorxiv.org/ content/10.1101/2021.01.27.428431v1.
Tierny J. et al. (2008) SummarizedExperiment. Vis. Comput., 24, 155–172. [Google Scholar]
Yip S.H. et al. (2018) Evaluation of tools for highly variable gene discovery from single-cell RNA-seq data. Brief. Bioinform., 20, 1583–1589. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btab179_Supplementary_Data

Click here for additional data file.^{(1.2KB, zip)}

[btab179-B1] Hansen K.D. et al. (2020) TENxPBMCData: PBMC data from 10X Genomics. R package version 1.8.0. https://bioconductor.org/packages/release/data/experiment/html/TENxPBMCData.html.

[btab179-B2] Haque A. et al. (2017) A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med., 9, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab179-B3] Huang R. et al. (Mar. 2021) TreeSummarizedExperiment: a S4 class for data with hierarchical structure. F1000Research, 9, 1246. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab179-B4] Jiang L. et al. (2011) Synthetic spike-in standards for RNA-seq experiments. Genome Res, 21, 1543–1551. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab179-B5] Lun A., Risso D. (2020) SingleCellExperiment: S4 Classes for Single Cell Data. http://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html (31 August 2020, date last accessed)

[btab179-B7] R Foundation for Statistical Computing. (2008) R: A language and environment for statistical computing. https://www.r-project.org/. (31 August 2020, date last accessed).

[btab179-B8] Ramos M. et al. (2017) Software for the integration of multiomics experiments in bioconductor. Cancer Res., 77, e39–e42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab179-B9] Reuter J.A. et al. (2015) High-throughput sequencing technologies. Mol. Cell, 58, 586–597. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab179-B10] Righelli D. et al. (2021) Spatial Experiment: infrastructure for spatially resolved transcriptomics data in R using Bioconductor. https://www.biorxiv.org/ content/10.1101/2021.01.27.428431v1.

[btab179-B11] Tierny J. et al. (2008) SummarizedExperiment. Vis. Comput., 24, 155–172. [Google Scholar]

[btab179-B12] Yip S.H. et al. (2018) Evaluation of tools for highly variable gene discovery from single-cell RNA-seq data. Brief. Bioinform., 20, 1583–1589. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Irzam Sarfraz

Muhammad Asif

Joshua D Campbell

Roles

Abstract

Motivation

Results

Availability and implementation

Supplementary information

1 Introduction

2 Materials and methods

Fig. 1.

3 Results

4 Discussion

Availability

Funding

Supplementary Material

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Irzam Sarfraz

Muhammad Asif

Joshua D Campbell

Roles

Abstract

Motivation

Results

Availability and implementation

Supplementary information

1 Introduction

2 Materials and methods

Fig. 1.

3 Results

4 Discussion

Availability

Funding

Supplementary Material

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases