Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Apr 14.
Published in final edited form as: Methods Mol Biol. 2019;1989:245–265. doi: 10.1007/978-1-4939-9454-0_16

Data-Driven Flow Cytometry Analysis

Sherrie Wang 1, Ryan R Brinkman 1
PMCID: PMC8043852  NIHMSID: NIHMS1687143  PMID: 31077110

Abstract

The emergence of flow and mass cytometry technologies capable of generating 40-dimensional data has spurred research into automated methodologies that address bottlenecks across the entire analysis process from quality checking, data transformation, and cell population identification, to biomarker identification and visualizations. We review these approaches in the context of the stepwise progression through the different steps, including normalization, automated gating, outlier detection, and graphical presentation of results.

Keywords: Flow cytometry, Data analysis, Bioinformatics

1. R/Bioconductor

More than 50 approaches to automate flow cytometry (FCM) data analysis are available (Table 1). The overwhelming majority have been developed and released as freely available, open-source tools using the R programming language [1]. These tools have been developed for high-throughput workflows, and are not generally amenable to graphical user interface manual interaction with individual files during the analysis process. However, these tools can be integrated into commercial tools familiar to users, facilitating adoption. For example, the flowWorkspace package can export automated gating results in a format readable by FlowJo (FlowJo Inc., Ashland OR). Many of the approaches have been released through the Bioconductor repository which enforces strict requirements on cross-platform compatibility and functional documentation. Algorithms for data analysis are provided as packages that generally address a single step in the analysis pipelines, with interoperability enforced through Bioconductor. This allows users to substitute new approaches to the same challenge as the field advances, an advantage over monolithic tools that attempt to solve a single or even multiple problems in isolation.

Table 1.

Bioinformatic tools for high-throughput data analysis

Package name Use References Technical notes
Preprocessing
FCS Trans Manipulate file formats PMC3932304 R package for FCS to .txt conversion
fdaNorm Adjust data to account for batch effects PMC2648208 R/Bioconductor software to adjust data to account for batch effects like laser drift
guassNorm Adjust data to account for batch effects PMC3648208 R/Bioconductor software to adjust data to account for batch effects like laser drift
flow Variance stabilization http://www.bioconductor.org/ R/Bioconductor package that removes mean variance correlations from cell populations
flowCore Read/Write, process (transform, compensate) of flow data. The basic flow infrastructure PMC2684747 R/Bioconductor core infrastructure for representing cell populations and parent/child relationships among them
flowBeads Automated analysis of bead data http://www.bioconductor.org/ R/Bioconductor package that provides gating and normalization specific to bead data
flowBin Combining multitube flow cytometry data by binning http://www.bioconductor.org/ R/Bioconductor package that combines flow cytometry data multiplexed into tubes by common markers
CATALYST Pipeline for preprocessing of mass cytometry data http://www.bioconductor.org/ R/Bioconductor package that includes normalization, single-cell deconvolution and compensation for Mass cytometry data
MetaCyto Pipeline for analyzing cytometry data http://www.bioconductor.org/ R/Bioconductor package that provides preprocessing, automated gating, and meta-analysis of cytometry data
flowQB Quality control of cytometer sensitivity http://www.bioconductor.org/ Automatically calculates detector efficiency (Q), optical background (B), and instrinsic CV of the beads
flowStats Advanced statistical methods and functions, specialized and general gating algorithms http://www.bioconductor.org/ R/Biconductor software that collects several algorithms together for normalization and gating
flowUtils Import gates, transformation and compensation http://www.bioconductor.org/ R/Bioconductor package to support Gating-ML specification to exchange gate coordinates between software
flowTrans Estimate parameters for data transformation PMC3243046 R/Bioconductor infrastructure to optimize parameter choice for different transformations
flowWorkspace Import manually gated data from FlowJo workspaces, represent manual and automated gating hierarchies efficiently PMC3992339 R/Bioconductor core infrastructure that makes manually gated data accessible to BioConductors computational flow tools by importing pre-processed and gated data from FlowJo
ncdfFlow Advanced method for large dataset processing PMC3992339 R/Bioconductor package that overcomes memory limitations when working with large datasets by storing FCS data in netCDF files on disk
plateCore Analyze multiple plates PMC2777006 R/Bioconductor package that enable automated negative control-based gating and plate-based analysis
flowAI Identify Outlier Events http://www.bioconductor.org/ R package that removes spurious events based on time vs. fluorescence
flowClean Identify outlier events http://www.bioconductor.org/ R package that removes spurious events based on time vs. fluorescence
flowQ Identify outlier samples (e.g., wells drying out, reagent issues) PMC2768034 R/Bioconductor package that provides infrastructure to generate interactive HTML quality report
QUALIFIER Identify outlier samples PMC3499158 R/Bioconductor software that uses manual gates to perform an extensive series of statistical quality assessment checks on gated cell subpopulations
Automated gating
Unsupervised
ACCENSE Unsupervised cell population identification PMC3890841 R/Matlab software for dimensionality reduction with density-based partitioning
FLOCK Unsupervised cell population identification PMC3084630 Stand alone software for clustering using an adaptive multi-dimensional mesh to estimate local density followed by hierarchical merging of adjacent regions based on density differentials
FlowSOM Unsupervised cell population identification PMID: 25573116 R/Bioconductor software that uses Self-Organizing maps
flowClust Unsupervised cell population identification PMC2701419 R/Bioconductor software for clustering using t-mixture model with Box-Cox transformation with support for Bayesian priors
flowFP Unsupervised cell population identification PMC2777013 R/Bioconductor software for fingerprint generation via multivariate probability distribution
flowMeans Unsupervised cell population identification PMC21182178 R/Bioconductor software for k-means clustering and merging using the F-R statistics
flowMerge Unsupervised cell population identification PMC2798116 R/Bioconductor software that combines flowClust and entropy-based or Mahalanobis distance-based cluster merging
flowPeaks Unsupervised cell population identification PMC3400953 R software for unsupervised clustering using k-means and mixture model
flowType Unsupervised cell population identification PMC3998128 R/Bioconductor software for combinatorial gating of high-dimensional populations and correlative analysis against clinical outcomes
SPADE Unsupervised cell population identification PMC3196363 Matlab/standalone/R/Bioconductor tool for density-based sampling, k-means clustering and minimum spanning trees
SWIFT Unsupervised gating for rare cell population PMID: 24677621 Iterative weighted sampling procedure with splitting and merging to retain discrimination of extremely small subpopulations
X-shift Unsupervised cell population identification PMC4896314 Using fast KNN estimation of cell event density and automatically arranges populations by marker-based classification systems
flowMatch Cell population matching PMC3471348 R/Bioconductor software to match clusters across samples for producing robust meta-clusters
flowMap-FR Cell population matching PMC5014134 R/Bioconductor software to match cell population clusters across samples using the F-R statistics
NetFCM Semiautomated web-based method for flow cytometry data analysis PMID: 25044796 Semiautomatic gating strategy that uses clustering and principal component analysis(PCA) together with other statistical methods to mimic manual gating approaches
Supervised
flowDensity Supervised cell population identification PMID: 25378466 R/Bioconductor software for supervised gating to match manual analysis for clinical trials and diagnosis
OpenCyto General framework to construct reproducible automated gating pipelines and simplify data processing http://opencyto.org/ R/Bioconductor infrastructure for hierarchical automated gating that maintains relationships among cell populations
X-cyt Supervised cell population identification PMC3839720 R script that partitions each sample with initialization by user template, then optimizes on the parameters via estimation-maximization
SamSpectral Supervised cell population identification PMC2923634 R/Bioconductor software for efficient spectral clustering using density-based down-sampling
Data analysis
Citrus Identify most important cell populations correlated with outcome of interest PMID:2497804 Regularized supervised learning algorithms to identify stratifying clusters that are best predictors of a known experimental endpoint of interest
COMPASS Identify combinatorial subsets of polyfunctional T cells http://rglab.github.io/COMPASS/ R software that is multivariate extension of MIMOSA that jointly models all combinatorial polyfunctional cell subsets
MIMOSA Identify responders and nonresponders to stimulation in intracellular cytokine staining assay data PMC3862207 R/Bioconductor software to detect antigen-specific changes in marginal or specific cell subsets
Biomarker discovery
flowType Unsupervised cell population identification PMC3998128 R/Bioconductor software for combinatorial gating of high-dimensional populations and correlative analysis against clinical outcomes
RchyOptimyx Identify most important cell populations correlated with outcome of interest PMC3988128 R/Bioconductor software that optimize cellular hierarchies to preserve correlation with external variables and summarize large data sets in simple plots
Visualization and post-processing
flowViz Visualization (e.g., histograms, dot plots, density plots, gating hierarchies and layouts) PMC2768483 R/Bioconductor software that employs trellis graphics and can be adapted to provide useful visualizations
flowPlots Graphical displays with statistical tests for gated ICS flow cytometry data http://www.bioconductor.org R/Bioconductor software that provides analysis plots and data class for gated flow cytometry data

Core infrastructure widely used by other packages is provided by the flowCore R/Bioconductor package [2] that implements a computationally efficient data structure for reading and saving flow cytometry (FCM) data, and provides a systematic FCS file parsing. This in turn encourages new algorithms development and the use of combinations of tools in complex work flows [2]. It also includes a range of methods for data processing, including compensation, transformation, and gating.

When hundreds of FCS data files are generated by high-throughput instruments, processing can become a challenge as the memory limit can be reached when reading in all the data. ncdf-Flow is a Bioconductor package that is designed to overcome memory limitations when working with large datasets by accessing FCS data on disk in a way that circumvents having to read the entire file into memory [3].

2. Data Acquisition and Quality Assessment

Technical issues resulting from instrumental or procedural variations during acquisition can bias the statistics of the obtained cell subpopulation and can impact the quality of the cytometry data and the subsequent analysis results. Clogs can results in abrupt changes in the fluorescence in the time domain analysis. Other issues such as unstable data acquisition can result in a shift in means of the populations analyzed, which can pose challenges for gating. These data should be identified or potentially removed by the user, either automatically or manually, before being passed to the downstream analysis. Data quality assessment aims to detect and as appropriate flag for review or remove such abnormalities in measurements which likely do not possess any underlying biological causes, and thus are likely caused by technical errors.

QUAliFiER is a tool that can be used for quality assessment on manually gated data [4]. The typical workflow for QUAliFiER is: importing data, extracting cell population statistics, defining quality assessment tasks, performing outlier calling, and generating a quality assessment report. QUAliFiER uses grouping and conditioning variables defined in the associated study metadata to apply filters to the population statistics and carry out outlier detection.

The flowQ package is designed for quality assessment for pregated data. It uses a generic framework that accounts for different quality assessment criteria of different experimental setup [5]. flowAI and flowClean are the two currently available algorithms that remove these outliers. flowAI detects and removes outlier events from the analysis of flow rate, signal stability, and dynamic range of FCM instruments. flowClean analyses frequency changes within a sample during acquisition for outlier detection. The result of running flowAI and flowClean on data from Flow Repository (repository ID: FR-FCM-ZZGS) shows flowAI removes the spurious events with substantially shorter time but at the cost of removing a large amount of cell events, whereas flowClean removes fewer outliers events but with a relatively longer runtime (Fig. 1). The runtime for flowClean is 5 min for a 56 Mb file, while flowAI requires 15 s. Users often need to run thousands of files at a time for cleaning. Computational cost of each individual file should be kept at a minimum to ensure efficient utilization of resources and time overall.

Fig. 1.

Fig. 1

flowAI and flowClean running on default parameters. The black dots indicate the cells being removed. The right column shows data after outlier removal

flowQB [6] is another R/Bioconductor package that provides quality control, however at the instrumental level. It automatically calculates flow cytometer’s detection efficiency (Q) and background illumination (B) to estimate the precision of measurements at different signal levels. flowQB provides a mean for comparing different instruments and different channels on an instrument.

3. Data Transformation and Normalization

Data needs to be properly compensated, transformed, and normalized to ensure accuracy of any subsequent gating analysis. Compensation is necessary to correctly account for the contribution of each fluorochrome to each channel in conditions of spectral overlap. flowCore and flowUtils package have compensation and transformation functions for data preprocessing. The often-used transformation methods that handle negative values and display normally distributed cell types are logicle, hyperlog, and arcsine [7]. On occasion users require the normalization of data to eliminate technical variances that make matching populations across samples difficult. gaussNorm and fadNorm are two methods developed for this purpose and are integrated in the flowStats Bioconductor package [8]. Both methods aim to normalize single channel fluorescence data by finding and matching a common feature across samples. The common feature is assumed to be well aligned in ideal conditions and can thus be used as a reference to determine and remove technical variations by minimizing differences between the features across samples. A new fdaNorm algorithm [9] focuses on local normalization on specific cell subsets exhibiting variability and improves the peaks alignment and performance.

DeepCyTOF [10], an algorithm for semiautomated gating of cell samples uses deep learning technique with domain adaptation. It can also be used to overcome strong batch effects by calibrating target samples to reference sample.

4. Automated Gating

The most time consuming and subjective component of FCM data analysis is cell population identification and most of the efforts of the computational biology community have been focused on developing algorithms for automated gating [11, 12]. FCM data has unique requirements for computational efficiency, robustness with respect to different antigen/marker expression patterns, ability to determine true population number, and the ability to detect and handle outliers [13]. To date over 20 tools have been developed for automated gating (Table 1), and these can generally be divided into unsupervised and supervised approaches.

4.1. Unsupervised Gating

Clustering methods detect cells population that display similar biomarker expression in high-dimensional space and group them together. These algorithms generally do not require user input, but often allow the specification of some parameters such as the expected number of cell populations, to tune the results to a desired outcome. Unsupervised methods allow the discovery of unknown cell populations in high-dimensional data from large datasets that is not possible with manual analysis due to the exponentially increasing number of possible combinations of markers that must be investigated. Many approaches have extended advancements from other data types including k-means, random forests, self-organizing maps, and spectral clustering. Some tools have been developed for specific uses cases. For example, SWIFT is designed to gate on a large number of small clusters, and thereby can effectively separate rare populations [14], but has poorer performance on larger cell populations [15].

4.2. Supervised Gating

Supervised approaches are beneficial when users have some prior experimental expectations, for example, if a user wants to replicate an existing manual process to robustly target cell populations of interest in a specific way. This is useful for validating novel biomarkers discovered through high-dimensional cell discovery approaches. flowDensity [16] and openCyto [17] facilitate high reproducibility by automating the manual gating process. While flowDensity can process data in an unsupervised manner, the approach is designed to use customized one-dimensional density thresholds for each cell populations to mimic experts hierarchical gating order. However, unlike manual gating where the placement of gate boundaries is inherently subjective, thresholds are adjusted in a data-dependent manner for each sample (Fig. 2) [16].

Fig. 2.

Fig. 2

flowDensity applied to International Mouse Phenotyping Consortium (IMPC) data. Colors moving from blue to red indicate increase density

OpenCyto is a framework that uses hierarchical gating on cell populations. Users define a template specifying the hierarchical relationships of the cell subpopulations and markers used. Most importantly, OpenCyto provides a larger framework for automated gating pipeline that fulfills all its analysis components, including preprocessing, cell population identification, population matching, and correlation with outcome variables [17]. It uses a plug-in framework that allows users incorporate gating algorithms into the pipeline.

4.3. Performance

Due to the lack of an unbiased gold standard, automated methods are often evaluated on their ability to best mimic a reliable manually gated cell populations as the current standard of practice. The performance of each method can be evaluated through measures of accuracy, precision, sensitivity, runtime, quality, and stability. High precision implies a low presence of false positives, and high sensitivity implies low false negatives. Sensitivity and precision can be summarized in F-measure which is the harmonic mean of precision and recall. Fast runtime is important in performing interactive, exploratory analysis of large data sets. Runtime depends on algorithms subsampling, number of processor cores, and hardware specifications. Runtimes within a couple of minutes per FCS file scale to studies with hundreds of examples running on single processing core. Alternatively, users can configure processing to run on multicore settings where each CPU core takes up the task of running one FCS file through automated gating, as generally each file is considered independently. As such, several FCS files can run in parallel, which greatly reduces computing time of the overall program. Several software packages for multicore processing are available through CRAN (e.g., doMC, doParrallel).

Comprehensive community-based evaluations through the FlowCAP series of challenges [15, 18, 19] and other studies [20] have shown that automated approaches meet or exceed the performance of manual analysis for even simple datasets. To provide guidance in applying supervised methods of automated analysis to cytometry data for researchers and bioinformaticians, FlowCAP (Flow Cytometry: Critical Assessment of Population Identification Methods) has organized a series of challenges to facilitate a community-based evaluation of various FCM tools. FlowCAP I and II studied algorithms’ ability for identifying cell population and sample classification, using manual gates as the evaluation baseline. The challenges identified flowMeans as the bestperforming method taking into account of both speed and accuracy. One significant finding from the studies was that combining results from different cell population identification method produces a higher F-measure than any individual method for any dataset. As such, aggregate ensemble approach was found to be superior to any method used alone. However, a knowledge gap has been identified in the ability of algorithms to robustly identify rare cell population and deal with technical variables.

FlowCAP-3 was designed to explore the performance of algorithms addressing this area. It focused on the reproducibility of FCM tools, aiming to identify methods that could reproduce centralized manual gates with minimum bias and low variability and identification of rare populations [15]. The FlowCAP-3 study identified flowDensity and openCyto as the two co-best-performing supervised algorithms. It also recognized the difficulties in matching results from different centers even when given standardized reagents and analysis.

Weber et al. [20] reported a performance-comparison study of several clustering algorithms on their ability to detect all major immune cell populations and detect a single rare cell population, with a focus on CyTOF datasets. The study used F1 measure for the evaluation. F1 measure (harmonic mean of precision and recall) is the modified version of F-measure for the evaluation of unsupervised algorithms. Similar to F score, F1 score also ranges between 0 to 1, with 1 indicating a perfect imitation of the reference gated populations. However, unlike F score which used average weighted by size and assigns more importance to large cell populations, F1 score is calculated from unweighted averages in order to account for large and small population equally. Use of F or F1 score is a superior method in the evaluation of algorithms’ performance than simply comparing population percentages, as population percentages do not give information on accuracy. Two populations can have exactly same population percentages yet not overlap with each other (i.e., low accuracy). For data sets containing multiple cell populations of interest, FlowSOM and flowMeans demonstrated good performance on high-dimensional immunological data sets. FlowSOM had the fastest runtime, usually several orders of magnitude faster than its counterparts. For single rare cell population detection, X-shift has the best F1 measure, followed by flowMeans, and FlowSOM among the top five. Overall, FlowSOM was shown to have the best performance taking into account of speed and accuracy and was identified as the best unsupervised clustering method for automated detection of cell populations in highdimensional cytometry [20].

5. Probability Binning

Alternative to gating, probability binning is a method that analyzes FCM data in bins that contains nearly equal number of events. This method is extended into multiple dimensions to generate multivariate probability functions of FCM data, as incorporated in the R/Bioconductor package flowFP [21]. These specific multivariate probability functions are referred to as “fingerprints.” Statistical tests can then be performed directly on the elements of a fingerprint to separate statistically informative subregions (or bins) that correlates most significantly with an experimental question. These bins can then subsequently be used as gates similarly to automated method.

6. Biomarker Identification

After identifying cell population in an individual sample, users might be interested in studying the difference in cell populations across groups of samples with biological variation. For application in discovery, the goal is to identify and describe cell populations that are correlated with an external variable. The current best-performing approach to biomarker discovery incorporates flowType, used in combination with flowDensity, the only two pipelines that showed significant performance in FlowCAP-IV [15, 18, 19]. The goal in FlowCAP-IV was to predict outcome associated with FCS data files, in this case, time of progression to AIDS among HIV+ cohorts. The results can be generalized to any outcome across datasets. flowType identifies all cell types in a sample by combining all possible partitions of cells, identified either manually or using automated gating [22]. It has been particularly successful in a type of exploratory analysis where correlation of different cell populations to external outcome is investigated.

FloReMi, available through Github (https://github.com/SofieVG/FloReMi), builds upon flowDensity and flowType to extract all features for each sample and then select features that most correlates with the survival time with minimal redundancy. As such, it is able to identify subpopulations of cells that can be used to predict phenotype of a patient. The components of the FloReMi pipeline include preprocessing, identifying, and selecting informative cell subsets and predicting survival time, all of which work independent from each other. It is thus possible to evaluate each component individually and substitute them with outside algorithms for similar tasks [23].

7. Visualization

Visualization is an essential part of data analysis in that it allows for communicating, exploring, and discovering possible significant outliers, cell populations. Visualization helps researchers study high-dimensional data in a way that provides biological insights. It is often desirable to reduce dimensionality in the complex data sets in order for users to compare cell populations identified through high-dimensional space on a two-dimensional computer screen.

SPADE colors and connects similar immunophenotypes together in the form of a spanning tree, enabling visualization of single-cell data and inference of relationships between different cell types [24]. Biologists can then annotate the cell clusters by drawing inferences from cell marker expression. An example of a SPADE output is shown in Fig. 3.

Fig. 3.

Fig. 3

SPADE tree color-coded based on the median expression intensities of cell markers within each node, where each node represents a cluster; each of the six trees represents the distribution of the different marker expression across cell clusters; red represents a high marker expression while blue a low expression; dataset is from the flowDensity package

t-SNE reduces dimensionality by arranging cell populations in ways that can preserve the spatial relationships of the cell populations in high-dimensional space [25], that is, cells that are close together in the high-dimensional space are in proximity in the 2D or 3D scatter map, thus enabling the visualization of different subpopulations (Fig. 4). Users can color code the t-SNE map based on gated results from either manual, supervised, or unsupervised methods. For detailed tutorials on t-SNE, users can go to https://github.com/Irrationone/graspods_scrnaseq_clustering_tutorial.

Fig. 4.

Fig. 4

Running t-SNE on two-dimensional space, where each data point represents one cell. Data is color-coded based on clustering. Data used is from the flowDensity package

Both t-SNE and SPADE create a 2D (t-SNE can also produce 3D) visualization of higher order relationships and are both suited for cell population identification [24, 25]. In dealing with massive data, some algorithms implement subsampling to reduce computational time, however, often with the trade-off of cell population identification accuracy, especially for rare populations [20]. t-SNE performs a random down-sampling and SPADE density-dependent down-sampling. t-SNE has no clustering, and manual interpretation of distinct clusters for cell population identification is required. Alternatively SPADE has a built-in clustering algorithm and project the clusters on a spanning tree and allows for additional inference of cellular hierarchy. However, manual interpretation is still required.

Citrus is a visualization tool that can aid in understanding high-dimensional data, and can help in the study of the correlation of cellular responses to external experimental endpoints. The algorithm splits data into many subdivisions and uses statistical tests to find differences among sample groups (control group and patient group) [26]. Citrus uses both predictive and correlative models to detect associations between clusters and experimental endpoints. The predictive model studies the properties of each cluster that are most predictive of the experimental endpoint. Citrus creates a series of predictive models with varying number of properties and then cross validates each model for their predictive accuracy (Fig. 5). Users can then select the model that generates the most predictive accuracy based on the error model rate and further exam those cluster properties identified by Citrus and their phenotypes.

Fig. 5.

Fig. 5

Model error rate generated after cross validation of each predictive model. cv.min is the minimum error rate model. cv.1se is the simplest error rate model having 1 standard error of the minimum. cv.fdr.constrained is the model that contains the largest number of features while maintaining minimum false discovery rate. Figure reproduced from the Citrus simulated data, which contains 10 healthy patients sample groups and 10 diseased patients sample groups

RchyOptimyx is another visualization method that assigns importance to each cell type in biomarker identification studies by associating cell abundance to clinical outcomes, such as disease status or patient survival, and simplify the identified phenotypes [27]. RchyOptimyx provides dimensionality reduction for the description of a set of cell populations based on marker importance (Fig. 6).

Fig. 6.

Fig. 6

A typical RchyOptimyx pipeline involves using flowType to extract all cell proportions (a) and then return a list of significantly different phenotypes and their scores. The phenotypes and their scores can be generated by any manual or automated algorithms. (b) shows a complete hierarchy describing all possible gating strategy for gating that cell population. (c) shows the relationship of number of markers in each phenotype with the significance of correlation with the clinical outcome. Data in (b) and (c) are reproduced from RchyOptimyx package data HIVData tube 2

Users should not confuse the hierarchies generated by RchyOptimyx to those generated from SPADE. SPADE and RchyOptimyx are different from each other in terms of algorithms and applications [28]. RchyOptimyx can group cell populations that have common parents which also exhibit functional similarities. In this way, it can automatically annotate cell populations identified by other methods (including SPADE) and list a singlecell hierarchy based on marker importance. Its functional strength lies in optimization of gating strategies. In contrast, SPADE uses the distance between mean/median fluorescence intensities to connect clusters together and then requires manual annotation of results based on experts knowledge. It is mostly used to overview different immune populations and surface marker or intracellular signaling molecule expressions [29, 30].

8. Metadata Annotation

The ability to properly understand the results of analysis of even simple studies requires many components of the data generation and analysis be adequately described. The importance of metadata annotation only increases with complexity of studies, promoting the need for standardized reporting of experimental variables. The MIFlowCyt standard was developed with the aim of promoting effective curation, integration, analysis, interpretation, and sharing for cytometry data, and has been adopted by publishers [31]. One important component of data annotation is the semantic labeling of cell populations. flowCL semantically labels cell populations based on their surface markers profile. By doing so, flowCL provided means for standardized annotation for identified cell types. After identifying immunophenotypes through either manual or automated method, users might be interested to find out the cell type based on each cell marker’s relative abundances. FlowCL does this by query against the cell ontology (CL) source for describing cell markers for hematopoietic cell lines [32]. CL is the cell ontology source that represents biological cell types with a structured description of each cell type. This is important in data integration when trying to match gating results from different sources. However, the manual exploitation of surface marker information in CL and the use of it requiring drawing inferences in multiple axes can pose great challenges for application on large dataset. flowCL addresses these challenges in ways that provide unambiguous labeling and optimize CL for application in automated cell labeling.

9. Learning

The use of computational algorithms described in this review requires users are familiar with R language. Most of the mentioned packages are available at Bioconductor website (https://www.bioconductor.org), a repository of open-source bioinformatic tools.

9.1. Install R

R is an open-source software that is freely available for all operating systems. Users can download R at its official website https://www.r-project.org or download through Bioconductor website at https://bioconductor.org/install. It is highly recommended users install RStudio (https://www.rstudio.com), the most widely used open-source integrated development environment (IDE) for R. IDE facilitates software development environments for computer programmers. The RStudio platform has a simple user interface and well suited for complex data handling and script writing.

9.2. First Step in Learning R

Starting off, users should learn the basic complexities of R, i.e., its data structures and functions. There are many free resources online for learning R, including courses [33, 34], tutorials, and forums. Users can get support from online forums such as Stack Overflow by asking or searching for specific problem-based questions. There are also R user groups available at most local universities and they often hold workshops.

9.3. Learn R for Flow Cytometry

Some recommended materials for learning how to conduct high-throughput data computational analysis using R are listed in Table 2. For package-specific usage, users should refer to the package-specific vignette, which is only available through Bioconductor. Vignettes contain thorough step-by-step instructions and code on how to use the software. It is often the first step in learning a new algorithm by reproducing the examples in the vignette. The package of each algorithm comes with example data sets, so that users can reproduce the results in the example by simply copying the code into RStudio or R.

Table 2.

Online resources for learning R for bioinformatics

Material References
FCM Data Analysis Using R Workshop https://bioinformatics.ca/workshops/2013/flow-cytometry-data-analysis-using-r-2013
Basic FCM Workshop https://bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/BasicFlowWorkshop.pdf
Statistics and R for High-Throughput experiments in life science www.edx.org/course/statistics-r-harvardx-ph525-1x-0
Introduction to R and Bioconductor https://github.com/Bioconductor/BiocIntro/tree/Bioc2017
Bioconductor workflow for Single-cell RNA-seq Data Analysis: Dimensionality Reduction, Clustering, and Pseudotime Ordering https://github.com/fperraudeau/bioc2017singlecell
Conference BioC2017: Where Software and Biology Connect http://bioconductor.org/help/course-materials/2017/BioC2017/

FCM bioinformatics is a rapid evolving field. To keep up with the latest tool development and news, users are recommended to follow the literature including within the journals Cytometry A, Bioinformatics where many tools to date have been published and BioConductor, which separately tracks packages in this domain by publishing and maintaining a broad range of analytical and graphical methods. Each published Bioconductor package has a maintainer, who is the first responder for helping solve any package-related issues such as those related to installation, compatibility, and software updates.

Acknowledgments

This work was supported by GenomeCanada (252FLO Brinkman), NSERC, GenomeBC, and NIH (1 R01 GM118417-01A1).

References

  • 1.Kvistborg P, Gouttefangeas C, Aghaeepour N et al. (2015) Thinking outside the gate: single-cell assessments in multiple dimensions. Immunity 42(4):591–592. 10.1016/j.immuni.2015.04.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Hahne F, LeMeur N, Brinkman RR et al. (2009) flowCore: a Bioconductor package for high throughput flow cytometry. BMC Bioinformatics 10:106. 10.1186/1471-2105-10-106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Robinson JP, Rajwa B, Patsekin V et al. (2012) Computational analysis of high-throughput flow cytometry data. Expert Opin Drug Discov 7(8):679–693. 10.1517/17460441.2012.693475 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Finak G, Jiang W, Pardo J et al. (2012) QUAliFiER: an automated pipeline for quality assessment of gated flow cytometry data. BMC Bioinformatics 13:252. 10.1186/1471-2105-13-252 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Gentleman R, Hahne F, Kettman J et al. (2017) flowQ: quality control for flow cytometry. R package version 1.38
  • 6.Spidlen J, El Khettabi F, Moore W et al. (2017) flowQB: automated quadratic characterization of flow cytometry instrument sensitivity: Q, B and CV instrinsic calculations. R package version 2.6.0. https://www.bioconductor.org/packages/release/bioc/html/flowQB.html
  • 7.O’Neill K, Aghaeepour N, Spidlen J et al. (2013) Flow cytometry bioinformatics. PLoS Comput Biol 9(12):e1003365. 10.1371/journal.pcbi.1003365 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hahne F, Khodabakhshi AH, Bashashati A et al. (2010) Per-channel basis normalization methods for flow cytometry data. Cytometry A 77(2):121–131. 10.1002/cyto.a.20823 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Finak G, Jiang W, Krouse K et al. (2014) High-throughput flow cytometry data normalization for clinical trials. Cytometry A 85(3):277–286. 10.1002/cyto.a.22433 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Li H, Shaham U, Stanton KP et al. (2017) Gating mass cytometry data by deep learning. Bioinformatics 33(21 ):3423–3430. 10.1093/bioinformatics/btx448 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Maecker HT, Rinfret A, D’Souza P et al. (2005) Standardization of cytokine flow cytometry assays. BMC Immunol 6:13. 10.1186/1471-2172-6-13 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.McNeil LK, Price L, Britten CM et al. (2013) A harmonized approach to intracellular cytokine staining gating: results from an international multiconsortia proficiency panel conducted by the cancer immunotherapy consortium (CIC/CRI). Cytometry A 83(8):728–738. 10.1002/cyto.a.22319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Verschoor CP, Lelic A, Bramson JL et al. (2015) An introduction to automated Flow cytometry gating tools and their implementation. Front Immunol 6:380. 10.3389/fimmu.2015.00380 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Rebhahn JA, Roumanes DR, Qi Y et al. (2016) Competitive SWIFT cluster templates enhance detection of aging changes. Cytometry A 89(1):59–70. 10.1002/cyto.a.22740 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Aghaeepour N, Finak G, Flow CAPC et al. (2013) Critical assessment of automated flow cytometry data analysis techniques. Nat Methods 10(3):228–238. 10.1038/nmeth.2365 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Malek M, Taghiyar MJ, Chong L et al. (2015) flowDensity: reproducing manual gating of flow cytometry data by automated density-based cell population identification. Bioinformatics 31(4):606–607. 10.1093/bioinformatics/btu677 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Finak G, Frelinger J, Jiang W et al. (2014) OpenCyto: an open source infrastructure for scalable, robust, reproducible, and automated, end-to-end flow cytometry data analysis. PLoS Comput Biol 10(8):e1003806. 10.1371/journal.pcbi.1003806 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Aghaeepour N, Chattopadhyay P, Chikina M et al. (2016) A benchmark for evaluation of algorithms for identification of cellular correlates of clinical outcomes. Cytometry A 89(1):16–21. 10.1002/cyto.a.22732 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Finak G, Langweiler M, Jaimes M et al. (2016) Standardizing flow cytometry Immunophenotyping analysis from the human Immunophenotyping consortium. Sci Rep 6:20686. 10.1038/srep20686 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Weber LM, Robinson MD (2016) Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytometry A 89(12):1084–1096. 10.1002/cyto.a.23030 [DOI] [PubMed] [Google Scholar]
  • 21.Rogers WT, Holyst HA (2009) FlowFP: a bioconductor package for fingerprinting flow cytometric data. Adv Bioinforma 2009:193947. 10.1155/2009/193947 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Aghaeepour N, O’Neill K, Jalali A (2014) flowType: phenotyping Flow cytometry assays. R package version 2.14.0
  • 23.Van Gassen S, Vens C, Dhaene T et al. (2016) FloReMi: flow density survival regression using minimal feature redundancy. Cytometry A 89(1):22–29. 10.1002/cyto.a.22734 [DOI] [PubMed] [Google Scholar]
  • 24.Anchang B, Hart TD, Bendall SC et al. (2016) Visualization and cellular hierarchy inference of single-cell data using SPADE. Nat Protoc 11(7):1264–1279. 10.1038/nprot.2016.066 [DOI] [PubMed] [Google Scholar]
  • 25.van der Maaten LJP, Hinton GE (2008) Visualizing high-dimensional data using t-SNE. J Mach Learn Res 9:2579–2605 [Google Scholar]
  • 26.Bruggner RV, Bodenmiller B, Dill DL et al. (2014) Automated identification of stratifying signatures in cellular subpopulations. Proc Natl Acad Sci U S A 111(26):E2770–E2777. 10.1073/pnas.1408792111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Aghaeepour N, Jalali A, O’Neill K et al. (2012) RchyOptimyx: cellular hierarchy optimization for flow cytometry. Cytometry A 81(12):1022–1030. 10.1002/cyto.a.22209 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.O’Neill K, Jalali A, Aghaeepour N et al. (2014) Enhanced flowType/RchyOptimyx: a BioConductor pipeline for discovery in high-dimensional cytometry data. Bioinformatics 30(9):1329–1330. 10.1093/bioinformatics/btt770 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bendall SC, Simonds EF, Qiu P et al. (2011) Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332(6030):687–696. 10.1126/science.1198704 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Qiu P, Simonds EF, Bendall SC et al. (2011) Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE. Nat Biotechnol 29(10):886–891. 10.1038/nbt.1991 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Lee JA, Spidlen J, Boyce K et al. (2008) MIFlowCyt: the minimum information about a flow cytometry experiment. Cytometry A 73(10):926–930. 10.1002/cyto.a.20623 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Courtot M, Meskas J, Diehl AD et al. (2015) flowCL: ontology-based cell population labelling in flow cytometry. Bioinformatics 31(8):1337–1339. 10.1093/bioinformatics/btu807 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.R Programming. https://www.coursera.org/learn/r-programming. Accessed 19 Dec 2017
  • 34.Statistics and R. https://www.edx.org/course/statistics-r-harvardx-ph525-1x-0. Accessed 19 Dec 2017

RESOURCES