Skip to main content
eLife logoLink to eLife
. 2019 Jan 17;8:e38173. doi: 10.7554/eLife.38173

CaImAn an open source tool for scalable calcium imaging data analysis

Andrea Giovannucci 1,, Johannes Friedrich 1,2,3, Pat Gunn 1, Jérémie Kalfon 4,, Brandon L Brown 5, Sue Ann Koay 6, Jiannis Taxidis 7, Farzaneh Najafi 8, Jeffrey L Gauthier 6, Pengcheng Zhou 2,3, Baljit S Khakh 5,9, David W Tank 6, Dmitri B Chklovskii 1, Eftychios A Pnevmatikakis 1,
Editors: David Kleinfeld10, Andrew J King11
PMCID: PMC6342523  PMID: 30652683

Abstract

Advances in fluorescence microscopy enable monitoring larger brain areas in-vivo with finer time resolution. The resulting data rates require reproducible analysis pipelines that are reliable, fully automated, and scalable to datasets generated over the course of months. We present CaImAn, an open-source library for calcium imaging data analysis. CaImAn provides automatic and scalable methods to address problems common to pre-processing, including motion correction, neural activity identification, and registration across different sessions of data collection. It does this while requiring minimal user intervention, with good scalability on computers ranging from laptops to high-performance computing clusters. CaImAn is suitable for two-photon and one-photon imaging, and also enables real-time analysis on streaming data. To benchmark the performance of CaImAn we collected and combined a corpus of manual annotations from multiple labelers on nine mouse two-photon datasets. We demonstrate that CaImAn achieves near-human performance in detecting locations of active neurons.

Research organism: Mouse, Zebrafish

eLife digest

The human brain contains billions of cells called neurons that rapidly carry information from one part of the brain to another. Progress in medical research and healthcare is hindered by the difficulty in understanding precisely which neurons are active at any given time. New brain imaging techniques and genetic tools allow researchers to track the activity of thousands of neurons in living animals over many months. However, these experiments produce large volumes of data that researchers currently have to analyze manually, which can take a long time and generate irreproducible results.

There is a need to develop new computational tools to analyze such data. The new tools should be able to operate on standard computers rather than just specialist equipment as this would limit the use of the solutions to particularly well-funded research teams. Ideally, the tools should also be able to operate in real-time as several experimental and therapeutic scenarios, like the control of robotic limbs, require this. To address this need, Giovannucci et al. developed a new software package called CaImAn to analyze brain images on a large scale.

Firstly, the team developed algorithms that are suitable to analyze large sets of data on laptops and other standard computing equipment. These algorithms were then adapted to operate online in real-time. To test how well the new software performs against manual analysis by human researchers, Giovannucci et al. asked several trained human annotators to identify active neurons that were round or donut-shaped in several sets of imaging data from mouse brains. Each set of data was independently analyzed by three or four researchers who then discussed any neurons they disagreed on to generate a ‘consensus annotation’. Giovannucci et al. then used CaImAn to analyze the same sets of data and compared the results to the consensus annotations. This demonstrated that CaImAn is nearly as good as human researchers at identifying active neurons in brain images.

CaImAn provides a quicker method to analyze large sets of brain imaging data and is currently used by over a hundred laboratories across the world. The software is open source, meaning that it is freely-available and that users are encouraged to customize it and collaborate with other users to develop it further.

Introduction

Understanding the function of neural circuits is contingent on the ability to accurately record and modulate the activity of large neural populations. Optical methods based on the fluorescence activity of genetically encoded calcium binding indicators (Chen et al., 2013) have become a standard tool for this task, due to their ability to monitor in vivo targeted neural populations from many different brain areas over extended periods of time (weeks or months). Advances in microscopy techniques facilitate imaging larger brain areas with finer time resolution, producing an ever-increasing amount of data. A typical resonant scanning two-photon microscope produces data at a rate greater than 50 GB/Hr (calculation performed on a 512 × 512 Field of View imaged at 30 Hz producing an unsigned 16-bit integer for each measurement), a number that can be significantly higher (up to more than 1TB/Hour) with other custom recording technologies (Sofroniew et al., 2016; Ahrens et al., 2013; Flusberg et al., 2008; Cai et al., 2016; Prevedel et al., 2014; Grosenick et al., 2017; Bouchard et al., 2015).

This increasing availability and volume of calcium imaging data calls for automated analysis methods and reproducible pipelines to extract the relevant information from the recorded movies, that is the locations of neurons in the imaged Field of View (FOV) and their activity in terms of raw fluorescence and/or neural activity (spikes). The typical steps arising in the processing pipelines are the following (Figure 1a): (i) Motion correction, where the FOV at each data frame (image or volume) is registered against a template to correct for motion artifacts due to the finite scanning rate and existing brain motion, (ii) source extraction where the different active and possibly overlapping sources are extracted and their signals are demixed from each other and from the background neuropil signals (Figure 1b), and (iii) activity deconvolution, where the neural activity of each identified source is deconvolved from the dynamics of the calcium indicator.

Figure 1. Processing pipeline of CaImAn for calcium imaging data.

Figure 1.

(a) The typical pre-processing steps include (i) correction for motion artifacts, (ii) extraction of the spatial footprints and fluorescence traces of the imaged components, and (iii) deconvolution of the neural activity from the fluorescence traces. (b) Time average of 2000 frames from a two-photon microscopy dataset (left) and magnified illustration of three overlapping neurons (right), as detected by the CNMF algorithm. (c) Denoised temporal components of the three neurons in (b) as extracted by CNMF and matched by color (in relative fluorescence change, ΔF/F). (d) Intuitive depiction of CNMF. The algorithm represents the movie as the sum of spatially localized rank-one spatio-temporal components capturing neurons and processes, plus additional non-sparse low-rank terms for the background fluorescence and neuropil activity. (e) Flow-chart of the CaImAbatch processing pipeline. From left to right: Motion correction and generation of a memory efficient data format. Initial estimate of somatic locations in parallel over FOV patches using CNMF. Refinement and merging of extracted components via seeded CNMF. Removal of low quality components. Final domain dependent processing stages. (f) Flow-chart of the CaImAn online algorithm. After a brief mini-batch initialization phase, each frame is processed in a streaming fashion as it becomes available. From left to right: Correction for motion artifacts. Estimation of activity from existing neurons, identification and incorporation of new neurons. The spatial footprints of inferred neurons are also updated periodically (dashed lines).

Related work

Source extraction

Some source extraction methods attempt the detection of neurons in static images using supervised or unsupervised learning methods. Examples of unsupervised methods on summary images include graph-cut approaches applied to the correlation image (Kaifosh et al., 2014; Spaen et al., 2017), and dictionary learning (Pachitariu et al., 2013). Supervised learning methods based on boosting (Valmianski et al., 2010), or, more recently, deep neural networks have also been applied to the problem of neuron detection (Apthorpe et al., 2016; Klibisz et al., 2017). While these methods can be efficient in detecting the locations of neurons, they cannot infer the underlying activity nor do they readily offer ways to deal with the spatial overlap of different components.

To extract temporal traces jointly with the spatial footprints of the components one can use methods that directly represent the full spatio-temporal data using matrix factorization approaches for example independent component analysis (ICA) (Mukamel et al., 2009), constrained nonnegative matrix factorization (CNMF) (Pnevmatikakis et al., 2016) (and its adaptation to one-photon data (Zhou et al., 2018)), clustering based approaches (Pachitariu et al., 2017), dictionary learning (Petersen et al., 2017), or active contour models (Reynolds et al., 2017). Such spatio-temporal methods are unsupervised, and focus on detecting active neurons by considering the spatio-temporal activity of a component as a contiguous set of pixels within the FOV that are correlated in time. While such methods tend to offer a direct decomposition of the data in a set of sources with activity traces in an unsupervised way, in principle they require processing of the full dataset, and thus are quickly rendered intractable. Possible approaches to deal with the data size include distributed processing in High Performance Computing (HPC) clusters (Freeman et al., 2014), spatio-temporal decimation (Friedrich et al., 2017a), and dimensionality reduction (Pachitariu et al., 2017). Recently, Giovannucci et al., 2017 prototyped an online algorithm (OnACID), by adapting matrix factorization setups (Pnevmatikakis et al., 2016; Mairal et al., 2010), to operate on calcium imaging streaming data and thus natively deal with large data rates. For a full review see (Pnevmatikakis, 2018).

Deconvolution

For the problem of predicting spikes from fluorescence traces, both supervised and unsupervised methods have been explored. Supervised methods rely on the use of labeled data to train or fit biophysical or neural network models (Theis et al., 2016), although semi-supervised that jointly learn a generative model for fluorescence traces have also been proposed (Speiser et al., 2017). Unsupervised methods can be either deterministic, such as sparse non-negative deconvolution (Vogelstein et al., 2010; Pnevmatikakis et al., 2016) that give a single estimate of the deconvolved neural activity, or probabilistic, that aim to also characterize the uncertainty around these estimates (e.g., (Pnevmatikakis et al., 2013; Deneux et al., 2016)). A recent community benchmarking effort (Berens et al., 2017) characterizes the similarities and differences of various available methods.

CaImAn

Here we present CaImAn, an open source pipeline for the analysis of both two-photon and one-photon calcium imaging data. CaImAn includes algorithms for both offline analysis (CaImAn batch) where all the data is processed at once at the end of each experiment, and online analysis on streaming data (CaImAn online). Moreover, CaImAn requires very moderate computing infrastructure (e.g., a personal laptop or workstation), thus providing automated, efficient, and reproducible large-scale analysis on commodity hardware.

Contributions

Our contributions can be roughly grouped in three different directions:

Methods: CaImAn batch improves on the scalability of the source extraction problem by employing a MapReduce framework for parallel processing and memory mapping which allows the analysis of datasets larger than would fit in RAM on most computer systems. It also improves on the qualitative performance by introducing automated routines for component evaluation and classification, better handling of neuropil contamination, and better initialization methods. While these benefits are here presented in the context of the widely used CNMF algorithm of Pnevmatikakis et al. (2016), they are in principle applicable to any matrix factorization approach.

CaImAn online improves and extends the OnACID prototype algorithm (Giovannucci et al., 2017) by introducing, among other advances, new initialization methods and a convolutional neural network (CNN) based approach for detecting new neurons on streaming data. Our analysis on in vivo two-photon and light-sheet imaging datasets shows that CaImAn online approaches human-level performance and enables novel types of closed-loop experiments. Apart from these significant algorithmic improvements CaImAn includes several useful analysis tools such as, a MapReduce and memory-mapping compatible implementation of the CNMF-E algorithm for one-photon microendoscopic data (Zhou et al., 2018), a novel efficient algorithm for registration of components across multiple days, and routines for segmentation of structural (static) channel information which can be used for component seeding.

Software: CaImAn is a complete open source software suite implemented primarily in Python, and is already widely used by, and has received contributions from, its community. It contains efficient implementations of the standard analysis pipeline steps (motion correction - source extraction - deconvolution - registration across different sessions), as well as numerous other features. Much of the functionality is also available in a separate MATLAB implementation.

Data: We benchmark the performance of CaImAn against a previously unreleased corpus of manually annotated data. The corpus consists of 9 mouse in vivo two-photon datasets. Each dataset is manually annotated by 3–4 independent labelers that were instructed to select active neurons in a principled and consistent way. In a subsequent stage, the annotations were combined to create a ‘consensus’ annotation, that is used to benchmark CaImAn, to train supervised learning based classifiers, and to quantify the limits of human performance. The manual annotations are released to the community, providing a valuable tool for benchmarking and training purposes.

Paper organization

The paper is organized as follows: We first give a brief presentation of the analysis methods and features provided by CaImAn. In the Results section we benchmark CaImAn batch and CaImAn online against a corpus of manually annotated data. We apply CaImAn online to a zebrafish whole brain lightsheet imaging recording, and demonstrate how such large datasets can be processed efficiently in real time. We also present applications of CaImAn batch to one-photon data, as well as examples of component registration across multiple days. We conclude by discussing the utility of our tools, the relationship between CaImAn batch and CaImAn online and outline future directions. Detailed descriptions of the introduced methods are presented in Materials and methods.

Methods

Before presenting the new analysis features introduced with this work, we overview the analysis pipeline that CaImAn uses and builds upon.

Overview of analysis pipeline

The standard analysis pipeline for calcium imaging data used in CaImAn is depicted in Figure 1a. The data is first processed to remove motion artifacts. Subsequently the active components (neurons and background) are extracted as individual pairs of a spatial footprint that describes the shape of each component projected to the imaged FOV, and a temporal trace that captures its fluorescence activity (Figure 1b–d). Finally, the neural activity of each fluorescence trace is deconvolved from the dynamics of the calcium indicator. These operations can be challenging because of limited axial resolution of 2-photon microscopy (or the much larger integration volume in one-photon imaging). This results in spatially overlapping fluorescence from different sources and neuropil activity. Before presenting the new features of CaImAn in more detail, we briefly review how it incorporates existing tools in the pipeline.

Motion correction

CaImAn uses the NoRMCorre algorithm (Pnevmatikakis and Giovannucci, 2017) that corrects non-rigid motion artifacts by estimating motion vectors with subpixel resolution over a set of overlapping patches within the FOV. These estimates are used to infer a smooth motion field within the FOV for each frame. For two-photon imaging data this approach is directly applicable, whereas for one-photon micro-endoscopic data the motion is estimated on high pass spatially filtered data, a necessary operation to remove the smooth background signal and create enhanced spatial landmarks. The inferred motion fields are then applied to the original data frames.

Source extraction

Source extraction is performed using the constrained non-negative matrix factorization (CNMF) framework of Pnevmatikakis et al. (2016) which can extract components with overlapping spatial footprints (Figure 1b). After motion correction the spatio-temporal activity of each source can be expressed as a rank one matrix given by the outer product of two components: a component in space that describes the spatial footprint (location and shape) of each source, and a component in time that describes the activity trace of the source (Figure 1c). The data can be described by the sum of all the resulting rank one matrices together with an appropriate term for the background and neuropil signal and a noise term (Figure 1d). For two-photon data the neuropil signal can be modeled as a low rank matrix (Pnevmatikakis et al., 2016). For microendoscopic data the larger integration volume leads to more complex background contamination (Zhou et al., 2018). Therefore, a more descriptive model is required (see Materials and methods (Mathemathical model of the CNMF framework) for a mathematical description). CaImAbatch embeds these approaches into a general algorithmic framework that enables scalable automated processing with improved results versus the original CNMF and other popular algorithms, in terms of quality and processing speed.

Deconvolution

Neural activity deconvolution is performed using sparse non-negative deconvolution (Vogelstein et al., 2010; Pnevmatikakis et al., 2016) and implemented using the near-online OASIS algorithm (Friedrich et al., 2017b). The algorithm is competitive to the state of the art according to recent benchmarking studies (Berens et al., 2017). Prior to deconvolution, the traces are detrended to remove non-stationary effects, for example photo-bleaching.

Online processing

The three processing steps described above can be implemented in an online fashion using the OnACID algorithm (Giovannucci et al., 2017). The method extends the online dictionary learning framework presented in Mairal et al. (2010) for source extraction, by introducing spatial constraints, adding the capability of finding new components as they appear and also incorporating the steps of motion correction and deconvolution (Figure 1e). CaImAn extends and improves the OnACID prototype algorithm by introducing a number of algorithmic features and a CNN based component detection approach, leading to a major performance improvement.

We now present the new methods introduced by CaImAn. More details are given in Materials and methods and pseudocode descriptions of the main routines are given in the Appendix.

Batch processing of large scale datasets on standalone machines

The batch processing pipeline mentioned above represents a computational bottleneck. For instance, a naive first step might be to load in-memory the full dataset; this approach is non-scalable as datasets typically exceed available RAM (and extra memory is required by any analysis pipeline). To limit memory usage, as well as computation time, CaImAn batch relies on a MapReduce approach (Dean and Ghemawat, 2008). Unlike previous work (Freeman et al., 2014), CaImAn batch assumes minimal computational infrastructure (down to a standard laptop computer), is not tied to a particular parallel computation framework, and is compatible with HPC scheduling systems like SLURM (Yoo et al., 2003).

Naive implementations of motion correction algorithms need to either load in memory the full dataset or are constrained to process one frame at a time, therefore preventing parallelization. Motion correction is parallelized in CaImAn batch without significant memory overhead by processing temporal chunks of movie data on different CPUs. First, each chunk is registered with its own template and a new template is formed by the registered data of each chunk. CaImAn batch then broadcasts to each CPU a meta-template, obtained as the median between all templates, which is used to align all the frames in each chunk. Each process writes in parallel to the target file containing motion-corrected data, which is stored as a memory mapped array. This allows arithmetic operations to be performed against data stored on the hard drive with minimal memory use, and data slices to be indexed and accessed without loading the full file in memory. More details are given in Materials and methods (Memory mapping).

Similarly, the source extraction problem, especially in the case of detecting cell bodies, is inherently local with a neuron typically appearing in a neighborhood within a small radius from its center of mass (Figure 2a). Exploiting this locality, CaImAn batch splits the FOV into a set of spatially overlapping patches which enables the parallelization of the CNMF (or any other) algorithm to extract the corresponding set of local spatial and temporal components. The user specifies the size of the patch, the amount of overlap between neighboring patches and the initialization parameters for each patch (number of components and rank background for CNMF, average size of each neuron, stopping criteria for CNMF-E). Subsequently the patches are processed in parallel by the CNMF/CNMF-E algorithm to extract the components and neuropil signals from each patch.

Figure 2. Parallelized processing and component quality assessment for CaImAn batch.

Figure 2.

(a) Illustration of the parallelization approach used by CaImAn batch for source extraction. The data movie is partitioned into overlapping sub-tensors, each of which is processed in an embarrassingly parallel fashion using CNMF, either on local cores or across several machines in a HPC. The results are then combined. (b) Refinement after combining the results can also be parallelized both in space and in time. Temporal traces of spatially non-overlapping components can be updated in parallel (top) and the contribution of the spatial footprints for each pixel can be computed in parallel (bottom). Parallelization in combination with memory mapping enable large scale processing with moderate computing infrastructure. (c) Quality assessment in space: The spatial footprint of each real component is correlated with the data averaged over time, after removal of all other activity. (d) Quality assessment in time: A high SNR is typically maintained over the course of a calcium transient. (e) CNN based assessment. Top: A 4-layer CNN based classifier is used to classify the spatial footprint of each component into neurons or not, see Materials and methods (Classification through CNNs) for a description. Bottom: Positive and negative examples for the CNN classifier, during training (left) and evaluation (right) phase. The CNN classifier can accurately classify shapes and generalizes across datasets from different brain areas.

Apart from harnessing memory and computational benefits due to parallelization, processing in patches intrinsically equalizes dynamic range and enables CaImAn batch to detect neurons across the whole FOV, a feature absent in the original CNMF, where areas with high absolute fluorescence variation tend to be favored. This results in better source extraction performance. After all the patches have been processed, the results are embedded within the FOV (Figure 2a), and the overlapping regions between neighboring patches are processed so that components corresponding to the same neuron are merged. The process is summarized in algorithmic format in Algorithm 1 and more details are given in Materials and methods (Combining results from different patches).

Initialization methods

Due to the non-convex nature of the objective function for matrix factorization, the choice of the initialization method can severely impact the final results. CaImAn batch provides an extension of the GreedyROI method used in Pnevmatikakis et al. (2016), that detects neurons based on localized spatiotemporal activity. CaImAn batch can also be seeded with binary masks that are obtained from different sources, for example through manual annotation or segmentation of structural channel (SeededInitialization, Algorithm 3). More details are given in Materials and methods (Initialization strategies).

Automated component evaluation and classification

A common limitation of matrix factorization algorithms is that the number of components that the algorithm seeks during its initialization must be pre-determined by the user. For example, Pnevmatikakis et al. (2016) suggest detecting a large number of components which are then ordered according to their size and activity pattern, with the user deciding on a cut-off threshold. When processing large datasets in patches the target number of components is passed on to every patch implicitly assuming a uniform density of (active) neurons within the entire FOV. This assumption does not hold in the general case and can produce many spurious components. CaImAn introduces tests, based on unsupervised and supervised learning, to assess the quality of the detected components and eliminate possible false positives. These tests are based on the observation that active components are bound to have a distinct localized spatio-temporal signature within the FOV. In CaImAn batch, these tests are initially applied after the processing of each patch is completed, and additionally as a post-processing step after the results from the patches have been merged and refined, whereas in CaImAn online they are used to screen new candidate components. We briefly present these tests below and refer to Materials and methods (Details of quality assessment tests) for more details:

Spatial footprint consistency: To test whether a detected component is spurious, we correlate the spatial footprint of this component with the average frame of the data, taken over the intervals when the component, with no other overlapping component, was active (Figure 2c). The component is rejected if the correlation coefficient is below a certain threshold θsp (e.g., θsp<0.5).

Trace SNR: For each component we computed the peak SNR of its temporal trace averaged over the duration of a typical transient (Figure 2d). The component is rejected if the computed SNR is below a certain threshold θSNR (e.g., θSNR=2).

CNN based classification: We also trained a 4-layer convolutional neural network (CNN) to classify spatial footprints into true or false components (Figure 2e), where a true component here corresponds to a spatial footprint that resembles the soma of a neuron. The classifier, which we call batch classifier, was trained on a small corpus of manually annotated datasets (full description given in section Benchmarking against consensus annotation) and exhibited similar high classification performance on test samples from different datasets.

While CaImAn uses the CNMF algorithm, the tests described above can be applied to results obtained from any source extraction algorithm, highlighting the modularity of our tools.

Online analysis with CaImAn online

CaImAn supports online analysis on streaming data building on the core of the prototype algorithm of Giovannucci et al., 2017, and extending it in terms of qualitative performance and computational efficiency:

Initialization: Apart from initializing CaImAn online with CaImAn batch on a small time interval, CaImAn online can also be initialized in a bare form over an even smaller time interval, where only the background components are estimated and all the components are determined during the online analysis. This process, named BareInitialization, can be achieved by running the CNMF algorithm (Pnevmatikakis et al., 2016) over the small interval to estimate the background components and possibly a small number of components. The SeededInitialization of Algorithm 3 can also be used.

Deconvolution: Instead of a separate step after demixing as in Giovannucci et al., 2017, deconvolution here can be performed simultaneously with the demixing online, leading to more stable traces especially in cases of low-SNR, as also observed in Pnevmatikakis et al. (2016). Online deconvolution can also be performed for models that assume second order calcium dynamics, bringing the full power of Friedrich et al., 2017b to processing of streaming data.

Epochs: CaImAn online supports multiple passes over the data, a process that can detect early activity of neurons that were not picked up during the initial pass, as well as smooth the activity of components that were detected at late stages during the first epoch.

New component detection using a CNN: To search for new components in a streaming setup, OnACID keeps a buffer of the residual frames, computed by subtracting the activity of already found components and background signals. Candidate components are determined by looking for points of maximum energy in this residual signal, after some smoothing and dynamic range equalization. For each such point identified, a candidate shape and trace are constructed using a rank-1 NMF in a local neighborhood around this point. In its original formulation (Giovannucci et al., 2017), the shape of the component was evaluated using the space correlation test described above. Here, we use a CNN classifier approach that tests candidate components by examining their spatial footprint as obtained by the average of the residual buffer across time. This online classifier (different from the batch classifier for quality assessment described above), is trained to be strict, minimizing the number of false positive components that enter the online processing pipeline. It can test multiple components in parallel, and it achieves better performance with no hyper-parameter tuning compared to the previous approach. More details on the architecture and training procedure are given in Materials and methods (Classification through CNNs). The identification of candidate components is further improved by performing spatial high pass filtering on the average residual buffer to enhance its contrast. The new process for detecting neurons is described in Algorithm 4 and 5. See Videos 1 and 2 on a detailed graphic description of the new component detection step.

Distributed update of spatial footprints: A time limiting step in OnACID (Giovannucci et al., 2017) is the periodic update of all spatial footprints at given frames. This constraint is lifted with aImAn online that distributes the update of spatial footprints among all frames ensuring a similar processing speed for each frame. See Materials and methods (Distributed shape update) for more details.

Component registration across multiple sessions

 CaImAn provides a method to register components from the same FOV across different sessions. The method uses an intersection over union metric to calculate the distance between different cells in different sessions and solves a linear assignment problem to perform the registration in a fully automated way (RegisterPair, Algorithm 7). To register the components between more than two sessions (RegisterMulti, Algorithm 8), we order the sessions chronologically and register the components of the current session against the union of components of all the past sessions aligned to the current FOV. This allows for the tracking of components across multiple sessions without the need of pairwise registration between each pair of sessions. More details as well as discussion of other methods (Sheintuch et al., 2017) are given in Materials and methods (Component registration).

Benchmarking against manual annotations

To quantitatively evaluate CaImAn we benchmarked its results against manual annotations.

Creating consensus labels through manual annotation

We collected manual annotations from multiple independent labelers who were instructed to find round or donut shaped (since proteins expressing the calcium indicator are confined outside the cell nuclei, neurons will appear as ring shapes, with a dark disk in the center) active neurons on nine two-photon in vivo mouse brain datasets. To distinguish between active and inactive neurons, the annotators were given the max-correlation image for each dataset (the value of the correlation image for each pixel represent the average correlation (across time) between the pixel and its neighbors (Smith and Häusser, 2010). This summarization can enhance active neurons and suppress neuropil for two photon datasets (Figure 3—figure supplement 1a). See Materials and methods (Collection of manual annotations) for more information). In addition, the annotators were given a temporally decimated background subtracted movie of each dataset. The datasets were collected at various labs and from various brain areas (hippocampus, visual cortex, parietal cortex) using several GCaMP variants. A summary of the features of all the annotated datasets is given in Table 2.

To address human variability in manual annotation each dataset was labeled by 3 or 4 independent labelers, and the final consensus annotation dataset was created by having the different labelers reaching a consensus over their disagreements (Figure 3a). The consensus annotation was taken as ‘ground truth’ for the purpose of benchmarking CaImAand each individual labeler (Figure 3b). More details are given in Materials and methods (Collection of manual annotations). We believe that the current database, which is publicly available at https://users.flatironinstitute.org/~neuro/caiman_paper, presents an improvement over the existing Neurofinder database (http://neurofinder.codeneuro.org/) in several aspects:

Figure 3. Consensus annotation generation.

(a) Top: Individual manual annotations on the dataset K53 (only part of the FOV is shown) for labelers L1 (left), L2 (middle), L3(right). Contour plots are plotted against the max-correlation image of the dataset. Bottom: Disagreements between L1 and L2 (left), and consensus labels (right). In this example, consensus considerably reduced the number of initially selected neurons. (b) Matches (top) and mismatches (bottom) between each individual labeler and consensus annotation. Red contours on the mismatches panels denote false negative contours, that is components in the consensus not selected by the corresponding labeler, whereas yellow contours indicate false positive contours. Performance of each labeler is given in terms of precision/recall and F1 score and indicates an unexpected level of variability between individual labelers.

Figure 3.

Figure 3—figure supplement 1. Construction of components obtained from consensus annotation.

Figure 3—figure supplement 1.

(a) Correlation image can efficiently display active neurons. Comparison of median across time (top) and max-correlation (bottom) image for annotated datasets J115 (left), K53 (middle) and YST (right). In all cases, the correlation image aids in manual annotation by providing an efficient way to remove neuropil contamination and visualize the footprints of active neurons. (b) Contour plots of manual annotations (left) vs spatial footprints obtained after running SeededInitialization (right), for dataset J115 overlaid against the mean image. Manual annotations are restricted to be of ellipsoid shape whereas pre-processing with SeededInitialization allows the spatial footprints to adapt to the footprint of each neuron in the FOV. (c) Thresholding of spatial footprints selects the most prominent part of each neuron for comparison against ground truth. Left. Four examples of non thresholded components overlaid to their corresponding contours. Right. Same as left, but including all neurons within a small region. Finding an optimal threshold to generate consistent binary masks can be challenging.

Consistency: The datasets are annotated using exactly the same procedure (see Materials and methods), and in all datasets the goal is to detect only active cells. In contrast, the annotation of the various Neurofinder datasets is performed either manually or automatically by segmenting an image of a static (structural) indicator. Even though structural indicators could be used for ground truth extraction, the segmentation of such images is not a straightforward problem in the case of dense expression, and the stochastic expression of indicators can lead to mismatches between functional and structural indicators.

Uncertainty quantification: By employing more than one human labeler we discovered a surprising level of disagreement between different annotators (see Table 1, Figure 3b for details). This result indicates that individual annotations can be unreliable for benchmarking purposes and that unreproducible scientific results might ensue. The combination of the various annotations leads to more reliable set of labels and also quantifies the limits of human performance.

Table 1. Results of each labeler, CaImAn batch and CaImAn online algorithms against consensus annotation.

Results are given in the form F1 score#\ of\ active\ neurons(precision,recall), and empty entries correspond to datasets not manually annotated by the specific labeler. The number of frames for each dataset, as well as the number of neurons that each labeler and algorithm found are also given. In italics the datasets used to train the CNN classifiers.

Name
# of frames
L1 L2 L3 L4  CaImAn batch  CaImAn online
N.01.01
1825
0.80 241(0.95, 0.69)⁢⁢ ⁢0.89 287(0.96, 0.83)⁢ 0.78 386(0.73, 0.84)⁢ 0.75 289(0.80, 0.70)⁢ 0.76 317(0.76, 0.77)⁢ 0.75 298⁢(0.81, 0.70)⁢
N.03.00.t
2250
X 0.90 188(0.88, 0.92)⁢ 0.85 215⁢(0.78, 0.93) 0.78 206(0.73, 0.83)⁢ 0.78 154(0.76, 0.80) 0.74 150(0.79, 0.70)
N.00.00
2936
X 0.92 425(0.93, 0.91)⁢ ⁢0.83 402(0.86, 0.80)⁢ 0.87 358(0.96, 0.80)⁢ 0.72 366(0.79, 0.67)⁢ 0.69 259(0.87, 0.58)⁢
YST
3000
⁢0.78 431(0.76, 0.81) 0.90 465(0.85, 0.97)⁢ 0.82 505(0.75, 0.92) 0.79 285(0.96, 0.67)⁢ 0.77 332(0.85, 0.70)⁢ 0.77 330(0.84, 0.70)⁢
N.04.00.t
3000
X 0.69 471(0.54, 0.97)⁢ 0.75 411(0.61, 0.97)⁢ 0.87 326(0.78, 0.98)⁢ 0.69 218(0.69, 0.70)⁢ 0.7 260(0.68, 0.72)
N.02.00
8000
0.89 430(0.86, 0.93) 0.87 382(0.88, 0.85)⁢ 0.84 332(0.92, 0.77)⁢ 0.82 278(1.00, 0.70)⁢ 0.78 351(0.78, 0.78)⁢ 0.78 334(0.85, 0.73)⁢
J123
41000
X 0.83 241(0.73, 0.96)⁢ 0.90 181(0.91, 0.90) 0.91 177(0.92, 0.89)⁢ 0.73 157(0.88, 0.63)⁢ 0.82 172(0.85, 0.80)⁢
J115
90000
⁢0.85 708(0.96, 0.76) 0.93 869(0.94, 0.91) 0.94 880(0.95, 0.93)⁢ 0.83 635(1.00, 0.71)⁢ 0.78 738(0.87, 0.71)⁢ ⁢0.79 1091(0.71, 0.89)
K53
116043
0.89 795(0.96, 0.83)⁢ 0.92 928(0.92, 0.92)⁢ 0.93 875(0.95, 0.91) 0.83 664(1.00, 0.72)⁢ 0.76 809(0.80, 0.72)⁢ 0.81 1025(0.77, 0.87)⁢
mean ± std 0.84±0.05(0.9±0.08, 0.8±0.08) 0.87±0.07(0.85±0.13, 0.92±0.05) 0.85±0.06(0.83±0.11, 0.88±0.06)⁢ 0.83±0.09(0.91±0.1, 0.78±0.1)⁢ 0.754±0.03(0.8±0.06, 0.72±0.05)⁢ ⁢0.762±0.05(0.82±0.06, 0.73±0.1)
Comparing CaImAn against manual annotations

To compare CaImAn against the consensus annotation, the manual annotations were used as binary masks to construct the consensus spatial and temporal components, using the SeededInitialization procedure (Algorithm 3) of CaImAn batch. This step is necessary to adapt the manual annotations to the shapes of the actual spatial footprints of each neuron in the FOV (Figure 3—figure supplement 1), as manual annotations primarily produced elliptical shapes. The set of spatial footprints obtained from CaImAn is registered against the set of consensus spatial footprints (derived as described above) using our component registration algorithm RegisterPair (Algorithm 7). Performance is then quantified using a precision/recall framework similar to other studies (Apthorpe et al., 2016; Giovannucci et al., 2017).

Software

CaImAn is developed by and for the community. Python open source code for the above-described methods is available at https://github.com/flatironinstitute/CaImAn (Giovannucci et al., 2018; copy archived at https://github.com/elifesciences-publications/CaImAn). The repository contains documentation, several demos, and Jupyter notebook tutorials, as well as visualization tools, and a message/discussion board. The code, which is compatible with Python 3, uses several open-source libraries, such as OpenCV (Bradski, 2000), scikit-learn (Pedregosa et al., 2011), and scikit-image (van der Walt et al., 2014). Most routines are also available in MATLAB at https://github.com/flatironinstitute/CaImAn-MATLAB (Pnevmatikakis et al., 2018; copy archived at https://github.com/elifesciences-publications/CaImAn-MATLAB). We provide tips for efficient data analysis at https://github.com/flatironinstitute/CaImAn/wiki/CaImAn-Tips. All the annotated datasets together with the individual and consensus annotation are available at https://users.flatironinstitute.org/~neuro/caiman_paper. All the material is also available from the Zenodo repository at https://zenodo.org/record/1659149/export/hx#.XC_Rms9Ki9t

Results

Manual annotations show a high degree of variability

We compared the performance of each human annotator against a consensus annotation. The performance was quantified with a precision/recall framework and the results of the performance of each individual labeler against the consensus annotation for each dataset is given in Table 1. The range of human performance in terms of F1 score was 0.69–0.94. All annotators performed similarly on average (0.84 ± 0.05, 0.87 ± 0.07, 0.85 ± 0.06, 0.83 ± 0.08). We also ensured that the performance of labelers was stable across time (i.e. their learning curve plateaued, data not shown). As shown in Table 1 (see also Figure 4b) the F1 score was never 1, and in most cases it was less or equal to 0.9, demonstrating significant variability between annotators. Figure 3 (bottom) shows an example of matches and mismatches between individual labelers and consensus annotation for dataset K53, where the level of agreement was relatively high. The high degree of variability between human responses indicates the challenging nature of the source extraction problem and raises reproducibility concerns in studies relying heavily on manual ROI selection.

Figure 4. Evaluation of CaImAn performance against manually annotated data.

(a) Comparison of CaImAn batch (top) and CaImAn online (bottom) when benchmarked against consensus annotation for dataset K53. For a portion of the FOV, correlation image overlaid with matches (left panels, red: consensus, yellow: CaImAn) and mismatches (right panels, red: false negatives, yellow: false positives). (b) Performance of CaImAn batch, and CaImAn online vs average human performance (blue). For each algorithm the results with both the same parameters for each dataset and with the optimized per dataset parameters are shown. CaImAn batch and CaImAn online reach near-human accuracy for neuron detection. Complete results with precision and recall for each dataset are given in Table 1. (c–e) Performance of CaImAn batch increases with peak SNR. (c) Example of scatter plot between SNRs of matched traces between CaImAn batch and consensus annotation for dataset K53. False negative/positive pairs are plotted in green along the x- and y-axes respectively, perturbed as a point cloud to illustrate the density. Most false positive/negative predictions occur at low SNR values. Shaded areas represent thresholds above which components are considered for matching (blue for CaImAn batch and red for consensus selected components) (d) F1 score and upper/lower bounds of CaImAn batch for all datasets as a function of various peak SNR thresholds. Performance of CaImAn batch increases significantly for neurons with high peak SNR traces (see text for definition of metrics and the bounds). (e) Precision and recall of CaImAn batch as a function of peak SNR for all datasets. The same trend is observed for both precision and recall.

Figure 4.

Figure 4—figure supplement 1. Performance of CaImAn online over different choices of parameters.

Figure 4—figure supplement 1.

Performance of CaImAn online over different choices of three parameters: θ=(TraceSNRthreshold,CNNthreshold,#ofcandindatecomponents) scores (top), Precision (middle) and Recall (bottom) are shown for all labeled datasets for four different cases: low threshold/large number setting (red) where θ=(1.2,0.65,10), high threshold/low number setting where θ=(2,0.6,5), and setting that maximizes performance averaged over all datasets (green) θ=(1.2,0.65,10). The maximum F1 score (and corresponding precision/recall) for each dataset is also shown (magenta). Lower threshold settings are more desirable for shorter datasets (N.03.00.t, N.04.00.t, N.00.00, N.01.01, YST) because they achieve high recall rates without a big penalty in precision. On the contrary, higher threshold settings are more desirable for longer datasets (J115, J123, K53) because they achieve high precision without a big penalty in recall.
Figure 4—figure supplement 2. CaImAn batch outperforms the Suite2p algorithm in all datasets when benchmarked against the consensus annotation.

Figure 4—figure supplement 2.

(a) Contour plots of selected components against consensus annotation (CA) for (left) and Suite2p with the use of a classifier (middle) and direct comparison between the algorithms (right) for the test dataset N.00.00. identifies better components with a weak footprint in the summary correlation image. (b) Performance metrics F1 score (top), precision (middle) and recall (bottom), for Suite2p (with and the without the use of the classifier) and for the eight test datasets. consistently outperforms Suite2p, which can have significant variations between precision and recall. See Materials and methods (Comparison with Suite2p) for more details on the comparison.

This process may have generated slightly biased results in favor of each individual annotator as the consensus annotation is always a subset of the union of the individual annotations. We also used an alternative cross-validation approach, where the labels of each annotator were compared with the combined results of the remaining annotators. The combination was constructed using a majority vote when a dataset was labeled from 4 annotators, or an intersection of selections when a dataset was labeled by 3. The results (see Table 3 in Materials and methods) indicate an even higher level of disagreement between the annotators with lower average F1 score 0.82 ± 0.06 (mean ± STD) and range of values 0.68-0.90. More details are given in Materials and methods (Cross-Validation analysis of manual annotations).

CaImAn batch and CaImAn online detect neurons with near-human accuracy

We first benchmarked CaImAn batch and CaImAn online against consensus annotation for the task of identifying neurons locations and their spatial footprints, using the same precision recall framework (Table 1). Figure 4a shows an example dataset (K53) along with neuron-wise matches and mismatches between CaImAn batch vs consensus annotation (top) and CaImAn online vs consensus annotation (bottom).

The results indicate a similar performance between CaImAn batch and CaImAn online; CaImAn batch has F1 scores in the range 0.69–0.78 and average performance 0.75 ± 0.03 (mean ± STD). On the other hand CaImAn online had F1 scores in the range 0.70–0.82 and average performance 0.76 ± 0.05. While the two algorithms performed similarly on average, CaImAn online tends to perform better for longer datasets (e.g., datasets J115, J123, K53 that all have more than 40000 frames; see also Table 2 for characteristics of the various datasets). CaImAn batch operates on the entire dataset at once, representing each spatial footprint with a constant in time vector. In contrast, CaImAn online operates at a local level looking at a short window over time to detect new components, while adaptively changing their spatial footprint based on new data. This enables CaImAn online to adapt to slow non-stationarities that can appear in long experiments.

Table 2. Properties of manually annotated datasets.

For each dataset the duration, imaging rate and calcium indicator are given, as well as the number of active neurons selected after consensus between the manual annotators.

Name Area brain Lab Rate (Hz) Size (T×X×Y) Indicator #Labelers #Neurons CA
NF.01.01 Visual Cortex Hausser 7 1825 × 512 × 512 GCaMP6s 4 333
NF.03.00.t Hippocampus Losonczy 7 2250 × 498 × 467 GCaMP6f 3 178
NF.00.00 Cortex Svoboda 7 2936 × 512 × 512 GCaMP6s 3 425
YST Visual Cortex Yuste 10 3000 × 200 × 256 GCaMP3 4 405
NF.04.00.t Cortex Harvey 7 3000 × 512 × 512 GCaMP6s 3 257
NF.02.00 Cortex Svoboda 30 8000 × 512 × 512 GCaMP6s 4 394
J123 Hippocampus Tank 30 41000 × 458 × 477 GCaMP5 3 183
J115 Hippocampus Tank 30 90000 × 463 × 472 GCaMP5 4 891
K53 Parietal Cortex Tank 30 116043 × 512 × 512 GCaMP6f 4 920

CaImAn approaches but is in most cases below the accuracy levels of human annotators (Figure 4b). We attribute this to two primary factors: First, CNMF detects active components regardless of their shape, and can detect non-somatic structures with significant transients. While non-somatic components can be filtered out to some extent using the CNN classifier, their existence degrades performance compared to the manual annotations that consist only of neurons. Second, to demonstrate the generality and ease of use of our tools, the results presented here are obtained by running CaImAn batch and CaImAn online with exactly the same parameters for each dataset (see Materials and methods (Implementation details)): fine-tuning to each individual dataset can significantly increase performance (Figure 4b).

To test the later point we measured the performance of CaImAn online on the nine datasets, as a function of 3 parameters: (i) the trace SNR threshold for testing the traces of candidate components, (ii) the CNN threshold for testing the shapes of candidate components, and (iii) the number of candidate components to be tested at each frame (more details can be found in Materials and methods (Implementation details for CaImAn online)). By choosing a parameter combination that maximizes the value for each dataset, the performance generally increases across the datasets with F1 scores in the range 0.72–0.85 and average performance 0.78±0.05 (see Figure 4 (orange) and Figure 4—figure supplement 1 (magenta)). This analysis also shows that in general a strategy of testing a large number of components per timestep but with stricter criteria, achieves better results than testing fewer components with looser criteria (at the expense of increased computational cost). The results also indicate different strategies for parameter choice depending on the length of a dataset: Lower threshold values and/or larger number of candidate components (Figure 4—figure supplement 1 (red)), lead to better values for shorter datasets, but can decrease precision and overall performance for longer datasets. The opposite also holds for higher threshold values and/or smaller number of candidate components (Figure 4—figure supplement 1 (blue)), where CaImAn online for shorter datasets can suffer from lower recall values, whereas in longer datasets CaImAn online can add neurons over a longer period of time while maintaining high precision values and thus achieve better performance. A similar grid search was also performed for the CaImAn batch algorithm where four parameters of the component evaluation step (space correlation, trace SNR, min/max CNN thresholds) were optimized individually to filter out false positives. This procedure led to F1 scores in in the range 0.71–0.81 and average performance 0.774±0.034 (Figure 4 (red)).

We also compared the performance of CaImAn against Suite2p (Pachitariu et al., 2017), another popular calcium imaging data analysis package. By using a small grid search around some default parameters of Suite2p we extracted the set of parameters that worked better in the eight datasets where the algorithm converged (in the dataset J123 Suite2p did not converge). CaImAn outperformed Suite2p in all datasets with the latter obtaining F1 scores in the range 0.41–0.75, with average performance 0.55±0.12. More details about the comparison are shown in Figure 4—figure supplement 2 and Materials and methods (Comparison with Suite2p).

Neurons with higher SNR transients are detected more accurately

For the parameters that yielded on average the best results (see Table 1), both CaImAn batch and CaImAn online exhibited higher precision than recall (0.8±0.06 vs 0.72±0.05 for CaImAn batch, and 0.82±0.06 vs 0.73±0.1 for CaImAn online, respectively). This can be partially explained by the component evaluation steps at the end of patch processing (Figure 1e) for CaImAn batch (and the end of each frame for CaImAn online) which aim to filter out false positive components, thus increasing precision while leaving recall intact (or in fact lowering it in case where true positive components are filtered out). To better understand this behavior, we analyzed the CaImAn batch performance as a function of the SNR of the inferred and consensus traces (Figure 4c–e). The SNR measure of a trace corresponds to the peak-SNR averaged over the length of a typical trace (see Materials and methods (Detecting fluorescence traces with high SNR)). An example is shown in Figure 4c where the scatter plot of SNR between matched consensus and inferred traces is shown (false negative/positive components are shown along the x- and y- axis, respectively). To evaluate the performance we computed a precision metric as the fraction of inferred components above a certain SNR threshold that are matched with a consensus component (Figure 4c, shaded blue). Similarly we computed a recall metric as the fraction of consensus components above a SNR threshold that are detected by CaImAn batch (Figure 4c, shaded red), and an F1 score as the harmonic mean of the two (Figure 4d). The results indicate that the performance significantly improves as a function of the SNR for all datasets considered, improving on average from 0.73 when all neurons are considered to 0.92 when only neurons with traces having SNR 9 are considered (Figure 4d). This increase in the F1 score resulted increase in both the precision and the recall as a function of the SNR (Figure 4e)(these precision and recall metrics are computed on different sets of neurons, and therefore strictly speaking one cannot combine them to form an F1 score. However, they can be bound from above by being evaluated on the set of matched and non-matched components where at least one trace is above the threshold (union of blue and pink zones in Figure 4c) or below by considering only matched and non-matched components where both consensus and inferred traces have SNR above the threshold (intersection of blue and pink zones in Figure 4c). In practice these bounds were very tight for all but one dataset (Figure 4d). More details can be found in Materials and methods (Performance quantification as a function of SNR)). A similar trend is also observed for CaImAn online (data not shown).

CaImAn reproduces the consensus traces with high fidelity

Testing the quality of the inferred traces is more challenging due to the unavailability of ground truth data in the context of large scale in vivo recordings. As mentioned above, we defined as ‘ground truth’ the traces obtained by running the CNMF algorithm seeded with the binary masks obtained by consensus annotation procedure. After spatial alignment with the results of CaImAn , the matched traces were compared both for CaImAn batch and for CaImAn online. Figure 5a, shows an example of 5 of these traces for the dataset K53, showing very similar behavior of the traces in these three different cases.

Figure 5. Evaluation of CaImAn extracted traces against traces derived from consensus annotation.

Figure 5.

(a) Examples of shapes (left) and traces (right) are shown for five matched components extracted from dataset K53 for consensus annotation (CA, black), CaImAn batch (yellow) and CaImAn online (red) algorithms. The dashed gray portion of the traces is also shown magnified (bottom-right). Consensus spatial footprints and traces were obtained by seeding CaImAn with the consensus binary masks. The traces extracted from both versions of CaImAn match closely the consensus traces. (b) Slope graph for the average correlation coefficient for matches between consensus and CaImAn batch, and between consensus and CaImAn online. Batch processing produces traces that match more closely the traces extracted from the consensus labels. (c) Empirical cumulative distribution functions of correlation coefficients aggregated over all the tested datasets. Both distributions exhibit a sharp derivative close to 1 (last bin), with the batch approach giving better results.

To quantify the similarity we computed the correlation coefficients of the traces (consensus vs CaImAn batch, and consensus vs CaImAn online) for all the nine datasets (Figure 5b–c). Results indicated that for all but one dataset (Figure 5b) CaImAn batch reproduced the traces with higher fidelity, and in all cases the mean correlation coefficients was higher than 0.9, and the empirical histogram of correlation coefficients peaked at the maximum bin 0.99–1 (Figure 5c). The results indicate that the batch approach extracts traces closer to the consensus traces. This can be attributed to a number of reasons: By processing all the time points simultaneously, the batch approach can smooth the trace estimation over the entire time interval as opposed to the online approach where at each timestep only the information up to that point is considered. Moreover, CaImAn online might not detect a neuron until it becomes strongly active. This neuron’s activity before detection is unknown and has a default value of zero, resulting in a lower correlation coefficient. While this can be ameliorated to a great extent with additional passes over the data, the results indicate the trade-offs inherent between online and batch algorithms.

Online analysis of a whole brain zebrafish dataset

We tested CaImAn online with a 380 GB whole brain dataset of larval zebrafish (Danio rerio) acquired with a light-sheet microscope (Kawashima et al., 2016). The imaged transgenic fish (Tg(elavl3:H2B-GCaMP6f)jf7) expressed the genetically encoded calcium indicator GCaMP6f in almost all neuronal nuclei. Data from 45 planes (FOV 820 × 410 μm2, spaced at 5.5 μm intervals along the dorso-ventral axis) was collected at 1 Hz for 30 min (for details about preparation, equipment and experiment refer to Kawashima et al. (2016)). With the goal of simulating real-time analysis of the data, we run all the 45 planes in parallel on a computing cluster with nine nodes (each node is equipped with 24 CPUs and 128–256 GB RAM, Linux CentoOS). Data was not stored locally in each machine but directly accessed from a network drive.

The algorithm was initialized with CaImAn batch run on 200 initial frames and looking for 500 components. The small number of frames (1885) and the large FOV size (2048×1188 pixels) for this dataset motivated this choice of increased number of components during initialization. In Figure 6 we report the results of the analysis for plane number 11 of 45. For plane 11, CaImAn online found 1524 neurons after processing 1685 frames. Since no ground truth was available for this dataset, it was only possible to evaluate the performance of this algorithm by visual inspection. CaImAn online identified all the neurons with a clear footprint in the underlying correlation image (higher SNR, Figure 6a) and missed a small number of the fainter ones (low SNR). By visual inspection of the components the authors could find very few false positives. Given that the parameters were not tuned and that the classifier was not trained on zebrafish neurons, we hypothesize that the algorithm is biased towards a high precision result. Spatial components displayed the expected morphological features of neurons (Figure 6b–c). Considering all the planes (Figure 6e and Figure 6—figure supplement 1) CaImAn online was able to identify in a single pass of the data a total of 66108 neurons. See Video 1 for a summary across all planes. The analysis was performed in 21 min, with the first three minutes spent in initialization, and the remaining 18 in processing the data in streaming mode (and in parallel for each plane). This demonstrates the ability of CaImAn online to process large amounts of data in real-time (see also Figure 8 for a discussion of computational performance).

Figure 6. Online analysis of a 30 min long whole brain recording of the zebrafish brain.

(a) Correlation image overlaid with the spatial components (in red) found by the algorithm (portion of plane 11 out of 45 planes in total). (b) Correlation image (left) and mean image (right) for the dashed region in panel (a) with superimposed the contours of the neurons marked in (a). (c) Spatial (left) and temporal (right) components associated to the ten example neurons marked in panel (a). (d) Temporal traces for all the neurons found in the FOV in (a); the initialization on the first 200 frames contained 500 neurons (present since time 0). (e) Number of neurons found per plane (See also Figure 6—figure supplement 1 for a summary of the results from all planes).

Figure 6.

Figure 6—figure supplement 1. Spatial and temporal components for all planes.

Figure 6—figure supplement 1.

Profile of spatial (left) and temporal (right) components found in each plane of the whole brain zebrafish recording. (Left) Components are extracted with CaImAn online and then max-thresholded. (Right) See Results section for a complete discussion. .
Figure 6—video 1. Results of CaImAn online initialized by CaImAn batch on a whole brain zebrafish dataset.
Download video file (10.8MB, mp4)
DOI: 10.7554/eLife.38173.015
Each panel shows the active neurons in a given plane (top-to-bottom) without any background activity. See the text for more details.

Video 1. Depiction of CaImAn online on a small patch of in vivo cortex data.

Download video file (29.1MB, mp4)
DOI: 10.7554/eLife.38173.016

Top left: Raw data. Bottom left: Footprints of identified components. Top right: Mean residual buffer and proposed regions for new components (in white squares). Enclosings of accepted regions are shown in magenta. Several regions are proposed multiple times before getting accepted. This is due to the strict behavior of the classifier to ensure a low number of false positives. Bottom right: Reconstructed activity.

Video 2. Depiction of CaImAn online on a single plane of mesoscope data courtesy of E. Froudarakis, J. Reimers and A. Tolias (Baylor College of Medicine).

Download video file (86.9MB, mp4)
DOI: 10.7554/eLife.38173.017

Top left: Raw data. Top right: Inferred activity (without neuropil). Bottom left: Mean residual buffer and accepted regions for new components (magenta squares). Bottom right: Reconstructed activity.

Analyzing 1p microendoscopic data using CaImAn

We tested the CNMF-E implementation of CaImAn batch on in vivo microendosopic data from mouse dorsal striatum, with neurons expressing GCaMP6f. 6000 frames were acquired at 30 frames per second while the mouse was freely moving in an open field arena (for further details refer to Zhou et al., 2018). In Figure 7 we report the results of the analysis using CaImAn batch with patches and compare to the results of the MATLAB implementation of Zhou et al. (2018). Both implementations detect similar components (Figure 7a) with an F1-score of 0.89. 573 neurons were found in common by both implementations. 106 and 31 additional components were detected by Zhou et al. (2018) and CaImAn batch respectively. The median correlation between the temporal traces of neurons detected by both implementations was 0.86. Similar results were also obtained by running CaImAn batch  without patches. Ten example temporal traces are plotted in Figure 7b.

Figure 7. Analyzing microendoscopic 1 p data with the CNMF-E algorithm using CaImAn batch .

Figure 7.

(a) Contour plots of all neurons detected by the CNMF-E (white) implementation of Zhou et al. (2018) and CaImAn batch (red) using patches. Colors match the example traces shown in (b), which illustrate the temporal components of 10 example neurons detected by both implementations CaImAn batch . reproduces with reasonable fidelity the results of Zhou et al. (2018).

Computational performance of CaImAn

We examined the performance of CaImAn in terms of processing time for the various analyzed datasets presented above (Figure 8). The processing time discussed here excludes motion correction, which is highly efficient and primarily depends on the level of the FOV discretization for non-rigid motion correction (Pnevmatikakis and Giovannucci, 2017). For CaImAn batch , each dataset was analyzed using three different computing architectures: (i) a single laptop (MacBook Pro) with 8 CPUs (Intel Core i7) and 16 GB of RAM (blue in Figure 8a), (ii) a linux-based workstation (CentOS) with 24 CPUs (Intel Xeon CPU E5-263 v3 at 3.40 GHz) and 128 GB of RAM (magenta), and (iii) a linux-based HPC cluster (CentOS) where 112 CPUs (Intel Xeon Gold 6148 at 2.40 GHz, four nodes, 28 CPUs each) were allocated for the processing task (yellow). Figure 8a shows the processing of CaImAn batch as a function of dataset size on the four longest datasets, whose size exceeded 8 GB, on log-log plot.

Figure 8. Time performance of CaImAn batch and CaImAn online for four of the analyzed datasets (small, medium, large and very large).

Figure 8.

(a) Log-log plot of total processing time as a function of data size for CaImAn batch two-photon datasets using different processing infrastructures: (i) a desktop with three allocated CPUs (green), (ii) a desktop with 24 CPUs allocated (orange), and (iii) a HPC where 112 CPUs are allocated (blue). The results indicate a near linear scaling of the processing time with the size of dataset, with additional dependence on the number of found neurons (size of each point). Large datasets (>100 GB) can be seamlessly processed with moderately sized desktops or laptops, but access to a HPC enables processing with speeds faster than the acquisition time (considered 30 Hz for a 512×512 FOV here). However, for smaller datasets the advantages of adopting a cluster vanishes, because of the inherent overhead. The results of CaImAn online using the laptop, using the ‘strict’ parameter setting (Figure 4—figure supplement 1), are also plotted in red indicating near real-time processing speed. (b) Break down of processing time for CaImAn batch (excluding motion correction). Processing with CNMF in patches and refinement takes most of the time for CaImAn batch. (c) Computational gains for CaImAn batch due to parallelization for three datasets with different sizes. The parallelization gains are computed by using the same 24 CPU workstation and utilizing a different number of CPUs for each run. The different parts of the algorithm exhibit the same qualitative characteristics (data not shown). (d) Cost analysis of CNMF-E implementation for processing a 6000 frames long 1p dataset. Processing in patches in parallel induces a time/memory tradeoff and can lead to speed gains (patch size in legend). (e) Computational cost per frame for analyzing dataset J123 with CaImAn onlne. Tracking existing activity and detecting new neurons are the most expensive steps, whereas udpating spatial footprints can be efficiently distributed among all frames. (f) Processing speed of CaImAn onlne for all annotated datasets. Overall speed depends on the number of detected neurons and the size of the FOV (Np stands for number of pixels). Spatial downsampling can speed up processing. (g) Cost of neural activity online tracking for the whole brain zebrafish dataset (maximum time over all planes per volume). Tracking can be done in real-time using parallel processing.

Results show that, as expected, employing more processing power results in faster processing. CaImAn batch on a HPC cluster processes data faster than acquisition time (Figure 8a) even for very large datasets. Processing of an hour long dataset was feasible within 3 hr on a single laptop, even though the size of the dataset is several times the available RAM. Here, acquisition time is computed based on the assumption of imaging a FOV discretized over a 512×512 grid at a 30 Hz rate (a typical two-photon imaging setup with resonant scanning microscopes). Dataset size is computed by representing each measurement using single precision arithmetic, which is the minimum precision required for standard algebraic processing. These assumptions lead to a data rate of 105 GB/hr. In general performance scales linearly with the number of frames (and hence, the size of the dataset), but we also observe a dependency on the number of components, which during the solution refinement step can be quadratic. This is expected from the properties of the matrix factorization approach as also noted by past studies (Pnevmatikakis et al., 2016). The majority of the time (Figure 8b) required for CaImAn batch processing is taken by CNMF algorithmic processing either during the initialization in patches (orange bar) or during merging and refining the results of the individual patches (green bar).

To study the effects of parallelization we ran CaImAn batch several times on the same hardware (linux-based workstation with 24CPUs), limiting the runs to different numbers of CPUs each time (Figure 8c). In all cases we saw significant performance gains from parallel processing, with the gains being similar for all stages of processing (patch processing, refinement, and quality testing, data not shown). We saw the most effective scaling with our 50 G dataset (J123). For the largest datasets (J115, 100GB), the speedup reaches a plateau due to limited available RAM, suggesting that more RAM can lead to better scaling. For small datasets (5 GB) the speedup factor is limited by increased communications overhead (indicative of weak scaling in the language of high performance computing).

The cost of processing 1p data in CaImAn batch using the CNMF-E algorithm (Zhou et al., 2018) is shown (Figure 8d) for our workstation-class hardware. Splitting in patches and processing in parallel can lead to computational gains at the expense of increased memory usage. This is because the CNMF-E introduces a background term that has the size of the dataset and needs to be loaded and updated in memory in two copies. This leads to processing times that are slower compared to the standard processing of 2 p datasets, and higher memory requirements. However ( 8 c), memory usage can be controlled enabling scalable inference at the expense of slower processing speeds.

Figure 8a also shows the performance of CaImAn online (red markers). Because of the low memory requirements of the streaming algorithm, this performance only mildly depends on the computing infrastructure, allowing for near real-time processing speeds on a standard laptop (Figure 8a). As discussed in Giovannucci et al., 2017 processing time of CaImAn online depends primarily on (i) the computational cost of tracking the temporal activity of discovered neurons, (ii) the cost of detecting and incorporating new neurons, and (iii) the cost of periodic updates of spatial footprints. Figure 8e shows the cost of each of these steps for each frame, for one epoch of processing of the dataset J123. Distributing the spatial footprint update more uniformly among all frames removes the computational bottleneck appearing in Giovannucci et al., 2017, where all the footprints where updated periodically at the same frame. The cost of detecting and incorporating new components remains approximately constant across time, and is dependent on the number of candidate components at each timestep. In this example five candidate components were used per frame resulting in a relatively low cost (7 ms per frame). A higher number of candidate components can lead to higher recall in shorter datasets but at a computational cost. This step can benefit by the use of a GPU for running the online CNN on the footprints of the candidate components. Finally, as noted in Giovannucci et al., 2017, the cost of tracking components can be kept low, and increases mildly over time as more components are found by the algorithm (the analysis here excludes the cost of motion correction, because the files where motion corrected before hand to ensure that manual annotations and the algorithms where operating on the same FOV. This cost depends on whether rigid or pw-rigid motion correction is being used. Rigid motion correction taking on average 3–5 ms per frame for a 512×512 pixel FOV, whereas pw-rigid motion correction with patch size 128×128 pixel is typically 3–4 times slower).

Figure 8f shows the overall processing speed (in frames per second) for CaImAn online for the nine annotated datasets. Apart from the number of neurons, the processing speed also depends on the size of the imaged FOV and the use of spatial downsampling. Datasets with smaller FOV (e.g., YST) or datasets where spatial downsampling is used can achieve higher processing speeds for the same amount of neurons (blue dots in Figure 8f) as opposed to datasets where no spatial downsampling is used (orange dots in Figure 8f). In most cases, spatial downsampling can be used to increase processing speed without significantly affecting the quality of the results, an observation consistent with previous studies (Friedrich et al., 2017a).

In Figure 8g the cost per frame is plotted for the analysis of the whole brain zebrafish recording. The lower imaging rate (1 Hz) allows for tracing of neural activity with computational costs significantly lower than the 1 s between volume imaging time (Figure 8e), even in the presence of a large number of components (typically more than 1000 per plane, Figure 6) and the significantly larger FOV (2048×1188 pixels).

CaImAn successfully tracks neurons across multiple days

Figure 9 shows an example of tracking neurons across six different sessions corresponding to six different days of mouse cortex in vivo data using our multi-day registration algorithm RegisterMulti (see Materials and methods, Algorithm 8). 453, 393, 375, 378, 376, and 373 active components were found in the six sessions, respectively. Our tracking method detected a total of 686 distinct active components. Of these, 172, 108, 70, 92, 82, and 162 appeared in exactly 1, 2, 3, 4, 5, and all six sessions respectively. Contour plots of the 162 components that appeared in all sessions are shown in Figure 9a, and parts of the FOV are highlighted in Figure 9d showing that components can be tracked in the presence of non-rigid deformations of the FOV between the different sessions.

Figure 9. Components registered across six different sessions (days).

(a) Contour plots of neurons that were detected to be active in all six imaging sessions overlaid on the correlation image of the sixth imaging session. Each color corresponds to a different session. (b) Stability of multiday registration method. Comparisons of forward and backward registrations in terms of F1 scores for all possible subsets of sessions. The comparisons agree to a very high level, indicating the stability of the proposed approach. (c) Comparison (in terms of F1 score) of pair-wise alignments using readouts from the union vs direct alignment. The comparison is performed for both the forward and the backwards alignment. For all pairs of sessions the alignment using the proposed method gives very similar results compared to direct pairwise alignment. (d) Magnified version of the tracked neurons corresponding to the squares marked in panel (a). Neurons in different parts of the FOV exhibit different shift patterns over the course of multiple days, but can nevertheless be tracked accurately by the proposed multiday registration method.

Figure 9.

Figure 9—figure supplement 1. Tracking neurons across days; step-by-step description of multi session registration.

Figure 9—figure supplement 1.

(a) Correlation image overlaid to contour plots of the neurons identified by in day 1 (top, 453 neurons), 2 (middle, 393 neurons) and 3 (bottom, 375 neurons). (b) Result of the pairwise registration between session 1 and 2. The union of distinct active components consists of the components that were active in i) both sessions (yellow - where only the components of session two are displayed), ii) only in session 2 (green), and iii) only in session 1 (red), aligned to the FOV of session 2. (c) At the next step the union of sessions 1 and 2 is registered with the results of session three to produce the union of all distinct components aligned to the FOV of session 3. (d) Registration of non-consecutive sessions (session 1 vs session 3) without pairwise registration. Keeping track of which session each component was active in enables efficient and stable comparisons.

To test the stability of RegisterMulti for each subset of sessions, we repeated the same procedure running backwards in time starting from day 6 and ending at day 1, a process that also generated a total of 686 distinct active components. We identified the components present in at least a given subset of sessions when using the forward pass, and separately when using the backwards pass, and compared them against each other (Figure 9b) for all possible subsets. Results indicate a very high level of agreement between the two approaches with many of the disagreements arising near the boundaries (data not shown). Disagreements near the boundaries can arise because the forward pass aligns the union with the FOV of the last session, whereas the backwards pass with the FOV of the first session, potentially leading to loss of information near the boundaries.

A step by step demonstration of the tracking algorithm for the first three sessions is shown in Figure 9—figure supplement 1. Our approach allows for the comparison of two non-consecutive sessions through the union of components without the need of a direct pairwise registration (Figure 9—figure supplement 1f), where it is shown that registering sessions 1 and 3 directly and through the union leads to nearly identical results. Figure 9c compares the registrations for all pairs of sessions using the forward (red) or the backward (blue) approach, with the direct pairwise registrations. Again, the results indicate a very high level of agreement, indicating the stability and effectiveness of the proposed approach.

A different approach for multiple day registration was recently proposed by Sheintuch et al. (2017) (CellReg). While a direct comparison of the two methods is not feasible in the absence of ground truth, we tested our method against the same publicly available datasets from the Allen Brain Observatory visual coding database. (http://observatory.brain-map.org/visualcoding). Similarly to Sheintuch et al. (2017) the same experiment performed over the course of different days produced very different populations of active neurons. To measure performance of RegisterPair for pairwise registration, we computed the transitivity index proposed in Sheintuch et al. (2017). The transitivity property requires that if cell 'a’ from session one matches with cell 'b’ from session 2, and cell 'b’ from session two matches with cell 'c’ from session 3, then cell 'a’ from session one should match with cell 'c’ from session 3 when sessions 1 and 3 are registered directly. For all ten tested datasets the transitivity index was very high, with values ranging from 0.976 to 1 (0.992±0.006, data not shown). A discussion between the similarities and differences of the two methods is given in Materials and methods.

Discussion

Reproducible and scalable analysis for the 99%

Significant advances in the reporting fidelity of fluorescent indicators, and the ability to simultaneously record and modulate neurons granted by progress in optical technology, have made calcium imaging one of the two most prominent experimental methods in systems neuroscience alongside electrophysiology recordings. Increasing adoption has led to an unprecedented wealth of imaging data which poses significant analysis challenges. CaImAn is designed to provide the experimentalist with a complete suite of tools for analyzing this data in a formal, scalable, and reproducible way. The goal of this paper is to present the features of CaImAn and examine its performance in detail. CaImAn embeds existing methods for preprocessing calcium imaging data into a MapReduce framework and augments them with supervised learning algorithms and validation metrics. It builds on the CNMF algorithm of Pnevmatikakis et al. (2016) for source extraction and deconvolution, extending it along the lines of (i) reproducibility and performance improvement, by automating quality assessment through the use of unsupervised and supervised learning algorithms for component detection and classification, and (ii) scalability, by enabling fast large scale processing with standard computing infrastructure (e.g., a commodity laptop or workstation). Scalability is achieved by either using a MapReduce batch approach, which employs parallel processing of spatially overlapping, memory mapped, data patches; or by integrating the online processing framework of Giovannucci et al., 2017 within our pipeline. Apart from computational gains both approaches also result in improved performance. Towards our goal of providing a single package for dealing with standard problems arising in analysis of imaging data, CaImAn also includes an implementation of the CNMF-E algorithm of Zhou et al. (2018) for the analysis of microendoscopic data, as well as a novel method for registering analysis results across multiple days.

Towards surpassing human neuron detection performance

To evaluate the performance of CaImAn batch and CaImAn online, we used a number of distinct labelers to generate a corpus of nine annotated two-photon imaging datasets. The results indicated a surprising level of disagreement between individual labelers, highlighting both the difficulty of the problem, and the non-reproducibility of the laborious task of human annotation. CaImAn reached near-human performance with respect to this consensus annotation, by using the same parameters for all the datasets without dataset dependent parameter tweaking. Such tweaking can include setting the SNR threshold based on the noise level of the recording, the complexity of the neuropil signal based on the level of background activity, or specialized treatment around the boundaries of the FOV to compensate for eventual imaging artifacts, and as shown can significantly improve the results on individual datasets. As demonstrated in our results, optimal parameter setting for CaImAn online can also depend on the length of the experiment with stricter parameters being more suitable for longer datasets. We plan to investigate parameter schemes that increase in strictness over the course of an experiment.

 CaImAn has higher precision than recall when run on most datasets. While more balanced results can be achieved by appropriately relaxing the relevant quality evaluation thresholds, we prefer to maintain a higher precision as we believe that the inclusion of false positive traces can be more detrimental in any downstream analysis compared to the exclusion of, typically weak, true positive traces. This is true especially in experiments with low task dimensionality where a good signal from few neurons can be sufficient for the desired hypothesis testing.

Apart from being used as a benchmarking tool, the set of manual annotations can also be used as labeled data for supervised learning algorithms. CaImAn uses two CNN based classifiers trained on (a subset of) this data, one for post processing component classification in CaImAn batch, and the other for detecting new neurons in residual images in the CaImAn online. The deployment of these classifiers resulted in significant gains in terms of performance, and we expect further advances in the future. The annotations are made freely available to the community for benchmarking and training purposes.

CaImAn batch vs CaImAn online

Our results suggest similar performance between CaImAn batch and CaImAn onine when evaluated on the basis of processing speed and quality of results, with CaImAn online outperforming CaImAn batch on longer datasets in terms of neuron detection, possibly due to its inherent ability to adapt to non-stationarities arising during the course of a large experiment. By contrast, CaImAn batch extracts better traces compared to CaImAn online with respect to the traces derived from the consensus annotations. While multiple passes over the data with CaImAn online can mitigate these shortcomings, this still depends on good initialization with CaImAn batch, as the analysis of the whole brain zebrafish dataset indicates. In offline setups, CaImAn onine could also benefit from the post processing component evaluation tools used in batch mode. for example using the batch classifier for detecting false positive components at the end of the experiment.

 CaImAn online differs from CaImAn batch in that the former has lower memory requirements and it can support novel types of closed-loop all-optical experiments (Packer et al., 2015; Carrillo-Reid et al., 2017). As discussed in Giovannucci et al., 2017, typical all-optical closed-loop experiments require the pre-determination of ROIs that are monitored/modulated. Indeed, CaImAn online allows identification and modulation of new neurons on the fly, greatly expanding the space of possible experiments. Even though our simulated online processing setup is not integrated with hardware to an optical experimental setup, our results indicate thatCaImAn online performed close to real-time in most cases. Real time can be potentially achieved by using parallel computational streams for the three steps of frame processing (motion correction and tracking, detecting new neurons, updating shapes), since these steps can be largely run in an asynchronous mode independently. This suggests that large scale closed-loop experiments with single cell resolution are feasible by combining existing all-optical technology and our proposed analysis method.

Future directions

While CaImAn uses a highly scalable processing pipeline for two-photon datasets, processing of one-photon microendoscopic imaging data is less scalable due to the more complex background model that needs to be retained in memory during processing. Adapting CaImAn online to the one-photon data processing algorithm of Zhou et al. (2018) is a promising way for scaling up efficient processing in this case. The continuing development and quality improvement of neural activity indicators has enabled direct imaging of neural processes (axons/dendrites), imaging of synaptic activity (Xie et al., 2016), or direct imaging of voltage activity in vivo conditions (Piatkevich et al., 2018). While the approach presented here is tuned for somatic imaging through the use of various assumptions (space localized activity, CNN classifiers trained on images of somatic activity), the technology of CaImAn is largely transferable to these domains as well. We will pursue these extensions in future work.

Materials and methods

Memory mapping

 CaImAn batch uses memory mapping for efficient parallel data access. With memory mapped arrays, arithmetic operations can be performed on data residing on the hard drive without explicitly loading it to RAM, and slices of data can be indexed and accessed without loading the full file in memory, enabling out-of-core processing (Toledo, 1999). On modern computers tensors are stored in linear format, no matter the number of the array dimensions. Therefore, one has to decide which elements of an array are contiguous in memory: in row-major order, consecutive elements of a row (first-dimension) are next to each other, whereas in column-major order consecutive elements of a column (last dimension) are contiguous. Such decisions significantly affect the speed at which data is read or written on spinning disks (and to a lesser degree on solid state drives): in column-major order reading a full column is fast because memory is read in a single sequential block, whereas reading a row is inefficient since only one element can be read at a time and all the data needs to be accessed.

In the context of calcium imaging datasets, CaImAn batch represents the datasets in a matrix form Y, where each row corresponds to a different imaged pixel, and each column to a different frame. As a result, a column-major order mmap file enables the fast access of individual frames at a given time, whereas a row-major order files enables the fast access of an individual pixel at all times. To facilitate processing in patches CaImAn batch stores the data in row-major order. In practice, this is opposite to the order with which the data appears, one frame at a time. In order to reduce memory usage and speed up computation CaImAn batch employs a MapReduce approach, where either multiple files or multiple chunks of a big file composing the original datasets are processed and saved in mmap format in parallel. This operation includes two phases, first the chunks/files are saved (possibly after motion correction, if required) in multiple row-major mmap format, and then chunks are simultaneously combined into a single large row-major mmap file.

Mathematical model of the CNMF framework

The CNMF framework (Figure 1d) for calcium imaging data representation can be expressed in mathematical terms as (Pnevmatikakis et al., 2016)

Y=AC+B+E. (1)

Here, YRd×T denotes the observed data written in matrix form, where d is the total number of observed pixels/voxels, and T is the total number of observed timesteps (frames). ARd×N denotes the matrix of the N spatial footprints, A=[𝐚1,𝐚2,,𝐚N], with aiRd×1 being the spatial footprint of component i.CRN×T denotes the matrix of temporal components, C=[𝐜1,𝐜2,,𝐜N], with ciRT×1 being the temporal trace of component i. B is the background/neuropil activity matrix. For two-photon data it is modeled as a low rank matrix B=𝐛𝐟, where bRd×nb,fRnb×T correspond to the matrices of spatial and temporal background components, and nb is the number of background components. For the case of micro-endoscopic data the integration volume is much larger and the low rank model is inadequate. For this we use the CNMF-E algorithm of Zhou et al. (2018) where the background is modeled as

B=W(Y-AC), (2)

where WRd×d is an appropriate weight matrix, where the (i,j) entry models the influence of the neuropil signal of pixel j to the neuropil signal at pixel i.

Combining results from different patches

To combine results from different patches we first need to account for the overlap at the boundaries. Neurons lying close to the boundary between neighboring patches can appear multiple times and must be merged. With this goal, we optimized the merging approach used in Pnevmatikakis et al. (2016): Groups of components with spatially overlapping footprints whose temporal traces are correlated above a threshold are replaced with a single component, that tries to explain as much of the variance already explained by the ‘local’ components (as opposed to the variance of the data as performed in Pnevmatikakis et al. (2016)). If Aold,Cold are the matrices of components to be merged, then the merged component 𝐚m,𝐜m are given by the solution of the rank-1 NMF problem:

minam0,cm0AoldColdamcm. (3)

Prior to merging, the value of each component at each pixel is normalized by the number of patches that overlap in this pixel, to avoid counting the activity of each pixel multiple times.

We follow a similar procedure for the background/neuropil signals from the different patches. When working with two-photon data, the spatial background/neuropil components for each patch can be updated by keeping their spatial extent intact to retain a local neuropil structure, or they can be merged when they are sufficiently correlated in time as described above to promote a more global structure. For the case of one-photon data, CNMF-E estimates the background using a local autoregressive process (see Equation 2) (Zhou et al., 2018), a setup that cannot be immediately propagated when combining the different patches. To combine backgrounds from the different patches, we first approximate the backgrounds Bi from all the patches i with a low rank matrix using non-negative matrix factorization of rank gb to obtain global spatial, and temporal background components.

[bi,fi]=NNMF(Bi,gb). (4)

The resulting components are embedded into a large matrix BRd×T that retains a low rank structure. After the components and backgrounds from all the patches have been combined, they are further refined by running CNMF iteration of updating spatial footprints, temporal traces, and neuropil activity. CaImAn batch implements these steps in parallel (as also described in Pnevmatikakis et al. (2016)): Temporal traces whose corresponding spatial traces do not overlap can be updated in parallel. Similarly, the rows of the matrix of spatial footprints A can also be updated in parallel (Figure 2b). The process is summarized in algorithmic format in Algorithms 1–2. When working with one-photon data, instead of producing a low-rank approximation of B that would underfit the background, we increase patch overlap and run the full pipeline on each patch. In the final phase, when neurons overlap we retain only the variant with the highest quality rather than merging them.

Initialization strategies

Source extraction using matrix factorization requires solving a bi-convex problem where initialization plays a critical role. The CNMF/CNMF-E algorithms use initialization methods that exploit the locality of the spatial footprints to efficiently identify the locations of candidate components (Pnevmatikakis et al., 2016; Zhou et al., 2018). CaImAn incorporates these methods, extending them by using the temporal locality of the calcium transient events. The available initialization methods for CaImAn batch include:

GreedyROI: This approach, introduced in Pnevmatikakis et al. (2016), first spatially smooths the data with a Gaussian kernel of size comparable to the average neuron radius, and then initializes candidate components around locations where maximum variance (of the smoothed data) is explained. This initialization strategy is fast but requires manual specification of the number of components by the user.

RollingGreedyROI: The approach, introduced in this paper, operates like GreedyROI by spatially smoothing the data and looking for points of maximum variance. Instead of working across all the data, RollingGreedyROI looks for points of maximum variance on a rolling window of a fixed duration, for example 3 s, and initializes components by performing a rank one NMF on a local spatial neighborhood. By focusing into smaller rolling windows, RollingGreedyROI can better isolate single transient events, and as a result detect better neurons with sparse activity. RollingGreedyROI is the default choice for processing of 2-photon data.

GreedyCorr: This approach, introduced in Zhou et al. (2018), initializes candidate components around locations that correspond to the local maxima of an image formed by the pointwise product between the correlation image and the peak signal-to-noise ratio image. A threshold for acceptance of candidate neurons is used, making it unnecessary to pre-specify the neuron count. This comes at the expense of a higher computational cost. GreedyCorr is the default choice for processing of one-photon data.

SparseNMF: Sparse NMF approaches, when ran in small patches, can be effective for quickly uncovering spatial structure in the imaging data, especially for neural processes (axons/dendrites) whose shape cannot be easily parametrized and/or localized.

SeededInitialization: Often locations of components are known either from manual annotation or from labeled data obtained in a different way, such as data from a static structural channel recorded concurrently with the functional indicator. CaImAn can be seeded with binary (or real valued) masks for the spatial footprints. Apart from A, these masks can be used to initialize all the other relevant matrices C and B as well. This is performed by (i) first estimating the temporal background components 𝐟 using only data from parts of the FOV not covered by any masks and, (ii) then estimating the spatial background components 𝐛, and then estimating A,C (with A restricted to be non-zero only at the locations of the binary masks), using a simple NMF approach. Details are given in Algorithm 3.

Details of quality assessment tests

Here we present the unsupervised and supervised quality assessment tests in more detail (Figure 2).

Matching spatial footprints to the raw data

Let 𝐚i,𝐜i denote the spatial footprint and temporal trace of component i, and the let A\i,C\i denote the matrices A,C when the component i has been removed. Similarly, let Yi=Y-A\iC\i-B denote the entire dataset when the background and the contribution of all components except i have been removed. If component i is real then Yi and 𝐚i𝐜i will look similar during the time intervals when the component i is active. As a first test CaImAn finds the first Np local peaks of ci (e.g., Np=5), constructs intervals around these peaks, (e.g., 50 ms in the past and 300 ms in the future, to cover the main part of a possible calcium transient around that point), and then averages Yi across time over the union of these intervals to obtain a spatial image <Yi> (Figure 2c). The Pearson’s correlation over space between <Yi> and 𝐚i (both restricted on a small neighborhood around the centroid of 𝐚i) is then computed, and component i is rejected if the correlation coefficient is below a threshold value θsp, (e.g., θsp<0.5). Note that a similar test is used in the online approach of Giovannucci et al., 2017 to accept for possible new components.

Detecting fluorescence traces with high SNR

For a candidate component to correspond to an active neuron its trace must exhibit dynamics reminiscent of the calcium indicator’s transient. A criterion for this can be obtained by requiring the average SNR of trace 𝐜i over the course a transient to be above a certain threshold θSNR, for example θSNR=2, (Figure 2d). The average SNR can be seen as a measure of how unlikely it is for the transients of 𝐜i (after some appropriate z-scoring) to have been the result of a white noise process.

To compute the SNR of a trace, let R=Y-AC-B be the residual spatiotemporal signal. We can obtain the residual signal for each component i, 𝐫i, by projecting R into the spatial footprint 𝐚i:

ri=1ai2Rai (5)

Then the trace 𝐜i+𝐫i corresponds to the non-denoised trace of component i. To calculate its SNR we first compute a type of z-score:

zi=ci+riBASELINE(ci+ri)NOISE(ci+ri). (6)

The Baseline() function determines the baseline of the trace, which can be varying in the case of long datasets exhibiting baseline trends, for example due to bleaching. The function Noise() estimates the noise level of the trace. Since calcium transients around the baseline can only be positive, we estimate the noise level by restricting our attention only to the points tn where 𝐜i+𝐫i is below the baseline value, that is tn={t:ci(t)+ri(t)BASELINE(ci+ri)}, and compute the noise level as the scale parameter of a half-normal distribution (Figure 2b):

NOISE(ci+ri)=std([ci+ri](tn))/12π. (7)

We then determine how likely is that the positive excursions of 𝐳i can be attributed just to noise. We compute the probabilities 𝐩i(t)=Φ(-𝐳i(t)), where Φ() denotes the cumulative distribution function of a standard normal distribution, and compute the most unlikely excursion over a window of Ns timesteps that corresponds to the length of a typical transient, for example Ns=0.4s×F, where 0.4s could correspond to the typical length of a GCaMP6f transient, and F is the imaging rate.

pmini=mint(j=0Ns-1𝐩i(t+j))1/Ns. (8)

The (averaged peak) SNR of component i can then be defined as

SNRi=Φ1(1pmini)=Φ1(pmini), (9)

where Φ-1 is the quantile function for the standard normal distribution (logit function) and a component is accepted if SNRiθSNR. Note that for numerical stability we compute pmini in the logarithmic domain and check the condition pminiΦ(-θSNR).

We can also use a similar test for the significance of the time traces in the spike domain after performing deconvolution. In this case, traces are considered to spike if their maximum height due to a spike transient exceeds a threshold. If we assume that the shape of each calcium transient has been normalized to have maximum amplitude 1, then this corresponds to testing 𝐬iθSNRσi, where 𝐬i represents the deconvolved activity trace for component i, and θSNR is again an appropriate SNR threshold, for example θSNR=2, and σi is the noise level for trace i.

Classification through convolutional neural networks (CNNs)

The tests described above are unsupervised but require fine-tuning of two threshold parameters (θsp,θSNR) that might be dataset dependent and might be sensitive to strong non-stationarities. As a third test we trained a 4-layer CNN to classify the spatial footprints into true or false components, where a true component here corresponds to a spatial footprint that resembles a neuron soma (See Figure 2e and section Classification through convolutional networks for details). A simple threshold θCNN can be used to tune the classifier (e.g., θCNN=0.5).

Collection of manual annotations and consensus

We collected manual annotations from four independent labelers who were instructed to find round or donut shaped neurons of similar size using the ImageJ Cell Magic Wand tool (Walker, 2014). We focused on manually annotating only cells that were active within each dataset and for that reason the labelers were provided with two summary statistics: (i) A movie obtained by removing a running 20th percentile (as a crude background approximation) and downsampling in time by a factor of 10, and (ii) the max-correlation image. The correlation image (CI) at every pixel is equal to the average temporal correlation coefficient between that pixel and its neighbors (Smith and Häusser, 2010) (eight neighbors were used for our analysis). The max-correlation image is obtained by computing the CI for each batch of 33 s (1000 frames for a 30 Hz acquisition rate), and then taking the maximum over all these images (Figure 3—figure supplement 1a). Neurons that are inactive during the course of the dataset will be suppressed both from the baseline removed video (since their activity will always be around their baseline), and from the max-correlation image since the variation around this baseline will mostly be due to noise leading to practically uncorrelated neighboring pixels (Figure 3—figure supplement 1a). Nine different mouse in vivo datasets were used from various brain areas and labs. A description is given in Table 2. To create the final consensus, the labelers were asked to jointly resolve the inconsistencies between their annotations. To this end, every ROI selected by at least one but not all labelers was re-considered by a group of at least two labelers that decided whether it corresponds to an active neuron or not.

The annotation procedure provides a binary mask per selected component. On the other hand, the output of for each component is a non-negatively valued vector over the FOV (a real-valued mask). The two sets of masks differ not only in their variable type but also in their general shape: Manual annotation through the Cell Magic Wand tool tends to produce circular shapes, whereas the output of CaImAn will try to accurately estimate the shape of each active component (Figure 3—figure supplement 1b). To construct the consensus components that can be directly used for comparison, the binary masks from the manual annotations were used to seed the CNMF algorithm (Algorithm 3). This produced a set of real valued components with spatial footprints restricted to the areas provided by the annotations, and a corresponding set of temporal components that can be used to evaluate the performance of CaImAn (Figure 4). Registration was performed using the RegisterPair algorithm (Algorithm 7) and match was counted as a true positive when the (modified) Jaccard distance (Equation 11) was below 0.7. Details of the registration procedure are given below (see Component registration).

Cross-Validation analysis of manual annotations

As mentioned in the results section, comparing each manual annotation with the consensus annotation can create slightly biased results in favor of individual annotators since the consensus annotation is chosen from the union of individual annotations. To correct for this we performed a cross-validation analysis where the annotations of each labeler were compared against an automatically generated combination of the rest of the labelers. To create the combined annotations we first used the RegisterMulti procedure to construct the union of each subset of N-1 labelers (where N is the total number of labelers for each dataset). When N=4 then the combined annotation consisted of the components that were selected by at least two labelers. When N=3 a stricter intersection approach was used; the combined annotation consisted of the components that were selected by both remaining labelers. The procedure was repeated for all subsets of labelers and all datasets. The results are shown in Table 3 While individual scores for specific annotators and datasets vary significantly compared to using the consensus annotation as ground truth (Table 1), the decrease in average performance was modest indicating a low bias level.

Table 3. Cross-validated results of each labeler, where each labeler’s performance is compared against the annotations of the rest of the labelers using a majority vote.

Results are given in the form F1 score (precision, recall), and empty entries correspond to datasets not manually annotated by the specific labeler. The results indicate decreased performance compared to the consensus annotation annotations.

Name L1 L2 L3 L4 Mean
N.01.01 0.75
(0.73, 0.77)
0.70
(0.58, 0.88)
0.86
(0.81, 0.90)
0.84
(0.92, 0.77)
0.79
(0.76, 0.83)
N.03.00.t X 0.75
(0.69, 0.82)
0.79
(0.67, 0.97)
0.85
(0.76,0.97)⁢
0.8
(0.71,0.92)
N.00.00 X 0.87
(0.84,0.90)⁢
0.82
(0.75,0.91)
⁢0.72
(0.71,0.97)⁢
0.83
(0.76,0.93)
YST ⁢0.7
(0.93,0.56)⁢
0.79
(0.7,0.9)⁢
⁢0.81
(0.76,0.86)⁢
⁢0.77
(0.75,0.78)⁢
0.77
(0.78,0.78)⁢
N.04.00.t X 0.79
⁢(0.76,0.83)⁢
0.72
⁢(0.60,0.89)⁢
0.68
(0.53,0.96)⁢
0.73
(0.63,0.89)⁢
N.02.00 0.84
(0.97,0.75)
⁢0.88
⁢(0.89,0.87)⁢
0.86
(0.79,0.94)⁢
⁢0.81
(0.7,0.95)
0.85
(0.83,0.88)
J123 X 0.9
(0.86,0.93)⁢
0.89
(0.84,0.93)⁢
0.77
(0.63,0.96)⁢
⁢0.87
(0.88,0.88)
J115 0.85
(0.98,0.76)⁢
0.87
(0.80,0.97)
0.88
(0.80,0.97)
0.87
(0.93,0.82)⁢
0.85
(0.78,0.94)⁢
K53 0.86
(0.98,0.77)⁢
0.9
(0.85,0.96)
0.88
(0.8,0.96)
0.89
(0.9,0.88)⁢
0.88
(0.88,0.89)⁢
mean ± std 0.8±0.06
(0.92±0.09,0.72±0.08)
0.83±0.07
(0.77±0.1,0.9±0.05)⁢
0.83±0.05
(0.76±0.07,0.92±0.04)⁢
0.81±0.06
(0.76±0.13,0.9±0.08)
0.82±0.06
(0.77±0.12,0889±0.06)

Classification through convolutional neural networks (CNNs)

 CaImAn uses two CNN classifiers; one for post processing component screening in CaImAn batch, and a different one for screening candidate components in CaImAn online. In both cases a four layer CNN was used, with architecture as described in Figure 2e. The first two convolutional layers consist of 32 3×3 filters each, whereas each of the latter two layers consist of 64 3×3 filters, all followed by a rectifier linear unit (ReLU). Every two layers a 2×2 max-pool filter is included. A two layer dense network with 512 hidden units is used to compute the predictions (Figure 2e).

CaImAn batch classifier for post processing classification

The purpose of the batch classifier is to classify the components detected by CaImAn batch into neuron somas or other shapes, by examining their spatial footprints. Only three annotated datasets (NF.03.00.t, NF.04.00.t, NF.02.00) were used to train the batch classifier. The set of estimated footprints from running CaImAn batch initialized with the consensus annotation was matched to the set of consensus footprints. Footprints matched to consensus components were considered positive examples, whereas the remaining components were labeled as negatives. The two sets were enriched using data augmentation (rotations, reflections, contrast manipulation etc.) through the Keras library (keras.io) and the CNN was trained on 60% of the data, leaving 20% for validation and 20% for testing. The CNN classifier reached an accuracy of 97% on test data; this generalized to the rest of the datasets (Figure 2e) without any parameter change.

Online classifier for new component detection

The purpose of the CaImAn online classifier is to detect new components based on their spatial footprints by looking at the mean across time of the residual buffer. To construct the labeled data for the online classifier, CaImAn batch was run on the first five annotated datasets seeded with the masks obtained through the manual annotations. Subsequently the activity of random subsets of found components and the background was removed from contiguous frames of the raw datasets to construct residual buffers, which were averaged across time. From the resulting images patches were extracted corresponding to positive examples (patches around a neuron that was active during the buffer) and negative examples (patches around other positions within the FOV). A neuron was considered active if its trace attained an average peak-SNR value of 4 or higher during the buffer interval. Similarly to the batch classifier, the two sets were augmented and split into training, validation and testing sets. The resulting classifier reached a 98% accuracy on the testing set, and also generalized well when applied to different datasets.

Differences between the two classifiers

Although both classifiers examine the spatial footprints of candidate components, their required performance characteristics are different which led us to train them separately. Firstly, the two classifiers are trained on separate data: The batch classifier is trained on spatial footprints extracted from CaImAn batch, whereas the online classifier is trained on residual signals that are generated as CaImAn online operates. The batch classifier examines each component as a post-processing step to determine whether its shape corresponds to a neural cell body. As such, false positive and false negative examples are treated equally and possible mis-classifications do not directly affect the traces of the other components. By contrast, the online classifier operates as part of the online processing pipeline. In this case, a new component that is not detected in a residual buffer is likely to be detected later should it become more active. On the other hand, a component that is falsely detected and incorporated in the online processing pipeline will continue to affect the future buffer residuals and the detection of future components. As such the online algorithm is more sensitive to false positives than false negatives. To ensure a small number of false positive examples under testing conditions, only components with average peak-SNR value at least four were considered as positive examples during training of the online classifier.

Distributed update of spatial footprints

To efficiently distribute the cost of updating shapes across all frames we derived a simple algorithm that (i) ensures that every spatial footprint is updated at least once every Tu steps, where Tu is a user defined parameter, for example Tu=200, and (ii) no spatial component is updated during a step when new components were added. The latter property is used to compensate for the additional computational cost that comes with introducing new components. Whenever a new component is added the algorithm collects the components with overlapping spatial footprints and makes sure they are updated at the next frame. This property ensures that the footprints of all required components adapt quickly whenever a new neighbor is introduced. The procedure is described in algorithmic form in Algorithm 6.

Component registration

Fluorescence microscopy methods enable imaging the same brain region across different sessions that can span multiple days or weeks. While the microscope can visit the same location in the brain with reasonably high precision, the FOV might might not precisely match due to misalignments or deformations in the brain medium. CaImAn provides routines for FOV alignment and component registration across multiple sessions/days. Let 𝐚11,𝐚21,,𝐚N11 and 𝐚12,𝐚22,,𝐚N22 the sets of spatial components from sessions 1 and 2 respectively, where N1 and N2 denote the total number of components from each session. We first compute the FOV displacement by aligning some summary images from the two sessions (e.g., mean or correlation image), using a non-rigid registration method, for example NoRMCorre (Pnevmatikakis and Giovannucci, 2017). We apply the estimated displacement field to the components of A1 to align them with the FOV of session 2. To perform the registration, we construct a pairwise distance matrix DRN1×N2 with D(i,j)=d(𝐚i1,𝐚j2), where d(,) denotes a distance metric between two components. The chosen distance corresponds to the Jaccard distance between the binarized versions of the components. A real valued component 𝐚 is converted into its binary version m(𝐱) by setting to one only the values of 𝐚 that are above the maximum value of 𝐚 times a threshold θb, for example θb=0.2:

m(a)((x))={1,a((x))θba0,otherwise. (10)

To compute the distance between two binary masks m1,m2, we use the Jaccard index (intersection over union) which is defined as

J(m1,m2)=|m1m2||m1m2|, (11)

and use it to define the distance metric as 

d(ai1,aj2)={1J(m(ai1),m(aj2))1J(m(ai1),m(aj2))θd0(m(ai1)m(aj2))OR(m(aj2)m(ai1)),otherwise. (12)

where θd is a distance threshold, for example 0.5 above which two components are considered non-matching and their distance is set to infinity. This is done to prevent false matchings between only marginally overlapping components.

After the distance matrix D has been completed, an optimal matching between the components of the two sessions is computed using the Hungarian algorithm to solve the linear assignment problem. As infinite distances are allowed, it is possible to have components from both sessions that are not matched with any other component, preventing false assignments and enabling the registration of sessions with different number of neurons. Moreover, the use of infinite distances speeds up the Hungarian algorithm as it significantly restricts the space of possible pairings. This process of registering components across two sessions (RegisterPair) is summarized in Algorithm 7.

To register components across multiple sessions, we first order the sessions chronologically and register session 1 against session 2. From this registration we construct the union of the distinct components between the two sessions by keeping the matched components from session two as well as the non-matched components from both sessions aligned to the FOV of session 2. We then register this union of components to the components of session three and repeat the procedure until all sessions are have been registered. This process of multi session registration (RegisterMulti) is summarized in Algorithm 8. At the end of the process the algorithm produces a list of matches between the components of each session and the union of all active distinct components, allowing for efficient tracking of components across multiple days (Figure 9), and the comparison of non-consecutive sessions through the union without the need of direct pairwise registration (Figure 9—figure supplement 1). An alternative approach to the problem of multiple session registration (CellReg) was presented recently by Sheintuch et al. (2017) where the authors register neurons across multiple days by first constructing a similar union set of all the components which is then refined using a clustering procedure. RegisterMulti differs from the CellReg method of Sheintuch et al. (2017) along the following dimensions:

  • RegisterMulti uses a simple intersection over union metric to estimate the distance between two neighboring neurons after the FOV alignment. Cells that have a distance above a given threshold are considered distinct by default and are not tested for matching. This parameter has an intuitive interpretation and can be set a-priori for each dataset. By contrast, CellReg uses a probabilistic framework based on the joint probability distribution between the distance of two cells and the correlation of their shapes. Such choice makes specific parametric assumptions about the distributions of centroid distances between the same and different cells, as well as their shape correlations. This model must be re-evaluated for every different set of sessions to be registered and can require considerable data to learn the appropriate distance metric.

  • RegisterMulti uses the Hungarian algorithm to register two different set of components, solving the linear assignment problem optimally under the assumed distance function. In contrast CellReg uses a greedy method for initializing the assignment of cells to the union superset relying on the following clustering step to refine these estimates, adding extra computational burden to the registration procedure.

Implementation details for CaImAn batch

Each dataset was processed using the same set of parameters, apart from the expected size of neurons (estimated by inspecting the correlation image), the size of patches and expected number of neurons per patch (estimated by inspecting the correlation image). For the dataset N.01.01, where optical modulation was induced, the threshold for merging neurons was slightly higher (the stimulation caused clustered synchronous activity). For shorter datasets, rigid motion correction was sufficient; for longer datasets K53, J115 we applied non-rigid motion correction. Parameters for the automatic selection of components were optimized using a grid search approach.

The global default parameters for all datasets were obtained by performing a grid search on the nine datasets over the following values: trace peak SNR threshold on the set {1.75,2,2.25,2.5}, spatial correlation threshold on the set {0.75,0.8,0.85}, lower threshold on CNN classifier (reject if prediction is below a certain value) on the set {0.05,0.1,0.15}, and upper threshold on classifier (accept if prediction is above a certain value) on the set {0.9,.95,0.99,1}. The best overall parameters (used for the results reported in Table 1) were given for the choice (2,0.85,0.1,0.99). For all datasets the background neuropil activity was modeled as a rank two matrix, and calcium dynamics were modeled as a first order autoregressive process. The remaining parameters were optimized so that all the datasets could be run on a machine with less than 128 GB RAM.

Implementation details for CaImAn online

Datasets were processed for two epochs with the exception of the longer datasets J115, K53 where only one pass of the data was performed to limit computational cost. For all datasets the background neuropil activity was modeled as a rank two matrix, and calcium dynamics were modeled as a first order autoregressive process. For each dataset, CaImAn online was initialized on the first 200 frames, using the BareInitialization on the entire FOV with only two neurons, so in practice all the neurons were detected during the online mode. We did not post-process the results (which could have further enhanced performance) in order to demonstrate performance levels with fully online practices. As in batch processing, the expected size of neurons was chosen separately for each dataset after inspecting the correlation image. Several datasets (N.03.00.t, N.02.00, J123, J115, K53) were spatially decimated by a factor of 2 to enhance processing speed, a step that did not lead to changes in detection performance.

To select global parameters for all datasets we performed a grid search on all nine datasets by varying the following parameters: The peak SNR threshold for accepting a candidate component on the set {0.6,0.8,1,1.2,1.4,1.6,1.8,2}, the online CNN classifier threshold for accepting candidate components on the set {0.5,0.55,0.6,0.65,0.7,0.75}, and the number of candidate components per frame on the set {5,7,10,14}. The best overall parameters (reported in Table 1) were given for the choice (1.2,0.65,10). This parameter choice was also the best when conditioning on the shorter six datasets. For the three longer datasets, the best parameter choice was (2,0.6,5), corresponding to a stricter set of parameters with less candidate components per frame (Figure 4—figure supplement 1).

For the analysis of the whole brain zebrafish dataset, CaImAn online was run for one epoch with the same parameters as above, with only differences appearing in the number of neurons during initialization (600 vs 2), and the value of the threshold for the online CNN classifier (0.75 vs 0.5). The former decision was motivated by the goal of retrieving with a single pass neurons from a preparation with a denser level of activity over a larger FOV in this short dataset (1885 frames). To this end, the number of candidate neurons at each timestep was set to 10 (per plane). The threshold choice was motivated by the fact that the classifier was trained on mouse data only, and thus a higher threshold choice would help diminish potential false positive components. Rigid motion correction was applied online to each plane.

Comparison with Suite2p

To compare CaImAn with Suite2p we used the MATLAB version of the Suite2p package (Pachitariu et al., 2017). To select parameters for Suite2p we used a small grid search around the default values for the variables nSVDforROI, NavgFramesSVD, and sig. The classifier used by Suite2p was not re-trained for each dataset but used with the default values. For each case (with the classifier and without the classifier), the values that give the best F1 score in average are reported in Figure 4—figure supplement 2. The dataset J123 was excluded from the comparison since (due its low SNR) Suite2p did not converge and kept adding a large number of neurons in each iteration. Use of the classifier improved the results on average, from F1 score 0.51±0.12 without the classifier to 0.55±0.12, however the use of the classifier improved only four of the eight tested datasets in terms of the F1 score. As with CaImAn it is possible that dataset specific parameter choice can lead to improved results.

Performance quantification as a function of SNR

To quantify performance as a function of SNR we construct the consensus traces by running CaImAn batch on the datasets seeded with the ‘consensus’ binary masks obtained from the manual annotators. Afterwards we obtain the average peak-SNR of a trace 𝐜 with corresponding residual signal 𝐫 (Equation 5) is obtained as

SNR(𝐳)=-Φ-1(pmin), (13)

where Φ-1() denotes the probit function (quantile function for the standard Gaussian distribution), 𝐳 is the z-scored version of 𝐜+𝐫 (Equation 6) and pmin is given by Equation 8.

Let c1gt,c2gt,,cNgt be the consensus traces and c1cm,c2cm,,cNcm be their corresponding CaImAn inferred traces. Here we assume that false positive and false negative components are matched with trivial components that have 0 SNR. Let also SNRgti=SNR(cigt) and SNRcmi=SNR(cicm), respectively. After we compute the SNR for both consensus and inferred traces the performance algorithm can be quantified in multiple ways as a function of a SNR thresholds θSNR:

Precision: Precision at level θSNR, can be computed as the fraction of detected components with SNRcm>θSNR that are matched with consensus components. It quantifies the certainty that a component detected with a given SNR or above corresponds to a true component.

PREC(θSNR)=|{i:(SNRcmi>θSNR)&(SNRgti>0)}||{i:(SNRcmi>θSNR)}|

Recall: Recall at level θSNR, can be computed as the fraction of consensus components with SNRgt>θSNR that are detected by the algorithm. It quantifies the certainty that a consensus component with a given SNR or above is detected.

RECALL(θSNR)=|{i:(SNRgti>θSNR)&(SNRcmi>0)}||{i:(SNRgti>θSNR)}|

F1 Score: An overall F1 score at level θSNR, can be obtained by computing the harmonic mean between precision and recall

F1(θSNR)=2PREC(θSNR)×RECALL(θSNR)PREC(θSNR)+RECALL(θSNR)

The cautious reader will observe that the precision and recall quantities described above are not computed in the same set of components. This can be remedied by recomputing the quantities in two different ways:

AND framework: Here we consider a match only if both traces have SNR above the given threshold:

PRECAND(θSNR)=|{i:(SNRcmi>θSNR)&(SNRgti>θSNR)}||{i:(SNRcmi>θSNR)}|
RECALLAND(θSNR)=|{i:(SNRgti>θSNR)&(SNRcmi>θSNR)}||{i:(SNRgti>θSNR)}|

OR framework: Here we consider a match if either trace has SNR above the given threshold and its match has SNR above 0.

RECALLOR(θSNR)=|{i:(max(SNRgti,SNRcmi)>θSNR)&(min(SNRgti,SNRcmi)>0)}||{i:(SNRcmi>0)}|
RECALLOR(θSNR)=|{i:(max(SNRgti,SNRcmi)>θSNR)&(min(SNRgti,SNRcmi)>0)}||{i:(SNRgti>0)}|

It is easy to show that

PRECAND(θSNR)PREC(θSNR)PRECOR(θSNR)RECALLAND(θSNR)RECALL(θSNR)RECALLOR(θSNR)F1AND(θSNR)F1(θSNR)F1OR(θSNR),

with equality holding for θSNR=0. As demonstrated in Figure 4d, these bounds are tight.

Additional features of CaImAn

 CaImAn contains a number of additional features that are not presented in the results section for reasons of brevity. These include:

Volumetric data processing

Apart from planar 2D data, CaImAn batch is also applicable to 3D volumetric data collected via dense raster scanning methods or from direct volume imaging methods such as light field microscopy (Prevedel et al., 2014; Grosenick et al., 2017).

Segmentation of structural indicator data

Structural indicators expressed in the nucleus and functional indicators expressed in the cytoplasm can facilitate source extraction and help identify silent or specific subpopulations of neurons (e.g., inhibitory). CaImAn provides a simple adaptive thresholding filtering method for segmenting summary images of the structural channel (e.g., mean image). The obtained results can be used for seeding source extraction from the functional channel in CaImAn batch or CaImAn online as already discussed.

Duplicate detection

The annotations obtained through the consensus process were screened for possible duplicate selections. To detect for duplicate components we define the degree of spatial overlap matrix O as

Oij={0,i=j|m(ai)m(aj)||m(aj)|,ij, (14)

that defines the fraction of component i that overlap with component j, where m() is the thresholding function defined in Equation 10. Any entry of O that is above a threshold θo (e.g., θo=0.7 used here) indicates a pair of duplicate components. To decide which of the two components should be removed, we use predictions of the CaImAn batch CNN classifier, removing the component with the lowest score.

Extraction of ΔF/F

The fluorescence trace 𝐟i of component i can be written as

𝐟i=𝐚i2(𝐜i+𝐫i). (15)

The fluorescence due to the component’s transients overlaps with a background fluorescence due to baseline fluorescence of the component and neuropil activity, that can be expressed as

f0,i=BASELINE(fi+Bai), (16)

where BASELINE:RTRT is a baseline extraction function, and B is the estimated background signal. Examples of the baseline extraction function are a percentile function (e.g., 10th percentile), or a for longer traces, a running percentile function, for example 10th percentile over a window of a hundred seconds (computing the exact running percentile function can be computationally intensive. To reduce the complexity we compute the running percentile with a stride of W, where W is equal or smaller to the length of the window, and then linearly interpolate the values). To determine the optimal percentile level an empirical histogram of the trace (or parts of it in case of long traces) is computed using a diffusion kernel density estimator (Botev et al., 2010), and the mode of this density is used to define the baseline and its corresponding percentile level. The ΔF/F activity of component i can then be written as

ciΔF/F=fiBASELINE(fi)f0,i (17)

The approach we propose here is conceptually similar to practical approaches where the ΔF/F is computed by averaging over the spatial extent of an ROI (Jia et al., 2011) with some differences: (i) instead of averaging with a binary mask we use the a weighed average with the shape of each component, (ii) signal due to overlapping components is removed from the calculation of the background fluorescence, and (iii) the traces have been extracted through the CNMF process prior to the ΔF/F extraction. Note that the same approach can also be performed to the trace 𝐚i2𝐜i that does not include the residual traces for each component. In practice it can be beneficial to extract ΔF/F traces prior to deconvolution, since the ΔF/F transformation can alleviate the effects of drifting baselines, for example due to bleaching. For the non-deconvolved traces 𝐟i some temporal smoothing can also be applied to obtain more smooth ΔF/F traces.

Algorithmic details

In the following we present in pseudocode form several of the routines introduced and used by CaImAn. Note that the pseudocode descriptions do not aim to present a complete picture and may refer to other work for some of the steps.

Algorithm 1: ProcessInPatches
Require: Input data matrix Y, patch size, overlap size, initialization method, rest of parameters.
1: Y(1),,Y(Np)=CONSTRUCTPATCHES(Y,ps,os) Break data into memory mapped patches.
2: for i=1,,Np do Process each patch
3: [A(i),C(i),b(i),f(i)]=CNMF(Y(i),options) Run CNMF on each patch
4: end for
5: [A,C]=MERGECOMPONENTS[{A(i),C(i)}i=1,,N] Merge components
6: [b,f]=MERGEBACKGROUNDS[{b(i),f(i)}i=1,,N] Merge background components
7: M(A>0). Find masks of spatial footprints.
8: repeat Optionally keep updating A,C,b,f using HALS (Cichocki et al., 2007).
9: [b,f]NNMF(YAC,nb)
10: CargminC0YbfAC
11: AargminA0,A(M)==0YbfAC
12: until Convergence
13: return A,C,𝐛,𝐟
Algorithm 2 CaImAn batch
Require: Input data matrix Y, rest of parameters.
1: YNoRMCORRE(Y,params) Motion Correction (Pnevmatikakis and Giovannucci, 2017)
2: A,C,b,f=PROCESSINPATCHES(Y,params) Run CNMF in patches Algorithm 1
3: JEATIMATEQUALITY(Y,A,C,b,f,params) Get indeces of accepted components
4: AA[:,J],CC[J,:] Disregard rejected components
5: [b,f]NNMF(YAC,nb)
6: CargminC0YbfAC
7: AargminA0,A(M)==0YbfAC Refit (optional)
8: return A,C,𝐛,𝐟
Algorithm 3: SeededInitialization
Require: Input data matrix Y, matrix of binary masks M, number of background components nb.
1: p=find(M1==0) Find the pixels not covered by any component.
2: [,f]NNMF(Y[p,:],nb) Run NMF on these pixels just to get temporal backgrounds f
3: bargminb0Ybf Obtain spatial background b.
4: Cmax((MM)1M(Ybf),0) Initialize temporal traces.
5: AargminA0,A(M)==0YbfAC. Initialize spatial footprints constrained within the masks.
6: repeat Optionally keep updating A,C,𝐛,𝐟 using HALS
7: [b,f]NNMF(YAC,nb)
8: CargminC0YbfAC
9: AargminA0,A(M)==0YbfAC
10: until Convergence
11: return A,C,𝐛,𝐟
Algorithm 4 CaImAn online (See Giovannucci et al., 2017 for explanation of routines)
Require: Data matrix Y, initial estimates A,𝐛,C,𝐟,S, current number of components K, current timestep t, rest of parameters.
1: W=Y[:,1:t]C/tM=CC/t
2: M=CC/t Initialize sufficient statistics (Giovannucci et al., 2017)
3: 𝒢=DETERMINEGROUPS([A,b],K) (Giovannucci et al., 2017, Algorithm S1-S2)
4: Rbuf=[Y-[A,𝐛][C;𝐟]][:,t-lb+1:t] Initialize residual buffer
5: t=ti=1,,Nepochs
6: for i=1,,Nepochs do
7:      While there is more data do
8: tt+1𝐲tMotionCorrect(𝐲t,𝐛𝐟t-1)
9: ytMOTIONCORRECT(yt,bft1) (Pnevmatikakis and Giovannucci, 2017)
10: [ct;ft]UPDATETRACES([A,b],[ct1;ft1],yt,𝒢) (Giovannucci et al., 2017, Algorithm S3)
11: C,SOASIS(C,γ,smin,λ) (Friedrich et al., 2017b)
12: Anew,CnewFINDNEWCOMPONENTS(Rbuf,Ncomp) Algorithm 5
13: [A,b],[C,f],K,𝒢,Rbuf,W,MINTEGRATENEWCOMPONENTS(
14: [A,b],[C,f],K,𝒢,Anew,Cnew,Rbuf,yt,W,M) (Giovannucci et al., 2017, Algorithm S4)
15: Rbuf[Rbuf[:,2:lb],𝐲t-A𝐜t-𝐛𝐟t] Update residual buffer
16: W,MUPDATESUFFSTATISTICS(W,M,yt,[ct;ft])
17: IuSHAPEUPDATEINDECES(A,Inew) Indeces of components to get updated, Algorithm S6
18: [A,b]UPDATESHAPES[W,M,[A,b],Iu] (Giovannucci et al., 2017, Algorithm S5)
19:      end while
20: t0
21: end for
22: return A,b,C,f,S
Algorithm 5: FindNewComponents
Require: Residual buffer Rbuf, number of new candidate components Ncomp, neuron radius r.
1: Etmax(Rbuf(t),0)2
2: EHIGHPASSFILTER(E) Spatial high pass filtering for contrast enhancement.
3: P=FINDLOCALPEAKS(E,Ncomp,r) Find local maxima at least 2apart.
4: AtestpPNp={(x,y):|x-px|r,|y-py|r}
5: for pP do
6: Np={(x,y):|xpx|r,|ypy|r} Define a neighborhood around p
7: AtestAtestMEAN(Rbuf)
8: end for
9: IacceptONLINECNNCCLASSIFIER(Atest) Find indeces of accepted components
10: Anew,CnewiIaccept[𝐚,𝐜]NNMF(Rbuf[Npi,:],1)AnewAnew𝐚CnewCnew𝐜Anew,Cnew
11; for iIaccept do
12: [a,c]NNMF(Rbuf[Npi,:],1)
13: AnewAnewa
14: CnewCnewc
15: end for
16: return Anew,Cnew
Algorithm 6: ShapeUpdateIndeces
Require: Set of spatial footprints A, indeces of newly added components J, update vector 𝐪, update period Tu, current step in online mode t.
1: if t = 0 then Initialize vector at the beginning of online mode.
2: q2[1,2,,|A|]/|A| Values logarithmically spaced between 1 and 2.
3: end if
4: 𝐪𝐪×0.51/TuJ=Iu{i:qi1}
5: if J= then
6: Iu{i:qi1} Indeces of components to get updated.
7: 𝐪(Iu)𝐪(Iu)+1
8: else Do not update shapes if new components are added.
9: Io=jJIoIo{i:A[:,i]A[:,j]>0}
10: for jJ do
11: IoIo{i:A[:,i]A[:,j]>0} Find overlapping components.
12: end for
13: 𝐪(Io)0 Make sure these components get updated at the next step.
14: Iu
15: end if
16: return Indeces of components to get updated Iu, update counter vector q.
Algorithm 7: RegisterPair
Require: Spatial footprint matrices A1,A2, field of view templates I1,I2, thresholds for binarization θb and matching θm.
1: S=COMPUTEMOTIONFIELD(I1,I2) Compute motion field between the templates.
2: A1APPLYMOTIONFIELD(A1,S) Align A1 to the template I2
3: [M1,M2]=BINARIZE([A1,A2],θb) Turn components into binary masks.
4: D=COMPUTEDISTANCEMATRIX(M1,M2,θD) Compute distance matrix.
5: P1,P2,L1,L2=HUNGARIAN(D) Match using the Hungarian algorithm.
6: return Matched components P1,P2, non-matched components L1,L2 and aligned components from first session A1.
Algorithm 8: RegisterMulti
Require: List of Spatial footprint matrices A1,A2,,AN, field of view templates I1,I2,,IN, thresholds for binarization θb and matching θm.
1: for i=1,,N do
2: Ki=SIZE(Ai,2) Number of components in each session.
3: end for
4: AuA1 Initialize Au matrix
5: m[1]=[1,2,,K1] Initialize matchings list
6: KtotK1 Total # of distinct components so far.
7: for i=2,,N do
8: Pu,Pi,Lu,Li,Au=REGISTERPAIR(Au,Ai,Ii1,Ii,θb,θm) Register Au to session i.
9: Au[:,Pu]Ai[:,Pi] Keep the matched components from session i.
10: Au[Au,Ai[:,Li]] Include the non-matched components from session i.
11: m[i][Pi]=Pu m[i][j]=k if component j from session i is mapped to component k in Optionally keep updating Au.
12: m[i][Li]=[Ktot+1,Ktot+2,,Ktot+|Li|] Include newly added components.
13: KtotKtot+|Li| Update total number of distinct components.
14: end for
15: return Union of all distinct components Au, and list of matchings m.

Acknowledgements

We thank B Cohen, L Myers, N Roumelioti, and S Villani for providing us with manual annotations. We thank V Staneva and B Deverett for contributing to the early stages of CaImAn, M Schachter for his insight and contributions, and L Paninski for numerous useful discussions. We thank N Carriero, I Fisk, and D Simon from the Flatiron Institute (Simons Foundation) for useful discussions and suggestions to optimize High Performance Computing code. We thank T Kawashima and M Ahrens for sharing the whole brain zebrafish dataset. Last but not least, we thank the active community of users for their great help in terms of code/method contributions, bug reporting, code testing and suggestions that have led to the growth of into a widely used open source package. A partial list of contributors (in the form of GitHub usernames) can be found in https://github.com/flatironinstitute/CaImAn/graphs/contributors (Python) and https://github.com/flatironinstitute/CaImAn-MATLAB/graphs/contributors (MATLAB). The authors acknowledge support from following funding sources: AG, EAP, JF, PG (Simons Foundation, internal funding). JG, SAK, DWT (NIH NRSA F32NS077840-01,5U01NS090541, 1U19NS104648 and Simons Foundation SCGB), PZ (NIH NIBIB R01EB022913, NSF NeuroNex DBI-1707398, Gatsby Foundation), JT (NIH R01-MH101198), FN (MURI, Simons Collaboration on the Global Brain and Pew Foundation).

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Andrea Giovannucci, Email: agiovann@email.unc.edu.

Eftychios A Pnevmatikakis, Email: epnevmatikakis@flatironinstitute.org.

David Kleinfeld, University of California, San Diego, United States.

Andrew J King, University of Oxford, United Kingdom.

Funding Information

This paper was supported by the following grants:

  • National Institutes of Health F32NS077840-01 to Jeffrey L Gauthier.

  • Simons Foundation FI-CCB to Andrea Giovannucci, Johannes Friedrich, Pat Gunn, Jeremie Kalfon, Dmitri B Chklovskii, Eftychios A Pnevmatikakis.

  • Simons Foundation SCGB to Sue Ann Koay, Farzaneh Najafi, Jeffrey L Gauthier, David W Tank.

  • National Institutes of Health 5U01NS090541 to Sue Ann Koay, Jeffrey L Gauthier, David W Tank.

  • National Institutes of Health 1U19NS104648 to Sue Ann Koay, Jeffrey L Gauthier, David W Tank.

  • National Institutes of Health NIBIB R01EB022913 to Pengcheng Zhou.

  • National Science Foundation NeuroNex DBI-1707398 to Pengcheng Zhou.

  • Gatsby Charitable Foundation to Pengcheng Zhou.

  • National Institutes of Health R01-MH101198 to Jiannis Taxidis.

  • Pew Charitable Trusts to Farzaneh Najafi.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Data curation, Software, Formal analysis, Supervision, Validation, Investigation, Visualization, Methodology, Writing—review and editing.

Conceptualization, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—review and editing.

Software, Validation, Writing—review and editing.

Software, Visualization.

Software.

Data curation, Formal analysis, Validation, Writing—review and editing.

Data curation, Software.

Software, Validation.

Conceptualization, Data curation.

Software, Validation, Methodology, Writing—review and editing.

Resources, Funding acquisition.

Resources, Data curation.

Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Project administration, Writing—review and editing.

Conceptualization, Data curation, Software, Formal analysis, Supervision, Validation, Visualization, Methodology, Writing—original draft.

Additional files

Transparent reporting form
DOI: 10.7554/eLife.38173.031

Data availability

All input data used to generate most figures, along with the necessary scripts, is available via Zenodo (https://zenodo.org/record/1659149#.XC_Wcs9Ki9s). The original (pre-non-rigid-motion correction) NF datasets listed in Table 2 are publicly available via https://github.com/CodeNeuro/neurofinder. They were originally shared by the Hausser, Losonczy, Svoboda, and Harvey labs.

The following previously published datasets were used:

Andrea Giovannucci, Johannes Friedrich, Pat Gunn, Brandon L Brown, Sue Ann Koay, Jiannis Taxidis, Farzaneh Najafi, Jeffrey L Gauthier, Pengcheng Zhou, Baljit S Khakh, David W Tank, Dmitri B Chklovskii, Eftychios A Pnevmatikakis. 2018. Datasets Generated: Data from CaImAn, an open source tool for scalable Calcium Imaging data Analysis. Zenodo.

References

  1. Ahrens MB, Orger MB, Robson DN, Li JM, Keller PJ. Whole-brain functional imaging at cellular resolution using light-sheet microscopy. Nature Methods. 2013;10:413–420. doi: 10.1038/nmeth.2434. [DOI] [PubMed] [Google Scholar]
  2. Apthorpe N, Riordan A, Aguilar R, Homann J, Gu Y, Tank D, Seung HS. Advances in Neural Information Processing Systems. MIT Press; 2016. Automatic Neuron Detection in Calcium Imaging Data Using Convolutional Networks; pp. 3270–3278. [Google Scholar]
  3. Berens P, Freeman J, Deneux T, Chenkov N, McColgan T, Speiser A, Macke JH, Turaga S, Mineault P, Rupprecht P, Gerhard S, Friedrich RW, Friedrich J, Paninski L, Pachitariu M, Harris KD, Bolte B, Machado TA, Ringach D, Reimer J. Community-based benchmarking improves spike inference from two-photon calcium imaging data. bioRxiv. 2017 doi: 10.1101/177956. [DOI] [PMC free article] [PubMed]
  4. Botev ZI, Grotowski JF, Kroese DP. Kernel density estimation via diffusion. The Annals of Statistics. 2010;38:2916–2957. doi: 10.1214/10-AOS799. [DOI] [Google Scholar]
  5. Bouchard MB, Voleti V, Mendes CS, Lacefield C, Grueber WB, Mann RS, Bruno RM, Hillman EM. Swept confocally-aligned planar excitation (SCAPE) microscopy for high speed volumetric imaging of behaving organisms. Nature Photonics. 2015;9:113–119. doi: 10.1038/nphoton.2014.323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bradski G. The OpenCV library. Dr Dobb’s Journal: Software Tools for the Professional Programmer. 2000;25:120–123. [Google Scholar]
  7. Cai DJ, Aharoni D, Shuman T, Shobe J, Biane J, Song W, Wei B, Veshkini M, La-Vu M, Lou J, Flores SE, Kim I, Sano Y, Zhou M, Baumgaertel K, Lavi A, Kamata M, Tuszynski M, Mayford M, Golshani P, Silva AJ. A shared neural ensemble links distinct contextual memories encoded close in time. Nature. 2016;534:115–118. doi: 10.1038/nature17955. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Carrillo-Reid L, Yang W, Kang Miller JE, Peterka DS, Yuste R. Imaging and optically manipulating neuronal ensembles. Annual Review of Biophysics. 2017;46:271–293. doi: 10.1146/annurev-biophys-070816-033647. [DOI] [PubMed] [Google Scholar]
  9. Chen TW, Wardill TJ, Sun Y, Pulver SR, Renninger SL, Baohan A, Schreiter ER, Kerr RA, Orger MB, Jayaraman V, Looger LL, Svoboda K, Kim DS. Ultrasensitive fluorescent proteins for imaging neuronal activity. Nature. 2013;499:295–300. doi: 10.1038/nature12354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cichocki A, Zdunek R, Si A. Lecture Notes in Computer Science. Springer; 2007. Hierarchical ALS algorithms for nonnegative matrix and 3D tensor factorization; pp. 169–176. [Google Scholar]
  11. Dean J, Ghemawat S. MapReduce. Communications of the ACM. 2008;51:107–113. doi: 10.1145/1327452.1327492. [DOI] [Google Scholar]
  12. Deneux T, Kaszas A, Szalay G, Katona G, Lakner T, Grinvald A, Rózsa B, Vanzetta I. Accurate spike estimation from noisy calcium signals for ultrafast three-dimensional imaging of large neuronal populations in vivo. Nature Communications. 2016;7:12190. doi: 10.1038/ncomms12190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Flusberg BA, Nimmerjahn A, Cocker ED, Mukamel EA, Barretto RP, Ko TH, Burns LD, Jung JC, Schnitzer MJ. High-speed, miniaturized fluorescence microscopy in freely moving mice. Nature Methods. 2008;5:935–938. doi: 10.1038/nmeth.1256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Freeman J, Vladimirov N, Kawashima T, Mu Y, Sofroniew NJ, Bennett DV, Rosen J, Yang CT, Looger LL, Ahrens MB. Mapping brain activity at scale with cluster computing. Nature Methods. 2014;11:941–950. doi: 10.1038/nmeth.3041. [DOI] [PubMed] [Google Scholar]
  15. Friedrich J, Yang W, Soudry D, Mu Y, Ahrens MB, Yuste R, Peterka DS, Paninski L. Multi-scale approaches for high-speed imaging and analysis of large neural populations. PLOS Computational Biology. 2017a;13:e1005685. doi: 10.1371/journal.pcbi.1005685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Friedrich J, Zhou P, Paninski L. Fast online deconvolution of calcium imaging data. PLOS Computational Biology. 2017b;13:e1005423. doi: 10.1371/journal.pcbi.1005423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Giovannucci A, Friedrich J, Kaufman J, Churchland A, Chklovskii D, Paninski L, Pnevmatikakis EA. OnACID: online analysis of calcium imaging data in real time. Biorxiv. 2017 doi: 10.1101/193383. [DOI]
  18. Giovannucci A, Pnevmatikakis EA, Friedrich J, Gunn P, Kalfon J, Brown B. CaImAn. c156373GitHub. 2018 https://github.com/flatironinstitute/CaImAn
  19. Grosenick LM, Broxton M, Kim CK, Liston C, Poole B, Yang S, Andalman AS, Scharff E, Cohen N, Yizhar O, Ramakrishnan C, Ganguli S, Suppes P, Levoy M, Deisseroth K. Identification of cellular-activity dynamics across large tissue volumes in the mammalian brain. bioRxiv. 2017 doi: 10.1101/132688. [DOI]
  20. Jia H, Rochefort NL, Chen X, Konnerth A. In vivo two-photon imaging of sensory-evoked dendritic calcium signals in cortical neurons. Nature Protocols. 2011;6:28–35. doi: 10.1038/nprot.2010.169. [DOI] [PubMed] [Google Scholar]
  21. Kaifosh P, Zaremba JD, Danielson NB, Losonczy A. SIMA: python software for analysis of dynamic fluorescence imaging data. Frontiers in Neuroinformatics. 2014;8:80. doi: 10.3389/fninf.2014.00080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Kawashima T, Zwart MF, Yang CT, Mensh BD, Ahrens MB. The serotonergic system tracks the outcomes of actions to mediate Short-Term motor learning. Cell. 2016;167:933–946. doi: 10.1016/j.cell.2016.09.055. [DOI] [PubMed] [Google Scholar]
  23. Klibisz A, Rose D, Eicholtz M, Blundon J, Zakharenko S. Lecture Notes in Computer Science. Springer; 2017. Fast, Simple Calcium Imaging Segmentation with Fully Convolutional Networks; pp. 285–293. [Google Scholar]
  24. Mairal J, Bach F, Ponce J, Sapiro G. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research. 2010;11:19–60. [Google Scholar]
  25. Mukamel EA, Nimmerjahn A, Schnitzer MJ. Automated analysis of cellular signals from large-scale calcium imaging data. Neuron. 2009;63:747–760. doi: 10.1016/j.neuron.2009.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Pachitariu M, Packer AM, Pettit N, Dalgleish H, Hausser M, Sahani M. Advances in Neural Information Processing Systems. MIT Press; 2013. Extracting regions of interest from biological images with convolutional sparse block coding; pp. 1745–1753. [Google Scholar]
  27. Pachitariu M, Stringer C, Dipoppa M, Schröder S RLF, Dalgleish H, Carandini M, Harris KD. Suite2p: beyond 10,000 neurons with standard two-photon microscopy. BioRxiv. 2017 doi: 10.1101/061507. [DOI]
  28. Packer AM, Russell LE, Dalgleish HW, Häusser M. Simultaneous all-optical manipulation and recording of neural circuit activity with cellular resolution in vivo. Nature Methods. 2015;12:140–146. doi: 10.1038/nmeth.3217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M. Scikit-learn: machine learning in python. Journal of Machine Learning Research. 2011;12:2825–2830. [Google Scholar]
  30. Petersen A, Simon N, Witten D. SCALPEL: extracting neurons from calcium imaging data. arXiv. 2017 doi: 10.1214/18-AOAS1159. https://arxiv.org/abs/1703.06946 [DOI] [PMC free article] [PubMed]
  31. Piatkevich KD, Jung EE, Straub C, Linghu C, Park D, Suk HJ, Hochbaum DR, Goodwin D, Pnevmatikakis E, Pak N, Kawashima T, Yang CT, Rhoades JL, Shemesh O, Asano S, Yoon YG, Freifeld L, Saulnier JL, Riegler C, Engert F, Hughes T, Drobizhev M, Szabo B, Ahrens MB, Flavell SW, Sabatini BL, Boyden ES. A robotic multidimensional directed evolution approach applied to fluorescent voltage reporters. Nature Chemical Biology. 2018;14:352–360. doi: 10.1038/s41589-018-0004-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Pnevmatikakis EA, Merel J, Pakman A, Paninski L. Bayesian spike inference from calcium imaging data. Signals, Systems and Computers, 2013 Asilomar Conference on IEEE; 2013. pp. 349–353. [Google Scholar]
  33. Pnevmatikakis EA, Soudry D, Gao Y, Machado TA, Merel J, Pfau D, Reardon T, Mu Y, Lacefield C, Yang W, Ahrens M, Bruno R, Jessell TM, Peterka DS, Yuste R, Paninski L. Simultaneous denoising, Deconvolution, and demixing of calcium imaging data. Neuron. 2016;89:285–299. doi: 10.1016/j.neuron.2015.11.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Pnevmatikakis EA, Giovannucci A. NoRMCorre: an online algorithm for piecewise rigid motion correction of calcium imaging data. Journal of Neuroscience Methods. 2017;291:83–94. doi: 10.1016/j.jneumeth.2017.07.031. [DOI] [PubMed] [Google Scholar]
  35. Pnevmatikakis EA. Analysis pipelines for calcium imaging data. Current Opinion in Neurobiology. 2018;55:15–21. doi: 10.1016/j.conb.2018.11.004. [DOI] [PubMed] [Google Scholar]
  36. Pnevmatikakis EA, Giovannucci A, Kalfon J, Najafi F, Taxidis J. CaImAn-MATLAB. 52af659GitHub. 2018 https://github.com/elifesciences-publications/CaImAn-MATLAB
  37. Prevedel R, Yoon YG, Hoffmann M, Pak N, Wetzstein G, Kato S, Schrödel T, Raskar R, Zimmer M, Boyden ES, Vaziri A. Simultaneous whole-animal 3D imaging of neuronal activity using light-field microscopy. Nature Methods. 2014;11:727–730. doi: 10.1038/nmeth.2964. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Reynolds S, Abrahamsson T, Schuck R, Sjöström PJ, Schultz SR, Dragotti PL. ABLE: an Activity-Based level set segmentation algorithm for Two-Photon calcium imaging data. Eneuro. 2017;4:ENEURO.0012-17.2017. doi: 10.1523/ENEURO.0012-17.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Sheintuch L, Rubin A, Brande-Eilat N, Geva N, Sadeh N, Pinchasof O, Ziv Y. Tracking the same neurons across multiple days in Ca2+Imaging Data. Cell Reports. 2017;21:1102–1115. doi: 10.1016/j.celrep.2017.10.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Smith SL, Häusser M. Parallel processing of visual space by neighboring neurons in mouse visual cortex. Nature Neuroscience. 2010;13:1144–1149. doi: 10.1038/nn.2620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Sofroniew NJ, Flickinger D, King J, Svoboda K. A large field of view two-photon mesoscope with subcellular resolution for in vivo imaging. eLife. 2016;5:e14472. doi: 10.7554/eLife.14472. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Spaen Q, Hochbaum DS, Asín-Achá R. HNCcorr: a novel combinatorial approach for cell identification in calcium-imaging movies. arXiv. 2017 doi: 10.1523/ENEURO.0304-18.2019. https://arxiv.org/abs/1703.01999 [DOI] [PMC free article] [PubMed]
  43. Speiser A, Yan J, Archer EW, Buesing L, Turaga SC, Macke JH. Advances in Neural Information Processing Systems. MIT Press; 2017. Fast amortized inference of neural activity from calcium imaging data with variational autoencoders; pp. 4027–4037. [Google Scholar]
  44. Theis L, Berens P, Froudarakis E, Reimer J, Román Rosón M, Baden T, Euler T, Tolias AS, Bethge M. Benchmarking spike rate inference in population calcium imaging. Neuron. 2016;90:471–482. doi: 10.1016/j.neuron.2016.04.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Toledo S. DIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society; 1999. A survey of out-of-core algorithms in numerical linear algebra; pp. 161–179. [Google Scholar]
  46. Valmianski I, Shih AY, Driscoll JD, Matthews DW, Freund Y, Kleinfeld D. Automatic identification of fluorescently labeled brain cells for rapid functional imaging. Journal of Neurophysiology. 2010;104:1803–1811. doi: 10.1152/jn.00484.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, Gouillart E, Yu T, scikit-image contributors scikit-image: image processing in python. PeerJ. 2014;2:e453. doi: 10.7717/peerj.453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Vogelstein JT, Packer AM, Machado TA, Sippy T, Babadi B, Yuste R, Paninski L. Fast nonnegative deconvolution for spike train inference from population calcium imaging. Journal of Neurophysiology. 2010;104:3691–3704. doi: 10.1152/jn.01073.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Walker T. Cell Magic Wand Tool 2014
  50. Xie Y, Chan AW, McGirr A, Xue S, Xiao D, Zeng H, Murphy TH. Resolution of High-Frequency mesoscale intracortical maps using the genetically encoded glutamate sensor iGluSnFR. Journal of Neuroscience. 2016;36:1261–1272. doi: 10.1523/JNEUROSCI.2744-15.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Yoo AB, Jette MA, Grondona M. Lecture Notes in Computer Science. Springer; 2003. Slurm: Simple linux utility for resource management; pp. 44–60. [Google Scholar]
  52. Zhou P, Resendez SL, Rodriguez-Romaguera J, Jimenez JC, Neufeld SQ, Giovannucci A, Friedrich J, Pnevmatikakis EA, Stuber GD, Hen R, Kheirbek MA, Sabatini BL, Kass RE, Paninski L. Efficient and accurate extraction of in vivo calcium signals from microendoscopic video data. eLife. 2018;7:e28728. doi: 10.7554/eLife.28728. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: David Kleinfeld1

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "CaImAn: An open source tool for scalable Calcium Imaging data Analysis" for consideration by eLife.

Your manuscript has been thoroughly reviewed by three outstanding reviewers and your software extensively tested by one of the groups. Your technical contribution, while admittedly introducing no novel approaches per se, is seen as a timely and carefully constructed software package that will help a large community of neuroscientists. However, before we can pass final judgment, we ask you to address all of the comments of the reviewers; this extensive request is based on the pointed nature of the comments and the importance of having your software accepted by the largest possible fraction of the imaging community. Please pay particular attention to queries on implementation.

A few key points to attend to include:

It is essential to judge the results with CaImAn against existing pipelines, e.g. the very popular Suite2P (Pachitariu et al.,) There is no expectation that CaImAn will shine in every dimension of use. Yet the authors simply must supply comparative benchmarks.

The trial data sets need to be completely described and, of course, made publicly and fully available if and when the manuscript is accepted.

It should be stated up front that there is no "ground truth" in the sense of simultaneous electrical and calcium measurements, only a comparison with a consensus view (note "tyranny of the majority" in one interpretation). Reviewer three makes a clear suggestion for cross-validation that should be followed.

The authors note that fine-tuning of parameters to individual datasets 'can significantly increase performance'. This must be detailed in light of the experience gained with nine different datasets.

Please augment the analysis of Figure 4 to show how precision and recall separately change with signal-to-noise ratio.

Reviewer #1:

Giovannucci et al., present a new open-source software package for analyzing calcium imaging data collected by two-photon and 1-photon microscopes. While the key algorithms used in the pipeline have been developed over many years and largely overlap with previously published work, new features, e.g. memory mapping, CNN-based classifiers, cross-day registration, have been added to improve the performance and further extend the functionalities of the software. The authors tested the package using various in vivo imaging datasets and compared the results with 'consensus ground truth'. The software is written in Python and is open-source, well-organized, and requires minimal user intervention.

The main advance of the manuscript is to combine a plethora of powerful algorithms that have been developed and improved over almost a decade into a single software package that allows batch processing as well as real-time processing of calcium imaging data. Most of the algorithms have been optimized for speed and performance to allow their use in real-time experiments such as closed-loop and all-optical experiments. Having a software package for calcium imaging data analysis that is freely available, well-documented and makes use of the most recent algorithms for the multiple steps involved in calcium imaging processing is extremely valuable for the field, as it will improve the quality and the efficiency of calcium data analysis throughout the neuroscience community, and also open up many new experimental avenues. I therefore strongly support publication.

1) My main concern is the lack of rigorous comparison of the performance of the package with existing solutions. The manuscript only contains internal comparisons but does not show comparisons to existing pipelines (e.g. Suite2P by Pachitariu et al.,) or earlier versions described in other manuscripts (e.g. CNMF in Pnevmatikakis et al., 2016), or state-of-the-art databases such as Neurofinder. This greatly increases the difficulty of judging the overall performance of the software package in its current form. Although the effort the authors went through to generate an improved ground truth data set is laudable, this does not allow one to judge the performance of the package with respect to existing packages/algorithms. For example, a test of how the package performs on the Neurofinder data and how existing packages (like e.g. Suite2P) perform on their newly generated ground truth data would help one to assess the performance of the ROI detection, which is one of the core features of the pipeline. While some of the comparisons have been performed in the respective publications, an overview of the performance of the pipeline with respect to its modules in their current form would be extremely informative.

2) Following on from the previous point, in the Discussion, the authors claim that 'apart from computational gains both approaches [MapReduce batch and OnACID online] also result in improved performance'. Is this with respect to CNMF-based approaches or other existing methods, for example, Suite2pl? If the latter, then the caiman software was never directly compared to any other approaches commonly used (discussed in the Related work section) to validate if it outperforms them in terms of speed or accuracy (only shown similar number of cells detected to Zhou et al., (2018)). Some of the datasets used for ground truth testing could be easily analyzed using different algorithms to provide the comparison and demonstrate the advantages of Caiman. Pachitariu et al., (2018) showed Suite2p detects more cells (in particular, cells with low baseline firing rate) than CNMF-based method does. It would be helpful to know in which way Caiman outperforms Suite2P.

3) The authors claim that CaImAn Online outperforms Batch in terms of neuron detection for longer recordings, while CaImAn Batch is better suited for shorter recordings (Table 1 and also in the Discussion section). What is this claim based on, or what is the performance measure considered? Assuming longer recordings as > 3000 frames and by inspecting Table 2, there is no clear distinction in terms of F1 scores or precision/recall scores (Table 1) based on the file length. Additionally, it is currently hard to inspect this claim as the information to be compared is stored in two separate tables.

4) The manuscript states that fine-tuning of parameters to individual datasets 'can significantly increase performance' (subsection “CAIMAN Batch and CAIMAN online detect neurons with near-human accuracy”) but no evidence is provided for this rather strong statement. It would be interesting to see the results of one of the videos analyzed in this section when the parameters were fine-tuned to show if such adjustment could lead to more reliable results.

5) In Figure 4A, why are the consensus ground truth components different in the Caiman batch panels and the caiman online panels? Were they constructed using the same SEEDINITIALIZATION procedure?

a) In the same figure, there are clearly some components that do not look like cells, in both Caiman batch/online detected components, and even in the consensus ROIs. The authors should add two panels to this figure, one showing the raw FOV image (preferably just the average image of the video, without further processing) with the manual selected ROI contours marked on it – which should look similar to the last panel in Figure 3A; and another showing the 'consensus ground truth' components given by the SEEDINITIALIZATION on top of the average image.

b) A worry is that the initialization procedure might add error to the ground truth and bias it towards the same direction as caiman does, and therefore masking error from benchmarking. For example, if one cell is mistakenly split into several components in both consensus data and caiman data, and both methods detected them, the performance metrics will be artifactually higher. It is important to directly compare the manually-picked ROIs in Figure 3 with the caiman-detected ROIs in Figure 4A (yellow), as the authors claim in the abstract that the software 'achieves near-human performance in detecting locations of active neurons'. It would be good to quantify how well the spatial footprints of the manually drawn ROIs and the caiman detected ROIs overlap. At least please plot the number of ROIs detected by humans against that by the software.

6. Figure 4B shows that the F1 score of Caiman online is higher than that of Caiman batch. And also, in text, the recall value of Caiman online is higher than Caiman batch. In the Materials and methods section, the authors state Caiman online uses a more strict CNN classifier to avoid false positives – I would expect this to result in a higher precision and lower recall, compared to the results given by Caiman batch – but it turns out to be the opposite. Is it a result of comparing the results with different consensus ground truth data?

7) The authors analysed how the F1 score of Caiman Batch depends on SNR, but do not directly address the question of why precision is much higher than recall. It would be helpful if the authors would should show how precision and recall change with SNR separately in addition to the current Figure 4D.

8) Motion correction algorithms and CNN classifier based neuronal filtering are discussed for the online purposes while the computational performance section does not touch on their speed. Running of the demo files in the GitHub link provided by the authors shows that they both introduce additional delays. The CNN classifier (evaluate_components_CNN function) run every few hundred frames is especially computationally expensive (> a few hundred ms on my device). In Figure 8C, is the CNN classifier used? If not, how much time delay will that induce? The computational cost of motion correction and CNN classifier should be reported to verify that.

9) In Figure 8C, does the 'processing time' in the left panel include the time for 'updating shapes'? Judging from the right panel, if the 'updating shapes' is enabled, it could take larger than 15 seconds to complete processing one frame, which means the software could not function in real-time. Subsection “Computational performance of CAIMAN” say that the shape update functionality can be distributed over several frames or done in parallel to allow for real-time processing. Was this already implemented into the online software? Also, how does the shape update functionality impact the activity traces, and does it bring any substantial advantages? This is not shown (Giovannucci et al., (2017) also does not quantify the differences) but can be useful to know for online implementation of the algorithm into experimental setups. It would be helpful if the authors would quantify how much improvement the 'updating shapes' step actually brings to the fidelity of calcium traces. If the improvement is minimal, then the user might skip this step if speed is the priority.

10) Figure 3, Figure 7, Figure 9, Appendix 0—figure 10, Appendix 0—figure 11: how were these FOV images generated? Are they the average frames of the motion-corrected videos or are the intensities of pixels within ROIs enhanced somehow? Please clarify. Please show the average image of the videos without enhancement.

Reviewer #2:

Let me start this review by acknowledging that CaImAn is unequivocally the state of the art toolbox for calcium imaging, and is an impressive piece of work.

However, this manuscript a bit confusing. It reads like the combination of an advertisement and documentation, without ever referencing (quantitatively) the gains/performance of the tools in caiman relative to any other tools in the literature. In other words, it is a description and benchmarking of a given tool.

Such a document is very useful, but also a bit confusing because that is not what peer-reviewed manuscripts typically look like. eLife is an interesting journal/venue, so perhaps they are interested in publishing something like this, I won't comment on its suitability for publishing in its current form. I will, however, describe what I believe to be the most important novel contributions that are valuable to the field that are in this manuscript, with some suggested modifications to clarify/improve on some points.

1) Nine new manually labeled datasets, with four "expert" labelers. In spike sorting, the most important paper ever (imho) was the one that generate "ground truth" data. this manuscript does not do that; one could image a "ground truth" channel but this manuscript does basically provide an upper bound on accuracy for any algorithm on these calcium imaging datasets, as defined by the consensus. This is potentially incredibly valuable. however, the details of the datasets are merely listed in a table in the Methods section, but not quantitative description / evaluation of them. As a resource, showing some images of sample traces, sample frames, summary statistics so that potential users could evaluate and compare their data with these data would be very valuable and interesting. comparing them, how hard is each for the experts, and what about the machines, is the hardness correlated, etc.? And this would be a highlight of this manuscript.

2) The parallelization of the standard method in the field, as in Figure 8, is a new result, and interesting but the metrics are not "standard" in the parallel programming literature. I'd recommend a few relevant notions of scaling, including strong scaling and weak scaling (https://en.wikipedia.org/wiki/Scalability#Weak_versus_strong_scaling), as well as "scale up" and "scale out" (https://en.wikipedia.org/wiki/Scalability).

I think strong, weak, scale up, and scale out would be the appropriate quantities to plot for Figure 8, as well as a reference of "optimal scaling", which is available, e.g., from Amdahl's law (https://en.wikipedia.org/wiki/Amdahl%27s_law). Perhaps users also want the quantities that are plotted but, in terms of demonstrating high quality parallelization procedures, the scaling plots are more standard and informative.

3) Registration of components across days. It seems one pub previously addressed this, but caiman has a new strategy, employing the (basically standard) approach to aligning points, the Hungarian algorithm. note that Hungarian scales with n^3 naively, though faster implementations are available for sparse data, etc. MATLAB actually has much better implementations than Python last I checked. In any case, I believe this approach is better than the previously publish one just based on first principles, but no evidence is reported, just words. I also believe it is faster when n is small, but when n is big, I suspect it would be slower. A quantitative comparison of accuracy and speed would be desirable. Also, Hungarian typically requires the *same* number of points, but there is no guarantee that this will be the case. I did not quite understand the comment about infinite distances.

4) Caiman batch and online have a number of minor features/additions/tweaks/bug-fixes relative to previous implementations. however, the resulting improvements in performance is not documented. perhaps that is unnecessary. but a clear list of novel contributions would be very instructive. Getting a tool from "kind of working in our lab" to "actually working in lots of other labs" is really hard, and really important. But the level of effort that went in to doing that is obscured in the text. perhaps this is the most important point of the manuscript.

5 It is very impressive that all the results were obtained using the same parameters. however, the caveat to that is that all the results were based on using the parameters jointly chosen on *these datasets*, meaning that one would expect the performance on essentially any new dataset to be worse. Held out datasets can provide evidence that performance does not drop too much. With more information on each datasets (see point 1), and similar knowledge of held-out data, one could use these results to predict how well these particular parameters would work on a new dataset that a different lab might generate. Note that this is the same problem as the very popular benchmarking problems in machine vision that are very popular, which doesn't stop those papers from getting published in top journals.

6) In the end, it remains unclear precisely what you recommend to do and when. Specifically, the manuscript mentions many functions/options/setting/parameters and several pre-processing stages. I also understand that the jupyter notebooks provide some examples. I think some concrete guidance, given that you've now analyzed nine different datasets spanning labs, sensors, etc., you have a wealth of knowledge about the heterogeneity of these datasets that would be valuable for the rest of the community, but is not quite conveyed in the manuscript.

Reviewer #3:

Giovannucci et al., present a software package to process calcium imaging data. As calcium imaging is becoming a dominant methodology to record neural activity, a standardized and open-source analytical platform is of importance. CaImAn presents one such effort. The manuscript is well written, and the software seems state-of-the-art and properly validated. I believe eLife Tools and Resources is a perfect outlet for such a commendable effort and enthusiastically support the publication of the study. Below are some comments that I believe the authors can address relatively easily prior to publication.

Subsection “Batch processing of large scale datasets on standalone machines” (parallelization). First, it should be mentioned that the difficulty of parallelization arises from this particular motion correction algorithm in which correction of a frame depends on the previous frames. If each frame is processed independently, like in some other motion correction methods, it would be trivial to parallelize. Second, is there any mechanism to ensure that there is no jump between the temporal chunks? In other words, is the beginning of chunk X adjusted to match the end of chunk X-1? This seems critical for subsequent analysis.

Subsection “CAIMAN BATCH and CAIMAN ONLINE detect neurons with near-human accuracy” (ground truth).The authors mention that the human performance is overestimated because the ground truth is based on the agreement among the scorers. This effect can be eliminated, and the true human performance can be estimated by cross-validation. For example, in a dataset annotated by four annotators, one can test the performance of one annotator against a putative ground truth defined by the other three annotators. This can be repeated for all the four annotators and the average will give a better estimate of human performance than the current method.

Subsection “Computational performance of CAIMAN”. The details of the systems are not described in the Materials and methods section. For a Macbook Pro, it has 8 logical CPU cores and 4 physical CPU cores (in 1 CPU). Please specify the CPU model names.

Subsection “Computational performance of CAIMAN” (employing more processing power results in faster processing). From the figure, a 112 CPU cluster processes the data about three times as fast as a 8 CPU laptop. The performance gain does not seem to be proportional to the number of CPUs available even considering the overhead. Please discuss the overhead of parallelization and the bottleneck when processed in the cluster.

Subsection “Computational performance of CAIMAN” (the performance scales linearly). Please clarify whether this (linearity) is a visual observation of the results, or is this based on a complexity analysis of the factorization algorithm?

Subsection “Computational performance of CAIMAN” (enabling real-time processing). Please clarify how often such an update of a footprint is required in order to support real-time processing without parallelization.

Reviewing editor:

Please consider adding a reference under "Related Work" to the use of supervised learning ("Adaboost") to segment active neurons, i.e. Automatic identification of fluorescently labeled brain cells for rapid functional imaging. I. Valmianski, A. Y. Shih, J. D. Driscoll, D. M. Matthews, Y. Freund and D. Kleinfeld, Journal of Neurophysiology (2010) 104:1803-1811.

eLife. 2019 Jan 17;8:e38173. doi: 10.7554/eLife.38173.028

Author response


Your manuscript has been thoroughly reviewed by three outstanding reviewers and your software extensively tested by one of the groups. Your technical contribution, while admittedly introducing no novel approaches per se, is seen as a timely and carefully constructed software package that will help a large community of neuroscientists. However, before we can pass final judgment, we ask you to address all of the comments of the reviewers; this extensive request is based on the pointed nature of the comments and the importance of having your software accepted by the largest possible fraction of the imaging community. Please pay particular attention to queries on implementation.

We would like to thank the reviewers and the editor for their thorough and thoughtful reviews. CaImAn is a big project and the reviews touched upon almost all of its aspects leading to a large revision of our paper. We believe that our current submission addresses all the issues raised by reviewers. In summary the revised version of our paper includes (among many other improvements):

A link to a (password protected) website that contains all the datasets, their characteristics and the manual annotations.

A comparison of CaImAn with the popular package Suite2p.

A more systematic exploration of the parameter space for both CaImAn batch and CaImAn online, to better distinguish between “average” and “best” cases, and insight for parameter choosing.

A systematic study on the computational speed for both CaImAn batch and CaImAn online, and the effect of parallelization.

Concurrently with this submission we also released a new version of CaImAn, that presents a simplified way to run the various algorithms and pass the parameters.

Please find our detailed responses below. Since our paper is long, we copied excerpts from the paper (text, figures etc.) that directly address the reviewers’ concerns.

A few key points to attend to include:

It is essential to judge the results with CaImAn against existing pipelines, e.g. the very popular Suite2P (Pachitariu et al.) There is no expectation that CaImAn will shine in every dimension of use. Yet the authors simply must supply comparative benchmarks.

With our revised submission we provide a comparison between Suite2p and CaImAn. The results are described in the text and appear in a new Figure 4—figure supplement 2.

To compare against Suite2p we used a grid search for some parameters around the defaults provided by its developers and report the combination that gave the best results on average. The software and scripts were cloned from the master branch of the GitHub repository (https://github.com/cortex-lab/Suite2P) on August 27th 2018. We modified the make_db_example.mand master_file_example.mscripts to accommodate for the grid parameter search. One file was excluded from the comparison (J123) because Suite2p would run a significantly large number of iterations and add tens of thousands of components (J123 has very low SNR, it is possible that some ad hoc parameter needed to be set in Suite2p). Suite2p comes with a trainable classifier; to keep the comparison fully automated this classifier was not retrained for each dataset, yet it was used with its default and results were reported both using the default general suite2P classifier or without any classifier. In both conditions we selected the set of parameters providing the highest average f1_score over all the datasets. In all cases CaImAn outperformed Suite2p with varying degrees of difference. Suite2p includes a supervised learning based classifier that is used to evaluate components. The MEAN +/- STD F1_score obtained with the default classifier was 0.59 +/-.12 (optimal parameters {1500, 2000,0.25}), whereas without classifier we obtained an F1_score of 0.55 +/- 0.12 ({500,2000,0.25}). It is expected that separate training on each dataset will yield better results (as is also the case with CaImAn). As we stated in our initial cover letter, comparisons like this are not definitive and Suite2p in the hands of a more experienced user might yield better results than the results reported here. We in fact contacted the main developer of Suite2p prior trying out the comparison on our own but received no response. As such we choose to present these results in the supplement and avoid overselling them.

The trial data sets need to be completely described and, of course, made publicly and fully available if and when the manuscript is accepted.

We have deposited the raw data and manual and consensus annotations to Zenodo.

It should be stated up front that there is no "ground truth" in the sense of simultaneous electrical and calcium measurements, only a comparison with a consensus view (note "tyranny of the majority" in one interpretation). Reviewer three makes a clear suggestion for cross-validation that should be followed.

We agree. In the revised manuscript we changed our phrasing to use the term “consensus annotation” as opposed to “consensus ground truth” to make clear your point that there is no well defined ground truth. We also followed the cross-validation approach suggested by reviewer #3 to better understand the variability of human annotators. While the “ground truth” (and thus individual scores) could vary significantly in this case, on average we observed only mild differences. See also our detailed response to the relevant comment from reviewer #3.

The authors note that fine-tuning of parameters to individual datasets 'can significantly increase performance'. This must be detailed in light of the experience gained with nine different datasets.

In the revised version of the paper we made a more systematic search of the parameter space for both CaImAn batch and CaImAn online. We performed a small grid search on some threshold parameters for the quality evaluation tests. For both algorithms we found the set of parameters that give the best results on average for all datasets,as well as the parameters that individually maximize the performance on each dataset. The set of parameters yielding the highest average F1 score was selected as the default and reported in Figure 4B (batch and online) and Table 1. We also checked which choice of parameters maximized the performance for each dataset (batch max and online maxin Figure 4B). The average F1 score are 0.77 +/- 0.03 for batch max, 0.78 +/- 0.04 for online max, 0.75 +/- 0.03 for batch, and 0.76 +/- 0.05 for online. In both cases the average total F1 score can be raised by about 0.02.

We offer more details the Results section and Materials and methods section (Implementation details of CaImAn batch/online).

For CaImAn batch: “The global default parameters for all datasets were obtained by performing a grid search on the 9 datasets over the following values: trace peak SNR threshold {1.75, 2, 2.25, 2.5}, spatial correlation threshold {0.75, 0.8, 0.85}, lower threshold on CNN classifier (reject if prediction is below, {0.05, 0.1, 0.15}) and upper threshold on classifier (accept if prediction is above, {0.9,.95, 0.99, 1}). The best overall parameters (used for the results reported in Table 1) were given for the choice (2, 0.85, 0.1, 0.99).”

For CaImAn online: “To select global parameters for all datasets we performed a grid search on all 9 datasets by varying the following parameters: The peak SNR threshold for accepting a candidate component on the set {0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2}, the online CNN classifier threshold for accepting candidate components on the set {0.5, 0.55, 0.6, 0.65, 0.7, 0.75}, and the number of candidate components per frame on the set {5, 7, 10, 14}. The best overall parameters (reported in Table 1) were given for the choice $(1.2, 0.65, 10).”

Please augment the analysis of Figure 4 to show how precision and recall separately change with signal-to-noise ratio.

We have followed this suggestion and also show how precision and recall change as a function of the SNR for each file in Figure 4E. The same trend is followed, i.e., performance is higher for neurons with higher SNR traces.

Reviewer #1:

Giovannucci et al., present a new open-source software package for analyzing calcium imaging data collected by two-photon and 1-photon microscopes. While the key algorithms used in the pipeline have been developed over many years and largely overlap with previously published work, new features, e.g. memory mapping, CNN-based classifiers, cross-day registration, have been added to improve the performance and further extend the functionalities of the software. The authors tested the package using various in vivo imaging datasets and compared the results with 'consensus ground truth'. The software is written in Python and is open-source, well-organized, and requires minimal user intervention.

The main advance of the manuscript is to combine a plethora of powerful algorithms that have been developed and improved over almost a decade into a single software package that allows batch processing as well as real-time processing of calcium imaging data. Most of the algorithms have been optimized for speed and performance to allow their use in real-time experiments such as closed-loop and all-optical experiments. Having a software package for calcium imaging data analysis that is freely available, well-documented and makes use of the most recent algorithms for the multiple steps involved in calcium imaging processing is extremely valuable for the field, as it will improve the quality and the efficiency of calcium data analysis throughout the neuroscience community, and also open up many new experimental avenues. I therefore strongly support publication.

1) My main concern is the lack of rigorous comparison of the performance of the package with existing solutions. The manuscript only contains internal comparisons but does not show comparisons to existing pipelines (e.g. Suite2P by Pachitariu et al.,) or earlier versions described in other manuscripts (e.g. CNMF in Pnevmatikakis et al., 2016), or state-of-the-art databases such as Neurofinder. This greatly increases the difficulty of judging the overall performance of the software package in its current form. Although the effort the authors went through to generate an improved ground truth data set is laudable, this does not allow one to judge the performance of the package with respect to existing packages/algorithms. For example, a test of how the package performs on the Neurofinder data and how existing packages (like e.g. Suite2P) perform on their newly generated ground truth data would help one to assess the performance of the ROI detection, which is one of the core features of the pipeline. While some of the comparisons have been performed in the respective publications, an overview of the performance of the pipeline with respect to its modules in their current form would be extremely informative.

We added a detailed comparison of the performance of CaImAn batch to Suite2P. Please refer to our response to the editor comments for details. Besides that, we also analysed all the neurofinder test files and generated the results to be submitted but unfortunately the neurofinder website appears to be down at the moment, therefore submission is impossible (September/October 2018). We have nevertheless added the scripts and results of the performed benchmark in the folder shared with the reviewers. The original CNMF algorithm (Pnevmatikakis et al., 2016) can be considered a subset of the CaImAn batch algorithm. CNMF essentially corresponds to applying CaImAn batch without using patches and without using automated component quality testing. In contrast, CNMF sorted the various components and required from the human to select a cut-off threshold. As such, we chose not to do a formal comparison. For instance, it is not possible to run J123, J115 and K53 on a 128 GB RAM workstation because of the memory load (data not shown).

2) Following on from the previous point, in the Discussion, the authors claim that 'apart from computational gains both approaches [MapReduce batch and OnACID online] also result in improved performance'. Is this with respect to CNMF-based approaches or other existing methods, for example, Suite2pl? If the latter, then the caiman software was never directly compared to any other approaches commonly used (discussed in the Related work section) to validate if it outperforms them in terms of speed or accuracy (only shown similar number of cells detected to Zhou et al., (2018)). Some of the datasets used for ground truth testing could be easily analyzed using different algorithms to provide the comparison and demonstrate the advantages of Caiman. Pachitariu et al., (2018) showed Suite2p detects more cells (in particular, cells with low baseline firing rate) than CNMF-based method does. It would be helpful to know in which way Caiman outperforms Suite2P.

The introduction of patch processing with map reduce, component quality testing, as well as a modified initialization algorithm (RollingGreedyROI vs GreedyROI in CNMF) has improved the results of CaImAn over CNMF in terms of scalability, performance, but also with respect to automation. As we also noted above, the original CNMF algorithm was not fully automated and required the user to order the components and select a cut-off point. As such, we believe a comparison of the two methods will not be particularly instructive. We also do not want to comment on qualitative differences between Suite2p and CaImAn. In its core Suite2p uses a very similar matrix factorization model to the original CNMF algorithm, with the main difference being the background/neuropil model. Other differences arise in the implementation details as well as post-processing methods, are hard to compare qualitatively. Our results directly contradict the statements of Pachitariu et al., (2018) as we demonstrate a qualitatively much better performance of CaImAn than what is reported there.

Just to clarify, for microendoscopic 1p data, CaImAn does not present an alternative algorithm to CNMF-E (Zhou et al., (2018)). It merely just ports the algorithm in Python and endows it with the map-reduce and component quality testing capabilities. As such, we expect the results to be similar (but not identical).

3) The authors claim that CaImAn Online outperforms Batch in terms of neuron detection for longer recordings, while CaImAn Batch is better suited for shorter recordings (Table 1 and also in the Discussion section). What is this claim based on, or what is the performance measure considered? Assuming longer recordings as > 3000 frames and by inspecting Table 2, there is no clear distinction in terms of F1 scores or precision/recall scores (Table 1) based on the file length. Additionally, it is currently hard to inspect this claim as the information to be compared is stored in two separate tables.

The reviewer is right that our statement was vague and not well supported. Here by longer recordings we mean recordings with length > 20000 or, better stated, experiments long enough where the spatial footprints could change during the course of the experiment due to various non-stationarities. In our analysis we observed that in the three longest datasets (J115, J123, and K53) the difference between CaImAn batch and online was the highest, with online achieving better results as quantified by our precision/recall framework. CaImAn online looks at a local window of the data (current frame plus the residual buffer) to update the activity of each neuron and identify new ones. On the other hand, the batch algorithm looks at the all data at once and tries to express the spatio-temporal activity of each neuron as a simple rank one matrix. We believe that this important difference endows the online algorithm with more robustness to various non-stationarities that can arise in long experimental sessions. To ease the presentation, we have re-ordered the datasets in all the Figures and Tables in an increasing order according to their number of frames and included the number of frames on Table 1. We have also rephrased our wording and included additional discussion to better express our views:

“[…]While the two algorithms performed similarly on average, CaImAn online tends to perform better for longer datasets (e.g., datasets J115, J123, K53 that all have more than 40000 frames). CaImAn batch operates on the dataset at once, representing each neuron as constant in time spatial footprint. In contrast, CaImAn online operates at a local level looking at a short window over time to detect new components, while adaptively changing their shape based on the new data. This enables CaImAn online to adapt to slow non-stationarities that can appear in long experiments.”

4) The manuscript states that fine-tuning of parameters to individual datasets 'can significantly increase performance' (subsection “CAIMAN Batch and CAIMAN online detect neurons with near-human accuracy”) but no evidence is provided for this rather strong statement. It would be interesting to see the results of one of the videos analyzed in this section when the parameters were fine-tuned to show if such adjustment could lead to more reliable results.

We thank the reviewer for this suggestion. To quantify this comment, we repeated the analysis on the nine labeled datasets using a small grid search over several parameters. The results show an average improvement of 0.02 in terms of F1 score and also highlight strategies for parameter choice. We present these results in Figure 4B and note in the text (see also relevant answer to the editor for parameter details):

“[…] In general by choosing a parameter combination that maximizes the value for each dataset, the performance increases across the datasets with F_1 scores in the range 0.72-0.85 and average performance 0.78 +/- 0.05 (see Figure 4B (orange) and Figure 4—figure supplement 1 (magenta)). This analysis also shows that in general a strategy of testing a large number of components per timestep, but with stricter criteria, achieves better results than testing fewer components with looser criteria (at the expense of increased computational cost). The results also indicate different strategies for parameter choice depending on the length of a dataset: Lower threshold values and/or larger number of candidate components (Figure 4—figure supplement 1(red)), lead to better values for shorter datasets, but can decrease precision and overall performance for longer datasets. The opposite also holds for higher threshold values and/or smaller number of candidate components (Figure 4—figure supplement 1(blue)), where CaImAn online for shorter datasets can suffer from lower recall values, whereas in longer datasets CaImAn online can add neurons over a longer period of time while maintaining high precision values and thus achieve better performance.”

5) In Figure 4A, why are the consensus ground truth components different in the Caiman batch panels and the caiman online panels? Were they constructed using the same SEEDEDINITIALIZATION procedure?

We thank the reviewer for the careful inspection and for pointing out this issue. We apologize for our oversight, it should not have happened in the first place. The mismatch stemmed from two different thresholds used to binarize the ground truth. It is now fixed in Figure 4A.

a) In the same figure, there are clearly some components that do not look like cells, in both Caiman batch/online detected components, and even in the consensus ROIs. The authors should add two panels to this figure, one showing the raw FOV image (preferably just the average image of the video, without further processing) with the manual selected ROI contours marked on it – which should look similar to the last panel in Figure 3A; and another showing the 'consensus ground truth' components given by the SEEDEDINITIALIZATION on top of the average image.

We added Figure 3—figure supplement 1 comparing the original masks and the ones after seeded initialization for the dataset. Panel b shows that SEEDEDINITIALIZATION allows the original elliptical annotations to adapt to the actual footprint of each component as shown in the FOV. Figure 3—figure supplement 1C also demonstrates some examples of the thresholding process. The main goal of this process is to obtain the most prominent (visible) parts of the neuron, a step that facilitates the comparison process.

We have also modified the text to make this point more clear:

“[…]To compare CaImAn against the consensus annotation, the manual annotations were used as binary masks to construct the consensus spatial and temporal components, using the SeededInitialization procedure (Algorithm 3) of CaImAn Batch. This step is necessary in order to adapt the manual annotations to the shapes of the actual spatial footprints of each neuron in the FOV (Figure 3—figure supplement 1), since manual annotations had in general an elliptical shape.”

b) A worry is that the initialization procedure might add error to the ground truth and bias it towards the same direction as caiman does, and therefore masking error from benchmarking. For example, if one cell is mistakenly split into several components in both consensus data and caiman data, and both methods detected them, the performance metrics will be artifactually higher. It is important to directly compare the manually-picked ROIs in Figure 3 with the caiman-detected ROIs in Figure 4A (yellow), as the authors claim in the abstract that the software 'achieves near-human performance in detecting locations of active neurons'. It would be good to quantify how well the spatial footprints of the manually drawn ROIs and the caiman detected ROIs overlap. At least please plot the number of ROIs detected by humans against that by the software.

The SEEDEDINITIALIZATION procedure cannot split (or merge) any selection. Essentially what it does is a simple non-negative matrix factorization where the spatial footprints of each components are constrained to be subsets of the input annotations, together with additional components for background. As noted in our previous response, this is necessary to ensure a comparison among similar ROIs. While this can introduce a bias in terms of the extracted traces, we believe that this bias is significantly less compared to non-model based approaches. For example, simply computing the average over a selected ROI which would introduce contamination from neuropil or components with overlapping spatial footprints. We have updated Table 1 to also include the number of neurons selected by each labeller and our algorithms.

6. Figure 4B shows that the F1 score of Caiman online is higher than that of Caiman batch. And also, in text, the recall value of Caiman online is higher than Caiman batch. In the Materials and methods section, the authors state Caiman online uses a more strict CNN classifier to avoid false positives – I would expect this to result in a higher precision and lower recall, compared to the results given by Caiman batch – but it turns out to be the opposite. Is it a result of comparing the results with different consensus ground truth data?

There is an important difference between the two classifiers. The CNN classifier for the CaImAn batch algorithm is applied only once for every component that has been selected from the CNMF algorithm to test its quality. If it does not pass the test, then it is excluded from the final list of active components. However, the CNN classifier for the CaImAn online algorithm is applied to every candidate component considered at each frame. If a component does not get selected at some point from the online CNN classifier then it can be considered again for inclusion in the future once more signal has been integrated. As a result, it can be included at a later point during the experiment. This explains our choice to train the online CNN classifier to be stricter. This salient point is explained in subsection “Differences between the two classifiers”.

“Although both classifiers examine the spatial footprints of candidate components, their required performance characteristics are different which led us to train them separately. First of all, the two classifiers are trained on separate data: The batch classifier is trained on spatial footprints extracted from CaImAn batch, whereas the online classifier is trained on residual signals that are generated as CaImAn online operates. The batch classifier examines each component as a post-processing step to determine whether its shape corresponds to a neural cell body. As such, false positive and false negative examples are treated equally and possible mis-classifications do not directly affect the traces of the other components. By contrast, the online classifier operates as part of the online processing pipeline. In this case, a new component that is not detected in a residual buffer is likely to be detected later should it become more active. On the other hand, a component that is falsely detected and incorporated in the online processing pipeline will continue to affect the future buffer residuals and the detection of future components. As such the online algorithm is more sensitive to false positives than false negatives. To ensure a small number of false positive examples under testing conditions, only components with average peak-SNR value at least 4 were considered as positive examples during training of the online classifier

7) The authors analysed how the F1 score of Caiman Batch depends on SNR, but do not directly address the question of why precision is much higher than recall. It would be helpful if the authors would should show how precision and recall change with SNR separately in addition to the current Figure 4D.

We updated Figure 4D to show how precision and recall change as a function of the SNR, per the reviewer’s instructions (see also our response to the editor). The main reason for the discrepancy between precision and recall is the existence of the quality assessment step in CaImAn batch and the similar tests that are used during CaImAn online. These tests aim to filter out false positive components a step that will increase precision but do not introduce any missed out components, a step that would increase recall. In fact, if a true positive component is filtered out the recall will actually fall. Appropriately changing the thresholds of the various tests can lead to increased recall at the price of reduced precision. However, we believe that having a higher precision is desirable. As we note in the Discussion section:

“[…] The performance of CaImAn (especially in its batch version) indicates a considerably higher precision than recall in most datasets. While more balanced results can be achieved by appropriately relaxing the relevant quality evaluation thresholds, we prefer to maintain a higher precision as we believe that the inclusion false positive traces can be more detrimental in any downstream analysis compared to the exclusion of, typically weak, true positive traces. This statement is true especially in experiments with low task dimensionality where a good signal from few neurons can be sufficient for the required hypothesis testing.”

8) Motion correction algorithms and CNN classifier based neuronal filtering are discussed for the online purposes while the computational performance section does not touch on their speed. Running of the demo files in the GitHub link provided by the authors shows that they both introduce additional delays. The CNN classifier (evaluate_components_CNN function) run every few hundred frames is especially computationally expensive (> a few hundred ms on my device). In Figure 8C, is the CNN classifier used? If not, how much time delay will that induce? The computational cost of motion correction and CNN classifier should be reported to verify that.

These are both valid observations. Motion correction adds a computational cost to the online pipeline, especially if non-rigid motion correction is required. In our case, motion correction was done beforehand to ensure that the FOV where CaImAn online operates is perfectly aligned to the FOV that the annotators used. As we note in the text:

“[…] The analysis here excludes the cost of motion correction, because the files where motion corrected before hand to ensure that manual annotations and the algorithms where operating on the same FOV. This cost depends on whether rigid or pw-rigid motion correction is being used. Rigid motion correction taking on average 3-5ms per frame for a 512 x 512 pixel FOV, whereas pw-rigid motion correction with patch size 128 x 128 pixel is typically 3-4 times slower.”

The online CNN classifier is run on every step to evaluate the spatial footprints of the candidate components. This also adds a high computational overload (>=10ms) which we found to depend critically on the computing infrastructure. For example, utilizing a GPU, which is not the default mode of installing CaImAn, can help speed up this process significantly. Note that the cost of the CNN classifier mostly occurs from calling the underlying neural network at every frame and less so on the number of components that it needs to check at every point. We discuss these issues further in the paper and have modified Figure 8 to include a breakdown of the computational cost per frame for one dataset (J123).

“The cost of detecting and incorporating new components remains approximately constant across time and is dependent on the number of candidate components at each timestep. In this example 5 candidate components were used per frame resulting in a relatively low cost (~7ms per frame). As discussed earlier, a higher number of candidate components can lead to higher recall in shorter datasets but at a computational cost. This step can benefit by the use of a GPU for running the online CNN on the footprints of the candidate components. Finally, as also noted in [Giovannucci et al., 2017], the cost of tracking components can be kept low, and shows a mild increase over time as more components are being added by the algorithm.”

The cost of running the CNN classifier for the zebrafish example is not substantially higher and is significantly smaller than the intervolume time of 1s. An architecture where the CNN classifier is run on a parallel stream can significantly help there although we have not implemented such an architecture yet.

9) In Figure 8C, does the 'processing time' in the left panel include the time for 'updating shapes'? Judging from the right panel, if the 'updating shapes' is enabled, it could take larger than 15 seconds to complete processing one frame, which means the software could not function in real-time. Subsection “Computational performance of CAIMAN” say that the shape update functionality can be distributed over several frames or done in parallel to allow for real-time processing. Was this already implemented into the online software? Also, how does the shape update functionality impact the activity traces, and does it bring any substantial advantages? This is not shown (Giovannucci et al., (2017) also does not quantify the differences) but can be useful to know for online implementation of the algorithm into experimental setups. It would be helpful if the authors would quantify how much improvement the 'updating shapes' step actually brings to the fidelity of calcium traces. If the improvement is minimal, then the user might skip this step if speed is the priority.

Periodically updating the shapes is an important step in the online pipeline: As the experiment proceeds and a neuron keeps firing spikes the algorithm can accumulate more information and produce refined estimates about the neuron’s spatial footprint. This is important for both estimating its future activity more accurately and for distinguishing it from neighboring neurons. Moreover, the spatial footprint of a neuron can slowly vary with time especially in long experiments. The online algorithm can natively adapt to changes like this unlike batch approaches. We have modified our code so that now the shape update is indeed distributed among all the frames and happens only in frames where no new neurons are added to further distribute the cost evenly. We developed a simple algorithm that ensures (i) that every neuron gets updated every N frames (where N is a user defined parameter, default value 200) and (ii) if a neuron is added, then the spatial footprints of all neighboring neurons are updated to adapt to the presence of their new neighbor. We describe this process more analytically in the Materials and methods section (see: distributed shape update). Based on this approach we modified Figure 8 to reflect this new approach and show how the computational cost per frame is allocated to each step of the online algorithm.

“As discussed in Giovannucci et al., (2017) processing time of CaImAn online depends primarily on i) the computational cost of tracking the temporal activity of discovered neurons, ii) the cost of detecting and incorporating new neurons, and iii) the cost of periodic updates of spatial footprints. Figure 8e shows the cost of each of these steps for each frame, for one epoch of processing of the dataset J123. Distributing the spatial footprint update more uniformly among all frames removes the computational bottleneck appearing in Giovannucci et al., (2017), where all the footprints where updated periodically at the same frame.”

10) Figure 3, Figure 7, Figure 9, Appendix 0—figure 10, Appendix 0—figure 11: how were these FOV images generated? Are they the average frames of the motion-corrected videos or are the intensities of pixels within ROIs enhanced somehow? Please clarify. Please show the average image of the videos without enhancement.

The figure that we generally choose as a background is the so-called correlation image (or for longer datasets the max-correlation image). The value of the correlation image at each pixel corresponds to the average of the correlation coefficients between the trace of a pixel and its neighbors (in an already motion corrected movie). We included this definition in the Materials and methods section but now also refer to it in footnote 3 in the text:

“[…] The value of the correlation image for each pixel represent the average correlation (across time) between the pixel and its neighbors. This summarization can enhance active neurons and suppress neuropil for two photon datasets (Figure 10A). See Materials and methods section (Collection of manual annotations)”.

We choose this image because it demonstrates invariance with respect to expression levels and it has the property of attaining high values in pixels that are part of active neurons and lower values for pixels that are not. Since our methods focus on the detection of active neurons, their performance can be better assessed visually by plotting contours against the correlation image as opposed to e.g., the mean. Note that the use of the correlation image is very common in practice and has also been used for source extraction algorithms (as we detail in the related work). To make this point more clear we included the new Figure 3—figure supplement 1A that shows the median (equivalent to mean) and correlation images for three datasets and demonstrates the utility of the correlation image. Median and correlation images overlaid to all the manual labels are shown in the supporting website that contains all our data.

Reviewer #2:

Let me start this review by acknowledging that CaImAn is unequivocally the state of the art toolbox for calcium imaging, and is an impressive piece of work.

However, this manuscript a bit confusing. It reads like the combination of an advertisement and documentation, without ever referencing (quantitatively) the gains/performance of the tools in caiman relative to any other tools in the literature. In other words, it is a description and benchmarking of a given tool.

Such a document is very useful, but also a bit confusing because that is not what peer-reviewed manuscripts typically look like. eLife is an interesting journal/venue, so perhaps they are interested in publishing something like this, I won't comment on its suitability for publishing in its current form. I will, however, describe what I believe to be the most important novel contributions that are valuable to the field that are in this manuscript, with some suggested modifications to clarify/improve on some points.

We thank the reviewer for the useful comments. We added a formal comparison with a state-of-the-art package for calcium imaging data (Suite2P) and demonstrate that CaImAn is competitive when benchmarked against consensus annotations (see answer to editor for details).

1) Nine new manually labeled datasets, with four "expert" labelers. In spike sorting, the most important paper ever (imho) was the one that generate "ground truth" data. this manuscript does not do that; one could image a "ground truth" channel but this manuscript does basically provide an upper bound on accuracy for any algorithm on these calcium imaging datasets, as defined by the consensus. This is potentially incredibly valuable. however, the details of the datasets are merely listed in a table in the methods, but not quantitative description / evaluation of them. As a resource, showing some images of sample traces, sample frames, summary statistics so that potential users could evaluate and compare their data with these data would be very valuable and interesting. comparing them, how hard is each for the experts, and what about the machines, is the hardness correlated, etc.? And this would be a highlight of this manuscript.

We have created a website that can be used to download the raw data and manual and consensus annotations. For each dataset, the website also depicts the labels generated by each labelers and the consensus overlaid on the correlation image, as well as example spatial and temporal components extracted by CaImAn. The site can be found at the url https://users.flatironinstitute.org/~neuro/caiman_paper/

To log-in use the username “reviewers” and password “island”. We kindly ask the reviewers to not share these credentials; we will make the site publicly accessible should the paper gets accepted.

2) The parallelization of the standard method in the field, as in Figure 8, is a new result, and interesting but the metrics are not "standard" in the parallel programming literature. I'd recommend a few relevant notions of scaling, including strong scaling and weak scaling (https://en.wikipedia.org/wiki/Scalability#Weak_versus_strong_scaling), as well as "scale up" and "scale out" (https://en.wikipedia.org/wiki/Scalability).

I think strong, weak, scale up, and scale out would be the appropriate quantities to plot for Figure 8, as well as a reference of "optimal scaling", which is available, eg, from Amdahl's law (https://en.wikipedia.org/wiki/Amdahl%27s_law). Perhaps users also want the quantities that are plotted but, in terms of demonstrating high quality parallelization procedures, the scaling plots are more standard and informative.

We thank the reviewer for this useful suggestion. We have included a new panel in Figure 8 (panel c) that shows the scaling of CaImAn batch when processing a dataset in the same machine but using a different number of CPUs. Even though it would be desirable to plot this Figure all the way up to 112 CPUs (or more) we found that the usage of a computing cluster was leading to variable results due to variable speeds for reading the files over network drives, thus hindering our conclusions. While the language of strong vs weak scaling can help demonstrate the properties of the algorithm, we believe that it can be highly technical for a neuroscience audience and refrained from using it to describe our results. Instead we note:

“[…] To the study the effects of parallelization we ran CaImAn batch on the same computing architecture (24CPUs) utilizing a different number of CPUs at a time (Figure 8c). In all cases significant speedup factors can be gained by utilizing parallel processing, with the gains being similar for all stages of processing (patch processing, refinement, and quality testing, data not shown). The results show better scaling for medium sized datasets (J123, ~50GB). For the largest datasets (J115, ~100GB), the speedup gains saturate due to limited RAM, whereas for small datasets (~5GB) the speedup factor can be limited by the increased fraction of communications cost overhead (an indication of weak scaling in the language of high performance computing).”

3) Registration of components across days. It seems one pub previously addressed this, but caiman has a new strategy, employing the (basically standard) approach to aligning points, the Hungarian algorithm. note that Hungarian scales with n^3 naively, though faster implementations are available for sparse data, etc. MATLAB actually has much better implementations than Python last I checked. In any case, I believe this approach is better than the previously publish one just based on first principles, but no evidence is reported, just words. I also believe it is faster when n is small, but when n is big, I suspect it would be slower. A quantitative comparison of accuracy and speed would be desirable. Also, Hungarian typically requires the *same* number of points, but there is no guarantee that this will be the case. I did not quite understand the comment about infinite distances.

Our response to this comment can be separated in three parts:

A) It is true that the Hungarian algorithm is not the most computationally efficient solution for the linear assignment problem. However, for the population sizes that we encounter the cost of solving the linear assignment problem with the Hungarian problem is only a small fraction of the total cost. Moreover, the cubic cost refers to the general case of a dense unstructured affinity matrix. In our case, the matrix is sparse since most pairs of neurons are far from each other and are not considered for registration. This is the reason why we assign infinite distances (i.e., zero affinity) in this case. The sparse matrix leads to faster matrix vector operations and eventually a faster solution.

B) The infinite distances also permit to have neurons that are unmatched (i.e., have infinite distance from all the neurons in the other session) and enable registering sessions with unequal number of neurons. Every neuron that is matched in the other session with infinite distance is considered as unmatched.

C) Comparison with the method of Sheintuch et al., is not easy because of the absence of ground truth information. The metric proposed in Sheintuch et al., is tailored for their approach, since it uses the confidence in the assignment that comes from their probabilistic approach. To better compare the two approaches, we applied our method to the same publicly available Allen Brain datasets and computed the transitivity index. For all the datasets considered our transitivity index was very high (>0.99). A similar analysis already appeared in our initial submission in Figure 8C where we compared the approach of registering components through union vs direct registration.

We modified the paper to include the additional analysis and clarifications.

"A different approach for multiple day registration was recently proposed by Sheintuch et al., (2017) (CellReg). While a direct comparison of the two methods is not feasible in the absence of ground truth, we tested our method against the same publicly available datasets from the Allen Brain Observatory visual coding database. (http://observatory.brain-map.org/visualcoding). Similarly, to Sheintuch et al., (2017) the same experiment performed over the course of different days produced very different populations of active neurons. To measure performance of RegisterPair for pairwise registration, we computed the transitivity index proposed in Sheintuch et al., (2017). The transitivity property requires that if cell "a" from session 1 matches with cell "b" from session 2, and cell "b" from session 2 matches with cell "c" from session 3, then cell "a" from session 1 should match with cell "c" from session 3 when sessions 1 and 3 are registered directly. For all ten tested datasets the transitivity index was very high, with values ranging from 0.976 to 1 (0.992 +/- 0.006, data not shown). A discussion between the similarities and differences of the two methods is given in Methods section and Materials and methods section.”

4) Caiman batch and online have a number of minor features/additions/tweaks/bug-fixes relative to previous implementations. however, the resulting improvements in performance is not documented. perhaps that is unnecessary. but a clear list of novel contributions would be very instructive. Getting a tool from "kind of working in our lab" to "actually working in lots of other labs" is really hard, and really important. But the level of effort that went in to doing that is obscured in the text. perhaps this is the most important point of the manuscript.

We have tried to make a list of the contributions summarized in this work in the Introduction (see: Contributions). We elaborate more on them in the subsequent Methods section and Materials and methods section. These contributions refer more to algorithmic developments as well as the release of the labeled datasets. With respect to software contributions (e.g., bug fixes) we believe that providing a list would be rather daunting and without clear benefits. Our public code repository is a better source for that information and we invite interested parties to follow (or even participate in) the development of CaImAn there.

5 It is very impressive that all the results were obtained using the same parameters. however, the caveat to that is that all the results were based on using the parameters jointly chosen on *these datasets*, meaning that one would expect the performance on essentially any new dataset to be worse. Held out datasets can provide evidence that performance does not drop too much. With more information on each datasets (see point 1), and similar knowledge of held-out data, one could use these results to predict how well these particular parameters would work on a new dataset that a different lab might generate. Note that this is the same problem as the very popular benchmarking problems in machine vision that are very popular, which doesn't stop those papers from getting published in top journals.

The supervised learning tools that we use in CaImAn pertain to the two CNN classifiers. From these the batch classifier was trained only on the first three datasets, whereas the online classifier was trained on the first five datasets, thus leaving a significant amount of held out data for testing. This is stated in subsection “Classification through CNNs”). For the rest of the pipeline CaImAn mostly uses tools from unsupervised learning (e.g., matrix factorization, tests of correlation coefficients and SNR levels etc). As such it is not exactly trained on a set of data to define a set of parameters which could be subject to overfitting. Thus, we believe that the parameters we pick here for the diverse set of nine datasets offer a sufficient amount of generalization to other datasets.

6) In the end, it remains unclear precisely what you recommend to do and when. Specifically, the manuscript mentions many functions/options/setting/parameters and several pre-processing stages. I also understand that the jupyter notebooks provide some examples. I think some concrete guidance, given that you've now analyzed nine different datasets spanning labs, sensors, etc., you have a wealth of knowledge about the heterogeneity of these datasets that would be valuable for the rest of the community, but is not quite conveyed in the manuscript.

This is a helpful suggestion. We have created an entry on the wiki of our GitHub repo with several tips on using CaImAn. https://github.com/flatironinstitute/CaImAn/wiki/CaImAn-Tips

We plan to continuously update this entry as our code and experience evolves, which explains our decision to not include this discussion in the main part of the paper. We link to this page from the paper when talking about our software. Please also see our response to reviewer #1, point 4 on parameter choice strategies for CaimAn online.

Reviewer #3:

Giovannucci et al., present a software package to process calcium imaging data. As calcium imaging is becoming a dominant methodology to record neural activity, a standardized and open-source analytical platform is of importance. CaImAn presents one such effort. The manuscript is well written, and the software seems state-of-the-art and properly validated. I believe eLife Tools and Resources is a perfect outlet for such a commendable effort and enthusiastically support the publication of the study. Below are some comments that I believe the authors can address relatively easily prior to publication.

Subsection “Batch processing of large scale datasets on standalone machines” (parallelization). First, it should be mentioned that the difficulty of parallelization arises from this particular motion correction algorithm in which correction of a frame depends on the previous frames. If each frame is processed independently, like in some other motion correction methods, it would be trivial to parallelize. Second, is there any mechanism to ensure that there is no jump between the temporal chunks? In other words, is the beginning of chunk X adjusted to match the end of chunk X-1? This seems critical for subsequent analysis.

Jumps between consecutive chunks are avoided by ensuring that eventually all the chunks are registered with the same template. At the beginning each chunk gets its own template but then the templates of all the chunks get a template of their own (a template of templates so to speak) which is used to register each individual frame. We’ve made this more clear in the document (subsection “Batch processing of large scale datasets on standalone machines”):

“[…] Naive implementations of motion correction algorithms need to either load in memory the full dataset or are constrained to process one frame at a time, therefore preventing parallelization. Motion correction is parallelized in CaImAn batch without significant memory overhead by processing temporal chunks of movie data on different CPUs. First, each chunk is registered with its own template and a new template is formed by the registered data of each chunk. CaImAn batch then broadcasts to each CPU ameta-template, obtained as the median between all templates, which is used to align all the frames in each chunk. Each process writes in parallel to the target file containing motion-corrected data, which is stored as a memory mapped array.”

Subsection “CAIMAN BATCH and CAIMAN ONLINE detect neurons with near-human accuracy” (ground truth).The authors mention that the human performance is overestimated because the ground truth is based on the agreement among the scorers. This effect can be eliminated, and the true human performance can be estimated by cross-validation. For example, in a dataset annotated by four annotators, one can test the performance of one annotator against a putative ground truth defined by the other three annotators. This can be repeated for all the four annotators and the average will give a better estimate of human performance than the current method.

Thanks for this suggestion. We followed this approach by comparing the results of each annotator to the combined results of the other annotators. We report these results in Table 3, (see: subsection “Cross-Validation analysis of manual annotations”). Perhaps surprisingly, the overall results were only modestly decreased, although the scores of individual annotators varied more resulting in more cases where CaImAn achieved a higher F1 score than individual annotators. While this comparison is unbiased (in the sense that it does not favor the manual annotations) it resulted in a set of 3 (or 4) different “ground truth” labels for each dataset and we felt that a direct comparison with the results of CaImAn was not warranted. As such, we do not present this analysis in the main Results section.

“[…] As mentioned in the Results section, comparing each manual annotation with the consensus can create slightly biased results in favor of individual labelers since the consensus is chosen from the union of individual annotations. To correct for this, we performed a cross-validation analysis where the annotations of each labeler where compared against an automatically generated combinations of the rest of the labelers. To create the combined annotations we first used the RegisterMulti procedure to construct the union of each subset of N-1 labelers (where N is the total number of labelers for each dataset). When N=4 then the combined annotation consisted of the components that were selected by at least two labelers. When N=3 a stricter intersection approach was used, i.e., the combined annotation consisted of the components that were selected by both remaining labelers. The procedure was repeated for all subsets of labelers and all datasets. The results are shown in Table 3. While individual scores for specific annotators and datasets vary significantly compared to using the consensus annotation as ground truth (Table 1), the decrease in average performance was modest indicating a low bias level.”

Subsection “Computational performance of CAIMAN”. The details of the systems are not described in the Materials and methods section. For a Macbook Pro, it has 8 logical CPU cores and 4 physical CPU cores (in 1 CPU). Please specify the CPU model names.

We have included this information in the revision:

“[…] each dataset was analyzed using three different computing architectures: (i) a single laptop (MacBook Pro) with 8 CPUs (Intel Core i7) and 16GB of RAM (blue in Figure 8a), (ii) a linux-based workstation (CentOS) with 24 CPUs (Intel Xeon CPU E5-263 v3 at 3.40GHz) and 128GB of RAM (magenta), and (iii) a linux-based HPC cluster (CentOS) where 112 CPUs (Intel Xeon Gold 6148 at 2.40GHz, 4 nodes, 28 CPUs each) were allocated for the processing task (yellow).”

Subsection “Computational performance of CAIMAN” (employing more processing power results in faster processing). From the figure, a 112 CPU cluster processes the data about 3 times as fast as a 8 CPU laptop. The performance gain does not seem to be proportional to the number of CPUs available even considering the overhead. Please discuss the overhead of parallelization and the bottleneck when processed in the cluster.

This is a valid observation. This phenomenon is mainly due to the fact that the data was processed on two different machines. The 112 CPU cluster was reading the data over network drives, a process that was both slower and somewhat variable. To examine the parallelization properties (see also our response to Reviewer #2, point #2) we more rigorously tested the processing speed when using the same machine but different number of cores:

“[…] To the study the effects of parallelization we ran CaImAn batch on the same computing architecture (24CPUs) utilizing a different number of CPUs at a time (Figure 8c). In all cases significant speedup factors can be gained by utilizing parallel processing, with the gains being similar for all stages of processing (patch processing, refinement, and quality testing, data not shown). The results show better scaling for medium sized datasets (J123, ~50GB). For the largest datasets (J115, ~100GB), the speedup gains saturate due to limited RAM, whereas for small datasets (~5GB) the speedup factor can be limited by the increased fraction of communications cost overhead (an indication of weak scaling in the language of high performance computing).”

Subsection “Computational performance of CAIMAN” (the performance scales linearly). Please clarify whether this (linearity) is a visual observation of the results, or is this based on a complexity analysis of the factorization algorithm?

The linear scaling with respect to frames is based on the complexity of the algorithm. A rank-K factorization of a MxN matrix has typically complexity O(MNK) and all the other analysis steps have linear scaling.

Subsection “Computational performance of CAIMAN” (enabling real-time processing). Please clarify how often such an update of a footprint is required in order to support real-time processing without parallelization.

(Please see also our response to point #9 from reviewer #1 who raised a similar question). We have modified our code so that the shape update is distributed among all the frames and happens only in frames where no new neurons are added to further distribute the cost evenly. We developed a simple algorithm that ensures that (i) every neuron is updated every N frames (where N is a user defined parameter, default value 200) and (ii) if a neuron is added, then the spatial footprints of all neighboring neurons are updated to adapt to the presence of their new neighbor. We describe this process more analytically in the Materials and methods section (see: distributed shape update). Our results indicate that updating all shapes at least once every 500 frames (in a distributed fashion) leads to similar results as in the original case where all the shapes were being updated at once at specific frames.

“[…] To efficiently distribute the cost of updating shapes across all frames we derived a simple algorithm that i) ensures that every spatial footprint gets updated at least once every T_u steps, where T_u is a user defined parameter, e.g., T_u=200, and ii) no spatial component gets updated during a step when new components were added. The latter property is used to compensate for the additional computational cost that comes with introducing new components. Moreover, whenever a new component gets added the algorithm collects the components with overlapping spatial footprints and makes sure they get updated at the next frame. This property ensures that the footprints of all required components adapt quickly whenever a new neighbor is introduced. The procedure is described in algorithmic form in Algorithm 6.”

Reviewing editor:

Please consider adding a reference under "Related Work" to the use of supervised learning ("Adaboost") to segment active neurons, i.e. Automatic identification of fluorescently labeled brain cells for rapid functional imaging. I. Valmianski, A. Y. Shih, J. D. Driscoll, D. M. Matthews, Y. Freund and D. Kleinfeld, Journal of Neurophysiology (2010) 104:1803-1811.

Thank you for this suggestion, we have included this reference in our discussion of related methods.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Andrea Giovannucci, Johannes Friedrich, Pat Gunn, Brandon L Brown, Sue Ann Koay, Jiannis Taxidis, Farzaneh Najafi, Jeffrey L Gauthier, Pengcheng Zhou, Baljit S Khakh, David W Tank, Dmitri B Chklovskii, Eftychios A Pnevmatikakis. 2018. Datasets Generated: Data from CaImAn, an open source tool for scalable Calcium Imaging data Analysis. Zenodo. [DOI]

    Supplementary Materials

    Transparent reporting form
    DOI: 10.7554/eLife.38173.031

    Data Availability Statement

    All input data used to generate most figures, along with the necessary scripts, is available via Zenodo (https://zenodo.org/record/1659149#.XC_Wcs9Ki9s). The original (pre-non-rigid-motion correction) NF datasets listed in Table 2 are publicly available via https://github.com/CodeNeuro/neurofinder. They were originally shared by the Hausser, Losonczy, Svoboda, and Harvey labs.

    The following previously published datasets were used:

    Andrea Giovannucci, Johannes Friedrich, Pat Gunn, Brandon L Brown, Sue Ann Koay, Jiannis Taxidis, Farzaneh Najafi, Jeffrey L Gauthier, Pengcheng Zhou, Baljit S Khakh, David W Tank, Dmitri B Chklovskii, Eftychios A Pnevmatikakis. 2018. Datasets Generated: Data from CaImAn, an open source tool for scalable Calcium Imaging data Analysis. Zenodo.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES