Abstract
We develop a data harmonization approach for C. elegans volumetric microscopy data, still or video, consisting of a standardized format, data pre-processing techniques, and a set of human-in-the-loop machine learning based analysis software tools. We unify a diverse collection of 118 whole-brain neural activity imaging datasets from 5 labs, storing these and accompanying tools in an online repository called WormID (wormid.org). We use this repository to train three existing automated cell identification algorithms to, for the first time, enable accuracy in neural identification that generalizes across labs, approaching human performance in some cases. We mine this repository to identify factors that influence the developmental positioning of neurons. To facilitate communal use of this repository, we created open-source software, code, web-based tools, and tutorials to explore and curate datasets for contribution to the scientific community. This repository provides a growing resource for experimentalists, theorists, and toolmakers to (a) study neuroanatomical organization and neural activity across diverse experimental paradigms, (b) develop and benchmark algorithms for automated neuron detection, segmentation, cell identification, tracking, and activity extraction, and (c) inform models of neurobiological development and function.
Introduction
Whole-brain imaging experiments with single-neuron resolution (herein shortened to simply “whole-brain imaging”) have undergone explosive growth since first demonstrated in the nematode C. elegans, a millimeter-sized worm, and the zebrafish D. rerio in 2013.1,2 Since then, these methods have been widely adopted and advanced in the worm3,4,5,6,7, zebrafish8,9,10,11,12,13, and larval14 and adult15,16 fly communities. Moreover, there have been significant efforts and advances in neuron-resolution imaging of multiple and/or large brain regions in mammals, rapidly approaching whole-brain imaging, especially in mice.17,18,19,20,21
In C. elegans, whole-brain imaging datasets have enabled characterization of neural dynamics3,6, functional connectivity26,22,23,24, and the roles of individual neurons during behavior7. These studies leverage the property of eutely in this organism: each cell has a unique and stereotyped identity, consistent across every animal, that allows for data from individual neurons to be pooled and compared across multiple trials and animals. However, analyses of these experiments are bottlenecked by the need to determine the unique identities of each neuron in 3D volumetric recordings. Manual cell identification from fluorescent microscopy imagery is a notoriously difficult skill, requiring substantial expertise and labor. This task is particularly difficult for neurons labeled with nuclear localized fluorophores, which is typical for whole-brain recordings. We recently developed NeuroPAL6, the first method where the unique identity of every single neuron can be distinguished by an invariant fluorescent color barcode in living animals at all developmental stages of both sexes.25 NeuroPAL has greatly simplified the task of cell identification and has thus seen rapid adoption, with at least 6 labs6,7,26,22,24 publishing whole-brain imaging datasets using these animals, and many more labs incorporating the system into their experimental protocols since its release in 2021.
Despite this innovation, neural identification remains a challenging task that requires expertise and many hours of manual work. In the past few years, researchers have proposed various algorithmic auto-identification approaches to attack this problem.28,29,26,30,31,32 However, none of them have achieved widespread adoption, due at least in part to their incompatibility with different microscopy data formats and low performance on data acquired from different labs. Automatic approaches to the complementary problem of tracking neurons across video frames have achieved some generalized performance across various datasets33,34, but so far there have not been efforts to perform similar training and benchmarking for automatic cell identification. In order to build automatic approaches that are robust, accurate, and generalizable, there is a critical need for a standardized format and compatible tools trained and benchmarked on a consolidated corpus of data that reflects the heterogeneity of microscopy equipment, experimental conditions, and protocols across labs.
To address this need, we take a data harmonization approach: a process of combining datasets from different sources and homogenizing them to produce a substantially larger data corpus that, in our case, minimizes non-biological inconsistencies across individual datasets while increasing the overall biological diversity of training and benchmarking data. Harmonization includes: i) aggregating the data, ii) converting it to a standardized format, iii) normalizing it, iv) handling duplicate and missing data, and v) pre-processing data to register it to a common space and coordinate system. Data harmonization is standard in many data science fields but has seen slower adoption in the life sciences35. Similar efforts to standardize data formats and build large corpuses of data have been essential in the development and benchmarking of many modern machine learning algorithms.36,37,38
We introduce WormID (wormid.org). This resource consists of: i) data harmonization tools including a standardized file format for both raw and processed data alongside related metadata that extends the existing Neurodata Without Borders (NWB) format, ii) pre-processing to align the color and coordinate space of new datasets, and iii) open-source software to analyze whole-brain activity images. We also provide tutorials and documentation that enable researchers to easily incorporate these tools into their data pipelines. Finally, we provide a large online corpus of harmonized C. elegans whole-brain activity imaging and structural data that can be used for large-scale experimental analysis, neurobiological modeling, and algorithmic development. This corpus is stored in a popular community archive called the Distributed Archives for Neurophysiology Data Integration (DANDI), which serves as a repository for experimental neuroscience data from a variety of model organisms.39
By aggregating a diversity of datasets from multiple labs into a large data corpus, we achieve a substantial boost in the performance of three existing neural auto-identification algorithms, arguably moving into the regime of practical utility for the broader community of users. Furthermore, we mine this corpus to investigate the relationship between neural lineage, synaptic connectivity, and somatic positioning of C. elegans neurons to better understand the factors that drive the positioning of neurons in the adult worm.
This corpus and set of tools should be of wide utility to C. elegans researchers. We hope it will serve as a seed for continued community aggregation of brain imaging datasets and further the development and improvement of community data-analysis tools applicable across many model organisms.
Results
A standardized format for whole-brain C. elegans recordings enables data aggregation and algorithm interoperability
Current state-of-the-art whole-brain recordings of C. elegans typically consist of a combination of structural images that often use the NeuroPAL multi-channel fluorescent system to determine neuron identities (Fig. 1a–b) and time series images of neural activity acquired by using genetically-encoded activity sensors (e.g., GCaMP6s40) (1c). This imaging is performed either on immobilized worms (often constrained within a microfluidic chip to maximize image quality1,3,41) or on freely-moving worms.4,5 To aid interpretation, herein we visualize whole-brain structural NeuroPAL images via i) an unrolled ‘butterfly’ plot of neuron positions that projects the 3D worm structure into a 2D plane (Fig. 1a), ii) a 2D projection plot of the NeuroPAL color space (Fig. 1b), and iii) 2D dorsal-ventral and lateral projection plots of the neurons (Fig. 1b). These visualizations facilitate quick comparisons of neuron color and position from different samples and fine-tuning of their global alignment.
Figure 1:
NWB file contents
(a) Illustration of a NeuroPAL worm. The head is highlighted by a red box. A butterfly plot visualizes the full 2D representation of the worm brain by projecting neurons onto the surface of the cylindrical body and then unrolling the cylinder. Neuron centers are colored using their composite NeuroPAL expression. (b) Visualizations of the raw NeuroPAL structural image and 2D projections of its RG, RB, GB color subspaces and XZ and XY projections of its neuron positions. Neuron centers are colored using their composite NeuroPAL expression. (c) Example activity traces for five neurons contained in the NWB file and of the raw neural-activity (GCaMP6s) images.
All associated raw data and metadata is stored in the standardized NWB42 file format with an additional extension that we developed, ndx-multichannel-volume (ndx = neurodata extension), to provide support for multi-channel volumetric recordings and C. elegans specific metadata (Fig. 2a). This extension is available in the NWB Extensions Catalog and is now the official NWB standard for data sharing of C. elegans whole-brain neural-activity imaging. NWB data is hierarchically organized with basic metadata stored at the file’s root level, raw data stored in the ‘acquisition module’, and various processed experimental data stored in ‘processing modules’ (Fig. 2b). Individual NWB files contain a single experimental run for a single animal. These NWB files are then stored and accessed from the DANDI archive, where they receive a unique persistent digital object identifier (DOI) in accordance with the International Organization for Standardization (ISO).
Figure 2:
NWB schema and two software programs with NWB I/O support
(a) Names and content for objects used in C. elegans optophysiology NWB files. (b) File organization hierarchy of NWB files for C. elegans optophysiology. Modules are structured like folders within the root file in an HDF5-based hierarchy. (c-d) NeuroPAL ID software (c) and eats-worm software (d) GUIs with NWB I/O support for visualization and annotation of NeuroPAL structural images, neural segmentation and automated ID, and time-series of neural-activity with stimulus-presentation in immobilized worms.
We incorporated NWB ndx-multichannel-volume read and write functionality into two software tools. These independent software implementations both offer both user-friendly GUIs that analyze C. elegans NeuroPAL structural images and neural activity in immobilized worms (NeuroPAL software https://github.com/Yemini-Lab/NeuroPAL_ID, Fig. 2c and eats-worm software https://github.com/focolab/eats-worm, Fig. 2d). This functionality can be straightforwardly incorporated into other data-analysis pipelines and software.
In Table 1 we present a summary of the data we aggregated and harmonized into a corpus: 108 worms from six datasets acquired by five different labs, each with segmented neurons and human-labeled identities. This corpus can be mined for biological insights, training and benchmarking of machine-vision approaches, and neurobiological studies of structural and neural-activity time-series data. Each of these datasets is stored on DANDI and range from a few hundred megabytes to several terabytes (see Methods for dataset references). DANDI supports streaming from the cloud and allows users to selectively load data objects and data chunks, substantially reducing the local data storage and RAM requirements necessary to work with this data on a personal computer.
Table 1:
Summary of aggregated dataset characteristics
NP dataset comes from the original NeuroPAL paper6, obtained as part of a collaboration between the labs of Oliver Hobert (Columbia University) and Aravinthan D.T. Samuel (Harvard University).
| Dataset # | # of Worms in the Dataset | Lab Code | NeuroPAL, GCaMP, or Both | # of Segmented Neurons (Avg) | # of ID Labels (Avg) |
|---|---|---|---|---|---|
| NP | 10 | NP | NeuroPAL | 189-196 (193) | 186-193 (190) |
| 1 | 21 | EY | Both | 166-188 (177) | 164-184 (175) |
| 2 | 9 | HL | NeuroPAL | 113-125 (119) | 58-69 (64) |
| 3 | 9 | KK | Both | 149-163 (154) | 149-163 (154) |
| 4 | 38 | SF | Both | 29-96 (70) | 29-96 (70) |
| 5 | 21 | SK1 | Both | 78-139 (111) | 30-82 (48) |
| 6 | 10 | SK2 | NeuroPAL | 166-180 (173) | 38-63 (49) |
| Summary | 118 | 5 labs | 29-196 (126) | 29-193 (99) |
An updated atlas of the C. elegans hermaphrodite head
Our multi-lab data corpus allows data scientists to train and benchmark the performance of algorithms for automated neuron identification using datasets that reflect real-world diversity. In this section, we focus on the statistical atlas approach presented in Varol et al. 2020.28 This approach was the first to take advantage of the color information provided by NeuroPAL and was presented alongside the original NeuroPAL work.6 This neuron identification assignment algorithm was framed as a bipartite graph matching problem, with the goal of minimizing the total assignment cost using the well-known Hungarian algorithm.43 Cost is calculated by comparing neuron position and color in the animal sample with the mean and covariance of neuron position and colors in a reference statistical atlas (see Methods). The original atlas presented in the paper was trained on 10 worms from the original NeuroPAL work. We retrain this atlas using the full multi-lab corpus that we present in this work, increasing the training set by over 10-fold. The statistical atlas generated by this approach serves the additional purpose of characterizing the mean and covariance of neuron position and color across the whole corpus of data.
In Fig. 3 we present visualizations of the statistical atlas of neuron colors and positions trained on 104 of the 118 worms in our consolidated NWB/DANDI dataset as well as on the smaller dataset of 10 worms used in the original NeuroPAL paper (employing the Statistical Atlas algorithm in Yemini et al. 2021 and Varol et al. 20206,28). 14 worms were omitted from the atlas due to large nonlinear deformities or obvious artifacts. With 104 worms, this represents, to the best of our knowledge, the most broadly trained statistical atlas for C. elegans neuron positions and NeuroPAL coloring available. By leveraging the diversity of the multi-lab corpus this atlas captures variability between individual worms, strains, and lab-specific experimental conditions. This atlas can be used as a basis for automatic labeling algorithms and biological investigations of neuron positions and brain organization. This statistical atlas further complements detailed electron-microscopy (EM) based anatomical atlases with cellular structural detail and provides nearly 100 more animals in its corpus than the approximately 10 EM ones available.44,45,46 Although our corpus lacks the synaptic connectivity found in the EM datasets, it provides the complementary functional activity that is not available from EM imaging.
Figure 3:
Multi-lab atlas of C. elegans neurons and their positional variability
(a) Butterfly plot showing the mean locations of neurons in the atlas colored by ganglion. (b) 2D color plots showing the distribution of neuron colors in the atlas. (b-c) Ellipses represent covariance (1 SD) and are centered at the mean for each neuron. (c) 2D projections of neuron positions and colors in the aligned atlas space. XZ projection (top) and XY projection (bottom). Ellipses are colored by the mean color, per color channel, for each neuron in the atlas.
WormID.org supplies links to the software, visualization tools, and datasets discussed previously in this paper. Furthermore, WormID.org provides links to the data corpus and related tools to work with whole-brain structural and activity images, convert datasets to NWB, and supplies tutorials and instructions for using these tools. Our aim is that this data standard, data corpus, and atlas of cell positions will be a continually evolving resource for the C. elegans neuroscience community, and eventually other model organisms.
Analysis of biological factors in neuron positions
We statistically analyzed the spatial positions of C. elegans neuronal somas across individuals, strains, and lab conditions based on the mean and covariances in the statistical atlas. We focused on relative pairwise displacements rather than absolute positions because the absolute position of cells is dependent on positioning and deformation of the animal’s body during recording and thus requires global alignment of all animals. Aligning multiple animals into identical positions is an imperfect task. In contrast, measuring pairwise cell positions does not require alignment and is relatively robust against animal positioning and deformation.
Before analyzing statistical properties on the neuron positions, we assessed the percentage of neurons that were labeled by humans in each dataset. We found that neurons in the ventral ganglion and retrovesicular ganglion were less commonly labeled than neurons in other ganglia. As is shown here and previously in Yemini et al. 20216, neurons in the ventral and retrovesicular ganglia exhibit high relative positional variability, which may explain why fewer of them were confidently labeled by researchers (Fig. 4a). For this reason, we explored several different factors hypothesized to contribute to the organization and variability of relative cell positions: i) gangliar boundaries (e.g., basal lamina and abutting tissue) which may restrict cell movement within the coelem, ii) synaptic connectivity which may impose energetic costs dependent on neuronal proximity, and iii) developmental-time and cell lineage effects whereby recently divided cells (i.e., sister cells) remain close together and more distant relatives (e.g., mother and grandmother cells) end up further apart.
Figure 4:
Analyses of neuron positions, distances, and positional variability
(a) Left: percentage of datasets containing each labeled neuron, organized anterior-posterior within each ganglion. Middle: heatmap of the STD of pairwise positional distances between each pair of neurons across datasets. Right: Averaged sums of heatmap rows. Neurons with higher mean positional variability have less stereotyped positions within the worm body. (b) Pairwise positional variability by ganglia for 10 closest neighbors of each neuron, separating neuron pairs in the same ganglion from pairs in different ganglia. Anterior pharynx: effect size 95% CI [−1.79, −1.38] μm, p = 2.5 * 10−21, Nsame = 53, Ndiff = 36; Dorsal: effect size 95% CI [−1.16, −0.72] μm, p = 7.1 * 10−6, Nsame = 13, Ndiff = 27; Lateral: effect size 95% CI [−2.80, −2.38] μm, p = 5.3 * 10−31, , Nsame = 319, Ndiff = 159; Retrovesicular: 95% CI effect size=[−1.97,−1.19]μm p = 9.3 * 10−5 , Nsame = 89, Ndiff = 43 ; Anterior: p = 0.161, Nsame = 176, Ndiff = 50 ; Ventral: p = 0.252, Nsame = 52, Ndiff = 37. (c) Relationship between pairwise neuron synaptic weights and their mean positional distance for chemical and electrical synapses. Chemical synapses: KendallTau τ = −0.036, p = 0.021, Pearson R = −0.098, p = 6.4 * 10−6, N=2119; Electrical synapses: KendallTau τ = 0.009, p = 0.80, Pearson R = 0.031, p = 0.51, N=444. (d) Relationship between cell birth times and the mean and SEM of their nuclear positional distance in adulthood for sister cells. Mean: KendallTau τ = −0.144, p = 0.052, Pearson R = −0.161, p = 0.129; SEM: KendallTau τ = 0.014, p = 0.845, Pearson R = 0.074, p = 0.487.Most sisters are within 15 μm of each other in adulthood. More sisters that divide embryonically remain close together (<8 μm) than sisters that divide >16 hours later at postembryonic larval stages.
To test the first hypothesis, that gangliar boundaries regulate positional organization and variability, we measured the positional variability of neurons that are spatially close and compared pairs within the same ganglion to pairs straddling each other in different ganglia (see Methods). We found that neurons in the anterior pharyngeal bulb and neurons in the dorsal, lateral, and retrovesicular ganglia all exhibit significantly lower variability for pairs within the same ganglion compared to pairs in different ganglia. Conversely, for neurons in the anterior and ventral ganglia, we observed no significant difference between pairs in the same ganglion and pairs in different ganglia (Fig. 4b). We used an independent samples t-test to compare pairs within the same ganglion with those in different ganglia. Known anatomical features of the worm support this hypothesis: the pharynx is a muscular epithelial tube47 that rigidly encases neurons; the remaining ganglia are separated by basal lamina that loosely restricts their boundaries44; finally, the anterior and ventral ganglia (and comparatively smaller retrovesicular ganglion) are completely bounded whereas all other ganglia are open at least at one end, and in White et al. 1986 it had been noted that tight cellular packing in these regions led to “slop”, “uncertainty”, and in live animals even “flipping” from side to side of the cells contained therein. This finding suggests that neural identification algorithms could be improved by a hierarchical approach, such as first predicting ganglia membership, then predicting neuron identities within each ganglion.
Next, we explored the relationship between somatic distance and synaptic connectivity. Overall, there was a very weak but statistically significant correlation between nuclear distance and synaptic weight for chemical synapses and no significance or detectable correlation for electrical synapses. However, we found that nearby neurons (mean distance < 40 μm) exhibit a wide range of chemical synaptic weights ranging anywhere from 0 to 70 synapses (with a median synaptic count of 3), whereas distant neurons (mean distance > 40 μm) have a maximum synaptic count of ~25 synapses (with a median count of 2) (Fig. 4c). This choice of distance cutoff was chosen by observing a distinct elbow at 40 μm in a 2D kernel density estimate plot of the scatter data (Supplement 1). Our data suggests that neurons that are strongly wired together tend to be close to each other, although somatic proximity alone is not sufficient to imply strong connectivity. Recent findings in C. elegans have substantiated Peter’s rule: neurons with larger colocalized axodendritic regions are more likely to form connections.48 Our findings lend further support to this principle and suggest that close somatic or nuclear proximity also plays a role in determining neural connectivity.
Lastly, we explored the hypothesis that cell lineage is a determinant of adult cell positioning. Embryonic C. elegans are confined to a fixed volume within an eggshell approximately 50 μm in length and 30 μm in diameter.49 After hatching, they grow over 4x in length from birth (~250 μm) to adulthood (over 1 mm), with an exponential expansion in their volume.50,51 Sister cells are cells whose lineage differs only at the very last division. We hypothesized that animal growth should lead to both larger distances and higher variability between older sister cells that divided in the embryo, versus younger sister cells born much later at postembryonic larval stages of development. Surprisingly, we found no statistically significant correlation between the time of cell division and nuclear distance or between time of cell division and distance variability measured by SEM (Fig. 4d). In fact, most sisters remained within 15 μm of each other (~3 nuclei apart) at adulthood, regardless of when they were born. Strikingly, a substantial cohort of embryonic sisters ended up closer together at adulthood (< 8 μm) than those dividing at larval stages that occur more than 16 hours later (Fig. 4d). Our data rule out exponential postembryonic growth spurts as a major determinant of divergence and variability in neuron positions.
Neuron identification performance increases for all laboratories and all tested algorithms when trained on a harmonized multi-lab corpus
Our previously published statistical atlas algorithm (“StatAtlas”) for automated neuron identification was trained on a homogenous dataset of 10 NeuroPAL worms.6,28 Formerly, this 10 worm training set achieved average accuracies of 86% overall in head neurons that ranged from 50% for the ventral ganglion to 100% for the anterior pharyngeal ganglion. These accuracies facilitate neural identification, but in practice they require substantial verification and manual corrections, and thus necessitating significant time expenditure in these tasks. Moreover, the algorithm fails to generalize to datasets produced by other labs (Fig. 5b). We tested the performance of our previously published algorithm on each of the six aggregated datasets. Initial performance on these datasets ranged from ~21% to ~65% with an average of 41%, (Supplement 2). This substantial decrease in accuracy on datasets from different labs exposes the limitation of using single-lab training sets to produce tools intended for use by different labs with different instrumentation, experimental methods, and data acquisition pipelines.
Figure 5:
Improvements in neural identification accuracy
(a) Examples of raw and color-corrected (histogram matched) images from each lab and dataset. (b) Top ranked test accuracy for training set of original 10 reference worms with no color correction (orange), 10 reference worms with color correction (green), and the multi-lab corpus with color correction (yellow) for coherent point drift (CPD, left), the statistical atlas model (StatAtlas, middle), and the conditional random field model (CRF_ID, right). Algorithmic performance was evaluated using paired t-tests (where N=94 for each training set) to compare the performance of different atlases. Significance is reported using a Bonferroni correction with the convention of * for p<0.05, ** for p<0.01, and *** for p<0.001. (c) Same as b but using top 5 rank. Summary statistics and p values can be found in supplementary table 1.
To assess the performance benefits of using a large, harmonized corpus to train commonly-used automated neural-identification methods we tested two more popular algorithms: coherent point drift (“CPD”)52 and CRF_ID26. Coherent point drift is an untrained and unsupervised algorithm that: 1) globally aligns a sample point cloud of neurons to a reference atlas, then 2) locally matches points from the sample to their nearest neighbors in the atlas, and finally 3) identifies sample neurons (points) by their corresponding matches in the atlas. CRF_ID is a newer graph-based approach that identifies neurons using a combination of statistics from their individual features (e.g., absolute position and color) and pairwise relationships (e.g., displacement and angle relative to each other). There are no currently published benchmarks on the neural identification problem using CPD. Formerly, CRF_ID demonstrated a high accuracy of 83% when originally trained and tested solely on the HL dataset. Similar to StatAtlas, when testing the generalizability of the CPD and CRF_ID base models on the full WormID corpus we observed poor performance with an average overall accuracy of 39% and 59% respectively.
After inspecting recordings from multiple labs, we hypothesized that differences in color space may have negatively impacted algorithmic performance. Potential sources of color space variability include differences in microscope hardware, software and image settings, and configuration of the optical path. Anecdotally, in addition to these known sources of variability, researchers also typically adjust exposure, contrast, and other channel display parameters to make the composite rendered colors appear more like the images in the NeuroPAL reference manual.53 In aggregate, this suggested that harmonizing the color space may aid automatic algorithms.
We developed an approach to match the color histogram of a sample image to a reference histogram representing ideal coloring (see Methods). Histogram-matching the original small training set improved the accuracy of all three tested algorithms by an average of 8%, 9%, and 7% for CPD, statistical atlas, and CRF_ID respectively. It also qualitatively made composite color renderings better match the NeuroPAL reference manual, aiding users in annotating and correcting algorithmic predictions (Fig. 5a,b,c).
Given this success on the original small training set, we used the histogram-matched images to train a new atlas for the StatAtlas and CRF_ID algorithms on the full corpus of data. Test accuracy is reported using 5-fold cross-validation where each worm is tested against an atlas that was not trained on that worm. For CPD, we updated the algorithm to select the best template out of the full corpus (see Methods for further details). This led to significant improvement in accuracy across algorithms *Fig. 5b, c), with an average improvement of 17%, 22%, and 18% that further raised average predictive accuracy from 22% to 39%, 41% to 62%, and 55% to 74% for CPD, StatAtlas, and CRF_ID respectively. This is equivalent to a ~1.3x, ~1.6x, and ~1.7x reduction in error rate. Accuracy reached as high as 95% for several individual datasets for both StatAtlas and CRF_ID. Furthermore, when considering the top 5 neural identity assignments (rather than just the top 1), the multi-lab models showed average accuracies of 65%, 86%, and 89% for CPD, StatAtlas, and CRF_ID respectively, with some datasets reaching 100% accuracy for both StatAtlas and CRF_ID (Fig. 5c, Supplement 3). In addition, we see similar improvements in accuracy across most datasets for StatAtlas and CRFID when training using all except one dataset and then testing on the left-out dataset (Supplement 6). This indicates that most of the benefits from retraining come from achieving a better representation of the full diversity across datasets, rather than capturing the specific nuances of any one dataset. This generalizability will enable labs to use these retrained algorithms out of the box rather than needing to do additional fine-tuning on their own data.
Differences in accuracy between datasets may have been caused by a variety of factors including poor initial alignment, optical quality, non-neuronal artifacts in the images, and nonlinear deformations of the worm body. Additionally, datasets with less neurons annotated had better automatic labeling accuracy, presumably because experimenters only labeled the easiest neurons to identify and left the hardest ones unannotated (Supplement 4,5).
Discussion
Aggregation and harmonization of data from a variety of different sources is necessary to build a corpus for analytical methods and machine learning tools that generalize across the diversity of real-world data. In this work, we present a data harmonization pipeline for analyzing whole-brain structural and activity imaging in C. elegans. This pipeline includes data aggregation, conversion to a standardized file format, software for analyzing these standardized datasets, pre-processing approaches to align images and color spaces, and spatial registration of sample neuron point clouds to a common atlas.
We used this corpus to study potential biological factors that organize cell position in C. elegans. Specifically, we find that: i) restrictions in bounding tissue and gangliar space likely contribute to variability in neuron positions, ii) neurons with somatic distances less than ~40 μm of each other show higher synaptic connectivity, and iii) sister neurons that divide in the embryo can be found closer together at adulthood than ones dividing at larval stages more than 16 hours later. The positive relationship between synaptic connectivity and neuron somatic proximity thus augments the previously observed correlation of synaptic connectivity to axodendritic adjacency, termed Peter’s Rule. Moreover, the close distances and low positional variability we measured for embryonically born sister neurons rules out exponential organismal growth as a major cause in driving neurons apart from each other during the establishment of the adult Bauplan.
We then used the corpus to train a machine learning tool to automate the intensive task of labeling cells in these datasets. This substantially boosted generalized performance across datasets from contributing labs for each tested algorithm, despite the variability in data from these different groups. Accuracy of auto-identification now approaches human performance for certain datasets using the StatAtlas or CRF_ID algorithms. In the future, our corpus can be used to incorporate neuronal shape- and size-based descriptors as well as dynamical time-series features to further improve neural identification algorithms.
The WormID.org tools and resources are readily applicable to new whole-brain structural and activity imaging datasets, and these new datasets can be easily added to the existing corpus. These tools streamline public data sharing to facilitate both open science and to satisfy data-sharing mandates. We hope this resource will continue to grow in size and breadth to enable the development and benchmarking of new machine learning tools and algorithms. Our analyses of cell features based on the full corpus of data can immediately be used to inform better feature selection and algorithms that continue to improve automated approaches for neuron-subtype identification in volumetric images. Additionally, the large corpus and trained statistical atlas can serve as a descriptive resource of the underlying neurophysiology of C. elegans. Moreover, our resources can be incorporated into computational neurobiology courses, such as the Neuromatch Academy (neuromatch.io)54 to train the next global generation of neuroscientists on real-world datasets. As the community continues to develop new tools, this corpus will allow these new tools to be benchmarked for generalizable performance, spurring innovation.
As the community continues to scale up the generation of neural data and increasingly relies on machine learning analysis to tame this “big data”, there is an ever-growing need to unify disparate datasets to produce verifiably robust, accurate, and generalizable analytical approaches. Harmonization efforts such as ours can significantly reduce the activation energy necessary for collaboration, data sharing, and the development of unified community-wide tools across labs. While some of the resources we created are specific to C. elegans, the framework and much of our toolkit can be applied to other model organism imaging communities.
Methods
Standardized file format - Neurodata Without Borders
NWB is an HDF5-based format built specifically for neurophysiology data and has emerged as the de facto standard for storing neurophysiology datasets with associated metadata for reuse and sharing. NWB provides object types for data and metadata including acquisition parameters, segmentation of 3D image regions, fluorescent time series (e.g., for neural activity), experimental design information, multichannel electrophysiology time series data, 3D images, stimulus events during an experiment, and behavioral data.42
The base NWB schema supports two-dimensional structural and time-series multi-channel images but did not originally support the type of five-dimensional (multi-channel, volumetric, time-series) data that is used in C. elegans whole-brain activity imaging or other metadata associated with these types of experiments. To solve this problem, we developed ‘ndx-multichannel-volume’ as a novel extension to the existing Neurodata Without Borders (NWB) standardized file format. More information and resources about NWB can be found at nwb.org.
Our extension adds new objects built off of existing ones in the schema to add and improve support for multi-channel, volumetric, time-series images and the metadata associated with those images as well as volumetric segmentation data and metadata fields specific to C. elegans such as cultivation temperature and growth stage. This extension and the datasets presented in this work represent the first applications of the NWB data format to C. elegans and have now been incorporated as the standard for this model organism. This extension is flexible, open-source, and can be continuously updated to incorporate new types of data for future experiments.
Storage on DANDI
Data and associated metadata were uploaded to the DANDI archive [RRID:SCR_017571] using the Python command line tool (https://doi.org/10.5281/zenodo.3692138). The data were first converted into the NWB format (https://doi.org/10.1101/2021.03.13.435173) and organized into a BIDS-like (https://doi.org/10.1038/sdata.2016.44) structure.
All datasets can be streamed or downloaded from the DANDI archive, available on WormID.org as well as these individual URLs:55,56,57,58,59,60,61
Original NeuroPAL: https://doi.org/10.48324/dandi.000715/0.240614.1942
EY: https://doi.org/10.48324/dandi.000472/0.240625.0454
HL: https://doi.org/10.48324/dandi.000714/0.240611.1954
KK: https://doi.org/10.48324/dandi.000692/0.240402.2118
SF: https://doi.org/10.48324/dandi.000776/0.240625.0015
Software systems with embedded NWB I/O
We present two software examples with user-friendly GUIs to interface with NWB datasets and run standard analysis pipelines for cell segmentation, identification, tracking, and extracting time-series of neural-activity traces annotated with any experimental stimuli presented. First, we present the NeuroPAL_ID software (Figure 2c) for visualization, annotation, neuronal segmentation and identification, neural tracking, activity trace extraction and stimulus presentation data of volumetric NeuroPAL images and whole-brain activity. This software is pre-compiled for use on MacOS and Windows. The software is open-source, available from https://github.com/Yemini-Lab/NeuroPAL_ID/releases, and is written in MATLAB and Python. It has now been updated to include functionality described in this paper to enable histogram matching, color-corrected image visualization, and automated neural identification using the new Statistical Atlas. This software is written and managed by the Yemini Lab; further information can be found at https://www.yeminilab.com/neuropal.
Second, we present the eats-worm software for visualization, segmentation, and activity extraction of neural-activity time series from immobilized worms. Eats-worm similarly allows for manual verification and curation of the automatic segmentation and tracking algorithms. The tracking algorithm was optimized for tracking neurons across frames in immobilized worms, but there are currently efforts to extend this functionality to work for freely-moving worms as well. Eats-worm is written in Python and is built as a plugin to Napari, a popular 3D visualization tool. This software is written and managed by the Kato Lab; further information can be found at https://github.com/focolab/eats-worm.
Both software programs have embedded functionality to read and write NWB files. NWB I/O functionality enables a user to quickly run similar analyses on all of the datasets presented in this work without the need to develop specific pipelines to read in data from each dataset. Furthermore, this functionality can be easily embedded into MATLAB or Python-based analysis software.
Data acquisition
NeuroPAL structural volumes and neural activity time series volumes were acquired using the protocols outlined in Yemini et al. 2021.6 After collection of these images, neurons were segmented and annotated according to the guidelines in the NeuroPAL manual.53 Specific immobilization methods, microscope setup, and experimental protocols differ slightly between datasets. All datasets were taken using spinning disk confocal microscopes with XY resolution varying from 0.1604-0.54 μm/pixel and Z resolutions varying from 0.54-1.5 μm/pixel. XY resolution was the same for NeuroPAL structural images and neural-activity images (using GCaMP6s) for all datasets, but Z resolution varied from 0.54-3 μm/pixel. Z resolution is generally lower for neural-activity images due to limitations in optical sectioning with confocal microscopes. Lower Z resolution also helps to reduce the number of frames needed to record a full volume for a single time-point, to aid imaging at a higher temporal resolution. Most images were taken with the worm immobilized in a microfluidic chip with the exception of the KK dataset (where worms were semi-restricted in a microfluidic device) and the SF dataset (where worms were freely moving). The NWB files and the DANDI datasets that hold them contain metadata for the specific setup and conditions in each dataset. For published datasets, additional information can be found in the associated publications.6,7
After acquisition of NeuroPAL structural volumes and whole-brain activity time-series, images were segmented using various automatic segmentation algorithms ranging from classical computer vision approaches (e.g., template matching64) to deep neural network approaches. These were then manually verified. Ground truth annotations were done using a combination of existing automatic identification algorithms followed by manual corrections. Each neuron identity label was either explicitly annotated by experts or manually verified after algorithmic identification. Note that varying levels of completeness in labeling are due to the difficulty of this manual annotation task. For several datasets with lower image quality, even experts could only confidently label 30-50% of segmented neurons in the volume. For neural activity time-series, neuron centers were first tracked across images using various algorithms and then manually verified by experts.62,63 Fluorescence activity is then extracted from these tracked ROIs to obtain time series of neural-activity traces. Neurons in the NeuroPAL structural volume were then matched to the ROIs in the neural activity time-series to get labeled activity traces.
Datasets from various labs were converted to the NWB standardized file format using the ndx-multichannel-volume extension presented in this work. These files were then uploaded to the DANDI archive where they are now publicly accessible for data streaming, download, or online visualization.
| Dataset | Microscope | Length of recording | Sample rate | Resolution (um/pixel) | Strain | Setup |
|---|---|---|---|---|---|---|
| NP_og | Zeiss LSM880 spinning disk confocal | ~4 min | ~4Hz | 0.208 x 0.208 x 1.02 | OH16230 | Microfluidic chip |
| SF | Andor spinning disk confocal w/ Nikon ECLIPSE Ti microscope 40x water immersion | ~15 min | 1.7 Hz | 0.54 x 0.54 x 0.54 | Various | Freely moving |
| SK1 | Leica DMi8 inverted spinning disk confocal, 40x WI, 1.1 NA | ~25 min | 1.04 Hz | 0.1604 x 0.1604 x 1 (3 for calc images) OR 0.3208 x 0.3208 x 0.75 (2.5 for calc images) |
FC121, FC128, OH16230 | Microfluidic chip |
| SK2 | Leica DMi8 inverted spinning disk confocal, 40x WI, 1.1 NA | ~15 min | 3.3 Hz | 0.3208 x 0.3208 x 0.75 (1.5 for calc images) | OH16230 | Microfluidic chip |
| KK | Nikon Eclipse Ti-U inverted spinning disk confocal, 40x 1.3 NA | ~15-20 | 1.67 | 0.32 x 0.32 x 1.5 for both images | KDK92 | Semi-restricted in microfluidic device |
| HL | Perkin Elmer spinning disk confocal 1.3 NA, 40x oil OR Brucker Opterra II swept field confocal 0.75 NA, 40x air | NA | NA | 0.33 x 0.33 x 1 | OH15495 | Microfluidic device |
| EY | Spinning disk confocal | ~4 min | ~4 Hz | 0.27 x 0.27 x 1.5 | OH16230 |
Butterfly plot
To produce the butterfly plot, we first manually found three orthogonal basis vectors to align neuron point clouds to a new cartesian coordinate space. To do so, we used human-guided affine transformation to roughly align these basis vectors to the anterior-posterior, dorsal-ventral, and left-right axes. The xyz coordinates of each neuron were projected into this new cartesian coordinate space and then converted to cylindrical coordinates by the following equations. We plotted the new x and θ coordinates on a 2D plane to get the butterfly plots shown in (Fig. 1a, Fig. 5a). This projection is akin to flattening the positions of the neurons along the circumference of a cylinder of the worm body and then unrolling that cylinder into a flattened plane.
Histogram matching
We modified the established approach of histogram matching to apply to 3-D volumetric, multi-channel data.64 We created a reference histogram using the 10 worms from the original NeuroPAL work. This data is stored as uint16, so there are 65,536 possible values for each pixel. For each channel, we created a histogram counting the number of pixels within the bin edges, assigning each color value its own bin, and then averaged the values in each of these bins across the 10 images. Practically, these histograms were very similar across these 10 datasets, so the averaged histogram looked similar to each of the individual histograms.
To color match a new animal sample, we calculated a histogram for each channel. The number of bins for each channel histogram was equal to the maximum intensity value present in that channel in the image. Practically, this means that there are a different number of histogram bins for each channel in each image because images were collected at different bit depths and with varying levels of saturation.
We then calculated a cumulative density at each color value for both the sample and the reference. We created a lookup table M to associate each gray count value x in the sample to the color value in the reference with the closest cumulative density. Next we created a new matched image with each pixel transformed into the new color space using this lookup table as shown below.
Color extraction
To extract the color values for the neurons in each image, we first calculated the mean and standard deviation of the pixel gray counts in each channel, and then converted each pixel value into its Z-score based on its gray count value. We then took a sample of a 3x3x1 grid of pixel values around each segmented neuron center in each channel. We use the median values of this 3x3x1 grid as the RGB values for that neuron center. Color values were extracted post-histogram matching when training or testing using histogram-matched images. For non-histogram-matched images, there are no additional color pre-processing steps beyond Z-scoring.
Positional variability analysis
We calculated pairwise positional variability by measuring the Euclidean distance between every pair of canonical head neurons across each structural volume when both neurons in that pair had a ground truth label. We then took the average and standard deviation of these distances for each neuron pair to find mean nuclear distance and pairwise positional variability, respectively. For these analyses we ignored pairs that are not present in at least 5 datasets. We used pairwise positional variability instead of absolute positional variability because absolute position is extremely sensitive to point-cloud realignment, which would make it hard to disaggregate natural positional variability from alignment errors; and furthermore, we are not interested in only how individual cells vary but rather how cells vary relative to each other. To get the mean positional variability for a given neuron, we averaged the mean pairwise distance for all pairs that contained that neuron.
Intra vs inter ganglion measures: for every neuron in the atlas, we found its n closest atlas neighbors and only measured pairs for its n closest neighbors. We then separated these pairings based on whether the two neurons in the pair are within the same ganglion or in two different ganglions. Note that pairs in different ganglions will appear twice: e.g. if one neuron in the pair is in the anterior ganglion and the other is in the lateral ganglion, the pair will be counted in the analysis for both the anterior ganglion and the lateral ganglion (Fig. 4b). Pairs within the same ganglion are only counted once. We compared this approach for n =1-20 (Supplement 7). For all numbers of neighbors there is higher positional variability for neighbors in different ganglia when compared to the same ganglion. The pattern stabilizes around n=7 and holds steady through n=20. Therefore, we selected 10 to use for n in our analysis.
Synaptic connection: synaptic weights between neuron pairs are derived from the whole-brain connectome of the adult hermaphrodite in Cook et al. 2019.45
Lineal distance: the cell lineage tree and associated birth times were taken from Sulston et al. 1982.63 The last shared parent cell between two neurons is the most recent shared parent node in the lineal tree. We used the birth time of the last shared parent cell between two neurons as the lineal distance and explored the relationship between this lineal distance and mean pairwise nuclear distance (Fig. 4d). In this analysis, we focus only on sister cells: terminal cells that only divided from each other at the very last stage of their lineal tree.
Coherent Point Drift (CPD)
Coherent point drift has been a common algorithm for registering two similar point clouds to each other since its introduction in Myronenko and Song 200952. CPD allows for both rigid and non-rigid point set registration. CPD models one point set as a set of GMM centroids that are fit to the second point set by maximizing the likelihood. GMM centroids are set to move coherently to preserve the structure of the point clouds. In the rigid case, the algorithm learns an affine transformation of the GMM centroid locations while in the non-rigid case, the algorithm learns a displacement function on the original centroid positions with an enforced regularization term to enforce smoothness. The objective function is optimized using an iterative EM optimization approach and yields both the aligned point set as well as an NxM correspondence probability matrix that represents the likelihood that each point n in set 1 corresponds to each point m in point set 2.
In this paper, we use the specific implementation of CPD used in Yu et al. 202134 First, rigid CPD is used to roughly align a test worm point cloud to a template point cloud. Then, non-rigid CPD is used to model non-linear deformations between the semi-aligned test and template. Neuron assignments are then determined by creating a matrix of pairwise Euclidean distances between every neuron’s position and color in the test and every neuron in the template in the aligned space. We then use the Hungarian algorithm on this distance matrix to find the optimal label assignments. To get 2nd ranked assignments, we assigned an infinite cost to each label assignment from the first pass and reran the Hungarian algorithm. We repeat this for the 3rd-5th order assignments.
Accuracy was calculated by counting the number of neurons whose algorithmic assignment was the same as the ground truth label and then dividing by the total number of neurons that have a ground truth label. Note that neurons without a ground truth label were not included in the accuracy metric but are still part of the cost matrix and received neuron assignments. Since there is no ground truth for these neurons we did not determine the accuracy of their label assignments.
The difference between the use of the ‘original (10) worms’ versus ‘multi-lab corpus’ for CPD is what set is included in possible options for the template. For the ‘original’ group, we compare every test set to each of the original 10 neuroPAL worms and report the accuracy for the template that has the highest average probability of correspondence after the rigid alignment step. Similarly, for the ‘multi-lab corpus’, each test worm is compared to each possible template worm in the whole multi-lab corpus and accuracy is reported similarly. The accuracy of CPD is highly sensitive to a good rough initial alignment and to similarity of the template and the test point cloud. The template with the highest average probability of correspondence is not necessarily the template that yields the highest accuracy, but it is the template that the algorithm has the highest confidence that it has found the ‘correct’ correspondence.
Statistical Atlas training and inference
Statistical atlases used for testing performance were trained using the algorithm described in Varol et al. 2020.28 This algorithm uses a training set of neuron point clouds with both XYZ and RGB values and takes a block-coordinate descent approach where it iteratively learns affine transformation parameters to align the neuron point clouds, then updates the means and covariances of the positions and colors of each neuron until reaching convergence. This process generates mean and covariance parameters for each neuron as well as an aligned coordinate space for all the worms in the training set. The trained atlas consists of a list of neuron names alongside their associated means and covariances in the aligned position and color space.
We trained three atlases: the original atlas trained on just the original 10 NeuroPAL worms from Yemini et al. 20216, the color corrected atlas trained on these same 10 worms after histogram matching, and the multi-lab + color-corrected atlas which is trained on the full corpus of histogram-matched data.
For the original atlas, we tested every dataset in the full corpus without histogram matching. For the color-corrected atlas, we similarly tested every dataset on the full corpus of data with histogram matching. For the atlas trained on the full corpus, we use K-fold cross-validation. The corpus was split into five equally-sized groups. For each group, an atlas was trained on all datasets in the other four groups and performance was reported for the out-of-training set group. The 10 worms used to train the original and color-corrected atlas were included in the training for each of these five groups. These 10 worms were not used to report testing accuracy for any of the atlases (Fig. 5). The fully trained atlas presented in Fig. 3 was trained using the 10 original worms and the full corpus of data presented in this work, without splitting it into groups. This full atlas was embedded into the autoID functionality of the NeuroPAL ID software shown in Fig. 2.
Neuron point clouds used for testing were pre-aligned by learning an affine transformation from each sample dataset to the aligned coordinates of the atlas based on a subset of the ground truth labeled neurons in the sample. Briefly, assuming N neurons in the test sample and M neurons in the atlas, we calculated an NxM cost matrix using the Mahalanobis distance between each neuron center in the sample and each neuron distribution in the atlas. represents the XYZRGB values of neuron i, while and represent the XYZRGB mean and covariance respectively for neuron j in the atlas.
We then treated this cost matrix as a linear sum assignment problem. Label assignments (for neural identification) were calculated using the Hungarian algorithm.43 2nd-5th order ranked assignments and accuracy are calculated in the same way as described for CPD.
CRF_ID training and inference
CRF_ID atlases and inference are conducted using the algorithm described in Chaudhary et al. 202126. This approach follows a probabilistic-graphical-model framework based on conditional random fields. The graph is defined by node features corresponding to unary measures for each neuron center such as position and color and edge features corresponding to pairwise measures for each pair of neurons such as distance, relative angle, or probability that one neuron is anterior to the other. After features are selected, a data-driven atlas is trained on a corpus of data to determine the average values for each of the measured features; then for a test worm, node and edge potentials are calculated based on comparison of each feature in the test worm to the atlas and infer the best global assignment of labels by maximizing an energy function using an approximate inference method. For the analysis in this work, we used the color information solely to define the node potentials, and the pairwise angle relationships only to define the edge potentials. Optimizing the weights of the node and edge features may result in a higher prediction accuracy.
We trained three atlases: the original atlas trained on just the original 10 NeuroPAL worms from Yemini et al. 20216, the color corrected atlas trained on these same 10 worms after histogram matching, and the multi-lab + color-corrected atlas which is trained on the full corpus of histogram-matched data. This training approach follows the same K-fold cross validation approach used for the Statistical Atlas method.
We use the roughly pre-aligned point clouds used in the Statistical Atlas algorithm as input to the CRF_ID algorithm to eliminate possible difference in the initial alignment step, which can dramatically change accuracy.
In practice, there are nearly always fewer detected neuron centers in a given image than total cells in the atlas. CRF_ID handles this by modeling a hidden variable h ∈ (0,1}N where N is the number of neurons in the atlas. This variable specifies the probability that a given cell is missing in the image. Based on the number of cells in the test image, P cells are uniformly selected across different regions of the head and removed from the atlas. This process is repeated ~1000 times to sample multiple possible combinations of h. The top 1-5 predicted assignments are generated by compiling a list of the most frequent labels for each cell in the test image across all runs. Accuracy is reported in the same way as CPD and CRF_ID.
Optimizing the aforementioned energy function using an approximate inference method produces marginal distributions of label assignments for each cell. The top 1-5 predicted label assignments for each cell were generated by sorting the marginal probability of labels in a descending order. The label that resulted in the highest marginal probability was assigned as top 1.
Supplementary Material
Acknowledgements
We thank Erdem Varol and Amin Nejatbakhsh for providing code and troubleshooting support to train the statistical atlas, and to label images of dense neurons used in figure panels that show atlases of neuron positions. We thank Adam Atanas and Steve Flavell for contributing their datasets. Research funding was provided by National Institute of General Medical Sciences of the National Institutes of Health 582 (#R35GM124735, SK), the Weill Institute for Neurosciences, and the Weill Neurohub. Development of eats-worm is supported by a napari Plugin Foundation grant from the Chan Zuckerberg Initiative. EY was funded by the Esther A. & Joseph Klingenstein Fund, the Simons Foundation, and the Hypothesis Fund.
References
- 1.Schrödel T., Prevedel R., Aumayr K. et al. Brain-wide 3D imaging of neuronal activity in Caenorhabditis elegans with sculpted light. Nat Methods 10, 1013–1020 (2013). 10.1038/nmeth.2637 [DOI] [PubMed] [Google Scholar]
- 2.Ahrens M., Orger M., Robson D. et al. Whole-brain functional imaging at cellular resolution using light-sheet microscopy. Nat Methods 10, 413–420 (2013). 10.1038/nmeth.2434 [DOI] [PubMed] [Google Scholar]
- 3.Kato S., et al. Global Brain Dynamics Embed the Motor Command Sequence of Caenorhabditis elegans. Cell 163(3), 656–669 (2015). 10.1016/i.cell.2015.09.034 [DOI] [PubMed] [Google Scholar]
- 4.Venkatachalam V., et al. Pan-neuronal imaging in roaming Caenorhabditis elegans. PNAS 113(8), E1082–E1088 (2015). 10.1073/pnas.1507109113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nguyen J.P., et al. Whole-brain calcium imaging with cellular resolution in freely behaving Caeonorhabditis elegans. PNAS 113(8), 1074–1081. (2015) 10.1073/pnas.1507110112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Yemini E., et al. NeuroPAL: A multicolor atlas for whole-brain neuronal identification in C. elegans. Cell 184(1), 272–288 (2021). 10.1016/j.cell.2020.12.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Atanas A., et al. Brain-wide representations of behavior spanning multiple timescales and states in C. elegans. Cell 186(19), 4134–4151 (2023). 10.1016/j.cell.2023.07.035 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Portugues R., Feierstein C., Engert F., Orger M.B., Whole-brain activity maps reveal stereotyped distributed networks for visuomotor behavior. 81(6), P1328–1343 (2014). 10.1016/j.neuron.2014.01.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Prevedel R., Yoon YG., Hoffmann M. et al. Simultaneous whole-animal 3D imaging of neuronal activity using light-field microscopy. Nat Methods 11, 727–730 (2014). 10.1038/nmeth.2964 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Royer L., Lemon W., Chhetri R. et al. Adaptive light-sheet microscopy for long-term, high-resolution imaging in living organisms. Nat Biotechnol 34, 1267–1278 (2016). 10.1038/nbt.3708 [DOI] [PubMed] [Google Scholar]
- 11.Kim D., Kim J., Marques J. et al. Pan-neuronal calcium imaging with cellular resolution in freely swimming zebrafish. Nat Methods 14, 1107–1114 (2017). 10.1038/nmeth.4429 [DOI] [PubMed] [Google Scholar]
- 12.Vladimirov N., Wang C., Höckendorf B. et al. Brain-wide circuit interrogation at the cellular level guided by online analysis of neuronal function. Nat Methods 15, 1117–1125 (2018). 10.1038/s41592-018-0221-x [DOI] [PubMed] [Google Scholar]
- 13.Chen X., et al. Brain-wide organization of neuronal activity and convergent sensorimotor transformations in larval zebrafish. Neuron 100(4), 876–890 (2018). 10.1016/j.neuron.2018.09.042 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Lemon W., Pulver S., Höckendorf B. et al. Whole-central nervous system functional imaging in larval Drosophila. Nat Commun 6, 7924 (2015). 10.1038/ncomms8924 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Mann K., Gallen C.L., Clandinin T.R., Whole-brain calcium imaging reveals an intrinsic functional network in Drosophila. Current Biology 27(15), 2389–2396 (2017). 10.1016/j.cub.2017.06.076 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Aimon S., et al. Fast near-whole-brain imaging in adult Drosophila during responses to stimuli and behavior. PLoS Biology 17(2), (2019). 10.1371/journal.pbio.2006732 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Stirman J., Smith I., Kudenov M. et al. Wide field-of-view, multi-region, two-photon imaging of neuronal activity in the mammalian brain. Nat Biotechnol 34, 857–862 (2016). 10.1038/nbt.3594 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Skocek O., Nöbauer T., Weilguny L. et al. High-speed volumetric imaging of neuronal activity in freely moving rodents. Nat Methods 15, 429–432 (2018). 10.1038/s41592-018-0008-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Klioutchnikov A., Wallace D.J., Frosz M.H. et al. Three-photon head-mounted microscope for imaging deep cortical layers in freely moving rats. Nat Methods 17, 509–513 (2020). 10.1038/s41592-020-0817-9 [DOI] [PubMed] [Google Scholar]
- 20.Zong W., et al. Large-scale two-photon calcium imaging in freely moving mice. Cell 186(7), (2022). 10.1016/j.cell.2022.02.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Manley J., et al. Simultaneous, cortex-wide dynamics of up to 1 million neurons reveal unbounded scaling of dimensionality with neuron number. Neuron, (2024). 10.1016/j.neuron.2024.02.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Uzel K., Kato S., Zimmer M., A set of hub neurons and non-local connectivity features support global brain dynamics in C. elegans. Cell Current Biology 32, 3443–3459 (2022). 10.1016/j.cub.2022.06.039 [DOI] [PubMed] [Google Scholar]
- 23.Flavell S., Gordus A., Dynamics functional connectivity in the static connectome of Caenorhabditis elegans. Current opinions in Neurobiology 73, (2022). 10.1016/j.conb.2021.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Randi F., Sharma A.K., Dvali S. et al. Neural signal propagation atlas of Caenorhabditis elegans. Nature 623, 406–414 (2023). 10.1038/s41586-023-06683-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Tekieli T., et al. Visualizing the organization and differentiation of the male-specific nervous system of C. elegans. Development (Cambridge, England) 148, 18 (2021). 10.1242/dev.199687 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Chaudhary S., Lee S.A., Li Y., Patel D.S., Ly H., Graphical-model framework for automated annotation of cell identities in dense cellular images. elife 10, (2021). 10.7554/eLife.60321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Cecere Z.T., et al. State-dependent network interactions differentially gate sensory input at the motor and command neuron level in Caenorhabditis elegans. Preprint at 10.1101/2021.04.09.439242v3 [DOI] [Google Scholar]
- 28.Varol E., et al. Statistical Atlas of C. elegans Neurons. Medical Image Computing and Computer Assisted Intervention - MICCAI 2020 12265, (2020). [Google Scholar]
- 29.Nejatbakhsh A., Varol E., Yemini E., Hobert O., Paninski L., Probabilistic Joint Segmentation and labeling of C. elegans neurons. Medical Image Computing and Computer Assisted Intervention - MICCAI 2020 12265, (2020). [Google Scholar]
- 30.Skuhersky M., Wu T., Yemini E. et al. Toward a more accurate 3D atlas of C. elegans neurons. BMC Bioinformatics 23, 195 (2022). 10.1186/s12859-022-04738-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Toyoshima Y., Wu S., Kanamori M. et al. Neuron ID dataset facilitates neuronal annotation for whole-brain activity imaging of C. elegans. BMC Biol 18, 30 (2020). 10.1186/s12915-020-0745-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Bubnis G., Ban S., DiFranco M.D., Kato S., A probabilistic atlas for cell identification. Preprint at https://arxiv.org/abs/1903.09227(2019). [Google Scholar]
- 33.Wu Y., Wu S., Wang X., Lang C., Zhang Q., Wen Q., Xu T., Rapid detection and recognition of whole brain activity in a freely behaving Caenorhabditis elegans. PLoS Computational Biology, (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Yu X., et al. Fast deep neural correspondence for tracking and identifying neurons in C. elegans using semi-synthetic training. elife 10, (2021). 10.7554/eLife.66410 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Cheng C., et al. A general primer for data harmonization. Scientific Data 11(1), (2024). 10.1038/s41597-024-02956-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Deng L,. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29(6), 141–142 (2012). [Google Scholar]
- 37.Deng J., Dong W., Socher R., Li L.-J., Li K., Fei-Fei L., ImageNet: A large-scale hierarchical image database, IEEE Conference on Computer Vision and Pattern Recognition 248–255 (2009). doi: 10.1109/CVPR.2009.5206848. [DOI] [Google Scholar]
- 38.Lhoest Q., et al. Datasets: A Community Library for Natural Language Processing. Preprint at https://arxiv.org/abs/2109.02846(2021). [Google Scholar]
- 39.Halchenko Y., et al. dandi/dandi-cli: 0.61.2. Zenodo (2024). 10.5281/zenodo.3692138 [DOI] [Google Scholar]
- 40.Chen TW, Wardill T., Sun Y. et al. Ultrasensitive fluorescent proteins for imaging neuronal activity.Nature 499, 295–300 (2013). 10.1038/nature12354 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Chronis N., Zimmer M., Bargmann C.I., Microfluidics for in vivo imaging of neuronal and behavioral activity in Caenorhabditis elegans. Nature methods 4(9), 727–731 (2007). 10.1038/nmeth1075. [DOI] [PubMed] [Google Scholar]
- 42.Rübel O., et al. The Neurodata Without Borders ecosystem for neurophysiological data science. eLife 11, (2022). 10.7554/eLife.78362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Kuhn H.W., The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2, 83–97 (1955). 10.1002/nav.3800020109. [DOI] [Google Scholar]
- 44.White J.G., Southgate E., Thomson J.N., Brenner S., The structure of the nervous system of the nematode Caenorhabditis elegans. Philosophical Transactions of the Royal Society Biological Sciences 314(1165), (1986). 10.1098/rstb.1986.0056 [DOI] [PubMed] [Google Scholar]
- 45.Cook S.J., Jarrell T.A., Brittin C.A. et al. Whole-animal connectomes of both Caenorhabditis elegans sexes. Nature 571, 63–71 (2019). 10.1038/s41586-019-1352-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Witvliet D., Mulcahy B., Mitchell J.K. et al. Connectomes across development reveal principles of brain maturation. Nature 596, 257–261 (2021). 10.1038/s41586-021-03778-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Mango S.E., The C. elegans pharynx: a model for organogenesis. WormBook (2007). 10.1895/wormbook.1.129.1, http://www.wormbook.org [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Cook S.J., Kalinski C.A., Hobert O., Neuronal contact predicts connectivity in the C. elegans brain. Current Biology 33(11), 2315–2320 (2023). 10.1016/j.cub.2023.04.071. [DOI] [PubMed] [Google Scholar]
- 49.Riddle D.L., et al. C. elegans II. 2nd edition. Cold Spring Harbor Laboratory press; (1997). [PubMed] [Google Scholar]
- 50.Byerly L., Cassada R.C., Ruseell R.L., The life cycle of the nematode Caenorhabditis elegans: I. Wild-type growth and reproduction. Developmental Biology 51(1), 22–33 (1976). 10.1016/0012-1606(76)90119-6. [DOI] [PubMed] [Google Scholar]
- 51.Stojanovski K., Großhans H. & Towbin B.D. Coupling of growth rate and developmental tempo reduces body size heterogeneity in C. elegans. Nat Commun 13, 3132 (2022). 10.1038/s41467-022-29720-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Myronenko A., Song X., Point-Set Registration: Coherent Point Drift. Arxiv (2009). 10.48550/arXiv.0905.2635 [DOI] [PubMed] [Google Scholar]
- 53.Yemini E., NeuroPAL annotations manual. https://www.hobertlab.org/wp-content/uploads/2019/06/NeuroPAL-Reference-Manual-v1_small.pdf [Google Scholar]
- 54.Neuromatch. (2024). at Neuromatch.io [Google Scholar]
- 55.Yemini E., Venkatachalam V., Lin A., Varol E., Nejatbakhsh A., Sprague D., Samuel A., Paninski L., Hobert O., NeuroPAL: Atlas of C. elegans neuron locations and colors in NeuroPAL worm (Version 0.240614.1942) [Data set]. DANDI archive (2023). 10.48324/dandi.000715/0.240614.1942 [DOI] [Google Scholar]
- 56.Yemini E., Venkatachalam V., Lin A., Varol E., Nejatbakhsh h A., Sprague D., Samuels A., Paninski L., Hobert O., NeuroPAL microfluidic chip images and GCaMP activity (Version 0.240625.0452) [Data set]. DANDI archive (2023). 10.48324/dandi.000541/0.240625.0452 [DOI] [Google Scholar]
- 57.Chaudhary S., Sprague D., Lee S.A., Li Y., Patel D.S., Lu H., Segmented and labeled NeuroPAL structural images (Version 0.240611.1954) [Data set]. DANDI archive (2023). 10.48324/dandi.000714/0.240611.1954 [DOI] [Google Scholar]
- 58.Suzuki R., Wen C., Sprague D., Onami S., Kimura K.D., Whole-brain spontaneous GCaMP activity with NeuroPAL cell ID information of semi-restricted worms (Version 0.240402.2118) [Data set]. DANDI archive (2023). 10.48324/dandi.000692/0.240402.2118 [DOI] [Google Scholar]
- 59.Atanas A., et al. , Brain-wide representations of behavior spanning multiple timescales and states in C. elegans (Version 0.240625.0022) [Data set]. Dandi archive (20230. 10.48324/dandi.000776/0.240625.0022 [DOI] [Google Scholar]
- 60.Dunn R., Sprague D., Kato S., C. elegans whole-brain neuroPAL and immobilized calcium imaging (Version 0.240625.0439) [Data set]. DANDI archive (2023). 10.48324/dandi.000565/0.240625.0439 [DOI] [Google Scholar]
- 61.Sprague D., Borchardt J., Dunn R., Bubnis G., Kato S., NeuroPAL volumetric images (Version 0.240625.0454) [Data set]. DANDI archive (2023). 10.48324/dandi.000472/0.240625.0454 [DOI] [Google Scholar]
- 62.Emmons S.W., Yemini E., Zimmer M., Methods for analyzing neuronal structure and activity in Caenorhabditis elegans. Genetics 218(4), (2021). 10.1093/genetics/iyab072 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Wen C., et al. 3DeeCellTracker, a deep learning-based pipeline for segmenting and tracking cells in 3D time lapse images. eLife 10, (2021). 10.7554/eLife.59187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Gonzales R.C., Fittes B.A., Gray-level transformations for interactive image enhancement. Second conference on remotely manned systems, 17–19 (1975). [Google Scholar]
- 65.Sulston J.E., Schierenberg E., White J.G., Thomson J.N., The embryonic cell lineage of the nematode Caenorhabditis elegans. Developmental Biology 100(1), 64–119 (1983). 10.1016/0012-1606(83)90201-4. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





