Abstract
Recent studies in mice have shown that orofacial behaviors drive a large fraction of neural activity across the brain. To understand the nature and function of these signals, we need better computational models to characterize the behaviors and relate them to neural activity. Here we developed Facemap, a framework consisting of a keypoint tracker and a deep neural network encoder for predicting neural activity. Our algorithm for tracking mouse orofacial behaviors was more accurate than existing pose estimation tools, while the processing speed was several times faster, making it a powerful tool for real-time experimental interventions. The Facemap tracker was easy to adapt to data from new labs, requiring as few as 10 annotated frames for near-optimal performance. We used the keypoints as inputs to a deep neural network which predicts the activity of ~50,000 simultaneously-recorded neurons and, in visual cortex, we doubled the amount of explained variance compared to previous methods. Using this model, we found that the neuronal activity clusters that were well predicted from behavior were more spatially spread out across cortex. We also found that the deep behavioral features from the model had stereotypical, sequential dynamics that were not reversible in time. In summary, Facemap provides a stepping stone toward understanding the function of the brain-wide neural signals and their relation to behavior.
Subject terms: Neural encoding, Neural circuits, Machine learning
Facemap is a data analysis framework for tracking keypoints on mouse faces and relating them to large-scale neural activity. Both of these steps use state-of-the-art convolutional neural networks to achieve high precision and fast processing speeds.
Main
Neurons across the brain are constantly active, even in the absence of external sensory stimuli or a behavioral task1,2. This ongoing, spontaneous neural activity is driven by the spontaneous behaviors of the animal, such as running, head movements and whisking in mice3–9, tail movements in zebrafish10 and body movements in flies11–13. In mice, different neurons were best explained by different combinations of orofacial behaviors, such as whisking, sniffing and grooming, showing that multidimensional representations of behavior exist across the brain14–17. These multidimensional behavioral representations persist during presentations of sensory stimuli14 and decision-making tasks18–20.
Despite the widespread presence of behavioral signals across the brain, their role and function remains poorly understood. To make progress in understanding these neural signals, it is important to develop better computational models. This requires progress in the following two areas: (1) better quantification of orofacial behavior and (2) better models of the influence of behavior on neural activity.
To quantify behavior, previous studies took advantage of the stability of the head-fixed experimental setup to compute low-dimensional features of the raw behavior movies, either using principal components (PCs) of the movies14,17,20, or using autoencoders fit to the movies21,22. Although movie PCs are easy to compute, the resulting features are hard to interpret. Another common approach for quantifying orofacial movements is whisker tracking, which can provide specific and interpretable information about whisker motion23–26. However, previous approaches for whisker tracking required trimming the other whiskers and/or whisker painting, which may alter mouse behavior, and they also required a high-speed overhead camera, which may be unavailable in many experimental setups. An alternative approach is markerless pose estimation or keypoint tracking. Several algorithms exist for general keypoint tracking in animals27–31, but none of these tools have specialized methods for tracking orofacial movements.
Similarly, better models are needed to account for the influence of behavior on neural activity. Previous studies used simple approaches like reduced-rank regression (RRR) or ridge regression14,20. These models are linear and do not take into account temporal dynamics. Therefore, they are unlikely to capture the full influence of time-varying, multidimensional behavior on neural activity.
To address these shortcomings, we developed two new algorithms as follows: a keypoint tracker and a single neuron prediction model, both of which we make available in Facemap. Both algorithms are powered by deep neural networks. To track orofacial behaviors, we developed a pose estimation tool that tracks 13 distinct keypoints on the mouse face from variable camera views. Our pose estimation tool is more accurate than the best existing method (DeepLabCut), and it is also twice as fast, thus providing a viable option for online behavioral tracking. On new data, the Facemap tracker requires only ten new labeled frames for near-optimal performance. We also developed a multilayer neural network that is optimized to predict neural dynamics from orofacial behaviors. Compared to previous methods, this approach can predict almost twice as much neural variance for neurons in visual cortex. Furthermore, the model learns deep behavioral features that have highly-structured state dynamics, which we inferred using a hidden Markov model (HMM). Hence, Facemap can be used to obtain insights into both the structure and influence of orofacial behaviors on neural activity, thus providing a stepping stone toward understanding the function of the brain-wide behavioral signals.
Results
Fast and accurate tracker for mouse orofacial movements
We start by describing a neural network model for keypoint tracking on the mouse face, the Facemap tracker. As a first step, we chose several well-defined keypoints that could track various orofacial movements (Fig. 1a). To capture whisking, we tracked three whiskers that are visible from most camera views, labeling the points at the base of the whiskers. To capture sniffing, we tracked four nose-related keypoints (bottom, top, tip and right-bottom, when in view). To capture mouth movements, when the mouth was in view, we tracked two mouth keypoints (mouth and lower lip). To capture eye movements, such as blinking, we tracked the four corners of the eye (bottom, top, front and back). We did not track the pupil, because it is completely dilated and untrackable in darkness, and also because it is easier to track with simpler methods14.
Our goal was to build a model that would generalize well to new data. To achieve this goal, we collected a dataset of short mouse face videos from many different mice with the camera setup at several different angles. From this dataset of 16 mice and 53 video recordings at different views, we manually annotated 2,500 frames (Extended Data Fig. 1). We used 2,400 frames for training the network and set aside a test set consisting of 100 frames from multiple views of a new mouse.
Unlike more general approaches, like DeepLabCut and SLEAP27,31, we only require our tracker to perform well on specific keypoints from the mouse face. Thus, we hypothesized that a minimal ‘U-Net’-style neural network32 would be sufficient for the task while providing faster tracking compared to the existing, bigger models (Supplementary Table 1). Similar to DeepLabCut, which in turn is based on DeeperCut33, the Facemap tracker takes as input an image and outputs a set of downsampled probability heatmaps and location refinement maps to predict the x and y coordinates for each keypoint (Fig. 1b). The likelihood values of the model prediction were used to filter the traces and remove outliers (Methods27). The Facemap tracker was implemented from scratch using the neural network software PyTorch34, which is a popular and easy-to-use alternative to the TensorFlow framework35 used by DeepLabCut and SLEAP.
The keypoint error percentiles shown on an example test frame demonstrate the accuracy of the tracking (Fig. 1c and Supplementary Video 1). To get an upper bound on the tracking performance, we manually labeled test frames twice at different orientations and compared the two sets of labels. We found that the tracker achieved human-level performance (Fig. 1d). We compared our model with current state-of-the-art tools for keypoint tracking, DeepLabCut and SLEAP27,31,36,37. The Facemap tracker was more accurate than the other well-performing network, DeepLabCut with the ResNet50 backbone, both in average error (3.9 versus 4.4 pixels) and for individual keypoints of the face (Fig. 1e and Extended Data Fig. 2a). Facemap also outperformed DeepLabCut with the Mobilenet backbone, SLEAP default and SLEAP’s larger network (32 channels), which had average errors of 5.6, 5.0 and 5.7 pixels, respectively (Fig. 1e and Extended Data Fig. 2a).
To compare the speed of the networks for the purpose of online tracking, we computed the processing speed using a batch size of 1 (Fig. 1e). All the networks can achieve higher speeds with larger batch sizes, but only a batch size of 1 can be used for online processing of keypoints for closed-loop experiments. The smaller size of the Facemap tracker network provided a much faster processing speed of 327 Hz on a V100 GPU compared to DeepLabCut’s ResNet50 network (150 Hz), DeepLabCut’s Mobilenet network (211 Hz), SLEAP’s default network (80 Hz) and SLEAP’s larger network (c = 32; 72 Hz). Across different GPU types, Facemap consistently demonstrated the fastest processing speed (Supplementary Table 2). We also benchmarked the processing speed of the Facemap tracker at larger batch sizes and found that it was as fast or faster than all other networks across GPUs except for the Tesla T4 GPU, where DeepLabCut Mobilenet was the fastest (Extended Data Fig. 2b). Therefore, Facemap is the fastest orofacial tracker with state-of-the-art performance, which enables its use in closed-loop experiments with high frame rates.
The keypoints tracked by Facemap captured recognizable orofacial behaviors, such as blinking (Fig. 1f), whisking (Fig. 1g) and sniffing (Fig. 1h), in addition to other orofacial behaviors. In the neural recordings, the camera view in Fig. 1c was used, so mouth keypoints were not included in the analyses as they were not visible. Therefore, for the rest of this study, we use the eye, whisker and nose keypoints to characterize the aspects of behavior and neural activity. To start, we investigated the timescales of the orofacial keypoints. To do this, we built an autoregressive model to predict the position of each keypoint in the future (prediction shown in Fig. 1i). The variance explained by the model on test data decayed as a function of time into the future (Fig. 1j). The predictability of the nose keypoints decayed fastest (~1 s), followed by the whiskers (~3 s) and eye keypoints (~10 s; Fig. 1k). This was surprising because whisking was the fastest behavior observed in the videos (~10 Hz). However, these fast movements were pseudo-random (Fig. 1g) and hard to predict, so they did not contribute strongly to the predictability of the whisker keypoints.
Fine-tuning the Facemap tracker on new videos
We built the Facemap tracker to perform well on a variety of camera angles and across different mice. While Facemap generalized well on data from similar mice and camera configurations, the tracker had variable performance on videos from other labs (Fig. 2a). We investigated whether a fine-tuning strategy might improve the performance of the tracker further on new data. We annotated a small set of video frames contributed by other labs to fine-tune the neural network individually for each lab. The fine-tuned network showed a dramatic drop in error after training with just one frame. Training with around ten frames led to near-optimal performance (Fig. 2b and Supplementary Video 2). This fine-tuning procedure also worked for face videos from freely-moving mice from another lab: with around ten frames, successful tracking was achieved (Extended Data Fig. 3 and Supplementary Video 3)38.
Given the quick improvement in the network performance after fine-tuning, we reasoned this step is necessary for adapting Facemap’s tracker to new data. Therefore, we implemented a ‘human-in-the-loop’ workflow to allow users to easily fine-tune Facemap for their own datasets in our graphical user interface (GUI) for the keypoints (Fig. 2c). In the first step, the existing Facemap network is used to generate keypoint predictions. Next, the user refines the predicted keypoints to generate new training labels for the network. Then, the network is re-trained with the new labels to create a fine-tuned network. The fine-tuned network is then applied to new frames and the user can decide whether or not to repeat the retraining process depending on the performance of the fine-tuned network. Once a well-performing network is obtained, the fine-tuned network is saved in the GUI for future use. This fine-tuning step was also used for our experiments in the next section, where we combined keypoint tracking and large-scale neural recordings.
A deep network model of neural activity
To determine how neural activity depends on orofacial behaviors, we designed a neural network encoding model that can extract deep features from the keypoint data or directly from the PCs of the videos. Similar to deep encoding models in sensory neuroscience, the model has a first linear step for applying spatiotemporal filters to the time-varying keypoints (Fig. 3a,b), followed by several fully-connected layers that process these features further into more abstract representations that can better predict the neural activity. This deep network model was trained end-to-end to predict the activity of the top 128 PCs of neural data from either visual or sensorimotor cortices at the temporal resolution of the imaging data (300 ms bins; Fig. 3c). The neural activity was split into training and test data in blocks of around 10 min and 3.5 min, respectively, and the variance explained was computed on the test data periods. We normalized the variance explained by an approximate upper bound, estimated using peer prediction, similar to ref. 14. Among multiple variations of the neural network architecture, the model we chose (Fig. 3a) had the best performance while using the fewest number of layers (Extended Data Fig. 4).
We found that neurons across visual and sensorimotor cortices were well explained by behavior, with an average normalized variance explained of 43.2% and 48.8%, respectively, from the keypoint-based deep prediction model (Fig. 3d,e). This was computed from the raw variance explained of 4.1% and 5.3% normalized by the explainable variances of 9.4% and 11.1% for the visual and sensorimotor areas, respectively, in bins of ~300 ms.
We compared the deep prediction model to a linear prediction method and found improvements of 136% and 71.5% more explained variance in visual and sensorimotor area, respectively (Fig. 3f). Next, we tested the deep prediction model using as input the PCs of the videos, rather than the keypoints. The deep prediction model again outperformed the linear method with improvements of 102% and 46.2% in visual and sensorimotor area, respectively. The deep model based on movie PCs outperformed the deep model based on keypoints in visual areas (50.4% versus 43.2%) and in sensorimotor areas (54.8% versus 48.8%) (Fig. 3f). This may be due to the much larger number of inputs (500 movie PCs versus 22 keypoint coordinates). Users thus have two options as follows: either use the more interpretable prediction model based on a small number of keypoints or the slightly better performing but less interpretable model based on movie PCs. We also asked how the neural prediction varies with the number of neurons and timepoints available for the model fitting (Fig. 3g,h). We found that explained variance saturated at around 10,000 neurons, but did not saturate with the number of timepoints even for our longest recordings of ~2 h. Thus, large-scale and longer recordings are necessary to fit good encoding models.
Next, we investigated how the model explained variance changed as a function of neural PC number (Fig. 3i). The largest neural PC generally accounts for the overall arousal state of the mouse, while the higher neural PCs may account for finer and more specific behaviors such as whisking, sniffing and eye movements14. In sensorimotor areas, we found that the improvement in variance explained by the neural network model was exclusive to the higher PCs, while the first PC was explained nearly as well (15.1% versus 13.5% for nonlinear versus linear keypoint-based models). In contrast, the first PC of the visual areas benefited substantially from nonlinear prediction (8.1% versus 4.1%). Furthermore, the top PC corresponded to a smaller fraction of the total explained variance in visual compared to sensorimotor areas (ratio = 0.22 and 0.37 of explained variance at 1 versus 64 neural PCs in Fig. 3i). Overall, these differences suggest that the behavior-related neural activity in visual areas is higher-dimensional and more nonlinear as a function of behavior compared to sensorimotor areas. The differences cannot be explained by visual inputs, because the recordings were performed in complete darkness with measured lux values of 0.00 in the visible spectrum (for comparison, we obtained 7.8 lux with monitors on and 84.4 lux with the microscope doors open; see also Extended Data Fig. 5).
The nonlinear, deep network model predicted the fine structure of neural activity, capturing small events across small groups of neurons better than the linear model (Fig. 4a and Extended Data Fig. 6). We investigated the spatial distribution of these subgroups of neurons by clustering the neurons with k-means into 100 clusters, a number that was sufficient to achieve a high correlation of each neuron with its cluster center (Extended Data Fig. 7a and Fig. 4b,c). Some clusters were spatially spread throughout the recording area, while others were more localized (Fig. 4c and Extended Data Fig. 7b,c). We defined a spatial locality index for each cluster (Methods). In general, the clusters best predicted by the behavior had the lowest locality index (Fig. 4d,e). Thus, behaviorally-correlated clusters are more spatially distributed across cortex, consistent with the hypothesis that many of these behavioral signals are broadcast across the brain.
State dynamics extracted from deep behavioral features
The last hidden layer in the deep network model, the ‘deep behavioral features’, contains a representation of behavior that is not directly available in the raw keypoints. To understand the nature of these representations, we characterized their dynamical properties using HMMs. Various types of HMMs have been previously fit to raw behavioral data, often from freely-moving animals21,39–42. Here we chose to use discrete HMMs, which can model the data as a succession of discrete states43,44. Transitions between states are probabilistic with probabilities defined by the transition matrix. In addition, each state is assigned a fixed ‘emission’ pattern of activations across all features. The transition matrix and emission patterns are parameters that are fit to each session individually.
We start by visualizing the HMMs that were fit to an individual session using 50 states, which were sufficient to reach a high log-likelihood on test trials (Extended Data Fig. 8a). The 256 deep behavioral features from one session were first sorted using one-dimensional t-SNE, such that features with similar activation patterns are near each other in the plot (Fig. 5a)45. The most probable states can then be inferred (Fig. 5b), and their emission patterns can be used to reconstruct the original data matrix (Fig. 5c). The reconstruction assures us that the HMM captures a majority of the data variance. We also visualized instances of the same state and observed they were consistent and in some cases human-interpretable (Supplementary Videos 4). The HMM states were also separately sorted using Rastermap46, so that forward transitions—from a lower to a higher state in the sorting—are maximized in the sorting (Fig. 5d). Due to this sorting, state dynamics appear to be arranged in ordered, increasing sequences (Fig. 5b). This asymmetry in state transitions was not apparent at the level of the keypoints themselves (Fig. 5e and Extended Data Fig. 8b–d); despite being sorted with the same Rastermap algorithm, states inferred directly from keypoints had relatively symmetric transition probabilities. To further validate the quality of the HMM, we used it to generate new synthetic data (Fig. 5f,g). Samples from the model had the same overall appearance as the original data. Thus, transition probabilities captured in the HMM can generate the same kind of behavioral sequences as are present in the data itself.
Next, we quantified some of the HMM properties directly. The duration of a state in the model is given by the self-transition probability (which was left out from the visualizations in Fig. 5d,e). Self-transitions p near 1, imply a long-lasting state, with an exponential distribution of state durations. The mean of this distribution is defined as the ‘state lifetime’, and can be easily computed as −log(1 − p). The distribution of state lifetimes was broad (Fig. 5h), with lifetimes ranging from 0.2 to 10 s. The model fit to behavioral states had longer lifetimes than the model fit directly to the keypoints, and both had much longer lifetimes than a control model fit to temporally-shuffled data. For the rest of our analyses, we will ignore self-transition probabilities and will focus on the transition probabilities between states. Operationally, we set self-transitions to 0 in the transition matrix and normalize the outgoing transitions to a sum of 1 (like in Fig. 5d,e).
Another property of the HMM is the sparsity of transitions between states. It is apparent in Fig. 5d that the transition matrix is quite sparse, with most values near-zero and a few large values. In other words, HMM states tend to transition to only a few other potential states. To quantify this property, we computed for each state the probability to transition to its n-nearest states, where near states are defined as the ones with the highest probability of transition. As n increases, the summed transition probability approaches its maximum of 1 quickly for small n and much more slowly after. This shows that the HMM has a sparse structure, dominated by a few large transitions. In contrast, models inferred from the keypoints had more dense transitions (approached 1 more slowly). Both types of models had sparser transitions than the control model that was fit to appropriately-shuffled data (Methods).
To quantify the asymmetry of the HMM transitions, we performed a series of analyses directly on the transition matrix (Fig. 5j). For each pair of states with high transition probability, we asked how likely other transitions are. We analyzed reverse transitions (Fig. 5k), two-step forward transitions (Fig. 5l) and two-step backward transitions (Fig. 5m). We found that these types of transitions were generally more likely than chance. However, reverse transitions were less likely in the deep feature HMM compared to the keypoint HMM, corresponding to the more asymmetrical nature of the former model (Fig. 5d versus Fig. 5e). While two-step forward transitions were matched between the two models, the two-step backward transitions were at baseline levels for the deep feature HMM, but not for the keypoint HMM. The net effect of the asymmetry in state transitions was that the deep feature HMM produced longer, uninterrupted forward sequences of states. We quantified this property from the inferred states, measuring the length of all increasing state sequences (Fig. 5n). The distribution of forward sequence lengths was much more long-tailed for the deep feature HMM, compared to controls and to the keypoint HMM (Fig. 5n). Combined with the already longer state durations (Fig. 5h), this shows that the deep behavioral features have longer, uninterrupted runs of stereotypical dynamics. This may imply that the HMM states inferred from deep behavioral features correspond to more abstract aspects of behavior, which may be ignoring some specific low-level properties of the keypoints such as the phases of the whisking, sniffing or running cycles. Furthermore, the asymmetrical transitions may correspond to a much longer cycle of behavior dictated by transitions between passive to active states and back. More work will be needed to fully make this link, perhaps using more sophisticated HMMs such as switching linear dynamical systems21.
Relation between deep behavioral dynamics and neural dynamics
To directly compare the behavioral HMM to the neural data, we visualized the activity of the neural populations tuned to different HMM states. We define a ‘trial’ as uninterrupted timepoints of the same state, and the response of a neuron on that trial as its average activity over those timepoints. Across states, we observed a range of approximately 50–300 trials per state (Extended Data Fig. 9j). We then used training trials to select the neurons with the highest activity on each state. For many of the states, we obtained neural populations highly selective to that state (see Fig. 5o for a subset of states and Extended Data Fig. 9i for all 50 states). We observed populations with either brief or long-lasting activity, which mirrored the diversity of behavioral state durations. We note that the existence of these neural populations does not follow directly from the fitting procedure of the deep features; while the deep features were indeed trained to predict neural activity, we allowed arbitrary weight combinations of these features to predict single neurons as opposed to relating single neurons to discretized behavioral states as we do in this section. Other aspects of these neural populations could be investigated further, for example, by engaging these neural populations in a behavioral task and comparing their activity with the deep behavioral features they represent. However, that is beyond the scope of the present study.
We have so far used HMMs to study changes in dynamical properties which are a consequence of the deterministic transformation from keypoints to deep behavioral features. We can also compare the dynamical properties of deep behavioral features to those of the neural data itself. To do this, we first clustered the recordings of ~50,000 neurons into 256 clusters (chosen to match the number of deep behavioral features) using k-means and fit the HMMs to the mean cluster activities. The neural HMM had relatively shorter state durations and more dense state transitions (Extended Data Fig. 9a–d), but similarly asymmetric transition probabilities to the deep behavioral HMM (Extended Data Fig. 9e–h). Visualizing the neural data and the inferred neural HMM states (Extended Data Fig. 8e–g), we can see that neural activity contains some shorter states with faster transitions compared to the deep behavioral features (Fig. 5a,b). We conclude that the deep behavioral features especially capture longer-duration states in the neural data and may be missing information about the shorter-duration states.
Discussion
Here we described Facemap, a framework that relates orofacial tracking to neural activity using new modeling tools. The framework is composed of two parts as follows: (1) an orofacial keypoint tracker for extracting eye, whisker, nose and mouth movements, and (2) a neural network encoding model that extracts spatiotemporal features of behavior that are most related to the neural activity. We have shown that the orofacial tracker is highly accurate while being substantially faster than other keypoint tracking approaches, and we showed that it can be easily trained on new orofacial videos from other experimental setups than our own. These keypoints capture the important aspects of the behavior with many fewer variables (22) than the number of pixels in a frame (~100,000). Despite this dramatic dimensionality reduction, the keypoints contain substantial information about the behavior to predict neural activity very accurately.
We used the new Facemap framework to make a few initial observations. We found that the eye keypoints had predictable dynamics on much longer timescales (10 s) compared to the dynamics of the nose keypoints (1 s), while the whisker dynamics were somewhere in-between. Across both visual and sensorimotor areas, clusters that were spread out over the brain were the ones best predicted from behavior. We also found that visual cortex has higher-dimensional and more nonlinear representations of behavior compared to sensorimotor cortex, a surprising result that merits further investigation.
We also found that the deep behavioral features extracted by the network model contained a much more orderly representation of behavior compared to the raw keypoints. Using an HMM, we found that the deep behavioral features were organized into relatively longer-lasting states, from less than a second to several seconds, which transitioned into other states in a predictable manner, forming sequences of states that repeated many times over the course of a session. These asymmetric state sequences were not found in the raw keypoints and had substantially longer durations in the deep behavioral features compared to the neural activity. The differences arise because the deep features represent specifically those aspects of the behavior that best predict the neural data. There are important aspects of behavior that may not be relevant for this prediction, such as the phases of the whisking, sniffing or running cycles, and there are aspects of the neural activity that may not be predictable at all. When those are factored out, an orderly representation emerges in the deep behavioral features.
These initial analyses are just the start of using Facemap to extract insights about neural activity patterns and the structure of behavior itself. We developed the method alongside a user-friendly GUI so that others can easily adapt it to their own data, and use it flexibly in their own studies. To track fast orofacial movements such as whisker movements, we note that reasonable resolution of the face will likely be required (at least 200 pixels) and a frame rate of at least 50 Hz. Many labs already have video cameras capturing the face of the mouse with sufficient resolution and frame rate and could, therefore, perform orofacial tracking during such experiments47,48. Furthermore, with head-mounted cameras38, orofacial tracking can be incorporated into freely-moving behavioral contexts, to enable observation of the fine movements that rodents make as they explore their environment or engage in social interactions39,49–53. We believe Facemap is one of the important steps toward unlocking the fundamental mystery of brain-wide neural activity—what is its function and where is it coming from—and we look forward to seeing it used to make progress on these questions.
Methods
All experimental procedures were conducted according to IACUC and received ethical approval from the IACUC board at HHMI Janelia Research Campus. The Facemap code library is implemented in Python 3 (ref. 54), using pytorch, numpy, scipy, numba, tqdm, opencv and pandas34,55–59. The GUI additionally uses PyQt and pyqtgraph60,61. The figures were made using matplotlib and jupyter-notebook62,63.
Data acquisition
Animals
We performed 16 recordings in 12 mice bred to express GCaMP6s in excitatory neurons: TetO-GCaMP6s x Emx1-IRES-Cre mice (available as RRID:IMSR_JAX:024742 and RRID:IMSR_JAX:005628). These mice were male and female and ranged from 2 to 12 months of age. Mice were housed in reverse light cycle and were pair-housed with their siblings before and after surgery. Due to the stability of the cranial window surgery, we often use the same mice for multiple experiments in the lab; five of the seven visual area mice were used in a previous study64, and the other two visual mice and all of the sensorimotor mice were trained on behavioral tasks after the recordings.
Surgical procedures
Surgeries were performed in adult mice (P35–P125) following procedures outlined in ref. 64. In brief, mice were anesthetized with isoflurane, while a craniotomy was performed. Marcaine (no more than 8 mg kg−1) was injected subcutaneously beneath the incision area, and warmed fluids + 5% dextrose and buprenorphine 0.1 mg kg−1 (systemic analgesic) were administered subcutaneously along with dexamethasone 2 mg kg−1 via intramuscular route. For the visual cortical windows, measurements were taken to determine bregma–lambda distance and location of a 4 mm circular window over V1 cortex, as far lateral and caudal as possible without compromising the stability of the implant. A 4 + 5 mm double window was placed into the craniotomy so that the 4 mm window replaced the previously removed bone piece and the 5 mm window lay over the edge of the bone. The sensorimotor window was also a double window and it was placed as medial and frontal as possible. The outer window was 7 mm × 4.5 mm and the inner window was around 1 mm smaller in all dimensions. After surgery, Ketoprofen 5 mg kg−1 was administered subcutaneously and the animal was allowed to recover on heat. The mice were monitored for pain or distress, and ketoprofen 5 mg kg−1 was administered for 2 days following surgery.
Videography
The camera setup was similar to the setup in ref. 14. A Thorlabs M850L3 (850 nm) infrared LED was pointed at the face of the mouse to enable infrared video acquisition in darkness. The videos were acquired at 50 Hz using FLIR cameras with a zoom lens and an infrared filter (850 nm and 50 nm cutoff). The camera acquisition software was BIAS (https://github.com/janelia-idf/bias). The wavelength of 850 nm was chosen to avoid the 970 nm wavelength of the two-photon laser while remaining outside the visual detection range of the mice65,66.
The entire setup was enclosed in a large black box to prevent light from the room from entering the microscope light path and from entering the mouse’s eye. We turned off the infrared LEDs and then estimated the amount of visible non-infrared light entering the mouse’s eye during recording by using an FLIR Extech LT300 Light Meter. We positioned the Light Meter where the mouse’s head is during recording. We found that when the enclosure was closed, as in our experimental conditions, the illuminance measurement was 0.00 lux. When we kept the enclosure closed but turned on the monitors to show visual stimuli (as in ref. 64), the illuminance measurement was 7.80 lux. We captured the face of the mouse with our camera in these two settings, with the infrared filter removed from the camera (Extended Data Fig. 5). For comparison, the illuminance of the enclosure area when it was open, coming from overhead lighting in the room, was much greater at 84.4 lux.
Imaging acquisition
We used a custom-built two-photon mesoscope67 to record neural activity, and ScanImage68 for data acquisition. We used a custom online Z-correction module (now in ScanImage) to correct for z and x-y drift online during the recording. As described in ref. 64, we used an upgrade of the mesoscope that allowed us to approximately double the number of recorded neurons using temporal multiplexing69.
The mice were free to run on an air-floating ball. Mice were acclimatized to running on the ball for several sessions before imaging. On the first day of recording, the field of view was selected such that large numbers of neurons could be observed, with clear calcium transients.
Processing of calcium imaging data
Calcium imaging data were processed using the Suite2p toolbox70 (available at www.github.com/MouseLand/suite2p), which relies on the packages numpy, scipy, numba, scanimage-tiff-reader, paramiko and scikit-learn57,71–74. Suite2p performs motion correction, ROI detection, cell classification, neuropil correction and spike deconvolution as described elsewhere14. For non-negative deconvolution, we used a timescale of decay of 1.25 s (refs. 75,76). We obtained 50,614 ± 13,919 (s.d., n = 10 recordings) neurons in the visual area recordings, and 33,686 ± 4,465 neurons (n = 6 recordings) in the sensorimotor area recordings.
Facemap tracker network
Model architecture
The Facemap tracker network is a U-Net-style convolutional neural network consisting of downsampling and upsampling blocks with skip connections implemented in pytorch34. The model’s input is a grayscale 256 × 256 pixels image, which is passed through a set of convolutional filters of different sizes, as shown in Fig. 1b. The network has two sets of outputs as follows: (1) heatmaps represent the probability of a keypoint in the pixel region and (2) location refinement maps represent the x and y offsets between the keypoint position in full-sized image and the downsampled map, similar to refs. 27,33. The downsampled (64 × 64 pixels) heatmaps and location refinement maps are used to obtain the x and y coordinates of keypoints, and example traces are shown in Fig. 1f.
The tracker predicted 15 distinct keypoints in total for tracking mouse orofacial movements from different views (Fig. 1a and Extended Data Fig. 1). The keypoints were used to track various movements of the eye (4), nose (5), whiskers (3), mouth (2) and an additional keypoint for the paw. The forepaw occasionally entered the view, such as during grooming, but we found this keypoint difficult to track and use in further analyses, so we did not consider it further. We also labeled a fifth nose keypoint (nose bridge, not shown in Fig. 1a and Extended Data Fig. 1), but found that it was difficult to identify across different camera angles and, therefore, excluded it from the analyses in the article. The videos taken during neural recordings were from the view in Fig. 1c. In this view, the mouth keypoints were not visible, so those keypoints were not used in the model for neural prediction. Thus, we used four eye keypoints, four nose keypoints and three whisker keypoints for neural prediction.
Training
The Facemap tracker was trained on 2,400 images recorded from multiple mice and different camera views (Extended Data Fig. 1). Training images of size 256 × 256 pixels were labeled with all the keypoints, except when a bodypart was not visible in the frame, then no label was added. The model was trained for 36 epochs with the Adam optimizer using a batch size of 8 and weight decay of zero77. We used a custom learning rate (LR) scheduler that used a fixed LR of 0.0004 for 30 epochs followed by for the next three epochs and finally for the final three epochs. Each image was normalized such that 0.0 represented the first percentile and 1.0 represented the 99th percentile. Image augmentations performed during training were random crop, resize after padding to maintain aspect ratio, horizontal flip and contrast augmentation.
Performance evaluation
The accuracy of the tracker was evaluated using the average pixel error for 100 test frames of size 256 × 256 pixels from a new mouse and different camera views. First, the Euclidean distance in pixels between the ground-truth labels and the predicted keypoints was computed. Next, the average error was computed as the average of the Euclidean distances across all frames (Extended Data Fig. 2a) and all keypoints (Fig. 1e).
The processing speed of the tracker was calculated to evaluate its utility for offline and online analyses. Therefore, the processing speed calculation accounted for the timing of various steps as follows: (1) image preprocessing, (2) forward pass through the model and (3) postprocessing steps. All processing speeds are reported for a sample image of size 256 × 256 pixels passed through the network for 1,024 repetitions and a total of ten runs using various batch sizes on different GPUs (Supplementary Table 2).
Filtering keypoint traces for neural prediction
Occasionally, keypoints are occluded, such as during grooming. Therefore, like DeepLabCut, we found the timepoints when the tracker network confidence was low, and replaced those timepoints in the keypoint traces by a median-filtered value. The network confidence, or likelihood, is defined as the value of the peak of the heatmap output. The likelihood traces for each keypoint were baseline filtered in time with a Gaussian filter of s.d. of 4 s, then the threshold of the likelihood was defined as negative eight times the s.d. of the baselined likelihood, and any values below this threshold were considered outliers. This identified on average 0.19% of timepoints across all keypoint traces as outliers.
After excluding outliers based on likelihood, we also directly identified outlier timepoints using the keypoint traces, by detecting large movements or deviations from baseline. If a keypoint moved more than 25 pixels from the previous timepoint to the current timepoint, then the current timepoint was considered an outlier. Also if the keypoint trace on the current timepoint exceeded its median-filtered (window = 1 s) value by more than 25 pixels, then the current timepoint was considered an outlier. This identified on average an additional 0.066% timepoints across all keypoint traces as outliers.
To obtain values for the outlier timepoints, we median-filtered the keypoint traces with a window of 300 ms, excluding the outlier timepoints. Linear interpolation from the median-filtered traces was then used to fill in the values at the outlier timepoints.
Pose estimation model comparisons
We compared the performance of the Facemap tracker to other state-of-the-art tools used for pose estimation, including SLEAP31 and DeepLabCut27,36. The models were trained on the same training set used for Facemap. In addition, the same protocol for speed benchmarking was used to obtain the processing speed of the other models.
DeepLabCut models training
DeepLabCut’s models used for comparison included two different architectures as follows: ResNet50 (default model) and Mobilenet_v2_0.35 (fastest model). Augmentations used during training were scaling, rotation and contrast augmentation, similar to training of the Facemap tracker. A hyperparameter search was performed to find optimal training parameters for each model using different batch sizes (1, 2, 4 and 8) and LRs (0.0001, 0.001 and 0.01). Models with the lowest average test error for each architecture were compared to Facemap in Fig. 1e and Extended Data Fig. 2.
Processing speeds for DeepLabCut’s models were obtained using a similar approach as Facemap tracker. We timed DeepLabCut’s getposeNP function for 1,024 repetitions for a total of ten runs for different batch sizes and GPUs. The getposeNP function timing included a forward pass through the network and postprocessing steps to obtain keypoints locations from the heatmaps and location refinement maps.
SLEAP models training
The default U-Net backbone was used for SLEAP’s models, which included the following two different values of initial number of filters: (1) c = 16 (default) and (2) c = 32 to vary the network size and potentially improve accuracy. A hyperparameter search over different LRs (0.0001, 0.001 and 0.01), batch sizes (1, 2, 4 and 8) and number of epochs (100 and 150) was performed to find the best model for each U-Net configuration. Furthermore, early stopping by stopping training on plateau was used for half of the models to prevent overfitting. Default augmentation settings were used for most models and mirroring (horizontal flip) was added to some models to match the training of the other networks used for comparison. Similar to DeepLabCut, the best models were selected based on the lowest average test error for the default and c = 32 models and used in Fig. 1e and Extended Data Fig. 2.
The processing speed for SLEAP’s models was calculated by timing their predict_on_batch function. The U-Net models with different numbers of initial filters were run for 1,024 repetitions for a total of ten runs using different batch sizes of our sample image input.
Facemap tracker refinement
We developed a method for fine-tuning the Facemap tracker for new data that differed from our training data. Facemap tracker’s base model is defined as the network trained on our dataset (Fig. 1). We extracted frames from videos contributed by five other labs to use as training data for fine-tuning the base model specifically to each lab’s video. We executed the following steps for each lab’s video. First, the base model was used to generate predictions for 50 random frames. Keypoints on the 50 training frames were refined to correct keypoints with large deviations from their defined bodyparts or remove keypoints not in view. The percentage of keypoints refined across 50 frames were 99.51% for lab 1, 100% for lab 2, 99.79% for lab 3, 100% for lab 4 and 98.32% for lab 5. Therefore, most of the keypoints across all frames were refined for fine-tuning the model and benchmarking the fine-tuned model.
Next, the base model was fine-tuned with varying numbers of training frames ranging from 1 to 50. The network was trained for 36 epochs with an initial LR of 0.0001 with annealing as described earlier and a weight decay of 0.001. Additionally, we trained a model from scratch, that is a network initialized with random weights, using ten training frames for comparison. To compute the errors for the base model, the fine-tuned model and the scratch model, we used 50 test frames and labeled them from scratch to use as a test set. We then computed the average error in pixels from the test set labels to the model predictions (Fig. 2b). The models trained from scratch with ten frames had an average error of 3.76 ± 0.39 pixels across labs, compared to 2.43 ± 0.24 pixels for the base model fine-tuned with ten frames. Predictions from the base, scratch and fine-tuned models for a random section of the video are shown in Supplementary Video 2 for each lab. The workflow used for the analysis was integrated into the GUI so users can easily fine-tune the Facemap tracker with video recordings that differ from our training data (Fig. 2c).
Autoregressive model for prediction of keypoints
We built an autoregressive model to determine how far into the future we could predict each keypoint, as a measure of its timescale. The keypoint traces were split into ten segments in time. The first 75% of each segment was assigned to the training set, and then after 2.6 s which were excluded, the remaining part of the segment was assigned to the test set. Linear regression was performed with exponential decay basis functions, with decay timescales from 40 ms to 5 s. All keypoint traces were input to the basis functions, then combined linearly to predict each future timepoint predicted. We fit the regression model on training timepoints separately for each future timepoint, for timepoints 20 ms to 10 s in intervals of 20 ms and for 10 s to 40 s in intervals of 500 ms. Then we estimated performance on test timepoints at each future timepoint delay using variance explained. We estimated the timescale of the keypoint trace as the future timepoint at which the variance explained was half the variance explained at a time delay of 20 ms.
Behavior to neural prediction
The activity of each neuron was z-scored: the activity was subtracted by the mean and divided by the s.d. To predict the neural activity from behavior, we reduced the dimensionality of the z-scored activity using singular value decomposition (SVD) and keeping 128 components, obtaining U, S and V matrices of size (neurons by 128; 128; timepoints by 128, respectively). We then predicted the neural PCs, which we defined here as the product of V and S, calling this Y = VS. After obtaining a prediction of the neural PCs , we projected the prediction into the neural activity space using U, so that the predicted neural activity was defined as . If fewer than 200 neurons were predicted, then we directly predicted the neurons rather than using the PCs. When predicting more neurons, we found that predicting the neural PCs performed and/or outperformed direct neural prediction.
The neural activity was split into ten segments in time. The first 75% of each segment was assigned to the training set, and then after 3 s which were excluded, the remaining part of the segment was assigned to the test set. The training and test sets were made to consist of continuous segments to avoid contamination of the test set with the train set due to the autocorrelation timescale of behavior, with lengths on average of 10 and 3.5 min, respectively.
We quantified the performance of a neural prediction model using the variance explained. The single neuron variance explained for a neural trace for neuron i () is defined as
1 |
which is the s.d. for variance explained.
Peer prediction analysis
Neurons have independent noise that models cannot explain. Therefore, an upper bound for the variance that a model can explain is lower than the total variance of neural activity. To estimate the amount of this explainable variance in the neural recordings, we used the ‘peer prediction’ method14,78,79. Peer prediction analysis predicts each neuron from the other simultaneously-recorded cells (the neuron’s ‘peers’). The amount of variance that the peer prediction model explains is an estimate of the repeatable shared variance across neurons, we term this variance the explainable variance.
To compute peer prediction, we split the population into two spatially segregated populations, dividing the field of view into nonoverlapping strips of width 200 μm and assigning the neurons in the even strips to one group, and the neurons in the odd strips to the other group, regardless of the neuron’s depth. Next, we computed the top 128 PCs of each population and predicted one population’s PCs from the other population’s PCs using RRR fit to training data with λ = 1 × 10−1 and rank = 127. The variance explained by this model on test data (Eq. (1)) is termed the explainable variance for each neuron. The average explainable variance was 9.4% in the visual recordings and 11.1% in the sensorimotor recordings at the recording frame rate of 3 Hz.
Prediction performance quantification
We computed the variance explained for a given behavioral prediction model for each neuron on test data (Eq. (1)). The average single neuron variance explained, in 300 ms bins, by the deep network model using keypoints was 4.1% in the visual areas and 5.3% in the sensorimotor areas, and using movie PCs was 4.8% and 5.3%, respectively. We then normalized the variance explained by the upper bound on its variance explained, the explainable variance, as computed from peer prediction. We quantified the normalized variance explained on a per-neuron basis in Fig. 3d,e, taking the variance explained for each neuron and dividing it by its explainable variance, and visualizing only the neurons with an explainable variance greater than 1 × 10−3. For population-level variance explained quantification, the normalized variance explained was defined as the mean variance explained across all neurons divided by the mean explainable variance across all neurons (Fig. 3f–i and Extended Data Fig. 4).
We also computed the cumulative variance explained across neural PCs (), defined as
in Fig. 3i. This quantity allows the estimation of the dimensionality of the behavioral prediction.
Linear neural prediction using PCs of videos or keypoints
The mouse videos were reduced in dimensionality using SVD in blocks as described in ref. 14. The movie PCs were computed from the raw movie frames, and the top 500 PCs were used. Because the neural activity was recorded at a lower frame rate, the behavioral PCs were smoothed with a Gaussian filter of width 100 ms and then resampled at the neural timescale. We subtracted each behavioral PC by its mean and divided all PCs by the s.d. of the top behavioral PC.
A linear model called reduced rank regression (RRR) was used to predict neural PCs (Y) from the behavioral PCs or the corrected keypoint traces (X). RRR is a form of regularized linear regression, with the prediction weights matrix restricted to a specific rank80, reducing the number of parameters and making it more robust to overfitting. The RRR model is defined as
Like in ridge regression, the identity matrix times a λ constant can be added to the input covariance X for regularization. We set λ = 1 × 10−6. Training data were then used to fit the A and B coefficients in closed form; a rank of 128 was used for predicting from the movie PCs and a rank of 21 was used for predicting from the keypoints.
Neural prediction using a deep network
A multilayer network model was fit to predict neural activity from the movie PCs or the corrected keypoint traces using pytorch34 (Fig. 3a). The deep network model consisted of a core module and a readout module. The core module consisted of a fully-connected layer with the same dimensionality as the number of keypoints, a one-dimensional convolutional layer with ten filters (temporal convolution), a ReLU nonlinearity, two fully-connected layers with ReLU nonlinearities, the first with dimensionality of 50 and the second with dimensionality of 256. The 256-dimensional output of the core module is termed the ‘deep behavioral features’ of the model. The readout module of the network was one fully-connected layer, with a dimensionality of size 128 when predicting the neural PCs, or size equal to the number of neurons when predicting single neuron activity (when the number of neurons predicted was less than 200). The deep behavioral features, before entering the readout module, were subsampled at the timepoints coincident with the neural activity frames, because the videos were recorded at 50 Hz, while the neural activity was recorded at 3 Hz.
The deep network model was fit on the training data using the optimizer AdamW with LR of 1 × 10−1, weight decay of 1 × 10−4 and 300 epochs81, and the LR was annealed by a factor of 10 at both epochs 200 and 250. When fewer than 2,000 neurons were fit, the LR and weight decay were reduced by a factor of 10 to reduce overfitting. When fewer than 1 h of training timepoints were used, the LR and weight decay were reduced by a factor of 2, and the number of epochs was reduced by 100 to reduce overfitting. Each training batch consisted of a single training segment, with an average of length 10 min, and there were ten batches per recording. The model was then applied to the test segments to compute variance explained.
We varied various parameters of the network to approximately determine the best network architecture for neural prediction (Extended Data Fig. 4). We varied the number of units in the last layer of the core module, the ‘deep behavioral features’, from 1 to 1,024 (Extended Data Fig. 4a), and the number of convolution filters (Extended Data Fig. 4e). We varied the number of fully-connected layers with ReLU nonlinearities in the core module, each with dimensionality of 50 other than the last layer which was fixed at 256 dimensions (Extended Data Fig. 4b). We also varied the number of fully-connected layers in the readout module, with each layer having 128 dimensions and a ReLU nonlinearity, other than the last layer which had no output nonlinearity (Extended Data Fig. 4c). Next, from the original architecture described above, we removed components, such as the first fully-connected layer and some of the ReLU nonlinearities (Extended Data Fig. 4d).
Scaling of performance with neurons and timepoints
In Fig. 3g–i, we quantified the prediction performance as a function of the number of neurons and timepoints. For this analysis, we predicted using either a fraction of the neurons or a fraction of the training timepoints, while always keeping the test timepoints fixed. The variance explained was computed for each neuron, averaged across all neurons in the subset and then normalized by the explainable variance averaged over the neurons in the subset.
Neural activity clustering and sorting
We identified groups of coactive neurons using scaled k-means clustering70. Compared to regular k-means, scaled k-means fits an additional variable λi for each neuron i such that
where is the activity vector of neuron i, σi is the cluster assigned to neuron i and μj is the activity of cluster j. Like regular k-means, this model is optimized by iteratively assigning each neuron to the cluster that best explains its activity, and then re-estimating cluster means. We ran scaled k-means clustering with 100 clusters on z-scored neural activity. Example clusters are shown in Fig. 4c and Extended Data Fig. 6. The activity of the neurons in each cluster was averaged to obtain a cluster activity trace (Fig. 4b). To obtain the cluster prediction from the deep behavioral model, we averaged the prediction of each neuron in the cluster (shown in gray in Fig. 4b), and then correlated this prediction with the cluster activity trace to obtain an r value for each cluster.
To quantify how spread out each cluster is in the recording field of view, we computed a locality index for each cluster. We defined the locality index as the Kullback–Leibler (KL) divergence between the cluster’s discretized spatial distribution in the recording field of view and the discretized spatial distribution of all neurons, using a discretization of 200 μm. We then correlated the locality index with the correlation of each cluster with its prediction (Fig. 4d,e).
Fitting a discrete HMM
We fit a hidden Markov model (HMM) to the deep behavioral features , where t is a time-step for temporal features that were downsampled ten times from 50 Hz to 5 Hz43. We also fit the same models to the keypoint data. All fitting procedures were the same, except for the choice of the variance term, which depends on the number of features (30 for the 11 keypoints from the Facemap tracker and 256 for the deep behavioral features) in the way described below. The HMM state dynamics are given by
where bi represents the probability of starting the Markov chain in state i, while Aji represents the probability of transition from state j to state i. In all experiments, we chose the number of states to be 50, and we saw similar results with fewer (10) or more (200) states. Because our goal is to understand the pattern of dynamics of the deep behavioral features, we did not attempt to infer the ‘optimal’ number of states and do not believe the data lends itself easily to such an estimation.
In addition to state dynamics, an HMM has an ‘observation’ or ‘emission’ model, which declares the probability of observing some data sample zt for each possible state ht:
where Ci and σ are the mean and s.d. of the Gaussian observation model, respectively. This completes the model specification. We optimized this model in Pytorch using an improved, nonstandard optimization scheme, which routinely optimized the model better compared to alternative optimization methods such as expectation maximization.
Our optimization scheme consists of (1) optimizing the model log-likelihood directly as a function of its parameters using the automated differentiation from pytorch and (2) using initializations and reparametrizations of the HMM parameters that improve stability.
The log-likelihood of the HMM can be computed based on the forward pass of the ‘forward-backward’ algorithm. Following the convention of ref. 44, we define α(ht) = Prob(z1, z2, …, zt, ht). We can then define recursion equations for computing
2 |
The full log-likelihood of the data can then be computed based on α(hT), where T is the last timepoint, by observing that
Because the dependence of αt+1 on αt can be written in closed form, we can see that it is differentiable. After taking the logarithm and replacing the probabilities with the model equations, Eq. (2) becomes
3 |
where , C is a constant and n is the number of dimensions of the data. This formulation allows us to use the automatic differentiation from pytorch to optimize the HMM model directly, without inferring states first like in the expectation maximization method. Additionally, we note that we used the ‘logsumexp’ function from pytorch to compute the second half of Eq. (3), which has the advantage of being stable to exponentiation.
We re-parametrized the transition matrix A with a ‘log-transition’ matrix Q by
This has the advantage of removing the constraint of positivity of Aji and the constraint of summing to 1 of the rows of A. We initialized the log-transition matrix with Qii = 3 and Qij = 0 when i ≠ j, and we initialized the parameters Ci of the observation model with random samples from the data. For setting σ, we made the choice of freezing it to a fixed value for each dataset. This was because of the dependence of the log-likelihood on the number of observation dimensions n in Eq. (3). Because n is quite different between the keypoints and the deep behavioral features, the relative contribution of the observation term to the likelihood would be different if we set or learned σ to be the same in the two cases, potentially biasing the model to rely more or less on the internal hidden states ht. Instead, we fix σ2 to be proportional to the summed variance of zt, and we set it to 1 for the deep behavioral features, and 30/256 for the keypoints model. This ensures an approximately equal weighting of the observation term into the likelihood model. We note that the properties of the fitted HMM were not substantially different when σ2 was set to the same value for the keypoints and deep behavioral features, but the quality of the samples simulated from the HMM degraded if σ2 was too low.
Properties of the discrete HMM
The inferred states were determined with the Viterbi algorithm, which finds the most likely hidden states. We simulated states by drawing initial states from the categorical distribution with parameters bi, and then running the forward dynamics and drawing states from the conditional distributions Prob(ht+1 = i∣ht = j) = Aji.
State lifetimes were defined as , and they correspond to the mean durations of staying in state i. To compute transition sparsity and other metrics, we set self-transitions Aii = 0 and renormalized the rows. Formally, we defined a transition matrix when j ≠ i and Bii = 0. This is the matrix shown in Fig. 5d,e used for the analyses in Fig. 5i–n. The states were sorted using the Rastermap algorithm on the matrix B46. Specifically, this involves maximizing the similarity of the reordered transition matrix to the matrix given by Fji = −log((i − j)2) when j < i and 0 otherwise. Thus, the model attempts to put the highest probabilities close to the diagonal, and specifically above the diagonal, because they do not count if they are below the diagonal. For more details, see the rastermap repository at github.com/MouseLand/rastermap.
The transition sparsity was computed by sorting the rows of the matrix B in descending order, and computing a cumulative sum over each row. ‘Near’ states were defined as the five states i with the highest probability Aji for a given j. Reverse transitions were computed for each state based on its near states. Similarly, we computed the two-state forward and backward transitions. Forward sequences were computed based on the most likely inferred states, by counting the number of increasing sequences of each length. Note this depends on the initial Rastermap sorting of states to define a meaningful order.
Statistics and reproducibility
No statistical method was used to predetermine sample size, but our sample sizes are similar to those reported in previous publications14,16,17. We performed Wilcoxon signed-rank tests, which do not require the data to be normal. No data were excluded from the analyses. There were no experimental groups so there was no randomization necessary. Data collection and analysis were not performed blind to the conditions of the experiments.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41593-023-01490-6.
Supplementary information
Acknowledgements
This research and all authors were funded by the Howard Hughes Medical Institute at the Janelia Research Campus. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. We thank the Vivarium staff at Janelia for animal husbandry support, and we thank several others at Janelia: S. Lindo and S. DiLisio for surgery support, M. Michaelos for help with recordings, J. Arnold for designing head bars and coverslips, D. Flickinger for microscopy support and T. Goulet for engineering support. We thank A. Robie, M. Mayank Kabra and K. Branson at Janelia for advice on keypoint tracking. We also thank the contributors of the mouse face videos from other labs: A. Hoffmann in the Helmchen lab at UZH, F. Cazettes in the Mainen lab at the Champalimaud, B. Price in the Gavornik lab at BU, A. Laffere in the Lak lab at Oxford and M. Campbell in the Uchida lab at Harvard, and J. Poort at Cambridge.
Extended data
Author contributions
A.S., C.S. and M.P. designed the study and wrote the manuscript, with input from all authors. A.S., C.S., M.P. and R.T. performed data analysis. L.Z., M.P. and W.L. performed data collection.
Peer review
Peer review information
Nature Neuroscience thanks Eric Yttri and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Data availability
The data generated in this study is available on figshare: 10.25378/janelia.23712957.
Code availability
Facemap was used to perform all analyses in the paper, the code and GUI are available at https://www.github.com/mouseland/facemap, including video and code-based guides for using the software. Scripts for running the analyses in the paper are available at https://github.com/MouseLand/facemap/tree/main/paper.
Competing interests
The authors have no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Marius Pachitariu, Carsen Stringer.
Contributor Information
Atika Syeda, Email: syedaa@janelia.hhmi.org.
Carsen Stringer, Email: stringerc@janelia.hhmi.org.
Extended data
is available for this paper at 10.1038/s41593-023-01490-6.
Supplementary information
The online version contains supplementary material available at 10.1038/s41593-023-01490-6.
References
- 1.Ringach DL. Spontaneous and driven cortical activity: implications for computation. Curr. Opin. Neurobiol. 2009;19:439–444. doi: 10.1016/j.conb.2009.07.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Avitan L, Stringer C. Not so spontaneous: multi-dimensional representations of behaviors and context in sensory areas. Neuron. 2022;110:3064–3075. doi: 10.1016/j.neuron.2022.06.019. [DOI] [PubMed] [Google Scholar]
- 3.Niell CM, Stryker MP. Modulation of visual responses by behavioral state in mouse visual cortex. Neuron. 2010;65:472–479. doi: 10.1016/j.neuron.2010.01.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Erisken S, et al. Effects of locomotion extend throughout the mouse early visual system. Curr. Biol. 2014;24:2899–2907. doi: 10.1016/j.cub.2014.10.045. [DOI] [PubMed] [Google Scholar]
- 5.Williamson RS, Hancock KE, Shinn-Cunningham BG, Polley DB. Locomotion and task demands differentially modulate thalamic audiovisual processing during active search. Curr. Biol. 2015;25:1885–1891. doi: 10.1016/j.cub.2015.05.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Vinck M, Batista-Brito R, Knoblich U, Cardin JA. Arousal and locomotion make distinct contributions to cortical activity patterns and visual encoding. Neuron. 2015;86:740–754. doi: 10.1016/j.neuron.2015.03.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Clancy KB, Orsolic I, Mrsic-Flogel TD. Locomotion-dependent remapping of distributed cortical networks. Nat. Neurosci. 2019;22:778–786. doi: 10.1038/s41593-019-0357-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Guitchounts G, Masís J, Wolff SBE, Cox D. Encoding of 3D head orienting movements in the primary visual cortex. Neuron. 2020;108:512–525. doi: 10.1016/j.neuron.2020.07.014. [DOI] [PubMed] [Google Scholar]
- 9.Bouvier G, Senzai Y, Scanziani M. Head movements control the activity of primary visual cortex in a luminance-dependent manner. Neuron. 2020;108:500–511. doi: 10.1016/j.neuron.2020.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Pietri T, et al. The emergence of the spatial structure of tectal spontaneous activity is independent of visual inputs. Cell Rep. 2017;19:939–948. doi: 10.1016/j.celrep.2017.04.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Fujiwara T, Cruz T, Bohnslav JP, Chiappe ME. A faithful internal representation of walking movements in the Drosophila visual system. Nat. Neurosci. 2017;20:72–81. doi: 10.1038/nn.4435. [DOI] [PubMed] [Google Scholar]
- 12.Strother JA, et al. Behavioral state modulates the on visual motion pathway of drosophila. Proc. Natl Acad. Sci. USA. 2018;115:E102–E111. doi: 10.1073/pnas.1703090115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zolin A, et al. Context-dependent representations of movement in Drosophila dopaminergic reinforcement pathways. Nat. Neurosci. 2021;24:1555–1566. doi: 10.1038/s41593-021-00929-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Stringer C, et al. Spontaneous behaviors drive multidimensional, brainwide activity. Science. 2019;364:eaav7893. doi: 10.1126/science.aav7893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Gründemann J, et al. Amygdala ensembles encode behavioral states. Science. 2019;364:eaav8736. doi: 10.1126/science.aav8736. [DOI] [PubMed] [Google Scholar]
- 16.Lanore F, Cayco-Gajic NA, Gurnani H, Coyle D, Silver RA. Cerebellar granule cell axons support high-dimensional representations. Nat. Neurosci. 2021;24:1142–1150. doi: 10.1038/s41593-021-00873-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Benisty, H., et al. Rapid fluctuations in functional connectivity of cortical networks encode spontaneous behavior. Preprint at bioRxiv10.1101/2021.08.15.456390 (2021). [DOI] [PMC free article] [PubMed]
- 18.Salkoff DB, Zagha E, McCarthy E, McCormick DA. Movement and performance explain widespread cortical activity in a visual detection task. Cereb. Cortex. 2019;30:421–437. doi: 10.1093/cercor/bhz206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Engelhard B, et al. Specialized coding of sensory, motor and cognitive variables in VTA dopamine neurons. Nature. 2019;570:509–513. doi: 10.1038/s41586-019-1261-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Musall S, Kaufman MT, Juavinett AL, Gluf S, Churchland AK. Single-trial neural dynamics are dominated by richly varied movements. Nat. Neurosci. 2019;22:1677–1686. doi: 10.1038/s41593-019-0502-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Batty, E. et al. BehaveNet: nonlinear embedding and Bayesian neural decoding of behavioral videos. In Advances in Neural Information Processing Systems 32 (2019).
- 22.Whiteway MR, et al. Partitioning variability in animal behavioral videos using semi-supervised variational autoencoders. PLoS Comput. Biol. 2021;17:e1009439. doi: 10.1371/journal.pcbi.1009439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Voigts J, Sakmann B, Celikel T. Unsupervised whisker tracking in unrestrained behaving animals. J. Neurophysiol. 2008;100:504–515. doi: 10.1152/jn.00012.2008. [DOI] [PubMed] [Google Scholar]
- 24.Clack NG, et al. Automated tracking of whiskers in videos of head fixed rodents. PLoS Comput. Biol. 2012;8:e1002591. doi: 10.1371/journal.pcbi.1002591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Petersen RS, Colins Rodriguez A, Evans MH, Campagner D, Loft MSE. A system for tracking whisker kinematics and whisker shape in three dimensions. PLoS Comput. Biol. 2020;16:e1007402. doi: 10.1371/journal.pcbi.1007402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Staab, M. et al. What moves when mice move a single whisker to touch? Individuality and stereotypy in behavior. Preprint at bioRxiv10.1101/2022.10.03.510596 (2022).
- 27.Mathis A, et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 2018;21:1281–1289. doi: 10.1038/s41593-018-0209-y. [DOI] [PubMed] [Google Scholar]
- 28.Graving JM, et al. Deepposekit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife. 2019;8:e47994. doi: 10.7554/eLife.47994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Kane GA, Lopes Gonçalo, Saunders JL, Mathis A, Mathis MW. Real-time, low-latency closed-loop feedback using markerless posture tracking. eLife. 2020;9:e61909. doi: 10.7554/eLife.61909. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Bala PC, et al. Automated markerless pose estimation in freely moving macaques with OpenMonkeyStudio. Nat. Commun. 2020;11:4560. doi: 10.1038/s41467-020-18441-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Pereira TD, et al. Sleap: a deep learning system for multi-animal pose tracking. Nat. Methods. 2022;19:486–495. doi: 10.1038/s41592-022-01426-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. Preprint at arXiv10.48550/arXiv.1505.04597 (2015).
- 33.Leibe, B., Matas, J., Sebe, N. & Welling, M. (eds.). Deepercut: a deeper, stronger, and faster multi-person pose estimation model. Proceedings of European Conference on Computer Vision pp. 34–50 (Springer, Cham, 2016).
- 34.Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L. & Desmaison, A. "Pytorch: An imperative style, high-performance deep learning library." Advances in neural information processing systems 32 (2019).
- 35.Abadi, M. et. al. Tensorflow: large-scale machine learning on heterogeneous distributed systems. Preprint at arXiv10.48550/arXiv.1603.04467 (2016).
- 36.Nath, T., Mathis, A., Chen, A.C., Patel, A., Bethge, M. & Mathis, M.W. Using DeepLabCut for 3D markerless pose estimation across species and behaviors. Nat. Protoc.14, 2152–2176 (2019). [DOI] [PubMed]
- 37.Mathis, A., Biasi, T., Schneider, S., Yuksekgonul, M., Rogers, B., Bethge, M. & Mathis, M.W. Pretraining boosts out-of-domain robustness for pose estimation. Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) pp. 1859–1868 (IEEE, 2021).
- 38.Meyer AF, Poort J, O’Keefe J, Sahani M, Linden JF. A head-mounted camera system integrates detailed behavioral monitoring with multichannel electrophysiology in freely moving mice. Neuron. 2018;100:46–60. doi: 10.1016/j.neuron.2018.09.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Wiltschko, A. B. et al. Mapping sub-second structure in mouse behavior. Neuron88, 1121–1135 (2015). [DOI] [PMC free article] [PubMed]
- 40.Buchanan, E. K. et al. Quantifying the behavioral dynamics of C. elegans with autoregressive hidden Markov models. Workshop on Worm’s Neural Information Processing at the 31st Conference on Neural Information Processing Systems (NIPS, 2017).
- 41.Calhoun AJ, Pillow JW, Murthy M. Unsupervised identification of the internal states that shape natural behavior. Nat. Neurosci. 2019;22:2040–2049. doi: 10.1038/s41593-019-0533-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Zhang, L., Dunn, T., Marshall, J., Olveczky, B. & Linderman, S. Animal pose estimation from video data with a hierarchical von Mises-Fisher-Gaussian model. Proceedings of 24th International Conference on Artificial Intelligence and Statistics Vol. 130, pp. 2800–2808 (PMLR, 2021).
- 43.Baum LE, Petrie T. Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 1966;37:1554–1563. [Google Scholar]
- 44.Bishop, C. M. Pattern Recognition and Machine Learning Vol. 4 (Springer, 2006).
- 45.Van der Maaten L, Hinton G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008;9:2579–2605. [Google Scholar]
- 46.Stringer, C. et al. Rastermap: a discovery method for neural population recordings. Preprint at bioRxiv10.1101/2023.07.25.550571 (2023). [DOI] [PubMed]
- 47.Dolensek N, Gehrlach DA, Klein AS, Gogolla N. Facial expressions of emotion states and their neuronal correlates in mice. Science. 2020;368:89–94. doi: 10.1126/science.aaz9468. [DOI] [PubMed] [Google Scholar]
- 48.Aguillon-Rodriguez V, et al. Standardized and reproducible measurement of decision-making in mice. eLife. 2021;10:e63711. doi: 10.7554/eLife.63711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Robie AA, Seagraves KM, Egnor SER, Branson K. Machine vision methods for analyzing social interactions. J. Exp. Biol. 2017;220:25–34. doi: 10.1242/jeb.142281. [DOI] [PubMed] [Google Scholar]
- 50.Segalin C, et al. The mouse action recognition system (MARS) software pipeline for automated analysis of social behaviors in mice. eLife. 2021;10:e63720. doi: 10.7554/eLife.63720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Lauer J, et al. Multi-animal pose estimation, identification and tracking with DeepLabCut. Nat. Methods. 2022;19:496–504. doi: 10.1038/s41592-022-01443-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Hsu AI, Yttri EA. B-SOiD, an open-source unsupervised algorithm for identification and fast prediction of behaviors. Nat. Commun. 2021;12:5188. doi: 10.1038/s41467-021-25420-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Marshall JD, et al. Continuous whole-body 3D kinematic recordings across the rodent behavioral repertoire. Neuron. 2021;109:420–437. doi: 10.1016/j.neuron.2020.11.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Van Rossum, G. Python Reference Manual (CWI, 1995).
- 55.Harris CR, et al. Array programming with NumPy. Nature. 2020;585:357–362. doi: 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Virtanen P, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Lam, S. K., Pitrou, A. & Seibert, S. Numba: a LLVM-based Python JIT compiler. Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC pp. 1–6 (ACM, 2015).
- 58.Da Costa-Luis, C. et al. tqdm: a fast, extensible progress meter for Python and CLI. Zenodo. zenodo.org/records/6412640 (2022).
- 59.Bradski G. The OpenCV library. DDJ. 2000;120:122–125. [Google Scholar]
- 60.PyQT. Pyqt reference guide. www.bibsonomy.org/bibtex/2803cea968d0f5243a205d35520563074/maxirichter (2012).
- 61.Campagnola, L. et al. pyqtgraph/pyqtgraph. GitHub. github.com/pyqtgraph/pyqtgraph (2023).
- 62.Hunter JD. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 2007;9:90–95. [Google Scholar]
- 63.Kluyver, T. et al. Jupyter Notebooks-a publishing format for reproducible computational workflows. Elpub10.3233/978-1-61499-649-1-87 (2016).
- 64.Stringer C, Michaelos M, Tsyboulski D, Lindo SE, Pachitariu M. High-precision coding in visual cortex. Cell. 2021;184:2767–2778. doi: 10.1016/j.cell.2021.03.042. [DOI] [PubMed] [Google Scholar]
- 65.Joesch M, Meister M. A neuronal circuit for colour vision based on rod–cone opponency. Nature. 2016;532:236–239. doi: 10.1038/nature17158. [DOI] [PubMed] [Google Scholar]
- 66.Nikbakht N, Diamond ME. Conserved visual capacity of rats under red light. eLife. 2021;10:e66429. doi: 10.7554/eLife.66429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Sofroniew NJ, Flickinger D, King J, Svoboda K. A large field of view two-photon mesoscope with subcellular resolution for in vivo imaging. eLife. 2016;5:e14472. doi: 10.7554/eLife.14472. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Pologruto TA, Sabatini BL, Svoboda K. Scanimage: flexible software for operating laser scanning microscopes. Biomed. Eng. Online. 2003;2:13. doi: 10.1186/1475-925X-2-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Tsyboulski, D. et al. Remote focusing system for simultaneous dual-plane mesoscopic multiphoton imaging. Preprint at bioRxiv10.1101/503052 (2018).
- 70.Pachitariu, M. et al. Suite2p: beyond 10,000 neurons with standard two-photon microscopy. Preprint at bioRxiv10.1101/061507 (2017).
- 71.Van der Walt S, Colbert SC, Varoquaux G. The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 2011;13:22–30. [Google Scholar]
- 72.Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods17, 261–272 (2020). [DOI] [PMC free article] [PubMed]
- 73.Clack, N. ScanImageTiffReaderPython. GitLab. gitlab.com/vidriotech/scanimagetiffreader-python (2023).
- 74.Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res.12, 2825–2830 (2011).
- 75.Friedrich J, Zhou P, Paninski L. Fast online deconvolution of calcium imaging data. PLoS Comput. Biol. 2017;13:e1005423. doi: 10.1371/journal.pcbi.1005423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Pachitariu M, Stringer C, Harris KD. Robustness of spike deconvolution for neuronal calcium imaging. J. Neurosci. 2018;38:7976–7985. doi: 10.1523/JNEUROSCI.3339-17.2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at arXiv10.48550/arXiv.1412.6980 (2014).
- 78.Harris KD, Csicsvari J, Hirase H, Dragoi G, Buzsáki G. Organization of cell assemblies in the hippocampus. Nature. 2003;424:552. doi: 10.1038/nature01834. [DOI] [PubMed] [Google Scholar]
- 79.Pillow JW, et al. Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature. 2008;454:995–999. doi: 10.1038/nature07140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Izenman AJ. Reduced-rank regression for the multivariate linear model. J. Multivar. Anal. 1975;5:248–264. [Google Scholar]
- 81.Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. Preprint at arXiv10.48550/arXiv.1711.05101 (2017).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data generated in this study is available on figshare: 10.25378/janelia.23712957.
Facemap was used to perform all analyses in the paper, the code and GUI are available at https://www.github.com/mouseland/facemap, including video and code-based guides for using the software. Scripts for running the analyses in the paper are available at https://github.com/MouseLand/facemap/tree/main/paper.