Skip to main content
eLife logoLink to eLife
. 2021 Sep 2;10:e63377. doi: 10.7554/eLife.63377

DeepEthogram, a machine learning pipeline for supervised behavior classification from raw pixels

James P Bohnslav 1, Nivanthika K Wimalasena 1,2, Kelsey J Clausing 3,4, Yu Y Dai 3,4, David A Yarmolinsky 1,2, Tomás Cruz 5, Adam D Kashlan 1,2, M Eugenia Chiappe 5, Lauren L Orefice 3,4, Clifford J Woolf 1,2, Christopher D Harvey 1,
Editors: Mackenzie W Mathis6, Timothy E Behrens7
PMCID: PMC8455138  PMID: 34473051

Abstract

Videos of animal behavior are used to quantify researcher-defined behaviors of interest to study neural function, gene mutations, and pharmacological therapies. Behaviors of interest are often scored manually, which is time-consuming, limited to few behaviors, and variable across researchers. We created DeepEthogram: software that uses supervised machine learning to convert raw video pixels into an ethogram, the behaviors of interest present in each video frame. DeepEthogram is designed to be general-purpose and applicable across species, behaviors, and video-recording hardware. It uses convolutional neural networks to compute motion, extract features from motion and images, and classify features into behaviors. Behaviors are classified with above 90% accuracy on single frames in videos of mice and flies, matching expert-level human performance. DeepEthogram accurately predicts rare behaviors, requires little training data, and generalizes across subjects. A graphical interface allows beginning-to-end analysis without end-user programming. DeepEthogram’s rapid, automatic, and reproducible labeling of researcher-defined behaviors of interest may accelerate and enhance supervised behavior analysis. Code is available at: https://github.com/jbohnslav/deepethogram.

Research organism: D. melanogaster, Mouse

Introduction

The analysis of animal behavior is a common approach in a wide range of biomedical research fields, including basic neuroscience research (Krakauer et al., 2017), translational analysis of disease models, and development of therapeutics. For example, researchers study behavioral patterns of animals to investigate the effect of a gene mutation, understand the efficacy of potential pharmacological therapies, or uncover the neural underpinnings of behavior. In some cases, behavioral tests allow quantification of behavior through tracking an animal’s location in space, such as in the three-chamber assay, open-field arena, Morris water maze, and elevated plus maze (EPM) (Pennington, 2019). Increasingly, researchers are finding that important details of behavior involve subtle actions that are hard to quantify, such as changes in the prevalence of grooming in models of anxiety (Peça et al., 2011), licking a limb in models of pain (Browne, 2017), and manipulation of food objects for fine sensorimotor control (Neubarth, 2020; Sauerbrei et al., 2020). In these cases, researchers often closely observe videos of animals and then develop a list of behaviors they want to measure. To quantify these observations, the most commonly used approach, to our knowledge, is for researchers to manually watch videos with a stopwatch to count the time each behavior of interest is exhibited (Figure 1A). This approach takes immense amounts of researcher time, often equal to or greater than the duration of the video per individual subject. Also, because this approach requires manual viewing, often only one or a small number of behaviors are studied at a time. In addition, researchers often do not label the video frames when specific behaviors occur, precluding subsequent analysis and review of behavior bouts, such as bout durations and the transition probability between behaviors. Furthermore, scoring of behaviors can vary greatly between researchers especially as new researchers are trained (Segalin, 2020) and can be subject to bias. Therefore, it would be a significant advance if a researcher could define a list of behaviors of interest, such as face grooming, tail grooming, limb licking, locomoting, rearing, and so on, and then use automated software to identify when and how frequently each of the behaviors occurred in a video.

Figure 1. DeepEthogram overview.

(A) Workflows for supervised behavior labeling. Left: a common traditional approach based on manual labeling. Middle: workflow with DeepEthogram. Right: Schematic of expected scaling of user time for each workflow. (B) Ethogram schematic. Top: example images from Mouse-Ventral1 dataset. Bottom: ethogram with human labels. Dark colors indicate which behavior is present. Example shown is from Mouse-Ventral1 dataset. Images have been cropped, brightened, and converted to grayscale for clarity. (C) DeepEthogram-fast model schematic. Example images are from the Fly dataset. Left: a sequence of 11 frames is converted into 10 optic flows. Middle: the center frame and the stack of 10 optic flows are converted into 512-dimensional representations via deep convolutional neural networks (CNNs). Right: these features are converted into probabilities of each behavior via the sequence model.

Figure 1.

Figure 1—figure supplement 1. Optic flow.

Figure 1—figure supplement 1.

(A) Example images from the Fly dataset on two consecutive frames. (B) Optic flow estimated with TinyMotionNet. Note that the image size is half the original due to the TinyMotionNet architecture. Displacements in the x dimension (left) and y dimension (right) between the frames in (A). (C) The reconstruction of frame 1 estimated by sampling frame 2 according to the optic flow calculation. The image was resized with bilinear interpolation before resampling. (D) Absolute error between frame 1 and the frame 1 reconstructed from optic flow. (E) Visualization of optic flow using arrow lengths to indicate the direction and magnitude flow. (F) Visualization of optic flow using coloring according the inset color scale. Left displacements are mapped to cyan, right displacements to red, and so on. Saturation indicates the magnitude of displacement.

Researchers are increasingly turning to computational approaches to quantify and analyze animal behavior (Datta et al., 2019; Anderson and Perona, 2014; Gomez-Marin et al., 2014; Brown and de Bivort, 2017; Egnor and Branson, 2016). The task of automatically classifying an animal’s actions into user-defined behaviors falls in the category of supervised machine learning. In computer vision, this task is called ‘action detection,’ ‘temporal action localization,’ ‘action recognition,’ or ‘action segmentation.’ This task is distinct from other emerging behavioral analysis methods based on unsupervised learning, in which machine learning models discover behavioral modules from the data, irrespective of researcher labels. Although unsupervised methods, such as Motion Sequencing (Wiltschko, 2015; Wiltschko et al., 2020), MotionMapper (Berman et al., 2014), BehaveNet (Batty, 2019), B-SOiD (Hsu and Yttri, 2019), and others (Datta et al., 2019), can discover behavioral modules not obvious to the researcher, their outputs are not designed to perfectly match up to behaviors of interest in cases in which researchers have strong prior knowledge about the specific behaviors relevant to their experiments.

Pioneering work, including JAABA (Kabra et al., 2013), SimBA (Nilsson, 2020), MARS (Segalin, 2020), Live Mouse Tracker (de Chaumont et al., 2019), and others (Segalin, 2020; Dankert et al., 2009; Sturman et al., 2020), has made important progress toward the goal of supervised classification of behaviors. These methods track specific features of an animal’s body and use the time series of these features to classify whether a behavior is present at a given timepoint. In computer vision, this is known as ‘skeleton-based action detection.’ In JAABA, ellipses are fit to the outline of an animal’s body, and these ellipses are used to classify behaviors. SimBA classifies behaviors based on the positions of ‘keypoints’ on the animal’s body, such as limb joints. MARS takes a similar approach with a focus on social behaviors (Segalin, 2020). These approaches have become easier with recent pose estimation methods, including DeepLabCut (Mathis, 2018; Nath, 2019; Lauer, 2021) and similar algorithms (Pereira, 2018a; Graving et al., 2019). Thus, these approaches utilize a pipeline with two major steps. First, researchers reduce a video to a set of user-defined features of interest (e.g., limb positions) using pose estimation software. Second, these pose estimates are used as inputs to classifiers that identify the behaviors of interest. This approach has the advantage that it provides information beyond whether a behavior of interest is present or absent at each timepoint. Because the parts of the animal’s body that contribute to the behavior are tracked, detailed analyses of movement and how these movements contribute to behaviors of interest can be performed.

Here, we took a different approach based on models that classify behaviors directly from the raw pixel values of videos. Drawing from extensive work in this area in computer vision (He et al., 2015; Piergiovanni and Ryoo, 2018; Zhu et al., 2017; Simonyan and Zisserman, 2014), this approach has the potential to simplify the pipeline for classifying behaviors. It requires only one type of human annotation – labels for behaviors of interest – instead of labels for both pose keypoints and behaviors. In addition, this approach requires only a single model for behavior classification instead of models for pose estimation and behavior classification. Classification from raw pixels is in principle generally applicable to any dataset that has video frames and training data for the model in the form of frame-by-frame binary behavior labels. Some recent work has performed behavior classification from pixels but only focused on motor deficits (Ryait, 2019) or one species and setup (van Dam et al., 2020). Other recent work uses image and motion features, similar to the approaches we developed here, except with a focus on classifying the timepoint at which a behavior starts, instead of classifying every frame into one or more behaviors (Kwak et al., 2019).

Our method, called DeepEthogram, is a modular pipeline for automatically classifying each frame of a video into a set of user-defined behaviors. It uses a supervised deep-learning model that, with minimal user-based training data, takes a video with T frames and a user-defined set of K behaviors and generates a binary [T, K] matrix (Figure 1A; Piergiovanni and Ryoo, 2018; Zhu et al., 2017). This matrix indicates whether each behavior is present or absent on each frame, which we term an ‘ethogram’: the set of behaviors that are present at a given timepoint (Figure 1B). We use convolutional neural networks (CNNs), specifically Hidden Two-Stream Networks (Zhu et al., 2017) and Temporal Gaussian Mixture (TGM) networks (Piergiovanni and Ryoo, 2018), to detect actions in videos, and we pretrained the networks on large open-source datasets (Deng, 2008; Carreira et al., 2019). Previous work has introduced the methods we use here (He et al., 2015; Piergiovanni and Ryoo, 2018; Zhu et al., 2017; Simonyan and Zisserman, 2014), and we have adapted and extended these methods for application to biomedical research of animal behavior. We validated our approach’s performance on nine datasets from two species, with each dataset posing distinct challenges for behavior classification. DeepEthogram automatically classifies behaviors with high performance, often reaching levels obtained by expert human labelers. High performance is achieved with only a few minutes of positive example data and even when the behaviors occur at different locations in the behavioral arena and at distinct orientations of the animal. Importantly, specialized video recording hardware is not required, and the entire pipeline requires no programming by the end-user because we developed a graphical user interface (GUI) for annotating videos, training models, and generating predictions.

Results

Modeling approach

Our goal was to take a set of video frames as input and predict the probability that each behavior of interest occurs on a given frame. This task of automated behavior labeling presents several challenges that framed our solution. First, in many cases, the behavior of interest occurs in a relatively small number of video frames, and the accuracy must be judged based on correct identification of these low-frequency events. For example, if a behavior of interest is present in 5% of frames, an algorithm could guess that the behavior is ‘not present’ on every frame and still achieve 95% overall accuracy. Critically, however, it would achieve 0% accuracy on the frames that matter, and an algorithm does not know a priori which frames matter. Second, ideally a method should perform well after being trained on only small amounts of user-labeled video frames, including across different animals, and thus require little manual input. Third, a method should be able to identify the same behavior regardless of the position and orientation of the animal when the behavior occurs. Fourth, methods should require relatively low computational resources in case researchers do not have access to large compute clusters or top-level GPUs.

We modeled our approach after temporal action localization methods used in computer vision aimed to solve related problems (Zeng, 2019; Xie et al., 2019; Chao, 2018; El-Nouby and Taylor, 2018). The overall architecture of our solution includes (1) estimating motion (optic flow) from a small snippet of video frames, (2) compressing a snippet of optic flow and individual still images into a lower dimensional set of features, and (3) using a sequence of the compressed features to estimate the probability of each behavior at each frame in a video (Figure 1C). We implemented this architecture using large, deep CNNs. First, one CNN is used to generate optic flow from a set of images. We incorporate optic flow because some behaviors are only obvious by looking at the animal’s movements between frames, such as distinguishing standing still and walking. We call this CNN the flow generator (Figure 1C, Figure 1—figure supplement 1). We then use the optic flow output of the flow generator as input to a second CNN to compress the large number of optic flow snippets across all the pixels into a small set of features called flow features (Figure 1C). Separately, we use a distinct CNN, which takes single video frames as input, to compress the large number of raw pixels into a small set of spatial features, which contain information about the values of pixels relative to one another spatially but lack temporal information (Figure 1C). We include single frames separately because some behaviors are obvious from a single still image, such as identifying licking just by seeing an extended tongue. Together, we call these latter two CNNs feature extractors because they compress the large number of raw pixels into a small set of features called a feature vector (Figure 1C). Each of these feature extractors is trained to produce a probability for each behavior on each frame based only on their input (optic flow or single frames). We then fuse the outputs of the two feature extractors by averaging (Materials and methods). To produce the final probabilities that each behavior was present on a given frame – a step called inference – we use a sequence model, which has a large temporal receptive field and thus utilizes long timescale information (Figure 1C). We use this sequence model because our CNNs only ‘look at’ either 1 frame (spatial) or about 11 frames (optic flow), but when labeling videos, humans know that sometimes the information present seconds ago can be useful for estimating the behavior of the current frame. The final output of DeepEthogram is a T,K matrix, in which each element is the probability of behavior k occurring at time t. We threshold these probabilities to get a binary prediction for each behavior at each timepoint, with the possibility that multiple behaviors can occur simultaneously (Figure 1B).

For the flow generator, we use the MotionNet (Zhu et al., 2017) architecture to generate 10 optic flow frames from 11 images. For the feature extractors, we use the ResNet family of models (He et al., 2015; Hara et al., 2018) to extract both flow features and spatial features. Finally, we use TGM (Piergiovanni and Ryoo, 2018) models as the sequence model to perform the ultimate classification. Each of these models has many variants with a large range in the number of parameters and the associated computational demands. We therefore created three versions of DeepEthogram that use variants of these models, with the aim of trading off accuracy and speed: DeepEthogram-fast, DeepEthogram-medium, and DeepEthogram-slow. DeepEthogram-fast uses TinyMotionNet (Zhu et al., 2017) for the flow generator and ResNet18 (He et al., 2015) for the feature extractors. It has the fewest parameters, the fastest training of the flow generator and feature extractor models, the fastest inference time, and the smallest requirement for computational resources. As a tradeoff for this speed, DeepEthogram-fast tends to have slightly worse performance than the other versions (see below). In contrast, DeepEthogram-slow uses a novel architecture TinyMotionNet3D for its flow generator and 3D-ResNet34 (He et al., 2015; Simonyan and Zisserman, 2014; Hara et al., 2018) for its feature extractors. It has the most parameters, the slowest training and inference times, and the highest computational demands, but it has the capacity to produce the best performance. DeepEthogram-medium is intermediate and uses MotionNet (Zhu et al., 2017) and ResNet50 (He et al., 2015) for its flow generator and feature extractors. All versions of DeepEthogram use the same sequence model. All variants of the flow generators and feature extractors are pretrained on the Kinetics700 video dataset (Carreira et al., 2019), so that model parameters do not have to be learned from scratch (Materials and methods). TGM networks represent the state of the art on various action detection benchmarks as of 2019 (Piergiovanni and Ryoo, 2018). However, recent work based on multiple temporal resolutions (Feichtenhofer et al., 2019; Kahatapitiya and Ryoo, 2021), graph convolutional networks (Zeng, 2019), and transformer architectures (Nawhal and Mori, 2021) has exceeded this performance. We carefully chose DeepEthogram’s components based on their performance, parameter count, and hardware requirements. DeepEthogram as a whole and its component parts are not aimed to be the state of the art on standard computer vision temporal action localization datasets and instead are focused on practical application to biomedical research of animal behavior.

In practice, the first step in running DeepEthogram is to train the flow generator on a set of videos, which occurs without user input (Figure 1A). In parallel, a user must label each frame in a set of training videos for the presence of each behavior of interest. These labels are then used to train independently the spatial feature extractor and the flow feature extractor to produce separate estimates of the probability of each behavior. The extracted feature vectors for each frame are then saved and used to train the sequence models to produce the final predicted probability of each behavior at each frame. We chose to train the models in series, rather than all at once from end-to-end, due to a combination of concerns about backpropagating error across diverse models, overfitting with extremely large models, and computational capacity (Materials and methods).

Diverse datasets to test DeepEthogram

To test the performance of our model, we used nine different neuroscience research datasets that span two species and present distinct challenges for computer vision approaches. Please see the examples in Figure 2, Figure 2—figure supplements 16, and Videos 19 that demonstrate the behaviors of interest and provide an intuition for the ease or difficulty of identifying and distinguishing particular behaviors.

Figure 2. Datasets and behaviors of interest.

(A) Left: raw example images from the Mouse-Ventral1 dataset for each of the behaviors of interest. Right: time spent on each behavior, based on human labels. Note that the times may add up to more than 100% across behaviors because multiple behaviors can occur on the same frame. Background is defined as when no other behaviors occur. (B–I) Similar to (A), except for the other datasets.

Figure 2.

Figure 2—figure supplement 1. Example images from the datasets, part 1.

Figure 2—figure supplement 1.

(A) Examples from the Mouse-Ventral1 dataset. Each row is three consecutive frames of the indicated behavior. Right columns: optic flow computed by TinyMotionNet and visualized as in Figure 1—figure supplement 1F. (B) Similar to (A), except for the Mouse-Ventral2 dataset.
Figure 2—figure supplement 2. Example images from the datasets, part 2.

Figure 2—figure supplement 2.

(A) Examples from the Mouse-Openfield dataset. Each row is three consecutive frames of the indicated behavior. Right columns: optic flow computed by TinyMotionNet and visualized as in Figure 1—figure supplement 1F. (B) Similar to (A), except for the Fly dataset.
Figure 2—figure supplement 3. Example images from the datasets, part 3.

Figure 2—figure supplement 3.

Examples from the Mouse-Homecage dataset. Each row is three consecutive frames of the indicated behavior. Right columns: optic flow computed by TinyMotionNet and visualized as in Figure 1—figure supplement 1F.
Figure 2—figure supplement 4. Example images from the datasets, part 4.

Figure 2—figure supplement 4.

Examples from the Mouse-Social dataset. Each row is three consecutive frames of the indicated behavior. Right columns: optic flow computed by TinyMotionNet and visualized as in Figure 1—figure supplement 1F.
Figure 2—figure supplement 5. Example images from the datasets, part 5.

Figure 2—figure supplement 5.

Examples from the Sturman-EPM dataset. Each row is three consecutive frames of the indicated behavior. Right columns: optic flow computed by TinyMotionNet and visualized as in Figure 1—figure supplement 1F. All data from Sturman et al., 2020.
Figure 2—figure supplement 6. Example images from the datasets, part 6.

Figure 2—figure supplement 6.

(A) Examples from the Sturman-FST dataset. Each row is three consecutive frames of the indicated behavior. Right columns: optic flow computed by TinyMotionNet and visualized as in Figure 1—figure supplement 1F. (B) Similar to (A), except for the Sturman-OFT dataset. All data from Sturman et al., 2020.

Video 1. DeepEthogram example from the Mouse-Ventral1 dataset.

Download video file (329.1KB, mp4)

Video is from the test set. Top: raw image. Title indicates frame number in video. Tick legends indicate pixels. Middle: human labels. Black box indicates the current frame. Bottom: DeepEthogram predictions from a trained DeepEthogram-medium model.

Video 2. DeepEthogram example from the Mouse-Ventral2 dataset.

Download video file (1.1MB, mp4)

Video is from the test set. Top: raw image. Title indicates frame number in video. Tick legends indicate pixels. Middle: human labels. Black box indicates the current frame. Bottom: DeepEthogram predictions from a trained DeepEthogram-medium model.

Video 3. DeepEthogram example from the Mouse-Openfield dataset.

Download video file (526.4KB, mp4)

Video is from the test set. Top: raw image. Title indicates frame number in video. Tick legends indicate pixels. Middle: human labels. Black box indicates the current frame. Bottom: DeepEthogram predictions from a trained DeepEthogram-medium model.

Video 4. DeepEthogram example from the Mouse-Homecage dataset.

Download video file (740.7KB, mp4)

Video is from the test set. Top: raw image. Title indicates frame number in video. Tick legends indicate pixels. Middle: human labels. Black box indicates the current frame. Bottom: DeepEthogram predictions from a trained DeepEthogram-medium model.

Video 5. DeepEthogram example from the Mouse-Social dataset.

Download video file (528.5KB, mp4)

Video is from the test set. Top: raw image. Title indicates frame number in video. Tick legends indicate pixels. Middle: human labels. Black box indicates the current frame. Bottom: DeepEthogram predictions from a trained DeepEthogram-medium model.

Video 6. DeepEthogram example from the Sturman-EPM dataset.

Download video file (343.2KB, mp4)

Video is from the test set. Top: raw image. Title indicates frame number in video. Tick legends indicate pixels. Middle: human labels. Black box indicates the current frame. Bottom: DeepEthogram predictions from a trained DeepEthogram-medium model.

Video 7. DeepEthogram example from the Sturman-FST dataset.

Download video file (711.3KB, mp4)

Video is from the test set. Top: raw image. Title indicates frame number in video. Tick legends indicate pixels. Middle: human labels. Black box indicates the current frame. Bottom: DeepEthogram predictions from a trained DeepEthogram-medium model.

Video 8. DeepEthogram example from the Sturman-OFT dataset.

Download video file (393.6KB, mp4)

Video is from the test set. Top: raw image. Title indicates frame number in video. Tick legends indicate pixels. Middle: human labels. Black box indicates the current frame. Bottom: DeepEthogram predictions from a trained DeepEthogram-medium model.

Video 9. DeepEthogram example from the Flies dataset.

Download video file (586.7KB, mp4)

Video is from the test set. Top: raw image. Title indicates frame number in video. Tick legends indicate pixels. Middle: human labels. Black box indicates the current frame. Bottom: DeepEthogram predictions from a trained DeepEthogram-medium model.

We collected five datasets of mice in various behavioral arenas. The ‘Mouse-Ventral1’ and ‘Mouse-Ventral2’ datasets are bottom-up videos of a mouse in an open field and small chamber, respectively (Figure 2A,B, Figure 2—figure supplement 1A, B, Videos 12). The ‘Mouse-Openfield’ dataset includes commonly used top-down videos of a mouse in an open arena (Figure 2C, Figure 2—figure supplement 2A, Video 3). The ‘Mouse-Homecage’ dataset are top-down videos of a mouse in its home cage with bedding, a hut, and two objects (Figure 2D, Figure 2—figure supplement 3, Video 4). The ‘Mouse-Social’ dataset are top-down videos of two mice interacting in an open arena (Figure 2E, Figure 2—figure supplement 4, Video 5). We also tested three datasets from published work by Sturman et al., 2020 that consist of mice in common behavior assays: the EPM, forced swim test (FST), and open field test (OFT) (Figure 2F–H, Figure 2—figure supplements 56, Videos 68). Finally, we tested a different species in the ‘Fly’ dataset that includes side view videos of a Drosophila melanogaster and aims to identify a coordinated walking pattern (Fujiwara et al., 2017; Figure 2D, Figure 2—figure supplement 2B, Video 9).

Collectively, these datasets include distinct view angles, a variety of illumination levels, and different resolutions and video qualities. They also pose various challenges for computer vision, including the subject occupying a small fraction of pixels (Mouse-Ventral1, Mouse-Openfield, Mouse-Homecage, Sturman-EPM, Sturman-OFT), complex backgrounds with non-uniform patterns (bedding and objects in Mouse-Homecage) or irrelevant motion (moving water in Sturman-FST), objects that occlude the subject (Mouse-Homecage), poor contrast of body parts (Mouse-Openfield, Sturman-EPM, Sturman-OFT), little motion from frame-to-frame (Fly, due to high video rate), and few training examples (Sturman-EPM, only five videos and only three that contain all behaviors). Furthermore, in most datasets, some behaviors of interest are rare and occur in only a few percent of the total video frames.

In each dataset, we labeled a behavior as present regardless of the location where it occurred and the orientation of the subject when it occurred. We did not note position or direction information, and we did not spatially crop the video frames or align the animal before training our model. In all datasets, we labeled the frames on which none of the behaviors of interest were present as ‘background,’ following the convention in computer vision. Each video in a dataset was recorded using a different individual mouse or fly, and thus training and testing the model across videos measured generalization across individual subjects. The video datasets and researcher annotations are available at the project website: https://github.com/jbohnslav/deepethogram (copy archived at swh:1:rev:ffd7e6bd91f52c7d1dbb166d1fe8793a26c4cb01), Bohnslav, 2021.

DeepEthogram achieves high performance approaching expert-level human performance

We split each dataset into three subsets: training, validation, and test (Materials and methods). The training set was used to update model parameters, such as the weights of the CNNs. The validation set was used to set appropriate hyperparameters, such as the thresholds used to turn the probabilities of each behavior into binary predictions about whether each behavior was present. The test set was used to report performance on new data not used in training the model. We generated five random splits of the data into training, validation, and test sets and averaged our results across these five splits, unless noted otherwise (Materials and methods). We computed three complementary metrics of model performance using the test set. First, we computed the accuracy, which is the fraction of elements of the T,K ethogram that were predicted correctly. We note that in theory accuracy could be high even if the model did not perform well on each behavior. For example, in the Mouse-Ventral2 dataset, some behaviors were incredibly rare, occurring in only ~2% of frames (Figure 2B). Thus, the model could in theory achieve ~98% accuracy simply by guessing that the behavior was absent on all frames. Therefore, we also computed the F1 score, a metric ranging from 0 (bad) to 1 (perfect) that takes into account the rates of false positives and false negatives. The F1 score is the geometric mean of the precision and recall of the model. Precision is the fraction of frames labeled by the model as a given behavior that are actually that behavior (true positives/(true positives + false positives)). Recall is the fraction of frames actually having a given behavior that are correctly labeled as that behavior by the model (true positives/(true positives + false negatives)). We report the F1 score in the main figures and show precision and recall performance in the figure supplements. Because the accuracy and F1 score depend on our choice of a threshold to turn the probability of a given behavior on a given frame into a binary prediction about the presence of that behavior, we also computed the area under the receiver operating curve (AUROC), which summarizes performance as a function of the threshold.

We first considered the entire ethogram, including all behaviors. DeepEthogram performed with greater than 85% accuracy on the test data for all datasets (Figure 3A). Overall metrics were calculated for each element of the ethogram. The model achieved high overall F1 scores, with high precision and recall (Figure 3B, Figure 3—figure supplement 1A, Figure 3—figure supplement 2A). Similarly, high overall performance was observed with the AUROC measures (Figure 3—figure supplement 3A). These results indicate that the model was able to capture the overall patterns of behavior in videos.

Figure 3. DeepEthogram performance.

All results are from the test sets only. (A) Overall accuracy for each model size and dataset. Error bars indicate mean ± SEM across five random splits of the data (three for Sturman-EPM). (B) Similar to (A), except for overall F1 score. (C) F1 score for DeepEthogram-medium for individual behaviors on the Mouse-Ventral1 dataset. Gray bars indicate shuffle (Materials and methods). *p≤0.05, **p≤0.01, ***p≤0.001, repeated measures ANOVA with a post-hoc Tukey’s honestly significant difference test. (D) Similar to (C), but for Mouse-Ventral2. Model and shuffle were compared with paired t-tests with Bonferroni correction. (E) Similar to (C), but for Mouse-Openfield. (F) Similar to (D), but for Mouse-Homecage. (G) Similar to (D), but for Mouse-Social. (H) Similar to (C), but for Sturman-EPM. (I) Similar to (C), but for Sturman-FST. (J) Similar to (C), but for Sturman-OFT. (K) Similar to (D), but for Fly dataset. (L) F1 score on individual behaviors (circles) for DeepEthogram-medium vs. human performance. Circles indicate the average performance across splits for behaviors in datasets with multiple human labels. Gray line: unity. Model vs. human performance: p=0.067, paired t-test. (M) Model F1 vs. the percent of frames in the training set with the given behavior. Each circle is one behavior for one split of the data. (N) Model accuracy on frames for which two human labelers agreed or disagreed. Paired t-tests with Bonferroni correction. (O) Similar to (N), but for F1. (P) Ethogram examples. Dark color indicates the behavior is present. Top: human labels. Bottom: DeepEthogram-medium predictions. The accuracy and F1 score for each behavior, and the overall accuracy and F1 scores are shown. Examples were chosen to be similar to the model’s average by behavior.

Figure 3.

Figure 3—figure supplement 1. DeepEthogram performance, precision.

Figure 3—figure supplement 1.

All results are from the test sets only. (A) Overall precision for each model size and dataset. Error bars indicate mean ± SEM across five random splits of the data (three for Sturman-EPM). (B) Precision for DeepEthogram-medium for individual behaviors on the Mouse-Ventral1 dataset. *p≤0.05, **p≤0.01, ***p≤0.001, repeated measures ANOVA with post-hoc Tukey’s honestly significant difference test. (C) Similar to (B), but for Mouse-Ventral2. Paired t-tests with Bonferroni correction. (D) Similar to (B), but for Mouse-Openfield. (E) Similar to (D), but for Mouse-Homecage. (F) Similar to (C), but for Mouse-Social. (G) Similar to (B), but for Sturman-EPM. (H) Similar to (B), but for Sturman-FST. (I) Similar to (B), but for Sturman-OFT. (J) Similar to (C), but for Fly dataset. (K) Precision on individual behaviors for DeepEthogram-medium vs. human performance. Circles are average performance across data splits for individual behaviors for all datasets with multiple human labels. Model performance vs. human performance: p=0.529, paired t-test. (L) Model precision vs. the percent of frames in the training set with the given behavior. Each point is for one behavior for one split of the data. (M) Model precision on frames for which two human labelers agreed or disagreed. Asterisks indicate p<0.05, paired t-test with Bonferroni correction.
Figure 3—figure supplement 2. DeepEthogram performance, recall.

Figure 3—figure supplement 2.

All results are from the test sets only. (A) Overall recall for each model size and dataset. Error bars indicate mean ± SEM across five random splits of the data (three for Sturman-EPM). (B) Recall for DeepEthogram-medium for individual behaviors on the Mouse-Ventral1 dataset. *p≤0.05, **p≤0.01, ***p≤0.001, repeated measures ANOVA with post-hoc Tukey’s honestly significant difference test. (C) Similar to (B), but for Mouse-Ventral2. Paired t-tests with Bonferroni correction. (D) Similar to (B), but for Mouse-Openfield. (E) Similar to (D), but for Mouse-Homecage. (F) Similar to (C), but for Mouse-Social. (G) Similar to (B), but for Sturman-EPM. (H) Similar to (B), but for Sturman-FST. (I) Similar to (B), but for Sturman-OFT. (J) Similar to (C), but for Fly dataset. (K) Recall on individual behaviors for DeepEthogram-medium vs. human performance. Shown is the average performance across splits for all datasets with multiple human labels. Circles are average performance across data splits for individual behaviors for all datasets with multiple human labels. Model performance vs. human performance: p<0.035, paired t-test. (L) Model precision vs. the percent of frames in the training set with the given behavior. Each point is for one behavior for one split of the data. (M) Model recall on frames for which two human labelers agreed or disagreed. Asterisks indicate p<0.05, paired t-test with Bonferroni correction.
Figure 3—figure supplement 3. DeepEthogram performance, area under the receiver operating characteristic curve (AUROC).

Figure 3—figure supplement 3.

All results are from the test sets only. (A) Overall recall for each model size and dataset. Error bars indicate mean ± SEM across five random splits of the data (three for Sturman-EPM). (B) AUROC for DeepEthogram-medium for individual behaviors on the Mouse-Ventral1 dataset. *p≤0.05, **p≤0.01, ***p≤0.001, paired t-test with Bonferroni correction. (C) Similar to (B), but for Mouse-Ventral2. (D) Similar to (B), but for Mouse-Openfield. (E) Similar to (B), but for Mouse-Homecage. (F) Similar to (B), but for Mouse-Social. (G) Similar to (B), but for Sturman-EPM. (H) Similar to (B), but for Sturman-FST. (I) Similar to (B), but for Sturman-OFT. (J) Similar to (B), but for Fly dataset. (K) Model AUROC vs. the percent of frames in the training set with the given behavior. Each point is for one behavior for one split of the data.
Figure 3—figure supplement 4. Ethogram examples for the Mouse-Ventral1 dataset.

Figure 3—figure supplement 4.

(A) An example ethogram with above-average performance, showing the human labels, estimated probabilities for each behavior from DeepEthogram-medium, and the thresholded and postprocessed predictions, for data from the test set. The accuracy and F1 score for each behavior are shown, along with the overall accuracy and overall F1 score. (B, C) Similar to (A), except for approximately average performance and below-average performance.
Figure 3—figure supplement 5. Ethogram examples for the Mouse-Ventral2 dataset.

Figure 3—figure supplement 5.

(A) An example ethogram with above-average performance, showing the human labels, estimated probabilities for each behavior from DeepEthogram-medium, and the thresholded and postprocessed predictions, for data from the test set. The accuracy and F1 score for each behavior are shown, along with the overall accuracy and overall F1 score. (B, C) Similar to (A), except for approximately average performance and below-average performance.
Figure 3—figure supplement 6. Ethogram examples for the Mouse-Openfield dataset.

Figure 3—figure supplement 6.

(A) An example ethogram with above-average performance, showing the human labels, estimated probabilities for each behavior from DeepEthogram-medium, and the thresholded and postprocessed predictions, for data from the test set. The accuracy and F1 score for each behavior are shown, along with the overall accuracy and overall F1 score. (B, C) Similar to (A), except for approximately average performance and below-average performance.
Figure 3—figure supplement 7. Ethogram examples for the Mouse-Homecage dataset.

Figure 3—figure supplement 7.

(A) An example ethogram with above-average performance, showing the human labels, estimated probabilities for each behavior from DeepEthogram-medium, and the thresholded and postprocessed predictions, for data from the test set. The accuracy and F1 score for each behavior are shown, along with the overall accuracy and overall F1 score. (B, C) Similar to (A), except for approximately average performance and below-average performance.
Figure 3—figure supplement 8. Ethogram examples for the Mouse-Social dataset.

Figure 3—figure supplement 8.

(A) An example ethogram with above-average performance, showing the human labels, estimated probabilities for each behavior from DeepEthogram-medium, and the thresholded and postprocessed predictions, for data from the test set. The accuracy and F1 score for each behavior are shown, along with the overall accuracy and overall F1 score. (B, C) Similar to (A), except for approximately average performance and below-average performance.
Figure 3—figure supplement 9. Ethogram examples for the Sturman-EPM dataset.

Figure 3—figure supplement 9.

(A) An example ethogram with above-average performance, showing the human labels, estimated probabilities for each behavior from DeepEthogram-medium, and the thresholded and postprocessed predictions, for data from the test set. The accuracy and F1 score for each behavior are shown, along with the overall accuracy and overall F1 score. (B, C) Similar to (A), except for approximately average performance and below-average performance.
Figure 3—figure supplement 10. Ethogram examples for the Sturman-FST dataset.

Figure 3—figure supplement 10.

(A) An example ethogram with above-average performance, showing the human labels, estimated probabilities for each behavior from DeepEthogram-medium, and the thresholded and postprocessed predictions, for data from the test set. The accuracy and F1 score for each behavior are shown, along with the overall accuracy and overall F1 score. (B, C) Similar to (A), except for approximately average performance and below-average performance.
Figure 3—figure supplement 11. Ethogram examples for the Sturman-OFT dataset.

Figure 3—figure supplement 11.

(A) An example ethogram with above-average performance, showing the human labels, estimated probabilities for each behavior from DeepEthogram-medium, and the thresholded and postprocessed predictions, for data from the test set. The accuracy and F1 score for each behavior are shown, along with the overall accuracy and overall F1 score. (B, C) Similar to (A), except for approximately average performance and below-average performance.
Figure 3—figure supplement 12. DeepEthogram exhibits position and heading invariance.

Figure 3—figure supplement 12.

Nine randomly selected examples of the ‘face groom’ behavior from the Mouse-Openfield dataset. All examples were identified as ‘face groom’ by DeepEthogram-medium. The examples include different videos and different mice.

We also analyzed the model’s performance for each individual behavior. The model achieved F1 scores of 0.7 or higher for many behaviors, even reaching F1 scores above 0.9 in some cases (Figure 3C–K). DeepEthogram’s performance significantly exceeded chance levels of performance on nearly all behaviors across datasets (Figure 3C–K). Given that F1 scores may not be intuitive to understand in terms of their values, we examined individual snippets of videos with a range of F1 scores and found that F1 scores similar to the means for our datasets were consistent with overall accurate predictions (Figure 3P, Figure 3—figure supplements 411). We note that the F1 score is a demanding metric, and even occasional differences on single frames or a small number of frames can substantially decrease the F1 score. Relatedly, the model achieved high precision, recall, and AUROC values for individual behaviors (Figure 3—figure supplement 1B–J, Figure 3—figure supplement 2B–J, Figure 3—figure supplement 3B–J). The performance of the model depended on the frequency with which a behavior occurred (c.f. Figure 2 right panels and Figure 3C–J). Strikingly, however, performance was relatively high even for behaviors that occurred rarely, that is, in less than 10% of video frames (Figure 3M, Figure 3—figure supplement 1L, Figure 3—figure supplement 2L, and Figure 3—figure supplement 3K). The performance tended to be highest for DeepEthogram-slow and worst for DeepEthogram-fast, but the differences between model versions were generally small and varied across behaviors (Figure 3A,B, Figure 3—figure supplement 1A, Figure 3—figure supplement 2A, Figure 3—figure supplement 3A). The high-performance values are, in our opinion, impressive given that they were calculated based on single-frame predictions for each behavior, and thus performance will be reduced if the model misses the onset or offset of a behavior bout by even a single frame. These high values suggest that the model not only correctly predicted which behaviors happened and when but also had the resolution to correctly predict the onset and offset of bouts.

To better understand the performance of DeepEthogram, we benchmarked the model by comparing its performance to the degree of agreement between expert human labelers. Multiple researchers with extensive experience in monitoring and analyzing mouse behavior videos independently labeled the same set of videos for the Mouse-Ventral1, Mouse-Ventral2, Mouse-Openfield, Mouse-Social, and Mouse-Homecage datasets, allowing us to measure the consistency across human experts. Also, Sturman et al. released the labels of each of three expert human labelers (Sturman et al., 2020). The Fly dataset has more than 3 million frames and thus was too large to label multiple times. Human-human performance was calculated by defining one labeler as the ‘ground truth’ and the other labelers as ‘predictions’ and then computing the same performance metrics as for DeepEthogram. In this way, ‘human accuracy’ is the same as the percentage of scores on which two humans agreed. Strikingly, the overall accuracy, F1 scores, precision, and recall for DeepEthogram approached that of expert human labelers (Figure 3A, B, C, E, H, I, J and L, Figure 3—figure supplement 1A,B,D,G,H,I,K, Figure 3—figure supplement 2). In many cases, DeepEthogram’s performance was statistically indistinguishable from human-level performance, and in the cases in which humans performed better, the difference in performance was generally small. Notably, the behaviors for which DeepEthogram had the lowest performance tended to be the behaviors for which humans had less agreement (lower human-human F1 score) (Figure 3L, Figure 3—figure supplement 1K, Figure 3—figure supplement 2K). Relatedly, DeepEthogram performed best on the frames in which the human labelers agreed and did more poorly in the frames in which humans disagreed (Figure 3N and O, Figure 3—figure supplement 1M, Figure 3—figure supplement 2M). Thus, there is a strong correlation between DeepEthogram and human performance, and the values for DeepEthogram’s performance approach those of expert human labelers.

The behavior with the worst model performance was ‘defecate’ from the Mouse-Openfield dataset (Figures 2C and 3E). Notably, defecation was incredibly rare, occurring in only 0.1% of frames. Furthermore, the act of defecation was not actually visible from the videos. Rather, human labelers marked the ‘defecate’ behavior when new fecal matter appeared, which involved knowledge of the foreground and background, tracking objects, and inferring unseen behavior. This type of behavior is expected to be challenging for DeepEthogram because the model is based on images and local motion and thus will fail when the behavior cannot be directly observed visually.

The model was able to accurately predict the presence of a behavior even when that behavior happened in different locations in the environment and with different orientations of the animal (Figure 3—figure supplement 12). For example, the model predicted face grooming accurately both when the mouse was in the top-left quadrant of the chamber and facing north and when the mouse was in the bottom-right quadrant facing west. This result is particularly important for many analyses of behavior that are concerned with the behavior itself, rather than where that behavior happens.

One striking feature was DeepEthogram’s high performance even on rare behaviors. From our preliminary work building up to the model presented here, we found that simpler models performed well on behaviors that occurred frequently and performed poorly on the infrequent behaviors. Given that, in many datasets, the behaviors of interest are infrequent (Figure 2), we placed a major emphasis on performance in cases with large class imbalances, meaning when some behaviors only occurred in a small fraction of frames. In brief, we accounted for class imbalances in the initialization of the model parameters (Materials and methods). We also changed the cost function to weight errors on rare classes more heavily than errors on common classes. We used a form of regularization specific to transfer learning to reduce overfitting. Finally, we tuned the threshold for converting the model’s probability of a given behavior into a classification of whether that behavior was present. Without these added features, the model simply learned to ignore rare classes. We consider these steps toward identifying rare behaviors to be of major significance for effective application in common experimental datasets.

DeepEthogram accurately predicts behavior bout statistics

Because DeepEthogram produces predictions on individual frames, it allows for subsequent analyses of behavior bouts, such as the number of bouts, the duration of bouts, and the transition probability from one behavior to another. These statistics of bouts are often not available if researchers only record the overall time spent on a behavior with a stopwatch, rather than providing frame-by-frame labels. We found a strong correspondence for the statistics of behavior bouts between the predictions of DeepEthogram and those from human labels. We first focused on results at the level of individual videos for the Mouse-Ventral1 dataset, comparing the model predictions and human labels for the percent of time spent on each behavior, the number of bouts per behavior, and the mean bout duration (Figure 4A–C). Note that the model was trained on the labels from Human 1. For the time spent on each behavior, the model predictions and human labels were statistically indistinguishable (one-way ANOVA, p>0.05; Figure 4A). For the number of bouts and bout duration, the model was statistically indistinguishable from the labels of Human 1, on which it was trained. Some differences were present between the model predictions and the other human labels not used for training (Figure 4B,C). However, the magnitude of these differences was within the range of differences between the multiple human labelers (Figure 4B,C).

Figure 4. DeepEthogram performance on bout statistics.

Figure 4.

All results from DeepEthogram-medium, test set only. (A–C) Comparison of model predictions and human labels on individual videos from the Mouse-Ventral1 dataset. Each point is one behavior from one video. Colors indicate video ID. Error bars: mean ± SEM (n = 18 videos). Asterisks indicate p<0.05, one-way ANOVA with Tukey’s multiple comparison test. No asterisk indicates p>0.05. (D–F) Comparison of model predictions and human labels on all behaviors for all datasets. Each circle is one behavior from one dataset, averaged across splits of the data. Gray line: unity.

To summarize the performance of DeepEthogram on bout statistics for each behavior in all datasets, we averaged the time spent, number of bouts, and bout duration for each behavior across the five random splits of the data into train, validation, and test sets. This average provides a quantity similar to an average across multiple videos, and an average across multiple videos is likely how some end-users will report their results. The values from the model were highly similar to the those from the labels on which it was trained (Human 1 labels) for the time spent per behavior, the number of bouts, and the mean bout duration (Figure 4D–F). Together, these results show that DeepEthogram accurately predicts bout statistics that might be of interest to biologist end-users.

DeepEthogram approaches expert-level human performance for bout statistics and transitions

Next, we benchmarked DeepEthogram’s performance on bout statistics by comparing its performance to the level of agreement between expert human labelers. We started by looking at the time spent on each behavior in single videos for the Mouse-Ventral1 and Sturman-OFT datasets. We compared the labels from Human 1 to the model predictions and to the labels from Humans 2 and 3 (Figure 5A,B). In general, there was strong agreement between the model and Human 1 and among human labelers (Figure 5A,B, left and middle). To directly compare model performance to human-human agreement, we plotted the absolute difference between the model and Human 1 versus the absolute difference between Human 1 and Humans 2 and 3 (Figure 5A,B, right). Model agreement was significantly worse than human-human agreement when considering individual videos. However, the magnitude of this difference was small, implying that discrepancies in behavior labels introduced by the model were only marginally larger than the variability between multiple human labelers.

Figure 5. Comparison of model performance to human performance on bout statistics.

All model data are from DeepEthogram-medium, test set data. r values indicate Pearson’s correlation coefficient. (A) Performance on Mouse-Ventral1 dataset for time spent. Each circle is one behavior from one video. Left: Human 1 vs. model. Middle: Human 1 vs. Humans 2 and 3. Both Humans 2 and 3 are shown on the y-axis. Right: absolute error between Human 1 and model vs. absolute error between Human 1 and each of Humans 2 and 3. Model difference vs. human difference: p<0.001, paired t-test. (B) Similar to (A), but for Sturman-OFT dataset. Right: model difference vs. human difference: p<0.001, paired t-test. (C–E) Performance on all datasets with multiple human labelers (Mouse-Ventral1, Mouse-Openfield, Sturman-OFT, Sturman-EPM, Sturman-FST). Each point is one behavior from one dataset, averaged across data splits. Performance for Humans 2 and 3 were averaged. Similar to Figure 4D–F, but only for datasets with multiple labelers. Left: Human 1 vs. model. Middle: Human 1 vs. Humans 2 and 3. Right: absolute error between Human 1 and model vs. absolute error between Human 1 and each of Humans 2 and 3. p>0.05, paired t-test with Bonferroni correction, in (C–E) right panels. (F–H) Example transition matrices for Mouse-Ventral1 dataset. For humans and models, transition matrices were computed for each data split and averaged across splits.

Figure 5.

Figure 5—figure supplement 1. Performance of keypoint-based behavior classification on the Mouse-Openfield dataset.

Figure 5—figure supplement 1.

(A) Left: keypoints identified, labeled, and predicted using DeepLabCut. Right: example keypoint sequence predicted by DeepLabCut from a held-out video. (B) Example images from held-out videos showing good DeepLabCut performance. (C) Histograms of behavioral features derived from keypoints for each behavior. (D) Accuracy on the test set. Error bars: mean ± SEM, n = 5 data splits. *p≤0.05, **p≤0.01, ***p≤0.001, repeated measures ANOVA with post-hoc Tukey’s honestly significant difference test. Human vs. shuffle results not shown for clarity. (E) Similar to (A), but for F1. (F) Accuracy of keypoint-based behavioral classification vs. DeepEthogram. Each point is one behavior from one model type (colored as in D) and one data split. (G) similar to (F), but for F1. (H) Human vs. model time spent exhibiting each behavior. Each point is one behavior from one model type, averaged across data splits. (I) similar to (H), but for average bout number. (J) similar to (H), but for average frames per bout.
Figure 5—figure supplement 2. Comparison with unsupervised methods.

Figure 5—figure supplement 2.

(A) B-SoID pipeline. (B) B-SoID behavioral space. Shown are a random sample of points that B-SoID labeled confidently (57% of total data). Left: colors are B-SoID cluster assignments. Right: colors (0–5) indicate human labeled behaviors. Note the overall lack of clustering of human-identified colors. (C) B-SoID classifier confusion matrix. X-axis: label predicted by the B-SoID classifier (Random forest). Note the good performance; the classifier successfully recaptures the HDBScan clustering, indicating the B-SoID is performing as expected. (D) Comparison between B-SoID cluster assignments and human labels. Left: each element is the proportion of B-soid clusters that correspond to the given human label. Rows sum to one. Right: each element is the proportion of human labels corresponding to the given B-SoID cluster. Columns sum to 1. Red outlines indicate the B-SoID cluster with maximum correspondence to the human label. Note the overall lack of a consistent structure between human-identified behaviors and B-SoID clusters. One exception: 74% of cluster six corresponds to the ‘rearing’ behavior. (E) Performance comparison between the unsupervised pipeline and DeepEthogram-fast. *p≤0.05, **p≤0.01, ***p≤0.001, repeated measures ANOVA with post-hoc Tukey’s honestly significant difference test. Human vs. shuffle results not shown for clarity.

To summarize our benchmarking of model performance on bout statistics for each behavior in all datasets with multiple human labelers, we again averaged the time spent, number of bouts, and bout duration for each behavior across the five random splits of the data to obtain a quantity similar to an average across multiple videos (Figure 5C–E). For time spent per behavior, number of bouts, and mean bout length, the human-model differences were similar, and not significantly different, in magnitude to the differences between humans (Figure 5C–E, right column). The transition probabilities between behaviors were also broadly similar between Human 1, Human 2, and the model (Figure 5F–H). Furthermore, model-human differences and human-human differences were significantly correlated (Figure 5C–E, right column), again showing that DeepEthogram models are more reliable for situations in which multiple human labelers agree (see Figure 3N–O, Figure 3—figure supplement 1M, Figure 3—figure supplement 2M).

Therefore, the results from Figure 5A,B indicate that the model predictions are noisier than human-human agreement on the level of individual videos. However, when averaged across multiple videos (Figure 5C–E), this noise averages out and results in similar levels of variability for the model and multiple human labelers. Given that DeepEthogram performed slightly worse on F1 scores relative to expert humans but performed similarly to humans on bout statistics, it is possible that for rare behaviors DeepEthogram misses a small number of bouts, which would minimally affect bout statistics but could decrease the overall F1 score.

Together, our results from Figures 35 and Figure 3—figure supplements 13 indicate that DeepEthogram’s predictions match well the labels defined by expert human researchers. Further, these model predictions allow easy post-hoc analysis of additional statistics of behaviors, which may be challenging to obtain with traditional manual methods.

Comparison to existing methods based on keypoint tracking

While DeepEthogram operates directly on the raw pixel values in the videos, other methods exist that first track body keypoints and then perform behavior classification based on these keypoints (Segalin, 2020; Kabra et al., 2013; Nilsson, 2020; Sturman et al., 2020). One such approach that is appealing due to its simplicity and clarity was developed by Sturman et al. and was shown to be superior to commercially available alternatives (Sturman et al., 2020). In their approach, DeepLabCut (Mathis, 2018) is used to estimate keypoints and then a multilayer perceptron architecture is used to classify features of these keypoints into behaviors. We compared the performance of DeepEthogram and this alternate approach using our custom implementation of the Sturman et al. methods (Figure 5—figure supplement 1). We focused our comparison on the Mouse-Openfield dataset, which is representative of videos used in a wide range of biological studies. We used DeepLabCut (Mathis, 2018) to identify the position of the four paws, the base of the tail, the tip of the tail, and the nose. These keypoints could be used to distinguish behaviors. For example, the distance between the nose and the base of the tail was highest when the mouse was locomoting (Figure 5—figure supplement 1C). However, the accuracy and F1 scores for DeepEthogram generally exceeded those identified from classifiers based on features of these keypoints (Figure 5—figure supplement 1D–G). For bout statistics, the two methods performed similarly well (Figure 5—figure supplement 1F–J). Thus, for at least one type of video and dataset, DeepEthogram outperformed an established approach.

There are several reasons why DeepEthogram might have done better on accuracy and F1 score. First, the videos tested were relatively low resolution, which restricted the number of keypoints on the mouse’s body that could be labeled. High-resolution videos with more keypoints may improve the keypoint-based classification approach. Second, our videos were recorded with a top-down view, which means that the paw positions were often occluded by the mouse’s body. A bottom-up or side view could allow for better identification of keypoints and may result in improved performance for the keypoint-based methods.

An alternative approach to DeepEthogram and other supervised classification pipelines could be to use an unsupervised behavior classification followed by human labeling of behavior clusters. In this approach, an unsupervised algorithm identifies behavior clusters without user input, and then the researcher identifies the cluster that most resembles their behavior of interest (e.g., ‘cluster 3 looks like face grooming’). The advantage of this approach is that it involves less researcher time due to the lack of supervised labeling. However, this approach is not designed to identify predefined behaviors of interest and thus, in principle, might not be well suited for the goal of supervised classification. We tested one such approach starting with the Mouse-Openfield dataset and DeepLabCut-generated keypoints (Figure 5—figure supplement 1A, C). We used B-SoID (Hsu and Yttri, 2019), an unsupervised classification pipeline for animal behavior, which identified 11 behavior clusters for this dataset (Figure 5—figure supplement 2A, B). These clusters were separable in a low-dimensional behavior space (Figure 5—figure supplement 2B), and B-SoID’s fast approximation algorithm showed good performance (Figure 5—figure supplement 2C). For every frame in our dataset, we had human labels, DeepEthogram predictions, and B-SoID cluster assignments. By looking at the joint distributions of B-SoID clusters and human labels, there appeared to be little correspondence (Figure 5—figure supplement 2D). To assign human labels to B-SoID clusters, for each researcher-defined behavior, we picked the B-SoID cluster that had the highest overlap with the behavior of interest (red boxes, Figure 5—figure supplement 2D, right). We then evaluated these ‘predictions’ compared to DeepEthogram. For most behaviors, DeepEthogram performed better than this alternative pipeline (Figure 5—figure supplement 2E).

We note that the unsupervised clustering with post-hoc assignment of human labels is not the use for which B-SoID (Hsu and Yttri, 2019) and other unsupervised algorithms (Wiltschko, 2015; Berman et al., 2014) were designed. Unsupervised approaches are designed to discover repeated behavior motifs directly from data, without humans predefining the behaviors of interest (Datta et al., 2019; Egnor and Branson, 2016), and B-SoID succeeded in this goal. However, if one’s goal is the automatic labeling of human-defined behaviors, our results show that DeepEthogram or other supervised machine learning approaches are better choices.

DeepEthogram requires little training data to achieve high performance

We evaluated how much data a user must label to train a reliable model. We selected 1, 2, 4, 8, 12, or 16 random videos for training and used the remaining videos for evaluation. We only required that each training set had at least one frame of each behavior. We trained the feature extractors, extracted the features, and trained the sequence models for each split of the data into training, validation, and test sets. We repeated this process five times for each number of videos, resulting in 30 trained models per dataset. Given the large number of dataset variants for this analysis, to reduce overall computation time, we used DeepEthogram-fast and focused on only the Mouse-Ventral1, Mouse-Ventral2, and Fly datasets. Also, we trained the flow generator only once and kept it fixed for all experiments. For all but the rarest behaviors, the models performed at high levels even with only one labeled video in the training set (Figure 6A–C). For all the behaviors studied across datasets, the performance measured as accuracy or F1 score approached seemingly asymptotic levels after training on approximately 12 videos. Therefore, a training set of this size or less is likely sufficient for many cases.

Figure 6. DeepEthogram performance as a function of training set size.

Figure 6.

(A) Accuracy (top) and F1 score (bottom) for DeepEthogram-fast as a function of the number of videos in the training set for Mouse-Ventral1, shown for each behavior separately. The mean is shown across five random selections of training videos. (B, C) Similar to (A), except for the Mouse-Ventral2 dataset and Fly dataset. (D) Accuracy of DeepEthogram-fast as a function of the number of frames with the behavior of interest in the training set. Each point is one behavior for one random split of the data, across datasets. The black line shows the running average. For reference, 104 frames is ~5 min of behavior at 30 frames per second.(E) Similar to (D), except for F1 score. (F) Accuracy for the predictions of DeepEthogram-fast using the feature extractors only or using the sequence model. Each point is one behavior from one split of the data, across datasets, for the splits used in (D, E). (G) Similar to (F), except for F1 score.

We also analyzed the model’s performance as a function of the number of frames of a given behavior present in the training set. For each random split, dataset, and behavior, we had a wide range of the number of frames containing a behavior of interest. Combining all these splits, datasets, and behaviors together, we found that the model performed with more than 90% accuracy when trained with only 80 example frames of a given behavior and over 95% accuracy with only 100 positive example frames (Figure 6D). Furthermore, DeepEthogram achieved an F1 score of 0.7 with only 9000 positive example frames, which corresponds to about 5 min of example behavior at 30 frames per second (Figure 6E, see Figure 3P for an example of ~0.7 F1 score). The total number of frames required to reach this number of positive example frames depends on how frequent the behavior is. If the behavior happens 50% of the time, then 18,000 total frames are required to reach 9000 positive example frames. Instead, if the behavior occurs 10% of the time, then 90,000 total frames are required. In addition, when the sequence model was used instead of using the predictions directly from the feature extractors, model performance was higher (Figure 6F,G) and required less training data (data not shown), emphasizing the importance of using long timescale information in the prediction of behaviors. Therefore, DeepEthogram models require little training to achieve high performance. As expected, as more training data are added, the performance of the model improves, but this rather light dependency on the amount of training data makes DeepEthogram amenable for even small-scale projects.

DeepEthogram allows rapid inference time

A key aspect of the functionality of the software is the speed with which the models can be trained and predictions about behaviors made on new videos. Although the versions of DeepEthogram vary in speed, they are all fast enough to allow functionality in typical experimental pipelines. On modern computer hardware, the flow generator and feature extractors can be trained in approximately 24 hr. In many cases, these models only need to be trained once. Afterwards, performing inference to make predictions about the behaviors present on each frame can be performed at ~150 frames per second for videos at 256 × 256 resolution for DeepEthogram-fast, at 80 frames per second for DeepEthogram-medium, and 13 frames per second for DeepEthogram-slow (Table 1). Thus, for a standard 30 min video collected at 60 frames per second, inference could be finished in 12 min for DeepEthogram-fast or 2 hr for DeepEthogram-slow. Importantly, the training of the models and the inference involve zero user time because they do not require manual input or observation from the user. Furthermore, this speed is rapid enough to get results quickly after experiments to allow fast analysis and experimental iteration. However, the inference time is not fast enough for online or closed-loop experiments.

Table 1. Inference speed.

Dataset Resolution Inference time (FPS)
Titan RTX Geforce 1080 Ti
DEG_f DEG_m DEG_s DEG_f DEG_m DEG_s
Mouse-Ventral1 256 × 256 235 128 34 152 76 13
Mouse-Ventral2 256 × 256 249 132 34 157 79 13
Mouse-Openfield 256 × 256 211 117 33 141 80 13
Mouse-Homecage 352 × 224 204 102 28 132 70 11
Mouse-Social 224 × 224 324 155 44 204 106 17
Sturman-EPM 256 × 256 240 123 34 157 83 13
Sturman-FST 224 × 448 157 75 21 106 51 9
Sturman-OFT 256 × 256 250 125 34 159 84 13
Flies 128 × 192 623 294 89 378 189 33

A GUI for beginning-to-end management of experiments

We developed a GUI for labeling videos, training models, and running inference (Figure 7). Our GUI is similar in behavior to those for BORIS (Friard et al., 2016) and JAABA (Kabra et al., 2013). To train DeepEthogram models, the user first defines which behaviors of interest they would like to detect in their videos. Next, the user imports a few videos into DeepEthogram, which automatically calculates video statistics and organizes them into a consistent file structure. Then the user clicks a button to train the flow generator model, which occurs without user time. While this model is training, the user can go through a set of videos frame-by-frame and label the presence or absence of all behaviors in these videos. Labeling is performed with simple keyboard or mouse clicks at the onset and offset of a given behavior while scrolling through a video in a viewing window. After a small number of videos have been labeled and the flow generator is trained, the user then clicks a button to train the feature extractors, which occurs without user input and saves the extracted features to disk. Finally, the sequence model can be trained automatically on these saved features by clicking another button. All these training steps could in many cases be performed once per project. With these trained models, the user can import new videos and click the predict button, which estimates the probability of each behavior on each frame. This GUI therefore presents a single interface for labeling videos, training models, and generating predictions on new videos. Importantly, this interface requires no programming by the end-user.

Figure 7. Graphical user interface.

Figure 7.

(A) Example DeepEthogram window with training steps highlighted. (B) Example DeepEthogram window with inference steps highlighted.

The GUI also includes an option for users to manually check and edit the predictions output by the model. The user can load into the GUI a video and predictions made by the model. By scrolling through the video, the user can see the predicted behaviors for each frame and update the labels of the behavior manually. This allows users to validate the accuracy of the model and to fix errors should they occur. This process is expected to be fast because the large majority of frames are expected to be labeled correctly, based on our accuracy results, so the user can focus on the small number of frames associated with rare behaviors or behaviors that are challenging to detect automatically. Importantly, these new labels can then be used to retrain the models to obtain better performance on future experimental videos. Documentation for the GUI will be included on the project’s website.

Discussion

We developed a method for automatically classifying each frame of a video into a set of user-defined behaviors. Our open-source software, called DeepEthogram, provides the code and user interface necessary to label videos and train DeepEthogram models. We show that modern computer vision methods for action detection based on pretrained deep neural networks can be readily applied to animal behavior datasets. DeepEthogram performed well on multiple datasets and generalized across videos and animals, even for identifying rare behaviors. Importantly, by design, CNNs ignore absolute spatial location and thus are able to identify behaviors even when animals are in different locations and orientations within a behavioral arena (Figure 3—figure supplement 12). We anticipate this software package will save researchers great amounts of time, will lead to more reproducible results by eliminating inter-researcher variability, and will enable experiments that may otherwise not be possible by increasing the number of experiments a lab can reasonably perform or the number of behaviors that can be investigated. DeepEthogram joins a growing community of open-source computer vision applications for biomedical research (Datta et al., 2019; Anderson and Perona, 2014; Egnor and Branson, 2016).

The models presented here performed well for all datasets tested. In general, we expect the models will perform well in cases in which there is a high degree of agreement between separate human labelers, as our results in Figures 35 indicate. As we have shown, the models do better with more training data. We anticipate that a common use of DeepEthogram will be to make automated predictions for each video frame followed by rapid and easy user-based checking and editing of the labels in the GUI for the small number of frames that may be inaccurately labeled. We note that these revised labels can then be used as additional training data to continually update the models and thus improve the performance on subsequent videos.

One of our goals for DeepEthogram was to make it general-purpose and applicable to all videos with behavior labels. DeepEthogram operates directly on the raw video pixels, which is advantageous because preprocessing is not required and the researcher does not need to make decisions about which features of the animal to track. Skeleton-based action recognition models, in which keypoints are used to predict behaviors, require a consistent skeleton as their input. A crucial step in skeleton-based action recognition is feature engineering, which means turning the x and y coordinates of keypoints (such as paws or joints) into features suitable for classification (such as the angle of specific joints). With different skeletons (such as mice, flies, or humans) or numbers of animals (one or more), these features must be carefully redesigned. Using raw pixel values as inputs to DeepEthogram allows for a general-purpose pipeline that can be applied to videos of all types, without the need to tailor preprocessing steps depending on the behavior of interest, species, number of animals, video angles, resolution, and maze geometries. However, DeepEthogram models are not expected to generalize across videos that differ substantially in any of these parameters. For example, models that detect grooming in top-down videos are unlikely to identify grooming in side-view videos.

Because DeepEthogram is a general-purpose pipeline, it will not perform as well as pipelines that are engineered for a specific task, arena, or species. For example, MARS was exquisitely engineered for social interactions between a black and white mouse (Segalin, 2020) and thus is expected to outperform DeepEthogram on videos of this type. Moreover, because DeepEthogram operates on raw pixels, it is possible that our models may perform more poorly on zoomed-out videos in which the animal is only a few pixels. Also, if the recording conditions change greatly, such as moving the camera or altering the arena background, it is likely that DeepEthogram will have to be retrained.

An alternate approach is to use innovative methods for estimating pose, including DeepLabCut (Mathis, 2018; Nath, 2019; Lauer, 2021), LEAP (Pereira, 2018b), and others (Graving et al., 2019), followed by frame-by-frame classification of behaviors based on pose in a supervised (Segalin, 2020; Nilsson, 2020; Sturman et al., 2020) or unsupervised (Hsu and Yttri, 2019) way. Using pose for classification could make behavior classifiers faster to train, less susceptible to overfitting, and less demanding of computational resources. Using pose as an intermediate feature could allow the user to more easily assess model performance. Depending on the design, such skeleton-based action recognition could aid multi-animal experiments by more easily predicting behaviors separately for each animal, as JAABA does (Kabra et al., 2013). While we demonstrated that DeepEthogram can accurately identify social interactions, it does not have the ability to track the identities of multiple mice and identify behaviors separately for each mouse. Furthermore, tracking keypoints on the animal gives valuable, human-understandable information for further analysis, such as time spent near the walls of an arena, distance traveled, and measures of velocity (Pennington, 2019; Sturman et al., 2020). In addition, specific aspects of an animal’s movements, such as limb positions and angles derived from keypoint tracking, can be directly related to each behavior of interest, providing an additional layer of interpretation and analysis of the behavior. DeepEthogram does not track parts of the animal’s body or position and velocity information, and instead it focuses only on the classification of human-defined behaviors. Finally, because DeepEthogram uses 11 frames at a time for inputs, as well as relatively large models, it is not easily applicable to real-time applications, such as the triggering of optogenetic stimulation based on ongoing behaviors.

DeepEthogram may prove to be especially useful when a large number of videos or behaviors need to be analyzed in a given project. These cases could include drug discovery projects or projects in which multiple genotypes need to be compared. Additionally, DeepEthogram could be used for standardized behavioral assays, such as those run frequently in a behavioral core facility or across many projects with standardized conditions. Importantly, whereas user time scales linearly with the number of videos for manual labeling of behaviors, user time for DeepEthogram is limited to only the labeling of initial videos for training the models and then can involve essentially no time on the user’s end for all subsequent movies. In our hands, it took approximately 1–3 hr for an expert researcher to label five behaviors in a 10 min movie from the Mouse-Openfield dataset. This large amount of time was necessary for researchers to scroll back and forth through a movie to mark behaviors that are challenging to identify by eye. If only approximately 10 human-labeled movies are needed for training the model, then only approximately 10–30 hr of user time would be required. Subsequently, tens of movies could be analyzed, across projects with similar recording conditions, without additional user time. DeepEthogram does require a fair amount of computer time (see Inference time above, Materials and methods); however, we believe that trading increasingly cheap and available computer time for valuable researcher effort is worthwhile. Notably, the use of DeepEthogram should make results more reproducible across studies and reduce variability imposed by inter-human labeling differences. Furthermore, in neuroscience experiments, DeepEthogram could aid identification of the starts and stops of behaviors to relate to neural activity measurements or manipulations.

Future extensions could continue to improve the accuracy and utility of DeepEthogram. First, DeepEthogram could be easily combined with an algorithm to track an animal’s location in an environment (Pennington, 2019), thus allowing the identification of behaviors of interest and where those behaviors occur. Also, it would be interesting to use DeepEthogram’s optic flow snippets as inputs to unsupervised behavior pipelines, where they could help to uncover latent structure in animal behavior (Wiltschko, 2015; Berman et al., 2014; Batty, 2019). In addition, while the use of CNNs for classification is standard practice in machine learning, recent works in temporal action detection use widely different sequence modeling approaches and loss functions (Piergiovanni and Ryoo, 2018; Zeng, 2019; Monfort, 2020). Testing these different approaches in the DeepEthogram pipeline could further improve performance. Importantly, DeepEthogram was designed in a modular way to allow easy incorporation of new approaches as they become available. While inference is already fast, further development could improve inference speed by using low-precision weights, model quantization, or pruning. Furthermore, although our model is currently designed for temporal action localization, DeepEthogram could be extended by incorporating models for spatiotemporal action localization, in which there can be multiple actors (i.e., animals) performing different behaviors on each frame.

Materials and methods

DeepEthogram pipeline

Along with this publication, we are releasing open-source Python code for labeling videos, training all DeepEthogram models, and performing inference on new videos. The code, associated documentation, and files for the GUI can be found at https://github.com/jbohnslav/deepethogram.

Implementation

We implemented DeepEthogram in the Python programming language (version 3.7 or later; Rossum et al., 2010). We used PyTorch (Paszke, 2018; version 1.4.0 or greater) for all deep-learning models. We used PyTorch Lightning for training (Falcon, 2019). We used OpenCV (Bradski, 2008) for image reading and writing. We use Kornia (Riba et al., 2019) for GPU-based image augmentations. We used scikit-learn (Pedregosa, 2021) for evaluation metrics, along with custom Python code. CNN diagram in Figure 1 was generated using PlotNeuralNet (Iqbal, 2018). Other figures were generated in Matplotlib (Caswell, 2021). For training, we used one of the following Nvidia GPUs: GeForce 1080Ti, Titan RTX, Quadro RTX6000, or Quadro RTX8000. Inference speed was evaluated on a computer running Ubuntu 18.04, an AMD Ryzen Threadripper 2950 X CPU, an Nvidia Titan RTX, an Nvidia Geforce 1080Ti, a Samsung 970 Evo hard disk, and 128 GB DDR4 memory.

Datasets

All experimental procedures were approved by the Institutional Animal Care and Use Committees at Boston Children’s Hospital (protocol numbers 17-06-3494R and 19-01-3809R) or Massachusetts General Hospital (protocol number 2018N000219) and were performed in compliance with the Guide for the Care and Use of Laboratory Animals.

For human-human comparison, we relabeled all videos for Mouse-Ventral1, Mouse-Ventral2, Mouse-Openfield, Mouse-Social, and Mouse-Homecage using the DeepEthogram GUI. Previous labels were not accessible during relabeling. Criteria for relabeling were written in detail by the original experimenters, and example labeled videos were viewed extensively before relabeling. Mouse-Ventral1 was labeled three times and the other datasets were labeled twice.

Videos and human annotations are available at the project website: https://github.com/jbohnslav/deepethogram.

Kinetics700

To pretrain our models for transfer to neuroscience datasets, we use the Kinetics700 (Carreira et al., 2019) dataset. The training split of this dataset consisted of 538,523 videos and 141,677,361 frames. We first resized each video so that the short side was 256 pixels. During training, we randomly cropped 224 × 224 pixel images, and during validation, we used the center crop.

Mouse-Ventral1

Recordings of voluntary behavior were acquired for 14 adult male C57BL/6J mice on the PalmReader device (Roberson et al., submitted). In brief, images were collected with infrared illumination and frustrated total internal reflectance (FTIR) illumination on alternate frames. The FTIR channel highlighted the parts of the mouse’s body that were in contact with the floor. We stacked these channels into an RGB frame: red corresponded to the FTIR image, green corresponded to the infrared image, and blue was the pixel-wise mean of the two. In particular, images were captured as a ventral view of mice placed within an opaque 18 cm long × 18 cm wide × 15 cm high chamber with a 5-mm-thick borosilicate glass floor using a Basler acA2000-50gmNIR GigE near-infrared camera at 25 frames per second. Animals were illuminated from below using nonvisible 850 nm near-infrared LED strips. All mice were habituated to investigator handling in short (~5 min) sessions and then habituated to the recording chamber in two sessions lasting 2 hr on separate days. On recording days, mice were habituated in a mock recording chamber for 45 min and then moved by an investigator to the recording chamber for 30 min. Each mouse was recorded in two of these sessions spaced 72 hr apart. The last 10 min of each recording was manually scored on a frame-by-frame basis for defined actions using a custom interface implemented in MATLAB. The 28 approximately 10 min videos totaled 419,846 frames (and labels) in the dataset. Data were recorded at 1000 × 1000 pixels and down-sampled to 250 × 250 pixels. We resized to 256 × 256 pixels using bilinear interpolation during training and inference.

Mouse-Ventral2

Recordings of voluntary behavior were acquired for 16 adult male and female C57BL/6J mice. These data were collected on the iBob device. Briefly, the animals were enclosed in a device containing an opaque six-chambered plastic enclosure atop a glass floor. The box was dark and illuminated with only infrared light. Animals were habituated for 1 hr in the device before being removed to clean the enclosure. They were then habituated for another 30 min and recorded for 30 min. Recorded mice were either wild type or contained a genetic mutation predisposing them to dermatitis. Thus, scratching and licking behavior were scored. Up to six mice were imaged from below simultaneously and subsequently cropped to a resolution of 270 × 240 pixels. Images were resized to 256 × 256 pixels during training and inference. Data were collected at 30 frames per second. There were 16 approximately 30 min videos for a total of 863,232 frames.

Mouse-Openfield

Videos for the Mouse-Openfield dataset were obtained from published studies (Orefice, 2019; Orefice, 2016) and unpublished work (Clausing et al., unpublished). Video recordings of voluntary behavior were acquired for 20 adult male mice.

All mice were exposed to a novel empty arena (40 cm × 40 cm × 40 cm) with opaque plexiglass walls. Animals were allowed to explore the arena for 10 min, under dim lighting. Videos were recorded via an overhead-mounted camera at either 30 or 60 frames per second. Videos were acquired with 2–4 mice simultaneously in separate arenas and cropped with a custom Python script such that each video contained the behavioral arena for a single animal. Prior to analysis, some videos were brightened in FIJI (Schindelin, 2012), using empirically determined display range cutoffs that maximized the contrast between the mouse’s body and the walls of the arena. Twenty of the 10 min recordings were manually scored on a frame-by-frame basis for defined actions in the DeepEthogram interface. All data were labeled by an experimenter. The 20 approximately 10 min videos totaled 537,534 frames (and labels).

Mouse-Homecage

Videos for the mouse home cage behavior dataset were obtained from unpublished studies (Clausing et al., unpublished). Video recordings of voluntary behavior were acquired for 12 adult male mice. All animals were group-housed in cages containing four total mice. On the day of testing, all mice except for the experimental mouse were temporarily removed from the home cage for home cage behavior testing. For these sessions, experimental mice remained alone in their home cages, which measured 28 cm × 16.5 cm × 12.5 cm and contained bedding and nesting material. For each session, two visually distinct novel wooden objects and one novel plastic igloo were placed into the experimental mouse’s home cage. Animals were allowed to interact with the igloo and objects for 10 min, under dim lighting. Videos were recorded via an overhead-mounted camera at 60 frames per second. Videos were acquired of two mice simultaneously in separate home cages. Following recordings, videos were cropped using a custom Python script such that each video contained the home cage for a single animal. Prior to analysis, all videos were brightened in FIJI48, using empirically determined display range cutoffs that maximized the contrast between each mouse’s body and the bedding and walls of the home cage. Twelve of the 10 min recordings were manually scored on a frame-by-frame basis for defined actions in the DeepEthogram interface. Data were labeled by two experimenters. The 12 approximately 10 min videos totaled 438,544 frames (and labels).

Mouse-Social

Videos for the mouse reciprocal social interaction test dataset were obtained from unpublished studies (Clausing et al., unpublished; Dai et al., unpublished). Video recordings of voluntary behavior were acquired for 12 adult male mice. All mice were first habituated to a novel empty arena (40 cm × 40 cm × 40 cm) with opaque plexiglass walls for 10 min per day for two consecutive days prior to testing. For each test session, two sex-, weight-, and age-matched mice were placed into the same arena. Animals were allowed to explore the arena for 10 min under dim lighting. Videos were recorded via an overhead-mounted camera at 60 frames per second. Videos were acquired with 2–4 pairs of mice simultaneously in separate arenas. Following recordings, videos were cropped using a custom Python script, such that each video only contained the behavioral arena for two interacting animals. Prior to analysis, all videos were brightened in FIJI48 using empirically determined display range cutoffs that maximized the contrast between each mouse’s body and the walls of the arena. Twelve of the 10 min recordings were manually scored on a frame-by-frame basis for defined actions in the DeepEthogram interface. Data were labeled by two experimenters. The 12 approximately 10 minvideos totaled 438,544 frames (and labels).

Sturman datasets

All Sturman datasets are from Sturman et al., 2020. For more details, please read their paper. Videos were downloaded from an online repository (https://zenodo.org/record/3608658#.YFt8-f4pCEA). Labels were downloaded from GitHub (https://github.com/ETHZ-INS/DLCAnalyzer, Lukas von, 2021). We arbitrarily chose ‘Jin’ as the labeler for model training; the other labelers were used for human-human evaluation (Figures 4 and 5).

Sturman-EPM

This dataset consists of five videos. Only three contain one example of all behaviors. Therefore, we could only perform three random train-validation-test splits for this dataset (as our approach requires at least one example in each set). Images were resized to 256 × 256 during training and inference. Images were flipped up-down and left-right each with a probability of 0.5.

Sturman-FST

This dataset consists of 10 recordings. Each recording has two videos, one top-down and one side view. To make this multiview dataset suitable for DeepEthogram, we closely cropped the mice in each view, resized each to 224 × 224, and concatenated them horizontally so that the final resolution was 448 × 224. We did not perform flipping augmentation.

Sturman-OFT

This dataset consists of 20 videos. Images were resized to 256 × 256 for training and inference. Images were flipped up-down and left-right with probability 0.5 during training.

Fly

Wild type DL adult male flies (D. melanogaster), 2–4 days post-eclosion were reared on a standard fly medium and kept on a 12 hr light-dark cycle at 25°. Flies were cold anesthetized and placed in a fly sarcophagus. We glued the fly head to its thorax and finally to a tungsten wire at an angle around 60° (UV cured glue, Bondic). The wire was placed in a micromanipulator used to position the fly on top of an air-suspended ball. Side-view images of the fly were collected at 200 Hz with a Basler A602f camera. Videos were down-sampled to 100 Hz. There were 19 approximately 30 min videos for a total of 3,419,943 labeled frames. Images were acquired at 168 × 100 pixels and up-sampled to 192 × 128 pixels during training and inference. Images were acquired in grayscale but converted to RGB (cv2.cvtColor, cv2.COLOR_GRAY2RGB) so that input channels were compatible with pretrained networks and other datasets.

Models

Overall setup

Problem statement

Our input features were a set of images with dimensions T,C,H,W , and our goal was to output the probability of each behavior on each frame, which is a matrix with dimensions T,K . T is the number of frames in a video. C is the number of input channels – in typical color images, this number is 3 for the red, green, and blue (RGB) channels. H,W are the height and width of the images in pixels. K is the number of user-defined behaviors we aimed to estimate from our data.

Training protocol

We used the ADAM optimizer (Kingma and Ba, 2017) with an initial learning rate of 1 × 10–4 for all models. When validation performance saturated for 5000 training steps, we decreased the learning rate by a factor of 110 on the Kinetics700 dataset, or by a factor of 0.1 for neuroscience datasets (for speed), 5e-7. For Kinetics700, we used the provided train and validation split. For neuroscience datasets, we randomly picked 60% of videos for training, 20% for validation, and 20% for test (with the exception of the subsampling experiments for Figure 5, wherein we only used training and validation sets to reduce overall training time). Our only restriction on random splitting was ensuring that at least one frame of each class was included in each split. We were limited to five random splits of the data for most experiments due to the computational and time demands of retraining models. We save both the final model weights and the best model weights; for inference, we load the best weights. Best is assessed by the minimum validation loss for flow generator models or the mean F1 across all non-background classes for feature extractor and sequence models.

Stopping criterion

For Kinetics700 models, we stopped when the learning rate dropped below 5e-7. This required about 800,000 training steps. For neuroscience dataset flow generators, we stopped training at 10,000 training steps. For neuroscience dataset feature extractors, we stopped when the learning rate dropped below 5e-7, when 20,000 training steps were complete, or 24 hr elapsed, whichever came first. For sequence models, we stopped when the learning rate dropped below 5e-7, or when 100,000 training steps were complete, whichever came first.

End-to-end training

We could, in theory, train the entire DeepEthogram pipeline end-to-end. However, we chose to train the flow generator, feature extractor, and then sequence models sequentially. By backpropagating the classification loss into the flow generator (Zhu et al., 2017), we risk increasing the overall number of parameters and overfitting. Furthermore, we designed the sequence models to have a large temporal receptive window. We therefore train on long sequences (see below). Very long sequences of raw video frames take large amounts of VRAM and exceed our computational limits. By illustration, to train on sequences of, for example, 180 frame snippets of 11 images, our tensor would be of shape [N × 33 × 180 × 256 × 256]. This corresponds to 24 GB of VRAM at a batch size of 16, just for the data and none of the neural activations or gradients, which is impractical. Therefore, we first extract features to disk and subsequently train sequence models.

Augmentations

To improve the robustness and generalization of our models, we augmented the input images with random perturbations for all datasets during training. We used Kornia (Riba et al., 2019) for GPU-based image augmentation to improve training speed. We perturbed the image brightness and contrast, rotated each image by up to 10°, and flipped horizontally and vertically (depending on the dataset). The input to the flow generator model is a set of 11 frames; the same augmentations were performed on each image in this stack. On Mouse-Ventral1 and Mouse-Ventral2, we also flipped images vertically with a probability of 0.5. We calculated the mean and standard deviation of the RGB input channels and standardized the input channel-wise.

Pretraining + transfer learning

All flow generators and feature extractors were first trained to classify videos in the Kinetics700 dataset (see below). These weights were used to initialize models on neuroscience datasets. Sequence models were trained from scratch.

Flow generators

For optic flow extraction, a common algorithm to use is TV-L1 (Carreira and Zisserman, 2017). However, common implementations of this algorithm (Bradski, 2008) require compilation of C++, which would introduce many dependencies and make installation more difficult. Furthermore, recent work (Zhu et al., 2017) has shown that even simple neural-network-based optic flow estimators outperform TV-L1 for action detection. Therefore, we used CNN-based optic flow estimators. Furthermore, we found that saving optic flow as JPEG images, as is common, significantly degraded performance. Therefore, we computed optic flows from a stack of RGB images at runtime for both training and inference. This method is known as Hidden Two-Stream Networks (Zhu et al., 2017).

Architectures

For summary, see Table 2.

Table 2. Model summary.
Model name Flow generator (parameters) Feature extractor (parameters) Sequence model (parameters) # frames input to flow generator # frames input to RGB feature extractor Total parameters
DeepEthogram-fast TinyMotionNet (1.9M) ResNet18 × 2 (22.4M) TGM (250K) 11 1 ~24.5M
DeepEthogram-medium MotionNet (45.8M) ResNet50 × 2 (49.2M) TGM (250K) 11 1 ~ 95.2M
DeepEthogram-slow TinyMotionNet3D (0.4M) ResNet3D-34 × 2 (127M) TGM (250K) 11 11 ~ 127.6M
TinyMotionNet

For every timepoint, we extracted features based on one RGB image and up to 10 optic flow frames. Furthermore, for large datasets like Kinetics700 (Carreira et al., 2019), it was time-consuming and required a large amount of disk space to extract and save optic flow frames. Therefore, we implemented TinyMotionNet (Zhu et al., 2017) to extract 10 optic flow frames from 11 RGB images ‘on the fly,’ as we extracted features. TinyMotionNet is a small and fast optic flow model with 1.9 million parameters. Similar to a U-Net (Ronneberger et al., 2015), it consists of a downward branch of convolutional layers of decreasing resolution and increasing depth. It is followed by an upward branch of increasing resolution. Units from the downward branch are concatenated to the upward branch. During training, estimated optic flows were output at 0.5, 0.25, and 0.125 of the original resolution.

MotionNet

MotionNet is similar to TinyMotionNet except with more parameters and more feature maps per layer. During training, estimated optic flows were output at 0.5, 0.25, and 0.125 of the original resolution. See the original paper (Zhu et al., 2017) for more details.

TinyMotionNet3D

This novel architecture is based on TinyMotionNet (Zhu et al., 2017), except we replaced all 2D convolutions with 3D convolutions. We maintained the height and width of the kernels. On the encoder and decoder branches, we used a temporal kernel size of 3, meaning that each filter spanned three images. On the last layer of the encoder and the iconv layers that connect the encoder and decoder branches, we used a temporal kernel of 2, meaning the kernels spanned two consecutive images. We aimed to have the model learn the displacement between two consecutive images (i.e., the optic flow). Due to the large memory requirements of 3D convolutional layers, we used 16, 32, and 64 filter maps per layer. For this architecture, we noticed large estimated flows in texture-less regions in neuroscience datasets after training on Kinetics. Therefore, we added a L1 sparsity penalty on the flows themselves (see ‘Loss functions,’ below).

Modifications

For the above models, we deviated from the original paper. First, each time the flows were up-sampled by a factor of 2, we multiplied the values of the neural activations by 2. If the flow size increased from 0.25 to 0.5 of the original resolution, a flow estimate of 1 corresponds to four pixels and two pixels in the original image, respectively. To compensate for this distortion, we multiplied the up-sampled activations by 2. Secondly, when used in combination with the CNN feature extractors (see below), we did not compress the flow values to the discrete values between 0 and 255 (Zhu et al., 2017). In fact, we saw performance increases when keeping the continuous float32 values. Third, we did not backpropagate the classifier loss function into the flow generators as the neuroscience datasets likely did not have enough training examples to make this a sensible strategy. Finally, for MotionNet, we only output flows at three resolutions (rather than five) for consistency.

Loss functions

In brief, we train flow generators to minimize reconstruction errors and minimize high-frequency components (to encourage smooth flow outputs).

MotionNet loss

For full details, see original paper (Zhu et al., 2017). For clarity, we reproduce the loss functions here. We estimate the current frame given the next frame and an estimated optic flow as follows:

I0^(i,j)=I1(i+Vx(i,j),j+Vy(i,j))

where I0,I1 are the current and next image. i,j are the indices of the given pixel in rows and columns. Vxi,j,Vyi,j are the estimated x and y displacements between I0,I1 , which means V is the optic flow. We use Spatial Transformer Networks (Jaderberg et al., 2015) to perform this sampling operation in a differentiable manner (PyTorch function torch.nn.functional.grid_sample).

The image loss is the error between the reconstructed I0^ and original I0 .

Lpixel=1Ni,jNρ(I0I0^)

where ρ is the generalized Charbonnier penalty ρx=x2+ϵ2 , which reduces the influence of outliers compared to a simple L1 loss. Following (Zhu et al., 2017), we use α=0.4,ϵ=1e-7 .

The structural similarity (SSIM, Wang et al., 2004) loss encourages the reconstructed I0^ and original I0 to be perceptually similar:

LSSIM=1N1SSIM(I0,I0^)

The smoothness loss encourages smooth flow estimates by penalizing the x and y gradients of the optic flow:

Lsmooth=1NρVxx+ρVxx+ρVxy+ρVyy

We set the Charbonnier α=0.3.

For the TinyMotionNet3D architecture only, we added a flow sparsity loss that penalizes unnecessary flows:

Lsparsity=1NV

Regularization

With millions of parameters and far fewer data points, it is likely that our models will overfit to the training data. Transfer learning (see above) ameliorates this problem somewhat, as does using dropout (see below). However, increasing dropout to very high levels reduces the representational space of the feature vector. To reduce overfitting, we used L2SP regularization (Li et al., 2018). A common form of regularization is weight decay, in which the sum of squared weights is penalized. However, this simple term could cause the model to ‘forget’ its initial knowledge from transfer learning. Therefore, L2SP regularization uses the initial weights from transfer learning as the target.

Lregularizationw=α2ws-ws022+β2ws´22

For details, see the L2-SP paper (Li et al., 2018). w are all trainable parameters of the network, excluding biases and batch normalization parameters. α is a hyperparameter governing how much to decay weights towards their initial values (from transfer learning). ws are current model weights, and ws0 are their values from pre-training. β is a hyperparameter decaying new weights ws´ (such as the final linear readout layers in feature extractors) towards zero. For flow generator models, α=1e-5 . There are no new weights, so β is unused.

The final loss is the weighted sum of the previous components:

Ldata=λ0Lpixel+λ1LSSIM+λ2Lsmooth+λ3Lsparsity+Lregularization

Following Zhu et al., 2017, we set λ0=1,λ1=1. During training, the flow generator’s output flows at multiple resolutions. From largest to smallest, we set λ2 to be 0.01, 0.02, 0.04. For TinyMotionNet3D, we set λ3 to 0.05 and reduced λ2 by a factor of 0.25.

Feature extractors

The goal of the feature extractor was to model the probability that each behavior was present in the given frame of the video (or optic flow stack). We used two-stream CNNs (Zhu et al., 2017; Simonyan and Zisserman, 2014) to classify inputs from both RGB frames and optic flow frames. These CNNs reduced an input tensor from N,C,H,W pixels to N,512 features. Our final fully connected layer estimated probabilities for each behavior, with output shape N,K . Here, N is the batch size. We trained these CNNs on our labels, and then used the penultimate N,512 spatial features or flow features as inputs to our sequence models (below).

Architectures

For summary, see Table 2. We used the ResNet (He et al., 2015; Hara et al., 2018) family of models for our feature extractors, one for the spatial stream and one for the flow stream. For DeepEthogram-fast, we used ResNet18 with ~11 million parameters. For DeepEthogram-medium, we used a ResNet50 with ~23 million parameters. We added dropout (Hinton et al., 2012) layers before the final fully connected layer. For DeepEthogram-medium, we added an extra fully connected layer of shape 2048,512 after the global average pooling layer to reduce the file size of stored features. For DeepEthogram-slow, we used a 3D ResNet34 (Hara et al., 2018) with ~63 million parameters. For DeepEthogram-fast and DeepEthogram-medium, these models were pretrained on ImageNet (Deng, 2008) with three input channels (RGB). We stacked 10 optic flow frames, for 20 input channels. To leverage ImageNet weights with this new number of channels, we used the mean weight across all three RGB channels and replicated it 20 times (Wang et al., 2015). This was only performed when adapting ImageNet weights to Kinetics700 models to resolve the input-frame-number discrepancy; the user will never need to perform this averaging.

Loss functions

Our problem is a multi-label classification task. Each timepoint can have multiple positive examples. For example, if a mouse is licking its forepaw and scratching itself with its hindlimb, both ‘lick’ and ‘scratch’ should be positive. Therefore, we used a focal binary loss (Lin et al., 2018; Marks, 2020). The focal loss is the binary cross-entropy loss, weighted by probability, to de-emphasize the loss for already well-classified examples and to encourage the model to ‘focus’ on misclassified examples. Combined with up-weighting rare, positive examples, the data loss function is

Ldata=t,kwk1-pγyt,klogpkxt+pγ1-yt,klog1-pkxt

where yt,k is the ground truth label and was 1 if class k occurred at time t, or otherwise was 0. pkxt is our model output for class k at time t. Note that for the feature extractor we only considered one timepoint at a time, so t=0. γ is a focal loss term (Lin et al., 2018); if γ=0, this equation is simply the weighted binary cross-entropy loss. The larger the γ, the more the model down-weights correctly classified but insufficiently confident predictions. See the focal loss paper for more details (Lin et al., 2018). We chose γ=1 for all feature extractor and sequence models for all datasets; see ‘Hyperparameter optimization’ section. We also use label smoothing (Müller et al., 2019), so that the target was 0.05 if yt,k=0 and 0.95 if yt,k=1. wk is a weight given to positive examples – note that there was no corresponding weight in the second term when our ground truth is 0. Intuitively, if we had a very rare behavior, we wanted to penalize the model more for an error on positive examples because there were so few examples of the behavior. We calculated the weight as follows:

wk=i=1:Nyi,ki=1:N1-yi,kβ

The numerator is the total number of positive examples in our training set, and the denominator is the total number of negative examples in our training set. β is a hyperparameter that we tuned manually. If β=1, positive examples were weighted fully by their frequency in the training set. If β=0, all training examples were weighted equally. By illustration, if only 1% of our training set had a positive example for a given behavior, with β=1 our weight was 100 and with β=0 this wk=1. We empirically found that with rare behaviors β=1 drastically increased the levels of false positives, while with β=0 many false negatives occurred. For all datasets, we set β=0.25. This wk argument corresponds to pos_weight in torch.nn.BCEWithLogitsLoss.

We used L2-SP regularization as above. For feature extractors, we used α=1e-5 and β=1e-3 . See ‘Hyperparameter optimization’ section.

The final loss term is the sum of the data term and the regularization term:

L=Ldata+Lregularization

Bias initialization

To combat the effects of class imbalance, we set the bias parameters on the final layer to approximate the class imbalance (https://www.tensorflow.org/tutorials/structured_data/imbalanced_data). For example, if we had 99 negative examples and 1 positive example, we wanted to set our initial biases such that the model guessed ‘positive’ around 1% of the time. Therefore, we initialized the bias term as the log ratio of positive examples to negative examples:

bk=logei=1:Nyi,ki=1:N1-yi,k

Fusion

There are many ways to fuse together the outputs of the spatial and motion stream in two-stream CNNs (Feichtenhofer et al., 2016). For simplicity, we used late, average fusion. We averaged together the K-dimensional output vectors of the CNNs before the sigmoid function:

pKxt=σfspatialxt+fmotionxt2

Inference time

To improve inference speed, we use a custom inference video pipeline that uses only sequential video reading, batched model predictions, and multiprocessed data loading. Inference speed time is strongly related to input resolution and GPU hardware. We report timing on both a Titan RTX graphics card and a GeForce 1080 Ti graphics card.

Sequence models

Architecture

For summary, see Table 2. The goal of the sequence model was to have a wide temporal receptive field for classifying timepoints into behaviors. For human labelers, it is much easier to classify the behavior at time t by watching a short clip centered at t rather than viewing the static image. Therefore, we used a sequence model that takes as input a sequence of spatial features and flow features output by the feature extractors. Our criteria were to find a model that had a large temporal receptive field as context can be useful for classifying frames. However, we also wanted a model that had relatively few parameters as this model was trained from scratch on small neuroscience datasets. Therefore, we chose TGM (Piergiovanni and Ryoo, 2018) models, which are designed for temporal action detection. Unless otherwise noted, we used the following hyperparameters:

  • Filter length:L=15

  • Number of input layers:C=1

  • Number of output layers:Cout=8

  • Number of TGM layers: 3

  • Input dropout: 0.5

  • Dropout of output features: 0.5

  • Input dimensionality (concatenation of flow and spatial):D=1024

  • Number of filters: 8

  • Sequence length: 180

  • Soft attention, not 1D convolution

  • We do not use super-events

  • For more details, see Piergiovanni and Ryoo, 2018.

Modifications

TGM models use two main features to make the final prediction: the T,D input features (in our case, spatial and flow features from the feature extractors); and the T,D learned features output by the TGM layers. The original TGM model performed ‘early fusion’ by concatenating these two features into shape T,2D before the 1D convolution layer. We found in low-data regimes that the model ignored the learned features, and therefore reduced to a simple 1D convolution. Therefore, we performed ‘late fusion’ – we used separate 1D convolutions on the input features and on the learned features. We averaged the output of these two layers (both T,K activations before the sigmoid function). Secondly, in the original TGM paper, the penultimate layer was a standard 1D convolutional layer with 512 output channels. We found that this dramatically increased the number of parameters without improving performance significantly. Therefore, we reduced output channels to 128. The total number of parameters was ~264,000.

Loss function

The data loss term is the weighted, binary focal loss as for the feature extractor above. For the regularization loss, we used simple L2 regularization because we do not pretrain the sequence models.

Lregularizationw=α2ws´22

We used α=0.01 for all datasets.

Keypoint-based classification

We compared pixel-based (DeepEthogram) and skeleton-based behavioral classification on the Mouse-Openfield dataset. Our goal was to replicate Sturman et al., 2020 as closely as possible. We chose this dataset because videos with this resolution of mice in an open field arena are a common form of behavioral measurement in biology. We first used DeepLabCut (Mathis, 2018) to label keypoints on the mouse and train pose estimation models. Due to the low resolution of the videos (200–300 pixels on each side), we could only reliably estimate seven keypoints: nose, left and right forepaw (if visible, or shoulder area), left and right hindpaw (if visible, or hip area), the base of the tail, and the tip of the tail. See Figure 5—figure supplement 1A for details. We labeled 1800 images and trained models using the DeepLabCut Colab notebook (ResNet50). Example performance on held-out data for unlabeled frames can be seen in Figure 5—figure supplement 1B. We used linear interpolation for keypoints with confidence below 0.9.

Using these seven keypoints for all videos, we computed a number of pose and behavioral features (Python, NumPy). As a check, we plotted the distribution of these features for each human-labeled behavior (Figure 5—figure supplement 1C). These features contained signal that can reliably discriminate behaviors. For example, the distance between the nose and the tailbase is larger during locomotion than during face grooming (Figure 5—figure supplement 1C, left).

We attempted to replicate Sturman et al., 2020 as closely as possible. However, due to technical considerations, using the exact codebase was not possible. Our dataset is multilabel, meaning that two behaviors can present, and be labeled, on a single frame. Therefore, we could not use cross-entropy loss. Our videos are lower resolution, and therefore we used seven keypoints instead of 10. We normalized pixels by the width and height of the arena. We computed the centroid as the mean of all paw locations. Due to the difference in keypoints we selected, we had to perform our own behavioral feature expansion. (See ‘Time-resolved skeleton representation,’ Sturman et al., 2020 supplementary methods.) We used the following features:

  • x and y coordinates of all keypoints in the arena

  • x and y coordinates after aligning relative to the body axis, such that the nose was to the right and the tailbase to the left

  • Angles between

    • tail and body axis

    • each paw and the body axis

  • Distances between

    • nose and tailbase

    • tail base and tip

    • left forepaw and left hindpaw, right forepaw and hindpaw, averaged

    • forepaw and nose

    • left and right forepaw

    • left and right hindpaw

  • The area of the body (polygon enclosed by the paws, nose, and tailbase)

This resulted in 44 behavioral features for each frame. Following Sturman et al., we used T-15 frames to T + 15 frames as input to our classifier, which is 1364 features in total (44 * 31). We also used the same model architecture as Sturman et al.: a multilayer perceptron with 1364 neurons in the input layer, two hidden layers with 256 and 128 neurons, respectively, and ReLU activations. For simplicity, we used Dropout (Srivastava et al., 2014) with probability 0.35 between each layer.

To make the comparison as fair as possible, we implemented training tricks from DeepEthogram sequence models to train these keypoint-based models. These include the loss function (binary focal loss with up-weighting of rare behaviors), the stopping criterion (100 epochs or when learning rate reduces below 5e-7), learning rate scheduling based on validation F1 saturation, L2 regularization, thresholds optimized based on F1, postprocessing based on bout length statistics, and inference using the best weights during training (as opposed to the final weights).

Unsupervised classification

To compare DeepEthogram to unsupervised classification, we used B-SoID (Hsu and Yttri, 2019; version 2.0, downloaded January 18, 2021). We used the same DeepLabCut outputs as for the supervised classifiers. We used the Streamlit GUI for feature computation, UMAP embedding, model training, and classification. Our Mouse-Openfield dataset contained videos with a mixture of 30 and 60 frames-per-second videos. The B-SoID app assumed constant framerates; therefore, we down-sampled the poses from 60 Hz to 30 Hz, performed all embedding and classification, and then up-sampled classified behaviors back to 60 Hz using nearest-neighbor up-sampling (PyTorch, see Figure 5—figure supplement 2A). B-SoID identified 11 clusters in UMAP space (Figure 5—figure supplement 2B, left). To compare unsupervised with post-hoc assignment to DeepEthogram, we first computed a simple lookup table that mapped human annotations to B-SoID clusters by counting the frames on which they co-occurred (Figure 5—figure supplement 2D). For each human label, we picked the B-SoID cluster with the maximum number of co-occurring labels; this defines a mapping between B-SoID clusters and human labels. We used this mapping to ‘predict’ human labels on the test set (Figure 5—figure supplement 2E). We compared B-SoID to the DeepEthogram-fast model for a fair comparison as B-SoID inference is relatively fast.

Hyperparameter optimization

There are many hyperparameters in DeepEthogram models that can dramatically affect performance. We optimized hyperparameters using Ray Tune (Liaw, 2018), a software package for distributed, asynchronous model selection and training. Our target for hyperparameter optimization was the F1 score averaged over classes on the validation set, ignoring the background class. We used random search to select hyperparameters, and the Asynchronous Successive Halving algorithm (Li, 2020) to terminate poor runs. We did not perform hyperparameter optimization on flow generator models. For feature extractors, we optimized the following hyperparameters: learning rate, α and β from the regularization loss, γ from the focal loss, β from the positive example weighting, whether or not to add a batch normalization layer after the final fully connected layer (Kocaman et al., 2020), dropout probability, and label smoothing. For sequence models, we optimized learning rate, regularization α, γ from the focal loss, β from the positive example weighting, whether or not to add a batch normalization layer after the final fully connected layer (Kocaman et al., 2020), input dropout probability, output dropout probability, filter length, number of layers, whether or not to use soft attention (Piergiovanni and Ryoo, 2018), whether or not to add a nonlinear classification layer, and number of features in the nonlinear classification layer.

The absolute best performance could have been obtained by picking the best hyperparameters for each dataset, model size (DeepEthogram-f, DeepEthogram-m, or DeepEthogram-s), and split of the data. However, this would overstate performance for subsequent users that do not have the computational resources to perform such an optimization themselves. Therefore, we manually selected hyperparameters that had good performance on average across all datasets, models, and splits, and used the same parameters for all models. We will release the Ray Tune integration code required for users to optimize their own models, should they choose.

We performed this optimization on the O2 High Performance Compute Cluster, supported by the Research Computing Group, at Harvard Medical School. See http://rc.hms.harvard.edu for more information. Specifically, we used a cluster consisting of 8 RTX6000 GPUs.

Postprocessing

The output of the feature extractor and sequence model is the probability of behavior k occurring on frame t: pt,k=fxt,k . To convert these probabilities into binary predictions, we thresholded the probabilities:

yt,k^=pt,k>τk

We picked the threshold τk for each behavior k that maximized the F1 score (below). We picked the threshold independently on the training and validation sets. On test data, we used the validation thresholds.

We found that these predictions overestimated the overall number of bouts. In particular, very short bouts were over-represented in model predictions. For each behavior k, we removed both ‘positive’ and ‘negative’ bouts (binary sequences of 1 s and 0 s, respectively) shorter than the first percentile of the bout length distribution in the training set.

Finally, we computed the ‘background’ class as the logical not of the other predictions.

Evaluation and metrics

We used the following metrics: overall accuracy, F1 score, and the AUROC by class. Accuracy was defined as

Accuracy=TP+TNTP+TN+FP+FN

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. We reported overall accuracy, not accuracy for each class.

F1 score was defined as

F1=1Kk=1K2precisionrecallprecision+recall

where precision=TPTP+FP and recall=TPTP+FN . The above was implemented by sklearn.metrics.f1_score with argument average=’macro’.

AUROC was computed by taking the AUROC for each class and averaging the result. This was implemented by sklearn.metrics.roc_auc_score with argument average=’macro’.

Shuffle

To compare model performance to random chance, we performed a shuffling procedure. For each model and random split, we randomly circularly permuted each video’s labels 100 times. This means that the distribution of labels was kept the same, but the correspondence between predictions and labels was broken. For each of these 100 repetitions, we computed all metrics and then averaged across repeats; this results in one chance value per split of the data (gray bars, Figure 3C–K, Figure 3—figure supplement 1B–J, Figure 3—figure supplement 2B–J, Figure 3—figure supplement 3B–J).

Statistics

We randomly assigned input videos to train, validation, and test splits (see above). We then trained flow generator models, feature extractor models, performed inference, and trained sequence models. We repeated this process five times for all datasets. For Sturman-EPM, because only three videos had at least one behavior, we split this dataset three times. When evaluating DeepEthogram performance, this results in N = 5 samples. For each split of the data, the videos in each subset were different; the fully connected layers in the feature extractor were randomly initialized with different weights; and the sequence model was randomly initialized with different weights. When comparing the means of multiple groups (e.g., shuffle, DeepEthogram, and human performance for a single behavior), we used a one-way repeated measures ANOVA, with subjects being splits. If this was significant, we performed a post-hoc Tukey’s honestly significant difference test to compare means pairwise. For cases in which only two groups were being compared (e.g., model and shuffle without human performance), we performed paired t-tests with Bonferroni correction.

Acknowledgements

We thank Woolf lab members Rachel Moon for data acquisition and scoring and Victor Fattori for scoring, and David Roberson and Lee Barrett for designing and constructing the PalmReader and iBob mouse viewing platforms. We thank the Harvey lab for helpful discussions and feedback on the manuscript. We thank Sturman et al. 2020 for making their videos and human labels publicly available. This work was supported by NIH grants R01 MH107620 (CDH), R01 NS089521 (CDH), R01 NS108410 (CDH), DP1 MH125776 (CDH), F31 NS108450 (JPB), R35 NS105076 (CJW), R01 AT011447 (CJW), R00 NS101057 (LLO), K99 DE028360 (DAY), European Research Council grant ERC-Stg-759782 (EC), an NSF GRFP (NKW), FCT fellowship PD/BD/105947/2014 (TC), a Harvard Medical School Dean’s Innovation Award (CDH), and a Harvard Medical School Goldenson Research Award (CDH).

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Christopher D Harvey, Email: harvey@hms.harvard.edu.

Mackenzie W Mathis, EPFL, Switzerland.

Timothy E Behrens, University of Oxford, United Kingdom.

Funding Information

This paper was supported by the following grants:

  • National Institutes of Health R01MH107620 to Christopher D Harvey.

  • National Institutes of Health R01NS089521 to Christopher D Harvey.

  • National Institutes of Health R01NS108410 to Christopher D Harvey.

  • National Institutes of Health F31NS108450 to James P Bohnslav.

  • National Institutes of Health R35NS105076 to Clifford J Woolf.

  • National Institutes of Health R01AT011447 to Clifford J Woolf.

  • National Institutes of Health R00NS101057 to Lauren L Orefice.

  • National Institutes of Health K99DE028360 to David A Yarmolinsky.

  • European Research Council ERC-Stg-759782 to M Eugenia Chiappe.

  • National Science Foundation GRFP to Nivanthika K Wimalasena.

  • Ministry of Education PD/BD/105947/2014 to Tomás Cruz.

  • Harvard Medical School Dean's Innovation Award to Christopher D Harvey.

  • Harvard Medical School Goldenson Research Award to Christopher D Harvey.

  • National Institutes of Health DP1 MH125776 to Christopher D Harvey.

Additional information

Competing interests

none.

None.

Author contributions

Conceptualization, Developed the software and analyzed the data. Performed video labeling, Formal analysis, Funding acquisition, Investigation, Methodology, Performed video labeling, Software, Visualization, Writing - original draft, Writing – review and editing.

Conceptualization, Funding acquisition, Investigation, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed video labeling, Writing – review and editing.

Investigation, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed video labeling, Writing – review and editing.

Investigation, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed video labeling, Writing – review and editing.

Funding acquisition, Investigation, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed video labeling, Writing – review and editing.

Investigation, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Performed experiments. Performed video labeling, Writing – review and editing.

Investigation, Performed video labeling.

Funding acquisition, Supervised experiments, Supervised experiments, Supervised experiments, Supervision, Writing – review and editing.

Funding acquisition, Supervised experiments, Supervised experiments, Supervised experiments, Supervision, Writing – review and editing.

Conceptualization, Funding acquisition, Supervised experiments, Supervised experiments, Supervised experiments, Supervision, Writing – review and editing.

Conceptualization, Funding acquisition, Methodology, Software, Supervised the software development and data analysis, Supervision, Writing - original draft, Writing – review and editing.

Ethics

All experimental procedures were approved by the Institutional Animal Care and Use Committees at Boston Children's Hospital (protocol numbers 17-06-3494R and 19-01-3809R) or Massachusetts General Hospital (protocol number 2018N000219) and were performed in compliance with the Guide for the Care and Use of Laboratory Animals.

Additional files

Transparent reporting form

Data availability

Code is posted publicly on Github and linked in the paper. Video datasets and human annotations are publicly available and linked in the paper.

The following previously published datasets were used:

von Ziegler L, Sturman O, Bohacek J. 2020. Videos for deeplabcut, noldus ethovision X14 and TSE multi conditioning systems comparisons. Zenodo.

References

  1. Anderson DJ, Perona P. Toward a Science of Computational Ethology. Neuron. 2014;84:18–31. doi: 10.1016/j.neuron.2014.09.005. [DOI] [PubMed] [Google Scholar]
  2. Batty E. Openreview. Behavenet: Nonlinear embedding and bayesian neural decoding of behavioral videos; 2019. [Google Scholar]
  3. Berman GJ, Choi DM, Bialek W, Shaevitz JW. Mapping the stereotyped behaviour of freely moving fruit flies. Journal of the Royal Society, Interface. 2014;11:20140672. doi: 10.1098/rsif.2014.0672. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bohnslav J. Deepethogram. swh:1:rev:ffd7e6bd91f52c7d1dbb166d1fe8793a26c4cb01Software Heritage. 2021 https://archive.softwareheritage.org/swh:1:rev:ffd7e6bd91f52c7d1dbb166d1fe8793a26c4cb01
  5. Bradski G. Open Source Computer Vision Library. OpenCV; 2008. [Google Scholar]
  6. Brown AE, de Bivort B. Ethology as a Physical Science. bioRxiv. 2017 doi: 10.1101/220855. [DOI]
  7. Browne LE. Time-Resolved Fast Mammalian Behavior Reveals the Complexity of Protective Pain Responses. Cell Reports. 2017;20:89–98. doi: 10.1016/j.celrep.2017.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Carreira J, Zisserman A. IEEE conference on computer vision and pattern recognition. Quo vadis, action recognition? A new model and the kinetics dataset; 2017. [Google Scholar]
  9. Carreira J, Noland E, Hillier C, Zisserman A. A Short Note on the Kinetics-700 Human Action Dataset. arXiv. 2019 https://arxiv.org/abs/1907.06987
  10. Caswell TA. Matplotlib/matplotlib: REL. V3.4.2Zenodo. 2021 doi: 10.5281/zenodo.592536. [DOI]
  11. Chao YW. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. arXiv. 2018 https://arxiv.org/abs/1804.07667
  12. Dankert H, Wang L, Hoopfer ED, Anderson DJ, Perona P. Automated monitoring and analysis of social behavior in Drosophila. Nature Methods. 2009;6:297–303. doi: 10.1038/nmeth.1310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Datta SR, Anderson DJ, Branson K, Perona P, Leifer A. Computational Neuroethology: A Call to Action. Neuron. 2019;104:11–24. doi: 10.1016/j.neuron.2019.09.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. de Chaumont F, Ey E, Torquet N, Lagache T, Dallongeville S, Imbert A, Legou T, Le Sourd A-M, Faure P, Bourgeron T, Olivo-Marin J-C. Real-time analysis of the behaviour of groups of mice via a depth-sensing camera and machine learning. Nature Biomedical Engineering. 2019;3:930–942. doi: 10.1038/s41551-019-0396-1. [DOI] [PubMed] [Google Scholar]
  15. Deng J. IEEE Conference. IMAGENET: A large-scale hierarchical image database; 2008. [DOI] [Google Scholar]
  16. Egnor SER, Branson K. Computational analysis of behavior. Annual Review of Neuroscience. 2016;39:217–236. doi: 10.1146/annurev-neuro-070815-013845. [DOI] [PubMed] [Google Scholar]
  17. El-Nouby A, Taylor GW. Real-Time End-to-End Action Detection with Two-Stream Networks. arXiv. 2018 https://arxiv.org/abs/1802.08362
  18. Falcon W. Pytorch lightning. 0.3Github. 2019 https://www.pytorchlightning.ai/
  19. Feichtenhofer C, Pinz A, Zisserman A. Convolutional Two-Stream Network Fusion for Video Action Recognition. arXiv. 2016 https://arxiv.org/abs/1604.06573
  20. Feichtenhofer C, Fan H, Malik J, He K. SlowFast Networks for Video Recognition . arXiv. 2019 https://arxiv.org/abs/1812.03982
  21. Friard O, Gamba M, Fitzjohn R. BORIS : a free, versatile open‐source event‐logging software for video/audio coding and live observations. Methods in Ecology and Evolution. 2016;7:1325–1330. doi: 10.1111/2041-210X.12584. [DOI] [Google Scholar]
  22. Fujiwara T, Cruz TL, Bohnslav JP, Chiappe ME. A faithful internal representation of walking movements in the Drosophila visual system. Nature Neuroscience. 2017;20:72–81. doi: 10.1038/nn.4435. [DOI] [PubMed] [Google Scholar]
  23. Gomez-Marin A, Paton JJ, Kampff AR, Costa RM, Mainen ZF. Big behavioral data: Psychology, ethology and the foundations of neuroscience. Nature Neuroscience. 2014;17:1455–1462. doi: 10.1038/nn.3812. [DOI] [PubMed] [Google Scholar]
  24. Graving JM, Chae D, Naik H, Li L, Koger B, Costelloe BR, Couzin ID. DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife. 2019;8:e47994. doi: 10.7554/eLife.47994. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3D CNNS retrace the history of 2D. CNNs and ImageNet. 2018;10:00685. doi: 10.1109/CVPR.2018.00685. [DOI] [Google Scholar]
  26. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. arXiv. 2015 https://arxiv.org/abs/1512.03385
  27. Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR. Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors. arXiv. 2012 https://arxiv.org/abs/1207.0580
  28. Hsu A, Yttri EA. B-SOID: An Open Source Unsupervised Algorithm for Discovery of Spontaneous Behaviors. bioRxiv. 2019 doi: 10.1101/770271. [DOI] [PMC free article] [PubMed]
  29. Iqbal H. Harisiqbal88/plotneuralnet. v1.0.0Zenodo. 2018 doi: 10.5281/zenodo.2526396. [DOI]
  30. Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K. Spatial Transformer Networks . arXiv. 2015 https://arxiv.org/abs/1506.02025
  31. Kabra M, Robie AA, Rivera-Alba M, Branson S, Branson K. JAABA: Interactive machine learning for automatic annotation of animal behavior. Nature Methods. 2013;10:64–67. doi: 10.1038/nmeth.2281. [DOI] [PubMed] [Google Scholar]
  32. Kahatapitiya K, Ryoo MS. Coarse-Fine Networks for Temporal Activity Detection in Videos . arXiv. 2021 https://arxiv.org/abs/2103.01302
  33. Kingma DP, Ba J. Adam . arXiv. 2017 https://arxiv.org/abs/1412.6980
  34. Kocaman V, Shir OM, Bäck T. Improving Model Accuracy for Imbalanced Image Classification Tasks by Adding a Final Batch Normalization Layer. arXiv. 2020 https://arxiv.org/abs/2011.06319
  35. Krakauer JW, Ghazanfar AA, Gomez-Marin A, MacIver MA, Poeppel D. Neuroscience needs behavior: Correcting a reductionist bias. Neuron. 2017;93:480–490. doi: 10.1016/j.neuron.2016.12.041. [DOI] [PubMed] [Google Scholar]
  36. Kwak IS, Kriegman D, Branson K. Detecting the Starting Frame of Actions in Video. arXiv. 2019 https://arxiv.org/abs/1906.03340
  37. Lauer J. Multi-Animal Pose Estimation and Tracking with Deeplabcut. bioRxiv. 2021 doi: 10.1101/2021.04.30.442096v1. [DOI] [PMC free article] [PubMed]
  38. Li X, Grandvalet Y, Davoine F. Explicit Inductive Bias for Transfer Learning with Convolutional Networks. arXiv. 2018 https://arxiv.org/abs/1802.01483
  39. Li L. A System for Massively Parallel Hyperparameter Tuning. arXiv. 2020 https://arxiv.org/abs/1810.05934
  40. Liaw R. Tune. arXiv. 2018 https://arxiv.org/abs/1807.05118
  41. Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal Loss for Dense Object Detection . arXiv. 2018 doi: 10.1109/TPAMI.2018.2858826. https://arxiv.org/abs/1708.02002 [DOI] [PubMed]
  42. Lukas von Z. DLC analyzer. 7f12ca8Github. 2021 https://github.com/ETHZ-INS/DLCAnalyzer
  43. Marks M. SIPEC: The Deep-Learning Swiss Knife for Behavioral Data Analysis. bioRxiv. 2020 doi: 10.1101/2020.10.26.355115. [DOI]
  44. Mathis A. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience. 2018;21:1281–1289. doi: 10.1038/s41593-018-0209-y. [DOI] [PubMed] [Google Scholar]
  45. Monfort M. Multi-Moments in Time. arXiv. 2020 https://arxiv.org/pdf/1911.00232.pdf
  46. Müller R, Kornblith S, Hinton G. When Does Label Smoothing Help? arXiv. 2019 https://arxiv.org/abs/1906.02629
  47. Nath T. Using DeepLabCut for 3D markerless pose estimation across species and behaviors. Nature Protocols. 2019;14:2152–2176. doi: 10.1038/s41596-019-0176-0. [DOI] [PubMed] [Google Scholar]
  48. Nawhal M, Mori G. Activity Graph Transformer for Temporal Action Localization. arXiv. 2021 https://arxiv.org/abs/2101.08540
  49. Neubarth NL. Meissner corpuscles and their spatially intermingled afferents underlie gentle touch perception. Science. 2020;368:eabb2751. doi: 10.1126/science.abb2751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Nilsson SR. Simple Behavioral Analysis (SIMBA) – an Open Source Toolkit for Computer Classification of Complex Social Behaviors in Experimental Animals. bioRxiv. 2020 doi: 10.1101/2020.04.19.049452. [DOI]
  51. Orefice LL. Peripheral Mechanosensory Neuron Dysfunction Underlies Tactile and Behavioral Deficits in Mouse Models of ASDs. Cell. 2016;166:299–313. doi: 10.1016/j.cell.2016.05.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Orefice LL. Targeting Peripheral Somatosensory Neurons to Improve Tactile-Related Phenotypes in ASD Models. Cell. 2019;178:867–886. doi: 10.1016/j.cell.2019.07.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Paszke A. Pytorch: An Imperative Style, High-Performance Deep Learning Library. arXiv. 2018 https://arxiv.org/abs/1912.01703
  54. Peça J, Feliciano C, Ting JT, Wang W, Wells MF, Venkatraman TN, Lascola CD, Fu Z, Feng G. Shank3 mutant mice display autistic-like behaviours and striatal dysfunction. Nature. 2011;472:437–442. doi: 10.1038/nature09965. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Pedregosa F. Scikit-learn: Machine learning in Python. 0.2Mach. Learn. Python. 2021 https://scikit-learn.org/stable/
  56. Pennington ZT. ezTrack: An open-source video analysis pipeline for the investigation of animal behavior. Scientific Reports. 2019;9:19979. doi: 10.1038/s41598-019-56408-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Pereira T. Leap Estimates Animal Pose. LEAP; 2018a. [Google Scholar]
  58. Pereira TD. Fast Animal Pose Estimation Using Deep Neural Networks. Nature. 2018b;16:117–125. doi: 10.1101/331181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Piergiovanni AJ, Ryoo MS. Temporal Gaussian Mixture Layer for Videos. arXiv. 2018 https://arxiv.org/abs/1803.06316
  60. Riba E, Mishkin D, Ponsa D, Rublee E, Bradski G. Kornia: An Open Source Differentiable Computer Vision Library for PyTorch. arXiv. 2019 https://arxiv.org/abs/1910.02190
  61. Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation . arXiv. 2015 https://arxiv.org/abs/1505.04597
  62. Rossum G, Drake FL, Van Rossum G. The Python language reference. Python Software Foundation. 2010 https://docs.python.org/3/reference/
  63. Ryait H. Data-driven analyses of motor impairments in animal models of neurological disorders. PLOS Biology. 2019;17:e3000516. doi: 10.1371/journal.pbio.3000516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Sauerbrei BA, Guo J-Z, Cohen JD, Mischiati M, Guo W, Kabra M, Verma N, Mensh B, Branson K, Hantman AW. Cortical pattern generation during dexterous movement is input-driven. Nature. 2020;577:386–391. doi: 10.1038/s41586-019-1869-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Schindelin J. Fiji: an open-source platform for biological-image analysis. Nature Methods. 2012;9:676–682. doi: 10.1038/nmeth.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Segalin C. The Mouse Action Recognition System (MARS): A Software Pipeline for Automated Analysis of Social Behaviors in Mice. bioRxiv. 2020 doi: 10.1101/2020.07.26.222299. [DOI] [PMC free article] [PubMed]
  67. Simonyan K, Zisserman A. Two-Stream Convolutional Networks for Action Recognition in Videos. arXiv. 2014 https://arxiv.org/abs/1406.2199
  68. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research. 2014;15:1929–1958. [Google Scholar]
  69. Sturman O, von Ziegler L, Schläppi C, Akyol F, Privitera M, Slominski D, Grimm C, Thieren L, Zerbi V, Grewe B, Bohacek J. Deep learning-based behavioral analysis reaches human accuracy and is capable of outperforming commercial solutions. Neuropsychopharmacology. 2020;45:1942–1952. doi: 10.1038/s41386-020-0776-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. van Dam EA, Noldus L, van Gerven MAJ. Deep learning improves automated rodent behavior recognition within a specific experimental setup. Journal of Neuroscience Methods. 2020;332:S0165-0270(19)30393-0. doi: 10.1016/j.jneumeth.2019.108536. [DOI] [PubMed] [Google Scholar]
  71. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing. 2004;13:600–612. doi: 10.1109/tip.2003.819861. [DOI] [PubMed] [Google Scholar]
  72. Wang L, Xiong Y, Wang Z, Qiao Y. Towards Good Practices for Very Deep Two-Stream ConvNets . arXiv. 2015 https://arxiv.org/abs/1507.02159
  73. Wiltschko AB. Mapping Sub-Second Structure in Mouse Behavior. Neuron. 2015;88:1121–1135. doi: 10.1016/j.neuron.2015.11.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Wiltschko AB, Tsukahara T, Zeine A, Anyoha R, Gillis WF, Markowitz JE, Peterson RE, Katon J, Johnson MJ, Datta SR. Revealing the structure of pharmacobehavioral space through motion sequencing. Nature Neuroscience. 2020;23:1433–1443. doi: 10.1038/s41593-020-00706-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Xie T, Yang X, Zhang T, Xu C, Patras I. Exploring Feature Representation and Training Strategies in Temporal Action Localization. arXiv. 2019 https://arxiv.org/abs/1905.10608
  76. Zeng R. Graph Convolutional Networks for Temporal Action Localization. arXiv. 2019 doi: 10.1109/TPAMI.2021.3090167. https://arxiv.org/abs/1909.03252 [DOI] [PubMed]
  77. Zhu Y, Lan Z, Newsam S, Hauptmann AG. Hidden Two-Stream Convolutional Networks for Action Recognition. arXiv. 2017 https://arxiv.org/abs/1704.00389

Decision letter

Editor: Mackenzie W Mathis1
Reviewed by: Mackenzie W Mathis2, Johannes Bohacek

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

DeepEthogram introduces a new tool to the neuroscience and behavior community that allow direct from-video-to-actions to be automatically identified. The authors comprehensively benchmark and provide data that demonstrates the tool's high utility in many common laboratory scenarios.

Decision letter after peer review:

Thank you for submitting your article "DeepEthogram: a machine learning pipeline for supervised behavior classification from raw pixels" for consideration byeLife. Your article has been reviewed by 3 peer reviewers, including Mackenzie Mathis as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Timothy Behrens as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Johannes Bohacek (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

As the editors have judged that your manuscript is of interest, but as described below that additional experiments are required before it is published, we would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). First, because many researchers have temporarily lost access to the labs, we will give authors as much time as they need to submit revised manuscripts. We are also offering, if you choose, to post the manuscript to bioRxiv (if it is not already there) along with this decision letter and a formal designation that the manuscript is "in revision at eLife". Please let us know if you would like to pursue this option. (If your work is more suitable for medRxiv, you will need to post the preprint yourself, as the mechanisms for us to do so are still in development.)

Summary:

Bohnslav et al., present a new toolkit and GUI for using video input to extract behavioral states (ethograms) using a set of established deep neural networks. They show their pipeline works on range of laboratory datasets, and provide metrics comparing network performance to humans. However, the reviewers all agreed there are several key revisions needed in order to support the main claims of the paper. These revolve around benchmarking, datasets, and a more careful handling of related work, limitations of such a software, and clarifying methods. We have collectively decided to send the individual reviews from each reviewer, and ask you address those (and perhaps combine where you see fit), but we urge you to focus in on the following points for your revision.

Datasets:

The reviewers each expressed concern over the simplicity of the datasets and the potentially limited scope of DeepEthogram in relation. For example, the authors claim these are difficult datasets, but in fact we feel they are not representative of the laboratory videos often collected: they have very static backgrounds, no animals have cables or other occluders. We would urge the authors to use other datasets, even those publically available, to more thoroughly benchmark performance in a broader collection of behaviors.

Benchmarking:

While DeepEthogram could be an important tool to the growing toolbox of deep learning tools for behavior, we felt that there are sufficiently other options available that the authors should directly compare performance. While we do appreciate that comparing to the "gold standard" of human-labeled data, the real challenge with such datasets is even humans tend not to agree on a semantic label. Here, the authors only use two humans for ground-truth annotation, but there is a concern of an outlier. Typically, 3 humans are used to overcome a bit of this limitation. Therefore, we suggest carefully benchmarking against humans (i.e., increase the number of ground truth annotations), and please see the individual reviewer comments with specific questions related to other published/available code bases where you can directly compare your pipelines performance.

Methods, Relation to other packages, and Limitations:

The reviewers raised several points where methods are unclear, or how an analysis was performed was not clear. In particular, we ask you to check reviewer #3's comments carefully regarding methods. Moreover, we think a more nuanced discussion about when to do some "pre-processing" (like pose estimation) would be beneficial vs. straight to an ethogram, and visa versa. In particular, it's worth nothing that often times having an intermediate bottleneck such as key points allows the user to more easily assess network performance (keypoints are a defined ground truth vs. semantic action labels).

In total, the reviews are certainly enthusiastic about this work, and do hope you find these suggestions helpful. We look forward to reading your revision.Reviewer #1:

Bohnslav et al., present a new tool to quantify behavior actions directly from video. I think this is a nice addition to the growing body of work using video to analyze behavior. The paper is well written, clear for a general audience, and takes nice innovations in computer vision into life sciences and presents a usable tool for the community. I have a few critical points that I believe need addressed before publication, mostly revolving around benchmarking, but overall I am enthusiastic about this work being ineLife.

In the following sections I highlight areas I believe can be improved upon.

In relation to prior work: The authors should more explicitly state their contribution, and the field's contributions, to action recognition. The introduction mostly highlights limitations of unsupervised methods to perform behavioral analysis (which to note, produces the same outputs as this paper, i.e. an ethogram) and key point estimation alone, which of course is tackling a different problem. What I would like to see is a more careful consideration of the state-of-the-field in computer vision for action recognition, and clearly defining what the contribution is in this paper the cover letter alludes to them developing novel computer vision aspects of the package, but from the code base, etc, it seems they utilize (albeit nicely!) pre-existing works from ~3 years ago, begging the question if this is truly state-of-the-art performance. Moreover, and this does hurt novelty a bit, this is not the first report in life science of such a pipeline, so this should be clearly stated. I don't think it's required to compare this tool to every other tool available, but I do think discussing this in the introduction is of importance (but again, I am still enthusiastic for this being ineLife).

"Our model operates directly on the raw pixel values of videos, and thus it is generally applicable to any case with video data and binary behavior labels and further does not require pre-specification of the body features of interest, such as keypoints on limbs or fitting the body with ellipses." – please include references to the many other papers that do this as well. For example, please see:

Data-driven analyses of motor impairments in animal models of neurological disorders https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000516

LSTM Self-Supervision for Detailed Behavior Analysis https://openaccess.thecvf.com/content_cvpr_2017/html/Brattoli_LSTM_Self-Supervision_for_CVPR_2017_paper.html

Facial expressions of emotion states and their neuronal correlates in mice https://science.sciencemag.org/content/368/6486/89/tab-figures-data (not deep learning, but similar workflow; also extract features as the authors here do, and gets good performance using old CV techniques)

Deep learning improves automated rodent behavior recognition within a specific experimental setup https://www.sciencedirect.com/science/article/pii/S0165027019303930

I think Figure 1A is a bit misleading, it's not clear anymore that manual annotation is the only or most common other alternative pipeline (discussed below in benchmarking)- many tools for automated analysis now exist, and tools like JAABA and MotionMapper have been around for 5+ years; I would rather like to see a comparison workflow to "unsupervised methods," and/or keypoint estimation + classification with supervised or unsupervised means.

Lastly, they do not discuss key papers in life science for automated animal ethogram building, such as Live Mouse Tracker (https://livemousetracker.org/), BORIS and related Behatrix. Not only should these important papers be discussed, they should likely be benchmarked if the authors want to claim SOTA (see below).

Datasets: the authors claim they picked challenging datasets ("Diverse and challenging datasets to test DeepEthogram"), but I don't believe this is the case and they should tone down this statement. In fact, the datasets presented are rather easy to solve (the camera is orthogonal to the animal, i.e. top or bottom, or the animal's position is fixed, and the background is homogeneous, rarely the case even for laboratory experiments). I would urge them to use another more challenging dataset, and/or discuss the limitations of this work. For example, a mouse in a standard home cage with bedding, nests, huts, etc would pose more challenges, or they could report their performance on the Kinect700 dataset, which they pretrain on anyhow.

Benchmarking: The authors don't directly compare their work to that of other tools available in the field. Is their approach better (higher performance) than:

(1) unsupervised learning methods

(2) pose estimation plus classifiers or unsupervised clustering (as done in LEAP, DeepLabCut, B-SOiD, SIMBA, and the ETH DLC-Analyzer)

(3) tools that automate ethogram building, such as JAABA, BORIS/Behatrix.

Therefore, more results should be presented in relation to key works, and/or a more clear introduction on this topic should be presented.

– For example, they claim it's hard to match the resulting clusters from unsupervised learning to their "label:" i.e., "their outputs can be challenging to match up to behaviors of interest in cases in which researchers have strong prior knowledge about the specific behaviors relevant to their experiments". But this is not really a fair statement; one can simply look at the clusters and post-hoc assign a label, which has been nicely done in MotionMapper, for example.

– In pose estimation, one gets an animal-centric lower dimensional representation, which can be mapped onto behavioral states (ethograms), or used for kinematic analysis if desired. However, there is the minimal number of key points needed to make a representation that can still be used for ethogram building. Is the raw-pixel input truly better than this for all behaviors? For example, on the simple datasets with black backgrounds presented in this work, the background pixels are useless, and don't hinder the analysis. However, if the background dynamically changed (camera is moving, or background changes (lighting, bedding etc)), then the classification task from raw pixels becomes much harder than the task of extracted keypoints to classification task. Therefore, I think the authors should do the following: (1) discuss this limitation clearly in the paper, and (2) if they want to claim their method has universally higher performance, they need to show this on both simple and more challenging data.

Moveover, the authors discuss 4 limitations of other approaches, but do not address them in their work, i.e.:

– "First, the user must specify which features are key to the behavior (e.g. body position or limb position), but many behaviors are whole-body activities that could best be classified by full body data." – can they show an example where this is true? It seems from their data each action could be easily defined by kinematic actions of specific body parts a priori.

– "Second, errors that occur in tracking these features in a video will result in poor input data to the classification of behaviors, potentially decreasing the accuracy of labeling." – but is poor video quality not an issue for your classification method? The apple-to-apple comparison here is having corrupted video data as "bad" inputs – of course any method will suffer with bad data input.

– "Third, users might have to perform a pre-processing step between their raw videos and the input to these algorithms, increasing pipeline complexity and researcher time." – can they elaborate here? What preprocessing is needed for pose estimation, that is not needed for this, for example? (Both require manual labor, and given the time estimates, DEG takes longer to label than key point estimation due to the human needing to be able to look at video clips (see their own discussion)).

– "Fourth, the selection of features often needs to be tailored to specific video angles, behaviors (e.g. social behaviors vs. individual mice), species, and maze environments, making the analysis pipelines often specialized to specific experiments." – this is absolutely true, but also a limitation to the authors work, where the classifiers are tailored, the video should be a fixed perspective, background static, etc. So again I don't see this as a major limitation that makes pose estimation a truly invalid option.

Benchmarking and Evaluation:

– "We evaluated how many video frames a user must label to train a reliable model. We selected 1, 2, 4, 8, 12, or 16 random videos for training and used the remaining videos for evaluation. We only required that each training set had at least one frame of each behavior. We trained the feature extractors, extracted the features, and trained the sequence models for each split of the data." – it is not clear how many FRAMES are used here; please state in # of frames in Figure 5 and in the text (not just video #'s).

Related: "Combining all these models together, we found that the model performed with more than 90% accuracy when trained with only 80 example frames" This again is a bit misleading, as the user wants to know the total # of frames needed for your data, i.e. in this case this means that a human needs to annotate at least 80-100 frames per behavior, which for 5 states is ~500 frames; this should be made more explicit.

– "We note that here we used DEG-fast due to the large numbers of splits of the data, and we anticipate that the more complex DEG-medium and DEG-slow models might even require less training data." – this would go against common assumptions in deep learning; the deeper the models, the more prone to overfitting you are with less data. Please revise, or show the data that this statement is true.

– "Human-human performance was calculated by defining one labeler as the "ground truth" and the other labeler as "predictions", and then computing the same performance metrics as for DEG. " – this is a rather unconventional way to measure ground truth performance of humans. Shouldn't the humans be directly compared for % agreement and % disagreement on the behavioral state? (i.e., add a plot to the row that starts with G in figure 3).

To note, this is a limitation of such approaches, compared to pose-estimation, as humans can disagree on what a "behavior" is, whereas key points have a true GT, so I think it's a really important point that the authors address this head on (thanks!), and could be expanded in the discussion. Notably, MARS puts a lot of effort into measuring human performance, and perhaps this could be discussed in the context of this work as well.

Reviewer #2:

It was a pleasure reviewing the methodological manuscript describing DeepEthogram, a software developed for supervised behavioral classification. The software is intended to allow users to automate classification/quantification of complex animal behaviors using a set of supervised deep learning algorithms. The manuscript combines a few state-of-art neural networks into a pipeline to solve the problem of behavior classification in a supervised way. The pipeline uses well-established CNN to extract spatial features from each still frame of the videos that best predicts the user-provided behavior labels. In parallel, optical flow for each frame is estimated through another CNN, providing information about the "instantaneous" movement for each pixel. The optical flow "image" is then passed to another feature extractor that has the same architecture as the spatial feature extractor, and meaningful patterns of pixel-wise movements are extracted. Finally, the spatial feature stream and the optical flow feature stream are combined and fed into a temporal Gaussian mixture CNN, which can pool together information across long periods of time, mimicking human classifiers who can use previous frames to inform classification of behavior in current frame. The resulting pipeline provides a supervised classification algorithm that can directly operate on raw videos, while maintaining a relatively small computational demands on the hardware.

While I think something like DeepEthogram is needed in the field, I think the authors could do substantially more to validate that DeepEthogram is the ticket. In particular, I find the range of datasets validated in the manuscript poorly representative of the range of behavioral tracking circumstances that researchers routinely face. First, in all exemplar datasets, the animals are recorded in a completely empty environment. The animals are not interacting with any objects as they might in routine behavioral tests; there are no cables attached to them (which is routine for optogenetic studies, physiological recording studies, etc); they are alone (Can DeepEthogram classify social behaviors? the github page lists this as a typical use case); there isn't even cage bedding.

The authors also tout the time saving benefits of using deep ethogram. However, with their best performing implementation (DEG slow), with a state of the art computer, with a small video (256 x 256 pixels, width by height), the software runs at 15 frames per second (nearly 1/2 the speed of the raw video). My intuition is that this is on the slow side, given that many behaviors can be scored by human observers in near real time if the observer is using anything but a stopwatch. It would be nice to see benchmarks on larger videos that more accurately reflect the range of acquisition frames. If it is necessary for users to dramatically downsample videos, this should be made clear.

Specific comments:

– It would be nice to see if DeepEthogram is capable of accurately scoring a behavior across a range of backgrounds. For example, if the model is trained on a sideview recording of an animal grooming in its cage, can it accurately score an animal in an open field doing the same from an overhead view, or a side view? If the authors provided guidance on such issues to the reader this would be helpful.

– The authors should highlight that human scoring greatly outperforms DEG on a range of behaviors when comparing the individual F1 scores in Figure 3. Why aren't there any statistics for these comparisons?

– Some of the F1 scores for individual behaviors look very low (~0.5). It would be nice to know what chance performance is in these situations and if the software is performing above chance.

– I find it hard to understand the size of the data sets used in the analyses. For instance, what is 'one split of the data', referenced in Figure 3? Moreover, the authors state "We selected 1, 2, 4, 8, 12, or 16 random videos for training and used the remaining videos for evaluation" I have no idea what this means. What is the length and fps of the video?

– Are overall F1 scores in Figure 3 computed as the mean of the individual scores on each component F1 score, or the combination of all behaviors (such that it weights high frequency behaviors)? It's also difficult to understand what the individual points in Figure 4 (a-c) correspond to.

– The use of the names Mouse-1, Mouse-2 etc for experiments are confusing because it can appear that these experiments are only looking at single mice. I would change the nomenclature to highlight that these reflect experiments with multiple mice.

– It is not clear why the image has to be averaged across RGB channels and then replicated 20 times for the spatial stream. The author mentioned "To leverage ImageNet weights with this new number of channels", and I assume this means the input to the spatial stream has to have same shape (number of weights) as the input to the flow stream. However why this is the case is not clear, especially considering two feature extractor networks are independently trained for spatial and flow streams. Lastly this might raise the question of whether there will be valuable information in the RGB channels separately that will be lost from the averaging operation (for example, certain part of an animal's body has different color than others but is equal-luminous).

– It is not intuitive why simple average pooling is sufficient for fusing the spatial and flow streams. It can be speculated that classification of certain behavior will benefit much more from optical flow features while other behaviors benefits from still image features. I'm curious to see whether an additional layer at the fusing stage that has behavior-specific weights could improve performance.

– Since computational demands is one of the major concern in this article, I'm wondering whether exploiting the sparse nature of the input images would further improve the performance of the algorithm. Often times the animal of interests only occupies a small number of pixels in the raw images, and some simple thresholding of the images, or even user-defined masking of the images, together with use of sparse data backends and operations should in theory significantly reduce the computational demands for both the spatial and flow feature extractor networks.

Reviewer #3:

The paper by Bohnslav et al., presents a software tool that integrates a supervised machine learning algorithm for detecting and quantifying behavior directly from raw video input. The manuscript is well-written, the results are clear. Strengths and weaknesses of the approach are discussed and the work is appropriately placed in the bigger context of ongoing research in the field. The algorithms demonstrate high performance and reach human-level accuracy for behavior recognition. The classifiers are embedded in an excellent user-friendly interface that eliminates the need of any programming skills on the end of the user. Labeled datasets can even be imported. We suggest additional analyses to strengthen the manuscript.

1) Although the presented metrics for accuracy and F1 are state of the art it would be useful to also report absolute numbers for some of the scored behaviors for each trial, because most behavioral neuroscience studies actually report behavior in absolute numbers and/or duration of individual behaviors (rears, face grooms, etc.). Correlation of human and DEG data should also be presented on this level. This will speak to many readers more directly than the accuracy and F1 statistics. For this, we would like to see a leave-one-out cross-validation or a k-fold cross-validation (ensure that each trial ends up exactly once in a cross validation set) that enables a final per-trial readout. This can be done with only one of the DEG types (e.g "fast"). The current randomization approach of 60/20/20% (train/validate/test) with a n of 3 repeats is insufficient, since it a) allows per-trial data for at most 60% of all files and b) is susceptible to artefacts due to random splits (i.e one abnormal trial can be over or under represented in the cross validation sets).

2) In line with comment 1) we propose to update Figure 4, which at the moment uses summed up data from multiple trials. We would rather like to see each trial represented by a single data-point in this figure (#bouts/#frames by behavior). As alternative to individual scatterplots, correlation-matrix-heatmaps could be used to compare different raters.

3) Direct benchmarking against existing datasets is necessary. With many algorithms being published these days, it is important to pick additional (published) datasets and test how well the classifiers perform on those videos. Their software package already allows import of labeled datasets, some are available online. For example, how well can DeepEthogram score…

a. grooming in comparison to Hsu and Yttri (REF #17) or van den Boom et al., (2017, J Neurosci Methods).

b. rearing in comparison to Sturman et al., (REF #21).

c. social interactions compared to (Segalin et al., (REF #7) or Nilsson et al., (REF #19)).

4) In the discussion on page 19 the authors state: "Subsequently, tens to hundreds to thousands of movies could be analyzed, across projects and labs, without additional user-time, which would normally cost additionally hundreds to thousands of hours of time from researchers." This sentence suggests that a network trained on the e.g. the open field test in one lab can be transferred across labs. This key issue of "model transferability" should be tested. E.g. the authors could use the classifier from mouse#3 and test is on another available top-view recording dataset recorded in a different lab with different open-field setup (datasets are available online, e.g. REF #21).

5) Figure 5D/E: Trendline is questionable, we would advise to fit a sigmoid trendline, not an arbitrarily high order polynomial. Linear trend lines (such as shown in Figure 4) should include R2values on the plot or in the legend.

6) In the discussion, the authors do a very good job highlighting the limitations and advantages of their approach. The following limitations should however be expanded:

a. pose-estimation-based approaches (e.g. DLC) are going to be able to track multiple animals at the same time (thus allowing e.g. better read-outs of social interaction). It seems this feature cannot be incorporated in DeepEthogram.

b. Having only 2 human raters is a weakness that should briefly be addressed. Triplicates are useful for assessing outlier values, this could be mentioned in light of the fact that the F1 score of DeepEthogram occasionally outperforms the human raters (e.g. Figure 3C,E).

c. Traditional tracking measures such as time in zone, distance moved and velocity cannot be extracted with this approach. These parameters are still very informative and require a separate analysis with different tools (creating additional work).

d. The authors are correct that the additional time required for behavior analysis (due to the computationally demanding algorithms) is irrelevant for most labs. However, they should add (1) that the current system will not be able to perform behavior recognition in real time (thus preventing the use of closed-loop systems, which packages such as DLC have made possible) and (2) that the speed they discuss on page 16 is based on an advanced computer system (GPU, RAM) and will not be possible with a standard lab computer (or provide an estimate how long training would require if it is possible).

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "DeepEthogram, a machine learning pipeline for supervised behavior classification from raw pixels" for further consideration byeLife. Your revised article has been evaluated 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Timothy Behrens as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Johannes Bohacek (Reviewer #3).

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

The reviewers all felt the manuscript was improved, and thank the authors for the additional datasets and analysis. We would just like to see two items before the publication is accepted fully.

(1) Both reviewer #1 and #2 note the new data is great, but lacks human ground truth. Both for comparison, and releasing the data for others to benchmark on, it would be please include the data. We also understand that obtaining ground truth from 3 persons is a large time commitment, but even if there is one person, this data should be included for all datasets shown in Figure 3.

(2) Please include links for the raw videos used in this work; it is essential for others to benchmark and use to validate the algorithm presented here (see Reviewer 3: "raw videos used in this work (except the ones added during the revision) are – it appears – not accessible online").

Lastly, reviewer 3 notes that perhaps, still, some use-cases are best suited for DeepEthogram, while others more for pose-estimation plus other tools, but this of course cannot be exhaustively demonstrated here; at your discretion you might want to address in the discussion, but we leave that up to your judgement.Reviewer #1:

I thank the authors for the revisions and clarifications, and I think the manuscript is much improved. Plus, the new datasets and comparisons to B-iOD and R-Analyzer (Sturman) are a good additions.

One note is that is not clear which datasets have ground truth data; namely, in the results 5 datasets they use for testing are introduced:

"Mouse-Ventral1"

"Mouse-Ventral2"

"Mouse-Openfield"

"Mouse-Homecage"

"Mouse-Social"

plus three datasets from published work by Sturman et al.,

and "fly"

Then it states that all datasets were labeled; yet, Figure 3 has no ground truth for "Mouse-Ventral2" , "Mouse-Homecage" , "Mouse-Social" or 'Fly" -- please correct and include the ground truth. I do see that is says that only a subset of each of the 2 datasets in Figure 3 are labeled with 3 humans, but minimally then the rest (1 human?) should be included in Figure 3 (and be made open source for future benchmarking).

It appears from the discussion this was done (i.e., at least 1 human, as this is of course required for the supervised algorithm too):

"In our hands, it took approximately 1-3 hours for an expert researcher to label five behaviors in a ten-minute movie from the Mouse-Openfield dataset" and it appears that labeling is defined in the methods.

Reviewer #2:

The authors did a great job addressing our comments, especially with the additional validation work. My only concern is that some of the newly included datasets don't have human-labeled performance for comparison, hence making it hard to judge the actual performance of DeepEthogram. While I understand it is very time-consuming to obtain human labels, I think it will greatly improve the impact of the work if the model comparison can be bench-marked against ground truth. Especially it would be great to see the comparison to human label for the "Mouse-Social" and "Mouse-Homecage" datasets, which presumably represent a large proportion of use cases for DeepEthogram. Otherwise I think it looks good and I would support publication of this manuscript.

Reviewer #3:

The authors present a software solution (DeepEthogram) that performs supervised machine-learning analysis of behavior directly from raw videos files. DeepEthogram comes with a graphical user interface and performs behavior identification and quantification with high accuracy, requires modest amounts of pre-labeled training data, and demands manageable computational resources. It promises to be a versatile addition to the ever-growing compendium of open-source behavior analysis platforms and presents an interesting alternative to pose-estimation-based approaches for supervised behavior classification, under certain conditions.

The authors have generated a large amount of additional data and showcase the power of their approach in a wide variety of datasets including their own data as well as published datasets. DeepEthogram is clearly a powerful tool and the authors do an excellent job describing the advantages and disadvantages of their system and provide a nuanced comparison of point-tracking analyses vs. analyses based on raw videos (pixel data). Also their responses to the reviewers comments are very detailed, thoughtful and clear. The only major issue is that the raw videos used in this work (except the ones added during the revision) are – it appears – not accessible online. This problem must be solved, the videos are essential for reproducibility.

A minor caveat is that in order to compare DeepEthogram to existing supervised and unsupervised approaches, the authors have slightly skewed the odds in their favor by picking conditions that benefit their own algorithm. In the comparison with point-tracking data they use a low resolution top-view recording to label the paws of mice (which are obstructed most of the time from this angle). In the comparison with unsupervised clustering, they use the unsupervised approach for an application that it isn't really designed for (performed in response to reviewers requests). But the authors directly address these points in the text, and the comparisons are still valid and interesting and address the reviewers concerns.

eLife. 2021 Sep 2;10:e63377. doi: 10.7554/eLife.63377.sa2

Author response


Summary:

Bohnslav et al., present a new toolkit and GUI for using video input to extract behavioral states (ethograms) using a set of established deep neural networks. They show their pipeline works on range of laboratory datasets, and provide metrics comparing network performance to humans. However, the reviewers all agreed there are several key revisions needed in order to support the main claims of the paper. These revolve around benchmarking, datasets, and a more careful handling of related work, limitations of such a software, and clarifying methods. We have collectively decided to send the individual reviews from each reviewer, and ask you address those (and perhaps combine where you see fit), but we urge you to focus in on the following points for your revision.

We thank the reviewers for their feedback and constructive suggestions. We have worked hard to incorporate all the suggestions of each reviewer and feel that the manuscript and software are much improved as a result.

Datasets:

The reviewers each expressed concern over the simplicity of the datasets and the potentially limited scope of DeepEthogram in relation. For example, the authors claim these are difficult datasets, but in fact we feel they are not representative of the laboratory videos often collected: they have very static backgrounds, no animals have cables or other occluders. We would urge the authors to use other datasets, even those publically available, to more thoroughly benchmark performance in a broader collection of behaviors.

We thank the reviewers for the suggestion to add more datasets that cover a wider range of behavior settings. We have now added five datasets, including three publicly available datasets and two new datasets collected by us that specifically address these concerns. The new datasets we collected include a mouse in a homecage, which contains a complex background and occluders. The second dataset we added is a social interaction dataset that includes two mice and thus complex and dynamic settings. The three publicly available datasets we added feature commonly used behavioral paradigms: the open field test, the forced swim test, and the elevated plus maze. We thank Sturman et al., for making their videos and labels publicly available.

We now have nine datasets in total that span two species, multiple view angles (dorsal, ventral, side), individual and social settings, complex and static backgrounds, and multiple types of occluders (objects and other mice). We feel these datasets cover a wide range of common lab experiments and typical issues for behavior analysis. It is of course not possible to cover all types of videos, but we have made a sincere and substantial effort to demonstrate the efficacy of our software in a variety of settings.

We think the datasets that we include are representative of the laboratory videos often collected. To our knowledge, datasets for open field behavior are some of the most commonly collected in laboratory settings when one considers behavioral core facilities, biotech/pharmaceutical companies, as well as the fields of mouse disease models, mouse genetic mutations, and behavioral pharmacology. The same is true for the elevated plus maze, which is now included in the revision. A Pubmed search for “open field test” or “elevated plus maze” reveals more than 2500 papers for each in the past five years. We also note that all the datasets we include were not collected only for the purpose of testing our method; rather, they were collected for specific neuroscience research questions, indicating that they are at least reflective of the methods used in some fields, including the large fields of pain research, anxiety research, and autism research. The new additions of the datasets for the forced swim test, elevated plus maze, social interaction, and homecage behavior extend our tests of DeepEthogram to other commonly acquired videos. We agree that these videos do not cover all the possible types of videos that can be collected or that are common in a lab setting, but we are confident these videos cover a wide range of very commonly used behavioral tests and thus will be informative regarding the possible utility of DeepEthogram.

Benchmarking:

While DeepEthogram could be an important tool to the growing toolbox of deep learning tools for behavior, we felt that there are sufficiently other options available that the authors should directly compare performance. While we do appreciate that comparing to the "gold standard" of human-labeled data, the real challenge with such datasets is even humans tend not to agree on a semantic label. Here, the authors only use two humans for ground-truth annotation, but there is a concern of an outlier. Typically, 3 humans are used to overcome a bit of this limitation. Therefore, we suggest carefully benchmarking against humans (i.e., increase the number of ground truth annotations), and please see the individual reviewer comments with specific questions related to other published/available code bases where you can directly compare your pipelines performance.

We thank the reviewers for the recommendation to add more benchmarking. We have extended our benchmarking analysis in two important ways. First, we have added a third human labeler. The results are qualitatively similar to before with the third human labeler added. We have also included three datasets from Sturman et al., each of which has three labelers. Second, we have added a comparison to other recent approaches in the field, including the use of keypoint tracking followed by supervised classification into behaviors and the use of unsupervised behavior analysis followed by post-hoc labeling of machine-generated clusters. We find that DeepEthogram performs better than these alternate methods. We think the addition of this new benchmarking combined with the new datasets will provide the reader with an accurate measure of DeepEthogram’s performance.

Methods, Relation to other packages, and Limitations:

The reviewers raised several points where methods are unclear, or how an analysis was performed was not clear. In particular, we ask you to check reviewer #3's comments carefully regarding methods. Moreover, we think a more nuanced discussion about when to do some "pre-processing" (like pose estimation) would be beneficial vs. straight to an ethogram, and visa versa. In particular, it's worth nothing that often times having an intermediate bottleneck such as key points allows the user to more easily assess network performance (keypoints are a defined ground truth vs. semantic action labels).

We thank the reviewers for these suggestions. We have clarified the methods and analysis as suggested and detailed below. We have also extended the discussion of our method relative to others, including keypoint approaches, so that the advantages and disadvantages of our approach are clearer. These changes are described in detail below in response to individual reviewer comments.

In total, the reviews are certainly enthusiastic about this work, and do hope you find these suggestions helpful. We look forward to reading your revision.

We appreciate the time that the reviewers spent to provide detailed comments and constructive suggestions. The feedback has been valuable in helping us to improve the paper and the software.

Reviewer #1:

Bohnslav et al., present a new tool to quantify behavior actions directly from video. I think this is a nice addition to the growing body of work using video to analyze behavior. The paper is well written, clear for a general audience, and takes nice innovations in computer vision into life sciences and presents a usable tool for the community. I have a few critical points that I believe need addressed before publication, mostly revolving around benchmarking, but overall I am enthusiastic about this work being in eLife.

We thank the reviewer for this positive feedback and for their recommendations that have helped us to improve our work.

In the following sections I highlight areas I believe can be improved upon.

In relation to prior work: The authors should more explicitly state their contribution, and the field's contributions, to action recognition. The introduction mostly highlights limitations of unsupervised methods to perform behavioral analysis (which to note, produces the same outputs as this paper, i.e. an ethogram) and key point estimation alone, which of course is tackling a different problem. What I would like to see is a more careful consideration of the state-of-the-field in computer vision for action recognition, and clearly defining what the contribution is in this paper the cover letter alludes to them developing novel computer vision aspects of the package, but from the code base, etc, it seems they utilize (albeit nicely!) pre-existing works from ~3 years ago, begging the question if this is truly state-of-the-art performance. Moreover, and this does hurt novelty a bit, this is not the first report in life science of such a pipeline, so this should be clearly stated. I don't think it's required to compare this tool to every other tool available, but I do think discussing this in the introduction is of importance (but again, I am still enthusiastic for this being in eLife).

We have revised the introduction so that it now focuses on how DeepEthogram differs in design to previous methods used to classify animal behaviors. We try to provide the reader with an understanding of the differences in design and uses between methods that utilize unsupervised behavior classification, keypoints, and classification from raw pixel values. We have removed the text that focused on potential limitations of keypoint-based methods. We have also added sentences that highlight that the approach we take is built on existing methods that have addressed action detection in different settings, making it clear that our software is not built from scratch and is rather an extension and novel use case of earlier, pioneering work in a different field. We now directly state that our contribution is to extend and apply this previous work in the new setting of animal behavior and life sciences research.

"Our model operates directly on the raw pixel values of videos, and thus it is generally applicable to any case with video data and binary behavior labels and further does not require pre-specification of the body features of interest, such as keypoints on limbs or fitting the body with ellipses." -- please include references to the many other papers that do this as well. For example, please see:

Data-driven analyses of motor impairments in animal models of neurological disorders https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000516

LSTM Self-Supervision for Detailed Behavior Analysis https://openaccess.thecvf.com/content_cvpr_2017/html/Brattoli_LSTM_Self-Supervision_for_CVPR_2017_paper.html

Facial expressions of emotion states and their neuronal correlates in mice https://science.sciencemag.org/content/368/6486/89/tab-figures-data (not deep learning, but similar workflow; also extract features as the authors here do, and gets good performance using old CV techniques)

Deep learning improves automated rodent behavior recognition within a specific experimental setup https://www.sciencedirect.com/science/article/pii/S0165027019303930

We have added new citations on this topic.

I think Figure 1A is a bit misleading, it's not clear anymore that manual annotation is the only or most common other alternative pipeline (discussed below in benchmarking)- many tools for automated analysis now exist, and tools like JAABA and MotionMapper have been around for 5+ years; I would rather like to see a comparison workflow to "unsupervised methods," and/or keypoint estimation + classification with supervised or unsupervised means.

We have now added direct comparisons to the methods mentioned by the reviewer. In Supplementary Figures 20-21, we have direct comparisons to methods for keypoint estimation + classification and for unsupervised classification followed by post-hoc labeling. We show that DeepEthogram performs better than these approaches, at least for the dataset and behaviors we tested. In the discussion, we now provide an extended section on cases in which we expect DeepEthogram to perform better than other methods and cases in which we anticipate other methods, such as keypoint estimation followed by classification, may outperform DeepEthogram. We hope these new comparisons and new text help readers understand the differences between methods as well as the situations in which they should choose one versus the other.

To our knowledge, manual annotation is still very common, and we do not think automated analysis tools, such as JAABA and MotionMapper, are more commonly used than manual annotation. We have many colleagues in our department, departments at neighboring institutions, and behavior core facilities that solve the problem of behavior classification using manual annotation. In fact, we have had a hard time finding any colleagues who use JAABA and MotionMapper as their main methods. We asked our colleagues why they do not use these automated approaches, and the common response has been that they have been tried but that they do not work sufficiently well or they are challenging to implement. Thus, to our knowledge, unlike DeepLabCut for keypoint estimation, these methods have not caught on for behavior classification. There are of course labs that use the existing automated approaches for this problem, but our informal survey of colleagues and the literature indicates that manual annotation is still very common.

For these reasons, we think that mentioning the idea of manual annotation is valid, and following the reviewer’s suggestion, we now focus our introduction on describing the approaches taken by recent automated pipelines, with an emphasis on how our pipeline differs in design.

Lastly, they do not discuss key papers in life science for automated animal ethogram building, such as Live Mouse Tracker (https://livemousetracker.org/), BORIS and related Behatrix. Not only should these important papers be discussed, they should likely be benchmarked if the authors want to claim SOTA (see below).

We thank the reviewer for these references, and we have added citations to these papers. However, we feel that each of these references solves a different problem than the one DeepEthogram is meant to address. Live Mouse Tracker is an interesting approach. However, it uses specialized hardware (depth sensors and RFID monitoring), which means their approach cannot be applied using typical video recording hardware, and is specific to mouse social behaviors. DeepEthogram is meant to use typical lab hardware and to be general-purpose across species and behaviors. BORIS appears to be an excellent open-source GUI for labeling behaviors in videos. However, it does not appear to include automatic labeling or machine learning. Behatrix is software for generating transition matrices, calculating statistics from behavioral sequences, and generating plots. When combined with BORIS, it is software for data analysis of human-labeled behavioral videos, but it does not appear to perform automatic labeling of videos. We have therefore cited these papers, but we do not see an easy way to benchmark our work against these approaches given that they seem to address different goals. Instead, we have added benchmarking of our work relative to a keypoint estimation + classification approach and in comparison to an unsupervised clustering with post-hoc labeling approach.

Datasets: the authors claim they picked challenging datasets ("Diverse and challenging datasets to test DeepEthogram"), but I don't believe this is the case and they should tone down this statement. In fact, the datasets presented are rather easy to solve (the camera is orthogonal to the animal, i.e. top or bottom, or the animal's position is fixed, and the background is homogeneous, rarely the case even for laboratory experiments). I would urge them to use another more challenging dataset, and/or discuss the limitations of this work. For example, a mouse in a standard home cage with bedding, nests, huts, etc would pose more challenges, or they could report their performance on the Kinect700 dataset, which they pretrain on anyhow.

We have removed our statement about the datasets being challenging in general. We have now clarified the specific challenges that the datasets present to machine learning. In particular, we emphasize the major imbalance in the frequency of classes. Many of the classes of interest are only present in ~1-5% of the frames. In fact, this was the hardest problem for us to solve.

Following the reviewer’s suggestions, we have added new datasets that present the challenges mentioned, including complex backgrounds, occluders, and social interactions. This includes a mouse in the homecage with bedding, objects, and a hut, as suggested. We find that DeepEthogram performs well on these more challenging datasets.

Benchmarking: The authors don't directly compare their work to that of other tools available in the field. Is their approach better (higher performance) than:

(1) unsupervised learning methods

It is difficult to compare supervised learning methods to unsupervised ones. The objectives of the two approaches are different. In unsupervised methods, the goal is to identify behavior dimensions or clusters based on statistical regularities, whereas in supervised methods the goal is to label research-defined behaviors. The dimensions or clusters identified in unsupervised approaches are not designed to line up to researcher-defined behaviors of interest. In some cases, they may line up nicely, but in other cases it may be more challenging to identify a researcher’s behavior of interest in the output of an unsupervised algorithm.

Nevertheless, it is possible to try this approach and compare its performance to DeepEthogram (Supplementary Figure 21). We used DeepLabCut to identify keypoints followed by B-SOiD, an unsupervised method, to identify behavior clusters. We then labeled the B-SOiD clusters based on similarity to our pre-defined behaviors of interest. We verified that the DeepLabCut tracking worked well. Also, B-SOiD found statistically meaningful clusters that divided the DeepLabCut-based features into distinct parts of a low dimensional behavior space. However, in many cases, the correspondence of machine-generated clusters to researcher-defined behaviors of interest was poor.

We found that DeepEthogram’s performance was higher than that of the pipeline using unsupervised methods. For example, “rearing” was the behavior that the B-SOiD approach predicted most accurately, at 76% accuracy. In comparison, our worst model, DeepEthogram-fast, scored this behavior with 94% accuracy. The difference in performance was even greater for the other behaviors. Thus, a workflow in which one first generates unsupervised clusters and then treats them as labels for researcher-defined behaviors did not work as well as DeepEthogram.

(2) pose estimation plus classifiers or unsupervised clustering (as done in LEAP, DeepLabCut, B-SOiD, SIMBA, and the ETH DLC-Analyzer)

We benchmarked DeepEthogram relative to pose estimation plus classification using the classification architecture from Sturman et al., (ETH DLC-Analyzer) (Supplementary Figure 20). Note that we re-implemented the Sturman et al. approach for two reasons. First, our datasets can have more than one behavior per timepoint, which necessitates different activation and loss functions. Second, we wanted to ensure the same train/validation/test splits were used for both DeepEthogram and the ETH DLC-Analyzer classification architecture (the multilayer perceptron).

We found that DeepEthogram had higher accuracy and F1 scores than the ETH DLC-Analyzer on the dataset we tested. The two methods performed similarly on bout statistics, including the number of bouts and bout duration.

As mentioned above, we tested unsupervised clustering using a B-SOiD-based pipeline and found that DeepEthogram performed better.

We did not test other pipelines that are based on keypoint estimation plus classifiers, such as JAABA and SIMBA. The reason is that each pipeline takes a substantial effort to get set up and to verify that it is working properly. In a reasonable amount of time, it is therefore not feasible to test many different pipelines for benchmarking purposes. We therefore chose to focus on two pipelines with different approaches, namely the ETH DLC-Analyzer for keypoints + classification and the B-SOiD-based unsupervised approach.

(3) tools that automate ethogram building, such as JAABA, BORIS/Behatrix.

BORIS and Behatrix are excellent tools for annotated videos and creating statistics of behaviors. However, they do not provide automated labeling pipelines, to our knowledge. We therefore did not see how to benchmark our work relative to these approaches.

As mentioned above, it takes a great deal of time and effort to set up and validate each new pipeline. Given that JAABA is based on the idea of keypoint estimation followed by classification, we chose not to implement it and instead focused only on the ETH DLC-Analyzer for this approach. It would be ideal to rapidly benchmark DeepEthogram relative to many approaches, but we were unable to find a way to do so in a reasonable amount of time, especially because we need to ensure that each existing algorithm is operating properly.

Therefore, more results should be presented in relation to key works, and/or a more clear introduction on this topic should be presented.

– For example, they claim it's hard to match the resulting clusters from unsupervised learning to their "label:" i.e., "their outputs can be challenging to match up to behaviors of interest in cases in which researchers have strong prior knowledge about the specific behaviors relevant to their experiments". But this is not really a fair statement; one can simply look at the clusters and post-hoc assign a label, which has been nicely done in MotionMapper, for example.

We have now supported this claim by trying out an unsupervised approach. We used the B-SOiD algorithm on DeepLabCut keypoints (Supplementary Figure 21). B-SOiD identified well separated behavior clusters. However, we found it challenging to line up these clusters to the researcher-defined behaviors of interest. In our Mouse-Openfield dataset, we found one cluster that looked like “rearing”, but it was not possible to find clusters that looked like the other behaviors. It is of course possible that in some cases the clusters and behaviors will line up nicely, but that was not apparent in our results. More generally, the goals of supervised and unsupervised clustering are not the same, so in principle there is no reason that post-hoc assignment of labels will work well. It is true that in theory this might work reasonably okay in some cases, but our results in Supplementary Figure 21 show that this approach does not work as well as DeepEthogram.

– In pose estimation, one gets an animal-centric lower dimensional representation, which can be mapped onto behavioral states (ethograms), or used for kinematic analysis if desired. However, there is the minimal number of key points needed to make a representation that can still be used for ethogram building. Is the raw-pixel input truly better than this for all behaviors? For example, on the simple datasets with black backgrounds presented in this work, the background pixels are useless, and don't hinder the analysis. However, if the background dynamically changed (camera is moving, or background changes (lighting, bedding etc)), then the classification task from raw pixels becomes much harder than the task of extracted keypoints to classification task. Therefore, I think the authors should do the following: (1) discuss this limitation clearly in the paper, and (2) if they want to claim their method has universally higher performance, they need to show this on both simple and more challenging data.

We have now added a paragraph in the discussion that mentions cases in which DeepEthogram is expected to perform poorly. This includes cases with, for example, moving cameras and changing lighting. We are not aware that these scenarios are common in laboratory settings, but nevertheless we mention conditions that will help DeepEthogram’s performance, including fixed camera angles, fixed lighting, and fixed backgrounds. In addition, we have added two datasets that have more complex or changing backgrounds. In the Mouse-Homecage dataset, there are bedding, objects, and a hut, which provide occluders and a complex background. In the Sturman-FST, the background is dynamic due to water movement as the animal swims. We also now include a social interaction dataset, which adds a different use case. In all these cases, DeepEthogram performs well.

We have expanded our discussion of the settings in which DeepEthogram may excel and those in which other methods, like keypoint-based methods, might be advantageous. In addition, we now mention that DeepEthogram is intended as a general framework that can be used without extensive fine tuning and customization for each experiment. As with all software focused on general purpose uses, we anticipate that DeepEthogram will perform worse than solutions that are custom for a problem at hand, such as custom pipelines for moving cameras or specific pose estimations. We now note this tradeoff between general purpose and customized solutions and discuss where DeepEthogram falls on this spectrum.

Moveover, the authors discuss 4 limitations of other approaches, but do not address them in their work, i.e.:

– "First, the user must specify which features are key to the behavior (e.g. body position or limb position), but many behaviors are whole-body activities that could best be classified by full body data." – can they show an example where this is true? It seems from their data each action could be easily defined by kinematic actions of specific body parts a priori.

We have removed this sentence. Instead, in the introduction, we now focus on the design differences and general goals of the different approaches with less emphasis on potential limitations. In the discussion, we present a balanced discussion of cases in which DeepEthogram might be a good method to choose and cases in which other methods, such as those with keypoint estimation, might be better.

– "Second, errors that occur in tracking these features in a video will result in poor input data to the classification of behaviors, potentially decreasing the accuracy of labeling." – but is poor video quality not an issue for your classification method? The apple-to-apple comparison here is having corrupted video data as "bad" inputs – of course any method will suffer with bad data input.

We have removed this sentence.

– "Third, users might have to perform a pre-processing step between their raw videos and the input to these algorithms, increasing pipeline complexity and researcher time." – can they elaborate here? What preprocessing is needed for pose estimation, that is not needed for this, for example? (Both require manual labor, and given the time estimates, DEG takes longer to label than key point estimation due to the human needing to be able to look at video clips (see their own discussion)).

We have re-phrased this point in the introduction. The main point we wish to make is that a pipeline for behavior classification based on keypoints + classification requires two steps: keypoint estimation followed by behavior classification. This requires two stages of manual annotation, one for labeling the keypoints and one for labeling the behaviors for classification. If one wants to get frame-by-frame predictions for this type of pipeline, then one must have frame-by-frame labeling of training videos. Instead, DeepEthogram only requires the latter step. As the reviewer notes, the frame-by-frame behavior labeling is indeed the more time-consuming step, but it is required for both types of pipelines.

– "Fourth, the selection of features often needs to be tailored to specific video angles, behaviors (e.g. social behaviors vs. individual mice), species, and maze environments, making the analysis pipelines often specialized to specific experiments." – this is absolutely true, but also a limitation to the authors work, where the classifiers are tailored, the video should be a fixed perspective, background static, etc. So again I don't see this as a major limitation that makes pose estimation a truly invalid option.

We have removed this sentence from the introduction and added this point to the discussion where we talk about the advantages and disadvantages of DeepEthogram. We disagree that DeepEthogram requires tailored classifiers. We have used the same model architecture for all nine datasets that involve different backgrounds, species, behaviors, and camera angles. It is true that DeepEthogram might not work in some cases, such as with moving cameras, and we now highlight the cases in which we expect DeepEthogram will fail. In general, though, we have used the same model throughout our study for a range of datasets.

Benchmarking and Evaluation:

– "We evaluated how many video frames a user must label to train a reliable model. We selected 1, 2, 4, 8, 12, or 16 random videos for training and used the remaining videos for evaluation. We only required that each training set had at least one frame of each behavior. We trained the feature extractors, extracted the features, and trained the sequence models for each split of the data." – it is not clear how many FRAMES are used here; please state in # of frames in Figure 5 and in the text (not just video #'s).

Related: "Combining all these models together, we found that the model performed with more than 90% accuracy when trained with only 80 example frames" This again is a bit misleading, as the user wants to know the total # of frames needed for your data, i.e. in this case this means that a human needs to annotate at least 80-100 frames per behavior, which for 5 states is ~500 frames; this should be made more explicit.

We now report the number of frames to Figure 6 as suggested by the reviewer.

We choose to report the number of example frames because this is what is most relevant for the model, but we realize this is not the most relevant for the user. The number of actual frames (not just examples) varies depending on how frequent the behavior is. If 100 example frames are needed for a behavior, that could mean the user has to label 200 frames if the behavior occurs 50% of the time or 10,000 frames if the behavior occurs 1% of the time. We have now added sentences in the text to help the reader with this conversion. We do not think it is helpful to report the number of actual frames in the figure because then it is impossible for the user to extrapolate to their own use cases. Instead, if they know the number of example frames required and how frequent their behavior of interest is, they can calculate for themselves how many frames will need to be labeled. The new sentences we added help clarify this point and will help the reader calculate for their own cases.

– "We note that here we used DEG-fast due to the large numbers of splits of the data, and we anticipate that the more complex DEG-medium and DEG-slow models might even require less training data." – this would go against common assumptions in deep learning; the deeper the models, the more prone to overfitting you are with less data. Please revise, or show the data that this statement is true.

We have removed this sentence.

– "Human-human performance was calculated by defining one labeler as the "ground truth" and the other labeler as "predictions", and then computing the same performance metrics as for DEG. " – this is a rather unconventional way to measure ground truth performance of humans. Shouldn't the humans be directly compared for % agreement and % disagreement on the behavioral state? (i.e., add a plot to the row that starts with G in figure 3).

We agree that this description was confusing. We have now clarified in the text that the percent agreement between human labelers is identical to the accuracy measure we report.

To note, this is a limitation of such approaches, compared to pose-estimation, as humans can disagree on what a "behavior" is, whereas key points have a true GT, so I think it's a really important point that the authors address this head on (thanks!), and could be expanded in the discussion. Notably, MARS puts a lot of effort into measuring human performance, and perhaps this could be discussed in the context of this work as well.

We appreciate that the reviewer acknowledged the hard work of labeling many videos with multiple human labelers. We have now extended this effort to include a third human labeler for the Mouse-Ventral1 dataset. We have also added three publicly available datasets that each have three human labelers.

Reviewer #2:

It was a pleasure reviewing the methodological manuscript describing DeepEthogram, a software developed for supervised behavioral classification. The software is intended to allow users to automate classification/quantification of complex animal behaviors using a set of supervised deep learning algorithms. The manuscript combines a few state-of-art neural networks into a pipeline to solve the problem of behavior classification in a supervised way. The pipeline uses well-established CNN to extract spatial features from each still frame of the videos that best predicts the user-provided behavior labels. In parallel, optical flow for each frame is estimated through another CNN, providing information about the "instantaneous" movement for each pixel. The optical flow "image" is then passed to another feature extractor that has the same architecture as the spatial feature extractor, and meaningful patterns of pixel-wise movements are extracted. Finally, the spatial feature stream and the optical flow feature stream are combined and fed into a temporal Gaussian mixture CNN, which can pool together information across long periods of time, mimicking human classifiers who can use previous frames to inform classification of behavior in current frame. The resulting pipeline provides a supervised classification algorithm that can directly operate on raw videos, while maintaining a relatively small computational demands on the hardware.

Thank you for the positive feedback.

While I think something like DeepEthogram is needed in the field, I think the authors could do substantially more to validate that DeepEthogram is the ticket. In particular, I find the range of datasets validated in the manuscript poorly representative of the range of behavioral tracking circumstances that researchers routinely face. First, in all exemplar datasets, the animals are recorded in a completely empty environment. The animals are not interacting with any objects as they might in routine behavioral tests; there are no cables attached to them (which is routine for optogenetic studies, physiological recording studies, etc); they are alone (Can DeepEthogram classify social behaviors? the github page lists this as a typical use case); there isn't even cage bedding.

We thank the reviewer for this suggestion. We have now added five new datasets that address many of these concerns. We have added a dataset in which the mouse is in its homecage. In these videos, there is bedding, and the mouse interacts with multiple objects and is occluded by these objects and a hut. Second, we added a dataset for the elevated plus maze, in which the mouse interacts with the maze, exhibiting stretches and dips over the side of the maze. Third, we added a social interaction dataset with two mice that specifically tests the model’s performance on social behaviors. Fourth, we added a dataset for the forced swim test in which the mouse is swimming in a small pool. These datasets therefore have more complex backgrounds (e.g. bedding, objects, multiple mice), dynamic backgrounds (moving water in the forced swim test), and social interactions. Unfortunately, we did not have access to a dataset with cables, such as for optogenetics or physiology recordings, but we note that the Mouse-Homecage dataset includes occluders. We now have nine datasets in total, and DeepEthogram performed well across all these datasets.

We feel that the datasets we used are representative of videos that are common in many fields. For example, the open field assay and elevated plus maze are common assays that are typical in behavior core facilities, biotech/pharma companies, and many academic research labs. A Pubmed search for “open field test” or “elevated plus maze” each returns over 2500 papers in the past five years. These types of videos are not necessarily common in the fields of neurophysiology or systems neuroscience, but they are commonly used for phenotyping mutant mice, tests of mouse models of disease, and treatments of disease. However, we agree entirely that we have not tested all common types of behavior videos, including some mentioned by the reviewer. Unfortunately, each dataset takes a long time to test due to extensive labeling by humans for “ground-truth” data, and we did not have access to all types of datasets. In the revision, we made a sincere and substantial effort to extend the datasets tested. In addition, we have added more discussion of the cases in which we expect DeepEthogram to work well and the cases in which we expect it will be worse than other methods. We hope this discussion points readers to the best method for their specific use case.

The authors also tout the time saving benefits of using deep ethogram. However, with their best performing implementation (DEG slow), with a state of the art computer, with a small video (256 x 256 pixels, width by height), the software runs at 15 frames per second (nearly 1/2 the speed of the raw video). My intuition is that this is on the slow side, given that many behaviors can be scored by human observers in near real time if the observer is using anything but a stopwatch. It would be nice to see benchmarks on larger videos that more accurately reflect the range of acquisition frames. If it is necessary for users to dramatically downsample videos, this should be made clear.

We have now clarified in the text that the time savings are in terms of person-hours instead of actual time. A significant advantage is that DeepEthogram can run in the background on a computer with little to no human time (after it is trained), whereas manual labeling of each video continues to require human time with each video. In the case of the DeepEthogram-slow model, it is true that humans might be able to do it faster, but this would take human time, whereas the model can run while the researchers do other things. We think this is a critical difference and have highlighted this difference more clearly. Furthermore, we have made engineering improvements that have sped up inference time by nearly 300%.

We have added a table that measures inference time across the range of image sizes we used in this study. We note that all comparable methods in temporal action localization down-sample videos in resolution. For example, video classification networks commonly use 224 x 224 pixels as input. DeepEthogram works similarly for longer acquisitions (more frames per video). We had access to videos collected for specific neuroscience research questions and were restricted to these acquisition times, but the software will work similarly with all acquisition times.

Specific comments:

– It would be nice to see if DeepEthogram is capable of accurately scoring a behavior across a range of backgrounds. For example, if the model is trained on a sideview recording of an animal grooming in its cage, can it accurately score an animal in an open field doing the same from an overhead view, or a side view? If the authors provided guidance on such issues to the reader this would be helpful.

We have tested DeepEthogram in a range of different camera angles and backgrounds across the nine datasets we now include. Our results show that the model can work across this diversity of set ups. However, an important point is that it is unlikely that the model trained on a side view, for example, will work for videos recorded from above. Instead, the user would be required to label the top-view and side-view videos separately and train different models for these cases. We do not think this is a major limitation because often experimenters decide on a camera location and use that for an entire set of experiments. However, this is an important point for the user to understand. We have therefore added a sentence to the discussion describing this point, along with extended points about the cases in which DeepEthogram is expected to excel and when it might not work well.

– The authors should highlight that human scoring greatly outperforms DEG on a range of behaviors when comparing the individual F1 scores in Figure 3. Why aren't there any statistics for these comparisons?

Thank you for this recommendation. We have added statistics for comparing the F1 scores in Figure 3. These analyses reveal that human scoring outperforms DeepEthogram on some behaviors, but the performance of DeepEthogram and humans is statistically indistinguishable on the majority of behaviors tested. We have revised the main text to make it clear that human labelers do better than DeepEthogram in some cases.

We have also compared DeepEthogram to human performance on other measures of behavior, in particular the statistics of behavior bouts (duration and frequency). We expect that these metrics may be more commonly used by the end-user. In the case of bout statistics, DeepEthogram’s performance is worse than human performance at the level of single videos (Figure 5A-B). However, when averaged across videos, the performance of the model and humans is statistically indistinguishable (Figure 5C-E). The reason appears to be that the model’s predictions are more variable on a single-video level, but after averaging across videos, the noise is averaged out to reveal a similar mean to what is obtained by human labelers. These results can also be seen in Figure 4A-C, in which the difference between the bout statistics for the model and labels on which it was trained (Human 1) are similar in magnitude to the difference between human labelers.

We thank the reviewer again for this suggestion. The statistics along with portions of Figure 4-5 are new in response to this suggestion.

– Some of the F1 scores for individual behaviors look very low (~0.5). It would be nice to know what chance performance is in these situations and if the software is performing above chance.

We thank the reviewer for the suggestion to add chance level performance. We have added chance-level F1 scores by performing a shuffle of the actual human labels relative to the video frames. For nearly all behaviors, DeepEthogram’s performance is significantly higher than chance level performance by a substantial margin. We think the addition of the chance level performance helps to interpret the F1 scores, which are not always intuitive to understand in terms of how good or bad they are. We also note that the F1 score is a demanding metric. For example, if the model correctly predicts the presence of a grooming bout, but misses the onset and offset times by several frames, the F1 score will be substantially reduced. For this reason, we also report the bout statistics in Figures 4-5, which are closer to the metrics that end users might want to report and are easier to interpret.

– I find it hard to understand the size of the data sets used in the analyses. For instance, what is 'one split of the data', referenced in Figure 3? Moreover, the authors state "We selected 1, 2, 4, 8, 12, or 16 random videos for training and used the remaining videos for evaluation" I have no idea what this means. What is the length and fps of the video?

We thank the reviewer for this comment and agree that it was not clear in the original submission. We have updated the text and figure legends to clarify this point. A split of the data refers to a random splitting of the data into training, validation, and testing sets. We now describe this in the procedures for the model set up in the main text and clarify the meaning of “splits” when it is used later in the Results section. Videos refer to a single recorded behavior video. Videos can differ in terms of their duration and frame number. However, it is a commonly used division of the data for experimental recordings. We therefore include two sets of plots in Figure 6. One is based on the number of videos, and the other is based on the number of frames with positive examples of the behavior. The former might be more intuitive to some readers because it is a common division of the data, but it might be challenging to understand given that videos can differ in duration and frame rate. Instead, reporting the number of positive examples provides an exact metric that can be applied to any case once one calculates the frame rate and frequency of their behavior of interest. We have now included text that helps readers with this conversion.

– Are overall F1 scores in Figure 3 computed as the mean of the individual scores on each component F1 score, or the combination of all behaviors (such that it weights high frequency behaviors)? It's also difficult to understand what the individual points in Figure 4 (a-c) correspond to.

We have clarified the meaning of these values in text and figure legends. The overall F1 score (Figure 3A-B) reports the “combination” of all behaviors. We report each behavior individually below (Figure 3C-K). In the old Figure 4A-C, each dot represented a single behavior (e.g. “face grooming”). We have clarified the meaning of each marker in each figure in the legends.

– The use of the names Mouse-1, Mouse-2 etc for experiments are confusing because it can appear that these experiments are only looking at single mice. I would change the nomenclature to highlight that these reflect experiments with multiple mice.

We thank the reviewer for this suggestion. We have now updated the naming of each dataset to provide a more descriptive title.

– It is not clear why the image has to be averaged across RGB channels and then replicated 20 times for the spatial stream. The author mentioned "To leverage ImageNet weights with this new number of channels", and I assume this means the input to the spatial stream has to have same shape (number of weights) as the input to the flow stream. However why this is the case is not clear, especially considering two feature extractor networks are independently trained for spatial and flow streams. Lastly this might raise the question of whether there will be valuable information in the RGB channels separately that will be lost from the averaging operation (for example, certain part of an animal's body has different color than others but is equal-luminous).

This averaging is only performed once, when loading ImageNet weights into our models. This occurs before training begins. The user never needs to do this because they will use our models that are pretrained on Kinetics700, rather than ImageNet.

For clarification, the ImageNet weights have 3 channels; red, green, and blue. Our flow classifier model has 20 channels; δ-X and δ-Y each for 10 frames. Therefore, we average across Red, Green, and Blue channels in the ImageNet weights and replicate this value 20 times. However, these weights are then changed by the training procedure, so there is no information lost due to averaging. We have updated the text to clarify this.

– It is not intuitive why simple average pooling is sufficient for fusing the spatial and flow streams. It can be speculated that classification of certain behavior will benefit much more from optical flow features while other behaviors benefits from still image features. I'm curious to see whether an additional layer at the fusing stage that has behavior-specific weights could improve performance.

We thank the reviewer for this interesting suggestion. In theory, this weighting of image features and flow features could happen implicitly in the fully connected layer prior to average pooling. Specifically, the biases on the units in the fully connected layer are per-behavior parameters for both the spatial and the flow stream that will affect which one is weighted more highly in the average fusion. We have also experimented with concatenation pooling, in which we fuse the two feature vectors together before one fully connected layer. However, this adds parameters and did not improve performance in our experiments. We have not experimented with more elaborate fusion schemes.

– Since computational demands is one of the major concern in this article, I'm wondering whether exploiting the sparse nature of the input images would further improve the performance of the algorithm. Often times the animal of interests only occupies a small number of pixels in the raw images, and some simple thresholding of the images, or even user-defined masking of the images, together with use of sparse data backends and operations should in theory significantly reduce the computational demands for both the spatial and flow feature extractor networks.

This is an interesting and insightful comment. For most of our datasets, the relevant information is indeed sparse, and the models could be made faster and perhaps more accurate if we could focus on the animal themselves. The difficulty is in developing a method that automatically focuses on the relevant pixels across datasets. For some datasets, thresholding would suffice, but for others more elaborate strategies would be required. Because we wanted DeepEthogram to be a general-purpose solution without requiring coding by the end user, we did not adopt any of these strategies. To make a general-purpose solution to focus on the relevant pixels, we could make a second version of DeepEthogram that implements models for spatiotemporal action detection, in which the user specifies one or more bounding boxes, each with their own behavior label. However, this would be a significant extension that we did not perform here.

Reviewer #3:

The paper by Bohnslav et al., presents a software tool that integrates a supervised machine learning algorithm for detecting and quantifying behavior directly from raw video input. The manuscript is well-written, the results are clear. Strengths and weaknesses of the approach are discussed and the work is appropriately placed in the bigger context of ongoing research in the field. The algorithms demonstrate high performance and reach human-level accuracy for behavior recognition. The classifiers are embedded in an excellent user-friendly interface that eliminates the need of any programming skills on the end of the user. Labeled datasets can even be imported. We suggest additional analyses to strengthen the manuscript.

We thank the reviewer for this positive feedback and for their constructive suggestions.

Concerns:

1) Although the presented metrics for accuracy and F1 are state of the art it would be useful to also report absolute numbers for some of the scored behaviors for each trial, because most behavioral neuroscience studies actually report behavior in absolute numbers and/or duration of individual behaviors (rears, face grooms, etc.). Correlation of human and DEG data should also be presented on this level. This will speak to many readers more directly than the accuracy and F1 statistics. For this, we would like to see a leave-one-out cross-validation or a k-fold cross-validation (ensure that each trial ends up exactly once in a cross validation set) that enables a final per-trial readout. This can be done with only one of the DEG types (e.g "fast"). The current randomization approach of 60/20/20% (train/validate/test) with a n of 3 repeats is insufficient, since it a) allows per-trial data for at most 60% of all files and b) is susceptible to artefacts due to random splits (i.e one abnormal trial can be over or under represented in the cross validation sets).

We thank the reviewer for these suggestions. We have added multiple new analyses and plots that focus on the types of data mentioned. In the new Figure 4A-C, we show measures of bout statistics, with dots reporting values for individual videos in the test set. In addition, in Figure 5A-B and Figure 5F-H, we show measures of bout statistics benchmarked against human-human agreement. We agree with the reviewer that these bout statistics are likely more interpretable to most readers and likely closer to the metrics that end users will report and care about in their own studies. The remainders of Figures 4 and 5 provide summaries of these bout statistics for each behavior in each dataset. In these summaries, each dot represents a single behavior with values averaged across train/validation/test splits of the data, corresponding to a measure similar to averages across videos. We used these summaries to report the large number of comparisons in a compact manner.

We did not employ leave-one-out cross validation because we base our model’s thresholds and learning rate changes on the validation set. We therefore require three portions of the data. K-fold cross-validation is more complex with a split into three groups, and we also require that each of the train, validation, and test sets contains at least one example from each behavior, which might not happen with K-fold cross validation. However, to address your concerns, we repeated the random splitting 5 times for all datasets (except Sturman-EPM, which only has 3 videos with all behaviors). We found that empirically 68% of all videos are represented at least once in the test set. While this does add noise to our estimates of performance, it is not expected to add bias.

2) In line with comment 1) we propose to update Figure 4, which at the moment uses summed up data from multiple trials. We would rather like to see each trial represented by a single data-point in this figure (#bouts/#frames by behavior). As alternative to individual scatterplots, correlation-matrix-heatmaps could be used to compare different raters.

In response to this suggestion, we have completely revised and extended the old Figure 4, dividing it into two separate main figures (Figures 4 and 5). In Figure 4, we now report bout statistics first on an individual video basis (Figure 4A-C) and then averaged across train/validation/test splits of the data for each behavior as a summary (Figure 4D-F). In the new Figure 5, we benchmark the performance of the model on bout statistics by comparison to human performance for the datasets for which we have multiple human labelers. In the new Figure 5A-B, we examined this benchmarking on the basis of single videos. Then in Figure 5C-E, we report the averages across data splits for each behavior as a summary across all datasets and all behaviors. This analysis led to an interesting finding. On an individual video basis, humans are more accurate than DeepEthogram (Figure 5A-B). However, on averages for each behavior, the accuracy of the model and humans is statistically indistinguishable (Figure 5C-E). This occurs because the model predictions are noisier than humans on single videos, but when averaging across multiple videos, this noise is averaged out, resulting in similar values between the model and humans.

3) Direct benchmarking against existing datasets is necessary. With many algorithms being published these days, it is important to pick additional (published) datasets and test how well the classifiers perform on those videos. Their software package already allows import of labeled datasets, some are available online. For example, how well can DeepEthogram score…

a. grooming in comparison to Hsu and Yttri (REF #17) or van den Boom et al., (2017, J Neurosci Methods)

b. rearing in comparison to Sturman et al., (REF #21)

c. social interactions compared to (Segalin et al., (REF #7) or Nilsson et al., (REF #19))

We thank the reviewer for this recommendation. In response, we have done two things. First, we have tested DeepEthogram on publicly available datasets, in particular from Sturman et al. We greatly appreciate that Sturman et al., made their datasets and labels publicly available. We plan to do the same upon publication of this paper. This allowed us to confirm that DeepEthogram performs well on other datasets collected by different groups.

Second, we have benchmarked our method against two other approaches. First, we tried the method of Hsu and Yttri (B-SOiD) on the Mouse-Openfield dataset (Supplementary Figure 21). Specifically, we performed DeepLabCut to identify keypoints and then used B-SOiD on these keypoints to separate behaviors into clusters. The method worked well and created separable clusters in a low dimensional behavior space. We then tried to line up these clusters with researcher-defined behaviors of interest. However, it was challenging to find a good correspondence. As a result, DeepEthogram substantially outperformed this B-SOiD-based method. We think this is not surprising given that B-SOiD is designed for unsupervised behavior classification, whereas the problem we are trying to solve is supervised classification. While it is possible that clusters emerging from unsupervised methods may line up nicely with researcher-defined behaviors of interest, this does not have to happen. It therefore makes sense to utilize a supervised method, like DeepEthogram, for the problem of identifying pre-defined behaviors of interest. We now discuss this point and present these results in depth in the text.

We also compared the performance of DeepEthogram and the ETH DLC-Analyzer method from Sturman et al. Specifically, we used DeepLabCut to track body parts in our Mouse-Openfield dataset and then used the architecture from Sturman et al. to classify behaviors (the multilayer perceptron). Note that we had to re-implement the code for the Sturman et al. approach for two reasons. First, our datasets can have more than one behavior per timepoint, which necessitates different activation and loss functions. Second, we wanted to ensure the same train/validation/test splits were used for both DeepEthogram and the ETH DLC-Analyzer classification architecture (the multilayer perceptron). We also included several training tricks in this architecture based on things we had learned while developing DeepEthogram, aiming to make this the fairest comparison possible. We found that DeepEthogram had significantly better performance in terms of accuracy and F1 on some behaviors in this dataset. The accuracy of the bout statistics was similar between the two modeling approaches. We conclude that DeepEthogram can perform better than other methods, at least in some cases and on some metrics, and hope that these new benchmarking comparisons help lend credibility to the software.

We did not perform further benchmarking against other approaches. Such benchmarking takes substantial effort because it takes a lot of work to set up and validate each method. We agree with the reviewer that it would be valuable to have more extensive comparisons of methods, but in terms of timing and the multiple directions asked for by the reviewers (e.g. also new datasets), it was not feasible to try more methods. In addition, not all the datasets mentioned were available when we checked. For example, last we checked, the dataset for Segalin et al., is not publicly available, and the authors told us they would release the data when the paper is published. Also, to our knowledge, the method of Segalin et al., is specific to social interactions between a white and a black mouse, whereas the social interaction dataset that is available to us has two black mice.

4) In the discussion on page 19 the authors state: "Subsequently, tens to hundreds to thousands of movies could be analyzed, across projects and labs, without additional user-time, which would normally cost additionally hundreds to thousands of hours of time from researchers." This sentence suggests that a network trained on the e.g. the open field test in one lab can be transferred across labs. This key issue of "model transferability" should be tested. E.g. the authors could use the classifier from mouse#3 and test is on another available top-view recording dataset recorded in a different lab with different open-field setup (datasets are available online, e.g. REF #21).

The original intent of this comment was that videos with similar recording conditions could be tested, such as in behavior core facilities with standardized set ups. We ran a simple test of generalization by training DeepEthogram on our Mouse-Openfield dataset and testing it on the Sturman-OFT dataset. We found that the model did not generalize well and speculate that this could be due to differences in resolution, background, or contrast. We have therefore removed this claim and added a note that DeepEthogram likely needs to be retrained if major aspects of the recordings change.

5) Figure 5D/E: Trendline is questionable, we would advise to fit a sigmoid trendline, not an arbitrarily high order polynomial. Linear trend lines (such as shown in Figure 4) should include R2 values on the plot or in the legend.

We apologize for the lack of clarity in our descriptions. The lines mentioned are not trendlines. The line in the old Figure 5D-E is simply a moving average that is shown to aid visualization. Also, the lines in the scatterplots of the old Figure 4 are not linear trend lines; rather, they are unity lines to help the reader see if the points fall above or below unity. We now explain the meaning of these lines in the figure legends. We have also added correlation coefficients to the relevant plots.

6) In the discussion, the authors do a very good job highlighting the limitations and advantages of their approach. The following limitations should however be expanded:

a. pose-estimation-based approaches (e.g. DLC) are going to be able to track multiple animals at the same time (thus allowing e.g. better read-outs of social interaction). It seems this feature cannot be incorporated in DeepEthogram.

We thank the reviewer for noting our attempts to provide a balanced assessment of DeepEthogram, including its limitations. We have now expanded the discussion to even more clearly mention the cases in which DeepEthogram might not work well. We specifically mention that DeepEthogram cannot track multiple animals in its current implementation. We hope the new discussion material will help readers to identify when and when not to use DeepEthogram and when other methods might be preferable.

b. Having only 2 human raters is a weakness that should briefly be addressed. Triplicates are useful for assessing outlier values, this could be mentioned in light of the fact that the F1 score of DeepEthogram occasionally outperforms the human raters (e.g. Figure 3C,E).

We have added a third labeler for our Mouse-Ventral1 dataset. We have also included results from Sturman et al., each of which have 3 human labelers. Unfortunately, we were not able to add a third labeler for all datasets. For those in which we included a third labeler, the results are similar to using only two labelers. We think this is because all of our labelers are experts who spent a lot of time carefully labeling videos.

c. Traditional tracking measures such as time in zone, distance moved and velocity cannot be extracted with this approach. These parameters are still very informative and require a separate analysis with different tools (creating additional work).

We agree these measures are informative and not currently output from DeepEthogram. We have added a sentence in the discussion noting that DeepEthogram does not provide tracking details.

d. The authors are correct that the additional time required for behavior analysis (due to the computationally demanding algorithms) is irrelevant for most labs. However, they should add (1) that the current system will not be able to perform behavior recognition in real time (thus preventing the use of closed-loop systems, which packages such as DLC have made possible) and (2) that the speed they discuss on page 16 is based on an advanced computer system (GPU, RAM) and will not be possible with a standard lab computer (or provide an estimate how long training would require if it is possible).

We have added sentences in the discussion and results that state that the DeepEthogram is not suitable for closed-loop experiments and that the speed of inference depends on the computer system and might be substantially slower for typical desktop computers. We have also updated the inference evaluation by using a 1080Ti GPU, which is well within the budget of most labs. Furthermore, during the revision process, we significantly improved inference speed, so that DeepEthogram-fast runs inference at more than 150 frames per second on a 1080Ti GPU.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

The reviewers all felt the manuscript was improved, and thank the authors for the additional datasets and analysis. We would just like to see two items before the publication is accepted fully.

(1) Both reviewer #1 and #2 note the new data is great, but lacks human ground truth. Both for comparison, and releasing the data for others to benchmark on, it would be please include the data. We also understand that obtaining ground truth from 3 persons is a large time commitment, but even if there is one person, this data should be included for all datasets shown in Figure 3.

We have performed addition labeling to get human levels of performance for benchmarking. For eight datasets, we now have labels from multiple researchers, which allows calculation of human-level performance. We have updated all the relevant figures, e.g. Figure 3. The results are consistent with those previously reported, in which DeepEthogram’s performance is similar to human-level performance in general. The only dataset for which we did not add another set of human labels is the Fly dataset. This dataset has more than 3 million video frames, and the initial labeling took months of work to label. Note that this dataset still has one set of human labels, which we use to train and test the model, but we are unable to plot a human-level of performance with only one set of human labels. We feel that comparing DeepEthogram’s performance to human-level performance for 8 datasets is sufficient and goes beyond, or is at least comparable to, the number of datasets typically reported in papers of this type.

(2) Please include links for the raw videos used in this work; it is essential for others to benchmark and use to validate the algorithm presented here (see Reviewer 3: "raw videos used in this work (except the ones added during the revision) are – it appears – not accessible online").

We have posted a link to the raw videos and the human labels on the project website (https://github.com/jbohnslav/deepethogram). We have noted in the paper’s main text and methods that the videos and annotations are publicly available at the project website.

Lastly, reviewer 3 notes that perhaps, still, some use-cases are best suited for DeepEthogram, while others more for pose-estimation plus other tools, but this of course cannot be exhaustively demonstrated here; at your discretion you might want to address in the discussion, but we leave that up to your judgement.

We agree with Reviewer 3 on this point. After reviewing the manuscript text, we feel that we have already included a balanced and extensive discussion of the use-cases that are best suited for DeepEthogram versus pose estimation methods. We feel that the relevant points are mentioned and, for brevity, do not feel it would be helpful to add even more explanations as these points are already addressed extensively in both the Results section and the Discussion section.

Reviewer #1:

I thank the authors for the revisions and clarifications, and I think the manuscript is much improved. Plus, the new datasets and comparisons to B-iOD and R-Analyzer (Sturman) are a good additions.

Thank you for the positive feedback and the constructive comments. Your input has greatly improved the manuscript.

One note is that is not clear which datasets have ground truth data; namely, in the results 5 datasets they use for testing are introduced:

"Mouse-Ventral1"

"Mouse-Ventral2"

"Mouse-Openfield"

"Mouse-Homecage"

"Mouse-Social"

plus three datasets from published work by Sturman et al.,

and "fly"

Then it states that all datasets were labeled; yet, Figure 3 has no ground truth for "Mouse-Ventral2" , "Mouse-Homecage" , "Mouse-Social" or 'Fly" -- please correct and include the ground truth. I do see that is says that only a subset of each of the 2 datasets in Figure 3 are labeled with 3 humans, but minimally then the rest (1 human?) should be included in Figure 3 (and be made open source for future benchmarking).

It appears from the discussion this was done (i.e., at least 1 human, as this is of course required for the supervised algorithm too):

"In our hands, it took approximately 1-3 hours for an expert researcher to label five behaviors in a ten-minute movie from the Mouse-Openfield dataset" and it appears that labeling is defined in the methods.

We apologize for the confusion on this point. All datasets had at least one set of human labels because this is required to train and test the model. To obtain a benchmark of human-level performance, multiple human labels are required so that the labels of one researcher can be compared to the labels of a second researcher. We have now performed additional annotation. Now 8 datasets have human-level performance reported (see updated Figure 3). The results show that the model achieves human-level performance in many cases.

The only dataset for which we do not have multiple human labelers is the Fly dataset. The Fly dataset has more than 3 million timepoints, and it took a graduate student over a month of full-time work to obtain the first set of labels. Given the size of this dataset, we did not add another set of labels.

We feel that having 8 datasets with human-level performance exceeds, or at least matches, what is typical for benchmarking of methods similar to DeepEthogram. We feel this number of datasets is sufficient to give the reader a good understanding of the model’s performance.

We have now made all the datasets and human labels available on the project website and hope they will be of use to other researchers.

Reviewer #2:

The authors did a great job addressing our comments, especially with the additional validation work. My only concern is that some of the newly included datasets don't have human-labeled performance for comparison, hence making it hard to judge the actual performance of DeepEthogram. While I understand it is very time-consuming to obtain human labels, I think it will greatly improve the impact of the work if the model comparison can be bench-marked against ground truth. Especially it would be great to see the comparison to human label for the "Mouse-Social" and "Mouse-Homecage" datasets, which presumably represent a large proportion of use cases for DeepEthogram. Otherwise I think it looks good and I would support publication of this manuscript.

Thank you for the positive feedback and the valuable suggestions that helped us to extend and improve the paper. We have now added more human labels to obtain human-level performance for 8 datasets, including the Mouse-Social and Mouse-Homecage datasets. We have made the datasets and human labels publicly available at the project website, which we hope will assist other researchers.

Reviewer #3:

The authors present a software solution (DeepEthogram) that performs supervised machine-learning analysis of behavior directly from raw videos files. DeepEthogram comes with a graphical user interface and performs behavior identification and quantification with high accuracy, requires modest amounts of pre-labeled training data, and demands manageable computational resources. It promises to be a versatile addition to the ever-growing compendium of open-source behavior analysis platforms and presents an interesting alternative to pose-estimation-based approaches for supervised behavior classification, under certain conditions.

The authors have generated a large amount of additional data and showcase the power of their approach in a wide variety of datasets including their own data as well as published datasets. DeepEthogram is clearly a powerful tool and the authors do an excellent job describing the advantages and disadvantages of their system and provide a nuanced comparison of point-tracking analyses vs. analyses based on raw videos (pixel data). Also their responses to the reviewers comments are very detailed, thoughtful and clear. The only major issue is that the raw videos used in this work (except the ones added during the revision) are – it appears – not accessible online. This problem must be solved, the videos are essential for reproducibility.

We thank the reviewer for the positive comments and constructive feedback that has greatly helped the paper. We have now made all the videos and human annotations available on the projection website: https://github.com/jbohnslav/deepethogram

A minor caveat is that in order to compare DeepEthogram to existing supervised and unsupervised approaches, the authors have slightly skewed the odds in their favor by picking conditions that benefit their own algorithm. In the comparison with point-tracking data they use a low resolution top-view recording to label the paws of mice (which are obstructed most of the time from this angle). In the comparison with unsupervised clustering, they use the unsupervised approach for an application that it isn't really designed for (performed in response to reviewers requests). But the authors directly address these points in the text, and the comparisons are still valid and interesting and address the reviewers concerns.

We agree with these points. In particular, we agree that the comparison of supervised and unsupervised methods is challenging and not an apples-to-apples comparison, but we included this comparison because other reviewers felt strongly that it was a necessary analysis for benchmarking the model’s performance. We include text that highlights the caveats mentioned by the reviewer and have tried to provide a thorough and balanced assessment of cases in which DeepEthogram might work best and cases in which other approaches may be preferred and will provide additional information. In the head-to-head comparisons, we have highlighted the caveats of the comparisons, including those mentioned by the reviewer. Our goal is for the reader to understand when and when not to use DeepEthogram and what type of information one can obtain from the software’s output. We hope our extensive discussion of these points will help the reader and the community.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. von Ziegler L, Sturman O, Bohacek J. 2020. Videos for deeplabcut, noldus ethovision X14 and TSE multi conditioning systems comparisons. Zenodo. [DOI]

    Supplementary Materials

    Transparent reporting form

    Data Availability Statement

    Code is posted publicly on Github and linked in the paper. Video datasets and human annotations are publicly available and linked in the paper.

    The following previously published datasets were used:

    von Ziegler L, Sturman O, Bohacek J. 2020. Videos for deeplabcut, noldus ethovision X14 and TSE multi conditioning systems comparisons. Zenodo.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES