AVATAR: AI Vision Analysis for Three-dimensional Action in Real-time

Dae-Gun Kim; Kwanhoo Shin; Anna Shin; Yong-Cheol Jeong; Changbum Ko; Junesu Lee; Seahyung Park; Daesoo Kim

doi:10.5607/en25044

. 2026 Mar 3;35(2):109–123. doi: 10.5607/en25044

AVATAR: AI Vision Analysis for Three-dimensional Action in Real-time

Dae-Gun Kim ^1,^†, Kwanhoo Shin ^1,^†, Anna Shin ^1,^†, Yong-Cheol Jeong ¹, Changbum Ko ², Junesu Lee ¹, Seahyung Park ^1,³, Daesoo Kim ^1,^2,^4,^5,^*

PMCID: PMC13106957 PMID: 41771612

Abstract

Artificial intelligence (AI) provides new opportunities for high-resolution behavioral analysis and automated, human-free experiments. Here we present AVATAR (AI Vision Analysis for Three-dimensional Action in Real-time). This AI-driven system reconstructs 3D mouse motions by detecting key body parts from synchronized multi-view videos and converting into action skeletons. AVATAR achieves near-human accuracy in pose estimation, enables robust extraction of kinematic and postural features, and supports scalable analysis of model animal behaviors. Using these features represented by 3D action skeleton, LSTM-based model reliably classifies freely moving mouse behaviors during various experimental paradigms with low-latency processing (100 ms) enables real-time closed-loop optogenetic stimulation. As a demonstration of generalizability, we applied AVATAR framework to bottom-view predatory hunting paradigm. AVATARnet accurately detected mouse poses and extracted dynamic behavioral features of the mouse. Using AVATARnet-driven dynamic features, an XGBoost-based classifier automated action segmentation annotation during complex predatory chasing behavior. Together, AVATAR provides 3D pose estimation, dynamic quantification, classification, and closed-loop manipulation in real-time.

Keywords: Mice, Behavior, Artificial intelligence, Machine learning, Neural networks, computer, Imaging, Three-dimensional

INTRODUCTION

In the neuroscience field, researchers have investigated understanding the model animal behavior as a means to elucidate the neural mechanisms that generate and control it [1, 2]. Dissecting the neural substrates of specific behaviors requires precise behavioral characterization as well as neural modulation or neural activity recording in freely moving animals [3-5]. Accordingly, defining behavioral states with high temporal and spatial precision is a fundamental component of circuit-level analysis [1, 6].

Early approaches to behavioral quantification relied on the manual identification of behavioral modules based on repeated action patterns [7, 8]. Conventional experiments for animal behavior, use a top-view single camera which contains occlusion of body parts, to record animal motion. To obtain more precise measurements of specific body parts, researchers also applied colored markers or attached external fiducials to the animal’s body [9, 10]. Recent advances in artificial intelligence (AI), computer vision, and machine learning have transformed behavioral neuroscience by enabling automated detection, quantification, and classification of complex behaviors. These have included the use of convolutional neural networks (CNN) using markerless video recordings from a single camera [11-13] and reconstruction of 3D skeletons using multi-camera systems [14-18]. You Only Look Once (YOLO), a high-speed object detection algorithm, allows efficient real-time detection of animals and environmental features [19, 20]. Deeplabcut (DLC) introduced markerless, semi-automated pose tracking with dramatic increasing accuracy, greatly expanding opportunities for quantitative behavioral analysis [11, 14]. Variational Animal Motion Embedding (VAME), one of the first deep generative approaches for behavioral segmentation, demonstrated how deep learning can automatically identify latent behavioral motifs [21]. More recently, Social LEAP Estimates Animal Poses (SLEAP), extended pose estimation to multi-animal setting with high accuracy and scalability [22]. These technological innovations have enabled unprecedented resolution in the study of fine-grained behavioral patterns. More recently, several studies have sought to move beyond pose estimation toward real-time or near-real-time behavioral analysis, including approaches that integrate low-latency inference, multi-camera 3D reconstruction, and behavioral classification. DeepLabCut-Live demonstrated real-time pose estimation enabling closed-loop feedback in constrained experimental settings [23]. Multi-view frameworks such as DANNCE [16, 24] and Anipose [17] have achieved high-accuracy 3D pose reconstruction using synchronized camera arrays, enabling detailed analysis of animal kinematics. Other approaches have focused on downstream behavioral segmentation or classification based on pose dynamics, often relying on post hoc analysis, such as B-SOiD [25] and related pose-based behavioral embedding methods. Despite these advances, integrating accurate 3D reconstruction, real-time processing, and flexible behavioral analysis within a single experimental pipeline remains challenging. Existing methods require specialized hardware configurations, extensive computation, or task-specific behavioral definitions. While these approaches have significantly advanced the precision and scope of behavioral quantification, achieving a mechanistic understanding of rapidly evolving behavior requires more than accurate pose tracking alone. It demands analytical pipelines that can extract meaningful behavioral states in real-time and, critically, systems that can integrate behavioral detection with simultaneous neural manipulation. As behavioral detection methods achieve increasingly fine spatial and temporal resolution, the volume and complexity of the resulting data expand dramatically. This escalation imposes substantial computational demands that hinder real-time processing. Consequently, it is difficult to integrate with closed-loop experiments or rapid neural manipulations. To address these challenges, we established AVATAR (AI Vision Analysis for Three-dimensional Action in Real-time), that combines multi-view 3D pose reconstruction, low-latency processing, and flexible behavioral analysis, enabling real-time closed-loop behavioral detection coupled with simultaneous neural manipulation.

MATERIALS AND METHODS

Animals

For AVATARnet system development and validation, a total of 50 male C57BL/6J (Jackson Laboratories stock #000664) adult mice (8~16 weeks of age) were used. All wild-type mice contributed to system development, including data acquisition for pose estimation, network training, calibration, and performance validation. Among these, 22 mice were additionally used to assess the applicability of the AVATAR system across various behavioral paradigms, including head-wired condition (n=3), social interaction (n=6), novel object interaction (n=3), tube-licking test (n=3), an ICR mouse model (n=3), and bottom-view predatory hunting (n=4). For the dystonia mouse model, three males B6.D2-Cacna1α^tg/J (Jackson Laboratories stock #000544) adult mice (8~16 weeks of age) were used. All mice were group housed at five to six mice per cage under a 12 h light/dark cycle and ad libitum access to food and water. All procedures were conducted according to the Korea Advanced Institute of Science and Technology (KAIST) Guidelines for the Care and Use of Laboratory Animals and were approved by the Institutional Animal Care and Use Committee (IACUC).

Recording behavior in the AVATAR studio

A transparent chamber was made of 5-mm-thick acrylic panels (200 mm×200 mm×300 mm) and inserted in the center of AVATAR studio. For video recording, five high-speed cameras (FLIR^® Systems, Inc., BFS-U3-23S3C-C) were installed at five different viewpoints: front, rear, left, right, and bottom. The specific positions of the cameras were adjusted according to the focal length of the camera (30 cm). The operation of the five cameras was synchronized to a central clock system and controlled by a host card (IOI Technology, U3X4-PCIE4XE101, 4port usb3.0 to PCIe x4 Gen 2 host card). An opaque panel was used to cover the top to block external visual stimuli. The size, design and components of the AVATAR studio are detailed further in Supplementary Fig. 1.

For behavioral recordings of wild-type mice, we recorded one wild-type mouse at a time in the AVATAR studio. For a disease mouse model, we recorded one dystonia model mouse. For separated social model, we recorded two wild-type mice simultaneously. All animals designated for behavioral recordings were habituated in the AVATAR studio for 10 minutes. Thereafter, all five cameras recorded for 5 min.

Images were captured from each camera and transferred to the host card via a USB3.1 data cable. The host card was connected to the main computer through the PCI-Express slot and the image data were integrated into a multi-view collage image at the main computing processor unit (Intel, i9 9900k). It took about 30 milliseconds to receive each image data from the five cameras and combine them into one collaged image (~30 fps).

AVATARnet

To detect body parts from the multi-viewpoint image data obtained using the AVATAR studio, we developed a CNN algorithm called “AVATARnet” by modification of a previously reported deep-learning algorithm, “YOLOv4”, which exhibited good performance in accuracy and speed [20]. AVATARnet is based on a YOLOv4-style convolutional architecture with a CSPDarknet-53 backbone. The network was configured with an increased input resolution of 608×608 pixels and incorporated skip connections and batch normalization to improve detection performance for small body parts. These changes improved detection performance for multiple small-sized objects, including forelimbs and hindlimbs. We also added several training methods to enable accurate detection of body parts of various sizes. For general regularization, we added a dropout method and data augmentation in the labeled training set.

Training AVATARnet with mouse image data sets

We prepared multi-viewpoint images of 3,200×2,000 pixels for nine body parts (nose, head, torso, anus, tail-tip, forelimbs and hindlimbs) of C57BL/6J mice without any user-defined preprocessing, and trained AVATARnet using the transfer learning method. During training and inference, images were processed using the standard YOLOv4 input pipeline, which applies letterbox resizing to isotropically scale images while preserving the original aspect ratio, followed by padding to a fixed 608×608 resolution before feature extraction by the CSPDarknet-53 backbone used in YOLOv4. No aspect-ratio warping or asymmetric resizing is performed within the CSPDarknet-53 backbone, and all convolutional layers preserve the spatial layout established by the letterbox preprocessing. To enhance detection ability, we prepared additional image data taken under different conditions, including hunting (200), social interaction (200), non-social object interaction (200), with a head-mounted optic fiber (100), and with a head-mounted endoscope (100). After training was performed using the basic data set (1,000 images for each body part class) obtained from solitary exploration in AVATAR chamber, we used sum-squared error between the predictions and the ground truth to calculate loss. The loss function was composed of the classification loss

localization loss

graphic file with name en-35-2-109-m2.jpg

and confidence loss

The average loss value was below 0.5 when we used more than 8000 iterations. The training took around 12 hours; 14,000 iterations were applied using a GPU (NVIDIA, RTX 2080ti) computer processing the 3,200×2,000 pixels training image data set.

Evaluation of AVATARnet with the mouse image data set

We used mean Average Precision (mAP), Intersection over Union (IOU) and Mean Square Error (MSE) to evaluate our network. IOU analysis evaluates the portion of overlay between the ground truth and a prediction box and our network showed more than 75% IOU after the training. We used mAP measure to measure the accuracy of the network predictions; it evaluates the degree of overlap between the predicted body part areas and actual body part areas and found that AVATARnet detected the nine body parts with 90% accuracy when using mAP, which was previously used as a default metric for precision in the PascalVOC object detection competition. We also evaluated the mean square error (MSE) distance for each class. After the network was trained, it showed a 7~15 pixels (1.4~4.5 mm) MSE distance from each class.

Automatic calculation of the 3D positions of body parts

To calculate the 3D position (x-y-z) of body parts from the 2D (x-y) coordinate data obtained from the five cameras (Fig. 1C~E), we used a computer-vision 3D calibration and reconstruction algorithm [26]. The 3D coordinates of the target points are computed by calculating the intersection of the straight lines that pass from the center of the camera through each target point on the image (taken from focal length). Using the parameters of each camera (intrinsic, extrinsic and lens distortion), this algorithm could compute the 3D coordinate of selected body parts’ endpoint locations from multiple 2D images.

Reconstruction of an AVATAR mouse using an action skeleton

To make a virtual subject, the 3D coordinates of all body parts (one point for each body part) were connected with eight lines to form an “action skeleton”, which represents the vectorial location of specific body parts relative to their attachments to the body trunk in 3D space: vector-1, from head to nose; vector-2, from torso to head; vector-3, from torso to anus; vector-4, from anus to tail-tip; vector-5, from torso to left forelimb; vector-6, from torso to right forelimb; vector-7, from anus to left hind limb; vector-8, from anus to right hind limb. Unlike a real skeleton, the action skeleton is flexible in length as it represents the distance between the two connected objects. Action skeletons were generated at 20 frames per second.

Real-time closed-loop optogenetic stimulation

For closed-loop experiments, we unilaterally injected 0.5 μl of AAV2/9-Ef1α-dfloxhChR2(H134R)-mCherry or control gene constructs (Addgene,USA) in the VTA (AP, -3.1 mm; ML, 0.4 mm; DV, -4.5 mm, from bregma) in dopamine transporter::Cre (DATCre) mice with a fiber-optic cannula (200 μm diameter; DoricLenses, Canada) implanted over the VTA.

To test the utility of AVATAR system for real-time, we reinforced specific behavioral sequence through real-time feedback of AVATAR system and TTL interface. We injected virus, adeno-associated viruses (AAV) harboring channelrhodopsin-2 gene (ChR2), and inserted optic fiber in dopamine neurons of the ventral tegmental area (VTA) for real-time automatic stimulation. Phylogenetic tree analysis illustrated that the total action patterns can reach around 20,000 clusters, which could be changed if the user adjusts the threshold of similarity. Among those, we isolated action sequence corresponding to “rearing” and coupled it to yield laser pulses (20 Hz, 5 ms, 3 mW, 1 s).

Motion unit analysis

A motion unit was defined as the set of action skeleton structures at a given time. To isolate non-redundant motion units, we eliminated redundant sets based on the confidence interval, x±1.96×SE (2.5 mm). We displayed all non-redundant motion units according to their level of frequency in a posture ring that ran counterclockwise from the 9 o’clock position. Along the time series, connecting one motion unit at the first posture ring to the next motion unit at the next posture ring yielded a behavior cylinder that shows the dynamic changes of the motion sequences. Analysis of the behavior cylinder revealed that there are dominant motion sequences that can be annotated as specific actions, such as walking, running, and rearing. These results suggest that the AVATAR system can be useful for unsupervised characterization and quantification of motion and action sequences.

Pose estimation and detection model validation

Model performance was evaluated by comparing the AVATARnet pose estimation model with DeepLabCut (DLC), which defaults to a ResNet-50 backbone. Both models were trained and validated on the same dataset under identical conditions. Evaluation was performed using two complementary metrics: mean L2 error (Euclidean distance between annotated and ground-truth keypoint positions) and the Percentage of Correct Keypoints (PCK) at thresholds of 0.1, 0.2, and 0.5 of the normalized coordinate spaces. These metrics were used to assess the trade-off between detection accuracy and computational efficiency for the pose estimation task.

Predatory hunting behavior

The test chamber was enclosed with a 25 cm×25 cm×40 cm opaque lid and a transparent floor. A 1-cm diameter hole was located at the center of top lid to allow connection with additional experimental devices to mice if needed. The transparent floor was elevated 50cm above the ground, enabling unobstructed bottom-view recordings in a hunting experiment. A camera was mounted directly beneath the transparent floor, recording at a 1920×1080 pixel resolution. For cricket habituation, mice were single-housed 3 days before the assay and exposed to a cricket overnight for 2 nights. 3 crickets were supplied at every night to each mouse. Mice were food-deprived for 12 hours before the hunting recording. Each mouse was placed in the test chabmer for 10 minutes to habituate experimental condition. Following the chamber habituation, the cricket was released into the chamber, then hunting session started. The hunting session ended when the mouse killed the cricket or 20 minutes after the cricket entrance.

Cricket detection during predatory hunting

For cricket detection, standard YOLOv4 object detection model (size s) was trained to identify and track crickets throughout the hunting sequence. The model output consisted of bounding box coordinates (x, y, width, height), detection confidence score, and frame-to-frame track IDs.

Pose estimation and detection model validation during predatory hunting

Visual field classification

The cricket angle was calculated as the arctangent of the vector from the nose to the cricket position relative to the vector from the body center to the nose. Each frame was categorized into one of four visual zones corresponding to the mouse binocular and monocular fields.

∙Binocular zone: |cricket angle|≤20°, corresponding to the frontal field where both eyes converge and depth perception is maximal

∙Left monocular zone: 20°<cricket angle≤140°, the peripheral left visual field

∙Right monocular zone: -140°≤cricket angle<-20°, the peripheral right visual field

∙Out of sight: |cricket angle|>140°, regions behind the mouse where the cricket is not visible

Action segmentation model training

An XGBoost classifier was trained to segment behavioral states from AVATARnet-driven engineered feature set. The model effectively handled the high-dimensional feature space and class imbalance inherent in the dataset, while maintaining computational efficiency on large-scale datasets. The training process included cross-validation, hyperparameter tuning, and class-weight adjustments to optimize classification performance across chasing, attacking, and consuming states.

Model performance was evaluated using the F1-score, defined as the harmonic mean of precision and recall, which provides a balanced measure of precision and recall:

where TP, FP, and FN represent true positives, false positives, and false negatives, respectively. F1 scores range from 0 to 1, with higher values indicating better classification performance.

RESULTS

To obtain image data from multiple angles, we designed a chamber, “AVATAR studio” (Fig. 1A; Supplementary Figs. 1, 2; and Methods); consisting of five Full-HD CMOS cameras (1200×1200 pixels, 20 frames per second) for recording a moving subject at different viewpoints (1 bottom+4 side views), LED lights device for preventing motion blur in a transparent box. The bottom panel is made transparent to visualize limbs under the body. We confirmed that key body areas: the nose, head-top, tail-tip, and four paws, can be monitored by at least two of the five cameras at any given time, which enables their co-ordinates to be calculated. To prevent time consumption, due to sequential search of images generated by five multi cameras, image data from the five cameras were concatenated into a single collage (Supplementary Fig. 2). An object detection algorithm [22, 23], was modified to search for major body parts on mice. The network, called “AVATARnet”, can detect nine body parts in a collage and infers the center point of each object within 30 ms (Fig. 1B). Based on the yolov4 network, AVATARnet has modified the input network size (608x608 pixels) and number of parameters (56.9 M), for detecting small body parts.

We trained AVATARnet with labeled data (C57BL/6J, ICR/J mouse) from over 1,000 images captured by AVATAR studio under various conditions; solitary, with a conspecific, or with a non-social object. The trained AVATARnet successfully detected the 2D position (x-y) of body parts in any video data obtained from the five cameras at a level comparable to that of human labeling (center point difference=SD<4.04±3.71 mm) (Fig. 1C; Supplementary Figs. 3, 4; Methods; and Supplementary Video 1). We devised a posture set algorithm and corrected for detection of errors or wrong paw directions (left, right) during body object detection. To benchmark our model against DeepLabCut (DLC), we systematically compared detection accuracy using mean L2 error and PCK at thresholds of 0.1, 0.2, and 0.5 (Supplementary Fig. 5). For each keypoint, we computed the Euclidean distance between predicted and ground-truth positions and calculated the proportion of keypoints whose error fell below each normalized threshold. Under our five-camera multiview recording system, AVATARnet exhibited substantially higher accuracy, with an L2 error of 0.107 compared to 0.424 for DLC (Supplementary Fig. 5A). PCK scores also showed better performance at all thresholds (PCK@0.1: 0.765 vs 0.502; PCK@0.2: 0.995 vs 0.598; PCK@0.5: 0.983 vs 0.983) (Supplementary Fig. 5B) while simultaneously achieving a 3.42-fold faster processing speed than DLC. These results demonstrate that AVATAR approach provides markedly improved pose estimation accuracy within the AVATAR studio environment.

Next, we calculated the 3D coordinates of body parts from the 2D information obtained by AVATARnet. We used a triangulation and bundle-adjustment algorithm [26] (Fig. 1D) to calculate the 3D position of particular body parts using the 2D video data obtained from at least two cameras, camera-intrinsic parameters (focal length, principal point, skew coefficient) and camera-extrinsic parameters (the location and orientation of the camera). We reconstructed virtual motion by connecting the 3D coordinates of nine body parts with eight vector sets, “action skeletons”, that ran along the skeletal structure of a mouse (Fig. 1E; Supplementary Figs. 6, 7; and Methods). The virtually-reconstructed “AVATAR mice” are action skeletons representing whole-body motions in 3D virtual space (Supplementary Video 2). Finally, through the AVATAR system, 3D coordinates of each body-part over every frame (20 fps) can be obtained and analyzed in various ways (Supplementary Fig. 8).

We tested the motion capture ability of AVATAR in various states. AVATAR successfully detects dystonia in tottering mice (a1A^tot/tot) [27], characterized by twisted body postures, and can record freely-moving mice with a head-mounted optic wire (Supplementary Fig. 8A, B). It can capture the individual motion of two mice housed together, shows no interference in the presence of human-made objects, and also detect key joints of ICR mouse (Supplementary Fig. 8C~F; Supplementary Video 3).

AVATAR revealed that while a mouse could yield more than 2,000 unique poses (i.e., non-redundant pose sets) during 5 minutes of exploration in the chamber, the motion complexity obtained from 50 C57BL/6J mice showed a higher number during the same time (53,397 unique motion units, total frames=450,000; confidence interval, ±5 mm) (Supplementary Figs. 9, 10 and Methods). All behaviors of the freely moving mice are quantified with 3D skeletal coordinates, which can be converted into a time series (Fig. 2A).

Fig. 2 — 3D motion quantification of AVATAR system. (A) Representative behavior sequences. Top, schematic figure of behavior from a real mouse. Middle, action skeletons from a vertical view according to each mouse’s behavior. Bottom, limb combination (joint angle variation) in time-series. Dashed lines indicate the behaviors and the action skeletons at the matched points. (B) Left, horizontal view of stacked poses set by fixing the lumbar to the x-z plane from the origin (anus: 0, 0, 0; torso center: x, 0, z). Right, vertical view of the stacked poses. (C) Visualization of the stacked pose sequences for a specific time-bin (red arrow, walking; yellow arrow, sniffing; green arrow, rearing; blue arrow, immobility; purple arrow, grooming). (D) Schematic of the AVATAR LSTM classifier model. (E) Training accuracy and loss of AVATAR LSTM. (F) Comparison of human annotation of exploration behavior in AVATAR studio. (G) and AVATAR labelling (red, walking; yellow, sniffing; green, rearing; blue, immobility; purple, grooming). (H) Cumulative graph of the number of error frames over entire frames. (I) A phylogenic tree classifies poses into five predefined behaviors through semi-supervised learning. The data set may be divided into more groups according to the classification resolution.

Skeletal coordinates can be parameterized according to various criteria, and the time period can be classified according to specific rules. In this paper, we classified the AVATAR mice behavior over time using an ethogram. Since the value of the skeletal coordinates varies across episodes of behavior, depending on the orientation of the mouse, the spine skeleton of the mouse was fixed to the x-axis to align the orientation across all episodes (Fig. 2B). Through human labeling, the exploratory behavior of mice was classified into five major ethograms (walking, sniffing, rearing, immobility, grooming) in AVATAR studio (training set: 18000 frame) (Fig. 2C). We designed an LSTM neural network classifier [28] to classify behavioral time series data derived from the AVATAR system (Fig. 2D). The LSTM classifier training accuracy and training loss converge to 89.91% and 0.27, respectively at 100 epochs (Fig. 2E). LSTM classification performed with an error rate of 11.97% when compared to human-labelled data (Fig. 2F~H). Additionally, the training dataset can undergo further unsupervised clustering to produce finer classifications (Fig. 2I).

To test the efficiency of AVATAR in real-time analysis and feedback, we set up a system for closed-loop photostimulation in response to TTL signals generated by the AVATAR system during specific motion sequences especially rearing sequences.

To this end, we unilaterally injected 0.5 µl of AAV2/9-Ef1α-dflox-hChR2 (H134R)-mCherry or control gene constructs in the VTA in dopamine transporter::Cre (DAT-Cre) mice with a fiber-optic cannula implanted over the VTA (Fig. 3A). AVATAR was capable of driving photostimulation within 100 milliseconds from when mice displayed the selected motion sequence (Fig. 3B). In this paper, the rearing pose sequence (10 frames) was selected as a trigger for photostimulation (Fig. 3C, D). During photostimulation, mice showed a different pose sequence during rearing. A total of 109 rearing episodes were observed in 6 trials of 3 mice (Off: 40, On: 69), and this was composed of a time-bin of 100 frames centered on when the nose was at maximum height. Closed-loop photostimulation decreased average rearing height and decreased peak nose height (Fig. 3D~F). Stereotyped up-and-down movement along the y-x plane was observed in control conditions, but showed a more complex trajectory during photostimulation (Fig. 3G, H) (Supplementary Video 4).

Fig. 3 — Real-time closed loop optogenetic test in AVATAR system. (A) Activation of dopaminergic neurons in the VTA. (B) Schematic figure of AVATAR real-time closed-loop brain photostimulation. A total detection and stimulation time of 100 ms is required. (C) Schematic of reinforcing the rearing process. When the mouse rearing pose sequence is detected, photostimulation is triggered to enhance the rearing behavior. (D) Nose height graph for each condition (n=3). Blue boxes represent times when rearing sequences were detected and light stimulation was given. (E) Box plot of average nose height during rearing (yellow dot=photostimulation off, red dot=on). (F) Change in nose height in the extracted rearing sequence (time bin=100 frames, 2.5 seconds). (G) The trajectory of the nose in the extracted rearing sequence during no photostimulation (y-z plane). (H) The trajectory of the nose in the extracted rearing sequence during photostimulation (y-z plane).

Body orientation during all rearing episodes was aligned and fixed on the x-axis, which allows a stacked display of all poses during rearing episodes (Fig. 4A). With a set of stacked poses, changes in the mice's nose movement were recorded on the x-z plane, and histograms of nose coordinates were drawn for each axis (Fig. 4B). The results show alterations in the coordinate distribution of the nose during photostimulation. By observing individual rearing trials in the simulator, we confirmed that mice typically stopped, before raising their head and immediately descending, whereas during photostimulation, mice lift their head concurrently while moving (Fig. 4C). Without fixing the pose to the x-axis, we visualized the entire pose sequence history in the AVATAR chamber. Rearing was mainly observed in the corners of the chambers, whereas during photostimulation, rearing was more scattered across the entire chamber (Fig. 4D). We compared the rearing sequence between control and photostimulated conditions by machine learning tools. Clustering with t-sne [29] showed that rearing in photostimulated mice has different features from normal rearing (Fig. 4E). A Gaussian SVM classification learner was able to distinguish between non-photostimulated and photostimulated rearing with a probability of over 97% (Fig. 4F). These results suggest that closed loop stimulation with AVATAR induces an altered rearing sequence.

Fig. 4 — Comparison between closed loop optogenetic stimulation on and off during rearing behavior. (A) Visualization of extracted rearing sequence poses when stacked and aligned on the x-axis. (B) Histogram of the nose position projected onto the x-z plane during rearing (yellow dot=photostimulation off, red dot=on). (C) Example of stacked poses during rearing (not fixed to the x axis). (D) History of all poses within the AVATAR chamber. (E) t-sne clustering of rearing with no photostimulation and with photostimultaion (yellow dot=photostimulation off, red dot=on). (F) Machine learning classifier of rearing data. Gaussian SVM classifications with 97% accuracy.

To further assess the generalizability of the AVATAR pipeline to another experimental setting, we analyzed data obtained from a bottom-view mouse predatory hunting paradigm (Fig. 5A).

Fig. 5 — AVATARnet Model performance of the mouse pose estimation and the cricket during predatory cricket hunting. (A) Schematic figure of the bottom-view hunting recording chamber. The floor of the chamber was transparent to record hunting behavior from the bottom. (B) Representative image from mouse pose estimation validation dataset by AVATARnet. The blue bounding box indicates the detected mouse with its corresponding confidence score. Colored dots represent the keypoints automatically detected by the AVATARnet model. (C) F1-confidence curve for the AVATARnet mouse pose estimation. The blue curve illustrating the trade-off between detection confidence and F1-score, with optimal performance (F1=0.81) achieved at a confidence threshold of 0.649. (D) Representative image from cricket detection validation dataset by the YOLO. The red bounding box indicates the detected cricket with its corresponding confidence score. (E) F1-confidence curve for the YOLO cricket detection model. The blue curve illustrating the trade-off between detection confidence and F1-score with optimal performance (F1=0.98) achieved at confidence threshold of 0.178. (F) Pose estimation accuracy comparison using L2 error between our model and DLC. Each bar represents mean L2 error (in pixels) between predicted and ground-truth keypoints. L2 error of our model is 0.42. L2 error of DLC is 0.37. (G) Pose estimation performance comparison using Percent of Correct Keypoints (PCK) metrics between our model and DLC. PCK@0.1 of our model is 0.066 and DLC is 0.054. PCK@0.2 of our model is 0.217 and DLC is 0.314. PCK@0.5 of our model is 0.983 and DLC is 0.983.

We collected 194,167 frames of cricket hunting videos with 30 fps and 1920x1080 pixels resolution from 4 different mice. We implemented the AVATARnet to analyze the mouse predatory hunting behavior, and for the cricket detection, we used standard YOLO-v4 small size model. We extracted 653 frames to annotate the nine keypoints (nose, right and left forepaw, body center, right and left hindpaw, tail base, tail end, and tail tip). Among these, 106 frames were used for validation, and 547 frames were used to additional training the AVATAR model. The mouse pose estimation achieved F1-score of 0.81 at the confidence threshold of 0.649 (Fig. 5B, C), while cricket detection showed more robust performance of 0.98 at the confidence threshold of 0.178 (Fig. 5D, E). To benchmark our model against DeepLabCut (DLC), we applied the same evaluation procedure used for AVATAR studio dataset (Fig. 5F, G). Our analysis revealed that AVATARnet exhibited marginally lower accuracy, yielding an L2 error of 0.42 compared to 0.37 for DLC (Fig. 5F), and PCK scores showed similar performance at extreme thresholds (PCK@0.1: 0.066 vs 0.054: PCK@0.5: 0.983 vs 0.983), though AVATARnet was at a slight disadvantage at intermediate precision (PCK@0.2: 0.217 vs 0.314) (Fig. 5G). Despite this modest accuracy trade-off, AVATARnet achieved substantially faster processing speeds under identical experimental conditions. When evaluated on identical datasets at full HD resolution using 5,748 MiB of VRAM, our model completed analysis and video generation in approximately 30.3 milliseconds per frame, compared to 160.9 milliseconds per frame for DLC (Table 1). These results demonstrate that the AVATARnet pipeline can be reliably applied to a bottom-view mouse predatory hunting paradigm, achieving competitive pose estimation performance under an experimental setting distinct from the AVATAR studio.

Table 1.

Conceptual comparison between AVATAR and DeepLabCut

Category	AVATAR	DeepLabCut
System	Real-time 3D mouse behavior analysis pipeline	High-precision, markerless pose estimation method
Input data structure	Multi-view image streams acquired from five synchronized cameras	Single-view 2D image stream acquired from a single camera
Output data structure	Behavior-oriented 3D action skeletons and behavioral state labels	Frame-wise precise 2D keypoint coordinates
Inference speed	Low-latency inference enables real-time analysis	Typically higher-latency processing optimized for post-hoc accuracy
Computational efficiency	Optimized for speed with moderate computational requirements	Optimized for high-precision inference with greater computational demands

Open in a new tab

Summary of the distinct design objectives, input/output data structures, and computational characteristics of the AVATAR pipeline and DeepLabCut.

AVATAR is designed as a real-time 3D mouse behavior analysis pipeline that integrates pose estimation with downstream behavioral analysis, whereas DeepLabCut is a high-precision markerless pose estimation method optimized for frame-wise keypoint detection. The comparison highlights complementary design emphases rather than direct performance superiority, with computational characteristics described in the context of identical experimental conditions used in this study.

Building on previous findings that visual cues are critical for mouse predation [30], we next examined how AVATARnet-drived behavioral features related to visual context during hunting behavior. Using AVATARnet-driven behavioral features, we analyzed how the visual context correlates with behavioral phases such as chasing, attacking, and consuming. We defined two angles based on detected keypoint positions. The head angle (θ₁), formed between the body and the nose, and the cricket angle (θ₂), formed between the nose and the cricket. The mouse visual field includes a binocular region (-20° to +20° from the nasal midline) and two monocular regions (-20° to -140° and +20° to +140°) [31].

We first quantified the cricket positions within the visual field of mice across the entire dataset. Among all video frames, crickets appeared within the binocular field in 26.8% of frames, within the monocular field in 56.5% (left 29.4%, right 27.1%), and were out of sight in 16.7% (Fig. 6A). The out of sight condition occurred when the mouse was grooming or rearing, or when the cricket was positioned behind the mouse. We further analyzed the distribution of each behavioral phase across the monocular and binocular visual fields (Fig. 6B~D). Chasing behavior consisted of rotational movements to align the head toward the prey, and approaching. The rotation typically began in the monocular field and transitioned into the binocular field. In contrast, during the approaching, the mouse moved closer while keeping the prey within the binocular field to initiate an attack. During chasing, the cricket was within the binocular visual field in 52.8% of frames and within the monocular field in 43.5% (Fig. 6B). In most binocular field cases, the mouse was engaged in forward approaching. In contrast, monocular field cases generally corresponded to the onset of chasing, when the mouse performed rotational alignment, or to situations in which the cricket rapidly escaped toward the peripheral field. Attack behavior was typically initiated when the mouse approached the cricket closely following a chasing bout. Although attacks were expected to occur primarily when the prey was within the binocular visual field, 55.5% of attack events were observed under binocular conditions and 31.4% under monocular conditions (Fig. 6C). The unexpectedly frequent occurrence of attacks under monocular conditions can be attributed to several factors. When the prey was very close to the mouse, even small positional changes resulted in significant variations in the viewing angle. When the mouse initiated an attack, the cricket attempted to escape. Attacks also occurred when the cricket engaged in defensive behavior at the edge of the chamber, prompting the mouse to attack from a lateral position. Consuming behavior represents the final phase of the hunting sequence, occurring after the mouse has killed or immobilized the cricket through a bite or attack. During this phase, the mouse grasped the prey with its forelimbs and consumed. Because the cricket no longer attempted to escape, as expected, it was predominantly (89.1%) observed when the prey was within the binocular visual field (Fig. 6D). Together, these results demonstrate that the visual field position of the prey dynamically shapes each phase of mouse hunting behavior. Chasing, attacking, and consuming behaviors each exhibit distinct movement patterns depending on the relative position of the prey, suggesting that visual information plays a key role in sequential transitions within the hunting sequence.

Fig. 6 — Distribution of cricket positions across the mouse visual field during different phases of hunting and automated classification of mouse hunting behavioral states. (A) Distribution of prey position in the mouse visual field across all dataset. (B) Distribution of prey position in the mouse visual field during chasing. (C) Distribution of prey position in the mouse visual field during attacking. (D) Distribution of prey position in the mouse visual field during consuming. (E) Representative image of chasing state. (F) Representative image of attacking state. (G) Representative image of consuming state. (H) Representative diagram comparing the true behavioral state and the classified behavioral state from our model. (I) Accuracy of auto classification for chasing state. (J) Accuracy of auto classification for attacking state. (K) Accuracy of auto classification for consuming state. (L) Confusion matrix illustrating model performance across all behavioral states, showing classification accuracy and misclassification rates for chasing, attacking, consuming, and others.

We designed an XGBoost-based classifier [32] to automate action segmentation on manually annotated hunting sequences using dynamic behavioral feature information derived from AVATARnet (Fig. 6E~H). This automated classification was performed as a post hoc analysis to evaluate the behavioral relevance of AVATARnet-derived features. We annotated each frame across 23 hunting trials from four mice into three hunting states: chasing, attacking, and consuming (Fig. 6E~G). We extracted a comprehensive feature set including kinematic variables (distance, speed, acceleration, and jerk for both mouse and cricket, relative distance), spatial variables (head angle, cricket angle, and cricket location within visual field of the mouse), temporal variables (spectrogram representation of coordinate trajectories), and sliding-window averages. These features revealed distinct patterns across hunting phases. We validated the segmentation accuracy of AVATAR using precision, recall, and F1-score for each hunting state (Fig. 6I~L). For the chasing state, our model achieved a mean precision of 0.59±0.03, recall of 0.72±0.04 and F1-score of 0.66±0.03 (Fig. 6I). For the attacking state, our classifier exhibited a mean precision of 0.28±0.04, recall of 0.42±0.05, and F1-score of 0.34±0.04 (Fig. 6J). The reduced performance for attacking classification may be attributed to its brief duration and rarity. Because attacks consist of short, high-velocity movements embedded within rapid state transitions, they provide limited temporal information for consistent feature extraction. For the consuming state, the classifier achieved a mean precision of 0.86±0.03, recall of 0.71±0.06, and F1-score of 0.77±0.04 (Fig. 6K). The high precision indicates that consuming events were reliably distinguished from other behavioral states, likely due to the prolonged duration and distinctive posture patterns associated with this phase. The overall classification performance across all states yielded a total F1-score of 0.69±0.05, indicating balanced and reliable segmentation accuracy of the automated classifier. To further assess the contribution of extracted features for characteristics of hunting states, we evaluated the ability to classify behavioral states of AVATARnet from either raw coordinates alone or coordinates supplemented with extracted features. The F1-score of the XGBoost classifier improved from 0.30 to 0.68, indicating that the AVATARnet-driven features robustly capture the actual hunting states of mice in the bottom-view predatory hunting paradigm.

Together, these results demonstrate that the AVATAR pipeline supports reliable pose estimation and downstream behavioral analysis across distinct experimental configurations. In the bottom-view predatory hunting paradigm, AVATARnet enabled robust mouse pose estimation and feature extraction, and the extracted features supported automated post hoc segmentation of hunting-related behavioral states. Although classification performance varied across behavioral states—particularly for brief and rare attack events—these analyses confirm that AVATARnet-derived representations capture behaviorally meaningful information beyond pose tracking alone. Importantly, this performance was achieved with substantially reduced inference latency compared to DeepLabCut under identical computational conditions, supporting the feasibility of low-latency behavioral analysis in diverse experimental settings.

DISCUSSION

Machine learning has emerged as a crucial tool for the precise and unbiased assessment of animal behavior [1-5, 33]. By mimicking the brain-like motion detection postulated in the biological motion theory and utilizing available video processing technologies, the AVATAR system performs real-time motion tracking of nine body parts from freely-moving mice with millisecond resolution (Fig. 1A, B) by translating a large amount of multi-viewpoint video data and reconstructing it into an “AVATAR mouse”.

Existing approaches for markerless behavioral analysis address complementary aspects of the problem. These complementary approaches differ in system scope, data representation, and computational emphasis (Table 1). Methods such as DeepLabCut [11, 14] and DeepLabCut-Live [23] provide accurate pose tracking and, in constrained settings, enable low-latency feedback, but are primarily limited to 2D representations and often incur substantial computational costs for real-time use. Multi-view 3D frameworks, including DANNCE [24] and Anipose [17], achieve high-accuracy reconstruction of animal kinematics using specialized multi-camera setups, but are typically applied in offline or batch-processing workflows. In contrast, AVATAR was designed as an integrated experimental pipeline that supports multi-view 3D reconstruction while prioritizing real-time processing, data efficiency, and flexibility across experimental configurations without restrictive hardware requirements, enabling closed-loop behavioral detection. However, AVATAR does not aim to replace reconstruction-focused frameworks when precise joint-level kinematics or biomechanical analysis is required, and such methods may be more suitable for studies emphasizing fine-grained posture modeling. Instead, AVATAR prioritizes real-time behavioral detection and experimental flexibility, trading explicit skeletal interpretability for low-latency closed-loop integration.

Compared to existing tools, particularly DeepLabCut, AVATAR achieves accuracy comparable to DLC for the evaluated behaviors, while providing substantially improved processing speed—five times faster for 2D tracking and three times faster with better accuracy for 3D reconstruction. This virtual representation uses significantly lower data storage (0.35 megabytes vs. 1,000 megabytes per min or 500 megabytes vs. 1440 gigabytes per 24 hours/mouse) (Fig. 1C~E), facilitating 3D motion sequencing, clustering, and action phylogeny analysis (Fig. 2I). The real-time processing capability is critical for closed-loop manipulations, where behavioral detection must trigger interventions with minimal latency.

AVATAR's versatility is demonstrated through three distinct applications. First, closed-loop optogenetics experiments demonstrated that the system can detect specific behavioral sequences (such as rearing) and trigger photostimulation within 100 milliseconds, enabling the precise manipulation of brain circuits during freely moving behavior. The ability to detect altered rearing sequences through machine learning classification demonstrates AVATAR's utility for quantifying subtle behavioral changes induced by neural manipulation. Second, AVATAR successfully detects dystonia in tottering mice (Supplementary Video 3), demonstrating potential for high-throughput screening of motor disorders. This can be expanded to large-scale pre-clinical screening of drugs for neurological disorders, assessed through changes in 3D body motion analysis [27]. Third, application to cricket hunting behavior demonstrates hardware flexibility—accurate 3D reconstruction from single-camera setups enables analysis of rapid, spatially complex predatory behaviors in naturalistic experimental scenarios [30]. From simple rearing to complex hunting sequences, and from single-camera to multi-camera configurations, AVATAR demonstrates adaptability across diverse real experimental requirements. Despite these demonstrated capabilities, several technical constraints remain. The most pressing challenge arises in multi-animal tracking scenarios. Although the system can track multiple animals simultaneously, frequent identity switches disrupt longitudinal analysis of social interactions. During close-contact behaviors, this problem intensifies as overlapping body parts create occlusions that even five cameras cannot fully resolve, thereby restricting fine-grained analysis of social behaviors such as aggression, mating, or parental care.

Beyond multi-animal scenarios, AVATAR imposes specific experimental requirements. Controlled lighting and background contrast remain necessary, and training new configurations requires approximately 500 labeled instances to achieve DeepLabCut-level accuracy. In addition, reliance on GPU resources for real-time processing may limit accessibility in some experimental settings. The closed-loop stimulation demonstrated here relied on detecting rearing, a behavior characterized by relatively long bout durations and unambiguous postural features. Extending closed-loop approaches to behaviors with shorter timescales and more variable kinematic profiles, such as predatory strikes, remains a technically challenging task.

Future development will both address these constraints and expand AVATAR’s capabilities into new domains. Regarding technical improvements, priority improvements include reducing identity switches to enable robust social behavior analysis. This capability would unlock detailed studies of aggression, mating, and parental care, which are currently limited by tracking instability. Parallel efforts in transfer learning promise to reduce training data requirements across species and experimental configurations, lowering barriers to adoption.

AVATAR provides real-time 3D behavioral tracking with validated accuracy and processing efficiency, combining flexibility in experimental configuration—from single-camera to multi-camera setups—with applicability to diverse behaviors ranging from simple rearing to complex predatory sequences. The system facilitates standardized, reproducible quantification of animal behavior while enabling closed-loop manipulations previously limited to simple analog signals. By reconstructing freely-moving animals into data-efficient virtual representations, AVATAR bridges the gap between high-dimensional behavioral data and practical experimental implementation. As technical capabilities expand to address current limitations in multi-animal tracking and environmental constraints, AVATAR will increasingly enable discovery-driven approaches to understanding brain-behavior relationships across species and experimental paradigms, in both research and preclinical settings.

Supplemental Materials

en-35-2-109-supple.zip^{(104.1MB, zip)}

ACKNOWLEDGEMENTS

This study was supported by the National Research Foundation of Korea (NRF) grand funded by the Korea government (MSIT) (RS-2025-00521226) and the Bio&Medical Technology Development Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (RS-2023-00266872).

Footnotes

AUTHOR CONTRIBUTIONS

Dae-Gun Kim: Conceptualization, Methodology, Formal analysis, Investigation, Validation, Resources, Writing-Original Draft, Visualization, Project Administration. Anna Shin: Methodology (AVATAR, real-time closed-loop stimulation), Investigation (real-time closed-loop stimulation), Resources (real-time closed-loop stimulation), Validation (real-time closed-loop stimulation), Visualization (real-time closed-loop stimulation). Kwanhoo Shin: Conceptualization (bottom-view predatory hunting paradigm), Investigation (bottom-view predatory hunting paradigm), Visualization (bottom-view predatory hunting paradigm). Daesoo Kim: Conceptualization, Supervision, Project administration, Funding acquisition.

References

1.Datta SR, Anderson DJ, Branson K, Perona P, Leifer A (2019) Computational neuroethology: a call to action. Neuron 104:11-24. 10.1016/j.neuron.2019.09.038 [DOI] [PMC free article] [PubMed]
2.Krakauer JW, Ghazanfar AA, Gomez-Marin A, MacIver MA, Poeppel D (2017) Neuroscience needs behavior: correcting a reductionist bias. Neuron 93:480-490. 10.1016/j.neuron.2016.12.041 [DOI] [PubMed]
3.Vogt N (2018) Correlating behavior and neural activity at high resolution. Nat Methods 15:479. 10.1038/s41592-018-0057-4 [DOI] [PubMed]
4.Kriegeskorte N, Douglas PK (2018) Cognitive computational neuroscience. Nat Neurosci 21:1148-1160. 10.1038/s41593-018-0210-5 [DOI] [PMC free article] [PubMed]
5.Brown AE, De Bivort B (2018) Ethology as a physical science. Nat Phys 14:653-657. 10.1038/s41567-018-0093-0 [DOI]
6.Wiltschko AB, Johnson MJ, Iurilli G, Peterson RE, Katon JM, Pashkovski SL, Abraira VE, Adams RP, Datta SR (2015) Mapping sub-second structure in mouse behavior. Neuron 88:1121-1135. 10.1016/j.neuron.2015.11.031 [DOI] [PMC free article] [PubMed]
7.Tinbergen N (1963). On aims and methods of ethology. Z für Tierpsychol 20:410-433. 10.1111/j.1439-0310.1963.tb01161.x [DOI]
8.Altmann J (1974) Observational study of behavior: sampling methods. Behaviour 49:227-267. 10.1163/156853974X00534 [DOI] [PubMed]
9.Roy S, Bryant JL, Cao Y, Heck DH (2011) High-precision, three-dimensional tracking of mouse whisker movements with optical motion capture technology. Front Behav Neurosci 5:27. 10.3389/fnbeh.2011.00027 [DOI] [PMC free article] [PubMed]
10.Sarikcioglu L, Demirel BM, Utuk A (2009) Walking track analysis: an assessment method for functional recovery after sciatic nerve injury in the rat. Folia Morphol (Warsz) 68:1-7. [PubMed]
11.Mathis A, Mamidanna P, Cury KM, Abe T, Murthy VN, Mathis MW, Bethge M (2018) DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat Neurosci 21:1281-1289. 10.1038/s41593-018-0209-y [DOI] [PubMed]
12.Pereira TD, Aldarondo DE, Willmore L, Kislin M, Wang SS, Murthy M, Shaevitz JW (2019) Fast animal pose estimation using deep neural networks. Nat Methods 16:117-125. 10.1038/s41592-018-0234-5 [DOI] [PMC free article] [PubMed]
13.Graving JM, Chae D, Naik H, Li L, Koger B, Costelloe BR, Couzin ID (2019) DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife 8:e47994. 10.7554/eLife.47994.sa2 [DOI] [PMC free article] [PubMed]
14.Nath T, Mathis A, Chen AC, Patel A, Bethge M, Mathis MW (2019) Using DeepLabCut for 3D markerless pose estimation across species and behaviors. Nat Protoc 14:2152-2176. 10.1038/s41596-019-0176-0 [DOI] [PubMed]
15.Günel S, Rhodin H, Morales D, Campagnolo J, Ramdya P, Fua P (2019) DeepFly3D, a deep learning-based approach for 3D limb and appendage tracking in tethered, adult Drosophila. Elife 8:e48571. 10.7554/eLife.48571.033 [DOI] [PMC free article] [PubMed]
16.Dunn TW, Marshall JD, Severson KS, Aldarondo DE, Hildebrand DGC, Chettih SN, Wang WL, Gellis AJ, Carlson DE, Aronov D, Freiwald WA, Wang F, Ölveczky BP (2021) Geometric deep learning enables 3D kinematic profiling across species and environments. Nat Methods 18:564-573. 10.1038/s41592-021-01106-6 [DOI] [PMC free article] [PubMed]
17.Karashchuk P, Rupp KL, Dickinson ES, Walling-Bell S, Sanders E, Azim E, Brunton BW, Tuthill JC (2021) Anipose: a toolkit for robust markerless 3D pose estimation. Cell Rep 36:109730. 10.1016/j.celrep.2021.109730 [DOI] [PMC free article] [PubMed]
18.Bolaños LA, Xiao D, Ford NL, LeDue JM, Gupta PK, Doebeli C, Hu H, Rhodin H, Murphy TH (2021) A three-dimensional virtual mouse generates synthetic training data for behavioral analysis. Nat Methods 18:378-381. 10.1038/s41592-021-01103-9 [DOI] [PMC free article] [PubMed]
19.Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. arXiv. doi: 10.48550/arXiv.1506.02640. 10.1109/CVPR.2016.91 [DOI]
20.Bochkovskiy A, Wang CY, Liao HYM (2020) YOLOv4: optimal speed and accuracy of object detection. arXiv. doi: 0.48550/arXiv.2004.10934. 10.48550/arXiv.2004.10934 [DOI]
21.Luxem K, Mocellin P, Fuhrmann F, Kürsch J, Miller SR, Palop JJ, Remy S, Bauer P (2022) Identifying behavioral structure from deep variational embeddings of animal motion. Commun Biol 5:1267. 10.1038/s42003-022-04080-7 [DOI] [PMC free article] [PubMed]
22.Pereira TD, Tabris N, Matsliah A, Turner DM, Li J, Ravindranath S, Papadoyannis ES, Normand E, Deutsch DS, Wang ZY, McKenzie-Smith GC, Mitelut CC, Castro MD, D'Uva J, Kislin M, Sanes DH, Kocher SD, Wang SS, Falkner AL, Shaevitz JW, Murthy M (2022) SLEAP: a deep learning system for multi-animal pose tracking. Nat Methods 19:486-495. 10.1038/s41592-022-01426-1 [DOI] [PMC free article] [PubMed]
23.Kane GA, Lopes G, Saunders JL, Mathis A, Mathis MW (2020) Real-time, low-latency closed-loop feedback using markerless posture tracking. Elife 9:e61909. 10.7554/eLife.61909.sa2 [DOI] [PMC free article] [PubMed]
24.Marshall JD, Aldarondo DE, Dunn TW, Wang WL, Berman GJ, Ölveczky BP (2021) Continuous whole-body 3D kinematic recordings across the rodent behavioral repertoire. Neuron 109:420-437.e8. 10.1016/j.neuron.2020.11.016 [DOI] [PMC free article] [PubMed]
25.Hsu AI, Yttri EA (2021) B-SOiD, an open-source unsupervised algorithm for identification and fast prediction of behaviors. Nat Commun 12:5188. 10.1038/s41467-021-25420-x [DOI] [PMC free article] [PubMed]
26.Hartley R, Zisserman A (2003) Multiple view geometry in computer vision. Cambridge University Press, Cambridge. 10.1017/CBO9780511811685 [DOI]
27.Kim JE, Chae S, Kim S, Jung YJ, Kang MG, Heo W, Kim D (2021) Cerebellar 5HT-2A receptor mediates stress-induced onset of dystonia. Sci Adv 7:eabb5735. 10.1126/sciadv.abb5735 [DOI] [PMC free article] [PubMed]
28.Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J (2017) LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst 28:2222-2232. 10.1109/TNNLS.2016.2582924 [DOI] [PubMed]
29.Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579-2605.
30.Hoy JL, Yavorska I, Wehr M, Niell CM (2016) Vision drives accurate approach behavior during prey capture in laboratory mice. Curr Biol 26:3046-3052. 10.1016/j.cub.2016.09.009 [DOI] [PMC free article] [PubMed]
31.Pinke D, Issa JB, Dara GA, Dobos G, Dombeck DA (2023) Full field-of-view virtual reality goggles for mice. Neuron 111:3941-3952.e6. 10.1016/j.neuron.2023.11.019 [DOI] [PMC free article] [PubMed]
32.Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (Krishnapuram B, ed), pp 785-794. Association for Computing Machinery, New York, NY. 10.1145/2939672.2939785 [DOI]
33.Hausmann SB, Vargas AM, Mathis A, Mathis MW (2021) Measuring and modeling the motor system with machine learning. Curr Opin Neurobiol 70:11-23. 10.1016/j.conb.2021.04.004 [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

en-35-2-109-supple.zip^{(104.1MB, zip)}

[ref1] 1.Datta SR, Anderson DJ, Branson K, Perona P, Leifer A (2019) Computational neuroethology: a call to action. Neuron 104:11-24. 10.1016/j.neuron.2019.09.038 [DOI] [PMC free article] [PubMed]

[ref2] 2.Krakauer JW, Ghazanfar AA, Gomez-Marin A, MacIver MA, Poeppel D (2017) Neuroscience needs behavior: correcting a reductionist bias. Neuron 93:480-490. 10.1016/j.neuron.2016.12.041 [DOI] [PubMed]

[ref3] 3.Vogt N (2018) Correlating behavior and neural activity at high resolution. Nat Methods 15:479. 10.1038/s41592-018-0057-4 [DOI] [PubMed]

[ref4] 4.Kriegeskorte N, Douglas PK (2018) Cognitive computational neuroscience. Nat Neurosci 21:1148-1160. 10.1038/s41593-018-0210-5 [DOI] [PMC free article] [PubMed]

[ref5] 5.Brown AE, De Bivort B (2018) Ethology as a physical science. Nat Phys 14:653-657. 10.1038/s41567-018-0093-0 [DOI]

[ref6] 6.Wiltschko AB, Johnson MJ, Iurilli G, Peterson RE, Katon JM, Pashkovski SL, Abraira VE, Adams RP, Datta SR (2015) Mapping sub-second structure in mouse behavior. Neuron 88:1121-1135. 10.1016/j.neuron.2015.11.031 [DOI] [PMC free article] [PubMed]

[ref7] 7.Tinbergen N (1963). On aims and methods of ethology. Z für Tierpsychol 20:410-433. 10.1111/j.1439-0310.1963.tb01161.x [DOI]

[ref8] 8.Altmann J (1974) Observational study of behavior: sampling methods. Behaviour 49:227-267. 10.1163/156853974X00534 [DOI] [PubMed]

[ref9] 9.Roy S, Bryant JL, Cao Y, Heck DH (2011) High-precision, three-dimensional tracking of mouse whisker movements with optical motion capture technology. Front Behav Neurosci 5:27. 10.3389/fnbeh.2011.00027 [DOI] [PMC free article] [PubMed]

[ref10] 10.Sarikcioglu L, Demirel BM, Utuk A (2009) Walking track analysis: an assessment method for functional recovery after sciatic nerve injury in the rat. Folia Morphol (Warsz) 68:1-7. [PubMed]

[ref11] 11.Mathis A, Mamidanna P, Cury KM, Abe T, Murthy VN, Mathis MW, Bethge M (2018) DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat Neurosci 21:1281-1289. 10.1038/s41593-018-0209-y [DOI] [PubMed]

[ref12] 12.Pereira TD, Aldarondo DE, Willmore L, Kislin M, Wang SS, Murthy M, Shaevitz JW (2019) Fast animal pose estimation using deep neural networks. Nat Methods 16:117-125. 10.1038/s41592-018-0234-5 [DOI] [PMC free article] [PubMed]

[ref13] 13.Graving JM, Chae D, Naik H, Li L, Koger B, Costelloe BR, Couzin ID (2019) DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife 8:e47994. 10.7554/eLife.47994.sa2 [DOI] [PMC free article] [PubMed]

[ref14] 14.Nath T, Mathis A, Chen AC, Patel A, Bethge M, Mathis MW (2019) Using DeepLabCut for 3D markerless pose estimation across species and behaviors. Nat Protoc 14:2152-2176. 10.1038/s41596-019-0176-0 [DOI] [PubMed]

[ref15] 15.Günel S, Rhodin H, Morales D, Campagnolo J, Ramdya P, Fua P (2019) DeepFly3D, a deep learning-based approach for 3D limb and appendage tracking in tethered, adult Drosophila. Elife 8:e48571. 10.7554/eLife.48571.033 [DOI] [PMC free article] [PubMed]

[ref16] 16.Dunn TW, Marshall JD, Severson KS, Aldarondo DE, Hildebrand DGC, Chettih SN, Wang WL, Gellis AJ, Carlson DE, Aronov D, Freiwald WA, Wang F, Ölveczky BP (2021) Geometric deep learning enables 3D kinematic profiling across species and environments. Nat Methods 18:564-573. 10.1038/s41592-021-01106-6 [DOI] [PMC free article] [PubMed]

[ref17] 17.Karashchuk P, Rupp KL, Dickinson ES, Walling-Bell S, Sanders E, Azim E, Brunton BW, Tuthill JC (2021) Anipose: a toolkit for robust markerless 3D pose estimation. Cell Rep 36:109730. 10.1016/j.celrep.2021.109730 [DOI] [PMC free article] [PubMed]

[ref18] 18.Bolaños LA, Xiao D, Ford NL, LeDue JM, Gupta PK, Doebeli C, Hu H, Rhodin H, Murphy TH (2021) A three-dimensional virtual mouse generates synthetic training data for behavioral analysis. Nat Methods 18:378-381. 10.1038/s41592-021-01103-9 [DOI] [PMC free article] [PubMed]

[ref19] 19.Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. arXiv. doi: 10.48550/arXiv.1506.02640. 10.1109/CVPR.2016.91 [DOI]

[ref20] 20.Bochkovskiy A, Wang CY, Liao HYM (2020) YOLOv4: optimal speed and accuracy of object detection. arXiv. doi: 0.48550/arXiv.2004.10934. 10.48550/arXiv.2004.10934 [DOI]

[ref21] 21.Luxem K, Mocellin P, Fuhrmann F, Kürsch J, Miller SR, Palop JJ, Remy S, Bauer P (2022) Identifying behavioral structure from deep variational embeddings of animal motion. Commun Biol 5:1267. 10.1038/s42003-022-04080-7 [DOI] [PMC free article] [PubMed]

[ref22] 22.Pereira TD, Tabris N, Matsliah A, Turner DM, Li J, Ravindranath S, Papadoyannis ES, Normand E, Deutsch DS, Wang ZY, McKenzie-Smith GC, Mitelut CC, Castro MD, D'Uva J, Kislin M, Sanes DH, Kocher SD, Wang SS, Falkner AL, Shaevitz JW, Murthy M (2022) SLEAP: a deep learning system for multi-animal pose tracking. Nat Methods 19:486-495. 10.1038/s41592-022-01426-1 [DOI] [PMC free article] [PubMed]

[ref23] 23.Kane GA, Lopes G, Saunders JL, Mathis A, Mathis MW (2020) Real-time, low-latency closed-loop feedback using markerless posture tracking. Elife 9:e61909. 10.7554/eLife.61909.sa2 [DOI] [PMC free article] [PubMed]

[ref24] 24.Marshall JD, Aldarondo DE, Dunn TW, Wang WL, Berman GJ, Ölveczky BP (2021) Continuous whole-body 3D kinematic recordings across the rodent behavioral repertoire. Neuron 109:420-437.e8. 10.1016/j.neuron.2020.11.016 [DOI] [PMC free article] [PubMed]

[ref25] 25.Hsu AI, Yttri EA (2021) B-SOiD, an open-source unsupervised algorithm for identification and fast prediction of behaviors. Nat Commun 12:5188. 10.1038/s41467-021-25420-x [DOI] [PMC free article] [PubMed]

[ref26] 26.Hartley R, Zisserman A (2003) Multiple view geometry in computer vision. Cambridge University Press, Cambridge. 10.1017/CBO9780511811685 [DOI]

[ref27] 27.Kim JE, Chae S, Kim S, Jung YJ, Kang MG, Heo W, Kim D (2021) Cerebellar 5HT-2A receptor mediates stress-induced onset of dystonia. Sci Adv 7:eabb5735. 10.1126/sciadv.abb5735 [DOI] [PMC free article] [PubMed]

[ref28] 28.Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J (2017) LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst 28:2222-2232. 10.1109/TNNLS.2016.2582924 [DOI] [PubMed]

[ref29] 29.Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579-2605.

[ref30] 30.Hoy JL, Yavorska I, Wehr M, Niell CM (2016) Vision drives accurate approach behavior during prey capture in laboratory mice. Curr Biol 26:3046-3052. 10.1016/j.cub.2016.09.009 [DOI] [PMC free article] [PubMed]

[ref31] 31.Pinke D, Issa JB, Dara GA, Dobos G, Dombeck DA (2023) Full field-of-view virtual reality goggles for mice. Neuron 111:3941-3952.e6. 10.1016/j.neuron.2023.11.019 [DOI] [PMC free article] [PubMed]

[ref32] 32.Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (Krishnapuram B, ed), pp 785-794. Association for Computing Machinery, New York, NY. 10.1145/2939672.2939785 [DOI]

[ref33] 33.Hausmann SB, Vargas AM, Mathis A, Mathis MW (2021) Measuring and modeling the motor system with machine learning. Curr Opin Neurobiol 70:11-23. 10.1016/j.conb.2021.04.004 [DOI] [PubMed]

PERMALINK

AVATAR: AI Vision Analysis for Three-dimensional Action in Real-time

Dae-Gun Kim

Kwanhoo Shin

Anna Shin

Yong-Cheol Jeong

Changbum Ko

Junesu Lee

Seahyung Park

Daesoo Kim

Abstract

INTRODUCTION

MATERIALS AND METHODS

Animals

Recording behavior in the AVATAR studio

AVATARnet

Training AVATARnet with mouse image data sets

Evaluation of AVATARnet with the mouse image data set

Automatic calculation of the 3D positions of body parts

Fig. 1.

Reconstruction of an AVATAR mouse using an action skeleton

Real-time closed-loop optogenetic stimulation

Motion unit analysis

Pose estimation and detection model validation

Predatory hunting behavior

Cricket detection during predatory hunting

Pose estimation and detection model validation during predatory hunting

Visual field classification

Action segmentation model training

RESULTS

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Table 1.

Fig. 6.

DISCUSSION

Supplemental Materials

ACKNOWLEDGEMENTS

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases