Abstract
The advent of deep learning methodologies for animal behavior analysis has revolutionized neuroethology studies. However, the analysis of social behaviors, characterized by dynamic interactions among multiple individuals, continues to represent a major challenge. In this study, we present “YORU” (your optimal recognition utility), a behavior detection approach leveraging an object detection deep learning algorithm. Unlike conventional approaches, YORU directly identifies behaviors as “behavior objects” based on the animal’s shape, enabling robust and accurate detection. YORU successfully classified several types of social behaviors in species ranging from vertebrates to insects. Furthermore, YORU enables real-time behavior analysis and closed-loop feedback. In addition, we achieved real-time delivery of photostimulation feedback to specific individuals during social behaviors, even when multiple individuals are close together. This system overcomes the challenges posed by conventional pose estimation methods and presents an alternative approach for behavioral analysis.
YORU is a behavior detection tool with deep learning object detection to enable real-time social behavior analysis.
INTRODUCTION
Social behaviors such as courtship, aggression, and group formation are important for improving survival rate and reproductive efficiency in a wide range of animals, including invertebrates and vertebrates (1–3). To accelerate our understanding of the neural basis of these behaviors, it is necessary to capture information on the location and type of social interactions that occur between individuals (4). Recent advances in machine learning have led to the establishment of various tools for animal behavior detection, represented by markerless body part tracking (pose estimation) and unsupervised or supervised behavior classifications (5–13). These emerging tools enable real-time behavior analysis, allowing researchers to manipulate neural activity precisely when the animal exhibits the behavior of interest (14–16). Such a closed-loop approach promises to clarify the causal relationship between the neural activity and behaviors, and indeed has yielded substantial success in uncovering the neural basis for single individual behavior (7, 17, 18). While these tools have been adapted to analyze social interactions involving multiple individuals, detecting such interactions remains a major challenge (4, 6, 7). This difficulty arises from the complexity of defining social behaviors based on the coordinates of each individual’s body parts (6, 7, 19). Especially, the accurate detection of animal behaviors involving complex actions with minimal movement patterns (e.g., a mouse grooming its body and irregular or subtle repetitive movements) or occluded body parts poses considerable challenges. These issues intensify as the number of individuals observed increases, making it difficult to detect social interactions (e.g., aggression, courtship, and parental behavior). Therefore, novel methods are required to enable the detection of social interaction and to perform real-time behavior analysis for neural intervention.
Social interactions are often accompanied by individuals adopting distinctive postures to interact with others, such as wing extension by courting male fruit flies and mounting on females by male mice attempting copulation (20–22). In this study, we propose the detection of social behaviors of multiple individuals based on their appearances by defining each as a “behavior object.” To this end, we focused on an object detection algorithm “YOLOv5,” a high-speed object detection algorithm based on a convolutional neural network (23, 24). This algorithm achieves fast object recognition by framing detection as a single regression problem, using a unified architecture that processes the entire image in one forward pass (23, 24). The methodology is robust to variations in object orientation, number, size, and background noise (25, 26). This object detection approach is therefore a promising method to detect various animals’ social behaviors without relying on body-part coordinates, making it well-suited for adapting behavior analysis and a closed-loop system to investigate social interactions.
Here, we established a behavior detection system called “YORU” (your optimal recognition utility), based on an object detection algorithm (24), as follows. First, to verify the proof of concept, we tested whether YORU could be used to detect social behaviors of various animals ranging from vertebrates to insects. As an application of our behavior analysis, we compared the detection readouts with neural activity imaging in mice to interpret large-scale brain activity. Second, we evaluated the inference speed of the detection and feedback latency in YORU’s real-time analysis. Last, as a practical example of YORU’s closed-loop system, we tested the modulation of Drosophila courtship behavior by applying optogenetic neuronal manipulation under the YORU system. In the presence of multiple flies, we successfully manipulated the neural activity of individuals in an individual selective manner by optogenetic stimulation of the fly exhibiting the behavior of interest.
RESULTS
YORU is a framework for animal behavior recognition
Here, we introduce YORU, an animal behavior detection system with a graphical user interface (GUI). In the YORU system, animal behaviors, either performed by single animals or multiple animals interacting with each other, are classified as a “behavior object” based on their shapes with an object detection algorithm (24) (Fig. 1A). This object detection–based approach identifies animal behavior from a single frame, which distinguishes it from other behavior classification tools that rely on tracking individual body parts over time (10–13). Therefore, animal behaviors can be detected more rapidly than with previous approaches, making YORU suitable for real-time analysis. YORU is an open-source Python software composed of four packages: “Training,” “Evaluation,” “Video Analysis,” and “Real-time Process” (Fig. 1B and fig. S1, A to E). The YORU system is designed to work both offline (video analysis) and online (real-time analysis) to detect animal behavior. During video analysis, YORU allows us to quantify animal behavior according to user-defined shapes of social behavior. For real-time analysis, YORU analyzes animal behavior in real-time, which can be used to output trigger signals to control external devices, such as light-emitting diodes (LEDs) for optogenetic control.
Fig. 1. YORU detects animal behaviors as a behavior object.
(A) Illustrations of behavior objects. YORU can adapt to single-animal (left) and multi-animal (right) behaviors, including social behaviors. Animal behaviors (top) can be classified as behavior objects (bottom). (B) Diagram of YORU. YORU includes four packages: Training, Evaluation, Video Analysis, and Real-time Process.
We set three constraints on the workflow of YORU: ease of use for experimenters, low system latency, and high customizability. To make YORU user-friendly, we designed the system to allow quantification of animal behaviors without requiring programming knowledge. To achieve low system latency, special efforts were focused on behavior detection and feedback when operating as a closed-loop system. Here, immediate behavior detection is achieved by the object detection approach, which processes generating region proposals and classifying subjects simultaneously, resulting in faster detection (23, 24). Immediate feedback, such as optogenetic manipulation in response to animal behaviors, is achieved by incorporating a custom system based on multiprocessing (27): image acquisition, object recognition, and hardware [Arduino, data acquisition systems (DAQs), etc.] manipulation are not processed serially but simultaneously. To achieve high customizability, i.e., adapting YORU system to various experimental systems easily, YORU implements a trigger output for hardware manipulation. The experimenter can customize the threshold for behavior detection in YORU software with no or minimal user programming. YORU also supports communication with external hardware (i.e., DAQ systems and microcontrollers) and other software [e.g., Bonsai (28)] via serial communication, enabling synchronized and interactive operation with existing applications. These features enable experimenters to easily implement high-performance, closed-loop neural feedback systems. We evaluated YORU on benchmark datasets used by other behavioral analysis tools: Mouse-Ventral1, Mouse-Ventral2 (8), and CalMS21 (29). Behaviors such as investigating, mounting, and grooming in mice were successfully detected, demonstrating YORU’s potential for accurate behavior detection (see Supplementary Text; fig. S2, A to D; and table S1 for details).
Detection of social behaviors using object detection algorithm
In previous studies, object detection approaches have successfully classified some social behaviors of Drosophila, such as offensive behaviors and copulation (30, 31). In this study, we extended this idea and analyzed a variety of animal behaviors with YORU. The following social behaviors were tested to validate performance: (i) Fruit flies, wing extension behavior of a male toward a female during courtship (20) (Fig. 2, A to C, and movie S1); (ii) ants, mouth-to-mouth food transfer behavior among workers (trophallaxis) (32, 33) (Fig. 2, D to F, and movie S2); and (iii) zebrafish, orientation behavior toward another individual behind a partition (34, 35) (Fig. 2, G to I, and movie S3). We extracted 2000 images from multiple videos and manually labeled their behavior objects with the following definitions:
Fig. 2. Detection of animal behaviors by YORU.
(A to C) Fly – wing extension dataset. Scale bars, 500 μm. (D to F) Ant – trophallaxis dataset. Scale bars, 2 mm. (G to I) Zebrafish – orientation dataset. Scale bars, 20 mm. (J and K) Fly – group courtship datasets. Scale bar, 10 mm. (L and M) Ant – group trophallaxis datasets. Scale bar, 10 mm. [(B), (E), and (H)] Ethogram of the fly wing extension (B), ant trophallaxis (E), and zebrafish orientation (H) behaviors. Top panels show the example view of each behavior. The circles show detections by manual analysis (pink), YORU analysis (blue), or analysis using a previous tracking method (green). The horizontal axes in the ethogram represent the observation period. The colored area of the ethogram shows the occurrence of the behavior detected with YORU analysis (blue), manual analysis using BORIS (pink), and the previous tracking method (green). Scale bars, 500 μm (D), 2 mm (E), and 20 mm (F). [(C), (F), and (I)] The AP values of Fly – wing extension (C), Ant – trophallaxis (F), and Zebrafish – orientation (I) models. The values at IoU = 50% (AP@50) (left) and IoU = 75% (AP@75) (right) are shown. The horizontal and vertical axes show the number of images used for training and the AP values, respectively. Each colored line shows the AP values of different pretrained models [also in (K) and (M)]. [(K) and (M)] The AP values for Fly – group courtship (K) and Ant – group trophallaxis (M) models.
1) Fruit flies: “wing_extension,” a fly extending one of its wings (Fig. 2A). When the flies were not labeled as “wing_extension,” they were labeled as “fly.”
2) Ants: “trophallaxis,” heads of two ants engaging in a food exchange, with their mouth parts in contact with each other; “no,” the situation of no food exchange although the heads of the two ants were close together (Fig. 2D). No label was given when neither of these two behavior types was detected.
3) Zebrafish: “orientation,” two zebrafish exhibiting orientation behavior as defined in a previous study (34, 35); “no_orientation,” a zebrafish exhibiting no orientation behavior (Fig. 2G).
We created models and validated their detection accuracies using multiple videos that were not used for the model creation process. The accuracy scores for the model’s detection of fruit flies, ants, and zebrafish behaviors were 93.3, 98.3, and 90.5%, respectively, when compared to human manual annotations (Fig. 2, B, E, and H, movies S1 to S3, and table S2). The corresponding F1 scores, which provide a balanced measure of detection performance (see Materials and Methods for details), were 81.1, 95.9, and 87.9%, respectively. To benchmark these scores, we analyzed the behaviors of fruit flies and ants using pose estimation with SLEAP (7), a widely used real-time pose-estimation framework, along with unsupervised and supervised behavior classification methods [Keypoint-MoSeq (11) and A-SOiD (12), respectively] (figs. S3, A to F; S4, A to D; and S5, A and B; and tables S3 to S6). In the SLEAP analysis, some body parts were not accurately detected, and individual animals or their body orientations were occasionally misidentified (fig. S3, C to F). In the Keypoint-MoSeq analysis, which used SLEAP tracking data, the resulting behavior clusters did not adequately represent the actual behaviors of the target animals (fig. S4, A to D, and tables S3 and S4). In the A-SOiD analysis, the accuracy scores of fruit fly and ant behaviors were 69.7 and 95.1%, respectively, when compared to human manual annotations (fig. S5, A and B, and tables S5 and S6; the corresponding F1 scores were 53.9 and 87.6%). Under the same evaluation protocol, YORU outperformed A-SOiD, achieving higher accuracy and F1 scores compared to human annotations (fig. S5, A and B, and tables S5 and S6). We also compared zebrafish orientation analyses between the previous method based on body part tracking by Fish Tracker (34, 36) and human annotations, yielding an accuracy and F1 scores of 81.2 and 78.7%, respectively (table S2 and Fig. 2H). In summary, YORU can potentially detect wing extension in flies, trophallaxis in ants, and orientation in zebrafish more rapidly and with higher accuracy than previous analysis methods. More broadly, these results suggest that YORU can detect a wide range of social behaviors with accuracy comparable to human annotations.
In general, two major factors to consider for the practical application of deep learning–based analyses are the amount of training data (the number of labels) and the selection of base network (5, 37). To find optimal conditions in our dataset, we evaluated the accuracy of models with different numbers of training images (200, 500, 1000, 1500, and 2000) and YOLOv5 networks (YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x; ordered by model size and complexity). Two metrics were used in this evaluation: “precision,” the ratio of correctly predicted detections to all predicted detections; and “recall,” the ratio of correctly predicted detections to all ground truths. All models detected the target social behavior with precision greater than 85% and recall greater than 90%, even when only 200 images were used to create models (fig. S6, A to C, and table S7). The precision and recall scores increased with the number of training images, finally exceeding 90 and 95%, respectively, when 1000 or more images were used for training (fig. S6, A to C, and table S7). We then evaluated typical object detection model indices: average precisions (APs) and intersection over union (IoU) (Fig. 2, C, F, and I, and fig. S7, A to C) (38). These two indices can be used to compare the accuracy of the model between conditions; the IoU shows the difference in location information between the ground truth and detected bounding boxes, while the AP reflects the accuracy of the behavior classes, which includes precision and recall values (38). For each behavior, neither the number of images used for training nor the YOLOv5 models tested affected IoU values (fig. S7, A to C). On the other hand, AP values increased with the number of training images, with the same number of images giving similar AP values across YOLOv5-based models (Fig. 2, C, F, and I). These evaluations suggest the primary factor influencing accuracy is the amount of training data, whereas the choice of base network has relatively little impact, at least in these cases. Accordingly, we recommend aligning the YOLOv5 network size with task complexity: Use smaller models (YOLOv5n/s) for real-time applications and consider larger models (YOLOv5l/x) for tasks requiring fine-grained recognition and prioritizing accuracy over speed (see Supplementary Text).
Next, we increased the number of individuals to test the performance of YOLOv5 models in multi-individual conditions. In the first condition, a group of flies (four males and four females) was analyzed to detect three objects: wing_extension, “copulation,” and other behaviors labeled as fly (Fig. 2, J and K, and movie S4). In the second condition, a group of ants (six ants) was used to detect two objects: trophallaxis and “no trophallaxis” (Fig. 2, L and M, and movie S5). In the group of flies, precision and recall scores were over 95% even when using only 200 images for training; the scores increased with the number of training images (fig. S8A and table S7). In the group of ants, precision and recall scores exceeded 90% with 500 or more training images (fig. S8B and table S7). These results suggest that YORU’s detection of social behaviors is also applicable to multiple individuals (more than two individuals) with high accuracy (Fig. 2, J to M; fig. S8, A to D; and table S7), highlighting its usefulness for analyzing various types of social behaviors. The detection accuracy does not change substantially with different YOLOv5 pretrained models and instead depends strongly on the number of training images. To explore the limits of the number of objects YORU can detect, we analyzed videos of arenas containing between 5 and 60 flies (fig. S9, A and B). The results showed that even with 60 flies present, the number of detected individuals remained within the margin of error (fig. S9B). Furthermore, increasing the number of behavioral types to three (wing_extension, copulation, and fly) had only a minimal effect on accuracy (fig. S9C and table S8). Notably, these analyses were performed using a model trained only on videos containing eight flies (“Flies – group courtship” dataset). These findings suggest that YORU is capable of effectively analyzing large numbers of subjects, even beyond the conditions it was originally trained on.
Last, we compared the accuracy of YORU with that of conventional image classification algorithms (DensNet121, ResNet50, and VGG16) and alternative object detection frameworks (Faster R-CNN and SSD) (fig. S10, A to F, and table S9). While image classification algorithms lack the ability to identify which individual exhibited a specific behavior and where it occurred within the arena, they detected social behaviors with accuracy comparable to that of YORU (fig. S10A and table S9). While other object detection algorithms are capable of analyzing social behaviors, YORU demonstrated more stable accuracy across varying numbers of training images, particularly at AP@50 threshold (i.e., the AP value at IoU = 50) (fig. S10, B to F). However, at the more stringent AP@75 threshold, some alternative algorithms achieved higher performance (fig. S10, B to F). These findings indicate that both image classification and object detection approaches are viable for the analysis of animal social behavior. However, among the object-detection methods, YOLO-based algorithms offer distinct advantages, including lower training costs and reduced inference latency (39, 40). Moreover, their capacity to provide spatial information regarding the behavioral events renders YOLO-based models particularly well-suited for integration into YORU.
The relationship between behavioral readouts and neural activity interpretation
One of the main questions in neurophysiology is the interpretation of which sensations and behaviors can explain observed neural activity. One promising methodology to address this question is to combine time series analyses of multiple behavioral types via video analysis with neural activity measurements. To test the performance of YORU for such tasks, we recorded dorsal cortex–wide neural activity with wide-field calcium imaging from mice running in a virtual reality (VR) system, which provides visual feedback coupled to their locomotion (41) (Fig. 3A). A mouse on a treadmill in VR typically exhibits multiple behaviors, such as running (“Running”), whisker movement (Whisker-on), eye blinking (“Blinking”), and grooming (Grooming-On) (Fig. 3B and movie S6). First, to validate the behavior classification performance of YORU, we estimated the time series of eight behavior classes from video analysis: “Running,” “Stop,” “Whisker-On,” “Whisker-Off,” “Eye-Open,” “Eye-Closed,” “Grooming-On,” and “Grooming-Off.” Precision and recall scores for the model’s detection of these behaviors were 91.8 and 92.7%, respectively, validating this model as detecting mouse behaviors as accurately as manual human annotations (table S10). For all classes detected by this model, the average IoUs and AP@50 were above 0.60 and 0.55, respectively (fig. S11 and table S10). Previous studies have reported that rodents actively move their whiskers to seek and identify objects or avoid obstacles in front of them during locomotion (42–45). In line with these reports, the time series data of active whisker movement (corresponding to Whisker-On label) during Running periods were positively correlated while Whisker-Off and Running correlated negatively (Fig. 3, B and C). These results further confirmed that YORU’s approach to behavior labeling can efficiently detect typical behaviors of mice.
Fig. 3. YORU uncovers the relationship between behavioral readouts and neural activity interpretation.
(A) The setup for virtual reality (VR) and cortex-wide imaging in mice. (B) Time series of locomotion speed (red, calculated from rotary encoder signals of VR) and YORU readout (black). Representative images are shown at the bottom. (C) Correlation matrix between each behavior. Correlation indices are derived from the YORU readout time series data. (D) Top view of the Allen Common Coordinate atlas of the dorsal cortex. Five rough divisions are auditory areas (yellow), association areas (magenta), somatosensory areas (cyan), and motor areas (blue). Black asterisks show the somatosensory areas representing forelimb and hindlimb information. (E) Pseudo-colormap of Spearman’s correlation coefficient (YORU-readout versus neural activity of each pixel). Whisking (with Stop) is defined as the intersection of Stop and Whisker-On.
Next, we investigated which brain regions in the cortex correlate with YORU’s readout. The Running epoch was highly correlated with the neural activity of medial motor areas, somatosensory areas representing forelimb and hindlimb information, visual areas, and posterior association region (retrosplenial cortex) (Fig. 3D). Several whisker movements associated with whisking during Stop period and Blinking behavior were also coupled to distinct macroscopic activity patterns of widespread regions in somatosensory and visual areas (Fig. 3, D and E). As expected, grooming behavior correlated specifically with neural activity of forelimb somatosensory and motor areas (Fig. 3, D and E). The spatial patterns of the correlation maps derived from YORU readouts are highly similar to those derived from the ground-truth labels for all analyzed behaviors (fig. S12). This strong correspondence supports our conclusion that YORU provides behaviorally relevant readouts that reliably correlate with underlying neural activity patterns, comparable to those obtained using traditional ground-truth methods. These results demonstrate YORU’s applicability and potential for precise and quantitative interpretation of neural activity across various animal behaviors.
Inference speed and system latency of YORU
A closed-loop system that relies on live feedback of animal behaviors requires a low-latency solution for behavior detection and feedback outputs (18). To assess YORU’s potential for application in closed-loop systems for social behavior analysis, we measured the total time required from frame acquisition to behavior estimation using a simple light detection task (Fig. 4A). Possible major factors that could affect the speed of YORU’s behavior detection are (i) the network structure, (ii) image size, and (iii) computing hardware, especially the graphics processing unit (GPU).
Fig. 4. Validation of YORU’s operation speed.
(A) “LED lighting” dataset. Two state classes were defined: ON and OFF, indicating that LED light was turned on and off, respectively. (B) Schematic of system latency measurements. The camera captures the LED light, YORU detects a frame, and the trigger-controller DAQ outputs the TTL voltage based on detection by YORU. The recording DAQ logged the TTL pulse from the trigger controller DAQ and the LED voltage. (C) The system latency of LED lighting – Small (left) and LED lighting – Large (right) models. The system latency of each model was calculated using camera images with resolutions of 640 × 480 pixels and 1280 × 1024 pixels, respectively. For 640 × 480, n = 176 (YOLOv5n), 200 (YOLOv5s), 172 (YOLOv5m), 200 (YOLOv5l), and 168 (YOLOv5x) trials. For 1280 × 1024, n = 172 (YOLOv5n), 200 (YOLOv5s), 200 (YOLOv5m), 183 (YOLOv5l), and 171 (YOLOv5x) trials. (D) The system latency at different camera frame rates. The LED lighting – Small model was used. n = 155 (30 fps), 158 (60 fps), 176 (100 fps) 150 (130fps), 157 (160 fps), and 159 (200 fps) trials. [(C) and (D)] Violin plot represents the probability density of individual data points within the range of possible values.
To demonstrate their impact on analyzing each frame, we measured the single-frame inference latency with (i) five YOLOv5 architectures (YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x), (ii) two input image sizes (640 × 480 or 1280 × 1024 pixels, images were resized before feeding to the neural network), and (iii) a variety of NVIDIA GPUs in Windows PCs (table S11). Inference speed was calculated by comparing the time before (t1) and after (t2) the frame was analyzed by a model (fig. S13A). We found that the smallest network (YOLOv5n) was the fastest, while the largest network (YOLOv5x) had the greatest inference latency (fig. S13B and table S11). In addition, the inference speed was faster with a smaller image size and more powerful NVIDIA GPUs (table S11). For example, with NVIDIA RTX 4080 GPU, we achieved inference speed as low as ~5.0 ms per frame (~200 fps) using YOLOv5s networks (fig. S13B). These results suggest that the network architecture, input image size, and GPU all influence YORU’s inference speed.
The design of closed-loop systems adapted to neuroscience experiments requires the simultaneous control of several processes, such as camera capture and hardware manipulation (28). In addition, several other factors such as camera type and frame rate, trigger destination type, and PC memory can affect system latency. Therefore, we tested the performance of YORU’s closed-loop system from end to end, including the entire process from the camera capturing an image of the LED to the PC detecting whether the LED was lit or not and sending a trigger signal to a DAQ system upon detecting the “ON” state (Fig. 4, A and B). The delay between the timing of the LED turning on and the trigger signal output, both detected by measuring their voltages using DAQ, was as low as 30 ms per event (Fig. 4C). This suggests that the end-to-end system latency of the YORU system is around 30 ms in this setup, which is sufficiently low in most cases to provide real-time feedback in response to animal behavior. Next, we assessed the factors that affected the end-to-end system latency, such as networks, input image size, camera frame rate, and system hardware. The effect of the network differences was almost negligible except YOLOv5l and v5x, which showed larger latency than others (Fig. 4C). On the other hand, the input image size severely affected the latency; the average latency of the smaller image (640 × 480 pixels) was ~30 ms while that of the larger image (1280 × 1024 pixels) was ~75 ms (Fig. 4C). According to the camera fps latency results, even if the camera frames were acquired as fast as the model’s inference speed (~200 fps), the system delay would not be substantial, due to YORU’s multiprocessing system (Fig. 4D). In a multiprocessing system, the size and processing speed of the random-access memory (RAM) affect the processing speed of the system. In YORU’s real-time process, the size of the RAM (between 16 and 32 GB) had less effect on the system latencies, suggesting that 16-GB RAM is sufficient to operate YORU’s real-time process (fig. S14A). The effect on system latency due to hardware differences (cameras and trigger devices) was small (fig. S14, B and C). These results suggest that inference speed and the input image size are the primary factors affecting end-to-end system latency.
Furthermore, we benchmarked YORU against SLEAP (7). To assess the closed-loop response speed, we measured onset latency under identical experimental conditions (fig. S15, A to C). YORU exhibited a mean latency of 31.3 ± 8.0 ms (mean ± SD, n = 250 trials), which was approximately 30% lower than SLEAP’s latency of 46.5 ± 10.0 ms (mean ± SD, n = 250 trials) (fig. S15C). These results demonstrate that YORU offers faster processing speeds for real-time analysis.
YORU application for real-time optogenetic system
Next, to assess the practical applicability of the YORU closed-loop system, we applied it to event-triggered optogenetic manipulation. During courtship, the male fruit fly extends his wing to serenade the female with a unique sound known as the courtship song. Upon hearing it, female flies gradually increase their receptivity to copulation (46). We hypothesized that if male wing extension behavior is inhibited when a male attempts to extend his wing, copulation rates would be reduced. Using a split-GAL4 strain that specifically drives gene expression in pIP10 neurons, which are descending neurons regulating courtship song production (47, 48), we expressed the green light–gated anion channel GtACR1 (49) in male pIP10 neurons (Fig. 5A). We then paired individual mutant males with a wild-type female in a chamber and allowed YORU to detect the single-wing extension of a fly. In this system, when YORU detects an object labeled as wing_extension, YORU introduces green photostimulation to the entire chamber (Fig. 5, B and C, and movie S7). As a control for photostimulation, we used an event-triggered light with a 1-s delay that illuminated the entire chamber (Fig. 5C, “delayed” group). Male flies expressing green fluorescent protein (GFP) in pIP10 neurons were also used as a genetic control group. We then analyzed the wing extension ratio during courtship and the copulation rate during the 30-min observation period (Fig. 5, D and E). In the experimental group, males decreased the amount of wing extension during courtship (Fig. 5D), validating the optogenetic inhibition of pIP10-induced behavior. In line with this, the cumulative copulation rate was significantly lower in the experimental group than in the control groups (Fig. 5E). These results confirm the importance of male pIP10 neurons for inducing the wing extension behaviors observed in the previous reports (47, 48, 50), which subsequently leads to male copulation success. They also validate YORU’s performance in operating real-time manipulation of neural activity in response to the detection of a specific behavior.
Fig. 5. Neural manipulation in response to male wing extension using YORU.
(A) Microscopy image of pIP10 split-GAL4 expression pattern in a representative male brain and ventral nerve cord. Scale bar, 100 μm. Signals of the GFP marker (green) and counter-labeling with the nc82 antibody (magenta) are shown. (B) Schematic of the YORU’s closed-loop conditions. YORU analyzes camera frames. Then, if YORU detects wing_extension, it sends signals to the trigger controller (DAQ) and operates the LED light. (C) Schematic of light conditions. As the experimental photostimulation condition, YORU introduces green photostimulation to the entire chamber when it detects a fly showing wing extension (event). As a photostimulation control, we used an event-triggered light with a 1-s delay that illuminates the entire chamber (delayed). pIP10 specific split-GAL4>UAS-GtACR1 (pIP10 neurons > GtACR1) males were used for these two groups. In addition, pIP10 split-GAL4>20XUAS-IVS-mCD8::GFP (pIP10 neurons > GFP) males were used as a genetic control. The following color code was used: experimental group (blue), photostimulation control group (pink), and genetic control group (dark blue) [also in (D) and (E)]. Sample size indicates the number of males tested. (D) Ratio of time spent displaying wing extension before copulation. The aligned rank transform one-way analysis of variance (ART one-way ANOVA) test corrected with the Benjamini-Hochberg method was used for statistical analysis. Boxplots display the medians (horizontal white line in each box) with 25th and 75th percentiles and whiskers denote 1.5× the interquartile range. Each point indicates individual data. (E) Cumulative copulation rate of pIP10 split-GAL4>GtACR1 males. Pairwise comparisons using log-rank test corrected with the Benjamini-Hochberg method were used for statistical analysis. [(D) and (E)] Not significant (n.s.), P > 0.05; *P < 0.05.
YORU application for individual-focused photostimulation
We applied YORU for individual-selective neural manipulation in response to social behavior between multiple individuals. We included a projector in YORU’s closed-loop system to control the light pattern for optogenetic stimulation (Fig. 6, A to C, and movie S8). When YORU detects the target behavior, it sends location information to focus the projector light. We used the fly courtship assay to test the usability of the system by suppressing female hearing only when the male sings a courtship song, as exhibited by his wing extension. We hypothesized that disrupting hearing in females during male courtship song production would suppress female mating receptivity, leading to a reduced copulation rate. Using the JO15-2-Gal4 strain that selectively labels auditory sensory neurons, we expressed GtACR1 (49) in female auditory sensory neurons (i.e., JO-A and JO-B neurons) (Fig. 6D and fig. S16). We then paired these females each with a wild-type male in a chamber and used the online capabilities of YORU. In this system, when YORU detects an object labeled single wing_extension, the YORU-operated projector illuminates the object labeled as fly, which typically is the female courted by the male (Fig. 6, B, C, and E, and movie S8). The individual-focused photostimulation operated by YORU successfully directed illumination to the thorax of the target fly during 89.5% of the total stimulation period (table S12). As a control for individual-focused photostimulation, we used pattern light (3-s on, 4-s off) illuminating the entire chamber (Fig. 6E). In the experimental group (JO15-2>GtACR1 female with event-triggered light condition), the copulation rate was significantly lower than in the control groups (Fig. 6, E and F). This result confirms the importance of auditory sensory neurons for females to detect the male’s courtship song to enhance copulation receptivity. Again, YORU was able to optogenetically manipulate neural activity using individual-focused illumination, even when multiple individuals were moving in a chamber at the same time. Those experiments validated the usefulness of the YORU system, which can manipulate various devices, such as a projector as well as a DAQ and Arduino, by trigger output. These proof-of-principle experiments demonstrate the usefulness of YORU for the online detection of social behaviors and manipulation of individual-focused neural activity through optogenetics.
Fig. 6. Individual-specific neural manipulation by YORU.
(A) Experimental setup of individual-focused photostimulation. An infrared (IR) camera is used for observing flies. IR LED light is used for IR camera recording. IR pass filter allows only IR light to be captured by the IR camera, preventing visible light noise. The projector is used for introducing individual-focused photostimulation. (B) Schematic of the closed-loop system for individual-focused photostimulation assay. YORU analyzes camera frames. If YORU detects wing_extension, it draws a green circle on the fly bounding box and sends the image to the projector. The projector introduces green-circled photostimulation to the fly. (C) A representative situation during individual-focused photostimulation. White arrowheads show the fly displaying wing extension. (D) JO15-2-GAL4 expression in female antenna. GFP markers driven by JO15-2-GAL4 (JO15-2-GAL4>20XUAS-IVS-mCD8::GFP) are detected in Johnstons Organs. White arrowheads show JO neurons. D and V indicate the dorsal and ventral sides, respectively. Scale bars, 100 μm. (E) Schematic of light conditions. In the experimental photostimulation condition, when YORU detects a fly showing wing extension, YORU introduces individual-focused photostimulation to the other fly (Event-triggered). In the photostimulation control, we used pattern light (3-s On, 4-s off) to illuminate the entire chamber irrespective of the wing extension event (Pattern). (F) Cumulative copulation rate of JO15-2>GtACR1 females. We used JO15-2-GAL4>UAS-GtACR1 (JO15-2>GtACR1) females for the experimental group. As a genetic control, we used +>UAS-GtACR1 (+>GtACR1) females. The following color code was used: experimental group (blue), photostimulation control group (pink), and genetic control group (dark blue). Pairwise comparisons using log-rank test corrected with the Benjamini-Hochberg method used for statistical analysis. Sample size indicates the tested females. Not significant (n.s.), P > 0.05; ***P < 0.001.
DISCUSSION
Here, we presented YORU, an animal behavior detection system using an object detection algorithm. YORU allowed the detection of social behaviors, as well as single-animal behaviors. Furthermore, by introducing real-time analysis, YORU can operate a closed-loop system with low latencies and high user scalability. We also demonstrated the practical applicability of real-time neuronal manipulation using the fly courtship behavior. In particular, we created an individual-focused illumination system to manipulate the neural activity of selected individuals in response to a specific behavior. The YORU’s closed-loop system is thus a powerful approach for social behavior research. On the user side, YORU can be used entirely through its GUI without any programming. Just as pose estimation analysis tools such as DeepLabCut (5) have revolutionized neuroscience research, YORU will meet the needs of many biologists and stimulate the generation of novel, testable hypotheses.
The classification of specific behaviors is essential in biology. In pose estimation–based approaches, behavior is typically defined by the positions of body parts. Both supervised and unsupervised classification methods using body part coordinate data can be used to identify known behaviors and to discover previously unidentified behavioral patterns by analyzing time series coordinate information (5, 10, 11, 13). However, achieving high-accuracy classification of known behaviors in real time remains a challenge. For example, pose estimation approaches that recognize behaviors such as wing extension in fruit flies require accurate detection of the wings and body axes, as well as the definition of behaviors based on their relative angles. If any body part is tracked inaccurately, the behavior may not be identified correctly. Furthermore, tracking the behavior of multiple individuals requires distinguishing which body parts belong to which animal. Although various algorithms have been proposed to address these challenges, real-time behavior tracking of multiple individuals remains difficult (6, 7, 19). In our attempt to classify behaviors using a body-part tracking approach on the “Fly – wing extension” (two flies) and the “Ant – trophallaxis” (two ants) datasets, we observed several limitations, including frequent failures in body-part detection, misidentification of individuals, and incorrect orientation assignments (fig. S3, A to F). A likely cause for these failures is that the Fly – wing extension dataset, originally recorded to validate YORU for capturing wing motion, exhibited limited image quality. Consequently, certain body parts, including wings, were difficult to resolve, reducing the accuracy of SLEAP analysis (fig. S3, C and E). Such limitations in body-part tracking likely represent a primary factor contributing to the reduced behavior detection accuracy observed in the Keypoint-MoSeq and A-SOiD analyses. In contrast, the object detection algorithm used in YORU (YOLOv5) provides a robust alternative that maintains reliable performance despite variations in object orientation, number, size, and background noise (24, 39, 40). As such, it effectively compensates for the limitations of pose estimation–based behavior classification, even under conditions involving multiple animals (Fig. 2, J to M, and fig. S9, C) (6, 7). Notably, YORU detects animal behavior in a single step from each frame, enabling rapid inference suitable for real-time processing, even at camera resolutions sufficient for animal behavior quantification. However, a key limitation of YORU lies in its inability to detect behaviors that cannot be reliably distinguished from a single video frame (see the “Limitations of this study” section for details). In such cases, pose estimation approaches often outperform object detection methods due to their ability to incorporate temporal dynamics across multiple frames. Thus, object detection approaches (such as YOLO) and pose estimation approaches are fundamentally distinct yet complementary, compensating for each other’s shortcomings rather than competing. Researchers can select the most suitable tool or combine both approaches to meet their specific needs, thereby accelerating the advancement of behavior analysis across disciplines.
In this study, we demonstrated that YORU is applicable to detect typical behavior patterns from the head-fixed mouse (Fig. 3). While image classification approaches (e.g., ResNet50 and DensNet121) can also be effective for head-fixed animals, a key advantage of YORU is its ability to provide both behavioral classification and spatial localization. Spatial information is valuable for enabling closed-loop interventions that require a low false-positive rate. For instance, position detection remains essential in head-fixed experiments where limb or finger movements must be monitored (e.g., tasks involving grasping or lever pulling). Moreover, when a specific behavior is detected in a location that is either anatomically implausible or spatially inappropriate, the spatial information can serve as a complementary metric independent of classification confidence. This enables the exclusion of false-positive detection from subsequent analysis or closed-loop interventions.
In behavioral neuroscience, optogenetics serves as a powerful approach to cell type– and spatiotemporal-specific control of neural activity, especially for investigating the causal relationship between the activity of neural circuits and behavior (14, 51). With state-of-the-art genetic tools, it is now possible to express light-gated ion channels in specific neurons to perform neural activity intervention (14, 49). In addition, the development of machine learning techniques has made it possible to create closed-loop experimental setups and manipulate neural activity during behavior, allowing us to explore the causal relationship between neural circuits and behavior in more detail (28, 52, 53). In the research of the neural bases of social behaviors, optogenetic manipulation of only specific individuals, even in the presence of multiple individuals, can be a powerful approach (54). However, the difficulty of capturing the behavior of multiple individuals simultaneously in real time has made it extremely difficult to conduct online behavior analysis in the presence of multiple individuals. The YORU system allows us to create a closed-loop system that can analyze animal behavior and operate photostimulation for optogenetics in real time. Furthermore, the YORU system allows us to conduct individual-focused optogenetics experiments with widely available equipment, such as projectors, cameras, and personal computers.
Various analysis tools, driven by advances in deep learning, have contributed to biology research. Using YORU, biologists can use high-performance object detection algorithms to perform behavioral quantification analyses and intervention via a GUI, both offline and online. The object detection paradigm has the potential not only for animal behavior quantification, but also for capturing various biological phenomena. In particular, it has been incorporated as a tracking method for humans and fish, and has been applied to the behavioral classification of animals and plants (e.g., Drosophila mating behaviors and the stomatal opening and closing of Arabidopsis) (30, 31, 55, 56). The YORU system can contribute to making the use of object detection algorithms more widespread in biology and to substantially reduce the labor of biologists.
Limitations of this study
Although YORU enabled the detection of various social behaviors and hardware operations in response to animal behaviors, it has several limitations:
1) Because YORU’s backend detects behaviors by per-frame appearances, it is challenging to identify behaviors that require temporal context (30), such as foraging or mating attempts.
2) YORU currently does not support individual identification (ID tracking). To address limitations (1) and (2), postanalysis of object detection outputs or complementary tools such as Keypoint-MoSeq (11) or DeepEthogram (8) is necessary.
3) Hardware operation presents another limitation. Although the system delay in our condition was ~30 ms, additional delays in trigger processing due to external hardware should be considered. In individual-focused photostimulation experiments, we observed an additional latency on the projector side before the patterned image signal was transmitted and displayed. This latency may cause fast-moving individuals to exit the illuminated area before the stimulus is delivered. Possible solutions include incorporating predictive algorithms or using a low-latency projector (57, 58).
Overcoming these limitations would further expand the options for studying dynamic animal social behaviors.
MATERIALS AND METHODS
Development of YORU system
YORU is written in Python. The GUI of YORU was developed using DearPyGui library, which enabled GUI development in a fast, interactive manner, and plotting of acquisition data in real time. To use various image acquisition devices such as webcams or other high-performance machine vision cameras, we used OpenCV API for image acquisition. All processing and data streams were performed in a multiprocessing manner with the Python library “multiprocessing.” To label animal behaviors, YORU uses the open-source annotation software “LabelImg” (https://github.com/HumanSignal/labelImg). To detect “behavior objects,” YORU uses YOLOv5 packages (https://github.com/ultralytics/yolov5). The code for YORU is available at (https://github.com/Kamikouchi-lab/YORU) as open-source software.
Datasets
To evaluate the performance of YORU, we prepared different datasets collected under various conditions. The datasets include Fly – wing extension, Ant – trophallaxis, Fly – group courtship, “Ant – group trophallaxis,” “Zebrafish – orientation,” “LED lighting – Small,” and “LED lighting – Large.” The Zebrafish – orientation dataset was created using zebrafish videos from a previous study (35), while the other datasets were generated from videos obtained in this study. Each dataset consisted of images (frames manually extracted from videos) and “behavioral object” labels. Each image was then manually labeled using LabelImg. Table S13 shows the detailed conditions of each dataset.
Fly – Wing extension
This dataset includes a pair of wild-type fruit flies consisting of a male and a female. Fruit flies (Drosophila melanogaster, Canton-S strain) were raised on standard yeast-based media on a 12-hour light/12-hour dark (12 h L/D) cycle. Both sexes of flies were collected within 8 hours after eclosion to ensure their virgin status. They were maintained at 25°C under a 12 h L/D cycle and transferred to new tubes every 2 to 4 days, except on the day of the experiment. Male flies were kept singly in a plastic tube (1.5 ml, Eppendorf) containing ∼200-μl fly food, while females were kept in groups of 10 to 30. Experiments were conducted using males and females 4 to 8 days posteclosion, with each individual used only once. The video recordings were performed between Zeitgeber time (ZT) = 1 to 11 at 25°C and 40 to 60% relative humidity.
Courtship behavior was monitored in a round courtship chamber with a sloped wall (20 mm top diameter, 12 mm bottom diameter, 4 mm height, and 6 mm radius fillet) made of transparent polylactic acid filament by 3D printer (Sermoon D1, Creality 3D Technology Co. Ltd.). The chamber was enclosed with a slide glass and a white acrylic plate as a lid and bottom, respectively. The chamber was illuminated from the bottom by an infrared LED light (ISL-150 × 150-II94-BT, 940 nm, CCS INC.) to enable recordings in dark conditions. Male and female flies were gently introduced into chambers by aspiration without anesthesia. Videos were captured from above, with a monochrome complementary metal-oxide semiconductor (CMOS) camera (DMK33UX273, The Imaging Source Asia Co. Ltd.) equipped with a 25–mm–focal length lens (TC2514-3MP, KenkoTokina Corporation) and a light-absorbing and infrared transmitting filter (IR-82, FUJIFILM), at a resolution of 640 × 480 pixels and 30 fps for 30 min for each pair using IC Capture (The Imaging Source Asia Co. Ltd.). In this dataset, we labeled two behavior object classes, fly and wing_extension, without a female or male identification. The Wing_extension was labeled on the fly when it extended one of its wings. The Fly indicates a fly not showing wing extension.
Fly – Group courtship
This dataset includes a group of wild-type fruit flies consisting of four males and four females. We prepared these flies under the same conditions described in the Fly – wing extension dataset. Behaviors were monitored in a fly bowl chamber with a sloped wall (60 mm in diameter, 3.5 mm in depth) (59), illuminated by a visible LED light to facilitate recordings. Four male and four female flies were gently introduced into the chamber by aspiration without anesthesia. Videos were captured from the top as described in the Fly – wing extension dataset. In this dataset, we labeled three behavior object classes: fly, wing_extension, and copulation without a female or male identification. The definitions of wing_extension and fly were based on those in the Fly – wing extension dataset. The Copulation was defined by genital coupling between a male and a female.
Ant – Trophallaxis
This dataset includes two worker ants. Workers of Camponotus japonicus were collected in October 2023 at the Higashiyama Campus of Nagoya University (35°09′15.3″ N, 136°58′15.8″ E). Ant species were identified with the Encyclopedia of Japanese Ant (60). After collection, they were kept individually in Falcon tubes with moistened paper for 24 hours at 22°C with 40 to 60% relative humidity under dark conditions. The video recordings were conducted at 22°C and 40 to 60% relative humidity.
Trophallaxis behavior was monitored in a custom-made chamber (25 mm width, 15 mm length, and 5 mm depth) made of transparent polylactic acid filament by 3D printer (Sermoon D1, Creality 3D Technology Co. Ltd.). The chamber was enclosed with a glass ceiling. Two worker ants were used for a single video recording: one individual was fed 1 M sucrose solution as much as desired immediately before observation, while the other was not fed anything. These two workers were transferred to a custom-made chamber. Videos were captured from the top, with a color CMOS camera (DFK33UP1300, The Imaging Source Asia Co. Ltd.) equipped with a 50–mm–focal length lens (MVL50M23, Thorlabs Inc.), at a resolution of 1280 × 960 pixels and 30 fps for 30 min for each pair using IC Capture (The Imaging Source Asia Co. Ltd.). In this dataset, we labeled two behavior object classes: trophallaxis and no. The trophallaxis class was defined as a situation when the heads of two individuals were in proximity and the palps were in contact with each other. The no class was defined when the heads of the two individuals were in proximity, but the palps were not in contact with each other.
Ant – Group trophallaxis
This dataset includes a group of ants consisting of six worker ants. Among six workers, three individuals were fed 1 M sucrose solution as much as desired immediately before observation, while the other three were not fed anything. These workers were then transferred to the polystyrene chamber (87 mm width, 57 mm length, and 19 mm depth) with a glass ceiling, and video recording was started. Videos were recorded from the top with a monochrome CMOS camera (FLIR GS3-U3-15S5; Edmund Optics) equipped with a zoom lens (M0814-MP2, CBC Optics Co. Ltd.). Video recordings and behavior object definitions were performed in the same manner as for the Ant – trophallaxis dataset.
Zebrafish – Orientation
The animal experiments in this study were approved by the Institutional Animal Experiment Committee and were conducted in accordance with the Regulations on Animal Experiments at the institute. This dataset includes a pair of wild-type zebrafish. Adult zebrafish, aged 4 to 6 months, with the Oregon AB genetic background were used without sex identification. Fish were maintained in a 14/10 h L/D cycle at 28.5°C. The experiments were conducted as described in previous studies (34, 35). Fish were isolated in individual tanks the day before the experiments, with white paper placed between the tanks to prevent visual contact. The experiments were conducted at ZT = 0 to 14.
Orientation behavior was monitored in two custom-made acrylic tanks (90 mm length, 180 mm width, and 60 mm height). These tanks were separated by a divider made from a polymer dispersed liquid crystal film (Sunice Film) attached to 2-mm-thick acrylic sheets. Fish were placed individually into water tanks with a depth of 57 mm for 20 min. Following this acclimation period, the fish were recorded for 5 min at 30 fps with the opaque divider condition (invisible condition). The divider was then made transparent to allow the fish to see each other, and the recording continued for an additional 5 min (visible condition). Videos were captured from below, with a monochrome CMOS camera (FLIR FL3-U3-13E4, Edmund Optics) at a resolution of 1280 × 1024 pixels and 30 fps. The divider condition (opaque or transparent) was controlled with a DAQ interface (USB-6008; National Instruments Co.) and custom-made software written in LabVIEW (National Instruments Co.). In this dataset, we labeled two behavior object classes: “orientation” and “no_orientation.” The orientation class was defined as a situation when two zebrafish showed orientation behavior as defined in a previous study (34). The no_orientation class was defined when two zebrafish showed no orientation behavior.
LED – ON or OFF
This dataset includes a blue LED (OSB5YU3Z74A, OptoSupply). We created LED lighting – Small and LED lighting – Large datasets; LED lighting – Small consists of 640 × 480 pixel videos, and LED lighting – Large consists of 1280 × 1024 pixel videos. Videos were captured with a color CMOS camera (DFK33UP1300, The Imaging Source Asia Co. Ltd.) equipped with a 50–mm–focal length lens (MVL50M23, Thorlabs Inc.). In each dataset, we created models based on each YOLOv5 model (YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x). In these models, there were two classes, “ON” and “OFF,” indicating that LED turned on or off, respectively (Fig. 4A).
Segment LED – Straight or Polyline
This dataset includes a 7-segment LED (A-551SRD-NW-A B/W, PARA LIGHT ELECTRONICS CO. LTD.) (fig. S15A). The segment LED alternates between the straight and polyline patterns every 3 s (fig. S15A). Videos were captured with a color CMOS camera (DFK33UP1300, The Imaging Source Asia Co. Ltd.) equipped with a 25–mm–focal length lens (TC2514-3MP, KenkoTokina Corporation). In our dataset, we created a YORU model based on YOLOv5s network and a SLEAP model. In YORU models, there were two classes, Straight and Polyline, indicating the segment LED illumination patterns (fig. S15A). We extracted 300 images manually and labeled them (table S13). In SLEAP models, we extracted 200 images randomly from video clips and manually labeled the “keypoints” in each image (fig. S15A and table S13).
Model creation for YORU-based analysis
We created YORU models using the Training package. YORU randomly splits the datasets containing the images and labels into two datasets: 80% for training and 20% for validation. The models were trained with the training and validation datasets for 300 epochs. If the training loss was low enough, training was finished before 300 epochs. For comparison with human manual annotations, we created YORU models based on a YOLOv5s pretrained model, in which 2000 images were used. Human manual annotations were conducted using BORIS (61). For model evaluation using Evaluation package, we extracted the specified number of images from each dataset and created each model based on YOLOv5 pretrained models. For the comparison using other datasets, we extracted the specified number of images from each dataset created YORU models based on a YOLOv5s pretrained model (table S13).
Model creation for behavior classification with pose-estimation analysis
For two behavioral datasets (Fly – wing extension and Ant – trophallaxis dataset), we first trained pose-estimation models using SLEAP. The fly dataset comprised 211 manually labeled frames, while the ant dataset comprised 300 (table S13). To maximize model robustness, labels were drawn from both videos used for downstream behavioral classification and from independent hold-out videos. We then applied two complementary classification approaches. First, we performed unsupervised clustering of the tracked keypoints using both test and training videos from each dataset with Keypoint-MoSeq (https://github.com/dattalab/keypoint-moseq) (11) and evaluated the resulting clusters of test videos by comparing them against our human-annotated behavioral classes. Second, we carried out supervised classification with A-SOiD (https://github.com/YttriLab/A-SOID) (12) under a cross-validation scheme: The classifier was trained on a subset of the human annotations and its performance was assessed on the remaining annotations.
Model creation for other deep learning algorithm analysis
We created each model using custom Python codes. The models were trained with the training and validation datasets (in total 2000 images and labels) for 300 epochs. If the training loss was low enough, training was finished before 300 epochs. For comparison with other object detection algorithms, we analyzed test images that were not used for model creation and compared the results with ground-truth labels.
Comparison with YORU and human manual annotations in animal videos
To obtain human manual annotation data, we analyzed the behaviors of flies (10 videos and ~40 min in total), ants (three videos and ~90 min in total), and zebrafish (two videos and ~10 min in total), following the behavior definitions (table S13). Four parameters, accuracy, precision, recall, and F1 score, were used to compare the performances between the human manual annotation and YORU. To obtain values for these parameters, each annotation was first classified as follows:
1) True positive (TP): Correct annotation.
2) False positive (FP): Incorrect annotation, such as annotating a nonexisting object or a misplaced annotation.
3) False negative (FN): Undetected ground-truth.
4) True negative (TN): Correct no-annotation.
Subsequently, accuracy, precision, recall, and F1 score were calculated as follows
Evaluation of YORU models
We evaluated the models using YORU’s Evaluation package. Several hundred images, which were not used in the model creation, were manually labeled with ground-truth bounding boxes using LabelImg. YORU then predicted bounding boxes on the same images using the models. By comparing the ground-truth and predicted boxes, we calculated the precision and recall of the models. In addition, IoU and APs, two of the typical object detection model indexes, were used (38). IoU represents how close the ground-truth box and the predicted box are, based on Jaccard Index that evaluates the overlap between two bounding boxes, and AP represents the accuracy of the behavior classes, reflecting the precision and recall values (38). Since the AP is affected by the IoU threshold, we used two different threshold values to calculate the APs: “AP@50” with a threshold value of IoU = 50% and “AP@75” with a threshold value of IoU = 75%. The AP ranges from 0 to 1, with a value of 1 indicating perfect consistency with the ground-truth labels. As with the PASCAL Visual Object Classes Challenge (62), model evaluation in YORU adopts the AP using the all-point interpolation method by the Riemann integral to compare the models.
Evaluation of YORU performance to detect multiple individuals
Wild-type flies (D. melanogaster, Canton-S strain) were reared on standard yeast-based media under a 12 h L/D cycle in single-sex groups for 4 to 8 days after eclosion. For female group behavior assays, 1 or 2 days before the experiment, female flies were divided into groups of 5, 10, 20, 30, 40, 50, or 60 individuals for video recordings. For group courtship assays, virgin male and female flies were maintained separately in groups of ~10. Videos were recorded as described in the Flies – group courtship dataset. We used YOLOv5s-based model of Flies – group courtship dataset to analyze the number of individuals detected in each video frame.
Mice – Running in VR and recording neural activity
All mouse procedures followed institutional and national guidelines and were approved by the Animal Care and Use Committee of Nagoya University. All efforts were made to reduce the number of animals used and minimize the suffering and pain of the animals. The “Mouse – treadmill” dataset consists of three videos of single mouse behaviors in the treadmill environment. Referring to a previous study (41), the videos were collected. C57BL/6J mice were purchased from Nihon SLC. These animals were maintained in a temperature-controlled room (24°C) under a 12 h L/D cycle with ad libitum access to food and water. Because of the body size needed for adaptation in head-fixed VR, mice aged 11 to 28 weeks were used in the VR experiments. The VR environment used in this study was referred to in the previous study (41). In brief, mice were head-fixed and free to run on a treadmill. We recorded mice behavior with a monochrome CMOS camera (DMX33UX174, The Imaging Source) at 30 fps and 640 × 480 pixels resolution.
For the Mouse – treadmill dataset, we labeled 2171 frames with eight behavior object classes: Running, Stop, Whisker-On, Whisker-Off, Eye-Open, Eye-Closed, Grooming-On, and Grooming-Off (Fig. 3B). Eye-Open and Eye-Closed indicate a mouse that opens or closes the eye. Grooming-On was defined as the situation where the forelegs of the mouse are touching its head. Grooming-Off was defined as all situations that do not satisfy the criteria of Grooming-On. Running and Stop indicate a mouse that runs or stops. Running was defined as the situation where the roller was rotating, and Stop was defined as the situation where the roller was stationary. Whisker-On was defined as the situation where the whisker was oriented in the anterior direction, and Whisker-Off was defined as all situations that did not satisfy the criteria of Whisker-On. Labels were randomly split into two datasets: creating a model dataset (1975 images) and a test dataset (196 images). Using the dataset, we generated a model based on the YOLOv5s pretrained model.
Surgery for the head plate implantation and virus injection were performed as previously described (41). In brief, AAV-PHP.eB-hSyn-jGCaMP7f was injected at 100 μl into the retro-orbital sinus of mice using a 30-gauge needle. Head plate implantation was performed 14 days after the virus injection. Skin and membrane tissue on the skull were carefully removed, and the surface was covered with clear dental cement (204610402CL, Sun Medical). A custom-made metal head plate was implanted onto the skull for stable head fixation. During surgery, the eyes were covered with ofloxacin ointment (0.3%) to prevent dry eye and unexpected injuries, and body temperature was maintained with a heating pad. All procedures of surgery were performed under deep anesthesia with a mixture of medetomidine hydrochloride (0.75 mg/kg; Nihon Zenyaku), midazolam (4 mg/kg; Sandoz), and butorphanol tartrate (5 mg/kg; Meiji Seika). After surgery, mice were injected with atipamezole hydrochloride solution (0.75 mg/kg; Meiji Seika) for rapid recovery from the effect of medetomidine hydrochloride.
The recording of cortex-wide neural activity was performed as previously described (41). In brief, we used a tandem lens design (a pair of Plan Apo 1×, WD = 61.5 mm, Leica). To filter the calcium independent artifacts (63), an alternating blue (M470L4, Thorlabs Inc.) and violet (M405L3, Thorlabs Inc.) excitation LEDs were used as a light source for excitation, combined with band-pass filters (86-352, Edmund Optics; FBH400-40, Thorlabs Inc., respectively). Fluorescent emission was passed through a dichroic mirror (FF495-Di03) and filters (FEL0500 and FESH0650, Thorlabs Inc.) to a scientific CMOS (sCMOS) camera (ORCA-Fusion, Hamamatsu Photonics). The illumination timing of the excitation light was controlled by a global exposure timing signal from the camera, processed with an FPGA-based logic circuit (Analog discovery 2, Digilent), equipped with binary-counter IC (TC4520BP, Toshiba).
Speed benchmarking – Inference speed
We measured the inference speed of the YOLOv5 detection using a custom Python code of the YORU’s detection function in the Real-time Process package. The list of computers used for this analysis is shown in table S14. In the custom Python code, the times before and after the YOLO detection step ( and , respectively) were logged by the “perf_counter()” function of the “Time” module. Then, we calculated the inference speed of one frame detection as . For the analyses on the model size and frame size dependency, we used the 640 × 480 pixels videos for LED lightning – Small models and 1280 × 1024 pixels videos for LED lightning – Large models. Fifty thousand frames (60 fps, ~14 min, ~420 events) were used to calculate the inference speed.
Speed benchmarking – Real-time system
To estimate the latency of the YORU’s Real-time Process package, we measured the end-to-end latency of the LED light detection task by running YORU on a Windows desktop PC (CPU, Core i7-13700KF 16core; GPU, NVIDIA RTX 4080; RAM, 32 GB or 16 GB DDR5). The CMOS camera (DFK 33UP1300, The Imaging Source Asia Co. Ltd. or ELP-USBFHD08S-MFV, Autocastle) equipped with a 50–mm–focal length lens (MVL50M23, Thorlabs Inc.) captured an LED and streamed frames. The frames were then processed to detect whether the LED state was on or off using the LED lighting – Small or LED lighting – Large model. When YORU detected the ON state, it sent a signal to a trigger controller, which then emitted a transistor-transistor logic (TTL) voltage pulse. As the trigger controller, DAQ (USB-6008, National Instruments Co.) or a microcontroller (Arduino Uno, Arduino CC) was used. A recording DAQ (USB-6212, National Instruments Co.) logged the TTL voltage from the trigger controller (Fig. 4B). The voltage data were then processed using a 50-Hz low-pass filter [“butterworth_filter()” function of “scipy” package] to block the high-frequency noises. The delay between the timing of LED voltage and that of the trigger TTL was used to estimate the full-system latency, which includes overhead from hardware communication and other software layers.
Speed benchmarking – Comparison with SLEAP
To compare the onset latency between YORU’s Real-time Process package and SLEAP’s real-time analysis, we measured the onset latency of an LED light detection task under identical experimental conditions (CPU, orfe i7-13700KF 16core; GPU, NVIDIA RTX 4080; RAM, 32 GB DDR5). A color CMOS camera (DFK33UP1300, The Imaging Source Asia Co. Ltd.) equipped with a 25–mm–focal length lens (TC2514-3MP, KenkoTokina Corporation) captured a seven-segment LED (A-551SRD-NW-A B/W, PARA LIGHT ELECTRONICS CO. LTD.) and streamed frames. The frames were then processed to detect whether the LED state was on or off using the YORU or SLEAP model. When the model detected the polyline state, it sent a signal to a trigger controller which then emitted a TTL voltage pulse. As the trigger controller, a microcontroller (Arduino Uno, Arduino CC) was used. To interface with SLEAP, we developed custom code using the Arduino Firmata. A recording DAQ (USB-6008, National Instruments Co.) logged the TTL voltage from the trigger controller (fig. S15B). The delay between the TTL signal sent by the microcontroller (AE-ATMEGA-UNO-R3; AKIZUKI DENSHI TSUSHO CO. LTD.) of the segment LED and the corresponding TTL output trigger was used to estimate onset latency, which includes overhead introduced by hardware communication and additional software processing layers.
Optogenetic assays – Fly preparation
D. melanogaster were raised as described above. Canton-S was used as a wild-type strain. UAS-GtACR1.d.EYFP (attP2) (49) (RRID: BDSC_32194) and split-GAL4 strain that labels pIP10 neurons specifically (w; VT040556-p65.AD; VT040347-GAL4.DBD) (47) (RRID: BDSC_87691) were obtained from the Bloomington Drosophila Stock Center. JO15-2-GAL4 (64) was a gift from D. F. Eberl (University of Iowa).
For optogenetic assays, pIP10 neuron–specific split-GAL4>GtACR1 male or JO15-2>GtACR1 female flies were paired with wild-type adult males or females, respectively, as mating partners. Flies used for the behavior assay were collected within 8 hours after eclosion to ensure their virgin status. Wild-type male, transgenic male, and female flies were kept singly in a plastic tube (1.5 ml, Eppendorf) containing ∼200-μl fly food. Wild-type females were kept in groups of 10 to 30. They were transferred to new tubes every 2 to 3 days, but not on the day of the experiment. Males and females, 5 to 8 days after eclosion, were used for experiments and were used only once. All experiments were performed between ZT = 1 to 11 at 25°C and 40 to 60% relative humidity.
Optogenetic assays – Dissection and immunolabeling
Dissections of the male genitalia and female head were performed as described previously with minor modifications (17). In brief, male genitalia were dissected in phosphate-buffered saline (PBS; Takara Bio Inc., no. T900; pH 7.4 at 25°C), kept in 50% VECTASHIELD mounting medium (Vector Laboratories, no. H-1000; RRID: AB_2336789) in deionized water for ∼5 min, and mounted on glass slides (Matsunami Glass IND. LTD., Osaka, Japan) using VECTASHIELD mounting medium.
Immunolabeling of the brains and ventral nerve cord was performed as described previously with minor modifications (65). In brief, brains were dissected in PBS (pH 7.4 at 25°C), fixed with 4% paraformaldehyde for 60 to 90 min at 4°C, and subjected to antibody labeling. Brains were kept in 50% glycerol in PBS for ∼1 hour, 80% glycerol in deionized water for ∼30 min, and then mounted. Rabbit polyclonal anti-GFP (Invitrogen, no. A11122; RRID: AB_221569; 1:1000 dilution) was used for detecting the mCD8::GFP. Mouse anti-Bruchpilot nc82 (Developmental Studies Hybridoma Bank, no. nc82, RRID: AB_2314866; 1:20 dilution) was used to visualize neuropils in the brain. Secondary antibodies used in this study were as follows: Alexa Fluor 488–conjugated anti-rabbit IgG (Invitrogen, no. A11034; RRID: AB_2576217; 1:300 dilution) and Alexa Fluor 647–conjugated anti-mouse IgG (Invitrogen, no. A21236; RRID: AB_2535805; 1:300 dilution).
Optogenetic assays – Confocal microscopy and image processing
Serial optical sections were obtained at 0.84-μm intervals with a resolution of 512 × 512 pixels using an FV1200 laser-scanning confocal microscope (Olympus) equipped with a silicone oil immersion 30× lens (UPLSAPO30XSIR, numerical aperture = 1.05; Olympus). Images of neurons were registered to the Drosophila brain template (66) by using groupwise registration (67) with the Computational Morphometry Toolkit registration software. Brain registration, image size, contrast, and brightness were adjusted using Fiji software (version 2.14.0; RRID: SCR_002285).
Optogenetic assays – Retinal feeding
Transgenic male and female flies were maintained under a dark condition for 4 to 6 days after eclosion and then transferred to a plastic tube (1.5 ml, Eppendorf) containing ~200 μl of fly food. Plastic tubes that contain males were divided into experimental and control groups. For the experimental group, 2 μl of all-trans-retinal (R2500, Sigma-Aldrich), 25 mg/ml dissolved in 99.5% ethanol (14033-80, KANTO KAGAKU), was placed on the food surface. Male and female flies were kept on the food for 1 and 2 days, respectively, before being used for the assays.
Optogenetic assays – A closed-loop system for event-triggered photostimulation
For the event-triggered optogenetic assay (Fig. 5), we used the YORU system on a desktop Windows PC (the same machine used in the speed benchmarking experiments). Green LED light (M530L4, Thorlabs Inc.) was used as the light source. The cameras used to record the fly behavior and photostimulations, respectively, are as follows: a monochrome CMOS camera (DMK33UX273, The Imaging Source Asia Co. Ltd.) equipped with a light-absorbing and infrared transmitting filter (IR-82, FUJIFILM) and a color CMOS camera (DFK 33UP1300, The Imaging Source Asia Co. Ltd). The monochrome CMOS and color CMOS cameras were equipped with a 25–mm–focal length lens (TC2514-3MP, KenkoTokina Corporation) and zoom lens (MLM3X-MP, Computar), respectively. The fly behaviors were recorded at a resolution of 640 by 480 pixels and a frame rate of 100 fps for ~40 min for each fly pair using YORU. The recordings of the photostimulation were performed at a resolution of 1280 × 1080 pixels resolution and 30 fps. The chamber was illuminated from the bottom by an infrared LED light (ISL-150 × 150-II94-BT, 940 nm, CCS INC.) to enable recordings in dark conditions. We used a round courtship chamber with a sloped wall (19 mm top diameter, 12 mm bottom diameter, 4 mm height, and 6 mm radius fillet) made of the transparent polylactic acid filament produced by a 3D printer (Sermoon D1, Creality 3D Technology Co. Ltd.). The chamber was enclosed with a slide glass and a white acrylic plate as a lid and bottom, respectively.
For analyzing behaviors, a male and a female fly were introduced into the chamber by gentle aspiration without anesthesia. YORU captured and detected two behavior object classes. When YORU detected a wing_extension, a green light [light intensity (530 nm): 2.0 mW/cm2] illuminated the entire chamber. As a control experiment of the light condition, we used an event-triggered light with a 1-s delay to illuminate the entire chamber (Fig. 5C). The light intensity of photostimulation was calibrated with an optical power meter (PM100D, Thorlabs Inc.).
Optogenetic assays – A closed-loop system for individual-focused photostimulation
For the individual-focused optogenetic assay (Fig. 6), we used the YORU system on a desktop Windows PC (the same as for the speed benchmarking experiments). A projector, XGIMI Technology Co.) was used as a light source to stimulate specific individuals. Camera captures, chambers, and background light setup were configured in the same manner as in the event-triggered photostimulation experiments.
Before the experiments, the positions of the camera and projector planes were calibrated using a circle grid pattern (five circles height × eight circles width). The “findHomography” function of “OpenCV” package provided the homography matrix that linearly transformed the position from the camera plane to the projector plane.
For analyzing behaviors, a male and a female fly were introduced into the chamber by gentle aspiration without anesthesia and analyzed by YORU. When YORU detected a wing_extension, a circle-shaped illumination (~6.5 mm diameter with 1.14 mW/cm2 light intensity at 530 nm) was delivered to the individual that was detected as a fly. In this experimental setup, the primary factor affecting system latency is the processing speed of the projector (~0.1 s) rather than the latency of the model (~0.005 s). We chose the illuminated area (~6.5 mm diameter) to reliably illuminate the fast-moving fly to account for the projector’s illumination delay. The center coordinates of the predicted bounding boxes were homography transformed using the homography matrix. For the control condition, the projector displayed a patterned light consisting of alternating 3 s of green and 4 s of black,” covering the entire chamber. This green/black time ratio was determined on the basis the calculated courtship duration of wild-type Drosophila observed in courtship assay videos. The illumination pattern of the projector was controlled using a plugin (YORU-projector-plugin; https://github.com/Kamikouchi-lab/YORU-projector-plugin). Please note that adjusting projector settings, such as light intensity or the size of the region of interest, requires minor modifications to the code. The light intensity of photostimulation was calibrated with an optical power meter (PM100D, Thorlabs Inc.) in courtship assay videos.
Quantification and statistical analysis
Statistical analyses were conducted using Jupyter (Python version 3.9.12) and RStudio (R version 4.3.2). Aligned rank transform one-way analysis of variance (ART one-way ANOVA) was performed to compare the wing extension ratio between conditions in the event-triggered optogenetic assay. All statistical analyses were performed after verifying the equality of variance (Bartlett’s test for three-groups comparisons; F tests for two-groups comparisons) and normality of the values (Shapiro-Wilk test). Kaplan-Meier curves were generated using R, and a pairwise Log-rank test was performed to compare females’ cumulative copulation rates between conditions in the individual-focused optogenetic assay. After ART one-way ANOVA or pairwise log-rank tests, P values were adjusted using the Benjamini-Hochberg method in the post hoc test. For the ART one-way ANOVA, the ARTool package (version 0.11.1) was used (https://github.com/mjskay/ARTool/) (68, 69). For Kaplan-Meier curves and pairwise log-rank test, the survival package (version 3.5.7) was used (https://github.com/therneau/survival). Statistical significance was set at P < 0.05. Boxplots were drawn using the R package ggplot2 (https://ggplot2.tidyverse.org/). Boxplots represent the median and interquartile range (the distance between the first and third quartiles), and whiskers denote 1.5× the interquartile range. The statistical methods and results used in this study are summarized in table S15.
AI-assisted technologies
During the development of YORU software and associated experimental systems, OpenAI’s ChatGPT (versions GPT-4.1, GPT-4.5, GPT-4o, o1, o3-mini, GPT-5, and GPT-5.2) was used to assist with coding, debugging, and implementation in Python. The tool was also used to support data management, data analysis, and data visualization for this study using Python and R. Prompts used when working with the tool included: “check the current code for errors and issues given the following structure of the data (e.g., a pandas DataFrame): [details of the code, data structure, and any error messages]. Give alternatives, explanations, and considerations,” “explain the following error message and how to resolve it: [details of the code and the error message],” “provide an alternative way to implement or calculate the following: [details of the desired outcome and the code used],” and “format the following code for readability and consistency (e.g., apply standard style conventions): [details of the code].” The tool was used to identify and explain errors, suggest corrections and alternative solutions, and provide recommendations to improve code readability. All outputs were reviewed and validated by the researchers.
OpenAI’s ChatGPT (versions GPT-4.1, GPT-4.5, GPT-4o, o1, o3-mini, GPT-5, and GPT-5.2) and Microsoft Copilot (https://copilot.microsoft.com/) were also used to improve the clarity and language of the manuscript. The authors provided ChatGPT with the prompt: “Suggest improvements to my writing for clarity and flow, while staying very close to my original style.” This prompt was applied to selected paragraphs. All scientific content, data analysis, and conclusions remain entirely the work of the authors. The AI-assisted proofreading was limited to linguistic refinements and did not alter the scientific content or conclusions.
Acknowledgments
We thank M. Yamanouchi for the software logo design; M. Hibi and S. Hosaka for collecting animal behavior videos; R. Nishimura (the Research Equipment Development Group, Technical Center of Nagoya University) for constructing the chambers used in behavioral experiments and for developing VR environment for mice; Y. Suzuki for the experimental setup; M. P. Su, K. Noma, T. Nozaki, and H. Ando for discussions; K. Ito, D. F. Eberl, and the Bloomington Drosophila Stock Center (Indiana University, Bloomington, IN, USA) for fly stocks; and Developmental Studies Hybridoma Bank for antibodies. The authors acknowledge the use of AI-assisted technologies, and details are provided in the Materials and Methods.
Funding:
This work was supported by the MEXT KAKENHI Grant-in-Aid for Transformative Research Areas (A) “iPlasticity” JP23H04228 (A.K.), MEXT KAKENHI Grant-in-Aid for Transformative Research Areas (A) “Materia-Mind” JP24H02200 (A.K.), MEXT KAKENHI Grant-in-Aid for Transformative Research Areas (A) “Dynamic Brain” JP 25H02496 (A.K.), Grant-in-Aid for Transformative Research Areas (A) Hierarchical Bio-Navigation JP22H05650 (R.T.), Grant-in-Aid for Transformative Research Areas (A) Hierarchical Bio-Navigation JP24H01433 (R.T.), Grant-in-Aid for Transformative Research Areas (B) JP21H05168 (F.O.), Grants-in-Aid for Scientific Research (C) JP23K05846 (R.T.), Grants-in-Aid for Scientific Research (C) JP23K05845 (T.S.), Grant-in-Aid for Early-Career Scientists JP20K16464 (R.F.T.), Grant-in-Aid for Early-Career Scientists JP21K15137 (R.T.), Grant-in-Aid for JSPS Fellows JP24KJ1290 (H.M.Y.), CREST from the Japan Science and Technology Agency JPMJCR1851 (F.O.), and JST FOREST JPMJFR2147 (A.K.).
Author contributions:
Conceptualization: H.M.Y., R.F.T., R.T., and A.K. Methodology: H.M.Y., R.F.T., N.C., K.H., F.O., and R.T. Software: H.M.Y., R.F.T., N.C., and K.H. Validation: H.M.Y., R.F.T., R.T., and A.K. Resources: T.S. and F.O. Investigation: H.M.Y., R.F.T., T.S., and R.T. Formal analysis: H.M.Y., R.F.T., and R.T. Visualization: H.M.Y. and R.F.T. Data curation: H.M.Y., R.F.T., R.T., and A.K. Supervision: H.M.Y., R.T., and A.K. Project administration: R.T. and A.K. Funding acquisition: H.M.Y., R.F.T., T.S., F.O., R.T., and A.K. Writing–original draft: H.M.Y., R.F.T., R.T., and A.K. Writing–review and editing: H.M.Y., R.F.T., F.O., R.T., and A.K.
Competing interests:
The authors declare that they have no competing interests.
Data and materials availability:
YORU software code has been deposited at GitHub and are publicly available at DOI: 10.5281/zenodo.17827855, GitHub: https://github.com/Kamikouchi-lab/YORU under the AGPL-3.0 license. The YORU’s plugin code for projector individual-focused illumination in Fig. 6 is publicly available at DOI: 10.5281/zenodo.15544362, GitHub: https://github.com/Kamikouchi-lab/YORU-projector-plugin under the MIT license. All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. All data and original code necessary to reproduce the figure panels and statistical analyses are available at DOI: 10.5281/zenodo.17241905, GitHub: https://github.com/Kamikouchi-lab/YORU_data_analysis. Each dataset data is available at DOI: 10.5281/zenodo.15605860. The pGP-AAV-syn-jGCaMP7f-WPRE plasmid was obtained from Addgene (plasmid no. 104488) and is governed by a material transfer agreement from the provider, Howard Hughes Medical Institute (https://janelia.org/project-team/genie/genie-reagents).
Supplementary Materials
The PDF file includes:
Supplementary Text
Figs. S1 to S16
Tables S2 to S6, S8, S9, S11, S12
Legends for tables S1, S7, S10, S13 to S15
Legends for movies S1 to S8
References
Other Supplementary Material for this manuscript includes the following:
Tables S1, S7, S10, S13 to S15
Movies S1 to S8
REFERENCES
- 1.Silk M. J., Finn K. R., Porter M. A., Pinter-Wollman N., Can multilayer networks advance animal behavior research? Trends Ecol. Evol. 33, 376–378 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Han J., Siegford J., de los Campos G., Tempelman R. J., Gondro C., Steibel J. P., Analysis of social interactions in group-housed animals using dyadic linear models. Appl. Anim. Behav. Sci. 256, 105747 (2022). [Google Scholar]
- 3.Batista G., Levine J. D., Silva A., Editorial: The neuroethology of social behavior. Front. Neural Circuits 16, 897273 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Bordes J., Miranda L., Müller-Myhsok B., Schmidt M. V., Advancing social behavioral neuroscience by integrating ethology and comparative psychology methods through machine learning. Neurosci. Biobehav. Rev. 151, 105243 (2023). [DOI] [PubMed] [Google Scholar]
- 5.Mathis A., Mamidanna P., Cury K. M., Abe T., Murthy V. N., Mathis M. W., Bethge M., DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 21, 1281–1289 (2018). [DOI] [PubMed] [Google Scholar]
- 6.Lauer J., Zhou M., Ye S., Menegas W., Schneider S., Nath T., Rahman M. M., Di Santo V., Soberanes D., Feng G., Murthy V. N., Lauder G., Dulac C., Mathis M. W., Mathis A., Multi-animal pose estimation, identification and tracking with DeepLabCut. Nat. Methods 19, 496–504 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Pereira T. D., Tabris N., Matsliah A., Turner D. M., Li J., Ravindranath S., Papadoyannis E. S., Normand E., Deutsch D. S., Wang Z. Y., McKenzie-Smith G. C., Mitelut C. C., Castro M. D., D’Uva J., Kislin M., Sanes D. H., Kocher S. D., Wang S. S. H., Falkner A. L., Shaevitz J. W., Murthy M., SLEAP: A deep learning system for multi-animal pose tracking. Nat. Methods 19, 486–495 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bohnslav J. P., Wimalasena N. K., Clausing K. J., Dai Y. Y., Yarmolinsky D. A., Cruz T., Kashlan A. D., Chiappe M. E., Orefice L. L., Woolf C. J., Harvey C. D., DeepEthogram, a machine learning pipeline for supervised behavior classification from raw pixels. eLife 10, e63377 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hu Y., Ferrario C. R., Maitland A. D., Ionides R. B., Ghimire A., Watson B., Iwasaki K., White H., Xi Y., Zhou J., Ye B., LabGym: Quantification of user-defined animal behaviors using learning-based holistic assessment. Cell Rep. Methods 3, 100415 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hsu A. I., Yttri E. A., B-SOiD, an open-source unsupervised algorithm for identification and fast prediction of behaviors. Nat. Commun. 12, 5188 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Weinreb C., Pearl J. E., Lin S., Osman M. A. M., Zhang L., Annapragada S., Conlin E., Hoffmann R., Makowska S., Gillis W. F., Jay M., Ye S., Mathis A., Mathis M. W., Pereira T., Linderman S. W., Datta S. R., Keypoint-MoSeq: Parsing behavior by linking point tracking to pose dynamics. Nat. Methods 21, 1329–1339 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Tillmann J. F., Hsu A. I., Schwarz M. K., Yttri E. A., A-SOiD, an active-learning platform for expert-guided, data-efficient discovery of behavior. Nat. Methods 21, 703–711 (2024). [DOI] [PubMed] [Google Scholar]
- 13.Goodwin N. L., Choong J. J., Hwang S., Pitts K., Bloom L., Islam A., Zhang Y. Y., Szelenyi E. R., Tong X., Newman E. L., Miczek K., Wright H. R., McLaughlin R. J., Norville Z. C., Eshel N., Heshmati M., Nilsson S. R. O., Golden S. A., Simple Behavioral Analysis (SimBA) as a platform for explainable machine learning in behavioral neuroscience. Nat. Neurosci. 27, 1411–1424 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Bernstein J. G., Garrity P. A., Boyden E. S., Optogenetics and thermogenetics: Technologies for controlling the activity of targeted cells within intact neural circuits. Curr. Opin. Neurobiol. 22, 61–71 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Poth K. M., Texakalidis P., Boulis N. M., Chemogenetics: Beyond lesions and electrodes. Neurosurgery 89, 185–195 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Fenno L., Yizhar O., Deisseroth K., The development and application of optogenetics. Annu. Rev. Neurosci. 34, 389–412 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Yamanouchi H. M., Tanaka R., Kamikouchi A., Piezo-mediated mechanosensation contributes to stabilizing copulation posture and reproductive success in Drosophila males. iScience 26, 106617 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kane G. A., Lopes G., Saunders J. L., Mathis A., Mathis M. W., Real-time, low-latency closed-loop feedback using markerless posture tracking. eLife 9, e61909 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Biderman D., Whiteway M. R., Hurwitz C., Greenspan N., Lee R. S., Vishnubhotla A., Warren R., Pedraja F., Noone D., Schartner M. M., Huntenburg J. M., Khanal A., Meijer G. T., Noel J.-P., Pan-Vazquez A., Socha K. Z., Urai A. E., Cunningham J. P., Sawtell N. B., Paninski L., Lightning pose: Improved animal pose estimation via semi-supervised learning, Bayesian ensembling and cloud-native open-source tools. Nat. Methods 21, 1316–1328 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Villella A., Hall J. C., Neurogenetics of courtship and mating in Drosophila. Adv. Genet. 62, 67–184 (2008). [DOI] [PubMed] [Google Scholar]
- 21.Burns-Cusato M., Scordalakes E. M., Rissman E. F., Of mice and missing data: What we know (and need to learn) about male sexual behavior. Physiol. Behav. 83, 217–232 (2004). [DOI] [PubMed] [Google Scholar]
- 22.McGill T. E., Sexual behavior in three inbred strains of mice. Behaviour 19, 341–350 (1962). [Google Scholar]
- 23.J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. CVPR, doi: 10.1145/3243394.3243692 (2016). [DOI]
- 24.G. Jocher, YOLOv5 by Ultralytics, version 7.0 (2020); 10.5281/zenodo.3908559. [DOI]
- 25.Du J., Understanding of object detection based on CNN family and YOLO. J. Phys. Conf. Ser. 1004, 012029 (2018). [Google Scholar]
- 26.Chan A. H. H., Putra P., Schupp H., Köchling J., Straßheim J., Renner B., Schroeder J., Pearse W. D., Nakagawa S., Burke T., Griesser M., Meltzer A., Lubrano S., Kano F., YOLO-Behaviour: A simple, flexible framework to automatically quantify animal behaviours from videos. Methods Ecol. Evol. 16, 760–774 (2025). [Google Scholar]
- 27.Aziz Z. A., Abdulqader D. N., Sallow A. B., Omer H. K., Python parallel processing and multiprocessing: A review. Acad. J. Nawroz Univ. 10, 345–354 (2021). [Google Scholar]
- 28.Lopes G., Bonacchi N., Frazão J., Neto J. P., Atallah B. V., Soares S., Moreira L., Matias S., Itskov P. M., Correia P. A., Medina R. E., Calcaterra L., Dreosti E., Paton J. J., Kampff A. R., Bonsai: An event-based framework for processing and controlling data streams. Front. Neuroinform. 9, 7 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.J. J. Sun, T. Karigo, D. Chakraborty, S. P. Mohanty, B. Wild, Q. Sun, C. Chen, D. J. Anderson, P. Perona, Y. Yue, A. Kennedy, The Multi-Agent Behavior Dataset: Mouse Dyadic Social Interactions. arXiv arXiv:2104.02710 [Preprint] (2021). 10.48550/arXiv.2104.02710. [DOI] [PMC free article] [PubMed]
- 30.K. Amino, T. Matsuo, Automated Behavior Analysis Using a YOLO-Based Object Detection System (2022) vol. 181.
- 31.H. M. Yamanouchi, R. Tanaka, A. Kamikouchi, Event-triggered feedback system using YOLO for optogenetic manipulation of neural activity. 2023 IEEE Int. Conf. Pervasive Comput. Commun. Workshop Affil. Events PerCom Workshop BiRD 2023, 312–315 (2023).
- 32.Negroni M. A., LeBoeuf A. C., Metabolic division of labor in social insects. Curr. Opin. Insect Sci. 59, 101085 (2023). [DOI] [PubMed] [Google Scholar]
- 33.B. Piqueret, P. d’Ettorre, “Communication in ant societies” in The Cambridge Handbook of Animal Cognition, A. B. Kaufman, J. C. Kaufman, J. Call, Eds. (Cambridge University Press, Cambridge, 2021; https://www.cambridge.org/core/books/cambridge-handbook-of-animal-cognition/communication-in-ant-societies/87925A559AC5D1D9C641F75FB1A992E7) Cambridge Handbooks in Psychology, pp. 36–55.
- 34.Stednitz S. J., McDermott E. M., Ncube D., Tallafuss A., Eisen J. S., Washbourne P., Forebrain control of behaviorally driven social orienting in zebrafish. Curr. Biol. 28, 2445–2451.e3 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Hosaka S., Hosokawa M., Hibi M., Shimizu T., The zebrafish cerebellar neural circuits are involved in orienting behavior. eNeuro 11, ENEURO.0141-24.2024 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Abril-de-Abreu R., Cruz J., Oliveira R. F., Social eavesdropping in zebrafish: Tuning of attention to social interactions. Sci. Rep. 5, 12678 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Mathis M. W., Mathis A., Deep learning tools for the measurement of animal behavior in neuroscience. Curr. Opin. Neurobiol. 60, 1–11 (2020). [DOI] [PubMed] [Google Scholar]
- 38.Padilla R., Passos W. L., Dias T. L. B., Netto S. L., Da Silva E. A. B., A comparative analysis of object detection metrics with a companion open-source toolkit. Electronics 10, 279 (2021). [Google Scholar]
- 39.T. Mahendrakar, A. Ekblad, N. Fischer, R. T. White, M. Wilde, B. Kish, I. Silver, “Performance Study of YOLOv5 and Faster R-CNN for Autonomous Navigation around Non-Cooperative Targets” in 2022 IEEE Aerospace Conference (AERO) (2022; http://arxiv.org/abs/2301.09056), pp. 1–12. [Google Scholar]
- 40.R. Khanam, M. Hussain, What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv arXiv:2407.20892 [Preprint] (2024). 10.48550/arXiv.2407.20892. [DOI]
- 41.R. F. Takeuchi, A. Y. Sato, K. N. Ito, H. Yokoyama, R. Miyata, R. Ueda, K. Kitajima, R. Kamaguchi, T. Suzuki, K. Isobe, N. Honda, F. Osakada, Posteromedial cortical networks encode visuomotor prediction errors. bioRxiv [Preprint] (2024). 10.1101/2022.08.16.504075. [DOI]
- 42.Petersen C. C. H., Sensorimotor processing in the rodent barrel cortex. Nat. Rev. Neurosci. 20, 533–546 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Warren J. A., Zhou S., Xu Y., Moeser M. J., MacMillan D. R., Council O., Kirchherr J., Sung J. M., Roan N. R., Adimora A. A., Joseph S., Kuruc J. D., Gay C. L., Margolis D. M., Archin N., Brumme Z. L., Swanstrom R., Goonetilleke N., The HIV-1 latent reservoir is largely sensitive to circulating T cells. eLife 9, e57246 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.S. B. Vincent, The Function of the Vibrissae in the Behavior of the White Rat (Holt, 1912). [Google Scholar]
- 45.Sofroniew N. J., Cohen J. D., Lee A. K., Svoboda K., Natural whisker-guided behavior by head-fixed mice in tactile virtual reality. J. Neurosci. 34, 9537–9550 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Ishimoto H., Kamikouchi A., Molecular and neural mechanisms regulating sexual motivation of virgin female Drosophila. Cell. Mol. Life Sci. 78, 4805–4819 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Ding Y., Lillvis J. L., Cande J., Berman G. J., Arthur B. J., Long X., Xu M., Dickson B. J., Stern D. L., Neural evolution of context-dependent fly song. Curr. Biol. 29, 1089–1099.e7 (2019). [DOI] [PubMed] [Google Scholar]
- 48.von Philipsborn A. C., Liu T., Yu J. Y., Masser C., Bidaye S. S., Dickson B. J., Neuronal control of Drosophila courtship song. Neuron 69, 509–522 (2011). [DOI] [PubMed] [Google Scholar]
- 49.Mohammad F., Stewart J. C., Ott S., Chlebikova K., Chua J. Y., Koh T. W., Ho J., Claridge-Chang A., Optogenetic inhibition of behavior with anion channelrhodopsins. Nat. Methods 14, 271–274 (2017). [DOI] [PubMed] [Google Scholar]
- 50.Lillvis J. L., Wang K., Shiozaki H. M., Xu M., Stern D. L., Dickson B. J., Nested neural circuits generate distinct acoustic signals during Drosophila courtship. Curr. Biol. 34, 808–824.e6 (2024). [DOI] [PubMed] [Google Scholar]
- 51.Krakauer J. W., Ghazanfar A. A., Gomez-Marin A., MacIver M. A., Poeppel D., Neuroscience needs behavior: Correcting a reductionist bias. Neuron 93, 480–490 (2017). [DOI] [PubMed] [Google Scholar]
- 52.Lopes G., Monteiro P., New open-source tools: Using Bonsai for behavioral tracking and closed-loop experiments. Front. Behav. Neurosci. 15, 647640 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Buccino A. P., Lepperød M. E., Dragly S.-A., Häfliger P., Fyhn M., Hafting T., Open source modules for tracking animal behavior and closed-loop stimulation based on Open Ephys and Bonsai. J. Neural Eng. 15, 055002 (2018). [DOI] [PubMed] [Google Scholar]
- 54.Wu M.-C., Chu L.-A., Hsiao P.-Y., Lin Y.-Y., Chi C.-C., Liu T.-H., Fu C.-C., Chiang A.-S., Optogenetic control of selective neural activity in multiple freely moving Drosophila adults. Proc. Natl. Acad. Sci. U.S.A. 111, 5367–5372 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Casado-García A., del-Canto A., Sanz-Saez A., Pérez-López U., Bilbao-Kareaga A., Fritschi F. B., Miranda-Apodaca J., Muñoz-Rueda A., Sillero-Martínez A., Yoldi-Achalandabaso A., Lacuesta M., Heras J., LabelStoma: A tool for stomata detection based on the YOLO algorithm. Comput. Electron. Agric. 178, 105751 (2020). [Google Scholar]
- 56.Walter T., Couzin I. D., Trex, a fast multi-animal tracking system with markerless identification, and 2D estimation of posture and visual fields. eLife 10, e64000 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Y. Watanabe, G. Narita, S. Tatsuno, T. Yuasa, K. Sumino, M. Ishikawa, “High-speed 8-bit image projector at 1,000 fps with 3 ms delay” (2015) vol. 3, pp. 1421–1422.
- 58.S. Kagami, K. Hashimoto, “Interactive Stickies: Low-latency projection mapping for dynamic interaction with projected images on a movable surface” in ACM SIGGRAPH 2020 Emerging Technologies (Association for Computing Machinery, New York, NY, USA, 2020; https://dl.acm.org/doi/10.1145/3388534.3407291) SIGGRAPH ‘20, pp. 1–2. [Google Scholar]
- 59.Simon J. C., Dickinson M. H., A new chamber for studying the behavior of Drosophila. PLOS ONE 5, e8793 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.M. Terayama, S. Kubota, K. Eguchi, Encyclopedia of Japanese Ants. (https://antcat.org/references/142683).
- 61.Friard O., Gamba M., BORIS: A free, versatile open-source event-logging software for video/audio coding and live observations. Methods Ecol. Evol. 7, 1325–1330 (2016). [Google Scholar]
- 62.Everingham M., Eslami S. M. A., Van Gool L., Williams C. K. I., Winn J., Zisserman A., The Pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 111, 98–136 (2015). [Google Scholar]
- 63.Allen W. E., Kauvar I. V., Chen M. Z., Richman E. B., Yang S. J., Chan K., Gradinaru V., Deverman B. E., Luo L., Deisseroth K., Global representations of goal-directed behavior in distinct cell types of mouse neocortex. Neuron 94, 891–907.e6 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Dolezal D. M., Joiner M. A., Eberl D. F., Two distinct functions of Lim1 in the Drosophila antenna. MicroPublication Biol. 2024, doi: 10.17912/micropub.biology.001229 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Yamada X. D., Ishimoto X. H., Li X. X., Kohashi X. T., Ishikawa X. Y., Kamikouchi X. A., GABAergic local interneurons shape female fruit fly response to mating songs. J. Neurosci. 38, 4329–4347 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Bogovic J. A., Otsuna H., Heinrich L., Ito M., Jeter J., Meissner G., Nern A., Colonell J., Malkesman O., Ito K., Saalfeld S., An unbiased template of the Drosophila brain and ventral nerve cord. PLOS ONE 15, e0236495 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Avants B., Gee J. C., Geodesic estimation for large deformation anatomical shape averaging and interpolation. NeuroImage 23, S139–S150 (2004). [DOI] [PubMed] [Google Scholar]
- 68.J. O. Wobbrock, L. Findlater, D. Gergle, J. J. Higgins, The aligned rank transform for nonparametric factorial analyses using only anova procedures. 143–146 (2011).
- 69.M. Kay, L. A. Elkin, J. J. Higgins, J. O. Wobbrock, ARTool: Aligned rank transform for nonparametric factorial anovas. R package version 0.11.1, https://github.com/mjskay/ARTool. 594511 (2021).
- 70.Venkataramanan A. K., Facktor M., Gupta P., Bovik A. C., Assessing the impact of image quality on object-detection algorithms. Electron. Imaging 34, 334-1–334-6 (2022). [Google Scholar]
- 71.M. F. Tariq, M. A. Javed, Small Object Detection with YOLO: A performance analysis across model versions and hardware. arXiv:2504.09900 [Preprint] (2025). 10.48550/arXiv.2504.09900. [DOI]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Text
Figs. S1 to S16
Tables S2 to S6, S8, S9, S11, S12
Legends for tables S1, S7, S10, S13 to S15
Legends for movies S1 to S8
References
Tables S1, S7, S10, S13 to S15
Movies S1 to S8
Data Availability Statement
YORU software code has been deposited at GitHub and are publicly available at DOI: 10.5281/zenodo.17827855, GitHub: https://github.com/Kamikouchi-lab/YORU under the AGPL-3.0 license. The YORU’s plugin code for projector individual-focused illumination in Fig. 6 is publicly available at DOI: 10.5281/zenodo.15544362, GitHub: https://github.com/Kamikouchi-lab/YORU-projector-plugin under the MIT license. All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. All data and original code necessary to reproduce the figure panels and statistical analyses are available at DOI: 10.5281/zenodo.17241905, GitHub: https://github.com/Kamikouchi-lab/YORU_data_analysis. Each dataset data is available at DOI: 10.5281/zenodo.15605860. The pGP-AAV-syn-jGCaMP7f-WPRE plasmid was obtained from Addgene (plasmid no. 104488) and is governed by a material transfer agreement from the provider, Howard Hughes Medical Institute (https://janelia.org/project-team/genie/genie-reagents).






