Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 16.
Published in final edited form as: AI Neurosci. 2025 Feb 21;1(1):54–59. doi: 10.1089/ains.2024.0008

DeepLabCut to Automate Behavioral Analysis of Parkinsonism

Nabeel Rangoonwala 1, Khoi Le 1, Vaibhavi Peshattiwar 1, Caroline Swain 1, Dipesh Pokharel 1, Tatiana White 1, Thyagarajan Subramanian 1,*, Kala Venkiteswaran 1
PMCID: PMC12356275  NIHMSID: NIHMS2101874  PMID: 40822935

Abstract

Background:

Behavioral assessment of parkinsonism often relies on human rater evaluation. However, human biases and variability necessitate larger sample sizes to maintain validity, leading to extensive video analysis and limiting researchers’ time. Recent artificial intelligence (AI) and machine learning (ML) advancements enable efficient data analysis, offering unbiased decision-making and consistency across scenarios, bridging inter-rater differences. While not fully automating jobs, AI/ML boosts productivity when properly trained with diverse data. This study aims to show that AI/ML can assist in the analysis of rat parkinsonian behavioral studies to reduce labor dependence while still maintaining accuracy.

Methods:

DeepLabCut (DLC), an animal pose estimation software, was used to analyze motor behavior in video recordings of parkinsonian Sprague Dawley rats while they performed the stepping test (n = 24). The stepping test involves observing the animal’s locomotor function and motor coordination while it is guided across a flat surface. The amount of adjusting steps was counted over the 1-meter distance. Twenty-eight videos (n = 24 + 4 training videos) were fed into DLC, which then selected 20 frames per video using a k-nearest neighbors’ algorithm and subsequently labeled to train the model. This one-time training process took 3 h. The output, which has the tracked coordinates of the forepaw being tested, was fed into a script in R to plot Δ y between consecutive frames. The positive peaks were counted as one step, and large negative peaks were counted as a reset or side switch. The counts for each video were then compared with an independent manual rater.

Results:

There was good absolute agreement between the two scoring methods, using the two-way random effect model, kappa = 0.9, p < 0.0001. It takes 10–15 min to go through each video manually, but the DLC-assisted scoring resulted in 3–4 min per video. These results show that DLC-assisted scoring produced results that could be on par with manual scoring. In addition, this shows a feasible avenue to integrate AI/ML in parkinsonian behavioral studies to reduce the workload for analysis and eventually, fully automating such tasks.

Keywords: artificial intelligence, DeepLabCut, Parkinson’s, animal behavior, machine learning

Introduction

Parkinson’s disease (PD) is a neurodegenerative disorder that primarily affects movement and is accompanied by the gradual loss of dopamine-producing neurons in the brain. PD is characterized by resting tremor, muscle rigidity, bradykinesia (slowness of movement), and postural instability. Although the exact cause of PD remains unknown, it is believed to involve a combination of genetic and environmental factors. Beyond motor symptoms, PD can also cause non-motor issues such as sleep disturbances, mood disorders, and cognitive impairment.

To find the cause and a potential cure for PD, several preclinical animal models are used. The laboratory rat is particularly suitable, as rat models of PD exhibit bradykinesia in many behaviors and have been used for many decades to test the preclinical efficacy and safety of many PD treatments in current use. Rat models of PD are also heavily used in experimental therapeutic studies to discover new PD therapies. Behavioral assessments play a pivotal role in this process as they allow us to see how the rats are doing and how potential treatments are working. Traditionally, such assessments have been conducted manually by human raters, which is not only time-consuming but also subject to inter-rater variability and biases. Large sample sizes are often required to ensure statistical validity, further exacerbating the workload for researchers. Therefore, there is a pressing need for objective and efficient methods to analyze parkinsonian behavior in preclinical studies.

Artificial intelligence (AI) and machine learning (ML) are growing technologies that are improving different fields and industries. DeepLabCut (DLC) has emerged as a powerful tool for tracking animal movements in video recordings with high precision and accuracy. DLC is an open-source, markerless, pose estimation software that utilizes deep learning algorithms.1,2 DLC can identify and track specific body parts of animals, such as limbs, the tail, or the head, in a variety of behavioral paradigms and outputs the marked coordinates over time. This technology offers several advantages over traditional manual scoring methods, including unbiased decision-making, consistency across scenarios, and considerable time savings. DeepLabCut has already shown itself to be better than some commercial solutions in accuracy and time savings with tests such as the elevated plus maze test and the open field test.3

Methods

In this study, DLC was employed to analyze motor behavior in parkinsonian Sprague Dawley rats performing the stepping test, a commonly used paradigm to assess bradykinesia using locomotor function and motor coordination.4 In this test, the rat is held so that both hind limbs and one forepaw are raised just off the surface of a wide table. The animal is then moved laterally across the surface of the table for 90 cm in a way such that it must bear weight on its remaining forepaw. The number of adjusting steps made by the weight-bearing forepaw is counted. Each rat goes through six total trials, three trials on each forelimb. The video was taken from a slightly elevated angle from the side, so the rat was guided toward the camera for each pass. Twenty-eight different videos were used in this process, with each video containing three trials for a total of 84 trials.

In addition to the stepping test, we used the cylinder test to test for rearing instead of paw touches. The rats were placed in a glass cylinder and recorded from a side view so the height of the cylinder could be seen. The nose was tracked in each video to count the number of times the rats reared in a 5-min period. A total of 50 trials were used for this preliminary study.

Video recordings of these experiments were taken and put into DLC. Twenty frames were extracted from each video using k-nearest neighbors. The interested body part was marked in each of the frames. The training data-set was created with a 50-layer residual network, or ResNet-50,5 architecture, and the model was trained with 200,000 iterations. All 20 frames from each of the 28 videos, for a total of 560 frames, were manually labeled. Training the model is a one-time endeavor that takes about 3–4 h with labeling the frames and letting them train off them. Next, the full videos were fed in to be analyzed. This process takes about 1–2 min per video and outputs a CSV and a labeled video for each input video. The CSV files were then inputted into a custom R script to count the adjusting steps for the stepping test and the number of rears for the cylinder test.

To count the adjusting steps for the stepping test from the CSV file, the difference in the y-coordinate was calculated between adjacent frames (Δy). A large positive jump or consecutive smaller positive jumps in the Δy were counted as a step, while a large negative jump or consecutive smaller negative jumps were counted as a reset or side switch. Once the step counts were all retrieved, the DLC-assisted counts were compared to the counts of a person manually watching each video and counting the steps. The accuracy and the intraclass correlation coefficient (two-way random model for multiple raters) were used to compare the absolute agreement between the DLC-assisted method and the manual scoring method.

To find the number of rears for the cylinder test from the CSV file, two thresholds were created in R, one higher up for the rearing threshold and one lower down for the reset threshold. Since a rear requires the rat to come all the way down, a rear was counted as the rat’s nose crossing above the rearing threshold followed by the rat’s nose crossing below the reset threshold anytime later. Every time this cycle is completed, a rear is counted.

To compare the DLC-assisted method to the manual method, percent accuracy, and the intraclass correlation coefficient (ICC) were calculated between the two sets. Percent accuracy was calculated using the following formula:

PercentAccuracy=100100*DLCMethodValueManualMethodValueManualMethodValue

The percent accuracy was calculated for each individual trial and then averaged for each behavioral test. The ICC value was found using an R script, and the ICC value for fixed raters was used.

Results

Figure 1 above shows the means and error bars for each of the scoring methods with the stepping test. The mean of the DLC-assisted method and the manual scoring were calculated as 14.357 steps and 14.595 steps, respectively. The standard deviations were calculated as 2.467 for the DLC-assisted method, while the standard deviation for the manual scoring was 2.508. Figure 2 above shows the difference between the adjusting steps in each method. The blue line is the difference, while the red line is the average difference over the 84 trials. The average difference in steps is −0.238 steps. The DLC-assisted analysis demonstrated a remarkable 98.37% accuracy rate compared to manual scoring by human raters. Moreover, there was good absolute agreement between the DLC-assisted method and manual scoring, as evidenced by the high intraclass correlation coefficient (kappa =0.9, p < 0.0001). Importantly, DLC-assisted scoring reduced the time required for analysis, from 10–15 min per video with manual scoring to 3–4 min per video.

Fig. 1.

Fig. 1.

Average steps for each scoring method for all 84 trials. DLC-Assisted mean is 14.357 steps, and Manual mean is 14.595 steps.

Fig. 2.

Fig. 2.

Difference between the two scoring Methods. Average difference is shown in red and has a value of −0.238.

Figure 3 above shows the means and error bars for each of the scoring methods with the cylinder test. The mean of the DLC-assisted method and the manual scoring were calculated as 15.62 rears and 15.78 rears, respectively. Figure 4 above shows the absolute difference between the rears in each method. The blue line is the difference, while the red line is the average difference over the 50 trials. The average absolute difference in rears is −0.16 rears. The DLC-assisted analysis demonstrated an 87% accuracy rate compared with manual scoring by human raters. Moreover, there was good absolute agreement between the DLC-assisted method and manual scoring, as evidenced by the high intraclass correlation coefficient (kappa = 0.98, p < 0.0001). Importantly, DLC-assisted scoring reduced the time required for analysis from 25–30 min per video with manual scoring to 7–8 min per video.

Fig. 3.

Fig. 3.

Average rears for each scoring method for all 50 trials. DLC-assisted mean is 15.62 rears, and Manual mean is 15.78 rears.

Fig. 4.

Fig. 4.

Difference between the two scoring Methods. Average difference is shown in red and has a value of −0.16.

Discussion

The results from Figures 14 highlight the high reliability and efficiency of DLC-assisted scoring methods in both stepping and cylinder tests, compared to traditional manual scoring. In the stepping test (Figs. 1 and 2), the mean number of steps scored by the DLC-assisted method was 14.357, while manual scoring produced a mean of 14.595 steps. The difference between the two methods was statistically insignificant (p = 0.536), indicating that both methods perform similarly in measuring the stepping behavior. The average difference in steps was −0.238, showing minimal deviation across the 84 trials. The intraclass correlation coefficient (kappa = 0.9, p < 0.0001) further supports the high agreement between the methods, demonstrating that DLC-assisted scoring is a reliable tool for step counting. In addition, DLC-assisted scoring achieved an impressive accuracy rate of 98.37% compared with manual scoring, while also significantly reducing the time required for analysis—from 10–15 min to just 3–4 min per video.

Similarly, in the cylinder test (Figs. 3 and 4), the DLC-assisted method yielded a mean of 15.62 rears, closely matching the manual method’s mean of 15.78 rears. The average difference between the methods was −0.16 rears across 50 trials, which is minimal. Furthermore, the intraclass correlation coefficient for this test was even higher (kappa = 0.98, p < 0.0001), highlighting a near-perfect agreement between the two approaches. The DLC-assisted method also achieved an accuracy rate of 87% compared with manual scoring. Most notably, the time-saving benefits of the DLC-assisted method were evident, reducing analysis time from 25–30 min per video to just 7–8 min.

Taken together, the results from both the stepping and cylinder tests underscore the efficiency and reliability of DLC-assisted scoring. In both tests, the DLC-assisted method closely mirrors manual scoring in accuracy, with minimal differences in the number of steps or rears counted. The high intraclass correlation coefficients across both tests further emphasize the strong agreement between the two methods. Importantly, the DLC-assisted method not only maintains a high level of accuracy but also offers a substantial reduction in the time required for scoring, making it a valuable tool for large-scale studies where time efficiency is crucial. These findings suggest that DLC-assisted scoring is a practical and effective alternative to manual scoring, allowing researchers to streamline their behavioral analyses without compromising data quality.

One of the limitations of the DLC method is its reduced effectiveness in rare or unusual circumstances that deviate from typical experimental conditions. In such cases, the method may struggle to maintain accuracy, as it is optimized for more standard setups. Another challenge arises when DLC attempts to distinguish between body parts with similar appearances, which can lead to misidentification and compromise the accuracy of the analysis. This issue becomes particularly pronounced in scenarios where the subject’s movements are complex, or the body parts are positioned closely together. Furthermore, a technical limitation is that DLC only accepts video files in the .mp4 format, and even minor differences, such as capitalizing the file extension to .MP4, can cause complications. While renaming the extension to lowercase may seem like a simple fix, it has led to degraded video quality and tracking errors in some cases, forcing researchers to discard those videos. This highlights the sensitivity of the software to formatting and video quality, which can be a potential source of data loss if not carefully managed.

Conclusion

The integration of DLC-assisted analysis offers a promising approach to overcome the limitations of manual scoring in parkinsonian behavioral studies. By providing objective and efficient data analysis, DLC facilitates accurate assessment while reducing the workload for researchers. DLC not only enhances the reliability of preclinical research findings but also opens avenues for exploring novel therapeutic interventions for PD. Moving forward, continued advancements in AI/ML technologies hold the potential to revolutionize the field of behavioral neuroscience and accelerate the pace of discovery in neurodegenerative disorders. In addition, assessment of other motor behaviors that are currently assessed by manual video rating, such as drug-induced dyskinesias, sleep disorders, seizures, tardive dyskinesia, dystonia, chorea, and tics, could be amenable to adaptation of our techniques to automate these tasks. Motor, sensory, cognitive, and behavioral dysfunctional assessments in other animal models of spinal cord injury, stroke, traumatic brain injury, cognitive disorders, neuropathy, pain syndromes, and substance use disorders may be amenable to adoption of our techniques to reduce costs, time and effort to study and can potentially help accelerate the discovery of new or more effective treatments for these diseases as predicted in earlier reviews.6

Funding Information

Funding for this work was in part from NINDS RO1 NS104565, NIDDK RO1 DK124098, DoD NETP 13204752 and the Anne M. and Phillip H. Glatfelter, III Family Foundation.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

References

  • 1.Mathis A, Mamidanna P, Cury KM, et al. DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nat Neurosci 2018;21(9):1281–1289; doi: 10.1038/s41593-018-0209-y [DOI] [PubMed] [Google Scholar]
  • 2.Nath T, Mathis A, Chen AC, et al. Using DeepLabCut for 3D markerless pose estimation across species and behaviors. Nat Protoc 2019;14 (7):2152–2176; doi: 10.1038/s41596-019-0176-0 [DOI] [PubMed] [Google Scholar]
  • 3.Sturman O, von Ziegler L, Schläppi C, et al. Deep learning-based behavioral analysis reaches human accuracy and is capable of outperforming commercial solutions. Neuropsychopharmacol 2020;45:1942–1952; doi: 10.1038/s41386-020-0776-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Anselmi L, Bove C, Coleman FH, et al. Ingestion of subthreshold doses of environmental toxins induces ascending Parkinsonism in the rat. NPJ Parkinsons Dis 2018;4:30; doi: 10.1038/s41531-018-0066-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90 [DOI] [Google Scholar]
  • 6.Liljequist D, Elfving B, Skavberg Roaldsen K. Intraclass correlation—A discussion and demonstration of basic features. PLoS One 2019;14(7):e0219854; doi: 10.1371/journal.pone.0219854 [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES