Abstract
Quantitative measurement of animal behavior is a cornerstone of neuroscience, genetics, and ethology. While modern computer vision has democratized automated analysis, the field has coalesced around pose estimation as the standard intermediate representation. This reliance imposes a significant bottleneck: researchers must often train custom pose models using large, labor-intensive datasets. Furthermore, the assumption that denser anatomical tracking yields better classification remains largely unverified. Here, we benchmark intermediate representations for supervised mouse behavior classification to determine the optimal trade-off between annotation cost and model performance. We systematically evaluate the sensitivity of classification to keypoint density, the impact of temporal feature engineering, and the viability of segmentation-derived shape descriptors as a low-cost alternative. We find that classifier performance is remarkably robust to keypoint variation; increasing keypoint density yields negligible gains, particularly when behavior training sets are sufficiently large. In contrast, augmenting models with temporal features (specifically FFT-based signal processing) consistently drives performance improvements. Crucially, we demonstrate that whole-body segmentation achieves performance parity with explicit pose estimation across most behaviors. These findings challenge the "more is better" intuition in pose tracking and suggest a paradigm shift: efficient pipelines should prioritize behavioral dataset volume and temporal dynamics over complex anatomical keypoints.
Full Text
The Full Text of this preprint is available as a PDF (9.0 MB). The Web version will be available soon.
