. 2023 Mar 23;12:e79305. doi: 10.7554/eLife.79305

Table 2. Methods for 2D pose estimation.

DeepLabCut	DeepLabCut (Mathis et al., 2018a; Mathis et al., 2018b) uses a popular architecture for deep learning (He et al., 2016), called ResNet. DeepLabCut models are pre-trained on a massive dataset for object recognition called ImageNet (Russakovsky et al., 2015). Through a process called transfer learning, the DeepLabCut model learns the position of keypoints using as few as 200 labeled frames. This makes the model very robust and flexible in terms of what body parts (or objects) users want to label as the model provides a strong backbone of image filters within their ResNet architecture. To detect the keypoint position, DeepLabCut replaces the classification layer of the ResNet with deconvolutional layers to produce spatial probability densities from which the model learns to assign high probabilities to regions with the user labeled keypoints. DeepLabCut can provide very accurate pose estimations but can require extensive time for training.
SLEAP	SLEAP (Pereira et al., 2022) is based on an earlier method called LEAP (Pereira et al., 2022), which performed pose estimation on single animals. SLEAP uses simpler CNN architectures with repeated convolutional and pooling layers. This makes the model more lightweight compared to DLC’s ResNet architecture and, hence, the model is faster to train with comparable accuracy. Similar to DeepLabCut, the model uses a stack of upsampling or deconvolutional layers to estimate confidence maps during training and inference. Unlike DLC, SLEAP does not solely rely on transfer learning from general-purpose network models (though this functionality is also provided for flexible experimentation). Instead, it uses customizable neural network architectures that can be tuned to the needs of the dataset. SLEAP can produce highly accurate pose estimates starting at about 100 labeled frames for training combined and is quick to train on a GPU (<1 hour).
DeepPoseKit	DeepPoseKit (Graving et al., 2019a; Graving et al., 2019b) uses a type of CNN architecture, called stacked DenseNet, an efficient variant of the stacked hourglass (Newell et al., 2016), and uses multiple down- and upsampling steps with densely connected hourglass networks to produce confidence maps on the input image. The model uses only about 5% of the amount of parameters used by DeepLabCut, providing speed improvements over DeepLabCut and LEAP.
B-KinD	B-KinD (Sun et al., 2021a; Sun et al., 2021b) discovers key points without human supervision. B-KinD has the potential to transform how pose estimation is done, as keypoint analysis is one of the most time-consuming aspects of doing pose estimation analysis. However, there are challenges for the approach when occlusions occur in the video recordings, e.g., recordings of animals tethered to brain recording systems.