Table 1. Quantitative comparison with other state-of-the-art 2D and 3D animal and human pose estimation methods.
Protocol 1 (absolute 3D MPJPE, mm) | ||||
---|---|---|---|---|
| ||||
Training set fraction | ||||
5% | 10% | 50% | 100% | |
| ||||
2D pose estimation methods (+ post hoc triangulation) | ||||
| ||||
DLC† (Mathis et al., 2018) | 11.0973 | 11.0512 | 9.8934 | 8.9060 |
SimpleBaseline† (Xiao, Wu, & Wei, 2018) | 18.0990 | 14.6191 | 7.3636 | 5.9555 |
SimpleBaseline | 18.5675 | 16.5800 | 8.3573 | 6.6957 |
DLC + soft argmax | 11.0323 | 9.2244 | 6.3545 | 6.4739 |
DLC + 2D variant of our temporal constraint* | 8.5432 | 9.1236 | 5.9526 | 6.0390 |
| ||||
3D monocular pose estimation methods | ||||
| ||||
Temporal Convolution* (Pavllo et al., 2019) | - | - | - | 17.6337 |
| ||||
3D multi-view pose estimation methods | ||||
| ||||
Learnable Triangulation† (Iskakov et al., 2019) | 18.7795 | 15.6614 | 8.9729 | 6.3177 |
DANNCE (Dunn et al., 2021) | 12.8754 | 10.9085 | 4.9912 | 4.3614 |
| ||||
Ours (temporal baseline)* | 12.4940 | 7.1162 | 4.8347 | 4.3749 |
Ours (temporal + extra)* | 8.1706 | 6.6927 | 5.0461 | 4.1409 |
The methods that use ground truth 2D bounding boxes during inference are masked by †.
The methods that use temporal information during training are masked by *.
For the monocular approach, the reported metric results are separately computed and averaged across all camera views.