Full-BAPose architecture for whole-body multi-person pose estimation. The input color image is fed through the HRNet backbone for initial feature extraction. The feature sizes are denoted by the two spatial dimensions first and the channel dimension last, e.g., (128 × 128 × 32) denotes feature size of 128 × 128 with 32 channels. The HRNet features are combined by the D-WASP module and a decoder utilizing adaptive convolutions generates the detection bounding boxes and the keypoints for the hand, head, feet, and entire body, i.e., 133 keypoints and 4 bounding boxes for each person instance.