Skip to main content
. 2022 Oct 3;18(1):117–125. doi: 10.1007/s11548-022-02761-6

Fig. 2.

Fig. 2

The baseline generates a heatmap, Ht^, for each detection using a pose estimation network. In our model, we provide additional information by incorporating a heatmap prior from t-δ. Concatenating the image features at t with H^t-δ, we pass this through our attention mechanism to produce a weighted heatmap prior, H^t-δ. Both Ht^ and H^t-δ are concatenated and passed through the fusing module, using context from both heatmaps to produce the final articulated hand pose. (The initial and final heatmaps represent real outputs from the network, while the heatmap prior (during training) shows ground truth at t-δ)