Table 3: Ablation study of the proposed state representation—dynamic contextual belief.
The full state (DCB-full) consists of 1 history map, 1 saliency map, 54 stuff maps, 79 context maps and 1 target map. We mask out one part by setting the map(s) to zeros at each time. See the supplementary for full results.
| Sequence Score ↑ | Scanpath Ratio ↑ | Prob. ↓ Mismatch | |
|---|---|---|---|
| DCB-full | 0.422 | 0.803 | 1.029 |
| w/o history map | 0.419 | 0.800 | 1.042 |
| w/o saliency map | 0.419 | 0.795 | 1.029 |
| w/o stuff maps | 0.407 | 0.777 | 1.248 |
| w/o thing maps | 0.331 | 0.487 | 3.152 |
| w/o target map | 0.338 | 0.519 | 2.926 |
| DCB | 0.422 | 0.826 | 0.987 |
| CFI | 0.402 | 0.619 | 1.797 |