. Author manuscript; available in PMC: 2021 Jun 22.

Published in final edited form as: Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2020 Aug 5;2020:190–199. doi: 10.1109/cvpr42600.2020.00027

Table 3: Ablation study of the proposed state representation—dynamic contextual belief.

The full state (DCB-full) consists of 1 history map, 1 saliency map, 54 stuff maps, 79 context maps and 1 target map. We mask out one part by setting the map(s) to zeros at each time. See the supplementary for full results.

	Sequence Score ↑	Scanpath Ratio ↑	Prob. ↓ Mismatch
DCB-full	0.422	0.803	1.029
w/o history map	0.419	0.800	1.042
w/o saliency map	0.419	0.795	1.029
w/o stuff maps	0.407	0.777	1.248
w/o thing maps	0.331	0.487	3.152
w/o target map	0.338	0.519	2.926
DCB	0.422	0.826	0.987
CFI	0.402	0.619	1.797