. 2021 Nov 30;21(23):7982. doi: 10.3390/s21237982

Table 6.

Visualization of the generated captions of the ablated models, where the colored words are the improvements from the previous caption.

Image	Captions
	Baseline: A couple of women standing next to each other. +self-att(Dec): A couple of women standing next to each other. +self-att(Enc+Dec): Two women are holding wine glasses in a room. Our PW: Two women standing next to each other holding wine glasses. Our CW: Two women drinking wine in a room.
	Baseline: A group of people walking down a street +self-att(Dec): A group of people standing in the street. +self-att(Enc+Dec): A group of people standing with an umbrella. Our PW: A group of people standing in the street with an umbrella. Our CW: A group of people standing under an umbrella.
	Baseline: A close up of a horse in a field. +self-att(Dec): A horse standing in a field. +self-att(Enc+Dec): A horse in the grass in a field. Our PW: A white horse standing in the grass in a field. Our CW: A white horse grazing in a field of grass.
	Baseline: A group of people on skis in the snow. +self-att(Dec): A man riding skis in the snow. +self-att(Enc+Dec): A group of people skiing down a snow covered slope. Our PW: A group of people riding skis down a snow covered slope. Our CW: Two men are skiing down a snow covered slope.