Skip to main content
. 2021 Nov 30;21(23):7982. doi: 10.3390/s21237982

Table 6.

Visualization of the generated captions of the ablated models, where the colored words are the improvements from the previous caption.

Image Captions
graphic file with name sensors-21-07982-i001.jpg Baseline: A couple of women standing next to each other.
+self-att(Dec): A couple of women standing next to each other.
+self-att(Enc+Dec): Two women are holding wine glasses in a room.
Our PW: Two women standing next to each other holding wine glasses.
Our CW: Two women drinking wine in a room.
graphic file with name sensors-21-07982-i002.jpg Baseline: A group of people walking down a street
+self-att(Dec): A group of people standing in the street.
+self-att(Enc+Dec): A group of people standing with an umbrella.
Our PW: A group of people standing in the street with an umbrella.
Our CW: A group of people standing under an umbrella.
graphic file with name sensors-21-07982-i003.jpg Baseline: A close up of a horse in a field.
+self-att(Dec): A horse standing in a field.
+self-att(Enc+Dec): A horse in the grass in a field.
Our PW: A white horse standing in the grass in a field.
Our CW: A white horse grazing in a field of grass.
graphic file with name sensors-21-07982-i004.jpg Baseline: A group of people on skis in the snow.
+self-att(Dec): A man riding skis in the snow.
+self-att(Enc+Dec): A group of people skiing down a snow covered slope.
Our PW: A group of people riding skis down a snow covered slope.
Our CW: Two men are skiing down a snow covered slope.