|
Algorithm 1 Step 2: MLLM-Based Caption Expansion and Merging |
-
Require:
Original test set with query texts and images.
-
Require:
Error pairs from Step 1, where each .
-
Require:
Multimodal LLM (for example, Qwen-VL).
-
Require:
Prompt template described above.
-
Ensure:
Refined test set with expanded captions .
-
1:
for to M do
-
2:
for to do
-
3:
Extract the ground truth pair and the mismatched pair from .
-
4:
Instantiate the prompt by replacing <QUERY_TEXT> with and <GALLERY_TEXT> with .
-
5:
Query the MLLM with images and and prompt to obtain the raw output string .
-
6:
Parse using the marker tokens <EXP1_START> and <EXP1_END> to obtain the expansion for caption 1 (img1), denoted as .
-
7:
Apply text cleaning to obtain
CleanText
.
-
8:
if is non-empty then
-
9:
Append the cleaned expansion to the current refined caption of the corresponding query:
-
10:
.
-
11:
end if
-
12:
end for
-
13:
end for
-
14:
Construct the refined test set by replacing each original caption with its final expanded version .
|