Skip to main content
. 2026 Mar 4;26(5):1599. doi: 10.3390/s26051599
Algorithm 1 Step 2: MLLM-Based Caption Expansion and Merging
  • Require: 

    Original test set D with query texts {qj} and images.

  • Require: 

    Error pairs {E(m)}m=1M from Step 1, where each E(m)={((qk1,Ik1),(qk2,Ik2))}k=1Ni(m).

  • Require: 

    Multimodal LLM M (for example, Qwen-VL).

  • Require: 

    Prompt template described above.

  • Ensure: 

    Refined test set D with expanded captions {qjnew}.

  •   1:

    for m=1 to M do

  •   2:

        for k=1 to Ni(m) do

  •   3:

           Extract the ground truth pair (qk1,Ik1) and the mismatched pair (qk2,Ik2) from E(m).

  •   4:

           Instantiate the prompt Pk(m) by replacing <QUERY_TEXT> with qk1 and <GALLERY_TEXT> with qk2.

  •   5:

           Query the MLLM with images Ik1 and Ik2 and prompt Pk(m) to obtain the raw output string raw_outk(m).

  •   6:

           Parse raw_outk(m) using the marker tokens <EXP1_START> and <EXP1_END> to obtain the expansion for caption 1 (img1), denoted as Δqk1.

  •   7:

           Apply text cleaning to obtain Δqk1^= CleanText (Δqk1).

  •   8:

           if Δqk1^ is non-empty then

  •   9:

                Append the cleaned expansion to the current refined caption of the corresponding query:

  • 10:

                       qk,new1qk,new1+Δqk1^.

  • 11:

           end if

  • 12:

        end for

  • 13:

    end for

  • 14:

    Construct the refined test set D by replacing each original caption qj with its final expanded version qjnew.