Revisiting Text-Based Person Retrieval: Mitigating Annotation-Induced Mismatches with Multimodal Large Language Models

. 2026 Mar 4;26(5):1599. doi: 10.3390/s26051599

Algorithm 1 Step 2: MLLM-Based Caption Expansion and Merging

Require:
Original test set $D$ with query texts ${q_{j}}$ and images.
Require:
Error pairs ${E^{(m)}}_{m = 1}^{M}$ from Step 1, where each $E^{(m)} = {((q_{k}^{1}, I_{k}^{1}), (q_{k}^{2}, I_{k}^{2}))}_{k = 1}^{N_{i}^{(m)}}$ .
Require:
Multimodal LLM $M$ (for example, Qwen-VL).
Require:
Prompt template described above.
Ensure:
Refined test set $D^{'}$ with expanded captions ${q_{j}^{new}}$ .
1:
for $m = 1$ to M do
2:
for $k = 1$ to $N_{i}^{(m)}$ do
3:
Extract the ground truth pair $(q_{k}^{1}, I_{k}^{1})$ and the mismatched pair $(q_{k}^{2}, I_{k}^{2})$ from $E^{(m)}$ .
4:
Instantiate the prompt $P_{k}^{(m)}$ by replacing <QUERY_TEXT> with $q_{k}^{1}$ and <GALLERY_TEXT> with $q_{k}^{2}$ .
5:
Query the MLLM with images $I_{k}^{1}$ and $I_{k}^{2}$ and prompt $P_{k}^{(m)}$ to obtain the raw output string ${raw_out}_{k}^{(m)}$ .
6:
Parse ${raw_out}_{k}^{(m)}$ using the marker tokens <EXP1_START> and <EXP1_END> to obtain the expansion for caption 1 (img1), denoted as $Δ q_{k}^{1}$ .
7:
Apply text cleaning to obtain $\hat{Δ q_{k}^{1}} =$ CleanText $(Δ q_{k}^{1})$ .
8:
if $\hat{Δ q_{k}^{1}}$ is non-empty then
9:
Append the cleaned expansion to the current refined caption of the corresponding query:
10:
$q_{k, new}^{1} \leftarrow q_{k, new}^{1} + \hat{Δ q_{k}^{1}}$ .
11:
end if
12:
end for
13:
end for
14:
Construct the refined test set $D^{'}$ by replacing each original caption $q_{j}$ with its final expanded version $q_{j}^{new}$ .