Multimodal Behavioral Sensors for Lie Detection: Integrating Visual, Auditory, and Generative Reasoning Cues

. 2025 Oct 2;25(19):6086. doi: 10.3390/s25196086

Algorithm 1 Prompt -level multimodal deception detection (zero-shot GPT-5).

Require:
Dataset $D$ of videos with ground-truth labels $y \in {lie, truth}$
Require:
Ablation flags: UseVideo, UseTranscript, UseEmotion
1:
Initialize metrics containers
2:
for each sample $x \in D$ do
3:
$I \leftarrow ⌀, T \leftarrow ⌀, E \leftarrow ⌀$
4:
if UseVideo then
5:
Extract 16 uniformly spaced frames $I = {f_{0}, \dots, f_{15}}$
6:
end if
7:
if UseTranscript then
8:
Extract audio; obtain ASR transcript T (Whisper-1)
9:
end if
10:
if UseEmotion then
11:
Extract audio; compute emotion label e and confidence $c_{e}$ (SpeechBrain wav2vec2)
12:
$E \leftarrow$ “Detected emotion: e ( $c_{e}$ )”
13:
end if
14:
Build user prompt: include E (if any), T (if any), instruction to return strictly JSON
15:
Attach frames I (if any) as images to the same message
16:
System message: safety + research framing
17:
Query GPT-5 with deterministic decoding
18:
Parse first valid JSON object: {label, confidence, reasoning}
19:
$\hat{y} \leftarrow$ label ▹ final class: lie or truth
20:
$q \leftarrow$ confidence $\in [0, 1]$ ▹ used only for analysis/threshold sweeps
21:
Store $(\hat{y}, q)$ and compare with ground truth y
22:
end for
23:
Compute Accuracy, Precision/Recall/F1 per class, Macro-F1, MCC, Cohen’s $κ$