Skip to main content
[Preprint]. 2025 Aug 26:2025.08.22.25334232. [Version 1] doi: 10.1101/2025.08.22.25334232

Table 2.

Clinical performance of agent versus baseline

Study AI Agent Base-LLM Model(s)* Primary Outcome Agent vs Baseline* Baseline comparison External validation Synthetic Data
Chen et al. GPT-4 Accuracy +14.3% Base-LLM No Yes
Goodell et al. GPT-4o, Llama-3 Accuracy +59.1% Base-LLM Yes Yes
Omar et al. GPT-4o, Claude, Gemini-1.5-Pro Accuracy +13.8% Base-LLM No No
Gorenshtein 2025a GPT-4o-mini Accuracy +3.5% Single LLM (Sonar-pro) No Yes
Gorenshtein 2025b Gem ini-1.5-Pro-002 Accuracy +29.6% Base-LLM Yes-Physician review No
Gorenshtein 2025c GPT-4o Quality score −25.5% Physicians Yes-Comparison to Physician No
Ferber et al. GPT-4 Accuracy +56.9% Base-LLM Yes-Physician review Mixed
Qu et al. GPT-4o, CRISPR-Llama3-8B (fine tuned Llama3-8b) Design success +49% Base-LLM Yes- in-vitro validation Mixed
Pickard et al. GPT-4 Enrichment simalirty score +61.7% Base-LLM Yes No
Woo et al. GPT-4, Claude-3, Llama-3–70B, GPT-3.5, Mistral-8×7B Accuracy +36% Base-LLM Yes-Physician review Yes
Swanson et al. GPT-4o Binding improvement 2/92 improved Baseline wild-type nanobodies (no agent design) Yes- in-vitro validation Mixed
Mejia et al. GPT-4o, GPT-4o-mini, Qwen-2.5–32B, DeepSeek-R1 Custom score N/A Base-LLM No No
Wang 2025a GPT-4 ROUGE-L +7.1% Base-LLM No No
Yang 2025 Llama-3 None N/A None No Yes
Altermatt et al. GPT-4o Accuracy +4.1% Base-LLM+chain of thoughts No Yes
Xu et al. GPT-4-Turbo, GPT-3.5-Turbo, Mixtral-8×22B, Llama-3 8B, Llama-3 70B, Qwen 2.5-max, Claude-3-Opus Accuracy N/A None Yes Yes
Wang 2025b GPT-4o PTV D95 +4.75% ECHO autoplanner Yes Yes
Ke et al. GPT-4-Turbo Accuracy +76% (Base-LLM) 28% (Physician) Base-LLM, Physicians Yes-Comparison to Physician Yes
Yang 2024 GPT-3.5-Turbo F1 score +53% Base-LLM Yes Yes
Low et al. ChatRWD Accuracy +48% Chatgpt-4 No Yes
*

The comparison to Baseline unless described otherwise is to Base-LLM. This was conducted on the best performance agentic LLM backbone (e.g. ChatGPT-4 Agentic compared to Chatgpt-4). Percentage point difference