Table 2.
Clinical performance of agent versus baseline
| Study | AI Agent Base-LLM Model(s)* | Primary Outcome | Agent vs Baseline* | Baseline comparison | External validation | Synthetic Data |
|---|---|---|---|---|---|---|
| Chen et al. | GPT-4 | Accuracy | +14.3% | Base-LLM | No | Yes |
| Goodell et al. | GPT-4o, Llama-3 | Accuracy | +59.1% | Base-LLM | Yes | Yes |
| Omar et al. | GPT-4o, Claude, Gemini-1.5-Pro | Accuracy | +13.8% | Base-LLM | No | No |
| Gorenshtein 2025a | GPT-4o-mini | Accuracy | +3.5% | Single LLM (Sonar-pro) | No | Yes |
| Gorenshtein 2025b | Gem ini-1.5-Pro-002 | Accuracy | +29.6% | Base-LLM | Yes-Physician review | No |
| Gorenshtein 2025c | GPT-4o | Quality score | −25.5% | Physicians | Yes-Comparison to Physician | No |
| Ferber et al. | GPT-4 | Accuracy | +56.9% | Base-LLM | Yes-Physician review | Mixed |
| Qu et al. | GPT-4o, CRISPR-Llama3-8B (fine tuned Llama3-8b) | Design success | +49% | Base-LLM | Yes- in-vitro validation | Mixed |
| Pickard et al. | GPT-4 | Enrichment simalirty score | +61.7% | Base-LLM | Yes | No |
| Woo et al. | GPT-4, Claude-3, Llama-3–70B, GPT-3.5, Mistral-8×7B | Accuracy | +36% | Base-LLM | Yes-Physician review | Yes |
| Swanson et al. | GPT-4o | Binding improvement | 2/92 improved | Baseline wild-type nanobodies (no agent design) | Yes- in-vitro validation | Mixed |
| Mejia et al. | GPT-4o, GPT-4o-mini, Qwen-2.5–32B, DeepSeek-R1 | Custom score | N/A | Base-LLM | No | No |
| Wang 2025a | GPT-4 | ROUGE-L | +7.1% | Base-LLM | No | No |
| Yang 2025 | Llama-3 | None | N/A | None | No | Yes |
| Altermatt et al. | GPT-4o | Accuracy | +4.1% | Base-LLM+chain of thoughts | No | Yes |
| Xu et al. | GPT-4-Turbo, GPT-3.5-Turbo, Mixtral-8×22B, Llama-3 8B, Llama-3 70B, Qwen 2.5-max, Claude-3-Opus | Accuracy | N/A | None | Yes | Yes |
| Wang 2025b | GPT-4o | PTV D95 | +4.75% | ECHO autoplanner | Yes | Yes |
| Ke et al. | GPT-4-Turbo | Accuracy | +76% (Base-LLM) 28% (Physician) | Base-LLM, Physicians | Yes-Comparison to Physician | Yes |
| Yang 2024 | GPT-3.5-Turbo | F1 score | +53% | Base-LLM | Yes | Yes |
| Low et al. | ChatRWD | Accuracy | +48% | Chatgpt-4 | No | Yes |
The comparison to Baseline unless described otherwise is to Base-LLM. This was conducted on the best performance agentic LLM backbone (e.g. ChatGPT-4 Agentic compared to Chatgpt-4). Percentage point difference