. Author manuscript; available in PMC: 2026 Feb 20.

Published in final edited form as: Proc Conf Empir Methods Nat Lang Process. 2025 Nov;2025:27337–27362. doi: 10.18653/v1/2025.emnlp-main.1390

Table 2:

Average improvement over randomly sampled FDA drugs grouped by LLM-estimated relevance of the retrieved BioAssays. The value represents the increase in the docking score, measured in kcal/mol.

Model	High (39%)		Medium (42%)		Low (7%)		No (12%)		Overall
Model	Avg.	Med.	Avg.	Med.	Avg.	Med.	Avg.	Med.	Avg.	Med.

TargetDiff	0.838	0.802	0.701	0.777	0.669	0.696	0.771	1.052	0.761	0.796
Gemma-3–27B	0.196	0.170	-0.050	-0.145	0.197	0.293	-0.390	-0.535	0.023	0.034
GPT 4o	0.331	0.228	0.116	0.118	0.396	0.302	-0.289	-0.122	0.171	0.130
DeepSeekV3	0.379	0.311	0.079	0.070	0.429	0.300	-0.067	-0.098	0.203	0.159
Assay2Mol (Gemma-3–27B)	1.277	1.124	1.037	1.121	0.535	0.770	0.554	0.606	1.037	1.069
Assay2Mol (GPT 4o)	1.061	1.046	0.732	0.741	0.223	0.517	0.269	0.151	0.769	0.777
Assay2Mol (DeepSeekV3)	1.042	0.634	0.842	0.921	0.599	0.579	0.267	0.273	0.834	0.849