Skip to main content
. Author manuscript; available in PMC: 2026 Feb 20.
Published in final edited form as: Proc Conf Empir Methods Nat Lang Process. 2025 Nov;2025:27337–27362. doi: 10.18653/v1/2025.emnlp-main.1390

Table 4:

Manual review of LLM BioAssay relevance assessment by a single expert computational chemist (author S.S.E.). All targets have 10 BioAssays retrieved except for 2RHY and 2PQW, which have 6 each. The reason is that only 6 BioAssays remained after filtering.

Relevance Target GPT 4o errors GPT 4o accuracy (%) DeepSeek-V3 errors DeepSeek-V3 accuracy (%) GPT 4o accuracy – DeepSeek-V3 accuracy (%)

high 5W2G 1 90 1 90 0
high 3G51 4 60 10 0 60
high 1COY 3 70 9 10 60
high 2JJG 5 50 9 10 40
high 2RHY 1 83 3 50 33
high 2PQW 0 100 2 67 33
high 4G3D 5 50 5 50 0
medium 4AAW 4 60 6 40 20
medium 4YHJ 0 100 1 90 10
medium 14GS 1 90 9 10 80
medium 4RN0 1 90 4 60 30
medium 1FMC 0 100 10 0 100
medium 3DAF 0 100 10 0 100
medium 1A2G 0 100 10 0 100
medium 3DZH 0 100 8 20 80
medium 5BUR 0 100 10 0 100
low 1R1H 0 100 7 30 70
low 5B08 0 100 3 70 30
low 5I0B 1 90 4 60 30
low 3KC1 0 100 4 60 40
low 1D7J 1 90 4 60 30
no 2Z3H 0 100 0 100 0
no 2V3R 0 100 0 100 0
no 3B6H 8 20 10 0 20
no 4P6P 0 100 0 100 0

total 25 35 86 139 43 43