Skip to main content
. 2024 Dec 9;16(6):2514–2572. doi: 10.1039/d4sc03921a

Table 2. Decoder-only scientific LLMs. The release date column displays the date of the first publication for each paper. When available, the publication date of the last updated version is displayed between parentheses.

LLM Model sizea Training data Architecture Application Release date
Tx-LLM271 b TDC datasets PaLM-2 Property prediction and retrosynthesis 2024.06
BioMedLM272 2.7B PubMed abstracts and full articles GPT QA 2024.03
LlasMol273 ∼7B SMolInstruct Galactica, LLaMa, Mistral Property prediction, molecule captioning, molecule generation, retrosynthesis, name conversion 2024.02 (2024.08)
BioMistral274 7B PubMed Central (PMC) Mistral QA 2024.02 (2024.08)
BiMediX275 8× 7B 1.3M Arabic–English instructions (BiMed) Mixtral QA 2024.02
EpilepsyLLM276 7B Data from the Japan Epilepsy Association, Epilepsy Information Center, and Tenkan Net LLaMa QA 2024.01
CheXagent277 7B 28 Publicly available datasets, including PMC, MIMIC, wikipedia, PadChest, and BIMCV-COVID-19 Mistral QA, image understanding 2024.01
ChemSpaceAL278 b ChEMBL 33, GuacaMol v1, MOSES, and BindingDB 08-2023 GPT Molecule generation 2023.09 (2024.02)
BioMedGPT-LM279 7B and 10B 5.5M Bbiomedical papers from S2ORC LLaMA2 QA 2023.08
Darwin280 7B SciQ and web of science LLaMA QA, property prediction, NER, and molecule generation 2023.08
cMolGPT46 b MOSES GPT Molecule generation 2023.05
PMC-LLaMA281 7B and 13B MedC-k and MedC-I LLaMA QA 2023.04 (2024.04)
GPTChem142 175B Curation of multiple classification and regression benchmarks GPT-3 Property prediction and inverse design 2023.02 (2024.02)
Galactica123 125M, 1.3B, 6.7B, 30B, 120B The galactica corpus, a curation with 62B scientific documents Decoder-only QA, NER, document summarization, property prediction 2022.11
BioGPT282 355M 15M of title and abstract from PubMed GPT-2 QA, NER, and document classification 2022-09 (2023.04)
SMILES-to-properties-transformer265 6.5M Synthetic data generated with the thermodynamic model COSMO-RS GPT-3 Property prediction 2022.06 (2022.09)
ChemGPT283 ∼1B 10M molecules from PubChem GPT-neo Molecule generation 2022.05 (2023.11)
Regression transformer139 ∼27M ChEMBL, MoleculeNet, USPTO, etc. XLNet Property prediction, molecule tuning, molecule generation 2022.02 (2023.04)
MolGPT284 6M MOSES and GuacaMol GPT Molecule generation 2021.10
Adilov2021 (ref. 285) 13.4M 5M SMILES from ChemBERTa's PubChem-10M GPT-2 Property prediction and molecule generation 2021.09
a

“Model Size” is reported as the number of parameters. “PubMed” refer to the PubMed abstracts dataset, while PMC (PubMed Corpus) refers to the full-text corpus dataset.

b

The total number of parameters was not reported.