. 2024 Dec 9;16(6):2514–2572. doi: 10.1039/d4sc03921a

Table 2. Decoder-only scientific LLMs. The release date column displays the date of the first publication for each paper. When available, the publication date of the last updated version is displayed between parentheses.

LLM	Model size^a	Training data	Architecture	Application	Release date
Tx-LLM²⁷¹	^b	TDC datasets	PaLM-2	Property prediction and retrosynthesis	2024.06
BioMedLM²⁷²	2.7B	PubMed abstracts and full articles	GPT	QA	2024.03
LlasMol²⁷³	∼7B	SMolInstruct	Galactica, LLaMa, Mistral	Property prediction, molecule captioning, molecule generation, retrosynthesis, name conversion	2024.02 (2024.08)
BioMistral²⁷⁴	7B	PubMed Central (PMC)	Mistral	QA	2024.02 (2024.08)
BiMediX²⁷⁵	8× 7B	1.3M Arabic–English instructions (BiMed)	Mixtral	QA	2024.02
EpilepsyLLM²⁷⁶	7B	Data from the Japan Epilepsy Association, Epilepsy Information Center, and Tenkan Net	LLaMa	QA	2024.01
CheXagent²⁷⁷	7B	28 Publicly available datasets, including PMC, MIMIC, wikipedia, PadChest, and BIMCV-COVID-19	Mistral	QA, image understanding	2024.01
ChemSpaceAL²⁷⁸	^b	ChEMBL 33, GuacaMol v1, MOSES, and BindingDB 08-2023	GPT	Molecule generation	2023.09 (2024.02)
BioMedGPT-LM²⁷⁹	7B and 10B	5.5M Bbiomedical papers from S2ORC	LLaMA2	QA	2023.08
Darwin²⁸⁰	7B	SciQ and web of science	LLaMA	QA, property prediction, NER, and molecule generation	2023.08
cMolGPT⁴⁶	^b	MOSES	GPT	Molecule generation	2023.05
PMC-LLaMA²⁸¹	7B and 13B	MedC-k and MedC-I	LLaMA	QA	2023.04 (2024.04)
GPTChem¹⁴²	175B	Curation of multiple classification and regression benchmarks	GPT-3	Property prediction and inverse design	2023.02 (2024.02)
Galactica¹²³	125M, 1.3B, 6.7B, 30B, 120B	The galactica corpus, a curation with 62B scientific documents	Decoder-only	QA, NER, document summarization, property prediction	2022.11
BioGPT²⁸²	355M	15M of title and abstract from PubMed	GPT-2	QA, NER, and document classification	2022-09 (2023.04)
SMILES-to-properties-transformer²⁶⁵	6.5M	Synthetic data generated with the thermodynamic model COSMO-RS	GPT-3	Property prediction	2022.06 (2022.09)
ChemGPT²⁸³	∼1B	10M molecules from PubChem	GPT-neo	Molecule generation	2022.05 (2023.11)
Regression transformer¹³⁹	∼27M	ChEMBL, MoleculeNet, USPTO, etc.	XLNet	Property prediction, molecule tuning, molecule generation	2022.02 (2023.04)
MolGPT²⁸⁴	6M	MOSES and GuacaMol	GPT	Molecule generation	2021.10
Adilov2021 (ref. 285)	13.4M	5M SMILES from ChemBERTa's PubChem-10M	GPT-2	Property prediction and molecule generation	2021.09

“Model Size” is reported as the number of parameters. “PubMed” refer to the PubMed abstracts dataset, while PMC (PubMed Corpus) refers to the full-text corpus dataset.

The total number of parameters was not reported.