Table 2. Decoder-only scientific LLMs. The release date column displays the date of the first publication for each paper. When available, the publication date of the last updated version is displayed between parentheses.
| LLM | Model sizea | Training data | Architecture | Application | Release date |
|---|---|---|---|---|---|
| Tx-LLM271 | b | TDC datasets | PaLM-2 | Property prediction and retrosynthesis | 2024.06 |
| BioMedLM272 | 2.7B | PubMed abstracts and full articles | GPT | QA | 2024.03 |
| LlasMol273 | ∼7B | SMolInstruct | Galactica, LLaMa, Mistral | Property prediction, molecule captioning, molecule generation, retrosynthesis, name conversion | 2024.02 (2024.08) |
| BioMistral274 | 7B | PubMed Central (PMC) | Mistral | QA | 2024.02 (2024.08) |
| BiMediX275 | 8× 7B | 1.3M Arabic–English instructions (BiMed) | Mixtral | QA | 2024.02 |
| EpilepsyLLM276 | 7B | Data from the Japan Epilepsy Association, Epilepsy Information Center, and Tenkan Net | LLaMa | QA | 2024.01 |
| CheXagent277 | 7B | 28 Publicly available datasets, including PMC, MIMIC, wikipedia, PadChest, and BIMCV-COVID-19 | Mistral | QA, image understanding | 2024.01 |
| ChemSpaceAL278 | b | ChEMBL 33, GuacaMol v1, MOSES, and BindingDB 08-2023 | GPT | Molecule generation | 2023.09 (2024.02) |
| BioMedGPT-LM279 | 7B and 10B | 5.5M Bbiomedical papers from S2ORC | LLaMA2 | QA | 2023.08 |
| Darwin280 | 7B | SciQ and web of science | LLaMA | QA, property prediction, NER, and molecule generation | 2023.08 |
| cMolGPT46 | b | MOSES | GPT | Molecule generation | 2023.05 |
| PMC-LLaMA281 | 7B and 13B | MedC-k and MedC-I | LLaMA | QA | 2023.04 (2024.04) |
| GPTChem142 | 175B | Curation of multiple classification and regression benchmarks | GPT-3 | Property prediction and inverse design | 2023.02 (2024.02) |
| Galactica123 | 125M, 1.3B, 6.7B, 30B, 120B | The galactica corpus, a curation with 62B scientific documents | Decoder-only | QA, NER, document summarization, property prediction | 2022.11 |
| BioGPT282 | 355M | 15M of title and abstract from PubMed | GPT-2 | QA, NER, and document classification | 2022-09 (2023.04) |
| SMILES-to-properties-transformer265 | 6.5M | Synthetic data generated with the thermodynamic model COSMO-RS | GPT-3 | Property prediction | 2022.06 (2022.09) |
| ChemGPT283 | ∼1B | 10M molecules from PubChem | GPT-neo | Molecule generation | 2022.05 (2023.11) |
| Regression transformer139 | ∼27M | ChEMBL, MoleculeNet, USPTO, etc. | XLNet | Property prediction, molecule tuning, molecule generation | 2022.02 (2023.04) |
| MolGPT284 | 6M | MOSES and GuacaMol | GPT | Molecule generation | 2021.10 |
| Adilov2021 (ref. 285) | 13.4M | 5M SMILES from ChemBERTa's PubChem-10M | GPT-2 | Property prediction and molecule generation | 2021.09 |
“Model Size” is reported as the number of parameters. “PubMed” refer to the PubMed abstracts dataset, while PMC (PubMed Corpus) refers to the full-text corpus dataset.
The total number of parameters was not reported.