Skip to main content
[Preprint]. 2024 Sep 22:arXiv:2407.11435v2. Originally published 2024 Jul 16. [Version 2]

Table 1: A summary of existing gLMs.

An overview of various gLMs is provided, highlighting their pretraining datasets, tasks, architectures, tokenization methods, and unique features. The models are listed in the order of their public release dates. Abbreviations used include SSM for State Space Model, CNN for Convolutional Neural Network, BPE for Byte-Pair Encoding, CLM for Causal Language Modeling, and MLM for Masked Language Modeling.

Model Name Pretraining data sources Task Architecture Tokenization Notes
BigBird [53] Human MLM Transformer BPE
DNABERT [54] Human MLM Transformer overlapping k-mer
GeneBERT [55] Human MLM Transformer overlapping k-mer Trained to also predict chromatin accessibility ATAC-seq data.
Epigenomic BERT [56] Human MLM Transformer non-overlapping k-mer DNA sequences are paired with associated epigenetic state information (IDEAS) [57] during training.
LookingGlass [58] Bacteria + archaea CLM Recurrent Neural Network nucleotide-level Metagenomic sequences from diverse environments rather than assembled genomes are used for training.
LOGO [59] Human MLM CNN + Transformer overlapping k-mer
ViBE [60] Virus MLM Transformer overlapping k-mer
GPN [13] Arabidopsis thaliana + 7 related Brassicales genomes MLM CNN nucleotide-level
FloraBERT [61] Several hundred plants + selected maize genomes MLM Transformer BPE Only 1kb promoter sequences are used in training.
INHERIT [62] Bacteria + bacteriophage MLM Transformer overlapping k-mer
GenSLMs [63] Prokaryotic gene sequences + SARS-CoV-2 genomes CLM Transformer non-overlapping k-mer Pretrain on prokaryotic genes and fine-tune on SARS-CoV-2 genomes.
NT [16] Human + 1000 Genomes Project + multi-species MLM Transformer non-overlapping k-mer
SpliceBERT [64] Human + 71 vertebrate genomes MLM Transformer nucleotide-level Only RNA Transcripts are used in training.
SpeciesLM Fungi [65] 1500 fungal genomes MLM Transformer overlapping k-mer Only 5′ and 3′ UTR regions are used in training: the 5′ species LM and 3′ species LM.
GENA-LM [66] Human + multi-species MLM Transformer BPE
DNABERT-2 [48] Human + multi-species MLM Transformer BPE
HyenaDNA [35] Human CLM SSM nucleotide-level
GROVER [67] Human MLM Transformer BPE
DNAGPT [68] Human + multi-species CLM Transformer non-overlapping k-mer
GPN-MSA [17] Human + Multiple Sequence Alignment (MSA) with 100 vertebrate genomes MLM Transformer nucleotide-level
UTR-LM [69] Human + 4 vertebrate genomes MLM Transformer nucleotide-level Only 5′ UTR regions are used in training. Trained also to predict mRNA minimum free energy and secondary structures calculated by ViennaRNA [70].
hgT5 [71] Human T5 [72] Transformer Unigram model [73]
AgroNT [14] 48 plant genomes focusing on edible plant species MLM Transformer non-overlapping k-mer
MegaDNA [36] ∼100k bacteriophage genomes CLM Transformer nucleotide-level
regLM [30] Human + yeast CLM SSM nucleotide-level Human enhancer and yeast promoter sequences are used to fine-tune/pretrain separate HyenaDNA [35] models
EVO [31] Bacteria + archaea + virus + plasmid CLM SSM + Transformer nucleotide-level
Caduceus [24] Human MLM SSM nucleotide-level
ChatNT [74] Genomic sequences + English instructions CLM Transformer overlapping k-mer Combines the pretrained gLM NT [16] and the English LM Vicuna [75]. Trained to perform all supervised genomics prediction tasks as text-to-text tasks.
LucaOne [76] Genomic and protein sequences from 169,861 species MLM Transformer nucleotide- and amino acid-level Mixed pretraining with DNA, RNA, and protein sequences. Trained also to predict 8 types of selected annotations.
PlantCaduceus [15] 16 Angiosperm genomes MLM SSM nucleotide-level
CD-GPT [77] Genomic and protein sequences of 14 organisms CLM Transformer BPE Mixed pretraining with DNA, RNA, and protein sequences, followed by targeted DNA-Protein and mRNA-Protein paired pretraining.
SpeciesLM Metazoa [19] 494 metazoan genomes MLM Transformer overlapping k-mer Only trained on 2 kb upstream of start codons
gLM2 [78] Metagenomes and genomes from IMG [79] and MGnify [80]. MLM Transformer BPE for nucleotides, amino acid-level for proteins Pretraining with a mixed-modality dataset, comprising interleaved protein-coding (amino acid) and intergenic (nucleotide) sequences.