Table 1: A summary of existing gLMs.
An overview of various gLMs is provided, highlighting their pretraining datasets, tasks, architectures, tokenization methods, and unique features. The models are listed in the order of their public release dates. Abbreviations used include SSM for State Space Model, CNN for Convolutional Neural Network, BPE for Byte-Pair Encoding, CLM for Causal Language Modeling, and MLM for Masked Language Modeling.
Model Name | Pretraining data sources | Task | Architecture | Tokenization | Notes |
---|---|---|---|---|---|
BigBird [53] | Human | MLM | Transformer | BPE | |
DNABERT [54] | Human | MLM | Transformer | overlapping k-mer | |
GeneBERT [55] | Human | MLM | Transformer | overlapping k-mer | Trained to also predict chromatin accessibility ATAC-seq data. |
Epigenomic BERT [56] | Human | MLM | Transformer | non-overlapping k-mer | DNA sequences are paired with associated epigenetic state information (IDEAS) [57] during training. |
LookingGlass [58] | Bacteria + archaea | CLM | Recurrent Neural Network | nucleotide-level | Metagenomic sequences from diverse environments rather than assembled genomes are used for training. |
LOGO [59] | Human | MLM | CNN + Transformer | overlapping k-mer | |
ViBE [60] | Virus | MLM | Transformer | overlapping k-mer | |
GPN [13] | Arabidopsis thaliana + 7 related Brassicales genomes | MLM | CNN | nucleotide-level | |
FloraBERT [61] | Several hundred plants + selected maize genomes | MLM | Transformer | BPE | Only 1kb promoter sequences are used in training. |
INHERIT [62] | Bacteria + bacteriophage | MLM | Transformer | overlapping k-mer | |
GenSLMs [63] | Prokaryotic gene sequences + SARS-CoV-2 genomes | CLM | Transformer | non-overlapping k-mer | Pretrain on prokaryotic genes and fine-tune on SARS-CoV-2 genomes. |
NT [16] | Human + 1000 Genomes Project + multi-species | MLM | Transformer | non-overlapping k-mer | |
SpliceBERT [64] | Human + 71 vertebrate genomes | MLM | Transformer | nucleotide-level | Only RNA Transcripts are used in training. |
SpeciesLM Fungi [65] | 1500 fungal genomes | MLM | Transformer | overlapping k-mer | Only 5′ and 3′ UTR regions are used in training: the 5′ species LM and 3′ species LM. |
GENA-LM [66] | Human + multi-species | MLM | Transformer | BPE | |
DNABERT-2 [48] | Human + multi-species | MLM | Transformer | BPE | |
HyenaDNA [35] | Human | CLM | SSM | nucleotide-level | |
GROVER [67] | Human | MLM | Transformer | BPE | |
DNAGPT [68] | Human + multi-species | CLM | Transformer | non-overlapping k-mer | |
GPN-MSA [17] | Human + Multiple Sequence Alignment (MSA) with 100 vertebrate genomes | MLM | Transformer | nucleotide-level | |
UTR-LM [69] | Human + 4 vertebrate genomes | MLM | Transformer | nucleotide-level | Only 5′ UTR regions are used in training. Trained also to predict mRNA minimum free energy and secondary structures calculated by ViennaRNA [70]. |
hgT5 [71] | Human | T5 [72] | Transformer | Unigram model [73] | |
AgroNT [14] | 48 plant genomes focusing on edible plant species | MLM | Transformer | non-overlapping k-mer | |
MegaDNA [36] | ∼100k bacteriophage genomes | CLM | Transformer | nucleotide-level | |
regLM [30] | Human + yeast | CLM | SSM | nucleotide-level | Human enhancer and yeast promoter sequences are used to fine-tune/pretrain separate HyenaDNA [35] models |
EVO [31] | Bacteria + archaea + virus + plasmid | CLM | SSM + Transformer | nucleotide-level | |
Caduceus [24] | Human | MLM | SSM | nucleotide-level | |
ChatNT [74] | Genomic sequences + English instructions | CLM | Transformer | overlapping k-mer | Combines the pretrained gLM NT [16] and the English LM Vicuna [75]. Trained to perform all supervised genomics prediction tasks as text-to-text tasks. |
LucaOne [76] | Genomic and protein sequences from 169,861 species | MLM | Transformer | nucleotide- and amino acid-level | Mixed pretraining with DNA, RNA, and protein sequences. Trained also to predict 8 types of selected annotations. |
PlantCaduceus [15] | 16 Angiosperm genomes | MLM | SSM | nucleotide-level | |
CD-GPT [77] | Genomic and protein sequences of 14 organisms | CLM | Transformer | BPE | Mixed pretraining with DNA, RNA, and protein sequences, followed by targeted DNA-Protein and mRNA-Protein paired pretraining. |
SpeciesLM Metazoa [19] | 494 metazoan genomes | MLM | Transformer | overlapping k-mer | Only trained on 2 kb upstream of start codons |
gLM2 [78] | Metagenomes and genomes from IMG [79] and MGnify [80]. | MLM | Transformer | BPE for nucleotides, amino acid-level for proteins | Pretraining with a mixed-modality dataset, comprising interleaved protein-coding (amino acid) and intergenic (nucleotide) sequences. |