BigBird [53] |
Human |
MLM |
Transformer |
BPE |
|
DNABERT [54] |
Human |
MLM |
Transformer |
overlapping k-mer |
|
GeneBERT [55] |
Human |
MLM |
Transformer |
overlapping k-mer |
Trained to also predict chromatin accessibility ATAC-seq data. |
Epigenomic BERT [56] |
Human |
MLM |
Transformer |
non-overlapping k-mer |
DNA sequences are paired with associated epigenetic state information (IDEAS) [57] during training. |
LookingGlass [58] |
Bacteria + archaea |
CLM |
Recurrent Neural Network |
nucleotide-level |
Metagenomic sequences from diverse environments rather than assembled genomes are used for training. |
LOGO [59] |
Human |
MLM |
CNN + Transformer |
overlapping k-mer |
|
ViBE [60] |
Virus |
MLM |
Transformer |
overlapping k-mer |
|
GPN [13] |
Arabidopsis thaliana + 7 related Brassicales genomes |
MLM |
CNN |
nucleotide-level |
|
FloraBERT [61] |
Several hundred plants + selected maize genomes |
MLM |
Transformer |
BPE |
Only 1kb promoter sequences are used in training. |
INHERIT [62] |
Bacteria + bacteriophage |
MLM |
Transformer |
overlapping k-mer |
|
GenSLMs [63] |
Prokaryotic gene sequences + SARS-CoV-2 genomes |
CLM |
Transformer |
non-overlapping k-mer |
Pretrain on prokaryotic genes and fine-tune on SARS-CoV-2 genomes. |
NT [16] |
Human + 1000 Genomes Project + multi-species |
MLM |
Transformer |
non-overlapping k-mer |
|
SpliceBERT [64] |
Human + 71 vertebrate genomes |
MLM |
Transformer |
nucleotide-level |
Only RNA Transcripts are used in training. |
SpeciesLM Fungi [65] |
1500 fungal genomes |
MLM |
Transformer |
overlapping k-mer |
Only 5′ and 3′ UTR regions are used in training: the 5′ species LM and 3′ species LM. |
GENA-LM [66] |
Human + multi-species |
MLM |
Transformer |
BPE |
|
DNABERT-2 [48] |
Human + multi-species |
MLM |
Transformer |
BPE |
|
HyenaDNA [35] |
Human |
CLM |
SSM |
nucleotide-level |
|
GROVER [67] |
Human |
MLM |
Transformer |
BPE |
|
DNAGPT [68] |
Human + multi-species |
CLM |
Transformer |
non-overlapping k-mer |
|
GPN-MSA [17] |
Human + Multiple Sequence Alignment (MSA) with 100 vertebrate genomes |
MLM |
Transformer |
nucleotide-level |
|
UTR-LM [69] |
Human + 4 vertebrate genomes |
MLM |
Transformer |
nucleotide-level |
Only 5′ UTR regions are used in training. Trained also to predict mRNA minimum free energy and secondary structures calculated by ViennaRNA [70]. |
hgT5 [71] |
Human |
T5 [72] |
Transformer |
Unigram model [73] |
|
AgroNT [14] |
48 plant genomes focusing on edible plant species |
MLM |
Transformer |
non-overlapping k-mer |
|
MegaDNA [36] |
∼100k bacteriophage genomes |
CLM |
Transformer |
nucleotide-level |
|
regLM [30] |
Human + yeast |
CLM |
SSM |
nucleotide-level |
Human enhancer and yeast promoter sequences are used to fine-tune/pretrain separate HyenaDNA [35] models |
EVO [31] |
Bacteria + archaea + virus + plasmid |
CLM |
SSM + Transformer |
nucleotide-level |
|
Caduceus [24] |
Human |
MLM |
SSM |
nucleotide-level |
|
ChatNT [74] |
Genomic sequences + English instructions |
CLM |
Transformer |
overlapping k-mer |
Combines the pretrained gLM NT [16] and the English LM Vicuna [75]. Trained to perform all supervised genomics prediction tasks as text-to-text tasks. |
LucaOne [76] |
Genomic and protein sequences from 169,861 species |
MLM |
Transformer |
nucleotide- and amino acid-level |
Mixed pretraining with DNA, RNA, and protein sequences. Trained also to predict 8 types of selected annotations. |
PlantCaduceus [15] |
16 Angiosperm genomes |
MLM |
SSM |
nucleotide-level |
|
CD-GPT [77] |
Genomic and protein sequences of 14 organisms |
CLM |
Transformer |
BPE |
Mixed pretraining with DNA, RNA, and protein sequences, followed by targeted DNA-Protein and mRNA-Protein paired pretraining. |
SpeciesLM Metazoa [19] |
494 metazoan genomes |
MLM |
Transformer |
overlapping k-mer |
Only trained on 2 kb upstream of start codons |
gLM2 [78] |
Metagenomes and genomes from IMG [79] and MGnify [80]. |
MLM |
Transformer |
BPE for nucleotides, amino acid-level for proteins |
Pretraining with a mixed-modality dataset, comprising interleaved protein-coding (amino acid) and intergenic (nucleotide) sequences. |