Table 1.
‘Implementation’ details for SeqVec (Heinzinger et al., 2019), ProtBert (Elnaggar et al., 2021), ProtT5 (Elnaggar et al., 2021), ESM-1b (Rives et al., 2021), UniRep (Alley et al., 2019) and BB (Bepler and Berger, 2019)
SeqVec | ProtBert | ProtT5 | ESM-1b | UniRep | BB | |
---|---|---|---|---|---|---|
Parameters | 93M | 420M | 3B | 650M | 18.2M | 90M* |
Dataset | UniRef50 | BFD | BFD | UniRef50 | UniRef50 | Pfam |
Sequences | 33M | 2.1B | 2.1B | 27M | 27M | 21M |
Embed time (s) | 0.03 | 0.06 | 0.1 | 0.09 | 2.1 | 0.1 |
Attention heads | 0 | 16 | 32 | 20 | 0 | 0 |
Bits per float | 32 | 32 | 16 | 32 | 32 | 32 |
Size (GB) | 0.35 | 1.6 | 3.6 | 7.3 | 0.06 | 0.12 |
Notes: Estimates marked by *; differences in the number of proteins (Sequences) for the same set (Dataset) originated from versioning. The embedding time (in seconds) was averaged over 10 000 proteins taken from the PDB (Berman et al., 2000) using the embedding models taken from bio-embeddings (Dallago et al., 2021).