Skip to main content
. 2024 Dec 5;15:10627. doi: 10.1038/s41467-024-54812-y

Fig. 4. Tokenization schemes for RNA language models.

Fig. 4

a Representation of nucleotides as tokens for single, paired, or triplet nucleotides. Tokens are encoded for nucleotides in 1-nucleotide steps, i.e. are overlapping for paired and triplet nucleotides. Beginning and end tokens are also included in the token library. b Perplexity of RNA language models trained on 23S rRNA sequences, with the nanoGPT model modified to use an overall rotary positional embedding (RoPE), or with RoPE applied to each attention layer. Training with paired-nt and overall RoPE was conducted for 100,000 iterations, whereas the other models were trained for 1 M iterations, with a batch size of 18 in all models. A perplexity value of 4 would be random (i.e. 4 nucleotides to choose from), and a value of 1 would indicate perfect certainty in nucleotide choice. The perplexity after training for a random model should be 4 regardless of the tokenization scheme, due to the 1-nucleotide steps used with the paired and triplet encoding. c Perplexity of an RNA LM pretrained on 231 RNA sequence families in GARNET (Supplementary Data 1). The perplexity of an RNA LM model finetuned on hyperthermophilic RNAs, starting from the pretrained general model, is 1.33. For detailed model parameters and training data statistics in panels (b) and (c) refer to Supplementary Data 3. d Alignment of 23S rRNA sequences generated using the more general pretrained 231-RNA LM, showing the 3’ end of the generated sequences (n = 100). e Alignment of 16S rRNA sequences generated using the more general pretrained 231-RNA LM, showing the 3’ end of the generated sequences (n = 100). Sequence generation in panels (d) and (e) was seeded with 100 nucleotides of E. coli 23S rRNA or 16S rRNA, respectively, and using a temperature of 0.2. The bottom row is the E. coli sequence, and E. coli nucleotide numbering is also shown. White space shows regions where insertions and deletions are present in the sequences.