Skip to main content
Journal of Cheminformatics logoLink to Journal of Cheminformatics
. 2026 Jan 2;18:13. doi: 10.1186/s13321-025-01143-9

Optimizing SMILES token sequences via trie-based refinement and transition graph filtering

Sridhar Radhakrishnan 1,, Krish Mody 2, Arvind Venkatesh 3, Ananth Venkatesh 4
PMCID: PMC12866345  PMID: 41484928

Abstract

Tokenization plays a critical role in preparing SMILES strings for molecular foundation models. Poor token units can fragment chemically meaningful substructures, inflate sequence length, and hinder model learning and interpretability. Existing approaches such as SMILES Pair Encoding (SPE) and Atom Pair Encoding (APE) compress token sequences but often ignore domain-specific chemistry or fail to generalize to larger or more diverse molecules. We propose a domain-aware method for SMILES compression that combines frequency-guided substring mining using a prefix trie with an optional entropy-based refinement step using a token transition graph (TTG). On a corpus of 100,000 PubChem molecules, the Trie+TTG method reduces token sequences by more than 50% compared to APE while preserving chemically coherent substructures. The method generalizes effectively to large, out-of-distribution molecules, achieving compression rates of up to 90% with minimal sensitivity to molecule size. To assess downstream utility, we evaluate latent-space structure using unsupervised clustering and perform QSAR regression on ESOL. Trie+TTG produces more separable molecular representations and stronger predictive performance than Trie-only and APE. In addition, on peptide corpora, our method substantially outperforms SPE and the PeptideCLM tokenizer in compression and entropy metrics. These results show that combining trie-based mining with TTG refinement yields compact, stable, and chemically meaningful tokenizations suitable for modern molecular representation learning.

Scientific contributions: We present a trie-based framework that compresses SMILES sequences into shorter, chemically coherent units while guaranteeing lossless reconstruction. By incorporating a token transition graph for entropy-guided refinement, our method selects contextually stable merges that improve both compression efficiency and generalization. Unlike prior approaches such as APE and SPE, our tokenizer combines frequency and context awareness, yielding more compact, interpretable, and transferable molecular representations.

Keywords: SMILES tokenization, Trie-based compression, Token transition graph, Chemically aware representation, Molecular language modeling

Introduction

The Simplified Molecular Input Line Entry System (SMILES) [1] is a widely used linear notation for molecular graphs, encoding atoms, bonds, branches, rings, and stereochemistry in compact and human-readable strings. Due to its simplicity and compatibility with string-based processing, SMILES has become a foundational input format in cheminformatics, particularly for machine learning and deep learning pipelines.

Large language models (LLMs) have recently transformed molecular modeling by adapting transformer-based architectures, originally developed for natural language, to chemical sequences. Models such as MolBERT, ChemBERTa, and ChemGPT have used SMILES as input for tasks that include prediction of molecular properties, generation of de novo molecules, and retrosynthetic analysis [2, 3]. However, the performance of these models is closely tied to the quality of SMILES tokenization, which directly impacts sequence length, syntactic consistency, and semantic granularity.

Current tokenization strategies for SMILES face several challenges:

  • Excessive sequence length: Character-level tokenization results in long input sequences. Since transformer models scale quadratically with sequence length [4], this leads to increased computational cost and weaker modeling of long-range dependencies (e.g., distant ring closures).

  • Loss of chemical semantics: Generic subword tokenizers such as BPE [5] and SentencePiece [6] can split chemically meaningful units (for example, “Cl” into “C” + “l”), reducing interpretability and disrupting molecular structure encoding.

  • Static vocabularies: Domain-specific tokenizers like APE [3] and SPE [7] rely on frequency-based merge rules that produce fixed vocabularies, limiting adaptability to new chemical domains.

  • Lack of contextual awareness: Existing methods do not consider token adjacency or transition patterns, which can lead to merges that are frequent but chemically implausible.

Empirical studies have shown that tokenization quality significantly affects downstream model performance. For example, previous work has reported 2–5% differences in ROC-AUC on classification benchmarks such as HIV, Tox21, ClinTox, and ESOL [3, 8]. These differences arise purely from tokenization choices, highlighting the need for approaches that are both compressive and chemically faithful.

To address the limitations of existing SMILES tokenization approaches, which often produce fragmented or semantically unstable sequences, we propose a two-stage token compression framework that balances sequence compactness with chemical coherence.

In the first stage, we apply a trie-based compression algorithm that identifies frequent and chemically meaningful substrings from tokenized SMILES sequences and replaces them with synthetic tokens. This significantly reduces both the sequence length and the vocabulary size while preserving complete molecular information. In the second stage, we introduce the Token Transition Graph (TTG), a directed graph constructed from empirical token transition frequencies, which evaluates candidate merges based on contextual entropy. By filtering through the TTG, we retain only those substrings that exhibit high semantic stability across their usage contexts.

Our key contributions are as follows.

  • We introduce a trie-based SMILES compression method that preserves chemically coherent substructures while producing compact and interpretable token sequences.

  • We develop the Token Transition Graph (TTG), a context-sensitive refinement layer that uses empirical token adjacency and entropy filtering to eliminate unstable or semantically inconsistent merges.

  • We show that the method generalizes to large, out-of-distribution molecules, achieving compression rates of around 90% with minimal deviation from in-distribution behavior. Out-of-distribution molecules are drawn from ChEMBL and include structurally diverse compounds not used to construct the trie or TTG.

  • We benchmark against widely used methods (APE, SPE), observing consistent improvements in compression ratio, token stability, and structural coherence across both in-domain and out-of-distribution datasets. On peptide corpora, our approach substantially outperforms the PeptideCLM tokenizer.

  • We evaluate downstream usefulness through unsupervised clustering and lightweight supervised tasks, demonstrating that the refined token sequences yield more separable and task-relevant molecular representations than frequency-based or pairwise-merging baselines.

Although tokenization is fixed prior to model training, in molecular modeling its design strongly influences sequence structure, representation quality, and downstream performance. Our framework therefore treats token compression as an important modeling choice–rather than incidental preprocessing–by explicitly aligning the tokenization process with chemical structure and contextual stability.

This combination produces a more stable and coherent token vocabulary, reducing syntactic noise, and improving the consistency of SMILES representations for modeling tasks.

The remainder of this paper is organized as follows. "Related work" section reviews prior approaches to SMILES tokenization and contextual compression. "Overview of compression framework" section introduces our overall compression framework and its integration into molecular string processing pipelines. "Problem statement and baseline analysis" section formalizes the token compression problem and presents a baseline analysis of existing methods. "Trie-based substring compression" section details our trie-based substring mining approach, which identifies frequent, chemically meaningful token sequences. "TTG-guided refinement of trie-based compression" section describes the Token Transition Graph (TTG) and shows how entropy-guided filtering refines the trie vocabulary to ensure contextual stability. "Experimental evaluation" section reports our experimental results, including compression performance, ablations, and downstream analyses. Finally, "Conclusion" section summarizes the key findings and outlines directions for future work.

Related work

SMILES tokenization approaches

SMILES (Simplified Molecular Input Line Entry System) strings have long served as the primary linear representation of molecules for machine learning applications. Traditional tokenization of SMILES relies on character-level splitting or rule-based schemes based on regular expressions, such as those used in RDKit. These approaches tokenize at the level of individual atoms, bonds, and symbols, but often result in excessively long sequences and do not capture recurring structural motifs [1].

To address this, researchers have adopted subword tokenization strategies from natural language processing. Byte Pair Encoding (BPE) [5] and SentencePiece [6] are widely used methods to segment text into frequent substrings based on merging operations. Although effective in NLP, these approaches are domain-agnostic and can split chemically significant groups (such as halogen atoms or ring closures) into syntactically legal but semantically incorrect pieces, splitting, for example, ‘Cl’ into ‘C’ and ‘l’.

We have summarized various tokenization methods in Table 1. Domain-specific tokenizers such as Atom Pair Encoding (APE) [9] and SMILES Pair Encoding (SPE) [3] have been proposed to preserve chemically valid token boundaries. APE starts with tokens at the atom level and merges adjacent tokens based on statistical frequency across the corpus, guided by chemical grammar rules [3]. SPE follows a similar idea, but incorporates frequent pairings of SMILES tokens, including brackets and ring notations [7]. These methods have shown improved performance in downstream tasks, such as molecular property prediction and classification, with evidence of better AUC scores on datasets such as HIV, BBBP, and Tox21.

Table 1.

Comparison of tokenization methods applied to SMILES strings

Feature BPE [5] SentencePiece [6] APE [9] SPE [3] Our trie-based compression
Designed for Natural language Natural language Molecular strings SMILES strings SMILES (chemical tokens)
Token unit Subword/character pairs Learned subwords or characters Atom-level tokens Frequent SMILES substrings Chemical tokens and frequent token sequences
Respects chemistry? No No Yes Partially Yes (via chemical base tokens)
Splits valid tokens? Yes (e.g., Cl C + l) Yes No Sometimes No (base tokens preserved)
Merges based on Most frequent character pairs Statistical segmentation model Atom-pair co-occurrence Frequent SMILES substrings Frequent chemically meaningful sequences (trie-based)
Semantic integrity Often broken Often broken Preserved Mostly preserved Well preserved
Vocabulary structure Flat; lacks domain semantics Flat; lacks domain semantics Flat; co-occurrence driven Flat; data-driven Trie-based; reflects recurring chemical substructures
Compression benefit Moderate Moderate Moderate Moderate–High High (highest among methods tested)
Applicability to SMILES Weak Weak Strong Strong Strong

Recent work has also extended tokenization to handle alternative linear notations like SELFIES, which guarantee syntactic validity by design. Tokenizers such as APE-SELFIES aim to preserve semantic motifs while leveraging the syntactic guarantees of SELFIES [10]. However, even these approaches often use static vocabularies that do not generalize well across molecular libraries of different complexity.

Vocabulary compression and substructure mining

Reducing vocabulary size while maintaining expressiveness is a key goal in both language modeling and cheminformatics. In molecular representations, this results in the identification of frequent substructures, such as functional groups, rings, or pharmacophores, and the encoding of them using single tokens.

Approaches such as fragment-based SMILES encoding, molecular fingerprint-based tokenization, and hierarchical vocabulary design have emerged to reduce sequence redundancy and improve generalization [11]. For example, some methods extract BRICS or RECAP fragments, which are known pharmacologically relevant substructures, and treat them as atomic units during model training. This form of substructure mining not only compresses sequences, but also improves the interpretability of learned embeddings. However, these methods have static vocabularies that are not tailored to specific datasets or corpora, and they are therefore not optimal for certain applications where this context is important.

Other methods adopt statistical or frequency-based strategies to identify frequent substrings or motifs. GraphBPE [12] applies BPE-like merging at the graph level, allowing the identification of frequently recurring atom bond fragments directly from molecular graphs. These methods highlight the importance of compressing input sequences without losing chemically relevant patterns.

Tries and prefix trees in NLP

Tries, or prefix trees, are hierarchical data structures commonly used in natural language processing for tasks such as auto-completion, dictionary lookup, and sub-word segmentation [13, 14]. Trie data structures are particularly effective for capturing and storing variable-length substrings while maintaining efficient prefix-based search.

In NLP, trie structures have been applied to construct subword vocabularies, such as in the construction of BPE vocabularies, where each merge operation corresponds to a path extension in a trie. In unsupervised segmentation tasks, tries can be used to encode frequent token sequences discovered in corpora, providing a data-efficient way to compress inputs by replacing common substrings with synthetic tokens [13]. Despite their utility in NLP, tries have not been widely adopted in molecular tokenization frameworks, where the prefix structure and statistical frequency could be exploited to identify common SMILES substrings in a scalable and interpretable manner.

Graph-based methods in molecular modeling

Beyond string-based representations, many modern approaches model molecules directly as graphs, where atoms are nodes, and bonds are edges. Graph Neural Networks (GNNs), such as Message Passing Neural Networks (MPNNs) and Graph Attention Networks (GATs), have become dominant in tasks such as molecular property prediction, molecular docking, and synthesis planning [15, 16].

Graph-based tokenization has also emerged, aiming to bridge the gap between sequence models and structural representations. For example, GraphBPE [12] adapts the encoding of byte pairs to molecular graphs, extracting frequently recurring atom bond subgraphs to use as compressed tokens. These tokens are then used as input to transformer models, improving both accuracy and computational efficiency (Table 1).

Despite these advances, most tokenization schemes–even graph-based ones–focus on frequency rather than transition dynamics. The role of the token transition probability, which encodes the contextual likelihood of token adjacency, remains underexplored. Our work addresses this gap by combining substring mining with probabilistic modeling of token transitions, capturing both structure and context in the tokenization process.

Overview of compression framework

Our token compression framework transforms tokenized SMILES strings into shorter and more chemically coherent sequences suitable for molecular language models. It consists of a two-stage pipeline: (1) a trie-based substring mining stage that identifies frequently occurring token patterns, and (2) an entropy-guided refinement stage using a Token Transition Graph (TTG) to retain only contextually stable substrings (defined as a substring is contextually stable if its transitions have low entropy, meaning it tends to appear in similar environments across the corpus) as synthetic tokens.

The system operates on a dataset of SMILES strings and proceeds as follows:

  1. Canonicalization: All SMILES strings are converted to a canonical form to ensure consistency across enumeration variants and to standardize the input representation. Throughout this paper, when we refer SMILES we imply its canonical form.

  2. SMILES Tokenization: Each molecule is tokenized using an atom-aware scheme that yields chemically interpretable units, including atoms (e.g., “C”, “O”), bonds (e.g., “=”), branches (“(”, “)”), and ring closures (“1”, “2”, and so on).

  3. Trie Construction and Substring Mining: A prefix trie is built over the tokenized corpus to enumerate all substrings and their frequencies. Substrings exceeding a minimum frequency threshold are retained as initial candidates for compression.

  4. Token Transition Graph (TTG) Scoring: To assess the contextual stability of each candidate substring, defined as the substring in any token sequence extracted from the trie that occurs at least δ times in the corpus), a TTG is constructed that records weighted token-to-token transitions based on empirical co-occurrence. Candidates are evaluated using average transition entropy, and only those with low-entropy, stable transitions are selected for inclusion in the compressed vocabulary.

  5. Token Replacement and Vocabulary Update: Each validated substring is assigned a synthetic token and substituted throughout the corpus. The vocabulary is updated accordingly, and tokenized SMILES strings are rewritten using the compressed representation.

  6. Model Input Preparation: The resulting compressed sequences serve as inputs to molecular language models for tasks such as property prediction, molecule generation, and reaction or retrosynthesis modeling.

The modular pipeline shown in Fig. 1 ensures that compression is both frequency-sensitive and context-sensitive, producing token sequences that are not only shorter, but also chemically coherent. By separating substring discovery (through the trie) from contextual validation (via the TTG), the framework achieves efficient compression without sacrificing generalization.

Fig. 1.

Fig. 1

Overview of the two-stage compression pipeline. Raw SMILES strings are first canonicalized and tokenized. A trie-based substring mining step then identifies frequent candidates, which may optionally be refined using the Token Transition Graph (TTG) to retain only contextually stable substrings before substitution with synthetic tokens and use as compressed inputs to downstream models

The framework is compatible with both pretraining and fine-tuning pipelines for chemical language models and can be extended to alternative molecular representations such as SELFIES. Its modular architecture supports future integration of additional compression strategies or domain-specific filters.

Problem statement and baseline analysis

Large-scale chemical foundation models increasingly rely on corpora composed of tokenized SMILES strings. These tokenized sequences are often lengthy and contain recurring substructures that reflect common molecular motifs. Compressing such sequences without loss of semantic fidelity could reduce the memory and computational overhead for transformer-based models while also improving convergence properties. The principal challenge lies in identifying these high-frequency token substrings in a way that preserves the chemical structure and meaning.

Let C={T1,T2,,Tn} be a corpus of tokenized SMILES strings, where each Ti is a sequence of tokens drawn from a vocabulary Σ. Let K denote the maximum length of the substring to be considered during the compression step. The goal is to develop a framework that, given a corpus C and parameters K and a frequency threshold δ, identifies token substrings of length k[3,K) that occur with frequency at least δ, and replaces them with synthetic tokens. The compressed sequences should be fully reversible and preserve the original molecular semantics.

A straightforward baseline for this problem is a brute-force algorithm that, for each string in the corpus, extracts all substrings of fixed length k and counts their occurrences using a map-based data structure. The algorithm iterates over every window of size k within a token sequence and updates a frequency map that tracks substring counts. This frequency map is then used to identify substrings that exceed the frequency threshold and are candidates for substitution.

Algorithm 1.

Algorithm 1

BruteForceCountSubstrings

The time complexity of this approach depends on the implementation of the underlying map. The extraction of substrings per iteration takes O(k) time and the loop executes O(M-k+1) times for a string of length M. If a hash map is used, the average case complexity of lookup and insertion is constant, resulting in the expected overall complexity of O((M-k+1)·k). In the worst case, hash collisions degrade performance to linear time per operation. Tree-based maps, on the contrary, incur O(logd) time per operation, where d is the number of distinct substrings seen. Although this increases predictability and allows for sorted traversal, it comes at the cost of additional overhead per insertion and query. Both hash maps and tree maps require O(d·k) space to store all observed substrings and their frequencies.

Although the brute-force approach is easy to implement and performs well on modestly sized datasets, it becomes inefficient as the number of substrings grows or when multiple lengths of substrings must be considered. In such cases, map-based counting leads to redundant computation and high memory usage. Furthermore, when the corpus contains millions of tokenized sequences and the range of k spans many values, the cumulative cost of substring extraction and frequency counting becomes a bottleneck.

To evaluate and compare map implementations, we observe that hash maps (e.g., dict in Python or unordered_map in C++) provide fast average-case insertion and lookup but do not preserve key order and can be sensitive to the input distribution. In contrast, tree maps (e.g., TreeMap in Java or map in C++) maintain keys in sorted order and offer logarithmic time guarantees, though they tend to be slower in practice due to tree traversal overhead. A summary of these trade-offs is presented in Table 2.

Table 2.

Comparison between hash map and tree map performance in substring counting

Feature Hash map Tree map
Insert / Lookup (Average) O(1) O(logd)
Insert / Lookup (Worst) O(n) O(logd)
Maintains key order No Yes
Sensitive to input distribution Yes No
Space usage O(d·k) O(d·k)

In summary, while the brute-force method serves as a useful reference point, its limitations in scalability and memory efficiency motivate the need for a more structured approach. In particular, the repeated and prefix-aligned nature of substrings within tokenized SMILES suggests that prefix-based data structures such as tries could offer substantial performance gains by eliminating redundant computation and enabling efficient lookup and aggregation across multiple substring lengths.

Trie-based substring compression

To enable scalable token compression across large SMILES corpora, we introduce a trie-based method for identifying and replacing frequently occurring token subsequences. A trie (prefix tree) compactly encodes all substrings of a corpus by sharing common prefixes, making it well suited for discovering reusable patterns of varying lengths while preserving chemical token boundaries. This allows the construction of shorter, more coherent token sequences without loss of information, ensuring full recoverability of the original SMILES.

Overview of this section. The trie-based compression framework proceeds through four stages:

  1. Token Stream Generation (Token Stream Generation section): Raw or canonicalized SMILES strings are converted into chemically meaningful tokens such as atoms, bonds, branches, and ring indices, forming the alphabet used throughout the pipeline.

  2. Trie Construction (Trie Construction section): A token-level prefix trie is built over the corpus, indexing all substrings up to a maximum length K and recording their frequency counts. This structure compactly captures recurring token patterns.

  3. Substring Enumeration  (Substring enumeration section): The trie is traversed to extract high-frequency substrings, which are then ranked by length and frequency. These represent reusable chemical substructures that can be encoded as single units.

  4. Substring Replacement and Vocabulary Update (Substring Replacement and Vocabulary Update Section): Substrings exceeding a frequency threshold are assigned synthetic tokens (e.g., <R1>), and each SMILES string is rewritten using a replacement trie. The vocabulary is expanded accordingly, producing a fully reversible compressed representation.

Together, these stages form a lossless, chemically grounded compression method that identifies recurrent structural motifs and replaces them with interpretable synthetic tokens.

In Table 3, we show examples from a synthetic corpus for illustrative purposes. It shows the trie compression string. The substrings and their frequencies from the synthetic corpus are also shown in Table 4.

Table 3.

Trie-based compression of tokenized SMILES strings from selected molecules

Molecule SMILES Tokenized sequence Trie compression Tokens
Acetic acid CC(=O)O [C, C, (, =, O, ), O] [<R1>, O] 7 2
Vinyl bromide chloride ClC=CBr [Cl, C, =, C, Br] [<R2>, C, Br] 5 3
Benzene C1=CC=CC=C1 [C, 1, =, C, C, =, C, C, =, C, 1] [<R4>] 11 1
1-Chloropropane CCCl [C, C, Cl] [<R5>] 3 1
Glycine C(C(=O)O)N [C, (, C, (, =, O, ), O, ), N] [<R1>, ), N] 10 3
Chloroacetic acid ClCC(=O)O [Cl, C, C, (, =, O, ), O] [<R5>, <R1>, O] 8 3
Cyclohexane C1CCCCC1 [C, 1, C, C, C, C, C, 1] [<R6>] 8 1

Tokenized sequences follow chemically informed token boundaries. The <Rn> tokens are synthetic representations of frequent substrings derived from an simulated corpus of approximately 100,000 molecules

Table 4.

Synthetic tokens and the substrings they replace, with their observed frequencies in a simulated 100,000-molecule corpus

Synthetic token Replaced substring Corpus frequency
<R1> [C, (, =, O, )] 7,200
<R2> [Cl, C, =] 3,900
<R3> [Cl, C, =, C, Br] 3,400
<R4> [C, 1, =, C, C, =, C, C, =, C, 1] 3,800
<R5> [C, C, Cl] 6,400
<R6> [C, 1, C, C, C, C, C, 1] 5,600

These frequencies guide the construction of the compressed vocabulary in the trie-based framework

Token stream generation

Conventional trie-based substring mining methods assume a character-level alphabet, which is appropriate for natural languages or genomic sequences. However, this assumption breaks down in the case of SMILES, where tokens are not single characters but chemically meaningful units. These include multicharacter atoms (e.g., “Cl”, “Br”), bracketed species such as “[C@H]” or “[Cl-]”, bond types, stereochemical markers, and ring closure digits. A naive character-level trie fails to respect these boundaries and would result in inconsistent or chemically invalid substring merges.

To construct a token-level trie, we first tokenize the SMILES corpus using a chemically informed regular expression pattern. This tokenization process identifies bracketed atoms, multiletter elements, and symbols such as bond indicators and ring numbers. The resulting vocabulary forms a domain-specific alphabet, Σ, over which all token sequences are defined. Thus, each SMILES string is transformed into a sequence of tokens Ti=[σ1,σ2,,σm] where σjΣ.

Tokenization proceeds by scanning each SMILES string and applying the following rules in order of precedence: (1) bracketed atoms are matched as atomic units, (2) common two-letter atom symbols such as “Cl” and “Br” are recognized, (3) single-letter atoms are extracted, (4) ring digits are matched as standalone tokens, and (5) structural and stereochemical characters such as “=”, “#”, “+”, “-”, “(”, “)”, “/”, “\ \”, and “@” are treated as individual tokens. This ensures that all valid SMILES tokens are uniquely and consistently represented across the corpus.

Because our compression operates exclusively on these chemically valid SMILES tokens, every compressed sequence can be losslessly decompressed to a syntactically valid SMILES, precerving the orginal molecular representation.

Algorithm 2.

Algorithm 2

ExtractSMILESAlphabet

To illustrate the trie-based compression process, Table 3 presents several representative molecules, their chemically tokenized SMILES sequences, and the compressed outputs resulting from frequent substring substitution. The tokenized sequences are processed into a prefix trie, where frequently occurring token subsequences are identified and assigned synthetic symbols such as <R1> and <R5>. Figure 2 shows a partial view of the resulting trie, with node frequencies scaled down for clarity.

Fig. 2.

Fig. 2

A partial token-level trie structure for frequent SMILES substrings found in selected molecules. Node labels represent frequency counts, scaled down by approximately a factor of 100 to improve visual clarity (e.g., a frequency of 7,200 in the corpus is shown as 72). Synthetic tokens such as <R1> and <R5> are inserted when substrings exceed the frequency threshold in the corpus-wide token analysis

Trie construction

Once the SMILES strings have been tokenized using a chemically meaningful vocabulary, we construct a trie over these token sequences to enable efficient tracking of substrings. Unlike character-level tries commonly used in natural language processing or bioinformatics, this trie operates over complete SMILES tokens. As a result, semantically meaningful units such as Cl, Br, or [C@H] are preserved throughout traversal and indexing.

Each node in the trie contains a fixed-size array of child pointers, with one entry for each token in the alphabet Σ. A counter is associated with each node to record the number of times a given token prefix has appeared in the training corpus. The tokens are assigned to the positions of the array via an integer-based indexing function, allowing O(1) time access during both insertion and traversal.

Figure 2 illustrates a partial token-level trie constructed from a chemically tokenized corpus. Note that the counts in the trie are scaled down for illustration; actual frequencies in the corpus are shown in Table 4. It shows frequently occurring substrings such as [C, (, =, O, )] (extracted from acetic acid, glycine, and chloroacetic acid) and [C, C, Cl] (from 1-chloropropane and chloroacetic acid). Each node maintains a frequency count that represents the frequency at which the token sequence appears as a prefix. Substrings above the threshold are replaced with synthetic tokens like <R1> and <R5> in the compressed vocabulary.

To construct the trie, we iterate through each token sequence and insert all contiguous substrings of length up to K. That is, for each starting position i in a token sequence T, we insert the substrings T[i:i+1],T[i:i+2],,T[i:i+K] into the trie. Each insertion follows a path along existing nodes or creates new ones as needed, incrementing node-level counters to reflect the observed frequency. Algorithm 3 outlines this insertion process.

Algorithm 3.

Algorithm 3

BuildIndexedTokenTrie

Substring enumeration

With the trie constructed, we can now efficiently enumerate frequent substrings of a desired length k. Each path of length k from the root corresponds to a token substring that occurs in the training corpus. The count associated with the terminal node of the path represents the frequency of that substring.

To collect all such substrings, we perform a depth-first traversal of the trie, branching along non-null children and accumulating token paths. The algorithm 4 shows the recursive enumeration routine, which stores substrings of exactly length k along with their observed frequency.

Algorithm 4.

Algorithm 4

CollectSubstringsOfLengthK

In addition to enumeration, the trie supports direct frequency querying of any specific token substring using a simple traversal, shown in Algorithm 5. This facilitates downstream frequency filtering operations, where we discard low-frequency substrings that fall below a user-defined threshold δ.

Algorithm 5.

Algorithm 5

QueryIndexedTokenTrie

The total time complexity of construction is O(N·K) where N is the total number of tokens in the corpus. Enumeration of all length-k substrings takes O(dk·k) time, where dk is the number of such substrings. Queries on specific substrings are executed in O(k) time.

The enumeration phase produces a frequency-ranked list of substrings suitable for compression. The next section introduces a context-sensitive mechanism for refining these candidates based on co-occurrence and transition likelihood.

Substring replacement and vocabulary update

One practical application of analyzing frequent substrings is compression: reducing the overall space required to represent tokenized SMILES strings. This is achieved by identifying high-frequency substrings (those occurring more than a frequency threshold δ) and replacing them with newly defined synthetic tokens not present in the original alphabet Σ. This dictionary-based compression encodes repeated patterns using shorter symbols, improving efficiency in both storage and downstream processing.

Motivation

Let T1,T2,,Tn be tokenized SMILES strings and let K be the maximum substring length considered. Substrings of length k[3,K) that occur with frequency δ are selected as compression candidates. These substrings are replaced with compact tokens (e.g., <R1>, <R2>), producing a compressed corpus where frequent chemical fragments are encoded more efficiently.

The overall process involves the following.

  • Collecting substrings from the trie

  • Filtering substrings based on frequency threshold δ

  • Sorting substrings by decreasing frequency

  • Assigning replacement tokens to the top substrings

  • Rewriting each SMILES string using a compression pass

Filtering and sorting frequent substrings

To identify meaningful and reusable token sequences, we filter the substrings collected from the indexed trie based on a minimum frequency threshold. This ensures that only those substrings that occur with sufficient regularity in the tokenized corpus are retained for further consideration. Once filtered, the substrings are sorted in descending order of frequency to prioritize highly prevalent patterns for replacement with synthetic tokens. The following algorithm outlines this frequency-based pruning and prioritization process.

Algorithm 6.

Algorithm 6

FilterAndSortSubstrings

Vocabulary extension with synthetic tokens

Each high-frequency substring is mapped to a new replacement token of the form <R1>, <R2>, and so on, where these tokens are disjoint from the original vocabulary Σ. These synthetic tokens are inserted into the updated vocabulary Σ=Σ{<R1>,<R2>,}. The mapping from substrings to replacement tokens is maintained in a dictionary to support reconstruction and interpretability.

Compression pass using replacement trie

To efficiently rewrite the tokenized strings, we construct a replacement trie using the selected substrings and their assigned tokens. Each path from root to leaf in this trie represents a high-frequency substring, with the leaf node holding the corresponding replacement token.

This structure enables longest-prefix matching at each position in a SMILES string, allowing greedy and non-overlapping replacement during compression.

Algorithm 7.

Algorithm 7

CompressUsingReplacementTrie

Handling overlapping substrings

During replacement, care must be taken to avoid overlapping substitutions. We adopt a greedy left-to-right strategy that always selects the longest matching substring at any position. Once a match is found, all tokens in that match are skipped in the next iteration. This ensures non-overlapping compression and simplifies downstream parsing.

Compression complexity and outcome

Let M denote the number of tokens in a tokenized SMILES string, and let B represent the number of substrings selected for replacement. Constructing the replacement trie requires inserting each selected substring into the trie, with each insertion taking time proportional to the substring length. If k is the average length of the substrings, the total time complexity for building the replacement trie is O(B·k). The same bound applies to the space complexity, as each node in the trie stores pointers corresponding to tokens in the substrings and potentially a replacement token at the leaves.

During the compression pass, the algorithm scans the token sequence from left to right, performing longest-prefix matches against the replacement trie. In the worst case, each token may initiate a match attempt that descends to depth K, where K is the maximum length of the substring considered. Therefore, the overall time complexity for this step is O(M·K). This is efficient in practice, due to early pruning in the trie and the greedy matching strategy that avoids overlapping replacements.

After this replacement phase, SMILES strings are compressed by encoding frequent substrings as single synthetic tokens such as <R1>, <R2>, and so on. This reduces the number of tokens in each sequence, which directly improves the efficiency of downstream tasks such as string comparison, indexing, and neural model training. The compressed vocabulary, denoted Σ, includes both the original tokens and the newly introduced synthetic symbols. A mapping between each synthetic token and its corresponding original substring is maintained to allow decompression and interpretability. This compact representation serves as a refined input for subsequent modeling steps, including transition graph construction and foundation model integration.

TTG-guided refinement of trie-based compression

Overview of the Token Transition Graph (TTG)

The Token Transition Graph (TTG) is a directed, weighted graph that captures the statistical co-occurrence of token transitions within a corpus of tokenized SMILES strings. Each node in the TTG corresponds to a token from the vocabulary, and each directed edge between two tokens represents an observed transition, annotated with its frequency. This graph encodes sequential dependencies within SMILES sequences and allows us to quantify the contextual stability of substrings.

To build the TTG, each SMILES string is padded with special START and END tokens. Then, for every adjacent token pair in the sequence, an edge is inserted or updated in the graph.

Formally, given a corpus C={T1,T2,,Tn}, where each Ti=[t1i,t2i,,tmii] is a tokenized SMILES string, we construct a directed graph G=(V,E) such that

V=Σ{START,END}

and for each transition tjitj+1i, we increment the weight w(tji,tj+1i) by 1. In addition, we insert transitions STARTt1i and tmiiEND for each sequence in the corpus.

A representative portion of the TTG is shown in Fig. 3, demonstrating the co-occurrence patterns and filtered transitions that later drive the compression and generalization behaviors reported in the experiments.”

Fig. 3.

Fig. 3

A subset of the Token Transition Graph (TTG) illustrating statistically stable transitions retained after entropy-based pruning. Nodes represent high-entropy tokens; directed edges represent conditional transition probabilities P(RjRi) estimated from the corpus. Only stable, high-confidence transitions are shown

Entropy-guided transition scoring

The TTG provides a mechanism to assess the contextual predictability of token transitions. For each token x, we define the outgoing transition probability as

P(yx)=w(xy)zw(xz)

and the transition entropy of token x as

H(x)=-yP(yx)·logP(yx)

Low entropy indicates that a token is followed by a narrow set of next tokens with high confidence, suggesting contextual consistency. Such transitions are desirable for compression. High-entropy transitions, by contrast, occur in variable or ambiguous contexts and are likely to degrade semantic precision when merged.

Two-stage compression pipeline: trie followed by TTG filtering

We propose a two-stage compression pipeline that combines the high recall of a frequency-based trie with the semantic precision of entropy-aware filtering via the TTG.

In the first stage, a trie is constructed over the tokenized SMILES corpus, indexing all substrings and their frequencies. A frequency threshold fmin is applied to retain only those substrings that occur frequently enough to be worth considering for compression. This is the same as the filtering step for the trie algorithm previously described (with δ=fmin).

In the second stage, a TTG is built from the same corpus. Each retained substring of the trie is analyzed using the TTG to evaluate its contextual stability. For a given substring, we compute the average entropy across its transitions. Only substrings with an average transition entropy below a threshold Hmax are retained in the final vocabulary. Note that the role of TTG-based filtering within the overall two-stage pipeline is summarized in Fig. 1.

Algorithm 8.

Algorithm 8

TTG-Guided Refinement of Trie-Based Compression

This refinement ensures that the selected substrings are not only frequent but also semantically coherent and chemically meaningful. Algorithm 8 summarizes the complete two-stage refinement procedure, combining frequency-based substring extraction from the trie with entropy-guided filtering using the Token Transition Graph. An example of this is presented in Fig. 4.

Fig. 4.

Fig. 4

A partial splitting of the token transition graph, starting from a base token [14C], one of the synthetic tokens generated after training the trie compressor. The edge with weight 1 from this token to its child indicates that the token ( always follows the base token. However, the children of this token are more distributed, and the edges represent the probability of encountering each of these children immediately after (. The probabilities do not sum to one since we have excluded the remaining possible follow-up tokens. In theory, we could extend this tree to arbitrary depth by enumerating all transition probabilities of the last layer and continuing this process

Benefits of TTG-guided refinement

This two-stage approach combines the best aspects of frequency-based and context-aware compression. The trie enables efficient discovery of reusable substrings that appear frequently across the corpus. The TTG then acts as a semantic filter, pruning candidates that arise in inconsistent or noisy contexts. As a result, the final vocabulary contains synthetic tokens that are both compressive and robust, improving the efficiency of the training of the downstream model while preserving chemical fidelity.

We emphasize that syntactic similarity in SMILES is not expected to imply chemical similarity. SMILES encodes molecules through arbitrary traversal orders, parentheses, and ring indices, none of which reflect chemical structure directly. Our TTG-guided tokenization therefore does not attempt to preserve syntactic neighborhoods. Instead, its purpose is to avoid anti-chemical behavior–i.e., situations where chemically similar molecules are pushed unnaturally far apart due to syntactic artifacts. Our pipeline focuses on stability, entropy, and context-consistent substring selection, rather than enforcing syntactic similarity as a proxy for chemical similarity.

Distinction from standalone methods

Purely frequency-driven approaches (e.g., trie or BPE) often overmerge tokens that appear frequently but lack semantic cohesion. Conversely, entropy-based filters without a trie may fail to discover common substructures due to context variability. Our combined approach avoids both of these pitfalls. The trie captures a wide range of compressible units, and the TTG refines this set by enforcing consistency constraints.

Thus by guiding compression with an entropy-informed context analysis, the TTG supports the creation of a cleaner, more interpretable vocabulary that aligns with chemical regularity. This vocabulary is better suited for generalization across molecular datasets and improves the token efficiency of foundation models trained on SMILES strings.

Experimental evaluation

We evaluated our trie-based token compression framework through a series of experiments designed to assess both its compression efficiency and its ability to generalize across molecular datasets. These experiments compare our approach with state-of-the-art molecular tokenizers, including Atom Pair Encoding (APE) and SMILES Pair Encoding (SPE), using standard benchmarks and metrics. Specifically, we analyze performance on in-distribution data (PubChem) as well as out-of-distribution data (ChEMBL), measuring mean token count, variance, fertility, and normalized entropy. We further assess how the integration of the Token Transition Graph (TTG) refines the vocabulary produced by the trie and impacts downstream compression behavior. The following subsections present detailed comparisons across methods, discuss the benefits of TTG-guided refinement, and quantify generalization performance across diverse chemical corpora.

Training corpus evaluation: comparison with state of the art

ChEMBL corpus evaluation

We present the performance of two distinct algorithms that use the trie-based structure for token compression. The first is a pure trie-based method that scans for repeated token sequences; the second enhances this approach by incorporating token transition graph statistics (see "TTG-Guided Refinement of Trie-Based Compression" section) to form a hybrid TTG–Trie tokenizer.

We compared these algorithms with two state-of-the-art chemical tokenization schemes: Atom Pair Encoding (APE) [10] and SMILES Pair Encoding (SPE) [17]. All models were trained on a random subset of 100,000 molecules from the ChEMBL dataset [18]. A separate portion of this dataset was retained as a “test” set, which is entirely distinct from the training dataset. The purpose of this separation is to prevent overfitting of any of the models tested to a specific series of molecules—in the extreme case, a tokenizer could produce a single token for each molecule, which would achieve perfect performance on the training set but have no ability to generalize to out-of-distribution data.

Even accounting for this phenomenon by using a testing dataset, tokenizers that require large vocabularies will naturally require fewer tokens, on average, to represent each molecule, since there are more substitutions possible. We therefore note the number of tokens produced by each method along with normalized entropy. We calculate the Shannon entropy of the token distribution produced by each tokenizer, considering the probability that each token appears in the tokenized version of each molecule across the entire test dataset. Higher entropy indicates that a given tokenizer is more efficient at compressing molecules in the dataset. In Table 2 the results are summarized, in addition to these statistics, the mean token count, the variance in token count, and the training time for each method. The normalized entropy values reported in Tables 5, 6, 7 are calculated using the formulation in Eq. 3. This normalization accounts for vocabulary size, enabling a direct comparison of entropy between tokenizers with different vocabularies.

Table 5.

Comparison of token-stream statistics across tokenizers

Tokenizer Mean tokens per molecule Token count variance Token entropy (bits) Vocabulary size
SPE 15.37 97.32 0.3522 367
APE 9.24 27.58 0.5201 8006
Trie 5.65 17.20 0.6974 163
Trie+TTG 4.47 9.18 0.7839 163

Mean Tokens per Molecule is the average token length of SMILES in the PubChem 100K sample. Variance measures the variability of token lengths. Entropy is the Shannon entropy of the token distribution (in bits), reflecting uniformity of token usage. Vocabulary Size is the number of unique tokens produced by each tokenizer

Table 6.

Metric comparison for SPE, APE, Trie, and Trie+TTG on the PubChem 100K dataset

Tokenizer Fertility (tokens/atom) Compression ratio Mean tokens per molecule Token count variance Normalized entropy
SPE 0.2788 3.59 12.38 42.39 0.888672
APE 0.1576 6.35 7.00 13.28 0.911246
Trie 0.1014 9.86 4.50 5.49 0.867195
Trie+TTG 0.0764 13.09 3.39 2.85 0.924042

Fertility is the average number of tokens produced per heavy atom. Compression Ratio is the factor by which token count is reduced relative to character-level baseline SMILES. Mean Tokens per Molecule and Variance quantify token-stream length. Normalized Entropy measures the evenness of token usage, scaled to [0,1]

Table 7.

Metric comparison for SPE, Trie+TTG, and PeptideCLM’s tokenizer for 106 peptides in [23]

Tokenizer Mean Variance Entropy Compression
SPE 24.02099 142.05463 5.17593 1.27304
PeptideCLM 25.07946 122.36169 5.34500 1.21930
Trie+TTG 9.85576 27.46811 8.47536 3.10272

We note that the integrated Trie with Token Transition Graph approach is significantly better across all metrics considered when trained and tested on different sections of the ChEMBL dataset. It represents molecules with the fewest number of tokens, exhibits the lowest variance in number of tokens required to represent a molecule, has the greatest entropy, and the lowest vocabulary size. The previous state-of-the-art APE, uses almost fifty times as many tokens and has significantly worse performance.

Figure 5 shows the distribution of token sizes in representing each molecule in the ChEMBL test dataset. All distributions are unimodal and skewed right, but APE and SPE have much greater variance. It is clear that there are almost no molecules that require more than ten tokens to be represented with either the Trie or Trie+TTG methods, whereas the average number of tokens required to represent a molecule with SPE is greater than ten and there remain a significant fraction of molecules requiring more than ten tokens to be represented with APE. Trie and Trie+TTG methods also represent a significant portion of the dataset with less than five tokens, whereas only a small fraction of molecules are represented with less than five tokens using the SPE and APE tokenizers.

Fig. 5.

Fig. 5

Distribution of token counts per molecule for each tokenization method on ChEMBL

PubChem corpus evaluation

Next we present data for the PubChem dataset [19]. We also use a random slice of 100,000 molecules, but train and test on the same dataset. This provides a comparison of each model’s best performance, at the risk of overfitting. We also measure “fertility”, or the ratio of tokens to characters in the dataset. The “compression ratio” is the inverse of fertility, and serves to determine the quality of a tokenizer. Both these metrics are closely related to the mean token count.

Fertility=Total Number of TokensTotal Number of Characters 1
Compression=Total Number of CharactersTotal Number of Tokens=1Fertility 2
Normalized Entropy=-1log|V|xVp(x)logp(x) 3

In equation 3, V is the set of all tokens used to represent the test dataset, and p(x) for some xV represents the total number of appearances of x in the representations of molecules in the test data set divided by the total number of tokens used to represent all molecules in the dataset (the sum of the number of tokens used to represent each molecule).

We interpret the reduction in mean tokens as indicating that the trie-based method offers a more efficient encoding scheme than traditional methods. Multiplying the mean tokens by the number of strings (100,000) gives the total number of tokens needed to encode the dataset, highlighting the significant savings provided by trie-based compression.

Figure 6 shows a nearly linear correlation between the size of the data set and the total token savings for the trie tokenizer. The result implies a consistent reduction in sequence length regardless of the scale of the dataset. Trie also exhibits higher normalized entropy and lower variance, suggesting its ability to extract reusable substructures from molecular corpora. Considering information from the token transition graph further improves the trie algorithm on all metrics considered and significantly reduces variance in token lengths.

Fig. 6.

Fig. 6

Total number of tokens saved by using trie method over APE in the PubChem dataset, relative to size of the test dataset

There is substantial evidence in the literature, primarily in the NLP domains, indicating that fewer total tokens and lower fertility correlate with improved model training and inference efficiency. Prior work [20, 21] links these properties with reduced compute costs and better LLM convergence.

These results reaffirm the importance of compression-aware metrics such as fertility and normalized entropy. Lower fertility correlates with shorter token sequences and reduced computational overhead during model training, while higher normalized entropy indicates more balanced and semantically expressive vocabularies. Together, these metrics highlight the advantages of the trie-based tokenizer in both efficiency and generalization performance.

Peptide corpus evaluation

The last corpus in which we evaluate our scheme is the PeptideCLM pretraining corpus, which contains approximately 11 M peptides and 12 M small molecules, of which a training and testing slice of 105 and 106 peptides, respectively, was chosen [22, 23]. Both contained distinct molecules that were not present in the other slice.

We then evaluated the TTG-Trie tokenizer compared to SPE and the PeptideCLM tokenizer, which is based on SPE [22]. In comparing the compression ratio, mean (tok/mol), variance (tok/mol), and entropy (bits) of the tokenizers on the corpus, we aim to show the efficiency of the token transition graph and trie tokenizer on long sequences.

Notably, while the PeptideCLM tokenizer is comparable in performance to SPE’s results, the Trie+TTG method consistently outperforms the other two methods, using significantly less tokens per molecule more consistently. We interpret this to mean that our method has significantly higher long-sequence efficiency than the current standards. Integration into future long-sequence chemical language molecules is thus an interesting area for future research.

Parameter ablation (min-frequency and aubstring length)

To evaluate the robustness of our tokenizer with respect to design hyperparameters, we performed an ablation study on a subset of 100,000 canonicalized molecules from PubChem. We varied the maximum length of the substring K{6,8,10,12}, the minimum frequency threshold freq_threshold{2,3,4}, and the entropy retention threshold entropy_threshold{3.0,3.5,4.0}. For each configuration, we computed four diagnostic quantities: (i) compression ratio, (ii) normalized token entropy, (iii) vocabulary size, and (iv) TTG edge count.

We observe smooth and predictable trade-offs across the grid. Increasing K expands the candidate vocabulary (from 55,000 to 170,000 tokens) while slightly lowering the compression ratio, reflecting the increased presence of longer substrings. Raising the entropy threshold filters out low-information substrings, resulting in higher normalized entropy and slightly reduced compression.

A notable outcome is that the TTG edge count remains constant (1798 edges) across all 36 configurations, indicating that the global co-occurrence structure captured by the TTG is stable and largely insensitive to local substring parameters. This suggests that the TTG imposes a strong regularization effect on the vocabulary.

In general the region K[6,8] with entropy_threshold=3.5 provides a good balance between compression efficiency, entropy, and vocabulary size, matching the hyperparameters used in our main experiments. These results confirm that the tokenizer is robust to reasonable changes in its design parameters.

Evaluating generalization

A critical requirement for molecular tokenization in large-scale language modeling is robust generalization, which is the ability to compress molecular sequences that differ significantly from the training distribution. Tokenizers that perform well on data related to the training corpus may be of no use for out-of-distribution data.

This challenge is especially relevant for trie-based tokenizers, which are data-dependent: they construct hierarchical units from frequent patterns in the training data. Consequently, such methods risk underperformance when token sequences from a different domain are not well represented in the learned trie.

To evaluate generalization, we tested all three tokenization schemes—Trie, APE, and SPE—on a large dataset outside of distribution: 1.3 million molecules randomly sampled from the ChEMBL database. Each tokenizer was trained on the same 100,000 PubChem molecules and frozen during evaluation.

Algorithm 9.

Algorithm 9

EvaluateTokenizerGeneralization

The generalization results (in Table 8) reveal several important trends. Among the four methods, the TTG tokenizer demonstrates the strongest generalization capability, showing the lowest fertility, the highest token count variance, and the highest compression ratio. These gains suggest that TTG’s entropy-guided, frequency-aware merges isolate semantically meaningful and reusable chemical fragments that remain robust even when the distribution of molecular scaffolds shifts substantially.

Table 8.

Comparison of ability to generalize for SPE, Trie, and APE tokenizers trained on PubChem

Tokenizer Fertility Compression Mean Variance Normalized entropy
SPE 0.305 3.28 17.42 184.27 0.820
APE 0.276 3.62 15.77 544.81 0.609
Trie 0.134 7.46 7.70 66.03 0.696
Trie+TTG 0.117 8.55 6.71 55.92 0.679

The trie-based tokenizer performs similarly to the TTG framework. Although its fertility and mean ratios are slightly higher than TTG, its significant increase in compression ratio compared to APE and SPE suggests that its hierarchical merges effectively capture semantically meaningful and reusable chemical fragments that remain robust across chemically distinct datasets.

In contrast, APE shows clear signs of degradation when applied to the ChEMBL corpus. Many of the merges learned from PubChem fail to reoccur, resulting in increased token fragmentation, reduced compression, and a notable drop in normalized entropy, suggesting limited transferability. The extremely high variance in APE mean token length also suggests it has poor performance for out-of-distribution data, as there may be a select few molecules that are different enough from the training dataset to not be well represented in this framework.

SPE performs the worst overall: its merges appear too specialized to the training distribution, leading to high fertility, inflated sequence lengths, and significant variance in unseen data. This reflects a breakdown in abstraction and an inability to generalize beyond the training set. SPE does retain high entropy, although the magnitude of this increase is not significant, given the severe degradation in the compression ratio, which could make SPE over two and a half times less efficient at representing molecules in out-of-distribution data.

In general, the TTG-guided tokenizer strikes the most effective balance between compression efficiency and generalization. Its entropy-aware, frequency-filtered merges yield tokens that are both compact and chemically coherent, enabling broad reuse across diverse molecular corpora. Unlike APE and SPE, which rely mainly on frequency heuristics or raw entropy maximization, and unlike the classic counted-trie that ignores context stability, TTG learns a stable, transferable vocabulary ideally suited for large-scale molecular language modeling and pretraining across heterogeneous chemical datasets.

Computational performance

To evaluate the practical scalability of our approach, we benchmarked both the vocabulary construction time for 100K molecules and the tokenization time for 1 M molecules for each tokenizer. All methods were implemented in Python 3.11.2 and executed in a single threaded environment on a MacBook Air equipped with an M1 CPU (3.2 GHz) and 8 GB RAM.

Table 9 reports the total runtime required to build the vocabulary on the PubChem training set (100K molecules), along with the total inference time for the out-of-distribution ChEMBL corpus (1 M molecules). Both our Trie and Trie+TTG Filtering methods require significantly less training time than either SPE or APE, and they remain tractable at scale while generalizing well. Tokenization using our methods remains efficient, with all models processing molecules in well under 15 s, at least an order of magnitude faster than the current state-of-the-art.

Table 9.

Runtime comparison for vocabulary construction on 100K molecules and generalization performance measured as tokenization time on 1 M out-of-distribution molecules from ChEMBL

Tokenizer Training time on 100K molecules (s) Tokenization time on 1 M molecules (s) Encode time (s/mol) Decode time (s/mol)
SPE 9.80 156.47
APE 1.94 160.36
Trie-based compression 0.85 12.62 2.137661×10-5 2.478609×10-1
Trie + TTG filtering 1.23 14.31 2.423143×10-5 1.406596

Training time refers to the one-time cost of building the tokenizer vocabulary (trie construction, TTG scoring, substring filtering). Encode Time (s/mol) and Decode Time (s/mol) are separately measured per-molecule runtimes for applying the trained tokenizer to new SMILES strings. These encode/decode values are not derived from the training time; instead, they reflect the actual end-to-end time needed to compress or reconstruct an individual molecule using the learned replacement trie

Adding TTG filtering to the trie method incurs a modest runtime cost for vocabulary construction and tokenization, as expected, because the TTG requires computing transition statistics and filtering candidate substrings. Importantly, this additional cost affects only the training of the tokenizer, not the per-molecule tokenization itself, since TTG filtering is performed once during vocabulary creation and not during inference. Tokenization speed therefore remains effectively unchanged between Trie-only and Trie+TTG.

Unlike APE and SPE, which are not reversible due to many-to-one merges without stored inverse mappings, our Trie and Trie+TTG tokenizers maintain an explicit and bijective mapping between substrings and synthetic tokens. This ensures exact reconstruction of the original SMILES and allows us to measure both encoding and decoding costs.

For our approaches, we report encoding and decoding time per molecule. Encoding is extremely fast (<3×10-5 s/mol) for both Trie and Trie+TTG, since compression operates through a longest-prefix match on the replacement trie. Decoding is necessarily slower because it reconstructs the original SMILES by recursively expanding synthetic tokens back into their corresponding substrings. This expansion cost scales with the depth and number of replacement rules, making decoding inherently more expensive than encoding. Even so, decoding remains practical, taking <1.5 s per molecule for Trie+TTG and significantly faster for the Trie-only setting. Since decoding is rarely needed during downstream model usage as compressed token sequences, not reconstructed SMILES strings, are typically consumed by chemical language models—the higher decoding time does not affect practical performance in training or inference.

Ablation: trie-only vs. trie+TTG for downstream separability

This experiment also functions as a downstream-style classification proxy by evaluating how effectively each tokenizer organizes molecular space through unsupervised clustering. Unlike earlier diagnostic constructions that relied on averaged transition vectors and produced a collapsed geometry, the present approach yields a more informative assessment of latent-space structure. By examining cluster cohesion and separability, this experiment provides a task-relevant evaluation of neighborhood behavior and directly quantifies the impact of TTG refinement beyond the frequency-based trie baseline.

To isolate the contribution of the TTG refinement step, we compared three tokenizers–Trie-only, Trie+TTG, and APE–under an unsupervised clustering protocol that approximates downstream separability. This experiment evaluates how effectively each tokenizer organizes molecular space based solely on co-occurrence statistics of the resulting token sequences.

Each tokenizer was applied to the same set of 100,000 canonicalized PubChem SMILES strings. For every molecule, we constructed a TF–IDF feature vector over its token sequence, capturing frequency-weighted fragment usage, and reduced the resulting vectors to 50 dimensions using PCA. K-means clustering (k=20) was then applied in the reduced space, and cluster quality was quantified using the Silhouette Score, which measures the intra-cluster cohesion relative to inter-cluster separation.

The Trie-only tokenizer already exceeds APE, indicating that frequency-based substring discovery alone provides more coherent fragment structure than byte-pair-style merges. Adding TTG refinement yields a further and substantial improvement, producing the highest silhouette score among all methods. This suggests that TTG’s entropy-based filtering removes context-unstable substrings and retains transitions that occur in more consistent chemical environments.

Overall, the Trie+TTG tokenizer creates a clearer and more separable latent space than either baseline. This shows that TTG refinement adds meaningful structure beyond the trie alone, resulting in representations that are better suited for downstream use. The results of these comparisons are shown in Table 10.

Table 10.

Silhouette scores for unsupervised clustering using TF–IDF representations derived from each tokenizer

Tokenizer Silhouette score
Trie-only 0.472
Trie+TTG 0.587
APE 0.250

All Trie-based models use K=10, a minimum frequency threshold of 3, and a TTG entropy threshold of 3.5. TF–IDF vectors were reduced to 50 dimensions using PCA prior to k-means clustering (k=20)

QSAR evaluation on ESOL

To assess whether the TTG refinement improves downstream predictive utility, we conducted a regression experiment on the ESOL aqueous solubility dataset. We trained both a Trie-only tokenizer and two TTG-refined tokenizers on the ESOL training split and constructed TF–IDF feature vectors over the resulting token sequences. A Random Forest Regressor (500 estimators) was used for all models to isolate the effect of tokenization alone, with no task-specific architectural bias. The results are summarized in Table 11.

Table 11.

QSAR performance on ESOL using TF–IDF features derived from different tokenizers

Tokenizer RMSE MAE R2
Trie-only (K=8, freq=4) 1.638 1.243 0.442
TTG (K=8, freq=3, entropy=2.5) 1.440 1.059 0.569
TTG (K=10, freq=3, entropy=3.5) 1.831 1.393 0.302

The TTG refinement with moderate substring length (K=8) yields the best predictive performance, indicating improved chemical signal retention

Three tokenizer configurations were evaluated: (i) a Trie-only model with K=8 and minimum frequency 4, (ii) a TTG-refined model with K=8, frequency threshold 3, and entropy threshold 2.5, and (iii) a second TTG model with K=10, frequency 3, and entropy threshold 3.5. Performance was measured using RMSE, MAE, and R2 on the ESOL test split.

The TTG-refined tokenizer with K=8 achieves the strongest performance (R2=0.569), outperforming the Trie-only baseline (R2=0.442). This shows that TTG filtering improves the discriminative quality of the token features by removing context-unstable substrings. Increasing K to 10, however, leads to lower performance, suggesting that excessively long synthetic tokens reduce the granularity needed for fine-grained property prediction. Overall, these results demonstrate that TTG refinement enhances downstream QSAR accuracy when applied with appropriate parameter settings.

Conclusion

This work introduces two complementary algorithms–Trie-based substring compression and TTG-guided refinement–for constructing compact and chemically coherent SMILES tokenizations. Unlike frequency-only methods such as APE and SPE, the proposed framework integrates both global substring statistics and local contextual stability, producing reversible vocabularies that capture reusable chemical fragments.

Across PubChem, ChEMBL, and large peptide corpora, Trie+TTG achieves substantially shorter token sequences, higher normalized entropy, and greater robustness to out-of-distribution molecules. Unsupervised clustering and QSAR regression on ESOL further show that the refined token sequences support more coherent latent spaces and improved downstream predictive performance. These results demonstrate that chemically aware token compression can meaningfully influence the quality and utility of molecular representations.

Although our study does not train full-scale chemical language models, the observed gains in compression, entropy, and latent-space separability indicate that Trie+TTG is a strong candidate for future CLMs, including long-sequence peptide models. Future work will integrate this tokenizer into SMILES and SELFIES-based foundation models and explore hybrid learning-guided vocabulary compression.

Future work

Although our trie-based token compression approach, augmented by the Token Transition Graph (TTG), provides a chemically aware and efficient mechanism for SMILES tokenization, it is not without limitations. In the following, we discuss the current weaknesses and outline potential avenues for future improvement. One of the central issues of these tokenizers is that they are still, to a limited extent, frequency-based and may therefore fail to capture rare but chemically important substructures that appear infrequently or not at all in the dataset.

The trie and trie with TTG approaches presented in this paper are optimized to tokenize SMILES strings, but can also be extended to tokenize SELFIES strings [24]. As the SELFIES format has become increasingly important for large language models and related applications, it will be imperative that we integrate the algorithms described in this paper to work with SELFIES as well as SMILES representations and optimize its performance accordingly. The SELFIES format has the important property that any combination of certain characters results in a semantically and chemically valid molecule, which might aid in compression.

The trie method is also limited in the size of substrings that can be considered for computational cost reasons; this, combined with the static frequency thresholds used to determine whether a certain substring should be considered a distinct token, reduces the method’s ability to identify more complex tokens and generalize to other datasets. We anticipate improving the trie-based methods by incorporating hybrid learning-based compression, in which a secondary algorithm is used to identify task-relevant patterns in addition to frequent substrings for tokenization. Another area for improvement would involve directly incorporating data from the token transition graph when generating the trie instead of only using the token transition graph as a semantic filter. This could help with dynamic trie pruning and construction, in which the token transition graph can be used to select low-entropy paths that maximize compression gain, allowing longer substrings to be considered and more semantically meaningful tokens to be identified.

Author contributions

Sridhar Radhakrishnan1, Krish Mody 2, Arvind Venkatesh 3, Ananth Venkatesh 4 Contributing authors: kmody@andrew.cmu.edu; contact@rvind.dev; ananthv@mit.edu; Corresponding author: Sridhar Radhakrishnan, sridhar@ou.edu.

Funding

University of Oklahoma.

Data availability

ChEMBL [18]: https://www.ebi.ac.uk/chembl/, PubChem [19]: https://pubchem.ncbi.nlm.nih.gov/.

Code availibility

GitHub repository: The Python implementations used in this study are available at [https://github.com/BlastCoder/SMILES-Tokenization SMILES-Tokenization]. The repository is released under the MIT License, permitting reuse, modification, and distribution with attribution.

Declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Weininger D (1988) Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31–36 [Google Scholar]
  • 2.Schwaller P, Laino T, Gaudin T, Bolgar P, Hunter CA, Bekas C, Lee AA (2019) Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent Sci 5(9):1572–1583. 10.1021/acscentsci.9b00576 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chithrananda S, Grand G, Ramsundar B (2020) Chemberta: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885
  • 4.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS). 30
  • 5.Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, Berlin, Germany. 1: 1715–1725
  • 6.Kudo T, Richardson J (2018) Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium. 66–71
  • 7.Honda S, Shi Z, Ueda H (2020) Smiles pair encoding: a data-driven substructure tokenization algorithm for deep learning. J Chem Inf Model 60(1):118–126 [DOI] [PubMed] [Google Scholar]
  • 8.Fender I, Gut JA, Lemmin T (2025) Beyond performance: how design choices shape chemical language models. bioRxiv. 10.1101/2025.05.23.655735 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zheng S, Rao J, Zhang Z, Xu J, Yang Y (2023) Ape: an atom pair encoding method for learning molecular representations. J Chem Inf Model 63(4):1063–1074. 10.1021/acs.jcim.2c01350 [Google Scholar]
  • 10.Leon M, Perezhohin Y, Peres F, Popovč A (2024) Comparing smiles and selfies tokenization for enhanced chemical language modeling. Sci Rep 14(1):12345–12354 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Liu T, Wang J, Wang Y, Liu Z, Tang J, Huang Y (2023) Fragmentbert: a fragment-based pretrained language model for molecular property prediction. Brief Bioinform 24(2):052. 10.1093/bib/bbad052 [Google Scholar]
  • 12.Shen Y, Póczos B (2024) Graphbpe: Molecular graphs meet byte-pair encoding. arXiv preprint arXiv:2407.19039
  • 13.Nagy T, Bohnet B, Fraser A (2020) Efficient subword segmentation with trie structures. In: Proceedings of the 28th International Conference on Computational Linguistics (COLING). International Committee on Computational Linguistics, Barcelona, Spain. 5997–6008
  • 14.Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. MIT Press, Cambridge [Google Scholar]
  • 15.Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. Proceedings of the 34th International Conference on Machine Learning. 1263–1272
  • 16.Xu K, Hu W, Leskovec J, Jegelka S (2019) How powerful are graph neural networks? In: International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=ryGs6iA5Km
  • 17.Li X, Fourches D (2020) Smiles pair encoding: a data-driven substructure tokenization algorithm for deep learning. J Chem Inf Model. 10.26434/chemrxiv.12339368.v1 [DOI] [PubMed] [Google Scholar]
  • 18.Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2011) Chembl: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(D1):1100–1107. 10.1093/nar/gkr777 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kim S (2021) Exploring chemical information in pubchem. Curr Protoc 1(9):217. 10.1002/cpz1.217 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Goldman O, Caciularu A, Eyal M, Cao K, Szpektor I, Tsarfaty R (2024) Unpacking tokenization: Evaluating text compression and its correlation with model performance. 2274–2286. 10.18653/v1/2024.findings-acl.134
  • 21.Wadell A, Bhutani A, Viswanathan V (2025) Tokenization for molecular foundation models. arXiv:2409.15370 [cs.LG] [DOI] [PubMed]
  • 22.Feller AL, Wilke CO (2025) Peptide-aware chemical language model successfully predicts membrane diffusion of cyclic peptides. J Chem Inf Model 65(2):571–579. 10.1021/acs.jcim.4c01441 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Feller A (2025) Pretraining data for PeptideCLM (UPDATED). Zenodo. 10.5281/zenodo.15042141 [Google Scholar]
  • 24.Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Mach Learn Sci Technol 1(4):045024. 10.1088/2632-2153/aba947 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

ChEMBL [18]: https://www.ebi.ac.uk/chembl/, PubChem [19]: https://pubchem.ncbi.nlm.nih.gov/.

GitHub repository: The Python implementations used in this study are available at [https://github.com/BlastCoder/SMILES-Tokenization SMILES-Tokenization]. The repository is released under the MIT License, permitting reuse, modification, and distribution with attribution.


Articles from Journal of Cheminformatics are provided here courtesy of BMC

RESOURCES