Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2026 Apr 17:2026.01.06.697994. Originally published 2026 Jan 7. [Version 3] doi: 10.64898/2026.01.06.697994

Scaling SMILES-based chemical language models for therapeutic peptide engineering

Aaron L Feller †,, Maxim Secor , Sebastian Swanson , Claus O Wilke , Kristine Deibler
PMCID: PMC12803269  PMID: 41542510

Abstract

Therapeutic peptides occupy a unique middle ground in drug discovery, offering the high specificity of protein interactions with the chemical diversity of small molecules, yet they currently fall in a computational blind spot. Existing foundation models cannot handle them effectively: protein models are restricted to natural amino acids, while chemical models struggle to process large, polymer-like sequences. This disconnect has forced the field to rely on static chemical descriptors that fail to capture subtle chemical details or on complex multi-embedding pipelines that are custom tailored to specific datasets. To bridge this gap, we present PeptideCLM-2, a suite of chemical language models trained on over 100 million molecules to natively represent complex peptide chemistry. This modeling approach both simplifies the application of machine learning to therapeutic peptides and results in improved performance over alternative approaches for predicting development endpoints including membrane diffusion, tumor homing, and half life.

Graphical Abstract

graphic file with name nihpp-2026.01.06.697994v3-f0001.jpg

Introduction

Peptides are a rapidly evolving modality in the therapeutic space.16 They occupy a unique chemical niche situated between small molecules and proteins. Like small molecules, they possess immense chemical diversity.711 Yet, they retain the modularity of biological polymers, allowing for accessible and efficient synthesis.1214 While modern synthetic methods enable the creation of a vast diversity of modified peptides, the computational tools required to rationally select among these options have not kept pace. Attempts to adapt existing frameworks have faced significant limitations. Protein language models (pLMs) are restricted to fixed amino-acid alphabets, rendering them unable to encode noncanonical or chemically modified residues.15 Conversely, chemical language models (CLMs) are typically trained on small molecules, lacking the contextual range to interpret peptide-specific motifs.16,17

Large-scale protein corpora such as UniProt18 have fueled the rise of self-supervised pLMs. Architectures like the ESM family1921 and ProtTrans22 have proven that transformers can learn structural and functional constraints directly from sequence data. These models capture deep dependencies predictive of structure,2325 variant effects,26,27 and molecular function.28,29 In parallel, CLMs utilizing symbolic representations like SMILES or SELFIES30,31 have successfully encoded chemical syntax via masked or autoregressive objectives,3234 enabling prediction of various chemical properties.35,36 Extensions such as ChemBERTa-237 and ProLLaMA38 have further suggested that augmenting self-supervision with physicochemical regression introduces a valuable inductive bias, improving generalization. Despite these advances, however, deep learning has shown minimal, if any, improvement over molecular fingerprints for peptide property prediction.39,40

In this work, we introduce PeptideCLM-2, a suite of nine SMILES-based transformer encoders41 designed to unify therapeutic peptide modeling. Building upon our previous architecture, PeptideCLM,42 we train models across a parameter range of 32 to 337 million using three distinct objectives: masked language modeling (MLM), multi-task regression (MTR)43 to RDKit-derived descriptors,44 and a dual objective combining both. This systematic design provides a framework to rigorously decouple the effects of model scale and inductive bias on representation learning. Because therapeutic peptides often adopt transient rather than static conformations, we employed a string-based architecture to capture topological connectivity without the bias of a single, rigid 3D structure.

We demonstrate that these models outperform molecular fingerprints and specialized architectures, while relying only on simple string-based inputs. Our evaluation also reveals a critical scaling transition: while descriptor-guided pretraining provides a necessary scaffold for smaller models, larger architectures trained solely on MLM spontaneously recover these physicochemical relationships, a capability evident in both the embedding manifold and downstream evaluations. Together, these results establish PeptideCLM-2 as both an open, scalable resource for peptide engineering and a framework for understanding the interaction between model capacity and pretraining for the field of computational peptide chemistry.

Results

Transformer architecture and multi-objective pretraining for diverse peptide chemistries

The PeptideCLM-2 suite represents a unified framework for embedding therapeutic peptides. Unlike standard protein language models, which are restricted to a fixed alphabet of 20 canonical amino acids, PeptideCLM-2 is designed to explore the full spectrum of chemical space. By combining scalable transformer architectures with interpretable tokenization strategies, the framework provides a bridge between symbolic chemical representation and chemical function.

To rigorously evaluate chemical deep learning for peptides and the emergence of chemical intuition, we designed a controlled grid of nine models. We varied model capacity across three orders of magnitude—32M, 114M, and 337M parameters—and trained each scale using three distinct pretraining objectives: masked-language modeling (MLM), multi-task regression (MTR) to physicochemical descriptors, and a combined dual objective. This systematic design allows us to disentangle the effects of parameter scale from the learning paradigm used.

Our architecture processes inputs as raw SMILES strings. This allows for the native encoding of canonical residues, noncanonical modifications, cyclic scaffolds, and complex conjugations like lipidation or PEGylation (Fig. 1A). The backbone of the framework is a BERT-style transformer encoder (Fig. 1B). To ensure training stability and effective handling of long-range chemical dependencies within large macrocycles, we incorporated modern architectural features including rotary positional embeddings (RoPE),45 SwiGLU activation functions,46 and pre-layer normalization.47 Model depth and hidden dimension scale with parameter count, maintaining a consistent per-head dimensionality of 64.

Figure 1: PeptideCLM-2 workflow: enabling encoding of chemical diversity for predictive modeling.

Figure 1:

(A) The model can input SMILES strings from peptides with various modifications, allowing it to encode cyclic peptides, noncanonical amino acids, and synthetic modifications. (B) Models were trained with both span masking (predicting grey [MASK] tokens) and regression to physicochemical descriptors (e.g., LogP, TPSA) from a pooled sequence embedding. (C) Examples of prediction targets for downstream tasks, including membrane permeability, cellular interactions, and aggregation propensity.

We investigated three distinct learning paradigms to train the architecture. The masked language modeling (MLM) objective utilized span masking48 to force the model to reconstruct missing chemical fragments, effectively learning the syntax of molecular graphs. The multi-task regression (MTR) objective explicitly grounded the embeddings with regression to 99 RDKit-derived physicochemical descriptors (Tab. S1) from a mean-pooled embedding. Finally, a hybrid objective optimized both losses simultaneously, testing whether syntax and semantics could be learned in tandem. For downstream evaluation, the pretrained encoder is coupled with a lightweight, 2-hidden layer feed-forward regression head employing GeLU activation, allowing the model to adapt its learned representations to specific biological tasks such as membrane permeability, cellular interactions, or aggregation (Fig. 1C).

Composite pretraining corpus and sequence compression via k-mer tokenization

To construct a representation space that spans the full continuum from small molecules to biological polymers, we curated a composite pretraining corpus combining three distinct datasets. We aggregated lipids from the LIPID MAPS Structure Database (n = 50,450),49 small drug-like molecules from PubChem (n = 108,583,157),50 and diverse peptide sequences from ESMAtlas (n = 9,634,945).20 As visualized in Figure 2A, these sources inhabit unique, complementary regions of the chemical manifold, ensuring the model encounters a heterogeneous distribution of molecular syntax.51 Due to the flexibility of SMILES encoding, we are able to train on small molecules to learn chemical features, which can directly translate to therapeutic peptides that contain chemistries not seen in the natural amino acid space.

Figure 2: Data Diversity and Representational Efficiency.

Figure 2:

(A) A 2D t-SNE projection of Morgan Fingerprints from the pretraining corpus, demonstrating that the composite dataset (LMSD, PubChem, and ESMAtlas) bridges distinct chemical subspaces. (B) Comparative length distributions of tokenized molecules. The k-mer tokenization strategy yields a substantial compression factor relative to atom-level encoding, reducing sequence length by 64% for peptides, thereby enabling the efficient processing of long biological chains.

A central bottleneck in applying chemical language models to biopolymers is the computational cost of self-attention, which scales quadratically with sequence length 𝒪n2. Because direct character encoding of peptides generates exceptionally long SMILES strings, standard models become prohibitively expensive to train. To resolve this, we developed a specialized k-mer tokenizer (Tab. S2 & S3) that compresses these symbolic representations by mapping recurring sub-structural motifs to single tokens. This strategy significantly reduces the effective sequence length, and thus the quadratic compute burden, while maintaining full compatibility with standard SMILES syntax.

Quantitatively, this strategy reduces sequence lengths by 38% for small molecules and 64% for natural peptides relative to standard atom-level frameworks like DeepChem52 (Fig. 2B). In order to conduct a fair evaluation, we held the data composition and tokenization schema constant across all experiments (Tab. S4). We similarly standardized finetuning workflows using nested cross-validation (Tab. S5), ensuring that observed differences in downstream tasks, such as permeability or aggregation, are attributable solely to the emergence of learned chemical intuition rather than artifacts from pretraining.

We found that this compression comes at no cost to accuracy (Tab. S6); benchmarking confirms that models trained with either tokenizer achieve equivalent performance on membrane permeability prediction. Thus, the k-mer approach successfully resolves the trade-off between computational tractability and semantic fidelity, allowing the model to attend to long-range dependencies in complex backbones without the prohibitive cost of character-level encoding.

Parameter scaling drives the spontaneous emergence of physicochemical organization

We evaluated PeptideCLM-2 using the CycPeptMPDB dataset,53 a standard benchmark for cyclic peptide permeability with robust evaluation splits.42 To ensure a rigorous comparison, we focused our analysis on nine distinct model configurations selected via ablation studies, covering three distinct parameter sizes.

We first investigated whether the models had learned a chemically meaningful latent space. By projecting the embeddings of the 337M MLM model into two dimensions, we observed that the model organizes the chemical manifold according to fundamental physical properties. Without explicit instruction, MLM training organized embeddings by molecular weight and aromaticity (Fig. 3A, B). We also notice this across a larger list of physicochemical descriptors including charge, logP, and available hydrogen bonding sites (Fig. S1). This unsupervised organization shows that the embedded structure also has some organization with measured PAMPA permeability (Fig. 3C), confirming that the model has encoded many of the structural determinants of membrane diffusion.

Figure 3: Emergence of chemical structure and predictive capability.

Figure 3:

(A-C) A t-SNE projection of embeddings from the 337M model. The model spontaneously organizes the chemical manifold by (A) molecular weight and (B) aromaticity, which correlates with (C) measured permeability. (D) Transfer learning performance (linear probing) remains low across all scales, indicating permeability requires non-linear features. (E) Full finetuning reveals a scaling transition: while small models rely on explicit supervision (MTR) to perform well, large self-supervised models (MLM) spontaneously recover this performance, matching the supervised baseline.

To disentangle the quality of these raw representations from the model’s capacity to adapt, we compared layer-wise transfer learning against full-model finetuning. We found that linear probes trained on frozen features performed poorly (R2 < 0.3) regardless of model scale (Fig. 3D), suggesting that the model embeddings are complex, non-linear properties that cannot be extracted via simple linear combinations of features.

We further analyzed where this information is stored by probing each layer of the network. This revealed a significant difference in stability between model sizes. In smaller models, the final hidden layer—typically used for downstream tasks—often underperformed compared to intermediate layers (Fig. S2), suggesting a forgetting of physicochemical details at the output bottleneck. In contrast, larger models exhibited high stability, maintaining rich representations from the middle layers all the way to the final output.

The most significant finding, however, emerged during full-model finetuning (Fig. 3E). At the smallest scale (32M), inductive bias is dominant: models pretrained with explicit physicochemical regression (MTR) significantly outperformed those trained only on language modeling (R2 ≈ 0.38 vs. 0.13). This indicates that small models improve when pretraining is grounded in physical reality.

As capacity increased to 337M parameters, this dependency vanished. The purely self-supervised MLM models show equivalent predictions to the supervised MTR models, with both objectives converging at an R2 ≈ 0.58. This is nearly double the performance of traditional molecular fingerprints (R2 ≈ 0.3). Here we demonstrate a fundamental scaling law: while small models improve prediction when explicitly taught physical properties of molecules, sufficiently large transformers are able to derive these priors from the syntax of chemical language alone.

Predictive generalization to complex biological phenotypes and non-canonical chemistries

Having established that PeptideCLM-2 captures physicochemical constraints, we next asked whether this intuition translates to complex biological function. We evaluated the model on three diverse classification benchmarks—tumor homing, cell penetration, and antimicrobial activity (Tab. S7)—using Matthews Correlation Coefficient (MCC) to rigorously account for class imbalance. The datasets used for these tasks all require modeling of noncanonical chemistry.

Visual inspection of the pretrained embedding space reveals that the model has already learned to distinguish bioactive species. Embedding projections from PeptideCLM-2 Hybrid 337M parameter model show distinct separation between positive and negative classes across all three benchmarks (Fig. 4A, C, E). This latent structure provides a robust foundation for finetuning, where PeptideCLM-2 models were able to outperform specialized architectures, often relying on generation of chemical features or complicated embedding methods.

Figure 4: Predictions on three benchmark datasets for biological interaction.

Figure 4:

Left panels: t-SNE projections (A, C, E) of embeddings from the 337M Hybrid model. Even before finetuning, the model’s latent space exhibits strong linear separability between active (red) and inactive (blue) peptides across all three tasks. Right panels: Performance comparison (MCC) between Transfer Learning (linear probe) and End-to-End Finetuning (B, D, F). Grey dashed lines indicate the performance of prior state-of-the-art methods (THPep, Random Forest+PaDEL, AmpHGT). Red dashed lines indicate a standard Random Forest baseline using RDKit topological fingerprints.

Tumor Homing:

Tumor-Homing Peptides (THPs) are notoriously difficult to classify because they rely on subtle recognition motifs (e.g., RGD, NGR) to target tumor vasculature.54 The method published with this dataset, THPep, achieved an MCC of ≈ 0.71 by relying on engineered features like Pseudo Amino Acid Composition (PseAAC).55 PeptideCLM-2 surpassed this baseline (MCC 0.732) using only raw SMILES strings (Fig. 4B), demonstrating that the model effectively learns motif-driven features without explicit feature engineering.

Cell Penetration:

We next evaluated the model on CellPPD-Mod,56 a dataset explicitly containing chemically modified peptides. Standard sequence models fail here, forcing previous baselines to rely on extracted 2D/3D chemical descriptors (e.g., PaDEL) for the side-chain modifications. The use of PeptideCLM-2 allows for encoding of the entire molecule in one pass, and achieved an MCC of 0.875, outperforming the descriptor-based baselines (≈ 0.85) while operating directly on the chemical syntax (Fig. 4D).

Antimicrobial Activity:

Finally, we tested the model on a rigorous benchmark by He et al.,57 designed to challenge models by training on natural peptide sequences and making predictions on peptides containing noncanonical amino acids. The specialized baseline, AmpHGT, employs a complex Heterogeneous Graph Transformer to explicitly model atoms and fragments, achieving an MCC of 0.797. PeptideCLM-2 surpassed this graph-based architecture with an MCC of 0.813 (Fig. 4F). This result shows that a SMILES-based transformer can capture the intrinsic chemistry of noncanonical residues as effectively as explicit molecular graphs.

Capacity-dependent prediction of non-linear biophysical properties evaluated via peptide half-life and aggregation propensity

Peptide stability dictates therapeutic viability through two primary mechanisms: resistance to enzymatic degradation in circulation and physical resistance to aggregation during storage. We evaluated PeptideCLM-2 on both fronts to determine if learned chemical syntax captures these complex biophysical properties.

To assess enzymatic stability, we utilized the recently curated PepMSND dataset,58 a benchmark of peptide half-life across 635 diverse samples in varying blood environments. The published modeling approach for this dataset was a multimodal ensemble combining 0D physicochemical descriptors, 1D sequences, 2D molecular graphs, and 3D structural conformations. They include a Kalmogorov-Arnold Network (KAN)59 to process molecular descriptors, which, in ablation studies, had the greatest impact to predictive capability.

Visualizing the pretrained PeptideCLM-2 embeddings for the PepMSND dataset reveals that the model innately organizes the latent space by stability class without requiring explicit spatial or physical inputs (Fig. 5A). There are several clusters containing a single class (stable vs unstable), and upon finetuning, our single string-based architecture accurately captures the determinants of degradation. PeptideCLM-2 outcompetes the multimodal baseline without the KAN network (Fig. 5B), demonstrating that large-scale representation learning can successfully replace complex, multi-level feature engineering for blood stability prediction. Additionally, combining PeptideCLM-2 with the same KAN network used with PepMSND, applying their same method of combining output features into a joint regression model, improves predictive capability above the PepMSND+KAN baseline.

Figure 5: Scaling model size in predicting peptide stability.

Figure 5:

(A) t-SNE projection of PeptideCLM-2 MLM model embeddings for all 10 clusters from PepMSND, colored by either passing or failing stability checkpoint. (B) Boxplot of PeptideCLM-2 predictions on binary stability for 10 folds from PepMSND splits, colored by model type. KAN represents a Kolmogorov-Arnold Network trained on molecular descriptors, as used in the PepMSND model. Grey dashed line is the baseline model score from the original publication. t-SNE visualizations of the peptide fibrillation data show significant overlap between non-aggregating (C) and aggregating (D) peptides, illustrating the difficulty of linearly separating these classes based on structure alone. (E) Density plots of model predictions. Traditional molecular fingerprints (Morgan FP) fail to distinguish the two populations, performing near random chance (AUROC 0.579). In contrast, PeptideCLM-2 exhibits increasing accuracy with model size, reaching an AUROC of 0.823 at 337M parameters. Curves are scaled independently for visibility.

Beyond enzymatic degradation, physical aggregation represents a critical failure mode driven by condition-dependent interactions between lipids, excipients, and secondary structures. To evaluate this, we utilized a large proprietary dataset of Thioflavin T (ThT) fluorescence assays. This benchmark encompasses engineered macrocycles and endogenous peptides containing diverse protraction moieties measured across varying pH conditions. We appended the pH value to the embedding prior to the regression head, allowing the network to learn environment-dependent stability.

The inherent difficulty of predicting fibrillation is visible in the latent space. Pretrained embeddings show extensive overlap between aggregating and non-aggregating peptides (Fig. 5C, D), indicating that simple structural clustering cannot separate the classes. Consequently, a Random Forest trained on Morgan Fingerprints fails completely, yielding an area under the receiver operating characteristic curve (AUROC) of 0.579.

In contrast, PeptideCLM-2 demonstrates a dramatic recovery of predictive power that scales directly with model capacity (Fig. 5E). While the 32M parameter model provided a baseline discrimination (AUROC 0.694), scaling to the 114M and 337M architectures improved performance to 0.751 and 0.823, respectively. This distinct scaling trajectory confirms that large chemical language models can capture the subtle, non-linear biophysical drivers of aggregation that remain invisible to static chemical fingerprints.

Summary of predictive performance

Our comprehensive evaluation of PeptideCLM-2 across six distinct datasets demonstrated its superiority when compared to existing computational baselines. Notably, all of these datasets feature peptides incorporating noncanonical amino acids and complex chemical modifications that traditional protein language models cannot process. As summarized in Table 1, the 337M parameter model achieves state-of-the-art performance on each task, consistently outperforming specialized deep learning architectures and standard chemical descriptors.

Table 1:

Performance comparison of PeptideCLM-2 (337M) against benchmark baselines. AUROC: area under the receiver operating characteristic curve; MCC: Matthews correlation coefficient. RF denotes a Random Forest classifier trained on the specified descriptors.

Task (Dataset) Metric Alternative Model Score PeptideCLM-2

Membrane Permeability AUROC PeptideCLM42 0.781 0.830
Tumor Homing MCC THPep55 0.710 0.732
Cell Penetration MCC RF + PaDEL56 0.850 0.875
Antimicrobial Activity MCC AmpHGT57 0.797 0.813
Blood Stability MCC PepMSND Multimodal58 0.537 0.609
Fibrillation Propensity AUROC RF + Morgan FP 0.579 0.823

Discussion

The development of therapeutic peptides has long been hindered by a representational dilemma. While machine learning has revolutionized protein engineering, peptides remain stranded between two paradigms: protein language models cannot handle noncanonical chemistry, while standard molecular fingerprints lack the contextual depth to model long biopolymers. Our initial foray into this space, PeptideCLM,42 demonstrated the feasibility of SMILES-based modeling for cyclic peptides. In this work, we expand this concept into a comprehensive framework, PeptideCLM-2, incorporating significant advances in data scale, architectural depth, and rigorous benchmarking that demonstrates that these transformer-based encoders consistently outperform classical fingerprint embeddings.

Our analysis identifies two critical drivers for effective peptide modeling. First, we observe a parameter scaling effect where larger models move beyond simple pattern recognition to spontaneously learn fundamental chemical rules from SMILES notation alone. Second, we introduce a novel k-mer tokenizer to process these complex molecules efficiently. This innovation allows PeptideCLM-2 to analyze intricate structures that were previously too computationally demanding, resolving the conflict between deep chemical accuracy and processing speed.

Large-scale transformers spontaneously derive physical rules from chemical syntax alone

Our results demonstrate that the optimal pretraining strategy is strictly a function of model capacity. At the 32M parameter scale, the model lacks the representational capacity to infer thermodynamic laws from raw syntax alone. In this regime, explicit physicochemical supervision provides a decisive advantage, acting as a scaffold that guides the model toward a structured embedding space where permeability can be accurately predicted.

However, this advantage vanishes at scale. The convergence of self-supervised and descriptor-guided models at 337M parameters confirms that sufficiently large transformers outperform chemical features when trained only on token co-occurrence. This finding offers a mechanistic explanation for prior observations in the field, such as the reported superiority of descriptor-guided training in ChemBERTa-2.37 We posit that ChemBERTa-2, being a relatively lightweight architecture, operated on the lower end of this scaling curve, where inductive bias is still beneficial, while MoLFormer,34 trained purely on SMILES, operated on the upper end of this curve. PeptideCLM-2 demonstrates that with sufficient depth, the model transcends a dependence on supervised learning, developing learned representations from sequence alone that are functionally equivalent.

String-based architectures resolve geometric bias and enable functional prediction for diverse chemistries

Our reliance on SMILES strings is a deliberate design choice to address the limitations of geometric deep learning in this domain. Unlike globular proteins, which fold into stable tertiary structures, therapeutic peptides are often intrinsically disordered or exist as dynamic ensembles. Current 3D methods force these labile molecules into a single static conformation, introducing a geometric bias that misrepresents their behavior in solution. By operating on SMILES, PeptideCLM-2 captures precise chemical connectivity while implicitly allowing the model to learn representations that account for conformational flexibility. This is supported by our results in membrane permeability, where the model successfully inferred 3D-dependent constraints directly from the 1D syntax without requiring explicit structural input.

Notably, PeptideCLM-2 generalizes beyond simple physicochemical descriptors to predict complex biological phenotypes. This capability stems from the fundamental advantage of SMILES-based architectures: the ability to represent a residue not as a fixed label, but as a composite of its constituent chemical parts. While protein language models are structurally limited to the 20 canonical amino acids, our k-mer tokenization strategy allows PeptideCLM-2 to efficiently process the chemical diversity that defines modern peptide therapeutics, natively encoding cyclizations, stereochemical inversions, and diverse chemical modifications. Additionally, this tokenization method allows the model to learn from small molecule chemistry during pretraining and apply that knowledge on downstream tasks.

This structural flexibility not only enabled simple encoding, but also resulted in improved performance over custom architectures and molecular fingerprints on benchmarks such as AMP and CellPPD-Mod. The model also has demonstrated chemical generalization when successfully predicting the antimicrobial activity of peptides containing noncanonical residues that were held out during training. The finding that embeddings pretrained on chemical syntax (SMILES reconstruction) transfer effectively to functional tasks implies that the chemical features necessary to generalize across molecules are present in the grammar of chemical language.

Limitations and future directions

While PeptideCLM-2 establishes a new standard for discriminative modeling, several frontiers remain. First, peptide aggregation is inherently dynamic. While our static embeddings predict propensity effectively, capturing true fibrillation kinetics may require training on molecular dynamics trajectories to explicitly model conformational flexibility. Second, our reliance on SMILES strings prioritizes topological connectivity over explicit 3D geometry. While the model successfully infers 3D constraints—as evidenced by its permeability predictions—integrating geometric deep learning could further refine performance on highly structure-dependent targets.

As high-throughput screening60 and advanced synthesis61 continue to generate large-scale datasets, extending this architecture to the billion-parameter scale promises even greater predictive fidelity. Coupling these predictive oracles with generative frameworks like diffusion models62 brings us closer to de novo design of noncanonical peptides with precise, multi-parametric profiles.

Conclusion

We present PeptideCLM-2 as an open, scalable resource designed to accelerate the transition from empirical screening to rational engineering in peptide drug discovery. By successfully resolving the representational trade-off between semantic depth and computational tractability, our framework empowers the community to design the next generation of stable, potent, and chemically diverse therapeutics. To foster reproducibility and catalyze further innovation, we release all model weights, tokenizers, and training datasets to the public.

Methods

Dataset curation and preprocessing

To construct a chemically diverse pretraining corpus bridging the gap between small molecules and proteins, we curated data from three primary sources: PubChem,50 ESMAtlas,20 and the LIPID MAPS Structure Database (LMSD).49

Small molecules:

We retrieved the complete compound set from PubChem and applied a rigorous filtering cascade to remove non-drug-like entities. Entries were excluded if they were shorter than 20 characters or contained silicon chains. We further removed salts (leading/trailing Br and Cl) and disconnected components (e.g., solvent molecules indicated by ‘.’ within brackets). Polymeric artifacts, specifically repeating silicon oxide motifs (e.g., [Si](=O)[Si](=O)), were identified and discarded. After splitting remaining disconnected components into independent entries and deduplicating the dataset, the final small-molecule corpus contained 108,583,157 unique SMILES strings.

Peptides:

Peptide sequences were sourced from ESMAtlas. To ensure high-quality structural priors, we filtered for sequences with high predicted confidence (pTM > 0.7, pLDDT > 0.7) and lengths ≤ 100 amino acids. To reduce redundancy, sequences were clustered at 30% identity using MMseqs2 (--min-seq-id 0.3 -c 0.8 --cov-mode 1), retaining the centroid with the highest product of pTM and pLDDT as the cluster representative. Amino acid sequences were converted to SMILES strings using p2smi.63

Lipids:

All available lipid structures were obtained directly from LMSD.

Data Balancing:

To prevent model collapse into the dominant small-molecule modality, we employed a balanced sampling strategy during training. Each epoch consisted of an upsampled lipid set (250K), the full peptide set (~10M), and a random downsampled subset of small molecules (10M), ensuring the model encountered a heterogeneous distribution of chemical syntax. All molecules were canonicalized to standard SMILES format using RDKit; entries failing conversion were discarded.

K-mer SMILES tokenization

Standard atom-level tokenization often results in excessively long sequences for peptide biopolymers, increasing the computational cost of self-attention 𝒪n2. To mitigate this, we developed a custom k-mer tokenizer based on the concepts from SMILES Pair Encoding64 that compresses peptide SMILES while preserving chemical semantics.

We first constructed a pre-tokenizer that segments SMILES into atom-level primitives, preserving multi-character elements (e.g., ”Br”, ”Cl”) and stereochemical/charge brackets (e.g., [C@@H], [N+]). We then mined the most frequent contiguous k-mers (up to length 6) from a reference corpus of 200,000 PubChem small molecules and 200,000 SmProt peptides.65 This candidate list was filtered based on chemical validity and frequency (Tab. S2), yielding a compact vocabulary of 405 tokens (160 single-atom, 245 k-mer). This vocabulary reduces sequence length by approximately 60% compared to character-level encoding, enabling efficient training on longer peptide chains.

Model architecture and pretraining

Architecture

PeptideCLM-2 models are trained as BERT-style transformer encoders.41 We instantiated three model scales:

Small (32M):

6 layers, 384 hidden dimension, 6 heads.

Base (114M):

12 layers, 768 hidden dimension, 12 heads.

Large (337M):

24 layers, 1024 hidden dimension, 16 heads.

All models utilize Rotary Positional Embeddings (RoPE) to better capture relative distances in chemical space, along with SwiGLU activation functions and pre-layer normalization for training stability. Full hyperparameters can be found in Table S1.

Masking distributions

To implement span masking, we determined span lengths using a Gaussian distribution (μ=3.5,σ=1.0) with a minimum length of one token. The total number of tokens to be masked was calculated per sequence based on a specified masking percentage. For each span, a random starting position was selected, with adjustments made to prevent boundary overruns or overlaps with previously masked regions; selected positions were then replaced with the mask token ID until the target masking budget was satisfied.

Training objectives

We employed a hybrid objective function combining self-supervision with property regression:

=λMLMMLM+λMTRMTR
Masked language modeling (MLM):

We applied a 25% masking rate using a span-masking strategy to encourage the reconstruction of chemical fragments rather than trivial atoms. Span lengths were sampled from a Gaussian distribution (μ=3.5,σ=1), with each masked position replaced by the [MASK] token.

Multi-task regression (MTR):

In parallel, a regression head predicted 99 physicochemical descriptors (computed via RDKit) from the mean-pooled sequence embedding. The MTR head consists of two fully connected layers with SiLU activation.46 Descriptors were normalized to zero mean and unit variance prior to training.

For the combined objective, we set λMLM=0.6 and λMTR=0.4. To enforce structural invariance, we applied dynamic randomization of SMILES strings during training,66 ensuring the model sees different valid string representations of the same molecule.

Optimization

Models were trained using the AdamW optimizer (β1 = 0.9,β2 = 0.98, weight decay 0.01). Pretraining was performed on 8× NVIDIA H100 GPUs with a global batch size of 512 for 3 pseudo-epochs. Each pseudo-epoch consisted of a sample of ~20.25M molecules, comprising 10M small molecules, the full ESMAtlas, and a 5× upsampling of the LMSD. The learning rate followed a linear warmup for the first 5,000 steps to a peak of 3 × 10−4, followed by cosine annealing to 10% of the peak rate.

Downstream evaluation protocols

Finetuning strategy

For all downstream tasks, we replaced the pretraining heads with a task-specific feed-forward network consisting of two hidden layers with GeLU activation and dropout (p = 0.1). Optimization was performed using AdamW with a learning rate of 1 × 10−5 and a batch size of 16 unless otherwise noted.

Train/test splitting was strictly replicated from the benchmark literature for each dataset to ensure fair comparison. For aggregation prediction, we performed stratified 5-fold cross-validation with random splits.

Models were trained until validation loss plateaued (patience of 5 epochs, checking validation every 0.5 epoch). To improve robustness, final predictions are reported as the ensemble mean of models trained on the inner cross-validation folds.

Classification Baselines

As a primary baseline in classification tasks, we computed RDKit topological fingerprints (2048-bit, path length 1–7, 2 bits/hash)44 and trained Random Forest classifiers using scikit-learn.67 These models utilized 100 estimators (n_estimators=100) with the standard Gini impurity criterion. Tree depth was unconstrained, allowing nodes to expand until all leaves were pure. For specific tasks, we compared against published specialized architectures including AmpHGT57 and THPep.55

Layer-wise transfer learning, CycPeptMPDB

To assess the intrinsic quality of learned representations without weight updates, we froze the pretrained encoder and extracted embeddings from each layer. These embeddings were mean-pooled over the sequence length and passed to a LassoCV regressor (5-fold cross-validation). We allowed the model to automatically select the regularization parameter (α) over a path of 100 values with a maximum of 10,000 iterations.

Supplementary Material

Supplement 1
media-1.pdf (6.5MB, pdf)

Supporting Information Available

Supplementary tables and figures are included in this publication.

Acknowledgement

We would like to express our thanks to Rory Donovan-Maiye and Deepa Korani for their work at Novo Nordisk on large BERT-based modeling of SMILES strings that preceded this project. Finetuning on open classification datasets was performed using compute resources provided by NVIDIA.

This work was supported in part by NIH grants 1R01 AI148419 and 1R56 AI179799. C.O.W. was also supported by the Blumberg Centennial Professorship in Molecular Evolution.

Footnotes

Conflicts of Interest

M.S., S.S., and K.D. are employees of Novo Nordisk. A.L.F. initiated this work during a research internship at Novo Nordisk. This study utilizes an internal aggregation dataset generated by Novo Nordisk. C.O.W. declares no competing financial interests.

Data and Software Availability

All code for pretraining, finetuning, and data processing for this project has been made available at https://github.com/AaronFeller/PeptideCLM-2. Model code and weights for all models are released on huggingface at https://huggingface.co/collections/aaronfeller/peptideclm-2. Pretraining data has been released on Zenodo (doi:10.5281/zenodo.17993164) and on Huggingface datasets in the PeptideCLM-2 collection for ease of use. For advancement of modeling therapeutic peptides with noncanonical chemistry, the datasets used for downstream tasks including for membrane permeability, tumor homing, cell penetration, antimicrobial activity, and blood stability have been organized and can be found on the project github.

References

  • (1).Muttenthaler M.; King G. F.; Adams D. J.; Alewood P. F. Trends in peptide drug discovery. Nat. Rev. Drug Disc. 2021, 20, 309–325. [Google Scholar]
  • (2).Cooper B. M.; Iegre J.; O’Donovan D. H.; Halvarsson M. Ö.; Spring D. R. Peptides as a platform for targeted therapeutics for cancer: peptide–drug conjugates (PDCs). Chem. Soc. Rev. 2021, 50, 1480–1494. [DOI] [PubMed] [Google Scholar]
  • (3).Wang L.; Wang N.; Zhang W.; Cheng X.; Yan Z.; Shao G.; Wang X.; Wang R.; Fu C. Therapeutic peptides: current applications and future directions. Signal Transduct. 2022, 7, 48. [Google Scholar]
  • (4).Fetse J.; Kandel S.; Mamani U.-F.; Cheng K. Recent advances in the development of therapeutic peptides. Trends Pharmacol. Sci. 2023, 44, 425–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (5).Barman P.; Joshi S.; Sharma S.; Preet S.; Sharma S.; Saini A. Strategic approaches to improvise peptide drugs as next generation therapeutics. Int. J. Pept. Res. Ther. 2023, 29, 61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (6).Sharma K.; Sharma K. K.; Sharma A.; Jain R. Peptide-based drug discovery: Current status and recent advances. Drug Discov. Today 2023, 28, 103464. [DOI] [PubMed] [Google Scholar]
  • (7).Hickey J. L.; Sindhikara D.; Zultanski S. L.; Schultz D. M. Beyond 20 in the 21st century: prospects and challenges of non-canonical amino acids in peptide drug discovery. ACS Med. Chem. Lett. 2023, 14, 557–565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (8).Zhang H.; Chen S. Cyclic peptide drugs approved in the last two decades (2001–2021). RSC Chem. Biol. 2022, 3, 18–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (9).Ji X.; Nielsen A. L.; Heinis C. Cyclic Peptides for Drug Development. Angew. Chem. Int. Ed. 2023, 202308251, e202308251. [Google Scholar]
  • (10).Lamers C. Overcoming the shortcomings of peptide-based therapeutics. Future Drug Discov. 2022, 4, FDD75. [Google Scholar]
  • (11).Openy J.; Vega-Ces S.; Amrahova G.; Mestdach E.; Chi C.; Kissel B.; ‘t Hart, P. Backbone Alterations in Cyclic Peptides Influence Both Membrane Permeability and Biological Activity. J. Med. Chem. 2025, 68, 24108–24126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (12).Castro T. G.; Melle-Franco M.; Sousa C. E.; Cavaco-Paulo A.; Marcos J. C. Non-canonical amino acids as building blocks for peptidomimetics: Structure, function, and applications. Biomolecules 2023, 13, 981. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (13).Terasaka N.; Iwane Y.; Geiermann A.-S.; Goto Y.; Suga H. Recent developments of engineered translational machineries for the incorporation of non-canonical amino acids into polypeptides. International Journal of Molecular Sciences 2015, 16, 6513–6531. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (14).Guo Z.; Diao T. Late-Stage Serine Modification Enables Noncanonical Peptide Synthesis. Journal of the American Chemical Society 2025, 147, 33127–33135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (15).Du Z.; Caragea D.; Guo X.; Li Y. PepBERT: Lightweight language models for bioactive peptide representation. bioRxiv 2025, 10.1101/2025.04.08.647838. [DOI] [Google Scholar]
  • (16).Wang L.; Pulugurta R.; Vure P.; Zhang Y.; Pal A.; Chatterjee P. PepDoRA: A unified peptide language model via weight-decomposed low-rank adaptation. arXiv 2024, 2410.20667. [Google Scholar]
  • (17).Fernández-Díaz R.; Ochoa R.; Hoang T. L.; Lopez V.; Shields D. How to build machine learning models able to extrapolate from standard to modified peptides. J. Cheminform. 2025, [Google Scholar]
  • (18).Consortium T. U. UniProt: the universal protein knowledgebase in 2025. Nucleic Acids Res. 2025, 53, D609–D617. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (19).Rives A.; Meier J.; Sercu T.; Goyal S.; Lin Z.; Liu J.; Guo D.; Ott M.; Zitnick C. L.; Ma J.; others Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. 2021, 118, e2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (20).Lin Z.; Akin H.; Rao R.; Hie B.; Zhu Z.; Lu W.; Smetanin N.; Verkuil R.; Kabeli O.; Shmueli Y.; others Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. [DOI] [PubMed] [Google Scholar]
  • (21).Hayes T.; Rao R.; Akin H.; Sofroniew N. J.; Oktay D.; Lin Z.; Verkuil R.; Tran V. Q.; Deaton J.; Wiggert M.; others Simulating 500 million years of evolution with a language model. Science 2025, 387, 850–858. [DOI] [PubMed] [Google Scholar]
  • (22).Elnaggar A.; Heinzinger M.; Dallago C.; Rehawi G.; Wang Y.; Jones L.; Gibbs T.; Feher T.; Angerer C.; Steinegger M.; others ProtTrans: towards cracking the language of life’s code through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7112–7127. [Google Scholar]
  • (23).Alanazi W.; Meng D.; Pollastri G. Porter 6: protein secondary structure prediction by leveraging pre-trained language models (PLMs). Int. J. Mol. Sci. 2024, 26, 130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (24).Zhang Z.; Wayment-Steele H. K.; Brixi G.; Wang H.; Kern D.; Ovchinnikov S. Protein language models learn evolutionary statistics of interacting sequence motifs. Proc. Natl. Acad. Sci. 2024, 121, e2406285121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (25).Heinzinger M.; Littmann M.; Sillitoe I.; Bordin N.; Orengo C.; Rost B. Contrastive learning on protein embeddings enlightens midnight zone. NAR Genom. Bioinform. 2022, 4, lqac043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (26).Sun Y.; Shen Y. Structure-informed protein language models are robust predictors for variant effects. Hum. Genet. 2025, 144, 209–225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (27).Gurev S.; Youssef N.; Jain N.; Marks D. Variant effect prediction with reliability estimation across priority viruses. bioRxiv 2025, 10.1101/2025.08.04.668549. [DOI] [Google Scholar]
  • (28).Liu D.; Young F.; Lamb K. D.; Claudio Quiros A.; Pancheva A.; Miller C. J.; Macdonald C.; Robertson D. L.; Yuan K. PLM-interact: extending protein language models to predict protein-protein interactions. Nat. Commun. 2025, 16, 9012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (29).Lu A. X.; Zhang H.; Ghassemi M.; Moses A. Self-supervised contrastive learning of protein representations by mutual information maximization. BioRxiv 2020, 10.1101/2020.09.04.283929. [DOI] [Google Scholar]
  • (30).Weininger D. SMILES, a chemical language and information system. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. [Google Scholar]
  • (31).Krenn M.; Ai Q.; Barthel S.; Carson N.; Frei A.; Frey N. C.; Friederich P.; Gaudin T.; Gayle A. A.; Jablonka K. M.; others SELFIES and the future of molecular string representations. Patterns 2022, 3, 100588. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (32).Wang S.; Guo Y.; Wang Y.; Sun H.; Huang J. SMILES-BERT: Large-Scale Unsupervised Pre-training for Molecular Property Prediction. Proc. ACM Conf. Bioinform. Comput. Biol. 2019, 429–436. [Google Scholar]
  • (33).Chithrananda S.; Grand G.; Ramsundar B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. arXiv 2020, 2010.09885. [Google Scholar]
  • (34).Ross J.; Belgodere B.; Chenthamarakshan V.; Padhi I.; Mroueh Y.; Das P. Large-scale chemical language representations capture molecular structure and properties. Nat. Mach. Intell. 2022, 4, 1256–1264. [Google Scholar]
  • (35).Park J.-H.; Park H.; Kim Y.; Lim W.; Lee S. Moleco: Molecular Contrastive Learning with Chemical Language Models for Molecular Property Prediction. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track 2024, 408–420. [Google Scholar]
  • (36).Park J.-H.; Kim Y.; Lee M.; Park H.; Lee S. Moltres: Improving chemical language representation learning for molecular property prediction. arXiv preprint arXiv:2408.01426 2024, [Google Scholar]
  • (37).Ahmad W.; Simon E.; Chithrananda S.; Grand G.; Ramsundar B. ChemBERTa-2: Towards chemical foundation models. arXiv 2022, 2209.01712. [Google Scholar]
  • (38).Lv L.; Lin Z.; Li H.; Liu Y.; Cui J.; Chen C. Y.-C.; Yuan L.; Tian Y. ProLLaMA: A protein large language model for multi-task protein language processing. IEEE Trans. Artif. Intell. 2025, [Google Scholar]
  • (39).Praski M.; Adamczyk J.; Czech W. Benchmarking pretrained molecular embedding models for molecular representation learning. arXiv 2025, 2508.06199. [Google Scholar]
  • (40).Adamczyk J.; Ludynia P.; Czech W. Molecular Fingerprints Are Strong Models for Peptide Function Prediction. arXiv 2025, 2501.17901. [Google Scholar]
  • (41).Devlin J.; Chang M.-W.; Lee K.; Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proc. NAACL 2019, 4171–4186. [Google Scholar]
  • (42).Feller A. L.; Wilke C. O. Peptide-aware chemical language model successfully predicts membrane diffusion of cyclic peptides. J. Chem. Inf. Model. 2025, 65, 571–579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (43).Borchani H.; Varando G.; Bielza C.; Larranaga P. A survey on multi-output regression. WIREs Data Min. Knowl. Discov. 2015, 5, 216–233. [Google Scholar]
  • (44).Landrum G.; others RDKit: Open-source cheminformatics. 2013; http://www.rdkit.org, Online.
  • (45).Su J.; Lu Y.; Pan S.; Murtadha A.; Wen B.; Roformer Y. L. Enhanced transformer with rotary position embedding., 2021. DOI: 10.1016/j.neucom 2023, [DOI] [Google Scholar]
  • (46).Elfwing S.; Uchibe E.; Doya K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [DOI] [PubMed] [Google Scholar]
  • (47).Xiong R.; Yang Y.; He D.; Zheng K.; Zheng S.; Xing C.; Zhang H.; Lan Y.; Wang L.; Liu T. On layer normalization in the transformer architecture. Proc. Int. Conf. Mach. Learn. (ICML) 2020, 10524–10533. [Google Scholar]
  • (48).Joshi M.; Chen D.; Liu Y.; Weld D. S.; Zettlemoyer L.; Levy O. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar]
  • (49).Sud M.; Fahy E.; Cotter D.; Brown A.; Dennis E. A.; Glass C. K.; Merrill A. H. Jr; Murphy R. C.; Raetz C. R.; Russell D. W.; others LMSD: LIPID MAPS structure database. Nucleic Acids Res. 2007, 35, D527–D532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (50).Kim S.; Chen J.; Cheng T.; Gindulyte A.; He J.; He S.; Li Q.; Shoemaker B. A.; Thiessen P. A.; Yu B.; others PubChem 2025 update. Nucleic Acids Res. 2025, 53, D1516–D1525. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (51).Morgan H. L. The generation of a unique machine description for chemical structures-a technique developed at Chemical Abstracts Service. J. Chem. Doc. 1965, 5, 107–113. [Google Scholar]
  • (52).Ramsundar B.; Eastman P.; Walters P.; Pande V.; Leswing K.; Wu Z. Deep Learning for the Life Sciences; O’Reilly, 2019. [Google Scholar]
  • (53).Li J.; Yanagisawa K.; Sugita M.; Fujie T.; Ohue M.; Akiyama Y. CycPeptMPDB: a comprehensive database of membrane permeability of cyclic peptides. J. Chem. Inf. Model. 2023, 63, 2240–2250. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (54).Scodeller P.; Asciutto E. K. Targeting tumors using peptides. Molecules 2020, 25, 808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (55).Shoombuatong W.; Schaduangrat N.; Pratiwi R.; Nantasenamat C. THPep: A machine learning-based approach for predicting tumor homing peptides. Comput. Biol. Chem. 2019, 80, 441–451. [DOI] [PubMed] [Google Scholar]
  • (56).Kumar V.; Agrawal P.; Kumar R.; Bhalla S.; Usmani S. S.; Varshney G. C.; Raghava G. P. Prediction of cell-penetrating potential of modified peptides containing natural and chemically modified residues. Front. Microbiol. 2018, 9, 725. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (57).He Y.; Song X.; Wan H.; Zhao X. AmpHGT: expanding prediction of antimicrobial activity in peptides containing non-canonical amino acids using multi-view constrained heterogeneous graph transformer. BMC Biol. 2025, 23, 184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (58).Hu H.; Zhang C.; Xu Z.; Guo J.; Su A.; Li C.; Duan H. PepMSND: integrating multi-level feature engineering and comprehensive databases to enhance in vitro/in vivo peptide blood stability prediction. Digital Discovery 2025, 4, 2478–2490. [Google Scholar]
  • (59).Liu Z.; Wang Y.; Vaidya S.; Ruehle F.; Halverson J.; Soljačić M.; Hou T. Y.; Tegmark M. Kan: Kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756 2024, [Google Scholar]
  • (60).Quartararo A. J.; Gates Z. P.; Somsen B. A.; Hartrampf N.; Ye X.; Shimada A.; Kajihara Y.; Ottmann C.; Pentelute B. L. Ultra-large chemical libraries for the discovery of high-affinity peptide binders. Nat. Commun. 2020, 11, 3183. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (61).Niquille D. L.; Hansen D. A.; Mori T.; Fercher D.; Kries H.; Hilvert D. Nonribosomal biosynthesis of backbone-modified peptides. Nat. Chem. 2018, 10, 282–287. [DOI] [PubMed] [Google Scholar]
  • (62).Shin J.-E.; Riesselman A. J.; Kollasch A. W.; McMahon C.; Simon E.; Sander C.; Manglik A.; Kruse A. C.; Marks D. S. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 2021, 12, 2403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (63).Feller A. L.; Wilke C. O. p2smi: A toolkit enabling smiles generation and property analysis for noncanonical and cyclized peptides. Journal of Open Source Software 2025, 10, 8319. [Google Scholar]
  • (64).Li X.; Fourches D. SMILES pair encoding: a data-driven substructure tokenization algorithm for deep learning. J. Chem. Inf. Model. 2021, 61, 1560–1569. [DOI] [PubMed] [Google Scholar]
  • (65).Li Y.; Zhou H.; Chen X.; Zheng Y.; Kang Q.; Hao D.; Zhang L.; Song T.; Luo H.; Hao Y.; others SmProt: a reliable repository with comprehensive annotation of small proteins identified from ribosome profiling. Genomics Proteomics Bioinformatics 2021, 19, 602–610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (66).Arús-Pous J.; Johansson S. V.; Prykhodko O.; Bjerrum E. J.; Tyrchan C.; Reymond J.-L.; Chen H.; Engkvist O. Randomized SMILES strings improve the quality of molecular generative models. J. Cheminform. 2019, 11, 71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (67).Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V.; others Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.pdf (6.5MB, pdf)

Data Availability Statement

All code for pretraining, finetuning, and data processing for this project has been made available at https://github.com/AaronFeller/PeptideCLM-2. Model code and weights for all models are released on huggingface at https://huggingface.co/collections/aaronfeller/peptideclm-2. Pretraining data has been released on Zenodo (doi:10.5281/zenodo.17993164) and on Huggingface datasets in the PeptideCLM-2 collection for ease of use. For advancement of modeling therapeutic peptides with noncanonical chemistry, the datasets used for downstream tasks including for membrane permeability, tumor homing, cell penetration, antimicrobial activity, and blood stability have been organized and can be found on the project github.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES