Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2025 Oct 6:2025.10.03.680398. [Version 1] doi: 10.1101/2025.10.03.680398

SLAE: Strictly Local All-atom Environment for Protein Representation

Yilin Chen 1, Cizhang Zhao 2, Po-Ssu Huang 3, Tianyu Lu 4, Hannah K Wayment-Steele 5
PMCID: PMC12632552  PMID: 41278779

Abstract

Building physically grounded protein representations is central to computational biology, yet most existing approaches rely on sequence-pretrained language models or backbone-only graphs that overlook side-chain geometry and chemical detail. We present SLAE, a unified all-atom framework for learning protein representations from each residue’s local atomic neighborhood using only atom types and interatomic geometries. To encourage expressive feature extraction, we introduce a novel multi-task autoencoder objective that combines coordinate reconstruction, sequence recovery, and energy regression. SLAE reconstructs all-atom structures with high fidelity from latent residue environments and achieves state-of-the-art performance across diverse downstream tasks via transfer learning. SLAE’s latent space is chemically informative and environmentally sensitive, enabling quantitative assessment of structural qualities and smooth interpolation between conformations at all-atom resolution.

1. Introduction

Proteins are the fundamental machinery of life, carrying out processes from catalysis and signaling to structural organization. Their remarkable functional diversity arises not only from their amino acid sequences but from the intricate three-dimensional structures into which those sequences fold.

Within protein structures, the backbone and side chain atoms act as an intricately coupled system that establishes local atomic environments through hydrophobic packing, hydrogen-bonding networks, and electrostatic interactions. These residue-level environments mediate conformational preferences and side chain dynamics, linking the global fold to the specific interactions that underlie protein function. Representing these interactions in a concise, learnable form is therefore essential for generalizable and physically grounded models of protein structure and function.

Current representations, either through protein language model(PLM) or sequence-structure joint embedding, lack the ability to isolate physical interactions from evolutionary information, and often needed to adopt backbone-only structure info to reduce computational demands. As a result, the field remains limited by the absence of a general-purpose pretraining framework that extracts, compresses, and transfers knowledge of all-atom structure across proteins and downstream applications.

We propose SLAE (Strictly Local All-atom Environment autoencoder), a framework for protein representation learning that models a protein as a set of residue-centric chemical environments. To promote generalizability and a physically grounded view, SLAE enforces an informational bottleneck by restricting the encoder to strictly local atom graphs and pair it with an asymmetric decoder that must recover full structure. When this reconstruction task is solved, the resulting tokenization of structure emerges jointly from the representation and the model, emphasizing physically meaningful interactions rather than heuristic features. Fully connected local atom graphs capture interactions between a residue and its neighboring atoms and are computationally tractable during pretraining. We show these local representations are sufficient to reconstruct all-atom Cartesian coordinates with high fidelity.

We design an all-atom autoencoder architecture that separates local and global reasoning across the encoding and decoding stages. An SE(3)-equivariant graph encoder maps each local environment to a rotation/translation-invariant residue token. A Transformer decoder with self-attention then aggregates these tokens to model long-range couplings and reconstruct coherent global geometry. This residue-level bottleneck forces the encoder to distill the packing signals such as covalent bonds, hydrogen-bond motifs, and steric/electrostatic cues that the global decoder requires to reconstruct long-range geometry, facilitating transfer across tasks. We introduce a physics-augmented pretraining objective that couples self-supervised (i) all-atom coordinate reconstruction, (ii) sequence recovery, and supervised (iii) Rosetta-derived inter-residue energies. These complementary signals act as a multi-view regularizer, aligning the latent space with atomistic structure, biochemical signal and energetics, yielding embeddings that vary smoothly with conformation and are interpretable along axes of side-chain chemistry, solvent exposure, and secondary structure.

SLAE supports multiscale readouts: atom and residue embeddings for fine-grained local characterization, and pooled protein-level features for global structure. This flexibility allows downstream task heads to focus on single residues, interfaces, or entire folds using a single pretrained representation. We demonstrate that pretraining directly on all-atom protein structures yields features that transfer effectively. Across benchmarks on multiple resolution scale tasks including fold classification, protein–protein binding affinity, single-point mutation stability, and NMR chemical shifts, SLAE achieves state-of-the-art or on-par performance.

Main contributions:

With the SLAE framework, we (i) propose a residue-centered, local atomgraph protein representation, and show it is sufficient for high-fidelity all-atom reconstruction; (ii) propose the energy regression task for reconstruction pretraining guidance; (iii) design local encoding and global decoding stages in all-atom autoencoder to encourage compact and transferable residue embeddings; (iv) achieve state-of-the-art on diverse downstream tasks with transfer learning; (v) show that the above design allow an interpretable latent space.

2. Related Work

Protein Representation Pretraining

Protein representation learning has followed two main tracks. Sequence pretraining with protein language models (PLMs) on massive corpora captures evolutionary constraints but lacks explicit structure information (Meier et al., 2021; Lin et al., 2023). In parallel, graph denoising objectives noises sequence or structural features and train graph models to recover them (Zaidi et al., 2022; Jamasb et al., 2024), capturing global context while abstracting away side-chain geometry. Neither paradigm learns atomistic features as the primary signal. SLAE departs by pretraining directly on all-atom coordinates reconstruction and showing that features learned from atomistic geometry are sufficient for high-fidelity coordinate reconstruction and downstream transfer.

Sequence-structure co-embedding approaches pair PLM embeddings with structural features to inject geometry into sequence representations, improving downstream performance without learning at all-atom resolution. Representative methods include SaProt (Su et al., 2023b), FoldToken (Gao et al., 2024), ProSST (Li et al., 2024), and ESM3 (Hayes et al., 2024). Most hybrid models augment sequence tokens with backbone-level descriptors, and the learned tokens remain sequence-anchored. SLAE instead learns structure and energetics-anchored residue tokens, reducing sequence-only bias while increasing structure representation resolution.

All-atom Protein Representation

All-atom protein generative models which simultaneously generate backbone and side chain coordinates can also have an all-atom representation of protein structure. Protpardelle (Chu et al., 2024) can be cast as a continuous normalizing flow to generate deterministic latent encodings of all-atom protein structures. A joint embedding space of sequence and all-atom structure was proposed in CHEAP (Lu et al., 2024), in which the embeddings reconstruct all-atom protein structures and recover sequence. However, interpolation between two conformations of the same protein sequence is not possible as identical sequence would map to the same CHEAP embedding. Representations can also be derived from protein structure prediction models such as AlphaFold3 (Abramson et al., 2024), but the information is distributed across layers and in both single and pairwise representations.

Geometric GNNs for Atomistic Systems

Representing atomistic systems as geometric graphs is natural. While encoders for protein have been proposed using point cloud voxelization, graph convolution and hierarchical pooling (Hermosilla et al., 2021; Anand et al., 2022; Wang et al., 2023), they incur a considerable computational burden making them impractical for large-scale pretraining with previously proposed denoising objectives. Equivariant GNNs such as DimeNet (Gasteiger et al., 2022), NequIP (Batzner et al., 2022) and MACE (Batatia et al., 2023) excel at small-molecule property prediction and interatomic potentials. For scalability, many adopt low-order interactions with truncated neighborhoods, closely related to Atomic Cluster Expansion (ACE) formulations (Drautz, 2019). Works which extend atomistic modeling to proteins are emerging (Pengmei et al., 2024; Bojan et al., 2025), but existing approaches typically pretrain on small-molecule datasets, reuse features from pretrained potential models or are trained in a task-specific manner. There remains a gap in methods amenable to large-scale, all-atom pretraining on proteins. SLAE addresses this by modeling two-body local interactions over cutoff graphs and pretrain a physics-informed autoencoder that yields a general, task-agnostic latent space at protein scale: thousands of atoms per system compared to tens of atoms.

3. The SLAE Framework

We introduce the SLAE autoencoder and its end-to-end pretraining objectives (Fig. 1A). SLAE solves a deliberately difficult two-part problem: the geometric graph encoder projects interatomic interactions within each atom’s local neighborhood into compact residue tokens, while the decoder learns a global prior over how these local environments compose into coherent macromolecular structures. This residue-level bottleneck over all-atom inputs makes large-scale pretraining tractable and learns meaningful embeddings.

Figure 1: Overview of the SLAE framework.

Figure 1:

A. Pretraining A graph encoder maps local atomic neighborhoods to residue embeddings. Examples of atom connectivity shown as input to the encoder, with different colors for each residue. The transformer decoder connects pooled local features at residue level into the full-atom protein structure. The decoder also regresses to inter-residue energy score terms. B. Transfer learning The pretrained embeddings are fed to lightweight heads for diverse downstream tasks. C. Latent geometry Linear interpolations on latent space decode to physically coherent structures that follow changes on the underlying chemical-environment manifold.

3.1. Structure Representation

Given a protein structure, we construct a directed graph G=(V,E), where:

Nodes

Each node viV represents heavy atom ai. The node feature is a one-hot encoding of the atom’s chemical type.

Edges

For each pair of atoms ai, aj with ajai28Å, we define a directed edge ejiE with features hji(e) that is a concatenation of: (i) the scaler interatomic distance ajai2 in terms of Bessel radial basis functions ϕrai,aj and (ii) the unit vector interatomic direction projected onto spherical harmonics Ymϕaai,aj.

Design Motivation

This representation is minimal yet physically complete: it encodes interatomic distances and orientations without relying on torsion angles, amino acids, or residue indices. As such, it enables generalization to arbitrary biomolecular complexes, which we leave for future work. Bond connectivity and hydrogen patterns are learned implicitly through the autoencoder objective detailed in Section 3.4.

3.2. Encoder

The encoder maps each atom’s local chemical environment into residue-level latent embeddings z1,,zn, zi128.

Equivariant Neighborhood Embedding

We employ a SE(3)-equivariant neural network, inspired by Musaelian et al. (2023), that operates on each heavy atom and its neighbors through learned edge embeddings. Each layer L maintains coupled latent spaces: a scalar space xijL (invariant) and a tensor space VijL (equivariant). An equivariant tensor product incorporates interactions between the current equivariant state of the center–neighbor pair (i,j) and all other neighbors kN(i):VijL=VijL1kN(i)wikLϕrik, where ϕrik is a geometric embedding of the neighbor direction and wikL are learned weights derived from scalar features of edges (i,k). This can be viewed as a weighted projection of the atomic density around atom i, enabling equivariant interactions between the pair (i,j) and the environment of i.

Following the tensor product, scalar outputs are reintroduced into the scalar latent space with xijL=MLPlatentLxijL1VijLurij, where urij is a smooth cutoff envelope. This step completes the coupling of scalar and equivariant latent spaces: scalars distilled from tensor products inject directional information back into xijL, allowing the invariant channel to carry geometric cues that were previously only available to the equivariant representation.

Residue Environment Pooling

After the final layer, we obtain scalar pair features xijL. We first pool to atoms by mean-aggregating incoming edges, and then pool atom embeddings to residues: si=1|N(i)|jN(i)xijL, zr=1|A(r)|iA(r)si. This yields compact residue-level representations while retaining strictly local chemical information.

Design Motivation

The encoder updates edge embeddings dynamically by incorporating information from neighboring edges. This paradigm originally developed for interatomic potentials in small-molecule graphs naturally extends to large protein graphs. This allows SLAE to capture strictly local but physically meaningful chemical environments. Pooling representations to the residue level serves as an efficient and natural information bottleneck for protein structure.

3.3. Decoder

Having distilled each residue’s local chemistry and geometry into embeddings z128, the decoder assembles these local descriptors into a single, coherent macromolecule that respects long-range couplings.

Architecture

We first project each latent embedding to a model dimension of 1024. On top of these expanded embeddings, we employ a Transformer architecture with global self-attention and Rotary Positional Embeddings (RoPE) (Su et al., 2023a) to capture long-range residue interactions with a stack of multi-head self-attention layers.

The Transformer outputs are passed into three parallel MLP heads for structure reconstruction, sequence recovery, and energy prediction:

  1. Reconstructs the 3D coordinates of up to 37 heavy and side-chain atoms per residue x^n×37×3.

  2. Recovers the amino acid identity at each residue position s^n×20.

  3. Approximates inter-residue physical interactions using Rosetta scores, including hydrogenbonding, electrostatics, and solvation energies r^n×n×3.

Design Motivation

The decoder is designed to complement the encoder’s strictly local representation by modeling global dependencies across residues. Global self-attention allows residue embeddings to exchange information across the entire protein, enabling the reconstruction of coherent backbone and side-chain geometries. The addition of energy prediction task guides the decoder toward physically meaningful structures, ensuring that the latent space encodes not only geometric detail but also the energetic constraints that govern protein stability and interactions.

3.4. Pretraining

We pretrain SLAE end-to-end on full atomic structures with three complementary objectives:

  1. All-atom Structure Recovery To obtain the predicted structure x^, we mask the atom-37 template coordinates while providing the ground-truth residue identities to train the decoder to recover ground truth coordinates. We supervise this reconstruction with a combination of all-atom local distance difference test loss (SmoothLDDT) (Jumper et al., 2021) and frame-aligned point error (FAPE) (Anishchenko et al., 2024): Lstruct=αLDDT(x,x^)+βFAPE(x,x^), where x and x^ denote the ground-truth and predicted all-atom coordinates.

  2. Sequence Recovery We additionally recover the residue sequence from the latent space: Lseq=CrossEntropy(s,s^), where s is the ground-truth amino-acid identity and s^ are the predicted logits over 20 amino acid classes.

  3. Energy Prediction To inject physically grounded supervision, we predict inter-residue energies approximated by Rosetta scores, including hydrogen bonding, electrostatics, and solvation: Lenergy=rr^22, where r and r^ are ground-truth and predicted energy terms.

The combined loss integrates all three components:

L=wcoord(αLDDT+βFAPE)+wseqCrossEntropy+wenergyMSE (1)

with weights wstruct, wseq, wenergy0 as tunable hyperparameters (Appendix B.1).

Implicit Latent Space Regularization

By jointly optimizing geometry, identity, and energetics, SLAE’s pretraining objective provides complementary constraints on the latent space: (i) Geometry losses depend smoothly on atomic coordinates, promoting continuous and physically plausible reconstructions. (ii) Sequence recovery encourages embeddings to encode amino acid identity, preserving biochemical interpretability and avoiding collapse. (iii) Energy prediction provides a physics-based signal, guiding embeddings toward inter-residue interactions such as hydrogen bonding, solvation, and electrostatics. These losses shape a latent manifold that maps cleanly onto valid, physically coherent protein conformations. The result is a structurally consistent, chemically informative, and energetically grounded representation without relying on explicit regularizers.

3.5. Results and ablations

We pretrain SLAE on a sequence-augmented CATH(Ingraham et al., 2019)-derived dataset (Lu et al., 2025b)(Appendix C). On the held-out test set with no family overlap, the autoencoder achieves 99.9% sequence recovery and all-atom RMSD of 1.1Å for structures shorter than 128 residues and 1.9A across all lengths up to 512 residues.

We study the effect of model and pretraining design choices on pretraining performance (Table 6). For encoder locality, we swept cutoff radii and find an 8Å neighborhood yields the best results (Appendix E). For discretization, we compare end-to-end VQ (van den Oord et al., 2018) and LFQ (Yu et al., 2023) against post-hoc kNN codebooks built on frozen encoder embeddings. End-to-end quantization trades off sequence and structure accuracy, whereas reconstruction from post-hoc kNN-codebook quantized embeddings approaches continuous resolution as the codebook grows. Ablation experiments (Table 6, Appendix E) further highlight the importance of both the FAPE loss and Rosetta-derived energy supervision, confirming the effectiveness of our multitask pretraining framework. These results validate the design choices and permit downstream evaluation on a faithful representation of protein structures.

4. Downstream Tasks

We next demonstrate that SLAE embeddings pretrained on all-atom reconstruction and energetics objectives transfer effectively to diverse downstream tasks(Figure 1B). Across all four benchmarks spanning complementary biological scales, SLAE achieves better or on-par performance with state-of-the-art methods, underscoring the generality and flexibility of the SLAE framework.

Fold Classification

Protein fold classification is a cornerstone of structural biology, linking structure to evolutionary relationships and functional annotation. Using the SCOPe 1.75 dataset Fox et al. (2014) and following Hou et al. (2018), we evaluate generalization under three test sets: Family, Superfamily, and Fold. An MLP is trained on pooled residue embeddings. SLAE achieves on-par or superior accuracy compared to prior state-of-the-art models across all splits (Table 2), demonstrating that global fold information can be recovered even from strictly local all-atom embeddings.

Table 2:

Fold classification accuracy (%) on SCOPe 1.75 under three test splits

Method Fold (%) Superfamily (%) Family (%)
GVP-GNN(Jing et al., 2021) 16.0 22.5 83.8
IEConv(Hermosilla et al., 2021) 45.0 69.7 98.9
GearNet-Edge-IEConv (Zhang et al., 2023) 48.3 70.3 99.5
ProNet-SCHull (Wang et al., 2024) 56.1 74.6 99.4
SLAE-finetuned 55.1 77.1 99.1

Protein-Protein Binding Affinity Prediction

Protein-protein interactions underlie nearly all cellular processes, and accurate prediction of binding affinity is critical for understanding signaling pathways, complex assembly, and therapeutic design. We evaluate SLAE on the PPB-Affinity dataset (Liu et al., 2024), a recently curated large-scale benchmark that aggregates 12,062 experimental binding ΔΔG values from multiple sources and aligns them with high-quality structural complexes.

Complex structures are embedded chain-wise and interface-wise with the SLAE encoder, and pooled residue embeddings are passed into an MLP for regression. In 5-fold cross-validation, SLAE achieves lower RMSE and higher Pearson correlation than PLM-based baselines (Table 3). Despite being pretrained only on single-chain data, SLAE generalizes seamlessly to multi-chain contexts, thanks to its atomistic representation that does not rely on residue or chain indices.

Table 3:

Protein-protein binding affinity prediction on the PPB-Affinity dataset

Method RMSE (kcal/mol) Pearson Correlation
PPB-Affinity Baseline (Liu et al., 2024) 2.08 0.70
PPLM-Affinity (Liu et al., 2025) 1.89 0.76
SLAE-finetuned (w/o. interface) 2.01 0.73
SLAE-finetuned (with interface) 1.86 0.77

Single-Point Mutation Thermostability Prediction

Protein stability is fundamental to function, and predicting the impact of point mutations on thermostability (ΔΔG) is a central challenge for protein engineering, drug resistance modeling, and disease variant interpretation. We benchmark SLAE on the Megascale mutation dataset (Tsuboyama et al., 2023), filtered according to ThermoMPNN protocol with 272,712 mutations across 298 proteins Dieckhaus et al. (2024).

Pairs of wild-type and mutant structures are embedded with residue-level differences extracted at the mutation site. An MLP head predicts ΔΔG. SLAE achieves 0.68 RMSE and 0.76 Pearson correlation (Table 4) on the test set, outperforming prior methods. Ablation experiments show that removing mutation-site differencing degrades performance, highlighting the importance of local residue environment modeling for physical property prediction in the SLAE framework.

Table 4:

Single-point mutation thermostability prediction on the Megascale dataset test split

Method RMSE (kcal/mol) Pearson Correlation
Rosetta (Pancotti et al., 2022) 5.18 0.53
RaSP (Blaabjerg et al., 2023) 1.08 0.71
ThermoMPNN (Dieckhaus et al., 2024) 0.71 0.75
SLAE-finetuned (w/o. mutated site) 0.73 0.70
SLAE-finetuned (with mutated site) 0.68 0.76

Chemical Shift Prediction

NMR chemical shifts are among the most direct experimental probes of local atomic environments, among them the backbone nitrogen are notoriously difficult to predict accurately due to its large variance and contributions from ring currents, electrostatics, and subtle side-chain conformations. We benchmark on stringently filtered BMRB (Hoch et al., 2023) which contains 2,532 training and 594 validation chemical shift records and their corresponding Alphafold2 predicted structures. PLM-CS framework is adopted as baseline model architecture, which trains a lightweight predictor on top of pretrained representations (Zhu et al., 2025).

We report validation set performance of finetuned SLAE along with PLM-CS results using multiple protein residue embeddings, including ESM2, AlphaFold2, ProSST and SLAE. 1 Finetuned SLAE achieves the lowest RMSE and highest correlation, substantially outperforming retrained PLM-CS baselines (Table 5). This demonstrates that SLAE embeddings capture fine-grained atomistic features essential for NMR observables.

Table 5:

Backbone nitrogen chemical shift prediction on BMRB

Method RMSE(ppm) Pearson Correlation
PLMCS-AF2 2.94 0.82
PLMCS-ESM2 2.74 0.84
PLMCS-ProSST 2.53 0.87
PLMCS-SLAE 2.53 0.87
SLAE-finetuned 1.88 0.93

5. Interpreting the Latent Space

SLAE’s downstream performance stems from a structured, interpretable latent space. We show that residue embeddings are organized along biochemically meaningful axes, are sensitive to local environment changes, and admit linear paths that decode to geometrically coherent structures(Figure 1C).

5.1. Embedding Variability Reflects Chemical Environment Change

To probe what SLAE embeddings captures at the residue level, we analyze how they organize across local chemical environments. Dimensionality reduction of kNN centroids from CATH (Section 3.5, Appendix E) shows that residue latents cluster by side chain chemistry and broader structural context. The latent space also stratifies along gradients of solvent accessibility and separates by secondary structure, with helices, sheets, and coils occupying distinct submanifolds (Figure 3, App. Fig 6 and 7). This indicates that SLAE representation is sensitive to both chemical identity and structural environment.

Figure 3: SLAE embedding comparison between native and decoy structures.

Figure 3:

(native in yellow, decoys colored by TM-score to their native; warmer = more native-like) Left protein-level PCA. Each point is a protein. Right residue-level PCA for 1TUD and its decoys. Decoy residues are colored by their parent decoy’s TM-score. In both panels, SLAE embeddings organize along gradients of nativeness, revealing coherent neighborhoods that align with structural quality.

We then quantify this sensitivity using the mdCATH dataset (Mirarchi et al., 2024). Across 5,398 proteins, per-residue latent displacement between conformers correlates with physical measures of environment variability: changes in contact maps and solvent exposure explain over half of the variance in embedding similarity (R2 = 0.55, ρ ≈ 0.74; Appendix E). Thus, SLAE embeddings consistently track how residues respond to burial, packing, and secondary-structure transitions.

5.2. Discriminative power over Native-Decoy Residue environments

We show that SLAE residue latent capture local environments contain signal that zero-shot distinguishes native structures from decoys and provide a practical embedding space for evaluating backbone–sequence co-design.

On the Rosetta decoy dataset (Park et al., 2016) containing 133 native protein structures with thousands of decoys each, native–decoy cosine margin is 0.136 across residues. We further fit a leave-protein-out logistic regression by training on all proteins except one and tested on the held-out protein’s residues and report AUROC = 0.659 (Appendix E), indicating a moderate, generalizable linear signal at the residue level.

Motivated by this discriminative signal, we use the SLAE embedding space to quantify the distributional coverage of generative models, extending prior metrics (Lu et al., 2025a) to all-atom resolution and residue granularity. As a proof of concept, we compute per-residue type Fréchet Pro tein Distance (FPD) between SLAE embeddings of the generated structures and the native CATH distribution for models such as Chroma (Ingraham et al., 2023), Protpardelle-1c (Lu et al., 2025b) and La-Proteina (Geffner et al., 2025). The FPD metrics reveal subtle differences in the coverage of local amino acid environments by different generative models (Appendix E.3, App.Fig. 8). For example, biased sampling is evident in La-Proteina samples for serine, threonine, and valine relative to Protpardelle-1c and Chroma. Using SLAE embeddings provides a more sensitive view on coverage of all-atom local environments which are ignored in backbone-based metrics and which may be averaged out on the global protein fold level as in previous assessments of generative model coverage of protein structures

5.3. Smooth Latent Interpolation Captures Conformational Transitions

Latent space smoothness is relevant for evaluating whether a representation supports continuous sampling of protein conformations. Unlike variational autoencoders that encourage smoothness via KL regularization to a simple prior, the SLAE autoencoder relies solely on physics-augmented pretraining objectives. We examine the smoothness of SLAE latent by linear interpolation between two conformation states Z(A) and Z(B). For each residue i and interpolation scale t[0,1], the interpolated residue embeddings are given by zi(t)=(1t)zi(A)+tzi(B). The interpolated set Z(t)=z1(t),,zn(t) is then decoded into an all-atom structure with the pretrained SLAE decoder (Figure 1C).

For two proteins with known conformational changes, adenylate kinase (AdK) and KaiB, we linearly interpolate between the SLAE embedding of the two experimentally determined states(AdK: 1AKE, 4AKE; KaiB: 2QKE, 5JYT). We sample intermediate structures from 50 evenly spaced values of t and align their backbone coordinates to frames in MD simulation of the transitions (Seyler et al., 2015; Zhang et al., 2024). For AdK, the interpolated structures closely track the MD intermediates, as evidenced by smooth trajectories with low RMSD (Figure 4), and they agree better than interpolations from the generative model (App. Fig. 10). Notably, these interpolations are unguided by any energy function or model likelihood; they arise solely from linear paths in SLAE latent space anchored in pretraining with physics-based task. KaiB shows higher RMSD between steps 20 and 30 (Figure 4). Closer examination of the interpolated structures (App. Fig 9) reveals disagreement in the C-terminus, which is known to unfold during transition (Wayment-Steele et al., 2023). This degradation is expected as SLAE is pretrained on folded structures and thus treats unfolded segments as out-of-distribution, where local environment cues under-constrain reconstruction.

Figure 4: Latent space interpolation between two conformations.

Figure 4:

A. Structures sampled by linear interpolation (purple) overlaid with MD simulation frames (grey) B. Alignment RMSD to MD simulation trajectories

Within the folded structure regime, SLAE’s latent space is sufficiently regular that simple linear paths often decode to geometrically coherent intermediates aligned with MD trajectories. These results support the view that SLAE embeddings approximate a continuous, chemically grounded manifold of protein structures. The latent space reflects local environmental variation while accommodating large-scale transitions, make it useful for downstream analysis and generative applications.

6. Conclusion

We introduced SLAE, a framework tailored to learning general-purpose representations of proteins at all-atom resolution. SLAE applies a strictly local graph neural network over atomic environments, using computationally simple layers to perform expressive geometric reasoning on atom-type and interatomic distance features. Pretraining is driven by a novel objective that combines full atomic coordinate reconstruction with energy score regression, yielding embeddings that are structurally faithful, chemically grounded, and energetically informed.

Figure 2: SLAE latent organization.

Figure 2:

UMAP visualization of kNN centroids shows clustering by solvent accessibility (left) and secondary structure (right).

Table 1: Reconstruction performance of SLAE and ablations.

We report sequence recovery accuracy (%) and reconstruction RMSD (Å) on test structures. All further experiments use the highlighted best SLAE model.

Graph Radius (Å) Discretization Method Codebook Size Training Obj. Seq. Acc. (%) RMSD < 128 res (Å) RMSD < 512 res (Å)
8 LFQ 32768 all 75.2 2.50 3.74
8 kNN 4096 all 97.5 2.96 4.03
8 kNN 32768 all 99.4 1.60 2.31
8 w/o. FAPE 97.2 3.89 5.22
8 w/o. Energy 98.0 3.26 5.17
6 all 99.9 1.24 2.55
8 all 99.9 1.12 1.92

Acknowledgments

The computing for this project was performed on the Sherlock cluster. We would like to thank Stanford University and the Stanford Research Computing Center for providing computational available resources and support that contributed to these research results. Y.C and T.L. are supported by Stanford Graduate Fellowship. C.Z. and H.K.W.-S. acknowledge financial support from the University of Wisconsin-Madison Office of the Vice Chancellor for Research, with funding from the Wisconsin Alumni Research Foundation. This project is supported by NIH (R01GM147893 to P.-S.H.), Merck Research Laboratories (MRL) Scientific Engagement and Emerging Discovery Science (SEEDS) Program, and Stanford Medicine Catalyst. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

A. Model

A.1. Autoencoder pseudocode

The end-to-end SLAE autoencoder can be summarized as follows:

Algorithm 1.

SLAE Autoencoder. h(e): edge features, E: SE(3)-equivariant update, P: pooling, DTr: Transformer decoder. Outputs: x^ (coordinates), s^ (sequence), r^ (energies).

1: Input: heavy-atom coordinates aii=1N
2: Build G=(V,E) with cutoff rc
3: Init h(e)ϕr,ϕa
4: for L=1 to 2 do
5:   xijL,VijLExijL1,VijL1
6: end for
7: zrPxijL
8: (x^,s^,r^)DTrzr

A.2. Encoder Architecture

Notation

Let ai3 be the coordinate of atom i, rij=ajai, rij=rij, r^ij=rij/rij. The neighbor set of i is N(i)=jrijrc. Each directed edge (i,j) maintains invariant scalars xijLdsc and equivariant tensors VijL.

Two-body initialization

Edge features are initialized with radial and angular bases:

xij0=urijMLP2bonehotZionehotZjϕrrij, (2)
Vij0=ωijϕar^ij,ωij=MLPωxij0, (3)

where ϕr are Bessel radial basis functions, ϕa are angular embeddings (e.g., spherical harmonics), and urij is a smooth cutoff envelope.

Tensor product update

At layer L, equivariant features of edge (i,j) interact with the embedded environment of atom i:

VijL=VijL1kN(i)wikLϕrik, (4)

where ϕrik encodes neighbor geometry and wikL=MLPembedLxikL1 are learned weights. This corresponds to a weighted projection of the atomic density around atom i.

Latent scalar update.

Scalar channels are updated with tensor product scalars:

xijL=MLPlatentLxijL1VijLurij, (5)

injecting geometric information from VijL back into xijL.

Hierarchical pooling

Final edge scalars are aggregated:

si=1|N(i)|jN(i)xijL, (6)
zr=1|A(r)|iA(r)si,zr128, (7)

producing residue-level embeddings zr.

A.3. Decoder architecture

Transformer backbone

We employ a standard pre-norm Transformer encoder with Rotary Positional Embeddings (RoPE) with LTr=8 layers, h=16 heads, model width dmodel=1024. Each layer consists of:

  • Multi-head self-attention with RoPE (pre-norm): MHARoPE(LayerNorm()).

  • Residual connection.

  • Feed-forward network with hidden dimension dff and SwiGLU, applied as FFN(LayerNorm()).

  • No dropout.

Formally:

H()=MHARoPELayerNormH(1)+H(1), (8)
H()=FFNLayerNormH()+H(), (9)

for =1,,LTr, with H(0)=z1,,zn.

Prediction heads

From final hidden states Hn×dmodeldmodel=1024, we apply three parallel heads:

  1. 3D coordinates (linear head) LayerNorm + Linear maps per-residue embeddings to all Atom37 coordinates:
    x^=Unflatten(Linear(LN(H)),37×3)n×37×3.
    (Atoms 1–4 are N, Cα, C, and O; atoms 5–37 are side chain. Masking is applied via the Atom37 mask.)
  2. Sequence logits on valid tokens An MLP head operates only on valid tokens (mask-compacted), then is re-padded for loss:
    s^=MLPseqHvalidnvalid×20.
  3. Pairwise energies A pairwise feature head first down-projects H, lifts to 2D by pairwise product/difference, applies a small MLP, then per-type linear heads with magnitude clamp to 1e3:
    r^=r^hbond,r^sol,r^elecn×n×3.

A.4. Task-specific heads

Trainable decoder backbone.

We expose a lightweight wrapper over the DecoderBackbone to enable fine-tuning the last N Transformer blocks while freezing the rest. Take the single site mutation stability task as an example, we document the layout of downstream task-specific finetuning here.

Contrastive and site-aware head

A Siamese head takes two or more structure embeddings (e.g., wild-type and mutant), runs them through the shared DecoderBackbone, and regresses a scalar target (e.g., ΔΔG). Beyond global contrastive pooling, it can extract site-specific residue representations, enabling residue-level tasks.

Backbone embeddings.

Given masked inputs Xwt,Mwt and Xmut,Mmut,

Hwt=DecoderBackboneXwt,Mwt,Hmut=DecoderBackboneXmut,Mmut.

Mask-aware pooling and site features.

Let the mean-pooling operator be

Pool(H,M)=t=1LH:,t,:M:,tt=1LM:,t+εB×1024.

We form global embeddings zwt=PoolHwt,Mwt and zmut=PoolHmut,Mmut. Given mutation indices ι={1,,L}B, we also extract site embeddings

swt=Hwt[range(B),ι],smut=Hmut[range(B),ι]B×1024.

Contrastive feature and MLP regressor.

We concatenate global and site representations together with their difference:

u=zwt,zmut,swt,smut,smutswtB×(51024).

A small MLP head predicts a scalar per pair:

y^=MLP(u)=LinearGELULayerNormLinearGELU(u)B×1.

General usage

The same interface supports other pairwise or single-input tasks by (i) choosing one or multiple passes through DecoderBackbone, (ii) selecting global vs. site-wise features, and (iii) swapping the final MLP for the appropriate output dimensionality/loss. For atom-level tasks, the DecoderBackbone can be reinitialized with the smaller attention window.

B. Training

B.1. Losses

All-atom FAPE (Frame-Aligned Point Error)

All-atom FAPE is computed by aligning the predicted and reference structures on every triplet of bonded atoms (i,j,k) (with the exception of symmetric side chain atoms) and then measuring per-atom positional deviations between the aligned structures. For each frame f(i,j,k) (with j as the origin), define an orthonormal basis for predicted/true coordinates via a deterministic map Φ:33SO(3):

Ufpred=Φx^i,x^j,x^k,Uftrue=Φxi,xj,xk,

where Φ constructs column vectors from the two edge directions at j,

vji=xixj,vjk=xkxj,

then

e0=vjk×vji,e2=vjivjk,e1=e2×e0,

and column-normalizes e0,e1,e2 to obtain a right-handed 3×3 matrix.

For any atom a in the same protein as f, rotate origin-subtracted positions into the local frames:

rf,apred=Ufpredx^ax^j,rf,atrue=Uftruexaxj.

Define df,a=rf,apredrf,atrue2, clamped at c=10 as d˜f,a=mindf,a,c, and apply a Huber penalty with δ=1.0:

ρδ(d˜)=12d˜2,d˜δ,δd˜12δ2,d˜>δ.

We average first over frames and then over atoms, yielding an atom-weighted mean:

LFAPE=1Bb=1B1AbaAb1FbfFbρδd˜f,a.

All-atom smooth LDDT

We use a differentiable, all-atom version of LDDT that compares pairwise distances within a cutoff. Let P={(i,a),(j,b)} be all heavy-atom pairs with xi,axj,bRmax and not in the same residue. Define ground-truth and predicted distances diabj=xi,axj,b and d^iabj=x^i,ax^j,b, and the absolute error Δiabj=d^iabjdiabj. Using standard lDDT thresholds τ{0.5,1.0,2.0,4.0} Å with smooth indicators sτ(Δ)=σ(α(τΔ)) (sigmoid, α controls sharpness),

sLDDTi=1Pi(i,a),(j,b)Pi1|T|τTsτΔiabj,LsLDDT=11NresisLDDTi.

Mean-squared error (MSE)

Used for continuous targets regression:

LMSE=1|Ω|uΩy^uyu22.

Huber loss.

Used for continuous targets regression with δ=1.35:

LHuber=1|Ω|uΩ12y^uyu2,y^uyuδ,δy^uyu12δ2,otherwise.

B.2. Training specifics

The autoencoder is trained on a single NVIDIA A100 or H100 GPU using batch size 16. For pretraining we use wcoord=wseq=wenergy=1 and α=10, β=1 for the loss L=wcoord(αLDDT+βFAPE)+wseqCrossEntropy+wenergyMSE. We train for 30 epochs with early stopping on validation loss not decreasing after 5 epochs. The learning rate schedule is linear warmup for 1,000 steps followed by cosine decay. Optimization uses AdamW with maximum learning rate 1×10−4 and standard β1=0.9, β2=0.999 (weight decay as in AdamW defaults). Unless noted otherwise, downstream task-specific fine-tuning uses the same batch size and maximum learning rate 1 × 10−5.

C. Datasets

Pretraining Structure

We train SLAE on an sequence-augmented CATH set (Lu et al., 2025b) by redesigning each domain with 32 ProteinMPNN sequences and predicting structures with ESMFold; we retain only high-confidence, self-consistent structure models (pLDDT80,scRMSD2.0Å), yielding 337936 structures, with 271 test structures from holdout CATH domains. We evaluate SLAE latent space on protein conformational ensembles sampled from the dataset of molecular dynamics (MD) simulations mdCATH (Mirarchi et al., 2024). We subsample 32 frames per protein across MD trajectory ensembles for each of the 5398 structures.

Pretraining Rosetta Score

We use PyRosetta to compute residue pairwise energy scores for all pretraining structures under its default full-atom energy terms. For each pair of residues we compute (1) fa_sol: Lazaridis-Karplus solvation energy (2) fa_elec: Coulombic electrostatic potential with a distance-dependent dielectric (3) hbond: Sum of all hydrogen bonding terms for backbone and sidechain.

Fold Classification

We obtain the dataset from Hermosilla et al. (2021), which consolidated 16,712 proteins with 1195 different folds from the SCOPe 1.75 database (Fox et al., 2014). Three test sets are used: (1) Family, which allows proteins from the same family to appear in both training and test; (2) Superfamily, which excludes proteins sharing family membership with the training set; and (3) Fold, which further excludes proteins from the same superfamily as those in training. All structures are obtained from the SCOPe 1.75 archive.

Stability

We obtain the dataset curated by Dieckhaus et al. (2024) on Tsuboyama et al. (2023), composed of 272,712 single point mutations and their experimental ΔΔG. The proteins were clustered using MMseqs2 with sequence identity cutoff of 25% to yield 239 training, 31 validation and 29 validation proteins. For wild type sequences we predict their structures with AlphaFold2. For all mutated structures we model the mutation with PyRosetta and relax within 8A radius to obtain training structures.

Binding Affinity

We use the PPB-AFfinity (Liu et al., 2024) which integrates experimental protein-protein binding affinity data from several source databases: SKEMPI v2.0, SAbDab, PDB-bind v2020, Affinity Benchmark v5.5, and ATLAS. This dataset contains 12062 unique binding complexes consisting of 3032 unique PDB codes and point mutations. We use the structures curated in the dataset and define interface residues as those within 5A distance from other atoms of the neighboring chains. For all mutations we mutate the sidechain with PyRosetta and relax within 8A radius to obtain training structures.

NMR Chemical Shift

We retrieve the BMRB totaling 17,028 entries (2025–07-02) (Hoch et al., 2023). The entries were filtered and processed based on NMR experiment type, backbone chemical shift coverage, sequence consistency, basic experimental condition boundary plus any other routine re-referencing requirements. 3623 entries were retained and split into 2532 training and 594 validation entries at a 50% pairwise sequence-identity threshold after filtering entries without any nitrogen chemical shifts. Alphafold2 was used to generate all structures used in training.

MD Simulation

For adenylate kinase (AdK), we use conformational ensembles generated using the Framework Rigidity Optimized Dynamics Algorithm (FRODA), yielding 200 trajectories (Seyler et al., 2015). For KaiB, we use the temperature-dependent fold-switching simulation from Zhang et al. (2024), subsampling every 10 frames out of the 4 successful fold-switching trajectories from the fold-switched state to ground state.

Rosetta Decoy

To assess local residue environment embeddings distribution between native and decoy structure, we use structure dataset by Park et al. (2016), where each of the 133 native structures are accompanied with large numbers (≥ 1000 cluster centers) of alternative conformations (decoys).

D. METRICS

Structure comparison

We report RMSD after optimal Kabsch rigid alignment for Cα, backbone and all-atom. Given reference Xn×3 and prediction X^, align X^ to X then compute

RMSD=1nj=1nx^jalignxj22.

Numeric regression

Given targets yii=1N and predictions y^ii=1N, we report

RMSE=1Ni=1Ny^iyi2,r=i=1Ny^iy^¯yiy¯i=1Ny^iy^¯2i=1Nyiy¯2.

Distribution comparison

We compute Fréchet Protein Distance (FPD) following Lu et al. (2025a). Given N data points from a reference distribution pdata(x), here the sequence-augmented CATH dataset, and M samples from a generative model psample(x), we computed per-residue SLAE embeddings zdata(i)i=1N and zsample(j)j=1M and then compute

FPD=μdataμsample22+TrΣdata+Σsample2ΣdataΣsample12 (10)

where μdata and μsample are the means of the reference embeddings and the sample embeddings respectively, and Σdata and Σsample are the covariance matrices of the reference embeddings and the sample embeddings respectively. We compute FPD using a smaller subset of 2000 samples as SHAPES showed that this is sufficient for an accurate FPD estimate Lu et al. (2025a).

E. Additional Experiments and Results

E.1. Pretraining

We report in Table 6 additional results on the pretraining performance of the SLAE autoencoder. We note that encoders with 10Å graph radius cutoff is infeasible to train with a single GPU due to the number of edges.

Table 6:

Complete results of SLAE autoencoder ablation experiments.

Graph Radius(Å) Discretization Method Codebook Size Training Obj. Seq. Acc. (%) RMSD < 128 (Å) RMSD < 512 (Å)
8 LFQ 16384 all 69.5 4.12 5.79
8 LFQ 32768 all 75.2 2.50 3.74
8 VQ 16384 all 65.7 5.02 5.88
8 VQ 32768 all 70.4 4.30 6.02
8 kNN 4096 all 97.5 2.96 4.03
8 kNN 16384 all 98.6 1.71 2.57
8 kNN 32768 all 99.4 1.60 2.31
8 w/o. FAPE 97.2 3.89 5.22
8 w/o. Energy 98.0 3.26 5.17
4 all 99.8 2.57 3.86
6 all 99.9 1.24 2.55
8 all 99.9 1.12 1.92

E.2. Latent space characterization

E.2.1. kNN clustering

We examine the CATH-kNN-quantized latent space, the k-means codebook of k = 16,384 centroids. We assign each centroid the majority amino-acid identity among its members; the commitment loss is the L2 distance from an embedding to its assigned centroid. The commitment loss histogram is tightly concentrated around 3–5 L2 units (Figure 5), which is modest relative to the embedding norm (15 ± 4), indicating that quantization preserves most geometric signal.

Figure 5:

Figure 5:

Commitment loss distribution during post-hoc quantization

We observe clear residue type mixing in the clusters. Although many centroids are quite pure (median majority fraction 0.96), the distribution is broad (mean 0.89±0.15; entropy mean 0.52), with a substantial tail of mixed-composition clusters (10th-percentile majority 0.67). Along with the modest commitment error, this suggests that the observed mixing reflects genuinely overlapping local chemistries. Consistently, residue-conditioned intra-cluster distances show that some types form diffuse, mixed neighborhoods (A, G, S, C with ratios ≥ 1), while others are tighter and more type-specific (W, Y, R with ratios ≤ 1). These observations suggest that the kNN partitioning of residue embedding space yields chemically meaningful clusters but does not enforce one-residue exclusivity and captures real cross-type similarity in local environments.

E.2.2. Residue embedding visualization

We project the 16,384-entry codebook (centroid) embeddings into three dimensions using UMAP and analyze how local chemical environments are organized in this latent space (Figs. 67). Each CATH residue is assigned to its nearest codebook entry, and for every centroid we aggregate properties across its assigned residues. We compute the mean SASA and the majority secondary-structure label. This yields a coarse-grained landscape in which centroids arrange along solvent-exposure gradients and segregate by secondary-structure preferences.

Table 7:

Residue-wise clustering statistics: number of centroids that each residue type dominates, mean intra-cluster distance (± standard deviation), and ratio relative to the global mean.

Residue # Centroids Mean intra-cluster distance mean ± std Ratio to global distance
A 1601 19.95 ± 6.50 [1.20]
C 215 17.67 ± 5.72 [1.06]
D 962 13.43 ± 7.47 [0.81]
E 1076 11.58 ± 3.97 [0.70]
F 641 11.44 ± 3.53 [0.69]
G 1192 18.38 ± 6.21 [1.11]
H 387 11.56 ± 3.53 [0.70]
I 899 14.33 ± 4.60 [0.86]
K 947 11.48 ± 3.76 [0.69]
L 1565 14.26 ± 4.63 [0.86]
M 272 13.69 ± 4.33 [0.82]
N 729 13.28 ± 4.16 [0.80]
P 737 13.83 ± 4.49 [0.83]
Q 492 11.48 ± 3.89 [0.69]
R 720 9.95 ± 3.14 [0.60]
S 1032 17.31 ± 5.58 [1.04]
T 920 15.81 ± 4.97 [0.95]
V 1253 15.96 ± 5.13 [0.96]
W 202 8.86 ± 2.76 [0.53]
Y 542 10.49 ± 3.16 [0.63]
Figure 6:

Figure 6:

3D UMAP projection of CATH residue embeddings colored by solvent accessibility and secondary structure

E.2.3. Structure ensemble analysis

Subsampled mdCATH

For each residue, we measure how much its embedding changes across the ensemble by averaging pairwise differences between frames. For a given residue and set of frames, we compute two physical descriptors: Contact-map change: we form a binary contact row per frame (contact if residues are within a chosen distance threshold) and measure, on average, what fraction of those contacts differ between frames. Solvent-exposure change: we compute solvent-accessible surface area (SASA), convert to residue-type-normalized relative SASA, and take the average absolute change between frames. We fit a simple linear model that predicts per-residue embedding change from the two descriptors. We aggregate performance on held-out residues and report: (i) the proportion of variance explained and (ii) the Spearman rank correlation between observed and predicted embedding change.

Rosetta Decoys

For each native protein we have a residue-embedding matrix and a set of its decoy matrices, aligned by residue index. We apply row-wise L2 normalization so that inner products equal cosine similarity. For a given protein, we compute the mean residue-wise cosine similarity between each decoy and its native, then take the average over decoys. The native-decoy cosine margin is defined as the difference between the native’s self-similarity (equal to 1.0 after normalization) and this mean decoy similarity.

Figure 7:

Figure 7:

3D UMAP projection of CATH residue embeddings colored by amino acid type

To test linear separability at the residue level and generalization to unseen proteins, we train a logistic-regression classifier on residue embeddings with leave-protein-out grouped cross-validation: each residue embedding is a sample (label 0=native, 1=decoy) and carries its protein ID for grouped CV. We split with GroupKFold so all residues from a held-out protein appear only in the test set, and train an L2-regularized LogisticRegression. On each test fold we report AUROC; metrics are aggregated as mean ± sd across folds.

E.3. Per-residue generative model assessment

We compare distribution coverage of all-atom chemical environments sampled by generative models, stratified by residue type. For each residue type, we extracted the sLAE embeddings of 2000 random examples from the sequence-augmented CATH dataset and from a collection of 20,000 unconditional samples of all-atom protein structures from La-Proteina, Protpardelle-1c, and Chroma.

E.4. Latent space interpolation

In Figure 9 A and B we show 20 out of 50 interpolated structures for AdK and KaiB. In addition, we compare linearly interpolated AdK structures from the SLAE latent space to those from the all-atom generative model Protpardelle-1c (Figure 10) and show that SLAE interpolation is better matched to simulated intermediate structures.

Figure 8:

Figure 8:

SLAE embeddings to assess residue environment coverage. PCA of SLAE per-residue embeddings of de novo structure samples (light blue) compared to the reference CATH distribution (purple) stratified by amino acid type given in the title. The two modes in each amino acid type correspond to residues belonging to a beta sheet or alpha helix.

Figure 9:

Figure 9:

Structures decoded from SLAE latent interpolation. A. AdK B. KaiB C. Step 23 KaiB intermediate structure with under-characterized C-terminus showing disordered backbone collapsing onto itself.

Figure 10:

Figure 10:

Comparison of SLAE and generative model (Protpardelle-lc) latent interpolation. A. Three representative steps from interpolation fraction 0.3 to 0.7. Top: Protpardelle-1c linear interpolation (blue) and best MD frame matches (grey). Bottom: SLAE linear interpolation (purple) and best MD frame matches (grey). B. RMSD of interpolation trajectories to their closest-match MD frames

Footnotes

1

Fair comparison with open-sourced methods is not possible due to non-overlapping dataset splits (some entries from PLM-CS datasets do not pass the filter standard). We therefore re-trained a PLM-CS baseline on our splits and evaluate all embeddings under an identical protocol.

Contributor Information

Yilin Chen, Stanford University, Department of Bioengineering.

Cizhang Zhao, University of Wisconsin–Madison, Department of Biochemistry.

Po-Ssu Huang, Stanford University, Department of Bioengineering.

Tianyu Lu, Stanford University, Department of Bioengineering.

Hannah K. Wayment-Steele, University of Wisconsin–Madison, Department of Biochemistry

References

  1. Abramson Josh, Adler Jonas, Dunger Jack, Evans Richard, Green Tim, Pritzel Alexander, Ronneberger Olaf, Willmore Lindsay, Ballard Andrew J., Bambrick Joshua, Bodenstein Sebastian W., Evans David A., Hung Chia-Chun, O’Neill Michael, Reiman David, Tunyasuvunakool Kathryn, Wu Zachary, Akvilė Žemgulytė Eirini Arvaniti, Beattie Charles, Bertolli Ottavia, Bridgland Alex, Cherepanov Alexey, Congreve Miles, Cowen-Rivers Alexander I., Cowie Andrew, Figurnov Michael, Fuchs Fabian B., Gladman Hannah, Jain Rishub, Khan Yousuf A., Low Caroline M. R., Perlin Kuba, Potapenko Anna, Savy Pascal, Singh Sukhdeep, Stecula Adrian, Thillaisundaram Ashok, Tong Catherine, Yakneen Sergei, Zhong Ellen D., Zielinski Michal, Augustin Žídek Victor Bapst, Kohli Pushmeet, Jaderberg Max, Hassabis Demis, and Jumper John M.. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016):493–500, June 2024. ISSN 1476–4687. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Anand Namrata, Eguchi Raphael, Mathews Irimpan I., Perez Carla P., Derry Alexander, Altman Russ B., and Huang Po-Ssu. Protein sequence design with a learned potential. Nature Communications, 13(1):746, February 2022. ISSN 2041–1723. doi: 10.1038/s41467-022-28313-9. [DOI] [Google Scholar]
  3. Anishchenko Ivan, Kipnis Yakov, Kalvet Indrek, Zhou Guangfeng, Krishna Rohith, Pellock Samuel J., Lauko Anna, Gyu Rie Lee Linna An, Dauparas Justas, DiMaio Frank, and Baker David. Modeling protein-small molecule conformational ensembles with ChemNet, September 2024. [Google Scholar]
  4. Batatia Ilyes, Kovács Dávid Péter, Simm Gregor N. C, Ortner Christoph, and Csányi Gábor. MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields, January 2023. [Google Scholar]
  5. Batzner Simon, Musaelian Albert, Sun Lixin, Geiger Mario, Mailoa Jonathan P., Kornbluth Mordechai, Molinari Nicola, Smidt Tess E., and Kozinsky Boris. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature Communications, 13(1):2453, May 2022. ISSN 2041–1723. doi: 10.1038/s41467-022-29939-5. [DOI] [Google Scholar]
  6. Blaabjerg Lasse M, Kassem Maher M, Good Lydia L, Jonsson Nicolas, Cagiada Matteo, Johansson Kristoffer E, Boomsma Wouter, Stein Amelie, and Lindorff-Larsen Kresten. Rapid protein stability prediction using deep learning representations. eLife, 12:e82593, May 2023. ISSN 2050–084X. doi: 10.7554/eLife.82593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bojan Meital, Vedula Sanketh, Maddipatla Advaith, Nadav Bojan Sellam Federico Napoli, Schanda Paul, and Bronstein Alex M.. Representing local protein environments with atomistic foundation models, June 2025. [Google Scholar]
  8. Alexander E Chu Jinho Kim, Cheng Lucy, Gina El Nesr Minkai Xu, Richard W Shuai, and Po-Ssu Huang. An all-atom protein generative model. Proceedings of the National Academy of Sciences, 121(27):e2311500121, 2024. [Google Scholar]
  9. Dieckhaus Henry, Brocidiacono Michael, Randolph Nicholas Z., and Kuhlman Brian. Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proceedings of the National Academy of Sciences, 121(6):e2314853121, February 2024. doi: 10.1073/pnas.2314853121. [DOI] [Google Scholar]
  10. Drautz Ralf. Atomic cluster expansion for accurate and transferable interatomic potentials. Physical Review B, 99(1), 2019. doi: 10.1103/PhysRevB.99.014104. [DOI] [Google Scholar]
  11. Fox Naomi K., Brenner Steven E., and Chandonia John-Marc. SCOPe: Structural Classification of Proteins–extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Research, 42(Database issue):D304–309, January 2014. ISSN 1362–4962. doi: 10.1093/nar/gkt1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gao Zhangyang, Tan Cheng, Wang Jue, Huang Yufei, Wu Lirong, and Li Stan Z.. FoldToken: Learning Protein Language via Vector Quantization and Beyond, March 2024. [Google Scholar]
  13. Gasteiger Johannes, Groß Janek, and Günnemann Stephan. Directional Message Passing for Molecular Graphs, April 2022. [Google Scholar]
  14. Geffner Tomas, Didi Kieran, Cao Zhonglin, Reidenbach Danny, Zhang Zuobai, Dallago Christian, Kucukbenli Emine, Kreis Karsten, and Vahdat Arash. La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching, July 2025. [Google Scholar]
  15. Hayes Thomas, Rao Roshan, Akin Halil, Sofroniew Nicholas J., Oktay Deniz, Lin Zeming, Verkuil Robert, Tran Vincent Q., Deaton Jonathan, Wiggert Marius, Badkundri Rohil, Shafkat Irhum, Gong Jun, Derry Alexander, Molina Raul S., Thomas Neil, Khan Yousuf A., Mishra Chetan, Kim Carolyn, Bartie Liam J., Nemeth Matthew, Hsu Patrick D., Sercu Tom, Candido Salvatore, and Rives Alexander. Simulating 500 million years of evolution with a language model, December 2024. [Google Scholar]
  16. Hermosilla Pedro, Schäfer Marco, Lang Matěj, Fackelmann Gloria, Vázquez Pere Pau, Kozlíková Barbora, Krone Michael, Ritschel Tobias, and Ropinski Timo. Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures, April 2021. [Google Scholar]
  17. Jeffrey C Hoch Kumaran Baskaran, Burr Harrison, Chin John, Hamid R Eghbalnia Toshimichi Fujiwara, Michael R Gryk Takeshi Iwata, Kojima Chojiro, Kurisu Genji, Maziuk Dmitri, Miyanoiri Yohei, Jonathan R Wedell Colin Wilburn, Yao Hongyang, and Yokochi Masashi. Biological Magnetic Resonance Data Bank. Nucleic Acids Research, 51(D1):D368–D376, January 2023. ISSN 0305–1048. doi: 10.1093/nar/gkac1050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Hou Jie, Adhikari Badri, and Cheng Jianlin. DeepSF: Deep convolutional neural network for mapping protein sequences to folds. Bioinformatics, 34(8):1295–1303, April 2018. ISSN 1367–4803. doi: 10.1093/bioinformatics/btx780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Ingraham John, Garg Vikas, Barzilay Regina, and Jaakkola Tommi. Generative Models for Graph-Based Protein Design. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [Google Scholar]
  20. Ingraham John B., Baranov Max, Costello Zak, Barber Karl W., Wang Wujie, Ismail Ahmed, Frappier Vincent, Lord Dana M., Ng-Thow-Hing Christopher, Van Vlack Erik R, Tie Shan, Xue Vincent, Cowles Sarah C., Leung Alan, Rodrigues João V., Morales-Perez Claudio L., Ayoub Alex M., Green Robin, Puentes Katherine, Oplinger Frank, Panwar Nishant V., Obermeyer Fritz, Root Adam R., Beam Andrew L., Poelwijk Frank J., and Grigoryan Gevorg. Illuminating protein space with a programmable generative model. Nature, 623(7989):1070–1078, November 2023. ISSN 1476–4687. doi: 10.1038/s41586-023-06728-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Jamasb Arian R., Morehead Alex, Joshi Chaitanya K., Zhang Zuobai, Didi Kieran, Mathis Simon V., Harris Charles, Tang Jian, Cheng Jianlin, Lio Pietro, and Blundell Tom L.. Evaluating representation learning on the protein structure universe, June 2024. [Google Scholar]
  22. Jing Bowen, Eismann Stephan, Suriana Patricia, Townshend Raphael J. L., and Dror Ron. Learning from Protein Structure with Geometric Vector Perceptrons, May 2021. [Google Scholar]
  23. Jumper John, Evans Richard, Pritzel Alexander, Green Tim, Figurnov Michael, Ronneberger Olaf, Tunyasuvunakool Kathryn, Bates Russ, Žídek Augustin, Potapenko Anna, Bridgland Alex, Meyer Clemens, Kohl Simon A. A., Ballard Andrew J., Cowie Andrew, Romera-Paredes Bernardino, Nikolov Stanislav, Jain Rishub, Adler Jonas, Back Trevor, Petersen Stig, Reiman David, Clancy Ellen, Zielinski Michal, Steinegger Martin, Pacholska Michalina, Berghammer Tamas, Bodenstein Sebastian, Silver David, Vinyals Oriol, Senior Andrew W., Kavukcuoglu Koray, Kohli Pushmeet, and Hassabis Demis. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, August 2021. ISSN 1476–4687. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Li Mingchen, Tan Pan, Ma Xinzhu, Zhong Bozitao, Yu Huiqun, Zhou Ziyi, Ouyang Wanli, Zhou Bingxin, Hong Liang, and Tan Yang. ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention, May 2024. [Google Scholar]
  25. Lin Zeming, Akin Halil, Rao Roshan, Hie Brian, Zhu Zhongkai, Lu Wenting, Smetanin Nikita, Verkuil Robert, Kabeli Ori, Shmueli Yaniv, dos Santos Costa Allan, Fazel-Zarandi Maryam, Sercu Tom, Candido Salvatore, and Rives Alexander . Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, March 2023. doi: 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
  26. Liu Huaqing, Chen Peiyi, Zhai Xiaochen, Huo Ku-Geng, Zhou Shuxian, Han Lanqing, and Fan Guoxin. PPB-Affinity: Protein-Protein Binding Affinity dataset for AI-based protein drug discovery. Scientific Data, 11(1):1316, December 2024. ISSN 2052–4463. doi: 10.1038/s41597-024-03997-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Liu Jun, Chen Hungyu, and Zhang Yang. A Corporative Language Model for Protein–Protein Interaction, Binding Affinity, and Interface Contact Prediction, July 2025. ISSN 2692–8205. [Google Scholar]
  28. Lu Amy X., Yan Wilson, Yang Kevin K., Gligorijevic Vladimir, Cho Kyunghyun, Abbeel Pieter, Bonneau Richard, and Frey Nathan. Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure, August 2024. [Google Scholar]
  29. Lu Tianyu, Liu Melissa, Chen Yilin, Kim Jinho, and Huang Po-Ssu. Assessing generative model coverage of protein structures with SHAPES. Cell Systems, 16(8):101347, August 2025a. ISSN 2405–4712. doi: 10.1016/j.cels.2025.101347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Lu Tianyu, Shuai Richard, Kouba Petr, Li Zhaoyang, Chen Yilin, Shirali Akio, Kim Jinho, and Huang Po-Ssu. Conditional Protein Structure Generation with Protpardelle-1C, August 2025b. ISSN 2692–8205. [Google Scholar]
  31. Meier Joshua, Rao Roshan, Verkuil Robert, Liu Jason, Sercu Tom, and Rives Alexander. Language models enable zero-shot prediction of the effects of mutations on protein function. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ‘21, pp. 29287–29303, Red Hook, NY, USA, December 2021. Curran Associates Inc. ISBN 978–1-7138–4539-3. [Google Scholar]
  32. Mirarchi Antonio, Giorgino Toni, and De Fabritiis Gianni. mdCATH: A Large-Scale MD Dataset for Data-Driven Computational Biophysics. Scientific Data, 11(1):1299, November 2024. ISSN 2052–4463. doi: 10.1038/s41597-024-04140-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Musaelian Albert, Batzner Simon, Johansson Anders, Sun Lixin, Owen Cameron J., Kornbluth Mordechai, and Kozinsky Boris. Learning local equivariant representations for large-scale atomistic dynamics. Nature Communications, 14(1):579, February 2023. ISSN 2041–1723. doi: 10.1038/s41467-023-36329-y. [DOI] [Google Scholar]
  34. Pancotti Corrado, Benevenuta Silvia, Birolo Giovanni, Alberini Virginia, Repetto Valeria, Sanavia Tiziana, Capriotti Emidio, and Fariselli Piero. Predicting protein stability changes upon single-point mutation: A thorough comparison of the available tools on a new dataset. Briefings in Bioinformatics, 23(2):bbab555, March 2022. ISSN 1477–4054. doi: 10.1093/bib/bbab555. [DOI] [Google Scholar]
  35. Park Hahnbeom, Bradley Philip, Greisen Per Jr, Liu Yuan, Khipple Mulligan Vikram, Kim David E., Baker David, and DiMaio Frank. Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules. Journal of Chemical Theory and Computation, 12(12):6201–6212, December 2016. ISSN 1549–9618. doi: 10.1021/acs.jctc.6b00819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Pengmei Zihan, Shen Zhengyuan, Wang Zichen, Collins Marcus, and Rangwala Huzefa. Pushing the Limits of All-Atom Geometric Graph Neural Networks: Pre-Training, Scaling and Zero-Shot Transfer, October 2024. [Google Scholar]
  37. Seyler Sean L., Kumar Avishek, Thorpe M. F., and Beckstein Oliver. Path Similarity Analysis: A Method for Quantifying Macromolecular Pathways. PLOS Computational Biology, 11(10):e1004568, October 2015. ISSN 1553–7358. doi: 10.1371/journal.pcbi.1004568. [DOI] [Google Scholar]
  38. Su Jianlin, Lu Yu, Pan Shengfeng, Murtadha Ahmed, Wen Bo, and Liu Yunfeng. RoFormer: Enhanced Transformer with Rotary Position Embedding, November 2023a. [Google Scholar]
  39. Su Jin, Han Chenchen, Zhou Yuyang, Shan Junjie, Zhou Xibin, and Yuan Fajie. SaProt: Protein Language Modeling with Structure-aware Vocabulary, October 2023b. [Google Scholar]
  40. Tsuboyama Kotaro, Dauparas Justas, Chen Jonathan, Laine Elodie, Behbahani Yasser Mohseni, Weinstein Jonathan J., Mangan Niall M., Ovchinnikov Sergey, and Rocklin Gabriel J.. Megascale experimental analysis of protein folding stability in biology and design. Nature, 620(7973): 434–444, August 2023. ISSN 1476–4687. doi: 10.1038/s41586-023-06328-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. van den Oord Aaron, Vinyals Oriol, and Kavukcuoglu Koray. Neural Discrete Representation Learning, May 2018. [Google Scholar]
  42. Wang Limei, Liu Haoran, Liu Yi, Kurtin Jerry, and Ji Shuiwang. Learning Hierarchical Protein Representations via Complete 3D Graph Networks, March 2023. [Google Scholar]
  43. Wang Shih-Hsin, Huang Yuhao, Baker Justin M., Sun Yuan-En, Tang Qi, and Wang Bao. A Theoretically-Principled Sparse, Connected, and Rigid Graph Representation of Molecules. In The Thirteenth International Conference on Learning Representations, October 2024. [Google Scholar]
  44. Wayment-Steele Hannah K., Ojoawo Adedolapo, Otten Renee, Apitz Julia M., Pitsawong Warintra, Hömberger Marc, Ovchinnikov Sergey, Colwell Lucy, and Kern Dorothee. Predicting multiple conformations via sequence clustering and AlphaFold2. Nature, pp. 1–3, November 2023. ISSN 1476–4687. doi: 10.1038/s41586-023-06832-9. [DOI] [Google Scholar]
  45. Yu Lijun, Lezama Jose, Gundavarapu Nitesh Bharadwaj, Versari Luca, Sohn Kihyuk, Minnen David, Cheng Yong, Gupta Agrim, Gu Xiuye, Hauptmann Alexander G., Gong Boqing, Yang Ming-Hsuan, Essa Irfan, Ross David A., and Jiang Lu. Language Model Beats Diffusion - Tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representations, October 2023. [Google Scholar]
  46. Zaidi Sheheryar, Schaarschmidt Michael, Martens James, Kim Hyunjik, The Yee Whye, Sanchez-Gonzalez Alvaro, Battaglia Peter, Pascanu Razvan, and Godwin Jonathan. Pre-training via Denoising for Molecular Property Prediction, October 2022. [Google Scholar]
  47. Zhang Ning, Sood Damini, Guo Spencer C., Chen Nanhao, Antoszewski Adam, Marianchuk Tegan, Dey Supratim, Xiao Yunxian, Hong Lu, Peng Xiangda, Baxa Michael, Partch Carrie, Wang Lee-Ping, Sosnick Tobin R., Dinner Aaron R., and LiWang Andy. Temperature-dependent fold-switching mechanism of the circadian clock protein KaiB. Proceedings of the National Academy of Sciences, 121(51):e2412327121, December 2024. doi: 10.1073/pnas.2412327121. [DOI] [Google Scholar]
  48. Zhang Zuobai, Xu Minghao, Jamasb Arian, Chenthamarakshan Vijil, Lozano Aurelie, Das Payel, and Tang Jian. Protein Representation Learning by Geometric Structure Pretraining, January 2023. [Google Scholar]
  49. Zhu He, Hu Lingyue, Yang Yu, and Chen Zhong. A novel approach to protein chemical shift prediction from sequences using a protein language model. Digital Discovery, 4(2):331–337, 2025. doi: 10.1039/D4DD00367E. [DOI] [Google Scholar]

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES