SLAE: Strictly Local All-atom Environment for Protein Representation

Yilin Chen; Cizhang Zhao; Po-Ssu Huang; Tianyu Lu; Hannah K Wayment-Steele

doi:10.1101/2025.10.03.680398

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Oct 6:2025.10.03.680398. [Version 1] doi: 10.1101/2025.10.03.680398

SLAE: Strictly Local All-atom Environment for Protein Representation

Yilin Chen ¹, Cizhang Zhao ², Po-Ssu Huang ³, Tianyu Lu ⁴, Hannah K Wayment-Steele ⁵

PMCID: PMC12632552 PMID: 41278779

Abstract

Building physically grounded protein representations is central to computational biology, yet most existing approaches rely on sequence-pretrained language models or backbone-only graphs that overlook side-chain geometry and chemical detail. We present SLAE, a unified all-atom framework for learning protein representations from each residue’s local atomic neighborhood using only atom types and interatomic geometries. To encourage expressive feature extraction, we introduce a novel multi-task autoencoder objective that combines coordinate reconstruction, sequence recovery, and energy regression. SLAE reconstructs all-atom structures with high fidelity from latent residue environments and achieves state-of-the-art performance across diverse downstream tasks via transfer learning. SLAE’s latent space is chemically informative and environmentally sensitive, enabling quantitative assessment of structural qualities and smooth interpolation between conformations at all-atom resolution.

1. Introduction

Proteins are the fundamental machinery of life, carrying out processes from catalysis and signaling to structural organization. Their remarkable functional diversity arises not only from their amino acid sequences but from the intricate three-dimensional structures into which those sequences fold.

Within protein structures, the backbone and side chain atoms act as an intricately coupled system that establishes local atomic environments through hydrophobic packing, hydrogen-bonding networks, and electrostatic interactions. These residue-level environments mediate conformational preferences and side chain dynamics, linking the global fold to the specific interactions that underlie protein function. Representing these interactions in a concise, learnable form is therefore essential for generalizable and physically grounded models of protein structure and function.

Current representations, either through protein language model(PLM) or sequence-structure joint embedding, lack the ability to isolate physical interactions from evolutionary information, and often needed to adopt backbone-only structure info to reduce computational demands. As a result, the field remains limited by the absence of a general-purpose pretraining framework that extracts, compresses, and transfers knowledge of all-atom structure across proteins and downstream applications.

We propose SLAE (Strictly Local All-atom Environment autoencoder), a framework for protein representation learning that models a protein as a set of residue-centric chemical environments. To promote generalizability and a physically grounded view, SLAE enforces an informational bottleneck by restricting the encoder to strictly local atom graphs and pair it with an asymmetric decoder that must recover full structure. When this reconstruction task is solved, the resulting tokenization of structure emerges jointly from the representation and the model, emphasizing physically meaningful interactions rather than heuristic features. Fully connected local atom graphs capture interactions between a residue and its neighboring atoms and are computationally tractable during pretraining. We show these local representations are sufficient to reconstruct all-atom Cartesian coordinates with high fidelity.

We design an all-atom autoencoder architecture that separates local and global reasoning across the encoding and decoding stages. An SE(3)-equivariant graph encoder maps each local environment to a rotation/translation-invariant residue token. A Transformer decoder with self-attention then aggregates these tokens to model long-range couplings and reconstruct coherent global geometry. This residue-level bottleneck forces the encoder to distill the packing signals such as covalent bonds, hydrogen-bond motifs, and steric/electrostatic cues that the global decoder requires to reconstruct long-range geometry, facilitating transfer across tasks. We introduce a physics-augmented pretraining objective that couples self-supervised (i) all-atom coordinate reconstruction, (ii) sequence recovery, and supervised (iii) Rosetta-derived inter-residue energies. These complementary signals act as a multi-view regularizer, aligning the latent space with atomistic structure, biochemical signal and energetics, yielding embeddings that vary smoothly with conformation and are interpretable along axes of side-chain chemistry, solvent exposure, and secondary structure.

SLAE supports multiscale readouts: atom and residue embeddings for fine-grained local characterization, and pooled protein-level features for global structure. This flexibility allows downstream task heads to focus on single residues, interfaces, or entire folds using a single pretrained representation. We demonstrate that pretraining directly on all-atom protein structures yields features that transfer effectively. Across benchmarks on multiple resolution scale tasks including fold classification, protein–protein binding affinity, single-point mutation stability, and NMR chemical shifts, SLAE achieves state-of-the-art or on-par performance.

Main contributions:

With the SLAE framework, we (i) propose a residue-centered, local atomgraph protein representation, and show it is sufficient for high-fidelity all-atom reconstruction; (ii) propose the energy regression task for reconstruction pretraining guidance; (iii) design local encoding and global decoding stages in all-atom autoencoder to encourage compact and transferable residue embeddings; (iv) achieve state-of-the-art on diverse downstream tasks with transfer learning; (v) show that the above design allow an interpretable latent space.

2. Related Work

Protein Representation Pretraining

Protein representation learning has followed two main tracks. Sequence pretraining with protein language models (PLMs) on massive corpora captures evolutionary constraints but lacks explicit structure information (Meier et al., 2021; Lin et al., 2023). In parallel, graph denoising objectives noises sequence or structural features and train graph models to recover them (Zaidi et al., 2022; Jamasb et al., 2024), capturing global context while abstracting away side-chain geometry. Neither paradigm learns atomistic features as the primary signal. SLAE departs by pretraining directly on all-atom coordinates reconstruction and showing that features learned from atomistic geometry are sufficient for high-fidelity coordinate reconstruction and downstream transfer.

Sequence-structure co-embedding approaches pair PLM embeddings with structural features to inject geometry into sequence representations, improving downstream performance without learning at all-atom resolution. Representative methods include SaProt (Su et al., 2023b), FoldToken (Gao et al., 2024), ProSST (Li et al., 2024), and ESM3 (Hayes et al., 2024). Most hybrid models augment sequence tokens with backbone-level descriptors, and the learned tokens remain sequence-anchored. SLAE instead learns structure and energetics-anchored residue tokens, reducing sequence-only bias while increasing structure representation resolution.

All-atom Protein Representation

All-atom protein generative models which simultaneously generate backbone and side chain coordinates can also have an all-atom representation of protein structure. Protpardelle (Chu et al., 2024) can be cast as a continuous normalizing flow to generate deterministic latent encodings of all-atom protein structures. A joint embedding space of sequence and all-atom structure was proposed in CHEAP (Lu et al., 2024), in which the embeddings reconstruct all-atom protein structures and recover sequence. However, interpolation between two conformations of the same protein sequence is not possible as identical sequence would map to the same CHEAP embedding. Representations can also be derived from protein structure prediction models such as AlphaFold3 (Abramson et al., 2024), but the information is distributed across layers and in both single and pairwise representations.

Geometric GNNs for Atomistic Systems

Representing atomistic systems as geometric graphs is natural. While encoders for protein have been proposed using point cloud voxelization, graph convolution and hierarchical pooling (Hermosilla et al., 2021; Anand et al., 2022; Wang et al., 2023), they incur a considerable computational burden making them impractical for large-scale pretraining with previously proposed denoising objectives. Equivariant GNNs such as DimeNet (Gasteiger et al., 2022), NequIP (Batzner et al., 2022) and MACE (Batatia et al., 2023) excel at small-molecule property prediction and interatomic potentials. For scalability, many adopt low-order interactions with truncated neighborhoods, closely related to Atomic Cluster Expansion (ACE) formulations (Drautz, 2019). Works which extend atomistic modeling to proteins are emerging (Pengmei et al., 2024; Bojan et al., 2025), but existing approaches typically pretrain on small-molecule datasets, reuse features from pretrained potential models or are trained in a task-specific manner. There remains a gap in methods amenable to large-scale, all-atom pretraining on proteins. SLAE addresses this by modeling two-body local interactions over cutoff graphs and pretrain a physics-informed autoencoder that yields a general, task-agnostic latent space at protein scale: thousands of atoms per system compared to tens of atoms.

3. The SLAE Framework

We introduce the SLAE autoencoder and its end-to-end pretraining objectives (Fig. 1A). SLAE solves a deliberately difficult two-part problem: the geometric graph encoder projects interatomic interactions within each atom’s local neighborhood into compact residue tokens, while the decoder learns a global prior over how these local environments compose into coherent macromolecular structures. This residue-level bottleneck over all-atom inputs makes large-scale pretraining tractable and learns meaningful embeddings.

Figure 1: — **A. Pretraining** A graph encoder maps local atomic neighborhoods to residue embeddings. Examples of atom connectivity shown as input to the encoder, with different colors for each residue. The transformer decoder connects pooled local features at residue level into the full-atom protein structure. The decoder also regresses to inter-residue energy score terms. **B. Transfer learning** The pretrained embeddings are fed to lightweight heads for diverse downstream tasks. **C. Latent geometry** Linear interpolations on latent space decode to physically coherent structures that follow changes on the underlying chemical-environment manifold.

3.1. Structure Representation

Given a protein structure, we construct a directed graph $G = (V, E)$ , where:

Nodes

Each node $v_{i} \in V$ represents heavy atom $a_{i}$ . The node feature is a one-hot encoding of the atom’s chemical type.

Edges

For each pair of atoms $a_{i}$ , $a_{j}$ with ${‖a_{j} - a_{i}‖}_{2} \leq 8 Å$ , we define a directed edge $e_{j \to i} \in E$ with features $h_{j \to i}^{(e)}$ that is a concatenation of: (i) the scaler interatomic distance ${‖a_{j} - a_{i}‖}_{2}$ in terms of Bessel radial basis functions $ϕ_{r} (a_{i}, a_{j})$ and (ii) the unit vector interatomic direction projected onto spherical harmonics $Y_{ℓ m} ϕ_{a} (a_{i}, a_{j})$ .

Design Motivation

This representation is minimal yet physically complete: it encodes interatomic distances and orientations without relying on torsion angles, amino acids, or residue indices. As such, it enables generalization to arbitrary biomolecular complexes, which we leave for future work. Bond connectivity and hydrogen patterns are learned implicitly through the autoencoder objective detailed in Section 3.4.

3.2. Encoder

The encoder maps each atom’s local chemical environment into residue-level latent embeddings $\{z_{1}, \dots, z_{n}\}$ , $z_{i} \in ℝ^{128}$ .

Equivariant Neighborhood Embedding

We employ a SE(3)-equivariant neural network, inspired by Musaelian et al. (2023), that operates on each heavy atom and its neighbors through learned edge embeddings. Each layer $L$ maintains coupled latent spaces: a scalar space $x_{i j}^{L}$ (invariant) and a tensor space $V_{i j}^{L}$ (equivariant). An equivariant tensor product incorporates interactions between the current equivariant state of the center–neighbor pair $(i, j)$ and all other neighbors $k \in N (i) : V_{i j}^{L} = V_{i j}^{L - 1} \otimes (\sum_{k \in N (i)} w_{i k}^{L} ϕ (r_{i k}))$ , where $ϕ (r_{i k})$ is a geometric embedding of the neighbor direction and $w_{i k}^{L}$ are learned weights derived from scalar features of edges $(i, k)$ . This can be viewed as a weighted projection of the atomic density around atom $i$ , enabling equivariant interactions between the pair $(i, j)$ and the environment of $i$ .

Following the tensor product, scalar outputs are reintroduced into the scalar latent space with $x_{i j}^{L} = ML P_{latent}^{L} (x_{i j}^{L - 1} ‖ V_{i j}^{L}) \cdot u (r_{i j})$ , where $u (r_{i j})$ is a smooth cutoff envelope. This step completes the coupling of scalar and equivariant latent spaces: scalars distilled from tensor products inject directional information back into $x_{i j}^{L}$ , allowing the invariant channel to carry geometric cues that were previously only available to the equivariant representation.

Residue Environment Pooling

After the final layer, we obtain scalar pair features $x_{i j}^{L}$ . We first pool to atoms by mean-aggregating incoming edges, and then pool atom embeddings to residues: $s_{i} = \frac{1}{| N (i) |} \sum_{j \in N (i)} x_{i j}^{L}$ , $z_{r} = \frac{1}{| A (r) |} \sum_{i \in A (r)} s_{i}$ . This yields compact residue-level representations while retaining strictly local chemical information.

Design Motivation

The encoder updates edge embeddings dynamically by incorporating information from neighboring edges. This paradigm originally developed for interatomic potentials in small-molecule graphs naturally extends to large protein graphs. This allows SLAE to capture strictly local but physically meaningful chemical environments. Pooling representations to the residue level serves as an efficient and natural information bottleneck for protein structure.

3.3. Decoder

Having distilled each residue’s local chemistry and geometry into embeddings $z \in ℝ^{128}$ , the decoder assembles these local descriptors into a single, coherent macromolecule that respects long-range couplings.

Architecture

We first project each latent embedding to a model dimension of $ℝ^{1024}$ . On top of these expanded embeddings, we employ a Transformer architecture with global self-attention and Rotary Positional Embeddings (RoPE) (Su et al., 2023a) to capture long-range residue interactions with a stack of multi-head self-attention layers.

The Transformer outputs are passed into three parallel MLP heads for structure reconstruction, sequence recovery, and energy prediction:

Reconstructs the 3D coordinates of up to 37 heavy and side-chain atoms per residue $(\hat{x} \in ℝ^{n \times 37 \times 3})$ .
Recovers the amino acid identity at each residue position $(\hat{s} \in ℝ^{n \times 20})$ .
Approximates inter-residue physical interactions using Rosetta scores, including hydrogenbonding, electrostatics, and solvation energies $(\hat{r} \in ℝ^{n \times n \times 3})$ .

Design Motivation

The decoder is designed to complement the encoder’s strictly local representation by modeling global dependencies across residues. Global self-attention allows residue embeddings to exchange information across the entire protein, enabling the reconstruction of coherent backbone and side-chain geometries. The addition of energy prediction task guides the decoder toward physically meaningful structures, ensuring that the latent space encodes not only geometric detail but also the energetic constraints that govern protein stability and interactions.

3.4. Pretraining

We pretrain SLAE end-to-end on full atomic structures with three complementary objectives:

All-atom Structure Recovery To obtain the predicted structure $\hat{x}$ , we mask the atom-37 template coordinates while providing the ground-truth residue identities to train the decoder to recover ground truth coordinates. We supervise this reconstruction with a combination of all-atom local distance difference test loss (SmoothLDDT) (Jumper et al., 2021) and frame-aligned point error (FAPE) (Anishchenko et al., 2024): $L_{struct} = α LDDT (x, \hat{x}) + β FAPE (x, \hat{x})$ , where $x$ and $\hat{x}$ denote the ground-truth and predicted all-atom coordinates.
Sequence Recovery We additionally recover the residue sequence from the latent space: $L_{seq} = CrossEntropy (s, \hat{s})$ , where $s$ is the ground-truth amino-acid identity and $\hat{s}$ are the predicted logits over 20 amino acid classes.
Energy Prediction To inject physically grounded supervision, we predict inter-residue energies approximated by Rosetta scores, including hydrogen bonding, electrostatics, and solvation: $L_{energy} = ‖ r - \hat{r} ‖_{2}^{2}$ , where $r$ and $\hat{r}$ are ground-truth and predicted energy terms.

The combined loss integrates all three components:

L = w_{coord} (α LDDT + β FAPE) + w_{seq} CrossEntropy + w_{energy} MSE

(1)

with weights $w_{struct}$ , $w_{seq}$ , $w_{energy} \geq 0$ as tunable hyperparameters (Appendix B.1).

Implicit Latent Space Regularization

By jointly optimizing geometry, identity, and energetics, SLAE’s pretraining objective provides complementary constraints on the latent space: (i) Geometry losses depend smoothly on atomic coordinates, promoting continuous and physically plausible reconstructions. (ii) Sequence recovery encourages embeddings to encode amino acid identity, preserving biochemical interpretability and avoiding collapse. (iii) Energy prediction provides a physics-based signal, guiding embeddings toward inter-residue interactions such as hydrogen bonding, solvation, and electrostatics. These losses shape a latent manifold that maps cleanly onto valid, physically coherent protein conformations. The result is a structurally consistent, chemically informative, and energetically grounded representation without relying on explicit regularizers.

3.5. Results and ablations

We pretrain SLAE on a sequence-augmented CATH(Ingraham et al., 2019)-derived dataset (Lu et al., 2025b)(Appendix C). On the held-out test set with no family overlap, the autoencoder achieves 99.9% sequence recovery and all-atom RMSD of 1.1Å for structures shorter than 128 residues and 1.9A across all lengths up to 512 residues.

We study the effect of model and pretraining design choices on pretraining performance (Table 6). For encoder locality, we swept cutoff radii and find an 8Å neighborhood yields the best results (Appendix E). For discretization, we compare end-to-end VQ (van den Oord et al., 2018) and LFQ (Yu et al., 2023) against post-hoc kNN codebooks built on frozen encoder embeddings. End-to-end quantization trades off sequence and structure accuracy, whereas reconstruction from post-hoc kNN-codebook quantized embeddings approaches continuous resolution as the codebook grows. Ablation experiments (Table 6, Appendix E) further highlight the importance of both the FAPE loss and Rosetta-derived energy supervision, confirming the effectiveness of our multitask pretraining framework. These results validate the design choices and permit downstream evaluation on a faithful representation of protein structures.

4. Downstream Tasks

We next demonstrate that SLAE embeddings pretrained on all-atom reconstruction and energetics objectives transfer effectively to diverse downstream tasks(Figure 1B). Across all four benchmarks spanning complementary biological scales, SLAE achieves better or on-par performance with state-of-the-art methods, underscoring the generality and flexibility of the SLAE framework.

Fold Classification

Protein fold classification is a cornerstone of structural biology, linking structure to evolutionary relationships and functional annotation. Using the SCOPe 1.75 dataset Fox et al. (2014) and following Hou et al. (2018), we evaluate generalization under three test sets: Family, Superfamily, and Fold. An MLP is trained on pooled residue embeddings. SLAE achieves on-par or superior accuracy compared to prior state-of-the-art models across all splits (Table 2), demonstrating that global fold information can be recovered even from strictly local all-atom embeddings.

Table 2:

Fold classification accuracy (%) on SCOPe 1.75 under three test splits

Method	Fold (%)	Superfamily (%)	Family (%)
GVP-GNN(Jing et al., 2021)	16.0	22.5	83.8
IEConv(Hermosilla et al., 2021)	45.0	69.7	98.9
GearNet-Edge-IEConv (Zhang et al., 2023)	48.3	70.3	99.5
ProNet-SCHull (Wang et al., 2024)	56.1	74.6	99.4
SLAE-finetuned	55.1	77.1	99.1

Open in a new tab

Protein-Protein Binding Affinity Prediction

Protein-protein interactions underlie nearly all cellular processes, and accurate prediction of binding affinity is critical for understanding signaling pathways, complex assembly, and therapeutic design. We evaluate SLAE on the PPB-Affinity dataset (Liu et al., 2024), a recently curated large-scale benchmark that aggregates 12,062 experimental binding $ΔΔ G$ values from multiple sources and aligns them with high-quality structural complexes.

Complex structures are embedded chain-wise and interface-wise with the SLAE encoder, and pooled residue embeddings are passed into an MLP for regression. In 5-fold cross-validation, SLAE achieves lower RMSE and higher Pearson correlation than PLM-based baselines (Table 3). Despite being pretrained only on single-chain data, SLAE generalizes seamlessly to multi-chain contexts, thanks to its atomistic representation that does not rely on residue or chain indices.

Table 3:

Protein-protein binding affinity prediction on the PPB-Affinity dataset

Method	RMSE (kcal/mol)	Pearson Correlation
PPB-Affinity Baseline (Liu et al., 2024)	2.08	0.70
PPLM-Affinity (Liu et al., 2025)	1.89	0.76
SLAE-finetuned (w/o. interface)	2.01	0.73
SLAE-finetuned (with interface)	1.86	0.77

Open in a new tab

Single-Point Mutation Thermostability Prediction

Protein stability is fundamental to function, and predicting the impact of point mutations on thermostability $(ΔΔ G)$ is a central challenge for protein engineering, drug resistance modeling, and disease variant interpretation. We benchmark SLAE on the Megascale mutation dataset (Tsuboyama et al., 2023), filtered according to ThermoMPNN protocol with 272,712 mutations across 298 proteins Dieckhaus et al. (2024).

Pairs of wild-type and mutant structures are embedded with residue-level differences extracted at the mutation site. An MLP head predicts $ΔΔ G$ . SLAE achieves 0.68 RMSE and 0.76 Pearson correlation (Table 4) on the test set, outperforming prior methods. Ablation experiments show that removing mutation-site differencing degrades performance, highlighting the importance of local residue environment modeling for physical property prediction in the SLAE framework.

Table 4:

Single-point mutation thermostability prediction on the Megascale dataset test split

Method	RMSE (kcal/mol)	Pearson Correlation
Rosetta (Pancotti et al., 2022)	5.18	0.53
RaSP (Blaabjerg et al., 2023)	1.08	0.71
ThermoMPNN (Dieckhaus et al., 2024)	0.71	0.75
SLAE-finetuned (w/o. mutated site)	0.73	0.70
SLAE-finetuned (with mutated site)	0.68	0.76

Open in a new tab

Chemical Shift Prediction

NMR chemical shifts are among the most direct experimental probes of local atomic environments, among them the backbone nitrogen are notoriously difficult to predict accurately due to its large variance and contributions from ring currents, electrostatics, and subtle side-chain conformations. We benchmark on stringently filtered BMRB (Hoch et al., 2023) which contains 2,532 training and 594 validation chemical shift records and their corresponding Alphafold2 predicted structures. PLM-CS framework is adopted as baseline model architecture, which trains a lightweight predictor on top of pretrained representations (Zhu et al., 2025).

We report validation set performance of finetuned SLAE along with PLM-CS results using multiple protein residue embeddings, including ESM2, AlphaFold2, ProSST and SLAE. ¹ Finetuned SLAE achieves the lowest RMSE and highest correlation, substantially outperforming retrained PLM-CS baselines (Table 5). This demonstrates that SLAE embeddings capture fine-grained atomistic features essential for NMR observables.

Table 5:

Backbone nitrogen chemical shift prediction on BMRB

Method	RMSE(ppm)	Pearson Correlation
PLMCS-AF2	2.94	0.82
PLMCS-ESM2	2.74	0.84
PLMCS-ProSST	2.53	0.87
PLMCS-SLAE	2.53	0.87
SLAE-finetuned	1.88	0.93

Open in a new tab

5. Interpreting the Latent Space

SLAE’s downstream performance stems from a structured, interpretable latent space. We show that residue embeddings are organized along biochemically meaningful axes, are sensitive to local environment changes, and admit linear paths that decode to geometrically coherent structures(Figure 1C).

5.1. Embedding Variability Reflects Chemical Environment Change

To probe what SLAE embeddings captures at the residue level, we analyze how they organize across local chemical environments. Dimensionality reduction of kNN centroids from CATH (Section 3.5, Appendix E) shows that residue latents cluster by side chain chemistry and broader structural context. The latent space also stratifies along gradients of solvent accessibility and separates by secondary structure, with helices, sheets, and coils occupying distinct submanifolds (Figure 3, App. Fig 6 and 7). This indicates that SLAE representation is sensitive to both chemical identity and structural environment.

Figure 3: — (native in yellow, decoys colored by TM-score to their native; warmer = more native-like) **Left** protein-level PCA. Each point is a protein. **Right** residue-level PCA for 1TUD and its decoys. Decoy residues are colored by their parent decoy’s TM-score. In both panels, SLAE embeddings organize along gradients of nativeness, revealing coherent neighborhoods that align with structural quality.

We then quantify this sensitivity using the mdCATH dataset (Mirarchi et al., 2024). Across 5,398 proteins, per-residue latent displacement between conformers correlates with physical measures of environment variability: changes in contact maps and solvent exposure explain over half of the variance in embedding similarity (R² = 0.55, ρ ≈ 0.74; Appendix E). Thus, SLAE embeddings consistently track how residues respond to burial, packing, and secondary-structure transitions.

5.2. Discriminative power over Native-Decoy Residue environments

We show that SLAE residue latent capture local environments contain signal that zero-shot distinguishes native structures from decoys and provide a practical embedding space for evaluating backbone–sequence co-design.

On the Rosetta decoy dataset (Park et al., 2016) containing 133 native protein structures with thousands of decoys each, native–decoy cosine margin is 0.136 across residues. We further fit a leave-protein-out logistic regression by training on all proteins except one and tested on the held-out protein’s residues and report AUROC = 0.659 (Appendix E), indicating a moderate, generalizable linear signal at the residue level.

Motivated by this discriminative signal, we use the SLAE embedding space to quantify the distributional coverage of generative models, extending prior metrics (Lu et al., 2025a) to all-atom resolution and residue granularity. As a proof of concept, we compute per-residue type Fréchet Pro tein Distance (FPD) between SLAE embeddings of the generated structures and the native CATH distribution for models such as Chroma (Ingraham et al., 2023), Protpardelle-1c (Lu et al., 2025b) and La-Proteina (Geffner et al., 2025). The FPD metrics reveal subtle differences in the coverage of local amino acid environments by different generative models (Appendix E.3, App.Fig. 8). For example, biased sampling is evident in La-Proteina samples for serine, threonine, and valine relative to Protpardelle-1c and Chroma. Using SLAE embeddings provides a more sensitive view on coverage of all-atom local environments which are ignored in backbone-based metrics and which may be averaged out on the global protein fold level as in previous assessments of generative model coverage of protein structures

5.3. Smooth Latent Interpolation Captures Conformational Transitions

Latent space smoothness is relevant for evaluating whether a representation supports continuous sampling of protein conformations. Unlike variational autoencoders that encourage smoothness via KL regularization to a simple prior, the SLAE autoencoder relies solely on physics-augmented pretraining objectives. We examine the smoothness of SLAE latent by linear interpolation between two conformation states $Z^{(A)}$ and $Z^{(B)}$ . For each residue $i$ and interpolation scale $t \in [0, 1]$ , the interpolated residue embeddings are given by $z_{i}^{(t)} = (1 - t) z_{i}^{(A)} + t z_{i}^{(B)}$ . The interpolated set $Z^{(t)} = \{z_{1}^{(t)}, \dots, z_{n}^{(t)}\}$ is then decoded into an all-atom structure with the pretrained SLAE decoder (Figure 1C).

For two proteins with known conformational changes, adenylate kinase (AdK) and KaiB, we linearly interpolate between the SLAE embedding of the two experimentally determined states(AdK: 1AKE, 4AKE; KaiB: 2QKE, 5JYT). We sample intermediate structures from 50 evenly spaced values of $t$ and align their backbone coordinates to frames in MD simulation of the transitions (Seyler et al., 2015; Zhang et al., 2024). For AdK, the interpolated structures closely track the MD intermediates, as evidenced by smooth trajectories with low RMSD (Figure 4), and they agree better than interpolations from the generative model (App. Fig. 10). Notably, these interpolations are unguided by any energy function or model likelihood; they arise solely from linear paths in SLAE latent space anchored in pretraining with physics-based task. KaiB shows higher RMSD between steps 20 and 30 (Figure 4). Closer examination of the interpolated structures (App. Fig 9) reveals disagreement in the C-terminus, which is known to unfold during transition (Wayment-Steele et al., 2023). This degradation is expected as SLAE is pretrained on folded structures and thus treats unfolded segments as out-of-distribution, where local environment cues under-constrain reconstruction.

Figure 4: — A. Structures sampled by linear interpolation (purple) overlaid with MD simulation frames (grey) B. Alignment RMSD to MD simulation trajectories

Within the folded structure regime, SLAE’s latent space is sufficiently regular that simple linear paths often decode to geometrically coherent intermediates aligned with MD trajectories. These results support the view that SLAE embeddings approximate a continuous, chemically grounded manifold of protein structures. The latent space reflects local environmental variation while accommodating large-scale transitions, make it useful for downstream analysis and generative applications.

6. Conclusion

We introduced SLAE, a framework tailored to learning general-purpose representations of proteins at all-atom resolution. SLAE applies a strictly local graph neural network over atomic environments, using computationally simple layers to perform expressive geometric reasoning on atom-type and interatomic distance features. Pretraining is driven by a novel objective that combines full atomic coordinate reconstruction with energy score regression, yielding embeddings that are structurally faithful, chemically grounded, and energetically informed.

Figure 2: — UMAP visualization of kNN centroids shows clustering by solvent accessibility (left) and secondary structure (right).

Table 1: Reconstruction performance of SLAE and ablations.

We report sequence recovery accuracy (%) and reconstruction RMSD (Å) on test structures. All further experiments use the highlighted best SLAE model.

Graph Radius (Å)	Discretization Method	Codebook Size	Training Obj.	Seq. Acc. (%)	RMSD < 128 res (Å)	RMSD < 512 res (Å)
8	LFQ	32768	all	75.2	2.50	3.74
8	kNN	4096	all	97.5	2.96	4.03
8	kNN	32768	all	99.4	1.60	2.31
8	–	–	w/o. FAPE	97.2	3.89	5.22
8	–	–	w/o. Energy	98.0	3.26	5.17
6	–	–	all	99.9	1.24	2.55
8	–	–	all	99.9	1.12	1.92

Open in a new tab

Acknowledgments

The computing for this project was performed on the Sherlock cluster. We would like to thank Stanford University and the Stanford Research Computing Center for providing computational available resources and support that contributed to these research results. Y.C and T.L. are supported by Stanford Graduate Fellowship. C.Z. and H.K.W.-S. acknowledge financial support from the University of Wisconsin-Madison Office of the Vice Chancellor for Research, with funding from the Wisconsin Alumni Research Foundation. This project is supported by NIH (R01GM147893 to P.-S.H.), Merck Research Laboratories (MRL) Scientific Engagement and Emerging Discovery Science (SEEDS) Program, and Stanford Medicine Catalyst. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

A. Model

A.1. Autoencoder pseudocode

The end-to-end SLAE autoencoder can be summarized as follows:

Algorithm 1.

SLAE Autoencoder. $h^{(e)}$ : edge features, $E$ : SE(3)-equivariant update, $P$ : pooling, $D_{Tr}$ : Transformer decoder. Outputs: $\hat{x}$ (coordinates), $\hat{s}$ (sequence), $\hat{r}$ (energies).

1:	Input: heavy-atom coordinates ${\{a_{i}\}}_{i = 1}^{N}$
2:	Build $G = (V, E)$ with cutoff $r_{c}$
3:	Init $h^{(e)} \leftarrow (ϕ_{r}, ϕ_{a})$
4:	for $L = 1$ to $2$ do
5:	$\{x_{i j}^{L}, V_{i j}^{L}\} \leftarrow E (\{x_{i j}^{L - 1}, V_{i j}^{L - 1}\})$
6:	end for
7:	$\{z_{r}\} \leftarrow P (\{x_{i j}^{L}\})$
8:	$(\hat{x}, \hat{s}, \hat{r}) \leftarrow D_{Tr} (\{z_{r}\})$

Open in a new tab

A.2. Encoder Architecture

Notation

Let $a_{i} \in ℝ^{3}$ be the coordinate of atom $i$ , $r_{i j} = a_{j} - a_{i}$ , $r_{i j} = ‖r_{i j}‖$ , ${\hat{r}}_{i j} = r_{i j} / r_{i j}$ . The neighbor set of $i$ is $N (i) = \{j ∣ r_{i j} \leq r_{c}\}$ . Each directed edge $(i, j)$ maintains invariant scalars $x_{i j}^{L} \in ℝ^{d_{sc}}$ and equivariant tensors $V_{i j}^{L}$ .

Two-body initialization

Edge features are initialized with radial and angular bases:

x_{i j}^{0} = u (r_{i j}) \cdot {MLP}_{2 b} (onehot (Z_{i}) ‖onehot (Z_{j})‖ ϕ_{r} (r_{i j})),

(2)

V_{i j}^{0} = ω_{i j} \cdot ϕ_{a} ({\hat{r}}_{i j}), ω_{i j} = ML P_{ω} (x_{i j}^{0}),

(3)

where $ϕ_{r}$ are Bessel radial basis functions, $ϕ_{a}$ are angular embeddings (e.g., spherical harmonics), and $u (r_{i j})$ is a smooth cutoff envelope.

Tensor product update

At layer $L$ , equivariant features of edge $(i, j)$ interact with the embedded environment of atom $i$ :

V_{i j}^{L} = V_{i j}^{L - 1} \otimes (\sum_{k \in N (i)} w_{i k}^{L} ϕ (r_{i k})),

(4)

where $ϕ (r_{i k})$ encodes neighbor geometry and $w_{i k}^{L} = ML P_{embed}^{L} (x_{i k}^{L - 1})$ are learned weights. This corresponds to a weighted projection of the atomic density around atom $i$ .

Latent scalar update.

Scalar channels are updated with tensor product scalars:

x_{i j}^{L} = ML P_{latent}^{L} (x_{i j}^{L - 1} ‖ V_{i j}^{L}) \cdot u (r_{i j}),

(5)

injecting geometric information from $V_{i j}^{L}$ back into $x_{i j}^{L}$ .

Hierarchical pooling

Final edge scalars are aggregated:

s_{i} = \frac{1}{| N (i) |} \sum_{j \in N (i)} x_{i j}^{L},

(6)

z_{r} = \frac{1}{| A (r) |} \sum_{i \in A (r)} s_{i}, z_{r} \in ℝ^{128},

(7)

producing residue-level embeddings $\{z_{r}\}$ .

A.3. Decoder architecture

Transformer backbone

We employ a standard pre-norm Transformer encoder with Rotary Positional Embeddings (RoPE) with $L_{Tr} = 8$ layers, $h = 16$ heads, model width $d_{model} = 1024$ . Each layer consists of:

Multi-head self-attention with RoPE (pre-norm): ${MHA}_{RoPE} (LayerNorm (\cdot))$ .
Residual connection.
Feed-forward network with hidden dimension $d_{ff}$ and SwiGLU, applied as $FFN(LayerNorm (\cdot))$ .
No dropout.

Formally:

H^{(ℓ)} = {MHA}_{RoPE} (LayerNorm (H^{(ℓ - 1)})) + H^{(ℓ - 1)},

(8)

H^{(ℓ)} = FFN (LayerNorm (H^{(ℓ)})) + H^{(ℓ)},

(9)

for $ℓ = 1, \dots, L_{Tr}$ , with $H^{(0)} = [z_{1}, \dots, z_{n}]$ .

Prediction heads

From final hidden states $H \in ℝ^{n \times d_{model}} (d_{model} = 1024)$ , we apply three parallel heads:

3D coordinates (linear head) LayerNorm + Linear maps per-residue embeddings to all Atom37 coordinates:
$\hat{x} = Unflatten (Linear (LN (H)), 37 \times 3) \in ℝ^{n \times 37 \times 3} .$
(Atoms 1–4 are N, $C_{α}$ , C, and O; atoms 5–37 are side chain. Masking is applied via the Atom37 mask.)
Sequence logits on valid tokens An MLP head operates only on valid tokens (mask-compacted), then is re-padded for loss:
$\hat{s} = ML P_{seq} (H_{valid}) \in ℝ^{n_{valid} \times 20} .$
Pairwise energies A pairwise feature head first down-projects $H$ , lifts to 2D by pairwise product/difference, applies a small MLP, then per-type linear heads with magnitude clamp to $1 e - 3$ :
$\hat{r} = [{\hat{r}}^{hbond}, {\hat{r}}^{sol}, {\hat{r}}^{elec}] \in ℝ^{n \times n \times 3} .$

A.4. Task-specific heads

Trainable decoder backbone.

We expose a lightweight wrapper over the DecoderBackbone to enable fine-tuning the last $N$ Transformer blocks while freezing the rest. Take the single site mutation stability task as an example, we document the layout of downstream task-specific finetuning here.

Contrastive and site-aware head

A Siamese head takes two or more structure embeddings (e.g., wild-type and mutant), runs them through the shared DecoderBackbone, and regresses a scalar target (e.g., $ΔΔ G$ ). Beyond global contrastive pooling, it can extract site-specific residue representations, enabling residue-level tasks.

Backbone embeddings.

Given masked inputs $(X^{wt}, M^{wt})$ and $(X^{mut}, M^{mut})$ ,

H^{wt} = DecoderBackbone (X^{wt}, M^{wt}), H^{mut} = DecoderBackbone (X^{mut}, M^{mut}) .

Mask-aware pooling and site features.

Let the mean-pooling operator be

Pool (H, M) = \frac{\sum_{t = 1}^{L} H_{:, t, :} M_{:, t}}{\sum_{t = 1}^{L} M_{:, t} + ε} \in ℝ^{B \times 1024} .

We form global embeddings $z^{wt} = Pool (H^{wt}, M^{wt})$ and $z^{mut} = Pool (H^{mut}, M^{mut})$ . Given mutation indices $ι = {1, \dots, L}^{B}$ , we also extract site embeddings

s^{wt} = H^{wt} [range (B), ι], s^{mut} = H^{mut} [range (B), ι] \in ℝ^{B \times 1024} .

Contrastive feature and MLP regressor.

We concatenate global and site representations together with their difference:

u = [z^{wt}, z^{mut}, s^{wt}, s^{mut}, s^{mut} - s^{wt}] \in ℝ^{B \times (5 \cdot 1024)} .

A small MLP head predicts a scalar per pair:

\hat{y} = MLP (u) = Linear \circ GELU \circ LayerNorm \circ Linear \circ GELU (u) \in ℝ^{B \times 1} .

General usage

The same interface supports other pairwise or single-input tasks by (i) choosing one or multiple passes through DecoderBackbone, (ii) selecting global vs. site-wise features, and (iii) swapping the final MLP for the appropriate output dimensionality/loss. For atom-level tasks, the DecoderBackbone can be reinitialized with the smaller attention window.

B. Training

B.1. Losses

All-atom FAPE (Frame-Aligned Point Error)

All-atom FAPE is computed by aligning the predicted and reference structures on every triplet of bonded atoms $(i, j, k)$ (with the exception of symmetric side chain atoms) and then measuring per-atom positional deviations between the aligned structures. For each frame $f (i, j, k)$ (with $j$ as the origin), define an orthonormal basis for predicted/true coordinates via a deterministic map $Φ : {(ℝ^{3})}^{3} \to SO (3)$ :

U_{f}^{pred} = Φ ({\hat{x}}_{i}, {\hat{x}}_{j}, {\hat{x}}_{k}), U_{f}^{true} = Φ (x_{i}^{⋆}, x_{j}^{⋆}, x_{k}^{⋆}),

where $Φ$ constructs column vectors from the two edge directions at $j$ ,

v_{j \to i} = x_{i} - x_{j}, v_{j \to k} = x_{k} - x_{j},

then

e_{0} = (v_{j \to k}) \times (v_{j \to i}), e_{2} = v_{j \to i} - v_{j \to k}, e_{1} = e_{2} \times e_{0},

and column-normalizes $[e_{0}, e_{1}, e_{2}]$ to obtain a right-handed 3×3 matrix.

For any atom $a$ in the same protein as $f$ , rotate origin-subtracted positions into the local frames:

r_{f, a}^{pred} = U_{f}^{pred} ({\hat{x}}_{a} - {\hat{x}}_{j}), r_{f, a}^{true} = U_{f}^{true} (x_{a}^{⋆} - x_{j}^{⋆}) .

Define $d_{f, a} = {‖r_{f, a}^{pred} - r_{f, a}^{true}‖}_{2}$ , clamped at $c = 10 Å$ as ${\tilde{d}}_{f, a} = \min (d_{f, a}, c)$ , and apply a Huber penalty with $δ = 1.0$ :

ρ_{δ} (\tilde{d}) = \{\begin{array}{l} \frac{1}{2} {\tilde{d}}^{2}, & \tilde{d} \leq δ, \\ δ \tilde{d} - \frac{1}{2} δ^{2}, & \tilde{d} > δ . \end{array}

We average first over frames and then over atoms, yielding an atom-weighted mean:

L_{FAPE} = \frac{1}{B} \sum_{b = 1}^{B} \frac{1}{|A_{b}|} \sum_{a \in A_{b}} (\frac{1}{|F_{b}|} \sum_{f \in F_{b}} ρ_{δ} ({\tilde{d}}_{f, a})) .

All-atom smooth LDDT

We use a differentiable, all-atom version of LDDT that compares pairwise distances within a cutoff. Let $P = {(i, a), (j, b)}$ be all heavy-atom pairs with $‖x_{i, a}^{⋆} - x_{j, b}^{⋆}‖ \leq R_{\max}$ and not in the same residue. Define ground-truth and predicted distances $d_{i a b j}^{⋆} = ‖x_{i, a}^{⋆} - x_{j, b}^{⋆}‖$ and ${\hat{d}}_{i a b j} = ‖{\hat{x}}_{i, a} - {\hat{x}}_{j, b}‖$ , and the absolute error $Δ_{i a b j} = |{\hat{d}}_{i a b j} - d_{i a b j}^{⋆}|$ . Using standard lDDT thresholds $τ \in {0.5, 1.0, 2.0, 4.0}$ Å with smooth indicators $s_{τ} (Δ) = σ (α (τ - Δ))$ (sigmoid, $α$ controls sharpness),

{sLDDT}_{i} = \frac{1}{|P_{i}|} \sum_{(i, a), (j, b) \in P_{i}} \frac{1}{| T |} \sum_{τ \in T} s_{τ} (Δ_{i a b j}), L_{sLDDT} = 1 - \frac{1}{N_{res}} \sum_{i} {sLDDT}_{i} .

Mean-squared error (MSE)

Used for continuous targets regression:

L_{MSE} = \frac{1}{| Ω |} \sum_{u \in Ω} {‖{\hat{y}}_{u} - y_{u}^{⋆}‖}_{2}^{2} .

Huber loss.

Used for continuous targets regression with $δ = 1.35$ :

L_{Huber} = \frac{1}{| Ω |} \sum_{u \in Ω} \{\begin{array}{l} \frac{1}{2} {({\hat{y}}_{u} - y_{u}^{⋆})}^{2}, & |{\hat{y}}_{u} - y_{u}^{⋆}| \leq δ, \\ δ |{\hat{y}}_{u} - y_{u}^{⋆}| - \frac{1}{2} δ^{2}, & otherwise . \end{array}

B.2. Training specifics

The autoencoder is trained on a single NVIDIA A100 or H100 GPU using batch size 16. For pretraining we use $w_{coord} = w_{seq} = w_{energy} = 1$ and $α = 10$ , $β = 1$ for the loss $L = w_{coord} (α LDDT + β FAPE) + w_{seq} CrossEntropy + w_{energy} MSE$ . We train for 30 epochs with early stopping on validation loss not decreasing after 5 epochs. The learning rate schedule is linear warmup for 1,000 steps followed by cosine decay. Optimization uses AdamW with maximum learning rate 1×10⁻⁴ and standard $β_{1} = 0.9$ , $β_{2} = 0.999$ (weight decay as in AdamW defaults). Unless noted otherwise, downstream task-specific fine-tuning uses the same batch size and maximum learning rate 1 × 10⁻⁵.

C. Datasets

Pretraining Structure

We train SLAE on an sequence-augmented CATH set (Lu et al., 2025b) by redesigning each domain with 32 ProteinMPNN sequences and predicting structures with ESMFold; we retain only high-confidence, self-consistent structure models $(p L D D T \geq 80, s c R M S D \leq 2.0 Å)$ , yielding 337936 structures, with 271 test structures from holdout CATH domains. We evaluate SLAE latent space on protein conformational ensembles sampled from the dataset of molecular dynamics (MD) simulations mdCATH (Mirarchi et al., 2024). We subsample 32 frames per protein across MD trajectory ensembles for each of the 5398 structures.

Pretraining Rosetta Score

We use PyRosetta to compute residue pairwise energy scores for all pretraining structures under its default full-atom energy terms. For each pair of residues we compute (1) fa_sol: Lazaridis-Karplus solvation energy (2) fa_elec: Coulombic electrostatic potential with a distance-dependent dielectric (3) hbond: Sum of all hydrogen bonding terms for backbone and sidechain.

Fold Classification

We obtain the dataset from Hermosilla et al. (2021), which consolidated 16,712 proteins with 1195 different folds from the SCOPe 1.75 database (Fox et al., 2014). Three test sets are used: (1) Family, which allows proteins from the same family to appear in both training and test; (2) Superfamily, which excludes proteins sharing family membership with the training set; and (3) Fold, which further excludes proteins from the same superfamily as those in training. All structures are obtained from the SCOPe 1.75 archive.

Stability

We obtain the dataset curated by Dieckhaus et al. (2024) on Tsuboyama et al. (2023), composed of 272,712 single point mutations and their experimental $ΔΔ G$ . The proteins were clustered using MMseqs2 with sequence identity cutoff of 25% to yield 239 training, 31 validation and 29 validation proteins. For wild type sequences we predict their structures with AlphaFold2. For all mutated structures we model the mutation with PyRosetta and relax within 8A radius to obtain training structures.

Binding Affinity

We use the PPB-AFfinity (Liu et al., 2024) which integrates experimental protein-protein binding affinity data from several source databases: SKEMPI v2.0, SAbDab, PDB-bind v2020, Affinity Benchmark v5.5, and ATLAS. This dataset contains 12062 unique binding complexes consisting of 3032 unique PDB codes and point mutations. We use the structures curated in the dataset and define interface residues as those within 5A distance from other atoms of the neighboring chains. For all mutations we mutate the sidechain with PyRosetta and relax within 8A radius to obtain training structures.

NMR Chemical Shift

We retrieve the BMRB totaling 17,028 entries (2025–07-02) (Hoch et al., 2023). The entries were filtered and processed based on NMR experiment type, backbone chemical shift coverage, sequence consistency, basic experimental condition boundary plus any other routine re-referencing requirements. 3623 entries were retained and split into 2532 training and 594 validation entries at a 50% pairwise sequence-identity threshold after filtering entries without any nitrogen chemical shifts. Alphafold2 was used to generate all structures used in training.

MD Simulation

For adenylate kinase (AdK), we use conformational ensembles generated using the Framework Rigidity Optimized Dynamics Algorithm (FRODA), yielding 200 trajectories (Seyler et al., 2015). For KaiB, we use the temperature-dependent fold-switching simulation from Zhang et al. (2024), subsampling every 10 frames out of the 4 successful fold-switching trajectories from the fold-switched state to ground state.

Rosetta Decoy

To assess local residue environment embeddings distribution between native and decoy structure, we use structure dataset by Park et al. (2016), where each of the 133 native structures are accompanied with large numbers (≥ 1000 cluster centers) of alternative conformations (decoys).

D. METRICS

Structure comparison

We report RMSD after optimal Kabsch rigid alignment for Cα, backbone and all-atom. Given reference $X^{⋆} \in ℝ^{n \times 3}$ and prediction $\hat{X}$ , align $\hat{X}$ to $X^{⋆}$ then compute

RMSD = \sqrt{\frac{1}{n} \sum_{j = 1}^{n} {‖{\hat{x}}_{j}^{align} - x_{j}^{⋆}‖}_{2}^{2}} .

Numeric regression

Given targets ${\{y_{i}\}}_{i = 1}^{N}$ and predictions ${\{{\hat{y}}_{i}\}}_{i = 1}^{N}$ , we report

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}}, r = \frac{\sum_{i = 1}^{N} ({\hat{y}}_{i} - \bar{\hat{y}}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{N} {({\hat{y}}_{i} - \bar{\hat{y}})}^{2}} \sqrt{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}}} .

Distribution comparison

We compute Fréchet Protein Distance (FPD) following Lu et al. (2025a). Given $N$ data points from a reference distribution $p_{data} (x)$ , here the sequence-augmented CATH dataset, and $M$ samples from a generative model $p_{sample} (x)$ , we computed per-residue SLAE embeddings ${\{z_{data}^{(i)}\}}_{i = 1}^{N}$ and ${\{z_{sample}^{(j)}\}}_{j = 1}^{M}$ and then compute

FPD = {‖μ_{data} - μ_{sample}‖}_{2}^{2} + Tr (Σ_{data} + Σ_{sample} - 2 {(Σ_{data} Σ_{sample})}^{\frac{1}{2}})

(10)

where $μ_{data}$ and $μ_{sample}$ are the means of the reference embeddings and the sample embeddings respectively, and $Σ_{data}$ and $Σ_{sample}$ are the covariance matrices of the reference embeddings and the sample embeddings respectively. We compute FPD using a smaller subset of 2000 samples as SHAPES showed that this is sufficient for an accurate FPD estimate Lu et al. (2025a).

E. Additional Experiments and Results

E.1. Pretraining

We report in Table 6 additional results on the pretraining performance of the SLAE autoencoder. We note that encoders with $10Å$ graph radius cutoff is infeasible to train with a single GPU due to the number of edges.

Table 6:

Complete results of SLAE autoencoder ablation experiments.

Graph Radius(Å)	Discretization Method	Codebook Size	Training Obj.	Seq. Acc. (%)	RMSD < 128 (Å)	RMSD < 512 (Å)
8	LFQ	16384	all	69.5	4.12	5.79
8	LFQ	32768	all	75.2	2.50	3.74
8	VQ	16384	all	65.7	5.02	5.88
8	VQ	32768	all	70.4	4.30	6.02
8	kNN	4096	all	97.5	2.96	4.03
8	kNN	16384	all	98.6	1.71	2.57
8	kNN	32768	all	99.4	1.60	2.31
8	–	–	w/o. FAPE	97.2	3.89	5.22
8	–	–	w/o. Energy	98.0	3.26	5.17
4	–	–	all	99.8	2.57	3.86
6	–	–	all	99.9	1.24	2.55
8	–	–	all	99.9	1.12	1.92

Open in a new tab

E.2. Latent space characterization

E.2.1. kNN clustering

We examine the CATH-kNN-quantized latent space, the k-means codebook of k = 16,384 centroids. We assign each centroid the majority amino-acid identity among its members; the commitment loss is the L2 distance from an embedding to its assigned centroid. The commitment loss histogram is tightly concentrated around 3–5 L2 units (Figure 5), which is modest relative to the embedding norm (15 ± 4), indicating that quantization preserves most geometric signal.

Figure 5: — Commitment loss distribution during post-hoc quantization

We observe clear residue type mixing in the clusters. Although many centroids are quite pure (median majority fraction 0.96), the distribution is broad (mean 0.89±0.15; entropy mean 0.52), with a substantial tail of mixed-composition clusters (10th-percentile majority 0.67). Along with the modest commitment error, this suggests that the observed mixing reflects genuinely overlapping local chemistries. Consistently, residue-conditioned intra-cluster distances show that some types form diffuse, mixed neighborhoods (A, G, S, C with ratios ≥ 1), while others are tighter and more type-specific (W, Y, R with ratios ≤ 1). These observations suggest that the kNN partitioning of residue embedding space yields chemically meaningful clusters but does not enforce one-residue exclusivity and captures real cross-type similarity in local environments.

E.2.2. Residue embedding visualization

We project the 16,384-entry codebook (centroid) embeddings into three dimensions using UMAP and analyze how local chemical environments are organized in this latent space (Figs. 6–7). Each CATH residue is assigned to its nearest codebook entry, and for every centroid we aggregate properties across its assigned residues. We compute the mean SASA and the majority secondary-structure label. This yields a coarse-grained landscape in which centroids arrange along solvent-exposure gradients and segregate by secondary-structure preferences.

Table 7:

Residue-wise clustering statistics: number of centroids that each residue type dominates, mean intra-cluster distance (± standard deviation), and ratio relative to the global mean.

Residue	# Centroids	Mean intra-cluster distance mean ± std	Ratio to global distance
A	1601	19.95 ± 6.50	[1.20]
C	215	17.67 ± 5.72	[1.06]
D	962	13.43 ± 7.47	[0.81]
E	1076	11.58 ± 3.97	[0.70]
F	641	11.44 ± 3.53	[0.69]
G	1192	18.38 ± 6.21	[1.11]
H	387	11.56 ± 3.53	[0.70]
I	899	14.33 ± 4.60	[0.86]
K	947	11.48 ± 3.76	[0.69]
L	1565	14.26 ± 4.63	[0.86]
M	272	13.69 ± 4.33	[0.82]
N	729	13.28 ± 4.16	[0.80]
P	737	13.83 ± 4.49	[0.83]
Q	492	11.48 ± 3.89	[0.69]
R	720	9.95 ± 3.14	[0.60]
S	1032	17.31 ± 5.58	[1.04]
T	920	15.81 ± 4.97	[0.95]
V	1253	15.96 ± 5.13	[0.96]
W	202	8.86 ± 2.76	[0.53]
Y	542	10.49 ± 3.16	[0.63]

Open in a new tab

Figure 6: — 3D UMAP projection of CATH residue embeddings colored by solvent accessibility and secondary structure

E.2.3. Structure ensemble analysis

Subsampled mdCATH

For each residue, we measure how much its embedding changes across the ensemble by averaging pairwise differences between frames. For a given residue and set of frames, we compute two physical descriptors: Contact-map change: we form a binary contact row per frame (contact if residues are within a chosen distance threshold) and measure, on average, what fraction of those contacts differ between frames. Solvent-exposure change: we compute solvent-accessible surface area (SASA), convert to residue-type-normalized relative SASA, and take the average absolute change between frames. We fit a simple linear model that predicts per-residue embedding change from the two descriptors. We aggregate performance on held-out residues and report: (i) the proportion of variance explained and (ii) the Spearman rank correlation between observed and predicted embedding change.

Rosetta Decoys

For each native protein we have a residue-embedding matrix and a set of its decoy matrices, aligned by residue index. We apply row-wise L2 normalization so that inner products equal cosine similarity. For a given protein, we compute the mean residue-wise cosine similarity between each decoy and its native, then take the average over decoys. The native-decoy cosine margin is defined as the difference between the native’s self-similarity (equal to 1.0 after normalization) and this mean decoy similarity.

Figure 7: — 3D UMAP projection of CATH residue embeddings colored by amino acid type

To test linear separability at the residue level and generalization to unseen proteins, we train a logistic-regression classifier on residue embeddings with leave-protein-out grouped cross-validation: each residue embedding is a sample (label 0=native, 1=decoy) and carries its protein ID for grouped CV. We split with GroupKFold so all residues from a held-out protein appear only in the test set, and train an L2-regularized LogisticRegression. On each test fold we report AUROC; metrics are aggregated as mean ± sd across folds.

E.3. Per-residue generative model assessment

We compare distribution coverage of all-atom chemical environments sampled by generative models, stratified by residue type. For each residue type, we extracted the sLAE embeddings of 2000 random examples from the sequence-augmented CATH dataset and from a collection of 20,000 unconditional samples of all-atom protein structures from La-Proteina, Protpardelle-1c, and Chroma.

E.4. Latent space interpolation

In Figure 9 A and B we show 20 out of 50 interpolated structures for AdK and KaiB. In addition, we compare linearly interpolated AdK structures from the SLAE latent space to those from the all-atom generative model Protpardelle-1c (Figure 10) and show that SLAE interpolation is better matched to simulated intermediate structures.

Figure 8: — SLAE embeddings to assess residue environment coverage. PCA of SLAE per-residue embeddings of *de novo* structure samples (light blue) compared to the reference CATH distribution (purple) stratified by amino acid type given in the title. The two modes in each amino acid type correspond to residues belonging to a beta sheet or alpha helix.

Figure 10: — Comparison of SLAE and generative model (Protpardelle-lc) latent interpolation. A. Three representative steps from interpolation fraction 0.3 to 0.7. Top: Protpardelle-1c linear interpolation (blue) and best MD frame matches (grey). Bottom: SLAE linear interpolation (purple) and best MD frame matches (grey). B. RMSD of interpolation trajectories to their closest-match MD frames

Footnotes

Fair comparison with open-sourced methods is not possible due to non-overlapping dataset splits (some entries from PLM-CS datasets do not pass the filter standard). We therefore re-trained a PLM-CS baseline on our splits and evaluate all embeddings under an identical protocol.

Contributor Information

Yilin Chen, Stanford University, Department of Bioengineering.

Cizhang Zhao, University of Wisconsin–Madison, Department of Biochemistry.

Po-Ssu Huang, Stanford University, Department of Bioengineering.

Tianyu Lu, Stanford University, Department of Bioengineering.

Hannah K. Wayment-Steele, University of Wisconsin–Madison, Department of Biochemistry

References

Abramson Josh, Adler Jonas, Dunger Jack, Evans Richard, Green Tim, Pritzel Alexander, Ronneberger Olaf, Willmore Lindsay, Ballard Andrew J., Bambrick Joshua, Bodenstein Sebastian W., Evans David A., Hung Chia-Chun, O’Neill Michael, Reiman David, Tunyasuvunakool Kathryn, Wu Zachary, Akvilė Žemgulytė Eirini Arvaniti, Beattie Charles, Bertolli Ottavia, Bridgland Alex, Cherepanov Alexey, Congreve Miles, Cowen-Rivers Alexander I., Cowie Andrew, Figurnov Michael, Fuchs Fabian B., Gladman Hannah, Jain Rishub, Khan Yousuf A., Low Caroline M. R., Perlin Kuba, Potapenko Anna, Savy Pascal, Singh Sukhdeep, Stecula Adrian, Thillaisundaram Ashok, Tong Catherine, Yakneen Sergei, Zhong Ellen D., Zielinski Michal, Augustin Žídek Victor Bapst, Kohli Pushmeet, Jaderberg Max, Hassabis Demis, and Jumper John M.. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016):493–500, June 2024. ISSN 1476–4687. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
Anand Namrata, Eguchi Raphael, Mathews Irimpan I., Perez Carla P., Derry Alexander, Altman Russ B., and Huang Po-Ssu. Protein sequence design with a learned potential. Nature Communications, 13(1):746, February 2022. ISSN 2041–1723. doi: 10.1038/s41467-022-28313-9. [DOI] [Google Scholar]
Anishchenko Ivan, Kipnis Yakov, Kalvet Indrek, Zhou Guangfeng, Krishna Rohith, Pellock Samuel J., Lauko Anna, Gyu Rie Lee Linna An, Dauparas Justas, DiMaio Frank, and Baker David. Modeling protein-small molecule conformational ensembles with ChemNet, September 2024. [Google Scholar]
Batatia Ilyes, Kovács Dávid Péter, Simm Gregor N. C, Ortner Christoph, and Csányi Gábor. MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields, January 2023. [Google Scholar]
Batzner Simon, Musaelian Albert, Sun Lixin, Geiger Mario, Mailoa Jonathan P., Kornbluth Mordechai, Molinari Nicola, Smidt Tess E., and Kozinsky Boris. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature Communications, 13(1):2453, May 2022. ISSN 2041–1723. doi: 10.1038/s41467-022-29939-5. [DOI] [Google Scholar]
Blaabjerg Lasse M, Kassem Maher M, Good Lydia L, Jonsson Nicolas, Cagiada Matteo, Johansson Kristoffer E, Boomsma Wouter, Stein Amelie, and Lindorff-Larsen Kresten. Rapid protein stability prediction using deep learning representations. eLife, 12:e82593, May 2023. ISSN 2050–084X. doi: 10.7554/eLife.82593. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bojan Meital, Vedula Sanketh, Maddipatla Advaith, Nadav Bojan Sellam Federico Napoli, Schanda Paul, and Bronstein Alex M.. Representing local protein environments with atomistic foundation models, June 2025. [Google Scholar]
Alexander E Chu Jinho Kim, Cheng Lucy, Gina El Nesr Minkai Xu, Richard W Shuai, and Po-Ssu Huang. An all-atom protein generative model. Proceedings of the National Academy of Sciences, 121(27):e2311500121, 2024. [Google Scholar]
Dieckhaus Henry, Brocidiacono Michael, Randolph Nicholas Z., and Kuhlman Brian. Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proceedings of the National Academy of Sciences, 121(6):e2314853121, February 2024. doi: 10.1073/pnas.2314853121. [DOI] [Google Scholar]
Drautz Ralf. Atomic cluster expansion for accurate and transferable interatomic potentials. Physical Review B, 99(1), 2019. doi: 10.1103/PhysRevB.99.014104. [DOI] [Google Scholar]
Fox Naomi K., Brenner Steven E., and Chandonia John-Marc. SCOPe: Structural Classification of Proteins–extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Research, 42(Database issue):D304–309, January 2014. ISSN 1362–4962. doi: 10.1093/nar/gkt1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gao Zhangyang, Tan Cheng, Wang Jue, Huang Yufei, Wu Lirong, and Li Stan Z.. FoldToken: Learning Protein Language via Vector Quantization and Beyond, March 2024. [Google Scholar]
Gasteiger Johannes, Groß Janek, and Günnemann Stephan. Directional Message Passing for Molecular Graphs, April 2022. [Google Scholar]
Geffner Tomas, Didi Kieran, Cao Zhonglin, Reidenbach Danny, Zhang Zuobai, Dallago Christian, Kucukbenli Emine, Kreis Karsten, and Vahdat Arash. La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching, July 2025. [Google Scholar]
Hayes Thomas, Rao Roshan, Akin Halil, Sofroniew Nicholas J., Oktay Deniz, Lin Zeming, Verkuil Robert, Tran Vincent Q., Deaton Jonathan, Wiggert Marius, Badkundri Rohil, Shafkat Irhum, Gong Jun, Derry Alexander, Molina Raul S., Thomas Neil, Khan Yousuf A., Mishra Chetan, Kim Carolyn, Bartie Liam J., Nemeth Matthew, Hsu Patrick D., Sercu Tom, Candido Salvatore, and Rives Alexander. Simulating 500 million years of evolution with a language model, December 2024. [Google Scholar]
Hermosilla Pedro, Schäfer Marco, Lang Matěj, Fackelmann Gloria, Vázquez Pere Pau, Kozlíková Barbora, Krone Michael, Ritschel Tobias, and Ropinski Timo. Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures, April 2021. [Google Scholar]
Jeffrey C Hoch Kumaran Baskaran, Burr Harrison, Chin John, Hamid R Eghbalnia Toshimichi Fujiwara, Michael R Gryk Takeshi Iwata, Kojima Chojiro, Kurisu Genji, Maziuk Dmitri, Miyanoiri Yohei, Jonathan R Wedell Colin Wilburn, Yao Hongyang, and Yokochi Masashi. Biological Magnetic Resonance Data Bank. Nucleic Acids Research, 51(D1):D368–D376, January 2023. ISSN 0305–1048. doi: 10.1093/nar/gkac1050. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hou Jie, Adhikari Badri, and Cheng Jianlin. DeepSF: Deep convolutional neural network for mapping protein sequences to folds. Bioinformatics, 34(8):1295–1303, April 2018. ISSN 1367–4803. doi: 10.1093/bioinformatics/btx780. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ingraham John, Garg Vikas, Barzilay Regina, and Jaakkola Tommi. Generative Models for Graph-Based Protein Design. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [Google Scholar]
Ingraham John B., Baranov Max, Costello Zak, Barber Karl W., Wang Wujie, Ismail Ahmed, Frappier Vincent, Lord Dana M., Ng-Thow-Hing Christopher, Van Vlack Erik R, Tie Shan, Xue Vincent, Cowles Sarah C., Leung Alan, Rodrigues João V., Morales-Perez Claudio L., Ayoub Alex M., Green Robin, Puentes Katherine, Oplinger Frank, Panwar Nishant V., Obermeyer Fritz, Root Adam R., Beam Andrew L., Poelwijk Frank J., and Grigoryan Gevorg. Illuminating protein space with a programmable generative model. Nature, 623(7989):1070–1078, November 2023. ISSN 1476–4687. doi: 10.1038/s41586-023-06728-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jamasb Arian R., Morehead Alex, Joshi Chaitanya K., Zhang Zuobai, Didi Kieran, Mathis Simon V., Harris Charles, Tang Jian, Cheng Jianlin, Lio Pietro, and Blundell Tom L.. Evaluating representation learning on the protein structure universe, June 2024. [Google Scholar]
Jing Bowen, Eismann Stephan, Suriana Patricia, Townshend Raphael J. L., and Dror Ron. Learning from Protein Structure with Geometric Vector Perceptrons, May 2021. [Google Scholar]
Jumper John, Evans Richard, Pritzel Alexander, Green Tim, Figurnov Michael, Ronneberger Olaf, Tunyasuvunakool Kathryn, Bates Russ, Žídek Augustin, Potapenko Anna, Bridgland Alex, Meyer Clemens, Kohl Simon A. A., Ballard Andrew J., Cowie Andrew, Romera-Paredes Bernardino, Nikolov Stanislav, Jain Rishub, Adler Jonas, Back Trevor, Petersen Stig, Reiman David, Clancy Ellen, Zielinski Michal, Steinegger Martin, Pacholska Michalina, Berghammer Tamas, Bodenstein Sebastian, Silver David, Vinyals Oriol, Senior Andrew W., Kavukcuoglu Koray, Kohli Pushmeet, and Hassabis Demis. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, August 2021. ISSN 1476–4687. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li Mingchen, Tan Pan, Ma Xinzhu, Zhong Bozitao, Yu Huiqun, Zhou Ziyi, Ouyang Wanli, Zhou Bingxin, Hong Liang, and Tan Yang. ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention, May 2024. [Google Scholar]
Lin Zeming, Akin Halil, Rao Roshan, Hie Brian, Zhu Zhongkai, Lu Wenting, Smetanin Nikita, Verkuil Robert, Kabeli Ori, Shmueli Yaniv, dos Santos Costa Allan, Fazel-Zarandi Maryam, Sercu Tom, Candido Salvatore, and Rives Alexander . Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, March 2023. doi: 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
Liu Huaqing, Chen Peiyi, Zhai Xiaochen, Huo Ku-Geng, Zhou Shuxian, Han Lanqing, and Fan Guoxin. PPB-Affinity: Protein-Protein Binding Affinity dataset for AI-based protein drug discovery. Scientific Data, 11(1):1316, December 2024. ISSN 2052–4463. doi: 10.1038/s41597-024-03997-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu Jun, Chen Hungyu, and Zhang Yang. A Corporative Language Model for Protein–Protein Interaction, Binding Affinity, and Interface Contact Prediction, July 2025. ISSN 2692–8205. [Google Scholar]
Lu Amy X., Yan Wilson, Yang Kevin K., Gligorijevic Vladimir, Cho Kyunghyun, Abbeel Pieter, Bonneau Richard, and Frey Nathan. Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure, August 2024. [Google Scholar]
Lu Tianyu, Liu Melissa, Chen Yilin, Kim Jinho, and Huang Po-Ssu. Assessing generative model coverage of protein structures with SHAPES. Cell Systems, 16(8):101347, August 2025a. ISSN 2405–4712. doi: 10.1016/j.cels.2025.101347. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu Tianyu, Shuai Richard, Kouba Petr, Li Zhaoyang, Chen Yilin, Shirali Akio, Kim Jinho, and Huang Po-Ssu. Conditional Protein Structure Generation with Protpardelle-1C, August 2025b. ISSN 2692–8205. [Google Scholar]
Meier Joshua, Rao Roshan, Verkuil Robert, Liu Jason, Sercu Tom, and Rives Alexander. Language models enable zero-shot prediction of the effects of mutations on protein function. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ‘21, pp. 29287–29303, Red Hook, NY, USA, December 2021. Curran Associates Inc. ISBN 978–1-7138–4539-3. [Google Scholar]
Mirarchi Antonio, Giorgino Toni, and De Fabritiis Gianni. mdCATH: A Large-Scale MD Dataset for Data-Driven Computational Biophysics. Scientific Data, 11(1):1299, November 2024. ISSN 2052–4463. doi: 10.1038/s41597-024-04140-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Musaelian Albert, Batzner Simon, Johansson Anders, Sun Lixin, Owen Cameron J., Kornbluth Mordechai, and Kozinsky Boris. Learning local equivariant representations for large-scale atomistic dynamics. Nature Communications, 14(1):579, February 2023. ISSN 2041–1723. doi: 10.1038/s41467-023-36329-y. [DOI] [Google Scholar]
Pancotti Corrado, Benevenuta Silvia, Birolo Giovanni, Alberini Virginia, Repetto Valeria, Sanavia Tiziana, Capriotti Emidio, and Fariselli Piero. Predicting protein stability changes upon single-point mutation: A thorough comparison of the available tools on a new dataset. Briefings in Bioinformatics, 23(2):bbab555, March 2022. ISSN 1477–4054. doi: 10.1093/bib/bbab555. [DOI] [Google Scholar]
Park Hahnbeom, Bradley Philip, Greisen Per Jr, Liu Yuan, Khipple Mulligan Vikram, Kim David E., Baker David, and DiMaio Frank. Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules. Journal of Chemical Theory and Computation, 12(12):6201–6212, December 2016. ISSN 1549–9618. doi: 10.1021/acs.jctc.6b00819. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pengmei Zihan, Shen Zhengyuan, Wang Zichen, Collins Marcus, and Rangwala Huzefa. Pushing the Limits of All-Atom Geometric Graph Neural Networks: Pre-Training, Scaling and Zero-Shot Transfer, October 2024. [Google Scholar]
Seyler Sean L., Kumar Avishek, Thorpe M. F., and Beckstein Oliver. Path Similarity Analysis: A Method for Quantifying Macromolecular Pathways. PLOS Computational Biology, 11(10):e1004568, October 2015. ISSN 1553–7358. doi: 10.1371/journal.pcbi.1004568. [DOI] [Google Scholar]
Su Jianlin, Lu Yu, Pan Shengfeng, Murtadha Ahmed, Wen Bo, and Liu Yunfeng. RoFormer: Enhanced Transformer with Rotary Position Embedding, November 2023a. [Google Scholar]
Su Jin, Han Chenchen, Zhou Yuyang, Shan Junjie, Zhou Xibin, and Yuan Fajie. SaProt: Protein Language Modeling with Structure-aware Vocabulary, October 2023b. [Google Scholar]
Tsuboyama Kotaro, Dauparas Justas, Chen Jonathan, Laine Elodie, Behbahani Yasser Mohseni, Weinstein Jonathan J., Mangan Niall M., Ovchinnikov Sergey, and Rocklin Gabriel J.. Megascale experimental analysis of protein folding stability in biology and design. Nature, 620(7973): 434–444, August 2023. ISSN 1476–4687. doi: 10.1038/s41586-023-06328-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
van den Oord Aaron, Vinyals Oriol, and Kavukcuoglu Koray. Neural Discrete Representation Learning, May 2018. [Google Scholar]
Wang Limei, Liu Haoran, Liu Yi, Kurtin Jerry, and Ji Shuiwang. Learning Hierarchical Protein Representations via Complete 3D Graph Networks, March 2023. [Google Scholar]
Wang Shih-Hsin, Huang Yuhao, Baker Justin M., Sun Yuan-En, Tang Qi, and Wang Bao. A Theoretically-Principled Sparse, Connected, and Rigid Graph Representation of Molecules. In The Thirteenth International Conference on Learning Representations, October 2024. [Google Scholar]
Wayment-Steele Hannah K., Ojoawo Adedolapo, Otten Renee, Apitz Julia M., Pitsawong Warintra, Hömberger Marc, Ovchinnikov Sergey, Colwell Lucy, and Kern Dorothee. Predicting multiple conformations via sequence clustering and AlphaFold2. Nature, pp. 1–3, November 2023. ISSN 1476–4687. doi: 10.1038/s41586-023-06832-9. [DOI] [Google Scholar]
Yu Lijun, Lezama Jose, Gundavarapu Nitesh Bharadwaj, Versari Luca, Sohn Kihyuk, Minnen David, Cheng Yong, Gupta Agrim, Gu Xiuye, Hauptmann Alexander G., Gong Boqing, Yang Ming-Hsuan, Essa Irfan, Ross David A., and Jiang Lu. Language Model Beats Diffusion - Tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representations, October 2023. [Google Scholar]
Zaidi Sheheryar, Schaarschmidt Michael, Martens James, Kim Hyunjik, The Yee Whye, Sanchez-Gonzalez Alvaro, Battaglia Peter, Pascanu Razvan, and Godwin Jonathan. Pre-training via Denoising for Molecular Property Prediction, October 2022. [Google Scholar]
Zhang Ning, Sood Damini, Guo Spencer C., Chen Nanhao, Antoszewski Adam, Marianchuk Tegan, Dey Supratim, Xiao Yunxian, Hong Lu, Peng Xiangda, Baxa Michael, Partch Carrie, Wang Lee-Ping, Sosnick Tobin R., Dinner Aaron R., and LiWang Andy. Temperature-dependent fold-switching mechanism of the circadian clock protein KaiB. Proceedings of the National Academy of Sciences, 121(51):e2412327121, December 2024. doi: 10.1073/pnas.2412327121. [DOI] [Google Scholar]
Zhang Zuobai, Xu Minghao, Jamasb Arian, Chenthamarakshan Vijil, Lozano Aurelie, Das Payel, and Tang Jian. Protein Representation Learning by Geometric Structure Pretraining, January 2023. [Google Scholar]
Zhu He, Hu Lingyue, Yang Yu, and Chen Zhong. A novel approach to protein chemical shift prediction from sequences using a protein language model. Digital Discovery, 4(2):331–337, 2025. doi: 10.1039/D4DD00367E. [DOI] [Google Scholar]

[R1] Abramson Josh, Adler Jonas, Dunger Jack, Evans Richard, Green Tim, Pritzel Alexander, Ronneberger Olaf, Willmore Lindsay, Ballard Andrew J., Bambrick Joshua, Bodenstein Sebastian W., Evans David A., Hung Chia-Chun, O’Neill Michael, Reiman David, Tunyasuvunakool Kathryn, Wu Zachary, Akvilė Žemgulytė Eirini Arvaniti, Beattie Charles, Bertolli Ottavia, Bridgland Alex, Cherepanov Alexey, Congreve Miles, Cowen-Rivers Alexander I., Cowie Andrew, Figurnov Michael, Fuchs Fabian B., Gladman Hannah, Jain Rishub, Khan Yousuf A., Low Caroline M. R., Perlin Kuba, Potapenko Anna, Savy Pascal, Singh Sukhdeep, Stecula Adrian, Thillaisundaram Ashok, Tong Catherine, Yakneen Sergei, Zhong Ellen D., Zielinski Michal, Augustin Žídek Victor Bapst, Kohli Pushmeet, Jaderberg Max, Hassabis Demis, and Jumper John M.. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016):493–500, June 2024. ISSN 1476–4687. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Anand Namrata, Eguchi Raphael, Mathews Irimpan I., Perez Carla P., Derry Alexander, Altman Russ B., and Huang Po-Ssu. Protein sequence design with a learned potential. Nature Communications, 13(1):746, February 2022. ISSN 2041–1723. doi: 10.1038/s41467-022-28313-9. [DOI] [Google Scholar]

[R3] Anishchenko Ivan, Kipnis Yakov, Kalvet Indrek, Zhou Guangfeng, Krishna Rohith, Pellock Samuel J., Lauko Anna, Gyu Rie Lee Linna An, Dauparas Justas, DiMaio Frank, and Baker David. Modeling protein-small molecule conformational ensembles with ChemNet, September 2024. [Google Scholar]

[R4] Batatia Ilyes, Kovács Dávid Péter, Simm Gregor N. C, Ortner Christoph, and Csányi Gábor. MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields, January 2023. [Google Scholar]

[R5] Batzner Simon, Musaelian Albert, Sun Lixin, Geiger Mario, Mailoa Jonathan P., Kornbluth Mordechai, Molinari Nicola, Smidt Tess E., and Kozinsky Boris. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature Communications, 13(1):2453, May 2022. ISSN 2041–1723. doi: 10.1038/s41467-022-29939-5. [DOI] [Google Scholar]

[R6] Blaabjerg Lasse M, Kassem Maher M, Good Lydia L, Jonsson Nicolas, Cagiada Matteo, Johansson Kristoffer E, Boomsma Wouter, Stein Amelie, and Lindorff-Larsen Kresten. Rapid protein stability prediction using deep learning representations. eLife, 12:e82593, May 2023. ISSN 2050–084X. doi: 10.7554/eLife.82593. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Bojan Meital, Vedula Sanketh, Maddipatla Advaith, Nadav Bojan Sellam Federico Napoli, Schanda Paul, and Bronstein Alex M.. Representing local protein environments with atomistic foundation models, June 2025. [Google Scholar]

[R8] Alexander E Chu Jinho Kim, Cheng Lucy, Gina El Nesr Minkai Xu, Richard W Shuai, and Po-Ssu Huang. An all-atom protein generative model. Proceedings of the National Academy of Sciences, 121(27):e2311500121, 2024. [Google Scholar]

[R9] Dieckhaus Henry, Brocidiacono Michael, Randolph Nicholas Z., and Kuhlman Brian. Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proceedings of the National Academy of Sciences, 121(6):e2314853121, February 2024. doi: 10.1073/pnas.2314853121. [DOI] [Google Scholar]

[R10] Drautz Ralf. Atomic cluster expansion for accurate and transferable interatomic potentials. Physical Review B, 99(1), 2019. doi: 10.1103/PhysRevB.99.014104. [DOI] [Google Scholar]

[R11] Fox Naomi K., Brenner Steven E., and Chandonia John-Marc. SCOPe: Structural Classification of Proteins–extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Research, 42(Database issue):D304–309, January 2014. ISSN 1362–4962. doi: 10.1093/nar/gkt1240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Gao Zhangyang, Tan Cheng, Wang Jue, Huang Yufei, Wu Lirong, and Li Stan Z.. FoldToken: Learning Protein Language via Vector Quantization and Beyond, March 2024. [Google Scholar]

[R13] Gasteiger Johannes, Groß Janek, and Günnemann Stephan. Directional Message Passing for Molecular Graphs, April 2022. [Google Scholar]

[R14] Geffner Tomas, Didi Kieran, Cao Zhonglin, Reidenbach Danny, Zhang Zuobai, Dallago Christian, Kucukbenli Emine, Kreis Karsten, and Vahdat Arash. La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching, July 2025. [Google Scholar]

[R15] Hayes Thomas, Rao Roshan, Akin Halil, Sofroniew Nicholas J., Oktay Deniz, Lin Zeming, Verkuil Robert, Tran Vincent Q., Deaton Jonathan, Wiggert Marius, Badkundri Rohil, Shafkat Irhum, Gong Jun, Derry Alexander, Molina Raul S., Thomas Neil, Khan Yousuf A., Mishra Chetan, Kim Carolyn, Bartie Liam J., Nemeth Matthew, Hsu Patrick D., Sercu Tom, Candido Salvatore, and Rives Alexander. Simulating 500 million years of evolution with a language model, December 2024. [Google Scholar]

[R16] Hermosilla Pedro, Schäfer Marco, Lang Matěj, Fackelmann Gloria, Vázquez Pere Pau, Kozlíková Barbora, Krone Michael, Ritschel Tobias, and Ropinski Timo. Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures, April 2021. [Google Scholar]

[R17] Jeffrey C Hoch Kumaran Baskaran, Burr Harrison, Chin John, Hamid R Eghbalnia Toshimichi Fujiwara, Michael R Gryk Takeshi Iwata, Kojima Chojiro, Kurisu Genji, Maziuk Dmitri, Miyanoiri Yohei, Jonathan R Wedell Colin Wilburn, Yao Hongyang, and Yokochi Masashi. Biological Magnetic Resonance Data Bank. Nucleic Acids Research, 51(D1):D368–D376, January 2023. ISSN 0305–1048. doi: 10.1093/nar/gkac1050. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Hou Jie, Adhikari Badri, and Cheng Jianlin. DeepSF: Deep convolutional neural network for mapping protein sequences to folds. Bioinformatics, 34(8):1295–1303, April 2018. ISSN 1367–4803. doi: 10.1093/bioinformatics/btx780. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Ingraham John, Garg Vikas, Barzilay Regina, and Jaakkola Tommi. Generative Models for Graph-Based Protein Design. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [Google Scholar]

[R20] Ingraham John B., Baranov Max, Costello Zak, Barber Karl W., Wang Wujie, Ismail Ahmed, Frappier Vincent, Lord Dana M., Ng-Thow-Hing Christopher, Van Vlack Erik R, Tie Shan, Xue Vincent, Cowles Sarah C., Leung Alan, Rodrigues João V., Morales-Perez Claudio L., Ayoub Alex M., Green Robin, Puentes Katherine, Oplinger Frank, Panwar Nishant V., Obermeyer Fritz, Root Adam R., Beam Andrew L., Poelwijk Frank J., and Grigoryan Gevorg. Illuminating protein space with a programmable generative model. Nature, 623(7989):1070–1078, November 2023. ISSN 1476–4687. doi: 10.1038/s41586-023-06728-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Jamasb Arian R., Morehead Alex, Joshi Chaitanya K., Zhang Zuobai, Didi Kieran, Mathis Simon V., Harris Charles, Tang Jian, Cheng Jianlin, Lio Pietro, and Blundell Tom L.. Evaluating representation learning on the protein structure universe, June 2024. [Google Scholar]

[R22] Jing Bowen, Eismann Stephan, Suriana Patricia, Townshend Raphael J. L., and Dror Ron. Learning from Protein Structure with Geometric Vector Perceptrons, May 2021. [Google Scholar]

[R23] Jumper John, Evans Richard, Pritzel Alexander, Green Tim, Figurnov Michael, Ronneberger Olaf, Tunyasuvunakool Kathryn, Bates Russ, Žídek Augustin, Potapenko Anna, Bridgland Alex, Meyer Clemens, Kohl Simon A. A., Ballard Andrew J., Cowie Andrew, Romera-Paredes Bernardino, Nikolov Stanislav, Jain Rishub, Adler Jonas, Back Trevor, Petersen Stig, Reiman David, Clancy Ellen, Zielinski Michal, Steinegger Martin, Pacholska Michalina, Berghammer Tamas, Bodenstein Sebastian, Silver David, Vinyals Oriol, Senior Andrew W., Kavukcuoglu Koray, Kohli Pushmeet, and Hassabis Demis. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, August 2021. ISSN 1476–4687. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Li Mingchen, Tan Pan, Ma Xinzhu, Zhong Bozitao, Yu Huiqun, Zhou Ziyi, Ouyang Wanli, Zhou Bingxin, Hong Liang, and Tan Yang. ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention, May 2024. [Google Scholar]

[R25] Lin Zeming, Akin Halil, Rao Roshan, Hie Brian, Zhu Zhongkai, Lu Wenting, Smetanin Nikita, Verkuil Robert, Kabeli Ori, Shmueli Yaniv, dos Santos Costa Allan, Fazel-Zarandi Maryam, Sercu Tom, Candido Salvatore, and Rives Alexander . Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, March 2023. doi: 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]

[R26] Liu Huaqing, Chen Peiyi, Zhai Xiaochen, Huo Ku-Geng, Zhou Shuxian, Han Lanqing, and Fan Guoxin. PPB-Affinity: Protein-Protein Binding Affinity dataset for AI-based protein drug discovery. Scientific Data, 11(1):1316, December 2024. ISSN 2052–4463. doi: 10.1038/s41597-024-03997-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Liu Jun, Chen Hungyu, and Zhang Yang. A Corporative Language Model for Protein–Protein Interaction, Binding Affinity, and Interface Contact Prediction, July 2025. ISSN 2692–8205. [Google Scholar]

[R28] Lu Amy X., Yan Wilson, Yang Kevin K., Gligorijevic Vladimir, Cho Kyunghyun, Abbeel Pieter, Bonneau Richard, and Frey Nathan. Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure, August 2024. [Google Scholar]

[R29] Lu Tianyu, Liu Melissa, Chen Yilin, Kim Jinho, and Huang Po-Ssu. Assessing generative model coverage of protein structures with SHAPES. Cell Systems, 16(8):101347, August 2025a. ISSN 2405–4712. doi: 10.1016/j.cels.2025.101347. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Lu Tianyu, Shuai Richard, Kouba Petr, Li Zhaoyang, Chen Yilin, Shirali Akio, Kim Jinho, and Huang Po-Ssu. Conditional Protein Structure Generation with Protpardelle-1C, August 2025b. ISSN 2692–8205. [Google Scholar]

[R31] Meier Joshua, Rao Roshan, Verkuil Robert, Liu Jason, Sercu Tom, and Rives Alexander. Language models enable zero-shot prediction of the effects of mutations on protein function. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ‘21, pp. 29287–29303, Red Hook, NY, USA, December 2021. Curran Associates Inc. ISBN 978–1-7138–4539-3. [Google Scholar]

[R32] Mirarchi Antonio, Giorgino Toni, and De Fabritiis Gianni. mdCATH: A Large-Scale MD Dataset for Data-Driven Computational Biophysics. Scientific Data, 11(1):1299, November 2024. ISSN 2052–4463. doi: 10.1038/s41597-024-04140-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Musaelian Albert, Batzner Simon, Johansson Anders, Sun Lixin, Owen Cameron J., Kornbluth Mordechai, and Kozinsky Boris. Learning local equivariant representations for large-scale atomistic dynamics. Nature Communications, 14(1):579, February 2023. ISSN 2041–1723. doi: 10.1038/s41467-023-36329-y. [DOI] [Google Scholar]

[R34] Pancotti Corrado, Benevenuta Silvia, Birolo Giovanni, Alberini Virginia, Repetto Valeria, Sanavia Tiziana, Capriotti Emidio, and Fariselli Piero. Predicting protein stability changes upon single-point mutation: A thorough comparison of the available tools on a new dataset. Briefings in Bioinformatics, 23(2):bbab555, March 2022. ISSN 1477–4054. doi: 10.1093/bib/bbab555. [DOI] [Google Scholar]

[R35] Park Hahnbeom, Bradley Philip, Greisen Per Jr, Liu Yuan, Khipple Mulligan Vikram, Kim David E., Baker David, and DiMaio Frank. Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules. Journal of Chemical Theory and Computation, 12(12):6201–6212, December 2016. ISSN 1549–9618. doi: 10.1021/acs.jctc.6b00819. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Pengmei Zihan, Shen Zhengyuan, Wang Zichen, Collins Marcus, and Rangwala Huzefa. Pushing the Limits of All-Atom Geometric Graph Neural Networks: Pre-Training, Scaling and Zero-Shot Transfer, October 2024. [Google Scholar]

[R37] Seyler Sean L., Kumar Avishek, Thorpe M. F., and Beckstein Oliver. Path Similarity Analysis: A Method for Quantifying Macromolecular Pathways. PLOS Computational Biology, 11(10):e1004568, October 2015. ISSN 1553–7358. doi: 10.1371/journal.pcbi.1004568. [DOI] [Google Scholar]

[R38] Su Jianlin, Lu Yu, Pan Shengfeng, Murtadha Ahmed, Wen Bo, and Liu Yunfeng. RoFormer: Enhanced Transformer with Rotary Position Embedding, November 2023a. [Google Scholar]

[R39] Su Jin, Han Chenchen, Zhou Yuyang, Shan Junjie, Zhou Xibin, and Yuan Fajie. SaProt: Protein Language Modeling with Structure-aware Vocabulary, October 2023b. [Google Scholar]

[R40] Tsuboyama Kotaro, Dauparas Justas, Chen Jonathan, Laine Elodie, Behbahani Yasser Mohseni, Weinstein Jonathan J., Mangan Niall M., Ovchinnikov Sergey, and Rocklin Gabriel J.. Megascale experimental analysis of protein folding stability in biology and design. Nature, 620(7973): 434–444, August 2023. ISSN 1476–4687. doi: 10.1038/s41586-023-06328-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] van den Oord Aaron, Vinyals Oriol, and Kavukcuoglu Koray. Neural Discrete Representation Learning, May 2018. [Google Scholar]

[R42] Wang Limei, Liu Haoran, Liu Yi, Kurtin Jerry, and Ji Shuiwang. Learning Hierarchical Protein Representations via Complete 3D Graph Networks, March 2023. [Google Scholar]

[R43] Wang Shih-Hsin, Huang Yuhao, Baker Justin M., Sun Yuan-En, Tang Qi, and Wang Bao. A Theoretically-Principled Sparse, Connected, and Rigid Graph Representation of Molecules. In The Thirteenth International Conference on Learning Representations, October 2024. [Google Scholar]

[R44] Wayment-Steele Hannah K., Ojoawo Adedolapo, Otten Renee, Apitz Julia M., Pitsawong Warintra, Hömberger Marc, Ovchinnikov Sergey, Colwell Lucy, and Kern Dorothee. Predicting multiple conformations via sequence clustering and AlphaFold2. Nature, pp. 1–3, November 2023. ISSN 1476–4687. doi: 10.1038/s41586-023-06832-9. [DOI] [Google Scholar]

[R45] Yu Lijun, Lezama Jose, Gundavarapu Nitesh Bharadwaj, Versari Luca, Sohn Kihyuk, Minnen David, Cheng Yong, Gupta Agrim, Gu Xiuye, Hauptmann Alexander G., Gong Boqing, Yang Ming-Hsuan, Essa Irfan, Ross David A., and Jiang Lu. Language Model Beats Diffusion - Tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representations, October 2023. [Google Scholar]

[R46] Zaidi Sheheryar, Schaarschmidt Michael, Martens James, Kim Hyunjik, The Yee Whye, Sanchez-Gonzalez Alvaro, Battaglia Peter, Pascanu Razvan, and Godwin Jonathan. Pre-training via Denoising for Molecular Property Prediction, October 2022. [Google Scholar]

[R47] Zhang Ning, Sood Damini, Guo Spencer C., Chen Nanhao, Antoszewski Adam, Marianchuk Tegan, Dey Supratim, Xiao Yunxian, Hong Lu, Peng Xiangda, Baxa Michael, Partch Carrie, Wang Lee-Ping, Sosnick Tobin R., Dinner Aaron R., and LiWang Andy. Temperature-dependent fold-switching mechanism of the circadian clock protein KaiB. Proceedings of the National Academy of Sciences, 121(51):e2412327121, December 2024. doi: 10.1073/pnas.2412327121. [DOI] [Google Scholar]

[R48] Zhang Zuobai, Xu Minghao, Jamasb Arian, Chenthamarakshan Vijil, Lozano Aurelie, Das Payel, and Tang Jian. Protein Representation Learning by Geometric Structure Pretraining, January 2023. [Google Scholar]

[R49] Zhu He, Hu Lingyue, Yang Yu, and Chen Zhong. A novel approach to protein chemical shift prediction from sequences using a protein language model. Digital Discovery, 4(2):331–337, 2025. doi: 10.1039/D4DD00367E. [DOI] [Google Scholar]

PERMALINK

This is a preprint.

SLAE: Strictly Local All-atom Environment for Protein Representation

Yilin Chen

Cizhang Zhao

Po-Ssu Huang

Tianyu Lu

Hannah K Wayment-Steele

Abstract

1. Introduction

Main contributions:

2. Related Work

Protein Representation Pretraining

All-atom Protein Representation

Geometric GNNs for Atomistic Systems

3. The SLAE Framework

Figure 1: Overview of the SLAE framework.

3.1. Structure Representation

Nodes

Edges

Design Motivation

3.2. Encoder

Equivariant Neighborhood Embedding

Residue Environment Pooling

Design Motivation

3.3. Decoder

Architecture

Design Motivation

3.4. Pretraining

Implicit Latent Space Regularization

3.5. Results and ablations

4. Downstream Tasks

Fold Classification

Table 2:

Protein-Protein Binding Affinity Prediction

Table 3:

Single-Point Mutation Thermostability Prediction

Table 4:

Chemical Shift Prediction

Table 5:

5. Interpreting the Latent Space

5.1. Embedding Variability Reflects Chemical Environment Change

Figure 3: SLAE embedding comparison between native and decoy structures.

5.2. Discriminative power over Native-Decoy Residue environments

5.3. Smooth Latent Interpolation Captures Conformational Transitions

Figure 4: Latent space interpolation between two conformations.

6. Conclusion

Figure 2: SLAE latent organization.

Table 1: Reconstruction performance of SLAE and ablations.

Acknowledgments

A. Model

A.1. Autoencoder pseudocode

Algorithm 1.

A.2. Encoder Architecture

Notation

Two-body initialization

Tensor product update

Latent scalar update.

Hierarchical pooling

A.3. Decoder architecture

Transformer backbone

Prediction heads

A.4. Task-specific heads

Trainable decoder backbone.

Contrastive and site-aware head

Backbone embeddings.

Mask-aware pooling and site features.

Contrastive feature and MLP regressor.

General usage

B. Training

B.1. Losses

All-atom FAPE (Frame-Aligned Point Error)

All-atom smooth LDDT

Mean-squared error (MSE)

Huber loss.

B.2. Training specifics

C. Datasets

Pretraining Structure

Pretraining Rosetta Score

Fold Classification