Generating functional and multistate proteins with a multimodal diffusion transformer

Bowen Jing; Anna Sappington; Mihir Bafna; Ravi Shah; Adrina Tang; Rohith Krishna; Adam Klivans; Daniel J Diaz; Bonnie Berger

doi:10.1101/2025.09.03.672144

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Sep 4:2025.09.03.672144. [Version 1] doi: 10.1101/2025.09.03.672144

Generating functional and multistate proteins with a multimodal diffusion transformer

Bowen Jing ^1,^*, Anna Sappington ^1,^2,^*, Mihir Bafna ¹, Ravi Shah ³, Adrina Tang ¹, Rohith Krishna ⁴, Adam Klivans ³, Daniel J Diaz ³, Bonnie Berger ¹

PMCID: PMC12424740 PMID: 40950151

Abstract

Generating proteins with the full diversity and complexity of functions found in nature is a grand challenge in protein design. Here, we present ProDiT, a multimodal diffusion model that unifies sequence and structure modeling paradigms to enable the design of functional proteins at scale. Trained on sequences, 3D structures, and annotations for 214M proteins across the evolutionary landscape, ProDiT generates diverse, novel proteins that preserve known active and binding site motifs and can be successfully conditioned on a wide range of molecular functions, spanning 465 Gene Ontology terms. We introduce a diffusion sampling protocol to design proteins with multiple functional states, and demonstrate this protocol by scaffolding enzymatic active sites from carbonic anhydrase and lysozyme to be allosterically deactivated by a calcium effector. Our results showcase ProDiT’s unique capacity to satisfy design specifications inaccessible to existing generative models, thereby expanding the protein design toolkit.

1. Introduction

Generative models trained on datasets of protein sequences or backbone structures have enabled significant advances in de novo protein design [11, 51]. Recently, structure diffusion models [49, 22, 26, 2] have been developed to generate protein binders [49, 26] or scaffold functional motifs [49, 2] while fine-tuned protein language models [30, 14, 4, 31, 18, 8] have been prompted to generate functional sequences with low homology to existing ones [30, 36, 18]. However, these existing paradigms focus only on structure or sequence in isolation and are therefore fundamentally incapable of capturing the complete functional landscape of natural proteins. Structure generative models require hand-crafted specification of function in terms of a binding partner or functional motif [49], depend on slow and specialized architectures adopted from protein structure prediction [6], and are limited to the relatively scarce structural training data in the Protein Data Bank (PDB). On the other hand, sequence generative models cannot natively reason about sequence-structure-function relationships, have shown lower generation quality compared to structure generative models [4], and have relied on fine-tuning [30] or complex prompting and filtering [18] to generate functional proteins. Neither family of approaches can target protein dynamics, a key feature of natural proteins, which often adopt multiple states to modulate their functions (e.g., motor proteins, allosterically regulated enzymes, signaling proteins). We hypothesize that a generative model, efficiently trained at scale with a broad understanding of protein functional space via both sequence and structure, would unlock many crucial protein design capabilities.

Here, we present ProDiT (Protein Diffusion Transformer), a novel framework that unifies structural and sequence generative modeling into a multimodal diffusion generative process, to enable these design capabilities. ProDiT leverages direct diffusion-based modeling of 3D structural coordinates alongside amino acid tokens using a fast, scalable transformer architecture, eschewing the specialized architectures typically used for modeling protein structure [6]. Our model design allows scaling multimodal training across large protein datasets, expanding beyond the PDB to the 214M available structures in AlphaFoldDB [44] and utilizing the full set of functional annotations in UniProtKB. Compared to previous attempts at multimodal protein generative modeling such as ProteinGenerator [29] and ESM3 [18], ProDiT models sequence and structure with diffusion processes in their respective state spaces, instead of tokenizing structure into a discrete vocabulary (ESM3) or forcing sequence generation into a continuous diffusion framework (ProteinGenerator). These prior approaches exhibit significantly degraded generation quality compared to models that specialize in generating structure, such as RFDiffusion [49]. In contrast, our model matches or often exceeds the unconditional structure generation quality of RFDiffusion across all sequence lengths, while also significantly exceeding the sequence generation quality of dedicated protein language models (PLMs) like EvoDiff [4].

We explore ProDiT’s broad understanding of the protein sequence and structural landscape by conditioning generation on functional descriptors in the form of molecular function Gene Ontology (GO) terms. Although prior works have included similar annotation inputs in their training [30, 18], they do not report detailed evaluations of function-conditioned generation. We conduct comprehensive benchmarking of ProDiT across 915 diverse GO terms representing highly specific descriptions of protein function and find generations predicted as functional for 463 of these terms, including terms represented in less than 0.01% of training examples in UniProtKB. Many of these generations have only moderate structural similarity to known functional proteins (TM-score < 0.7), showing an ability to generalize across the structural landscape. We also build upon techniques from image generative modeling for prompt adherence to further boost GO term adherence, a family of techniques not applicable to tokenized models such as ESM3. Structural alignments of our generations against known functional proteins reveal atomic-level recovery of active site residues, despite low global sequence similarity. Our designs additionally recover binding residues to enzymatic cofactors, e.g., NAD(P)+ for aldehyde dehydrogenase and CoA for malonyltransferase, demonstrating a multifaceted understanding of protein function.

Leveraging ProDiT’s multimodal capabilities, we also develop a principled protocol for the design of dynamic, multistate proteins, derived from a novel interpretation of the sequence-structure-function paradigm based on probabilistic graphical models. When distinct motifs are provided for the desired output protein states, our protocol generates two separate structural scaffolds coupled to a common sequence in a single generative rollout. As such, our approach represents a significant advance over prior protocols requiring brute-force sampling of plausible alternative conformations, previously thought necessary for such a protein design task [35, 17]. We demonstrate our approach by scaffolding active sites from carbonic anhydrase and lysozyme to be allosterically modulated by calcium binding, an important step towards the design of custom and controllable enzymes.

2. Results

2.1. Multimodal Generation of Sequence and Structure

We sought to develop a general framework that combines methods from protein language models and structure diffusion models to enable multimodal generation. To do so, ProDiT iteratively reverses two diffusion processes: one consisting of Gaussian noise added to the Cα coordinates of centered protein structures and the other consisting of random masking of the amino acid sequence (Fig. 1A; Methods). At training time, the noise levels for sequence and structure are sampled independently; as a result, the model is able to denoise these modalities asynchronously, allowing for multiple generation workflows: sequence generation, structure generation (optionally followed by inverse folding), or sequence and structure co-generation. Additionally, ProDiT can accept as input structural or functional constraints in the form of motifs or molecular function Gene Ontology (GO) terms, respectively, for conditional generation. The model is trained and sampled with current best practices from the discrete and continuous diffusion literature (Methods).

Figure 1: — **(A)** ProDiT is trained to denoise a joint diffusion process over sequence and structure. The two inputs may be at different noise levels, allowing for flexible generation workflows. Optionally, ProDiT can also condition on a GO term or structural motif. **(B)** Comparison of ProDiT with other models by number of successful generations (light color) and unique FoldSeek clusters (dark color) from 500 samples (100 each of length 100, 200, 300, 400, 500). **(C)** Structure generation quality across protein lengths, measured in terms of self-consistency TM-score (scTM) between the generated structure and ProteinMPNN-designed sequence. ProDiT maintains structure quality better than other methods at longer sequence lengths. **(D)** Sequence generation quality across protein lengths, measured in terms of ESMFold predicted Local Distance Difference Test (pLDDT). ProDiT obtains much higher sequence quality than other methods at all lengths. **(E)** Selected samples from ProDiT of length 500 under each generation protocol, labeled with the corresponding pLDDT or scTM metrics. For structure- and co-generation, the ESMFold refolded structure is superimposed in grey. RFDiff: RFDiffusion; ProtGen: ProteinGenerator.

We first assess the performance of ProDiT in unconditional generation of sequence and structure according to standard metrics (Fig. 1B–D). For each method and modality, we draw 100 samples each of lengths 100, 200, 300, 400, and 500 amino acids and compute the total number of successful generations. For generated sequences, success is defined by predicted Local Distance Difference Test (pLDDT) > 70 after folding with ESMFold [28]. For generated structures, we produce a corresponding sequence with ProteinMPNN and refold with ESMFold; success is defined by self-consistency RMSD (scRMSD) < 2Å. We assess diversity by clustering generated structures using FoldSeek [43] and report the number of unique clusters. For all generation modalities, we visualize proteins of length 500 amino acids sampled from ProDiT and observe diverse secondary structure elements, varied topologies, and globular, symmetric, and compact geometries (Fig. 1E).

ProDiT generates significantly more successful sequences than ESM3 and EvoDiff across all sequence lengths, surpasses the structure generation success rate of ESM3 [18], and matches the performance of RFDiffusion, a well-known structure generative model [49]. The difference in structure generation quality is most substantial at longer sequence lengths (> 300 amino acids) (Fig. 1C), where ESM3 protein generations become highly unstructured and RFDiffusion quality deteriorates more rapidly than ProDiT, potentially due to the larger training crops (512 vs 256) enabled by our simpler architecture. We observe similar trends when computing generation quality in terms of mean pLDDT or self-consistency TM-score (scTM) for sequence and structure, respectively (Fig. S1,S2). Strikingly, ProDiT accomplishes all this with a 5x smaller model than ESM3 (321M vs 1.7B parameters) and without using any equivariant modules or 2D tracks in its architecture.

For co-generation, ProDiT produces self-consistent sequence and structure pairs when simultaneously denoising over both modalities (Fig. 1B, right), using a success cutoff of scTM > 0.9 after refolding with ESMFold. A previously reported approach for co-generation, ProteinGenerator [29], adopts the architecture of RFDiffusion while embedding protein sequence into continuous space in order to apply continuous diffusion. However, ProteinGenerator is largely unable to generate self-consistent designs at lengths > 200 amino acids, whereas ProDiT’s multimodal diffusion can successfully generate self-consistent designs at 500 amino acids (Fig. S3). To further investigate ProDiT’s ability to generate self-consistent proteins, we used ProDiT to generate a backbone and then to inverse fold a corresponding sequence, demonstrating its ability to reason over both modalities successively (Fig. S5). This capability is crucial for function-conditioned generation, as functional properties are determined by both primary sequence and backbone structure. In contrast, inverse folding models that predict sequences solely based on backbone structures (like ProteinMPNN) lack an intrinsic understanding of whether those sequences will possess a desired function.

2.2. Steering Design with Molecular Function

We hypothesized that a model with broad understanding of sequence and structure across the evolutionary landscape could be steered towards generating diverse proteins with specific functions. To demonstrate this capability, we assessed the responsiveness of ProDiT to functional conditioning with molecular function Gene Ontology (GO) terms. To systematically screen functional designs in silico, we employ the DeepFRI function prediction network [16] to predict a probability for each GO term given an input structure. In total, 915 unique molecular function GO terms are present in both the input vocabulary of ProDiT and the output vocabulary of DeepFRI (Tab. S1); each of these terms is associated with a median of 0.08% of proteins in UniProtKB (0.03%–0.26% IQR), providing a highly specific description of protein function. For each of these GO terms, we generate 100 samples of length 300 amino acids from ProDiT (first generating the structure, then the sequence) and re-fold the sequence with ESMFold (Methods). A generation is counted as successful if the corresponding GO term is assigned by DeepFRI with > 50% confidence for the refolded structure.

2.2.1. Systematic Assessment of Function Conditioning

In aggregate, 463 GO terms have at least one successful ProDiT design, with success rates being higher for the most prevalent GO terms and lower for less prevalent ones (Fig. 2A,B), suggesting ProDiT function-conditioned generation is data limited. We assess the diversity of the generations by computing the TM-scores between successful designs with the same GO term and novelty via the TM-score to the most similar structure in AlphaFoldDB with the same GO term (Fig. 2C). We observe slightly improved diversity for more common GO terms versus less common terms (0.60 vs 0.66 TM-score, respectively), suggesting more robust generalization for frequently observed terms, although the novelty is similar (0.81 vs 0.78 TM-score, respectively).

Figure 2: — **(A)** For each of 915 molecular function GO terms sorted by occurrence in UniProtKB, we count the number of successful generations (out of 100), where success is defined as DeepFRI probability > 50% for the target term. Selected terms are labeled with the number of successes. **(B)** Proportion of GO terms with different numbers of successful designs, stratified by frequency of GO term occurrence. In total, 465 terms have at least one success. **(C)** Distribution of TM-score diversity (between successful designs with the same GO term conditioning) and novelty (between designs and the most similar AlphaFoldDB entry with the same GO term) for common and rare GO terms, using a frequency cutoff of 500k. Novelty is computed only for a subset of 169 GO terms. **(D)** Impact of classifier-free guidance (CFG) for two selected GO terms with moderate success rate. For each guidance strength setting, 100 samples are generated and the success rate, mean self-consistency TM-score (scTM), diversity, and novelty are computed. Main evaluations use a guidance strength of 3. **(E)** For selected GO terms, the most novel generation is identified and superimposed onto its closest match with the same GO term from AFDB (grey).

To improve responsiveness to functional class conditioning, we explored classifier free guidance (CFG), a technique commonly used to enhance class adherence in image generative models [19]. We implement CFG with ProDiT and evaluate its ability to improve functional class conditional generation. The amount of guidance is controlled by the guidance strength parameter with unity corresponding to out-of-the-box conditional generation. We find that for GO terms with low success rates, applying CFG with guidance strengths greater than one can improve the success rate and the sequence-structure self consistency (scTM), though with some tradeoff in diversity and novelty (Fig. 2D). We expect that other established techniques from the image diffusion literature can further improve generation for less prevalent GO terms.

Prior work in function-conditioned protein generation has often cited high structural similarity and low sequence identity to known functional proteins as a sign of generalization [30, 36]; however, this does not assess generalization across structures and instead reveals memorization of structures via co-evolutionary constraints. Although the median successful ProDiT generation also has a high TM-score with known functional proteins, identifying the most novel design for each GO term reveals an average TM-score of 0.64 (across 169 assessed terms) to the closest neighbor in AlphaFoldDB with the same term, preserving the overall fold class but with significant local structural changes (Tab. S1). We identify and visualize these proteins for several terms (Fig. 2E) and observe alignment of substructures, indicating a shared functional site or motif but with deviation in the scaffolds. The structures are nonetheless confidently predicted by DeepFRI to possess the target GO terms, suggesting that ProDiT has learned to generate key structural elements and/or sequence motifs that underlie the functional activity without memorizing all aspects of the global structure and corresponding protein family.

2.2.2. Recovery of Active Sites at Atomic Fidelity

To further verify the fidelity of successful function-conditioned designs, we investigated whether ProDiT generations recover known catalytic residues and binding motifs for molecular functions with available annotations. To do so, we developed a structural alignment pipeline to screen all GO-conditioned generated structures for the presence of labeled active site residues from UniProtKB entries with the corresponding GO term (Methods). Briefly, we performed a global structural alignment between successful designs and AlphaFoldDB structures with ≥2 labeled active site residues, and filtered for designs with 100% residue identity and <1Å RMSD for those residues. Although this pipeline is limited by the under-annotation of native active site residues in UniProtKB (across GO terms, the median percentage of structures that are sufficiently annotated is 0.23%), we found successful hits for 45 GO terms (Fig. S10, Tab. S2). In Fig. 3, we highlight two exemplar case studies that demonstrate ProDiT’s high-fidelity generation of enzyme active sites. We superimpose the generated structures against relevant PDB structures (aligned on the active site) to highlight the positioning and orientation of key residues relative to co-crystallized cofactors and substrates.

Figure 3: — **(A)** We superimpose a protein generated from ProDiT conditioned on aldehyde dehydrogenase [NAD(P)+] activity (GO:0004030) with PDB 1BPW, a betaine aldehyde dehydrogenase (grey), and its co-crystallized NAD+ cofactor (cyan). **(B)** View of the 10Å neighborhood of the active site in the reference protein, 1BPW. The active site residues (E263, C297) as well as key residues in the NAD+ binding motif (W165, N166, P168, K189) are highlighted. **(C)** View of the aligned 10Å neighborhood of the active site in the generated design. The active site and NAD+ binding site residues are preserved and correctly oriented. **(D)** Sequence identity versus distance from active site for 1BPW and the generated design. **(E)** We similarly superimpose a protein generation conditioned on malonyltransferase activity (GO:0016420) with PDB 2G2Z (grey), an *E. coli* malonyltransferase. The co-crystallized malonic semialdehyde and CoA ligands are also shown (cyan). **(F)** View of the 10Å neighborhood of the active site in the reference protein, 2G2Z. The active site residues (S92, H201) as well as key residues in the malonyl-CoA binding motif (R117, R190, N162, Q11, Q166, Q63, H201, H91, L194, M121, M132, S92, V168, V196, V280) are highlighted. **(G)** View of the aligned 10Å neighborhood of the active site in the generated design. The active site and malonyl-CoA binding site residues are preserved and correctly oriented. **(H)** Sequence identity versus distance from active site for 2G2Z and the generated design.

We first present a generated protein conditioned on GO:0004030, which corresponds to aldehyde dehydrogenase [NAD(P)+] activity. Aldehyde dehydrogenases catalyze the [NAD(P)+]-dependent oxidation of aldehydes to carboxylic acids and play key roles in detoxification, biosynthetic processes, antioxidant defense, and cellular regulation [3]. The generated design is aligned with the PDB structure of 1BPW—a betaine aldehyde dehydrogenase with NAD+ co-crystallized in the active site (Fig. 3A). The active site residues are generated with correct orientation, achieving an all-atom RMSD of 0.41Å (Fig. 3B,C). Furthermore, the generated protein also conserves the unique NAD+ binding motif that distinguishes aldehyde dehydrogenases from closely related alcohol dehydrogenases (e.g. Trp165, Asn166, Pro168, and Lys189 in the Rossmann fold) [24]. Strikingly, the sequence identity between the generated protein and 1BPW decreases with distance from the active site (Fig. 3D), indicating low overall homology (45% sequence identity).

We next showcase a generated protein conditioned on GO:0016420, which corresponds to malonyltransferase activity. These enzymes play a central role in the fatty acid biosynthesis pathway by catalyzing the transfer of a malonyl group from malonyl-CoA to the acyl carrier protein. We aligned the generated design to the holo-crystal structure of a malonyltransferase from E. coli (PDB ID: 2G2Z), which includes malonic semialdehyde (a substrate analogue) and CoA co-crystallized in the active site (Fig. 3E). The design not only recapitulates the conserved catalytic residues (Ser92 and His201; 0.23Å RMSD) positioned to coordinate the substrate, but also preserves the topology of the malonyl-CoA binding pocket (Fig. 3F,G). Notably, residues critical for recognition of the malonyl carboxylate (Arg117), formation of the oxyanion hole (Gln11) [32] and binding of the CoA cofactor are all conserved. Nevertheless, the generated protein has only 41% sequence identity with the reference, with identical residues concentrated near the active site (Fig. 3H).

These case studies highlight ProDiT’s ability to co-generate protein sequences and structures that preserve key catalytic residues and cofactor binding sites for diverse predicted enzymatic functions. We expect that for native functions not covered by existing GO term annotations, the overall algorithm of labeling homologous sequences and tuning the model on this additional “GO-term” label should also lead to the co-generation of novel sequences and structures that preserve essential functional residues.

2.3. Design of Multistate Allostery

2.3.1. Coupled Structure Diffusion

A significant challenge in protein design is the specification of complex, multistate functions, including allosteric regulation of protein activity [50, 29, 35, 17]. Previous efforts to design allosteric systems involve fusing domains together [35] or first exhaustively sampling alternative states using physics-based methods [17] and then searching for sequences that satisfy both states (i.e. both conformers). We hypothesized that a natively multimodal method that jointly samples sequence and structure could be well-suited for the direct generation of multistate proteins—protein sequences that can adopt multiple stable conformations. However, the scarce data for protein structural ensembles, and thus limited examples available for training, would appear to preclude any attempt to directly condition the model on properties of the target ensemble. We circumvent this hurdle by formulating coupled structure diffusion, a principled technique for multistate design that requires models trained only on single state sequence-structure pairs. This technique, derived from a probabilistic graphical model interpretation of the sequence-structure-function paradigm (Figure 4A), involves two structure denoising trajectories that are constrained to a single sequence denoising trajectory. We iteratively denoise the two structures using model predictions conditioned on two distinct functional motifs, while their shared sequence trajectory is constrained by the sequence of both motifs (Fig. 4B). Our protocol can be used with ProDiT out-of-the-box and requires no additional training on structural ensembles.

Figure 4: — **(A)** The sequence-structure-function paradigm can be viewed as a directed graphical model in which structures are drawn from conditional densities P(struct | seq) and functions (here represented as motifs) are similarly drawn from structure. Multistate proteins correspond to sampling multiple conformers from the same sequence. Hence, designing a multistate protein amounts to conditioning on the two terminal variables (motifs) and sampling the parent variables (one sequence and two structures). This conditional probability can be written as the product of conditional probabilities that only involve at most a single structure. **(B)** The product density is heuristically sampled by coupling two structure and one sequence denoising trajectories, with two evaluations of the denoising model conditioned on the two different motifs. These evaluations correspond to the two product terms in the conditional probability. A third model evaluation corresponding to P(seq) is not shown.

2.3.2. Scaffolding Enzymatic Motifs

We explored the application of coupled structure diffusion for the scaffolding of active site motifs extracted from two enzymes, lysozyme and carbonic anhydrase, to be allosterically regulated by a calcium effector (Fig. 5). Both active sites are formed by contacts between noncontiguous segments in sequence space, which we reasoned could be disrupted by introducing an allosteric motif that induces a conformational change upon calcium binding. For each enzyme, we iteratively denoise a structure with the active site motif (taken from PDB structures 1DPX and 6LUX, respectively) and a second structure with a 12-residue EF-hand calcium binding motif from calmodulin (1PRW) randomly inserted in the sequence. Both of these structure denoising trajectories share the same sequence trajectory. To evaluate whether we obtain an active and inactive conformation, we screened the designed sequences in silico by predicting the protein structure with Chai-1 [42] both with and without a calcium ion. When co-folding with calcium, a substantial fraction of generated structures shifted away (in terms of RMSD) from the active site motif and towards the EF-hand motif (Fig. 5A,B,I,J). We then identified sequence designs with low active site RMSD in the unbound state, low EF-hand RMSD in the bound state, and self-consistent Chai-1 predictions for both states. This subset allowed us to interpret and analyze the designed allosteric with greater confidence. Two examples of such designs, one for lysozyme and one for carbonic anhydrase, are shown in Fig. 5

Figure 5: — **(A)** For 320 designed scaffolds of the lysozyme motif, we refold the scaffold in both states with Chai-1 (inducing the bound state by co-folding with a calcium ion) and plot the resulting Cα RMSDs for the four catalytic residues. The selected design is circled. **(B)** Corresponding Cα RMSDs for the EF hand calcium binding motif. **(C)** Superimposed Chai-1 structure predictions (five total) of the unbound state. Note that the EF hand motif is not confidently predicted and adopts two conformations in the ensemble of predictions. **(D)** Superimposed Chai-1 structure predictions (five total) of the bound state. **(E,F)** The surface mesh from the native enzyme is superimposed on the native and designed motifs, highlighting the catalytic residues Glu35 and Asp52. Note the large displacement of pocket-forming residue Trp108 (full motif shown in Fig. S11,S12). We report the mean and standard deviation of the displacement of the CH2 carbon. **(G,H)** The corresponding structures of the EF-hand calcium binding sites in the unbound and bound states, with the mean and standard deviation of the Cα RMSD calculated. **(I–P)** Similar analyses for the designed carbonic anhydrase active site scaffolds. **(M,N)** Active state geometry in the bound and unbound states, showing the carbon dioxide substrate, zinc cofactor, and nucleophilic hydroxide. Note the change in orientation of the catalytic residue Thr199 (full motif shown in Fig S13,S14). We report the mean and standard deviation of the displacement of the hydroxyl oxygen. Native motif colored in green in all subfigures.

Lysozyme motif

Lysozymes are enzymes that hydrolyze glycosidic bonds in peptidoglycans (EC 3.2.1.17) and are notable for being the first enzymes with a resolved structure [9]. The catalytic mechanism involves the formation of a covalent intermediate with Asp52 and uses Glu35 as a general acid to facilitate hydrolysis of a glycosidic bond [46]. This mechanism is representative of canonical glycosidase hydrolase chemistry, with the substrate specificity dictated by the surrounding microenvironment [46, 39]. For lysozymes, the Trp108 residue has been experimentally demonstrated to be critical for substrate recognition [23]. Furthermore, it is well established that carbohydrate C–H bonds interact preferentially with aromatic residues with tryptophan having a 9-fold enrichment in carbohydrate-binding modules (CBMs) [21]. In our selected lysozyme scaffold (Fig. 5A–H), Trp108 is scaffolded at its native position in the unbound state but is displaced by ~15.3 Å in the calcium-bound state due to the rearrangement of several loops (Fig. 5E,F). This structural rearrangement indicates significant allosteric disruption of the substrate binding cleft, likely preventing the recognition and hydrolysis of the peptidoglycan substrate.

Carbonic anhydrase motif

Carbonic anhydrases catalyze the interconversion of carbon dioxide and bicarbonate (EC 4.2.1.1). This function is essential for acid-base homeostasis and is widely studied for its catalytic efficiency and potential utility in carbon capture technologies [10]. The catalytic mechanism involves a critically positioned threonine residue (Thr199) which accepts a hydrogen bond from a zinc-coordinated hydroxide ion positioned for nucleophilic attack of the substrate carbon dioxide [41, 47]. In our selected carbonic anhydrase scaffold (Fig. 5I–P), the catalytic residues are scaffolded with subatomic-level accuracy in the unbound state (0.15 Å RMSD), but calcium binding shifts the threonine hydroxyl group by ~1.04 Å in the direction of the activated hydroxide ion. This would competitively displace the Zn²⁺ coordination site occupied by the activated hydroxide nucleophile, preventing the hydration of carbon dioxide (Fig. 5M,N).

We qualify that the successful scaffolding of residues selected for the active site motif does not necessarily guarantee catalytic activity. In particular, the sensitivity of catalysis to errors in active site motif placement is unknown, and potentially important residues in the second contact shell or distal from the active site were not included in the motif-scaffolding conditioning. As such, future experimental studies will be required to verify the enzymatic activity and allosteric regulation of ProDiT’s designs. Nevertheless, our results are the first demonstration of highly controlled protein generation for allosteric disruption of enzymatic active sites, laying the groundwork for future methodological developments and fine-tuning on experimental data.

3. Discussion

Here, we have presented ProDiT, a multimodal protein generative model that learns a function-aware joint distribution of sequence and structure. In silico metrics of unconditional generation quality indicate that ProDiT matches the performance of specialized, more expensive structural models and substantially outperforms multimodal approaches that discretize structure (ESM3 [18]) or map sequences to a continuous representation (ProteinGenerator [29]). Our comprehensive benchmarking of 915 GO terms provides, for the first time, predicted functional generations across a wide range of molecular functions, often demonstrating highly accurate recovery of known active site residues in structural alignments. Enabled by ProDiT’s multimodal capabilities, we have developed a novel protocol for multistate generation, marking an important step towards the design of custom and controllable enzymes.

As researchers work to assemble a more complete picture of protein design with diverse biological functions, we anticipate ProDiT will serve as a blueprint for further developments in multimodal and multistate protein generative models. In function-conditioned design, we foresee ProDiT generations serving as new starting points for directed evolution, allowing for increased exploration of protein sequence and structure spaces. A natural extension for multistate design would be to incorporate further conditioning information relevant for protein function, such as intramolecular interactions [49] and atomic motifs [2] relevant in de novo enzyme design. Coupled with our unique multistate generation capabilities, such extensions could lead to particularly exciting applications in multistep enzymes, molecular motors, or signaling proteins. Ultimately, ProDiT highlights the promise of multimodal generative models to unify and advance the protein design toolkit towards better emulating the complexity of natural proteins.

A. Methods

A.1. Multimodal Diffusion

ProDiT models protein sequence as strings in the discrete amino acid vocabulary $s \in [1,20]^{L}$ and $C α$ protein structure as an element of Euclidean space $x \in R^{3 L}$ . The model learns to generate the data distribution by reversing noising processes defined directly on these spaces (i.e., without tokenization or embedding); as such, its training and inference formulations invoke both continuous and discrete diffusion, described below. Our training loss is a weighted combination of the structure and sequence losses, $ℒ = ℒ_{struct} + 3 ℒ_{seq}$ .

A.1.1. Continuous diffusion.

In diffusion modeling of continuous data $x \in R^{d}$ , the data distribution $p_{data}$ is corrupted under a forward diffusion process $d x = f (x, t) d t + g (t) d w$ where w is Brownian motion. This generates a time-evolving density $p_{t} (x)$ , $t \in [0, T]$ with initial conditions $p_{0} = p_{data}$ and $p_{T}$ close to Gaussian. The diffusion process can be simulated in reverse via $d x = [- f (x, t) + g^{2} (t) \nabla \log p_{t} (x)] d t + g (t) d \bar{w}$ with use of the so-called score $\nabla \log p_{t} (x)$ [20, 40]. We approximate $\nabla \log p_{t} (x)$ with a neural network $s_{θ} (x)$ by minimizing the MSE denoising score matching objective $E_{x_{0}, x_{t}} [‖ s_{θ} (x, t) - \nabla_{x_{t}} \log p_{t ∣ 0} (x_{t} ∣ x_{0}) ‖]$ for all t. To apply diffusion modeling to protein structure, we take $n = 3 L$ and x to be the zero-mean coordinates of the $C α$ atoms. We adapt, with minor notational modifications, the parameterization of Karras et al [25]: $f (x, t) = 0$ , $g (t) = \sqrt{2 t}$ such that $p_{t ∣ 0} (x_{t} ∣ x_{0}) = 𝒩 (x_{t}; μ = x_{0}, Σ = t^{2} I_{n}$ ). We then define time rescaling

t^{'} = \frac{t^{1 / p} - t_{m i n}^{1 / p}}{t_{m a x}^{1 / p} - t_{m i n}^{1 / p}}

(1)

with $t_{min} = 0.05 Å$ , $t_{max} = 160 Å$ , $p = 7$ and train for times $t^{'} \in [0, 1]$ . That is, the neural network is trained to remove Gaussian noise ranging from $σ = 0.05 Å$ to $σ = 160 Å$ . The neural network score model is parameterized with post-conditioning, i.e.,

s_{θ} (x, t) = \frac{1}{t^{2}} [\frac{t \cdot t_{data}}{\sqrt{t_{data}^{2} + t^{2}}} f_{θ} (x, t) - \frac{t^{2}}{t_{data}^{2} + t^{2}} x]

(2)

where $t_{data} = 15 Å$ and $f_{θ} (x, t)$ is the neural network. We observe that in the limit of $t \to 0$ we have $s_{θ} (x, t) \approx f_{θ} (x, t) / t$ and in the limit of $t \to \infty$ we have $s_{θ} (x, t) \approx (t_{data} f_{θ} (x, t) - x) / t^{2}$ . Hence, the network interpolates between predicting the score and predicting the clean data.

Our final structure training loss is then

ℒ_{struct} = E_{t^{'} ~ U [0, 1], x_{0}, x_{t}} [\frac{t^{2} (t^{2} + t_{data}^{2})}{t_{data}^{2}} {‖s_{θ} (x_{t}, t) - \nabla_{x_{t}} \log p_{t ∣ 0} (x_{t} ∣ x_{0})‖}^{2}]

(3)

At inference time, we sample $x_{t_{max}} ~ 𝒩 (0, t_{max}^{2} I)$ and integrate from $t^{'} = 1$ to $t^{'} = 0$ with the Euler-Maruyama scheme with uniform steps in $t^{'}$ . We apply low temperature annealing and modify the reverse diffusion as

d x = [\frac{g^{2} (t)}{2} (1 + β (t)) s_{θ} (x, t)] d t + g (t) β (t) γ (t) d \bar{w}

(4)

where $γ (t) = 0.5$ and $β (t) = 1 / {(1 + t / t_{data})}^{ν}$ . Here $β (t)$ controls the stochasticity via mixing level of Langevin dynamics and $γ (t)$ controls the temperature of the dynamics. We use $ν = 0.5$ for unconditional generation and $ν = 1$ for conditional generation. As in prior work, we find these low temperature dynamics essential for generating high quality structures [22,15].

A.1.2. Discrete diffusion.

In diffusion modeling of discrete data [5], we represent an element of the vocabulary with its one-hot vector and transport the data distribution $p_{0} = p_{data}$ towards a prior with probability vector $π \in Δ^{K}$ , where $Δ^{K}$ is the K-simplex. In particular, we define a corruption schedule $α_{t}$ , $t \in [0,1]$ such that $α_{0} = 1$ , $α_{1} = 0$ monotonically decreasing and a noising process $p_{t ∣ s} (x_{t} ∣ x_{s}) = C a t (x_{t}; (α_{t} / α_{s}) x_{s} + (1 - α_{t} / α_{s}) π)$ for $t > s$ . This generates noisy marginals $p_{t} (x_{t}) = \sum_{x_{0}} C a t (x_{t}; α_{t} x_{0} + (1 - α_{t}) π) p (x_{0})$ . The noising process can be reversed by iteratively sampling

p_{s ∣ t} (x_{s} ∣ x_{t}) = \sum_{x_{0}} p (x_{s} ∣ x_{t}, x_{0}) p (x_{0} ∣ x_{t})

(5)

where $p (x_{s} ∣ x_{t}, x_{0})$ is analytically computed from Bayes’ rule and we approximate $p (x_{0} ∣ x_{t}) \approx C a t (x_{0}; {\hat{x}}_{0} (x_{t}; θ))$ via a neural network ${\hat{x}}_{0} (x_{t}; θ) \in Δ^{K}$ . The network is trained by minimizing the cross entropy $E_{x_{0}, x_{t}} [- \log ⟨{\hat{x}}_{0} (x_{t}; θ), x_{0}⟩]$ for all t. To apply discrete diffusion to protein sequences, we noise all positions independently and adopt the parameterization of MDLM [37] and DPLM [48]. In particular, we let $K = 21$ and define the K^th state to represent a mask state m. We then assign $π = m$ . With this parameterization, the reverse process becomes

p_{s ∣ t} (x_{s} ∣ x_{t}) = C a t (x_{s}; (1 - δ_{x_{t}, m}) x_{t} + δ_{x_{t}, m} [\frac{1 - α_{s}}{1 - α_{t}} m + \frac{α_{s} - α_{t}}{1 - α_{t}} {\hat{x}}_{0} (x_{t}; θ)])

(6)

Note that the sampling is independent across positions but the neural network ${\hat{x}}_{0} (x_{t}; θ)$ takes the entire sequence as input (for brevity we omit this distinction in notation).

Our final sequence training loss is then

ℒ_{seq} = E_{t \sim U [0, 1], x_{0}, x_{t}} [- λ (t) δ_{x_{t}, m} log 〈 {\hat{x}}_{0} (x_{t}; θ), x_{0}) 〉]

(7)

where $λ (t) = 1 - t$ as in DPLM. Notably, only the logits of positions where $x_{t} = m$ are supervised.

For inference, we initially set $x_{1} = m$ and iteratively sample $x_{t}$ with steps uniform in t. The sampling strategy depends on the task. For unconditional sequence generation, our sampling deviates from ancestral sampling of the reverse process $p_{s ∣ t}$ and instead follows the approach introduced in DPLM [48], which can be cast as self-planning under a Path Planning framework [34]. Specifically, a single step from t to s with $t > s$ and L being the sequence length proceeds as follows:

z_{ℓ, i} \sim G u m b e l (1,1), s h a p e (z) = [L, 20]

(8)

ϕ_{ℓ, i} \leftarrow \log s o f t m a x_{i} ((\log i t s {({\hat{x}}_{0} (x_{t}; θ))}_{ℓ, i} + z_{ℓ, i}) / γ (t))

(9)

y_{ℓ}, u_{ℓ} \leftarrow a r g \underset{i}{m a x} ϕ_{ℓ, i}, \underset{i}{m a x} ϕ_{ℓ, i}

(10)

S_{t o p k} \leftarrow {t o p k}_{ℓ} (u_{ℓ}, k = L - r o u n d (s L)), S_{t o p k} \subseteq [1 \dots L]

(11)

{(x_{s})}_{ℓ} = \{\begin{array}{l} y_{ℓ} & ℓ \in S_{t o p k} \land {(x_{t})}_{ℓ} = m \\ {(x_{t})}_{ℓ} & ℓ \in S_{t o p k} \land {(x_{t})}_{ℓ} \neq m \\ m & ℓ \notin S_{t o p k} \end{array}

(12)

Here, $γ (t)$ is a temperature and we set $γ (t) = 0.5$ for sequence generation. In co-generation, we follow a similar procedure except $γ (t) = 0.1 + 0.4 t$ and instead set

y_{ℓ}, u_{ℓ} \leftarrow {(x_{t})}_{ℓ}, ϕ_{ℓ, {(x_{t})}_{ℓ}}

(13)

for $ℓ$ such that ${(x_{t})}_{ℓ} \neq m$ ; that is, the sampled token and log probability are replaced with the current token and its log probability under the model prediction. Finally, in inverse folding, we choose a random unmasking order and sample each position with temperature 0.1, as done in ProteinMPNN [12].

A.2. Model Architecture and Training

ProDiT is a transformer architecture [45] based closely on the design of diffusion transformers for images [33]. In particular, it consists of 30 transformer blocks, each with a self-attention block and feed-forward layer. We adopt adaLN-Zero [33] to inject conditioning information, in particular a sinusoidal embedding of the structural diffusion time. We follow modern transformer best practices, including pre-norm, QK-norm, and GeLU activations. Unconventionally, we adopt a dropout rate of 20%, which we found essential for sequence generation performance.

The model accepts four types of inputs: structure, sequence, motifs, and GO terms. The structure is directly embedded with a linear layer without any geometric or equivariant transformations, following emerging best practice [1, 15]. The sequence is embedded with a learned embedding. The motifs are centered and also embedded with a linear layer, with motif positions additionally marked with a learned embedding in both the model input and adaLN-Zero conditioning. Finally, the GO terms are input via a learned embedding vector for each term. ProDiT then outputs via two linear heads a structural output for continuous structural diffusion and classifier logits for discrete sequence diffusion.

We train two models: one with model dimensionality 768 (without function conditioning) and one with model dimensionality 1024 (with function conditioning). These models have 321M and 576M parameters, respectively. Both models are trained with crop size 512 and batch size 250k–500k tokens across 16–32 NVIDIA H200 GPUs for approximately 500k steps. Model parameters are tracked with an exponential moving average with decay coefficient 0.999. 50% of the time, we sample motif conditioning following the motif sampling algorithm in Genie2 [27]. When GO terms are available, 85% of the time we sample a term at random in proportion to its inverse frequency across in UniProtKB, whereas 15% of the time we drop the GO term. Additionally, we completely drop the sequence or structure each 5% of the time.

A.3. Data

Our training set is drawn from 214M proteins in UniProtKB which have associated structures in AlphaFoldDB. Unlike prior work on protein language models, we cluster proteins for training by structural similarity rather than sequence similarity; we found this to improve generation diversity. Specifically, we use the FoldSeek clusterings of AlphaFoldDB computed by Barrio-Hernandez et al [7] as hierarchical clusters; we first sample a random FoldSeek cluster, then a random MMseqs cluster within the FoldSeek cluster, and finally a random protein within the MMseqs cluster. At training time, we filter to sample proteins with pLDDT > 80; this results in 1.24M FoldSeek clusters, 18.9M MMseqs clusters, and 128M sequences.

To obtain function training data, we compile metadata from all UniProtKB entries and associate each entry with the set of all molecular function GO terms listed in the entry and any implied parent molecular function terms not already present (via “is_a” relationship types listed in http://purl.obolibrary.org/obo/go/go.obo). 65% of entries contain at least one non-root term, with those entries having 9.6 terms on average. We then compute frequencies of each term in our training dataset order to sample them according to inverse frequency at training time. The resulting set of 8220 terms comprises the input vocabulary of ProDiT.

A.4. Design and Evaluation

We use the 321M-parameter model for unconditional generation protocols, as it obtains slightly better diversity, and use the 576M-parameter model for conditional generation.

A.4.1. Unconditional generation.

For sequence generation, we sample a noisy structure at the highest noise level and fix it while iteratively unmasking the sequence. For structure generation, we initialize and fix the sequence to be the all-mask state while denoising the structure. We then design 8 sequences using ProteinMPNN, refold each of them with ESMFold, and keep the sequence with the lowest self-consistency RMSD. For co-generation, we denoise both modalities with a linear schedule and report self-consistency metrics (scRMSD, scTM) using the sequence designed by ProDiT. In all modalities, we sample generations of length L using L model forward passes. To compute the number of successful clusters, we pool all successful generations from different lengths and run FoldSeek clustering via:

foldseek easy-cluster <dir> <out> <tmp> --alignment-type 1 --cov-mode 0 --min-seq-id 0 --tmscore-threshold 0.5

We use the ESMFold structures for sequence generation and the raw ProDiT outputs for structure generation and co-generation.

To benchmark against prior work, we download the code and weights for ESM3 1.4B, RFDiffusion, EvoDiff 640M, ProteinGenerator from their public websites. For ESM3, because default sampling settings are not provided, we experimented with sampling hyperparameters and found the following settings to yield the best results: sequence is sampled with linear schedule, L steps, random order, and temperature 1; and structure is sampled with linear schedule, L steps, random order, and temperature 0.8. For RFDiffusion and ProteinGenerator, we use the default sampling scripts but set the number of diffusion steps to be 50 for ProteinGenerator (default 25). We use the Genie2 evaluation pipeline [27] to run ProteinMPNN on generated structures and assess all structure generation methods.

A.4.2. Function conditioning.

The DeepFRI function prediction network outputs predictions for 942 distinct molecular function GO terms, 915 of these are in the input vocabulary of ProDiT (most of the excluded terms have been marked obselete). For each of these terms, we generate 100 proteins of length 300 by first generating a structure in 300 steps, then a corresponding sequence (using inverse folding sampling) in 300 additional steps. All sequences are refolded with ESMFold. Because some GO terms may correspond to proteins with disordered regions or flexible interdomain orientations, we do not filter based on scTM or scRMSD. The generation is marked as successful if DeepFRI predicts a probability ≥ 50% for the desired GO term from the refolded structure. We run DeepFRI with the command:

python predict.py --pdb_dir <dir> --ont mf --output_fn_prefix <out>

To compute the diversity, we gather all successful generations with the same GO term and compute the TM-score between generations i,j via TMalign. The diversity of generation i is then the mean of $T M_{i, j}, i \neq j$ . Note that $T M_{i, j} = T M_{j, i}$ as all generations have the same length. We display histograms of diversity TM-scores of successful generations pooled across all terms (Fig. 2C), i.e., terms with more successes have higher weight.

To compute the novelty of the generations, we first compile a FoldSeek database for selected GO terms by gathering all AlphaFoldDB structures with the target GO term and running:

foldseek createdb <pdb_dir> <out_db>

We query this database via

foldseek easy-search <query_pdbs> <go_db> <out> <tmp> --alignment-type 1 --format-output query,target,alntmscore,qtmscore

We then take the match with the highest qtmscore found by foldseek, i.e., the TM-score normalized by the length of the query (300) and assign this as the novelty TM-score of the generated protein. This ensures that high TM-scores correspond to a match for the entire generated protein rather than a short AlphaFoldDB protein matching part of the generated structure. Because of the relatively high runtime of this protocol, we only compute novelty results for 169 GO terms randomly selected from those with 10 or more success and 2M or fewer occurrences. We display histograms of novelty TM-scores of successful generations pooled across all terms (Fig. 2C), i.e., terms with more successes have higher weight.

To implement classifier-free guidance for function conditioning, we follow prior literature [19, 38] and run denoising sampling with linear combinations of logits (in discrete spaces) or scores (in continuous spaces). Let $ϕ (s, x, t_{seq}, t_{struct} ∣ \emptyset)$ and $ϕ (s, x, t_{seq}, t_{struct} ∣ c)$ represent the output of ProDiT when denoising sequence s and structure x unconditionally and conditioned on functional class c, respectively. Then for guidance strength $λ$ , we denoise with

(1 - λ) ϕ (s, x, t_{seq}, t_{struct} ∣ \emptyset) + λ ϕ (s, x, t_{seq}, t_{struct} ∣ c)

(14)

A.4.3. Structural alignment pipeline.

To further verify that function-conditioned generations are biologically meaningful, we design a pipeline that compares the active sites of generated structures with experimental structures from AlphaFoldDB (AFDB) of the same GO term.

For each GO term, we filter UniProtKB for entries with AFDB structures and at least 2 labeled active site residues in the entry’s UniProtKB feature table (sufficiently annotated). Each generated structure is then structurally aligned to all sufficiently annotated AFDB structures with the same GO term using gemmi.calculate_superposition with selection on protein backbone atoms via gemmi.SupSelect.MainChain. For each labeled active site residue in the AFDB structure’s UniProtKB feature table, we identify the nearest residue in the generated structure with (i) the same amino acid type (ii) minimized Euclidean distance between any atom in the generated and AFDB residues. If a one-to-one correspondence was found for all annotated active site residues, we computed the all atom RMSD of the active sites. To compute the RMSD, atom positions were directly paired based on name correspondence, and the final RMSD was reported as the root mean square of the interatomic distances across the full set of matched residue pairs. Alignments with this RMSD below 1.0 Å were considered successful confirmations of active site conservation. Due to long runtimes, this process was applied to 337 GO terms.

For each successful generated structure with confirmed active site matches, we further analyzed the structure by comparing them to their five closest AFDB structures (ranked by lowest active site RMSD). For each of these structure pairs, we computed several metrics, including the global backbone RMSD, the RMSD of the active site, and the RMSDs of expanded regions around the active site using spherical expansions from 1 Å to 18 Å. All RMSD values were computed in PyMOL after realigning the structures on the selected active site or expanded region. For each level of expansion, we also match the residues in the generated structure to the closest reference residue (post-alignment) and compute the fraction of matched residue identities.

A.4.4. Multistate design.

The objective of our multistate design protocol is to sample a single sequence x which folds into two distinct structures, x₁, x₂, supporting functional motifs m₁, m₂, respectively. Using the conditional independencies implied by the directed graphical model, we can rewrite this target conditional distribution as:

p (s, x_{1}, x_{2} ∣ m_{1}, m_{2}) \propto p (s, x_{1}, x_{2}, m_{1}, m_{2})

(15)

= p (s) p (x_{1} ∣ s) p (x_{2} ∣ s) p (m_{1} ∣ s, x_{1}) p (m_{2} ∣ s, x_{2})

(16)

= p (s) \frac{p (s, x_{1})}{p (s)} \frac{p (s, x_{2})}{p (s)} \frac{p (s, x_{1} ∣ m_{1}) p (m_{1})}{p (s, x_{1})} \frac{p (s, x_{2} ∣ m_{2}) p (m_{1})}{p (s, x_{2})}

(17)

\propto \frac{p (s, x_{1}, ∣ m_{1}) p (s, x_{2}, ∣ m_{2})}{p (s)}

(18)

Each of these terms corresponds to our learned generative model under some form of conditioning: $p (s, x_{i} ∣ m_{i})$ is the distribution sampled by motif scaffolding $m_{i}$ , and $p (s)$ is the distribution sampled by unconditional sequence generation. The product (and ratio) of these densities cannot be directly sampled by diffusion models; however, heuristic procedures for sampling similar densities involving the linear combinations of logits (in discrete spaces) and scores (in continuous spaces) are widely employed in the generative modeling literature [13], using the property that logarithms of products (and ratios) of densities are sums (and differences) of log densities.

Hence, we suggest a heuristic sampling algorithm for sampling the target density as follows: let $ϕ (s, x_{i}, t_{seq}, t_{struct} ∣ m_{i})$ and $ψ (s, x_{i}, t_{seq}, t_{struct}, ∣ m_{i})$ represent the logits and score predictions, respectively, of neural networks provided noisy sequence s, and noisy structure $x_{i}$ , diffusion times $t_{seq}$ , $t_{struct}$ , and motif $m_{i}$ . Then we denoise the sequence s with logits

ϕ (s, x_{1}, t_{seq}, t_{struct} ∣ m_{1}) + ϕ (s, x_{2}, t_{seq}, t_{struct}, ∣ m_{2}) - ϕ (s, t_{seq})

(19)

and denoise each noisy structure $x_{i}$ with scores $ψ (s, x_{i}, t_{seq}, t_{struct} ∣ m_{i})$ . We use these expressions as logits and scores of a black-box model output and proceed with the same discrete and continuous sampling algorithms as described previously.

To explore the application this pipeline to allosteric regulation of enzymes, we posit that a multistate enzyme could be designed via two explicit structural states: one to scaffold the desired active site motif, and another to scaffold the binding motif of a desired effector. (Although there is a risk that the model would produce a degenerate design of a single or very similar structures scaffolding both motifs, we find that this does not occur in a typical sample.) We select calcium as the effector and the EF hand motif in positions 20–31 in PDB 1PRW as the effector binding motif.

For the carbonic anhydrase active site motif, we select the all residues from PDB 6LUX with Cα co-ordinate within 7 Å if the position of the zinc cofactor. For the lysozyme active site motif, we first compute the midpoint of a line connecting the Cα atoms of catalytic residues Glu35 and Asp52 in PDB 1DPX, and include all residues with Cα coordinate within 8 Å of this midpoint. We sample 320 designs for each active site motif, using 200 (carbonic anhydrase) or 500 (lysozyme) steps of sequence-structure co-generation and motif residue indices and sequence lengths from the parent PDB. For carbonic anhydrase, we place the EF hand motif at positions 125–136 (monotonic indexing), whereas for lysozyme we randomly place the EF hand at a random non-overlapping position in each design.

To filter the designs in silico, we sought to identify changes that could plausibly be sufficient to disable or reduce the activity of the enzyme based on the current understanding of its catalytic mechanism. Thus, we filtered the carbonic anhydrase scaffolds based on the displacement of core catalytic residues identified in previous work. For lysozyme, we screened using the RMSD for all motif residues due to the putative importance of non-catalytic residues in substrate recognition.

Supplementary Material

NIHPP2025.09.03.672144v1-supplement-1.pdf^{(6.8MB, pdf)}

Acknowledgments

We thank Samuel Sledzieski, Alexander Shida, Arvind Pillai, Jason Yim, Woody Ahern, and Soojung Yang for helpful discussions. This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award 1R35GM141861 (to B.B.), the NSF AI Institute for Foundations of Machine Learning (IFML), the UT-Austin Center for Generative AI, and a gift from Param Hansa Philanthropies. B.J. was supported by a Department of Energy Computational Science Graduate Fellowship under Award Number DESC0022158. A.S. was supported by a Hertz Foundation Fellowship. R.K. was supported by the Advanced Research Projects Agency for Health APECx Program and a gift from Microsoft. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a Department of Energy Office of Science User Facility using NERSC awards ASCR-ERCAP0027302, ASCR-ERCAP0030607, ASCR-ERCAP0032958, and ASCR-ERCAP0027818.

Footnotes

Competing Interests

D.J.D. owns Intelligent Proteins LLC where he consults biotechnology companies on AI protein engineering. D.J.D. is a co-founder of Metabologic AI, which focuses on developing commercial enzymes with AI. Other authors declare no competing interests.

References

[1].Abramson J., Adler J., Dunger J., Evans R., Green T., Pritzel A., Ronneberger O., Willmore L., Ballard A. J., Bambrick J., et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, 630(8016):493–500, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Ahern W., Yim J., Tischer D., Salike S., Woodbury S. M., Kim D., Kalvet I., Kipnis Y., Coventry B., Altae-Tran H. R., et al. Atom level enzyme active site scaffolding using rfdiffusion2. bioRxiv, pp. 2025–04, 2025. [Google Scholar]
[3].Ahmed Laskar A. and Younus H. Aldehyde toxicity and metabolism: the role of aldehyde dehydrogenases in detoxification, drug resistance and carcinogenesis. Drug metabolism reviews, 51(1):42–64, 2019. [DOI] [PubMed] [Google Scholar]
[4].Alamdari S., Thakkar N., van den Berg R., Tenenholtz N., Strome B., Moses A., Lu A. X., Fusi N., Amini A. P., and Yang K. K. Protein generation with evolutionary diffusion: sequence is all you need. BioRxiv, pp. 2023–09, 2023. [Google Scholar]
[5].Austin J., Johnson D. D., Ho J., Tarlow D., and Van Den Berg R. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34:17981–17993, 2021. [Google Scholar]
[6].Baek M., DiMaio F., Anishchenko I., Dauparas J., Ovchinnikov S., Lee G. R., Wang J., Cong Q., Kinch L. N., Schaeffer R. D., et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Barrio-Hernandez I., Yeo J., Jänes J., Mirdita M., Gilchrist C. L., Wein T., Varadi M., Velankar S., Beltrao P., and Steinegger M. Clustering predicted structures at the scale of the known protein universe. Nature, 622(7983):637–645, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Bhatnagar A., Jain S., Beazer J., Curran S. C., Hoffnagle A. M., Ching K., Martyn M., Nayfach S., Ruffolo J. A., and Madani A. Scaling unlocks broader generation and deeper functional understanding of proteins. bioRxiv, pp. 2025–04, 2025. [Google Scholar]
[9].Blake C., Koenig D., Mair G., North A., Phillips D., and Sarma V. Structure of hen egg-white lysozyme: a three-dimensional fourier synthesis at 2 å resolution. Nature, 206(4986):757–761, 1965. [DOI] [PubMed] [Google Scholar]
[10].Boone C. D., Gill S., Habibzadegan A., and McKenna R. Carbonic anhydrase: an efficient enzyme with possible global implications. International Journal of Chemical Engineering, 2013 (1):813931, 2013. [Google Scholar]
[11].Chu A. E., Lu T., and Huang P.-S. Sparks of function by de novo protein design. Nature biotechnology, 42(2):203–215, 2024. [Google Scholar]
[12].Dauparas J., Anishchenko I., Bennett N., Bai H., Ragotte R. J., Milles L. F., Wicky B. I., Courbet A., de Haas R. J., Bethel N., et al. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Du Y., Durkan C., Strudel R., Tenenbaum J. B., Dieleman S., Fergus R., Sohl-Dickstein J., Doucet A., and Grathwohl W. S. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International conference on machine learning, pp. 8489–8510. PMLR, 2023. [Google Scholar]
[14].Ferruz N., Schmidt S., and Höcker B. Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022. [Google Scholar]
[15].Geffner T., Didi K., Zhang Z., Reidenbach D., Cao Z., Yim J., Geiger M., Dallago C., Kucukbenli E., Vahdat A., and Kreis K. Proteina: Scaling flow-based protein structure generative models. In The Thirteenth International Conference on Learning Representations, 2025. [Google Scholar]
[16].Gligorijević V., Renfrew P. D., Kosciolek T., Leman J. K., Berenberg D., Vatanen T., Chandler C., Taylor B. C., Fisk I. M., Vlamakis H., et al. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168, 2021. [Google Scholar]
[17].Guo A. B., Akpinaroglu D., Stephens C. A., Grabe M., Smith C. A., Kelly M. J., and Kortemme T. Deep learning–guided design of dynamic proteins. Science, 388(6749):eadr7094, 2025. [DOI] [PubMed] [Google Scholar]
[18].Hayes T., Rao R., Akin H., Sofroniew N. J., Oktay D., Lin Z., Verkuil R., Tran V. Q., Deaton J., Wiggert M., et al. Simulating 500 million years of evolution with a language model. Science, pp. eads0018, 2025. [Google Scholar]
[19].Ho J. and Salimans T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. [Google Scholar]
[20].Ho J., Jain A., and Abbeel P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. [Google Scholar]
[21].Hudson K. L., Bartlett G. J., Diehl R. C., Agirre J., Gallagher T., Kiessling L. L., and Woolfson D. N. Carbohydrate–aromatic interactions in proteins. Journal of the American Chemical Society, 137(48):15152–15160, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Ingraham J. B., Baranov M., Costello Z., Barber K. W., Wang W., Ismail A., Frappier V., Lord D. M., Ng-Thow-Hing C., Van Vlack E. R., et al. Illuminating protein space with a programmable generative model. Nature, 623(7989):1070–1078, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Inoue M., Yamada H., Yasukochi T., Kuroki R., Miki T., Horiuchi T., and Imoto T. Multiple role of hydrophobicity of tryptophan-108 in chicken lysozyme: structural stability, saccharide binding ability, and abnormal pka of glutamic acid-35. Biochemistry, 31(24):5545–5553, 1992. [DOI] [PubMed] [Google Scholar]
[24].Johansson K., Ramaswamy S., Eklund H., El-Ahmad M., Hjelmqvist L., and Jörnvall H. Structure of betaine aldehyde dehydrogenase at 2.1 å resolution. Protein Science, 7(10):2106–2117, 1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Karras T., Aittala M., Aila T., and Laine S. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35:26565–26577, 2022. [Google Scholar]
[26].Krishna R., Wang J., Ahern W., Sturmfels P., Venkatesh P., Kalvet I., Lee G. R., Morey-Burrows F. S., Anishchenko I., Humphreys I. R., et al. Generalized biomolecular modeling and design with rosettafold all-atom. Science, 384(6693):eadl2528, 2024. [DOI] [PubMed] [Google Scholar]
[27].Lin Y., Lee M., Zhang Z., and AlQuraishi M. Out of many, one: Designing and scaffolding proteins at the scale of the structural universe with genie 2. arXiv preprint arXiv:2405.15489, 2024. [Google Scholar]
[28].Lin Z., Akin H., Rao R., Hie B., Zhu Z., Lu W., Smetanin N., Verkuil R., Kabeli O., Shmueli Y., et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023. [DOI] [PubMed] [Google Scholar]
[29].Lisanza S. L., Gershon J. M., Tipps S. W., Sims J. N., Arnoldt L., Hendel S. J., Simma M. K., Liu G., Yase M., Wu H., et al. Multistate and functional protein design using rosettafold sequence space diffusion. Nature biotechnology, pp. 1–11, 2024. [Google Scholar]
[30].Madani A., Krause B., Greene E. R., Subramanian S., Mohr B. P., Holton J. M., Olmos J. L. Jr, Xiong C., Sun Z. Z., Socher R., et al. Large language models generate functional protein sequences across diverse families. Nature biotechnology, 41(8):1099–1106, 2023. [Google Scholar]
[31].Nijkamp E., Ruffolo J. A., Weinstein E. N., Naik N., and Madani A. Progen2: exploring the boundaries of protein language models. Cell systems, 14(11):968–978, 2023. [DOI] [PubMed] [Google Scholar]
[32].Oefner C., Schulz H., D’Arcy A., and Dale G. E. Mapping the active site of escherichia coli malonyl-coa–acyl carrier protein transacylase (fabd) by protein crystallography. Biological Crystallography, 62(6):613–618, 2006. [DOI] [PubMed] [Google Scholar]
[33].Peebles W. and Xie S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205, 2023. [Google Scholar]
[34].Peng F. Z., Bezemek Z., Patel S., Rector-Brooks J., Yao S., Tong A., and Chatterjee P. Path planning for masked diffusion model sampling. arXiv preprint arXiv:2502.03540, 2025. [Google Scholar]
[35].Pillai A., Idris A., Philomin A., Weidle C., Skotheim R., Leung P. J., Broerman A., Demakis C., Borst A. J., Praetorius F., et al. De novo design of allosterically switchable protein assemblies. Nature, 632(8026):911–920, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Ruffolo J. A., Nayfach S., Gallagher J., Bhatnagar A., Beazer J., Hussain R., Russ J., Yip J., Hill E., Pacesa M., et al. Design of highly functional genome editors by modeling the universe of crispr-cas sequences. BioRxiv, pp. 2024–04, 2024. [Google Scholar]
[37].Sahoo S., Arriola M., Schiff Y., Gokaslan A., Marroquin E., Chiu J., Rush A., and Kuleshov V. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024. [Google Scholar]
[38].Schiff Y., Sahoo S. S., Phung H., Wang G., Boshar S., Dalla-torre H., de Almeida B. P., Rush A. M., PIERROT T., and Kuleshov V. Simple guidance mechanisms for discrete diffusion models. In The Thirteenth International Conference on Learning Representations, 2025. [Google Scholar]
[39].Shanmugam N. S. and Yin Y. Cazyme3d: a database of 3d structures for carbohydrate-active enzymes. Journal of Molecular Biology, pp. 169001, 2025. [DOI] [PubMed] [Google Scholar]
[40].Song Y., Sohl-Dickstein J., Kingma D. P., Kumar A., Ermon S., and Poole B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. [Google Scholar]
[41].Stams T., Nair S. K., Okuyama T., Waheed A., Sly W. S., and Christianson D. W. Crystal structure of the secretory form of membrane-associated human carbonic anhydrase iv at 2.8-å resolution. Proceedings of the National Academy of Sciences, 93(24):13589–13594, 1996. [Google Scholar]
[42].team C. D., Boitreaud J., Dent J., McPartlon M., Meier J., Reis V., Rogozhonikov A., and Wu K. Chai-1: Decoding the molecular interactions of life. BioRxiv, pp. 2024–10, 2024. [Google Scholar]
[43].Van Kempen M., Kim S. S., Tumescheit C., Mirdita M., Lee J., Gilchrist C. L., Söding J., and Steinegger M. Fast and accurate protein structure search with foldseek. Nature biotechnology, 42(2):243–246, 2024. [Google Scholar]
[44].Varadi M., Bertoni D., Magana P., Paramval U., Pidruchna I., Radhakrishnan M., Tsenkov M., Nair S., Mirdita M., Yeo J., et al. Alphafold protein structure database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic acids research, 52(D1):D368–D375, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł., and Polosukhin I. Attention is all you need. Advances in neural information processing systems, 30, 2017. [Google Scholar]
[46].Vocadlo D. J., Davies G. J., Laine R., and Withers S. G. Catalysis by hen egg-white lysozyme proceeds via a covalent intermediate. Nature, 412(6849):835–838, 2001. [DOI] [PubMed] [Google Scholar]
[47].Wang J., Lisanza S., Juergens D., Tischer D., Watson J. L., Castro K. M., Ragotte R., Saragovi A., Milles L. F., Baek M., et al. Scaffolding protein functional sites using deep learning. Science, 377(6604):387–394, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Wang X., Zheng Z., Ye F., Xue D., Huang S., and Gu Q. Diffusion language models are versatile protein learners. In International Conference on Machine Learning, pp. 52309–52333. PMLR, 2024. [Google Scholar]
[49].Watson J. L., Juergens D., Bennett N. R., Trippe B. L., Yim J., Eisenach H. E., Ahern W., Borst A. J., Ragotte R. J., Milles L. F., et al. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[50].Wei K. Y., Moschidi D., Bick M. J., Nerli S., McShan A. C., Carter L. P., Huang P.-S., Fletcher D. A., Sgourakis N. G., Boyken S. E., et al. Computational design of closely related proteins that adopt two well-defined but structurally divergent folds. Proceedings of the National Academy of Sciences, 117(13):7208–7215, 2020. [Google Scholar]
[51].Winnifrith A., Outeiral C., and Hie B. Generative artificial intelligence for de novo protein design. Current Opinion in Structural Biology, 86:102794–102794, 2024. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHPP2025.09.03.672144v1-supplement-1.pdf^{(6.8MB, pdf)}

[R1] [1].Abramson J., Adler J., Dunger J., Evans R., Green T., Pritzel A., Ronneberger O., Willmore L., Ballard A. J., Bambrick J., et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, 630(8016):493–500, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Ahern W., Yim J., Tischer D., Salike S., Woodbury S. M., Kim D., Kalvet I., Kipnis Y., Coventry B., Altae-Tran H. R., et al. Atom level enzyme active site scaffolding using rfdiffusion2. bioRxiv, pp. 2025–04, 2025. [Google Scholar]

[R3] [3].Ahmed Laskar A. and Younus H. Aldehyde toxicity and metabolism: the role of aldehyde dehydrogenases in detoxification, drug resistance and carcinogenesis. Drug metabolism reviews, 51(1):42–64, 2019. [DOI] [PubMed] [Google Scholar]

[R4] [4].Alamdari S., Thakkar N., van den Berg R., Tenenholtz N., Strome B., Moses A., Lu A. X., Fusi N., Amini A. P., and Yang K. K. Protein generation with evolutionary diffusion: sequence is all you need. BioRxiv, pp. 2023–09, 2023. [Google Scholar]

[R5] [5].Austin J., Johnson D. D., Ho J., Tarlow D., and Van Den Berg R. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34:17981–17993, 2021. [Google Scholar]

[R6] [6].Baek M., DiMaio F., Anishchenko I., Dauparas J., Ovchinnikov S., Lee G. R., Wang J., Cong Q., Kinch L. N., Schaeffer R. D., et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Barrio-Hernandez I., Yeo J., Jänes J., Mirdita M., Gilchrist C. L., Wein T., Varadi M., Velankar S., Beltrao P., and Steinegger M. Clustering predicted structures at the scale of the known protein universe. Nature, 622(7983):637–645, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Bhatnagar A., Jain S., Beazer J., Curran S. C., Hoffnagle A. M., Ching K., Martyn M., Nayfach S., Ruffolo J. A., and Madani A. Scaling unlocks broader generation and deeper functional understanding of proteins. bioRxiv, pp. 2025–04, 2025. [Google Scholar]

[R9] [9].Blake C., Koenig D., Mair G., North A., Phillips D., and Sarma V. Structure of hen egg-white lysozyme: a three-dimensional fourier synthesis at 2 å resolution. Nature, 206(4986):757–761, 1965. [DOI] [PubMed] [Google Scholar]

[R10] [10].Boone C. D., Gill S., Habibzadegan A., and McKenna R. Carbonic anhydrase: an efficient enzyme with possible global implications. International Journal of Chemical Engineering, 2013 (1):813931, 2013. [Google Scholar]

[R11] [11].Chu A. E., Lu T., and Huang P.-S. Sparks of function by de novo protein design. Nature biotechnology, 42(2):203–215, 2024. [Google Scholar]

[R12] [12].Dauparas J., Anishchenko I., Bennett N., Bai H., Ragotte R. J., Milles L. F., Wicky B. I., Courbet A., de Haas R. J., Bethel N., et al. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Du Y., Durkan C., Strudel R., Tenenbaum J. B., Dieleman S., Fergus R., Sohl-Dickstein J., Doucet A., and Grathwohl W. S. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International conference on machine learning, pp. 8489–8510. PMLR, 2023. [Google Scholar]

[R14] [14].Ferruz N., Schmidt S., and Höcker B. Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022. [Google Scholar]

[R15] [15].Geffner T., Didi K., Zhang Z., Reidenbach D., Cao Z., Yim J., Geiger M., Dallago C., Kucukbenli E., Vahdat A., and Kreis K. Proteina: Scaling flow-based protein structure generative models. In The Thirteenth International Conference on Learning Representations, 2025. [Google Scholar]

[R16] [16].Gligorijević V., Renfrew P. D., Kosciolek T., Leman J. K., Berenberg D., Vatanen T., Chandler C., Taylor B. C., Fisk I. M., Vlamakis H., et al. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168, 2021. [Google Scholar]

[R17] [17].Guo A. B., Akpinaroglu D., Stephens C. A., Grabe M., Smith C. A., Kelly M. J., and Kortemme T. Deep learning–guided design of dynamic proteins. Science, 388(6749):eadr7094, 2025. [DOI] [PubMed] [Google Scholar]

[R18] [18].Hayes T., Rao R., Akin H., Sofroniew N. J., Oktay D., Lin Z., Verkuil R., Tran V. Q., Deaton J., Wiggert M., et al. Simulating 500 million years of evolution with a language model. Science, pp. eads0018, 2025. [Google Scholar]

[R19] [19].Ho J. and Salimans T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. [Google Scholar]

[R20] [20].Ho J., Jain A., and Abbeel P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. [Google Scholar]

[R21] [21].Hudson K. L., Bartlett G. J., Diehl R. C., Agirre J., Gallagher T., Kiessling L. L., and Woolfson D. N. Carbohydrate–aromatic interactions in proteins. Journal of the American Chemical Society, 137(48):15152–15160, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Ingraham J. B., Baranov M., Costello Z., Barber K. W., Wang W., Ismail A., Frappier V., Lord D. M., Ng-Thow-Hing C., Van Vlack E. R., et al. Illuminating protein space with a programmable generative model. Nature, 623(7989):1070–1078, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Inoue M., Yamada H., Yasukochi T., Kuroki R., Miki T., Horiuchi T., and Imoto T. Multiple role of hydrophobicity of tryptophan-108 in chicken lysozyme: structural stability, saccharide binding ability, and abnormal pka of glutamic acid-35. Biochemistry, 31(24):5545–5553, 1992. [DOI] [PubMed] [Google Scholar]

[R24] [24].Johansson K., Ramaswamy S., Eklund H., El-Ahmad M., Hjelmqvist L., and Jörnvall H. Structure of betaine aldehyde dehydrogenase at 2.1 å resolution. Protein Science, 7(10):2106–2117, 1998. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Karras T., Aittala M., Aila T., and Laine S. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35:26565–26577, 2022. [Google Scholar]

[R26] [26].Krishna R., Wang J., Ahern W., Sturmfels P., Venkatesh P., Kalvet I., Lee G. R., Morey-Burrows F. S., Anishchenko I., Humphreys I. R., et al. Generalized biomolecular modeling and design with rosettafold all-atom. Science, 384(6693):eadl2528, 2024. [DOI] [PubMed] [Google Scholar]

[R27] [27].Lin Y., Lee M., Zhang Z., and AlQuraishi M. Out of many, one: Designing and scaffolding proteins at the scale of the structural universe with genie 2. arXiv preprint arXiv:2405.15489, 2024. [Google Scholar]

[R28] [28].Lin Z., Akin H., Rao R., Hie B., Zhu Z., Lu W., Smetanin N., Verkuil R., Kabeli O., Shmueli Y., et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023. [DOI] [PubMed] [Google Scholar]

[R29] [29].Lisanza S. L., Gershon J. M., Tipps S. W., Sims J. N., Arnoldt L., Hendel S. J., Simma M. K., Liu G., Yase M., Wu H., et al. Multistate and functional protein design using rosettafold sequence space diffusion. Nature biotechnology, pp. 1–11, 2024. [Google Scholar]

[R30] [30].Madani A., Krause B., Greene E. R., Subramanian S., Mohr B. P., Holton J. M., Olmos J. L. Jr, Xiong C., Sun Z. Z., Socher R., et al. Large language models generate functional protein sequences across diverse families. Nature biotechnology, 41(8):1099–1106, 2023. [Google Scholar]

[R31] [31].Nijkamp E., Ruffolo J. A., Weinstein E. N., Naik N., and Madani A. Progen2: exploring the boundaries of protein language models. Cell systems, 14(11):968–978, 2023. [DOI] [PubMed] [Google Scholar]

[R32] [32].Oefner C., Schulz H., D’Arcy A., and Dale G. E. Mapping the active site of escherichia coli malonyl-coa–acyl carrier protein transacylase (fabd) by protein crystallography. Biological Crystallography, 62(6):613–618, 2006. [DOI] [PubMed] [Google Scholar]

[R33] [33].Peebles W. and Xie S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205, 2023. [Google Scholar]

[R34] [34].Peng F. Z., Bezemek Z., Patel S., Rector-Brooks J., Yao S., Tong A., and Chatterjee P. Path planning for masked diffusion model sampling. arXiv preprint arXiv:2502.03540, 2025. [Google Scholar]

[R35] [35].Pillai A., Idris A., Philomin A., Weidle C., Skotheim R., Leung P. J., Broerman A., Demakis C., Borst A. J., Praetorius F., et al. De novo design of allosterically switchable protein assemblies. Nature, 632(8026):911–920, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Ruffolo J. A., Nayfach S., Gallagher J., Bhatnagar A., Beazer J., Hussain R., Russ J., Yip J., Hill E., Pacesa M., et al. Design of highly functional genome editors by modeling the universe of crispr-cas sequences. BioRxiv, pp. 2024–04, 2024. [Google Scholar]

[R37] [37].Sahoo S., Arriola M., Schiff Y., Gokaslan A., Marroquin E., Chiu J., Rush A., and Kuleshov V. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024. [Google Scholar]

[R38] [38].Schiff Y., Sahoo S. S., Phung H., Wang G., Boshar S., Dalla-torre H., de Almeida B. P., Rush A. M., PIERROT T., and Kuleshov V. Simple guidance mechanisms for discrete diffusion models. In The Thirteenth International Conference on Learning Representations, 2025. [Google Scholar]

[R39] [39].Shanmugam N. S. and Yin Y. Cazyme3d: a database of 3d structures for carbohydrate-active enzymes. Journal of Molecular Biology, pp. 169001, 2025. [DOI] [PubMed] [Google Scholar]

[R40] [40].Song Y., Sohl-Dickstein J., Kingma D. P., Kumar A., Ermon S., and Poole B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. [Google Scholar]

[R41] [41].Stams T., Nair S. K., Okuyama T., Waheed A., Sly W. S., and Christianson D. W. Crystal structure of the secretory form of membrane-associated human carbonic anhydrase iv at 2.8-å resolution. Proceedings of the National Academy of Sciences, 93(24):13589–13594, 1996. [Google Scholar]

[R42] [42].team C. D., Boitreaud J., Dent J., McPartlon M., Meier J., Reis V., Rogozhonikov A., and Wu K. Chai-1: Decoding the molecular interactions of life. BioRxiv, pp. 2024–10, 2024. [Google Scholar]

[R43] [43].Van Kempen M., Kim S. S., Tumescheit C., Mirdita M., Lee J., Gilchrist C. L., Söding J., and Steinegger M. Fast and accurate protein structure search with foldseek. Nature biotechnology, 42(2):243–246, 2024. [Google Scholar]

[R44] [44].Varadi M., Bertoni D., Magana P., Paramval U., Pidruchna I., Radhakrishnan M., Tsenkov M., Nair S., Mirdita M., Yeo J., et al. Alphafold protein structure database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic acids research, 52(D1):D368–D375, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] [45].Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł., and Polosukhin I. Attention is all you need. Advances in neural information processing systems, 30, 2017. [Google Scholar]

[R46] [46].Vocadlo D. J., Davies G. J., Laine R., and Withers S. G. Catalysis by hen egg-white lysozyme proceeds via a covalent intermediate. Nature, 412(6849):835–838, 2001. [DOI] [PubMed] [Google Scholar]

[R47] [47].Wang J., Lisanza S., Juergens D., Tischer D., Watson J. L., Castro K. M., Ragotte R., Saragovi A., Milles L. F., Baek M., et al. Scaffolding protein functional sites using deep learning. Science, 377(6604):387–394, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Wang X., Zheng Z., Ye F., Xue D., Huang S., and Gu Q. Diffusion language models are versatile protein learners. In International Conference on Machine Learning, pp. 52309–52333. PMLR, 2024. [Google Scholar]

[R49] [49].Watson J. L., Juergens D., Bennett N. R., Trippe B. L., Yim J., Eisenach H. E., Ahern W., Borst A. J., Ragotte R. J., Milles L. F., et al. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] [50].Wei K. Y., Moschidi D., Bick M. J., Nerli S., McShan A. C., Carter L. P., Huang P.-S., Fletcher D. A., Sgourakis N. G., Boyken S. E., et al. Computational design of closely related proteins that adopt two well-defined but structurally divergent folds. Proceedings of the National Academy of Sciences, 117(13):7208–7215, 2020. [Google Scholar]

[R51] [51].Winnifrith A., Outeiral C., and Hie B. Generative artificial intelligence for de novo protein design. Current Opinion in Structural Biology, 86:102794–102794, 2024. [DOI] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

Generating functional and multistate proteins with a multimodal diffusion transformer

Bowen Jing

Anna Sappington

Mihir Bafna

Ravi Shah

Adrina Tang

Rohith Krishna

Adam Klivans

Daniel J Diaz

Bonnie Berger

Abstract

1. Introduction

2. Results

2.1. Multimodal Generation of Sequence and Structure

Figure 1: Overview of ProDiT and unconditional generation.

2.2. Steering Design with Molecular Function

2.2.1. Systematic Assessment of Function Conditioning

Figure 2: In silico validation of molecular function conditioning.

2.2.2. Recovery of Active Sites at Atomic Fidelity

Figure 3: Validation of functional conditioning via structural alignment.

2.3. Design of Multistate Allostery

2.3.1. Coupled Structure Diffusion

Figure 4: Coupled structure diffusion framework for generating multistate proteins.

2.3.2. Scaffolding Enzymatic Motifs

Figure 5: In silico validation of conformation switching designs.

Lysozyme motif

Carbonic anhydrase motif

3. Discussion

A. Methods

A.1. Multimodal Diffusion

A.1.1. Continuous diffusion.

A.1.2. Discrete diffusion.

A.2. Model Architecture and Training

A.3. Data

A.4. Design and Evaluation

A.4.1. Unconditional generation.

A.4.2. Function conditioning.

A.4.3. Structural alignment pipeline.

A.4.4. Multistate design.

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases