DiffPROTACs is a deep learning-based generator for proteolysis targeting chimeras

Fenglei Li; Qiaoyu Hu; Yongqi Zhou; Hao Yang; Fang Bai

doi:10.1093/bib/bbae358

. 2024 Aug 5;25(5):bbae358. doi: 10.1093/bib/bbae358

DiffPROTACs is a deep learning-based generator for proteolysis targeting chimeras

Fenglei Li ^1,^2,^#, Qiaoyu Hu ^3,^#, Yongqi Zhou ^4,⁵, Hao Yang ^6,⁷, Fang Bai ^8,^9,^10,^11,^✉

PMCID: PMC11299039 PMID: 39101502

Abstract

PROteolysis TArgeting Chimeras (PROTACs) has recently emerged as a promising technology. However, the design of rational PROTACs, especially the linker component, remains challenging due to the absence of structure–activity relationships and experimental data. Leveraging the structural characteristics of PROTACs, fragment-based drug design (FBDD) provides a feasible approach for PROTAC research. Concurrently, artificial intelligence–generated content has attracted considerable attention, with diffusion models and Transformers emerging as indispensable tools in this field. In response, we present a new diffusion model, DiffPROTACs, harnessing the power of Transformers to learn and generate new PROTAC linkers based on given ligands. To introduce the essential inductive biases required for molecular generation, we propose the O(3) equivariant graph Transformer module, which augments Transformers with graph neural networks (GNNs), using Transformers to update nodes and GNNs to update the coordinates of PROTAC atoms. DiffPROTACs effectively competes with existing models and achieves comparable performance on two traditional FBDD datasets, ZINC and GEOM. To differentiate the molecular characteristics between PROTACs and traditional small molecules, we fine-tuned the model on our self-built PROTACs dataset, achieving a 93.86% validity rate for generated PROTACs. Additionally, we provide a generated PROTAC database for further research, which can be accessed at https://bailab.siais.shanghaitech.edu.cn/service/DiffPROTACs-generated.tgz. The corresponding code is available at https://github.com/Fenglei104/DiffPROTACs and the server is at https://bailab.siais.shanghaitech.edu.cn/services/diffprotacs.

Keywords: PROTACs, linker generation, de-novo drug design, deep learning, PROTAC database

Introduction

The technology of PROteolysis TArgeting Chimeras (PROTACs) has gained popularity since the first proposal and demonstration by Crews in 2001 [1]. PROTACs are molecules consisting of three components: a ligand for a protein of interest (POI), a linker, and a ligand for recruiting E3 ubiquitin ligase. By bringing the ubiquitination machinery closer to the POI, it promotes the formation of a complex (POI-PROTAC-E3) and drives the transfer of ubiquitin from the E2 enzyme to the exposed lysine on the surface of target protein. This leads to polyubiquitination and degradation of the POI into small peptide fragments or amino acids after recognition by the 26 S proteasome. After the completion of the process, the PROTACs undergo recycling to reach another POI, revealing the catalytic properties of PROTACs.

Compared to traditional drug molecules, PROTACs have a number of superior properties. First, PROTACs are capable of modulating undruggable targets without well-defined binding pockets, albeit with comparatively modest or even affinities [2, 3, 4]. Second, the catalytic properties of PROTACs allow them to function when their concentration in the cellular environment is low, mitigating potential adverse effects associated with high drug concentrations. Moreover, PROTACs can distinguish the highly conserved homologous proteins that have different conformations outside the catalytic core, as the ubiquitin transfer step depends on the relative position of exposed lysine and ubiquitin [5]. This depends on the conformation of the ternary complex, a parameter that is significantly affected by the linker of PROTACs. The diversity of conformations makes the task of designing a universally applicable linker almost insurmountable and thus presents a major challenge for PROTAC design [6].

Since PROTACs were proposed, there has been a great effort to move it from academic to industry. The first crystal structure of the ternary complex for BRD4-MZ1-VHL (PDB code: 5T35) was released in 2017 [7]. In 2020, clinical testing of two PROTAC molecules (ARV-110 and ARV-471) provided the first proof-of-concept for the modality against two well-established cancer targets: the androgen receptor and the estrogen receptor. By the end of 2021, approximately 15 PROTACs have entered into clinical trials sucessivelly [8, 9]. However, on account of the lack of experimental crystal structures and the vagueness of the structure–activity relationship, the discovery of PROTACs, especially the linker, still depends mainly on the expertise of chemists and the experimental validation technologies such as western blot and cell-based assays.

In addition to the methods based on human design, several computational methods have emerged in recent years to complement them. Molecular dynamics (MD) and docking are two common approaches of traditional computer-aided drug design. MD is an approach for exploring molecular dynamic behaviors in certain space. Several MD methods [10, 11, 12] have been developed to simulate ternary structures and gain insight into the mechanisms of PROTACs, facilitating the rational design of novel PROTACs. MD methods have shown some predictability, but the process is always time consuming. Our group is also very interested in elucidating the reasons for the large changes in degradation efficacy caused by small differences in molecules and has performed studies in different molecular systems, e.g. BTK-PROTACs-CRBN and BCR-ABL-PROTACs-CRBN, using all atom MD simulation strategies [13, 14]. Mai et al. [15] employs coarse-grained MD and alchemical free energy calculation methods to explore PROTAC cooperativity but sacrifices precision. For docking or scoring function-based methods, which typically predict the binding of a ligand and a target to form a stable complex, there are currently no established standard protocols for constructing reliable ternary complexes. Nonetheless, various efforts have been made to address this challenge [16, 17, 18]. PROsettaC [19], a typical protocol based on scoring function, integrates global docking, local docking, conformational sampling, and clustering to model the 3D structure of the POI-PROTAC-E3 complex. The selection of the most promising PROTAC and its associated structure is determined by the ranking provided by the scoring function. Although these methods prove effective in certain cases, the pursuit of a universally applicable modeling approach is an extensive endeavor that requires a high level of commitment.

In recent years, advances in artificial intelligence technology and the accumulation of PROTAC data, particularly with the release of PROTAC-DB [20], have ushered in a surge in PROTAC research leveraging deep learning methodologies. DeepPROTACs [21] is a deep learning-based model for predicting the degradation of PROTACs and provides a way to design or screen PROTACs. Zhang et al. [22] and Nori et al. [23] employed reinforcement learning to generate PROTACs with desired properties and obtained good results in their study cases. However, it is important to note that these models operate primarily on 2D representations of PROTACs (only the atom element types and bond types are predicted), while the functionality of PROTACs is predominantly contingent upon their 3D structures. Especially for PROTAC cases, because the 3D structure determines the stabilization of ternary complex structure, which is the precondition of initiating the degradation processes.

All of these methods have contributed valuable tools for PROTAC design and offer promising research directions. Nevertheless, there is still a substantial journey ahead in the pursuit of rational PROTAC design.

Most of the PROTACs reported to date the present have been developed with established and potent small molecule ligands that bind to known targets. These ligands are usually selected based on the availability of cocrystal structures that can be used to define a suitable initial vector for linker incorporation. Consequently, FBDD methods, which start with fragments (small molecular compounds) and interconnect them to form a ligand, offer an alternative route for PROTAC research, particularly in the domain of PROTAC generation. The two ligands of a PROTAC can be viewed as individual fragments, with the linker serving to connect these two ligands and thereby form a complete PROTAC molecule. There have been several deep learning–based FBDD methods for linker generation, such as DeLinker [24], 3DLinker [25], and DiffLinker [26]. The success of these methods in FBDD indicates the potential applicability of similar approaches in PROTAC research and the insights gained from FBDD can be transferred to inform and guide PROTAC generation.

In addition, artificial intelligence–generated content has garnered significant attention and has become an important topic. Diffusion models [27], which have emerged as an innovative generative framework, use the noising and denoising process for training and generating new content. The former gradually adds noise to the data step by step (i.e. diffusion), while the latter attempt to gradually recover the data (i.e. denoising). Usually, the diffusion models employ a neural network to predict the added noise and remove it for denoising. During generation, the random normal noises are denoised step by step to generate new data. In addition to vision [28], audio [29], and natural language [30] fields, diffusion models are also used in molecular and material generation [31]. GeoDiff [32] and EDM [33] applied diffusion models to generate molecules considering the equivariant property of molecules. Based on EDM, DiffLinker [26] and DiffSBDD [34] further added the condition of fragments or target protein pockets to improve the fragment–based drug design and structure–based drug design. Currently, most diffusion models for molecule generation use graph neural networks (GNNs) to learn and predict noise, as molecules are naturally represented as graphs. In this representation, the atoms and bonds of molecules correspond to the nodes and edges within the graph. Recently, Transformers [35], which play a vital role in recently reported large language models and generative tasks, have also been employed to analyze graph-structured data, yielding promising results [36]. The application of Transformer models to the field of molecular generation could be a way to advance this field. However, Transformers were originally developed for sequential data and lack inductive biases associated with graphs, such as rotation equivariance, that are required to deal with 3D structures. Equiformer [37] addresses this issue by incorporating equivariant graph attention and other operations, achieving satisfactory results. Nevertheless, this adaptation fundamentally alters the basic structure of Transformers, rendering some of the conventional techniques and tricks used in Transformers may not be applicable. Our goal is to leverage the strengths of both Transformers and GNNs while minimizing the changes to the traditional Transformer architecture.

Therefore, we propose a diffusion model, DiffPROTACs, to generate new PROTACs and the O(3) equivariant graph Transformer (OEGT) module to learn and predict the noise in the model. OEGT uses Transformers to extract node and edge features from a molecular graph, and then employs GNNs to update the graph’s coordinates. The GNN module ensures that features retain the same transformations after operations such as molecule rotation or reflection, namely O(3) equivariance. In contrast, Transformers operate on features that inherently do not contain spatial information. We trained and tested DiffPROTACs on two traditional FBDD datasets, ZINC and GEOM, and found that DiffPROTACs competes closely with existing models. Furthermore, the incorporation of Transformers into the models extends the potential for large models, opening the door to leveraging the latest advancements and techniques related to Transformers to enhance model performance.

Interestingly, this work observed significant differences in the distribution of PROTACs compared to the traditional small molecule datasets. To address this, we fine-tuned our model using a PROTACs dataset, resulting in a validity score of 93.86% for the generated PROTACs. As a culmination of our work, we present a database of generated PROTACs in this paper to facilitate further research in this area.

Materials and methods

Data

In our experiments, we used three different datasets: ZINC, GEOM, and PROTACs. ZINC and GEOM are derived from DiffLinker. ZINC contains 438 610 training samples, 400 validation samples, and 400 test samples, while GEOM contains 282 602 training samples, 1250 validation samples, and 1290 test samples. It is important to note that the ZINC and GEOM datasets were computationally generated, and decomposed from the ZINC20 [38] and GEOM [39] databases, respectively. The ZINC dataset consists of single linkers and two fragments, while the GEOM dataset typically features at least three fragments. Also, it is worth noting that ZINC lacks the element P, while the molecules in other two datasets contain this element.

Weng et al. published the dataset PROTAC-DB 2.0 [40] recently, which contains basic information of 3270 PROTACs. In our data collection efforts, we gathered the Simplified molecular-input line-entry system (SMILES) representations for the E3 ligands, the warheads (ligands that bind to targets), and the linkers of each PROTAC from the corresponding pages on the PROTAC-DB website. This process resulted in a final dataset of 365 warheads, 82 E3 ligands, 1501 linkers, and 3270 PROTACs.

We analyzed the frequently occurring functional groups of linkers in the PROTAC-DB (3257 PROTACs with linkers out of 3270 PROTACs). The detailed findings are presented in Table 1. Notably, a single linker may encompass multiple functional groups, potentially resulting in some overlap in the data.

Table 1.

Functional group occurrence in PROTAC-DB

Linker structure	Functional groups of linkers	Occurrence in PROTAC-DB (%)
	Amide	36.3
	PEG	35.7
	Alkyl^a	7.0
	Alkyne	6.7
	Triazole	11.8
	Benzene	8.0
	Piperazine	6.2
	Piperidine	4.1

Open in a new tab

^aLinkers containing only carbon atoms, excluding other elements.

In addition, we conducted an extensive analysis of several physicochemical properties of the linkers in the PROTAC-DB, including molecular weight, AlogP, number of rings, number of rotatable bonds, and the number of hydrogen bond acceptors and donors, as shown in Fig. 1. Our findings indicate that the molecular weights of the linkers predominantly center around 200 Daltons. Furthermore, most linkers lack rings, and the number of rotatable bonds is typically between 7 and 10. Besides, we have calculated the distance between the anchors that connect the linker, which is particularly relevant to protein–protein interactions, as shown in Fig. 1. The anchor distance for most linkers is around 7–10 Å.

Distributions of the physicochemical and drug-like properties of the linkers in PROTAC-DB. From left to right and top to bottom, the properties include anchor distance, molecular weight, AlogP, number of rotatable bonds, number of rings, and the number of hydrogen bond acceptors and donors, respectively.

Our results indicate that existing PROTAC molecule linkers often contain amides or polyethylene glycol (PEG), with more than one-third of these molecules incorporating each of such functional groups. Amides offer several advantages, including good biocompatibility, as peptide bonds naturally occur in organisms, and ease of synthesis. PEGs are relatively flexible and have good solubility as well as chemical stability. Furthermore, amides contain both hydrogen bond donors and acceptors, whereas the ether oxygen in PEG can only act as a hydrogen bond acceptor. Consequently, PROTAC molecules tend to have more hydrogen bond acceptors. These factors should be considered when designing linkers.

We hope this analysis provides researchers with a better understanding of the parameters to consider in the design of PROTAC linkers.

However, it is important to note that these PROTACs exist only in a 2D format, lacking experimental 3D structures. To address this limitation, we generated the 3D structures of the PROTACs computationally. For simplicity and to strike a balance between time and precision, we employed a random structure generation approach using LigPrep in Schrödinger [41]. It is worth acknowledging that this method does not consider protein–protein or protein–ligand interactions and relies only on local minima, potentially introducing some bias.

Another critical challenge we encountered was how to divide the PROTACs into appropriate ligands and linkers. Since PROTAC-DB provides the division, the problem is akin to sub-graph matching. To solve it, we employed the subgraph isomorphism [42] module in NetworkX [43]. The linkers and PROTACs were initially converted into graph objects and then iteratively matched with each other. This iterative process resulted in mapping identities. However, due to various issues such as patterns not found in PROTACs, cases with no reasonable structures, or PROTACs without linkers, we obtained a final set of 2813 samples for future analysis. The samples were divided into training, validation and test sets at the ratio of 2013:400:400 randomly.

We conducted a statistical analysis of the atom number distributions within the ZINC, GEOM, and PROTACs datasets. The result is shown in Fig. 3. In general, the ZINC and GEOM datasets have more similarities, while the PROTACs dataset differs significantly. One of the main differences is the average total number of atoms in PROTACs, which is higher than that of ZINC and GEOM. This difference is primarily attributed to the larger number of fragments in PROTACs.

Distributions of ZINC, GEOM and PROTACs datasets. The figure displays distributions of total atom number, fragment atom number and linker atom number of the three datasets from left to right.

As for the distribution of the number of linker atoms, PROTACs and GEOM datasets show greater variability. Moreover, the modes of the two datasets are larger than those observed in ZINC. These different distributions across the three datasets emphasize the importance of considering their differences during the learning and generation processes.

Diffusion models

Diffusion models are a kind of framework with denoising and diffusion process to generate new data. The diffusion process is to add noise in T steps gradually for a data point Inline graphic . , from 0 to , where is the number of linker atoms, is the coordinates of linker atom i. represents features of ^th linker atom types. In each time step we get a noised data point , t = 1, 2, …, T. The size of is identical to the input size .Until time step T, we get an approximately normal distributed noise Inline graphic . Mathematically, the process can be represented as

(1)

where, q is a probability and Inline graphic is a normal distribution, is the last state of , i.e. . By reparameterization trick, we can get

(2)

According to DDPM [44] and EDM, Inline graphic , ,. is obtained from the noise schedule in EDM. The process is restricted to a Markov Process, which can be written as,

(3)

The denoising process is a reverse process, i.e. to obtain the noise in each step and remove it. The process can be derived as

(4)

where Inline graphic is the noise added from to , i.e. , by reparameterization trick for equation (2), and

We use the simplified objective function from DDPM, Inline graphic , to learn , and finally obtain the estimated denoising process,

(5)

where,

where Inline graphic is a nonlinear function, here we use the OEGT.

Considering the FBDD generation for PROTACs, the condition Inline graphic is introduced as the ‘fragments’ or ‘ligands’, which contains the coordinates and features of each ligand atom identical to . Thus, the denoising process and the loss function turn to

(6)

(7)

and the initial noise Inline graphic turns to . Here we set as the center of the condition and move it to zero.

In conclusion, for training, given training sample Inline graphic , we first move the center of to zero and sample the timestep and the noise . After obtaining the noised sample (equation (2)) at time step , with the context , we employ OEGT to learn the noise (equation (7)). After training, we get the learned neural network , which can estimate the noise Inline graphic . Then for generation, given the context and the linker size, we first sample a random linker and move the center of to zero. From timestep T and sample , we can ‘denoise’ it iteratively by T steps, i.e. to obtain for each (equation (6)), and finally get a generated sample .

O(3) equivariant graph Transformer

If a function Inline graphic satisfies for all , where and are two representations of the group element in group G, the function is equivariant to . For simplicity, the same representations are used for the group, and group is used for equivariance and for all layers of the neural network, which means, is equivariant to a rotation or reflection, i.e. Inline graphic . Igashov et al. [26] have proved that, for a Markovian denoising process, if in the initial state is -equivariant, and the model in each denoising step is -equivariant, then is -equivariant, which means .

We use OEGT to learn the noise as Fig. 2b. The OEGT represented as Inline graphic in equation (7), is learnt to estimate the noise . Since the linker and condition u have identical representation for atom coordinates and atom features , we combine them and use OEGT to process them at the same time. But only linker atom coordinates and linker atom features are updated. Those in the ligands remain unchanged.

Overview of DiffPROTACs and OEGT. (a) The framework of DiffPROTACs, which is generally a diffusion model, containing the diffusion and denoising process. In the two processes, the molecules are noised and denoised step by step. (b) The architecture of OEGT, which combines Transformers and graph neural networks (GNNs) for noise learning, uses the former to update node features and the latter to update molecular coordinates.

(8)

(9)

where, Inline graphic is the layer index, is Graphormer [36], a Transformer model in graph, is the concatenated feature of each atom element feature , is the distance matrix for all atoms in a graph, is the coordinates of atom i. is the number of whole atoms, including ligands and linker. represents the one-hot encoding of atom types. Inline graphic is the number of each atom feature. The ZINC dataset comprises of 8 elements, which corresponds to atom type of ‘C, O, N, F, S, Cl, Br, I’. However, in the GEOM dataset, the one-hot encoding expands to include P, resulting in a length of 9.

To be more concrete, equation (8) is Graphormer but slightly different in the attention part, which is:

(10)

where, Inline graphic are parameters. is the scaling factor.

In Transformers, Q, K, and V represent query, key, and value, respectively. A stands for attention score or attention weight. The attention mechanism intuitively involves matching a query with each key and then using an operation (softmax) to enable the query to identify the most matched value. Here in the context of self-attention, Q, K, and V are functions of the feature itself.

Transformers can be considered as a type of GNN. However, traditional Transformers lack O(3) equivariance due to the dot product attention mechanism. In contrast, the message passing in GNNs inherently maintains O(3) equivariance. Therefore, in our approach, we leverage the dot product attention in Transformers and enhance it with message passing, as equation (9), to ensure O(3) equivariance.

Equation (9) is message passing neural network [45], a typical architecture of GNN. The concrete expression is

(11)

where Inline graphic is the distance between and , represents the aggregation operation, and in this context, a multilayer perceptron is used. in equation (9) denotes the update operation, and a summation operation (equation (11)) is employed. It is essential to emphasize that these operations are deliberately applied to the linker update, while the condition component, both the atom features and atom coordinates, remains unchanged.

For transformation of rotation and reflection to the molecules, the distance of each atom pair is invariant. When the rotation or reflection is applied to the coordinates ( Inline graphic ), the output of equation (9) will do the same transformation (), which keeps the O(3) equivariance. Equation (8) is not related to the coordinates, and further, thanks to the separation of the feature and the coordinates, can be any function to learn the new atom feature from the old feature and the distance matrix. The detailed parameters can be seen at Supplementary Table S2.

Metrics

Similar to many molecular generation tasks, our evaluation metrics include validity, uniqueness, and recovery. To assess the results, we employ a process where we sample 100 conformations for each input ligand pair in the test set and subsequently calculate the following metrics:

Validity

This metric assesses the reasonableness of the generated molecule, specifically whether it resides within the chemical space. In our work, we utilize OpenBabel [46] to compute the bonds within the generated molecules, while RDKit [47] is employed to assess compliance with valency rules. Additionally, validity encompasses the absence of dissociative atoms and the presence of the specified fragments. Given that the fragments serve as conditions and remain unaltered during learning, our assessment focuses on detecting any detached atoms.

Uniqueness

To determine uniqueness, we consider the ratio of unrepeated generated molecules.

Since defining whether two molecules are ‘repeated’ or ‘the same’ in 3D space can be challenging, we employ the use of canonical SMILES for each molecule. Canonical SMILES, generated by a specific software, ensure uniqueness in the 2D space. The equation of uniqueness is

where, Inline graphic and are the number of unique SMILES and the number of valid SMILES of generated molecules for input .

Recovery

The recovery metric evaluates the ratio of matched molecules, indicating whether the generated samples contain molecules identical to the original ones. Similar to the uniqueness assessment, canonical SMILES are employed to facilitate this comparison. Canonical SMILES play a crucial role in defining the identity of molecules in 2D space. The equation of recovery is

where, Inline graphic the number of instances where the generated samples contain identical SMILES as the input, is the number of inputs.

Results

Framework of DiffPROTACs

As shown in Fig. 2a, the training and generation of DiffPROTACs focus on the diffusion and denoising processes respectively. In the diffusion process, noise is added stepwise to the sample data of molecules, specifically, to the linker part. After Inline graphic steps, the distribution of the linker is almost normal distributed. The noise introduced during this process is for the learning of the network module, OEGT. Notably, the fragment or ligand part serves as contextual information and remains unchanged throughout the process. The denoising process gradually restores the linker part through the learned noise.

To facilitate noise learning, we introduce the OEGT module (Fig. 2b), which integrates both Transformer and GNN components in one block. This module divides the update of the molecules into two different processes: the update of the node features and the update of the coordinates. Namely, the Transformer encoder (Graphormer [36]) is used to extract the node features and the GNN updates the coordinates of the nodes. Inline graphic blocks of OEGT modules are stacked for learning.

The center of the mass (CoM [32]) of the fragments is first moved to the origin. CoM and the message passing of GNN for the coordinates in OEGT guarantee that the whole process is equivariant to O(3) group, i.e. rotation and reflection.

DiffPROTACs was then trained and tested on ZINC and GEOM datasets and further fine-tuned on PROTAC dataset.

Results of DiffPROTACs on ZINC and GEOM

DiffPROTACs was trained and tested on ZINC and GEOM for evaluating its ability of generating traditional small molecules, and the results are shown as Table 2. ZINC and GEOM [26] are two molecular fragment datasets, which are decomposed computationally from ZINC20 [38] and GEOM [39] database, respectively. The ZINC comprises two fragments and one linker, whereas the GEOM consists of a minimum of three fragments.

Table 2.

Performance metrics of different methods on ZINC and GEOM datasets

Dataset	Method	Validity (%)	Uniqueness (%)	Recovery (%)
ZINC	DeLinker	99.15	40.86	33.80
	3DLinker	98.39	56.69	28.00
	DiffLinker	97.69	30.77	74.00
	DiffPROTACs	97.60	28.23	73.25
GEOM	DiffLinker	95.96	32.95	86.12
	DiffPROTACs	96.84	36.81	84.34

Open in a new tab

We found that DeLinker [24] and 3DLinker [25] were not directly compatible with our ZINC dataset (cannot get the results for some input). Therefore, we used their original datasets, which are also subsets of ZINC20, for testing and obtained the results in Table 2. The validity matrix generated by these methods demonstrates a comparable level of performance, with DeLinker and 3DLinker outperforming the other two. However, it is important to note that DeLinker can only generate 2D structures of molecules, and 3DLinker requires additional anchor (exit vector) information for molecule reconstruction. While DeLinker and 3DLinker exhibit greater uniqueness, DiffLinker and DiffPROTACs surpass them in recovery metrics. This indicates that the latter two methods excel in the ability to recover most molecules in our test set with the original fragment information.

For the study on the dataset GEOM, DeLinker and 3DLinker can only link two fragments with one generated linker. Therefore, only DiffLinker and DiffPROTACs are tested on GEOM. Of these, DiffPROTACs exhibits superior performance in terms of validity and uniqueness, albeit with a slightly lower recovery rate than DiffLinker. These results suggest that DiffPROTACs is indeed a strong contender in the current landscape of methods.

Results of DiffPROTACs on PROTACs

We created the PROTACs dataset using PROTAC-DB 2.0 [40] for testing purposes; however, the size of the dataset is relatively small compared to other datasets. As a result, we leveraged models pretrained on other datasets to overcome this limitation. Notably, the ZINC dataset lacks the phosphorus (P) element, whereas GEOM includes it. Observations indicate that the linker distribution in the GEOM dataset closely resembles that in the PROTAC dataset (Fig. 3), thereby suggesting that models pretrained on GEOM are more suitable for PROTAC generation. Also, it is worth noting that 3DLinker has an atom limit of 48 for fragments, while most PROTACs have ligands with atom numbers that exceed this limit. Therefore, we only tested on DiffLinker and DiffPROTACs for the generation task. The results are shown in Table 3.

Table 3.

Performance metrics of different methods on PROTACs dataset

	Validity (%)	Uniqueness (%)	Recovery (%)
DiffLinker (GEOM)	53.55	48.54	3.5
DiffPROTACs (GEOM)	34.32	32.45	4.25
DiffPROTACs-fine-tuning	93.86	68.7	43.75

Open in a new tab

The probability of encountering failed molecules increases with the rising number of PROTAC atoms, evident from the increasing proportion of not valid molecules in the distribution (see Supplementary Fig. S1). The two models perform effectively on ZINC and GEOM datasets but exhibit limitations in the context of PROTACs, especially when the total atom number exceeds 50 (Supplementary Fig. S1). This could be a result of almost all ZINC and GEOM data falling below this threshold (Fig. 3). Moreover, PROTACs exhibit distinct characteristics compared to traditional small molecules in ZINC and GEOM datasets. We believe that the dissimilarity and the divergence in the characteristics, distribution among the data can be one major reason. We then fine-tune our model on the PROTACs dataset, resulting in significant performance improvements, as demonstrated in Fig. 4.

Distribution of the generated results from DiffLinker and DiffPROTACs-finetuning. The top three figures display the distribution of generated molecules by DiffLinker, while the bottom three display that of DiffPROTACs-finetuing. Each figure presents the relationship between the number of atoms and the number of generated molecules, including the total count and non-valid molecules. The three columns, from left to right, represent the distributions of PROTAC atom number, fragment atom number, and linker atom number, respectively.

We conducted a statistical analysis of the generated results for PROTACs, a unique class of molecules that often do not conform to the rule of five. We compared these results with the original training samples, and Fig. 5 clearly illustrates the close relationship between the two distributions. These results underscore the remarkable similarity between the properties of the molecules generated by DiffPROTACs and the real PROTACs (ground truth), highlighting the significant potential of DiffPROTACs.

Distributions of rule of five of the test data and generated PROTAC data. The figure shows the distributions of molecular weight, AlogP, hydrogen bond acceptors, hydrogen bond donors and rotatable bonds for true PROTACs in training samples and generated PROTACs for test input ligands. The distributions exhibit a high degree of overlap, highlighting the potential of generated PROTACs by DiffPROTACs to closely resemble true PROTACs.

Case study: BRD4-PROTAC-VHL (PDB code:8BEB, 8BDT)

Krieger et al. explored the implications of different VHL binders and exit vectors (anchors) on BRD4-degraders [48]. They provided two structures, 8BEB and 8BDT, in the PDB, which were recently released. It is important to note that the training dataset comprises simulated data, whereas the case studies involve experimental data, and even their 2D structures are not presented in our training data. These structures share the same VHL ligand and target warhead but differ in their linker patterns. We set the linker length for both and obtained the generation results as shown in Fig. 6. Each structure of the results is displayed in Supplementary Figs S2 and S3. The spatial structure of the generated linkers closely resembles the crystal structures (Fig. 6a and c). DiffPROTACs successfully recovered PROTACs (with the same molecular formula) in two test PDBs (PDB codes: 8BEB and 8BDT), and reproduced the conformations in their complex structures with the E3s and targets with RMSD values of 0.25 Å and 0.53 Å for 8BEB and 8BDT, respectively (Fig. 6b and d). To assess the conformational rationality of the generated linker, we generated the potential energy landscapes for the linkers of PROTACs in 8BDT (PDB code) and 8BEB (PDB code) by systematical conformational search simulations (as shown in Supplementary Fig. S4). Our generated and the experimentally determined conformations of these linkers were also projected to the energy landscape. Both the generated linker conformations and the experimentally determined linker conformations are in relatively low states with similar values, indicating DiffPROTACs is capable of generating native like conformations of PROTAC linkers.

Case studies. The PROTACs and generated linkers for BRD4-PROTAC-VHL by DiffPROTACs. (a) Generated linkers of PROTACs for the protein and E3 in PDB 8BDT. (b) The alignment of the generated conformation and the experimentally determined conformation of the linker in PDB 8BDT, with an RMSD of 0.53 Å. (c) Generated linkers of PROTACs for the protein target and E3 in PDB 8BEB. (d) The alignment of the generated conformation and the experimentally determined conformation of the linker in PDB 8BEB, with an RMSD of 0.25 Å.

Moreover, for 8BDT, when provided with the crystal structure of the ligand and warhead, DiffPROTACs accurately identified the correct exit vector within 8BDT, contrary to another exit vector mentioned in their paper that lacks PDB structure, presumably due to its relatively weak binding affinity to VHL. These results highlight DiffPROTACs can effectively predict the approximate structure and linker composition with accurate structural information for the ligands of PROTACs.

Generated database of DiffPROTACs

DiffPROTACs requires knowledge of the linker size, which is often unknown in practical scenarios. To address this challenge, we generated a range of linkers with varying sizes. Subsequently, we created an extensive dataset comprising the entire set of PROTACs (training, validation, and test) with linker lengths ranging from 5 to 28. This resulted in a dataset containing 2 601 818 PROTACs, organized by PROTAC-DB ID, with non-valid entries removed.

We computed the physicochemical or drug-like properties of the generated linkers, including molecular weight, AlogP, number of rings, number of rotatable bonds, and the number of hydrogen bond acceptors and donors, as shown in Fig. 7. In addition, we measured the distance between the anchor points for linker evaluation purpose.

Distributions of the physicochemical or drug-like properties of the generated linkers. From left to right and top to bottom, the properties include anchor distance, molecular weight, AlogP, number of rotatable bonds, number of rings, and the number of hydrogen bond acceptors and donors, respectively.

Concerning the issue of diversity, we evaluated the diversity of linkers generated by DiffPROTACs by calculating the Tanimoto similarity for the generated linkers, as shown in Fig. 8. Specifically, from a total of 2 601 818 linkers, after deduplication, we obtained 1 724 424 unique linkers. From these unique linkers, we randomly sampled 10 000 linkers three times and calculated the Tanimoto coefficients of their atom pairs molecular fingerprints using the fingerprint similarity panel in Maestro (Schrödinger Inc.).

Heatmaps of fingerprint similarities (Tanimoto similarity) of linkers generated by DiffPROTACs. After deduplication, all generated linkers were randomly sampled three times, with 10 000 samples taken each time, to obtain the results.

As illustrated in Fig. 8, the extensive dark blue regions indicate low similarity between the linkers, demonstrating that DiffPROTACs can generate highly diverse linkers.

Through this analysis, we aim to create a general outline for this database, providing insights for future researchers to better utilize it.

This dataset, enriched with patterns learned from DiffPROTACs, is a valuable addition to the existing PROTAC dataset, which is limited due to the resource-intensive and time-consuming nature of experimental work. It can also be used as a screening library to facilitate and advance research in this field. The generated database can be downloaded at https://bailab.siais.shanghaitech.edu.cn/service/DiffPROTACs-generated.tgz.

Discussion

We present a novel diffusion model called DiffPROTACs, which leverages Transformers and GNNs to learn noise and generate new PROTACs linkers based on provided ligands. To incorporate the inductive biases of O(3) equivariant properties in molecular generation, we introduce the OEGT module, combining Transformers with GNNs. In this architecture, Transformers update nodes, and GNNs update the coordinates of PROTAC atoms. DiffPROTACs competes effectively with existing models in the field of FBDD, demonstrating comparable performance on generating traditional small molecules by learning from the ZINC and GEOM datasets. To address the distinctions in the PROTACs dataset compared to existing datasets, we fine-tuned the model on PROTACs data and achieved a remarkable validity rate of 93.86% for the generated PROTACs. We also provide a database of generated PROTACs for further research and investigation.

DiffPROTACs uses only the Euclidean distance as the edge feature in the Transformer component of OEGT, unlike the original Graphormer, which includes additional features such as shortest path and centrality encoding for node features. In addition, Graphormer-GD [49] adds resistance distance to its model. These extra features serve as priors for the neural network and can potentially improve performance. Despite not having these additional priors, DiffPROTACs demonstrates competitive performance, running neck, and neck with other methods. This highlights the substantial potential of DiffPROTACs.

The patterns of the PROTACs differ from those in the current FBDD datasets. This divergence should be considered in future research. Utilizing more appropriate datasets as a pre-training set could potentially lead to better results.

The case study highlights the importance of the structure of the ligand. The precise positioning of the ligands in binding to the target and the E3 ligase appears to be a critical factor. Therefore, the development and application of an appropriate docking method to accurately predict these interactions is warranted.

In conclusion, our DiffPROTACs model introduces an innovative approach that combines Transformer and GNN, offering a valuable tool for the generation of PROTACs. This model has the potential to advance research in the field and may contribute to the acceleration of the discovery and development of PROTACs.

Key Points

DiffPROTACs employs the OEGT module, integrating GNN and Transformer architectures to ensure rotational equivariance within the model.
DiffPROTACs introduces a novel diffusion model for generating PROTACs, capable of generating unique linkers based on the spatial structure of the warhead and ligand.
DiffPROTACs is utilized to construct a comprehensive database for PROTAC research, serving as a screening library to facilitate and enhance research efforts in this domain.
DiffPROTACs demonstrates comparable performance to the current state-of-the-art model on FBDD data, achieving a remarkable 93.86% validity in PROTAC generation.

Supplementary Material

PROTACs-SI-final-v2_bbae358

protacs-si-final-v2_bbae358.docx^{(1.3MB, docx)}

Acknowledgements

We thank Ilia Igashov (author of DiffLinker) and Yinan Huang (author of 3DLinker) for their invaluable assistance in addressing technical issues. We thank Shenghua Gao and Zibo Zhao for insightful discussions.

Contributor Information

Fenglei Li, Shanghai Institute for Advanced Immunochemical Studies, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China; School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China.

Qiaoyu Hu, Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, 3663 Zhongshan North Road, Putuo District, Shanghai 200062, China.

Yongqi Zhou, Shanghai Institute for Advanced Immunochemical Studies, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China; School of Life Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China.

Hao Yang, Shanghai Institute for Advanced Immunochemical Studies, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China; School of Life Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China.

Fang Bai, Shanghai Institute for Advanced Immunochemical Studies, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China; School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China; School of Life Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China; Shanghai Clinical Research and Trial Center, 1599 Keyuan Road, Pudong New Area, Shanghai, 201210, China.

Supplementary data

Supplementary data is available at Briefings in Bioinformatics online. The supporting information file comprises four sections: (1) Ratio distributions of DiffLinker and DiffPROTACs-finetuning results, (2) Case Studies: BRD4-PROTAC-VHL, (3) Discussion on the anchors of PROTACs, and (4) Main settings of DiffPROTACs. The titles of the figures provided are as follows: Figure S1: Ratio distributions of DiffLinker and DiffPROTACs-finetuning results. Figure S2: Generated PROTAC samples of 8BDT. Figure S3: Generated PROTAC samples of 8BEB. Figure S4: Potential energy landscape of the molecular conformations of studied linkers. Figure S5: Potential anchor for PROTAC with PDB code 8BDT. Figure S6: Generated linkers for PROTAC45. Figure S7: Generated PROTAC samples for BRD4(BD2) and VHL based on the simulated ternary complex structure of PROTAC45. The titles of the tables included are: Table S1: Parameters used for generating the potential energy landscape for the conformations of linkers of PROTAC in the structures of 8BEB and 8BDT. Table S2: Main settings of DiffPROTACs.

Funding

This work was supported by Shanghai Science and Technology Development Funds (grant IDs: 22ZR1441400 and 20QA1406400), the National Key R&D Program of China (grant IDs: 2022YFC3400501 and 2022YFC3400500), the National Natural Science Foundation of China (grant ID: 82003654), start-up package from ShanghaiTech University, and Shanghai Frontiers Science Center for Biomacromolecules and Precision Medicine at ShanghaiTech University.

Conflict of interest: None declared.

Author contributions

F.L. constructed and validated the model. F.L. drafted the manuscript. Q.H. validated the model together with F.L., Y.Z. and H.Y. performed analysis for the case studies. F.B. designed the whole project and revised the manuscript. All authors read and approved the final manuscript.

Data availability

The data, source code of DiffPROTACs and the trained models are available at https://github.com/Fenglei104/DiffPROTACs.

References

1. Sakamoto KM, Kim KB, Kumagai A. et al. Protacs: chimeric molecules that target proteins to the Skp1–Cullin–F box complex for ubiquitination and degradation. Proc Natl Acad Sci 2001;98:8554–9. 10.1073/pnas.141230798. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Hammoudeh DI, Follis AV, Prochownik EV. et al. Multiple independent binding sites for small-molecule inhibitors on the oncoprotein c-Myc. J Am Chem Soc 2009;131:7390–401. 10.1021/ja900616b. [DOI] [PubMed] [Google Scholar]
3. An S, Fu L. Small-molecule PROTACs: an emerging and promising approach for the development of targeted therapy drugs. EBioMedicine 2018;36:553–62. 10.1016/j.ebiom.2018.09.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Burslem GM, Crews CM. Proteolysis-targeting chimeras as therapeutics and tools for biological discovery. Cell 2020;181:102–14. 10.1016/j.cell.2019.11.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Nowak RP, DeAngelo SL, Buckley D. et al. Plasticity in binding confers selectivity in ligand-induced protein degradation. Nat Chem Biol 2018;14:706–14. 10.1038/s41589-018-0055-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Troup RI, Fallan C, Baud MGJ. Current strategies for the design of PROTAC linkers: a critical review. Explor Target Antitumor Ther 2020;1:273–312. 10.37349/etat.2020.00018. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Gadd MS, Testa A, Lucas X. et al. Structural basis of PROTAC cooperative recognition for selective protein degradation. Nat Chem Biol 2017;13:514–21. 10.1038/nchembio.2329. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Békés M, Langley DR, Crews CM. PROTAC targeted protein degraders: the past is prologue. Nat Rev Drug Discov 2022;21:181–200. 10.1038/s41573-021-00371-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Mullard A. Targeted protein degraders crowd into the clinic. Nat Rev Drug Discov 2021;20:247–50. 10.1038/d41573-021-00052-4. [DOI] [PubMed] [Google Scholar]
10. Li W, Zhang J, Guo L. et al. Importance of three-body problems and protein–protein interactions in proteolysis-targeting chimera modeling: insights from molecular dynamics simulations. J Chem Inf Model 2022;62:523–32. 10.1021/acs.jcim.1c01150. [DOI] [PubMed] [Google Scholar]
11. Liao J, Nie X, Unarta IC. et al. In silico modeling and scoring of PROTAC-mediated ternary complex poses. J Med Chem 2022;65:6116–32. 10.1021/acs.jmedchem.1c02155. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Dixon T, MacPherson D, Mostofian B. et al. Predicting the structural basis of targeted protein degradation by integrating molecular dynamics simulations with structural mass spectrometry. Nat Commun 2022;13:5884. 10.1038/s41467-022-33575-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Guo WH, Qi X, Yu X. et al. Enhancing intracellular accumulation and target engagement of PROTACs with reversible covalent chemistry. Nat Commun 2020;11:1–16. 10.1038/s41467-020-17997-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Liu H, Mi Q, Ding X. et al. Discovery and characterization of novel potent BCR-ABL degraders by conjugating allosteric inhibitor. Eur J Med Chem 2022;244:114810. 10.1016/j.ejmech.2022.114810. [DOI] [PubMed] [Google Scholar]
15. Mai H, Zimmer MH, Miller TF. et al. Exploring PROTAC cooperativity with coarse-grained alchemical methods. J Phys Chem B 2023;127:446–55. 10.1021/acs.jpcb.2c05795. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Bai N, Miller SA, Andrianov GV. et al. Rationalizing PROTAC-mediated ternary complex formation using Rosetta. J Chem Inf Model 2021;61:1368–82. 10.1021/acs.jcim.0c01451. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Weng G, Li D, Kang Y. et al. Integrative Modeling of PROTAC-mediated ternary complexes. J Med Chem 2021;64:16271–81. 10.1021/acs.jmedchem.1c01576. [DOI] [PubMed] [Google Scholar]
18. Drummond ML, Henry A, Li H. et al. Improved accuracy for modeling PROTAC-mediated ternary complex formation and targeted protein degradation via new in silico methodologies. J Chem Inf Model 2020;60:5234–54. 10.1021/acs.jcim.0c00897. [DOI] [PubMed] [Google Scholar]
19. Zaidman D, Prilusky J, London N. PRosettaC: Rosetta based modeling of PROTAC mediated ternary complexes. J Chem Inf Model 2020;60:4894–903. 10.1021/acs.jcim.0c00589. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Weng G, Shen C, Cao D. et al. PROTAC-DB: an online database of PROTACs. Nucleic Acids Res 2021;49:D1381–7. 10.1093/nar/gkaa807. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Li F, Hu Q, Zhang X. et al. DeepPROTACs is a deep learning-based targeted degradation predictor for PROTACs. Nat Commun 2022;13:7133. 10.1038/s41467-022-34807-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Zheng S, Tan Y, Wang Z. et al. Accelerated rational PROTAC design via deep learning and molecular simulations. Nat Mach Intell 2022;4:739–48. 10.1038/s42256-022-00527-y. [DOI] [Google Scholar]
23. Nori D, Coley CW, Mercado. De novo PROTAC design using graph-based deep generative models. arXiv preprint. arXiv: 2211.02660. 2022. http://arxiv.org/abs/2211.02660.
24. Imrie F, Bradley AR, Schaar M. et al. Deep generative models for 3D linker design. J Chem Inf Model 2020;60:1983–95. 10.1021/acs.jcim.9b01120. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Huang Y, Peng X, Ma J. et al. 3DLinker: an E(3) equivariant variational autoencoder for molecular linker design. arXiv preprint. arXiv: 2205.07309. 2022. http://arxiv.org/abs/2205.07309.
26. Igashov I, Stärk H, Vignac C. et al. Equivariant 3D-conditional diffusion models for molecular linker design. arXiv preprint. arXiv: 2210.05274. 2022. http://arxiv.org/abs/2210.05274.
27. Sohl-Dickstein J, Weiss EA, Maheswaranathan N. et al. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint. arXiv: 1503.03585. 2015. http://arxiv.org/abs/1503.03585.
28. Croitoru F-A, Hondru V, Ionescu RT. et al. Diffusion models in vision: a survey. IEEE Trans Pattern Anal Mach Intell 2023;45:10850–69. 10.1109/TPAMI.2023.3261988. [DOI] [PubMed] [Google Scholar]
29. Zhang C, Zhang C, Zheng S. et al. A survey on audio diffusion models: text to speech synthesis and enhancement in generative AI. arXiv preprint. arXiv: 2303.13336. 2023. http://arxiv.org/abs/2303.13336.
30. Zhu Y, Zhao Y. Diffusion models in NLP: a survey. arXiv preprint. arXiv: 2303.07576. 2023. http://arxiv.org/abs/2303.07576.
31. Zhang M, Qamar M, Kang T. et al. A survey on graph diffusion models: generative AI in science for molecule, Protein and Material. arXiv preprint. arXiv: 2304.01565. 2023. http://arxiv.org/abs/2304.01565. [Google Scholar]
32. Xu M, Yu L, Song Y. et al. GeoDiff: a geometric diffusion model for molecular conformation generation. arXiv preprint. arXiv: 2203.02923. 2022. http://arxiv.org/abs/2203.02923.
33. Hoogeboom E, Satorras VG, Vignac C. et al. Equivariant diffusion for molecule generation in 3D. arXiv preprint. arXiv:2203.17003. 2022. http://arxiv.org/abs/2203.17003.
34. Schneuing A, Du Y, Harris C. et al. Structure-based drug design with equivariant diffusion models. arXiv preprint. arXiv: 2210.13695. 2023. http://arxiv.org/abs/2210.13695.
35. Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. arXiv preprint. arXiv: 1706.03762. 2017. http://arxiv.org/abs/1706.03762.
36. Ying C, Cai T, Luo S. et al. Do transformers really perform badly for graph representation? arXiv preprint. arXiv: 2106.05234. 2021. https://arxiv.org/abs/2106.05234.
37. Liao YL, Smidt T. Equiformer: equivariant graph attention transformer for 3D atomistic graphs. arXiv preprint. arXiv: 2206.11990. 2023. http://arxiv.org/abs/2206.11990.
38. Irwin JJ, Tang KG, Young J. et al. ZINC20—a free ultralarge-scale chemical database for ligand discovery. J Chem Inf Model 2020;60:6065–73. 10.1021/acs.jcim.0c00675. [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Axelrod S, Gómez-Bombarelli R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data 2022;9:1–14. 10.1038/s41597-022-01288-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Weng G, Cai X, Cao D. et al. PROTAC-DB 2.0: an updated database of PROTACs. Nucleic Acids Res 2023;51:D1367–72. 10.1093/nar/gkac946. [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Schrödinger . Schrödinger, Inc https://newsite.schrodinger.com/. 2024.
42. Cordella LP, Foggia P, Sansone C. et al. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans Pattern Anal Mach Intell 2004;26:1367–72. 10.1109/TPAMI.2004.75. [DOI] [PubMed] [Google Scholar]
43. Hagberg AA, Schult DA, Swart PJ. Exploring network structure, dynamics, and function using NetworkX. Proceedings of the 7th Python in Science Conference (SciPy2008). USA: SciPy Organizers, 2008;11–5. [Google Scholar]
44. Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. arXiv preprint. arXiv: 2006.11239. 2020. http://arxiv.org/abs/2006.11239.
45. Gilmer J, Schoenholz SS, Riley PF. et al. Neural message passing for quantum chemistry. arXiv preprint. arXiv: 1704.01212. 2017. http://arxiv.org/abs/1704.01212.
46. O’Boyle NM, Banck M, James CA. et al. Open babel: an open chemical toolbox. J Chem 2011;3:33. 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
47. Landrum, G, Tosco P, Kelley B. et al. rdkit/rdkit: 2022_09_4 (Q3 2022) Release. Zenodo. 2023. 10.5281/zenodo.7541264. [DOI]
48. Krieger J, Sorrell FJ, Wegener AA. et al. Systematic potency and property assessment of VHL ligands and implications on PROTAC design. ChemMedChem 2023;18:e202200615. 10.1002/cmdc.202200615. [DOI] [PubMed] [Google Scholar]
49. Zhang B, Luo S, Wang L. et al. Rethinking the expressive power of GNNs via graph Biconnectivity. arXiv preprint. arXiv: 2301.09505. 2023. http://arxiv.org/abs/2301.09505.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

PROTACs-SI-final-v2_bbae358

protacs-si-final-v2_bbae358.docx^{(1.3MB, docx)}

Data Availability Statement

The data, source code of DiffPROTACs and the trained models are available at https://github.com/Fenglei104/DiffPROTACs.

[ref1] 1. Sakamoto KM, Kim KB, Kumagai A. et al. Protacs: chimeric molecules that target proteins to the Skp1–Cullin–F box complex for ubiquitination and degradation. Proc Natl Acad Sci 2001;98:8554–9. 10.1073/pnas.141230798. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] 2. Hammoudeh DI, Follis AV, Prochownik EV. et al. Multiple independent binding sites for small-molecule inhibitors on the oncoprotein c-Myc. J Am Chem Soc 2009;131:7390–401. 10.1021/ja900616b. [DOI] [PubMed] [Google Scholar]

[ref3] 3. An S, Fu L. Small-molecule PROTACs: an emerging and promising approach for the development of targeted therapy drugs. EBioMedicine 2018;36:553–62. 10.1016/j.ebiom.2018.09.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4. Burslem GM, Crews CM. Proteolysis-targeting chimeras as therapeutics and tools for biological discovery. Cell 2020;181:102–14. 10.1016/j.cell.2019.11.031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. Nowak RP, DeAngelo SL, Buckley D. et al. Plasticity in binding confers selectivity in ligand-induced protein degradation. Nat Chem Biol 2018;14:706–14. 10.1038/s41589-018-0055-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6. Troup RI, Fallan C, Baud MGJ. Current strategies for the design of PROTAC linkers: a critical review. Explor Target Antitumor Ther 2020;1:273–312. 10.37349/etat.2020.00018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7. Gadd MS, Testa A, Lucas X. et al. Structural basis of PROTAC cooperative recognition for selective protein degradation. Nat Chem Biol 2017;13:514–21. 10.1038/nchembio.2329. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] 8. Békés M, Langley DR, Crews CM. PROTAC targeted protein degraders: the past is prologue. Nat Rev Drug Discov 2022;21:181–200. 10.1038/s41573-021-00371-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9. Mullard A. Targeted protein degraders crowd into the clinic. Nat Rev Drug Discov 2021;20:247–50. 10.1038/d41573-021-00052-4. [DOI] [PubMed] [Google Scholar]

[ref10] 10. Li W, Zhang J, Guo L. et al. Importance of three-body problems and protein–protein interactions in proteolysis-targeting chimera modeling: insights from molecular dynamics simulations. J Chem Inf Model 2022;62:523–32. 10.1021/acs.jcim.1c01150. [DOI] [PubMed] [Google Scholar]

[ref11] 11. Liao J, Nie X, Unarta IC. et al. In silico modeling and scoring of PROTAC-mediated ternary complex poses. J Med Chem 2022;65:6116–32. 10.1021/acs.jmedchem.1c02155. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12. Dixon T, MacPherson D, Mostofian B. et al. Predicting the structural basis of targeted protein degradation by integrating molecular dynamics simulations with structural mass spectrometry. Nat Commun 2022;13:5884. 10.1038/s41467-022-33575-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13. Guo WH, Qi X, Yu X. et al. Enhancing intracellular accumulation and target engagement of PROTACs with reversible covalent chemistry. Nat Commun 2020;11:1–16. 10.1038/s41467-020-17997-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14. Liu H, Mi Q, Ding X. et al. Discovery and characterization of novel potent BCR-ABL degraders by conjugating allosteric inhibitor. Eur J Med Chem 2022;244:114810. 10.1016/j.ejmech.2022.114810. [DOI] [PubMed] [Google Scholar]

[ref15] 15. Mai H, Zimmer MH, Miller TF. et al. Exploring PROTAC cooperativity with coarse-grained alchemical methods. J Phys Chem B 2023;127:446–55. 10.1021/acs.jpcb.2c05795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] 16. Bai N, Miller SA, Andrianov GV. et al. Rationalizing PROTAC-mediated ternary complex formation using Rosetta. J Chem Inf Model 2021;61:1368–82. 10.1021/acs.jcim.0c01451. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Weng G, Li D, Kang Y. et al. Integrative Modeling of PROTAC-mediated ternary complexes. J Med Chem 2021;64:16271–81. 10.1021/acs.jmedchem.1c01576. [DOI] [PubMed] [Google Scholar]

[ref18] 18. Drummond ML, Henry A, Li H. et al. Improved accuracy for modeling PROTAC-mediated ternary complex formation and targeted protein degradation via new in silico methodologies. J Chem Inf Model 2020;60:5234–54. 10.1021/acs.jcim.0c00897. [DOI] [PubMed] [Google Scholar]

[ref19] 19. Zaidman D, Prilusky J, London N. PRosettaC: Rosetta based modeling of PROTAC mediated ternary complexes. J Chem Inf Model 2020;60:4894–903. 10.1021/acs.jcim.0c00589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] 20. Weng G, Shen C, Cao D. et al. PROTAC-DB: an online database of PROTACs. Nucleic Acids Res 2021;49:D1381–7. 10.1093/nar/gkaa807. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] 21. Li F, Hu Q, Zhang X. et al. DeepPROTACs is a deep learning-based targeted degradation predictor for PROTACs. Nat Commun 2022;13:7133. 10.1038/s41467-022-34807-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] 22. Zheng S, Tan Y, Wang Z. et al. Accelerated rational PROTAC design via deep learning and molecular simulations. Nat Mach Intell 2022;4:739–48. 10.1038/s42256-022-00527-y. [DOI] [Google Scholar]

[ref23] 23. Nori D, Coley CW, Mercado. De novo PROTAC design using graph-based deep generative models. arXiv preprint. arXiv: 2211.02660. 2022. http://arxiv.org/abs/2211.02660.

[ref24] 24. Imrie F, Bradley AR, Schaar M. et al. Deep generative models for 3D linker design. J Chem Inf Model 2020;60:1983–95. 10.1021/acs.jcim.9b01120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] 25. Huang Y, Peng X, Ma J. et al. 3DLinker: an E(3) equivariant variational autoencoder for molecular linker design. arXiv preprint. arXiv: 2205.07309. 2022. http://arxiv.org/abs/2205.07309.

[ref26] 26. Igashov I, Stärk H, Vignac C. et al. Equivariant 3D-conditional diffusion models for molecular linker design. arXiv preprint. arXiv: 2210.05274. 2022. http://arxiv.org/abs/2210.05274.

[ref27] 27. Sohl-Dickstein J, Weiss EA, Maheswaranathan N. et al. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint. arXiv: 1503.03585. 2015. http://arxiv.org/abs/1503.03585.

[ref28] 28. Croitoru F-A, Hondru V, Ionescu RT. et al. Diffusion models in vision: a survey. IEEE Trans Pattern Anal Mach Intell 2023;45:10850–69. 10.1109/TPAMI.2023.3261988. [DOI] [PubMed] [Google Scholar]

[ref29] 29. Zhang C, Zhang C, Zheng S. et al. A survey on audio diffusion models: text to speech synthesis and enhancement in generative AI. arXiv preprint. arXiv: 2303.13336. 2023. http://arxiv.org/abs/2303.13336.

[ref30] 30. Zhu Y, Zhao Y. Diffusion models in NLP: a survey. arXiv preprint. arXiv: 2303.07576. 2023. http://arxiv.org/abs/2303.07576.

[ref31] 31. Zhang M, Qamar M, Kang T. et al. A survey on graph diffusion models: generative AI in science for molecule, Protein and Material. arXiv preprint. arXiv: 2304.01565. 2023. http://arxiv.org/abs/2304.01565. [Google Scholar]

[ref32] 32. Xu M, Yu L, Song Y. et al. GeoDiff: a geometric diffusion model for molecular conformation generation. arXiv preprint. arXiv: 2203.02923. 2022. http://arxiv.org/abs/2203.02923.

[ref33] 33. Hoogeboom E, Satorras VG, Vignac C. et al. Equivariant diffusion for molecule generation in 3D. arXiv preprint. arXiv:2203.17003. 2022. http://arxiv.org/abs/2203.17003.

[ref34] 34. Schneuing A, Du Y, Harris C. et al. Structure-based drug design with equivariant diffusion models. arXiv preprint. arXiv: 2210.13695. 2023. http://arxiv.org/abs/2210.13695.

[ref35] 35. Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. arXiv preprint. arXiv: 1706.03762. 2017. http://arxiv.org/abs/1706.03762.

[ref36] 36. Ying C, Cai T, Luo S. et al. Do transformers really perform badly for graph representation? arXiv preprint. arXiv: 2106.05234. 2021. https://arxiv.org/abs/2106.05234.

[ref37] 37. Liao YL, Smidt T. Equiformer: equivariant graph attention transformer for 3D atomistic graphs. arXiv preprint. arXiv: 2206.11990. 2023. http://arxiv.org/abs/2206.11990.

[ref38] 38. Irwin JJ, Tang KG, Young J. et al. ZINC20—a free ultralarge-scale chemical database for ligand discovery. J Chem Inf Model 2020;60:6065–73. 10.1021/acs.jcim.0c00675. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref39] 39. Axelrod S, Gómez-Bombarelli R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data 2022;9:1–14. 10.1038/s41597-022-01288-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref40] 40. Weng G, Cai X, Cao D. et al. PROTAC-DB 2.0: an updated database of PROTACs. Nucleic Acids Res 2023;51:D1367–72. 10.1093/nar/gkac946. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref41] 41. Schrödinger . Schrödinger, Inc https://newsite.schrodinger.com/. 2024.

[ref42] 42. Cordella LP, Foggia P, Sansone C. et al. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans Pattern Anal Mach Intell 2004;26:1367–72. 10.1109/TPAMI.2004.75. [DOI] [PubMed] [Google Scholar]

[ref43] 43. Hagberg AA, Schult DA, Swart PJ. Exploring network structure, dynamics, and function using NetworkX. Proceedings of the 7th Python in Science Conference (SciPy2008). USA: SciPy Organizers, 2008;11–5. [Google Scholar]

[ref44] 44. Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. arXiv preprint. arXiv: 2006.11239. 2020. http://arxiv.org/abs/2006.11239.

[ref45] 45. Gilmer J, Schoenholz SS, Riley PF. et al. Neural message passing for quantum chemistry. arXiv preprint. arXiv: 1704.01212. 2017. http://arxiv.org/abs/1704.01212.

[ref46] 46. O’Boyle NM, Banck M, James CA. et al. Open babel: an open chemical toolbox. J Chem 2011;3:33. 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref47] 47. Landrum, G, Tosco P, Kelley B. et al. rdkit/rdkit: 2022_09_4 (Q3 2022) Release. Zenodo. 2023. 10.5281/zenodo.7541264. [DOI]

[ref48] 48. Krieger J, Sorrell FJ, Wegener AA. et al. Systematic potency and property assessment of VHL ligands and implications on PROTAC design. ChemMedChem 2023;18:e202200615. 10.1002/cmdc.202200615. [DOI] [PubMed] [Google Scholar]

[ref49] 49. Zhang B, Luo S, Wang L. et al. Rethinking the expressive power of GNNs via graph Biconnectivity. arXiv preprint. arXiv: 2301.09505. 2023. http://arxiv.org/abs/2301.09505.

PERMALINK

DiffPROTACs is a deep learning-based generator for proteolysis targeting chimeras

Fenglei Li

Qiaoyu Hu

Yongqi Zhou

Hao Yang

Fang Bai

Abstract

Introduction

Materials and methods

Data

Table 1.

Figure 1.

Figure 3.

Diffusion models

O(3) equivariant graph Transformer

Figure 2.

Metrics

Validity

Uniqueness

Recovery

Results

Framework of DiffPROTACs

Results of DiffPROTACs on ZINC and GEOM

Table 2.

Results of DiffPROTACs on PROTACs

Table 3.

Figure 4.

Figure 5.

Case study: BRD4-PROTAC-VHL (PDB code:8BEB, 8BDT)

Figure 6.

Generated database of DiffPROTACs

Figure 7.

Figure 8.

Discussion

Key Points

Supplementary Material

Acknowledgements

Contributor Information

Supplementary data

Funding

Author contributions

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases