Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Jul 15;25(4):bbae338. doi: 10.1093/bib/bbae338

A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation

Xiangru Tang 1,#,, Howard Dai 2,#, Elizabeth Knight 3,#, Fang Wu 4, Yunyang Li 5, Tianxiao Li 6, Mark Gerstein 7,8,9,10,11
PMCID: PMC11247410  PMID: 39007594

Abstract

Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.

Keywords: generative model, drug design, molecule generation, protein generation

Introduction

During the drug design process, ligands must be created, selected, and tested for their interactions and chemical effects conditioned on specific targets [1]. These ligands range from small molecules with tens of atoms to large proteins such as monoclonal antibodies [2, 3]. While methods exist to optimize the selection and testing of probable molecules, traditional discovery methods across all fields are computationally expensive [4]. Recent artificial intelligence (AI) models have demonstrated competitive performance [5–7] in improving various tasks in the drug design process. Methods such as machine learning (ML)-driven quantitative structure-activity relationship approaches [8, 9] have significantly improved virtual screening (VS) in molecule design [10, 11], while ML-assisted directed evolution techniques for protein engineering [12, 13] have proven to be reliable and widely used tools. However, an emerging and even more powerful task for ML is the generation of entirely new biological compounds in de novo drug design [14–16].

In contrast to applications in VS and directed evolution, which seek to expedite and optimize tasks within an existing framework, de novo drug design focuses on generating entirely new biological entities not found in nature [14]. While other ML-driven methods search within existing chemical libraries for drug-like candidates, thereby facing inherent constraints, de novo design circumvents this limitation by exploring unknown chemical space and generating drug-like candidates from scratch [17–19].

In this paper, we explore the impacts and developments of ML-driven de novo drug design in two primary areas of research: molecule and protein generation. Within protein generation, we additionally explore antibody and peptide generation, given their high research activity and relevance. Although the types of pharmaceuticals and the associated chemical nuances differ across fields, the overarching goal of exploring chemical space through de novo design remains constant. Both fields are rapidly growing industries with traditionally high research and development costs [20–22], and current improvements are driven by active developments in ML-based de novo methods.

Molecule design specifically refers to the development of novel molecular compounds, often with the aim of small-molecule drug design. The generated molecules must satisfy a complex and often abstract array of chemical constraints that determine both their validity and “drug-likeness” [23, 24]. This, combined with the vast space of potential drug-like compounds (up to Inline graphicInline graphic [25]), renders traditional small-drug design time-consuming and expensive. Using traditional methods, preclinical trials can cost hundreds of millions of dollars [26] and take between 3 and 6 years [4]. In recent years, AI-driven methods have gained traction in drug design. AI-focused biotechnology companies have initiated over 150 small-molecule drugs in the discovery phase and 15 in clinical trials, with the usage of this AI-fueled process expanding by almost 40% each year [27].

An equally promising field, protein design, refers to the artificial generation or modification of proteins (protein engineering) for various biological uses. Native proteins have adapted and evolved over millions of years, so the rapid progression of human society in recent years poses challenges for naturally occurring proteins [28]. Protein design has an even more versatile range of applications, finding utility in immune signaling, targeted therapeutics, and various other fields. When executed efficiently, protein design has the potential to transform synthetic biology [21, 29, 30]. Like molecule generation, proteins must adhere to abstract biological constraints, yet the inherently more complex structure of a protein presents a nuanced generative objective, requiring a more direct application of chemical knowledge in the process [31–33]. Traditional methods such as directed evolution are confined to specific evolutionary trajectories for existing proteins; the de novo generation of proteins would add an entirely new dimension for researchers to explore [30, 34, 35].

Structurally, we aim to provide an organized introduction to the two fields mentioned above. We begin with a technical overview of relevant deep learning architectures employed in both small molecule and protein design. We then explore their applications in molecule and protein design, dividing our analysis into a variety of subfields highlighted in Fig. 1. For each subfield, we provide (1) a general background/task definition, (2) common datasets used for training and testing, (3) common evaluation metrics, (4) an overview of past and current ML approaches, and (5) a comparative analysis of the performance of state-of-the-art (SOTA) models. A detailed overview of this structure is shown in Fig. 2. Finally, we integrate concepts within each subfield into a broad analysis of de novo drug design as a whole, providing a comprehensive summary of the field in terms of current trends, top-performing models, and future directions. Our overall objective is to provide a systematic overview of ML in drug design, capturing recent advancements in this rapidly evolving area of research.

Figure 1.

Figure 1

An overview of the topics covered in this survey; in particular, we explore the intersection between generative AI model architectures and real-world applications, organized into two main categories: small molecule and protein generation tasks; note that diffusion and flow-based models are often paired with GNNs for processing 2D/3D-based input, while VAEs and GANs are typically used for 1D input.

Figure 2.

Figure 2

A structured layout for all terms and papers covered in our survey, including datasets, models, and metrics for each task; sections contained in the main text are highlighted in blue, and sections expanded upon in the appendix are highlighted in purple.

Related surveys

Several survey papers delve into specific aspects of generative AI in drug design, with some focusing on molecule generation [17, 36, 37], protein generation [28, 29], or antibody generation [38–41]. Other survey papers are organized based on model architecture rather than application, with recent papers by Zhang et al. [42] and Guo et al. [43] reviewing diffusion models in particular. While each of the above surveys provides an in-depth analysis of a specific application or type of model, the level of specialization may limit their scope. Our approach is a macro-level analysis of small molecule and protein generation, tailored for those seeking a high-level introduction to the emerging field of generative AI in chemical innovation. This broad perspective enables us to highlight relationships across fields, such as parallel shifts in methods of input representation, the common emergence of architectures like equivariant graph neural networks (EGNNs), and similar challenges faced in both molecule and protein design.

Preliminary: generative AI models

Generative AI uses various statistical modeling, iterative training, and random sampling to generate new data samples resembling the input data. Historically, prominent approaches include generative adversarial networks (GANs) [44], variational autoencoders (VAEs) [45], and flow-based models [46]. More recently, diffusion models [47] have emerged as promising alternatives. In our survey, we begin by providing a concise mathematical and computational overview of these architectures.

Variational autoencoders

A VAE [45] is a type of generative model that extends upon the typical encoder–decoder framework by representing each latent attribute using a distribution rather than a single value. This approach creates a more dynamic representation of the underlying properties of the training data and enables the sampling of new data points from scratch.

Formally, we can express the encoder as

graphic file with name DmEquation1.gif

Intuitively, each Inline graphic will be mapped to some mean Inline graphic and variance Inline graphic, which describe a corresponding normal distribution. We express the decoder as Inline graphic, where Inline graphic is a randomly sampled point from the latent distribution Inline graphic and is mapped into a point Inline graphic in the decoding process.

VAE loss is computed using two balancing ideas: reconstruction loss and Kullback–Leibler (KL) divergence loss. Reconstruction loss measures the difference between the ground truth and the reconstructed decoder output, often expressed using cross-entropy loss:

graphic file with name DmEquation2.gif

KL divergence measures the difference between two probability distributions [48]. For VAEs, KL divergence is computed between the encoded distribution and the standard normal distribution. This can be seen as “regularization,” as it encourages the encoder to map elements to a more central region with overlapping distributions, thus improving continuity across the latent space. Formally, the KL loss can be expressed as follows, with Inline graphic representing the Inline graphicth dimension in the latent space and Inline graphic representing the dimension of z:

graphic file with name DmEquation3.gif

Here, Inline graphic and Inline graphic represent the mean and variance of the kth component of the latent space, respectively, for datapoint Inline graphic. Then, the overall loss function can be expressed as follows, where Inline graphic can be adjusted to balance the reconstruction loss and KL loss:

graphic file with name DmEquation5.gif

Generative adversarial networks

GANs [44] utilize “competing” neural networks for mutual improvement. The two neural networks—the generator and the discriminator—compete in a zero-sum game. The generator (Inline graphic) creates instances (e.g. chemical structures of potential drugs) from random noise (Inline graphic) sampled from a prior distribution Inline graphic to mimic the training samples, while the discriminator (Inline graphic) aims to distinguish between the synthetic data and the training samples.

The learning process involves the optimization of the following loss function:

graphic file with name DmEquation6.gif

Here, Inline graphic represents the likelihood applied by the discriminator to a correct sample, while Inline graphic represents the negative likelihood applied by the discriminator to an incorrect sample. This function returns a higher value when the discriminator accurately categorizes samples; thus, the discriminator aims to maximize this function, while the generator aims to minimize it.

Flow-based models

Flow-based generative models [46] generate data according to a target distribution Inline graphic by applying a chain of transformations to a simple latent distribution, often Gaussian, denoted Inline graphic. This transformation applies an invertible function Inline graphic, such that

graphic file with name DmEquation7.gif

where the trained model learns parameters Inline graphic. Since Inline graphic is invertible and thus the learned map is bijective, Inline graphic has the same dimensionality as Inline graphic. Often, Inline graphic is a composite function where Inline graphic; this allows for more complex probability distributions to be modeled. Because each function is invertible, the posterior can be easily computed—the log-likelihood of a single point Inline graphic can be written in terms of its latent variable Inline graphic:

graphic file with name DmEquation8.gif

This function is used to train parameters Inline graphic to maximize the probability of observing the data. Various models build upon this premise to represent complex data distributions and capture relationships within sequential data.

Diffusion models

Diffusion models [47] perform a fixed learning procedure, gradually adding Gaussian noise to data over a series of time steps. We define two stages of the model: the noise-adding (forward) and the noise-removing (reverse) process.

In the forward process, each step Inline graphic can be represented as a Markov chain and is comprised of Inline graphic and a small amount of Gaussian noise. We represent this mathematically as follows:

graphic file with name DmEquation9.gif

Here, Inline graphic is the data at time t, and Inline graphic denotes the noise schedule. The variance Inline graphic decreases in the forward process so that after many steps, we have Inline graphic.

In the reverse process, we aim to reconstruct the data from the noise. In this process, a denoising function is learned, often modeled by a neural network:

graphic file with name DmEquation10.gif

where Inline graphic is the denoising function parameterized by Inline graphic.

To train a diffusion model, we approximate the added noise at each step; the loss function minimizes the difference between the true noise and the model-predicted noise:

graphic file with name DmEquation11.gif

Here, Inline graphic means that the time step Inline graphic is drawn uniformly at random from the set Inline graphic, and Inline graphic represents the noise predicted by the model parameterized by Inline graphic. Inline graphic represents the last time step of the model. Once the neural network has been trained, we can sample from the noise distribution and iterate through the reverse process to generate new data.

While the mathematical framework for the diffusion process is generally based on continuous data, adaptations made by Austin et al. [49] allow for smoother implementation with discrete data forms like molecular graphs.

Other models

While we cover the main generative methods used in these fields, a variety of other models also appear in specific applications and tasks in our paper, such as transformers, energy-based models, BERT, and more [50–56]. While not generative models on their own, graph neural networks (GNNs) [57] are often paired with the above generative methods to capture the graph-like structure of molecules. A wide variety of GNN variations exist, including EGNNs [58], message-passing neural networks (MPNNs) [59], graph convolutional networks [60], graph isomorphism networks [61], and convolutional neural networks (CNNs) [62–64]. We discuss GNNs and EGNNs in more detail in the appendix on page 23.

Applications

Molecule

Task background

Molecule generation focuses on the creation of novel molecular compounds for drug design. These generated molecules are intended to be (1) valid, (2) stable, and (3) unique, with an overall goal of pharmaceutical applicability. “Pharmaceutical applicability” is a broad term for a molecule’s binding affinity to various biological targets. While the first three tasks may seem trivial, there are a variety of challenges with simply generating valid and stable molecules. Thus, the field of target-agnostic molecule generation is focused on generating valid sets of molecules without consideration for any biological target. Target-aware molecule generation (or ligand generation) focuses on the generation of molecules for specific protein structures and therefore focuses more on the pharmaceutical component. Finally, 3D conformation generation involves generating various 3D conformations given 2D connectivity graphs.

For training and testing, molecule inputs can be formatted in a variety of ways, depending on the available information or desired output. Molecules can be expressed in 1D format through the simplified molecular-input line-entry system (SMILES), in 2D using connectivity graphs to represent bonds, or in 3D using point cloud embeddings on graph nodes [65].

Target-agnostic molecule design

Overview 

While the task of target-agnostic molecule design may seem simplistically open-ended, there is a vast array of chemical properties and rules that generated molecules must align with to be considered “valid” and “stable.” The determination of “validity” includes a complex combination of considerations, such as electromagnetic forces, energy levels, and geometric constraints, and a well-defined “formula” does not yet exist for predicting the feasibility of molecular compounds. This, combined with the vast space of potential drug-like compounds (up to Inline graphicInline graphic), makes brute-force experimentation quite time-consuming [25]. Deep learning can assist in learning abstract features for existing valid compounds and efficiently generate new molecules with a higher likelihood of validity.

Task 

The target-agnostic molecule design task is as follows: given no input, generate a set of novel, valid, and stable molecules.

Datasets 

To learn these abstract constraints, models must learn from large sets of existing valid, stable molecules. The following datasets are most commonly used for this task:

  • QM9 [66]—Quantum Machines 9 contains small stable molecules pulled from the larger chemical universe database GDB-17

  • GEOM-Drug [67]—Geometric Ensemble of Molecules contains more complex, drug-like molecules, often used to test scalability beyond the simpler molecules of QM9

Metrics 

The most general task within this field is unconditional molecule generation, where models aim to generate a new set of valid, stable molecules with no input. All of these metrics can be evaluated using either QM9 or GEOM-Drug as testing sets.

  • Atom Stability—The percentage of atoms with the correct valency

  • Molecule Stability—The percentage of molecules whose atoms are all stable

  • Validity—The percentage of stable molecules that are considered valid, often evaluated by RDKit

  • Uniqueness—The percentage of valid molecules that are unique (not duplicates)

  • Novelty—The percentage of molecules not contained within the training dataset

  • QED [23]—Quantitative Estimate of Drug-Likeness, a formulaic combination of a variety of molecular properties that collectively estimate how likely a molecule is to be used for pharmaceutical purposes

Note that the novelty metric is sometimes omitted, as argued by Vignac et al. [68], who contend that QM9 is an exhaustive set of all molecules with up to nine heavy atoms following a predefined set of constraints. Therefore, any “novel” molecule would have to break one of these constraints, making novelty a poor indicator of performance. While QED is a well-established metric and may see expanded usage in the future, many current models have focused solely on generating valid molecules and do not report performance on QED.

Models are often evaluated on conditional molecule generation, aiming to generate models that fit desired chemical properties. For evaluation, a property classifier network Inline graphic is trained on half of the QM9 dataset, while the model is trained on the other half. Inline graphic is then evaluated on the model’s generated molecules, and the mean absolute error between the target property value and the evaluated property value is calculated. Below are the six molecular properties considered:

  • Inline graphic Polarizability, or the tendency of a molecule to acquire an electric dipole moment when subjected to an external electric field, measured in cubic Bohr radius (Inline graphic)

  • Inline graphic Highest Occupied Molecular Orbit Energy, measured in millielectron volts (Inline graphic)

  • Inline graphic Lowest Unoccupied Molecular Orbit Energy, measured in millielectron volts (Inline graphic)

  • Inline graphic Difference between Inline graphic and Inline graphic, measured in millielectron volts (Inline graphic)

  • Inline graphic Dipole moment, measured in debyes (Inline graphic)

  • Inline graphic Molar heat capacity at 298.15 Inline graphic, measured in calories per Kelvin per mole Inline graphic

Models 

Approaches to the molecular generation task have seen significant shifts over the past few years, transitioning from 1D SMILES strings to 2D connectivity graphs, then to 3D geometric structures, and finally to incorporating both 2D and 3D information.

Early methods like Character VAE (CVAE) [77], Grammar VAE (GVAE) [78], and Syntax-Directed VAE (SD-VAE) [79] apply VAEs to 1D SMILES string representations of molecule graphs. While 1D SMILES strings can be deterministically mapped to molecular graphs, SMILES falls short in quality of representation—two graphs with similar chemical structures may end up with very different SMILES strings, making it harder for models to learn these similarities and patterns [80].

Junction Tree VAE (JTVAE) [80] was the first model to address this issue by generating 2D graph structures directly. JTVAE generates a tree-structured scaffold and then converges this scaffold into a molecule using a graph message-passing network. This approach allows JTVAE to iteratively expand its molecule and check for validity at each step, resulting in considerable performance improvement over previous SMILES-based methods.

2D graph methods like JTVAE still fall short due to the lack of 3D input; because binding and interaction with other molecules/proteins rely heavily on 3D conformations, models that do not consider 3D information cannot properly represent and optimize properties like binding affinity. Thus, more recent developments include models that incorporate 3D information. Earlier 3D-based methods like E-NF [71] and G-SchNet [69] approached the molecular generation problem with flow-based methods or autoregressive methods (in particular, G-SchNet uses the SchNet architecture developed by Schutt et al. [54]). More recently, a wave of diffusion-based models operate on 3D point clouds, taking advantage of E(3) equivariance and demonstrating superior performance.

EDM [76] provided an initial baseline for the application of diffusion, applying a standard diffusion process to an equivariant GNN with atoms represented as nodes with variables for both scalar features and 3D coordinates. While autoregressive models require an arbitrary ordering of the atoms, diffusion-based methods like EDM are not sequential and do not need such ordering, reducing a dimension of complexity and thus improving efficiency.

Many subsequent models compared themselves with EDM as a baseline on the molecule generation task, seeking to improve upon its performance by adding additional considerations and adjustments. GCDM [72] implements a crossover between geometric deep learning and diffusion, using a geometry-complete perceptron network to introduce attention-based geometric message-passing. While both EDM and GCDM have already demonstrated massive performance improvements, both models still struggle with both large-molecule scalability and diversity in the generated molecules. MDM [73] addressed the scalability issue by pointing out the lack of consideration for interatomic relations in EDM and GCDM. MDM separately defines graph edges for covalent bonds and for Van der Waals forces (dependent on a physical distance threshold Inline graphic) to allow for thorough consideration of interatomic forces and local constraints. In addition, MDM addressed the diversity issue by introducing an additional distribution-controlling noise variable in each diffusion step. While previous diffusion models operated directly in the complex atomic feature space, GeoLDM [70] applies VAEs to map molecule structures to a lower dimensional latent space for its diffusion. This latent space has a smoother distribution and lower dimensionality, leading to higher efficiency and scalability for large molecules. In addition, conditional generation is improved, as specified chemical properties are more clearly defined within latent spaces than they are in raw format.

While previous models learned exclusively from either 2D or 3D representations, a new wave of models recognize the need for both: a molecule’s 2D connectivity structure is necessary to determine bond types and gather information about chemical properties and synthesis, while the 3D conformation is crucial for its interaction and binding affinity with other molecules. By jointly learning and generating both representations, models can maximize the amount of chemically relevant information and produce higher quality molecular samples. The Joint 2D and 3D Diffusion Model (JODO) [74] uses a geometric graph representation to capture both 3D spatial information and connectivity information, applying score SDEs to this joint representation while proposing a diffusion graph transformer to parameterize the data prediction model and avoid the loss of correlation after noise is independently added to each separate channel. MiDi [75] uses a similar graph representation but instead applies a DDPM. It proposes a “relaxed” EGNN, which improves upon the classical EGNN architecture by exploiting the observation that translational invariance is not needed in the zero center-of-mass subspace. A full overview of the developments described in this section can be seen in Fig. 3.

Figure 3.

Figure 3

An overview of the progress in target-agnostic molecule design over time; shortcomings of previous models are shown in the corresponding pink boxes, with subsequent models solving these shortcomings through novel design choices [70, 73–80].

As shown in Table 1, diffusion-based methods demonstrate significant improvements over previous methods, all achieving over 98.5% in atom stability. However, some models fall behind when extended to the larger GEOM-Drugs dataset, as shown in Table 2, where MiDi distinguishes itself for its capability to generate more stable complex molecules, albeit at the expense of validity. Table 3 illustrates that MDM and GCDM excel at conditional generation tasks, with the former model achieving the best performance in four out of six tasks and the latter outperforming the remaining two. Overall, current models demonstrate high performance on the QM9 dataset, but there is room for improvement when dealing with the more complex molecules found in the GEOM-Drugs dataset.

Table 1.

An overview of relevant molecular generation models; all benchmarking metrics are self-reported unless otherwise noted; all metrics are evaluated with the QM9 dataset; for models with multiple variations, the highest performing version was selected; [**] represents the current SOTA; as MiDi and MDM use slightly different evaluation conditions, their results are not fully comparable

Model Type of model Dataset At Stb. (%, Inline graphic) Mol Stb. (%, Inline graphic) Valid (%, Inline graphic) Val/Uniq. (%, Inline graphic)
G-SchNet [69] SchNet QM9 95.7 [70] 68.1 [70] 85.5 [70] 80.3 [70]
E-NF [71] EGNN, Flow QM9 85 [70] 4.9 [70] 40.2 [70] 39.4 [70]
EDM [70] EGNN, Diffusion QM9, GEOM-Drugs 98.7 82.0 91.9 90.7
GCDM [72] EGNN, Diffusion QM9 98.7 85.7 94.8 93.3
MDM [73] EGNN, VAE, Diffusion QM9, GEOM-Drugs 99.2 [74] 89.6 [74] 98.6 94.6
JODO [74] EGNN, Diffusion QM9, GEOM-Drugs 99.2 93.4 99.0 96.0
MiDi** [75] EGNN, Diffusion QM9, GEOM-Drugs 99.8 97.5 97.9 97.6
GeoLDM** [70] VAE, Diffusion QM9, GEOM-Drugs 98.9 89.4 93.8 92.7
Table 2.

Molecular generation molecules evaluated on the larger GEOM-Drugs dataset; all metrics are self-reported; as MiDi uses slightly different evaluation conditions, its results are not fully comparable

Model Atom Stb. (%, Inline graphic) Mol Stb. (%, Inline graphic) Valid (%, Inline graphic) Val/Uniq. (%, Inline graphic)
EDM [76] 81.30 / / /
MDM [73] / 62.20 99.50 99.00
MiDi [75] 99.80 91.60 77.80 77.80
GeoLDM [70] 84.40 / 99.30 /
Table 3.

Molecular generation models evaluated on the conditional molecule generation task; all metrics are self-reported; all metrics are evaluated with the QM9 dataset

Task (Inline graphic) Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Units Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
EDM [76] 2.76 655 356 584 1.111 1.101
GCDM [72] 1.97 602 344 479 0.844 0.689
MDM [73] 1.591 44 19 40 1.177 1.647
GeoLDM [70] 2.37 587 340 522 1.108 1.025

Target-aware molecule design

Overview 

Contrasting target-agnostic molecule design, target-aware design involves generating molecules based on specific biological targets. Within target-aware design, two primary approaches exist: ligand-based drug design (LBDD) and structure-based drug design (SBDD). LBDD models often utilize the amino acid sequences of target proteins, leveraging the characteristics and features of known ligands to build new molecules with similar properties. By contrast, SBDD models use the 3D structure of the target protein to design a corresponding molecular structure. LBDD models are most useful when the 3D structure is not experimentally available, but are limited in novelty because they only learn from existing bindings [37]. When the 3D structure of the target protein is available, SBDD models are generally preferred, as they consider crucial 3D information.

Task 

Given input target information, typically in the form of protein amino acid sequences in LBDD and protein 3D structure in SBDD, these approaches generate molecules that exhibit high binding affinity and potential interactions with this target.

Datasets 

The following datasets are used for target-aware molecule design. CrossDocked2020 [81] is currently the most heavily used, as the cross-docking technique allows for the generation of combinatorially large quantities of data by “mixing and matching” similar ligand–protein pairs (22.5M compared with 40K in Binding MOAD [82]).

  • CrossDocked2020 [81]—Contains ligand–protein complexes generated by cross-docking within clusters of similar binding sites called ”pockets”

  • ZINC20 [83]—Fully enumerated dataset of possible ligands

  • Binding MOAD [82]—Binding Mother of All Databases, a subset of PDB [84] containing experimentally determined protein–ligand pairings

Metrics 

Target-aware molecule design utilizes the following metrics. Beyond affinity/applicability metrics, an additional consideration for diversity is considered, as a diverse array of potential options for a given target can provide more flexibility in the drug development process.

  • Vina Score [85]—Scoring function supported by the Vina platform that returns a weighted sum of atomic interactions and is useful for docking

  • Vina Energy [85]—Energy prediction by the Vina platform that is used to measure binding affinity

  • High Affinity Percentage—Percentage of molecules with lower Vina energy than the reference (ground-truth) molecule when binding to the target protein

  • QED [23]—Quantitative Estimate of Drug-Likeness, which is also used in target-agnostic generation (see page 5)

  • SAscore [86]—Synthetic Accessibility Score, a formulaic combination of molecular properties that determine how easy a molecule is to create in a real lab setting

  • Diversity—The diversity of generated molecules for each specific binding site, measured by Tanimoto similarities [87] between pairs of Morgan fingerprints

Models 

LBDD models incorporate transformer architectures to generalize properties of learned ligands. For example, DrugGPT [88] is a recent autoregressive model that uses transformers to train numerous protein–ligand pairs. In their model, the ligand SMILES and protein amino acid sequence are tokenized for training, and the model produces viable SMILES ligand outputs. Generally, with improving protein structure prediction methods (see page 9) and increasing access to structural information as a whole, SBDD methods have become more prevalent than LBDD methods; thus, we explore SBDD methods below in more detail.

LiGAN [89] introduces the idea of a 3D target-aware molecule output, fitting molecules into grid formats for learning with a CNN and training the model under a VAE framework. Pocket2Mol [90] places more emphasis on the specific “pockets” on the target protein to which the molecules bind, using an EGNN and geometric vector MLP layers to train on graph-structured input. Luo et al. [91] directly model the probability of atoms occurring at a certain position in the binding site, taking advantage of invariance through the SchNet [54] architecture.

Recent SBDD models have also popularized the use of diffusion models.

TargetDiff [92] performs diffusion on an EGNN (with many similarities to EDM [76]) to learn the conditional distribution. To optimize the binding affinity, Guan et al. [92] note that the flexibility of atom types should be low, which is reflected by the entropy of the atom embedding. Schneuing et al. [93] propose DiffSBDD, which includes two sub-models: DiffSBDD-cond and DiffSBDD-inpaint. DiffSBDD-cond is a conditional DDPM that learns a conditional distribution in a similar way as TargetDiff. In our benchmarking, we focus on the higher performing model, DiffSBDD-inpaint, which applies the inpainting approach (traditionally applied to filling parts of images) by masking and replacing segments of the ligand–protein complex.

As shown in Table 4, DiffSBDD leads in Vina score and diversity, while TargetDiff leads in high affinity. Interestingly, diffusion-based methods seem to be outperformed by the MLP used in Pocket2Mol when it comes to drug-like metrics like QED and SA. However, Guan et al. [92] note that adjustments to TargetDiff, such as switching to fragment-based generation or predicting atom bonds, could improve performance on QED and SA.

Table 4.

An overview of relevant target-aware molecular generation models; all benchmarking metrics are self-reported unless otherwise noted; [**] represents the current SOTA; as each paper uses slightly different benchmarking methods, their results may not be fully comparable; all metrics are evaluated with the CrossDocked2020 dataset

Model Type of model Dataset Vina (Inline graphic) Affinity (%, Inline graphic) QED (Inline graphic) SA (Inline graphic) Diversity (Inline graphic)
Luo et al. [91] SchNet CrossDocked2020 -6.344 29.09 0.525 0.657 0.720
LiGAN [89] CNN, VAE CrossDocked2020 -6.144 [91] 21.1 [92] 0.39 [92] 0.59 [92] 0.66 [92]
Pocket2Mol** [90] EGNN, MLP CrossDocked2020 -5.14 [92] 48.4 [92] 0.56 [92] 0.74 [92] 0.69 [92]
TargetDiff** [92] EGNN, Diffusion CrossDocked2020 -6.3 58.1 0.48 0.58 0.72
DiffSBDD** [93] EGNN, Diffusion CrossDocked, MOAD -7.333 0.467 0.554 0.758

Protein

Task background

Proteins are large biomolecules that contain one or more long chains of amino acids. Each amino acid is a molecular compound that contains both an amino (-NH2) and a carboxylic acid (-COOH) [94]. While there are over 500 naturally occurring amino acids, only 22 are proteinogenic (protein-building) and thus relevant to the protein generation task [95]. Thus, in addition to 3D structural representations, proteins can be represented through their amino acid sequences, assigning each amino acid a letter label and representing each long chain as a string of labels. This sequential representation for amino acids mirrors the sequence structure of human language, allowing for natural language models to be applied in ways that would not be possible for the previously discussed molecule generation task. Within protein generation, several generative subtasks can be defined. Representation learning involves creating meaningful embeddings for protein inputs, which improves the data space for other models to train on. Structure prediction involves the generation of a protein structure for its corresponding amino acid sequence, historically a challenging task due to the vast conformational space. Sequence generation describes the inverse, creating a protein sequence for its corresponding structure. Finally, backbone design refers to creating protein structures from scratch, which forms the core of the de novo design task.

We also briefly discuss antibody generation due to its high relevance within the protein generation field. In particular, antibodies are Y-shaped proteins used by the immune system to identify and bind to bacteria known as antigens. While many protein design models use multiple sequence alignment (MSA) to map evolutionary relationships between related sequences, MSAs are not always available for antibody sequences, and antibody-specific models cannot rely on this input. Additionally, binding regions are specifically defined for antibodies, contained within six complementarity-determining regions (CDRs). The CDR-H3 region is particularly diverse and complex, leading to a specialized task for the reconstruction of this region, known as CDR-H3 generation. We discuss antibody-specific methods within each corresponding subtask, and we include a more detailed discussion of antibody generation in the appendix on page 26.

Finally, we provide an additional section on peptide generation due to its high relevance and more specialized applications. Advances in drug delivery and synthesis technology have broadened its therapeutic potential, and recent innovations in peptide drug discovery have significantly improved treatments for type 2 diabetes with the creation of glucagon-like peptide-1 receptor agonists (e.g. liraglutide and semaglutide). Protein generation, while related, differs from peptide generation in length and complexity. Peptides are shorter (often no more than 50 amino acids) and have greater flexibility; thus, distinct computational models are needed to capture this differentiation.

Protein representation learning

Overview 

Protein representation learning involves learning embeddings to convert raw protein data into latent space representations, thereby extracting meaningful features and chemical attributes. Specifically, given a protein Inline graphic, where each Inline graphic represents an amino acid (sequence-based) or atom coordinate (structure-based), the goal is to learn an embedding Inline graphic, where each Inline graphic represents a Inline graphic-dimensional token representation for amino acid Inline graphic. Although representation learning is not a generative task on its own, these embeddings create “richer” data spaces for other generative models to train on. Hence, we briefly discuss them here. For a more in-depth analysis of protein representation learning models, please refer to the appendix on page 23.

Structure prediction

Overview 

Generating 3D structures of proteins from their amino acid sequence is a challenging and important task in drug design. Historically, techniques like protein threading [96] and homology modeling [97] has been used to predict structures; however, these methods have fallen short due to a lack of computational power and the difficulty of finding structures in the vast conformational space. New research in computational modeling has used various deep learning architectures to discover information from the amino acid sequences to generate accurate 3D structures. Current models have achieved impressive accuracy in structure prediction, but there is room for improvement in terms of speed and scale.

Task 

Given a protein amino acid sequence, generate a set of 3D point coordinates for each amino acid residue, aiming to replicate a target ground-truth structure as closely as possible.

Datasets 

Unlike many of the other fields mentioned previously, the field of protein structure prediction benefits from a widely standardized benchmarking task through the Critical Assessment of Protein Structure Prediction (CASP). CASP conducts biennial testing on models using solved protein structures that have not been released to PDB.

  • PDB [84]—Protein Data Bank, a central archive for all experimentally determined protein structures, widely used in almost all protein structure-related tasks

  • CASP14 [98]—14th Critical Assessment of Protein Structure Prediction, a set of unreleased PDB structures used to create a standardized blind testing environment

  • CAMEO [99]—Continuous Automated Model Evaluation, a complement to CASP that conducts weekly blind tests using around 20 pre-released PDB targets (to provide more continuous feedback, as CASP is biennial)

Metrics 

Models are evaluated by comparing each protein’s ground-truth structure with the generated structure. Three different approaches are taken to evaluate structural similarity:

  • RMSDRoot-Mean-Square Deviation directly compares ground-truth positions with generated positions for each amino acid. If each Inline graphic denotes the distance between each ground truth and generated amino acid position, with Inline graphic total amino acids, we have
    graphic file with name DmEquation12.gif
  • GDT-TS [100]—Global Distance Test-Total Score finds the most optimal superposition between two structures, searching for the highest number of corresponding residues that are within a distance threshold from each other. The GDT-TS aims to represent global fold similarities.

  • TM-score [101]—Template Modeling score a similarity scoring formula that adjusts the GDT metric, normalizing for protein sequence length to avoid dependency on protein size, evaluating all residues beyond just those within the proposed cutoff for a more cohesive score. The TM-score aims to represent both global fold and local structural similarities.

  • LDDT [102]—Local-Distance Difference Test, a superpos.-free metric based on local distances between atoms. For each atom pair, its local distance is “preserved” if the generated local distance is within a given threshold of the ground-truth distance, and the proportion of preserved distances is calculated. The LDDT can protect against artificially unfavorable scores when considering flexible proteins with multiple domains.

Models 

AlphaFold2 [103] is a landmark model that uses deep learning techniques to compete with experimental methods. AlphaFold2 integrates numerous layers of transformers in an end-to-end approach. The transformers incorporate information from the MSA and pair representations to explore the folding space, potential orientations of amino acids, and overall structure based on pairwise distances. The MSA aligns multiple related protein sequences to create a 2D representation that informs the transformer architecture. Additionally, AlphaFold2 employs invariant point attention (IPA) for spatial attention, while the transformer captures interactions along the chain structure. Notably, AlphaFold2 introduces novel constraints from experimental data, which record probable distances between residues, preferred orientations between residues, and likely dihedral angles for the covalent bonds in the backbone.

Proposed in 2020, trRosetta [107], i.e. transform-restrained Rosetta, is another model that uses a deep residual network with attention mechanisms. Upon inputting MSA information, trRosetta predicts distances and orientations for residue pairs, which are then utilized to construct the 3D structure using a Rosetta protocol. Despite their advancements, both trRosetta and AlphaFold2 face several challenges, including their reliance on the MSA representation, limitations to natural proteins, and high computational requirements. Recently, RoseTTAFold [106], which replaced trRosetta [107], demonstrated performance comparable with AlphaFold2 based on CASP14 test data. Importantly, RoseTTAFold can generate samples within 10 min, which is around 100 times faster than AlphaFold2. RoseTTAFold employs a three-track neural network that simultaneously learns from 1D sequence-level, 2D distance map-level, and 3D backbone coordinate-level information with attention mechanisms integrated throughout. RoseTTAFold exhibits robust performance in predicting protein complexes, whereas AlphaFold2 excels primarily for single protein structure prediction.

Building on these techniques, ESMFold [105] is a language model that predicts protein structure by leveraging ESM-2 representations. The output embeddings from ESM-2 are passed to a series of self-attending “folding blocks.” A structure module, featuring an SE(3) transformer architecture, generates the final structure predictions. Since ESMFold uses ESM-2 representations instead of MSA representations, this model offers faster processing with competitive scores on CAMEO and CASP14. EigenFold [104] is another model that applies diffusion models to generate protein structures. The model represents the protein as a system of harmonic oscillators. Then, the structure can be projected onto the eigenmodes of that system during the forward process. In the reverse process, the rough global structure is sampled before refining local details. As a score-based model, EigenFold is not as computationally intensive but still underperforms in accuracy and range compared with other models. Protein structure prediction models described before are summarized in Table 5.

Table 5.

An overview of relevant protein structure prediction models; all metrics are self-reported unless otherwise noted; scores are provided as mean performance on single-structure tasks; lDDT represents the lDDT metric computed specifically using coordinates for the alpha carbon of each residue; [**] denotes the current SOTA

Model Type of model Dataset CAMEO CASP14
RMSD (Å, Inline graphic) TMScore (Inline graphic) GDT-TS (Inline graphic) lDDT (Inline graphic) TMScore (Inline graphic)
AlphaFold2** [103] Transformer CASP14 3.30 [104] 0.87 [104] 0.86 [104] 0.90 [104] 0.38 [105]
RoseTTAFold [106] Transformer CAMEO, CASP14 5.72 [104] 0.77 [104] 0.71 [104] 0.79 [104] 0.37 [105]
ESMFold [105] Transformer CAMEO, CASP14 3.99 [104] 0.85 [104] 0.83 [104] 0.87 [104] 0.68
EigenFold [104] Diffusion CAMEO 7.37 0.75 0.71 0.78

Antibody structure prediction 

A class of models has also been developed specifically catering to antibody structure prediction. As discussed previously, MSA alignment cannot be used for antibody input, rendering general models like AlphaFold highly inefficient and slow in the context of antibody prediction. IgFold [108] uses sequence embeddings from AntiBERTy [109] and IPA to predict antibody structures, achieving SOTA generation speed with comparable accuracy with other models in the field. tFold-Ab [110] performs comparably with IgFold, generating full-atom structures more efficiently by reducing reliance on external tools like Rosetta energy functions. For a more in-depth analysis of the datasets, task definition, metrics, and performance, refer to the appendix on page 27.

Sequence generation

Overview 

Sequence generation, also known as inverse folding or fixed-backbone design, entails the inverse task of structure prediction. Generating amino acid sequences that can fold into target structures is crucial for designing proteins with desired structural and functional properties. As with molecules and protein structures, the space of valid sequences is vast; this figure has been estimated to lie between Inline graphic and Inline graphic [111]. In addition, the process of protein folding is naturally complex and difficult to predict.

To address these challenges, a variety of deep learning methods have been applied to represent the distribution of sequences with respect to structural information. While we briefly discuss some preliminary methods that are structure-agnostic, we place the highest focus on models that target specific protein structures.

Task 

Given a fixed protein backbone structure, generate a corresponding amino acid sequence that will fold into the given structure.

Datasets 

Models in this field utilize the following datasets. Models primarily use CATH for training, with some using various augmentation methods utilizing UniRef and UniParc. CATH and TS500 are most frequently used for evaluation. To produce a standardized benchmark, Yu et al. [112] created a set of 14 known de novo protein structures that do not exist in CATH to avoid data contamination.

  • PDB [84]—Protein Data Bank, a comprehensive protein structure dataset (see page 9)

  • UniRef [113]—A clustered version of the Unified Protein KnowledgeBase (UniProtKB), part of the central resource UniProt, which is a curated and labeled set of protein sequences and their functions

  • UniParc [113]—A larger dataset of protein sequences, part of the central resource Uniprot, which includes UniProtKB and adds proteins from a variety of other sources

  • CATH [114]—A classification of protein domains (subsequences that can fold independently) into a hierarchical scheme with four levels: class (C), architecture (A), topology (T), and homologous superfamily (H). Using their classification, the authors also provide a diverse set of proteins that have minimal overlap and sequence similarity.

  • TS500 [115]—A subset of 500 proteins from PDB [116] filtered by sequence identity using the PISCES network. Li et al. also created a smaller subset, TS50, with additional filters to control for sequence length and fraction of surface residue.

Metrics 

While many models perform their testing, we use results from the independent benchmark created by Yu et al. for fair comparison. We list the evaluated metrics below. Note that while Yu et al. did not evaluate perplexity, we include it here due to its frequent use in sequence design method evaluations.

  • AARAmino Acid Recovery, the proportion of matching amino acids between the generated and native sequences

  • Diversity—The average difference between pairs of generated sequences, measured using Clustalw2 [117]

  • RMSDRoot-Mean-Square Deviation, a structural comparison between two structures (see page 9). In the context of sequence generation, the proposed sequences are folded into structures before comparison with native backbone structures.

  • Nonpolar Loss—A metric measuring the rationality of polar amino acid types within the folded structure, where higher presence of nonpolar amino acids on the surface results in higher loss

  • PPL— Perplexity, an exponentiation of cross-entropy loss, representing the inverse likelihood of a native sequence appearing in the predicted sequence distribution. For a series of Inline graphic amino acids Inline graphic, we can express perplexity as
    graphic file with name DmEquation13.gif

    Perplexity is calculated individually for each protein in the test set and averaged to produce a final PPL value.

Models 

A preliminary class of models generates novel protein sequences without considering a fixed backbone target, aiming to capture the unconditional distribution of amino acid sequence space. ProteinVAE [118] utilizes ProtBERT [119] to reduce raw input sequences into latent representations, employing an encoder–decoder framework with position-wise multi-head self-attention to capture long-range dependencies in these sequences. ProT-VAE [120] uses a different pre-trained language model ProtT5NV (https://www.nvidia.com/en-us/gpu-cloud/bionemo/). It includes an inner family-specific encoder–decoder layer to learn parameters relevant to specific protein families. Conversely, ProteinGAN [121] uses a GAN architecture to generate protein sequences. The model’s efficacy is exemplified through the example of malate dehydrogenase, demonstrating its potential to generate fully functional enzyme proteins. While these approaches demonstrate relative success in generating valid and diverse sets of protein sequences, models that operate entirely in sequence space cannot consider crucial structural information. This limitation restricts their ability to capture the full range of constraints and dependencies between amino acid residues.

The primary class of models in this field receives fixed backbone targets as input, generating corresponding amino acid sequences. ProteinSolver [122] draws connections between generating backbone structures and solving Sudoku puzzles, arguing that both are forms of constraint satisfaction problems where positional information imposes constraints on the labels that can be assigned to each “node.” After finding a GNN architecture that can effectively solve Sudoku puzzles, Strokach et al. [122] apply a similar architecture to the task of protein sequence design. In this design, node attributes encode amino acid types and edge attributes encode relative distances between pairs of residues. PiFold [123] extends this approach by introducing more comprehensive feature representations, including explicit distance, angle, and direction information in its node and edge features. Anand et al. [124] design a 3D CNN that directly learns conditional distributions for each residue given previous amino acid types, relative distances for heavy atoms, and torsional angles for side chains. Using these learned distributions, the 3D CNN autoregressively generates potential sequences. ABACUS-R [125] incorporates a pre-trained transformer to infer a residue’s amino acid types from nearby residues. To generate valid sequences, the model iteratively updates subsets of residues based on their environments, gradually constructing self-consistent sequences. ProRefiner [126] improves upon this design by introducing entropy scores for each prediction. While ABACUS-R uses every residue in the neighborhood for refinement, ProRefiner masks out high-entropy (low-confidence) predictions. By filtering out low-quality predictions, ProRefiner mitigates error accumulation from incorrect predictions.

To better model input protein structures, GPD [130] uses the Graphormer [131] architecture, which is a modified transformer for graph-structured data. GPD also uses Gaussian noise and random masks to improve diversity and recovery. GVP-GNN [127] uses a simple yet novel geometric representation for all nodes in the system. Rather than individually encoding vector features (like relative node orientations) for each node, these features are directly represented as geometric vectors that transform accordingly alongside global transformations. This approach defines a global geometric orientation rather than independent features of each node. ESM-IF1 [128] extends upon the representations in GVP-GNN by attaching a generic transformer block and training on an expanded dataset. To generate additional training examples, ESM-IF1 uses MSA Transformer [132], a representation learning model, to rank the sequences in the UniRef50 dataset by predicted LDDT scores. The top 12 million of these sequences are assigned corresponding structures through predictions made by AlphaFold2, producing a collection of sequence-structure pairs much larger than that of experimentally determined pairs. While previous methods require a fixed decoding order, ProteinMPNN [129] implements an order-agnostic autoregressive approach, which allows for a flexible choice of decoding order based on each specific task.

We report the benchmark results measured by Yu et al. in Table 6. ProteinMPNN generates the most accurate sequences, leading all methods in sequence recovery, RMSD, and nonpolar loss. GPD remains the most time-efficient method, generating sequences around three times faster than ProteinMPNN. Performance on diversity varies, but this can often be artificially controlled by adjusting a noise hyperparameter during testing to increase variation. Note that ProRefiner is not listed; ProRefiner primarily acts as an add-on module for existing methods and reports 2–7 percentage points of improvement on AAR when used to refine sequences for GVP-GNN, ESM-IF1, and ProteinMPNN.

Table 6.

An overview of relevant sequence design methods; Div. and Non. refer to diversity and nonpolar loss, respectively; results are reported by Yu et al. [112]; for GVP-GNN, we report self-evaluated AAR on a CATH test split; [**] denotes the current SOTA

Model Type of model Dataset AAR (%, Inline graphic) Div. (Inline graphic) RMSD (Å, Inline graphic) Non. (Inline graphic) Time (s, Inline graphic)
ProteinSolver [122] GNN UniParc 24.6 0.186 5.354 1.389 180
3D CNN [124] CNN CATH 44.5 0.272 1.62 1.027 536544
ABACUS-R [125] Transformer CATH 45.7 0.124 1.482 0.968 233280
PiFold [123] GNN CATH, TS50/TS500 42.8 0.141 1.592 1.464 221
GVP-GNN [127] GNN CATH, TS50 44.9* [127]
ESM-IF1** [128] Transformer CATH, UniRef+ 47.7 0.184 1.265 1.201 1980
ProteinMPNN** [129] MPNN CATH 48.7 0.168 1.019 1.061 112
GPD [130] Transformer CATH 46.2 0.219 1.758 1.333 35

In general, sequence generation remains a challenging field, as current SOTA models only recover fewer than half of the target amino acid residues.

Backbone design

Overview 

Like molecule generation, generating novel proteins from scratch can directly expand the library of available proteins capable of performing highly complex and versatile functions. While other areas such as structure prediction and sequence generation contribute to the overall drug design process, backbone design lies at the core of de novo design, where new protein structures can be created entirely from scratch.

As seen in molecule design, protein design contains a similar distinction between structure and sequence. Some models generate 1D amino acid sequences, while others directly generate 3D structures, with some co-designing both 1D sequences and 3D structures.

Datasets 

Models in this field utilize the following datasets:

  • PDB [84]—Protein Data Bank, a comprehensive protein structure dataset (see page 9)

  • AlphaFoldDB [133]—AlphaFold Database, an expanded protein structural dataset created by using AlphaFold2 to predict structures for corresponding sequences in the UniRef dataset

  • SCOP [134]—Structural Classification of Proteins, a classification of proteins by homology and structural similarity. SCOP has been updated several times to include additional categorizations and features, with many recent models using the extended SCOPe [135] database.

  • CATH [114]—A classification of protein domains into a hierarchical scheme with four levels, also used in sequence generation (see page 10)

Tasks 

The backbone design task involves designing a protein backbone structure either from scratch or based on existing context. This involves generating coordinates for the backbone atoms for each amino acid (nitrogen, alpha-carbon, carbonyl, and oxygen atom). External tools like Rosetta (https://rosettacommons.org/software.) can be used for side-chain packing, generating the remaining atoms.

  • Context-Free Generation—Given no input, the goal is to generate a diverse set of protein structures. This task is evaluated using the self-consistency TM (scTM) score.

  • Context-Given Generation—This is an inpainting task for proteins. Given a motif (a set of existing amino acid residues for a native protein), the goal is to accurately fill in the missing residues according to the native protein, which is evaluated using a variety of similarity metrics like AAR, PPL, and RMSD.

Metrics 

Generated backbones should be highly designable. High designability is generally determined by the ability of a structure to be generated by a corresponding amino acid sequence. While lab testing is optimal, this is not always feasible, so the folding process is simulated by other generative models. Thus, Trippe et al. [136] proposed the scTM approach, which includes the following steps: a proposed structure is fed into a sequence-prediction model (typically ProteinMPNN [129]) to produce a corresponding amino acid sequence, and then fed back into a structure prediction model (typically AlphaFold2 [103]) to produce a sample structure. The TM-score (see page 9) between the generated structure and this sample structure is calculated.

  • scTM [136]—Self-consistency TM-score, an approach proposed by Trippe et al. to simulate the folding process, described in detail above. Scores of scTM Inline graphic are typically considered designable, so the percentage of generated structures with scTM Inline graphic is often reported.

  • scRMSDSelf-consistency RMSD, identical to the scTM but uses RMSD instead of the TM-score for evaluation. A score of scRMSD Inline graphic is typically used as a cutoff.

  • AAR— Amino acid recovery, a comparison between the ground truth and the generated amino acid sequences. The AAR is also measured in antibody representation learning (see page 26).

  • RMSDRoot-mean-square deviation, measures distances between the ground truth and the generated residue coordinates. The RMSD is also used in protein structure prediction (see page 9).

Models 

ProtDiff [136] represents each residue with 3D Cartesian coordinates and uses a particle filtering diffusion approach. However, 3D Cartesian point clouds do not mirror the folding process to create protein structures—FoldingDiff [138] instead uses an angular representation for the protein structure, which more closely mirrors the rotational energy-optimizing protein folding process. FoldingDiff treats the protein backbone structure as a sequence of six angles representing orientations of consecutive residues. It denoises from a random, unfolded state to a folded structure using a DDPM and a BERT architecture. LatentDiff [137] initially uses an equivariant protein autoencoder with GNNs to embed proteins into a latent space. Subsequently, it uses an equivariant diffusion model to learn the latent distribution. This process is analogous to GeoLDM [70] for molecule design. Notably, LatentDiff’s sampling on its latent space is ten times faster than sampling on raw protein space.

The above models have shown relatively high performance in generating shorter proteins (up to 128 residues in length) but struggle with larger and more complex proteins [140]. To address longer protein structures, other methods use frame-based construction methods. This representation was initially demonstrated in the architecture of AlphaFold2 [103] in structure prediction, known as IPA. In this paradigm, each residue is represented by orientation-preserving rigid body transformations (reference frames), which can be consistently defined regardless of global orientation. This allows for a more generalized representation than a series of 3D point clouds. Genie [140] performs discrete-time diffusion using a cloud of frames determined by a translation and rotation element to generate backbone structures. During each diffusion step, Genie computes the Frenet–Senet frames and uses paired residue representations and IPA for noise prediction. FrameDiff [139] also parameterizes the backbone structures based on the frame manifold, using a score-based generative model. This approach establishes a diffusion process on Inline graphic, the manifold of frames, invariant to translations and rotations. Then, the neural network predicts the denoised frame and torsion angle using IPA and a transformer model. Finally, RFDiffusion [141] combines the powerful structure prediction methods from RoseTTAFold with diffusion models to generate protein structures. RFDiffusion fine-tunes the RoseTTAFold weights and inputs a masked input sequence and random noise coordinates to iteratively generate the backbone structure. RFDiffusion also “self-conditions” on the predicted final structure, leading to improved performance. RFDiffusion is a large pre-trained model with significantly more parameters than the other two frame-based models, enabling it to outperform the other frame-based models. GPDL [145] utilizes a similar technique to RFDiffusion, using ESMFold instead of RoseTTAFold as its base structure prediction model. Additionally, it incorporates the ESM2 language model to extract important evolutionary information from input sequences to ESMFold. Due to ESMFold’s superior efficiency in structure prediction, GPDL generates backbone structures 10–20 times faster than RFDiffusion.

Another class of models aims to co-design both protein sequence and structure simultaneously. GeoPro [142] uses an EGNN to encode and predict 3D protein structures, designing a separate decoder to decode protein sequences from these learned representations. Protpardelle [144] creates a “superposition” over the possible sidechain states and collapses them during each iterative update step in the reverse diffusion process. The backbone is updated in each iterative step, while the sidechains are chosen probabilistically by another network to update. ProtSeed [143] uses a trigonometry-aware encoder that computes constraints and interactions from the context features and uses an equivariant decoder to translate proteins into their desired state, updating the sequence and structure in a one-shot manner. Anand et al. [124] use IPA as mentioned above, performing diffusion in frame space to efficiently generate protein sequences and structures.

For a comparison of performance between various models, see Table 7. For an overview of developments described in this section, see Fig. 4. Note that as seen in molecule generation, we observe a progression from 1D-based (amino acid) models to 3D structure-based models to 1D/3D co-design; in addition, the field of protein design faces analogous questions of complexity scaling and latent space regularization.

Table 7.

An overview of relevant backbone design methods; “AFDB” refers to AlphaFoldDB; “Design” refers to designability, defined by Lin et al. [140], where any proteins with scRMSD Inline graphic and pLDDT Inline graphic (a local distance metric used in AlphaFold2 [103]) are considered designable; all metrics are evaluated with the PDB dataset, while [*] denotes results tested only on Inline graphic-lactamase metalloproteins extracted from PDB; [**] denotes the current SOTA

Model Type of model Dataset Context-Free Context-Given
scTM (%, Inline graphic) Design. (%, Inline graphic) PPL (Inline graphic) AAR (%, Inline graphic) RMSD (Å, Inline graphic)
LatentDiff [137] EGNN, Diffusion PDB, AFDB 31.6
FoldingDiff [138] Diffusion CATH 14.2 [137]
FrameDiff [139] Diffusion PDB 84 48.3 [140]
Genie [140] Diffusion SCOPe, AFDB 81.5 79.0
RFDiffusion** [141] Diffusion PDB 95.1 [140]
ProtDiff [136] EGNN, Diffusion PDB 11.8 [137] 12.47* [142] 8.01* [142]
GeoPro [142] EGNN PDB 43.41* 2.98*
ProtSeed [143] MLP CATH 5.6 43.8
Protpardelle [144] Diffusion CATH 85
Figure 4.

Figure 4

An overview of the progress in protein generation over time; shortcomings of previous models are shown in the corresponding pink boxes, with subsequent models solving these shortcomings through novel design choices; for consistency, only methods that generate proteins from scratch (without fixed backbone or sequence input) are depicted [118, 120, 121, 127, 136–143].

Antibody CDR-H3 Generation 

As mentioned previously, antibody generation primarily focuses on the generation of a particular region known as the CDR-H3 region. Similar to protein generation, models in CDR-H3 generation have transitioned from sequence-based methods like the LSTM used by Akbar et al. [146] to sequence-structure co-design pioneered by RefineGNN [147] through iterative refinement. Notably, some models extend beyond the CDR-H3 generation task, aiming to tackle multiple parts of the antibody pipeline at once. dyMEAN [148] is an end-to-end method incorporating structure prediction, docking, and CDR-H3 generation into a singular model. For a more in-depth analysis of datasets, task definition, metrics, and performance, refer to the appendix on page 28.

Peptide design

Overview 

While we have discussed the monumental and robust models developed for protein generation, it is necessary to have models tailored for peptide-specific needs due to the inherently intricate and context-dependent nature of peptide structure, as well as the highly diverse array of downstream applications [149]. This section briefly explores four different applications for AI in peptide generation, focusing on four SOTA models: MMCD (Multi-Modal Contrastive Diffusion Model), PepGB (Peptide–protein interaction via Graph neural network for Binary prediction), PepHarmony, and AdaNovo.

Peptide generation 

In peptide generation, like protein backbone design, models aim to generate novel peptides from scratch. MMCD [150] is a diffusion-based model for therapeutic peptide generation that co-designs peptide sequences and structures (backbone coordinates). It employs a transformer encoder for sequences and an EGNN for structures, along with contrastive learning strategies to align sequence and structure embeddings and differentiate therapeutic and non-therapeutic peptide embeddings. MMCD outperforms baselines in both sequence and structure generation tasks, as demonstrated by testing on datasets of antimicrobial peptides and anticancer peptides.

Peptide–protein interaction 

For peptide–protein interactions, models aim to predict the physical binding site for a proposed peptide–protein pair. PepGB [151] is a GNN-based model for facilitating peptide drug discovery. It predicts peptide–protein interactions and leverages graph attention neural networks to learn interactions between peptides and proteins. PepGB was trained on a binary interaction benchmark dataset of protein–peptide and protein–protein interactions. A mutation dataset of peptide analogs targeting MDM2 was used for validating PepGB, and a large-scale peptide sequence dataset from UniProt was used for pre-training. PepGB consistently outperforms baselines in predicting peptide-protein interactions for novel peptides and proteins, with an increase of at least 9%, 9%, and 27% in AUC-precision scores and 19%, 6%, and 4% in AUC-recall scores under novel protein, peptide, and pair settings, respectively.

Peptide representation learning 

As with protein representation learning, models in peptide representation learning aim to convert raw peptide sequences into latent representations that capture valuable information. PepHarmony [152] is a multi-view contrastive learning model that integrates both sequence and structural information for enhanced peptide representation learning. It employs a sequence encoder (ESM) and a structure encoder (GearNet), which are trained together using contrastive or generative learning. PepHarmony utilizes data from conventional datasets like AlphaFoldDB and PDB while also employing a cell-penetrating peptide dataset (compiled from a variety of existing datasets), a solubility dataset (PROSOS-II [153]), and an affinity dataset (DrugBank [154]). Zhang et al. report that PepHarmony demonstrates superior performance in downstream tasks such as cell-penetrating peptide prediction, peptide solubility prediction, peptide–protein affinity prediction, and self-contact map prediction. When compared with general protein representation learning methods like ESM2 and GearNet, PepHarmony outperforms baseline and fine-tuned versions of these models in most evaluation metrics, including accuracy, F1 score, area under the receiver operating characteristic curve, and correlation coefficients.

Peptide sequencing 

Mass spectrometry has played a crucial role in analyzing protein compositions from physical samples, but various forms of noise have posed challenges in extracting information from these reports. In peptide sequencing, models aim to address this challenge by predicting amino acid sequences given mass spectra data. AdaNovo [155] is a SOTA model for de novo peptide sequencing, composed of a mass spectrum encoder (MS Encoder) and two peptide decoders inspired by the transformer architecture. AdaNovo significantly improves upon previous models like DeepNovo [156], PointNovo [157], and Casanovo [158] in terms of peptide-level and amino acid-level precision across various species. For example, in a human dataset, AdaNovo achieves a peptide-level precision of 0.373 and an amino acid-level precision of 0.618, outperforming DeepNovo (0.293 and 0.610), PointNovo (0.351 and 0.606), and Casanovo (0.343 and 0.585). AdaNovo’s success is attributed to its innovative use of conditional mutual information and adaptive training strategies, which enhance its ability to identify post-translational modifications and handle noisy data typically associated with mass spectrometry.

Current trends

The drug design process, marked by a history of high complexity and cost, is poised for a transformative shift fueled by generative AI. AI-based methods are driving faster development and reducing costs, resulting in more effective and accessible pharmaceuticals for the public. Within the realm of generative AI, a notable shift has occurred; the emergence of GNNs and graph-based methods has fueled the transition from sequence-based approaches to structure-based approaches, ultimately leading to the integration of both sequence and structure in generation tasks.

Within the field of molecular generation, we are witnessing the recent dominance of graph-based diffusion models. These models take advantage of E(3) equivariance to achieve SOTA performance, with leaders like GeoLDM and MiDi excelling in target-agnostic design, and TargetDiff, Pocket2Mol, and DiffSBDD excelling in target-aware design. Finally, Torsional Diffusion outperforms all counterparts in molecular conformation generation. Additionally, we observe a shift from sequence to structural approaches in target-aware molecule design, where SBDD approaches demonstrate clear advantages over LBDD approaches, which operate with amino acid sequences.

Within protein generation, a shift from sequence to structure is evident, as exemplified by the emergence of structure-based representation learning models like GearNET. These models leverage established sequence-based representation models such as ESM-1B and UniRep, recognizing the importance of 3D structural information in the protein generation process. AlphaFold2 remains a clear SOTA model for structure prediction. Similar to molecule generation, a wave of diffusion models are now tackling the protein scaffolding task, with RFDiffusion emerging as the top-performing model.

Challenges and future directions

While the prospects for generative AI in drug design are promising, several issues must be addressed before we can embrace ML-driven de novo drug design. The main areas for improvement include increasing performance in existing tasks, defining more applicable tasks within molecule, protein, and antibody generation, and exploring entirely new areas of research.

Current generative models still struggle with a variety of tasks and benchmarks. Within molecule generation, we face the following challenges:

  • Complexity—Models generate high frequencies of valid and stable molecules when trained on the simple QM9 dataset but struggle when trained on the more complex GEOM-Drugs dataset.

  • Applicability—More applicable tasks like protein-molecule binding are especially challenging, and current models still struggle with generating molecules with high binding affinity for targets.

  • Explainability—All methods discussed are fairly black-box and abstract; existing models do not reveal aspects like “important” atoms or structures, and explainable AI in molecule generation is undeveloped as a whole.

Within protein generation, we encounter the following challenges:

  • Benchmarking—While most models in molecule generation use standardized benchmarking procedures, generative tasks in protein design lack a standard evaluative procedure, with variance between each model’s metrics and testing conditions, making it hard to objectively evaluate the quality of designed proteins.

  • Performance—As the tasks in protein generation are generally more complex than those in molecule generation, SOTA models still struggle in several key areas like fold classification, gene ontology, and antibody CDR H3 generation, leaving room for future improvement.

While our paper focuses on generative models and applications, it is important to note that many current tasks are evaluated with predictive models, such as the affinity optimization task in antibody generation or the conditional generation task for molecules. In these cases, classifier networks are used to predict binding affinity or molecular properties, and improvements to these classification methods would naturally lead to more precise alignment with real-world biological applications.

Conclusion

The survey has provided an overview of the current landscape of generative AI in de novo drug design, focusing on molecule and protein generation. It has discussed important advancements in these fields, detailing the key datasets, model architectures, and evaluation metrics used. The paper also highlights key challenges and future directions, including improvements to benchmarking methods, improving explainability, and further alignment with real-world tasks to increase applicability. Overall, generative AI has shown great promise in the field of drug design, and continued research in this field can lead to exciting advancements in the future.

Key Points

  • Our survey examines the advancements and applications of Generative AI within de novo drug design, particularly focusing on the generation of novel small molecules and proteins.

  • We explore the intricacies of generating biologically plausible and pharmaceutically potential compounds from scratch, providing a comprehensive yet approachable digest of formal task definitions, datasets, benchmarks, and model types in each field.

  • The work captures the progression of AI model architectures in drug design, highlighting the emergence of EGNNs and diffusion models as key drivers in recent work.

  • We highlight remaining challenges in applicability, performance, and scalability, delineating future research trajectories.

  • Through our organized repository, we aim to facilitate further collaboration in the rapidly evolving intersection of computational biology and AI.

Supplementary Material

Survey__bib_submission___appendix_(1)_bbae338

Contributor Information

Xiangru Tang, Department of Computer Science, Yale University, New Haven, CT 06520, United States.

Howard Dai, Department of Computer Science, Yale University, New Haven, CT 06520, United States.

Elizabeth Knight, School of Medicine, Yale University, New Haven, CT 06520, United States.

Fang Wu, Computer Science Department, Stanford University, CA 94305, United States.

Yunyang Li, Department of Computer Science, Yale University, New Haven, CT 06520, United States.

Tianxiao Li, Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT 06520, United States.

Mark Gerstein, Department of Computer Science, Yale University, New Haven, CT 06520, United States; Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT 06520, United States; Department of Statistics & Data Science, Yale University, New Haven, CT 06520, United States; Department of Biomedical Informatics & Data Science, Yale University, New Haven, CT 06520, United States; Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, United States.

Funding

Xiangru Tang and Mark Gerstein are supported by Schmidt Futures.

References

  • 1. Drews J. Drug discovery: a historical perspective. Science 2000;287:1960–4. 10.1126/science.287.5460.1960. [DOI] [PubMed] [Google Scholar]
  • 2. Mandal S, Moudgil M’n, Mandal SK. Rational drug design. Eur J Pharmacol 2009;625:90–100. 10.1016/j.ejphar.2009.06.065. [DOI] [PubMed] [Google Scholar]
  • 3. Colwell LJ. Statistical and machine learning approaches to predicting protein–ligand interactions. Curr Opin Struct Biol 2018;49:123–8. 10.1016/j.sbi.2018.01.006. [DOI] [PubMed] [Google Scholar]
  • 4. Horvath C. Comparison of preclinical development programs for small molecules (drugs/pharmaceuticals) and large molecules (biologics/biopharmaceuticals): studies, timing, materials, and costs. Pharmaceutical Sciences Encyclopedia: Drug Discovery, Development, and Manufacturing 2010:1–35. 10.1002/9780470571224.pse166. [DOI] [Google Scholar]
  • 5. Sliwoski G, Kothiwale S, Meiler J. et al.. Computational methods in drug discovery. Pharmacol Rev 2014;66:334–95. 10.1124/pr.112.007336. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Petra Schneider W, Walters P, Plowright AT. et al.. Rethinking drug design in the artificial intelligence era. Nat Rev Drug Discov 2020;19:353–64. 10.1038/s41573-019-0050-3. [DOI] [PubMed] [Google Scholar]
  • 7. Jing Y, Bian Y, Ziheng H. et al.. Deep learning for drug design: an artificial intelligence paradigm for drug discovery in the big data era. AAPS J 2018;20:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Polishchuk P. Interpretation of quantitative structure–activity relationship models: past, present, and future. J Chem Inf Model 2017;57:2618–39. 10.1021/acs.jcim.7b00274. [DOI] [PubMed] [Google Scholar]
  • 9. Isarankura-Na-Ayudhya C, Naenna T, Nantasenamat C. et al.. A practical overview of quantitative structure-activity relationship. EXCLI 2009;8:74–88. [Google Scholar]
  • 10. Li Z, Wang S, Chin WS. et al.. High-throughput screening of bimetallic catalysts enabled by machine learning. J Mater Chem A 2017;5:24131–8. 10.1039/C7TA01812F. [DOI] [Google Scholar]
  • 11. Li H, Sze K-H, Gang L. et al. Machine-learning scoring functions for structure-based virtual screening. Wiley interdisciplinary reviews: computational molecular. Science 2021;11:e1478. 10.1002/wcms.1478. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Yang KK, Zachary W, Arnold FH. Machine-learning-guided directed evolution for protein engineering. Nat Methods 2019;16:687–94. 10.1038/s41592-019-0496-6. [DOI] [PubMed] [Google Scholar]
  • 13. Wu Z, Kan SBJ, Lewis RD. et al.. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc Natl Acad Sci 2019;116:8852–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Hartenfeller M, Schneider G. De novo drug design. Chemoinformatics and computational chemical biology 2011:299–323. 10.1007/978-1-60761-839-3_12. [DOI] [PubMed] [Google Scholar]
  • 15. Mouchlis VD, Afantitis A, Serra A. et al.. Advances in de novo drug design: from conventional to machine learning methods. Int J Mol Sci 2021;22:1676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Lima AN, Philot EA, Trossini GHG. et al.. Use of machine learning approaches for novel drug discovery. Expert Opin Drug Discovery 2016;11:225–39. 10.1517/17460441.2016.1146250. [DOI] [PubMed] [Google Scholar]
  • 17. Wang M, Wang Z, Sun H. et al.. Deep learning approaches for de novo drug design: an overview. Curr Opin Struct Biol 2022;72:135–44. 10.1016/j.sbi.2021.10.001. [DOI] [PubMed] [Google Scholar]
  • 18. Kutchukian PS, Shakhnovich EI. De novo design: balancing novelty and confined chemical space. Expert Opin Drug Discovery 2010;5:789–812. 10.1517/17460441.2010.497534. [DOI] [PubMed] [Google Scholar]
  • 19. Liu X, IJzerman AP, van GJP. Computational approaches for de novo drug design: past, present, and future. Artificial neural networks 2020:139–65. 10.1007/978-1-0716-0826-5_6. [DOI] [PubMed] [Google Scholar]
  • 20. DiMasi JA, Grabowski HG, Hansen RW. The cost of drug development. N Engl J Med 2015;372:1972–2. 10.1056/NEJMc1504317. [DOI] [PubMed] [Google Scholar]
  • 21. Lippow SM, Tidor B. Progress in computational protein design. Curr Opin Biotechnol 2007;18:305–11. 10.1016/j.copbio.2007.04.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Zhou W, Wang Y, Aiping L. et al.. Systems pharmacology in small molecular drug discovery. Int J Mol Sci 2016;17:246. 10.3390/ijms17020246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Richard Bickerton G, Paolini GV, Besnard J. et al.. Quantifying the chemical beauty of drugs. Nat Chem 2012;4:90–8. 10.1038/nchem.1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Ursu O, Rayan A, Goldblum A. et al.. Understanding drug-likeness. Wiley interdisciplinary reviews: computational molecular Science 2011;1:760–81. 10.1002/wcms.52. [DOI] [Google Scholar]
  • 25. Polishchuk PG, Madzhidov TI, Varnek A. Estimation of the size of drug-like chemical space based on gdb-17 data. J Comput Aided Mol Des 2013;27:675–9. 10.1007/s10822-013-9672-4. [DOI] [PubMed] [Google Scholar]
  • 26. DiMasi JA, Grabowski HG, Hansen RW. Innovation in the pharmaceutical industry: new estimates of r&d costs. J Health Econ 2016;47:20–33. 10.1016/j.jhealeco.2016.01.012. [DOI] [PubMed] [Google Scholar]
  • 27. Jayatunga MKP, Xie W, Ruder L. et al.. Ai in small-molecule drug discovery: a coming wave. Nat Rev Drug Discov 2022;21:175–6. 10.1038/d41573-022-00025-1. [DOI] [PubMed] [Google Scholar]
  • 28. Ding W, Nakai K, Gong H. Protein design via deep learning. Brief Bioinform 2022;23. 10.1093/bib/bbac102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Gao W, Mahajan SP, Sulam J. et al.. Deep learning in protein structural modeling and design. Patterns 2020;1:100142. 10.1016/j.patter.2020.100142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Huang P-S, Boyken SE, Baker D. The coming of age of de novo protein design. Nature 2016;537:320–7. 10.1038/nature19946. [DOI] [PubMed] [Google Scholar]
  • 31. Zhang N, Bi Z, Liang X. et al.. Ontoprotein: protein pretraining with gene ontology embedding arXiv preprint arXiv:2201.11147. 2022.
  • 32. Zhou H-Y, Yunxiang F, Zhang Z. et al.. Protein representation learning via knowledge enhanced primary structure modeling bioRxiv. 2023:2023–01.
  • 33. Ma C, Zhao H, Lin Z. et al.. Retrieved sequence augmentation for protein representation learning bioRxiv. 2023:2023–02.
  • 34. Romero PA, Arnold FH. Exploring protein fitness landscapes by directed evolution. Nat Rev Mol Cell Biol 2009;10:866–76. 10.1038/nrm2805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Dahiyat BI, Mayo SL. De novo protein design: fully automated sequence selection. Science 1997;278:82–7. 10.1126/science.278.5335.82. [DOI] [PubMed] [Google Scholar]
  • 36. Zhang Z, Yan J, Liu Q. et al.. A systematic survey in geometric deep learning for structure-based drug design arXiv preprint arXiv:2306.11768. 2023.
  • 37. Thomas M, Bender A, Graaf. Integrating structure-based approaches in generative molecular design. Curr Opin Struct Biol 2023;79:102559. 10.1016/j.sbi.2023.102559. [DOI] [PubMed] [Google Scholar]
  • 38. Akbar R, Bashour H, Rawat P. et al.. Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies. MAbs 2022;14:2008790.Taylor & Francis. 10.1080/19420862.2021.2008790. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Hummer AM, Abanades B, Deane CM. Advances in computational structure-based antibody design. Curr Opin Struct Biol 2022;74:102379. 10.1016/j.sbi.2022.102379. [DOI] [PubMed] [Google Scholar]
  • 40. Chungyoun M, Gray JJ. Ai models for protein design are driving antibody engineering. Current opinion Biomed Eng 2023;28:100473. 10.1016/j.cobme.2023.100473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Kim J, McFee M, Fang Q. et al.. Computational and artificial intelligence-based methods for antibody development. Trends Pharmacol Sci 2023;44:175–89. 10.1016/j.tips.2022.12.005. [DOI] [PubMed] [Google Scholar]
  • 42. Zhang M, Qamar M, Kang T. et al.. A survey on graph diffusion models: generative ai in science for molecule, protein and material arXiv preprint arXiv:2304.01565. 2023.
  • 43. Guo Z, Liu J, Wang Y. et al.. Diffusion models in bioinformatics: a new wave of deep learning revolution in action arXiv preprint arXiv:2302.10907. 2023.
  • 44. Goodfellow I, Pouget-Abadie J, Mirza M. et al.. Generative adversarial nets. Advances in neural information processing systems 2014;27. [Google Scholar]
  • 45. Kingma DP, Welling M. Auto-encoding variational bayes arXiv preprint arXiv:1312.6114. 2013.
  • 46. Rezende D, Mohamed S. Variational inference with normalizing flows. In International conference on machine learning. PMLR, 2015, 1530–8. [Google Scholar]
  • 47. Yang L, Zhang Z, Song Y. et al.. Diffusion models: a comprehensive survey of methods and applications arXiv preprint arXiv:2209.00796. 2022.
  • 48. Van Erven, Harremos P. Rényi divergence and kullback-leibler divergence. IEEE Trans Inf Theory 2014;60:3797–820. 10.1109/TIT.2014.2320500. [DOI] [Google Scholar]
  • 49. Austin J, Johnson DD, Ho J. et al.. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems 2021;34:17981–93. [Google Scholar]
  • 50. Vaswani A, Shazeer N, Parmar N. et al.. Attention is all you need. Advances in neural information processing systems 2017;30. [Google Scholar]
  • 51. Popescu M-C, Balas VE, Perescu-Popescu L. et al.. Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems 2009;8:579–88. [Google Scholar]
  • 52. LeCun Y, Chopra S, Hadsell R. et al.. A tutorial on energy-based learning. Predicting structured data 2006;1. 10.7551/mitpress/7443.003.0014. [DOI] [Google Scholar]
  • 53. Ngiam J, Chen Z, Koh PW. et al.. Learning deep energy models. In Proceedings of the 28th international conference on machine learning (ICML-11), 2011, 1105–12.
  • 54. Sch”utt K, Kindermans P-J, Felix HES. et al.. Schnet: a continuous-filter convolutional neural network for modeling quantum interactions. Advances in neural information processing systems 2017;30. [Google Scholar]
  • 55. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9:1735–80. 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
  • 56. Devlin J, Chang M-W, Lee K. et al.. Bert: pre-training of deep bidirectional transformers for language understanding arXiv preprint arXiv:1810.04805. 2018.
  • 57. Scarselli F, Marco Gori A, Tsoi C. et al.. The graph neural network model. IEEE Trans Neural Netw 2008;20:61–80. 10.1109/TNN.2008.2005605. [DOI] [PubMed] [Google Scholar]
  • 58. Satorras VG, Hoogeboom E, Welling M. E (n) equivariant graph neural networks. In International conference on machine learning. PMLR, 2021, 9323–32. [Google Scholar]
  • 59. Gilmer J, Schoenholz SS, Riley PF. et al.. Neural message passing for quantum chemistry. In International conference on machine learning. PMLR, 2017, 1263–72. [Google Scholar]
  • 60. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks arXiv preprint arXiv:1609.02907. 2016.
  • 61. Xu K, Hu W, Leskovec J. et al.. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826. 2018.
  • 62. LeCun Y, Boser B, Denker JS. et al.. Backpropagation applied to handwritten zip code recognition. Neural Comput 1989;1:541–51. 10.1162/neco.1989.1.4.541. [DOI] [Google Scholar]
  • 63. Jiuxiang G, Wang Z, Kuen J. et al.. Recent advances in convolutional neural networks. Pattern Recognit 2018;77:354–77. 10.1016/j.patcog.2017.10.013. [DOI] [Google Scholar]
  • 64. O’Shea K, Nash R. An introduction to convolutional neural networks arXiv preprint arXiv:1511.08458. 2015.
  • 65. Tang X, Tran A, Tan J. et al.. Mollm: a unified language model for integrating biomedical text with 2d and 3d molecular representations bioRxiv. 2023–11:2023. [DOI] [PMC free article] [PubMed]
  • 66. Ramakrishnan R, Dral PO, Rupp M. et al.. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data 2014;1:1–7. 10.1038/sdata.2014.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Axelrod S, Gomez-Bombarelli R. Geom, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data 2022;9:185. 10.1038/s41597-022-01288-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Vignac C, Frossard P. Top-n: Equivariant set and graph generation without exchangeability arXiv preprint arXiv:2110.02096. 2021.
  • 69. Gebauer N, Gastegger M, Schütt KT. Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. Advances in neural information processing systems 2019;32. [Google Scholar]
  • 70. Xu M, Powers A, Dror R. et al.. Geometric latent diffusion models for 3d molecule generation arXiv preprint arXiv:2305.01140. 2023.
  • 71. Satorras VG, Hoogeboom E, Fuchs F. et al. (eds). (n) equivariant normalizing flows. Advances in Neural Information Processing Systems, Vol. 34, 2021, 4181–92. [Google Scholar]
  • 72. Morehead A, Cheng J. Geometry-complete diffusion for 3d molecule generation arXiv preprint arXiv:2302.04313. 2023. [DOI] [PMC free article] [PubMed]
  • 73. Huang L, Zhang H, Xu T. et al.. Mdm: molecular diffusion model for 3d molecule generation arXiv preprint arXiv:2209.05710. 2022.
  • 74. Huang H, Sun L, Bowen D. et al.. Learning joint 2d & 3d diffusion models for complete molecule generation arXiv preprint arXiv:2305.12347. 2023. [DOI] [PubMed]
  • 75. Vignac C, Osman N, Toni L. et al.. Midi: mixed graph and 3d denoising diffusion for molecule generation arXiv preprint arXiv:2302.09048. 2023.
  • 76. Hoogeboom E, Satorras VG, Vignac C. et al.. Equivariant diffusion for molecule generation in 3d. In International Conference on Machine Learning. PMLR, 2022, 8867–87. [Google Scholar]
  • 77. Gómez-Bombarelli R, Wei JN, Duvenaud D. et al.. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci. 2018;4:268–76. 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Kusner MJ, Paige B, Hernández-Lobato JM. Grammar variational autoencoder. In International conference on machine learning. PMLR, 2017, 1945–54. [Google Scholar]
  • 79. Dai H, Tian Y, Dai B. et al.. Syntax-directed variational autoencoder for structured data arXiv preprint arXiv:1802.08786. 2018.
  • 80. Jin W, Barzilay R, Jaakkola T. Junction tree variational autoencoder for molecular graph generation. In International conference on machine learning. PMLR, 2018, 2323–32. [Google Scholar]
  • 81. Francoeur PG, Masuda T, Sunseri J. et al.. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design. J Chem Inf Model 2020PMID: 32865404;60:4200–15. 10.1021/acs.jcim.0c00411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Liegi H, Benson ML, Smith RD. et al.. Binding moad (mother of all databases). Proteins 2005;60:333–40. 10.1002/prot.20512. [DOI] [PubMed] [Google Scholar]
  • 83. Irwin JJ, Tang KG, Young J. et al.. Zinc20—a free ultralarge-scale chemical database for ligand discovery. J Chem Inf Model 2020PMID: 33118813;60:6065–73. 10.1021/acs.jcim.0c00675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. Berman HM, Westbrook J, Feng Z. et al.. The protein data bank. Nucleic Acids Res 2000;28:235–42. 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Trott O, Olson AJ. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 2010;31:455–61. 10.1002/jcc.21334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Chem 2009;1:1–11. 10.1186/1758-2946-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87. Taffee T. Tanimoto. Elementary mathematical theory of classification and prediction 1958. [Google Scholar]
  • 88. Li Y, Gao C, Song X. et al.. Druggpt: a gpt-based strategy for designing potential ligands targeting specific proteins bioRxiv. 2023.
  • 89. Masuda T, Ragoza M, Koes DR. Generating 3d molecular structures conditional on a receptor binding site with deep generative models arXiv preprint arXiv:2010.14442. 2020. [DOI] [PMC free article] [PubMed]
  • 90. Peng X, Luo S, Guan J. et al.. Pocket2mol: Efficient molecular sampling based on 3d protein pockets. In International Conference on Machine Learning. PMLR, 2022, 17644–55. [Google Scholar]
  • 91. Luo S, Guan J, Ma J. et al.. A 3d generative model for structure-based drug design. Advances in Neural Information Processing Systems 2021;34:6229–39. [Google Scholar]
  • 92. Guan J, Qian WW, Peng X. et al.. 3d equivariant diffusion for target-aware molecule generation and affinity prediction arXiv preprint arXiv:2303.03543. 2023.
  • 93. Schneuing A, Yuanqi D, Harris C. et al.. Structure-based drug design with equivariant diffusion models arXiv preprint arXiv:2210.13695. 2022.
  • 94. Lopez MJ, Mohiuddin SS. Biochemistry, essential amino acids, 2020. [PubMed]
  • 95. Flissi A, Ricart E, Campart C. et al.. Norine: update of the nonribosomal peptide resource. Nucleic Acids Res 2020;48:D465–9. 10.1093/nar/gkz1000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96. Lemer CM-R, Rooman MJ, Wodak SJ. Protein structure prediction by threading methods: evaluation of current techniques. Proteins 1995;23:337–55. 10.1002/prot.340230308. [DOI] [PubMed] [Google Scholar]
  • 97. Krieger E, Nabuurs SB, Vriend G. et al.. Homology modeling Structural bioinformatics, Vol. 44, 2003, 509–23 10.1002/0471721204.ch25. [DOI] [PubMed] [Google Scholar]
  • 98. Kryshtafovych A, Schwede T, Topf M. et al.. Critical assessment of methods of protein structure prediction (casp)—round xiv. Proteins 2021;89:1607–17. 10.1002/prot.26237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99. Haas J”u, Barbato A, Behringer D. et al.. Continuous automated model evaluation (cameo) complementing the critical assessment of structure prediction in casp12. Proteins 2018;86:387–98. 10.1002/prot.25431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100. Zemla A. Lga: a method for finding 3d similarities in protein structures. Nucleic Acids Res 2003;31:3370–4. 10.1093/nar/gkg571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101. Yang Z, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins 2004;57:702–10. [DOI] [PubMed] [Google Scholar]
  • 102. Mariani V, Biasini M, Barbato A. et al.. Lddt: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 2013;29:2722–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103. Jumper J, Evans R, Pritzel A. et al.. Highly accurate protein structure prediction with alphafold. Nature 2021;596:583–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104. Jing B, Erives E, Pao-Huang P. et al.. Eigenfold: generative protein structure prediction with diffusion models arXiv preprint arXiv:2304.02198. 2023.
  • 105. Lin Z, Akin H, Rao R. et al.. Allan dos Santos costa, Maryam Fazel-Zarandi, tom Sercu, Salvatore Candido, and Alexander rives. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379:1123–30. 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
  • 106. Baek M, DiMaio F, Anishchenko I. et al.. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021;373:871–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107. Zongyang D, Hong S, Wang W. et al.. The trrosetta server for fast and accurate protein structure prediction. Nature News 2021. [DOI] [PubMed] [Google Scholar]
  • 108. Ruffolo JA, Chu L-S, Mahajan SP. et al.. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun 2023;14:2389. 10.1038/s41467-023-38063-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109. Ruffolo JA, Gray JJ, Sulam J. Deciphering antibody affinity maturation with language models and weakly supervised learning arXiv preprint arXiv:2112.07782. 2021.
  • 110. Wu J, Wu F, Jiang B. et al.. Tfold-ab: fast and accurate antibody structure prediction without sequence homologs bioRxiv. 2022:2022–11.
  • 111. Dryden DTF, Thomson AR, White JH. How much of protein sequence space has been explored by life on earth? Journal of The Royal Society Interface 2008;5:953–6. 10.1098/rsif.2008.0085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112. Yu J, Junxi M, Wei T. et al.. Multi-indicator comparative evaluation for deep learning-based protein sequence design methods. Bioinformatics 2024;40:btae037. 10.1093/bioinformatics/btae037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113. Apweiler R, Bairoch A, Wu CH. et al.. Uniprot: the universal protein knowledgebase. Nucleic Acids Res 2004;32:115D–19. 10.1093/nar/gkh131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114. Sillitoe I, Lewis TE, Cuff A. et al. Tony E Lewis, Alison cuff, Sayoni das, Paul Ashford, Natalie L Dawson, Nicholas Furnham, roman a Laskowski, David lee, Jonathan G lees, et al. Cath: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 2015;43:D376–81. 10.1093/nar/gku947. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115. Li Z, Yang Y, Faraggi E. et al.. Direct prediction of profiles of sequences compatible with a protein structure by neural networks with fragment-based local and energy-based nonlocal profiles. Proteins 2014;82:2565–73. 10.1002/prot.24620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116. Wang G, DunbrackRL, Jr. Pisces: a protein sequence culling server. Bioinformatics 2003;19:1589–91. 10.1093/bioinformatics/btg224. [DOI] [PubMed] [Google Scholar]
  • 117. Larkin MA, Blackshields G, Brown NP. et al.. Clustal w and clustal x version 2.0. Bioinformatics 2007;23:2947–8. 10.1093/bioinformatics/btm404. [DOI] [PubMed] [Google Scholar]
  • 118. Lyu S, Sowlati-Hashjin S, Garton M. Proteinvae: Variational autoencoder for translational protein design bioRxiv. 2023:2023–03.
  • 119. Elnaggar A, Heinzinger M, Dallago C. et al.. Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2021;44:7112–27. [DOI] [PubMed] [Google Scholar]
  • 120. Sevgen E, Moller J, Lange A. et al.. Prot-vae: protein transformer variational autoencoder for functional protein design bioRxiv. 2023:2023–01.
  • 121. Repecka D, Jauniskis V, Karpus L. et al.. Expanding functional protein sequence spaces using generative adversarial networks. Nat Mach Intell 2021;3:324–33. 10.1038/s42256-021-00310-5. [DOI] [Google Scholar]
  • 122. Strokach A, Becerra D, Corbi-Verge C. et al.. Fast and flexible protein design using deep graph neural networks. Cell Syst 2020;11:402–411.e4. 10.1016/j.cels.2020.08.016. [DOI] [PubMed] [Google Scholar]
  • 123. Gao Z, Cheng T, Chacón P. et al.. Pifold: toward effective and efficient protein inverse folding arXiv preprint arXiv:2209.12643. 2022.
  • 124. Anand N, Eguchi R, Mathews II. et al.. Protein sequence design with a learned potential. Nat Commun 2022;13:746. 10.1038/s41467-022-28313-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 125. Liu Y, Lu Z, Wang W. et al.. Rotamer-free protein sequence design based on deep learning and self-consistency. Nat Comput Sci 2022;2:451–62. 10.1038/s43588-022-00273-6. [DOI] [PubMed] [Google Scholar]
  • 126. Zhou X, Chen G, Ye J. et al.. Prorefiner: an entropy-based refining strategy for inverse protein folding with global graph attention. Nat Commun 2023;14:7434. 10.1038/s41467-023-43166-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127. Jing B, Eismann S, Suriana P. et al. Learning from protein structure with geometric vector perceptrons. International Conference on Learning Representations 2021. https://openreview.net/forum?id=1YLJDvSx6J4.
  • 128. Hsu C, Verkuil R, Liu J. et al.. Learning inverse folding from millions of predicted structures. In International conference on machine learning. PMLR, 2022, 8946–70. [Google Scholar]
  • 129. Dauparas J, Anishchenko I, Bennett N. et al.. Robust deep learning–based protein sequence design using proteinmpnn. Science 2022;378:49–56. 10.1126/science.add2187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 130. Junxi M, Li Z, Zhang B. et al.. Graphormer supervised de novo protein design method and function validation. Brief Bioinform 2024;25:bbae135. 10.1093/bib/bbae135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131. Ying C, Cai T, Luo S. et al.. Do transformers really perform badly for graph representation? Advances in neural information processing systems 2021;34:28877–88. [Google Scholar]
  • 132. Rao RM, Liu J, Verkuil R. et al.. Msa transformer. In International Conference on Machine Learning. PMLR, 2021, 8844–56. [Google Scholar]
  • 133. Varadi M, Anyango S, Deshpande M. et al.. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 2022;50:D439–44. 10.1093/nar/gkab1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 134. Murzin AG, Brenner SE, Hubbard T. et al.. Scop: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995;247:536–40. 10.1016/S0022-2836(05)80134-2. [DOI] [PubMed] [Google Scholar]
  • 135. Chandonia J-M, Guan L, Lin S. et al.. Scope: improvements to the structural classification of proteins–extended database to facilitate variant interpretation and machine learning. Nucleic Acids Res 2022;50:D553–9. 10.1093/nar/gkab1054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 136. Trippe BL, Yim J, Tischer D. et al.. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem arXiv preprint arXiv:2206.04119. 2022.
  • 137. Fu C, Yan K, Wang L. et al.. A latent diffusion model for protein structure generation arXiv preprint arXiv:2305.04120. 2023.
  • 138. Wu KE, Yang KK, Berg. et al.. Protein structure generation via folding diffusion arXiv preprint arXiv:2209.15611. 2022. [DOI] [PMC free article] [PubMed]
  • 139. Yim J, Trippe BL, De Bortoli. et al.. Se (3) diffusion model with application to protein backbone generation. In International Conference on Machine Learning. PMLR, 2023;202:40001–40039. [Google Scholar]
  • 140. Lin Y, AlQuraishi M. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds arXiv preprint arXiv:2301.12485. 2023.
  • 141. Watson JL, Juergens D, Bennett NR. et al.. De novo design of protein structure and function with rfdiffusion. Nature 2023;620:1089–100. 10.1038/s41586-023-06415-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 142. Song Z, Zhao Y, Song Y. et al.. Joint design of protein sequence and structure based on motifs arXiv preprint arXiv:2310.02546. 2023.
  • 143. Shi C, Wang C, Lu J. et al.. Protein sequence and structure co-design with equivariant translation arXiv preprint arXiv:2210.08761. 2022.
  • 144. Chu AE, Cheng L, El Nesr. et al.. An all-atom protein generative model. bioRxiv 2023:2023–05. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 145. Zhang B, Liu K, Zheng Z. et al.. Protein language model supervised precise and efficient protein backbone design method bioRxiv. 2023:2023–10.
  • 146. Akbar R, Robert PA, Weber CR. et al.. In silico proof of principle of machine learning-based antibody design at unconstrained scale. MAbs 2022;14:2031482.Taylor & Francis. 10.1080/19420862.2022.2031482. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 147. Jin W, Wohlwend J, Barzilay R. et al.. Iterative refinement graph neural network for antibody sequence-structure co-design arXiv preprint arXiv:2110.04624. 2021.
  • 148. Kong X, Huang W, Liu Y. End-to-end full-atom antibody design arXiv preprint arXiv:2302.00203. 2023.
  • 149. Muttenthaler M, King GF, David J. Adams, et al.. Trends in peptide drug discovery. Nat Rev Drug Discov 2021;20:309–25. 10.1038/s41573-020-00135-8. [DOI] [PubMed] [Google Scholar]
  • 150. Wang Y, Liu X, Huang F. et al.. A multi-modal contrastive diffusion model for therapeutic peptide generation. AAAI 2024;38:3–11. 10.1609/aaai.v38i1.27749. [DOI] [Google Scholar]
  • 151. Lei Y, Xu W, Fang M. et al.. Pepgb: facilitating peptide drug discovery via graph neural networks arXiv preprint arXiv:2401.14665. 2024.
  • 152. Zhang R, Wu H, Liu C. et al.. Pepharmony: a multi-view contrastive learning framework for integrated sequence and structure-based peptide encoding arXiv preprint arXiv:2401.11360. 2024.
  • 153. Smialowski P, Doose G, Torkler P. et al.. Proso ii–a new method for protein solubility prediction. FEBS J 2012;279:2192–200. 10.1111/j.1742-4658.2012.08603.x. [DOI] [PubMed] [Google Scholar]
  • 154. Wishart DS, Knox C, Guo AC. et al.. Drugbank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res 2008;36:D901–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 155. Xia J, Chen S, Zhou J. et al.. Adanovo: Adaptive De Novo peptide sequencing with conditional mutual information, 2024.
  • 156. Tran NH, Zhang X, Xin L. et al.. De novo peptide sequencing by deep learning. Proc Natl Acad Sci 2017;114:8247–52. 10.1073/pnas.1705691114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 157. Qiao R, Tran NH, Xin L. et al.. Computationally instrument-resolution-independent de novo peptide sequencing for high-resolution devices. Nat Mach Intell 2021;3:420–5. 10.1038/s42256-021-00304-3. [DOI] [Google Scholar]
  • 158. Yilmaz M, Fondrie W, Bittremieux W. et al.. De novo mass spectrometry peptide sequencing with a transformer model. In International Conference on Machine Learning. PMLR, 2022, 25514–22. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Survey__bib_submission___appendix_(1)_bbae338

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES