Abstract
Molecular representation learning (MRL) is afoundation in leveraging computational methods for drug discovery, enabling the transformation of molecular structure and properties into numerical vectors. These vectors serve as input for machine learning models and facilitate the prediction and analysis of molecular attributes, functions, and reactions. The advent of foundation models has introduced both new opportunities and challenges to MRL. These models have improved generalizability and migration in scarce data. Through pretraining and fine-tuning, foundation models can be adapted to various domains. Their robust encoding and generative abilities also allow the transformation of molecular data into more expressive forms. This paper provides a detailed review of current mainstream molecular descriptors and datasets, focusing primarily on the representation of small molecules while excluding larger molecules such as proteins and peptides. It classifies foundation models into two primary categories based on the form of input: unimodal-based and multimodal-based models. For each category, representative models are identified and their advantages and disadvantages evaluated. Moreover, we systematically summarize four core pretraining strategies for MRL foundation models, analyzing their task designs, applicable scenarios, and impacts on downstream performance. In addition, the application of molecular representation foundation models in drug discovery and development is discussed, together with the current status of model interpretability. The paper concludes with insights into the future directions of MRL foundation models.
Keywords: molecular representation learning, foundation models, drug discovery, machine learning
Introduction
With the rapid advancement of artificial intelligence, deep learning applications in real-world scenarios have attracted growing attention across diverse scientific domains. In particular, in chemistry and medicine, these applications encompass molecular property prediction, drug–drug interaction (DDI) analysis, and molecular generation (MG) [1–4]. Fundamental to these advances are effective molecular representation learning (MRL) methods, which enable the capture of molecular characteristics at different levels [5]. MRL leverages deep learning models to transform molecular structures and properties into numerical vectors, primarily through sequence-based and graph-based representation frameworks.
Historically, molecular fingerprints were initially used to encode molecules as binary vectors for the input of models [6], but public databases exhibited limited characteristics suitable for this representation. The Simplified Molecular Input Line Entry System (SMILES) addressed some of these limitations, facilitating the use of sequence-based neural architectures (e.g. Transformers and Recurrent Neural Networks (RNNs)) for the prediction of molecular tasks [7, 8]. Subsequently, researchers began to try to represent molecules as topological graphs, with atoms as nodes and bonds as edges [9, 10]. More recently, 3D molecular geometry has gained traction in molecular representation, as it better captures spatial structure–function relationships and reveals unique energy states [11, 12].
Foundation models, defined as machine learning architectures characterized by massive parameter scales, enable adaptation to complex tasks [13–15]. Their development has progressed through four distinct stages: shallow neural network model, deep learning model, large-scale pretrained model, and ultra large-scale model with annual model size increasing 10-fold (from millions to billions of parameters) [16, 17], as illustrated in Fig. 1. Early computational resource constraints prioritized smaller architectures such as support vector machines (SVMs) [18], but advances in computational capacity and data scalability since 2009 have broadened neural network applications in molecular research [19, 20]. Since then, deep learning models have become ubiquitous in molecular representation tasks.
Figure 1.

The development process of MRL foundation model.
With increasing data volumes and computing resources, the scale of pretrained models has continued to grow, demonstrating robust generalization, and accuracy [21–24]. The advent of large language models (LLMs) has prompted increasing numbers of scholars to explore their applicability in molecular science. Combined with the availability of billion-level datasets in MRL [25], they are gaining traction in molecular modeling.
However, a systematic review of this field remains notably lacking. In this article, we investigate recent foundation models of MRL. Since foundation models in the molecular field are still in their early stages, most have significantly fewer parameters compared with LLMs. Thus, in the molecular field, we define models with millions of parameters as foundation models or large-scale models and summarize them to facilitate quick understanding for researchers new to the field and to aid in practical applications. Our contributions can be summarized in the following aspects:
(1) Categorization by inputs: We classify molecular representations on the basis of data-driven types. Using these categories, we outline learning strategies, application tasks, and representative foundation models associated with specific domain knowledge.
(2) Abundant additional resources: We have compiled a comprehensive collection of resources, including links to code repositories and benchmark datasets, to support further research and application.
(3) Guidance for choosing MRL foundation models: We present a comprehensive overview of MRL foundation models employed in each prevalent molecular task and systematically formulate a structured set of guidelines to assist in the judicious selection of appropriate models.
(4) Future outlook: We discuss the limitations of current models and highlight promising research directions that could lead to breakthroughs in the field.
Molecular descriptors and datasets
A typical foundation model framework involves pretraining the model using unlabeled large-scale datasets to learn generalizable representations, followed by fine-tuning on labeled downstream datasets [26]. Molecular descriptors, which vary in the type of information they provide, are crucial for this process. As depicted in Fig. 2, the utility of these descriptors depends on the specific downstream tasks. For example, 2D topological graphs generally reflect molecular size and degree of branching, which correlate with several drug properties, such as toxicity [27, 28]. In contrast, 3D geometric descriptors offer spatial information on atoms, conformational correlations, and surface properties, which are essential to determine quantum mechanical properties [29].
Figure 2.
An overview of different inputs of MRL foundation models.
The success of pretraining foundation models is highly dependent on the underlying datasets, which are intimately linked to molecular representation techniques [30]. Numerous data are available, usually containing data on the biochemical properties and biological activities of molecules, such as those found in PubChem, ZINC, and GEOM [31–33]. We have detailed these data in Table 1, outlining their application tasks, data sizes, and other relevant information.
Table 1.
Representative molecular datasets of foundation models
PP, property prediction; MG, molecular generation; RP, reaction prediction; DDI, drug–drug interaction; DSP, drug synergy prediction. “–” indicates not applicable. “
” means “about”. The first four datasets are pretraining datasets, so the application tasks are null.
We now briefly review the commonly used molecular descriptors and associated datasets.
Molecular fingerprint: This method converts molecular structures or properties into numerical or vector representations, widely used in computational chemistry and drug design for tasks such as virtual screening, similarity searches, and structure–activity relationship analysis. Morgan fingerprints, e.g. identify the presence or the absence of specific structures within molecules, generating binary bit-vector representations. This is done by considering the local environment of each atom, representing it based on a set of atomic invariants, and iteratively updating these features between adjacent atoms using a hash function. Such fingerprints are valued for their flexibility, compactness, and strong descriptive power [34, 35]. Various other types of molecular fingerprints include dictionary-based, circular, topological, pharmacophore, and protein–ligand interaction fingerprints [36].
1D sequences: It defines the string representation of molecules by performing a depth-first preorder spanning tree traversal of the molecular graph. The resulting string corresponds to the flattened spanning tree of the molecular graph. SMILES learning, noted for its compactness, has been extensively applied in molecular property prediction [37, 38]. SMILES strings explicitly represent meaningful substructures such as branching, cyclic structures, and chiral information. Due to its compatibility with computer programming, many datasets now utilize SMILES as molecular descriptors, including well-known repositories such as PubChem, ZINC, and others. However, the SMILES syntax is complex and highly restrictive, with most sequences not corresponding to well-defined molecules. Consequently, a new representation known as the SELFIES method was introduced by Krenn et al. [39] in 2020. This method aims to rectify prevalent grammar errors and violations of chemical principles in SMILES representations. SELFIES effectively solves these issues by associating each symbol with a specific structural or referential element, addressing common problems like imbalanced parentheses or ring identifiers. It can be converted from SMILES using various tools, enriching the available datasets.
2D Topology Graph: This method models atoms and bonds as nodes and edges, respectively, with each node and edge carrying feature vectors representing atomic type, chirality, bond type, and direction. Given that 2D molecular graph data can be derived from SMILES using the RDKit tool, possessing a SMILES dataset is effectively the same as having a 2D graph dataset [40]. Consequently, common SMILES datasets, such as those found in PubChem, ZINC, and others, can also serve as 2D graph molecular datasets. However, the inherent variability in topological structures often makes graph data more complex than image and text data, challenging the direct training of foundation models with molecular graphs.
3D Geometry: 3D geometry can display the arrangement of molecules in space, as well as the relative positions and directions between atoms, reflecting the spatial shape and stereochemistry of molecules [41]. Moreover, the energy state of molecules can be evaluated by calculating their potential energy, describing their interactions by calculating their molecular orbitals, and analyzing their dynamics by simulating their motion [42]. Structurally, stable molecules can be represented as a series of 3D coordinates [43]. However, 3D geometric data obtained through experimental measurements is typically costly, resulting in its scarcity in downstream tasks. To address this challenging issue, since September 2021, larger 3D datasets such as Molecule3D have been introduced. Models like GeoSSL are being pretrained on these datasets [44, 45].
Beyond the traditional molecular descriptors mentioned above, mathematically abstract molecular representations are also valuable for uncovering structural patterns and the chemical nature of molecules. These approaches quantify molecular features using tools from topology, graph theory, and differential geometry.
At the topological level, molecular structures can be modeled as topological spaces defined by atomic connectivity. Topological descriptors (e.g. Wiener index
) quantify structural complexity and predict stability and reactive site distribution [46]. It can be formulated as
![]() |
(1) |
where
denotes the topological distance between atoms
and
, defined as the number of edges in the shortest bonding path between them.
Graph theory provides computable models of molecular topology. Node centrality metrics (e.g. degree, betweenness, and closeness) quantify atomic importance, where high-degree atoms often indicate reactive centers. Degree centrality
reflects local connectivity and is defined as:
![]() |
(2) |
where
is the adjacency matrix of the molecular graph.
indicates that atoms
and
are directly bonded; otherwise,
.
Betweenness centrality
measures an atom’s mediation in molecular information transfer, indicating its influence on long-range interactions. Atoms with high betweenness centrality often serve as key intermediates. It is defined as:
![]() |
(3) |
where
denotes the total number of shortest paths between nodes
and
, and
represents the number of those shortest paths that pass through node
.
Differential geometry methods capture molecular surface shape through curvature analysis, enabling characterization of geometric compatibility in drug–target binding [47]. Gaussian curvature
and mean curvature
describe local surface features (e.g. protrusions and cavities) relevant to binding complementarity, defined as:
![]() |
(4) |
![]() |
(5) |
where
and
are the two principal curvatures at any point on the molecular surface (with
in convex regions,
in concave regions, and
on flat regions).
Such mathematically abstract representations help reveal the intrinsic relationships between molecular structure and properties, and provide more generalizable feature inputs for MRL models.
In addition to traditional molecular descriptors, recent research has explored the integration of data from various modalities with molecular representations, including IUPAC names, knowledge graphs (KGs), images, and biochemical texts. The IUPAC name, governed by rules from the International Union of Pure and Applied Chemistry, ensures that each chemical substance is uniquely named to accurately reflect its structure and composition, such as the position and type of functional groups. Molecular KGs amalgamate chemical knowledge with graph-structured data, linking chemical elements, molecular structures, and chemical reactions into a comprehensive network [48]. Furthermore, molecular images serve as another form of molecular descriptor, focusing on visual representation [49].
In addition, there is growing interest in the textual aspects of molecular data. Biomedical texts offer rich, flexible external information about molecular entities derived from wet lab experiments [50].
Pretraining strategies
Pretraining strategies are core determinants of the performance of MRL foundation models. By designing rational pretraining tasks, these strategies enable models to learn generalizable molecular features from large-scale unlabeled molecular data, laying a solid foundation for downstream task fine-tuning. Currently, mainstream pretraining strategies for MRL foundation models can be categorized into four paradigms, with distinct task designs, applicable scenarios, and impacts on downstream performance, as elaborated below.
Masked Language Modeling
Masked language modeling (MLM) serves as the cornerstone pretraining task for sequence-based MRL foundation models [51, 52]. The core logic of MLM involves randomly masking a subset of tokens in molecular sequences, followed by training the model to predict the masked tokens. This task forces the model to learn local dependency relationships and global sequence patterns between tokens, which is well suited to capturing the syntactic characteristics of molecular sequences.
For instance, ChemBERTa conducts MLM pretraining on 77 million SMILES sequences from PubChem [51]. By learning the correlations between atomic tokens and functional group tokens during pretraining, the model can accurately encode molecular structural features. Compared with traditional task-specific models, ChemBERTa achieves a 5%–10% improvement in AUC-ROC on molecular property prediction tasks. However, MLM has inherent limitations: it over-reliance on the syntactic correctness of sequences and fails to effectively capture spatial and topological features of molecules. Thus, it is more suitable for unimodal models with sequence inputs.
Contrastive learning
Contrastive learning (CL) has emerged as a dominant pretraining strategy for multimodal MRL models by constructing positive–negative sample pairs to align features across different modalities or different views of the same modality. In unimodal scenarios, CL generates negative samples by perturbing molecular graphs and performs CL between original graphs and perturbed graphs, enhancing the robustness to molecular topological feature variations. In multimodal scenarios, CL aligns features from different modalities, enabling cross-modal information fusion.
GraphMVP exemplifies the effectiveness of CL in multimodal pretraining. By contrasting topological features of 2D molecular graphs with spatial features of 3D geometry, the model simultaneously captures molecular connectivity and spatiality. In the energy prediction task on the QM9 dataset, GraphMVP reduces the RMSE by 15% compared with unimodal models [53]. The key advantage of CL lies in its ability to learn inter-modal correlations without labeled data. However, its performance highly depends on the quality of positive sample construction, which remains a critical challenge in practical applications.
Reconstruction-based pretraining
Reconstruction-based pretraining (RBP) enables models to learn global molecular structural features by reconstructing original molecular data from corrupted inputs. For graph-based models, reconstruction tasks include “node feature reconstruction” and “graph structure reconstruction” [54]. For 3D geometry-based models, reconstruction tasks involve “coordinate reconstruction” and “energy reconstruction” [55].
Molecular Graph Masked Autoencoder (MGMAE) demonstrates the superiority of RBP in graph models. By masking >50% of nodes and edges in molecular graphs and training the model to reconstruct complete molecular graphs, MGMAE forces the model to learn global topological patterns of molecules. On the BBBP dataset for molecular property prediction, MGMAE achieves an AUC-ROC of 94.2%, outperforming peer models [56]. The primary advantage of RBP is its ability to capture global molecular features, but it requires high model complexity and incurs relatively high training costs.
Multimodal alignment pretraining
Multimodal alignment pretraining (MAP) is specifically designed for multimodal input models, aiming to align and fuse features from different modalities through cross-modal tasks [50, 57]. For example, KV-PLM adopts a “SMILES to text” matching task to align molecular structure and functional information [50].
The key advantage of MAP is its ability to fuse structural information (SMILES, graphs) and semantic information (text), providing more comprehensive features for downstream tasks. However, this strategy requires large-scale cross-modal labeled data, which poses significant challenges in data acquisition and annotation.
Computing models
This section reviews prominent foundation models utilizing diverse molecular representation techniques (Table 2), where the “Molecular Descriptor” denotes pretraining inputs. Models are classified into Unimodal-based and Multimodal-based categories. Molecular fingerprints are rarely used due to information loss, dimensional complexity, and rule dependency [89, 90]. Instead, structure-based representation learning methods are preferred for preserving molecular information and enhancing model performance.
Table 2.
Summary of representative molecular representation foundation models (MRFMs) from recent years
| Model | Year | Molecular descriptor | Backbone architecture | Pretraining dataset | Parameters | Downstream Tasks | Link |
|---|---|---|---|---|---|---|---|
| ChemBERTa-2 [51] | 2022 | Sequence (SMILES) | Transformer | PubChem (77 M) | 5 M–46 M | PP | – |
| MOLFORMER [52] | 2022 | Sequence (SMILES) | Bert | PubChem + ZINC15 ( 1111 M) |
21 M |
PP | https://github.com/IBM/molformer |
| MOLGEN [58] | 2024 | Sequence (SELFIES) | Bart [59] | ZINC15 (100 M) | 8B | MG | https://github.com/zjunlp/MolGen |
| LlaSMol [60] | 2024 | Sequence (SMILES/SELFIES) | Mistral [61] | SMolInstruct (3 M) | 7B | PP, MG, RP | https://github.com/osu-nlp-group/llm4chem |
| SynerGPT [62] | 2024 | Sequence (SMILES) | Transformer | ChemicalX DrugCombDB (0.6 M) | 18 M/22.8 M | DSP | https://github.com/KyleBenzle/SynerGPT |
| CancerGPT [63] | 2024 | Sequence (SMILES) | GPT | DrugComb Portal (0.7 M) | 124 M | DSP | – |
| GROVER [5] | 2020 | Graph | GNN + Transformer | ZINC15+Chembl (10 M) | 107.7 M | PP | https://github.com/tencent-ailab/grover |
| GraphCL [64] | 2020 | Graph | 5-layer GIN | ZINC15 (2 M) |
2 M |
PP | https://github.com/Shen-Lab/GraphCL |
| Hu et al. [65] | 2020 | Graph | 5-layer GIN | ChEMBL(456k)+ZINC(2 M) |
2 M |
PP | http://snap.stanford.edu/gnn-pretrain |
| JOAO [66] | 2021 | Graph | 5-layer GIN | ZINC15 (2 M) |
2 M |
PP | https://github.com/Shen-Lab/GraphCL_Automated |
| AD-GCL [67] | 2021 | Graph | 5-layer GIN | ZINC15 (2 M) |
2 M |
PP | https://github.com/susheels/adgcl |
| GraphLoG [68] | 2021 | Graph | 5-layer GIN | ZINC15 (2 M) |
2 M |
PP | https://github.com/DeepGraphLearning/GraphLoG |
| MPG [69] | 2021 | Graph | MolGNet [70] | ZINC + ChEMBL (11 M) | 53 M | PP, DDI | https://github.com/pyli0628/MPG |
| MGSSL [71] | 2021 | Graph | 5-layer GIN | ZINC15 (250 K) |
2 M |
PP | https://github.com/zaixizhang/MGSSL |
| Graphomer [72] | 2021 | Graph | Transformer | PCQM4M-LSC (3.8 M) | 47.1 M | PP | https://github.com/microsoft/Graphormer |
| LP-Info [73] | 2022 | Graph | 5-layer GIN | ZINC15 (2 M) |
2 M |
PP | https://github.com/Shen-Lab/GraphCL_Automated |
| SimGRACE [74] | 2022 | Graph | 5-layer GIN | ZINC15 (2 M) |
2 M |
PP | https://github.com/junxia97/SimGRACE |
| GraphMAE [54] | 2022 | Graph | 5-layer GIN | ZINC15 (2 M) |
2 M |
PP | https://github.com/THUDM/GraphMAE |
| MGMAE [56] | 2022 | Graph | 5-layer GIN | ZINC15 (2 M) + ChEMBL (456 K) |
2 M |
PP | – |
| KPGT [75] | 2022 | Graph | Transformer | ChEMBL (2 M) | 100 M | PP | https://github.com/lihan97/KPGT |
| MOLE-BERT [76] | 2023 | Graph | 5-layer GIN | ZINC15 (20 M) |
2 M |
PP | https://github.com/junxia97/Mole-BERT |
| 3D PGT [77] | 2023 | 3D Geometry | GPS [78] | PubChemQC (3.74 M) | 42.6 M | PP | https://github.com/LARS-research/3D-PGT |
| Uni-Mol [55] | 2023 | 3D Geometry | Transformer | ZINC/ChemBL + PDB (209 M) |
47.61 M |
PP, MG | https://github.com/dptech-corp/Uni-Mol |
| ImageMol [49] | 2022 | Images | ResNet18 [79] | PubChem (10 M) |
11 M |
PP | https://github.com/ChengF-Lab/ImageMol |
| MM-Deacon [80] | 2021 | Sequence + IUPAC | Transformer | PubChem | 10 M | PP, DDI | – |
| PanGu Drug Model [81] | 2022 | Sequence + Graph | Transformer | ZINC20 + DrugSpaceX + UniChem ( 1.7B) |
104 M | - | http://pangu-drug.com/ |
| DVMP [82] | 2023 | Sequence + Graph | GNN + Transformer | PubChem (10 M) | 104.1 M | PP, RP | https://github.com/microsoft/DVMP |
| Transformer-M [83] | 2023 | Graph + 3D Geometry | Transformer | PCQM4Mv2 (3.4 M) | 47.1 M | PP | https://github.com/lsj2408/Transformer-M |
| GraphMVP [53] | 2022 | Graph + 3D Geometry | 5-layer GIN + SchNet | GEOM (50 K) |
2 M |
PP | https://github.com/chao1224/GraphMVP |
| KV-PLM [50] | 2022 | Sequence + Text | Transformer | PubChem (150 M) |
110 M |
PP | https://github.com/thunlp/KV-PLM |
| MolT5 [57] | 2022 | Sequence + Text | Transformer | ZINC-15 (100 M) | 60 M / 770 M | MG | https://github.com/blender-nlp/MolT5 |
| MoleculeSTM [84] | 2023 | Sequence + Text | GNN + Transformer + Bert | PubChemSTM (281K) |
400 M |
PP, MG | https://github.com/chao1224/MoleculeSTM/tree/main |
| MolReGPT [85] | 2024 | Sequence + Text | GPT | ChEBI-20 (33 K) |
1T |
MG | https://github.com/phenixace/MolReGPT |
| TxT-LLM [86] | 2024 | Sequence + Text | PaLM-2 [87] | TDC ( 726 K) |
400 M |
PP, RP | – |
| Y-mol [88] | 2024 | Sequence + Text + KG | Llama2 | PubMed ( 33 M) |
7B | PP, DDI, MG | https://anonymous.4open.science/r/Y-Mol |
PP, property prediction; MG, molecular generation; RP, retrosynthesis prediction; RC, retrosynthesis condition prediction; DDI, drug–drug interaction, DSP, drug synergy prediction. The symbol “–” indicates that a particular application is not applicable, and “
” denotes “approximately”.
Unimodal-based model
Sequence-based model
SMILES-based models encode chirality via tokenization and pretraining design. Chiral centers marked by “@” and “@@” are treated as distinct tokens, allowing the Transformer’s self-attention to learn their contextual associations. For example, the “[C@@H]” token forms strong attention weights with nearby C, N, and O atoms, capturing the local chiral environment.
In addition, SELFIES-based representations explicitly encode chirality using semantically clear tokens such as “[C@]” and “[C@@]”, reducing syntactic complexity [91]. Their fragment-level tokenization strengthens the linkage between chiral information and the molecular backbone, enabling foundation models to learn chirality–property relationships more efficiently via positional encoding and dependency modeling.
Transformer models surpass early RNN-based approaches due to their superior ability to capture chemical structure information. RNNs suffer from gradient vanishing, limiting their capacity to model long-range dependencies, and nonlocal interactions critical for molecular properties. In contrast, Transformers employ self-attention to encode global token relationships, enabling effective representation of functional groups and their spatial correlations. This capability supports more accurate learning of structure–activity relationships.
Foundation models such as GPT, BERT, and T5, which are derived from Transformer architectures, have reshaped the trajectory of MRL. In chemical informatics, researchers have leveraged their capabilities, exemplified by ChemBERTa, which integrates chemical sequence representation [92]. These models have excelled in tasks like retrosynthesis and molecular property prediction, leading White to proclaim that “the future of chemistry is language” [93].
A significant challenge with SMILES-based models is the generation of invalid molecular strings. To mitigate this, models must incorporate additional rules, such as SMILES syntax and atomic ordering, which complicate training [94]. Alternatively, Selfies sequences, as adopted by MolGEN, offer a solution to this problem [58].
Topological graph-based model
Molecular graph representation is increasingly recognized for its ability to better capture the structural and functional characteristics of molecules compared with sequence inputs [95]. Topology-based graph models leverage message passing and graph structural encoding to exploit molecular topological features. The common method involves using graph neural networks (GNNs) [96], graph convolutional networks (GCNs) [97], graph attention networks (GATs) [98], and graph isomorphism networks (GINs) [99] for molecular representation. They stack multiple graph convolution layers and integrates atomic features (node features) with bond features (edge features). Through “aggregation and update” operations, it learns the local topological environment of each node. For example, when updating node representations, GIN aggregates the features of all neighboring nodes and uses learnable parameters to adjust their contribution weights, enabling the model to capture branching complexity, ring structures, and other topological characteristics.
However, traditional GNNs are limited by message passing that only covers immediate neighboring nodes, making it difficult to capture nonlocal chemical interactions. In contrast, Transformer-based graph models overcome this limitation through global self-attention, enabling them to encode potential relationships between any pair of atoms.
Similar to sequence models, Transformer-based graph models (e.g. Graphormer) overcome traditional GNN’s nonlocal interaction limitation to capture topological features [72]. Through this architectural design, Graphormer efficiently leverages molecular topological features and achieves superior performance in molecular property prediction tasks compared with conventional GNNs.
3D geometry-based model
Most MRL methods represent molecules as sequential tokens or topology graphs, limiting their ability to leverage 3D geometry essential for 3D-related tasks. Suboptimal 3D structure parameters can degrade model performance compared to sequential/topological methods, and may also cause robustness and prediction issues. Overcoming this in foundation models remains a major challenge [100].
3D GNN-based models extract local spatial features of atoms through convolutional operations. However, their fixed receptive fields limit the ability to adapt to dynamic conformational changes in molecules. In contrast, Transformer-based 3D models employ distance-aware self-attention, which dynamically adjusts attention weights between atoms to capture key spatial interactions in real time as conformations evolve.
Taking Uni-Mol as an example, the model first converts the 3D coordinates of atoms into a relative Euclidean distance matrix. Through the self-attention modules of a prelayerNorm Transformer, distance information is incorporated into attention weight computation—atoms that are closer in space receive higher attention weights [101], while long-range interactions are effectively captured via distance thresholding and attention weight adjustment [55, 102, 103].
In addition, Uni-Mol introduces an SE(3)-equivariant coordinate head, which preserves invariance to spatial rotations and translations during training, ensuring robust generalization of spatial shape representations [77].
3D PGT employs a graph processing system (GPS) [104] architecture that integrates 3D coordinates with molecular graph topology. By using spatial convolution layers, it extracts local atomic spatial distribution features and fuses them with graph attention mechanisms, enabling joint encoding of molecular spatial geometry, and chemical environment.
Image-based model
Advances in computer vision have led to growing interest in image-based MRL. Although both molecular images and topological graphs serve as 2D representations, they differ in processing paradigms: GNNs model graph connectivity and chemical topology, whereas CNNs extract local visual patterns from images. Recent progress in unsupervised visual representation learning [105, 106] suggests strong potential for image-based pretraining in drug discovery. For example, ImageMol [49] introduces a chemistry-aware unsupervised framework that achieves high accuracy across multiple drug discovery tasks, demonstrating the utility of molecular images as an effective representation modality.
Traditional unimodal molecular descriptors, such as fingerprints, SMILES strings, and 2D topological graphs, show strengths in specific tasks but inherently suffer from limited information capacity. As MRL applications expand toward more complex scenarios, information from a single modality is no longer sufficient to support a comprehensive understanding of molecular characteristics. Consequently, multimodal molecular descriptors have emerged. The core advantage of multimodal inputs lies in their ability to integrate complementary information from different modalities, thereby enabling more holistic and informative molecular representations.
Multimodal-based model
In recent years, the advancement of foundation models has notably improved the understanding and generation capabilities of multimodal data. These models use extensive multimodal datasets to learn associations between different modalities, thereby enhancing their ability to interpret and generate such data. When trained on diverse datasets, the representations developed by foundation models are highly transferable, offering broad applicability across numerous downstream tasks. This universality has substantial research value, especially in the field of drug discovery, where it may potentially lead to significant breakthroughs. In the following, we will introduce four distinct types of multimodal integration.
Sequence+Graph integration
While SMILES sequences alone may not adequately capture the topological structure of molecules, relying solely on molecular graphs can lead to issues such as overly smooth graph models. Employing both SMILES and molecular diagrams simultaneously allows for the leveraging of each method’s strengths, achieving a more comprehensive representation of molecular structures. However, current approaches, such as the DVMP, whose dual-tower architecture lacks fine-grained interaction between SMILES and graph data, which is addressed by MolCLR’s CL [82, 107].
Graph+3D geometry integration
While graphs primarily emphasize topological information, 3D geometry focuses on energy-related features. Molecular graphs effectively capture the composition and chemical structure of molecules, where atoms and bonds are represented as nodes and edges [108]. In contrast, 3D geometry displays the spatial arrangement of molecules, showing relative positions and directions among atoms. It can reveal the molecular spatial shape and stereochemistry [41]. Together, they can facilitate a deeper analysis of reaction rates and pathways. GraphMVP exemplifies this approach by enriching 2D topological pretraining with 3D geometric information, which provides robust supplementary information on molecular energy and spatial structure data [53].
Due to the limitations inherent in 3D geometric data, it is often necessary to combine 3D information with 2D data for model input. A notable approach in this area is GeomGCL, which uses a geometric CL strategy on both 2D and 3D views [109]. Currently, it improves the MRL in 2D by incorporating additional 3D geometric information. This dual-view strategy not only solves the integration challenges, but also improves the overall predictive performance of the models.
Despite these advancements, the scarcity of comprehensive molecular datasets that include 2D and 3D information remains a significant challenge, limiting the development of foundation models in this field.
Text+Other representation integration
In recent years, a new trend in molecular representation has focused on jointly modeling molecular SMILES sequences with literature texts to obtain cross-modal representations. The combination of SMILES structural information and textual context provides stronger prior knowledge for tasks, such as molecular property prediction and domain knowledge extraction, especially useful for fine-tuning LLMs on chemical tasks.
KV-PLM initiated the integration of biochemical text with SMILES sequences, where the model first converts SMILES sequences into 2D molecular graphs via RDKit, then embeds the 2D molecular graph features (e.g. atomic type and bond type) into the text encoding process through a cross-modal attention module, thereby facilitating cross-modal learning between molecular structure and biochemical text [50]. MolT5 further developed this concept with a pretrained model handling large volumes of unlabeled text and molecular strings [57].
Subsequent models like MoleculeSTM use CL to align molecules and text for zero-shot retrieval and editing [84]; MolReGPT bridges molecular and natural languages via retrieval-augmented prompting [85].
Despite these advancements, this approach inherits limitations of SMILES and NLP. Textual data often focus on either internal molecular structures or external biomedical contexts, restricting machine reading versatility and impacting knowledge acquisition and pretrained model performance.
Other multimodal methods
Beyond conventional descriptors, molecular information can also be represented through images, KGs, and other modalities [48–50]. These developments have motivated multimodal representation learning, which integrates heterogeneous data to provide a more comprehensive molecular understanding. For example, KCL unifies structural features with knowledge associations to uncover latent inter-element relationships [48]. Many recent frameworks adopt separate encoding branches for different data types, such as DVMP for structural strings and topological graphs, MM-Deacon for chemical naming semantics, and CLOOME for integrating molecular and cellular imaging signals [80, 82, 110]. The resulting embeddings are jointly optimized to establish cross-modal correspondence.
However, these models require large amounts of single-modal data and complex cross-modal alignment, increasing pretraining cost, and limiting scalability.
Applications
This section explores four prevalent application tasks of MRFMs. Each task is pivotal in leveraging the potential of these models within various domains as depicted in Fig. 3.
Figure 3.
Applications of MRL foundation model.
We detail representative works for each application in Table 2, providing a comprehensive overview of the methods and their outcomes. In Table 2, “Downstream tasks” include those specified in the original studies; however, the models may also be applicable to other tasks not listed.
Application 1: molecular property prediction
Molecular property prediction is vital in fields like drug development [111, 112], involving the prediction of molecule physical and chemical properties from structural data. The accuracy of these predictions hinges on learning effective molecular representations, making property prediction a common performance benchmark for MRL foundation models.
AUC-ROC and RMSE/MAE are the two main evaluation metrics used for classification and regression tasks, respectively. As shown in Table 3, foundation models integrating Transformer- or GIN-based architectures excel in molecular property prediction. Notably, Chemprop, a pivotal model based on Directional Message Passing Neural Networks (D-MPNNs), has significantly advanced molecular property prediction by introducing directional message passing mechanisms [113]. This innovation addresses the limitation of traditional MPNNs in ignoring bond directionality, enabling more accurate capture of asymmetric electronic effects and steric hindrance. Chemprop achieved state-of-the-art performance, and served as a key inspiration for subsequent advanced models like CD-MVGNN [114] and GSL-MPP [115]. With the emergence of Transformer architectures, their combination with GNNs has been applied to molecular property prediction tasks and has shown remarkable performance. DVMP, with its dual-branch Transformer and GNN architecture, shows strong performance in molecular property prediction, especially on the HIV and SIDER datasets [82].
Table 3.
Comparison of performance (ROC-AUC %) on molecular property classification tasks
| BBBP | BACE | ClinTox | Tox21 | ToxCast | SIDER | HIV | MUV | |
|---|---|---|---|---|---|---|---|---|
| ChemBERTa-2 | 64.3 | – | 73.3 | 72.8 | – | – | 62.2 | – |
| MOLFORMER | 93.7 | 88.21 | 94.8 | 84.7 | – | 69 | 82.2 | – |
| GROVER | 94 | 89.4 | 94.4 | 83.1 | 73.7 | 65.8 | – | – |
| GraphCL | 69.68 | 75.38 | 75.99 | 73.87 | 62.4 | 60.53 | 78.47 | 69.8 |
| Hu et al. | 68.7 | 84.5 | 72.6 | 78.1 | 65.7 | 62.7 | 79.9 | 81.3 |
| JOAO | 71.39 | 75.49 | 80.97 | 74.27 | 63.16 | 60.49 | 77.51 | 73.67 |
| AD-GCL | 69.54 | 77.27 | 80.77 | 72.92 | – | 63.19 | – | – |
| GraphLoG | 72.5 | 83.5 | 76.7 | 75.7 | 63.5 | 61.2 | 77.8 | 76 |
| MPG | 92.2 | 92 | 96.3 | 83.7 | 74.8 | 66.1 | – | – |
| MGSSL | 69.7 | 79.1 | 80.7 | 76.5 | 64.1 | 61.8 | 78.8 | 78.7 |
| Graphomer | – | – | – | – | – | – | 80.51 | – |
| LP-Info | 71.68 | 81.15 | 76.73 | 74.45 | 62.39 | 60.8 | 77.03 | 72.03 |
| SimGRACE | 71.3 | 75 | 75.6 | 75.6 | 63.4 | 60.6 | 75.2 | 76.9 |
| GraphMAE | 72 | 83.1 | 82.3 | 75.5 | 64.1 | 60.3 | 77.2 | 76.3 |
| MGMAE | 94.2 | 92.7 | 96.7 | 86 | 75.3 | 66.4 | – | – |
| KPGT | 90.8 | 85.5 | 94.6 | 84.8 | 74.6 | 64.9 | – | – |
| MOLE-BERT | 71.9 | 80.8 | 78.9 | 76.8 | 64.3 | 62.8 | 78.2 | 78.6 |
| 3D PGT | 72.1 | 80.9 | 79.4 | 73.8 | 69.2 | 60.6 | 78.1 | 69.4 |
| Uni-Mol | 72.9 | 85.7 | 91.9 | 79.6 | 69.6 | 65.9 | 80.8 | 82.1 |
| MM-Deacon | 78.5 | – | 99.5 | – | – | 69.3 | 80.1 | – |
| DVMP | 77.8 | 89.4 | 95.6 | 79.1 | – | 69.8 | 81.4 | – |
| GraphMVP | 72.4 | 81.2 | 77.5 | 74.4 | 63.1 | 63.9 | 77 | 75 |
| KV-PLM | 74.61 | – | – | 72.71 | – | 61.51 | 74 | – |
| MoleculeSTM | 69.98 | 80.77 | 92.53 | 76.91 | 65.05 | 60.96 | 76.93 | 73.4 |
| TxT-LLM | – | – | 86.3 | 88.2 | 79.2 | – | 73.2 | – |
The best performing values are highlighted in bold, and the second best ones are marked with an underline; “–” indicates no data available. All performance metrics in this table are directly cited from the original studies of the corresponding models, and the evaluation protocols are consistent with the original studies to ensure the reliability of performance comparison.
However, model effectiveness depends heavily on the dataset. Mole-BERT encounters negative transfer due to a small and unbalanced atomic vocabulary [76]. While techniques like VQ-VAE can alleviate this, challenges may resurface in tasks such as protein prediction, indicating the need for alternative strategies.
Application 2: molecular generation
In drug discovery, identifying target molecules with specific properties, historically relying on domain expertise, poses a significant challenge. MG automates this process via four steps: (i) Data conversion: molecule structures are transformed into computational formats (e.g. SMILES, graphs, andvectors). (ii) MG: deep learning-based generative models sample or create novel molecules from the molecular space. (iii) Molecular evaluation: molecules are screened according to predicted properties, including physical, chemical, biological, toxicity, and synthesizability aspects. (iv) Molecule optimization: generated molecules are refined, or model parameters adjusted, to improve metrics such as QED, SA, and pIC50 [116–118].
For evaluating MG, two primary criteria are often used: (i) validity: the proportion of chemically viable molecules out of all generated molecules [119]. (ii) Novelty: the percentage of valid molecules generated that do not appear in the training dataset.
Additional metrics include: (i) reconstruction accuracy: measures how frequently the model can reconstruct a specific molecule from its potential embeddings [120]. (2) FCD: assesses the similarity between sampled and training molecules [121]. (3) For tasks with attribute constraints, the proportion of generated molecules that match the target attribute is also evaluated.
In tasks like molecular conformation generation, coverage score (COV) and matching score (MAT) are commonly used metrics to assess performance [122].
As shown in Table 4, foundation models like MolT5 and Uni-Mol have proven effective in MG. MolT5, pretrained on natural language and SMILES, excels at generating molecules from text descriptions, outperforming baselines in validity. Uni-Mol generates 3D molecular conformations, achieving superior COV and MAT scores compared to other models [55, 57].
Table 4.
Comparison of performance on molecule generation tasks
| Model | Dataset | Validity (%) |
Novelty (%) |
COV (%) |
MAT (Å) |
FCD
|
|---|---|---|---|---|---|---|
| LlaSMol (using SELFIES) | SMolInstruct | 99.9 | – | – | – | – |
| MOLGEN | Synthetic molecules | 100 | 100 | – | – | 0.15 |
| Natural product molecules | 100 | 99.87 | – | – | 65.19 | |
| Uni-Mol | QM9 | – | – | 97.95 | 0.1831 | – |
| Drugs | – | – | 91.91 | 0.7863 | – | |
| MolT5 | CheBI-20 | 90.5 | – | – | – | 1.2 |
| MolReGPT (10-shot) | ChEBI-20 | 89.9 | – | – | – | 0.41 |
| Y-mol | Random samples (200 000) | 100 | 68 | – | – | – |
Validity: chemical validity; Novelty: The percentage of valid molecules generated that do not appear in the training dataset; COV, coverage; MAT, average atomic distance; FCD, Frechet ChemNet Distance. “
”: The larger the numerical value, the better the performance; “
”: The smaller the numerical value, the better the performance; “–” indicates no data available. All performance metrics in this table are directly cited from the original studies of the corresponding models, and the evaluation protocols are consistent with the original studies to ensure the reliability of performance comparison.
Application 3: drug–drug interactions
DDI tasks play a pivotal role in the drug development process, aiding drug developers in screening for safer and more effective drugs. They also assist clinicians in making informed decisions and arranging appropriate treatment plans, thereby enhancing drug safety, reducing healthcare costs, and minimizing medical disputes [123].
In the realm of DDI prediction, several metrics are commonly employed to assess the performance of models: (i) accuracy (ACC): measures the proportion of correct predictions made by the model. (ii) ROC–AUC (AUC): evaluates the model’s ability to discriminate between interacting and noninteracting drug pairs. (iii) PR-AUC (Area Under Precision-Recall Curve): focuses on the precision and recall performance of the model, particularly useful in datasets with class imbalances. (iv) F1 Score: Balances precision and recall, providing a measure of the model’s accuracy in identifying true interactions.
As shown in Table 5, MPG, a specialized foundation model for DDI prediction [69], conducts unsupervised pretraining on large molecular datasets to obtain general molecular representations. Then, it refines these representations via supervised fine-tuning on a smaller labeled dataset to learn specific molecular pair interactions. Finally, MPG uses a multitask learning strategy to predict interaction types and degrees simultaneously, enhancing DDI prediction comprehensiveness and accuracy.
Table 5.
Comparison of performance of AUC-ROC(%), PR-AUC(%), and F1(%) on DDI tasks
| Model | Dataset | AUC-ROC (%) | PR-AUC (%) | F1 (%) |
|---|---|---|---|---|
| MPG | BIOSNAP | 96.6 | 96 | 90.5 |
| MM-Deacon | Zhang’s dataset [124] | 95 | 91.8 | 82.14 |
| Y-mol | Ryu’s dataset [125] | 65.23 | – | – |
| Deng’s dataset [126] | 62.19 | – | – |
The higher the value of AUC-ROC, PR-AUC, F1, the better the performance is; “–” indicates no data available. All performance metrics in this table are directly cited from the original studies of the corresponding models, and the evaluation protocols are consistent with the original studies to ensure the reliability of performance comparison.
Application 4: retrosynthesis prediction
Molecular representation plays a crucial role in retrosynthesis, which involves devising viable synthetic routes for target molecules [127]. This process benefits significantly from various molecular representation techniques that help chemists identify key substructures, select appropriate bond-breaking points, and predict potential synthons and precursors. Additionally, these techniques assist in validating reaction outcomes and products.
Each molecular representation method provides unique insights that are critical for retrosynthesis: (i) Molecular mass and elemental composition: Helps in determining the basic framework of the target molecule. (ii) Functional groups and stereochemistry: Crucial for understanding reactivity and orientation in molecular interactions. (iii) Crystal structure and intermolecular interactions: Aid in predicting how molecules will interact under different conditions.
By integrating this information, chemists can design more rational retrosynthetic pathways, optimize reaction conditions, and enhance both the efficiency and selectivity of the synthesis process. As shown in Table 6, DVMP is a foundational model that has demonstrated robust performance in retrosynthesis tasks. After undergoing pretraining with a vast dataset, it effectively supports the complex decision-making required in planning and executing synthetic routes [82].
Table 6.
Comparison of performance (top-k Accuracy %) tested on USPTO-50K of the retrosynthesis task
| Model | Top-k accuracy(%) | |||||
|---|---|---|---|---|---|---|
| 1 | 3 | 5 | 10 | 20 | 50 | |
| DVMP (Reaction types unknown) | 54.2 | 70.5 | 77.2 | 84.9 | 90 | 92.7 |
| DVMP (Reaction types give as prior) | 66.5 | 81.2 | 86.6 | 90.5 | 92.8 | 93.5 |
| TxT-LLM | 23.9 | – | – | – | – | – |
The best performing values are highlighted in bold, and the second best ones are marked with an underline; “–” indicates no data available. All performance metrics in this table are directly cited from the original studies of the corresponding models, and the evaluation protocols are consistent with the original studies to ensure the reliability of performance comparison.
The most prevalent metric in retrosynthesis analysis is the accuracy of top k (Top-k ACC), which measures the proportion of correctly predicted retrosynthesis routes among the first k predictions.
Application 5: drug synergy prediction
Drug synergy prediction assesses whether combined drug effects are synergistic or antagonistic, crucial for cancer, antimicrobial therapy, and complex disease management. MRL transforms drug structures into vectors and constructs joint embeddings for prediction.
Foundation models have been applied to this task, mainly through transfer learning with pretrained models or fine-tuning LLMs. These approaches overcome data scarcity, performing well even with limited datasets.
Common evaluation metrics include ROC-AUC, PR-AUC, accuracy (ACC), and the F1 score. ROC-AUC evaluates model discrimination, PR-AUC is useful for imbalanced datasets, ACC measures prediction correctness, and the F1 score balances precision and recall.
As shown in Table 7, CancerGPT uses LLMs to predict synergy in rare tissues, outperforming others in PR-AUC and ROC-AUC in multiple datasets [63] in particular. SynerGPT, a GPT-based model, applies contextual learning to personalized prediction, achieving high ROC-AUC scores (74.0 in zero-shot and 77.7 in few-shot) in novel drug combinations, surpassing baselines [62].
Table 7.
Comparison of performance of AUC-ROC (%) and PR-AUC (%) on drug synergy prediction tasks
| Model | Dataset | AUC-ROC (%) | PR-AUC (%) |
|---|---|---|---|
| CancerGPT | DrugComb portal | – | – |
| SynerGPT (zero-shot) | DrugCombDB-unknown drug | 74 | 57.3 |
| DrugCombDB-unknown cell line | 83.5 | 72.1 | |
| SynerGPT (few-shot) | DrugCombDB-unknown drug | 77.7 | 61.5 |
| DrugCombDB-unknown cell line | 83.8 | 72.8 |
The best performing values are highlighted in bold; “–” indicates no data available. CancerGPT has not released specific AUC-ROC and PR-AUC data publicly. All performance metrics in this table are directly cited from the original studies of the corresponding models, and the evaluation protocols are consistent with the original studies to ensure the reliability of performance comparison.
How to choose the appropriate foundation model
When applying molecular foundation models to downstream tasks, researchers should consider the following strategic selection guidelines.
First, effective model selection in MRL requires systematic consideration of task objectives, data characteristics, and architectural constraints. Molecular property prediction prioritizes accurate estimation of physicochemical or biological properties, whereas MG emphasizes creating novel yet chemically valid structures. Retrosynthesis focuses on decomposing targets into feasible precursors, while DDI and synergy prediction require modeling combinatorial pharmacological effects.
Second, data properties further guide model choices. SMILES- or graph-based representations are well suited for MG, where Transformer architectures or graph variational autoencoders (GVAE) [128] are commonly adopted. Retrosynthetic prediction requires precise mapping between reactants and products, making multimodal architectures advantageous. Interaction and synergy prediction benefit from capturing latent relational knowledge, where KG-enhanced models are preferred. For data-limited scenarios, transfer learning and fine-tuning of pretrained models (e.g. ChemBERTa-2 [51]) can improve performance.
Third, model selection should align with established paradigms. Predictive tasks—including property prediction and interaction modeling—can be handled by conventional GNNs, but large pretrained frameworks, such as ChemBERTa-2 [51] or Uni-Mol [55] offer greater accuracy and efficiency. Generative tasks typically rely on GPT-style autoregressive models [129], exemplified by MOLGEN [58]. Retrosynthesis lies at the interface of prediction and generation, requiring task-specific architectural trade-offs.
Finally, when interpretability is required, Transformer-based architectures should be prioritized due to their explainability through attention mechanisms. Attention matrix analysis can be incorporated to facilitate interpretability experiments when needed. For applications with training time constraints, simple architectures are recommended, because hybrid architectures typically require more training time, as shown in Table 8.
Table 8.
Comparison of time complexity of MRFMs
| Architecture | Model | Time complexity |
|---|---|---|
| GIN-based | GraphCL |
|
| Hu et al. | ||
| JOAO | ||
| AD-GCL | ||
| GraphLoG | ||
| MGSSL | ||
| LP-Info | ||
| SimGRACE | ||
| GraphMAE | ||
| MGMAE | ||
| MOLE-BERT | ||
| Transformer-based | ChemBERTa-2 |
|
| SynerGPT | ||
| KPGT | ||
| Uni-Mol | ||
| MM-Deacon | ||
| PanGu drug model | ||
| Transformer-M | ||
| KV-PLM | ||
| Graphomer | ||
| MolT5 | ||
| GPT-based | CancerGPT |
|
| MolReGPT | ||
| Llama-based | Y-mol |
|
| Ber-based | MOLFORMER |
|
| Bart-based | MOLGEN |
|
| Mistral-based | LlaSMol |
|
| GPS-based | 3D PGT |
|
| ResNet18-based | ImageMol |
|
| PaLM-2-based | TxT-LLM |
|
| MolGNet-based | MPG |
|
| GNN + Transformer-based | GROVER |
|
| DVMP | ||
| GIN + SchNet-based | GraphMVP |
|
| GNN + Transformer + Bert-based | MoleculeSTM |
|
The time complexity of GNN/GIN-type models is
, where
represents the number of nodes in the graph structure and
denotes the number of edges. The computational complexity of Transformer-based models is
, where
denotes the sequence length, and
represents the embedding dimension.
denotes the number of layers. In the time complexity of LlaSMol,
represents the sliding window size. In the complexity formulas of ImageMol and GraphMVP,
indicates the number of atoms.
Interpretability
Interpretability is essential for identifying and mitigating model biases, ensuring fairness, and enhancing user trust in model predictions [130]. However, the complexity of foundation models—characterized by massive parameter scales and opaque internal mechanisms—limits transparency and controllability [131], underscoring the need for effective interpretability techniques [132].
Existing interpretability methods can be broadly categorized into three categories:
(1) Feature attribution methods quantify contributions of input features to model outputs, including gradient-based, surrogate-based, and perturbation-based approaches. Yet, the high dimensionality of modern molecular representations substantially increases computational cost and complexity [133].
(2) Instance-based methods explain predictions by analyzing specific samples through anchors [134], counterfactuals [135], or contrastive reasoning [136]. Their adoption remains limited due to the intensive computation required for generating valid counterfactuals or contrasts [133].
(3) Graph-convolution-based methods leverage message passing and attention mechanisms to assign importance weights to molecular graph components. Attention maps have become the most widely used interpretability strategy in foundation model research, with approaches, such as GNNExplainer [137] and DVMP [82]. While these methods capture complex relational structures, they incur substantial computation and memory overhead on large or deep networks, and may suffer from feature over-smoothing in deeper layers.
Conclusion and future outlooks
MRL has greatly enhanced the efficiency and quality of molecular design, discovery, and optimization [138]. In this work, we first reviewed commonly used molecular descriptors and datasets, and categorized foundation models based on their input representations. For each category, representative MRL models were analyzed with respect to their architectural characteristics and performance trade-offs.
We further synthesized four mainstream pretraining paradigms for MRL foundation models, highlighting their modality-specific strengths and applicability to diverse downstream tasks. Additionally, we examined major applications of MRL models in drug discovery and assessed the progress and remaining issues in interpretability.
Despite the considerable advances achieved by MRL models, several challenges and limitations persist that warrant further investigation.
Integrating multimodal data
The choice of molecular representation fundamentally shapes model input structure and determines the type of chemical information encoded, with each paradigm exhibiting distinct strengths and limitations. Sequence-based 1D representations are easy to implement but neglect spatial configuration [139]. Graph-based 2D methods capture atomic connectivity yet omit conformational details [140]. While 3D geometric models provide full spatial information, they must handle issues, such as conformational variability and rotational invariance [81]. Additionally, emerging modalities, such as molecular videos [141] and audio [142], suggest opportunities for richer multimodal characterization.
If these multimodal data are fused, it may be possible to represent molecules in a more comprehensive way for downstream tasks. To fully unlock the potential of multimodal fusion, future research should advance along three practical directions.
First, incorporating molecular dynamics (MD) trajectories as a novel dynamic modality can significantly enhance representational capacity. Unlike static 3D structures, MD trajectories capture continuous conformational transitions (e.g. ligand binding and unbinding events). Leveraging spatiotemporal attention mechanisms—even hybrid architectures combining 3D CNNs with Transformers—enables extraction of time-resolved features such as bond angle fluctuations and conformational transition kinetics. This helps address a key limitation of static structural models, which often overlook dynamic binding processes, thereby improving predictions of kinetic parameters such as k_on and k_off.
Second, cross-modal data augmentation provides a scalable solution for the scarcity of experimentally derived 3D structures. Abundant 2D graphs or SMILES representations can be used to pretrain generative models that propose physically plausible 3D conformations. These conformations can then be refined and filtered using chemical priors—such as bond length and valence constraints encoded in KGs—before being utilized to augment training datasets for 3D MRL models. Recent advances in MRL demonstrate that CL is effective for cross-modal representation alignment. For example, the GraphMVP framework integrates contrastive objectives with reconstruction tasks to jointly pretrain 2D and 3D molecular encoders [53].
Third, KG-guided multimodal alignment introduces chemically meaningful constraints during pretraining. By embedding chemistry-aware relational rules as alignment guidance, fused multimodal representations can remain consistent with fundamental chemical principles. This strategy is expected to reduce chemically invalid outputs and improve the reliability of generative modeling. These methodologies typically involve three fundamental components: (i) extracting entities and relationships from heterogeneous sources, (ii) performing cross-modal alignment and feature fusion, and (iii) constructing KGs and embedding them into a vector space for downstream tasks. Demonstrated systems such as AliMe MKG [143] validate the effectiveness of this framework, achieving performance gains in recommendation by integrating textual, visual, and user-generated content.
Technologies for addressing data scarcity
The advancement of foundation models critically depends on access to extensive pretraining datasets. The limited availability of molecular data constitutes a major obstacle in developing robust MRFMs. To overcome these challenges, several technological approaches have shown effectiveness [144–146].
Semi-supervised learning offers a viable solution for 3D MRL in scenarios with scarce labeled data. A dual-task training paradigm can be adopted: large amounts of unlabeled 3D conformations (e.g. derived from MD simulations) are used for self-supervised pretraining—such as conformation denoising by perturbing atomic coordinates and reconstructing the original structure—while a smaller subset of experimentally validated 3D structures with annotated properties is used for supervised fine-tuning. Notably, this hybrid learning strategy has achieved >90% of the performance of fully supervised models using only 10% labeled data in 3D property prediction tasks on QM9 [147].
Cross-modal data augmentation can leverage the abundance of 2D molecular representations to alleviate 3D data sparsity. Methods such as 3D InfoMax enable the extraction of latent 3D structural information from 2D molecular graphs and the generation of multiple plausible 3D conformers for each input molecule. These generated conformations can then be screened and refined using chemistry-aware constraints, ensuring structural validity while substantially expanding the scale of 3D training datasets without requiring additional experiments.
Beyond these methodologies, analogous strategies have been developed, including meta-learning and knowledge distillation techniques. Meta-learning enables models to rapidly adapt to new tasks leveraging prior experience [148]. Knowledge distillation facilitates the transfer of knowledge from larger models to compact architectures without significant performance degradation [149].
Interpretability
Despite substantial performance improvements in molecular foundation models, interpretability remains a key challenge [150]. While prior work has emphasized predictive accuracy, the growing complexity of MRL models now necessitates stronger interpretability to ensure reliable decision-making.
Current interpretability research in MRL primarily relies on attention-based visualization of key graph nodes, which is insufficient for fully explaining multimodal learning behavior. Future directions include assessing decision consistency across modalities to identify potential biases—e.g. discrepancies in feature focus between SMILES-based and 3D models. Existing tools, such as DODRIO and Align-Anything, support attention visualization within Transformers and across modalities, enabling more comprehensive interpretability analyses [151, 152].
Incorporating chemical KGs provides another promising avenue. By integrating structured domain knowledge into model architectures and subsequent interpretability evaluation, knowledge-guided analysis can better elucidate chemically meaningful reasoning processes.
Efficiency of training
Foundation models demonstrate high accuracy and generalization in MRL and downstream tasks. However, their large parameter count demands processing vast data during training and inference, which consumes significant storage. Thus, efficient data management and transmission, such as distributed parallel training techniques, are essential [153, 154].
Data parallelism, a common approach, evenly distributes data across GPUs or nodes, differentiates gradients per device, aggregates them on one GPU, and broadcasts results [155, 156]. Future parallel computing advancements will likely boost foundation model training efficiency, streamlining processes for faster iterations, and complex model development, reducing current high-cost barriers.
Robustness and generalization
Robustness and generalization are essential for reliable deployment of MRL models. Robustness ensures stable performance under perturbations or distribution shifts, while generalization enables effective prediction on unseen molecular domains.
Enhancement strategies operate at both the data and model levels. Data augmentation improves robustness by expanding distributional diversity, such as using multiple SMILES representations, generating 3D conformers via molecular simulations, or applying perturbations to atomic and bond features. Multimodal integration with contrastive alignment further strengthens cross-domain transferability [157].
At the algorithmic level, meta-learning approaches improve rapid adaptation to limited data [158]. Sparse attention mechanisms in Transformer architectures reduce sensitivity to irrelevant long-range interactions [159]. Additionally, probabilistic weighting techniques such as Monte Carlo Dropout [160] help mitigate noise and improve predictive reliability.
Key Points
Provide the first systematic review of molecular representation learning (MRL) foundation models, with a focus on small-molecule representation techniques.
Categorize MRL foundation models into unimodal and multimodal architectures and analyze representative models in each category.
Summarize four mainstream pretraining strategies, highlighting their applicability, and influence on downstream molecular property prediction and generation tasks.
Discuss challenges in model interpretability and provide practical guidelines for selecting appropriate MRL foundation models in real-world drug discovery scenarios.
Propose future research directions, including multimodal fusion, data scarcity solutions, efficiency improvements, and robustness enhancement.
Acknowledgements
The authors thank anonymous reviewers for their valuable suggestions.
Contributor Information
Bosheng Song, College of Computer Science and Electronic Engineering, Hunan University, 116 Lushan South Road, Yuelu District, 410086 Changsha, China.
Jiayi Zhang, College of Computer Science and Electronic Engineering, Hunan University, 116 Lushan South Road, Yuelu District, 410086 Changsha, China.
Ying Liu, College of Computer Science and Electronic Engineering, Hunan University, 116 Lushan South Road, Yuelu District, 410086 Changsha, China.
Yuansheng Liu, College of Computer Science and Electronic Engineering, Hunan University, 116 Lushan South Road, Yuelu District, 410086 Changsha, China.
Jing Jiang, College of Computer Science and Electronic Engineering, Hunan University, 116 Lushan South Road, Yuelu District, 410086 Changsha, China.
Sisi Yuan, School of Chinese Medicine, Hong Kong Baptist University, 15 Baptist University Road, Kowloon Tong, Kowloon, Hong Kong SAR 999077, China.
Xia Zhen, National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, No. 109 Deya Road, Kaifu District, 410086 Changsha, China.
Yiping Liu, College of Computer Science and Electronic Engineering, Hunan University, 116 Lushan South Road, Yuelu District, 410086 Changsha, China.
Author contributions
B.S. and J.Z. conceptualized the framework of the review, authored the initial draft, and fine-tuned various details. Y.L. and Y.L. contributed significantly by gathering critical data and offering expert insights that were crucial for revising the manuscript. S.Y., J.J., X.Z., and Y.L. oversaw the project’s strategic planning and played a key role in securing the financial support necessary for our research. All authors have read and approved the final manuscript for publication.
Conflict of interest: None declared.
Funding
This work was supported by the National Natural Science Foundation of China (grant nos 62272151, 62202153, 62522110, and 62472152) and Hunan Provincial Natural Science Foundation of China (grant no. 2024JJ4015) .
Data availability
All the code and data tables used for the figures in the manuscript are available on GitHub link: https://github.com/Z-dot-max/MRL_Foundation_Review/.
References
- 1. Feinberg EN, Sur D, Zhenqin W. et al. Potentialnet for molecular property prediction. ACS Central Sci 2018;4:1520–30. 10.1021/acscentsci.8b00507 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Cheng F, Zhao Z. Machine learning-based prediction of drug–drug interactions by integrating drug phenotypic, therapeutic, chemical, and genomic properties. J Am Med Inform Assoc 2014;21:e278–86. 10.1136/amiajnl-2013-002512 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Lim J, Ryu S, Kim JW. et al. Molecular generative model based on conditional variational autoencoder for de novo molecular design. J Chem 2018;10:1–9. 10.1186/s13321-018-0286-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Lee M, Min K. MGCVAE: multi-objective inverse design via molecular graph conditional variational autoencoder. J Chem Inf Model 2022;62:2943–50. 10.1021/acs.jcim.2c00487 [DOI] [PubMed] [Google Scholar]
- 5. Atz K, Grisoni F, Schneider G. Geometric deep learning on molecular representations. Nat Mach Intell 2021;3:1023–32. 10.1038/s42256-021-00418-8 [DOI] [Google Scholar]
- 6. Cereto-Massagué A, Ojeda MJ, Valls C. et al. Molecular fingerprint similarity search in virtual screening. Methods 2015;71:58–63. 10.1016/j.ymeth.2014.08.005 [DOI] [PubMed] [Google Scholar]
- 7. Wang J, Hsieh C-Y, Wang M. et al. Multi-constraint molecular generation based on conditional transformer, knowledge distillation and reinforcement learning. Nat Mach Intell 2021;3:914–22. 10.1038/s42256-021-00403-1 [DOI] [Google Scholar]
- 8. Grisoni F, Moret M, Lingwood R. et al. Bidirectional molecule generation with recurrent neural networks. J Chem Inf Model 2020;60:1175–83. 10.1021/acs.jcim.9b00943 [DOI] [PubMed] [Google Scholar]
- 9. Kearnes S, McCloskey K, Berndl M. et al. Molecular graph convolutions: moving beyond fingerprints. J Comput Aided Mol Des 2016;30:595–608. 10.1007/s10822-016-9938-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Zhenxing W, Wang J, Hongyan D. et al. Chemistry-intuitive explanation of graph neural networks for molecular property prediction with substructure masking. Nat Commun 2023;14:2585. 10.1038/s41467-023-38192-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Schütt K, Kindermans P-J, Felix HES. et al. SchNet: a continuous-filter convolutional neural network for modeling quantum interactions. Adv Neural Inf Process Syst 2017;30:109–18. [Google Scholar]
- 12. Shin W-H, Zhu X, Bures MG. et al. Three-dimensional compound comparison methods and their application in drug discovery. Molecules 2015;20:12841–62. 10.3390/molecules200712841 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Wornow M, Yizhe X, Thapa R. et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med 2023;6:135. 10.1038/s41746-023-00879-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Yuan Y. On the power of foundation models. In:International Conference on Machine Learning, pp. 40519–30. PMLR, 2023. [Google Scholar]
- 15. Moor M, Banerjee O, Abad ZSH. et al. Foundation models for generalist medical artificial intelligence. Nature 2023;616:259–65. 10.1038/s41586-023-05881-4 [DOI] [PubMed] [Google Scholar]
- 16. Schneider J, Meske C, Kuss P. Foundation models. Bus Inf. Syst Eng 2024;66:221–31. 10.1007/s12599-024-00851-0 [DOI] [Google Scholar]
- 17. Zhou C, Li Q, Li C. et al. A comprehensive survey on pretrained foundation models: a history from bert to chatgpt. International Journal of Machine Learning and Cybernetics 2024;1–65. [Google Scholar]
- 18. Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995;20:273–97. 10.1007/BF00994018 [DOI] [Google Scholar]
- 19. Shen L, Jingheng W, Yang W. Multiscale quantum mechanics/molecular mechanics simulations with neural networks. J Chem Theory Comput 2016;12:4934–46. 10.1021/acs.jctc.6b00663 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Qiu J, Li L, Sun J. et al. Large AI models in health informatics: applications, challenges, and the future. IEEE J Biomed Health Inform 2023;27:6074–87. 10.1109/JBHI.2023.3316750 [DOI] [PubMed] [Google Scholar]
- 21. Chang Y, Wang X, Wang J. et al. A survey on evaluation of large language models. ACM Trans Intell Syst Technol 2024;15:1–45. 10.1145/3641289 [DOI] [Google Scholar]
- 22. Zhao WX, Zhou K, Li J. et al. A survey of largelanguage models. arXiv preprint arXiv:2303.18223. 2023.
- 23. Zhou Y, Chia MA, Wagner SK. et al. A foundation model for generalizable disease detection from retinal images. Nature 2023;622:156–63. 10.1038/s41586-023-06555-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Touvron H, Lavril T, Izacard G. et al. Llama: open and efficient foundationlanguage models. arXiv preprint arXiv:2302.13971. 2023.
- 25. Huang L, Zhang H, Tingyang X. et al. Mdm: molecular diffusion model for 3D molecule generation. Proceedings of the AAAI Conference on Artificial Intelligence 2023;37:5105–12. 10.1609/aaai.v37i4.25639 [DOI] [Google Scholar]
- 26. Xia J, Zhu YQ, Du YQ. et al. A systematic survey of chemical pre-trained models. In: In: Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 2023.
- 27. Zhu J, Xia Y, Wu L. et al. Unified 2D and 3D pre-training of molecular representations. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 2022.
- 28. Huber W, Carey VJ, Long L. et al. Graphs in molecular biology. BMC Bioinform 2007;8:S8–8. 10.1186/1471-2105-8-S6-S8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Zhang L, Han J, Wang H. et al. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys Rev Lett 2017;120:143001. [DOI] [PubMed] [Google Scholar]
- 30. Jung KH. Uncover this tech term: foundation model. Korean J Radiol 2023;24:1038–41. 10.3348/kjr.2023.0790 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Wang Y, Xiao J, Suzek TO. et al. Pubchem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 2009;37:W623–33. 10.1093/nar/gkp456 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Irwin JJ, Sterling T, Mysinger MM. et al. Zinc: a free tool to discover chemistry for biology. J Chem Inf Model 2012;52:1757–68. 10.1021/ci3001277 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Axelrod S, Gómez-Bombarelli R. Geom, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data 2022;9:185. 10.1038/s41597-022-01288-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Morgan HL. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. J Chem Doc 1965;5:107–13. 10.1021/c160017a018 [DOI] [Google Scholar]
- 35. Glem RC, Bender A, Arnby CH. et al. Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to adme. IDrugs 2006;9:199–204. [PubMed] [Google Scholar]
- 36. Duvenaud DK, Maclaurin D, Aguilera-Iparraguirre J. et al. Convolutional networks on graphs for learning molecular fingerprints. In: Proceedings of Advances in Neural Information Processing Systems, Vol. 28, pp. 2224–32, 2015. [Google Scholar]
- 37. Weininger D. Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 1988;28:31–6. 10.1021/ci00057a005 [DOI] [Google Scholar]
- 38. Deng J, Yang Z, Wang H. et al. A systematic studyof key elements underlying molecular property prediction. Nature Communications 2023;14:6395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Krenn M, Häse F, Nigam A. et al. SELFIES: a robust representation of semantically constrained graphs with an example application in chemistry. arXiv preprint arXiv:1905.13741. 2019.
- 40. Tu ZK, Coley CW. Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction. J Chem Inf Model 2022;62:3503–13. [DOI] [PubMed] [Google Scholar]
- 41. Danel T, Spurek P, Tabor J. et al. Spatial graph convolutional networks. In: Yang H, Pasupa K, Leung AC-S. et al. (eds.), Neural Information Processing, pp. 668–75. Cham: Springer International Publishing, 2020. 10.1007/978-3-030-63823-8_76 [DOI] [Google Scholar]
- 42. Jiao R, Han J, Huang W. et al. Energy-motivated equivariant pretraining for 3D molecular graphs. AAAI Conference on Artificial Intelligence 2022;37:8096–104. 10.1609/aaai.v37i7.25978 [DOI] [Google Scholar]
- 43. Moon K, Im H-J, Kwon S. 3D graph contrastive learning for molecular property prediction. Bioinformatics 2023;39:btad371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Xu Z, Luo YZ, Zhang X. et al. Molecule3D: a benchmark for predicting 3D geometries from molecular graphs. arXiv preprint arXiv:2110.01717. 2021.
- 45. Liu SC, Guo HY, Tang J. Molecular geometry pretraining with se (3)-invariant denoising distance matching. arXiv preprint arXiv:2206.13602. 2022.
- 46. Amigó JM, Gálvez J, Villar VM. A review on molecular topology: applying graph theory to drug discovery and design. Naturwissenschaften 2009;96:749–61. 10.1007/s00114-009-0536-7 [DOI] [PubMed] [Google Scholar]
- 47. Edelsbrunner H, Koehl P. The geometry of biomolecular solvation. In: Goodman JE, O'Rourke J (eds.), Combinatorial and Computational Geometry, Vol. 52, pp. 243–75. New York: Cambridge University Press, 2005. [Google Scholar]
- 48. Fang Y, Zhang Q, Zhang N. et al. Knowledge graph-enhanced molecular contrastive learning with functional prompt. Nat Mach Intell 2023;5:542–53. 10.1038/s42256-023-00654-0 [DOI] [Google Scholar]
- 49. Zeng X, Xiang H, Linhui Y. et al. Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework. Nat Mach Intell 2022;4:1004–16. 10.1038/s42256-022-00557-6 [DOI] [Google Scholar]
- 50. Zeng Z, Yao Y, Liu Z. et al. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nat Commun 2022;13:862. 10.1038/s41467-022-28494-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Ahmad W, Simon E, Chithrananda S. et al. Chemberta-2: towards chemical foundation models. arXiv preprint arXiv: 2209.01712. 2022.
- 52. Ross J, Belgodere BM, Chenthamarakshan V. et al. Large-scale chemical language representations capture molecular structure and properties. Nat Mach Intell 2021;4:1256–64. 10.1038/s42256-022-00580-7 [DOI] [Google Scholar]
- 53. Liu SC, Wang HC, Liu WY. et al. Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728. 2021.
- 54. Hou Z, Liu X, Cen Y. et al. Graphmae: Self-supervised masked graph autoencoders. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, Association for Computing Machinery (ACM) SIGKDD, 2022.
- 55. Zhou G, Gao Z, Ding Q. et al. Uni-Mol: A universal 3D molecular representation learning framework. In: Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, ICLR, 2023.
- 56. Feng J, Wang Z, Li Y. et al. MGMAE: Molecular representation learning by reconstructing heterogeneous graphs with a high mask ratio. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, ACM, 2022.
- 57. Edwards C, Lai T, Ros K. et al. Translation between molecules and natural language. In: Goldberg Y, Kozareva Z, Zhang Y (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 375–413. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022. [Google Scholar]
- 58. Fang Y, Zhang N, Chen Z. et al. Domain-agnostic molecular generation with chemical feedback. In: Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, ICLR, 2023.
- 59. Lewis M, Liu YH. et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA, ACL, pp. 7871–80, 2020.
- 60. Yu BT, Baker FN, Chen ZQ. et al. Llasmol: advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv preprint arXiv:2402.09391. 2024.
- 61. Jiang AQ, Sablayrolles A, Mensch A. et al. Mistral 7B. arXiv preprint arXiv:2310.06825. 2023.
- 62. Edwards C, Naik A, Khot T. et al. Synergpt: In-context learning for personalized drug synergy prediction and drug design. arXiv preprint arXiv:2307.11694. 2023.
- 63. Li T, Shetty S, Kamath A. et al. Cancergpt for few shot drug pair synergy prediction using large pretrained language models. NPJ NPJ Digit Med 2024;7:40. 10.1038/s41746-024-01024-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Liang H, Xingjian D, Zhu B. et al. Graph contrastive learning with implicit augmentations. Neural Netw 2022;163:156–64. 10.1016/j.neunet.2023.04.001 [DOI] [PubMed] [Google Scholar]
- 65. Hu WH, Liu BW, Gomes J. et al. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265. 2019. Available at: https://arxiv.org/abs/1905.12265 (Accessed: 2026-01-01).
- 66. You Y, Chen T, Yang S. et al. Graph contrastive learning automated. In: Proceedings of the International Conference on Machine Learning, Virtual Event, IMLS, 2021.
- 67. Suresh S, Pan L, Hao C. Adversarial graph augmentation to improve graph contrastive learning. In: Ranzato MA, Beygelzimer A, Dauphin Y. et al. (eds.), Advances in Neural Information Processing Systems, Vol. 34, pp. 12150–62. Red Hook, NY, USA: Curran Associates, Inc., 2021. [Google Scholar]
- 68. Xu MH, Wang H, Ni BB. et al. Self-supervised graph-level representation learning with local and global structure. In: Proceedings of Machine Learning Research, 2021, Vol. 139, pp. 11548–58. PMLR, Virtual Event, 2021. [Google Scholar]
- 69. Li PY, Wang J, Qiao YX. et al. An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Brief Bioinform 2021;22:bbab109. [DOI] [PubMed] [Google Scholar]
- 70. Li PY, Wang J, Qiao YX. et al. Learn molecular representations from large-scale unlabeled molecules for drug discovery. 2020. arXiv preprint arXiv:2012.11175. https://arxiv.org/abs/2012.11175
- 71. Zhang Z, Liu Q, Wang H. et al. Motif-based graph self-supervised learning for molecular property prediction. In: Ranzato MA. et al. (eds.), Advances in Neural Information Processing Systems 34, pp. 10812–24. Red Hook, NY, USA: Curran Associates, Inc., 2021. [Google Scholar]
- 72. Ying C, Cai T, Luo S. et al. Do transformers really perform badly for graph representation? In: Ranzato MA. et al. (eds.), Advances in Neural Information Processing Systems 34, pp. 22259–71. Red Hook, NY, USA: Curran Associates, Inc., 2021. [Google Scholar]
- 73. You Y, Chen T, Wang Z. et al. Bringing your own view: Graph contrastive learning without prefabricated data augmentations. In: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Tempe, AZ, USA, ACM, 2022. [DOI] [PMC free article] [PubMed]
- 74. Xia J, Wu L, Chen J. et al. Simgrace: A simple framework for graph contrastive learning without data augmentation. In: Proceedings of the ACM Web Conference 2022, Virtual Event, ACM, 2022.
- 75. Li H, Zhao D, Zeng J. Kpgt: Knowledge-guided pre-training of graph transformer for molecular property prediction. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, ACM SIGKDD, 2022.
- 76. Xia J, Zhao C, Hu B. et al. Mole-bert: Rethinking pre-training graph neural networks for molecules. In: International Conference on Learning Representations, Kigali, Rwanda, ICLR, 2023.
- 77. Xu W, Zhao H, Tu W. et al. Automated 3D pre-training for molecular property prediction. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, ACM SIGKDD, 2023.
- 78. Masters D, Dean J, Klaser K. et al. GPS++: An optimised hybrid mpnn/transformer for molecular property prediction. arXiv preprint arXiv:2212.02229. In:, 2022.
- 79. He K, Zhang X, Ren S. et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, IEEE, pp. 770–8, 2016.
- 80. Guo Z, Sharma PK, Martinez A. et al. Multilingual molecular representation learning via contrastive pre-training. In: Annual Meeting of the Association for Computational Linguistics, Virtual Event, ACL, 2021.
- 81. Lin X, Chi X, Xiong Z. et al. Pangu drug model: learn a molecule like a human. Sci China Life Sci 2022;66:879–82. [DOI] [PubMed] [Google Scholar]
- 82. Zhu J, Xia Y, Wu L. et al. Dual-view molecular pre-training. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, ACM SIGKDD, 2023.
- 83. Luo SJ, Chen TL, Xu YX. et al. One transformer can understand both 2d & 3d molecular data. arXiv preprint arXiv:2210.01765. In:, 2022.
- 84. Liu S, Nie W, Wang C. et al. Multi-modal molecule structure–text model for text-based retrieval and editing. Nat Mach Intell 2023;5:1447–57. 10.1038/s42256-023-00759-6 [DOI] [Google Scholar]
- 85. Li J, Liu Y, Fan W. et al. Empowering molecule discovery for molecule-caption translation with large language models: A ChatGPT perspective. In: IEEE transactions on knowledge and data engineering, IEEE, Piscataway, NJ, USA, 2024.
- 86. Chaves JMZ, Wang E, Tu T. et al. Tx-LLM: a large language model for therapeutics. 2024.
- 87. Anil R, Dai AM, Firat O. et al. Palm 2 technical report. 2023.
- 88. Ma TF, Lin X, Li TL. et al. Y-Mol: a multiscale biomedical knowledge-guided large language model for drug development. arXiv preprint arXiv:2410.11550. 2024.
- 89. Cai H, Zhang H, Zhao D. et al. FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction. Brief Bioinform 2022;23:bbac326. 10.1093/bib/bbac408 [DOI] [PubMed] [Google Scholar]
- 90. In Lam HY, Pincket R, Han H. et al. Application of variational graph encoders as an effective generalist algorithm in computer-aided drug design. Nat Mach Intell 2023;5:754–64. 10.1038/s42256-023-00683-9 [DOI] [Google Scholar]
- 91. Krenn M, Ai Q, Barthel S. et al. Selfies and the future of molecular string representations. Patterns 2022;3:100588. 10.1016/j.patter.2022.100588 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92. Chithrananda S, Grand G, Ramsundar B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. 2020. arXiv preprint arXiv:2010.09885
- 93. White AD. The future of chemistry is language. Nat Rev Chem 2023;7:457–8. 10.1038/s41570-023-00502-0 [DOI] [PubMed] [Google Scholar]
- 94. Hirohara M, Saito Y, Koda Y. et al. Convolutional neural network based on smiles representation of compounds for detecting chemical motif. BMC Bioinform 2018;19:526. 10.1186/s12859-018-2523-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95. Han S, Haitao F, Yuyang W. et al. HimGNN: a novel hierarchical molecular graph representation learning framework for property prediction. Brief Bioinform 2023;24:bbad277. 10.1093/bib/bbad305 [DOI] [PubMed] [Google Scholar]
- 96. Zhou J, Cui GQ, Hu SD. et al. Graph neural networks: a review of methods and applications. AI Open 2020;1:57–81. Elsevier. [Google Scholar]
- 97. Zhang S, Tong H, Jiejun X. et al. Graph convolutional networks: a comprehensive review. Comput Soc Netw 2019;6. 10.1186/s40649-019-0069-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98. Veličković P, Cucurull G, Casanova A. et al. Graph attention networks. arXiv. 2017. preprint arXiv:1710.10903.
- 99. Chen Z, Villar S, Chen L. On the equivalence between graph isomorphism testing and function approximation with GNNs. In: Wallach HM, Larochelle H, Beygelzimer A. et al. (eds.), Advances in Neural Information Processing Systems, Vol. 32. Red Hook, NY, USA: Curran Associates, Inc., 2019. [Google Scholar]
- 100. Zaidi S, Schaarschmidt M, Martens J. et al. Pre-training via denoising for molecular property prediction. In: International Conference on Learning Representations, Kigali, Rwanda, ICLR, 2023.
- 101. Danielsson P-E. Euclidean distance mapping. Computer Graphics and image processing 1980;14:227–48. 10.1016/0146-664X(80)90054-4 [DOI] [Google Scholar]
- 102. Xiong R, Yang Y, He D. et al. On layer normalization in the transformer architecture. In: Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings of Machine Learning Research, Hadley, MA, USA: PMLR. Daumé H III, Singh A (eds.), pp. 10524–33, 2020.
- 103. Dokmanić I, Parhizkar R, Ranieri J. et al. Euclidean distance matrices: essential theory, algorithms, and applications. IEEE Signal Process Mag 2015;32:12–30. 10.1109/MSP.2015.2398954 [DOI] [Google Scholar]
- 104. Salihoglu S, Widom J. GPS: A graph processing system. In: Proceedings of the 25th International Conference on Scientific and Statistical Database Management, Baltimore, MD, USA, IEEE Computer Society, pp. 1–12, 2013.
- 105. Chen T, Kornblith S, Norouzi M. et al. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings of Machine Learning Research, pp. 1597–607. PMLR, 2020. [Google Scholar]
- 106. He K, Fan H, Wu Y. et al. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, IEEE/CVF, pp. 9726–35, 2020.
- 107. Wang Y, Wang J, Cao Z. et al. Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell 2022;4:279–87. 10.1038/s42256-022-00447-x [DOI] [Google Scholar]
- 108. Yang K, Swanson K, Jin W. et al. Analyzing learned molecular representations for property prediction. J Chem Inf Model 2019;59:3370–88. 10.1021/acs.jcim.9b00237 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109. Li S, Zhou J, Xu T. et al. Geomgcl: Geometric graph contrastive learning for molecular property prediction. In: Proceedings of the Thirty-Six AAAI Conference on Artificial Intelligence, Virtual Event, AAAI (Association for the Advancement of Artificial Intelligence) Vol. 36, pp. 4541–9, 2022. 10.1609/aaai.v36i4.20377 [DOI]
- 110. Sanchez-Fernandez A, Rumetshofer E, Hochreiter S. et al. Contrastive learning of image-and structure-based representations in drug discovery. In: Proceedings of the ICLR 2022 Workshop on Machine Learning for Drug Discovery. 2022.
- 111. Hansen K, Biegler F, Ramakrishnan R. et al. Machine learning predictions of molecular properties: accurate many-body potentials and nonlocality in chemical space. J Phys Chem Lett 2015;6:2326–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112. Li Z, Jiang M, Wang S. et al. Deep learning methods for molecular representation and property prediction. Drug Discov Today 2022;27:103373. 10.1016/j.drudis.2022.103373 [DOI] [PubMed] [Google Scholar]
- 113. Heid E, Greenman KP, Chung Y. et al. Chemprop: a machine learning package for chemical property prediction. J Chem Inf Model 2023;64:9–17. 10.1021/acs.jcim.3c01250 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114. Ma H, Yatao Bian Y, Rong WH. et al. Cross-dependent graph neural networks for molecular property prediction. Bioinformatics 2022;38:2003–9. 10.1093/bioinformatics/btac039 [DOI] [PubMed] [Google Scholar]
- 115. Zhao B, Weixia X, Guan J. et al. Molecular property prediction based on graph structure learning. Bioinformatics 2024;40:btae304. 10.1093/bioinformatics/btae304 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116. Sun M, Xing J, Meng H. et al. Molsearch: Search-based multi-objective molecular generation and property optimization. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, ACM SIGKDD (Association for Computing Machinery Special Interest Group on Knowledge Discovery and Data Mining), 2022. [DOI] [PMC free article] [PubMed]
- 117. Mahmood O, Mansimov E, Bonneau R. et al. Masked graph modeling for molecule generation. Nat Commun 2021;12:3156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118. Patrick Walters W, Barzilay R. Applications of deep learning in molecule generation and molecular property prediction. Acc Chem Res 2020;54:263–70. 10.1021/acs.accounts.0c00699 [DOI] [PubMed] [Google Scholar]
- 119. Jin W, Barzilay R, Jaakkola T. Junction tree variational autoencoder for molecular graph generation. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, Hadley, MA, USA: PMLR. Dy J, Krause A (eds.), pp. 2323–32, 2018.
- 120. Jin W, Barzilay R, Jaakkola T. Hierarchical generation of molecular graphs using structural motifs. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Virtual Event, PMLR (Proceedings of Machine Learning Research), 2020.
- 121. Maziarz K, Jackson-Flux H, Cameron P. et al. Learning to extend molecular scaffolds with structural motifs. 2021. arXiv preprint arXiv:2103.03864.
- 122. Xu M, Wang W, Luo S. et al. An end-to-end framework for molecular conformation generation via bilevel programming. In: Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual Event, PMLR (Proceedings of Machine Learning Research), 2021.
- 123. Mei S, Zhang K. A machine learning framework for predicting drug–drug interactions. Sci Rep 2021;11:17619. 10.1038/s41598-021-97193-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124. Zhang W, Chen Y, Liu F. et al. Predicting potential drug-drug interactions by integrating chemical, biological, phenotypic and network data. BMC Bioinform 2017;18:1–12. 10.1186/s12859-016-1415-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 125. Ryu JY, Kim HU, Lee SY. Deep learning improves prediction of drug–drug and drug–food interactions. Proc Natl Acad Sci U S A 2018;115:E4304–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126. Deng Y, Xinran X, Qiu Y. et al. A multimodal deep learning framework for predicting drug–drug interaction events. Bioinformatics 2020;36:4316–22. 10.1093/bioinformatics/btaa501 [DOI] [PubMed] [Google Scholar]
- 127. Wang Y, Pang C, Wang Y. et al. Retrosynthesis prediction with an interpretable deep-learning framework based on molecular assembly tasks. Nat Commun 2023;14:6155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128. Behrouzi T, Hatzinakos D. Graph variational auto-encoder for deriving EEG-based graph embedding. Pattern Recognit 2022;121:108202. 10.1016/j.patcog.2021.108202 [DOI] [Google Scholar]
- 129. Radford A, Narasimhan K, Salimans T. et al. Improving Language Understanding by Generative Pre-Training. San Francisco, CA, USA, 2018. [Google Scholar]
- 130. Guha R. On the interpretation and interpretability of quantitative structure–activity relationship models. J Comput Aided Mol Des 2008;22:857–71. 10.1007/s10822-008-9240-5 [DOI] [PubMed] [Google Scholar]
- 131. Shoaib MR, Emara HM, Zhao J. A survey on the applications of frontier AI, foundation models, and large language models to intelligent transportation systems. In: Proceedings of the 2023 International Conference on Computer and Applications (ICCA 2023), Dubai, UAE, IEEE (Institute of Electrical and Electronics Engineers), pp. 1–7, 2023.
- 132. Wiggins WF, Tejani AS. On the opportunities and risks of foundation models for natural language processing in radiology. Radiology: Artificial Intelligence 2022;4:e220119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 133. Jiménez-Luna J, Grisoni F, Schneider G. Drug discovery with explainable artificial intelligence. Nat Mach Intell 2020;2:573–84. 10.1038/s42256-020-00236-4 [DOI] [Google Scholar]
- 134. Ribeiro MT, Singh S, Guestrin C. Anchors: High-precision model-agnostic explanations. In: Proceedings of the AAAI conference on artificial intelligence, Vol. 32, 2018. [Google Scholar]
- 135. Wachter S, Mittelstadt B, Russell C. Counterfactual explanations without opening the black box: automated decisions and the GDPR. Harv JL & Tech 2017;31:841. [Google Scholar]
- 136. Dhurandhar A, Chen P-Y, Luss R. et al. Explanations based on the missing: Towards contrastive explanations with pertinent negatives. In:Advances in Neural Information Processing Systems, Vol. 31, 2018. [Google Scholar]
- 137. Ying Z, Bourgeois D, You J. et al. Gnnexplainer: generating explanations for graph neural networks. Adv Neural Inf Process Syst 2019;32:9240–51. [PMC free article] [PubMed] [Google Scholar]
- 138. Muhammed MT, Aki-Yalcin E. Pharmacophore modeling in drug discovery: methodology and current status. J Turk Chem Soc Sect Chem 2021;8:749–62. 10.18596/jotcsa.927426 [DOI] [Google Scholar]
- 139. Wu C-K, Zhang X-C, Yang Z-J. et al. Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules. Brief Bioinform 2021;22:bbab327. Oxford University Press. [DOI] [PubMed] [Google Scholar]
- 140. Li C, Wang J, Niu Z. et al. A spatial-temporal gated attention module for molecular property prediction based on molecular geometry. Brief Bioinform 2021;22. Oxford Academic. [DOI] [PubMed] [Google Scholar]
- 141. Xiang H, Zeng L, Hou L. et al. A molecular video-derived foundation model for scientific drug discovery. Nat Commun 2024;15:9696. 10.1038/s41467-024-53742-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 142. Song Q, Sun B, Li S. Multimodal sparse transformer network for audio-visual speech recognition. IEEE Transactions on Neural Networks and Learning Systems 2023;34:10028–38. 10.1109/TNNLS.2022.3163771 [DOI] [PubMed] [Google Scholar]
- 143. Xu G, Chen H, Li F-L. et al. AliMe MKG: A multi-modal knowledge graph for live-streaming e-commerce. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management (CIKM 2021), Virtual Event, ACM SIGIR (Association for Computing Machinery Special Interest Group on Information Retrieval), pp. 4808–12, 2021.
- 144. Schwarzer M, Rajkumar N, Noukhovitch M. Pretraining representations for data-efficient reinforcement learning. In: Larochelle H, Ranzato M, Hadsell R. et al. (eds.), Advances in Neural Information Processing Systems, Vol. 34. Red Hook, NY, USA: Curran Associates, Inc., 2021. [Google Scholar]
- 145. Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng 2010;22:1345–59. 10.1109/TKDE.2009.191 [DOI] [Google Scholar]
- 146. Li H, Zhang R, Min Y. et al. A knowledge-guided pre-training framework for improving molecular representation learning. Nat Commun 2023;14:7568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 147. Zhou G, Wang Z, Feng Y. et al. S-MolSearch: 3D semi-supervised contrastive learning for bioactive molecule search. Advances in Neural Information Processing Systems 2024;37:74715–37. [Google Scholar]
- 148. Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research. Hadley, MA, USA: PMLR. Precup D, Teh YW (eds.), pp. 1126–35, 2017.
- 149. Ke G, Wang B, Wang X. et al. Rethinking multi-view representation learning via distilled disentangling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, IEEE/CVF (Institute of Electrical and Electronics Engineers/Computer Vision Foundation), pp. 26774–83, 2024.
- 150. Li X, Xiong H, Li X. et al. Interpretable deep learning: interpretation, interpretability, trustworthiness, and beyond. Knowl Inf Syst 2021;64:3197–234. [Google Scholar]
- 151. Li J, Chen X, Hovy E. et al. Visualizing and understanding neural models in NLP. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 681–91, 2016.
- 152. Ji J, Zhou J, Lou H. et al. Align anything: training all-modality models to follow instructions with language feedback. 2024. arXiv preprint arXiv:2412.15838.
- 153. Zeng Z, Liu C, Tang Z. et al. ACCTFM: an effective intra-layer model parallelization strategy for training large-scale transformer-based models. IEEE Trans Parallel Distrib Syst 2022;33:4326–38. 10.1109/TPDS.2022.3187815 [DOI] [Google Scholar]
- 154. Zhang M, Hu Z, Li M. Duet: a compiler-runtime subgraph scheduling approach for tensor programs on a coupled CPU-GPU architecture. In: Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2021), Virtual Event, IEEE (Institute of Electrical and Electronics Engineers), 2021, pp. 151–61.
- 155. Yao L, Ge Z. Big data quality prediction in the process industry: a distributed parallel modeling framework. J Process Control 2018;68:1–13. 10.1016/j.jprocont.2018.04.004 [DOI] [Google Scholar]
- 156. Zhao B, Zhou H, Li G. et al. ZenLDA: large-scale topic model training on distributed data-parallel platform. Big Data Min Anal 2018;1:57–74. [Google Scholar]
- 157. Yuxun Q, Tang Y, Zhang C. et al. Dual-space contrastive learning for open-world semi-supervised classification. IEEE Transactions on Neural Networks and Learning Systems 2025;1–14. [DOI] [PubMed] [Google Scholar]
- 158. Collins L, Mokhtari A, Shakkottai S. Task-robust model-agnostic meta-learning. Advances in Neural Information Processing Systems 2020;33:18860–71. [Google Scholar]
- 159. Zhang X-C, Cheng-Kun W, Yang Z-J. et al. MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. Brief Bioinform 2021;22:bbab152. [DOI] [PubMed] [Google Scholar]
- 160. Milanés-Hermosilla D, Trujillo Codorniú R, López-Baracaldo R. et al. Monte carlo dropout for uncertainty estimation and motor imagery classification. Sensors 2021;21:7241. MDPI. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All the code and data tables used for the figures in the manuscript are available on GitHub link: https://github.com/Z-dot-max/MRL_Foundation_Review/.
















