Multimodal Cross-Attention Molecular Property Prediction for Text, Sequence, Graph, and Geometry

Shihao Sun; Peng Wang; Yunjiangcan He; Jiao Yang; Songjiang Li

doi:10.1021/acsomega.5c07964

. 2025 Nov 11;10(46):56225–56239. doi: 10.1021/acsomega.5c07964

Multimodal Cross-Attention Molecular Property Prediction for Text, Sequence, Graph, and Geometry

Shihao Sun ¹, Peng Wang ¹, Yunjiangcan He ¹, Jiao Yang ¹, Songjiang Li ^1,^*

PMCID: PMC12658644 PMID: 41322622

Abstract

The use of single-modal molecular representations limits the accuracy of standard Quantitative Structure–Property Relationship (QSPR) models, which are essential for speeding up drug discovery and material design. We address this by introducing the multimodal cross-attention molecular property prediction (MCMPP) model, which integrates SMILES, ECFP fingerprints, molecular graphs, and 3D molecular conformations through a cross-attention mechanism after being independently processed by Transformer-Encoder, BiLSTM, GCN, and reduced Unimol+. Tests on four data sets (Delaney, Lipophilicity, SAMPL, and BACE) demonstrate how MCMPP improves prediction accuracy by using complementary effects across modalities. According to experimental data, MCMPP works better than other fusion procedures, obtaining the greatest Pearson correlation coefficient and demonstrating its effectiveness as a material design and drug discovery tool.

graphic file with name ao5c07964_0012.jpg

graphic file with name ao5c07964_0010.jpg

1. Introduction

Predicting molecular property, such as electrical conductivity, biological activity, and solubility, is essential in materials science and drug design. This has always depended on experimental trial-and-error techniques, which are expensive, time-consuming, and complicated. In order to make forecasts more quickly and effectively, computational approaches have become increasingly popular. While conventional methods such as quantitative structure–activity relationship (QSAR) modeling, , molecular docking, and molecular dynamics simulations provide theoretical interpretability, their efficacy is limited by their computational complexity and reliance on experimental data. Deep learning (DL), , which is especially good at modeling nonlinear interactions in high-dimensional chemical spaces, has become popular as a result of this. DL has shown remarkable efficacy in enhancing the precision and effectiveness of molecular property forecasts, outperforming conventional machine learning techniques and propelling progress in the fields of chemistry, physics, biology, and materials research. −

Applying deep learning to the prediction of molecular properties requires molecular representation. , SMILES, Extended Connectivity Fingerprints (ECFP), molecular graphs, and three-dimensional conformations are some of the ways that molecular information may be represented. Quantitative structure–activity relationship (QSAR) predictions have significantly improved as a result of incorporating these techniques into machine learning (ML) and deep learning (DL) models. Recurrent neural networks (RNNs) trained on SMILES sequences successfully capture both syntactic structure and chemical space distribution, whereas SMILES sequences express molecular structures as character strings. Lately, SMILES representations have also been encoded using large language models (LLMs). KFLM2 combines in-depth domain-specific information with LLMs to predict molecular properties accurately. To capture molecular substructures, ECFP creates binary bit strings, and methods such as BiGRU improve the extraction of substructural features. Graph convolutional networks (GCNs) like TrimNet and message-passing neural networks (MPNN) parse molecular graphs, in which bonds are edges and atoms are nodes, to extract structural information for prediction. Although 2D diagrams or 1D SMILES were employed in early research, 3D conformations are essential for precise predictions of quantum chemical properties. Models like Uni-Mol and SphereNet, which use 3D molecular learning for better predictions, show that integrating 3D data enhances the representation of geometric aspects. Other chemical modalities have also been investigated for molecular property prediction in addition to the above frequently utilized structural representations. ImageMol predicts molecular characteristics and drug-target interactions using image-based representations. With a wide range of applications in materials science and drug discovery, MMFRL improves embedding initialization by relational learning, while ACML advances drug discovery through spectrum-based multimodal learning. These many representations provide complementary information, and SMILES, ECFP, molecular graphs, and 3D molecular conformations are generated using tools such as RDKit. Future studies should concentrate on fusion techniques to maximize deep learning models’ capacity for prediction.

The capacity of mainstream deep learning models to take use of complementary interactions across modalities is often restricted to single-modal input. In order to tackle this issue, scientists have created ensemble, multitask, and mixed molecular representation learning techniques. ,− However, these methods may make the model less interpretable and put more strain on neural network fusion processes. As a result, multimodal collaborative reasoning, which incorporates complementary information and interaction links across modalities, is becoming more and more important than single-modal extraction. One important area of artificial intelligence is multimodal technology, which makes it possible to collect pertinent information from diverse sources and analyze intermodal interdependence using scenario-based applications, deep reasoning, and unified representations. , Multimodal learning has produced important advances in the prediction of molecular properties. A multimodal fusion framework, for instance, combines map data, chemical fingerprints, and SMILES sequences. Atomic-level and motif-level graph information are combined in a multigranularity fusion model. Furthermore, the SGGRL model combines geometric characteristics, graph structure, and SMILES into a combined representation. These frameworks have shown the advantages of multimodal cooperation by significantly improving prediction accuracy and flexibility across a range of chemical activities.

The disadvantage of unimodal learning is its incapacity to represent cross-modal synergy and complementary information. Unimodal learning concentrates on maximizing specific data modalities (such as text, photos, and molecular mapping). The problem of creating efficient fusion mechanisms for diverse modalitieswhich include intricate nonlinear interactions like redundancy, complementarity, and collaborationis addressed by multimodal learning, on the other hand. Balancing information interaction across modalities is a major difficulty in multimodal fusion, necessitating a method to measure each modality’s contribution given certain limitations.

Feature-level, decision-level, hybrid-level, and model-level fusion are the several categories of fusion methods now in use. Despite providing a direct link between feature vectors, feature-level fusion may make things more complicated and hide connections. Dimensionality and computing difficulties are addressed by recent methods, such as the use of Low-rank Matrix Factorization(LMF) in multimodal compound property prediction. The sorts of modalities involved should determine which fusion strategy is used. For example, transcription factor binding site prediction has been improved because to cross-attention methods like the MultiTF model. The GIT-Mol model for molecular research and the DLF-MFF method, which combines many molecular representations to increase prediction accuracy and task flexibility, are two examples of the widespread use of cross-attention in molecular property prediction.

Current research in molecular property prediction continues to face several key challenges. First, single-modal data characterization is insufficient for capturing the full spectrum of molecular information. Second, existing multimodal approaches often lack systematic strategies for selecting appropriate modalities. Finally, many fusion methods struggle to efficiently integrate and fully leverage the heterogeneous information across different modalities.

The inadequacy of single-modal data, the absence of systematic modality selection procedures, and the limitations of effectively integrating diverse information are some of the major obstacles facing current molecular property prediction research. MCMPP, a deep learning model created to solve these problems, is proposed in this work. A cross-attention mechanism, which translates data from several modalities into a single representation, is its main novelty. This improves the model’s capacity to incorporate various modalities and close the scalability gap. MCMPP is a useful method among several molecular prediction types. The following are this work’s main contributions:

We propose a multimodal cross-attention molecular property prediction (MCMPP) model that utilizes Transformer-Encoder, BiLSTM, GCN, and reduced Unimol+ models to process four commonly used and complementary molecular representations in bioinformatics: SMILES, ECFP fingerprints, molecular maps, and 3D molecular conformations. This model effectively leverages the strengths of deep learning architectures to integrate diverse molecular information, thereby enhancing both the accuracy and reliability of property predictions.
SMILES, ECFP fingerprints, molecular graphs, and 3D molecular conformations are the four molecular representations that MCMPP creatively integrates via the Cross-Attention Mechanism (CAM). It has been shown via comparison tests with current fusion techniques that this approach can more successfully include different complementary data in multimodal situations, greatly enhancing the model’s performance.
We compared the Root-Mean-Square Error (RMSE) of many sophisticated models in the regression task of downstream molecular property prediction in order to assess the MCMMP model’s performance. On many benchmark data sets, the experimental results demonstrate that the MCMPP model outperforms current approaches by a substantial margin. This exceptional result not only demonstrates the MCMPP model’s efficacy in this job but also offers compelling theoretical justification for the viability of deep learning-based multimodal model modeling.

2. Data Sets and Methodology

2.1. Data Sets

Data selection is an important part of the molecular property prediction job. Five benchmark regression data sets were used in this investigation, including two biophysical property data sets (BACE, refined set of PDBbind) and three physicochemical property data sets (Delaney, Lipophilicity, SAMPL) from MoleculeNet. An 8:1:1 ratio was used to randomly divide all data sets into training, validation, and test sets. In particular, the test set includes independent samples that were not exposed during the training or validation stages, which helps to assess the model’s generalization performance. The training set is utilized for learning and updating model parameters, while the validation set is used for hyperparameter tuning and model selection. Table S1 of the Supporting Information contains comprehensive attribute information for every data set.

a.
Delaney. There are 1128 organic small molecule samples in the Delaney data set, all of which were determined experimentally. Delaney et al.’s solubility research provided the source of these data, which the DeepChem team then standardized before adding them to the MoleculeNet benchmark collection.
b.
Lipophilicity. The ChEMBL database is the source of 4200 organic small molecule samples that make up the lipophilicity data set. The logarithmic distribution coefficients of molecules in the oil–water two-phase system (logD values), which indicate the lipophilicity of molecules under physiological circumstances (pH 7.4), are used to identify these samples. The shake flask method, chromatography, and potentiometric titration are the experimental techniques used to calculate the logD values. Drug molecules, natural products, and synthetic chemicals are among the many molecular structures included in this collection.
c.
SAMPL. There are 642 organic small molecule samples in the SAMPL collection, all of which were acquired via experimental determination. With values expressed in kcal/mol, the data set concentrates on the hydration free energy of molecules during their transition from the gas phase to the aqueous phase.
d.
BACE. The (β-secretase 1 inhibitor data set), or BACE data set, includes 1513 drugs’ in vitro experimental data. The data set’s goal is to quantify the half-maximal inhibitory concentration (IC₅₀, expressed in nM/μM) of compounds’ inhibitory action against β-secretase (BACE1). Binary activity categorization tags (0/1) are applied to every sample. Only the BACE data set’s regression task element is taken into account in this investigation.
e.
PDBbind. Protein–ligand complex crystal structures and binding affinities are available in the PDBbind collection. It is extensively used in domains including drug design, molecular docking, and computational chemistry and is mostly employed to examine the interaction between proteins and small molecule ligands. The PDBbind data set comes in two versions: the general set and the refined set. It was selected using a variety of screening techniques. In particular, the refined of PDBbind v2020 is used in this study. Following data cleaning, 5168 biomolecular complex binding constants are still included in the data set.

2.2. Multimodal Feature Representation

For property prediction in Quantitative Structure–Activity Relationship (QSPR) modeling, choosing an appropriate molecular representation method is essential. Molecular graphs, which represent atoms and bonds in a graph form; Extension-Connectivity Fingerprints (ECFP), which capture local chemical environments and topological structures; SMILES, which encodes atomic and bond information in strings; and 3D molecular conformations, which show spatial structures and atom interactions, are examples of common methods.

Molecular graphs display topological linkages, ECFP concentrates on local chemical properties, SMILES delivers serialized descriptions, and 3D molecular conformations record spatial interactions. Each method offers distinct, complementary insights. In order to enhance property prediction, this work considers fusing these four representationsSMILES, ECFP, molecular graphs, and 3D molecular conformationsto create a more complete molecular characterization system, as illustrated in Figure .

Molecular representation of a molecule. (A) SMILES vectors. (B) ECFP fingerprints. (C) Molecular graph. (D) 3D molecular conformation.

2.2.1. SMILES Vectors

SMILES is a textual format that encodes molecular structures using ASCII strings, representing atoms, bonds, connectivity, and stereochemistry. It efficiently records crucial molecular data.

This work employs a preprocessing pipeline to get SMILES ready for Transformer input. In order to preserve important structural information, a regex-based tokenizer first divides SMILES into chemically relevant tokens (for example, “C(O)O” yields [“C”, “(O)”, “O”]; see Figure ). Each token is subsequently given its own index in a single token dictionary. This dictionary is used to transform all SMILES strings into fixed-length integer text, zero-padded for shorter text (length stats presented in Figure S1 and Table S2 of the Supporting Information). In order to provide dense vector representations for Transformer input, these integer sequences are last run via an embedding layer.

2.2.2. ECFP Fingerprints

In chemoinformatics, ECFP is a popular method for expressing molecular topology. It divides molecules into heavy atom-centered substructures, then repeatedly stretches bonds to create bigger patterns, giving each one a distinct identity. Using RDKit, ECFP fingerprints were created from SMILES in this work. The settings were adjusted to a bit length of 1024 and a radius of 2 (ECFP4). This results in binary vectors of fixed length, where each bit denotes whether a substructure is there or not. These characteristics are compressed using a hash function, which makes them appropriate for property prediction and similarity search. Every data set uses 1024-bit fingerprint sequences, with the exception of PDBbind. Please see Table S2 of the Supporting Information for specific settings.

2.2.3. Molecular Graph

Molecular graphs, as the core characterization model in cheminformatics and computational chemistry, abstract molecular structures into a topological network composed of vertices (atoms) and edges (chemical bonds) based on graph theory. Each atom in a graph G = (V, E) is represented by a multidimensional feature vector that encodes attributes such as element type (represented by a 44-dimensional encoding of element types), valence (11-dimensional representation of formal charge and hybridization state), hydrogen count (11-dimensional hydrogen count), bond environment (11-dimensional identification of bond order types), and aromaticity (1-dimensional Boolean value). The graph is formalized as having a vertex set V. Atom connection and bond information, such as bond type, spatial effects, and simplified bond lengths, are captured by the edge set. This structure provides rich input for graph neural networks (GCNs) in molecular property prediction by accurately representing molecular properties, including rings, branches, and functional groups.

2.2.4. 3D Molecular Conformation

The spatial arrangement of atoms inside a molecule is described by 3D molecular conformation, which is determined by variables such as bond lengths, angles, and dihedral angles. Properties like polarity, hydrophobicity, and biological activity are all directly impacted by this structure, which represents a molecule’s stable energy state. Significant variations in molecular behavior, particularly in drug-target interactions, may result from even minor conformational alterations. In order to help with precision medication design, recent developments simulate 3D conformations by combining deep learning and physical chemistry. Energy-stable 3D molecule conformations may be efficiently produced from SMILES using programs such as RDKit.

To guarantee structural completeness, hydrogen atoms were added using AddHs after molecular objects were first created from SMILES strings using the MolFromSmiles function. The Experimental-Torsion information Distance Geometry (ETKDG) method was then used to create the first three-dimensional conformations. This technique combines distance geometry with experimental-torsion angle information to create plausible spatial arrangements. Lastly, stable low-energy conformations were obtained via energy reduction using the MMFF94 molecular force field.

2.3. Multimodal Prediction Model

In order to accomplish accurate molecular property prediction, this work suggests a multimodal cross-attention molecular property prediction (MCMPP) model that integrates heterogeneous molecular information. Four functional elements make up its hierarchical design (see Figure ): In order to capture multifaceted information about molecules, the feature representation layer (Figure A) takes in four different forms of molecular representations: two-dimensional molecular diagram, three-dimensional molecular conformation, ECFP topology, and SMILES symbolic representation. The feature extraction layer (Figure B) optimizes the encoding for different kinds of information, extracts the SMILES feature H _S using Transformer-Encoder, and extracts the ECFP feature H _e using a combination of TCN, BiLSTM, and multihead attention mechanism. The GCN layer and global pooling were used to create the molecular map feature H _g, and reduced Unimol+ was used to produce the three-dimensional molecular conformation H _3D. Two different approaches are used by the feature fusion layer (Figure C). One is to use CAM to model the cross-modal interaction of four feature vectors, and then use the prediction layer to make predictions. The other is to predict the four features through a separate inference layer and then use five methods, including LASSO, Elastic Net, Random Forest (RF), Gradient Boosting (GB), and Stochastic Gradient Descent (SGD), to connect for output and calculate the weights. The fully connected layer then uses the weighted summation findings to produce the expected values. To produce the expected values of the molecular characteristics, the prediction layer (Figure D) uses the cross-attention mechanism using MLP to execute nonlinear mapping on the H _fused output. Table S3 of the Supporting Information contains the feature extraction layer’s specific hyperparameters.

Overall structure of the multimodal cross-attention molecular property prediction (MCMPP) model. (A) Feature representation layer. (B) Feature extraction layer. (C) Feature fusion layer. (D) Prediction layer.

2.3.1. Transformer-Encoder

As shown in B.a, the Transformer-encoder layer includes a multimodal feature extraction module called MCMPP_Transformer, which is made up of two sequentially processed substructures: the Position-wise Feedforward Sublayer and the Multi-Head Self-Attention Sublayer. Token embedding, position encoding, and encoder processing are followed by feature extraction for the molecular representation X. The following is the model’s process: First, given the molecular sequence input X, it is mapped to the embedding vector P through the Token Embedding layer. The formula for the sinusoidal position embedding PE is as follows:

PE (pos, 2 i) = \sin (\frac{pos}{10, 00 0^{2 i / d}})

PE (pos, 2 i + 1) = \cos (\frac{pos}{10, 00 0^{2 i / d}})

Where pos is the position, the range of i is $[0, \frac{d}{2}]$ , d is the input size, k is the fixed offset, and PE _pos+k can be expressed as a linear function of PE _pos.

Second, the encoder takes in the matrix $H \in R^{l \times d}$ , where d is the input dimension and l is the length of the molecular sequence. Three learnable parameter matrices are used to perform a linear transformation on this matrix:

Q = H W_{q}, K = H W_{k}, V = H W_{v}

The learnable matrices W _q, W _k, W _v have matrix dimensions of $R^{d \times d_{k}}$ , where d _k is a hyperparameter that regulates the subspace dimension. Next, the following formula may be used to calculate the scaled dot-product attention.

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

Then, to enhance the feature expression ability, multihead attention is adopted:

Q^{(h)} = H W_{q}^{(h)}, K^{(h)} = H W_{k}^{(h)}, V^{(h)} = H W_{v}^{(h)}

{head}^{(h)} = Attention (Q^{(h)}, K^{(h)}, V^{(h)})

The output is then linearly fused after the splicing of the two heads together.

H = [{head}^{(1)} ∥ \cdot\cdot\cdot ∥ {head}^{(n)}] W_{O}

Where H is the output matrix of the Transformer; each head corresponds to the independent parameters W _q , W _k , W _v , n is the number of heads, h is the index of the head, and the term [head⁽¹⁾ ∥···∥head⁽ⁿ⁾] denotes the final dimension of the concatenation. The output size of [head⁽¹⁾ ∥···∥head⁽ⁿ⁾] is $R^{l \times d}$ , where d is the output dimension, and W _O is a learnable matrix of size $R^{d \times d}$ .

2.3.2. TCN-BiLSTM-Attention Mechanism

The TCN-BiLSTM-attention mechanism (MCMPP_BiLSTM) was created to efficiently collect molecular ECFP characteristics. While dilated convolutions extract complete features, lowering noise and improving important patterns, the Temporal Convolutional Network (TCN) leverages residual connections and one-dimensional causal convolutions to maintain time-series properties and speed up convergence. This makes it easier for the BiLSTM to identify significant data dependencies. By assigning weights to various components of the LSTM output, the attention mechanism highlights pertinent data and improves prediction accuracy.

The TCN is made up of stacked residual units, each of which has two convolutional blocks with Dropout, ReLU activation, weight normalization, and dilated causal convolutions. The BiLSTM contains 256 forward and 256 backward units, as shown in B.b. The forget gate f, input gate i, and output gate o, which govern the information flow via the sequence, maintain cell state c and hidden state h for each LSTM unit.

At the time-step t (t ≤ L, where L represents the length of the input sequence), the states and gates are calculated as follows:

f_{t} = σ (w_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})

i_{t} = σ (w_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})

{\hat{c}}_{t} = \tanh (w_{c} \cdot [h_{t - 1}, x_{t}] + b_{c})

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} \cdot ĉ_{t}

o_{t} = σ (w_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})

h_{t}^{forward} = o_{t} ⊙ \tanh (c_{t})

At time-step t, with previous cell state c _t–1, hidden state h _t–1, and input x _t (the feature embedding for the t – 1-th residue in the query sequence), weights w _* and biases b _* are applied as follows. The concatenation of vectors is denoted by [,], and σ(·) is the sigmoid activation function. By concatenating the hidden states in every LSTM cell at every time-step, the BiLSTM layer’s output is shown as an L × 256 matrix. The backward LSTM performs symmetric computations, producing h _t . The final bidirectional representation is formed by concatenating forward and backward hidden states:

H = [h_{1}^{forward} \oplus h_{1}^{backward}, . . ., h_{T}^{forward} \oplus h_{T}^{backward}]

Multihead attention consists of 10 attention heads, each of which performs the following scaled dot-product attention:

M_{i}^{Q} = H \cdot W_{i}^{Q}, M_{i}^{K} = H \cdot W_{i}^{K}, M_{i}^{V} = H \cdot W_{i}^{V}

H_{i} = softmax (M_{i}^{Q} \cdot {(M_{i}^{K})}^{T} / \sqrt{d_{i}}) \cdot M_{i}^{V}

Finally, output the attention matrix of the concatenated heads:

H = H_{1} H_{2} \cdot\cdot\cdot H_{10}

Where H is the output matrix of the feature extraction module; H _i is the attention matrix in the i-th attention head; M _i , M _i , M _i are the query, key, and value matrices, respectively. d _i is the scale factor.

2.3.3. GCN

The molecular graph encodes properties in a feature matrix and captures connectedness via an adjacency matrix, representing molecules as (atoms) and edges (bonds), and each atmo is represented by a 78-feature vector of atomic properties. In order to address the shortcomings of conventional molecular graph neural networks, this research suggests an enhanced method.

The original model extracts features using a global pooling layer on the 2D molecular graph, dubbed MCMPP_GCN, after two GCN layers B.c. The adjacency matrix A and node feature matrix H are used by the GCN, and the graph convolution at layer l is represented as

H^{(l + 1)} = σ ({\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2} H^{(l)} W^{(l)})

Where Ã = A + I _n is the adjacency matrix of an undirected graph with added self-connections. D̃ _ii = ∑_i Ã _ii is the diagonal node degree matrix. H ^(l) and W ^(l) are respectively the learnable parameters of GCN and the output of the first layer.

Second, a residual-enhanced GCN design is suggested in order to address the gradient degradation problem that is common in deep GCN:

H^{(l + 1)} = GELU (BN (GCN (H^{(l)}) + W_{res} H^{(l)}))

The 2-layer GCN’s training convergence speed is increased by the residual join, GELU is the activation function, BN is batch normalization, and W _r es is the weight mapping.

2.3.4. Reduced Unimol+

In this work, 3D molecule conformational characteristics are extracted using the Uni_Mol+ framework. A dual-track Transformer with atomic and pair representation modules is used by Unimol+. We eliminated the original property prediction module, referring to the remaining portion as MCMPP_Unimol+, since only feature extraction is required. There is no fixed early stop. It has a 256-dimensional hidden layer and a 4-layer architecture.

Figure illustrates the two steps of MCMPP_Unimol+ operation: RDKit first creates basic 3D conformations from SMILES, and then a dual-track Transformer made up of L modules, each of which has a feedforward network (FFN), refines these conformations over R iterations). The following is the iterative updating mechanism: a residual connection is used to update the atomic coordinates X _t(x, y, z) at iteration t, as well as the expected coordinate offset ΔX _t(Δx, Δy, Δz) for each module:

X_{t + 1} = X_{t} + Δ X_{t}

The coordinates R of the DFT equilibrium conformation were finally estimated for the purpose of predicting molecular properties after R optimization rounds.

Overall flowchart of reduced MCMPP_Unimol+.

2.4. Feature Fusion

It is anticipated that the variety of integrable modalities will increase as multimodal learning research continues to progress. This rise in modal cardinality emphasizes the drawbacks of depending just on postfusion techniques. These approaches usually only include the decision-making process after each modality has been analyzed separately, often using methods like majority voting or weighted averages. However, there are two major drawbacks to this strategy: First, it has trouble capturing the intricate dynamic interactions between modalities; second, weaker modalities might have a detrimental influence on the overall performance of the model. In order to overcome these problems, we suggest introducing the Cross-Attention Mechanism (CAM), which is intended to more accurately and early simulate the complementarities and correlations between various modal aspects.

2.4.1. Cross-Attention Mechanism Fusion

The MCMPP model’s central mechanism, the Cross-attention Mechanism(CAM), solves the problem of classic multimodal fusion models’ inadequate integration of various representations. By supporting dynamic feature alignment, multimodal fusion, and interpretability modeling, the cross-attention mechanism in MCMPP is essential to the area of molecular property prediction. Significant benefits are provided by this method in terms of lower processing costs and improved sample generalization abilities. It also shows great promise in aiding chemical mechanism research, offering a dependable and effective computational tool for material design. In particular, as (Figure ) shows, the cross-attention mechanism successfully aligns heterogeneous modal information, including textual modalities like SMILES strings. This is expressed mathematically as multihead attention:

MultiHead (Q_{T_{t}}, K, V) = Concat ({head}_{1}, . . ., {head}_{h}) W_{O}

{head}_{i} = CrossAttn (Q_{T_{t}} W_{Q i}, K W_{K i}, V W_{V i})

where K stands for the key, V for the value, W for the weight matrix, which is trained to project the input into the proper space, and Q _{T
_t} for the query from the target text modality. The four categories of molecular representation informationSMILES, ECFP, molecular graph, and three-dimensional molecule conformationare represented by the letter i for the source modality in eq . The MCMPP model aligns the T _t modality with each source modality by computing a unique set of cross-attention weights for each modality. This procedure guarantees that each feature’s fine details are faithfully conveyed in the text representation that corresponds to it. The following is the formula for the attention mechanism:

CrossAttn (Q_{T_{t}}, K, V) = softmax (\frac{Q_{T_{t}} K^{T}}{\sqrt{d_{k}}}) V

Where d _k is the dimension of the query and the key.

Detailed flowcharts of the two fusion methods. The above figure shows the method flowchart of the cross-attention mechanism, and the following figure shows the method flowchart of decision-level fusion.

By leveraging this cross-attention framework, MCMPP can effectively identify and encode complex relationships and dependencies among modalities.

2.4.2. Decision-Level Fusion

This research uses decision-level fusion, also known as late fusion, to allow for fair comparison. Each modality is initially modeled separately, and the results are then aggregated to get the final prediction, as shown in Figure . Five methodsLasso, Elastic Net, Random Forest (RF), Gradient Boosting (GB), and Stochastic Gradient Descent (MCMPP_SGD)are used for decision-level fusion in accordance with the tenets of multimodal multifeature deep learning (MMFDL). MCMPP _LASSO, MCMPP _Elastic, MCMPP _RF, and MCMPP _GB (MCMPP _SGD) are their designations. The following formula determines each state’s significance:

O = Concat (O_{Transformer}, O_{BiLSTM}, O_{GCN}, O_{Unimol +})

W_{1}, W_{2}, W_{3}, W_{4} = Weight (O)

Where O _Transformer, O _BiLSTM, O _GCN and O _Unimol+ represent the output of the four feature extraction models, respectively, and O represents the tensor that fuses molecular features. W ₁, W ₂, W ₃, W ₄ represent the weights of the multimodal models and Weight(·) serves as a method for calculating the importance of the four-modal models.

The final projected value is produced during the testing phase by multiplying each modal weight by the recently determined multimodal characteristics.

output = W_{1} O_{Transformer}^{'} + W_{2} O_{BiLSTM}^{'} + W_{3} O_{GCN}^{'} + W_{4} O_{Unimol +}^{'}

Where O _Transformer ′, O _BiLSTM ′, O _GCN ′ and O _Unimol+ ′ represent the output of the four feature extraction models in the test phase, respectively.

2.5. Prediction

The Multilayer Perceptron (MLP) serves as the primary regression model throughout the prediction stage. MLP ensures both computational economy and good baseline performance in cross-modal regression problems by efficiently modeling nonlinear patterns via activation functions and capturing linear connections through its fully linked layers. The MLP receives the fused feature vector H _fused from the upstream modules and uses it for forward propagation. The following is the definition of the process:

y_{pred} = W_{2} \cdot ReLU (W_{1} H_{fused} + b_{1}) + b_{2}

Where W _k is the learnable weight, ReLU is the activation function, and b _k is the bias vector.

The MLP includes an identity transformation, f(x) = x, in the output layer of the regression job to satisfy the continuous value prediction criterion. The output is immediately translated to the target variable’s true scale thanks to this design, which successfully removes any possible prediction bias brought on by the saturation intervals or value range restrictions of activation functions.

3. Results and Discussion

In order to assess the performance of the suggested model MCMPP from a variety of angles, this section shows a significant number of experimental data that were done in the chosen training environment. Table S3 of the Supporting Information displays all of the experimental parameters used in this chapter. Tables S4 and S5 of the Supporting Information provide a detailed description of the virtual environment setup and the characteristics of the equipment utilized in the tests.

3.1. Performance Analysis

Four molecular featuresSMILES coding vectors, ECFP fingerprints, molecular maps, and three-dimensional molecular conformationsthat were obtained from different models have been shown in earlier research to be complementary and theoretically connected. It is expected that a thorough comprehension of the complementarity between these traits would significantly enhance the model’s ability to comprehend molecular information. After feature extraction, we feed each single-modal model’s output into a fully connected layer in order to verify this theory. The high-dimensional hidden layer vectors produced by each feature encoder are then made less dimensional by using the Uniform Manifold Approximation and Projection (UMAP) technique. Figure displays the visualization outcomes for the SAMPL and Delaney data sets.

UMAP dimension reduction diagrams of the fully connected layer vectors of the four modalities of the training set based on the Delaney and SAMPL data sets.

Areas of overlap and nonoverlap are highlighted by the UMAP visualization analysis, which shows different distribution patterns for the four feature classes in the projected space. While the nonoverlapping portions unmistakably show the distinct information contained inside each feature, the overlapping regions represent shared information across features. In addition to providing important insights for creating a successful plan for integrating multimodal molecular characteristics, this feature distribution pattern experimentally validates the theoretical complementarity of the four features and the viability of their integration.

This work concurrently generated single-modal and multimodal deep learning models to evaluate the efficacy of multimodal learning in molecular property prediction. SMILES encoding vectors, ECFP fingerprints, molecular graph structures, and three-dimensional molecular conformations are the four molecular representations that are uniquely integrated as inputs by the multimodal model. This method uses the complementarity of multisource features to investigate deeper levels of molecular information. The Pearson correlation coefficients on several data sets were computed in order to assess the model’s effectiveness (see Table ).

1. Pearson Coefficients of Different Methods on Four Data Sets ^,

data set	Delaney	lipophilicity	SAMPL	BACE
MCMPP_Transformer	0.943	0.675	0.941	0.716
MCMPP_BiLSTM	0.813	0.726	0.830	0.812
MCMPP_GCN	0.936	0.671	0.766	0.608
MCMPP_Unimol+	0.960	0.891	0.968	0.826
MCMPP_LASSO	0.966	0.902	0.958	0.790
MCMPP_Elastic	0.960	0.880	0.979	0.813
MCMPP_RF	0.963	0.882	0.962	0.806
MCMPP_GB	0.965	0.885	0.956	0.821
MCMPP_SGD	0.964	0.888	0.955	0.804
MCMPP_CAM	0.971	0.913	0.978	0.841

Open in a new tab

Larger values indicate better learning performance of the model.

The values of the best-performing model are indicated in bold.

The results demonstrate that, compared to all single-modal models, the multimodal models (including MCMPP_LASSO, MCMPP_Elastic, MCMPP_RF, MCMPP_GB, MCMPP_SGD, and MCMPP_CAM) consistently show superior performance across the four data sets. Even so, MCMPP_Unimol+ did well on the BACE and lipophilicity data sets. However, in the lipophilicity multiattribute data set, single-modal models (except from MCMP_Unimol+) often fared worse. Especially, MCMPP_CAM excelled within the multimodal fusion framework. Despite a slightly lower Pearson correlation coefficient on the SAMPL data set, it achieved the best prediction performance on all other tasks.

We derive an important conclusion from these experimental results: multimodal learning performs noticeably better than single-modal learning and allows for more precise molecular property prediction by including complementary feature information.

This study used Pearson correlation coefficient (Pearson), mean absolute error (MAE), and root-mean-square error (RMSE) as performance metrics across four independent data sets (Delaney, SAMPL, Lipophilicity, and BACE) in order to thoroughly assess the effectiveness of multimodal fusion models in molecular property prediction. Multimodal feature fusion improves prediction accuracy, as demonstrated by the quantitative analysis results (see Tables and ), which show that the RMSE and MAE values for most multimodal fusion models (including MCMPP_LASSO, MCMPP_Elastic, MCMPP_RF, MCMPP_GB, and MCMPP_SGD) are significantly lower than those of the single-modal benchmark models. However, in some situations (like the BACE data set), several late fusion models showed unusual fluctuations, and their error values even exceeded those of the best single-modal models (like MCMPP_Unimol+).

2. RMSE Performance of Different Methods on Four Data Sets ^,

data set	Delaney	lipophilicity	SAMPL	BACE
MCMPP_Transformer	0.641	0.916	1.368	1.058
MCMPP_BiLSTM	1.321	0.815	1.698	0.777
MCMPP_GCN	0.826	0.893	2.350	1.042
MCMPP_Unimol+	0.638	0.734	1.319	0.725
MCMPP_LASSO	0.638	0.617	1.277	0.959
MCMPP_Elastic	0.815	0.986	1.193	1.569
MCMPP_RF	0.789	0.815	1.257	0.948
MCMPP_GB	0.748	0.755	1.301	0.794
MCMPP_SGD	0.634	0.726	1.318	0.904
MCMPP_CAM	0.566	0.611	1.032	0.696

Open in a new tab

Lower values indicate better learning performance of the model.

The values of the best-performing model are indicated in bold.

3. MAE Performance of Different Data Sets on Four Different Models ^,

data set	Delaney	lipophilicity	SAMPL	BACE
MCMPP_Transformer	0.480	0.689	1.017	0.826
MCMPP_BiLSTM	0.953	0.603	1.271	0.509
MCMPP_GCN	0.677	0.710	1.791	0.866
MCMPP_Unimol+	0.449	0.508	0.882	0.517
MCMPP_LASSO	0.443	0.464	0.875	0.650
MCMPP_Elastic	0.552	0.703	0.851	1.232
MCMPP_RF	0.491	0.648	0.913	0.652
MCMPP_GB	0.482	0.564	0.878	0.547
MCMPP_SGD	0.439	0.544	0.863	0.580
MCMPP_CAM	0.403	0.458	0.724	0.451

Open in a new tab

Lower values indicate better learning performance of the model.

The values of the best-performing model are indicated in bold.

This anomaly is caused by the intrinsic constraints of decision-level fusion, particularly the weight distribution mechanism for suboptimal modes, according to an attribution analysis of the weight distribution graph. For instance, MCMPP_Elastic undervalues the contributions of high-performance modes (like MCMPP_Unimol+) and gives low-performance modes (like MCMPP_Transformer and MCMPP_GCN) excessive weights in the BACE data set,t as shown in Figure . On the other hand, the study’s suggested feature-level fusion model, MCMPP_CAM, had the lowest RMSE and MAE values for every job. This model effectively reduces the weight distribution biases present in subsequent fusion stages by using a combined feature learning technique. Consequently, MCMPP_CAM offered an optimal technical strategy for multimodal molecular characterization fusion and showed remarkable stability in varied data contexts.

Weight distribution of each modal input of the decision-level fusion method on different data sets.

3.2. Model Reliability Analysis

This research uses a repeated experiment technique to fully evaluate the stability and dependability of model performance. Single-modal and multimodal fusion models are contrasted and studied via 15 separate trials, and their stability is assessed using the Pearson correlation coefficient’s distribution features. According to the experimental findings, the multimodal fusion model’s standard deviation of the Pearson correlation coefficient is much lower than that of the single-modal model across various molecular data sets (as shown in the Figure ). Additionally, the single-modal model’s data distribution shows significant dispersion, suggesting that the model’s prediction performance is heavily impacted by changes in data set features and is marked by a high level of uncertainty. The multimodal fusion technique, on the other hand, shows remarkable resilience. Notably, the MCMPP_CAM approach shows its benefit in terms of adaptability across a range of settings by achieving the ideal Pearson coefficient across the majority of data sets. The MCMPP_Elastic approach performs very well in the particular instance of the SAMPL data set because it is better suited to the data set’s intrinsic properties. Overall, by combining complementary characteristics, the multimodal fusion techniques successfully reduce the impact of random elements, producing a more concentrated distribution of their correlation coefficients. These findings support the multimodal fusion model’s greater stability and dependability in molecular property prediction from a statistical standpoint.

Pearson coefficient box plot. Pearson coefficients of different data sets using four single-modal learning methods and six fusion methods.

It can be seen from the comparative analysis results in Figure S7 of the Supporting Information that the comparison analysis findings, single-modal models often have performance issues when it comes to cross-data set prediction tasks. Their mean absolute error (MAE), root-mean-square error (RMSE), and Pearson correlation coefficient are often lower than those of multimodal fusion methods. Notably, in the specific case of the SAMPL data set, although the MCMPP_Elastic model achieved the highest Pearson coefficient, its data fluctuation range was relatively wide, indicating a certain degree of instability. Consequently, within the multimodal fusion framework, the MCMPP_CAM model demonstrates outstanding overall performance. With the exception of the SAMPL data set, MCMPP_CAM achieves optimal results in terms of Pearson coefficient, RMSE, and MAE across all other data sets.

Overall stability has increased as a result of the model’s dependability being much improved after 15 rounds of repeated trials, especially using the MCMPP_CAM method. This approach adapts better to the varied features of various data sets and has improved generalization capabilities.

3.3. Model Utilize Characteristic Distribution Analysis

This research uses the Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction approach to show the geographical distribution of ECFP fingerprints and SMILES encoding vectors in order to examine the connection between input characteristics. It is not appropriate for dimensionality reduction processing since the node and edge data formats of graph data and 3D molecular conformation data include information like bond lengths and bond angles. The poor performance of single-dimensional reduction may be explained by the observed discrepancy in distribution between the test and training sets. With the exception of the BACE data set, the SMILES encoding vector shows a notable overlap between the test and training sets across three data sets, as seen in Figure A. This is consistent with the RMSE findings, which showed that the Transformer model was unable to provide acceptable RMSE and MAE values for the BACE data set. The ECFP fingerprints in Figure B show a low degree of overlap between the SAMPL data set’s training and test sets. As a result, MCMAP_BiLSTM model’s RMSE is higher than the MCMAP_Transformer model’s on this data set. According to these analytical findings, the predictive performance of the model is directly correlated with the extent of overlap between the training and test sets in the feature representation space. In particular, low prediction accuracy is often the result of inadequate overlap between the test and training sets.

UMAP visualizes the SMILE and ECFP hidden vectors from the training and test sets of four downstream data sets. (A) UMAP graphs of SMILES encoding vectors in the training and test sets. (B) UMAP diagrams of ECFP in the training and test sets.

Combining the findings from Figures and , we deduce that the overlap between the training and test sets is affected differently by various molecular representation techniques. A different approach may provide a comparatively high level of overlap between the training and test sets, but a different chemical representation would produce little to no overlap. This implies that the prediction model may better use the training data and enhance prediction performance when there is a larger overlap between the training and test sets in the molecular representation space. Furthermore, no one method can adequately describe the complex properties of molecules, since several molecular representation methods capture complementary features of the chemical universe. By skillfully combining these complementary representations, the MCMPP model overcomes this constraint, creating a more thorough and representative picture of the chemical structure and, in the end, greatly improving the precision and dependability of molecular property prediction.

3.4. Generalization Proficiency Testing

This work examined the optimum performance of the proposed multimodal fusion model, MCMPP, using single-molecule data sets, such as Delaney, Lipophilicity, SAMPL, and BACE, in order to gauge its capacity for generalization. We then expanded this analysis to include more intricate situations, including interactions between proteins and tiny molecules. The PDBbind v2020 data set, which comprises 23,496 biomolecular complex binding constants representing several complex types such as protein–ligand, nucleic acid-ligand, protein-nucleic acid, and protein–protein complexes, was used for this application. A selection of 5,316 high-quality protein-small molecule complexes (PDBbind v2020 refined) were chosen for the experiment, and the binding constants for each were determined. Invalid molecular structures were eliminated throughout the RDKit molecular representation conversion procedure. SMILES, ECFP, molecular maps, and 3D molecular conformations are the four molecular representations that were ultimately successfully created from the PDB data. 4,936 of these complexes were chosen for further experimental analysis.

Table displays the outcomes of the experimental examination. The multimodal approach MCMPP_CAM, which uses the cross-attention mechanism, performs the best overall in the task of predicting protein–ligand binding affinity. In particular, MCMPP_CAM obtained the greatest Pearson correlation coefficient and the lowest mean absolute error (MAE) and root-mean-square error (RMSE) values on the PDBbind v2020 revised data set. However, the performance of the five late-stage fusion multimodal approaches examined in this study is still below that of another exceptional single-modal model, MCMPP_Unimol+, despite the fact that they are generally better than the three single-modal baseline models (MCMPP_Transformer, MCMPP_BiLSTM, and MCMPP_GCN). This result highlights the MCMPP_CAM method’s advantages in efficiently combining data from many sources, enabling it to capture a more thorough and reliable description of molecular complexes. As a result, this fusion method greatly improves the model’s capacity for generalization.

4. RMSE, MAE, and Pearson of Different Methods in the PDBbind v2020 Refined Data Set .

metric	RMSE	MAE	Pearson
MCMPP_Transformer	1.769	1.399	0.586
MCMPP_BiLSTM	1.364	1.057	0.732
MCMPP_GCN	1.988	1.579	0.636
MCMPP_Unimol+	1.236	0.918	0.814
MCMPP_LASSO	1.415	1.114	0.706
MCMPP_Elastic	1.277	0.978	0.728
MCMPP_RF	1.306	1.006	0.732
MCMPP_GB	1.297	0.998	0.751
MCMPP_SGD	1.295	0.996	0.762
MCMPP_CAM	1.156	0.864	0.820

Open in a new tab

The values of the best-performing model are indicated in bold.

This research carried out 15 rounds of repeated tests using different random seeds in order to further evaluate the performance stability of the MCMPP framework on the benchmark data set. The single-modal model’s Pearson correlation coefficient shows notable volatility, as shown in Figure . On the other hand, the cross-attention based multimodal property prediction model (MCMPP_CAM) shows much better performance stability. It is clear from combining the RMSE and MAE evaluation results shown in Table that the MCMPP_CAM approach significantly improves the predictive ability for protein–ligand binding affinity by achieving exceptional prediction accuracy and demonstrating remarkable stability on this data set. It should be noted that when the model was used to evaluate the PDBbind data set, no specific changes were made to the chemical representations. Verifying the basic functionality of the proposed multimodal prediction model is the main goal of this investigation. The assessment findings show that the framework has a high potential for providing reliable performance even in the absence of tuning for particular data sets.

Pearson coefficients for generalized proficiency testing. The left graph displays a box plot illustrating the repeatedly calculated Pearson correlation coefficients, while the right graph presents a histogram that records the minimum, median, and maximum values of the Pearson coefficients.

Furthermore, data sets were subjected to scaffold segmentation in order to assess the MCMPP_CAM model’s capacity for generalization under various segmentation techniques. The model’s performance was evaluated on many data sets using both random and scaffold segmentation techniques, as shown in Table The model maintained consistent results across data sets, further demonstrating the great generalization capacity of MCMPP_CAM, even if scaffold segmentation performed somewhat worse than random segmentation.

5. Performance of MCMPP_CAM on 5 Data Sets with Random Split and Scaffold Split .

data set	split type	RMSE	MAE	Pearson
Delaney	random	0.566	0.403	0.971
	scaffold	0.647	0.495	0.956
lipophilicity	random	0.611	0.458	0.913
	scaffold	0.728	0.604	0.885
	random	1.032	0.724	0.978
	scaffold	0.811	0.675	0.949
	random	0.696	0.451	0.841
	scaffold	0.814	0.607	0.811
V2020 refined	random	1.156	0.864	0.820
	scaffold	1.335	1.102	0.793

Open in a new tab

The values of the best-performing model are indicated in bold.

3.5. Comparison with the Baseline Model

In the field of molecular property prediction, representative state-of-the-art models were carefully chosen as baseline references in order to thoroughly and impartially assess the performance of the model suggested in this work. Molecular topological structures have been effectively modeled in recent years by high-precision multimodal models including ACML, MMFRL, MMFDL, KFLM2, MGFF, SGGLR, and DLF-MFF as well as single-modal models like D-MPNN, MolCLR, GEM, Uni-Mol and GraphMVP. These multimodal and single-modal methods show wide-ranging utility in the prediction of molecular properties.

This research uses a repeated experimental design in order to reduce the influence of random events and improve the validity of the model’s generalization capacity as well as the dependability of the experimental outcomes. Three separate training and assessment procedures were carried out for each model using various random seeds, and the final outcome was determined by averaging the evaluation metrics gathered from many runs. Through statistical averaging, this design not only reduces the randomness of individual experimental results but also offers a more accurate depiction of the model’s actual performance, creating a strong database for later model performance comparisons.

Because many modalities give more extensive and complementary information, MCMPP performs better than single-modal prediction models like D-MPNN and ImageMol, as seen in Figure . However, the multimodal framework’s feature extraction module still has a drawback: despite the use of lightweight modules and parameters, the entire parameter scale is still enormous, which leads to a lengthy training period. MCMPP performs somewhat better than other multimodal models like MMFDL and KFLM2, which is explained by the efficiency of the four molecular representations and matching feature extraction modules used. Nevertheless, it performs somewhat worse than SGGLR on the SAMPL data set. This may be explained by two things: first, SGGLR benefits from a more efficient weight distribution of modal information in the late fusion stage, and second, most multimodal models usually perform badly on this data set when compared to other data sets. When combined, these findings imply that while MCMPP performs well generally, it may not be as competitive as certain late fusion models in particular fields.

6. RMSE of 10 Different Models on Different Data Sets ^, ^,

data set	Delaney	lipophilicity	SAMPL	BACE
MGFF	0.576	0.547	1.027
MMFDL	0.620	0.725	1.103	0.762
SGGLR	0.628		0.847
DLF-MFF	2.162	1.107	1.831
KFLM2	0.668	0.821	1.079
D-MPNN	1.050	0.683	2.082
MolCLR	1.271	0.691	2.594
GEM	0.798	0.660	1.877
Uni-Mol	0.788	0.603	1.620
GraphMVP	1.029	0.681
ACML	0.840		2.340
MMFRL	0.730	0.543	1.456
MCMPP	0.566	0.531	1.021	0.673

Open in a new tab

Lower values indicate better learning performance of the model.

The values of the best-performing model are indicated in bold.

MCMPP is the optimal performance of our MCMPP_CAM model.

In comparison to high-precision molecular property prediction models created recently, the multimodal cross-attention molecular property prediction model (MCMPP) suggested in this work exhibits significant competitiveness, as seen in Table . The overall prediction accuracy of MCMPP across various data sets has far surpassed the competing models, despite its somewhat worse performance on the SAMPL data set compared to the current models.

4. Conclusions

This paper presents the multimodal cross-attention molecular property prediction (MCMPP) model for predicting the molecular properties of materials. The framework skillfully integrates four distinct molecular representation techniques: the Transformer model for SMILES sequence features, GCN for molecular map features, BiLSTM for ECFP fingerprint characteristics, and reduced Unimol+ for three-dimensional molecule conformation information. A key innovation of the MCMPP model is the cross-attention mechanism, which effectively integrates all four modalities and significantly improves prediction accuracy. Specifically, the cross-attention mechanism enables the efficient integration of different types of information for downstream prediction tasks.

Extensive experimental results demonstrate that MCMPP outperforms traditional baseline models in the majority of challenging molecular property benchmark tests, confirming the effectiveness of the fusion strategy employed. The results further demonstrate that MCMPP may effectively include characteristics from several modalities, leading to increased prediction stability and accuracy. It is now a useful technique for determining important molecular components in chemistry as a consequence. For the improvement of certain chemical engineering processes, this is crucial. A useful reference for the creation of upcoming attribute prediction tools is also provided by the multimodal integration method’s successful implementation in chemical engineering, which highlights the framework’s usefulness.

Future research will primarily concentrate on the use of chemical spatial clustering. Cluster analysis will be used large molecular data sets, and the descriptor information obtained from clustering will be included in the model framework. This will facilitate the optimization of chemical spatial grouping and feature selection, improving the model’s predictive power in complex chemical environments.

Supplementary Material

ao5c07964_si_001.pdf^{(576.7KB, pdf)}

Acknowledgments

This work was supported by the Jilin Provincial Science and Technology Innovation Center of Network Database Application Software (Grant No. YDZJ202302CXJD027) and Jilin Provincial Science and Technology Development Planned Project (Grant No. YDZJ202401621ZYTS). The authors would like to appreciate the anonymous reviewers for their valuable comments.

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsomega.5c07964.

Experimental details and materials and the experimental environment (Tables S1–S5; Figures S1–S3) (PDF)

S.S. made the main contribution to the manuscript. S.S. and P.W. prepared the data set, developed the methods, built the model, and performed the computational analysis. S.L. contributed at every step: project management, funding acquisition, supervision, manuscript review and editing. All authors contributed to the writing of the first draft of the manuscript.

The authors declare no competing financial interest.

References

Li J., Luo D., Wen T., Liu Q., Mo Z.. Representative feature selection of molecular descriptors in QSAR modeling. J. Mol. Struct. 2021;1244:131249. doi: 10.1016/j.molstruc.2021.131249. [DOI] [Google Scholar]
Kwon S., Bae H., Jo J., Yoon S.. Comprehensive ensemble in QSAR prediction for drug discovery. BMC Bioinf. 2019;20:521. doi: 10.1186/s12859-019-3135-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J., Fu A., Zhang L.. Progress in molecular docking. Quantitative Biology. 2019;7:83–89. doi: 10.1007/s40484-019-0172-y. [DOI] [Google Scholar]
Salo-Ahen O. M., Alanko I., Bhadane R., Bonvin A. M., Honorato R. V., Hossain S., Juffer A. H., Kabedev A., Lahtela-Kakkonen M., Larsen A. S.. et al. Molecular dynamics simulations in drug discovery and pharmaceutical development. Processes. 2021;9:71. doi: 10.3390/pr9010071. [DOI] [Google Scholar]
Goodfellow, I. ; Bengio, Y. ; Courville, A. ; Bengio, Y. . Deep learning; MIT Press, 2016; Vol. 1. [Google Scholar]
LeCun Y., Bengio Y., Hinton G.. Deep learning. nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
Kearnes S., McCloskey K., Berndl M., Pande V., Riley P.. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design. 2016;30:595–608. doi: 10.1007/s10822-016-9938-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Albrecht T., Slabaugh G., Alonso E., Al-Arif S. M. R.. Deep learning for single-molecule science. Nanotechnology. 2017;28:423001. doi: 10.1088/1361-6528/aa8334. [DOI] [PubMed] [Google Scholar]
Ge M., Su F., Zhao Z., Su D.. Deep learning analysis on microscopic imaging in materials science. Materials Today Nano. 2020;11:100087. doi: 10.1016/j.mtnano.2020.100087. [DOI] [Google Scholar]
Agrawal, A. ; Gopalakrishnan, K. ; Choudhary, A. . Handbook on Big Data and Machine Learning in the Physical Sciences: Volume 1. Big Data Methods in Experimental Materials Discovery; World Scientific Publishing, 2020; pp 205–230. [Google Scholar]
Erdmann, M. ; Glombitza, J. ; Kasieczka, G. ; Klemradt, U. . Deep Learning for Physics Research; World Scientific Publishing, 2021. [Google Scholar]
Deng J., Yang Z., Wang H., Ojima I., Samaras D., Wang F.. A systematic study of key elements underlying molecular property prediction. Nat. Commun. 2023;14:6395. doi: 10.1038/s41467-023-41948-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wigh D. S., Goodman J. M., Lapkin A. A.. A review of molecular representation in the age of machine learning. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2022;12:e1603. doi: 10.1002/wcms.1603. [DOI] [Google Scholar]
Ma J., Sheridan R. P., Liaw A., Dahl G. E., Svetnik V.. Deep neural nets as a method for quantitative structure–activity relationships. J. Chem. Inf. Model. 2015;55:263–274. doi: 10.1021/ci500747n. [DOI] [PubMed] [Google Scholar]
Weininger D.. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences. 1988;28:31–36. doi: 10.1021/ci00057a005. [DOI] [Google Scholar]
Segler M. H., Kogej T., Tyrchan C., Waller M. P.. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science. 2018;4:120–131. doi: 10.1021/acscentsci.7b00512. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xie L., Jin Y., Xu L., Chang S., Xu X.. Fusing Domain Knowledge with a Fine-Tuned Large Language Model for Enhanced Molecular Property Prediction. J. Chem. Theory Comput. 2025;21:6743–6758. doi: 10.1021/acs.jctc.5c00605. [DOI] [PubMed] [Google Scholar]
Lu X., Xie L., Xu L., Mao R., Chang S., Xu X.. Integrating Chemical Language and Molecular Graph in Multimodal Fused Deep Learning for Drug Property Prediction. arXiv. 2023 doi: 10.48550/arXiv.2312.17495. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li P., Li Y., Hsieh C.-Y., Zhang S., Liu X., Liu H., Song S., Yao X.. TrimNet: learning molecular representation from triplet messages for biomedicine. Briefings Bioinf. 2021;22:bbaa266. doi: 10.1093/bib/bbaa266. [DOI] [PubMed] [Google Scholar]
Gilmer, J. ; Schoenholz, S. S. ; Riley, P. F. ; Vinyals, O. ; Dahl, G. E. . Neural message passing for quantum chemistry. In International conference on machine learning, 2017; pp 1263–1272.
Liyaqat T., Ahmad T., Saxena C.. Advancements in Molecular Property Prediction: A Survey of Single and Multimodal Approaches. arXiv. 2024 doi: 10.48550/arXiv.2408.09461. [DOI] [Google Scholar]
Lu S., Gao Z., He D., Zhang L., Ke G.. Data-driven quantum chemical property prediction leveraging 3D conformations with Uni-Mol+ Nat. Commun. 2024;15:7104. doi: 10.1038/s41467-024-51321-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou G., Gao Z., Ding Q., Zheng H., Xu H., Wei Z., Zhang L., Ke G.. Uni-mol: A universal 3d molecular representation learning framework. ChemRxiv. 2023 doi: 10.26434/chemrxiv-2022-jjm0j. [DOI] [Google Scholar]
Liu Y., Wang L., Liu M., Zhang X., Oztekin B., Ji S.. Spherical message passing for 3d graph networks. arXiv. 2021 doi: 10.48550/arXiv.2102.05013. [DOI] [Google Scholar]
Zeng X., Xiang H., Yu L., Wang J., Li K., Nussinov R., Cheng F.. Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework. Nature Machine Intelligence. 2022;4:1004–1016. doi: 10.1038/s42256-022-00557-6. [DOI] [Google Scholar]
Zhou Z., Li Y., Hong P., Xu H.. Multimodal fusion with relational learning for molecular property prediction. Commun. Chem. 2025;8:200. doi: 10.1038/s42004-025-01586-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Y., Li Y., Liu L., Hong P., Xu H.. Advancing Drug Discovery with Enhanced Chemical Understanding via Asymmetric Contrastive Multimodal Learning. J. Chem. Inf. Model. 2025;65:6547–6557. doi: 10.1021/acs.jcim.5c00430. [DOI] [PMC free article] [PubMed] [Google Scholar]
Contributors, R. RDKit: Open-source cheminformatics, 2023.
Nan S., Li Z., Jin S., Du W., Shen W.. Machine Learning-Based Multi-Modal and Multi-Granularity Feature Fusion Framework for Accurate Prediction of Molecular Properties. Ind. Eng. Chem. Res. 2025;64:3045–3056. doi: 10.1021/acs.iecr.4c03293. [DOI] [Google Scholar]
Gomes J., Ramsundar B., Feinberg E. N., Pande V. S.. Atomic convolutional networks for predicting protein-ligand binding affinity. arXiv. 2017 doi: 10.48550/arXiv.1703.10603. [DOI] [Google Scholar]
Xie L., Xu L., Kong R., Chang S., Xu X.. Improvement of prediction performance with conjoint molecular fingerprint in deep learning. Frontiers in pharmacology. 2020;11:606668. doi: 10.3389/fphar.2020.606668. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xie L., Xu L., Chang S., Xu X., Meng L.. Multitask deep networks with grid featurization achieve improved scoring performance for protein–ligand binding. Chemical Biology & Drug Design. 2020;96:973–983. doi: 10.1111/cbdd.13648. [DOI] [PubMed] [Google Scholar]
Wenzel J., Matter H., Schmidt F.. Predictive multitask deep neural network models for ADME-Tox properties: learning from large data sets. J. Chem. Inf. Model. 2019;59:1253–1268. doi: 10.1021/acs.jcim.8b00785. [DOI] [PubMed] [Google Scholar]
Ramsundar B., Liu B., Wu Z., Verras A., Tudor M., Sheridan R. P., Pande V.. Is multitask deep learning practical for pharma? J. Chem. Inf. Model. 2017;57:2068–2076. doi: 10.1021/acs.jcim.7b00146. [DOI] [PubMed] [Google Scholar]
Yin Z., Song W., Li B., Wang F., Xie L., Xu X.. Neural networks prediction of the protein-ligand binding affinity with circular fingerprints. Technology and Health Care. 2023;31:487–495. doi: 10.3233/THC-236042. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yuan W., Chen G., Chen C. Y.-C.. FusionDTA: attention-based feature polymerizer and knowledge distillation for drug-target binding affinity prediction. Briefings Bioinf. 2022;23:bbab506. doi: 10.1093/bib/bbab506. [DOI] [PubMed] [Google Scholar]
Tanoori B., Jahromi M. Z., Mansoori E. G.. Drug-target continuous binding affinity prediction using multiple sources of information. Expert Systems with Applications. 2021;186:115810. doi: 10.1016/j.eswa.2021.115810. [DOI] [Google Scholar]
Lu X., Xie L., Xu L., Mao R., Xu X., Chang S.. Multimodal fused deep learning for drug property prediction: Integrating chemical language and molecular graph. Computational and Structural Biotechnology Journal. 2024;23:1666–1679. doi: 10.1016/j.csbj.2024.04.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Z., Jiang T., Wang J., Xuan Q.. Multi-modal representation learning for molecular property prediction: sequence, graph, geometry. arXiv. 2024 doi: 10.48550/arXiv.2401.03369. [DOI] [Google Scholar]
Wei Y., Zhang Q., Liu L.. Predicting transcription factor binding sites by a multi-modal representation learning method based on cross-attention network. Applied Soft Computing. 2024;166:112134. doi: 10.1016/j.asoc.2024.112134. [DOI] [Google Scholar]
Liu P., Ren Y., Tao J., Ren Z.. Git-mol: A multi-modal large language model for molecular science with graph, image, and text. Computers in biology and medicine. 2024;171:108073. doi: 10.1016/j.compbiomed.2024.108073. [DOI] [PubMed] [Google Scholar]
Ma M., Lei X.. A deep learning framework for predicting molecular property based on multi-type features fusion. Computers in Biology and Medicine. 2024;169:107911. doi: 10.1016/j.compbiomed.2023.107911. [DOI] [PubMed] [Google Scholar]
Wu Z., Ramsundar B., Feinberg E. N., Gomes J., Geniesse C., Pappu A. S., Leswing K., Pande V.. MoleculeNet: a benchmark for molecular machine learning. Chemical science. 2018;9:513–530. doi: 10.1039/C7SC02664A. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li C., Wang J., Niu Z., Yao J., Zeng X.. A spatial-temporal gated attention module for molecular property prediction based on molecular geometry. Briefings Bioinf. 2021;22:bbab078. doi: 10.1093/bib/bbab078. [DOI] [PubMed] [Google Scholar]
Walters, P. We need better benchmarks for machine learning in drug discovery. In Practical Cheminformatics, 2023. [Google Scholar]
Krenn M., Ai Q., Barthel S., Carson N., Frei A., Frey N. C., Friederich P., Gaudin T., Gayle A. A., Jablonka K. M.. et al. SELFIES and the future of molecular string representations. Patterns. 2022;3:100588. doi: 10.1016/j.patter.2022.100588. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bjerrum E., Rastemo T., Irwin R., Kannas C., Genheden S.. PySMILESUtils–Enabling deep learning with the SMILES chemical language. ChemRxiv. 2021 doi: 10.26434/chemrxiv-2021-kzhbs. [DOI] [Google Scholar]
Rogers D., Hahn M.. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010;50:742–754. doi: 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
Xia J., Zhu Y., Du Y., Li S. Z.. A systematic survey of chemical pre-trained models. arXiv. 2022 doi: 10.48550/arXiv.2210.16484. [DOI] [Google Scholar]
Huang, H. ; Sun, L. ; Du, B. ; Lv, W. . Learning joint 2-d and 3-d graph diffusion models for complete molecule generation. In IEEE Transactions on Neural Networks and Learning Systems, 2024. [DOI] [PubMed] [Google Scholar]
Guan, J. ; Qian, W. W. ; Ma, W.-Y. ; Ma, J. ; Peng, J. . Energy-inspired molecular conformation optimization. In International conference on learning representations, 2021.
Tibshirani R.. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1996;58:267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]
Zou H., Hastie T.. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2005;67:301–320. doi: 10.1111/j.1467-9868.2005.00503.x. [DOI] [Google Scholar]
Breiman L.. Random forests. Machine learning. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
Friedman J. H.. Greedy function approximation: a gradient boosting machine. Ann. Stat. 2001;29:1189–1232. doi: 10.1214/aos/1013203451. [DOI] [Google Scholar]
Ruder S.. An overview of gradient descent optimization algorithms. arXiv. 2016 doi: 10.48550/arXiv.1609.04747. [DOI] [Google Scholar]
Vaswani, A. ; Shazeer, N. ; Parmar, N. ; Uszkoreit, J. ; Jones, L. ; Gomez, A. N. ; Kaiser, Ł. ; Polosukhin, I. . Attention is all you need. In Advances in neural information processing systems, 2017; Vol. 30. [Google Scholar]
Yan H., Deng B., Li X., Qiu X.. TENER: adapting transformer encoder for named entity recognition. arXiv. 2019 doi: 10.48550/arXiv.1911.04474. [DOI] [Google Scholar]
Hu X., Zhou X., Liu H., Song H., Wang S., Zhang H.. Enhanced predictive modeling of hot rolling work roll wear using TCN-LSTM-Attention. International Journal of Advanced Manufacturing Technology. 2024;131:1335–1346. doi: 10.1007/s00170-024-13105-w. [DOI] [Google Scholar]
Kilinc H. C., Apak S., Ozkan F., Ergin M. E., Yurtsever A.. Multimodal Fusion of optimized GRU–LSTM with self-attention layer for Hydrological Time Series forecasting. Water Resources Management. 2024;38:6045–6062. doi: 10.1007/s11269-024-03943-4. [DOI] [Google Scholar]
Zhu Y.-H., Liu Z., Liu Y., Ji Z., Yu D.-J.. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein–DNA binding site prediction. Briefings Bioinf. 2024;25:bbae040. doi: 10.1093/bib/bbae040. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nguyen T., Le H., Quinn T. P., Nguyen T., Le T. D., Venkatesh S.. GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics. 2021;37:1140–1147. doi: 10.1093/bioinformatics/btaa921. [DOI] [PubMed] [Google Scholar]
Gasteiger J., Becker F., Günnemann S.. Gemnet: Universal directional graph neural networks for molecules. Adv. Neural Inf. Process. Syst. 2021;34:6790–6802. [Google Scholar]
Wei, X. ; Zhang, T. ; Li, Y. ; Zhang, Y. ; Wu, F. . Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020; pp 10941–10950.
Zhang, W. ; Yin, Z. ; Sheng, Z. ; Li, Y. ; Ouyang, W. ; Li, X. ; Tao, Y. ; Yang, Z. ; Cui, B. . Graph attention multi-layer perceptron. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022; pp 4560–4570.
Wang Y., Wang J., Cao Z., Barati Farimani A.. Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence. 2022;4:279–287. doi: 10.1038/s42256-022-00447-x. [DOI] [Google Scholar]
Fang X., Liu L., Lei J., He D., Zhang S., Zhou J., Wang F., Wu H., Wang H.. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence. 2022;4:127–134. doi: 10.1038/s42256-021-00438-4. [DOI] [Google Scholar]
Liu S., Wang H., Liu W., Lasenby J., Guo H., Tang J.. Pre-training molecular graph representation with 3d geometry. arXiv. 2021 doi: 10.48550/arXiv.2110.07728. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ao5c07964_si_001.pdf^{(576.7KB, pdf)}

[ref1] Li J., Luo D., Wen T., Liu Q., Mo Z.. Representative feature selection of molecular descriptors in QSAR modeling. J. Mol. Struct. 2021;1244:131249. doi: 10.1016/j.molstruc.2021.131249. [DOI] [Google Scholar]

[ref2] Kwon S., Bae H., Jo J., Yoon S.. Comprehensive ensemble in QSAR prediction for drug discovery. BMC Bioinf. 2019;20:521. doi: 10.1186/s12859-019-3135-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] Fan J., Fu A., Zhang L.. Progress in molecular docking. Quantitative Biology. 2019;7:83–89. doi: 10.1007/s40484-019-0172-y. [DOI] [Google Scholar]

[ref4] Salo-Ahen O. M., Alanko I., Bhadane R., Bonvin A. M., Honorato R. V., Hossain S., Juffer A. H., Kabedev A., Lahtela-Kakkonen M., Larsen A. S.. et al. Molecular dynamics simulations in drug discovery and pharmaceutical development. Processes. 2021;9:71. doi: 10.3390/pr9010071. [DOI] [Google Scholar]

[ref5] Goodfellow, I. ; Bengio, Y. ; Courville, A. ; Bengio, Y. . Deep learning; MIT Press, 2016; Vol. 1. [Google Scholar]

[ref6] LeCun Y., Bengio Y., Hinton G.. Deep learning. nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]

[ref7] Kearnes S., McCloskey K., Berndl M., Pande V., Riley P.. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design. 2016;30:595–608. doi: 10.1007/s10822-016-9938-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] Albrecht T., Slabaugh G., Alonso E., Al-Arif S. M. R.. Deep learning for single-molecule science. Nanotechnology. 2017;28:423001. doi: 10.1088/1361-6528/aa8334. [DOI] [PubMed] [Google Scholar]

[ref9] Ge M., Su F., Zhao Z., Su D.. Deep learning analysis on microscopic imaging in materials science. Materials Today Nano. 2020;11:100087. doi: 10.1016/j.mtnano.2020.100087. [DOI] [Google Scholar]

[ref10] Agrawal, A. ; Gopalakrishnan, K. ; Choudhary, A. . Handbook on Big Data and Machine Learning in the Physical Sciences: Volume 1. Big Data Methods in Experimental Materials Discovery; World Scientific Publishing, 2020; pp 205–230. [Google Scholar]

[ref11] Erdmann, M. ; Glombitza, J. ; Kasieczka, G. ; Klemradt, U. . Deep Learning for Physics Research; World Scientific Publishing, 2021. [Google Scholar]

[ref12] Deng J., Yang Z., Wang H., Ojima I., Samaras D., Wang F.. A systematic study of key elements underlying molecular property prediction. Nat. Commun. 2023;14:6395. doi: 10.1038/s41467-023-41948-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] Wigh D. S., Goodman J. M., Lapkin A. A.. A review of molecular representation in the age of machine learning. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2022;12:e1603. doi: 10.1002/wcms.1603. [DOI] [Google Scholar]

[ref14] Ma J., Sheridan R. P., Liaw A., Dahl G. E., Svetnik V.. Deep neural nets as a method for quantitative structure–activity relationships. J. Chem. Inf. Model. 2015;55:263–274. doi: 10.1021/ci500747n. [DOI] [PubMed] [Google Scholar]

[ref15] Weininger D.. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences. 1988;28:31–36. doi: 10.1021/ci00057a005. [DOI] [Google Scholar]

[ref16] Segler M. H., Kogej T., Tyrchan C., Waller M. P.. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science. 2018;4:120–131. doi: 10.1021/acscentsci.7b00512. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] Xie L., Jin Y., Xu L., Chang S., Xu X.. Fusing Domain Knowledge with a Fine-Tuned Large Language Model for Enhanced Molecular Property Prediction. J. Chem. Theory Comput. 2025;21:6743–6758. doi: 10.1021/acs.jctc.5c00605. [DOI] [PubMed] [Google Scholar]

[ref18] Lu X., Xie L., Xu L., Mao R., Chang S., Xu X.. Integrating Chemical Language and Molecular Graph in Multimodal Fused Deep Learning for Drug Property Prediction. arXiv. 2023 doi: 10.48550/arXiv.2312.17495. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] Li P., Li Y., Hsieh C.-Y., Zhang S., Liu X., Liu H., Song S., Yao X.. TrimNet: learning molecular representation from triplet messages for biomedicine. Briefings Bioinf. 2021;22:bbaa266. doi: 10.1093/bib/bbaa266. [DOI] [PubMed] [Google Scholar]

[ref20] Gilmer, J. ; Schoenholz, S. S. ; Riley, P. F. ; Vinyals, O. ; Dahl, G. E. . Neural message passing for quantum chemistry. In International conference on machine learning, 2017; pp 1263–1272.

[ref21] Liyaqat T., Ahmad T., Saxena C.. Advancements in Molecular Property Prediction: A Survey of Single and Multimodal Approaches. arXiv. 2024 doi: 10.48550/arXiv.2408.09461. [DOI] [Google Scholar]

[ref22] Lu S., Gao Z., He D., Zhang L., Ke G.. Data-driven quantum chemical property prediction leveraging 3D conformations with Uni-Mol+ Nat. Commun. 2024;15:7104. doi: 10.1038/s41467-024-51321-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] Zhou G., Gao Z., Ding Q., Zheng H., Xu H., Wei Z., Zhang L., Ke G.. Uni-mol: A universal 3d molecular representation learning framework. ChemRxiv. 2023 doi: 10.26434/chemrxiv-2022-jjm0j. [DOI] [Google Scholar]

[ref24] Liu Y., Wang L., Liu M., Zhang X., Oztekin B., Ji S.. Spherical message passing for 3d graph networks. arXiv. 2021 doi: 10.48550/arXiv.2102.05013. [DOI] [Google Scholar]

[ref25] Zeng X., Xiang H., Yu L., Wang J., Li K., Nussinov R., Cheng F.. Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework. Nature Machine Intelligence. 2022;4:1004–1016. doi: 10.1038/s42256-022-00557-6. [DOI] [Google Scholar]

[ref26] Zhou Z., Li Y., Hong P., Xu H.. Multimodal fusion with relational learning for molecular property prediction. Commun. Chem. 2025;8:200. doi: 10.1038/s42004-025-01586-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] Wang Y., Li Y., Liu L., Hong P., Xu H.. Advancing Drug Discovery with Enhanced Chemical Understanding via Asymmetric Contrastive Multimodal Learning. J. Chem. Inf. Model. 2025;65:6547–6557. doi: 10.1021/acs.jcim.5c00430. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] Contributors, R. RDKit: Open-source cheminformatics, 2023.

[ref29] Nan S., Li Z., Jin S., Du W., Shen W.. Machine Learning-Based Multi-Modal and Multi-Granularity Feature Fusion Framework for Accurate Prediction of Molecular Properties. Ind. Eng. Chem. Res. 2025;64:3045–3056. doi: 10.1021/acs.iecr.4c03293. [DOI] [Google Scholar]

[ref30] Gomes J., Ramsundar B., Feinberg E. N., Pande V. S.. Atomic convolutional networks for predicting protein-ligand binding affinity. arXiv. 2017 doi: 10.48550/arXiv.1703.10603. [DOI] [Google Scholar]

[ref31] Xie L., Xu L., Kong R., Chang S., Xu X.. Improvement of prediction performance with conjoint molecular fingerprint in deep learning. Frontiers in pharmacology. 2020;11:606668. doi: 10.3389/fphar.2020.606668. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref32] Xie L., Xu L., Chang S., Xu X., Meng L.. Multitask deep networks with grid featurization achieve improved scoring performance for protein–ligand binding. Chemical Biology & Drug Design. 2020;96:973–983. doi: 10.1111/cbdd.13648. [DOI] [PubMed] [Google Scholar]

[ref33] Wenzel J., Matter H., Schmidt F.. Predictive multitask deep neural network models for ADME-Tox properties: learning from large data sets. J. Chem. Inf. Model. 2019;59:1253–1268. doi: 10.1021/acs.jcim.8b00785. [DOI] [PubMed] [Google Scholar]

[ref34] Ramsundar B., Liu B., Wu Z., Verras A., Tudor M., Sheridan R. P., Pande V.. Is multitask deep learning practical for pharma? J. Chem. Inf. Model. 2017;57:2068–2076. doi: 10.1021/acs.jcim.7b00146. [DOI] [PubMed] [Google Scholar]

[ref35] Yin Z., Song W., Li B., Wang F., Xie L., Xu X.. Neural networks prediction of the protein-ligand binding affinity with circular fingerprints. Technology and Health Care. 2023;31:487–495. doi: 10.3233/THC-236042. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref36] Yuan W., Chen G., Chen C. Y.-C.. FusionDTA: attention-based feature polymerizer and knowledge distillation for drug-target binding affinity prediction. Briefings Bioinf. 2022;23:bbab506. doi: 10.1093/bib/bbab506. [DOI] [PubMed] [Google Scholar]

[ref37] Tanoori B., Jahromi M. Z., Mansoori E. G.. Drug-target continuous binding affinity prediction using multiple sources of information. Expert Systems with Applications. 2021;186:115810. doi: 10.1016/j.eswa.2021.115810. [DOI] [Google Scholar]

[ref38] Lu X., Xie L., Xu L., Mao R., Xu X., Chang S.. Multimodal fused deep learning for drug property prediction: Integrating chemical language and molecular graph. Computational and Structural Biotechnology Journal. 2024;23:1666–1679. doi: 10.1016/j.csbj.2024.04.030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref39] Wang Z., Jiang T., Wang J., Xuan Q.. Multi-modal representation learning for molecular property prediction: sequence, graph, geometry. arXiv. 2024 doi: 10.48550/arXiv.2401.03369. [DOI] [Google Scholar]

[ref40] Wei Y., Zhang Q., Liu L.. Predicting transcription factor binding sites by a multi-modal representation learning method based on cross-attention network. Applied Soft Computing. 2024;166:112134. doi: 10.1016/j.asoc.2024.112134. [DOI] [Google Scholar]

[ref41] Liu P., Ren Y., Tao J., Ren Z.. Git-mol: A multi-modal large language model for molecular science with graph, image, and text. Computers in biology and medicine. 2024;171:108073. doi: 10.1016/j.compbiomed.2024.108073. [DOI] [PubMed] [Google Scholar]

[ref42] Ma M., Lei X.. A deep learning framework for predicting molecular property based on multi-type features fusion. Computers in Biology and Medicine. 2024;169:107911. doi: 10.1016/j.compbiomed.2023.107911. [DOI] [PubMed] [Google Scholar]

[ref43] Wu Z., Ramsundar B., Feinberg E. N., Gomes J., Geniesse C., Pappu A. S., Leswing K., Pande V.. MoleculeNet: a benchmark for molecular machine learning. Chemical science. 2018;9:513–530. doi: 10.1039/C7SC02664A. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref44] Li C., Wang J., Niu Z., Yao J., Zeng X.. A spatial-temporal gated attention module for molecular property prediction based on molecular geometry. Briefings Bioinf. 2021;22:bbab078. doi: 10.1093/bib/bbab078. [DOI] [PubMed] [Google Scholar]

[ref45] Walters, P. We need better benchmarks for machine learning in drug discovery. In Practical Cheminformatics, 2023. [Google Scholar]

[ref46] Krenn M., Ai Q., Barthel S., Carson N., Frei A., Frey N. C., Friederich P., Gaudin T., Gayle A. A., Jablonka K. M.. et al. SELFIES and the future of molecular string representations. Patterns. 2022;3:100588. doi: 10.1016/j.patter.2022.100588. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref47] Bjerrum E., Rastemo T., Irwin R., Kannas C., Genheden S.. PySMILESUtils–Enabling deep learning with the SMILES chemical language. ChemRxiv. 2021 doi: 10.26434/chemrxiv-2021-kzhbs. [DOI] [Google Scholar]

[ref48] Rogers D., Hahn M.. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010;50:742–754. doi: 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]

[ref49] Xia J., Zhu Y., Du Y., Li S. Z.. A systematic survey of chemical pre-trained models. arXiv. 2022 doi: 10.48550/arXiv.2210.16484. [DOI] [Google Scholar]

[ref50] Huang, H. ; Sun, L. ; Du, B. ; Lv, W. . Learning joint 2-d and 3-d graph diffusion models for complete molecule generation. In IEEE Transactions on Neural Networks and Learning Systems, 2024. [DOI] [PubMed] [Google Scholar]

[ref51] Guan, J. ; Qian, W. W. ; Ma, W.-Y. ; Ma, J. ; Peng, J. . Energy-inspired molecular conformation optimization. In International conference on learning representations, 2021.

[ref52] Tibshirani R.. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1996;58:267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]

[ref53] Zou H., Hastie T.. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2005;67:301–320. doi: 10.1111/j.1467-9868.2005.00503.x. [DOI] [Google Scholar]

[ref54] Breiman L.. Random forests. Machine learning. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]

[ref55] Friedman J. H.. Greedy function approximation: a gradient boosting machine. Ann. Stat. 2001;29:1189–1232. doi: 10.1214/aos/1013203451. [DOI] [Google Scholar]

[ref56] Ruder S.. An overview of gradient descent optimization algorithms. arXiv. 2016 doi: 10.48550/arXiv.1609.04747. [DOI] [Google Scholar]

[ref57] Vaswani, A. ; Shazeer, N. ; Parmar, N. ; Uszkoreit, J. ; Jones, L. ; Gomez, A. N. ; Kaiser, Ł. ; Polosukhin, I. . Attention is all you need. In Advances in neural information processing systems, 2017; Vol. 30. [Google Scholar]

[ref58] Yan H., Deng B., Li X., Qiu X.. TENER: adapting transformer encoder for named entity recognition. arXiv. 2019 doi: 10.48550/arXiv.1911.04474. [DOI] [Google Scholar]

[ref59] Hu X., Zhou X., Liu H., Song H., Wang S., Zhang H.. Enhanced predictive modeling of hot rolling work roll wear using TCN-LSTM-Attention. International Journal of Advanced Manufacturing Technology. 2024;131:1335–1346. doi: 10.1007/s00170-024-13105-w. [DOI] [Google Scholar]

[ref60] Kilinc H. C., Apak S., Ozkan F., Ergin M. E., Yurtsever A.. Multimodal Fusion of optimized GRU–LSTM with self-attention layer for Hydrological Time Series forecasting. Water Resources Management. 2024;38:6045–6062. doi: 10.1007/s11269-024-03943-4. [DOI] [Google Scholar]

[ref61] Zhu Y.-H., Liu Z., Liu Y., Ji Z., Yu D.-J.. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein–DNA binding site prediction. Briefings Bioinf. 2024;25:bbae040. doi: 10.1093/bib/bbae040. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref62] Nguyen T., Le H., Quinn T. P., Nguyen T., Le T. D., Venkatesh S.. GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics. 2021;37:1140–1147. doi: 10.1093/bioinformatics/btaa921. [DOI] [PubMed] [Google Scholar]

[ref63] Gasteiger J., Becker F., Günnemann S.. Gemnet: Universal directional graph neural networks for molecules. Adv. Neural Inf. Process. Syst. 2021;34:6790–6802. [Google Scholar]

[ref64] Wei, X. ; Zhang, T. ; Li, Y. ; Zhang, Y. ; Wu, F. . Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020; pp 10941–10950.

[ref65] Zhang, W. ; Yin, Z. ; Sheng, Z. ; Li, Y. ; Ouyang, W. ; Li, X. ; Tao, Y. ; Yang, Z. ; Cui, B. . Graph attention multi-layer perceptron. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022; pp 4560–4570.

[ref66] Wang Y., Wang J., Cao Z., Barati Farimani A.. Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence. 2022;4:279–287. doi: 10.1038/s42256-022-00447-x. [DOI] [Google Scholar]

[ref67] Fang X., Liu L., Lei J., He D., Zhang S., Zhou J., Wang F., Wu H., Wang H.. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence. 2022;4:127–134. doi: 10.1038/s42256-021-00438-4. [DOI] [Google Scholar]

[ref68] Liu S., Wang H., Liu W., Lasenby J., Guo H., Tang J.. Pre-training molecular graph representation with 3d geometry. arXiv. 2021 doi: 10.48550/arXiv.2110.07728. [DOI] [Google Scholar]

PERMALINK

Multimodal Cross-Attention Molecular Property Prediction for Text, Sequence, Graph, and Geometry

Shihao Sun

Peng Wang

Yunjiangcan He

Jiao Yang

Songjiang Li

Abstract

1. Introduction

2. Data Sets and Methodology

2.1. Data Sets

2.2. Multimodal Feature Representation

1.

2.2.1. SMILES Vectors

2.2.2. ECFP Fingerprints

2.2.3. Molecular Graph

2.2.4. 3D Molecular Conformation

2.3. Multimodal Prediction Model

2.

2.3.1. Transformer-Encoder

2.3.2. TCN-BiLSTM-Attention Mechanism

2.3.3. GCN

2.3.4. Reduced Unimol+

3.

2.4. Feature Fusion

2.4.1. Cross-Attention Mechanism Fusion

4.

2.4.2. Decision-Level Fusion

2.5. Prediction

3. Results and Discussion

3.1. Performance Analysis

5.

1. Pearson Coefficients of Different Methods on Four Data Sets ,

2. RMSE Performance of Different Methods on Four Data Sets ,

3. MAE Performance of Different Data Sets on Four Different Models ,

6.

3.2. Model Reliability Analysis

7.

3.3. Model Utilize Characteristic Distribution Analysis

8.

3.4. Generalization Proficiency Testing

4. RMSE, MAE, and Pearson of Different Methods in the PDBbind v2020 Refined Data Set .

9.

5. Performance of MCMPP_CAM on 5 Data Sets with Random Split and Scaffold Split .

3.5. Comparison with the Baseline Model

6. RMSE of 10 Different Models on Different Data Sets , ,

4. Conclusions

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

1. Pearson Coefficients of Different Methods on Four Data Sets ^,

2. RMSE Performance of Different Methods on Four Data Sets ^,

3. MAE Performance of Different Data Sets on Four Different Models ^,

6. RMSE of 10 Different Models on Different Data Sets ^, ^,