Abstract

A molecule is a complex of heterogeneous components, and the spatial arrangements of these components determine the whole molecular properties and characteristics. With the advent of deep learning in computational chemistry, several studies have focused on how to predict molecular properties based on molecular configurations. MA message-passing neural network provides an effective framework for capturing molecular geometric features with the perspective of a molecule as a graph. However, most of these studies assumed that all heterogeneous molecular features, such as atomic charge, bond length, or other geometric features, always contribute equivalently to the target prediction, regardless of the task type. In this study, we propose a dual-branched neural network for molecular property prediction based on both the message-passing framework and standard multilayer perceptron neural networks. Our model learns heterogeneous molecular features with different scales, which are trained flexibly according to each prediction target. In addition, we introduce a discrete branch to learn single-atom features without local aggregation, apart from message-passing steps. We verify that this novel structure can improve the model performance. The proposed model outperforms other recent models with sparser representations. Our experimental results indicate that, in the chemical property prediction tasks, the diverse chemical nature of targets should be carefully considered for both model performance and generalizability. Finally, we provide the intuitive analysis between the experimental results and the chemical meaning of the target.
Introduction
To design de novo compounds, all possible sets should be explored throughout a combinatorial space called chemical compound space (CCS). Quantum mechanics (QM) functions as a guide that narrows down this search space based on the first principles of molecular dynamics. Density functional theory (DFT) is the standard method for analyzing a molecular electronic structure from its numerical wave functions in QM to predict molecular behaviors.
In the past decade, machine learning (ML)-based methods have been developed and have played a key role in various QM tasks. For example, support vector machine (SVM),1 Gaussian processes,2 kernel ridge regression models,3,4 and other ML-based studies5 have been developed and applied to predict molecular properties or model molecular dynamics. Recently, with the advent of large-scale quantum-mechanical databases, deep learning (DL)-based models have become crucial in quantum machine learning (QML) studies.6 Deep neural network (DNN)-based methods have made it possible to model hidden complex relationships between molecular structures and their properties.
Early DNN-based QM studies explored various tasks such as DFT prediction,7 a simple molecule dynamics simulation,8 atomization energy prediction,9 or learning electronic properties.10 These models considered a single atom as a neuron of an input layer, with the frame of a one-body representation of a given system. Given a molecular geometry, a single atom can be represented with its atomic number and position, and multiple atoms can also be represented in the same way with their interactions, using their mutual distances or angles. However, standard MLP-based models cannot handle various sizes of chemicals.2,11 This limitation triggers a harmful effect on the model generalizability.
Graph neural networks (GNNs) can address these limitations by describing a molecule as a graph. GNNs can train both node and edge representations, which are naturally equivalent to atoms and atom–atom relationships present in a molecule. This graph-based molecular representation has become widely adopted in solving various molecule learning tasks. Several GNN-based models12−16 adopted a localized convolution kernel to learn each component of a molecule. These models are called graph convolutional networks (GCNs). The spatial embedding-based GCN model1 trains each local attribute iteratively and then aggregates these features to represent molecular properties.
The message-passing neural network (MPNN) framework19 has taken these approaches one step further. On the basis of a graph convolution, MPNNs learn node features in two types of phases: message-passing and update phases. In the message-passing phase, the hidden representations of nodes, neighboring nodes, and their edges are aggregated to represent the locality of a given node, called a message. Subsequently, the hidden features of these nodes are updated with the message and other features in the following update phase. The final step in a network is called a readout phase; all of the features are combined to yield the predictions. In the MPNN framework, both the message aggregation and the readout phase use a sum operator to gather local attributes and all atomic attributes, respectively. All MPNN models can be considered as a type of the spatial GCNs because in these models each atom feature is updated with its localized convolutional filters.
Since the late 2010s, several message-passing-based DNN models have been developed, such as SchNet,12,20 GDML,13 PhysNet,15 MegNet,21 and DimeNet.22 SchNet and the subsequent model architectures suggested a block structure, which consists of a set of specific message-passing functions. They also adopted a continuous-filter convolution to learn intramolecular interactions that take the form of a continuous distribution. These models used atom types and atom–atom distances as inputs of the model. Some studies used more sophisticated functions to represent angles in a rotational equivariant approach.23,24 Many of the MPNN-based models exhibited competitive performances in various QM research areas, even in the molecular property prediction on a nonequilibrium state.
However, most of the previous works did not consider the natural heterogeneities among chemical properties. Although subtle, the target properties have originated from different chemical natures. For example, among the 12 targets from QM9, nine describe the types of given molecular energy, while the other three do not. A conventional DL optimization can select the better parameters for describing all of the targets. However, it cannot determine that feature type that is more important for predicting each target. In this case, the optimal parameters in the given model architecture may vary according to target types. Almost all previous models have no choice but to optimize parameters under the assumption that all types of features such as node features or edge features contribute equally to predicting all targets. This may limit the model generalizability and transferability because it ignores the diversity of chemical factors determining various properties.
In addition, GNNs have suffered from differentiating nodes, especially with deep architecture, that is, the oversmoothing problem.17,25 This problem emerges because iterative aggregations of neighboring features are equivalent to a repeated mixing of local features, eventually making all node representations too similar and reducing the model performance. Therefore, stacking more layers to a network cannot be the solution in the GNN (MPNN) case without careful consideration.
To overcome all these limitations, we propose a novel GNN for molecular property prediction, namely, a dual-branched Legendre message-passing neural network (DL-MPNN) with simple and powerful scalability. We used two types of features: an atom type and a position in three-dimensional space. We calculated the distance between two atoms and an angle between three atoms from atomic coordinates. In detail, we applied modified Legendre rational functions for radial basis functions of atom–atom distances. For a representation of angles between atoms, orthogonal Legengdre polynomials were used. Finally, we adopted trainable scalars to each type of feature for model flexibility in multitask QM learning. Our approaches exhibited superior performances to other recent models in quantum-mechanical benchmark tasks. In addition, we validate the effectiveness of the proposed method via ablation studies.
To the best of our knowledge, this study is the first attempt to automatically explore a relative feature of importance for each target type without increasing the computational cost. The introduction of the trainable scalars has several advantages over that of using an additional layer. First, the overall computational cost is not significantly increased, because it only adopts scalar multiplications. Subsequently, it makes the model flexible for heterogeneous types of features in multitarget tasks.
In addition, we introduce an MLP-based novel-type block called a single-body block. These blocks are stacked to form a discrete branch in the model and do not communicate with the message-passing branch, as they solely train only atom embeddings. The outcome of this MLP-based pathway is aggregated with that of the MPNN at the final stage. Accordingly, we obtain an atom feature that is not overmixed by incoming messages during the training.
In summary, our contributions are as follows: (1) We propose a novel dual-branched network for molecular property prediction, which comprises a message-passing for locally aggregated features and a fully connected pathway for discrete atoms. (2) We introduce trainable scalar-valued parameters to enable the model to enhance more important feature signals according to each label. (3) Our experimental results are better than those of most previous works in the public benchmarks with sparser representations. In addition, we verify that various quantum-chemical factors contribute differently according to target molecular properties.
Experimental Section
Our neural network is dual-branched, comprising two different types of networks. One is the MPNN, a type of spatial-based GCNs, and the other is the MLP-based network. The details are described in the following sections.
Preliminaries
We briefly introduce the notations and the data used in this study. We describe a molecule as a set of different types of i points V = {Z, P} and vi = {zi, pi}. Z = {zi} is a set of scalar-valued charges for each atom type. P = {pi}, where pi = (pix, piy, piz) is a set of xyz coordinates of atom points in the Euclidean space. We denote vj,j∈N(i) as a set of neighboring atoms of vi.
Data Preprocessing
We used two types of information between three neighboring atoms; atom types zi and coordinates pi of each atom in the molecules. We calculated distances dij and angles αijk between atom coordinates pi, pj, and pk. Because coordinates are used to describe the whole molecular structure, a single position of an atom is meaningless. We initialized random trainable atom type embeddings X from Z and trained them. All dij and αijk values from the molecules were calculated before the model training. More details are described in Figure 1.
Figure 1.

Data preprocessing. An atom type Z and atomic coordinate p were used in the model. We created trainable embeddings X from Z and calculated distances dij and angles αijk from the coordinates pi, pj, and pk.
Input Embedding: Distance and Angle Representations
Distance Representation: Radial Basis Function
We used radial basis functions to expand a scalar-valued edge distance to a vector. In particular, we adopted sequences of Legendre rational polynomials, one of the continuous and bounded orthogonal polynomials. The Legendre rational function Rn(x) and Legendre rational polynomial Pn(x) are defined as
| 1 |
| 2 |
where n denotes the degree. The nth-order Legendre rational polynomials are generated recursively as follows.
| 3 |
We seleced the first nth order polynomials R1(x), R2(x),
..., Rn(x), such that a scalar-valued distance
is embedded as an n-dimensional
vector
. The plot of functions
degree of 1,2,...,12
are described in the top right side of Figure 2. We modified the equations to set the maximum
value as 1.0, to make the distribution of the bases similar to that
of the initial atom embedding distribution.
Figure 2.
Basis functions for data preprocessing. For any pair of two atoms located closer than a given cutoff, we created an edge representation between two atoms regardless of molecular bond information. A scalar-valued distance is expanded as an n-dimensional vector by radial basis functions. For radial basis functions, we used Legendre rational polynomials. Angle representations are depicted by any two edges sharing one atom. A cosine-valued angle is represented as the degree = 1, 2, ..., mth Legendre polynomials (top right).
Recall that we did not use any
molecular bond information. Following
previous studies,12,15,20,22−24,27 we set a single scalar cutoff c and assumed that
any atom pair vi and vj located closer than the cutoff
can interact with each other, and vice versa, before training. In
other words, we build edge representations
between any two atoms
close to each other.
The cutoff value is analogous to the kernel height or width in the
conventional convolutional layer because it determines the size of
the receptive field of the local feature map.
Angle Representation: Cosine Basis Function
Similarly, we adopted another sequence of orthogonal polynomials to represent a scalar-valued angle αijk as vectors: Legendre polynomials of the first kind. Legendre polynomials of the first kind are the orthogonal polynomials that exhibit several useful properties. They are the solutions to the Legendre differential equation.37 They are commonly used to describe various physical systems and dynamics. Their formula is expressed as
![]() |
4 |
The polynomials of degree n are orthogonal each other, such that
| 5 |
We calculated angles αijk between any pair of two edges sharing one node,
and
. We selected the first mth order polynomials Q1(x), Q2(x), ..., Qm(x), such that a scalar-valued angle
is embedded as an m-dimensional
vector
. The scheme is presented
on the left side
of Figure 2. We calculated
cosine values of each angle and embed them with Legendre polynomials.
Comparison of Two Different Types of Legendre Polynomials
We briefly compare the two different Legendre polynomials. First, the sequences of Legendre rational functions {Pk(x)}k=1,···,n are adopted as the encoding method for distances, which have different distributions over the distances. The standard deviations of the distributions of the distance encodings are higher at shorter distance values. Therefore, they can approximate the interatomic potentials, such that the atom pair exhibits stronger interatomic relationships if they are located closely with each other. In other words, the sequences of the Legendre rational functions {Pk(x)} can make shorter distance values exhibit richer representations.
Next, all Legendre polynomials of the first kind {Qk(x)}k=1,···,m are defined within the range of [−1, 1]. These functions are symmetric (even functions) or antisymmetric (odd functions) at the zero point. Because the cosine function is an even function type and ranges from −1 to 1, {Qk(x)} can cover the cosine angle of 0–2π and is symmetric at the π value. In the three-dimensional Euclidean spaces, the angle of θ (0 ≤ θ ≤ π) between two nonparallel vectors on a plane is equivalent to its supplementary angle of (π – θ). Therefore, Legendre polynomials of the first kind {Qk(x)} can represent the cosine of angles properly.
Model Architecture
The overall model architecture is described in Figure 3. There are two discrete pathways in the network, which include the MLP-based pathway with single body blocks and MPNN-based pathway comprising output blocks and interaction blocks. First, an atom type Z is embedded by the embedding block. The distances are calculated from atomic coordinates p, and if a distance dij is less than the predefined cutoff value, then dij is represented as an edge attribute of the molecule graph. The scalar dij is expanded to a vector dij in the radial basis function {Pk(x)}k=1,···,n. The angles αijk are calculated from the edges. Then αijk is encoded to an m-dimensional vector before training via Legendre polynomials of the first kind {Qk(x)}k=1,···,m. In this study, we set m = 12.
Figure 3.

Model architecture. The network comprises two
separate branches,
MLP-based (red arrow) and MPNN-based (blue arrow) pathways. Atom types Z are embedded as trainable matrices. Distances dij and angles αijk are calculated from atomic coordinates. We calculated
from dij. We also calculated
from αijk. All blocks except the
embedding block are stacked multiple times
(not sharing weights). In this model, we stacked six, seven, and six
blocks for the single body blocks, output blocks, and interaction
blocks, respectively. For simplicity, skip connections are not shown.
In MPNN, the message passing and readout steps of our model can be summarized as
| 6 |
| 7 |
where m, h, e, and α denote the message of atom vi and its neighbors, single-atom representation of atom vi, interaction between two neighboring atoms (vi, vj), and angle information between three atoms (vi, vj, vk), respectively. t is the time step or layer order in the model, and ŷ represents the predicted target value. Both fmt and fu represent the graph convolution, and fr is the sum operation.
In the proposed model, there are four types of blocks: the embedding block, output block, single body block, and interaction block. All blocks except the embedding block are sequentially stacked multiple times. Detailed explanations are presented in the next section.
Block Details
We adopted four types of blocks in the model. The embedding block E, the output blocks Ot, the single body blocks St, and the interaction blocks It, where t = 0,..., (T – 1) is the time step of the multiple sequential blocks. Atom type embeddings Z enter the embedding block E at the first step, and the embedded X moves to the other three blocks at time step t = 0. Distance embeddings dij are used for all blocks E, Ot, and It, except the single body blocks St. Angle embeddings αijk are soley used in It. All blocks are sequentially stacked with time steps t = 0,..., (T – 1) including some skip connections. The length of time steps T of each block may differ from other block types. Detailed figures are in Figure 4.
Figure 4.
Block structures. (a) Embedding block (top left). (b) Interaction block (top right). (c) Single body block (bottom left). (d) Output block (bottom right). We denote the dense layer with the activation as σ(Wx + b). We also denote the trainable scalars as γ. All blue boxed items, including dense layers and features, are γ-trainable. Subscripts are omitted for simplicity. (e) Weighted aggregation formula of two final outputs from the single block and output block pathways. cS and cP denote the trainable scalar-valued coefficients.
Embedding Block
In the embedding block E, the categorical
atom types Z are represented as float-valued trainable
features. Zi, Zj∈N(i), and dij from
each atom pair
are used to make atom
embeddings Xi.
Output Block
In the output block Ot, the output of It is trained with Rn, except the first block, which takes an embedded input X from E. ⊕ and ⊗ denote the direct sum and direct product, respectively. All blue-boxed objects are γ-trainable. These blocks train the two-body representation.
Single Body Block
Message-passing functions would mix all representations of a molecule. We assumed that it is not always beneficial to predict molecular properties. Therefore, we introduced a separated path from a message-passing pathway. In the single body block St, the output X from E enters and is trained throughout multiple MLP layers. The last of the layer is γ-trainable. Neither edge nor angle representation is used. These blocks handle each-atom representations.
Interaction Block
In the interaction block It, each output from the Ot and Rn, Qm are applied. The blue boxes and the operators are also used in these blocks. In addition, these blocks train the three-body interaction.
Trainable Scalars: γ
In molecular property prediction tasks, various factors determine multiple properties with different contribution weights for each target. To tackle this issue, we did not consider any additional layer or attention preventing the model from the oversmoothing problem and becoming cost-intensive.
Instead, we proposed a more simple solution. As a solution, we introduced trainable scalars γ to each layer. This makes the contributions of heterogeneous factors flexible, according to each target. In other words, the importance of heterogeneous features such as atom type and atom–atom distances can be various over targets. Therefore, our model can focus on more important features of the target. Mathematical formulas are described below.
| 8 |
We introduced γ to some of the layers in the output
blocks,
single body blocks, and the interaction blocks. All γ values
are trained independently of each other (subscripts were omitted for
simplicity). We initialized γ values from the exponential function
of random normal distribution
(μ = 0.0, σ = 0.1), namely,
. This was because we observed that the
slightly right-skewed distribution performed better than the normal
distribution with μ = 1.0, without skewness. This distribution
is presented in Figure 5.
Figure 5.

Gaussian distribution and the initialization of the γ distribution. Both distributions attain their maximum values at x = 1.0
Related Works
Starting from DTNN,26 several GNNs adopt the perspective of a molecule as a graph. Under this perspective, a molecular system is a combination of atoms and many-body interatomic interactions that correspond to graph nodes and edges. MPNN19 introduced the message concept, which is an aggregation of attaching edges and corresponding neighbor nodes. SchNet12 is a multiblock network based on the message-passing framework. The gradients flow atomwise, and the convolution process of atom–atom interaction features is in the interaction block. Subsequently, PhysNet,15 MegNet,21 and MGCN27 extended the previous works and improved the performances based on the message-passing multiblock framework.
More recent works introduced angle information to describe the geometry of a molecule. With angle information, we can inspect up to three-body interactions. Spherical harmonics were used to represent an original angle to be rotationally equivariant.28,29 Several studies14,23,30,31 introduced a Clebsch-Gordan decomposition to represent an angle comprising a linear combination of irreducible representations. Theoretically, it can be expanded to be arbitrary n-body networks,14,23,31 and most of the networks do not explore more than three-body interactions in limited computational resources. In summary, most of the recent GNNs on molecular learning used an atom type (one-body), a distance between an atom pair (two-body), and an angle between three atoms (three-body). As aforementioned, we also used the same input features in accordance with the previous GNNs, for a fair evaluation.
Training and Evaluation
Data Set
We used QM938 and molecular dynamics (MD) simulation data sets13,39 for the experiment. QM9 is the most popular quantum-mechanics database created in 2014, which comprises 134 000 small molecules made up of five atoms, which include carbon, oxygen, hydrogen, nitrogen, and fluorine. Each molecule has 12 numerical properties, such that the benchmark consists of 12 regression tasks.
The MD simulation data set is created for the energy prediction task from molecular conformational geometries. The simulation data are given as trajectories. The subtasks are divided according to each molecule type. The energy values are given as a scalar (kcal/mol), and forces are given as three-dimensional vectors (in xyz format) of each atom. The energies can be predicted solely on the molecular geometry or by using the additional forces. We used the most recent subdata sets39 of which the properties were created from a CCSD(T) calculation, which is a more reliable method than the conventional DFT method.
Implementation
The tensorflow package of version 2.2 was used to implement the proposed framework. To calculate Legendre polynomials, we utilized the scipy package ver. 1.5.2.
For the QM9 data set, we trained the model at least 300 epochs for each target. We terminated the training when the validation mean absolute error (MAE) did not decrease for 30 epochs. Therefore, the overall training epochs are slightly different from each label. We randomly split the train, valid, and test data set to 80%, 10%, and 10% of whole data set, respectively, in accordance with the guideline.40 The initial learning rate and decay rate were set to 10–3 and 0.96 per 20 epochs, respectively. Adam41 was used as the optimizer, and the batch size was set to 24.
For the MD simulation data set, most of the training configurations were the same with those of the QM9 experiment. However, we modified the loss function because the energies were trained using both energies and forces, which differs from QM9. We obtained the predicted forces from the gradients of the atom positions following from the previous works.15,22 We also followed the original guideline39 for data splitting. Therefore, we used 1000 samples to train all subtasks, the other 500 for a test except ethanol, and 1000 for a test of ethanol. We set the decay rate as 0.9, which makes the learning rate decrease faster than the QM9 training.
Evaluation
First, we evaluated the model performance with MAE, the standard metric of QM9.40 We compared the performance of our model with the models of previous studies SchNet,12 Cormorant,23 LieConv,42 DimeNet,22 and DimeNet++.32 We also analyzed our proposed methods in terms of the effect of γ values as well as the single body block of each target. We also analyzed the effect of the single body block. We evaluated the performance of the model without single body blocks (not dual-branched architecture).
In particular, we observed that the single body block contributed to the model performance differently with each target type. We inferred that these differences can be explained by the different nature of the properties. We discussed the relationship between some of the results and the chemical interpretations in the next section.
We compared the change in ratio of the average of all γ values from the single block and that of the output blocks over epochs throughout the training. We extracted the γ values and calculated the ratio of ∑(γ values from the single block)/∑(γ values from the output block) every 30 epoch, until 200 epochs. After this point, the changes in the ratios varied negligibly, so we did not depict the ratios after that time point. Note that we did not compare the γ values directly among different layers or targets. The magnitudes of these values are determined by several complex sources: the original feature scale, layer weights, layer biases, activation functions, and uncontrollable random noises. Therefore, the γ values themselves cannot be evaluated directly.
Results and Discussion
Model Performance
The Comparative Results
We compared the performance of our model performance with that of other MPNN-based models that also used atom types and locations as inputs.12,21−23,32 The MAEs for QM9 and MD simulations are described in Table 1 and Table 2, respectively. For QM9, our model achieved advanced performances in six of the 12 targets. For the MD simulation, our model exhibited the best performance among other models in four of the five targets.
Table 1. Mean Absolute Error on QM9 Compared with Previous Worksa.
| target | unit | SchNet | cormorant | LieConv | DimeNet | DimeNet++ | DL-MPNN | MP only |
|---|---|---|---|---|---|---|---|---|
| μ | D | 0.033 | 0.038 | 0.032 | 0.0286 | 0.0297 | 0.0256 | 0.0238 |
| α | bohr3 | 0.235 | 0.085 | 0.084 | 0.0469 | 0.0435 | 0.0444 | 0.0457 |
| ϵHOMO | eV | 0.041 | 0.034 | 0.030 | 0.0278 | 0.0246 | 0.0223 | 0.0238 |
| ϵLUMO | eV | 0.034 | 0.038 | 0.025 | 0.0197 | 0.0195 | 0.0169 | 0.0163 |
| Δϵ | eV | 0.063 | 0.061 | 0.049 | 0.0348 | 0.0326 | 0.0391 | 0.0403 |
| ⟨R2⟩ | bohr2 | 0.073 | 0.961 | 0.800 | 0.331 | 0.331 | 0.414 | 0.385 |
| zpve | meV | 1.7 | 2.027 | 2.280 | 1.29 | 1.2 | 1.2 | 1.2 |
| U0 | eV | 0.014 | 0.022 | 0.019 | 0.00802 | 0.0063 | 0.0074 | 0.0084 |
| U | eV | 0.019 | 0.021 | 0.019 | 0.00789 | 0.0063 | 0.0074 | 0.0085 |
| H | eV | 0.014 | 0.021 | 0.024 | 0.00811 | 0.0065 | 0.0076 | 0.0092 |
| G | eV | 0.014 | 0.020 | 0.022 | 0.00898 | 0.0076 | 0.0076 | 0.0083 |
| Cv | calmolK | 0.033 | 0.026 | 0.038 | 0.0249 | 0.0249 | 0.0234 | 0.0235 |
DL-MPNN and MP only denote our dual-branched model and the model without MLP-pathway, respectively.
Table 2. Mean Absolute Error on MD Simulation Compared with Previous Works.
| target | train method | sGDML | SchNet | DimeNet | DL-MPNN |
|---|---|---|---|---|---|
| aspirin | forces | 0.68 | 1.35 | 0.499 | 0.590 |
| benzene | forces | 0.06 | 0.31 | 0.187 | 0.053 |
| ethanol | forces | 0.33 | 0.39 | 0.230 | 0.10 |
| malonaldehyde | forces | 0.41 | 0.66 | 0.383 | 0.225 |
| toluene | forces | 0.14 | 0.57 | 0.216 | 0.200 |
In general, our model exhibited better performances on the targets in QM9, which are more related to molecular interactions such as dipole moment (μ), molecular orbitals (ϵHOMO and ϵLUMO), Gibbs free energy (G), and others. These properties are relevant in molecular reactions to external effects. However, predictions for other targets such as the electronic spatial extent (⟨R2⟩) and internal energies (U0, U, H) were not superior to those of other models.
We found that this may be related to the distribution of each target value. We observed that, when the mean of the target value is closer to zero and the standard deviation of the target value is smaller, the prediction performances are better. For example, the 95% confidence intervals (CIs) of the distribution of the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO) are −6.5 ± 1.2 and 0.3 ± 1.3, respectively. In case of internal energies including U, U0, and H, the 95% of CI is −76.5 ± 10.4, −76.1 ± 10.3, and −77.0 ± 10.5, respectively. This issue triggered by the diverse target distribution of QM9 has already been reported in the literature.24
The Effect of Basis Functions
We briefly discuss the effect of Legendre polynomials as radial basis functions in our model. The results are shown in Table 3. We compared our radial basis functions with Bessel basis functions in DimeNet22 and DimeNet++32 using QM9. We analyzed six of 12 targets in QM9, μ, HOMO, LUMO, gap, ⟨R2⟩, and U. To summarize the comparisons, only μ showed clearly improved validation performances (∼30% lower MAE), and the other five targets showed slightly decreased MAE values.
Table 3. Relative Validation MAE of our Model using Legendre Polynomials as Radial Basis Functions, Comparing with the Model with Replacing Radial Basis Functions by Bessel Basis Functions (Used in DimeNet and DimeNet++22)a.
| target | (MAE using Legendre polynomials)/(MAE using Bessel basis functions) |
|---|---|
| μ | 0.77 |
| ϵHOMO | 1.02 |
| ϵLUMO | 1.03 |
| Δϵ | 1.01 |
| ⟨R2⟩ | 1.11 |
| U | 1.03 |
If the value is less than 1.0, it means that the Legendre polynomials as radial basis functions performed better.
This inconsistency over targets could not be clearly explained. Therefore, we concluded that a Legendre polynomial of our model is not worse than Bessel basis functions. Note that our cutoff value is less than those of any other previous works, maintaining the model performances. This will be discussed in the next section.
The Effect of Trainable Scalars
The Role of the γ in the Model Performance
We trained our model with and without trainable γ values for all 12 labels in QM9 data sets. We found that the validation performances were significantly improved (over 10%) in the case of μ, LUMO, and ⟨R2⟩ with γ values. Overall, the validation MAE values with γ values were lower than those without γ values in nine target cases in QM9. With U0 and U, the MAEs without training γ values are 3% lower than those with training γ. The results are shown in Figure 6.
Figure 6.

Ratio of validation MAE without training γ over MAE with training γ. If the ratio is less than 1.0, the performance with training γ is superior to than that of without γ. Nine of 12 targets showed a lower validation MAE with training γ coefficients.
Adaptive Dependency on Pathways by Targets
We analyzed
, which is
the ratio of the sum of the trained
coefficients from the MLP pathway (the last single body block) over
those from the MPNN pathway (the last output block), respectively,
by each target. We plotted the ratio changes with training steps in Figure 7. If the ratio increases
during the training steps, the MLP pathway is more critical than the
MPNN pathway for target prediction, and vice versa. Note that the
weights of the MLP pathway are only trained on atom types Z.
Figure 7.
Ratio of the sum of the trained coefficients
from the
last single body block and output
block, respectively. If the ratio goes to zero, it can be interpreted
that the model depends more on the features from atom–atom
relationships, rather than those from individual atom information.
The ratios from U, U0, and H showed a similar pattern with each other,
which are closely related in the internal energy term.
We found that, in training the model on μ, the single body blocks were less important. During the training on the target μ, the trainable scalars in St become close to zero, while other targets were not the cases. We found that this exceptional result can be explained with the construction of μ, which is the sum of dipole moments of all atom pairs in a molecule. We repeated the experiment with μ three times and observed these patterns in all the times (results are not shown).
The formulation of the dipole moment is given by33
| 9 |
where q and R are charges and position vectors, respectively. A dipole moment occurs between two polarized atoms, especially in a covalent bond or an ionic bond.34,35 By definition, dipole moments are represented by the charge separation between atom pairs (dipoles) in a molecule. The atom-only representations did not play a key role in the case of predicting μ in our experiments, which is consistent with the formulation. The overall performances also support these observations (Table 1). In the case of μ, the performance was better (MAE = 0.0238) when the model used the MPNN pathway only than when the full model was used (MAE = 0.0256). However, the full model showed better performance than the MPNN pathway only model, in most of the target predictions. It also confirms that the atom-only representations from the MLP pathway does not help in predicting the μ of given molecules.
In addition, the plots in Figure 7 in predicting U0, U, and H are similar to each other. U0 and U mean the internal energy at 0 and 298.15 K, respectively. H is the enthalpy at 298.15 K, which has the internal energy terms. These targets are closely related to each other by construction. Considering the experimental results, we concluded that the model trained differently with respect to the target, and the pattern of trained results is consistent with the nature of the chemistry.
Density of a Molecule Representation
As mentioned before, the cutoff c value determines the node connections in a molecular graph. With a high cutoff value, the molecular graph representation would be dense. When the cutoff is low, the corresponding graph representation would be sparse.
The dense graphs generally have higher feasibilities in capturing the connectivity information in a graph than the sparse graphs. However, dense structures are exposed to higher risks of gathering excessive features even when there is less necessary information. Besides, in an MPNN, the messages of every node from its neighboring edges are mixed repeatedly throughout the network. If a graph is excessively dense, most messages become indistinguishable, because all atoms would refer all other neighbors in every step. This may potentially increase the risk of an oversmoothing problem,25 prone to occur in deep GCNs.
It is challenging to determine the most appropriate cutoff value of the graph data set; however, we found a clue from molecular conformations. We observed that the average distance between any atom pair in QM9 molecules is 3.27 Å (Å). The atom pairs within 4.0 Å, which is the cutoff value used in this work, account for 72% of all the pairs in the QM9 data set. Previous models SchNet,12 MegNet,21 and DimeNet22/DimeNet++32 adopted 10.0, 5.0, and 5.0 as their cutoff values, respectively. We observed that 99.99% and 89% for cutoff values of 10.0 and 5.0 of the distances, respectively, are represented in molecular graphs.
Note that
the angle representation is defined with two different
edges sharing one atom. This indicates that, for the angle representation,
the required number of computations increases linearly with the squared
number of edges. If the number of nodes of a graph are |VG|, then the upper bound of the number of representations
including all nodes, edges, and angles of the graph is given as 

even if
the message directions were not
considered. We compared graph density values according to different
cutoff values. If the edges are within a cutoff value of 4.0, then
the number of edge representations are reduced to half ((72%)2 = 52.8%) of the overall possible number of edges
. If the
cutoff value is 5.0, then the proportion
of the number of the edge representations is increased to 80% ((89%)2 = 79.2%).
We demonstrate that the cutoff value of 4.0 Å is sufficient in describing a molecular graph in QM9. As mentioned earlier, the molecules in QM9 comprise five atoms, namely, C, H, O, N, and F. The lengths of the covalent bonds between any two atoms among these five atoms are always less than 2.0 Å.36 Because the maximum length of any two successive covalent bondings is less than 4.0, the model can capture any two-hop neighboring atoms simultaneously, with a cutoff value of 4.0 Å. This property makes it possible to identify all the angles between two covalent bonds. The detailed description is presented in Figure 8. From these observations, we argue that the 4.0 Å is the best choice as the cutoff value for an efficient training and chemical nature. The same holds for MD simulation as well, because in this data set, all the atoms are carbon, hydrogen, and oxygen (C, H, and O). Finally, we argue that 4.0 Å is also an adequate value for organic molecules in the real world, because most molecular bond lengths are shorter than 2.0 Å.36
Figure 8.

Example of bond and angle representation. All lengths of the covalent bond between two atoms of C, H, O, N, and F are less than 2.0, so any two-hop neighbor via covalent bond vk from an atom vi will always be located inside the cutoff = 4.0. Therefore, an atom vi can always capture two successive covalent bonds (vi, vj) and (vj, vk).
Conclusion
In this study, we developed a novel dual-branched network, message-passing-based pathway for atom–atom interactions and fully connected layers for single atom representations. We represented a molecule as a graph in three-dimensional space with atom types and atom positions. We embedded a scalar-valued atom–atom distance as a vector using Legendre rational functions. Similarly, we embedded a scalar-valued angle as a vector using orthogonal Legendre polynomials. Both functions are complete orthogonal series and require low computational costs with recursive relations. Our model exhibited remarkable performances in two quantum-chemical data sets. In addition, we proposed trainable scalar values, such that the proposed model can attend more significant features according to the various natures of the targets during the training. With the analysis of the trained scalars, we also showed that the model can obtain an important interpretation ability of the target. We found that the trained scalar values can explain the chemical nature of the target. Furthermore, we adopted a smaller cutoff value than those used in previous MPNN models and showed that, with this value, we can save the computational resources without a loss of performance. Furthermore, we argued that this cutoff value is sufficiently long to identify the local structure of a molecule. Although we showed both the model performance and a hint of interpretability, our model was applied to restricted fields of small molecules. We will enhance the model scalability to broader applications including biomolecules, drugs, and crystals in our future works. Furthermore, we will conduct further analysis on the predictability of the model according to target distributions. Future works will be focused on other molecular property predictions or the predictions for more complex molecules in more various fields.
Acknowledgments
This work was supported by the BK21 FOUR program of the Education and Research Program for Future ICT Pioneers, Seoul National University in 2021 (2021M3A7B4911115), National Research Foundation of Korea, grant funded by the Korean Government (MSIT) (2018R1A2B3001628, 2019R1G1A1003253), K-BIO KIURI Center program (2020M3H1A1073304), and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean Government (MSIT) (No. 2021-0-01343, Artificial Intelligence Graduate School Program in Seoul National University).
We used two public data sets, namely, qm9 and MD simulation data sets. Specifically, we retrieved the qm9 data set from the open resource from DimeNet22 and DimeNet++.32 For the MD simulation, we used the public database from a web source for quantum-machine research.39 We used numpy43 1.19.1, scipy44 1.4.1, and tensorflow-gpu45 2.2.0 for data preprocessing and model training. We utilized the codes for loading and reading numpy-formatted data from DimeNet22 and DimeNet++.32 Our code for the model object is available at https://github.com/fromjade/dlmpnn.git.
The authors declare no competing financial interest.
Footnotes
References
- Gao T.; Sun S.-L.; Shi L.-L.; Li H.; Li H.-Z.; Su Z.-M.; Lu Y.-H. An accurate density functional theory calculation for electronic excitation energies: The least-squares support vector machine. J. Chem. Phys. 2009, 130, 184104. 10.1063/1.3126773. [DOI] [PubMed] [Google Scholar]
- Bartók A. P.; Payne M. C.; Kondor R.; Csányi G. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett. 2010, 104, 136403. 10.1103/PhysRevLett.104.136403. [DOI] [PubMed] [Google Scholar]
- Rupp M.; Tkatchenko A.; Müller K.-R.; Von Lilienfeld O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 2012, 108, 058301. 10.1103/PhysRevLett.108.058301. [DOI] [PubMed] [Google Scholar]
- Snyder J. C.; Rupp M.; Hansen K.; Müller K.-R.; Burke K. Finding density functionals with machine learning. Phys. Rev. Lett. 2012, 108, 253002. 10.1103/PhysRevLett.108.253002. [DOI] [PubMed] [Google Scholar]
- Mills M. J.; Popelier P. L. Intramolecular polarisable multipolar electrostatics from the machine learning method Kriging. Comput. Theor. Chem. 2011, 975, 42–51. 10.1016/j.comptc.2011.04.004. [DOI] [Google Scholar]
- Cova T. F.; Pais A. A. Deep learning for deep chemistry: optimizing the prediction of chemical patterns. Front. Chem. 2019, 7, 809. 10.3389/fchem.2019.00809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Balabin R. M.; Lomakina E. I. Neural network approach to quantum-chemistry data: Accurate prediction of density functional theory energies. J. Chem. Phys. 2009, 131, 074104. 10.1063/1.3206326. [DOI] [PubMed] [Google Scholar]
- Houlding S.; Liem S.; Popelier P. A polarizable high-rank quantum topological electrostatic potential developed using neural networks: Molecular dynamics simulations on the hydrogen fluoride dimer. Int. J. Quantum Chem. 2007, 107, 2817–2827. 10.1002/qua.21507. [DOI] [Google Scholar]
- Hansen K.; Montavon G.; Biegler F.; Fazli S.; Rupp M.; Scheffler M.; Von Lilienfeld O. A.; Tkatchenko A.; Muller K.-R. Assessment and validation of machine learning methods for predicting molecular atomization energies. J. Chem. Theory Comput. 2013, 9, 3404–3419. 10.1021/ct400195d. [DOI] [PubMed] [Google Scholar]
- Montavon G.; Rupp M.; Gobre V.; Vazquez-Mayagoitia A.; Hansen K.; Tkatchenko A.; Müller K.-R.; Von Lilienfeld O. A. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 2013, 15, 095003. 10.1088/1367-2630/15/9/095003. [DOI] [Google Scholar]
- Behler J.; Parrinello M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys. Rev. Lett. 2007, 98, 146401. 10.1103/PhysRevLett.98.146401. [DOI] [PubMed] [Google Scholar]
- Schütt K. T.; Kindermans P.-J.; Sauceda H. E.; Chmiela S.; Tkatchenko A.; Müller K.-R. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in Neural Information Processing Systems 2017, 992–1002. [Google Scholar]
- Chmiela S.; Tkatchenko A.; Sauceda H. E.; Poltavsky I.; Schütt K. T.; Müller K.-R. Machine learning of accurate energy-conserving molecular force fields. Sci. Adv. 2017, 3, e1603015. 10.1126/sciadv.1603015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kondor R.N-body networks: a covariant hierarchical neural network architecture for learning atomic potentials. arXiv preprint 1803.01588, 2018.
- Unke O. T.; Meuwly M. PhysNet: A neural network for predicting energies, forces, dipole moments, and partial charges. J. Chem. Theory Comput. 2019, 15, 3678–3693. 10.1021/acs.jctc.9b00181. [DOI] [PubMed] [Google Scholar]
- Wang X.; Li Z.; Jiang M.; Wang S.; Zhang S.; Wei Z. Molecule property prediction based on spatial graph embedding. J. Chem. Inf. Model. 2019, 59, 3817–3828. 10.1021/acs.jcim.9b00410. [DOI] [PubMed] [Google Scholar]
- Zhou J.; Cui G.; Hu S.; Zhang Z.; Yang C.; Liu Z.; Wang L.; Li C.; Sun M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. 10.1016/j.aiopen.2021.01.001. [DOI] [Google Scholar]
- Wu Z.; Pan S.; Chen F.; Long G.; Zhang C.; Yu P. S. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 2021, 32, 4. 10.1109/TNNLS.2020.2978386. [DOI] [PubMed] [Google Scholar]
- Gilmer J.; Schoenholz S. S.; Riley P. F.; Vinyals O.; Dahl G. E. Neural message passing for quantum chemistry. International Conference on Machine Learning 2017, 1263–1272. [Google Scholar]
- Schütt K. T.; Sauceda H. E.; Kindermans P.-J.; Tkatchenko A.; Müller K.-R. SchNet-A deep learning architecture for molecules and materials. J. Chem. Phys. 2018, 148, 241722. 10.1063/1.5019779. [DOI] [PubMed] [Google Scholar]
- Chen C.; Ye W.; Zuo Y.; Zheng C.; Ong S. P. Graph networks as a universal machine learning framework for molecules and crystals. Chem. Mater. 2019, 31, 3564–3572. 10.1021/acs.chemmater.9b01294. [DOI] [Google Scholar]
- Klicpera J.; Groß J.; Günnemann S.. Directional message passing for molecular graphs. International Conference on Learning Representations 2020.
- Anderson B.; Hy T.-S.; Kondor R.. Cormorant: Covariant molecular neural networks. Advances in Neural Information Processing Systems ;MIT Press, 2019. [Google Scholar]
- Miller B. K.; Geiger M.; Smidt T. E.; Noé F.. Relevance of rotationally equivariant convolutions for predicting molecular properties. arXiv preprint 2008.08461 2020.
- Li Q.; Han Z.; Wu X.-M.. Deeper insights into graph convolutional networks for semi-supervised learning. Proceedings of the AAAI Conference on Artificial Intelligence. AAAI 2018; pp 3538–3545.
- Schütt K. T.; Arbabzadah F.; Chmiela S.; Müller K. R.; Tkatchenko A. Quantum-chemical insights from deep tensor neural networks. Nat. Commun. 2017, 8, 1–8. 10.1038/ncomms13890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu C.; Liu Q.; Wang C.; Huang Z.; Lin P.; He L. Molecular property prediction: A multilevel quantum interactions modeling perspective. Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33, 1052–1060. 10.1609/aaai.v33i01.33011052. [DOI] [Google Scholar]
- Poulenard A.; Rakotosaona M.-J.; Ponty Y.; Ovsjanikov M. Effective rotation-invariant point cnn with spherical harmonics kernels. 2019 International Conference on 3D Vision (3DV). 2019, 47–56. 10.1109/3DV.2019.00015. [DOI] [Google Scholar]
- Smidt T. E. Euclidean symmetry and equivariance in machine learning. Trends Chem. 2021, 3, 82. 10.1016/j.trechm.2020.10.006. [DOI] [Google Scholar]
- Kondor R.; Lin Z.; Trivedi S.. Clebsch-gordan nets: a fully fourier space spherical convolutional neural network. Advances in Neural Information Processing Systems ;MIT Press, 2018. [Google Scholar]
- Fuchs F. B.; Worrall D. E.; Fischer V.; Welling M.. SE (3)-transformers: 3D roto-translation equivariant attention networks. Advances in Neural Information Processing Systems ;MIT Press,2020. [Google Scholar]
- Klicpera J.; Giri S.; Margraf J. T.; Günnemann S.. Fast and Uncertainty-Aware Directional Message Passing for Non-Equilibrium Molecules. NeurIPS-W ;2020.
- Oxtoby D. W.; Gillis H. P.; Butler L. J.. Principles of modern chemistry ;Cengage Learning, 2015. [Google Scholar]
- Brau C. A.Modern problems in classical electrodynamics ;Oxford University Press, 2004. [Google Scholar]
- Daley R.Organic Chemistry, Part 3 of 3; Lulu.com, 2005. [Google Scholar]
- Allen F. H.; Kennard O.; Watson D. G.; Brammer L.; Orpen A. G.; Taylor R. Tables of bond lengths determined by X-ray and neutron diffraction. Part 1. Bond lengths in organic compounds. J. Chem. Soc., Perkin Transactions 2 1987, S1–S19. 10.1039/p298700000s1. [DOI] [Google Scholar]
- Anli F.; Gungor S. Some useful properties of Legendre polynomials and its applications to neutron transport equation in slab geometry. Appl. Math. Model. 2007, 31, 727–733. 10.1016/j.apm.2005.12.005. [DOI] [Google Scholar]
- Ramakrishnan R.; Dral P. O.; Rupp M.; Von Lilienfeld O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. data 2014, 1, 1–7. 10.1038/sdata.2014.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chmiela S.; Sauceda H. E.; Müller K.-R.; Tkatchenko A. Towards exact molecular dynamics simulations with machine-learned force fields. Nat. Commun. 2018, 9, 1–10. 10.1038/s41467-018-06169-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu Z.; Ramsundar B.; Feinberg E. N.; Gomes J.; Geniesse C.; Pappu A. S.; Leswing K.; Pande V. S. MoleculeNet: A Benchmark for Molecular Machine Learning. Chem. Sci. 2018, 9, 513. 10.1039/C7SC02664A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kingma D. P.; Ba J.. Adam: A method for stochastic optimization. International Conference on Learning Representations ,2015.
- Finzi M.; Stanton S.; Izmailov P.; Wilson A. G.. Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data. International Conference on Machine Learning ,2020; pp 3165–3176.
- Harris C. R.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jones E.; Oliphant T.; Peterson P.; et al. SciPy: Open source scientific tools for Python. 2001; http://www.scipy.org/.
- Abadi M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015; https://www.tensorflow.org/, Software available from tensorflow.org.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Kondor R.N-body networks: a covariant hierarchical neural network architecture for learning atomic potentials. arXiv preprint 1803.01588, 2018.
- Miller B. K.; Geiger M.; Smidt T. E.; Noé F.. Relevance of rotationally equivariant convolutions for predicting molecular properties. arXiv preprint 2008.08461 2020.




