From Local Atomic Environments to Molecular Information Entropy

Alexander Croy

doi:10.1021/acsomega.4c02770

. 2024 Apr 24;9(18):20616–20622. doi: 10.1021/acsomega.4c02770

From Local Atomic Environments to Molecular Information Entropy

Alexander Croy ^1,^*

PMCID: PMC11080039 PMID: 38737089

Abstract

graphic file with name ao4c02770_0006.jpg

The similarity of local atomic environments is an important concept in many machine learning techniques, which find applications in computational chemistry and material science. Here, we present and discuss a connection between the information entropy and the similarity matrix of a molecule. The resulting entropy can be used as a measure of the complexity of a molecule. Exemplarily, we introduce and evaluate two specific choices for defining the similarity: one is based on a SMILES representation of local substructures, and the other is based on the SOAP kernel. By tuning the sensitivity of the latter, we can achieve good agreement between the respective entropies. Finally, we consider the entropy of two molecules in a mixture. The gain of entropy due to the mixing can be used as a similarity measure of the molecules. We compare this measure to the average and best-match kernel. The results indicate a connection between the different approaches and demonstrate the usefulness and broad applicability of the similarity-based entropy approach.

1. Introduction

The concept of similarity is directly linked to the complexity of an object: decomposing the object into different, distinguishable units, the number of those units provides a measure of its complexity.¹ In this sense, the complexity, or information content, of a molecule can be defined, and over the past decades, different measures have been proposed, e.g.²⁻⁶ Unfortunately, the different complexity measures are often hardly comparable. In this article, we establish a connection between the information entropy (in the sense of Shannon^7,8) and the similarity matrix constructed from the local atomic environments of a molecule. The resulting expression is analogous to the von-Neumann entropy⁹ and can be used as a general framework for quantifying molecular complexity.

Similarity also plays an important role in many machine learning techniques, like kernel-ridge regression (KRR) or Gaussian process regression (GPR).¹⁰ In combination with descriptors of local atomic environments,^11,12 those methods have been very successful in different areas of computational chemistry and material science.¹³ The atomic environments are usually defined in terms of all atoms within a certain distance of a reference atom. Suitable descriptors are then found, for example, from an expansion of the density of atoms in the environment into radial basis functions and spherical harmonics, i.e. using a smooth overlap of atomic positions (SOAP).¹⁴ The kernel function entering the KRR or GPR is calculated from such descriptors and can be interpreted as a measure of the similarity of two local atomic environments. Comparing all pairs of environments in this way leads to the similarity matrix of the specific molecule. On the other hand, for learning and predicting the global properties of molecules, one can introduce the similarity between different molecules.¹⁵ The latter can also be constructed from the similarity matrix of the local atomic environments.¹⁶

In order to calculate the molecular information entropy, we present two specific choices for defining a similarity function of local atomic environments: one is based on a graph representation of the molecule, choosing substructures around a reference atom and comparing the resulting SMILES strings.^17,18 The second approach facilitates the aforementioned SOAP similarity kernel and thus uses the positions and atomic numbers of the atoms. Both approaches are suitable for automatized computational studies, and we present results for a selection of molecules from the QM9 data set.^19,20

Finally, we investigate the information entropy for pairs of molecules, which leads to the mixing entropy. The latter is the maximal gain of information entropy upon mixing two molecules.⁶ Based on this observation, we propose and construct a new similarity measure of molecules and compare it to previously studied kernels.¹⁶ Our results demonstrate the usefulness and broad applicability of the similarity-based entropy approach.

2. Methods

2.1. Information Entropy

Typically, information entropy is considered in contexts involving some kind of “experiment” or “process”, where each time one of the events A₁, A₂, ..., A_n occurs at random.²¹ Knowing the probabilities p₁, p₂, ..., p_n for those events, one can characterize the amount of uncertainty about the outcomes by introducing the Shannon entropy(7,8) according to

The logarithm can be taken with respect to any base, but it is usually assumed to be base two. Moreover, we take p_i log p_i = 0 if p_i = 0. One readily sees that the entropy vanishes if the probability for one event is one and the others are zero accordingly. This describes an experiment with no uncertainty because the outcome would always be the same. On the other hand, one has maximum uncertainty and thus maximal entropy if all events have the same probability p_i = 1/n. In this case, H = log n.

In the context of molecules and graphs,^1,22 a different point of view might be more suitable. If we decompose the molecule (or a graph) into n different parts (e.g., its atoms or vertices) and assign each of them with one of the equivalence classes (e.g., atom types) A₁, A₂, ..., and A_m (m ≤ n), then we can construct a finite scheme by associating the probability p_i = n_i/n to the respective class. The number of parts we found for each class is denoted by n_i, i.e., ∑ ^m_i = 1n_i = n. The entropy given by eq 1 can be viewed as a measure of the complexity of the object. If all parts belong to the same class (n₁ = n), then the complexity is zero. Conversely, if all parts belong to a different class (n_i = 1 and n = m), then the complexity is maximal for that particular system.

2.2. Information Entropy from Similarity

To obtain a connection between information entropy of a molecule Inline graphic and the similarities of its atoms, we start from a similarity function as follows,

If two atoms k and l in the molecule are chemically, or otherwise, equivalent, the function yields 1, and otherwise it gives 0. Using the similarity function for all pairs of atoms yields the similarity matrix Inline graphic of the molecule. This matrix is symmetric and positive semidefinite, and its trace equals the number of atoms in the molecule.

By permuting rows and columns of a similarity matrix of size n × n arising from eq 2, it can always be written as a direct sum of matrices of ones, i.e., S ≃ 1_n₁ ⊕ 1_n₂ ⊕ ⋯ ⊕ 1_{n_m}. The similarity matrix becomes block-diagonal with m blocks of size n_i × n_i, respectively. Consequently, it has m nonzero eigenvalues, namely {n₁, n₂,..., n_m} and n – m eigenvalues which are zero1. This suggests that we can obtain a finite scheme directly from the similarity matrix since the nonzero eigenvalues of S divided by n yield the required probabilities p_i = n_i/n. Moreover, we can directly calculate the associated entropy

Here, Tr denotes the trace of the matrix and log is the matrix logarithm. The last expression is analogous to the von-Neumann entropy,⁹ which is not only used in quantum mechanics, but also in the context of complex networks.²³ This analogy suggests that we can generalize the similarity function eq 2 to any positive semidefinite and symmetric function with a value range 0 ≤ S ≤ 1. The expression for the entropy remains unchanged.

We can also introduce the linear entropy by expanding the logarithm to the lowest order:

It is given in terms of the average of the squared elements of the similarity matrix. If the latter contains only zeros and ones, the latter average is equal to the average of the elements themselves. While it is approximate, this expression can be easier to calculate, especially for large similarity matrices.

Of course, the main question is how to find a suitable similarity function. From a chemical point of view, one might have different options for specifying chemical equivalence^3,5 or one might use graph-theoretic concepts.^4,22 In the following, two approaches are presented and discussed, which are also motivated by applicability in computational settings.

2.3. Substructure-SMILES Similarity

As a first approach to find the similarity of atomic environments, we used a graph representation of the molecule under consideration. For each atom of the molecule, we select a subgraph which involves all atoms which are connected to the reference atom by at most N bonds. For example, N = 1 would entail the atom itself and the bonded neighbors. Each subgraph is then converted to a (canonical) SMILES string^17,18 with the reference atom as starting point. In practice, we use rdkit²⁴ to generate the environments and the SMILES strings. The similarity function is then simply defined via the comparison of the respective SMILES strings of the subgraphs, i.e.,

Instead of exact string comparison, one can also utilize any other string or SMILES similarity method.²⁵

Figure 1 illustrates the atomic environments of one of the carbon atoms in ethanol for different values of N. For N = 0, the SMILES strings only contain the element symbol of the respective atom, and therefore all hydrogen atoms, and all carbon atoms are found to be identical (lower right block in the similarity matrix). Already for N = 1, this is no longer the case as the environments take neighboring atoms into account, and the hydrogen atom bound to the oxygen is considered as distinct from the other hydrogens. The similarity matrices show increasing differentiation between the different atoms until convergence is reached for N ≥ 2 in this case.

Top row: Atomic environments for one of the carbon atoms in ethanol with N = 0, 1, and 2 (from left to right). The substructure-SMILES strings are given below the structures. Bottom row: Similarity matrices obtained from eq 5 for N = 0, 1, 2.

2.4. SOAP Similarity

Another strategy for defining local atomic environments is the SOAP approach.^14,16 It is based on the representation of the local density of atoms Inline graphic in the vicinity of a central atom. The corresponding environment is defined by a cutoff radius. Expanding the density in terms of radial basis functions g_n(r) and spherical harmonics Y_lm(r̂) leads to

Here, r denotes the length and r̂ the direction of the position vector Inline graphic , and Z indicates the atomic species. The expansion coefficients yield the rotationally invariant partial power spectrum,

The elements Inline graphic can be collected into an unit length vector which is then used to define the similarity function for two atomic environments,

It can be shown that S_SOAP is a positive definite function, which is obviously symmetric with respect to k and l. The integer exponent ζ ≥ 1 can be used to increase the sensitivity of the similarity function.¹⁴ In the definition above, we have additionally enforced the dissimilarity of environments centered around different species.

3. Results and Discussion

3.1. Molecular Information Entropies

First, we consider the molecular information entropies calculated from the substructure-SMILES similarity approach. To this end, we collected 13 small molecules for which the information entropies are known based on symmetry and chemical intuition.^26,27Figure 2 shows a comparison of the entropies obtained from eq 3 for different sizes N of the atomic environment. Consistent with the observation in Section 2.3, the entropy increases with an increasing N for some molecules until it converges to the expected value. In those cases, larger environments are necessary to achieve sufficient differentiability between all of the environments (as shown in Figure 1 for CCO).

Molecular information entropies H_SMILES calculated from the substructure-SMILES similarity approach for the molecules denoted on the upper axis. The values are obtained for different sizes of the atomic environments (N = 0, 1, 2, 3) and compared to values from literature (topmost row).^26,27

In the next step, we use the ground state geometries of the 13 molecules mentioned above from the QM9 data set^19,20 and calculate the SOAP descriptors for each atom using dscribe^28,29 (with r_cut = 6 Å, n_max = 10, and l_max = 6). The similarity matrix and entropy are then calculated via eqs 8 and3, respectively, for a given sensitivity exponent ζ. Figure 3 shows a comparison of the corresponding entropies with the entropies found from the substructure-SMILES similarity approach (with N = 3). As the sensitivity increases for larger ζ, the entropies also increase since the similarity function, further distinguishing the atomic environments. Apparently, the entropies do not converge in all cases to the SMILES entropies. The reason is that the functional form of the similarity function in eq 5 suppresses all entries which are ≤1 with increasing N, which leads in the extreme case to all environments being dissimilar and yields the maximum entropy.

Comparison of entropies obtained from substructure-SMILES (H_SMILES) and the SOAP approach (H_SOAP), respectively. The red circles indicate values from literature.^26,27 The different plots show SOAP entropies with an increasing sensitivity exponent ζ. Perfect matching is indicated by the straight lines.

To further characterize the similarities arising from the two approaches, we consider the substructure-SMILES similarity as references and compute the Kullback–Leibler (KL) divergence³⁰ for the SOAP similarity of the same molecule,

The KL divergence is a positive function and becomes zero if the two similarity matrices are identical. As an illustration, we take the first 184 molecules from the QM9 data set [10 are excluded due to SMILES mismatch between rdkit and QM9] and compute the average KL divergence for different sensitivity exponents. From Figure 4, one can see that the KL divergence initially decreases with increasing sensitivity, in accordance with our previous observations. Then, at ζ ≈ 64 it shows a minimum, implying that the respective SOAP similarities on average match best to the SMILES similarities. For larger exponents, the KL divergence increases slowly as more and more entries in S_SOAP are vanishing. Figure 4 also shows the comparison of the entropies obtained from the SOAP similarities for ζ = 64 and the SMILES-based similarities. We observe good but not perfect agreement, which can again be attributed to the sweeping effect of the sensitivity exponent. It should be noted that the binary nature of the SMILES similarity is an extreme case that requires an atypical high sensitivity.

Left: Average Kullback–Leibler divergence ⟨D⟩ of the SMILES and SOAP similarity matrices of 174 molecules. Right: Comparison of entropies obtained from the substructure-SMILES (H_SMILES) and the SOAP approach (H_SOAP) near the minimum of ⟨D⟩ for 174 molecules. Perfect matching is indicated by a straight line.

3.2. Mixing Entropy and Molecular Similarity

One important question concerns the complexity or information entropy of mixtures of molecules as this directly applies to chemical reactions.^26,27,31 We take two molecules, Inline graphic and , and suppose that we have a similarity function S, which can be applied to any pair of atomic environments. The resulting similarity matrix has a 2 × 2 block structure with the diagonal blocks (S_I and S_II) referring to the similarity between parts of each individual molecule, and the off-diagonal blocks (S_I,II = S^T_II, I) refer to similarity between environments belonging to different molecules.

If the two molecules are identical, all four blocks are equal to S_I and the nonvanishing eigenvalues of S_I,II become those of 2 S_I. Since the total number of parts is 2n_I, the entropy of the two molecules is the same as that for the individual molecule. In the other case, if the two molecules do not share any equivalent atomic environments, then the off-diagonal blocks are filled with zeros, and the similarity matrix can be written as a direct sum S_I+II = S_I⊕S_II. In this case, the entropy becomes a weighted average of the individual entropies plus a term which can be called entropy of mixing,^26,27

with

For n_I = n_II, the mixing entropy is equal to log 2 (or one bit per atom). For any pair of molecules, we can define the gain of entropy due to mixing via

In general, this quantity will have values between 0 and H_mix ≤ 1.

To illustrate the concept of mixing entropy, we take all pairs of the first 184 molecules from the QM9 data set and calculate the combined similarity matrices S_I,II and the corresponding entropies H(S_I+II). In Figure 5, the latter are compared to the mixing entropies H_mix(n_I, n_II) for the molecules using the substructure-SMILES approach to calculate the similarities (with N = 3). For the data points on the diagonal, the molecules do not share equivalent atomic environments and are ΔH = H_mix. However, there are a number of molecular pairs for which the entropy is smaller than the mixing entropy.

Comparison of the gain of entropy due to mixing of two molecules ΔH_SMILES and the respective mixing entropy H_mix. The similarities are calculated using the substructure-SMILES approach (with N = 3). Perfect matching is indicated by the straight line.

The results for the mixing entropies suggest that one can use the ratio of the gain of entropy due to mixing Inline graphic and the mixing entropy H_mix(n_I, n_II) as a measure for similarity of two molecules. This ratio is zero for two identical molecules and one for molecules not sharing atomic environments. Other approaches to compare two molecules based on similarities of local atomic environments include the average structural kernel and the best-match structural kernel.¹⁶ The former is calculated by averaging the elements of the off-diagonal block in S_I,II, i.e.,

The best-match kernel can be formulated in terms of a rectangular assignment problem,^16,32

The matrix X contains only zeros and ones and additionally its elements are subject to ∑_jX_ij = 1 ∀ i and ∑_iX_ij ≤ 1 ∀ j. If N_I > N_II, the arguments Inline graphic and have to be interchanged. Here, we have generalized the two kernels by allowing arbitrary powers p of the entries in S to be used while in ref (16) one is summing the elements of the similarity matrix (p = 1) in both kernels. The reason for introducing p is given by the expression of the linear entropy in eq 4. There, only squared elements of the similarity matrix appear which suggests that taking the sum over the squared elements in K̅^(p) or K̂^(p), i.e. p = 2, is more natural when comparing to the entropy-based similarity as also shown below.

Taking again the pairs of the 184 molecules from before, we compute the different molecular similarities for the substructure-SMILES and the SOAP approaches and compare them in Figure 6. One sees that for the SMILES approach, the results for p = 1 and p = 2 are identical, which is expected since S_ij = S²_ij in that case. On the other hand, for the SOAP approach, one sees a clear dependence on p. The best-match kernel with p = 2 yields similarities that agree on average with the entropy-based similarities. In contrast, the p = 1 kernel gives rise to a systematic nonlinear deviation. With the average kernels one obtains similarities that are rather different from the entropy-based measure. In either case, it appears that the kernels with p = 2 have a correspondence with the entropy-based similarities which is close to linear. The results show that the considered molecular similarity measures are related despite their different starting points.

Comparison of two molecular similarity kernels, K̅^(p) and K̂^(p), with a measure obtained via the gain of entropy due to mixing, 1 – ΔH/H_mix. Top row: Results for similarities were calculated from the substructure-SMILES approach with environment size N = 3. Bottom row: Results for similarities were calculated from the SOAP approach with sensitivity exponent ζ = 16. Perfect matching is indicated by the straight lines.

4. Conclusions

In summary, we have presented and discussed a connection between the information entropy of a molecule and the similarity matrix of its local atomic environments. The entropy can be obtained from an expression, given by eq 3, which is akin to the von-Neumann entropy. This relation provides a convenient framework for calculating the information entropy and also for analyzing its properties.

Two approaches for obtaining the similarity of the atomic environments were given. The substructure-SMILES method is based on a graph representation of the molecule. The similarity is defined in terms of a comparison of the SMILES strings for substructures. For a set of molecules, the entropies were shown to correspond to the known and expected values if the sizes of the substructures were sufficiently large. The second approach was based on SOAP descriptors, which were obtained from positions and atomic numbers of atoms in a molecule. The sensitivity of the resulting similarities can be tuned by an integer exponent ζ, and we found a good agreement between SOAP- and SMILES-based entropies for larger values of ζ. It should be noted that any value <1 in the similarity matrix is suppressed with increasing ζ, so that it is not expected that the values converge to the SMILES-based calculations. But the results show that both approaches can be used to estimate the information entropy of molecules. The choice of similarity function and the tuning of its hyperparameters (like ζ) should be adapted to the given application.

Finally, we investigated the entropy of a pair of molecules. For identical molecules, this entropy corresponds to one of the individual molecule. If the molecules do not share any similar atomic environments, the ensemble entropy becomes equal to a weighted sum of the individual entropies plus the mixing entropy, which only depends on the number of atoms in each molecule. In general, the entropy of the pair takes values between these two extremes. This was the motivation for defining the similarity of two molecules in terms of the entropy gain due to mixing. We compared this measure to two other similarity kernels (the average structural kernel and the best-match structural kernel) for 184 molecules and found, on average, very good agreement with a modified best-match kernel. Thus, the entropy-based molecular similarity provides an alternative measure for comparing molecules with a strong anchoring in the framework of molecular information entropies.

Acknowledgments

Inspiration and support is kindly acknowledged from the project “Olfactorial Perceptronics” (No. 9B396) funded by the Volkswagen foundation and SPP 2363 “Molecular Machine Learning” (No. GR4482/6) of the German Research Foundation. Stefanie Gräfe is acknowledged for her input and helpful comments on the manuscript.

Data Availability Statement

The QM9 data set is available at 10.6084/m9.figshare.c.978904.v5. The python code to calculate similarities and entropies can be found at https://github.com/CoMeT4MatSci/molent.

The author declares no competing financial interest.

Footnotes

Each matrix of ones 1_{n_i} has one nonzero eigenvalue equal to n_i and n_i – 1 eigenvalues which are zero.

References

Rashevsky N. Life, information theory, and topology. Bulletin of Mathematical Biophysics 1955, 17, 229–235. 10.1007/BF02477860. [DOI] [Google Scholar]
Bonchev D.; Trinajstić N. Information theory, distance matrix, and molecular branching. J. Chem. Phys. 1977, 67, 4517–4533. 10.1063/1.434593. [DOI] [Google Scholar]
Bertz S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 1981, 103, 3599–3601. 10.1021/ja00402a071. [DOI] [Google Scholar]
Bonchev D.; Trinajstić N. Chemical information theory: Structural aspects. Int. J. Quantum Chem. 1982, 22, 463–480. 10.1002/qua.560220845. [DOI] [Google Scholar]
Böttcher T. An Additive Definition of Molecular Complexity. J. Chem. Inf. Model. 2016, 56, 462–470. 10.1021/acs.jcim.5b00723. [DOI] [PubMed] [Google Scholar]
Sabirov D. S.; Shepelevich I. S. Information Entropy in Chemistry: An Overview. Entropy 2021, 23, 1240. 10.3390/e23101240. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shannon C. E. A mathematical theory of communication. Bell System Technical Journal 1948, 27, 379–423. 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
Shannon C. E. A mathematical theory of communication. Bell System Technical Journal 1948, 27, 623–656. 10.1002/j.1538-7305.1948.tb00917.x. [DOI] [Google Scholar]
von Neumann J.; Mathematische Grundlagen der Quantenmechanik; Springer: Berlin: 1932. [Google Scholar]
Rasmussen C. E.; Williams C. K. I.; Gaussian Processes for Machine Learning; The MIT Press, 2006. [Google Scholar]
Behler J.; Parrinello M. Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces. Phys. Rev. Lett. 2007, 98, 146401 10.1103/PhysRevLett.98.146401. [DOI] [PubMed] [Google Scholar]
Bartók A. P.; Payne M. C.; Kondor R.; Csányi G. Gaussian Approximation Potentials: The Accuracy of Quantum Mechanics, without the Electrons. Phys. Rev. Lett. 2010, 104, 136403 10.1103/PhysRevLett.104.136403. [DOI] [PubMed] [Google Scholar]
Deringer V. L.; Bartók A. P.; Bernstein N.; Wilkins D. M.; Ceriotti M.; Csányi G. Gaussian Process Regression for Materials and Molecules. Chem. Rev. 2021, 121, 10073–10141. 10.1021/acs.chemrev.1c00022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bartók A. P.; Kondor R.; Csányi G. On representing chemical environments. Phys. Rev. B 2013, 87, 184115 10.1103/PhysRevB.87.184115. [DOI] [Google Scholar]
Nikolova N.; Jaworska J. Approaches to Measure Chemical Similarity – a Review. QSAR & Combinatorial Science 2003, 22, 1006–1026. 10.1002/qsar.200330831. [DOI] [Google Scholar]
De S.; Bartók A. P.; Csányi G.; Ceriotti M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 2016, 18, 13754–13769. 10.1039/C6CP00415F. [DOI] [PubMed] [Google Scholar]
Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]
Weininger D.; Weininger A.; Weininger J. L. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 1989, 29, 97–101. 10.1021/ci00062a008. [DOI] [Google Scholar]
Ruddigkeit L.; van Deursen R.; Blum L. C.; Reymond J.-L. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864–2875. 10.1021/ci300415d. [DOI] [PubMed] [Google Scholar]
Ramakrishnan R.; Dral P. O.; Rupp M.; von Lilienfeld O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 2014, 1, 140022 10.1038/sdata.2014.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
Khinchin A.; Mathematical Foundations of Information Theory; Dover Books on Mathematics: Dover Publications, 1957. [Google Scholar]
Mowshowitz A. Entropy and the complexity of graphs: I. An index of the relative complexity of a graph. Bulletin of Mathematical Biophysics 1968, 30, 175–204. 10.1007/BF02476948. [DOI] [PubMed] [Google Scholar]
De Domenico M.; Biamonte J. Spectral Entropies as Information-Theoretic Tools for Complex Network Comparison. Phys. Rev. X 2016, 6, 041062 10.1103/PhysRevX.6.041062. [DOI] [Google Scholar]
RDKit: Open-source cheminformatics. https://www.rdkit.org.
Öztürk H.; Ozkirimli E.; Özgür A. A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction. BMC Bioinformatics 2016, 17, 128. 10.1186/s12859-016-0977-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sabirov D. S. Information entropy changes in chemical reactions. Computational and Theoretical Chemistry 2018, 1123, 169–179. 10.1016/j.comptc.2017.11.022. [DOI] [Google Scholar]
Sabirov D. S. Information entropy of mixing molecules and its application to molecular ensembles and chemical reactions. Comput. Theor. Chem. 2020, 1187, 112933 10.1016/j.comptc.2020.112933. [DOI] [Google Scholar]
Himanen L.; Jäger M. O. J.; Morooka E. V.; Federici Canova F.; Ranawat Y. S.; Gao D. Z.; Rinke P.; Foster A. S. DScribe: Library of descriptors for machine learning in materials science. Comput. Phys. Commun. 2020, 247, 106949 10.1016/j.cpc.2019.106949. [DOI] [Google Scholar]
Laakso J.; Himanen L.; Homm H.; Morooka E. V.; Jäger M. O.; Todorović M.; Rinke P. Updates to the DScribe library: New descriptors and derivatives. J. Chem. Phys. 2023, 158, 234802 10.1063/5.0151031. [DOI] [PubMed] [Google Scholar]
Kullback S.; Leibler R. A. On information and sufficiency. annals of mathematical statistics 1951, 22, 79–86. 10.1214/aoms/1177729694. [DOI] [Google Scholar]
Karreman G. Topological information content and chemical reactions. Bulletin of Mathematical Biophysics 1955, 17, 279–285. 10.1007/BF02477754. [DOI] [Google Scholar]
Crouse D. F. On implementing 2D rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems 2016, 52, 1679–1696. 10.1109/TAES.2016.140952. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The QM9 data set is available at 10.6084/m9.figshare.c.978904.v5. The python code to calculate similarities and entropies can be found at https://github.com/CoMeT4MatSci/molent.

[ref1] Rashevsky N. Life, information theory, and topology. Bulletin of Mathematical Biophysics 1955, 17, 229–235. 10.1007/BF02477860. [DOI] [Google Scholar]

[ref2] Bonchev D.; Trinajstić N. Information theory, distance matrix, and molecular branching. J. Chem. Phys. 1977, 67, 4517–4533. 10.1063/1.434593. [DOI] [Google Scholar]

[ref3] Bertz S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 1981, 103, 3599–3601. 10.1021/ja00402a071. [DOI] [Google Scholar]

[ref4] Bonchev D.; Trinajstić N. Chemical information theory: Structural aspects. Int. J. Quantum Chem. 1982, 22, 463–480. 10.1002/qua.560220845. [DOI] [Google Scholar]

[ref5] Böttcher T. An Additive Definition of Molecular Complexity. J. Chem. Inf. Model. 2016, 56, 462–470. 10.1021/acs.jcim.5b00723. [DOI] [PubMed] [Google Scholar]

[ref6] Sabirov D. S.; Shepelevich I. S. Information Entropy in Chemistry: An Overview. Entropy 2021, 23, 1240. 10.3390/e23101240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] Shannon C. E. A mathematical theory of communication. Bell System Technical Journal 1948, 27, 379–423. 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]

[ref8] Shannon C. E. A mathematical theory of communication. Bell System Technical Journal 1948, 27, 623–656. 10.1002/j.1538-7305.1948.tb00917.x. [DOI] [Google Scholar]

[ref9] von Neumann J.; Mathematische Grundlagen der Quantenmechanik; Springer: Berlin: 1932. [Google Scholar]

[ref10] Rasmussen C. E.; Williams C. K. I.; Gaussian Processes for Machine Learning; The MIT Press, 2006. [Google Scholar]

[ref11] Behler J.; Parrinello M. Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces. Phys. Rev. Lett. 2007, 98, 146401 10.1103/PhysRevLett.98.146401. [DOI] [PubMed] [Google Scholar]

[ref12] Bartók A. P.; Payne M. C.; Kondor R.; Csányi G. Gaussian Approximation Potentials: The Accuracy of Quantum Mechanics, without the Electrons. Phys. Rev. Lett. 2010, 104, 136403 10.1103/PhysRevLett.104.136403. [DOI] [PubMed] [Google Scholar]

[ref13] Deringer V. L.; Bartók A. P.; Bernstein N.; Wilkins D. M.; Ceriotti M.; Csányi G. Gaussian Process Regression for Materials and Molecules. Chem. Rev. 2021, 121, 10073–10141. 10.1021/acs.chemrev.1c00022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] Bartók A. P.; Kondor R.; Csányi G. On representing chemical environments. Phys. Rev. B 2013, 87, 184115 10.1103/PhysRevB.87.184115. [DOI] [Google Scholar]

[ref15] Nikolova N.; Jaworska J. Approaches to Measure Chemical Similarity – a Review. QSAR & Combinatorial Science 2003, 22, 1006–1026. 10.1002/qsar.200330831. [DOI] [Google Scholar]

[ref16] De S.; Bartók A. P.; Csányi G.; Ceriotti M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 2016, 18, 13754–13769. 10.1039/C6CP00415F. [DOI] [PubMed] [Google Scholar]

[ref17] Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]

[ref18] Weininger D.; Weininger A.; Weininger J. L. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 1989, 29, 97–101. 10.1021/ci00062a008. [DOI] [Google Scholar]

[ref19] Ruddigkeit L.; van Deursen R.; Blum L. C.; Reymond J.-L. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864–2875. 10.1021/ci300415d. [DOI] [PubMed] [Google Scholar]

[ref20] Ramakrishnan R.; Dral P. O.; Rupp M.; von Lilienfeld O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 2014, 1, 140022 10.1038/sdata.2014.22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] Khinchin A.; Mathematical Foundations of Information Theory; Dover Books on Mathematics: Dover Publications, 1957. [Google Scholar]

[ref22] Mowshowitz A. Entropy and the complexity of graphs: I. An index of the relative complexity of a graph. Bulletin of Mathematical Biophysics 1968, 30, 175–204. 10.1007/BF02476948. [DOI] [PubMed] [Google Scholar]

[ref23] De Domenico M.; Biamonte J. Spectral Entropies as Information-Theoretic Tools for Complex Network Comparison. Phys. Rev. X 2016, 6, 041062 10.1103/PhysRevX.6.041062. [DOI] [Google Scholar]

[ref24] RDKit: Open-source cheminformatics. https://www.rdkit.org.

[ref25] Öztürk H.; Ozkirimli E.; Özgür A. A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction. BMC Bioinformatics 2016, 17, 128. 10.1186/s12859-016-0977-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] Sabirov D. S. Information entropy changes in chemical reactions. Computational and Theoretical Chemistry 2018, 1123, 169–179. 10.1016/j.comptc.2017.11.022. [DOI] [Google Scholar]

[ref27] Sabirov D. S. Information entropy of mixing molecules and its application to molecular ensembles and chemical reactions. Comput. Theor. Chem. 2020, 1187, 112933 10.1016/j.comptc.2020.112933. [DOI] [Google Scholar]

[ref28] Himanen L.; Jäger M. O. J.; Morooka E. V.; Federici Canova F.; Ranawat Y. S.; Gao D. Z.; Rinke P.; Foster A. S. DScribe: Library of descriptors for machine learning in materials science. Comput. Phys. Commun. 2020, 247, 106949 10.1016/j.cpc.2019.106949. [DOI] [Google Scholar]

[ref29] Laakso J.; Himanen L.; Homm H.; Morooka E. V.; Jäger M. O.; Todorović M.; Rinke P. Updates to the DScribe library: New descriptors and derivatives. J. Chem. Phys. 2023, 158, 234802 10.1063/5.0151031. [DOI] [PubMed] [Google Scholar]

[ref30] Kullback S.; Leibler R. A. On information and sufficiency. annals of mathematical statistics 1951, 22, 79–86. 10.1214/aoms/1177729694. [DOI] [Google Scholar]

[ref31] Karreman G. Topological information content and chemical reactions. Bulletin of Mathematical Biophysics 1955, 17, 279–285. 10.1007/BF02477754. [DOI] [Google Scholar]

[ref32] Crouse D. F. On implementing 2D rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems 2016, 52, 1679–1696. 10.1109/TAES.2016.140952. [DOI] [Google Scholar]

PERMALINK

From Local Atomic Environments to Molecular Information Entropy

Alexander Croy

Abstract

1. Introduction