Abstract
A novel framework for inverse quantitative structure–activity relationships (inverse QSAR) has recently been proposed and developed using both artificial neural networks and mixed integer linear programming. However, classes of chemical graphs treated by the framework are limited. In order to deal with an arbitrary graph in the framework, we introduce a new model, called a two-layered model, and develop a corresponding method. In this model, each chemical graph is regarded as two parts: the exterior and the interior. The exterior consists of maximal acyclic induced subgraphs with bounded height, the interior is the connected subgraph obtained by ignoring the exterior, and the feature vector consists of the frequency of adjacent atom pairs in the interior and the frequency of chemical acyclic graphs in the exterior. Our method is more flexible than the existing method in the sense that any type of graphs can be inferred. We compared the proposed method with an existing method using several data sets obtained from PubChem database. The new method could infer more general chemical graphs with up to 50 non-hydrogen atoms. The proposed inverse QSAR method can be applied to the inference of more general chemical graphs than before.
Keywords: QSAR, molecular design, artificial neural network, mixed integer linear programming, enumeration of graphs, cheminformatics, materials informatics
1. Introduction
Computer-aided design of chemical structures is one of the key topics in chemoinformatics. In particular, extensive studies have been done on inverse quantitative structure–activity relationships (inverse QSAR), which seek chemical structures having desired chemical activities under some constraints. In this framework, chemical compounds are usually represented as vectors of real or integer numbers, which are often called descriptors in chemoinformatics and correspond to feature vectors in machine learning. Using these chemical descriptors, various heuristic and statistical methods have been developed for inverse QSAR [1,2,3]. In many of such methods, inference or enumeration of graph structures from a given set of descriptors is a crucial subtask. Although various methods have been developed for that purpose [4,5,6,7], enumeration still remains a challenging task because the number of possible chemical graphs is huge, for example, chemical graphs with up to 30 atoms (vertices) C, N, O, and S, may exceed [8]. Furthermore, even inference is a challenging task because it is NP-hard (computationally difficult) except for some simple cases [9]. Due to this inherent difficulty, most existing methods for inverse QSAR do not guarantee optimal or exact solutions.
On the other hand, the design of novel graph structures has recently become a hot topic in artificial neural network (ANN) studies, and thus extensive studies have been done for inverse QSAR using ANNs, especially with graph convolutional networks [10]. For example, variational autoencoders [11], recurrent neural networks [12,13], grammar variational autoencoders [14], generative adversarial networks [15], and invertible flow models [16,17] have been applied. Note that QSAR using three-dimensional structures of chemical compounds (3D-QSAR) has also been studied [18]. Particularly, comparative molecular field analysis (CoMFA) has been extensively studied and applied to various molecular design problems [19,20]. In CoMFA, electrostatic potential interaction energies across superimposed molecular structures are used as descriptors and then regression is performed by using the partial least squares (PLS) fitting. Recently, deep neural networks have been applied to 3D-QSAR by combining potential interaction energies with convolutional neural networks [21]. However, in order to apply 3D-QSAR, we need to calculate accurate three-dimensional structures of chemical compounds, which is not a straightforward task.
A novel framework for inferring chemical graphs has recently been developed [22,23] based on ANNs and mixed integer linear programming (MILP), as illustrated in Figure 1. It constructs a prediction function in the first phase and infers a chemical graph in the second phase. The first phase of the framework consists of three stages. In Stage 1, we choose a chemical property and a class of graphs, where a property function a is defined so that is the value of in , and collect a data set of chemical graphs in such that is available. In Stage 2, we introduce a feature function for a positive integer K. In Stage 3, we construct a prediction function with an ANN that, given a vector , returns a value so that serves as a predicted value to for each . Given a target chemical value , the second phase infers chemical graphs with in the next two stages. In Stage 4, we formulate an MILP that simulates the construction of from G and the computation process in the ANN so that given a target value, , and solve the MILP to infer a chemical graph and a feature vector such that and . In Stage 5, we generate other chemical graphs such that based on the output chemical graph .
Figure 1.
An illustration of a framework for inferring a set of chemical graphs .
MILP formulations required in Stage 4 have been designed for chemical compounds with cycle index 0 (i.e., acyclic) [23,24], cycle index 1 [25], and cycle index 2 [26]. In particular, Azam et al. [24] introduced a restricted class of acyclic graphs that is characterize by an integer , called a “branch-parameter” such that the restricted class still covers most of the acyclic chemical compounds in the database.
Recently, Akutsu and Nagamochi [27] extended the idea to define a restricted class of cyclic graphs, called “-lean cyclic graphs”, that covers most of the cyclic chemical compounds in the database. Based on this, they also defined a set of rules for specifying several topological substructures of a target chemical graph in a flexible way in Stage 4 before we solve an MILP. The method has been implemented by Zhu et al. [28], and computational results showed that chemical graphs with around up to 50 non-hydrogen atoms can be inferred. Although the method can infer the class of -lean cyclic graphs and specify topological structures of the cyclic part, we still need to introduce a new model to deal with an arbitrary graph and to include a prescribed structure in the acyclic part of a target chemical graph.
In this paper, we introduce a new model, called a two-layered model, for representing the feature of a chemical graph in order to deal with an arbitrary graph in the framework. In the two-layered model, a chemical graph G with a parameter is regarded as two parts: the exterior and the interior. The exterior consists of maximal acyclic induced subgraphs with height at most and the interior is the connected subgraph obtained by ignoring the exterior. We define a feature vector of a chemical graph G to be the frequency of adjacent atom pairs in the interior and the frequency of chemical acyclic graphs in the exterior. Figure 2 illustrates an example of a chemical graph G. For a branch-parameter , the interior of the chemical graph G in Figure 2 is obtained by removing the set of vertices with degree 1 times, i.e., first remove the set of vertices of degree 1 in G, and then remove the set of vertices of degree 1 in , where the removed vertices become the exterior-vertices of G and there are eight rooted trees in the exterior of G.
Figure 2.
An illustration of a chemical graph G, where for , the exterior-vertices are and the interior-vertices are .
We also introduce a new set of rules for specifying topological substructures of a target chemical graph G to be inferred so that a prescribed structure can be included in both of the acyclic and cyclic parts of G. The set of rules contains (i) a seed graph as an abstract form of a target chemical graph G; (ii) a set of chemical rooted trees as candidates for trees in the exterior of G; and (iii) lower and upper bounds on the number of components in a target chemical graph such as chemical elements, double/triple bounds and the interior-vertices in G. Figure 3a,b illustrates examples of a seed graph and a set of chemical rooted trees, respectively. Given a seed graph , the interior of a target chemical graph G is constructed from by replacing some edges with paths between the end-vertices u and v, and by attaching new paths to some vertices v. For example, the chemical graph G in Figure 2 is constructed from the seed graph in Figure 3a as follows. First replace five edges and in with new paths , , , and , respectively, to obtain the subgraph of G that consists of vertices depicted with squares. Next, attach to this graph three new paths, , , and , to obtain the interior of G in Figure 2. Finally, the chemical graph G in Figure 2 is obtained by attaching eight trees selected from the set and assigning chemical elements and bond-multiplicities in the interior. The frequency of chemical elements and the graph size are controlled with lower and upper bounds on the components in a target chemical graph G. See Section 2.2 for more details on the specification.
Figure 3.
(a) An illustration of a seed graph where the vertices in are depicted with gray squares, the edges in are depicted with dotted lines, the edges in are depicted with dashed lines, the edges in are depicted with gray bold lines, and the edges in are depicted with black solid lines. (b) A set of 11 chemical rooted trees , where the root of each tree is depicted with a black circle.
We implemented the two-layered model and the results of computational experiments suggest that the proposed method can infer chemical graphs with around up to 50 non-hydrogen atoms.
The paper is organized as follows. Section 2.1 introduces some notions on graphs, a modeling of chemical compounds, and a choice of descriptors. Section 2.2 introduces a method of specifying topological substructures of target chemical graphs in Stage 4. Section 3 reports the results on some computational experiments conducted for chemical properties such as octanol/water partition coefficient, boiling point, melting point, flash point, lipophylicity, and solubility. Section 4 makes some concluding remarks. An MILP formulation used in Stage 4 and a review of the dynamic programming algorithm for generating isomers in Stage 5 can be found in Supplementary Materials. The proposed method/system is available at GitHub https://github.com/ku-dml/mol-infer.
2. Materials and Methods
This section presents mathematical details of our developed method. Readers not interested in mathematical details can skip this section.
2.1. Preliminary
This section introduces some notions and terminology on graphs, a modeling of chemical compounds, and our choice of descriptors.
Let , and denote the sets of reals, integers and non-negative integers, respectively. For two integers a and b, let denote the set of integers i with .
Graphs. Given a graph G, let and denote the sets of vertices and edges, respectively. For a subset (resp., of a graph G, let (resp., ) denote the graph obtained from G by removing the vertices in (resp., the edges in ), where we remove all edges incident to a vertex in in . The rank of a graph G is defined to be the minimum of an edge subset such that contains no cycle. A path with two end-vertices u and v is called a -path. An edge in a connected graph G is called a bridge if the graph obtained from G by removing edge e is not connected, i.e., consists of two connected graphs containing vertex , . For a cyclic graph G, an edge e is called a core-edge if it is in a cycle of G or is a bridge such that each of the connected graphs , of contains a cycle. A vertex incident to a core-edge is called a core-vertex of G.
A vertex designated in a graph G is called a root. In this paper, we designated at most two vertices as roots, and denote by the set of roots of G. We call a graph G rooted (resp., bi-rooted) if (resp., ), where we call Gunrooted if .
For a graph G, possibly with roots, a leaf-vertex is defined to be a non-root vertex with degree 1, call the edge incident to a leaf vertex v a leaf-edge, and denote and the sets of leaf-vertices and leaf-edges in G, respectively. For a graph or a rooted graph G, we define graphs obtained from G by removing the set of leaf-vertices i times so that
where we call a vertex a leaf k-branch and we say that a vertex has height height ht( in G. The height ht( of a rooted tree T is defined to be the maximum of ht( of a vertex . For an integer , we call a rooted tree T k-lean if T has at most one leaf k-branch. For an unrooted cyclic graph G, we regard the set of non-core-edges in G induces a collection of trees each of which is rooted at a core-vertex, where we call G k-lean if each of the rooted trees in is k-lean. Nearly 97% of cyclic chemical compounds with up to 100 non-hydrogen atoms in PubChem are 2-lean [24].
Two-layered Model. Let G be an unrooted graph. For an integer , which we call a branch-parameter, a two-layered model of G is a partition of G into an “interior” and an “exterior” in the following way. We call a vertex (resp., an edge of G an exterior-vertex (resp., exterior-edge) if ht( (resp., e is incident to an exterior-vertex) and denote the sets of exterior-vertices and exterior-edges by and , respectively and denote and , respectively. We call a vertex in (resp., an edge in ) an interior-vertex (resp., interior-edge). The set of exterior-edges forms a collection of connected graphs each of which is regarded as a rooted tree T rooted at the vertex with the maximum ht(, where we call T a ρ-fringe-tree (or a fringe-tree). Let denote the set of fringe-trees in G. The interior of G is defined to be the subgraph of G. Note that every core-vertex (resp., core-edge) in G is an interior-vertex (resp., interior-edge) of G. Figure 2 illustrates an example of a graph G, such that , and for a branch-parameter .
2.1.1. Modeling of Chemical Compounds
To represent a chemical compound, we assume that each chemical element has a unique valence and we use a hydrogen-suppressed model, because hydrogen atoms can be added at the final stage under the assumption. In the hydrogen-suppressed model, a chemical compound C is represented by a tuple of a simple, connected undirected graph H and functions and , where is a set of non-hydrogen chemical elements such as C (carbon), O (oxygen), N (nitrogen), and so on. The set of atoms and the set of bonds in the compound C are represented by the vertex set and the edge set , respectively. The chemical element assigned to a vertex is represented by and the bond-multiplicity between two adjacent vertices is represented by of the edge . We say that two tuples are isomorphic if they admit an isomorphism , i.e., a bijection such that ↔ . When is rooted at a vertex , are rooted-isomorphic (r-isomorphic) if they admit an isomorphism such that . Chemical rooted trees and in Figure 2 are r-isomorphic.
Associated with the two functions and in a tuple , we introduce the following functions: , , , and .
For a notational convenience, we use a function such that means the sum of bond-multiplicities of edges incident to a vertex u, i.e.,
A chemical graph G is defined to be a tuple such that the valence condition at each vertex is satisfied, i.e.,
where we define the hydro-degree of a vertex v to be .
Figure 2 illustrates an example of a chemical graph .
To represent a feature of an edge such that , and in a chemical graph , we use a tuple , which we call the adjacency-configuration of the edge e. We introduce a total order < over the elements in to distinguish with and notationally. For a tuple , let denote the tuple .
To represent a feature of a vertex with that has d atoms in its neighbor in a chemical graph , we use a pair , which we call the chemical symbol of the vertex v. We treat as a single symbol , and define to be the set of all chemical symbols .
To represent a feature of an edge such that , and in a chemical graph , we use a tuple , which we call the edge-configuration of the edge e. We introduce a total order < over the elements in to distinguish with and notationally. For a tuple , let denote the tuple .
To represent a feature of the exterior of a chemical graph , a -fringe-tree in is called a fringe-configuration in the exterior.
2.1.2. Introducing Descriptors of Feature Vectors
This section introduces descriptors to define our feature vectors. Let be a chemical property for which we will construct a prediction function from a feature vector of a chemical graph to a predicted value for the chemical property of G.
We first choose a set of non-hydrogen chemical elements and then collect a data set of chemical compounds C whose chemical elements belong to , where we regard as a set of chemical graphs that represent the chemical compounds C in . To define the interior/exterior of chemical graphs , we next choose a branch-parameter , where we recommend .
Let (resp., ) denote the set of chemical elements used in the set of interior-vertices (resp., exterior-vertices) over all chemical graphs , and denote the set of edge-configurations used in the set of interior-edges over all chemical graphs . Let denote the set of chemical rooted trees r-isomorphic to a -fringe-tree over all chemical graphs .
We define an integer encoding of a finite set A of elements to be a bijection , where we denote by the set of integers. Introduce an integer coding of each of the sets , , and . Let (resp., ) denote the coded integer of an element (resp., ), denote the coded integer of an element in and denote an element in .
For each chemical element , let and denote the mass and valence of , respectively. In our model, we use integers , .
We define the feature vector of a chemical graph to be a vector that consists of the following non-negative integer descriptors , , where .
: the number of vertices in G.
: the number of interior-vertices in G.
: the average of mass over all non-hydrogen atoms in G, i.e., .
, : the number of interior-vertices of degree d in G.
, : the number of interior-vertices of interior-degree in the interior of G.
, : the number of vertices in G of hydro-degree .
, , : the number of interior-edges with bond multiplicity m in G, i.e., .
, , : the frequency of chemical element in the set of interior-vertices in G.
, , : the frequency of chemical element in the set of exterior-vertices in G.
, , : the frequency of edge-configuration in the set of interior-edges in G.
, , : the frequency of fringe-configuration in the set of -fringe-trees in G.
2.2. Specifying Target Chemical Graphs
Given a prediction function and a target value , we call a chemical graph such that for the feature vector a target chemical graph. This section presents a set of rules for specifying topological substructure of a target chemical graph in a flexible way in Stage 4.
We first describe how to reduce a chemical graph into an abstract form based on which our specification rules will be defined. To illustrate the reduction process, we use the chemical graph in Figure 2.
-
R1
Removal of all -fringe-trees: The interior of G is obtained by removing the non-root vertices of each -fringe-trees . Figure 4 illustrates the interior of chemical graph G with in Figure 2.
-
R2
Removal of some leaf paths: We call a -path Q in a leaf path if vertex v is a leaf-vertex of and the degree of each internal vertex of Q in is 2, where we regard that Q is rooted at vertex u. A connected subgraph S of the interior of G is called a cyclical-base if S is obtained from H by removing the vertices in for a subset X of interior-vertices and a set of leaf -paths such that no two paths and share a vertex. Figure 5a illustrates a cyclical-base of the interior for a set of leaf paths in Figure 4.
-
R3
Contraction of some pure paths: A path in S is called pure if each internal vertex of the path is of degree 2. Choose a set of several pure paths in S so that no two paths share vertices except for their end-vertices. A graph is called a contraction of a graph S (with respect to ) if is obtained from S by replacing each pure -path with a single edge , where may contain multiple edges between the same pair of adjacent vertices. Figure 5b illustrates a contraction obtained from the chemical graph S by contracting each -path into a new edge , where , and , and of pure paths in Figure 5a.
Figure 4.
The interior of chemical graph G with in Figure 2.
Figure 5.
(a) A cyclical-base of the interior in Figure 4; (b) A contraction of S for a pure path set in (a), where a new edge obtained by contracting a pure path is depicted with a thick line.
We will define a set of rules so that a chemical graph can be obtained from a graph (called a seed graph in the next section) by applying processes R3 to R1 in a reverse way. We specify topological substructures of a target chemical graph with a tuple called a target specification defined under the set of the following rules.
Seed Graphs
A seed graph is defined to be a graph (possibly with multiple edges) such that the edge set consists of four sets , , , and , where each of them can be empty. A seed graph plays a role of the most abstract form in R3. Figure 3a illustrates an example of a seed graph, where , , , , and .
A subdivision S of is a graph constructed from a seed graph according to the following rules:
-
-
Each edge is replaced with a -path of length at least 2;
-
-
Each edge is replaced with a -path of length at least 1 (equivalently e is directly used or replaced with a -path of length at least 2);
-
-
Each edge is either used or discarded; and
-
-
Each edge is always used directly.
We allow a possible elimination of edges in as an optional rule in constructing a target chemical graph from a seed graph, even though such an operation has not been included in the process R3. A subdivision S plays a role of a cyclical-base in R2. A target chemical graph will contain S as a subgraph of the interior of G.
Interior-Specification
A graph that serves as the interior of a target chemical graph G will be constructed as follows. First, construct a subdivision S of a seed graph by replacing each edge edge with a pure -path . Next, construct a supergraph of S by attaching a leaf path at each vertex or at an internal vertex of each pure -path for some edge , where possibly (i.e., we do not attach any new edges to v). We introduce the following rules for specifying the size of , the length of a pure path , the length of a leaf path , the number of leaf paths , and a bond-multiplicity of each interior-edge, where we call the set of prescribed constants an interior-specification :
-
-
Lower and upper bounds on the number of interior-vertices of a target chemical graph G.
-
-For each edge ,
- a lower bound and an upper bound on the length of a pure -path . (For a notational convenience, set , , and , , . )
- a lower bound and an upper bound on the number of leaf paths attached to at internal vertices v of a pure -path .
- a lower bound and an upper bound on the maximum length of a leaf path attached at an internal vertex of a pure -path .
-
-For each vertex ,
- a lower bound and an upper bound on the number of leaf paths attached to v, where .
- a lower bound and an upper bound on the length of a leaf path attached to v.
-
-
For each edge , a lower bound and an upper bound on the number of edges with bond-multiplicity in -path , where we regard , as single edge e.
We call a graph that satisfies an interior-specification a -extension of , where the bond-multiplicity of each edge has been determined.
Table 1 shows an example of an interior-specification to the seed graph in Figure 3.
Table 1.
Example 1 of an interior-specification .
|
Figure 6 illustrates an example of an -extension of seed graph in Figure 3 under the interior-specification in Table 1.
Figure 6.
An illustration of a graph that is obtained from the seed graph in Figure 3 under the interior-specification in Table 1, where the vertices newly introduced by pure paths and leaf paths are depicted with white squares and circles, respectively.
Chemical-Specification
Let be a graph that serves as the interior of a target chemical graph G, where the bond-multiplicity of each edge in has be determined. Finally, we introduce a set of rules for constructing a target chemical graph G from by choosing a chemical element and assigning a -fringe-tree to each interior-vertex . We introduce the following rules for specifying the size of G, a set of chemical rooted trees that are allowed to use as -fringe-trees and lower and upper bounds on the frequency of a chemical element, a chemical symbol, and an edge-configuration, where we call the set of prescribed constants a chemical specification :
-
-
Lower and upper bounds on the number of vertices in G, where .
-
-
Subsets and of chemical rooted trees with height at most , where we require that every -fringe-tree rooted at a vertex (resp., at an internal vertex v not in ) in G belongs to (resp., ). Let and denote the set of chemical elements assigned to non-root vertices over all chemical rooted trees in .
-
-
A subset , where we require that every chemical element assigned to an interior-vertex v in G belongs to . Let and (resp., and ) denote the number of vertices (resp., interior-vertices and exterior-vertices) v such that in G.
-
-
A set of chemical symbols and a set of edge-configurations with , where we require that the edge-configuration of an interior-edge e in G belongs to . We do not distinguish and .
-
-
Define to be the set of adjacency-configurations such that . Let denote the number of interior-edges e such that in G.
-
-
Subsets , , we require that every chemical element assigned to a vertex in the seed graph belongs to .
-
-
Lower and upper bound functions and on the number of interior-vertices v such that in G.
-
-
Lower and upper bound functions on the number of interior-vertices v such that in G.
-
-
Lower and upper bound functions on the number of interior-edges e such that in G.
-
-
Lower and upper bound functions on the number of interior-edges e such that in G.
We call a chemical graph G that satisfies a chemical specification a -extension of , and denote by the set of all -extensions of .
Table 2 shows an example of a chemical-specification to the seed graph in Figure 3.
Table 2.
Example 2 of a chemical-specification .
|
Figure 2 illustrates an example of a -extension of obtained from the -extension in Figure 6 under the chemical-specification in Table 2.
Our specification of topological substructures is similar to that proposed by Akutsu and Nagamochi [27], wherein a target chemical graph is restricted to -lean cyclic graphs and prescribed substructures cannot be specified in the acyclic part. In our new method, a chemical graph with any structure can be handled and substructures in the acyclic part can be fixed.
2.3. Examples of Specification
We here present some cases where a target specification can be chosen based on a set of given chemical graphs with a similar structure so that becomes a subset of . In such a case, every target chemical graph in possesses a common structure over the given set .
Figure 7 illustrates a set of four flavonoids and a seed graph for so that for a choice of an interior-specification and a chemical-specification . Let . In the seed graph , we set , , and and predetermine the chemical element for each vertex and the bond-multiplicity for each edge as in Figure 7e, i.e., for and for . Figure 7f illustrates a set of chemical rooted trees for the 2-fringe-trees in a target chemical graph. For vertices in , we set , , , , , and . For edges , we set and , where a pure path may be introduced in a target chemical graph. We see that every given chemical graph belongs to by setting the other specification in and adequately.
Figure 7.
Illustration of a set of four flavonoids, a seed graph , and a set of chemical rooted trees for : (a) fisetin ; (b) ruteorinn ; (c) aurone ; (d) chalcone ; (e) ; (f) .
Figure 8 illustrates a set of three dibenzodiazepine atypical antipsychotics, and a seed graph for so that for a choice of an interior-specification and a chemical-specification . Let . In the seed graph , we set and and predetermine the chemical element for each vertex and the bond-multiplicity for each edge as in Figure 8d. Figure 8e illustrates a set of chemical rooted trees for the 2-fringe-trees in a target chemical graph. For vertices in , we set , , , , , , and , where a leaf path may be introduced in a target chemical graph. For edge , we set and . We see that every given chemical graph belongs to by setting the other specification in and adequately.
Figure 8.
Illustration of a set of three dibenzodiazepine atypical antipsychotics, a seed graph and a set of chemical rooted trees for : (a) clozabine ; (b) quetiapine ; (c) olanzapine ; (d) ; (e) .
3. Results
We implemented our method of Stages 1 to 5 for inferring chemical graphs under a given target specification and conducted experiments to evaluate the computational efficiency. We executed the experiments on a PC with Processor: 3.0 GHz Core i7-9700 (3.0 GHz) Memory: 16 GB RAM DDR4. We used ChemDoodle version 10.2.0 for constructing 2D drawings of chemical graphs.
To conduct experiments for Stages 1 to 5, we selected six chemical properties : octanol/water partition coefficient (KOW), boiling point (BP), melting point (MP), flash point (closed cup) (FP), lipophylicity (LP), solubility (SL) provided by HSDB from PubChem [29] for KOW, BP, MP, and FP, figshare [30] for LP and MoleculeNet [31] for SL.
Results on Phase 1.
We implemented Stages 1, 2, and 3 in Phase 1 as follows.
Stage 1. We set a graph class to be the set of all chemical graphs with any graph structure, and set a branch-parameter to be 2. For each property KOW, BP, MP, FP, LP, SL}, we first select a set of chemical elements and then collect a data set on chemical graphs over the set of chemical elements. To construct the data set , we eliminated chemical compounds that have at most three carbon atoms or contain a charged element such as or an element whose valence is different from our setting of valence function .
Table 3 shows the size and range of data sets that we prepared for each chemical property in Stage 1, where we denote the following:
: the set of selected chemical elements (hydrogen atoms are added at the final stage);
: the size of data set over for property ;
: the number of different edge-configurations of interior-edges over the compounds in ;
: the number of non-isomorphic chemical rooted trees in the set of all 2-fringe-trees in the compounds in ;
: the minimum and maximum values of over the compounds G in ; and
: the minimum and maximum values of in over compounds G in .
Table 3.
Data sets for stage 1 in phase 1.
| KOW | C,O,N | 644 | 24 | 109 | [4, 58] | [−7.53, 13.45] |
| KOW | C,O,N,S,Cl | 837 | 31 | 142 | [4, 73] | [−7.53, 13.45] |
| BP | C,O,N | 358 | 21 | 91 | [4, 30] | [−11.70, 470.0] |
| BP | C,O,N,S,Cl | 425 | 23 | 114 | [4, 30] | [−11.70, 470.0] |
| MP | C,O,N | 448 | 22 | 94 | [4, 122] | [−185.3, 300.0] |
| MP | C,O,N,S,Cl | 548 | 26 | 118 | [4, 122] | [−185.3, 300.0] |
| FP | C,O,N | 348 | 20 | 85 | [4, 66] | [−82.99, 300.0] |
| FP | C,O,N,S,Cl | 399 | 24 | 107 | [4, 66] | [−82.99, 300.0] |
| LP | C,O,N | 592 | 27 | 71 | [6, 60] | [−3.62, 6.84] |
| LP | C,O,N,S,Cl | 779 | 32 | 78 | [6, 74] | [−3.62, 6.84] |
| SL | C,O,N | 640 | 25 | 111 | [4, 55] | [−9.33, 1.11] |
| SL | C,O,N,S,Cl | 847 | 31 | 144 | [4, 55] | [−11.60, 1.11] |
Stage 2. We used the new feature function that consists of the descriptors such as fringe-configuration defined in Section 2.1 and let denote the feature function.
Stage 3. Let be a prediction function to a property function with a feature function for a data set D of chemical graphs. We define the coefficient of determination of a prediction function over a data set D to be
To conduct an experiment in Stage 3, we first constructed ten architectures , with one or two hidden layers. For each pair of a property KOW, BP, MP, FP, LP, SL}, and an architecture , , we constructed five prediction functions in order to evaluate the performance with cross-validation as follows. Partition data set into five subsets , randomly and for each set construct an ANN and its prediction function using the feature function . We used scikit-learn version 0.23.2 with Python 3.8.5, MLPRegressor and ReLU activation function to construct each ANN . We evaluated the resulting prediction function with the coefficient of determination for the test set . For each property , let t- denote the average of over all in the cross-validation to an architecture .
Table 4 shows the results on Stages 2 and 3, where we denote the following.
Table 4.
Results of Stages 2 and 3 in Phase 1.
| L-Time | t- (Best) | t- | Arch. | ||
|---|---|---|---|---|---|
| KOW | C,O,N | 0.7 | 0.959 | 0.983 | (156,10,10,1) |
| KOW | C,O,N,S,Cl | 0.7 | 0.947 | 0.968 | (199,20,10,1) |
| BP | C,O,N | 3.5 | 0.858 | 0.923 | (135,30,20,1) |
| BP | C,O,N,S,Cl | 3.3 | 0.821 | 0.899 | (163,10,1) |
| MP | C,O,N | 3.8 | 0.784 | 0.893 | (139,40,1) |
| MP | C,O,N,S,Cl | 4.1 | 0.796 | 0.880 | (170,10,10,1) |
| FP | C,O,N | 1.1 | 0.750 | 0.874 | (128,40,1) |
| FP | C,O,N,S,Cl | 1.8 | 0.707 | 0.853 | (157,10,10,1) |
| LP | C,O,N | 0.5 | 0.868 | 0.908 | (121,30,1) |
| LP | C,O,N,S,Cl | 0.7 | 0.861 | 0.892 | (137,20,10,1) |
| SL | C,O,N | 0.7 | 0.870 | 0.913 | (159,30,1) |
| SL | C,O,N,S,Cl | 0.9 | 0.870 | 0.903 | (201,30,20,1) |
-
-
: the set of selected chemical elements (hydrogen atoms are added at the final stage);
-
-
L-time: the average time (s) to construct an ANN over all ANNs;
-
-
t- (best): the best value of t- over all architectures , ;
-
-
t-: the maximum of over all ; and
-
-
Arch.: The architecture , that attains t-. An architecture (resp., ) consists of an input layer with K nodes, a hidden layer with p nodes (resp., two hidden layers with and nodes, respectively), and an output layer with a single node, where K is equal to the number of descriptors in the feature vector.
From Table 4, we see that the execution of Stage 3 was considerably successful, where most of t- are around 0.85 to 0.95 for all six chemical properties.
An Additional Experiment in Stage 3. We conducted an additional experiment to compare our new feature function with the feature function based edge-configuration in the previous method [27] designed with the same framework. Note that the previous feature vector can be defined only for a cyclic graph G, whereas our feature vector is defined for an arbitrary graph G. For each property KOW, BP, MP, FP, LP, SL}, we set a set of chemical elements to be and then collect a data set of chemical cyclic graphs from the data set of all chemical graphs over the set of chemical elements in the previous experiment. For each of the feature functions and , we constructed five prediction functions with the same set of ten architectures , and the data set of chemical cyclic graphs in the same manner of the previous experiment.
Table 5 shows the results of this experiment, where the table also includes the result of prediction functions by in the set of all chemical graphs. In the table, we denote the following:
-
-
, : the size of data set of cyclic graphs (resp., of all chemical graphs) for property ;
-
-
t- (ave.): the average of over all for and ; and
-
-
t- (best): the average of over all .
Table 5.
Results of prediction functions by and in data set of cyclic graphs and in data set of all graphs.
| , | , | , | ||||||
|---|---|---|---|---|---|---|---|---|
| t- (ave.) | t- (Best) | t- (ave.) | t- (Best) | t- (ave.) | t- (Best) | |||
| KOW | 580 | 0.952 | 0.959 | 0.950 | 0.954 | 837 | 0.944 | 0.947 |
| BP | 224 | 0.688 | 0.718 | 0.680 | 0.693 | 425 | 0.809 | 0.821 |
| MP | 348 | 0.668 | 0.694 | 0.712 | 0.736 | 548 | 0.776 | 0.796 |
| FP | 218 | 0.435 | 0.476 | 0.574 | 0.623 | 399 | 0.688 | 0.707 |
| LP | 776 | 0.832 | 0.842 | 0.853 | 0.861 | 779 | 0.854 | 0.861 |
| SL | 638 | 0.851 | 0.863 | 0.853 | 0.861 | 847 | 0.860 | 0.870 |
From Table 5, we see that the score of R of the prediction function by in chemical cyclic graphs (resp., in all chemical graphs) is improved from that by for properties MP and FP (resp., BP, MP, and FP). Recall that our new feature function can be defined for arbitrary graphs and we can select a larger data set than that by in a learning stage. This advantage is observed in the experiment. We guess that the better prediction function for BP (resp., FP) is obtained by using because the size of data set becomes considerably larger from to (resp., from to ).
Results on Phase 2.
We prepared the following instances (a–d) for conducting experiments of Stages 4 and 5 in Phase 2.
-
(a)
: The instance used in Section 2.2 to explain the target specification.
-
(b), : An instance for inferring chemical graphs with rank at most 2. In the four instances , , the following specifications in are common.
- Set , set to be the set of all possible symbols in , and set to be the set of all possible edge-configurations. Set , .
- The lower bounds , , , , , , , , , are all set to be 0.
- The upper bounds , , , , , , , , , are all set to be an upper bound on .
- For each property , let denote the set of 2-fringe-trees in the compounds in , and select a subset with , . For each instance , set , .
Instance is given by the rank-1 seed graph in Figure 9a and Instances , are given by the rank-2 seed graph , in Figure 9b–d.
-
(i)
For instance , select as a seed graph the monocyclic graph in Figure 9a, where , and . Set and . We include a linear constraint as part of the side constraint.
-
(ii)
For instance , select as a seed graph the graph in Figure 9b, where , , and . Set and . We include a linear constraint .
-
(iii)
For instance , select as a seed graph the graph in Figure 9c, where , , and . Set and . We include linear constraints and .
-
(iv)
For instance , select as a seed graph the graph in Figure 9d, where , and . Set and . We include linear constraints , and .
Figure 9.
An illustration of seed graphs: (a) A monocyclic graph ; (b) A rank-2 cyclic graph with two vertex-disjoint cycles; (c) A rank-2 cyclic graph with two disjoint cycles sharing a vertex; (d) A rank-2 cyclic graph with three cycles.
We define instances in (c) and (d) in order to find chemical graphs that have an intermediate structure of given two chemical cyclic graphs and . Let and denote the sets of chemical elements and chemical symbols of the interior-vertices in , denote the sets of edge-configurations of the interior-edges in , and denote the set of 2-fringe-trees in . Analogously define sets , , , and in .
-
(c)
: An instance aimed to infer a chemical graph such that the core of is equal to the core of and the frequency of each edge-configuration in the non-core of is equal to that of . We use chemical compounds CID 24822711 and CID 59170444 in Figure 10a,b for and , respectively.
Set a seed graph to be the core of .
Set , and set to be the set of all possible chemical symbols in .
Set and , .
Set , ,
and .
Set lower bounds , , , , , , , and to be 0.
Set upper bounds , , , , , , , and to be .
Set to be the number of core-edges in with and to be the number interior-edges in and with edge-configuration .
Let denote the set of chemical rooted trees r-isomorphic p-fringe-trees in .
Set , .
-
(d)
: An instance aimed to infer a chemical monocyclic graph such that the frequency vector of edge-configurations in is a vector obtained by merging those of and . We use chemical monocyclic compounds CID 10076784 and CID 44340250 in Figure 10c,d for and , respectively. Set a seed graph to be the monocyclic seed graph with , and in Figure 9a.
Set , and .
Set , ,
and .
Set lower bounds , , , , , , , and to be 0.
Set upper bounds , , , , , , , and to be .
For each edge-configuration , let (resp., ) denote the number of interior-edges with in (resp., ), and set
, ,
and
.
Set , .
Figure 10.
An illustration of chemical compounds for instances and : (a) : CID 24822711; (b) : CID 59170444; (c) : CID 10076784; (d) : CID 44340250.
In Stage 5, before we formulate an MILP for inferring a target chemical graph for each instance I, we reduce the input layer of an ANN constructed in Stage 3 so that the input layer consists of input nodes that correspond to the descriptors actually used in the specification of the instance I, i.e., we remove any input nodes in that represent the frequency of edge-configurations in and chemical rooted trees not contained in the specification of I. For example, there are chemical rooted trees in the set of 2-fringe-trees in the data set with KOW in Table 3, and an ANN constructed in Stage 3 contains 109 input nodes that correspond to the descriptors for the fringe-configuration. However, the set of input nodes for the fringe-configuration is reduced to a set of input nodes when we formulate an MILP for solving instance , saving the number of integer variables.
Table 6 shows the features of the seven test instances, where we denote the following:
-
-
: the set of non-hydrogen chemical elements for inferring a target graph;
-
-
: the number of different edge-configurations of interior-edges for inferring a target graph;
-
-
: the number of different chemical rooted trees in the set ; and
-
-
, : the lower and upper bounds on and for inferring a target graph .
Table 6.
Features of test instances.
| Instance | |||||
|---|---|---|---|---|---|
| C,O,N | 10 | 11 | [30,50] | [20,28] | |
| C,O,N | 28 | 40 | [38,38] | [6,6] | |
| C,O,N | 28 | 35 | [50,50] | [30,30] | |
| C,O,N | 28 | 30 | [50,50] | [30,30] | |
| C,O,N | 28 | 25 | [50,50] | [30,30] | |
| C,O,N | 8 | 12 | [46,46] | [24,24] | |
| C,O,N | 7 | 8 | [40,45] | [18,18] |
Stage 4. To solve an MILP in Stage 4, we used CPLEX version 12.10. Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12 show the results on Stages 4 and 5, where we denote the following:
-
-
: the minimum and maximum values of in over compounds G in in Table 3;
-
-
: (resp., ) denotes the minimum (resp., maximum) target value y with such that the MILP instance for the target value becomes feasible (i.e., admits a target chemical graph ). To determine the minimum and minimum target values and , we solved many numbers of MILP instances. Note that the MILP instance may become infeasible for some value y within the range ;
-
-
: a target value in for a property ;
-
-
#v: the number of variables in the MILP in Stage 4;
-
-
#c: the number of constraints in the MILP in Stage 4;
-
-
IP-time: the time (sec.) to solve the MILP in Stage 4;
-
-
n: the number of non-hydrogen atoms in the chemical graph inferred in Stage 4; and
-
-
: the number of interior-vertices in the chemical graph inferred in Stage 4.
Table 7.
Results of Stage 4 for KOW.
| Instance | #v | #c | IP-Time | n | ||||
|---|---|---|---|---|---|---|---|---|
| [−7.53, 13.45] | [−7.0, 13.4] | 3.2 | 7663 | 9162 | 3.9 | 35 | 24 | |
| [−7.53, 13.45] | [−7.5, 13.4] | 3.0 | 9894 | 6626 | 17.5 | 38 | 7 | |
| [−7.53, 13.45] | [−7.5, 13.4] | 3.0 | 11,514 | 8934 | 14.0 | 50 | 30 | |
| [−7.53, 13.45] | [−7.5, 13.4] | 3.0 | 11,318 | 8926 | 24.6 | 50 | 30 | |
| [−7.53, 13.45] | [−7.5, 13.4] | 3.0 | 11,122 | 8918 | 22.0 | 50 | 30 | |
| [−7.53, 13.45] | [−7.5, 13.4] | 3.0 | 7867 | 8630 | 2.1 | 49 | 32 | |
| [−7.53, 13.45] | [−7.5, 13.4] | 3.0 | 5395 | 6899 | 5.2 | 45 | 23 |
Table 8.
Results of Stage 4 for BP.
| Instance | #v | #c | IP-Time | n | ||||
|---|---|---|---|---|---|---|---|---|
| [−11.70, 470.0] | [352, 470] | 411 | 7583 | 8982 | 2.7 | 42 | 25 | |
| [−11.70, 470.0] | [−11, 470] | 229 | 9816 | 6449 | 2.7 | 38 | 7 | |
| [−11.70, 470.0] | [−11, 470] | 229 | 11,436 | 8757 | 9.1 | 50 | 30 | |
| [−11.70, 470.0] | [−11, 470] | 229 | 11,240 | 8749 | 11.0 | 50 | 30 | |
| [−11.70, 470.0] | [−11, 470] | 229 | 11,044 | 8741 | 24.0 | 50 | 30 | |
| [−11.70, 470.0] | [170, 470] | 320 | 7575 | 8450 | 25.9 | 49 | 33 | |
| [−11.70, 470.0] | [151, 470] | 310 | 5315 | 6719 | 4.4 | 43 | 23 |
Table 9.
Results of Stage 4 for MP.
| Instance | #v | #c | IP-Time | n | ||||
|---|---|---|---|---|---|---|---|---|
| [−185.3, 300.0] | [55, 300] | 177.5 | 7602 | 9023 | 16.1 | 41 | 24 | |
| [−185.3, 300.0] | [−180, 300] | 60 | 9833 | 6487 | 2.3 | 38 | 9 | |
| [−185.3, 300.0] | [−185, 300] | 57.4 | 11,453 | 8795 | 44.7 | 50 | 30 | |
| [−185.3, 300.0] | [−185, 300] | 57.4 | 11,257 | 8787 | 10.5 | 50 | 30 | |
| [−185.3, 300.0] | [−185, 300] | 57.4 | 11,061 | 8779 | 93.9 | 50 | 30 | |
| [−185.3, 300.0] | [253, 300] | 260.0 | 7580 | 6172 | 24.0 | 41 | 33 | |
| [−185.3, 300.0] | [−75, 299] | 58 | 5110 | 4050 | 104.6 | 45 | 23 |
Table 10.
Results of Stage 4 for FP.
| Instance | #v | #c | IP-Time | n | ||||
|---|---|---|---|---|---|---|---|---|
| [−82.99, 300.0] | [98, 300] | 199 | 7459 | 8696 | 1.6 | 35 | 22 | |
| [−82.99, 300.0] | [−82, 300] | 109 | 9694 | 6166 | 1.4 | 38 | 8 | |
| [−82.99, 300.0] | [−82, 300] | 109 | 11,314 | 8474 | 8.7 | 50 | 30 | |
| [−82.99, 300.0] | [−82, 300] | 109 | 11,118 | 8466 | 25.8 | 50 | 30 | |
| [−82.99, 300.0] | [−82, 300] | 109 | 10,922 | 8458 | 8.5 | 50 | 30 | |
| [−82.99, 300.0] | [250, 300] | 275 | 7667 | 8170 | 60.9 | 47 | 34 | |
| [−82.99, 300.0] | [54, 300] | 177 | 5193 | 6436 | 2.0 | 45 | 23 |
Table 11.
Results of Stage 4 for LP.
| Instance | #v | #c | IP-Time | n | ||||
|---|---|---|---|---|---|---|---|---|
| [−3.6, 6.84] | [−3.6, 6.8] | 1.6 | 7597 | 9008 | 1.9 | 39 | 23 | |
| [−3.6, 6.84] | [−3.6, 6.8] | 1.6 | 9836 | 6481 | 2.9 | 38 | 8 | |
| [−3.6, 6.84] | [−3.6, 6.8] | 1.6 | 11,456 | 8789 | 21.1 | 50 | 30 | |
| [−3.6, 6.84] | [−3.6, 6.8] | 1.6 | 11,260 | 8781 | 20.4 | 50 | 30 | |
| [−3.6, 6.84] | [−3.6, 6.8] | 1.6 | 11,064 | 8773 | 24.2 | 50 | 30 | |
| [−3.6, 6.84] | [−3.6, 6.8] | 1.6 | 7801 | 8476 | 1.1 | 47 | 32 | |
| [−3.6, 6.84] | [−3.6, 6.8] | 1.6 | 5335 | 6754 | 4.3 | 45 | 23 |
Table 12.
Results of Stage 4 for SL.
| Instance | #v | #c | IP-Time | n | ||||
|---|---|---|---|---|---|---|---|---|
| [−9.33, 1.11] | [−9.3, −2.0] | −5.6 | 7674 | 9186 | 2.4 | 41 | 23 | |
| [−9.33, 1.11] | [−9.3, −2.0] | −5.6 | 9906 | 6650 | 22.3 | 38 | 12 | |
| [−9.33, 1.11] | [−9.3, −2.0] | −5.6 | 11,526 | 8958 | 15.2 | 50 | 30 | |
| [−9.33, 1.11] | [−9.3, −2.0] | −5.6 | 11,330 | 8950 | 16.2 | 50 | 30 | |
| [−9.33, 1.11] | [−9.3, −2.0] | −5.6 | 11,134 | 8942 | 122.7 | 50 | 30 | |
| [−9.33, 1.11] | [−9.3, −2.0] | −5.6 | 7874 | 8648 | 1.2 | 54 | 33 | |
| [−9.33, 1.11] | [−9.3, −3.0] | −6.1 | 5402 | 6917 | 8.1 | 43 | 23 |
Figure 11a illustrates the chemical graph inferred from instance with of KOW in Table 7.
Figure 11.
(a) inferred from with of KOW; (b) inferred from with of LP.
Figure 11b illustrates the chemical graph inferred from instance with of LP in Table 11.
The topological specification of instances , and is more restricted than that of the other instances, and thereby the feasible target range of , and is rather narrower than the original range for some property . We see that the running time for solving an MILP instance with is 8.5 to 122 (s), which is much smaller than the running time of 61 to 12058 (s) to solve a similar set of MILP instances with in the experimental result for the previous method [28].
Stage 5. We computed chemical isomers of each target chemical graph inferred in Stage 4. We execute the algorithm for generating chemical isomers of up to 100 when the number of all chemical isomers exceeds 100. The algorithm can evaluate a lower bound on the total number of all chemical isomers without generating all of them.
Table 13 and Table 14 show the computational results of the experiment, where we denote the following:
-
-
DP-time: the running time (s) to execute the dynamic programming algorithm in Stage 5 to compute a lower bound on the number of all chemical isomers of and generate all (or up to 100) chemical isomers ;
-
-
G-LB: a lower bound on the number of all chemical isomers of ; and
-
-
#G: the number of all (or up to 100) chemical isomers of generated in Stage 5.
Table 13.
Results of Stage 5 for KOW, LP, and BP.
| Kow | Lp | Bp | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Instance | DP-Time | G-LB | #G | DP-Time | G-LB | #G | DP-time | G-LB | #G |
| 0.031 | 16 | 16 | 0.164 | 128 | 100 | 0.164 | 100 | ||
| 0.149 | 100 | 0.148 | 100 | 0.162 | 100 | ||||
| 44.1 | 100 | 118 | 900 | 100 | 171 | 6 | 6 | ||
| 27.2 | 20 | 20 | 80.2 | 6 | 6 | 28.6 | 7 | 7 | |
| 0.166 | 6000 | 100 | 73 | 12 | 12 | 142 | 5 | 5 | |
| 0.166 | 6000 | 100 | 0.168 | 288 | 100 | 0.168 | 100 | ||
| 22.3 | 100 | 1.44 | 100 | 1.7 | 100 |
Table 14.
Results of Stage 5 for FP, MP, and SL.
| FP | MP | SL | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Instance | DP-Time | G-LB | #G | DP-Time | G-LB | #G | DP-Time | G-LB | #G |
| 0.057 | 32 | 32 | 0.165 | 256 | 100 | 0.165 | 1024 | 100 | |
| 0.164 | 100 | 0.166 | 100 | 0.163 | 100 | ||||
| 28.8 | 720 | 100 | 8.26 | 100 | 1.07 | 100 | |||
| 72.2 | 27 | 27 | 51.9 | 1 | 1 | 46.5 | 1680 | 100 | |
| 40.3 | 20 | 20 | 125 | 100 | 7.01 | 100 | |||
| 0.169 | 100 | 0.173 | 6048 | 100 | 0.168 | 120 | 100 | ||
| 0.057 | 32 | 32 | 0.17 | 100 | 0.165 | 1024 | 100 |
From Table 13 and Table 14, we observe that the running time for generating up to 100 target chemical graphs in Stage 5 is not considerably larger than that in Stage 4.
4. Discussions and Conclusions
The framework of designing chemical graphs using ANNs and MILP has been proposed [23] as a basis of a total system of the QSAR and the inverse of QSAR, where the inverse of a prediction function produced by an ANN is solved by an MILP. The merit of the framework is that the inverse problem can be treated exactly as a mathematical problem, and an MILP instance with a moderate size can be efficiently solved with a fast MILP solver. On the other hand, the main technical concern in applying the framework is in defining a feature vector of a chemical graph in terms of graph theoretical descriptors so that the computation of a feature vector can be simulated with a set of linear constraints in an MILP. So far, the framework has been applied to the design of new methods of inferring several restricted classes of chemical graphs such as the graphs with rank at most 2 and the -lean cyclic graphs [26,28].
Herein, we examine some technical issues in the previous method before we observe some new features of our method in this paper.
In the feature vector of the previous models [26,28], the structure of subgraphs used as descriptors is only a pair of adjacent vertices, called adjacency-configuration or edge-configuration, which is significantly limited from a variety of subgraphs used in a more sophisticated construction of a feature vector such as the fingerprint. However, including the occurrence of a certain subgraph with only a few vertices as part of a feature vector may require realizing a mechanism of the subgraph isomorphism in an MILP that simulates the computation of such an occurrence and can easily make the resulting MILP very complicated and hard to solve. Furthermore, the feature vector can be defined only for cyclic graphs and we need to eliminate any acyclic graphs from the original data set before we construct a prediction function. This may reduce a data set to an unnecessarily small size or reduce the chances of capturing important information on QSAR over all types of graphs.
A branch-parameter was originally introduced as a new measure to the “agglomeration degree” of trees [24] and then used to define restricted classes of acyclic and cyclic graphs [24,27]. In fact, such a restriction on the structure of target chemical graphs was rather necessary to reduce the size of an MILP formulation that simulates a selection process of a target chemical graph from a supergraph (called a scheme graph), where the number of variables and constraints required to infer a chemical graph with non-hydrogen atoms is when some other parameters such as are regarded as constants.
Although nearly 97% of cyclic chemical compounds with up to 100 non-hydrogen atoms in PubChem are 2-lean [24], the way of specifying the topological structure of a target chemical graph in the previous method [26,28] was based on the core and the non-core of a chemical graph, and we could not include a fixed substructure in the non-core of a target chemical graph.
Compared with the previous models, the two-layered model proposed in this paper is rather simple, where a chemical graph is regarded as a combination of the interior and the exterior. The new model can deal with chemical compounds with any graph structure and include a prescribed structure in both of the acyclic and cyclic parts of a target chemical graph as long as the requirement on target chemical graphs is described under the set of specification rules introduced in this paper. This considerably improves the availability of the framework in a practical application.
The feature vector of our two-layered model can be defined for arbitrary graphs. In the new feature vector, the exterior of a chemical graph is encoded into fringe-configurations, i.e., the occurrence of each chemical rooted tree with height at most , where we may regard that the set of such a chemical rooted trees plays a similar role of some types of functional groups. In our method, we include as part of the descriptors of a feature vector the occurrence of each of such chemical rooted trees and the descriptors of our feature vector on the exterior of a chemical graph may have an analogous effect with the fingerprint.
Our specification of target chemical graphs can specify a candidate set of chemical rooted trees that are allowed to be used as chemical rooted trees in the exterior of a target chemical graph. This allows us to control the chemical property of target chemical graphs in a more meaningful way since chemical properties of some rooted trees in are known as functional groups and some kinds of rooted trees can be prohibited in a target chemical graph, if necessary, just by excluding from the candidate set . Although the number of different kinds of such chemical trees in a data set from PubChem is approximately up to 300 for in many cases and the number of input nodes in an ANN becomes over , we derived an MILP formulation for inferring a chemical graph with with non-hydrogen atoms and a candidate set of chemical rooted trees by using variables and constraints when the number of interior-vertices is constant, where can be quite small compared with .
We have implemented the proposed method for inferring chemical compounds with a prescribed topological substructure setting . The results of computational experiments using some chemical properties such as octanol/water partition coefficient, boiling point, melting point, flash point, lipophylicity, and solubility suggest that the proposed system can infer chemical graphs with 50 non-hydrogen atoms.
For a larger branch-parameter, say , we obtain a more variety of chemical rooted trees which provides new descriptors in a feature vector and new candidates for fringe-trees in the exterior in a target chemical graph, whereas the number of different chemical rooted trees in may increase rapidly.
It is left as a future work to use other learning methods such as decision tree, random forest, graph convolution, and an ensemble method in Stages 3 and 4 in the framework.
Acknowledgments
This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant #18H04113.
Abbreviations
| ANN | artificial neural network |
| MILP | mixed integer linear programming |
Supplementary Materials
The following are available online at https://www.mdpi.com/1422-0067/22/6/2847/s1.
Author Contributions
Conceptualization, H.N. and T.A.; methodology, H.N.; software, N.A.A., J.Z., Y.S., K.H., and L.Z.; validation, N.A.A., J.Z., and H.N.; formal analysis, H.N.; data resources, L.Z., K.H., H.N., and T.A.; writing—original draft preparation, H.N.; writing—review and editing, N.A.A. and T.A.; project administration, H.N.; funding acquisition, T.A. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant #18H04113.
Data Availability Statement
Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.
Conflicts of Interest
The authors declare no conflict of interest.
Footnotes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Miyao T., Kaneko H., Funatsu K. Inverse QSPR/QSAR analysis for chemical structure generation (from y to x) J. Chem. Inf. Model. 2016;56:286–299. doi: 10.1021/acs.jcim.5b00628. [DOI] [PubMed] [Google Scholar]
- 2.Ikebata H., Hongo K., Isomura T., Maezono R., Yoshida R. Bayesian molecular design with a chemical language model. J. Comput. Aided Mol. Des. 2017;31:379–391. doi: 10.1007/s10822-016-0008-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Rupakheti C., Virshup A., Yang W., Beratan D.N. Strategy to discover diverse optimal molecules in the small molecule universe. J. Chem. Inf. Model. 2015;55:529–537. doi: 10.1021/ci500749q. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Fujiwara H., Wang J., Zhao L., Nagamochi H., Akutsu T. Enumerating treelike chemical graphs with given path frequency. J. Chem. Inf. Model. 2008;48:1345–1357. doi: 10.1021/ci700385a. [DOI] [PubMed] [Google Scholar]
- 5.Kerber A., Laue R., Grüner T., Meringer M. MOLGEN 4.0. MATCH Commun. Math. Comput. Chem. 1998;37:205–208. [Google Scholar]
- 6.Li J., Nagamochi H., Akutsu T. Enumerating substituted benzene isomers of tree-like chemical graphs. IEEE/ACM Trans. Comput. Biol. Bioinform. 2016;15:633–646. doi: 10.1109/TCBB.2016.2628888. [DOI] [PubMed] [Google Scholar]
- 7.Reymond J.L. The chemical space project. Acc. Chem. Res. 2015;48:722–730. doi: 10.1021/ar500432k. [DOI] [PubMed] [Google Scholar]
- 8.Bohacek R.S., McMartin C., Guida W.C. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev. 1996;16:3–50. doi: 10.1002/(SICI)1098-1128(199601)16:1<3::AID-MED1>3.0.CO;2-6. [DOI] [PubMed] [Google Scholar]
- 9.Akutsu T., Fukagawa D., Jansson J., Sadakane K. Inferring a graph from path frequency. Discrete Appl. Math. 2012;160:1416–1428. doi: 10.1016/j.dam.2012.02.002. [DOI] [Google Scholar]
- 10.Kipf T.N., Welling M. Semi-supervised classification with graph convolutional networks. arXiv. 20161609.02907 [Google Scholar]
- 11.Gómez-Bombarelli R., Wei J.N., Duvenaud D., Hernández-Lobato J.M., Sánchez-Lengeling B., Sheberla D., Aguilera-Iparraguirre J., Hirzel T.D., Adams R.P., Aspuru-Guzik A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018;4:268–276. doi: 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Segler M.H.S., Kogej T., Tyrchan C., Waller M.P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 2017;4:120–131. doi: 10.1021/acscentsci.7b00512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Yang X., Zhang J., Yoshizoe K., Terayama K., Tsuda K. ChemTS: An efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater. 2017;18:972–976. doi: 10.1080/14686996.2017.1401424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kusner M.J., Paige B., Hernández-Lobato J.M. Grammar variational autoencoder; Proceedings of the 34th International Conference on Machine Learning; Sydney, NSW, Australia. 6–11 August 2017; pp. 1945–1954. [Google Scholar]
- 15.De Cao N., Kipf T. MolGAN: An implicit generative model for small molecular graphs. arXiv. 20181805.11973 [Google Scholar]
- 16.Madhawa K., Ishiguro K., Nakago K., Abe M. GraphNVP: An invertible flow model for generating molecular graphs. arXiv. 20191905.11600 [Google Scholar]
- 17.Shi C., Xu M., Zhu Z., Zhang W., Zhang M., Tang J. GraphAF: A flow-based autoregressive model for molecular graph generation. arXiv. 20202001.09382 [Google Scholar]
- 18.Cherkasov A., Muratov E.M.N., Fourches D., Varnek A., Baskin I.I., Cronin M., Dearden J., Gramatica P., Martin Y.C., Todeschini R., et al. QSAR modeling: Where have you been? Where are you going to? J. Med. Chem. 2014;57:4977–5010. doi: 10.1021/jm4004285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Cramer R.D., III, Patterson D.E., Bunce J.D. Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc. 1988;110:5959–5967. doi: 10.1021/ja00226a005. [DOI] [PubMed] [Google Scholar]
- 20.Cramer R.D. Template CoMFA generates single 3D-QSAR models that, for twelve of twelve biological targets, predict all ChEMBL-tabulated affinities. PLoS ONE. 2015;10:e0129307. doi: 10.1371/journal.pone.0129307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Moriwaki H., Tian Y.-S., Kawashita N., Takagi T. Three-dimensional classification structure–activity relationship analysis using convolutional neural network. Chem. Pharm. Bull. 2019;67:426–432. doi: 10.1248/cpb.c18-00757. [DOI] [PubMed] [Google Scholar]
- 22.Azam N.A., Chiewvanichakorn R., Zhang F., Shurbevski A., Nagamochi H., Akutsu T. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming; Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies; Valletta, Malta. 24–26 February 2020; pp. 101–108. [Google Scholar]
- 23.Zhang F., Zhu J., Chiewvanichakorn R., Shurbevski A., Nagamochi H., Akutsu T. A new integer linear programming formulation to the inverse QSAR/QSPR for acyclic chemical compounds using skeleton trees; Proceedings of the 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems; Kitakyushu, Japan. 22–25 September 2020; pp. 433–444. [Google Scholar]
- 24.Azam N.A., Zhu J., Sun Y., Shi Y., Shurbevski A., Zhao L., Nagamochi H., Akutsu T. A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming. 2020 doi: 10.1186/s13015-021-00197-2. submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Ito R., Azam N.A., Wang C., Shurbevski A., Nagamochi H., Akutsu T. A novel method for the inverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networks and integer programming; Proceedings of the International Conference on Bioinformatics & Computational Biology (BIOCOMP 2020); Las Vegas, NV, USA. 27–30 July 2020. [Google Scholar]
- 26.Zhu J., Wang C., Shurbevski A., Nagamochi H., Akutsu T. A novel method for inference of chemical compounds of cycle index two with desired properties based on artificial neural networks and integer programming. Algorithms. 2020;13:124. doi: 10.3390/a13050124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Akutsu T., Nagamochi H. A novel method for inference of chemical compounds with prescribed topological substructures based on integer programming. arXiv. 2020 doi: 10.1109/TCBB.2021.3112598.2010.09203 [DOI] [PubMed] [Google Scholar]
- 28.Zhu J., Azam N.A., Zhang F., Shurbevski A., Haraguchi K., Zhao L., Nagamochi H., Akutsu T. A novel method for inferring of chemical compounds with prescribed topological substructures based on integer programming. 2020 doi: 10.1109/TCBB.2021.3112598. submitted. [DOI] [PubMed] [Google Scholar]
- 29.PubChem. [(accessed on 13 May 2020)]; Available online: https://pubchem.ncbi.nlm.nih.gov/
- 30.Figshare. [(accessed on 13 May 2020)]; Available online: https://figshare.com/articles/dataset/Lipophilicity_Dataset_-_logD7_4_of_1_130_Compounds/5596750/1.
- 31.A Benchmark for Molecular Machine Learning. [(accessed on 13 May 2020)]; Available online: http://moleculenet.ai/datasets-1.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.











