Skip to main content
International Journal of Molecular Sciences logoLink to International Journal of Molecular Sciences
. 2021 Mar 11;22(6):2847. doi: 10.3390/ijms22062847

An Inverse QSAR Method Based on a Two-Layered Model and Integer Programming

Yu Shi 1, Jianshen Zhu 1,, Naveed Ahmed Azam 1,, Kazuya Haraguchi 1,, Liang Zhao 2, Hiroshi Nagamochi 1,*,, Tatsuya Akutsu 3
Editor: Hanoch Senderowitz
PMCID: PMC8002091  PMID: 33799613

Abstract

A novel framework for inverse quantitative structure–activity relationships (inverse QSAR) has recently been proposed and developed using both artificial neural networks and mixed integer linear programming. However, classes of chemical graphs treated by the framework are limited. In order to deal with an arbitrary graph in the framework, we introduce a new model, called a two-layered model, and develop a corresponding method. In this model, each chemical graph is regarded as two parts: the exterior and the interior. The exterior consists of maximal acyclic induced subgraphs with bounded height, the interior is the connected subgraph obtained by ignoring the exterior, and the feature vector consists of the frequency of adjacent atom pairs in the interior and the frequency of chemical acyclic graphs in the exterior. Our method is more flexible than the existing method in the sense that any type of graphs can be inferred. We compared the proposed method with an existing method using several data sets obtained from PubChem database. The new method could infer more general chemical graphs with up to 50 non-hydrogen atoms. The proposed inverse QSAR method can be applied to the inference of more general chemical graphs than before.

Keywords: QSAR, molecular design, artificial neural network, mixed integer linear programming, enumeration of graphs, cheminformatics, materials informatics

1. Introduction

Computer-aided design of chemical structures is one of the key topics in chemoinformatics. In particular, extensive studies have been done on inverse quantitative structure–activity relationships (inverse QSAR), which seek chemical structures having desired chemical activities under some constraints. In this framework, chemical compounds are usually represented as vectors of real or integer numbers, which are often called descriptors in chemoinformatics and correspond to feature vectors in machine learning. Using these chemical descriptors, various heuristic and statistical methods have been developed for inverse QSAR [1,2,3]. In many of such methods, inference or enumeration of graph structures from a given set of descriptors is a crucial subtask. Although various methods have been developed for that purpose [4,5,6,7], enumeration still remains a challenging task because the number of possible chemical graphs is huge, for example, chemical graphs with up to 30 atoms (vertices) C, N, O, and S, may exceed 1060 [8]. Furthermore, even inference is a challenging task because it is NP-hard (computationally difficult) except for some simple cases [9]. Due to this inherent difficulty, most existing methods for inverse QSAR do not guarantee optimal or exact solutions.

On the other hand, the design of novel graph structures has recently become a hot topic in artificial neural network (ANN) studies, and thus extensive studies have been done for inverse QSAR using ANNs, especially with graph convolutional networks [10]. For example, variational autoencoders [11], recurrent neural networks [12,13], grammar variational autoencoders [14], generative adversarial networks [15], and invertible flow models [16,17] have been applied. Note that QSAR using three-dimensional structures of chemical compounds (3D-QSAR) has also been studied [18]. Particularly, comparative molecular field analysis (CoMFA) has been extensively studied and applied to various molecular design problems [19,20]. In CoMFA, electrostatic potential interaction energies across superimposed molecular structures are used as descriptors and then regression is performed by using the partial least squares (PLS) fitting. Recently, deep neural networks have been applied to 3D-QSAR by combining potential interaction energies with convolutional neural networks [21]. However, in order to apply 3D-QSAR, we need to calculate accurate three-dimensional structures of chemical compounds, which is not a straightforward task.

A novel framework for inferring chemical graphs has recently been developed [22,23] based on ANNs and mixed integer linear programming (MILP), as illustrated in Figure 1. It constructs a prediction function in the first phase and infers a chemical graph in the second phase. The first phase of the framework consists of three stages. In Stage 1, we choose a chemical property π and a class G of graphs, where a property function a is defined so that a(G) is the value of π in GG, and collect a data set Dπ of chemical graphs in G such that a(G) is available. In Stage 2, we introduce a feature function f:GRK for a positive integer K. In Stage 3, we construct a prediction function ηN with an ANN N that, given a vector xRK, returns a value y=ηN(x)R so that ηN(f(G)) serves as a predicted value to a(G) for each GDπ. Given a target chemical value y*, the second phase infers chemical graphs G* with ηN(f(G*))=y* in the next two stages. In Stage 4, we formulate an MILP that simulates the construction of f(G) from G and the computation process in the ANN so that given a target value, y*, and solve the MILP to infer a chemical graph G and a feature vector x* such that f(G)=x* and ηN(x*)=y*. In Stage 5, we generate other chemical graphs G* such that ηN(f(G*))=y* based on the output chemical graph G.

Figure 1.

Figure 1

An illustration of a framework for inferring a set of chemical graphs G*.

MILP formulations required in Stage 4 have been designed for chemical compounds with cycle index 0 (i.e., acyclic) [23,24], cycle index 1 [25], and cycle index 2 [26]. In particular, Azam et al. [24] introduced a restricted class of acyclic graphs that is characterize by an integer ρ, called a “branch-parameter” such that the restricted class still covers most of the acyclic chemical compounds in the database.

Recently, Akutsu and Nagamochi [27] extended the idea to define a restricted class of cyclic graphs, called “ρ-lean cyclic graphs”, that covers most of the cyclic chemical compounds in the database. Based on this, they also defined a set of rules for specifying several topological substructures of a target chemical graph in a flexible way in Stage 4 before we solve an MILP. The method has been implemented by Zhu et al. [28], and computational results showed that chemical graphs with around up to 50 non-hydrogen atoms can be inferred. Although the method can infer the class of ρ-lean cyclic graphs and specify topological structures of the cyclic part, we still need to introduce a new model to deal with an arbitrary graph and to include a prescribed structure in the acyclic part of a target chemical graph.

In this paper, we introduce a new model, called a two-layered model, for representing the feature of a chemical graph in order to deal with an arbitrary graph in the framework. In the two-layered model, a chemical graph G with a parameter ρ1 is regarded as two parts: the exterior and the interior. The exterior consists of maximal acyclic induced subgraphs with height at most ρ and the interior is the connected subgraph obtained by ignoring the exterior. We define a feature vector f(G) of a chemical graph G to be the frequency of adjacent atom pairs in the interior and the frequency of chemical acyclic graphs in the exterior. Figure 2 illustrates an example of a chemical graph G. For a branch-parameter ρ=2, the interior of the chemical graph G in Figure 2 is obtained by removing the set of vertices with degree 1 ρ=2 times, i.e., first remove the set V1={w1,w2,,w14} of vertices of degree 1 in G, and then remove the set V2={w15,w16,,w19} of vertices of degree 1 in GV1, where the removed vertices become the exterior-vertices of G and there are eight rooted trees T1,T2,,T8 in the exterior of G.

Figure 2.

Figure 2

An illustration of a chemical graph G, where for ρ=2, the exterior-vertices are w1,w2,,w19 and the interior-vertices are u1,u2,,u28.

We also introduce a new set of rules for specifying topological substructures of a target chemical graph G to be inferred so that a prescribed structure can be included in both of the acyclic and cyclic parts of G. The set of rules contains (i) a seed graph GC as an abstract form of a target chemical graph G; (ii) a set F of chemical rooted trees as candidates for trees in the exterior of G; and (iii) lower and upper bounds on the number of components in a target chemical graph such as chemical elements, double/triple bounds and the interior-vertices in G. Figure 3a,b illustrates examples of a seed graph GC and a set F of chemical rooted trees, respectively. Given a seed graph GC, the interior of a target chemical graph G is constructed from GC by replacing some edges a=uv with paths Pa between the end-vertices u and v, and by attaching new paths Qv to some vertices v. For example, the chemical graph G in Figure 2 is constructed from the seed graph GC in Figure 3a as follows. First replace five edges a1=u1u2,a2=u1u3,a3=u4u7,a4=u10u11 and a5=u11u12 in GC with new paths Pa1=(u1,u13,u2), Pa2=(u1,u14,u3), Pa3=(u4,u15,u16,u7), Pa4=(u10,u17,u18,u19,u11) and Pa5=(u11,u20,u21,u22,u12), respectively, to obtain the subgraph G1 of G that consists of vertices depicted with squares. Next, attach to this graph G1 three new paths, Qu5=(u5,u24), Qu18=(u18,u25,u26,u27), and Qu22=(u22,u28), to obtain the interior of G in Figure 2. Finally, the chemical graph G in Figure 2 is obtained by attaching eight trees T1,T2,,T8 selected from the set F and assigning chemical elements and bond-multiplicities in the interior. The frequency of chemical elements and the graph size are controlled with lower and upper bounds on the components in a target chemical graph G. See Section 2.2 for more details on the specification.

Figure 3.

Figure 3

(a) An illustration of a seed graph GC where the vertices in VC are depicted with gray squares, the edges in E(2) are depicted with dotted lines, the edges in E(1) are depicted with dashed lines, the edges in E(0/1) are depicted with gray bold lines, and the edges in E(=1) are depicted with black solid lines. (b) A set F={ψ1,ψ2,,ψ11}F(Dπ) of 11 chemical rooted trees ψi,i[1,11], where the root of each tree is depicted with a black circle.

We implemented the two-layered model and the results of computational experiments suggest that the proposed method can infer chemical graphs with around up to 50 non-hydrogen atoms.

The paper is organized as follows. Section 2.1 introduces some notions on graphs, a modeling of chemical compounds, and a choice of descriptors. Section 2.2 introduces a method of specifying topological substructures of target chemical graphs in Stage 4. Section 3 reports the results on some computational experiments conducted for chemical properties such as octanol/water partition coefficient, boiling point, melting point, flash point, lipophylicity, and solubility. Section 4 makes some concluding remarks. An MILP formulation used in Stage 4 and a review of the dynamic programming algorithm for generating isomers in Stage 5 can be found in Supplementary Materials. The proposed method/system is available at GitHub https://github.com/ku-dml/mol-infer.

2. Materials and Methods

This section presents mathematical details of our developed method. Readers not interested in mathematical details can skip this section.

2.1. Preliminary

This section introduces some notions and terminology on graphs, a modeling of chemical compounds, and our choice of descriptors.

Let R, Z and Z+ denote the sets of reals, integers and non-negative integers, respectively. For two integers a and b, let [a,b] denote the set of integers i with aib.

Graphs. Given a graph G, let V(G) and E(G) denote the sets of vertices and edges, respectively. For a subset VV(G) (resp., EE(G)) of a graph G, let GV (resp., GE) denote the graph obtained from G by removing the vertices in V (resp., the edges in E), where we remove all edges incident to a vertex in V in GV. The rank r(G) of a graph G is defined to be the minimum |F| of an edge subset FE(G) such that GF contains no cycle. A path with two end-vertices u and v is called a u,v-path. An edge e=u1u2 in a connected graph G is called a bridge if the graph Ge obtained from G by removing edge e is not connected, i.e., Ge consists of two connected graphs Gi containing vertex ui, i=1,2. For a cyclic graph G, an edge e is called a core-edge if it is in a cycle of G or is a bridge e=u1u2 such that each of the connected graphs Gi, i=1,2 of Ge contains a cycle. A vertex incident to a core-edge is called a core-vertex of G.

A vertex designated in a graph G is called a root. In this paper, we designated at most two vertices as roots, and denote by Rt(G) the set of roots of G. We call a graph G rooted (resp., bi-rooted) if |Rt(G)|=1 (resp., |Rt(G)|=2), where we call Gunrooted if Rt(G)=.

For a graph G, possibly with roots, a leaf-vertex is defined to be a non-root vertex vV(G)\Rt(G) with degree 1, call the edge uv incident to a leaf vertex v a leaf-edge, and denote Vleaf(G) and Eleaf(G) the sets of leaf-vertices and leaf-edges in G, respectively. For a graph or a rooted graph G, we define graphs Gi,iZ+ obtained from G by removing the set of leaf-vertices i times so that

G0:=G;Gi+1:=GiVleaf(Gi),

where we call a vertex vVleaf(Gk) a leaf k-branch and we say that a vertex vVleaf(Gk) has height height ht(v)=k in G. The height ht(T) of a rooted tree T is defined to be the maximum of ht(v) of a vertex vV(T). For an integer k0, we call a rooted tree T k-lean if T has at most one leaf k-branch. For an unrooted cyclic graph G, we regard the set of non-core-edges in G induces a collection T of trees each of which is rooted at a core-vertex, where we call G k-lean if each of the rooted trees in T is k-lean. Nearly 97% of cyclic chemical compounds with up to 100 non-hydrogen atoms in PubChem are 2-lean [24].

Two-layered Model. Let G be an unrooted graph. For an integer ρ0, which we call a branch-parameter, a two-layered model of G is a partition of G into an “interior” and an “exterior” in the following way. We call a vertex vV(G) (resp., an edge eE(G)) of G an exterior-vertex (resp., exterior-edge) if ht(v)<ρ (resp., e is incident to an exterior-vertex) and denote the sets of exterior-vertices and exterior-edges by Vex(G) and Eex(G), respectively and denote Vint(G)=V(G)\Vex(G) and Eint(G)=E(G)\Eex(G), respectively. We call a vertex in Vint(G) (resp., an edge in Eint(G)) an interior-vertex (resp., interior-edge). The set Eex(G) of exterior-edges forms a collection of connected graphs each of which is regarded as a rooted tree T rooted at the vertex vV(T) with the maximum ht(v), where we call T a ρ-fringe-tree (or a fringe-tree). Let Tex(G) denote the set of fringe-trees in G. The interior of G is defined to be the subgraph (Vint(G),Eint(G)) of G. Note that every core-vertex (resp., core-edge) in G is an interior-vertex (resp., interior-edge) of G. Figure 2 illustrates an example of a graph G, such that Vint={u1,u2,,u28}, Vex={w1,w2,,w19} and Tex(G)={T1,T2,,T8} for a branch-parameter ρ=2.

2.1.1. Modeling of Chemical Compounds

To represent a chemical compound, we assume that each chemical element a has a unique valence val(a)[1,4] and we use a hydrogen-suppressed model, because hydrogen atoms can be added at the final stage under the assumption. In the hydrogen-suppressed model, a chemical compound C is represented by a tuple G=(H,α,β) of a simple, connected undirected graph H and functions α:V(H)Λ and β:E(H)[1,3], where Λ is a set of non-hydrogen chemical elements such as C (carbon), O (oxygen), N (nitrogen), and so on. The set of atoms and the set of bonds in the compound C are represented by the vertex set V(H) and the edge set E(H), respectively. The chemical element assigned to a vertex vV(H) is represented by α(v) and the bond-multiplicity between two adjacent vertices u,vV(H) is represented by β(e) of the edge e=uvE(H). We say that two tuples (Hi,αi,βi),i=1,2 are isomorphic if they admit an isomorphism ϕ, i.e., a bijection ϕ:V(H1)V(H2) such that uvE(H1),α1(u)=a,α1(v)=b,β1(uv)=mϕ(u)ϕ(v)E(H2),α2(ϕ(u))=a,α2(ϕ(v))=b,β2(ϕ(u)ϕ(v))=m. When Hi is rooted at a vertex ri,i=1,2, (Hi,αi,βi),i=1,2 are rooted-isomorphic (r-isomorphic) if they admit an isomorphism ϕ such that ϕ(r1)=r2. Chemical rooted trees T1 and T5 in Figure 2 are r-isomorphic.

Associated with the two functions α and β in a tuple G=(H,α,β), we introduce the following functions: βG:V(H)[0,12], ac:V(E)Λ×Λ×[1,3], cs:V(E)Λ×[1,4], and ec:V(E)(Λ×[1,4])×(Λ×[1,4])×[1,3].

For a notational convenience, we use a function βG:V(H)[0,4] such that βG(u) means the sum of bond-multiplicities of edges incident to a vertex u, i.e.,

βG(u)uvE(H)β(uv) for each vertex uV(H).

A chemical graph G is defined to be a tuple (H,α,β) such that the valence condition at each vertex vV(H) is satisfied, i.e.,

βG(v)val(α(v)),

where we define the hydro-degree deghyd(v) of a vertex v to be val(α(v))βG(v).

Figure 2 illustrates an example of a chemical graph G=(H,α,β).

To represent a feature of an edge e=uvE(H) such that α(u)=a, α(v)=b and β(e)=m in a chemical graph G=(H,α,β), we use a tuple (a,b,m)Λ×Λ×[1,3], which we call the adjacency-configuration ac(e) of the edge e. We introduce a total order < over the elements in Λ to distinguish with (a,b,m) and (b,a,m) (ab) notationally. For a tuple ν=(a,b,m), let ν¯ denote the tuple (b,a,m).

To represent a feature of a vertex vV(H) with α(v)=a that has d atoms in its neighbor in a chemical graph G=(H,α,β), we use a pair (a,d)Λ×[1,4], which we call the chemical symbol cs(v) of the vertex v. We treat (a,d) as a single symbol ad, and define Λdg to be the set of all chemical symbols μ=adΛ×[1,4].

To represent a feature of an edge e=uvE(H) such that cs(u)=μ, cs(v)=ξ and β(e)=m in a chemical graph G=(H,α,β), we use a tuple (μ,ξ,m)Λdg×Λdg×[1,3], which we call the edge-configuration ec(e) of the edge e. We introduce a total order < over the elements in Λdg to distinguish with (μ,ξ,m) and (ξ,μ,m) (μξ) notationally. For a tuple γ=(μ,ξ,m), let γ¯ denote the tuple (ξ,μ,m).

To represent a feature of the exterior of a chemical graph G=(H,α,β), a ρ-fringe-tree in Tex(G) is called a fringe-configuration in the exterior.

2.1.2. Introducing Descriptors of Feature Vectors

This section introduces descriptors to define our feature vectors. Let π be a chemical property for which we will construct a prediction function ηN from a feature vector f(G) of a chemical graph to a predicted value yR for the chemical property of G.

We first choose a set Λ of non-hydrogen chemical elements and then collect a data set Dπ of chemical compounds C whose chemical elements belong to Λ{H}, where we regard Dπ as a set of chemical graphs that represent the chemical compounds C in Dπ. To define the interior/exterior of chemical graphs GDπ, we next choose a branch-parameter ρ, where we recommend ρ=2.

Let Λint(Dπ) (resp., Λex(Dπ)) denote the set of chemical elements used in the set of interior-vertices (resp., exterior-vertices) over all chemical graphs GDπ, and Γint(Dπ) denote the set of edge-configurations used in the set of interior-edges over all chemical graphs GDπ. Let F(Dπ) denote the set of chemical rooted trees ψ r-isomorphic to a ρ-fringe-tree TTex(G) over all chemical graphs GDπ.

We define an integer encoding of a finite set A of elements to be a bijection σ:A[1,|A|], where we denote by [A] the set [1,|A|] of integers. Introduce an integer coding of each of the sets Λint(Dπ), Λex(Dπ), Γint(Dπ) and F(Dπ). Let [a]int (resp., [a]ex) denote the coded integer of an element aΛint(Dπ) (resp., aΛex(Dπ)), [γ] denote the coded integer of an element γ in Γint(Dπ) and [ψ] denote an element ψ in F(Dπ).

For each chemical element aΛ, let mass(a) and val(a) denote the mass and valence of a, respectively. In our model, we use integers mass*(a)=10·mass(a), aΛ.

We define the feature vector f(G) of a chemical graph G=(H,α,β)Dπ to be a vector that consists of the following non-negative integer descriptors dcpi(G), i[1,K], where K=17+|Λint(Dπ)|+|Λex(Dπ)|+|Γint(Dπ)|+|F(Dπ)|.

  1. dcp1(G): the number n(G)=|V(G)| of vertices in G.

  2. dcp2(G): the number |Vint(G)| of interior-vertices in G.

  3. dcp3(G): the average ms¯(G) of mass* over all non-hydrogen atoms in G, i.e., ms¯(G)vV(G)mass*(α(v))/n(G).

  4. dcpi(G), i=3+d,d[1,4]: the number dgd(G) of interior-vertices of degree d in G.

  5. dcpi(G), i=7+d,d[1,4]: the number dgdint(G) of interior-vertices of interior-degree deg(Vint,Eint)(v)=d in the interior (Vint,Eint) of G.

  6. dcpi(G), i=11+d,d[0,3]: the number hydgd(G) of vertices in G of hydro-degree deghyd(v)=d.

  7. dcpi(G), i=15+m, m[2,3]: the number bdmint(G) of interior-edges with bond multiplicity m in G, i.e., bdmint(G){eEintβ(e)=m}.

  8. dcpi(G), i=17+[a]int, aΛint(Dπ): the frequency naaint(G) of chemical element a in the set of interior-vertices in G.

  9. dcpi(G), i=17+|Λint(Dπ)|+[a]ex, aΛex(Dπ): the frequency naaex(G) of chemical element a in the set of exterior-vertices in G.

  10. dcpi(G), i=17+|Λint(Dπ)|+|Λex(Dπ)|+[γ], γΓint(Dπ): the frequency ecγ(G) of edge-configuration γ in the set of interior-edges eEint in G.

  11. dcpi(G), i=17+|Λint(Dπ)|+|Λex(Dπ)|+|Γint(Dπ)|+[ψ], ψF(Dπ): the frequency fcψ(G) of fringe-configuration ψ in the set of ρ-fringe-trees in G.

2.2. Specifying Target Chemical Graphs

Given a prediction function ηN and a target value y*R, we call a chemical graph G* such that ηN(x*)=y* for the feature vector x*=f(G*) a target chemical graph. This section presents a set of rules for specifying topological substructure of a target chemical graph in a flexible way in Stage 4.

We first describe how to reduce a chemical graph G=(H,α,β) into an abstract form based on which our specification rules will be defined. To illustrate the reduction process, we use the chemical graph G=(H,α,β) in Figure 2.

  • R1

    Removal of all ρ-fringe-trees: The interior Hint=(Vint(H),Eint(H)) of G is obtained by removing the non-root vertices of each ρ-fringe-trees TTex(G). Figure 4 illustrates the interior Hint of chemical graph G with ρ=2 in Figure 2.

  • R2

    Removal of some leaf paths: We call a u,v-path Q in Hint a leaf path if vertex v is a leaf-vertex of Hint and the degree of each internal vertex of Q in Hint is 2, where we regard that Q is rooted at vertex u. A connected subgraph S of the interior Hint of G is called a cyclical-base if S is obtained from H by removing the vertices in V(Qu)\{u},uX for a subset X of interior-vertices and a set {QuuX} of leaf u,v-paths Qu such that no two paths Qu and Qu share a vertex. Figure 5a illustrates a cyclical-base S=HintuX(V(Qu)\{u}) of the interior Hint for a set {Qu5=(u5,u24),Qu18=(u18,u25,u26,u27),Qu22=(u22,u28)} of leaf paths in Figure 4.

  • R3

    Contraction of some pure paths: A path in S is called pure if each internal vertex of the path is of degree 2. Choose a set P of several pure paths in S so that no two paths share vertices except for their end-vertices. A graph S is called a contraction of a graph S (with respect to P) if S is obtained from S by replacing each pure u,v-path with a single edge a=uv, where S may contain multiple edges between the same pair of adjacent vertices. Figure 5b illustrates a contraction S obtained from the chemical graph S by contracting each uv-path PaP into a new edge a=uv, where a1=u1u2,a2=u1u3,a3=u4u7,a4=u10u11, and a5=u11u12, and P={Pa1=(u1,u13,u2),Pa2=(u1,u14,u3),Pa3=(u4,u15,u16,u7),Pa4=(u10,u17,u18,u19,u11),Pa5=(u11,u20,u21,u22,u12)} of pure paths in Figure 5a.

Figure 4.

Figure 4

The interior Hint of chemical graph G with ρ=2 in Figure 2.

Figure 5.

Figure 5

(a) A cyclical-base S=Hintu{u5,u18,u22}(V(Qu)\{u}) of the interior Hint in Figure 4; (b) A contraction S of S for a pure path set P={Pa1,Pa2,,Pa5} in (a), where a new edge obtained by contracting a pure path is depicted with a thick line.

We will define a set of rules so that a chemical graph can be obtained from a graph (called a seed graph in the next section) by applying processes R3 to R1 in a reverse way. We specify topological substructures of a target chemical graph with a tuple (GC,σint,σce) called a target specification defined under the set of the following rules.

Seed Graphs

A seed graph GC=(VC,EC) is defined to be a graph (possibly with multiple edges) such that the edge set EC consists of four sets E(2), E(1), E(0/1), and E(=1), where each of them can be empty. A seed graph plays a role of the most abstract form S in R3. Figure 3a illustrates an example of a seed graph, where VC={u1,u2,,u12}, E(2)={a1,a2,,a5}, E(1)={a6}, E(0/1)={a7}, and E(=1)={a8,a9,,a17}.

A subdivision S of GC is a graph constructed from a seed graph GC according to the following rules:

  • -

    Each edge e=uvE(2) is replaced with a u,v-path Pe of length at least 2;

  • -

    Each edge e=uvE(1) is replaced with a u,v-path Pe of length at least 1 (equivalently e is directly used or replaced with a u,v-path Pe of length at least 2);

  • -

    Each edge eE(0/1) is either used or discarded; and

  • -

    Each edge eE(=1) is always used directly.

We allow a possible elimination of edges in E(0/1) as an optional rule in constructing a target chemical graph from a seed graph, even though such an operation has not been included in the process R3. A subdivision S plays a role of a cyclical-base in R2. A target chemical graph G=(H,α,β) will contain S as a subgraph of the interior Hint of G.

Interior-Specification

A graph H* that serves as the interior Hint of a target chemical graph G will be constructed as follows. First, construct a subdivision S of a seed graph GC by replacing each edge edge e=uuE(2)E(1) with a pure u,u-path Pe. Next, construct a supergraph H* of S by attaching a leaf path Qv at each vertex vVC or at an internal vertex vV(Pe)\{u,u} of each pure u,u-path Pe for some edge e=uuE(2)E(1), where possibly Qv=v,E(Qv)= (i.e., we do not attach any new edges to v). We introduce the following rules for specifying the size of H*, the length |E(Pe)| of a pure path Pe, the length |E(Qv)| of a leaf path Qv, the number of leaf paths Qv, and a bond-multiplicity of each interior-edge, where we call the set of prescribed constants an interior-specification σint:

  • -

    Lower and upper bounds nLBint,nUBintZ+ on the number of interior-vertices of a target chemical graph G.

  • -
    For each edge e=uuE(2)E(1),
    • a lower bound LB(e) and an upper bound UB(e) on the length |E(Pe)| of a pure u,u-path Pe. (For a notational convenience, set LB(e):=0, UB(e):=1, eE(0/1) and LB(e):=1, UB(e):=1, eE(=1). )
    • a lower bound blLB(e) and an upper bound blUB(e) on the number of leaf paths Qv attached to at internal vertices v of a pure u,u-path Pe.
    • a lower bound chLB(e) and an upper bound chUB(e) on the maximum length |E(Qv)| of a leaf path Qv attached at an internal vertex vV(Pe)\{u,u} of a pure u,u-path Pe.
  • -
    For each vertex vVC,
    • a lower bound chLB(e) and an upper bound chUB(e) on the number of leaf paths Qv attached to v, where 0chLB(e)chUB(e)1.
    • a lower bound chLB(v) and an upper bound chUB(v) on the length |E(Qv)| of a leaf path Qv attached to v.
  • -

    For each edge e=uuEC, a lower bound bdm,LB(e) and an upper bound bdm,UB(e) on the number of edges with bond-multiplicity m[2,3] in u,u-path Pe, where we regard Pe, eE(0/1)E(=1) as single edge e.

We call a graph H* that satisfies an interior-specification σint a σint-extension of GC, where the bond-multiplicity of each edge has been determined.

Table 1 shows an example of an interior-specification σint to the seed graph GC in Figure 3.

Table 1.

Example 1 of an interior-specification σint.

graphic file with name ijms-22-02847-i001.jpg

Figure 6 illustrates an example of an σint-extension H* of seed graph GC in Figure 3 under the interior-specification σint in Table 1.

Figure 6.

Figure 6

An illustration of a graph H* that is obtained from the seed graph GC in Figure 3 under the interior-specification σint in Table 1, where the vertices newly introduced by pure paths Pai and leaf paths Qvi are depicted with white squares and circles, respectively.

Chemical-Specification

Let H* be a graph that serves as the interior Hint of a target chemical graph G, where the bond-multiplicity of each edge in H* has be determined. Finally, we introduce a set of rules for constructing a target chemical graph G from H* by choosing a chemical element aΛ and assigning a ρ-fringe-tree ψ to each interior-vertex vVint. We introduce the following rules for specifying the size of G, a set of chemical rooted trees that are allowed to use as ρ-fringe-trees and lower and upper bounds on the frequency of a chemical element, a chemical symbol, and an edge-configuration, where we call the set of prescribed constants a chemical specification σce:

  • -

    Lower and upper bounds nLB,n*Z+ on the number of vertices in G, where nLBintnLBn*.

  • -

    Subsets F(v)F(Dπ),vVC and FEF(Dπ) of chemical rooted trees with height at most ρ, where we require that every ρ-fringe-tree Tv rooted at a vertex vVC (resp., at an internal vertex v not in VC) in G belongs to F(v) (resp., FE). Let F*:=FEvVCF(v) and Λex denote the set of chemical elements assigned to non-root vertices over all chemical rooted trees in F*.

  • -

    A subset ΛintΛint(Dπ), where we require that every chemical element α(v) assigned to an interior-vertex v in G belongs to Λint. Let Λ:=ΛintΛex and naa(G) (resp., naaint(G) and naaex(G)) denote the number of vertices (resp., interior-vertices and exterior-vertices) v such that α(v)=a in G.

  • -

    A set ΛdgintΛ×[1,4] of chemical symbols and a set ΓintΓint(Dπ) of edge-configurations (μ,ξ,m) with μξ, where we require that the edge-configuration ec(e) of an interior-edge e in G belongs to Γint. We do not distinguish (μ,ξ,m) and (ξ,μ,m).

  • -

    Define Γacint to be the set of adjacency-configurations such that Γacint:={(a,b,m)(ad,bd,m)Γint}. Let acνint(G),νΓacint denote the number of interior-edges e such that ac(e)=ν in G.

  • -

    Subsets Λ*(v){aΛintval(a)2}, vVC, we require that every chemical element α(v) assigned to a vertex vVC in the seed graph belongs to Λ*(v).

  • -

    Lower and upper bound functions naLB,naUB:Λ[1,n*] and naLBint,naUBint:Λt[1,n*] on the number of interior-vertices v such that α(v)=a in G.

  • -

    Lower and upper bound functions nsLBint,nsUBint:Λdgint[1,n*] on the number of interior-vertices v such that cs(v)=μ in G.

  • -

    Lower and upper bound functions acLBint,acUBint:ΓacintZ+ on the number of interior-edges e such that ac(e)=ν in G.

  • -

    Lower and upper bound functions ecLBint,ecUBint:ΓintZ+ on the number of interior-edges e such that ec(e)=γ in G.

We call a chemical graph G that satisfies a chemical specification σce a (σint,σce)-extension of GC, and denote by G(GC,σint,σce) the set of all (σint,σce)-extensions of GC.

Table 2 shows an example of a chemical-specification σce to the seed graph GC in Figure 3.

Table 2.

Example 2 of a chemical-specification σce.

graphic file with name ijms-22-02847-i002.jpg

Figure 2 illustrates an example of a (σint,σce)-extension of GC obtained from the σint-extension H* in Figure 6 under the chemical-specification σce in Table 2.

Our specification of topological substructures is similar to that proposed by Akutsu and Nagamochi [27], wherein a target chemical graph is restricted to ρ-lean cyclic graphs and prescribed substructures cannot be specified in the acyclic part. In our new method, a chemical graph with any structure can be handled and substructures in the acyclic part can be fixed.

2.3. Examples of Specification

We here present some cases where a target specification (GC,σint,σce) can be chosen based on a set G* of given chemical graphs with a similar structure so that G* becomes a subset of G(GC,σint,σce). In such a case, every target chemical graph in G(GC,σint,σce) possesses a common structure over the given set G*.

Figure 7 illustrates a set G* of four flavonoids and a seed graph GC for ρ=2 so that G*G(GC,σint,σce) for a choice of an interior-specification σint and a chemical-specification σce. Let Λ:={C,O}. In the seed graph GC=(VC,EC), we set E(1):={a1,a2}, E(0/1):={a3}, and E(=1):=EC\{a1,a2,a3} and predetermine the chemical element α(u) for each vertex uVC and the bond-multiplicity β(e) for each edge eE(=1) as in Figure 7e, i.e., Λ*(u):={a} for a=α(u) and bdm,LB(e):=1 for m=β(e). Figure 7f illustrates a set F* of chemical rooted trees for the 2-fringe-trees in a target chemical graph. For vertices in GC, we set chUB(u):=0,uVC, F(ui):={ψ3},i[1,3], F(u4):={ψ1,ψ3}, F(u5):={ψ4}, F(u6):={ψ2}, and F(u):={ψ1},uVC\{u1,u2,,u6}. For edges aiE(1),i=1,2, we set UB(ai):=2,chUB(ai):=0 and FE:={ψ1,ψ2}, where a pure path Pai may be introduced in a target chemical graph. We see that every given chemical graph GiG* belongs to G(GC,σint,σce) by setting the other specification in σint and σce adequately.

Figure 7.

Figure 7

Illustration of a set G*={G1,G2,G3,G4} of four flavonoids, a seed graph GC, and a set F*={ψ1,ψ2,ψ3,ψ4} of chemical rooted trees for ρ=2: (a) fisetin G1; (b) ruteorinn G2; (c) aurone G3; (d) chalcone G4; (e) GC=(VC,EC); (f) F*=FEvVCF(v).

Figure 8 illustrates a set G* of three dibenzodiazepine atypical antipsychotics, and a seed graph GC for ρ=2 so that G*G(GC,σint,σce) for a choice of an interior-specification σint and a chemical-specification σce. Let Λ:={C,O,N,S,Cl}. In the seed graph GC=(VC,EC), we set E(2):={a1} and E(=1):=EC\{a1} and predetermine the chemical element α(u) for each vertex uVC and the bond-multiplicity β(e) for each edge eE(=1) as in Figure 8d. Figure 8e illustrates a set F* of chemical rooted trees for the 2-fringe-trees in a target chemical graph. For vertices in GC, we set chUB(u):=0,uVC\{u2}, chLB(u2):=0, chUB(u2):=4, F(u1):={ψ3,ψ7}, F(u2):={ψ1,ψ6}, F(ui):={ψ3},i[3,5], and F(u):={ψ1},uVC\{u1,u2,,u6}, where a leaf path Qu2 may be introduced in a target chemical graph. For edge a1E(2), we set UB(a1):=3,chUB(a1):=0 and FE:={ψ1,ψ2,ψ4,ψ8}. We see that every given chemical graph GiG* belongs to G(GC,σint,σce) by setting the other specification in σint and σce adequately.

Figure 8.

Figure 8

Illustration of a set G*={G1,G2,G3} of three dibenzodiazepine atypical antipsychotics, a seed graph GC and a set F*={ψ1,ψ2,,ψ8} of chemical rooted trees for ρ=2: (a) clozabine G1; (b) quetiapine G2; (c) olanzapine G3; (d) GC=(VC,EC); (e) F*=FEvVCF(v).

3. Results

We implemented our method of Stages 1 to 5 for inferring chemical graphs under a given target specification and conducted experiments to evaluate the computational efficiency. We executed the experiments on a PC with Processor: 3.0 GHz Core i7-9700 (3.0 GHz) Memory: 16 GB RAM DDR4. We used ChemDoodle version 10.2.0 for constructing 2D drawings of chemical graphs.

To conduct experiments for Stages 1 to 5, we selected six chemical properties π: octanol/water partition coefficient (KOW), boiling point (BP), melting point (MP), flash point (closed cup) (FP), lipophylicity (LP), solubility (SL) provided by HSDB from PubChem [29] for KOW, BP, MP, and FP, figshare [30] for LP and MoleculeNet [31] for SL.

Results on Phase 1.

We implemented Stages 1, 2, and 3 in Phase 1 as follows.

Stage 1. We set a graph class G to be the set of all chemical graphs with any graph structure, and set a branch-parameter ρ to be 2. For each property π  {KOW, BP, MP, FP, LP, SL}, we first select a set Λ of chemical elements and then collect a data set Dπ on chemical graphs over the set Λ of chemical elements. To construct the data set Dπ, we eliminated chemical compounds that have at most three carbon atoms or contain a charged element such as N+ or an element aΛ whose valence is different from our setting of valence function val.

Table 3 shows the size and range of data sets that we prepared for each chemical property in Stage 1, where we denote the following:

  • Λ: the set of selected chemical elements (hydrogen atoms are added at the final stage);

  • |Dπ|: the size of data set Dπ over Λ for property π;

  • |Γint(Dπ)|: the number of different edge-configurations of interior-edges over the compounds in Dπ;

  • |F(Dπ)|: the number of non-isomorphic chemical rooted trees in the set of all 2-fringe-trees in the compounds in Dπ;

  • [n_,n¯]: the minimum and maximum values of n(G) over the compounds G in Dπ; and

  • [a_,a¯]: the minimum and maximum values of a(G) in π over compounds G in Dπ.

Table 3.

Data sets for stage 1 in phase 1.

π Λ |Dπ| |Γint(Dπ)| |F(Dπ)| [n_,n¯] [a_,a¯]
KOW C,O,N 644 24 109 [4, 58] [−7.53, 13.45]
KOW C,O,N,S,Cl 837 31 142 [4, 73] [−7.53, 13.45]
BP C,O,N 358 21 91 [4, 30] [−11.70, 470.0]
BP C,O,N,S,Cl 425 23 114 [4, 30] [−11.70, 470.0]
MP C,O,N 448 22 94 [4, 122] [−185.3, 300.0]
MP C,O,N,S,Cl 548 26 118 [4, 122] [−185.3, 300.0]
FP C,O,N 348 20 85 [4, 66] [−82.99, 300.0]
FP C,O,N,S,Cl 399 24 107 [4, 66] [−82.99, 300.0]
LP C,O,N 592 27 71 [6, 60] [−3.62, 6.84]
LP C,O,N,S,Cl 779 32 78 [6, 74] [−3.62, 6.84]
SL C,O,N 640 25 111 [4, 55] [−9.33, 1.11]
SL C,O,N,S,Cl 847 31 144 [4, 55] [−11.60, 1.11]

Stage 2. We used the new feature function that consists of the descriptors such as fringe-configuration defined in Section 2.1 and let ffc denote the feature function.

Stage 3. Let η:RKR be a prediction function to a property function a:DR with a feature function f:DRK for a data set D of chemical graphs. We define the coefficient of determination R2(f,η,D) of a prediction function η over a data set D to be

R2(f,η,D)1GD(a(G)η(f(G)))2GD(a(G)a˜)2fora˜=1|D|GDa(G).

To conduct an experiment in Stage 3, we first constructed ten architectures Aj, j[1,10] with one or two hidden layers. For each pair (π,Aj) of a property π  {KOW, BP, MP, FP, LP, SL}, and an architecture Aj, j[1,10], we constructed five prediction functions in order to evaluate the performance with cross-validation as follows. Partition data set Dπ into five subsets Dπ(i), i[1,5] randomly and for each set Dπ\Dπ(i) construct an ANN N(j,i) and its prediction function ηN(j,i) using the feature function ffc. We used scikit-learn version 0.23.2 with Python 3.8.5, MLPRegressor and ReLU activation function to construct each ANN N(j,i). We evaluated the resulting prediction function ηN(j,i) with the coefficient R2(ffc,ηN(j,i),Dπ(i)) of determination for the test set Dπ(i). For each property π, let t-Rcv2(j) denote the average of R2(ffc,ηN(j,i),Dπ(i)) over all i[1,5] in the cross-validation to an architecture Aj.

Table 4 shows the results on Stages 2 and 3, where we denote the following.

Table 4.

Results of Stages 2 and 3 in Phase 1.

π Λ L-Time t-Rcv2 (Best) t-Rmax2 Arch.
KOW C,O,N 0.7 0.959 0.983 (156,10,10,1)
KOW C,O,N,S,Cl 0.7 0.947 0.968 (199,20,10,1)
BP C,O,N 3.5 0.858 0.923 (135,30,20,1)
BP C,O,N,S,Cl 3.3 0.821 0.899 (163,10,1)
MP C,O,N 3.8 0.784 0.893 (139,40,1)
MP C,O,N,S,Cl 4.1 0.796 0.880 (170,10,10,1)
FP C,O,N 1.1 0.750 0.874 (128,40,1)
FP C,O,N,S,Cl 1.8 0.707 0.853 (157,10,10,1)
LP C,O,N 0.5 0.868 0.908 (121,30,1)
LP C,O,N,S,Cl 0.7 0.861 0.892 (137,20,10,1)
SL C,O,N 0.7 0.870 0.913 (159,30,1)
SL C,O,N,S,Cl 0.9 0.870 0.903 (201,30,20,1)
  • -

    Λ: the set of selected chemical elements (hydrogen atoms are added at the final stage);

  • -

    L-time: the average time (s) to construct an ANN over all 10×5=50 ANNs;

  • -

    t-Rcv2 (best): the best value of t-Rcv2(j) over all architectures Aj, j[1,10];

  • -

    t-Rmax2: the maximum of R2(ffc,ηN(j,i),Dπ(i)) over all j[1,10],i[1,5]; and

  • -

    Arch.: The architecture Aj, j[1,10] that attains t-Rmax2. An architecture (K,p,1) (resp., (K,p1,p2,1)) consists of an input layer with K nodes, a hidden layer with p nodes (resp., two hidden layers with p1 and p2 nodes, respectively), and an output layer with a single node, where K is equal to the number of descriptors in the feature vector.

From Table 4, we see that the execution of Stage 3 was considerably successful, where most of t-Rmax2 are around 0.85 to 0.95 for all six chemical properties.

An Additional Experiment in Stage 3. We conducted an additional experiment to compare our new feature function ffc with the feature function fec based edge-configuration in the previous method [27] designed with the same framework. Note that the previous feature vector fec(G) can be defined only for a cyclic graph G, whereas our feature vector ffc(G) is defined for an arbitrary graph G. For each property π  {KOW, BP, MP, FP, LP, SL}, we set a set Λ of chemical elements to be {C,O,N,S,Cl} and then collect a data set D˜π of chemical cyclic graphs from the data set Dπ of all chemical graphs over the set Λ of chemical elements in the previous experiment. For each of the feature functions fec and ffc, we constructed five prediction functions with the same set of ten architectures Aj, j[1,10] and the data set D˜π of chemical cyclic graphs in the same manner of the previous experiment.

Table 5 shows the results of this experiment, where the table also includes the result of prediction functions by ffc in the set Dπ of all chemical graphs. In the table, we denote the following:

  • -

    |D˜π|, |Dπ|: the size of data set D˜π of cyclic graphs (resp., Dπ of all chemical graphs) for property π;

  • -

    t-Rcv2 (ave.): the average of R2(f,ηN(j,i),D(i)) over all j[1,10],i[1,5] for f=fec,ffc and D=D˜π,Dπ; and

  • -

    t-Rcv2 (best): maxj[1,10]{the average of R2(ffc,ηN(j,i),Dπ(i)) over all i[1,5]}.

Table 5.

Results of prediction functions by fec and ffc in data set D˜π of cyclic graphs and ffc in data set Dπ of all graphs.

f=fec, D=D˜π f=ffc, D=D˜π f=ffc, D=Dπ
π |D˜π| t-Rcv2 (ave.) t-Rcv2 (Best) t-Rcv2 (ave.) t-Rcv2 (Best) |Dπ| t-Rcv2 (ave.) t-Rcv2 (Best)
KOW 580 0.952 0.959 0.950 0.954 837 0.944 0.947
BP 224 0.688 0.718 0.680 0.693 425 0.809 0.821
MP 348 0.668 0.694 0.712 0.736 548 0.776 0.796
FP 218 0.435 0.476 0.574 0.623 399 0.688 0.707
LP 776 0.832 0.842 0.853 0.861 779 0.854 0.861
SL 638 0.851 0.863 0.853 0.861 847 0.860 0.870

From Table 5, we see that the score of R2 of the prediction function by ffc in chemical cyclic graphs (resp., in all chemical graphs) is improved from that by fec for properties MP and FP (resp., BP, MP, and FP). Recall that our new feature function ffc can be defined for arbitrary graphs and we can select a larger data set than that by fec in a learning stage. This advantage is observed in the experiment. We guess that the better prediction function for BP (resp., FP) is obtained by using ffc because the size of data set becomes considerably larger from |D˜π|=224 to |Dπ|=425 (resp., from |D˜π|=218 to |Dπ|=399).

Results on Phase 2.

We prepared the following instances (a–d) for conducting experiments of Stages 4 and 5 in Phase 2.

  • (a)

    Ia=(GC,σint,σce): The instance used in Section 2.2 to explain the target specification.

  • (b)
    Ib,i=(GCi,σinti,σcei), i=1,2,3,4: An instance for inferring chemical graphs with rank at most 2. In the four instances Ib,i, i=1,2,3,4, the following specifications in (σint,σce) are common.
    • Set Λ:={C,N,O}, set Λdgint to be the set of all possible symbols in Λ×[1,4], and set Γint to be the set of all possible edge-configurations. Set Λ*(v):=Λ, vVC.
    • The lower bounds LB, blLB, chLB, bd2,LB, bd3,LB, naLB, naLBint, nsLBint, acLBint, ecLBint are all set to be 0.
    • The upper bounds UB, blUB, chUB, bd2,UB, bd3,UB, naUB, naUBint, nsUBint, acUBint, ecUBint are all set to be an upper bound n* on n(G*).
    • For each property π, let F(Dπ) denote the set of 2-fringe-trees in the compounds in Dπ, and select a subset FπiF(Dπ) with |Fπi|=455i, i[1,5]. For each instance Ib,i, set FE:=F(v):=Fπi, vVC.

    Instance Ib,1 is given by the rank-1 seed graph GC1 in Figure 9a and Instances Ib,i, i=2,3,4 are given by the rank-2 seed graph GCi, i=2,3,4 in Figure 9b–d.

    • (i)

      For instance Ib,1, select as a seed graph the monocyclic graph GC1=(VC,EC=E(2)E(1)) in Figure 9a, where VC={u1,u2}, E(2)={a1} and E(1)={a2}. Set nLBint:=0,nUBint:=12 and nLB:=n*:=38. We include a linear constraint (a1)(a2) as part of the side constraint.

    • (ii)

      For instance Ib,2, select as a seed graph the graph GC2=(VC,EC=E(2)E(1)E(=1)) in Figure 9b, where VC={u1,u2,u3,u4}, E(2)={a1,a2}, E(1)={a3} and E(=1)={a4,a5}. Set nLBint:=nUBint:=30 and nLB:=n*:=50. We include a linear constraint (a1)(a2).

    • (iii)

      For instance Ib,3, select as a seed graph the graph GC3=(VC,EC=E(2)E(1)E(=1)) in Figure 9c, where VC={u1,u2,u3,u4}, E(2)={a1}, E(1)={a2,a3} and E(=1)={a4,a5}. Set nLBint:=nUBint:=30 and nLB:=n*:=50. We include linear constraints (a1)(a2)+(a3) and (a2)(a3).

    • (iv)

      For instance Ib,4, select as a seed graph the graph GC4=(VC,EC=E(2)E(1)E(=1)) in Figure 9d, where VC={u1,u2,u3,u4}, E(1)={a1,a2,a3} and E(=1)={a4,a5}. Set nLBint:=nUBint:=30 and nLB:=n*:=50. We include linear constraints (a2)(a1)+1, (a2)(a3)+1 and (a1)(a3).

Figure 9.

Figure 9

An illustration of seed graphs: (a) A monocyclic graph GC1; (b) A rank-2 cyclic graph GC2 with two vertex-disjoint cycles; (c) A rank-2 cyclic graph GC3 with two disjoint cycles sharing a vertex; (d) A rank-2 cyclic graph GC4 with three cycles.

We define instances in (c) and (d) in order to find chemical graphs that have an intermediate structure of given two chemical cyclic graphs GA=(HA=(VA,EA),αA,βA) and GB=(HB=(VB,EB),αB,βB). Let ΛAint and Λdg,Aint denote the sets of chemical elements and chemical symbols of the interior-vertices in GA, ΓAint denote the sets of edge-configurations of the interior-edges in GA, and FA denote the set of 2-fringe-trees in GA. Analogously define sets ΛBint, Λdg,Bint, ΓBint, and FB in GB.

  • (c)

    Ic=(GC,σint,σce): An instance aimed to infer a chemical graph G such that the core of G is equal to the core of GA and the frequency of each edge-configuration in the non-core of G is equal to that of GB. We use chemical compounds CID 24822711 and CID 59170444 in Figure 10a,b for GA and GB, respectively.

    Set a seed graph GC=(VC,EC=E(=1)) to be the core of GA.

    Set Λ:={C,N,O}, and set Λdgint to be the set of all possible chemical symbols in Λ×[1,4].

    Set Γint:=ΓAintΓBint and Λ*(v):={αA(v)}, vVC.

    Set nLBint:=min{nint(GA),nint(GB)}, nUBint:=max{nint(GA),nint(GB)},

    nLB:=min{n(GA),n(GB)}10 and n*:=max{n(GA),n(GB)}+5.

    Set lower bounds LB, blLB, chLB, bd2,LB, bd3,LB, naLB, naLBint, nsLBint and acLBint to be 0.

    Set upper bounds UB, blUB, chUB, bd2,UB, bd3,UB, naUB, naUBint, nsUBint and acUBint to be n*.

    Set ecLBint(γ) to be the number of core-edges in GA with γΓint and ecUBint(γ) to be the number interior-edges in GA and GB with edge-configuration γ.

    Let FB(p),p[1,2] denote the set of chemical rooted trees r-isomorphic p-fringe-trees in GB.

    Set FE:=F(v):=FB(1)FB(2), vVC.

  • (d)

    Id=(GC1,σint,σce): An instance aimed to infer a chemical monocyclic graph G such that the frequency vector of edge-configurations in G is a vector obtained by merging those of GA and GB. We use chemical monocyclic compounds CID 10076784 and CID 44340250 in Figure 10c,d for GA and GB, respectively. Set a seed graph to be the monocyclic seed graph GC1=(VC,EC=E(2)E(1)) with VC={u1,u2}, E(2)={a1} and E(1)={a2} in Figure 9a.

    Set Λ:={C,N,O}, Λdgint:=Λdg,AintΛdg,Bint and Γint:=ΓAintΓBint.

    Set nLBint:=min{nint(GA),nint(GB)}, nUBint:=max{nint(GA),nint(GB)},

    nLB:=min{n(GA),n(GB)} and n*:=max{n(GA),n(GB)}.

    Set lower bounds LB, blLB, chLB, bd2,LB, bd3,LB, naLB, naLBint, nsLBint and acLBint to be 0.

    Set upper bounds UB, blUB, chUB, bd2,UB, bd3,UB, naUB, naUBint, nsUBint and acUBint to be n*.

    For each edge-configuration γΓint, let xA*(γint) (resp., xB*(γint)) denote the number of interior-edges with γ in GA (resp., GB), γΓint and set

    xmin*(γ):=min{xA*(γ),xB*(γ)}, xmax*(γ):=max{xA*(γ),xB*(γ)},

    ecLBint(γ):=(3/4)xmin*(γ)+(1/4)xmax*(γ) and

    ecUBint(γ):=(1/4)xmin*(γ)+(3/4)xmax*(γ).

    Set FE:=F(v):=FAFB, vVC.

Figure 10.

Figure 10

An illustration of chemical compounds for instances Ic and Id: (a) GA: CID 24822711; (b) GB: CID 59170444; (c) GA: CID 10076784; (d) GB: CID 44340250.

In Stage 5, before we formulate an MILP for inferring a target chemical graph G for each instance I, we reduce the input layer of an ANN N constructed in Stage 3 so that the input layer consists of input nodes that correspond to the descriptors actually used in the specification (GC,σint,σce) of the instance I, i.e., we remove any input nodes in N that represent the frequency of edge-configurations in Γint(Dπ) and chemical rooted trees ψF(Dπ) not contained in the specification (GC,σint,σce) of I. For example, there are |F(Dπ)|=109 chemical rooted trees in the set of 2-fringe-trees in the data set Dπ with π= KOW in Table 3, and an ANN N constructed in Stage 3 contains 109 input nodes that correspond to the descriptors for the fringe-configuration. However, the set of input nodes for the fringe-configuration is reduced to a set of |F*|=40 input nodes when we formulate an MILP for solving instance Ib,1, saving the number of integer variables.

Table 6 shows the features of the seven test instances, where we denote the following:

  • -

    Λ: the set of non-hydrogen chemical elements for inferring a target graph;

  • -

    |Γint|: the number of different edge-configurations of interior-edges for inferring a target graph;

  • -

    |F*|: the number of different chemical rooted trees in the set F*=FEvVCF(v); and

  • -

    [nLBint,nUBint], [nLB,n*]: the lower and upper bounds on nint(G) and n(G) for inferring a target graph G.

Table 6.

Features of test instances.

Instance Λ |Γint| |F*| [nLBint,nUBint] [nLB,n*]
Ia C,O,N 10 11 [30,50] [20,28]
Ib,1 C,O,N 28 40 [38,38] [6,6]
Ib,2 C,O,N 28 35 [50,50] [30,30]
Ib,3 C,O,N 28 30 [50,50] [30,30]
Ib,4 C,O,N 28 25 [50,50] [30,30]
Ic C,O,N 8 12 [46,46] [24,24]
Id C,O,N 7 8 [40,45] [18,18]

Stage 4. To solve an MILP in Stage 4, we used CPLEX version 12.10. Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12 show the results on Stages 4 and 5, where we denote the following:

  • -

    [a_,a¯]: the minimum and maximum values of a(G) in π over compounds G in Dπ in Table 3;

  • -

    [y_,y¯]: y_ (resp., y¯) denotes the minimum (resp., maximum) target value y with a_ya¯ such that the MILP instance for the target value y*=y becomes feasible (i.e., admits a target chemical graph G). To determine the minimum and minimum target values y_ and y¯, we solved many numbers of MILP instances. Note that the MILP instance may become infeasible for some value y within the range [y_,y¯];

  • -

    y*: a target value in [y_,y¯] for a property π;

  • -

    #v: the number of variables in the MILP in Stage 4;

  • -

    #c: the number of constraints in the MILP in Stage 4;

  • -

    IP-time: the time (sec.) to solve the MILP in Stage 4;

  • -

    n: the number n(G) of non-hydrogen atoms in the chemical graph G inferred in Stage 4; and

  • -

    nint: the number nint(G) of interior-vertices in the chemical graph G inferred in Stage 4.

Table 7.

Results of Stage 4 for KOW.

Instance [a_,a¯] [y_,y¯] y* #v #c IP-Time n nint
Ia [−7.53, 13.45] [−7.0, 13.4] 3.2 7663 9162 3.9 35 24
Ib,1 [−7.53, 13.45] [−7.5, 13.4] 3.0 9894 6626 17.5 38 7
Ib,2 [−7.53, 13.45] [−7.5, 13.4] 3.0 11,514 8934 14.0 50 30
Ib,3 [−7.53, 13.45] [−7.5, 13.4] 3.0 11,318 8926 24.6 50 30
Ib,4 [−7.53, 13.45] [−7.5, 13.4] 3.0 11,122 8918 22.0 50 30
Ic [−7.53, 13.45] [−7.5, 13.4] 3.0 7867 8630 2.1 49 32
Id [−7.53, 13.45] [−7.5, 13.4] 3.0 5395 6899 5.2 45 23

Table 8.

Results of Stage 4 for BP.

Instance [a_,a¯] [y_,y¯] y* #v #c IP-Time n nint
Ia [−11.70, 470.0] [352, 470] 411 7583 8982 2.7 42 25
Ib,1 [−11.70, 470.0] [−11, 470] 229 9816 6449 2.7 38 7
Ib,2 [−11.70, 470.0] [−11, 470] 229 11,436 8757 9.1 50 30
Ib,3 [−11.70, 470.0] [−11, 470] 229 11,240 8749 11.0 50 30
Ib,4 [−11.70, 470.0] [−11, 470] 229 11,044 8741 24.0 50 30
Ic [−11.70, 470.0] [170, 470] 320 7575 8450 25.9 49 33
Id [−11.70, 470.0] [151, 470] 310 5315 6719 4.4 43 23

Table 9.

Results of Stage 4 for MP.

Instance [a_,a¯] [y_,y¯] y* #v #c IP-Time n nint
Ia [−185.3, 300.0] [55, 300] 177.5 7602 9023 16.1 41 24
Ib,1 [−185.3, 300.0] [−180, 300] 60 9833 6487 2.3 38 9
Ib,2 [−185.3, 300.0] [−185, 300] 57.4 11,453 8795 44.7 50 30
Ib,3 [−185.3, 300.0] [−185, 300] 57.4 11,257 8787 10.5 50 30
Ib,4 [−185.3, 300.0] [−185, 300] 57.4 11,061 8779 93.9 50 30
Ic [−185.3, 300.0] [253, 300] 260.0 7580 6172 24.0 41 33
Id [−185.3, 300.0] [−75, 299] 58 5110 4050 104.6 45 23

Table 10.

Results of Stage 4 for FP.

Instance [a_,a¯] [y_,y¯] y* #v #c IP-Time n nint
Ia [−82.99, 300.0] [98, 300] 199 7459 8696 1.6 35 22
Ib,1 [−82.99, 300.0] [−82, 300] 109 9694 6166 1.4 38 8
Ib,2 [−82.99, 300.0] [−82, 300] 109 11,314 8474 8.7 50 30
Ib,3 [−82.99, 300.0] [−82, 300] 109 11,118 8466 25.8 50 30
Ib,4 [−82.99, 300.0] [−82, 300] 109 10,922 8458 8.5 50 30
Ic [−82.99, 300.0] [250, 300] 275 7667 8170 60.9 47 34
Id [−82.99, 300.0] [54, 300] 177 5193 6436 2.0 45 23

Table 11.

Results of Stage 4 for LP.

Instance [a_,a¯] [y_,y¯] y* #v #c IP-Time n nint
Ia [−3.6, 6.84] [−3.6, 6.8] 1.6 7597 9008 1.9 39 23
Ib,1 [−3.6, 6.84] [−3.6, 6.8] 1.6 9836 6481 2.9 38 8
Ib,2 [−3.6, 6.84] [−3.6, 6.8] 1.6 11,456 8789 21.1 50 30
Ib,3 [−3.6, 6.84] [−3.6, 6.8] 1.6 11,260 8781 20.4 50 30
Ib,4 [−3.6, 6.84] [−3.6, 6.8] 1.6 11,064 8773 24.2 50 30
Ic [−3.6, 6.84] [−3.6, 6.8] 1.6 7801 8476 1.1 47 32
Id [−3.6, 6.84] [−3.6, 6.8] 1.6 5335 6754 4.3 45 23

Table 12.

Results of Stage 4 for SL.

Instance [a_,a¯] [y_,y¯] y* #v #c IP-Time n nint
Ia [−9.33, 1.11] [−9.3, −2.0] −5.6 7674 9186 2.4 41 23
Ib,1 [−9.33, 1.11] [−9.3, −2.0] −5.6 9906 6650 22.3 38 12
Ib,2 [−9.33, 1.11] [−9.3, −2.0] −5.6 11,526 8958 15.2 50 30
Ib,3 [−9.33, 1.11] [−9.3, −2.0] −5.6 11,330 8950 16.2 50 30
Ib,4 [−9.33, 1.11] [−9.3, −2.0] −5.6 11,134 8942 122.7 50 30
Ic [−9.33, 1.11] [−9.3, −2.0] −5.6 7874 8648 1.2 54 33
Id [−9.33, 1.11] [−9.3, −3.0] −6.1 5402 6917 8.1 43 23

Figure 11a illustrates the chemical graph G inferred from instance Ic with y*=3.0 of KOW in Table 7.

Figure 11.

Figure 11

(a) G inferred from Ic with y*=3.0 of KOW; (b) G inferred from Id with y*=1.6 of LP.

Figure 11b illustrates the chemical graph G inferred from instance Id with y*=1.6 of LP in Table 11.

The topological specification of instances Ia, Ic and Id is more restricted than that of the other instances, and thereby the feasible target range [y_,y¯] of Ia, Ic and Id is rather narrower than the original range [a_,a¯] for some property π. We see that the running time for solving an MILP instance with n=50 is 8.5 to 122 (s), which is much smaller than the running time of 61 to 12058 (s) to solve a similar set of MILP instances with n=50 in the experimental result for the previous method [28].

Stage 5. We computed chemical isomers G* of each target chemical graph G inferred in Stage 4. We execute the algorithm for generating chemical isomers of G up to 100 when the number of all chemical isomers exceeds 100. The algorithm can evaluate a lower bound on the total number of all chemical isomers G without generating all of them.

Table 13 and Table 14 show the computational results of the experiment, where we denote the following:

  • -

    DP-time: the running time (s) to execute the dynamic programming algorithm in Stage 5 to compute a lower bound on the number of all chemical isomers G* of G and generate all (or up to 100) chemical isomers G*;

  • -

    G-LB: a lower bound on the number of all chemical isomers G* of G; and

  • -

    #G: the number of all (or up to 100) chemical isomers G* of G generated in Stage 5.

Table 13.

Results of Stage 5 for KOW, LP, and BP.

Kow Lp Bp
Instance DP-Time G-LB #G DP-Time G-LB #G DP-time G-LB #G
Ia 0.031 16 16 0.164 128 100 0.164 1.4×105 100
Ib1 0.149 2.8×105 100 0.148 2.0×1010 100 0.162 4.4×105 100
Ib2 44.1 3.9×1010 100 118 900 100 171 6 6
Ib3 27.2 20 20 80.2 6 6 28.6 7 7
Ib4 0.166 6000 100 73 12 12 142 5 5
Ic 0.166 6000 100 0.168 288 100 0.168 4.0×105 100
Id 22.3 8.3×1010 100 1.44 3.2×108 100 1.7 9.7×109 100

Table 14.

Results of Stage 5 for FP, MP, and SL.

FP MP SL
Instance DP-Time G-LB #G DP-Time G-LB #G DP-Time G-LB #G
Ia 0.057 32 32 0.165 256 100 0.165 1024 100
Ib1 0.164 3.1×106 100 0.166 1.4×106 100 0.163 4.5×105 100
Ib2 28.8 720 100 8.26 2.4×1010 100 1.07 5.6×109 100
Ib3 72.2 27 27 51.9 1 1 46.5 1680 100
Ib4 40.3 20 20 125 6.1×107 100 7.01 1.1×108 100
Ic 0.169 1.1×105 100 0.173 6048 100 0.168 120 100
Id 0.057 32 32 0.17 4.2×108 100 0.165 1024 100

From Table 13 and Table 14, we observe that the running time for generating up to 100 target chemical graphs in Stage 5 is not considerably larger than that in Stage 4.

4. Discussions and Conclusions

The framework of designing chemical graphs using ANNs and MILP has been proposed [23] as a basis of a total system of the QSAR and the inverse of QSAR, where the inverse of a prediction function produced by an ANN is solved by an MILP. The merit of the framework is that the inverse problem can be treated exactly as a mathematical problem, and an MILP instance with a moderate size can be efficiently solved with a fast MILP solver. On the other hand, the main technical concern in applying the framework is in defining a feature vector of a chemical graph in terms of graph theoretical descriptors so that the computation of a feature vector can be simulated with a set of linear constraints in an MILP. So far, the framework has been applied to the design of new methods of inferring several restricted classes of chemical graphs such as the graphs with rank at most 2 and the ρ-lean cyclic graphs [26,28].

Herein, we examine some technical issues in the previous method before we observe some new features of our method in this paper.

In the feature vector of the previous models [26,28], the structure of subgraphs used as descriptors is only a pair of adjacent vertices, called adjacency-configuration or edge-configuration, which is significantly limited from a variety of subgraphs used in a more sophisticated construction of a feature vector such as the fingerprint. However, including the occurrence of a certain subgraph with only a few vertices as part of a feature vector may require realizing a mechanism of the subgraph isomorphism in an MILP that simulates the computation of such an occurrence and can easily make the resulting MILP very complicated and hard to solve. Furthermore, the feature vector can be defined only for cyclic graphs and we need to eliminate any acyclic graphs from the original data set before we construct a prediction function. This may reduce a data set to an unnecessarily small size or reduce the chances of capturing important information on QSAR over all types of graphs.

A branch-parameter ρ was originally introduced as a new measure to the “agglomeration degree” of trees [24] and then used to define restricted classes of acyclic and cyclic graphs [24,27]. In fact, such a restriction on the structure of target chemical graphs was rather necessary to reduce the size of an MILP formulation that simulates a selection process of a target chemical graph from a supergraph (called a scheme graph), where the number of variables and constraints required to infer a chemical graph with n* non-hydrogen atoms is O(n*) when some other parameters such as ρ are regarded as constants.

Although nearly 97% of cyclic chemical compounds with up to 100 non-hydrogen atoms in PubChem are 2-lean [24], the way of specifying the topological structure of a target chemical graph in the previous method [26,28] was based on the core and the non-core of a chemical graph, and we could not include a fixed substructure in the non-core of a target chemical graph.

Compared with the previous models, the two-layered model proposed in this paper is rather simple, where a chemical graph is regarded as a combination of the interior and the exterior. The new model can deal with chemical compounds with any graph structure and include a prescribed structure in both of the acyclic and cyclic parts of a target chemical graph as long as the requirement on target chemical graphs is described under the set of specification rules introduced in this paper. This considerably improves the availability of the framework in a practical application.

The feature vector of our two-layered model can be defined for arbitrary graphs. In the new feature vector, the exterior of a chemical graph is encoded into fringe-configurations, i.e., the occurrence of each chemical rooted tree with height at most ρ, where we may regard that the set of such a chemical rooted trees plays a similar role of some types of functional groups. In our method, we include as part of the descriptors of a feature vector the occurrence of each of such chemical rooted trees and the descriptors of our feature vector on the exterior of a chemical graph may have an analogous effect with the fingerprint.

Our specification of target chemical graphs can specify a candidate set F of chemical rooted trees that are allowed to be used as chemical rooted trees in the exterior of a target chemical graph. This allows us to control the chemical property of target chemical graphs in a more meaningful way since chemical properties of some rooted trees in F are known as functional groups and some kinds of rooted trees can be prohibited in a target chemical graph, if necessary, just by excluding from the candidate set F. Although the number |F(Dπ)| of different kinds of such chemical trees in a data set Dπ from PubChem is approximately up to 300 for ρ=2 in many cases and the number of input nodes in an ANN N becomes over |F(Dπ)|, we derived an MILP formulation for inferring a chemical graph with with n* non-hydrogen atoms and a candidate set F of chemical rooted trees by using O(n*+|F|) variables and O(n*|F|) constraints when the number of interior-vertices is constant, where |F| can be quite small compared with |F(Dπ)|.

We have implemented the proposed method for inferring chemical compounds with a prescribed topological substructure setting ρ=2. The results of computational experiments using some chemical properties such as octanol/water partition coefficient, boiling point, melting point, flash point, lipophylicity, and solubility suggest that the proposed system can infer chemical graphs with 50 non-hydrogen atoms.

For a larger branch-parameter, say ρ=3,4, we obtain a more variety of chemical rooted trees which provides new descriptors in a feature vector and new candidates for fringe-trees in the exterior in a target chemical graph, whereas the number of different chemical rooted trees in F(Dπ) may increase rapidly.

It is left as a future work to use other learning methods such as decision tree, random forest, graph convolution, and an ensemble method in Stages 3 and 4 in the framework.

Acknowledgments

This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant #18H04113.

Abbreviations

ANN artificial neural network
MILP mixed integer linear programming

Supplementary Materials

The following are available online at https://www.mdpi.com/1422-0067/22/6/2847/s1.

Author Contributions

Conceptualization, H.N. and T.A.; methodology, H.N.; software, N.A.A., J.Z., Y.S., K.H., and L.Z.; validation, N.A.A., J.Z., and H.N.; formal analysis, H.N.; data resources, L.Z., K.H., H.N., and T.A.; writing—original draft preparation, H.N.; writing—review and editing, N.A.A. and T.A.; project administration, H.N.; funding acquisition, T.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant #18H04113.

Data Availability Statement

Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Miyao T., Kaneko H., Funatsu K. Inverse QSPR/QSAR analysis for chemical structure generation (from y to x) J. Chem. Inf. Model. 2016;56:286–299. doi: 10.1021/acs.jcim.5b00628. [DOI] [PubMed] [Google Scholar]
  • 2.Ikebata H., Hongo K., Isomura T., Maezono R., Yoshida R. Bayesian molecular design with a chemical language model. J. Comput. Aided Mol. Des. 2017;31:379–391. doi: 10.1007/s10822-016-0008-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Rupakheti C., Virshup A., Yang W., Beratan D.N. Strategy to discover diverse optimal molecules in the small molecule universe. J. Chem. Inf. Model. 2015;55:529–537. doi: 10.1021/ci500749q. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Fujiwara H., Wang J., Zhao L., Nagamochi H., Akutsu T. Enumerating treelike chemical graphs with given path frequency. J. Chem. Inf. Model. 2008;48:1345–1357. doi: 10.1021/ci700385a. [DOI] [PubMed] [Google Scholar]
  • 5.Kerber A., Laue R., Grüner T., Meringer M. MOLGEN 4.0. MATCH Commun. Math. Comput. Chem. 1998;37:205–208. [Google Scholar]
  • 6.Li J., Nagamochi H., Akutsu T. Enumerating substituted benzene isomers of tree-like chemical graphs. IEEE/ACM Trans. Comput. Biol. Bioinform. 2016;15:633–646. doi: 10.1109/TCBB.2016.2628888. [DOI] [PubMed] [Google Scholar]
  • 7.Reymond J.L. The chemical space project. Acc. Chem. Res. 2015;48:722–730. doi: 10.1021/ar500432k. [DOI] [PubMed] [Google Scholar]
  • 8.Bohacek R.S., McMartin C., Guida W.C. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev. 1996;16:3–50. doi: 10.1002/(SICI)1098-1128(199601)16:1&#x0003c;3::AID-MED1&#x0003e;3.0.CO;2-6. [DOI] [PubMed] [Google Scholar]
  • 9.Akutsu T., Fukagawa D., Jansson J., Sadakane K. Inferring a graph from path frequency. Discrete Appl. Math. 2012;160:1416–1428. doi: 10.1016/j.dam.2012.02.002. [DOI] [Google Scholar]
  • 10.Kipf T.N., Welling M. Semi-supervised classification with graph convolutional networks. arXiv. 20161609.02907 [Google Scholar]
  • 11.Gómez-Bombarelli R., Wei J.N., Duvenaud D., Hernández-Lobato J.M., Sánchez-Lengeling B., Sheberla D., Aguilera-Iparraguirre J., Hirzel T.D., Adams R.P., Aspuru-Guzik A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018;4:268–276. doi: 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Segler M.H.S., Kogej T., Tyrchan C., Waller M.P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 2017;4:120–131. doi: 10.1021/acscentsci.7b00512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Yang X., Zhang J., Yoshizoe K., Terayama K., Tsuda K. ChemTS: An efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater. 2017;18:972–976. doi: 10.1080/14686996.2017.1401424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kusner M.J., Paige B., Hernández-Lobato J.M. Grammar variational autoencoder; Proceedings of the 34th International Conference on Machine Learning; Sydney, NSW, Australia. 6–11 August 2017; pp. 1945–1954. [Google Scholar]
  • 15.De Cao N., Kipf T. MolGAN: An implicit generative model for small molecular graphs. arXiv. 20181805.11973 [Google Scholar]
  • 16.Madhawa K., Ishiguro K., Nakago K., Abe M. GraphNVP: An invertible flow model for generating molecular graphs. arXiv. 20191905.11600 [Google Scholar]
  • 17.Shi C., Xu M., Zhu Z., Zhang W., Zhang M., Tang J. GraphAF: A flow-based autoregressive model for molecular graph generation. arXiv. 20202001.09382 [Google Scholar]
  • 18.Cherkasov A., Muratov E.M.N., Fourches D., Varnek A., Baskin I.I., Cronin M., Dearden J., Gramatica P., Martin Y.C., Todeschini R., et al. QSAR modeling: Where have you been? Where are you going to? J. Med. Chem. 2014;57:4977–5010. doi: 10.1021/jm4004285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Cramer R.D., III, Patterson D.E., Bunce J.D. Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc. 1988;110:5959–5967. doi: 10.1021/ja00226a005. [DOI] [PubMed] [Google Scholar]
  • 20.Cramer R.D. Template CoMFA generates single 3D-QSAR models that, for twelve of twelve biological targets, predict all ChEMBL-tabulated affinities. PLoS ONE. 2015;10:e0129307. doi: 10.1371/journal.pone.0129307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Moriwaki H., Tian Y.-S., Kawashita N., Takagi T. Three-dimensional classification structure–activity relationship analysis using convolutional neural network. Chem. Pharm. Bull. 2019;67:426–432. doi: 10.1248/cpb.c18-00757. [DOI] [PubMed] [Google Scholar]
  • 22.Azam N.A., Chiewvanichakorn R., Zhang F., Shurbevski A., Nagamochi H., Akutsu T. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming; Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies; Valletta, Malta. 24–26 February 2020; pp. 101–108. [Google Scholar]
  • 23.Zhang F., Zhu J., Chiewvanichakorn R., Shurbevski A., Nagamochi H., Akutsu T. A new integer linear programming formulation to the inverse QSAR/QSPR for acyclic chemical compounds using skeleton trees; Proceedings of the 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems; Kitakyushu, Japan. 22–25 September 2020; pp. 433–444. [Google Scholar]
  • 24.Azam N.A., Zhu J., Sun Y., Shi Y., Shurbevski A., Zhao L., Nagamochi H., Akutsu T. A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming. 2020 doi: 10.1186/s13015-021-00197-2. submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ito R., Azam N.A., Wang C., Shurbevski A., Nagamochi H., Akutsu T. A novel method for the inverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networks and integer programming; Proceedings of the International Conference on Bioinformatics & Computational Biology (BIOCOMP 2020); Las Vegas, NV, USA. 27–30 July 2020. [Google Scholar]
  • 26.Zhu J., Wang C., Shurbevski A., Nagamochi H., Akutsu T. A novel method for inference of chemical compounds of cycle index two with desired properties based on artificial neural networks and integer programming. Algorithms. 2020;13:124. doi: 10.3390/a13050124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Akutsu T., Nagamochi H. A novel method for inference of chemical compounds with prescribed topological substructures based on integer programming. arXiv. 2020 doi: 10.1109/TCBB.2021.3112598.2010.09203 [DOI] [PubMed] [Google Scholar]
  • 28.Zhu J., Azam N.A., Zhang F., Shurbevski A., Haraguchi K., Zhao L., Nagamochi H., Akutsu T. A novel method for inferring of chemical compounds with prescribed topological substructures based on integer programming. 2020 doi: 10.1109/TCBB.2021.3112598. submitted. [DOI] [PubMed] [Google Scholar]
  • 29.PubChem. [(accessed on 13 May 2020)]; Available online: https://pubchem.ncbi.nlm.nih.gov/
  • 30.Figshare. [(accessed on 13 May 2020)]; Available online: https://figshare.com/articles/dataset/Lipophilicity_Dataset_-_logD7_4_of_1_130_Compounds/5596750/1.
  • 31.A Benchmark for Molecular Machine Learning. [(accessed on 13 May 2020)]; Available online: http://moleculenet.ai/datasets-1.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.


Articles from International Journal of Molecular Sciences are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES