Skip to main content
Algorithms for Molecular Biology : AMB logoLink to Algorithms for Molecular Biology : AMB
. 2021 Aug 14;16:18. doi: 10.1186/s13015-021-00197-2

A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming

Naveed Ahmed Azam 1, Jianshen Zhu 1, Yanming Sun 1, Yu Shi 1, Aleksandar Shurbevski 1, Liang Zhao 2, Hiroshi Nagamochi 1,, Tatsuya Akutsu 3,
PMCID: PMC8364129  PMID: 34391471

Abstract

Analysis of chemical graphs is becoming a major research topic in computational molecular biology due to its potential applications to drug design. One of the major approaches in such a study is inverse quantitative structure activity/property relationship (inverse QSAR/QSPR) analysis, which is to infer chemical structures from given chemical activities/properties. Recently, a novel two-phase framework has been proposed for inverse QSAR/QSPR, where in the first phase an artificial neural network (ANN) is used to construct a prediction function. In the second phase, a mixed integer linear program (MILP) formulated on the trained ANN and a graph search algorithm are used to infer desired chemical structures. The framework has been applied to the case of chemical compounds with cycle index up to 2 so far. The computational results conducted on instances with n non-hydrogen atoms show that a feature vector can be inferred by solving an MILP for up to n=40, whereas graphs can be enumerated for up to n=15. When applied to the case of chemical acyclic graphs, the maximum computable diameter of a chemical structure was up to 8. In this paper, we introduce a new characterization of graph structure, called “branch-height” based on which a new MILP formulation and a new graph search algorithm are designed for chemical acyclic graphs. The results of computational experiments using such chemical properties as octanol/water partition coefficient, boiling point and heat of combustion suggest that the proposed method can infer chemical acyclic graphs with around n=50 and diameter 30.

Keywords: QSAR/QSPR, Molecular design, Artificial neural network, Mixed integer linear programming, Enumeration of graphs

Background

In computational molecular biology, various types of data have been utilized, which include sequences, gene expression patterns, and protein structures. Graph structured data have also been extensively utilized, which include metabolic pathways, protein-protein interaction networks, gene regulatory networks, and chemical graphs. Much attention has recently been paid to the analysis of chemical graphs due to its potential applications to computer-aided drug design. One of the major approaches to computer-aided drug design is quantitative structure activity/property relationship (QSAR/QSPR) analysis, the purpose of which is to derive quantitative relationships between chemical structures and their activities/properties. Furthermore, inverse QSAR/QSPR has been extensively studied [1, 2], the purpose of which is to infer chemical structures from given chemical activities/properties. Inverse QSAR/QSPR is often formulated as an optimization problem to find a chemical structure maximizing (or minimizing) an objective function under various constraints.

In both QSAR/QSPR and inverse QSAR/QSPR, chemical compounds are usually represented as vectors of real or integer numbers, which are often called descriptors and correspond to feature vectors in machine learning. Using these chemical descriptors, various heuristic and statistical methods have been developed for finding optimal or nearly optimal graph structures under given objective functions [1, 3, 4]. Inference or enumeration of graph structures from a given feature vector is a crucial subtask in many of such methods. Various methods have been developed for this enumeration problem [58] and the computational complexity of the inference problem has been analyzed [9, 10]. On the other hand, enumeration in itself is a challenging task, since the number of molecules (i.e., chemical graphs) with up to 30 atoms (vertices) C, N, O, and S, may exceed 1060 [11].

As a new approach, artificial neural network (ANN) and deep learning technologies have recently been applied to inverse QSAR/QSPR. For example, variational autoencoders [12], recurrent neural networks [13, 14], and grammar variational autoencoders [15] have been applied. In these approaches, new chemical graphs are generated by solving a kind of inverse problems on neural networks that are trained using known chemical compound/activity pairs. However, the optimality of the solution is not necessarily guaranteed in these approaches. In order to guarantee the optimality mathematically, a novel approach has been proposed [16] for ANNs, using mixed integer linear programming (MILP).

Recently, a new framework has been proposed [1719] by combining two previous approaches: efficient enumeration of tree-like graphs [5], and MILP-based formulation of the inverse problem on ANNs [16]. This combined framework for inverse QSAR/QSPR mainly consists of two phases. The first phase solves (I) Prediction Problem, where a feature vector f(G) of a chemical graph G is introduced and a prediction function ψN on a chemical property π is constructed with an ANN N using a data set of chemical compounds G and their values a(G) of π. The second phase solves (II) Inverse Problem, where (II-a) given a target value y of the chemical property π, a feature vector x is inferred from the trained ANN N so that ψN(x) is close to y and (II-b) then a set of chemical structures G such that f(G)=x is enumerated by a graph search algorithm. In (II-a) of the above-mentioned previous methods [1719], an MILP is formulated for acyclic chemical compounds. Afterwards, Ito et al. [20] and Zhu et al. [21] designed a method of inferring chemical graphs with cycle index 1 and 2, respectively, by formulating a new MILP and using an efficient algorithm for enumerating chemical graphs with cycle index 1 [22] and cycle index 2 [23, 24]. The computational results conducted on instances with n non-hydrogen atoms show that a feature vector x can be inferred for up to around n=40 whereas graphs G can be enumerated for up to around n=15.

In this paper, we present a new characterization of graph structure, called “branch-height.” Based on this, we can treat a class of acyclic chemical graphs with a structure that is topologically restricted but frequently appears in a chemical database, formulate a new MILP formulation that can handle acyclic graphs with a large diameter, and design a new graph search algorithm that generates acyclic chemical graphs with up to around 50 vertices. The results of computational experiments using such chemical properties as octanol/water partition coefficient, boiling point and heat of combustion suggest that the proposed method is much more useful than the previous method.

The paper is organized as follows. "Preliminary" section introduces some notions on graphs, a modeling of chemical compounds and a choice of descriptors. "A method for inferring chemical graphs" section reviews the framework for inferring chemical compounds based on ANNs and MILPs. "MILPs for chemical acyclic graphs with bounded branch-height" section introduces a new method of modeling acyclic chemical graphs and proposes a new MILP formulation that represents an acyclic chemical graph G with n vertices, where our MILP requires only O(n) variables and constraints when the branch-parameter k and the k-branch height in G (graph topological parameters newly introduced in this paper) is constant. "A new graph search algorithm" section describes the idea of our new dynamic programming type of algorithm that enumerates a given number of acyclic chemical graphs for a given feature vector. "Experimental results" section reports the results on some computational experiments conducted for chemical properties such as octanol/water partition coefficient, boiling point and heat of combustion. "Concluding remarks" section makes some concluding remarks. Appendix A provides the statistical distribution of structural features of acyclic chemical graphs in a chemical graph database. Appendices B and C describe the idea of our MILP formulation and the details of all variables and constraints in the MILP formulation, respectively. Appendix D presents descriptions of our new graph search algorithm.

Preliminary

This section introduces some notions and terminology on graphs, a modeling of chemical compounds and our choice of descriptors.

Let R, Z and Z+ denote the sets of reals, integers and non-negative integers, respectively. For two integers a and b, let [ab] denote the set of integers i with aib.

Graphs

A graph stands for a simple undirected graph, where an edge joining two vertices u and v is denoted by uv (=vu). The sets of vertices and edges of a graph H are denoted by V(H) and E(H), respectively. Let H=(V,E) be a graph with a set V of vertices and a set E of edges. For a vertex vV, the set of neighbors of v in H is denoted by NH(v), and the degree degH(v) of v is defined to be |NH(v)|. The length of a path is defined to be the number of edges in the path. The distance distH(u,v) between two vertices u,vV is defined to be the minimum length of a path connecting u and v in H. The diameter dia(H) of H is defined to be the maximum distance between two vertices in H; i.e., dia(H)maxu,vVdistH(u,v). Denote by (P) the length of a path P.

Centers of trees For a tree T with an even (resp., odd) diameter d, the center is defined to be the vertex v (resp., the adjacent vertex pair {v,v}) that situates in the middle of one of the longest paths, with length d. The center of each tree is uniquely determined.

Rooted trees A rooted tree is defined to be a tree where a vertex (or a pair of adjacent vertices) is designated as the root. Let T be a rooted tree, where for two adjacent vertices u and v, vertex u is called the parent of v if u is closer to the root than v is. The height height(v) of a vertex v in T is defined to be the maximum length of a path from v to a leaf u in the descendants of v, where height(v)=0 for each leaf v in T. Figure 1a and b illustrate examples of trees rooted at the center.

Fig. 1.

Fig. 1

An illustration of rooted trees and a 2-branch-tree: a A tree H1 with odd diameter 11; b A tree H2 with even diameter 10; c The 2-branch-tree of H2

Degree-bounded trees For positive integers ab and c with b2, let T(abc) denote the rooted tree such that the number of children of the root is a, the number of children of each non-root internal vertex is b and the distance from the root to each leaf is c. We see that the number of vertices in T(abc) is a(bc-1)/(b-1)+1, and the number of non-leaf vertices in T(abc) is a(bc-1-1)/(b-1)+1. In the rooted tree T(abc), we denote the vertices by v1,v2,,vn with a breadth-first-search order, and denote the edge between a vertex vi with i[2,n] and its parent by ei, where n=a(bc-1)/(b-1)+1 and each vertex vi with i[1,a(bc-1-1)/(b-1)+1] is a non-leaf vertex. For each vertex vi in T(abc), let Cld(i) denote the set of indices j such that vj is a child of vi, and prt(i) denote the index j such that vj is the parent of vi when i[2,n]. Let Pprc(a,b,c) be a set of ordered index pairs (ij) of vertices vi and vj in T(abc). We call Pprc(a,b,c) proper if the next conditions hold:

  1. For each pair of vertices vi and vj in T(abc) such that vi is the parent of vj, there is a sequence (i1,i2),(i2,i3),,(ik-1,ik) of index pairs in Pprc(a,b,c) such that i1=i and ik=j; and

  2. Each subtree H=(V,E) of T(abc) with v1V is isomorphic to a subtree H=(V,E) by a graph isomorphism ψ:VV with ψ(v1)=v1 so that if vjV for a pair (i,j)Pprc(a,b,c) then viV.

Note that a proper set Pprc(a,b,c) is not necessarily unique.

Branch-height in trees In this paper, we introduce “branch-height” of a tree as a new measure to the “agglomeration degree” of trees. We specify a non-negative integer k, called a branch-parameter to define branch-height. First we regard T as a rooted tree by choosing the center of T as the root. Figure 1a, b illustrate examples of rooted trees. We introduce the following terminology on a rooted tree T.

  • A leaf k-branch: A non-root vertex v in T such that height(v)=k.

  • A non-leaf k-branch: A non-root vertex v in T such that v has at least two children, and for each child u of v it holds that height(u)k. We call a leaf or a non-leaf k-branch a k-branch. Figure 2a–c illustrate the k-branches of the rooted tree H2 in Fig. 1b for k=1,2 and 3, respectively.

  • A k-branch-path: A path P in T that joins two vertices u and u such that each of u and u is the root or a k-branch and P does not contain the root or a k-branch as an internal vertex.

  • The k-branch-subtree of T: The subtree of T that consists of the edges in all k-branch-paths of T. We call a vertex (resp., an edge) in T a k-internal vertex (resp., a k-internal edge) if it is contained in the k-branch-subtree of T and a k-external vertex (resp., a k-external edge) otherwise. Let Vin and Vex (resp., Ein and Eex) denote the sets of k-internal and k-external vertices (resp., edges) in T.

  • The k-branch-tree of T: The rooted tree obtained from the k-branch-subtree of T by replacing each k-branch-path with a single edge. Figure 1c illustrates the 2-branch-tree of the rooted tree H2 in Fig. 1b. Notice that by our definitions, leaf k-branches and non-leaf k-branches are leaves and branching points in the k-branch-tree.

  • A k-fringe-tree: One of the connected components that consists of the edges not in the k-branch-subtree. Each k-fringe-tree T contains exactly one vertex v in the k-branch-subtree, where T is regarded as a tree rooted at v. Note that the height of any k-fringe-tree is at most k. Figure 2a–c illustrate the k-fringe-trees of the rooted tree H2 in Fig. 1b for k=1,2 and 3, respectively.

  • The k-branch-leaf number blk(T): The number of leaf k-branches in T. For the trees Hi, i=1,2 in Fig. 1a, b, it holds that bl0(H1)=bl0(H2)=8, bl1(H1)=bl1(H2)=5, bl2(H1)=bl2(H2)=3 and bl3(H1)=bl3(H2)=2.

  • The k-branch height bhk(T) of T: The maximum number of k-branches along a path from the root to a leaf of T; i.e., bhk(T) is the height of the k-branch-tree T (the maximum length of a path from the root to a leaf in T). For the example of trees Hi, i=1,2 in Fig. 1a, b, it holds that bh0(H1)=bh0(H2)=3, bh1(H1)=bh1(H2)=3, bh2(H1)=bh2(H2)=2 and bh3(H1)=bh3(H2)=1.

Fig. 2.

Fig. 2

An illustration of the k-branches (depicted by gray circles), the k-branch-subtree (depicted by solid lines) and k-fringe-trees (depicted by dashed lines) of H2: a k=1; b k=2; c k=3

Even though this paper deals exclusively with acyclic graphs, we formally introduce the k-branch height for chemical cyclic graphs (chemical graphs that contain at least one cycle). The core of a chemical cyclic graph G is defined to be the induced subgraph G of G that consists of vertices in a cycle or the vertices in a path joining two cycles. A vertex in the core (not in the core) is called a core vertex (resp., a non-core vertex). The edges not in the core of a chemical cyclic graph G form a collection of trees T, which we call a non-core tree. Each non-core tree contains exactly one core vertex and is regarded as a tree rooted at the core vertex. The k-branch height of a chemical cyclic graph G is defined to be the maximum of k-branch heights over all non-core trees. We observe that most chemical graphs G with at most 50 non-hydrogen atoms satisfy bh2(G)2. See Appendix A for a summary of statistical feature distribution of chemical graphs registered in the chemical database PubChem [25].

For convenient reference, we summarize the graph-related notation used throughout this paper in Table 1.

Table 1.

Graph-theoretic notation

Symbol Designation
General graph notation
 H=(V,E) A graph H with a vertex set V and edge set E
 V(H) The vertex set of a graph H
 E(H) The edge set of a graph H
 NH(v) The number of neighbors of a vertex v in a graph H
 degH(v) The degree |NH(v)| of a vertex v in a graph H
 distH(u,v) The distance between two vertices u and v in a graph H
 dia(H) The diameter of a graph H
 (P) The length of a path P
Branch-height in a tree T
 Vin The set of internal vertices for a fixed branch parameter k
 Vex The set of external vertices for a fixed branch parameter k
 Ein The set of internal edges for a fixed branch parameter k
 Eex The set of external edges for a fixed branch parameter k
 blk(T) The k-branch-leaf number of T
 bhk(T) The k-branch height of T

Modeling of chemical compounds

We represent the graph structure of a chemical compound as a graph with labels on vertices and multiplicity on edges in a hydrogen-suppressed model. Let Λ be a set of labels each of which represents a chemical element such as C (carbon), O (oxygen), N (nitrogen) and so on, where we assume that Λ does not contain H (hydrogen). Let mass(a) and val(a) denote the mass and valence of a chemical element aΛ, respectively. In our model, we use integer mass(a)=10·mass(a), aΛ, and assume that each chemical element aΛ has a unique valence val(a)[1,4].

We introduce a total order < over the elements in Λ according to their mass values; i.e., we write a<b for chemical elements a,bΛ with mass(a)<mass(b). A pair of two atoms a and b, a,bΛ, joined with a bond-multiplicity m[1,3], where m=1,2,3, correspond to single, double, and triple bonds, respectively, is denoted by a tuple γ=(a,b,m), called the adjacency-configuration of the atom pair. Choose a set Γ< of tuples γ=(a,b,m)Λ×Λ×[1,3] such that a<b. For a tuple γ=(a,b,m)Λ×Λ×[1,3], let γ¯ denote the tuple (b,a,m). Set Γ>={γ¯γΓ<} and Γ=={(a,a,m)aΛ,m[1,3]}, and Γ=Γ<Γ=.

We use a hydrogen-suppressed model because hydrogen atoms can be added at the final stage.

Let (H,α,β) be a tuple of a graph H=(V,E), a function α:VΛ and a function β:E[1,3], where α(v)=a and β(e)=m mean that a chemical element a is assigned to a vertex v and a bond-multiplicity m is assigned to an edge e, respectively. For a notational convenience, we denote the sum of bond-multiplicities of edges incident to a vertex uV by

  • β(u)uvEβ(uv).

A tuple G=(H,α,β) is called a chemical graph over Λ and Γ<Γ= if the following holds:

  • (i)

    H is connected;

  • (ii)

    (α(u),α(v),β(uv))Γ<Γ= for each edge uvE; and

  • (iii)

    β(u)val(α(u)) for each vertex uV.

A chemical graph G=(H,α,β) is called a “chemical acyclic graph” if the graph H is an acyclic graph. Similarly for other types of graphs for H.

We define the bond-configuration of an edge e=uvE in a chemical graph G to be a tuple (degH(u),degH(v),β(e)) such that degH(u)degH(v) for the end-vertices u and v of e. Let Bc denote the set of bond-configurations μ=(d1,d2,m)[1,4]×[1,4]×[1,3] such that max{d1,d2}+m5. We regard that (d1,d2,m)=(d2,d1,m).

In summary, we give the notation on modeling chemical compounds used throughout this paper in Table 2.

Table 2.

Notation adopted for modeling chemical compounds

Symbol Designation
Λ A set of labels representing chemical elements
mass(a) Atomic mass of chemical element aΛ
val(a) Valence of chemical element aΛ
mass(a) 10·mass(a), aΛ
a<b A total order over labels in the set Λ, indicating mass(a)<mass(b)
γ=(a,b,m) Adjacency configuration for an atom pair, a,bΛ, m[1,3]
γ¯ For an adjacency configuration γ=(a,b,m), γ¯=(b,a,m)
Γ< Set of adjacency configurations γ=(a,b,m)Λ×Λ×[1,3] with a<b
Γ> Set of adjacency configurations Γ>={γ¯γΓ<}
Γ= Set of adjacency configurations, Γ=={(a,a,m)aΛ,m=[1,3]}
Γ Γ=Γ<Γ=
α A mapping of atom labels in Λ to graph vertices
β A mapping of integers in [1, 3] to graph edges, overloaded as β(u)=uvE(H)β(uv) for vertices uV(H) in a graph H
Bc Set of bond-configurations μ[1,4]×[1,4]×[1,3]

Descriptors

In our method, we use only graph-theoretical descriptors for defining a feature vector, which facilitates our design of an algorithm for constructing graphs. Given a chemical acyclic graph G=(H,α,β), we define a feature vector f(G) that consists of the following 11 kinds of descriptors. We choose an integer k[1,4] as a branch-parameter.

General chemical graph descriptors

  • n(G): the number |V| of vertices.

  • dia¯(G)dia(H)/n(G): the diameter of H divided by n(G)=|V|.

  • ms¯vVmass(α(v))/n(G): the average mass of atoms in G.

  • nH(G): the number of hydrogen atoms to be added to G.

Descriptors for vertices of certain degree

  • dgit(G)|{vVtdegH(v)=i}|,i[1,4],t{in,ex}: the number of k-internal/k-external vertices of degree i in H, where the bond-multiplicity of edges incident to a vertex v is ignored in the degree of v.

Descriptors for branch-leaf number and branch-height

  • blk(G): the k-branch-leaf number of G.

  • bhk(G): the k-branch height of G.

Descriptors for vertex labels

  • ceat(G)|{vVtα(v)=a}|,aΛ,t{in,ex}: the number of k-internal/k-external vertices with chemical element aΛ.

Descriptors for the number of bonds

  • bdmt(G){eEtβ(e)=m}, m=2,3, t{in,ex}: the number of k-internal/k-external edges with bond-multiplicity m.

Descriptors for adjacency-configurations

  • acγt(G), γΓ, t{in,ex}: the number of k-internal/k-external edges e=uv with adjacency-configuration γ=(a,b,m) (i.e., α(u)=a,α(v)=b and β(e)=m) in G.

Descriptors for bond-configurations

  • bcμt(G), μBc, t{in,ex}: the number of k-internal/k-external edges e=uv with bond-configuration μ=(d,d,m) (i.e., degH(u)=d,degH(v)=d and β(e)=m) in G.

Note that

nH(G)aΛ,t{in,ex}val(a)ceat(G)-γ=(a,b,m)Γ,t{in,ex}2m·acγt(G)=aΛ,t{in,ex}val(a)ceat(G)-2(n(G)-1+m[2,3],t{in,ex}(m-1)·bdmt(G)).

The number K of descriptors in our feature vector x=f(G) is K=2|Λ|+2|Γ|+50. Note that the above K descriptors are not independent in the sense that some descriptors depend on the combination of other descriptors. For example, descriptor bdiin(G) can be determined by γ=(a,b,m)Γ:m=iacγin(G).

A method for inferring chemical graphs

Framework for the Inverse QSAR/QSPR

We review the framework that solves the inverse QSAR/QSPR by using MILPs [20, 21], which is illustrated in Fig. 3. For a specified chemical property π such as boiling point, we denote by a(G) the observed value of the property π for a chemical compound G. As the first phase, we solve (I) Prediction Problem with the following three steps.

Fig. 3.

Fig. 3

ac An illustration of Phase 1: a Stage 1 for preparing a data set Dπ for a graph class G and a specified chemical property π; b Stage 2 for introducing a feature function f with descriptors; c Stage 3 for constructing a prediction function ψN with an ANN N; de An illustration of Phase 2: (d) Stage 4 for formulating an MILP M(x,y,g;C1,C2) and finding a feasible solution (x,g) of the MILP for a target value y so that ψN(x)=y (possibly detecting that no target graph G exists); (e) Stage 5 for enumerating graphs GG such that f(G)=x

Phase 1.

Stage 1: Let DB be a set of chemical graphs. For a specified chemical property π, choose a class G of graphs such as acyclic graphs or monocyclic graphs. Prepare a data set Dπ={Gii=1,2,,m}GDB such that the value a(Gi) of each chemical graph Gi, i=1,2,,m is available. Set reals a_,a¯R so that a_a(Gi)a¯, i=1,2,,m.

Stage 2: Introduce a feature function f:GRK for a positive integer K. We call f(G) the feature vector of GG, and call each entry of a vector f(G) a descriptor of G.

Stage 3: Construct a prediction function ψN with an ANN N that, given a vector in RK, returns a real number in the range [a_,a¯] so that ψN(f(G)) takes a value nearly equal to a(G) for many chemical graphs in DB. See Fig. 3a–c for an illustration of Stages 1, 2, and 3 in Phase 1.

In this paper, we use the range-based method to define an applicability domain (AD) [26] to our inverse QSAR/QSPR. Set xj_ and xj¯ to be the minimum and maximum values of the j-th descriptor xj in f(Gi), respectively, over all graphs Gi, i=1,2,,m, where we possibly normalize some descriptors such as ceain(G), which is normalized with ceain(G)/n(G). Define our AD D to be the set of vectors xRK such that xj_xjxj¯ for the variable xj of each j-th descriptor, j=1,2,,k.

In the second phase, we try to find a vector xRK from a target value y of the chemical propery π such that ψN(x)=y. Based on the method due to Akutsu and Nagamochi [16], Chiewvanichakorn et al. [18] showed that this problem can be formulated as an MILP. By including a set of linear constraints such that xD into their MILP, we obtain the next result.

Theorem 1

([20, 21]) Let N be an ANN with a piecewise-linear activation function for an input vector xRK, nA denote the number of nodes in the architecture and nB denote the total number of break-points over all activation functions. Then there is an MILP M(x,y;C1) that consists of variable vectors xD(RK), yR, and an auxiliary variable vector zRp for some integer p=O(nA+nB) and a set C1 of O(nA+nB) constraints on these variables such that: ψN(x)=y if and only if there is a vector (x,y) feasible to M(x,y;C1).

See Appendix “Upper and lower bounds on descriptors” for the set of constraints to define our AD D in the MILP M(x,y;C1) in Theorem 1.

A vector xRK is called admissible if there is a chemical graph GG such that f(G)=x [17]. Let A denote the set of admissible vectors xRK. To ensure that a vector x inferred from a given target value y becomes admissible, we introduce a new vector variable gRq for an integer q. For the class G of chemical acyclic graphs, Azam et al. [17] introduced a set C2 of new constraints with a new vector variable gRq for an integer q so that

  • A feasible solution (x,g) of a new MILP for a target value y delivers a vector x with ψN(x)=y, and

  • A vector g that represents a chemical acyclic graph GG.

Afterwards, for the classes of chemical graphs with cycle index 1 and 2, Ito et al. [17] and Zhu et al. [21] presented such a set C2 of constraints so that a vector g in a feasible solution (x,g) of a new MILP can represent a chemical graph G in the class G, respectively.

As the second phase, we solve (II) Inverse Problem for the inverse QSAR/QSPR by treating the following inference problems.

(II-a) Inference of Vectors

Input: A real y with a_ya¯.

Output: Vectors xAD and gRq such that ψN(x)=y and g forms a chemical graph GG with f(G)=x.

(II-b) Inference of Graphs

Input: A vector xAD.

Output: All graphs GG such that f(G)=x.

The second phase consists of the next two steps.

Phase 2.

Stage 4:  Formulate Problem (II-a) as the above MILP M(x,y,g;C1,C2) based on G and N. Find a feasible solution (x,g) of the MILP such that

  • xAD and ψN(x)=y.

The second requirement may be replaced with inequalities (1-ε)yψN(x)(1+ε)y for a tolerance ε>0.

Stage 5:  To solve Problem (II-b), enumerate all (or a specified number) of graphs GG such that f(G)=x for the inferred vector x. See Fig. 3d, e for an illustration of Stages 4 and 5 in Phase 2.

In practical applications, there would be many criteria that a target chemical compound needs to satisfy rather than a single chemical property π, such as stability and synthesizability. The above five steps in the framework are rather schematic in the sense that it would be necessary to adjust several settings in each stage in order to find a collection of chemical graphs that meet many of those criteria after a repeated application of the framework. For example, we can include in an MILP formulation in Stage 4 additional conditions such as lower and upper bounds on the frequency of adjacency-configurations and extra requirements on substructures of a target chemical graph as long as these conditions can be expressed as linear constraints with integer/real variables. Also an efficient algorithm in Stage 5 can quickly offer a large number of isomers of the same feature vectors, to which we can apply a further screening to choose promising candidates for chemical graphs.

Our target graph class

In this paper, we choose a branch-parameter k1 and define a class G of chemical acyclic graphs G such that

  • The maximum degree in G is at most 4;

  • The k-branch height bhk(G) is bounded for a specified branch-parameter k; and

  • The size of each k-fringe-tree in G is bounded.

The reason why we restrict ourselves to the graphs in G is that this class G covers a large part of the acyclic chemical compounds registered in the chemical database PubChem. See Appendix A for a summary of the statistical features of the chemical graphs in PubChem in terms of k-branch height and the size of 2-fringe-trees. According to this, over 55% (resp., 99%) of acyclic chemical compounds with up to 100 non-hydrogen atoms in PubChem have the maximum degree 3 (resp., 4); and nearly 87% (resp., 99%) of acyclic chemical compounds with up to 50 non-hydrogen atoms in PubChem have the 2-branch height at most 1 (resp., 2). This implies that k=2 is sufficient to cover most of chemical acyclic graphs. For k=2, over 92% of 2-fringe-trees of chemical compounds with up to 100 non-hydrogen atoms in PubChem obey the following size constraint:

n(T)2degT(r)+2for each 2-fringe-treeTwith the rootr. 1

We formulate an MILP in Stage 4 that, given a target value y, infers a vector xZ+K with ψN(x)=y and a chemical acyclic graph G=(H,α,β)G with f(G)=x. We here specify some of the features of a graph GG such as the number of non-hydrogen atoms in order to control the graph structure of target graphs to be inferred and to simplify MILP formulations. In this paper, we specify the following features on a graph GG: a set Λ of chemical elements, a set Γ< of adjacency-configurations, the maximum degree, the number of non-hydrogen atoms, the diameter, the k-branch height and the k-branch-leaf number for a branch-parameter k.

More formally, given specified integers n, dmax, dia, k, bh, blZ other than Λ and Γ, let H(n,dmax,dia,k,bh,bl) denote the set of acyclic graphs H such that

  • The maximum degree of a vertex in H is at most 3 when dmax=3 (or equal to 4 when dmax=4),

  • The number n(H) of vertices in H is n,

  • The diameter dia(H) of H is dia,

  • The k-branch height bhk(H) is bh,

  • The k-branch-leaf number blk(H) is bl and

  • (1) holds.

To design Stage 4 for our class G, we formulate an MILP M(x,g;C2) that infers a chemical graph G=(H,α,β)G with HH(n,dmax,dia,k,bh,bl) for a given specification (Λ,Γ,n,dmax,dia,k,bh,bl). The details will be given in "MILPs for chemical acyclic graphs with bounded branch-height" section and Appendix C.

Design of Stage 5, i.e., generating chemical graphs G that satisfy f(G)=x for a given feature vector xZ+K is still challenging for a relatively large instance with size n(G)20. There have been proposed algorithms for generating chemical graphs G in Stage 5 for the classes of graphs with cycle index 0 to 2 [5, 2224]. All of these are designed based on the branch-and-bound method and can generate a target chemical graph with size n(G)20. To break this barrier, we newly employ the dynamic programming method for designing an algorithm in Stage 5 in order to generate a target chemical graph G with size n(G)=50. For this, we further restrict the structure of acyclic graphs G so that the number bl2(G) of leaf 2-branches is at most 3. Among all acyclic chemical compounds with up to 50 non-hydrogen atoms in the chemical database PubChem, the ratio of the number of acyclic chemical compounds G with bl2(G)2 (resp., bl2(G)3) is 78% (resp., 95%). See "A new graph search algorithm" section and Appendix D for the details on the new algorithm in Stage 5.

To conclude the description of the target graph class to be inferred by the inverse QSAR/QSPR framework developed in this paper, we summarize the global parameters in Table 3.

Table 3.

Fixed parameters of target graphs

Symbol Designation
Λ A set of atom labels
Γ A set of adjacency configurations
n Number of vertices
dmax Maximum vertex degree, at most 3 and exactly 4, for dmax=3 and dmax=4, respectively
dia Graph diameter
k Branch parameter
bh k-branch height
bl k-branch-leaf number

MILPs for chemical acyclic graphs with bounded branch-height

In this section, we describe an idea of formulating an MILP M(x,g;C2) to infer a chemical acyclic graph G in the class G for a given specification (Λ,Γ,n,dmax,dia, k,bh,bl) defined in the previous section. Please refer to Table 3 for a summary of the parameters that we assume to be fixed for a target graph.

Scheme graphs

Our new idea of constructing an acyclic graph H is as follows. See a rooted tree TB=T(dmax,dmax-1,bh) in Fig. 4a.

  • From the tree TB, we first choose a subtree T including the root u1. We use T as the k-branch-tree of H.

  • Next, we choose some edges in the tree T and replace each of the edges e=uiuj with a path Pe between vertices ui and uj. Let T denote the resulting tree. We use T as the k-branch-subtree of H.

  • Finally, we append to the tree T rooted trees with height at most k as the k-fringe-trees of H. The resulting tree is a required rooted tree H.

Fig. 4.

Fig. 4

An illustration of scheme graph SG(dmax,k,bh,t) with dmax=3, k=2, bh=2, and t=5, where the vertices in TB (resp., in Pt) are depicted with black (resp., gray) circles: a A base-tree TB and a link-path Pt are joined with directed edges between them; b A tree Ss rooted at a vertex us=us,1VB; c A tree Tt rooted at a vertex vt=vt,1VP

In our MILP, we prepare a binary variable for each of the vertices and edges in TB so that a subtree T of TB can be selected as one of the combinations of these binary values.

To represent a replacement of an edge e with a path Pe in our MILP, we introduce a path Pt=(v1,1,v2,1,,vt,1) of a sufficiently large length t-1, and a set F of directed edges between the vertices in TB and Pt as shown in Fig. 4a. We also introduce a binary variable for each of the vertices and edges in Pt and F in our MILP. When an edge e=uiuj is replaced with a path Pe, we select an edge from ui to a vertex vh,1 in Pt and an edge from a vertex vh+p,1 so that the edges (ui,vh,1) and (vh+p,1,uj) and the subpath (vh,1,vh+1,1,,vh+p,1) of Pt form a path Pe. Such a path Pe can be selected as one of the combinations of these binary values. To append rooted trees to tree T, we prepare a rooted tree with a sufficiently large size at each vertex in TB and Pt and introduce a binary variable for each of the vertices and edges in these rooted trees in our MILP. A rooted subtree from each of such rooted trees as a k-fringe-tree can be selected as one of the combinations of these binary values.

We call the graph that consists of all the above graphs TB, Pt and the edge set F and the set of rooted trees at the vertices in TB and Pt a scheme graph SG(dmax,k,bh,t).

Figure 5a illustrates an acyclic graph H with n(H)=37, dia(H)=17, bh2(H)=2 and bl2(H)=3, where the maximum degree of a vertex is 3. Figure 5b illustrates the 2-branch-tree of the acyclic graph H in Fig. 5a. Figure 5c illustrates a subgraph H of the scheme graph SG(dmax,k,bh,t=n-bl-1) such that H is isomorphic to the acyclic graph H in Fig. 5a.

Fig. 5.

Fig. 5

An illustration of selecting a subgraph H from the scheme graph SG(dmax,k,bh,t=n-bl-1): a An acyclic graph HH(n,dmax,dia,k,bh,bl) with n=37, dmax=3, dia(H)=17, k=2, bh=2 and bl=3, where the labels of some vertices indicate the corresponding vertices in the scheme graph SG(dmax,k,bh,t); b The k-branch-tree of H for k=2; c An acyclic graph H selected from SG(dmax,k,bh,t) as a graph that is isomorphic to H in (a)

In this paper, we obtain the following result.

Theorem 2

LetΛbe a set of chemical elements,Γbe a set of adjacency-configurations, where|Λ||Γ|, andK=2|Λ|+2|Γ|+50. Given non-negative integersn3,dmax{3,4},dia3,k1,bh1andbl2, there is an MILPM(x,g;C2)that consists of variable vectorsxRKandgRqfor an integerq=O(|Γ|·[(dmax-1)bh+k+n·(dmax-1)max{bh,k})])and a setC2of constraints onxandgwith sizeO(|Γ|+(dmax-1)bh+k+n·(dmax-1)max{bh,k}))such that:(x,g)is feasible toM(x,g;C2)if and only ifgforms a chemical acyclic graphG=(H,α,β)such thatHH(n,dmax,dia,k,bh,bl)andf(G)=x.

Note that our MILP requires only O(n) variables and constraints when the branch-parameter k, the k-branch height and |Γ| are constant.

See Appendices B and C for the details of the MILP formulation and the set of all variables and constraints in the MILP formulation, respectively.

A new graph search algorithm

Previous methods of inferring chemical graphs [1719] use a graph search algorithm based on the branch-and-bound algorithm proposed by Fujiwara et al. [5], where an enormous number of chemical graphs are constructed by repeatedly appending and removing a vertex one by one until a target chemical graph is constructed. Their algorithm cannot generate even one acyclic chemical graph when n(G) is larger than around 20.

This section introduces a new dynamic programming method for designing an algorithm in Stage 5. We consider the following aspects:

  1. Treat acyclic graphs with a certain limited structure that frequently appears among chemical compounds registered in the chemical database; and

  2. Instead of manipulating acyclic graphs directly, first compute the frequency vectors f(G) (sub-vectors of the feature vectors f(G), see Appendix D) of subtrees G of all target acyclic graphs and then construct a limited number of target graphs G from the process of computing the vectors.

In (a), we choose a branch-parameter k=2 and treat acyclic graphs G that have a small 2-branch number such as bl2(G)[2,3] and satisfy the size constraint (1) on 2-fringe-trees. Figure 6a, b illustrate chemical acyclic graphs G with bl2(G)=2 and bl2(G)=3, respectively.

Fig. 6.

Fig. 6

An illustration of chemical acyclic graphs G with diameter dia and bl2(G)=2,3: a A chemical acyclic graph G with two leaf 2-branches v1 and v2; b A chemical acyclic graph G with three leaf 2-branches v1,v2 and v3

We design a method in (b) based on the mechanism of dynamic programming in the following way. Define a frequency vector f(T) of each chemical rooted tree T to be a vector that consists of the frequency of each chemical element aΛ, each adjacency-configuration aΛ, each bond-configuration μBc, and each degree dgiDg in T. We are given a vector x that is the frequency vector f(G) of a chemical acyclic graph G to be inferred.

We first construct a set FT of chemical rooted trees with height at most k=2 and compute the frequency vector f(T) of each chemical rooted tree TFT to obtain the set W(FT) of frequency vectors f(T),TFT. Note that a large number of chemical rooted trees TFT maps to the same frequency vector w and the size |W(FT)| is considerably smaller than the size |FT|.

We next combine two chemical rooted trees Ta,TbFT to construct a chemical tree Ta,b by joining their roots ra and rb with an edge e=rarb of a bond-multiplicity m, as illustrated in Fig. 6a. In fact, we compute only the feature vector f(Ta,b) of such a tree Ta,b without directly treating the graph structures of Ta, Tb and Ta,b. For this, we add two frequency vectors wa,wbW(FT) together with an additional term from the bond-multiplicity m to obtain the frequency vector wa,b(=f(Ta,b)) of such a tree Ta,b. Given such a vector wa,b, we can actually construct a chemical tree Ta,b with f(Ta,b)=wa,b by choosing trees Ta,TbFT and combining them with an edge of bond-multiplicity m.

Our algorithm for generating a chemical acyclic graph G with bl2(G)=2 continues to compute a set W(p) of frequency vectors of chemical trees that can be obtained by combining p trees in FT for each p=2,3,,(dia-5)/2. Finally, we find a vector pair (w1,w2) with w1W((dia-5)/2) and w2W((dia-5)/2) such that a vector with w1, w2 and a bond-multiplicity m is equal to the given vector x; i.e., a chemical acyclic graph G with f(G)=x is obtained by joining chemical trees T1 and T2 with wi=f(Ti),i=1,2 with an edge of bond-multiplicity m.

With a slight modification, the algorithm can generate a chemical acyclic graph G with bl2(G)=3.

Appendix D presents the details of our new algorithms for generating acyclic graphs G with bl2(G)[2,3].

Experimental results

We implemented our method of Stages 1 to 5 for inferring chemical acyclic graphs and conducted experiments to evaluate the computational efficiency for three chemical properties π: octanol/water partition coefficient (Kow), boiling point (Bp) and heat of combustion (Hc). We executed the experiments on a PC with Two Intel Xeon CPUs E5-1660 v3 @3.00GHz, 32 GB of RAM running under OS: Ubuntu 14.04.6 LTS. We show 2D drawings of some of the inferred chemical graphs, where ChemDoodle version 10.2.0 was used for constructing the drawings.

Results on Phase 1. We implemented Stages 1, 2, and 3, in Phase 1 as follows.

Stage 1. We set a graph class G to be the set of all chemical acyclic graphs, and set a branch-parameter k to be 2. For each property π{ Kow, Bp, Hc}, we first select a set Λ of chemical elements and then collected a data set Dπ on chemical acyclic graphs over the set Λ of chemical elements provided by the Hazardous Substances Data Bank (HSDB) of PubChem. To construct the data set, we eliminated chemical compounds that have at most three carbon atoms or contain a charged element such as N+ or an element aΛ whose valence is different from our setting of valence function val.

Table 4 shows the size and range of data sets that we prepared for each chemical property in Stage 1, where we denote the following:

  • π: one of the chemical properties Kow, Bp and Hc;

  • Λ: the set of selected chemical elements (hydrogen atoms are added at the final stage);

  • |Dπ|: the size of data set Dπ over Λ for property π;

  • |Γ|: the number of different adjacency-configurations over the compounds in Dπ;

  • [n_,n¯]: the minimum and maximum number n(G) of non-hydrogen atoms over the compounds G in Dπ;

  • [bl_,bl¯]: the minimum and maximum numbers bl2(G) of leaf 2-branches over the compounds G in Dπ;

  • [bh_,bh¯]: the minimum and maximum values of the 2-branch height bh2(G) over the compounds G in Dπ; and

  • [a_,a¯]: the minimum and maximum values of a(G) for π over compounds G in Dπ.

Table 4.

Results of Stage 1 in Phase 1

π Λ |Dπ| |Γ| [n_,n¯] [bl_,bl¯] [bh_,bh¯] [a_,a¯]
Kow C,O,N 216 10 [4, 28] [0, 2] [0, 4] [− 4.2, 8.23]
Bp C,O,N 172 10 [4, 26] [0, 1] [0, 3] [− 11.7, 404.84]
Hc C,O,N 128 6 [4, 26] [0, 1] [0, 2] [1346.4, 13304.5]

Stage 2. We used a feature function f that consists of the descriptors defined in “Descriptors” section.

Stage 3. We used scikit-learn version 0.21.6 with Python 3.7.4 to construct ANNs N where the tool and activation function are set to be MLPRegressor and ReLU, respectively. We tested several different architectures of ANNs for each chemical property. To evaluate the performance of the resulting prediction function ψN with cross-validation, we partition a given data set Dπ into five subsets Dπ(i), i[1,5] randomly, where Dπ\Dπ(i) is used for a training set and Dπ(i) is used for a test set in five trials i[1,5]. For a set {y1,y2,,yN} of observed values and a set {ψ1,ψ2,,ψN} of predicted values, we define the coefficient of determination to be R21-j[1,N](yj-ψj)2j[1,N](yj-y¯)2, where y¯=1Nj[1,N]yj. Table 5 shows the results on Stages 2 and 3, where

  • K: the number of descriptors for the chemical compounds in data set Dπ for property π;

  • Activation: the choice of activation function;

  • Architecture: (ab, 1) consists of an input layer with a nodes, a hidden layer with b nodes and an output layer with a single node, where a is equal to the number K of descriptors;

  • L-time: the average time (in seconds) to construct ANNs for each trial;

  • test R2 (ave.): the average of coefficient of determination over the five tests; and

  • test R2 (best): the largest value of coefficient of determination over the five test sets.

From Table 5, we see that the execution of Stage 3 was successful, where the average of test R2 is over 0.9 for all three chemical properties.

Table 5.

Results of Stages 2 and 3 in Phase 1

π K Activation Architecture L-Time test R2 (ave.) test R2 (best)
Kow 76 ReLU (76, 10, 1) 2.12 0.901 0.951
Bp 76 ReLU (76, 10, 1) 26.07 0.935 0.965
Hc 68 ReLU (68, 10, 1) 234.06 0.924 0.988

For each chemical property π, we selected the ANN N that attained the best test R2 score among the five ANNs to formulate an MILP M(x,y,z;C1) which will be used in Phase 2.

Results on Phase 2. We implemented Stages 4 and 5 in Phase 2 as follows.

Stage 4. In this step, we solve the MILP M(x,y,g;C1,C2) formulated based on the ANN N obtained in Phase 1. To solve an MILP in Stage 4, we use CPLEX version 12.10. In our experiment, we choose a target value y[a_,a¯] and fix or bound some descriptors in our feature vector as follows:

  • Set the 2-leaf-branch number bl to be each of 2 and 3;

  • Fix the instance size n=n(G) to be each integer in {26,32,38,44,50};

  • Set the diameter dia=dia(G) be one of the integers in {(2/5)n,(3/5)n}.

  • Set the maximum degree dmax:=3 for dia=(2/5)n and dmax:=4 for dia=(3/5)n;

  • For each instance size n, test a target value yπ for each chemical property π{ Kow, Bp, Hc}.

Based on the above setting, we generated six instances for each instance size n. We set ε=0.02 in Stage 4.

Tables 6, 7 (resp., Tables 8, 9) show the results on Stage 4 for bl=2 (resp., bl=3), where we denote the following:

  • yπ: a target value in [a_,a¯] for a property π;

  • n: a specified number of vertices in [n_,n¯];

  • dia: a specified diameter in {(2/5)n,(3/5)n};

  • IP-time: the time (sec.) to an MILP instance to find vectors x and g.

We observe that most of the MILP instances with bl=2, n50 and dia30 (resp., bl=3, n50 and dia30) are solved within one minute (resp., in a few minutes). The previously most efficient MILP formulation for inferring chemical acyclic graphs due to Zhang et al. [19] could solve instances with a relatively small diameter of dia=9 for the case of dmax=4 and n=20 and dia=8 for the case of dmax=3 and n=50. Our new MILP formulation on chemical acyclic graphs with bounded 2-branch height considerably improved the tractable size of chemical acyclic graphs in Stage 4 for the inference problem (II-a).

Table 6.

Results of Stages 4 and 5 for bl=2, dmax=3 and dia=25n

π y n dia IP-time #FP G-LB #G G-time
Kow 4 26 11 3.95 11,780 2.4×106 100 0.91
5 32 13 4.81 216 2.7×104 100 10.64
7 38 16 7.27 19,931 4.2×107 100 48.29
8 44 18 9.33 241,956 1.2×1013 100 119.01
9 50 20 21.57 58,365 1.7×1010 100 110.38
Bp 440 26 11 2.09 22,342 3.6×107 100 2.9
550 32 13 3.94 748 5.9×106 100 3.77
660 38 16 6.4 39,228 7.3×108 100 151.25
770 44 18 7.21 138,076 3.0×1012 100 182.66
880 50 20 9.49 106,394 3.0×1010 100 217.18
Hc 13000 26 11 2.94 12 2.0×101 12 0.04
16500 32 13 7.67 2722 1.2×107 100 0.31
20000 38 16 10.5 1830 9.7×105 100 1.06
23000 44 18 13.62 12,336 4.7×108 100 142.02
25000 50 20 15.1 136,702 5.3×1014 100 22.26

Table 7.

Results of Stages 4 and 5 for bl=2, dmax=4 and dia=35n

π y n dia IP-time #FP G-LB #G G-time
Kow 4 26 16 16.21 4198 3.5×105 100 1.18
5 32 20 24.74 1650 5.3×106 100 0.69
7 38 23 38.88 154,408 9.5×109 100 67.31
8 44 27 38.73 1,122,126 8.5×1013 100 660.37
9 50 30 31.59 690,814 1.1×1015 100 238.02
Bp 440 26 16 12.44 8156 2.6×106 100 2.74
550 32 20 23.22 38,600 4.4×108 100 12.72
660 38 23 20.62 52,406 1.1×109 100 197.89
770 44 27 50.55 23,638 6.8×108 100 244.56
880 50 30 48.37 40,382 2.2×1011 100 884.99
Hc 13000 26 16 23.26 249 2.7×103 100 0.06
16500 32 20 44.2 448 6.9×104 100 0.63
20000 38 23 96.02 3330 6.1×106 100 15.16
23000 44 27 82.34 43,686 1.5×1010 100 152.96
25000 50 30 83.81 311,166 1.3×1013 100 287.95

Table 8.

Results of Stages 4 and 5 for bl=3, dmax=3 and dia=25n

π y n dia IP-time #FP G-LB #G G-time
Kow 4 26 11 3.1 511 3.6×103 100 14.31
5 32 13 4.72 3510 6.8×106 100 851.21
7 38 16 5.82 11,648 1.2×108 100 612.86
8 44 18 9.69 17,239 2.2×108 100 703.92
9 50 20 22.53 60,792 3.9×1012 100 762.17
Bp 440 26 11 3.01 66 9.0×102 66 902.77
550 32 13 4.29 308 1.0×107 100 2238.62
660 38 16 5.86 303 1.8×107 100 3061.11
770 44 18 14.39 19,952 4.7×1010 100 678.26
880 50 20 10.39 17,993 7.1×1012 100 4151.07
Hc 13000 26 11 3.05 340 1.5×104 100 1.57
16500 32 13 5.81 600 3.1×108 100 921.55
20000 38 16 15.67 18,502 6.2×108 100 1212.54
23000 44 18 21.15 5064 6.9×109 100 1279.95
25000 50 20 31.90 41,291 2.4×1012 100 668.5

Table 9.

Results of Stages 4 and 5 for bl=3, dmax=4 and dia=35n

π y n dia IP-time #FP G-LB #G G-time
Kow 4 26 16 9.94 100 2.5×104 100 6.73
5 32 20 16.58 348 1.4×108 100 3400.74
7 38 23 33.71 17,557 1.2×1011 100 2652.38
8 44 27 34.28 0 0 1 >2 hours
9 50 30 68.74 80,411 6.4×1015 100 6423.85
Bp 440 26 16 14.16 150 1.8×105 100 29.72
550 32 20 18.94 305 1.4×107 100 2641.9
660 38 23 21.15 1155 2.0×109 100 4521.66
770 44 27 25.6 1620 4.3×108 100 175.2
880 50 30 63.22 0 0 1 >2 hours
Hc 13000 26 16 31.87 12 2.7×104 12 0.66
16500 32 20 41.03 392 3.4×108 100 2480.34
20000 38 23 48.48 630 1.4×105 100 105.59
23000 44 27 143.75 341 7.8×108 100 5269.1
25000 50 30 315.91 10,195 3.8×109 100 5697.08

Figure 7ac illustrate some chemical acyclic graphs G with bl2(G)=2 obtained in Stage 4 by solving an MILP. Remember that these chemical graphs obey the AD D defined in Appendix A.

Fig. 7.

Fig. 7

An illustration of chemical acyclic graphs G with n(G)=50, bl2(G)=2 and dmax=4 obtained in Stage 4 by solving an MILP: a yKow=9, dia(G)=(2/5)n=20; b yBp=880, dia(G)=n/2=25; c yHc=25000, dia(G)=(3/5)n=30

Figure 8ac illustrate some chemical acyclic graphs G with bl2(G)=3 obtained in Stage 4 by solving an MILP.

Fig. 8.

Fig. 8

An illustration of chemical acyclic graphs G with n(G)=50, bl2(G)=3 and dmax=4 obtained in Stage 4 by solving an MILP: a yKow=9, dia(G)=(2/5)n=20; b yBp=880, dia(G)=n/2=25; c yHc=25,000, dia(G)=(3/5)n=30

Stage 5. In this stage, we execute our new graph search algorithms for generating target graphs GG(x) with bl2(G){2,3} for a given feature vector x obtained in Stage 4.

We introduce a time limit of 10 minutes for each iteration h in Step 2 and an execution of Steps 1 and 3 for bl=2 (resp., each iteration h in Steps 2 and 3 and δ1 in Step 4 and an execution of Steps 1 and 5 for bl=3). In the last step, we choose at most 100 feasible vector pairs and generate a target graph from each of these feasible vector pairs. We also impose an upper bound UB on the size |W| of a vector set W that we maintain during an execution of the algorithm. We executed the algorithm for each of the three bounds UB=106,107,108 until a feasible vector pair is found or the running time exceeds a global time limitation of two hours.

When no feasible vector pair is found by the graph search algorithms, we output the target graph G constructed from the vector g in Stage 4.

Tables 6, 7 (resp., Tables 8, 9) show the results of Stage 5 for bl=2 (resp., bl=3), where we denote the following:

  • #FP: the number of feasible vector pairs obtained by an execution of the graph search algorithm for a given feature vector x;

  • G-LB: a lower bound on the number of all target graphs GG(x) for a given feature vector x;

  • #G: the number of all (or up to 100) chemical acyclic graphs G such that f(G)=x (where at least one such graph G has been found from the vector g in Stage 4);

  • G-time: the running time (sec.) to execute Stage 5 for a given feature vector x, where “> 2 hours” means that the running time exceeds two hours.

Previously, an instance of chemical acyclic graphs with size n up to 16 was solved in Stage 5 by Azam et al. [17]. For the classes of chemical graphs with cycle index 1 and 2, the maximum size of instances solved in Stage 5 by Ito et al. [17] and Zhu et al. [21] was around 18 and 15, respectively. Our new algorithm based on dynamic programming solves instances with n=50. In our experiments, we also computed a lower bound G-LB on the number of target graphs. We observe that there are over 1010 or 1014 target graphs in some cases. Remember that these lower bounds are computed without actually generating each target graph one by one. So when a lower bound is enormously large, this would suggest that we may need to impose some more constraints on the structure of graphs or the range of descriptors to narrow a family of target graphs to be inferred.

An additional experiment We also conducted some additional experiment to demonstrate that our MILP-based method is flexible to control conditions on inference of chemical graphs. In Stage 3, we constructed an ANN Nπ for each of the three chemical properties π{ Kow, Bp, Hc}, and formulated the inverse problem of each ANN Nπ as an MILP Mπ. Since the set of descriptors is common to all three properties Kow, Bp and Hc, it is possible to infer a chemical acyclic graph G that satisfies a target value yπ for each of the three properties at the same time (if one exists). We specify the size of graph so that n=50, bl=2, dia=25 and dmax=4, and set target values with yKow=4.0, yBp=400.0 and yHc=13000.0 in an MILP that consists of the three MILP MKow, MHc and MBp. The MILP was solved in 18930 seconds and we obtained a chemical acyclic graph G illustrated in Fig. 9. We continued to execute Stage 5 for this instance to generate more target graphs G. Table 10 shows that 100 target graphs are generated by our new dynamic programming algorithm.

Fig. 9.

Fig. 9

An illustration of a chemical acyclic graph G inferred for three chemical properties Kow, Bp and Hc simultaneously, where yKow=4.0, yBp=400.0 and yHc=13000.0, n=50, bl=2, dia=25, and dmax=4

Table 10.

Results of Stages 4 and 5 for bl=2, dmax=4, n=50 and dia=25

π y n dia IP-time #FP G-LB #G G-time
Kow 4 50 25 18930.46 117,548 2.4×1011 100 423.53
Bp 400
Hc 1300

Concluding remarks

In this paper, we introduced a new measure, branch-height of a tree, and showed that many chemical compounds in the chemical database have a simple structure where the number of 2-branches is small. Based on this, we proposed a new method of applying the framework for inverse QSAR/QSPR [1719] to the case of acyclic chemical graphs where Azam et al. [17] inferred chemical graphs with around 20 non-hydrogen atoms and Zhang et al. [19] solved an MILP of inferring a feature vector for an instance with diameter 9. In our method, we formulated a new MILP in Stage 4 specialized for acyclic chemical graphs with a small branch number and designed a new graph search algorithm in Stage 5 that computes frequency vectors of graphs in a dynamic programming scheme.

We implemented our new method and conducted some experiments on chemical properties such as octanol/water partition coefficient, boiling point and heat of combustion.

The resulting method improved the performance so that chemical graphs with around 50 non-hydrogen atoms and around diameter 30 can be inferred. Since there are many acyclic chemical compounds having large diameters, this is a significant improvement.

It is left as a future work to design MILPs and graph search algorithms based on the new idea of the paper for classes of graphs with a higher rank. Recently, a method for inferring a chemical cyclic graph with any rank has been designed by Akutsu and Nagamochi [27] based on the ideas in this paper. The method is also designed so that a target chemical graph to be inferred can be specified in a more flexible way, where we can include a prescribed substructure of graphs such as a benzene ring into a target chemical graph while imposing constraints on a global topological structure of a target graph at the same time.

Acknowledgements

Not applicable.

Abbreviations

ANN

Artificial neural network

MILP

Mixed integer linear programming

Appendix A: Statistical features of molecular structures

We observe the following features of the graph-theoretical structure of chemical graphs registered in the chemical database PubChem. Let DB(n) denote the set of chemical graphs with at most n non-hydrogen atoms that are registered in chemical database PubChem (downloaded a copy on March 21, 2019). The cycle index (or rank) of a chemical graph G=(H=(V,E),α,β) is defined to be |E|-(|V|-1) (i.e., the minimum number of edges to be removed to make the graph H acyclic). We call a chemical graph a rank-r chemical graph if the rank of the graph is r. The core of a chemical cyclic graph G is defined to be the induced subgraph G of G such that G consists of vertices in a cycle or vertices in a path joining two cycles. A vertex in the core (not in the core) is called a core vertex (resp., a non-core vertex). The edges not in the core of a chemical cyclic graph G form a collection of trees T, which we call a non-core tree. Each non-core tree contains exactly one core vertex and is regarded as a tree rooted at the core vertex. The k-branch height of a chemical cyclic graph G is defined to be the maximum of k-branch heights over all non-core trees.

Let ρr (%) denote the ratio of the number of chemical graphs with rank at most r[0,4] to the number of all chemical graphs in PubChem. See Table 11.

Table 11.

The percentage ρr of the number of chemical compounds with rank at most r[0,4] over all chemical compounds in PubChem

ρ0 ρ1 ρ2 ρ3 ρ4
2.9% 16.3% 44.5% 68.8% 84.7%

Let ρ0(d) (%) denote the ratio of the number of chemical graphs in DB(100) such that the maximum degree is at most d[3,4] to the number of all chemical graphs in DB(100). Let ρr(d) (%), r[1,4] denote the ratio of the number of rank-r chemical graphs in DB(100) such that the maximum degree of a non-core vertex is at most d[3,4] to the number of all rank-r chemical graphs in DB(100). See Table 12.

Table 12.

The percentage ρr(d) of the number of chemical compounds with rank r[0,4] such that the maximum degree of a non-core vertex is at most d[3,4] over all rank-r chemical compounds in DB(100)

ρ0(3) ρ0(4) ρ1(3) ρ1(4) ρ2(3) ρ2(4) ρ3(3) ρ3(4) ρ4(3) ρ4(4)
55.55% 99.85% 68.30% 99.97% 84.46% 99.99% 87.11% 99.99% 87.75% 99.99%

Let ρr(k,h) (%), r[0,4], k=2, h[1,2] denote the ratio of the number of rank-r chemical graphs in DB(50) such that the k-branch height is at most h to the number of all rank-r chemical graphs in DB(50). See Table 13. We see that most chemical graphs G with at most 50 non-hydrogen atoms satisfy bh2(G)2.

Table 13.

The percentage ρr(k,h) (%) of the number of rank-r chemical graphs in DB(50) such that the k-branch height is at most h to the number of all rank-r chemical graphs in DB(50)

ρ0(2,1) ρ0(2,2) ρ1(2,1) ρ1(2,2) ρ2(2,1) ρ3(2,1) ρ4(2,1)
87.23% 99.46% 88.13% 98.76% 96.39% 99.17% 99.43%

We show the distribution of 2-branch height over alkans CnH2n+2. Let Aln(n) denote the set of all alkans with n carbon atoms, where |Aln(25)|=36,797,588. Let ρAln(2,h) (%), h[1,4] denote the ratio of the number of alkans in Aln(25) such that the 2-branch height is at most h to the number of alkans in Aln(25). See Table 14.

Table 14.

The percentage ρAln(2,h) (%) of the number of alkans in Aln(25) such that the 2-branch height is at most h to the number of alkans in Aln(25)

ρAln(2,1) ρAln(2,2) ρAln(2,3) ρAln(2,4)
49.03% 97.67% 99.99% 100.00%

Let ρ2bt(δ) denote the ratio of the number of acyclic chemical graphs in DB(50) such that the degree of the root of the 2-branch-tree is δ[1,4] to the number of all acyclic chemical graphs in DB(50). See Table 15.

Table 15.

The percentage ρ2bt(δ) of the number of acyclic chemical graphs in DB(50) such that the degree of the root of the 2-branch-tree is δ[1,4] to the number of all acyclic chemical graphs in DB(50)

ρ2bt(1) ρ2bt(2) ρ2bt(3) ρ2bt(4)
6.39% 83.58% 9.30% 0.73%

Among the 2-fringe-trees T of all acyclic chemical graphs in DB(100), over 90% of them satisfy n2d+2 for the number n=|V(T)| of non-hydrogen atoms in a 2-fringe-tree T and the number d of non-hydrogen atoms adjacent to the root in T.

Let FT0,2 denote the set of all 2-fringe-trees that appear in an acyclic chemical graph in DB(100), and FT0,2(δ), δ[1,3] denote the set of all 2-fringe-trees TFT0,2 that have δ children (i.e., the degree of the root is δ). Let ρ2δ+2(δ) (%) denote the ratio of the number of 2-fringe-trees in FT0,2(δ) that have at most 2δ+2 vertices to the number of 2-fringe-trees in FT0,2(δ). See Table 16.

Table 16.

The percentage ρ2δ+2(δ) (%) of the number of 2-fringe-trees in FT0,2(δ) that have at most 2δ+2 vertices to the number of 2-fringe-trees in FT0,2(δ)

ρ4(1) ρ6(2) ρ8(3)
93.77% 93.99% 92.01%

Appendix B: Formulating an MILP based on scheme graphs

This section shows how to formulate an MILP based on a scheme graph.

Scheme graphs

Let t, s, and c, be integers such that

  • t=n-(bh-1)-(k+1)bl;

  • s=a(bc-1)/(b-1)+1 for a=dmax, b=dmax-1 and c=bh; and

  • c=s-1.

Let a scheme graph SG(dmax,k,bh,t) consist of a tree TB, a path Pt, a set {Sss[1,s]} of trees, a set {Ttt[1,t]} of trees, and a set of directed edges between TB and Pt so that an acyclic graph HH(n,dmax,dia,k,bh,bl) will be constructed in the following way:

  • (i)

    The k-branch-tree of H will be chosen as a subtree of TB=(VB,EB);

  • (ii)

    Each k-fringe-tree rooted at a vertex usV(TB) of H will be chosen as a subtree of Ss;

  • (iii)

    Each k-branch-path of H (except for its end-vertices) will be chosen as a subpath of Pt or as an edge in TB;

  • (iv)

    Each k-fringe-tree rooted at a vertex vtV(Pt) of H will be chosen as a subtree of Tt; and

  • (v)

    An edge (uv) directed from TB to Pt will be selected as an initial edge of a k-branch-path of H and an edge (vu) directed from Pt to TB will be selected as an ending edge of a k-branch-path of H.

More formally, each component of a scheme graph SG(dmax,k,bh,t) is defined as follows.

  • (i)

    TB=(VB={u1,u2,,us},EB={a1,a2,,ac}), called a base-tree is a tree rooted at a vertex u1 that is isomorphic to the rooted tree T(dmax,dmax-1,bh). Regard TB as an ordered tree by introducing a total order for each set of siblings and call the first (resp., last) child in a set of siblings the leftmost (resp. rightmost) child, which defines the leftmost (rightmost) path from the root u1 to a leaf in TB, as illustrated in Fig. 4a.

    For each vertex usVB, let EB(s) denote the set of indices i of edges a(i)EB incident to us and CldB(s) denote the set of indices i of children uiVB of us in the tree TB.

    For each integer d[0,k], let VB(d) denote the set of indices s of vertices usVB whose depth is d in the tree TB, where VB(bh) is the set of indices s of leaves us of TB.

    Regard each edge aiEB as a directed edge (us,us) from one end-vertex us of ai to the other end-vertex us of ai such that s=prt(s) (i.e., us is the parent of us), where head(i) and tail(i) denote the head us and tail us of edge aiEB, respectively.

    For each index s[1,s], let EB+(s) (resp., EB-(s)) denote the set of indices i of edges aiEB such that the tail (resp., head) of ai is vertex us.

    Let LB denote the set of indices of leaves of TB, and sleft (resp., sright) denote the index sLB of the leaf us at which the leftmost (resp., rightmost) path from the root ends.

    For each leaf us, sLB, let VB,s (resp., EB,s) denote the set of indices s of non-root vertices us (resp., indices i of edges a(i)EB) along the path from the root to the leaf us in the tree TB.

    For the example of a base-tree TB with bh=2 in Fig. 4, it holds that LB={5,6,7,8,9,10}, sleft=5, sright=10, EB,sleft={1,4} and VB,sleft={2,5}.

  • (ii)

    Ss, s[1,s] is a tree rooted at vertex usVB in TB that is isomorphic to the rooted tree T(dmax-1,dmax-1,k), as illustrated in Fig. 4b. Let us,i and es,i denote the vertex and edge in Ss that correspond to the i-th vertex and the i-th edge in T(dmax-1,dmax-1,k), respectively. Regard each edge es,i as a directed edge (us,prt(i),us,i). For this, each vertex usVB is also denoted by us,1.

  • (iii)

    Pt=(VP={v1, v2, , vt}, EP={e2, e3, , et}), called a link-path with size t is a directed path from vertex v1 to vertex vt, as illustrated in Fig. 4a. Each edge etEP is directed from vertex vt-1 to vertex vt.

  • (iv)

    Tt, t[1,t] is a tree rooted at vertex vt in Pt that is isomorphic to the rooted tree T(dmax-2,dmax-1,k), as illustrated in Fig. 4c. Let vt,i and et,i denote the vertex and edge in Tt that correspond to the i-th vertex and the i-th edge in T(dmax-2,dmax-1,k), respectively. Regard each edge et,i as a directed edge (vt,prt(i),ut,i). For this, each vertex vtVP is also denoted by vt,1.

  • (v)

    For every pair (st) with s[1,s] and t[1,t], join vertices us and vt with directed edges (us,vt) and (vt,us), as illustrated in Fig. 4a.

We explain the basic idea of an MILP in Theorem 2. The MILP mainly consists of the following three types of constraints.

  1. Constraints for selecting an acyclic graph H as a subgraph of the scheme graph SG(dmax,k,bh,t);

  2. Constraints for assigning chemical elements to vertices and multiplicity to edges to determine a chemical graph G=(H,α,β); and

  3. Constraints for computing descriptors from the selected acyclic chemical graph G.

In the constraints of C1, more formally we prepare the following.

  • (i)

    In the scheme graph SG(dmax,k,bh,t), we prepare a binary variable u(s, 1) for each vertex us=us,1VB, s[1,s] so that vertex us=us,1 becomes a k-branch of a selected graph H if and only if u(s,1)=1. The subgraph of the base-tree TB that consists of vertices us=us,1 with u(s,1)=1 will be the k-branch-tree of the graph H. We also prepare a binary variable a(i), i[1,c] for each edge aiEB, where c=s-1. For a pair of a vertex us,1 and a child us,1 of us,1 such that u(s,1)=u(s,1)=1, either the edge ai=(us,1,us,1) is used in the selected graph H (when a(i)=1) or a path Pi=(us,1,vt,1,vt+1,1,,vt,1,us,1) from vertex us,1 to vertex us,1 is constructed in H with an edge (us,1,vt,1), a subpath (vt,1,vt+1,1,,vt,1) of the link-path Pt and an edge (vt,1,us,1) (when a(i)=0). For example, vertices u1,1 and u2,1 are connected by a path P1=(u1,1,v1,1,v2,1,u2,1) in the selected graph H in Fig. 5c.

  • (ii)
    Let
    • ntreeS=1+(dmax-1)((dmax-1)k-1)/(dmax-2),
    • ntreeT=1+(dmax-2)((dmax-1)k-1)/(dmax-2),
    where ntreeS (resp., ntreeT) is the number of vertices in the rooted tree T(dmax-1,dmax-1,k) (resp., T(dmax-2,dmax-1,k)). In each tree Ss, s[1,s] (resp., Tt, t[1,t]) in the scheme graph, we prepare a binary variable u(si) (resp., v(ti)) for each vertex us,i, i[2,ntreeS] (resp., vt,i, i[2,ntreeT]) so that u(s,i)=1 (resp., v(t,i)=1) means that the corresponding vertex us,i (resp., vt,i) is used as a vertex in a selected graph H. The (non-empty) subgraph of a tree Ss (resp., Tt) that consists of vertices us,i with u(s,i)=1 (resp., vt,i with v(t,i)=1) will be a k-fringe-tree of a selected graph H.
  • (iii)

    In the link-path Pt, we prepare a binary variable e(t), t[2,t] for each edge et,1=(vt-1,1,vt,1)EP so that e(t)=1 if and only if edge et,1 is used in some path Pi=(us,1,vt,1,vt+1,1,,vt,1,us,1) constructed in (i).

  • (iv)

    For each pair (st) of s[1,s] and t[1,t], we prepare a binary variable e(st) (resp., e(ts)) so that e(s,t)=1 (resp., e(t,s)=1) if and only if directed edge (us,1,vt,1) (resp., (vt,1,us,1)) is used as the first edge (resp., last edge) of some path Pi=(us,1,vt,1,vt+1,1,,vt,1,us,1) constructed in (i).

Based on these, we include constraints with some more additional variables so that a selected subgraph H is a connected acyclic graph. See constraints (12) to (32) in Appendix C for the details.

In the constraints of C2, we prepare an integer variable α~(u) for each vertex u in the scheme graph that represents the chemical element α(u)Λ if u is in a selected graph H (or α~(u)=0 otherwise) and an integer variable β~(e)[0,3] (resp., β^(e)[0,3]) for each edge e (resp., e=e(s,t) or e(ts), s[1,s], t[1,t]) in the scheme graph that represents the multiplicity β(e)[1,3] if e is in a selected graph H (or β~(e) or β^(e) takes 0 otherwise). This determines a chemical graph G=(H,α,β). Also we include constraints for a selected chemical graph G to satisfy the valence condition (α(u),α(v),β(uv))Γ for each edge uvE. See constraints (33) to (47) in Appendix C for the details.

In the constraints of C3, we introduce a variable for each descriptor and constraints with some more variables to compute the value of each descriptor in f(G) for a selected chemical graph G. See constraints (48) to (75) in Appendix C for the details.

Appendix C: All constraints in an MILP formulation for chemical acyclic graphs

To formulate an MILP that represents a chemical graph, we distinguish a tuple (a,b,m) from a tuple (b,a,m). For a tuple γ=(a,b,m)Λ×Λ×{1,2,3}, let γ¯ denote the tuple (b,a,m). Let Γ<{γ¯γΓ>}. We call a tuple γ=(a,b,m)Λ×Λ×{1,2,3} proper if mmin{val(a),val(b)} and mmax{val(a),val(b)}-1, where the latter is assumed because otherwise G must consist of two atoms of a=b. Assume that each tuple γΓ is proper. Let ϵ be a fictitious chemical element that represents null, call a tuple (a,b,0) with a,bΛ{ϵ} fictitious, and define Γ0 to be the set of all fictitious tuples; i.e., Γ0={(a,b,0)a,bΛ{ϵ}}. To represent chemical elements eΛ{ϵ}Γ in an MILP, we encode these elements e into some integers denoted by [e]. Assume that, for each element aΛ, [a] is a positive integer and that [ϵ]=0.

Upper and lower bounds on descriptors

In our formulation of an MILP for inferring a vector x in Stage 4, we fix the following descriptors as specified constants: the number n(G) of vertices, the diameter dia(G), and the number blk(G) of leaf k-leaf branches, which are set to be given integers n, dia, and bl, respectively. For each of the other descriptors, we specify a lower bound LB and an upper bound UB on the value so that the descriptor takes a value from the range between LB and UB.

constants

  • n5: the size n(G) of G;

  • LBdgt(i),UBdgt(i)[0,n],i[1,4],t{in,ex}: lower and upper bounds on the number dgit(G) of k-internal/k-external vertices of degree i in G;

  • LBcet(a),UBcet(a)[0,n], aΛ,t{in,ex}: lower and upper bounds on the number ceat(G) of k-internal/k-external vertices v with α(v)=a in G;

  • LBbdt(m), UBbdt(m)[0,n-1],m[2,3],t{in,ex}: lower and upper bounds on the number bdmt(G) of k-internal/k-external edges e with β(e)=m in G;

  • LBact(γ),UBact(γ)[0,n-1],t{in,ex},γΓ<Γ=: lower and upper bounds on the number acγt(G) of k-internal/k-external edges e with adjacency-configuration γ in G;

  • LBbct(μ),UBbct(μ)[0,n-1],t{in,ex}, μBc: lower and upper bounds on the number bcμt(G) of k-internal/k-external edges e with bond-configuration μ in G;

variablesxfor descriptors

  • dgin(i),dgex(i)[0,n], i[1,4]: dgin(i) (resp., dgex(i)) represents dgiin(G) (resp., dgiex(G));

  • cein(a),ceex(a)[0,n], aΛ: cein(a) (resp., ceex(a)) represents ceain(G) (resp., ceaex(G));

  • bdin(m),bdex(m)[0,2n], m[1,3]: bdin(m) (resp., bdex(m)) represents bdmin(G) (resp., bdmex(G));

  • acin(γ),acex(γ)[0,n], γΓ<Γ=: acin(γ) (resp., acex(γ)) represents acγin(G) (resp., acγex(G));

  • bcin(μ),bcex(μ)[0,n-1], μBc: bcin(μ) (resp., bcex(μ)) represents bcμin(G) (resp., bcμex(G));

constraints

LBdgt(i)dgt(i)UBdgt(i),i[1,4],t{in,ex}, 2
LBcet(a)cet(a)UBcet(a),aΛ,t{in,ex}, 3
LBbdt(m)bdt(m)UBbdt(m),m[2,3],t{in,ex}, 4
LBact(γ)act(γ)UBact(γ),γΓ,t{in,ex}, 5
LBbct(μ)bct(μ)UBbct(μ),μBc,t{in,ex}. 6

We use the range-based method to define an applicability domain for our method. For this, we find the range (the minimum and maximum) of each descriptor over all relevant chemical compounds and represent each range as a set of linear constraints in the constraint set C1 of our MILP formulation. Recall that Dπ stands for a set of chemical graphs used for constructing a prediction function. However, the number of examples in Dπ may not be large enough to capture a general feature on the structure of chemical graphs. For this, we also use some data set from the whole set DB of chemical graphs in a database. Let DBG(i) denote the set of chemical graphs GDBG such that n(G)=i for each integer i1. Based on this, we assume that the given lower and upper bounds on the above descriptors satisfy the following. For each t{in,ex},

minGDπDBG(n)dgit(G)n(G)LBdgt(i)nUBdgt(i)nmaxGDπDBG(n)dgit(G)n(G),i[1,4], 7
minGDπDBG(n)ceat(G)n(G)LBcet(a)nUBcet(a)nmaxGDπDBG(n)ceat(G)n(G),aΛ, 8
minGDπDBG(n)bdmt(G)n(G)-1LBbdt(m)n-1UBbdt(m)n-1maxGDπDBG(n)bdmt(G)n(G)-1,m[2,3], 9
minGDπDBG(n)acγt(G)n(G)-1LBact(γ)n-1UBact(γ)n-1maxGDπDBG(n)acγt(G)n(G)-1,γΓ, 10
minGDπDBG(n)bcμt(G)n(G)-1LBbct(μ)n-1UBbct(μ)n-1maxGDπDBG(n)bcμt(G)n(G)-1,μBc. 11

Construction of scheme graph

We infer a subgraph H such that the maximum degree is dmax{3,4}, n(H)=n, bhk(H)=bh, and blk(H)=bl. For this, we first construct the scheme graph SG(dmax,k,bh,t). We then prepare a binary variable u(si) (resp., v(ti)) for each vertex us,i in tree Ss (resp., vt,i in tree Tt).

Recall that when the two end-vertices of edge ai=(us,1,us,1)EB={a1,a2,,ac} is connected in a selected subgraph H, either edge ai is directly used in H or a path Pi=(us,1,vt,1,vt+1,1,, vt,1,us,1) from us,1 to us,1 visiting some vertices in Pt is constructed in H. We regard the index i of each edge aiEB={a1,a2,,ac} as the “color” of the edge, and define the color set of EB to be [1,c]. To introduce necessary linear constraints that can construct such a path Pi properly in our MILP, we assign the color i to the vertices vt,1,vt+1,1,, vt,1 in Pt when a path Pi=(us,1,vt,1,vt+1,1,, vt,1,us,1) is used in H.

constants

Integers dmax{3,4}, n3, dia3, k1, bh1 and bl2;

variables

  • a(i){0,1}, iEB: a(i) represents edge aiEB (a(i)=1, iEB) (a(i)=1 edge ai is used in H);

  • e(s,t),e(t,s){0,1}, s[1,s], t[1,t]: e(st) (resp., e(ts)) represents direction (us,1,vt,1) (resp., (vt,1,us,1)), where e(s,t)=1 (resp., e(t,s)=1) edge us,1,vt,1 is used in H and direction (us,1,vt,1) (resp., (vt,1,us,1)) is assigned to edge us,1vt,1;

  • χ(t)[0,c], t[1,t]: χ(t) represents the color c[0,c] assigned to vertex vt,1 (χ(t)=c vertex vt,1 is assigned color c, where χ(t)=c=0 iff vt,1 is not in H);

  • δclr(t,c){0,1}, t[1,t], c[0,c] (δclr(t,c)=1 χ(t)=c);

  • clr(c)[0,t], c[0,c]: the number of vertices vt,i with color c;

  • degb+(s)[0,4], s[1,s]: the out-degree of vertex us,1 in the k-branch-subtree of H;

  • degb-(s)[0,4], s[1,s]: the in-degree of vertex us,1 in the k-branch-subtree of H;

constraints

c[0,c]δclr(t,c)=1,c[0,c]c·δclr(t,c)=χ(t),t[1,t], 12
t[1,t]δclr(t,c)=clr(c),c[0,c], 13
t(1-a(i))clr(i),i[1,c], 14
e(s,t)+e(t,s)1,s[1,s],t[1,t], 15
s[1,s]\{head(c)}e(t,s)1-δclr(t,c),s[1,s]\{tail(c)}e(s,t)1-δclr(t,c),c[1,c],t[1,t], 16
iEB-(s)a(i)+t[1,t]e(t,s)=degb-(s),iEB+(s)a(i)+t[1,t]e(s,t)=degb+(s),degb-(s)+degb+(s)dmax,s[1,s]. 17

Selecting a subgraph

From the scheme graph SG(dmax,k,bh,t), we select a subgraph H such that n(H)=n, dia(H)=dia, bhk(H)=bh, and blk(H)=bl.

constants

  • Integers dmax{3,4}, n3, dia3, k1, bh1 and bl2;

  • For each tree Ss=T(dmax-1,dmax-1,k), prepare
    • the set CldS(i) of the indices of children of a vertex vi;
    • the index prt(i) of the parent of a non-root vertex vi;
    • the set DsnS(d) of indices i of a vertex vi whose depth is d;
    • a proper set Pprc(dmax-1,dmax-1,k) of index pairs,
    • where we denote Pprc(dmax-1,dmax-1,k) by PS,prc;
  • For each tree Tt=T(dmax-2,dmax-1,k), prepare
    • the set CldT(i) of the indices of children of a vertex vi;
    • the index prt(i) of the parent of a non-root vertex vi;
    • a proper set Pprc(dmax-2,dmax-1,k) of index pairs,
    • where we denote Pprc(dmax-2,dmax-1,k) by PT,prc;

variables

  • σ(s){0,1}, s[1,s]: (σ(s)=1 vertex us,1 is a non-leaf k-branch or a root);

  • u(s,i){0,1}, s[1,s], i[1,ntreeS]: u(si) represents vertex us,i (u(s,i)=1 vertex us,i is used in H and edge es,i (i2) is used in H), (u(s,1)=1 and σ(s)=0 vertex us,1 is a leaf k-branch);

  • v(t,i){0,1}, t[1,t], i[1,ntreeT]: v(ti) represents vertex vt,i (v(t,i)=1 vertex vt,i is used in H and edge et,i (i2) is used in H);

  • e(t){0,1}, t[1,t+1]: e(t) represents edge et,1=vt-1,1vt,1, where e1,1 and et+1,1 are fictitious edges (e(t)=1 edge et,1 is used in H);

constraints

u(s,i)u(s,j),s[1,s],(i,j)PS,prc, 18
v(t,i)v(t,j),t[1,t],(i,j)PT,prc, 19
s[1,s],i[1,ntreeS]u(s,i)+t[1,t],i[1,ntreeT]v(t,i)=n, 20
i[1,ntreeS]u(s,i)2+2jCldS(1)u(s,j),s[1,s], 21
i[1,ntreeT]v(t,i)2+2jCldT(1)v(t,j),t[1,t], 22
e(t+1)+s[1,s]e(t,s)=v(t,1),e(t)+s[1,s]e(s,t)=v(t,1),c[1,c]δclr(t,c)=v(t,1),(wheree(1)=e(t+1)=0),t[1,t], 23
c·(1-e(t+1))χ(t)-χ(t+1)v(t,1)-e(t+1),t[1,t-1], 24
a(i)+t[1,t]e(t,i+1)=u(i+1,1),i[1,c], 25
σ(s)u(s,1),s[1,s], 26
σ(s)=u(s,1)=1,ifusistheroot, 27
(dmax-1)σ(s)sCldB(s)u(s,1)2σ(s),iDsnS(k)u(s,i)u(s,1)-σ(s),s[1,s],usroot, 28
s[2,s](u(s,1)-σ(s))=bl,sVB(bh)u(s,1)1, 29
sVB,sleftu(s,1)+iEB,sleftclr(i)=dia2-k, 30
sVB,srightu(s,1)+iEB,srightclr(i)=dia2-k, 31
iVB,su(i,1)+iEB,sclr(i)dia2-k,sLB\{sleft,sright}. 32

Constraints (21) and (22) represent an extension of constraint (1) on the size of 2-fringe-trees to the case of a general branch-parameter k.

Assigning multiplicity

We prepare an integer variable β~(e) or β^(e) for each edge e in the scheme graph SG(dmax,k,bh,t) to denote the multiplicity of e in a selected graph H and include necessary constraints for the variables to satisfy in H.

constants

  • Prepare functions tail and head such that ai=(utail(i),uhead(i))EB;

  • Assume that each edge in a tree Ss, s[1,s] (resp., Tt, t[1,t]) is denoted by es,i (resp., et,i) with the integer i[2,ntreeS] of the head us,i (resp., vt,i) of the edge;

variables

  • β~(i)[0,3], i[1,c]: β~(i) represents the multiplicity of edge ai, where β~(i)=0 if edge ai is not in an inferred chemical graph G;

  • β~(p,i)[0,3], p[1,s+t], i[2,ntreeS]: β~(p,i) with ps (resp., p>s) represents the multiplicity of edge ep,i (resp., ep-s,i);

  • β~(t,1)[0,3], t[1,t+1]: β~(t,1) represents the multiplicity of edge et,1;

  • β^(s,t)[0,3], s[1,s], t[1,t]: β^(s,t) represents the multiplicity of edge us,1vt,1;

constraints

a(i)β~(i)3a(i),i[1,c], 33
u(s,i)β~(s,i)3u(s,i),s[1,s],i[2,ntreeS], 34
v(t,i)β~(s+t,i)3v(t,i),t[1,t],i[2,ntreeT], 35
e(t)β~(t,1)3e(t),t[1,t+1], 36
e(s,t)+e(t,s)β^(s,t)3e(s,t)+3e(t,s),s[1,s],t[1,t]. 37

Assigning chemical elements and valence condition

We include constraints so that each vertex v in a selected graph H satisfies the valence condition; i.e., β(v)val(α(v)). With these constraints, a chemical acyclic graph G=(H,α,β) on a selected subgraph H will be constructed.

constants

  • A set Λ{ϵ} of chemical elements, where ϵ denotes null;

  • A coding [a], aΛ{ϵ} such that [ϵ]=0; [a]1, aΛ; and [a][b] if ab; Let [Λ] and [Λ{ϵ}] denote {[a]aΛ} and {[a]aΛ{ϵ}}, respectively;

  • A valence function: val:Λ[1,4];

  • Let EB(s) denote the set of indices i of all edges aiEB adjacent to vertex us,1 in TB.

variables

  • α~(p,i)[Λ{ϵ}], p[1,s+t], i[1,ntreeS]: α~(p,i) with ps (resp., p>s) represents α(up,i) (resp., α(vp-s,i));

  • δα(p,i,a){0,1}, p[1,s+t], i[1,ntreeS], aΛ{ϵ}: δα(p,i,a)=1α(up,i)=a for ps and α(vp-s,i)=a for p>s;

  • δβ~(i,m){0,1}, p[1,s+t], i[1,c], m[0,3]: δβ~(i,m)=1 the multiplicity of edge ai in an inferred chemical graph G is m;

  • δβ~(p,i,m){0,1}, p[1,s+t], i[2,ntreeS], m[0,3]: δβ~(p,i,m)=1 the multiplicity of edge ep,i, ps (or ep-s,i, p>s) in G is m;

  • δβ~(t,1,m){0,1}, t[1,t+1], m[0,3]: δβ~(t,1,m)=1 the multiplicity of edge et in G is q;

  • δβ^(s,t,m){0,1}, s[1,s], t[1,t], m[0,3]: δβ^(s,t,m)=1 the multiplicity of edge us,1vt,1 in G is m;

constraints

aΛ{ϵ}δα(p,i,a)=1,p[1,s+t],i[1,ntreeS], 38
aΛ{ϵ}[a]·δα(p,i,a)=α~(p,i),p[1,s+t],i[1,ntreeS], 39
m[0,3]δβ~(i,q)=1,m[1,3]m·δβ~(i,m)=β~(i),i[1,c], 40
m[0,3]δβ~(p,i,m)=1,m[1,3]m·δβ~(p,i,m)=β~(p,i),p[1,s+t],i[2,ntreeS], 41
m[0,3]δβ~(t,1,q)=1,m[1,3]m·δβ~(t,1,m)=β~(t,1),t[1,t+1], 42
m[0,3]δβ^(s,t,m)=1,m[0,3]mδβ^(s,t,m)=β^(s,t),s[1,s],t[1,t], 43
iEB(s)β~(i)+t[1,t]β^(s,t)+jCldS(1)β~(s,j)aΛval(a)·δα(s,1,a),s[1,s], 44
s[1,s]β^(s,t)+β~(t,1)+β~(t+1,1)+jCldT(1)β~(s+t,j)aΛval(a)δα(s+t,1,a),t[1,t], 45
β~(s,i)+jCldS(i)β~(s,j)aΛval(a)δα(s,i,a),s[1,s],i[2,ntreeS], 46
β~(s+t,i)+jCldT(i)β~(s+t,j)aΛval(a)δα(s+t,i,a),t[1,t],i[2,ntreeT]. 47

Descriptors on mass, the numbers of elements and bonds

We include constraints to compute descriptors ms¯(G), cea(G) (aΛ), bdm(G) (m[2,3]) and nH(G) according to the definitions in "Modeling of chemical compounds" section.

constants

  • A function mass:ΛZ (we let mass(a) denote the observed mass of a chemical element aΛ, and define mass(a)=10·mass(a));

variables

  • MassZ: Mass represents vVmass(α(v));

  • bd(m)[0,2n], m[1,3];

  • nH[0,4n]: the number nH(G) of hydrogen atoms to be included to G;

constraints

p[1,s+t]δα(p,1,a)=cein(a),p[1,s+t],i[2,ntreeS]δα(p,i,a)=ceex(a),aΛ, 48
aΛmass(a)(cein(a)+ceex(a))=Mass, 49
i[1,c]δβ~(i,q)+s[1,s],t[1,t]δβ^(s,t,q)+t[2,t]δβ~(t,1,q)=bdin(m),p[1,s+t],i[2,ntreeS]δβ~(p,i,m)=bdex(m),m[1,3], 50
aΛval(a)(cein(a)+ceex(a))-2(n-1+bdin(2)+bdex(2)+2bdin(3)+2bdex(3))=nH. 51

Descriptor for the Number of Specified Degree

We include constraints to compute descriptors dgi(G) (i[1,4]) according to the definitions in "Modeling of chemical compounds" section. We also add constraints so that the maximum degree of a vertex in H is at most 3 (resp., equal to 4) when dmax=3 (resp., dmax=4).

variables

  • deg(p,i)[0,4], p[1,s+t], i[1,ntreeS]: deg(p,i) represents degH(up,i) for ps or degH(vp-s,i) for p>s;

  • δdeg(p,i,d){0,1}, p[1,s+t], i[1,ntreeS], d[0,4]: δdeg(p,i,d)=1deg(p,i)=d;

constraints

iEB(s)a(i)+t[1,t](e(s,t)+e(t,s))+jCldS(1)u(s,j)=deg(s,1),s[1,s], 52
u(s,i)+jCldS(i)u(s,j)=deg(s,i),s[1,s],i[2,ntreeS], 53
2v(t,1)+jCldT(1)v(t,j)=deg(s+t,1),t[1,t], 54
v(t,i)+jCldT(i)v(t,j)=deg(s+t,i),t[1,t],i[2,ntreeT], 55
d[0,4]δdeg(p,i,d)=1,d[1,4]d·δdeg(p,i,d)=deg(p,i),p[1,s+t],i[1,ntreeS], 56
p[1,s+t]δdeg(p,1,d)=dgin(d),p[1,s+t],i[2,ntreeS]δdeg(p,i,d)=dgex(d),d[1,4], 57
dgin(4)+dgex(4)1(resp.,=0)whendmax=4(resp.,=3). 58

Descriptor for the number of adjacency-configurations

We include constraints to compute descriptors acγ(G) (γ=(a,b,m)Γ) according to the definitions in "Modeling of chemical compounds" section.

constants

  • A set Γ=Γ<Γ=Γ> of proper tuples (a,b,m)Λ×Λ×[1,3];

  • The set Γ0={(a,b,0)a,bΛ{ϵ}};

variables

  • δτ(i,γ){0,1}, i[1,c], γΓΓ0: δτ(i,γ)=1 edge ai is assigned tuple γ; i.e., γ=(α~(tail(i),1),α~(head(i),1),β~(i));

  • δτ(t,1,γ){0,1}, t[2,t], γΓΓ0: δτ(t,1,γ)=1 edge et,1 is assigned tuple γ; i.e., γ=(α~(s+t-1,1),α~(s+t,1),β~(t,1));

  • δτ(p,i,γ){0,1}, p[1,s+t], i[2,ntreeS], γΓΓ0: δτ(p,i,γ)=1 edge ep,i, ps (or ep-s,i, p>s) is assigned tuple γ; i.e., γ=(α~(p,prt(i)),α~(p,i),β~(p,i));

  • δτ^(s,t,γ){0,1}, s[1,s], t[1,t], γΓΓ0: δτ^(s,t,γ)=1 edge us,1vt,1 is assigned tuple γ; i.e., γ=(α~(s,1),α~(s+t,1),β^(s,t));

constraints

(a,b,m)ΓΓ0[a]δτ(i,(a,b,m))=α~(tail(i),1),(a,b,m)ΓΓ0[b]δτ(i,(a,b,m))=α~(head(i),1),(a,b,m)ΓΓ0m·δτ(i,(a,b,m))=β~(i),γΓΓ0δτ(i,γ)=1,i[1,c], 59
(a,b,m)ΓΓ0[a]δτ(t,1,(a,b,m))=α~(s+t-1,1),(a,b,m)ΓΓ0[b]δτ(t,1,(a,b,m))=α~(s+t,1),(a,b,m)ΓΓ0m·δτ(t,1,(a,b,m))=β~(t,1),γΓΓ0δτ(t,1,γ)=1,t[2,t], 60
(a,b,m)ΓΓ0[a]δτ(p,i,(a,b,m))=α~(p,prt(i)),(a,b,m)ΓΓ0[b]δτ(p,i,(a,b,m))=α~(p,i),(a,b,m)ΓΓ0m·δτ(p,i,(a,b,m))=β~(p,i),γΓΓ0δτ(p,i,γ)=1,p[1,s+t],i[2,ntreeS], 61
(a,b,m)ΓΓ0[a]δτ^(s,t,(a,b,m))=α~(s,1),(a,b,m)ΓΓ0[b]δτ^(s,t,(a,b,m))=α~(s+t,1),(a,b,m)ΓΓ0m·δτ^(s,t,(a,b,m))=β^(s,t),γΓΓ0δτ^(s,t,γ)=1,s[1,s],t[1,t], 62
i[1,c](δτ(i,γ)+δτ(i,γ¯))+s[1,s],t[1,t](δτ^(s,t,γ)+δτ^(s,t,γ¯))+t[2,t](δτ(t,1,γ)+δτ(t,1,γ¯))=acin(γ),γΓ<, 63
i[1,c]δτ(i,γ)+s[1,s],t[1,t]δτ^(s,t,γ)+t[2,t]δτ(t,1,γ)=acin(γ),γΓ=, 64
p[1,s+t],i[2,ntreeS](δτ(p,i,γ)+δτ(p,i,γ¯))=acex(γ),γΓ<, 65
p[1,s+t],i[2,ntreeS]δτ(p,i,γ)=acex(γ),γΓ=. 66

Descriptor for bond-configuration

We include constraints to compute the descriptors for bond-configuration bdμ(G), μBc, according to the definition.

variables

  • bc(μ)[0,n-1], μBc;

  • δdc(i,d,d,m){0,1}, i[1,c], d,d[0,4], m[0,3]: δdc(i,d,d,m)=1degH(utail(i))=d, degH(uhead(i))=d and β(ai)=m[1,3] in G;

  • δdc(t,1,d,d,m){0,1}, t[2,t], d,d[0,4], m[0,3]: δdc(t,1,d,d,m)=1degH(vt-1,1)=d, degH(vt,1)=d and β(et,1)=m[1,3] in G;

  • δdc(p,i,d,d,m){0,1}, p[1,s+t], i[2,ntreeS], d,d[0,4], m[0,3]: δdc(p,i,d,d,m)=1degH(up,prt(i))=d, degH(up,i)=d and β(ep,i)=m[1,3] for ps (or degH(vp-s,prt(i))=d, degH(vp-s,i)=d and β(ep-s,i)=m[1,3] for p>s) in G;

  • δdc^(s,t,d,d,m){0,1}, s[1,s], t[1,t], d,d[0,4], m[0,3]: δdc^(s,t,d,d,1)=1degH(us,1)=d, degH(vt,1)=d and β(us,1vt,1)=m[1,3] in G;

constraints

d,d[0,4],m[0,3]δdc(i,d,d,m)=1,d,d[0,4],m[0,3]m·δdc(i,d,d,m)=β~(i),d[1,4],d[0,4],m[0,3]d·δdc(i,d,d,m)=deg(tail(i),1),d[0,4],d[1,4],m[0,3]d·δdc(i,d,d,m)=deg(head(i),1),i[1,c], 67
d,d[0,4],m[0,3]δdc(t,1,d,d,m)=1,d,d[0,4],m[0,3]m·δdc(t,1,d,d,m)=β~(t,1),d[1,4],d[0,4],m[0,3]d·δdc(t,1,d,d,m)=deg(s+t-1,1),d[0,4],d[1,4],m[0,3]d·δdc(t,1,d,d,m)=deg(s+t,1),t[2,t], 68
d,d[0,4],m[0,3]δdc(p,i,d,d,m)=1,p[1,s+t],i[2,ntreeS], 69
d,d[0,4],m[0,3]m·δdc(s,i,d,d,m)=β~(s,i),s[1,s],i[2,ntreeS], 70
d,d[0,4],m[0,3]m·δdc(s+t,i,d,d,m)=β~(s+t,i),t[1,t],i[2,ntreeT], 71
d[1,4],d[0,4],m[0,3]d·δdc(p,i,d,d,m)=deg(p,prt(i)),d[0,4],d[1,4],m[0,3]d·δdc(t,i,d,d,m)=deg(p,i),p[1,s+t],i[2,ntreeS], 72
d,d[1,4],m[0,3]δdc^(s,t,d,d,m)=1,d,d[1,4],m[0,3]m·δdc^(s,t,d,d,m)=β^(s,t),d[1,4],d[0,4],m[0,3]d·δdc^(s,t,d,d,m)=deg(s,1),d[0,4],d[1,4],m[0,3]d·δdc^(s,t,d,d,m)=deg(s+t,1),s[1,s],t[1,t], 73
i[1,c](δdc(i,d,d,m)+δdc(i,d,d,m))+t[2,t](δdc(t,1,d,d,m)+δdc(t,1,d,d,m))+s[1,s],t[1,t](δdc^(s,t,d,d,m)+δdc^(s,t,d,d,m))=bcin(μ),p[1,s+t],i[2,ntreeS](δdc(p,i,d,d,m)+δdc(p,i,d,d,m))=bcex(μ),μ=(d,d,m)Bc,d<d, 74
i[1,c]δdc(i,d,d,m)+t[2,t]δdc(t,1,d,d,m)+s[1,s],t[1,t]δdc^(s,t,d,d,m)=bcin(μ),p[1,s+t],i[2,ntreeS]δdc(p,i,d,d,m)=bcex(μ),μ=(d,d,m)Bc. 75

Appendix D: Descriptions of new graph search algorithms

Multi-rooted trees and frequency vectors

For a finite set A of elements, let Z+A denote the set of functions w:AZ+. A function wZ+A is called a non-negative integer vector (or a vector) on A and the value x(a) for an element aA is called the entry of x for aA. For a vector wZ+A and an element aA, let w+1a (resp., w-1a) denote the vector w such that w(a)=w(a)+1 (resp., w(a)=w(a)-1) and w(b)=w(b) for the other elements bA\{a}. For a vector wZ+A and a subset BA, let w[B] denote the projection of w to B; i.e., w[B]Z+B such that w[B](b)=w(b), bB.

Let Bc denote the set of tuples μ=(d1,d2,k)[1,4]×[1,4]×[1,3] (bond-configuration) such that max{d1,d2}+k4. For two tuples μ=(d1,d2,k),μ=(d1,d2,k)Bc, we write μμ if

  • max{d1,d2}max{d1,d2}, min{d1,d2}min{d1,d2} and kk,

and write μ>μ if

  • μμ and μμ.

Let Dg={dg1,dg2,dg3,dg4}, where dgi denotes the number of vertices with degree i.

Henceforth we deal with vectors w that have their win and wex components, both win,wexZ+ΛΓBcDg, and for convenience we write w=(win,wex) in the sense of concatenation.

For a vector x=(xin,xex) with xin,xexZ+ΛΓBcDg, let G(x) denote the set of chemical acyclic graphs G whose 2-internal (resp., 2-external) vertices/edges are determined by the vector xin (resp., xex); i.e., G satisfies the following:

  • ceain(G)=xin(a) and ceaex(G)=xex(a) for each chemical element aΛ,

  • acγin(G)=xin(γ) and acγex(G)=xex(γ) for each adjacency-configuration γΓ,

  • bcμin(G)=xin(μ) and bcμex(G)=xex(μ) for each bond-configuration μBc,

  • dgiin(G)=xin(dgi) and dgiex(G)=xex(dgi) for each degree dgiDg.

Throughout the section, let k=2 be a branch-parameter, x=(xin,xex) be a given feature vector with xin,xexZ+ΛΓBcDg, and dia be an integer. We infer a chemical acyclic graph GG(x) such that bl2(G)[2,3] and the diameter of G is dia, where n=aΛ(xin(a)+xex(a)). Note that any other descriptors of GG(x) can be determined by the entries of vector x.

To infer a chemical acyclic graph GG(x), we consider a connected subgraph T of G that consists of

-asubtreeofthe 2-branch-subtreeGofGand-the 2-fringe-treesrootedatverticesinG. 76

Our method first generates a set FT of all possible rooted trees T that can be a 2-fringe-tree of a chemical graph GG(x), and then extends the trees T by repeatedly appending a tree in FT until a chemical graph GG(x) is formed. In the extension, we actually manipulate the “frequency vectors” of trees defined below.

To specify which part of a given tree T plays the role of 2-internal vertices/edges or 2-external vertices/edges in a chemical graph GG(x) to be inferred, we designate at most three vertices r1(T), r2(T), and r3(T), in T as terminals, and call T rooted (resp., bi-rooted and tri-rooted) if the number of terminals is one (resp., two and three). For a rooted tree (resp., bi- or tri-rooted tree) T, let V~in denote the set of vertices contained in a path between two terminals of T, E~in denote the set of edges in T between two vertices in V~in, and define V~exV(T)\V~in and E~exE(T)\E~in. For a bi- or tri-rooted tree T, define the backbone path PT of T to be the path of T between vertices r1(T) and r2(T).

Given a chemical acyclic graph T, define ft(T), t{in,ex}, to be the vector wZ+ΛΓBcDg that consists of the following entries:

  • w(a)=|{vV~tα(v)=a}|, aΛ,

  • w(γ)=|{uvE~t{α(u),α(v)}={a,b},β(uv)=q}|, γ=(a,b,q)Γ,

  • w(μ)=|{uvE~t{degT(u),degT(v)}={d,d},β(uv)=m}|, μ=(d,d,m)Bc,

  • w(dgi)=|{vV~tdegT(v)=i}|, dgiDg.

Define f(T)(fin(T),fex(T)). The entry for an element eΛΓBcDg in ft(T), t{in,ex} is denoted by ft(e;T). For a subset B of ΛΓBcDg, let ft[B](T) denote the projection of ft(T) onto B.

Our aim is to generate all chemical bi-rooted (resp., tri-rooted) trees T with diameter dia such that f(T)=x.

A new algorithm for computing chemical bi-rooted trees G with bl2(G)=2

This section describes a sketch of our new graph search algorithm for the case of bl2(G)=2. See Appendix “A sketch of algorithm for computing chemical tri-rooted trees G with bl2(G)=3” for a sketch of a new algorithm for the case of bl2(G)=3.

We call a chemical graph GG(x) with diameter dia and bl2(G)=2 a target graph.

A chemical acyclic graph G with bl2(G)=2 has exactly two leaf 2-branches vi, i=1,2, where the length of the path between the two leaf 2-branches v1 and v2 of a target graph G is dia-2k=dia-4. We observe that a connected subgraph T of a target graph G that satisfies (76) for bl2(G)=2 is a chemical rooted or bi-rooted tree with roots u and v, where possibly u=v. We call such a subgraph T an internal-subtree (resp., end-subtree) of G if neither (resp., one) of u and v is a 2-branch in G. When u=v, we call an internal-subtree (resp., end-subtree) T of G an internal-fringe-tree (resp., end-fringe-tree) of  G. Figure 10a–d illustrates an internal-subtree, an internal-fringe-tree, an end-subtree and an end-fringe-tree of G.

Fig. 10.

Fig. 10

An illustration of subtrees T of a chemical acyclic graph G in Fig. 6a, where the vertices/edges in T are depicted by solid lines: a An internal-subtree T of G; b An internal-fringe-tree T of G; c An end-subtree T of G; d An end-fringe-tree T of G

Let δ1=dia-52 and δ2=dia-5-δ1=dia-52. We regard a target graph GG(x) with bl2(G)=2 and diameter dia as a combination of two chemical bi-rooted trees T1 and T2 with (PTi)=δi, i=1,2, joined by an edge e=r1(T1)r1(T2), as illustrated in Fig. 11.

Fig. 11.

Fig. 11

An illustration of combining two bi-rooted trees T1=Tw1 and T2=Tw2 with a new edge with multiplicity m joining vertices r1(T1) and r1(T2) to construct a target graph G, where aiΛ, di[1,dmax-1], mi[di,val(ai)-1], i=1,2, and m[1,min{3,val(a1)-m1,val(a2)-m2}]

We start with generating chemical rooted trees and then iteratively extend chemical bi-rooted trees T with (PT)=1,2,,δ1, before we finally combine two chemical bi-rooted trees T1 and T2 with (PTi)=δi. To describe our algorithm, we introduce some notation.

  • Let T(x) denote the set of all bi-rooted trees T (where possibly r1(T)=r2(T)) such that fin(T)xin and fex(T)xex, which is a necessary condition for T to be an internal-subtree or end-subtree of a target graph GG(x).

  • Let FT denote the set of all rooted trees TT(x) that can be a 2-fringe-tree of a target graph G, where T satisfies the size constraint (1) of 2-fringe-trees.

  • For each integer h[1,dia-4], let Tend(h) denote the set of all bi-rooted trees TT(x) that can be an end-subtree of a target graph G such that (PT)=h, and each 2-fringe-tree Tv rooted at a vertex v in PT belongs to FT.

The idea of our new algorithm is to compute only the set Wend(h) of frequency vectors w of end trees, whose size |Wend(h)| is much more restricted than that of Tend(h). We compute the set Wend(h) of frequency vectors w of trees in Tend(h) iteratively for each integer h0. During the computation, we keep a sample of a tree Tw for each frequency vector w so that a final step can construct some number of target graphs G by assembling these sample trees. Based on this, we generate target graphs GG(x) by the following steps:

    • (i)
      Compute FT by a branch-and-bound procedure that generates all possible rooted trees TT(x) (where r1(T)=r2(T)) that can be a 2-fringe-tree of a target graph GG(x);
    • (ii)
      Compute the set W(0) of all vectors w=(win,wex) such that win=fin(T) and wex=fex(T) for some tree TFT, and let Wend(0)W(0) be those trees with height exactly 2;
    • (iii)
      For each vector w=(win,wex)W(0), choose a sample tree TwFT such that win=fin(T) and wex=fex(T), and store these sample trees;
  1. For each integer h=1,2,,δ2, iteratively execute the next:
    • (i)
      Compute the set Wend(h) of all vectors w=(win,wex) such that win=fin(T) and wex=fex(T) for some bi-rooted tree TTend(h), where such a vector w is obtained from a combination of vectors wW(0) and wWend(h-1);
    • (ii)
      For each vector wWend(h), store a sample tree Tw, which is obtained from a combination of sample trees Tw with wW(0) and Tw with wWend(h-1);
  2. We call a pair of vectors w1Wend(δ1) and w2Wend(δ2) feasible, if it admits a target graph GG(x) such that win1+win2xin and wex1+wex2xex. Find the set Wpair of all feasible pairs of vectors w1 and w2;

  3. For each feasible vector pair (w1,w2)Wpair, construct a corresponding target graph G by combining the corresponding samples trees Tw1 and Tw2, as illustrated in Fig. 11.

Detailed descriptions of the five steps in the above algorithm can be found in Appendix “Case of two leaf 2-branches”.

For a relatively large instance with n40 and dia20, the number |Wpair| of feasible vector pairs in Step 4 is still very large. In fact, the size |Wend(h)| of a vector set Wend(h) to be computed in Step 2 can also be considerably large during an execution of the algorithm. For such a case, we impose a time limitation on the running time for computing Wend(h) and a memory limitation on the number of vectors stored in a vector set Wend(h). With these limitations, we can compute only a limited subset W^end(h) of each vector set Wend(h) in Step 2. Even with such a subset W^end(h), we still can find a large size of a subset W^pair of Wpair in Step 3.

Our algorithm also delivers a lower bound on the number of all target graphs GG(x) in the following way. In Step 1, we also compute the number t(w) of trees TFT such that w=f(T) for each wW(0). In Step 2, when a vector w is constructed from two vectors w and w, we iteratively compute the number t(w) of trees T such that w=f(T) by t(w):=t(w)×t(w). In Step 3, when a feasible vector pair (w1,w2)Wpair is obtained, we know that the number of the corresponding target graphs G is t(w1)×t(w2). Possibly we compute a subset W^pair of Wpair in Step 3. Then (1/2)(w1,w2)W^pairt(w1)×t(w2) gives a lower bound on the number of target graphs GG(x), where we divided by 2 since an axially symmetric target graph G can correspond to two vector pairs in Wpair.

A sketch of algorithm for computing chemical tri-rooted trees G with bl2(G)=3

We call a chemical graph GG(x) with diameter dia and bl2(G)=3 a target graph. Let ninlaΛxin(a), which is the number of 2-internal vertices in a target graph GG(x).

A chemical acyclic graph G with bl2(G)=3 has exactly three leaf 2-branches vi, i=1,2,3, and exactly one 2-internal vertex v4 adjacent to three 2-internal vertices vi, i=1,2,3, as illustrated in Fig. 6(b). We call vertex v4 the joint-vertex of G. Without loss of generality assume that the length of the path Pv1,v2 between v1 and v2 is dia-4 and that the length of the path Pv1,v1 is not smaller than that of Pv2,v2.

Analogously with the case of bl2(G)=2, we define internal-subtree (resp., end-subtree, internal-fringe-tree, and end-fringe-tree) of G, to be a connected subgraph G that satisfies (76). Observe that G can be partitioned into three end-subtrees Ti, i=1,2,3, the 2-fringe-tree T4 rooted at the joint-vertex v4 and three edges viv4, i=1,2,3, where the backbone path PTi connects leaf 2-branch vi and vertex vi. In particular, we call the end-subtree of G that consists of T1, T2, T4, and edges viv4, i=1,2, the main-subtree of G, which consists of the path Pv1,v2 and all the 2-fringe-trees rooted at vertices in Pv1,v2. We call T3 the co-subtree of G.

Let δi, i=1,2,3 denote the length of the backbone path of Ti. Note that

  • δ1+δ2+2=dia-4 and δ1δ2δ3=ninl-dia+2,

from which it follows that

  • δ2[δ3,dia/2-3] and δ1[dia/2-3,dia-6-δ3].

We regard a target graph GG(x) with bl2(G)=3 and diameter dia as a combination of the main-subtree and the co-subtree joined with an edge. We represent the co-subtree as a chemical bi-rooted tree T with (PT)=δ3. We represent the main-subtree of a target graph G as a tri-rooted tree T with (PT)=dia-4 so that terminals r1(T), r2(T), and r3(T), correspond to the two leaf 2-branches and the joint-vertex of G, respectively.

We start with generating chemical rooted trees and then iteratively extend chemical bi-rooted trees T with (PT)=1,2,,dia-6-δ3, before we combine two chemical bi-rooted trees T and T to obtain a chemical tri-rooted tree T1 with (PT1)=δi, and finally, combine a chemical tri-rooted tree T1 and a chemical bi-rooted tree T2 with (PT2)=δ3, to obtain a target graph GG(x).

Analogously with the case of bl2(G)=2, we define the set T(x) of all bi-rooted trees T, the set FT of all rooted trees TT(x) that can be a 2-fringe-tree of a target graph G and the set Tend(h), h[1,dia-6-δ3], of all bi-rooted trees TT(x) that can be an end-subtree of a target graph G such that (PT)=h.

We generate target graphs GG(x) by the following steps:

  1. Analogously with Step 1 for the case of bl2(G)=2, compute the set FT by a branch-and-bound algorithm as described in "Step 1: Enumeration of 2-fringe-trees" section, and the set W(0) of all vectors w=(win,wex) such that win=fin(T) and wex=fex(T) for some tree TFT. For each vector wW(0), store a sample tree TwFT, and let Wend(0)W(0) be the set of feature vectors of possible end-trees with height 2;

  2. For each integer h=1,2,,dia-6-δ3, compute the set Wend(h) of all vectors w=(win,wex) such that win=fin(T) and wex=fex(T) for some bi-rooted tree TTend(h). For each vector wWend(h), store a sample tree Tw;

  3. For each integer h[dia/2-2,dia-5-δ3], compute the set Wend+2(h) of all vectors w=(win,wex) such that win=fin(T) and wex=fex(T) of some bi-rooted tree T with (PT)=h that represents an end-subtree rooted at the joint-vertex. For each vector wWend+2(h), store a sample tree Tw;

  4. For each integer δ1[dia/2-3,dia-6-δ3], compute the set Wmain(δ1+1) of all vectors w=(win,wex) such that win=fin(T) and wex=fex(T) for some tri-rooted tree T that represents the main-subtree such that the length of the path Pr2(T),r3(T) between terminals r2(T) and r3(T) is δ1+1. For each vector wWmain(δ1+1), store a sample tree Tw;

  5. We call a pair of vectors w1Wmain(δ1+1) and w2Wend(δ3) feasible if it admits a target graph GG(x) such that win1+win2xin and wex1+wex2xex. Find the set Wpair of all feasible pairs of vectors w1 and w2;

  6. For each feasible vector pair (w1,w2)Wpair, construct a corresponding target graph G by combining the samples trees Tw1 and Tw2, which correspond to the main-subtree and the co-subtree of a target graph G, respectively, as illustrated in Fig. 12.

Fig. 12.

Fig. 12

An illustration of combining a tri-rooted T1=Tw1 and a bi-rooted tree T2=Tw2 with a new edge joining vertices r3(T1) and r1(T2) to construct a target graph G

Detailed descriptions of the six steps in the above algorithm can be found in Appendix “Case of three leaf 2-branches”.

Frequency vectors of fictitious trees

Let T be a chemical bi-rooted or tri-rooted tree, where we regard a rooted tree T as a bi-rooted tree with r1(T)=r2(T) for a notational convenience. Recall that our algorithm generates a target graph GG(x) as a supergraph of T, where one of terminals r1(T) and r2(T) can be a 2-branch of G. We assume that the second terminal r2(T) will be a 2-branch of G in such a case in our algorithms.

For an integer p[1,3], let T[+p] denote a fictitious chemical graph obtained from T by regarding the degree of terminal r1(T) as degT(r1(T))+p. Figure 13 (resp., Fig. 14a) illustrates fictitious trees T[+p] in the case of r1(T)=r2(T) (resp., r1(T)r2(T)). The frequency vectors fin(T[+p]) and fex(T[+p]) are obtained as follows: Let d=degT(r1(T)), vi, i[1,d], denote the neighbors of r1(T), and di=degT(vi), mi=β(r1(T)vi), and μi=(d,di,mi), μi=(d+p,di,mi), i[1,d].

Fig. 13.

Fig. 13

An illustration of fictitious rooted trees T[+p], p[1,3] for rooted trees T with r=r1(T)=r2(T) and d=degT(r), where a dashed line depicts a fictitious edge incident to the terminal r1(T)=r2(T): (aT[+1] and d=1; (bT[+1] and d=2; (cT[+1] and d=3; (dT[+2] and d=0; (eT[+2] and d=1; (fT[+2] and d=2; (gT[+3] and d=0; (hT[+3] and d=1

Fig. 14.

Fig. 14

An illustration of fictitious trees T[+q] and T+1 for bi-rooted trees and tri-rooted trees T: a T[+q] of a bi-rooted tree T; b T+1 of a tri-rooted tree T

For r1(T)=r2(T) and d=d+p,

  • fin(T[+p])=fin(T)+1dgd-1dgd,   fex(T[+p])=fex(T)+1id(1μi-1μi).

For r1(T)r2(T) and d=d+p, where vd denotes the vertex in PT,

  • fin(T[+1])=fin(T)+1dgd-1dgd+1μd-1μd,

  • fex(T[+1])=fex(T)+1id-1(1μi-1μi).

Let T be a chemical tri-rooted tree, where the third terminal r3(T) is in the backbone path PT between vertices r1(T) and r2(T). Let T+1 denote a fictitious chemical graph obtained from T by regarding the degree of terminal r3(T) as degT(r3(T))+1. Figure 14b illustrates a fictitious tri-rooted tree T+1. The frequency vectors fin(T+1) and fex(T+1) are obtained as follows: Let d=degT(r3(T)), vi, i[1,d], denote the neighbors of r3(T), where vd-1 and vd are contained in the path PT. For each index i[1,d], let di=degT(vi), mi=β(r3(T)vi), μi=(d,di,mi), and μi=(d+1,di,mi).

Then

fin(T+1)=fin(T)+1dg(d+1)-1dgd+i[d-1,d](1μi-1μi),fex(T+1)=fex(T)+i[1,d-2](1μi-1μi). 77

Sets of frequency vectors

For an element aΛ and integers d[0,dmax-2] and m[d,val(a)-1], let Winl(0)(a,d,m) (resp., Winl+3(0)(a,d,m)) denote the set of frequency vectors (fin(T[+2]),fex(T[+2])) (resp., (fin(T[+3]),fex(T[+3]))) of a chemical rooted tree T such that

  • r1(T)=r2(T), the height of T is at most 2,

  • α(r1(T))=a, degT(r1(T))=d, and β(r1(T))=m.

Recall that β(u)=uvEβ(uv), defined in “Preliminary” section.

For an element aΛ and integers d[1,dmax-1], m[d,val(a)-1], and h0, let Wend(h)(a,d,m) (resp., Wend+2(h)(a,d,m)) denote the set of frequency vectors (fin(T[+1]),fex(T[+1])) (resp., (fin(T[+2]),fex(T[+2]))) of chemical bi-rooted trees T such that

  • α(r1(T))=a, degT(r1(T))=d, β(r1(T))=m, (PT)=h and

  • if h=0 then the height of the tree T rooted at r2(T) is 2.

Case of two leaf 2-branches

Step 1: Enumeration of 2-fringe-trees

The main task of Step 1 is to compute for each tuple (a,d,m) of an element aΛ and integers d[1,dmax-1] (resp., d[0,dmax-2]) and m[d,val(a)-1] (resp., m[d,val(a)-2]), the set Wend(0)(a,d,m) (resp., Winl(0)(a,d,m)) of all frequency vectors f(T[+1]) (resp., f(T[+2])) of chemical rooted trees T such that r1(T)=r2(T), α(r1(T))=a, degT(r1(T))=d and β(r1(T))=m.

Step 1 first computes the set FT of all possible chemical rooted trees TT(x) (where r1(T)=r2(T)) that can be a 2-fringe-tree of a target graph GG(x). For this, we design a branch-and-bound procedure where we append a new vertex one by one to construct a rooted tree with only one child. To design a bounding procedure, we derive a property of the structure of chemical rooted trees that can be a 2-fringe-tree of a target graph.

Let G0 be a chemical rooted tree with a terminal r0=r1(G0)=r2(G0), where fin(α(r0);G0)=1 and fin(a;G0)=0, aΛ\{α(r0)} and fin(γ;G0)=0, γΓ. For a vector x=(xin,xex) with xin,xexZ+ΛΓBcDg, we call G0 x-extensible if some chemical acyclic graph GG(x) contains G0 as a subgraph of a 2-fringe-tree T rooted at r0 in G.

We use the next condition as a bounding procedure when we generate chemical rooted trees in Step 1.

Lemma 3

For a branch-parameterk=2, letx=(xin,xex)be a vector withxin,xexZ+ΛΓBcDg, andG0be a chemical rooted tree rooted at a vertexr0such thatf(G0)x.

  • (i)
    Graph G0 is x-extensible only when the next holds for any subset ΛΛ:
    aΛ(xex(a)-fex(a;G0))γ=(a,b,m)Γ:aΛ,bΛ\Λ(xex(γ)-fex(γ;G0))+2γ=(a,b,m)Γ:a,bΛ(xex(γ)-fex(γ;G0)). 78
  • (ii)
    Let G1 denote the chemical rooted tree obtained from G0 by appending a new atom with an element bΛ to an atom with an element aΛ in G0 with a multiplicity q; i.e., we join an atom a in G0 and a new atom b with an adjacency-configuration (a,b,q). Then G1 is x-extensible only when the next holds:
    • xex(a)-fex(a;G0)nb(a)-1
    for
    • nb(a)=γ=(a,b,m)Γ:baΛ(xex(γ)-fex(γ;G0))+2γ=(a,a,m)Γ(xex(γ)-fex(γ;G0)).
Proof
  • (i)

    Assume that G0 is a subgraph of a 2-fringe-tree T in some chemical graph GG(x) so that T is rooted at r0. The left-hand side means the number of the remaining 2-external vertices with elements in Λ in the 2-fringe-trees in G. Each of such atoms has a neighbor in the connected graph G. The right-hand side indicates an upper bound on the number of 2-external edges joining elements in Λ in the 2-fringe-trees in G.

  • (ii)

    Note that fex[ΛΓ](G1)=fex[ΛΓ](G0)+1b+1γ. For Λ={a}, the left-hand side in Eq. (78) is xex(a)-fex(a;G0), which remains unchanged if ab (resp., is reduced by 1 if a=b); and the right-hand side in (78) is nb(a), which is reduced by 1 if ab (resp., is reduced by 2 if a=b). That is, the left-hand side minus the right-hand side in (78) is always reduced by 1. This gives the required necessary condition for G1 to be x-extensible.

Figure 15 illustrates all graph structures of rooted trees T with height at most 2 and only one child satisfying the size constraint (1). For each element aΛ, we enumerate chemical trees TT(x) rooted at vertex r with α(r)=a that has only one child by a branch-and-bound algorithm. Let Ta denote the set of resulting rooted trees for each root element aΛ.

Fig. 15.

Fig. 15

An illustration of rooted trees T with height at most 2 and only one child satisfying the size constraint: a case of n(T)=2; b case of n(T)=3; c case of n(T)=4; d case of n(T)=5

We next enumerate chemical trees TT(x) rooted at vertex r with α(r)=a that has two or three children by generating a combination of two or three graphs in Ta. During generating graphs, our bounding procedure tests whether the current graph satisfies the necessary condition in Lemma 3(ii).

Finally, we compute the following sets:

for each element aΛ, integers d[1,dmax-1], m[d,val(a)-1], the set Wend(0)(a,d,m) of frequency vectors f(T[+1]) for rooted trees TTa with degT(r)=d and height 2;

for each element aΛ, integers d[0,dmax-2], m[d,val(a)-2], the set Winl(0)(a,d,m) of frequency vectors f(T[+2]) for rooted trees TTa with degT(r)=d and height at most 2.

For each vector wWend(0)(a,d,m) (resp., wWinl(0)(a,d,m)), we store a sample tree Tw.

We remark that the size of the set FT depends on the vector x. However, since the height of trees is limited to 2, the degree is at most 3 or 4, and the size constraint (1) on fringe trees in "Our target graph class" section, the size of the set FT is fairly limited.

Step 2: Generation of frequency vectors of end-subtrees

The main task of Step 2 is to compute the following sets in the ascending order of h=1,2,,δ2:

For elements aΛ, integers d[1,dmax-1], m[d,val(a)-1], and h[1,δ2], the sets Wend(h)(a,d,m) of all frequency vectors f(T[+1]) of chemical bi-rooted trees TT(x) such that α(r1(T))=a, degT(r1(T))=d, β(r1(T))=m and (PT)=h.

Observe that each vector w=(win,wex)Wend(h)(a,d,m) is obtained from a combination of vectors w=(win,wex)Winl(0)(a,d-1,m) and w=(win,wex)Wend(h-1)(b,d,m) such that

  • mval(a)-2, 1m-mval(b)-m,

  • win=win+win+1γ+1μxin, wex=wex+wexxex

  • for γ=(a,b,m-m)Γ and μ=(d+1,d+1,m-m)Bc.

Figure 16 illustrates this process of computing a vector wWend(h)(a,d,m).

Fig. 16.

Fig. 16

An illustration of appending a rooted tree T to a bi-rooted tree T to compute a vector wWend(h)(a,d,m) from the frequency vectors w=f(T[+2])Winl(0)(a,d-1,m) of a rooted tree T and w=f(T[+1])Wend(h-1)(b,d,m) of a bi-rooted tree T

For each vector wWend(h)(a,d,m) obtained from a combination wWinl(0)(a,d-1,m) and wWend(h-1)(b,d,m), we construct a sample tree Tw from their sample trees Tw and Tw.

Step 3: Enumeration of feasible vector pairs

A feasible pair of vectors is defined to be a pair of vectors wi=(wini,wexi)Wend(δi)(ai,di,mi), aiΛ, di[1,dmax-1], mi[di,val(ai)-1], i=1,2 that admits an adjacency-configuration γ=(a1,a2,m)Γ and a bond-configuration μ=(d1+1,d2+1,m)Bc with an integer m[1,min{3,val(a1)-m1,val(a2)-m2}] such that

  • xin=win1+win2+1γ+1μ and xex=wex1+wex2,

or equivalently w1 is equal to the vector (xin-win2-1γ-1μ,xex-wex1), which we call the (γ,μ)-complement of w2, and denote it by w2¯.

The main task of Step 3 is to enumerate all feasible vector pairs (w1,w2), wiWend(δi)(ai,di,mi) with aiΛ, di[1,dmax-1], mi[di,val(ai)-1], i=1,2.

To efficiently search for a feasible pair of vectors in two sets Wend(δi)(ai,di,mi), i=1,2, we first compute the (γ,μ)-complement vector w¯ of each vector wWend(δ2)(a2,d2,m2) for each pair of γ=(a1,a2,m)Γ and μ=(d1+1,d2+1,m)Bc with m[1,min{3,val(a1)-m1,val(a2)-m2}], and denote by Wend(δ2)¯ the set of the resulting (γ,μ)-complement vectors. Observe that (w1,w2) is a feasible vector pair if and only if w1=w2¯. To find such pairs, we merge the sets Wend(δ1)(a1,d1,m1) and Wend(δ2)¯ into a sorted list Lγ,μ. Then each feasible vector pair (w1,w2) appears as a consecutive pair of vectors w1 and w2¯ in the list Lγ,μ.

Step 4: Construction of chemical graphs

The task of Step 4 is to construct for each feasible vector pair wiWend(δi)(ai,di,mi), i=1,2 such that w1 is equal to the (γ=(a1,a2,m),μ)-complement vector w2¯ of w2, construct a target graph T(w1,w2)G(x) by combining the sample trees Ti=Twi of vectors wi with an edge e=r1(T1)r1(T2) such that β(e)=m. Figure 11 illustrates two sample trees Ti, i=1,2 to be combined with a new edge e=r1(T1)r1(T2).

Case of three leaf 2-branches

Step 1: Enumeration of 2-fringe-trees

The main task of Step 1 is to compute the following sets:

for each tuple (a,d,m) of an element aΛ and integers d[1,dmax-1] (resp., d[0,dmax-2] and d[0,dmax-3]) and m[d,val(a)-1] (resp., m[d,val(a)-2] and m[d,val(a)-3]), the set Wend(0)(a,d,m) (resp., Winl(0)(a,d,m) and Winl+3(0)(a,d,m)) of all frequency vectors f(T[+1]) (resp., f(T[+2]) and f(T[+3])) of chemical rooted trees T such that r1(T)=r2(T), α(r1(T))=a, degT(r1(T))=d and β(r1(T))=m. For each vector wWend(0)(a,d,m) (resp., wWinl(0)(a,d,m) and wWinl+3(0)(a,d,m)), we store a sample tree Tw. This step can be designed in a similar way as Step 1 for the case of bl2(G)=2.

Step 2: Generation of frequency vectors of end-subtrees

Analogously with Step 2 for the case of bl2(G)=2, Step 2 computes the following sets in the ascending order of h=1,2,,dia-6-δ3:

For elements aΛ, integers d[1,dmax-1], m[d,val(a)-1], i=1,2, and h[1,dia-6-δ3], the sets Wend(h)(a,d,m) of all frequency vectors f(T[+1]) of chemical bi-rooted trees TT(x) such that α(r1(T))=a, degT(r1(T))=d, β(r1(T))=m and (PT)=h.

For each vector wWend(h)(a,d,m), we construct a sample tree Tw from their sample trees Tw and Tw.

Step 3: Generation of frequency vectors of end-subtrees with two fictitious edges

The main task of Step 3 is to compute the following sets:

For elements aΛ, integers d[1,dmax-2], m[d,val(a)-2] and h[dia/2-2,dia-5-δ3], the sets Wend+2(h)(a,d,m) of all frequency vectors of bi-rooted trees T[+2] such that α(r1(T))=a, degT(r1(T))=d, β(r1(T))=m and (PT)=h. For each vector wWend+2(h)(a,d,m), we store a sample tree Tw. This step can be designed in a similar way as Step 3 for the case of bl2(G)=2.

Step 4: Enumeration of frequency vectors of main-subtrees

For an element aΛ, and integers d[2,dmax-1], m[d,val(a)-1], and δ1[dia/2-3,dia-6-δ3], define Wmain(δ1+1)(a,d,m) to be the set of the frequency vectors f(T+1) of chemical tri-rooted trees T such that

  • α(r1(T))=a, degT(r1(T))=d, β(r1(T))=m, (PT)=dia-4 and

  • the length of the path Pr2(T),r3(T) between vertices r2(T) and r3(T) is δ1+1.

See Fig. 12 for the structure of a main-tree. Such a chemical tri-rooted graph T corresponds to the main-subtree of a target graph GG(x).

The main task of Step 4 is to compute the sets Wmain(δ1+1)(a,d,m), aΛ, d[2,dmax-1], m[d,val(a)-1], δ1[dia/2-3,dia-6-δ3]. Each vector wWmain(δ1+1)(a,d,m) can be obtained from a combination of vectors w1Wend+2(δ1+1)(a,d-1,m) and w2Wend(δ2)(a,d,m) such that δ1+δ2=dia-4 and δ1δ2, as illustrated in Fig. 17. For each vector wWmain(δ1+1)(a,d,m), we store a sample tree Tw. This step can be designed in a similar way as Step 3 for the case of bl2(G)=2.

Fig. 17.

Fig. 17

An illustration of computing the frequency vector w=f(T+1)Wmain(δ1+1)(a,d,m) of a tri-rooted tree T from the frequency vectors w1=f(T1[+2])Wend+2(δ1+1)(a,d-1,m) and w2=f(T2[+1])Wend(δ2)(a,d,m) for bi-rooted trees T1 and T2

Step 5: Enumeration of feasible vector pairs

Analogously with the case of bl2(G)=2, a feasible pair of vectors is defined to be a pair of vectors w1=(win1,wex1)Wmain(δ1+1)(a1,d1,m1), and w2=(win2,wex2)Wend(δ3)(a2,d2,m2), δ1[dia/2-3,dia-6-δ3], aiΛ, di[1,dmax-1], mi[di,val(ai)-1], i=1,2 that admits an adjacency-configuration γ=(a1,a2,m)Γ and a bond-configuration μ=(d1+1,d2+1,m)Bc with an integer m[1,min{3,val(a1)-m1,val(a2)-m2}] such that

  • xin=win1+win2+1γ+1μ and xex=wex1+wex2.

Step 5 computes the set of all feasible vector pairs (w1,w2) by using a sorting algorithm as in the Step 4 for the case of bl2(G)=2.

Step 6: Construction of chemical graphs

Analogously with Step 4 for the case of bl2(G)=2, Step 6 constructs a target graph T(w1,w2)G(x) for each feasible vector pair (w1,w2) by combining the sample trees Ti=Twi of vectors wi with a new edge e=r1(T1)r1(T2).

Authors' contributions

Conceptualization, HN and TA; methodology, HN; software, NAA, JZ, YS, YS, AS and L.; validation, NAA, JZ, AS and HN; formal analysis, HN; data resources, AS, LZ, HN and TA; writing—original draft preparation, HN; writing—review and editing, NAA, AS and TA; project administration, HN; funding acquisition, TA. All authors read and approved the final manuscript.

Funding

This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant #18H04113.

Availablity of data and materials

Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.

Declarations

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Hiroshi Nagamochi, Email: nag@amp.i.kyoto-u.ac.jp.

Tatsuya Akutsu, Email: takutsu@kuicr.kyoto-u.ac.jp.

References

  • 1.Miyao T, Kaneko H, Funatsu K. Inverse QSPR/QSAR analysis for chemical structure generation (from y to x) J Chem Inf Model. 2016;56(2):286–299. doi: 10.1021/acs.jcim.5b00628. [DOI] [PubMed] [Google Scholar]
  • 2.Skvortsova MI, Baskin II, Slovokhotova OL, Palyulin VA, Zefirov NS. Inverse problem in QSAR/QSPR studies for the case of topological indices characterizing molecular shape (Kier indices) J Chem Inf Comput Sci. 1993;33(4):630–634. doi: 10.1021/ci00014a017. [DOI] [Google Scholar]
  • 3.Ikebata H, Hongo K, Isomura T, Maezono R, Yoshida R. Bayesian molecular design with a chemical language model. J Comput Aided Mol Design. 2017;31(4):379–391. doi: 10.1007/s10822-016-0008-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Rupakheti C, Virshup A, Yang W, Beratan DN. Strategy to discover diverse optimal molecules in the small molecule universe. J Chem Inf Model. 2015;55(3):529–537. doi: 10.1021/ci500749q. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Fujiwara H, Wang J, Zhao L, Nagamochi H, Akutsu T. Enumerating treelike chemical graphs with given path frequency. J Chem Inf Model. 2008;48(7):1345–1357. doi: 10.1021/ci700385a. [DOI] [PubMed] [Google Scholar]
  • 6.Kerber A, Laue R, Grüner T, Meringer M. MOLGEN 4.0. Match Commun Math Comput Chem. 1998;37:205–208. [Google Scholar]
  • 7.Li J, Nagamochi H, Akutsu T. Enumerating substituted benzene isomers of tree-like chemical graphs. IEEE/ACM Trans Comput Biol Bioinf. 2016;15(2):633–646. doi: 10.1109/TCBB.2016.2628888. [DOI] [PubMed] [Google Scholar]
  • 8.Reymond J-L. The chemical space project. Accounts Chem Res. 2015;48(3):722–730. doi: 10.1021/ar500432k. [DOI] [PubMed] [Google Scholar]
  • 9.Akutsu T, Fukagawa D, Jansson J, Sadakane K. Inferring a graph from path frequency. Discrete Appl Math. 2012;160(10–11):1416–1428. doi: 10.1016/j.dam.2012.02.002. [DOI] [Google Scholar]
  • 10.Nagamochi H. A detachment algorithm for inferring a graph from path frequency. Algorithmica. 2009;53(2):207–224. doi: 10.1007/s00453-008-9184-0. [DOI] [Google Scholar]
  • 11.Bohacek RS, McMartin C, Guida WC. The art and practice of structure-based drug design: a molecular modeling perspective. Med Res Rev. 1996;16(1):3–50. doi: 10.1002/(SICI)1098-1128(199601)16:1&#x0003c;3::AID-MED1&#x0003e;3.0.CO;2-6. [DOI] [PubMed] [Google Scholar]
  • 12.Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernández-Lobato JM, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Sci. 2018;4(2):268–276. doi: 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Segler MHS, Kogej T, Tyrchan C, Waller MP. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Central Sci. 2017;4(1):120–131. doi: 10.1021/acscentsci.7b00512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Yang X, Zhang J, Yoshizoe K, Terayama K, Tsuda K. ChemTS: an efficient python library for de novo molecular generation. Sci Technol Adv Mater. 2017;18(1):972–976. doi: 10.1080/14686996.2017.1401424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Kusner MJ, Paige B, Hernández-Lobato JM. Grammar variational autoencoder. In: Proceedings of the 34th International Conference on Machine Learning, vol 70; 2017. p. 1945–54
  • 16.Akutsu T, Nagamochi H. A mixed integer linear programming formulation to artificial neural networks. In: Proceedings of the 2nd international conference on information science and systems, Tokyo, Japan, ACM; 2019. p. 215–20.
  • 17.Azam NA, Chiewvanichakorn R, Zhang F, Shurbevski A, Nagamochi H, Akutsu T. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming with guaranteed admissibility. In: Proceedings of the 13th international joint conference on biomedical engineering systems and technologies, vol 3: BIOINFORMATICS, Valetta, Malta; 2020. p. 101–108
  • 18.Chiewvanichakorn R, Wang C, Zhang Z, Shurbevski A, Nagamochi H, Akutsu T. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming. In: Proceedings of the 2020 10th international conference on bioscience, biochemistry and bioinformatics, Kyoto, Japan; 2020. p. 40–46. 10.1145/3386052.3386054
  • 19.Zhang F, Zhu J, Chiewvanichakorn R, Shurbevski A, Nagamochi H, Akutsu T. A new integer linear programming formulation to the inverse QSAR/QSPR for acyclic chemical compounds using skeleton trees. In: Proceedings of the 33rd international conference on industrial, engineering and other applications of applied intelligent systems, Kitakyushu, Japan; 2020. p. 433–444. 10.1007/978-3-030-55789-8_38
  • 20.Ito R, Azam NA, Wang C, Shurbevski A, Nagamochi H, Akutsu T. A novel method for the inverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networks and integer programming. In: Proceedings of the 21st international conference on bioinformatics and computational biology; 2020
  • 21.Zhu J, Wang C, Shurbevski A, Nagamochi H, Akutsu T. A novel method for inference of chemical compounds of cycle index two with desired properties based on artificial neural networks and integer programming. Algorithms. 13:5. doi: 10.3390/a13050124.124. [DOI] [PMC free article] [PubMed]
  • 22.Suzuki M, Nagamochi H, Akutsu T. Efficient enumeration of monocyclic chemical graphs with given path frequencies. J Cheminf. 2014;6(1):31. doi: 10.1186/1758-2946-6-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Tamura Y, Nishiyama Y, Wang C, Sun Y, Shurbevski A, Nagamochi H, Akutsu T. Enumerating chemical graphs with mono-block 2-augmented tree structure from given upper and lower bounds on path frequencies; 2020. arXiv preprint arXiv:2004.06367
  • 24.Yamashita K, Masui R, Zhou X, Wang C, Shurbevski A, Nagamochi H, Akutsu T. Enumerating chemical graphs with two disjoint cycles satisfying given path frequency specifications; 2020. arXiv preprint arXiv:2004.08381
  • 25.Kim S, et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 2021;49(D1):D1388–D1395. doi: 10.1093/nar/gkaa971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Netzeva TI, et al. Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships: the report and recommendations of ECVAM workshop 52. Altern Lab Anim. 2005;33(2):155–173. doi: 10.1177/026119290503300209. [DOI] [PubMed] [Google Scholar]
  • 27.Nagamochi H, Akutsu T. A novel method for inference of chemical compounds with prescribed topological substructures based on integer programming; 2020. arXiv preprint arXiv:2010.09203 [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.


Articles from Algorithms for Molecular Biology : AMB are provided here courtesy of BMC

RESOURCES