Abstract
We propose a dynamic programming algorithm that generates chemical isomers of a given chemical compound with cycles. We represent a chemical compound as a chemical graph and define its feature vector based on graph-theoretical descriptors. Our descriptors mainly consist of the occurrence of “edge-configuration” that captures the information of adjacent atoms such as their degrees and bond-multiplicity. We call two chemical graphs chemical isomers of each other if they have the same feature vector and share a common prescribed structure. Our proposed algorithm produces a compact representation of all chemical isomers of a given chemical graph. This representation enables efficient counting of chemical isomers without requiring explicit generation. Furthermore, our algorithm allows us to enumerate any number of isomers, even at random. For example, our compact representation for a chemical graph with 70 non-hydrogen atoms contains around 400 arcs in which
chemical isomers are embedded. The proposed algorithm serves as a powerful tool for accelerating chemical compound exploration, particularly in drug discovery and material science, where identifying novel molecular structures is critical. By efficient enumeration of isomers, our approach enhances the search space exploration for target chemical compounds, facilitating advancements in molecular design.
Keywords: Molecular Design, Enumeration of Graphs, Dynamic Programming
Subject terms: Computational biology and bioinformatics, Computer science
Introduction
Graphs are a fundamental data structure in computer science and have been extensively utilized in computational molecular biology, especially for representing chemical molecules. The design of novel graph structures has recently gained significant attention in artificial neural network (ANN) research and related fields. In particular, extensive studies have been done on designing chemical graphs having desired chemical properties because of its potential application to drug design. For example, variational autoencoders1, recurrent neural networks2,3, grammar variational autoencoders4, generative adversarial networks5, and invertible flow models6,7 have been applied.
Quantitative structure activity/property relationship QSAR/QSPR are computational modeling techniques used in cheminformatics. They aim to establish mathematical relationships between the structural attributes of chemical compounds and their biological activities or physicochemical properties8–10. Design of chemical graphs has also been studied for many years in the field of chemo-informatics. In the field, this problem is referred to as inverse quantitative structure activity/property relationships (inverse QSAR/QSPR). In this framework, chemical compounds are usually represented as vectors of real or integer numbers, which are often called descriptors and correspond to feature vectors in machine learning. Using these chemical descriptors, various heuristic and statistical methods have been developed for finding chemical graphs having desired properties11–16.
In many of such methods, enumeration of graph structures from a given set of descriptors is a crucial subtask. However, enumeration in itself is a challenging task, since the number of molecules (i.e., chemical graphs) with up to 30 atoms (vertices) C, N, O, and S, may exceed
17. Enumerating chemical compounds has a long history and numerous applications such as designing novel drugs18 and structure elucidation19. The problem of enumerating chemical compounds can be viewed as the problem of enumerating graphs with given constraints, which is one of the fundamental problems in the field of discrete mathematics and has many applications. Various methods have been developed for general graph structures20–23 and for restricted chemical compounds24–28. Enumeration of restricted chemical compounds with specialized tools is more efficient than with the tools that use general graph structures, which has led to a new trend in the field of chemoinformatics29.
Recently a novel framework for inferring chemical graphs has been proposed30,31. This framework is illustrated in Figure 1. One of the important stage of this framework is to enumerate chemical isomers of a given chemical graph. The enumeration algorithms required in this framework have been designed for chemical compounds with cycle index at most 231–34. The computation results show that these algorithms can generate chemical graphs with around 15 vertices without hydrogen atoms, and cannot deal with large size chemical graphs due to large computation time. Instead of focusing on a general chemical graph structure that rarely exists, Azam et al.32 introduced a restricted class of acyclic graphs that is characterized by an integer
, called a “branch parameter” such that the restricted class still covers most of the acyclic chemical compounds in the PubChem database. Based on this characterization they designed an efficient algorithm to generate acyclic graphs with around 50 vertices without hydrogen atoms. Recently, Akutsu and Nagamochi37 extended the idea to define a restricted class of cyclic graphs, called “
-lean cyclic graphs” that covers the most of the cyclic chemical compounds in the PubChem database to deal with chemical graphs with large size. Accordingly, they proposed an algorithm to generate chemical isomers of a
-lean chemical graph. The method has been implemented and computational results showed that chemical graphs with around up to 50 non-hydrogen atoms can be inferred and generated35,36.
Fig. 1.
An illustration of a framework for inferring a set of chemical graphs
.
The idea of the chemical graph generation algorithm developed in37 is to construct a required chemical isomer starting with small chemical subgraphs in a bottom-up manner, where chemical subgraphs are encoded into frequency vectors and the actual construction is carried out in terms of computation of frequency vectors. However, this algorithm has been designed to efficiently generate a small number of chemical graphs, and does not have a backtracking algorithm that allows to generate all isomers of a rather large chemical graph that admits extremely many isomers. In order to generate all isomers, we need to design a procedure of backtracking the computation of frequency vectors in a top-down manner. In this paper, we design a dynamic programming algorithm that enumerates all isomers by constructing a compact representation of the set of all the isomers such that we can generate any number of isomers from the representation.
The paper is organized as follows. Section 2 reviews a modeling of chemical compounds and a choice of descriptors. Section 3 proposes a dynamic programming algorithm that generates chemical isomers of a given chemical graph. Section 4 reports the results on some computational experiments. Section 5 makes some concluding remarks. The proposed method/system is available at GitHub https://github.com/ku-dml/mol-infer.
Preliminary
Let
,
and
denote the sets of reals, integers and non-negative integers, respectively. For two integers a and b, let [a, b] denote the set of integers i with
.
Graphs
Given a graph G, let V(G) and E(G) denote the sets of vertices and edges, respectively and let
denote the set of neighbors of a vertex v in G. The length of a path is defined to be the number of edges in the path. Denote by
the length of a path P. A rooted tree is defined to be a tree where a vertex is designated as the root. The height
of a vertex v in a rooted tree T is defined to be the maximum length of a path from v to a leaf u, and the height
of T is defined to be the height
of the root r.
As an extension of rooted trees, we define a bi-rooted tree to be a tree T with two designated vertices
and
, called terminals. Let T be a bi-rooted tree. Define the backbone path
to be the path of T between terminals
and
, and denote by
(or by
) the set of components of T in the graph
obtained from T by removing the edges in
, where we regard each tree
as a tree rooted at the unique vertex in
. The height
of T is defined to be the maximum of the heights of rooted trees in
.
The rank of a graph G is defined to be the minimum number of edges to be removed to make the graph a tree. We call a graph with rank k a rank-k graph. Figure 2 illustrates three examples of rank-2 graphs
,
.
Fig. 2.
An illustration of rank-2 graphs
,
, where the core vertices (resp., non-core vertices) are depicted with squares (resp., circles), the 2-branch vertices are depicted with gray circles (a)
is 2-lean with
where
,
,
,
,
and
; (b)
is not 2-lean with
where
,
,
,
,
,
and
; (c)
is not 2-lean with
where
,
,
,
,
,
and
.
The core of a graph with cycles is defined to be the subgraph obtained by cycles and the paths between the cycles. More precisely, let H be a connected simple graph with rank at least 1. The core
of H is defined to be an induced subgraph
such that
is the set of vertices in cycles of H and
is the set of vertices each of which is in a path between two vertices
. A vertex (resp., an edge) in H is called a core vertex (resp., core edge) if it is contained in the core
, i.e., it lies on a cycle or on a path between cycles. A vertex or edge that is not in the core is called non-core vertex or non-core edge, respectively.
The core size
is defined to be
. An exterior tree
T is defined to be a maximal induced subtree of H such that V(T) contains exactly one core vertex v of H, where T is regarded as a rooted tree rooted at v. For example, in Figure 2(a), the tree induced by the vertex set
is an exterior tree of
. The core height
is defined to be the maximum height
of an exterior tree T of H.
The core size and core height of the three rank-2 graphs
,
illustrated in Figures 2(a)-(c) are
,
,
,
,
and
.
Branch parameter
Choose a positive integer
as a branch parameter32. A non-core vertex v is called a
-internal vertex (resp., a
-external vertex) if
(resp.,
). A non-core edge e is called a
-internal edge (resp., a
-external edge) if e is incident to no
-external vertex (resp., to a
-external vertex). A
-internal vertex v is called a
-branch if v has at least two children, each of which has height at least
, where a
-branch v is called a leaf
-branch if
.
A
-fringe tree is defined to be a maximal subtree
of an exterior tree T such that the edge set of
consists of
-external edges. For example, in Figure 2(a), the tree induced by the vertex set
is a 2-fringe tree. Note that every exterior tree T contains a
-fringe tree if and only if
. The
-branch leaf number
of H is defined to be the number of leaf
-branches in H.
We call an exterior tree T of H a
-exterior tree if
; i.e., it contains at least one leaf
-branch.
We call a core vertex adjacent to a
-exterior tree a
-branch core vertex, denote by
the set of
-branch core vertices and define the
-branch core size
to be
. Note that
,
and either
or
.
We call a cyclic graph H
-lean if every exterior tree T contains at most one leaf
-branch; i.e., the set of
-internal edges in each exterior tree T forms a single path.
Figure 2 illustrates three examples of rank-2 graphs. In the first example,
and
are the leaf 2-branches,
and
are the 2-branch core vertices,
holds and
is 2-lean. In the second example,
and
are the leaf 2-branches,
is the 2-branch core vertex,
holds and
is not 2-lean. In the third example,
and
are the leaf 2-branches,
is the non-leaf 2-branch,
is the 2-branch core vertex,
holds and
is not 2-lean.
For
, nearly 97% of cyclic chemical compounds with up to 100 non-hydrogen atoms in PubChem are 2-lean. This statistical fact allows us to focus on 2-lean chemical graphs instead of general chemical graph structures which are relatively difficult to generate and may not be of practical use. Over 92% of 2-fringe trees of chemical compounds with up to 100 non-hydrogen atoms in PubChem obey the following size constraint:
| 1 |
Thus we can focus on designing an efficient algorithm to enumerate fringe trees that satisfy Eq. (1), instead of generating chemical graphs with any kind of fringe trees that rarely exist.
A hydrogen-suppressed model for chemical compounds
We represent the graph structure of a chemical compound as a graph H with labels on vertices and multiplicity on edges in a hydrogen-suppressed model. In a cyclic graph H, we regard each non-core edge
as a directed edge (u, v) from a vertex u to a child v of u in an exterior tree of H in order to define a descriptor that exploits the direction of non-core edges.
Let
be a set of labels each of which represents a chemical element such as C (carbon), O (oxygen), N (nitrogen) and so on, where we assume that
does not contain H (hydrogen). Let
and
denote the mass and valence of a chemical element
, respectively. We define an adjacency-configuration to be a tuple
with chemical elements
and a bond-multiplicity
; a chemical symbol to be a pair
of the chemical element
and the degree
, where
denotes the set of all chemical symbols; and an edge-configuration to be a tuple
with
and
.
We choose a branch parameter
and two sets
and
of chemical symbols and three sets
,
and
of edge-configurations.
Let
be an edge in a chemical graph G such that
are assigned to the vertices u and v, the degrees of u and v are i and j, respectively and the bond-multiplicity between them is m. When uv is a core edge, the edge-configuration
of edge e is defined to be
if
in a total order over
(or
otherwise). When uv is a non-core edge which is regarded as a directed edge (u, v) where u is the parent of v in some exterior tree, the edge-configuration
of a
-internal (resp.,
-external) edge e is defined to be
(resp.,
).
Let
be a tuple with a cyclic graph
and functions
and
, where we use
to denote the function
such that
for each vertex
. A tuple
is called a chemical cyclic graph if (i) H is connected; (ii)
for each vertex
; and (iii)
,
and
for each core edge
,
-internal edge
and
-external edge
, respectively.
Descriptors and feature vectors
A feature vector
f(G) of a chemical cyclic graph
consists of the following 16 kinds of graph-theoretical descriptors.
n(G): the number |V| of vertices;
: the core size of G;
: the core height of G;
: the
-branch leaf number of G;
: the average mass of atoms in G;
: the number of hydrogen atoms suppressed in G;
,
: the numbers of core vertices and non-core vertices of degree
in G;
,
,
: the numbers of core edges,
-internal edges and
-external edges with bond multiplicity
in G;
,
,
,
: the numbers of core vertices and non-core vertices v with a chemical symbol
; and
,
,
: the numbers of core edges
such that
,
-internal edges
such that
, and
-external edges
such that
in G.
Note that excluding the average mass descriptor
, the remaining 15 descriptors in the feature vector correspond to frequency counts.
An algorithm for generating isomers
This section designs a new algorithm for generating
-lean cyclic graphs G that have the same feature vector
of a given chemical
-lean graph
.
The idea of algorithm
For a graph
, we define the frequency vector
as the vector consisting of the frequency values corresponding to the 15 descriptors used in the feature vector
. Instead of manipulating target graphs directly, first compute the frequency vectors
of subtrees
of all target graphs and then construct a limited number of target graphs G from the process of computing the vectors. For this, we extend the dynamic programming algorithm for generating acyclic chemical graphs proposed by Azam et al.32. A sketch of the algorithm is described as follows.
Given a chemical
-lean cyclic graph
, simplify the core of
into a graph
by replacing some paths with edges. Decompose
into a collection of chemical trees
such that each tree
contains at most two vertices in
. See Figure 3 for an illustration.For each index
, compute the feature vector
and then generate a set
of all (or a limited number of) chemical acyclic graphs
such that
by using the dynamic programming algorithm32.Each combination of chemical trees
forms a chemical
-lean cyclic graph
such that
.
Fig. 3.
An illustration of generating a chemical isomer
of a chemical graph
in Stage 5, where
is decomposed into chemical trees
,
based on a set
of core vertices and a set
of chemical tree
such that
is constructed for each vector
, before a new target graph
is obtained as a combination of
.
Note that the number of chemical isomers
obtained in 3 is
which may possibly include isomorphic chemical graphs depending on the structure of
. However, in many cases of computational experiments, the number
is extremely large. Therefore we generate only a limited number of isomers
by selecting a small number of trees
, and most of the generated isomers are non-isomorphic as evident from the experimental results given in Section 4.
In the following, we describe a new algorithm that for a given chemical
-lean graph
, generates chemical
-lean cyclic graphs
such that
where
may not be graph-isomorphic to
and the elements in
may not correspond between the two cores; i.e., possibly
for some core vertex v of H in the graph-isomorphism
between
and
.
In this section, we describe our new algorithm in a general setting where a branch parameter is any integer
and a chemical graph G to be inferred is any chemical
-lean cyclic graph.
Figure 4(a) and (b) illustrate small chemical 2-lean cyclic graphs, which we use as running examples to demonstrate how our algorithm generates isomers.
Fig. 4.
An illustration of chemical rooted/bi-rooted trees: (a) A chemical rooted tree
rooted a vertex v in a chemical graph G: CID 7600; (b) A chemical bi-rooted tree
with terminals u and v in a chemical graph G: CID 3729083; (c) An example of a base-graph
to the chemical graph G in (a); (d) An example of a base-graph
to the chemical graph G in (b).
Nc-trees and C-trees
Nc-trees
Let
be a branch parameter and H be a
-lean cyclic graph. We define “non-core-subtrees” in the following way.
Let T be a connected subgraph of H. We call T a non-core-subtree of H if T is regarded as a bi-rooted tree such that the backbone path
is a subgraph of a
-exterior tree of H (excluding the root) and the
-fringe trees rooted at vertices in
. We call a non-core-subtree T of H an internal-subtree (resp., an end-subtree) of H if neither (resp., one) of the two end-vertices of
is a leaf
-branch of H, as illustrated in Figure 5(a) (resp., in Figure 5(b)).
Fig. 5.
An illustration of subtrees of a chemical
-lean cyclic graph G, where thick lines depict the cycle of the core of G, green circles depict leaf
-branches in G and arrows depict non-core directed edges: (a) A non-core-subtree (internal-subtree) T of G represented by an nc-tree (a chemical bi-rooted tree); (b) A non-core-subtree (end-subtree) T of G represented by an nc-tree (a chemical bi-rooted tree); (c) A core-subtree T of G with
represented by a c-tree (a chemical rooted tree); (d) A core-subtree T of G with
represented by a c-tree (a chemical bi-rooted tree).
We introduce “nc-trees” to represent non-core subtrees of a
-lean cyclic graph H. The nc-tree is defined to be a chemical bi-rooted tree T such that each rooted tree
has a height at most
.
For an nc-tree T, define
As discussed in Section 2, a non-core edge in
is regarded as a directed edge (u, v). Define the number
of
-branch core vertices in T to be
and the core height
of T to be 0.
C-trees
For a
-lean cyclic graph H, a subtree T of H is called a core-subtree if one of the following holds:
-
(i)
T consists of all pendant-trees rooted at a core vertex
; -
(ii)
T consists of a core-path
with
and all pendant-trees rooted at internal vertices of path
; and -
(iii)
T consists of a core-path
with
, all pendant-trees rooted at internal vertices of path
, and all pendant-trees rooted at one of the end-vertices of path
.
To represent a core-subtree of H, we introduce “c-trees.” For a branch parameter
, we call a bi-rooted tree
-lean if each rooted tree
contains at most one
-branch; i.e., there is no non-leaf
-branch and no two
-exterior trees meet at the same vertex in
. A c-tree is defined to be a chemical
-lean bi-rooted tree T.
The chemical tree
in Figure 4(a) is an example of a c-tree with
,
and
, where v is a core vertex and
and
are nc-trees. The chemical tree
in Figure 4(b) is an example of a c-tree with
,
and
, where s is a core vertex and
is an nc-tree. The chemical tree
in Figure 4(b) is an example of a c-tree with
,
and
, where
and
are c-trees.
For a c-tree T, define
Define the number
of
-branch core vertices in T to be the number of rooted trees in
with
. Define the core height
for the bi-rooted tree T. Note that
(resp.,
) is the set of
-external vertices (resp.,
-external vertices) in the rooted trees in
. Illustrations of c-trees are given in Figures 5(c) and (d).
As discussed in Section 2, a non-core edge in
for an nc-tree or a c-tree T is regarded as a directed edge (u, v).
Fictitious trees
For the nc-tree
in Figure 4(a), the degree of the terminal
is
in G for
in T and
. We treat such a degree
of a terminal v in a target chemical graph G as a fictitious degree of a chemical rooted tree T.
For an nc-tree or a c-tree T and an integer
, let
denote a fictitious chemical graph obtained from T by regarding the degree of terminal
as
. Figure 6(a) and (b) illustrate fictitious trees
in the case of
and
in the case of
and
, respectively.
Fig. 6.

An illustration of fictitious trees: (a)
of a rooted nc- or c-tree T; (b)
of a bi-rooted nc-tree T; (c)
of a bi-rooted c-tree T.
For a c-tree T with
and integers
, let
denote a fictitious chemical graph obtained from T by regarding the degree of terminal
,
as
. Figure 6(c) illustrates a fictitious bi-rooted c-tree
.
Frequency vectors
For a finite set A of elements, let
denote the set of functions
. A function
is called a non-negative integer vector (or a vector) on A and the value
for an element
is called the entry of
for
. For a vector
and an element
, let
(resp.,
) denote the vector
such that
(resp.,
) and
for the other elements
. For a vector
and a subset
, let
denote the projection of
to B; i.e.,
such that
,
.
To introduce a “frequency vector” of a subgraph of a chemical cyclic graph, we define sets of symbols that correspond to some descriptors of a chemical cyclic graph. Let
,
and
be sets of edge-configurations in Section 2. We define a vector whose entry is the frequency of an edge-configuration in the sets
,
or the number of
-branch core vertices. We use a symbol
to denote the number of
-branch core vertices in our frequency vector. To distinguish edge-configurations from different sets among three sets
,
, we use
to denote the entry of an edge-configuration
,
. We denote by
the set of entries
,
,
. Define the set of all entries of a frequency vector to be
Define the frequency vector
to be a vector
that consists of the following entries:
,
;
,
,
;
.
For an nc-tree or c-tree T, the frequency vector
of a fictitious tree
is defined as follows: Let
,
,
,
,
. Let
,
. Set
if T is an nc-tree, and
if T is a c-tree. When
,
Let
, and let
belong to
. When T is an nc-tree,
The frequency vector
of a fictitious tree
for a bi-rooted c-tree T with
is defined as follows: For each
, let
,
,
,
of the unique edge incident to
and
,
. Let
,
. Then
Chemical graph isomorphism
For a chemical
-lean cyclic graph for a branch parameter
, we choose a path-partition
of the core
, where
. Let
denote the set of all end-vertices of paths
, where
.
Define the base-graph
of H by
to be the multigraph obtained from H replacing each path
with a single edge
joining the end-vertices of
, where
. We call a vertex in
and an edge in
a base-vertex and a base-edge, respectively. For a notational convenience in distinguishing the two end-vertices u and v of a base-edge
, we regard each base edge
as a directed edge
. For each base-edge
, let
denote the path
that is replaced by edge
.
Figure 4(c) (resp., (d)) illustrates an example of a base-graph
to the chemical graph G in Figure 4(a) (resp., (b)).
We define the “components” of G by
as follows.
Vertex-components
For each base-vertex
, define the component at vertex v (or the v-component)
of G to be the chemical core-subtree rooted at v in G; i.e.,
consists of all pendent-trees rooted at v. We regard
as a c-tree rooted at the core vertex v of G and define the code
of
to be a tuple
such that
The nc-tree
in Figure 4(a) is the v-component of the graph G for the base-vertex
in Figure 4(c), where
,
,
,
and
.
Edge-components
For each base-edge
, define the component at edge e (or the e-component)
of G to be the chemical core-subtree of G that consists of the core-path
and all pendant-trees of G rooted at internal vertices of path
. We regard
as a bi-rooted c-tree with
and
for the base-edge
and define the code
of
to be a tuple
such that
,
,
,
,
and
for the edges
incident to u and v,
.
The c-tree
in Figure 4(b) is the e-component of the graph G for the base-edge
in Figure 4(d), where
contains exactly one leaf 2-branch
(hence
),
,
,
,
,
,
and
.
Observe that
Note that any other descriptors of a chemical
-lean cyclic graph G with
except for the core height can be determined by the entries of the frequency vector
. For example, the vector
with the numbers
of core vertices of degree
is given by
and the vector
with the numbers of symbols
of core vertices is given by
Similarly the vector
with the numbers of symbols
of non-core vertices is given by
We introduce a specification
as a set of functions
.
We call two chemical graphs
-isomorphic if they consist of vertex and edge components with the same codes and heights; i.e., two chemical
-lean cyclic graphs
,
are
-isomorphic if the following hold:
and
are graph-isomorphic, where we assume that
and
denotes the base-graph of both graphs
and
by
.For the v-components
of
,
at each base-vertex
,
and
.For the e-components
of
,
at each base-edge
,
and
.
See Section 2 for the definition of height
of a bi-rooted tree T.
The
-isomorphism also implies that
,
,
,
and
.
Chemical isomers of a given chemical graph
Let
be a chemical
-lean cyclic graph
, and
(resp.,
) denote the v-component (resp., the e-component) of
.
Target v-components
Let
denote the height
of the v-component of
. For each base-vertex
, fix a code
and call a rooted c-tree T a target v-component if
where the condition on
is equivalent to
when
, since
is a
-lean cyclic graph and the set of
-internal edges in any target component forms a single path of length
from the root to a unique leaf
-branch. Let
denote the set of all target v-components of a base-vertex
.
For example, the number of all target v-components of the example in Figure 4(a) is 8, which will be computed as a compact representation by our algorithm.
Target e-components
For each base-edge
, fix a code
and call a bi-rooted c-tree T a target e-component if
Let
denote the set of all target e-components of a base-edge
.
For example, the number of all target e-components of the example in Figure 4(b) is 12, which will be computed as a compact representation by our algorithm.
Given a collection of target v-components
,
and target e-components
,
, there is a chemical
-lean cyclic graph
that is
-isomorphic to the original chemical graph
. Such a graph
can be obtained from
by replacing each base-edge
with
and attaching
at each base-vertex
.
From this observation, our aim is now to generate some number of target v-components for each base-vertex v and target e-components for each base-edge e. In the following, we denote
,
,
,
,
and
for each base-edge
by
,
,
,
,
and
, respectively for a notational simplicity. For each base-edge
, let
A sketch of dynamic programming algorithm on frequency vectors
We start with describing a sketch of our new algorithm for generating graphs
in Stage 5.
We start with enumerating chemical rooted trees with height at most
, which can be a
-fringe tree of a target component. Next we extend each of the rooted tree to an nc-tree T and then to a c-tree T under a constraint that the frequency vector
of T does not exceed a given vector
,
or
,
.
For a vector
, we formulate the following sets of nc-trees and c-trees and of their frequency vectors:
-
(i)
,
,
,
: the set of rooted nc-trees T with a root r such that
Let
denote the set of the frequency vectors
for all nc-trees
; -
(ii)
,
,
,
: the set of rooted nc-trees T with a root r such that
Let
denote the set of the frequency vectors
for all nc-trees
; -
(iii)
,
,
,
,
,
,
: the set of bi-rooted nc-trees T such that
Let
denote the set of all frequency vectors
for all bi-rooted nc-trees
. -
(iv)
,
,
,
,
,
,
,
: the set of rooted c-trees T with a root r such that
Let
denote the set of the frequency vectors
for all c-trees
. -
(v)
,
,
,
,
,
,
,
,
: the set of bi-rooted c-trees T such that
Let
denote the set of the frequency vectors
for all bi-rooted c-trees
.
Note that
for any vector
in the above set in (i)-(iii).
Forward phase
The first phase computes the frequency vectors of some nc-trees and c-trees that can be a subtree of a target component, where we first enumerate chemical rooted trees with height at most
and generate the frequency vectors of other types of nc-trees and c-trees from the frequency vectors of their subtrees recursively.
The first phase consists of five steps. Step 1 computes the sets of trees and vectors in (i), (ii) and (iii) with
, where each tree in these sets is of height at most
. Note that the frequency vectors of some two trees in a tree set
in the above can be identical.
In fact, the size
of a set
of trees can be considerably larger than that
of the set
of their frequency vectors. We mainly maintain a whole vector set
. With this idea, Steps 2–5 compute only vector sets
in (iii) with
, (iv) and (v).
We derive recursive formula that holds among the above sets. Based on this, we compute the vector sets in (iii) in Step 2, those in (iv) in Step 3 and those in (v) in Step 4. For each base-edge
, Step 5 compares vectors
and
, where
is the frequency vector of a c-tree
that is extended from the end-vertex
, to examine whether
and
give rise to a target e-component.
In the previous method for generating target components due to Akutsu and Nagamochi37, an algorithm is designed in a similar idea of the first phase so that some number of target components are constructed when a necessary set of frequency vectors is computed at the end of the execution. However, the algorithm cannot enumerate all target components.
Backward phase
To address this problem, we need to backtrack the computation in the first phase to detect which subtrees will be part of a target component. In this paper, we design as the second phase an algorithm that constructs a compact DAG representation of all target components by backtracking the computation of the first phase so that all target components can be generated by tracing the DAG representation.
The second phase consists of two steps, Step A and Step B. Step A (resp., Step B) constructs a DAG representation of all target v-components for each base-vertex
with
(resp., for each base-edge
) so that a path from a source and to a sink in the DAG corresponds to a construction process of a target component. We can enumerate all target components by enumerating all paths in the resulting DAG representations.
Defining DAG representations for vertex-components
A DAG representation
for the set of target v-components
consists of the following:
An acyclic digraph (N, A) with a node set N, a set
of sources, a single sink
and an arc set A. The node set N consists of
disjoint subsets
and
. The end-nodes of every arc
satisfies one of “
,” “
” and “
.”A function
.A set W of labels
, where
stands for the frequency vector of a chemical rooted tree.A set
of chemical rooted trees T for each non-null label
such that the frequency vector of T is equal to the vector implied by
. We also store
.A function
such that
and
, where
is a null label that stands for the zero-vector. For a directed path P in (N, A), let W(P) denote the multi-set of non-null labels
of arcs a in P.
Figure 7 illustrates a DAG representation for the set of target v-components
, where, for example,
and
for an arc
and
consists of one tree
in the figure. All null labels
are omitted in the figure.
Fig. 7.
(a) An illustration of a DAG representation
for the set of target v-components
, where each arc
with
(resp.,
) is depicted with two (resp., three) lines,
on an arc a denotes the non-null label
and the root of a chemical rooted tree
is depicted with a black circle; (b) An illustration of a target v-component
induced by a path
with
,
and
.
A target v-component is constructed from a DAG representation as follows.
First choose a path
from a source in
to sink
. For example, let
in Figure 7(a), where we obtain
.Construct a chemical path
such that
is the chemical element in the chemical symbol
and
is the bond-multiplicity
of the arc
. For the example
in Figure 7(a), we obtain
.For each non-null label
of an arc
in the path P, choose a chemical rooted tree
from the set
. The number
of all such combinations is
. For the example
in Figure 7(a), choose
for arc
,
for arc
and with
for arc
, and
.Finally attach the tree
chosen in 3 to the vertex
in the chemical path
and the resulting tree becomes a target v-component. For the example
in Figure 7(a), we obtain a target v-component
as illustrated in Figure 7(b).Any target v-component is constructed in the above manner of 1 to 4. Hence the number of all target v-components can be computed as follows. Let
denote the set of nodes
such that
After initializing
for each arc a with
,
for each arc a with
and
, choose non-sink nodes
in a non-decreasing order of the distance from u to sink
, and compute
. The number of all target v-components is given by
for the source
.
Figures 8(a) and (b) illustrate a DAG representation for the set of target v-components of base-vertex v in Figure 4(a). By choosing a path
in the DAG representation, we obtain a target v-component
in Figure 8(b), where
and
. In this case, the number of paths from the source to a sink is eight and the number of all target v-components is eight, since the choice of trees from
is unique.
Fig. 8.
(a) A DAG representation
for the set of target v-components of the base-vertex v in Figure 4(a); (b) The target v-component
induced by a path
.
Defining DAG representations for edge-components
A DAG representation
,
for the set of target e-components
consists of the following:
An acyclic digraph
with a node set
, a set
of sources, a single sink
and an arc set
. The source set
consists of disjoint subsets
. The node set
consists of disjoint subsets
and
. The end-nodes of every arc
satisfies one of “
,” “
” and “
.” The set of sources in
is denoted by
. See Figure 9(a).An acyclic digraph
with a node set
and an arc set
such that
has a single source
and a single sink
and every path from
to
has length
, where
. Let
denote the set of nodes
whose distance from source
is p. We call a node
(resp.,
) a left-node (resp., a right-node) and call an arc
a left-arc (resp., a right-arc) if u and v are left-nodes (resp., right-nodes) and a middle-arc otherwise. See Figure 9(b).Functions
and
.A set W of labels
, where
stands for the frequency vector of a chemical rooted tree.A set
of chemical rooted trees T for each non-null label
such that the frequency vector of T is equal to the vector implied by
. We also store
.A function
such that
and
, where
is a null label that stands for the zero-vector. For a directed path P in
, let W(P) denote the multi-set non-null labels
of arcs in P.Functions
such that
and
. For a directed path Q in
, let W(Q) denote the multi-set of
of arcs a in Q, and let
denote the multi-set of
of arcs a in Q.
Figure 9 illustrates a DAG representation for the set of target e-components
, where the null labels
are omitted in the figure.
Fig. 9.
An illustration of a DAG representation for the set of target e-components
; (a) A representation
with
for the non-core part, where
is omitted in the figure and each gray circle indicates a source or a sink; (b) A representation
for the core part, where each gray circle indicates a node with distance
or
from the source
.
A target e-component is constructed from a DAG representation as follows.
Choose a path
from the source
to the sink
in
. For example, let
in Figure 9(b), where
,
and
is the middle-arc in Q.Construct a chemical path
such that
is the chemical element in the chemical symbol
, and
is
of arc
. For the example
in Figure 9(b), we obtain
.For each
of an arc
in path Q, choose a chemical rooted tree
from the set
; Attach the tree
to the atom at
(resp.,
) in path
if
is a left-arc (resp., a right-arc). For the example
with
in Figure 9(b), we attach a tree
to the atom C at
in path
since
is a left-arc.- For each
of an arc
in path Q, execute the next to construct an nc-tree
: -
(i)Choose a path
from source
to the sink
in
. Construct a chemical path
such that
is the chemical element in the chemical symbol
and
is the bond-multiplicity
of arc
. -
(ii)For each
of an arc
in the path
, choose a chemical rooted tree
from the set
and attach the tree
to atom
in the path
Let
be one of the resulting chemical trees rooted at
. -
(iii)Attach the tree
to the atom at
(resp.,
) in the path
if
is a left-arc (resp., a right-arc). The resulting chemical tree is a target e-component.
with
in Figure 9(b), we can choose a path
from source
to sink
in
in Figure 9(a), and we construct a tree
in the same manner with the case of constructing target v-components, and attach tree
to the atom C at
in the path
since
is a right-arc. -
(i)
Any target e-component is constructed in the above manner of 1 to 5. Hence the number of all target e-components can be computed as follows. For each source
, the number
of chemical rooted trees
constructed in 4 can be computed in the same manner with the case of computing the number of target v-components
in a DAG representation. Let
denote the set of nodes
such that
. After initializing
for each right-arc a with
,
for each right-arc a with
,
for each right-arc a with
and
, choose non-sink right-nodes
in the order of
and compute
to obtain
. For the right-nodes in
, we apply the above procedure to left-nodes
in the order of
after reversing the directions of arcs in
. Finally, for the set
of middle-arcs, the number of all target e-components is given by
.
Figures 10(a)-(c) illustrate a DAG representation for the set of target e-components of the base-edge e in Figure 4(b). For this example, we choose a path
from the source
to the sink
in
, where
,
and
is the middle-arc in Q. From this, we have a chemical path
For
of the right-arc
in the path Q, we choose tree
and attach the tree at the atom
at
in
.
Fig. 10.
A DAG representation for the set of target e-components of base-edge
in Figure 4(b): (a) A representation
, for the non-core part, (b) A representation
for the core part, (c) A target e-component
induced by the path
.
For
of left-arc
in the path Q, we choose a path
from source
to sink
in
, from which we have a chemical path
For
of arc
in the path
, we choose tree
and attach the tree to the atom
at
in
to obtain a chemical rooted tree
. Finally, attach the tree
to the atom
at
in
. The resulting target e-component
is illustrated in Figure 10(c). In this example, we obtain
,
and
and the number of target e-components is
.
Computing frequency vectors of subtrees of target components in the forward phase
Step 1: Enumeration of
-fringe trees
Step 1 generates chemical rooted trees with height at most
to compute the following sets in (i)-(iv).
-
(i)
For each base-vertex
such that
, where
, compute the set
,
of rooted c-trees. Note that every c-tree in the set
with
is a target v-component in
; Set
and
; -
(ii)
For each base-vertex
such that
, where
, and integers
and
, compute the sets
of rooted c-trees and
of their frequency vectors; Set
; -
(iii)
For each base-vertex
such that
and each possible tuple
, compute the sets
and
of rooted nc-trees and the sets
and
of their frequency vectors; Set
and
; -
(iv)
For each base-edge
and each possible tuple
, compute the sets
and
of rooted nc-trees and
,
of rooted c-trees and the sets
,
and
of their frequency vectors; For each base-edge
with
and each possible pair (a, d, m) with
,
and
, we compute the sets
of rooted c-trees and
; Set
,
,
and
.
To compute the above sets of trees and vectors, we enumerate all possible trees with height at most
under the size constraint (1) by a branch-and-bound procedure.
Step 2: Generation of frequency vectors of end-subtrees
For each base-vertex
or each base-edge
such that
and each possible tuple
, Step 2 computes the set
in the ascending order of
. Observe that each vector
is obtained as
from a combination of vectors
,
and an edge-configuration
such that
| 2 |
Figure 11(a) illustrates this process of computing a vector
.
Fig. 11.
(a) An illustration of computing a vector
from the frequency vectors
of a bi-rooted nc-tree T and
of an nc-tree
; (b) An illustration of computing a vector
from the frequency vectors
,
of a c-tree T and
of an nc-tree
.
We call an edge-configuration
feasible to the set
, if at least one vector
is obtained from a combination
and
. We let
store all edge-configurations
feasible to
.
Step 3: Generation of frequency vectors of rooted core-subtrees
For each base-vertex
or each base-edge
such that
and each possible tuple
with
,
Step 3 computes the set
, where
Observe that each vector
is obtained as
from a combination of vectors
,
,
and an edge-configuration
such that
| 3 |
where
. Figure 11(b) illustrates this process of computing a vector
.
We call an edge-configuration
feasible to the set
, if at least one vector
is obtained from a combination of vectors
and
. We let
store all edge-configurations
feasible to
.
For each base-vertex
, it holds that
. Step A generates all target v-components
such that
based on the sets of frequency vectors generated in Steps 1 to 3.
Note that the set
is constructed in Step 1 for integers
and in Step 3 for integers
.
Step 4: Generation of frequency vectors of bi-rooted core-subtrees
For each base-edge
, each index
and each possible tuple
with
, Step 4 computes the set
in the ascending order of
. Let
denote the c-tree with a single vertex v such that
.
For
, we see that each vector
is obtained as
from a combination of vectors
,
and an edge-configuration
such that
| 4 |
Figure 12(a) illustrates this process of computing a vector
.
Fig. 12.
An illustration of computing a vector
(a) For
, a vector
is obtained from the vectors
of a rooted c-tree T and
of the c-tree
; (b) For
, a vector
is obtained from the vectors
of a rooted c-tree T and
of a c-tree
.
For
, observe that each vector
is obtained as
from a combination of vectors
,
and an edge-configuration
such that
| 5 |
Figure 12(b) illustrates this process of computing a vector
.
We call an edge-configuration
with
feasible to the set
, if at least one vector
is obtained from a combination of vectors
and
(or
for
). We let
store all edge-configurations
feasible to
.
Step 5: Enumeration of feasible vector pairs
For each edge
, a feasible vector pair is defined to be a pair of vectors
,
with
that admits an edge-configuration
such that
| 6 |
Let
denote the set of feasible vector pairs for a base-edge
. We also call each of the two vectors in a feasible vector pair
feasible, and let
,
denote the set of feasible vectors
. Figure 13 illustrates a feasible vector pair
.
Fig. 13.
An illustration of computing a feasible vector pair
with
of c-trees
for a base-edge
.
The last equality in (6) is equivalent with a condition that
is equal to the vector
, which we call the
-complement of
, and denote it by
.
For each edge
, Step 5 enumerates the set
of all feasible vector pairs
. To efficiently search for a feasible pair of vectors in two sets
,
with
, we first compute the
-complement vector
of each vector
for each edge-configuration
with
, and denote by
the set of the resulting
-complement vectors. Observe that
is a feasible vector pair if and only if
. To find such pairs, we merge the sets
and
into a sorted list
. Then each feasible vector pair
appears as a consecutive pair of vectors
and
in the list
.
From a feasible vector pair
for a base-edge
, Step B generates all target e-components
such that
consists of two c-trees
and
with
,
.
Constructing DAG representations in the backward phase
Step A: Constructing target v-components from frequency vectors
For each base-vertex
such that
, the set of target v-components is constructed in Step 1. Let
be a base-vertex such that
, where
. Based on the sets of frequency vectors computed in Steps 1, 2 and 3, we construct a DAG representation
of the set of target v-components defined in Section 3.7. Step A consists of three steps. Step A1 discards unnecessary vectors from the vector sets computed in Steps 1, 2 and 3. of Section 3.9.
Step A1.
-
(i)
Note that
. We call the vector
feasible. Let
. -
(ii)
For each possible tuple
, we define “feasible vectors” as follows. For each
, if a pair of vectors
and
satisfies (3) and the vector
is feasible, then we call each of these vectors
and
feasible. Let
,
denote the set of feasible vectors
and
denote the set of feasible vectors
. -
(iii)
For each possible tuple
with
, we define “feasible vectors” in the descending order of
as follows. For each
, if a pair of vectors
and
satisfies (2) and the vector
is feasible, then we call each of these vectors
and
feasible. Let
denote the set of feasible vectors
and
denote the set of feasible vectors
.
Step A2 constructs a set N of nodes and a function
for the DAG representation. A node in N corresponds to a frequency vector
of a chemical tree T and we denote the node by
for a notational simplicity.
Step A2: Constructing N.
-
(i)
Create a unique sink
in N. -
(ii)
For each integer
, create a node
called an internal-node (an int-node, for short) for each feasible vector
, set
and let
denote the set of these int-nodes. -
(iii)
For the feasible vector
, create a node
in N, which will be the unique source in (N, A) and set
with
. Let
consist of source
. -
(iv)
Let
.
Step A3 creates arcs in our DAG representation by re-executing some part of the forward phase in Steps 1, 2 and 3 of Section 3.9. We first execute Step 1, next execute Step 2 only for feasible vectors in the ascending order of
and then execute Step 3 only for feasible vectors as follows.
Step A3: Constructing A.
-
(i)
(Step 1(iii)) For each feasible vector
, create an arc
that leaves from the int-node
and enters sink
, and set
and
. For the resulting arc
, we set
and
. -
(ii)
(Step 2) For each tuple
such that
and
, create an arc
, if there are feasible vectors
and
such that
is a feasible vector for an edge-configuration
. In this case, we set
and
and create an arc
that leaves from the int-node
and enters the int-node
. For the resulting arc
, we set
and
. -
(iii)
(Step 3) Create an arc
, if there are feasible vectors
and
such that
is a feasible vector for an edge-configuration
. In this case, we set
and
and create an arc
that leaves from the source
and enters the int-node
. For the resulting arc
, we set
and
. -
(iv)
Let A(h),
denote the set of arcs
such that
or
. Set
.
For base-vertex
in Figure 4(a), Step 1(ii) computes one c-tree
in Figure 14(a) and the set of the frequency vectors
such that
Fig. 14.
An illustration of chemical rooted/bi-rooted trees: (a) The c-tree
and nc-trees
in Step 1 for base-vertex
in Figure 4(a); (b) The c-trees
and nc-trees
in Step 1 for base-edge
in Figure 4(b).
,
and Step 1(iii) computes seven nc-trees
in Figure 14(a) and the sets of their frequency vectors
such that
,
,
,
,
,
,
,
,
and
.
Step B: Constructing target e-components from frequency vectors
Let
be a base-edge. Recall that Step 5 computes the set
of feasible vector pairs
and the set
of feasible vectors for each index
. For each feasible vector pair
, a target e-component
is obtained as a pair of c-trees
and
with a bond-multiplicity m such that
with
, where we call
the part-i of the target e-component
. For each index
, let
denote the set of the c-trees T such that
is a feasible vector
.
In this section, we design an algorithm for constructing a DAG representation of the c-trees
,
.
Based on the sets of frequency vectors computed in Steps 1 to 5, we construct a DAG representation of the set of target e-components. Step B consists of five steps. Step B1 discards unnecessary vectors from the vector sets computed in Steps 1 to 4.
Step B1: Discarding unnecessary vectors
-
(i)
Compute the sets of feasible vectors,
,
. -
(ii)
For each possible tuple
with
, we define “feasible vectors” in the descending order
as follows. For each
with
and
, if a pair of vectors
and
satisfies (5) and the vector
is feasible, then we call each of these vectors
and
feasible. Let
denote the set of feasible vectors
and
denote the set of feasible vectors
. For each
with
and
, if a pair of vectors
and
satisfies (4) and the vector
is feasible, then we call such a vector
feasible. Let
denote the set of feasible vectors
. -
(iii)
For each possible tuple
with
, we define “feasible vectors” in the descending order of
as follows. For each
, if a pair of vectors
and
satisfies (3) and the vector
is feasible, then we call each of these vectors
and
feasible. Let
,
denote the set of feasible vectors
and
denote the set of feasible vectors
. -
(iv)
For each possible tuple
with
, we define “feasible vectors” in the set
in the descending order of
as follows. For each
, if a pair of vectors
and
satisfies (2) and the vector
is feasible, then we call each of these vectors
and
feasible. Let
denote the set of feasible vectors
and
denote the set of feasible vectors
.
Steps B2 and B3 construct the non-core part
of a DAG representation of target e-components in a similar manner of Steps A2 and A3.
Step B2: Constructing
-
(i)
Create a unique sink
in
. -
(ii)
For each integer
, create a node
, called an internal-node (an int-node, for short) for each feasible vector
, set
and let
denote the set of these int-nodes. -
(iii)
For each integer
, create a node
for each feasible vector
set
with
and let
denote the set of these int-nodes, where each node in
will be a source in
. -
(iv)
Let
.
Step B3: Constructing
-
(i)
Step 1(iv): For each feasible vector
, set
and
and create an arc
that leaves from int-node
and enters sink
, set
and
for the arc
. -
(ii)
Step 2: For each tuple
such that
and
, if there are feasible vectors
and
such that
is a feasible vector for an edge-configuration
, then
and
and create an arc
that leaves from int-node
and enters int-node
and set
and
for the arc
. -
(iii)
Step 3: For each tuple
such that
,
, if there are feasible vectors
and
such that
is a feasible vector for an edge-configuration
, then set
and
and create an arc
that leaves from source
and enters int-node
and set
and
for the arc
. -
(iv)
Let
,
denote the set of arcs
such that
or
. Set
.
For base-edge
in Figure 4(b), Step 1(iv) computes six c-trees
in Figure 14(b) and the sets of their frequency vectors
such that
,
,
,
,
,
,

and five nc-trees
in Figure 14(b) and the sets of their frequency vectors
such that
,
,
,
,
,
and
.
Steps B2 and B5 construct the core part
of a DAG representation of target e-components.
Step B4: Constructing
-
(i)
For each
, create nodes
, set
with
and let
and
. -
(ii)
For each integer
, create a node
called a core-node (a c-node, for short) for each feasible vector
, set
and let
(resp.,
) denote the set of these c-nodes if
(resp.,
). -
(iii)
Let
.
Step B5: Constructing
-
(i)Step 4: For each integer
and each tuple
such that
, execute the following:- If there are feasible vectors
with
and
such that
is a feasible vector for an edge-configuration
with
, then create an arc a between c-nodes
and
with
and
so that for
, arc a is a left-arc
directed from
to
; for
, arc a is a right-arc
directed from
to
. - If there are feasible vectors
with
and
such that
is a feasible vector for an edge-configuration
with
, then create an arc a between c-nodes
and
with
and
so that for
, arc a is a left-arc
directed from
to
; for
, arc a is a right-arc
directed from
to
.
-
(ii)
Step 5: For each feasible vector pairs
, create a middle-arc
that leaves from c-node
and enters c-node
and set
and
for the bond-multiplicity m determined for the pair
uniquely by (6). -
(iii)
Let
,
denote the set of arcs
such that
and set
.
Experimental results
We implemented our new dynamic programming algorithm that generates the target components and conducted experiments to evaluate the computational efficiency. We executed the experiments on a PC with Processor: 3.0 GHz Core i7-9700 (3.0GHz) Memory: 16 GB RAM DDR4. We used ChemDoodle version 10.2.0 for constructing 2D drawings of chemical graphs.
We set a branch parameter
to be 2. We conducted the following three experiments.
Select some chemical compounds
in the PubChem and a vertex v in
and compute the DAG representation
of the set of all target v-components of the frequency vector
of the v-component
of
at v.Select some chemical compounds
in the PubChem and a u, v-path P such that all internal vertices in P are of degree 2 in the core
and compute the DAG representation
of the set of all target e-components of the frequency vector
of the e-component
of
at e by regarding uv as a base-edge
.Select some chemical compounds
in the PubChem and a base-graph
and compute the set of the DAG representations
for the target v-components and the DAG representations
.
We set a time limitation to be 3600 sec. In the following tables, M.O. and T.O. mean memory out and time out, respectively. The proposed algorithm is compared with the state-of-the-art compound generator MOLGEN developed by Gugisch et al.38. Using the online version39 of MOLGEN, we generated
(the available limit) isomers for each instance discussed in the following sections. In these experiments, we applied available restrictions on the isomers such as the number of bonds, cycles, maximum bond multiplicity, single bonds, double bonds and triple bonds. Note that the isomers generated by MOLGEN can be structurally different from those generated by the proposed method.
Results of experiment 1
For this experiment, we selected six instances
as follows. We selected from the database PubChem six cyclic chemical compounds, CID: 7600, CID: 152211, CID: 497892, CID: 46930263, CID: 67558426 and CID: 46930349 which are denoted by
,
, respectively. Each chemical instance
has one benzene ring as the core and only one core vertex v that is adjacent to a non-core vertex as illustrated in Figures 4(a) and 15. For this vertex v, we generate target v-components.
Fig. 15.
(a)-(e) Instances
to compute target v-components, respectively. All instances have one benzene ring and a chemical acyclic graph
joining vertex v in the benzene ring and vertex u.
Table 1 shows the results of computing v-components, where we denote the following:
i: the instance
,
;
: the set of chemical elements in the v-component
in instance
;
: the number
of vertices in the v-component
;
: the number of different edge-configurations of 2-internal edges in the v-component
;
: the number of different edge-configurations of 2-external edges in the v-component
;
: the height
of the v-component
;D-time: the running time (sec.) to construct the DAG representation
;
: the number of vertices in
;
: the number of edges in
;p-time: the running time (sec.) to trace all paths from the sources to the sinks in
;
p: the number of all paths from the sources to the sinks in
;T-LB: a lower bound on the number of all target v-components
;G-time (resp., G-time38): the running time (sec.) to construct all (or up to
) (resp.,
) target v-components
(resp., isomers by MOLGEN38) from
;
(resp.,
38): the number of all (or up to
) (resp.,
) target v-components
(resp., isomers by MOLGEN38) generated from
.
From Table 1, we observe that with the increase in the size of
, there is no significant increase in D-time, p-time and G-time. Furthermore, the D-time, p-time and G-time are bounded above by 0.162, 0.255, and 0.136, respectively, from which it is evident that the proposed algorithm can generate target v-components efficiently. Note that the running time of the proposed algorithm is significantly lower than that of MOLGEN38.
Table 1.
Results for Computing Target v-components.
| i | ![]() |
![]() |
![]() |
![]() |
D-time | ![]() |
p-time |
p |
T-LB | G-time | ![]() |
G-time38 |
38
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
10 | C,O,N | 5, 3 | 7 | 0.000 | 20, 27 | 0.000 | 8 | 8 | 0.000 | 8 | 31.643 | ![]() |
![]() |
20 | C,O,N | 6, 7 | 11 | 0.008 | 343, 780 | 0.001 | 1440 | 1440 | 0.016 | 1440 | 46.22 | ![]() |
![]() |
25 | C,O,N | 7, 5 | 14 | 0.100 | 3427,
|
0.192 | ![]() |
![]() |
0.121 | ![]() |
33.185 | ![]() |
![]() |
27 | C,O,N | 9, 6 | 17 | 0.115 | 6523,
|
0.163 | ![]() |
![]() |
0.129 | ![]() |
58.185 | ![]() |
![]() |
28 | C,O,N | 8, 7 | 17 | 0.129 | 5627,
|
1.640 | ![]() |
![]() |
0.132 | ![]() |
61.73 | ![]() |
![]() |
29 | C,O,N | 9, 6 | 19 | 0.162 | 8967,
|
0.255 | ![]() |
![]() |
0.136 | ![]() |
42.724 | ![]() |
Results of experiment 2
For this experiment, we selected six instances
as follows. We selected from the database PubChem six cyclic chemical compounds, CID: 3729083, CID: 129130, CID: 47622, CID: 195338, CID: 497867 and CID: 10325899 which are denoted by
,
, respectively. Each chemical instance
has two benzene rings and a chemical acyclic graph
joining them, where we denote by u and v the common vertices with the rings and
as illustrated in Figures 4(b) and 16. We regard
as a base-edge and generate target e-components of
.
Fig. 16.
(a)-(e) Instances
,
to compute target e-components, respectively. All instances have two benzene rings and a chemical acyclic graph
joining two vertices u and v in these benzene rings.
Table 2 shows the results of computing e-components, where we denote the following:
i: the instance
,
;
: the set of chemical elements in the e-component
in instance
;
: the number
of vertices in
;
: the number of different edge-configurations of the core edges in
;
: the number of different edge-configurations of 2-internal edges in
;
: the number of different edge-configurations of 2-external edges in
;
: the number of 2-branch core vertices in
;
: the maximum height of a chemical rooted tree rooted at a core vertex in
;D-time: the running time (sec.) to construct the DAG representation
;
: the number of vertices in
;
: the number of edges in
;p-time: the running time (sec.) to trace all paths from the sources to the sinks in
;
p: the number of all paths from the sources to the sinks in
;T-LB: a lower bound on the number of all target e-components
;G-time (resp., G-time38): the running time (sec.) to construct all (or up to
) (resp.,
) target e-components
(resp., isomers by MOLGEN38) from
;
(resp.,
38): the number of all (or up to
) (resp.,
) target e-components
(resp., isomers by MOLGEN38) generated from
.
By Table 2, the D-time, p-time and G-time are bounded above by 9.580, 6.420, and 0.168, respectively. This implies that the proposed algorithm can efficiently generate target e-components. Moreover, the running time of the proposed algorithm is significantly lower than that of MOLGEN38.
Table 2.
Results for Computing e-components.
| i | ![]() |
![]() |
![]() |
![]() |
D-time | ![]() |
p-time |
p |
T-LB | G-time | ![]() |
G-time38 |
38
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
17 | C,O,N | 4, 3, 3 | 1, 7 | 0.000 | 24, 27 | 0.000 | 12 | 12 | 0.000 | 12 | 28.822 | ![]() |
![]() |
17 | C,O,N | 3, 2, 5 | 1, 4 | 0.001 | 78, 116 | 0.001 | 180 | 180 | 0.002 | 180 | 29.083 | ![]() |
![]() |
20 | C,O,N | 4, 0, 3 | 0, 2 | 0.001 | 193, 313 | 0.004 | 1512 | 1512 | 0.020 | 1512 | 97.158 | ![]() |
![]() |
25 | C,O,N | 6, 3, 4 | 1, 5 | 0.011 | 525, 924 | 0.024 | ![]() |
![]() |
0.152 | ![]() |
39.25 | ![]() |
![]() |
30 | C,O,N | 3, 6, 6 | 1, 9 | 0.908 | 788, 2275 | 0.310 | ![]() |
![]() |
0.168 | ![]() |
59.099 | ![]() |
![]() |
32 | C,O,N | 3, 6, 7 | 3, 9 | 9.580 |
,
|
6.420 | ![]() |
![]() |
0.165 | ![]() |
58.792 | ![]() |
Results of experiment 3
For this experiment, we selected five
instances as follows. We selected from the database PubChem five cyclic chemical compounds, CID: 1356, CID: 89834791, CID: 334516, CID: 91420002 and CID: 124165467 which are denoted by
,
, respectively. These instances are illustrated in Figure 17.
Fig. 17.
(a)-(e) Instances
,
, respectively, to generate isomers, where a set
is selected as the set of vertices indicated with asterisks.
Table 3 shows the results of computing chemical isomers
of
, where we denote the following:
i: the instance
,
;
: the number
of the given chemical graph
in instance
;
: the number of chemical elements in
;
: the number of different edge-configurations of the core edges in
;
: the number of different edge-configurations of 2-internal edges in
;
: the number of different edge-configurations of 2-external edges in
;
: the number
of base-vertices of a base-graph
selected to
;
: the number
of base-edges of a base-graph
selected to
;
: the core size
of
;
: the number of 2-branch core vertices in
;
: the core height
of
;D-time: the running time (sec.) to construct the set of DAG representations
and
;
: the total number of vertices in the set of DAG representations
and
;
: the total number of edges in the set of DAG representations
and
;
D: the number of DAG representations successfully constructed in a time limitation out of the total number
of DAG representations defined to
;G-LB: a lower bound on the number of all chemical isomers
of
;G-time (resp., G-time38): the running time (sec.) to construct all (or up to
) (resp.,
) chemical isomers
of
(resp., isomers by MOLGEN38);
(resp.,
38): the number of all (or up to
) (resp.,
) chemical isomers
of
generated from the set of DAG representations (resp., isomers generated by MOLGEN38);#non-iso
: the number of non-isomorphic chemical isomers generated by the proposed algorithm.
By Table 3 the D-time of all the instances except
is bounded above by 0.004 which is very small. Instance
has relatively bigger DAG representation with
and
vertices and edges, respectively, and thus the D-time for
is 24.600. However, the G-time for all the instances is bounded above by 0.266, and hence the proposed algorithm can efficiently generate isomers for an instance with around 70 non-hydrogen atoms. Furthermore, the proposed algorithm outperforms MOLGEN38 in terms of running time and the number of isomers.
Table 3.
Results for Computing Chemical Isomers
of
.
| i | ![]() |
![]() |
![]() |
,
|
![]() |
D-time | ![]() |
D |
G-LB | G-time | ![]() |
#non-iso
|
G-time38 |
38
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
40 | 3 | 9, 1, 3 | 6, 8 | 32, 1, 3 | 0.002 | 429, 803 | 2 | ![]() |
0.180 | ![]() |
![]() |
30.964 | ![]() |
![]() |
50 | 3 | 9, 9, 8 | 5, 7 | 22, 2, 9 | 24.600 |
, |
7 | ![]() |
0.195 | ![]() |
![]() |
43.575 | ![]() |
![]() |
||||||||||||||
![]() |
50 | 3 | 10, 0, 2 | 10, 13 | 25, 0, 1 | 0.000 | 134, 184 | 5 | 6912 | 0.153 | 6912 | 6912 | 32.978 | ![]() |
![]() |
50 | 3 | 7, 4, 8 | 5, 7 | 24, 4, 6 | 0.001 | 124, 165 | 7 | 4608 | 0.090 | 4608 | 3312 | 77.341 | ![]() |
![]() |
70 | 3 | 8, 4, 7 | 5, 7 | 42, 4, 6 | 0.004 | 303, 392 | 7 | ![]() |
0.266 | ![]() |
5428 | 46.476 | ![]() |
For an in-depth analysis of our algorithm, we randomly selected 100 instances for each
from the PubChem database and generated chemical isomers. We index the CIDs of these instances for each
from 1 to 100 and are given in the supplementary material S1_instances. We show the time D-time to construct DAG representations, and the time G-time to generate chemical isomers for each
in Figures 18(a)-(i). The average D-time and G-time over all instances is 0.1884 and 0.0565, respectively, which are reasonably small. Thus it is evident that the proposed algorithm can efficiently generate small chemical isomers. Similarly, we show the number
of chemical isomers and the number #non-iso
of non-isomorphic isomers generated by the proposed algorithm for each
in Figures 19(a)-(i). From Figure 19 we observe that for most of the instances, the difference
is very small, and hence the proposed algorithm can efficiently generate a large number of non-isomorphic chemical isomers with around 70 non-hydrogen atoms.
Fig. 18.
(a)-(i) Plots of D-time and G-time.
Fig. 19.
(a)-(i) Plots of #
and #non-iso
.
Concluding remarks
In this paper, we considered the problem of enumerating all chemical isomers
of a given chemical
-lean chemical graph
in the sense that
and
have a common base graph
and satisfies
for the feature function based on the frequency of edge-configurations. The dynamic programming algorithm designed by Akutsu and Nagamochi37 can find a limited number of chemical isomers. In this paper, we improve the algorithm and designed a new backtracking procedure in order to construct a compact DAG representation to the set of all chemical isomers. Just by tracing the DAG representation, we can count the number of all chemical isomers and generate each of them, if necessary in a random way. We implemented the proposed method and our computational results suggest that the DAG representation of chemical isomers can be constructed for an instance with around 70 vertices. These experiments show that the proposed algorithm can help in discovering novel drugs by efficiently exploring the search space of target chemical compounds.
While the proposed method demonstrates promising performance for graphs with up to around 70 vertices, it may face scalability issues for larger chemical structures. Moreover, the method is currently only applicable to
-lean chemical graphs and relies on specific base-graph definitions, which may limit its general applicability. Additionally, the feature function is restricted to edge-configuration frequencies, and a rigorous analysis of the computational complexity remains as future work.
Supplementary Information
Acknowledgements
This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant#22H00532.
Author contributions
Ryota Ido: Software, validation, data resources; Naveed Ahmed Azam: Software, validation, data resources, formal analysis, writing—review and editing; Jianshen Zhu: Software, validation, data resources; Hiroshi Nagamochi: Conceptualization, methodology, formal analysis, data resources, writing—original draft preparation, project administration; Tatsuya Akutsu: Conceptualization, methodology, data resources, writing—review and editing, funding acquisition. All authors read and approved the final manuscript.
Funding
This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant #18H04113, #22H00532, and #22K19830.
Data availability
Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-05976-0.
References
- 1.Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci.4(2), 268–276 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci.4(1), 120–131 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Yang, X., Zhang, J., Yoshizoe, K., Terayama, K. & Tsuda, K. ChemTS: An efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater.18(1), 972–976 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Kusner, M. J., Paige, B. & Hernández-Lobato, J. M. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. 1945–1954 (2017).
- 5.De Cao, N. & Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv:1805.11973 (2018).
- 6.Madhawa, K., Ishiguro, K., Nakago, K. & Abe, M. GraphNVP: an invertible flow model for generating molecular graphs. arXiv:1905.11600 (2019).
- 7.Shi, C. et al. GraphAF: a flow-based autoregressive model for molecular graph generation. arXiv:2001.09382 (2020).
- 8.Arockiaraj, M. et al. Topological and entropy indices in QSPR studies of N-carbophene covalent organic frameworks. BioNanoSci.14(3), 2762–2773 (2024). [Google Scholar]
- 9.Zhang, X. et al. Distance-based topological characterization, graph energy prediction, and NMR patterns of benzene ring embedded in P-type surface in 2D network. Sci. Rep.14(1), 23766 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Arockiaraj, M., Greeni, A. B., Kalaam, A. R. A., Aziz, T. & Alharbi, M. Mathematical modeling for prediction of physicochemical characteristics of cardiovascular drugs via modified reverse degree topological indices. Eur. Phys. J. E, Soft Matter.47(8), 53. (2024). [DOI] [PubMed]
- 11.Ikebata, H., Hongo, K., Isomura, T., Maezono, R. & Yoshida, R. Bayesian molecular design with a chemical language model. J. Comput. Aided Mol. Des.31(4), 379–391 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Miyao, T., Kaneko, H. & Funatsu, K. Inverse QSPR/QSAR analysis for chemical structure generation (from y to x). J. Chem. Inf. Model.56(2), 286–299 (2016). [DOI] [PubMed] [Google Scholar]
- 13.Rupakheti, C., Virshup, A., Yang, W. & Beratan, D. N. Strategy to discover diverse optimal molecules in the small molecule universe. J. Chem. Inf. Model.55(3), 529–537 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wazzan, S., Hayat, S. & Ismail, W. Optimizing structure-property models of three general graphical indices for thermodynamic properties of benzenoid hydrocarbons. J. King Saud Univ. Sci.36(11), 103541 (2024). [Google Scholar]
-
15.Hayat, S. et al. Optimizing predictive models for evaluating the F-temperature index in predicting the
-electron energy of polycyclic hydrocarbons, applicable to carbon nanocones. Sci. Rep.14(1), 25494 (2024). [DOI] [PMC free article] [PubMed] - 16.Hayat, S., Arfan, A., Khan, A., Jamil, H. & Alenazi, M. J. F. An optimization problem for computing predictive potential of general sum/product-connectivity topological indices of physicochemical properties of benzenoid hydrocarbons. Axioms13(6), 342 (2024). [Google Scholar]
- 17.Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev.16(1), 3–50 (1996). [DOI] [PubMed] [Google Scholar]
- 18.Blum, L. C. & Reymond, J. L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc.131(25), 8732–8733 (2009). [DOI] [PubMed] [Google Scholar]
- 19.Meringer, M. & Schymanski, E. L. Small molecule identification with MOLGEN and mass spectrometry. Metabolites3(2), 440–462 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
-
20.Benecke, C. et al. MOLGEN
, a generator of connectivity isomers and stereoisomers for molecular structure elucidation. Anal. Chim. Acta314(3), 141–147 (1995).
- 21.Kerber, A., Laue, R., Grüner, T. & Meringer, M. MOLGEN 4.0. Match Commun. Math. Comput. Chem.37, 205–208 (1998). [Google Scholar]
- 22.Peironcely, J. E. et al. OMG: Open Molecule Generator. J. Cheminform.4(1), 21 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Reymond, J.-L. The chemical space project. Acc. Chem. Res.48(3), 722–730 (2015). [DOI] [PubMed] [Google Scholar]
- 24.Fujiwara, H., Wang, J., Zhao, L., Nagamochi, H. & Akutsu, T. Enumerating treelike chemical graphs with given path frequency. J. Chem. Inf. Model.48(7), 1345–1357 (2008). [DOI] [PubMed] [Google Scholar]
- 25.Li, J., Nagamochi, H. & Akutsu, T. Enumerating substituted benzene isomers of tree-like chemical graphs. IEEE/ACM Trans. Comput. Biol. Bioinform.15(2), 633–646 (2016). [DOI] [PubMed] [Google Scholar]
- 26.Suzuki, M., Nagamochi, H. & Akutsu, T. Efficient enumeration of monocyclic chemical graphs with given path frequencies. J. Cheminform.6(1), 31 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Tamura, Y. et al. Enumerating chemical graphs with mono-block 2-augmented tree structure from given upper and lower bounds on path frequencies. arXiv:2004.06367 (2020).
- 28.Yamashita, K. et al. Enumerating chemical graphs with two disjoint cycles satisfying given path frequency specifications. arXiv:2004.08381 (2020).
- 29.Vogt, M. & Bajorath, J. Chemoinformatics: a view of the field and current trends in method development. Bioorg. Med. Chem.20(18), 5317–5323 (2012). [DOI] [PubMed] [Google Scholar]
- 30.Azam, N. A. et al. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming, In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies – Volume 3: BIOINFORMATICS, Valetta, Malta. 101–108 (2020).
- 31.Zhang, F. et al. A new integer linear programming formulation to the inverse QSAR/QSPR for acyclic chemical compounds using skeleton trees, In Proceedings of the 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Kitakyushu, Japan. 433–444. 10.1007/978-3-030-55789-8_38 (2020).
- 32.Azam, N. A. et al. A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming. Algorithms for Mol. Biol.16(1), 18. 10.3390/a13050124 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ito, R. et al. A novel method for the inverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networks and integer programming, In Proceedings of the 21st International Conference on Bioinformatics & Computational Biology. (2020).
- 34.Zhu, J., Wang, C., Shurbevski, A., Nagamochi, H. & Akutsu, T. A novel method for inference of chemical compounds of cycle index two with desired properties based on artificial neural networks and integer programming. Algorithms13(5), 124. 10.3390/a13050124 (2020). [Google Scholar]
- 35.Zhu, J. et al. A novel method for inferring of chemical compounds with prescribed topological substructures based on integer programming, IEEE/ACM Trans. Comput. Biol. and Bioinform.arXiv:2010.09203 (2020). [DOI] [PubMed]
- 36.Azam, N. A., Zhu, J., Ido, R., Nagamochi, H. & Akutsu, T. Experimental results of a dynamic programming algorithm for generating chemical isomers based on frequency vectors, In Proceedings of the Fourth International Workshop on Enumeration Problems and Applications: WEPA. online, paper ID 15. (2020).
- 37.Akutsu, T. & Nagamochi, H. A novel method for inference of chemical compounds with prescribed topological substructures based on integer programming. arXiv:2010.09203 (2020). [DOI] [PubMed]
- 38.Gugisch, R., Kerber, A., Kohnert, A., Laue, R., Meringer, M., Rücker, C. & Wassermann, A. MOLGEN 5.0, A Molecular Structure Generator, In Advances in Mathematical Chemistry and Applications: Revised Edition, 1, pp. 113–138. 2016.
- 39.MOLGEN Team, MOLGEN 5.0, A Molecular Structure Generator. Available at https://www.molgen.de/online.html. Accessed 2025.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.












































