A dynamic programming algorithm for generating chemical isomers based on frequency vectors

Ryota Ido; Naveed Ahmed Azam; Jianshen Zhu; Hiroshi Nagamochi; Tatsuya Akutsu

doi:10.1038/s41598-025-05976-0

. 2025 Jul 1;15:22214. doi: 10.1038/s41598-025-05976-0

A dynamic programming algorithm for generating chemical isomers based on frequency vectors

Ryota Ido ¹, Naveed Ahmed Azam ^1,^2,^✉, Jianshen Zhu ¹, Hiroshi Nagamochi ¹, Tatsuya Akutsu ³

PMCID: PMC12219021 PMID: 40594631

Abstract

We propose a dynamic programming algorithm that generates chemical isomers of a given chemical compound with cycles. We represent a chemical compound as a chemical graph and define its feature vector based on graph-theoretical descriptors. Our descriptors mainly consist of the occurrence of “edge-configuration” that captures the information of adjacent atoms such as their degrees and bond-multiplicity. We call two chemical graphs chemical isomers of each other if they have the same feature vector and share a common prescribed structure. Our proposed algorithm produces a compact representation of all chemical isomers of a given chemical graph. This representation enables efficient counting of chemical isomers without requiring explicit generation. Furthermore, our algorithm allows us to enumerate any number of isomers, even at random. For example, our compact representation for a chemical graph with 70 non-hydrogen atoms contains around 400 arcs in which Inline graphic chemical isomers are embedded. The proposed algorithm serves as a powerful tool for accelerating chemical compound exploration, particularly in drug discovery and material science, where identifying novel molecular structures is critical. By efficient enumeration of isomers, our approach enhances the search space exploration for target chemical compounds, facilitating advancements in molecular design.

Keywords: Molecular Design, Enumeration of Graphs, Dynamic Programming

Subject terms: Computational biology and bioinformatics, Computer science

Introduction

Graphs are a fundamental data structure in computer science and have been extensively utilized in computational molecular biology, especially for representing chemical molecules. The design of novel graph structures has recently gained significant attention in artificial neural network (ANN) research and related fields. In particular, extensive studies have been done on designing chemical graphs having desired chemical properties because of its potential application to drug design. For example, variational autoencoders¹, recurrent neural networks^2,3, grammar variational autoencoders⁴, generative adversarial networks⁵, and invertible flow models^6,7 have been applied.

Quantitative structure activity/property relationship QSAR/QSPR are computational modeling techniques used in cheminformatics. They aim to establish mathematical relationships between the structural attributes of chemical compounds and their biological activities or physicochemical properties^8–10. Design of chemical graphs has also been studied for many years in the field of chemo-informatics. In the field, this problem is referred to as inverse quantitative structure activity/property relationships (inverse QSAR/QSPR). In this framework, chemical compounds are usually represented as vectors of real or integer numbers, which are often called descriptors and correspond to feature vectors in machine learning. Using these chemical descriptors, various heuristic and statistical methods have been developed for finding chemical graphs having desired properties^11–16.

In many of such methods, enumeration of graph structures from a given set of descriptors is a crucial subtask. However, enumeration in itself is a challenging task, since the number of molecules (i.e., chemical graphs) with up to 30 atoms (vertices) C, N, O, and S, may exceed Inline graphic ¹⁷. Enumerating chemical compounds has a long history and numerous applications such as designing novel drugs¹⁸ and structure elucidation¹⁹. The problem of enumerating chemical compounds can be viewed as the problem of enumerating graphs with given constraints, which is one of the fundamental problems in the field of discrete mathematics and has many applications. Various methods have been developed for general graph structures^20–23 and for restricted chemical compounds^24–28. Enumeration of restricted chemical compounds with specialized tools is more efficient than with the tools that use general graph structures, which has led to a new trend in the field of chemoinformatics²⁹.

Recently a novel framework for inferring chemical graphs has been proposed^30,31. This framework is illustrated in Figure 1. One of the important stage of this framework is to enumerate chemical isomers of a given chemical graph. The enumeration algorithms required in this framework have been designed for chemical compounds with cycle index at most 2^31–34. The computation results show that these algorithms can generate chemical graphs with around 15 vertices without hydrogen atoms, and cannot deal with large size chemical graphs due to large computation time. Instead of focusing on a general chemical graph structure that rarely exists, Azam et al.³² introduced a restricted class of acyclic graphs that is characterized by an integer Inline graphic , called a “branch parameter” such that the restricted class still covers most of the acyclic chemical compounds in the PubChem database. Based on this characterization they designed an efficient algorithm to generate acyclic graphs with around 50 vertices without hydrogen atoms. Recently, Akutsu and Nagamochi³⁷ extended the idea to define a restricted class of cyclic graphs, called “ Inline graphic -lean cyclic graphs” that covers the most of the cyclic chemical compounds in the PubChem database to deal with chemical graphs with large size. Accordingly, they proposed an algorithm to generate chemical isomers of a -lean chemical graph. The method has been implemented and computational results showed that chemical graphs with around up to 50 non-hydrogen atoms can be inferred and generated^35,36.

Fig. 1 — An illustration of a framework for inferring a set of chemical graphs .

Inline graphic — An illustration of a framework for inferring a set of chemical graphs .

The idea of the chemical graph generation algorithm developed in³⁷ is to construct a required chemical isomer starting with small chemical subgraphs in a bottom-up manner, where chemical subgraphs are encoded into frequency vectors and the actual construction is carried out in terms of computation of frequency vectors. However, this algorithm has been designed to efficiently generate a small number of chemical graphs, and does not have a backtracking algorithm that allows to generate all isomers of a rather large chemical graph that admits extremely many isomers. In order to generate all isomers, we need to design a procedure of backtracking the computation of frequency vectors in a top-down manner. In this paper, we design a dynamic programming algorithm that enumerates all isomers by constructing a compact representation of the set of all the isomers such that we can generate any number of isomers from the representation.

The paper is organized as follows. Section 2 reviews a modeling of chemical compounds and a choice of descriptors. Section 3 proposes a dynamic programming algorithm that generates chemical isomers of a given chemical graph. Section 4 reports the results on some computational experiments. Section 5 makes some concluding remarks. The proposed method/system is available at GitHub https://github.com/ku-dml/mol-infer.

Preliminary

Let Inline graphic , and denote the sets of reals, integers and non-negative integers, respectively. For two integers a and b, let [a, b] denote the set of integers i with .

Graphs

Given a graph G, let V(G) and E(G) denote the sets of vertices and edges, respectively and let Inline graphic denote the set of neighbors of a vertex v in G. The length of a path is defined to be the number of edges in the path. Denote by the length of a path P. A rooted tree is defined to be a tree where a vertex is designated as the root. The height of a vertex v in a rooted tree T is defined to be the maximum length of a path from v to a leaf u, and the height Inline graphic of T is defined to be the height of the root r.

As an extension of rooted trees, we define a bi-rooted tree to be a tree T with two designated vertices Inline graphic and , called terminals. Let T be a bi-rooted tree. Define the backbone path to be the path of T between terminals and , and denote by (or by ) the set of components of T in the graph obtained from T by removing the edges in , where we regard each tree as a tree rooted at the unique vertex in Inline graphic . The height of T is defined to be the maximum of the heights of rooted trees in .

The rank of a graph G is defined to be the minimum number of edges to be removed to make the graph a tree. We call a graph with rank k a rank-k graph. Figure 2 illustrates three examples of rank-2 graphs Inline graphic , .

Fig. 2 — An illustration of rank-2 graphs , , where the core vertices (resp., non-core vertices) are depicted with squares (resp., circles), the 2-branch vertices are depicted with gray circles (a) is 2-lean with where , , , , and ; (b) is not 2-lean with where , , , , , and ; (c) is not 2-lean with where , , , , , and .

The core of a graph with cycles is defined to be the subgraph obtained by cycles and the paths between the cycles. More precisely, let H be a connected simple graph with rank at least 1. The core Inline graphic of H is defined to be an induced subgraph such that is the set of vertices in cycles of H and is the set of vertices each of which is in a path between two vertices . A vertex (resp., an edge) in H is called a core vertex (resp., core edge) if it is contained in the core , i.e., it lies on a cycle or on a path between cycles. A vertex or edge that is not in the core is called non-core vertex or non-core edge, respectively.

The core size Inline graphic is defined to be . An exterior tree T is defined to be a maximal induced subtree of H such that V(T) contains exactly one core vertex v of H, where T is regarded as a rooted tree rooted at v. For example, in Figure 2(a), the tree induced by the vertex set is an exterior tree of . The core height Inline graphic is defined to be the maximum height of an exterior tree T of H.

The core size and core height of the three rank-2 graphs Inline graphic , illustrated in Figures 2(a)-(c) are , , , , and .

Branch parameter

Choose a positive integer Inline graphic as a branch parameter³². A non-core vertex v is called a -internal vertex (resp., a -external vertex) if (resp., ). A non-core edge e is called a -internal edge (resp., a -external edge) if e is incident to no -external vertex (resp., to a -external vertex). A -internal vertex v is called a Inline graphic -branch if v has at least two children, each of which has height at least , where a -branch v is called a leaf -branch if .

A Inline graphic -fringe tree is defined to be a maximal subtree of an exterior tree T such that the edge set of consists of -external edges. For example, in Figure 2(a), the tree induced by the vertex set is a 2-fringe tree. Note that every exterior tree T contains a -fringe tree if and only if Inline graphic . The -branch leaf number of H is defined to be the number of leaf -branches in H.

We call an exterior tree T of H a Inline graphic -exterior tree if ; i.e., it contains at least one leaf -branch.

We call a core vertex adjacent to a Inline graphic -exterior tree a -branch core vertex, denote by the set of -branch core vertices and define the -branch core size to be . Note that , and either or .

We call a cyclic graph H Inline graphic -lean if every exterior tree T contains at most one leaf -branch; i.e., the set of -internal edges in each exterior tree T forms a single path.

Figure 2 illustrates three examples of rank-2 graphs. In the first example, Inline graphic and are the leaf 2-branches, and are the 2-branch core vertices, holds and is 2-lean. In the second example, and are the leaf 2-branches, is the 2-branch core vertex, holds and is not 2-lean. In the third example, and are the leaf 2-branches, is the non-leaf 2-branch, is the 2-branch core vertex, Inline graphic holds and is not 2-lean.

For Inline graphic , nearly 97% of cyclic chemical compounds with up to 100 non-hydrogen atoms in PubChem are 2-lean. This statistical fact allows us to focus on 2-lean chemical graphs instead of general chemical graph structures which are relatively difficult to generate and may not be of practical use. Over 92% of 2-fringe trees of chemical compounds with up to 100 non-hydrogen atoms in PubChem obey the following size constraint:

Thus we can focus on designing an efficient algorithm to enumerate fringe trees that satisfy Eq. (1), instead of generating chemical graphs with any kind of fringe trees that rarely exist.

A hydrogen-suppressed model for chemical compounds

We represent the graph structure of a chemical compound as a graph H with labels on vertices and multiplicity on edges in a hydrogen-suppressed model. In a cyclic graph H, we regard each non-core edge Inline graphic as a directed edge (u, v) from a vertex u to a child v of u in an exterior tree of H in order to define a descriptor that exploits the direction of non-core edges.

Let Inline graphic be a set of labels each of which represents a chemical element such as C (carbon), O (oxygen), N (nitrogen) and so on, where we assume that does not contain H (hydrogen). Let and denote the mass and valence of a chemical element , respectively. We define an adjacency-configuration to be a tuple Inline graphic with chemical elements and a bond-multiplicity ; a chemical symbol to be a pair of the chemical element and the degree , where denotes the set of all chemical symbols; and an edge-configuration to be a tuple with and .

We choose a branch parameter Inline graphic and two sets and of chemical symbols and three sets , and of edge-configurations.

Let Inline graphic be an edge in a chemical graph G such that are assigned to the vertices u and v, the degrees of u and v are i and j, respectively and the bond-multiplicity between them is m. When uv is a core edge, the edge-configuration of edge e is defined to be if in a total order over (or otherwise). When uv is a non-core edge which is regarded as a directed edge (u, v) where u is the parent of v in some exterior tree, the edge-configuration Inline graphic of a -internal (resp., -external) edge e is defined to be (resp., ).

Let Inline graphic be a tuple with a cyclic graph and functions and , where we use to denote the function such that for each vertex . A tuple is called a chemical cyclic graph if (i) H is connected; (ii) for each vertex ; and (iii) , and for each core edge , -internal edge and -external edge Inline graphic , respectively.

Descriptors and feature vectors

A feature vector f(G) of a chemical cyclic graph Inline graphic consists of the following 16 kinds of graph-theoretical descriptors.

n(G): the number |V| of vertices; : the core size of G; : the core height of G; : the -branch leaf number of G;
: the average mass of atoms in G; : the number of hydrogen atoms suppressed in G;
, : the numbers of core vertices and non-core vertices of degree in G;
, , : the numbers of core edges, -internal edges and -external edges with bond multiplicity in G;
, , , : the numbers of core vertices and non-core vertices v with a chemical symbol ; and
, , : the numbers of core edges such that , -internal edges such that , and -external edges such that in G.

Note that excluding the average mass descriptor Inline graphic , the remaining 15 descriptors in the feature vector correspond to frequency counts.

An algorithm for generating isomers

This section designs a new algorithm for generating Inline graphic -lean cyclic graphs G that have the same feature vector of a given chemical -lean graph .

The idea of algorithm

For a graph Inline graphic , we define the frequency vector as the vector consisting of the frequency values corresponding to the 15 descriptors used in the feature vector . Instead of manipulating target graphs directly, first compute the frequency vectors of subtrees of all target graphs and then construct a limited number of target graphs G from the process of computing the vectors. For this, we extend the dynamic programming algorithm for generating acyclic chemical graphs proposed by Azam et al.³². A sketch of the algorithm is described as follows.

Given a chemical -lean cyclic graph , simplify the core of into a graph by replacing some paths with edges. Decompose into a collection of chemical trees such that each tree contains at most two vertices in . See Figure 3 for an illustration.
For each index , compute the feature vector and then generate a set of all (or a limited number of) chemical acyclic graphs such that by using the dynamic programming algorithm³².
Each combination of chemical trees forms a chemical -lean cyclic graph such that .

Fig. 3 — An illustration of generating a chemical isomer of a chemical graph in Stage 5, where is decomposed into chemical trees , based on a set of core vertices and a set of chemical tree such that is constructed for each vector , before a new target graph is obtained as a combination of .

Note that the number of chemical isomers Inline graphic obtained in 3 is which may possibly include isomorphic chemical graphs depending on the structure of . However, in many cases of computational experiments, the number is extremely large. Therefore we generate only a limited number of isomers by selecting a small number of trees , and most of the generated isomers are non-isomorphic as evident from the experimental results given in Section 4.

In the following, we describe a new algorithm that for a given chemical Inline graphic -lean graph , generates chemical -lean cyclic graphs such that

where Inline graphic may not be graph-isomorphic to and the elements in may not correspond between the two cores; i.e., possibly for some core vertex v of H in the graph-isomorphism between and .

In this section, we describe our new algorithm in a general setting where a branch parameter is any integer Inline graphic and a chemical graph G to be inferred is any chemical -lean cyclic graph.

Figure 4(a) and (b) illustrate small chemical 2-lean cyclic graphs, which we use as running examples to demonstrate how our algorithm generates isomers.

Fig. 4 — An illustration of chemical rooted/bi-rooted trees: (a) A chemical rooted tree rooted a vertex v in a chemical graph G: CID 7600; (b) A chemical bi-rooted tree with terminals u and v in a chemical graph G: CID 3729083; (c) An example of a base-graph to the chemical graph G in (a); (d) An example of a base-graph to the chemical graph G in (b).

Nc-trees and C-trees

Nc-trees

Let Inline graphic be a branch parameter and H be a -lean cyclic graph. We define “non-core-subtrees” in the following way.

Let T be a connected subgraph of H. We call T a non-core-subtree of H if T is regarded as a bi-rooted tree such that the backbone path Inline graphic is a subgraph of a -exterior tree of H (excluding the root) and the -fringe trees rooted at vertices in . We call a non-core-subtree T of H an internal-subtree (resp., an end-subtree) of H if neither (resp., one) of the two end-vertices of is a leaf -branch of H, as illustrated in Figure 5(a) (resp., in Figure 5(b)).

Fig. 5 — An illustration of subtrees of a chemical -lean cyclic graph G, where thick lines depict the cycle of the core of G, green circles depict leaf -branches in G and arrows depict non-core directed edges: (a) A non-core-subtree (internal-subtree) T of G represented by an nc-tree (a chemical bi-rooted tree); (b) A non-core-subtree (end-subtree) T of G represented by an nc-tree (a chemical bi-rooted tree); (c) A core-subtree T of G with represented by a c-tree (a chemical rooted tree); (d) A core-subtree T of G with represented by a c-tree (a chemical bi-rooted tree).

We introduce “nc-trees” to represent non-core subtrees of a Inline graphic -lean cyclic graph H. The nc-tree is defined to be a chemical bi-rooted tree T such that each rooted tree has a height at most .

For an nc-tree T, define

As discussed in Section 2, a non-core edge in Inline graphic is regarded as a directed edge (u, v). Define the number of -branch core vertices in T to be and the core height of T to be 0.

C-trees

For a Inline graphic -lean cyclic graph H, a subtree T of H is called a core-subtree if one of the following holds:

(i)
T consists of all pendant-trees rooted at a core vertex ;
(ii)
T consists of a core-path with and all pendant-trees rooted at internal vertices of path ; and
(iii)
T consists of a core-path with , all pendant-trees rooted at internal vertices of path , and all pendant-trees rooted at one of the end-vertices of path .

To represent a core-subtree of H, we introduce “c-trees.” For a branch parameter Inline graphic , we call a bi-rooted tree -lean if each rooted tree contains at most one -branch; i.e., there is no non-leaf -branch and no two -exterior trees meet at the same vertex in . A c-tree is defined to be a chemical -lean bi-rooted tree T.

The chemical tree Inline graphic in Figure 4(a) is an example of a c-tree with , and , where v is a core vertex and and are nc-trees. The chemical tree in Figure 4(b) is an example of a c-tree with , and , where s is a core vertex and is an nc-tree. The chemical tree in Figure 4(b) is an example of a c-tree with Inline graphic , and , where and are c-trees.

For a c-tree T, define

Define the number Inline graphic of -branch core vertices in T to be the number of rooted trees in with . Define the core height for the bi-rooted tree T. Note that (resp., ) is the set of -external vertices (resp., -external vertices) in the rooted trees in . Illustrations of c-trees are given in Figures 5(c) and (d).

As discussed in Section 2, a non-core edge in Inline graphic for an nc-tree or a c-tree T is regarded as a directed edge (u, v).

Fictitious trees

For the nc-tree Inline graphic in Figure 4(a), the degree of the terminal is in G for in T and . We treat such a degree of a terminal v in a target chemical graph G as a fictitious degree of a chemical rooted tree T.

For an nc-tree or a c-tree T and an integer Inline graphic , let denote a fictitious chemical graph obtained from T by regarding the degree of terminal as . Figure 6(a) and (b) illustrate fictitious trees in the case of and in the case of and , respectively.

Fig. 6 — An illustration of fictitious trees: (a) of a rooted nc- or c-tree T; (b) of a bi-rooted nc-tree T; (c) of a bi-rooted c-tree T.

For a c-tree T with Inline graphic and integers , let denote a fictitious chemical graph obtained from T by regarding the degree of terminal , as . Figure 6(c) illustrates a fictitious bi-rooted c-tree .

Frequency vectors

For a finite set A of elements, let Inline graphic denote the set of functions . A function is called a non-negative integer vector (or a vector) on A and the value for an element is called the entry of for . For a vector and an element , let (resp., ) denote the vector such that (resp., ) and for the other elements . For a vector Inline graphic and a subset , let denote the projection of to B; i.e., such that , .

To introduce a “frequency vector” of a subgraph of a chemical cyclic graph, we define sets of symbols that correspond to some descriptors of a chemical cyclic graph. Let Inline graphic , and be sets of edge-configurations in Section 2. We define a vector whose entry is the frequency of an edge-configuration in the sets , or the number of -branch core vertices. We use a symbol to denote the number of -branch core vertices in our frequency vector. To distinguish edge-configurations from different sets among three sets Inline graphic , , we use to denote the entry of an edge-configuration , . We denote by the set of entries , , . Define the set of all entries of a frequency vector to be

Define the frequency vector Inline graphic to be a vector that consists of the following entries:

, ;
, , ;
.

For an nc-tree or c-tree T, the frequency vector Inline graphic of a fictitious tree is defined as follows: Let , , , , . Let , . Set if T is an nc-tree, and if T is a c-tree. When ,

Let Inline graphic , and let belong to . When T is an nc-tree,

The frequency vector Inline graphic of a fictitious tree for a bi-rooted c-tree T with is defined as follows: For each , let , , , of the unique edge incident to and , . Let , . Then

Chemical graph isomorphism

For a chemical Inline graphic -lean cyclic graph for a branch parameter , we choose a path-partition of the core , where . Let denote the set of all end-vertices of paths , where .

Define the base-graph Inline graphic of H by to be the multigraph obtained from H replacing each path with a single edge joining the end-vertices of , where . We call a vertex in and an edge in a base-vertex and a base-edge, respectively. For a notational convenience in distinguishing the two end-vertices u and v of a base-edge Inline graphic , we regard each base edge as a directed edge . For each base-edge , let denote the path that is replaced by edge .

Figure 4(c) (resp., (d)) illustrates an example of a base-graph Inline graphic to the chemical graph G in Figure 4(a) (resp., (b)).

We define the “components” of G by Inline graphic as follows.

Vertex-components

For each base-vertex Inline graphic , define the component at vertex v (or the v-component) of G to be the chemical core-subtree rooted at v in G; i.e., consists of all pendent-trees rooted at v. We regard as a c-tree rooted at the core vertex v of G and define the code of to be a tuple such that

The nc-tree Inline graphic in Figure 4(a) is the v-component of the graph G for the base-vertex in Figure 4(c), where , , , and .

Edge-components

For each base-edge Inline graphic , define the component at edge e (or the e-component) of G to be the chemical core-subtree of G that consists of the core-path and all pendant-trees of G rooted at internal vertices of path . We regard as a bi-rooted c-tree with and for the base-edge and define the code of to be a tuple Inline graphic such that

, , , ,
and for the edges incident to u and v,
.

The c-tree Inline graphic in Figure 4(b) is the e-component of the graph G for the base-edge in Figure 4(d), where contains exactly one leaf 2-branch (hence ), , , , , , and .

Observe that

Note that any other descriptors of a chemical Inline graphic -lean cyclic graph G with except for the core height can be determined by the entries of the frequency vector . For example, the vector with the numbers of core vertices of degree is given by

and the vector Inline graphic with the numbers of symbols of core vertices is given by

Similarly the vector Inline graphic with the numbers of symbols of non-core vertices is given by

We introduce a specification Inline graphic as a set of functions .

We call two chemical graphs Inline graphic -isomorphic if they consist of vertex and edge components with the same codes and heights; i.e., two chemical -lean cyclic graphs , are -isomorphic if the following hold:

and are graph-isomorphic, where we assume that and denotes the base-graph of both graphs and by .
For the v-components of , at each base-vertex , and .
For the e-components of , at each base-edge , and .

See Section 2 for the definition of height Inline graphic of a bi-rooted tree T.

The Inline graphic -isomorphism also implies that , , , and .

Chemical isomers of a given chemical graph

Let Inline graphic be a chemical -lean cyclic graph , and (resp., ) denote the v-component (resp., the e-component) of .

Target v-components

Let Inline graphic denote the height of the v-component of . For each base-vertex , fix a code and call a rooted c-tree T a target v-component if

where the condition on Inline graphic is equivalent to when , since is a -lean cyclic graph and the set of -internal edges in any target component forms a single path of length from the root to a unique leaf -branch. Let denote the set of all target v-components of a base-vertex .

For example, the number of all target v-components of the example in Figure 4(a) is 8, which will be computed as a compact representation by our algorithm.

Target e-components

For each base-edge Inline graphic , fix a code and call a bi-rooted c-tree T a target e-component if

Let Inline graphic denote the set of all target e-components of a base-edge .

For example, the number of all target e-components of the example in Figure 4(b) is 12, which will be computed as a compact representation by our algorithm.

Given a collection of target v-components Inline graphic , and target e-components , , there is a chemical -lean cyclic graph that is -isomorphic to the original chemical graph . Such a graph can be obtained from by replacing each base-edge with and attaching at each base-vertex .

From this observation, our aim is now to generate some number of target v-components for each base-vertex v and target e-components for each base-edge e. In the following, we denote Inline graphic , , , , and for each base-edge by , , , , and , respectively for a notational simplicity. For each base-edge , let

A sketch of dynamic programming algorithm on frequency vectors

We start with describing a sketch of our new algorithm for generating graphs Inline graphic in Stage 5.

We start with enumerating chemical rooted trees with height at most Inline graphic , which can be a -fringe tree of a target component. Next we extend each of the rooted tree to an nc-tree T and then to a c-tree T under a constraint that the frequency vector of T does not exceed a given vector , or , .

For a vector Inline graphic , we formulate the following sets of nc-trees and c-trees and of their frequency vectors:

(i)
, , , : the set of rooted nc-trees T with a root r such that
Let denote the set of the frequency vectors for all nc-trees ;
(ii)
, , , : the set of rooted nc-trees T with a root r such that
Let denote the set of the frequency vectors for all nc-trees ;
(iii)
, , , , , , : the set of bi-rooted nc-trees T such that
Let denote the set of all frequency vectors for all bi-rooted nc-trees .
(iv)
, , , , , , , : the set of rooted c-trees T with a root r such that
Let denote the set of the frequency vectors for all c-trees .
(v)
, , , , , , , , : the set of bi-rooted c-trees T such that
Let denote the set of the frequency vectors for all bi-rooted c-trees .

Note that Inline graphic for any vector in the above set in (i)-(iii).

Forward phase

The first phase computes the frequency vectors of some nc-trees and c-trees that can be a subtree of a target component, where we first enumerate chemical rooted trees with height at most Inline graphic and generate the frequency vectors of other types of nc-trees and c-trees from the frequency vectors of their subtrees recursively.

The first phase consists of five steps. Step 1 computes the sets of trees and vectors in (i), (ii) and (iii) with Inline graphic , where each tree in these sets is of height at most . Note that the frequency vectors of some two trees in a tree set in the above can be identical.

In fact, the size Inline graphic of a set of trees can be considerably larger than that of the set of their frequency vectors. We mainly maintain a whole vector set . With this idea, Steps 2–5 compute only vector sets in (iii) with , (iv) and (v).

We derive recursive formula that holds among the above sets. Based on this, we compute the vector sets in (iii) in Step 2, those in (iv) in Step 3 and those in (v) in Step 4. For each base-edge Inline graphic , Step 5 compares vectors and , where is the frequency vector of a c-tree that is extended from the end-vertex , to examine whether and give rise to a target e-component.

In the previous method for generating target components due to Akutsu and Nagamochi³⁷, an algorithm is designed in a similar idea of the first phase so that some number of target components are constructed when a necessary set of frequency vectors is computed at the end of the execution. However, the algorithm cannot enumerate all target components.

Backward phase

To address this problem, we need to backtrack the computation in the first phase to detect which subtrees will be part of a target component. In this paper, we design as the second phase an algorithm that constructs a compact DAG representation of all target components by backtracking the computation of the first phase so that all target components can be generated by tracing the DAG representation.

The second phase consists of two steps, Step A and Step B. Step A (resp., Step B) constructs a DAG representation of all target v-components for each base-vertex Inline graphic with (resp., for each base-edge ) so that a path from a source and to a sink in the DAG corresponds to a construction process of a target component. We can enumerate all target components by enumerating all paths in the resulting DAG representations.

Defining DAG representations for vertex-components

A DAG representation Inline graphic for the set of target v-components consists of the following:

An acyclic digraph (N, A) with a node set N, a set of sources, a single sink and an arc set A. The node set N consists of disjoint subsets and . The end-nodes of every arc satisfies one of “,” “” and “.”
A function .
A set W of labels , where stands for the frequency vector of a chemical rooted tree.
A set of chemical rooted trees T for each non-null label such that the frequency vector of T is equal to the vector implied by . We also store .
A function such that and , where is a null label that stands for the zero-vector. For a directed path P in (N, A), let W(P) denote the multi-set of non-null labels of arcs a in P.

Figure 7 illustrates a DAG representation for the set of target v-components Inline graphic , where, for example, and for an arc and consists of one tree in the figure. All null labels are omitted in the figure.

Fig. 7 — (a) An illustration of a DAG representation for the set of target v-components , where each arc with (resp., ) is depicted with two (resp., three) lines, on an arc a denotes the non-null label and the root of a chemical rooted tree is depicted with a black circle; (b) An illustration of a target v-component induced by a path with , and .

A target v-component is constructed from a DAG representation as follows.

First choose a path from a source in to sink . For example, let in Figure 7(a), where we obtain .
Construct a chemical path such that is the chemical element in the chemical symbol and is the bond-multiplicity of the arc . For the example in Figure 7(a), we obtain .
For each non-null label of an arc in the path P, choose a chemical rooted tree from the set . The number of all such combinations is . For the example in Figure 7(a), choose for arc , for arc and with for arc , and .
Finally attach the tree chosen in 3 to the vertex in the chemical path and the resulting tree becomes a target v-component. For the example in Figure 7(a), we obtain a target v-component as illustrated in Figure 7(b).
Any target v-component is constructed in the above manner of 1 to 4. Hence the number of all target v-components can be computed as follows. Let denote the set of nodes such that After initializing for each arc a with , for each arc a with and , choose non-sink nodes in a non-decreasing order of the distance from u to sink , and compute . The number of all target v-components is given by for the source .

Figures 8(a) and (b) illustrate a DAG representation for the set of target v-components of base-vertex v in Figure 4(a). By choosing a path Inline graphic in the DAG representation, we obtain a target v-component in Figure 8(b), where and . In this case, the number of paths from the source to a sink is eight and the number of all target v-components is eight, since the choice of trees from is unique.

Fig. 8 — (a) A DAG representation for the set of target v-components of the base-vertex v in Figure 4(a); (b) The target v-component induced by a path .

Defining DAG representations for edge-components

A DAG representation Inline graphic , for the set of target e-components consists of the following:

An acyclic digraph with a node set , a set of sources, a single sink and an arc set . The source set consists of disjoint subsets . The node set consists of disjoint subsets and . The end-nodes of every arc satisfies one of “,” “” and “.” The set of sources in is denoted by . See Figure 9(a).
An acyclic digraph with a node set and an arc set such that has a single source and a single sink and every path from to has length , where . Let denote the set of nodes whose distance from source is p. We call a node (resp., ) a left-node (resp., a right-node) and call an arc a left-arc (resp., a right-arc) if u and v are left-nodes (resp., right-nodes) and a middle-arc otherwise. See Figure 9(b).
Functions and .
A set W of labels , where stands for the frequency vector of a chemical rooted tree.
A set of chemical rooted trees T for each non-null label such that the frequency vector of T is equal to the vector implied by . We also store .
A function such that and , where is a null label that stands for the zero-vector. For a directed path P in , let W(P) denote the multi-set non-null labels of arcs in P.
Functions such that and . For a directed path Q in , let W(Q) denote the multi-set of of arcs a in Q, and let denote the multi-set of of arcs a in Q.

Figure 9 illustrates a DAG representation for the set of target e-components Inline graphic , where the null labels are omitted in the figure.

Fig. 9 — An illustration of a DAG representation for the set of target e-components ; (a) A representation with for the non-core part, where is omitted in the figure and each gray circle indicates a source or a sink; (b) A representation for the core part, where each gray circle indicates a node with distance or from the source .

A target e-component is constructed from a DAG representation as follows.

Choose a path from the source to the sink in . For example, let in Figure 9(b), where , and is the middle-arc in Q.
Construct a chemical path such that is the chemical element in the chemical symbol , and is of arc . For the example in Figure 9(b), we obtain .
For each of an arc in path Q, choose a chemical rooted tree from the set ; Attach the tree to the atom at (resp., ) in path if is a left-arc (resp., a right-arc). For the example with in Figure 9(b), we attach a tree to the atom C at in path since is a left-arc.
For each of an arc in path Q, execute the next to construct an nc-tree :
- (i)
  Choose a path from source to the sink in . Construct a chemical path such that is the chemical element in the chemical symbol and is the bond-multiplicity of arc .
- (ii)
  For each of an arc in the path , choose a chemical rooted tree from the set and attach the tree to atom in the path Let be one of the resulting chemical trees rooted at .
- (iii)
  Attach the tree to the atom at (resp., ) in the path if is a left-arc (resp., a right-arc). The resulting chemical tree is a target e-component.
For the example with in Figure 9(b), we can choose a path from source to sink in in Figure 9(a), and we construct a tree in the same manner with the case of constructing target v-components, and attach tree to the atom C at in the path since is a right-arc.
Any target e-component is constructed in the above manner of 1 to 5. Hence the number of all target e-components can be computed as follows. For each source , the number of chemical rooted trees constructed in 4 can be computed in the same manner with the case of computing the number of target v-components in a DAG representation. Let denote the set of nodes such that . After initializing for each right-arc a with , for each right-arc a with , for each right-arc a with and , choose non-sink right-nodes in the order of and compute to obtain . For the right-nodes in , we apply the above procedure to left-nodes in the order of after reversing the directions of arcs in . Finally, for the set of middle-arcs, the number of all target e-components is given by .

Figures 10(a)-(c) illustrate a DAG representation for the set of target e-components of the base-edge e in Figure 4(b). For this example, we choose a path

from the source Inline graphic to the sink in , where , and is the middle-arc in Q. From this, we have a chemical path

For Inline graphic of the right-arc in the path Q, we choose tree and attach the tree at the atom at in .

Fig. 10 — A DAG representation for the set of target e-components of base-edge in Figure 4(b): (a) A representation , for the non-core part, (b) A representation for the core part, (c) A target e-component induced by the path .

For Inline graphic of left-arc in the path Q, we choose a path

from source Inline graphic to sink in , from which we have a chemical path

For Inline graphic of arc in the path , we choose tree and attach the tree to the atom at in to obtain a chemical rooted tree . Finally, attach the tree to the atom at in . The resulting target e-component is illustrated in Figure 10(c). In this example, we obtain ,

Inline graphic and and the number of target e-components is .

Computing frequency vectors of subtrees of target components in the forward phase

Step 1: Enumeration of -fringe trees

Step 1 generates chemical rooted trees with height at most Inline graphic to compute the following sets in (i)-(iv).

(i)
For each base-vertex such that , where , compute the set , of rooted c-trees. Note that every c-tree in the set with is a target v-component in ; Set and ;
(ii)
For each base-vertex such that , where , and integers and , compute the sets of rooted c-trees and of their frequency vectors; Set ;
(iii)
For each base-vertex such that and each possible tuple , compute the sets and of rooted nc-trees and the sets and of their frequency vectors; Set and ;
(iv)
For each base-edge and each possible tuple , compute the sets and of rooted nc-trees and , of rooted c-trees and the sets , and of their frequency vectors; For each base-edge with and each possible pair (a, d, m) with , and , we compute the sets of rooted c-trees and ; Set , , and .

To compute the above sets of trees and vectors, we enumerate all possible trees with height at most Inline graphic under the size constraint (1) by a branch-and-bound procedure.

Step 2: Generation of frequency vectors of end-subtrees

For each base-vertex Inline graphic or each base-edge such that and each possible tuple , Step 2 computes the set in the ascending order of . Observe that each vector is obtained as from a combination of vectors , and an edge-configuration such that

Figure 11(a) illustrates this process of computing a vector Inline graphic .

Fig. 11 — (a) An illustration of computing a vector from the frequency vectors of a bi-rooted nc-tree T and of an nc-tree ; (b) An illustration of computing a vector from the frequency vectors , of a c-tree T and of an nc-tree .

We call an edge-configuration Inline graphic feasible to the set , if at least one vector is obtained from a combination and . We let store all edge-configurations feasible to .

Step 3: Generation of frequency vectors of rooted core-subtrees

For each base-vertex Inline graphic or each base-edge such that and each possible tuple with ,

Step 3 computes the set Inline graphic , where

Observe that each vector Inline graphic is obtained as from a combination of vectors , , and an edge-configuration such that

where Inline graphic . Figure 11(b) illustrates this process of computing a vector .

We call an edge-configuration Inline graphic feasible to the set , if at least one vector is obtained from a combination of vectors and . We let store all edge-configurations feasible to .

For each base-vertex Inline graphic , it holds that . Step A generates all target v-components such that based on the sets of frequency vectors generated in Steps 1 to 3.

Note that the set Inline graphic is constructed in Step 1 for integers and in Step 3 for integers .

Step 4: Generation of frequency vectors of bi-rooted core-subtrees

For each base-edge Inline graphic , each index and each possible tuple with , Step 4 computes the set in the ascending order of . Let denote the c-tree with a single vertex v such that .

For Inline graphic , we see that each vector is obtained as from a combination of vectors , and an edge-configuration such that

Figure 12(a) illustrates this process of computing a vector Inline graphic .

Fig. 12 — An illustration of computing a vector (a) For , a vector is obtained from the vectors of a rooted c-tree T and of the c-tree ; (b) For , a vector is obtained from the vectors of a rooted c-tree T and of a c-tree .

For Inline graphic , observe that each vector is obtained as from a combination of vectors , and an edge-configuration such that

Figure 12(b) illustrates this process of computing a vector Inline graphic .

We call an edge-configuration Inline graphic with feasible to the set , if at least one vector is obtained from a combination of vectors and (or for ). We let store all edge-configurations feasible to .

Step 5: Enumeration of feasible vector pairs

For each edge Inline graphic , a feasible vector pair is defined to be a pair of vectors , with that admits an edge-configuration such that

Let Inline graphic denote the set of feasible vector pairs for a base-edge . We also call each of the two vectors in a feasible vector pair feasible, and let , denote the set of feasible vectors . Figure 13 illustrates a feasible vector pair .

Fig. 13 — An illustration of computing a feasible vector pair with of c-trees for a base-edge .

The last equality in (6) is equivalent with a condition that Inline graphic is equal to the vector , which we call the -complement of , and denote it by .

For each edge Inline graphic , Step 5 enumerates the set of all feasible vector pairs . To efficiently search for a feasible pair of vectors in two sets , with , we first compute the -complement vector of each vector for each edge-configuration with , and denote by the set of the resulting -complement vectors. Observe that Inline graphic is a feasible vector pair if and only if . To find such pairs, we merge the sets and into a sorted list . Then each feasible vector pair appears as a consecutive pair of vectors and in the list .

From a feasible vector pair Inline graphic for a base-edge , Step B generates all target e-components such that consists of two c-trees and with , .

Constructing DAG representations in the backward phase

Step A: Constructing target v-components from frequency vectors

For each base-vertex Inline graphic such that , the set of target v-components is constructed in Step 1. Let be a base-vertex such that , where . Based on the sets of frequency vectors computed in Steps 1, 2 and 3, we construct a DAG representation of the set of target v-components defined in Section 3.7. Step A consists of three steps. Step A1 discards unnecessary vectors from the vector sets computed in Steps 1, 2 and 3. of Section 3.9.

Step A1.

(i)
Note that . We call the vector feasible. Let .
(ii)
For each possible tuple , we define “feasible vectors” as follows. For each , if a pair of vectors and satisfies (3) and the vector is feasible, then we call each of these vectors and feasible. Let , denote the set of feasible vectors and denote the set of feasible vectors .
(iii)
For each possible tuple with , we define “feasible vectors” in the descending order of as follows. For each , if a pair of vectors and satisfies (2) and the vector is feasible, then we call each of these vectors and feasible. Let denote the set of feasible vectors and denote the set of feasible vectors .

Step A2 constructs a set N of nodes and a function Inline graphic for the DAG representation. A node in N corresponds to a frequency vector of a chemical tree T and we denote the node by for a notational simplicity.

Step A2: Constructing N.

(i)
Create a unique sink in N.
(ii)
For each integer , create a node called an internal-node (an int-node, for short) for each feasible vector , set and let denote the set of these int-nodes.
(iii)
For the feasible vector , create a node in N, which will be the unique source in (N, A) and set with . Let consist of source .
(iv)
Let .

Step A3 creates arcs in our DAG representation by re-executing some part of the forward phase in Steps 1, 2 and 3 of Section 3.9. We first execute Step 1, next execute Step 2 only for feasible vectors in the ascending order of Inline graphic and then execute Step 3 only for feasible vectors as follows.

Step A3: Constructing A.

(i)
(Step 1(iii)) For each feasible vector , create an arc that leaves from the int-node and enters sink , and set and . For the resulting arc , we set and .
(ii)
(Step 2) For each tuple such that and , create an arc , if there are feasible vectors and such that is a feasible vector for an edge-configuration . In this case, we set and and create an arc that leaves from the int-node and enters the int-node . For the resulting arc , we set and .
(iii)
(Step 3) Create an arc , if there are feasible vectors and such that is a feasible vector for an edge-configuration . In this case, we set and and create an arc that leaves from the source and enters the int-node . For the resulting arc , we set and .
(iv)
Let A(h), denote the set of arcs such that or . Set .

For base-vertex Inline graphic in Figure 4(a), Step 1(ii) computes one c-tree in Figure 14(a) and the set of the frequency vectors such that

Fig. 14 — An illustration of chemical rooted/bi-rooted trees: (a) The c-tree and nc-trees in Step 1 for base-vertex in Figure 4(a); (b) The c-trees and nc-trees in Step 1 for base-edge in Figure 4(b).

Inline graphic ,

and Step 1(iii) computes seven nc-trees Inline graphic in Figure 14(a) and the sets of their frequency vectors such that

Inline graphic ,

Inline graphic , ,

Inline graphic ,

Inline graphic , and

Inline graphic .

Step B: Constructing target e-components from frequency vectors

Let Inline graphic be a base-edge. Recall that Step 5 computes the set of feasible vector pairs and the set of feasible vectors for each index . For each feasible vector pair , a target e-component is obtained as a pair of c-trees and with a bond-multiplicity m such that with , where we call the part-i of the target e-component Inline graphic . For each index , let denote the set of the c-trees T such that is a feasible vector .

In this section, we design an algorithm for constructing a DAG representation of the c-trees Inline graphic , .

Based on the sets of frequency vectors computed in Steps 1 to 5, we construct a DAG representation of the set of target e-components. Step B consists of five steps. Step B1 discards unnecessary vectors from the vector sets computed in Steps 1 to 4.

Step B1: Discarding unnecessary vectors

(i)
Compute the sets of feasible vectors, , .
(ii)
For each possible tuple with , we define “feasible vectors” in the descending order as follows. For each with and , if a pair of vectors and satisfies (5) and the vector is feasible, then we call each of these vectors and feasible. Let denote the set of feasible vectors and denote the set of feasible vectors . For each with and , if a pair of vectors and satisfies (4) and the vector is feasible, then we call such a vector feasible. Let denote the set of feasible vectors .
(iii)
For each possible tuple with , we define “feasible vectors” in the descending order of as follows. For each , if a pair of vectors and satisfies (3) and the vector is feasible, then we call each of these vectors and feasible. Let , denote the set of feasible vectors and denote the set of feasible vectors .
(iv)
For each possible tuple with , we define “feasible vectors” in the set in the descending order of as follows. For each , if a pair of vectors and satisfies (2) and the vector is feasible, then we call each of these vectors and feasible. Let denote the set of feasible vectors and denote the set of feasible vectors .

Steps B2 and B3 construct the non-core part Inline graphic of a DAG representation of target e-components in a similar manner of Steps A2 and A3.

Step B2: Constructing Inline graphic

(i)
Create a unique sink in .
(ii)
For each integer , create a node , called an internal-node (an int-node, for short) for each feasible vector , set and let denote the set of these int-nodes.
(iii)
For each integer , create a node for each feasible vector set with and let denote the set of these int-nodes, where each node in will be a source in .
(iv)
Let .

Step B3: Constructing Inline graphic

(i)
Step 1(iv): For each feasible vector , set and and create an arc that leaves from int-node and enters sink , set and for the arc .
(ii)
Step 2: For each tuple such that and , if there are feasible vectors and such that is a feasible vector for an edge-configuration , then and and create an arc that leaves from int-node and enters int-node and set and for the arc .
(iii)
Step 3: For each tuple such that , , if there are feasible vectors and such that is a feasible vector for an edge-configuration , then set and and create an arc that leaves from source and enters int-node and set and for the arc .
(iv)
Let , denote the set of arcs such that or . Set .

For base-edge Inline graphic in Figure 4(b), Step 1(iv) computes six c-trees in Figure 14(b) and the sets of their frequency vectors such that ,

Inline graphic ,

Inline graphic , ,

Inline graphic ,

Inline graphic

and five nc-trees Inline graphic in Figure 14(b) and the sets of their frequency vectors such that

Inline graphic ,

Inline graphic , ,

Inline graphic ,

Inline graphic and

Inline graphic .

Steps B2 and B5 construct the core part Inline graphic of a DAG representation of target e-components.

Step B4: Constructing Inline graphic

(i)
For each , create nodes , set with and let and .
(ii)
For each integer , create a node called a core-node (a c-node, for short) for each feasible vector , set and let (resp., ) denote the set of these c-nodes if (resp., ).
(iii)
Let .

Step B5: Constructing Inline graphic

(i)
Step 4: For each integer and each tuple such that , execute the following:
- If there are feasible vectors with and such that is a feasible vector for an edge-configuration with , then create an arc a between c-nodes and with and so that for , arc a is a left-arc directed from to ; for , arc a is a right-arc directed from to .
- If there are feasible vectors with and such that is a feasible vector for an edge-configuration with , then create an arc a between c-nodes and with and so that for , arc a is a left-arc directed from to ; for , arc a is a right-arc directed from to .
(ii)
Step 5: For each feasible vector pairs , create a middle-arc that leaves from c-node and enters c-node and set and for the bond-multiplicity m determined for the pair uniquely by (6).
(iii)
Let , denote the set of arcs such that and set .

Experimental results

We implemented our new dynamic programming algorithm that generates the target components and conducted experiments to evaluate the computational efficiency. We executed the experiments on a PC with Processor: 3.0 GHz Core i7-9700 (3.0GHz) Memory: 16 GB RAM DDR4. We used ChemDoodle version 10.2.0 for constructing 2D drawings of chemical graphs.

We set a branch parameter Inline graphic to be 2. We conducted the following three experiments.

Select some chemical compounds in the PubChem and a vertex v in and compute the DAG representation of the set of all target v-components of the frequency vector of the v-component of at v.
Select some chemical compounds in the PubChem and a u, v-path P such that all internal vertices in P are of degree 2 in the core and compute the DAG representation of the set of all target e-components of the frequency vector of the e-component of at e by regarding uv as a base-edge .
Select some chemical compounds in the PubChem and a base-graph and compute the set of the DAG representations for the target v-components and the DAG representations .

We set a time limitation to be 3600 sec. In the following tables, M.O. and T.O. mean memory out and time out, respectively. The proposed algorithm is compared with the state-of-the-art compound generator MOLGEN developed by Gugisch et al.³⁸. Using the online version³⁹ of MOLGEN, we generated Inline graphic (the available limit) isomers for each instance discussed in the following sections. In these experiments, we applied available restrictions on the isomers such as the number of bonds, cycles, maximum bond multiplicity, single bonds, double bonds and triple bonds. Note that the isomers generated by MOLGEN can be structurally different from those generated by the proposed method.

Results of experiment 1

For this experiment, we selected six instances Inline graphic as follows. We selected from the database PubChem six cyclic chemical compounds, CID: 7600, CID: 152211, CID: 497892, CID: 46930263, CID: 67558426 and CID: 46930349 which are denoted by , , respectively. Each chemical instance has one benzene ring as the core and only one core vertex v that is adjacent to a non-core vertex as illustrated in Figures 4(a) and 15. For this vertex v, we generate target v-components.

Fig. 15 — (a)-(e) Instances to compute target v-components, respectively. All instances have one benzene ring and a chemical acyclic graph joining vertex v in the benzene ring and vertex u.

Table 1 shows the results of computing v-components, where we denote the following:

i: the instance , ;
: the set of chemical elements in the v-component in instance ;
: the number of vertices in the v-component ;
: the number of different edge-configurations of 2-internal edges in the v-component ;
: the number of different edge-configurations of 2-external edges in the v-component ;
: the height of the v-component ;
D-time: the running time (sec.) to construct the DAG representation ;
: the number of vertices in ;
: the number of edges in ;
p-time: the running time (sec.) to trace all paths from the sources to the sinks in ;
p: the number of all paths from the sources to the sinks in ;
T-LB: a lower bound on the number of all target v-components ;
G-time (resp., G-time³⁸): the running time (sec.) to construct all (or up to ) (resp., ) target v-components (resp., isomers by MOLGEN³⁸) from ;
(resp., ³⁸): the number of all (or up to ) (resp., ) target v-components (resp., isomers by MOLGEN³⁸) generated from .

From Table 1, we observe that with the increase in the size of Inline graphic , there is no significant increase in D-time, p-time and G-time. Furthermore, the D-time, p-time and G-time are bounded above by 0.162, 0.255, and 0.136, respectively, from which it is evident that the proposed algorithm can generate target v-components efficiently. Note that the running time of the proposed algorithm is significantly lower than that of MOLGEN³⁸.

Table 1.

Results for Computing Target v-components.

				D-time		p-time	p	T-LB	G-time		G-time³⁸
10	C,O,N	5, 3	7	0.000	20, 27	0.000	8	8	0.000	8	31.643
20	C,O,N	6, 7	11	0.008	343, 780	0.001	1440	1440	0.016	1440	46.22
25	C,O,N	7, 5	14	0.100	3427,	0.192			0.121		33.185
27	C,O,N	9, 6	17	0.115	6523,	0.163			0.129		58.185
28	C,O,N	8, 7	17	0.129	5627,	1.640			0.132		61.73
29	C,O,N	9, 6	19	0.162	8967,	0.255			0.136		42.724

Open in a new tab

Results of experiment 2

For this experiment, we selected six instances Inline graphic as follows. We selected from the database PubChem six cyclic chemical compounds, CID: 3729083, CID: 129130, CID: 47622, CID: 195338, CID: 497867 and CID: 10325899 which are denoted by , , respectively. Each chemical instance has two benzene rings and a chemical acyclic graph joining them, where we denote by u and v the common vertices with the rings and Inline graphic as illustrated in Figures 4(b) and 16. We regard as a base-edge and generate target e-components of .

Fig. 16 — (a)-(e) Instances , to compute target e-components, respectively. All instances have two benzene rings and a chemical acyclic graph joining two vertices u and v in these benzene rings.

Table 2 shows the results of computing e-components, where we denote the following:

i: the instance , ;
: the set of chemical elements in the e-component in instance ;
: the number of vertices in ;
: the number of different edge-configurations of the core edges in ;
: the number of different edge-configurations of 2-internal edges in ;
: the number of different edge-configurations of 2-external edges in ;
: the number of 2-branch core vertices in ;
: the maximum height of a chemical rooted tree rooted at a core vertex in ;
D-time: the running time (sec.) to construct the DAG representation ;
: the number of vertices in ;
: the number of edges in ;
p-time: the running time (sec.) to trace all paths from the sources to the sinks in ;
p: the number of all paths from the sources to the sinks in ;
T-LB: a lower bound on the number of all target e-components ;
G-time (resp., G-time³⁸): the running time (sec.) to construct all (or up to ) (resp., ) target e-components (resp., isomers by MOLGEN³⁸) from ;
(resp., ³⁸): the number of all (or up to ) (resp., ) target e-components (resp., isomers by MOLGEN³⁸) generated from .

By Table 2, the D-time, p-time and G-time are bounded above by 9.580, 6.420, and 0.168, respectively. This implies that the proposed algorithm can efficiently generate target e-components. Moreover, the running time of the proposed algorithm is significantly lower than that of MOLGEN³⁸.

Table 2.

Results for Computing e-components.

				D-time		p-time	p	T-LB	G-time		G-time³⁸
17	C,O,N	4, 3, 3	1, 7	0.000	24, 27	0.000	12	12	0.000	12	28.822
17	C,O,N	3, 2, 5	1, 4	0.001	78, 116	0.001	180	180	0.002	180	29.083
20	C,O,N	4, 0, 3	0, 2	0.001	193, 313	0.004	1512	1512	0.020	1512	97.158
25	C,O,N	6, 3, 4	1, 5	0.011	525, 924	0.024			0.152		39.25
30	C,O,N	3, 6, 6	1, 9	0.908	788, 2275	0.310			0.168		59.099
32	C,O,N	3, 6, 7	3, 9	9.580	,	6.420			0.165		58.792

Open in a new tab

Results of experiment 3

For this experiment, we selected five Inline graphic instances as follows. We selected from the database PubChem five cyclic chemical compounds, CID: 1356, CID: 89834791, CID: 334516, CID: 91420002 and CID: 124165467 which are denoted by , , respectively. These instances are illustrated in Figure 17.

Fig. 17 — (a)-(e) Instances , , respectively, to generate isomers, where a set is selected as the set of vertices indicated with asterisks.

Table 3 shows the results of computing chemical isomers Inline graphic of , where we denote the following:

i: the instance , ;
: the number of the given chemical graph in instance ;
: the number of chemical elements in ;
: the number of different edge-configurations of the core edges in ;
: the number of different edge-configurations of 2-internal edges in ;
: the number of different edge-configurations of 2-external edges in ;
: the number of base-vertices of a base-graph selected to ;
: the number of base-edges of a base-graph selected to ;
: the core size of ;
: the number of 2-branch core vertices in ;
: the core height of ;
D-time: the running time (sec.) to construct the set of DAG representations and ;
: the total number of vertices in the set of DAG representations and ;
: the total number of edges in the set of DAG representations and ;
D: the number of DAG representations successfully constructed in a time limitation out of the total number of DAG representations defined to ;
G-LB: a lower bound on the number of all chemical isomers of ;
G-time (resp., G-time³⁸): the running time (sec.) to construct all (or up to ) (resp., ) chemical isomers of (resp., isomers by MOLGEN³⁸);
(resp., ³⁸): the number of all (or up to ) (resp., ) chemical isomers of generated from the set of DAG representations (resp., isomers generated by MOLGEN³⁸);
#non-iso: the number of non-isomorphic chemical isomers generated by the proposed algorithm.

By Table 3 the D-time of all the instances except Inline graphic is bounded above by 0.004 which is very small. Instance has relatively bigger DAG representation with and vertices and edges, respectively, and thus the D-time for is 24.600. However, the G-time for all the instances is bounded above by 0.266, and hence the proposed algorithm can efficiently generate isomers for an instance with around 70 non-hydrogen atoms. Furthermore, the proposed algorithm outperforms MOLGEN³⁸ in terms of running time and the number of isomers.

Table 3.

Results for Computing Chemical Isomers Inline graphic of .

			,		D-time		D	G-LB	G-time		#non-iso	G-time³⁸
40	3	9, 1, 3	6, 8	32, 1, 3	0.002	429, 803	2		0.180			30.964
50	3	9, 9, 8	5, 7	22, 2, 9	24.600	,	7		0.195			43.575

50	3	10, 0, 2	10, 13	25, 0, 1	0.000	134, 184	5	6912	0.153	6912	6912	32.978
50	3	7, 4, 8	5, 7	24, 4, 6	0.001	124, 165	7	4608	0.090	4608	3312	77.341
70	3	8, 4, 7	5, 7	42, 4, 6	0.004	303, 392	7		0.266		5428	46.476

Open in a new tab

For an in-depth analysis of our algorithm, we randomly selected 100 instances for each Inline graphic from the PubChem database and generated chemical isomers. We index the CIDs of these instances for each from 1 to 100 and are given in the supplementary material S1_instances. We show the time D-time to construct DAG representations, and the time G-time to generate chemical isomers for each Inline graphic in Figures 18(a)-(i). The average D-time and G-time over all instances is 0.1884 and 0.0565, respectively, which are reasonably small. Thus it is evident that the proposed algorithm can efficiently generate small chemical isomers. Similarly, we show the number of chemical isomers and the number #non-iso Inline graphic of non-isomorphic isomers generated by the proposed algorithm for each in Figures 19(a)-(i). From Figure 19 we observe that for most of the instances, the difference is very small, and hence the proposed algorithm can efficiently generate a large number of non-isomorphic chemical isomers with around 70 non-hydrogen atoms.

Fig. 18 — (a)-(i) Plots of D-time and G-time.

Fig. 19 — (a)-(i) Plots of # and #non-iso.

Concluding remarks

In this paper, we considered the problem of enumerating all chemical isomers Inline graphic of a given chemical -lean chemical graph in the sense that and have a common base graph and satisfies for the feature function based on the frequency of edge-configurations. The dynamic programming algorithm designed by Akutsu and Nagamochi³⁷ can find a limited number of chemical isomers. In this paper, we improve the algorithm and designed a new backtracking procedure in order to construct a compact DAG representation to the set of all chemical isomers. Just by tracing the DAG representation, we can count the number of all chemical isomers and generate each of them, if necessary in a random way. We implemented the proposed method and our computational results suggest that the DAG representation of chemical isomers can be constructed for an instance with around 70 vertices. These experiments show that the proposed algorithm can help in discovering novel drugs by efficiently exploring the search space of target chemical compounds.

While the proposed method demonstrates promising performance for graphs with up to around 70 vertices, it may face scalability issues for larger chemical structures. Moreover, the method is currently only applicable to Inline graphic -lean chemical graphs and relies on specific base-graph definitions, which may limit its general applicability. Additionally, the feature function is restricted to edge-configuration frequencies, and a rigorous analysis of the computational complexity remains as future work.

Supplementary Information

Supplementary Information.^{(53.7KB, pdf)}

Acknowledgements

This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant#22H00532.

Author contributions

Ryota Ido: Software, validation, data resources; Naveed Ahmed Azam: Software, validation, data resources, formal analysis, writing—review and editing; Jianshen Zhu: Software, validation, data resources; Hiroshi Nagamochi: Conceptualization, methodology, formal analysis, data resources, writing—original draft preparation, project administration; Tatsuya Akutsu: Conceptualization, methodology, data resources, writing—review and editing, funding acquisition. All authors read and approved the final manuscript.

Funding

This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant #18H04113, #22H00532, and #22K19830.

Data availability

Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-05976-0.

References

1.Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci.4(2), 268–276 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci.4(1), 120–131 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Yang, X., Zhang, J., Yoshizoe, K., Terayama, K. & Tsuda, K. ChemTS: An efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater.18(1), 972–976 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kusner, M. J., Paige, B. & Hernández-Lobato, J. M. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. 1945–1954 (2017).
5.De Cao, N. & Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv:1805.11973 (2018).
6.Madhawa, K., Ishiguro, K., Nakago, K. & Abe, M. GraphNVP: an invertible flow model for generating molecular graphs. arXiv:1905.11600 (2019).
7.Shi, C. et al. GraphAF: a flow-based autoregressive model for molecular graph generation. arXiv:2001.09382 (2020).
8.Arockiaraj, M. et al. Topological and entropy indices in QSPR studies of N-carbophene covalent organic frameworks. BioNanoSci.14(3), 2762–2773 (2024). [Google Scholar]
9.Zhang, X. et al. Distance-based topological characterization, graph energy prediction, and NMR patterns of benzene ring embedded in P-type surface in 2D network. Sci. Rep.14(1), 23766 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Arockiaraj, M., Greeni, A. B., Kalaam, A. R. A., Aziz, T. & Alharbi, M. Mathematical modeling for prediction of physicochemical characteristics of cardiovascular drugs via modified reverse degree topological indices. Eur. Phys. J. E, Soft Matter.47(8), 53. (2024). [DOI] [PubMed]
11.Ikebata, H., Hongo, K., Isomura, T., Maezono, R. & Yoshida, R. Bayesian molecular design with a chemical language model. J. Comput. Aided Mol. Des.31(4), 379–391 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Miyao, T., Kaneko, H. & Funatsu, K. Inverse QSPR/QSAR analysis for chemical structure generation (from y to x). J. Chem. Inf. Model.56(2), 286–299 (2016). [DOI] [PubMed] [Google Scholar]
13.Rupakheti, C., Virshup, A., Yang, W. & Beratan, D. N. Strategy to discover diverse optimal molecules in the small molecule universe. J. Chem. Inf. Model.55(3), 529–537 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Wazzan, S., Hayat, S. & Ismail, W. Optimizing structure-property models of three general graphical indices for thermodynamic properties of benzenoid hydrocarbons. J. King Saud Univ. Sci.36(11), 103541 (2024). [Google Scholar]
15.Hayat, S. et al. Optimizing predictive models for evaluating the F-temperature index in predicting the -electron energy of polycyclic hydrocarbons, applicable to carbon nanocones. Sci. Rep.14(1), 25494 (2024). [DOI] [PMC free article] [PubMed]
16.Hayat, S., Arfan, A., Khan, A., Jamil, H. & Alenazi, M. J. F. An optimization problem for computing predictive potential of general sum/product-connectivity topological indices of physicochemical properties of benzenoid hydrocarbons. Axioms13(6), 342 (2024). [Google Scholar]
17.Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev.16(1), 3–50 (1996). [DOI] [PubMed] [Google Scholar]
18.Blum, L. C. & Reymond, J. L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc.131(25), 8732–8733 (2009). [DOI] [PubMed] [Google Scholar]
19.Meringer, M. & Schymanski, E. L. Small molecule identification with MOLGEN and mass spectrometry. Metabolites3(2), 440–462 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Benecke, C. et al. MOLGEN, a generator of connectivity isomers and stereoisomers for molecular structure elucidation. Anal. Chim. Acta314(3), 141–147 (1995).
21.Kerber, A., Laue, R., Grüner, T. & Meringer, M. MOLGEN 4.0. Match Commun. Math. Comput. Chem.37, 205–208 (1998). [Google Scholar]
22.Peironcely, J. E. et al. OMG: Open Molecule Generator. J. Cheminform.4(1), 21 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Reymond, J.-L. The chemical space project. Acc. Chem. Res.48(3), 722–730 (2015). [DOI] [PubMed] [Google Scholar]
24.Fujiwara, H., Wang, J., Zhao, L., Nagamochi, H. & Akutsu, T. Enumerating treelike chemical graphs with given path frequency. J. Chem. Inf. Model.48(7), 1345–1357 (2008). [DOI] [PubMed] [Google Scholar]
25.Li, J., Nagamochi, H. & Akutsu, T. Enumerating substituted benzene isomers of tree-like chemical graphs. IEEE/ACM Trans. Comput. Biol. Bioinform.15(2), 633–646 (2016). [DOI] [PubMed] [Google Scholar]
26.Suzuki, M., Nagamochi, H. & Akutsu, T. Efficient enumeration of monocyclic chemical graphs with given path frequencies. J. Cheminform.6(1), 31 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Tamura, Y. et al. Enumerating chemical graphs with mono-block 2-augmented tree structure from given upper and lower bounds on path frequencies. arXiv:2004.06367 (2020).
28.Yamashita, K. et al. Enumerating chemical graphs with two disjoint cycles satisfying given path frequency specifications. arXiv:2004.08381 (2020).
29.Vogt, M. & Bajorath, J. Chemoinformatics: a view of the field and current trends in method development. Bioorg. Med. Chem.20(18), 5317–5323 (2012). [DOI] [PubMed] [Google Scholar]
30.Azam, N. A. et al. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming, In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies – Volume 3: BIOINFORMATICS, Valetta, Malta. 101–108 (2020).
31.Zhang, F. et al. A new integer linear programming formulation to the inverse QSAR/QSPR for acyclic chemical compounds using skeleton trees, In Proceedings of the 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Kitakyushu, Japan. 433–444. 10.1007/978-3-030-55789-8_38 (2020).
32.Azam, N. A. et al. A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming. Algorithms for Mol. Biol.16(1), 18. 10.3390/a13050124 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Ito, R. et al. A novel method for the inverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networks and integer programming, In Proceedings of the 21st International Conference on Bioinformatics & Computational Biology. (2020).
34.Zhu, J., Wang, C., Shurbevski, A., Nagamochi, H. & Akutsu, T. A novel method for inference of chemical compounds of cycle index two with desired properties based on artificial neural networks and integer programming. Algorithms13(5), 124. 10.3390/a13050124 (2020). [Google Scholar]
35.Zhu, J. et al. A novel method for inferring of chemical compounds with prescribed topological substructures based on integer programming, IEEE/ACM Trans. Comput. Biol. and Bioinform.arXiv:2010.09203 (2020). [DOI] [PubMed]
36.Azam, N. A., Zhu, J., Ido, R., Nagamochi, H. & Akutsu, T. Experimental results of a dynamic programming algorithm for generating chemical isomers based on frequency vectors, In Proceedings of the Fourth International Workshop on Enumeration Problems and Applications: WEPA. online, paper ID 15. (2020).
37.Akutsu, T. & Nagamochi, H. A novel method for inference of chemical compounds with prescribed topological substructures based on integer programming. arXiv:2010.09203 (2020). [DOI] [PubMed]
38.Gugisch, R., Kerber, A., Kohnert, A., Laue, R., Meringer, M., Rücker, C. & Wassermann, A. MOLGEN 5.0, A Molecular Structure Generator, In Advances in Mathematical Chemistry and Applications: Revised Edition, 1, pp. 113–138. 2016.
39.MOLGEN Team, MOLGEN 5.0, A Molecular Structure Generator. Available at https://www.molgen.de/online.html. Accessed 2025.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information.^{(53.7KB, pdf)}

Data Availability Statement

Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.

[CR1] 1.Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci.4(2), 268–276 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci.4(1), 120–131 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Yang, X., Zhang, J., Yoshizoe, K., Terayama, K. & Tsuda, K. ChemTS: An efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater.18(1), 972–976 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Kusner, M. J., Paige, B. & Hernández-Lobato, J. M. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. 1945–1954 (2017).

[CR5] 5.De Cao, N. & Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv:1805.11973 (2018).

[CR6] 6.Madhawa, K., Ishiguro, K., Nakago, K. & Abe, M. GraphNVP: an invertible flow model for generating molecular graphs. arXiv:1905.11600 (2019).

[CR7] 7.Shi, C. et al. GraphAF: a flow-based autoregressive model for molecular graph generation. arXiv:2001.09382 (2020).

[CR8] 8.Arockiaraj, M. et al. Topological and entropy indices in QSPR studies of N-carbophene covalent organic frameworks. BioNanoSci.14(3), 2762–2773 (2024). [Google Scholar]

[CR9] 9.Zhang, X. et al. Distance-based topological characterization, graph energy prediction, and NMR patterns of benzene ring embedded in P-type surface in 2D network. Sci. Rep.14(1), 23766 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Arockiaraj, M., Greeni, A. B., Kalaam, A. R. A., Aziz, T. & Alharbi, M. Mathematical modeling for prediction of physicochemical characteristics of cardiovascular drugs via modified reverse degree topological indices. Eur. Phys. J. E, Soft Matter.47(8), 53. (2024). [DOI] [PubMed]

[CR11] 11.Ikebata, H., Hongo, K., Isomura, T., Maezono, R. & Yoshida, R. Bayesian molecular design with a chemical language model. J. Comput. Aided Mol. Des.31(4), 379–391 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Miyao, T., Kaneko, H. & Funatsu, K. Inverse QSPR/QSAR analysis for chemical structure generation (from y to x). J. Chem. Inf. Model.56(2), 286–299 (2016). [DOI] [PubMed] [Google Scholar]

[CR13] 13.Rupakheti, C., Virshup, A., Yang, W. & Beratan, D. N. Strategy to discover diverse optimal molecules in the small molecule universe. J. Chem. Inf. Model.55(3), 529–537 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Wazzan, S., Hayat, S. & Ismail, W. Optimizing structure-property models of three general graphical indices for thermodynamic properties of benzenoid hydrocarbons. J. King Saud Univ. Sci.36(11), 103541 (2024). [Google Scholar]

[CR15] 15.Hayat, S. et al. Optimizing predictive models for evaluating the F-temperature index in predicting the -electron energy of polycyclic hydrocarbons, applicable to carbon nanocones. Sci. Rep.14(1), 25494 (2024). [DOI] [PMC free article] [PubMed]

[CR16] 16.Hayat, S., Arfan, A., Khan, A., Jamil, H. & Alenazi, M. J. F. An optimization problem for computing predictive potential of general sum/product-connectivity topological indices of physicochemical properties of benzenoid hydrocarbons. Axioms13(6), 342 (2024). [Google Scholar]

[CR17] 17.Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev.16(1), 3–50 (1996). [DOI] [PubMed] [Google Scholar]

[CR18] 18.Blum, L. C. & Reymond, J. L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc.131(25), 8732–8733 (2009). [DOI] [PubMed] [Google Scholar]

[CR19] 19.Meringer, M. & Schymanski, E. L. Small molecule identification with MOLGEN and mass spectrometry. Metabolites3(2), 440–462 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Benecke, C. et al. MOLGEN, a generator of connectivity isomers and stereoisomers for molecular structure elucidation. Anal. Chim. Acta314(3), 141–147 (1995).

[CR21] 21.Kerber, A., Laue, R., Grüner, T. & Meringer, M. MOLGEN 4.0. Match Commun. Math. Comput. Chem.37, 205–208 (1998). [Google Scholar]

[CR22] 22.Peironcely, J. E. et al. OMG: Open Molecule Generator. J. Cheminform.4(1), 21 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Reymond, J.-L. The chemical space project. Acc. Chem. Res.48(3), 722–730 (2015). [DOI] [PubMed] [Google Scholar]

[CR24] 24.Fujiwara, H., Wang, J., Zhao, L., Nagamochi, H. & Akutsu, T. Enumerating treelike chemical graphs with given path frequency. J. Chem. Inf. Model.48(7), 1345–1357 (2008). [DOI] [PubMed] [Google Scholar]

[CR25] 25.Li, J., Nagamochi, H. & Akutsu, T. Enumerating substituted benzene isomers of tree-like chemical graphs. IEEE/ACM Trans. Comput. Biol. Bioinform.15(2), 633–646 (2016). [DOI] [PubMed] [Google Scholar]

[CR26] 26.Suzuki, M., Nagamochi, H. & Akutsu, T. Efficient enumeration of monocyclic chemical graphs with given path frequencies. J. Cheminform.6(1), 31 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Tamura, Y. et al. Enumerating chemical graphs with mono-block 2-augmented tree structure from given upper and lower bounds on path frequencies. arXiv:2004.06367 (2020).

[CR28] 28.Yamashita, K. et al. Enumerating chemical graphs with two disjoint cycles satisfying given path frequency specifications. arXiv:2004.08381 (2020).

[CR29] 29.Vogt, M. & Bajorath, J. Chemoinformatics: a view of the field and current trends in method development. Bioorg. Med. Chem.20(18), 5317–5323 (2012). [DOI] [PubMed] [Google Scholar]

[CR30] 30.Azam, N. A. et al. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming, In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies – Volume 3: BIOINFORMATICS, Valetta, Malta. 101–108 (2020).

[CR31] 31.Zhang, F. et al. A new integer linear programming formulation to the inverse QSAR/QSPR for acyclic chemical compounds using skeleton trees, In Proceedings of the 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Kitakyushu, Japan. 433–444. 10.1007/978-3-030-55789-8_38 (2020).

[CR32] 32.Azam, N. A. et al. A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming. Algorithms for Mol. Biol.16(1), 18. 10.3390/a13050124 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Ito, R. et al. A novel method for the inverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networks and integer programming, In Proceedings of the 21st International Conference on Bioinformatics & Computational Biology. (2020).

[CR34] 34.Zhu, J., Wang, C., Shurbevski, A., Nagamochi, H. & Akutsu, T. A novel method for inference of chemical compounds of cycle index two with desired properties based on artificial neural networks and integer programming. Algorithms13(5), 124. 10.3390/a13050124 (2020). [Google Scholar]

[CR35] 35.Zhu, J. et al. A novel method for inferring of chemical compounds with prescribed topological substructures based on integer programming, IEEE/ACM Trans. Comput. Biol. and Bioinform.arXiv:2010.09203 (2020). [DOI] [PubMed]

[CR36] 36.Azam, N. A., Zhu, J., Ido, R., Nagamochi, H. & Akutsu, T. Experimental results of a dynamic programming algorithm for generating chemical isomers based on frequency vectors, In Proceedings of the Fourth International Workshop on Enumeration Problems and Applications: WEPA. online, paper ID 15. (2020).

[CR37] 37.Akutsu, T. & Nagamochi, H. A novel method for inference of chemical compounds with prescribed topological substructures based on integer programming. arXiv:2010.09203 (2020). [DOI] [PubMed]

[CR38] 38.Gugisch, R., Kerber, A., Kohnert, A., Laue, R., Meringer, M., Rücker, C. & Wassermann, A. MOLGEN 5.0, A Molecular Structure Generator, In Advances in Mathematical Chemistry and Applications: Revised Edition, 1, pp. 113–138. 2016.

[CR39] 39.MOLGEN Team, MOLGEN 5.0, A Molecular Structure Generator. Available at https://www.molgen.de/online.html. Accessed 2025.

PERMALINK

A dynamic programming algorithm for generating chemical isomers based on frequency vectors

Ryota Ido

Naveed Ahmed Azam

Jianshen Zhu

Hiroshi Nagamochi

Tatsuya Akutsu

Abstract

Introduction

Fig. 1.

Preliminary

Graphs

Fig. 2.

Branch parameter

A hydrogen-suppressed model for chemical compounds

Descriptors and feature vectors

An algorithm for generating isomers

The idea of algorithm

Fig. 3.

Fig. 4.

Nc-trees and C-trees

Nc-trees

Fig. 5.

C-trees

Fictitious trees

Fig. 6.

Frequency vectors

Chemical graph isomorphism

Vertex-components

Edge-components

Chemical isomers of a given chemical graph

Target v-components

Target e-components

A sketch of dynamic programming algorithm on frequency vectors

Forward phase

Backward phase

Defining DAG representations for vertex-components

Fig. 7.

Fig. 8.

Defining DAG representations for edge-components

Fig. 9.

Fig. 10.

Computing frequency vectors of subtrees of target components in the forward phase

Step 1: Enumeration of -fringe trees

Step 2: Generation of frequency vectors of end-subtrees

Fig. 11.

Step 3: Generation of frequency vectors of rooted core-subtrees

Step 4: Generation of frequency vectors of bi-rooted core-subtrees

Fig. 12.

Step 5: Enumeration of feasible vector pairs

Fig. 13.

Constructing DAG representations in the backward phase

Step A: Constructing target v-components from frequency vectors

Fig. 14.

Step B: Constructing target e-components from frequency vectors

Experimental results

Results of experiment 1

Fig. 15.

Table 1.

Results of experiment 2

Fig. 16.

Table 2.

Results of experiment 3

Fig. 17.

Table 3.

Fig. 18.

Fig. 19.

Concluding remarks

Supplementary Information

Acknowledgements

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

Supplementary Information

References

Associated Data

Supplementary Materials