Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Jul 1;15:22214. doi: 10.1038/s41598-025-05976-0

A dynamic programming algorithm for generating chemical isomers based on frequency vectors

Ryota Ido 1, Naveed Ahmed Azam 1,2,, Jianshen Zhu 1, Hiroshi Nagamochi 1, Tatsuya Akutsu 3
PMCID: PMC12219021  PMID: 40594631

Abstract

We propose a dynamic programming algorithm that generates chemical isomers of a given chemical compound with cycles. We represent a chemical compound as a chemical graph and define its feature vector based on graph-theoretical descriptors. Our descriptors mainly consist of the occurrence of “edge-configuration” that captures the information of adjacent atoms such as their degrees and bond-multiplicity. We call two chemical graphs chemical isomers of each other if they have the same feature vector and share a common prescribed structure. Our proposed algorithm produces a compact representation of all chemical isomers of a given chemical graph. This representation enables efficient counting of chemical isomers without requiring explicit generation. Furthermore, our algorithm allows us to enumerate any number of isomers, even at random. For example, our compact representation for a chemical graph with 70 non-hydrogen atoms contains around 400 arcs in which Inline graphic chemical isomers are embedded. The proposed algorithm serves as a powerful tool for accelerating chemical compound exploration, particularly in drug discovery and material science, where identifying novel molecular structures is critical. By efficient enumeration of isomers, our approach enhances the search space exploration for target chemical compounds, facilitating advancements in molecular design.

Keywords: Molecular Design, Enumeration of Graphs, Dynamic Programming

Subject terms: Computational biology and bioinformatics, Computer science

Introduction

Graphs are a fundamental data structure in computer science and have been extensively utilized in computational molecular biology, especially for representing chemical molecules. The design of novel graph structures has recently gained significant attention in artificial neural network (ANN) research and related fields. In particular, extensive studies have been done on designing chemical graphs having desired chemical properties because of its potential application to drug design. For example, variational autoencoders1, recurrent neural networks2,3, grammar variational autoencoders4, generative adversarial networks5, and invertible flow models6,7 have been applied.

Quantitative structure activity/property relationship QSAR/QSPR are computational modeling techniques used in cheminformatics. They aim to establish mathematical relationships between the structural attributes of chemical compounds and their biological activities or physicochemical properties810. Design of chemical graphs has also been studied for many years in the field of chemo-informatics. In the field, this problem is referred to as inverse quantitative structure activity/property relationships (inverse QSAR/QSPR). In this framework, chemical compounds are usually represented as vectors of real or integer numbers, which are often called descriptors and correspond to feature vectors in machine learning. Using these chemical descriptors, various heuristic and statistical methods have been developed for finding chemical graphs having desired properties1116.

In many of such methods, enumeration of graph structures from a given set of descriptors is a crucial subtask. However, enumeration in itself is a challenging task, since the number of molecules (i.e., chemical graphs) with up to 30 atoms (vertices) C, N, O, and S, may exceed Inline graphic17. Enumerating chemical compounds has a long history and numerous applications such as designing novel drugs18 and structure elucidation19. The problem of enumerating chemical compounds can be viewed as the problem of enumerating graphs with given constraints, which is one of the fundamental problems in the field of discrete mathematics and has many applications. Various methods have been developed for general graph structures2023 and for restricted chemical compounds2428. Enumeration of restricted chemical compounds with specialized tools is more efficient than with the tools that use general graph structures, which has led to a new trend in the field of chemoinformatics29.

Recently a novel framework for inferring chemical graphs has been proposed30,31. This framework is illustrated in Figure 1. One of the important stage of this framework is to enumerate chemical isomers of a given chemical graph. The enumeration algorithms required in this framework have been designed for chemical compounds with cycle index at most 23134. The computation results show that these algorithms can generate chemical graphs with around 15 vertices without hydrogen atoms, and cannot deal with large size chemical graphs due to large computation time. Instead of focusing on a general chemical graph structure that rarely exists, Azam et al.32 introduced a restricted class of acyclic graphs that is characterized by an integer Inline graphic, called a “branch parameter” such that the restricted class still covers most of the acyclic chemical compounds in the PubChem database. Based on this characterization they designed an efficient algorithm to generate acyclic graphs with around 50 vertices without hydrogen atoms. Recently, Akutsu and Nagamochi37 extended the idea to define a restricted class of cyclic graphs, called “Inline graphic-lean cyclic graphs” that covers the most of the cyclic chemical compounds in the PubChem database to deal with chemical graphs with large size. Accordingly, they proposed an algorithm to generate chemical isomers of a Inline graphic-lean chemical graph. The method has been implemented and computational results showed that chemical graphs with around up to 50 non-hydrogen atoms can be inferred and generated35,36.

Fig. 1.

Fig. 1

An illustration of a framework for inferring a set of chemical graphs Inline graphic.

The idea of the chemical graph generation algorithm developed in37 is to construct a required chemical isomer starting with small chemical subgraphs in a bottom-up manner, where chemical subgraphs are encoded into frequency vectors and the actual construction is carried out in terms of computation of frequency vectors. However, this algorithm has been designed to efficiently generate a small number of chemical graphs, and does not have a backtracking algorithm that allows to generate all isomers of a rather large chemical graph that admits extremely many isomers. In order to generate all isomers, we need to design a procedure of backtracking the computation of frequency vectors in a top-down manner. In this paper, we design a dynamic programming algorithm that enumerates all isomers by constructing a compact representation of the set of all the isomers such that we can generate any number of isomers from the representation.

The paper is organized as follows. Section 2 reviews a modeling of chemical compounds and a choice of descriptors. Section 3 proposes a dynamic programming algorithm that generates chemical isomers of a given chemical graph. Section 4 reports the results on some computational experiments. Section 5 makes some concluding remarks. The proposed method/system is available at GitHub https://github.com/ku-dml/mol-infer.

Preliminary

Let Inline graphic, Inline graphic and Inline graphic denote the sets of reals, integers and non-negative integers, respectively. For two integers a and b, let [ab] denote the set of integers i with Inline graphic.

Graphs

Given a graph G, let V(G) and E(G) denote the sets of vertices and edges, respectively and let Inline graphic denote the set of neighbors of a vertex v in G. The length of a path is defined to be the number of edges in the path. Denote by Inline graphic the length of a path P. A rooted tree is defined to be a tree where a vertex is designated as the root. The height Inline graphic of a vertex v in a rooted tree T is defined to be the maximum length of a path from v to a leaf u, and the height Inline graphic of T is defined to be the height Inline graphic of the root r.

As an extension of rooted trees, we define a bi-rooted tree to be a tree T with two designated vertices Inline graphic and Inline graphic, called terminals. Let T be a bi-rooted tree. Define the backbone path Inline graphic to be the path of T between terminals Inline graphic and Inline graphic, and denote by Inline graphic (or by Inline graphic) the set of components of T in the graph Inline graphic obtained from T by removing the edges in Inline graphic, where we regard each tree Inline graphic as a tree rooted at the unique vertex in Inline graphic. The height Inline graphic of T is defined to be the maximum of the heights of rooted trees in Inline graphic.

The rank of a graph G is defined to be the minimum number of edges to be removed to make the graph a tree. We call a graph with rank k a rank-k graph. Figure 2 illustrates three examples of rank-2 graphs Inline graphic, Inline graphic.

Fig. 2.

Fig. 2

An illustration of rank-2 graphs Inline graphic, Inline graphic, where the core vertices (resp., non-core vertices) are depicted with squares (resp., circles), the 2-branch vertices are depicted with gray circles (a) Inline graphic is 2-lean with Inline graphic where Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic; (b) Inline graphic is not 2-lean with Inline graphic where Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic; (c) Inline graphic is not 2-lean with Inline graphic where Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic.

The core of a graph with cycles is defined to be the subgraph obtained by cycles and the paths between the cycles. More precisely, let H be a connected simple graph with rank at least 1. The core Inline graphic of H is defined to be an induced subgraph Inline graphic such that Inline graphic is the set of vertices in cycles of H and Inline graphic is the set of vertices each of which is in a path between two vertices Inline graphic. A vertex (resp., an edge) in H is called a core vertex (resp., core edge) if it is contained in the core Inline graphic, i.e., it lies on a cycle or on a path between cycles. A vertex or edge that is not in the core is called non-core vertex or non-core edge, respectively.

The core size Inline graphic is defined to be Inline graphic. An exterior tree T is defined to be a maximal induced subtree of H such that V(T) contains exactly one core vertex v of H, where T is regarded as a rooted tree rooted at v. For example, in Figure 2(a), the tree induced by the vertex set Inline graphic is an exterior tree of Inline graphic. The core height Inline graphic is defined to be the maximum height Inline graphic of an exterior tree T of H.

The core size and core height of the three rank-2 graphs Inline graphic, Inline graphic illustrated in Figures 2(a)-(c) are Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic.

Branch parameter

Choose a positive integer Inline graphic as a branch parameter32. A non-core vertex v is called a Inline graphic-internal vertex (resp., a Inline graphic-external vertex) if Inline graphic (resp., Inline graphic). A non-core edge e is called a Inline graphic-internal edge (resp., a Inline graphic-external edge) if e is incident to no Inline graphic-external vertex (resp., to a Inline graphic-external vertex). A Inline graphic-internal vertex v is called a Inline graphic-branch if v has at least two children, each of which has height at least Inline graphic, where a Inline graphic-branch v is called a leaf Inline graphic-branch if Inline graphic.

A Inline graphic-fringe tree is defined to be a maximal subtree Inline graphic of an exterior tree T such that the edge set of Inline graphic consists of Inline graphic-external edges. For example, in Figure 2(a), the tree induced by the vertex set Inline graphic is a 2-fringe tree. Note that every exterior tree T contains a Inline graphic-fringe tree if and only if Inline graphic. The Inline graphic-branch leaf number Inline graphic of H is defined to be the number of leaf Inline graphic-branches in H.

We call an exterior tree T of H a Inline graphic-exterior tree if Inline graphic; i.e., it contains at least one leaf Inline graphic-branch.

We call a core vertex adjacent to a Inline graphic-exterior tree a Inline graphic-branch core vertex, denote by Inline graphic the set of Inline graphic-branch core vertices and define the Inline graphic-branch core size Inline graphic to be Inline graphic. Note that Inline graphic, Inline graphic and either Inline graphic or Inline graphic.

We call a cyclic graph H Inline graphic-lean if every exterior tree T contains at most one leaf Inline graphic-branch; i.e., the set of Inline graphic-internal edges in each exterior tree T forms a single path.

Figure 2 illustrates three examples of rank-2 graphs. In the first example, Inline graphic and Inline graphic are the leaf 2-branches, Inline graphic and Inline graphic are the 2-branch core vertices, Inline graphic holds and Inline graphic is 2-lean. In the second example, Inline graphic and Inline graphic are the leaf 2-branches, Inline graphic is the 2-branch core vertex, Inline graphic holds and Inline graphic is not 2-lean. In the third example, Inline graphic and Inline graphic are the leaf 2-branches, Inline graphic is the non-leaf 2-branch, Inline graphic is the 2-branch core vertex, Inline graphic holds and Inline graphic is not 2-lean.

For Inline graphic, nearly 97% of cyclic chemical compounds with up to 100 non-hydrogen atoms in PubChem are 2-lean. This statistical fact allows us to focus on 2-lean chemical graphs instead of general chemical graph structures which are relatively difficult to generate and may not be of practical use. Over 92% of 2-fringe trees of chemical compounds with up to 100 non-hydrogen atoms in PubChem obey the following size constraint:

graphic file with name 41598_2025_5976_Article_Equ1.gif 1

Thus we can focus on designing an efficient algorithm to enumerate fringe trees that satisfy Eq. (1), instead of generating chemical graphs with any kind of fringe trees that rarely exist.

A hydrogen-suppressed model for chemical compounds

We represent the graph structure of a chemical compound as a graph H with labels on vertices and multiplicity on edges in a hydrogen-suppressed model. In a cyclic graph H, we regard each non-core edge Inline graphic as a directed edge (uv) from a vertex u to a child v of u in an exterior tree of H in order to define a descriptor that exploits the direction of non-core edges.

Let Inline graphic be a set of labels each of which represents a chemical element such as C (carbon), O (oxygen), N (nitrogen) and so on, where we assume that Inline graphic does not contain H (hydrogen). Let Inline graphic and Inline graphic denote the mass and valence of a chemical element Inline graphic, respectively. We define an adjacency-configuration to be a tuple Inline graphic with chemical elements Inline graphic and a bond-multiplicity Inline graphic; a chemical symbol to be a pair Inline graphic of the chemical element Inline graphic and the degree Inline graphic, where Inline graphic denotes the set of all chemical symbols; and an edge-configuration to be a tuple Inline graphic with Inline graphic and Inline graphic.

We choose a branch parameter Inline graphic and two sets Inline graphic and Inline graphic of chemical symbols and three sets Inline graphic, Inline graphic and Inline graphic of edge-configurations.

Let Inline graphic be an edge in a chemical graph G such that Inline graphic are assigned to the vertices u and v, the degrees of u and v are i and j, respectively and the bond-multiplicity between them is m. When uv is a core edge, the edge-configuration Inline graphic of edge e is defined to be Inline graphic if Inline graphic in a total order over Inline graphic (or Inline graphic otherwise). When uv is a non-core edge which is regarded as a directed edge (uv) where u is the parent of v in some exterior tree, the edge-configuration Inline graphic of a Inline graphic-internal (resp., Inline graphic-external) edge e is defined to be Inline graphic (resp., Inline graphic).

Let Inline graphic be a tuple with a cyclic graph Inline graphic and functions Inline graphic and Inline graphic, where we use Inline graphic to denote the function Inline graphic such that Inline graphic for each vertex Inline graphic. A tuple Inline graphic is called a chemical cyclic graph if (i) H is connected; (ii) Inline graphic for each vertex Inline graphic; and (iii) Inline graphic, Inline graphic and Inline graphic for each core edge Inline graphic, Inline graphic-internal edge Inline graphic and Inline graphic-external edge Inline graphic, respectively.

Descriptors and feature vectors

A feature vector f(G) of a chemical cyclic graph Inline graphic consists of the following 16 kinds of graph-theoretical descriptors.

  • n(G): the number |V| of vertices; Inline graphic: the core size of G; Inline graphic: the core height of G; Inline graphic: the Inline graphic-branch leaf number of G;

  • Inline graphic: the average mass of atoms in G; Inline graphic: the number of hydrogen atoms suppressed in G;

  • Inline graphic, Inline graphic: the numbers of core vertices and non-core vertices of degree Inline graphic in G;

  • Inline graphic, Inline graphic, Inline graphic: the numbers of core edges, Inline graphic-internal edges and Inline graphic-external edges with bond multiplicity Inline graphic in G;

  • Inline graphic, Inline graphic, Inline graphic, Inline graphic: the numbers of core vertices and non-core vertices v with a chemical symbol Inline graphic; and

  • Inline graphic, Inline graphic, Inline graphic: the numbers of core edges Inline graphic such that Inline graphic, Inline graphic-internal edges Inline graphic such that Inline graphic, and Inline graphic-external edges Inline graphic such that Inline graphic in G.

Note that excluding the average mass descriptor Inline graphic, the remaining 15 descriptors in the feature vector correspond to frequency counts.

An algorithm for generating isomers

This section designs a new algorithm for generating Inline graphic-lean cyclic graphs G that have the same feature vector Inline graphic of a given chemical Inline graphic-lean graph Inline graphic.

The idea of algorithm

For a graph Inline graphic, we define the frequency vector Inline graphic as the vector consisting of the frequency values corresponding to the 15 descriptors used in the feature vector Inline graphic. Instead of manipulating target graphs directly, first compute the frequency vectors Inline graphic of subtrees Inline graphic of all target graphs and then construct a limited number of target graphs G from the process of computing the vectors. For this, we extend the dynamic programming algorithm for generating acyclic chemical graphs proposed by Azam et al.32. A sketch of the algorithm is described as follows.

  1. Given a chemical Inline graphic-lean cyclic graph Inline graphic, simplify the core of Inline graphic into a graph Inline graphic by replacing some paths with edges. Decompose Inline graphic into a collection of chemical trees Inline graphic Inline graphic such that each tree Inline graphic contains at most two vertices in Inline graphic. See Figure 3 for an illustration.

  2. For each index Inline graphic, compute the feature vector Inline graphic and then generate a set Inline graphic of all (or a limited number of) chemical acyclic graphs Inline graphic such that Inline graphic by using the dynamic programming algorithm32.

  3. Each combination of chemical trees Inline graphic forms a chemical Inline graphic-lean cyclic graph Inline graphic such that Inline graphic.

Fig. 3.

Fig. 3

An illustration of generating a chemical isomer Inline graphic of a chemical graph Inline graphic in Stage 5, where Inline graphic is decomposed into chemical trees Inline graphic, Inline graphic based on a set Inline graphic of core vertices and a set Inline graphic of chemical tree Inline graphic such that Inline graphic is constructed for each vector Inline graphic, before a new target graph Inline graphic is obtained as a combination of Inline graphic.

Note that the number of chemical isomers Inline graphic obtained in 3 is Inline graphic which may possibly include isomorphic chemical graphs depending on the structure of Inline graphic. However, in many cases of computational experiments, the number Inline graphic is extremely large. Therefore we generate only a limited number of isomers Inline graphic by selecting a small number of trees Inline graphic, and most of the generated isomers are non-isomorphic as evident from the experimental results given in Section 4.

In the following, we describe a new algorithm that for a given chemical Inline graphic-lean graph Inline graphic, generates chemical Inline graphic-lean cyclic graphs Inline graphic such that

graphic file with name 41598_2025_5976_Article_Equ7.gif

where Inline graphic may not be graph-isomorphic to Inline graphic and the elements in Inline graphic may not correspond between the two cores; i.e., possibly Inline graphic for some core vertex v of H in the graph-isomorphism Inline graphic between Inline graphic and Inline graphic.

In this section, we describe our new algorithm in a general setting where a branch parameter is any integer Inline graphic and a chemical graph G to be inferred is any chemical Inline graphic-lean cyclic graph.

Figure 4(a) and (b) illustrate small chemical 2-lean cyclic graphs, which we use as running examples to demonstrate how our algorithm generates isomers.

Fig. 4.

Fig. 4

An illustration of chemical rooted/bi-rooted trees: (a) A chemical rooted tree Inline graphic rooted a vertex v in a chemical graph G: CID 7600; (b) A chemical bi-rooted tree Inline graphic with terminals u and v in a chemical graph G: CID 3729083; (c) An example of a base-graph Inline graphic to the chemical graph G in (a); (d) An example of a base-graph Inline graphic to the chemical graph G in (b).

Nc-trees and C-trees

Nc-trees

Let Inline graphic be a branch parameter and H be a Inline graphic-lean cyclic graph. We define “non-core-subtrees” in the following way.

Let T be a connected subgraph of H. We call T a non-core-subtree of H if T is regarded as a bi-rooted tree such that the backbone path Inline graphic is a subgraph of a Inline graphic-exterior tree of H (excluding the root) and the Inline graphic-fringe trees rooted at vertices in Inline graphic. We call a non-core-subtree T of H an internal-subtree (resp., an end-subtree) of H if neither (resp., one) of the two end-vertices of Inline graphic is a leaf Inline graphic-branch of H, as illustrated in Figure 5(a) (resp., in Figure 5(b)).

Fig. 5.

Fig. 5

An illustration of subtrees of a chemical Inline graphic-lean cyclic graph G, where thick lines depict the cycle of the core of G, green circles depict leaf Inline graphic-branches in G and arrows depict non-core directed edges: (a) A non-core-subtree (internal-subtree) T of G represented by an nc-tree (a chemical bi-rooted tree); (b) A non-core-subtree (end-subtree) T of G represented by an nc-tree (a chemical bi-rooted tree); (c) A core-subtree T of G with Inline graphic represented by a c-tree (a chemical rooted tree); (d) A core-subtree T of G with Inline graphic represented by a c-tree (a chemical bi-rooted tree).

We introduce “nc-trees” to represent non-core subtrees of a Inline graphic-lean cyclic graph H. The nc-tree is defined to be a chemical bi-rooted tree T such that each rooted tree Inline graphic has a height at most Inline graphic.

For an nc-tree T, define

graphic file with name 41598_2025_5976_Article_Equ8.gif

As discussed in Section 2, a non-core edge in Inline graphic is regarded as a directed edge (uv). Define the number Inline graphic of Inline graphic-branch core vertices in T to be Inline graphic and the core height Inline graphic of T to be 0.

C-trees

For a Inline graphic-lean cyclic graph H, a subtree T of H is called a core-subtree if one of the following holds:

  • (i)

    T consists of all pendant-trees rooted at a core vertex Inline graphic;

  • (ii)

    T consists of a core-path Inline graphic with Inline graphic and all pendant-trees rooted at internal vertices of path Inline graphic; and

  • (iii)

    T consists of a core-path Inline graphic with Inline graphic, all pendant-trees rooted at internal vertices of path Inline graphic, and all pendant-trees rooted at one of the end-vertices of path Inline graphic.

To represent a core-subtree of H, we introduce “c-trees.” For a branch parameter Inline graphic, we call a bi-rooted tree Inline graphic-lean if each rooted tree Inline graphic contains at most one Inline graphic-branch; i.e., there is no non-leaf Inline graphic-branch and no two Inline graphic-exterior trees meet at the same vertex in Inline graphic. A c-tree is defined to be a chemical Inline graphic-lean bi-rooted tree T.

The chemical tree Inline graphic in Figure 4(a) is an example of a c-tree with Inline graphic, Inline graphic and Inline graphic, where v is a core vertex and Inline graphic and Inline graphic are nc-trees. The chemical tree Inline graphic in Figure 4(b) is an example of a c-tree with Inline graphic, Inline graphic and Inline graphic, where s is a core vertex and Inline graphic is an nc-tree. The chemical tree Inline graphic in Figure 4(b) is an example of a c-tree with Inline graphic, Inline graphic and Inline graphic, where Inline graphic and Inline graphic are c-trees.

For a c-tree T, define

graphic file with name 41598_2025_5976_Article_Equ9.gif

Define the number Inline graphic of Inline graphic-branch core vertices in T to be the number of rooted trees in Inline graphic with Inline graphic. Define the core height Inline graphic for the bi-rooted tree T. Note that Inline graphic (resp., Inline graphic) is the set of Inline graphic-external vertices (resp., Inline graphic-external vertices) in the rooted trees in Inline graphic. Illustrations of c-trees are given in Figures 5(c) and (d).

As discussed in Section 2, a non-core edge in Inline graphic for an nc-tree or a c-tree T is regarded as a directed edge (uv).

Fictitious trees

For the nc-tree Inline graphic in Figure 4(a), the degree of the terminal Inline graphic is Inline graphic in G for Inline graphic in T and Inline graphic. We treat such a degree Inline graphic of a terminal v in a target chemical graph G as a fictitious degree of a chemical rooted tree T.

For an nc-tree or a c-tree T and an integer Inline graphic, let Inline graphic denote a fictitious chemical graph obtained from T by regarding the degree of terminal Inline graphic as Inline graphic. Figure 6(a) and (b) illustrate fictitious trees Inline graphic in the case of Inline graphic and Inline graphic in the case of Inline graphic and Inline graphic, respectively.

Fig. 6.

Fig. 6

An illustration of fictitious trees: (a) Inline graphic of a rooted nc- or c-tree T; (b) Inline graphic of a bi-rooted nc-tree T; (c) Inline graphic of a bi-rooted c-tree T.

For a c-tree T with Inline graphic and integers Inline graphic, let Inline graphic denote a fictitious chemical graph obtained from T by regarding the degree of terminal Inline graphic, Inline graphic as Inline graphic. Figure 6(c) illustrates a fictitious bi-rooted c-tree Inline graphic.

Frequency vectors

For a finite set A of elements, let Inline graphic denote the set of functions Inline graphic. A function Inline graphic is called a non-negative integer vector (or a vector) on A and the value Inline graphic for an element Inline graphic is called the entry of Inline graphic for Inline graphic. For a vector Inline graphic and an element Inline graphic, let Inline graphic (resp., Inline graphic) denote the vector Inline graphic such that Inline graphic (resp., Inline graphic) and Inline graphic for the other elements Inline graphic. For a vector Inline graphic and a subset Inline graphic, let Inline graphic denote the projection of Inline graphic to B; i.e., Inline graphic such that Inline graphic, Inline graphic.

To introduce a “frequency vector” of a subgraph of a chemical cyclic graph, we define sets of symbols that correspond to some descriptors of a chemical cyclic graph. Let Inline graphic, Inline graphic and Inline graphic be sets of edge-configurations in Section 2. We define a vector whose entry is the frequency of an edge-configuration in the sets Inline graphic, Inline graphic or the number of Inline graphic-branch core vertices. We use a symbol Inline graphic to denote the number of Inline graphic-branch core vertices in our frequency vector. To distinguish edge-configurations from different sets among three sets Inline graphic, Inline graphic, we use Inline graphic to denote the entry of an edge-configuration Inline graphic, Inline graphic. We denote by Inline graphic the set of entries Inline graphic, Inline graphic, Inline graphic. Define the set of all entries of a frequency vector to be

graphic file with name 41598_2025_5976_Article_Equ10.gif

Define the frequency vector Inline graphic to be a vector Inline graphic that consists of the following entries:

  • Inline graphic Inline graphic, Inline graphic Inline graphic Inline graphic;

  • Inline graphic, Inline graphic, Inline graphic;

  • Inline graphic.

For an nc-tree or c-tree T, the frequency vector Inline graphic of a fictitious tree Inline graphic is defined as follows: Let Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic. Let Inline graphic, Inline graphic. Set Inline graphic if T is an nc-tree, and Inline graphic if T is a c-tree. When Inline graphic,

graphic file with name 41598_2025_5976_Article_Equ11.gif

Let Inline graphic, and let Inline graphic belong to Inline graphic. When T is an nc-tree,

graphic file with name 41598_2025_5976_Article_Equ12.gif

The frequency vector Inline graphic of a fictitious tree Inline graphic for a bi-rooted c-tree T with Inline graphic is defined as follows: For each Inline graphic, let Inline graphic, Inline graphic, Inline graphic, Inline graphic of the unique edge incident to Inline graphic and Inline graphic, Inline graphic. Let Inline graphic, Inline graphic. Then

graphic file with name 41598_2025_5976_Article_Equ13.gif

Chemical graph isomorphism

For a chemical Inline graphic-lean cyclic graph for a branch parameter Inline graphic, we choose a path-partition Inline graphic of the core Inline graphic, where Inline graphic. Let Inline graphic denote the set of all end-vertices of paths Inline graphic, where Inline graphic.

Define the base-graph Inline graphic of H by Inline graphic to be the multigraph obtained from H replacing each path Inline graphic with a single edge Inline graphic joining the end-vertices of Inline graphic, where Inline graphic. We call a vertex in Inline graphic and an edge in Inline graphic a base-vertex and a base-edge, respectively. For a notational convenience in distinguishing the two end-vertices u and v of a base-edge Inline graphic, we regard each base edge Inline graphic as a directed edge Inline graphic. For each base-edge Inline graphic, let Inline graphic denote the path Inline graphic that is replaced by edge Inline graphic.

Figure 4(c) (resp., (d)) illustrates an example of a base-graph Inline graphic to the chemical graph G in Figure 4(a) (resp., (b)).

We define the “components” of G by Inline graphic as follows.

Vertex-components

For each base-vertex Inline graphic, define the component at vertex v (or the v-component) Inline graphic of G to be the chemical core-subtree rooted at v in G; i.e., Inline graphic consists of all pendent-trees rooted at v. We regard Inline graphic as a c-tree rooted at the core vertex v of G and define the code Inline graphic of Inline graphic to be a tuple Inline graphic such that

graphic file with name 41598_2025_5976_Article_Equ14.gif

The nc-tree Inline graphic in Figure 4(a) is the v-component of the graph G for the base-vertex Inline graphic in Figure 4(c), where Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic.

Edge-components

For each base-edge Inline graphic, define the component at edge e (or the e-component) Inline graphic of G to be the chemical core-subtree of G that consists of the core-path Inline graphic and all pendant-trees of G rooted at internal vertices of path Inline graphic. We regard Inline graphic as a bi-rooted c-tree with Inline graphic and Inline graphic for the base-edge Inline graphic and define the code Inline graphic of Inline graphic to be a tuple Inline graphic such that

  • Inline graphic, Inline graphic, Inline graphic, Inline graphic,

  • Inline graphic and Inline graphic for the edges Inline graphic incident to u and v,

  • Inline graphic.

The c-tree Inline graphic in Figure 4(b) is the e-component of the graph G for the base-edge Inline graphic in Figure 4(d), where Inline graphic contains exactly one leaf 2-branch Inline graphic (hence Inline graphic), Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic.

Observe that

graphic file with name 41598_2025_5976_Article_Equ15.gif

Note that any other descriptors of a chemical Inline graphic-lean cyclic graph G with Inline graphic except for the core height can be determined by the entries of the frequency vector Inline graphic. For example, the vector Inline graphic with the numbers Inline graphic of core vertices of degree Inline graphic is given by

graphic file with name 41598_2025_5976_Article_Equ16.gif

and the vector Inline graphic with the numbers of symbols Inline graphic of core vertices is given by

graphic file with name 41598_2025_5976_Article_Equ17.gif

Similarly the vector Inline graphic with the numbers of symbols Inline graphic of non-core vertices is given by

graphic file with name 41598_2025_5976_Article_Equ18.gif

We introduce a specification Inline graphic as a set of functions Inline graphic.

We call two chemical graphs Inline graphic-isomorphic if they consist of vertex and edge components with the same codes and heights; i.e., two chemical Inline graphic-lean cyclic graphs Inline graphic, Inline graphic are Inline graphic-isomorphic if the following hold:

  • Inline graphic and Inline graphic are graph-isomorphic, where we assume that Inline graphic and Inline graphic denotes the base-graph of both graphs Inline graphic and Inline graphic by Inline graphic.

  • For the v-components Inline graphic of Inline graphic, Inline graphic at each base-vertex Inline graphic, Inline graphic and Inline graphic.

  • For the e-components Inline graphic of Inline graphic, Inline graphic at each base-edge Inline graphic, Inline graphic and Inline graphic.

See Section 2 for the definition of height Inline graphic of a bi-rooted tree T.

The Inline graphic-isomorphism also implies that Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic.

Chemical isomers of a given chemical graph Inline graphic

Let Inline graphic be a chemical Inline graphic-lean cyclic graph Inline graphic, and Inline graphic (resp., Inline graphic) denote the v-component (resp., the e-component) of Inline graphic.

Target v-components

Let Inline graphic denote the height Inline graphic of the v-component of Inline graphic. For each base-vertex Inline graphic, fix a code Inline graphic and call a rooted c-tree T a target v-component if

graphic file with name 41598_2025_5976_Article_Equ19.gif

where the condition on Inline graphic is equivalent to Inline graphic when Inline graphic, since Inline graphic is a Inline graphic-lean cyclic graph and the set of Inline graphic-internal edges in any target component forms a single path of length Inline graphic from the root to a unique leaf Inline graphic-branch. Let Inline graphic denote the set of all target v-components of a base-vertex Inline graphic.

For example, the number of all target v-components of the example in Figure 4(a) is 8, which will be computed as a compact representation by our algorithm.

Target e-components

For each base-edge Inline graphic, fix a code Inline graphic Inline graphic and call a bi-rooted c-tree T a target e-component if

graphic file with name 41598_2025_5976_Article_Equ20.gif

Let Inline graphic denote the set of all target e-components of a base-edge Inline graphic.

For example, the number of all target e-components of the example in Figure 4(b) is 12, which will be computed as a compact representation by our algorithm.

Given a collection of target v-components Inline graphic, Inline graphic and target e-components Inline graphic, Inline graphic, there is a chemical Inline graphic-lean cyclic graph Inline graphic that is Inline graphic-isomorphic to the original chemical graph Inline graphic. Such a graph Inline graphic can be obtained from Inline graphic by replacing each base-edge Inline graphic with Inline graphic and attaching Inline graphic at each base-vertex Inline graphic.

From this observation, our aim is now to generate some number of target v-components for each base-vertex v and target e-components for each base-edge e. In the following, we denote Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic for each base-edge Inline graphic by Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic, respectively for a notational simplicity. For each base-edge Inline graphic, let

graphic file with name 41598_2025_5976_Article_Equ21.gif

A sketch of dynamic programming algorithm on frequency vectors

We start with describing a sketch of our new algorithm for generating graphs Inline graphic in Stage 5.

We start with enumerating chemical rooted trees with height at most Inline graphic, which can be a Inline graphic-fringe tree of a target component. Next we extend each of the rooted tree to an nc-tree T and then to a c-tree T under a constraint that the frequency vector Inline graphic of T does not exceed a given vector Inline graphic, Inline graphic or Inline graphic, Inline graphic.

For a vector Inline graphic, we formulate the following sets of nc-trees and c-trees and of their frequency vectors:

  • (i)
    Inline graphic, Inline graphic, Inline graphic, Inline graphic: the set of rooted nc-trees T with a root r such that
    graphic file with name 41598_2025_5976_Article_Equ22.gif
    Let Inline graphic denote the set of the frequency vectors Inline graphic for all nc-trees Inline graphic;
  • (ii)
    Inline graphic, Inline graphic, Inline graphic, Inline graphic: the set of rooted nc-trees T with a root r such that
    graphic file with name 41598_2025_5976_Article_Equ23.gif
    Let Inline graphic denote the set of the frequency vectors Inline graphic for all nc-trees Inline graphic;
  • (iii)
    Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic: the set of bi-rooted nc-trees T such that
    graphic file with name 41598_2025_5976_Article_Equ24.gif
    Let Inline graphic denote the set of all frequency vectors Inline graphic for all bi-rooted nc-trees Inline graphic.
  • (iv)
    Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic: the set of rooted c-trees T with a root r such that
    graphic file with name 41598_2025_5976_Article_Equ25.gif
    Let Inline graphic denote the set of the frequency vectors Inline graphic for all c-trees Inline graphic.
  • (v)
    Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic: the set of bi-rooted c-trees T such that
    graphic file with name 41598_2025_5976_Article_Equ26.gif
    Let Inline graphic denote the set of the frequency vectors Inline graphic for all bi-rooted c-trees Inline graphic.

Note that Inline graphic for any vector Inline graphic in the above set in (i)-(iii).

Forward phase

The first phase computes the frequency vectors of some nc-trees and c-trees that can be a subtree of a target component, where we first enumerate chemical rooted trees with height at most Inline graphic and generate the frequency vectors of other types of nc-trees and c-trees from the frequency vectors of their subtrees recursively.

The first phase consists of five steps. Step 1 computes the sets of trees and vectors in (i), (ii) and (iii) with Inline graphic, where each tree in these sets is of height at most Inline graphic. Note that the frequency vectors of some two trees in a tree set Inline graphic in the above can be identical.

In fact, the size Inline graphic of a set Inline graphic of trees can be considerably larger than that Inline graphic of the set Inline graphic of their frequency vectors. We mainly maintain a whole vector set Inline graphic. With this idea, Steps 2–5 compute only vector sets Inline graphic in (iii) with Inline graphic, (iv) and (v).

We derive recursive formula that holds among the above sets. Based on this, we compute the vector sets in (iii) in Step 2, those in (iv) in Step 3 and those in (v) in Step 4. For each base-edge Inline graphic, Step 5 compares vectors Inline graphic and Inline graphic, where Inline graphic is the frequency vector of a c-tree Inline graphic that is extended from the end-vertex Inline graphic, to examine whether Inline graphic and Inline graphic give rise to a target e-component.

In the previous method for generating target components due to Akutsu and Nagamochi37, an algorithm is designed in a similar idea of the first phase so that some number of target components are constructed when a necessary set of frequency vectors is computed at the end of the execution. However, the algorithm cannot enumerate all target components.

Backward phase

To address this problem, we need to backtrack the computation in the first phase to detect which subtrees will be part of a target component. In this paper, we design as the second phase an algorithm that constructs a compact DAG representation of all target components by backtracking the computation of the first phase so that all target components can be generated by tracing the DAG representation.

The second phase consists of two steps, Step A and Step B. Step A (resp., Step B) constructs a DAG representation of all target v-components for each base-vertex Inline graphic with Inline graphic (resp., for each base-edge Inline graphic) so that a path from a source and to a sink in the DAG corresponds to a construction process of a target component. We can enumerate all target components by enumerating all paths in the resulting DAG representations.

Defining DAG representations for vertex-components

A DAG representation Inline graphic for the set of target v-components Inline graphic consists of the following:

  • An acyclic digraph (NA) with a node set N, a set Inline graphic of sources, a single sink Inline graphic and an arc set A. The node set N consists of Inline graphic disjoint subsets Inline graphic and Inline graphic. The end-nodes of every arc Inline graphic satisfies one of “Inline graphic,” “Inline graphic” and “Inline graphic.”

  • A function Inline graphic.

  • A set W of labels Inline graphic, where Inline graphic stands for the frequency vector of a chemical rooted tree.

  • A set Inline graphic of chemical rooted trees T for each non-null label Inline graphic such that the frequency vector of T is equal to the vector implied by Inline graphic. We also store Inline graphic.

  • A function Inline graphic such that Inline graphic and Inline graphic, where Inline graphic is a null label that stands for the zero-vector. For a directed path P in (NA), let W(P) denote the multi-set of non-null labels Inline graphic of arcs a in P.

Figure 7 illustrates a DAG representation for the set of target v-components Inline graphic, where, for example, Inline graphic and Inline graphic for an arc Inline graphic and Inline graphic consists of one tree Inline graphic in the figure. All null labels Inline graphic are omitted in the figure.

Fig. 7.

Fig. 7

(a) An illustration of a DAG representation Inline graphic for the set of target v-components Inline graphic, where each arc Inline graphic with Inline graphic (resp., Inline graphic) is depicted with two (resp., three) lines, Inline graphic on an arc a denotes the non-null label Inline graphic and the root of a chemical rooted tree Inline graphic is depicted with a black circle; (b) An illustration of a target v-component Inline graphic induced by a path Inline graphic with Inline graphic, Inline graphic and Inline graphic.

A target v-component is constructed from a DAG representation as follows.

  1. First choose a path Inline graphic from a source in Inline graphic to sink Inline graphic. For example, let Inline graphic in Figure 7(a), where we obtain Inline graphic.

  2. Construct a chemical path Inline graphic such that Inline graphic is the chemical element in the chemical symbol Inline graphic and Inline graphic is the bond-multiplicity Inline graphic of the arc Inline graphic. For the example Inline graphic in Figure 7(a), we obtain Inline graphic.

  3. For each non-null label Inline graphic of an arc Inline graphic in the path P, choose a chemical rooted tree Inline graphic from the set Inline graphic. The number Inline graphic of all such combinations is Inline graphic. For the example Inline graphic in Figure 7(a), choose Inline graphic for arc Inline graphic, Inline graphic for arc Inline graphic and with Inline graphic for arc Inline graphic, and Inline graphic.

  4. Finally attach the tree Inline graphic chosen in 3 to the vertex Inline graphic in the chemical path Inline graphic and the resulting tree becomes a target v-component. For the example Inline graphic in Figure 7(a), we obtain a target v-component Inline graphic as illustrated in Figure 7(b).

  5. Any target v-component is constructed in the above manner of 1 to 4. Hence the number of all target v-components can be computed as follows. Let Inline graphic denote the set of nodes Inline graphic such that Inline graphic After initializing Inline graphic for each arc a with Inline graphic, Inline graphic for each arc a with Inline graphic and Inline graphic, choose non-sink nodes Inline graphic in a non-decreasing order of the distance from u to sink Inline graphic, and compute Inline graphic. The number of all target v-components is given by Inline graphic for the source Inline graphic.

Figures 8(a) and (b) illustrate a DAG representation for the set of target v-components of base-vertex v in Figure 4(a). By choosing a path Inline graphic in the DAG representation, we obtain a target v-component Inline graphic in Figure 8(b), where Inline graphic and Inline graphic. In this case, the number of paths from the source to a sink is eight and the number of all target v-components is eight, since the choice of trees from Inline graphic is unique.

Fig. 8.

Fig. 8

(a) A DAG representation Inline graphic for the set of target v-components of the base-vertex v in Figure 4(a); (b) The target v-component Inline graphic induced by a path Inline graphic.

Defining DAG representations for edge-components

A DAG representation Inline graphic, Inline graphic for the set of target e-components Inline graphic consists of the following:

  • An acyclic digraph Inline graphic with a node set Inline graphic, a set Inline graphic of sources, a single sink Inline graphic and an arc set Inline graphic. The source set Inline graphic consists of disjoint subsets Inline graphic. The node set Inline graphic consists of disjoint subsets Inline graphic and Inline graphic. The end-nodes of every arc Inline graphic satisfies one of “Inline graphic,” “Inline graphic” and “Inline graphic.” The set of sources in Inline graphic is denoted by Inline graphic. See Figure 9(a).

  • An acyclic digraph Inline graphic with a node set Inline graphic and an arc set Inline graphic such that Inline graphic has a single source Inline graphic and a single sink Inline graphic and every path from Inline graphic to Inline graphic has length Inline graphic, where Inline graphic. Let Inline graphic denote the set of nodes Inline graphic whose distance from source Inline graphic is p. We call a node Inline graphic (resp., Inline graphic) a left-node (resp., a right-node) and call an arc Inline graphic a left-arc (resp., a right-arc) if u and v are left-nodes (resp., right-nodes) and a middle-arc otherwise. See Figure 9(b).

  • Functions Inline graphic and Inline graphic.

  • A set W of labels Inline graphic, where Inline graphic stands for the frequency vector of a chemical rooted tree.

  • A set Inline graphic of chemical rooted trees T for each non-null label Inline graphic such that the frequency vector of T is equal to the vector implied by Inline graphic. We also store Inline graphic.

  • A function Inline graphic such that Inline graphic and Inline graphic, where Inline graphic is a null label that stands for the zero-vector. For a directed path P in Inline graphic, let W(P) denote the multi-set non-null labels Inline graphic of arcs in P.

  • Functions Inline graphic such that Inline graphic and Inline graphic. For a directed path Q in Inline graphic, let W(Q) denote the multi-set of Inline graphic of arcs a in Q, and let Inline graphic denote the multi-set of Inline graphic of arcs a in Q.

Figure 9 illustrates a DAG representation for the set of target e-components Inline graphic, where the null labels Inline graphic are omitted in the figure.

Fig. 9.

Fig. 9

An illustration of a DAG representation for the set of target e-components Inline graphic; (a) A representation Inline graphic with Inline graphic for the non-core part, where Inline graphic is omitted in the figure and each gray circle indicates a source or a sink; (b) A representation Inline graphic for the core part, where each gray circle indicates a node with distance Inline graphic or Inline graphic from the source Inline graphic.

A target e-component is constructed from a DAG representation as follows.

  1. Choose a path Inline graphic from the source Inline graphic to the sink Inline graphic in Inline graphic. For example, let Inline graphic in Figure 9(b), where Inline graphic, Inline graphic and Inline graphic is the middle-arc in Q.

  2. Construct a chemical path Inline graphic such that Inline graphic is the chemical element in the chemical symbol Inline graphic, and Inline graphic is Inline graphic of arc Inline graphic. For the example Inline graphic in Figure 9(b), we obtain Inline graphic Inline graphic.

  3. For each Inline graphic of an arc Inline graphic in path Q, choose a chemical rooted tree Inline graphic from the set Inline graphic; Attach the tree Inline graphic to the atom at Inline graphic (resp., Inline graphic) in path Inline graphic if Inline graphic is a left-arc (resp., a right-arc). For the example Inline graphic with Inline graphic in Figure 9(b), we attach a tree Inline graphic to the atom C at Inline graphic in path Inline graphic since Inline graphic is a left-arc.

  4. For each Inline graphic of an arc Inline graphic in path Q, execute the next to construct an nc-tree Inline graphic:
    • (i)
      Choose a path Inline graphic from source Inline graphic to the sink Inline graphic in Inline graphic. Construct a chemical path Inline graphic such that Inline graphic is the chemical element in the chemical symbol Inline graphic and Inline graphic is the bond-multiplicity Inline graphic of arc Inline graphic.
    • (ii)
      For each Inline graphic of an arc Inline graphic in the path Inline graphic, choose a chemical rooted tree Inline graphic from the set Inline graphic and attach the tree Inline graphic to atom Inline graphic in the path Inline graphic Let Inline graphic be one of the resulting chemical trees rooted at Inline graphic.
    • (iii)
      Attach the tree Inline graphic to the atom at Inline graphic (resp., Inline graphic) in the path Inline graphic if Inline graphic is a left-arc (resp., a right-arc). The resulting chemical tree is a target e-component.
    For the example Inline graphic with Inline graphic in Figure 9(b), we can choose a path Inline graphic from source Inline graphic to sink Inline graphic in Inline graphic in Figure 9(a), and we construct a tree Inline graphic in the same manner with the case of constructing target v-components, and attach tree Inline graphic to the atom C at Inline graphic in the path Inline graphic since Inline graphic is a right-arc.
  5. Any target e-component is constructed in the above manner of 1 to 5. Hence the number of all target e-components can be computed as follows. For each source Inline graphic, the number Inline graphic of chemical rooted trees Inline graphic constructed in 4 can be computed in the same manner with the case of computing the number of target v-components Inline graphic in a DAG representation. Let Inline graphic denote the set of nodes Inline graphic such that Inline graphic. After initializing Inline graphic for each right-arc a with Inline graphic, Inline graphic for each right-arc a with Inline graphic, Inline graphic for each right-arc a with Inline graphic and Inline graphic, choose non-sink right-nodes Inline graphic in the order of Inline graphic and compute Inline graphic to obtain Inline graphic. For the right-nodes in Inline graphic, we apply the above procedure to left-nodes Inline graphic in the order of Inline graphic after reversing the directions of arcs in Inline graphic. Finally, for the set Inline graphic of middle-arcs, the number of all target e-components is given by Inline graphic.

Figures 10(a)-(c) illustrate a DAG representation for the set of target e-components of the base-edge e in Figure 4(b). For this example, we choose a path

graphic file with name 41598_2025_5976_Article_Equ27.gif

from the source Inline graphic to the sink Inline graphic in Inline graphic, where Inline graphic, Inline graphic and Inline graphic is the middle-arc in Q. From this, we have a chemical path

graphic file with name 41598_2025_5976_Article_Equ28.gif

For Inline graphic of the right-arc Inline graphic in the path Q, we choose tree Inline graphic and attach the tree at the atom Inline graphic at Inline graphic in Inline graphic.

Fig. 10.

Fig. 10

A DAG representation for the set of target e-components of base-edge Inline graphic in Figure 4(b): (a) A representation Inline graphic, for the non-core part, (b) A representation Inline graphic for the core part, (c) A target e-component Inline graphic induced by the path Inline graphic.

For Inline graphic of left-arc Inline graphic in the path Q, we choose a path

graphic file with name 41598_2025_5976_Article_Equ29.gif

from source Inline graphic to sink Inline graphic in Inline graphic, from which we have a chemical path

graphic file with name 41598_2025_5976_Article_Equ30.gif

For Inline graphic of arc Inline graphic in the path Inline graphic, we choose tree Inline graphic and attach the tree to the atom Inline graphic at Inline graphic in Inline graphic to obtain a chemical rooted tree Inline graphic. Finally, attach the tree Inline graphic to the atom Inline graphic at Inline graphic in Inline graphic. The resulting target e-component Inline graphic is illustrated in Figure 10(c). In this example, we obtain Inline graphic,

Inline graphic and Inline graphic and the number of target e-components is Inline graphic.

Computing frequency vectors of subtrees of target components in the forward phase

Step 1: Enumeration of Inline graphic-fringe trees

Step 1 generates chemical rooted trees with height at most Inline graphic to compute the following sets in (i)-(iv).

  • (i)

    For each base-vertex Inline graphic such that Inline graphic, where Inline graphic, compute the set Inline graphic, Inline graphic of rooted c-trees. Note that every c-tree in the set Inline graphic with Inline graphic is a target v-component in Inline graphic; Set Inline graphic and Inline graphic;

  • (ii)

    For each base-vertex Inline graphic such that Inline graphic, where Inline graphic, and integers Inline graphic and Inline graphic, compute the sets Inline graphic of rooted c-trees and Inline graphic of their frequency vectors; Set Inline graphic;

  • (iii)

    For each base-vertex Inline graphic such that Inline graphic and each possible tuple Inline graphic, compute the sets Inline graphic and Inline graphic of rooted nc-trees and the sets Inline graphic and Inline graphic of their frequency vectors; Set Inline graphic and Inline graphic;

  • (iv)

    For each base-edge Inline graphic and each possible tuple Inline graphic, compute the sets Inline graphic and Inline graphic of rooted nc-trees and Inline graphic, Inline graphic of rooted c-trees and the sets Inline graphic, Inline graphic and Inline graphic of their frequency vectors; For each base-edge Inline graphic with Inline graphic and each possible pair (adm) with Inline graphic, Inline graphic and Inline graphic, we compute the sets Inline graphic of rooted c-trees and Inline graphic; Set Inline graphic Inline graphic, Inline graphic Inline graphic, Inline graphic Inline graphic and Inline graphic Inline graphic.

To compute the above sets of trees and vectors, we enumerate all possible trees with height at most Inline graphic under the size constraint (1) by a branch-and-bound procedure.

Step 2: Generation of frequency vectors of end-subtrees

For each base-vertex Inline graphic or each base-edge Inline graphic such that Inline graphic and each possible tuple Inline graphic, Step 2 computes the set Inline graphic in the ascending order of Inline graphic. Observe that each vector Inline graphic is obtained as Inline graphic from a combination of vectors Inline graphic, Inline graphic and an edge-configuration Inline graphic such that

graphic file with name 41598_2025_5976_Article_Equ2.gif 2

Figure 11(a) illustrates this process of computing a vector Inline graphic.

Fig. 11.

Fig. 11

(a) An illustration of computing a vector Inline graphic from the frequency vectors Inline graphic Inline graphic of a bi-rooted nc-tree T and Inline graphic of an nc-tree Inline graphic; (b) An illustration of computing a vector Inline graphic from the frequency vectors Inline graphic, Inline graphic of a c-tree T and Inline graphic of an nc-tree Inline graphic.

We call an edge-configuration Inline graphic feasible to the set Inline graphic, if at least one vector Inline graphic is obtained from a combination Inline graphic and Inline graphic. We let Inline graphic store all edge-configurations Inline graphic feasible to Inline graphic.

Step 3: Generation of frequency vectors of rooted core-subtrees

For each base-vertex Inline graphic or each base-edge Inline graphic such that Inline graphic and each possible tuple Inline graphic with Inline graphic,

Step 3 computes the set Inline graphic, where

graphic file with name 41598_2025_5976_Article_Equ31.gif

Observe that each vector Inline graphic is obtained as Inline graphic from a combination of vectors Inline graphic, Inline graphic, Inline graphic and an edge-configuration Inline graphic such that

graphic file with name 41598_2025_5976_Article_Equ3.gif 3

where Inline graphic. Figure 11(b) illustrates this process of computing a vector Inline graphic.

We call an edge-configuration Inline graphic feasible to the set Inline graphic, if at least one vector Inline graphic is obtained from a combination of vectors Inline graphic and Inline graphic. We let Inline graphic store all edge-configurations Inline graphic feasible to Inline graphic.

For each base-vertex Inline graphic, it holds that Inline graphic. Step A generates all target v-components Inline graphic such that Inline graphic based on the sets of frequency vectors generated in Steps 1 to 3.

Note that the set Inline graphic is constructed in Step 1 for integers Inline graphic and in Step 3 for integers Inline graphic.

Step 4: Generation of frequency vectors of bi-rooted core-subtrees

For each base-edge Inline graphic, each index Inline graphic and each possible tuple Inline graphic with Inline graphic, Step 4 computes the set Inline graphic in the ascending order of Inline graphic. Let Inline graphic denote the c-tree with a single vertex v such that Inline graphic.

For Inline graphic, we see that each vector Inline graphic is obtained as Inline graphic from a combination of vectors Inline graphic, Inline graphic and an edge-configuration Inline graphic such that

graphic file with name 41598_2025_5976_Article_Equ4.gif 4

Figure 12(a) illustrates this process of computing a vector Inline graphic Inline graphic.

Fig. 12.

Fig. 12

An illustration of computing a vector Inline graphic (a) For Inline graphic, a vector Inline graphic is obtained from the vectors Inline graphic of a rooted c-tree T and Inline graphic of the c-tree Inline graphic; (b) For Inline graphic, a vector Inline graphic is obtained from the vectors Inline graphic of a rooted c-tree T and Inline graphic of a c-tree Inline graphic.

For Inline graphic, observe that each vector Inline graphic is obtained as Inline graphic from a combination of vectors Inline graphic, Inline graphic Inline graphic and an edge-configuration Inline graphic such that

graphic file with name 41598_2025_5976_Article_Equ5.gif 5

Figure 12(b) illustrates this process of computing a vector Inline graphic Inline graphic.

We call an edge-configuration Inline graphic with Inline graphic feasible to the set Inline graphic Inline graphic, if at least one vector Inline graphic is obtained from a combination of vectors Inline graphic and Inline graphic Inline graphic (or Inline graphic for Inline graphic). We let Inline graphic Inline graphic store all edge-configurations Inline graphic feasible to Inline graphic.

Step 5: Enumeration of feasible vector pairs

For each edge Inline graphic, a feasible vector pair is defined to be a pair of vectors Inline graphic Inline graphic, Inline graphic with Inline graphic that admits an edge-configuration Inline graphic such that

graphic file with name 41598_2025_5976_Article_Equ6.gif 6

Let Inline graphic denote the set of feasible vector pairs for a base-edge Inline graphic. We also call each of the two vectors in a feasible vector pair Inline graphic feasible, and let Inline graphic Inline graphic, Inline graphic denote the set of feasible vectors Inline graphic Inline graphic. Figure 13 illustrates a feasible vector pair Inline graphic.

Fig. 13.

Fig. 13

An illustration of computing a feasible vector pair Inline graphic with Inline graphic of c-trees Inline graphic for a base-edge Inline graphic.

The last equality in (6) is equivalent with a condition that Inline graphic is equal to the vector Inline graphic, which we call the Inline graphic-complement of Inline graphic, and denote it by Inline graphic.

For each edge Inline graphic, Step 5 enumerates the set Inline graphic of all feasible vector pairs Inline graphic. To efficiently search for a feasible pair of vectors in two sets Inline graphic, Inline graphic with Inline graphic, we first compute the Inline graphic-complement vector Inline graphic of each vector Inline graphic for each edge-configuration Inline graphic with Inline graphic, and denote by Inline graphic the set of the resulting Inline graphic-complement vectors. Observe that Inline graphic is a feasible vector pair if and only if Inline graphic. To find such pairs, we merge the sets Inline graphic and Inline graphic into a sorted list Inline graphic. Then each feasible vector pair Inline graphic appears as a consecutive pair of vectors Inline graphic and Inline graphic in the list Inline graphic.

From a feasible vector pair Inline graphic for a base-edge Inline graphic, Step B generates all target e-components Inline graphic such that Inline graphic consists of two c-trees Inline graphic and Inline graphic with Inline graphic, Inline graphic.

Constructing DAG representations in the backward phase

Step A: Constructing target v-components from frequency vectors

For each base-vertex Inline graphic such that Inline graphic, the set of target v-components is constructed in Step 1. Let Inline graphic be a base-vertex such that Inline graphic, where Inline graphic. Based on the sets of frequency vectors computed in Steps 1, 2 and 3, we construct a DAG representation Inline graphic of the set of target v-components defined in Section 3.7. Step A consists of three steps. Step A1 discards unnecessary vectors from the vector sets computed in Steps 1, 2 and 3. of Section 3.9.

Step A1.

  • (i)

    Note that Inline graphic. We call the vector Inline graphic feasible. Let Inline graphic.

  • (ii)

    For each possible tuple Inline graphic, we define “feasible vectors” as follows. For each Inline graphic, if a pair of vectors Inline graphic Inline graphic and Inline graphic satisfies (3) and the vector Inline graphic is feasible, then we call each of these vectors Inline graphic and Inline graphic feasible. Let Inline graphic, Inline graphic denote the set of feasible vectors Inline graphic and Inline graphic denote the set of feasible vectors Inline graphic.

  • (iii)

    For each possible tuple Inline graphic with Inline graphic, we define “feasible vectors” in the descending order of Inline graphic as follows. For each Inline graphic, if a pair of vectors Inline graphic and Inline graphic satisfies (2) and the vector Inline graphic is feasible, then we call each of these vectors Inline graphic and Inline graphic feasible. Let Inline graphic denote the set of feasible vectors Inline graphic and Inline graphic denote the set of feasible vectors Inline graphic.

Step A2 constructs a set N of nodes and a function Inline graphic for the DAG representation. A node in N corresponds to a frequency vector Inline graphic of a chemical tree T and we denote the node by Inline graphic for a notational simplicity.

Step A2: Constructing N.

  • (i)

    Create a unique sink Inline graphic in N.

  • (ii)

    For each integer Inline graphic, create a node Inline graphic called an internal-node (an int-node, for short) for each feasible vector Inline graphic, set Inline graphic and let Inline graphic denote the set of these int-nodes.

  • (iii)

    For the feasible vector Inline graphic, create a node Inline graphic in N, which will be the unique source in (NA) and set Inline graphic with Inline graphic. Let Inline graphic consist of source Inline graphic.

  • (iv)

    Let Inline graphic.

Step A3 creates arcs in our DAG representation by re-executing some part of the forward phase in Steps 1, 2 and 3 of Section 3.9. We first execute Step 1, next execute Step 2 only for feasible vectors in the ascending order of Inline graphic and then execute Step 3 only for feasible vectors as follows.

Step A3: Constructing A.

  • (i)

    (Step 1(iii)) For each feasible vector Inline graphic, create an arc Inline graphic that leaves from the int-node Inline graphic and enters sink Inline graphic, and set Inline graphic and Inline graphic. For the resulting arc Inline graphic, we set Inline graphic and Inline graphic.

  • (ii)

    (Step 2) For each tuple Inline graphic such that Inline graphic and Inline graphic, create an arc Inline graphic, if there are feasible vectors Inline graphic and Inline graphic such that Inline graphic is a feasible vector for an edge-configuration Inline graphic. In this case, we set Inline graphic and Inline graphic and create an arc Inline graphic that leaves from the int-node Inline graphic and enters the int-node Inline graphic. For the resulting arc Inline graphic, we set Inline graphic and Inline graphic.

  • (iii)

    (Step 3) Create an arc Inline graphic, if there are feasible vectors Inline graphic and Inline graphic such that Inline graphic is a feasible vector for an edge-configuration Inline graphic. In this case, we set Inline graphic and Inline graphic and create an arc Inline graphic that leaves from the source Inline graphic and enters the int-node Inline graphic. For the resulting arc Inline graphic, we set Inline graphic and Inline graphic.

  • (iv)

    Let A(h), Inline graphic denote the set of arcs Inline graphic such that Inline graphic or Inline graphic. Set Inline graphic.

For base-vertex Inline graphic in Figure 4(a), Step 1(ii) computes one c-tree Inline graphic in Figure 14(a) and the set of the frequency vectors Inline graphic such that

Fig. 14.

Fig. 14

An illustration of chemical rooted/bi-rooted trees: (a) The c-tree Inline graphic and nc-trees Inline graphic in Step 1 for base-vertex Inline graphic in Figure 4(a); (b) The c-trees Inline graphic and nc-trees Inline graphic in Step 1 for base-edge Inline graphic in Figure 4(b).

Inline graphic,

and Step 1(iii) computes seven nc-trees Inline graphic in Figure 14(a) and the sets of their frequency vectors Inline graphic such that

Inline graphic,

Inline graphic,

Inline graphic, Inline graphic,

Inline graphic, Inline graphic,

Inline graphic,

Inline graphic, Inline graphic and

Inline graphic.

Step B: Constructing target e-components from frequency vectors

Let Inline graphic be a base-edge. Recall that Step 5 computes the set Inline graphic of feasible vector pairs Inline graphic and the set Inline graphic of feasible vectors for each index Inline graphic. For each feasible vector pair Inline graphic, a target e-component Inline graphic is obtained as a pair of c-trees Inline graphic and Inline graphic with a bond-multiplicity m such that Inline graphic Inline graphic with Inline graphic, where we call Inline graphic the part-i of the target e-component Inline graphic. For each index Inline graphic, let Inline graphic denote the set of the c-trees T such that Inline graphic is a feasible vector Inline graphic Inline graphic.

In this section, we design an algorithm for constructing a DAG representation of the c-trees Inline graphic, Inline graphic.

Based on the sets of frequency vectors computed in Steps 1 to 5, we construct a DAG representation of the set of target e-components. Step B consists of five steps. Step B1 discards unnecessary vectors from the vector sets computed in Steps 1 to 4.

Step B1: Discarding unnecessary vectors

  • (i)

    Compute the sets of feasible vectors, Inline graphic, Inline graphic Inline graphic.

  • (ii)

    For each possible tuple Inline graphic with Inline graphic, we define “feasible vectors” in the descending order Inline graphic as follows. For each Inline graphic with Inline graphic and Inline graphic, if a pair of vectors Inline graphic and Inline graphic satisfies (5) and the vector Inline graphic is feasible, then we call each of these vectors Inline graphic and Inline graphic feasible. Let Inline graphic denote the set of feasible vectors Inline graphic and Inline graphic denote the set of feasible vectors Inline graphic. For each Inline graphic with Inline graphic and Inline graphic, if a pair of vectors Inline graphic and Inline graphic satisfies (4) and the vector Inline graphic is feasible, then we call such a vector Inline graphic feasible. Let Inline graphic denote the set of feasible vectors Inline graphic.

  • (iii)

    For each possible tuple Inline graphic with Inline graphic, we define “feasible vectors” in the descending order of Inline graphic as follows. For each Inline graphic, if a pair of vectors Inline graphic and Inline graphic satisfies (3) and the vector Inline graphic is feasible, then we call each of these vectors Inline graphic and Inline graphic feasible. Let Inline graphic, Inline graphic denote the set of feasible vectors Inline graphic and Inline graphic denote the set of feasible vectors Inline graphic.

  • (iv)

    For each possible tuple Inline graphic with Inline graphic, we define “feasible vectors” in the set Inline graphic in the descending order of Inline graphic as follows. For each Inline graphic, if a pair of vectors Inline graphic and Inline graphic satisfies (2) and the vector Inline graphic is feasible, then we call each of these vectors Inline graphic and Inline graphic feasible. Let Inline graphic denote the set of feasible vectors Inline graphic and Inline graphic denote the set of feasible vectors Inline graphic.

Steps B2 and B3 construct the non-core part Inline graphic of a DAG representation of target e-components in a similar manner of Steps A2 and A3.

Step B2: Constructing Inline graphic

  • (i)

    Create a unique sink Inline graphic in Inline graphic.

  • (ii)

    For each integer Inline graphic, create a node Inline graphic, called an internal-node (an int-node, for short) for each feasible vector Inline graphic, set Inline graphic and let Inline graphic denote the set of these int-nodes.

  • (iii)

    For each integer Inline graphic, create a node Inline graphic for each feasible vector Inline graphic set Inline graphic with Inline graphic and let Inline graphic denote the set of these int-nodes, where each node in Inline graphic will be a source in Inline graphic.

  • (iv)

    Let Inline graphic.

Step B3: Constructing Inline graphic

  • (i)

    Step 1(iv): For each feasible vector Inline graphic, set Inline graphic and Inline graphic and create an arc Inline graphic that leaves from int-node Inline graphic and enters sink Inline graphic, set Inline graphic and Inline graphic for the arc Inline graphic.

  • (ii)

    Step 2: For each tuple Inline graphic such that Inline graphic and Inline graphic, if there are feasible vectors Inline graphic and Inline graphic such that Inline graphic is a feasible vector for an edge-configuration Inline graphic, then Inline graphic and Inline graphic and create an arc Inline graphic that leaves from int-node Inline graphic and enters int-node Inline graphic and set Inline graphic and Inline graphic for the arc Inline graphic.

  • (iii)

    Step 3: For each tuple Inline graphic such that Inline graphic, Inline graphic, if there are feasible vectors Inline graphic and Inline graphic such that Inline graphic is a feasible vector for an edge-configuration Inline graphic, then set Inline graphic and Inline graphic and create an arc Inline graphic that leaves from source Inline graphic and enters int-node Inline graphic and set Inline graphic and Inline graphic for the arc Inline graphic.

  • (iv)

    Let Inline graphic, Inline graphic denote the set of arcs Inline graphic such that Inline graphic or Inline graphic. Set Inline graphic.

For base-edge Inline graphic in Figure 4(b), Step 1(iv) computes six c-trees Inline graphic in Figure 14(b) and the sets of their frequency vectors Inline graphic such that Inline graphic,

Inline graphic,

Inline graphic, Inline graphic,

Inline graphic,

Inline graphic,

Inline graphic

and five nc-trees Inline graphic in Figure 14(b) and the sets of their frequency vectors Inline graphic such that

Inline graphic,

Inline graphic,

Inline graphic, Inline graphic,

Inline graphic,

Inline graphic and

Inline graphic.

Steps B2 and B5 construct the core part Inline graphic of a DAG representation of target e-components.

Step B4: Constructing Inline graphic

  • (i)

    For each Inline graphic, create nodes Inline graphic, set Inline graphic with Inline graphic and let Inline graphic and Inline graphic.

  • (ii)

    For each integer Inline graphic, create a node Inline graphic called a core-node (a c-node, for short) for each feasible vector Inline graphic, set Inline graphic and let Inline graphic (resp., Inline graphic) denote the set of these c-nodes if Inline graphic (resp., Inline graphic).

  • (iii)

    Let Inline graphic.

Step B5: Constructing Inline graphic

  • (i)
    Step 4: For each integer Inline graphic and each tuple Inline graphic such that Inline graphic, execute the following:
    • If there are feasible vectors Inline graphic with Inline graphic and Inline graphic such that Inline graphic is a feasible vector for an edge-configuration Inline graphic Inline graphic with Inline graphic, then create an arc a between c-nodes Inline graphic and Inline graphic with Inline graphic and Inline graphic so that for Inline graphic, arc a is a left-arc Inline graphic directed from Inline graphic to Inline graphic; for Inline graphic, arc a is a right-arc Inline graphic directed from Inline graphic to Inline graphic.
    • If there are feasible vectors Inline graphic with Inline graphic and Inline graphic such that Inline graphic is a feasible vector for an edge-configuration Inline graphic Inline graphic with Inline graphic, then create an arc a between c-nodes Inline graphic and Inline graphic with Inline graphic and Inline graphic so that for Inline graphic, arc a is a left-arc Inline graphic directed from Inline graphic to Inline graphic; for Inline graphic, arc a is a right-arc Inline graphic directed from Inline graphic to Inline graphic.
  • (ii)

    Step 5: For each feasible vector pairs Inline graphic, create a middle-arc Inline graphic that leaves from c-node Inline graphic and enters c-node Inline graphic and set Inline graphic and Inline graphic for the bond-multiplicity m determined for the pair Inline graphic uniquely by (6).

  • (iii)

    Let Inline graphic, Inline graphic denote the set of arcs Inline graphic such that Inline graphic and set Inline graphic.

Experimental results

We implemented our new dynamic programming algorithm that generates the target components and conducted experiments to evaluate the computational efficiency. We executed the experiments on a PC with Processor: 3.0 GHz Core i7-9700 (3.0GHz) Memory: 16 GB RAM DDR4. We used ChemDoodle version 10.2.0 for constructing 2D drawings of chemical graphs.

We set a branch parameter Inline graphic to be 2. We conducted the following three experiments.

  1. Select some chemical compounds Inline graphic in the PubChem and a vertex v in Inline graphic and compute the DAG representation Inline graphic of the set of all target v-components of the frequency vector Inline graphic of the v-component Inline graphic of Inline graphic at v.

  2. Select some chemical compounds Inline graphic in the PubChem and a uv-path P such that all internal vertices in P are of degree 2 in the core Inline graphic and compute the DAG representation Inline graphic of the set of all target e-components of the frequency vector Inline graphic of the e-component Inline graphic of Inline graphic at e by regarding uv as a base-edge Inline graphic.

  3. Select some chemical compounds Inline graphic in the PubChem and a base-graph Inline graphic and compute the set of the DAG representations Inline graphic for the target v-components and the DAG representations Inline graphic.

We set a time limitation to be 3600 sec. In the following tables, M.O. and T.O. mean memory out and time out, respectively. The proposed algorithm is compared with the state-of-the-art compound generator MOLGEN developed by Gugisch et al.38. Using the online version39 of MOLGEN, we generated Inline graphic (the available limit) isomers for each instance discussed in the following sections. In these experiments, we applied available restrictions on the isomers such as the number of bonds, cycles, maximum bond multiplicity, single bonds, double bonds and triple bonds. Note that the isomers generated by MOLGEN can be structurally different from those generated by the proposed method.

Results of experiment 1

For this experiment, we selected six instances Inline graphic as follows. We selected from the database PubChem six cyclic chemical compounds, CID: 7600, CID: 152211, CID: 497892, CID: 46930263, CID: 67558426 and CID: 46930349 which are denoted by Inline graphic, Inline graphic, respectively. Each chemical instance Inline graphic has one benzene ring as the core and only one core vertex v that is adjacent to a non-core vertex as illustrated in Figures 4(a) and 15. For this vertex v, we generate target v-components.

Fig. 15.

Fig. 15

(a)-(e) Instances Inline graphic to compute target v-components, respectively. All instances have one benzene ring and a chemical acyclic graph Inline graphic joining vertex v in the benzene ring and vertex u.

Table 1 shows the results of computing v-components, where we denote the following:

  • i: the instance Inline graphic, Inline graphic;

  • Inline graphic: the set of chemical elements in the v-component Inline graphic in instance Inline graphic;

  • Inline graphic: the number Inline graphic of vertices in the v-component Inline graphic;

  • Inline graphic: the number of different edge-configurations of 2-internal edges in the v-component Inline graphic;

  • Inline graphic: the number of different edge-configurations of 2-external edges in the v-component Inline graphic;

  • Inline graphic: the height Inline graphic of the v-component Inline graphic;

  • D-time: the running time (sec.) to construct the DAG representation Inline graphic;

  • Inline graphic: the number of vertices in Inline graphic;

  • Inline graphic: the number of edges in Inline graphic;

  • p-time: the running time (sec.) to trace all paths from the sources to the sinks in Inline graphic;

  • Inline graphicp: the number of all paths from the sources to the sinks in Inline graphic;

  • T-LB: a lower bound on the number of all target v-components Inline graphic;

  • G-time (resp., G-time38): the running time (sec.) to construct all (or up to Inline graphic) (resp., Inline graphic) target v-components Inline graphic (resp., isomers by MOLGEN38) from Inline graphic;

  • Inline graphic (resp., Inline graphic38): the number of all (or up to Inline graphic) (resp., Inline graphic) target v-components Inline graphic (resp., isomers by MOLGEN38) generated from Inline graphic.

From Table 1, we observe that with the increase in the size of Inline graphic, there is no significant increase in D-time, p-time and G-time. Furthermore, the D-time, p-time and G-time are bounded above by 0.162, 0.255, and 0.136, respectively, from which it is evident that the proposed algorithm can generate target v-components efficiently. Note that the running time of the proposed algorithm is significantly lower than that of MOLGEN38.

Table 1.

Results for Computing Target v-components.

i Inline graphic Inline graphic Inline graphic Inline graphic D-time Inline graphic p-time Inline graphicp T-LB G-time Inline graphic G-time38 Inline graphic38
Inline graphic 10 C,O,N 5, 3 7 0.000 20, 27 0.000 8 8 0.000 8 31.643 Inline graphic
Inline graphic 20 C,O,N 6, 7 11 0.008 343, 780 0.001 1440 1440 0.016 1440 46.22 Inline graphic
Inline graphic 25 C,O,N 7, 5 14 0.100 3427, Inline graphic 0.192 Inline graphic Inline graphic 0.121 Inline graphic 33.185 Inline graphic
Inline graphic 27 C,O,N 9, 6 17 0.115 6523, Inline graphic 0.163 Inline graphic Inline graphic 0.129 Inline graphic 58.185 Inline graphic
Inline graphic 28 C,O,N 8, 7 17 0.129 5627, Inline graphic 1.640 Inline graphic Inline graphic 0.132 Inline graphic 61.73 Inline graphic
Inline graphic 29 C,O,N 9, 6 19 0.162 8967, Inline graphic 0.255 Inline graphic Inline graphic 0.136 Inline graphic 42.724 Inline graphic

Results of experiment 2

For this experiment, we selected six instances Inline graphic as follows. We selected from the database PubChem six cyclic chemical compounds, CID: 3729083, CID: 129130, CID: 47622, CID: 195338, CID: 497867 and CID: 10325899 which are denoted by Inline graphic, Inline graphic, respectively. Each chemical instance Inline graphic has two benzene rings and a chemical acyclic graph Inline graphic joining them, where we denote by u and v the common vertices with the rings and Inline graphic as illustrated in Figures 4(b) and 16. We regard Inline graphic as a base-edge and generate target e-components of Inline graphic.

Fig. 16.

Fig. 16

(a)-(e) Instances Inline graphic, Inline graphic to compute target e-components, respectively. All instances have two benzene rings and a chemical acyclic graph Inline graphic joining two vertices u and v in these benzene rings.

Table 2 shows the results of computing e-components, where we denote the following:

  • i: the instance Inline graphic, Inline graphic;

  • Inline graphic: the set of chemical elements in the e-component Inline graphic in instance Inline graphic;

  • Inline graphic: the number Inline graphic of vertices in Inline graphic;

  • Inline graphic: the number of different edge-configurations of the core edges in Inline graphic;

  • Inline graphic: the number of different edge-configurations of 2-internal edges in Inline graphic;

  • Inline graphic: the number of different edge-configurations of 2-external edges in Inline graphic;

  • Inline graphic: the number of 2-branch core vertices in Inline graphic;

  • Inline graphic: the maximum height of a chemical rooted tree rooted at a core vertex in Inline graphic;

  • D-time: the running time (sec.) to construct the DAG representation Inline graphic;

  • Inline graphic: the number of vertices in Inline graphic;

  • Inline graphic: the number of edges in Inline graphic;

  • p-time: the running time (sec.) to trace all paths from the sources to the sinks in Inline graphic;

  • Inline graphicp: the number of all paths from the sources to the sinks in Inline graphic;

  • T-LB: a lower bound on the number of all target e-components Inline graphic;

  • G-time (resp., G-time38): the running time (sec.) to construct all (or up to Inline graphic) (resp., Inline graphic) target e-components Inline graphic (resp., isomers by MOLGEN38) from Inline graphic;

  • Inline graphic (resp., Inline graphic38): the number of all (or up to Inline graphic) (resp., Inline graphic) target e-components Inline graphic (resp., isomers by MOLGEN38) generated from Inline graphic.

By Table 2, the D-time, p-time and G-time are bounded above by 9.580, 6.420, and 0.168, respectively. This implies that the proposed algorithm can efficiently generate target e-components. Moreover, the running time of the proposed algorithm is significantly lower than that of MOLGEN38.

Table 2.

Results for Computing e-components.

i Inline graphic Inline graphic Inline graphic Inline graphic D-time Inline graphic p-time Inline graphicp T-LB G-time Inline graphic G-time38 Inline graphic38
Inline graphic 17 C,O,N 4, 3, 3 1, 7 0.000 24, 27 0.000 12 12 0.000 12 28.822 Inline graphic
Inline graphic 17 C,O,N 3, 2, 5 1, 4 0.001 78, 116 0.001 180 180 0.002 180 29.083 Inline graphic
Inline graphic 20 C,O,N 4, 0, 3 0, 2 0.001 193, 313 0.004 1512 1512 0.020 1512 97.158 Inline graphic
Inline graphic 25 C,O,N 6, 3, 4 1, 5 0.011 525, 924 0.024 Inline graphic Inline graphic 0.152 Inline graphic 39.25 Inline graphic
Inline graphic 30 C,O,N 3, 6, 6 1, 9 0.908 788, 2275 0.310 Inline graphic Inline graphic 0.168 Inline graphic 59.099 Inline graphic
Inline graphic 32 C,O,N 3, 6, 7 3, 9 9.580 Inline graphic, Inline graphic 6.420 Inline graphic Inline graphic 0.165 Inline graphic 58.792 Inline graphic

Results of experiment 3

For this experiment, we selected five Inline graphic instances as follows. We selected from the database PubChem five cyclic chemical compounds, CID: 1356, CID: 89834791, CID: 334516, CID: 91420002 and CID: 124165467 which are denoted by Inline graphic, Inline graphic, respectively. These instances are illustrated in Figure 17.

Fig. 17.

Fig. 17

(a)-(e) Instances Inline graphic, Inline graphic, respectively, to generate isomers, where a set Inline graphic is selected as the set of vertices indicated with asterisks.

Table 3 shows the results of computing chemical isomers Inline graphic of Inline graphic, where we denote the following:

  • i: the instance Inline graphic, Inline graphic;

  • Inline graphic: the number Inline graphic of the given chemical graph Inline graphic in instance Inline graphic;

  • Inline graphic: the number of chemical elements in Inline graphic;

  • Inline graphic: the number of different edge-configurations of the core edges in Inline graphic;

  • Inline graphic: the number of different edge-configurations of 2-internal edges in Inline graphic;

  • Inline graphic: the number of different edge-configurations of 2-external edges in Inline graphic;

  • Inline graphic: the number Inline graphic of base-vertices of a base-graph Inline graphic selected to Inline graphic;

  • Inline graphic: the number Inline graphic of base-edges of a base-graph Inline graphic selected to Inline graphic;

  • Inline graphic: the core size Inline graphic of Inline graphic;

  • Inline graphic: the number of 2-branch core vertices in Inline graphic;

  • Inline graphic: the core height Inline graphic of Inline graphic;

  • D-time: the running time (sec.) to construct the set of DAG representations Inline graphic and Inline graphic;

  • Inline graphic: the total number of vertices in the set of DAG representations Inline graphic and Inline graphic;

  • Inline graphic: the total number of edges in the set of DAG representations Inline graphic and Inline graphic;

  • Inline graphicD: the number of DAG representations successfully constructed in a time limitation out of the total number Inline graphic of DAG representations defined to Inline graphic;

  • G-LB: a lower bound on the number of all chemical isomers Inline graphic of Inline graphic;

  • G-time (resp., G-time38): the running time (sec.) to construct all (or up to Inline graphic) (resp., Inline graphic) chemical isomers Inline graphic of Inline graphic (resp., isomers by MOLGEN38);

  • Inline graphic (resp., Inline graphic38): the number of all (or up to Inline graphic) (resp., Inline graphic) chemical isomers Inline graphic of Inline graphic generated from the set of DAG representations (resp., isomers generated by MOLGEN38);

  • #non-isoInline graphic: the number of non-isomorphic chemical isomers generated by the proposed algorithm.

By Table 3 the D-time of all the instances except Inline graphic is bounded above by 0.004 which is very small. Instance Inline graphic has relatively bigger DAG representation with Inline graphic and Inline graphic vertices and edges, respectively, and thus the D-time for Inline graphic is 24.600. However, the G-time for all the instances is bounded above by 0.266, and hence the proposed algorithm can efficiently generate isomers for an instance with around 70 non-hydrogen atoms. Furthermore, the proposed algorithm outperforms MOLGEN38 in terms of running time and the number of isomers.

Table 3.

Results for Computing Chemical Isomers Inline graphic of Inline graphic.

i Inline graphic Inline graphic Inline graphic Inline graphic, Inline graphic Inline graphic D-time Inline graphic Inline graphicD G-LB G-time Inline graphic #non-isoInline graphic G-time38 Inline graphic38
Inline graphic 40 3 9, 1, 3 6, 8 32, 1, 3 0.002 429, 803 2 Inline graphic 0.180 Inline graphic Inline graphic 30.964 Inline graphic
Inline graphic 50 3 9, 9, 8 5, 7 22, 2, 9 24.600 Inline graphic, 7 Inline graphic 0.195 Inline graphic Inline graphic 43.575 Inline graphic
Inline graphic
Inline graphic 50 3 10, 0, 2 10, 13 25, 0, 1 0.000 134, 184 5 6912 0.153 6912 6912 32.978 Inline graphic
Inline graphic 50 3 7, 4, 8 5, 7 24, 4, 6 0.001 124, 165 7 4608 0.090 4608 3312 77.341 Inline graphic
Inline graphic 70 3 8, 4, 7 5, 7 42, 4, 6 0.004 303, 392 7 Inline graphic 0.266 Inline graphic 5428 46.476 Inline graphic

For an in-depth analysis of our algorithm, we randomly selected 100 instances for each Inline graphic from the PubChem database and generated chemical isomers. We index the CIDs of these instances for each Inline graphic from 1 to 100 and are given in the supplementary material S1_instances. We show the time D-time to construct DAG representations, and the time G-time to generate chemical isomers for each Inline graphic in Figures 18(a)-(i). The average D-time and G-time over all instances is 0.1884 and 0.0565, respectively, which are reasonably small. Thus it is evident that the proposed algorithm can efficiently generate small chemical isomers. Similarly, we show the number Inline graphic of chemical isomers and the number #non-isoInline graphic of non-isomorphic isomers generated by the proposed algorithm for each Inline graphic in Figures 19(a)-(i). From Figure 19 we observe that for most of the instances, the difference Inline graphic is very small, and hence the proposed algorithm can efficiently generate a large number of non-isomorphic chemical isomers with around 70 non-hydrogen atoms.

Fig. 18.

Fig. 18

(a)-(i) Plots of D-time and G-time.

Fig. 19.

Fig. 19

(a)-(i) Plots of #Inline graphic and #non-isoInline graphic.

Concluding remarks

In this paper, we considered the problem of enumerating all chemical isomers Inline graphic of a given chemical Inline graphic-lean chemical graph Inline graphic in the sense that Inline graphic and Inline graphic have a common base graph Inline graphic and satisfies Inline graphic for the feature function based on the frequency of edge-configurations. The dynamic programming algorithm designed by Akutsu and Nagamochi37 can find a limited number of chemical isomers. In this paper, we improve the algorithm and designed a new backtracking procedure in order to construct a compact DAG representation to the set of all chemical isomers. Just by tracing the DAG representation, we can count the number of all chemical isomers and generate each of them, if necessary in a random way. We implemented the proposed method and our computational results suggest that the DAG representation of chemical isomers can be constructed for an instance with around 70 vertices. These experiments show that the proposed algorithm can help in discovering novel drugs by efficiently exploring the search space of target chemical compounds.

While the proposed method demonstrates promising performance for graphs with up to around 70 vertices, it may face scalability issues for larger chemical structures. Moreover, the method is currently only applicable to Inline graphic-lean chemical graphs and relies on specific base-graph definitions, which may limit its general applicability. Additionally, the feature function is restricted to edge-configuration frequencies, and a rigorous analysis of the computational complexity remains as future work.

Supplementary Information

Acknowledgements

This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant#22H00532.

Author contributions

Ryota Ido: Software, validation, data resources; Naveed Ahmed Azam: Software, validation, data resources, formal analysis, writing—review and editing; Jianshen Zhu: Software, validation, data resources; Hiroshi Nagamochi: Conceptualization, methodology, formal analysis, data resources, writing—original draft preparation, project administration; Tatsuya Akutsu: Conceptualization, methodology, data resources, writing—review and editing, funding acquisition. All authors read and approved the final manuscript.

Funding

This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant #18H04113, #22H00532, and #22K19830.

Data availability

Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-05976-0.

References

  • 1.Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci.4(2), 268–276 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci.4(1), 120–131 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Yang, X., Zhang, J., Yoshizoe, K., Terayama, K. & Tsuda, K. ChemTS: An efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater.18(1), 972–976 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kusner, M. J., Paige, B. & Hernández-Lobato, J. M. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. 1945–1954 (2017).
  • 5.De Cao, N. & Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv:1805.11973 (2018).
  • 6.Madhawa, K., Ishiguro, K., Nakago, K. & Abe, M. GraphNVP: an invertible flow model for generating molecular graphs. arXiv:1905.11600 (2019).
  • 7.Shi, C. et al. GraphAF: a flow-based autoregressive model for molecular graph generation. arXiv:2001.09382 (2020).
  • 8.Arockiaraj, M. et al. Topological and entropy indices in QSPR studies of N-carbophene covalent organic frameworks. BioNanoSci.14(3), 2762–2773 (2024). [Google Scholar]
  • 9.Zhang, X. et al. Distance-based topological characterization, graph energy prediction, and NMR patterns of benzene ring embedded in P-type surface in 2D network. Sci. Rep.14(1), 23766 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Arockiaraj, M., Greeni, A. B., Kalaam, A. R. A., Aziz, T. & Alharbi, M. Mathematical modeling for prediction of physicochemical characteristics of cardiovascular drugs via modified reverse degree topological indices. Eur. Phys. J. E, Soft Matter.47(8), 53. (2024). [DOI] [PubMed]
  • 11.Ikebata, H., Hongo, K., Isomura, T., Maezono, R. & Yoshida, R. Bayesian molecular design with a chemical language model. J. Comput. Aided Mol. Des.31(4), 379–391 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Miyao, T., Kaneko, H. & Funatsu, K. Inverse QSPR/QSAR analysis for chemical structure generation (from y to x). J. Chem. Inf. Model.56(2), 286–299 (2016). [DOI] [PubMed] [Google Scholar]
  • 13.Rupakheti, C., Virshup, A., Yang, W. & Beratan, D. N. Strategy to discover diverse optimal molecules in the small molecule universe. J. Chem. Inf. Model.55(3), 529–537 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Wazzan, S., Hayat, S. & Ismail, W. Optimizing structure-property models of three general graphical indices for thermodynamic properties of benzenoid hydrocarbons. J. King Saud Univ. Sci.36(11), 103541 (2024). [Google Scholar]
  • 15.Hayat, S. et al. Optimizing predictive models for evaluating the F-temperature index in predicting the Inline graphic-electron energy of polycyclic hydrocarbons, applicable to carbon nanocones. Sci. Rep.14(1), 25494 (2024). [DOI] [PMC free article] [PubMed]
  • 16.Hayat, S., Arfan, A., Khan, A., Jamil, H. & Alenazi, M. J. F. An optimization problem for computing predictive potential of general sum/product-connectivity topological indices of physicochemical properties of benzenoid hydrocarbons. Axioms13(6), 342 (2024). [Google Scholar]
  • 17.Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev.16(1), 3–50 (1996). [DOI] [PubMed] [Google Scholar]
  • 18.Blum, L. C. & Reymond, J. L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc.131(25), 8732–8733 (2009). [DOI] [PubMed] [Google Scholar]
  • 19.Meringer, M. & Schymanski, E. L. Small molecule identification with MOLGEN and mass spectrometry. Metabolites3(2), 440–462 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Benecke, C. et al. MOLGENInline graphic, a generator of connectivity isomers and stereoisomers for molecular structure elucidation. Anal. Chim. Acta314(3), 141–147 (1995).
  • 21.Kerber, A., Laue, R., Grüner, T. & Meringer, M. MOLGEN 4.0. Match Commun. Math. Comput. Chem.37, 205–208 (1998). [Google Scholar]
  • 22.Peironcely, J. E. et al. OMG: Open Molecule Generator. J. Cheminform.4(1), 21 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Reymond, J.-L. The chemical space project. Acc. Chem. Res.48(3), 722–730 (2015). [DOI] [PubMed] [Google Scholar]
  • 24.Fujiwara, H., Wang, J., Zhao, L., Nagamochi, H. & Akutsu, T. Enumerating treelike chemical graphs with given path frequency. J. Chem. Inf. Model.48(7), 1345–1357 (2008). [DOI] [PubMed] [Google Scholar]
  • 25.Li, J., Nagamochi, H. & Akutsu, T. Enumerating substituted benzene isomers of tree-like chemical graphs. IEEE/ACM Trans. Comput. Biol. Bioinform.15(2), 633–646 (2016). [DOI] [PubMed] [Google Scholar]
  • 26.Suzuki, M., Nagamochi, H. & Akutsu, T. Efficient enumeration of monocyclic chemical graphs with given path frequencies. J. Cheminform.6(1), 31 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Tamura, Y. et al. Enumerating chemical graphs with mono-block 2-augmented tree structure from given upper and lower bounds on path frequencies. arXiv:2004.06367 (2020).
  • 28.Yamashita, K. et al. Enumerating chemical graphs with two disjoint cycles satisfying given path frequency specifications. arXiv:2004.08381 (2020).
  • 29.Vogt, M. & Bajorath, J. Chemoinformatics: a view of the field and current trends in method development. Bioorg. Med. Chem.20(18), 5317–5323 (2012). [DOI] [PubMed] [Google Scholar]
  • 30.Azam, N. A. et al. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming, In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies – Volume 3: BIOINFORMATICS, Valetta, Malta. 101–108 (2020).
  • 31.Zhang, F. et al. A new integer linear programming formulation to the inverse QSAR/QSPR for acyclic chemical compounds using skeleton trees, In Proceedings of the 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Kitakyushu, Japan. 433–444. 10.1007/978-3-030-55789-8_38 (2020).
  • 32.Azam, N. A. et al. A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming. Algorithms for Mol. Biol.16(1), 18. 10.3390/a13050124 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Ito, R. et al. A novel method for the inverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networks and integer programming, In Proceedings of the 21st International Conference on Bioinformatics & Computational Biology. (2020).
  • 34.Zhu, J., Wang, C., Shurbevski, A., Nagamochi, H. & Akutsu, T. A novel method for inference of chemical compounds of cycle index two with desired properties based on artificial neural networks and integer programming. Algorithms13(5), 124. 10.3390/a13050124 (2020). [Google Scholar]
  • 35.Zhu, J. et al. A novel method for inferring of chemical compounds with prescribed topological substructures based on integer programming, IEEE/ACM Trans. Comput. Biol. and Bioinform.arXiv:2010.09203 (2020). [DOI] [PubMed]
  • 36.Azam, N. A., Zhu, J., Ido, R., Nagamochi, H. & Akutsu, T. Experimental results of a dynamic programming algorithm for generating chemical isomers based on frequency vectors, In Proceedings of the Fourth International Workshop on Enumeration Problems and Applications: WEPA. online, paper ID 15. (2020).
  • 37.Akutsu, T. & Nagamochi, H. A novel method for inference of chemical compounds with prescribed topological substructures based on integer programming. arXiv:2010.09203 (2020). [DOI] [PubMed]
  • 38.Gugisch, R., Kerber, A., Kohnert, A., Laue, R., Meringer, M., Rücker, C. & Wassermann, A. MOLGEN 5.0, A Molecular Structure Generator, In Advances in Mathematical Chemistry and Applications: Revised Edition, 1, pp. 113–138. 2016.
  • 39.MOLGEN Team, MOLGEN 5.0, A Molecular Structure Generator. Available at https://www.molgen.de/online.html. Accessed 2025.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES