Skip to main content
Journal of Computational Biology logoLink to Journal of Computational Biology
. 2013 Mar;20(3):188–198. doi: 10.1089/cmb.2012.0239

Charge Group Partitioning in Biomolecular Simulation

Stefan Canzar 8,*, Mohammed El-Kebir 1,2,*, René Pool 2,3, Khaled Elbassioni 4, Alan E Mark 5,6, Daan P Geerke 3, Leen Stougie 1,7, Gunnar W Klau 1
PMCID: PMC3590896  PMID: 23461571

Abstract

Molecular simulation techniques are increasingly being used to study biomolecular systems at an atomic level. Such simulations rely on empirical force fields to represent the intermolecular interactions. There are many different force fields available—each based on a different set of assumptions and thus requiring different parametrization procedures. Recently, efforts have been made to fully automate the assignment of force-field parameters, including atomic partial charges, for novel molecules. In this work, we focus on a problem arising in the automated parametrization of molecules for use in combination with the gromos family of force fields: namely, the assignment of atoms to charge groups such that for every charge group the sum of the partial charges is ideally equal to its formal charge. In addition, charge groups are required to have size at most k. We show Inline graphic-hardness and give an exact algorithm that solves practical problem instances to provable optimality in a fraction of a second.

Key words: atomic force fields, biomolecular simulation, charge groups, dynamic programming, gromos, tree decomposition

1. Introduction

In the context of drug development, biomolecular systems such as protein-peptide (Yang et al., 2010), protein-ligand (Sharma et al., 2009), and protein-lipid interactions (Boggara et al., 2010) can be studied with the use of molecular simulations (Allen and Tildesley, 1987; van Gunsteren et al., 2006) using a force-field model that describes the interatomic interactions. Many biomolecular force fields are available, including amber (Cornell et al., 1995), charmm (Brooks et al., 2009), opls (Jorgensen et al., 1996), and gromos (Scott et al., 1999; Oostenbrink et al., 2004; Schmid et al., 2011). These force fields have in common that the nonbonded intermolecular interactions are represented in terms of interatomic pair potentials.

Typically, the number of atoms in biomolecular systems are in the range of 104 to 106. To observe relevant biological phenomena, time scales in the order of nano- to milliseconds need to be simulated. For such large-scale systems, evaluating all atom–atom interactions is practically infeasible. One way of dealing with this is to only consider interactions of atoms whose distance is within a pre-specified cutoff radius. Since not all interactions are considered, an error is introduced. The magnitude of the error due to omitting atom–atom interactions is inversely proportional to the distance between the atoms. More problematically, there are discontinuities as atoms move in and out of the cutoff radius.

Errors and discontinuities are reduced by combining atoms into charge groups, for which individual centers of geometry are determined. If the distance between two centers of geometry lies within the cutoff distance then all interactions between the atoms of the involved charge groups are considered. Ideally, charge groups should be neutral as interactions are then reduced to dipole–dipole interactions that scale inversely proportional to the cubed interatomic distance. Charge groups should not be too large. This is because the effective cutoff distance of an individual atom in a given charge group is given by the cutoff distance minus the distance to the center of geometry of the charge group. If the distance of an atom to the center of geometry becomes large, the effective cutoff becomes small, leading to errors and discontinuities as described above. For the same reason, charge groups should be connected as interatomic bonds impose spatial proximity.

To simulate a molecule, a force field requires a specific topology, which includes the atom types, bonds and angles, the atomic charges, and the charge group assignment. Most biomolecular force fields come with a set of topologies for frequently simulated molecules such as amino acids, lipids, nucleotides, and cofactors. Unparametrized molecules, however, require the construction of their topologies. Such a situation occurs, for instance, when assessing the binding affinity of a novel druglike compound to a certain protein.

Manually building topologies for new compounds is a tedious and time-consuming task, especially when a large chemical library needs to be screened, for example, when determining binding affinities for large sets of potential drug compounds to a newly discovered protein target. Therefore, automated approaches are needed.

Here, we focus on the gromos family of force fields, which has been specifically tailored to simulate biochemical processes, including protein-drug binding and peptide folding. A widely used topology generator for the gromos force field is prodrg (Schüttelkopf and van Aalten, 2004). However, the charge group assignment by prodrg for amino acid topologies contained several large charge groups comprising disconnected atoms, which is inconsistent with gromos (Lemkul et al., 2010). The Automated Topology Builder (atb) is a recent method for automated generation of gromos topologies (Malde et al., 2011). The assignment of atomic charges and charge groups by the atb proceeds in three consecutive stages. Firstly, partial charges are computed using quantum calculations. Subsequently, the symmetry of the molecule is exploited to ensure that symmetric atoms have identical charges. Finally, the molecule is partitioned into charge groups using a greedy algorithm. The atb method was experimentally verified for a set of biologically relevant molecules (Malde et al., 2011). For some large molecules, such as the cofactor adenosine-5’-triphosphate (ATP), however, the atb assigns too large charge groups, which leads to instabilities during simulation as described above.

As existing automated procedures such as prodrg and the atb fail in assigning appropriate charge groups, we have investigated the problem in detail. Our contribution is threefold: (1) We introduce the charge group partitioning problem and give a sound mathematical problem definition resulting in charge groups of small size and zero charge. We prove Inline graphic-hardness of the problem and identify important special cases, for which we give polynomial time algorithms. (2) Exploiting the properties of molecular structures enables us to present a tree decomposition-based algorithm that solves typical practical problem instances to optimality within fractions of a second. (3) We evaluate the performance of our method by running simulations using the resulting charge group assignments of amino acid side chains, which yield results consistent with experimentally known values. Moreover, for large, highly charged molecules such as ATP, we obtain charge groups that are both suitable for use in simulations as well reasonable from a chemical perspective.

2. Problem Statement and Complexity

In this section, we give a formal definition of the problem associated with assigning appropriate charge groups within a molecule. Our aim is to capture the two important aspects of chemical intuition discussed above: (1) The number of atoms in a charge group should not exceed a given integer k and (2) the sum of partial and formal charges of a charge group is ideally equal. Mathematically, the latter condition is equivalent to requiring the sum of differences of formal and partial charges in a charge group to be close to zero. We prove Inline graphic-hardness of the problem even if we take into account special characteristics of graphs representing a molecular structure. For the special case k = 2, we obtain a polynomial-time algorithm by reducing the problem to a minimum-cost perfect-matching problem.

A molecular structure can be modeled as a degree-bounded graph G = (V, E), where the nodes correspond to atoms and the edges to chemical bonds. In addition, we consider node weights Inline graphic, where δ(v) corresponds to the difference between formal and partial charge of the atom v. A formal definition of the charge group partitioning problem is as follows:

Definition 1

(Charge group partitioning, cgp). Given a graph G = (V, E), node weights Inline graphic, and an integer 2 ≤ k ≤ |V|−1, find a partition Inline graphic of V such that for all Inline graphic it holds |V′| ≤ k, the subgraph G[V′] induced by V′ is connected, and which has minimal total error

graphic file with name M8.gif

Each subset Inline graphic of the nodes in the partition corresponds to a charge group. The following theorem shows Inline graphic-hardness of the problem, even for the restricted case where G is planar. As we will discuss in Section 3, most molecular graphs are planar.

Theorem 1

cgp is Inline graphic-hard, even in the restricted case where G is planar, k = 4, the maximum degree of a node in the graph is 4, and the node weights are Inline graphic.

Proof

Clearly, the problem belongs to Inline graphic. Consider the following problem.

Definition 2

(Planar three-dimensional matching, planar 3DM). Given disjoint sets X1, X2, X3 with |X1| = |X2| = |X3| = m and a set of n triples Inline graphic. The bipartite graph B, with Inline graphic as its one color class and X = X1 ∪ X2 ∪ X3 as its other color class and an edge between Inline graphic and Inline graphic if and only if Inline graphic, is planar. Each element of X appears in 2 or 3 triples only. Does there exist a perfect matching in Inline graphic (i.e., a subset Inline graphic of m triples such that each element of X occurs uniquely in a triple in Inline graphic)?

This problem has been shown as Inline graphic-complete by Dyer and Frieze (1985). We reduce it to cgp in polynomial time. Take the bipartite graph B in the definition of Planar 3DM with Inline graphic and X as color classes. Give each Inline graphic a weight δ(x) = −1 and each Inline graphic a weight δ(T ) = 3. For each Inline graphic we introduce three extra vertices Inline graphic with weights Inline graphic, for an arbitrary 0 < ε < 1, and connect them by the path Inline graphic, which we call the tail of T. See Figure 1 for an example. Clearly, the resulting graph G remains planar (and bipartite). Since each Inline graphic is in at most three triples, it is easy to see that G has bounded degree 4.

FIG. 1.

FIG. 1.

Reduction from planar 3dm (see Definition 2): every Inline graphic corresponds to a node with weight δ(x) = −1, whereas every Inline graphic corresponds to a node with weight δ(T) = 3. There is an edge between nodes Inline graphic and Inline graphic if and only if Inline graphic. In addition to every Inline graphic, a path (Inline graphic) is attached with weights Inline graphic and Inline graphic.

Given a feasible partition to the cgp-instance, we say a group is of type i if it contains exactly i nodes from Inline graphic and exactly one node from Inline graphic. Notice that, for i = 1, 2, 3, each type i group contributes error (3−i) by itself, and because it covers a Inline graphic-node and therefore leaves a tail-path, it contributes indirectly an extra error ε (the alternative of including one of the tail nodes into the group with the triple node does not decrease the sum of the two errors). A type 0 group consists of a Inline graphic-node only and therefore will be combined with its tail to yield an error of 3 − ε. Let yi denote the number of type i groups, Inline graphic. Let y denote the number of X-vertices that form a group on their own. Then the feasible solution has total error

graphic file with name M45.gif (1)

We show that there exists a perfect matching if and only if G admits a partition with total error

graphic file with name M46.gif

Suppose Inline graphic is a perfect matching. For every triple Inline graphic, we create a type 3 group consisting of the corresponding vertex Ti in G and the three vertices corresponding to its three elements. Hence y3 = m. By the properties of the matching all X-vertices of G are now covered, and n − m triple-vertices of G remain uncovered. The latter necessarily form n − m type 0 groups: y0 = n − m. Insertion in Equation (1) yields W =  + (n − m)(3 − ε).

Now assume that no perfect matching exists. First, note that in any optimal solution to the cgp-instance y = 0. Assume y > 0 and let Inline graphic be such a vertex. Then every neighbor of x in Inline graphic is contained in a group of type i, with i ≤ 2. Therefore, adding x to any such group would decrease the cost of the solution by at least 2(1−ε). Furthermore, every group that contains two nodes from Inline graphic can be split into two groups without increasing the cost of the solution. Now, since there exists no perfect matching, we need m + c groups of type 1, 2, or 3, for some c ≥ 1, to cover all vertices in X. Using equations

graphic file with name M52.gif (2)
graphic file with name M53.gif (3)

we get

graphic file with name M54.gif (4)
graphic file with name M55.gif (5)

and the cost contributed to Equation (1) by type 1, 2, and 3 groups becomes equal to mε + c(3 + ε). Together with the remaining n − m − c groups of type 0, the total weight becomes

graphic file with name M56.gif

   ■

Using the same reduction but extending the tails to length k − 1 paths with ε weight on the internal vertices and − (k−1)ε weight on the leaf proves the problem to be hard for any k ≥ 4.

cgp with k = 2 can be solved by formulating a minimum-cost perfect-matching problem. Starting from G = (V, E), we assign a weight to the edges that is equal to the error that the pair of vertices will contribute if chosen as a group of the partition. For each vertex Inline graphic creates a shadow vertex v′ with δ(v′) = 0. The weight on the edge {v, v′} is then |δ(v)|, the error if v is chosen as a single vertex group. Additionally, we insert an edge {u, v′} of weight 0 if and only if Inline graphic. It is not difficult to see that a minimum-cost perfect-matching in this graph corresponds to an optimal partition, where an edge in the matching between a vertex and its shadow vertex signifies a single vertex group in the partition.

For k = 3 and for general, non-planar graphs, cgp is Inline graphic-hard by reduction from ordinary 3DM. Intriguingly, for planar graphs and k = 3 the complexity is still unknown.

3. Dynamic Programming for Bounded Treewidth

While problem cgp is Inline graphic-hard in general, as shown in the previous section, we can solve it by a dynamic program in polynomial time if the molecule graph is a tree. Starting from the leaves, we proceed toward an arbitrarily chosen root node. At a given node i we guess the group V′ that contains i in the optimal solution to the subproblem induced by the subtree rooted at i and recurse on the subtrees obtained when removing V′. Due to the size restriction |V′| ≤ k, we only have to consider a polynomial number of groups.

Although the structural formula of biomolecules is not always a tree, as we will see later, it is usually still treelike, which has already been exploited in Dehof et al. (2011). Formally, this property is captured by the treewidth of a graph (Robertson and Seymour, 1986). The definition is as follows.

Definition 3

A tree decomposition (T, X) of a graph G = (V, E) consists of a tree T and sets Xi for all Inline graphic, called bags, satisfying the three following properties:

  • 1. Every vertex in G is associated with at least one node in Inline graphic

  • 2. For every edge Inline graphic, there is an Inline graphic such that {u, v} ⊆ Xi.

  • 3. The nodes in T associated with any vertex in G define a subtree of T.

The width of a tree decomposition is maxi|Xi| − 1. The treewidth of G is the minimum width of any tree decomposition of G.

In this section, we propose a tree decomposition-based dynamic program for problem cgp, whose running time grows exponentially with the treewidth of G. Therefore, a tree decomposition of small width is crucial for the efficiency of our approach. Unfortunately, computing a tree decomposition of minimum width is Inline graphic-hard (Arnborg et al., 1987). However, for the class of r-outerplanar graphs, an optimal tree decomposition can be determined in time Inline graphic (Alber et al., 2005). A graph is r-outerplanar if, after removing all vertices on the boundary face, the remaining graph is (r − 1)-outerplanar. A graph is 1-outerplanar if it is outerplanar, that is, if it admits a crossing-free embedding in the plane such that all vertices are on the same face. Interestingly enough, most molecule graphs of biomolecules are r-outerplanar for some small integer r. For example, Horváth et al. (2010) have observed that 94.3% of the molecules in the National Cancer Institute (NCI) database (http://cactus.nci.nih.gov/) are 1-outerplanar. Even more, every r-outerplanar graph has treewidth at most 3r − 1 (Bodlaender, 1998). Therefore, not surprisingly, Yamaguchi et al. (2003) observed that out of 9,712 chemical compounds in the kegg ligand database (Goto et al., 2002), all but one had treewidth between 1 and 3, with a single molecule having treewidth 4. In fact, among the molecules considered here, the maximal treewidth was 2. As a result, our tree decomposition-based dynamic program found an optimal charge group partitioning in well under one second.

Let (T, X) be a tree decomposition of width ℓ for graph G = (V, E). The high-level idea of the algorithm is as follows (Fig. 2). For an arbitrarily chosen root i of the tree decomposition, we guess the groups that intersect Xi, denoted by the dashed lines in the figure. After removing these groups, G falls apart into connected components, denoted by the filled regions in the figure. By the properties of a tree decomposition, these connected components will correspond one-to-one to the subtrees of the tree decomposition obtained by removing bags that became empty. Recursing on the roots of these new subtrees yields the overall optimal solution.

FIG. 2.

FIG. 2.

Illustration of the tree decomposition-based dynamic programming algorithm. A graph G falls apart into connected components (gray regions) by removing the groups (dashed lines) that intersect bag Xi.

Without loss of generality we assume that T has at most n := |V | vertices and depth Inline graphic (Bodlaender, 1989), with r being the root of T. In the following, we let Inline graphic, where Ti denotes the subtree rooted at i, and write V(Ti) for the set of nodes in Ti. We define an extension of a partition of a vertex set V1 ⊆ V with nodes in V2\V1 into connected subgraphs of G of size k at most:

Definition 4

For vertex sets V1 ⊆ V2 ⊆ V, set Inline graphic contains all sets Inline graphic with Inline graphic, all sets in Inline graphic being disjoint, and all Inline graphic satisfying: (i) G[V′] is connected, (ii) |V′| ≤ k, and (iii) V1 ∩ V′ ≠ ∅.

Furthermore, by r(S) we denote the root of a subtree S of T, and for any node i in T and any vertex set A ⊆ V we denote by Inline graphic the set of trees, corresponding to the connected components of Inline graphic whose roots are not a descendant of another subtree in Inline graphic (i.e., there are no Inline graphic for which Inline graphic). With a slight abuse of notation, for sets A ⊆ V and Inline graphic we will write Inline graphic instead of Inline graphic, when the meaning is clear from the context. Then for any node i of T and any subset A ⊆ V, the cost of an optimal solution to cgp on graph G[Vi\A], denoted by cgp(i, A), can be described by the recurrence

graphic file with name M82.gif (6)

which also holds in the base case where Inline graphic, in particular when i is a leaf of T. The optimal partition has cost cgp(r, ∅). We can solve the recurrence relation (Eq. 6) using dynamic programming.

Theorem 2

The cost of an optimal solution to cgp on a graph of treewidth ℓ and maximum degree d can be computed in time Inline graphic.

Proof

Let (T, X) be a tree decomposition of G of width k and depth Inline graphic. Consider an arbitrary node i in T and a subset A ⊆ V, for which Xi\A ≠ ∅. We first observe that

graphic file with name M86.gif (7)

Indeed, for each partition Inline graphic of Xi\A, the number of possible extensions in Inline graphic can be bounded as follows. For Inline graphic, let Bj be the set of vertices at distance at most k − 1 from Yj in the graph Gj = G[Vi\(A ∪ Xi) ∪ Yj] (this set can be found by contracting Y j to a single vertex yj and performing BFS in Gj starting from yj). Each possible extension is then given by a family of pairwise-disjoint sets Inline graphic, where Zj ⊆ Bj, G[Zj ∪ Yj] is connected and |Yj ∪ Zj| ≤ k. Since the degree of each vertex is at most d, it follows that |Bj| ≤ |Yj|dk−1. Consequently, the total number of choices of sets Zj is at most (ℓ + 1)ekd2k−1/(k(d−1)) (and all these choices can be enumerated in time Inline graphic and space Inline graphic) (see Uehara, 1999). Since h ≤ ℓ + 1, the overall number of choices we consider is bounded by Equation (7).

Since every Inline graphic considered in Equation (6) intersects Xi\A (requirement 3), and due to the properties of a tree decomposition and the connectivity of all parts Inline graphic (in G), the induced subgraph Inline graphic, for all Inline graphic, is a subtree of Ti rooted at i. Keeping this crucial observation in mind, let us focus our attention on a particular node i in T, and bound the number of sets A that we need to consider on the left-hand side of Equation (6). To this end, it is convenient to consider the computation tree T for Equation (6) (that is, the recursion tree obtained when solving Eq. (6)) in a top-down fashion. We can label each node in this tree by (j, A), where j is a node in T and A is a subset of V. The root of T is (r, ∅), and the children of node (j, A) are labeled by the elements of the set Inline graphic.

Consider node (i, A) in T, and let Inline graphic be its ancestors. It is clear that every vertex Inline graphic belongs to exactly one connected component (group) V′ that originated at some ancestor Inline graphic; we say, in this case, that ancestor (jr, Ar) contributes to (i, A). Since Xi\A ≠ ∅ [by our assumption that (i, A) appears in the computation tree], it follows by our observation above that the number of ancestors that contribute to (i, A) is at most ℓ (since each such ancestor contributes at least one component that has a nonempty intersection with Xi). In other words, A can be partitioned into at most ℓ parts, such that each part belongs to a connected component that originated at some ancestor of (i, A), and hence, |A| ≤ kℓ. The number of choices for the contributing ancestors is at most depth(T ). Using an argument similar to the one used to derive Equation (7), we can conclude that for each vertex v in one of the chosen ancestors, the number of connected components originating at v is at most ekd2k−1/(k(d−1)), and thus we obtain Inline graphic for the total number of choices for A. For each such choice we have to evaluate a number of sets Inline graphic bounded by Equation (7), whose properties 1–3 can be verified in time Inline graphic. Determining the roots of subtrees in Inline graphic takes time Inline graphic.   ■

Additionally storing, along with each entry cgp(i, A), the partition Inline graphic minimizing the right-hand side in Equation (6), allows us to finally reconstruct a charge group partition that gives the optimal cost.

4. Experimental Evaluation

We implemented the dynamic programming method for bounded treewidth in C++ using the lemon graph library (http://lemon.cs.elte.hu). We used libtw (www.treewidth.com/) to obtain bounded treewidth decompositions of the input molecules. In our implementation, we solve the dynamic programming recurrence Equation (6) in a top-down fashion by employing memoization.

4.1. Hydration-free energy of amino acid side chains

We tested the quality of charge group assignments by comparing the calculated free energies of solvation in water of a set of 14 charge-neutral amino acid side chain analogs to experimental values, which are denoted by ΔGhyd,exp (Gerber, 1998; Oostenbrink et al., 2004). For each analog, we used the gromos 53A6 covalent and van der Waals parameters (Oostenbrink et al., 2004) and partial atomic charges symmetrized by the atb (Malde et al., 2011). A united-atom representation is used for aliphatic carbon groups. For comparison, we also include the manually parametrized solution that the gromos 53A6 force field provides (Oostenbrink et al., 2004). The topologies are derived from the amino acid structures by truncating at the Cα–Cβ bond. For simplicity, we refer to these analogs by their parent amino acid.

Using the gromacs 4.5.1 package (Berendsen et al., 1995), we computed the free energy of hydration ΔGhyd,calc using the thermodynamic integration method (Beveridge and DiCapua, 1989). A series of simulations were performed at a constant pressure of p = 1 bar and a constant temperature T = 298.15 K. The free energy was calculated for the process A → B, which involved switching off all nonbonded interactions of the solute in water and in the gas phase. The hydration-free energy is calculated as ΔGhyd,calc = ΔGAB,solution−ΔGAB,gas (Villa and Mark, 2002). The simulations were performed in cubic periodic boxes of length L ≈ 3 nm. Depending on the analog, the solvated system contained approximately 900 SPC (Villa and Mark, 2002) water molecules.

As described in the introduction, neutral charge groups lead to more accurate simulation results. In our problem definition, we aim to identify a charge group assignment in which the constituent charge groups have small residual error, which is the absolute difference between the sum of the formal charges and the sum of the partial charges of the atoms in the charge group. To ensure neutral charge groups where possible, we adjust the partial charges slightly by redistributing the residual error of every charge group over its atoms.

The results are presented in Table 1 and Figure 3. The gromos 53A6 simulation results (ffG53A6 in Table 1) for the studied analogs show good agreement with experiment, which is not surprising as the force field has been parametrized to reproduce the hydration-free energy (Oostenbrink et al., 2004). Using the atb charge group assignment solution (ATB in Table 1) leads to slightly larger deviations from experiment, but the average deviation is also within the experimental error of approximately 5 kJ/mol (Malde et al., 2011). Although the current method leads to values close to those obtained experimentally, they deviate slightly more from experiment than the atb values.

Table 1.

Comparison of Hydration-free Energies ΔGhyd of Amino Acid (AA) Analogs

 
 
ΔGhyd,calc
AA analog ΔGhyd,exp ffG53A6 ATB k = 5
Asn −40.6 −42.7 (2.1) −40.5 (0.1) −47.0 (6.4)
Asp −28.0 −30.1 (2.1) −29.1 (1.1) −28.6 (0.6)
Cys −5.2 −4.9 (0.3) −7.0 (1.8) −7.1 (1.9)
Gln −39.4 −40.4 (1.0) −35.9 (3.5) −35.9 (3.5)
Glu −27.0 −27.0 (0.0) −28.2 (1.2) −32.1 (5.1)
His −42.9 −44.8 (1.9) −43.7 (0.8) −40.9 (2.0)
Ile 8.7; 8.8 9.1 (0.3) 6.3 (2.5) 6.7 (2.1)
Leu 9.4; 9.7 10.8 (1.2) 7.4 (2.2) 7.1 (2.5)
Lys −18.3 −18.1 (0.2) −7.2 (11.1) −7.2 (11.1)
Met −6.2 −7.4 (1.2) 2.5 (8.7) 2.6 (8.8)
Phe −3.1 −1.3 (1.8) 1.8 (4.9) 0.6 (3.7)
Trp −24.7 −25.9 (1.2) −20.9 (3.8) −19.7 (5.0)
Tyr −26.6 −26.9 (0.3) −30.1 (3.5) −39.5 (12.9)
Val 8.2 8.5 (0.3) 8.0 (0.2) 8.0 (0.2)
Average     (1.1)   (3.2)   (4.7)

All free-energy values are given in kJ/mol. When two values separated by a semicolon are given, two experimental values were found. The absolute free-energy differences between simulation outcomes and the experimental values are given in parentheses. The average values of these differences are given in the bottom line. “ffG53A6” denotes results using the default gromos force field parameters for the analog, “ATB” denotes those using the atb charge group assignment, “k = 5” denotes those using our method. We performed a two-tailed paired Student's t-test between the distributions given in column 6 (ATB) and column 8 (k = 5) resulting in a p-value of 0.2867. The difference in hydration-free energy differences is thus not statistically significant.

FIG. 3.

FIG. 3.

Calculated ΔGhyd values versus experimental ones, showing the effect of the charge group assignment on the simulated hydration-free energy. The labels in the legend are the same as in Table 1. The solid line represents perfect agreement with experiment, dotted lines indicate the ±5 kJ/mol approximate experimental error.

4.2. Adenosine tri-phosphate

Although showing good performance on the amino acid side chains, the atb method may lead to unacceptably large charge groups, in particular for large highly charged molecules. An example is the cofactor ATP, for which the atb combined all phosphate groups and part of the ribose and nucleotide ring systems into a single charge group (Fig. 4c). In Figure 4b, the gromos 53A6 charge group assignment is given. For comparison, our solution is presented in Figure 4a and shows that the phosphate groups have been sorted in separate charge groups in agreement with the 53A6 assignment and in line with chemical intuition where one expects functional group such as phosphate, amino, and hydroxyl moieties to form separate charge groups.

FIG. 4.

FIG. 4.

Charge group assignments for adenosine tri-phosphate (ATP) at pH 5.0. The total molecular charge is −3. The partial charges are shown in gray. (a) Our optimal assignment according to Definition 1, obtained with k = 5, (b) gromos 53A6 assignment, and (c) assignment by the atb. Note that the C–H segments indicated by the rounded boxes are considered as single atom types in the gromos assignment, whereas they comprise two atoms in the other assignments.

5. Discussion

In this work, we have formally introduced the charge group partitioning problem that arises in the development of atomic force fields, and more generally, in the identification of functional groups in molecules. The problem is to assign atoms to charge groups of size at most k and such that for every charge group the sum of its partial charges is close to the sum of its formal charges. We showed Inline graphic-hardness for k ≥ 4 and proposed and implemented an exact algorithm capable of solving practical problem instances to provable optimality. With this combination of rigorous definition and exact solution approach, we have made a first step toward formalizing and quantifying some of the aspects that make up “chemical intuition”.

Algorithmically, we showed that the case k = 2 is solvable in polynomial time. In addition, we have presented a polynomial-time algorithm for bounded charge group size in cases where the molecular graph is a tree. Based on the observation that molecular graphs have bounded treewidth in practice and exploiting further properties such as outerplanarity and bounded degree, we developed a practical dynamic programming algorithm, which is based on a tree decomposition of the graph corresponding to the chemical structure of interest. An interesting open question is to settle the complexity status for the case k = 3.

Since our method relies on point charges obtained from quantum mechanical calculations, the quality of charge group assignments and subsequently of simulation outcomes depends on the accuracy of these calculations. However, our experiments have shown that taking into account charge group size and neutrality already gives good results, especially for large highly charged molecules such as ATP, where other methods fail to produce meaningful solutions. Still, the greedy partitioning algorithm built into the atb performs better on the set of smaller amino acid side chain molecules, which is due to the fact that this method exploits additional chemical knowledge. It is thus able, for instance, to deal with a symmetric molecule such as the tyrosine side chain, where the charge group assignment of our new method resulted in a large deviation because we do not consider symmetry in our problem definition. We will, therefore, investigate how to incorporate symmetry into our approach, which is not trivial as symmetry may interfere with the optimal substructures required by the dynamic program. In addition to symmetry, we plan to integrate other aspects of chemical intuition. For example, we will investigate the effect of bounding the error per charge group. Additionally, we plan to integrate constraints that take spatial geometry into account rather than using the number of atoms as a measure for charge group size. We would like to stress that only through a proper problem definition, together with a method capable of obtaining provably optimal solutions, one is able to make progress in answering the question how a good charge group partition should look like.

Acknowledgments

We thank SARA Computing and Networking Services (www.sara.nl) for their support in using the Lisa Compute Cluster. In addition, we are grateful to the referees for helpful comments. The research leading to these results has received support from the Tinbergen Institute as well as from the Innovative Medicines Initiative Joint Undertaking under grant agreement no. 115002 (eTOX), resources of which are composed of financial contribution from the European Union's Seventh Framework Programme (FP7/20072013) and EFPIA companies in-kind contribution.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

  1. Alber J. Dorn F. Niedermeier R. Experimental evaluation of a tree decomposition-based algorithm for vertex cover on planar graphs. Discrete Appl. Math. 2005;145:219–231. [Google Scholar]
  2. Allen M. Tildesley D. Computer Simulation of Liquids. Oxford University Press; New York: 1987. [Google Scholar]
  3. Arnborg S. Corneil D.G. Proskurowski A. Complexity of finding embeddings in a k-tree. SIAM J. Algebra. Discr. 1987;8:227–284. [Google Scholar]
  4. Berendsen H.J.C. van der Spoel D. Van Drunen R. GROMACS: a message-passing parallel molecular dynamics implementation. Com. Phys. Comm. 1995;91:43–56. [Google Scholar]
  5. Beveridge D.L. DiCapua F.M. Free energy via molecular simulation: applications to chemical and biomolecular systems. Annu. Rev. Biophys. Biophys. Chem. 1989;18:431–492. doi: 10.1146/annurev.bb.18.060189.002243. [DOI] [PubMed] [Google Scholar]
  6. Bodlaender H.L. NC-algorithms for graphs with small treewidth. In: van Leeuwen J., editor. Lecture Notes in Computer Science; Proc. 14th International Workshop on Graph-Theoretic Concepts in Computer Science (WG 1988); 1989. pp. 1–10. [Google Scholar]
  7. Bodlaender H.L. A partial k-arboretum of graphs with bounded treewidth. Theor. Comput. Sci. 1998;209:1–45. [Google Scholar]
  8. Boggara M.B.B. Faraone A. Krishnamoorti R. Effect of pH and ibuprofen on the phospholipid bilayer bending modulus. J. Phys. Chem. B. 2010;114:8061–8066. doi: 10.1021/jp100494n. [DOI] [PubMed] [Google Scholar]
  9. Brooks B.R. Brooks C.L., III Mackerell , Jr., et al. CHARMM: The Biomolecular Simulation Program. J. Comput. Chem. 2009;30(10, Sp. Iss. SI):1545–1614. doi: 10.1002/jcc.21287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cornell W.D. Cieplak P. Bayly C.I., et al. A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc. 117:5179–5197. [Google Scholar]
  11. Dehof A. Rurainski A. Bui Q.B.A., et al. Automated bond order assignment as an optimization problem. Bioinformatics. 2011;27:619–625. doi: 10.1093/bioinformatics/btq718. [DOI] [PubMed] [Google Scholar]
  12. Dyer M. Frieze A. On the complexity of partitioning graphs into connected subgraphs. Discrete Appl. Math. 1985;10:139–153. [Google Scholar]
  13. Gerber P.R. Charge distribution from a simple molecular orbital type calculation and non-bonding interaction terms in the force field mab. J. Comput.-Aided Mol. Des. 1998;12:37–51. doi: 10.1023/a:1007902804814. [DOI] [PubMed] [Google Scholar]
  14. Goto S. Okuno Y. Hattori M., et al. LIGAND: database of chemical compounds and reactions in biological pathways. Nucleic Acids Res. 2002;30:402–404. doi: 10.1093/nar/30.1.402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Horváth T. Ramon J. Wrobel S. Frequent subgraph mining in outer planar graphs. Data Min. Knowl. Discov. 2010;21:472–508. [Google Scholar]
  16. Jorgensen W.L. Maxwell D.S. Tirado-Rives J. Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. J. Am. Chem. Soc. 1996;118:11225–11236. [Google Scholar]
  17. Lemkul J.A. Allen W.J. Bevan D.R. Practical considerations for building GROMOS-compatible small-molecule topologies. J. Chem. Inf. Model. 2010;50:2221–2235. doi: 10.1021/ci100335w. [DOI] [PubMed] [Google Scholar]
  18. Malde A.K. Zuo L. Breeze M., et al. An automated force field topology builder (ATB) and repository: version 1.0. J. Chem. Theory Comput. 2011;7:4026–4037. doi: 10.1021/ct200196m. [DOI] [PubMed] [Google Scholar]
  19. Oostenbrink C. Villa A. Mark A.E. van Gunsteren W.F. A biomolecular force field based on the free enthalpy of hydration and solvation: The GROMOS force-field parameter sets 53A5 and 53A6. J. Comp. Chem. 2004;25:1656–1676. doi: 10.1002/jcc.20090. [DOI] [PubMed] [Google Scholar]
  20. Robertson N. Seymour P.D. Graph minors. II. Algorithmic aspects of tree-width. J. Algorithms. 1986;7:309–322. [Google Scholar]
  21. Schmid N. Eichenberger A. Choutko A., et al. Definition and testing of the GROMOS force-field versions 54A7 and 54B7. Eur. Biophys. J. 2011;40:843–856. doi: 10.1007/s00249-011-0700-9. [DOI] [PubMed] [Google Scholar]
  22. Schüttelkopf A.W. van Aalten D.M. PRODRG: a tool for high-throughput crystallography of protein-ligand complexes. Acta Crystallogr. 2004;60:1355–1363. doi: 10.1107/S0907444904011679. [DOI] [PubMed] [Google Scholar]
  23. Scott W.R.P. Hünenberger P.H. Tironi I. G., et al. The GROMOS biomolecular simulation program package. J. Phys. Chem. A. 1999;103:3596–3607. [Google Scholar]
  24. Sharma M. Khanna S. Bulusu G. Mitra A. Comparative modeling of thioredoxin glutathione reductase from Schistosoma mansoni: a multifunctional target for antischistosomal therapy. J. Mol. Graphics Model. 2009;27:665–675. doi: 10.1016/j.jmgm.2008.10.009. [DOI] [PubMed] [Google Scholar]
  25. Uehara R. Komazawa University; Japan: 1999. The number of connected components in graphs and its applications. iEiCE Technical Report COMP99-10, Natural Science Faculty. [Google Scholar]
  26. van Gunsteren W. F. Bakowies D. Baron R., et al. Biomolecular modeling: Goals, problems, perspectives. Angew. Chem. Int. Ed. 2006;45:4064–4092. doi: 10.1002/anie.200502655. [DOI] [PubMed] [Google Scholar]
  27. Villa A. Mark A.E. Calculation of the free energy of solvation for neutral analogs of amino acid side chains. J. Comp. Chem. 2002;2345:548–553. doi: 10.1002/jcc.10052. [DOI] [PubMed] [Google Scholar]
  28. Yamaguchi A. Aoki K.F. Mamitsuka H. Graph complexity of chemical compounds in biological pathways. Genome Inform. 2003;14:376–377. [Google Scholar]
  29. Yang C. Zhu X. Li J. Shi R. Exploration of the mechanism for LPFFD inhibiting the formation of beta-sheet conformation of A beta(1-42) in water. J. Mol. Model. 2010;16:813–21. doi: 10.1007/s00894-009-0594-y. [DOI] [PubMed] [Google Scholar]

Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.

RESOURCES