Skip to main content
The Journal of Chemical Physics logoLink to The Journal of Chemical Physics
. 2013 Dec 4;139(21):214111. doi: 10.1063/1.4832895

Proximal distributions from angular correlations: A measure of the onset of coarse-graining

Kippi M Dyer 1, B Montgomery Pettitt 1,a)
PMCID: PMC3869852  PMID: 24320368

Abstract

In this work we examine and extend the theory of proximal radial distribution functions for molecules in solution. We point out two formal extensions, the first of which generalizes the proximal distribution function hierarchy approach to the complete, angularly dependent molecular pair distribution function. Second, we generalize from the traditional right-handed solute-solvent proximal distribution functions to the left-handed distributions. The resulting neighbor hierarchy convergence is shown to provide a measure of the coarse-graining of the internal solute sites with respect to the solvent. Simulation of the test case of a deca-alanine peptide shows that this coarse-graining measure converges at a length scale of approximately 5 amino acids for the system considered.

INTRODUCTION

Since their introduction,1, 2, 3 so-called proximal distribution functions have been utilized to reduce the complexity of the molecular pair distribution functions of macromolecular systems3 to a set of conditional pair functions. In principle, the proximal distributions admit an approximate connection to the more general pair correlations of complex fluids while using a simpler, more easily understood description of the liquid structure around solutes, especially convenient for macromolecules.2

The proximal functions are based on a conditional property of the standard pair distribution functions in the liquid state. They are defined within a decomposition of the averages which build up the radial pair distribution functions via a sum over a hierarchy of neighbors, such that the radial distribution function is a direct sum over the nearest neighbor, next-nearest neighbor, etc., conditional probability distributions. The decomposition is exact, and finite where the number of atomic (or otherwise) sites within a given pair of molecules is finite. Taken as a whole, the neighbor hierarchy is an alternate, and equivalent, mathematical description of the radial distribution functions of the fluid.2

The existence of a series decomposition of the fundamental structural description of a fluid system leads naturally to the question of whether or how the series can be truncated to construct a reduced description of a fluid system which still accurately measures the thermodynamics. To this point, it has been shown4, 5, 6, 7, 8 that the truncation of the neighbor hierarchy of the average solvent distribution around a solute at the nearest neighbor approximation, with suitable renormalization, is a reasonable approximation for the density distribution and thus the electrostatic densities and energies of small biologically interesting molecules. The resulting electrostatic free energies of solvation of the approximate density distributions computed with linear response are within a chemically acceptable error and generally better than continuum solvent approaches.7

In this paper we consider the generalization of proximal distributions to full angular dependence. This leads us to address two questions. First, at the level of the radial distribution functions, we consider the convergence of solute length scales as measured by the neighbor hierarchy. We demonstrate that series convergence is slow for molecules of the same size, where the potential functions act over distances for which more than one element of the neighbor hierarchy is meaningful to the immediate structure. For the van der Waals interactions of molecules of comparable size, many members of the hierarchy are essential to the complete description of the system energy. As we demonstrate, this is due to the fact that many of the terms in the hierarchy of the neighbor decomposition of the radial distribution functions are of the same relative magnitude for similar size molecules, and thus, the hierarchy is instead more usefully considered to be a finite basis set expansion of the radial distribution function in which all members of the hierarchy may contribute to the sum. We and others8 have successfully used truncation of one of the series for molecules which are very different in size. Here we use a slightly different set of functions than have previously been studied3, 4, 6, 7, 8 to measure coarse-graining for large solutes with respect to the smaller solvent molecule interactions.

Previously the proximal functions have been used only as a decomposition of the radial distribution of solvent around a solute.3 They are, however, not limited to the angle-averaged radial distribution functions, and can be generalized to apply to the complete pair distribution functions, including both distance and angular correlations between molecular pairs. Once this generalization is made, then, the question we address is whether and how the complete set of proximal functions can be utilized to describe the structure and thermodynamics of solvation of a complex molecule in a way which keeps the spirit of the simpler structure offered by the original proximal distribution function methods.

We apply the current method to consider the aqueous solvation of deca-alanine. We wish to consider the general construction of the proximal distributions. We conclude with the idea of using the proximal functions as physical or environmental criteria measures for coarse graining of all-atom potentials for certain purposes.

THEORY

Here, we review the proximal distribution function theory2, 3 and present a generalization of the theory for the full angular-dependent molecular pair distribution functions.

Consider a fluid system of molecules whose interaction potential is defined by a site-site model, such that, for a given pair of molecules labeled 1, 2, the total potential U(1, 2) between the molecules is

U(1,2)=i=1mj=1nuij(rij), (1)

where rij is the distance between site i on molecule 1 and site j on molecule 2, m is the number of sites on molecule 1, n is the number of sites on molecule 2, and uij is the potential between the sites i, j. Here and throughout, we assume that uij(rij) is a radially symmetric function of the scalar distance between two sites, and that the number of sites m, n are finite. For most of this work, these assumptions do not represent a loss of generality, though we will point out where they are a necessary condition of certain results. Finally, the site-site notation is usually simplified9 from uij(rij) to uij(r) where it is understood that the scalar distance r implicitly refers to the distance between sites i and j. We will follow this convention, except where there is a possibility for confusion.

If we have N such molecules at a given thermodynamic phase point, then one of the standard descriptions of the structure of the fluid is given by the set of site-site pair distribution functions,10

ρigij(r)ρj=1VN(N1)δ(rjrir), (2)

where the angle brackets here represent the ensemble average, N is the number of molecules, V is the volume, the δ(rjrir) functions are the operators restricting the average to the vector rij between site i on molecule 1 and site j on molecule 2, ρi is the total number density of site i, and we assume a homogeneous, non-reactive system where the site density ρi = ρ1, ρ1 is the total number density of molecular species 1, and similarly for ρj for species 2. With this standard definition, for cases where the site-site potential terms uij(r) are radially symmetric functions of the distance between sites alone and there are no external fields, then the r-vector dependence of the gij(r) functions reduces to the radially symmetric case, so that

4πr2gij(r)ρjdr (3)

is the probability of finding site j of molecule 2 within the distance r + dr given that site i of molecule 1 is at the origin.

For thermodynamic quantities whose operators are functions of only the distance between sites, such that

O=O^(1,2)=ijo^ij(r) (4)

for O the thermodynamic average of the molecular operators O^ and site-site operators o^, such as the excess energy, then the radial distribution functions gij(r) are a complete description of the thermodynamics of interaction of the molecular system, since in this case

O=ijo^(rij)gij(r)dr. (5)

However, for many important cases, the radial site-site functions must be generalized to take into account the orientational correlations between molecules.11

The potential between pairs of arbitrary molecules with fixed intramolecular distributions is usually written as11

U(1,2)=U(r,Ω1,Ω2), (6)

where r is the vector between arbitrary points in molecule 1 and molecule 2, Ω1 is the orientation of molecule 1 with respect to an arbitrary origin, and similarly for Ω2 for molecule 2. If we choose the origin of r coincident with site i on molecule 1 and pointing to site j on molecule 2, labeled as rij, then the radial site-site distribution functions are angular averages of the molecular pair distribution functions,

gij(r)=g(rij,Ω1,Ω2)Ω1,Ω2, (7)

with g(rij, Ω1, Ω2) the molecular pair distribution function fully dependent on the distance and orientations of the pair of molecules, and the Ω1,Ω2 notation here indicates a normalized average over the angular coordinates.11 This notation usually assumes fixed internal molecular coordinates, but can be understood to apply in the case with internal degrees of freedom if the Ω-weighted ensemble average is understood to include as well the average over all of the respective intramolecular degrees of freedom of each molecule in the pair terms. The notation here is that of Gray and Gubbins,11 and Hansen and McDonald,9 and is standard in liquid state theory. The site-site labels gij(r) are discussed above. The ij labels inside of g(rij, Ω1, Ω2) mean simply that the origin of the coordinates is chosen to coincide with sites i and j, and is thus a simple coordinate change of the general g(r, Ω1, Ω2), without loss of generality. This is the standard convention when discussing the correspondence between site-site distribution functions and the general molecular pair distribution functions, and we follow it here.

To define the usual proximal distribution functions, we consider the coordinate geometry in a given pair function. From the definition of the site-site potential in Eq. 1, each point (rij, Ω1, Ω2) is equivalent to defining the set of distance vectors (or distance matrix) from m sites i on molecule 1 to n sites j on molecule 2, [rij]m, n ≡ [r11, r12, …, rmn]. This collection of distance vectors can be well-ordered, up to degeneracies, by scalar magnitude. For example, if we pick a particular site i on molecule 1, and label the set of distance vectors from site i to the n sites j of molecule 2 by their relative magnitude, [rij]n as [ri(n)] where ri(1) is the shortest distance of the set [rij]n, ri(2) is the second-shortest, and so on, then we have the ordered set [rij]n[ri(1)ri(2)ri(n)] for the n distances between any single site i on molecule 1 and the n sites in molecule 2 for a particular configuration (rij, Ω1, Ω2). Given this ordering, we define the conditional radial distribution function

ρigij(r,n)ρj=1VN(N1)δrij(n)r (8)

as the probability density of site j on molecule 2 around site i on molecule 1 given that site j is the nth closest site on molecule 2 to site i on molecule 1. By definition, the complete set of conditional probabilities sums to the radial site-site distribution function:

gij(r)=ngij(r,n). (9)

As a specific example, using any standard 3-site model of water, gOH(r) the radial site-site distribution between an oxygen site on molecule 1 and a specific hydrogen site on molecule 2, then the site-site distribution is given as the sum of the series

gOH(r)=gOH(r,1)+gOH(r,2)+gOH(r,3), (10)

and is constructed to be the sum of probabilities for gOH(r, 1) the radial distribution of hydrogen sites on molecule 2 with respect to the oxygen site on molecule 1 where the rOH distance is the shortest distance between the oxygen on molecule 1 and the 3 sites of molecule 2, gOH(r, 2) is the distribution where rOH is the second shortest distance, and gOH(r, 3) is the distribution for rOH the third shortest distance. Here we have considered water as a solute in water for an illustration.

These are the most familiar, and basic, form of the proximal radial distribution functions. However, there is an interesting degree of freedom left to examine. The development of the series expansion in this form is chiefly aimed at understanding the solvation density around a given solute. Thus, the proximal expansions of the site-site radial pair functions gij(r, n) are taken from the complete mutual orientational averages of the full g(r, Ω1, Ω2). There are two principal intermediate averages over the orientations of molecule 1 or 2 separately,

gij(rij,Ω2)=g(rij,Ω1,Ω2)Ω1,gij(rij,Ω1)=g(rij,Ω1,Ω2)Ω2, (11)

which we call12, 13, 14 the right- and left-handed angular correlation functions, respectively. For clarity, we point out that the left-hand correlation function is generated from the complete pair function by a right-hand average, and vice versa for the right-hand correlation function. Since, by definition,

gij(r)=gij(rij,Ω2)Ω2=gij(rij,Ω1)Ω1, (12)

the site-site distribution functions gij(r) are independent of the order in which the angular averages are taken.

For the proximal functions, instead, since the definition in Eq. 8 is for the n sites on molecule 2 relative to a given site i on molecule 1, there is an additional set of functions gij(r, m),

ρigij(r,m)ρj=1VN(N1)δrij(m)r, (13)

where 4πgij(r, mjr2dr is the probability of finding site j on molecule 2 at a distance r from site i on molecule 1, given that site i is the mth closest site of molecule 1 to site j on molecule 2, and ∑mgij(r, m) = gij(r). Just as for the angular functions, there is a left-handed and a corresponding right-handed sense to the proximal functions. They are formally distinguished by recognizing which angular averages are unrestricted in the construction. For the gij(r, n) functions of Eq. 8, the angular average is unrestricted for molecule 1, and thus the gij(r, n) are conditional averages of the right-handed gij(rij, Ω2) angular correlation functions. Similarly, for the gij(r, m) functions in Eq. 13, the averages over molecule 2 are unrestricted, and the gij(r, m) are functions of the left-handed gij(rij, Ω1) angular correlation functions. We therefore label the usual proximal distribution function hierarchy set gij(r, n) from Eq. 8 as the right-handed proximal distribution functions, and the complementary set gij(r, m) from Eq. 13 as the left-handed proximal distribution functions.

The relation of the left and right intermediate proximal radial distribution functions to their respective intermediate angular averages of the pair distribution function g(r, Ω1, Ω2) indicates that there exists an even more general set of functions that correspond to the full pair distribution function. Rather than the m sites of molecule 1 with respect to site j of molecule 2, or the n sites of molecule 2 with respect to a site i of molecule 1, if we instead take the complete set of η = m · n site-site distances between the pair of molecules 1 and 2 ordered from least to most, [r(η)] ≐ [r(1)r(2) ⩽ ⋅⋅⋅r(η)], we then have the set of functions

ρigij(r,η)ρj=1VN(N1)δrij(η)r, (14)

where 4πgij(r, η)ρjr2dr is the probability of finding site j of molecule 2 at distance r from site i of molecule 1 given that rij is the ηth closest of the m · n site-site distances between molecules 1 and 2. As before,

gij(r)=ηgij(r,η). (15)

For molecular pair functions, this is the most general set of proximal distribution functions that can be defined using only the set of site-site distances between molecules, and has the same formal relationship to the left and right functions gij(r, m) and gij(r, n) as does the pair function g(rij, Ω1, Ω2) to the left and right angularly dependent correlation functions gij(rij, Ω1) and gij(rij, Ω2).

To summarize, the proximal radial distribution functions as previously constructed2, 3 are one set of a general class of proximal distribution functions, each member of the class being delineated by its construction from one of the various angle averages of molecular pair correlation functions. Additionally, each of the classes of proximal distribution functions, as a complete series, is a set of distributions intermediate between the given angular pair function and the radially symmetric site-site distribution functions. In particular to this point, the standard proximal radial distribution functions gij(r, n) are intermediate between the right-handed gij(rij, Ω2) angular correlation function and gij(r), the gij(r, m) functions intermediate to the left-handed gij(rij, Ω1) angular correlation function, and the bothgij(r, η) functions are intermediate to the complete g(rij, Ω1, Ω2) pair correlation function. In strict terms of angular information content, the set of proximal functions are intermediate with relation to the radial and the angular pair correlations of the fluid.

Finally, we can always in principle re-normalize the conditional functions gij(r, η), gij(r, m) and gij(r, n) such that they approach 1 at long distance when using a partial or quasi-component set. In such case we will denote the re-normalized functions as the perpendicular distribution functions, gij(r,η), gij(r,m), and gij(r,n).

With these definitions of the proximal distribution functions, both with respect to the site-site radial distribution functions and the molecular pair distribution functions, it is reasonable to ask whether the series representations that the proximal functions provide of the different pair functions allow for a truncation of the series that still predicts the thermodynamic averages of the system with reasonable accuracy. To demonstrate, from the definitions of the thermodynamic averages, the excess internal energy per particle of a pure site-site model system is given by9

βUexN=12ijβuij(r)gij(r)ρjdr=12ijnβuij(r)gij(r,n)ρjdr12ijβuij(r)gij(r,1)ρjdr, (16)

where gij(r,1) is the nearest neighbor proximal distribution normalized to 1 at large distances, β−1 = kbT with kb Boltzmann's constant, T the absolute temperature in units of Kelvin, and N the total number of molecules in the pure fluid. Since gij(r) = ∑ngij(r, n) for the right-handed proximal distribution functions, we can approximate the total βUex/N using the truncated series in the last line after appropriate normalization. Previous work7 has shown that this truncation works reasonably well for the Coulomb term, uijc(r), in potentials of the form

uij(r)=qiqjr+4εσr12σr6=uijc(r)+uijlj(r), (17)

when used in conjunction with a Voronoi-like spacial (volume) normalization6 to the distribution's first terms, gij(r, 1) → 1 as r → ∞, so that the approximation gij(r)gij(r,1) holds at long distance.

We note that this can be done from a right or a left handed angular set. In past work we have used the right-handed gij(r, n) proximal decomposition of the solvent density around a solute. In results given below, we will discuss the numerical results of this truncation for the distributions and potential energy terms considering the solute angular average using the left-handed decomposition gij(r, m). In considering the truncation of the series in this way, however, we find that there is another utility to the proximal distribution function set that can be used to directly measure the onset of the coarse-graining convergence of the solution properties of a macromolecular solute.

Coarse-grained averaging as a concept in the physics of atomic systems has a long history. As an illustration, consider that the Lennard-Jones potential is itself a coarse-graining of the non-bonded interactions of the electronic distributions of atoms and molecules. We have in mind here the current and ongoing work15 that seeks to generate nano and mesoscale potential surface models of molecular systems in order to accurately project the underlying classical Hamiltonians of molecular mechanics models, for instance, the CHARMM16 and AMBER17 force fields, to longer length or time scales. The principal idea is that, rather than having macromolecules composed of hundred of thousands of sites, we use models composed of ten to a hundred times fewer sites, each such site representing some unique subset of the atomic sites of the generating potential surface. If these units can be rigorously parameterized,18 then in principle we could have a transferable set of nano and mesoscale models applicable to larger length and time scales than are generally accessible via all-atom molecular dynamics and Monte Carlo simulations. Our purpose here is not to produce such a model of the solute but to consider length scales where the properties as seen by solvent converge.

Formally, we seek to reduce the total number of site-site calculations necessary to generate the effective solvent-solute potential, so that

i=1mj=1nuij(rij)i=1mJ=1νU(riJ,Ω2) (18)

where the sites labeled by i, j are for the all-atom potential, ν ≪ n, J labels the sites of the coarse-grained model molecule 2, and ∑JU(riJ, Ω2) is an arbitrary set of functions of many fewer sites which can be parameterized to make the equality approximately true for all corresponding configurations. In principle, the set of ν sites can be defined arbitrarily for any given solvent-solute system. If we take a hint from generalized nonlocal grid techniques, the ν sites can be defined dynamically so as to parameterize the potential surface according to mathematical and computational convenience. Given the ideal goal of transferability of any given parameterization, however, then a strictly utilitarian argument would be for the ν sites to be associated with a particular defined subset of the sites in a given molecule.

For instance, we might restrict the ν sites to represent a given number of adjacent base pairs in a DNA sequence, or amino acids in a sequence in a protein, averaged such that the parameters roughly represent the sequence potential regardless of the particular macromolecule modeled. This localization of the ν sites also has the virtue of representing the same step across hierarchies of physical subunits as, for instance, the electronic structure to Lennard-Jones plus charges potential for small molecules and atomic species, as well as keeping the principal physical meaning of coarse-graining in the sense of summing over j in small numbers of sites that are adjacent to each other and which make up a physically and chemically useful subunit of the macromolecule being modeled. We note that this type of coarse-graining will be most useful for studying macromolecular systems at finite concentration, where the sheer number of all-atom sites of any polymer is a significant hurdle, for example in polymer melts or cellular crowding.

If we want to then understand the general idea of coarse-graining over a well-defined molecular length scale, we need to account for, and measure, the results of all possible ways that one can partition the atomic sites of a molecule, given that the admissible subgroups are restricted to those which are close to each other (proximal) relative to an arbitrary solvent molecule. The proximal distribution functions provide precisely this measure. To see this, we consider the excess energy per molecule for species α in a mixture of site-site model fluids:9, 10

βUexNα=12ργαγijdrgαiγj(r)βuαiγj(r)=12ργαγijmdrgαiγj(r,m)βuαiγj(r), (19)

where α and γ are species labels, i and j are the site labels within species α and γ, respectively, and for this case, we use the left-handed gij(r, m) functions.4 Now, for the left-handed proximal distributions of the macromolecule-water distributions, the m sites are many more than if the right-handed gij(r, n) functions were used, since nm for water, as in the more common convention.3, 4, 6, 7, 8 Using the larger number of macromolecular sites to expand the neighbor hierarchy in allows for the fine-grained functionalization of the excess energy per molecule over the m-member hierarchy:

βUex(ν)Nα=12ργm=1ναγijdrgαiγj(r,m)βuαiγj(r), (20)

where ν ∈ [1, …, m] for the m sites of the α molecule.

This parameterization of βUex(ν) with respect to ν measures the solute potential energy as a function of all subsets of ν sites within molecule 1 nearest to a test molecule 2. For a descriptive geometry, consider a linear polymer of arbitrary (large) length. A schematic of this idealized case is given in Figure 1. Now, consider an infinite radius disc perpendicular to the long axis of the polymer molecule, containing a single site of the polymer and a test site of a water molecule in the bulk. This disc is the restricted geometric region for which the solute site included is the nearest solute site to the test site of the solvent molecule, and thus is the region of g(r, Ωw, Ωp) which generates g(r, 1), the nearest neighbor proximal distribution function for the polymer-water pair. The higher order members of the proximal distribution function hierarchy will have similar closed, but increasingly complex, geometric regions of g(r, Ω1, Ω2) as generators. In total, the potential Uex(ν) is that of the ν sites closest to the solvent molecule, summed through all ν measuring geometries centered to the individual solute sites and then summed over all sites of the solute. That is, the first approximation, the potential from nearest neighbors, contains the contributions from only the solute site connected to the solvent via the corresponding blue line averaged over the blue disc region, then summed over each change of register which moves the blue measuring disc and line to the next site and corresponding test solvent molecule in the chain. The next approximation, nearest neighbor plus next-nearest neighbor, contains the blue line/disc contribution plus the green line contribution in its (more complex) next-nearest measuring geometry, and the total sum over solute sites sums these contributions for all changes of register along the solute. The next approximation would include as well the red line contributions, and so on. Further, the construction of the gij(r, m) functions insures that all possible Boltzmann-weighted orientations of the test disc are also taken into account. Or, for solutes with arbitrary internal geometry, that all possible localized subsets of the ν-nearest sites to a solvent molecule are accounted for.

Figure 1.

Figure 1

Schematic for the coordinates of the solute-solvent distributions g(r, Ωw, Ωp) and proximal components discussed in the text. The black outlined large spheres are the sites on a linear polymer solute with orientation Ωp, with connecting bonds. The dark blue sphere is the test solvent site. The blue line indicates the nearest neighbor site of the solute to the test solvent site within the blue averaging disc, the green line is for the next nearest neighbor(s), and the red line is for the next-next nearest neighbor.

We point out that the choice of the solute energy is not a unique measure of coarse graining. There are a variety of possible thermodynamic measures which can be used as the proxy by which the parameters of a given coarse-grained model are chosen, though minimization of the free energy functional is probably the choice most commonly considered formally.19, 20 The choice for a given system can be formally justified in a variety of ways, and all of the thermodynamic measures that we are aware of have a range of problems for which the given construction is appropriate.21, 22, 23 The energy here is chosen both for reasons of simplicity, as well as generality, since the integral over the distribution functions would be the key element for, e.g., virial pressure or, through the Kirkwood-Buff approach,24 any of a variety of thermodynamic variables. In application, for example, we have shown that the free energy via the linear response approximation is also accessible from the proximal functions.7 Additionally, recent work by Voth and co-workers25 has been concerned with χ2 fitness measures for coarse-grained models, and in particular for our purposes they have investigated the properties of χ2(n), that is, the fitness of a given model as a function of the number of coarse-grained sites used to construct the potential surface. Our simple energy measure is meant to be representative of this sort of χ2 fitness measure. The proximal convergence of the energy integral, or any other thermodynamic integral over the distribution functions can be viewed as physical or environmental criteria/measures for coarse graining of the all-atom system. As such, the method developed here is meant as a tool for analysis of all-atom systems in order to complement the existing tools for building coarse-grained models from them.

RESULTS

Here we present numerical results for two systems. Note, throughout this work, we do not renormalize any of the functions in the various proximal distribution function series, so that the identity gij(r) = ∑ngij(r, n) holds rigorously throughout. For our first example, in order to illustrate the fundamental properties of both sets of proximal functions, the right-handed set g(r, n) and the complete set g(r, η), we use results from number, volume and temperature of the simple point charge water model (SPC) water model; that is, we consider water about water. All results for pure water were derived from a system of 1024 water molecules, at 300 K and 0.03346 molecules per Å3. We use the full Ewald sum, the Nosé thermostat, the Verlet algorithm, and otherwise standard methods of molecular dynamics simulation.26 In our second example, we will use two previously determined7 fixed configurations of a deca-alanine peptide in solution using the Amber model17 with 2200 TIP3P water molecules.

Pure water

We first examine the excess internal energy integrals over the set of right-handed proximal distribution functions for the pure water system. In Table 1, we break out the results of the integrals in Eq. 19 into their Lennard-Jones and Coulomb contributions term-by-term with the sum over n for g(r, n), in comparison to the total sum, and the direct integral over gij(r) calculated in the standard way. The trivial result is that the total sum of energy contributions is identical to the gij(r) integrals. The more illustrative result is the total (Lennard-Jones plus Coulomb) potential for each site-site pair as given as a running total (i.e., the sum as each additional term is added to the total) over n in Table 1. For this case of a very small solute molecule, it is clear that each term of the series over n, for either the Lennard-Jones contributions or the Coulomb case, makes a substantial contribution to the individual site terms. However, for the Coulomb potential, the total site-site Coulomb sum for n = 1 is UC; total = −14.4, while for the full n = 3 the result is UC; total = −18.4. The molecular Coulomb potential (i.e., the total Coulomb contribution to the molecular potential) remains of the same magnitude across the series. Truncation of n for the Lennard-Jones potential for the very small molecule case considered would lead to large errors for any set of functions chosen, while the Coulomb results are less sensitive as seen in previous work.3, 4, 6, 7 We note that, because the n = 2, 3 terms oscillate around this result, the conclusion that can be drawn from all of these various results together is that the fundamental conservation of charge rescues the partial sums of the Coulomb potential for the total molecular potential, but as is normal and expected the term-by-term convergence of a Coulomb series is not generally so well behaved.

Table 1.

Excess internal energies for SPC water system at 298.15 K and ρ = 0.03346 N/Å3.

Generating function Total βU(ex)N Coulomb Lennard-Jones Running i, j total over n
Total over all sites ∑ijgij(r) = −15.2        
gOO(r) = ∑ngOO(r, n) 624.8 622.0 2.8
gOO(r, 1) 171.9 170.4 1.5 171.9
gOO(r, 2) 365.7 364.2 1.5 537.6
gOO(r, 3) 87.2 87.4 −0.2 624.8
gOH(r) = ∑ngOH(r, n) −319.2 −319.2 0.0
gOH(r, 1) −163.2 −163.2 0.0 −163.2
gOH(r, 2) −60.8 −60.8 0.0 −224.0
gOH(r, 3) −95.2 −95.2 0.0 −319.2
gHO(r) = ∑ngHO(r, n) −319.0 −319.0 0.0
gHO(r, 1) −99.2 −99.2 0.0 −99.2
gHO(r, 2) −176.0 −176.0 0.0 −275.2
gHO(r, 3) −43.8 −43.8 0.0 −319
gHH(r) = ∑ngHH(r, n) 159.0 159.0 0.0
gHH(r, 1) 77.6 77.6 0.0 77.6
gHH(r, 2) 32.8 32.8 0.0 110.4
gHH(r, 3) 48.5 48.5 0.0 159.0

For the water around water case, we plot the g(r, n) series for each site-site pair in Figures 2345. Notice due to the handedness of the angular averages OH≠HO. All of the terms, nearest neighbor, next nearest, and next-next nearest, are of the same relative order of magnitude over the full range of the functions. While there is a clear ordering of the probabilities over any given i, j pair, that ordering is not consistent across pairs. Taken together, this means that no truncation of the series over n is a representation or good approximation of the relative probabilities for the averages considered for small molecules and one should take caution to examine any series.

Figure 2.

Figure 2

The oxygen-oxygen gOO(r, n) functions for the SPC model water system discussed in the text, together with the total site-site distribution function, gOO(r) = ∑ngOO(r, n). The blue line is the total gOO(r), the black line is the nearest neighbor gOO(r, 1), the red line is the next-nearest neighbor gOO(r, 2), and the green line is the next-next nearest neighbor gOO(r, 3).

Figure 3.

Figure 3

The oxygen-hydrogen gOH(r, n) functions for the SPC model water system discussed in the text, together with the total site-site distribution function, gOH(r) = ∑ngOH(r, n). The line color conventions are as for Figure 1, blue for the total gOH(r), and black, red, and green for gOH(r, 1), gOH(r, 2), and gOH(r, 3), respectively.

Figure 4.

Figure 4

The hydrogen-oxygen gHO(r, n) functions for the SPC model water system discussed in the text, together with the total site-site distribution function, gHO(r) = ∑ngHO(r, n). The line color conventions are as for Figure 1, blue for the total gHO(r), and black, red, and green for gHO(r, 1), gHO(r, 2), and gHO(r, 3), respectively.

Figure 5.

Figure 5

The hydrogen-hydrogen gHH(r, n) functions for the SPC model water system discussed in the text, together with the total site-site distribution function, gHH(r) = ∑ngOH(r, n). The line color conventions are as for Figure 1, blue for the total gHH(r), and black, red, and green for gHH(r, 1), gHH(r, 2), and gHH(r, 3), respectively.

For molecules with tight angular coupling over longer ranges, correlations between pairs of molecules are non-trivial. Since the gij(r, n) set corresponds to g(r, Ω2), the averaging which generates the gij(r, n) functions is not given by a strict series expansion, in the sense of n being a strict order parameter. As such, while the sub-averages which generate each member g(r, n) are well-defined, direct measurement shows that there is no a priori reason to assign an ordering weight to the sequence 1, 2, …, n. By considering the correlations of the averaged solute, in some cases, the nearest neighbor sites of a small molecule solvent may be less important than the next nearest or next-next nearest, at any given particular distance, at least as far as which probabilities are important to the whole. Here, since there are only a small number of possible site neighbors for a given solvent there is no reason not to include them all. This is opposite to the case where we consider the angularly averaged solvent around a solute composed of a large number of sites. There, the neighbor hierarchy provides considerable computational convenience with more controllable precision5 because proximity then correlates with influence, i.e., a solute site on one side of a macromolecule has little influence on the solvent on the opposite side.

The difference considering small molecules is striking. For water as a solute in a solvent of water the results of breaking r, Ω2 out into rij(n)in their thermodynamic contributions indicate that, at distances beyond the immediate contact distance between sites, we cannot look at a given orientation and assign the equivalent thermodynamic weight to the bare potential between sites. In other words, since gij(r, n) and g(rij, Ω2) average over the orientations of molecule 1 and correlate with all other molecules 2 in the system, we cannot any longer make that direct prediction.

To illustrate, we plot the nearest neighbor functions for SPC water in Figures 67, for oxygen and hydrogen, respectively. In both cases, at any distance greater than the ∼3 Å Lennard-Jones contact distance between oxygen sites, the most probable nearest site neighbor j on molecule 2 for site i (whether oxygen or hydrogen) when averaged over the angles of molecule 1 is the hydrogen site on molecule 2. Naively, taking only the distances and bare potential of two SPC water molecules, this might seem counter-intuitive, since the hydrogen-hydrogen interaction is repulsive. However, if we have a site of molecule 1 at the origin and average over the orientations of molecule 1, then the contributions leading to both oxygen and hydrogen of molecule 1 more favorably interacting with the hydrogens of molecule 2 arise from two distinct physical paths. For an oxygen on molecule 1, the hydrogen on molecule 2 should be a direct interaction, while for a hydrogen on molecule 1, the interaction with a hydrogen on molecule 2 is indirect, mediated by the companion oxygen site of molecule 1 in its interaction with the hydrogen of molecule 2. It is more complicated to use distances between molecular surfaces rather than sites. This appears to alleviate some of the problems of widely disparate atomic site diameters,4, 27 and for molecules with a wide range of site charges that provide an additional challenge coupled to the one of sizes.

Figure 6.

Figure 6

The oxygen-nearest neighbor gOj(r, 1) functions for the SPC model water system discussed in the text. The solid red line is the oxygen-oxygen nearest neighbor function gOO(r, 1) and the dotted-dashed red line is the oxygen-hydrogen nearest neighbor function gOH(r, 1).

Figure 7.

Figure 7

The hydrogen-nearest neighbor gHj(r, 1) functions for the SPC model water system discussed in the text. The solid red line is the hydrogen-oxygen nearest neighbor function gHO(r, 1) and the dotted-dashed line is the hydrogen-hydrogen nearest neighbor function gHH(r, 1).

Ala10

For the more complicated deca-alanine solute, the right-handed angle averaged solvent distributions have been discussed elsewhere.7 Here in this work we consider the left-handed distributions. The excess energy per solute results βUex(ν)/Nsolute are presented in Figures 89, for the alpha-helix and the extended structures previously identified. These calculations were taken from simulations run expressly for this work. The total Lennard-Jones potentials, −5.8 kcal/mol for the alpha-helix structure and −1.8 kcal/mol for the extended structure, and the total Coulomb potentials, −57.2 kcal/mol for the alpha-helix and −99.2 kcal/mol for the extended, are consistent with the previously reported results for these systems.

Figure 8.

Figure 8

The excess Lennard-Jones solvent-solute potential for deca-alanine in water, as a function of the ν nearest solute sites to a solvent molecule. The black line is for the alpha-helix structure and the red line is for the extended structure. The energy units here are in kcal/mol.

Figure 9.

Figure 9

The excess Coulomb solvent-solute potential for deca-alanine in water, as a function of the ν nearest solute sites to a solvent molecule. The line types and units are the same as for Figure 3.

There are two noteworthy features of the Lennard-Jones energy results. First, the convergence with respect to the ν nearest solute sites to a solvent molecule, and second the positive shoulder at shorter length scale. We identify the first as the onset of solute coarse-graining in the system, due to the construction of Uex(ν) as discussed above. Inspection indicates that the coarse-graining convergence in this particular system takes effect at approximately the length of 4 or 5 amino-acids in the solute (assuming that each interior amino acid has ∼10 sites). We identify the second feature, the positive shoulder, as related to the repulsion/exclusion work, or cavity work, necessary to insert the solute into the solvent. That this must be the cavity work is due to the form of the Lennard-Jones potential: the positive energy contribution is from the repulsive r−12 term alone. The effect for the linear polymer chosen here is that the cavity work is measurable by the solvent at the scale of 1 or 2 amino acids, while the attractive r−6 term dominates the total at larger scales. Finally, we note that the physical scaling of the Lennard-Jones potential of the all-atom system appears to follow a similar decay as the χ2(n) functions used to measure the accuracy of coarse-grained models as a function of the number of sites used to represent the potential model.25 Our results for the scaling of the different extended and helical structures also agree with the results25 showing that the number of sites necessary to converge a coarse-grained macromolecule model may be conformationally dependent, due to the fluctuation in structures.

For the Coulomb surface, the convergence is dominated by cancellation of charges. This is guaranteed by the sum over sites in Uex(ν), and the result is a function that oscillates around its ultimate result. The oscillations are on the scale of the contributions from the partially charged sites of the individual amino acids, and are a representation of the fact that the ν parameterization is with respect to any ν-nearest neighbors of a solvent molecule. As with the Lennard-Jones potential, the fluctuations converge at approximately the scale of 4 or 5 amino acids (less for the extended system, more for the alpha-helix).

A direct display of the data analogous for deca-alanine to that of the water as in Table 1 would contain too many entries to be immediately useful. However, if we take the total for each site type within the solute, the sum over j water sites of Uij(ν) for the various (hydrogen, carbon, nitrogen, oxygen) site types i of the solute summed over the total number (of carbon, oxygen, etc.) of each type present in the solute, gives a manageable graphical presentation of the information for the solute, as well as a solute site decomposition of the total energies given in Figures 89. The total is presented in Figure 10 for the extended structure. The major feature, other than the expected convergence scale, is that the total solute-water potential is dominated by the oxygen and carbon terms, with the hydrogen and nitrogen terms being small due to the summation over water sites.

Figure 10.

Figure 10

The excess solute site-water potential for the extended deca-alanine configuration in water, as a function of ν. The black line is the total for all hydrogen sites in the solute, the red line is for all carbons, the blue line is for all nitrogens, and the green line is for all oxygens.

In the context of using the proximal distribution function methods outlined above to help generate coarse-grained models of all-atom site-site model molecules, we see the utility of the method given here as follows. We would start by running a test polypeptide simulation using an all-atom potential model. Then, one would use the proximal functions to measure the all-atom simulation as above, and find the natural scaling of the system, which here occurs at approximately 5 amino acids for the solute-solvent interaction. Assuming that further all-atom test cases found a similar scaling, then we would conclude that the potential model chosen to represent that basic unit, regardless of the particular functional form, could be optimally parameterized so as to reproduce the energy of interaction of the all-atom sequence dependent test cases. That is, the proximal functions are a tool for a priori measuring the convergence of coarse-graining at the all-atom level. One might then build a coarse-grained model according to the required solute-solute interactions, efficiency, and so on.

We seek convergence, as a function of the coarse-graining parameter ν, to hold over a range of oligomeric macromolecules. For molecular mechanics potentials, it is usually the case that a given polymer residue or subunit has the same potential parameters (within a given model set) regardless of the macromolecule investigated. Meaning, e.g., a glycine is modeled the same whether it is used in tyrosine or myoglobin. Though chemically this is only approximately true, in molecular mechanics computations it is a common compromise across a huge variety of different competing necessities. Since the potentials are the same, then it is also reasonable to assume that the approximate solution structures of the resulting subunits will, on average, have features that are similar. As an example, the average solution structure around an arbitrary glycine in a globular protein should be broadly similar to any other glycine in any other globular protein, given its proximity to the solvent in the context of condition of it in the protein geometry. While any particular subunit may have a local solution structure that deviates, the average structure of all sub-units should be similar in different protein cases simply due to the fact that the same basic physical properties hold for each system in the class. This suggests that, just as molecular mechanics potential models are constructed from a broad sample of the electronic structures of a chemical species across many states, a distribution model can be constructed in the liquid state.

Consider a general set of model solvation distributions which reproduces on average the solvation of an arbitrary polymer. If there is a ν which holds for a sequence specific subset of polymers, then there is a functional description of the solvation of the members of the subset that allow a coarse-graining length related to ν, and the functions which describe it are

Gij(r,ν)=n=1νgij(r,n), (21)

where i is the solvent label, j is the sites of the individual subunits of the polymers in consideration (i.e., base pairs, amino acids, etc.), and the gij(r, n) functions in the collection are averaged over many examples of the solute class in question. Using our example above, one would have Gij(r, 50) for an individual alanine amino acid, averaged over a representative sample of alanine-containing proteins in a class for which ν ≈ 50 is found to be representative as a measure of the coarse graining, and the general model of the average solvation structure of an arbitrary protein in the class is that constructed from the G functions for all of the amino acids in the protein sequence. We have only considered here the solvation and considerable effort is required for the intra-solute or solute-solute model aspects.

CONCLUSIONS

The purpose for this work was to examine the generalization of proximal pair distribution functions to consider the solute coarse grained properties as opposed to the solvent. We considered how the contributions of the effective potential energy surface of a polymer in aqueous solution could be meaningfully represented using a one sided, angularly averaged proximal radial distribution function methods. We have formally sketched the four constructed proximal alternatives, angularly averaging on either the left or the right side, for either the solute or the solvent. Previous work in this lab showed that the most well-known such proximal distribution functions, the nearest-neighbor or 1st-order terms of the solute-solvent site-site proximal radial distribution function hierarchy, were sufficient as an approximation to the solvation structure of the deca-alanine peptide to predict the effective Coulomb contribution to the potential energy with acceptable accuracy. For this work we showed that we could also use the solute averaged distributions to determine convergence criteria for coarse graining. For 3-site water models considered as both solvent and solute, acceptable convergence of the energies for the individual site-site terms of the total potential requires all terms in the proximal distribution function expansion of the site-site distribution functions. Yet, it is still the case that the total molecular Coulomb potential approximately converges at the first term in the proximal series n. The charge density zeroth moment condition aids convergence of the terms of the partial sums. For the Lennard-Jones terms, of course, there are no similar fundamental relations. Convergence in this case requires adding sufficient proximal terms to obtain the usual expected distance criteria used in cutoffs and other potential truncation schemes.

We also derived several formal generalizations of the basic proximal distribution theory. The first result was that the distance criterion of the neighbor hierarchy generalizes to all collections of site-site distances within the molecule. As such, there is a general set of site-generated proximal distribution functions that follow the same left, right, and both hierarchy of terms as do the full, angularly dependent molecular pair distribution functions.14 In turn, each type of generalized proximal distribution function is uniquely generated from an average of its related angular correlation function over the familiar r, Ω1, Ω2 molecular coordinates within the proximal ordering of the complete set of rij site-site vectors unique to a given configuration of a pair of molecules.

Consideration of which side to angularly average and which group to leave as sites allows us to generate Uex(ν), where when ν = n, Uex(ν) = Uex, the complete potential energy, in this case, of the alanine solute in water. A result of this paper was that this functionalization gives a measure of the length scale, in terms of the number of sites ν in the solute molecule, at which coarse-graining of the internal sites of the solute molecule with respect to the solvent sets in. Using this result, we generated the total set of right-handed solvent-solute proximal distribution functions of an extended and an alpha-helix conformation of deca-alanine in water, and then showed the functionalization of the total Uex(ν) for each conformation. Both conformations showed convergence of the solute potential energy at a ν value on the scale of approximately 4 or 5 amino acids in length. Additionally, while the extended structure converged more smoothly at that length scale, the alpha-helical structure demonstrates a more persistent length correlation, indicative of a different local solvation structure.

We note that both of these results are consistent with the χ2(n) fitness measures developed to study the accuracy of the representation of coarse-grained models.25 Overall, these results are consistent with a coarse-grained picture in which the length scale of the polymer that a proximal solvent water molecule samples is of the order of 5 amino acids for this poly-peptide. This is also consistent with previous and ongoing free-energy calculations on this and similar systems in this lab, and with a standard picture of the solvation mechanism of large molecules.

These formal and numerical results lead us to the idea that, for broad classes of molecules, similar average scaling results should hold for the solvation potential functions Uex(ν) for the members of a class. The functions which may form a transferable solvation structure consistent with the measured solvation potentials of the members of the class are, or can be constructed from, the average gij(r, ν) sets of proximal distribution functions for the whole class. Specifically, suppose that, for some set of polypeptides on the scale of the deca-alanine studied here, the solvation energies Uex(ν) all converge at roughly the 5 amino acid scale. Then, for that set of polypeptides, the set of transferable solvation distribution functions broadly describing the solvation of the members of the set and consistent with the measured scale ν are the sets of Gij(r,ν)=n=1νgij(r,n) functions for each amino acid averaged over the set, i.e., Gij(r, ν) for alanine, glycine, and so on. The remaining question for practical implementation requires consideration of whether a ν exists for such classes of polymer as well as sequence variations.

ACKNOWLEDGMENTS

The authors gratefully acknowledge the financial support of the National Institutes of Health (GM 037657), the National Science Foundation (CHE-1152876), and the Robert A. Welch Foundation (H-0037). A portion of the research was performed at EMSL, a national scientific user facility sponsored by the Department of Energy's Office of Biological and Environmental Research and located at Pacific Northwest National Laboratory. Additional high performance computing work was carried out through the resources from Xsede via the Texas Advanced Computing Center (TACC) at The University of Texas at Austin.

References

  1. Ben-Naim A., Water and Aqueous Solutions (Plenum Press, New York, 1992). [Google Scholar]
  2. Mehrotra P. K. and Beveridge D. L., J. Am. Chem. Soc. 102, 4287 (1980). 10.1021/ja00533a001 [DOI] [Google Scholar]
  3. Mezei M. and Beveridge D. L., Methods Enzymol. 127, 21 (1986). 10.1016/0076-6879(86)27005-6 [DOI] [PubMed] [Google Scholar]
  4. Lounnas V., Pettitt B. M., and G. N.PhillipsJr., Biophys. J. 66, 601 (1994). 10.1016/S0006-3495(94)80835-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Rudnicki W. R. and Pettitt B. M., Biopolymers 41, 107 (1997). [DOI] [PubMed] [Google Scholar]
  6. Makarov V. A., Andrews B. K., and Pettitt B. M., Biopolymers 45, 469 (1998). [DOI] [PubMed] [Google Scholar]
  7. Lin B., Wong K.-Y., Kokubo H., and Pettitt B. M., J. Phys. Chem. Lett. 2, 1626 (2011). 10.1021/jz200609v [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Virtanen J. J., Makowski L., Sosnick T. R., and Freed K. F., Biophys. J. 99, 1611 (2010). 10.1016/j.bpj.2010.06.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Hansen J. P. and McDonald I. R., Theory of Simple Liquids, 2nd ed. (Academic Press Inc., London, 1986). [Google Scholar]
  10. Chandler D., in The Liquid State of Matter: Fluids, Simple and Complex, edited by Montroll E. W. and Lebowitz J. L. (North Holland Pub. Co., Amsterdam, 1982), p. 275. [Google Scholar]
  11. Gray C. G. and Gubbins K. E., Theory of Molecular Fluids (Clarendon Press, Oxford, 1984), Vol. 1. [Google Scholar]
  12. Ladanyi B. M. and Chandler D., J. Chem. Phys. 62, 4308 (1975). 10.1063/1.431001 [DOI] [Google Scholar]
  13. Chandler D., Silbey R., and Ladanyi B. M., Mol. Phys. 46, 1335 (1982). 10.1080/00268978200101971 [DOI] [Google Scholar]
  14. Dyer K. M., Perkyns J. S., and Pettitt B. M., J. Chem. Phys. 127, 194506 (2007). 10.1063/1.2785188 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Curtis E. M. and Hall C., J. Phys. Chem. B 117, 5019 (2013). 10.1021/jp309712b [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Foloppe N. and A. D.MacKerellJr., J. Comput. Chem. 21, 86 (2000). [DOI] [Google Scholar]
  17. Wang J., Wolf R. M., Caldwell J. W., Kollman P. A., and Case D. A., J. Comput. Chem. 25, 1157 (2004). 10.1002/jcc.20035 [DOI] [PubMed] [Google Scholar]
  18. Noid W. G., Chu J.-W., Ayton G. S., Krishna V., Izvekov S., Voth G. A., Das A., and Andersen H. C., J. Chem. Phys. 128, 244114 (2008). 10.1063/1.2938860 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Gray C. G., Gubbins K. E., and Twu C. H., J. Chem. Phys. 69, 182 (1978). 10.1063/1.436383 [DOI] [Google Scholar]
  20. Hansen J.-P., Addison C. I., and Louis A. A., J. Phys.: Condens. Matter 17, S3185 (2005). 10.1088/0953-8984/17/45/001 [DOI] [Google Scholar]
  21. Head-Gordon T. and Stillinger F. H., J. Chem. Phys. 98, 3313 (1993). 10.1063/1.464103 [DOI] [Google Scholar]
  22. Toth G., J. Phys.: Condens. Matter 19, 335222 (2007). 10.1088/0953-8984/19/33/335222 [DOI] [PubMed] [Google Scholar]
  23. Li W. and Takada S., Biophys. J. 99, 3029 (2010). 10.1016/j.bpj.2010.08.041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kirkwood J. G. and Buff F. P., J. Chem. Phys. 19, 774 (1951). 10.1063/1.1748352 [DOI] [Google Scholar]
  25. Sinitskiy A. V., Saunders M. G., and Voth G. A., J. Phys. Chem. B 116, 8363 (2012). 10.1021/jp2108895 [DOI] [PubMed] [Google Scholar]
  26. Allen M. P. and Tildesley D. J., Computer Simulation of Liquids (Oxford University Press, Oxford, 1987). [Google Scholar]
  27. Lounnas V. and Pettitt B. M., Proteins 18, 133 (1994). 10.1002/prot.340180206 [DOI] [PubMed] [Google Scholar]

Articles from The Journal of Chemical Physics are provided here courtesy of American Institute of Physics

RESOURCES