Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Jan 11.
Published in final edited form as: Multiscale Model Simul. 2015;13(4):1327–1353. doi: 10.1137/140981587

CONSTRUCTING SURROGATE MODELS OF COMPLEX SYSTEMS WITH ENHANCED SPARSITY: QUANTIFYING THE INFLUENCE OF CONFORMATIONAL UNCERTAINTY IN BIOMOLECULAR SOLVATION*

H Lei , X Yang , B Zheng , G Lin , N A Baker †,§
PMCID: PMC4707684  NIHMSID: NIHMS748533  PMID: 26766929

Abstract

Biomolecules exhibit conformational fluctuations near equilibrium states, inducing uncertainty in various biological properties in a dynamic way. We have developed a general method to quantify the uncertainty of target properties induced by conformational fluctuations. Using a generalized polynomial chaos (gPC) expansion, we construct a surrogate model of the target property with respect to varying conformational states. To alleviate the high-dimensionality of the corresponding stochastic space, we propose a method to increase the sparsity of the gPC expansion by defining a set of conformational “active space” random variables. With the increased sparsity, we employ the compressive sensing method to accurately construct the surrogate model. We demonstrate the performance of the surrogate model by evaluating fluctuation-induced uncertainty in solvent-accessible surface area for the bovine trypsin inhibitor protein system and show that the new approach offers more accurate statistical information than standard Monte Carlo approaches. Furthermore, the constructed surrogate model also enables us to directly evaluate the target property under various conformational states, yielding a more accurate response surface than standard sparse grid collocation methods. In particular, the new method provides higher accuracy in high-dimensional systems, such as biomolecules, where sparse grid performance is limited by the accuracy of the computed quantity of interest. Our new framework is generalizable and can be used to investigate the uncertainty of a wide variety of target properties in biomolecular systems.

Keywords: uncertainty quantification, biomolecular conformation fluctuation, polynomial chaos, compressive sensing method, model reduction

1. Introduction

Biomolecular structures are inherently uncertain due to thermal fluctuations and experimental limits in structural characterization. At equilibrium, a biomolecule samples an ensemble of states governed by an energy landscape. For a biomolecule with well-defined native structure at an energetic global minimum, these states are generally located in the neighborhood of the native structure. While the native equilibrium structure of a biomolecule provides essential insight, it is also important to understand conformational fluctuations of biomolecular systems and their impact on molecular properties. In particular, it is of great interest to accurately quantify the uncertainty in these properties caused by stochastic conformational fluctuations.

Molecular dynamics (MD) simulations offer a powerful tool for examining the influence of conformational uncertainty on biomolecular properties [16, 2]. Over the past few decades, this approach has made great progress in the development of accurate empirical force field as well as efficient simulation algorithms [38]. However, despite these advances, MD is still a very computationally expensive simulation approach, particularly for large biomolecular complexes. Moreover, the finite durations of MD simulations are plagued with uncertainty in calculated properties due to non-ergodic sampling. Many coarse-grained (CG) models and methods have been developed to facilitate molecular simulation at larger length scales and longer time scales. One popular approach is the elastic network model (ENM), which involves a harmonic approximation of molecular energy landscape. It has been observed that the low-frequency normal modes of a biomolecular system can be reproduced using a single-parameter Hookean potential between neighboring residues [49, 25, 47]. In particular, by only modeling interactions between the neighboring α-carbon (Cα), ENMs are able to predict structural fluctuations (e.g., Debye-Waller or B-factors) with surprising accuracy [25, 4].

The simplified potentials used by CG models such as ENM allow us to examine structural fluctuations in a semi-analytical manner. However, there does not exist an analytical formula that directly leads from the structural fluctuations to target biomolecular properties computed from the structure. Instead, given a specific biomolecular conformation (e.g., one snapshot of biomolecule structure under fluctuation), we still need further numerical computation to obtain the target properties. This leads to an important practical question: how do we utilize the stochastic information obtained from these models to efficiently quantify the uncertainty of the target property induced by the biomoleculular conformational (structural) fluctuation? In many applications, a single native conformation of a molecule is used when computing a properties such as molecular volume and area [27, 41, 12], electrostatic and solvation properties [40, 44], titration states [3], and other quantities. However, these quantities are all sensitive to the structure of the molecule and therefore subject to uncertainty induced by conformational fluctuations. Many studies neglect this uncertainty; those which attempt to assess it are forced to resort to time-consuming monte carlo sampling over the numerous biomolecular conformation states.

In the present work, we address this issue by providing a general framework to quantify conformation-induced uncertainty on various biomolecular properties. In particular, we construct a surrogate model of a target quantity in terms of the molecular conformational states. The constructed surrogate model enables us to efficiently evaluate the statistical information of the target property, e.g., probability density function. To the best of our knowledge, this is the first demonstration of how a target property response surface – including property uncertainty – can be directly evaluated from the biomolecular conformational distribution.

To construct the surrogate model, we adopt the generalized polynomial chaos (gPC) [21, 53] and formulate the target property as an expansion of a set of gPC basis functions determined by the specific conformation states, where the gPC coefficients are determined by the values of the target properties on a number of sampling conformation states. Within this framework, numerical quantification of the conformation-induced uncertainty is formulated as the following problem: how can we accurately and efficiently construct the gPC based surrogate model of the target property using limited sampling points within the high-dimensional conformational space? Several probabilistic collocation methods (PCM) such as ANOVA [31, 17, 58, 55] and sparse grid methods [52, 19, 18, 36, 30] have been proposed to accurately construct gPC expansions by selecting specific collocation points for sampling. However, there are two fundamental barriers when directly applying these approaches to high-dimensional biomolecular systems with hundreds to thousands of degrees of freedom in CG representations. The first barrier is the required number of sampling points, which can be too large for any gPC approach beyond a linear approximation. Moreover, empirical evidence indicates that sparse grid methods are often limited to dimensions less than ~ 40 (e.g., see [37]). The second barrier is the presence of limited accuracy in the calculation of target properties – even in the absence of structural uncertainty. For example, many calculations related to biomolecular solvation properties are subject to errors in the discretization and numerical solution of the associated partial differential equations [5, 26]. The error between the true values and the computed values of these target properties can lead to erroneous results due to inhomogeneous weight distribution over the sampling points, as illustrated in this paper. To circumvent these difficulties, we adopt an alternative noncollocation method based on compressive sensing [11, 14, 7] which reduces the influence of the limited accuracy of the target property while taking advantage of the sparsity of the gPC expansion. The compressive sensing method was initially proposed for signal processing and later applied to wide range of applications, including uncertainty quantification frameworks [29, 15, 54, 56].

2. Stochastic model

In this section, we briey introduce a semi-analytical stochastic model based on the elastic network model presented in [49, 25, 47]. The resulting harmonic system yields a Gaussian probability distribution for conformational states [4] that is straightforward to use in stochastic models for uncertainty quantification. In addition to the dimensionality reduction provided by the coarse-grained ENM, we note that further dimensionality reduction can be obtained for biomolecular target properties that have local dependence on structure; i.e., where the values associated with a particular property depend only on a subset of atoms in the molecule.

2.1. Full stochastic model of conformational fluctuation

We construct the stochastic conformation space of the biomolecular system based on the coarse-grained (CG) anisotropic network model (ANM) [4], a variant of the ENM where each amino acid residue is modeled as a single CG particle connected to neighboring residues by anisotropic harmonic potentials. ANM can be viewed as a simplified CG model of normal mode analysis [22, 6, 28, 32], where the model potential does not rely on the complex atomic-detail force field. Consider a biomolecule of N residues, we denote the 3N-dimensional equilibrium position vector by R¯T=[r¯1Tr¯2Tr¯NT], where r¯i is a 3-dimensional vector representing the equilibrium position of residue i. Similarly, we denote the 3N-dimensional instantaneous position vector RT=[r1Tr2TrNT], where ri represents the instantaneous position vector of residue i. The fluctuation vector can then be defined by R=RR¯. The harmonic approximation for the potential energy V with respect to the instantaneous position R is given by

V(R)=γ2i<j(rijr¯ij)2h(rcr¯ij), (2.1)

where r¯ij and rij represent the equilibrium and instantaneous distances between residue i and j, γ is a model parameter representing the elastic coefficient of the harmonic potential, rc is the cut-off distance of the harmonic potential, and h is the Heaviside function.

Given the potential defined by Eq. (2.1), the 3N × 3N Hessian matrix has the form

H=(H11H12H1NH21H22H2NHN1HN2HNN)

with the element Hij defined by

Hij=(2V/XiXj2V/XiYj2V/XiZj2V/YiXj2V/YiYj2V/YiZj2V/ZiXj2V/ZiYj2V/ZiZj),

where Xi, Yi and Zi represent the Cartesian coordinates of residues i. We note that the rank of H is 3N − 6 since V is translationally and rotationally invariant. This harmonic form for the potential leads to Gaussian statistics for the conformational probability distribution (e.g., individual residue position distribution). The correlation between individual residue fluctuation can be determined by the pseudo-inverse of the Hessian matrix H as [4]

C=𝔼[RRT]=kBTγH1, (2.2)

where 𝔼[·] denotes the expectation, kB is the Boltzmann constant, T is the temperature.

We perform an eigendecomposition of H

H=WΛWT,Λ=diag(λ1,,λ3N6), (2.3)

where λi is the i-th nonzero eigenvalue of H. W is a 3N × (3N − 6) matrix defined by

W=[w1w2w3N6], (2.4)

where wi is the corresponding i-th eigenvector of H. Then, the correlation matrix can be written as

C=kBTγWΛ1WT=UUT, (2.5)

where U=(kBTγ)12WΛ12. The stochastic conformation space can be given by

R(ξ)=R¯+R(ξ), (2.6a)
R(ξ)=Uξ, (2.6b)

where ξ = (ξ1, ξ2, ⋯, ξ3N−6) is an independent and identically distributed (i.i.d.) Gaussian random vector. Given a value of ξ, the corresponding coarse-grained conformation is fully determined by Eq. (2.6), allowing us to calculate target properties, denoted by X(ξ).

2.2. Reduced-dimensionality stochastic model for conformational fluctuations

The dimension of the stochastic conformation space constructed by Eq. (2.6) is 3N − 6, which can still be high when N is large. However, we note that some target properties of interest, particularly those related to a specific residue p, may have only local dependence on the neighboring residues’ conformation rather than depending on all degrees of freedom in the biomolecule. For example, the solvent accessible surface area (SASA) of a specific residue p only depends on the positions of itself as well as its neighboring residues within a certain cutoff distance rc{p}. Under such circumstances, the full position fluctuation correlation matrix C can be replaced by a 3N × 3N matrix C′ of larger sparsity where the (i, j) element (a 3 × 3 matrix) is given by

Cij=Cijh(rc{p}rip)h(rc{p}rjp), (2.7)

rip = |rpri|, rjp = |rprj|, and rc{p} is a cut-off distance of residue p such that X{p} is independent of the residue i if rip>rc{p}.

Fig. 2.1 illustrates this dimensionality reduction procedure for local properties as discussed above. Similar to Eq. (2.6), we can construct the reduced stochastic conformation space by

C{p}=U{p}U{p}T, (2.8a)
R{p}(ξ{p})=R{p}¯+U{p}ξ{p}, (2.8b)

where ξ{p} is a d-dimensional i.i.d. normal random vector.

Fig. 2.1.

Fig. 2.1

Sketch of a typical reduced property correlation matrix. The square on the left hand side represents the full correlation matrix. Each block represents 9 elements (in x, y and z directions) of a residue in the correlation matrix C. The blocks in blue color represent the matrix elements associated with some local target property X. The square on the right hand side represents the reduced correlation matrix C{p} with lower dimensionality.

In summary, the value of a target property X is determined by the specific conformation state, corresponding to a point ξ (or ξ{p}) in the full (or reduced) random space. Our goal is to systematically quantify the uncertainty in X with respect to the conformational fluctuations through gPC expansion, as introduced in the next section. The rest of the manuscript focuses on local properties so we will omit the superscript {p} in the following text, and use X(ξ) and ξ to represent the target property and the d-dimensional random vector, respectively.

3. Numerical methods

In this section, we first review the generalized polynomial chaos (gPC) expansion with a brief discussion on possible difficulties with probabilistic collocation methods. Next, we introduce a noncollocation method to construct the gPC expansion based on compressive sensing. We note that the sparsity of gPC coefficients will affect the performance of compressive sensing method (see [10]). Hence, we propose a method to elevate the sparsity of the gPC expansion by defining a new set of random variables according to the direction of variability in the target properties.

3.1. gPC expansion and collocation method

We use the gPC expansion to construct the surrogate model of the target property X with respect to the model parameter ξ, the molecular conformation by

X(ξ)=|α|=0cαψα(ξ), (3.1a)
ψα(ξ)=ψα1(ξ1)ψα2(ξ2)ψαd(ξd),αi{0}, (3.1b)

where d is the number of random variables, α = (α1, α2, ⋯, αd) is a multi-index, and cα is the gPC coefficient to be determined. ψαii) are univariate normalized Hermite polynomials, which satisfy the orthonormality condition:

ψk(ξi)ψl(ξi)ρ(ξi)dξi=δkl,k,l{0}, (3.2)

where δkl is the Kronecker’s delta and ρ(ξi)=12πexp(ξi2/2) is the normal distribution function. The ENM described in Sec. 2.1 relies on d i.i.d. standard normal random variables, hence the gPC basis functions are constructed as the tensor products of univariate normalized Hermite polynomials as shown in Eq. (3.1b).

We truncate the expression (3.1) up to polynomial order P, hence X is approximated as:

X(ξ)X˜(ξ)=|α|=0Pcαψα(ξ), (3.3)

using a total number of n gPC terms with n = (P + d)!/(P!d!).

Ideally, we would construct the truncated gPC expansion of X(ξ) by computing cα using the orthonormality of ψα; e.g.,

cα=X(ξ)ψα(ξ)ρ(ξ)dξ, (3.4)

where ρ(ξ) is the probability density function (PDF) of ξ.

The integration can be accomplished by utilizing probabilistic collocation approaches such as tensor product [42, 43] or sparse grid [52, 19, 18] methods. Specifically, by evaluating X on specific collocation points ξ1, ξ2, ⋯, ξS; we have

cα=X(ξ)ψα(ξ)ρ(ξ)dξi=1SX(ξi)ψα(ξi)wi, (3.5)

where wi is the corresponding weight associated with collocation point ξi.

However, for the high-dimensional biomolecular systems considered the present work, the required number of collocation points S can be computationally intractable. For example, a small biomolecular system with a 27-dimensional reduced conformation random space would require S = 7.6 × 1012 tensor product collocation points to construct a quadratic-order gPC expansion. Standard sparse grid method based on Gaussian quadrature and Smolyak construction reduces this to 1513 sampling points; however, the required number of sampling points is fixed for each order of the gPC approximation (e.g., the required number of sampling points for a 3rd-order approximation is 27829) which makes it difficult to incorporate adaptive sampling strategies.

Also we note that X is generally accompanied by numerical error; e.g.,

X(ξ)=X¯(ξ)+ϕ, (3.6)

where X¯(ξ) is the true value of the target property and ϕ represents the numerical error. In this work, we require ϕ to satisfy

|ϕ||X|, (3.7)

so we can systematically study the accuracy of the constructed surrogate model using different numbers of sampling data. In general, the condition |ϕ| ≪ |X| is not essential to the application of gPC expansion to construct the surrogate model. ϕ provides a lower bound of the numerical error of the surrogate model; e.g., we should not expect the error of the surrogate model to be less than the error accompanied by the sampling data.

Moreover, as will be shown in Sec. 4, the aliasing error and the numerical error ϕ associated with X may lead to poor approximation of cα by using probabilistic collocation method even if Eq. (3.7) is satisfied. To overcome the above difficulties, we compute the gPC expansion by applying compressive sensing as described in Sec. 3.2.

3.2. Compressive sensing method

To construct the gPC expansion in Eq. (3.3), we compute X(ξ) on M sampling points (ξ1, ξ2, ⋯, ξM), which are generated according to the distribution of random variables ξ. In the present work, ξ are d-dimensional i.i.d. standard normal random variables. We discretize Eq. (3.3) as a linear system

(ψα1(ξ1)ψα2(ξ1)ψαn(ξ1)ψα1(ξ2)ψα2(ξ2)ψαn(ξ2)ψα1(ξM)ψα2(ξM)ψαn(ξM))(cα1cα2cαn)=(X(ξ1)X(ξ2)X(ξM))+(ε1ε2εM)

or equivalently,

Ψc=X+ε, (3.8)

where Ψ is the “measurement matrix” with entries Ψi,j = ψαj (ξi), c = (cα1, cα2, ⋯, cαn)T is the vector of the gPC coefficients, X = (X(ξ1), X(ξ2), ⋯, X(ξM))T is the vector consists of the outputs and ε = (ε1, ε2, ⋯, εM)T is related to the truncation error.

Notice that Ψ is an M × n matrix, and we are interested in the case when M < n or even Mn. It is presented in [10] that if Ψ satisfies the restricted isometry property (RIP), we can estimate c by solving the following optimization problem:

(P1,ε):arg minc*c*1 subject to Ψc*X2ε, (3.9)

where ε = ‖ε‖2. The upper bound of the error ‖cc*‖2 is decided by ε and the sparsity of c:

cc*2C1ε+C2ccs1s, (3.10)

where C1, C2 are constants, s is a positive integer, and cs is c with all but the s-largest entries set to zeros. For c in the present work, “sparse” means small ‖ccs1 with s being smaller (or much smaller) than the length of c. The (P1, ε) optimization problem can be solved using classical convex optimization solvers (e.g., CVX [24]), sparse recovery software packages (e.g., SPGL1 package [50], ℓ1-MAGIC [1]), or the split Bregman method [23, 57, 9, 8]. In this paper, we use SPGL1.

To solve Eq. (3.9), we need the value of ε, which is generally not known a priori. In this work, we estimate ε using a cross-validation method [15, 56]. We first divide M sampling data into two parts denoted by Mr and Mυ. Second, c is computed with Mr sample points with a chosen series of tolerance error εr. Next, an optimized estimate ε^r is determined such that ‖ΨυcXυ2 is minimized, where Ψυ and Xυ represent the submatrix of Ψ and the subvector of X corresponding to validation portion of the sampling data. Finally, we repeat the above process for different replicas of the sample points and determine the optimal ε as ε=M/Mrε^r. In this work, we set Mr = 2M/3 and performed the cross-validation for three replications. We note that verifying the RIP for a given matrix is a NP-hard problem. The aforementioned cross-validation procedure also serves as the verification for applying the ℓ1 minimization method to approximate c as it estimates the error of ε.

3.3. Sparsity recovery via a “renormalized active” random space

The performance of the compressive sensing method introduced above is closely related to the ratio between the numbers of sampling points M and basis functions n, as well as the sparsity of the linear system in Eq. (3.8). In general, accuracy improves with either larger M/n ratios or sparser target vectors c. One way to increase M/n (for a given M) is to reduce the dimension of stochastic space hence n is reduced. Unfortunately, for biomolecular systems, the dimension of the stochastic conformation space is determined by the structure of the molecule and is not always amenable to direct reduction. Constantine et al. [13] have developed an alternative approach that can be used to increase sparsity by analysis of variability in the target properties. For the target X(ξ) with respect to PDF ρ(ξ), we define gradient matrix G by [13]

G=𝔼[X(ξ)X(ξ)T], (3.11)

where ∇X(ξ) is the gradient vector defined by X(ξ)=(Xξ1,Xξ2,Xξd,)T. We conduct the eigendecomposition

G=QKQT,Q=[q1q2qd], (3.12a)
K=diag(k1,,kd),k1kd0, (3.12b)

where qi is the i-th eigenvector of G. Therefore, the target property X exhibits the largest variability along the direction q1 while it exhibits the smallest variability along the direction qd. This motivates the definition of a new random vector

χ=QTξ, (3.13)

where Q is unitary and χ = (χ1, χ2, ⋯, χd)T are i.i.d. Gaussian variables since ξ are i.i.d. Gausian (also similar to Ref. [48]). Dependence of the target property X on χi decreases from χ1 to χd.

Therefore, if we represent X by a gPC expansion with respect to χ, X may depend primarily on the first few random variables. Then the gPC coefficients associated with other variables exhibiting much smaller value (or even close to 0), yielding sparser c for the linear system defined in Eq. (3.8). Hence, if we recover the gPC coefficients with respect to χ in Eq. (3.8) by the compressive sensing method, we expect more accurate result than directly recovering gPC coefficients with respect to ξ.

Unfortunately, the gradient vector ∇X(ξ) is generally not known a priori. Direct evaluation of 𝔼[∇X(ξ)∇X(ξ)T] is very computationally expensive: the cost of evaluation of ∇X(ξ) is proportional to the dimension of the ξ. Therefore, we evaluate ∇X(ξ) by approximating it via the gPC expansion recovered from ξ; e.g.,

G𝔼[XgPC(ξ)XgPC(ξ)T] (3.14a)
XgPC(ξ)=|α|=0Pcα{ξ}ψα(ξ), (3.14b)

where the superscript {ξ} represents gPC coefficients directly recovered from ξ. Evaluation of 𝔼 [∇XgPC(ξ)∇XgPC(ξ)] is straightforward with respect to PDF ρ(ξ), can be used to define Q and, therefore, the new random basis χ = QT ξ. Finally, new basis functions associated with new random variables χ can be used to reconstruct the gPC expansion of X with respect to the χ which, in general, yields greater sparsity.

Remark 3.1

We do not reduce the dimension of the conformational space in the above procedure. Instead, we define a new basis spanning the random space based on the variability direction of the target property. This set of basis functions is not universal, it depends on the specific target property X.

Remark 3.2

The gradient matrix G is approximated by Eq. (3.14). Therefore, eigenvectors [q1 q2qd] may not correspond exactly to the steepest decay directions of variability for the target property X. Nevertheless, we adopt Eq. (3.14) to construct a “rotated” space that provides larger (if not optimal) sparsity.

Remark 3.3

Notice that ξ are i.i.d. Gaussian random variables and so as χ due to the matrix Q being unitary, the new basis functions associated with χ are still tensor product of Hermite polynomials, i.e., of the same form as in Eq. (3.1).

We summarize the entire procedure presented above (Sections 2 and 3) in Algorithm 1. In the next Section, we apply this framework to quantify uncertainty in biomolecular solvent accessible surface area properties in the presence of conformational fluctuations.

Algorithm 1

[Procedure to construct the gPC response surface of a given target quantity X with respect to a stochastic biomolecular conformation space.]

  • Step 1. For a biomolecular system, we model the potential energy using the harmonic elastic network approach so the conformation fluctuation is Gaussian-distributed. We construct the full stochastic conformation space given in Eq. (2.6). For “local” target properties, we further reduce the dimension of the stochastic conformation space as in Eq. (2.8). We conduct eigenvalue decomposition of the correlation matrix and represent the fluctuation by a d-dimensional i.i.d. standard normal random vector denoted by ξ.

  • Step 2. Generate M sampling points ξ1, ξ2, ⋯, ξM based on the distribution of ξ. Numerically compute X on ξ1, ξ2, ⋯, ξM to obtain M outputs X1, X2, ⋯, XM (where Xq = X(ξq)). Denote X = (X1, X2, ⋯, XM) as the “observation” in (P1,ε). The “measurement matrix” Ψ is constructed as Ψi,j = ψαj (ξi), where ψαj are the basis functions. The size of Ψ is M × n, where n is the total number of basis functions depending on P in (3.3).

  • Step 3. Set the tolerance ε in (P1,ε) by employing the cross-validation method.

  • Step 4. Solve the1 minimization problem
    arg minc*c*1 subject to Ψc*X2ε.
    to obtain the gPC coefficients c{ξ}.
  • Step 5. Evaluate the gradient matrix G = 𝔼[∇XgPC(ξ)∇XgPC(ξ)T], given c{ξ} and define the random vector χ by Eq. (3.13). Compute the sample of χ as χq = QT ξq, q = 1, ⋯, M.

  • Step 6. Construct new “measurement matrix” Ψ˜ by setting Ψ˜ij=ψαj(χi). Construct the gPC expansion of X(χ) by repeating steps 3–4 on random vector χ and using X = (X1, X2, ⋯, XM)T that have been determined in step 2.

4. Numerical Results

As an example, we apply our method to quantify the uncertainty in solvent-accessible surface area (SASA) caused by conformational fluctuations in the biomolecule bovine pancreatic trypsin inhibitor (PDB code: 5pti) [51], shown in Fig. 4.1. SASA is an essential element of numerous solvation models [5, 44, 40]. The SASA for the entire molecule can be decomposed into residue-specific contributions, allowing us to explore the influence of conformational fluctuations on local area uncertainty. SASA is calculated following Shrake et al. [45], setting Np nearly equidistant probing points on the solvent particle and determining SASA value for each residue from the fraction of probing points that are not buried by any of the neighboring residues. In particular, we choose Np ≈ 2.5 × 105 such that the numerical error ϕ satisfies |ϕ| / |X| ≲ 1.0 × 10−4, see Sec. 4.2 for further discussion on sensitivity study on the accuracy of the constructed surrogate model by choosing different magnitudes of numerical error.

Fig. 4.1.

Fig. 4.1

(a) Tube diagram of the equilibrium structure of bovine pancreatic trypsin inhibitor (PDB code: 5pti) with spheres denoting the residue Cα positions. (b) Tube diagrams of the molecule representing instantaneous conformational states under thermal fluctuation.

To demonstrate the applicability of our method in exploiting information from limited sampling data, we focus on the performance of our method when constructing a surrogate model using less than 2500 sample data. This performance is assessed relative to two reference systems: a direct Monte Carlo simulation of the conformational space with 106 sampling data as well as a system constructed by the standard sparse-grid collocation method. We test our method by examining the L2 error of the model as well as the Kullback-Leibler divergence between the probability density functions obtained from this new approach and the reference data.

4.1. Surrogate model for SASA of individual residues

Fig. 4.1 shows a sketch of the CG biomolecular model under equilibrium and thermal-fluctuation states. Following Ref. [4], each residue is modeled as a single α-carbon particle as shown in Fig. 4.1(a). Due to thermal fluctuations, the molecule exhibits a distribution of conformation states where individual residues may deviate from the equilibrium positions, as shown in Fig. 4.1(b). To model the fluctuation of individual residues, we construct the ANM correlation matrix C by Eq. (2.5) using a cut-off distance for the harmonic potential of rc = 9.8Å. The radius values of the α-carbon residue and the solvent probe were set to 2.8 and 1.2 Å, respectively, for the SASA calculations.

We first consider local properties and study the SASA of residue P14. Starting with the full 168-dimensional random correlation matrix C, we construct the local correlation matrix C′ via Eq. (2.7) by setting the neighbor cut-off distance rcp to be 9.5 Å. This cutoff value yields 8 neighboring residues and therefore a 27–dimensional random space ℝ27(ξ) by Eq. (2.8). As shown in Fig. 4.2, the PDFs of the SASA of residue P14 extracted from the local and the full random conformation spaces agree well with each other, indicating that this particular property can be represented within a reduced space rather than the full 168-dimensional space. The dashed line in Fig. 4.2 represents the PDF extracted from the local random space by neglecting the fluctuation correlation between different residues (e.g., setting the off-diagonal blocks to zero). The resulting distribution is wider than that predicted by the full correlation matrix. This is not surprising since the off-diagonal elements represent the harmonic potential contribution of molecular deformation in Eq. (2.1). Neglecting the off-diagonal block elements results in a more “exible” molecule model which lacks the harmonic restraints and therefore exhibits a wider distribution of SASA values.

Fig. 4.2.

Fig. 4.2

Probability density function of the SASA of the 14th residue obtained from the full correlation matrix C (solid line) and the local reduced correlation matrix C′ (“●” symbol). The dashed line represents the distribution obtained from the reduced correlation matrix where off-diagonal elements are set to zero.

Next, we construct the surrogate model by computing the gPC coefficients within the reduced random space ℝ27(ξ) following the method presented in Sec. 3. First, we calculate the gPC coefficients cα{ξ} up to order P = 2 (406 basis functions) by setting M = 300 in Algorithm 1 and applying step 1 − 4. Given cα{ξ}, we next construct the approximate gradient matrix G by Eq. (3.14). Eigendecomposition of this matrix provides a set of rotated random variables χ by Eq. (3.13), (Step 5 in Algorithm 1). Fig. 4.3 shows the resulting normalized eigenvalues of G and the reduced correlation matrix C′. We note that C′ is independent of the target quantity X; it is completely determined by the molecular structure. The eigenvalues of C′ decay slowly, at a rate similar to the full correlation matrix C (not shown in the plot), while the eigenvalues of the gradient matrix G decay much more quickly. This result indicates that, for a particular quantity X, the eigenvectors of C do not necessarily correspond to the directions with the steepest decay of variability in a target property.

Fig. 4.3.

Fig. 4.3

Normalized eigenvalues of the gradient matrix G (“▼” symbol) and the correlation matrix C′ (“■” symbol).

Given the variables χ, we compute the corresponding gPC coefficients cα{χ} with order P = 2 by applying Step 6 in Algorithm 1. The results are shown in Fig. 4.4. Compared with cα{χ}, the spectrum of cα{χ} exhibits a higher degree of sparsity, as expected. This result indicates that, with the same polynomial order, the target quantity X can be approximated using fewer gPC terms with respect to the set of random variables χ than with ξ.

Fig. 4.4.

Fig. 4.4

gPC coefficients (up to 2nd order) for the SASA value on the 14th residue obtained from CS method with respect to random vector ξ (red) and χ (blue) with dimension d = 27.

To examine the constructed surrogate model, we compute the relative L2 error ε of the surrogate model by

ε=(|X(ξ)X˜(ξ)|2ρ(ξ)dξ|X(ξ)|2ρ(ξ)dξ)1/2, (4.1)

where X˜ is the gPC expansion of X by Eq. (3.3) with cα{ξ} and cα{χ}, respectively. As X(ξ) is unknown in general, we use Monte Carlo sampling to approximate the integral in Eq. (4.2)

ε(i=1Ns|X(ξi)X˜(ξi)|2i=1Ns|X(ξi)|2)1/2, (4.2)

where Ns is the number of sampling data. In this work, we choose Ns = 106.

Fig. 4.5 shows the relative L2 error of the constructed surrogate model with gPC coefficients recovered from two independent sets of sample data. For each sample set, we use 200–400 points to construct the order P = 2 gPC expansion with 406 basis functions and 500–600 sample points to construct the order P = 3 of gPC expansion with 4060 basis functions. For each case, the L2 error decreases as we increase the number of sampling points from 200 to 400. For the same number of sample points, the surrogate models constructed with respect to χ exhibit smaller L2 error than those constructed with respect to ξ. In particular, given the same number of sampling points, sparser gPC coefficients c in Eq. (3.8) lead to more accurate recovery of c from the compressive sensing method by Eq. (3.9). The accuracies of the compressive sensing methods based on ξ and χ are comparable when the number of sampling points is close to the number of basis functions.

Fig. 4.5.

Fig. 4.5

Relative L2 error of the SASA value on residue 14 predicted by the gPC expansions X˜(ξ) and X˜(χ), where the gPC coefficients are obtained from two separate sets of sampling data, represented by (a) and (b), respectively. The symbols “▼” and “■” denote the 2nd and 3rd order gPC expansion by ξ. The symbols “◆” and “●” denote the 2nd and 3rd order gPC expansion by χ. The dash-dot and dash-dot-dot lines represent the relative L2 error of 1st and 2nd order gPC expansion obtained from level-1 and level-2 sparse grid points, using 55 and 1513 sample points respectively.

For random variables ξ, the L2 error changes non-monotonically as we compute cα at order P = 3 by increasing numbers of sampling points. The error increases as we increase the number of sample points to 500 and then decreases as we increase the number of sample point to 600 (although it remains larger than the 400-point error). This behavior is primarily due to the fact that the number of basis functions for P = 3 is much larger than the number of sample points and therefore cα{ξ} is poorly recovered due to insufficient sample points. However, for the transformed random variables χ, cα{χ} can be accurately recovered due to the high sparsity of the gPC spectrum with a monotonic decrease in error with increasing numbers of sampling points.

We examined the surrogate model constructed by the sparse grid method based on Gaussian quadrature collocation points and Smolyak structure with gPC coefficients computed according to Eq. (3.5). Fig. 4.5 shows the relative L2 error of the surrogate model constructed by approximating the integral in Eq. (3.4) with level-1 and level-2 sparse grid methods using 55 and 1513 sample points, respectively. Note that the algebraic accuracy of level-1 and level-2 sparse grid methods we use are 3 and 5, respectively. Therefore, we construct 1st-order and 2nd-order gPC expansions with level-1 and level-2 methods respectively. The sparse grid results show systematically larger L2 errors than the compressive sensing approach. An unexpected phenomenon is that the error of 2nd-order expansion is larger than that of the 1st-order expansion. This behavior will be explained in Sec. 4.2.

The differences between the models constructed by cα{ξ} and cα{χ} can be further illustrated by examining the response surfaces in the reduced random space shown in Fig. 4.6. This figure shows the response surfaces X˜(ξ) and X˜(χ) with respect to two random variables with the remaining 25 random variables fixed. The gPC coefficients are computed using 300 sample points with the order P = 2 for both cases. For X˜(χ), we only consider the first two random variables χ1 and χ2. For X˜(ξ), we consider the random variables ξ21 and ξ22 which are associated with the largest magnitudes of the first order gPC coefficients. For each case, we fixed the remaining variables as constant values extracted from an i.i.d. standard normal distribution 𝒩(0, 1).

Fig. 4.6.

Fig. 4.6

(a) The reduced response surface constructed by X˜(χ1,χ2,χ30,,χ270), where (χ30,,χ270) are fixed values extracted from the i.i.d. normal distribution 𝒩(0, 1). The scattered symbols (blue points) are direct numerical simulation results on stochastic points (χ1, χ2, ⋯, χ27) in ℝ27 following an i.i.d. normal distribution 𝒩27(0, 1). (b) The reduced response surface constructed by X˜(ξ10,,ξ21,ξ22,,ξ270) where (ξ10,,ξ200,ξ230,,ξ270) are fixed values extracted from the i.i.d. normal distribution 𝒩(0, 1). The scattered symbols (blue points) are direct numerical simulation results on points (ξ1, ξ2, ⋯, ξ27) following an i.i.d. normal distribution 𝒩27(0, 1).

The behavior of the Monte Carlo data around the response surfaces in Fig. 4.6 indicates that the variation of X˜(χ) strongly depends on χ1, χ2 while the dependence is much weaker for ξ21 and ξ22. Furthermore, most of the symbols generated by X˜(χ) fall near the reduced response surface X˜(χ) with small deviation while the deviations for X˜(ξ) are much larger around the response surface X˜(ξ). As expected with rotation of the space by Eq. (3.13), this result indicates that X˜(χ) can be fitted fairly well by only using two variables. However, if we use the original random variables, the reduced response surface can not be captured well even if we use the two most important variables associated with the first order gPC expansion. Fig. 4.6 clearly illustrates that the different sparsities of c result in different accuracies for the recovered response surfaces X˜(χ) and X˜(ξ).

To evaluate the statistical information extracted from the surrogate model, we compute the SASA PDF for target residue P14 by evaluating 106 sampling data points with the constructed surrogate model. These results are shown in Fig. 4.7(a) and compared with a reference solution based on the PDF computed from 106 direct MC sample points. The compressive sensing method with 300 sample points yields the closest approximation of the reference solution. In contrast, the PDFs constructed by the direct Monte Carlo and sparse grid methods show significant deviation from the reference solution. To quantify the numerical error of the obtained PDFs, we computed the Kullback-Leibler divergence

DKL=ln(fN(X)f0(X))fN(X)dX (4.3)

with the discrete form where fN(X) and f0(X) represent the PDFs of the numerical and reference solution, respectively. For the compressive sensing method, DKL decreases as we increase the number of sampling points, which is consistent with the L2 error of the surrogate model (Fig. 4.5). The plateau value at 500–600 sampling points is primarily due to the finite resolution of the PDF: a sensitivity study shows that DKL between two i.i.d. sets of 106 MC sample points is on the order of 10−4. In contrast, DKL values of the PDFs obtained by the level-1 and level-2 sparse grid methods are about 20 and 450 times larger (respectively) than the results of the compressive sensing method.

Fig. 4.7.

Fig. 4.7

(a) Probability density function (PDF) of the SASA values on residue P14 obtained from the gPC expansion X˜(χ) using 300 sampling points (“●” symbol). Reference solution (solid line) is obtained from MC sampling using 106 sample points. Results of 1st and 2nd gPC expansion obtained from level-1 (dash-dot line, 55 sample points) and level-2 (dash-dot-dot line, 1513 sample points) sparse grid method, and result from, and direct MC sampling methods (dashed line, 300 sample points) are also presented for comparison. (b) Kullback-Leibler divergence between the PDF of the reference solution and the PDFs obtained from gPC expansion X˜(χ) (“■” symbol) with varying numbers of sample points. Level-1 sparse grid (dash-dot line) and direct Monte Carlo (dashed line, 300 sample points) results are presented for comparison.

4.2. Error sources and sensitivity analysis

To further investigate the applicability of the numerical methods for biomolecular systems, we quantified the SASA uncertainty for two other residues P11 and P20, which have 13 and 20 neighboring residues and correspond to random conformation spaces ℝ42 and ℝ63, respectively. For each case, we constructed the surrogate model by the compressive sensing method with respect to both ξ and χ, as well as by the level-1 (1st-order gPC expansion) and level-2 (2nd-order gPC expansion) sparse grid methods. Fig. 4.8 shows the relative L2 error of the surrogate models, the PDFs of the SASA values, and the K-L divergence with respect to the reference solution.

Fig. 4.8.

Fig. 4.8

(a–b) Relative L2 error of the SASA value on residue 11 (a) and 20 (b) predicted by the gPC expansions X˜(ξ) and X˜(χ). The symbols “▼” and “■” denote the 2nd and 3rd order gPC expansion by ξ. The symbols “◆” and “●” denote the 2nd and 3rd order gPC expansion by χ. The dash-dot lines represent the relative L2 error of 1st order gPC expansion obtained from level-1 sparse grid methods using 85 and 127 sample points. The dash-dot-dot lines represent the relative L2 error of 2nd order gPC expansion obtained from level-2 sparse grid methods using 3613 and 8065 sample points. (c–d) Kullback-Leibler divergence between the PDF of the reference solution and the PDFs obtained from the constructed surrogate models (“■” symbol) for residues 11 (c) and 20 (d). Level-1 sparse grid (dash-dot line) and direct Monte Carlo (dashed line, 300 sample points) results are presented for comparison.

Similar to the results for residue P14, the surrogate models constructed with respect to χ yield smaller error than the ones constructed with respect to ξ. The accuracy of the ξ and χ compressive sensing methods is comparable when the number of sampling points is close to the number of basis functions. However, the surrogate model constructed with respect to χ is more accurate than ξ when the number of sampling points is much less than the number of basis functions; e.g., when the third-order gPC terms are incorporated. In particular, the surrogate model (2nd-order gPC expansion) for residue 20 constructed by the level-2 sparse grid in random space ℝ63 yields the largest deviation from the reference solution.

For the present system, the relatively large surrogate model error constructed by the sparse grid method (e.g., Eq. (3.5)) can be explained as follows. Given the target quantity X computed at collocation points, the gPC coefficient cα is computed by

cα=i=1Nspwi(X¯(ξi)+ϕ(ξi))ψα(ξi)=i=1Nspwi(X¯i+ϕi)ψαi, (4.4)

where X¯i=X¯(ξi), ϕi = ϕ(ξi), ψαi=ψα(ξi) represent the true solutions, numerical error, and Hermite basis function evaluated at the sparse grid collocation point ξi, respectively; Nsp is the required number of sampling point with integral accuracy up to order 2P +1; and ϕ is the associated numerical error accompanied with the computed value of X¯. We assume that

|ϕi||X¯i| (4.5)

and that cα can be approximated by

cα=i=1NspwiX¯iψαi+i=1Nspwiϕiψαi=c¯α+|α+β|>2P+1i=1Nspc¯βwiψαiψβi+i=1Nspwiϕiψαi, (4.6)

where c¯α=X¯(ξ)ψα(ξ)ρ(ξ)dξ represents the true value of the gPC coefficient of index α and c¯β represents the gPC coefficients with order |α + β| > 2P +1. The second term on the righthand side of Eq. (4.6) represents aliasing error due to the sparse grid approximation. The third term wiϕiψαi represents the error due to the numerical error ϕ accompanied with the numerical computation of X¯.

For system of high dimensionality, both the aliasing error |α+β|>2P+1i=1Nspc¯βwiψαiψβi and the numerical error i=1Nspwiϕiψαi may induce pronounced error to the numerical computation of cα. Specifically, we assume that the numerical error ϕi superimposed on each collocation point is i.i.d. with zero mean and small variance |σϕ2||X|2. Given this assumption, the term i=1Nspwiϕiψαi is zero mean with variance

Var (i=1Nspwiϕiψαi)=i=1Nsp(wi)2(ψαi)2σϕ2. (4.7)

When the dimension of ξ is large, we note that the weight distribution on sparse grid points is inhomogeneous; i.e.,

iwi=1,k,|wk|1. (4.8)

Fig. 4.9 plots the variance of the term i=1Nspϕiwi (normalized by σϕ2) for c0 (i.e., the coefficient of basis function ψ0(ξ) ≡ 1) computed at different levels of sparse grid points. As the dimension increases, the variance of the error term increases rapidly, leading to non-negligible errors in the computation of cα.

Fig. 4.9.

Fig. 4.9

Variance of the numerical error term i=1Nspϕiwi (normalized by σϕ2) for c0 computed by level-1 (“■”), level-2 (“▲”) and level-3 (“●”) sparse grid points.

For illustration purposes, we consider the following 27-dimensional function:

f(ξ)=305+k=127ξk0.01(k=127ξk2)2. (4.9)

We first construct a 1st-order gPC expansion f1 by using the level-1 sparse grid method (algebraic accuracy 3) to compute the coefficients:

cα=27f(ξ)ψα(ξ)ρ(ξ)dξi=1N1spf(ξi)ψα(ξi)wi, (4.10)

where ξi, wi are sparse grid points and corresponding weights and N1sp is the total number of the level-1 sparse grid points. Next we construct a 2nd-order gPC expansion f2 by using the level-2 sparse grid method (algebraic accuracy 5) to compute the coefficients with the same manner. We compute the relative L2 error of f1 and f2 as:

εk=fkf2/f2=(q=1N4sp(fk(ξq)f(ξq))2wq)1/2/(q=1N4spf(ξq)2wq)1/2,k=1,2, (4.11)

where we use level-4 sparse grid method (algebraic accuracy 9) so that the numerical integral gives accurate result. The 2nd-order expansion yields larger L2 error due to the aliasing error in numerical integration. In particular, ε1 = 0.029 while ε2 = 0.100.

We also constructed the gPC expansion of f by sample points superimposed with numerical error:

cα=27f(ξ)ψi(ξ)ρ(ξ)dξq=1N1spf(ξq)(1+σζ)ψi(ξq)wq, (4.12)

where ζ is a standard Gaussian random variable and σ is the magnitude of the noise. We repeat each test with 1000 independent sets of noise and present the mean and standard deviation of the L2 error in Table 4.1. As σ increases from 10−4 to 10−3, the relative L2 error further increases. Moreover, the 2nd-order expansion yields much larger error than 1st-order expansion due to the more inhomogeneous weight distribution.

Table 4.1.

L2 error of of the 1st-order and 2nd-order expansion of f constructed by level-1 (ε1) and level-2 (ε2) sparse grid method with sampling data superimposed with different magnitude of numerical error.

σ ε1 ε2

1 × 10−3 0.04 ± 0.02 1.1 ± 0.8
5 × 10−4 0.03 ± 0.01 0.5 ± 0.4
1 × 10−4 0.029 ± 0.002 0.14 ± 0.09

Similar to the simple numerical example presented above, the surrogate model error of the biomolecular system constructed by the sparse grid method is determined by both the aliasing error and the numerical error on the sampling point. Here we systematically investigate the L2 error of the surrogate model of the target property X (e.g., the SASA value) on residue P14. The target quantity X on sampling point is computed under various accuracy levels with relative error from approximately 10−5 to 10−3. The different accuracy levels are achieved by choosing different number of probe points on the solvent particle when computing the SASA value of the target residue. For each accuracy level, we conduct a random 3D rotation of the molecule and conduct 32 computation of the SASA value on residue P14. We approximate the relative error by σX/𝔼(X) with σX and 𝔼(X) defined by

σX=1Si=1SσXi,𝔼(X)=1Si=1SXi, (4.13)

where σXi is the standard deviation of 32 independent computation values of X on sample point ξi and S is the total number of sample points. We emphasize that σX defined by Eq. (4.13) is not equal to ϕ (e.g., the difference between the observation and true solution). However, σX provides a useful guide to understand the magnitude of the disturbance on the sample data. We also note that all the numerical results presented in Sec. 4.1 were computed using sampling data with relative error σX/𝔼(X) ≈ 5 × 10−5.

Fig. 4.10 shows the relative L2 error of the gPC expansion using the compressive sensing, level-1 sparse grid, and level-2 sparse grid methods. The results from the sparse grid methods are very sensitive to accuracy level of the sample point. For high accuracy levels, the L2 error is mainly due to the aliasing error. Increasing σX/𝔼 (X) from 2 × 10−5 to 1.2 × 10−3, the mean value of the relative L2 error increases from 2.20% to 3.36% for the level-1 sparse grid method and from 9.66% to 135.13% for the level-2 sparse grid method. In contrast, the compressive sensing method is insensitive to the imposed error on X for the present system; the resulting error is nearly constant for δ ∈ [10−5, 10−3]. This result suggests another advantage of the present method: the present method is more stable in the presence of limited accuracy in the computed target quantity. For high dimensional systems, the performance of sparse grid method strongly depends on the accuracy of the evaluation of X at collocation points. Similar phenomena have been reported previously [59]. In practice, it may be computationally infeasible to evaluate X at the accuracy level required for stable sparse grid results. However, our new method based on compressive sensing shows a much weaker dependence on accuracy at individual sample points.

Fig. 4.10.

Fig. 4.10

Relative L2 error of the SASA value on residue 14 predicted by 1st-order gPC expansion constructed by the level-1 sparse grid method (“▼” symbol), 2nd-order gPC expansion constructed by the level-2 sparse grid method (“■” symbol), and 2nd-order gPC expansion constructed by compressive sensing (“●” symbol using 200 sample points) under different accuracy levels σX/𝔼(X). For each accuracy level, 32 sets of independent computations are conducted to compute the L2 error of the constructed response surface.

In order to explore the sensitivity of the accuracy level on the constructed gPC expansion, we conducted 32 independent computations on the target quantity X by randomly rotating the biomolecule 32 times on each sampling point. However, we are cautious to claim that numerical error ϕi superimposed on the Xi is i.i.d. among the sampling points. The i.i.d. assumptions adopted in Eqs. (4.7) and (4.12) are used to demonstrate that the numerical error ϕ may further induce error to the constructed gPC expansion. The study presented in this section demonstrates that the sparse grid method may induce relatively large errors to the constructed surrogate model. Rigorous error analysis of the sparse grid method in high-dimensional/complex systems is beyond the scope of this work. However, there appear to be at least two important error sources (aliasing and numerical error on X) that could lead to erroneous results when applying the sparse grid method to high-dimensional systems such as biomolecules. We note that other specific structured or adaptive sparse grid methods [20, 30, 33, 34] may alleviate the instability issue in high-dimensional systems. However, these methods either have less exibility (the required number of sampling points is fixed for each accuracy level) or require a specialized design for adaptivity criteria.

4.3. Surrogate model for total molecular SASA

Finally, we apply our method to quantify the uncertainty of the total SASA for the entire molecule. Unlike the previous local per-residue SASA, this target quantity depends on the conformation states of all residues. We construct the gPC expansion within the full random space ℝ168. Due to the high dimensionality, we use a second-order gPC expansion with 14365 basis functions. Fig. 4.11 shows the relative L2 error of the surrogate model and the K-L divergence of the PDFs.

Fig. 4.11.

Fig. 4.11

(a) Relative L2 error of the total molecular SASA by gPC expansion X˜(ξ) (“●”) and X˜(χ)X˜(χ). The dash-dot line represents the relative L2 error obtained from sampling on level-1 sparse grid points. Sampling over the level-2 sparse grid points generates erroneous results, as discussed in the text. (b) Kullback-Leibler divergence between the PDF of the reference solution and the PDFs obtained from surrogate models X˜(χ) (“■”), level-1 sparse grid (dash-dot line) and direct MD sampling (dash line, 2400 sample points).

We note that the Hermite basis functions associated with the normal distribution are unbounded which leads to inhomogeneous error distributions in the random space. Fig. 4.12 shows the average error distribution of the surrogate model within different regimes of the SASA value. The average error of the surrogate model of X within [x1, x2] is defined by

𝔼(ε(x1,x2))=(i(XgPC(ξi)X(ξi))2I(x1,x2)(X(ξi))iI(x1,x2)(X(ξi)))12, (4.14)

where I(x1, x2)(X(ξi)) is an indicator function which is 1 if X(ξi) ∈ [x1, x2] and 0 otherwise. As shown in Fig. 4.12, the error exhibits a minimum value near the equilibrium state and increases as the X approaches the tails of the SASA PDF. This demonstrates that the constructed surrogate model is not a global approximation of the target quantity X over the entire random space. Instead, it provides an approximation of X with respect to the local points near equilibrium within the random space. Nevertheless, in practice, we are generally interested in the variation of X with response to conformation fluctuation near the equilibrium state; e.g., the relatively small thermally induced molecular fluctuations considered in the present work.

Fig. 4.12.

Fig. 4.12

The average error distribution of the surrogate models of the total-SASA within different regimes [X − ΔX,X + ΔX]. ΔX is chosen as 5 Å2. The surrogate models are constructed using 800 (blank) and 1600 (filled) sample points, respectively. The mean value of total SASA is approximately 3351 Å2 (denoted by “⋆” symbol), corresponding to conformations near the equilibrium state with respect to the thermal fluctuation.

Similar to the “local” properties discussed above, the gPC expansion recovered by our compressive sensing method yields the smallest error. However, the advantage of our new method over direct Monte Carlo sampling for the global SASA is not as large as in the case of local properties. By constructing multi-D basis function through tensor product of one-dimensional basis functions, the upper bound (Here, the upper bound exists because the sampling of the Gaussian random variables is truncated in practice.) of the basis function becomes larger, which decreases the efficiency of the compressive sensing method. This is similar to the phenomenon observed by others [39, 54]. If only statistical information such as expectation values or PDFs are needed, other methods such as quasi-Monte Carlo [35, 46] may be suitable for high-dimensional systems.

Supplementary Material

LaTeX source code

Footnotes

*

This work was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research as part of the Collaboratory on Mathematics for Mesoscopic Modeling of Materials (CM4). Pacific Northwest National Laboratory is operated by Battelle for the DOE under Contract DE-AC05-76RL01830. We would like to thank Xiaoliang Wan, Wen Zhou and Tom Goddard for fruitful discussions. H. Lei acknowledges a travel grant from the IMA for a workshop on Uncertainty Quantification in Materials Modeling and a travel grant from the DOE for the Conference on Data Analysis 2014.

REFERENCES

  • 1.ℓ1-magic. http://statweb.stanford.edu/~candes/l1magic/. [Google Scholar]
  • 2.Adcock SA, McCammon JA. Molecular dynamics: survey of methods for simulating the activity of proteins. Chem. Rev. 2006;106:1589–1615. doi: 10.1021/cr040426m. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Alexov E, Mehler EL, Baker NA, Baptista AM, Huang Y, Milletti F, Nielsen JE, Farrell D, Carstensen T, Olsson MHM, Shen JK, Warwicker J, Williams S, Word JM. Progress in the prediction of pKa values in proteins. Proteins. 2011;79:3260–3275. doi: 10.1002/prot.23189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Atilgan AR, Durell SR, Jernigan RL, Demirel MC, Keskin O, Bahar I. Anisotropy of fluctuation dynamics of proteins with an elastic network model. Biophys. J. 2001;80:505–515. doi: 10.1016/S0006-3495(01)76033-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Baker NA. Biomolecular applications of PoissonBoltzmann methods. Rev. Comp. Ch. 2005;21:349–379. [Google Scholar]
  • 6.Brooks B, Karplus M. Harmonic dynamics of proteins: normal modes and fluctuations in bovine pancreatic trypsin inhibitor. Proc. Natl. Acad. Sci. U.S.A. 1983;80:6571–6575. doi: 10.1073/pnas.80.21.6571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Bruckstein AM, Donoho DL, Elad M. From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev. 2009;51:34–81. [Google Scholar]
  • 8.Cai J, Osher S, Shen Z. Convergence of the linearized bregman iteration for ℓ1-norm minimization. Math. Comput. 2009;78:2127–2136. [Google Scholar]
  • 9.Cai J, Osher S, Shen Z. Linearized bregman iterations for compressed sensing. Math. Comput. 2009;78:1515–1536. [Google Scholar]
  • 10.Candès EJ. The restricted isometry property and its implications for compressed sensing. C. R. Math. Acad. Sci. Paris. 2008;346:589–592. [Google Scholar]
  • 11.Candès EJ, Tao T. Decoding by linear programming. IEEE Trans. Inform. Theory. 2005;51:4203–4215. [Google Scholar]
  • 12.Connolly ML. Solvent-accessible surfaces of proteins and nucleic acids. Science. 1983;221:709–713. doi: 10.1126/science.6879170. [DOI] [PubMed] [Google Scholar]
  • 13.Constantine PG, Dow E, Wang Q. Active subspace methods in theory and practice: Applications to kriging surfaces. SIAM J. Sci. Comput. 2014;36:A1500–A1524. [Google Scholar]
  • 14.Donoho DL, Elad M, Temlyakov VN. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inform. Theory. 2006;52:6–18. [Google Scholar]
  • 15.Doostan A, Owhadi H. A non-adapted sparse approximation of PDEs with stochastic inputs. J. Comput. Phys. 2011;230:3015–3034. [Google Scholar]
  • 16.Dror RO, Dirks RM, Grossman JP, Xu H, Shaw DE. Biomolecular simulation: A computational microscope for molecular biology. Annu. Rev. Biophys. 2012;41:429–452. doi: 10.1146/annurev-biophys-042910-155245. [DOI] [PubMed] [Google Scholar]
  • 17.Foo J, Karniadakis GE. Multi-element probabilistic collocation method in high dimensions. J. Comput. Phys. 2010;229:1536–1557. [Google Scholar]
  • 18.Foo J, Wan X, Karniadakis GE. The multi-element probabilistic collocation method (ME-PCM): error analysis and applications. J. Comput. Phys. 2008;227:9572–9595. [Google Scholar]
  • 19.Ganapathysubramanian B, Zabaras N. Sparse grid collocation schemes for stochastic natural convection problems. J. Comput. Phys. 2007;225:652–685. [Google Scholar]
  • 20.Genz A, Keister BD. Fully symmetric interpolatory rules for multiple integrals over infinite regions with gaussian weight. J. Comput. Appl. Math. 1996;71:299–309. [Google Scholar]
  • 21.Ghanem RG, Spanos PD. Stochastic finite elements: a spectral approach. New York: Springer-Verlag; 1991. [Google Scholar]
  • 22.Go N, Noguti T, Nishikawa T. Dynamics of a small globular protein in terms of low-frequency vibrational modes. Proc. Natl. Acad. Sci. U.S.A. 1983;80:3696–3700. doi: 10.1073/pnas.80.12.3696. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Goldstein T, Osher S. The split Bregman method for L1-regularized problems. SIAM J. Imaging Sci. 2009;2:323–343. [Google Scholar]
  • 24.Grant M, Boyd S. CVX: Matlab software for disciplined convex programming. http://cvxr.com/cvx/. [Google Scholar]
  • 25.Haliloglu T, Bahar I, Erman B. Gaussian dynamics of folded proteins. Phys. Rev. Lett. 1997;79:3090–3093. [Google Scholar]
  • 26.Harris RC, Boschitsch AH, Fenley MO. Inuence of grid spacing in PoissonBoltzmann equation binding energy estimation. J. Chem. Theory Comput. 2013;9:3677–3685. doi: 10.1021/ct300765w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Lee B, Richards FM. The interpretation of protein structures: Estimation of static accessibility. J. Mol. Biol. 1971;55:379–IN4. doi: 10.1016/0022-2836(71)90324-x. [DOI] [PubMed] [Google Scholar]
  • 28.Levitt M, Sander C, Stern PS. The normal modes of a protein: Native bovine pancreatic trypsin inhibitor. Int. J. Quant. Chem.: Quantum Biology Symposium. 1983;10:181–199. [Google Scholar]
  • 29.Li X. Finding deterministic solution from underdetermined equation: large-scale performance variability modeling of analog/rf circuits. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 2010;29:1661–1668. [Google Scholar]
  • 30.Ma X, Zabaras N. An adaptive hierarchical sparse grid collocation algorithm for the solution of stochastic differential equations. J. Comput. Phys. 2009;228:3084–3113. [Google Scholar]
  • 31.Ma X, Zabaras N. An adaptive high-dimensional stochastic model representation technique for the solution of stochastic partial differential equations. J. Comput. Phys. 2010;229:3884–3915. [Google Scholar]
  • 32.McCammon A, Harvey SC. Dynamics of protein and nucleic acids. Cambridge: Cambridge University Press; 1987. [Google Scholar]
  • 33.Narayan A, Xiu D. Stochastic collocation methods on unstructured grids in high dimensions via interpolation. SIAM J. Sci. Comput. 2012;34:A1729–A1752. [Google Scholar]
  • 34.Narayan A, Xiu D. Constructing nested nodal sets for multivariate polynomial interpolation. SIAM J. Sci. Comput. 2013;35:A2293–A2315. [Google Scholar]
  • 35.Niederreiter H. Random number generation and quasi-Monte Carlo methods. Philadelphia, PA: SIAM; 1992. [Google Scholar]
  • 36.Nobile F, Tempone R, Webster CG. An anisotropic sparse grid stochastic collocation method for partial differential equations with random input data. SIAM J. Numer. Anal. 2008;46:2411–2442. [Google Scholar]
  • 37.Petras K. Smolpack: a software for smolyak quadrature with clenshaw-curtis basis-sequence. http://people.sc.fsu.edu/~jburkardt/c_src/smolpack/smolpack.html,2003. [Google Scholar]
  • 38.Ponder JW, Case DA. vol. 66 of Advances in Protein Chemistry. Elsevier; 2003. Force Fields for Protein Simulations; pp. 27–85. [DOI] [PubMed] [Google Scholar]
  • 39.Rauhut H, Ward R. Sparse legendre expansions via l1-minimization. J. Approx. Theory. 2012;164:517–533. [Google Scholar]
  • 40.Ren P, Chun J, Thomas DG, Schnieders MJ, Marucho M, Zhang J, Baker NA. Biomolecular electrostatics and solvation: a computational perspective. Q. Rev. Biophys. 2012;45:427–491. doi: 10.1017/S003358351200011X. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Richmond TJ. Solvent accessible surface area and excluded volume in proteins. J. Mol. Biol. 1984;178:63–89. doi: 10.1016/0022-2836(84)90231-6. [DOI] [PubMed] [Google Scholar]
  • 42.Rizzi F, Najm HN, Debusschere BJ, Sargsyan K, Salloum M, Adalsteinsson H, Knio OM. Uncertainty quantification in md simulation. part i: Forward propagation. Multiscale Model Simul. 2012;10 [Google Scholar]
  • 43.Rizzi F, Najm HN, Debusschere BJ, Sargsyan K, Salloum M, Adalsteinsson H, Knio OM. Uncertainty quantification in md simulation. part ii: Bayesian inference of force-field parameters. Multiscale Model Simul. 2012;10:1460–1492. [Google Scholar]
  • 44.Roux B, Simonson T. Implicit solvent models. Biophys. Chem. 1999;78:1–20. doi: 10.1016/s0301-4622(98)00226-9. [DOI] [PubMed] [Google Scholar]
  • 45.Shrake A, Rupley JA. Environment and exposure to solvent of protein atoms. lysozyme and insulin. J. Mol. Biol. 1973;79:351–371. doi: 10.1016/0022-2836(73)90011-9. [DOI] [PubMed] [Google Scholar]
  • 46.Sloan IH, S Joe. Lattice Methods for Multiple Integration. New York: Oxford University Press; 1994. [Google Scholar]
  • 47.Tama F, Sanejouand Y-H. Conformational change of proteins arising from normal mode calculations. Protein Eng. 2001;14:1–6. doi: 10.1093/protein/14.1.1. [DOI] [PubMed] [Google Scholar]
  • 48.Tipireddy R, Ghanem R. Basis adaptation in homogeneous chaos spaces. J. Comput. Phys. 2014;259:304–317. [Google Scholar]
  • 49.Tirion MM. Large amplitude elastic motions in proteins from a single-parameter, atomic analysis. Phys. Rev. Lett. 1996;77:1905–1908. doi: 10.1103/PhysRevLett.77.1905. [DOI] [PubMed] [Google Scholar]
  • 50.van den Berg E, Friedlander MP. Probing the pareto frontier for basis pursuit solutions. SIAM J. Sci. Comput. 2008;31:890–912. [Google Scholar]
  • 51.Wlodawer A, Walter J, Huber R, Sjölin L. Structure of bovine pancreatic trypsin inhibitor. results of joint neutron and x-ray refinement of crystal form II. Mol. Biol. 1984;180:301–329. doi: 10.1016/s0022-2836(84)80006-6. [DOI] [PubMed] [Google Scholar]
  • 52.Xiu D, Hesthaven JS. High-order collocation methods for differential equations with random inputs. SIAM J. Sci. Comput. 2005;27:1118–1139. [Google Scholar]
  • 53.Xiu D, Karniadakis GE. The Wiener-Askey polynomial chaos for stochastic differential equations. SIAM J. Sci. Comput. 2002;24:619–644. [Google Scholar]
  • 54.Yan L, Guo L, Xiu D. Stochastic collocation algorithms using l1-minimization. Int. J. Uncertainty Quantification. 2012;2:279–293. [Google Scholar]
  • 55.Yang X, Choi M, Lin G, Karniadakis GE. Adaptive ANOVA decomposition of stochastic incompressible and compressible ows. J. Comput. Phys. 2012;231:1587–1614. [Google Scholar]
  • 56.Yang X, Karniadakis GE. Reweighted ℓ1 minimization method for stochastic elliptic differential equations. J. Comput. Phys. 2013;248:87–108. [Google Scholar]
  • 57.Yin W, Osher S, Goldfarb D, Darbon J. Bregman iterative algorithms for ℓ1-minimization with applications to compressed sensing. SIAM J. Imaging Sci. 2008;1:143–168. [Google Scholar]
  • 58.Zhang Z, Choi M, Karniadakis GE. Error estimates for the ANOVA method with polynomial chaos interpolation: Tensor product functions. SIAM J. Sci. Comput. 2012;34:A1165–A1186. [Google Scholar]
  • 59.Zhang Z, Tretyakov MV, Rozovskii B, Karniadakis GE. A recursive sparse grid collocation method for differential equations with white noise. SIAM J. Sci. Comput. 2014;36:A1652–A1677. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

LaTeX source code

RESOURCES