Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Apr 2.
Published in final edited form as: J Phys Commun. 2020 Nov 26;4(11):115010. doi: 10.1088/2399-6528/abcbac

Implicit ligand theory for relative binding free energies: II. An estimator based on control variates

Trung Hai Nguyen 1, David D L Minh 2
PMCID: PMC8018686  NIHMSID: NIHMS1681548  PMID: 33817346

Abstract

Implicit ligand theory describes the relationship between the noncovalent binding free energy and the binding free energy between a ligand and multiple rigid receptor conformations. We have previously shown that if the receptor conformations are sampled from or reweighed to a holo ensemble, the binding free energy relative to the ligand that defines the ensemble can be calculated. Here, we apply a variance reduction technique known as control variates to derive a new statistical estimator for the relative binding free energy. In applications to a data set of 6 reference ligands and 18 test ligands, statistically significant differences between the estimators are not observed for most systems. However, in cases where such differences are observed, the new estimator is more accurate, precise, and converges more quickly. Performance improvements are most consistent where there is a clear correlation, with a correlation coefficient greater than 0.3, between the control variate and the statistic being averaged.

1. Introduction

Noncovalent binding between small organic molecules and biological macromolecules is a ubiquitous process in biology and is critical to the mechanism of most drugs. Thus, there has been significant interest in developing computational methods to predict binding free energies, which quantify the strength of these interactions [1, 2, 3, 4, 5, 6, 7]. A key application of these methods is rational drug design [8, 9, 5, 7].

Alchemical pathway methods [10] are a class of theoretically rigorous but computationally expensive binding free energy methods. These methods involve sampling conformations of receptor-ligand complexes from a series of thermodynamic states along a possibly nonphysical pathway. They can be used to obtain either absolute binding free energies between a receptor and ligand (ΔGRL, as defined in Eq. 1) or relative binding free energies between a receptor and two different ligands (ΔΔGRL, as defined in Eq. 2). In ΔGRL calculations, the pathway may involve decoupling or physically separating the ligand from the receptor [11, 12, 13, 14]. In ΔΔGRL calculations, the ligand is transformed from one molecule to another [15, 16]. Alchemical pathway methods have accurately predicted binding free energies for many systems including protein-ligand [17, 18, 19, 20, 21, 9, 22] and protein-protein complexes [23, 24]. However, computational expense continues to restrict more widespread application of the methods.

Since 2012, we have been developing binding free energy methods based on implicit ligand theory (ILT) [25] — a statistical mechanics framework for calculating absolute or relative binding free energies using a pre-sampled set of rigid receptor conformations — that can quickly calculate binding free energies for a large set of ligands binding to the same receptor [26, 27, 28]. Computing ΔGRL involves drawing from or reweighing receptor configurations to the apo ensemble - the receptor alone, in the absence of ligand [25]. In a ΔΔGRL calculation, receptor configurations are drawn from a holo ensemble - where the receptor is bound to a ligand [29]. After receptor configurations are selected, the next step of an ILT-based free energy calculation is to compute the binding potential of mean force (BPMF) — the ΔGRL between a flexible ligand and a rigid receptor — for the receptor configurations. The standard absolute ΔGRL, computed using the “apo estimator”, is an exponential average of the BPMF over the apo ensemble. ΔΔGRL of a ligand relative to the reference ligand defining the holo ensemble, computed with the “holo estimator” (Eq. 6), is an exponential average of the BPMF over the reference holo ensemble.

ILT has several advantages over alternative binding free energy methods based on a flexible receptor. The main advantage of ILT is that once conformations of the receptor are thoroughly sampled, they can be used to calculate the binding free energies of many ligands. In contrast, in alchemical pathway calculations with a flexible receptor, receptor conformational sampling needs to be performed for every receptor-ligand complex. Another advantage of ILT is that BPMF calculations are much faster and more scalable than binding free energies to flexible receptors [28]. Protein-ligand interaction energies may be pre-computed on a grid and interpolated, reducing the number of nonbonded force calculations and essentially eliminating the relationship between receptor size and computational speed. Finally, it is worth noting that unlike most methods for ΔΔGRL estimation, the holo estimator does not require one ligand to be transformed into another. Therefore it can be employed even for cases where the target ligand significantly differs from the reference one.

Regarding receptor conformational sampling, the optimal choice of apo or holo ensemble depends on the extent to which ligand binding induces a consistent conformational change in the receptor. Both the apo and holo estimators are exponential averages that are dominated by the lowest BPMFs. The receptor configurations with the lowest BPMFs are those that are prevalent in the holo ensemble of interest. In contrast, BPMFs for receptor configurations that poorly accommodate a ligand may be approximated as infinite without a significant effect on the binding free energy estimate. Thus, converged ΔGRL estimates with ILT require sampling receptor configurations that are prevalent in the holo ensemble for the ligand of interest. If ligand binding induces a significant conformational change, then simulating the apo ensemble is unlikely to yield such conformations. In this case, relevant configurations may be obtained by simulating one or more holo ensembles with different ligands [26]. In principle, enhanced and biased sampling techniques may also be used for receptor configurational sampling.

Here, we describe a variation of the holo estimator that is based on exploiting the BPMFs between a ligand and its own holo ensemble. These BPMFs may be used to estimate the relative binding free energy between a ligand and itself, the “self” relative binding free energy, ΔΔGRLo. By definition, ΔΔGRLo is zero. However, for a finite set of receptor configurations, ΔGRLo may be estimated as nonzero. In previous work [29], we corrected ΔGRL estimates for all ligands by subtracting the estimate of ΔGRLo. This correction ensures that the estimated ΔGRLo is always zero, regardless of the sample size. Although subtracting this term helped reduce the error of other ΔGRL estimates, this strategy may not be the most statistically optimal approach to use the information contained in the BPMFs.

To improve the holo estimator, we apply a statistical variance reduction technique known as control variates. In the control variates technique, the bias in an estimate of a known expectation value is used to correct the bias in the estimate of an unknown expectation value. The expectation values are of quantities that can be calculated from the same samples. The method works best when the quantities are highly correlated and therefore bias in one estimate is related to bias in the other.

The structure of this paper is as follows: First, we introduce notation and review the holo estimator. We then develop a variation of the estimator based on control variates. We then apply the newly optimized estimator to ΔGRL calculations for T4 lysozyme and 24 ligands. Finally, we compare the accuracy of the unoptimized and optimized estimators with the results from alchemical pathway calculations.

2. Theory

The absolute standard binding free energy of a ligand L noncovalently bound to receptor R to form a complex RL is given by,

ΔGRL=kBTln(CRLCCRCL) (1)

where kB is Boltzmann’s constant and T is the temperature in Kelvin. CR, CL and CRL are equilibrium concentrations of the receptor, ligand and complex, respectively. C° is the standard concentration, 1 M.

In this work, we focus on ΔΔGRL, which is defined as the difference in binding free energies between a target ligand L and a reference ligand Lo,

ΔΔGRL=ΔGRLΔGRLo. (2)

The definition of ΔGRLo is the same as ΔGRL except that Lo replaces L.

2.1. ILT estimator for relative binding free energies

Previously, we reported that ΔΔGRL can be written as an expectation value [29],

exp[ΔΔGRLkBT]=ρRLo(rRLo)eβ[B(rR)Ψ(rRLo)]drRLoρRLo(rRLo)drRLo (3)
exp{β[B(rR)Ψ(rRLo)]}RLo, (4)

where kB is Boltzmann’s constant and T is the temperature. ρRLo(rRLo)=I(ξo)J(ξo)eβU(rRLo) is the unnormalized probability density of the receptor-ligand complex in the bound state. It includes: I(ξo), a function based on the ligand external coordinates ξo that specifies whether the ligand is in the binding site; J(ξo), the Jacobian for transforming the ligand external coordinates from Cartesian into the coordinate system used for ξo; and U(rX), the potential energy of species X ∈ {R, Lo, RLo} in implicit solvent. The observable of the expectation ⟨…⟩RLo includes B(rR), the BPMF – the binding free energy of the flexible target ligand L to a rigid receptor conformation. It also includes Ψ(rRLo)=U(rRLo)U(rR)U(rLo), the interaction energy between the reference ligand Lo and the receptor R when they form a bound complex rRLo.

If configurations of the complex are drawn from a different statistical distribution than the bound state, then importance sampling may be used to estimate ΔΔGRL,

exp[ΔΔGRLkBT]=w(rRLo)exp{β[B(rR)Ψ(rRLo)]}ow(rRLo)o, (5)

where w(rRLo) = ρRLo(rRLo)/ρo(rRLo) is the ratio of the statistical weight in the holo state, ρRLo (rRLo), over the statistical weight in the original distribution that configurations are sampled from, ρo(rRLo).

Using a sample mean to estimate the expectation value in Eq. 5 leads to,

ΔΔG^RL=kBTln{rRLow(rRLo)eβ[B^(rR)Ψ(rRLo)]rRLow(rRLo),} (6)

where the summation is over conformations of the receptor. B^(rR) is an estimate of the BPMF.

When B^(rR) is the BPMF of the reference ligand itself, we obtain an estimate of the “self” relative binding free energy,

ΔΔG^RLo=kBTln{rRLow(rRLo)eβ[B^o(rR)Ψ(rRLo)]rRLow(rRLo),} (7)

In the limit of infinite sample size, ΔΔG^RLo should converge to zero. However, it can be nonzero for a finite number of samples. In our previous study [29], we subtracted ΔΔG^RLo from ΔΔG^RL to ensure that the corrected ΔΔGRLo estimate is always zero, regardless of the sample size:

ΔΔG^RL,A=ΔΔG^RLΔΔG^RLo (8)
=kBTln{rRLow(rRLo)eβ[B^(rR)Ψ(rRLo)]rRLow(rRLo)eβ[B^o(rR)Ψ(rRLo)]}. (9)

We will refer to this expression as estimator A. Although estimator A ensures that the “self” relative binding free energy estimate is zero, it may not be statistically optimal.

2.2. ILT estimator based on control variates

To apply the control variates technique, we need a property with an exactly known expectation value. In the special case that the ligand is the reference ligand, then the left hand side of Eq. 5 is unity. Rearrangement and combination of the expectation values yields ⟨g(rRLo)⟩o = 0, where,

g(rRLo)=w(rRLo){eβ[B^o(rR)Ψ(rRLo)]1}. (10)

For notational convenience, we also define a function based on the key observable in Eq. 5,

h(rRLo)=w(rRLo)eβ[B^(rR)Ψ(rRLo)]. (11)

In the control variates technique, ⟨h(rRLo)⟩ is estimated by the sample mean of a new variable, m(rRLo), which is given by,

m(rRLo)=h(rRLo)+C(g(rRLo)g(rRLo))=h(rRLo)Cg(rRLo), (12)

where C is an arbitrary constant. In the standard control variates method, C is optimized to minimized the variance of the sample mean of m(rRLo). In this paper, we select C to minimize the variance of a new estimator for relative binding free energy, which is given by,

ΔΔG^RL,B=kBTlnm¯w¯. (13)

The bar over the variable name denotes the sample mean, such that m¯=1NrRLom(rRLo) and w¯=1NrRLow(rRLo). N is the number of sampled receptor configurations.

To optimize C, we use error propagation formulae to obtain the variance of ΔΔG^RL,B with respect to C, set its derivative to zero, and solve for C. The variance of the estimator is given by,

Var[ΔΔG^RL,B]=(kBT)2(w¯m¯)2Var[w¯m¯]=(kBT)2[Var[m¯]m¯2+Var[w¯]w¯22(w¯)(m¯)Cov[m¯,w¯]]=(kBT)2N[Var[m]m¯2+Var[w]w¯22(w¯)(m¯)Cov[m,w]]. (14)

Note that only m depends on C. Taking derivative of Var[ΔΔG^RL,B] with respect to C yields,

ddCVar[ΔΔG^RL,B]=(kBT)2NddC[Var[m]m¯2]2(kBT)2Nw¯ddC[Cov[m,w]m¯]. (15)

The derivative can be set to zero, resulting in,

ddC[Var[m]m¯2]2w¯ddC[Cov[m,w]m¯]=0 (16)

Substituting derivatives of the ratios into Eq. 16 and rearranging gives,

(w¯)(m¯)ddCVar[m]2m¯2ddCCov[m,w]+2(m¯Cov[m,w]w¯Var[m])ddCm¯=0, (17)

where

m=hCg,m¯=h¯Cg¯,ddCm¯=g¯,Var[m]=Var[h]+C2Var[g]2CCov[h,g],ddCVar[m]=2CVar[g]2Cov[h,g],Cov[m,w]=Cov[h,w]CCov[g,w],ddCCov[m,w]=Cov[g,w].

Substituting these identities into Eq. 17, rearranging, and solving for C gives,

C=(w¯)(h¯)Cov[h,g](w¯)(g¯)Var[h]h¯2Cov[g,w]+(h¯)(g¯)Cov[h,w](w¯)(h¯)Var[g](w¯)(g¯)Cov[h,g](h¯)(g¯)Cov[g,w]+g¯2Cov[h,w] (18)

We will refer to Eq. 13 with C given by Eq. 18 as estimator B. Estimator B is the key theoretical result of the paper. In the remainder of the paper, we will compare the accuracy and precision of estimator A (Eq. 9) and estimator B (Eq. 13).

3. Methods

All of the quantities needed to calculate binding free energies using estimators A (Eq. (9)) and B (Eq. (13)) and to assess their quality were obtained in previous studies. For the reader’s convenience, we describe some key details in this section.

To benchmark the accuracy of the two estimators, we used absolute binding free energies obtained via an alchemical pathway method. The absolute binding free energy ΔGRL between T4 lysozyme and 24 small organic molecules was previously calculated [26] using YANK, a software package developed by the Chodera group [12]. In these calculations, the AMBER ff14 force field [30] was used for T4 lysozyme and the generalized AMBER force field [31] with Bondi radii [32] and AM1BCC partial charges [33, 34] for ligands. Solvent was treated with the OBC2 generalized Born/surface area implicit solvent model [35]. Calculations were repeated 3 times for each system. Standard deviations were less than 0.5 kcal/mol for 22 systems and less than 1 kcal/mol for all 24 systems [26].

We considered one reference set and two test sets: test set I to benchmark consistency with YANK and test set II to benchmark against experimental results. The reference set consisted of methylpyrrole, benzene, p-xylene, phenol, n-hexylbenzene and DL-camphor. Test set I consisted of 18 remaining ligands from YANK calculations. Test set II included 18 ligands having experimental binding free energies reported in references [36, 37, 38]. As seven ligands were part of both test sets, we considered a total of 29 ligands.

ILT-based binding free energies were also computed using quantities obtained in previous studies. In previous work, 96 receptor snapshots were selected from every alchemical state of YANK calculations with the reference ligands. The BPMF between each ligand and the selected receptor configurations was calculated [26] using our software package AlGDock [39, 28]. The same force field and implicit solvent model were used in the YANK and AlGDock calculations. Interaction energies Ψ(rRLo) were calcuated using the OpenMM 6.3.1 library [40, 41]. Since receptor snapshots selected for AlGDock calculations were drawn from all alchemical states, not just the holo state, we needed to reweigh them to the holo state. To do this, we used the multistate Bennett Acceptance Ratio (MBAR) [42] to calculate the weights w(rRLo) in the holo ensemble.

Estimator A (Eq. (9)) and B (Eq. (13)) were used to calculate relative binding free energies. Relative binding free energies were converted to absolute binding free energies by adding them to the YANK binding free energy of the appropriate reference ligand.

The accuracy of the two estimators was compared via several metrics. The root mean square error (RMSE) and Pearson’s R were computed with respect to YANK results. We also compared the estimators using the difference in absolute deviation from YANK,

d=ΔΔG^RL,BΔΔG^RL,YΔΔG^RL,AΔΔG^RL,Y, (19)

where ΔΔG^RL,A and ΔΔG^RL,B are estimator A (Eq. (9)) and B (Eq. (13)), respectively. ΔΔG^RL,Y is the ΔΔGRL obtained from YANK. A negative value of d indicates that estimator B reduces the absolute deviation from YANK compared to estimator A and a positive value means the opposite.

Bootstrapping was used to estimate statistical errors. For relative binding free energies, random sets of samples were drawn with replacement from the set of 96 receptor snapshots. The standard deviation of free energy estimates from 100 such sets was reported as the uncertainty. To estimate error bars for RMSE and Pearson’s R values, the set of 18 test ligands was resampled with replacement 100 times.

4. Results

4.1. Estimator B improves consistency with YANK

Based on binding free energies of all test ligands in test set I relative to all reference ligands, estimator B is more consistent with YANK than estimator A (Fig. 1 and Tab. 1). The RMSE of free energies with respect to YANK is lower with estimator B (3.11 ± 0.21 kcal/mol) than with estimator A (3.76 ± 0.29 kcal/mol). The difference in RMSE is larger than the estimated error, suggesting that the improvement is statistically significant. While the Pearson’s R of estimator B (0.72 ± 0.05) is also slightly higher than that of estimator A (0.65 ± 0.05), the improvement is not large compared to the estimated error.

Figure 1.

Figure 1.

Binding free energies for ligands in test set I relative to all reference ligands estimated by YANK (x-axis) and AlGDock (y-axis) using estimator A (left column) or estimator B (right column). Error bars denote the standard deviation from three independent YANK calculations (x-axis) or from bootstrapping BPMFs (y-axis), with the range of error bars representing a single standard deviation. A least-squares linear regression is shown as a dashed line.

Table 1. Comparing the accuracy of estimators A and B with respect to YANK free energies.

The RMSE is in kcal/mol.

Estimator A Estimator B
Reference ligand Pearson’s R RMSE Pearson’s R RMSE
methylpyrrole 0.94(0.03) 1.20(0.31) 0.94(0.03) 1.59(0.32)
benzene 0.89(0.06) 1.48(0.46) 0.85(0.08) 1.68(0.45)
p-xylene 0.83(0.07) 1.72(0.36) 0.83(0.05) 1.77(0.30)
phenol 0.82(0.07) 1.94(0.33) 0.84(0.07) 1.73(0.37)
DL-camphor 0.91(0.04) 3.98(0.26) 0.92(0.04) 3.55(0.26)
n-hexylbenzene 0.89(0.05) 7.65(0.33) 0.86(0.07) 5.81(0.33)
overall 0.65(0.05) 3.76(0.29) 0.72(0.05) 3.11(0.21)

Although estimator B has a better performance overall, differences in quality metrics are only statistically significant for binding free energies relative to two reference ligands (Figs. 2 and 3 and Tab. 1). For three out of six reference ligands (methylpyrrole, benzene and p-xylene) estimator B yields slightly higher RMSEs compared to estimator A, but the differences are smaller than the estimated error. For another three systems (phenol, DL-camphor and n-hexylbenzene) estimator B is more consistent with YANK than A. For phenol, the improvement is slight and is unlikely to be statistically significant. However, for DL-camphor and n-hexylbenzene, which are the two most difficult cases yielding the largest RMSEs, estimator B shows a significant reduction in RMSEs. For n-hexylbenzene, the RMSE is dramatically reduced from 7.65 ± 0.33 kcal/mol to 5.82 ± 0.35 kcal/mol. On the other hand, differences between the RMSE of the two estimators are smaller than estimated error (Tab. 1).

Figure 2.

Figure 2.

Binding free energies estimated by YANK (x-axis) and AlGDock (y-axis) using estimator A (left column) or estimator B (right column) for ligands in test set I. Error bars denote the standard deviation from three independent YANK calculations (x-axis) or from bootstrapping BPMFs (y-axis), with the range of error bars representing a single standard deviation. A least-squares linear regression is shown as a dashed line. Each row corresponds to a reference ligand. Results for other three reference ligands are shown in Figure 3.

Figure 3.

Figure 3.

Binding free energies continued from Figure 2.

4.2. Estimator B reduces errors with respect to experiment

Overall, estimator B reproduces experimental results more accurately than estimator A (Fig. 4 and Tab. 2). The RMSE with respect to experiment of estimator B (2.77 ± 0.21 kcal/mol) is lower than the RMSE of estimator A (3.53 ± 0.28 kcal/mol). The difference in RMSE is statistically significant because it is much larger than the estimated errors. The Pearson’s R of estimator B (0.32 ± 0.07) is slightly higher than that of estimator A (0.25 ± 0.08), but the difference is small compared to the estimated errors. When considering each reference ligand separately (Figs. 5, 6 and Tab. 2), the RMSE of estimator B with respect to experiment is significantly lower than the RMSE of estimator A in three of six cases. In particular, it significantly improves the most difficult case, based on the reference ligand n-hexylbenzene. For two reference ligands, methylpyrole and p-xylene, estimator A has a somewhat lower RMSE than estimator B. The two estimators performs equally well in the case of the reference ligand benzene. Due to large estimated errors, differences in the Pearson’s R between the two estimators are not statistically significant.

Figure 4.

Figure 4.

Binding free energies relative to all reference ligands measured by experiment (x-axis) and estimated by AlGDock (y-axis) using estimator A (left column) or estimator B (right column) for ligands in test set II. Error bars denote the standard deviation from from bootstrapping BPMFs (y-axis), with the range of error bars representing a single standard deviation. A least-squares linear regression is shown as a dashed line.

Table 2. Comparing the accuracy of estimators A and B with respect to experimental free energies.

The RMSE is in kcal/mol.

Estimator A Estimator B
Reference ligand Pearson’s R RMSE Pearson’s R RMSE
methylpyrrole 0.50(0.18) 1.04(0.12) 0.52(0.18) 1.70(0.20)
benzene 0.41(0.22) 1.20(0.19) 0.43(0.19) 1.03(0.19)
p-xylene 0.75(0.09) 1.34(0.17) 0.72(0.10) 1.78(0.28)
phenol 0.22(0.25) 2.01(0.22) 0.42(0.20) 1.48(0.17)
DL-camphor 0.42(0.11) 3.78(0.33) 0.35(0.21) 2.94(0.28)
n-hexylbenzene 0.63(0.18) 7.22(0.37) 0.57(0.17) 5.29(0.42)
overall 0.25(0.08) 3.53(0.28) 0.32(0.07) 2.77(0.21)

Figure 5.

Figure 5.

Binding free energies measured by experiment (x-axis) and estimated by AlGDock (y-axis) using estimator A (left column) or estimator B (right column) for ligands in test set II. Error bars denote the standard deviation from bootstrapping BPMFs (y-axis), with the range of error bars representing a single standard deviation. A least-squares linear regression is shown as a dashed line. Each row corresponds to a reference ligand. Results for other three reference ligands are shown in Figure 6.

Figure 6.

Figure 6.

Binding free energies continued from Figure 5.

4.3. Estimator B converges faster for the most difficult case

For all reference ligands except for n-hexylbenzene, the rate at which the binding free energy estimators converge is comparable (Fig. 7). Convergence was assessed by the rate at which the RMSE with respect to YANK is reduced as the number of receptor snapshots is increased. Convergence behavior is very similar for five out six reference ligands: methylpyrrole, benzene, p-xylene, phenol and DL-camphor; the curves look rather similar and level off starting at almost the same number of receptor snapshots. For p-xylene and DL-camphor, estimator B appears to converge more closely than estimator A, but differences are well within error bars. On the other hand, estimator B shows significantly faster convergence in the RMSE for the hardest system, n-hexylbenzene. With estimator A, increasing the number of snapshots actually increases the RMSE with respect to YANK.

Figure 7. Convergence of binding free energies.

Figure 7.

RMSE with respect to YANK binding free energies for estimator A (blue) and B (red) for ligands in test set I. Error bars denote the standard deviation from bootstrapping. Different panels correspond to different reference ligands.

4.4. Overall, estimator B is more precise than A

In addition to being more accurate and converging more quickly than estimator A, estimator is also more precise. In the majority of systems, 63%, the error estimated for estimator B is smaller than for estimator A (Fig. 8). The mean and median values of error estimates for estimator A are 0.89 kcal/mol and 0.81 kcal/mol respectively; for estimator B, they are 0.74 kcal/mol and 0.66 kcal/mol, respectively.

Figure 8.

Figure 8.

Bootstrap errors of estimator B versus A.

4.5. Performance improvement of estimator B over A is related to correlation between h(rRLo) and g(rRLo)

Following expectations for the control variates technique, estimator B exhibits the most consistent improvement over estimator A when the statistic h(rRLo) is most correlated with the control variate g(rRLo) (Fig. 9 and Tab. 3). Here, the metric that we consider is d (Eq. (19)), the difference in the absolute deviation from YANK. Note that negative d indicates that estimator B has a smaller absolute deviation from YANK while a positive value means that it has a larger absolute deviation from YANK compared to estimator A. When the absolute correlation between h(rRLo) and g(rRLo) is weak, such that ∣Corr(h, g)∣ < 0.3, d are almost equally spread on both positive and negative sides; estimator B improves over A in 58% of the binding free energy calculations. On other other hand, when the correlation is stronger, ∣Corr(h, g)∣ > 0.3, d are negative in 79% of calculations. Averaging over all systems, the mean of d is −0.3 kcal/mol, consistent with other metrics demonstrating that estimator B is an improvement over estimator A.

Figure 9.

Figure 9.

Difference in absolute deviation from YANK, d, versus correlation between g(rRLo) and h(rRLo).

Table 3. Statistics related to difference in absolute deviation from YANK, d, and absolute correlation coefficient corr(g, h).

d is in kcal/mol.

Corr(g, h)∣ < 0.3 Corr(g, h)∣ ≥ 0.3
statistic d < 0 d ≥ 0 d < 0 d ≥ 0
percentage of calculations 58% 42% 79% 21%
mean of d within subset −0.84 0.58 −0.64 0.38

5. Conclusions

Using the control variates technique, we have developed a new statistical estimator for using ILT to calculate binding free energies relative to a reference ligand that defines the holo ensemble. We have applied the new estimator to previously-collected simulation data for the binding of 6 reference ligands and 18 test ligands to T4 lysozyme. In most cases, differences in the performance of the new and old estimator are not statistically significant. In the cases where statistically significant differences are observed, the new estimator outperforms the old one; it is more accurate, precise, and convergences more quickly. As anticipated for the control variates technique, performance improvements are greatest in cases where there is a strong correlation between the statistic h(rRLo) and control variate g(rRLo).

We recommend using the estimator over the previous one. Python code for the new estimator is included in the SI and on github at https://github.com/nguyentrunghai/RelBfe_control_variates. Data analyzed in this study will be made available upon acceptance for publication.

Supplementary Material

code

6. Acknowledgment

This research was supported in part by the National Institutes of Health (R01GM127712).

Contributor Information

Trung Hai Nguyen, Laboratory of Theoretical and Computational Biophysics, Ton Duc Thang University, Ho Chi Minh City, Vietnam, Faculty of Applied Sciences, Ton Duc Thang University, Ho Chi Minh City, Vietnam..

David D. L. Minh, Department of Chemistry, Illinois Institute of Technology, Chicago, IL 60616, USA.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

code

RESOURCES