Skip to main content
Protein Science : A Publication of the Protein Society logoLink to Protein Science : A Publication of the Protein Society
. 2023 Oct 1;32(10):e4772. doi: 10.1002/pro.4772

Structural characterization of an intrinsically disordered protein complex using integrated small‐angle neutron scattering and computing

Serena H Chen 1,, Kevin L Weiss 2, Christopher Stanley 1, Debsindhu Bhowmik 1
PMCID: PMC10503416  PMID: 37646172

Abstract

Characterizing structural ensembles of intrinsically disordered proteins (IDPs) and intrinsically disordered regions (IDRs) of proteins is essential for studying structure–function relationships. Due to the different neutron scattering lengths of hydrogen and deuterium, selective labeling and contrast matching in small‐angle neutron scattering (SANS) becomes an effective tool to study dynamic structures of disordered systems. However, experimental timescales typically capture measurements averaged over multiple conformations, leaving complex SANS data for disentanglement. We hereby demonstrate an integrated method to elucidate the structural ensemble of a complex formed by two IDRs. We use data from both full contrast and contrast matching with residue‐specific deuterium labeling SANS experiments, microsecond all‐atom molecular dynamics (MD) simulations with four molecular mechanics force fields, and an autoencoder‐based deep learning (DL) algorithm. From our combined approach, we show that selective deuteration provides additional information that helps characterize structural ensembles. We find that among the four force fields, a99SB‐disp and CHARMM36m show the strongest agreement with SANS and NMR experiments. In addition, our DL algorithm not only complements conventional structural analysis methods but also successfully differentiates NMR and MD structures which are indistinguishable on the free energy surface. Lastly, we present an ensemble that describes experimental SANS and NMR data better than MD ensembles generated by one single force field and reveal three clusters of distinct conformations. Our results demonstrate a new integrated approach for characterizing structural ensembles of IDPs.

Keywords: deep learning, force field, intrinsically disordered protein, small‐angle neutron scattering, structural ensemble

1. INTRODUCTION

Prevalent across the three domains of life, intrinsically disordered proteins (IDPs) are important for a wide range of biological functions from molecular recognition to regulation of enzyme activity (Chen & Bell, 2021; Dunker et al., 2000; Dunker et al., 2002; Huber & Bennett Jr., 1983). As the function of a protein depends on its structural dynamics, it is inevitable to study the structural ensembles of IDPs to uncover their physiological roles. Yet, due to their extensive structural flexibility, determining structural ensembles of IDPs presents a challenge for both experiment and computation.

A variety of experimental techniques have been developed and employed to characterize structural ensembles of IDPs, including innovative variations of nuclear magnetic resonance (NMR) spectroscopy (Konrat, 2014; Schneider et al., 2019) and single‐molecule approaches (Ferreon et al., 2010; Michalet et al., 2006). Complementary to these techniques, small‐angle x‐ray and neutron scattering (SAXS and SANS) offer global structural information of IDPs in terms of shape and size which are useful for validation of high‐resolution structures (Bernadó & Svergun, 2012; Gabel, 2012; Kachala et al., 2015) and elucidating complexes (Bush et al., 2019). Compared with X‐rays, neutrons are scattered by atomic nuclei, leading to their high sensitivity to light elements, such as hydrogen. Different hydrogen isotopes, such as hydrogen and deuterium, result in distinct neutron scattering lengths. Therefore, by selective deuteration and varying the D2O to H2O ratio of the solvent, one can modulate contrast between different components of a system, making SANS an ideal probe for studying structural ensembles of IDPs (Banks et al., 2018; Goldenberg & Argyle, 2014; Johansen et al., 2011; Mansouri et al., 2016; Stuhrmann, 2012). While SANS contrast matching and variation experiments typically use selective deuteration of an entire component, methods of selective labeling at the residue level were developed to characterize dynamics within globular proteins (Laux et al., 2008; Réat et al., 1998; Wood et al., 2013). Here we employed these capabilities to measure an IDP complex involving a residue‐specific deuterium‐labeled protein, with the remaining hydrogenated portion contrast matched out. To the best of our knowledge, these methods have not been demonstrated in IDPs before.

Due to the experimental timescale, SANS, similar to other ensemble methods, yields scattering intensities averaged over all conformations. In this respect, computational techniques complement experiment by modeling a pool of conformations to describe experimental data. Modeling is predominately based on three approaches. One approach is to apply the experimental data as restraints for conformational sampling (Allison et al., 2009; Vendruscolo & Dobson, 2005). Structural ensembles constructed by this approach rely on the number of experimental restraints as well as how stringent the restraints are enforced when sampling. Another approach, used by the ensemble optimization method (EOM) (Bernadó et al., 2007) and ENSEMBLE (Krzeminski et al., 2013), is to generate a large number of possible conformations from the conformational space, and then select a subset of the conformations which fit the experimental data. The outcomes of this approach depend on the quality and diversity of the initial pool of conformations. The other approach applies various enhanced sampling techniques to cross free energy barriers and sample more conformational space (Appadurai et al., 2021; Shrestha et al., 2019). This approach remains computationally intensive and requires careful scrutiny for convergence (Zhang et al., 2005).

Here, we integrated SANS experiments, molecular dynamics (MD) simulations, and deep learning (DL) to characterize the structural ensemble of an IDP complex. Specifically, we coupled residue‐specific deuterium labeling of SANS, MD simulation, and an autoencoder‐based DL algorithm to study the structural ensemble of a protein complex formed by two intrinsically disordered regions (IDRs): the nuclear co‐activator binding domain (NCBD) of CREB binding protein and the activation domain of the p160 transcriptional co‐activator for thyroid hormone and retinoid receptors (ACTR). The NCBD/ACTR complex regulates gene expression and has a long‐known association with breast and ovarian cancers (Anzick et al., 1997). The complex formation, determined by NMR spectroscopy, illustrated the first example of coupled binding and folding of IDPs (Demarest et al., 2002). While the NCBD/ACTR complex is more rigid than their individual unbounded states, 20% of the residues of the complex remain highly flexible (Ebert et al., 2008), which motivates our present study to characterize the NCBD/ACTR structural ensemble. To gain insights into the dynamic structures of the NCBD/ACTR complex, we performed SANS experiments and microsecond all‐atom MD simulations with four molecular mechanics force fields. We compared the structural ensembles generated by these force fields with NMR experiments from the literature and our SANS data. We found that among the four force field‐based structural ensembles, a99SB‐disp and CHARMM36m show the strongest agreement with experiments. Based on the MD simulations, we observed conversion between helices and disordered coils. Such transient secondary structure changes affect the shape of the interface. These subtle global and local structure details are indistinguishable on the free energy surface but only become apparent when visualizing their three‐dimensional structures. Therefore, we developed a DL algorithm to complement potential of mean force calculations and enable a comprehensive comparison of high‐dimensional molecular structures of the NCBD/ACTR complex. We show that the DL model successfully distinguishes NMR and MD structures in the latent space. From the latent space, we characterized the composition of distinct conformations populated in a representative ensemble that showed the best agreement with the experiments. Our work provides an integrated SANS, MD, and DL workflow for characterizing structural ensembles of IDPs.

2. RESULTS AND DISCUSSION

Given the disordered nature of IDPs, the more structural descriptions we determine from experiments, the better we can characterize the structural ensemble. To this end, we conducted selective labeling SANS experiments. We deuterated the mouse NCBD peptide at all five alanine (Ala) and seven leucine (Leu) positions (d A,L‐NCBD) (Figure 1a) and collected SANS data using the extended Q‐range small‐angle neutron scattering instrument at the Spallation Neutron Source located at Oak Ridge National Laboratory (Zhao et al., 2010). Ala and Leu residues were chosen based on commercially available FMOC deuterated amino acids for the peptide synthesis. The motivation to label all of these positions was to provide sufficient signal for the SANS experiment. By adjusting the buffer solution to 40% D2O for d A,L‐NCBD/ACTR, the hydrogenated portion is suppressed such that the measured signal predominately reflects NCBD d‐Ala and d‐Leu correlations. As a control measurement, SANS was performed on hydrogenated NCBD (h‐NCBD/ACTR) in 40% D2O buffer, where a linear–linear plot of the scattering profile illustrates the contrast matched complex compared to the signal from d A,L‐NCBD/ACTR in the same 40% D2O buffer (Figure 1b). Indeed, the calculated match point of the d A,L‐NCBD/ACTR hydrogenated component (i.e., excluding d‐Ala and d‐Leu residues) is 43.7% D2O using MULCh (Whitten et al., 2008). A full contrast SANS profile also was obtained for all‐hydrogenated, h‐NCBD/ACTR in 100% D2O buffer. Note that our study assumes that deuteration does not cause substantial changes in the NCBD/ACTR complex structure; however, the contrast‐matched and full‐contrast SANS profiles capture distinct parts of the protein complex. The former focuses on d‐Ala and d‐Leu correlations of the NCBD in the complex and the latter on the whole complex. After initial Guinier fits to the SANS data (Figure S1a,b), pair‐distribution, P(r), profiles were calculated yielding the radius of gyration (R g) for each case (Figure 1c). The linearized Guinier plots and fits provide a quality assessment of the low Q portion of the scattering curves, while the P(r) analysis covers a broader Q range of the data (see fits in Figure S1c,d). Based on the differences between the Guinier and P(r) fit methods, slight discrepancies in the obtained R g values can be expected, as we note particularly for the d A,L‐NCBD/ACTR in 40% D2O case (see Table S1). We also investigated the magnitude of the dependence of R g values on chosen D max in the P(r) fitting procedure. We performed multiple P(r) fits with varying D max , while staying within an appropriate range to still obtain good fits (see Figure 1c). The R g values are consistent; they only vary by ~0.5 Å in both cases and by no more than 0.8 Å if the extremes between error bars are considered (Figure S1e,f). For all of our further analyses, we used representative P(r) fit results, where the R g values of h‐NCBD/ACTR in 100% D2O and d A,L‐NCBD/ACTR in 40% D2O are 15.0 ± 0.1 Å and 11.7 ± 0.1 Å, respectively. These R g values were used as structural metrics to categorize MD structures of the NCBD/ACTR complex for further characterization.

FIGURE 1.

FIGURE 1

Selective deuteration and SANS provide representative R g values for NCBD/ACTR structural characterization. (a) Model of the NCBD/ACTR complex illustrating the selectively deuterated Ala and Leu residues of NCBD used for neutron experiments. NCBD (black ribbon) was selectively deuterated at all five Ala and seven Leu positions (yellow spheres). ACTR (gray ribbon) is also shown. (b) Linear–linear plot of SANS curves to illustrate the contrast matched NCBD/ACTR control experiment. SANS curves are normalized by concentration, I(Q)/c. The all‐hydrogenated, h‐NCBD/ACTR complex at 40% D2O (black, open symbols) is contrast matched, showing no scattering signal above the background. The measurable scattering from selectively labeled, d A,L‐NCBD/ACTR in 40% D2O (red, solid symbols) is shown for comparison. (c) Pair‐distribution profiles calculated from the SANS data for h‐NCBD/ACTR in 100% D2O (blue) and d A,L‐NCBD/ACTR in 40% D2O buffer (red). Additional profile calculations, with varying D max , are also shown for each in light blue and red, respectively (also see Figure S1e,f). The R g values from the representative profiles are listed in the legend (also see Table S1). (d) Dimensionless Kratky plots for the full contrast h‐NCBD/ACTR in 100% D2O (blue) and selectively labeled d A,L‐NCBD/ACTR in 40% D2O (red). The dotted lines represent the reference (x, y)‐values (√3, 1.104), where folded proteins exhibit a local maximum near the intersection. (e) Ab initio shape reconstructions from the two SANS contrast conditions superimposed onto the PDB 1KBH structure. The views are rotated 90° (indicated by the arrow) from top to bottom. The middle and right views also highlight the NCBD selectively deuterated Ala and Leu residues (yellow). The right view shows the merge overlay of the left and middle shape envelopes.

We also performed additional analysis on the two SANS contrast conditions to investigate the overall shape information obtained from each of them. From a dimensionless Kratky plot (Figure 1d), we observe that both h‐NCBD/ACTR in 100% D2O and d A,L ‐NCBD/ACTR in 40% D2O trend monotonically downward toward higher QR g with a local maximum near the intersection point for folded proteins: (x, y)‐values = (√3, 1.104) (Receveur‐Brechot & Durand, 2012). However, these maxima are off‐centered to the upper‐right of the intersection point, in the direction indicating some structural flexibility. The selectively labeled region appears more compact compared to the full contrast structure, which is consistent with the P(r) curves in Figure 1c. We also generated ab initio shape reconstructions from the two SANS contrast conditions (Figure 1e). These shape comparisons provide a visualization of the more compact substructure of the selectively labeled NCBD portion relative to the entire, full‐contrast NCBD/ACTR complex. Yet, our collective SANS analysis is unable to resolve atomic‐level insights into NCBD/ACTR complex structure and flexibility. To draw further insights into the indicated structural flexibility, we next leverage our integrated computational approach.

To compare with experimental data, we generated a comprehensive pool of structures using multiple atomistic molecular mechanics force fields in MD simulations and explored the effect of force fields on simulated NCBD/ACTR complex structures (Rauscher et al., 2015; Yoda et al., 2004). We started from an NMR model of PDB 1KBH (Demarest et al., 2002) and performed a total of 12 μs all‐atom MD simulations in an explicit solvent with four widely used Amber and CHARMM force fields: a99SB (Hornak et al., 2006), a99SB‐ILDN (Lindorff‐Larsen et al., 2010), a99SB‐disp (Robustelli et al., 2018), and C36m (Huang et al., 2017). We selected these force fields with an intention to capture the transition of force field development from targeting mainly ordered proteins (a99SB and a99SB‐ILDB) to both ordered and disordered proteins (a99SB‐disp and C36m). Note that a99SB force field was used in the original NMR structural refinement (Demarest et al., 2002). Although some studies show deuteration effects in force field parameters (Agarwal et al., 2020) as well as on protein stability and dynamics (Nichols et al., 2020), the computationally rigorous approach of reparametrizing d‐Ala and d‐Leu with D2O solvent is beyond the scope of this work. Instead, here, our focus is to study structural ensembles of unmodified proteins in the native environment, that is, H2O solvent, and understand how this compares to contrast matching SANS experimental data. Toward this aim, we performed MD simulations using unmodified hydrogen mass and force field parameters with H2O solvent. To assess structural flexibility at the deuterated positions, Ala and Leu of NCBD, in our simulations, we highlighted the positions of the deuterium‐labeled side chains in a representative MD trajectory and calculated the average Cα root‐mean‐square fluctuation (RMSF) per residue of ACTR and NCBD structures in four MD ensembles (Figure S2). In all MD structures, the deuterated positions show little fluctuation (less than 2 Å), suggesting that these sites are relatively stable. Moreover, we computed helical fraction and all‐atom root‐mean‐square deviation (RMSD) of overall MD structures from the NMR model (Figure S3). The MD structures sampled by different force fields have distinct helical fraction distributions in ACTR and NCBD, especially in the first 30 residues of ACTR. Overall, the a99SB structures have the lowest average helical content, and the C36m structures have the highest average helical content. Interestingly, the regions of low helical fraction also correspond to regions of higher RMSF values in Figure S2. However, the average helical fraction of these regions is not zero, suggesting flexible movement of ordered helices and helix–loop conversion in the complex. RMSD distributions of the MD structures range from 4 to 10 Å, suggesting various degrees of flexibility within the complex structure. The trends of the median RMSD and average helical fraction of the four forcefields are reversed, where the C36m structures have the lowest median RMSD value and the highest helical fraction, and the a99SB structures have the highest median RMSD value and the lowest helical fraction. Interestingly, despite their higher helical content, the a99SB‐disp structures have the broadest RMSD distribution, followed by the C36m structures. The RMSD distributions of the a99SB and a99SB‐ILDN structures are relatively narrow.

Next, to evaluate the agreement between MD simulations and experiments, we compared MD‐derived and experimental SANS curves and NMR chemical shifts. For each MD NCBD/ACTR complex structure we computed two SANS scattering curves and associated R g values following the same deuteration pattern in the experiments using the CRYSON program (Svergun et al., 1998) without experimental fitting to prevent bias. The scattering curves overlap well with the experiments, especially at low Q, suggesting that all MD ensembles have similar average shape (Figure 2a). To further compare theoretical and experimental scattering curves at different Q ranges, we calculated the reduced chi‐square value (χ 2) against increasing Q range using Equation (1).

χ2=1N1QIexpQIcompQ2σexpQ2 (1)

FIGURE 2.

FIGURE 2

Comparison of NCBD/ACTR force field‐based structural ensembles reveals that a99SB‐disp and C36m show the strongest agreement with SANS and NMR experiments. (a–c) Comparison of calculated average SANS curves by CRYSON (Svergun et al., 1998) to experimental SANS curves. (a) SANS curves normalized by concentration, I(Q)/c, against Q from 0.02 Å1 to 0.30 Å1 for h‐NCBD/ACTR in 100% D2O and against Q from 0.02 Å1 to 0.25 Å1 for d A,L‐NCBD/ACTR in 40% D2O. Experimental SANS data are shown as black points, where error bars represent standard deviation. Computed average SANS curves are depicted as lines, where NMR is in navy, a99SB in tan, a99SB‐ILDN in pink, a99SB‐disp in dark orange, and C36m in light blue. Note that the calculated I(Q)/c is normalized by the experimental I(0)/c of P(r) listed in Table S1, that is, 12.2 cm2/g  and 0.596 cm2/g for h‐NCBD/ACTR in 100% D2O and d A,L‐NCBD/ACTR in 40% D2O, respectively. The standard deviations of computed SANS curves are shown but lie within the line plots. (b,c) Reduced chi‐square values (χ 2) of NMR structures and MD ensembles referenced to experimental SANS curves for (b) h‐NCBD/ACTR in 100% D2O and (c) d A,L‐NCBD/ACTR in 40% D2O. The χ2 is plotted against the increasing Q range, in which the lower bound is 0.02 Å1 and the upper bound (Qupperbound) is from 0.03 Å1 to 0.30 Å1 in b and to 0.25 Å1 in c, with an interval of 0.01 Å1. Shaded error bars represent 95% confidence interval of the mean. Error bars of the χ2 for the MD ensembles are shown but most of them lie within the line plots. The χ2 value for h‐NCBD/ACTR in 100% D2O is higher than that for d A,L‐NCBD/ACTR in 40% D2O, which is attributed to the higher scattering contrast of h‐NCBD/ACTR in 100% D2O. (d) Radius of gyration (R g ) of MD structures are computed using atomic coordinates and masses in GROMACS (Lindahl et al., 2001) (MD) as well as using CRYSON with and without selective deuteration (CRYSON d A,L‐NCBD/ACTR in 40% D2O vs. CRYSON, h‐NCBD/ACTR in 100% D2O). Average R g values of the 20 NMR structures using CRYSON are included for comparison. R g values derived from the experimental SANS curves of h‐NCBD/ACTR in 100% D2O and d A,L‐NCBD/ACTR in 40% D2O are shown as dashed lines in blue and red, respectively. Error bars represent the standard deviation among the structures of each ensemble. (e) Comparison of MD‐derived and experimental chemical shifts by (left) Pearson's correlation coefficients and (right) RMSEs. Chemical shifts of MD structures are computed using SPARTA+ (Shen & Bax, 2010).

Iexp and Icomp are the scattering intensity measured by SANS experiments and by computation, respectively. σexp is the standard deviation of the experiments. N is the total number of structures in an ensemble. There are 27,000 structures in each MD ensemble. The lower bound of all Q ranges is 0.02 Å1, and the upper bound increases from 0.03 Å1 to 0.30 Å1 for h‐NCBD/ACTR in 100% D2O and from 0.03 Å1 to 0.25 Å1 for d A,L‐NCBD/ACTR in 40% D2O, both with an interval of 0.01 Å1. The result of the χ 2 value computed across the increasing Q range is summarized in Figure 2b,c. Even though all MD ensembles fit well at the low Q, for h‐NCBD/ACTR in 100% D2O, the a99SB‐disp ensemble has the lowest χ 2 value especially when Q is less than 0.15 Å1. At the high Q, the χ 2 value increases due to the background scattering, but overall, the C36m ensembles have the lowest χ 2 value when Q is greater than 0.15 Å1. The χ 2 value for d A,L‐NCBD/ACTR in 40% D2O is small for all ensembles, which ranges between 1.18 and 2.18 for the largest Q range, from 0.02 Å1 to 0.25 Å1. The small χ 2 values support the observation found in the MD simulations that the deuterated positions are relatively stable. Note that the χ 2 value for h‐NCBD/ACTR in 100% D2O is higher than that for d A,L‐NCBD/ACTR in 40% D2O. This likely results from the higher scattering contrast of h‐NCBD/ACTR in 100% D2O, where the experimental I(0)/c of P(r) for h‐NCBD/ACTR is roughly 20‐fold higher than that for d A,L‐NCBD/ACTR (Table S1), propagating this factor to the χ 2 values would account for the observed differences. The large χ 2 value at the high Q region for h‐NCBD/ACTR in 100% D2O is unlikely due to an incomplete H/D exchange given our sample preparation method that involved dissolving NCBD and ACTR separately in 100% D2O before combining these proteins to form the complex (see Materials and Methods for details). Considering both proteins alone are intrinsically disordered and highly solvent accessible, we expect fully exchanged H/D for these experiments. We also compared the R g values of MD structures computed using GROMACS (Lindahl et al., 2001) as well as using CRYSON with and without selective deuteration (Figure 2d). As expected, the R g values calculated by GROMACS and CRYSON without selective deuteration are comparable. The a99SB‐disp and C36m structures have the R g values that are the closest to the SANS experiments in both sample conditions.

To explore variation between individual NCBD/ACTR complex structures of each MD ensemble, we used the two calculated R g values as structural metrics to classify each MD structure into four categories: “100+40+,” “100+40,” “10040+,” and “10040” (Figure S4). A structure fell into the “100+40+” category if both calculated R g values are within the two experimental R g constraints listed in Figure 1c. If only one of the R g values satisfies the experimental constraints, the structure is in the “100+40” or “10040+” category. If neither calculated R g values satisfy the experimental constraints, the structure is in the “10040” category. As a comparison, we also computed the R g values of all 20 NMR models in PDB 1KBH. Surprisingly, none of the 20 NMR structures satisfies both experimental constraints. Three out of the 20 structures belong to 10040+, and the other 17 structures are 10040. A similar distribution is observed in the a99SB structural ensemble. About 12.7% of the NCBD/ACTR structures are 10040+, and the rest are mostly 10040 (There is one 100+40 structure in one of the three replicas near 950 ns.). As the NMR structures are refined using an Amber force field (Demarest et al., 2002), it is likely that these early versions of Amber force fields are biased to the 10040+ and 10040 conformations. The a99SB‐ILDN force field improves the accuracy of the side chain torsion potentials for four amino acids of the a99SB force field (Lindorff‐Larsen et al., 2010). However, these modifications do not improve sampling of NCBD/ACTR structures that satisfy the constraints. Less than 0.1% of the sampled structures are 100+40 or 10040+, while most of the sampled structures satisfy neither experimental constraint. Compared with the a99SB and a99SB‐ILDN structural ensembles, the a99SB‐disp force field generates the NCBD/ACTR complex structures in all four categories. About 0.2% of the NCBD/ACTR complex are 100+40+, 17.7% are 100+40, 1.1% are 10040+, and the remaining 81.0% are 10040. Similarly, the C36m force field also generates a heterogeneous ensemble of the NCBD/ACTR complex in all four categories. The C36m distribution of 100+40+, 100+40, 10040+, and 10040 structures are 0.7%, 7.3%, 5.6%, and 86.4%, respectively. Although both the a99SB‐disp and C36m force fields generate a small amount of the 100+40+ structures, they appear mostly after 800 ns in the a99SB‐disp ensemble while they are distributed consistently throughout 1 μs in the C36m ensemble. According to this initial comparison of the MD‐derived SANS curves and R g values with the experimental SANS data, we found that the C36m and a99SB‐disp force fields describe the NCBD/ACTR complex structure better than the a99SB and a99SB‐ILDN force fields. The same trend applies when we compared MD‐derived and experimental chemical shifts of different nuclei, where the C36m and a99SB‐disp structures show the strongest agreement with the experiments with overall the highest correlation coefficients and the lowest RMSE values (Figure 2e).

Next, we consider how our R g ‐based structure categories relate to experiments. It is possible that the R g values of individual structures in the actual ensemble are higher or lower than the ensemble R g values derived from the SANS experiments. However, our goal in this query is not to reconstruct the full ensemble as there are many established methods for this task. Rather, we are interested to investigate how different MD structures with collated R g values relate to experiments. Therefore, we compared the ensembles of 100+40+, 100+40, 10040+, and 10040 structures to the SANS and NMR experiments. There are 233, 6772, 5265, and 95,750 structures in the 100+40+ 100+40, 10040+, and 10040 ensembles, respectively. Figure 3a−c illustrate their computed average SANS curves and χ2 values against increasing Q range as compared to experimental SANS curves. The 100+40+ ensemble has the lowest χ2 value at all Q ranges compared with the other R g –based ensembles, suggesting that the 100+40+ ensemble captures shape and size that best describes experimental SANS curves. As a validation, we calculated the average R g value of each R g –based ensemble and confirmed the procedure of grouping each of the NMR and MD structures into one of the four categories based on the two calculated R g values described above (Figure 3d). In addition, we repeated the analysis of comparing computed average chemical shifts of each R g ‐based ensemble with experimental chemical shifts (Figure 3e). The 10040 and 10040+ ensembles have slightly lower correlation coefficients and higher RMSE values for all nuclei, which is expected as the two ensembles mainly consist of a99SB and a99SB‐ILDN structures (Figure S4b). However, comparing Figures 2e and 3e, we found that the average chemical shifts of the R g ‐based ensembles show better agreement with NMR experiments than the average chemical shifts of the force field‐based ensembles with overall higher correlation coefficients and lower RMSE values.

FIGURE 3.

FIGURE 3

Comparison of NCBD/ACTR R g –based structural ensembles reveals that 100+40+ has the best agreement to SANS experiments while all R g –based ensembles have better agreement to NMR experiments than the force field–based ensembles. (a–c) Comparison of calculated average SANS curves by CRYSON to experimental SANS curves for 10040, 10040+, 100+40, and 100+40+ structures. (a) SANS curves normalized by concentration, I(Q)/c, against Q as in Figure 2a. Experimental SANS data are shown as black points, where error bars represent standard deviation. Computed average SANS curves are depicted as lines, where 100+40+ is in purple, 100+40 in teal, 10040+ in yellow, and 10040 in gray. Note that the calculated I(Q)/c is normalized by the experimental I(0)/c of P(r) listed in Table S1, that is, 12.2 cm2/g  and 0.596 cm2/g for h‐NCBD/ACTR and d A,L‐NCBD/ACTR, respectively. The standard deviations of computed SANS curves are shown but lie within the line plots. (b,c) Reduced chi‐square values (χ 2) of the SANS curves between experiments and each of the R g –based ensembles for (b) h‐NCBD/ACTR in 100% D2O and (c) d A,L‐NCBD/ACTR in 40% D2O. The χ 2 is plotted against increasing Q range as in Figure 2b,c. Shaded error bars represent a 95% confidence interval of the mean. Note that error bars of the χ 2 for the ensembles are shown but most of them lie within the line plots. The χ 2 value for h‐NCBD/ACTR in 100% D2O is higher than that for d A,L‐NCBD/ACTR in 40% D2O, which is attributed to the higher scattering contrast of h‐NCBD/ACTR in 100% D2O. (d) Radius of gyration (R g ) is computed using atomic coordinates and masses in GROMACS (MD) as well as using CRYSON with and without selective deuteration (CRYSON d A,L‐NCBD/ACTR in 40% D2O vs. CRYSON, h‐NCBD/ACTR in 100% D2O). R g values derived from the experimental SANS curves of h‐NCBD/ACTR in 100% D2O and d A,L‐NCBD/ACTR in 40% D2O are shown as dashed lines in blue and red, respectively. Error bars represent the standard deviation among the structures of each R g –based ensemble. (e) Comparison of computed and experimental chemical shifts by (left) Pearson's correlation coefficients and (right) RMSEs. These analyses are performed on 233 100+40+ structures, 6772 100+40 structures, 5265 10040+ structures, and 95,750 10040 structures.

Due to varying R g distribution of MD structures, we seek to gain more structural insights into NCBD and ACTR interactions. To this end, we determined the free energy surface as a function of two structural metrics, sum of the contact area, A, between NCBD and ACTR and the crossing angle (Chen et al., 2020; Dalton et al., 2003), θ, between their longest helices, helix 3 of the NCBD and helix 1 of the ACTR (Figure 4). We used all NMR and MD structures to construct a two‐dimensional histogram of the normalized probability of A and θ, PAθ. Following a similar method used in previous studies (Bell et al., 2018; Zhou et al., 2001), we computed the potential of mean force (PMF) from WAθ=RTlnPAθ with a uniform reference distribution. Figure 4a highlights the structures satisfying at least one of the experimental R g constraints from NMR and each force field on the free energy surface. We found the surface has a free energy minimum of 5.2 kcal/mol and is located at A=33nm2 and θ=28°. The negative sign of θ denotes that the near helix, that is, helix 1 of the ACTR, is rotated clockwise relative to the far helix, that is, helix 3 of the NCBD. The three 10040+ NMR structures are in the free energy basin and are overlapped with some structures sampled by the a99SB force field. This similar distribution of A and θ of the NMR and a99SB structures again suggests the influence of the force field applied in NMR structural refinement. In comparison, the few R g ‐satisfying a99SB‐ILDN structures have a wide distribution, especially in θ. Some structures are populated around another local free energy minimum of A=34nm2 and θ=38°. As compared with the structures sampled by the a99SB force field, the a99SB‐disp and C36m structures are populated around the free energy basin with A mostly less than 33nm2. The smaller A suggests that the NCBD and ACTR of these structures are more flexible and less compact than the structures sampled by the a99SB force fields. We further compared the structures closest to the free energy minimum from each force field. Even with similar values of A and θ, there are clear differences between these structures (Figure 4b–f). The a99SB and a99SB‐ILDN structures have less defined helical content, which is consistent with the helical fraction analysis shown in Figure S3a. However, this secondary structural change does not reflect to the same extent on the two reaction coordinates investigated. Compared with the NMR and a99SB‐disp structures, helix 1 of the ACTR in the C36m structure tilts away as the shape of the accommodating groove formed by helices 1 and 3 of the NCBD shift. In addition, based on the helical content, the ACTR in a99SB‐disp and C36m structures both gain a helical turn between helices 1 and 2, while the NCBD in a99SB‐disp structure loses a turn at the end of helix 3. These subtle global and local structure details are indistinguishable on the free energy surface but become apparent when visualizing their three‐dimensional (3D) structures. From the distribution of different structural ensembles on the free energy surface, we observe distinct regions sampled by different force fields. Nevertheless, using only these two structural metrics is insufficient to distinguish finer structural details.

FIGURE 4.

FIGURE 4

Representative NMR and MD structures are structurally different yet indistinguishable on the free energy surface. (a) Free energy surface of all NMR and MD structures in terms of two reaction coordinates, the sum of the contact area, A, between NCBD and ACTR and the crossing angle, θ, between their longest helices. Each panel shows the NMR and MD structures from each force field satisfying at least one R g constraints from SANS experiments. Data points are color coded as in Figure 2. An NMR structure and a representative MD structure closest to the free energy minimum from each force field are highlighted by a yellow “x” and presented in b, c, d, e, and f, respectively. (b–f) Representative NCBD (black)/ACTR (gray) structures selected from the free energy surface. The longest helices of NCBD and ACTR are colored in blue and cyan, respectively. The contact area is depicted by a magenta surface. The rest of the complex surface is shown in light gray in the NMR structure but omitted in the other structures for clarity.

To comprehensively compare the NCBD/ACTR complex structures, we investigated an alternative method using a deep learning (DL) technique. We selected the NMR and MD structures satisfying at least one of the experimental R g constraints (i.e., all data points shown on the free energy surface in Figure 4a) and applied a convolutional variational autoencoder (CVAE) to encode high dimensional protein complex structures into a 3‐D latent space for visualization. The CVAE is a variational autoencoder (VAE) (Kingma & Welling, 2013) where the encoder and decoder are convolutional neural networks. The encoder of the CVAE serves as a dimensionality reduction tool that effectively projects high‐dimensional molecular structures in a lower‐dimensional normally distributed latent space, in which similar structures are placed close to one another (Degiacomi, 2019; Tian et al., 2021). The decoder, which is symmetric but in reverse order to the encoder, reconstructs the input from sampling of the constructed latent space. Direct comparison of the decoded data with the original input ensures accuracy of the latent space representation. Together, the loss function of the CVAE is a sum of the Kullback–Leibler divergence of the encoder output distribution from the standard normal distribution and the reconstruction loss between the decoder output and the original input. By minimizing the loss function, the CVAE model learns to compress and reconstruct data between a high‐dimensional input space and a low‐dimensional representation while maintaining high integrity (Byun & Rayadurgam, 2020). This feature allows efficient data analysis and data visualization by comparing different complex molecular structures in the low dimensional latent space. Decision on the latent space dimension (i.e., 3‐D) was mainly for visualization purpose and for selecting the lowest possible dimension that can effectively represent the input without sacrificing the loss. The CVAE has been successfully applied to study the folding pathways of small proteins (Bhowmik et al., 2018) and structural clustering of biomolecules (Akere et al., 2020; Bell et al., 2020; Chen et al., 2021).

We represented each R g ‐satisfying structure by a distance matrix calculated between the Cα atoms of every residue pair. Distance matrix representation removes translational and rotational variances in the 3‐D molecular structures. The CVAE model learned structural features presented in the distance matrices and projected similar structures close to each other in the latent space. The latent space reveals separation between MD structures sampled by different force fields (Figure 5a), suggesting distinct structural features between different MD ensembles. Particularly, the structures sampled by the a99SB and a99SB‐disp force fields are projected away from those by the a99SB‐ILDN and C36m force fields, which agrees with the overall MD structure distribution on the free energy surface.

FIGURE 5.

FIGURE 5

A convolutional variational autoencoder model discerns structural differences in the NCBD/ACTR structures from NMR spectroscopy and MD simulations. (a) The 3‐D latent space representing a total of 9816 NMR and MD structures in the training set. Clusters are labeled by the experimental method and force fields and color coded as in Figure 2. Note that some C36m data are dimmed to reveal the NMR data and the y‐ and z‐axes are reversed due to the viewing angle of the 3‐D plot. White ‘x’ marks are the representative NMR, a99SB, a99SB‐disp, and C36m structures selected from the free energy surface in Figure 4, showing a clear separation between these structures in the CVAE latent space. The representative a99SB‐ILDN structure is in the validation set and therefore not on the graph. We show the same latent space viewed from VAE 1 with VAE 2 and 3 axes switched for ‘x’ visibility and easy comparison with panel b. (b) The 3‐D latent space projected onto the VAE 1‐VAE 2 plane. Structures are labeled based on a conditional agreement of R g values from SANS experiments and all‐atom RMSD values with respect to a reference 100+40+ structure highlighted by a cyan star. Structures that satisfy either one of the R g constraints (10040+ or 100+40) are in yellow and teal, respectively. Structures that are within both constraints (100+40+) are in purple.

To elucidate structural features that the model learned to encode and decode, we labeled the same latent space by the R g categories defined previously and projected the space in a 2‐D plane with the structures labeled by their all‐atom RMSD values with respect to a 100+40+ structure as the reference (Figure 5b). Interestingly, the 100+40+ structures are in the regions in which the 10040+ and 100+40 structures overlap, and there is a clear trend that the RMSD value increases as the structure is projected further away from the reference. These results further demonstrate that the CVAE model learns the structural details embedded in the distance matrices and the resulting latent space presents a road map that allows comprehensive comparison of different NCBD/ACTR complex structures. Compared to commonly applied structural analysis methods, which are mostly limited to specific regions of the structure of interest, this DL approach allows us to evaluate the structure as a whole and still capture detailed local structural features (Chen et al., 2021). Moreover, unlike other dimensionality reduction methods which rely on pre‐defined collective variables, the CVAE constructs the underlying low dimensional representation space in an unsupervised manner.

To further characterize the 100+40+ structures, we analyzed the projection of their distance matrices in the latent space. The structures arrange approximately into three clusters described by RMSD ranges, (1) from 0 to 5 Å, (2) from 5 to 8 Å, and (3) from 8 Å and above (Figure 6a), using a 100+40+ structure as the reference. The Cα RMSF values of the ACTR and NCBD structures demonstrate similar structural fluctuations between the three clusters. In addition to the unstructured regions at the N‐ and C‐ termini of both proteins, which fluctuate up to 12.7 ± 2.5 Å, there are three internal regions of the complex varying between 2.8 and 4.2 Å (Figure 6b). These flexible regions roughly correspond to a lower average helical fraction of the complex. However, there are various helical contents, especially in ACTR, with low fluctuation between the three clusters (Figure 6d). The contact area between NCBD and ACTR of the 100+40+ structures is less than 30 nm2 and the helix crossing angle around 50°. Figure 6c shows their location on the free energy surface. When highlighting the flexible regions on a 3‐D NCBD/ACTR complex structure, we found that they are located at the junctions connecting neighboring helices (Figure 6e). One flexible region consisting of ACTR residues 1057 to 1063 connects helices 1 and 2. The other two flexible regions are in the NCBD. Residues 2077 to 2085, which form a part of the polyglutamine tract, connect helices 1 and 2, while residues 2092 to 2097 connect helices 2 and 3. These flexible regions are in agreement with observations from a previous NMR relaxation study (Ebert et al., 2008). Representative complex structures selected from the three clusters reveal distinct orientation and organization of the helices (Figure 6f–h). Based on the number of structures in each cluster, we calculated their composition percentage in the 100+40+ ensemble, which are 37%, 49%, and 14%, respectively. Globally, the three helices of the ACTR display different tilting angles with respect to the NCBD. Locally, structural flexibility in the NCBD reveals that a helical turn in the polyglutamine tract and regions near the two termini may become disordered coils. Note that here the MD simulations and SANS measurements were performed at 20°C. At a higher temperature in the human body (37°C), we expect to see a slight increase in structural flexibility of the NCBD/ACTR complex conformation where the overall helical content may decrease and the disordered coil regions may shift. This expectation is based on previous NMR experiments of temperature‐induced structural changes in IDPs (Kjaergaard et al., 2010), where the authors compared the backbone nuclei chemical shifts of ACTR between 5°C and 45°C and found that at the higher temperature the amount of transient helix decreases and become more disordered. At the human body temperature, there could be a redistribution of these clusters’ composition toward clusters 1 and 3 which have a lower helical content.

FIGURE 6.

FIGURE 6

Characterization of NCBD/ACTR structural ensemble based on the structures satisfying both experimental R g constraints reveals three distinct clusters. (a) The 3‐D latent space projected onto the VAE 1‐VAE 2 plane with only the 100+40+ structures. Structures are labeled based on all‐atom RMSD values using cutoff distances of 5 Å and 8 Å with respect to the same reference 100+40+ structure shown in Figure 5b highlighted by a cyan box and presented in e. The structure closest to the centroid of each cluster is boxed and presented in f, g, and h, respectively. (b) The Cα RMSF values of ACTR and NCBD structures in the three clusters. The flexible region of ACTR is highlighted in magenta while that of NCBD is in yellow. Error bars represent the standard deviation. (c) The 100+40+ structures on the free energy surface shown in Figure 4a. (d) Helical fraction of 100+40+ structures. The flexible regions are highlighted again for comparison. (e) Reference 100+40+ structure for RMSD. (f–h) Representative NCBD (black)/ACTR (gray) structures selected from the three clusters, each with its cluster's composition percentage in the ensemble. The flexible regions are color coded as in b and d.

In summary, we present an integrated SANS, MD, and DL workflow to effectively distinguish different protein conformations and characterize structural ensembles of IDPs. We demonstrate our workflow by characterizing the structural ensemble of the NCBD/ACTR complex. The SANS‐derived R g and PMF analysis suggests that the structural ensemble of the NCBD/ACTR complex is heterogenous, in which distinct complex arrangements are in good agreement with SANS experiments. Nevertheless, we have identified a structural ensemble that describes SANS and NMR experiments better than other ensembles determined by MD simulation with one force field. Although obtaining sufficient neutron scattering signal is a main driver in designing residue‐based selective labeling of SANS, the flexibility of labeling choice in combination with varying contrast of D2O concentration in the solvent offers an auspicious option to study structurally flexible systems such as IDPs. Selective labeling and contrast matching SANS experiments provide not only shape and size as constraints to select from an initial pool of conformations but also average scattering intensity for more stringent validation. Furthermore, the DL algorithm complements PMF calculations and captures both global and local structural features that enable effective recognition of different structures that are indistinguishable on the free energy surface.

3. CONCLUSION

In this work, we characterize the structural ensemble of an intrinsically disordered protein complex consisting of the NCBD and ACTR domains by small‐angle neutron scattering (SANS) and computing. Beyond full contrast SANS experiments, residue‐specific deuterium labeling with contrast matching presents additional information about the same system to aid characterization of structural ensembles. We combine SANS experiments with molecular dynamics (MD) simulations using four different AMBER and CHARMM force fields and a deep learning (DL)‐based convolutional variational autoencoder (CVAE) to explore the structural space of the NCBD/ACTR complex. We find that each force field generates a distinct pool of structures, where the a99SB‐disp and C36m structural ensembles show better agreement with the experimental SANS and NMR data. Applying structural constraints, such as the radius of gyration (R g) from SANS experiments, to classify structures leads to a structural ensemble that fits the SANS scattering curves and the NMR chemical shifts better than other ensembles generated by MD simulations with one force field. Complementary to structural metrics like contact area or helix crossing angle, the CVAE algorithm allows for a comprehensive comparison of complete three‐dimensional structures using their distance matrices by projecting similar structures closer in a lower‐dimensional latent space. Moreover, the CVAE algorithm successfully differentiates NMR and MD structures in the latent space, which are otherwise indistinguishable on the free energy surface. From structure projection in the latent space, we characterize an R g‐based ensemble, determining three representative conformations and the corresponding composition percentage. Taken together, we present an integrated SANS, MD, and DL method for characterizing the structural ensemble of an intrinsically disordered protein complex. This work provides insights for further study of more structurally flexible systems.

4. MATERIALS AND METHODS

4.1. Sample preparation

The nuclear receptor coactivator binding domain (NCBD) of mouse cAMP response element binding (CREB) protein (CBP, accession: NP_001020603), CBP(2059‐2117), and the interaction domain of mouse activator for thyroid hormone and retinoid receptor (ACTR, accession: Q9Y6Q9), ACTR(1046‐1093) were synthesized by solid‐phase FMOC chemistry and purified to >95%. Hydrogenated NCBD (h‐NCBD, 6545 Da) was synthesized by Keck Yale facility. Selectively deuterated NCBD (d A,L‐NCBD, 6633 Da) was synthesized by New England Peptide with L‐alanine‐N‐Fmoc (2,3,3,3‐D4, 98%) and L‐leucine‐N‐Fmoc (D10, 98%) to label all five alanine and seven leucine residues. Hydrogenated ACTR (5214 Da) also was synthesized by New England Peptide. Mass spectrometry confirmed the molecular mass of the peptides. D2O (99.9% D) (Cambridge Isotope Laboratories, Inc. Tewksbury, MA, USA), monosodium phosphate, disodium phosphate (Sigma‐Aldrich), and sodium chloride (Fluka) were used without further purification.

4.2. Small‐angle neutron scattering

SANS experiments were performed on the extended Q‐range small‐angle neutron scattering beamline at the Spallation Neutron Source located at Oak Ridge National Laboratory (Zhao et al., 2010). In 60 Hz operation mode, a 2.5 m sample‐to‐detector distance with 2.5–6.4 Å wavelength band was used to obtain the relevant wavevector transfer, Q = 4π sin(θ)/λ, where 2θ is the scattering angle. NCBD/ACTR samples were prepared in 20 mM sodium phosphate (pH 7), 50 mM NaCl, H2O/D2O. h‐NCBD/ACTR (2.5 mg/mL) in 100% D2O, h‐NCBD/ACTR (10.3 mg/mL) in 40% D2O, and d A,L‐NCBD/ACTR (8.7 mg/mL) in 40% D2O buffer were measured. Peptide concentrations were determined by UV–Vis using a calculated absorption, A2801mg/ml= 0.228 for NCBD (Gasteiger et al., 2005). Samples were loaded into either 1 or 2 mm pathlength circular‐shaped quartz cuvettes (Hellma USA, Plainville, NY) and SANS measurements were performed at 20°C. Data reduction followed standard procedures using MantidPlot (Arnold et al., 2014). The measured scattering intensity was corrected for the detector sensitivity and scattering contribution from the solvent and empty cells and then placed on an absolute scale using a calibrated standard (Wignall & Bates, 1987).

4.3. SANS analysis

Guinier fits to the low‐Q region of the I(Q) scattering curves were initially performed to confirm a Guinier regime (Guinier et al., 1955) (Figure S1). The pair distance distribution function, P(r), was then calculated from the I(Q) curves using the GNOM program in the ATSAS software package (Manalastas‐Cantos et al., 2021) (Figure 1c). The P(r) function was set to zero for r = 0 and r = D max, the maximum linear dimension of the scattering object, and the P(r) was normalized to the peak maximum in each profile. The real‐space radius of gyration, R g, and scattering intensity at zero angle, I(0), were determined from the P(r) solution to the scattering data. Molecular mass, M, was calculated from

I0c=MNAΔρ2υ¯2, (2)

where Δρ = contrast in scattering length density between protein and D2O buffer solution (= ρ protρ buf), υ¯ = protein partial specific volume and N A = Avogadro's number. The protein scattering length density, ρ prot, of NCBD/ACTR complexes was calculated from the sequence using the Contrast module of MULCh (Whitten et al., 2008). The D2O scattering length density used was ρ D2O = 6.388 × 10 (Kachala et al., 2015) cm−2. MULCh calculations yielded Δρ = −3.298, 0.098, and 0.75 × 1010 cm−2 for h‐NCBD/ACTR in 100% D2O, h‐NCBD/ACTR in 40% D2O, and d A,L‐NCBD/ACTR in 40% D2O, respectively. The MULCh value for υ¯ (= 0.737 mL/g) of NCBD/ACTR also was used in the above equation. Experimental SANS values are given in Table S1.

The DAMMIF program within ATSAS (Manalastas‐Cantos et al., 2021) was used to generate ab initio shape reconstruction models to fit the SANS data. A total of 10 models were generated for each of the two scattering conditions and compared by the program to identify the most probable model. The resulting shape envelopes were superimposed to the first model of the PDB 1KBH structure. The shape envelope for d A,L‐NCBD/ACTR in 40% D2O was superimposed to the NCBD structure only. Ab initio shape visualizations were created using UCSF Chimera (Pettersen et al., 2004).

5. MOLECULAR SYSTEMS AND MD SIMULATIONS

We performed all‐atom MD simulations to determine the dynamic structure of the complex formed by the two intrinsically disordered proteins, NCBD and ACTR. We investigated four molecular mechanics force fields from Amber and CHARMM families to sample diverse conformations. These force fields include a99SB (Hornak et al., 2006), a99SB‐ILDN (Lindorff‐Larsen et al., 2010) with TIP3P water model (Jorgensen et al., 1983), a99SB‐disp (Robustelli et al., 2018), and C36m with the CHARMM‐modified TIP3P water model (Huang et al., 2017). The a99SB force field was used in the original NMR structural refinement (Demarest et al., 2002). Both a99SB‐disp and C36m force fields were developed to address structural flexibility and improve the accuracy of disordered proteins in simulations. We used an NMR structure of the NCBD/ACTR complex in PDB 1KBH (Demarest et al., 2002) as the initial structure. The complex consists of a total of 106 residues, in which the first 47 residues are in chain A, which belongs to ACTR, and the remaining 59 residues are in chain B, which is part of NCBD. The complex structure was solvated in the center of a water box with a minimum distance of 15 Å from the edge of the box to the nearest protein atom, neutralized with counter ions, and ionized with 50 mM NaCl following the experimental conditions. We then minimized and equilibrated the resulting system for 100 ns, followed by a 1 μs trajectory in a production run. For each force field, we performed three independent 1 μs trajectories. After the first 100 ns, we sampled structures every 100 ps for analysis, yielding a total of 27,000 structures for each force field.

All MD simulations were performed with OpenMM (Eastman et al., 2017) in an NPT ensemble at 1 atm and 293.15 K using Monte Carlo Barostat and Langevin Integrator with a time step of 2 fs. Nonbonded interactions were calculated with a typical cutoff distance of 12 Å, while the long‐range electrostatic interactions were enumerated with the Particle Mesh Ewald algorithm (Darden et al., 1993).

6. STRUCTURE ASSESSMENT BY CRYSON

To evaluate the 108,000 NCBD/ACTR complex structures from MD simulations and the 20 NMR models in PDB 1KBH (Demarest et al., 2002), we used the CRYSON program (Svergun et al., 1998) to compute SANS curves using their atomic structures and compared these curves with scattering data from SANS experiments. Following the experimental conditions, for each structure we calculated two scattering curves, one with hydrogenated NCBD/ACTR complex (h‐NCBD/ACTR) in 100% D2O, and the other with selectively deuterated NCBD/ACTR complex (d A,L‐NCBD/ACTR) in 40% D2O. For d A,L‐NCBD/ACTR in 40% D2O, all five alanine and seven leucine residues were labeled as a separate chain (i.e., chain C) in their atomic coordinate files and set to fraction deuterated of 1 in CRYSON. All other residues in the original chains (i.e., chain A and chain B) were set to a fraction deuterated of 0. The fraction of D2O in the solvent was set to 0.4.

All CRYSON calculations were performed with the maximum scattering vector of 0.3 Å1 and 200 points in the scattering curve. Explicit hydrogen atoms were considered, and the contrast of the solvation shell was set to 0. No fitting to experimental data was involved in the scattering curve calculations to provide an unbiased comparison between the theoretical and experimental scattering curves. The experimental R g values serve as the constraints to select qualified MD and NMR structures. For each structure, we derived R g values from the two scattering curves. Only the structures satisfying at least one of the R g constraints from SANS experiments were included for deep learning analysis. The total number of R g‐satisfying structures was 12,270, which was about 11.4% of 108,020 structures generated from MD simulations and NMR spectroscopy.

7. DEEP LEARNING ANALYSIS

To systematically analyze the R g‐satisfying NCBD/ACTR complex structures, we deployed an autoencoder‐based deep learning architecture, a convolutional variational autoencoder (CVAE), for structure comparison and visualization at a large scale. To generate translation and rotation invariant input data for CVAE, we represented each R g‐satisfying structure by a distance matrix using the Cα atoms of the protein complex. We used a distance matrix for several compelling reasons that align better with our objective:

  • Invariant to coordinate transformations: A distance matrix captures relative distances between selected atoms regardless of their absolute positions, alleviating the need for prior structural alignments and errors introduced by the procedure.

  • Encodes pairwise relationships: A distance matrix encodes distances between all pairs of selected atoms, including both local and global structural features for learning.

  • Robust to noise and missing data: A distance matrix exhibits greater resilience to noise and missing data compared to raw coordinates as the data is enriched from 3 N to N(N‐1)/2, where N is the number of selected atoms.

There are 106 residues in the NCBD/ACTR complex, so the size of each distance matrix was 106 × 106. We then merged the distance matrices of the 12,270 MD and NMR structures and randomly split the matrices into training and validation datasets using the 80/20 ratio. We then constructed a CVAE model to project these high‐dimensional structural data into a 3‐D latent space using the Keras Library (Chollet, 2015). The encoder model consisted of three convolutional layers and a fully connected layer, each with 64 feature maps. We used a 2 × 2 convolution kernel and a stride of 1, 2, and 1 at the three convolutional layers, respectively. The activation function at each convolutional layer was ReLU. The optimizer was RMSProp (Tieleman & Hinton, 2012), with a learning rate of 0.001. We trained the model for 250 epochs, along which the training and validation loss converged (Figure S5). The difference between decoded and original images is minimal, suggesting the model was trained successfully (Figure S6). We then analyzed the structures from the clusters in the latent space and selected representative structures for visualization using VMD (Humphrey et al., 1996).

AUTHOR CONTRIBUTIONS

Serena H. Chen designed the study and performed MD simulations and data analysis. Kevin L. Weiss and Christopher Stanley designed and performed SANS experiments. Serena H. Chen and Debsindhu Bhowmik developed the CVAE code. Serena H. Chen and Christopher Stanley wrote the paper. All authors read and approved the final manuscript.

CONFLICT OF INTEREST STATEMENT

The authors declare no competing interests.

Supporting information

Data S1. Supporting information.

ACKNOWLEDGMENTS

We would like to thank David Bell (FNLCR) and Yung‐Ko Chen (Alicuu Technology Co., Ltd.) for helpful discussions. We would also like to thank Chris Layton (ORNL) and Daniel Dewey (ORNL) for their technical support.

This work was performed at the Compute and Data Environment for Science (CADES) of the Oak Ridge National Laboratory (ORNL), which is funded by the Office of Science of the U.S. Department of Energy under Contract No. DE‐AC05‐00OR22725. A portion of this research was performed at Oak Ridge National Laboratory's Spallation Neutron Source, sponsored by the U.S. Department of Energy, Office of Basic Energy Sciences. We acknowledge laboratory support by the Center for Structural Molecular Biology, funded by the Office of Biological and Environmental Research of the U.S. Department of Energy.

The research was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under contract numbers DE‐AC05‐00OR22725 and DE‐SC0023490; the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health. It was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE‐AC02‐06‐CH11357, Lawrence Livermore National Laboratory under Contract DE‐AC52‐07NA27344, Los Alamos National Laboratory under Contract DE‐AC5206NA25396, Oak Ridge National Laboratory under Contract DE‐AC05‐00OR22725, and Frederick National Laboratory for Cancer Research under Contract HHSN261200800001E.

Chen SH, Weiss KL, Stanley C, Bhowmik D. Structural characterization of an intrinsically disordered protein complex using integrated small‐angle neutron scattering and computing. Protein Science. 2023;32(10):e4772. 10.1002/pro.4772

This manuscript has been authored by UT‐Battelle, LLC, under Contract No. DE‐AC05‐00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non‐exclusive, paid‐up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe‐public‐access‐plan).

Review Editor: Nir Ben‐Tal

DATA AVAILABILITY STATEMENT

All relevant data are available from within the manuscript and Supplementary Information, and from the corresponding author upon reasonable request.

REFERENCES

  1. Agarwal R, Smith MD, Smith JC. Capturing deuteration effects in a molecular mechanics force field: deuterated THF and the THF–water miscibility gap. J Chem Theory Comput. 2020;16(4):2529–2540. [DOI] [PubMed] [Google Scholar]
  2. Akere A, Chen SH, Liu X, Chen Y, Dantu SC, Pandini A, et al. Structure‐based enzyme engineering improves donor‐substrate recognition of Arabidopsis thaliana glycosyltransferases. Biochem J. 2020;477(15):2791–2805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Allison JR, Varnai P, Dobson CM, Vendruscolo M. Determination of the free energy landscape of α‐synuclein using spin label nuclear magnetic resonance measurements. J Am Chem Soc. 2009;131(51):18314–18326. [DOI] [PubMed] [Google Scholar]
  4. Anzick SL, Kononen J, Walker RL, Azorsa DO, Tanner MM, Guan XY, et al. AIB1, a steroid receptor coactivator amplified in breast and ovarian cancer. Science. 1997;277(5328):965–968. [DOI] [PubMed] [Google Scholar]
  5. Appadurai R, Nagesh J, Srivastava A. High resolution ensemble description of metamorphic and intrinsically disordered proteins using an efficient hybrid parallel tempering scheme. Nat Commun. 2021;12(1):958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Arnold O, Bilheux JC, Borreguero JM, Buts A, Campbell SI, Chapon L, et al. Mantid—data analysis and visualization package for neutron scattering and μ SR experiments. Nucl Instrum Methods Phys Res, Sect A. 2014;764:156–166. [Google Scholar]
  7. Banks A, Qin S, Weiss KL, Stanley CB, Zhou H‐X. Intrinsically disordered protein exhibits both compaction and expansion under macromolecular crowding. Biophys J. 2018;114(5):1067–1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bell D, Domeniconi G, Yang C‐C, Zhang L, Cong GJ. Dynamics‐based peptide‐MHC binding optimization by a convolutional variational autoencoder: a use‐case model for CASTELO. J Chem Theory Comput. 2020;17(12):7962–7971. [DOI] [PubMed] [Google Scholar]
  9. Bell DR, Kang S‐G, Huynh T, Zhou R. Concentration‐dependent binding of CdSe quantum dots on the SH3 domain. Nanoscale. 2018;10(1):351–358. [DOI] [PubMed] [Google Scholar]
  10. Bernadó P, Mylonas E, Petoukhov MV, Blackledge M, Svergun DI. Structural characterization of flexible proteins using small‐angle X‐ray scattering. J Am Chem Soc. 2007;129(17):5656–5664. [DOI] [PubMed] [Google Scholar]
  11. Bernadó P, Svergun DI. Structural analysis of intrinsically disordered proteins by small‐angle X‐ray scattering. Mol Biosyst. 2012;8(1):151–167. [DOI] [PubMed] [Google Scholar]
  12. Bhowmik D, Gao S, Young MT, Ramanathan A. Deep clustering of protein folding simulations. BMC Bioinform. 2018;19(18):484. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Bush M, Alhanshali BM, Qian S, Stanley CB, Heller WT, Matsui T, et al. An ensemble of flexible conformations underlies mechanotransduction by the cadherin–catenin adhesion complex. Proc Natl Acad Sci. 2019;116(43):21545–21555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Byun T, Rayadurgam S. In Manifold for Machine Learning Assurance, 2020 IEEE/ACM 42nd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE‐NIER), 5–11 Oct. 2020; 97–100. 2020.
  15. Chen SH, Bell DR. Evolution of thyroglobulin loop kinetics in EpCAM. Life. 2021;11(9):915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Chen SH, Perez‐Aguilar JM, Zhou R. Graphene‐extracted membrane lipids facilitate the activation of integrin αvβ8. Nanoscale. 2020;12(14):7939–7949. [DOI] [PubMed] [Google Scholar]
  17. Chen SH, Young MT, Gounley J, Stanley C, Bhowmik D. How Distinct Structural Flexibility within SARS‐CoV‐2 Spike Protein Reveals Potential Therapeutic Targets, 2021 IEEE International Conference on Big Data (Big Data), 15–18 Dec. 2021; pp. 4333–4341. 2021.
  18. Chollet F. keras, GitHub. https://github.com/fchollet/keras 2015.
  19. Dalton JAR, Michalopoulos I, Westhead DR. Calculation of helix packing angles in protein structures. Bioinformatics. 2003;19(10):1298–1299. [DOI] [PubMed] [Google Scholar]
  20. Darden T, York D, Pedersen L. Particle mesh Ewald: an N·log(N) method for Ewald sums in large systems. J Chem Phys. 1993;98(12):10089–10092. [Google Scholar]
  21. Degiacomi MT. Coupling molecular dynamics and deep learning to mine protein conformational space. Structure. 2019;27(6):1034–1040.e3. [DOI] [PubMed] [Google Scholar]
  22. Demarest SJ, Martinez‐Yamout M, Chung J, Chen H, Xu W, Dyson HJ, et al. Mutual synergistic folding in recruitment of CBP/p300 by p160 nuclear receptor coactivators. Nature. 2002;415(6871):549–553. [DOI] [PubMed] [Google Scholar]
  23. Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradović Z. Intrinsic disorder and protein function. Biochemistry. 2002;41(21):6573–6582. [DOI] [PubMed] [Google Scholar]
  24. Dunker AK, Obradovic Z, Romero P, Garner EC, Brown CJ. Intrinsic protein disorder in complete genomes. Genome Inform Ser Workshop Genome Inform. 2000;11:161–171. [PubMed] [Google Scholar]
  25. Eastman P, Swails J, Chodera JD, McGibbon RT, Zhao Y, Beauchamp KA, et al. OpenMM 7: rapid development of high performance algorithms for molecular dynamics. PLoS Comput Biol. 2017;13(7):e1005659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Ebert M‐O, Bae S‐H, Dyson HJ, Wright PE. NMR relaxation study of the complex formed between CBP and the activation domain of the nuclear hormone receptor coactivator ACTR. Biochemistry. 2008;47(5):1299–1308. [DOI] [PubMed] [Google Scholar]
  27. Ferreon ACM, Moran CR, Gambin Y, Deniz AA. Single‐molecule fluorescence studies of intrinsically disordered proteins. In: Walter NG, editor. Methods in enzymology. Volume 472. Academic Press; 2010. p. 179–204. [DOI] [PubMed] [Google Scholar]
  28. Gabel F. Small angle neutron scattering for the structural study of intrinsically disordered proteins in solution: a practical guide. Methods in molecular biology. Volume 896. New York, NY: Springer; 2012. p. 123–135. [DOI] [PubMed] [Google Scholar]
  29. Gasteiger E, Hoogland C, Gattiker A, Wilkins MR, Appel RD, Bairoch AJ. Protein Identification and Analysis Tools on the ExPASy Server. In: Walker JM, editor. The Proteomics Protocols Handbook. Totowa, NJ: Humana Press; 2005. p. 571–607. [Google Scholar]
  30. Goldenberg DP, Argyle B. Minimal effects of macromolecular crowding on an intrinsically disordered protein: a small‐angle neutron scattering study. Biophys J. 2014;106(4):905–914. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Guinier A, Fournet G, Yudowitch KL. Small‐angle scattering of X‐rays. 1955.
  32. Hornak V, Abel R, Okur A, Strockbine B, Roitberg A, Simmerling C. Comparison of multiple Amber force fields and development of improved protein backbone parameters. Proteins Struct Funct Bioinform. 2006;65(3):712–725. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Huang J, Rauscher S, Nawrocki G, Ran T, Feig M, de Groot BL, et al. CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat Methods. 2017;14(1):71–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Huber R, Bennett WS Jr. Functional significance of flexibility in proteins. Biopolymers. 1983;22(1):261–279. [DOI] [PubMed] [Google Scholar]
  35. Humphrey W, Dalke A, Schulten K. VMD: Visual molecular dynamics. J Mol Graph. 1996;14(1):33–38. [DOI] [PubMed] [Google Scholar]
  36. Johansen D, Jeffries CM, Hammouda B, Trewhella J, Goldenberg DP. Effects of macromolecular crowding on an intrinsically disordered protein characterized by small‐angle neutron scattering with contrast matching. Biophys J. 2011;100(4):1120–1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Jorgensen WL, Chandrasekhar J, Madura JD, Impey RW, Klein ML. Comparison of simple potential functions for simulating liquid water. J Chem Phys. 1983;79(2):926–935. [Google Scholar]
  38. Kachala M, Valentini E, Svergun DI. Application of SAXS for the structural characterization of IDPs. In: Felli IC, Pierattelli R, editors. Intrinsically disordered proteins studied by NMR spectroscopy. Cham: Springer International Publishing; 2015. p. 261–289. [DOI] [PubMed] [Google Scholar]
  39. Kingma DP, Welling M. Auto‐encoding variational bayes. arXiv preprint arXiv:1312.6114 . 2013.
  40. Kjaergaard M, Nørholm AB, Hendus‐Altenburger R, Pedersen SF, Poulsen FM, Kragelund BB. Temperature‐dependent structural changes in intrinsically disordered proteins: formation of alpha‐helices or loss of polyproline II? Protein Sci. 2010;19(8):1555–1564. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Konrat R. NMR contributions to structural dynamics studies of intrinsically disordered proteins. J Magn Reson. 2014;241:74–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Krzeminski M, Marsh JA, Neale C, Choy W‐Y, Forman‐Kay JD. Characterization of disordered proteins with ENSEMBLE. Bioinformatics. 2013;29(3):398–399. [DOI] [PubMed] [Google Scholar]
  43. Laux V, Callow P, Svergun DI, Timmins PA, Forsyth VT, Haertlein M. Selective deuteration of tryptophan and methionine residues in maltose binding protein: a model system for neutron scattering. Eur Biophys J. 2008;37(6):815–822. [DOI] [PubMed] [Google Scholar]
  44. Lindahl E, Hess B, van der Spoel D. GROMACS 3.0: a package for molecular simulation and trajectory analysis. J Mol Model. 2001;7(8):306–317. [Google Scholar]
  45. Lindorff‐Larsen K, Piana S, Palmo K, Maragakis P, Klepeis JL, Dror RO, et al. Improved side‐chain torsion potentials for the Amber ff99SB protein force field. Proteins Struct Funct Bioinform. 2010;78(8):1950–1958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Manalastas‐Cantos K, Konarev PV, Hajizadeh NR, Kikhney AG, Petoukhov MV, Molodenskiy DS, et al. ATSAS 3.0: expanded functionality and new tools for small‐angle scattering data analysis. J Appl Cryst. 2021;54(Pt 1):343–355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Mansouri AL, Grese LN, Rowe EL, Pino JC, Chennubhotla SC, Ramanathan A, et al. Folding propensity of intrinsically disordered proteins by osmotic stress. Mol Biosyst. 2016;12(12):3695–3701. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Michalet X, Weiss S, Jäger M. Single‐molecule fluorescence studies of protein folding and conformational dynamics. Chem Rev. 2006;106(5):1785–1813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Nichols PJ, Falconer I, Griffin A, Mant C, Hodges R, McKnight CJ, et al. Deuteration of nonexchangeable protons on proteins affects their thermal stability, side‐chain dynamics, and hydrophobicity. Protein Sci. 2020;29(7):1641–1654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE. UCSF Chimera‐A visualization system for exploratory research and analysis. J Comput Chem. 2004;25(13):1605–1612. 10.1002/jcc.20084 [DOI] [PubMed] [Google Scholar]
  51. Rauscher S, Gapsys V, Gajda MJ, Zweckstetter M, de Groot BL, Grubmüller H. Structural ensembles of intrinsically disordered proteins depend strongly on force field: a comparison to experiment. J Chem Theory Comput. 2015;11(11):5513–5524. [DOI] [PubMed] [Google Scholar]
  52. Réat V, Patzelt H, Ferrand M, Pfister C, Oesterhelt D, Zaccai G. Dynamics of different functional parts of bacteriorhodopsin: H‐2H labeling and neutron scattering. Proc Natl Acad Sci. 1998;95(9):4970–4975. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Receveur‐Brechot V, Durand D. How random are intrinsically disordered proteins? A small angle scattering perspective. Curr Protein Pept Sci. 2012;13(1):55–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Robustelli P, Piana S, Shaw DE. Developing a molecular dynamics force field for both folded and disordered protein states. Proc Natl Acad Sci. 2018;115(21):E4758. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Schneider R, Blackledge M, Jensen MR. Elucidating binding mechanisms and dynamics of intrinsically disordered protein complexes using NMR spectroscopy. Curr Opin Struct Biol. 2019;54:10–18. [DOI] [PubMed] [Google Scholar]
  56. Shen Y, Bax A. SPARTA+: a modest improvement in empirical NMR chemical shift prediction by means of an artificial neural network. J Biomol NMR. 2010;48(1):13–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Shrestha UR, Juneja P, Zhang Q, Gurumoorthy V, Borreguero JM, Urban V, et al. Generation of the configurational ensemble of an intrinsically disordered protein from unbiased molecular dynamics simulation. Proc Natl Acad Sci U S A. 2019;116(41):20446–20452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Stuhrmann HB. Contrast variation application in small‐angle neutron scattering experiments. J Phys Conf Ser. 2012;351:12002. [Google Scholar]
  59. Svergun DI, Richard S, Koch MHJ, Sayers Z, Kuprin S, Zaccai G. Protein hydration in solution: experimental observation by x‐ray and neutron scattering. Proc Natl Acad Sci. 1998;95(5):2267–2272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Tian H, Jiang X, Trozzi F, Xiao S, Larson EC, Tao P. Explore protein conformational space with variational autoencoder. Front Mol Biosci. 2021;8:8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Tieleman T, Hinton G. Lecture 6.5‐rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning. 4(2), 26–31. 2012.
  62. Vendruscolo M, Dobson CM. Towards complete descriptions of the free‐energy landscapes of proteins. Philos Transact A Math Phys Eng Sci. 2005;363(1827):433–450. [DOI] [PubMed] [Google Scholar]
  63. Whitten AE, Cai S, Trewhella J. MULCh: modules for the analysis of small‐angle neutron contrast variation data from biomolecular assemblies. J Appl Crystallogr. 2008;41(1):222–226. [Google Scholar]
  64. Wignall GD, Bates FS. Absolute calibration of small‐angle neutron scattering data. J Appl Cryst. 1987;20(1):28–40. [Google Scholar]
  65. Wood K, Gallat F‐X, Otten R, van Heel AJ, Lethier M, van Eijck L, et al. Protein surface and Core dynamics show concerted hydration‐dependent activation. Angew Chem Int ed. 2013;52(2):665–668. [DOI] [PubMed] [Google Scholar]
  66. Yoda T, Sugita Y, Okamoto Y. Comparisons of force fields for proteins by generalized‐ensemble simulations. Chem Phys Lett. 2004;386(4–6):460–467. [Google Scholar]
  67. Zhang W, Wu C, Duan Y. Convergence of replica exchange molecular dynamics. J Chem Phys. 2005;123(15):154105. [DOI] [PubMed] [Google Scholar]
  68. Zhao JK, Gao CY, Liu D. The extended Q‐range small‐angle neutron scattering diffractometer at the SNS. J Appl Crystallogr. 2010;43(5):1068–1077. [Google Scholar]
  69. Zhou R, Berne BJ, Germain R. The free energy landscape for β hairpin folding in explicit water. Proc Natl Acad Sci. 2001;98(26):14931–14936. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1. Supporting information.

Data Availability Statement

All relevant data are available from within the manuscript and Supplementary Information, and from the corresponding author upon reasonable request.


Articles from Protein Science : A Publication of the Protein Society are provided here courtesy of The Protein Society

RESOURCES