Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jun 1.
Published in final edited form as: Proteins. 2019 Dec 27;88(6):775–787. doi: 10.1002/prot.25865

SAXSDom: Modeling multidomain protein structures using small-angle X-ray scattering data

Jie Hou 1, Badri Adhikari 2, John J Tanner 3, Jianlin Cheng 4,*
PMCID: PMC7230021  NIHMSID: NIHMS1546921  PMID: 31860156

Abstract

Many proteins are composed of several domains that pack together into a complex tertiary structure. Multidomain proteins can be challenging for protein structure modeling, particularly those for which templates can be found for individual domains but not for the entire sequence. In such cases, homology modeling can generate high quality models of the domains but not for the orientations between domains. Small-angle X-ray scattering (SAXS) reports the structural properties of entire proteins and has the potential for guiding homology modeling of multidomain proteins. In this paper, we describe a novel multidomain protein assembly modeling method, SAXSDom, that integrates experimental knowledge from SAXS with probabilistic Input-Output Hidden Markov model (IOHMM) to assemble the structures of individual domains together. Four SAXS-based scoring functions were developed and tested, and the method was evaluated on multidomain proteins from two public datasets. Incorporation of SAXS information improved the accuracy of domain assembly for 40 out of 46 CASP multidomain protein targets and 45 out of 73 multidomain protein targets from the AIDA dataset. The results demonstrate that SAXS data can provide useful information to improve the accuracy of domain-domain assembly. The source code and tool packages are available at https://github.com/jianlin-cheng/SAXSDom.

Keywords: Small-angle X-ray scattering, SAXS, protein structure, domain assembly, machine learning, probabilistic model, CASP

1. Introduction

Most proteins contain multiple domains. Vogel et al. define a protein domain as an “independent, evolutionary unit that can form a single-domain protein or be part of one or more different multidomain proteins”.1 Protein domains range in length from about 40 to 500 amino acids, with 100 residues being the most frequent domain length.23 Since the median protein chain length found in nature is a few hundred residues (361 in Eukarya, 267 in Bacteria, 247 in Archaea) 4, most proteins are multidomain. Obviously, the three-dimensional arrangement of domains within the folded protein - domain architecture - is central to the function of multidomain proteins.

Multidomain proteins present unique challenges to protein structure modeling. The most difficult case occurs when templates can be found only for the domains but not for the entire sequence. In this case, most computational methods adopt a “divide and conquer” strategy in which the sequence is parsed into domains, and the three-dimensional structures of the domains are predicted with either comparative (homology) structure modeling56 or de novo structure prediction78 on individual domains9. The predicted structures of domains are subsequently assembled into a full-length structural model using a variety of approaches, such as treating the problem as special case of protein-protein docking,1012 using protein folding algorithms to predict the conformation of the linkers between rigid domains,1314 and the use of ab initio folding potentials.15 Despite these advances, the modeling of multidomain protein structures remains an ongoing area of research. The use of experimental restraints has the potential to improve the accuracy of predicting multidomain protein structures. Cross-linking/mass spectrometry and small-angle X-ray (SAXS) scattering are two notable examples of experimental methods that provide distance information that can be combined with structure modeling into so-called “hybrid” methods.1618 In particular, the explosion of biological SAXS over the last 5–10 years1922 suggests that it may be especially impactful in hybrid methods. SAXS provides solution structural information in the form of the radius of gyration (Rg), the maximum particle dimension, and the electron pair distance distribution function (P(r)). Furthermore, SAXS provides information about the molecular mass in solution, oligomeric state, and quaternary structure.23 Several groups have integrated SAXS data into their protein structure prediction pipeline.2427 Also, in the recent Critical Assessment of Protein Structure Prediction (CASP) competition, SAXS information was incorporated into the data-assisted category that aimed to assess the potential of integrating SAXS data with protein structure prediction methods for protein folding.18 Most CASP12 approaches utilized SAXS as additional driving restraints involving (1) the goodness-of-fit between the experimental SAXS curve and those computed from models; (2) comparison of the experimental P(r) to the P(r) histogram calculated from the model; and (3) Rg as a restraint on the size of the structure. Although SAXS-based hybrid modeling holds great promise, more research is needed to determine the best ways to fully leverage the experimental information from SAXS in protein structure modeling.

In this work, we investigated the use of restraints from SAXS multidomain assembly. We developed a novel framework to systematically integrate the probabilistic approach for protein conformational sampling with SAXS-assisted structure folding. Our method applies probabilistic Input-Output Hidden Markov model and Monte Carlo sampling to simulate the domain-domain orientation with SAXS related energies enforced, so that it can generate near-native structures that have low free energy and good agreement with the SAXS curve. In addition, we examined the correlation between the SAXS scoring functions and structural qualities (i.e. RMSD) on the CASP proteins, which shows the effectiveness of SAXS data in the structural analysis. Our method shows a significant improvement in domain assembly and structure folding after incorporating SAXS information as additional energies to the physics-based force field, which demonstrates the promise of using SAXS data in computational protein structure modeling.

2. Materials and Methods

2.1. Benchmark sets

To assess how well each SAXS-based pseudo-energy function correlates with structural quality (i.e. RMSD),28 we collected predicted structural models generated for protein targets that were tested in the 8th, 9th, 10th, and 11th Critical Assessments of Structure Prediction (CASP) experiments.29 The proteins whose experimental structures were available were selected for preliminary analysis. The dataset contains 112,050 models corresponding to 428 single-domain and multidomain proteins. The detailed statistics are provided in Supporting Information Table S1.

In addition, we evaluated our method on the three types of datasets to validate the effectiveness of SAXS data in protein domain assembly. The first dataset contains multidomain proteins from CASP8–12 whose experimental structures are available. The domain definition (i.e. number of domains and the domain boundaries) of each protein was determined by CASP assessors.30 Since our method requires continuous domains as input, the domains with chain breaks (defined as distance of adjacent CA-CA atoms larger than 4 A) were removed from the dataset. Finally, we collected 51 CASP multidomain proteins for the domain assembly analysis. The length of domain linkers among the 51 proteins ranges from 5 to 21. We randomly selected 5 targets to determine the weights for the SAXS terms of the target function. The remaining 46 targets were used to compare the performance of different SAXS scoring functions for domain assembly. The structural similarities between the five training proteins and testing proteins are calculated and summarized in Figure S1. The structures of individual domains for all 51 CASP targets were directly derived from their native protein structures and were further used for domain assembly.

The second dataset is a collection of two-domain proteins curated in the Ab initio Domain Assembly (AIDA) server.15 The number of domains in each protein was determined by DomainParser.31 Unlike using the native domain structures for assembly in the CASP dataset, we first used our MULTICOM tertiary structure system9 to predict the structures of individual domains of proteins from their homology templates. The domains whose predicted structures have TM-score > 0.9 against their native structures were selected for domain assembly. Finally, MULTICOM successfully predicted high-quality models for domains of 73 proteins in the AIDA dataset. The length of domain linkers in 73 proteins ranges from 5 to 15. The predicted structures were used for domain assembly analysis.

We also tested our method on two monomeric proteins for which SAXS experimental data are available. The first protein is 1127-residue Rhodobacter capsulatus PutA (RcPutA) whose homology model has been comprehensively studied using SAXS data in previous work 32. Two domains have been identified in RcPutA from the templates corresponding to residues 1–972 and residues 994–1127. The second test case is bovine serum albumin (SASBDB33 accession code SASDBJ334). The domain boundary was determined according to the structural templates resulting in two domains: residues 1–292 and residues 303–583.

2.2. Domain-Domain orientation driven by united-residue model and probabilistic sampling

Given individual domain structures for a protein sequence, our method first converts the polypeptide chains of domains into united-residue representation as described in the UNRES model.8, 35 In the UNRES model, the backbone of the polypeptide chain is approximated by a sequence of α-carbon atoms linked by virtual bonds, and the conformation of the protein chain is determined by virtual bond lengths (bcαi), virtual bond angles (θi), virtual bond dihedral angles (τi) among adjacent α-carbon atoms (Figure 1). In addition, the united side chains are attached to the α-carbon atoms where two side-chain angles (δi and γi) and a virtual-bond length (bsci) determine the location of side chain. The six variables parameterize the geometry of α-carbon (i) and side-chain (SCi) at the ith residue of a polypeptide chain in coδnformation space. We used Input-Output Hidden Markov Model (IOHMM) that was trained in our previous work8 to sample the virtual-bond lengths and virtual-bond torsion angles given the predicted secondary structure in the linker regions. Each cycle of Monte Carlo sampling generates one acceptance move for domain-domain orientation using simulated annealing. The structures of the individual domains are unchanged during sampling (i.e. treated as rigid bodies). Thus, the conformation of the linker regions can be conditionally resampled given the known prior structural information of the domains based on the probabilistic model, which can predict more accurate local structural preferences of linkers than random sampling and potentially reduce the number of local movements in conformational space to achieve convergence.

Figure 1.

Figure 1.

Pipeline of SAXSDom for domain assembly with parameterization of conformation in linker regions and overall shape match with SAXS data.

Our method implements the domain assembly based on the following steps, as depicted in the Figure 1. Given the full-length sequence of a protein, we first predict the sequence’s 8-class secondary structure using SSpro.36 Then we sample the united-residue conformation for the entire polypeptide chain using IOHMM model for structure initialization. After the conformation is initialized, the torsion angles and virtual-bond lengths of α-carbon and its side chain atoms at each position of residues in the full-length polypeptide chain are updated according to their geometry in the pre-determined domain structures. The regions whose structure information is not provided in the domain structures are considered as linkers that anchor domains together. The conformation of the linker regions is then sampled using the IOHMM model and orients the domain structures using simulated annealing algorithm to generated structural models with lowest structural energy. Therefore, our method can be applied to assemble any number of domains for multidomain proteins.

2.3. Integrating physics-based force field with SAXS restraints for domain-domain assembly

Our method adopts the united-residue physics-based force field that was defined in our previous work to represent the energy of a united-residue peptide chain.8 The physics energy includes the mean free energy of hydrophobic (hydrophilic) interactions between side chains (Esciscj), excluded-volume potential of side-chain and peptide group interaction (Escipj), and the backbone peptide group interaction to represent the average electrostatic interaction (Epipj) for any pair of residues in the ithand jth positions in the polypeptide chain, as represented in Equation (1):

Ephysics=wsc*ji<jEsciscj+wscp*jijEscipj+wel*ji<j1Epipj. (1)

Unlike our earlier approach that generated chain conformation based on stepwise sampling of foldon units, our current method only samples the conformation of the linker regions and keeps the structures of the domains fixed. Therefore, the physics-based force field of intra-domain interactions is stable during conformation sampling, and the energy of chain conformation is only affected by the interactions of all inter-domain residues (i.e. interaction interface) and all linker residues, where the physics energy can be further represented as in Equation (2):

Ephysics=Ephysics(intradomain)+Ephysics(interdomain)+Ephysics(linker). (2)

It is worth noting that the energy of hydrophobic (hydrophilic) interactions between side chains of linker residues plays an important role in the protein folding and domain-domain movement.37 Studies showed that the average residue hydrophobicity (hydrophilicity) is largely influenced by the size of linkers, where longer linkers are more hydrophilic and exposed so that they induced larger domain motions in the conformation space. Inversely, smaller linkers showed more hydrophobic character, which may significantly restrain the domain-domain movement.38

We introduced additional energy terms corresponding to the SAXS restraints for the total energy calculation, defined as:

Esaxs=EsaxsIntFit+Esaxsχ+EsaxsPr+EsaxsRg. (3)

The first term in the SAXS energy, Esaxs ⋅ IntFit, represents the normalized fitness between the experimental SAXS intensity and computed intensity from the models, which is defined as:

EsaxsIntFit=wsaxsIntFit*i=1N|Iexp(qi)Imodel(qi)|i=1N|Iexp(qi)|. (4)

In Equation (4), Iexp(q) is the experimental SAXS intensity and Imodel(q) is the theoretical SAXS intensity calculated from models. We employ the same strategy as FoXS3940 to calculate Imodel(q) and to determine the best fit between Iexp(q) and lmodel(q)by minimizing the χ function:

χ=1Ni=1N(Iexp(qi)cImodel(qi)σ(qi))2. (5)

In Equation (5), (q) is the experimental error of the measured SAXS profile, Nis the number of points in the profile, andc is the scale factor determined from linear least-squares analysis to derive the minimum value of χ.

The second term in the SAXS energy function, includes χ as an additional score term to account for the degree of SAXS profile matching and is defined as follows:

Esaxs·χ=wsaxs·χ*ofχ. (6)

The third term in the SAXS energy function, EsaxsPr, represents the Kullback-Leibler divergence between the pairwise atom-atom distance distribution function P(r) derived from the experimental SAXS profile and the pair distance distribution computed from the model, which is defined as:

EsaxsPr=WsaxsPr*i=1NPrmodel(ri)*logPrmodel(ri)Prexp(ri). (7)

The experimental P(r) is calculated from the experimental SAXS intensity curve using an indirect Fourier transform along with an assumption of the maximum particle size (dmax).4142 The pair distance distribution of the protein structure is directly calculated from its atomic coordinates.

The last term in the SAXS energy function,EsaxsRg, is a penalty function based the agreement between experimental Rg and the Rg calculated from the protein model:

EsaxsRg=WsaxsRg*RGexpRGmodelRGexp¯, (8)

The SAXS-related quantities (i.e. SAXS intensity, P(r) and Rg ) described above were calculated using algorithms implemented in the Integrated Modeling Platform (IMP) package.43

We adopted the same weight configuration for the physics-based force field energy terms listed in Equation (1) as our previous method,8 where wsc = 1.00000, wsc ⋅ p = 2.73684, and wel = 0.06833. For the SAXS energy terms described in the Equation (3), we set wχ = 10, wsaxsfit = 700, wsaxs pr = 700, and wsaxsRg = 700 after experimenting with several weights on the small training proteins. In summary, the energy for a multidomain polypeptide chain in our method is:

Etotal=Ephysics(intradomain)+Ephysics(interdomain)+Ephysics(linker)+Esaxs. (9)

In addition to the four SAXS-related scoring functions as defined in Equation (48), we also experimented with ten other SAXS-based scoring functions based on the agreement between the experimental SAXS profiles and those computed from models (functions 5–14 of Table S2).

Since the physics-based energies are calculated from united-residue models, but the SAXS energy calculations require the full-atom representation with at least a Cα-trace, we reconstruct the Cα-trace and side chains from the united-residue protein representation using PULCHRA44 to generate full-atom protein models for SAXS energy calculation. In order to speed up SAXS fitting and computation, the functions of FoXS,39 PULCHRA44 and IMP 43 have been incorporated into our program instead of calling them as external programs during sampling.

We used simulated annealing Monte Carlo to search for the lowest-energy assembled multidomain conformation. Since only the linker regions are resampled during domain-domain orientation, the sampling space is significantly reduced. The number of Monte Carlo cycles for each linker is set to the number of residues in linker times 100. Given an assembled protein model in each cycle, the total energy, including the physics- and SAXS-based energies, is calculated and compared to the energy of previous conformation. The domain movement is accepted or rejected according to the probability proportional to α=min(1,eΔEt), where the ΔE represents the energy change for each domain movement, and t is the temperature of simulated annealing.

3. Results and Discussions

3.1. Evaluation of different SAXS profile matching score functions

We first tested several SAXS scoring functions to identify those that correlate best with the structural quality of a predicted model. Fourteen functions were considered, including the four described in detail above (Equations 4, 6, 7, 8) and ten more shown in Table S2. The test set consisted of the predicted server models of 428 targets from CASP8 to CASP11 (Table S1). Theoretical SAXS curves (I(q)) were calculated from both the experimental structures and the predicted models using FoXS,39 and the resulting SAXS curves were used to calculate distance distribution functions (P(r)) using GNOM.45 For each predicted model, we generated SAXS data from both the full-atom and -atom structure. Model quality was expressed as the Root Mean Square Deviation (RMSD) between the model and its experimental structure.

The Pearson correlation coefficient (PCC) between the RMSD and each of the 14 SAXS scores of all the predicted models for each protein was calculated, and the averaged correlations over the 428 targets are listed in Table S2 (full-atom model) and Table S3 (-atom model). Three SAXS scores stood out from the others. The P(r)-based function (score 2), Rg agreement function (score 3), and the normalized I(q) fitness function (score 5) showed the highest correlation with RMSD, with averaged PCCs of 0.6, 0.7, and 0.59, respectively when using the full-atom treatment (Table S2). The use of -atom models led to a similar result, with scores 2, 3, and 5 outperforming the others (Table S3). This result is potentially useful, since -trace modeling is typically faster than all-atom modeling. The averaged PCCs for the three best functions are shown in Figure 2. Since the χ function is a common metric for comparison of scattering curves for SAXS, we include it for comparison in Figure 2. Note that the χ-score (score 1 in Table S2) achieved relatively low correlations of 0.47 and 0.38 for full-atom and -atom models, respectively. Based on these results, we included the three top performing score functions (Equations 4, 7, 8) as SAXS energies in the SAXSdom domain assembly calculations described below.

Figure 2.

Figure 2.

Average Pearson correlation coefficient (PCC) between the structural quality (RMSD) and the SAXS score functions derived from (a) full-atom and (b) Cα atom models of protein structure. Analysis was done based on the predicted models from CASP8–11.

3.2. Performance of SAXSDom in assembling 46 CASP multidomain proteins

In order to validate the improvement of domain assembly obtained by incorporating SAXS information, we first developed a baseline approach, SAXSDom-abinitio, which used only the united-residue physics-based force field (Equation 1) and did not incorporate any SAXS information. We then tested five SAXS-based approaches that adopted four different SAXS energy terms either alone or in combination. The results using the SAXS functions individually are labeled as SAXSDom(Esaxs ⋅ IntFit), SAXSDom(Esaxs ⋅ Pr), and SAXSDom(Esaxs ⋅ Rg), and SAXSDom( Esaxs ⋅ χ). Note these metrics correspond to the top performing functions identified in the previous section, plus the historical SAXS χ statistic. Results obtained when using all four SAXS functions in combination are denoted SAXSDom(Esaxs). All SAXSDom methods were employed to assemble domains for 46 CASP multidomain proteins, and each method generated 50 full-length models for each protein. For each protein, the initial coordinates of each domain were directly derived from the experimental structure, and the secondary structure of the full-length protein sequence was predicted by SCRATCH.46 The “experimental” SAXS intensity profile was calculated by FoXS from the experimental structure. After 50 models were generated, we assessed model quality with Qprob47 to rank the assembled models. Qprob estimates the prediction error using several physicochemical, structural and energy feature scores, and then uses the combination of probability density distribution of the errors for the global quality assessment. Each domain assembly method was evaluated based on the averaged TM-score and RMSD of the Qprob-ranked best model, best in top five models, and best in all 50 models for the 46 proteins. The results for the six methods are reported in the Table 1 and Figure 3.

Table 1.

Summary of the domain assembly performance using ab initio modeling (without SAXS) and ab initio modeling plus different SAXS-related scoring functions on the 46 multidomain proteins in CASP dataset. The top 1 model and top 5 models are determined based on Qprob ranking.

Scoring Function Top 1 model Best in top 5 models Best in all 50 models
TM-score RMSD TM-score RMSD TM-score RMSD
SAXSDom-abinitio 0.73 8.41 0.76 6.47 0.80 4.43
SAXSDom (Esaxs · χ) 0.81 5.09 0.85 3.49 0.88 2.60
SAXSDom (Esaxs · IntFit) 0.76 6.77 0.82 3.96 0.87 2.74
SAXSDom (Esaxs · Pr) 0.80 5.27 0.85 3.46 0.89 2.29
SAXSDom (Esaxs · Rg) 0.77 6.20 0.81 4.20 0.85 3.03
SAXSDom (Esaxs) 0.80 5.17 0.85 3.48 0.89 2.36

Figure 3.

Figure 3.

Comparison of five SAXSDom approaches with the SAXSDom-abinitio method (does not use SAXS) on the best 50 assembled models. (A) SAXSDom (Esaxs) versus SAXSDom-abinitio (Left plot: TM_scores of SAXSDom (Esaxs), models versus TM_scores of SAXSDom-abinitio models; Middle plot: RMSD of the models of the two methods; Right plot: Distribution of χ-score of all assembled models for 46 proteins by two methods (mark the 2 curves in the plot). (B) SAXSDom (Esaxsχ) versus SAXSDom-abinitio. (C) SAXSDom (EsaxsPr) versus SAXSDom-abinitio. (D) SAXSDom (EsaxsRg) versus SAXSDom-abinitio. (E) SAXSDom ( EsaxsIntFit) versus SAXSDom-abinitio.

Incorporation of SAXS information clearly improved the accuracy of domain assembly. For example, whether one considers either the top 1 model based on Qprob ranking, best in top five models, or the best in all 50 models, the averaged TM-score and RMSD of the assembled models are consistently better when SAXS information is included, compared to using only the physics-based force field (Table 1). The P-value for the difference between the SAXS-based method and ab initio modeling according to TM-score and RMSD are reported in Table S4. For instance, as shown in Table 1, the method SAXSDom(Esaxs), which combines all four SAXS energy terms during conformation sampling, outperforms the method SAXSDom-abinitio by 9.59% (i.e. 0.800.730.73), 11.84%, 11.25% of TM-score and 38.52%, 46.21%, 46.73% of RMSD for top one, best of top five models, and best of all 50 models, respectively. Figure 3 shows the performance of five SAXSDom methods with different SAXS energies and SAXSDom-abinitio method evaluated on the best of all 50 assembled models based on the RMSD, TM-score, and SAXS χ-score. According to the evaluation, as shown in Figure 3(A), the method SAXSDom(Esaxs) outperforms the SAXSDom-abinitio in 40 out of 46 proteins in terms of RMSD and TM-score. We also evaluated the distribution of SAXS χ-scores for all generated models. As expected, the SAXS χ scores of assembled models using SAXS information were lower than that of models built by ab initio sampling. As shown in the plot, the distribution of SAXSDom(Esaxs) consistently shifted to lower SAXS χ-score compared with SAXSDom-abinitio. Figure 3 (B), (C), (D) and (E) show the performance of domain assembly using four individual SAXS energy terms and their comparison with performance of ab initio sampling. The results of the method comparison evaluated on the top one and best five assembled models of 46 proteins are also shown in Figure S2 and S3.

Altogether, these results show that incorporating SAXS information as additional energies for conformational sampling can improve the accuracy of the domain assembly. Results obtained when using all four SAXS functions in combination are relatively better than using the SAXS functions individually.

3.3. Performance of SAXSDom in AIDA multidomain proteins using predicted domain structures

We also assessed the performance of SAXSDom using 73 multidomain proteins which were originally curated for evaluating the ab initio domain assembly approach AIDA.15 In our work, the domain structures for these 73 proteins were predicted by the MULTICOM tertiary structure prediction method and then further assembled using our protocol. SAXSDom then generated 50 assembled models using the reference SAXS intensities derived from the native structures of full-length proteins. Qprob was then used to re-rank the 50 models. The same protocol was applied to SAXSDom-abinitio to generate 50 models for the 73 proteins. The accuracy of top Qprob-ranked models (i.e. top 1 model, best in top 5 models, best in all 50 models) were subsequently evaluated according to TM-score and RMSD. We also compared our methods with another two state-of-art structure modeling approaches, Modeller13 and AIDA.15. For each protein, Modeller and AIDA also generated 50 models which were ranked according to their default energies. The qualities of top ranked models generated by Modeller and AIDA were also evaluated and compared to our methods.

Table 2 reports the averaged TM-score and RMSD of top ranked models generated by the four methods tested. AIDA achieved relatively better performance in domain assembly compared to the other methods. The main difference between AIDA and our approach is that AIDA uses an all-atom representation of the protein structure, whereas SAXSDom uses a united-residue representation. The results also show that SAXSDom outperforms both SAXSDom-abinitio and Modeller in terms of all metrics with statistical significance shown by the one-sample paired t-test. Figure 4 shows the performance of SAXSDom with SAXSDom-abinitio, AIDA and Modeller evaluated on the best of all 50 assembled models based on the RMSD, TM-score, and SAXS χ-scores. According to the evaluation, as shown in Figure 4(A), the method SAXSDom outperforms the SAXSDom-abinitio in 50 out of 73 proteins in terms of RMSD and 45 out of 73 proteins in terms of TM-score. Figure 4(B) compares the performance of SAXSDom and AIDA. AIDA was able to assemble domains with slightly better qualities according to RMSD, while SAXSDom can generate assembled models that were better matched to the SAXS profile. Figure 4(C) shows that SAXSDom can generate significantly better models with lower SAXS χ-scores compared to that of Modeller. The results of the method comparison evaluated on the top one and best five assembled models are also shown in Figure S4 and S5.

Table 2.

Summary of the domain assembly performance using for domain assembly methods on the 73 proteins in AIDA dataset. The top 1 model and top 5 models are determined based on Qprob ranking.

Method Top 1 model Best in top 5 models Best in all 50 models P-value
TM-score RMSD TM-score RMSD TM-score RMSD TM-score RMSD
AIDA 0.716 9.135 0.767 6.444 0.810 4.438 1.00E+00 0.9999
Modeller 0.620 16.207 0.622 15.349 0.621 14.953 2.20E-16 2.20E-16
SAXSDom-abinitio 0.705 9.005 0.724 6.917 0.742 5.811 5.60E-08 1.98E-08
SAXSDom 0.722 7.658 0.750 5.987 0.767 5.012

Figure 4.

Figure 4.

Comparison of SAXSDom with SAXSDom-abinitio, AIDA and Modeller on the best of 50 assembled model. (A) SAXSDom versus SAXSDom-abinitio (Left plot: TM_scores of SAXSDom models versus TM_scores of SAXSDom-abinitio models; Middle plot: RMSD of the models of the two methods; Right plot: Distribution of χ-scores of all assembled models for 46 proteins by two methods). (B) SAXSDom versus AIDA. (C) SAXSDom versus Modeller.

In addition to the global statistical performance analysis provided so far, we present the results for four representative targets as three-dimensional structures (Figure 5). The crystal structure of signal recognition particle receptor from E.coli (PDB code 1FTS) consists of an α-helical domain (residues 1–82) connected to an αβα domain (residues 92–295) by a of 9-residue linker (Figure 5(A)). SAXSDom successfully placed the domains into the correct orientation using SAXS information, although the linker conformation is not correct. The assembled structure agrees well with the envelope of the protein structure even though the variation of linker region is relatively large. The shape envelopes are reconstructed using SAXS data through DAMMIN program in ATSAS package 42, 48. The agreement of the SAXSDom model with the SAXS data is characterized by χ =2.8 (Figure 6(A)). Figure 6(A) and Figure 6(B) show that the SAXSDom model has better agreement with the SAXS data than the models from the other methods, both for P(r) and the scattering curve. The residue-by-residue distance errors between the experimental structure and the models show that the accuracy of domain assembly is improved by incorporating SAXS energies in the SAXSDom compared to ab initio method SAXSDom-abinitio (Figure 6(C)).

Figure 5.

Figure 5.

The predicted assembly models and shape envelopes of five two-domain proteins. The predicted model (colored) and the native structure (green) is superimposed. The domain linker (yellow) and domains (purple, red) are highlighted in the predicted model. (A) The signal recognition particle receptor from E. coli (chain A of 1FTS), linker length = 9, RMSD=2.8, TM-score=0.88, χ-score=2.8. (B) The rRNA methyltransferase ErmC’ (chain A of 1QAM), linker length = 4, RMSD=2.9, TM-score=0.81, χ-score=1.6. (C) Protein of unknown function from Bacteroides ovatus (chain A of 3P02), linker length = 4, RMSD=3.4, TM-score=0.81, χ-score=1.7. (D) Myo-inositol monophosphatase (chain A of 2BJI), linker length = 7, RMSD=2.7, TM-score=0.86, χ-score=0.70. The shape envelopes are reconstructed using SAXS data through DAMMIN program in ATSAS package.

Figure 6.

Figure 6.

Comparison of predicted models for 1FTS by SAXSDom, SAXSDom-abinitio, AIDA and Modeller. (A) The SAXS profiles calculated from the models and the experimental structure. SAXS curves in q=0–0.15 region are also visualized. (B) Pair distance distribution functions (P(r)) calculated from the models and the experimental structure. (C) Residue-by-residue distance error between the predicted models and the experimental structure.

Figure 5(B) shows the predicted domain assembly for the ErmC’ rRNA methyltransferase (PDB entry 1QAM). The structure consists of two domains, an N-terminal αβαdomain(residues 1–171) and a C-terminal α domain (residues 176–235). The predicted assembly model has RMSD= 3.0, TM-score=0.81 to the experimental structure, and χ-score of 1.6 to the SAXS profile. The domain linker contains 4 residues and is folded into similar shape as that in the native structure.

Domain assembly for a protein of unknown function (PDB code 3P02) also achieved good performance, with two β-domains combined into a native-like orientation (RMSD=3.4, TM-score=0.81 and χ-score=1.7, Figure 5(C)). In this case, the structure has a rather short linker of only four residues, which restricts the conformational space needed to be sampled.

Finally, Figure 5(D) presents the predicted assembly for a myo-inositol monophosphatase (2BJI). The fold consists of a penta-layered αβαβα sandwich, and the linker connects the last strand of the first β-sheet to the first strand of the second β-sheet. SAXSDom successfully generated a native-like model with RMSD=2.7, TM-score=0.86 and χ-score=0.70. The comparisons of domain assembly methods for the targets are also summarized in Figure S6, S7, and S8.

3.4. Performance of SAXSDom using experimental SAXS data

To further examine the performance of SAXSDom on domain assembly using real SAXS profiles, we applied our method to two bi-domain, monomeric proteins for which the experimental SAXS data are available. The SAXS experimental profile of the protein RcPutA has been used to validate the tertiary structural interaction between two domains (1–972, 994–1127) in Luo, et al.32. The homology model of RcPutA that was generated using the crystal structure of a close homolog (5KF649) as the structural template agrees very well with the experimental SAXS data (χ-score= 2.55, Fig. 7A), and therefore was used as reference structure to validate the performance of domain assembly. In this case study, the results showed some dependence of the length of the linker, and therefore we systematically varied the linker length to explore the robustness of our method. The performance of domain assembly on RcPutA is summarized in Table S5. The RMSDs of the assembled models span the range of 2.9–5.6Å with χ-score ranging from 2.12–5.06 for linker lengths of 6–21 residues (Table S5). Regardless of linker length, SAXSDom correctly captured the essential tertiary structural interactions between the two domains. In particular, all the models show the β-hairpin of domain 2 near the center of domain 1 (e.g. Fig. S9A, S9B). However, the details of the inter-domain interface were more accurately described when shorter linkers were used (6–7 residues, Fig. S9A). We also evaluated domain assembly performance on bovine serum albumin (SASDBJ3) and the results are provided in Table S6. The top 1 model generated by SAXSDom shows good agreement with the crystal structure (RMSD=2.5Å and χ-score = 1.0). The final predicted structures for the two proteins are visualized in Figure 7 and Figure S9.

Figure 7.

Figure 7.

Performance of SAXSDom on two bi-domain proteins using real SAXS data. (A). Final prediction for RcPutA with domains consisting of residues 1–972 and residues 994–1127. The conformation of linker region with 21 residues is sampled to assemble the two domains. The reference structure is colored gray, and the SAXSDom model is colored red (domain 1) and purple (domain 2). The scatter plot shows the RMSD from the reference structure and SAXS chi-score for 50 decoys generated by SAXSDom; the top 1 ranked model is highlighted as red. On the right, the theoretical SAXS profiles calculated from the reference structure (blue) and predicted structure (red) are compared to the experimental data (black circles). (B) Two-domain assembly for target SASDBJ3 using real SAXS data. The structures of domain regions (1–292, 303–583) were predicted by MULTICOM protein structure system. The reference structure is colored gray, and the SAXSDom model is colored red (domain 1) and purple (domain 2). The scatter plot shows the RMSD from the reference structure and SAXS chi-score for 50 decoys generated by SAXSDom; the top 1 ranked model is highlighted as red. On the right, the theoretical SAXS profiles calculated from the reference structure (blue) and predicted structure (red) are compared to the experimental data (black circles).

4. Conclusion and Future work

In this work, we developed a data-assisted domain assembly method, SAXSDom, by integrating the probabilistic approach for backbone conformation sampling with SAXS-assisted restraints in domain assembly. We evaluated several SAXS-related score functions for structure modeling, including fitness of SAXS intensities, the divergence of pair-atom distance distribution, agreement of the radius of gyration, and the traditional chi-score. Our results show that incorporating the restraints from SAXS data into de novo conformational sampling method can improve the protein domain assembly. SAXSDom can generate more accurate domain assembly for 40 cases among 46 CASP multidomain proteins in terms of RMSD and TM-score when compared to modeling without using SAXS information. On the AIDA dataset, SAXSDom also achieved better accuracy for 50 out of 73 multidomain proteins according to RMSD metric and 45 out of 73 targets in terms of TM-score. Despite the success of improving protein domain assembly using SAXS data, our method can still be improved in several ways: (1) adopting new physical energies derived from full-atom structures such as van der Waals hard sphere repulsion, residue environment, residue pair, radius of gyration as introduced in Rosetta14; (2) extending the continuous domain assembly with discontinuous domain assembly for those proteins with inserted domains; and (3) designing more advanced SAXS scoring functions to guide domain assembly.

Supplementary Material

1

5. Acknowledgements

Research reported in this publication was supported by the NIGMS of the National Institutes of Health (NIH) under award number R01GM093123 and two National Science Foundation (NSF) grants (DBI 1759934 and IIS1763246).

6. References

  • 1.Vogel C; Bashton M; Kerrison ND; Chothia C; Teichmann SA Structure, function and evolution of multidomain proteins. Curr Opin Struct Biol 2004;14(2):208–16. [DOI] [PubMed] [Google Scholar]
  • 2.Wheelan SJ; Marchler-Bauer A; Bryant SH Domain size distributions can predict domain boundaries. Bioinformatics 2000;16(7):613–8. [DOI] [PubMed] [Google Scholar]
  • 3.Korasick DA; Jez JM, Protein Domains: Structure, Function, and Methods In Encyclopedia of Cell Biology, Bradshaw RA; Stahl PD, Eds. Academic Press: Waltham, 2016; pp 91–97. [Google Scholar]
  • 4.Brocchieri L; Karlin S Protein length in eukaryotic and prokaryotic proteomes. Nucleic acids research 2005;33(10):3390–3400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Krieger E; Nabuurs SB; Vriend G Homology modeling. Methods of biochemical analysis 2003;44:509–524. [DOI] [PubMed] [Google Scholar]
  • 6.Li J; Adhikari B; Cheng J An improved integration of template-based and template-free protein structure modeling methods and its assessment in CASP11. Protein and peptide letters 2015;22(7):586–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kim DE; Chivian D; Baker D Protein structure prediction and analysis using the Robetta server. Nucleic acids research 2004;32(suppl_2):W526–W531. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bhattacharya D; Cao R; Cheng J UniCon3D: de novo protein structure prediction using united-residue conformational search via stepwise, probabilistic sampling. Bioinformatics 2016;32(18):2791–2799. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hou J; Wu T; Cao R; Cheng J Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13. Proteins: Structure, Function, and Bioinformatics 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lise S; Walker-Taylor A; Jones DT Docking protein domains in contact space. BMC Bioinformatics 2006;7:310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Inbar Y; Benyamini H; Nussinov R; Wolfson HJ Combinatorial docking approach for structure prediction of large proteins and multi-molecular assemblies. Phys Biol 2005;2(4):S156–65. [DOI] [PubMed] [Google Scholar]
  • 12.Cheng TM; Blundell TL; Fernandez-Recio J Structural assembly of two-domain proteins by rigid-body docking. BMC Bioinformatics 2008;9:441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Eswar N; Webb B; Marti-Renom MA; Madhusudhan M; Eramian D; Shen M. y.; Pieper U; Sali A Comparative protein structure modeling using Modeller. Current protocols in bioinformatics 2006;15(1):5.6. 1–5.6. 30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Rohl CA; Strauss CE; Misura KM; Baker D Protein structure prediction using Rosetta. Methods in enzymology 2004;383:66–93. [DOI] [PubMed] [Google Scholar]
  • 15.Xu D; Jaroszewski L; Li Z; Godzik A AIDA: ab initio domain assembly for automated multi-domain protein structure prediction and domain-domain interaction prediction. Bioinformatics 2015;31(13):2098–2105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Belsom A; Schneider M; Brock O; Rappsilber J Blind evaluation of hybrid protein structure analysis methods based on cross-linking. Trends in biochemical sciences 2016;41(7):564–567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Ogorzalek TL; Hura GL; Belsom A; Burnett KH; Kryshtafovych A; Tainer JA; Rappsilber J; Tsutakawa SE; Fidelis K Small angle X-ray scattering and cross-linking for data assisted protein structure prediction in CASP 12 with prospects for improved accuracy. Proteins: Structure, Function, and Bioinformatics 2018;86:202–214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Moult J; Fidelis K; Kryshtafovych A; Schwede T; Tramontano A Critical assessment of methods of protein structure prediction (CASP)—Round XII. Proteins: Structure, Function, and Bioinformatics 2018;86:7–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Dyer KN; Hammel M; Rambo RP; Tsutakawa SE; Rodic I; Classen S; Tainer JA; Hura GL High-throughput SAXS for the characterization of biomolecules in solution: a practical approach. Methods Mol Biol 2014;1091:245–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Hura GL; Menon AL; Hammel M; Rambo RP; Poole FL 2nd; Tsutakawa SE; Jenney FE Jr.; Classen S; Frankel KA; Hopkins RC; Yang SJ; Scott JW; Dillard BD; Adams MW; Tainer JA Robust, high-throughput solution structural analyses by small angle X-ray scattering (SAXS). Nat. Methods 2009;6(8):606–612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Graewert MA; Svergun DI Impact and progress in small and wide angle X-ray scattering (SAXS and WAXS). Curr Opin Struct Biol 2013;23(5):748–54. [DOI] [PubMed] [Google Scholar]
  • 22.Tuukkanen AT; Spilotros A; Svergun DI Progress in small-angle scattering from biological solutions at high-brilliance synchrotrons. IUCrJ 2017;4(Pt 5):518–528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Korasick DA; Tanner JJ Determination of protein oligomeric structure from small-angle X-ray scattering. Protein Sci 2018;27(4):814–824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Dos Reis MA; Aparicio R; Zhang Y Improving protein template recognition by using small-angle x-ray scattering profiles. Biophysical journal 2011;101(11):2770–2781. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Joo K; Heo S; Joung I; Hong SH; Lee SJ; Lee J Data-assisted protein structure modeling by global optimization in CASP12. Proteins: Structure, Function, and Bioinformatics 2018;86:240–246. [DOI] [PubMed] [Google Scholar]
  • 26.Ogorzalek TL; Hura GL; Kryshtafovych A; Tainer JA; Fidelis K; Tsutakawa SE Small Angle X-ray Scattering for Data-Assisted Structure Prediction in CASP12 with Prospects to Improve Accuracy. Biophysical Journal 2018;114(3):576a–577a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Jiménez-García B; Pons C; Svergun DI; Bernadó P; Fernández-Recio J pyDockSAXS: protein-protein complex structure by SAXS and computational docking. Nucleic acids research 2015;43(W1):W356–W361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhang Y; Skolnick J Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics 2004;57(4):702–710. [DOI] [PubMed] [Google Scholar]
  • 29.Kryshtafovych A; Monastyrskyy B; Fidelis K CASP 11 statistics and the prediction center evaluation system. Proteins: Structure, Function, and Bioinformatics 2016;84:15–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Tress ML; Ezkurdia I; Richardson JS Target domain definition and classification in CASP8. Proteins: Structure, Function, and Bioinformatics 2009;77(S9):10–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Xu Y; Xu D; Gabow HN Protein domain decomposition using a graph-theoretic approach. Bioinformatics 2000;16(12):1091–1104. [DOI] [PubMed] [Google Scholar]
  • 32.Luo M; Christgen S; Sanyal N; Arentson BW; Becker DF; Tanner JJ Evidence that the C-terminal domain of a type B PutA protein contributes to aldehyde dehydrogenase activity and substrate channeling. Biochemistry 2014;53(35):5661–5673. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Valentini E; Kikhney AG; Previtali G; Jeffries CM; Svergun DI SASBDB, a repository for biological small-angle scattering data. Nucleic acids research 2014;43(D1):D357–D363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Jeffries CM; Graewert MA; Blanchet CE; Langley DB; Whitten AE; Svergun DI Preparing monodisperse macromolecular samples for successful biological small-angle X-ray and neutron-scattering experiments. Nature protocols 2016;11(11):2122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Liwo A; Ołdziej S; Pincus MR; Wawak RJ; Rackovsky S; Scheraga HA A united-residue force field for off-lattice protein-structure simulations. I. Functional forms and parameters of long-range side-chain interaction potentials from protein crystal data. Journal of computational chemistry 1997;18(7):849–873. [Google Scholar]
  • 36.Magnan CN; Baldi P SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics 2014;30(18):2592–2597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Dill KA Dominant forces in protein folding. Biochemistry 1990;29(31):7133–7155. [DOI] [PubMed] [Google Scholar]
  • 38.George RA; Heringa J An analysis of protein domain linkers: their classification and role in protein folding. Protein Engineering, Design and Selection 2002;15(11):871–879. [DOI] [PubMed] [Google Scholar]
  • 39.Schneidman-Duhovny D; Hammel M; Sali A FoXS: a web server for rapid computation and fitting of SAXS profiles. Nucleic acids research 2010;38(suppl_2):W540–W544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Schneidman-Duhovny D; Hammel M; Tainer JA; Sali A Accurate SAXS profile computation and its assessment by contrast variation experiments. Biophysical journal 2013;105(4):962–974. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Liu H; Zwart PH Determining pair distance distribution function from SAXS data using parametric functionals. Journal of structural biology 2012;180(1):226–234. [DOI] [PubMed] [Google Scholar]
  • 42.Franke D; Petoukhov M; Konarev P; Panjkovich A; Tuukkanen A; Mertens H; Kikhney A; Hajizadeh N; Franklin J; Jeffries C ATSAS 2.8: a comprehensive data analysis suite for small-angle scattering from macromolecular solutions. Journal of applied crystallography 2017;50(4):1212–1225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Russel D; Lasker K; Webb B; Velazquez-Muriel J; Tjioe E; Schneidman-Duhovny D; Peterson B; Sali A Putting the pieces together: integrative modeling platform software for structure determination of macromolecular assemblies. PLoS biology 2012;10(1):e1001244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Rotkiewicz P; Skolnick J Fast procedure for reconstruction of full-atom protein models from reduced representations. Journal of computational chemistry 2008;29(9):1460–1465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Svergun D Determination of the regularization parameter in indirect-transform methods using perceptual criteria. Journal of applied crystallography 1992;25(4):495–503. [Google Scholar]
  • 46.Cheng J; Randall AZ; Sweredoski MJ; Baldi P SCRATCH: a protein structure and structural feature prediction server. Nucleic acids research 2005;33(suppl 2):W72–W76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Cao R; Cheng J Protein single-model quality assessment by feature-based probability density functions. Scientific reports 2016;6:23990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Svergun DI Restoring low resolution structure of biological macromolecules from solution scattering using simulated annealing. Biophysical journal 1999;76(6):2879–2886. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Luo M; Gamage TT; Arentson BW; Schlasner KN; Becker DF; Tanner JJ Structures of proline utilization A (PutA) reveal the fold and functions of the aldehyde dehydrogenase superfamily domain of unknown function. Journal of Biological Chemistry 2016;291(46):24065–24075. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES