Forecasting residue–residue contact prediction accuracy

P P Wozniak; B M Konopka; J Xu; G Vriend; M Kotulska

doi:10.1093/bioinformatics/btx416

. 2017 Jun 26;33(21):3405–3414. doi: 10.1093/bioinformatics/btx416

Forecasting residue–residue contact prediction accuracy

P P Wozniak ^1,², B M Konopka ^1,², J Xu ², G Vriend ³, M Kotulska ^1,^✉

Editor: Alfonso Valencia

PMCID: PMC5860164 PMID: 29036497

Abstract

Motivation

Apart from meta-predictors, most of today's methods for residue–residue contact prediction are based entirely on Direct Coupling Analysis (DCA) of correlated mutations in multiple sequence alignments (MSAs). These methods are on average ∼40% correct for the 100 strongest predicted contacts in each protein. The end-user who works on a single protein of interest will not know if predictions are either much more or much less correct than 40%, which is especially a problem if contacts are predicted to steer experimental research on that protein.

Results

We designed a regression model that forecasts the accuracy of residue–residue contact prediction for individual proteins with an average error of 7 percentage points. Contacts were predicted with two DCA methods (gplmDCA and PSICOV). The models were built on parameters that describe the MSA, the predicted secondary structure, the predicted solvent accessibility and the contact prediction scores for the target protein. Results show that our models can be also applied to the meta-methods, which was tested on RaptorX.

Availability and implementation

All data and scripts are available from http://comprec-lin.iiar.pwr.edu.pl/dcaQ/.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Prediction of residue–residue contacts from sequence information alone has been a challenge in protein structure bioinformatics for a long time (Göbel et al., 1994). Most often such contacts are predicted to support modelling of the three dimensional structure of a protein (Bohr et al., 1993; Konopka et al., 2014; Saitoh et al., 1993; Skolnick et al., 1997; Vendruscolo et al., 1997), but other fields like protein interface prediction (Du et al., 2016; González et al., 2013; Horn et al., 1998), protein structure alignment (Terashi and Takeda-Shitaka, 2015), or even experimental protein structure determination and validation (Olmea et al., 1999; Wozniak et al., 2017) can benefit from contact prediction too. The importance of residue–residue contact prediction is further underlined by the fact that it is one of the challenges in the bi-annual CASP (Critical Assessment of techniques for protein Structure Prediction) (Monastyrskyy et al., 2014) event since 1996 (Lesk, 1997). Typically, a contact is defined as two amino acids that have either their C_α or C_β atoms within a specific distance, usually between eight and twelve Ångstrøm. These definitions gave the best results in protein structure reconstruction from contact map predictions (Duarte et al., 2010).

Today’s contact prediction methods all start with the analysis of correlated mutations in a multiple sequence alignment (MSA). An MSA is a set of homologous sequences in which evolutionary related residues are lined up at the same position. The main assumption of methods based on correlated mutations is that if two sequence positions together are important for protein structure and the residue at one of the two positions mutates, then the one at the other position should mutate as well, to maintain proper interactions (Morcos et al., 2011). Unfortunately, two residue positions that together are important for the protein function will show such behaviour as well (Oliveira et al., 2003), and so will residues that do not interact but both contact the same third residue (transitivity of correlations (Marks et al., 2011)). Contact prediction methods need to cope with such problems. Many of the methods that have been proposed over the years use machine learning techniques such as support vector machines (Cheng and Baldi, 2007; González et al., 2013), neural networks (Di Lena et al., 2012; Ding et al., 2013; Kukic et al., 2014; Tegge et al., 2009; Xue et al., 2009), hidden Markov models (Bjorkholm et al., 2009; Bystroff et al., 2000; Pollastri and Baldi, 2002), or random forests (Li et al., 2011; Wang, 2011; Wang and Xu, 2013). Today's methods with the highest residue–residue contact prediction accuracy, though, are based on Direct Coupling Analysis (DCA) which infers interactions from observations of correlated mutations using so called model learning and inference (Kappen and Rodríguez, 1998; Wainwright and Jordan, 2008). Most of the DCA-based methods deduce direct interactions between sequence positions using a Potts model (Cocco et al., 2013; Ekeberg et al., 2013; Feinauer et al., 2014; Morcos et al., 2011; Skwark et al., 2013; Zhang et al., 2016), but other statistical approaches proved their usefulness too, e.g. the sparse inverse covariance estimation in PSICOV (Jones et al., 2012).

Neither machine learning, nor DCA-based methods work well enough yet for reliable ab initio protein structure prediction, albeit that the performance occasionally is impressive; e.g. in transmembrane protein structure prediction (Hopf et al., 2012; Wang and Barth, 2015), RNA secondary structure prediction (De Leonardis et al., 2015), or protein-protein interaction prediction (Guo et al., 2015; Horn et al., 1998; Iserte et al., 2015). Excluding meta-servers, the accuracy of today's best methods is on the order of 40% for the 100 contacts that are predicted with the greatest confidence (Feinauer et al., 2014). Nevertheless, a recent study indicated great potential for modelling completely unknown protein folds only from contacts obtained from correlated mutation analyses (Ovchinnikov et al., 2017). Several studies have shown that even scarce correct contact information obtained from a known structure can be enough to reconstruct that structure with good quality (Duarte et al., 2010; Dyrka et al., 2016; Konopka et al., 2014; Sathyapriya et al., 2009). Obviously, an unofficial competition is ongoing to increase the prediction accuracy but results presented in new publications often differ only little from previous ones (Cocco et al., 2013; Ekeberg et al., 2013; Feinauer et al., 2014; Morcos et al., 2011; Skwark et al., 2013; Zhang et al., 2016).

A major problem for the end-user of DCA-based methods is that the reported percentage of correctly predicted contacts is an average over a large test set. This implies that the number of correctly predicted contacts can be much higher but also much lower than 40%. Especially the latter can be a problem (if unnoticed) for scientists who use contact predictions to steer their research on one individual protein. We therefore designed a method to forecast the accuracy of residue–residue contact predictions for individual proteins.

We analyzed a large number of parameters that describe the MSA, the predicted secondary structure, the predicted solvent accessibility, and the contact prediction scores for the target protein. We built regression models to test how well (combinations of) these parameters forecast the accuracy of two well-working, readily accessible, representative DCA-based methods: gplmDCA (Feinauer et al., 2014) and PSICOV (Jones et al., 2012). gplmDCA is representative for a large set of methods based on the Potts model, while PSICOV uses sparse inverse covariance estimation to deduce direct information from correlated mutations. When only information independent of the DCA-based method is used, the contact prediction accuracy can be forecasted with an average error of about 11 percentage points. When parameters that describe the contact prediction output are included, this becomes about 7 percentage points.

A forecast of the prediction accuracy will be helpful for research in all fields where DCA-based contact prediction methods play a role, including research on the contact prediction methods themselves.

2 Materials and methods

We used regression models on a large number of parameters that describe the MSA, the predicted secondary structure, the predicted solvent accessibility, and the contact prediction scores for the target protein. We analyzed which parameters should be combined to obtain the best model to forecast the quality of contact prediction. The regression models were trained and tested on multiple sets of proteins with known structures.

The 24 parameters that contributed most to the performance of the regression models are described in this Methods section. Other parameters that were tested, but for various reasons were rejected, are listed in the Supplementary Materials(SM)—Methods.

2.1 Training and test data

The training set consisted of 1128 domains selected from SCOP (Murzin et al., 1995). These domains were superfamily representatives that had no alternate side chain conformations, no missing residues inside the chain, and only contained the 20 canonical amino acid types. The domains in the training set were structurally diverse with an average pair-wise TM-score (Zhang and Skolnick, 2004; Zhang and Skolnick, 2005) of 0.34 ± 0.08.

The test set was a subset of the 804 protein chains used by the gplmDCA authors to test their method (Feinauer et al., 2014). The list is available online at http://gplmdca.aurell.org/input-data. All PDB chains that were structurally similar to any of the domains in the training set (TM-score 0.7 or higher) were removed from this list. The final test set contained 446 PDB chains. The average pair-wise TM-score for this final test set was 0.32 ± 0.06. The average pair-wise TM-score between the training set and the test set was 0.33 ± 0.06.

2.2 Contact definition

Two residues are in contact when their C_β-C_β distance is less than 8.0 Å. However, if they are separated in the sequence by less than four positions they are rejected from the calculations.

2.3 Secondary structure and solvent accessibility

Secondary structure and solvent accessibility were calculated for each domain in the training set with DSSP-2.2.1 (Kabsch and Sander, 1983; Touw et al., 2015). Secondary structure and solvent accessibility for each protein chain in the test set were predicted with SSpro and ACCpro20, respectively, that are both part of the SCRATCH 5.2 software package (Magnan and Baldi, 2014). We used the get_abinitio_predictions script that is provided with this software. This script ensures that the software does not make use of homologous structures.

ACCpro20 predicts relative solvent accessibilities. Relative solvent accessibilities were therefore also used for the training set instead of the absolute solvent accessibilities as provided by DSSP. The relative solvent accessibility is defined as the observed solvent accessible surface area divided by the maximally possible solvent accessible surface area and is given as a percentage. The maximal solvent accessible surface area for each residue type was set at the largest value observed (for each residue type) in the DSSP files of the training set (see Supplementary Table S26 in SM—Results). Residues are considered buried when their relative solvent accessibility is less than 20% (Chen and Zhou, 2005) otherwise they are called exposed.

2.4 MSA

MSAs for the test set have been made available online by the authors of gplmDCA (Feinauer et al., 2014) (http://gplmdca.aurell.org/input-data). MSAs for the training set were calculated following the same procedure as for the test set (Feinauer et al., 2014) except that HHblits (Remmert et al., 2011) from the HHsuite 2.0.16, used a slightly older version of Uniprot. Five search iterations with an E-value cut-off of 1.0 were performed. MSA filtering was switched off during the search using the -all parameter. The HHblits commands used are given in the SM—Methods.

2.5 Correlation scores

Correlation scores (also called co-evolutionary scores), as given to residue pairs by gplmDCA or PSICOV, reflect the chance that a pair of residues is in contact (Feinauer et al. 2014; Jones et al. 2012; Marks et al. 2011; Morcos et al. 2011).

Correlation scores calculated with gplmDCA for the training set were obtained by running gplmDCA with default parameters (lambda_h = 0.01, lambda_J = 0.01, lambda_chi = 0.001, reweighting_threshold = 0.1 and M = -1). The scores calculated with gplmDCA for the test set are available online (http://gplmdca.aurell.org/input-data).

Correlation scores calculated with PSICOV for the training and test sets were obtained using default parameters with a target precision matrix density of 3% (-d) and output estimated rather than raw scores (-p). The minimum sequence separation was set to 1 (-j), albeit that pairs with a sequence separation less than 4 were not treated as contacts. PSICOV normally requires a minimal level of sequence diversity in the MSA to calculate correlation scores. We forced PSICOV calculations to also be performed on insufficiently diverse MSAs by setting the MINEFSEQS parameter to override (-o).

2.6 Contact prediction procedure

For each domain and chain, predicted contact pairs were sorted in descending order of correlation score. We call the top N pairs the strongest predicted contacts. We tried N = 20, 50, 100, 200, L/2, L and 3L/2, in which L is the length of the target sequence.

2.7 Contact prediction evaluation

To assess the performance of the contact prediction method the positive predictive value (PPV) was used:

P P V = T P / (F P + T P) \cdot 100 %

(1)

in which:

TP—number of properly predicted contacts (true positives),

FP—number of non-contacts predicted as contacts (false positives).

This PPV will be referred to as the true PPV, while the PPV that was forecasted will just be called PPV or forecasted PPV.

2.8 Regression models

The regression models built to forecast the PPV of gplmDCA and PSICOV were: Lasso (LASSO), regression trees (TREE), Random Forest (FOREST), Support Vector Machine (SVM) and linear regression models in two setups: with only single parameters (primary terms) as input (LM1) and with all possible products of two parameters included as input too (LM2). Combinations of the most successful models that were also tested were: (i) average of Lasso and Random Forest (avLF), (ii) average of SVM and Random Forest (avSF) and (iii) average of Lasso, SVM and Random Forest (avLSF). Detailed information on the training and testing of regression models is available in the SM—Methods.

2.9 Input parameters

The parameters that were used as input for the regression models can be grouped in 4 sets:

A. MSA;
B. secondary structure;
C. solvent accessibility;
D. correlation score.

The 24 parameters that were used to train regression models are listed below. Descriptions of rejected parameters are available in the SM—Methods. A. MSA parameters:

seqLen—target sequence length;
MSAseq—number of sequences in the MSA;
gapArea—number of gaps in the MSA defined as:
$g a p A r e a = g a p N m b / (L \cdot s e q N m b)$ (2)

in which:

gapNmb—number of gaps in the MSA,

seqNmb—number of sequences in the MSA,

L—target sequence length;

4. aveBlock—average gap length in the MSA defined as:
$a v e B l o c k = (σ_{b l o c k s} \cdot L \cdot 100) / s e q N m b$ (3)

in which:

σ_blocks—average gap length in the MSA,

seqNmb—number of sequences in the MSA,

L—target sequence length;

5. identScoreAve—average sequence identity between the target sequence and all sequences in the MSA (Bartona, 2008).

B. Secondary structure parameters:

helixAll—number of occurrences of helical residues in the top N pairs divided by the total number of helical residues in the target sequence (note that a residue can be counted more than one time);
otherAll—number of occurrences of non-α and non-β residues in the top N pairs divided by the total number of non-α and non-β residues in the target sequence (note that a residue can be counted more than one time);
betaUnique—fraction of β-strand residues that participate in the top N pairs;
otherUnique—fraction of non-α and non-β residues that participate in the top N pairs.

C. Solvent accessibility parameters:

accBuriedAll—number of occurrences of buried residues in the top N pairs divided by the total number of buried residues in the target sequence (note that a residue can be counted more than one time);
accBuriedUnique—fraction of buried target sequence residues that are involved in the top N pairs;
accExposedUnique—fraction of exposed target sequence residues that are involved in the top N pairs;
accBurBur—fraction of the top N pairs in which both residues are buried;
accExpExp—fraction of the top N pairs in which both residues are exposed;
accExpBur—fraction of the top N pairs in which one residue is buried and the other is exposed;
accCV—coefficient of variation for the average relative solvent accessibility in the top N pairs defined as:
$a c c C V = σ_{R S A / 2} / μ_{R S A / 2}$ (4)

in which:

σ_RSA/2—standard deviation of average relative solvent accessibility calculated for each of the top N pairs,

μ_RSA/2—average of average relative solvent accessibility calculated for each of the top N pairs

The average relative solvent accessibility for a predicted contact is defined as:

a v e R S A = (R S A_{1} + R S A_{2}) / 2

(5)

in which:

RSA₁—relative solvent accessibility of the first residue in the predicted contact,

RSA₂—relative solvent accessibility of the second residue in the predicted contact;

8. accGlobal—average relative solvent accessibility of all residues in the target sequence.

D. Parameters that describe the variation and distribution of correlation scores calculated for the top N pairs:

standDev—standard deviation of the correlation scores;
corSpread—difference between the maximal and minimal correlation score divided by the maximal correlation score:
$c o r S p r e a d = {(C o r_{\max} - C o r_{\min}) / C o r}_{\max}$ (6)

in which:

Cor_max—correlation score for the strongest predicted contact in the top N pairs,

Cor_min—correlation score for the Nth strongest predicted contact;

3. IQR—half of the interquartile range defined as:
$I Q R = (Q_{1} - Q_{3}) / 2$ (7)

in which:

Q₁—first quartile of the correlation scores,

Q₃—third quartile of the correlation scores;

4. robIQR—robust coefficient of half of the interquartile range defined as:
$r o b I Q R = \sum_{i = 1}^{N} | C o r_{i} - I Q R | / (N \cdot I Q R)$ (8)

in which:

Cor_i—correlation score for the i-th strongest predicted contact,

IQR—half of the interquartile range as defined in Eq. 7;

5. CV—coefficient of variation for correlated scores defined as:
$C V = s t a n d D e v / μ$ (9)

in which:

standDev—standard deviation of the correlation scores,

μ—average of the correlation scores;

6. skewness—skewness of the distribution of correlation scores;
7. sepReduction—number of strongest predicted contacts that where skipped because of the sequence separation threshold while going down the list to select the top N pairs.

2.10 Model performance evaluation

The root mean square error (RMSE) and R-squared statistics (R²) (James et al., 2013) were used to measure the performance of regression models. RMSE describes the average difference between the estimated and the true value of the response variable:

R M S E = \sqrt{(1 / S) \cdot \sum_{i = 1}^{S} {(y_{i} - f (x_{i}))}^{2}}

(10)

in which:

S—number of samples in the test set (here number of protein chains in the test set),

y_i—true value of the response variable (here true PPV) for the i-th sample,

x_i—vector of input variables (here the 24 parameters) for the i-th sample,

f(x_i)—value predicted by the model (here forecasted PPV) for x_i.

The true PPV and forecasted PPV are both represented using percentages. Therefore, the RMSE value is expressed in percentage points.

R² indicates the amount of the total variance in the data that is explained by the model. In general R² falls between 0 and 1. In case the model forecasts values worse than the sample average, R² can adopt negative values. R² is defined as:

R^{2} = (T S S - R S S) / T S S = 1 - R S S / T S S

(11)

in which: RSS—residual sum of squares defined as:

R S S = \sum_{i = 1}^{S} {(y_{i} - f (x_{i}))}^{2}

(12)

TSS—total sum of squares defined as:

T S S = \sum_{i = 1}^{S} {(y_{i} - y_{a v e})}^{2}

(13)

and:

S—number of samples in the test set (here number of protein chains in the test set),

y_i—true value of the response variable (here true PPV) for the i-th sample,

x_i—vector of input variables (here 24 parameters) for the i-th sample,

y_ave—average of true values of the response variable (here average of true PPVs)

f(x_i) —value predicted by the model (here forecasted PPV) for x_i.

2.11 Parameter significance evaluation

The significance of parameters was determined as follows:

% I n c R M S E = \frac{R M S E_{f e a t_p e r m} - R M S E_{a l l_d a t a}}{R M S E_{a l l_d a t a}} \cdot 100 %

(14)

in which: RMSE_{all_data}—denotes RMSE based on all input data, RMSE_{feat_perm}— denotes RMSE based on all data but with values of a single feature shuffled randomly among all samples. The %IncRMSE was calculated with 5-fold cross-validation. This means that the model was trained based on 4/5^th of the training set and then for every feature %IncRMSE was calculated based on the remaining 1/5^th. This resulted in a single %IncRMSE value for each input feature. After five repetitions the five %IncRMSE values were averaged resulting in the finally reported %IncRMSE for each input feature. %IncRMSE is very similar to %IncMSE in the R package randomForest (Liaw and Wiener, 2002).

2.12 Data and code availability

Data collection and parameter calculations were performed with scripts written in Java using BioJava 3.0.7 (Prlić et al., 2012), commons-math3 3.6 and mrs-ws-client 1.0.1 (Hekkelman and Vriend, 2005) libraries. Regression models were built using the R programming language and a series of freely available R packages (Friedman et al., 2010; Liaw and Wiener, 2002; Meyer et al., 2015; Therneau and Atkinson, 2015; Varmuza and Filzmoser, 2009). All scripts and datasets generated during the study are available online at http://comprec-lin.iiar.pwr.edu.pl/dcaQ/.

3 Results

Figure 1 shows the RMSE values for the nine regression models used to forecast the prediction accuracy of gplmDCA. The best results were obtained with the avLSF regression model that calculates the average of the LASSO, SVM and FOREST results. avLSF forecasts the PPV of gplmDCA for N = 200 with an RMSE of 7.3 percentage points. For N = 200 the LASSO regression model gives a similar RMSE of 7.6 percentage points. The avLSF model, though, is the most robust of all regression models with regard to different values of N (see Supplementary Tables S1 and S3 in SM—Results). Figure 2A presents the scatter plot of the PPVs forecasted with the avLSF regression model as a function of the gplmDCA true PPV for the whole test set. The R² value is 0.86 (Supplementary Table S2 in SM—Results). Figure 2B shows, as an example, that in 70% of all cases the forecasted PPV is less than 7.3 percentage points different from the true PPV. It can also be seen that in half of all cases the forecasted PPV error is less than 5 percentage points.

Fig. 1. — RMSE for forecasting the PPV of gplmDCA. Results are shown for the nine regression models and N = 200. RMSE is expressed in percentage points. Note that the ordinate starts at 5

Fig. 2. — Results for the gplmDCA avLSF model trained for N = 200. (A) Forecasted PPV as a function of the gplmDCA true PPV. Both PPVs are expressed in percentages. (B) Cumulative distribution plot. The ordinate is the fraction of protein chains for which the absolute difference between the true PPV and the forecasted PPV is lower than the value on the abscissa. The difference on the abscissa is expressed in percentage points. The dashed lines indicate, as an example, that avLSF forecasted the PPV of gplmDCA with an accuracy better than 7.3 percentage points for 70% of the proteins in the test set

Figure 3 shows the average true PPV of gplmDCA for seven values of N. The true PPV of gplmDCA, like for every DCA-based method, decreases when the number of predicted contacts increases. Figure 3 also shows that the more contacts are predicted the better is the avLSF forecast performance as the RMSE clearly decreases when more contacts are predicted. The same effects are observed when N is a function of the sequence length L. The RMSEs for N = 200 and N = 3L/2 are nearly the same (7.3 and 7.2 percentage points, respectively). The RMSE and R² values for all regression models and all values of N are presented in Supplementary Tables S1–S4 in the SM—Results.

Fig. 3. — Average true PPV of gplmDCA for seven values of N: 20, 50, 100, 200, L/2, L and 3L/2, where L is the target sequence length. Error bars indicate the RMSE for forecasting the PPV of gplmDCA using the avLSF model. Average true PPV and error bars are expressed in percentages and percentage points, respectively

We examined how our best regression model, which has been trained for gplmDCA, performs with mfDCA—one of the oldest DCA implementations (Morcoset al., 2011) and which applies a mean field approximation. We used an in-house Java implementation of mfDCA with MSAs from Pfam (see SM—Methods) (Finn et al., 2016). Our models, trained for N = 200 on the contacts predicted with gplmDCA, forecasted the PPV of mfDCA with RMSE = 7.3 percentage points, which is similar to the results obtained for gplmDCA. The R² values, on the other hand, are lower for mfDCA than for gplmDCA (see Supplementary Tables S2 and S4 in SM—Results). mfDCA rarely predicts contacts with a PPV better than 50%, so that its TSS value (see Eq. 13) is lower than for the gplmDCA method. If R² is calculated for gplmDCA using only protein chains for which the true PPV is lower than 50% then the R² is almost identical to the R² for the mfDCA case.

These observations suggested that the regression model trained for gplmDCA can be also applied to forecast the PPV of other methods but when we tried to forecast the PPVs of PSICOV, however, the gplmDCA-based regression model performed poorly (Supplementary Tables S1–S4 in SM—Results). This problem is likely caused by the radically different ways by which PSICOV and gplmDCA cope with the transitivity of contacts, and thus by the information the regression models can extract from the method’s outputs. We therefore decided to train a separate regression model for PSICOV. The results of this training are shown in Figure 4. The best results, for N = 200, were obtained with the avLF and avLSF models. The avLSF regression model forecasts PSICOV's PPV with RMSE = 6.5 percentage points. The best model for PSICOV, avLF, has RMSE = 6.4 percentage points, which is marginally better. For most other values of N the avLSF model works better than the avLF model (see Supplementary Tables S5–S8, S11 and S12 in SM-Methods). We therefore continue the discussion using the avLSF results, also to allow for a better comparison of the results obtained for gplmDCA and PSICOV.

Fig. 4. — RMSE for forecasting the PPV of PSICOV. Results are for models trained for N = 200. RMSE is in percentage points. Note that the ordinate starts at 5

Figure 5A shows the correlation between the true PPVs for PSICOV and the avLSF forecasted PPVs. The R² value equals 0.86 (Supplementary Table S6 in SM—Results). Figure 5B shows that the regression model forecasts the PPV of PSICOV with an accuracy better than its RMSE (of 6.5 percentage points) for 72% of the proteins in the test set.

Fig. 5. — Results for the PSICOV avLSF model trained for N = 200. Axes and lines have the same meaning as in Figure 2

The ability to forecast the PSICOV results is slightly better than for gplmDCA. The avLSF model performs better on PSICOV data (RMSE = 6.5 percentage points) than on gplmDCA data (RMSE = 7.3 percentage points). Figure 5B reveals that avLSF forecasts the PPV of PSICOV with an accuracy better than 7.3 percentage points (the RMSE of gplmDCA) for 79% of all cases.

Figure 6 shows the performance of avLSF models built for PSICOV as function of N. This bar plot is highly similar to the one presented in Figure 3 for the gplmDCA results. The RMSE and R² values for all regression models and all variants of N are presented in Supplementary Tables S5–S8 in the SM—Results.

Fig. 6. — Average true PPV of PSICOV for N is 20, 50, 100, 200, L/2, L and 3L/2. Error bars indicate RMSEs for forecasting the PPV of PSICOV using the avLSF model. Average true PPV and error bars are expressed in percentages and percentage points, respectively

We examined the importance of individual parameters for the performance of the regression models built for gplmDCA or PSICOV. Table 1 presents the values of %IncRMSE for each parameter, each method, and the LASSO, FOREST, SVM and avLSF models. Supplementary Table S13 in the SM—Results contains also the results for other regression models. The higher the value of %IncRMSE, the more important the parameter is for the model’s performance. Parameters that describe solvent accessibility are highly important for the performance of the gplmDCA and PSICOV regression models. The coefficient of variation for the average relative solvent accessibility (accCV) is usually found among the five most important parameters. The importance of all other parameter types differs between gplmDCA and PSICOV. For gplmDCA models, secondary structure and MSA parameters are among the five most important ones much more often than for PSICOV models. The length of the target sequence (seqLen) is almost always one of the most significant parameters for gplmDCA models (see also Supplementary Table S13 in SM—Results). For FOREST regression model, the %IncRMSE of seqLen is low, but, on the other hand, it is high for the number of sequences in the MSA (MSAseq). For LASSO, FOREST, SVM and avLSF models, parameters which describe correlation scores are clearly more often important for PSICOV than for gplmDCA models. Especially, the half of the interquartile range (IQR), standard deviation (standDev) and skewness are always among the five most important parameters for PSICOV models. These differences in parameter importance between gplmDCA and PSICOV are likely to be the reason for the necessity to build regression models for PSICOV separately from the gplmDCA models.

Table 1.

The %IncRMSE for LASSO, FOREST, SVM and avLSF regression models built for gplmDCA and PSICOV with N = 200

	gplmDCA				PSICOV
Parameter	LASSO	FOREST	SVM	avLSF	LASSO	FOREST	SVM	avLSF
accBurBur	76.40	39.86	20.96	46.72	41.33	6.25	16.92	18.63
accBuriedAll	17.37	0.35	10.79	5.11	46.01	0.81	13.44	11.46
accBuriedUnique	101.59	3.00	25.20	35.11	39.50	3.59	15.40	15.72
accExpBur	2.85	9.84	16.08	9.33	9.32	5.49	16.57	10.32
accExposedUnique	2.69	1.06	8.40	2.62	5.66	1.07	5.52	2.80
accGlobal	8.31	2.06	14.22	7.52	11.66	3.30	17.10	10.28
accCV	39.30	53.47	37.62	46.25	40.45	12.34	21.03	22.80
accExpExp	31.08	3.81	9.39	3.17	17.41	1.01	9.77	3.02
betaUnique	15.50	10.61	12.80	13.30	24.30	19.68	16.24	20.55
helixAll	19.94	2.15	16.75	12.06	14.62	0.48	7.97	6.13
otherAll	72.33	1.18	14.86	22.89	8.49	0.53	8.71	4.25
otherUnique	1.35	0.91	6.66	1.93	1.91	1.13	5.38	1.65
gapArea	12.11	1.72	15.52	9.26	3.63	0.30	2.12	1.50
aveBlock	1.08	1.81	2.36	0.88	−0.19	0.16	0.81	−0.03
identScoreAve	9.90	3.80	3.76	4.33	9.03	0.96	3.51	3.03
MSAseq	6.58	10.82	5.24	6.54	1.24	3.85	5.34	2.98
seqLen	66.92	0.50	22.92	24.18	17.81	0.55	8.06	4.74
corSpread	1.88	0.41	5.31	2.45	0.27	3.19	29.40	8.02
IQR	50.06	12.54	16.39	24.72	87.89	46.21	48.37	61.14
robIQR	23.23	6.27	4.73	6.94	37.57	1.40	1.77	3.43
skewness	0.30	0.87	2.15	1.11	121.83	34.75	27.11	58.39
sepReduction	9.82	1.85	11.27	6.94	14.10	7.56	10.04	9.88
standDev	12.48	4.45	19.53	12.30	56.79	32.40	37.10	43.50
CV	1.45	1.99	2.74	1.35	0.48	1.82	9.14	2.56

Open in a new tab

Note: For each regression model the %IncRMSE value of the five most important parameters are in bold face. Blue: solvent accessibility; yellow: secondary structure; grey: MSA; white: correlation score.

Besides the %IncRMSE, for N = 200 we also examined the RMSE of our models when only one single parameter was used, and the correlation between all parameters. The results are presented in Supplementary Tables S14–S17 in SM—Results. accGlobal and accExpExp are among the most important single features (Supplementary Tables S14 and S15 in SM—Results), which was not seen in the %IncRMSE test. This might be caused by their high correlation score (Supplementary Tables S16 and S17 in SM—Results). Similarly, %IncRMSE did not place MSAseq and sepReduction among the important parameters in either model. Finally, for PSICOV's model permutation of betaUnique lead to an increase of 20% in RMSE (Table 1). In the single feature analysis the importance of betaUnique was not high (Supplementary Table S15 in SM—Results).

We investigated if our regression models are able to forecast the PPV for a target protein independently of the DCA-based method in use. The parameters which are not based on the results of the DCA-based method are the average relative solvent accessibility of all residues in the target sequence (accGlobal in Table 1) and all MSA parameters (grey cells in Table 1). When regression models are trained using only these parameters they perform much worse than when all parameters are used. For example, with N = 200 the gplmDCA avLSF model results in a RMSE of 7.3 (see Fig. 1 and Supplementary Table S1 in SM—Results), which in the absence of other parameters becomes 11.4 percentage points (see Supplementary Table S9 in SM—Results). This test gave similar results for the regression models built for gplmDCA and PSICOV (see Supplementary Tables S9–S12 in SM—Results).

4 Discussion

We designed a method to forecast the residue–residue contact prediction accuracy of gplmDCA and PSICOV. The method works well resulting in RMSE values of 7.3 and 6.5 percentage points, respectively, which means that the percentage of correct contacts among the 200 strongest predicted ones can be forecasted with an average error of about 7 percentage points. The best performing regression model is avLSF which is the average of results of LASSO, SVM and FOREST models. This is also the most robust model with regard to different values of N.

The models were trained on SCOP domains and tested on protein chains. Domains and chains often are part of larger proteins with multiple domains or multiple chains in a quaternary structure so that a (limited) number of residues in both the training set and the test set will have the wrong accessibility value assigned, and Table 1 indicates that accessibility values are important for the performance of the regression models. Fortunately, the SCOP definition of domains (Gough et al., 2001; Murzin et al., 1995) is such that this error is not very big, e.g. in 720 out of 1128 domains in the training set all residues have the same accessibilities in the domain as in the full protein. Nevertheless this stays a point of interest for future work.

The avLSF model trained for gplmDCA performed just as well for mfDCA despite that there are differences between the mfDCA and gplmDCA algorithms. For example, we trained the models with SCOP domains while Pfam domains were used by the authors of mfDCA. SCOP domains might be closer to Pfam domains than to complete chains used in our test set. Also, Pfam alignments usually contain fewer sequences than the alignments by HHblits. Usually, a lower number of sequences in the MSA leads to a lower PPV (Cocco et al., 2013; Feinauer et al., 2014) and our regression models forecast a low PPV better than a high one (see e.g. Fig. 3). Finally, gplmDCA returns corrected Frobenius norms, which were used as correlation scores to calculate parameters, while mfDCA returns direct information values.

Models trained with gplmDCA forecast the accuracy of mfDCA very well, but, on the other hand, they perform poorly for PSICOV. The PSICOV algorithm is so much different from the other DCA algorithms that PSICOV needed its own regression model. Building separate regression models for gplmDCA and PSICOV showed that different types of parameters were important for these two methods (see Table 1). Two Potts model-based methods and one sparse matrix based method, obviously, provide insufficient examples to allow for extrapolation, but we nevertheless believe that our method generalizes for families of prediction methods, and we believe that our procedure can be applied to other prediction methods that still will be developed. It might be interesting to see if better contact predictions can be obtained when the PSICOV way of dealing with the transitivity of contacts can be combined with the gplmDCA way of dealing with this problem, and it might be interesting to see how our regression models will perform on such a hybrid method.

Meta-methods perform better than the single methods they combine (Jones et al., 2015; Wang et al., 2017). To make meta-methods perform better beyond a certain limit, though, the underlying non-meta methods should be improved, which is why we focused on gplmDCA and PSICOV. Nevertheless, we also built models for RaptorX (Wang et al., 2017) which was one of the best methods in CASP11 and CASP12 (Monastyrskyy et al., 2016). RaptorX is a deep learning meta-method that integrates information such as sequence conservation, residue co-evolution, predicted secondary structure and solvent accessibility. We trained and tested our models following the same procedures as for gplmDCA and PSICOV. We predicted the PPV of RaptorX for N = 200 and obtained the best RMSE for the avLF model (RMSE = 9.67) and the avLSF model (RMSE = 9.72). Although these results show the potential of our approach for meta-methods, the results are slightly worse than for gplmDCA or PSICOV, which suggests that meta-methods behave yet differently from the two non-meta methods. One problem might be that RaptorX uses more data types than we cover in the gplmDCA and PSICOV models, but RaptorX also does certain things different from the other methods (e.g. three-state versus two-state accessibility descriptions) so that we cannot use all information RaptorX uses. The %IncRMSE and single parameter analyses—shown in the Supplementary Tables S22–S24 in SM—Results—show that parameters based on correlation scores have the largest importance for forecasting the RaptorX PPV. Compared to the non-meta methods, RaptorX does not return an actual correlation score but the probability of a residue–residue pair to create a contact. That makes the output more intuitive because the more the distribution of correlation scores is shifted toward 1.0, the higher is the true PPV of RaptorX. This probably causes the corSpread, skewness and CV parameters to be the most significant. All the results for RaptorX are provided in SM—Results (Supplementary Tables S18–S25).

The parameters that describe the solvent accessibility of the top N pairs were most often observed among the most important parameters for the gplmDCA and PSICOV models (see Table 1). This is very much related to the fact that if a contact is predicted between buried residues, this contact is more likely to be a true contact (due to the geometry of the protein core). Parameters related to correlation scores are also important, especially, for forecasting the performance of PSICOV. When the model has to forecast the contact prediction accuracy from a low number of predicted contacts, it receives fewer data about the distribution of correlation scores, which is bad for the forecasting performance. That is in agreement with Figure 3 and Figure 6 showing that the performance of models depends on the number of predicted contacts. The more contacts were predicted, the better was the estimation of the forecasting accuracy.

We trained our models using solvent accessibility and secondary structures provided with DSSP and tested them using values predicted with the SSpro and ACCpro20 methods. The performance of such predictors is continuously increasing and each research group working on DCA-based methods and interested in using our method will use their own predictors for this purpose. Therefore, we decided to use the perfect, DSSP-based values for training, and any well-performing method available for testing. We did investigate the performance when both training and testing were done with SSpro and ACCpro20, or when training and testing were done both using DSSP, and obtained marginally better RMSE values of 7.1 and 6.5, respectively.

We have shown that the quality of DCA-based contact predictions can be forecasted remarkably well from a series of parameters that can be extracted from the MSA, the predicted secondary structure, the predicted solvent accessibility, and the contact prediction scores for the target protein. We believe that knowledge about the quality of contact predictions will greatly enhance their usefulness in many areas of the life sciences, and might even contribute to improving the DCA-based methods themselves.

Supplementary Material

Supplementary Data

Click here for additional data file.^{(27.4KB, docx)}

Supplementary Data

Click here for additional data file.^{(174.9KB, docx)}

Acknowledgements

We thank Monika Kurczynska for producing the SCOP domain training set. Wroclaw Centre for Networking and Supercomputing at Wroclaw University of Science and Technology is also acknowledged.

Funding

The work of P.P.W., B.M.K. and M.K. was partially supported by Statuary Funds. P.P.W acknowledges partial funding from the Polish National Science Centre (Etiuda 4, DEC-2016/20/T/NZ1/00514). G.V. acknowledges funding from the EU project NewProt (289350). J.X. is partially supported by National Institutes of Health grant R01GM089753 and National Science Foundation grant DBI-1564955.

Conflict of Interest: none declared.

Author Contributions

B.M.K. and P.P.W. came up with general concept of prediction methods and designed input parameters for regression models. P.P.W. implemented Java code for processing the data and calculating input parameters for regression models and prepared the draft of the manuscript. B.M.K. implemented R-package code for building, training and testing regression models. J.X. provided the RaptorX data. M.K. and G.V. supervised the project. All authors participated in the study design, data analysis and final manuscript writing.

References

Bartona G.J. (2008) Sequence alignment for molecular replacement. Acta Crystallogr. D Biol. Crystallogr., 64, 25–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bjorkholm P. et al. (2009) Using multi-data hidden Markov models trained on local neighbourhoods of protein structure to predict residue–residue contacts. Bioinformatics, 25, 1264–1270. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bohr J. et al. (1993) Protein structures from distance inequalities. J. Mol. Biol., 231, 861–869. [DOI] [PubMed] [Google Scholar]
Bystroff C. et al. (2000) HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. J. Mol. Biol., 301, 173–190. [DOI] [PubMed] [Google Scholar]
Chen H, Zhou H.X (2005) Prediction of solvent accessibility and sites of deleterious mutations from protein sequence. Nucleic Acids Res., 33, 3193–3199. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng J, Baldi P (2007) Improved residue contact prediction using support vector machines and a large feature set. BMC Bioinformatics, 8, 113.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cocco S. et al. (2013) From principal component to direct coupling analysis of coevolution in proteins: low-eigenvalue modes are needed for structure prediction. PLoS Comput. Biol., 9, e1003176.. [DOI] [PMC free article] [PubMed] [Google Scholar]
De Leonardis E. et al. (2015) Direct-coupling analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction. Nucleic Acids Res., 43, 10444–10455. [DOI] [PMC free article] [PubMed] [Google Scholar]
Di Lena P. et al. (2012) Deep architectures for protein contact map prediction. Bioinformatics, 28, 2449–2457. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ding W. et al. (2013) CNNcon: improved protein contact maps prediction using cascaded neural networks. PloS One, 8, e61533.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Du T. et al. (2016) Prediction of residue–residue contact matrix for protein–protein interaction with Fisher score features and deep learning. Methods, 110, 97–105. [DOI] [PubMed] [Google Scholar]
Duarte J.M. et al. (2010) Optimal contact definition for reconstruction of contact maps. BMC Bioinformatics, 11, 283.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dyrka W. et al. (2016) Fast assessment of structural models of ion channels based on their predicted current-voltage characteristics. Proteins, 84, 217–231. [DOI] [PubMed] [Google Scholar]
Ekeberg M. et al. (2013) Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Phys. Rev. E. Stat. Nonlin. Soft. Matter. Phys., 87, 012707.. [DOI] [PubMed] [Google Scholar]
Feinauer C. et al. (2014) Improving contact prediction along three dimensions. PLoSComput. Biol., 10, e1003847.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Finn R.D. et al. (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res., 44, D279–D285. [DOI] [PMC free article] [PubMed] [Google Scholar]
Friedman J. et al. (2010) Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw., 33, 1–22. [PMC free article] [PubMed] [Google Scholar]
González A.J. et al. (2013) Prediction of contact matrix for protein-protein interaction. Bioinformatics, 29, 1018–1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gough J. et al. (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol., 313, 903–919. [DOI] [PubMed] [Google Scholar]
Göbel U. et al. (1994) Correlated mutations and residue contacts in proteins. Proteins, 18, 309–317. [DOI] [PubMed] [Google Scholar]
Guo F. et al. (2015) Identification of protein–protein interactions by detecting correlated mutation at the interface. J. Chem. Inf. Model, 55, 2042–2049. [DOI] [PubMed] [Google Scholar]
Hekkelman M.L, Vriend G (2005) MRS: a fast and compact retrieval system for biological data. Nucleic Acids Res., 33, W766–W769. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hopf T.A. et al. (2012) Three-dimensional structures of membrane proteins from genomic sequencing. Cell, 149, 1607–1621. [DOI] [PMC free article] [PubMed] [Google Scholar]
Horn F. et al. (1998) The interaction of class B G protein-coupled receptors with their hormones. Recept. Channels, 5, 305–314. [PubMed] [Google Scholar]
Iserte J. et al. (2015) I-COMS: Interprotein-Correlated Mutations Server. Nucleic Acids Res., 43, W320–W325. [DOI] [PMC free article] [PubMed] [Google Scholar]
James G. et al. (2013) An Introduction to Statistical Learning with Applications in R, Springer-Verlag; New York. [Google Scholar]
Jones D.T. et al. (2012) PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics, 28, 184–190. [DOI] [PubMed] [Google Scholar]
Jones D.T. et al. (2015) MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics, 31, 999–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577–2637. [DOI] [PubMed] [Google Scholar]
Kappen H.J, Rodríguez F.B (1998) Efficient learning in boltzmann machines using linear response theory. Neural Comput., 10, 1137–1156. [Google Scholar]
Konopka B.M. et al. (2014) Automated procedure for contact-map-based protein structure reconstruction. J. Membr. Biol., 247, 409–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kukic P. et al. (2014) Toward an accurate prediction of inter-residue distances in proteins using 2D recursive neural networks. BMC Bioinformatics, 15, 6.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lesk A.M. (1997) CASP2: report on ab initio predictions. Proteins, Suppl 1, 151–166. [DOI] [PubMed] [Google Scholar]
Li Y. et al. (2011) Predicting residue–residue contacts using random forest models. Bioinformatics, 27, 3379–3384. [DOI] [PubMed] [Google Scholar]
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News, 2, 18–22. [Google Scholar]
Monastyrskyy B. et al. (2014) Evaluation of residue–residue contact prediction in CASP10. Proteins, 2014, 82, 138–153. [DOI] [PMC free article] [PubMed] [Google Scholar]
Monastyrskyy B. et al. (2016) New encouraging developments in contact prediction: assessment of the CASP11 results. Proteins, 84, 131–144. [DOI] [PMC free article] [PubMed] [Google Scholar]
Magnan C.N, Baldi P (2014) SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics, 30, 2592–2597. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marks D.S. et al. (2011) Protein 3D structure computed from evolutionary sequence variation. PloS One, 6, e28766.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meyer D. et al. (2015) e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien
Morcos F. et al. (2011) Direct-coupling analysis of residue co-evolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. USA, 108, E1293–E1301. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murzin A.G. et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540. [DOI] [PubMed] [Google Scholar]
Ovchinnikov S. et al. (2017) Protein structure determination using metagenome sequence data. Science, 355, 294–298. [DOI] [PMC free article] [PubMed] [Google Scholar]
Oliveira L. et al. (2003) Identification of functionally conserved residues with the use of entropy-variability plots. Proteins, 52, 544–552. [DOI] [PubMed] [Google Scholar]
Olmea O. et al. (1999) Effective use of sequence correlation and conservation in fold recognition. J. Mol. Biol., 293, 1221–1239. [DOI] [PubMed] [Google Scholar]
Pollastri G, Baldi P (2002) Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics, 18, S62–S70. [DOI] [PubMed] [Google Scholar]
Prlić A. et al. (2012) BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics, 28, 2693–2695. [DOI] [PMC free article] [PubMed] [Google Scholar]
Remmert M. et al. (2011) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods, 9, 173–175. [DOI] [PubMed] [Google Scholar]
Saitoh S. et al. (1993) A geometrical constraint approach for reproducing the native backbone conformation of a protein. Proteins, 1993, 15, 191–204. [DOI] [PubMed] [Google Scholar]
Sathyapriya R. et al. (2009) Defining an Essence of Structure Determining Residue Contacts in Proteins. PLoS Comput. Biol., 5, e1000584.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Skolnick J. et al. (1997) MONSSTER: a method for folding globular proteins with a small number of distance restraints. J. Mol. Biol., 265, 217–241. [DOI] [PubMed] [Google Scholar]
Skwark M.J. et al. (2013) PconsC: combination of direct information methods and alignments improves contact prediction. Bioinformatics, 29, 1815–1816. [DOI] [PubMed] [Google Scholar]
Tegge A.N. et al. (2009) Nncon: improved protein contact map prediction using 2D-recursive neural networks. Nucleic Acids Res., 37, w515–w518. [DOI] [PMC free article] [PubMed] [Google Scholar]
Terashi G, Takeda-Shitaka M (2015) CAB-align: a flexible protein structure alignment method based on the residue–residue contact area. PLoS One, 10, e0141440.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Therneau TM, Atkinson E.J (2015) An Introduction to Recursive Partitioning Using the RPART Routines
Touw W.G. et al. (2015) A series of PDB related databases for everyday needs. Nucleic Acids Res., 43, D364–D368. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vendruscolo M. et al. (1997) Recovery of protein structure from contact maps. Fold Des., 2, 295–306. [DOI] [PubMed] [Google Scholar]
Varmuza K, Filzmoser P (2009) Introduction to Multivariate Statistical Analysis in Chemometrics, CRC Press (Taylor & Francis), Boca Raton, FL, USA. [Google Scholar]
Wainwright M.J, Jordan M.I (2008) Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1, 1–305. [Google Scholar]
Wang S. et al. (2017) Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol., 13, e1005324.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang X.F. (2011) Predicting residue–residue contacts and helix-helix interactions in transmembrane proteins using an integrative feature-based random forest approach. PloS One, 6, e26767.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Y, Barth P (2015) Evolutionary-guided de novo structure prediction of self-associated transmembrane helical proteins with near-atomic accuracy. Nat. Commun., 6, 7196.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Z, Xu J (2013) Predicting protein contact map using evolutionary and physical constraints by integer programming. Bioinformatics, 29, i266–i273. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wozniak P.P. et al. (2017) Correlated mutations select misfolded from properly folded proteins. Bioinformatics, 33, 1497–1504. [DOI] [PubMed] [Google Scholar]
Xue B. et al. (2009) Predicting residue–residue contact maps by a two-layer, integrated neural-network method. Proteins, 76, 176–183. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang H. et al. (2016) Improving residue–residue contact prediction via low-rank and sparse decomposition of residue correlation matrix. Biochem. Biophys. Res. Commun., 472, 217–222. [DOI] [PubMed] [Google Scholar]
Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins, 57, 702–710. [DOI] [PubMed] [Google Scholar]
Zhang Y, Skolnick J (2005) TM-align: A protein structure alignment algorithm based on TM-score. Nucleic Acids Res., 33, 2302–2309. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(27.4KB, docx)}

Supplementary Data

Click here for additional data file.^{(174.9KB, docx)}

Data Availability Statement

[btx416-B1] Bartona G.J. (2008) Sequence alignment for molecular replacement. Acta Crystallogr. D Biol. Crystallogr., 64, 25–32. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B2] Bjorkholm P. et al. (2009) Using multi-data hidden Markov models trained on local neighbourhoods of protein structure to predict residue–residue contacts. Bioinformatics, 25, 1264–1270. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B3] Bohr J. et al. (1993) Protein structures from distance inequalities. J. Mol. Biol., 231, 861–869. [DOI] [PubMed] [Google Scholar]

[btx416-B4] Bystroff C. et al. (2000) HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. J. Mol. Biol., 301, 173–190. [DOI] [PubMed] [Google Scholar]

[btx416-B5] Chen H, Zhou H.X (2005) Prediction of solvent accessibility and sites of deleterious mutations from protein sequence. Nucleic Acids Res., 33, 3193–3199. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B6] Cheng J, Baldi P (2007) Improved residue contact prediction using support vector machines and a large feature set. BMC Bioinformatics, 8, 113.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B7] Cocco S. et al. (2013) From principal component to direct coupling analysis of coevolution in proteins: low-eigenvalue modes are needed for structure prediction. PLoS Comput. Biol., 9, e1003176.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B8] De Leonardis E. et al. (2015) Direct-coupling analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction. Nucleic Acids Res., 43, 10444–10455. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B9] Di Lena P. et al. (2012) Deep architectures for protein contact map prediction. Bioinformatics, 28, 2449–2457. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B10] Ding W. et al. (2013) CNNcon: improved protein contact maps prediction using cascaded neural networks. PloS One, 8, e61533.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B11] Du T. et al. (2016) Prediction of residue–residue contact matrix for protein–protein interaction with Fisher score features and deep learning. Methods, 110, 97–105. [DOI] [PubMed] [Google Scholar]

[btx416-B12] Duarte J.M. et al. (2010) Optimal contact definition for reconstruction of contact maps. BMC Bioinformatics, 11, 283.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B13] Dyrka W. et al. (2016) Fast assessment of structural models of ion channels based on their predicted current-voltage characteristics. Proteins, 84, 217–231. [DOI] [PubMed] [Google Scholar]

[btx416-B14] Ekeberg M. et al. (2013) Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Phys. Rev. E. Stat. Nonlin. Soft. Matter. Phys., 87, 012707.. [DOI] [PubMed] [Google Scholar]

[btx416-B15] Feinauer C. et al. (2014) Improving contact prediction along three dimensions. PLoSComput. Biol., 10, e1003847.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B16] Finn R.D. et al. (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res., 44, D279–D285. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B17] Friedman J. et al. (2010) Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw., 33, 1–22. [PMC free article] [PubMed] [Google Scholar]

[btx416-B18] González A.J. et al. (2013) Prediction of contact matrix for protein-protein interaction. Bioinformatics, 29, 1018–1025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B19] Gough J. et al. (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol., 313, 903–919. [DOI] [PubMed] [Google Scholar]

[btx416-B20] Göbel U. et al. (1994) Correlated mutations and residue contacts in proteins. Proteins, 18, 309–317. [DOI] [PubMed] [Google Scholar]

[btx416-B21] Guo F. et al. (2015) Identification of protein–protein interactions by detecting correlated mutation at the interface. J. Chem. Inf. Model, 55, 2042–2049. [DOI] [PubMed] [Google Scholar]

[btx416-B22] Hekkelman M.L, Vriend G (2005) MRS: a fast and compact retrieval system for biological data. Nucleic Acids Res., 33, W766–W769. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B23] Hopf T.A. et al. (2012) Three-dimensional structures of membrane proteins from genomic sequencing. Cell, 149, 1607–1621. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B24] Horn F. et al. (1998) The interaction of class B G protein-coupled receptors with their hormones. Recept. Channels, 5, 305–314. [PubMed] [Google Scholar]

[btx416-B25] Iserte J. et al. (2015) I-COMS: Interprotein-Correlated Mutations Server. Nucleic Acids Res., 43, W320–W325. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B26] James G. et al. (2013) An Introduction to Statistical Learning with Applications in R, Springer-Verlag; New York. [Google Scholar]

[btx416-B27] Jones D.T. et al. (2012) PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics, 28, 184–190. [DOI] [PubMed] [Google Scholar]

[btx416-B28] Jones D.T. et al. (2015) MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics, 31, 999–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B29] Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577–2637. [DOI] [PubMed] [Google Scholar]

[btx416-B30] Kappen H.J, Rodríguez F.B (1998) Efficient learning in boltzmann machines using linear response theory. Neural Comput., 10, 1137–1156. [Google Scholar]

[btx416-B31] Konopka B.M. et al. (2014) Automated procedure for contact-map-based protein structure reconstruction. J. Membr. Biol., 247, 409–420. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B32] Kukic P. et al. (2014) Toward an accurate prediction of inter-residue distances in proteins using 2D recursive neural networks. BMC Bioinformatics, 15, 6.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B33] Lesk A.M. (1997) CASP2: report on ab initio predictions. Proteins, Suppl 1, 151–166. [DOI] [PubMed] [Google Scholar]

[btx416-B34] Li Y. et al. (2011) Predicting residue–residue contacts using random forest models. Bioinformatics, 27, 3379–3384. [DOI] [PubMed] [Google Scholar]

[btx416-B35] Liaw A, Wiener M (2002) Classification and regression by randomForest. R News, 2, 18–22. [Google Scholar]

[btx416-B36] Monastyrskyy B. et al. (2014) Evaluation of residue–residue contact prediction in CASP10. Proteins, 2014, 82, 138–153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B37] Monastyrskyy B. et al. (2016) New encouraging developments in contact prediction: assessment of the CASP11 results. Proteins, 84, 131–144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B38] Magnan C.N, Baldi P (2014) SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics, 30, 2592–2597. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B39] Marks D.S. et al. (2011) Protein 3D structure computed from evolutionary sequence variation. PloS One, 6, e28766.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B40] Meyer D. et al. (2015) e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien

[btx416-B41] Morcos F. et al. (2011) Direct-coupling analysis of residue co-evolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. USA, 108, E1293–E1301. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B42] Murzin A.G. et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540. [DOI] [PubMed] [Google Scholar]

[btx416-B43] Ovchinnikov S. et al. (2017) Protein structure determination using metagenome sequence data. Science, 355, 294–298. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B44] Oliveira L. et al. (2003) Identification of functionally conserved residues with the use of entropy-variability plots. Proteins, 52, 544–552. [DOI] [PubMed] [Google Scholar]

[btx416-B45] Olmea O. et al. (1999) Effective use of sequence correlation and conservation in fold recognition. J. Mol. Biol., 293, 1221–1239. [DOI] [PubMed] [Google Scholar]

[btx416-B46] Pollastri G, Baldi P (2002) Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics, 18, S62–S70. [DOI] [PubMed] [Google Scholar]

[btx416-B47] Prlić A. et al. (2012) BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics, 28, 2693–2695. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B48] Remmert M. et al. (2011) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods, 9, 173–175. [DOI] [PubMed] [Google Scholar]

[btx416-B49] Saitoh S. et al. (1993) A geometrical constraint approach for reproducing the native backbone conformation of a protein. Proteins, 1993, 15, 191–204. [DOI] [PubMed] [Google Scholar]

[btx416-B50] Sathyapriya R. et al. (2009) Defining an Essence of Structure Determining Residue Contacts in Proteins. PLoS Comput. Biol., 5, e1000584.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B51] Skolnick J. et al. (1997) MONSSTER: a method for folding globular proteins with a small number of distance restraints. J. Mol. Biol., 265, 217–241. [DOI] [PubMed] [Google Scholar]

[btx416-B52] Skwark M.J. et al. (2013) PconsC: combination of direct information methods and alignments improves contact prediction. Bioinformatics, 29, 1815–1816. [DOI] [PubMed] [Google Scholar]

[btx416-B53] Tegge A.N. et al. (2009) Nncon: improved protein contact map prediction using 2D-recursive neural networks. Nucleic Acids Res., 37, w515–w518. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B54] Terashi G, Takeda-Shitaka M (2015) CAB-align: a flexible protein structure alignment method based on the residue–residue contact area. PLoS One, 10, e0141440.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B55] Therneau TM, Atkinson E.J (2015) An Introduction to Recursive Partitioning Using the RPART Routines

[btx416-B56] Touw W.G. et al. (2015) A series of PDB related databases for everyday needs. Nucleic Acids Res., 43, D364–D368. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B57] Vendruscolo M. et al. (1997) Recovery of protein structure from contact maps. Fold Des., 2, 295–306. [DOI] [PubMed] [Google Scholar]

[btx416-B58] Varmuza K, Filzmoser P (2009) Introduction to Multivariate Statistical Analysis in Chemometrics, CRC Press (Taylor & Francis), Boca Raton, FL, USA. [Google Scholar]

[btx416-B59] Wainwright M.J, Jordan M.I (2008) Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1, 1–305. [Google Scholar]

[btx416-B60] Wang S. et al. (2017) Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol., 13, e1005324.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B61] Wang X.F. (2011) Predicting residue–residue contacts and helix-helix interactions in transmembrane proteins using an integrative feature-based random forest approach. PloS One, 6, e26767.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B62] Wang Y, Barth P (2015) Evolutionary-guided de novo structure prediction of self-associated transmembrane helical proteins with near-atomic accuracy. Nat. Commun., 6, 7196.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B63] Wang Z, Xu J (2013) Predicting protein contact map using evolutionary and physical constraints by integer programming. Bioinformatics, 29, i266–i273. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B64] Wozniak P.P. et al. (2017) Correlated mutations select misfolded from properly folded proteins. Bioinformatics, 33, 1497–1504. [DOI] [PubMed] [Google Scholar]

[btx416-B65] Xue B. et al. (2009) Predicting residue–residue contact maps by a two-layer, integrated neural-network method. Proteins, 76, 176–183. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx416-B66] Zhang H. et al. (2016) Improving residue–residue contact prediction via low-rank and sparse decomposition of residue correlation matrix. Biochem. Biophys. Res. Commun., 472, 217–222. [DOI] [PubMed] [Google Scholar]

[btx416-B67] Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins, 57, 702–710. [DOI] [PubMed] [Google Scholar]

[btx416-B68] Zhang Y, Skolnick J (2005) TM-align: A protein structure alignment algorithm based on TM-score. Nucleic Acids Res., 33, 2302–2309. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Forecasting residue–residue contact prediction accuracy

P P Wozniak

B M Konopka

J Xu

G Vriend

M Kotulska

Roles

Abstract

Motivation

Results

Availability and implementation

Supplementary information

1 Introduction

2 Materials and methods

2.1 Training and test data

2.2 Contact definition

2.3 Secondary structure and solvent accessibility

2.4 MSA

2.5 Correlation scores

2.6 Contact prediction procedure

2.7 Contact prediction evaluation

2.8 Regression models

2.9 Input parameters

2.10 Model performance evaluation

2.11 Parameter significance evaluation

2.12 Data and code availability

3 Results

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Table 1.

4 Discussion

Supplementary Material

Acknowledgements

Funding

Author Contributions

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases