Abstract
De novo membrane protein structure prediction is limited to small proteins due to the conformational search space quickly expanding with length. Long-range contacts (24+ amino acid separation)–residue positions distant in sequence, but in close proximity in the structure, are arguably the most effective way to restrict this conformational space. Inverse methods for co-evolutionary analysis predict a global set of position-pair couplings that best explain the observed amino acid co-occurrences, thus distinguishing between evolutionarily explained co-variances and these arising from spurious transitive effects. Here, we show that applying machine learning approaches and custom descriptors improves evolutionary contact prediction accuracy, resulting in improvement of average precision by 6 percentage points for the top 1L non-local contacts. Further, we demonstrate that predicted contacts improve protein folding with BCL::Fold. The mean RMSD100 metric for the top 10 models folded was reduced by an average of 2 Å for a benchmark of 25 membrane proteins.
Introduction
Determining membrane protein (MP) structures experimentally is difficult as they are often too large for nuclear magnetic resonance experiments and remain difficult to crystallize [1]. Only about 2% of reported structures are MPs [2] and only around 100 distinct folds of integral helical MPs with more than one transmembrane (TM) span [3] are represented in the Protein Data Bank (PDB) [2, 4, 5]. At the same time, understanding MP structure is critical, as it is estimated that MPs comprise 15–39% [6] of the human proteome. They are also particularly important players in cell-environment interactions, i.e. environment sensing receptors, transporters, and channels. Accordingly, approximately half of current therapeutics target MPs[7]. Given the relatively small number of experimental structures, structure-based drug discovery is underdeveloped for MPs. However, as better models for MPs become available structure-based drug discovery will offer an avenue for developing new and more targeted pharmaceuticals [7]. Computational methods that can assist in de novo MP structure determination are therefore of increasing relevance.
One avenue for tackling the experimental difficulties in membrane protein structure determination is de novo protein folding. This approach, while theoretically sound, suffers from insurmountable computational complexity due to exponential dimensionality of the conformational space to be searched. The vastness of the search space can be significantly reduced by introduction of conformational restraints (hereinafter “restraints”), which often take form of upper limits on inter-atom distances (“contacts”). While the definition of a contact may differ between different works, the most commonly used criterion [8] is an amino acid pair with Cβ carbons within 8Å or less (Cα is used in the case of glycine). For the purpose of this work, we will neglect local, short-range contacts (with separation of at most 5 amino acids in the primary sequence), as they are not conducive for successful protein folding, focusing rather on non-local contacts, in short (6–11 amino acids), medium (12–23 amino acids), and long range contacts (24+ amino acids separation) [9, 10]. For sake of clarity, we will designate all contacts with separation of at least 12 amino acids as “long-range contacts”, instead of “medium- and long-range contacts”.
Until recently, contact prediction in proteins was of limited use for guiding protein folding, with leading methods being able to predict approximately 1 in 5 contacts correctly within a set of L/5 predictions (where L is the length of protein sequence) [9, 11–13]. Subsequent advances employing multiple layered artificial neural networks (ANNs) were able to improve this number to 30% [12]. Support vector machines have also been used to predict TM contacts[14]. Our previous work in the Bio Chemical Library (BCL) suite, which uses only sequence information, yields positive predictive values no higher than 42%[15]. Recent improvements in contact prediction [16–18] successfully applied the statistical methods used for solving inverse problems in statistical mechanics to the contact prediction problem. These methods rely on the idea of global statistical inference using co-evolutionary information encoded in the multiple sequence alignments (MSA) of evolutionarily related (homologous) proteins to assess the statistical couplings between individual positions in the protein. In contrast to older methodology, these methods, collectively referred to as Direct Coupling Analysis (DCA)[17–19], model the entire data set at once and not only independent pairs of residues, assuming the evolutionary interaction model is an instance of a high-dimensional Ising model on a fully connected graph. In so doing, they reduce the systematic errors of direct co-evolutionary methods (such as Mutual Information), where high correlation between two residues is an indirect consequence of both being highly correlated to a third residue. This in turn results in a significant boost in predictive power of the co-evolutionary methods, which in some case are able to obtain nearly perfect predictive performance within the set of top L predictions.
DCA-like methods, while successfully applied to contact prediction problem, do not actually predict contacting amino acids, but rather indicate the pairs of amino acids that are under evolutionary pressure to co-evolve[17, 18, 20]. While such a pressure is likely to originate due to spatial proximity, it may also be a result of functional constraints, oligomeric state or conformational flexibility of a protein. Furthermore, the inferred couplings are specific to a particular MSA and may differ substantially upon the change of the homology threshold for inclusion of a sequence in the MSA.
Here we combine mean field DCA (mfDCA) [17] with a variety of custom descriptors as an input to a machine learning approach. The resulting method surpasses the predictive capabilities of mfDCA (including subsequent filtering steps advised by method's authors) in terms of precision by 6% on average, up to 17% (for aquaporin 0, PDB id:1YMGA). While this improvement has been measured for the top L long-range contacts (i.e. pairs of separation of at least 12 amino acids) using the ANN approach, we show that the improvement is relatively agnostic to the machine learning method used, by replicating the results using decision trees (DTs). Finally, we demonstrate, that these predicted contacts improve MP folding with BCL::Fold on average by 2Å in comparison to the same protocol without contact restraints.
Materials and methods
Membrane protein benchmark data set and alignment preparation
This work focuses on the contact and structure prediction of a set of 25 diverse α-helical membrane proteins of known structure, having more than four transmembrane helices. The data set has been chosen, so that for each target one can identify sufficiently many homologs (more than 1000) with satisfactory coverage to warrant success with coupling inference. Site-coverage (Cov) is the percent of the target sequence sites that map onto the final MSA after one removes columns with a large number of gaps. Based on prior work [17], a threshold of 30% gaps was used for this analysis. These 25 membrane proteins come from 23 Pfam families [21], contain a maximum of 15 helices, and have a maximum initial target length of 572 [22].
It is not evident what makes for the most suitable multiple sequence alignment for coupling inference. Most of the work published so far either chooses the alignment generation parameters arbitrarily [18] uses external, oftentimes intuition-based approaches [22] or samples multiple alternative alignments [20, 23, 24]. We created alignments for each protein using the HMMER3 software package [25] at the same E-values—1E-3, 1E-5, 1E-10, 1E-15, 1E-20, 1E-30, and 1E-40.
As many of the DCA-like methods are sensitive to the alignments containing large unaligned stretches (gaps), we have pruned the input alignments to discard the sequences containing more than 30% of gaps. The statistics of filtered and unfiltered MSA are included in Table 1. Filtered MSA are significantly smaller—on average nearly 16 fold so. Maximum, minimum, and average unfiltered MSA sizes are 65525, 1503, and 17865 respectively. The maximum, minimum, and average filtered MSA sizes are 1745, 684, and 1142, respectively. Coverage increases nearly 16% from approximately 76% coverage in unfiltered to 88% in filtered MSA.
Table 1. The 25 membrane proteins used as the benchmark set.
Protein | Initial Sequence | Filtered | Unfiltered | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Name | PDBID | L | Opt. Eval | TMhelix | Malign | Meff | Cov | Malign | Meff | Cov |
ADIC_SALTY | 3NCYA | 422 | 1.00E-20 | 12 | 1215 | 1205 | 0.891 | 16975 | 5560 | 0.884 |
ADRB2_HUMAN | 2RH1A | 442 | 1.00E-20 | 8 | 818 | 451 | 0.652 | 22822 | 5228 | 0.559 |
ADT1_BOVIN | 1OKCA | 292 | 1.00E-40 | 6 | 1068 | 1043 | 0.890 | 8631 | 3516 | 0.890 |
AMTB_ECOLI | 1XQFA | 362 | 1.00E-05 | 11 | 1048 | 1021 | 0.961 | 4051 | 1416 | 0.925 |
AQP4_HUMAN | 3GD8A | 223 | 1.00E-10 | 7 | 1073 | 1062 | 0.964 | 5400 | 1933 | 0.955 |
BTUC_ECOLI | 1L7VA | 324 | 1.00E-10 | 10 | 1049 | 1045 | 0.914 | 9399 | 4125 | 0.910 |
C3NQD8_VIBCJ | 3MKTA | 460 | 1.00E-20 | 12 | 1072 | 1068 | 0.917 | 11067 | 5591 | 0.913 |
C6E9S6_ECOBD | 3RKON | 473 | 1.00E-10 | 14 | 1745 | 1722 | 0.831 | 59616 | 5932 | 0.588 |
COX1_BOVIN | 1OCCA | 514 | 1.00E-40 | 12 | 1157 | 754 | 0.979 | 47394 | 1289 | 0.089 |
COX3_BOVIN | 1OCCC | 261 | 1.00E-03 | 6 | 684 | 521 | 0.958 | 9105 | 1444 | 0.709 |
CYB_BOVIN | 1PP9C | 379 | 1.00E-03 | 8 | 1069 | 581 | 0.921 | 49258 | 855 | 0.272 |
FIEF_ECOLI | 3H90A | 283 | 1.00E-05 | 6 | 1050 | 1039 | 0.968 | 7610 | 3473 | 0.933 |
GLPG_ECOLI | 3B45A | 180 | 1.00E-05 | 6 | 1092 | 1073 | 0.867 | 4625 | 2323 | 0.739 |
GLPT_ECOLI | 1PW4A | 434 | 1.00E-30 | 12 | 1611 | 1604 | 0.878 | 25199 | 10789 | 0.882 |
METI_ECOLI | 3DHWA | 203 | 1.00E-15 | 5 | 1086 | 1065 | 0.877 | 13418 | 4788 | 0.877 |
MIP_BOVIN | 1YMGA | 233 | 1.00E-10 | 7 | 1032 | 1010 | 0.901 | 5431 | 1937 | 0.897 |
MSBA_SALTY | 3B60A | 572 | 1.00E-03 | 6 | 1576 | 1568 | 0.881 | 65525 | 29777 | 0.388 |
O67854_AQUAE | 2A65A | 510 | 1.00E-03 | 12 | 1135 | 1043 | 0.825 | 4351 | 1657 | 0.818 |
OPSD_BOVIN | 1HZXA | 340 | 1.00E-20 | 7 | 1165 | 1151 | 0.803 | 40460 | 8873 | 0.782 |
Q87TN7_VIBPA | 3PJZA | 468 | 1.00E-10 | 12 | 1019 | 923 | 0.793 | 3340 | 1587 | 0.791 |
Q8EKT7_SHEON | 2XUTA | 456 | 1.00E-10 | 14 | 1055 | 1040 | 0.706 | 8196 | 2983 | 0.697 |
Q9K0A9_NEIMB | 3ZUXA | 308 | 1.00E-10 | 10 | 1024 | 1005 | 0.899 | 3928 | 1549 | 0.903 |
SGLT_VIBPA | 2XQ2A | 538 | 1.00E-05 | 15 | 1515 | 1380 | 0.820 | 8075 | 3177 | 0.784 |
TEHA_HAEIN | 3M71A | 306 | 1.00E-03 | 10 | 822 | 646 | 0.971 | 1503 | 735 | 0.948 |
URAA_ECOLI | 3QE7A | 407 | 1.00E-03 | 14 | 1371 | 1355 | 0.818 | 11244 | 3384 | 0.747 |
Statistics | ||||||||||
Mean | 376 | 2.42E-04 | 9.68 | 1142 | 1055 | 0.875 | 17865 | 4557 | 0.755 | |
Standard Deviation | 109 | 4.26E-04 | 3.07 | 244 | 309 | 0.080 | 18581 | 5691 | 0.218 | |
Maximum | 572 | 1.00E-03 | 15 | 1745 | 1722 | 0.979 | 65525 | 29777 | 0.955 | |
Minimum | 180 | 1.00E-40 | 5 | 684 | 451 | 0.652 | 1503 | 735 | 0.089 |
Protein names and Protein Data Bank IDs (PDBID) are accompanied by the length of the initial target sequence L, the optimal E-value used to generate the alignment (Opt. Eval) [22], the number of transmembrane helices predicted with SPOCTOPUS (TMhelix), in addition to the number of sequences in the created alignments (Malign), the effective number of alignment sequences, which takes into consideration sequence diversity (Meff), and the percent coverage of the initial sequence by the final alignment with columns containing over 30% gaps removed (Cov). The MSA-related data is included for both filtered and unfiltered datasets.
Coupling inference
In the interest of direct comparability with prior work, we have used the mean field DCA method employed by Morcos et al [17], which has been instrumental to successful folding approaches, such as EVfold [22, 26]. Therefore, for each of the alignments, we have re-weighted the sequences that share more than 80% identical amino acids, in the interest of reducing the bias of large homologous clusters on the inference process. Then we use the mean field principles to obtain the coupling matrix D for individual position pairs, which in this instance (naïve mean field DCA) reduces to computing an inverse of covariance matrix C-1.
Due to insufficient sampling of the sequence space, not all amino acid type pairs will be represented, therefore the covariance matrix is highly likely to be singular (non-invertible). In order to avoid this situation, Morcos et al. augment the covariance matrix with a high dose of pseudocounts (0.8 of the resulting weighted sum) derived from the known sequence database statistics.
The entries in the resultant coupling matrix D are not directly comparable between different positions in the protein. Therefore, in order to enable ranking of position pairs, we compute a Mutual Information-like score involving the individual positions in the matrix D (and their sums as proxies for single-site values), producing the scoring matrix S. Finally, the computed scores were mapped back to the original sequence coordinates (before gap-column removal) using the auxiliary script from HHsuite [27].
Membrane protein topology prediction
Global methods for coupling inference, especially based on the mean-field approach, when applied to contact prediction suffer from systematic overprediction for evolutionarily coupled, but spatially distant sites. One of such overpredictions involves membrane protein sites located at the interface between membrane and aqueous environment.
In order to alleviate these and related issues, as well as introduce, what we believe to be, crucially important piece of prior data, we have incorporated the information on the predicted membrane topology and topography into the contact prediction process. To do so, we have used SPOCTOPUS [28], a two-track membrane protein topology predictor, that incorporates a signal peptide pre-filter. None of the protein sequences in our data set included signal peptide, thus rendering the method we used functionally identical to OCTOPUS [29]. To further enhance the predictive performance, we moved beyond three-state predictions of (SP)OCTOPUS to a numerical indicator of each amino acids membrane location. Positions predicted to be on the inner-membrane as part of a loop are assigned a value of zero (consistent with a SPOCTOPUS prediction of “i”). Positions predicted to be outside of the membrane as part of a coil are assigned a value of one (consistent with a SPOCTOPUS prediction of “o”). Transmembrane helices are predicted as “M” in SPOCTOPUS and their subsequent encoding is determined by the position within the helix (from inner to outer membrane) and normalized based on the helix size, which in case of (SP)OCTOPUS is always 21. Therefore, the amino acid of the transmembrane span closest to the outside of the membrane will receive value of 1/22, second 2/22 and so on, until the one closest to the inside of the membrane receives value 21/22, differentiating it from the subsequent outer-membrane loop amino acid. We treated the few short re-entrant helices encountered as inner or outer-membrane loops.
Feature generation
It is not obvious what additional prior information may assist the contact prediction for membrane proteins. Therefore, we have explored 1,505 additional descriptors (features) for training the machine learning approaches. A schematic depiction of a descriptor input vector is provided in Fig 1 panel B. S1 Table lists all the descriptor categories considered.
The most basic category includes global descriptors, that is amino acid positions for sites i and j–indexed starting from one, the separation between i and j (i.e. |i-j|), and total sequence length.
The inferred co-evolution couplings, which constitute a major improvement of this method over the previous iterations, were included both in a direct and aggregate way. Coupling inference using alignments at different homology thresholds tends to reveal slightly different information about protein's evolutionary history [23], therefore we have used couplings derived from both filtered and unfiltered alignments, at “optimal” set of E-values, as well as from all E-values. For each of the coupling sets, we included the information on the individual strengths of evolutionary information by using a nine residue window around each of the amino acids in question. The window size has been selected to enable capturing the information on two turns of alpha helices, which are predominant building blocks of MPs. In addition to raw coupling strengths, we have also incorporated the maximum, mean, standard deviation, normalized mean (see S1 File) and sum, both window-wise and protein-wise.
Sequence information descriptors comprise windows of biochemical properties of amino acids surrounding sites i and j. These include volume, hydrophobicity, steric parameter, polarizability, isoelectric point, and the Basic Local Alignment Search Tool (BLAST)-derived PSSM (sequence profile) for each site in the window pair. These sequence descriptors together with global descriptors were used previously in the BCL to predict long-range contacts Descriptor set includes also the statistics on the multiple sequence alignments used for the coupling inference, that is the length of the aligned target sequence including gaps, the depth of the alignments (raw number of non-identical sequences) and the effective depths of the alignments (Meff S1 File) which allows to account for sequence redundancy, as well as target sequence coverage for the given MSA.
For prediction of a secondary structure element (SSE) propensity (helix/strand/coil), as well as the membrane state (membrane/transition/solution) for each amino acid position, we used JUFO9D [30]. It is a newer, internally developed multitrack ANN algorithm. Each individual ANN is trained on certain SSE type interactions, to improve accuracy. The SSE assignments for both sites were supplemented with SSE index difference, a descriptor indicating how many SSEs are between these amino acids (if both amino acids are in the same SSE, the descriptor equals 0, if they are in neighboring SSEs– 1 etc.). Finally, we have supplemented this information with position-aware descriptors. They are related to the size of SSEs containing amino acids in question and contain their distance from the center of respective SSEs. In particular, we have encoded the vertical membrane position of considered amino predicted by using this approach.
Feature selection and learning the statistical model
In order to avoid overfitting to the training data and improve robustness of the methods, we have iteratively pruned the set of descriptors in order to find a minimal set of descriptors, that captures the information essential for MP contact prediction. The pruning was achieved by ranking descriptors by F1-score (harmonic mean of precision and recall, S1 File) and information gain (the difference of the information entropy after partitioning the set using the given descriptors, Kullback-Leibler divergence–S1 File).
The statistical models trained for the purpose of this work were obtained in a two-step process. In order to ascertain the most suitable machine learning approach, we have trained both ANNs and DTs. For both approaches, we have first performed initial training with all the potential descriptors. To conduct the training, we have performed five-fold cross-validation, by splitting the proteins randomly between training, monitoring and independent (testing) set, of 15, 5 and 5 proteins respectively. No data from the independent set was ever used to optimize parameters or train the machine learning model. We were thus able to optimize the method using entirety of our protein set, while at the same time avoiding cross-contamination, as no data points appeared in more than one of the data set simultaneously. As protein pairs 3GD8A/1YMGA and 2RH1A/1HZXA share non-negligible sequence similarity, they have been always included together in the same data set.
Initial stages of training used root mean square error (RMSE) as the optimization criterion. As DTs are prone to overfitting, we have forbidden splitting the nodes (making tree deeper), if the size of the node was less than 20 data points. For ANN learning, we have established the α (learning momentum) and η (gradient descent contribution) parameters by grid search in cross-validation regime. The values that provided the best results were η equal to 0.000017 and α equal to zero.
For the DT-based approach, the first round of optimization has used F1-score (harmonic mean of precision and recall) and information gain (Kullback-Leibler divergence), which has demonstrated that only 210 of the descriptors considered contributed significantly to the predictive performance.
This set of descriptors have been then iteratively pruned in respect to the measured input sensitivity [31, 32]. We then iteratively rescored the top remaining descriptors (starting from this set of 210) with input sensitivity using the best run from the previous stage. The best threshold was set as the new top descriptor threshold. To decrease the likelihood of removing useful descriptors, no more than half were removed at each stage. Thus, we subsequently examined the top 160, 130, and 70 descriptors. In the third iteration (top 70 descriptors), approximately top 30 descriptors resulted in the best performance. We used the average of the enrichment as the objective function, where enrichment is the fold increase in positives. We evaluated each set of models generated by calculating the integral of the precision over the range 0.01% to 0.55% of the fraction predicted positive. This range closely captures the contacts predicted for the top L predictions across all proteins while avoiding the noise present below 0.01%. The small number of data points results in drastic changes from small perturbations in overall predictions below 0.01%.
In case of ANN based approach, we use the method for descriptor selection that has already been introduced and tested in BCL. We approximate the derivative of the effect of each feature column on the results, as measured by the weights of trained ANN. This derivative is then used in two ways to score descriptors: consistency of effect (i.e. does the derivative’s sign remain constant across all models in the cross validation ensemble, thus consistently increaseing/decreasing the emitted likelihood of contact) and the mean of the square of derivative. To compute the latter metric, we rescale each derivative between 0–0.5, sum the scaled values and square their values. The underlying rationale is that insignificant, noisy features will have low weights, roughly equal to zero. However, it is possible for the non-noise features to have a non-linear relationship to the output, in which the mean weight (and the mean derivative) would be close to zero, but most of the weights would not, which is the situation this metric aims to detect.
After selecting the optimal set of parameters and descriptors, we conducted final training, using the same procedure, but substituting average enrichment in place of RMSE.
BCL::Fold protein structure predictions
BCL::Fold creates each model through a Monte Carlo optimization with two stages. It begins model assembly by either placing, removing ir performing large SSE-based moves. The second stage focuses on the refinement of the generated models and utilizes small amplitude SSE translations and rotations to arrive at a final model. After each SSE move, models are scored using knowledge-based potentials that examine MP topology, environment prediction accuracy, SSE alignment, radius of gyration, amino acid environment, contact order, amino acid clashes, and a loop score [33–35].
The knowledge-based potential, BCL::Fold may be augmented with prior knowledge, in case of this work—contact information. As predicted contacts are imperfect, additional terms should encourage meeting the restraints, but not punish large violations. To do so, we have used a smooth step function with two parameters or “threshold values”. If restrained amino acids' carbon alpha atoms are not further than 8Å (upper bound of the contact range), restraint function confers a -1 bonus to the score. The other parameter is the width of transition region, 12Å, a maximal distance above which restraint does not affect the score anymore. The values of restraint function between these two thresholds are dictated by a smooth, monotonic function, characterized by a zero derivative at threshold points—a the negative of the sine function in range [0, pi/2] where the output decreases from 0 to the maximal penalty of -1. The penalty plateaus as separation approaches 20Å, thus limiting the influence of sites well beyond the contact range.
Here we used the mean of the predicted contact propensities output by each of the five cross-validated classifiers (both for DT and ANN based approaches). For each protein, we have restrained the top L highest scoring amino acid pairs to be in contact and ran the folding protocol independently and in parallel, generating 1,000 models for each protein and machine learning approach, producing 50,000 total models.
As the main goal of this work is to demonstrate the effect of contact prediction on protein folding, we have used top 10 best models by RMSD100 to the native structure. Performing a second iteration of folding did not significantly alter results.
Results and discussion
Topology filtering DCA-derived contacts improves prediction precision
While evolutionary coupling inference has been shown to be successfully applicable to protein contact prediction [17, 18, 23, 24, 26, 36], not every highly coupled amino acid pair corresponds to close proximity in the native structure. This is due to three confounding factors in evolutionary coupling inference: (i) method limitations, (ii) missing data, and (iii) non-spatial coupling causes. The statistical methods used for coupling inference make certain assumptions about the model of protein evolution, including the assertion that co-evolution analysis can be reduced to a two-site inverse problem. While not unreasonable, there is no evidence that certain evolutionary effects could not be modelled more aptly as n-body interactions. To do so would require substantially more sequence data than currently available. The other assertion that these methods make is that the input multiple sequence alignment captures the evolutionary history of a protein accurately. It is a fact, that multiple sequence alignments available currently are not uniformly samples from evolutionary history and consequently lack the ancestral sequences, as well as potentially include sequences of proteins that are not homologous to the protein of interest. Finally, even given the most accurate method and infinite data, there may be other sources of evolutionary couplings, including but not limited to: homo-oligomers with evolutionary pressures across protein interfaces that confound intra-protein contact prediction, selection pressures across proteins with multiple receptor/signaling domains that are evolutionarily related but not in physical proximity and other functional constraints.
As demonstrated before [20, 24], use of prior knowledge on protein structure substantially improves the contact prediction accuracy. The initial approaches to using evolutionary couplings to solve protein folding problem filtered away the pairs that were implausible to form contact, which led to a substantial increase in expected folding accuracy [22]. For example, two sites predicted to be on either end of a transmembrane α-helix are very unlikely to be in contact. However, such a pair may yield a high coupling value if they are part of a receptor-signaling pathway, which constrain their evolution similar to the functional roles that rely on the physical proximity.
The transmembrane position descriptor (see Methods), that we have developed, attempts to encode this observation. Fig 2 demonstrates that this descriptor results in a very consistent gradient along a blue-white-red (0–1) spectrum (inner to outer membrane) with similar colors in close vertical proximity. The mostly perpendicular orientation of the α-helices within MPs simplifies vertical separation prediction. Amino acids with different colors are unlikely to be in spatial contact.
For the original mfDCA filtered and processed method to which we compare, we used a threshold of 0.35 predicted vertical separation units, above which couplings are discarded as implausible. This threshold appears to be optimal in regard to the final accuracy for all L-fractions including ranges ideal for contact-driven protein structure prediction.
Aggregate descriptors improve contact prediction with decision trees and artificial neural networks
Out of the 1,505 features assessed for the potential to improve protein contact prediction, by far the most promising were the aggregated, coupling-related descriptors. They ranked high in respect to F-score, information gain [37], and input sensitivity [31, 32, 38]. These included the normalized mean, window maximum and mean direct information (DI), and maximum and mean DI across larger sets of all calculated MSA, using various e-values, both with and without filtering. As demonstrated before [20, 23], the use of summary statistics combined with multiple sources of co-evolutionary information allows for separating spatially-related couplings from the ones arising due to the confounding factors introduced by suboptimal input alignments.
The descriptor which proved particularly useful is the normalized mean coupling value. It reduces the impact of discrepancies in coupling values between proteins by dividing the mean coupling score for a square sub-matrix by the mean coupling score across the entire protein sequence (S1 File). This descriptor implicitly captures a phenomenon employed by other methods, that is that pairs with high couplings corresponding to spatial proximity tend to have neighbors that have high coupling values as well, while this regularity is usually not present for the spurious ones. The size of a window is a compromise between capturing sufficient amount of information and generalizability. A window size of nine was chosen to capture two complete windings of an α-helix. Shorter and longer window sizes were tested but yielded worse results (data not shown).
In the final optimized set of 30 descriptors that we used for DTs (see Methods and S2 Table), 18 involved coupling values, out of which 12 were aggregate metrics. Moreover, 9 of these features are estimated to be of greater importance for prediction performance than the first “simple” co-evolution descriptor. The input sensitivity of the highest ranked aggregate descriptor is also almost 30 fold higher than that of descriptor based on DI alone.
The highest ranked descriptor is the correlation window maximum using filtered MSA created with the optimal E-values. Coincidentally, this set of DI values also performs best for naïve DI contact prediction. The second highest by input sensitivity is the sequence separation, which is partially due to positions closer in sequence being more likely to be in contact, but also captures the periodicity of certain contacts (e.g. helix-helix contacts, where contact between i,j implies lack of contact between i±2,j±2). The third highest-ranking descriptor is transmembrane separation predicted by the topology filter. The normalized window mean and maximum DI from the unfiltered optimized MSA are also ranked highly. Polarizability is the highest ranked traditional biochemical property descriptor but in this case it is an aggregated sequence mean and not just the polarizability for positions i and/or j.
Fig 3 displays the receiver operating characteristic (ROC) curves for DTs and ANNs for predictions across the entire benchmark set in comparison to naïve DI using filtered MSA at minimum separations of twelve. We also include results for a sequence separation of one. Most notably, the AUC is higher for both DTs (0.862) and ANNs (0.855) in comparison to naïve DI (0.611). However, the goal is to predict a small fraction of all theoretically possible contacts with especially high accuracy. DTs and ANNs outperform naïve DI for the key false-positive-rate range of 0.001 to 0.1.
Fig 4 compares prediction accuracy at a minimum separation of twelve between naïve DI using filtered and unfiltered MSA, processed DI on filtered MSA, and the best DT and ANN sets across varying top L fractions. DTs outperform all DI-only methods. The ANNs outperform all methods, including DT-based methods, for L-fractions of L/2 and higher. The ANN’s advantage increases with larger fractions of L. Full results are given in S3 Table. The best accuracy for each PDB ID and L-fraction is marked in bold. ANNs have the highest accuracy at 3L and a minimum separation of twelve for 19 of the 25 benchmark proteins.
To put these results in context of currently established contact prediction methods, we have calculated positive predictive values for predictions with PSICOV[39], CCMpred [40] and FreeContact [41], using the same alignments as used by the methods we describe (S4 Table).
Known contacts improve BCL::Fold structure predictions
Accurate contact restraints limit the fold search space thereby enabling more efficient conformational sampling. For methods such as BCL::Fold [34, 35], the smaller search space increases the likelihood of sampling native-like topologies. BCL::Fold is especially well suited to incorporating contact restraint information as the primary sequence is distilled down to SSEs with intervening loop regions removed during folding [42]. Thus, SSEs can move more easily to satisfy restraints. For the full-sequence single-chain models, reaching long-range restraints may be kinetically or energetically inaccessible, due to relatively large perturbations needed to overcome local minima. We can subsequently rebuild loops for the best scoring models, thus rendering fully-fledged models.
In order to illustrate the extent to which BCL::Fold may take advantage of contact restraints, we have attempted to recover the structures of proteins in our data set by folding them with a varying number of contact restraints derived from experimentally determined structures. As expected, increasingly large L-fractions of correct restraints improve results (S1 Fig). For the beta2-adrenergic G protein-coupled receptor (PDB: 2RH1A) the best model sampled improved from 4.5Å to 2.5Å. For aquaporin 4 (PDB: 3GD8A) and cytochrome C oxidase (PDB: 1OCCA) the effect is less dramatic but present. This shift to lower RMSD100 values across models persists across the benchmark set and represents the positive control.
Predicted contacts improve BCL::Fold structure predictions
Having determined the upper limit of improvement, we compared the results of folding using the positive control restraint sets to the results obtained using the DI, DT, and ANN predicted contacts (Fig 5). All predictions were made using models with training sets that excluded data from the structures to be predicted. The usage of a set of diverse alpha-helical membrane proteins limits our possible set of training inputs to relatively few proteins as there are relatively few membrane protein families of known structure. We are further limited to proteins with alignments of sufficient depth to calculate DI for our coevolution-based contact prediction. While a larger training set would be desirable, we posit that the results we present are generalizable to other proteins.
The negative control shows the performance of BCL::Fold without contact restraints. All predicted contacts shift models towards lower RMSD, although not to the same degree as the improvement achieved with only correct contacts. However, the shift seen for cytochrome C oxidase (PDB: 1OCCA) is very close to the positive control. This is likely due to its especially high contact prediction accuracy. Both machine learning methods perform noticeably better than the DI based methods for beta2-adrenergic G protein-coupled receptor (PDB: 2RH1A). This is due to the methods’ automatic filtering of false positives between residues from opposite sides of the membrane. This improves accuracy from 12.4% (DI) to 30.7% (ANN) for the top L/2 contacts at a minimum separation of twelve.
Correct contacts result in the greatest average improvement in RMSD100 by 4.1Å. Predicted contacts all provide improvement over the negative control, but are not statistically significantly different from one another as determined by the Wilcoxon signed-rank test (Fig 6). However, the distribution of improvements as gauged by the quartiles does appear to be consistently higher for ANNs. In addition, the average improvement is highest for results from structures predicted using the contacts from the best ANNs (1.7Å) compared to the second best, processed DI also at 1.7 Å. Tables 2 and 3 include full benchmark folding results across all categories. The mutual relationship of the folding results is invariant with respect to chosen structural similarity metric, for results using GDT-TS and TMscore, see S5 Table.
Table 2. Full benchmark folding results for controls and predicted restraints from the best methods—DI filtered, negative control, and positive control (Top 10 average RMSD100 Å).
PDBID | DI Filtered MSA Stats | Negative Best RMSD100 | Pos. Control 2L ms 12 | |||||||
---|---|---|---|---|---|---|---|---|---|---|
L | TMhelix | Meff | Cov | Best | Top 10 Avg | S.D. | Best | Top 10 Avg | S.D. | |
3NCYA | 422 | 12 | 1205 | 0.89 | 7.90 | 8.56 | 0.28 | 2.53 | 2.78 | 0.11 |
2RH1A | 442 | 8 | 451 | 0.65 | 5.52 | 6.17 | 0.27 | 2.22 | 2.44 | 0.09 |
1OKCA | 292 | 6 | 1043 | 0.89 | 6.45 | 7.61 | 0.42 | 3.56 | 3.89 | 0.17 |
1XQFA | 362 | 11 | 1021 | 0.96 | 7.46 | 7.75 | 0.16 | 2.34 | 2.45 | 0.07 |
3GD8A | 223 | 7 | 1062 | 0.96 | 6.25 | 6.58 | 0.20 | 2.88 | 3.08 | 0.13 |
1L7VA | 324 | 10 | 1045 | 0.91 | 6.53 | 7.37 | 0.47 | 2.90 | 3.12 | 0.11 |
3MKTA | 460 | 12 | 1068 | 0.92 | 6.33 | 7.93 | 0.57 | 3.86 | 4.17 | 0.20 |
3RKON | 473 | 14 | 1722 | 0.83 | 7.80 | 8.30 | 0.29 | 2.86 | 3.36 | 0.21 |
1OCCA | 514 | 12 | 754 | 0.98 | 7.33 | 7.71 | 0.23 | 2.32 | 2.85 | 0.32 |
1OCCC | 261 | 6 | 521 | 0.96 | 7.44 | 8.00 | 0.21 | 5.36 | 5.59 | 0.08 |
1PP9C | 379 | 8 | 581 | 0.92 | 7.15 | 7.59 | 0.22 | 3.56 | 4.04 | 0.27 |
3H90A | 283 | 6 | 1039 | 0.97 | 6.62 | 7.26 | 0.37 | 3.17 | 3.84 | 0.24 |
3B45A | 180 | 6 | 1073 | 0.87 | 6.30 | 6.53 | 0.12 | 3.46 | 3.66 | 0.10 |
1PW4A | 434 | 12 | 1604 | 0.88 | 6.98 | 7.57 | 0.33 | 2.46 | 2.82 | 0.21 |
3DHWA | 203 | 5 | 1065 | 0.88 | 6.94 | 7.52 | 0.30 | 5.33 | 5.51 | 0.11 |
1YMGA | 233 | 7 | 1010 | 0.9 | 6.42 | 6.64 | 0.24 | 2.84 | 3.08 | 0.12 |
3B60A | 572 | 6 | 1568 | 0.88 | 9.37 | 10.01 | 0.33 | 5.56 | 6.27 | 0.40 |
2A65A | 510 | 12 | 1043 | 0.83 | 8.64 | 9.06 | 0.24 | 2.20 | 2.63 | 0.15 |
1HZXA | 340 | 7 | 1151 | 0.8 | 5.77 | 6.28 | 0.23 | 2.08 | 2.16 | 0.06 |
3PJZA | 468 | 12 | 923 | 0.79 | 7.69 | 8.59 | 0.36 | 4.14 | 4.82 | 0.35 |
2XUTA | 456 | 14 | 1040 | 0.71 | 7.74 | 8.31 | 0.34 | 3.18 | 3.74 | 0.26 |
3ZUXA | 308 | 10 | 1005 | 0.9 | 7.11 | 7.61 | 0.22 | 2.66 | 2.79 | 0.08 |
2XQ2A | 538 | 15 | 1380 | 0.82 | 8.75 | 9.28 | 0.23 | 3.40 | 3.88 | 0.39 |
3M71A | 306 | 10 | 646 | 0.97 | 5.88 | 6.41 | 0.34 | 2.34 | 2.45 | 0.08 |
3QE7A | 407 | 14 | 1355 | 0.82 | 8.43 | 9.05 | 0.32 | 4.37 | 4.64 | 0.18 |
Avg | 376 | 9.68 | 1055 | 0.875 | 7.15 | 7.75 | 0.29 | 3.26 | 3.60 | 0.18 |
The table displays the average, best, and standard deviation with respect to the RMSD100 across the entire benchmark set for positive and negative controls and with contacts predicted using filtered DI at L/2 and minimum separation of 6 (additional methods included in Table 3). In addition, each result row has the length (L), number of transmembrane helices (TMhelix), the effective alignment depth (Meff), and target sequence coverage (Cov) matched to each PDBID. The positive control shown was the best performing run analyzed and utilized 2L contacts at a minimum separation of 12.
Table 3. Full benchmark folding results for controls and predicted restraints from the best methods (Top 10 average RMSD100 Å).
PDBID | DI Naïve L/2 ms 6 | DI Processed Filt L/2 ms 6 | Best DT 1L ms 12 | Best ANN 3L ms 12 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Best | Top 10 Avg | S.D. | Best | Top 10 Avg | S.D. | Best | Top 10 Avg | S.D. | Best | Top 10 Avg | S.D. | |
3NCYA | 6.15 | 6.84 | 0.39 | 6.17 | 6.89 | 0.47 | 6.26 | 6.85 | 0.38 | 6.66 | 7.19 | 0.26 |
2RH1A | 5.84 | 5.98 | 0.08 | 5.49 | 5.86 | 0.23 | 4.26 | 4.47 | 0.10 | 4.24 | 4.60 | 0.21 |
1OKCA | 5.12 | 5.90 | 0.32 | 5.49 | 5.99 | 0.30 | 5.28 | 5.47 | 0.13 | 4.56 | 5.27 | 0.26 |
1XQFA | 5.87 | 6.81 | 0.51 | 5.76 | 6.53 | 0.38 | 5.81 | 6.46 | 0.31 | 6.01 | 6.70 | 0.27 |
3GD8A | 3.58 | 3.68 | 0.06 | 4.08 | 4.26 | 0.11 | 3.62 | 3.82 | 0.13 | 3.54 | 3.72 | 0.09 |
1L7VA | 5.19 | 6.34 | 0.54 | 5.71 | 6.33 | 0.33 | 6.09 | 6.91 | 0.41 | 6.31 | 6.82 | 0.24 |
3MKTA | 6.27 | 6.67 | 0.14 | 5.65 | 5.85 | 0.12 | 6.11 | 6.50 | 0.24 | 4.73 | 5.57 | 0.39 |
3RKON | 6.11 | 6.98 | 0.48 | 5.69 | 6.34 | 0.31 | 7.17 | 7.86 | 0.30 | 4.98 | 5.85 | 0.38 |
1OCCA | 3.94 | 4.23 | 0.22 | 3.84 | 4.16 | 0.20 | 4.45 | 4.79 | 0.17 | 4.46 | 5.06 | 0.27 |
1OCCC | 5.63 | 6.06 | 0.22 | 6.20 | 6.39 | 0.16 | 6.02 | 6.19 | 0.13 | 5.52 | 6.18 | 0.24 |
1PP9C | 5.26 | 5.63 | 0.21 | 5.30 | 5.62 | 0.20 | 5.04 | 5.51 | 0.23 | 6.41 | 6.81 | 0.22 |
3H90A | 4.78 | 4.90 | 0.09 | 5.71 | 5.80 | 0.05 | 5.05 | 5.33 | 0.14 | 4.79 | 4.95 | 0.09 |
3B45A | 4.29 | 4.54 | 0.12 | 4.64 | 4.93 | 0.17 | 4.43 | 4.63 | 0.15 | 4.70 | 4.82 | 0.10 |
1PW4A | 5.20 | 5.47 | 0.17 | 4.50 | 5.11 | 0.34 | 5.35 | 5.67 | 0.23 | 5.78 | 6.23 | 0.23 |
3DHWA | 6.41 | 6.63 | 0.11 | 5.81 | 6.30 | 0.21 | 6.88 | 7.19 | 0.13 | 5.91 | 6.30 | 0.21 |
1YMGA | 3.89 | 4.22 | 0.18 | 4.06 | 4.23 | 0.12 | 4.29 | 4.43 | 0.06 | 3.99 | 4.25 | 0.16 |
3B60A | 8.69 | 9.36 | 0.32 | 8.57 | 8.86 | 0.18 | 9.84 | 10.13 | 0.23 | 9.25 | 9.55 | 0.11 |
2A65A | 4.78 | 5.59 | 0.37 | 4.67 | 5.33 | 0.32 | 6.02 | 6.42 | 0.29 | 5.72 | 6.32 | 0.36 |
1HZXA | 3.95 | 4.14 | 0.16 | 3.31 | 3.61 | 0.17 | 3.64 | 4.12 | 0.20 | 3.28 | 3.60 | 0.16 |
3PJZA | 8.01 | 8.44 | 0.18 | 7.95 | 8.53 | 0.25 | 7.41 | 8.02 | 0.32 | 6.42 | 7.02 | 0.33 |
2XUTA | 8.34 | 8.84 | 0.28 | 8.19 | 8.47 | 0.16 | 8.00 | 8.27 | 0.12 | 6.73 | 7.50 | 0.42 |
3ZUXA | 4.88 | 5.14 | 0.14 | 4.50 | 4.84 | 0.21 | 4.87 | 5.17 | 0.21 | 4.33 | 5.31 | 0.53 |
2XQ2A | 8.27 | 9.51 | 0.48 | 9.12 | 9.32 | 0.10 | 8.44 | 9.08 | 0.35 | 8.38 | 8.96 | 0.27 |
3M71A | 4.69 | 5.19 | 0.28 | 4.56 | 5.53 | 0.35 | 4.88 | 5.31 | 0.18 | 4.57 | 5.42 | 0.37 |
3QE7A | 5.98 | 6.69 | 0.40 | 6.23 | 7.30 | 0.43 | 6.81 | 7.12 | 0.21 | 6.13 | 7.16 | 0.56 |
Avg | 5.64 | 6.15 | 0.26 | 5.65 | 6.09 | 0.24 | 5.84 | 6.23 | 0.21 | 5.50 | 6.05 | 0.27 |
The table displays the average, best, and standard deviation with respect to the RMSD100 across the entire benchmark set for contacts predicted from one of the following methods: naïve DI at L/2 and minimum separation of 6, processed, best DT at 1L and minimum separation of 12, or best ANN at 3L and a minimum separation of 12. In addition, each result row has the length (L), number of transmembrane helices (TMhelix), the effective alignment depth (Meff), and target sequence coverage (Cov) matched to each PDBID. The L-fraction and minimum separation for each method was independently optimized by sampling across L-fractions of L/10, L/5, L/2, 1L, 2L, and 3L as well as minimum separations of 6 or 12. Further alignment details are in Table 1. Additionally, the best single model and top 10 average results from all included contact prediction methods for each PDBID is bolded. Our method using ANNs has both the lowest average RMSD100 for the best model (5.50Å) as well as the lowest top 10 average across the benchmark (6.05Å).
Fig 7 shows how close the best model by RMSD100 replicates the native fold for 1HZXA (3.3Å). The topology is correct and there is substantial superimposition of model helices with those of the native structure. One expects some deviation as BCL::Fold uses idealized helices with limited bends or kinks. These models still require the addition of side chains as well as other refinement but the similarity between the predicted model and the native greatly simplifies refinement and final all-atom prediction.
The best model by RMSD100 (3.3Å) is aligned above to the native structure. The model was produced as part of a folding run of 1000 proteins using the top 3L contacts as predicted by the best ANN with a minimum sequence separation of 12. Helices line up well, as can be seen from both an in-membrane and above-membrane view. Deviation from the native structure is due in part to the use of idealized SSEs that do not bend.
Conclusion
Membrane protein de-novo structure prediction can be aided by predicted contacts, which is of particular importance given the challenges of experimental MP structure determination. The evolutionary methods for contact prediction, which use rapidly growing pool of protein sequence information can be expected to render more proteins (including MPs) amenable to analysis and hence improve our understanding of MP structure. Substantial improvement is possible by increasing the accuracy of contact prediction methods as evidenced by the correct contact results. One may attain further improvement by leveraging the confidence in each contact prediction or dynamically adjusting the contact restraints used based on scoring and confidence values. Since development of this method, new, improved methods for contact prediction have emerged [20, 36, 43], which allow to further increase the applicability of approaches similar to the one we described. Despite use of imperfectly predicted contacts, we observed a significant improvement in contact-restrained model accuracy, in comparison to the unrestrained ones.
Co-evolutionary methods for coupling inference suffer from systematic biases, which can be alleviated by machine learning. While this class of methods appears to work well for contact prediction, their goal only partially overlaps with the goal of contact prediction. Not all contacting residues co-evolve and—conversely—not all co-evolving residues contact. For membrane proteins, the most notable class of inferred couplings that do not correspond to contacts is the couplings between termini of transmembrane spans. Use of prior information, either on the stage of inference, or post-processing with machine learning does alleviate some of these “mispredictions”, thus improving the expected utility of these methods for protein structure prediction.
The accuracy of protein folding by BCL::Fold improves significantly by use of experimental and predicted contacts. As simulating protein folding by macromolecular simulations (molecular dynamics) is still largely unfeasible, the most promising results are to be achieved by stochastic methods that allow for sampling larger chunks of conformational space [34, 44]. The assembly of large structural elements, an approach implemented in BCL::Fold, permits for reaching native-like topologies of predicted models with a fewer computational steps. Use of sparse experimental information [42, 45] or prior information on protein structure—be it experimental or predicted—does further increase the likelihood of sampling and retaining the models with correct (close to the native) folds.
BCL::Fold’s protein structure prediction performance is best with L-fractions of L/2 and higher (S1 Fig). Machine learning methods achieve the best average benchmark accuracies for L-fractions most suitable for protein fold prediction using BCL::Fold. Thus, machine learning methods improve protein fold prediction.
We strongly believe that with improvement in evolutionary contact prediction methodology, increase of sequence information available and introduction of model refinement techniques to BCL::Fold (e.g. aggressive secondary structure bending, optimization of hydrogen bonding etc.), the approach described above will become a feasible strategy for accurate membrane protein structure prediction.
Availability
BCL::Fold is implemented as part of the BioChemical Library, a suite of software currently under development in the Meiler laboratory (http://www.meilerlab.org/index.php/bclcommons/show/b_apps_id/1). BCL software, including BCL::Fold, is freely available for academic use. Predicted contacts and top 10 structures are available at DOI 10.6084/m9.figshare.4600372. All source PDB files are available from the RCSB database http://www.rcsb.org/ and homologous sequences via BLAST https://blast.ncbi.nlm.nih.gov/Blast.cgi.
Supporting information
Acknowledgments
We thank the Vanderbilt University Center for Structural Biology computational support team for hardware and software maintenance. We also thank the Vanderbilt University Advanced Computing Center for Research and Education for computer cluster access and support. The authors thank Thomas Lasko MD, PhD and Terry Lybrand PhD for their valuable input and Warren Lemon for his work on the implementation of decision trees within the BCL.
Data Availability
All relevant data is available at Figshare. DOI: 10.6084/m9.figshare.4600372. URL: https://figshare.com/s/71fef14b62e4d2aba9ad.
Funding Statement
Work in the Meiler laboratory is supported through the National Institutes of Health (National Institute of General Medical Sciences R01 GM080403 [https://www.nigms.nih.gov/Pages/default.aspx] (JM). This work was supported by Public Health Service award T32 GM07347 [https://www.nigms.nih.gov/Pages/default.aspx] from the National Institute of General Medical Studies for the Vanderbilt Medical-Scientist Training Program (PLT). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Dill KA, MacCallum JL, The protein-folding problem, 50 years on. Science, 2012. 338(6110): p. 1042–6. 10.1126/science.1219021 [DOI] [PubMed] [Google Scholar]
- 2.Raman P, Cherezov V, Caffrey M, The Membrane Protein Data Bank. Cell Mol Life Sci, 2006. 63(1): p. 36–51. 10.1007/s00018-005-5350-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Fox NK., Brenner SE, Chandonia JM, SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res, 2014. 42(Database issue): p. D304–9. 10.1093/nar/gkt1240 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H et al. , The Protein Data Bank. Acta Crystallogr D Biol Crystallogr, 2002. 58(Pt 6 No 1): p. 899–907. [DOI] [PubMed] [Google Scholar]
- 5.Tusnády GE, Dosztányi Z, Simon I, PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank. Nucleic Acids Res, 2005. 33(Database issue): p. D275–8. 10.1093/nar/gki002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ahram M, Litou ZI, Fang R, Al-Tawallbeh G, Estimation of membrane proteins in the human proteome. In Silico Biol, 2006. 6(5): p. 379–86. [PubMed] [Google Scholar]
- 7.Bakheet TM, Doig AJ, Properties and identification of human protein drug targets. Bioinformatics, 2009. 25(4): p. 451–7. 10.1093/bioinformatics/btp002 [DOI] [PubMed] [Google Scholar]
- 8.Graña O, Baker D, MacCallum RM, Meiler J, Punta M, Rost B et al. , CASP6 assessment of contact prediction. Proteins, 2005. 61 Suppl 7: p. 214–24. [DOI] [PubMed] [Google Scholar]
- 9.Ezkurdia I, Grana O, Izarzugaza JM, Tress ML, Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8. Proteins, 2009. 77 Suppl 9: p. 196–209. [DOI] [PubMed] [Google Scholar]
- 10.Izarzugaza JM, Grana O, Tress ML, Valencia A, Clarke ND, Assessment of intramolecular contact predictions for CASP7. Proteins, 2007. 69 Suppl 8: p. 152–8. [DOI] [PubMed] [Google Scholar]
- 11.Monastyrskyy B, D'Andrea D, Fidelis K, Tramontano A, Kryshtafovych A, Evaluation of residue-residue contact prediction in CASP10. Proteins, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Di Lena P, Nagata K, Baldi P, Deep architectures for protein contact map prediction. Bioinformatics, 2012. 28(19): p. 2449–2457. 10.1093/bioinformatics/bts475 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Fuchs A, Kirschner A, Frishman D, Prediction of helix-helix contacts and interacting helices in polytopic membrane proteins using neural networks. Proteins, 2009. 74(4): p. 857–71. 10.1002/prot.22194 [DOI] [PubMed] [Google Scholar]
- 14.Zhang H, Huang Q, Bei Z, Wei Y, Floudas CA, COMSAT: Residue contact prediction of transmembrane proteins based on support vector machines and mixed integer linear programming. Proteins, 2016. 84(3): p. 332–48. 10.1002/prot.24979 [DOI] [PubMed] [Google Scholar]
- 15.Karakaş M, Woetzel N, Meiler J, BCL::contact-low confidence fold recognition hits boost protein contact prediction and de novo structure determination. J Comput Biol, 2010. 17(2): p. 153–68. 10.1089/cmb.2009.0030 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Weigt M, White RA, Szurmant H, Hoch JA, Hwa T, Identification of direct residue contacts in protein-protein interaction by message passing. Proc Natl Acad Sci U S A, 2009. 106(1): p. 67–72. 10.1073/pnas.0805923106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci U S A, 2011. 108(49): p. E1293–301. 10.1073/pnas.1111471108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Feinauer C, Skwark MJ, Pagnani A, Aurell E, Improving contact prediction along three dimensions. PLoS Comput Biol, 2014. 10(10): p. e1003847 10.1371/journal.pcbi.1003847 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Aurell E, The Maximum Entropy Fallacy Redux? PLoS Comput Biol, 2016. 12(5): p. e1004777 10.1371/journal.pcbi.1004777 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Jones DT, Singh T, Kosciolek T, Tetchner S, MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics, 2015. 31(7): p. 999–1006. 10.1093/bioinformatics/btu791 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE et al. , The Pfam protein families database: towards a more sustainable future. Nucleic acids research, 2016. 44(D1): p. D279–D285. 10.1093/nar/gkv1344 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hopf TA, Colwell LJ, Sheridan R, Rost B, Sander C, Marks DS, Three-dimensional structures of membrane proteins from genomic sequencing. Cell, 2012. 149(7): p. 1607–21. 10.1016/j.cell.2012.04.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Skwark MJ, Abdel-Rehim A, Elofsson A, PconsC: combination of direct information methods and alignments improves contact prediction. Bioinformatics, 2013. 29(14): p. 1815–6. 10.1093/bioinformatics/btt259 [DOI] [PubMed] [Google Scholar]
- 24.Skwark MJ, Raimondi D, Michel M, Elofsson A, Improved contact predictions using the recognition of protein like contact patterns. PLoS Comput Biol, 2014. 10(11): p. e1003889 10.1371/journal.pcbi.1003889 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Mistry J, Finn RD, Eddy SR, Bateman A, Punta M, Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res, 2013. 41(12): p. e121 10.1093/nar/gkt263 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R et al. , Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS One, 2011. 6(12): p. e28766 10.1371/journal.pone.0028766 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Remmert M, Biegert A, Hauser A, Söding J, HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods, 2012. 9(2): p. 173–5. [DOI] [PubMed] [Google Scholar]
- 28.Viklund H, Bernsel A, Skwark M, Elofsson A, SPOCTOPUS: a combined predictor of signal peptides and membrane protein topology. Bioinformatics, 2008. 24(24): p. 2928–9. 10.1093/bioinformatics/btn550 [DOI] [PubMed] [Google Scholar]
- 29.Viklund H and Elofsson A, OCTOPUS: improving topology prediction by two-track ANN-based preference scores and an extended topological grammar. Bioinformatics, 2008. 24(15): p. 1662–8. 10.1093/bioinformatics/btn221 [DOI] [PubMed] [Google Scholar]
- 30.Koehler-Leman J, Mueller R, Karakas M, Woetzel N, Meiler J, Simultaneous prediction of protein secondary structure and transmembrane spans. Proteins, 2013. 81(7): p. 1127–40. 10.1002/prot.24258 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zurada JM, Malinowski A, Usui S, Perturbation method for deleting redundant inputs of perceptron networks. Neurocomputing, 1997. 14(2): p. 177–193. [Google Scholar]
- 32.Engelbrecht AP, Sensitivity analysis for selective learning by feedforward neural networks. Fundamenta Informaticae, 2001. 46(3): p. 219–252. [Google Scholar]
- 33.Woetzel N, Karakaş M, Staritzbichler R, Müller R, Weiner BE, Meiler J, BCL::Score—knowledge based energy potentials for ranking protein models represented by idealized secondary structure elements. PLoS One, 2012. 7(11): p. e49242 10.1371/journal.pone.0049242 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Karakaş M, Woetzel N, Staritzbichler R, Alexander N, Weiner BE, Meiler J, BCL::Fold—de novo prediction of complex and large protein topologies by assembly of secondary structure elements. PLoS One, 2012. 7(11): p. e49240 10.1371/journal.pone.0049240 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Weiner BE, Woetzel N, Karakaş M, Alexander N, Meiler J, BCL::MP-fold: folding membrane proteins through assembly of transmembrane helices. Structure, 2013. 21(7): p. 1107–17. 10.1016/j.str.2013.04.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Baldassi C, Zamparo M, Feinauer C, Procaccini A, Zecchina R, Weigt et al. , Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners. Plos One, 2014. 9(3). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kent JT, Information gain and a general measure of correlation. Biometrika, 1983. 70(1): p. 163–173. [Google Scholar]
- 38.Dekker JP, Fodor A, Aldrich RW, Yellen G A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments. Bioinformatics, 2004. 20(10): p. 1565–1572. 10.1093/bioinformatics/bth128 [DOI] [PubMed] [Google Scholar]
- 39.Jones DT, Buchan DW, Cozzetto D, Pontil M, PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics, 2012. 28(2): p. 184–90. 10.1093/bioinformatics/btr638 [DOI] [PubMed] [Google Scholar]
- 40.Seemayer S, Gruber M, Söding J., CCMpred—fast and precise prediction of protein residue-residue contacts from correlated mutations. Bioinformatics, 2014. 30(21): p. 3128–30. 10.1093/bioinformatics/btu500 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kaján L, Hopf TA, Kalaš M, Marks DS, Rost B, FreeContact: fast and free software for protein contact prediction from residue co-evolution. BMC Bioinformatics, 2014. 15: p. 85 10.1186/1471-2105-15-85 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Weiner BE, Alexander N, Akin LR, Woetzel N, Karakas M, Meiler J, BCL::Fold—protein topology determination from limited NMR restraints. Proteins, 2014. 82(4): p. 587–95. 10.1002/prot.24427 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Feinauer C, Skwark MJ, Pagnani A, Aurell E, Improving Contact Prediction along Three Dimensions. Plos Computational Biology, 2014. 10(10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Leaver-Fay A, Tyka M, Lewis SM, Lange OF, Thompson J, Jacak R et al. , ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol, 2011. 487: p. 545–74. 10.1016/B978-0-12-381270-4.00019-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Fischer AW, Alexander NS, Woetzel N, Karakas M, Weiner BE, Meiler J, BCL::MP-Fold: Membrane protein structure prediction guided by EPR restraints. Proteins, 2015. 83(11): p. 1947–62. 10.1002/prot.24801 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All relevant data is available at Figshare. DOI: 10.6084/m9.figshare.4600372. URL: https://figshare.com/s/71fef14b62e4d2aba9ad.