Abstract
Nuclear export signal (NES) motifs function as essential regulators of the subcellular location of proteins by interacting with the major nuclear exporter protein, CRM1. Prediction of NES is of great interest in many aspects of research including cancer, but currently available methods, which are mostly based on the sequence-based approaches, have been suffered from high false positive rates since the NES consensus patterns are quite commonly observed in protein sequences. Therefore, finding a feature that can distinguish real NES motifs from false positives is desired to improve the prediction power, but it is quite challenging when only using the sequence. Here, we provide a comprehensive table for the validated cargo proteins, containing the location of the NES consensus patterns with the disordered propensity plots, known protein domain information, and the predicted secondary structures. It could be useful for determining the most plausible NES region in the context of the whole protein sequence and suggests possibilities for some non-binders of the annotated regions. In addition, using the currently available crystal structures of CRM1 bound to various classes of NES peptides, we adopted, for the first time, the structure-based prediction of the NES motifs bound to the CRM1’s binding groove. Combining sequence-based and structure-based predictions, we suggest a novel and more straight-forward approach to identify CRM1-binding NES sequences by analysis of their structural prerequisites and energetic evaluation of the stability at the CRM1’s binding site.
Subject terms: Computational biophysics, Protein structure predictions, Computational models, Molecular modelling
Introduction
Active transport between the nucleus and cytoplasm is an essential regulatory mechanism for many cellular proteins. As a major nuclear exporter factor, chromosome maintenance protein 1 (CRM1; or exportin-1, XPO1) mediates nuclear export of hundreds of distinct cargo proteins by recognizing short sequence motifs called Nuclear Export Signal (NES)1–3. CRM1 shuttles between the nucleus and the cytoplasm, binds cargo molecules at high RanGTP levels inside the nucleus, traverses nuclear pore complex (NPC) as ternary cargo–CRM1–RanGTP complexes, and releases cargo into the cytoplasm upon hydrolysis of the Ran-bound GTP4. Since spatial re-localization of oncoproteins and tumor suppressor proteins is important in cancer cells, understanding of the NES can help the basic research about this process and can also help the discovery of anticancer agents5.
Classical NES motifs in the early studies were referred to as a cluster of hydrophobic residues, mostly leucines (hence also called Leu-rich NES), within a 10–15 residue-long sequence motif1,6,7. Many years of research on various export cargoes and randomization-and-selection screens showed that more residue types, such as Ile, Val, Met, and Phe, are also allowed at the hydrophobic positions of the CRM1-dependent NES signals8,9. These hydrophobic residues (Φ) are spaced with various patterns following the consensus Φ1-(x)2–3-Φ2-(x)2–3-Φ3-x-Φ4, where x denotes any amino acid. Later, structural studies of the CRM1 bound to NES peptides revealed another hydrophobic pocket in CRM1 that can bind to one more hydrophobic amino acid (Φ0)10,11. This site is less restricted to hydrophobic residues compared to others. Until recently, the existing 11 consensus patterns were defined by the peptide library-based study9 and structural analyses of CRM1-NES complexes11–14. They consist of four to five hydrophobic residues (Φ0-Φ4; generally, L, I, V, M, and F) which are bound to the corresponding hydrophobic pockets (P0-P4) in CRM1. Based on the pattern of these Φ’s and spacing sequences, the NES motifs are classified as class 1a, 1b, 1c, 1d, 2, 3, and 4. Additionally, compared to these classes, some peptides bind in the opposite (−) direction, making their Φ3-Φ4 positions bound to P0-P1 (class 1-reverse)13. Until recently, X-ray crystal structures of CRM1 bound to NES peptides of the 1a, 1b, 1c, 2, 3, 4, and 1a-reverse classes have been solved. Depending on the classes, the NES peptides showed distinct backbone conformations binding to the central portion of the hydrophobic groove of CRM1. One turn helix in the middle is remarkably conserved among all classes maintaining a hydrogen bonding with the Lys residue (Lys568) in human CRM114.
Modeling short motifs or patterns like NES is a major research area in bioinformatics. Since NES motifs are essential regulators of the subcellular location of proteins in relation to cancer, cell cycle, cell differentiation and other important aspects of molecular biology, prediction of the NES motif is of great interest but still remains a challenge. Until now, more than 300 experimentally identified protein cargoes are recorded in databases such as validNESs15 and NESdb16 and over 1000 putative CRM1 cargoes were identified in a recent proteomics study17. Based on the ever-growing repertoire of the protein cargoes of CRM1, many attempts were tried to employ machine learning approaches to decide whether a given sequence has a CRM1-dependent NES motif or not. Several computational tools, such as NetNES8, NESsential18, NESmapper19, LocNES20, Wregex21, and NoLogo22 have been developed to predict NES motifs. Most of them are sequence-based predictors based on consensus pattern matching and calculation of biophysical properties such as disordered propensity, secondary structure components, and solvent accessibilities. To capture the diversity of the NES sequences, the consensus patterns were generally applied in the form of regular expression or position-specific scoring matrix (PSSM). Unfortunately, NES patterns are quite commonly observed in a large portion of the proteome so that the prediction based on these consensus patterns results in a high false positive rate. Since a functional NES needs to be solvent-exposed and not buried in a globular fold, Kırlı et al. applied these criteria and pattern matching to identify NES motifs in a set of validated, new CRM1 cargoes and found that functional NES motifs still could not be identified in a significant portion of them17. Moreover, sequences of functional NES motifs appear to be more diverse than previously appreciated. A large portion of experimentally defined NES regions does not match the current consensus patterns17. As a solution to reduce the high false positive rate, other biophysical features such as disorder propensity, secondary structure component, and evolutionary conservation were incorporated into machine learning algorithms like support vector machines (SVM) or neural networks8,20. However, the false positive rates remain high. In addition to the ever-expanding NES patterns resulting in many false positives when used in NES prediction, the limited information about direct CRM1 binding of the annotated NES regions is detrimental to develop accurate predictors using available data sets. Therefore, predicting NES motifs using only protein sequence information seems to have limitations, and the combination with structure-based predictions could be a new strategy to distinguish NES motifs and false positives.
In this study, using validated cargo protein sequences in NESdb and validNES, we provide a comprehensive look-up table which contains the location of the NES consensus patterns with the disorder propensity plots, conserved domain information, and the predicted secondary structure. This information could be useful for determining the most plausible NES region in the context of the whole protein sequence and for suggesting possibilities for some non-binders of the annotated NES regions. In addition, for the first time, we adopted the structure-based prediction of the NES sequences bound to the CRM1’s NES binding groove, using multiple crystal structures of CRM1-NES peptide as templates. For several experimentally validated NES peptides and false positive ones, we calculated the relative binding energy of the sequence segments at the CRM1’s binding pocket, and the prediction reliability of these binding energies was validated by the experimental binding affinities. Combining sequence-based and structure-based predictions, we suggest the novel and more straight-forward approach to identify NES sequences that bind directly to CRM1.
Results and Discussion
Deducing NES consensus pattern-matching sequences in candidate cargo proteins
Using the validated cargo protein sequences in NESdb and validNES (which have Leptomycin B (LMB)-sensitive data as evidence of CRM1-dependency), we extracted the NES consensus pattern-matching sequence segments based on the modified version of the Kosugi consensus16,20 as summarized in Fig. 1. All the possible consensus patterns are recorded and prioritized by the empirical class priority (see Methods for details). Based on these criteria, 4226 consensus-matching segments were extracted for 318 cargo protein sequences. Among them, 463 segments were treated as candidate NES motifs as they occur in regions that overlap to experimental evidence, and 3763 were treated as false positives (FPs). The experimental NES regions of 54 cargo proteins do not match the current consensus and are not considered in this study. Also excluded are four cargo proteins with no reported NES regions and five cargos with long reported NES regions (>25 residues) that do not have specific residues annotated. Among the consensus patterns, class 1a is the most abundant class (41%) as expected. Especially, compared to the false positive sequences, class 1a is observed more than twice as often in the candidate NES sequences. Classes 1c, 2, and 3 follow with 14~15%, class 1a-reverse is observed in 8.6%, and classes 1b, 1d, 4, or 1c-reverse seem to be quite rare (Fig. S1).
A comprehensive look-up table of NES patterns in NES cargo proteins
In order to make the NES motif to be accessible to CRM1-binding, the motif should not be located in the compactly folded protein domains. The NES motif may be located at the N-terminus, at the C-terminus, or within an unstructured region of an export cargo11. Therefore, for a precise prediction of the export signals, it is crucial to consider the motifs’ location with respect to protein domains and disordered regions. For all possible NES consensus patterns of the cargo proteins that we extracted, we analyzed the relationship with the protein ordered/disordered regions, known domains, and their predicted secondary structures, and provide a comprehensive online table. For a given full protein sequence, we plotted the disordered propensity, the location of the known domains, the predicted secondary structures, and all possible NES consensus regions (Fig. 2). For a given entry, the information annotated in NESdb or validNES, such as evidence of CRM1-dependency, mutation data, functional sequences or sites, is listed together. The locations of all NES consensus-matching segments are marked together with the experimentally validated regions (Fig. 2A, the bottom of the plot). The reference databases (NESdb, validNES, and UniProt), protein visualization tool (ProViz)23 and the structure and model database (SWISS-MODEL repository)24 are linked for user convenience, and the filter for easy look-up is also provided. This table could be useful for determining the most likely NES region in the context of a whole protein sequence. The online table is accessible via: http://prodata.swmed.edu/nes_pattern_location/.
NES candidates in the disordered or ordered regions
Even if a sequence motif can be fitted to the NES consensus, a motif that is located deep in the globular fold can hardly bind to CRM1 unless the region unfolds. In some cases, it may be possible to unfold and bind, but we assume that these cases would be very limited. Also, short linear interaction motifs like NES motifs have been proposed to be locally disordered to facilitate dynamic interactions with their binding partners, and the NES prediction algorithms have used disorder context to help distinguish correct NES motifs from false predictions18,20. However, NES motifs do not necessarily have to locate in the fully disordered region. Indeed, we have observed that some NES candidates are located in the fully disordered regions, but others are located next to ordered or “boundary” regions. Therefore, we employed the disorder propensity as a pre-filter to remove the segments located in the “highly” ordered regions.
Various computational tools have been developed for analyzing potential intrinsic disorder of protein sequences and were quite successful owing to clear association between disordered propensity and sequence features such as low complexity or high aromatic composition. We utilized DISOPRED325 and SPOT-disorder26, which use homologous sequences’ alignment-based profiles for detecting disordered regions, and IUPred2A27 which is much faster since it does not rely on the sequence alignment. Disordered regions for some proteins are quite differently predicted depending on the programs. In order to define ordered and buried regions with high confidence, we applied strict cutoff values (~0.1) to decide the order/disorder border lines (note that the most of the programs’ cutoff value for disordered regions are ~0.5). If a residue’s disorder propensities predicted by both DISOPRED3 and SPOT-Disorder are below 0.1, the residue is defined as in highly ordered region (note that the predicted values by IUPred2A are also recorded for the reference).
As shown in Fig. 3A, 55% of the NES candidate motifs are located in the disordered region, and 37% are found in the boundary region between the ordered and disordered parts. Only 8% of the NES candidate motifs are located in the highly ordered region. Among the 361 candidate motifs, 37 segments (for 20 cargo proteins) are located in the highly ordered region which may have less possibility to be accessible to CRM1 binding. For example, HDAC1 (uniport ID: Q13547) has a reported NES motif with a mutation data (L158A/L161A/L164A) for nuclear export28. This region can be fitted to the classes 1c, 2, or 3, but it is located in the highly ordered region. The crystal structure of HDAC1 (PDB ID: 4bkx) showed that this segment is buried in the globular domain and seems unlikely to be accessed by CRM1 (Fig. 4A). Note that in case of its homolog HDAC5, the candidate NES motif (1081EEAETVSAMALLSVGA1096, class 1a) is located in the disordered region after the conserved Hist_deacetyl domain and found to directly bind to CRM1. The similar region (after the Hist_deacetyl domain) in HDAC1 (358YLEKIKQRLFENLRMLP374, class 1c) could be also considered as a possible NES motif of HDAC1. Table S1 lists the NES candidate motifs located in the highly ordered region and Fig. 4A,C shows some examples for these segments in the available 3D structures.
In case of the false positives, the segments located in the highly ordered region is 19%, a larger percentage than those of the candidate NES motifs (note that the segments in the ordered region are far lower than those in the disordered region since we use the stringent cutoff for defining ordered region). The false positives in the disordered or boundary regions are 31% and 51%, respectively.
CDD domains and NES locations
To analyze the candidate NES motifs’ location with respect to the conserved regions, we extracted the conserved domain information for the cargo protein sequences using the four different databases, i.e., SMART, Pfam, NCBI-curated, and Conserved Domain Database (CDD). As shown in Fig. 3B, only 33% of the candidate NES regions are located in the middle of the CDD domains, and 40% is in the boundary region. It seems that the NES regions do not necessarily locate in the protein domains. Rather, the known domains are often considered to form folding units, masking the possible motifs from binding other proteins. In case of the false positives, more than half are located in the middle of the known domains. It may be because the hydrophobic residues are commonly located in the protein core or domains.
Secondary structure components of the NES peptides
Crystal structures of CRM1-bound NES peptides have been resolved for the classes 1a, 1a-reverse, 1b, 1c, 2, 3, and 4. They showed distinct backbone conformations that match their hydrophobic positions to the corresponding hydrophobic pockets in CRM1. Structural analysis, as well as secondary structure prediction of NES motifs, suggest that most NES motifs contain α-helices or helix-to-extended conformation12–14. The class 1d is also expected to have helix-strand, and other reverse (−) classes are likely the reverse of their (+) counterparts14. The common feature of the backbone conformations among the classes is one turn of helix at the region from Φ2 to Φ314.
In our analysis of the 361 candidate motifs, 36 segments (for 23 cargoes) have a β-strand conformation in the middle (β-strand contents of the middle part is >50%) (Table S2). Among them, 11 segments were confirmed to have β-strands in the available X-ray or solution structures. For example, NPM has two reported NES regions, but both of them are predicted to form β-strands in the middle of the segments. As shown in Fig. 4B, the two segments are both β-strands located in the middle of the jelly-roll fold. Indeed, both regions were also reported to be quite weak binders of CRM129 and the sequence of 42–61 failed to bind CRM1 in GST-pulldown assay (Chook Lab, unpublished results; annotated in NESdb). The candidate NES region in TDP-43 is also located in β-strands within a folded globular RRM domain, and it is recently validated to be a non-binder to CRM1 rather it is exported by passive diffusion30. For six segments, there is no experimentally determined structure, but homology models showed the β-strands for the segments. For 17 segments, no structural information is available. For two segments, the conformation in the modeled structures (with sequence identities of 79% and 98%, respectively) are found to be helix reflecting the limitation of the secondary structure prediction.
Evaluation of the stability of the NES peptides at the CRM1 binding groove based on structure modeling
Recent structural works of CRM1 complexed with various cargo sequences expand the possible consensus patterns13,14. Also, the NES-binding site in RanGTP-bound CRM1 is found to be quite rigid, and the peptides display CRM1-dependent NES activity only if their backbone conformations can place a sufficient number of the hydrophobic residues into the CRM1’s binding groove11. The adapting conformation of the peptides can be efficiently analyzed by structure-based modeling methods so that the application of the structural information can advance more accurate NES prediction.
Using the reported NES peptides with experimental binding affinities14,31 as a benchmarking set (Table 1), we evaluated the binding energy (Ebind) for a given peptide sequence at the CRM1 groove (see Methods for details). Binding energy can be assumed as relative stability of the protein(CRM1)-peptide(NES) complex structure compared to the protein itself and free peptide. The lower the binding energy, the higher the possibility for the peptide segments to bind at CRM1. Multiple crystal structures of CRM1-NES peptide (super PKI and MVM-NS2 for classes 1a; FMRP-1b for class 1b; SNUPN for class 1c; FMRP and SMAD4 for class 2; HIV-Rev for class2-rev type; X11L2 for class 4; and CPEB4 for class 1a-reverse; class 1a templates can be used to fit class 3 NES peptides) were utilized as templates. The model generation and energy calculation process are summarized in Fig. 5A.
Table 1.
Protein | Class | NES sequence | KD (nM)§ | ref. |
---|---|---|---|---|
MVM NS2 | 1a | 77STVDEMTKKFGTLTIHD93 | 2 | 31 |
*super PKI | 1a | 34NLNELALKLAGLDINK49 | 4 | 31 |
PKI | 1a | 34NSNELALKLAGLDINK49 | 34 | 31 |
ADAR1 | 1a | 121RGVDCLSSHFQELSIYQ137 | 69 | 31 |
MEK1 | 1a | 28TNLEALQKKLEELELDE44 | 70 | 31 |
Pax | 1a | 264RELDELMASLSDFKFMA280 | 700 | 31 |
*CPEB4-R | 1a | 395RMIDILSSELSHMDFTR379 | 710 | 31 |
NPMmutA | 1a | 278MTDQEAIQDLCLAVEEVSLRK298 | 790 | 31 |
HDAC5 | 1a | 1081EAETVSAMALLSVG1095 | 1600 | 31 |
p73 | 1a | 364NFEILMKLKESLELMELVP382 | 2000 | 31 |
*hRio2-R | 1a | 405GKIEELAQNFETMEFSR389 | 2600 | 31 |
Stradα | 1a | 413GIFGLVTNLEELEVD427 | 10300 | 31 |
*FMRP-1b | 1b | YLKEVDQLRALERLQID | 3000 | 14 |
SNUPN | 1c | 1MEELSQALASSFSVSQDLNS20 | 12500 | 31 |
HPV E7 | 1c | 73HVDIRTLEDLLMGTLGIVC91 | 34000 | 31 |
HIV Rev | 2 | 73LQLPPLERLTLDC85 | 1180 | 31 |
FMRP | 2 | 424LKEVDQLRLERLQID438 | 2000 | 31 |
SMAD4 | 2 | 134ERVVSPGIDLSGLTLQ149 | 4600 | 31 |
mDia2 | 3 | 1157SVPEVEALLARLRAL1171 | 1600 | 31 |
CDC7 | 3 | 456QDLRKLCERLRGMDSSTP473 | 20000 | 31 |
X11L2 | 4 | 55SSLQELVQQFEALPGDLV72 | 1500 | 31 |
CPEB4 | 1a-R | 379RTFDMHSLESSLIDIMR395 | 800 | 31 |
hRio2 | 1a-R | 389RSFEMTEFNQALEEIKG405 | 2800 | 31 |
*PKImut1 (I47A) | — | 34NSNELALKLAGLDANK49 | 150000 | 31 |
*PKImut2 (L42A/L45A) | — | 34NSNELALKAAGADINK49 | 900000 | 31 |
†APC | 1a | 163AQLQNLTKRIDSLPL174 | (−) | 33 |
‡Cyclin D1 | 1a | 281VDLACTPTDVRDVDI295 | (−) | — |
APRIL | 1b | 106LEPLKKLECLKSLDL120 | (−) | 33 |
‡hTERT | 1c | 965KAGRNMRRKLFGVLRLKC982 | (−) | — |
DcpS | 1c | 136TEKHLQKYLRQDLRL150 | (−) | 33 |
Cdk5 | 2 | 133LINRNGELKLADFGL147 | (−) | 33 |
†FGF1 | 2 | 138THYGQKAILFLPLPV152 | (−) | 33 |
COMMD1 | 3 | 171ILKTLSEVEESISTL185 | (−) | 20 |
DEAF1 | 1a-R | 452SWLYLEEMVNSLLNTAQQ469 | (−) | 13 |
SGN5 | 1a-R | 221YALEVSYFKSSLDRKLL238 | (−) | 13 |
†COMMD1–2 | 1a-R | 164DEVKVNQILKTLSEVEES181 | (−) | 13 |
†ELF3 | 1a-R | 111RLVFGPLGDQLHAQLR126 | (−) | 13 |
*Engineered or mutated (underscored residues are the ones inserted or mutated).
†Do not fit the consensus in Fig. 1 (due to Pro or do not have a bulky residue at Φ3/4 in the class 1a-R).
‡Unpublished data.
§(−) means no binding determined by pull-down binding assay.
Final model structures showed that all classes were predicted well with their Φ residues bound to the corresponding hydrophobic pockets (Fig. 5B). The calculated Ebind selected the right template for each class, and it can be utilized to find the most plausible class when multiple consensus patterns are found in one segment. The calculated Ebind values correlated quite well to the experimental KD values (Fig. 6, left; R2~0.63; Pearson’s r~0.79 with p = 2e − 6). However, in the case of the two PKI mutant peptides which have extremely low binding affinities, the Ebind scores are not quite distinguishable from those of the weak binders such as SNUPN, SMAD4, and HPV-E7. In case of the PKI double mutant peptides, we found a large interface cavity at the binding interface with CRM1 (Fig. S3A), but this feature, definitely detrimental to binding, is not well reflected in the modeling process or energy calculation. To penalize the interface cavity of the complex structure, residue solvent accessibility (RSA) for key interface residues (Fig. S3B) is calculated using the NACCESS program32 and treated as another scoring term. The RSA-corrected Ebind scores (EbindRSA) is obtained by calculating EbindRSA = Ebind + w∙RSA (w is the weight for the RSA term and is optimized to maximize the correlation) (Fig. 6, middle). EbindRSA gave improved correlation (Fig. 6, right; R2~0.73; Pearson’s r~0.86 with p = 5e-8).
For comparison, several false positive sequences that can be fitted to NES consensus but are experimentally validated as non-binders (determined by pull-down binding assay)13,33 are subjected to modeling with the same procedure. Interestingly, these false positives showed significantly higher Ebind scores reflecting their low binding affinities at the CRM1 binding groove. Notably, the peptides such as COMMD1 (164DEVKVNQILKTLSEVEES181) and ELF3 (111RLVFGPLGDQLHAQLR126) were not fitted to the right template (i.e., the lowest Ebind complex is not the class 1a-R structure). It suggests that these sequences could be energetically unstable when their backbone conformations are fitted their hydrophobic residues to CRM1 hydrophobic pockets. In case of the false positive peptides fitted to the right template (Fig. 7), the backbone conformation and the Φ residues may appear to be pretty similar to the true positive ones; however, they showed inferior binding energies. In some cases, such as Cyclin D1 (Fig. 7A, middle) or FGF1 (Fig. 7C, right), the backbone conformation seems to be not maintained well when presenting the Φ side chains into the pockets.
We expect the merit of this structure-based, energy-based method is to discriminate true positive and false positive with similar sequence patterns, by analyzing energetic differences at the CRM1 binding site via full-atom modeling. This atomic-level energetic analysis cannot be deduced by using the only sequence. In this perspective, our method would suggest novel approaches to find the CRM1-binding NES motifs. We cannot ignore the fact that the interaction between CRM1 and a whole cargo protein can be more than that of the CRM1-NES peptide10; however, it is extremely difficult to consider extra contacts between CRM1 and cargo’s whole structure which may be different depending on each cargo. Based on our previous result describing the strength of the CRM1-NES peptide interaction correlated to the nuclear export activity31, we assume that the energy prediction between CRM1 and NES peptide is a practical strategy.
For evaluating the performance, we compared our results to those of other sequence-based methods, i.e., NetNES8, NESmapper19, and LocNES20 (Figs S4–S20). Using the whole sequences of 17 proteins in Table 1, we extracted 19 positive cases (regions annotated as NES motifs in the NESdb or validNES database with mutational evidence) and 341 negative cases (non-NES regions with consensus pattern-matching). As shown in Table S3, Ebind score performs the same as LocNES in terms of recall rate (both predicts 17 true positives out of 19 experimentally verified NES cases). On the other hand, Ebind outperforms LocNES in terms of specificity and false positive rate. Ebind recorded 23 cases of false positives while LocNES predicted nearly the double amount of false positives (40 cases). NetNES showed better specificity (true negative rate (TNR): 0.988) than our method (TNR: 0.933). However, its recall rate (sensitivity or true positive rate (TPR): 0.474) was much lower than our method (TPR: 0.895). Our method seems to work well enough compared to these available methods. It effectively decreases false positives while maintaining a high recall rate, showing the best performance with respect to the balance of precision & recall (F1 score), and effectiveness (DOR).
Possibility of non-binders to CRM1 among the NES-annotated regions
The databases like validNESs15 and NESdb16 provide valuable information on NES research, however, defining CRM1-dependent NES regions is still a difficult task. The expanding NES patterns result in many false positives. Also, the lack of information showing direct CRM1 binding to many annotated NES regions prevents development of accurate predictors using available data sets. Most published experimental studies were focused on showing that a protein is an export cargo, by deletion of the whole region encompassing a candidate NES or by mutation of all the suspected hydrophobic residue positions. These perturbations are drastic and may affect structural stability and result in defects of functions other than CRM1-binding and nuclear export. Therefore, one should interpret the experimental data carefully to identify the CRM1-binding NES location, and it is always possible that regions which have been annotated as experimentally validated are not in fact functional NES motifs. Indeed, some of the annotated NES regions were found in the buried (highly ordered) protein domains (Fig. 4A,C). Some others can form β-strands in the middle of the segment (Fig. 4B,C) which would be rare in real NES sequences. Candidate segments that form β-strands and are located in the ordered region are observed in three cargoes including FAK (91RSEEVHWLHVDMGVSS106), MoKA (190KIQTLHLVGVNVPE203), and Sirt1 (423DEVDLLIVIGSSLKVRP239). We suggest that these segments have high possibility to be non-binders to CRM1 unless they unfold or transform their conformations upon specific conditions. Some cargo proteins might be exported following other events such as binding to an NES-containing adaptor protein.
Even if a segment fits the NES consensus and also satisfies the location criteria, these criteria are still not enough to locate the real NES segments in the whole protein sequence (see yellow highlighted segments in the online table). We tested the Ebind calculation to the all possible segments of the natural cargo proteins listed in Table 1. If a segment cannot form an energetically stable complex at the CRM1’s NES binding groove, it is likely a non-binder to CRM1. As shown in Fig. 8, the NES candidates are likely to have the lower Ebind scores compared to other false positive segments. Among the seventeen cases, eleven cases have the NES candidate motifs with the lowest Ebind, and four cases have the NES regions with the second lowest Ebind but the difference between the lowest and second lowest is usually marginal (less than 2). Although the data set used in the structure-based modeling is quite small, the resulting binding energy values can discriminate between CRM1 binders and false positives. This structure-based prediction method can be utilized as one of the features to find real CRM1-dependent NES peptides in the pool of numerous false positive sequences.
Conclusion
In summary, we analyzed the structural prerequisites for CRM1-dependent NES motifs, i.e., accessibility (by locating disordered/ordered regions), adapting conformation (by predicting secondary structures), and the stability at the binding site (by applying structure-based modeling to calculate binding energies). The comprehensive table including all the possible consensus patterns with the disordered propensity plot, conserved domain information, and the predicted secondary structures provide valuable information for determining or correcting the most probable NES regions.
In light of the currently resolved crystal structures of CRM1-NES peptides with diverse classes, we modeled the CRM1-NES peptide complex structures and calculated the stability of the NES peptides at the CRM1 binding groove. The resulting binding energies correlate well to the experimental binding affinities, and we can distinguish the real NES motifs and false positives which both match NES consensus patterns. Also, we do not rely on the input sequence’s pattern, rather use the energy function to select the most energetically favorable class template. Therefore, if the multiple patterns exist in one peptide segment, this energy calculation can be a tool to predict the peptide’s conformation when it binds to CRM1. Although the method can still be improved, this study provides a starting point to predict NES motifs by combining sequence-based and structure-based approaches. Because our method is template-based modeling, it is difficult to adequately model NES motifs of classes other than those of the templates. Since newly discovered NES motifs often deviate from the established consensus patterns, more structural information is definitely needed not only to understand new consensus patterns and NES-CRM1 binding mechanism but also to more accurately predict NES motifs.
Methods
Extraction of the NES consensus sequences
For the cargo proteins which have LMB sensitive data as CRM1-dependency annotated in NESdb16 and validNES15, the NES consensus-matching sequence segments were extracted by utilizing the modified version of the Kosugi consensus16,20 (Fig. 2): Φ1-X1,2,3-Φ2-[^PW]2-Φ3-[^PW]-Φ4; Φ1-X2,3-Φ2-[^PW]3-Φ3-[^PW]-Φ4; or Φ1-X2-Φ2-X[^PW]2-Φ3-[^PW]2-Φ4 ([^PW] is any of the 20 amino acids except Pro and Trp; Ala or Thr can be used only once at Φ1 or Φ2; X stands for any amino acid). If one segment or segments in the similar region (difference between the two segments’ starting residue numbers <5) can be fitted to multiple patterns, all the possible patterns are recorded but prioritized based on the fact that: (i) the class 1a pattern is the most frequently observed class in the validated NES sets, suggesting that it interacts more preferentially with CRM1 than other classes9,16,22; (ii) in the current NES databases, class 3 sequences are as prevalent as NES motifs of classes 1c and 213; (iii) the classes 1b and 1d can be found only in a few NES sequences, and the majority of the class 1d sequences can be overlapped to the class 1a pattern in the validated NES sets9,13; and (iv) reverse(−) of classes 3 and 4 appears to lack β-strands to hydrogen bond with the Lys residue and may not be ideal NES motifs14. This empirical class priority is defined as follows: (i) class 1a with five Φs (c1a-5) as priority 1; (ii) class 1 with four Φs (c1a-4), classes 1a-R, 2, 3, and 4 as priority 2; (iii) classes 1a/1c with Thr or Ala in one of their Φ1 or Φ2 positions as priority 3; (iv) classes 1b, 1d, 1c-reverse, and classes 2/3 with Thr or Ala in one of their Φ1 or Φ2 positions as priority 4, and (v) classes 1b/1d with Thr or Ala in one of their Φ1 or Φ2 positions as priority 5. The extracted regions are from the one residue before Φ0 to the two more residues after Φ4 (or shorter if located at the protein C- or N-termini). If the Φ2-Φ4 portion of the extracted region overlaps with experimental evidence (annotated as “mutations that affect nuclear export,” “mutations that affect CRM1 binding,” or “functional export signal” in NESdb, or annotated as “sites” in validNES), it is considered as a candidate NES. If not, it is deemed as a false positive.
Calculation of disorder propensity and definition of ordered regions
The disorder propensity of the cargo protein sequences is calculated using three different programs, DISOPRED325, SPOT-disorder26, and IUPred2A27. For DISOPRED3 and SPOT-disorder calculation, which is based on multiple sequence alignment, uniref90_2015_0134 database is used to find homologs during PSI-BLAST search35. In order to define ordered regions with high confidence, we applied strict cutoff values (~0.1) to decide the order/disorder border lines (note that the default values for disordered regions of these three programs here are ~0.5). If a residue’s disorder propensities predicted by both DISOPRED and SPOT-disorder are below 0.1, the residue is defined as ordered (“O”). If not, the residue is recorded as potentially disordered (“D”). The predicted values by IUPred2A is also recorded for the reference. The sequence segment’s location is determined by scanning the portion of “D” or “O” in the segment and flanking residues (20 residues at both sides) (Fig. S2A). If the portion of “D” mark is more than 90% for the segment and flanking regions, the location of the segment (loc_DISO) is defined as an ordered region (“ORD”). If “O” is more than 90%, the location is determined as a disordered region (“DISO”). The other segments are considered as the ones located in the “boundary” region. The segments in the boundary regions can be found at the end of the ordered regions, or they can locate in the ordered regions where some portions (>10%) have higher disorder propensity than the cutoff value.
Extraction of the conserved domain information of the cargo proteins
By using the Batch CD-search tool36, the conserved domain information for the cargo protein sequences was extracted. Four different databases, i.e., CDD (cdd v3.16), NCBI_Curated (cdd_ncbi v3.16), Pfam (oasis_pfam v3.16), SMART (oasis_smart v3.16), were searched with the expect value threshold of 0.01. The results were retrieved by the Concise mode.
Prediction of secondary structure
Secondary structures of the cargo protein sequences are predicted by PSIPRED Version 3.2137. During PSI-BLAST search35 to find homologs, uniref90_2015_0134 database is used. In the online table, the confidence level of the prediction is also colored by a gradient from dark (high confidence) to light (low confidence).
Relative binding energy (Ebind) prediction
Ten crystal structures of CRM1 bound to various NES peptides, including MVM-NS2 (PDB ID: 6CIT31), super PKI (unpublished data), FMRP-1b (5UWO14), SNUPN (3GB812), FMRP (5UWJ14), SMAD4 (5UWU14), HIV-Rev (3NBZ11), X11L2 (5UWS14), and CPEB4 (5DIF13), were utilized as templates. For the CRM1 part, we extracted the residues from 479 to 655 (numbered in scCRM1) to reduce the computation time. For potential NES peptides, the positions from Φ0–1 to Φ4 + 2 positions were modeled (or a shorter segment in case a sequence used in the experimental KD measure is shorter). A given peptide sequence is fitted to the backbone coordinates of every template structure. By using the Rosetta backrub module38, the backbone conformations of the fitted NES peptide and the surrounding helices in CRM1 are sampled to generate 50 models (50,000 backrub Monte Carlo trials/steps were run for each model). Among them, five complex structures with the lowest energy are selected and then optimized by the Rosetta relax module39,40, which searches the local conformational space around the starting structure. The relaxation was carried out 50 times for each model (i.e., the total number of models for a given peptide sequence is 10 × 50 = 500 models) with ‘-use_input_sc -ex1 -ex2’ flag for more rigorous search. The backrub-modeled backbone conformation was constrained during the relaxation by applying ‘-constrain_relax_to_start_cords’ flag. Structures of the CRM1 protein itself and the free peptide are also modeled separately with the same process. The all-atom energy function REF15 in Rosetta v.3.9 were utilized for all calculation.
The binding energy (Ebind) is calculated as Ecomplex − Eprotein − Epeptide. The values for Ecomplex, Eprotein, and Epeptide are the average of the lowest 10 energy values among the 500 models. For Epeptide, we utilized the lowest Epeptide among the all different backbone fitted models. Among the various template-fitted models, the one with the lowest Ebind score is selected. The Ebind scores were corrected with a solvent accessibility term calculated by the NACESS v.2.1.1 program32, which calculates the atomic accessible surface defined by rolling a probe of given size around a vdw surface. To penalize the cavity at the interface of CRM1 and low-affinity binders (such as PKI double mutant), the RSA values for the hydrophobic residues at the interface (Fig. S3) were extracted and added to the Ebind scores with the optimized weight.
Supplementary information
Acknowledgements
This work is funded by the Cancer Prevention Research Institute of Texas (CPRIT) Grants RP170170 (N.V.G. and Y.M.C.) and RP180410 (Y.M.C.), the National Institutes of Health Grant (GM127390 to N.V.G.) and Welch Foundation Grants (I-1532 to Y.M.C and I-1505 to N.V.G). The authors acknowledge the Texas Advanced Computing Center (TACC; http://www.tacc.utexas.edu) at The University of Texas at Austin for providing HPC resources.
Author Contributions
N.V.G. conceived of the presented idea and designed the research. Y.L. and J.P. developed the theory, performed the simulation, and analyzed the data. J.M.B. and Y.M.C. performed the experimental validation of the binding affinities and provided the structural data. Y.L. wrote the manuscript. Y.L., J.P., J.M.B., Y.M.C. and N.V.G. contributed to the interpretation of the results and revised the manuscript. N.V.G. supervised all the study.
Data Availability
The datasets generated during and/or analyzed during the current study are included in this published article and available via: http://prodata.swmed.edu/nes_pattern_location/.
Competing Interests
The authors declare no competing interests.
Footnotes
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary information accompanies this paper at 10.1038/s41598-019-43004-0.
References
- 1.Fornerod M, Ohno M, Yoshida M, Mattaj IW. CRM1 is an export receptor for leucine-rich nuclear export signals. Cell. 1997;90:1051–1060. doi: 10.1016/S0092-8674(00)80371-2. [DOI] [PubMed] [Google Scholar]
- 2.Fukuda M, et al. CRM1 is responsible for intracellular transport mediated by the nuclear export signal. Nature. 1997;390:308–311. doi: 10.1038/36894. [DOI] [PubMed] [Google Scholar]
- 3.OssarehNazari B, Bachelerie F, Dargemont C. Evidence for a role of CRM1 in signal-mediated nuclear protein export. Science. 1997;278:141–144. doi: 10.1126/science.278.5335.141. [DOI] [PubMed] [Google Scholar]
- 4.Dickmanns A, Monecke T, Ficner R. Structural Basis of Targeting the Exportin CRM1 in Cancer. Cells-Basel. 2015;4:538–568. doi: 10.3390/cells4030538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kau TR, Way JC, Silver PA. Nuclear transport and cancer: From mechanism to intervention. Nat Rev Cancer. 2004;4:106–117. doi: 10.1038/nrc1274. [DOI] [PubMed] [Google Scholar]
- 6.Fischer U, Huber J, Boelens WC, Mattaj IW, Luhrmann R. The Hiv-1 Rev Activation Domain Is a Nuclear Export Signal That Accesses an Export Pathway Used by Specific Cellular Rnas. Cell. 1995;82:475–483. doi: 10.1016/0092-8674(95)90436-0. [DOI] [PubMed] [Google Scholar]
- 7.Wen W, Meinkoth JL, Tsien RY, Taylor SS. Identification of a Signal for Rapid Export of Proteins from the Nucleus. Cell. 1995;82:463–473. doi: 10.1016/0092-8674(95)90435-2. [DOI] [PubMed] [Google Scholar]
- 8.la Cour T, et al. Analysis and prediction of leucine-rich nuclear export signals. Protein Eng Des Sel. 2004;17:527–536. doi: 10.1093/protein/gzh062. [DOI] [PubMed] [Google Scholar]
- 9.Kosugi S, Hasebe M, Tomita M, Yanagawa H. Nuclear Export Signal Consensus Sequences Defined Using a Localization-Based Yeast Selection System. Traffic. 2008;9:2053–2062. doi: 10.1111/j.1600-0854.2008.00825.x. [DOI] [PubMed] [Google Scholar]
- 10.Monecke T, et al. Crystal Structure of the Nuclear Export Receptor CRM1 in Complex with Snurportin1 and RanGTP. Science. 2009;324:1087–1091. doi: 10.1126/science.1173388. [DOI] [PubMed] [Google Scholar]
- 11.Guttler T, et al. NES consensus redefined by structures of PKI-type and Rev-type nuclear export signals bound to CRM1. Nat Struct Mol Biol. 2010;17:1367–U1229. doi: 10.1038/nsmb.1931. [DOI] [PubMed] [Google Scholar]
- 12.Dong XH, et al. Structural basis for leucine-rich nuclear export signal recognition by CRM1. Nature. 2009;458:1136–U1171. doi: 10.1038/nature07975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Fung HYJ, Fu SC, Brautigam CA, Chook YM. Structural determinants of nuclear export signal orientation in binding to exportin CRM1. Elife. 2015;4:e10034. doi: 10.7554/eLife.10034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Fung HYJ, Fu SC, Chook YM. Nuclear export receptor CRM1 recognizes diverse conformations in nuclear export signals. Elife. 2017;6:e23961. doi: 10.7554/eLife.23961. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Fu SC, Huang HC, Horton P, Juan HF. ValidNESs: a database of validated leucine-rich nuclear export signals. Nucleic Acids Res. 2013;41:D338–D343. doi: 10.1093/nar/gks936. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Xu DR, Grishin NV, Chook YM. NESdb: a database of NES-containing CRM1 cargoes. Mol Biol Cell. 2012;23:3673–3676. doi: 10.1091/mbc.E12-01-0045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kirli K, et al. A deep proteomics perspective on CRM1-mediated nuclear export and nucleocytoplasmic partitioning. Elife. 2015;4:e11466. doi: 10.7554/eLife.11466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Fu SC, Imai K, Horton P. Prediction of leucine-rich nuclear export signal containing proteins with NESsential. Nucleic Acids Res. 2011;39:e111. doi: 10.1093/nar/gkr493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kosugi S, Yanagawa H, Terauchi R, Tabata S. NESmapper: Accurate Prediction of Leucine-Rich Nuclear Export Signals Using Activity-Based Profiles. Plos Comput Biol. 2014;10:e1003841. doi: 10.1371/journal.pcbi.1003841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Xu DR, et al. LocNES: a computational tool for locating classical NESs in CRM1 cargo proteins. Bioinformatics. 2015;31:1357–1365. doi: 10.1093/bioinformatics/btu826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Prieto G, Fullaondo A, Rodriguez JA. Prediction of nuclear export signals using weighted regular expressions (Wregex) Bioinformatics. 2014;30:1220–1227. doi: 10.1093/bioinformatics/btu016. [DOI] [PubMed] [Google Scholar]
- 22.Liku ME, Legere EA, Moses AM. NoLogo: a new statistical model highlights the diversity and suggests new classes of Crm1-dependent nuclear export signals. Bmc Bioinformatics. 2018;19:65. doi: 10.1186/s12859-018-2076-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Jehl P, Manguy J, Shields DC, Higgins DG, Davey NE. ProViz-a web-based visualization tool to investigate the functional and evolutionary features of protein sequences. Nucleic Acids Res. 2016;44:W11–W15. doi: 10.1093/nar/gkw265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Bienert S, et al. The SWISS-MODEL Repository-new features and functionality. Nucleic Acids Res. 2017;45:D313–D319. doi: 10.1093/nar/gkw1132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Jones DT, Cozzetto D. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics. 2015;31:857–863. doi: 10.1093/bioinformatics/btu744. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Hanson J, Yang YD, Paliwal K, Zhou YQ. Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks. Bioinformatics. 2017;33:685–692. doi: 10.1093/bioinformatics/btw678. [DOI] [PubMed] [Google Scholar]
- 27.Meszaros B, Erdos G, Dosztanyi Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res. 2018;46:W329–W337. doi: 10.1093/nar/gky384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Kim JY, et al. HDAC1 nuclear export induced by pathological conditions is essential for the onset of axonal damage. Nat Neurosci. 2010;13:180–U163. doi: 10.1038/nn.2471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Bolli N, et al. Born to be exported: COOH-terminal nuclear export signals of different strength ensure cytoplasmic accumulation of nucleophosmin leukemic mutants. Cancer Res. 2007;67:6230–6237. doi: 10.1158/0008-5472.Can-07-0273. [DOI] [PubMed] [Google Scholar]
- 30.Pinarbasi ES, et al. Active nuclear import and passive nuclear export are the primary determinants of TDP-43 localization. Sci Rep-Uk. 2018;8:7083. doi: 10.1038/s41598-018-25008-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Fu SC, Fung HYJ, Cagatay T, Baumhardt J, Chook YM. Correlation of CRM1-NES affinity with nuclear export activity. Mol Biol Cell. 2018;29:2037–2044. doi: 10.1091/mbc.E18-02-0096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.‘NACCESS’, computer program. (Department of Biochemistry and Molecular Biology, University College, London, 1993).
- 33.Xu DR, Farmer A, Collett G, Grishin NV, Chook YM. Sequence and structural analyses of nuclear export signals in the NESdb database. Mol Biol Cell. 2012;23:3677–3693. doi: 10.1091/mbc.E12-01-0046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Suzek BE, et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31:926–932. doi: 10.1093/bioinformatics/btu739. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Marchler-Bauer A, Bryant SH. CD-Search: protein domain annotations on the fly. Nucleic Acids Res. 2004;32:W327–W331. doi: 10.1093/nar/gkh454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999;292:195–202. doi: 10.1006/jmbi.1999.3091. [DOI] [PubMed] [Google Scholar]
- 38.Smith CA, Kortemme T. Backrub-like backbone simulation recapitulates natural protein conformational variability and improves mutant side-chain prediction. J Mol Biol. 2008;380:742–756. doi: 10.1016/j.jmb.2008.05.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Nivon LG, Moretti R, Baker D. A Pareto-Optimal Refinement Method for Protein Design Scaffolds. Plos One. 2013;8:e59004. doi: 10.1371/journal.pone.0059004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Conway P, Tyka MD, DiMaio F, Konerding DE, Baker D. Relaxation of backbone bond geometry improves protein energy landscape modeling. Protein Sci. 2014;23:47–55. doi: 10.1002/pro.2389. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets generated during and/or analyzed during the current study are included in this published article and available via: http://prodata.swmed.edu/nes_pattern_location/.