Identifying molecular features that are associated with biological function of intrinsically disordered protein regions

Taraneh Zarin; Bob Strome; Gang Peng; Iva Pritišanac; Julie D Forman-Kay; Alan M Moses

doi:10.7554/eLife.60220

. 2021 Feb 22;10:e60220. doi: 10.7554/eLife.60220

Identifying molecular features that are associated with biological function of intrinsically disordered protein regions

Taraneh Zarin ^1,^†,^✉, Bob Strome ¹, Gang Peng ¹, Iva Pritišanac ^1,², Julie D Forman-Kay ^2,³, Alan M Moses ^1,^✉

Editors: Michael B Eisen⁴, Michael B Eisen⁵

PMCID: PMC7932695 PMID: 33616531

Abstract

In previous work, we showed that intrinsically disordered regions (IDRs) of proteins contain sequence-distributed molecular features that are conserved over evolution, despite little sequence similarity that can be detected in alignments (Zarin et al., 2019). Here, we aim to use these molecular features to predict specific biological functions for individual IDRs and identify the molecular features within them that are associated with these functions. We find that the predictable functions are diverse. Examining the associated molecular features, we note some that are consistent with previous reports and identify others that were previously unknown. We experimentally confirm that elevated isoelectric point and hydrophobicity, features that are positively associated with mitochondrial localization, are necessary for mitochondrial targeting function. Remarkably, increasing isoelectric point in a synthetic IDR restores weak mitochondrial targeting. We believe feature analysis represents a new systematic approach to understand how biological functions of IDRs are specified by their protein sequences.

Research organism: S. cerevisiae

Introduction

Intrinsically disordered regions (IDRs) are increasingly appreciated as playing key roles in diverse aspects of cell biology (Forman-Kay and Mittag, 2013). Bioinformatics methods identify thousands of IDRs in eukaryotic proteomes (Dosztányi et al., 2005; Uversky, 2002), and methods have been developed to predict biophysical or structural behavior for specific subsets of IDRs, such as those that fold upon binding (Katuwawala et al., 2019), contain specific N- or C-terminal motifs (e.g., Chen et al., 2008; Chuang et al., 2012), or phase separate (e.g., Vernon et al., 2018, reviewed in Vernon and Forman-Kay, 2019). Although there have been recent advances in predicting function from disordered region sequences (reviewed in Uversky, 2020), for the vast majority of IDRs, it is currently difficult to obtain high-specificity predictions of biological function or to identify which amino acid residues within them are important for function (van der Lee et al., 2014). Progress toward these goals would facilitate designing assays and mutagenesis experiments to understand the function of IDRs, which is especially pressing for those IDRs that are mutated in disease (Vacic and Iakoucheva, 2012).

For folded regions, similarity to homologous protein sequences offers highly specific and diverse predictions of biological function (El-Gebali et al., 2019) and can point to key residues (Ondrechen et al., 2001). However, IDRs often show little sequence similarity detectable in alignments, and therefore, alignment-based methods (Davey et al., 2012; Nguyen Ba et al., 2012) reveal only a small part of the functional elements in these regions (Nguyen Ba et al., 2012). Recent studies indicate that despite little conservation in amino acid sequence alignments, IDRs contain sequence-distributed molecular features, such as biophysical properties, repeats, and short linear motifs, that are likely under natural selection and associated with biological function (Zarin et al., 2019; Zarin et al., 2017). This suggests that, using appropriate statistical methods, it should be possible to predict function for individual IDRs based on their amino acid sequences and to associate specific molecular features with biological functions. If successful, such an approach would allow general predictions of diverse functions for IDRs and generate testable hypotheses about molecular features that are important for those functions. This contrasts with current bioinformatics approaches that are designed to predict specific IDR subtypes or approaches that predict whole-protein function based on amino acid sequences (Lobley et al., 2007; Radivojac et al., 2013).

Here we describe a first general approach to predict diverse functions for IDRs and to identify molecular features that are important for those functions. Using this approach, we identify molecular features that are associated with functions such as subcellular organization and signaling. This allows us to confirm experimentally that molecular features are necessary for mitochondrial targeting signals and can be used to establish a mitochondrial targeting signal in a synthetic IDR. We also demonstrate how our approach can predict a role in transcription for the N-terminal IDR in an uncharacterized yeast protein Mfg1, based in part on evolutionary variation in polyglutamine repeats. Based on these examples, our work indicates a path forward for prediction of specific biological functions based on IDR sequences, and experimental characterization of molecular features that are necessary for IDR function.

Results

A model that predicts protein function from molecular features in IDRs

One of the main challenges of training statistical models to predict IDR function is the lack of systematic data at the IDR level. Functional annotations such as Gene Ontology (The Gene Ontology Consortium, 2019) or deletion phenotypes are at the protein level, but many proteins contain more than one predicted IDR. We therefore devised a statistical model that could combine the contributions of all IDRs in a protein (Figure 1, Figure 1—figure supplement 1). We assume that IDRs are associated with molecular feature vectors (e.g., the ‘evolutionary signatures’ that summarize the evolution of molecular features [Zarin et al., 2019]), and that functional annotations are binary at the protein level (we treat each annotation independently): either the protein is associated with the given annotation or is not. If we knew which IDR in the protein was responsible for the function, this would amount to a standard classification problem.

Figure 1. — Example inputs (function, molecular features) and outputs (FAIDR probability, association with function) are shown. Given a set of features (e.g., molecular features comprising evolutionary signatures, as shown) for one or more IDRs in a given protein and functional annotation information (e.g., whether or not the protein is annotated with mRNA binding function or nucleolus function, as shown), FAIDR outputs the probability that the IDR is associated with the given function as well as the strength and direction of the association with function for each molecular feature. Molecular features comprising the evolutionary signatures are represented by a heatmap scaling from blue (decreased mean or variance of the feature compared to our null expectation of IDR evolution) to yellow (increased mean or variance of the feature compared to our null expectation of IDR evolution). The association with function is represented by a heatmap scaling from blue (negative association) to yellow (positive association). For example, for nucleolus function (as shown), mean CKII consensus sites are strongly positively associated with this function (indicated with yellow), whereas for mRNA binding (also shown), mean threonine content is negatively associated with this function (indicated with blue).

Figure 1—figure supplement 1. — Example inputs (function, molecular features) and outputs (FAIDR probability, association with function) are shown. Given a set of features (e.g., molecular features comprising evolutionary signatures, as shown) for one or more IDRs in a given protein and functional annotation information (e.g., whether or not the protein is annotated with mRNA binding function or nucleolus function, as shown), FAIDR outputs the probability that the IDR is associated with the given function as well as the strength and direction of the association with function for each molecular feature. Molecular features comprising the evolutionary signatures are represented by a heatmap scaling from blue (decreased mean or variance of the feature compared to our null expectation of IDR evolution) to yellow (increased mean or variance of the feature compared to our null expectation of IDR evolution). The association with function is represented by a heatmap scaling from blue (negative association) to yellow (positive association). For example, for nucleolus function (as shown), mean CKII consensus sites are strongly positively associated with this function (indicated with yellow), whereas for mRNA binding (also shown), mean threonine content is negatively associated with this function (indicated with blue).

Since, in fact, we do not know which IDR is responsible for the functional annotation, we treat this as a hidden variable. To ensure that the model infers only a few important molecular features for each function, we use LASSO regularization (Hastie et al., 2015) and to infer the hidden variables, we use an expectation–maximization (E–M) algorithm (see Materials and methods). The model automatically determines which IDRs are likely to be responsible for each function and which features are most informative for each function (Figure 1). We refer to this model as FAIDR (Feature Analysis of Intrinsically Disordered Regions). Given IDR feature vectors, in principle, it can predict protein function, infer which IDR is responsible for a given function, and determine the molecular features that are important for each function.

IDR function can be predicted from protein-level annotations and IDR sequence properties

We trained FAIDR models using protein-level functional annotations from the Saccharomyces Genome Database (SGD) (Cherry et al., 2012) (see Materials and methods). As previously described (Zarin et al., 2019), to summarize the molecular features of IDRs, we used evolutionary signatures, vectors of Z-scores that summarize the deviations of 82 molecular features in disordered regions from the distribution of those features obtained from simulations of IDR evolution. For each molecular feature, we calculate the mean and (log) variance across a set of orthologous IDRs, which leads to a total of 164 features describing the evolution of these molecular features in each IDR. Here we wanted to use both the evolutionarily conserved and diverged parts of the disordered region to form the evolutionary signature. Therefore, in contrast to our previous work (Zarin et al., 2019), we did not constrain the simulations to preserve short conserved motifs (see Materials and methods for details). The evolutionary signatures that we calculated for the yeast proteome are available for download and visualization at http://www.moseslab.csb.utoronto.ca/idr/.

To demonstrate the power of FAIDR at predicting IDR-level function, we used two systematic, global screens for function associated with IDRs that are mapped to protein coordinates: mitochondrial N-terminal targeting signals identified in a proteome-wide screen (Vögtle et al., 2009) and high confidence Cdc28 phosphorylation sites (which point to kinase substrates that are key regulatory sites) curated from in vitro and in vivo studies (Lai et al., 2012). Importantly, these datasets are unseen by the model during training and testing, and because they are mapped to protein coordinates, they can be used to create IDR-level functional annotations (see Materials and methods). To further ensure that there is no ‘leakage’ between training and test data, we fit the model parameters (b, described above) to protein-level annotations from 80% of the proteome and evaluate the performance on the unseen IDR-level data for the 20% of proteins that were not part of the training dataset. We find that the held-out predictions from FAIDR do as well or better than state-of-the-art predictors of mitochondrial targeting signals and Cdc28 substrates that were specifically designed to predict these functions (Mitofates and Condens, respectively) (Fukasawa et al., 2015; Lai et al., 2012; Figure 2). This suggests that our general approach is capturing comparable functional information to previous function-specific predictors.

Figure 2. — Receiver operating curves (ROC) on a held-out 20% sample are shown for FAIDR trained on evolutionary signatures (blue) versus Mitofates (A) and Condens (B) (orange). Area under curve (AUC) is indicated in legend on bottom right for each method. The number of positive IDRs (Np) in the held-out 20% is indicated in the top left of each plot.

Figure 2—source data 1. Data for ROC curves including IDR coordinates, FAIDR probabilities for held out IDRs and ground truth labels from experimental data mapped to each IDR.
idr_name_protein_coordinates.txt defines IDR names as amino acid coordinates in each protein. heldout_idr_probability_cdc28_mito.csv is the FAIDR probability for IDRs that were not included in the training set (‘held-out IDRs’). total_idr_expt_db_overlap_lai_2009_cdc28.csv is the binary summary of whether or not each IDR is phosphorylated by Cdc28 according to experimental data mapped from protein coordinates to each IDR (data compiled by Lai et al., 2012). total_idr_expt_db_overlap_vogtle_2009_mitochondrial_nterm.csv is the binary summary of whether or not each IDR is a mitochondrial N-terminal targeting signal according to experimental data mapped from protein coordinates to each IDR (based on a proteome-wide experimental study by Vögtle et al., 2009).

elife-60220-fig2-data1.zip^{(125KB, zip)}

Figure 2—figure supplement 1. — Receiver operating curves (ROC) on a held-out 20% sample are shown for FAIDR trained on evolutionary signatures (blue) versus Mitofates (A) and Condens (B) (orange). Area under curve (AUC) is indicated in legend on bottom right for each method. The number of positive IDRs (Np) in the held-out 20% is indicated in the top left of each plot.

Figure 2—source data 1. Data for ROC curves including IDR coordinates, FAIDR probabilities for held out IDRs and ground truth labels from experimental data mapped to each IDR.
idr_name_protein_coordinates.txt defines IDR names as amino acid coordinates in each protein. heldout_idr_probability_cdc28_mito.csv is the FAIDR probability for IDRs that were not included in the training set (‘held-out IDRs’). total_idr_expt_db_overlap_lai_2009_cdc28.csv is the binary summary of whether or not each IDR is phosphorylated by Cdc28 according to experimental data mapped from protein coordinates to each IDR (data compiled by Lai et al., 2012). total_idr_expt_db_overlap_vogtle_2009_mitochondrial_nterm.csv is the binary summary of whether or not each IDR is a mitochondrial N-terminal targeting signal according to experimental data mapped from protein coordinates to each IDR (based on a proteome-wide experimental study by Vögtle et al., 2009).

elife-60220-fig2-data1.zip^{(125KB, zip)}

For functions and phenotypes with no IDR-level data (the vast majority), we tested whether we could identify molecular features predictive of IDR function by evaluating the area under the curve (AUC) in cross-validation on held-out proteins (see Materials and methods). We found a diverse set of protein annotations and phenotypes that could be predicted with reasonable power (overall average fivefold cross-validation AUC = 0.81, Figure 2—figure supplement 1, gray bars), ranging from well-characterized IDR functions such as signal transduction and mitochondrial targeting, to underappreciated IDR functions such as ribosome biogenesis and DNA damage (full list of functions in Supplementary file 1). Although some of these annotations and phenotypes are correlated, this analysis strongly supports the idea that there is rich functional information in IDR sequences (Zarin et al., 2019), and is evidence that the model is not overfitting.

Because evolutionary signatures can be technically difficult and time consuming to compute (Zarin et al., 2019), we also tested whether FAIDR could predict function from single-species profiles of molecular features. Indeed, we found that the 82 sequence features (Zarin et al., 2019) computed from Saccharomyces cerevisiae IDR sequences alone (scaled to unit variance and 0 mean) could also predict protein function (overall average fivefold cross-validation AUC = 0.77, Figure 2—figure supplement 1, unfilled bars). Although the overall power was significantly less than the evolutionary signatures (paired t-test over 23 classes p<0.001), for some IDR functions the predictive power was similar, which suggests that IDR function can be predicted from single-protein sequences. Identifying informative molecular feature descriptions for IDRs that are easier to compute than evolutionary signatures is a promising area for further research (see Discussion).

Exploring molecular features of IDRs associated with diverse functions and phenotypes

We next sought to determine which molecular features were associated with IDR function. To do so, we obtained the features that were assigned non-zero coefficients using FAIDR, and fit a standard logistic regression to that subset of features. We extracted the associated t-statistics and visualized them as a clustered heatmap (Figure 3A). Encouragingly, we found that related functions group together in this analysis (Figure 3A, indicated by symbols above function names). For example, consistent with their putative functions (Darling et al., 2018), proteinaceous membraneless organelles (PMLOs) (Figure 3A, unfilled square) clustered with several phenotypes and functions related to RNA (Figure 3A, filled square).

Figure 3. — (A) Hierarchical clustering of t-statistics obtained from regression of 23 functions and phenotypes on evolutionary signatures (means and variances of 82 molecular features). Symbols beside function/phenotype names indicate related functions/phenotypes. Indicated subgroups (**i–ix**) are referred to and described in the main text. (B) Examples of positive and negatively predictive molecular features for signaling (i) and PMLOs (vi-ix).

Figure 3—source data 1. Raw T-statistics used for cluster analysis and display.
The remaining molecular features (n = 73) and the associated functions/phenotypes (n = 23) were then hierarchically clustered using weighted uncentered correlation distance (‘cluster’ and ‘calculate weights’ checked off with cutoff = 0.1 and exponent = 1) on Cluster 3.0, using average linkage. The clustered t-statistics underlying Figure 3 (tstat_table_atleast1_over_3_v2.cdt) can be viewed/explored using Java Treeview.

elife-60220-fig3-data1.zip^{(9.2KB, zip)}

We note that the features that underlie these t-statistics represent the extent to which the mean or (log) variance of a sequence-distributed molecular feature (such as a physicochemical property or amino acid repeats) differs from our null expectation over evolutionary time (see Materials and Methods for details). It is relatively straightforward to interpret a positively predictive mean of a property such as hydrophobicity to indicate that elevated hydrophobicity is associated with a certain function. It is more difficult to interpret how the variance of a sequence-distributed property over evolution can be interpreted with respect to its function. For both mean and variance of sequence-distributed properties, we interpret deviations from our null expectation to point to the properties that are (positively or negatively) associated with function (see Discussion).

We found that several molecular features known to be associated with specific biological functions were recovered in this analysis. For example, signaling-associated functions such as Cdc28 kinase substrates, signal transduction, and protein kinase activity share an increase in KEN degradation boxes (Figure 3B,i), while Cdc28 substrates (as expected) are distinguished by increased CKS1 docking sites and CDK consensus sites (Figure 3B,i). The DNA damage response seems to share some features with the more general functions discussed above, but is also (as expected) associated with the presence of specific motifs: PCNA/Pip boxes, Mec1/Tel1 consensus sites (Lai et al., 2012), and NLSs (nuclear localization signals) (Figure 3A,ii). Signal transduction and protein kinase activity share increased histidine residues (Figure 3B,i), which we speculate is related to histidine kinase signaling. Interestingly, the opposite (a decrease in histidine residues) is predictive of essentiality (‘inviable’ phenotype) and nuclear pore localization (Figure 3A,iii). Other features that are consistent with known disordered region function include the presence of PRK1 consensus sites as a predictor of actin cortical patch function (Figure 3A,iv) (Sekiya-Kawasaki et al., 2003), and increased isoelectric point as a predictor of mitochondrial functions (Figure 3A,v) (reviewed in Jaussi, 1995).

We next focused on features predictive of PMLOs, as it is currently unclear whether the association of IDRs with these organelles is determined by amino acid sequence. As expected, we found the presence of RGG (Chong et al., 2018) and K/A/P repeats (van der Lee et al., 2014) among the strongest predictors for all these compartments (Figure 3B,vi). We also found a decrease in threonine residues associated with these, as well as all the RNA-related functions (Figure 3B,viii), which to our knowledge has not been previously reported. We were particularly interested in features that could discriminate between PMLOs. We found that stress granules are distinct from all other bodies, showing a decrease in the variance of basic residues and lysine repeats (Figure 3B,vii). Nucleoli and associated ribosomal functions are distinct in that they show increased CKII consensus sites (Figure 3B,vi). We also found that processing body (P body) proteins show an increased variance in Mec1 consensus sites, while mRNPs and mRNA binding proteins show a decrease in this feature (Figure 3B,ix).

Molecular features in the Cox15 IDR predictably affect mitochondrial targeting of Cox15

In order to directly test whether the features in our evolutionary signatures that are predictive of function are required for function (and not simply correlated with function), we sought to manipulate these features in an example IDR. For this, we chose the N-terminal IDR of the Cox15 protein, which is a known mitochondrial targeting signal (Vögtle et al., 2009). As in previous work (Zarin et al., 2019), there are several reasons that make this system ideal for testing our hypotheses about molecular features that underlie a biological function: (1) localization to mitochondria is a clearly observable phenotype and (2) mitochondrial precursor peptides contain well-characterized sequence features that enable their prediction as mitochondrial targeting sequences (Fukasawa et al., 2015) (see Discussion). Using FAIDR, we first identified the top predictive features for this IDR (Cox15 IDR 1, which spans amino acids 1–45 in Cox15) that indicate association with mitochondrial localization (‘mitochondrion’ GO annotation), and plotted the molecular feature vector (evolutionary signature) values corresponding to these top molecular features in Figure 4A. The top predictive molecular feature (a positive predictor) is mean isoelectric point (pI), which also has a relatively high Z-score in the Cox15 IDR 1 feature vector. This indicates that increased isoelectric point is not just a generally positive predictor for mitochondrial localization, but that elevated isoelectric point is present in the Cox15 IDR specifically. Another top positive predictor that is present in this IDR is increased hydrophobicity (Figure 4A).

Figure 4. — (A) Feature vector of Z-scores (evolutionary signature) of the Cox15 N-terminal IDR (Cox15 IDR 1) (y-axis) is plotted against predictive features (t-statistics) for mitochondrial function as determined by FAIDR (x-axis). Two of the top features associated with mitochondrial localization are mean isoelectric point (pI), circled in red, and mean hydrophobicity (H), circled in blue. (B) Amino acid sequences for each Cox15 IDR variant. Wild-type and simulated Cox15 IDR sequences are compared to the same sequence with mutations altering isoelectric point (pI) and hydrophobicity. Sequences with variable residues (gray) are visualized with Jalview (Waterhouse et al., 2009). Predictions from Mitofates (Fukasawa et al., 2015), a predictor of mitochondrial targeting sequences, are also shown: prediction of whether or not the given sequence is a mitochondrial pre-sequence (check mark), as well as the predicted location of the cleavage site in the wild-type and mutated sequences (red box), the regions predicted to have a high (purple box) or low (pink box) max positively charged amphiphilicity score, and region of the TOM20 (receptor) recognition motif (green box). (C) Micrographs showing the mitochondrial localization phenotype for different budding yeast strains that differ in their Cox15 N-terminal IDRs. Green shows GFP-tagged Cox15 localization. Left to right: wild-type IDR, Cox15 IDR with low pI, Cox15 IDR with low hydrophobicity, simulated IDR, simulated IDR with high pI. Scale bar represents 1 µm.

Figure 4—figure supplement 1. — (A) Feature vector of Z-scores (evolutionary signature) of the Cox15 N-terminal IDR (Cox15 IDR 1) (y-axis) is plotted against predictive features (t-statistics) for mitochondrial function as determined by FAIDR (x-axis). Two of the top features associated with mitochondrial localization are mean isoelectric point (pI), circled in red, and mean hydrophobicity (H), circled in blue. (B) Amino acid sequences for each Cox15 IDR variant. Wild-type and simulated Cox15 IDR sequences are compared to the same sequence with mutations altering isoelectric point (pI) and hydrophobicity. Sequences with variable residues (gray) are visualized with Jalview (Waterhouse et al., 2009). Predictions from Mitofates (Fukasawa et al., 2015), a predictor of mitochondrial targeting sequences, are also shown: prediction of whether or not the given sequence is a mitochondrial pre-sequence (check mark), as well as the predicted location of the cleavage site in the wild-type and mutated sequences (red box), the regions predicted to have a high (purple box) or low (pink box) max positively charged amphiphilicity score, and region of the TOM20 (receptor) recognition motif (green box). (C) Micrographs showing the mitochondrial localization phenotype for different budding yeast strains that differ in their Cox15 N-terminal IDRs. Green shows GFP-tagged Cox15 localization. Left to right: wild-type IDR, Cox15 IDR with low pI, Cox15 IDR with low hydrophobicity, simulated IDR, simulated IDR with high pI. Scale bar represents 1 µm.

In order to manipulate these sequence-distributed features (isoelectric point and hydrophobicity), we constructed strains with different Cox15 IDR 1 genotypes and molecular features (Figure 4B). We first created a mutant IDR with lower pI by mutating each arginine residue (R) in the wild-type IDR sequence to a glutamic acid (E), which lowered the pI of the wild-type IDR from 12.40 to 4.04 while only minimally changing the hydrophobicity (Figure 4B, top). We tested mitochondrial localization in all GFP-tagged Cox15 strains using co-localization with mitochondrial DNA as visualized by DAPI staining (as in Higuchi-Sanabria et al., 2016; Figure 4—figure supplement 1). As expected, the wild-type GFP-tagged Cox15 strain has a clear mitochondrial localization (Figure 4C, left; Figure 4—figure supplement 1). However, the ‘Cox15 IDR low pI’ strain no longer localizes to the mitochondria (Figure 4C, Figure 4—figure supplement 1). This provides evidence that the elevated pI is necessary for mitochondrial targeting function. Similarly, we created a mutant IDR with lower hydrophobicity by mutating each leucine residue (L) to asparagine (N), each valine (V) to lysine (K), and each isoleucine residue (I) to arginine (R) (Figure 4B, middle). This changed the three most hydrophobic residues into the three least hydrophobic according to the Kyte–Doolittle scale (Kyte and Doolittle, 1982), while only minimally affecting the isoelectric point. Interestingly, the ‘Cox15 IDR low hydrophobicity’ strain also fails to localize to the mitochondria (Figure 4C, Figure 4—figure supplement 1), which indicates that elevated hydrophobicity is also necessary for mitochondrial targeting function.

To test whether elevated pI is sufficient to restore a targeting signal, we first constructed an N-terminal Cox15 IDR in which we ‘knocked out’ sequence-distributed features that we predict to be important for specific IDR function. To make this synthetic Cox15 IDR, we simulated the evolution of the Cox15 N-terminal IDR under a model that randomly evolves IDRs while preserving position-specific variation in evolutionary rates, but includes no specific constraints on the sequence-distributed molecular features used to train FAIDR (as in Iserman et al., 2020; Figure 4B, bottom). We note that this simulated IDR is not a completely random or generic IDR; for example, the simulated IDR shares 4/5 of the N-terminal amino acids with the wild-type Cox15 IDR from which it ‘evolved’ in silico (Figure 4B). Nevertheless, consistent with the functional importance of the sequence-distributed features, the strain in which the N-terminal IDR of GFP-tagged Cox15 is replaced with this simulated IDR fails to show a mitochondrial localization pattern (Figure 4C, Figure 4—figure supplement 1), indicating that the first few conserved amino acids in the IDR, the predicted cleavage site and helicity (Figure 4B) are not sufficient for mitochondrial localization. We next made mutations in the simulated Cox15 IDR to increase the pI back to the range of the wild-type Cox15 N-terminal targeting signal. We mutated each glutamic acid (E) and aspartic acid (D) residue to arginine (R), bringing the pI from 10.2 in the simulated IDR to 12.2 in the simulated IDR with high pI. Remarkably, this mitochondrial localization phenotype seems to be partially rescued when the pI of the simulated IDR is increased (‘sim IDR high pI’, Figure 4C, Figure 4—figure supplement 1). This indicates that the elevated pI in the context of a synthetic, non-functional Cox15 N-terminal disordered region is sufficient to restore weak mitochondrial targeting function.

Taken together, these results demonstrate that functional associations with evolutionary signatures revealed by the FAIDR statistical model can be used to design mutations in IDRs that alter phenotype in predictable ways, and suggest that a small number of molecular features may be necessary and sufficient for partially restoring some IDR functions.

Inferring function for specific IDRs in uncharacterized proteins

One of the important use cases of our approach (the FAIDR model trained on evolutionary signatures of predicted IDRs) is to predict functions for specific protein regions that are otherwise challenging, if not impossible, to annotate using current methods. To demonstrate the use of FAIDR for annotating protein regions, we chose to focus on open reading frames (ORFs) annotated as ‘uncharacterized’ in the yeast proteome by SGD (Cherry et al., 2012). There are currently 739 uncharacterized ORFs in the yeast proteome, of which 121 have one or more IDRs for which we are able to make predictions. We extracted the probability that each IDR corresponds to each of the functions that we considered (equation 1), and used this to make functional predictions for IDRs in uncharacterized ORFs.

One interesting example is the N-terminal IDR in putative mitochondrial protein SHH3, which we predict to be associated with the mitochondrion, as well as the mitochondrial inner membrane. Indeed, SHH3 assumes what looks to be a mitochondrial localization in recent localization collections such as the LoQAtE (Localization and Quantitation Atlas of the yeast proteomE) database (Breker et al., 2014; Breker et al., 2013) and Yeast RGB (Dubreuil et al., 2019). Another interesting case is an IDR in MFG1, which FAIDR predicts to be associated with sequence-specific DNA binding (Figure 5A), indicating that MFG1 could be a transcription factor. Indeed, all other IDRs with a higher probability for association with transcription factor activity according to FAIDR are IDRs from confirmed transcription factors (Sok2, Sfp1, Mot3) (Figure 5A). Despite being an uncharacterized ORF, Mfg1 regulates filamentous growth by interacting with the FLO11 promoter and regulating FLO11 expression (Ryan et al., 2012), and has a nuclear localization (Huh et al., 2003), further supporting the idea that it is indeed a transcription factor. Using FAIDR, we can specifically identify the Mfg1 protein region that is associated with transcription factor activity (Figure 5B). Furthermore, by looking at the top predictive features that FAIDR points to (Figure 5C, left) and plotting the feature vector (evolutionary signature) for these top features, we can begin to understand the molecular features that are contributing to this function for the Mfg1 IDR 1 specifically. For example, increased variance of glutamine residues in Mfg1 is a positive predictor of its association with transcription factor activity (Figure 5C, orange box), and this sequence feature can be seen in a multiple sequence alignment of the Mfg1 IDR and its orthologs (Figure 5C, right). Thus, we can formulate the hypothesis that this specific sequence feature is important for function, and test this by modulating the number of glutamine repeats in the lab (see Discussion).

Figure 5. — (A) Histogram of probabilities for association of IDRs with sequence-specific DNA binding (transcription factor or ‘TF’) function. Top predictions are indicated. From the top predictions, IDRs from known transcription factors are indicated with a green dot. Mfg1 is an uncharacterized protein. (B) Mfg1 protein coordinates, with all known domain annotations (via PFAM), disorder prediction (via DISOPRED3), and FAIDR probability for association with sequence-specific DNA binding indicated. (C) Feature vector (evolutionary signature) for Mfg1 IDR 1 (y-axis) is plotted against predictive features for transcription factor/sequence-specific DNA binding function as determined by FAIDR (x-axis). Log variance in glutamine (Q) residues is highlighted as one of the top features (orange circle) associated with TF function. Glutamine residues (Q) are indicated (in orange) in the IDR of Mfg1 and orthologs obtained from the YGOB (Byrne and Wolfe, 2005). All other (non-glutamine) residues are indicated in black. Multiple sequence alignment of orthologs is visualized using Jalview (Waterhouse et al., 2009). Species in alignment (from bottom to top) are as follows: *Saccharomyces cerevisiae*, *Saccharomyces mikatae*, *Saccharomyces kudriavzevii*, *Saccharomyces uvarum*, *Candida glabrata*, *Kazachstania africana*, *Kazachstania naganishii*, *Naumovozyma castellii*, *Naumovozyma dairenensis*, *Zygosaccharomyces rouxii*, *Torulaspora delbrueckii*, *Eremothecium (Ashbya) gossypii*, *Eremothecium (Ashbya) cymbalariae*, *Lachancea kluyveri*, *Lachancea thermotolerans*, and *Lachancea waltii*.

Figure 5—figure supplement 1. — (A) Histogram of probabilities for association of IDRs with sequence-specific DNA binding (transcription factor or ‘TF’) function. Top predictions are indicated. From the top predictions, IDRs from known transcription factors are indicated with a green dot. Mfg1 is an uncharacterized protein. (B) Mfg1 protein coordinates, with all known domain annotations (via PFAM), disorder prediction (via DISOPRED3), and FAIDR probability for association with sequence-specific DNA binding indicated. (C) Feature vector (evolutionary signature) for Mfg1 IDR 1 (y-axis) is plotted against predictive features for transcription factor/sequence-specific DNA binding function as determined by FAIDR (x-axis). Log variance in glutamine (Q) residues is highlighted as one of the top features (orange circle) associated with TF function. Glutamine residues (Q) are indicated (in orange) in the IDR of Mfg1 and orthologs obtained from the YGOB (Byrne and Wolfe, 2005). All other (non-glutamine) residues are indicated in black. Multiple sequence alignment of orthologs is visualized using Jalview (Waterhouse et al., 2009). Species in alignment (from bottom to top) are as follows: *Saccharomyces cerevisiae*, *Saccharomyces mikatae*, *Saccharomyces kudriavzevii*, *Saccharomyces uvarum*, *Candida glabrata*, *Kazachstania africana*, *Kazachstania naganishii*, *Naumovozyma castellii*, *Naumovozyma dairenensis*, *Zygosaccharomyces rouxii*, *Torulaspora delbrueckii*, *Eremothecium (Ashbya) gossypii*, *Eremothecium (Ashbya) cymbalariae*, *Lachancea kluyveri*, *Lachancea thermotolerans*, and *Lachancea waltii*.

Discussion

We have demonstrated a systematic approach to associate molecular features of IDR sequences with diverse biological functions. Our method stands in contrast with those established for specific subtypes of IDRs, as well as existing methods that use amino acid sequences to predict whole-protein function. Our goal is to generate hypotheses about both the function of individual IDRs and which molecular features in those IDRs are important for those functions, much like Pfam annotations provide both predictions of molecular functions of domains and hypotheses about specific residues within those domains that are important for that function (El-Gebali et al., 2019).

In the case of Mfg1, which is a protein of unknown function, the N-terminal IDR is among the IDRs most strongly associated with transcription factor function in the yeast proteome. Because Mfg1 contains three predicted IDRs (and in general proteins contain multiple IDRs), this illustrates the advantage of our approach that infers which (if any) of the IDRs are associated with a particular function. Although in some well-studied cases, it is clear that the C-terminal and N-terminal IDRs have different functions (e.g., p53, Krois et al., 2018; Laptenko et al., 2016), in general we do not know if functional information within IDRs is modular (confined to one IDR) or distributed (spread over all the IDRs in a protein). In our exploration of yeast IDRs strongly associated with functions, we identified interesting cases where multiple IDRs within a protein were strongly associated with different functions (e.g., Figure 5—figure supplement 1) possibly supporting the idea that IDRs can be thought of as functional modules. However, we note that the identification of IDR boundaries in protein sequences (via bioinformatics or experiments) is an error-prone process (Nielsen and Mulder, 2019), and thus in many cases, we are not confident how many IDRs a protein ‘truly’ contains. A key technical challenge hindering inference of the IDR sequence features that are associated with function is that most proteome-scale functional information is at the level of entire proteins and proteins often contain multiple predicted IDRs, the exact boundaries of which are often uncertain. Further improvement of the approaches described here could allow simultaneous inference of IDR function and boundaries, perhaps providing more insight.

The analysis of Mfg1 also reveals that homologs of the N-terminal IDR show elevated variance in glutamine repeats, a property of IDRs that we find associated with transcription factor function, consistent with previous reports (Freiman and Tjian, 2002; Gemayel et al., 2015; Klemsz and Maki, 1996). Thus, in addition to providing a hypothesis about which IDR is important for Mfg1 function, we predict that modulation of the glutamine residues in the N-terminal IDR of Mfg1 will impact its function in transcription. We note that in the case of Mfg1, the molecular feature is an elevation of evolutionary variance, and this feature cannot be modulated experimentally because we only affect one single value. Instead, we interpret the associations with our evolutionary signatures as pointing toward testable hypotheses. While FAIDR analysis using descriptions of molecular features based on single sequences would be easier to interpret, we found that our features did not perform as well when computed on S. cerevisiae alone (Figure 2—figure supplement 1). This supports the use of evolutionary features in this case, but as we discuss further below, this suggests that development of improved molecular features that can be manipulated in single sequences is an area for future research.

In addition to other previously reported molecular features (the expected short linear motifs and repeats associated with signaling, posttranslational regulation, and RNA binding), we found an unexpected depletion of threonine residues in IDRs in most classes of RNA-binding proteins and an unexpected decreased variance of lysine residues specific to stress-granule proteins. Future experiments will test the importance of these features. Nevertheless, our interpretable molecular features and statistical framework both confirm previous associations with function and reveal new properties of IDRs, while still providing predictive power comparable to current function-specific bioinformatics approaches.

Understanding which molecular features in IDRs are necessary and sufficient for various biological functions is an important area for further research. At least in the case of the Cox15 N-terminal mitochondrial targeting signal, we could show that two molecular features, mean isoelectric point and mean hydrophobicity, the increase of which we found to be positively associated with function, appear to be necessary for function of that IDR (Figure 4). We note that hydrophobicity and isoelectric point are only two of the top features that we chose to focus on in this study and are not the only two features that are important for mitochondrial localization. It is known that features including positive charge, potential to form an amphipathic helix, hydrophobicity, and the presence of cleavage motifs can help identify mitochondrial targeting signals (Fukasawa et al., 2015; Garg and Gould, 2016; Roise and Schatz, 1988; Vögtle et al., 2009). We note that the molecular features that we find to be associated with mitochondrial targeting function are in line with previous findings and serve to illustrate how our method could be used to identify and test functional molecular features for protein regions that are less characterized than mitochondrial targeting signals. Furthermore, while mitochondrial pre-sequences are known to form helices upon interacting with their target receptor, they do not engage in a stable interaction and sample different conformations (Abe et al., 2000; Kohda, 2018; Saitoh et al., 2007). We therefore include these regions in our analysis of IDRs. We suggest that our approach can be used to identify molecular features associated with function for any protein regions that are not amenable to standard methods of sequence analysis. To facilitate wider use of this approach, we have developed a website that provides search, visualization, and comparison of evolutionary signatures for predicted IDRs in the yeast proteome (http://www.moseslab.csb.utoronto.ca/idr/).

Remarkably, we also found that artificially increasing the isoelectric point of a simulated Cox15 N-terminal IDR (i.e., an IDR with no enrichment of sequence-distributed features) enabled us to create a weak synthetic targeting signal. To our knowledge, this represents the first synthetic IDR designed to perform a specific biological function and highlights the benefit of using a simple, interpretable statistical model: we could hand-design mutations in a synthetic IDR to perform a desired function through eight amino acid changes. In general, designing synthetic IDRs represents a new approach for testing the importance of molecular features.

In all, we found that diverse functions can be associated with the IDR sequence features included in our evolutionary signatures (Zarin et al., 2019). However, we note that these features of disordered regions were obtained from a review of the literature, and new features are continuously being discovered (e.g., patterning of aromatic residues [Martin et al., 2020]). It is likely that including these features will further improve predictive power of the approach described here. Nevertheless, these features are necessarily biased by current research interests, and because they rely on comparisons to simulations of molecular evolution, they are cumbersome to compute. In future work, we believe that application of advanced computational methods will enable more systematic exploration, ideally automatically learning molecular features that are most important for IDR function.

Materials and methods

The FAIDR statistical model

FAIDR was implemented in R with the glmnet package (Friedman et al., 2010). Code and results are available at https://github.com/taraneh-z/FAIDR; Zarin, 2021; copy archived at swh:1:rev:b1ef30705f7133f21201166522077a062984566b. A schematic of the model is provided in Supplementary materials (Figure 1—figure supplement 1).

We assume that IDRs are associated with molecular feature vectors Z (e.g., the ‘evolutionary signatures’ that summarize the evolution of molecular features [Zarin et al., 2019]) and that functional annotations, Y, are binary indicator variables at the protein level (we treat each annotation independently) such that Y = 1 if the protein is associated with the given annotation and Y = 0 if the protein is not associated with the given annotation. If we knew that the jth IDR in the ith protein was responsible for the function, this would amount to a standard classification problem.

Since our goal is to develop a method that can identify the molecular features associated with each function as well as to predict the function of the IDRs, we consider logistic regression, which summarizes the contribution of each feature to the prediction task using a vector of coefficients, b, one for each of m features. Using standard logistic regression, if the IDR responsible for a function were known, we could write the probability of a function given the features

P (Y_{i} = 1 | Z_{i j}, b) = \frac{1}{1 + e^{- Z_{i j} b}} = 1 - P (Y_{i} = 0 | Z_{i j}, b)

(1)

Since, in fact, we do not know which IDR is responsible for the functional annotation Y, we treat this as a hidden indicator variable, X, and marginalize. The likelihood of the data for n proteins is then:

P (Y | Z, b) = \prod_{i = 1}^{n} \sum_{j = 1}^{r_{i}} P (X_{i j} = 1) {P (Y_{i} = 1 | Z_{i j}, b)}^{Y_{i}} {P (Y_{i} = 0 | Z_{i j}, b)}^{1 - Y_{i}}

(2)

where r_i is the number of disordered regions in the ith protein and $X_{i j} = 1$ if the jth IDR in the ith protein is responsible for the function, Y.

To maximize the likelihood of our model (Equation 2), we use the E–M algorithm or iteratively maximize the expected complete log likelihood (reviewed in Moses, 2017). To derive the E–M algorithm for this model, we first write the complete likelihood (CL, the log likelihood assuming the hidden variables are observed):

C L = \prod_{i = 1}^{n} \prod_{j = 1}^{r_{i}} {({[\frac{1}{1 + e^{- Z_{i j} b}}]}^{Y_{i}} {[1 - \frac{1}{1 + e^{- Z_{i j} b}}]}^{1 - Y_{i}})}^{X_{i j}} = \prod_{i = 1}^{n} \prod_{j = 1}^{r} {({[h (Z_{i j})]}^{Y_{i}} {[1 - h (Z_{i j})]}^{1 - Y_{i}})}^{X_{i j}}

(A.1)

where $h (Z_{i j}) = \frac{1}{1 + e^{- Z_{i j} b}}$ .

Taking logs and expectations yields:

⟨ l o g C L ⟩ = \sum_{i = 1}^{n} \sum_{j = 1}^{r} ⟨ X_{i j} ⟩ [Y_{i} l o g h (Z_{i j}) + (1 - Y_{i}) l o g (1 - h (Z_{i j}))]

(A.2)

where the angle brackets denote expectations. In the E-step of the E–M algorithm, these expectations are calculated using Bayes’ theorem as follows:

⟨ X_{i j} ⟩ = P (X_{i j} = 1 | Y_{i}, Z_{i j}, b) = \frac{P (X_{i j} = 1) P (Y_{i} | X_{i j} = 1, Z_{i j}, b)}{P (Y_{i} | Z_{i j}, b)} = \frac{Y_{i} h (Z_{i j}) + (1 - Y_{i}) (1 - h (Z_{i j}))}{\sum_{k}^{} Y_{i} h (Z_{i k}) + (1 - Y_{i}) (1 - h (Z_{i k}))}

(A.3)

Note that in (Equation A.3), we have assumed no prior knowledge about which IDR in a protein is most likely to be responsible for the function (uninformative priors), $P (X_{i j} = 1)$ is constant over j.

In the M-step, we maximize the expected complete log likelihood. Inspection of Equation A.2 reveals exactly one term for each IDR, weighted by $⟨ X_{i j} ⟩$ , the expected value of the hidden variable. Thus, our model corresponds to an iteratively reweighted logistic regression problem, where the weights are the current guess of the hidden variable, $⟨ X_{i j} ⟩$ , representing which IDR is responsible for the function. We can therefore maximize the L1 penalized version of this expected complete log likelihood by adding an extra factor to the weights of the IRLS algorithm (Hastie et al., 2015). The modified weights are:

w_{i j} = ⟨ X_{i j} ⟩ h (Z_{i j}) (1 - h (Z_{i j}))

(A.4)

In each M-step in our implementation, we do five iterations of this IRLS (on standard LASSO regression) to obtain new parameter values. We used a fixed L1 penalty of 0.2 throughout, as we found this to provide the strong regularization needed for our small numbers of positive examples. To improve the numerical stability of IRLS (needed when we had small numbers of positive examples relative to the number of features), we also used a small L2 penalty (alpha = 0.99 in glmnet) and we monitored the weights (A.4) and set them to an arbitrarily small value if they reached 0. After five iterations of IRLS, we recalculate the expected values of the hidden variables $⟨ X_{i j} ⟩$ . We initialize $⟨ X_{i j} ⟩$ by setting them equal to 1/r_i, so that they are downweighted equally.

Obtaining evolutionary signatures for IDRs in the yeast proteome

In order to calculate evolutionary signatures from predicted IDRs (disordered regions predicted from the S. cerevisiae proteome using DISOPRED3 [Jones and Cozzetto, 2015] that are 30 amino acids or longer), we used a similar method to previously published work (Zarin et al., 2019) with some modifications.

Briefly, for each set of orthologous IDRs (obtained from the Yeast Gene Order Browser [Byrne and Wolfe, 2005] and filtered to include those proteins for which there are 10 or more orthologs), we simulated 1000 sets of orthologous IDRs using previously described methods (Nguyen Ba et al., 2014; Nguyen Ba et al., 2012) and quantified the difference in the evolution of molecular features in the real set of orthologous IDRs to that of the simulated orthologous IDRs using a standard Z-score (as described in Zarin et al., 2019). The evolutionary signature of each IDR thus comprised a vector of 164 Z-scores mean and log variance over evolution for 82 molecular features (listed in Zarin et al., 2019). A negative Z-score implies that the evolution (in mean or variance) of a molecular feature is lower than our null expectation, while a positive Z-score implies that the evolution (in mean or variance) of a molecular feature is higher than our null expectation for the given IDR. A Z-score around zero implies that the molecular feature in question is evolving according to our null expectation of IDR evolution. The null expectation includes selection to preserve amino acid residue frequency according to disordered region-specific substitution matrices, but no selection to preserve the sequence-distributed molecular features.

In previous work (Zarin et al., 2019), we calculated these evolutionary signatures for highly diverged segments of IDRs by constraining the evolution of short conserved segments in our simulations of IDR evolution. For this study, we wanted to include all of the potentially functional sequence information in our evolutionary signatures (i.e., highly diverged segments as well as short conserved segments) and therefore did not constrain the evolution of short conserved segments in the evolutionary simulations. The evolutionary simulation software (Nguyen Ba et al., 2014; Nguyen Ba et al., 2012) takes as input maximum-likelihood estimates for the column (site-specific) rate of evolution, the local rate of evolution, and whether or not a site is predicted to be part of a conserved short linear motif (Nguyen Ba et al., 2012). In order to ignore the constraint on these short conserved segments, we randomly scrambled the column (site-specific) rate of evolution for each IDR and set each found motif to 0. This results in evolutionary simulations of IDRs in which positionally conserved sites evolve with the same average constraint as other sites in the IDR.

Evolutionary signatures are available for download and visualization at http://www.moseslab.csb.utoronto.ca/idr/.

Compiling functions and phenotypes for prediction

We compiled a series of functional annotations, phenotypes, and datasets to predict with our model. These fell into the following categories:

Gene Ontology annotations that we previously found to be strongly enriched in our unsupervised clustering analysis (Zarin et al., 2019), acquired from SGD (Cherry et al., 2012).
Datasets that screened the S. cerevisiae proteome for membraneless organelle/reversible protein assembly formation under stress (Mitchell et al., 2013; Narayanaswamy et al., 2009; Wallace et al., 2015).
A dataset of gold-standard Cdc28 substrate predictions (Sharifpoor et al., 2011).
A set of publicly available S. cerevisiae phenotype annotations from SGD (Cherry et al., 2012).

In total, we found 23 functions and phenotypes for which we could get accurate predictions (Figure 2—figure supplement 1). Overall, we tested the model on 120 functions and phenotypes.

Evaluating FAIDR at the IDR level

We obtained two datasets for which we could define ‘positive’ and ‘negative’ IDRs from proteomics data.

First, we obtained a dataset of mitochondrial targeting signals from a global analysis that identified and validated mitochondrial targeting signals in highly purified mitochondria using combined fractional diagonal chromatography (Vögtle et al., 2009).

Second, we obtained Cdc28 phosphorylation sites discovered in high-throughput kinase-specific mass spectrometry experiments (Holt et al., 2009; Lai et al., 2012) and mapped them to IDRs. If at least one verified phosphorylation site was identified in a region, we considered this IDR to be a target of Cdc28.

To make predictions of functions for unseen IDRs, we used the parameters (b) estimated for each function and computed Equation (1) above. We compared these probability values to the known IDR function as the probability cutoff is varied. We used ROCR (Sing et al., 2009) to compute ROC curves and AUCs.

Even though these datasets were not used directly for the protein-level annotations, they may have been used as part of the evidence that led to the annotation of the protein. Therefore, we sought to further remove the potential for circularity by removing 20% of proteins from the training dataset and only evaluating the predictive power on the IDRs from those 20%. We report the number of positive IDRs in this held-out data set in Figure 2.

In order to confirm that the results of our IDR-level predictions are not based on sequence similarity between the training and test set, we performed an all-by-all protein BLAST (Altschul et al., 1990) with default parameters, using the training set as the ‘query’ and the test set as the ‘subject’. We found that out of 1016 IDRs in the test set, 49 (5%) had sequence similarity to IDRs in the training set (i.e., had a BLAST e-value that was less than 10⁻⁵). For mitochondrial targeting signals, this sequence similarity explains at most 3% of our predictive power (1/35 IDRs in the positive test set have sequence similarity to IDRs in the training set according to the above e-value threshold). For CDK targets, this sequence similarity explains at most 10% of our predictive power (6/59 IDRs in the positive test set have sequence similarity to IDRs in the training set according to the above e-value threshold). IDRs with sequence similarity tend to be in paralogous proteins.

Evaluating FAIDR at the protein level

Although predicting protein function was not a primary goal of FAIDR, we evaluated the AUCs on held-out protein annotations to confirm that FAIDR could learn features that are predictive of a diverse set of IDR functions. To make predictions of functions for held-out proteins (Y), we sum the weighted contributions of all the IDRs:

P (Y_{i} = 1 {| Z}_{i}, b) = \sum_{j = 1}^{r} P (X_{i j} = 1 | Z_{i j}, b) P (Y_{i} = 1 | Z_{i j}, b)

(A.5)

where $P ({Y_{i} = 1 | Z}_{i j}, b)$ is given in Equation (1) above. To compute $P (X_{i j} = 1 | Z_{i j}, b)$ for an unseen sequence, we cannot use (Equation A.3) because it treats Y as an observed variable. Instead we used $P (X_{i j} = 1 | Z_{i j}, b) = \frac{1}{r_{i}}$ , which amounts to a simple average over the IDRs in each protein.

For each functional category/annotation, we used fivefold cross-validation, randomly dividing the data into 80% of proteins for training and 20% of proteins for testing. For each fold of cross-validation, we compute the AUC using ROCR (Sing et al., 2009) on the random 20% held-out data. We report the average AUC over the five held-out datasets.

Clustering of t-statistics

We fit a standard logistic regression to molecular features that obtained non-zero coefficient values from FAIDR. We extracted the corresponding t-statistic for each molecular feature and clustered these t-statistics and the 23 functions that we predicted using hierarchical clustering for Figure 3. Prior to clustering, we assigned those molecular features with a coefficient of 0 a corresponding t-statistic of 0 and filtered the 164 molecular features to only include those molecular feature vectors that had at least one t-statistic value of 3 or greater for a predicted function. This filtering resulted in a total of 72 molecular features. We used the Cluster 3.0 program (de Hoon et al., 2004) to perform hierarchical clustering (clustering both ‘genes’ [molecular feature t-statistics] and ‘arrays’ [functions] using uncentered correlation distance, with calculated weights and using average linkage). We visualized the clustering using Java Treeview (Saldanha, 2004).

Construction of Cox15-GFP strains with variable molecular features and confocal microscopy

We obtained a BY4741-derived Cox15-GFP-tagged budding yeast strain from the Yeast GFP collection (Huh et al., 2003), which serves as our wild-type IDR strain. We mutated the N-terminal IDR in this strain as previously described (Zarin et al., 2019) using the Delitto Perfetto method (Storici et al., 2001), which performs markerless site-directed mutagenesis in the genome. In order to construct the IDR mutant strains (‘Cox15 low pI’, ‘Cox15 low hydrophobicity’, ‘Sim IDR’, and ‘Sim IDR high pI’), we used gene synthesis (via Integrated DNA Technologies [IDT]) whose sequences we confirmed using PCR and Sanger sequencing. The genotypes for the IDR strains are as follows:

WT:
ATGCTTTTCAGAAACATAGAAGTGGGCAGGCAGGCAGCTAAGCTATTAACGAGAACCTCGAGTCGTTTGGCCTGGCAAAGTATTGGGGCCTCAAGGAATATTTCTACCATCAGACAACAAATCAGAAAGACTCAA
Cox15 low pI:
ATGCTTTTCGAAAACATAGAAGTGGGCGAACAGGCAGCTAAGCTATTAACGGAAACCTCGAGTGAATTGGCCTGGCAAAGTATTGGGGCCTCAGAAAATATTTCTACCATCGAACAACAAATCGAAAAGACTCAA
Cox15 low hydrophobicity:
ATGaatTTCAGAAACagaGAAaagGGCAGGCAGGCAGCTAAGaataatACGAGAACCTCGAGTCGTaatGCCTGGCAAAGTagaGGGGCCTCAAGGAATagaTCTACCagaAGACAACAAagaAGAAAGACTCAA
Sim IDR:
ATGCTGCTGAGAAACGTTGAATCCTCCAAACCCGAAGCAAAACTAATTACCAGAGCTTCTTACGCCGTGCCCAGGAAAATGAACAATTCATACTTGGGCGATAATACATTGAATAACCTGGTCTTAAAGAAGAGCTATCTTTTAGCTGTTCCCAGAAAGATTCCCACGATTCCAGCCAGTCTGCCGCAAATTCGTGACAAGGAT
Sim IDR high pI:
ATGCTGCTGAGAAACGTTAGATCCTCCAAACCCAGAGCAAAACTAATTACCAGAGCTTCTTACGCCGTGCCCAGGAAAATGAACAATTCATACTTGGGCAGAAATACATTGAATAACCTGGTCTTAAAGAAGAGCTATCTTTTAGCTGTTCCCAGAAAGATTCCCACGATTCCAGCCAGTCTGCCGCAAATTCGTAGAAAGAGA

Confocal microscopy (Leica TCS-SP8, 63× oil immersion objective, Leica HyD hybrid detectors) was used to image cells. Overnight cultures were diluted 1/10 in fresh SD-His media and grown to log phase for 3 hr at 30°C. Where applicable (Figure 4—figure supplement 1), cells were treated with 0.5 µg/ml DAPI in SD-complete growth media for 15 min. Cultures were concentrated 1/10 by centrifugation at 3000 rpm for 3 min and imaged on standard, uncoated glass slides. Nuclear and mitochondrial localization was imaged sequentially to eliminate potential crosstalk in the following series: DAPI (excitation/emission 358 nm/461 nm), GFP (excitation/emission 488 nm/507 nm), and a bright-field image. Each strain was imaged on at least three separate days. Images in Figure 4C, Figure 4—figure supplement 1 were taken on the same day with the same microscope settings across all images.

Acknowledgements

We thank Alex X Lu for comments on the manuscript and Dr Alan R Davidson for discussions about IDR functional predictions. We thank Dr Helena Friesen and Dr Brenda Andrews for providing strains from the yeast GFP collection. We thank Canadian Institutes for Health Research (CIHR) for funding to AMM and JDF-K (grant no. PJT-148532), the Canada Research Chairs program and a CIHR Foundation grant (grant no. FDN-148375) to JDF-K, Canada Foundation for Innovation (CFI) for funding to AMM, and the Natural Sciences and Engineering Research Council of Canada (NSERC) for funding to TZ.

Appendix 1

We first review logistic regression and IRLS and then explain how our approach for associating protein functions with sequence features in disordered regions is a simple extension of logistic regression.

Multiple LASSO logistic regression

For classification, the target Y (or dependent variable in regression) is one if true and 0 if false for each of n observations. More formally,

Y = Y_{1}, Y_{2}, \dots, Y_{n}

Y_{i} \in \{0,1\}

We assume that for each observation $Y_{i}$ , we have a vector of features that we would like to use for prediction, say, $X_{i}$ .

X_{i} = X_{i 1}, X_{i 2}, \dots, X_{i m}

Since we are doing linear regression, we define a linear combination of weights, b, and features, X:

b_{0} + {b_{1} X}_{i 1} + {b_{2} X}_{i 2} + . . . + {b_{m} X}_{i m} = b_{0} + \sum_{j = 1}^{m} X_{i j} b_{j} \equiv X_{i} b

For (binary) classification, we use the logistic function to turn the linear combination into a probability

P (Y_{i} = 1| X_{i}, b) = \frac{1}{1 + e^{- X_{i} b}}

and we know that $P (Y_{i} = 0| X_{i}, b) = 1 - P (Y_{i} = 1| X_{i}, b)$ because there are only two possibilities for $Y_{i}$ , and the probability must sum to one.

The likelihood (the probability of Y given the model) is the product over the (by assumption) independent and identically distributed (i.i.d) observations, because the probability of a series of observations is just the probabilities of independent events multiplied together:

P (Y | X, b) = \prod_{i = 1}^{n} {[\frac{1}{1 + e^{- X_{i} b}}]}^{Y_{i}} {[1 - \frac{1}{1 + e^{- X_{i} b}}]}^{1 - Y_{i}}

In the expression above, we have used the trick that $x^{0} = 1$ and $x^{1} = x$ , so when $Y_{i} = 1$ , the first term, which is $P (Y_{i} = 1| X_{i}, b)$ , is included in the product and the second term is just 1 (because $1 - Y_{i} = 0$ ), so it does not change the product. Similar logic means that the second term, which corresponds to $P (Y_{i} = 0| X_{i}, b)$ , is selected when $Y_{i} = 0$ . Maximum-likelihood estimation means choosing the values (or estimates) of b that maximize the likelihood function.

In practice, we work with the log likelihood (whose maximum is also the maximum of the likelihood):

l o g L = \sum_{i = 1}^{n} Y_{i} \log \frac{1}{1 + e^{- X_{i} b}} + (1 - Y_{i}) \log (1 - \frac{1}{1 + e^{- X_{i} b}})

LASSO logistic regression adds a penalty to the log likelihood, so that

P l o g L = \sum_{i = 1}^{n} Y_{i} \log \frac{1}{1 + e^{- X_{i} b}} + (1 - Y_{i}) \log (1 - \frac{1}{1 + e^{- X_{i} b}}) - λ \sum_{j = 1}^{m} | b_{j} |

This penalty ensures that in order to have a non-zero value, a parameter b contributes at least $λ$ to the model fit (in units of log likelihood). $λ$ is a hyperparameter and cannot be estimated on the training data. In practice it can be chosen through cross-validation or based on prior knowledge. Although there is no closed form for the values of b that maximize the penalized loglikelihood (or standard loglikelihood), the problem can be solved by a so-called ‘iteratively reweighted least squares’ (IRLS) algorithm, which converts the logistic regression problem to a ‘weighted’ ordinary least squares (OLS) regression. In the LASSO or penalized likelihood framework, instead of OLS regression, L1 penalized regression is done in each iteration (Tibshirani et al., 2015).

IRLS: weighted OLS regression for logistic regression

In general, if the weights are defined to be $w = w_{1}, w_{2}, \dots, w_{n}$ , then the ‘weighted’ OLS regression problem is to choose b so that it minimizes

\sum_{i = 1}^{n} {w_{i} (Y_{i} - X_{i} b)}^{2}

In the case of IRLS for logistic regression, it is known that for a certain choice of weights and transformation of Y, the OLS solution will be exactly the maximum-likelihood solution for logistic regression. The weights are given by

w_{i} = \frac{1}{1 + e^{- X_{i} b}} (1 - \frac{1}{1 + e^{- X_{i} b}})

and the transformed variable, say, Y*, is defined as:

{Y^{*}}_{i} = X_{i} b + \frac{Y_{i} - \frac{1}{1 + e^{- X_{i} b}}}{w_{i}}

In these equations, b is our current estimate of the parameters. Since the weights and transformed target variable depend on the current estimate of the parameters, the algorithm proceeds by alternately solving the weighted OLS to estimate the parameters, and then updating the weights and transformed target variable. This proceeds iteratively from an initial guess, but in practice is known to converge rapidly.

A model to account for multiple disordered regions in one protein

If we let $Y_{i}$ represent the function of the proteins that we would like to predict, we would like to combine the features for all IDRs, $Z_{i} = (Z_{i 1,}, Z_{i 2}, . . . Z_{i r})$ , for that protein. A simple assumption we can make is that one of the IDRs is responsible for the function, and the others are irrelevant. We use a hidden variable, $X_{i j}$ , to represent this, so that if $X_{i j} = 1$ , the jth IDR in the ith protein is the one responsible for the function. In a linear regression framework, the likelihood of this model is

P (Y | Z, b) = \prod_{i = 1}^{n} \sum_{j = 1}^{r} P (X_{i j}) {[\frac{1}{1 + e^{- Z_{i j} b}}]}^{Y_{i}} {[1 - \frac{1}{1 + e^{- Z_{i j} b}}]}^{1 - Y_{i}}

In practice, to maximize likelihoods involving hidden variables, we use the E–M algorithm. This means iteratively maximizing the ‘expected complete log likelihood’, where we assume the hidden variables are observed, and equal to their expectations given our current parameter estimates. To derive this, we first write the complete likelihood, the log likelihood assuming the hidden variables are observed:

C L = \prod_{i = 1}^{n} \prod_{j = 1}^{r} {({[\frac{1}{1 + e^{- Z_{i j} b}}]}^{Y_{i}} {[1 - \frac{1}{1 + e^{- Z_{i j} b}}]}^{1 - Y_{i}})}^{X_{i j}}

Once again, the exponent $X_{i j}$ ensures that terms corresponding to the IDRs that are not responsible for the function do not contribute to the likelihood. Taking logs and expectations yields

⟨ l o g C L ⟩ = \sum_{i = 1}^{n} \sum_{j = 1}^{r} ⟨ X_{i j} ⟩ (Y_{i} \log \frac{1}{1 + e^{- Z_{i j} b}} + (1 - Y_{i}) \log (1 - \frac{1}{1 + e^{- Z_{i j} b}}))

where the angle brackets denote expectations. In the M-step of the E–M algorithm, we maximize the expected complete log likelihood. Comparing the standard loglikelihood for logistic regression (where we replace X with Z for consistency):

l o g L = \sum_{i = 1}^{n} Y_{i} \log \frac{1}{1 + e^{- Z_{i} b}} + (1 - Y_{i}) \log (1 - \frac{1}{1 + e^{- Z_{i} b}})

Inspection of the expected complete log likelihood reveals exactly one term for each IDR (the double sum $\sum_{i = 1}^{n} \sum_{j = 1}^{r}$ is over all n proteins over all r IDRs in each protein) weighted by $⟨X_{i j}⟩$ , the expected value of the hidden variable. Thus, our model corresponds to an iteratively reweighted logistic regression problem at the IDR level, where the weights are the current guess of the hidden variable $⟨X_{i j}⟩$ , that represents which IDR is responsible for the function. We can therefore maximize the LASSO penalized version of this expected complete log likelihood, by adding an extra set of weights to the IRLS algorithm described above, so that the weights are:

w_{i j} = ⟨X_{i j}⟩ \frac{1}{1 + e^{- Z_{i j} b}} (1 - \frac{1}{1 + e^{- Z_{i j} b}})

In each M-step, we do five iterations of this IRLS (on standard LASSO regression) to obtain new parameter values. We initialize the weights by setting them equal to one divided by the number of IDRs in the protein, so that they are downweighted equally.

In the E-step of the E–M algorithm, the expectations of the hidden indicator variables $⟨X_{i j}⟩$ are calculated using Bayes’ theorem as follows:

\begin{array}{ll} ⟨ X_{i j} ⟩ & = P (X_{i j} = 1 | Y_{i}, Z_{i j}, b) = \frac{P (X_{i j} = 1 | Z_{i j}, b) P (Y_{i} | X_{i j} = 1, Z_{i j}, b)}{P (Y_{i} | Z_{i j}, b)} \\ = \frac{Y_{i} P (Y_{i} = 1 | X_{i j} = 1, Z_{i j}, b) + (1 - Y_{i}) (1 - P (Y_{i} = 1 | X_{i j} = 1, Z_{i j}, b))}{\sum_{k} Y_{i} P (Y_{i} = 1 | X_{i k} = 1, Z_{i j}, b) + (1 - Y_{i}) (1 - P (Y_{i} = 1 | X_{i k} = 1, Z_{i k}, b))} \end{array}

Note that we have assumed no prior knowledge about which IDR in a protein is most likely to be responsible for the function (uninformative priors), that is $P (X_{i j} = 1 | Z_{i j}, b)$ is constant over j.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Taraneh Zarin, Email: taraneh.zarin@mail.utoronto.ca.

Alan M Moses, Email: alan.moses@utoronto.ca.

Michael B Eisen, University of California, Berkeley, United States.

Funding Information

This paper was supported by the following grants:

Canadian Institutes of Health Research PJT-148532 to Julie D Forman-Kay, Alan M Moses.
Canadian Institutes of Health Research FDN-148375 to Julie D Forman-Kay.
NSERC to Alan M Moses.
Fondation canadienne pour l’innovation to Alan M Moses.
NSERC Canada Graduate Scholarship to Taraneh Zarin.
Hospital for Sick Children RESTRACOMP Research Fellowship to Iva Pritišanac.

Additional information

Competing interests

Reviewing editor, eLife.

No competing interests declared.

Author contributions

Conceptualization, Software, Investigation, Visualization, Methodology, Validation, Writing - original draft, Writing - review and editing.

Investigation, Methodology.

Software, Visualization.

Investigation, Writing - review and editing.

Conceptualization, Funding acquisition, Writing - review and editing.

Conceptualization, Resources, Software, Supervision, Funding acquisition, Validation, Methodology, Writing - original draft, Project administration, Writing - review and editing.

Additional files

Supplementary file 1. Table of function and phenotype abbreviations used in this article, corresponding function and phenotype full names, references, and function and phenotype name used in data frames/source data (i.e., name with no spaces).

elife-60220-supp1.zip^{(1.1KB, zip)}

Transparent reporting form

elife-60220-transrepform.docx^{(65.5KB, docx)}

Data availability

All data generated or analysed during this study are included in the manuscript, supporting files and the accompanying website.

References

Abe Y, Shodai T, Muto T, Mihara K, Torii H, Nishikawa S, Endo T, Kohda D. Structural basis of presequence recognition by the mitochondrial protein import receptor Tom20. Cell. 2000;100:551–560. doi: 10.1016/S0092-8674(00)80691-1. [DOI] [PubMed] [Google Scholar]
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
Breker M, Gymrek M, Schuldiner M. A novel single-cell screening platform reveals proteome plasticity during yeast stress responses. Journal of Cell Biology. 2013;200:839–850. doi: 10.1083/jcb.201301120. [DOI] [PMC free article] [PubMed] [Google Scholar]
Breker M, Gymrek M, Moldavski O, Schuldiner M. LoQAtE--localization and quantitation ATlas of the yeast proteomE. A new tool for multiparametric dissection of single-protein behavior in response to biological perturbations in yeast. Nucleic Acids Research. 2014;42:D726–D730. doi: 10.1093/nar/gkt933. [DOI] [PMC free article] [PubMed] [Google Scholar]
Byrne KP, Wolfe KH. The Yeast Gene Order Browser: Combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Research. 2005;15:1456–1461. doi: 10.1101/gr.3672305. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen JR, Chang BH, Allen JE, Stiffler MA, MacBeath G. Predicting PDZ domain-peptide interactions from primary sequences. Nature Biotechnology. 2008;26:1041–1045. doi: 10.1038/nbt.1489. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, Hitz BC, Karra K, Krieger CJ, Miyasato SR, Nash RS, Park J, Skrzypek MS, Simison M, Weng S, Wong ED. Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Research. 2012;40:D700–D705. doi: 10.1093/nar/gkr1029. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chong PA, Vernon RM, Forman-Kay JD. RGG/RG motif regions in RNA binding and phase separation. Journal of Molecular Biology. 2018;430:4650–4665. doi: 10.1016/j.jmb.2018.06.014. [DOI] [PubMed] [Google Scholar]
Chuang GY, Boyington JC, Joyce MG, Zhu J, Nabel GJ, Kwong PD, Georgiev I. Computational prediction of N-linked glycosylation incorporating structural properties and patterns. Bioinformatics. 2012;28:2249–2255. doi: 10.1093/bioinformatics/bts426. [DOI] [PMC free article] [PubMed] [Google Scholar]
Darling AL, Liu Y, Oldfield CJ, Uversky VN. Intrinsically disordered proteome of human Membrane-Less organelles. Proteomics. 2018;18:1700193. doi: 10.1002/pmic.201700193. [DOI] [PubMed] [Google Scholar]
Davey NE, Van Roey K, Weatheritt RJ, Toedt G, Uyar B, Altenberg B, Budd A, Diella F, Dinkel H, Gibson TJ. Attributes of short linear motifs. Mol. BioSyst. 2012;8:268–281. doi: 10.1039/C1MB05231D. [DOI] [PubMed] [Google Scholar]
de Hoon MJ, Imoto S, Nolan J, Miyano S. Open source clustering software. Bioinformatics. 2004;20:1453–1454. doi: 10.1093/bioinformatics/bth078. [DOI] [PubMed] [Google Scholar]
Dosztányi Z, Csizmók V, Tompa P, Simon I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. Journal of Molecular Biology. 2005;347:827–839. doi: 10.1016/j.jmb.2005.01.071. [DOI] [PubMed] [Google Scholar]
Dubreuil B, Sass E, Nadav Y, Heidenreich M, Georgeson JM, Weill U, Duan Y, Meurer M, Schuldiner M, Knop M, Levy ED. YeastRGB: comparing the abundance and localization of yeast proteins across cells and libraries. Nucleic Acids Research. 2019;47:D1245–D1249. doi: 10.1093/nar/gky941. [DOI] [PMC free article] [PubMed] [Google Scholar]
El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, Sonnhammer ELL, Hirsh L, Paladin L, Piovesan D, Tosatto SCE, Finn RD. The pfam protein families database in 2019. Nucleic Acids Research. 2019;47:D427–D432. doi: 10.1093/nar/gky995. [DOI] [PMC free article] [PubMed] [Google Scholar]
Forman-Kay JD, Mittag T. From sequence and forces to structure, function, and evolution of intrinsically disordered proteins. Structure. 2013;21:1492–1499. doi: 10.1016/j.str.2013.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
Freiman RN, Tjian R. A glutamine-rich trail leads to transcription factors. Science. 2002;296:2149–2150. doi: 10.1126/science.1073845. [DOI] [PubMed] [Google Scholar]
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33:1–22. doi: 10.18637/jss.v033.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fukasawa Y, Tsuji J, Fu SC, Tomii K, Horton P, Imai K. MitoFates: improved prediction of mitochondrial targeting sequences and their cleavage sites. Molecular & Cellular Proteomics. 2015;14:1113–1126. doi: 10.1074/mcp.M114.043083. [DOI] [PMC free article] [PubMed] [Google Scholar]
Garg SG, Gould SB. The role of charge in protein targeting evolution. Trends in Cell Biology. 2016;26:894–905. doi: 10.1016/j.tcb.2016.07.001. [DOI] [PubMed] [Google Scholar]
Gemayel R, Chavali S, Pougach K, Legendre M, Zhu B, Boeynaems S, van der Zande E, Gevaert K, Rousseau F, Schymkowitz J, Babu MM, Verstrepen KJ. Variable Glutamine-Rich repeats modulate transcription factor activity. Molecular Cell. 2015;59:615–627. doi: 10.1016/j.molcel.2015.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hastie T, Tibshirani R, Wainwright M. Statistical Learning with Sparsity. Chapman and Hall/CRC; 2015. [Google Scholar]
Higuchi-Sanabria R, Garcia EJ, Tomoiaga D, Munteanu EL, Feinstein P, Pon LA. Characterization of fluorescent proteins for three- and Four-Color Live-Cell imaging in S. cerevisiae. PLOS ONE. 2016;11:e0146120. doi: 10.1371/journal.pone.0146120. [DOI] [PMC free article] [PubMed] [Google Scholar]
Holt LJ, Tuch BB, Villén J, Johnson AD, Gygi SP, Morgan DO. Global analysis of Cdk1 substrate phosphorylation sites provides insights into evolution. Science. 2009;325:1682–1686. doi: 10.1126/science.1172867. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O'Shea EK. Global analysis of protein localization in budding yeast. Nature. 2003;425:686–691. doi: 10.1038/nature02026. [DOI] [PubMed] [Google Scholar]
Ionescu CN, Origanti S, McAlear MA. The yeast rRNA biosynthesis factor Ebp2p is also required for efficient nuclear division. Yeast. 2004;21:1219–1232. doi: 10.1002/yea.1177. [DOI] [PubMed] [Google Scholar]
Iserman C, Desroches Altamirano C, Jegers C, Friedrich U, Zarin T, Fritsch AW, Mittasch M, Domingues A, Hersemann L, Jahnel M, Richter D, Guenther UP, Hentze MW, Moses AM, Hyman AA, Kramer G, Kreysing M, Franzmann TM, Alberti S. Condensation of Ded1p promotes a translational switch from housekeeping to stress protein production. Cell. 2020;181:818–831. doi: 10.1016/j.cell.2020.04.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jaussi R. Homologous nuclear-encoded mitochondrial and cytosolic isoproteins. A review of structure, biosynthesis and genes. European Journal of Biochemistry. 1995;228:551–561. doi: 10.1111/j.1432-1033.1995.tb20294.x. [DOI] [PubMed] [Google Scholar]
Jones DT, Cozzetto D. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics. 2015;31:857–863. doi: 10.1093/bioinformatics/btu744. [DOI] [PMC free article] [PubMed] [Google Scholar]
Katuwawala A, Peng Z, Yang J, Kurgan L. Computational prediction of MoRFs, short Disorder-to-order transitioning protein binding regions. Computational and Structural Biotechnology Journal. 2019;17:454–462. doi: 10.1016/j.csbj.2019.03.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Klemsz MJ, Maki RA. Activation of transcription by PU.1 requires both acidic and glutamine domains. Molecular and Cellular Biology. 1996;16:390–397. doi: 10.1128/MCB.16.1.390. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kohda D. "Multiple partial recognitions in dynamic equilibrium" in the binding sites of proteins form the molecular basis of promiscuous recognition of structurally diverse ligands. Biophysical Reviews. 2018;10:421–433. doi: 10.1007/s12551-017-0365-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Krois AS, Dyson HJ, Wright PE. Long-range regulation of p53 DNA binding by its intrinsically disordered N-terminal transactivation domain. PNAS. 2018;115:E11302–E11310. doi: 10.1073/pnas.1814051115. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology. 1982;157:105–132. doi: 10.1016/0022-2836(82)90515-0. [DOI] [PubMed] [Google Scholar]
Lai AC, Nguyen Ba AN, Moses AM. Predicting kinase substrates using conservation of local motif density. Bioinformatics. 2012;28:962–969. doi: 10.1093/bioinformatics/bts060. [DOI] [PubMed] [Google Scholar]
Laptenko O, Tong DR, Manfredi J, Prives C. The tail that wags the dog: how the disordered C-Terminal domain controls the transcriptional activities of the p53 Tumor-Suppressor protein. Trends in Biochemical Sciences. 2016;41:1022–1034. doi: 10.1016/j.tibs.2016.08.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lobley A, Swindells MB, Orengo CA, Jones DT. Inferring function using patterns of native disorder in proteins. PLOS Computational Biology. 2007;3:e162. doi: 10.1371/journal.pcbi.0030162. [DOI] [PMC free article] [PubMed] [Google Scholar]
Martin EW, Holehouse AS, Peran I, Farag M, Incicco JJ, Bremer A, Grace CR, Soranno A, Pappu RV, Mittag T. Valence and patterning of aromatic residues determine the phase behavior of prion-like domains. Science. 2020;367:694–699. doi: 10.1126/science.aaw8653. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mitchell SF, Jain S, She M, Parker R. Global analysis of yeast mRNPs. Nature Structural & Molecular Biology. 2013;20:127–133. doi: 10.1038/nsmb.2468. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moses A. Statistical Modeling and Machine Learning for Molecular Biology. Boca Raton: Chapman and Hall/CRC Press; 2017. [DOI] [Google Scholar]
Narayanaswamy R, Levy M, Tsechansky M, Stovall GM, O'Connell JD, Mirrielees J, Ellington AD, Marcotte EM. Widespread reorganization of metabolic enzymes into reversible assemblies upon nutrient starvation. PNAS. 2009;106:10147–10152. doi: 10.1073/pnas.0812771106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nguyen Ba AN, Yeh BJ, van Dyk D, Davidson AR, Andrews BJ, Weiss EL, Moses AM. Proteome-wide discovery of evolutionary conserved sequences in disordered regions. Science Signaling. 2012;5:rs1. doi: 10.1126/scisignal.2002515. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nguyen Ba AN, Strome B, Hua JJ, Desmond J, Gagnon-Arsenault I, Weiss EL, Landry CR, Moses AM. Detecting functional divergence after gene duplication through evolutionary changes in posttranslational regulatory sequences. PLOS Computational Biology. 2014;10:e1003977. doi: 10.1371/journal.pcbi.1003977. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nielsen JT, Mulder FAA. Quality and Bias of protein disorder predictors. Scientific Reports. 2019;9:5137. doi: 10.1038/s41598-019-41644-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ondrechen MJ, Clifton JG, Ringe D. THEMATICS: a simple computational predictor of enzyme function from structure. PNAS. 2001;98:12473–12478. doi: 10.1073/pnas.211436698. [DOI] [PMC free article] [PubMed] [Google Scholar]
Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Törönen P, Nokso-Koivisto J, Holm L, Cozzetto D, Buchan DW, Bryson K, Jones DT, Limaye B, Inamdar H, Datta A, Manjari SK, Joshi R, Chitale M, Kihara D, Lisewski AM, Erdin S, Venner E, Lichtarge O, Rentzsch R, Yang H, Romero AE, Bhat P, Paccanaro A, Hamp T, Kaßner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Björne J, Salakoski T, Wong A, Shatkay H, Gatzmann F, Sommer I, Wass MN, Sternberg MJ, Škunca N, Supek F, Bošnjak M, Panov P, Džeroski S, Šmuc T, Kourmpetis YA, van Dijk AD, ter Braak CJ, Zhou Y, Gong Q, Dong X, Tian W, Falda M, Fontana P, Lavezzo E, Di Camillo B, Toppo S, Lan L, Djuric N, Guo Y, Vucetic S, Bairoch A, Linial M, Babbitt PC, Brenner SE, Orengo C, Rost B, Mooney SD, Friedberg I. A large-scale evaluation of computational protein function prediction. Nature Methods. 2013;10:221–227. doi: 10.1038/nmeth.2340. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roise D, Schatz G. Mitochondrial presequences. Journal of Biological Chemistry. 1988;263:4509–4511. doi: 10.1016/S0021-9258(18)68809-X. [DOI] [PubMed] [Google Scholar]
Ryan O, Shapiro RS, Kurat CF, Mayhew D, Baryshnikova A, Chin B, Lin ZY, Cox MJ, Vizeacoumar F, Cheung D, Bahr S, Tsui K, Tebbji F, Sellam A, Istel F, Schwarzmüller T, Reynolds TB, Kuchler K, Gifford DK, Whiteway M, Giaever G, Nislow C, Costanzo M, Gingras AC, Mitra RD, Andrews B, Fink GR, Cowen LE, Boone C. Global gene deletion analysis exploring yeast filamentous growth. Science. 2012;337:1353–1356. doi: 10.1126/science.1224339. [DOI] [PubMed] [Google Scholar]
Saitoh T, Igura M, Obita T, Ose T, Kojima R, Maenaka K, Endo T, Kohda D. Tom20 recognizes mitochondrial presequences through dynamic equilibrium among multiple bound states. The EMBO Journal. 2007;26:4777–4787. doi: 10.1038/sj.emboj.7601888. [DOI] [PMC free article] [PubMed] [Google Scholar]
Saldanha AJ. Java treeview--extensible visualization of microarray data. Bioinformatics. 2004;20:3246–3248. doi: 10.1093/bioinformatics/bth349. [DOI] [PubMed] [Google Scholar]
Sekiya-Kawasaki M, Groen AC, Cope MJTV, Kaksonen M, Watson HA, Zhang C, Shokat KM, Wendland B, McDonald KL, McCaffery JM, Drubin DG. Dynamic phosphoregulation of the cortical actin cytoskeleton and endocytic machinery revealed by real-time chemical genetic analysis. Journal of Cell Biology. 2003;162:765–772. doi: 10.1083/jcb.200305077. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sharifpoor S, Nguyen Ba AN, Youn JY, Young JY, van Dyk D, Friesen H, Douglas AC, Kurat CF, Chong YT, Founk K, Moses AM, Andrews BJ. A quantitative literature-curated gold standard for kinase-substrate pairs. Genome Biology. 2011;12:R39. doi: 10.1186/gb-2011-12-4-r39. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: Visualizing the performance of scoring classifiers. 1:4R Packag. 2009 https://cran.r-project.org/web/packages/ROCR/ROCR.pdf
Storici F, Lewis LK, Resnick MA. In vivo site-directed mutagenesis using oligonucleotides. Nature Biotechnology. 2001;19:773–776. doi: 10.1038/90837. [DOI] [PubMed] [Google Scholar]
The Gene Ontology Consortium The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Research. 2019;47:D330–D338. doi: 10.1093/nar/gky1055. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R, Wainwright M, Hastie T. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC; 2015. [DOI] [Google Scholar]
Uversky VN. Natively unfolded proteins: a point where biology waits for physics. Protein Science. 2002;11:739–756. doi: 10.1110/ps.4210102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Uversky VN. New technologies to analyse protein function: an intrinsic disorder perspective. F1000Research. 2020;9:101–114. doi: 10.12688/f1000research.20867.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vacic V, Iakoucheva LM. Disease mutations in disordered regions--exception to the rule? Mol. BioSyst. 2012;8:27–32. doi: 10.1039/C1MB05251A. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Lee R, Buljan M, Lang B, Weatheritt RJ, Daughdrill GW, Dunker AK, Fuxreiter M, Gough J, Gsponer J, Jones DT, Kim PM, Kriwacki RW, Oldfield CJ, Pappu RV, Tompa P, Uversky VN, Wright PE, Babu MM. Classification of intrinsically disordered regions and proteins. Chemical Reviews. 2014;114:6589–6631. doi: 10.1021/cr400525m. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vernon RM, Chong PA, Tsang B, Kim TH, Bah A, Farber P, Lin H, Forman-Kay JD. Pi-Pi contacts are an overlooked protein feature relevant to phase separation. eLife. 2018;7:e31486. doi: 10.7554/eLife.31486. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vernon RM, Forman-Kay JD. First-generation predictors of biological protein phase separation. Current Opinion in Structural Biology. 2019;58:88–96. doi: 10.1016/j.sbi.2019.05.016. [DOI] [PubMed] [Google Scholar]
Vögtle FN, Wortelkamp S, Zahedi RP, Becker D, Leidhold C, Gevaert K, Kellermann J, Voos W, Sickmann A, Pfanner N, Meisinger C. Global analysis of the mitochondrial N-proteome identifies a processing peptidase critical for protein stability. Cell. 2009;139:428–439. doi: 10.1016/j.cell.2009.07.045. [DOI] [PubMed] [Google Scholar]
Wallace EW, Kear-Scott JL, Pilipenko EV, Schwartz MH, Laskowski PR, Rojek AE, Katanski CD, Riback JA, Dion MF, Franks AM, Airoldi EM, Pan T, Budnik BA, Drummond DA. Reversible, specific, active aggregates of endogenous proteins assemble upon heat stress. Cell. 2015;162:1286–1298. doi: 10.1016/j.cell.2015.08.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ. Jalview version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009;25:1189–1191. doi: 10.1093/bioinformatics/btp033. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zarin T, Tsai CN, Nguyen Ba AN, Moses AM. Selection maintains signaling function of a highly diverged intrinsically disordered region. PNAS. 2017;114:E1450–E1459. doi: 10.1073/pnas.1614787114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zarin T, Strome B, Nguyen Ba AN, Alberti S, Forman-Kay JD, Moses AM. Proteome-wide signatures of function in highly diverged intrinsically disordered regions. eLife. 2019;8:e46883. doi: 10.7554/eLife.46883. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zarin T. FAIDR. swh:1:rev:b1ef30705f7133f21201166522077a062984566bSoftware Heritage. 2021 https://archive.softwareheritage.org/swh:1:dir:f7116bc669d3b5399e74f85f21a01dac652862ef;origin=https://github.com/taraneh-z/FAIDR;visit=swh:1:snp:4efca85c9ee543283e54280a32418b345359049a;anchor=swh:1:rev:b1ef30705f7133f21201166522077a062984566b/

eLife. doi: 10.7554/eLife.60220.sa1

Decision letter

Editor: Michael B Eisen¹

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This research advance builds on the author's previous work on predicting the molecular function of intrinsically disordered from amino acid sequence and evolutionary dynamics. This is an emerging and important field and the author's contribution helps lay key foundations for how this question can be addressed. In addition novel scientific insight, a new online tool, the authors provide a detailed walk-through of the underlying statistical analysis performed.

Decision letter after peer review:

Thank you for submitting your article "Identifying molecular features that are associated with biological function of intrinsically disordered protein regions" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by Michael Eisen as the Senior and Reviewing Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, we are asking editors to accept without delay manuscripts, like yours, that they judge can stand as eLife papers without additional data, even if they feel that they would make the manuscript stronger. Thus the revisions requested below only address clarity and presentation.

The work presented offers a promising advancement for understanding protein function in general and of IDRs in particular. The manuscript presents the theoretical model development, experimental validation, and a protein engineering effort to further understand the results. A refreshing, pertinent discussion of potential shortcomings of the model is included throughout the manuscript. The manuscript is well-written but given its interdisciplinary nature, it appears overwhelmingly dense in details at times.

In general, it would be nice to have a less mathematical overview of the model workflow as a figure. What are the used features and the resulting predictions? Give more specific examples from the 82 features in Zarin et al., 2019.

Some specific comments raised by the reviewers:

Results/equations

"An interpretable regularized probabilistic model to predict function from IDR features" – I believe the LHS of Equation 1, P(Y=1|Z,b), does not match the description in the preceding paragraph. I think the issue is due to the lack of an “I” subscript for the “Y” variable. Since “I” corresponds to the specific protein, excluding the subscript and writing "P(Y=1)" seems to represent the probability that all proteins in the dataset have a particular functional annotation, while "P(Yi=1)" represents the probability that a particular protein has a particular functional annotation. This discrepancy causes the equivalency in Equation 1 to not be true ("all proteins have annotation" and "no proteins have annotation" are not the only two cases). This same issue extends into Equation 2. I don't believe there is any issue with the underlying approach of FAIDR, but it seems the equations (as written here) are not quite correct. It would be extremely useful to include a more detailed derivation in an Appendix in order to better explain the math behind the approach. Specifically if, this could be mathematically derived and explained in “laymans” terms this would, I think, significantly improve the utility of the manuscript, as for those less familiar with statistical inference/Bayesian analysis the mathematics as they stand basically just have to be trusted (which, having gone through they should, and clearly Dr. Moses is perhaps literally the world authority on this topic, BUT, never-the-less from a pedagogical standpoint it would do a great service to the field to provide a more accessible explanation of how the model is constructed).

Potential major concerns:

"IDR function can be predicted from protein-level annotations and IDR sequence properties" – As far as I can see the authors never indicate which specifics are used to derive their dataset, only that it is the "yeast proteomes". Depending on the source of the proteins in their dataset, it is possible that they have sequences with high similarity to one another in their dataset? This could be problematic. Since if they have two similar sequences with similar functional annotations, and the sequences are divided (since it's random) into the training and test sets, their predictions (and ROC curves) could be inflated and not a true measure of how accurate their method is. I.e., if the proteomic dataset they are using has already accounted for sequence similarity, then this is not a problem.

Results – The test of mitochondrial targeting is consistent with the authors analysis, but it's not clear to me if it's a particularly convincing demonstration of a novel approach, and perhaps the stated conclusions are a little strong given the evidence.

Some general concerns regarding the experiments:

There is no true negative control (i.e. completely redesign the sequence while maintaining pI and hydrophobicity), only the WT Cox15 pos control and various mutant sequences that are relatively similar (other than pI/hydrophobicity).

It is difficult to definitively conclude that the "sim IDR high pI" protein is localizing to the mitochondria from the image without concurrent labelling of mitochondria, only that it is localizing somewhere.

Not enough mutant sequences were tested to reach the conclusion that the localization must be due to high pI and/or high hydrophobicity of the sequence. For both primary mutants, less than 10% of the residues were mutated, so it's impossible to pinpoint what the exact cause of the loss of function is from. For example, given mitochondrial targeting signals have historically been described as amphipathic helices, is loss/gain of helicity a confounding variable? What about loss/gain of a cleavage site (as is commonly associated with targeting sequences).

All this said, the experiments do provide strong evidence for their conclusions, but I don't believe as high of confidence is warranted.

The features identified and then modified are known to be associated with mitochondrial targeting (in fact, represent how these sequences were historically identified, although predictions have now progressed somewhat).

To take a region that has a set of features that had already been identified as correlating with function and then modifying those features is – at least in principle – something that could have been done with the expected outcome without any of the authors methodologies. This of course does not invalidate the authors approach (and does strengthen the case) but it is in my mind not a well-defined “test”. This is not actually a huge problem in terms of publication, but perhaps something worth discussing. In defense of this section, the actually systematic approach the authors take to IDR mutation/evolution/rescue is great and I think a really useful blueprint for how one might test the importance of sequence features.

A sort of important question – is the Cox15 N-terminal IDR actually an IDR? It is not strongly predicted to be disordered (see http://d2p2.pro/view/sequence/up/P40086), and traditionally mitochondrial targeting sequences have been shown to form amphipathic helices (with a positive face). There may be experimental data demonstrating the disordered nature of this region, but it seems odd to select a demonstration example which it is (at least to me?) unclear if it meets the specific requirement for the approach to work. That said, a way to flip this is that as a prediction tool it is fundamentally assessing solvent-accessible residues, and any structural constraints imposed on those residues is perhaps less important than the mean-field chemical composition that the sequence provides. This might be a useful line of explanation (and has the added bonus of broadening the scope of the tool).

eLife. 2021 Feb 22;10:e60220. doi: 10.7554/eLife.60220.sa2

Author response

The work presented offers a promising advancement for understanding protein function in general and of IDRs in particular. The manuscript presents the theoretical model development, experimental validation, and a protein engineering effort to further understand the results. A refreshing, pertinent discussion of potential shortcomings of the model is included throughout the manuscript. The manuscript is well-written but given its interdisciplinary nature, it appears overwhelmingly dense in details at times.

We thank the reviewers for the supportive comments. We agree that the interdisciplinary nature of the paper requires us to be quite specific, and perhaps overwhelmingly so at times. We have tried to simplify the language throughout the manuscript, where possible, e.g. in the Abstract and Introduction. We have also reorganized the manuscript so that equations only appear in the Materials and methods.

In general, it would be nice to have a less mathematical overview of the model workflow as a figure. What are the used features and the resulting predictions? Give more specific examples from the 82 features in Zarin et al., 2019.

We have followed this suggestion. We have now included a less mathematical overview of the model workflow as Figure 1, including more specific examples from 82 features in Zarin et al., 2019, as suggested (e.g. mean CKII consensus sites, mean net charge, etc.).

Some specific comments raised by the reviewers:

Results/equations

"An interpretable regularized probabilistic model to predict function from IDR features" – I believe the LHS of Equation 1, P(Y=1|Z,b), does not match the description in the preceding paragraph. I think the issue is due to the lack of an “I” subscript for the “Y” variable. Since “I” corresponds to the specific protein, excluding the subscript and writing "P(Y=1)" seems to represent the probability that all proteins in the dataset have a particular functional annotation, while "P(Yi=1)" represents the probability that a particular protein has a particular functional annotation. This discrepancy causes the equivalency in Equation 1 to not be true ("all proteins have annotation" and "no proteins have annotation" are not the only two cases). This same issue extends into Equation 2. I don't believe there is any issue with the underlying approach of FAIDR, but it seems the equations (as written here) are not quite correct. It would be extremely useful to include a more detailed derivation in an Appendix in order to better explain the math behind the approach. Specifically if, this could be mathematically derived and explained in “laymans” terms this would, I think, significantly improve the utility of the manuscript, as for those less familiar with statistical inference/Bayesian analysis the mathematics as they stand basically just have to be trusted (which, having gone through they should, and clearly Dr. Moses is perhaps literally the world authority on this topic, BUT, never-the-less from a pedagogical standpoint it would do a great service to the field to provide a more accessible explanation of how the model is constructed).

The reviewer is right: we have now added the “I” subscripts where suggested.

As suggested, we have also added an Appendix with a more simple, laypersons derivation of the model. We now also describe the model more simply in the Results (under “A model that predicts protein function from molecular features in IDRs”), and have moved Equations 1 and 2 to the Materials and methods (under “The FAIDR statistical model”) so that no equations appear in the Results sections.

Potential major concerns:

"IDR function can be predicted from protein-level annotations and IDR sequence properties" – As far as I can see the authors never indicate which specifics are used to derive their dataset, only that it is the "yeast proteomes". Depending on the source of the proteins in their dataset, it is possible that they have sequences with high similarity to one another in their dataset? This could be problematic. Since if they have two similar sequences with similar functional annotations, and the sequences are divided (since it's random) into the training and test sets, their predictions (and ROC curves) could be inflated and not a true measure of how accurate their method is. I.e., if the proteomic dataset they are using has already accounted for sequence similarity, then this is not a problem.

We thank the reviewer for raising this important point. Although the IDRs are highly diverged in general, there are a small fraction that show detectible sequence similarity. In order to quantify the sequence similarity in our training and test sets, we performed an all-by-all BLAST of the IDRs in each respective set. As we now describe in the Materials and methods (under “Evaluating FAIDR at the IDR level”), we find that 5% of the IDRs in our test set had significant sequence similarity (BLAST e-value less than 10^-5) to IDRs in our training set. These were mostly comprised of paralogous proteins that were randomly split in our dataset. Further we report that sequence similarity explains up to 3% of our predictive power for mitochondrial targeting signals, and up to 10% of our predictive power for CDK targets. Hence, sequence similarity can account for at most only a small fraction of our classification results.

Results – The test of mitochondrial targeting is consistent with the authors analysis, but it's not clear to me if it's a particularly convincing demonstration of a novel approach, and perhaps the stated conclusions are a little strong given the evidence.

We take the reviewer’s point here. While we haven’t demonstrated any new biology in this analysis, we chose this example to illustrate the use of our new approach in guiding experimental design. We clarified the choice of a mitochondrial targeting signal as an example in this section.

Some general concerns regarding the experiments:

There is no true negative control (i.e. completely redesign the sequence while maintaining pI and hydrophobicity), only the WT Cox15 pos control and various mutant sequences that are relatively similar (other than pI/hydrophobicity).

We agree with the reviewer here, and hence we never claim that hydrophobicity and pI are the only important features for mitochondrial localization. We have emphasized this the Discussion of the revised manuscript. Completely re-designing the sequence while maintaining pI and hydrophobicity would not be an applicable negative control here. We have tried to make this more clear in the Results and Discussion sections of the revised manuscript.

It is difficult to definitively conclude that the "sim IDR high pI" protein is localizing to the mitochondria from the image without concurrent labelling of mitochondria, only that it is localizing somewhere.

In order to more confidently conclude that the “sim IDR high pI” protein is localizing to the mitochondria, we repeated the localization experiments in the presence of DAPI staining, which allows us to visualize the nuclear and mitochondrial DNA. The colocalization of our GFP signal with the mitochondrial DNA staining is clearly visible in these experiments. We have included the results of this experiment in the relevant Results section (“Molecular features in the Cox15 IDR predictably affect mitochondrial targeting of Cox15”) and include the micrographs in Figure 4—figure supplement 1.

Not enough mutant sequences were tested to reach the conclusion that the localization must be due to high pI and/or high hydrophobicity of the sequence. For both primary mutants, less than 10% of the residues were mutated, so it's impossible to pinpoint what the exact cause of the loss of function is from. For example, given mitochondrial targeting signals have historically been described as amphipathic helices, is loss/gain of helicity a confounding variable? What about loss/gain of a cleavage site (as is commonly associated with targeting sequences).

All this said, the experiments do provide strong evidence for their conclusions, but I don't believe as high of confidence is warranted.

Although we believe the definitive experiments suggested by the reviewer are beyond the scope of our study, we now provide some bioinformatics results to further support for our interpretation of the results. To test if the helicity or loss/gain of cleavage site could explain our results, we input our wildtype and mutant sequences into Mitofates, which predicts the helicity and cleavage sites for any given sequence. At least as far as the bioinformatics can be believed, neither the predicted helicity, nor the cleavage site, nor the recognition site can explain the phenotypic differences that we see when we make targeted mutations based on our model. We now include these results in Figure 4B (formerly Figure 3B). Although this is still indirect, we believe our interpretation of the data is consistent with all the evidence we have.

The features identified and then modified are known to be associated with mitochondrial targeting (in fact, represent how these sequences were historically identified, although predictions have now progressed somewhat).

To take a region that has a set of features that had already been identified as correlating with function and then modifying those features is – at least in principle – something that could have been done with the expected outcome without any of the authors methodologies. This of course does not invalidate the authors approach (and does strengthen the case) but it is in my mind not a well-defined “test”. This is not actually a huge problem in terms of publication, but perhaps something worth discussing. In defense of this section, the actually systematic approach the authors take to IDR mutation/evolution/rescue is great and I think a really useful blueprint for how one might test the importance of sequence features.

We thank the reviewer for these comments. The reviewer is right in that we strategically chose an easily observable phenotype which has been heavily studied, and in which we could easily manipulate sequence features that have (in some cases) been used to characterize these regions. As suggested by the reviewer, we now provide more context for why we chose this as a test case in the first paragraph of the relevant Results section (“Molecular features in the Cox15 IDR predictably affect mitochondrial targeting of Cox15”) and in the Discussion.

A sort of important question – is the Cox15 N-terminal IDR actually an IDR? It is not strongly predicted to be disordered (see http://d2p2.pro/view/sequence/up/P40086), and traditionally mitochondrial targeting sequences have been shown to form amphipathic helices (with a positive face). There may be experimental data demonstrating the disordered nature of this region, but it seems odd to select a demonstration example which it is (at least to me?) unclear if it meets the specific requirement for the approach to work. That said, a way to flip this is that as a prediction tool it is fundamentally assessing solvent-accessible residues, and any structural constraints imposed on those residues is perhaps less important than the mean-field chemical composition that the sequence provides. This might be a useful line of explanation (and has the added bonus of broadening the scope of the tool).

We have now provided more context on mitochondrial presequences. We explain in the Discussion section that although mitochondrial targeting sequences form helices upon binding to their target receptors, they sample dynamic conformations (like many disordered regions). Following the reviewer’s suggestion, we note that our approach can be applied to any protein region, but is particularly useful for those protein regions which are not amenable to standard methods of sequence analysis.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure 2—source data 1. Data for ROC curves including IDR coordinates, FAIDR probabilities for held out IDRs and ground truth labels from experimental data mapped to each IDR.

idr_name_protein_coordinates.txt defines IDR names as amino acid coordinates in each protein. heldout_idr_probability_cdc28_mito.csv is the FAIDR probability for IDRs that were not included in the training set (‘held-out IDRs’). total_idr_expt_db_overlap_lai_2009_cdc28.csv is the binary summary of whether or not each IDR is phosphorylated by Cdc28 according to experimental data mapped from protein coordinates to each IDR (data compiled by Lai et al., 2012). total_idr_expt_db_overlap_vogtle_2009_mitochondrial_nterm.csv is the binary summary of whether or not each IDR is a mitochondrial N-terminal targeting signal according to experimental data mapped from protein coordinates to each IDR (based on a proteome-wide experimental study by Vögtle et al., 2009).

elife-60220-fig2-data1.zip^{(125KB, zip)}

Figure 3—source data 1. Raw T-statistics used for cluster analysis and display.

The remaining molecular features (n = 73) and the associated functions/phenotypes (n = 23) were then hierarchically clustered using weighted uncentered correlation distance (‘cluster’ and ‘calculate weights’ checked off with cutoff = 0.1 and exponent = 1) on Cluster 3.0, using average linkage. The clustered t-statistics underlying Figure 3 (tstat_table_atleast1_over_3_v2.cdt) can be viewed/explored using Java Treeview.

elife-60220-fig3-data1.zip^{(9.2KB, zip)}

elife-60220-supp1.zip^{(1.1KB, zip)}

Transparent reporting form

elife-60220-transrepform.docx^{(65.5KB, docx)}

Data Availability Statement

All data generated or analysed during this study are included in the manuscript, supporting files and the accompanying website.

[bib1] Abe Y, Shodai T, Muto T, Mihara K, Torii H, Nishikawa S, Endo T, Kohda D. Structural basis of presequence recognition by the mitochondrial protein import receptor Tom20. Cell. 2000;100:551–560. doi: 10.1016/S0092-8674(00)80691-1. [DOI] [PubMed] [Google Scholar]

[bib2] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]

[bib3] Breker M, Gymrek M, Schuldiner M. A novel single-cell screening platform reveals proteome plasticity during yeast stress responses. Journal of Cell Biology. 2013;200:839–850. doi: 10.1083/jcb.201301120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] Breker M, Gymrek M, Moldavski O, Schuldiner M. LoQAtE--localization and quantitation ATlas of the yeast proteomE. A new tool for multiparametric dissection of single-protein behavior in response to biological perturbations in yeast. Nucleic Acids Research. 2014;42:D726–D730. doi: 10.1093/nar/gkt933. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] Byrne KP, Wolfe KH. The Yeast Gene Order Browser: Combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Research. 2005;15:1456–1461. doi: 10.1101/gr.3672305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] Chen JR, Chang BH, Allen JE, Stiffler MA, MacBeath G. Predicting PDZ domain-peptide interactions from primary sequences. Nature Biotechnology. 2008;26:1041–1045. doi: 10.1038/nbt.1489. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, Hitz BC, Karra K, Krieger CJ, Miyasato SR, Nash RS, Park J, Skrzypek MS, Simison M, Weng S, Wong ED. Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Research. 2012;40:D700–D705. doi: 10.1093/nar/gkr1029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] Chong PA, Vernon RM, Forman-Kay JD. RGG/RG motif regions in RNA binding and phase separation. Journal of Molecular Biology. 2018;430:4650–4665. doi: 10.1016/j.jmb.2018.06.014. [DOI] [PubMed] [Google Scholar]

[bib9] Chuang GY, Boyington JC, Joyce MG, Zhu J, Nabel GJ, Kwong PD, Georgiev I. Computational prediction of N-linked glycosylation incorporating structural properties and patterns. Bioinformatics. 2012;28:2249–2255. doi: 10.1093/bioinformatics/bts426. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Darling AL, Liu Y, Oldfield CJ, Uversky VN. Intrinsically disordered proteome of human Membrane-Less organelles. Proteomics. 2018;18:1700193. doi: 10.1002/pmic.201700193. [DOI] [PubMed] [Google Scholar]

[bib11] Davey NE, Van Roey K, Weatheritt RJ, Toedt G, Uyar B, Altenberg B, Budd A, Diella F, Dinkel H, Gibson TJ. Attributes of short linear motifs. Mol. BioSyst. 2012;8:268–281. doi: 10.1039/C1MB05231D. [DOI] [PubMed] [Google Scholar]

[bib12] de Hoon MJ, Imoto S, Nolan J, Miyano S. Open source clustering software. Bioinformatics. 2004;20:1453–1454. doi: 10.1093/bioinformatics/bth078. [DOI] [PubMed] [Google Scholar]

[bib13] Dosztányi Z, Csizmók V, Tompa P, Simon I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. Journal of Molecular Biology. 2005;347:827–839. doi: 10.1016/j.jmb.2005.01.071. [DOI] [PubMed] [Google Scholar]

[bib14] Dubreuil B, Sass E, Nadav Y, Heidenreich M, Georgeson JM, Weill U, Duan Y, Meurer M, Schuldiner M, Knop M, Levy ED. YeastRGB: comparing the abundance and localization of yeast proteins across cells and libraries. Nucleic Acids Research. 2019;47:D1245–D1249. doi: 10.1093/nar/gky941. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, Sonnhammer ELL, Hirsh L, Paladin L, Piovesan D, Tosatto SCE, Finn RD. The pfam protein families database in 2019. Nucleic Acids Research. 2019;47:D427–D432. doi: 10.1093/nar/gky995. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] Forman-Kay JD, Mittag T. From sequence and forces to structure, function, and evolution of intrinsically disordered proteins. Structure. 2013;21:1492–1499. doi: 10.1016/j.str.2013.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] Freiman RN, Tjian R. A glutamine-rich trail leads to transcription factors. Science. 2002;296:2149–2150. doi: 10.1126/science.1073845. [DOI] [PubMed] [Google Scholar]

[bib18] Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33:1–22. doi: 10.18637/jss.v033.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] Fukasawa Y, Tsuji J, Fu SC, Tomii K, Horton P, Imai K. MitoFates: improved prediction of mitochondrial targeting sequences and their cleavage sites. Molecular & Cellular Proteomics. 2015;14:1113–1126. doi: 10.1074/mcp.M114.043083. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] Garg SG, Gould SB. The role of charge in protein targeting evolution. Trends in Cell Biology. 2016;26:894–905. doi: 10.1016/j.tcb.2016.07.001. [DOI] [PubMed] [Google Scholar]

[bib21] Gemayel R, Chavali S, Pougach K, Legendre M, Zhu B, Boeynaems S, van der Zande E, Gevaert K, Rousseau F, Schymkowitz J, Babu MM, Verstrepen KJ. Variable Glutamine-Rich repeats modulate transcription factor activity. Molecular Cell. 2015;59:615–627. doi: 10.1016/j.molcel.2015.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] Hastie T, Tibshirani R, Wainwright M. Statistical Learning with Sparsity. Chapman and Hall/CRC; 2015. [Google Scholar]

[bib23] Higuchi-Sanabria R, Garcia EJ, Tomoiaga D, Munteanu EL, Feinstein P, Pon LA. Characterization of fluorescent proteins for three- and Four-Color Live-Cell imaging in S. cerevisiae. PLOS ONE. 2016;11:e0146120. doi: 10.1371/journal.pone.0146120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] Holt LJ, Tuch BB, Villén J, Johnson AD, Gygi SP, Morgan DO. Global analysis of Cdk1 substrate phosphorylation sites provides insights into evolution. Science. 2009;325:1682–1686. doi: 10.1126/science.1172867. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O'Shea EK. Global analysis of protein localization in budding yeast. Nature. 2003;425:686–691. doi: 10.1038/nature02026. [DOI] [PubMed] [Google Scholar]

[bib26] Ionescu CN, Origanti S, McAlear MA. The yeast rRNA biosynthesis factor Ebp2p is also required for efficient nuclear division. Yeast. 2004;21:1219–1232. doi: 10.1002/yea.1177. [DOI] [PubMed] [Google Scholar]

[bib27] Iserman C, Desroches Altamirano C, Jegers C, Friedrich U, Zarin T, Fritsch AW, Mittasch M, Domingues A, Hersemann L, Jahnel M, Richter D, Guenther UP, Hentze MW, Moses AM, Hyman AA, Kramer G, Kreysing M, Franzmann TM, Alberti S. Condensation of Ded1p promotes a translational switch from housekeeping to stress protein production. Cell. 2020;181:818–831. doi: 10.1016/j.cell.2020.04.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] Jaussi R. Homologous nuclear-encoded mitochondrial and cytosolic isoproteins. A review of structure, biosynthesis and genes. European Journal of Biochemistry. 1995;228:551–561. doi: 10.1111/j.1432-1033.1995.tb20294.x. [DOI] [PubMed] [Google Scholar]

[bib29] Jones DT, Cozzetto D. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics. 2015;31:857–863. doi: 10.1093/bioinformatics/btu744. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] Katuwawala A, Peng Z, Yang J, Kurgan L. Computational prediction of MoRFs, short Disorder-to-order transitioning protein binding regions. Computational and Structural Biotechnology Journal. 2019;17:454–462. doi: 10.1016/j.csbj.2019.03.013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] Klemsz MJ, Maki RA. Activation of transcription by PU.1 requires both acidic and glutamine domains. Molecular and Cellular Biology. 1996;16:390–397. doi: 10.1128/MCB.16.1.390. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] Kohda D. "Multiple partial recognitions in dynamic equilibrium" in the binding sites of proteins form the molecular basis of promiscuous recognition of structurally diverse ligands. Biophysical Reviews. 2018;10:421–433. doi: 10.1007/s12551-017-0365-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] Krois AS, Dyson HJ, Wright PE. Long-range regulation of p53 DNA binding by its intrinsically disordered N-terminal transactivation domain. PNAS. 2018;115:E11302–E11310. doi: 10.1073/pnas.1814051115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology. 1982;157:105–132. doi: 10.1016/0022-2836(82)90515-0. [DOI] [PubMed] [Google Scholar]

[bib35] Lai AC, Nguyen Ba AN, Moses AM. Predicting kinase substrates using conservation of local motif density. Bioinformatics. 2012;28:962–969. doi: 10.1093/bioinformatics/bts060. [DOI] [PubMed] [Google Scholar]

[bib36] Laptenko O, Tong DR, Manfredi J, Prives C. The tail that wags the dog: how the disordered C-Terminal domain controls the transcriptional activities of the p53 Tumor-Suppressor protein. Trends in Biochemical Sciences. 2016;41:1022–1034. doi: 10.1016/j.tibs.2016.08.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] Lobley A, Swindells MB, Orengo CA, Jones DT. Inferring function using patterns of native disorder in proteins. PLOS Computational Biology. 2007;3:e162. doi: 10.1371/journal.pcbi.0030162. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] Martin EW, Holehouse AS, Peran I, Farag M, Incicco JJ, Bremer A, Grace CR, Soranno A, Pappu RV, Mittag T. Valence and patterning of aromatic residues determine the phase behavior of prion-like domains. Science. 2020;367:694–699. doi: 10.1126/science.aaw8653. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] Mitchell SF, Jain S, She M, Parker R. Global analysis of yeast mRNPs. Nature Structural & Molecular Biology. 2013;20:127–133. doi: 10.1038/nsmb.2468. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] Moses A. Statistical Modeling and Machine Learning for Molecular Biology. Boca Raton: Chapman and Hall/CRC Press; 2017. [DOI] [Google Scholar]

[bib41] Narayanaswamy R, Levy M, Tsechansky M, Stovall GM, O'Connell JD, Mirrielees J, Ellington AD, Marcotte EM. Widespread reorganization of metabolic enzymes into reversible assemblies upon nutrient starvation. PNAS. 2009;106:10147–10152. doi: 10.1073/pnas.0812771106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] Nguyen Ba AN, Yeh BJ, van Dyk D, Davidson AR, Andrews BJ, Weiss EL, Moses AM. Proteome-wide discovery of evolutionary conserved sequences in disordered regions. Science Signaling. 2012;5:rs1. doi: 10.1126/scisignal.2002515. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] Nguyen Ba AN, Strome B, Hua JJ, Desmond J, Gagnon-Arsenault I, Weiss EL, Landry CR, Moses AM. Detecting functional divergence after gene duplication through evolutionary changes in posttranslational regulatory sequences. PLOS Computational Biology. 2014;10:e1003977. doi: 10.1371/journal.pcbi.1003977. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] Nielsen JT, Mulder FAA. Quality and Bias of protein disorder predictors. Scientific Reports. 2019;9:5137. doi: 10.1038/s41598-019-41644-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] Ondrechen MJ, Clifton JG, Ringe D. THEMATICS: a simple computational predictor of enzyme function from structure. PNAS. 2001;98:12473–12478. doi: 10.1073/pnas.211436698. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib46] Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Törönen P, Nokso-Koivisto J, Holm L, Cozzetto D, Buchan DW, Bryson K, Jones DT, Limaye B, Inamdar H, Datta A, Manjari SK, Joshi R, Chitale M, Kihara D, Lisewski AM, Erdin S, Venner E, Lichtarge O, Rentzsch R, Yang H, Romero AE, Bhat P, Paccanaro A, Hamp T, Kaßner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Björne J, Salakoski T, Wong A, Shatkay H, Gatzmann F, Sommer I, Wass MN, Sternberg MJ, Škunca N, Supek F, Bošnjak M, Panov P, Džeroski S, Šmuc T, Kourmpetis YA, van Dijk AD, ter Braak CJ, Zhou Y, Gong Q, Dong X, Tian W, Falda M, Fontana P, Lavezzo E, Di Camillo B, Toppo S, Lan L, Djuric N, Guo Y, Vucetic S, Bairoch A, Linial M, Babbitt PC, Brenner SE, Orengo C, Rost B, Mooney SD, Friedberg I. A large-scale evaluation of computational protein function prediction. Nature Methods. 2013;10:221–227. doi: 10.1038/nmeth.2340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] Roise D, Schatz G. Mitochondrial presequences. Journal of Biological Chemistry. 1988;263:4509–4511. doi: 10.1016/S0021-9258(18)68809-X. [DOI] [PubMed] [Google Scholar]

[bib48] Ryan O, Shapiro RS, Kurat CF, Mayhew D, Baryshnikova A, Chin B, Lin ZY, Cox MJ, Vizeacoumar F, Cheung D, Bahr S, Tsui K, Tebbji F, Sellam A, Istel F, Schwarzmüller T, Reynolds TB, Kuchler K, Gifford DK, Whiteway M, Giaever G, Nislow C, Costanzo M, Gingras AC, Mitra RD, Andrews B, Fink GR, Cowen LE, Boone C. Global gene deletion analysis exploring yeast filamentous growth. Science. 2012;337:1353–1356. doi: 10.1126/science.1224339. [DOI] [PubMed] [Google Scholar]

[bib49] Saitoh T, Igura M, Obita T, Ose T, Kojima R, Maenaka K, Endo T, Kohda D. Tom20 recognizes mitochondrial presequences through dynamic equilibrium among multiple bound states. The EMBO Journal. 2007;26:4777–4787. doi: 10.1038/sj.emboj.7601888. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib50] Saldanha AJ. Java treeview--extensible visualization of microarray data. Bioinformatics. 2004;20:3246–3248. doi: 10.1093/bioinformatics/bth349. [DOI] [PubMed] [Google Scholar]

[bib51] Sekiya-Kawasaki M, Groen AC, Cope MJTV, Kaksonen M, Watson HA, Zhang C, Shokat KM, Wendland B, McDonald KL, McCaffery JM, Drubin DG. Dynamic phosphoregulation of the cortical actin cytoskeleton and endocytic machinery revealed by real-time chemical genetic analysis. Journal of Cell Biology. 2003;162:765–772. doi: 10.1083/jcb.200305077. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib52] Sharifpoor S, Nguyen Ba AN, Youn JY, Young JY, van Dyk D, Friesen H, Douglas AC, Kurat CF, Chong YT, Founk K, Moses AM, Andrews BJ. A quantitative literature-curated gold standard for kinase-substrate pairs. Genome Biology. 2011;12:R39. doi: 10.1186/gb-2011-12-4-r39. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib53] Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: Visualizing the performance of scoring classifiers. 1:4R Packag. 2009 https://cran.r-project.org/web/packages/ROCR/ROCR.pdf

[bib54] Storici F, Lewis LK, Resnick MA. In vivo site-directed mutagenesis using oligonucleotides. Nature Biotechnology. 2001;19:773–776. doi: 10.1038/90837. [DOI] [PubMed] [Google Scholar]

[bib55] The Gene Ontology Consortium The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Research. 2019;47:D330–D338. doi: 10.1093/nar/gky1055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib56] Tibshirani R, Wainwright M, Hastie T. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC; 2015. [DOI] [Google Scholar]

[bib57] Uversky VN. Natively unfolded proteins: a point where biology waits for physics. Protein Science. 2002;11:739–756. doi: 10.1110/ps.4210102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib58] Uversky VN. New technologies to analyse protein function: an intrinsic disorder perspective. F1000Research. 2020;9:101–114. doi: 10.12688/f1000research.20867.1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib59] Vacic V, Iakoucheva LM. Disease mutations in disordered regions--exception to the rule? Mol. BioSyst. 2012;8:27–32. doi: 10.1039/C1MB05251A. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib60] van der Lee R, Buljan M, Lang B, Weatheritt RJ, Daughdrill GW, Dunker AK, Fuxreiter M, Gough J, Gsponer J, Jones DT, Kim PM, Kriwacki RW, Oldfield CJ, Pappu RV, Tompa P, Uversky VN, Wright PE, Babu MM. Classification of intrinsically disordered regions and proteins. Chemical Reviews. 2014;114:6589–6631. doi: 10.1021/cr400525m. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib61] Vernon RM, Chong PA, Tsang B, Kim TH, Bah A, Farber P, Lin H, Forman-Kay JD. Pi-Pi contacts are an overlooked protein feature relevant to phase separation. eLife. 2018;7:e31486. doi: 10.7554/eLife.31486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib62] Vernon RM, Forman-Kay JD. First-generation predictors of biological protein phase separation. Current Opinion in Structural Biology. 2019;58:88–96. doi: 10.1016/j.sbi.2019.05.016. [DOI] [PubMed] [Google Scholar]

[bib63] Vögtle FN, Wortelkamp S, Zahedi RP, Becker D, Leidhold C, Gevaert K, Kellermann J, Voos W, Sickmann A, Pfanner N, Meisinger C. Global analysis of the mitochondrial N-proteome identifies a processing peptidase critical for protein stability. Cell. 2009;139:428–439. doi: 10.1016/j.cell.2009.07.045. [DOI] [PubMed] [Google Scholar]

[bib64] Wallace EW, Kear-Scott JL, Pilipenko EV, Schwartz MH, Laskowski PR, Rojek AE, Katanski CD, Riback JA, Dion MF, Franks AM, Airoldi EM, Pan T, Budnik BA, Drummond DA. Reversible, specific, active aggregates of endogenous proteins assemble upon heat stress. Cell. 2015;162:1286–1298. doi: 10.1016/j.cell.2015.08.041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib65] Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ. Jalview version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009;25:1189–1191. doi: 10.1093/bioinformatics/btp033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib66] Zarin T, Tsai CN, Nguyen Ba AN, Moses AM. Selection maintains signaling function of a highly diverged intrinsically disordered region. PNAS. 2017;114:E1450–E1459. doi: 10.1073/pnas.1614787114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib67] Zarin T, Strome B, Nguyen Ba AN, Alberti S, Forman-Kay JD, Moses AM. Proteome-wide signatures of function in highly diverged intrinsically disordered regions. eLife. 2019;8:e46883. doi: 10.7554/eLife.46883. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib68] Zarin T. FAIDR. swh:1:rev:b1ef30705f7133f21201166522077a062984566bSoftware Heritage. 2021 https://archive.softwareheritage.org/swh:1:dir:f7116bc669d3b5399e74f85f21a01dac652862ef;origin=https://github.com/taraneh-z/FAIDR;visit=swh:1:snp:4efca85c9ee543283e54280a32418b345359049a;anchor=swh:1:rev:b1ef30705f7133f21201166522077a062984566b/

PERMALINK

Identifying molecular features that are associated with biological function of intrinsically disordered protein regions

Taraneh Zarin

Bob Strome

Gang Peng

Iva Pritišanac

Julie D Forman-Kay

Alan M Moses

Roles

Abstract

Introduction

Results

A model that predicts protein function from molecular features in IDRs

Figure 1. Schematic of the FAIDR statistical model.

Figure 1—figure supplement 1. The FAIDR probabilistic model.

IDR function can be predicted from protein-level annotations and IDR sequence properties

Figure 2. FAIDR trained on evolutionary signatures of IDRs can predict unseen data comparably to state-of-the-art, specific predictors for mitochondrial targeting signals (Mitofates [Fukasawa et al., 2015]) and Cdc28 substrates (Condens [Lai et al., 2012]).

Figure 2—figure supplement 1. Predicting diverse protein function and phenotype using FAIDR.

Exploring molecular features of IDRs associated with diverse functions and phenotypes

Figure 3. Molecular features of IDRs are associated with specific functions.

Molecular features in the Cox15 IDR predictably affect mitochondrial targeting of Cox15

Figure 4. Molecular features predicted to be associated with ‘mitochondrion’ GO annotation affect mitochondrial localization phenotype.

Figure 4—figure supplement 1. Micrographs showing (columns, left to right) bright-field, DAPI, Cox15-GFP, and 4× zoomed Cox15-GFP/DAPI overlay of different budding yeast strains with different Cox15 IDR genotypes (labels on left side of image).

Inferring function for specific IDRs in uncharacterized proteins

Figure 5. Identification of a specific IDR sequence associated with transcription factor activity in an uncharacterized protein.

Figure 5—figure supplement 1. Example of modular IDR function predicted by FAIDR.

Discussion

Materials and methods

The FAIDR statistical model

Obtaining evolutionary signatures for IDRs in the yeast proteome

Compiling functions and phenotypes for prediction

Evaluating FAIDR at the IDR level

Evaluating FAIDR at the protein level

Clustering of t-statistics

Construction of Cox15-GFP strains with variable molecular features and confocal microscopy

Acknowledgements

Appendix 1

Multiple LASSO logistic regression

IRLS: weighted OLS regression for logistic regression

A model to account for multiple disordered regions in one protein

Funding Statement

Contributor Information

Funding Information

Additional information

Competing interests

Author contributions

Additional files

Data availability

References

Decision letter

Roles

Author response

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases