Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2025 May 21;53(W1):W512–W519. doi: 10.1093/nar/gkaf408

SHARK: web server for alignment-free homology assessment for intrinsically disordered and unalignable protein regions

Chi Fung Willis Chow 1,2,3, Maxim Scheremetjew 4,5, HongKee Moon 6,7, Soumyadeep Ghosh 8,9, Anna Hadarovich 10,11, Lena Hersemann 12,13, Agnes Toth-Petroczy 14,15,16,
PMCID: PMC12230711  PMID: 40396357

Abstract

Whereas alignment has been fundamental to sequence-based assessments of protein homology, it is ineffective for intrinsically disordered regions (IDRs) due to their lowered sequence conservation and unique sequence properties. Here, we present a web server implementation of SHARK (bio-shark.org), an alignment-free algorithm for homology classification that compares the overall amino acid composition and short regions (k-mers) shared between sequences (SHARK-scores). The output of such k-mer-based comparisons is used by SHARK-dive, a machine learning classifier to detect homology between unalignable, disordered sequences. SHARK-web provides sequence-versus-database assessment of protein sequence homology akin to conventional tools such as BLAST and HMMER. Additionally, we provide precomputed sets of IDR sequences from 16 model organism proteomes facilitating searches against species-specific IDR-omes. SHARK-dive offers superior overall homology detection performance to BLAST and HMMER, driven by a large increase in sensitivity to low sequence identity homologs, and can be used to facilitate the study of sequence–function relationships in disordered, difficult-to-align regions.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

A key goal in biology is to decipher the function of all proteins. Since exhaustive experimental validation is infeasible, computational approaches have been developed to facilitate systematic annotation of proteins via homology-based annotation transfer [1]. Specifically, local sequence alignment algorithms such as BLAST and HMMER have been invaluable in identifying evolutionary homologs and functional analogs [2, 3]. This has further allowed highly conserved and alignable regions to be cataloged into databases such as Pfam and InterPro (https://www.ebi.ac.uk/interpro/).

Nevertheless, many residues are still unmapped to these domains despite continued releases [4], indicating that ∼30% of residues are not amenable to alignment-based analyses. This includes residues in intrinsically disordered regions (IDRs), which are known to be difficult to align [5–7] because of their elevated rate of evolution [8, 9] as well as their biased amino acid compositions relative to structured regions [10, 11]. Nonetheless, given their immense functional repertoire [12–14] and widespread prevalence in the protein universe [15, 16], accurate assessment of evolutionary homology and functional analogy in IDRs and other difficult-to-align regions, without resorting to sequence alignment, is urgently required.

To fill this unmet need, we have developed SHARK, an alignment-free sequence comparison algorithm based on assessing the physicochemical similarities at compositional and k-mer levels (SHARK-scores), as well as SHARK-dive, a machine-learned homology classifier trained on a manually curated set of unalignable and highly disordered homologous sequences [15]. SHARK-dive has been shown to offer superior performance to alignment-based homology assessment tools, such as BLAST and HMMER, in detecting remote, difficult-to-align, disordered homologs and analogs. Here, we present the web server implementation of SHARK-dive, aimed at facilitating proteome-wide homology searches for IDRs and other difficult-to-align regions by offering an intuitive user interface and providing access to the wider scientific community.

Materials and methods

The SHARK-dive algorithm

SHARK-dive is a homology classifier trained specifically for difficult-to-align, disordered sequences. It is built upon alignment-free sequence comparison algorithms including the SHARK algorithm, a k-mer-based sequence comparison algorithm which assesses the physicochemical similarities between short k-mers using the Grantham distance matrix [15, 17], as well as the Normalized Google Distance (NGD) [18] which offered the best performance for shorter k-mer lengths in a feature selection task on highly disordered Pfam families [15]. SHARK-dive uses the results of these algorithms (SHARK-scores) for a range of k-mer lengths, from k= 1 (amino acid composition) to k =10, in order to integrate information on any regions of similarity between the sequences, in particular short linear motifs (SLiMs) which are usually 3–10 amino acids long [19–21], and can be identified by motif detection tools, including SHARK-capture described in [22]. Mechanistically, SHARK-dive is a 10-fold ensemble gradient-boosted decision tree model, where the outputs of all models are averaged to give the final SHARK-dive score.

SHARK-dive was trained on a manually curated set of non-Pfam-annotated, disorder-enriched orthologous regions, where correspondence between orthologous regions within a “sequence family” is determined by their domain architectures, alongside a strict identity filtering of <50% sequence identity to ensure that it is trained on highly dissimilar orthologous regions. On a test set of 1785 sequences sharing <50% identity [15], SHARK-dive achieved superior precision-recall in an all-versus-all (i.e. 3 186 225 comparisons) homology prediction task, succeeding in detecting 172 767/322 581 (i.e. sensitivity = 0.536) homologs (Table 1). SHARK-dive achieved a >two-fold increase in sensitivity over conventional alignment-based homology detection tools such as BLAST (sensitivity = 0.115) and HMMER (sensitivity = 0.119), culminating in best-in-class overall performance, with an F1 score of 0.336 (BLAST = 0.201, HMMER = 0.210) (Table 1). For more exhaustive benchmarking analysis please refer to Chow et al. [22].

Table 1.

Performance of SHARK-dive algorithm compared to HMMER and BLAST on a test set of 1785 sequences with <50% identity (2 863 644 negative and 322 581 positive pairs)

Method, threshold Recall Specificity Precision Accuracy F1
SHARK-dive, 0.5 0.536 0.814 0.245 0.786 0.336
pHMMER E-value, 1a 0.119 0.998 0.856 0.909 0.210
BLAST BLOSUM62 (default matrix), 10b 0.115 0.996 0.787 0.907 0.201

aDefault reporting E-value on web server.

bDefault E-value in BLAST application.

The SHARK web server

SHARK web is free and open to all users and there is no login requirement. We use a common design concept for the underlying back-end architecture, which allows the execution of computationally expensive jobs of various sizes asynchronously (Fig. 1). This has the advantage that the website will not be blocked until a job has finished. The Python-based Django web framework (https://www.djangoproject.com/) is used as the backbone for receiving requests from the front-end client. Gunicorn (https://gunicorn.org/) acts as a WSGI HTTP web server between the front-end client (e.g. a browser) and Django using the WSGI protocol for the communication. We use Celery (https://docs.celeryq.dev/en/stable/index.html) for managing the task execution. Requests are processed by Django, which means they get validated, approved, or declined, and transformed into new Celery tasks/jobs. Once a new job has been created, Django will submit the task to a job queue. For the communication between Django and the distributed task queue manager, we use RabbitMQ as a messenger (https://www.rabbitmq.com/) (Fig. 1).

Figure 1.

Figure 1.

Overview of the architectural design of SHARK-web. Django is used as a back-end orchestrator to wire all components together. A user sends a request to the backend, which gets validated. If the request gets approved, it will create and submit a new job to the distributed task queue where a worker will pick up the job from the queue and execute. Once the job has been executed, results will be stored in the database and reported back to Django via the messenger. The client sends regular requests to monitor the status of each job. Once the status is completed, it will visualize the results for the user. For long running jobs, a user can also request to be notified via email, as soon as the job is done.

Jobs scheduled in the queue are picked up by preconfigured workers. Any worker can run a single job at any time. The number of workers is configurable depending on the computing power of the server where the application is running. This allows horizontal scaling if needed. The results of a finished task are stored in the back-end database and the status of the job is communicated back to the Django application. The front-end client monitors the status of a job by using the HTTP GET protocol.

We utilized the Vue.js framework (https://vuejs.org/) to create the front-end pages for our web application. Additionally, we used D3.js (https://d3js.org/) for the visualizations on the result page, e.g. for the heatmap and table view.

Datasets

IUPred-based IDR datasets

Sequences from reference proteomes of seven organisms: H. sapiens, M. musculus, E. coli, S. cerevisiae, A. thaliana, D. rerio, and D. melanogaster (obtained from the 2022-05 release of UniProt) were analyzed using the disorder predictor IUPred2a (long) [23]. Following smoothing of residue-specific disorder values using a Savitzky–Golay filter (moving average window size = 9, polynomial degree = 3), consecutive disordered residues (disorder threshold of 0.4) were concatenated into IDR segments, and filtered for ≥10 amino acid IDR segments without unresolved or non-canonical amino acids (Table 2).

Table 2.

IDR datasets available at bio-shark.org to facilitate systematic homology annotations

Unique proteome ID Organism (common name) Taxonomy ID No. IUPred2a IDRs No. AlphaFold2 IDRs
UP000005640 Homo sapiens (human) 9606 69 383 70 807
UP000000589 Mus musculus (mouse) 10090 67 570 52 334
UP000000625 Escherichia coli 83333 5207 1111
UP000002311 Saccharomyces cerevisiae (Baker’s yeast) 559292 14 985 12 288
UP000006548 Arabidopsis thaliana (mouse-ear cress) 3702 60 185 51 011
UP000000437 Danio rerio (zebrafish) 7955 89 418 60 173
UP000000803 Drosophila melanogaster (fruit fly) 7227 41 643 34 177
UP000000559 Candida albicans 237561 12 840
UP000008827 Glycine max (soybean) 3847 105 632
UP000002485 Schizosaccharomyces pombe 284812 8946
UP000007305 Zea mays (maize) 4577 87 978
UP000059680 Oryza sativa (Japanese rice) 39947 78 722
UP000002195 Dictyostelium discoideum 44689 34 528
UP000000805 Methanocaldococcus jannaschii 243232 440
UP000002494 Rattus norvegicus (Norway rat) 10116 50 388
UP000001940 Caenorhabditis elegans 6239 35 478

AlphaFold2-based IDR datasets

Sequences and predicted AlphaFold2 structures from proteomes of 16 organisms: H. sapiens, M. musculus, E. coli, S. cerevisiae, A. thaliana, D. rerio, D. melanogaster, C. albicans, C. elegans, D. discoideum, G. max, M. jannaschii, O. sativa, R. norvegicus, S. pombe,andZ. mays were obtained from AlphaFoldDB version 2022-11-01 [24]. Consecutive disordered residues with pLDDT <50 were concatenated into IDR segments, and filtered for ≥10 amino acid IDR segments without unresolved or non-canonical amino acids (Table 2).

Web server usage

Input

SHARK-web is designed for (multiple-) query-versus-database searches similar to existing homology search tools, e.g. BLAST and HMMER. Its inputs are FASTA-formatted sequences for both query and target/database sequence(s), both of which can be entered directly or uploaded as a file (note that sequence IDs should be unique within each set of query and target FASTA sequences). We further accept single sequences (query and/or target) which are not FASTA-formatted (i.e. lacking a FASTA header) for quick copy-and-paste searches. Whereas SHARK-dive has no upper limit for sequence length, they must be at least 10 amino acids long without non-canonical or unresolved amino acids.

For searches against species-specific IDR-omes, we have provided several precomputed sets of IDRs extracted from the proteomes of several model organism species using disorder predictors IUPred2a [23] and AlphaFold2 [25] for searches against species-specific IDR-omes (Table 2). SHARK-web accepts an upper limit of 106 000 comparisons, i.e. the number of query sequences times the number of target/database sequences <106 000. Whereas there is no login or email requirement, users can optionally enter their email to receive notifications for when their submitted jobs are completed, which may be useful for jobs involving a large number of comparisons.

SHARK-dive processing

Upon sequence submission, the backend performs query–target pairwise homology assessment via the following steps. We note that the web server utilizes an unmodified version of SHARK and SHARK-dive consistent with Chow et al. [15]; hence, the performance is identical and directly comparable:

  • Each sequence is represented as k-mer vectors ranging from k= 1 to k= 10 (Fig. 2A).

  • Alignment-free sequence comparison scores are computed based on the k-mer vectors of the sequence pair [15]. These comprise of SHARK-scores (Fig. 2A), which assesses the physicochemical similarity between k-mers to identify the most similar/sufficiently similar k-mers shared by the sequences, as well as the NGD [18]. Since scores are computed for each value of k, this results in a 10-element k-mer score vector.

  • The SHARK-dive model is applied to the score vector (Fig. 2B).

Figure 2.

Figure 2.

SHARK-dive algorithm. This comprises k-mer similarity scores including SHARK-scores (A), which are the inputs of the SHARK-dive model to predict the homology between a pair of sequences (B) (adapted from [15]).

Outputs

The SHARK-dive score is the main output of the model and represents the likelihood of homology between the sequence pair: SHARK-dive scores ≥0.5 are predictive of homology. This is first shown on the results page as a “heatmap,” which visualizes the individual k-mer scores of the score vector (Fig. 3A). Here, since the NGD is a distance metric where lower scores indicate greater similarity, their values are shown as 1-NGD (otherwise known as the Normalized Google Similarity) for consistency, with higher values indicative of greater similarity at that particular k-mer length. The final column of the “heatmap” view shows the SHARK-dive score of the sequence pair, which is colored dark gray if homology is predicted (i.e. SHARK-dive score ≥ 0.5) and light gray otherwise.

Figure 3.

Figure 3.

Example SHARK-dive workflow. This begins with sequence submission in the homepage (A) to compare query sequence(s) against target/database sequence(s). For long jobs, an email can be entered for notification of job completion. Upon job completion, the heatmap (B) and table (C) views are shown on the results page. Users can then download the entire results table or visualize the amino acid compositions and k-mer similarity scores (D) for a particular sequence pair.

To further highlight the similarities (or lack thereof) between individual sequence pairs, we provide an option to visualize the amino acid composition of a particular sequence pair (Fig. 3B). This can be done by selecting the “Generate charts” button. This opens a new page containing a bar chart comparison of the amino acid frequencies of both sequences, as well as a series of 10 boxplots showing the k-mer score distribution of SHARK-dive training sequences, with the k-mer score of the current pair overlaid as a dashed line (Fig. 3D). The boxplots aim to contextualize the k-mer score of the sequence pair: high k-mer scores that far exceed the training distribution of homologs may indicate the presence of conserved, highly similar k-mers between the sequences that may contribute to their predicted shared homology.

Since the heatmap only offers the top 100 SHARK-dive-scoring pairs, the results page also offers a table containing the same set of scores for the entire search (Fig. 3C). Here, the k-mer scores are color-coded by their value relative to the k-mer scores in the training distribution. This color-coding scheme aims to highlight particular k-mer lengths which harbor many similar regions between the pair of sequences. Finally, for further processing, the entire k-mer score table, including the SHARK-dive scores, can also be downloaded using the “Download Table” button.

Usage examples

In order to offer an example of a typical use case of the SHARK web server, we performed a search of the human FUS RNA-binding domain (FUS RBD, residues 212–526 as defined by Wang et al. [13], UniProtID P35637) to identify potential homologs in M. musculus IDRs and functional analogs in H. sapiens IDRs. This follows the rationale of a BLAST/HMMER search of a query (FUS RBD) against a sequence database (mouse and human IDRs).

Example 1: Finding IDR homologs in another species

As expected, the search against mouse IDRs predicts homology to the C-terminal IDR of M. musculus FUS, indicating that SHARK-dive is sensitive to orthologous IDRs. Concordant with BLAST and HMMER predictions, homology is also predicted to the C-terminal IDRs of mouse EWS (UniProtID Q61545) and TAF15 (UniProtID Q8BQ46) proteins, which are the other members of the FET family [26, 27]. Moreover, similar to human FUS, human EWS and TAF15 are also known to be capable of promoting in vitro phase separation using its IDRs [13]. The consistency with BLAST and HMMER predictions highlights that SHARK-dive not only identifies orthologous IDRs, but is also sensitive to closely related IDR sequences with similar functions, and serves as a sanity check that homologous IDRs of a known functional family are not missed.

Importantly, SHARK-dive is also capable of predicting mouse homologs that BLAST and HMMER fail to identify. One such example is the N-terminal IDR of mouse fibrillarin (UniProt ID P35550). This IDR is rich in arginine and glycine and RGG repeats [28], similar to the FUS RBD, and contains the conserved GAR domain [29] which was shown to be important in driving localization and phase separation of fibrillarin [30, 31]. Moreover, this N-terminal region interacts with the N-terminal IDR of FUS [31]; similarly, the FUS RBD is also known to drive phase transitions via interacting with the N-terminal FUS IDR [13]. As such, this predicted homology is supported by shared functionality as well as sequence compositional similarities.

Example 2: Finding functional IDR analogs within the same species

It may be also interesting to discover similar sequences within the same proteome which may therefore be functionally similar; this could be especially useful when looking for conserved recognition/modification sites. In the case of FUS RBD, we focused mainly on its propensity to drive condensate formation since this has been described in detail previously [13, 32]. Similar to the search against the mouse proteomes, SHARK-dive also predicted homology to the C-terminal IDRs of the other FET family members EWS and TAF15: not only does the RBD of each protein enhance the phase-separating capability of the protein via interactions with its N-terminal PLD, but an additional experiment of a chimeric FUS PLD-EWSR1 RBD also showed robust phase separation, indicating that the EWSR1 RBD is able to functionally complement the FUS RBD, suggesting that the FET family RBDs are functionally homologous [13].

More interestingly, it also predicted homology to the C-terminal IDR of human FAM98A (UniProtID Q8NCA5). Like FUS, FAM98A is also capable of partitioning into condensates, particularly into stress granules, where FUS is also localized [33]. Moreover, FAM98A and FUS are both implicated in amyotrophic lateral sclerosis pathology [34]. Upon further inspection by generating the visual charts, it becomes apparent that the sequences share high amino acid compositional similarities and this is further supported by the consistently high k-mer scores, which indicates regions of similarity between the two sequences; indeed, both sequences appear to be RGG-rich. This predicted homology could be investigated further to see if the two proteins are capable of direct interactions, especially given the known propensity of the FUS RBD to drive phase separation via multivalent interactions. The increased sensitivity of SHARK-dive also offers more predictions. For example, lower-scoring predictions such as the N-terminal IDR of H. sapiens ribonuclease 3 (UniProtID Q9NRR4) may also be functionally relevant, as it has also been predicted to harbor a prion-like amino acid composition, which may drive interactions and functional similarities to the FUS RBD (Wang et al. [13]).

Conclusion

As an alignment-free alternative to existing homology detection tools, SHARK-dive aims to facilitate the study of sequence–function relationships in disordered, difficult-to-align regions. The SHARK web server provides an easy-to-use online access to SHARK-dive, particularly for experimentalists to identify interesting candidate sequences for further functional investigations.

Acknowledgements

We would like to thank the Computer Services and Scientific Computing Facilities of the MPI-CBG for their support, especially to Matt Boes and Oscar Gonzales for supporting our servers and the HPC.

The Gunicorn logo is used under the terms of the Creative Commons Attribution 3.0 Unported License (CC BY 3.0).

The original image is available at https://commons.wikimedia.org/wiki/File:Gunicorn_logo_2010.svg. No modifications were made. Use of the logo does not imply endorsement by the original creator. License details: https://creativecommons.org/licenses/by/3.0/.

The Django file-type icon is used under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0). The original image is available at https://icon-icons.com/de/symbol/Datei-Typ-django/130645. No modifications were made. Use of the icon does not imply endorsement by the original creator. License details: https://creativecommons.org/licenses/by/4.0/.

The Celery logo is used under the terms of the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).

The original image is available at https://commons.wikimedia.org/wiki/File:Celery_logo.png. No modifications were made. Use of the logo does not imply endorsement by the original creator. License details: https://creativecommons.org/licenses/by-sa/4.0/.

Author contributions: Chi Fung Willis Chow (Conceptualization [equal], Data curation [equal], Formal analysis [equal], Methodology [equal], Validation [equal], Visualization [equal], Writing—original draft [equal]), Maxim Scheremetjew (Conceptualization [equal], Methodology [equal], Software [lead], Visualization [equal], Writing—review & editing [equal]), HongKee Moon (Methodology [equal], Software [equal], Visualization [equal], Writing—review & editing [equal]), Soumyadeep Ghosh (Methodology [supporting], Software [supporting], Writing—review & editing [equal]), Anna Hadarovich (Methodology [equal], Writing—review & editing [equal]), Lena Hersemann (Supervision [supporting], Writing—review & editing [equal]), and Agnes Toth-Petroczy (Conceptualization [equal], Funding acquisition [lead], Investigation [equal], Supervision [equal], Visualization [equal], Writing—review & editing [equal])

Contributor Information

Chi Fung Willis Chow, Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstrasse 108, 01307 Dresden, Germany; Center for Systems Biology Dresden, Pfotenhauerstrasse 108, 01307 Dresden, Germany; Cluster of Excellence Physics of Life, TU Dresden, 01062 Dresden, Germany.

Maxim Scheremetjew, Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstrasse 108, 01307 Dresden, Germany; Center for Systems Biology Dresden, Pfotenhauerstrasse 108, 01307 Dresden, Germany.

HongKee Moon, Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstrasse 108, 01307 Dresden, Germany; Center for Systems Biology Dresden, Pfotenhauerstrasse 108, 01307 Dresden, Germany.

Soumyadeep Ghosh, Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstrasse 108, 01307 Dresden, Germany; Center for Systems Biology Dresden, Pfotenhauerstrasse 108, 01307 Dresden, Germany.

Anna Hadarovich, Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstrasse 108, 01307 Dresden, Germany; Center for Systems Biology Dresden, Pfotenhauerstrasse 108, 01307 Dresden, Germany.

Lena Hersemann, Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstrasse 108, 01307 Dresden, Germany; Center for Systems Biology Dresden, Pfotenhauerstrasse 108, 01307 Dresden, Germany.

Agnes Toth-Petroczy, Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstrasse 108, 01307 Dresden, Germany; Center for Systems Biology Dresden, Pfotenhauerstrasse 108, 01307 Dresden, Germany; Cluster of Excellence Physics of Life, TU Dresden, 01062 Dresden, Germany.

Conflict of interest

None declared.

Funding

This work was supported by the Max Planck Gesellschaft (MPG) “free-floater” RGL funds and by the European Research Council (CONDEVO, ERC grant agreement number 101116284). Funding to pay the Open Access publication charges for this article was provided by Max Planck Gesellschaft.

Data availability

The precomputed IDR datasets are available at https://bio-shark.org/help.

References

  • 1. Loewenstein  Y, Raimondo  D, Redfern  OC  et al.  Protein function annotation by homology-based inference. Genome Biol. 2009; 10:207. 10.1186/gb-2009-10-2-207. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Altschul  SF, Gish  W, Miller  W  et al.  Basic Local Alignment Search Tool. J Mol Biol. 1990; 215:403–10. 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  • 3. Eddy  SR  Profile hidden Markov models. Bioinformatics. 1998; 14:755–63. 10.1093/bioinformatics/14.9.755. [DOI] [PubMed] [Google Scholar]
  • 4. Mistry  J, Chuguransky  S, Williams  L  et al.  Pfam: The protein families database in 2021. Nucleic Acids Res. 2021; 49:D412–9. 10.1093/nar/gkaa913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Zarin  T, Strome  B, Nguyen  Ba AN  et al.  Proteome-wide signatures of function in highly diverged intrinsically disordered regions. eLife. 2019; 8: 10.7554/eLife.46883. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Ho  W-L, Huang  H-C, Huang  J-R  IFF: identifying key residues in intrinsically disordered regions of proteins using machine learning. Protein Sci. 2023; 32:e4739. 10.1002/pro.4739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Riley  AC, Ashlock  DA, Graether  SP  The difficulty of aligning intrinsically disordered protein sequences as assessed by conservation and phylogeny. PLoS One. 2023; 18:e0288388. 10.1371/journal.pone.0288388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Khan  T, Douglas  GM, Patel  P  et al.  Polymorphism analysis reveals reduced negative selection and elevated rate of insertions and deletions in intrinsically disordered protein regions. Genome Biol Evol. 2015; 7:1815–26. 10.1093/gbe/evv105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Tóth-Petróczy  A, Tawfik  DS  Protein insertions and deletions enabled by neutral roaming in sequence space. Mol Biol Evol. 2013; 30:761–71. 10.1093/molbev/mst003. [DOI] [PubMed] [Google Scholar]
  • 10. Wootton  JC, Federhen  S  Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem. 1993; 17:149–63. 10.1016/0097-8485(93)85006-X. [DOI] [Google Scholar]
  • 11. Altschul  SF, Boguski  MS, Gish  W  et al.  Issues in searching molecular sequence databases. Nat Genet. 1994; 6:119–29. 10.1038/ng0294-119. [DOI] [PubMed] [Google Scholar]
  • 12. Van  Der Lee R, Buljan  M, Lang  B  et al.  Classification of intrinsically disordered regions and proteins. Chem Rev. 2014; 114:6589–631. 10.1021/cr400525m. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Wang  J, Choi  J-M, Holehouse  AS  et al.  A molecular grammar governing the driving forces for phase separation of prion-like RNA binding proteins. Cell. 2018; 174:688–99. 10.1016/j.cell.2018.06.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Hadarovich  A, Singh  HR, Ghosh  S  et al.  PICNIC accurately predicts condensate-forming proteins regardless of their structural disorder across organisms. Nat Commun. 2024; 15:10668. 10.1038/s41467-024-55089-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Chow  CFW, Ghosh  S, Hadarovich  A  et al.  SHARK enables sensitive detection of evolutionary homologs and functional analogs in unalignable and disordered sequences. Proc Natl Acad Sci USA. 2024; 121:e2401622121. 10.1073/pnas.2401622121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Peng  Z, Yan  J, Fan  X  et al.  Exceptionally abundant exceptions: comprehensive characterization of intrinsic disorder in all domains of life. Cell Mol Life Sci. 2015; 72:137–51. 10.1007/s00018-014-1661-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Grantham  R  Amino acid difference formula to help explain protein evolution. Science. 1974; 185:862–4. 10.1126/science.185.4154.862. [DOI] [PubMed] [Google Scholar]
  • 18. Lee  JC, Rashid  NA  Adapting normalized Google similarity in protein sequence comparison. Proceedings of the International Symposium on Information Technology 2008 (ITSim). 2008; 1:New York City, USA: IEEE; 6–10. 10.1109/ITSIM.2008.4631601. [DOI] [Google Scholar]
  • 19. Hugo  W, Sung  W-K, Ng  S-K  Discovering interacting domains and motifs in protein–protein interactions. Methods Mol Biol. 2013; 939:9–20. 10.1007/978-1-62703-107-3_2. [DOI] [PubMed] [Google Scholar]
  • 20. Davey  NE, Van Roey  K, Weatheritt  RJ  et al.  Attributes of short linear motifs. Mol BioSyst. 2012; 8:268–81. 10.1039/c1mb05231d. [DOI] [PubMed] [Google Scholar]
  • 21. O’Brien  KT, Haslam  NJ, Shields  DC  SLiMScape: a protein short linear motif analysis plugin for Cytoscape. BMC Bioinformatics. 2013; 14:224. 10.1186/1471-2105-14-224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Chow  CFW, Lenz  S, Scheremetjew  M  et al.  SHARK-capture identifies functional motifs in intrinsically disordered protein regions. Protein Sci. 2025; 34:e70091. 10.1002/pro.70091. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Mészáros  B, Erdös  G, Dosztányi  Z  IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res. 2018; 46:W329–37. 10.1093/nar/gky384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Varadi  M, Bertoni  D, Magana  P  et al.  AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 2024; 52:D368–75. 10.1093/nar/gkad1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Jumper  J, Evans  R, Pritzel  A  et al.  Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 596:583–9. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Schwartz  JC, Cech  TR, Parker  RR  Biochemical properties and biological functions of FET proteins. Annu Rev Biochem. 2015; 84:355–79. 10.1146/annurev-biochem-060614-034325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. King  OD, Gitler  AD, Shorter  J  The tip of the iceberg: RNA-binding proteins with prion-like domains in neurodegenerative disease. Brain Res. 2012; 1462:61–80. 10.1016/j.brainres.2012.01.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. David  E, McNeil  JB, Basile  V  et al.  An unusual fibrillarin gene and protein: structure and functional implications. Mol Biol Cell. 1997; 8:1051–61. 10.1091/mbc.8.6.1051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Zhang  X, Li  W, Sun  S  et al.  Advances in the structure and function of the nucleolar protein fibrillarin. Front Cell Dev Biol. 2024; 12:1494631. 10.3389/fcell.2024.1494631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Snaar  S, Wiesmeijer  K, Jochemsen  AG  et al.  Mutational analysis of fibrillarin and its mobility in living human cells. J Cell Biol. 2000; 151:653–62. 10.1083/jcb.151.3.653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Kim  E, Kwon  I  Phase transition of fibrillarin LC domain regulates localization and protein interaction of fibrillarin. Biochem J. 2021; 478:799–810. 10.1042/BCJ20200847. [DOI] [PubMed] [Google Scholar]
  • 32. Ganser  LR, Niaki  AG, Yuan  X  et al.  The roles of FUS-RNA binding domain and low complexity domain in RNA-dependent phase separation. Structure. 2024; 32:177–87. 10.1016/j.str.2023.11.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Ozeki  K, Sugiyama  M, Akter  KA  et al.  FAM98A is localized to stress granules and associates with multiple stress granule-localized proteins. Mol Cell Biochem. 2019; 451:107–15. 10.1007/s11010-018-3397-6. [DOI] [PubMed] [Google Scholar]
  • 34. Kamelgarn  M, Chen  J, Kuang  L  et al.  Proteomic analysis of FUS interacting proteins provides insights into FUS function and its role in ALS. Biochim Biophys Acta. 2016; 1862:2004–14. 10.1016/j.bbadis.2016.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The precomputed IDR datasets are available at https://bio-shark.org/help.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES