Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2022 Mar 25;18(3):e1009930. doi: 10.1371/journal.pcbi.1009930

Mining folded proteomes in the era of accurate structure prediction

Charles Bayly-Jones 1,2,*, James C Whisstock 1,2,*
Editor: Dina Schneidman-Duhovny3
PMCID: PMC8986115  PMID: 35333855

Abstract

Protein structure fundamentally underpins the function and processes of numerous biological systems. Fold recognition algorithms offer a sensitive and robust tool to detect structural, and thereby functional, similarities between distantly related homologs. In the era of accurate structure prediction owing to advances in machine learning techniques and a wealth of experimentally determined structures, previously curated sequence databases have become a rich source of biological information. Here, we use bioinformatic fold recognition algorithms to scan the entire AlphaFold structure database to identify novel protein family members, infer function and group predicted protein structures. As an example of the utility of this approach, we identify novel, previously unknown members of various pore-forming protein families, including MACPFs, GSDMs and aerolysin-like proteins.

Author summary

Virtually every cellular process in all organisms on Earth is driven by molecular nano-machines known as proteins. The diverse functions of proteins are the result of the unique three-dimensional shape adopted by a given protein molecule. It is therefore important to determine the shape of a given protein, which unlike DNA and our genes, cannot be known from its sequence alone. Since two proteins with similar shapes typically have a similar function, knowing a protein shape provides crucial clues about its function. By virtue of decades of experimental work and advances in artificial intelligence, this complex shape can now be computationally predicted for any protein whose composition is known. Scientists have used these and other methods to produce enormous libraries of protein shapes consisting of nearly a million unique entries. However, these libraries are too large and too complex for researchers to ‘read’. We use shape-comparison algorithms to carefully check these shape-libraries to gain insight into the potential function and biological role of previously unknown proteins. Furthermore, we identified new members of protein families using this technique. We show that shape-matching algorithms and computationally generated shape-libraries can be used effectively together to yield new insights and expedite scientific endeavours.

Introduction

Knowledge of a proteins’ structure is a powerful means for the prediction of biological function and molecular mechanism [1,2]. Accordingly, powerful pairwise fold recognition tools such as DALI [3] have been developed that permit searching of known fold space in order to identify homology between distantly related structurally characterised proteins. These approaches can identify homologous proteins even when primary amino acid sequence similarity is not readily detectable. This method is particularly useful when a protein of no known function can be flagged as belonging to a well characterised fold class (e.g. Rosado et al. [4]).

A key and obvious limitation of using fold recognition to infer function is that the structure of the protein of interest needs to first be determined. By virtue of advances in experimental techniques and judicious deposition of results over several decades, this limitation is now being addressed by machine learning approaches. Today, in the era of accurate protein structure prediction [5,6], it is possible to build a reasonably accurate library comprising representative structures of all proteins in a proteome [79] (Fig 1A, 1B and 1C). One utility of such a resource, is that fold recognition approaches for prediction of function can now be applied to any protein (Fig 1D and 1E).

Fig 1. Conceptual overview of structure-guided fold recognition against the AlphaFold database.

Fig 1

a. Entire proteome sequence databases are converted by (b) machine-learning methods into high-accuracy (c) structural model predictions. d. Curation of these structural databases into a unified resource of structures (S1 Text). e. DALI based fold matching to perform functional inference, identification of unknown members and structural classification. f. Searching the foldome in (d) for matches of murine GSDM-D N-terminal domain yields MACPF/GSMD family members, including C11orf42 –an unknown member of the GSDM family. g. Example phylogenetic analysis of perforin-like proteins identified from the foldome by DALI. Alignment of the central MACPF/GSDM fold is possible by extracting domain boundaries based on AlphaFold prediction allowing specific comparison between members without interfering ancillary domains.

To investigate the utility of this approach we used established bioinformatic tools to mine the “foldome” (S1 Text). We employ the popular DALI algorithm due to its sensitivity and robustness. We constructed a locally hosted DALI database of all protein structures predicted by AlphaFold, covering humans to flies to yeast (Fig 1D and 1E). We then began mining the whole database using a probe structure representing a well characterised protein superfamily (in this case the perforin-like superfamily of pore forming immune effectors). Repeating this search with different MACPF probes (e.g., MPEG1, perforin, C9) yields very similar results, indicating DALI is robust to the chosen search template.

Design and implementation

Generation and acquisition of AlphaFold models

All AlphaFold models were obtained from the EMBL EBI database (https://alphafold.ebi.ac.uk/) for each available model organism. For any searches where existing models were not available from the PDB, these were generated using AlphaFold hosted through ColabFold [10]. A regex search of PDB metadata for ‘uncharacterised’ was used to curate a subset of uncharacterised or unknown proteins in the human foldome. Atomic coordinates for these files were subsequently discarded if their pLDDT score was less than 70. Remaining models which possessed fewer than 100 residues were discarded.

Construction of local DALI search engine and database

All DALI searches were performed using DaliLite [3] (v5; available from http://ekhidna2.biocenter.helsinki.fi/dali/) on one of two Linux workstations equipped with 16-core or 20-core Intel i7 CPU and 128 Gb of DDR4 RAM. DALI database was generated as described in the DALI manual. Briefly, for every AlphaFold model a randomised four character internal “PDB” code was generated and associated with the model (unique_identifiers.txt). Subsequently, all models were imported and converted to DALI format to enable structure all-versus-all searches. Individual proteomes were isolated as separate lists of entries or combined, to enable independent or grouped searches.

Construction of local SA Tableau (GPU accelerated) search engine and database

All SA Tabelau searches were performed on the same Linux workstations as for the DALI searches; however jobs were split equally into four sets and each set was executed on a single Nvidia GTX1080 GPU. Since SA Tableau was originally written and compiled [11] for outdated CUDA architectures, we re-compiled SA Tableau under modern CUDA (v8.0), gcc (v4.9.3) and g++ (v5.4.0). To run SA Tableau, it was necessary to create a conda environment with python2.7, where numpy (v1.8.1) and biopython (v1.49) were found to properly execute and run the original code. SA Tableau databases and distance matrices were calculated with `buildtableauxdb.py` and combined into ASCII format with `convdb2.py`(available from http://munk.cis.unimelb.edu.au/~stivalaa/satabsearch/ at the date of publication). SA Tableau results were sorted and selected based on expectation value with a cut-off of 1×10−4.

Proteome-wide assignment of Pfam

In order to search the entire Pfam classification against a structural proteome database, we used the GPU-accelerated SA Tableau search algorithm to expedite the search process. Furthermore, we selected to search only the S. aureus foldome as this represents the smallest and therefore most computationally inexpensive example proteome. All Pfam classifications were represented by structure in one of two ways [12]. Firstly, trRosetta models for ~7,000 Pfam classifications were recently produced and used without modification. Secondly, of the remaining 60% of entries, we used the ProtCID database [13] (http://dunbrack2.fccc.edu/ProtCiD/default.aspx) to link Pfam IDs to known protein structures in the PDB. These exemplar structures from the PDB were downloaded and single chains were extracted from each model (that is, only a single copy of each domain was considered). Domain boundaries and chain IDs defined by ProtCID were used to discard unrelated chains and residues that did not pertain to the particular Pfam classification in question. Finally, the trRosetta and exemplar structures were searched against the entire S. aureus foldome with SA Tableau.

Results

This approach readily yields functional insight into previously uncharacterised proteins (Fig 1F, 1G and S1 Table). For example, structure-based mining identified all known perforin / GSDM family members, but also identified a likely new member of the GSDM pore-forming family in humans, namely C11orf42 (uniprot Q8N5U0—a protein of no known function). Remarkably there is only 1% sequence identity between the GSDMs and C11orf42 despite predicted conservation of tertiary structure. In humans, C11orf42 is expressed in testis and is highly expressed in thyroid tumours [14]. Moreover, CRISPR screens [1517] identified C11orf42 as contributing to fitness and proliferation in lymphoma, glioblastoma and leukaemia cell lines (BioGRID gene ID 160298) [18].

Identification of C11orf42 as a likely GSDM family member permits several useful predictions. Owing to the presence of a GSDM fold, we postulate that C11orf42 may share GSDM-like functions such as oligomerisation and membrane interaction. Unlike other GSDMs [19,20], however, inspection of the predicted structure suggests that C11orf42 lacks membrane penetrating regions entirely. These data imply that C11orf42 may have lost the ability to perforate lipid bilayers and instead may function as a scaffold of sorts, as has been postulated for members of the perforin superfamily [21].

We next expanded our analysis to all proteomes covering 356,000 predicted structures; these computations take ~24 hours on a 16 CPU Intel i7 workstation. We identified roughly 16 novel perforin-like proteins across the twenty-one model organisms covered by the AlphaFold database (Fig 1G, S1 Table and File 1 in https://zenodo.org/record/5893808#.YiE_LOhKhPY). Domain boundaries defined by the structure prediction were identified manually and the perforin/GSDM-like domains were aligned based on fold. We constructed a phylogenetic tree based on the structure-constrained multiple sequence alignment (MSA), suggesting that C11orf42 is potentially related to the precursor of the common GSDMs. Curation of sequences based on predicted structures, such as this, may enable further, more comprehensive evolutionary analyses. For example, by using newly identified structural homologs as seed sequences for iterative PSI-BLAST searches of sequence databases or by studying the gene loci of newly discovered family members.

We next decided to perform these foldome-wide searches for several other pore-forming protein families, identifying new members of aerolysins, lysenins, cry1 toxins and more (S1 Fig, S1 Table and File 1 in https://zenodo.org/record/5893808#.YiE_LOhKhPY). Members of these toxin families have applications in next-generation sequencing (both DNA/RNA [22,23] and polypeptide [2427]), as well as agricultural applications in crop protection. We anticipate the new members of these families to be of utility in translational research programs. Remarkably, many of the hits we identified suggest the unexpected presence of pore-forming protein families that were previously thought to be entirely absent in the selected phyla–for example aerolysin-like proteins in Drosophila, C. elegans, yeast and zebrafish.

To further assess the utility of the database in functional inference, we curated a subset of the human proteome corresponding to uncharacterised proteins of unknown function (Fig 2A, 2B and 2C). These proteins are largely unannotated, lacking both domain and functional descriptions. We pruned all regions of the predicted structures to have pLDDT (per-residue confidence score) greater than 70 and discarded models for which fewer than 100 residues remained. These became the probe structures for iterative searches against the whole human foldome to identify known proteins with assigned domains and function. We provide these as File 2 in https://zenodo.org/record/5893808#.YiE_LOhKhPY for the convenience of the reader.

Fig 2. Identification of uncharacterised human proteins by fold-recognition.

Fig 2

a. Uncharacterised human proteins were curated from the AlphaFold database. b. Low-confidence regions of AlphaFold models were excluded based on pLDDT criteria. c. All models were screened against the rest of the human AlphaFold database. d. Four examples (C12orf56, Q8IXR9; C22orf9, Q6ICG6; C11orf16, Q9NQ32; C6orf136, Q5SQH8) of uncharacterised proteins where fold-matching enabled the assignment of domain composition (labelled in various colours). Furthermore, homologs or similar proteins (blue label) provide insight into potential function (black dot points).

From these analyses, we highlight four notable examples of uncharacterised structures which met the criteria and yielded insight into potential function (Fig 2D). One of these, C12orf56, appears to be a previously unknown GTPase activator protein. When compared to its homolog Ric-8A [28], the PH domain appears to sterically occlude binding of Gα proteins and may result in a potentially autoinhibited conformation. Previously, the identity of this protein was most likely obscured in sequence-based approaches due to the abnormally large loop insertion in the PH domain (Fig 2D). Similarly, a putative nuclear import factor (NFT) with strong homology to NFT2 was identified. These examples demonstrate the utility of structure-guided curation and annotation of uncharacterised proteins. Unlike domain assignment by primary sequence analysis, fold-matching algorithms are sensitive and robust [2931]. We anticipate that domain assignment by fold-matching will likely provide more accurate and informative predictions over existing sequence analysis methods, especially in contexts where sequences have poor overall homology or possess discontinuous breaks and insertions. Of course, imputed function remains to be experimentally validated.

Lastly, many conserved protein families have no known function and roughly a quarter of sequences in the proteome are not assigned a protein family [12]. Further, domain annotations for numerous proteins are incomplete. In an attempt to employ structure matching to assign domain composition or identify protein families, we searched representative protein domains against the Staphylococcus aureus foldome (for simplicity we chose the smallest available foldome) to score unknown and known proteins according to their similarity (Fig 3A). These representative structures comprise the entire trRosetta Pfam library and curated exemplar structures from the PDB for Pfam entries not modelled by trRosetta. The public nature of these representative structures makes the trRosetta models a convenient choice, however, computing AlphaFold predictions for each Pfam entry would likely give an improved representative library. A remaining 19% of Pfam entries were excluded (fewer than 50 residues, absence of PDB entry).

Fig 3. Proteome-wide search and classification of Pfam groups.

Fig 3

a. Overview of analysis, firstly search models representing pfam entries were curated from either trRosetta or the PDB. These were each searched against the entire S. aureus foldome to identify matches. The final output was filtered by expectation value. b. Select examples of S. aureus enzymes classified into Pfam category PF04481. This Pfam category previously had unknown role or function. c. The uncategorised protein (Q2G056) is homologous to different Pfam groups with similar enzymatic function (PF06259, PF15609, PF19742).

Here, we employed a GPU-accelerated fold recognition software, SA Tableau search [11], to expedite the large comparison which was not computationally tractable with DALI. The analysis outputs a ranked list of all proteins which match the query domain (File 3 in https://zenodo.org/record/5893808#.YiE_LOhKhPY). As such these serve as first-pass approximation of structure-assigned domain annotation and family classification for the S. aureus proteome. We provide a full mapping of pfam to S. aureus entries ranked by likelihood, which enables prediction of function by similarity (S2 Table). Examples of the results include the prediction that the unknown Pfam group (PF04481) is structurally related to a group of synthases (Fig 3B) and the classification of the S. aureus protein (Q2G056) as a hydrolase-like or transferase-like fold.

Exhaustive searches of predicted structures serve as a sensitive, but computationally expensive, domain and family assignment tool for proteins which lack sequence annotations or where domain assignment has not been successful using sequence-based approaches. Owing to prohibitively time-consuming computational limitations, it was not feasible for us to search the entire foldome across all organisms. The dedication of high-performance computing resources to the remaining proteomes, or particularly subsets that are still unknown, may be merited. Similarly, the AlphaFold database would benefit from application of other structure-based classification methods (such as adaptations to classification schemes of SCOPe [32] or ECOD [33]). The curated subset of PDB entries used for DALI searches are available as a resource to expedite efforts by others (File 3 in https://zenodo.org/record/5893808#.YiE_LOhKhPY) and supplement the trRosetta Pfam models (publicly available: http://ftp.ebi.ac.uk/pub/databases/).

Finally, it is our perspective that adoption of the above analyses among structural biologists may be beneficial. Common practice of firstly searching for related folds before beginning a project, may accelerate investigations and improve the likelihood of success. For example, one might first generate an accurate structural prediction of the target molecule, then search this against larger foldome databases (via DALI [3] or FoldSeek [34] webserver) to gain insight into function and putative mechanism. In this way, before experimentation, previously obtained knowledge of function can provide rationale, guide inquiry and minimise unnecessary or resource intensive efforts–saving time and money. Likewise, the identification of homologs in model organisms may facilitate parallel studies in vivo or in situ.

Future directions

The need for modern and sensitive implementations of fold-matching algorithms has once again become relevant. The current release of AlphaFold predictions has expanded the available structural database by more than double, with additional contributions expected to reach nearly a million entries in the near future. Exhaustive search algorithms, such as DALI, are slow and scale poorly, meaning searching structural databases of these sizes is not tractable. As such improvements and further work on fold matching algorithms are paramount to enable rapid searching and exploration of these new resources (for example, FoldSeek [34]). Other algorithms for fold matching are notably much faster than DALI, such as FoldSeek (comparable sensitive to DALI) [34] and 3D Zernike moment decomposition of protein structures and subsequent k-means nearest neighbour [35]. However, the latter comes with sensitivity trade-offs. A mixed approach may be beneficial, where 3D Zernike descriptors could be initially used to filter the database, followed by an exhaustive DALI search. This would be akin to a coarse initial pass to define a smaller subset of the search space to allow a computationally tractable exhaustive local search. Alternatively, sequences could be first filtered based on exclusion criteria, such as length.

Overall, the efficacy of structure mining depends on the accuracy of predicted models. Currently, this is dependent on MSAs for the detection of evolutionary covariance, however new single-sequence structure prediction methods are emerging that do not rely on sequence alignment [36]. Currently, the extent and quality of the MSA will affect the quality of AlphaFold/RoseTTAFold predictions and thus the quality of search results. Notably, protein families with extensive primary sequence conservation may not benefit from structure-guided mining, as existing techniques are likely sufficiently sensitive, as well as being far quicker and more computationally efficient. As such, protein folds that are structurally conserved but have poor overall sequence conservation may represent ideal targets for structure-based mining.

Supporting information

S1 Fig. Select examples of newly identified PFPs.

a. Identified MACPFs from slime mold and zebrafish. b. Various β-PFPs with aerolysin-like pore-forming domains which resemble aerolysin, lysenin, epsilon toxin, monalysin and LSL. Observed in numerous organisms including drosophila, C. elegans, zebrafish, yeast, among others. c. Several novel α-haemolysin-like proteins identified in S. aureus and plants. The putative receptor binding domain is coloured green and the pore-forming domain is coloured grey with the expected transmembrane region in red.

(TIF)

S1 Table. Curated lists of pore-forming proteins identified by DALI search of AlphaFold database, organised by query.

(XLSX)

S2 Table. Ranked list of structurally-assigned pfam matches against the S. aureus foldome.

(XLSX)

S1 Text. Definition of a “foldome”.

(DOCX)

Acknowledgments

CBJ acknowledges helpful discussion and feedback from Dr. Bradley A. Spicer and Dr. Andrew Ellisdon.

Data Availability

All supplementary data is made available as a public submission to zenodo: https://zenodo.org/search?page=1&size=20&q=5893808.

Funding Statement

JCW acknowledges funding from the Australian Research Council, the Australian Research Data Commons, and the National Health and Medical Research Council of Australia. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009930.r001

Decision Letter 0

Dina Schneidman-Duhovny

24 Nov 2021

Dear Dr. Bayly-Jones,

Thank you very much for submitting your manuscript "Mining folded proteomes in the era of accurate structure prediction" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Importantly, please provide a command line tool or ideally a web server to perform the structural similarity search and analysis from the paper.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Dina Schneidman

Software Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Specifically, please provide a command line tool or ideally a web server to perform the structural similarity search and analysis from the paper.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The manuscript presents a case study on mining the computationally predicted structures in the recently released AlphaFold Protein Structure Database for structural similarities. The authors use established fold recognition algorithms to compare proteins from the entire database to known members of several pore-forming protein families and succeed in identifying previously unknown members of such families.

The authors' methods represent a clever and straightforward-to-apply workflow to explore computationally predicted structures. The paper explains well why the particular results from their examples are of biological interest. Beyond corroborating the approach itself, these results demonstrate that computationally predicted protein structures can be used to discover relevant connections and relations between different proteins that cannot be obtained from sequence databases.

This study is the first of its kind (as far as I am aware) which comes with a lot of merit but also necessitates some additional tests and discussion to form a solid basis for future similar ventures. In particular, the following points should be addressed:

- It is not clear to me whether using a different reference protein from one family would lead to the same result. For example, if instead of GSDM-D one had searched for similarities to a MACPF, would the results have included C11orf42, too? Similarly, could one identify all MACPF/GSDM members by using C11orf42 as a reference?

- When expanding the analysis to all proteomes (line 52ff), the computational cost increases strongly. Could one in principle alleviate this problem with a less expensive pre-screen that excludes proteins that, by some simpler criteria, clearly cannot have a similar fold?

- The authors introduce the term "foldome" for a set of structures corresponding to a proteome. According to their definition (one of many potential folds per sequence), a proteome has multiple potential foldomes. This definition leaves no room to describe a set of structural ensembles, which would intuitively be associated with the term foldome meaning "entire set of folds."

- While the "era of accurate structure prediction" is indubitably owed to the machine learning techniques, as much credit should be attributed to advances in the experimental techniques that determine the structures from which these models learn.

- The paper alludes to "more comprehensive evolutionary analyses" in line 59. It would help to elaborate on that or provide a few examples.

Reviewer #2: Summary of the paper

The authors point out a fairly obvious use-case for ML predictions of protein structure and then describe some interesting applications. Structures predicted by AlphaFold2 or similar neural networks are queried using fold recognition tools such as DALI in order to find structural homologs with very low sequence identity and impute function to uncharacterized proteins. The primary contributions are:

- The authors identify new pore-forming proteins, including a new human perforin / GSDM with only 1% sequence homology to known examples

- The authors query human proteins with confident structural predictions against the remainder of the AF2 human foldome, identifying possible functions for several uncharacterized proteins

- The authors query predicted structures for Pfam domains of unknown function against the S. aureus AF2 foldome

Main Review

Strengths

In general, the paper is concise and clear.

I appreciated the discussion of limitations and possible improvements

The example of structural homology with only 1% sequence homology is impressive!

Major Weaknesses

I think the contribution would be stronger if the authors also made it easier (with a command-line tool or web server) for scientists with less-specialized knowledge to do structural homology search against AF2 predictions.

Minor weaknesses

The paper could be stronger with justifications or reasoning for tool choices. Why use DALI vs SA Tableau for different tasks? Or why use trRosetta predictions instead of AF2 or RosettaFold predictions for some tasks?

Likewise, how were the foldomes chosen? Why limit searches to the human foldome or the S. aureaus foldome? Would it be better to search against a deduplicated, "representative" predicted structure library? If not, why not?

While acknowledging that bench verification of imputed functions is out of scope for this paper, I would like some discussion of the trustworthiness of functions imputed via structural homology, even under the assumption that the structure predictions are accurate.

Given that AF2 predicts accurate protein structures given MSAs, could we do homology search by just comparing MSAs?

Summary of the review

While the idea of applying structural homology search to ML structure predictions is fairly obvious, the authors do it carefully and find convincing, biologically-interesting results. Together, this makes a strong case for making this sort of search a part of the standard toolbox when working with proteins of unknown function, and the work is of broad interest to protein biologists and engineers. Overall, this paper is suitable publication in PLOS Comp Bio with minor revisions.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Kevin Kaichuang Yang

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009930.r003

Decision Letter 1

Dina Schneidman-Duhovny

16 Feb 2022

Dear Dr. Bayly-Jones,

We are pleased to inform you that your manuscript 'Mining folded proteomes in the era of accurate structure prediction' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Dina Schneidman

Software Editor

PLOS Computational Biology

***********************************************************

Please address the comment of Reviewer 2 in the final version

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: My questions and concerns have been sufficiently addressed.

Congratulations on the impressive results!

Reviewer #2: The authors respond thoroughly to the concerns from the first set of reviews. My one remaining concern is about this paragraph:

"Unlike domain assignment by primary sequence analysis, fold-matching algorithms are sensitive and

robust. We anticipate that domain assignment by fold-matching will likely provide more accurate and

informative predictions over existing sequence analysis methods, especially in contexts where

sequences have poor overall homology or possess discontinuous breaks and insertions. Of course,

imputed function remains to be experimentally validated."

Intuitively, it makes sense that fold-matching would be more sensitive and robust than primary sequence analysis. However, citing examples from the literature where these are compared head-to-head would strengthen this section.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Martin Vögele

Reviewer #2: Yes: Kevin Kaichuang Yang

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009930.r004

Acceptance letter

Dina Schneidman-Duhovny

22 Mar 2022

PCOMPBIOL-D-21-01823R1

Mining folded proteomes in the era of accurate structure prediction

Dear Dr Bayly-Jones,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Livia Horvath

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Select examples of newly identified PFPs.

    a. Identified MACPFs from slime mold and zebrafish. b. Various β-PFPs with aerolysin-like pore-forming domains which resemble aerolysin, lysenin, epsilon toxin, monalysin and LSL. Observed in numerous organisms including drosophila, C. elegans, zebrafish, yeast, among others. c. Several novel α-haemolysin-like proteins identified in S. aureus and plants. The putative receptor binding domain is coloured green and the pore-forming domain is coloured grey with the expected transmembrane region in red.

    (TIF)

    S1 Table. Curated lists of pore-forming proteins identified by DALI search of AlphaFold database, organised by query.

    (XLSX)

    S2 Table. Ranked list of structurally-assigned pfam matches against the S. aureus foldome.

    (XLSX)

    S1 Text. Definition of a “foldome”.

    (DOCX)

    Attachment

    Submitted filename: response_to_reviewers.docx

    Data Availability Statement

    All supplementary data is made available as a public submission to zenodo: https://zenodo.org/search?page=1&size=20&q=5893808.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES